| Subscribe to Open BlueDragon |
| Visit this group |
Alert Email
Get a short email alert whenever a new entry is published.
Confidential, secure it's piece of cake to keep uptodate.
Latest Articles:
Collection of blogs I follow
Is www.page-store.com sucking up all your bandwidth?
In the past month there, we have noticed a new BOT appear on the horizon, pertaining to be from Page-Store. As you can see from their site, they are a little light on detail and what on earth they are really going to be doing with all this data they are sucking up.

A number of other bloggers have noted this pattern also.
Looks like this is (good?) example of the new breed of companies that are utilizing the Amazon EC2 platform to create as many on demand servers as they need; sucking up blogs and sites from all over the internet. Thats all fine and dandy, except how do Amazon plan to deal with the fact that we may have to start denying IP addresses to certain rogue servers. This becomes problematic when the server reboots and Amazon allocates it a new IP address; a geniue service that picks up the old banned IP address for their Amazon Image has now been instantly blacklisted!
We at Blog-City are seeing a huge amount of collective traffic from these servers as they go after our user bases content. So we've decided to simply deny this BOT until we are told that they are indeed providing something good and worthwhile. There is much easier ways to get our public content - they just need to ask!
Related Stories
Comments (6)
The <a href="http://break-left.org/blog/?p=277">plot thickens, maybe</a>. Looks like powerset.com hired page-store.com to do their crawling, only now with a hacked up user agent for their heretrix spider.
If you use Apache on a Linux server, you can use your .htaccess file to block page-store without relying on blocking dynamic IPs. Just add the following lines (note that you can have as many other domain names listed as you like, just separate them by the vertical | line--below my example is a further list of other previously known offenders I use in my configuration).
==== SetEnvIfNoCase Referer ".*(page-store|webdevsquare).*" BadReferrer <Limit GET POST> Order Allow,Deny Allow from all Deny from env=BadReferrer </Limit> ==== aizzo|xopy|blogincome|adminshop| italiancharms4u|4italiancharms| andrewsaluk|datashaping| mortgagerefinancingtoday|insurancequotecity |injketandtonercartridges|quickcontactsonline |sex3k|paydayloansandcashadvance |z5n-home-loan-refinancing|flowersdeliveredquick |milf-xxx-action|skin-care-companies|BusinessReferences |domain-name-registration-4u|firsthorizonmtg |genaholincorporated|thetrafficproject|hostingtutorial |onlinecasino.forever|mtmarket.ruWell I am glad they replied to your email. Sadly they did not for me.
And as for their respecting the robots.txt, i can confirm that must be a recent addition at their end, because they surely were not respecting it to begin with. So while you may put the entry down to "ranting", I think entries such as these, (and others on various webmaster forums) only highlighted to them that they ought to do something if they didn't want their name to become muck in the Internet. I am glad they are reacting to customer complaints. Job well done.I noticed the same last night but, instead of ranting and raving or contemplating in a blog, decided to write a polite email to page-store. Guess what - this morning I had the reply and an apology for any inconvenience. Just wondering if really no one else had this simple idea before?
Btw, this is the statement on robots.txt files (haven't tried it yet, anyone else?): "User-agent to use in the robots.txt file: page-store. The crawler definitely reads and interprets the robots.txt file. It is not very clever with non-compliant inputs yet, however."Page-store sucked down 550 meg of data in 27 hours, the whole site is only 750 meg, which includes about 20 20 minute audio files. When I banned it by ip it continued to try to hit the site four times a minute for about 3 days.




Hi All. Sorry if the crawler has caused any inconvenience. We're committed to being good web citizens. Obviously we occasionally fall short. For example, we're not too smart, yet, about filtering out all GET requests that may trigger db updates, like "click here to rate", etc. We strive to improve the crawler, always. In fact, we're definitely going to replace Heritrix with a more intelligent bot. It's been doing a fair job, but not great.
Page-store is building a YMG-scale web repository for use by search startup companies. Lot's of startups get torn up by the conflicting requirements of building innovative ip at the same time that they're building web-scale infrastructure. Even with VC funding there's almost never enough time to accomplish both tasks. I thought it made sense to factor the problem, and at least to make available a collective high-quality crawl. So far the crawl is available, the quality is improving. Page-store is _very_ cheap (not free!), but easily within reach of any group that wants to build web-scale search applications.