Webmasters these days get very touchy about letting new spiders walk all over their sites. There are so many scraper bots, email harvesters, exploit probers, students running Nutch on gigabit university pipes, and other ill-behaved new search bots that some site owners nervously huddle in forum bunkers anxiously scanning their logs for suspect new vistors, so they can quickly issue bot and ip bans.
Cuill, the search startup from ex-googlers anticipated to launch soon seems to have run a rather high rate crawl when they were getting started that generated a large number of robots.txt bans. Here is a list of sites which have banned Cuill's user-agent "Twiceler".
A well-behaved crawler needs to follow a set of loosely-defined behaviors to be 'polite' - don't crawl a site too fast, don't crawl any single IP address too fast, don't pull too much bandwidth from small sites by e.g. downloading tons of full res media that will never be indexed, meticulously obey robots.txt, identify itself with user-agent string that points to a detailed web page explaining the purpose of the bot, etc.
Apart from the widely-recongnized challenges to building a new search engine, sites like del.icio.us and compete.com that ban all new robots aside from the big 4 (Google, Yahoo, MSN and Ask) make it that much harder for a new entrant to gain a footing. However the web is so bloody vast that even tens of thousands of site bans are unlikely to make a significant impact in the aggregate perceived quality of a major new engine.
My initial take was that this had to be annoying for Cuill. As a crawler author, I can attest that getting each new site rejection personally hurts. :) But now I'm not so sure. Looking over the list, aside from a few major sites like Yelp, you could argue that getting all the forum seo's to robots exclude your new engine might actually help improve your index quality. Perhaps a Cuill robots ban is a quality signal? :)
Comments (8)
That's pretty funny about the index:-)
Posted by Prakash S | April 8, 2008 10:16 AM
Posted on April 8, 2008 10:16
Rich,
This is an interesting post, but the text here implies that the Nutch crawler is "ill-behaved", which isn't the case. I'm one of the original contributors, and we wrote the crawler to observe every one of the polite practices you list. It won't even run out of the box - the user is forced to indicate a proper user-agent first.
Of course it's possible to force Nutch to do impolite things (it's open source), but the user has to be actively ill-intentioned, not simply a newbie or student.
Posted by Mike Cafarella | April 8, 2008 1:26 PM
Posted on April 8, 2008 13:26
Mike, sorry about that -- I didn't mean to imply that Nutch was a bad crawler. The suspicious webmasters do however frequently ban all open source crawlers as a matter of practice (along with things like wget), as a hurdle to keep the amateurs that downloaded some code and are running against the web off of their sites.
e.g.:
http://www.webmasterworld.com/search_engine_spiders/3407574.htm
Someone using Nutch would do well to customize the user-agent to avoid such bans.
Posted by Rich Skrenta | April 8, 2008 1:35 PM
Posted on April 8, 2008 13:35
How nice of you to supply this UNBLOCKED and indexed list of all of our site's robots.txt urls to make things easier on the site scrapers. You should not be linking to our robots.txt in this manner indexed or not.
Thw Twiceler bot is blocked from all of our sites merely for it's bad behavior.
I crawls where it should not be and basically wreaks havoc on a site when it does.
Blocking the user agent in our robots.txt has not been even effective as the bot continues to crawl how it sees fit.
If you block its IP it merely returns within seconds with a new one. We have documented a ton of IPs from this bot in many different c-classes and it always returns.
Posted by Concerned Webmaster | April 19, 2008 1:37 PM
Posted on April 19, 2008 13:37
Yeah it kind of x-rays certain affiliated groups of spammy sites, doesn't it. Grouping sites by robots.txt sigs, I think I saw that mentioned in research paper somewhere. :)
Cuill is a reputable company and crawls from a known set of IPs which they list on their site. If you are getting robots violations and they are coming from other IPs the most likely bet is that someone is spoofing their user-agent to scrape your content. It's not Cuill.
Posted by Rich Skrenta | April 19, 2008 1:49 PM
Posted on April 19, 2008 13:49
Hi,
My site is listed in that list which has banned Cuill.
And yes, I did ban cuill. I must explain what happened and why I banned it.
My server started to crash on application pools. I am not sure if this is the right terminology. Basicly, The application process was hanging at recycle and causing the whole iss to han gup; so not only the particular website stopped working, but also the other sites hosted on the same server.
After hours of debugging, I noticed that cuill (which I had never heard of before) was crawling the site aggressively.
I immediately emailed Mike and asked him to stop crawling my site. and to be sure, I also banned it.
Even though it is banned, it still attempts to crawl my site.
Posted by M. Savas ZORLU | April 23, 2008 12:56 PM
Posted on April 23, 2008 12:56
You know the saying: once bitten twice shy...
As a webmaster I get a bit tired of constantly having to deal with the startup crawler du jour.
From law firms looking for DMCA violations to verticals search engines, to image aggregators, to company intelligence resellers... It feels to me that everybody and their brother has gotten into spidering sites.
With 10,000s of pages that have content that is only relevant to a targeted audience who is perfectly able to find us on the majors, I do not hesitate to block (and possibly ban) when I see an aggressive crawler that does not provide me or my customers with direct benefits.
Posted by Cuill banning webmaster | April 23, 2008 2:12 PM
Posted on April 23, 2008 14:12
I don't see the point why the robots-files of the banning sites are listed on your site. Even more so I don't see the point why they are LINKED. Helping other start up crawlers/scrapers/email collectors by showing them the way?
Posted by Another cuill banning webmaster | May 9, 2008 1:19 PM
Posted on May 9, 2008 13:19