Webmasters these days get very touchy about letting new spiders walk all over their sites. There are so many scraper bots, email harvesters, exploit probers, students running Nutch on gigabit university pipes, and other ill-behaved new search bots that some site owners nervously huddle in forum bunkers anxiously scanning their logs for suspect new vistors, so they can quickly issue bot and ip bans.
Cuill, the search startup from ex-googlers anticipated to launch soon seems to have run a rather high rate crawl when they were getting started that generated a large number of robots.txt bans. Here is a list of sites which have banned Cuill's user-agent "Twiceler".
A well-behaved crawler needs to follow a set of loosely-defined behaviors to be 'polite' - don't crawl a site too fast, don't crawl any single IP address too fast, don't pull too much bandwidth from small sites by e.g. downloading tons of full res media that will never be indexed, meticulously obey robots.txt, identify itself with user-agent string that points to a detailed web page explaining the purpose of the bot, etc.
Apart from the widely-recongnized challenges to building a new search engine, sites like del.icio.us and compete.com that ban all new robots aside from the big 4 (Google, Yahoo, MSN and Ask) make it that much harder for a new entrant to gain a footing. However the web is so bloody vast that even tens of thousands of site bans are unlikely to make a significant impact in the aggregate perceived quality of a major new engine.
My initial take was that this had to be annoying for Cuill. As a crawler author, I can attest that getting each new site rejection personally hurts. :) But now I'm not so sure. Looking over the list, aside from a few major sites like Yelp, you could argue that getting all the forum seo's to robots exclude your new engine might actually help improve your index quality. Perhaps a Cuill robots ban is a quality signal? :)
Comments (12)
That's pretty funny about the index:-)
Posted by Prakash S | April 8, 2008 10:16 AM
Posted on April 8, 2008 10:16
Rich,
This is an interesting post, but the text here implies that the Nutch crawler is "ill-behaved", which isn't the case. I'm one of the original contributors, and we wrote the crawler to observe every one of the polite practices you list. It won't even run out of the box - the user is forced to indicate a proper user-agent first.
Of course it's possible to force Nutch to do impolite things (it's open source), but the user has to be actively ill-intentioned, not simply a newbie or student.
Posted by Mike Cafarella | April 8, 2008 1:26 PM
Posted on April 8, 2008 13:26
Mike, sorry about that -- I didn't mean to imply that Nutch was a bad crawler. The suspicious webmasters do however frequently ban all open source crawlers as a matter of practice (along with things like wget), as a hurdle to keep the amateurs that downloaded some code and are running against the web off of their sites.
e.g.:
http://www.webmasterworld.com/search_engine_spiders/3407574.htm
Someone using Nutch would do well to customize the user-agent to avoid such bans.
Posted by Rich Skrenta | April 8, 2008 1:35 PM
Posted on April 8, 2008 13:35
How nice of you to supply this UNBLOCKED and indexed list of all of our site's robots.txt urls to make things easier on the site scrapers. You should not be linking to our robots.txt in this manner indexed or not.
Thw Twiceler bot is blocked from all of our sites merely for it's bad behavior.
I crawls where it should not be and basically wreaks havoc on a site when it does.
Blocking the user agent in our robots.txt has not been even effective as the bot continues to crawl how it sees fit.
If you block its IP it merely returns within seconds with a new one. We have documented a ton of IPs from this bot in many different c-classes and it always returns.
Posted by Concerned Webmaster | April 19, 2008 1:37 PM
Posted on April 19, 2008 13:37
Yeah it kind of x-rays certain affiliated groups of spammy sites, doesn't it. Grouping sites by robots.txt sigs, I think I saw that mentioned in research paper somewhere. :)
Cuill is a reputable company and crawls from a known set of IPs which they list on their site. If you are getting robots violations and they are coming from other IPs the most likely bet is that someone is spoofing their user-agent to scrape your content. It's not Cuill.
Posted by Rich Skrenta | April 19, 2008 1:49 PM
Posted on April 19, 2008 13:49
Hi,
My site is listed in that list which has banned Cuill.
And yes, I did ban cuill. I must explain what happened and why I banned it.
My server started to crash on application pools. I am not sure if this is the right terminology. Basicly, The application process was hanging at recycle and causing the whole iss to han gup; so not only the particular website stopped working, but also the other sites hosted on the same server.
After hours of debugging, I noticed that cuill (which I had never heard of before) was crawling the site aggressively.
I immediately emailed Mike and asked him to stop crawling my site. and to be sure, I also banned it.
Even though it is banned, it still attempts to crawl my site.
Posted by M. Savas ZORLU | April 23, 2008 12:56 PM
Posted on April 23, 2008 12:56
You know the saying: once bitten twice shy...
As a webmaster I get a bit tired of constantly having to deal with the startup crawler du jour.
From law firms looking for DMCA violations to verticals search engines, to image aggregators, to company intelligence resellers... It feels to me that everybody and their brother has gotten into spidering sites.
With 10,000s of pages that have content that is only relevant to a targeted audience who is perfectly able to find us on the majors, I do not hesitate to block (and possibly ban) when I see an aggressive crawler that does not provide me or my customers with direct benefits.
Posted by Cuill banning webmaster | April 23, 2008 2:12 PM
Posted on April 23, 2008 14:12
I don't see the point why the robots-files of the banning sites are listed on your site. Even more so I don't see the point why they are LINKED. Helping other start up crawlers/scrapers/email collectors by showing them the way?
Posted by Another cuill banning webmaster | May 9, 2008 1:19 PM
Posted on May 9, 2008 13:19
That'll be 10,001 sites now: I've watched the accursed Twiceler 0.9 sucking garbage content out of my spam trap in cgi-bin for the past week and have reported this problem a second time. I can only hope that a company spending that much money can get Twiceler 1.0 to actually take notice of the robots.txt file?
Posted by PeterG22 | May 26, 2008 2:17 PM
Posted on May 26, 2008 14:17
And garbage it is. Here's a part of my weblog, as it is for several months now. Cuill is eating more than two times more than the Google bot, the only one i really want to read my site. Since I pay for bandwidth, I guess I'll send them the bill. Let's hope the number of bans runs to a million before they see the light.
1 19711 19.05% 19711 26.15% 281056 31.19% 3 0.11% crawl-16.cuill.com
2 8747 8.45% 8735 11.59% 129899 14.42% 7 0.26% crawl-66-249-65-6.googlebot.com
3 3627 3.50% 1226 1.63% 15445 1.71% 47 1.71% ip5457af95.direct-adsl.nl
4 3553 3.43% 3295 4.37% 59330 6.58% 14 0.51% ip503d2a0a.speed.planet.nl
5 2990 2.89% 461 0.61% 6030 0.67% 14 0.51% ip5453ed4b.adsl-surfen.hetnet.nl
6 2302 2.22% 2287 3.03% 14440 1.60% 26 0.95% grootammers2.dbinet.nl
Posted by cootje321 | July 12, 2008 7:52 AM
Posted on July 12, 2008 07:52
I had them crawling my site and it didn't really bother me till i heard lots of bad things about them and the fact it was always on my site day in and day out.
I did mail twicler about removing me from the crawl list, which i was promptly removed from being crawled.
I would think people who have asked to be removed and think they haven't been are actually seeing a look a like which has ill intentions.
Posted by clyde | July 14, 2008 12:08 PM
Posted on July 14, 2008 12:08
Not only do I block them in "/robots.txt", I found their abuse and disregard for that file so obnoxious, I actually now block them in my firewall by their IP netblocks. "Cuill"'s robot repeatedly found its way into my malicious robot traps; something that "well behaved" robots should do if they properly respect the robots.txt control file.
PS: Despite how they "claim" to pronounce their name, it's spelled as if it were Quill, the old feathery writing implement. Maybe the true idea of their robot was to make us feel as if we had been stuck by a whole bunch of quills!
Posted by The Snarkmaster | June 24, 2009 2:17 PM
Posted on June 24, 2009 14:17