The story goes that, one day back on the 1940's, a group of atomic scientists, including the famous Enrico Fermi, were sitting around talking, when the subject turned to extraterrestrial life. Fermi is supposed to have then asked, "So? Where is everybody?" What he meant was: If there are all these billions of planets in the universe that are capable of supporting life, and millions of intelligent species out there, then how come none has visited earth? This has come to be known as The Fermi Paradox.
My buddy Greg Lindahl maintains a collection of historical documents on his personal website, and gets enough traffic each month that he worries about his colo bandwidth bill.
When he analyzed his web logs recently and tallied up the self-reporting robots, he was surprised at how few he actually found crawling his site, and mentioned the Fermi quote I've reproduced above. If there really are 100 search engine startups (via via Charles Knight at Read/Write web), shouldn't we be seeing more activity from them?
Here is the list of every crawler that fetched over 1000 pages for the past three months:
1612960 Yahoo! Slurp help.yahoo.com bigco 365308 msnbot search.msn.com/msnbot.htm bigco 148090 Googlebot www.google.com/bot.html bigco 140120 VoilaBot www.voila.com bigco 68829 Ask Jeeves/Teoma about.ask.com bigco 62005 psbot www.picsearch.com/bot.html startup 39193 BecomeBot www.become.com/site_owners.html shopping 30006 WebVac www.WebVac.org edu 29778 ShopWiki www.shopwiki.com/wiki/Help:Bot shopping 22124 noxtrumbot www.noxtrum.com bigco 20963 Twiceler www.cuill.com/twiceler/robot.html startup 17113 MJ12bot majestic12.co.uk/bot.php startup 15650 Gigabot www.gigablast.com/spider.html startup 10404 ia_archiver www.archive.org nonprofit 9337 Seekbot www.seekbot.net/bot.html startup 9152 genieBot www.genieknows.com startup 7246 FAST MetaWeb www.fastsearch.com enterprise 7243 worio bot worio.com edu 6868 CazoodleBot www.cazoodle.com startup 6608 ConveraCrawler www.authoritativeweb.com/crawl enterprise 6293 IRLbot irl.cs.tamu.edu/crawler edu 5487 Exabot www.exabot.com/go/robot bigco 4215 ilial www.ilial.com/crawler startup 3991 SBIder www.sitesell.com/sbider.html memetracker 3673 boitho-dcbot www.boitho.com/dcbot.html enterprise 3601 accelobot www.accelobot.com memetracker 2878 Accoona-AI-Agent www.accoona.com startup 2521 Factbot www.factbites.com startup 2054 heritrix i.stanford.edu edu 2003 Findexa www.findexa.no ? 1760 appie www.walhello.com startup? 1678 envolk www.envolk.com spammers 1464 ichiro help.goo.ne.jp/door/crawler.html bigco 1165 IDBot www.id-search.org/bot.html edu 1161 Sogou www.sogou.com/docs/help bigco 1029 Speedy Spider www.entireweb.com bigco
There are a couple of surprises here... One is how much more aggressively Yahoo is crawling than everyone else. (Maybe he should just ban Yahoo to cut his hosting fees :)
Another is how few startups are actually crawling... And the ones that are aren't correlated with the folks getting buzz right now. In three months of data I didn't see a single visit from Zermelo, Powerset's crawler. I don't see Hakia in there at all, but they do have an index and actually refer a little traffic, which leads me to believe that they've licensed a crawl from someone else.
There hasn't been a lot of public information about Cuill since Matt Marshall's brief cryptic entry on them. But they're crawling fairly aggressively, and they've put up a public about us page detailing the impressive credentials of the founders, Tom Costello, Anna Patterson and Russell Power. Anna is the author of a widely-read intro paper on how to write a search engine from scratch.
...
The conventional wisdom is that there are all sorts of folks trying to take on Google, develop meaning-based search, France and Germany are supposedly both state-funding their own search efforts (heh). But if all these folks are out crawling the web... more than 11 of them should be showing up in webserver logs. ;)
Update: Charles Knight posts a ton of quotes from alt search engine folks on their approaches to crawling. Pretty interesting.
Comments (14)
OK, it's slightly off-topic (and much more than you really want to know) but just for the record:
The Findexa crawler is indexing for Yelo.no (which redirects to gulesider.no - the leading Norwegian yellow pages). The fun part is that Findexa doesn't really exists anymore after it was bought by Eniro, but still the crawler points to the non-existing page http://www.findexa.no/gulesider/article26548.ece ...
Posted by Hans Fredrik Nordhaug | August 5, 2007 3:26 PM
Posted on August 5, 2007 15:26
Well with Spinn3r we only crawl blog content so we shouldn't show up on a historical site.
I wonder if other crawlers/startups have similar limitations.
Posted by Kevin Burton | August 5, 2007 11:42 PM
Posted on August 5, 2007 23:42
Most of the partner sites that Congoo indexes provide RSS or XML feeds so there is no need to crawl their site. I can imagine that with feedburner and so many other open feed technologies, we arent the only ones using feeds instead of crawling. Crawling is very inefficient. When our system crawls a site, 90% of the processing is realizing the you have already indexed this page, the other 10% is actually finding something new. Feeds are better and you dont have to create customized crawling agents to figure out the formatting of every site. Feeds are uniform...the future of indexing.
Posted by Rafael Cosentino | August 6, 2007 6:48 AM
Posted on August 6, 2007 06:48
FAROO uses a special kind of distributed crawler, which is crawling "below the radar".
When a user opens a page with his browser, it will be automatically inserted into the distributed index of the peer-to-peer network.
Therefore there is no access to a website by FAROO itself. Thus additional network load of a traditional crawler is omitted.
Posted by wolf | August 6, 2007 7:15 AM
Posted on August 6, 2007 07:15
Sorry, I should have been more clear. I'm talking about web search startups, not niche crawlers. The web is 30 billion pages. Topix e.g. crawled 50k sites. Spinn3r got to a lot more with 3M (last count I heard) but that's still not web scale.
Posted by Rich Skrenta | August 6, 2007 9:11 AM
Posted on August 6, 2007 09:11
Rich, it's a good observation - we also see who crawls Quintura (www.quintura.com). I agree that having its own index is a necessity for search startup. Don't you think a search startup does not need its own crawle? There are quite a few open source ones.
Posted by Yakov | August 6, 2007 10:49 AM
Posted on August 6, 2007 10:49
Again slightly offtopic, but it's sad how many of those crawlers in the list above ignore robots.txt
I've already had to 403-bin a couple of those bots (Tailrank, Cazoodle) for blatantly ignoring excluded paths and running amok.
At least the big 3 (or 4) crawlers behave. Sorta.
Posted by drac | August 7, 2007 2:41 AM
Posted on August 7, 2007 02:41
Why don't search engines share data the way Open Source shares code?
See my blog for more:
http://smoothspan.wordpress.com/2007/08/07/why-don%e2%80%99t-search-startups-share-data-aka-open-source-style-web-crawling/
Best,
BW
Posted by Bob Warfield | August 7, 2007 5:53 AM
Posted on August 7, 2007 05:53
Great article! It seems that you are generating a bit of buzz with this one. I got inspired by your post (and by Read/WriteWeb to discuss the issue on my blog post for today (www.garystew.com). Migoa (my company) is a vertical search engine, so I guess technically we're not really covered by your review. But from what we know of the European vertical search players, only we and Extate have proprietary crawlers. Of course, this is only based on publicly available data, so it's possible that there are more European vertical search players with proprietary crawlers. In any case, thanks for the great article.
Posted by Gary Stewart | August 7, 2007 7:48 AM
Posted on August 7, 2007 07:48
For those of us in the human-powered search space, it might be because we actually visit sites ourselves instead of sending a spider out to do the dirty work.
Posted by Adam Jusko | August 7, 2007 8:09 AM
Posted on August 7, 2007 08:09
Whoa, you're up to 76 pages according to google. Only 49,999,999,924 to go.
Mahalo still has the jump on you though with 2120 pages. They only have 49,999,997,880 to go...
Posted by Rich Skrenta | August 7, 2007 8:33 AM
Posted on August 7, 2007 08:33
When I checked they said 81, a 6.5% jump in just hours. At that rate, we'll have the rest of the Web scoured in no time.
(To be fair, we do have over 600 pages, regardless of what Google might tell you. Sure it's not much, but it's a start...)
Posted by Adam Jusko | August 7, 2007 12:40 PM
Posted on August 7, 2007 12:40
Hey Rich... Much more than 3M now :)
Also, we're DEF thinking web scale :)
Posted by Kevin Burton | August 7, 2007 10:31 PM
Posted on August 7, 2007 22:31
Gurge had a great writeup about what a sloppy bot Slurp is: http://gurge.com/blog/2007/06/27/yahoo-slurp-makes-a-mess/
I wonder if they bought the 10 year extended warranty from Inktomi.
Posted by Sebastian | August 26, 2007 4:55 PM
Posted on August 26, 2007 16:55