The story goes that, one day back on the 1940's, a group of atomic scientists, including the famous Enrico Fermi, were sitting around talking, when the subject turned to extraterrestrial life. Fermi is supposed to have then asked, "So? Where is everybody?" What he meant was: If there are all these billions of planets in the universe that are capable of supporting life, and millions of intelligent species out there, then how come none has visited earth? This has come to be known as The Fermi Paradox.
My buddy Greg Lindahl maintains a collection of historical documents on his personal website, and gets enough traffic each month that he worries about his colo bandwidth bill.
When he analyzed his web logs recently and tallied up the self-reporting robots, he was surprised at how few he actually found crawling his site, and mentioned the Fermi quote I've reproduced above. If there really are 100 search engine startups (via via Charles Knight at Read/Write web), shouldn't we be seeing more activity from them?
Here is the list of every crawler that fetched over 1000 pages for the past three months:
1612960 Yahoo! Slurp help.yahoo.com bigco 365308 msnbot search.msn.com/msnbot.htm bigco 148090 Googlebot www.google.com/bot.html bigco 140120 VoilaBot www.voila.com bigco 68829 Ask Jeeves/Teoma about.ask.com bigco 62005 psbot www.picsearch.com/bot.html startup 39193 BecomeBot www.become.com/site_owners.html shopping 30006 WebVac www.WebVac.org edu 29778 ShopWiki www.shopwiki.com/wiki/Help:Bot shopping 22124 noxtrumbot www.noxtrum.com bigco 20963 Twiceler www.cuill.com/twiceler/robot.html startup 17113 MJ12bot majestic12.co.uk/bot.php startup 15650 Gigabot www.gigablast.com/spider.html startup 10404 ia_archiver www.archive.org nonprofit 9337 Seekbot www.seekbot.net/bot.html startup 9152 genieBot www.genieknows.com startup 7246 FAST MetaWeb www.fastsearch.com enterprise 7243 worio bot worio.com edu 6868 CazoodleBot www.cazoodle.com startup 6608 ConveraCrawler www.authoritativeweb.com/crawl enterprise 6293 IRLbot irl.cs.tamu.edu/crawler edu 5487 Exabot www.exabot.com/go/robot bigco 4215 ilial www.ilial.com/crawler startup 3991 SBIder www.sitesell.com/sbider.html memetracker 3673 boitho-dcbot www.boitho.com/dcbot.html enterprise 3601 accelobot www.accelobot.com memetracker 2878 Accoona-AI-Agent www.accoona.com startup 2521 Factbot www.factbites.com startup 2054 heritrix i.stanford.edu edu 2003 Findexa www.findexa.no ? 1760 appie www.walhello.com startup? 1678 envolk www.envolk.com spammers 1464 ichiro help.goo.ne.jp/door/crawler.html bigco 1165 IDBot www.id-search.org/bot.html edu 1161 Sogou www.sogou.com/docs/help bigco 1029 Speedy Spider www.entireweb.com bigco
There are a couple of surprises here... One is how much more aggressively Yahoo is crawling than everyone else. (Maybe he should just ban Yahoo to cut his hosting fees :)
Another is how few startups are actually crawling... And the ones that are aren't correlated with the folks getting buzz right now. In three months of data I didn't see a single visit from Zermelo, Powerset's crawler. I don't see Hakia in there at all, but they do have an index and actually refer a little traffic, which leads me to believe that they've licensed a crawl from someone else.
There hasn't been a lot of public information about Cuill since Matt Marshall's brief cryptic entry on them. But they're crawling fairly aggressively, and they've put up a public about us page detailing the impressive credentials of the founders, Tom Costello, Anna Patterson and Russell Power. Anna is the author of a widely-read intro paper on how to write a search engine from scratch.
The conventional wisdom is that there are all sorts of folks trying to take on Google, develop meaning-based search, France and Germany are supposedly both state-funding their own search efforts (heh). But if all these folks are out crawling the web... more than 11 of them should be showing up in webserver logs. ;)
Update: Charles Knight posts a ton of quotes from alt search engine folks on their approaches to crawling. Pretty interesting.