« I5 fire at Tejon pass | Main | What doesn't clog your algo makes it stronger... »

The 11 startups actually crawling the web

The story goes that, one day back on the 1940's, a group of atomic scientists, including the famous Enrico Fermi, were sitting around talking, when the subject turned to extraterrestrial life. Fermi is supposed to have then asked, "So? Where is everybody?" What he meant was: If there are all these billions of planets in the universe that are capable of supporting life, and millions of intelligent species out there, then how come none has visited earth? This has come to be known as The Fermi Paradox.

My buddy Greg Lindahl maintains a collection of historical documents on his personal website, and gets enough traffic each month that he worries about his colo bandwidth bill.

When he analyzed his web logs recently and tallied up the self-reporting robots, he was surprised at how few he actually found crawling his site, and mentioned the Fermi quote I've reproduced above. If there really are 100 search engine startups (via via Charles Knight at Read/Write web), shouldn't we be seeing more activity from them?

Here is the list of every crawler that fetched over 1000 pages for the past three months:

1612960 Yahoo! Slurp help.yahoo.com bigco
365308 msnbot search.msn.com/msnbot.htm bigco
148090 Googlebot www.google.com/bot.html bigco
140120 VoilaBot www.voila.com bigco
68829 Ask Jeeves/Teoma about.ask.com bigco
62005 psbot www.picsearch.com/bot.html startup
39193 BecomeBot www.become.com/site_owners.html shopping
30006 WebVac www.WebVac.org edu
29778 ShopWiki www.shopwiki.com/wiki/Help:Bot shopping
22124 noxtrumbot www.noxtrum.com bigco
20963 Twiceler www.cuill.com/twiceler/robot.html startup
17113 MJ12bot majestic12.co.uk/bot.php startup
15650 Gigabot www.gigablast.com/spider.html startup
10404 ia_archiver www.archive.org nonprofit
9337 Seekbot www.seekbot.net/bot.html startup
9152 genieBot www.genieknows.com startup
7246 FAST MetaWeb www.fastsearch.com enterprise
7243 worio bot worio.com edu
6868 CazoodleBot www.cazoodle.com startup
6608 ConveraCrawler www.authoritativeweb.com/crawl enterprise
6293 IRLbot irl.cs.tamu.edu/crawler edu
5487 Exabot www.exabot.com/go/robot bigco
4215 ilial www.ilial.com/crawler startup
3991 SBIder www.sitesell.com/sbider.html memetracker
3673 boitho-dcbot www.boitho.com/dcbot.html enterprise
3601 accelobot www.accelobot.com memetracker
2878 Accoona-AI-Agent www.accoona.com startup
2521 Factbot www.factbites.com startup
2054 heritrix i.stanford.edu edu
2003 Findexa www.findexa.no ?
1760 appie www.walhello.com startup?
1678 envolk www.envolk.com spammers
1464 ichiro help.goo.ne.jp/door/crawler.html bigco
1165 IDBot www.id-search.org/bot.html edu
1161 Sogou www.sogou.com/docs/help bigco
1029 Speedy Spider www.entireweb.com bigco

There are a couple of surprises here... One is how much more aggressively Yahoo is crawling than everyone else. (Maybe he should just ban Yahoo to cut his hosting fees :)

Another is how few startups are actually crawling... And the ones that are aren't correlated with the folks getting buzz right now. In three months of data I didn't see a single visit from Zermelo, Powerset's crawler. I don't see Hakia in there at all, but they do have an index and actually refer a little traffic, which leads me to believe that they've licensed a crawl from someone else.

There hasn't been a lot of public information about Cuill since Matt Marshall's brief cryptic entry on them. But they're crawling fairly aggressively, and they've put up a public about us page detailing the impressive credentials of the founders, Tom Costello, Anna Patterson and Russell Power. Anna is the author of a widely-read intro paper on how to write a search engine from scratch.

...

The conventional wisdom is that there are all sorts of folks trying to take on Google, develop meaning-based search, France and Germany are supposedly both state-funding their own search efforts (heh). But if all these folks are out crawling the web... more than 11 of them should be showing up in webserver logs. ;)

Update: Charles Knight posts a ton of quotes from alt search engine folks on their approaches to crawling. Pretty interesting.

TrackBack

Listed below are links to weblogs that reference The 11 startups actually crawling the web:

» Why Aren't Alt Search Engines Crawling Websites? from Read/WriteWeb
Based on log file evidence from a friend who runs a personal website, Rich Skrenta claims that only 11 search startups are actually crawling the web. He wonders where all the alt search engines are? For some reason, Rich doesn't... [Read More]

» SearchCap: The Day In Search, August 7, 2007 from Search Engine Land: News About Search Engines & Search Marketing
Below is what happened in search today, as reported on Search Engine Land and from other places across the web.... [Read More]

Comments (14)

Hans Fredrik Nordhaug:

OK, it's slightly off-topic (and much more than you really want to know) but just for the record:

The Findexa crawler is indexing for Yelo.no (which redirects to gulesider.no - the leading Norwegian yellow pages). The fun part is that Findexa doesn't really exists anymore after it was bought by Eniro, but still the crawler points to the non-existing page http://www.findexa.no/gulesider/article26548.ece ...

Well with Spinn3r we only crawl blog content so we shouldn't show up on a historical site.

I wonder if other crawlers/startups have similar limitations.

Rafael Cosentino:

Most of the partner sites that Congoo indexes provide RSS or XML feeds so there is no need to crawl their site. I can imagine that with feedburner and so many other open feed technologies, we arent the only ones using feeds instead of crawling. Crawling is very inefficient. When our system crawls a site, 90% of the processing is realizing the you have already indexed this page, the other 10% is actually finding something new. Feeds are better and you dont have to create customized crawling agents to figure out the formatting of every site. Feeds are uniform...the future of indexing.

wolf:

FAROO uses a special kind of distributed crawler, which is crawling "below the radar".

When a user opens a page with his browser, it will be automatically inserted into the distributed index of the peer-to-peer network.

Therefore there is no access to a website by FAROO itself. Thus additional network load of a traditional crawler is omitted.

Sorry, I should have been more clear. I'm talking about web search startups, not niche crawlers. The web is 30 billion pages. Topix e.g. crawled 50k sites. Spinn3r got to a lot more with 3M (last count I heard) but that's still not web scale.

Rich, it's a good observation - we also see who crawls Quintura (www.quintura.com). I agree that having its own index is a necessity for search startup. Don't you think a search startup does not need its own crawle? There are quite a few open source ones.

drac:

Again slightly offtopic, but it's sad how many of those crawlers in the list above ignore robots.txt

I've already had to 403-bin a couple of those bots (Tailrank, Cazoodle) for blatantly ignoring excluded paths and running amok.

At least the big 3 (or 4) crawlers behave. Sorta.

Why don't search engines share data the way Open Source shares code?

See my blog for more:

http://smoothspan.wordpress.com/2007/08/07/why-don%e2%80%99t-search-startups-share-data-aka-open-source-style-web-crawling/

Best,

BW

Great article! It seems that you are generating a bit of buzz with this one. I got inspired by your post (and by Read/WriteWeb to discuss the issue on my blog post for today (www.garystew.com). Migoa (my company) is a vertical search engine, so I guess technically we're not really covered by your review. But from what we know of the European vertical search players, only we and Extate have proprietary crawlers. Of course, this is only based on publicly available data, so it's possible that there are more European vertical search players with proprietary crawlers. In any case, thanks for the great article.

For those of us in the human-powered search space, it might be because we actually visit sites ourselves instead of sending a spider out to do the dirty work.

Whoa, you're up to 76 pages according to google. Only 49,999,999,924 to go.

Mahalo still has the jump on you though with 2120 pages. They only have 49,999,997,880 to go...

When I checked they said 81, a 6.5% jump in just hours. At that rate, we'll have the rest of the Web scoured in no time.

(To be fair, we do have over 600 pages, regardless of what Google might tell you. Sure it's not much, but it's a start...)

Hey Rich... Much more than 3M now :)

Also, we're DEF thinking web scale :)

Sebastian:

Gurge had a great writeup about what a sloppy bot Slurp is: http://gurge.com/blog/2007/06/27/yahoo-slurp-makes-a-mess/

I wonder if they bought the 10 year extended warranty from Inktomi.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on August 5, 2007 10:09 AM.

The previous post in this blog was I5 fire at Tejon pass.

The next post in this blog is What doesn't clog your algo makes it stronger....

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33