« | Main | Cuill is banned on 10,000 sites »

Did Powerset outsource their crawl?

I've been seeing Zermelo, Powerset's crawler hitting my pages. Sort-of:

ec2-67-202-8-249.compute-1.amazonaws.com - - [28/Mar/2008:23:31:06 -0700] "GET /2006/12/scale_limits_design.html HTTP/1.0" 200 11526 "http://www.skrenta.com/2006/12/i_took_a_ukulele_lesson_once.html" "zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]"

They're using the open-source Heritrix crawler, running out of Amazon Web Services. But who is page-store.com? From their site:

Vertical search sites are relatively costly to operate. A single vertical search engine may need to sweep all or a large part of the web selecting the pages pertinent to a small set of topics. Startup and operating costs are proportional to the input page set size, but revenue may be only proportional to the size of the selected subset.

Page-store positions itself as a web wholesaler, supplying page and link information to vertical search engine companies on a per-use basis. The effect is to level the playing field between vertical search and general horizontal internet search.

Page-store can provide

  • selected page feeds based on deep web crawls
  • page metadata
  • black-box filters
  • anchor text results
  • link information

Did Powerset outsource their crawl?

Comments (4)

If it's a stock Heretrix they're going to run into relevancy problems.

Heretrix doesn't do any URL reordering in its frontier so you're going to stumble through the Internet without any type of prioritization.

That's fine for Heretrix as it's mostly for archive crawls.

Of course, this is assuming that Powerset is doing a full Internet crawl. They might be doing a smaller crawl just to get a subset of the .net.

Kevin

Mark:

We were hit by a crawler out of a Washington state company. We called the company and some kid who answered said that they were crawling on behalf of Google. (I suspect he wasn't supposed to say that, but we caught them after hours and this kid answered.)

I think this is pretty common. In the case of Google, they need to check on cloaking, and they can only do that from IP addresses different from their normal ones, and from crawlers that give a false user agent.

@Kevin:

Not sure what you mean by relevancy concerns -- in a broad crawl, the generally breadth-first, host-rotation visitation strategy of Heritrix isn't likely to bury any highly inlinked pages behind less-relevant material for long.

You can also weight the relative effort spent on certain domains or hosts, so no early-discovered deep site need crowd out the tops of later-discovered sites. And, the ordering within queues is customizable, especially in the latest 2.0 release.

I had heard a rumor PowerSet was using Heritrix early on, then that they'd moved away, so it's interesting to see them using Heritrix again via page-store. Don't know their scale, but there's at least one other group that's doing repeated 4-billion-plus-URL crawls with Heritrix.

- Gordon

Outsource Medical Transcription:

PowerSet could have outsource their crawler or in order to hide their activity. But since Heritrix is open-source anybody could have access to it.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on April 7, 2008 8:55 AM.

The previous post in this blog was .

The next post in this blog is Cuill is banned on 10,000 sites.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33