ec2-67-202-8-249.compute-1.amazonaws.com - - [28/Mar/2008:23:31:06 -0700] "GET /2006/12/scale_limits_design.html HTTP/1.0" 200 11526 "http://www.skrenta.com/2006/12/i_took_a_ukulele_lesson_once.html" "zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:email@example.com,email:firstname.lastname@example.org]"
They're using the open-source Heritrix crawler, running out of Amazon Web Services. But who is page-store.com? From their site:
Vertical search sites are relatively costly to operate. A single vertical search engine may need to sweep all or a large part of the web selecting the pages pertinent to a small set of topics. Startup and operating costs are proportional to the input page set size, but revenue may be only proportional to the size of the selected subset.
Page-store positions itself as a web wholesaler, supplying page and link information to vertical search engine companies on a per-use basis. The effect is to level the playing field between vertical search and general horizontal internet search.
Page-store can provide
- selected page feeds based on deep web crawls
- page metadata
- black-box filters
- anchor text results
- link information
Did Powerset outsource their crawl?