« March 2008 | Main | May 2008 »

April 2008 Archives

April 7, 2008

Did Powerset outsource their crawl?

I've been seeing Zermelo, Powerset's crawler hitting my pages. Sort-of:

ec2-67-202-8-249.compute-1.amazonaws.com - - [28/Mar/2008:23:31:06 -0700] "GET /2006/12/scale_limits_design.html HTTP/1.0" 200 11526 "http://www.skrenta.com/2006/12/i_took_a_ukulele_lesson_once.html" "zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]"

They're using the open-source Heritrix crawler, running out of Amazon Web Services. But who is page-store.com? From their site:

Vertical search sites are relatively costly to operate. A single vertical search engine may need to sweep all or a large part of the web selecting the pages pertinent to a small set of topics. Startup and operating costs are proportional to the input page set size, but revenue may be only proportional to the size of the selected subset.

Page-store positions itself as a web wholesaler, supplying page and link information to vertical search engine companies on a per-use basis. The effect is to level the playing field between vertical search and general horizontal internet search.

Page-store can provide

  • selected page feeds based on deep web crawls
  • page metadata
  • black-box filters
  • anchor text results
  • link information

Did Powerset outsource their crawl?

April 8, 2008

Cuill is banned on 10,000 sites

Be careful while you debug your crawler...

Webmasters these days get very touchy about letting new spiders walk all over their sites. There are so many scraper bots, email harvesters, exploit probers, students running Nutch on gigabit university pipes, and other ill-behaved new search bots that some site owners nervously huddle in forum bunkers anxiously scanning their logs for suspect new vistors, so they can quickly issue bot and ip bans.

Cuill, the search startup from ex-googlers anticipated to launch soon seems to have run a rather high rate crawl when they were getting started that generated a large number of robots.txt bans. Here is a list of sites which have banned Cuill's user-agent "Twiceler".

A well-behaved crawler needs to follow a set of loosely-defined behaviors to be 'polite' - don't crawl a site too fast, don't crawl any single IP address too fast, don't pull too much bandwidth from small sites by e.g. downloading tons of full res media that will never be indexed, meticulously obey robots.txt, identify itself with user-agent string that points to a detailed web page explaining the purpose of the bot, etc.

Apart from the widely-recongnized challenges to building a new search engine, sites like del.icio.us and compete.com that ban all new robots aside from the big 4 (Google, Yahoo, MSN and Ask) make it that much harder for a new entrant to gain a footing. However the web is so bloody vast that even tens of thousands of site bans are unlikely to make a significant impact in the aggregate perceived quality of a major new engine.

My initial take was that this had to be annoying for Cuill. As a crawler author, I can attest that getting each new site rejection personally hurts. :) But now I'm not so sure. Looking over the list, aside from a few major sites like Yelp, you could argue that getting all the forum seo's to robots exclude your new engine might actually help improve your index quality. Perhaps a Cuill robots ban is a quality signal? :)

April 9, 2008

AppEngine - Web Hypercard, finally

Google's AppEngine is being compared to Amazon's EC2/S3. But Google deserves credit here for coming up with a pretty differently-positioned product. There may be overlap for many users of course, but it's really operating at a whole different level of the stack.

Folks that want/need more control over the environment, ability to manually manage their own machine instances, run code other than python, etc. will stay with EC2. EC2 is a step above RackSpace.

But rather than thinking of AppEngine as a step above EC2, instead I think of it somewhere around Myspace. Or "Ning 1.0", as Zoho points out.

In the beginning was GeoCities... No, even further back, in the beginning was Hypercard. Hypercard was a pre-web application for Macs that let you design a "stack" of pages - a website on a floppy, really. Popular stacks got traded far and wide. Hypercard stacks existed for every imaginable purpose - "Time Table of History", games, crossword puzzles, the Bible, etc.

The thing about Hypercard was that it wasn't just static text and images like base html. It had a scripting language, a database, and the Apple UI built-in, so you could create mini applications.

It feels like the web has been trying to claw its way back to the simple utility of Hypercard ever since Mosaic. GeoCities was the first massive-uptake anyone-can-build-here website haven. But it was all static html.

Sure, you can paste javascript widgets onto your page, and have content driven by external sites. But to make the website a first-class object - on functional partity with a "real" website - it needs to be backed by a database and programmability. But setting up mysql, renting machine space, configuring linux, programming all the boilerplate, not to mention the scalability issues if your site gets popular -- this is all a big hurdle.

So to hide all those details behind a platform that's easy to get started with, and lower the bar to entry to writing public application websites... Well that's a big deal. Hat's off to Google for bringing this to market.

I'm not alone...somewhat similar thoughts from Nate Westheimer...

April 14, 2008

Cluster map propagation in Amazon Dynamo

Dynamo is Amazon's scalable key/value storage service. The paper is a good read, but I found the way the cluster node list information was propagated in dynamo to be a little odd. The algorithm is that every 60 seconds a node will talk to another node in the cluster, chosen at random, and exchange update information. I wondered how fast a change would propagate through the cluster, so I simulated the propagation.

For a 5,000 node cluster it takes about 9 update cycles for a change to reach every other node. Since each update is on a 60 second timer, that's 9 minutes for a change to push out.

I didn't do a very sophtisticated time model..plus there is random start and all that. So maybe in practice it's a little different. But 9 minutes seems like a long time to propagate a host change out to the rest of the cluster. Maybe I mis-interpreted what they're doing?

I recall some confusion about whether Dynamo was actually providing SimpleDB, or if they were two separate software systems. Does anyone know if this was resolved?

April 16, 2008

Web robot names considered, and rejected

Google's is "Googlebot"
Yahoo's is "Slurp"
Cuill's is "Twiceler"

It makes sense have a friendly robot user agent, so nervous webmasters won't ban it. You don't want to call your crawler 'sitejacker' or something.. Unfortunately my favorite candidates were:

Crawlhammer
Webraker
Lurchy
Client9

hmmm. :-|

"Oh no! It's CrawlHammer!!"

If even in your heart you hide the urls ... there it shall rake for them...

...

Does anyone know what the purpose of a '+' in front of an url in the robots user-agent is? Some sites put in the '+', others don't...

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Mozilla/5.0 (compatible; Ask Jeeves/Teoma; +http://about.ask.com/en/docs/about/webmasters.shtml)

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html)

Gigabot/3.0 (http://www.gigablast.com/spider.html)

Microsoft "hits back" at Google with re-launch of 4-year old Newsbot

The memecrowd sure has a short memory... maybe I'm just showing my age here, but still.
CNET: Microsoft hits back at Google with Live Search News
Search Engine Land: Microsoft Launches Live Search News
Search Engine Watch: Windows Live Search Offers Google News Alternative

MSN Newsbot? Anyone? From 2004:

CNET: Google News faces Microsoft rival (Jul 27, 2004)
Wash Post: Microsoft Deploys Newsbot To Track Down Headlines (Aug 1, 2004)
Geeking with Greg: MSN Newsbot review (Jul 27, 2004)

April 22, 2008

Starbucks "re" branding

It will be interesting to see how the return of the original starbucks founder Howard Schultz and the return to their orig plan and ideas turns out.

He's had a successful stunt with the system closing for 3 hours to retrain workers in how to make coffee, which generated a lot of PR.

Now the introduction of the new house blend, named after the original starbucks store. But also, surprise! - the original logo is back.

Usually logos and identities get vaguer, cleaner and more abstract as a the MBAs wash/rinse/repeat. Starbucks is going back to the gritty and vaguely obsene logo they launched with.

 

Deadprogrammer famously detailed the history of the Starbucks logo going back to a 15th century woodcut. The original logo was slightly sanitized, but each corporate revision made it more and more abstract and less recognizable as to what it actually was. My wife said "I had no idea there was even anything inside that circle, I had never looked until you pointed it out to me."

Face logos are great brands but they always seem to get watered down and more cartoony over time. This is the case with a lot of the face logos on food at the grocery store, the original versions were closer to actual faces rather than abstract logos (think chef boy r dee here.)

This happened to KFC with the colonel...he started out as realistic line drawing of Colonel Sanders with the company name - "Kentucky Fried Chicken." After the waves of rebranding stylists were done with him he was an abstract cartoon. They couldn't stop there and abbreviated the company name. You're wouldn't want to realize you're eating FRIED CHICKEN when you're at KFC after all. You probably want to be eating a healthy salad with dressing on the side. That's why you went in there, right??

I bet Dunkins Donuts wishes they could rename themselves "DD". Hmmm, maybe "empty vessel" names aren't so bad after all... :)

Interesting to think about brand identities that get going because they're a little gritty and different and personal, they don't start out whitewashed / washed out, but after getting successful they put on the bland suit. What would the AOL redesigners do to Drudge's site if they bought it?

Hypertable architecture talk Wednesday in Palo Alto

Doug Judd will be discussing the internals and architecture of Hypertable tomorrow in Palo Alto at 6:30pm.

Hypertable is an open source, high performance, distributed database modeled after Google's Bigtable. It differs from traditional relational database technology in that the emphasis is on scalability as opposed to transaction support and table joining. Tables in Hypertable are sorted by a single primary key. However, tables can smoothly and cost-effectively scale to petabytes in size by leveraging a large cluster of commodity hardware. Hypertable is designed to run on top of an existing distributed file system such as the Hadoop DFS, GLusterFS, or the Kosmos File System (KFS). One of the top design objectives for this project has been optimum performance. To that end, the system is written almost entirely in C++, which differentiates it from other Bigtable-like efforts, such as HBase. We expect Hypertable to replace MySQL for much of Web 2.0 backend technology. In this presentation, Doug will give an architectural overview of Hypertable. He will describe some of the key design decisions and will highlight some of the places where Hypertable diverges from the system described in the Bigtable paper.

More details.

April 24, 2008

Microsoft bias in MSN search results, surprise

I was looking to see what search sites might have a particular bug that I (ahem) came across and was trying the search for the number 0 in various places. There is a pretty good Wikipedia page about zero. Zero has a rich and interesting history and there are many other potentially reasonable results.

But I was surprised to see MSN search had demoted their good results below some crappy ones from MSDN:

Lame! Falling into an inferior lex position and a lower overall relevance page to boost their own network results...give em credit for being old school. :)

...

I found my bug on Yahoo Search. I had tried a lot of smaller engines first because I didn't think a major would have this bug. You can't search for 0 on Yahoo. You can search for all the other numbers, but not 0 ...

Why?.. Because 0 is false. It suggests Yahoo is using a scripting language to front their search form, and a programmer did something like if ( $query ) rather than if ( $query ne '' ).

About April 2008

This page contains all entries posted to Skrentablog in April 2008. They are listed from oldest to newest.

March 2008 is the previous archive.

May 2008 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33