« The 11 startups actually crawling the web | Main | My top 10 beefs with the iPhone »

What doesn't clog your algo makes it stronger...

Valleywag outed the startup day job of the guys who collectively edit the the hilarious snark site uncov. The startup, Persai, was "hiding in plain site" since they have a blog and have been pretty open about about the tech they're using and their daily gripes.
"Persai is a startup that seeks to apply advanced machine learning techniques to content and advertising. We are using Amazon's web services to build a scalable architecture that will learn from consumer interests over time and match them with content crawled from around the web. The idea behind Persai is that you will have an active agent crawling the web looking for content that is relevant to you and only you. Every link we recommend will be something you want to read. We are zigging to social news' zag where popularity trumps relevance to the individual."
    -- from news.ycombinator

Anyway, a few days ago Persai released a Nutch webcrawl-generated set of "118,254 feeds of pure greatness". Intertwingly begged to differ about the quality after running some stats on the feeds. This generated some interesting comments...one in particular jumped out at me:

But if you look at the list itself, two sites are grossly overrepresented, and they account for the majority of the 301s and Timeouts. [emphasis mine]

I got a sinking feeling as I read this. I had curl'd over the corpus already to eyeball it ...yeah that's a list of feeds all right... but hadn't tallied the domains...

$ sed -e 's/^http:..//' -e 's/\/.*$//' persai_feedcorpus | count | head
 35695   rss.topix.net
 14613   izynews.de
  2831   feeds.feedburner.com
  1869   p.moreover.com
  1314   www.livejournal.com
  1241   rss.groups.yahoo.com
  1191   www.discountwatcher.com
  1096   news.bbc.co.uk
  1072   www.alibaba.com
   882   xml.newsisfree.com

Nooooo... Of course.. Sigh.

Comments (2)


Yea the list sucks at the moment. Its 2 days worth of work and I just posted the corpus for the hell of it. Oh well, thats what you get for putting it on the internet.

Chris Tolles:

I guess we need to do some looking around here....

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)


This page contains a single entry from the blog posted on August 6, 2007 7:54 PM.

The previous post in this blog was The 11 startups actually crawling the web.

The next post in this blog is My top 10 beefs with the iPhone.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33