"Persai is a startup that seeks to apply advanced machine learning techniques to content and advertising. We are using Amazon's web services to build a scalable architecture that will learn from consumer interests over time and match them with content crawled from around the web. The idea behind Persai is that you will have an active agent crawling the web looking for content that is relevant to you and only you. Every link we recommend will be something you want to read. We are zigging to social news' zag where popularity trumps relevance to the individual."
-- from news.ycombinator
Anyway, a few days ago Persai released a Nutch webcrawl-generated set of "118,254 feeds of pure greatness". Intertwingly begged to differ about the quality after running some stats on the feeds. This generated some interesting comments...one in particular jumped out at me:
But if you look at the list itself, two sites are grossly overrepresented, and they account for the majority of the 301s and Timeouts. [emphasis mine]
I got a sinking feeling as I read this. I had curl'd over the corpus already to eyeball it ...yeah that's a list of feeds all right... but hadn't tallied the domains...
$ sed -e 's/^http:..//' -e 's/\/.*$//' persai_feedcorpus | count | head
Nooooo... Of course.. Sigh.