"Persai is a startup that seeks to apply advanced machine learning techniques to content and advertising. We are using Amazon's web services to build a scalable architecture that will learn from consumer interests over time and match them with content crawled from around the web. The idea behind Persai is that you will have an active agent crawling the web looking for content that is relevant to you and only you. Every link we recommend will be something you want to read. We are zigging to social news' zag where popularity trumps relevance to the individual."
-- from news.ycombinator
Anyway, a few days ago Persai released a Nutch webcrawl-generated set of "118,254 feeds of pure greatness". Intertwingly begged to differ about the quality after running some stats on the feeds. This generated some interesting comments...one in particular jumped out at me:
But if you look at the list itself, two sites are grossly overrepresented, and they account for the majority of the 301s and Timeouts. [emphasis mine]
I got a sinking feeling as I read this. I had curl'd over the corpus already to eyeball it ...yeah that's a list of feeds all right... but hadn't tallied the domains...
$ sed -e 's/^http:..//' -e 's/\/.*$//' persai_feedcorpus | count | head
35695 rss.topix.net
14613 izynews.de
2831 feeds.feedburner.com
1869 p.moreover.com
1314 www.livejournal.com
1241 rss.groups.yahoo.com
1191 www.discountwatcher.com
1096 news.bbc.co.uk
1072 www.alibaba.com
882 xml.newsisfree.com
Nooooo... Of course.. Sigh.
Comments (2)
Yea the list sucks at the moment. Its 2 days worth of work and I just posted the corpus for the hell of it. Oh well, thats what you get for putting it on the internet.
Posted by Kyle | August 7, 2007 11:33 AM
Posted on August 7, 2007 11:33
I guess we need to do some looking around here....
Posted by Chris Tolles | August 7, 2007 2:16 PM
Posted on August 7, 2007 14:16