« Long tail in a short table | Main | Updated data from Topix on registration-free commenting »

Another way to look at Wikia Search

Despite Wikia Search's unfortunate launch reaction, there is something substantial and worthwhile about the project that hasn't really come up in the coverage.

To understand Wikia Search you have to go back to the launch of the Nutch project in 2003:

Meet Nutch, the open-source search engine. Open-source applications are unusual in that the code upon which the software runs is not owned by a private, commercial company but rather bound by a simple license that allows anyone to use, modify, and even profit from it free of charge, as long as they pledge to contribute their own innovations back into the code base. Because of this, anyone will be able to access Nutch's code and use it to their own ends, without paying licensing fees or hewing to a particular company's set of rules.

Perhaps more important, Google takes a "trust us" approach to search; they say they don't skew their PageRank formula to favor certain sites, but we have no way of knowing for sure. With Nutch, the indexing and page-ranking technologies are all open and visible; you can check them yourself if you have a problem with your page's ranking. Just as Linux has taken on Windows, revolutionizing the rules of search-engine development and distribution, Nutch could pose a threat to Google and other search giants. Interestingly, early Nutch development was supported in part by Overture's R&D division, and an Overture official sits on the Nutch board.

"Search is interesting again," says Doug Cutting, a founder and core project manager at Nutch. Cutting, whose development chops were honed at Xerox (XRX) PARC, Excite and Apple (AAPL), is building Nutch (that's his toddler's all-purpose word for "meal") with a small team of engineers based around the country. But Cutting says they hope that once Nutch is loosed on the world, tinkerers from Romania to China to Palo Alto will help build it into a robust platform, in the spirit of Linux or Apache (which has garnered more than 60 percent of the Web-server software market in just the last couple of years).

The thought I had at the time was, the open source model is great, but the problem with search is that without a sponsor to pay for racks full of machines and gigabits of bandwidth, eager would-be developers are stuck. You can't develop a search engine on a laptop sitting in the university cafe.

Thus there is no web-scale version today on Nutch.org, of course. But Nutch has succeeded in smaller scale deployments, such as indexing university intranets. Basically competing in the enterprise search space, against commercial products such as Thunderstone and the Google search appliance. Universities are more open to tinkering with the open source Nutch / Lucene alternative and so have been early adopters there.

Enter Jimmy Wales. Wikia is the web-scale sponsor that Nutch didn't have when it launched in 2003. Wikia has 1,000 servers now and can afford the multi-gigabit bandwidth bill. They're providing the hosting platform which Nutch has been starved for to let contributors show up and advance Nutch to industry-level.

Yes, the site looks like someone was thrilled to get it to compile for the first time the night before launch. The appalled reactions are understandable given the expectations and high-profile PR.

Look past that.

Early open source projects often look grim. If you go onto sourceforge and find some promising 0.1 project, you know what to expect. I agree with Markson that the mistake here was in Wikia's positioning of the launch. But I don't think that's necessarily going to have long-term effects. Ultimately they just need a small handful of developers and contributors to help move the rock uphill. And then iterate.

And don't count out the power of the open source model. Giving all of the academic researchers who only get to test their experimental ranking algos on little clusters a functioning web-scale search platform could enable real progress. Check back in 2 years and I'll bet that Wikia Search is going to be a valid competitive alternative search site. Certainly a long shot to unseat Google, but at least a worthy alternative.


Listed below are links to weblogs that reference Another way to look at Wikia Search:

» Daily SearchCast, Jan. 9, 2008: Google Maps Primary Results; Microsoft Buying FAST; Search Wikia Or Search Suckia? & More! from Daily SearchCast - Search Engine News Via Podcast
Google maps the New Hampshire primary results. Yahoo pushes to make your inbox the center of your life. Microsoft spending over $1 billion to buy FAST. New Year's Resolutions, Search Wikia launches to disappointing reviews, Google looking for an inhou... [Read More]

Comments (5)



Any web developer who was half serious about writing thier own search engine can use Alexa's search platform. I am surprised that nobody is talking about Alexa's platform given everybody's euphoria when it first launched ( http://www.techmeme.com/051213/p2 )

I know people could have to paid amount to access Alexa's index but open source algos running on Alexa's index can still be done.

Anu thoughts on why people are not talking about Amazon's effort anymore?

I remember the hubbub around the launch of the alexa stuff too. But the UI and re-ranking that can be put around someone else's index are pretty shallow; the editorial voice of a search engine is in the index itself. If you put a front end on top of AWS or Gigablast, you've launched AWS or Gigablast with a new front end...not a new search engine.

"Index" and "relevance" are umbrella terms... Many factors combine to produce perceived relevance on the result page. The provider of the index has already chosen what to optimize away to provide their results on a reasonable hardware platform. What the crawl pattern is, how to parse the page, tokenize terms, what kind of snippet you can get back, what facets per url are available for pivots, what ranking algos can be used.

If you use someone else's index as the basis of your own, all of the interesting design choices have already been made for you.


As an adult webmaster whose sites don't appear in Google at all for the most logical keywords, I'm obviously interested in search alternatives where the ranking algorithm can be determined and influenced by the actual humans who are trying to find my site.

What's disappointing about this launch is not the weak index currently in place. What's disappointing is that there isn't even a framework visible for human improvement. I get that such tools may eventually appear, but that's hardly the point.

"Today we've released some lemonade. It doesn't have any lemons in it, or lemon flavor of any kind, but we'll be adding that in a future release. And right now the liquid isn't water, or anything you'd want to drink -- it's just some clear liquid we borrowed for proof of concept."

So, really, what got released?


This is interesting.

I looked at Nutch for Topix's search. I was struck by how all the tools tokenizing, lexing, scoring, etc (Natively in Java) were geared towards TF/IDF using stemming. Much hacking of the API was going to be necessary to bend the API to do just the simple parsing and categorization that we wanted for Topix.

I can only imagine that given Nutch as a starting point, if you didn't want to deviate from the API you would be very constrained in what you could do to improve the rankings.


... it's only going to be worth anything if they invest in a good trust metric.

Search without a trust metric is like peanut butter without jelly or a bagel without cream cheese.


Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)


This page contains a single entry from the blog posted on January 8, 2008 8:31 AM.

The previous post in this blog was Long tail in a short table.

The next post in this blog is Updated data from Topix on registration-free commenting.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33