« December 2007 | Main | February 2008 »

January 2008 Archives

January 2, 2008

Why Search?

I've gone and founded a search startup... you can read it about it in this write-up in TechCrunch. But I get asked - why do search?

Simple - the idea that the current state-of-the-art in search is what we'll all be using, essentially unchanged, in 5 or 10 years, is absurd to me.

The web is big. Really, really big. It's literally billions and billions of pages. It's Carl Sagan big. And it's doubling in size every year or two.

So the idea that what you can see in positions 1-3 above the fold on Google are the sum of what the web has to say about every possible query is crazy.

And yet they have 85%+ market share, and little effective competition. At the same time there is such a fabulous business in search. It's the highest monetization service on the web, by far. Why does this Coke have no Pepsi?

Having just spent 5 years in the media space, I've come away with the idea that editorial differentiation is possible. But the editorial voice of a search engine is in the index...so it has to be algorithmic editorial differentiation.

Google and it's copy-tition were designed 10 years ago. But the web has changed significantly in the past decade. Google was built to index a web that no longer exists... a web where people still engaged in social linking behavior, for one thing.

But at the end of the day, founding a startup has to be about personal motivation. My roots go back to os internals, networking, algorithms, and product boot-up strategies. Basically, trying to make algorithmic sense of the vastness of the web is a difficult but really interesting problem. So is tilting at the biggest brand on the web. It's all just plain fun, which ideally should be the point of working. ;)

January 6, 2008

About the name 'Blekko'

In 1988 I was in college and desperately wanted to run some kind of multitasking OS on my own hardware. These were the dark days before Linux and FreeBSD. I had an account on the university Vax system but it was slow and I didn't have any privs to speak of. My first hope was for a Microvax but they were $15k. I ended up scrounging up a 286 system and installing SCO Xenix on it.

Xenix was a 16 bit port of AT&T's Unix and did the job. I loaded up my box with memory, serial ports, two modems and a serial terminal. I was in heaven.

I wanted to connect to the campus network via uucp and so my computer needed a name. I christened it 'blekko', how I came up with this I have no idea but I liked the sound of it. Thus was born my first "net" address, blekko.uucp.

So 'blekko', while it may sound like a weird Web 2.0 name actually pre-dates the existence of the web. :-)

Now when Mike and I were setting up the new company we got to a point with the lawyers where they needed a name to proceed with the incorporation of the company. I didn't want to pick a name then, names are a big deal and you should put a lot of thought into them. So to put off the decision we decided to call the company "BX10.net". This was an inside joke based on one of our colo server names. But the main idea was that there was no way we'd ever launch with that, so it would usefully serve as a placeholder name but force us to change it later.

Well the state of California rejected our incorp under that name. Apparently there is a BX11, Inc. and they said "BX10" was too close. So in the interest of forging ahead with the company creation I fished out all the names I had in my domain account and sent them over to Mike.

Mike orderded the list by the ones he thought were funniest and sent them off to the lawyers to try, in order, until one worked. Blekko was the first name and went through.

Now I still think that it's important to put more than five minutes of thought into a company name. Especially if the five minute's worth of thought yields "I would never use that name, are you insane?" But the reactions we've had have been ... interesting. Folks definitely love it or hate it. I actually score hate ahead of indifference; provoking a strong emotional response, even a negative one, helps the name stick in people's heads. :-)

One vendor we were talking to earnestly told us the name was fantastic and we must never change it. I'm not sure if he was pulling my leg though.

We've actually spoken to some naming/branding firms... I had always figured that investing more than $14.95 in a corporate identity made sense for a multimillion dollar startup effort... I mean you put millions of dollars into your coders and your ops, but you're going to settle for some name that happened to be free on Go Daddy?

The naming experts have had some interesting comments. They said phonetically 'blekko' wasn't bad. It's unique, staccato, memorable, and short. It does have some unpleasant phonetic associations. But they said mainly it was an "empty vessel" name. Meaning simply that the name doesn't suggest any idea in the mind of the person hearing it. It's an empty vessel that marketing would have to fill with a particular brand meaning.

We're still undecided on whether 'blekko' will actually be the launch name or if we will come up with something else. But I have to say the TechCrunch/Techmeme/Digg press and reaction have provided some fascinating test-marketing feedback. You can't pay for this stuff... and since it will be a little while before we launch anything, if we go with a different name later, it won't be a big deal to change it then.

I wonder what the name inspector would make of 'blekko'...

Update:

The Name Inspector reviews 'blekko'. He doesn't seem to like it. Although there is this curious comment at the end of the article:

But you’re in stealth mode. The Name Inspector believes you have no intention of launching as Blekko. Though he hopes he’s wrong.

Does that mean that he does want us to launch as 'blekko'? Hmmm....

January 7, 2008

Long tail in a short table

I finally found some stats on the fraction of porn queries out there to answer my question...plus, it was in a table classifying user searches into overall categories. This data was obtained by some researchers who manually classified a full week's worth of AOL search data:

Other 15.69%    News&Society 5.85%
Entertainment 12.60%    Computing 5.38%
Shopping 10.21%    Orgs&Inst 4.46%
Porn 7.19%    Home&Garden 3.82%
URL 6.78%    Autos 3.46%
Research 6.77%    Sports 3.30%
Misspellings 6.53%    Travel 3.09%
Places 6.13%    Games 2.38%
Business 6.07%    Personal Fin 1.63%
Health 5.99%    Holidays 1.63%

There are always methodology questions with data like this, but I've looked at the AOL data and am comfortable assuming that the categories are at least approximately realistic.

It's interesting to see the smooth spread across so many different categories. It's also easy to see why only focusing on a category or two may not be an effective product strategy. Shopping is the most lucrative of the verticals, and a healthy chunk at 10% of all searches. But if you focus only on shopping, that means users have to go elsewhere for the other 90% of their searches.

January 8, 2008

Another way to look at Wikia Search

Despite Wikia Search's unfortunate launch reaction, there is something substantial and worthwhile about the project that hasn't really come up in the coverage.

To understand Wikia Search you have to go back to the launch of the Nutch project in 2003:

Meet Nutch, the open-source search engine. Open-source applications are unusual in that the code upon which the software runs is not owned by a private, commercial company but rather bound by a simple license that allows anyone to use, modify, and even profit from it free of charge, as long as they pledge to contribute their own innovations back into the code base. Because of this, anyone will be able to access Nutch's code and use it to their own ends, without paying licensing fees or hewing to a particular company's set of rules.

Perhaps more important, Google takes a "trust us" approach to search; they say they don't skew their PageRank formula to favor certain sites, but we have no way of knowing for sure. With Nutch, the indexing and page-ranking technologies are all open and visible; you can check them yourself if you have a problem with your page's ranking. Just as Linux has taken on Windows, revolutionizing the rules of search-engine development and distribution, Nutch could pose a threat to Google and other search giants. Interestingly, early Nutch development was supported in part by Overture's R&D division, and an Overture official sits on the Nutch board.

"Search is interesting again," says Doug Cutting, a founder and core project manager at Nutch. Cutting, whose development chops were honed at Xerox (XRX) PARC, Excite and Apple (AAPL), is building Nutch (that's his toddler's all-purpose word for "meal") with a small team of engineers based around the country. But Cutting says they hope that once Nutch is loosed on the world, tinkerers from Romania to China to Palo Alto will help build it into a robust platform, in the spirit of Linux or Apache (which has garnered more than 60 percent of the Web-server software market in just the last couple of years).

The thought I had at the time was, the open source model is great, but the problem with search is that without a sponsor to pay for racks full of machines and gigabits of bandwidth, eager would-be developers are stuck. You can't develop a search engine on a laptop sitting in the university cafe.

Thus there is no web-scale version today on Nutch.org, of course. But Nutch has succeeded in smaller scale deployments, such as indexing university intranets. Basically competing in the enterprise search space, against commercial products such as Thunderstone and the Google search appliance. Universities are more open to tinkering with the open source Nutch / Lucene alternative and so have been early adopters there.

Enter Jimmy Wales. Wikia is the web-scale sponsor that Nutch didn't have when it launched in 2003. Wikia has 1,000 servers now and can afford the multi-gigabit bandwidth bill. They're providing the hosting platform which Nutch has been starved for to let contributors show up and advance Nutch to industry-level.

Yes, the site looks like someone was thrilled to get it to compile for the first time the night before launch. The appalled reactions are understandable given the expectations and high-profile PR.

Look past that.

Early open source projects often look grim. If you go onto sourceforge and find some promising 0.1 project, you know what to expect. I agree with Markson that the mistake here was in Wikia's positioning of the launch. But I don't think that's necessarily going to have long-term effects. Ultimately they just need a small handful of developers and contributors to help move the rock uphill. And then iterate.

And don't count out the power of the open source model. Giving all of the academic researchers who only get to test their experimental ranking algos on little clusters a functioning web-scale search platform could enable real progress. Check back in 2 years and I'll bet that Wikia Search is going to be a valid competitive alternative search site. Certainly a long shot to unseat Google, but at least a worthy alternative.

Updated data from Topix on registration-free commenting

Newspapers are apparently still fretting over whether to allow users to comment on their sites. Old-school editors like to hold the reigns tightly; approval-before-posting is a common moderation model on newspaper web sites. You'd think they'd be more open to letting in the usergen pageviews...

Some new data out of Topix showing the quality (measured by post kill ratios) between registered and unregistered commenters.

Total by registered users: 22,336
Total by non-registered: 60,772

Posts by registered users that got killed: 992
Posts by unregistered users that got killed: 4,095

% posts killed (registered users): 4.4%
% posts killed (unregistered): 6.7%

The unregistered commenters have a 50% higher kill rate. But they come with 3X the traffic.

Further evidence that the Ni-Chan paradox still holds:

  • Registration keeps out good posters. People with lives will tend to ignore forums with a registration process.
  • Registration lets in bad posters. Children and Internet addicts tend to have free time to go register an account and check their e-mail for the confirmation message. They will generally make your forum a waste of bandwidth.
  • Registration attracts trolls. If someone is interested in destroying a forum, a registration process only adds to the excitement of a challenge. Trolls are not out to protect their own reputation. They seek to destroy other peoples' "reputation..
  • Anonymity counters vanity. On a forum where registration is required, or even where people give themselves names, a clique is developed of the elite users, and posts deal as much with who you are as what you are posting. On an anonymous forum, if you can't tell who posts what, logic will overrule vanity.
  • I like this dataporn since it's applicable beyond newspaper and forum sites, to other kinds of recruitment-funnel online participation systems. Make it easy for users, especially first-time visitors, to jump in and participate. But also give power users the ability to invest more in their identity on your site.

    January 15, 2008

    Open source Bigtable clone 'Hypertable' posts performance numbers

    Zvents will soon be releasing their open-source Bigtable clone called Hypertable, and have posted some performance numbers that look quite good. Especially for such an early release.

    But maybe that not surprising since Hypertable was designed by Zvents search architect Doug Judd for speed. He rejected Java (used by HBase, the Hadoop-project Bigable effort) in favor of C++ in order to get the performance as high as possible.

    With a small test inserting about 28M rows of data from the AOL search dataset, they achieved a per-node write rate of approximately 7mb/sec. Iteration over the data once loaded was also quite fast, at nearly 1M cells/second.

    The question is how the system will scale up to much larger amounts of data. But the early perf numbers are encouraging. Doug and co will also need to get the word out about Hypertable and get a developer community going around this project if it's going to achieve its full potential.

    Hypertable can run on top of either HDFS or KFS. Zvents CEO Ethan Stock told me they will be releasing it under GPL 2.1 on Jan 31th.

    January 18, 2008

    Database gods bitch about mapreduce

    This is what disruption sounds like.

    This rant by major database guys against mapreduce is pretty telling.

    (You can read a good rebuttal here, and discussion on ycomb.)

    The thing that disrupts you is always uglier and worse in some way. Less features, less developed. But if there's a 10X price win in there somewhere, the cheap rickety thing wins in the end.

    Think Linux vs. AT&T Unix, or mysql vs. Oracle.

    I'll also take exception to the claim that schemas won out over unstructured data in the 60's. Unix ultimately trounced Multics and its ilk, not simply because of quasi-open source and economics, but also because the programming model was superior. "A file is just a stream of bytes" was a radical departure from the record and key oriented approaches that were dominant at the time. Some folks haven't stopped fighting the war though. Oracle's multi-decade messaging effort deserves more credit for the acceptance of databases as industry-standard tech than the idea that warring academics came to realize some deep truth about the way data "should" be stored.

    Is it the case that mapreduce on top of something like HDFS + Hypertable is a competitor to old-style monolithic databases running on big iron? You bet it is.

    Linear perf, linear cost scale, and the programming flexiblity of unstructured Unix-like I/O in GFS or fluid schemas in Bigtable. All good.

    And I wouldn't be surprised if the adoption curve, even for conservative Fortune-500 companies, was quicker than we've seen in the past. Bolt a map/reduce cluster onto the side of your data warehouse and mine those CRM records for business insights. Sounds like a startup idea we'll be seeing soon enough. ;-)

    January 21, 2008

    Markson: The Tin Handcuffs of SEO

    When I stopped living in the problem and began living in the answer, the problem went away.
          -- Randy Treft

    Mike Markson has a thought-provoking post on SEO.

    I have a buddy who compares getting VC funding to getting hooked on heroin. He says that instead of optimizing the company to build the right product, the funding often optimizes the company to do whatever is necessary to close the next round.

    SEO can be like that. It's such an easy way to get traffic. Certainly easier than making a great product that spreads worth-of-mouth all by itself.

    Every month Google allots the web-sites in its index a certain amount of traffic. Some sites do better than others, but for the most part Each site takes its monthly Google traffic home and tries to do the best it can with it.

    ...

    If you actually look at the recent successful sites over the past few years - YouTube, MySpace, Facebook, etc. - none of them got there by Google traffic. They created a product and figured out a way to get mass appeal outside the Google regulatory system.

    There's more, it's worth reading the whole thing.

    About January 2008

    This page contains all entries posted to Skrentablog in January 2008. They are listed from oldest to newest.

    December 2007 is the previous archive.

    February 2008 is the next archive.

    Many more can be found on the main index page or by looking through the archives.

    Powered by
    Movable Type 3.33