Main

main Archives

December 14, 2006

WikiPedia is the top recipient of Google's traffic

Google sends more traffic to WikiPedia than any other site, according to Hitwise. I've noticed that WikiPedia has top rankings now for more and more searches. Since they've doubled their article count over the past year, to 1.5M english articles, their search footprint has expanded. This trend will only continue.

This is also exactly the kind of destination content that Google loves. The articles are excellent quality, and WikiPedia is entirely non-commercial, so Google can feel good about having them rank at the top of the organic listings. Any algorithmic changes Google makes in the future are likely to preserve and reinforce this effect.

Google accounts for nearly 50% of their inbound traffic. But "wikipedia" is their top inbound search term, so people are seeking the brand out directly, as Steve Rubel notes a year ago.

The full list of top-10 Google downstream sites is:

  1. Wikipedia
  2. eBay
  3. Amazon.com
  4. Yahoo
  5. IMDB
  6. Walmart
  7. Answers.com
  8. Target
  9. MapQuest
  10. BizRate

This list doesn't distinguish between paid and non-paid listings, but clearly WikiPedia isn't buying adwords so 100% of their juice is organic. Given how often I see About.com come up in the results I'm surprised to not see them in this list.

It will be interesting to see whether Jimmy Wales can replicate this success with Wikia, the VC-backed commercial twin to WikiPedia. 100% revshare, wow -- I don't fully get the model yet but he's clearly got a bold plan up his sleeve.

More Hitwise coolness from SEOmoz.

Congrats Bryn

Props to my pal Bryn over the issuance of his patent for a secure distributed random number generator ...which came out in 2003 but he just noticed, thanks to the new Google patent search.

He pointed out that I actually implemented the thing in java (this was for the IP security layer in Sun's Javastation kernel), and my comments in the code actually appear at the end of the patent. Truel remarked

Just as exciting is the discovery that Rich used to write code that had comments...in html. I'm not even sure how to process it.

December 15, 2006

Success?

Mr. Skrenta,

    As I understand you are one of the founders of the ODP. My question is simple and I'm hoping that you can help me. I've submitted our site about 15 times over the last 4 years (waiting months and months between submissions) and have not yet been listed. I've followed all of the rules each and every time. Is it possible that there is something not entirely kosher with the way editors handle things? Is it possible that the buzz on the internet that says there is some corruption going on in the ODP with editors is true? Something is really wrong when someone submits a site for 4 years and is ignored.

Sigh.

December 16, 2006

I took a ukulele lesson once...

I've watched/listened to this guy like 10 times now...this is just awesome. :-)

DMOZ had 9 lives. Used up yet?

RIP DMOZ: 1998-2006

aka Open Directory Project
aka Netscape Open Directory
aka directory.mozilla.org
aka NewHoo
aka GnuHoo

Peter Da Vanzo: Is DMOZ Dead?
Tom Lustina: Here Lies ODP
Sean Bolton: DMOZ, Please Die Already
Resource Zone: submit URL link not working
Trond Sorvoja: Will AOL allow a Open Directory Foundation?

Apparently the machine holding dmoz in AOL ops crashed. Standard backups had been discontinued for some reason; during unsuccessful attempts to restore some of the lost data, ops blew away the rest of the existing data on the system.

So for the past 6 weeks, a few folks have been trying to patch the system back together again (reverse engineering from the latest RDF dump, I suppose). But 6 weeks is a very long outage. Add in the massive AOL layoffs last week, and it's not clear if there's even any left over there who cares. Even if some form of the ODP editing system is brought back, the likelihood of continued existence within AOL seems extremely doubtful.

dmoz doesn't exactly operate on a model of transparency, to say the least, so they have been keeping the details of what happened private. Perhaps they're concerned about an exodus of the remaining editors, or gleeful proclamations of death from the SEM industry. The remaining ODP editors will probably be mad at me for discussing this, but they get mad at me whenever I talk about the ODP....ironic! :-) Hey guys, it's 2006, open up.

...

What do you do when you get an email like this?

To: "Rich Skrenta"
Date: Tue, 14 Jul 1998 18:47:15 -0700
Subject: Infoseek and NewHoo

Rich,

I just got off the phone with Steve Kirsch, Infoseek's founder and Chairman of the Board. We are very much interested in purchasing the technology, content, and founders of NewHoo. This is our preferred option, but we would certainly consider discussing other partnering opportunities if this doesn't work out.

We think that the best way to continue the process would be for you to name a price range for a possible purchase, including the appropriate market and financial information justifying that price.

Next, we can continue our discussions if there is enough interest on this side.

Regards,
Scott

We launched NewHoo in June, 1998. Within 4 months we had the CTO of LookSmart saying we wanted to quit and join us, an acquisition offer from Infoseek, a $5M funding offer from Lycos, an angel funding offer being brokered by the Venture Law Group, and an acquisition offer from Netscape. We took the Netscape offer; it was a great strategic fit, since they had a lot of traffic to pour on the directory, and were willing to give the data away for free.

Unfortunately, as with many (most?) acquisitions, the hopeful little product was eventually lost within the sprawling org.

In a 2003 talk, I predicted that the server would get lost in AOL ops, and, deprived of any staff who understood how it worked, it would just crash one day, and that would be it.

My (edited) reply to a dmoz meta editor who contacted me about the extended outage:

Not sure if you all have been following the drama going on within AOL, but I doubt they have any attention for dmoz at all at this point, less even than usual. In fact, my guess is that everyone involved in the management chain there over dmoz for the past 6 years is now gone.

http://www.brianalvey.com/2006/12/15/just-add-drama/
http://valleywag.com/tech/aol/fucking-way-222195.php

So regardless of the tactics of whether specific front-line people in AOL ops can get the machine running again or not, I doubt that the environment there will be very good in the longer term. All of the folks there who had been championing product-lead growth are now gone. One possible outcome is that Time Warner is slimming AOL down for an eventual spin-out. A more cynical take is that they're going to deliberately torture the org first, as payback for the destruction in Time/Warner value following the AOL merger (this idea was put forward in an NY Times story a few months ago).

I do think it's a great time for a new directory to emerge, and human editing, if supported by sufficient technical automation to make them sufficiently productive, could be a powerful model. Bob Keating's ideas around building a faceted directory are spot-on IMO.

However, I maintain my belief that, without a monetary engine -- in other words, without making the directory a business at some level -- dependence on corporate patronage will eventually leave it weak and understaffed again. One option I might suggest is to look at something like Jimmy Wales' new Wikia service, and see if it could fit the bill, at least at some level. If so, the dmoz editors could move over there and start building again.

WikiPedia is another model to consider. It seems to have depended on patronage, and has probably been limited in the past by resource constraints. Modest advertising (e.g. adsense/adwords on search) on dmoz could easily have supported a staff of 10-20 full time employees, as well as hosting costs. Call it a nonprofit foundation, but you need the entity and some money coming in to pay for things like...proper ops (gosh you could have that from Rackspace for a monthly fee, including backups :-).

But unlikely to be possible within AOL, I'm afraid. I ran a scan of the forums to estimate active editorship...I count approx 4000 recent posters to the forums, given the old 50% measurement that suggests about 8-10k active editors -- plenty to build something fairly interesting again in a relatively short time.

...

In any case, if I can be helpful in any way, let me know.

-- Rich

Update:
I spoke to Bob Keating yesterday and apparently my post shook things up a bit inside of AOL with respect to the ODP. He credited it with getting them to finally assign a sysadmin back to dmoz, which hasn't had a dedicated SA for some time, part of the reason this outage was so long.

I was also contacted by Jimmy Wales of Wikipedia/Wikia, who very much would like to rescue dmoz and give it a good home.

So this post has directly lead to the server being fixed as well as a significant offer of help from a major industry figure.

AOL just had an impressive re-org. I actually briefly worked with Ron Grant while I was there. He's a scarily effective thinker and negotiator and frankly scares the living bejeezus out of me. You have something broken and rotting like AOL, you want some bold moves to try to fix it. Ron's the right guy for that.

Similarly I think the ODP is suffering from its closed, stultifying culture. There needs to be a re-org within the editor culture itself before the ODP will be able to truly move forward. Fire the handful of metas at the core of this rot and have a general housecleaning. Institute term limits for the senior ODP positions; that works great in politics to clean out the old corrupt guys and make way for fresh blood.

December 17, 2006

Wrap

Andrew Goodman with the best year-end recap so far: a month-by-month survey of the most surpising web milestones of 2006.

Fred Wilson experiments with RSS ad distribution in feeds. You gotta hand it to the guy for using his blog as a platform for experimentation and learning. Not content to just slurp it up from the business press, Fred is a lean-forward VC. Nice. :-)

Brad Templeton proposed a Zero User Interface backup solution. I think 3X the cost on the disks is going to be a big discouragement, but I really like the idea of Zero User Interface design. Google search is zero user interface: DWIM

Joel Spolsky: "Every few days some crappy software I can't even remember installing pops up noisy bulletins asking me if I want to upgrade something or other. I could not care LESS. I'm doing something. Leave me alone! I'm sure that the team at Sun Microsystems who just released this fabulous new version of the Java virtual machine have been thinking about the incremental release night and day for months and months, but the other 5,000,000,000 of us here on the planet really don't give a flying monkey. You just cannot imagine how little I want to spend even three seconds of my life thinking about whether or not to install that new JVM."

David Naylor's new social network crossed with a speed dating site TickMe.

Tim Bray: Referrer stats proving Slashdot in decline, Reddit on the ascent.

Blog-tag

I was blog-tagged by Peter Da Vanzo (thanks Peter). Hmmm, 5 things you don't know about me...

  • I whistle like a madman. Often Christmas carols, year-round. I'm not usually even aware I'm doing it. Off-key, not pleasant or tuneful, and usually fairly loud. I've been heard blocks away. I drove my mother insane with my whistling.

  • I worked for 2 weeks for a telephone survey firm in college. That was by far the worst job I've ever had. I had to phone people at dinnertime and take them through 15-minute surveys. We were supposed to lie about how long the survey would take. The worst was when someone, screaming obscenties, would hang up during the final few minutes. A floor manager would then come by and scream at me. "You could have kept them on the line!"

  • The best job I ever had was being a lifeguard. Getting a tan and hanging out by the pool was great, but the most novel part of my job involved pool cleaning Sunday mornings. It was a big pool, and you could't just stand on the side with the pole to clean the bottom. So I had to wear a Jules Verne style compressed air contraption to walk around the bottom of the pool with the vacuum thingy. The boss would shut off the air hose to let you know your time was up (you couldn't stay down too long or you'd go silly).

  • I studied Mandarin for 3 years in college. I really enjoyed studying the language, but this was to pass a 2 year language requirement. True to my teacher's stern prediction, since graduating 'Ni Hao Ma' is about all that's left.

  • I collect old computer books. Especially stuff about OS and programming language design. For some reason I haven't been able to get my hands on a copy of Computer Lib/Dream Machines, although Bill Danielson lent me his copy once so I could read it.

I tag Greg, Brad, Susan, Arin, Mike.

...hmmm, too much I- me- my- in this post...it will never make it past the Topix narcissism filter. :-)

December 18, 2006

Scaling eBay

Interesting presentation on the history of eBay's architectural evolution, from one-coder startup through 200M registered users and 2 petabytes of data. (via Tim Bray).

Update: Great followup, including a survey of other commentary on this talk, by Greg Linden.

December 19, 2006

Kooky but cool gmail spamguard idea

I came across a gripe that topix forums don't support plus signs in their email addrs in my daily topix vanity scan... John Reinke has come up with a way of forwarding gmail accounts to each other with disposable suffixes so that he can shut off a poisoned email target if it starts to get spam. Clever. I get way too much spam; my old personal addresses are essentially unusable at this point, having been around since about 1991, posted on usenet, the web, etc. I have been putting off doing something about this but figure I'll eventually come up with a solution. The problem is that I want to recover use of these old addresses, not just protect brand new ones (I usually use mailinator for throwaway registrations). So either I write some magical content or connection based spam detector (pita), upgrade to the latest & greatest off-the-shelf stuff, (have done before, pita to maintain), or do some kind of aggressive whitelist. Hmmm. Our new VP ops says he has a custom home-grown system he uses on his personal mail that he's going to deploy at topix. Maybe if that works well I'll see if I can use it on my private stuff too. Gosh, seems like we could have a whole new round of spam fighting startups. The last crop did well, but ... there's still spam, so clearly we're not done. :-)

Google's true search market share is 70%

Sitting here in Palo Alto, running a web business, it's pretty clear who the winner of the search game is. But every month I have to suffer through reading about Google's supposed 40-something percent market share. Everybody involved in the search industry and everyone who actually runs a website knows these numbers are completely wrong.

Of course I'm not the first person to point out how off-kilter web measurements are.

A modest proposal

Let's look at a search referral traffic the way a site owner would.

I picked a basket of medium-to-large websites and looked at the inbound search traffic percentages using Hitwise. I included Topix in this mix, both because it's a representative content site, and also because I could double-check the Hitwise numbers against our own server logs and 3rd party measurements from Google Analytics. As it turns out, the relative inbound referral ratios agreed between Hitwise, Google Analytics and our own server stats.

The results:


Site hitwise-google hitwise-yahoo hitwise-msn hitwise-ask Google Yahoo MSN Ask hitwise-total
apple.com 8.62 2.38 1.69 0 67.9% 18.8% 13.3% 0.0% 12.69
craigslist 7.48 3.4 1.17 0.12 61.5% 27.9% 9.6% 1.0% 12.17
ebay 10.12 3.36 2.57 0.44 61.4% 20.4% 15.6% 2.7% 16.49
flickr 17.72 7.26 1.34 0.45 66.2% 27.1% 5.0% 1.7% 26.77
nytimes 16.67 2.84 1.34 0.53 78.0% 13.3% 6.3% 2.5% 21.38
topix.net 40.5 10.02 0.65 1.56 76.8% 19.0% 1.2% 3.0% 52.73
tripadvisor 47.57 5.87 3.51 1.42 81.5% 10.1% 6.0% 2.4% 58.37
usatoday 6.43 2.07 1.4 0 64.9% 20.9% 14.1% 0.0% 9.9
wikipedia 48.36 10.98 3.66 2.57 73.8% 16.7% 5.6% 3.9% 65.57
youtube 12.97 2.28 2.16 0 74.5% 13.1% 12.4% 0.0% 17.41
Average 70.6% 18.7% 8.9% 1.7%

What I did

I did a simple average of the percentages instead of a weighted average, to offset the chance that a particular site was being unduly favored by a particular engine. (It doesn't look like that is happening though; both Yahoo and Google favor Wikipedia and IMDB in their top organic outbound referrals, so they seem to be sending traffic to the same kinds of places in their listings). These numbers probably undercount Ask, because they were below the top inbound referrer cutoff for some of these sites.

I'm not a professional analyst, and my approach here is pretty back-of-the-napkin. Still, it confirms what those of us in the search industry have known for a long time.

The New York Times, for instance, gets nearly 6X as much traffic from Google as it does from Yahoo. Tripadvisor gets 8X as much traffic from Google vs. Yahoo.

Even Yahoo's own sites are no different. While it receives a greater fraction of Yahoo search traffic than average, Yahoo's own flickr service gets 2.4 times as much traffic from Google as it does from Yahoo.

My favorite example (not included in the above stats): According to Hitwise, Yahoo blogger Jeremy Zawodny gets 92% of his inbound search traffic from Google, and only 2.7% from Yahoo. :-)

"We see little to stop Google from reaching 70 percent market share eventually; the question, really, comes down to, 'How long could it take?" -- RBC Capital Markets analyst Jordan Rohan.

Welcome to the future, we're already there. To paraphrase an old industry saying about IBM...

Google's not the competition, Google's the environment.

December 20, 2006

The real #1 search

You want to know what the real #1 search is? The one no-one will tell you about? The search that you can't buy through adwords or overture?

Talk about suffering through... Every year end we get these top-10 lists out of the PR departments at search engines. (Our top search at Topix is 'obituaries', as it always is).

If you've ever looked at raw search logs and tried to tally them you know how noisy that data can be. You have to filter out all the bogus robot hits, normalize per source & partner (think a static href link into a search, like piggly wiggly appearing on a high traffic page. So do you de-dup by referrer as well?) You start throwing out the edgy stuff and playing with stripping out quotes and coalescing typos and misspellings and before you know it you can have whatever list you want when you're done.

At Netscape I found that the code used to tally the reports actually discarded the search that more people did than any other search. It was top by a huge margin too. No not sex, yahoo, porn, britney or anything like that.

It was the null search. People just clicking on the search box without having typed anything. Of course the empty string was stripped out during the log tally run...but inadvertently discarding a key piece of user behavior data.

We had an error page which came up when that happened...we quickly replaced it with a friendly page of instructions. Doesn't seem like Yahoo and Google do anything at all with that search now, MSN does a page turn onto a null result page.

Season's Greetings, Oracle!

I was probably a little scroogey here, hmmm.

In the spirit of the holidays, I will let the mysql thing go...this time.

From: Rich Skrenta
Sent: Wednesday, December 20, 2006 2:31 PM
To: xxxxxx.xxxxxx@oracle.com
Cc: XXXXXX
Subject: RE: 30 minutes in January?

We don't use any databases here. I hate databases. If anyone bought
Oracle here I'd fire them. Someone brought up mysql last week for
a little stat-counter and I'm probably going to fire them after
the holidays.

http://blog.topix.net/archives/000045.html
--
Rich Skrenta
CEO, Topix.net


From: XXXXXX XXXXXX [mailto:xxxxxx.xxxxxx@oracle.com]
Sent: Wednesday, December 20, 2006 1:35 PM
To: Topix.net, RichSkrenta
Cc: XXXXXX
Subject: 30 minutes in January?

Rich,

I'm the local Oracle field rep and wanted to set some time to
make an introduction and to get to know your priorities for '07.
I happened by last week and XXXXXX suggested I follow up via email.
I work with local startups regarding:

Database/Infrastructure
Application Integration and Middleware
Business Intelligence/Reporting

We could explore these areas or others depending on how you feel time
would be best spent.

Please let me know if you see a 30 minute slot on your calendar in
January that would be good for a meeting.

Have a great holiday season. I look forward to speaking with you soon.

XXXXXX

XXXXXX XXXXXX
Technology Sales - Silicon Valley North
(XXX) XXX - XXXX office/fax
(XXX) XXX - XXXX cell
xxxxxx.xxxxxx@oracle.com

I've been reading too much Anonymous Lawyer...

December 21, 2006

Triptych

December 22, 2006

Hot Dog Demo

My father was a plastic surgeon. He took me to ..work.. once and I watched him re-attach someone's index finger that had been cut off on a circular saw. I watched the whole operation, and after the finger was bandaged up at the end, I passed out. I think I was 11.

My crystal radio didn't work. I couldn't solder neatly. I got blobs all over the circuit boards and invisibly ruined components by overheating them. "Do you want to be a doctor like your dad?" Wet stuff was even deeper into the physical world. That wasn't for me. So I went into software.

Watch Sawstop's hot dog demo.

I'm often neurotically distracted by safety-related things, like seatbelts. I have a pal who hated to wear his seatbelt. He had 100 rationalizations for not putting it on. He smoked a lot too, and was trying to quit. I told him to keep smoking and put his seatbelt on, he'd come out ahead statistically. I bet a lot of kids with parents who work in the ER hear a lot of brain-scarring accident stories while they're growing up.

December 23, 2006

I bought a goat

I've always liked goats. They seem so gentle and thoughtful, and it's amusing how they'll actually try to eat your pants and stuff. Goats are cool.

I looked into the whole goat-as-pet thing, since I have a fairly large backyard, although it is quite steep -- perfect! But taking care of a goat is a lot of work, they're social animals, they want to be around other goats, there are various goat-issues that you need a proper vet to assist with, and who knows if the city even allows goats (although a neighbor has a chicken coop, so perhaps I could get away with it.)

I thought of donating a goat to my kids school. I could visit the goat, but then the kids and hopefully some of the other parents would have to take care of it. I'm sure the kids would like a goat.

But what I ended up doing was buying a goat through Heifer International. Hopefully my goat will be helpful to someone who can really make good use of it.

Go buy a goat for someone right now. (You only have a few more days for those '06 tax deductions, and you won't feel like giving on the 26th!) You can also give chickens, pigs, water buffalo, cows, and even crazy stuff like a hive of bees. See their full catalog.

December 24, 2006

Heat Miser and the secret ingredient of our childhoods

Since the earliest Christmas I can remember, I have been disturbed by the suspicious resemblance between Heat Miser in The Year without a Santa Claus, and Burgermeister Meisterburger in Santa Claus is Coming to Town. I thought this was some straightforward shadiness on the part of Rankin/Bass, the production company behind the stop-motion holiday shows; sort of a meta-deception behind the puppetry evil. I started poking around in the Bugermeister's background and came across the career of a remarkable actor, as well as a chain of connected childhood flashbacks.

Burgermeister Meisterburger was voiced by Paul Frees. WikiPedia says, "like Mel Blanc, he was known in the industry as 'The Man of a Thousand Voices'". As I scanned the list of his credits, I nearly fell out of my chair. He had voiced, in an uncredited role, Colossus himself in Colossus: The Forbin Project.

Colossus is one of the most overlooked sci-fi classics. Produced prior to 2001: A Space Odyssey, it was considered too depressing to release by studio execs, and shelved until HAL grabbed the insane killer computer first-mover advantage.
Ship early, ship often.

He also voiced K.A.R.R., the evil twin to David Hasselhoff's good-robot car K.I.T.T. in an episode of Knight Rider.

Bryn's aside:

Back in August when John Mark Karr "confessed" to killing JonBenet Ramsey I notice that the AdSense on Topix.net when I searched for "Karr" was all about Knight Rider ring tones and David Hasselhoff. After some searching I discovered that KARR was the evil twin of KITT (the true star of Knight Rider). So either AdSense buyers are very thorough or Google keyword targeting is very advanced. Time for Turbo-Boost.

But that's not all.

Frees also voiced the original Star Wars trailer. Bob says

I was intrigued by this man's ability to be the voice of authority. His voice is so strong and commanding that people would bring him in to lend credibility to the worst examples of writing.

This trailer has the worst copywriting I've ever heard. "It's the story of a boy, a girl, and a universe." "A million years in the making, and it's coming to your theater this summer." and, my favorite, "Somewhere in space, this may all be happening right now."

He sounded a lot like the "Ghost Host" from the Haunted mansion. When I listen to it, I can almost remember some voiceovers for ads as well...maybe Levi's jeans? I'm sure he did a lot of ads, too bad there isn't a site for discovering these.

I remember that Levi's ad! It was on Bob Abel's demo tape. I can remember the voiceover, it did sound just like the star wars trailer voice. Same time period too...could very well have been Frees. Bob Abel did some amazing graphics stuff, including Sexy Robot, which was a superbowl ad for ... cans. As in, paid for by a canned-food industry association. Crazy.

Those graphics look primitive now. When I saw them in 1986 they were still eye-popping. I think Able had something to do with Tron too. Tron never looked good, unfortunately, not even when it first came out.

Paul Frees did the voice-over for your childhood

If you grew up in the 70's, this guy's voice was the secret ingredient in your childhood. Your alter-parent from the TV. The voice behind your personal voice-over track. Buy this. Watch this now. I command you. Just like Colossus. The voice of the commercial state, just like 1984 but instead of run by the govt it's run by cereal companies.

I got so distracted with Frees that I never followed up on the Heat Miser. Will have to save that for next Christmas Eve's post. :-) Merry Christmas!

December 26, 2006

Market Sizing with Junk Mail

There is a whole industry of folks who collect lists of names and addresses to sell to junk mailers. They've got lists of everything -- active seniors who have taken a cruise, newly licensed RNs, boat owners (selectable by length of boat). There are literally thousands of these lists. They're sometimes handy to find out how big a market is, or how many of a certain kind of business there are.

Mike and I were poking around Yelp to grok their model.

Mike: "So they got a DB of the restaurants from Acxiom or someone, made landing pages and collect community on them..."

Rich: "What do they have in Google?"

> site:yelp.com

Mike: "270k, doesn't seem like a lot, there's gotta be more restaurants than that."

Rich: "Google 'restaurant masterfile'"

Mike: "Restaurant Owners Masterfile, 385k restaurants, close..."

('Yelp' ... where did that name come from? Yet-another-listings-provider?.. But 'yelp' sounds better that 'yalp'...nah)

Paypal sure seems like it was a successful startup founder factory...

December 27, 2006

Bizarre dialogue with installer

"Don't the batteries get too hot and explode?"

"No, they're behind the pine cone."

This exchange struck me as odd. Even in context I had to think about it.

...

December 30, 2006

Google automatic inline map feature

How did Google know what I wanted?

How did they they associate the query with the address, and the intent to want to see a map?

I tried a couple of other likely locations, but wasn't able to make the map pop up.

December 31, 2006

There is no fold

Fascinating data from a company called Clicktale regarding whether users scroll below the fold or not on web pages. Clicktale has some magic that records user sessions using your website and can replay them to provide usability data. Cool. This looks like the next best thing to eyetracking all your visitors.

Their conclusions:

  • Don't try to squeeze your web page and make it more compact. There is little benefit in squeezing your pages since many visitors will scroll down below the fold to see your entire page.
  • Since visitors will scroll all the way to the bottom of your web page, make life easier for them and divide your layout into sections for easy scanning.
  • Minimize your written text and maximize images, visitors usually don't read text - they scan web pages.
  • Encourage your visitors to scroll down by using a cut-off layout.

When AOL acquired ICQ, the AOL designers tried to get the ICQ folks to shorten their pages. AOL usability guidelines forbid scrolling, but ICQ had pages that went on for miles and miles. They were crazy and just never ended, but their users obviously loved the service and loved the feel around the whole product so it worked for them. I've seen debates around how long pages should "optimally" be still being played out on design blogs. Nice to see some real user data about how sites are actually being used.

Browser: Do not try and find the fold. That's impossible. Instead, only try to realize the truth.
Designer: What truth?
Browser: There is no fold.
Designer: There is no fold?
Browser: Then you'll see, that it is not the fold that matters, it is only yourself.

Last Day of the Year

Santa Cruz, California

Not pictured: post-traumatic squid disorder incident.

January 1, 2007

Winner-Take-All: Google and the Third Age of Computing

IBM1950-1980
Microsoft1984-1998
Google2001-

Google has won both the online search and advertising markets. They hold a considerable technological lead, both with algorithms as well as their astonishing web-scale computing platform. Beyond this, however, network effects around their industry position and brand will prevent any competitor from capturing market share from them -- even if it were possible to match their technology platform.

To paraphrase an old comment about IBM, made during its 30 year dominance of the enterprise mainframe market, Google is not your competition, Google is the environment. Online businesses which struggle against this new reality will pay opportunity costs both in online advertising revenue as well as product success.

Competitors such as Yahoo should quickly move to align themselves with this inevitability. Yahoo could add an extra $1.5B to their revenue overnight by conceding monetization to Google and becoming a distribution partner for Adwords, as Ask Jeeves did.

Google is the start page for the Internet

The net isn't a directed graph. It's not a tree. It's a single point labeled G connected to 10 billion destination pages.

If the Internet were a monolithic product, say the work of some alternate-future AT&T that hadn't been broken up, then you'd turn it on and it would have a start page. From there you'd be able to reach all of the destination services, however many there were.

Well, that's how the net has organized itself after all.

From this position, Google derives immense and amazing power. And they make money, but not only for themselves. Google makes advertisers money. Google makes publishers money. Google drives multi-billion dollar industries profiting from Google SEM/SEO.

Most businesses on the net get 70% of their traffic from Google. These business are not competitors with Google, they are its partners, and have an interest in driving Google's success. Google has made partners of us all.

Why does Google make so much money?

It turns out that owning the starting point on the Internet is really, really valuable.

Not just because it gets a lot of traffic. It's because that traffic is so much more valuable than the rest of the page views bouncing around the net. Google's CPMs are $90-120, vs. $4-5 for an average browse page view elsewhere.

This value premium on search vs. content is because of the massive concentration of choice potential which exists on the decision point, Google.

John Battelle calls this power behind user search queries "intent". This is why the ROI of a clickthough bought from Google is so much higher than a clickthrough bought from a banner ad impression. It represents a higher likelihood that someone is going to take action if they came from a search instead of a mouse click.

No one wants to be on a search engine, they want to be on one of the 10 billion destination or application pages of the net. They may navigate "directly" to these pages because they know the name and/or have been there before. And they may move between pages by following links - say, from a blog like Valleywag to an interesting article. But these are 1:100 fan-out effects.

Google is a 1:10,000,000,000 fan-out effect. When you need to find a new page, or perhaps even to navigate to one you've been to before, you go back to the starting point -- Google.

To reconstitute Google's full value on the destination pages, you'd have to have a network which participated in a majority of the destination landings. Such a network would also participate in repeat visits which G does not see; but it would hit users after a decision point, and so might still have less overall value; it will be harder to distract someone to go elsewhere from the sidebar than when you're on the locator service.

But it's a lot easier to monetize G's 1:10B branching point than to participate in 10B destination pages.

And once you own it, you can have the rest of the net too. :-)

Google's next step: owning the rest of the page views on the net

Just as Microsoft used their platform monopoly to push into vertical apps, expect Google to continue to push into lucrative destination verticals -- shopping searches, finance, photos, mail, social media, etc. They are being haphazard about this now but will likely refine their thinking and execution over time. It's actually not inconceivable that they could eventually own all of the destination page views too. Crazy as it sounds, it's conceivable that they could actually end up owning the entire net, or most of what counts.

Complaints are already being heard about Google using their starting point power to muscle into verticals.

My 70% market share number was conservative so as to be believable; others report that Google is more like 78-80%.

Competitors who want to dethrone Google need to fight a two-front war. They have to build a killer consumer search service as well as a successful advertising network. Building one of these is difficult, but doing both simultaneously is nearly impossible. Google's dominance in both of these areas gives them an unfair advantage, and allows them to easily parry any attacks.

How zero switching costs paradoxically yield a winner-take-all market

Search engines have zero user switching costs. Unlike switching email providers, there is no user data to move over, or addresses which need to be forwarded or communicated to peers. You just type in a new name and go to the new place.

If switching costs are zero, the first thought is that it should be easy for a worthy challenger to take some share away from the leader. Paradoxically, it's the reverse that happens.

Zero switching costs lead to a winner-take-all market for the leader. Even a modest initial lead will snowball until majority market share is reached and maintained. This is because, faced with a choice between two products, in the absence of switching costs users will choose the better one, even if it is only slightly better.

Google had a vastly better product than any other search engine for a number of years. Competitors have closed the gap somewhat, but Google is still better. Everyone (70-80%) knows this now, and so the Google-has-better-search concept is now built into Google's brand.

Even if a competitor such as Yahoo, MSN or Ask were to fully close the gap at this point, they would still have to overcome the final brand perception gap. This is the effect where market research shows that users who see Google's logo on top of Yahoo's results perceive the results to be of higher quality; users looking at Google's results with Yahoo's logo on top view them as having less relevance. Brand perception effects have been measured to account for about 8% in things like beer. A few years ago an AOL researcher replicated this study in a shopping mall in Virginia with AOL Search results vs. Google.

Back to the zero switching cost and winner-takes-all. Suppose the product gap has been closed, and the two products are now identical. But one product has a powerful brand perception that it is better. In the marketing analysis, that's the same as being better. Users will stick with the leader.

Economic and social forces reinforce a feedback loop of success for the leader. The best programmers will leave the losers to work for the winning team. Major online sites will invest in organizing their sites to appeal to the winning search engine. Advertisers will be drawn to the leader, giving it a greater share of resources to invest in continuing and strengthening its lead.

Yahoo is leaving a lot of money on the table

Everbody wants to own their own advertisers. Talk to newspaper execs if you want to get an earful about ceding sales to the online giant. Controlling sales is a point of pride, and of some perceived strategic value. But quantifying the opportunity cost throws a stark light on the huge cost of opting out of Google's winning monetization platform.

This story has played out before. In 2001, Ask Jeeves was on the ropes. Battered by the dot-com crash, its ticker symbol was in danger of being de-listed from the Nasdaq. Skip forward to 2003, and they're flying high again. The magic in between was doing a deal with Google to have Adwords take over monetization. Google quickly become responsible for 70% of Ask Jeeves revenue, and Ask Jeeves stock rose 1,685% in the year following that deal.

Yahoo should accept Google's search and monetization dominance. Yahoo will not recover the search application, and browse views are not competitive and cannot be made to be so. They should do a deal with Google for Adwords/Adsense across their entire network, as Ask Jeeves did. They should be able to obtain at least an 85% rev share; that would take them from $0.10/search to $0.17, a 70% increase in search revenue overnight.

That's an extra $1.5B or so of yearly revenue being left on while they try to build a copy of Google's revenue platform.

Such a deal could even see Google's triumphant return to powering Yahoo's search results, which would provide superior results for users. In a way, this is simply rolling back Yahoo's configuration a few years, to the point where Yahoo used third parties -- Google and Overture -- for both search and monetization. Yahoo's effort to vertically integrate these functions has failed; it hasn't yielded a winning consumer product, and it's leaving billions of dollars in potential revenue on the table.

What about Microsoft?

Microsoft isn't a player online any more than IBM is. IBM?

IBM still has a great business, inhabiting the business enterprise market where they've been since they started. When the PC era arose, the popular question was why IBM couldn't own that new market too. Sad requiems were printed the day IBM finally gave up and exited the PC business.

Stodgy old IBM was perfect to selling to Fortune 1000 CIOs and the government, but wasn't configured to deliver PCs to consumers. The winner of that game was Microsoft. Surprise...the winner of the PC market didn't actually sell PCs! How could IBM have known...

The PC market isn't going away either. Microsoft has a great business too. But now the question everyone asks is why Microsoft doesn't own this new thing, the Internet. Surely with all those resources it could own any new market that arose.

But it shouldn't be surprising that huge successful companies can't make the leap into owning a completely new and different market. New markets bring with them new rules, and require different skills to win. Microsoft has the same shot as any well-funded venture at knocking off Google's crown. But they don't get a special pass just because they make a lot of money selling Word and Excel and have their logo on keyboards.

We get used to seeing the giant squash everything it steps on as it strides through the domain of its market dominance. But sooner or later, the terrain changes, and the old leader can go no further.

Nobody even bothers asking why IBM isn't a player in consumer search. IBM and consumer websites just don't have anything to do with one another. PC software and websites don't have anything to do with each other either.

All Hail the New King Google

The interregnum between the end of the PC era and the rise of the online world has concluded, and Google is the new king of forward market growth in computing and software technology. Major companies will succeed by working within the framework of Google's industry dominance, and smaller players will operate in niches or in service to the giant.

"I for one welcome our new insect overlords."

:-)

January 3, 2007

Filter

"Avoid pedantic people like the plague: they want to prove they wasted a bit more time than you getting a nuance right."

      -- Andrew Goodman, Undistracted

 

"My field is artificial intelligence, but I'm sad to say that this subject started on the wrong page of the map many years ago and most of us haven't woken up to it yet."

      -- Steve Grand, The Strong Possibility That We've Got Everything Horribly Wrong

 

"You would think the distant #2 would be sucking up more."

      -- poster, Digital Point forums, in Christmas gift from Yahoo?

 

January 5, 2007

Mike's New Soap-Vox

My co-founder Mike Markson has gone and got himself his own blog, over at the new easy-to-use Vox service from Six Apart. Out of the gate he's got a head-scratcher about 527 groups and Internet syndication.

I don't have any unique insights around this, but it reminds me of the personal SEM campaigns that Chris Zaharias has been running for political and social issues.

No idea where this all goes, but I wouldn't be surprised if we've barely scratched the net's potential effects on political messaging so far.

Update:

Well that honeymoon with Vox didn't last long. But Mike's still on 6A, just hardcore hosted movable type now with his own domain.

January 6, 2007

Elevator pitch archaeology

So last year I read this story on VentureBeat about genius Tony Hsieh and how he's doubled his sales every year since 1999 for his online shoe store, Zappos. Mernit had mentioned Zappos before, and I don't usually think about shoes much, but got curious about how someone was succeeding in a retail vertical online. Two mentions of this guy, I gotta go read about it.

All the press about Tony talked about customer service and the 24/7 warehouse and having a fast website. That was great, operational excellence and all that. Sure.

But not being a big shoe shopper or shoe thinker, I was kind of flying blind in the space. I wanted to understand the original vision for the business, to glimpse the spark that lead someone to think they could make a successful startup out of selling shoes online. Retail is so hard, and I dunno, I would just expect that between existing bricks & mortar retailers with websites, and ordering direct from manufacturers over the web, shoes would be pretty much covered, and it would be hard to get a foothold to make a big business.

So I went to look at the site, but it didn't help. The tagline really left me stumped. "We are a service company that happens to sell ... Shoes Handbags Apparel Accessories". Huh.

I didn't get it. I mean, that's great and all. I expect that sort of thing on a poster in the warehouse over the drinking fountain. Like if you go to the restroom at Best Buy and see the wall with all of the reminders for the employees on how to upsell properly. Or the big "Check Your Appearance" over the mirror in the employee hallways in casinos just before the doors that lead back into the public areas. It's an internal motto, a way they think about themselves. McDonald's is "Quality, Service, Cleanliness, and Value", but that's not their advertising tagline. (It's currently "I'm lovin' it", unfortunately).

There was no way "We're a service company" was the original spark behind Zappos. Yeah, we're going to happen to sell shoes, and we're going to be great operationally. That elevator pitch didn't hook any VCs.

I know the kind of meeting that results in "We're a service company" ending up on the website, and it wasn't around a kitchen table. It happened later. After other people were hired.

I went over to the Internet Archive to see what Zappos had looked like when it first launched. Sure enough, Zappos circa 1999-2001:

World's largest shoe store. Of course! It's so blindingly obvious (in hindsight). They're the Amazon.com of shoes. That's the elevator pitch. "We're going to be the Amazon.com of shoes." They're going to have everything, be really comprehensive. And of course have a great website and handle returns and ship things fast and all that stuff you need to do well if you're going to have a hope in retail.

Now Amazon.com doesn't call themselves the "earth's biggest bookstore" anymore. They don't seem to have a tagline at all now, that I can see on their website. Books became limiting, and they wanted to become a superstore, and sell everything.

But then this Zappos thing came along. And although Amazon sold shoes on their website, I guess Zappos was getting all the shoe-buzz and eating away at the vertical. So Amazon has launched Endless.com, an online shoe store.

Now Endless.com has a tagline. Which is reminiscent of Amazon's original tagline, and Zappos.

"Endless Style, Endless Options." Earth's biggest... world's largest... endless... hmmm. So now Amazon has launched a site to be the Amazon.com of shoes. Ironic! :-)

January 7, 2007

Taste that beats the others cold

In his book, Adcult USA, James Twitchell tells a story about Rosser Reeves. An executive of Minute Maid once complained about Reeves's refusal to fiddle with the advertising, saying "You have 47 people working on my brand, and you haven't changed the campaign in 12 years. What are they doing?"

Reeves replied: "They're keeping your people from changing your ad."

    -- Is it the end of the ultimate advertising slogan?, Al Ries

...

How to Ship Code and Influence People

In 1995 I was working for an AT&T spinoff, Unix System Labs, in the kernel group, but USL wasn't doing too well. It looked like Windows was winning the OS wars (remember those?) and was going to scrape Unix from the face of the earth. My buddy Tom had been trying to get me to come out to the west coast to work for Sun.

He was working in a group that did network security. He said that it didn't matter what OS won, we'd always have network security to worry about. That made sense to me. He set up the interviews, and since I didn't know anything about security, told me to buy a copy of Applied Cryptography and read it before I got there.

So I got hired into Sun's group developing firewalls and IP-level encryption. This was great work. Security is really intellectually challenging and rewarding. Little unseen errors in apparently simple code or protocols can lead to collapse of the entire system. Rigor and thought count.

But commercially our group was a bust. In addition to our firewall product, the Sun sales force was reselling a third-party product too. And they seemed to like to sell the third party product better than Sun's homegrown one.

We were also trying to sell an add-on security workstation package. The sales force didn't seem to think much of the commission on our $99 software bundle when they were pushing their multi-million dollar hardware orders.

Furthermore, the US government didn't seem to want to let us sell our stuff. International sales of cryptographic products are regulated, since they're classified as "munitions". We had Swiss banks that wanted to buy our encrypting firewalls. We'd have to meet with the NSA to get export approval, but were never able to sell the full strength versions. There was a dark story about a rebuffed offer to include a backdoor in the crypto for the powers-that-be, which lead them to subsequently look on us unfavorably. We tried all sorts of shenanigans to loophole around these restrictions but it was an uphill battle.

The final nail in the coffin came when Sun's crypto protocol lost out to another faction's in the IETF.

I was just an engineer in this group, but the reality of what was happening in the market to our product line started to seep in. Here I was putting all of this effort into stuff that never would be used by anyone. It was still intellectually challenging...like doing crossword puzzles or something. But it had no utility to the world.

I started to look around and I saw many other examples of groups working on stuff that no one would ever use or care about. Mobile IP initiatives, endless work around standards that nobody cared about, research from the labs that would never be applied or even cited.

Yikes.

I had written stuff that people actually used, before. It felt good. I had written a usenet newsreader that was used by hundreds of thousands of people. I was running an online game, as a commercial hobby on the side, which had several hundred paying customers. Sheesh, I thought. My side projects have more customers than my day job.

So I made a simple resolution. I wanted to work on stuff that people would actually use.

This sounds simple. But if you walk the halls of Sun, AOL, HP, IBM, AOL, Cisco, Siebel, Oracle, any university, many startups, and even Google and Yahoo, you'll find people working on stuff that isn't going to ship. Or that if it does ship, it won't be noticed, or won't move the needle.

That's tragic. It's like writing a blog that nobody reads. :-) People make fun of bloggers who are writing "only for their mother". But what about the legion of programmers writing code paths that will never be traversed. Wasted effort!

Some of that may be inevitable. You try experimental things. Sometimes they don't work. Everyone can't be maximally productive 100% of the time, so there may be lesser-value tasks that still keep the engine warm and have some marginal utility. But still. Evolutionarily, frustration is useful. It kicks us out of non-productive ruts. People should get frustrated more easily. Frustration should be driven by an awareness of futile effort.

* * *

Without business models, bizdev deals for distribution, and market economics that afford a place for your product, it doesn't matter how pretty the code is. Ugly code and awful products win all the time.

From an engineering perspective, it's simply zooming out the field-of-view to include the entire market, including the users, competitors, and so on. They're part of the total engineering solution. If you've written an app with some web forms and a database, but you haven't solved the problem of how to get users to come to the web form, then you've left part of the problem unsolved.

Greg Linden details some of the tricks that have been used by startups to get a leg up in a crowded world. He wonders if you have to be, perhaps, a little bit evil to have a hope. I'm not sure, but you should have some idea of how you're going to launch the bird, and the market and distribution economics that let it stay aloft.

All of this is a long-winded way of explaining why I include all this gunk about network effects and switching costs and distribution and brand perception on my blog. Because the world, full of competitors and networked humans with their set of behavior patterns, is part of the spec. If you're designing a product, but don't understand how the system of networked humans will work around it, you really can't understand how your product will work either.

No minute lost comes ever back again
Take heed and see ye nothing do in vain

Redwood City Fire

View from my house of a fire on Seaport boulevard today. The smoke trailed off down towards San Jose and lasted for hours.

Fire at car shredding business in Redwood City creates heavy smoke
Bay City News Service

Redwood City firefighters have contained a one-alarm fire that broke out in a large heap of trash at a car shredding business in Redwood City, but the smoke is visible from miles around, a fire chief reported.

The fire was reported at 2:38 p.m. at Sims Metal at 699 Seaport Blvd in the port area of Redwood City, Fire Chief Gerald Kohlmann reported.

There were no injuries reported, but the fire department got calls from as far away as Oakland reporting the smoke, Kohlmann said.

The fire broke out in a pile of flammable materials from scrapped cars, including roof liners and upholstery. The fire is contained, but it could be hours before the fire is extinguished, Kohlmann reported.

Nothing indicates that the fire was set intentionally, Kohlmann said. The firefighters are in the process of extinguishing the flames, but the materials are still burning deep within the pile.

There are no special precautions to be taken, but Kohlmann advises residents nearby with respiratory conditions to remain inside.

Firefighters from Menlo Park and Woodside Fire District were called to help extinguish the fire.

(Story via Topix)

January 9, 2007

Sebastopol

Tim O'Reilly, Brady Forrest

The smiley was invented in Pittsburgh :-)

Time is the one thing that can never be retrieved. One may lose and regain a friend; one may lose and regain money; opportunity once spurned may come again; but the hours that are lost in idleness can never be brought back to be used in gainful pursuits. Most careers are made or marred in the hours after supper.

      -- C. R. Lawton

Where do I find this crap?

Tom and I ..snarfed.. a giant pile of jokes & quotes from Mike Fryd's joke program running on CMU's TOPSA.ARPA in 1985. As a graduate student in the 80's, Mike Fryd wrote a phototypsetting language called SCRIBE, and then, if I recall correctly, sold for a nice chunk and apparently became some kind of beach bum in Florida.

I've been dragging the stolen joke collection around with me in a file called yukko.dat ever since. It seems to have a different lineage than fortune, and is a nostalgic relic from the CMU Tops-20 culture, so this weird file of old jokes is special to me. joke is the first program I write in any new programming language that I learn, since it exercises primitive string operators, file I/O, and calling a decent random number generator.

Mike Fryd also has the distinction of having participated in the CMU bboard thread where the smiley was invented. Although, unfortunately for the purposes of this post, he was not the one to invent it. :-(

January 11, 2007

Typing Trumps Pointing

Windows Vista's main navigational mode from the Start menu relies on typing, not pointing:

Google was right all along. It's not quite a command line renaissance, but it is an implied victory of textual search over traditional point-and-click desktop GUI metaphors. Typing trumps pointing. There's far too much content in the world-- and even on your local computer-- for browsing and pointing to work reliably as a navigation scheme today. Keyboard, text and search are the new bedrock navigation schemes for the 21st century.

From Coding Horror.

Appropriate Discoverability

95% of SEO is getting the basics right: title, meta, h1 h2, link anchor text, sane url structure, and so forth. That stuff still matters and it's amazing how so many businesses with tons of content don't do it.

I have a little rant that I give the folks who run online newspapers about SEO.

Newspapers actually pay writers to go to restaurants and eat food. Of course they're supposed to write a review of the place afterwards.

They have thousands of these reviews, often going back years, for every restaurant of note in the newspaper's market. For major restaurants, there may be multiple reviews.

Yet if you go to Google and type in any restaurant name, you're not likely to ever come across a newspaper restaurant review in the results. Yahoo local, yelp, chefmoz (heh), zagat's, chowhound, jatbar. The only newspaper I found was Dan Pulcrano's Metroactive, which is doing a pretty good job of getting their reviews in front of the searches in the Bay Area.

These would be very valuable pageviews to be getting. Adsense could do $10-30 CPM on these landings. Not to mention the value to the newspaper to hold on to a claim of authority for restaurant reviews in their area.

Newspapers also pay writers to go watch movies. When the movie is later released on DVD, they pay writers to rent the DVD and watch it. And write reviews. Yet again, if you type any movie name into Google, there are no newspaper results.

For a major newspaper chain, across the multiple properties they have, and given the separate reviews often written for the theater release and DVD release, they may have a number of individual reviews for the same movie. Enough to create a whole mini IMDB with a stack of editorial around each movie. If they organized the content right and got it indexed properly.

Again, these are very valuable pageviews. But they're being claimed by Amazon, Rotten Tomatoes, and other aggregators.

Here's a third example. Who best knows about all of the garage sales in your town this Saturday? The newspaper... But I'll bet if you type 'your town garage sales' into Google you're going to see other people there besides your local newspaper. Even though the newspaper has the most comprehensive list of upcoming garage sales.

SEO continues to have some dark connotations from its spammy past and the aggressive tactics it sometimes uses. But there are really three levels of SEO:

  1. Inappropriate discoverability. You've got something that should be findable, but for technical reasons your content is either not indexed or not ranked appropriately.

  2. Appropriate discoverability. Your content shows up for the right searches, in the right rank. If a human editor at Google were to review your rankings, they'd agree that it was appropriate.

  3. Inappropriate discoverability. You're ranking for terms for which you have no business ranking for, or the position within the results is out-of-whack with what a human reviewer would deem appropriate. Affiliates showing up ahead of the main company, content-generated spam farms ranking for random queries, etc.

Newspapers have a lot of great content, really high quality stufff that cost them a lot of money to develop. Users would love to come across this content, when appropriate. Google would even like to help users find that content, since the users will be happier. But often technical best practices aren't being followed with the CMS and the valuable content fields lie fallow.

Todd Friesen:

It still shocks me to be on a call with a client or a potential client and to talk about the 95% and get that "wow" reaction. 99% of the online world does not know what we do and more importantly does not know what we know. To them it's rocket science and when we show them results in a month they happily pay their bill and get back to doing what they do best - which, quite often, feels like rocket science to me though I'm sure they think it's easy.

We do a fair bit of SEO at Range and we recently have a great success story that involved online revenue moving from 6 figures a year to 7 figures a year (and I don't mean from 9M to 10M). Most of that campaign was the fundamentals that weren't in place prior to our involvements. We fixed Titles and Metas, URLs, ALTs, internal linking and did some external linking work. We also rewrote thousands of pages of content. Go ask the CMO what we did and you may hear terms like rocket science and magic and to that CMO it is rocket science with a huge payoff in revenue.

Anchor text not limited to the anchor

I tried a search for 'google third age' and this came up:

Now that's an odd snippet. It's not from my document, it's from Stanley Wong's reference to my post:

I just read Rick Skrenta’s great blog post,
Winner-Take-All: Google and the Third Age of Computing

Rick is right on the money with a lot of his observations, especially the fact that Google has built their huge lead on the backs of the Search and Advertising dominance.

Why did that snippet come up instead of one from my document? I think it's because 'third age' doesn't actually appear in the body of my blog post. But why not just show the beginning of the post? Maybe the little blockquote table with IBM and Microsoft and the dates threw it off? Hmmm.

If I search for 'google insect overlords' I get the appropriate snippets from the body of my post:

So it looks like Google hasn't just added the text within the anchor href to the target document index material, it has gone quite a bit outside of the anchor to pull in surrounding text, and added that as well.

index material for a page =
    text on page + anchors to page + relevant text surrounding anchors

This could boost relevance in cases where anchors aren't optimally formed from the point of view of the search engine, but sufficient confidence can be gained that the nearby text is relevant to the link. e.g.: "The text for the U.S. Constitution can be found here." Clearly 'here' is not very valuable anchor text, but "U.S. Constitution" is tantalizingly close...

This isn't just for snippet purposes, words from that extended snippet/anchor work on the query side. "rick's great blog post" turns up my post, even though 'rick' doesn't appear in the link proper or anywhere on my site (I go by 'Rich', not 'Rick').

Maybe Google has been doing this for a while but I've never noticed it before. It could certainly lend a whole new angle to Googlebombing, if you can not only spike the searches for someone, but actually write their site's snippets for some queries. :-)

January 12, 2007

Over-Communicate

At AOL I learned a management dictum: "Over-communicate". The idea was that lack of communication caused all manner of ills, and if you didn't take it for granted that everyone knew what was was going on, that everyone knew what you knew, then you would tell them, and thus avoid problems that would otherwise occur. Pick up the phone, send an email, and avoid a project trainwreck.

I've been thinking about how blogging is such a scalable communication tool. The dynamics of the blog mean that you don't necessarily have to meet with everyone to establish the common frame-of-reference that's so handy for effective communication.

I'm sure all of Fred's portfolio CEOs read his blog so they know what's on his mind. They don't have to have a lunch or a call with him to cover that background stuff. And when they do meet in person, the meeting will be more productive, since they'll have had time to think about what he's been writing about.

Jason's another blogger I can't stop myself from reading. I first directly encountered Jason several years ago when he flamed me in the comments on Battelle's blog, claiming Topix had blacklisted Weblogsinc from our crawl. I thought we were headed for our first PR disaster. Who was this guy? Who is Weblogsinc? Why did we blacklist them?

Jason and I chatted on the phone and it was all straightened out. And I started reading his blog.

When I see Jason post stuff like this I have to stop for a second. It seems new, this idea that you broadcast everything in your head and there is a net win. It seems to work for him. How generalizable is that, though? When does it work, when doesn't it work? Hmmm.

Today is the one-month anniversary of my blog. I'm still trying to get my style and rhythm and voice down. So far it's been pretty rewarding. I'm sure you'll let me know how I'm doing. :-)

I can't draw

But I do anyway!

January 14, 2007

The programmer productivity front

Programming Language
Operating System
Cluster/Grid     <--- you are here
Knowledge Base
AI
I looked at inbound traffic for a recent post and was surprised to see programming.reddit.com at the top of the list. I knew about Reddit before but not this sub-reddit. I checked it out and the articles were geeky-cool (for a programmer). But after a few days of reading I started to get an uneasy feeling about the place.

What was all this fretting about why nobody uses Lisp or functional languages? Haskell, ML, yikes. I felt like I'd been teleported back in time to my college days. Maybe this was an east-coast vs. west-coast thing? Reddit is in that Boston/MIT corridor, Paul Graham talks about Lisp all the time, are they really still worried about this stuff?

Language? Bah. The action is in the frontier after the OS.

Don't get me wrong, I love programming languages, and I have a soft spot for language design. I tried (and failed) to design a new language early in my career. I even have a collection of books about historical programming language design. I've seen huge productivity wins with better programming abstractions, and sure, picking nonconventional choices can often give you a leg-up over the competition.

Picking a language isn't just a personal choice though. It has to be tempered by the realities of how mature the platform is, whether you can hire people who will want to work in your language, how appealing your tech platform will appear to partners, investors, acquirers... Yahoo Shopping isn't written in Lisp anymore, they rewrote it. Of course.

But the productivity and development problems that I see building search and web apps just aren't happening at the language statement level.

Language statements generally live inside a program process. But coordinating all the pieces of communicating software across a modest 500-node application like Topix is a bitch, though.

I want a fast scratchpad for my 50 front-ends to be able to share, kind of like sys V shared memory, but networked. I want get, put, append, tail, queue, dequeue, infinitely scalable across some RAID-ish cluster. Billions of keys, petabytes of data, if I get something a zillion times a second from all the front ends it should adapt so it can serve that fast, but migrate stuff I never get to slower storage. Everything should be redundant, fault-tolerant, self-repairing and administratively scalable.

You end up building some version of this every time you make an eBay, Second Life, Hotmail, Bloglines, AIM, Google, Inktomi, Webfountain, Facebook, Flickr, Paypal, Youtube.

A zillion machines, a zillion concurrent connections, a big mess of data, never lose any of it, never go down, oh and the SLA is never take longer than 50ms to do anything. And be simple and fun to program on top of so the programmers can work on the actual app instead of spending all their time firefighting the cluster support layer.

We all keep cobbling together solutions for whatever app we happen to be writing out of ad-hoc clustered RDBMs, Reiser, Berkeley DBs, piles of coordination code and scripted admin.

Language innovations like Ruby are great, especially when they get some traction and acceptance so that you actually could use them if you wanted to. But all of the recent languages that get use have come out of individual eccentrics. They're incremental aesthetic exercises. They're also all more alike than different. Language innovation is basically done, and mostly has been for a long time.

Machine-level OS research died too, probably sometime in the 90's. Rob Pike, one of the inventors of Unix, put out a paper in 2000 called "Systems Software Research is Irrelevant."

Systems software research has become a sideline to the excitement in the computing industry...

Ironically, at a time when computing is almost the definition of innovation, research in both software and hardware at universities and much of industry is becoming insular, ossified and irrelevant...

What is Systems Research these days? Web caches, web servers, file systems, network packet delays, all that stuff. Performance, peripherals, and applications, but not kernels or even user-level applications.

Now after Pike wrote that he left Bell Labs and went to work at Google.

Of course. Google is doing more cluster OS research than anyone right now. You could argue that Google's technology success owes more to the block & tackle work of managing 500,000 servers than to little algorithms that power search and ad targeting. GFS, Map/Reduce, BigTable.

A smart researcher can write an ad targeting algorithm or some pagerank variant in a weekend. It's relatively easy to think up new algorithms; implementing them and getting them to run, especially for web-scale problems, is the hard part. Without the platform to develop and deploy against, it's like you're writing code on paper waiting for the computer to be invented so you can run your program.

It's too bad there isn't a standard platform for all this stuff, so we wouldn't all have to stop and write a new custom version every time we want to code something that will need more than a single machine to run on.

Peculiar distribution and economic dynamics -- giving the source to Unix away to universities -- lead to the entire industry eventually standardizing on the C/Unix/posix syscall OS model. GNU and Linux helped vastly here by obliterating the stranglehold that AT&T held over the technology, which was holding adoption back. New languages get scale by being free, so they can get critical adoption mass, bake their platform to maturity, and become viable, become socially acceptable by pragmatic users.

But we don't need a clone of SYSV or a free C compiler or a dynamic language with socially-acceptable syntax now. We need an industrial strength, hyper scalable cluster OS.

The problem is that the kind of eccentrics that gave us Unix, GNU, Linux, Perl, Ruby, aren't likely to be able to deliver here. Who has 500 machines in their garage and a million pageviews/day as a personal thorn in their side? Only companies have these problems, and when companies build a platform to solve the problem, the platform isn't general, and it's not given away.

Hmmm.

January 16, 2007

Referer Rankology

I started this blog last month. I figured there was no point writing a blog that nobody reads, so I set about getting some initial readers. I tried to write a couple of ..interesting.. posts to establish this base. I figured my daily observations would be as good as the next blog's (although now I see that it's harder than it looks, though it is rewarding). I decided on a format which interspersed little eccentric, personal items to give breaks between the longer essays full of harder-to-digest industry analysis.

My first few major posts did pretty well, but the third one was the real zinger. It got quite a bit of pickup, including lots of link love, ranking #1 on Techmeme, being Slashdotted, and resulting in a few reporters calling me. So what's that worth? Here's the tally.

The post directly received about 20k total hits so far (not counting RSS reads or reads from my homepage), linked from about 500 unique domains. The top 15 inbound referers were:

2836  slashdot.org
958  reddit.com
826  stumbleupon.com
619  del.icio.us
439  google reader
424  groklaw.net
419  techmeme.com
372  blogs.zdnet.com
368  battellemedia.com
324  arstechnica.com
303  gigaom.com
277  valleywag.com
263  bloglines.com
174  newsgator.com
159  dnjournal.com

I didn't rank equally on all of these sites, so it's not a completely apples-to-apples comparison. But still the relative ranks are interesting, since there are some new names sending strong traffic. StumbleUpon looks like it's going to be a winner. Other mentions of StumbleUpon that I've seen in the blogosphere suggest that it's growing like a weed, and is sending strong traffic to featured sites.

This post has also received 179 del.ico.us bookmarks, compared with 344 for my previous bell-ringer post about Google in 2004. The archives do indeed turn out to be worth more than the homepage. Even for my month-old blog, most of the action is in refers to the archived posts, as opposed to readers of the front page.

For reference, my total readership for month one is approx 350k visits for my 40-odd posts. There appear to be approx 1,000 regular weekly readers here after a month, based on measurements of posts with an image which I could track independently of readership medium. This gives a total conversion rate of about 0.3%. Ouch. Converting linkbait hits to readers is hard.

Update: Wait, that's not right. I had the wrong option to my script. I've had a total of 20k inbound referers, out of 60k total visits. Not 350k. This gives a trial-to-reader conversion rate of about 5%. That's not too bad.

Attack Products

Whether invading countries or markets, the first wave of troops to see battle are the commandos... Commandos parachute behind enemy lines or quietly crawl ashore at night. A start-up's biggest advantage is speed, and speed is what commandos live for. They work hard, fast, and cheap, though often with a low level of professionalism, which is okay, too, because professionalism is expensive. Their job is to do lots of damage with surprise and teamwork, establishing a beachhead before the enemy is even aware that they exist. Ideally, they do this by building the prototype of a product that is so creative, so exactly correct for its purpose that by its very existence it leads to the destruction of other products. They make creativity a destructive act. more...

    -- Accidental Empires by Robert X. Cringely

I was thinking about this in the context of Yahoo and Microsoft's competition with Google on search, compared with Steve Jobs and the dazzling launch of the iPhone.

On one hand you have essentially copycat search products which, while perhaps competently implemented, haven't significantly innovated the space or gained back any market share.

On the other hand, the iPhone's design is so dazzling that it's left designers worldwide gaping in open-mouthed awe.

Mike Davidson:

There are so many things to say about this iPhone that it's hard to know where to start. To me, the single most impressive thing about it is that, like a lot of Apple products but specifically this one, there is no other company in the world capable of inventing it. How many times do you see a new product come out and you think "Damn, I wish I would have thought of that!"

The iPhone is no such product.

You couldn't think of it, and even if you did, your finished product would be a godamned fingerpainting compared to this. It is so fulfilling to watch technology unfold like this, in the hands of the most indispensable and world-changing CEO of our lifetime. It makes all other work you may be doing in the technology world seem like peanuts.

When Apple says they are five years ahead of every other phone on the market with this offering, they are being conservative.

Jeffry Friedl:

Motorola has been around for a long time... has it never learned anything about designing a product for humans to use? ...

I watched the introduction of Apple's iPhone today ... and was astounded, not that the iPhone seems to have such a great user-interface design (although it does), but that it's so great in the face of a history of moronic phone design.

This isn't just a slick PR machine success. This is a genuinely stunning product, something "that is so creative, so exactly correct for its purpose that by its very existence it leads to the destruction of other products." How do you ship something so great it leaves the top people in the field awestruck?

How is it possible that Steve Jobs runs big, old Apple like a lean startup? And not just any average startup, but a kick-down-the-doors successful one. Repeatably, too!

This usually gets chalked up to the cult of the genius. Sure Jobs is a genius, but management theory is all about getting good results out of large groups of people with varying talents. And Jobs doesn't have a monopoly on all the smart people in the valley. If you're in charge of managing product development somewhere, isn't there some playbook (something like The Innovator's Dilemma) for how to organize your team to more reliably ship devastatingly effective, innovative products instead of me-too, committee-designed clone exercises that fail to achieve their goals?

Cave Man Programmer

I sometimes feel like some kind of cave man programmer. Frozen in ice sometime after the 6502 assembly era, thawed out in the post-OO LAMP age. There's lots of new stuff. Some of it good. Why am I so damned cranky?

Some aspects of the modern world delight. I discovered Applied Cryptography with glee; like a box containing a lighter, sharp knife, flashlight, mirror, binoculars, and a compass, the usefulness of the tools in that book immediately leapt out at me. Far beyond the security domain, knowing how to do protocol analysis, use MD5/SHA, decent RNGs, salts, Diffie-Hellman, stream ciphers and the like seem like essential tools. Does everyone learn this stuff as core CS in school now? I certainly hope so.

I search for an analogous "Applied AI" to no avail. Some algorithms seem promising, but instead of sharp knives and binoculars there are only plastic toys. Useless Bayesian 85% A/B classifiers that require tons of training data, only good for writing papers, but not actual code.

Entire chattering research volumes of nonsense, tautologically proving nothing very interesting, because if the books knew how to do what their titles suggested, we'd all be a lot further along with this stuff.

The damned book I want hasn't been written yet. I should have stayed frozen longer.

January 17, 2007

Speculative Fiction

Fred Vogelstein of Wired has a fascinating account about Yahoo's decision not to buy Google, and their choice to purchase Inktomi/Overture instead and go it themselves:

"Five billion dollars, 7 billion, 10 billion. I don't know what they're really worth -- and you don't either," he told his staff. "There's no fucking way we're going to do this!"

Semel could talk tough because he had a backup plan. Yahoo would go out and buy its own top-notch search engine and its own search-advertising technology, and it would beat Google in the emerging arena of little text ads that pop up next to search results.
    -- "How Yahoo Blew It", by Fred Vogelstein

Office discussion ensues...

Mike:

Of course, if Yahoo had bought Google, they would have killed it. Google wouldn't be Google today in that scenario, they'd look more like Intkomi or Overture.

Rich:

Yes, the tyranny of the linear time continuum. We can never really know what things would have looked like if they had paid the $5B. But, since most big acquisitions wreck both companies, it probably wouldn't have come out too well.

Chris joins in. Paraphrasing:

The $5B meeting that Semel rejected is a journalistic device. Someone needs to be to blame for Yahoo's mess; it might as well be the CEO. And if you can find a smoking gun meeting, so much the better -- regardless of whether that particular choice really was an actual decision point for them. The overall truth that we valley techies believe -- that culturally Yahoo chose the wrong road, pursuing media over technology, is still true.

Chris, not finished, cranked up the what-if ray further:

Although, if Yahoo had bought Google, thus killing it, there would be no Google today to make Yahoo look so bad by comparison. There would just be the regular industry mess we had before. People would still think Yahoo was cool. They should have bought Google, not to capitalize on its potential growth or technology, but to take out a dangerous competitor. The move would have succeeded, regardless of whether they integrated it well, or completely fucked it up.

The thing is, the real mistake here was not buying Google sooner. Yahoo was seriously late to the party in 2002. Heck, I tried to buy Google for AOL in 1999. I had no authority, having only been at AOL 2 months after the acquisition of NewHoo. Dave Beckwith, VP of Search at Netscape and I visited Larry and Sergey in their Menlo Park garage headquarters. Dave was being cagey so I asked Larry flat out -- how much? Larry's reply: "You don't understand. We don't want to just get rich ourselves. We want to make our family and friends rich too." Cool. Of course AOL would have killed them.

Jim Lanzone of Ask Jeeves also tried to buy Google, but the $1B ask was too high. Overture tried to buy them and was rebuffed. There seems to have been a long line of folks in 1999-2000 who recognized Google's value, but couldn't justify the price then.

Yahoo's stock was higher in 1999-2000, and Google only cost $1B. Two years later, Yahoo was down and Google's price had shot up. It was too late.

January 18, 2007

Yahoo World Explorer: geosurf hyperlocal with photos

Yahoo Research Berkeley has launched World Explorer, a cool little app that lets you type in a location and browse geotagged photos from Flickr. Here are some from Palo Alto centered over Google. :-)

The blog philosophers

Blake's blog: career, management and life philosophy, via kitchen utensils:

Job descriptions are evil. Though they serve a purpose in the grand scheme and organizational hierarchy of large companies, in the end they harm the long term growth of the employee. They are debilitating because they train us to focus on coloring within the lines, and never ask us to look outside for ways in which we can add value. We are trained to just multi-task on the job, but never multi-task our jobs.

Blake's got some very cool bay area restaurant reviews too. Worth a read.

But I hope this doesn't progress as far down the rabbit hole as Phillip Eby has... Phil is a well-known python programmer who seems to be morphing into some kind of new-age guru via his blog. He has a book, motivational seminars, the whole works. I actually like his stuff, in appropriate doses. The Multiple Self really weirded me out...but you gotta be careful when you start trying to hack your own brain...

:-)

Platform Shrugged

I think a lot of people have been willing to give Semel and the whole "Yahoo as a media company" so much space over the years because of its sheer size. SEM only got mainstream attention in the past 2 years, so now everyone realizes what many of us realized since 2002: Overture was a shambles.

What some of us formerly speculated about, has become more obvious: Semel, and others who shy away from technology, don't add value to a company like Yahoo. Platform and technology issues aren't trivial, obviously. Hands must be gotten dirty, even in the top jobs.

      -- Andrew Goodman (in the comments)

Andrew's previous comments on media vs. technology, from 2004, were prescient.

January 19, 2007

Search isn't over

A number of commenters have interpreted my winner-takes-all post as saying that I don't think startups have a chance taking on Google. Not at all. My point was that Microsoft and Yahoo have the same chance as any startup at that game. Maybe more so, because of extra resources and distribution; but less so, because it's hard to innovate inside a big organization; I figure those about cancel each other out. But the big "incumbents" don't get a special pass to win.

Bill Burnham says that search startups are dead. His points are generally reasonable, but I think it's a mistake to write off the category again.

Search was written off as 'done' before, in 1998, but it wasn't. There is far more depth here, both on the technology front, as well as new markets, that has not yet been plumbed. Google has got a great business because they are focused, for the most part, on the some of the most interesting computing technology problems we'll face for the next 50 years. This is no shallow vein. It is not just advertising. Rich ore is yet to be mined....and Google will not own it all.

Giving in to Despair

I knew industry people on the east coast in the early 90's who thought the fledgling Internet was essentially doomed because AT&T was going to own it, once they got their act together and woke up to the opportunity. That sounds absurd, I know, but the thing is, that there were people who really did believe this, and based investment decisions around that idea, based personal career choices around it.

Later in Silicon Valley I met people who thought the growing Internet was essentially doomed because MSN was going to own it. Once Microsoft woke up to the opportunity, they would surely just eat the whole thing, and nobody could stop them. The valley really had a conditioned fear complex around Microsoft. Well the despair that leaked into their thinking compromised the quality of the decisions they made regarding how to approach the net -- for product development, investments, career choices.

Now I see the same kind of despair in the search space, thanks to Google. Heck, I've helped feed it. Is the despair appropriate? Should it be influencing your decisions about investments, product development, career? Should we all pack up and move off to clean tech, or the once-and-future enterprise 2.0, or nano or bio?

I'm amazed that new ventures are launched in mature industries like beverages, airlines, or toothpastes.

Or that such an innovative leap is possible in a mature space like cellphones and PDAs.

Now you have a rapidly changing field like software, algorithms, extraction, AI... plus the rapidly morphing social composition of the Internet, the evolving composition of the net's content... all this change makes for opportunity.

Google puts its pants on one leg at a time too

Young companies on hypergrowth trajectories seem to inevitably stumble. The painful exercise transforms the company; the free-wheeling culture is clamped down on, process and bureaucracy are instituted. Then more nimble competitors can scurry around them. Who knows if the big G will be able to escape this trial but it sure is a common pattern.

Organizations are difficult to scale. Management gets a smaller and smaller rudder for the growing boat. Turning fast was a luxury of youth.

One of their current best practices, a strategy that seemed great in 2002-2003, such as red-zone hiring rates, 20% time, or vast build-outs of infrastructure, could bring significant challenges later.

Don't get me wrong, I believe Google is a fantastic company, as anyone who has read my posts knows. But to bet against the search startup space is equivalent to betting that Google is going to bat 1000. And nobody ever bats 1000.

"When Being a Verb is Not Enough"

Google is building for a future they see but most of the rest of us don't.
      -- Cringely, in his latest article on PBS.org

January 22, 2007

Brain Cloud

Youtube is further proof that worse is better.

Lisp is a brilliant failure, by a bipolar student.

"My love for Lisp pretty much destroyed my career as a programmer."

Smalltalk, Lisp, Scheme, Eiffel, QNX, Pick, Ryze. These things share something in common.

To paraphrase Tony Hoare, premature commercialization is the root of all evil.

If you have something that depends on a network effect for success, and you decide to charge everyone money upfront, you won't get the uptake necessary for platform success. You won't create the large-scale network effect which would create your future market of customers.

So if you come up with some nifty new programming language, and want everyone to use it, don't immediately go start a company and try to sell the compiler. You have to give it away. Maybe you can bake some kind of upsell into the thing, maybe not. Same goes for operating systems, social networking platforms, whatever.

Another way of saying this is that you should trade revenue to get market share.

It's interesting that the computing platforms we use today seem to have come out of either academic distribution or a quasi hippy Berkeley culture. "Code should be free". If you go all capitalist on your platform innovation too early, it goes nowhere.

...

I was trying to research Dick Pick, of Pick Systems, the guy who built the database language, to find material for this post, but Google kept messing up my queries. You can't search on Dick Pick. You get all these hits for "Dick's pick's". I sat in a meeting once where Larry and Sergey explained why stemming was bad. They were right. They should have stuck with it.

January 23, 2007

What I'm reading


Foundations of Genetic Programming, Terascale Knowledge Acquisition, Introduction to Evolutionary Computing, The Ghost Map, Granta, The Oxford Handbook of Computational Linguistics, What Predicts Divorce?, The Blind Side, The Text Mining Handbook, Anticipatory Learning Classifier Systems, Memory-Based Language Processing, Scalable Optimization via Probabilistic Modeling, Foundations of Statistical Natural Language Processing, Blink, Universal Principles of Design, Genetic Programming IV.

(this mess was on my nightstand until last weekend, when I moved it to the floor so I could scan all the covers at once, and see my clock again)

Years ago I made a decision that if I ever was curious about a book, any book, and thought I might one day read at least a few chapters, I'd buy it. No questions. Ideas and knowledge are so hard to come by, but sometimes nearly infinitely valuable. Surmounting the barriers of time, motivation, and effort are hard; I might as well tempt myself by collecting a store of good bait.

It's worked well for me over the years. Novels and historical books get read linearly. Tech books get a random access scatter pattern where I basically try to follow some thread of interest throughout a collection of sources. I've currently got more technical books in the set than normal, since I was curious about whether AI had made any recent progress with web-scale datasets.

The divorce book has been causing me problems though. I bought that as a deep-dive from Malcolm Gladwell's Blink. One of Blink's chapters describes a psychologist, John Gottman, who can predict whether a married couple will get divorced within 6 years with 90% accuracy, based on watching a 15 minute videotaped interview of the couple. "That's amazing," I thought. "How on earth does he do that?" The details in Blink were sketchy.

So I bought Gottman's 500-page tome detailing his research, as well as related material about observing interactions and the Facial Action Coding System (FACS). I've even ordered FACS training software so I can try to learn to recognize the thousands of facial micro-expressions and what they mean. It seems like this would just be handy to know, in negotiation, or acting, or business, or life. (Try the smile test to see a little demo about what FACS is about).

I haven't gotten back to Blink yet because I'm still down in this sub-thread it spawned. It's been incredibly interesting, and I intend to blog about the whole business at some point. But in the meantime I have this book about divorce laying around. People see that and they instantly think they know why I have it. My mother-in-law spotted it and that's lead to all sorts of sideways glances.

I also had a book about how to own & operate your own bar sitting around. People would come over and ask my wife if I was thinking of leaving the technology industry and opening a bar. Good lord no. Foodservice and retail are the last things on earth I would get into. But there was the book in the restaurant supply store, and it had chapters like "How to know when your bartender is stealing from you" and I just couldn't resist.

Eyes work using a page fault mechanism. They're so good at it that you don't even notice.

You can only see at a high-resolution in a fairly small area, and even that has a big fat blind spot right exactly in the middle, but you still walk around thinking you have a ultra-high resolution panoramic view of everything. Why? Because your eyes move really fast, and, under ordinary circumstances, they are happy to jump instantly to wherever you need them to jump to. And your mind provides this really complete abstraction, providing you with the illusion of complete vision when all you really have is a very small area of high res vision, a large area of extremely low-res vision, and the ability to page-fault-in anything you want to see -- so quickly that you walk around all day thinking you have the whole picture projected internally in a little theatre in your brain.

      -- Joel Spolsky, the Big Picture

It's a stretch, but you can kind of look at knowledge acquisition through reading that way too. Your head is a fast index in RAM. You can page-fault in whatever you want to know about from the slower offline world of paper and ideas. But that's a slow, expensive integration process. It requires reading, understanding, even sleep.

It's not just about wanting to keep the reading machine from getting bogged down in useless drivel or dead-ends. It's about actively managing the reading-input queue, nearly at the paragraph level, just like an engineering product queue. I think, if I've got time to read 20 paragraphs of something now...should they all be from this book, or that book, or should I scan 5 and then focus a bit? Am I still getting useful yield out of this thread, or could the next 10 minutes be better spent skipping forward?

You might think that's kind of a scattershot approach, like maybe I just can't focus on anything long enough to pay significant attention to it or something. But I don't think that's it. I'm a coder, I can focus like a madman for hours. But I don't have infinite time. I just want to maximize my data input yield.

I have an uncle who is a university English professor, and I'm sure he'd spit his coffee out at my engineer's approach to reading. But I'm pretty passionate about learning stuff, and I get even more excited about making use of that learning to make new stuff. It works for me...

January 25, 2007

Knight News Challenge

I'm off to Miami to help judge the applications for the Knight Foundation's News Challenge grants for community media projects.

The Knight Foundation has launched a new competition that will award as much as $5 million in its first year in community news projects that best use the digital world to connect people to the real world.

If the quality of entries warrant it, the foundation may spend as much as $25 million during the next five years in the search for bold community news experiments.

They got about a zillion applications...I spent the last 24 hours reading grant proposals...

January 26, 2007

The joy of the hack

A reporter just called me and wanted to talk about my virus, Elk Cloner, that I wrote back in 1982, when I was in the 9th grade. Apparently it's the 25th anniversary of the virus and since I wrote the first one she wanted my thoughts.

First thought: "25 years? Aaaah I'm old!"

Fortunately I still regularly get the feeling I had back when I wrote cloner.

"Why did I do it", she asked. "Was it malicious?"

No, not malicious. It was a practical joke combined with a hack. A wonderful hack.

Back then nothing was networked. We had these computers in a lab, and there was software for them on floppy disks. You stick in the disk and run the software. Simple.

The aha moment was when I realized I could essentially get my program to move around by itself. I could give it its own motive force, by having it hide in the resident RAM of the machine between floppy changes, and hitching a ride onto the next floppy that would be inserted. Whoa. That would be cool.

Insight without implementation is worthless, so to work I went.

That aha feeling is the burst that let's you know you just had a really cool idea. The moment you realize a hack is possible. NewHoo was like that. How do we build a web directory with human labor if we have no money? There were elements in Topix too that made us giddy when we thought of them. The hack doesn't have to be code, it can be little business insights. Even groups of people and individuals have hacks.

The essence of the hack isn't just realizing you can use a system in a new, unexpected way. It's getting a disproportionate effect from your effort. It's catalyzing potential energy stored in the system.

And the hack often changes the whole world. The user-generated content model we developed with NewHoo is ubiquitous now; it was the main inspiration behind Wikipedia. Viruses and exploits are of course all too common. You can't put the genie back in the bottle.

The only consolation is that the genie would have gotten out anyway. But it's fun to be the first to let it out. :-)

January 27, 2007

Foggy view today

Foggy view today towards Redwood City. On a clear day you can see Hangar One, the old military dirigible station, at Moffett. Today you can barely see past the port.

Party Topix

I was on a long flight back from Miami and missed the Paidcontent mixer. Bah. At least Bob and Mike appear to have had fun. :-)

January 30, 2007

Topix Classified Network

Heh:

Stock analysts credited yesterday's gain [of Tribune Corp's stock] to the announcement that classifieds for general merchandise from Topix.net also would be posted on Newsday.com and other Tribune Web sites. The papers' ads also would appear on the Topix.net site.

via Newsday

This follows Topix's announcement yesterday of powering free listings across Tribune online newspaper properties.

Team Topix is at the Newspaper Association of America convetion in Vegas this week. If you're here, stop by our booth and say hi.

January 31, 2007

Eeny meeny, jelly beanie, the Polycom is about to speak

February 4, 2007

FORTY

Where have I been... I'll tell you where I've been. I've been looking at dead bodies in Las Vegas and it SPOOKED ME OUT.

I missed Bodies...The Exhibition when it was in San Francisco a while back. Last week we were at the NAA convention in Las Vegas, and Burns tells me the bodies exhibition is at the Tropicana. I had to go. I'd read about the crazy process where they take cadavers and fill the veins and stuff with plastic and turn the bodies into life-sized, real models of themselves or something. So Burns and I take a hike up there from Mandalay Bay one day before our show.

Tolles wouldn't come, he was too creeped out.

The exhibition was amazing, even better than I expected.

Imagine a person's circulatory system, filled with plastic, holding its shape, with the the rest of the person dissolved away so only the red-colored blood vessels remain, hanging 3D in space.

The fetus room had its own special warning, so you could go around and skip that part if you wanted to. Rightly so. Yuugh.

The respiratory system room had a big plexiglass box with a hole in the top of it. The box was filled with hundreds of packs of cigarettes. Next to the box was a case with two preserved lungs, a healthy lung and a black-tarred smoker's lung. A sign over the plexiglass box said "Quit now." While Burns and I watched a woman dropped her pack into the box. I wonder if she'll really quit. It's hard.

Spending an hour looking at partially-dissected cadavers really started to haunt me about my own mortality. I was wondering who all these exhibited cadavers belonged too. I mean who were these people that their bodies ended up in a traveling show in Las Vegas with a bunch of people gawking at them. This isn't just Hamlet holding his buddy's skull, it's surreal on a modern scale.

The last time I felt this way was in Paris, when I visited the catacombs. Apparently in the 18th century the dead of Paris were overflowing the cemetaries, so monks dug up all the bodies, cleaned off the bones, and neatly stacked them in various ..patterns.. against the walls in a network of tunnels buried beneath the city. You can take a tour. (Take a close look at that picture. Those skulls are arranged in the shape of a heart. What were the monks thinking?)

The place is genuinely creepy, all wet-dripping stuff from the ceiling and skulls and plaques with ominous messages of doom every few paces.

While I was there a girl from a high school tour group freaked out and ran past me screaming "I have to get out of here!" She was going the wrong way.

That was ten years ago. I think it would be even more spooky to me now. You don't see any old people touring the catacombs.

February 6, 2007

98% spam

Absolutely horrifying email spam stats from our new vp ops at topix.

We receive 25,000 mail connections per day; each connection is an attempt by some machine to send us mail. Of those 25,000 connections we are able to reject nearly 80% of them outright since the IP address of the originating machine is registered as a known spam offender. Using various additional checks and methods we are able to reject about 98% of all incoming connections before they even have a chance to send us message content. Of the 2% that do make it to the point where they send us a complete mail message we are able to reject 25% of those as containing spam, a virus, or an unsafe attachment type. That means that in the end only 1.5% of all attempted mail connections actually result in delivered mail.

That's push email. Consider the day in our near future when 98% of the http fetchable web is spam. Auto-generated text, on-the-fly scraper-reconstituters, and so forth.

The bright side: web spam is an evolutionary force that pushes relevance innovations such as trustrank forward. Spam created the market opportunity for Google, when Altavista succumbed in 97-98. Search startups should be praying to the spam gods for a second opportunity. :-)

February 7, 2007

Topix vs. the newsroom model: enabling mass many-to-many communication

Getting hyperlocal community started on the web is hard. It's tough to do for a single community. This is where successes like Baristanet and Mike Orren's Texasgigs stand out. You have to start with some seed content to draw an initial audience, and then get them talking back to you, and talking to each other. It's a tricky boot-up process to manage, and many attempts have failed.

It's even harder if you're trying to take on thousands of local communities all at once, with a single brand.

Local newspapers by all rights should own their local communities online. But there are 1,500 local newspapers in the US, and most of them don't have a lot of spare resources to experiment online. Usually most of their effort is focused on the product that keeps them employed -- print. And I'll say it -- culturally they don't really like the web. Newsroom-driven attempts to deal with the web are usually one-way and read-only. Of course. Go to newspaper industry conferences and you'll find they're still talking about nonsense like electronic readers to view images of the print newspaper, and charging money for online subscriptions. And when it comes to facilitating conversation with the community, newsrooms tend to get seriously uncomfortable with the realities of dealing with the net public. They typically pull community the minute things get hot -- which, ironically, is when you need discussion and debate the most, and when the discussions often become the most interesting.

There's an underlying assumption in common between the newspaper's online efforts and the single-community hyperlocal startups: that there will be many local brands online. Currently there are over 3,000 local news brands in the US, if you include both newspapers and local TV stations with newscasts. That made sense when geography defined the radius of distribution around TV towers or newspaper delivery trucks. But put your content online, and that radius goes away. Distribution expands to the edges of the net.

Are 3,000 community websites sustainable? Will online consumers really accept 3,000 local brands? Or will a McDonald's model win instead, subsuming the thousands of mom & pop hamburger stands?

Topix is betting that the national play will ultimately win.

Single-community hyperlocal sites tend to have passionate, involved editors and participants. But the labor and overhead costs are high. And they don't generally scale -- to tens, hundreds, or thousands of additional communities. So the costs of the overhead have to be borne by the revenue potential of just one or a small number of communities.

The challenge for a national hyperlocal play, on the other hand, is to get quality in the long tail of geography. Can that be done from an office in Palo Alto? Won't local community development suffer without an editorial office based in every city and town?

But this apparent deficit may in fact be a strength.

Single-community hyperlocal sites are still rooted in the model of the print newsroom. This is a one-to-many mindset that empowers a few to determine what the many will read. It made sense when the distribution network started with an expensive printing press.

But the Internet is not about one-to-many. It's not a printing press or a TV tower. The Internet is the first mass many-to-many communication medium. Users aren't on their community site just to hear one person talk. You can be the greatest local journalist in the world, but reading your output takes 5 minutes a day. Users are there to talk with each other. To learn from, gossip with, and argue with their neighbors. Users provide their own draw. The necessary job of the 'editor' here isn't to run the conversation, like a teacher in a 6th-grade class. It's simply to make sure that the conversation 1) gets started, and 2) doesn't completely run off the rails.

Scaling this to enable millions of people to talk with each other in thousands of simultaneous parallel conversations is not simply a matter of putting up lots of empty message boards and hoping it all works. You need a lot of seed traffic, as well as topically- and geographically-segmented content to prime the pump. You also need a social architecture that draws visitors into discussion with each other. Challenge #1 is getting the party started.

Challenge #2 is keeping it safe and on-track. A free concert in the park with 30,000 music-lovers is great, until the lights go out and the police aren't there. Then it can get ugly really fast. Not because most of the participants are bad people. But because a few bad apples can ruin things for the majority.

This has happened to nearly every online community system, from Usenet to the comment threads that the Washington Post took down, to the news message boards that Yahoo recently took down.

We've spent considerable effort at Topix on an architecture to not only get online communities booted up, but to let them socially scale. Effectively dealing with spam, hate speech, profanity and trolls isn't just about maintaining the quality of the commentary. Scalable moderation is essential for enabling growth to larger and larger audiences. If you don't keep the quality sufficiently high, you stop adding users.

We do a lot of this at the software level, with layers and layers of filters to catch all sorts of bad stuff. We also have centralized human editorial overseeing the whole show. The simple idea: get rid of the bottom 5% of posts every day.

Starting from zero a little over a year ago, we now have over 1,000 active forums (an active forum being defined as one which receives at least 5 posts per day). There are 30,000 local communities in the US, so by one measure we've hardly made a dent in nationwide hyperlocal. But 1000 active communities is no small achievement, when you consider how hard it is to start just one. Local community now represents 35% of Topix's traffic, and continues to grow double-digits every month.

Community-driven content has been so successful for us, in fact, that were going to be re-organizing our entire site to focus on it. I'll have more to post about this as we get closer to the date of our big relaunch. :-)

February 10, 2007

Boost RF range with your head

Here's a weird tip I learned way back from a hardcore Sun engineer, Ben Stoltz... You can use your head as an antenna to boost the range for little RF devices like car key fobs, garage door openers, etc. It sounds crazy and I didn't believe it until I tried it myself. Stick the device under your chin and hit the key... I can double the range on my car RF key this way. It really helps to find your car in a garage when you've forgotten where you parked. Or to hit the garage door signal when you're slightly out of range.

Of course you have to not worry about what this is doing to your head. Is it just something about the shape of your skull, or is it the quasi-electrical circuitry of your neural mass that is serving as the antenna to amplify the signal? I don't know. Jeezus. I tried to swear it off but then I was looking for my car once, and it just works so well... Don't think about it, I tell myself.

I tried to use this trick to get my blackberry to sync to the ground from the 6 hour flight I was on today, but no luck. I would get partial signal for a few seconds but then it would fade out. I guess if you're going 500 miles an hour it's not just the 35,000 feet keeping you from hitting the towers. Where's our in-air wifi?

:-(

February 12, 2007

Facial Action Coding System

I mentioned before reading Blink and becoming fascinated with studying facial emotion with the Facial Action Coding System (FACS). I'd played with some of the online tools like Artnatomy, but apparently full FACS training takes 80 hours and requires a bunch of video; you can't learn it from a book since you have to be trained to recognize fleeting subtle expressions and what they mean.

So I ordered a training CD from the lab of Paul Ekman, who is one of the researchers who developed FACS and it finally came.

Micro Expression Training Tool

While most facial expressions last for two or three seconds, micro expressions last a fraction of that -- 1/25th of a second. These are signs of emotions just emerging; emotions expressed before the person displaying them knows what he or she is feeling, or emotions the person is trying to conceal. You can learn to spot these micro expressions and have access to this valuable information.

Subtle Expression Training Tool

With SETT -- in under an hour -- you can train yourself to see very small facial movements that often appear in just one region of the face: the brows, eyelids, cheeks, nose or lips. These small movements may occur when an emotion begins gradually, when emotions are repressed or when a person is deliberately trying to eliminate any sign of how he or she is feeling, but a trace still remains.

Understanding the code-language of the face seems like a great way to improve communication, not to mention being able to spot lies, false smiles, contempt, and the like. This seems like it would be useful in business, relationships, all sorts of situations.

I just started working through the exercises on the CD today. We'll see how it goes. Unfortunately this CD isn't full FACS though, I may need to hunt around for additional training materials.

February 13, 2007

The Failure of We (the) Media

In the wake of the latest We Media event last week the usual round of self-flagellation by a group of attendees is occurring.

David Cohn wishes We Media was an unconference. Scott Karp acknowledges that media companies need to make money, but bizarrely refers to that as an "idealogical agenda". A BBC exec calls it groundhog day. The staid Mark Glaser even throws a few rocks (one at me!), with a big pile-on in his comments to boot.

Andrew Nachison and Dale Peskin put on a solid show. They make a real effort to have the panels interact with the audience. This is not easy -- you get 200 people in a room for 90 minutes, how much interactivity do you really expect? But they managed to pull it off and the results were far more successful than many other panels I've been on. They also get really amazing attendees. Major publishers, tons of senior media execs, a dozen startup CEOs, VCs, Pulitzer-winning journalists, and an eclectic spectrum of media spinners like documentary filmmakers, artists, social activists, and the like to keep the cocktail mixer from getting too businessy and boring. If you can't find someone interesting to talk to at this thing, it's not the event's fault.

So what's the problem?

The problem is that the hopes that Dan Gillmor raised for the media industry in his book -- which kicked off this whole business -- have largely failed.

Tremendous excitement followed the publishing of Dan's We the Media (the conference's namesake). It accompanied the trumpeting of a new model of media by the newsy press, and the rise of blogs with attendant breathless hype.

Unfortunately, after doing the author's victory tour, Dan then attempted to put his ideas into practice in a business venture. I suppose there is some due credit for having the courage to cross the line from a long career as a newspaper journalist (observer) to become a startup founder (participant), and try to prove the viability of his alt.media business plan outlined in the book.

But, like nearly every News 2.0 venture so far, Dan's Bayosphere was a failure.

He has a lot of company. The dog's breakfast of new media startups includes Gather, Backfence, Newstrust, Daylife, TailRank, Associated Content, Pegasus News, Tinfinger, Findory, Inform, Newsvine, Memeorandum, NowPublic. The highest distinction on this list is to be one of the few still spoken of in the present tense (or present perfect -- "They haven't yet succeeded...")

And yes, I would include Topix here as well. We are, in fact, the most successful News 2.0 company, with over a million pageviews/day, 10M server/4.6M Comscore uniques, a million participants in our forums, a $60M exit, yada yada. But, we can face it, even we haven't yet burned down the world, or upended the news industry.

There is actually a media revolution in the works. So what's going on here? By implicit definition, participatory media is non-commercial. If it's commercial, someone owns it, and it's not "we" anymore.

Furthermore, as soon as a new media venture crosses the line and tries to become a business, it either becomes a successful business or a failed one. Businesses aren't about ideology, they're about getting a job done and earning revenue to keep the thing going. Even wild success tends to leave ideology behind. Ideology is the realm of nonprofits and failures.

There is still a power law to success, and the few continue to reap disproportionate rewards, as they always have. Pub media turns out to be a farm league for big media. The bloggers who "make it" look more and more like regular media than "us". They graduate to to the A-list, and start to get lumped in and criticised along with the establishment. Success looks like a sellout to a big media company, or a good business doing job boards and conferences on the side to pay the bills.

Yes, there is a media revolution in the works. But it's messy, it's nasty videos on Youtube, not the neat & tidy civic Welcome Wagon of citizen journalism. You can't quit your job as a journalist and replace your salary with adsense on your blog. You'll be lucky to make beer money, let alone pay COBRA and fund your SEP-IRA.

And big media has been watching, and buying the winning ventures, and building their own platforms to -- yes you're right! -- exploit the new models.

So shut up and keep blogging, or putting your time in as a wage slave in your chosen profession, or keep slugging it out at a startup. But please stop whining that "we" haven't achieved consensus at the latest industry schmoozefest. If you don't know why you're there, you probably shouldn't be. :-)

February 14, 2007

Nothing up my sleeve...

February 18, 2007

Two Cows and Venture Capital

You have two cows. One is male, and one is female. Mike Moritz says he loves both cows and will buy 35% of the pair for $100. After the deal is signed he tells you to kill your female cow, and then says your male cow must produce a baby cow within three months or you're fired. Three months and one day later he fires you, takes your remaining cow, and transfers it into a milking machine company which then goes public on Nasdaq, earning him $10,000,000. Citing a lactation preference in the term sheet, however, he keeps all but $0.10 of the proceeds. "No hard feelings," he says, "and be sure to come back the next time you have cows."
    -- Paul Kedrosky

Hmmm

Creating a new product like the iPod or even the Prius is a far more modest achievement than developing a new process. The former are what we normally think of as inventions, of course. But the latter, at least in Toyota's case, presents a novel way of thinking about work and the capabilities of human organizations.
    -- From 0 to 60 to World Domination, NYT Magazine
"People don't scale: Truly lazy developers let their machines do the work for them... smart developers know that people don't scale-- machines do. If you want it done the same way every time, and with any semblance of reliability, you want the human factor removed as much as is reasonably possible... I ask myself-- how can I make sure I never have to deal with this problem again? If my solution fixes it so nobody ever has to deal with that problem, that's a nice side-effect, too.
    -- Jeff Atwood

February 24, 2007

Hitmaking

Markson gnaws at the apparent lack of a hitmaker playbook for online marketing:

Many industries have hit-making down to a science. Hollywood is a good example - from the script to the the cast to the production to the testing to the release to the marketing to the distribution - every step of the way, calculated decisions are made that are designed to maximize a movies success. Same can be said for cheeseburgers, toothpaste, autos, etc. They know what it takes to make a hit and they pursue it. They might not always get it right, but there is some research and science to the choices that are made.

Offhand, I can think of some online equivalents. The gaming industry is a hit-driven business. A game starts with a concept, whether it's Tiger Woods Golf, Lego Star Wars, or some MMORPG Dungeons & Dragons descendant. Get a physics engine, voice actors, motion capture, all that expensive stuff that goes into the multi-million dollar gaming budgets these days. Get it out by July so it it can on the shelves for Xmas. Make sure your distribution agreements are in place so your focus-group tested dodecahedral boxes will be arranged in the proper stacks on the end caps at Fry's. There's a lot of details to get right to drive a gaming hit.

Another: the mini blog empires built by Jason Calacanis and Nick Denton. The model is -- think of a concept for a blog, get a fresh design and a great name, and hire a contract blogger. The pay is $2k/month, you're 1099, not an employee, bring your own laptop, work from home, and if you don't increase traffic 10% month-to-month you're fired and someone else will try... After two or three bloggers have tried, if the site is still not working it gets shut down. Have competent centralized ad sales, launch PR, and a great serving platform. Repeat and scale.

But Mike then talks about distribution strategies:

As far as I can tell there are only three real online hitmaking "strategies": SEM/SEO, viral and syphoning (taking your existing traffic and using promotion on your site to "syphon" it off to a new product.)

I don't think of those as hitmaker strategies, they're only a piece of the puzzle. Those are tactical distribution methods. I'd add pure word-of-mouth as the holy grail here, separate from viral:

  • SEM - purchase traffic profitably via some kind of arbitrage. ex: Shopping.com, NexTag, Monster.com.
  • SEO - rank organically for free search traffic. ex: About.com, Autobytel, Yelp.
  • Viral - product contains a built-in spam-your-friends mechanism. ex: Hotmail (the original!), Friendster, Youtube.
  • Syphoning - brand extension and traffic promotion from a winning business to promote a new product. ex: Gmail. I'm having a hard time with examples here because this strategy basically doesn't work. Zshops, MSN, Live, Endless.com, Alexa, A9, Yahoo 360...?
  • Word of mouth - something so compelling people refer their friends even without a built-in spam mechanism. ex: Google.

Viral and word of mouth often work together -- a spam-your-friends mechanism will only be used if someone likes the service enough to use it themselves, and to recommend it to others. It just lowers the activation threshold.

For most people in our industry this is old hat. Yet still there are routinely sites launched by startups and big companies alike that don't have any of these mechanisms. It's hard to predict success but you can often spot failure in the works when a site launches with zero SEO, no way to sustainably buy traffic, no viral aspect, a ho-hum product, and a billboard on the hill in South San Francisco begging visitors to come. These happen all the time and you wonder what the VCs and founders were thinking. Can web distribution 101 really be unknown to them at this point?

The thing is, the entire startup environment in Silicon Valley and the rest of the net-connected entrepreneurial world is the hitmaker factory. The VCs are the formal, but not exclusive drivers of this show. VCs are the hitmakers. They have time-honed playbooks for how to churn out hits from entrepreneurial ventures in fast-changing markets. Even their apparently-trite maxims actually code for a wealth of wisdom.

Movies and games burn out after a while so the engine needs to keep making fresh ones. For programming problems and online brands, though, once you have a winner you're basically done unless the winner screws it up at some point (e.g. AltaVista -> Google) or the market or technology base moves again.

So once you have Google you don't really need more people trying to make a great search engine; Silicon Valley made a really good one, everyone in the world can use it for free, and they're doing a great job keeping their product in good shape.

That happens all the time. We don't need big dialup ISPs anymore, so UUnet and AOL are in the past. We don't need PC or OS startups anymore so they're in the past. We don't need browser startups anymore since we've all got several browsers that work just fine. Mosaic/Netscape and Spyglass are over. We don't need someone to start retailing shoes or books or X online anymore since you can buy anything you want with a few mouse clicks and have it tomorrow FedEx. Fortunes were made filling all of those needs, but the needs are filled now. Fortunes were also made delivering nutmeg and sugar and fresh lettuce to our kitchens too. But they're done. You have to find a new need to make a new fortune, not solve an already-solved problem.

March 2, 2007

Yahoo Singing News

I thought this was a joke when I first heard it a while ago. I'd been making wisecracks about Yahoo News doing a sock puppet version for a while and figured I was confusing myself with some kind of meme-echo. No, it seems like this is actually real:
Yahoo! is hoping a quirky take on the news will strike a chord as its next original programming effort.

The Web giant confirmed Wednesday that it will launch a new initiative before the end of this quarter that will feature a journalist-cum-crooner who will sing the news.
    -- Hollywood Reporter (via PaidContent)

I know a lot of folks over at Yahoo and they're great. I actually feel really bad every time I lob a rock at them. Plus they give me all kinds of grief over how negative I'm being. And what do I know, The Daily Show is the most watched news show for a certain age segment, so maybe with Youtube viral distribution and whatnot this singing news show will be some kind of real hit.

But on another level, it's just wrong. It's not "Yahoo News". The Daily Show isn't a brand extension from World News Tonight, it's over on Comedy Central where it belongs. Singing news isn't news, and singing news doesn't scale. It may be a great show, but making little musical productions come out of this company with such a proud heritage as a valley tech titan is almost too much for me to bear. Yahoo has truly lost its way and needs to fix it right now.

Serious Yahoo engineers -- quit before further damage is done to your resumes. I will hire you. Google will hire you. Someone will hire you. You will be happier in an org that values tech and scale and algorithms. Where your boss comes by and chides you for not having met with the patent attorney lately to try to file some stuff. Get out of there!

March 4, 2007

Random AOL user search tool

I had been using the dontdelete site to search the AOL user search data which was released last August. AOL's release of this data generated a storm of controversy, which lead to a bunch of staff resignations for the folks involved over there. Still, their idea that this data is a really valuable research tool for the world was correct, and folks at the big shops like AOL, MSN, ASK, and Google all have quality data like this to work with. The rest of us didn't, but now we do. Thanks AOL guys. You took one on the chin for us and we are grateful!

Unfortunately the dontdelete tool doesn't seem to work anymore. Smells like someone loaded it into mysql and the database isn't running anymore or something. So I hacked up a quick little replacement for the purpose I was using it it for, browsing a random user session. I based this on my joke code, so it would be fast. It's fast! 2-3ms to return a random session from the 577,663 available.

Why is this useful... When you get past the voyeuristic fun, I've found that it's actually really hard to think up representative random searches to try out search engines to see how they do. I've never been very good at this; someone sends me to a new search engine, and I type 'skrenta', and then I go blank. Mike typed 'britney spears' when I showed him AskX. The problem is that 'britney spears' has been hand-optimized at Yahoo, Google, MSN and ASK, because there are guys just like us working at all of those companies. It's supposedly a popular query category, it's obviously monetizable, and it's easy to license the AMG or Muze data and make them better. But I have this nagging suspicion that 'skrenta' and 'britney spears' aren't serving me very well to take effective soundings of a new engine's quality.

Hence my random search tool. Real users type such gonzo stuff into the search box. You can't make this stuff up, which is the point. I included fresh-window links to a basket of other SE's, so you can see how the query does on different engines.

My all-time favorite so far: [will anastasia hurt my pregnancy]

Easy for a human to correct! You know what she means ("anesthesia", i.e. what are the risks of pain meds during pregnancy, getting an epidural, etc.) But no search engine can do that phonetic correction yet based on the greater context of the sentence. Maybe Powerset is working on stuff like this.

Give it a try here:

Skrenta's random AOL user search tool

March 7, 2007

Getting Stuff Done: Activation and Method

"Today I'm gonna show you how to drive a sports car. First, you need a lot of money!"
I'm a sucker for cheeseball motivational platitudes. I caught the bug a long time ago watching late nite TV when a commercial for Tom Vu came on. "I came to this country with nothing. Now look, I have all this!" He was standing in front of a rolls-royce parked in front of a mansion with a bunch of women in bikinis.

What amazed me about his commercial was that the appeals were general inducements to do something, anything. They weren't specific to attending his seminar. It was like 90% of the commercial was designed simply to make the viewer want to do something, and to raise the viewer's energy level. At the end it told you to go to his free seminar to find out more.

Unfortunately, in Tom Vu's case, the method was a scheme to acquire and flip distressed real estate. But that wasn't what interested me. Sure, there are a lot of schemes to try to make money. Going to a Tom Vu real estate seminar was a pretty sketchy way to go about that. But it is true that, for whatever you want to do, you have to want to get started before the method even becomes an issue.

Related: Black Hat SEO's Do it fucking now.

March 8, 2007

Blake's Blackberry Boredom tools

Blake told me he was using my aol random user query from his Blackberry while he was out somewhere waiting for his wife or something, since he found the queries interesting/amusing. But the page was too heavy to really work well on a blackberry because of all the links in the table I had to try the queries on various search engines.

So I've made two stripped-down anti-boredom tools for Blake. One is a lightweight version of the aol random user session tool. The other one is joke. Warning: since the jokes are from the 70's CMU tops-a joke file, there are many offensive ones. Ones that used to be offensive but have gotten even worse given the progress of time. Do not view these jokes if you can be offended by written material in any way.

http://www.skrenta.com/boredom.cgi
http://www.skrenta.com/joke.cgi

March 9, 2007

Freebase: one to watch

Holy smokes, this is cool. A new startup called Freebase, founded by computing gods, is taking on web search, with a Google Base-like database, but built with an open, ODP-like model.

A new company founded by a longtime technologist is setting out to create a vast public database intended to be read by computers rather than people, paving the way for a more automated Internet in which machines will routinely share information.

Mr. Hillis first described his idea for creating a knowledge web he called Aristotle in a paper in 2000. But he said he did not try to build the system until he had recruited two technical experts as co-founders. Robert Cook, an expert in parallel computing and database design, is Metaweb.s executive vice president for product development. John Giannandrea, formerly chief technologist at Tellme Networks and chief technologist of the Web browser group at Netscape/AOL, is the company.s chief technology officer.
    -- Start-Up Aims for Database to Automate Web Searching, by John Markoff.

Danny Hillis is a computing legend, having founded a company to produce the Connection Machine, one of the first massively parallel computers and a very slick piece of work. John Giannandrea ("jg") was a Netscape founder, and recently CTO of Tellme, which built a massive voice-recognizing telco application. He runs a tier-1 colocation business as a side hobby to his day jobs. Not just vision here but deep technical implementation experience.

And lest you be deceived by the academic aura:

Based in San Francisco, Metaweb Technologies, Inc. was spun out of Applied Minds, Inc. in July, 2005 to build a better infrastructure for the Web. Metaweb was founded by Danny Hillis and funded by Benchmark Capital, Millennium Technology Ventures, Omidyar Network and other prominent investors. It is led by battle-hardened alumni of Netscape, The Internet Archive, Alexa, Tellme, Intel and Broderbund.

How long has Danny been around? He's even in the joke file, with a koan about marvin minsky:

In the days when Sussman was a novice Minsky once came to him as he sat
hacking at the PDP-6. "What are you doing?", asked Minsky.
  "I am training a randomly wired neural net to play Tic-Tac-Toe."
  "Why is the net wired randomly?", asked Minsky?
  "I do not want it to have any preconceptions of how to play."
  Minsky shut his eyes,
  "Why do you close your eyes?", Sussman asked his teacher.
  "So that the room will be empty."
At that moment, Sussman was enlightened.
      -- Danny Hillis

These guys are the stuff, I would watch very closely. :-)

Nifty OJR unconference March 30

Robert Niles of USC's Online Journalism Review sent me a ping about the unconference they'll be hosting later this month:

Last month, I posted a note to OJR.org's discussion board inspired by your blog post on the WeMedia conference: http://www.ojr.org/ojr/discussion/55/

Since you were so forceful in your comments, I thought you might be interested in hearing about a conference that OJR is hosting later this month: http://www.ojr.org/ojr/conference/

Our theme is "An Introduction to Entrepreneurial Journalism Online." The event does not feature traditional panels, but is run in a discussion-driven Unconference/BloggerCon-type format. And we're not interested in academic, theoretical discussions led by people who have never produced a successful website. Our focus is practical, with people who are actually making independent online media work talking with those who want to do the same.

I hope that you might consider joining us, or perhaps spreading the word about the event.

Thank you,
Robert

Robert Niles
Editor, USC Annenberg Online Journalism Review
http://www.ojr.org
rniles@usc.edu

I've always been a huge fan of OJR, and this looks like a fun crowd to spend a Friday in LA with. I'll be there. :-)

March 12, 2007

Being a programmer in NY

If you are in Boston, Austin, Raleigh-Durham, Silicon Valley, or Seattle, as a programmer you have a lot of choices of where to work. In New York, the choices are investment banks, some hospitals, advertising agencies -- but not technology companies. There are very, very few technology companies in New York.

But New York is still the largest city in America, and there are an awful lot of programmers who are stuck in New York because their wife is going to medical school, or their family is there, or they just love the city, or they want to do improv theater and this is the best place to do it -- millions of reason why a programmer might find themselves in New York. Every programmer wants to work at a product company because it is so much better than working as a slave in an investment bank. And there were none in New York.

We would go to parties, and we'd find geeks, and they'd say, "Do you know of any software product companies in New York where I can work?" And we would say, "Ge, no. I can't really think of any." This is what programmers would talk to each other abut: how can I get out of the investment bank in New York? So part of our model was, "Let's create a fun place for us to work, since we are stuck in New York City. Create a software company specifically in New York City."
    -- Joel Spolsky, Founders at Work

I worked as a programmer in NJ in the early 90's, for a spinoff of AT&T. They were doing heavy software development on Unix. There were high paying jobs with big cash bonuses in NYC, but they were in trading firms and investment banks and stuff like that. It's really different being a product company, where the technology IP is front and center, vs. being part of the back office or operations staff of a firm that makes its way doing stuff other than product development.

When I came out to the sfbay to visit a friend I was blown away by all the logos we passed driving down the road. Company after company that I'd heard of, all lined up. I thought, this is great, even if one of these places goes out of business, I can get a job across the street. So I moved.

Founders at Work, as others have mentioned, is really a great read. Highly recommended if you're interested in reading war stories from the early days of a wide range of startups.

March 13, 2007

Kafka-esque!

I'm in the Wall Street Journal today, with a story about our purchase of Topix.com for $1M and the SEO issues related to moving the domain.

The story has caused a bit of blog buzz, given the quoted price for the domain and the open acknowledgement of the SEO concern for us. Predictably, the two responses are:

- Isn't that a lot of money to spend for a domain?
- Should you really be so dependent on SEO for traffic?

Back in 2003 when we were looking for a name, we came across Topix.net. The name 'topix' really fit what we were trying to do, it was a heck of a lot better than the other names we'd come up with. It turned out we could buy the name from a South Korean squatter for $800. So we took it.

Of course I knew we were breaking one of the rules of domain names, which is never get anything besides the .com. But I thought that advice might be outmoded. In the early days of the Netscape browser, if you typed a word into the URL bar, the browser would automatically append ".com" onto it if it wasn't already a domain. But the browser doesn't do that anymore.

Since those early day, there have also been a flurry of alternate top level domains released. .tv, .info, .fm, all of the country domains, and so forth. Surely, the advice that you had to have a .com wasn't as relevant anymore?

Well, we got our answer when our very first press story came out. This was in March 2004 when we got a front page business section launch story in the Mercury News. They gave us sweet coverage since we were the only startup to come out of palo alto in months (this was just as the dot-com crash was beginning to thaw). Unfortunately, while the article clearly spelled "Topix.net", the caption under our photo -- the most visible part of the story after the headline -- called us Topix.com. Someone had transcribed the name and mistakenly changed the .net to .com, out of habit, I suppose.

Since that time we've built up quite a bit of usage, and much of it return visitors who have bookmarked one of our pages, or become active in our local forums. But still, we continued to have issues where someone will assume a .com ending for the name. Mail gets sent to the wrong address, links to us are wrong, stories incorrectly mention our URL.

Beyond that, as part of some frank self-evaluations we were doing around our site and how we could make it better, and the brand stronger, we ran some user surveys and focus groups. "What do you think of the name?" was one of the questions we asked. The news was good & bad; people actually really liked the name 'topix', but the '.net' was a serious turn-off. It confused users, it made the name seem technical rather than friendly, and it communicated to the world that "we didn't own our own name."

So our choice was to 1) live with it, 2) move to a completely new name, or 3) try to go buy the .com. We'd talked to the owners of Topix.com since day 1 of our existence. They were a successful Canadian business, they were actively using the name for their business, and didn't really need to sell. In essence, the negotiations to buy the domains, while recently completed, actually took over three years.

So I brought up the question with our board. This is going to be expensive, should we look into it? They were very supportive. Their take was, if we were going to invest in our brand, and in having a better connection with our users, as opposed to remaining a geek-tool or just getting SEO traffic, that we'd want to make sure the brand was top-tier.

While the cost seemed expensive, in the context of the dollars behind our partial acquisition and funding -- $64M -- it wasn't really that large. Furthermore, unlike other marketing spends which tend to be a quick shot of attention which dissipates, this would be an asset which we'd own forever. Names are critically important on the net, and if we were ever to hope for having a mass audience, it made sense to at least own our own name.

So we decided to fix this issue once and for all, and we got the name.

What about SEO?

Now to the second question... How dependent should we be on SEO?

Contrary to what Danny Sullivan says, we have never thought of ourselves as primarily a news search engine, but rather in the mission of aggregating audience around localities. We have over 50 feeds of professional content available on our site (full text articles that we have the rights to display), including content from Reuters, the AP, and Tribune. Furthermore, an increasing fraction of our content and traffic is occurring in our local community forums. This is content 100% unique to Topix and is a very sticky service for us with our users.

But we do rely on SEO for what we think of as new user trials. Our goal is not to rely on this traffic, but rather to get as much adoption as possible. The fraction of "trials" that we convert to "return users" is our purest guide to how well the site is delivering value to our visitors, and the goal for all of our product initiatives is to increase this fraction.

To say that a content site should not rely on search engine traffic -- most of which comes from Google -- is naive. The web is 10 billion pages now, with a single point of entry. That's the web the way works. If you want to have a web business, you have to acknowledge this reality.

Sites such as Wikipedia, Answers.com, About.com and TripAdvisor receive massive amounts of traffic from search engines. I would think that 50% would be a low guess. About, Answers.com and TripAdvisor are big businesses, and they would be completely clobbered if users stopped being able to find them from Google. This is not unusual; it is the norm. Barry Diller talked about the importance of SEO to his sites in his keynote at a recent conference.

Sometimes retailers get hosed because the city decides to re-pave the street their business is on. The street is infrastructure. Like it or not, Google is infrastructure on the net now. They're the source of all the foot traffic. The three words in retail are "location, location, location." The three words online are "search engine optimiziation." It means the same thing.

The good news is that, as we sign more and more users up into our community system, Topix should become less reliant on external traffic. But it's never going to be the case that we're not going to want our content to be findable by someone looking for it, from the place everyone starts -- Google.

Coverage:

March 14, 2007

Don Dodge on YouTube...and "Linden's Curse"

Most of you know I was a VP at Napster back in 2000 when the RIAA was suing us. I learned a lot about "fair use", DMCA safe harbors, take down notice rules, and the enormous penalties for copyright infringment. These laws are tough and there is no wiggle room.

Forget all the rationalizations about how YouTube really helps promote the content and Google is providing a great service to help users find it. The courts will hear none of it. What matters is the law and the facts...and they don't look good for Google.

Google should have taken my advice and just done an exclusive advertising deal with YouTube. Google is all about advertising. They didn't need to acquire YouTube to accomplish their advertising objective.
    -- Don Dodge, "I told you so"

I actually went through this movie myself at AOL Music. My team was running Netscape Search when my boss said Steve Case didn't think search was interesting and if I wanted a career at Netscape I should lead this "Tiger Team" to deliver a subscription music service for AOL by xmas 2000. This was in June or something so there were only 6 months to build everything. My first question was "What about the rights to the music?" I was told, "Don't worry about the rights. We'll get the rights. Just build it."

So we built it, but it never shipped. No rights. We even built 2.0. It was great. It went out to an AOL user beta audience, but it never shipped either. We couldn't even get the rights to Warner's catalog, which post-AOLTW merger was part of our own company. I went to New York and LA as part of big negotiating teams and got to meet with music executives. Ron Grant lead one of these teams, he was great, really impressive. But we never got the rights.

Alongside this internal corporate drama I'm reading in the press about Napster getting nuked from space, how it's so bad that the Hummer-Winblad VCs who touched it are going to be personally sued. Ye gods. Rich's take-away: music biz sucks! No fun there.

Don's words ring true.

See also what I'm going to start referring to as "Linden's Curse": YouTube is not Googly

March 15, 2007

Ries, Reeves and the USP

Miller has paid an enormous price for its countless line extensions over the years. Miller could have been the number one brand of beer in the U.S.

Miller Lite was the first light beer in the mind. But instead of giving its new light beer a powerful new brand name, Miller Brewing chose a terrible generic name, Lite.

To compound the error, the verbal confusion between 'Lite' and 'light' forced the company to rebrand its new light beer Miller Lite.

Who hands a bartender their order written on a napkin? Verbally, Lite and light are indistinguishable. Tragic.

There's another problem, too. When you saddle a beer with a diet word like light, you undermine its manliness. Miller made multiple mistakes all at once and it has cost them dearly.

    -- Laura Ries, Warning: Massive line extension can kill you

I've loved the Ries branding ideas since reading The 22 Immutable Laws of Branding several years ago. Al Ries invented product positioning, which goes hand-in-glove with the concept of the USP -- the Unique Selling Proposition -- that Rosser Reeves pushed. I've been reading the original Reeves "Reality in Advertising" from 1960 (sparked by a comment here) and his points about the USP and penetration in advertising totally agree with the Ries "own a word in the mind of the customer" rules on branding. Between Reeves and Ries you take away rules like trying to push two or more features in a production position will lead to disaster. You can have a successful product with a main feature you're communicating, but then add another benefit to your messaging, and end up communicating neither credibly, and muddling your brand image in the process and losing share. USPs fell out of favor after Reeves but the case studies they present all ring true to me. Time to bring back the USP.

March 16, 2007

Skrenta on Naming

Would Google have been as successful if it had been called BackRub?

Before 'Google', that's what Larry's project was called. And yes, that's Larry's hand.

Even in 1997, names were tough. I've been registering names since 1991 and it has always seemed hard to pick a good name for a new service. Larry wanted to register 'Googol.com' but it was taken, so he took the misspelling. This was fortunate, since 'Google' is friendlier, and the misspelling makes for a stronger brand.

Google was a great service, positioned ideally to focus on quality despite an initially smaller index, against existing dominant services which were being poorly tended by their owners. And the name was great!

But what if Larry had left the original name, and called it BackRub? BackRub, despite the geek reference to link analysis, is icky, suggesting an intimacy with the product that most consumers probably wouldn't want to have. Would the takeoff rate have been as high?

* * *

The most vivid 'unfortunate name' example I can think of is Yggdrasil Linux. Yggdrasil was the first Linux distribution on CD-ROM, put out by then-college student Adam Richter. This was back in the early 90's, when if you wanted to run something Unix-like on commodity 386 boxes, your choices were UnixWare or BSDI for thousands of dollars. Yggdrasil was $30, and Adam was getting tons of mail with checks enclosed in his dorm room. A flurry of positive press followed his launch.

It was way-early for Linux, but to be the first distro -- to have the chance at being the official one, starting years before Red Hat and everyone else... Yowsa that would have been cool. But the name! Ouch. No one could remember it, speak it, or spell it. "Wtf?" "It's supposed to be the Norse tree of life... Unix source tree... Get it?"

It's not fair to beat up on Adam, I was around then and I wasn't starting a Linux distro, so he gets points for actually launching the thing. And his successors ("Slackware") didn't do much better with their names. But that's what makes it tragic. The choice of the name could have been all the difference between launching something that would grow into the next enterprise OS provider, or not getting any traction.

* * *

When we launched GnuHoo.com in 1998, we figured it would be a success if we got 1,000 editors to eventually sign up. We had 1,000 editor signups in the first three weeks, with little promotion. GnuHoo was a great name. It had the entire business concept in six letters -- open source yahoo. It rhymed. It was short. Our users loved it.

The only problems were the 'Gnu' and the 'Hoo'. First the cease & desist came from the Free Software Foundation. We changed it to NewHoo. That was a great name too. Then the cease & desist came from Yahoo. We were in the midst of selling the project to Netscape, and they didn't care about our fledgling brand, so that wasn't an issue, and once the sale was complete we renamed the project.

Which contributed greatly to a fade into obscurity. It was variously called "directory.mozilla.org", "dmoz.org" (after mozilla said they didn't want to have anything to do with us, they were just about browsers), "Netscape Open Directory", the "Open Directory Project", and the ODP or just DMOZ. We built the largest directory of the web, with 6M hand-edited sites, the thing was 4X bigger than Yahoo's directory, the data was used (and still is today) by all the major search engines. But we were like ADM. "We're in everything you eat!" But the weak and fuzzy naming really killed public awareness.

Another great thing about "NewHoo" was that it repositioned Yahoo by its very existinence: "New Yahoo". In other words, Yahoo's directory is all rotted to hell and uses an old model, come over to this new, better way to do the same thing. Of course Yahoo had to kill the name. But if you can reposition a competitor with the very introduction of your name/tagline/USP, well that's the triple word score.

* * *

Skrenta's Name Rules

  • .com must be free. :-|
  • Don't base on common words, or combos made out of common words. 'Topix' actually fails this test, since 'topics' is too common. 'Yahoo' is a word, but a rare one. 'Excite' was too common a word to work as a strong brand. 'Live' is a terrible name because it is way too common.
  • Creative misspelling turns it into a stronger brand and trademark (Googol -> Google, Flicker -> Flickr).
  • Look out for phonetic->spelling ambiguities. Too many can be a red flag. If there are just a few, see if you can stake out all the misspellings yourself, so you can install redirects. Flickr needs Flicker to redirect the type-in traffic. For "NewHoo" we had domain squatters taking "nuhoo", "nuhu", "noohoo", etc. Not critical but will save you annoyance later.
  • Try to state the USP of your product in the name. Base the name on the benefit to the user, not its features.
  • Try to look for something with an emotional connection instead of just riff on the a mechanical description of the product. Hit the Wikipedia 'random page' button to get name ideas rather than the thesaurus.
  • Think about whether your name aspires to be a verb or a place. Google and Zillow are verbs, Myspace is a place. Make sure your name works for its goal.
  • If the name has good sonorous aspects, like alliteration, consonance, assonance, etc. that is plus. They will help people to remember your name, and like it better. StubHub, FogDog, YouTube.
And finally...if you can't think of a name, and you have the resources (say in a funded startup), I would absolutely spend the bucks and hire a good naming/branding firm like A Hundred Monkeys (disclosure: we've worked with the hundred monkeys at topix, they were great.) They're going to come up with a better name than the ones you've thought up. It will be worth it in the long run. Naming fees are cheap compared to the total investment in your product, to your engineering budget, to your marketing budget, and a good name can be a strong wind at your back. The name is the cornerstone of both your product and your marketing. Get a good one!

* * *

The Name Inspector has a great blog that focuses on tech startup naming. Would love to see him analyze the search engines that got big -- Lycos, InfoSeek, AltaVista, Excite, Yahoo, Google ... maybe even LookSmart. :-)

March 20, 2007

Ranting like a madman

So I go to lunch with a friend, and I'm giving him a dump of some of my current thinking in an area. I talk for like an hour and a half. He talks too, but I talk a lot, and really fast, but it's good stuff, and he seems convinced and is tracking it. But all the while, some small sub-part of my brain that is listening to myself speak is thinking to itself that I sound like a bloody madman. I ask the dude what he thinks of my rant-arc but he seems impressed by the material. So I go back to the office and tell Bob that story. Bob knows my rants. "So do I sound like a madman?" He thinks for a moment, and there's a pause, and some kind of fleeting expression, and then he responds, "Well, no..." :-)

March 21, 2007

'I am my own grandfather' and AT&T

Does anyone else see all the new AT&T advertising and get creeped out? It was bad enough that AT&T wasn't even AT&T anymore after the breakup. Everyone knew old ma bell, the phone company, the one from the old SNL skit. Ma bell used to make solid phones that never broke. Not the pieces of junk in stores with the AT&T logo on them now.

But then AT&T went away. It was sad, they were such a cool brand, with their deathstar logo, even with all the scar tissue like NCR it still commanded power and respect. They invented the transistor and Unix and stuff after all. Gone.

But now they're back from the dead, reanimated by some 1/7th offspring that ate some of its siblings, and now its parent. I need a better story to welcome this monkey's paw walking frankenstein brand corpse back into my house. Something better than the billboard on 101 that tells me they're "reinventing television". Huh? Didn't I used to pay some cable bill that said AT&T on it? That must be what they mean.

(I've stolen the idea of this headline and Mark Twain quote below from some other blog, but I can't find it now to quote/link. Sorry. The other parts of the brand rant are all mine though. :-)

I married a widow with a grown daughter. My father fell in love with my step-daughter and married her, thus becoming my son-in-law, and my step-daughter became my mother because she was my father's wife. My wife gave birth to a son, who was of course my father's brother in law, and also my uncle for he was the brother of my step-mother. My father's wife became the mother of a son, who was, of course, my brother, and also my grandchild for he was the son of my daughter. Accordingly, my wife was my grandmother because she was my mother's mother-I was my wife's husband and grandchild at the same time- and as the husband of a person's grandmother is his grandfather- I AM MY OWN GRANDFATHER!
      -- Mark Twain.

March 26, 2007

How to beat Google, part 1

Our entire industry is scared witless by Google's dominance in search and advertising. Microsoft and Yahoo have been unsuccessful at staunching the bleeding of their search market share. VCs parrot the Google PR FUD machine that you need giant datacenters next to hydroelectric dams to compete. They spout nonsense about how startups should just use Alexa's crawl and put some ajax on top of it. Ye gods.

Grow a spine people! You have a giant growing market with just one dominant competitor, not even any real #2. You're going to do clean-tech energy saving software to shut off lightbulbs in high-rises instead? Pfft. Get a stick and try to knock G's crown off.

So here are my tips to get started. These are all about competing with Google's search engine. Of course G is big business now and does a lot of different things. Their advertising business is particularly strong, and exhibits some eBay-like network effects that substantially enhance its defensibility. Still, even if you're going to take that on too, you have to start with a strong base of search driven traffic.

  1. A conventional attack against Google's search product will fail. They are unassailable in their core domain. If you merely duplicate Google's search engine, you will have nothing. A copy of their product with your brand has no pull against the original product with their brand.

  2. Duplicating Google's engine is uninteresting anyway. The design and approach were begun a decade ago. You can do better now.

  3. You need both a great product and a strong new brand. Both are hard problems. The lack of either dooms the effort. "Strong new brand" specifically excludes "search.you.com". The branding and positioning are half the battle.

  4. You need to position your product to sub-segment the market and carve out a new niche. Or better, define an entirely new category. See Ries on how to launch a new brand into a market owned by a competitor. If it can be done in Ketchup or Shampoo, it can be done in search.

  5. Forget interface innovation. The editorial value of search is in the index, not the interface. That's why google's minimalist interface is so appealing. Interface features only get in the way.

  6. Forget about asking users to do anything besides typing two words into a box.

  7. Users do not click on clusters, or tags, or categories, or directory tabs, or pulldowns. Ever. Extra work from users is going the wrong way. You want to figure out how the user can do even less work.

  8. Your results need to be in a single column. UI successes like Google and blogging have shown that we don't want multiple columns. Distractions from the middle with junk on the sides corrupt your thinking and drive users away.

  9. Your product must look different than Google in some way that is deliberately incompatible with their UI, for two reasons. One, if you look the same as them, consumers can't tell how you're different, and then you won't pull any users over. Two, if your results are shown in the same form as Google's, they will simply copy whatever innovations you introduce. You need to do something they can't copy, not because they're not technically capable of doing so, but because of the constraints of their legacy interface on Google.com.

  10. Your core team will be 2-3 people, not 20. You cannot build something new and different with a big team. Big teams are only capable of duplicating existing technology. The sum of 20 sets of vision is mud.

  11. Search is more about systems software than algorithms or relevance tricks. That's why Google has all those OS programmers. You need a strong platform to win, you can't just cobble it together as you go like other big web apps.

  12. Do not fear Google's vast CapEx. You should wish maintenance of that monster on your worst enemies. Resource constraints are healthy for innovation. You're building something new and different anyway.

March 27, 2007

Adding people makes all software better

Tolles pointed out yesterday, that in spite of my apparent obsession with google's pure algorithmic approach to organizing the world's information, that all of my personally successful projects have involved a strong social aspect to the software:

first micro virus - arguably social software :)
monster - first user-designable MUD
usenet newsreader - usenet was a huge early net.community
DMOZ - massive community success to build a web directory
Topix - huge uptake in the local community over the past year

Greg Sterling even sagely commented on my how-to-beat-google list that "A distributed editorial staff is in there somewhere."

It does seem that, no matter what you're trying to build, adding people into the mix seems to make the software better. Software is cold and shallow, people humanize it, and the public can provide wonderful extensibility and depth to a system that your cube-bound programming staff would never be able to match. Fred Wilson had the seminal post on this with his "All software should be social."

The only caveat I would add is that it's often much harder than it looks to scale the social architecture. You can get early usage takeoff quickly by throwing the doors on your system open and letting people in. But regulating quality, rejecting spam, and keeping out the various bad actors who inevitably show up once a system gains audience is essential for a service to grow beyond its initial early adopters and have a shot at a mass audience. It's pretty common for social services with early promise to crap out after they attract enough traffic to be worthy of spamming and to draw the trolls. If quality falls while the user base grows, the size of the community served becomes self-limiting.

This is why all of the successful social sites have back rooms full of reviewers, scanning every uploaded photo, reading every user flagged post, trying kill-list keyword searches against their own services to look for bad stuff. It's why Craig Newmark describes himself as a "customer support representative" at Craigslist.

People on the outside require more people on the inside. :-)

March 28, 2007

Re: How to beat Google

It's got to be a slow news day at ZDnet if apparent inconsistenencies in my blog posts over the past 4 months are news. :-)

Credit to Donna for actually noticing and calling me on it though. But to try to clear things up a bit:

Many interpreted my winner-take-all post from January to mean that I thought startups shouldn't attempt to compete against Google. Not at all. In fact, I think startups are the only chance against Google. You need some kind of disruptive or market-changing innovation to succeed, since Google is just so damn good. For a variety of reasons, I think it's really hard for big companies to do that kind of work internally.

The comments on how to beat google have been really interesting. In particular this livesearch.alltheweb.com thing is kinda nifty. I guess it doesn't look like a Google killer to me, but I like it anyway.

Also the folks on Threadwatch reminded me about the raffle model. Iwon.com used this and acquired quite a bit of traffic quickly. Basically you run a lottery and give a million bucks or something to a random user of your search engine every day. Now Iwon was icky and no one in the search industry likes to think like a sweepstakes marketer, but the idea does have some pull.

Think about it this way. Google collects billions of dollars from advertisers. They fund a lot of non-search stuff over there. Where else could that money go? How about a loyalty rebate program back to users? Marketers have proposed various attention-buying schemes over the years. That's essentially what your Safeway or Harrah's card is. It's paying you money in return for your behavioral data, and permission to market to you. Loyalty and cash back programs are really popular and sticky. Now 5% cash back on my searches is bupkus, but if you pooled together everyone's and gave it all to one person every day...

That's crazy talk, even for me. ;-)

March 29, 2007

Conservative coding

An expat investment banker in Brussels once told me that two non-native english speakers can often converse far more easily in English than a native and non-native speaker. That's counter-intuitive? Shouldn't the pair with the native speaker have an easier time?

It turns out that native speakers use a far broader footprint of the language, and reference all sorts of cultural idioms when they speak. And so the non-native speaker has no idea what they're talking about. But two non-native speakers are both using a smaller, common, conservative subset, so there are fewer misunderstandings.

* * *

Everything at topix is written in perl. That sometimes elicits the "What's up with that?" from techies. "Perl looks like line noise. Isn't your code hard to maintain?"

Well, as hard as anyone's I guess, but not because of the language.

We do crazy fun stuff in our system, like mmap'ing giant files with key-offset indices at the front, pulling out chunks of data, decompressing them, and thawing them into perl objects. We can do something like 6,000 of those a second on a regular box. We now have a scalable get/put service based on that running on a 500 node cluster. We do named entity disambiguation and all sorts of text analytics in perl. Performance isn't an issue, not from the language anyway. We worry about disk seeks and network latency and stuff like that. But not statement execution. There are a handful of functions that got written in C but it's pretty tiny.

"What about python and ruby?"

I think that anyone using perl, python or ruby is about 100X more productive than someone working in Java or C++. Within the three I don't really have strong opinions though.

If you choose to deliberately limit yourself to a subset of whatever language you're working in, code can pretty much come out looking the same in all three.

Trouble starts when you try to get fancy.

I see gee-whiz programmers often gleefully code wonderful stuff that no-one else can make heads or tails of. Certainly not the new junior engineer we just hired who was a sharp coder in two other languages, but just started learning perl a few weeks ago.

And the gee-whiz stuff doesn't buy much. You can trim out a few lines here or there, but often the complexity is more at the greater system level, and the performance has to do with the systems and algorithmic stuff. Obfuscating a few lines to leverage a language trick doesn't actually benefit the system, and it certainly doesn't benefit the other members of the team who might have to pick up that code later. Coding is social, it's not just a private dialogue between you and the machine.

I've known a lot of languages in my career. I've studied language design and written compilers. I see big productivity differences between classes of languages, but within the classes, not so many. But folks always seem to get religious about one vs. the other. Frankly, it's a red-flag. It signals idealism over pragmatism, a love of a particular toolset over a focus on the goals of the project.

Ulysseys is great if you want phd english majors to study your work for years to figure out what it means. But put the five dollar words away when you're writing the install guide for your new blogging package. Coding is the same way. Put the fancy stuff away and code for the rest of us mortals.

March 31, 2007

The Architecture of Mailinator

Fascinating description of the architectural evolution of the Mailinator service, from a what-you-would-expect sendmail connected to a web interface to mailboxes thing, to the current form. Which includes

  • Never touches disk - did away even with checkpointing!
  • Has its own simple smtp server to receive connections
  • Uses adaptive forgetting as a scaling tool
  • Deliberately manages smtp session length -- takes longer to accept mail when the server isn't busy, to slow spammers down, but goes fast when the server is loaded because it needs to. wow. Points for the idea, 10x score for actually implementing it :)
  • Optimized for "survival" above all other criteria

Well worth reading the whole thing through on the mailinator blog, there is is much wisdom here...

(via programming.reddit.com)

April 1, 2007

What do you do when your success ... sucks?

We took a hard look at ourselves at Topix last year. We had built up a strong local audience on the site, but a lot of it was SEO, and while users were clearly getting some value out of our product, we hadn't made something that people really cared about. As cool a technical trick as our aggregated geolocalized news pages were, they actually pretty much sucked.

Thus began a six-month self-examination of why, exactly, our product sucked, and what we could do to un-suckify it.

As CEO I immediately rejected suggestions to reinvent the whole site as a myspace or digg clone, or any of the other fads du jour. I don't believe that you can win by making a clone of something else. That violates one of my rules of branded web products, which is basically that there can be only one of everything.

That would also be throwing the baby out with the bathwater. We had drawn a hyperlocal audience of millions of visitors spread over thousands of local city pages on our site. No one else had ever achieved this. We knew these visitors had shown up because they wanted to connect in some way with their town, online. But we weren't delivering the goods. We were leaving them unsatisfied.

We had many assets to draw on -- aggregation and AI technology, our recently launched local forums, content agreements with the AP, Reuters, Tribune, and 50 other top news organizations. Plenty of funding and engineers and seed traffic. If we couldn't somehow use these assets to build a great site, well the board should scrap our butts out of there.

Brand Therapy

This was a painful process. We crammed the entire company into a room, but no rah-rah speech this time. Instead we treated ourselves to a brainstorming session about why the site was lame. It was not fun.

We did the full marketing playbook. Focus groups with the mirrored glass and video cameras. On-site surveys. Telephone surveys. Accosting people on Caltrain and doing A/B surveys with paper mock-ups (no kidding). Therapy sessions with brandologists.

Finally we started to get somewhere.

Two key insights had emerged. The first was that users arriving at our site had no idea who we were or what the site was about. "Who the fuck are you guys?" was the question our site needed to answer for vistors, according to the brandologists. In person, and even on our corporate blog, we apparently came across as passionate about what we were doing. But none of this showed through on the site itself. "News untouched by human hands" was what we were actually delivering, and it wasn't working.

The second problem was sort of a structural flaw with our news pages. They didn't conform to any standard web page metaphor. Let me explain what I mean by that.

Back in 1995, when the web was new, visitors to a new site would lean forward, squint at the page, and try to figure out how it worked. The Southwest Airlines page was a picture of a check-in booth at the airport. You had to click on the picture of the phone to get the phone list, and so on.

That metaphor didn't last. People don't lean forward and squint at web pages to figure out how they work anymore. They instantly recognize -- within 100 milliseconds -- which class of site a page belong to -- search result, retail browse, blog, newspaper, spam site, message board, etc. And if they don't recognize what kind of page they're on, they generally give up and hit the back button.

Our news pages didn't conform to any standard metaphor. Some people thought they were search results. But they weren't, our pure news search was a separate section of the site. Some people thought we were a newspaper, with human editors. Some visitors thought we were a blog. But our news items didn't behave in very bloggy ways. Most people just didn't know who we were or what the page was trying to do. Further confusing matters was our front page, which really didn't have anything to do with the local news pages within the site. From the front we either looked like Google News or a national newspaper, depending on who you asked.

This all seems blindingly obvious in hindsight, but it was quite a bit to unravel. We were also left with the question -- ok, now we know what's wrong. But how do we fix it?

Reinventing Topix

So here is the plan we came up with.

  • Ride the winners on our existing site. The part of our site that was growing like a weed were the locally-oriented forums. We'd had over a million people post in these forums over the past year and it's just under 50% of our traffic now. Clearly this part of our site was working. Our new product would emphasize people over the machine.

  • Fix the local pages by making them work like community-edited blogs. Strictly obey the blog metaphor, with chronological posts, and all of the associated visual cues which tell you that you're on a blog, and not on, say, a google news search result.

  • We would run the show just like DMOZ, although borrowing some subsequent innovations from Wikipedia. This was a reliable model, we had done this before with 75,000 volunteers, but no one had done it for news yet. We needed to build an editorial system that could provide an umbrella quality filter around thousands of daily contributors.

    This would also close the quality gap we had between our mechanical aggregation of the news, and the judgment that humans can apply.

  • Anthropomorphize our existing technology into the roboblogger. This was a brilliant idea from one of our lead engineers. It simultaneously solves three problems: 1) Booting up a new city -- you need posting activity to draw the first editors. The roboblogger would give us that. But he is shy and gets out of the way if humans show up and take over a page. 2) If the community editors go on vacation, the roboblogger can step back in and take over while they're gone. 3) People know when a robot is editing the page vs. a human. His profile icon is a picture of a little tin-can robot. His handle is 'roboblogger'. No more confusion.

  • Kill the home page. It should be an "enter your ZIP code" box. Putting national news on this page created too much confusion with our main mission, which has always been local.

  • Streamline the experience. We'd joked that we have the old AOL audience on our site, when they email us feedback or bug reports they still have the CAPS LOCK key held down. We have sheriffs, teachers, doctors, airline pilots, bankers, real estate agents, lots of regular folk. And not just clumped on the coasts, but spread pretty evenly across all the states. Most our users are not bloggers, they're not fans of some Silicon Valley Web 2.0 startup. They just want to talk to people in their town. We had to make the experience simple for them.

The cool thing about this plan is that it leveraged a lot of the good stuff we had already done. But the aggregation technology we had built would be redirected at assisting human editors, providing dashboards of candidate stories for them, and taking care of the boot-up and vacation problem. The new site would put people front-and-center, and people in our local forums had been driving all of our growth over the past year. And we were better positioned than anyone else to do this, we had millions of users and the seed content to boot it up.

And the potential success case looks very, very interesting. When we launched NewHoo (dmoz's original name) in 1998, we figured it would be pretty cool if we signed up 1,000 editors. We signed up 1000 editors in the first 3 weeks, without any existing traffic or promotion. Ultimately 75,000 editors signed up to help. Topix is starting with a far broader base of seed traffic, and a pretty slick local news CMS for every city in the country.

We'll see! :-)

Read more:

April 5, 2007

"Brandthropology"

A brand is a differentiator, a promise, a license to charge a premium. A brand is a mental shortcut that discourages rational thought, an infusing with the spirit of the maker, a naming that invites this essence to inhabit this body. A brand is a performance, a gathering, an inspiration. A brand is a semiotic enterprise of the firm, the companion spirit of the firm, a hologram of the firm. A brand is a contract, a relationship, a guarantee; an elastic covenant with loose rules of engagement; a non-zero-sum game; improvisational theater at best, guerrilla theater at worst. As perceived vessels of exploitation, brands provide the impetus for generics and voluntary simplicity, as well as targets for demonstrations of cultural nationalism. McDonaldization, Coco-Colonization, and Disneyfication are simultaneously courted and countered, imported and deported. The swooshstika becomes a badge of infamy, Ronald McDonald is toppled and graffitoed, and iPod adverts are morphed with images from the infamous Abu Ghraib prison to protest the war in i-Raq. The brand demands an antiphonal, overlapping call-and-response patterned singing among communicants. It requires collusion, collaboration, and the willing suspension of disbelief.

...

Imagine the brand as a Thai spirit house. A ubiquitous structure in residential and commercial neighborhoods, often mistaken by tourists as a bird house, this tiny building resembles a temple, and acts as a dwelling for spirits of the land and household, who are plied with offertory gifts by petitioners in search of favors or assuring pledges. The spirit house is often piled high with gifts of flowers, food and currency, left by suppliants in hope of intercession by the residents. As will be evident in the following pages, I view branding as the creation of household gods, the mythic charter of our consumer culture. The brand is also a habitat in which consumers can be induced to dwell. In that dwelling, consumers domesticate the space, transforming it, and themselves, to essence. The resulting glow emanating from the dwelling is the brand's aura.

    -- John F. Sherry, Jr., in Kellogg on Branding

I think this is great stuff, whatever it means. But if you're a product person and you find yourself in a marketing meeting, and some marketing dude starts throwing around the B-word, and it seems like a pretty wicked tool for them to wield since it basically can mean anything they want it to -- well, you'll know why.

April 6, 2007

Early adopter pilotfish: pornographers vs. SEOs

Pornographers are apocryphally given credit for leading tech early adoption. I wondered if this was actually true. We know about Beta vs. VHS and all that. But it turns out it goes all the way back to the daguerreotypes and early photography. Crazy:

In 1841, William Fox Talbot patented the calotype process, the first negative-positive process, making possible multiple copies [of photographs]. This invention permitted an almost limitless number of prints to be produced from a glass negative. Also, the reduction in exposure time made a true mass market for pornographic pictures possible. The technology was immediately employed to reproduce nude portraits. Paris soon became the centre of this trade. In 1848 only thirteen photography studios existed in Paris; by 1860, there were over 400. Most of them profited by selling illicit pornography to the masses who could now afford it. The pictures were also sold near train stations, by traveling salesmen and women in the streets who hid them under their dresses.
    -- wikipedia

But in search and traffic I think it's not pornographers, but the SEO industry that deserves the credit.

On one hand, the relevance issues introduced by spam sort of define the entire environment that both search and social media exist within. It's not a matter of just bolting a spam filter onto your product once it's done, how to be crap-resistant and be able to greyscale score content across the entire spectrum needs to be core to your software.

Beyond that, however, SEO's often pay more detailed and critical attention to the web industry than most of the industry analysts, who simply eat press releases and comment on them. SEOs are continually probing for weakness and insight into the evolving global online traffic market.

Some SEOs I follow:

SEO by the Sea has been methodically going through the patent filings for Google, Yahoo, Flickr, Ask, Technorati, etc. looking for insight into their ranking and anti-spam methologies. Cool. :-)

SEOmoz has a nice how-to guide for getting links into WikiPedia.

Another gem from SEOmoz: "Every so often, one of our employees will roll into the office and announce, 'I'm going to get on Digg today.'" How do they do that? sock puppets, amazon's mechanical turk, or just plain old linkbait?

I find this stuff hilarious but also insightful. If you're designing social media systems, you should be keeping an eye on the $2B industry that sells links from your site to their clients.

:-)

April 10, 2007

Will the Internet kill universities, too?

Universities are extraordinary institutions. They are in fact, the last bastions of mediaevalism left in modern society outside, perhaps, the church. Like churches they attracted a certain type of person who did not share the values of the commercial world. The oldest universities date from the eleventh and twelfth centuries - hundreds of years before the invention of the printing press. In an age where books were scarce, communication was difficult and people who could read and write were almost as rare as the books, it made sense to centralise the acquisition and dissemination of knowledge. If you wanted to learn you headed towards where the books were and the people who could read them and that meant the great universities like Paris and Oxford. Poor communication, expensive reading materials and illiteracy were the foundation blocks for the universities. If today we have excellent communications, free online information and general literacy, we also have an environment in which the universities are struggling to maintain their position. That, of course, is not an accident.

...

This is where this student begins by recognising that university, like school, is also fairly phony in many ways. What saves university is generally the beauty of the subject as built by great minds. But if you just look at the professors and don't see past their narrow obsession with their pointless and largely unread (and unreadable) publications to the great invisible university of the mind, you will probably conclude its as phony as anything else. Which it is.

    -- Mark Tarver, Why I am not a professor

Actually I think that universities are far more useful as a social accreditation filter than for academic enrichment. In other words, they prove you can do work you're told to, on time, and give you an opportunity to develop and prove your skills at getting along with peers and superiors. If you're going to study tech, or literature, or languages because you love the subjects, you're going to do that anyway. But the hoop-jumping is the real point of the test.

I knew a number of bright folks who, for whatever reason, couldn't quite get it together in school and left. Some just a credit or two shy of graduation. It all comes down to: "Do well."

April 12, 2007

Un-suckification week 1 report

So we made some rather massive changes to Topix last week. It's still very early, but ... how's it going?

Is Skrenta going to get fired or is the damn thing working? :-)

So far it looks GREAT. We've approved over 500 editors in the first week. Not all are active, and some of them signed up for non-local channels. But at this point we've got about 100 daily active local cities being edited.

You can see the list of editors here and the list of most recent local editor posting actions here.

Just after we launched this stuff I realized something which hadn't occurred to me before, about the DMOZ/wikipedia model and how we're trying to apply it to local news. There's an advantage to community-edited news which actually makes it a much easier problem to tackle than either a web directory or encyclopedia.

At DMOZ we signed up 75,000 editors, who ultimately created 400,000 categories and filled them with links. The problem was that, even with 400k categories, we hadn't even made a dent in the problem of organizing the web's information. 400k categories is less than 1% of what you need for that problem.

But local news is a finite domain. We have 32,500 local news channels. Once we approve an editor for a town, if they become active then that town's page is basically 'fixed'. We aim to sign up multiple editors, and of course the character and style of editing varies, but pretty much any human can do a better job than our roboblogging technology.

So if we signed up an equivalent number of editors to DMOZ, we'd have an average of 2 editors per locality. It wouldn't work out that evenly, we'd have clumps with multiple editors in bigger cities, and some small towns would still only be roboblogged. But I'm guessing we'd have coverage over about 1/3rd of the US map, or 10,000 towns.

Another neat thing about this model that I hadn't thought of before is that any kind of commercial spam we might receive is going to get "washed away" in the daily flow of articles on the local pages. Unlike the wikipedia or DMOZ, where a spammed link can hide for years, nothing really "sticks" to topix, since it's all new each day. This does mean we need continuously active editors though, or at least a steady re-supply of new editors if some of the old ones drop out.

Something about the redesign has also lead to a big jump in our local forum activity. We set a new forum record yesterday, with 47k posts. That's a 25% increase from two weeks ago.

So early results look good. I'll post more detailed stats on our active coverage and posting volume in about a month.

April 13, 2007

Foo *

Some comments are too good to leave buried in the ... comments. :-)

Freebase: one to watch:

nyet! no! eggads! do you know why google is popular?! the world's data does not want to be structured.

there are only three ways to do this:

1. treat all bits as potentially noisy and use probabilistic methods to try to fish it out. see: google.

2. given the impossibility of structuring all the world's data into domain-specific schemas of value, semi structure your data into one humongous associative array. okay, now what?

3. ignore (2) and actually try to create a bazillion community authored and maintained schemas. the only problem is that schema design isn't much fun, and amateur schemas will break easily.

How to beat Google, part 1:

I've never understood how Google can insist that their infrastructure costs are actually an impediment to any startup. Sure, it costs a lot to serve 200M queries a day, but 200M queries usually come with a lot of money attached.

Early adopter pilotfish: pornographers vs. SEOs:

Will it be porn that finally bootstraps IPv6?: http://www.ipv6experiment.com/

Yahoo Singing News:

The site is now up at http://underground.yahoo.com/ . Changed your mind yet?

Speculative Fiction

it is okay to blame terry, he has been paid very well, by any standard in corporate america. including his stock grants, semel has certainly extracted hundreds of millions of dollars in compensation. for that much money it is fair to expect results. while he doesn't strike one as the type of person to grasp the viability of search, he has surrounded himself with advisors who certainly should have been able to make this assessment.

one can also turn some blame on jerry yang, who was instrumental in attracting terry. jerry should have known that such a technophobe would have problems dealing with the inevitable semi-technical issues that a yahoo ceo would have to grasp on some meaningful level. it was those senior yahoos who were engaged in the ceo search who incorrectly assumed that yahoo was simply "another media company", and their search was predicated on this. had they understood that this generalization was meaningless, they would have directed their search elsewhere.

but to be fair, the winds of the internet have shifted. no one cares about "integrated networks" like yahoo and aol anymore because they have failed to deliver more utility than the rest of the web. google is a rest-of-the-web company...its search and advertizing products leverage the entire web instead of trying to fight it. i'm not sure anyone at yahoo saw this coming.

p.s. i have a double-digit employee number at yahoo. that doesn't make me right, but some of what i cite is based on observation, not speculation.

htbg, notes

no this isn't part II yet, just some random thoughts I had this morning.

i'm on vacation this week so no polish, sorry. :-|

13. Both personalization and natural language approaches to search seem to mainly be about disambiguation. I've written a big disambiguation engine, one of the better commercial ones on the net. Disambiguation doesn't seem as interesting to me anymore.

Grouping terms and ranking compounds is more useful, IMO. Hence Ask having unfortunate results for stuff like lady diana car accident. Is this Edison yet?

Full blown question answering, apart from being something that nobody actually wants, is a matter of first structuring the web, and then converting english into some SQL-like stuff to run against it. If you could structure the web though you could skip the SQL business because you'd already have 98% of the win.

Sentence tagging doesn't seem that interesting. Parts of speech are this chompskian red herring where a set of artificial categories have been imposed on english. So you have an n% error tagger mapping these basically useless categories to web text. If you actually were able to put together some kind of probabilistic parse map, you could predict completions like "I played fetch with my <x>". Classic taggers don't do much for typical queries either.

Check out the great Ask patent screenshots from seo by the sea. So which rule(s) do they violate? We don't always really want what we think we want. Or maybe I'm wrong, it is a cool looking mock. :)

April 15, 2007

Roboblogger's busy profile

Keith pointed out to me that the edit history on roboblogger's profile page has turned out to be unexpectedly useful for debugging. You can actually follow the little guy around the site as he autoposts from the news stream. It's basically the inverse of the list of human posts. :-)

Vacation too short. 9am panel tomorrow morning at Moscone. zoinks.

April 16, 2007

Web 2.0, year 3

Mike cracked me up with one of his points in his rollup of day 1 at today's web 2.0 conference.

Topix has come a long way in the past few years. We used to have the worst booth at these conferences - a shabby sign with a typo in it, and that's all. Now we have the full blown booth with street teams outside promoting the site. Pretty cool.

    -- marksonland

We really did have a typo in our first booth sign. I think that thing was proofed by like 5 different people, and everyone missed it. Our sales guy walked up when we had finally set up the booth at its first show and instantly pointed out the typo. We ended up getting some lame sticker thing to "fix" it from the printing company. We used that horrible booth for another year or so before finally replacing it.

Well our booth looks better now, plus we have a cool street team survey going on in front of Moscone:

More on flickr. :-)

April 17, 2007

Holy shit

Something about our relaunch has caused a 25% spike in our forum posting activity.

(click for non awfully scaled image)

At first I thought our captcha had been broken and this had to be spam. But after rooting around we decided it wasn't spam, it's real activity. Then I looked at seo but I didn't see any change there. Our best bet now is that it's due to the redesign plus the new site being faster, which has lead to more on-site activity.

An odp editor is picking a bone with my "75k" number of dmoz editors. That's the number on the dmoz homepage. Yeah of course not all of them are active, the site is 9 years old and the number never goes down. Plus they stopped letting anyone new in about 4 years ago. A much more interesting number is the number of daily edits. But in the end it's apples-to-oranges, since dmoz in theory benefits from the previous edits of now-inactive editors, whereas Topix will only benefit from sustained daily editing activity.

Thus it's more interesting for me to say that Topix had 804 editor posting events yesterday, which is a 24% increase from the 645 posting events it had the previous monday. It's hard to predict success with two weeks of data, but of course we want to answer the question -- will Topix become big in its domain like dmoz and wikipedia, or will it moulder with no use like a Backfence. Well at this point it looks like it will become very big. No one will realize it for 1-2 years while it grows though. :)

Our launch party came off quite well last night. We had about 200 people at the St. Regis and I saw a lot of old pals I hadn't connected with in a while. The street teams worked out better than I expected too. I got some good quotage too from the morning panel. My talk tomorrow in front of the whole audience (wow big room) should be fun.

Media Panel Art

I drew this picture during my panel yesterday.

Afterwards I showed it to some guys from Snap who were chatting with some Japanese VCs but they all looked at me like I was insane.

April 24, 2007

Please Stand By

April 27, 2007

Grouchy Rich

Over the past couple of months I did press tours in NY and SF, with 25+ interviews, four panels, a high order bit at Web 2.0, had our launch party for about 200 people, and dealt with some stuff in our org.

I've scored myself on tests like the 16PF and I've generally come out as 50/50 introvert-extrovert. Questions like "Are you energized by meeting lots of new people at a party, or are you drained by the experience?" are the sort that help score you as an introvert or extrovert. The sterotypical engineer is an introvert. Remote employees should be extroverts. This may seem counter-intuitive at first; if you're working by yourself at home all the time, isn't that a better job for an introvert? But no, it turns out you want someone who will cross the extra barriers (telephone, IM, email, plane flights) to reach out to the group and over-communicate.

It's a big part of my job to run around and talk to people, but this particular media tour kinda left me fried and I got strange and grouchy in parts toward the end. So I apologize if you ran into me and I was brusque.

April 30, 2007

Mass media was a temporary phenomenon

On a panel at Web 2.0 I made some comments about media fragmentation and advertising. Part of my comment got quoted, and then echo-sphere style some folks responded to the fragment instead of to what I had actually said. Forget going to email-only interviews, I should only communicate via my blog, best way to preserve the message delivery... :) In any event, here is something closer to what I said on the panel.

In 1960 an advertiser could spend $5M a year and reach 160M television viewers. With the right message, sustained over a few years, 85% message penetration into the audience was achievable.

That world was dying before the Internet came along. When three TV channels exploded into 300, the audience spread out across the new terrain. Putting the audience back together became difficult and took more money.

This was happening in magazines as well. More and more titles, you can't hit everyone with the right ad in Reader's Digest anymore.

The audience isn't huddled every night in front of three TV channels anymore. And you can't reach them with $5M of 1960 dollars. The audience is divided across 300 channels,dvds, tivo, itunes, youtube, bittorrent, flickr, mmorpgs, millions of other options. Saturation marketing costs something like $30+M for a few week blitz to launch a new movie. What if you wanted to saturate like they could in 1960, and drill your damn jingle into every consumer's head until there was no way they couldn't hear it when they saw the box in the drugstore? Costs to launch a new top-tier brand from scratch start at $150M now.

So it's more expensive to reach the same audience of people, because they spread out into a zillion different places. But it isn't the net, per se, that did this. It was scarcity that caused people to huddle around the same few media outputs in the first place.

Printing presses are expensive, and it's expensive to move paper around. So the number of newspapers and magazines was originally limited, and we all read the same ones. Radio and TV spectrum are limited and licensed, and with few channels we used to all tune into the same ones.

But with more efficient printing, distribution, spectrum use, choices were already multiplying. And when the Internet showed up -- the ultimate mass many-to-many zero-incremental-cost media distribution network -- well there goes "mass" media. If you can manage to put back together a few tens of millions of users on the Net -- a tiny fraction of the 160M 1960 TV audience -- you have a huge web business.

But, while the turnkey mass media channel that let you annoy everyone in the country for only $5M a year is toast, we're not back to the pre-print age. Messages do get takeoff, but they have to be self-propagating, in order to get voted up and linked and shared.

If the web follows what happened in magazines and television, audience domination by the biggest sites will eventually wane. Maybe we're all on Yahoo or Facebook or Youtube today, but in 10 years these sites will command a smaller fraction of the total audience, because there will be a steady proliferation of high quality, niche-targeted alternatives.

Linkbait (ahem, "Social Media Optimization") may be all we have left of mass media when this shift is over.

May 2, 2007

Digg's huge PR bonanza

A few days ago I wrote about how the disintegration of mass media had lead to escalating costs to launch a new brand....$150M these days, up from single digit millions in the 1960's. That's a huge run-up, even adjusting for inflation.

Blake presciently noted that getting sued could be one of the best ways to achieve tons of cheap PR:

So my question is (always mindful of the "All Press is Good Press" cliche): Are we getting to the point where a lawsuit becomes the most cost-effective way to boot-strap market yourself?

In the wake of Digg's massive PR jackpot, others have noticed this too. Andy Beal wrote:

This whole mess has created a lot of publicity for Digg. It has demonstrated how powerful it is and how influential the voice of its users.

Yeah. :-)

May 3, 2007

"My spoon is too big!"

I must be wrong in the head to like this so much. I couldn't stop laughing.

In particular the end apocalypse sequence (starting at 7:00 min) is amazing.

Update: I'm not wrong in the head! (not about this, anyway :)

A little digging for this post turned up the fact that the animator, Don Hertzfeldt, was nominated for an academy award for this short (which is titled "Rejected"). Apparently it's received over 27 awards, and is the #3 most popular short of all time according to IMDB.

May 4, 2007

yahoo.msn.com

My first reaction on seeing the Techmeme headline about msft-yhoo was "pretty please!" Although I would rather Google tried to do a giant acquisition deal and threw themselves into the tar pit.

But then I felt kind of sad, because msft and yahoo aren't really standing in the way of anyone else succeeding right now. In fact they're struggling hard to compete themselves, trying their best... and then this comes along. Anyone who has worked in a bigco knows what this nonsense does to productivity. Imagine every single one of your employees spending hours today talking about this. Well it's Friday at least. But they'll keep talking about it on Monday...

May 8, 2007

Markson: The Top 10 Reasons Why Newspapers Are Sinking Online

I was going to blog about the whole newspaper death spiral business in the WSJ, given that last year we built an entire system with the AP to map local stories back to their originating publications, in part to address concerns such as Hussman's. But Mike's beaten me to the punch, and it's a good thing since he's got a far more comprehensive take on the state of the news industry. He pretty much covers everything...and it doesn't look good.

Marksonland: The Top 10 Reasons Why Newspapers Are Sinking Online

May 9, 2007

Giving up on Microsoft?

Jeff Atwood giving up on Microsoft? Holy cow.

There is a huge gulf between Microsoft and Unix developers. I somehow missed walking down the Microsoft road, since I'd started on the Apple II (BASIC, 6502 assembly, Pascal) and never had an IBM PC way back. Then when I got to school it was Tops-20 and VAX/VMS and a little bit of Unix here and there. And by the time I got a PC, it wasn't to run Windows, but rather SCO XENIX on my 286.

I thought this was going to catch up with me around '93, since it looked like Windows was going to kill Unix dead. And then I'd have to start over and learn all this msft stuff. But no, the Internet came along, and suddenly I could code "client server" cross-platform GUIs with print statements. Thank f'ing god I thought.

And it turned out Unix seemed a whole lot better suited to server software, having been designed as a multiuser OS from the beginning. There were horror stories of startups paying 24/7 operators to sit watching banks of NT machines and rebooting them when they froze. And the initial failed attempt to migrate Hotmail off of unix when it was acquired by msft. Whereas we'd routinely get uptimes of hundreds of days on our unix servers. (Heck, the uptime for this machine is currently 158 days.)

At this point it doesn't seem to come up much anymore. As Jeff points out, there don't seem to be many web startups running on a microsoft platform. When they do crop up you know their tech isn't likely to be very strong. You see nonsense like Dipsie supposedly being "the next google" but then hear they're coding everything on microsoft and you don't have to pay any attention anymore since you know there's nothing there. There are the odd successful standouts like Fog Creek shipping actual PC apps, but they seem increasingly rare.

You can probably even avoid buying the usual raft of PC stuff on the business side now. It's thousands of dollars, installation and maintenance are a pain. Raw linux could be a bit much for a bizdev or marketing emp to use but OSX + google apps is probably a good enough replacement.

May 10, 2007

14 rules for fast web pages

Steve Souders of Yahoo's "Exceptional Performance Team" gave an insanely great presentation at Web 2.0 about optimizing website performance by focusing on front end issues. Unfortunately I didn't get to see it in person but the Web 2.0 talks have just been put up and the ppt is fascinating and absolutely a must-read for anyone involved in web products.

His work has been serialized on the Yahoo user interface blog, and will also be published in an upcoming O'Reilly title (est publish date: Sep 07).

We have so much of this wrong at topix now that it makes me want to cry but you can bet I've already emailed this ppt to my eng team. :) Even if you're pure mgmt or product marketing you need to be aware of these issues and how they directly affect user experience. We've seen a direct correlation between site speed and traffic.

This is a big presentation, with a lot of data in it (a whole book's worth apparently), but half way through he boils it down into 14 rules for faster front end performance:

  1. Make fewer HTTP requests
  2. Use a CDN
  3. Add an Expires header
  4. Gzip components
  5. Put CSS at the top
  6. Move JS to the bottom
  7. Avoid CSS expressions
  8. Make JS and CSS external
  9. Reduce DNS lookups
  10. Minify JS
  11. Avoid redirects
  12. Remove duplicate scripts
  13. Turn off ETags
  14. Make AJAX cacheable and small

The full talk has details on what all of these mean in practice. The final slide of the deck is a set of references and resources, which I've pulled out here for clickability:

book: http://www.oreilly.com/catalog/9780596514211/
examples: http://stevesouders.com/examples/
image maps: http://www.w3.org/TR/html401/struct/objects.html#h-13.6
CSS sprites: http://alistapart.com/articles/sprites
inline images: http://tools.ietf.org/html/rfc2397
jsmin: http://crockford.com/javascript/jsmin
dojo compressor: http://dojotoolkit.org/docs/shrinksafe
HTTP status codes: http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
IBM Page Detailer: http://alphaworks.ibm.com/tech/pagedetailer
Fasterfox: http://fasterfox.mozdev.org/
LiveHTTPHeaders: http://livehttpheaders.mozdev.org/
Firebug: http://getfirebug.com/
YUIBlog: http://yuiblog.com/blog/2006/11/28/performance-research-part-1/
    http://yuiblog.com/blog/2007/01/04/performance-research-part-2/
    http://yuiblog.com/blog/2007/03/01/performance-research-part-3/
    http://yuiblog.com/blog/2007/04/11/performance-research-part-4/
YDN: http://developer.yahoo.net/blog/archives/2007/03/high_performanc.html
    http://developer.yahoo.net/blog/archives/2007/04/rule_1_make_few.html

Update: Yahoo has summarized these nicely on their developer blog.

May 13, 2007

Give Hammer a break

The collective response to Michael Arrington including MC Hammer on his TechCrunch 20 review panel is pretty lame, IMO.

Commenters should keep in mind that this is a real guy they're talking about. Have some friggin courtesy. I've actually met the guy at a party, so maybe it's easier for me to imagine him as a real human who reads the net too, and not just some TV celeb caricature. If you were introduced to him at CES or an industry party, would you say this stuff to his face? He's a nice guy, he's got a blog, and he's done a lot of other stuff since that 80's video.

Also, putting down someone who had a successful career in one area, and who is trying to reinvent themselves in a new role doesn't seem right to me. There are plenty of people who had careers in sports, music, movies, etc. and then go on to second careers in politics, wall street, real estate, etc. I think that's just great and should be encouraged.

But the worst conceit of the crowd's response is the assumption than this guy can't know anything about technology, and thus the idea of him doing a social network is silly. But the thing is -- there isn't really very much technology in social networks. You can build one of these puppies in a weekend, or have one built for you outsourced for $15-25k. It's commodity at this point. Success is based on boot-up and network-effects. So maybe, just maybe, is it possible that someone with a successful media and promotion background, with lots of contacts in those areas, with a name everyone recognizes, might actually have a decent shot at promoting something? Versus an unknown 20-something rails programmer freshly minted with their geek degree, and $20k in "VC"?

I met a bunch of music industry folks while I was at AOL Music, and many of them were savvy businesspeople and highly entrepreneurial. One aging rock dude, long out of contract, had even taught himself to program and built a subscription-driven site for his hard core fans where he posted tracks, videos, did live chats, etc.

It's hard to escape your stereotype I guess. Leonard Nimoy has done 20 things since star trek but he's got to keep doing that hand thing whenever people approach him in public.

I think Hammer's a great choice to make the event a bit less valley insular. And, as Renee Blodgett recently suggested about valley events in general, to liven things up a bit.

I have no idea what Hammer is up to, or if it's credible or not. But sheesh, cut the guy some slack.

May 14, 2007

If you're so good...

Stockbroker: I can make you 10x on this stock in 6 months!

Punter: If you're as good as you say you are, you'd be making money for yourself, instead of pretending you can for me!

So when you see an SEO consultant quit the consulting, to focus full time on his own stuff... well, at least you know his former clients were getting good advice! :)

May 15, 2007

Scaling Facebook, Hi5 with memcached

From a discussion board thread pointed to by programming.reddit, a nifty discussion of high volume sites Facebook, Hi5 and others who are using memcached as a critical scaling tool:

From: Steve Grimm <... facebook.com>
Subject: Re: Largest production memcached install?

No clue if we're the largest installation, but Facebook has roughly 200 dedicated memcached servers in its production environment, plus a small number of others for development and so on. A few of those 200 are hot spares. They are all 16GB 4-core AMD64 boxes, just because that's where the price/performance sweet spot is for us right now (though it looks like 32GB boxes are getting more economical lately, so I suspect we'll roll out some of those this year.)

We have a home-built management and monitoring system that keeps track of all our servers, both memcached and other custom backend stuff. Some of our other backend services are written memcached-style with fully interchangeable instances; for such services, the monitoring system knows how to take a hot spare and swap it into place when a live server has a failure. When one of our memcached servers dies, a replacement is always up and running in under a minute.

All our services use a unified database-backed configuration scheme which has a Web front-end we use for manual operations like adding servers to handle increased load. Unfortunately that management and configuration system is highly tailored to our particular environment, but I expect you could accomplish something similar on the monitoring side using Nagios or another such app.

...

At peak times we see about 35-40% utilization (that's across all 4 CPUs.) But as you say, that number will vary dramatically depending on how you use it. The biggest single user of CPU time isn't actually memcached per se; it's interrupt handling for all the incoming packets.

 

From: Paul Lindner <... inuus.com>

Don't forget about latency. At Hi5 we cache entire user profiles that are composed of data from up to a dozen databases. Each page might need access to many profiles. Getting these from cache is about the only way you can achieve sub 500ms response times, even with the best DBs.

We're also using memcache as a write-back cache for transient data. Data is written to memcache, then queued to the DB where it's eventually written to long-term storage. The effect is dramatic -- heavy write spikes are greatly diminished and we get predictable response times.

That said there's situations that memcache didn't work for our requirements. Storing friend graph relations was one of them. That's taken care of by another in-memory proprietary system. At some point we might consider merging some of this functionality into memcached including:

  • Multicast listener/broadcaster protocols
  • fixed size data structure storage
    (perhaps done via pluggable hashing algorithms??)
  • Loading the entire contents of one server from another.
    (while processing ongoing multicast updates to get in sync)
I'd be interested in working with others who want to add these types of features to memcache.

Greg Linden has commented on a talk about Livejournal's use of memcached for scaling. See also previous posts on scaling for ebay and mailinator.

May 30, 2007

'tie' considered harmful

Something has always left me uneasy about the 'tie' feature in perl, and I've been trying to reconcile it with my evolving view of programmer-system productivity.

To productively use a feature, like multi-process append to the same file, you have to understand the underlying performance and reliability behavior. Append is going to work great for 50 apache processes appending lines to a common log file without locking, but not for 2 processes appending 25k chunks to the same file, since they'll get corrupted. If you understand how unix's write-with-append semantics work you can get away with very fast updates to lots of little files without paying any locking penalties (twitter should probably have done something like this).

Similarly, when you see %foo in perl, you instantly know the perf footprint. It's an in-memory hash, it's going to be fast, and you won't get into trouble unless you find a corner like making a zillion hashes-of-hashes and then discover that there's a 200-300 byte overhead for each one.

But tie destroys your knowledge of how the hash works. The perf characteristics become completely different. A simple-minded approach to build a search keyword index with a hash-of-lists which might work acceptibly well with in-memory hashes suddenly becomes a disaster when you tie it to berkeley-db. Because you're not using an in-memory hash anymore, you're using a disguised call to berkeley-db.

I don't think the syntactic sugar win for the notiational convienence trumps the potential confusion to those who will view the code later, or even the confusingly overloaded semantics for the original programmer. I'd rather just know that %foo is an in-memory perl hash, and if I'm going to stuff something in a berkeley-db it's going to be with an explicit API.

As an aside, when I say 'productive', I'm trying to envision the entire life of the code and the product. Not just getting it written and working, but the lifetime maintanence load of the code, will people in ops need to monkey the system to keep it healthy, have pitfalls been left for new programmers inheriting the code, will it gracefully scale, degrade, and so on.

This is related to an evolving philosophy of programmer-system productivity that I've been developing, which I plan to write more about later.

Code is our enemy

Code is bad. It rots. It requires periodic maintenance. It has bugs that need to be found. New features mean old code has to be adapted.

The more code you have, the more places there are for bugs to hide. The longer checkouts or compiles take. The longer it takes a new employee to make sense of your system. If you have to refactor there's more stuff to move around.

Furthermore, more code often means less flexibility and functionality. This is counter-intuitive, but a lot of times a simple, elegant solution is faster and more general than the plodding mess of code produced by a programmer of lesser talent.

Code is produced by engineers. To make more code requires more engineers. Engineers have n^2 communication costs, and all that code they add to the system, while expanding its capability, also increases a whole basket of costs.

You should do whatever possible to increase the productivity of individual programmers in terms of the expressive power of the code they write. Less code to do the same thing (and possibly better). Less programmers to hire. Less organizational communication costs.

The minimum description length principle (MDL) is often used in genetic programming to identify the most promising candidate programs from a population. The shorter solutions are often better; not just shorter, but actually faster and/or more general.

A few hours reading WTF should convince anyone that there are often vast differences in the amount of code different programmers will put into the same task. But it's not just wtf? code. Components like a page crawler can have very different solutions. Maybe you can re-implement a 10k line solution into a 1k line solution, by taking a different approach. And it turns out that the shorter crawler is actually more general and works in a lot more cases. I've seen this over and over again in code and I'm convinced that it's harder to write something short and robust than something big and brittle.

I've been looking for ways to get code out of the code. Is there something the code is doing that can be turned into an external dataset, and driven by a web UI, or some rule-list that I can contract out to someone on elance? Maybe a little rule-based language has to be written. I've seen this yield an unexpected productivity increase. It turns out that using the web tool to edit the rules in the little domain-specific language ends up being more productive than messing around in the raw code anyway. The time spent formalizing the subdomain language is more than paid back.

Code has three lifetime performance curves:

  • Code that is consistent over time. The MD5 function is just great and it always does what we want. We act like all code is like this but most of the interesting parts of the system really aren't.

  • Code that will get worse over time, or will inevitably cause a problem in the future.

    Humans will have to jump in at some point to deal. You know this when you write the code, if you stop to think. Appending lines to a logfile without bothering to implement rotation is like this. Having a database that you know will grow over time on a single disk that counts on someone to type 'df' every so often and eventually deal is like that too.

    RAID is kind of like this. It reduces disk reliability problems by some constant. But when a disk fails, RAID has to email someone and say it's going to lose data unless someone steps in and deals. In a growing service, RAID is going to generate m management events for n disks. As n grows, m grows. 10X the disk cluster, 10X the management events. Wonderful. Better to architect something that decays organically over time, without requiring pager-level immediate support or else it will catastrophically fail. e.g the datacenter in one of these shipping container prototypes.

  • Code that gets better over time.

    This is the frontier.

    Google's spelling corrector is like this. It works okay on a small crawl, but better on a big crawl.

    People in the system can be organized this way, working on a component (like a dataset or ruleset) that they steadily improve over time. They're external to the core programming team but they make the code better by improving it with data.

    I've been wondering if it's possible to generally insert learning components at certain points into the code to adaptively respond to failure cases, scenarios, etc. Why am I manually tuning this perf variable or setting this backoff strategy? Why are we manually doing A/B testing and putting the results back into CVS to run another test, when the whole loop could be wired up to the live site to run by itself and just adapt and/or improve over time? I need to bake this some more but I think it's promising.

Related:

June 21, 2007

Are network effects getting weaker?


"This place is dead anyway."

I was thinking about how fast Facebook has replaced Linkedin for my valley connections. In a period of about 2 months, it seems like most of my contacts have all deserted linkedin and moved to facebook. They seemed to seed it through connections with their board, investors, etc, making a deliberate effort to invite lots of valley techies, journos, etc. and it worked.

Now the little remaining linkedin activity I have is from mostly from outside the sfbay area.

The move was easier than I thought. Facebook seems a little more open, so it's easy to browse around and connect with people you know. And, rather than my store of data in LinkedIn holding me there, starting fresh with a blank slate was a great way to clean up my contact list.

Friendster, Linkedin, Orkut, Myspace, Facebook. Orkut? Yeah it was hot for a little while too. I had a bunch of contacts in there. Not to mention even earlier stuff, like AIM buddy lists, email lists, etc.

Myspace was the king just a few months ago. Apparently it's not hip anymore:

MySpace is a tired social network that may have a ton of traffic but it has peaked. It doesn't have mojo anymore.

Like AOL in 1999, it will take years before people realize it.

MySpace isn't worth 25% of the combined company.

Facebook, yes, MySpace no.

    -- Fred Wilson

So what's going on here?

In an environment where travel is free and instantaneous, you get flash mobs.

If a place is cool, or new, or interesting, you go there to check it out.

A place might be interesting simply because there are a lot of other people there at the moment. We're instinctively drawn towards crowds. "Why are all those people gathered over there? I'll check it out too."

A swarm of bees clumped on a particular tree branch doesn't mean that the branch is some magical bee-place with a lock-in for centuries.

Sure, maybe they stay there.

Or maybe the bees move to a new hive.

Another analogy would be a new restaurant that is the hip new place to go.

Maybe they have great food, or maybe it's just the scene, but a lot of times the bloom fades and the crowd moves on to a new place.

Customers had a billing relationship with AOL. Moving to a new ISP was a hassle. They left in the end anwyay.

It's easier than ever to move from one service to another. Blog reader? No problem. Photo site? I have accounts on all of them anyway. Social networks? Yeah I'm signed up on all of them. I use the ones everyone else is using, at the moment. Just like we all do. The rest have a stub profile for me, but don't see much activity.

I started wondering if there was less lock-in than I thought on other services supposedly protected by strong network effects. Like eBay, for instance.

They've got all the buyers, and all the sellers. But what fraction of their transactions are "Buy it now" from their 700,000 merchants? Is there an 80/20 rule to those merchants? Could a core be drawn to a new service?

Ebay hasn't updated itself significantly, ever. I'd like to see a Facebook marketplace. The stronger identity around facebook profiles would be better than the anonymous "trust" ratings.

Hmmm.

June 27, 2007

Leaving Topix... (but in very good hands)

As has been announced elsewhere, I've stepped down as CEO of Topix, and my longtime friend and co-founder Chris Tolles has been promoted into the top spot.

Topix has been experiencing very strong traffic growth over the past 6 months. We launched a substantial set of improvements to the site in April, and traffic since then has soared. Topix's local forum activity continues to grow at double digit rates per month. Topix is a monster in local community and is growing like a weed.

So why the change, then...? Well, first let me say that this was a change that I personally initiated with our board. I have always seen myself as a product guy first and foremost. And some of the magic tricks I rely on tend to be based on technical architectural advantages, injecting attack products into market holes, and new product boot-up strategies. I have managed medium-sized groups before effectively, and I can certainly hold my own with PR, but at the end of the day if I find myself only doing those things I start to get a little grouchy. Also, the skills Topix needs now are not really based on innovating some new algorithm or launch magic. Rather, marketing, sales, and operations are key to Topix achieving its full potential.

So in looking at who would be best to take Topix to the next level, promoting Chris into this spot was the obvious choice. He has a depth of experience in sales and marketing, has managed successful community products before at Netscape/AOL, and provides great continuity of management with the company. Furthermore, he and I have actually traded positions previously. He's worked for me, and I and other folks on our team have worked for him before, several times ... at Sun, Netscape and AOL. So I know he can do the job, and the passion and experience that he'll bring to the role will be a huge asset for Topix.

Others agree with this assessment. Ben Smith wrote:

Congratulations to Mr. Tolles. Chris is going to do an incredible job making things happen. He is a marketing machine who knows product, can sell and has more passion to win than 90% of the executives in the valley.

This is the perfect opportunity to take what he helped create with Skrenta, Markson and others to the next level.

I will also say for the record that our board, with members from Tribune, Gannett, and Mayfield Fund, have been absolutely outstanding from the very beginning of our partnership with them, throughout this transition. Their advice, vision and support have been essential, and it's been a real pleasure to have had the opportunity to work with them.

I'll be continuing to serve on the board, and provide assistance wherever I can. As I still represent the third largest equity stake in Topix, after Tribune and Gannett, I have every interest in seeing Topix continue to grow and succeed.

Congratulations to Chris, and here's looking forward to Topix's inevitable victory in the local community space. :)

Also check out Mike's thoughts on the transition...

June 29, 2007

Palo Alto iPhone line pics



more...

June 30, 2007

MORE

The DVD box said "The best 6 minutes of film ever created."

I agree.

Mark Osborne's MORE took 9 months to create, but is only 6 minutes long. One reviewer compared it to a cross between Brazil and Citizen Kane. You can watch it on the web, but do yourself a favor and buy the DVD. It looks much better at a proper resolution and encoding. Plus there are exta commentary tracks and features which are pretty interesting. And you can feel good about supporting Mark, this is clearly a labor of love and he's working on new stuff too.

July 2, 2007

Macbook/Linksys wireless kernel hang solution

Shortly after getting my new macbook pro, I started to have issues with crashes where the kernel would hang and the display would freeze, requiring a power cycle/reboot. Calls to apple were basically useless. After futzing around I was able to diagnose the problem... I could provoke the problem by scp'ing a few large files. Some troubleshooting revealed that the macbook would hang only when I was using wireless, but if I plugged the ethernet cable in it was fine. I started mucking with settings in my Linksys "wireless-N" home router's advanced wifi settings and found one that stopped the macbook from crashing:

I disabled "frame burst" (which was enabled by default) and the problems completely disappeared.

I think the Mac is just dandy, but still it makes me think. I've been running a variant of Unix for the past 20 years. SCO Xenix, sysv 3.2, SVR4, Unixware, Dynix, sunos, Solaris, BSDI, Linux...and now OSX. 20 years, and we still have flaky drivers. 20 years, the industry still can't write a friggin driver that doesn't completely waste your machine if it sees a funny packet it doesn't parse right.

Update: That didn't fix it. I had to disable 'N' entirely in the router. That fixed it. See the comments for more details...

July 10, 2007

Fletcher's angry list of startup rules

Mark Fletcher posted a great list of startup rules a little while back. Mark was the creator of Bloglines and Onelist (which, after merging with eGroups, sold to Yahoo and became Yahoo Groups).

I first met Mark in 1998 when a VC tried to marry NewHoo with Onelist, and told us that the combination "might be interesting". Pfft! NewHoo and Onelist both went on to successful exits, and Mark's an interesting guy, so I guess it all worked out in the end...

Mark's advice is spot-on for a lot of the web 2.0 companies being launched now. I like Mark's list because it's a little edgier than a lot of the smile-faced spin you see on VC blogs.

Note that Mark himself seems to start companies during down-cycles, and sell them when the market gets hot. Then he goes on vacation for a year or two and waits for the next down-cycle. :)

1. Your idea isn't new. Pick an idea; at least 50 other people have thought of it. Get over your stunning brilliance and realize that execution matters more.

2. Stealth startups suck. You're not working on the Manhattan Project, Einstein. Get something out as quickly as possible and promote the hell out of it.

3. If you don't have scaling problems, you're not growing fast enough.

4. If you're successful, people will try to take advantage of you. Hope that you're in that position, and hope that you're smart enough to not fall for it.

5. People will tell you they know more than you do. If that's really the case, you shouldn't be doing your startup.

6. Your competition will inflate their numbers. Take any startup traffic number and slash it in half. At least.

7. Perfection is the enemy of good enough. Leonardo could paint the Mona Lisa only once. You, Bob Ross, can push a bug release every 5 minutes because you were at least smart enough to do a web app.

8. The size of your startup is not a reflection of your manhood. More employees does not make you more of a man (or woman as the case may be).

9. You don't need business development people. If you're successful, companies will come to you. The deals will still be distractions and not worth doing, but at least you're not spending any effort trying to get them.

10. You have to be wrong in the head to start a company. But we have all the fun.

11. Starting a company will teach you what it's like to be a manic depressive. They, at least, can take medication.

12. Your startup isn't succeeding? You have two options: go home with your tail between your legs or do something about it. What's it going to be?

13. If you don't pay attention to your competition, they will turn out to be geniuses and will crush you. If you do pay attention to them, they will turn out to be idiots and you will have wasted your time. Which would you prefer?

14. Startups are not a democracy. Want a democracy? Go run for class president, Bueller.

15. You're doing a web app, right? This isn't the 1980s. Your crummy, half-assed web app will still be more successful than your competitor's most polished software application.

Update: Only tangentially relevent, but uncov is just too damn funny.

July 12, 2007

WTF happened to Popdex?

Popdex, along with Blogdex and the Daypop top 40 was one of the first generation meme trackers for the blogosphere, ranking top blog posts based on linking activity. The folks maintaining this set of first generation tools seem to have collectively lost interest in them, after seeing them supplanted by social ranking tools such as Digg and Reddit.

But I was surprised to see Popdex become a wholesale spam farm. It's eerie, there are even "archives" supposedly going back in time, showing "results" from 2004-2006. The only thing is, these pages are just more spam. Popdex didn't actually used to look like that. Interesting to see the choices for the spam anchors on the sidebar and the post titles.

I wonder if the former owner of popdex just let their domain expire, or if they had a more active role in this.

August 5, 2007

The 11 startups actually crawling the web

The story goes that, one day back on the 1940's, a group of atomic scientists, including the famous Enrico Fermi, were sitting around talking, when the subject turned to extraterrestrial life. Fermi is supposed to have then asked, "So? Where is everybody?" What he meant was: If there are all these billions of planets in the universe that are capable of supporting life, and millions of intelligent species out there, then how come none has visited earth? This has come to be known as The Fermi Paradox.

My buddy Greg Lindahl maintains a collection of historical documents on his personal website, and gets enough traffic each month that he worries about his colo bandwidth bill.

When he analyzed his web logs recently and tallied up the self-reporting robots, he was surprised at how few he actually found crawling his site, and mentioned the Fermi quote I've reproduced above. If there really are 100 search engine startups (via via Charles Knight at Read/Write web), shouldn't we be seeing more activity from them?

Here is the list of every crawler that fetched over 1000 pages for the past three months:

1612960 Yahoo! Slurp help.yahoo.com bigco
365308 msnbot search.msn.com/msnbot.htm bigco
148090 Googlebot www.google.com/bot.html bigco
140120 VoilaBot www.voila.com bigco
68829 Ask Jeeves/Teoma about.ask.com bigco
62005 psbot www.picsearch.com/bot.html startup
39193 BecomeBot www.become.com/site_owners.html shopping
30006 WebVac www.WebVac.org edu
29778 ShopWiki www.shopwiki.com/wiki/Help:Bot shopping
22124 noxtrumbot www.noxtrum.com bigco
20963 Twiceler www.cuill.com/twiceler/robot.html startup
17113 MJ12bot majestic12.co.uk/bot.php startup
15650 Gigabot www.gigablast.com/spider.html startup
10404 ia_archiver www.archive.org nonprofit
9337 Seekbot www.seekbot.net/bot.html startup
9152 genieBot www.genieknows.com startup
7246 FAST MetaWeb www.fastsearch.com enterprise
7243 worio bot worio.com edu
6868 CazoodleBot www.cazoodle.com startup
6608 ConveraCrawler www.authoritativeweb.com/crawl enterprise
6293 IRLbot irl.cs.tamu.edu/crawler edu
5487 Exabot www.exabot.com/go/robot bigco
4215 ilial www.ilial.com/crawler startup
3991 SBIder www.sitesell.com/sbider.html memetracker
3673 boitho-dcbot www.boitho.com/dcbot.html enterprise
3601 accelobot www.accelobot.com memetracker
2878 Accoona-AI-Agent www.accoona.com startup
2521 Factbot www.factbites.com startup
2054 heritrix i.stanford.edu edu
2003 Findexa www.findexa.no ?
1760 appie www.walhello.com startup?
1678 envolk www.envolk.com spammers
1464 ichiro help.goo.ne.jp/door/crawler.html bigco
1165 IDBot www.id-search.org/bot.html edu
1161 Sogou www.sogou.com/docs/help bigco
1029 Speedy Spider www.entireweb.com bigco

There are a couple of surprises here... One is how much more aggressively Yahoo is crawling than everyone else. (Maybe he should just ban Yahoo to cut his hosting fees :)

Another is how few startups are actually crawling... And the ones that are aren't correlated with the folks getting buzz right now. In three months of data I didn't see a single visit from Zermelo, Powerset's crawler. I don't see Hakia in there at all, but they do have an index and actually refer a little traffic, which leads me to believe that they've licensed a crawl from someone else.

There hasn't been a lot of public information about Cuill since Matt Marshall's brief cryptic entry on them. But they're crawling fairly aggressively, and they've put up a public about us page detailing the impressive credentials of the founders, Tom Costello, Anna Patterson and Russell Power. Anna is the author of a widely-read intro paper on how to write a search engine from scratch.

...

The conventional wisdom is that there are all sorts of folks trying to take on Google, develop meaning-based search, France and Germany are supposedly both state-funding their own search efforts (heh). But if all these folks are out crawling the web... more than 11 of them should be showing up in webserver logs. ;)

Update: Charles Knight posts a ton of quotes from alt search engine folks on their approaches to crawling. Pretty interesting.

August 6, 2007

What doesn't clog your algo makes it stronger...

Valleywag outed the startup day job of the guys who collectively edit the the hilarious snark site uncov. The startup, Persai, was "hiding in plain site" since they have a blog and have been pretty open about about the tech they're using and their daily gripes.
"Persai is a startup that seeks to apply advanced machine learning techniques to content and advertising. We are using Amazon's web services to build a scalable architecture that will learn from consumer interests over time and match them with content crawled from around the web. The idea behind Persai is that you will have an active agent crawling the web looking for content that is relevant to you and only you. Every link we recommend will be something you want to read. We are zigging to social news' zag where popularity trumps relevance to the individual."
    -- from news.ycombinator

Anyway, a few days ago Persai released a Nutch webcrawl-generated set of "118,254 feeds of pure greatness". Intertwingly begged to differ about the quality after running some stats on the feeds. This generated some interesting comments...one in particular jumped out at me:

But if you look at the list itself, two sites are grossly overrepresented, and they account for the majority of the 301s and Timeouts. [emphasis mine]

I got a sinking feeling as I read this. I had curl'd over the corpus already to eyeball it ...yeah that's a list of feeds all right... but hadn't tallied the domains...

$ sed -e 's/^http:..//' -e 's/\/.*$//' persai_feedcorpus | count | head
 35695   rss.topix.net
 14613   izynews.de
  2831   feeds.feedburner.com
  1869   p.moreover.com
  1314   www.livejournal.com
  1241   rss.groups.yahoo.com
  1191   www.discountwatcher.com
  1096   news.bbc.co.uk
  1072   www.alibaba.com
   882   xml.newsisfree.com

Nooooo... Of course.. Sigh.

August 8, 2007

My top 10 beefs with the iPhone

Well I'm gonna catch some heat for this but here are my iphone beefs. My ancedotal experience based on talking to friends is that if you're coming from a treo the iphone is great, if you're coming from a blackberry, there are some rude shocks.

Serious power users I know carry both an iphone and bbery. I'm not gonna do that right now, that's defeating the point of the small form factor. Unfortunately there's not a clear winner here, neither one is better in every way. If I were to score, the iphone gets a lot more total points, but has some serious gaps w/ the bbery.

  • Sarafi controls are often unresponsive while it's transferring a page. Can't scroll, can't side scroll, can't expand or shrink, stop button doesn't work, it ignores the back button. This happens during dns delays too.


    loading techcrunch, touch screen unresponsive, rendering lag

  • No synchronous gmail app. What's this pop nonsense, is this a joke?

  • Anti keybounce or the skeptical touch software makes it lose keypresses I think should be valid.

  • Very difficult to type while driving with one hand. Or thumb. Even looking up a number from the contact list and initiating it can be tricky when it loses keypresses or gets them wrong because your thumb is hitting the screen at a funny angle.

  • Can't hear it ring. if the little holes are covered up you can't hear it at all. Like when it's in my pocket. Which is all the time.

  • Everyone I know with an iphone picked the classic phone ring, since of the bunch it's the most audible. Which still isn't great.

  • When I flip the screen sideways I wish the dpi would stay the same instead of expanding. I'm flipping the screen to get more sideways real estate. So every time I have to squish it back down.

  • Surprised they didn't do screen flip at a lower os level so all the apps got it. Even safari won't screen flip if the keyboard is popped up.

  • The accelerometers are funny, I sorta wish I could flip it with a button instead of twisting it around. I use it a lot reclining or lying down and then it gets the orientation wrong.

  • Touch keyboard is mostly useless. I can't type on this thing. bberry was much better even with their mini-keyboard. At least it could guess correctly, iphone makes "stupid" errors where the bbery predictive software would have gotten it right.

Despite the flaws the browser is good enough that I don't think I could go back. My biggest beef with the bbery browser was that it didn't do cookies, so it couldn't remember site logins. The way it ripped apart pages into a stream of text actually made them fit on the screen pretty well, then I could do a one-dimensional scroll to see everything, rather than the 3D scroll I have to do on the iphone to get page coverage (up/down, side/side, expand/shrink).

Waah! Google stole my idea!

"Google stole my idea"

if you stop crying you can have ice cream later

August 14, 2007

Byzantine Sequence Number Generation

The 645 clock was a huge box, 8 foot refrigerator size, containing a clock accurate to a microsecond. It hooked into the system as a "passive device," meaning that it looked like a bank of memory. Memory reads from a port with a clock on it returned the time in microseconds since 0000 GMT Jan 1, 1901. (52-bit register) The clock guaranteed that no two readings were the same. It had a real-time alarm register also. Inside there was a crystal in an oven, all kinds of ancient electronics.
    -- from a description of the Multics implementation on the the GE-645

That's funny. It seems like serious overkill just to make unique timestamps, even for Multics. :)

Let's Paxos for lunch...


In the garage.

Update: why keith has those bandages on his knees.

August 16, 2007

We Worship MD5, the GOD of HASH

For some time I had been looking for a mutual exclusion algorithm that satisfied my complete list of desirable properties. I finally found one--the N!-bit algorithm described in this paper. The algorithm is wildly impractical, requiring N! bits of storage for N processors, but practicality was not one of my requirements. So, I decided to publish a compendium of everything I knew about the theory of mutual exclusion.

The 3-bit algorithm described in this paper came about because of a visit by Michael Rabin. He is an advocate of probabilistic algorithms, and he claimed that a probabilistic solution to the mutual exclusion problem would be better than a deterministic one. I believe that it was during his brief visit that we came up with a probabilistic algorithm requiring just three bits of storage per processor. Probabilistic algorithms don't appeal to me. (This is a question of aesthetics, not practicality.) So later, I figured out how to remove the probability and turn it into a deterministic algorithm.
    -- Lamport

3N vs. N! Some folks just aren't comfortable with probablistic algorithms. Lamport here clearly knows what he is doing, but still has aesthetic problems with them.

In some people's minds, algorithms should be proveably correct at all times and for all inputs (as with defect-free programming and formal methods). Probabilistic algorithms give up this property. There is always a chance that the algorithm will produce a false result. But this chance can be made as small as desired. If the chance of the software failing is made smaller than the chance of the hardware failing (or of the user spontaneously combusting, or whatever), there's little to worry about.
    -- Bruce Schneier in Dr. Dobb's Journal

The common practical case I run into with coders is that they're unfamiliar with figuring how how big a hash they need to "not worry about" collisions. Here's the rule of thumb.

MD5 Quickie Tutorial

Suppose you're using something like MD5 (the GOD of HASH). MD5 takes any length string of input bytes and outputs 128 bits. The bits are consistently random, based on the input string. If you send the same string in twice, you'll get the exact same random 16 bytes coming out. But if you make even a tiny change to the input string -- even a single bit change -- you'll get a completely different output hash.

So when do you need to worry about collisions? The working rule-of-thumb here comes from the birthday paradox. Basically you can expect to see the first collision after hashing 2n/2 items, or 2^64 for MD5.

2^64 is a big number. If there are 100 billion urls on the web, and we MD5'd them all, would we see a collision? Well no, since 100,000,000,000 is way less than 2^64:

    18,446,744,073,709,551,616   2^64
               100,000,000,000  <2^37

(Another way of putting this is that the expected number of collisions from hasing a set of size 2^k bit strings hashed to m bit strings will be 22k-m collisions. [1])

Other MD5 tips & tricks

  • Unique ID generation

    Say you want to create a set of fixed-sized IDs based on chunks of text -- urls, for example. Urls can be long, with 100+ bytes common. They're varying sizes too. But md5(url) is 16 bytes, consistently, and you're unlikely to ever have a collision, so it's safe to use the md5 as an ID for the URL.

  • Checksums

    Don't trust your disk or your OS to properly detect errors for you. The CRC and protocol checksums they use are weak and bad data can get delivered.

    Instead, bring out an industrial strength checksum and protect your own data. MD5 your data before you stuff it onto the disk, check the MD5 when you read it.

        save_to_disk(data,md5(data))
        ...
        (data,md5) = read_from_disk()
        if (md5(data) != md5)
            read_error
    

    This kind of paranoia is healthy for code -- your module doesn't have to trust the teetering stack of plates if it's doing it's own end-to-end consistency check.

  • Password security

    Suppose you're writing a web app and you're going to have users login. They sign up with an account name and a password. How do you store the password?

    You could store the password in your database, "in the clear". But this should be avoided. If your site is hacked, someone could get a giant list of usernames and passwords.

    So instead, store md5(password) in the database. When a user tries to login, take the password they entered, md5 it, and then check it against what is in the database. The process can then forget the cleartext password they entered. If the site is hacked, no one can recover the list of passwords. Even employees are protected from casually seeing other people's passwords while debugging.

    If you don't store the password, how can you email it to someone if they forget it? Instead of emailing the user their forgotten password, instead invent a new, random password, store the md5 of it in the database, and email the new random password to the user.

    If a site can email you your original password, it's storing it in the clear in its database. Tisk, tisk.

  • Hash table addressing

    There are whole chapters of textbooks devoted to the pitfalls and difficulties of writing hash addressing algorithms. Because most of these algorithms are weak, they require you to rejigger your hash table size to be relatively prime to your original hash table size when you expand it.

    Forget that nonsense. MD5 isn't a weak hash function and you don't need to worry about that stuff. MD5 your key and have your table size be a power of 2. As an engineer, your table sizes should be powers of 2 anyway. Leave the primes to the academics.

  • Random number generation

    The typical library RNG available isn't generally very good. For the same reason that you want your hashes to be randomly distributed, you want your random numbers to actually be random, and not to have some underlying mathematical structure showing through.

    Having random numbers that can't be guessed or predicted can be surprisingly useful. MD5 based sequence numbers were a solution for the TCP sequence number guessing attacks.

    I also recall some players of an old online game who broke the game's RNG, and could predict the outcome of upcoming battles. The library RNG was known, the entire seed state was 32 bits, which was easy to plow throuh to find the seed the game was using. Solution: a stronger RNG, with more internal state, that can't be predicted.

    Here is an md5-based RNG that I wrote some time ago.

  • What if you need more than 16 bytes?

    You can use SHA1 or SHA256, which generate 160 and 256 bits of output, respectively. Or you can chain hashes together to get an arbitrary amount of output material:

        a = md5(s . '0')
        b = md5(s . '1')
    

    Because md5 is cryptographically secure, this is safe. You can make as many unique 16 byte hashes from an input string as you want.

        md5('Rich Skrenta')  = 15ddc636 023977a2 22c3423d a5e8fbee
        md5('Rich Skrenta0') = 4343e346 b4036f80 7015847d cf983010
        md5('Rich Skrenta1') = da79412d c52c47b4 fa7848e4 54f89614
    

  • I heard MD5 was broken and you should use SHA

    For cryptographic purposes, MD5 and SHA have both been broken such that a sophisticated attacker can create multiple documents that intentionally hash to the same value.

    But for practical uses like hash tables, decent RNGs, and unique ID generation, these algorithms maintain their full utility. The alternatives considered are often non-secure CRCs or hashes anyway, so a cryptographic hash weakness is not a concern.

    If you're concerned about some nefarious actor leaving data around designed to deliberately cause hash collisions in your algorithm, throw a secret salt onto the end beginning of the material that you're hashing:

            hash = md5(s . 'xyzzy')  [good point]
            hash = md5('xyzzy' . s)
    

  • Isn't MD5 overkill?

    Folks sometimes say MD5 is "overkill" for a lot of these applications. But it's good, cheap, strong, and it works. It's not going to cause you problems if you use it. You're not going to ever have to debug it or second guess it. If you have perf problems, and suspect MD5, and then go profile your code, it's not going to be MD5 that's causing your problems. You're going to find that it was something else.

    But if you feel you absolutely must leave the path and look for some faster hashes, check out Bob Jenkins' site. [Also see the Hsieh hash, it looks very good.]

  • How fast is MD5?

    About as fast as your disk or network transfer rate.

    AlgorithmSizeMB/s
    MD4128165.0
    MD512898.8
    SHA-116058.9

    These are 2004 numbers from the perl Digest implementation.

Be happy and love the MD5.

August 17, 2007

Crypto vs. the working coder

Working in security tends to make people jumpy and nervous. Most security coders don't understand any of the the crypto internals of the tools they use, so they must rely on a handful of trusted experts like Schneier to tell them what's safe. Even so the algorithms last for about 10-15 years before they're broken.

Kids poke hole in protocols that spent years peer-reviewing their way through the IETF. Implementations are about as secure as swiss cheese, but it doesn't matter since the commercial success of a security product has more to do with its channel marketing strategy than actual security. Rumors surface that some Chinese mathematicians have wrecked part of the functional toolkit we've used for the last decade in all of our products, and it's time to pack up the tents and move, again.

So a culture of nit-picking and paranoia surrounds crypto stuff. If you are using a security algorithm, so the thinking goes, it must be because there is a threat. And if there is a threat, the algorithm must be made perfectly secure.

That may be an appropriate way to think for security products. But it turns out that security techniques are often useful in general programming. MD5 is a great checksum, much better than CRCs. If you have 500 nodes in a cluster, each with some disks, yes I will guarantee you that read/write corruption can occur and get into your app. TCP packets do arrive corrupted, even though they're not supposed to.

Yes, Jenkins is faster. But it's only a 32-bit hash, whereas MD5 is 128. Yes, Whirlpool is more secure. But I don't need a 512 bit checksum. MD5 is a great compromise.

Salts and HMAC are great. But you know what? The reality is that 9 out of 10 websites store your password in the clear. It would be nice if we could get the run-of-the-mill programmer to at least understand how to hash a password before trying to scare them off with the the more advanced stuff. Otherwise they're going to throw up their hands and say their app doesn't really need to be secure anyway.

You can't say MD5 without a geek chorus shouting "It's broken, you must not use it for anything." When regular programmers don't understand the basic utility of these fat hash functions they're missing out though. The fog of confusion hanging over the security space does't benefit Joe coder who could make practical use of these tools in general applications.

The message from security folks is that you shouldn't be using any of their algs for non-secure applications. If you use their stuff, you have to go all the way.

But that's bunk. The engineering tolerances for crypto security are way beyond what the typical application needs for general purpose utility out of these functions. MD5 is a great general-purpose hash. There is useful stuff in between the extremes of a crappy CRC and and SHA-512.

So MD5 away to make your stateless GUIDs and be happy. :)

August 19, 2007

RSS reader shares for Skrentablog

readersubs%
Google56265%
Bloglines17821%
NewsGator637%
Netvibes445%
Fastladder142%
Livedoor60%

The data comes from fetches that look like this in the webserver logs:

GET /atom.xml   Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 512 subscribers)
GET /index.xml  Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 50 subscribers)
GET /atom.xml   Bloglines/3.1 (http://www.bloglines.com; 142 subscribers)
GET /index.xml  Bloglines/3.1 (http://www.bloglines.com; 36 subscribers)
GET /atom.xml   NewsGatorOnline/2.0 (http://www.newsgator.com; 56 subscribers)
GET /index.xml  NewsGatorOnline/2.0 (http://www.newsgator.com; 7 subscribers)
GET /atom.xml   Netvibes (http://www.netvibes.com/; 44 subscribers)
GET /atom.xml   Fastladder FeedFetcher/0.01 (http://fastladder.com/; 14 subscribers)
GET /atom.xml   livedoor FeedFetcher/0.01 (http://reader.livedoor.com/; 6 subscribers)

August 20, 2007

Some thoughts on Mahalo

I was surprised (along with many others) that Jason chose to launch a "human powered search engine" as his next venture. More so at the reported funding of $20M.

I'm a fan of Jason's antics and his promotional ability, but at first glance making this spruce goose fly looks like it would need David Copperfield plus a reduction in the universe's gravitational constant.

Is it really possible to do dmoz/about 2.0 and have a go of it?

Having founded the biggest human-powered search site on the web (600,000 pages) and more recently running a content startup with substantial SEO distribution I have a few comments and suggestions for Mahalo.

To be fair there have been some notable SEO successes. About.com is probably the biggest seo win ever, with a $410M sale to NY Times in 2005. About has been huge into SEO since they they were known as The Mining Company. About guides got an SEO manual when they joined and were directed to author high-value seo content, as Mahalo is doing with its staff. About now has approx 3-6M pages indexed in Google.

dmoz wasn't seo driven itself but was a huge presence in the early seo industry. Because we gave the dmoz data away and so many other sites put it up, getting a link in dmoz meant that you instantly had thousands of links from across the web. Plus dmoz.org was PR10 for a while which was nice. You had to have a link in dmoz just to get to the "base" level of pagerank a normal website should have. Google had to adjust some of their algs because the pagerank warping effect of this was so huge.

But the most succesful SEO site currently is Wikipedia. They get a full 2% of Google's outbound traffic. I don't expect that to last at the current level, Wikipedia is showing up in too many searches and it's gone over the line. But Google's quirky aesthetics are OK with Wikipedia being there because it is on the non-commercial side of the fence and is hugely open.

At this point though I'm thinking SEO has gotta be dead as a startup business model. It was kind of unknown stuff in 2003 but now the cat's out of the bag. It seems like the last attempt of web 2.0 sites that aren't able to get social adoption is to start flooding the Google index with tag landing page spam or a crappy template page for every restaurant in the country.

We know this from experience: No one will ever go to Mahalo directly, just as no one ever went to About.com, dmoz, Tripadvisor, Nextag, IMDB or any other vertical or broad-but-shallow site. Google is where everyone starts and Mahalo's distribution strategy has to be SEO. Its traffic is going to live or die based on SEO skill and Google's continued favor.

If Mahalo doesn't get SEO traffic it's gonna have to morph into something else. In the past a site like Looksmart that had lots of editorial generated directory content could sell that to other portals. Those days are over though with content being commoditized so I doubt there is big licensing revenue in Mahalo's future. But Jason is smart and wily and I'm sure he'll keep twisting the cube until he finds a solution.

The other structural challenge with human powered directories has always been maintenance. It's not just the labor effort to create the pages in the first place, you also have to revise them regularly to keep them up to date. Otherwise they rot. So there is an ongoing cost to keeping a site with N links to periodically revisit and re-evaluate each every M days. Wikipedia is more resiliant against rot because it is substantially a historical/reference site. But the topical/commercial queries Mahalo is targetting will require periodic review, or they will start looking dated in a year or two. Links rot, spammers take them over, or they simply point to out-of-date resources. So you have to re-author all your pages every 3mo-2years depending on how topical the subject is. We crawled dir.yahoo way back and they were 8% dead links, some categories hadn't been visited by yahoo editors in years. This was the inspiration for dmoz but even it succumbed to a similar fate, just on a bigger scale. :)

In the meantime here are some tactical comments for the Mahalo site itself:

  • Hyphens instead of underscores Jason! You too outside.in. C'mon guys, this is basic stuff.

  • Put the guide note under the <h1> and call it <h2>, it'll do better. Mahalo needs lots of guide notes. Without the contiguous block of text from the guide note, the links aren't enough to validate a landing. 250 words is ideal but anything is better than nothing.

  • <title> should match <h1> should match url. Don't forget to add <meta name="description">, this should match the <h2>

  • Not really seo but a general idea ... Reference pages in general are boring. Jason is the supreme master of linkbait... Could each mahalo page be turned into a controversy of its own? When someone biases a wikipedia page, it gets more attention and traffic, not less...

  • Marshall Simmonds was the SEO expert at TheMiningCo/About. He MADE that site. I bet he singlehandedly enabled 90% of the $410M of value.

    "$410 million for SEO? I'll bet they could hire marshall simmonds, About's director of search, for a fraction of that." [1]

    Marshall gave a talk at WebmasterWorld Pubcon 2004 where he laid out About's whole seo strategy that had made them so successful. The ppt was on the conference CD. Unfortunately I've lost mine but I'm sure you can track down the talk. You need to see that deck.

  • Minor but if you are concerned with speed, then: 1) remove urchin, 2) 15% of mahalo's pages are whitespace, that may compress or not but eliminating that before sending the page out is hygienic. 3) Don't forget the 14 rules.

YMMV. Good luck.

August 23, 2007

But Craigslist actually *is* a den of sin, Mike

Just look! (warning: NSFW. In fact, not safe for home either really.)

Attn: all hobbyists and escorts - AJC front page - m4w - 29
Date: 2007-08-22, 2:51PM EDT

Looks like a few bad apples are gonna ruin the ATL scene again. Today's Atlanta-Journal-Constitution has a front page article in which Mayor Franklin blames Craigslist for promoting child sex, and the vice squad discloses how it has been conducting stings on this list. To the ladies, thank you all for your lovely services, and pls be on guard, and to my follow hobbyists, lets continue to flag the fakes, and also be on guard. Pls don't let a few bad apples ruin a good thing. Have fun, and as always, play safe.

The BKeeper.

Techdirt and its commenters mocked the Atlanta city mayor over her accusation that Craigslist is promoting child prostitution. I dunno, it's pretty clear what's going on in the "erotic services" section on Craigslist. That's a huge part of their traffic too. It's not all apartment-finding and mattress-selling over there, you know. :)

I found this quote from the AJC article interesting:

Company founder Craig Newmark, who also was mailed Franklin's letter, no longer is involved in the company's daily affairs and is traveling, Best said.

Craig's not part of Craigslist anymore?

August 26, 2007

Rotten Tomatoes / RT / Redux

Ironically, Rich Skrenta from Topix (formerlly the founder of DMOZ/ODP) owns the domain name "rt.com" which I pursued unsuccessfully for many many years (it's not like he needs the money anymore). You wouldn't believe how many people can't spell "tomatoes". Despite not getting the domain name, we struck up a pretty good friendship and he provided me with some very valuable words of advice when we were in the middle of being acquired. His experience being acquired by Netscape and which was quickly thereafter sucked up by AOL/Time-Warner is similar to my experience being acquired by IGN and then sucked up by Fox Interactive Media. Without his words of advice (make sure that you have an escape hatch in case there's change of ownership), I'd probably be very unhappy right now?
   -- Stephen Wang, in Startup Review

Fyi I found that para - honest - not because I was googling skrenta, it was for backlinks to rt.com. Heh. Missed it when it first came out. Btw Stephen doesn't credit Chris Tolles there but IIRC Tolles did a lot of the talking so maybe Stephen owes him a beer too. :)

But jeezus what a great seo post Stephen wrote, go read the whole thing. SEO was a big part of Rotten Tomatoes as you can imagine and it worked out great for them. I had originally met Stephen because he wanted to buy my domain name, which I wasn't really interested in selling. But I became fascinated with his startup and his personal tenacity as an entrepreneur. This was no quick flip for them, it took them years to build. These guys loved movies and slaved night and day on rotten tomatoes all through the dot com bust. It was a walk through the desert for them but eventually it paid off with a great exit to IGN.

I'll say also that they built a great site and I still use it to check out the read on a movie if I'm not up on the openings or want to delve in.

Stephen's got a new project now:

Four of us formerly from Rotten Tomatoes (including Patrick Lee and myself) have gathered together in Hong Kong and have recently launched a new online community of artists (filmmaker, musicians and more) initially targeting Asia (http://www.alivenotdead.com).

August 27, 2007

Pass the hat for Greg Stein

Kevin Burton emailed me to let me know that he was trying to do something nice for Greg Stein, the director of the Apache foundation, who was mugged and seriously injured in front of his house in Mountain View.

Details here.

Seems like a nice thing to do. Let's see...

apache == cool
beer (a micro-hefeweizen by the looks) == cool
greg stein == cool

So git yer wallets out you apathetic webwags and toss some bills into the hat for Greg! How much did you ever pay to use Apache? Ok well there's a good rationalization for you. Time to make a bit of it up. Thank god we're not paying $1295 for Netscape Enterprise Server. :-)

August 28, 2007

Counting stuff is really hard

I've never worked anywhere where the logs could be tallied well. Netscape, AOL, they had giant systems that slurped up the logs from the front ends and stuffed them into web-enabled databases. Every query took 90 seconds to run, half of them timed out. Forget ad-hoc queries or tossing a custom regex in. Sometimes the logs would break and it'd be weeks or months or never before they worked again.

Sometimes there was just too much traffic to be able to count it all. More log events came in every 24 hours than could be processed in a 24 hour log run.

Google Analytics doesn't seem to fare much better. Granted, we probably put more data into it at Topix than the average site. But I could never get unique IP counts of that thing. It would just spin and spin until my browser gave up.

I've repeatedly seen senior engineers fail to make headway on the log problem. Logs should be easy, right? What could be more straightforward than collecting a set of files each day and tallying the lines?

It turns out that anything involving lots of data spread over a cluster of machines is hard. Correct that: Even little bits of data spread over a cluster is hard. i=n++ in a distributed environment is a PhD thesis.

We take the simplicity of i=n++ or counting lines for granted. It all begins with a single CPU and we know that model. In fact, we know that model so deeply that we think in it, in the same way that language shapes what we can think about. The von Neumann architecture defines our perception of what is easy and what is hard.

But it doesn't map at all to distributed systems.

The approach of the industry has been to try to impose von Neumann semantics on the distributed system. Recently some have started to question whether that's the right approach.

The underlying assumption ... is that any system that is scalable, fault-tolerant, and upgradable is composed of N nodes, where N>1.

The problem with current data storage systems, with rare exception, is that they are all "one box native" applications, i.e. from a world where N=1. From Berkeley DB to MySQL, they were all designed initially to sit on one box. Even after several years of dealing with MegaData you still see painful stories like what the YouTube guys went through as they scaled up. All of this stems from an N=1 mentality.
    -- Joe Gregorio

Distributed systems upend our intuition of what should be hard and what should be easy. So we try to devise protocols and systems to carry forward what was easy in our N=1 single CPU world.

But these algorithms are seriously messed up. "Let's Paxos for lunch" is a joke because Paxos is such a ridiculously complicated protocol. Yes I understand its inner beauty and all that but c'mon. Sometimes you get the feeling the universe is on your side when you use a technique. Like exponential backoff. You've been using that since you were a kid learning about social interactions and how to manage frustration. It feels right. But if you come to a point in your design where something like Paxos needs to be brought out, maybe the universe is telling you that you're doing it wrong.

It may be a bit unusual, but my way of thinking of "distributed systems" was the 30+ year (and still continuing) effort to make many systems look like one. Distributed transactions, quorum algorithms, RPC, synchronous request-response, tightly-coupled schema, and similar efforts all try to mask the existence of independence from the application developer and from the user. In other words, make it look to the application like many systems are one system. While I have invested a significant portion of my career working in this effort, I have repented and believe that we are evolving away from this approach.
    -- Pat Helland

This stuff isn't just for egghead protocol designers and comp sci academics. Basically any project that is sooner or later going to run on more than a single box encounters these problems. Your coders have modules to finish. But they have no tools in their aresenal to deal with this stuff. The SQL or Posix APIs leave programmers woefully unprepared for even a trivial foray outside of N=1.

Humility in the face of complexity makes programmers better. Logs sucker-punch good programmers because their assumptions about what should be hard and what should be easy are upended by N>1. Once you get two machines in the mix, if your requirements include reliability, consistency, fault-tolerance, and high performance, you are at the bleeding edge of distributed systems research.

This is not what we want to be worrying about. We're making huge social media systems to change the world. Head-spinning semantic analysis algorithms. Creepy targetted monetization networks. The future is indeed bright. But we take for granted the implicit requirements that the application will be able to scale, that it will stay up, that it will work.

So why does Technorati go down so much... why is Twitter having problems scaling... why did Friendster lose? All those places both benefited from top notch programmers, lots of resources. How can it be, we ask, that the top software designers in the world, with potentially millions of dollars personally at stake, create systems that let everyone down?

Of course programmers make systems that don't satisfy all of the (implicit) requirements. Nobody knows how to yet. We're still figuring this stuff out. There are no off-the-shelf toolkits.

Without a standardized approach or toolset, programmers do what they can and get the job done anyway. So you have cron jobs ssh'ing files around, ad-doc DB replication schemes, de-normalized data sharded across god-knows-where. And the maintenance load for the cluster starts to increase...

"We're fine here," some readers will say. "We have a great system to count our logs." But below the visible surface of FAIL is the hidden realm of productivity opportunity cost. Getting the application to work, to scale, to be resilient to failures is just the start. Making it a joy to program is the differentiator.

* * *

There is a place where they can count their logs. They had to make this funny distributed hash-of-hashes data structure. It's got a some unusual features for a database - explicit application management of disk seeks, a notion of time-based versioning, and a severely limited transactional model. It relies on an odd cast of supporting software. Paxos is even in there under the hood. That wasn't enough so they hired one of the original guys who invented Unix a million years ago, and the first thing he did was to invent an entirely new programming language to use it.

But now they can count their logs.

:-)

Update:

Kevin Burton: Distributed System Design is Difficult. We're seeing distributed systems effects even on single machines now, thanks to multiple cores.

Moon


"Frig! The moon looks like the sun..."


"Gaak!"

That squiggle is actually made of moon-light though. That's kinda neat...


"Moon good, rest dark. Maybe I can photoshop it..."

Feh.

Hawk or Friedl could do much better with this view. We'll see if that book Friedl recommended helps... stay tuned. :-|

August 29, 2007

Spooky "Elk Cloner" movie

An art school student has made a spooky CGI movie "in honor of" Elk Cloner (that first virus thing that I seem to be associated with...)

He's got a ton of details about how he did the animation, even scans of his original hand-written notes.

He's submitted the movie to 21 film festivals. I hope it wins some awards. Go elk cloner go!

August 30, 2007

Popdex Revisited

So I blogged about how Popdex had been taken over by spammers.

The original author of Popdex commented about how he sold the project, and it had taken this unfortunate turn.

Now, when you search for "popdex", instead of seeing Popdex.com in #1 as before, the popdex spammer site doesn't show up anywhere in the results:

It's cool that Google actively polices web spam. But unfortunately this manual whack-a-mole job (Matt was that you?) didn't entirely work, since the first result is now an extremely ad- and adsense-heavy page (even threw a popunder at me that got through Firefox's blocker) which simply mirrors the old popdex pitch text and points to popdex.com.

I would love to see exactly what that manual whack-a-mole interface looked like. I wonder how scalable in the end hitting a -zap- button on individual spam sites is though.

August 31, 2007

Be careful what you wish for

Big AP-wide story by Nick Jesdanun on my 1982 elk cloner virus in your papers for the holiday weekend. Fun. Nick also blogged a bit about writing the story.

update ... a reporter in Pittsburgh spotted the story on the wire and called me up to add some local color.

Time to boot up the emulator...

rc$ ../a2/a2 cloner.dsk

THE SMILING ELK: MODIFIED DOS  16/01/82

RICHARD SKRENTA          SLAVE DISKETTE


DISK VOLUME 254

 A 002 HELLO
 T 013 CLONER
 T 020 CLONER 2.0
 B 006 CLONER.OBJ
 B 002 CLONER.OBK

]BLOAD CLONER.OBJ
]CALL -151

*9000L

9000-   02          ???
9001-   A9 FF       LDA   #$FF
9003-   85 4C       STA   $4C
9005-   A9 8F       LDA   #$8F
9007-   85 4D       STA   $4D
9009-   A9 20       LDA   #$20
900B-   8D 80 A1    STA   $A180
900E-   A9 5B       LDA   #$5B
9010-   8D 81 A1    STA   $A181
9013-   A9 A7       LDA   #$A7
9015-   8D 82 A1    STA   $A182
9018-   A9 AD       LDA   #$AD
901A-   8D D1 A4    STA   $A4D1
901D-   A9 B6       LDA   #$B6
901F-   8D D2 A4    STA   $A4D2
9022-   A9 AA       LDA   #$AA
9024-   8D D3 A4    STA   $A4D3
9027-   A9 4C       LDA   #$4C
9029-   8D 13 A4    STA   $A413
902C-   A9 90       LDA   #$90
*9244G

ELK CLONER:

   THE PROGRAM WITH A PERSONALITY


IT WILL GET ON ALL YOUR DISKS
IT WILL INFILTRATE YOUR CHIPS
YES IT'S CLONER!

IT WILL STICK TO YOU LIKE GLUE
IT WILL MODIFY RAM TOO
SEND IN THE CLONER!

*9001G

WRITE PROTECTED

]
The Apple II was such a great computer to learn on. You turn it on, can jump right into a ROM monitor and start typing in assembly. Those were the days. :)

Oh btw that "CLONER 2.0" was the evil version that I never released.

September 5, 2007

Social software design, circa 1998

You are three quick steps away from joining rdos-general.

No I'm not. I'm one-click away from leaving...

September 6, 2007

History Lesson

"Some consider UNIX to be the second most important invention to come out of AT&T Bell Labs after the transistor." [1]
...
"The AT&T lawyers, concerned with consent-decree compliance, had believed it was safe to allow universities to have Unix." [2]
...
"Did the consent decree of 1956, then, kick off open source?" [3]
...
"There was a period in the early 1980s when Tops-20 commanded as fervent a culture of partisans as Unix or ITS - but DEC's decision to scrap all the internal rivals to the VAX architecture and its VMS OS killed the DEC-20 and put an end to Tops-20's brief period of popularity. DEC attempted to convince TOPS-20 users to convert to VMS, but instead, by the late 1980s, most of the TOPS-20 hackers had migrated to Unix." [4]
...
"Instead they hired David Cutler, who had played an important role in the development of VMS at DEC. VMS was a successful and innovative industrial OS in its days, and Digital had been working on it since the 1970's. Cutler took some 20 former Digital employees with him, and he and his team began the development of NT." [5]
...
"There are two things Cutler hates: Unix and GUIs. You don't say the word 'Unix' in his presence - you literally do not say it. You shouldn't say 'C++' either, as Bill himself gradually learned." [6]
...
"Windows-NT is VMS reimplemented." [7]
...
"At DEC Cutler is widely credited for terminating the 1979-80 Desktop RSTS project and scrapping the manufacturing prototype. Compared to the subsequently announced IBM-PC, RSTS had 40,000 running applications, ANSI languages, and a DBMS. RSTS had a reputation as a robust, stable and reliable multi-user, multi-tasking operating system." [8]
...
"Although e-mail had been available for use within a computer since the early 1960s, e-mail between computers got its start on DEC-20s... This involved not just new software and protocols, but also new notation: the now-ubiquitous user@host address format." [9]
...
"The DEC-20 was like a clipper ship, the highest expression of a technology which many believed obsolete -- the large central timesharing computer... Meanwhile, UNIX was taking over the computing world... For these reasons, we decided that it was time to start converting from TOPS-20 to UNIX.

VMS was not chosen for several reasons. First, we were feeling somewhat betrayed by DEC's abandonment of TOPS-20, and did not want to leave ourselves open to the same treatment in the future... Furthermore, UNIX has networking and communications for all our major requirements: Ethernet, TCP/IP, DECnet, BITNET, RS-232 and LAT terminals, Kermit.

The "user-friendly" shell provided by the TOPS-20 Exec, which gives help to those who need it but does not penalize expert users, is probably the feature that was missed the most." [10]

...
"From a technical perspective, the biggest mistake we made in VMS was not writing it in a high level language. At that time, we had a group of very accomplished assembly language programmers, some stringent size constraints, and no compiler with the appropriate quality for operating system development." [11]
...
"Developers designed both VMS and NT to overcome the weaknesses of UNIX, which Cutler once described as a "junk OS designed by a committee of Ph.D.s."

Coming from the mainframe world, Cutler's never been a fan of the PC. He considered the Intel x86 line of microprocessors and the OSs that relied on that chip family, to be something of a joke. So in the early days of NT development, circa 1988 to 1990, Cutler focused on a RISC-based chip from MIPS and demanded that his programmers write code that would work on any processor, rather than Intel x86-specific code, which might have been faster but would have been less portable. Cutler's disdain for the x86 continued through the 1990s, as Microsoft ported its NT code-base to other platforms, including DEC's Alpha and IBM's PowerPC.

Of course, NT's cross-platform capabilities didn't seem to matter much by the late 1990s as Microsoft canceled almost every non-x86 port. Cutler's beloved MIPS chip fell first, followed by the PowerPC, and, finally, the Alpha, which had been sold as part of DEC to Compaq." [?]

...
"When the PDP-11 computer arrived at Bell Labs, Dennis Ritchie built on B to create a new language called C which inherited Thompson's taste for concise syntax, and had a powerful mix of high-level functionality and the detailed features required to program an operating system. Most of the components of Unix were eventually rewritten in C, culminating with the kernel itself in 1973. Because of its convenience and power, C went on to become the most popular programming language in the world over the next quarter century.

This development of Unix in C had two important consequences: It made it much easier to port Unix to newly developed computers. It also made Unix easy to customize and improve by any programmer that could learn the high-level C programming language. Many did learn C, and went on to experiment with modifications to the operating system, producing many useful new extensions and enhancements." [12]

...
"In the early 1990's we visited Microsoft to try to ensure that their new OS Windows NT would be available on IA32. We met with Dave Cutler, and he was adamant that IA32 was doomed and would we please get lost so he could target Alpha and then whatever 64-bit architecture was certain to replace IA32 by Intel. It was not a polite disagreement; that guy (Cutler) HATED IA32 and wasn't reluctant to transfer his displeasure to IA32's representatives (us). What an ugly business meeting. Smart guy, though." [13]

 

quotes have been edited for clarity and conciseness

 

d|i|g|i|t|a|l :)

September 10, 2007

Keith Squared

Heh. Those crazy engineers.

September 25, 2007

My compiler vs. the monkeys

In high school I took part in a computer science fair in my hometown of Pittsburgh. It was city-wide, with folks from schools all over the area participating. The judging was held at Monroeville Mall. (The same mall where 'Night of the Living Dead' was shot, btw :)

My entry was a compiler for a custom language I had invented, implemented in Apple Pascal. I forget the details of the language, but the thing worked, and produced 6502 assembler as output. The assembler could be transferred to a DOS 3.3 disk and turned into a runnable program with the BIG MAC assembler. The compiler was mostly recursive descent, except for the expression parsing, where I used precedence parsing. Some local dad who had a job as a mainframe programmer had shown me the railroad switching algorithm to parse expressions, and I thought it was incredibly cool, so it got used too even though it wasn't really necessary.

(I saw a chapter on precedence parsing made it into Beautiful Code...about time that technique got some pr.)

Anyway, on the designated Saturday I showed up and set up my Apple II on a table. Only when the judges came by -- who were respresentatives from a local Radio Shack -- did I realize that my demo was a little lacking. Watching a compiler run is not very exiting, and the cumbersome process to move the intermediate output code over to DOS where it could be assembled and run didn't help. The judges were baffled by what I had done -- it wasn't clear they knew what a compiler was -- and my mumbled explanations didn't help. They moved on.

Later I learned that a program that animated some monkeys on a screen had won the competition.

My buddy didn't fare much better. He had implemented some kind of font-rendering system for his Epson MX-80 dot matrix printer. Had hand-designed all sorts of new fonts for it too, kerning, crazy stuff, it was some kind of proto apple II postscripty-thing. Frankly I was mystified by his project but it seemed like an awful amount of work and rather impressive. I felt better that he hadn't won either.

I took away an important life lesson from this experience.

Yeah yeah, the value of the demo... presentation, practice, performance. That thin shiny stuff sometimes blows away deep heavy stuff. No... I learned that staged competitions with subjective judging by small panels suck.

September 27, 2007

Kosmix releases Google GFS workalike 'KFS' as open source

Search startup Kosmix has released a C++ implementation of the Google File System as open source. This parallels the existing Hadoop/HDFS project which is written in Java. The Kosmix team has deep engineering talent, including a strong track record, and having recently built a web-scale crawler and search engine from scratch. Google has a set of tools that the rest of the industry needs in order to compete...it's cool that folks are stepping up to the task and leveraging the open source model to try to provide some balance.

KFS arrives with an impressive set of features for an alpha release:

  • Incremental scalability - New chunkserver nodes can be added as storage needs increase; the system automatically adapts to the new nodes.

  • Availability - Replication is used to provide availability due to chunk server failures.

  • Re-balancing - Periodically, the meta-server may rebalance the chunks amongst chunkservers. This is done to help with balancing disk space utilization amongst nodes.

  • Data integrity - To handle disk corruptions to data blocks, data blocks are checksummed. Checksum verification is done on each read; whenever there is a checksum mismatch, re-replication is used to recover the corrupted chunk.

  • Client side fail-over - During reads, if the client library determines that the chunkserver it is communicating with is unreachable, the client library will fail-over to another chunkserver and continue the read. This fail-over is transparent to the application.

  • Language support - KFS client library can be accessed from C++, Java, and Python.

  • FUSE support on Linux - By mounting KFS via FUSE, this support allows existing linux utilities (such as, ls) to interface with KFS.

  • Leases - KFS client library uses caching to improve performance. Leases are used to support cache consistency.

Every startup that scales beyond a single machine needs platform technology to build their application and run their cluster. If enough folks adopt the code and contribute, the hope is that it could become something like the gcc/linux/perl of the cluster storage layer.

October 1, 2007

Beautiful presentations: Jon Bentley's quicksort video

This is one of the best technical presentation I've seen. Which is all the more amazing because the speaker, Jon Bentley, spends nearly an hour talking about the quicksort algorithm.

This is based on a chapter in Beautiful Code , but goes deeper into the subject, and seeing Jon present definitely adds a lot to the material.

Btw there a fascinating Google/Bell Labs comment by the introducer:

There's quite a crowd - well it seems to be occuyping a fair chunk of the the front here - of people that used to occupy the Unix room at Bell Labs.

In the olden days there was one Bell system (and it worked!) and a lot of us were sitting in this place in Murray Hill, and over time more and more people have come to Google.

So now it feels like this has become sort of the home base for a lot of us.

Google has felt to me like it's the new AT&T Bell Labs. Huge monopoly profits funding a ton of great researchers, generally left alone. The analogy makes sense also since Google has actually hired so many from the old Bell Labs...

October 24, 2007

Fall


Corn maze off 84. I bought a box of organic strawberries here, they were amazing.


Straw bale maze off route 1. The maze was pretty cool...it was quiet in there. Lots of different rooms to explore, not just corridors.


"Do not throw out black thing!"


The Mystery Machine van from Scooby-doo used to be parked by Fry's in Palo Alto every time I went there. Lately I hadn't seen it there. Here I spotted it at Home Depot in San Carlos.


No cows this day on the hike up to the dish by Stanford.

My dish story:

When I came out to California to interview at Sun in 1995, my buddy Tom was driving me around so I could get a feel for the area. We were going up 280 when I saw the dish. "Holy shit, what's that?!" I made him exit the highway and find the access road up to it. It turned out it was a park and you could walk up there. Tom didn't want to walk up the hill, so he waited in his car smoking while I did the 30 min loop up to see the thing.

I love that dish. I tell my kids that we're using it to look for the aliens, even though I don't think it's part of SETI anymore.


Corn and pumpkins.


Boo.

November 3, 2007

Tapping on keys

Apparently I can't code and blog at the same time. When I work on something it generally takes over my life and I think about it 24/7. If I'm writing code, even when I'm not sitting in front of the keyboard there is usually some part of my brain working through a problem or just turning the structure of the system around in my head to see what needs to be done.

On the other hand it's not like I'm holed up in a fire tower working on a novel or something. I'm still walking around and seeing stuff and talking to people, and could have stuff to post. Maybe if I could bang out the words faster I could keep the blog going concurrently. My posts have generally taken at least an hour to write in the past. That's too much time each day when that hour could be put to other use. I'm going to experiment over the next week and see if I can't optimize post-writing time down so I can keep stuff going here without compromising the other stuff I'm doing.

Rotting Pumpkins

November 4, 2007

Ranking Web 2.0 sites by server latency

Server latency is the start of the battle for site performance. There are great tutorials on how to optimise your html, but if your server takes too long sending the bytes out in the first place, there's nothing the browser can do but wait.

It gets even worse. Server latency directly affects your site's hardware requirements. Slow html, in some sense, is the user's problem; their browser will spin trying to render your spaghetti css. But the longer it takes your server to put out a page, the fewer pages it can serve per second. Which means buying more servers for the same load.

For example, if your server returns a page in 50ms, it can pump out 20 of those a second. If it takes 250ms instead, it can put out only four per second. That's a 5X difference in the number of machines needed to serve the same number of users.

Just for review here recall that 1 millisecond = 1/1000th of a second. So:

    50ms = 1/20th of a second
    250ms = 1/4 of a second
    500ms = 1/2 of a second

The human eye's threshold for perceiving latency is about 50ms. Crudely simplified, this means that if you render a screen within 50ms or less, it's going to be perceived as instantaneous. If it takes longer, people can sense the latency. Lots of subliminal processing occurs on an image in the first 250ms, with conscious processing happening after 270ms.

(Making a leap here... This means that people's well-trained subliminal neural hardware is deciding whether to click Back even before they've consciously realized what they're looking at...cool. :-)

I'd recommend the following performance yardstick for server latency:

    50ms = pretty good
    250ms = ave/sluggish, but still OK
    >500ms = your site is slow as molasses

Faster is always better, but if you're in the 50-100ms range you can feel pretty good about your platform. Over that, and there's probably some easy wins to be had, which will payoff in user satisfaction and a lower hardware ramp in the co-lo.

So how does the rest of the net stack up?

The following list is the result of running apachebench on 530 Web 2.0 sites pulled from CrunchBase. I also added some of the major sites such as Google, Yahoo, and so forth for comparision.

Caveat: I don't recommend trying this at home. I was deliberately trying to avoid overloading anyone's servers, so this was not a stress test, just a touch test done off-peak with a mean result coming from 10 non-concurrent fetches. Running apachebench against Google (or anyone else for that matter) is a great way to get IP-banned.

There is a huge range -- 100X! -- in performance figures, from <10ms for the fastest sites to slow sites which took seconds to squirt out their html.

Yahoo trumps google on this test, returning their homepage in a mean time of 13.620ms, vs. Google's 30.716ms. This despite Yahoo's page having nearly twice as many bytes (although still quite lean at just under 10k of html).

I included information about the site's webservers, but there doesn't seem to be any consistent trend. Apache appears on both the slowest and the fastest sites, with many other servers spread through the range too. Apparently it doesn't matter what brand of webserver you have, it's how you use it.

Anyway, on to the list.

Continue reading "Ranking Web 2.0 sites by server latency" »

November 7, 2007

Network Effect Entrepreneurs

"Today Silicon Valley is full of 'network-effect entrepreneurs' "
      -- Steve Perlman

It's 1998... Bob and I are writing code for Sun. C++, kernel, networking, heavy QA process. 18 month release trains to get into Solaris. No one uses our product. Sun's channel strategy for desktop software is fsck'd. Everyone in the IETF hates us. Typical 90's bigco software job.

But then the net comes along. Microsoft buys Hotmail for $400M.

Hotmail wasn't Netscape. A browser is a big honking piece of client code. You spend 30 minutes compiling against some gargantuan event-driven windowing framework only to crash your windows box when the thing runs. Hard work.

Hotmail was some web forms on top of sendmail. You use printf to make web forms. I could have written Hotmail. You could have written Hotmail.

Hotmail was so successful that the founders and VC's were arguing over who invented the "P.S. Get your free email at Hotmail" viral advert appended to every outgoing message. Success has many fathers.

Think about that. Not the server mail delivery/connection model. Not their anti-spam. Their big I.P. story was this one-line message appended to the end of the email.

Bob and I coded NewHoo in two months. HTML forms on top of a database. We got Wired, Red Herring, Netly News in the first month after launch. Bizdev from Looksmart, Infoseek, Lycos. printf and html forms were working great for us. This was a lot easier than debugging locks in the kernel.

Competing projects sprung up to chase us -- Freedir, Infoseek's Go Guides, Zeal, Wherewithal -- but we didn't think they had much chance. Something about being the first put us at the head of the pack. No matter how many users the followers signed up, we always stayed way ahead.

That's a network effect barrier to entry.

The barrier certainly wasn't our code. Our I.P. wasn't our C or Perl. It was directory data and users.

Ebay was like this too. You could write a clone of ebay in a weekend. It's printf's and a database. But there's no point, because the trick would be how you would get everyone from over there onto your site. ebay's barrier to entry isn't their code.

There are still products with technology I.P. Oracle, RenderMan, Google... those shadowy funds arbitraging adsense to yahoo in Europe. RenderMan awes me. Every year they make a better movie with it. All that ray tracing math to make hair and mist and fire and faces look more realistic. 10 years of hard work by a big team are in that package. That's cool.

But connect-the-dots has the day, thanks in no small part to the takedown of MSM and the pillaging of its ad dollars.

Fun fun!

Meeting today

"We wanted to check you guys out before coming over here. But your company webpage just had picture of a paper bag puppet on it. Then we googled you, but it said you were writing a virus for the iphone. We were wondering, who are these guys?"

Heh.

November 8, 2007

Editorial Selection in Retail

I recently read Into the Wild, the story of Christopher McCandless's journey to Alaska. (If I had known that it was about to become a Sean Penn directed movie, I probably wouldn't have bought it ;)

I had picked up the book because it was on one of the "recommended" tables at Borders. It wasn't a new release, but I grabbed it because it looked interesting.

Louis Borders (he of the bookstore bearing his name) once told me that Borders had slipped after he had left, and that their "editorial selection" wasn't as good anymore.

This was a new idea to me... the idea that a bookstore could have an "editorial voice", based on what they feature on the the end caps, the "staff picks" tables, the books turned cover-facing-out on the bookshelves, and overall in the selection of books that are stocked.

It totally made sense to me. The old Printer's Inc in mountain view used to be a great bookstore. The ownership changed, they remodeled the store, and it feels to me like half the books are gone now. The half I wanted to buy.

It's not just bookstores though. I've started to notice editorial selection everywhere now. I'd never thought about it that way before. Even a restaurant has an editorial selection. Some restaurants try to be all things to all people. This makes sense in a New Jersey diner, but that's not how you get to be the French Laundry. Or In-n-Out burger, for that matter. It's not by being all things to all people.

Wish you were here...

Ha!

Spamgen english->spanish->english

Willy Wonka and the factory of the chocolate is a film 1971 cradle in the book Charlie of the 1964 children and the factory of the chocolate by Roald author Welsh Dahl. The film dug in generally the revisions received critical foxhole on its opening in 1971, but it was only a low commercial success, despite on the years the film has become one of more the beloved, well-known affluent family always films fact - and in spite of its age, and fable original creative attempt as kinematic musical comedy for the children - also has to you from grown in an important classic work of the cult with the children and the adults.

Try this for a random spam blog from google blogspot. :) I've seen a number of wordgen produced blogs out there. Not sure if anyone is using machine translation to wash text yet though.

November 13, 2007

Gee whiz

I've just been offered $125k for rt.com, the first domain I ever registered. I got it back in 1991, before there was even a web. It was for email hosted on my uucp Amiga Unix node.

The domain market sure looks hot right now. That seems like a good offer, but the domain has sentimental value, and who knows what domains might be worth in another 10 or 20 years...

Spice Girls VC

So one day a few years ago I'm sitting in a VC's office having a chat. I had a few ideas rattling around in my head but the VC had his eyes on a then-current space which was hot. He tossed a business plan for one of the leading startups into my lap.

"Where'd you get this?" I asked.

"They gave it to me."

He went on to talk about how he wanted to launch a company into the space as well, and I'd be a great vp eng. He said he knew a guy with some technology who could be cto, had a vp marketing in mind, and then we'd just need a world class ceo to round out the band.

I formed a theory that the process of seeking VC ended up calling your own competitors into existence. You'll meet with many more VCs than the 1-2 who end up funding you. But after seeing a company or two get funded in your space, the VCs who passed or weren't able to get in decide they want to have a bet in the space too. Fortunately they have the benefit of having heard your pitch and the opportunity to personally grill you at length on your approach.

But doing the Spice Girls or N' Sync thing to put a startup together can be tricky. Startup founders can be so cranky / eccentric.

November 15, 2007

Netscape's "Tin Man" rocket

If you used to work in Netscape building 23 or 24, you probably remember the giant rocket-ship shaped tower between the buildings. The purpose of the tower (as described by a FAQ on the Netscape intranet) was to pump a poison called TCE out of the groundwater and spray it into the air, supposedly so it would "dissipate":

The Tower a.k.a. The Tin Man a.k.a. The Rocket...

This is an Air Stripper. It is used to remove VOCs (Volatile Organic Compounds) from ground water wells located under our campus. In summary, water is pumped to the top of the tower, and then released over 'wiffle balls' to increase the surface area that it flows over. At the same time, a powerful stream of air is sent up the tower, where the high velocities pull the VOC's out of the water. They are then ejected high into the atmosphere, where they evaporate & diffuse immediately. For more information on how the Air Stripper operates, please visit The Air Stripper FAQ Page.

Building 23 Has an Interesting View

Yes, building 23 is right next to the tower, but do not despair. The VOCs are discharged from the air stripper at a very high rate of speed, & they evaporate into the atmosphere immediately. The EPA, the BAAQMD, the RWQCB, the previous owners, and Netscape have & will continue to monitor the processes and the quality of the environment across our entire campus. Similarly, much of the Bay Area, including many residential areas, have ground contamination issues & are undergoing remediation.

What's Down There Anyway?

Primarily Trichloroethene (TCE), and derivatives of that chemical. (TCA, PCE, DCE, DCA, trace amounts of Freon 113, Phenol, Vinyl Chloride, DCB) The area where these are found is approximately a half mile wide and 2 miles long, much of which is covered by Moffet Airfield.

There was a little putting green between the two buildings, and if you were anywhere near it or walking between the buildings you could feel the mist raining town from the tower. The building HVAC intakes buildings were also nearby.

We were moved to building 24 in 1999, and wondered about this hare-brained scheme to rid groundwater of poison by spraying it onto people and into ventilation intakes. Bryn's dad is a PhD chemist, so we asked him about TCE. His opinion was that it was bad stuff and we likely didn't want to be soaking in it. So we asked Netscape for TCE testing of our air quality, but predictably got the runaround. Shrug.

A few years later, tragi-predictably, the EPA reclassified TCE to be far more harmful than they previously had claimed:

The U.S. Environmental Protection Agency will require 10 Silicon Valley high-tech companies that once operated manufacturing plants in Mountain View to conduct -- for the first time -- air-quality testing for a toxic substance inside several offices that were later built on the land.

The same companies are suspected of having leaked into the ground a substance called trichloroethylene, known as TCE, a widely used solvent that cleans machine parts.

Now, the EPA believes TCE might be 60 to 70 times more dangerous to humans than previously thought, and it is concerned that contamination in groundwater is seeping into the air inside office buildings constructed in areas vacated by those companies.

    -- old unlinkable merc story

Might be seeping up? Yah right. It was pumped out of the ground and sprayed all over the place, on purpose. Sheesh.

I'm too apathetic about such things to worry over what kind of increased risk of god-knows-what I might have in the future, but if you care check out the TCE blog.

November 26, 2007

Unix should have a newfd call

So you can open a file and then unlink() it. The fd hangs onto the file, even though the file isn't visible in the filesystem anymore. Once you close the file, it goes away.

This is kinda cool and gets gets used every so often, but much more useful would be the reverse.

Cleaning up files in the process of being created before they are renamed() into place is a pain. If you could create a new filesystem fd in the unlinked state first, and then link it into the filesystem once it was ready, all the temp file and unlink-on-error nonsense could be done away with.

newfd() would have to take a path to associate the fd with a particular filesystem, like statvfs(), but that's easy.

A lot of newbie programmer errors where a partially-written file is put into place over the existing file would be eliminated too. I bet this could have saved a lot of trouble over the years.

Kinda late to be adding calls like this though. 30+ years late.

November 28, 2007

Apple's "1984" ad: Rejected by the board

So the best commercial ever - literally, the best commercial ever - shows up with Ridley Scott as the director, and half the folks who see it, including Apple's board, want to give the time back to the network rather than run it.

The ad was the pride of the entire agency. They were confident that 1984 would generate a tremendous interest in not only the Macintosh, but all Apple products.

Unfortunately, Apple's board didn't concur. When the board was shown the ad, cofounder Mike Markula suggested that Apple drop Chiat/Day altogether. The rest of the board was not impressed either.

Sculley was discouraged by the board's reaction and asked Chiat/Day to sell back both the timeslots to CBS (the commercial was to air uncut during a minute spot, and an abreviated version would be aired during a thirty second spot). If a buyer could not be found, Manuals [an ad featuring a stack of manuals] would be run instead.
    -- Tom Hormby

But a few people could see the ad was great stuff and wouldn't give up:

Chiat/Day defied Sculley and only sold the thirty second spot.

Steve Wozniak, who was still friends with Jobs at the time, heard about the board's refusal to support the adfrom Jobs, who also showed it to him. Wozniak loved the ad and offered to pay for the spot personally if Jobs was unable to get Apple to air the ad.

Amazing. The ad agency ignores the CEO's instructions, and Woz the founder steps in to offer to pay for the ad out of his own pocket if they don't run it. That's so cool...

Popular history remembers successful efforts being destined for greatness from the start. But there's usually a messier story behind the scenes.

Interesting tidbit: the models who tried out for the ad were physically unable to throw the sledgehammer. They had to hire an actual competitive discus thrower to play the part. Cool. :)

November 29, 2007

If you can't read this...

The IP address of this blog changed from 205.217.153.42 to 205.217.153.43. Windows for some reason doesn't seem to honor dns ttl at all and I had to reboot my windows machines to see the update. So if you're not reading this post, you have bad dns... eh. well. hmm.

When we've moved big sites we always leave behind an http redirector on the old IP for a few weeks. It's surprising to me just how many clients out there will continue to use old dns weeks after an update to an entry with a 15-minute ttl. I didn't bother with the redirector for this ip change on my blog though, seemed like overkill.

December 1, 2007

EC2 - the return of timesharing

I continue to be surprised at the success of EC2/S3.I know a lot of startups using it and I can imagine this sort of machine/storage virtualization taking over a big part of the datacenter/colo market. The appeal of built-in server financing, and the ease of scaling up or down are so compelling that folks are willing to work around the (pretty severe) limitations in the current service. (Of course the product is still very young and amzn will continue to improve it over time.)

Anyone who's ever tried to get financing or leaseback for machines knows what a pain it can be and how difficult it can be to qualify. EC2 makes all that pain go away, you can have 1 or 20 servers and scale up or down at a moment's notice. It's really more financial tech than datacenter magic.

I wonder if some kind of standardization for how to deploy virtual nodes and storage is going to develop. Presumably if there are other companies that are going to jump into the virtual datacenter market their APIs aren't going to look examctly like Amazon's.

I heard that Amazon's EC2/S3 service is getting a lot of calls from law enforcement because it's being used to host kiddie porn and file sharing services. Apparently being able to set up storage and compute farms from a web form with a credit card on in someone else's datacenter is pretty appealing to folks who don't want to get caught for what they're hosting.

Of course you expect this sort of thing, it's just a cost of doing business, like running a big forum system.Any big ISP or community site has dedicated staff to handlethe law enforcement requests and to police the userbase.Interesting that you have to do this sort of thing even to run a virtual datacenter product.

December 3, 2007

Weird Stuff - the other Silicon Valley tech museum

The Computer History Museum at Shoreline & 101 is pretty cool. If you're in the area and haven't been it's worth a visit. Seeing all the old electronics and panels covered with switches delivers a strong shot of nostalgia mixed with awe at the rate of progress computing hardware has made over the years. They've got the guidance computer from the nose-cone of an old Minuteman nuclear missle, a Dec-10, an Enigma machine, an IBM 360, lots more great stuff. And the place has that distinctive smell of old electronics. The smell of vacuum tubes and bakelite. :)

But there is another, unofficial, computer history museum, about five minutes away in Sunnyvale. Weird Stuff, a surplus equipment store. If you want to take some of the artifacts of the old Valley home with you, this is the place.

I bought an old working 1U server for $65, a rack for $50, some crazy old line tester thing with a hundred switches for $10, and a mechanical typewriter for a few bucks (it didn't have a price on it, and the checkout dude said "how much do you want to pay for it? ok.")

They get batches of semi-newish equipment too, so it's a great place for deals on telco racks, routers, switches, laser printers, patch panels, etc.

But I go there to see artifacts of the old valley. Half the crazy old devices on their shelves were someone's startup dream at some point. An old 1U firewall box that originally sold for 10's of $k, in a stack for $25/each. Telebit trailblazer uucp modems. Apple II's. Incomprehensible test equipment.

At least old hardware gets to rust on a shelf in a warehouse for a while after its life is up. Old software just goes >poof<, and is gone...

The Ark could be in here somewhere.

December 4, 2007

There is no Building Code for software

Did you ever have a leak around a window or roof vent or patio?

Did you ever have a program with bugs?

The building code is pretty cool. Not knowing anything at all about construction I was fascinated to see the detailed specifications about what must/must not be done in various kinds of residential and commercial construction. Like requiring thicker gypsum / sheetrock inside of closets located under stairwells. Why? Because if a fire breaks out in the closet, the staircase burns, and then egress is cut off for people on the upper floor.

Another interesting one is a requirement to have vertical bars on deck railings rather than horizontal ones. Why? Because it's harder for kids to climb over the railing and fall.

There are thousands of specifications like this. Some small details, some major structural points. How windows should be properly flashed, pipes connected, electricity kept safe, the foundation secured, on and on and on.

People have been living in houses for a very long time.

Houses rot, fall down, burn down. Pipes burst. Roofs leak.

Each of the rules in the building code have arisen because some particular failure scenario happened enough that it made sense to add the rule.

It kind of creeps me out. Behind that rule about thicker lining in stairwell closets are... well, some fires that burned out stairwells. And perhaps some people who couldn't get out. It's not just a theoretical.

According to Wikipedia, the Code of Hammurabi from ancient Babylon in 1760 B.C. was the first building code:

  • If a builder builds a house for someone, and does not construct it properly, and the house which he built falls in and kills its owner, then that builder shall be put to death.
  • If it kills the son of the owner, the son of that builder shall be put to death.
  • If it kills a slave of the owner, then he shall pay, slave for slave, to the owner of the house.
  • If it ruins goods, he shall make compensation for all that has been ruined, and inasmuch as he did not construct properly this house which he built and it fell, he shall re-erect the house from his own means.
  • If a builder builds a house for someone, even though he has not yet completed it; if then the walls seem toppling, the builder must make the walls solid from his own means.

There is no building code for software. There are a lot of anecdotal proscriptions, and a ton of knowledge on the subject. But for joe the general software contractor - Jeff Atwood's "80%" programmer - sometimes the expedient is chosen over the correct. Not because they're malicious or incompetent. Just because they haven't devoted their life to studying the art. They just want to learn the trade and work it. Where's the rulebook?

In software, unless you're in medical devices, or fly-by-wire aircraft systems, you don't usually kill people with bad software. Thank goodness.

We haven't been living in software houses for thousands of years. They're each more complicated and each software system is novel. It's still a black art. And every software project is in part an R&D exercise.

So I think it's still a long time before we'll have a building code for software.

December 5, 2007

I'm shocked, shocked to hear about the secret Wikipedia cabal

A buddy asked me what I thought of the "secret wikipedia mailing list" brouhaha:

From the Register:

Controversy has erupted among the encyclopedia's core contributors, after a rogue editor revealed that the site's top administrators are using a secret insider mailing list to crackdown on perceived threats to their power.

Many suspected that such a list was in use, as the Wikipedia "ruling clique" grew increasingly concerned with banning editors for the most petty of reasons. But now that the list's existence is confirmed, the rank and file are on the verge of revolt.

He wondered if this was unique to Wikipedia, or if we'd seen this sort of thing at dmoz or topix.

The fact is that there is no way to prevent players in a social game from colluding to increase their effectiveness.

If players can coordinate their actions to get more power, they will. People are social creatures and form cliques, groups, tribes, and like to hierachically organize themselves. This consistently happens if you have any kind of extra priviledge for the senior folks -- e.g. editall or meta capability in the Open Directory. But it happens even in purely discourse-mediated systems, where parties will collude to promote / denouce agreed-upon subjects.

Sometimes what happens can feel like a virtual re-creation of the Stanford Prison Experiment.

What I pointed out to my buddy, however, was that you need to be careful before you try to architect or legislate this out of your system. Game designers know there is a careful balance between keeping long-running multiplayer systems inviting to new folks while letting experienced players continue to progress in status and power. The power is one of the main rewards in a social system. And it's going to your most loyal and productive game addicts.

The 80/20 rule is vastly over-used, but we found that it did apply in dmoz. A small group of editors did most of the work. If you remove the rewards for the power-users, to make the playing field more "democratic", you may be pissing off your best users.

Other takes from Matthew Ingram, Mashable, others.

December 10, 2007

PageRank wrecked the web

Two years later and rel=nofollow is still bugging folks.

Google needs YOUR help

It's still bugging me, too. It doesn't make any sense.

Bad linking hurts everybody

Google couldn't seriously be asking webmasters to tag which of their links were going to affect pagerank vs. the ones they'd sold. Could they?

Let's pick up all the trash in the world

That would be like asking everyone in the world to please be nice so the old algorithm will still work.

Do a random act of internet kindness

If I close my eyes and wish really hard I can bring back the golden age of the 1999 web. Back when links still indicated site quality.

THINK before you LINK

Back when spam was simpler, and G wasn't party to both sides of the transaction.

PageRank stands for PR

The toolbar pagerank display is disconnected from the real topic-sensitive pagerank used in the SERPS. Google can cut your PR in half but your SERPs don't change. It's a message, but what does it mean?

NOFOLLOW if you're PAID, or PAY the cost

Why would they want us to think that these things mattered?

If you don't toe the line, we'll ban you. You'll be sorry.

We can't actually ban the Washington Post or the Stanford Daily though. But we're going to threaten you to make you shape up.

Don't say untrue things about people

On one hand it seems an oddly utopian world view, not a pragmatic one.

Help Google by only publishing quality links

What happened to all the genius researchers building Skynet with their 1 million servers? What's all the AI for if they can't do a better job of tagging web pages than asking users to do it for them?

You mean they can't even detect TextLinkAds on a page, and have to resort to this weird business threat model instead?

The web of spam

Links used to be for human navigation.

Google made them count for money and they're ruined now.

Nofollow isn't going to put it back the way it was.

PageRank wrecked the web

Google is the cause of all of this.
and Google is going down with it.

December 13, 2007

Multi-paned search UI in testing at Google

It's cool that Google has gotten around to implementing the multi-pane search interface. Wags are saying that Google copied Ask on this, but really, it was Ask that copied A9's innovative interface. And now that Udi Manber, who built A9, is running search products at Google, it makes sense to see him testing an evolution of those ideas.

A9's interface (which was powered under the hood by Google results at the time) didn't seem to get traction when it launched. But those ideas, deployed on the Real Thing, could be a different story.

My next hope is to see some personalization come out on the results.

I have some personal skepticism that either multiple columns or p13n is a good idea. But it would be nice to see Google explore those.

December 15, 2007

Google sees own shadow, jumps overboard

Google announces "Knol"...

First-order response

Bad news for jason and mahalo! Google declares war on jimmy and wikipedia!

Some context

So Google makes an algo that puts wikipedia at the top of all the results. You search for 'hamburger', you get the encyclopedia definition of a hamburger. Riiiight... But questioning the the wisdom of this algorithmic choice is off the table.

So they say, "Whoa. Look at that site at the top of all our results. We made them that big with our traffic. We should have a site like that, and then we could be there, instead. But we'll do it right this time. Our way. And put our ads on it!"

Onebox Thinking

Ask has those nice oneboxes. You search for britney, you get her AMG profile and a little picture from better days. But that's just the AMG dataset. You can implement about 100 of those custom datasets, and then smoke and noise start to come out of your feed integration team, and you can't take in any more feeds.

Google has Oneboxes. A lot of them are programmatic. sfo to lax, chicago weather, goog, things like that. But gosh, isn't wikipedia being in the top spot for all those searches just a kind of Onebox? An informational-article Onebox? Wikipedia only has 1.5M articles, that doesn't seem like a lot. Heck, jason pumped out 26,000 in a few months with a little team. What if this were properly scaled to the web?

Google could then scale its informational oneboxes. And keep them under its control. Not have them run by some kimono-wearing guy who wants to let the community decide how the content should be edited. A guy who won't take green US dollars for ads. What's he thinking? Better not trust him. ;-)

So what's the problem

Google is optimized for one result. Position #1, the I'm Feeling Lucky button. Oneboxes fit into this goal. The programmatic ones are command-line tools. 'weather 60201'.

But Oneboxes aren't webby. Even Mahalo, with its editor-created pages, seeks to link out to the breadth of information available about a topic. To be the best hub for that topic - not the destination. Wikipedia is a destination, but by virtue of the democratic inclusion process, mostly succeeds in distilling the web's voices into an objective resource.

There are many first-order problems with the Knol plan. Paul Montgomery zeroes in on some of the moderation issues nicely. But set aside the nightmare of trying to coax a usergen-content business to produce quality output. The question is, if this did succeed, would it contribute to building the ultimate web experience that we really want?

December 18, 2007

What fraction of searches are porn?

I found a stat that claimed that 25% of internet searches were for porn. This appeared in the CS Monitor, and my guess is it came from here.

I don't see that high a fraction of porn queries in the AOL dataset though...perhaps as little as a few percent. I wonder if they were filtered? But I didn't think they were, not in that way anyhow.

January 2, 2008

Why Search?

I've gone and founded a search startup... you can read it about it in this write-up in TechCrunch. But I get asked - why do search?

Simple - the idea that the current state-of-the-art in search is what we'll all be using, essentially unchanged, in 5 or 10 years, is absurd to me.

The web is big. Really, really big. It's literally billions and billions of pages. It's Carl Sagan big. And it's doubling in size every year or two.

So the idea that what you can see in positions 1-3 above the fold on Google are the sum of what the web has to say about every possible query is crazy.

And yet they have 85%+ market share, and little effective competition. At the same time there is such a fabulous business in search. It's the highest monetization service on the web, by far. Why does this Coke have no Pepsi?

Having just spent 5 years in the media space, I've come away with the idea that editorial differentiation is possible. But the editorial voice of a search engine is in the index...so it has to be algorithmic editorial differentiation.

Google and it's copy-tition were designed 10 years ago. But the web has changed significantly in the past decade. Google was built to index a web that no longer exists... a web where people still engaged in social linking behavior, for one thing.

But at the end of the day, founding a startup has to be about personal motivation. My roots go back to os internals, networking, algorithms, and product boot-up strategies. Basically, trying to make algorithmic sense of the vastness of the web is a difficult but really interesting problem. So is tilting at the biggest brand on the web. It's all just plain fun, which ideally should be the point of working. ;)

January 6, 2008

About the name 'Blekko'

In 1988 I was in college and desperately wanted to run some kind of multitasking OS on my own hardware. These were the dark days before Linux and FreeBSD. I had an account on the university Vax system but it was slow and I didn't have any privs to speak of. My first hope was for a Microvax but they were $15k. I ended up scrounging up a 286 system and installing SCO Xenix on it.

Xenix was a 16 bit port of AT&T's Unix and did the job. I loaded up my box with memory, serial ports, two modems and a serial terminal. I was in heaven.

I wanted to connect to the campus network via uucp and so my computer needed a name. I christened it 'blekko', how I came up with this I have no idea but I liked the sound of it. Thus was born my first "net" address, blekko.uucp.

So 'blekko', while it may sound like a weird Web 2.0 name actually pre-dates the existence of the web. :-)

Now when Mike and I were setting up the new company we got to a point with the lawyers where they needed a name to proceed with the incorporation of the company. I didn't want to pick a name then, names are a big deal and you should put a lot of thought into them. So to put off the decision we decided to call the company "BX10.net". This was an inside joke based on one of our colo server names. But the main idea was that there was no way we'd ever launch with that, so it would usefully serve as a placeholder name but force us to change it later.

Well the state of California rejected our incorp under that name. Apparently there is a BX11, Inc. and they said "BX10" was too close. So in the interest of forging ahead with the company creation I fished out all the names I had in my domain account and sent them over to Mike.

Mike orderded the list by the ones he thought were funniest and sent them off to the lawyers to try, in order, until one worked. Blekko was the first name and went through.

Now I still think that it's important to put more than five minutes of thought into a company name. Especially if the five minute's worth of thought yields "I would never use that name, are you insane?" But the reactions we've had have been ... interesting. Folks definitely love it or hate it. I actually score hate ahead of indifference; provoking a strong emotional response, even a negative one, helps the name stick in people's heads. :-)

One vendor we were talking to earnestly told us the name was fantastic and we must never change it. I'm not sure if he was pulling my leg though.

We've actually spoken to some naming/branding firms... I had always figured that investing more than $14.95 in a corporate identity made sense for a multimillion dollar startup effort... I mean you put millions of dollars into your coders and your ops, but you're going to settle for some name that happened to be free on Go Daddy?

The naming experts have had some interesting comments. They said phonetically 'blekko' wasn't bad. It's unique, staccato, memorable, and short. It does have some unpleasant phonetic associations. But they said mainly it was an "empty vessel" name. Meaning simply that the name doesn't suggest any idea in the mind of the person hearing it. It's an empty vessel that marketing would have to fill with a particular brand meaning.

We're still undecided on whether 'blekko' will actually be the launch name or if we will come up with something else. But I have to say the TechCrunch/Techmeme/Digg press and reaction have provided some fascinating test-marketing feedback. You can't pay for this stuff... and since it will be a little while before we launch anything, if we go with a different name later, it won't be a big deal to change it then.

I wonder what the name inspector would make of 'blekko'...

Update:

The Name Inspector reviews 'blekko'. He doesn't seem to like it. Although there is this curious comment at the end of the article:

But you’re in stealth mode. The Name Inspector believes you have no intention of launching as Blekko. Though he hopes he’s wrong.

Does that mean that he does want us to launch as 'blekko'? Hmmm....

January 7, 2008

Long tail in a short table

I finally found some stats on the fraction of porn queries out there to answer my question...plus, it was in a table classifying user searches into overall categories. This data was obtained by some researchers who manually classified a full week's worth of AOL search data:

Other 15.69%    News&Society 5.85%
Entertainment 12.60%    Computing 5.38%
Shopping 10.21%    Orgs&Inst 4.46%
Porn 7.19%    Home&Garden 3.82%
URL 6.78%    Autos 3.46%
Research 6.77%    Sports 3.30%
Misspellings 6.53%    Travel 3.09%
Places 6.13%    Games 2.38%
Business 6.07%    Personal Fin 1.63%
Health 5.99%    Holidays 1.63%

There are always methodology questions with data like this, but I've looked at the AOL data and am comfortable assuming that the categories are at least approximately realistic.

It's interesting to see the smooth spread across so many different categories. It's also easy to see why only focusing on a category or two may not be an effective product strategy. Shopping is the most lucrative of the verticals, and a healthy chunk at 10% of all searches. But if you focus only on shopping, that means users have to go elsewhere for the other 90% of their searches.

January 8, 2008

Another way to look at Wikia Search

Despite Wikia Search's unfortunate launch reaction, there is something substantial and worthwhile about the project that hasn't really come up in the coverage.

To understand Wikia Search you have to go back to the launch of the Nutch project in 2003:

Meet Nutch, the open-source search engine. Open-source applications are unusual in that the code upon which the software runs is not owned by a private, commercial company but rather bound by a simple license that allows anyone to use, modify, and even profit from it free of charge, as long as they pledge to contribute their own innovations back into the code base. Because of this, anyone will be able to access Nutch's code and use it to their own ends, without paying licensing fees or hewing to a particular company's set of rules.

Perhaps more important, Google takes a "trust us" approach to search; they say they don't skew their PageRank formula to favor certain sites, but we have no way of knowing for sure. With Nutch, the indexing and page-ranking technologies are all open and visible; you can check them yourself if you have a problem with your page's ranking. Just as Linux has taken on Windows, revolutionizing the rules of search-engine development and distribution, Nutch could pose a threat to Google and other search giants. Interestingly, early Nutch development was supported in part by Overture's R&D division, and an Overture official sits on the Nutch board.

"Search is interesting again," says Doug Cutting, a founder and core project manager at Nutch. Cutting, whose development chops were honed at Xerox (XRX) PARC, Excite and Apple (AAPL), is building Nutch (that's his toddler's all-purpose word for "meal") with a small team of engineers based around the country. But Cutting says they hope that once Nutch is loosed on the world, tinkerers from Romania to China to Palo Alto will help build it into a robust platform, in the spirit of Linux or Apache (which has garnered more than 60 percent of the Web-server software market in just the last couple of years).

The thought I had at the time was, the open source model is great, but the problem with search is that without a sponsor to pay for racks full of machines and gigabits of bandwidth, eager would-be developers are stuck. You can't develop a search engine on a laptop sitting in the university cafe.

Thus there is no web-scale version today on Nutch.org, of course. But Nutch has succeeded in smaller scale deployments, such as indexing university intranets. Basically competing in the enterprise search space, against commercial products such as Thunderstone and the Google search appliance. Universities are more open to tinkering with the open source Nutch / Lucene alternative and so have been early adopters there.

Enter Jimmy Wales. Wikia is the web-scale sponsor that Nutch didn't have when it launched in 2003. Wikia has 1,000 servers now and can afford the multi-gigabit bandwidth bill. They're providing the hosting platform which Nutch has been starved for to let contributors show up and advance Nutch to industry-level.

Yes, the site looks like someone was thrilled to get it to compile for the first time the night before launch. The appalled reactions are understandable given the expectations and high-profile PR.

Look past that.

Early open source projects often look grim. If you go onto sourceforge and find some promising 0.1 project, you know what to expect. I agree with Markson that the mistake here was in Wikia's positioning of the launch. But I don't think that's necessarily going to have long-term effects. Ultimately they just need a small handful of developers and contributors to help move the rock uphill. And then iterate.

And don't count out the power of the open source model. Giving all of the academic researchers who only get to test their experimental ranking algos on little clusters a functioning web-scale search platform could enable real progress. Check back in 2 years and I'll bet that Wikia Search is going to be a valid competitive alternative search site. Certainly a long shot to unseat Google, but at least a worthy alternative.

Updated data from Topix on registration-free commenting

Newspapers are apparently still fretting over whether to allow users to comment on their sites. Old-school editors like to hold the reigns tightly; approval-before-posting is a common moderation model on newspaper web sites. You'd think they'd be more open to letting in the usergen pageviews...

Some new data out of Topix showing the quality (measured by post kill ratios) between registered and unregistered commenters.

Total by registered users: 22,336
Total by non-registered: 60,772

Posts by registered users that got killed: 992
Posts by unregistered users that got killed: 4,095

% posts killed (registered users): 4.4%
% posts killed (unregistered): 6.7%

The unregistered commenters have a 50% higher kill rate. But they come with 3X the traffic.

Further evidence that the Ni-Chan paradox still holds:

  • Registration keeps out good posters. People with lives will tend to ignore forums with a registration process.
  • Registration lets in bad posters. Children and Internet addicts tend to have free time to go register an account and check their e-mail for the confirmation message. They will generally make your forum a waste of bandwidth.
  • Registration attracts trolls. If someone is interested in destroying a forum, a registration process only adds to the excitement of a challenge. Trolls are not out to protect their own reputation. They seek to destroy other peoples' "reputation..
  • Anonymity counters vanity. On a forum where registration is required, or even where people give themselves names, a clique is developed of the elite users, and posts deal as much with who you are as what you are posting. On an anonymous forum, if you can't tell who posts what, logic will overrule vanity.
  • I like this dataporn since it's applicable beyond newspaper and forum sites, to other kinds of recruitment-funnel online participation systems. Make it easy for users, especially first-time visitors, to jump in and participate. But also give power users the ability to invest more in their identity on your site.

    January 15, 2008

    Open source Bigtable clone 'Hypertable' posts performance numbers

    Zvents will soon be releasing their open-source Bigtable clone called Hypertable, and have posted some performance numbers that look quite good. Especially for such an early release.

    But maybe that not surprising since Hypertable was designed by Zvents search architect Doug Judd for speed. He rejected Java (used by HBase, the Hadoop-project Bigable effort) in favor of C++ in order to get the performance as high as possible.

    With a small test inserting about 28M rows of data from the AOL search dataset, they achieved a per-node write rate of approximately 7mb/sec. Iteration over the data once loaded was also quite fast, at nearly 1M cells/second.

    The question is how the system will scale up to much larger amounts of data. But the early perf numbers are encouraging. Doug and co will also need to get the word out about Hypertable and get a developer community going around this project if it's going to achieve its full potential.

    Hypertable can run on top of either HDFS or KFS. Zvents CEO Ethan Stock told me they will be releasing it under GPL 2.1 on Jan 31th.

    January 18, 2008

    Database gods bitch about mapreduce

    This is what disruption sounds like.

    This rant by major database guys against mapreduce is pretty telling.

    (You can read a good rebuttal here, and discussion on ycomb.)

    The thing that disrupts you is always uglier and worse in some way. Less features, less developed. But if there's a 10X price win in there somewhere, the cheap rickety thing wins in the end.

    Think Linux vs. AT&T Unix, or mysql vs. Oracle.

    I'll also take exception to the claim that schemas won out over unstructured data in the 60's. Unix ultimately trounced Multics and its ilk, not simply because of quasi-open source and economics, but also because the programming model was superior. "A file is just a stream of bytes" was a radical departure from the record and key oriented approaches that were dominant at the time. Some folks haven't stopped fighting the war though. Oracle's multi-decade messaging effort deserves more credit for the acceptance of databases as industry-standard tech than the idea that warring academics came to realize some deep truth about the way data "should" be stored.

    Is it the case that mapreduce on top of something like HDFS + Hypertable is a competitor to old-style monolithic databases running on big iron? You bet it is.

    Linear perf, linear cost scale, and the programming flexiblity of unstructured Unix-like I/O in GFS or fluid schemas in Bigtable. All good.

    And I wouldn't be surprised if the adoption curve, even for conservative Fortune-500 companies, was quicker than we've seen in the past. Bolt a map/reduce cluster onto the side of your data warehouse and mine those CRM records for business insights. Sounds like a startup idea we'll be seeing soon enough. ;-)

    January 21, 2008

    Markson: The Tin Handcuffs of SEO

    When I stopped living in the problem and began living in the answer, the problem went away.
          -- Randy Treft

    Mike Markson has a thought-provoking post on SEO.

    I have a buddy who compares getting VC funding to getting hooked on heroin. He says that instead of optimizing the company to build the right product, the funding often optimizes the company to do whatever is necessary to close the next round.

    SEO can be like that. It's such an easy way to get traffic. Certainly easier than making a great product that spreads worth-of-mouth all by itself.

    Every month Google allots the web-sites in its index a certain amount of traffic. Some sites do better than others, but for the most part Each site takes its monthly Google traffic home and tries to do the best it can with it.

    ...

    If you actually look at the recent successful sites over the past few years - YouTube, MySpace, Facebook, etc. - none of them got there by Google traffic. They created a product and figured out a way to get mass appeal outside the Google regulatory system.

    There's more, it's worth reading the whole thing.

    February 2, 2008

    The peanut butter jar is empty

    I was rooting for Jerry Yang. Tech founder returns to helm to take over. That's a story I have to get behind.

    But it never seemed to me like he was fully in charge. I wonder how many stakeholders Consensus had over there in the top suite.

    It is okay to blame terry, he has been paid very well, by any standard in corporate america. including his stock grants, semel has certainly extracted hundreds of millions of dollars in compensation. For that much money it is fair to expect results. while he doesn't strike one as the type of person to grasp the viability of search, he has surrounded himself with advisors who certainly should have been able to make this assessment.

    One can also turn some blame on jerry yang, who was instrumental in attracting terry. Jerry should have known that such a technophobe would have problems dealing with the inevitable semi-technical issues that a yahoo ceo would have to grasp on some meaningful level. It was those senior yahoos who were engaged in the ceo search who incorrectly assumed that yahoo was simply "another media company", and their search was predicated on this. Had they understood that this generalization was meaningless, they would have directed their search elsewhere.

    But to be fair, the winds of the internet have shifted. no one cares about "integrated networks" like yahoo and aol anymore because they have failed to deliver more utility than the rest of the web. Google is a rest-of-the-web company...its search and advertizing products leverage the entire web instead of trying to fight it. I'm not sure anyone at yahoo saw this coming.

        -- comment to Speculative Fiction, on Terry Semel's decision not to buy Google

    Rumors about a msft or some private equity takeout of Yahoo have been bobbing around for years. Sure, Yahoo could do (could have done?) the monetization deal with Google, and become Google's #1 adsense publisher. But on the product side...what would you do?

    Take a look at Yahoo's list of products:

    Sixyt-one services! 61! How do you wrap a brand around that?

    Yahoo used to mean "search", back in 1995. Then they line-extended their name onto everything...even physical stuff, like credit cards and mice and keyboards and a magazine (a real one, on paper!). Now what does Yahoo mean? What is the first word that jumps into people's heads when they think of Yahoo?

    I subscribe to Trout-Ries branding. Line extensions == generally harmful. But it's the default silicon valley product manager launch move. You trade short-term interest in a new product for long term damage to the core brand. It takes 5-10 years to build a major brand. And it takes 5-10 years for the full effect of line-extensions to erode a strong brand.

    Long after the folks responsible for junk like this have moved on, the effect remains in consumer's collective subconscious memory.

    Think about it... Yahoo doesn't mean keyboards. They didn't do plastics or ergonomic research or think of some insight about key travel distance or how audible the click should be or do wireless really well. Apple thinks about that stuff when they do a design. They work closely with the manufacturers to find out the latest materials and new techniques they can incorporate into their products. But Yahoo didn't do that. They just slapped their name on a box.

    Anyone who passed this keyboard sitting on a shelf in a store could see that. How many people saw the box vs. bought the thing? They sold some keyboards, but far more people saw the message of the keyboard. A message that Yahoo wasn't only about their directory or search functions. Or even about their website. Yahoo does everything! no... the message was that Yahoo was willing to put their name on anything.

    Trout-Ries: If you do everything, then you do nothing...

    Mike thinks Yahoo Mail is the first thing that comes to mind for people when they think of Yahoo. I asked my wife, she said Yahoo Groups.

    I asked her why she didn't use more Yahoo stuff.

    "Well, if you think of the web as a giant marketplace, and you're looking for something, you could go to the "Yahoo company store" and look at what just they have, or you could go through the main door and look at everything."

    "What's the main door?"

    "I don't know, I guess I just use Google."

    Sixty-one services. Not just names on a site-map, they're groups in the org chart.

    Google should take a close look at this. They're up to 39 services. 39!

    This story is over... now the cycle can start over with Google. I do hope that Google's brand-extending product managers diligently continue their efforts this year. :)

    February 7, 2008

    Google finally copies Topix 2004

    Heh. Google has launched a local news version of Google News. You can put in a zip code and get a geo-spun slice of their stories around a locality. Cool.

    But it doesn't seem like Google is going as far as Topix did in finding local references in non-local sources... We had a geoKB with named entities for every town in the US, and would disambiguate the references in the stories. Our geoKB knew the name of every street in the country. As well as every bridge, tunnel, body of water, hospital, school, jail... we even had a database of mayor names in our local KB (got that yet goog? :-) Sometimes helpful to tell the Springfields apart.

    I'd routinely see local stories from crazy sources... stuff that I never would have found any other way. My town (san carlos, ca) once was the cover story on a magazine called Government Procurement, because our city hall had put solar cells all over their roof. I never would have seen that story without topix.

    This was pretty neat stuff when Topix launched in January, 2004. Now if Google just added 50,000 vetted local blogs to the mix, and a community with 100k posts/day, they'll have something. :-)

    February 12, 2008

    Amazon is the Google of buying stuff

    I went into a little corner non-chain convenience store by my house (the "Devonshire Little Store") for some milk and noticed a big plastic tub of Dubble Bubble at the cash register.

    Folks I've worked with know that I have a thing for gum.

    I was doing the math at $0.10/piece, but then figured "what the heck" and asked if I could buy the whole bucket. That seemed to piss off the store owner.

    He said he would only sell me $5 worth at $0.10/piece. The bucket said "180ct" and was about 1/3 down. I tried to chat him up. "You can't get this at at Costco, can you?" "No, not at Costco." He wouldn't tell me where he got the Dubble Bubble buckets wholesale.

    An hour later, I'd chewed through half of my stash and was thinking there had to be a better way to get quantity gum.

    Enter Amazon. I'm happy to say that 1,260 pieces of Dubble Bubble (in various-sized plastic buckets) are now on their way to me. I'll have them tomorrow.

    Recently I've found that my online purchasing has increased, and consolidated, through Amazon. I did 80% of my christmas shopping through Amazon. I've bought scissors, wall thermometers, toys, video games, a camera, a bunch of DVDs, and of course books...

    A couple of things I've had to go outside to get..MREs (ebay), and a coffee machine for the office (amzn didn't have the model we were looking for.) But if amzn has it, I use them to buy it.

    Give credit to Bezos ... he's built the best ecommerce fulfillment platform in the business. One-click purchasing, Amazon Prime, reviews, "Where's my stuff?", multiple credit cards and shipping addresses on file in my account... it all just works.

    And with their merchants, they offer just about everything.

    When I go somewhere else on the web to buy stuff it's invariably a rude shock. Basic gaps in the checkout process. Delayed or missing order confirmation emails. Bozo shipping policies. Stuff that I don't have to worry about with Amazon.

    When I want to know something, I go to Google.

    But when I want to buy something, I go to Amazon.

    February 17, 2008

    Quote week

    "There's something deep in software development that not everyone gets but the people at Bell Labs did. It's the undercurrent of "the New Jersey Style", "Worse is Better", and "the Unix philosophy" - and it's not just a feature of Bell Labs software either. You see it in the original Ethernet specification where packet collision was considered normal.. and the same sort of idea is deep in the internet protocol. It's deep awareness of design ramification - a willingness to live with a little less to avoid the bigger mess and a willingness to see elegance in the real rather than the vision."
          -- Michael Feathers, Beautiful Code blog

    February 19, 2008

    Code must be nurtured

    Here's a theory of software quality for you: software must be nurtured. The existence of bugs isn't mysterious to any honest programmer. They are the product of neglect. Finding a bug in one's code isn't so much a surprise as a feeling of deja vu. Ohhhh yesssss, I remember thinking I should check that condition. Programmers have complete control over the quality of their code and, when working on code they care about, tend to produce things that work. The secret is to care for the programmers, so that they take good care of the software.
          -- Coderspiel

    February 20, 2008

    Leak proof

    So for now, my advice is this: don't start a new project without at least one architect with several years of solid experience in the language, classes, APIs, and platforms you're building on. If you have a choice of platforms, use the one your team has the most skills with, even if it's not the trendiest or nominally the most productive. And when you're designing abstractions or programming tools, go the extra mile to make them leak proof.
        -- Joel on Software

    February 21, 2008

    Nobody is really smart enough to program computers

    Fully understanding an average program requires an almost limitless capacity to absorb details and an equal capacity to comprehend them all at the same time. The way you focus your intelligence is more important than how much intelligence you have.

    At the 1972 Turing Award lecture, Edsger Dijkstra delivered a paper titled "The Humble Programmer." He argued that most of programming is an attempt to compensate for the strictly limited size of our skulls. The people who are best at programming are the people who realize how small their brains are. They are humble. The people who are the worst at programming are the people who refuse to accept the fact that their brains aren't equal to the task.

    The purpose of many good programming practices is to reduce the load on your gray cells. You might think that the high road would be to develop better mental abilities so you wouldn't need these programming crutches. You might think that a programmer who uses mental crutches is taking the low road. Empirically, however, it's been shown that humble programmers who compensate for their fallibilities write code that's easier for themselves and others to understand and that has fewer errors.
          -- Jeff Atwood, Coding Horror

    February 22, 2008

    Lamport's Bakery Algorithm

    This paper describes the bakery algorithm for implementing mutual exclusion. I have invented many concurrent algorithms. I feel that I did not invent the bakery algorithm, I discovered it. Like all shared-memory synchronization algorithms, the bakery algorithm requires that one process be able to read a word of memory while another process is writing it. (Each memory location is written by only one process, so concurrent writing never occurs.) Unlike any previous algorithm, and almost all subsequent algorithms, the bakery algorithm works regardless of what value is obtained by a read that overlaps a write. If the write changes the value from 0 to 1, a concurrent read could obtain the value 7456 (assuming that 7456 is a value that could be in the memory location). The algorithm still works. I didn't try to devise an algorithm with this property. I discovered that the bakery algorithm had this property after writing a proof of its correctness and noticing that the proof did not depend on what value is returned by a read that overlaps a write.

    I don't know how many people realize how remarkable this algorithm is. Perhaps the person who realized it better than anyone is Anatol Holt, a former colleague at Massachusetts Computer Associates. When I showed him the algorithm and its proof and pointed out its amazing property, he was shocked. He refused to believe it could be true. He could find nothing wrong with my proof, but he was certain there must be a flaw. He left that night determined to find it. I don't know when he finally reconciled himself to the algorithm's correctness.

    ...

    What is significant about the bakery algorithm is that it implements mutual exclusion without relying on any lower-level mutual exclusion. Assuming that reads and writes of a memory location are atomic actions, as previous mutual exclusion algorithms had done, is tantamount to assuming mutually exclusive access to the location. So a mutual exclusion algorithm that assumes atomics reads and writes is assuming lower-level mutual exclusion. Such an algorithm cannot really be said to solve the mutual exclusion problem. Before the bakery algorithm, people believed that the mutual exclusion problem was unsolvable--that you could implement mutual exclusion only by using lower-level mutual exclusion. Brinch Hansen said exactly this in a 1972 paper. Many people apparently still believe it.

    ...

    For a couple of years after my discovery of the bakery algorithm, everything I learned about concurrency came from studying it. ... The bakery algorithm marked the beginning of my study of distributed algorithms.
        -- Leslie Lamport

    I find this story fascinating. Lamport has invented a bunch of cool algorithms. But here he describes having "discovered" the Bakery algorithm, and then spent years studying the algorithm that he had written afterwards.

    How many of us find a solution to a problem, and then spend years studying the solution, learning from it? Actually I think I've learned more from studying bugs in my code than algorithms. If I could just avoid ever coding any bugs...

    Lamport has done a bunch of other stuff, including inventing Paxos, the distributed consensus algorithm behind google's distributed lock manager Chubby.

    February 27, 2008

    The real reason Google's clicks are flat

    From SEO Black Hat:

    Google reduced the clickable area on Adsense text ads ... Before, a user could click anywhere on the ad and be brought to the destination. After the changes, users have to click on something that looks like a hyperlink.

    "The CTR on text ads declined about 60% in the last 2 months with Googles changes, Image ads on the other hand stayed the same."
    - January 4th, 2008 Marcus of Plentyoffish.com

    4 months later, that little back and forth in the Google Rec Room shaved about $85 Billion (with a B) in market capitalization.

    But it wasn't as stupid an idea as it might seem. You see, Adsense works in a Quasi-market place environment. The market will bid up the cost per click once the adjustment for accidental clicks is readjusted. Right now, marketers should be getting a better value per click as a higher percentage of the clicks are "real" or intentional. That will lead to higher bids per click and ultimately should be close to a break even for GOOGs bottom line.

    Is the Sky Really Falling?

    The problem is that in the interim, GOOG gives almost not Guidance to the stock market. Mutual Fund types are really too thick to grasp exactly what's going on, so they think that this "slowing" in the growth has to do with the potential recession effecting GOOG.

    Meanwhile, the real story is that Online Advertising Spending will continue to grow at about 30% per year for at least the next 3 years and GOOG is poised to take a disproportionate amount of that growth even if nothing else they do is even marginally successful.

    March 6, 2008

    Who will stop Google from going to 90% market share?

    Jason predicts Google going to 90% market share.. He makes a solid argument and covers the bases. Referred traffic today suggests Google is at about 85%. Ask just quit the game, msn/yahoo put themselves into a tarpit. So the field is Google's...

    The only thing that can change this are new players. A string of uninteresting search attempts and lackluster competition have convinced people that it's impossible to stop Google's ascent.

    Google may have a network effect on ads, but the switching costs for the search app itself are small. Easier than switching free email providers. It's just another content site, and users are willing to try new search engines. There just haven't been any interesting new ones to try in a long time.

    I was hopeful that Wikia would launch something interesting and break the n-game losing streak of the upstarts, but sadly it was another shallow effort.

    I'm rooting for Cuill next. They have a very credible team. Anna built the current version of Google, and now she's working on the next gen. If they launch something interesting in any dimension, they'll show the market that you don't need a million servers and half of the phd's in the field to build a search app. It takes 20 people and $5M of hardware...if you know what you're doing.

    March 12, 2008

    NFS server %s not responding still trying

    :)

    April 7, 2008

    Did Powerset outsource their crawl?

    I've been seeing Zermelo, Powerset's crawler hitting my pages. Sort-of:

    ec2-67-202-8-249.compute-1.amazonaws.com - - [28/Mar/2008:23:31:06 -0700] "GET /2006/12/scale_limits_design.html HTTP/1.0" 200 11526 "http://www.skrenta.com/2006/12/i_took_a_ukulele_lesson_once.html" "zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]"

    They're using the open-source Heritrix crawler, running out of Amazon Web Services. But who is page-store.com? From their site:

    Vertical search sites are relatively costly to operate. A single vertical search engine may need to sweep all or a large part of the web selecting the pages pertinent to a small set of topics. Startup and operating costs are proportional to the input page set size, but revenue may be only proportional to the size of the selected subset.

    Page-store positions itself as a web wholesaler, supplying page and link information to vertical search engine companies on a per-use basis. The effect is to level the playing field between vertical search and general horizontal internet search.

    Page-store can provide

    • selected page feeds based on deep web crawls
    • page metadata
    • black-box filters
    • anchor text results
    • link information

    Did Powerset outsource their crawl?

    April 8, 2008

    Cuill is banned on 10,000 sites

    Be careful while you debug your crawler...

    Webmasters these days get very touchy about letting new spiders walk all over their sites. There are so many scraper bots, email harvesters, exploit probers, students running Nutch on gigabit university pipes, and other ill-behaved new search bots that some site owners nervously huddle in forum bunkers anxiously scanning their logs for suspect new vistors, so they can quickly issue bot and ip bans.

    Cuill, the search startup from ex-googlers anticipated to launch soon seems to have run a rather high rate crawl when they were getting started that generated a large number of robots.txt bans. Here is a list of sites which have banned Cuill's user-agent "Twiceler".

    A well-behaved crawler needs to follow a set of loosely-defined behaviors to be 'polite' - don't crawl a site too fast, don't crawl any single IP address too fast, don't pull too much bandwidth from small sites by e.g. downloading tons of full res media that will never be indexed, meticulously obey robots.txt, identify itself with user-agent string that points to a detailed web page explaining the purpose of the bot, etc.

    Apart from the widely-recongnized challenges to building a new search engine, sites like del.icio.us and compete.com that ban all new robots aside from the big 4 (Google, Yahoo, MSN and Ask) make it that much harder for a new entrant to gain a footing. However the web is so bloody vast that even tens of thousands of site bans are unlikely to make a significant impact in the aggregate perceived quality of a major new engine.

    My initial take was that this had to be annoying for Cuill. As a crawler author, I can attest that getting each new site rejection personally hurts. :) But now I'm not so sure. Looking over the list, aside from a few major sites like Yelp, you could argue that getting all the forum seo's to robots exclude your new engine might actually help improve your index quality. Perhaps a Cuill robots ban is a quality signal? :)

    April 9, 2008

    AppEngine - Web Hypercard, finally

    Google's AppEngine is being compared to Amazon's EC2/S3. But Google deserves credit here for coming up with a pretty differently-positioned product. There may be overlap for many users of course, but it's really operating at a whole different level of the stack.

    Folks that want/need more control over the environment, ability to manually manage their own machine instances, run code other than python, etc. will stay with EC2. EC2 is a step above RackSpace.

    But rather than thinking of AppEngine as a step above EC2, instead I think of it somewhere around Myspace. Or "Ning 1.0", as Zoho points out.

    In the beginning was GeoCities... No, even further back, in the beginning was Hypercard. Hypercard was a pre-web application for Macs that let you design a "stack" of pages - a website on a floppy, really. Popular stacks got traded far and wide. Hypercard stacks existed for every imaginable purpose - "Time Table of History", games, crossword puzzles, the Bible, etc.

    The thing about Hypercard was that it wasn't just static text and images like base html. It had a scripting language, a database, and the Apple UI built-in, so you could create mini applications.

    It feels like the web has been trying to claw its way back to the simple utility of Hypercard ever since Mosaic. GeoCities was the first massive-uptake anyone-can-build-here website haven. But it was all static html.

    Sure, you can paste javascript widgets onto your page, and have content driven by external sites. But to make the website a first-class object - on functional partity with a "real" website - it needs to be backed by a database and programmability. But setting up mysql, renting machine space, configuring linux, programming all the boilerplate, not to mention the scalability issues if your site gets popular -- this is all a big hurdle.

    So to hide all those details behind a platform that's easy to get started with, and lower the bar to entry to writing public application websites... Well that's a big deal. Hat's off to Google for bringing this to market.

    I'm not alone...somewhat similar thoughts from Nate Westheimer...

    April 14, 2008

    Cluster map propagation in Amazon Dynamo

    Dynamo is Amazon's scalable key/value storage service. The paper is a good read, but I found the way the cluster node list information was propagated in dynamo to be a little odd. The algorithm is that every 60 seconds a node will talk to another node in the cluster, chosen at random, and exchange update information. I wondered how fast a change would propagate through the cluster, so I simulated the propagation.

    For a 5,000 node cluster it takes about 9 update cycles for a change to reach every other node. Since each update is on a 60 second timer, that's 9 minutes for a change to push out.

    I didn't do a very sophtisticated time model..plus there is random start and all that. So maybe in practice it's a little different. But 9 minutes seems like a long time to propagate a host change out to the rest of the cluster. Maybe I mis-interpreted what they're doing?

    I recall some confusion about whether Dynamo was actually providing SimpleDB, or if they were two separate software systems. Does anyone know if this was resolved?

    April 16, 2008

    Web robot names considered, and rejected

    Google's is "Googlebot"
    Yahoo's is "Slurp"
    Cuill's is "Twiceler"

    It makes sense have a friendly robot user agent, so nervous webmasters won't ban it. You don't want to call your crawler 'sitejacker' or something.. Unfortunately my favorite candidates were:

    Crawlhammer
    Webraker
    Lurchy
    Client9

    hmmm. :-|

    "Oh no! It's CrawlHammer!!"

    If even in your heart you hide the urls ... there it shall rake for them...

    ...

    Does anyone know what the purpose of a '+' in front of an url in the robots user-agent is? Some sites put in the '+', others don't...

    Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

    Mozilla/5.0 (compatible; Ask Jeeves/Teoma; +http://about.ask.com/en/docs/about/webmasters.shtml)

    Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

    Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html)

    Gigabot/3.0 (http://www.gigablast.com/spider.html)

    Microsoft "hits back" at Google with re-launch of 4-year old Newsbot

    The memecrowd sure has a short memory... maybe I'm just showing my age here, but still.
    CNET: Microsoft hits back at Google with Live Search News
    Search Engine Land: Microsoft Launches Live Search News
    Search Engine Watch: Windows Live Search Offers Google News Alternative

    MSN Newsbot? Anyone? From 2004:

    CNET: Google News faces Microsoft rival (Jul 27, 2004)
    Wash Post: Microsoft Deploys Newsbot To Track Down Headlines (Aug 1, 2004)
    Geeking with Greg: MSN Newsbot review (Jul 27, 2004)

    April 22, 2008

    Hypertable architecture talk Wednesday in Palo Alto

    Doug Judd will be discussing the internals and architecture of Hypertable tomorrow in Palo Alto at 6:30pm.

    Hypertable is an open source, high performance, distributed database modeled after Google's Bigtable. It differs from traditional relational database technology in that the emphasis is on scalability as opposed to transaction support and table joining. Tables in Hypertable are sorted by a single primary key. However, tables can smoothly and cost-effectively scale to petabytes in size by leveraging a large cluster of commodity hardware. Hypertable is designed to run on top of an existing distributed file system such as the Hadoop DFS, GLusterFS, or the Kosmos File System (KFS). One of the top design objectives for this project has been optimum performance. To that end, the system is written almost entirely in C++, which differentiates it from other Bigtable-like efforts, such as HBase. We expect Hypertable to replace MySQL for much of Web 2.0 backend technology. In this presentation, Doug will give an architectural overview of Hypertable. He will describe some of the key design decisions and will highlight some of the places where Hypertable diverges from the system described in the Bigtable paper.

    More details.

    April 24, 2008

    Microsoft bias in MSN search results, surprise

    I was looking to see what search sites might have a particular bug that I (ahem) came across and was trying the search for the number 0 in various places. There is a pretty good Wikipedia page about zero. Zero has a rich and interesting history and there are many other potentially reasonable results.

    But I was surprised to see MSN search had demoted their good results below some crappy ones from MSDN:

    Lame! Falling into an inferior lex position and a lower overall relevance page to boost their own network results...give em credit for being old school. :)

    ...

    I found my bug on Yahoo Search. I had tried a lot of smaller engines first because I didn't think a major would have this bug. You can't search for 0 on Yahoo. You can search for all the other numbers, but not 0 ...

    Why?.. Because 0 is false. It suggests Yahoo is using a scripting language to front their search form, and a programmer did something like if ( $query ) rather than if ( $query ne '' ).

    May 1, 2008

    How Fake Luxury Conquered the World

    The legend says that once upon a time there was a General Motors. This General Motors, GM for short, had a car and a brand for every need, along the plan developed by the great Alfred Sloan prior to the Second World War. There were Chevrolets for regular folk, Pontiacs for the cautious old people (and, thanks to John Z. Delorean's development of the 1964 GTO, for angry young people as well), Buicks and Oldsmobiles for doctors and successful businessmen, and Cadillacs at the very top, for the most successful men in the land.
    ...
    It would have stayed that way forever, but one day a mysterious yet important man at GM had a mysterious yet important idea: Executives should drive cars from their own division!

    Which leads to every division of GM building their own version of the Cadillac.

    Read more: How Fake Luxury Conquered The World

    (thanks Bryn for the tip)

    blekko is hiring

    blekko is building a new search engine from scratch and I'm looking to hire a few more coders.

    Search is an absolutely fascinating problem to work on for a bunch of reasons. For one thing you have to scale the thing before getting the first user. You can't just start with a server or two and add more when the users come. Step 1 is to copy the internet onto your cluster. Step 2 is to analyze it..

    The componentry is remarkably deep.

    Search is like 7 hard problems wrapped into a stack. Distributed systems, html analytics, text analytics/semantics, anti-spam, AI/ML, frontend/UI. And scale... Apart from the sexy high end algos there are also the boring 10-year old system libraries and off-the-shelf tools that crack under stress and sometimes need a look. You open the hood and wonder how the thing ever worked in the first place...

    Plus there is always something fresh and new every day mining through the vast sordidness of the many billions of pages on the web. You expect to be amazed at the endless varieties of crazy porn domains and new approaches to webspam. But there are equal horrors in the small, finding pathological charset issues, previously-undiscovered abominable server implementations, psychopathic website owners. The web is a reactive fuzz test.

    I know there are some great coders out there reading this blog who would have blast working on some of the pieces here that need to get built. This is a great opportunity to join an experienced team early building a big system from the ground up. If you think you might be interested, send me an email and we can chat.

    fyi our interviews always have coding tests. Primarily we are looking for folks who love to write code and are good at it. :)

    October 23, 2008

    What's up Rich

    If blogging is dead it must be time to start Skrentablog up again. Apologies for letting the blog go dormant the last little while, I've had my head down in technology. Quick update: 200 servers, 11 employees, lots of code. Crawl, index, test, repeat.

    We hired a naming firm to come up with a better name than 'blekko', they did a great job. Down to two candidates. Testing them.

    We built a wicked cluster platform to run our stuff. It's kind of like bigtable from the top-down api view but is an integrated design, vs. the layered impedance mismatches with stuff like gfs/chubby. No masters, all swarm algos. We crawl/index/serve into structured storage. It's very fast, has integrated mapjobs, and is really easy to program on top of. I'll post more details about it in the future.

    More posts to come, I promise.

    October 29, 2008

    Retro Conservation Advertising

    The modern green/eco movement is bringing back the idea of eating local, having a garden, saving energy, etc. and pointing out the links between items (like bottled water and oil).

    But we've been here before. Check out these WWI gov't posters.


    "Don't waste paper - a pound of paper wasted is a pound of fuel wasted"


    "Keep the home garden going"

    Check out all the detailed instructions in that one. Public education indeed.

    More posters...

    November 2, 2008

    Lucy on Elections

    It's hard being a campaign worker.
    We're completely at the mercy of our candidate.
    We do all the work, and the candidate gets all the credit.
    We ring doorbells, and make the posters, and build up the candidate's image.
    And then he says something stupid, and ruins everything we've done.

    The next time I do any campaigning, it's gonna to be for myself!

          -- Lucy, You're (not) elected, Charlie Brown

    November 14, 2008

    Cold calls, cold response

    Every few days cold-calling salespeople show up at our office unnannounced to pitch us on insurance, lease deals, laser toner, office supplies, voip plans, bottled water, etc.

    We have an open office. So when they enter, 11 people immediately look up at them. This can apparently be somewhat intimidating, based on their flummoxed reactions. They usually ask for a business card so they can call us later. I sometimes offer them mine, since my card doesn't have a phone number on it. Then they beat a hasty retreat.

    Lately we've been trying a new tactic - not acking their presence when they come in. There's no receptionist (of course), and it's not clear who they should attempt to speak with. None of us really want to listen to their pitch or take their flier anyway, so playing the game of chicken with the other folks in the office sort of emerged as a default behavior. Who will be the first to crack at their nervousness, make eye contact, and thus become the dupe left holding the flier or handing out their business card?

    I almost feel sorry for them. Almost!

    November 21, 2008

    Thank heaven for tax refunds

    In 2000 before the dot-com meltdown I bought a few cases of french bordeaux. Even though I like bordeaux, it half-seemed like a silly purchase at the time, but when the wine arrived I was happy because the bordeaux had risen in value since I purchased it, but due to the stock market death-spiral my accounts had gone down in the meantime. win, sorta.

    Unfortunately there was also a bmw 540 that I decided was too indulgent to buy and passed on. Afterward I kicked myself -- it would have been free. I would have exercised some netscape options I had to buy it. I held onto them, eventually they declined in value until they were worthless. I should have bought the car!

    I saw a joke circulating at the time that beer would have yielded a better return than some stocks. The beer bottles could be returned for the 5 cent deposit, but stocks became worthless. Plus you would get to drink the beer.

    Now we're going through it again, but even worse. The banker line now is that it's not the return on your capital that you should be worried about, it's the return of your capital.

    I just got a state of California tax refund check. Normally it's ineffecient to pay too much withholding, essentially lending the government your money interest-free until tax time. In this case though it turned out to be a decent investment. :-|

    November 22, 2008

    Detecting spam from http headers?

    Greg Linden describes a paper about finding spam simply by inspecting the returned http headers:
    In our proposed approach, the [crawler] only reads the response line and HTTP session headers ... then ... employs a classifier to evaluate the headers ... If the headers are classified as spam, the [crawler] closes the connection ... [and] ignores the [content] ... saving valuable bandwidth and storage.

    We were able to detect 88.2% of the Web spam pages with a false positive rate of only 0.4% ... while only adding an average of 101 [microseconds] to each HTTP retrieval operation .... [and saving] an average of 15.4K of bandwidth and storage.

    After running web crawls for the past year and finding all manner of spam, I have to say I'm skeptical this technique would really catch much spam on the actual web. Among the top 10 http header features they identify as spam-predictors are:

    • Accept-Ranges: bytes
    • Content-Type: text/html; charset=iso-8859-1
    • Server: Fedora
    • X-powered-by: php/4
    • 64.225.154.135

    These are pretty standard-looking headers. Let's look at some actual spam though and see if we can see anything funny.

    $ curl -I http://www.fancieface.com/
    HTTP/1.1 200 OK
    Date: Sat, 22 Nov 2008 19:13:11 GMT
    Server: Apache/1.3.26 (Unix) mod_ssl/2.8.12 OpenSSL/0.9.6b
    Last-Modified: Tue, 21 Oct 2008 11:51:10 GMT
    ETag: "2081cc-ba62-48fdc22e"
    Accept-Ranges: bytes
    Content-Length: 47714
    Content-Type: text/html
    

    Very spammy site, but totally vanilla heaaders. How about some rolex watch spam:

    $ curl -I http://superjewelryguide.com/300.html
    HTTP/1.1 200 OK
    Date: Sat, 22 Nov 2008 17:48:26 GMT
    Server: Apache
    X-Powered-By: PHP/5.2.6
    Content-Type: text/html
    

    Again, pretty vanilla. Plus this technique isn't going to work at all for spam hosted within trusted domains. Here's some cialis spam smeared onto a my.nbc.com page:

    $ curl -I http://my.nbc.com/blogs/GaryRobinson/main/2008/10/13/cialis-cheapest-cialis-pills-here
    HTTP/1.1 200 OK
    Server: Apache/2.2.0 (Unix) DAV/2 PHP/5.1.6
    X-Powered-By: PHP/5.1.6
    Wirt: (null)
    Content-Type: text/html
    Expires: Sat, 22 Nov 2008 19:16:33 GMT
    Cache-Control: max-age=0, no-cache, no-store
    Pragma: no-cache
    Date: Sat, 22 Nov 2008 19:16:33 GMT
    Content-Length: 0
    Connection: keep-alive
    Set-Cookie: pers_cookie_insert_nbc.com_app1_prod_80=1572983360.20480.0000;
            expires=Sat, 22-Nov-2008 23:16:33 GMT; path=/
    

    but very fishy headers! :-)

    It's incredibly difficult to get a high quality random sample of the web. You can't factor crawler strategy bias out of the sample, and any small sample is not necessarily going to very representative.

    If the researchers did find good coverage with quirky headers and even individual ip addresses, I suspect that the crawl they're using may be over-weighted in pages from a few servers that spewed out a lot of urls/virtual hosts.

    March 14, 2009

    The news medium has a message: "Goodbye"

    Every so often there's a story about about a technophobe executive so out of touch a secretary has to print out their email every morning so they can read it on paper and dictate replies.

    That's what the print newspaper is, of course. Why on earth would you print all that stuff out? Over a hundred pages, most of which you're not going to read, with the crease down the middle of the front page photo, story jumps everywhere, a carbon-footprint disaster to produce, distribute and recycle. It's absurd.

    Back in 1980 newspapers were the main way that bytes flowed into people's homes. Radio and TV for audio/video, but the newspaper delivered the bytes that were read like the text-based web.

    I once worked out some rough back-of-napkin estimates on the number of text bytes in the paper. It was only delivered once during the day, but if you average the bytes across the entire 24 hour period it came out to be about the rate of a 300 baud modem. The newspaper was the internet.

    It was mostly one way - except for all those classified ads and the letters to the editor. It was really a lot more like AOL, since it was centrally controlled and edited.

    But it did represent the sole text byte pipe into the home. And so it contained every content vertical, all in one package. National news, world news, local community sections. Little league scores and the NFL. Weather, stock tables, TV listings, home sales. Advertising, both national, local and personal. Games and political commentary and the police blotter. Everything.

    Fortified by the high cost of the printing press and the limited radius of delivery trucks there was a natural local monopoly to these things. And indeed, they were a wonderful business, a so-called license to print money. Huge fortunes were made.

    That's all over now of course. The subsidy that classifieds supplied for bureaus in distant cities is gone. The class of professional reporters as we know them is going to be smaller and funded differently.

    I was at the TechCrunch office welcoming party last night, and was struck by how unassuming the offices were. This was the big move up, of course. They were still unpacking after moving out of Mike Arrington's house. But it was a small office with a few desks scattered around, a handful of computers. I've toured the massive AP newsroom, rebuilt in 2004 to cater to every desire of a journalist. The Reuters newsroom had pods that look like they were inspired by Norad in Wargames, with circular banks of monitors around central stations, all showing live feeds or charts from various sources. The old Mercury News offices were vast.

    TechCrunch was a modest affair by comparison. So this is where it all happens..., I thought. This is what the modern business press looks like now.

    Get used to it.

    April 8, 2009

    Bryn turned me into a muppet

    April 9, 2009

    blekko's ambient cluster health visualization

    When you have several hundred servers in a cluster, knowing the state and health of all of them can be a challenge. Traditional pager alert systems can often either log too many events, which makes people tune them out, or they miss non-fatal but still serious server sickness, such as degraded disk/cpu/network performance or subtle application errors.

    This becomes especially true when the cluster and application are designed for high availability. If the application is doing its best to hide server failures from the user, it's often not apparent when a serious problem is developing until the site fails in a more public or obvious way.

    We called these "analog failures" at Topix. There was a fairly complicated chain of processing for incoming stories that had been crawled. Crawl, categorize, cluster, dedup, roboedit, push to front ends, and push to incremental search system. Once an engineer mistakenly deleted half of the sources from our crawl, and it took us a disturbingly long time to notice. The problem was that, while overall we had half as many stories on the site, most pages still had new stories coming in, so we didn't notice that anything was wrong.

    Sometimes a server has a messed up failure, like its networking card starts losing 50% of its packets, but stuff is still getting through. Or a drive is in the process of failing, and its read/write rate is 10% of normal, but it hasn't failed enough to be removed from service yet. The cpu overheated and is running at a fraction of its normal speed. There seem to be limitless numbers of unusual ways that servers can fail.

    At blekko, there are dozens of stats we'd ideally like to track per host:

    • How full are each of the disks?
    • Are there any SMART errors being reported from the drives?
    • Are we getting read or write errors?
    • What is the read/write throughput rate? Sometimes failures degrade the rate substantially, but the disk continues to function
    • What is the current disk read latency?
    • Is packet loss occurring to the node?
    • What is the read/write network throughput?
    • What is the cpu load?
    • How much memory is in use?
    • How much swap is being use?
    • How big is the kernel's dirty page cache?
    • What are the internal/external temperature sensors reading?
    • How many live filesystems are on the host vs. dead disks?

    Others stats pertain to our cluster datastore:

    • How many buckets are on each host?
    • Is the host above or below goal for its number of buckets?
    • What is the outbound write lag from the host?
    • What is the maximum seek depth for a given path/bucket?
    • Do we have three copies of every bucket (R3)?
    • If we're not at R3, how many bucket copies are occurring?
    • For running mapjobs, what is their ETA + read/write/error rate?
    • Are the ram caches fully loaded?
    • Are we crawling/indexing, what is the rate compared with historical?

    The first step is to start putting the stats you want to be able to see into a big status table. But at 175 hosts, the table is kind of long, and it's hard to spot developing problems in the middle of the table.

    So we have been experimenting with mapping system stats onto different visualizations, so we can tell at a glance the overall state of hundreds of servers, and spot minor problems before they grow.

    A table with 175 rows is pretty long, but you can fit 175 squares into a very small picture. This table shows overall disk usage by host. The color of the tile shows the disk usage: red is 90%, orange is 80%, yellow is 70%, blue is below 60%. Dead filesystems on the node are represented by grey bars inside the tile. The whole grid is sorted worst-to-best, so it's easy to see the fraction of hosts at a given level of usage.

    Our datastore uses a series of buckets (4096 in our current map) to spread the data across the servers. Each bucket is stored three times. If we have three copies of every bucket, we're at "R3". This is the standard healthy state of the system.

    Because fetch/store operations will route around failures, it's not at all apparent from the view of the application if some buckets do not have three copies, and the cluster is degraded. So we have a grid of the buckets in our system, color coded to show whether there are 0/1/2/3 copies of the bucket.

    In the above picture, the set of buckets in red have only 1 copy. The yellow buckets have 2 copies, and the green have three. We have a big monitor with this display in our office, if it ever shows anything but a big green "3" folks notice and can investigate.

    For variety we've experimented with other ways to show data. This display is showing the fraction of a path in our datastore which has been loaded into the ram cache. Ram cache misses will fall back to disk, so it's not necessarily apparent to the user if the ram cache isn't loaded or working. But the disk fetch is much slower than the ram cache, so it's good to know if some machines have crashed and the ram cache isn't at 100%.

    Other parts of the display are standard graphs for data aggregated across all of the servers. These are super useful to spot overall load issues.

    We're still experimenting with finding the best data to collect and show. But the ambient displays so far are a big win. Obvious issues are immediately visible to everyone in our offfice. And people will walk by and look at the deeper graphs and sometimes spot issues. Taking the data from being something where you would have to proactively type a cli command or click around on some web forms, to displays that engineers will stop and look at for a few minutes on their way to/from getting a coffee or soda has been big improvement in our awareness and response to cluster issues.

    April 21, 2009

    Topix passes USA Today to become #1 online site for Gannett, Tribune and McClatchy

    Four years after our deal to sell a majority of Topix to the top three US newspaper companies, Topix becomes the #1 online property for Gannett, Tribune and McClatchy.

    Congrats to the Topix team on the fantastic recent site growth!

    June 1, 2009

    Bingram BetaHoo - poking at a few Bing queries

    I like Bing! Bing.com is live and it looks really cool. Very fast, clean UI, strong navigational results, nice extra features like the hover panes, aggressive title relevance, plus all the vertical sub-engines. People like it.

    That said it's brand new and we all want to kick the tires.

    Search engines are built out of a lot of layered systems. One part can be working great but be subverted by another part that has a gap. Like any product there are always bugs to be fixed and improvements to be made. So launch day isn't the final word on relevance. But it's interesting to survey a variety of results to poke around.

    • Overall the navigational results seem very strong.

    • Bing is doing aggressive title rewriting to boost perceived relevance. Google has done some of this for a while - note the title change on the same url based on the query - [skrentablog] vs. [rich skrenta].

      The "Skrenta, Rich" title came from dmoz.

      Bing is going farther. Sometimes it makes the result look better than Goog's, e.g. [san carlos art and wine fair]. But others are odd, like result #3 for [mike arrington]. That funny-looking title looks like it came from anchortext.

    • Bing's indexing of *.blogspot.com seems really limited. For instance [radish king] doesn't turn up radishking.blogspot.com. Site:blogspot.com on bing returns an estimate of just 560k results. Compared to Google (340m) and Yahoo (230m), Bing's blogspot index seems tiny. Other blogspot sites I've gone looking for are missing too. I wonder if this is some kind of rank or index penalty given the large amount of blogspot spam, or if there is some other issue with their crawl.

    • [michael arrington] vs. [mike arrington]. TechCrunch is #2 for Michael Arrington, but is way down at the bottom of the page for Mike Arrington. This seems to be the fault of the section-ized results; it's under a heading called "Mike Arrington Blog". As others have noted I'm not a big fan of sections or universal search style sections on result pages. It's unfortunate to see a strong result for the query get pushed that far down.

    • Bing, like Google, returns Dogpile and AltaVista for [search engine]. (Yahoo looks like they manually pinned a couple of results for this query.)

    Overall the few bugs I've seen are relatively minor issues in the scheme of the entire product and I'm sure will eventually be addressed by the Bing engineers. It's so cool to have a powerful new engine out with interesting results. Kudos, Microsoft!

    July 28, 2009

    There’s No Such Thing As A Google Killer

    Google is an amazing story. In a little more than 10 years, they have built not only a multi-billion dollar company that employs thousands of people, but also the world’s strongest brand. This is an anomalous story that perhaps may never be repeated.

    So let’s just get this out of the way: there is no such thing as a Google killer. No company is going to play David to their Goliath and slay them with a well-aimed stone from a slingshot. Google is here to stay.

    Why do I bring this up? I am one of the founders of a search start-up. One that recently raised money from a couple of great venture capital firms. So whenever anything is printed about us, or even comes up in causal conversation, the term “Google killer” gets bandied about. Again, I think you’re as likely to see a Google-killer as you are to find Sasquatch or the Loch Ness monster.

    So why join a search start-up then? Because I don’t believe to be successful in this business you need to be a Google-killer. In fact, trying to be a Google-killer is probably the one sure way not to succeed.

    If you were to start a soft-drink company, would you be a Coke killer? Would you create a product that tasted exactly like Coke and put it in a red can? Of course not, that would be product suicide. You make something that tastes different and package it differently – Snapple, Red Bull, Vitamin Water.

    We think the same about search. Google isn’t going anywhere. We think there are a lot of problems that search isn’t addressing right now that it could be. And that’s where we want to play. Own a category or die. So no, we’re not a Google killer. But stay tuned for more…

    October 15, 2009

    blekko is hiring software engineers


    blekko is building a disruptive general-purpose web search engine. We are hiring software engineers.

    Web search is not only one of the most important technologies of our time, but it is also incredibly fun to work on because it requires cutting-edge algorithms from a wide range of disciplines. It is one of the hardest startup challenges today – but the monetization is much higher than anything else on the web, and there are fewer credible competitors than most people think.

    Our team has founded multiple successful startups and held leadership positions at major tech companies such as Google, Sun and Netscape/AOL. We have funding from top-tier venture investors and a roster of highly prominent Silicon Valley angels including Marc Andreessen, Ron Conway, and two early Googlers.

    Our crawl/index/search/query code is implemented on top of a distributed storage system that supports integrated map job execution, data replication, scalability, and fault tolerance. The programming model is similar to Google BigTable, but the application-level code tends to be more high-level and pleasant to work with than a typical high performance distributed application.

    We are looking for talented software engineers who enjoy working on big systems, appreciate the productivity wins of interpreted languages and good API design, want to work on advanced search applications at web scale, and are:

    • Highly productive coders, self-motivated and able to learn new skills quickly
    • Intellectually curious and more pragmatic than theoretic
    • Comfortable in a small-company, startup environment

    Pluses:

    • (In descending order of importance): UNIX/Linux, Perl, C/C++, Javascript, HTML/CSS
    • Search, particularly web search
    • Large-scale distributed systems (e.g., Map/Reduce, Hadoop, distributed filesystems, clustered databases)
    • Deep systems knowledge of operating systems, I/O, and networks
    • Applied math, statistics and/or machine learning, particularly as applied to ranking and classification
    • Degree in computer science or related area, especially masters and PhD
    • Industry experience, especially in startups or domain-relevant Internet companies
    • Interest in potential leadership opportunities as the company scales

    Blekko is located in Redwood Shores, California across from the main Oracle campus. If interested, please contact blekkojobs@blekko.com

    November 12, 2009

    The future of business journalism

    Pay for play? Sigh.

    From: Sally Bailey <Sally.bailey@amgl.co.uk>
    Date: Thu, Nov 12, 2009 at 6:42 AM
    Subject: Blekko Raises $2.5 Million - Front Cover Proposal - ACQ Magazine

    Good afternoon Rich, I hope you are well.

    Many thanks for your time a short while ago and your offer of assistance on this matter. We have discussed this internally and would like to elevate this to a front cover position.

    The recent $2.5 million raised obviously marks an important addition to your company portfolio. The Blekko brand is highly regarded and stands out in a competitive sector.

    We wish to propose elevating the coverage to the front cover of the magazine, if of interest to you. I am sure this greater exposure will be appetising considering our "penetration" into the sector. Within the report we will be discussing the deal itself but also examining the wider market.

    You may already be familiar with the magazine but for your convenience I have included below the FLEXIPAGE link to our most recent issue:

    http://view.vcab.com/...

    The cover opportunity includes:

    • A Front Cover Headline
    • Contents page reference
    • Editorial content within the magazine over a set number of pages
    • Electronic reproduction of the coverage
    • A hard copy of the complete edition (plus further copies as required)

    There are 3 options available depending on the amount of editorial content you desire:

    • Option 1 – 1 page of editorial - £940.00 +VAT
    • Option 2 – 2 pages of editorial - £1940.00 +VAT
    • Option 3 – 4 pages of editorial - £2940.00 +VAT
    If you would like to proceed with the front cover report please reply confirming which option you are interested in and the applicable cost.
    Your thoughts are needed urgently as these are popular positions with limited space available.

    If you have any queries at all please so not hesitate in contacting me. I will await your feedback.

    Kindest Regards

    Sally Bailey - ACQ Magazine

    Mainland UK Switchboard - 0044 870 242 7021
    Mainland UK Facsimile - 0044 870 242 7023
    Email – sally.bailey@amgl.co.uk

    Website - www.amgl.co.uk

    June 12, 2010

    New Dog

    Finally got the yard fenced in, plus hard lobbying from the kids ... so 5 years since winston, I got a new dog.

    June 18, 2010

    Smile for the camera, Rich

    July 19, 2010

    Smile for the camera, Rich

    If blekko sees its shadow, 6 more weeks of beta

    blekko has (finally!) entered private beta...TechCrunch has the details on a preview of our new search engine.

    Am I insane for trying to build a new search engine from scratch? Maybe... but blekko is pretty cool anyway. :-)

    blekko is introducing a novel search syntax we call slashtags. Using simple tags to refine your queries (e.g. /date, /demblogs, /people, /health, /satire, etc.), you can quickly filter search results to just the sites you want, change the way results are sorted, and more.

    We have hundreds of slashtags you can get started with on blekko, plus we have a toolbox to let you make and share your own.

    Furthermore, we intend to be fully open about our crawl and rank data for the web. We don't believe security through obscurity is the best way to drive search ranking quality forward. So we have a set of tools on blekko.com which let you understand what factors are driving our rankings, and let you dive behind any url or site to see what their web search footprint looks like.

    So what took so long? It turns out that it's really friggin hard to build a search engine from scratch. Especially a good one. We've built our system from the ground up, with a multi-billion page index, a prioritized crawl, and new ranking and anti-spam technology. We also have a ton of classifiers firing on every url we crawl, many of which are powering built-in slashtags.

    Drop me a note if you'd like to get on the list for our private beta. It really is a beta, we still have some bugs to shake out before opening the site completely. But we could use your help to shake out the system before our launch.

    I'll leave you with blekko's founding principles. We call them the Web Search Bill of Rights:

    1. Search shall be open
    2. Search results shall involve people
    3. Ranking data shall not be kept secret
    4. Web data shall be readily available
    5. There is no one-size-fits-all for search
    6. Advanced search shall be accessible
    7. Search engine tools shall be open to all
    8. Search & community go hand-in-hand
    9. Spam does not belong in search results
    10. Privacy of searchers shall not be violated

    Here is my co-founder Mike's take on why blekko is cool.

    July 30, 2010

    CrunchUp New Product Demo: Blekko by Founder and CEO Rich Skrenta

    September 1, 2010

    blekko coverage and twitter glow

    The Web Ain’t Dead. Blekko.com Has Come to Save It. (stephenpickering.com)

    blekko Explains Itself: Exclusive Video (Update: Exclusive Invite) (battellemedia.com)

    Blekko Makes Influencing the Influencers Easy (travismurdock.com)

    Blekko: A Search Engine Which Is Also A Killer SEO Tool (searchengineland.com)

    Blekko vs. Google: I Do Believe I’m Now in Love With BOTH Search Engines (web-savvy-marketing.com)

    Blekko search: first impressions (economist.com)

    The Upside-Down Logic of Taking on Google at Search (technologyreview.com)

    Blekko search: Coming soon to a browser near you (memburn.com)

    Link data from Blekko (receptional.com)

    Blekko review (arhg.net)

    New Search Engine Blekko Takes Vertical Search to a New Level (eyetraffic.com)

    Taking Blekko out for a Spin (thenoisychannel.com)

    Blekko Screencast And Founder Interview (techcrunch.com)

    TechCrunch Review: The Blekko Search Engine Prepares To Launch (techcrunch.com)

    Blekko Demos Slick New Search Engine (thenextweb.com)

    Blekko: New Search Engine Lets You “Spin” The Web (searchengineland.com)

     

    Using #blekko, have a wicked smile in my face, like the smile I had when I used Google back in 2000.

    This #Blekko is freaking awesome. Example, I can search Apple Computers /History. Slashtags are going to change the web!

    @bryanphelps @blekko is a new kind of crack. woot! You can't take just one bite of the stuff. Awesome :-)

    Who cares about FB QnA! @blekko has managed to launch a search engine in 2010 & it actually works! Love it! Much needed goog competition!

    @blekko love it! Relevance is awesome even for obscure sites! Open up the API soon & let the party begin!

    #Blekko has a nice feature: you can track pages by AdSense ID :-)

    I have officially created my first to slashtags on Blekko. I'm totally digging it. I feel like a superpower.

    voteforme | Slashtags rock! @blekko has kept me up until 1am :-)

    Finally got in to @blekko. Crowdsourced indexes that are social more amazing than I expected.

    "you have marked the site www.ehow.com as spam. it is now dead to you." - I'm in love with @blekko!

    #blekko /date is brilliant. sort reviews based on dates, weeds out -ve reviews of early versions of products, say dell u2410 :)

    @blekko is pretty rad...

    Loving the very intuitive interface for @blekko. Never thought I'd feel that way about search again!

    I just made the /so slash tag on #Blekko to search the Stack Overflow trilogy - lovin' it!

    Trying out new search engine Blekko. It has very cool 'slashtags built in' functionality and transparency http://blekko.com #search #blekko

    got my #blekko account /very /cool. Thanks @blekko!

    Dude. Blekko is pretty damn cool.

    Wow. @blekko isn't half bad with search results.

    So, I've been using @Blekko for today and I'm just blown away by the options. Researching via verticals just got very, very, easy.

    Trying #blekko the new search engine. seems really good to me

    yesiamben - Actually really impressed with @blekko! The search stuff seems good, the results are relatively spam free, and I love the SEO stuff... #win

    Interestingly, @blekko is the only search engine to rank Google first for the query "search engine". Google lists Dogpile first.

    nicholasmarx - Received my @blekko beta invite today. Loving it so far, very cool.

    I think #blekko is right, security through obscurity doesn't work techcrunch.com/...

    Searching for "ffmpegx iPad conversion," @blekko gave me way more useful results than google, even without slash tags. So impressed.

    RT @RedUrbanNA: Finally gotten to play with Blekko http://blekko.com Slashtags are a real game changer. http://goo.gl/4h1H

    The more I play with blekko the more I really like it. I think I am going to try and go blekko for a week

    Nice "UX" slashtag on Blekko!! http://bit.ly/bZWqzm Google is still my primary search engine, but I now use Blekko for advanced search

    @blekko been playing around with blekko this weekend and it's impressive. Still determining how to fit it into my normal web use

    @justinhj dude, on that note, you seen @blekko? shit is fucking amazing

    Searched "open office for mac" on Google and got nothing worth clicking on, used @Blekko and got what I wanted #blekkowins

    Wow - @blekko is really neat. They definitely have something there.

    @blekko That is awesome! Today, I've used blekko to find something on @techcrunch instead of using their search. It was way more productive.

    Checking out @blekko The slashtag system makes it very easy to repeat searches and narrow the scope of a search

    @blekko it will take some getting used to but I could see blekko becoming a verb.

    Lightbulb just went off in my head on how @blekko works and it's freaking sweet. WTB Blekko search bar for Safari.

    @blekko I'm currently testing blekko.com and am astonished by the exact and complete search results. Thumbs up!

    Trying @blekko. A bit mind boggling to begin with but definitely interesting. A kind of "Command line for personal search".

    stefankulk - Creating a slashtag on @Blekko is like setting up a @Google custom search (http://www.google.com/cse/), but ten times faster

    Just tried blekko search engine -- "python /tech" gives me the programming language "python /science" gives me the snake -- cool! #blekko

    Congrats #blekko for winning. I compared @blekko against Google CSE head2head with the same list of websites

    @milesEfron But what @blekko is doing with them is quite different. The closest comparables are vertical search and custom search engines.

    @bradwarwick there is a lot in @ blekko that you'll like too - I'll show you tomorrow (if you have time)

    Wow - @blekko is really neat. They definitely have something there.

    I am trying out @blekko right now and I LOVE IT

    Need to make @blekko an online habit. Can get more done on @blekko in 5 minutes than would take 20 on other sites.

    I'm really starting to love #blekko. Made it my homepage already.

    Just created /green slashtag. As a LEED AP this looks like it will be a sweet tool Way to go Rich!

    As a member of @blekko, I am now able to slash the web. SWEET!

    Creating a slashtag on @Blekko is like setting up a @Google custom search (http://www.google.com/cse/), but ten times faster

    Had a sense of achievement one more time using #Blekko to figure out where my friend will be in holidays.

    @handshake20 have you heard of @blekko? I just started playing with the beta, has options to curate the web. cool stuff.

    well since google has turned evil, i'm going to @blekko and waiting for them to start introducing /email and /apps

    Just got an invite to test @blekko . I like what I see.. Very useful search engine!

    Just got my blekko invite. This is seriously cool.

    yeah. just received a beta invite for @blekko... first impression: awesome tool :)

    Really impressed with Blekko, a new search engine currently in private beta. Easily created a sanews slashtag. Allows: strike action /sanews

    I GOT MY @BLEKKO INVITE ALREADY!! I love you guys! New home page for sure

    Blekko: the only latter-day Google search "competitor" I haven't immediately laughed at http://bit.ly/cCrghv (Still a bit geeks-only tho).

    blekko is like a search engine for the top 5%. very clever. now plan to experiment with it for a week. starting. now.

    Thank you @blekko for the invite to your new search engine. So far I am really liking it. suggestion: red links to blue or user customizable

    And that's the end of "A Week With @Blekko". Went pretty well! Looking forward to giving everyone details on how it went!

    #blekko rocks, Google killer. I've seen Chacha, Mahalo, Quintura, even Bing redesign, so I totally didn't expect anything... stunned

    Email: "An invitation was sent to you by Rich Skrenta to try a new search engine developed by blekko." Nice

    Just got my blekko account...i like.

    Repost! #blekko rocks, Google killer. I've seen Chacha, Mahalo, Quintura, even Bing redesign, so I totally didn't expect anything... stunned

    maybe... #blekko could be a first collective intelligence search engine. if success, it will be a google killer. over the PageRank :-)

    ooh no #bing versus @blekko what to do? Google is no more search engine of choice - I'll sleep on it

    Figuring this @blekko thing out. Slash tags seem pretty awesome

    @blekko I'd also love to see API slashtags for /facebook, /google, /posterous, /tumblr, /wordpress and /identica or /statusnet.

    @blekko looks very interesting, great done guys! Now you have one more user ;-)

    Blekko.com - Why another search engine? Because this one is unique.: http://wp.me/pOEb4-2B

    i got my #blekko beta invite. this is awesome. my new search engine of choice.

    @ScepticGeek You are missing the commercial aspect of blekko! There is need for YBoss alternative - Bing's index is worse than blekkos!

    I like blekko.com. Start wondering how search engines could even exist without #slashtags

    Thumbs up to Blekko. Community driven Slashtags are a very cool twist on search. Very useful. Congrats! http://blekko.com/

    @blekko :) Thanks for the invitation - it´s more revolutionary as i heared about.

    Playing around with #blekko and finding it really cool.

    The new @blekko search engine is awesome! Good job!

    Testing Blekko and this is not bad, not bad at all. It actually has way more features than any other search engine.

    playing with blekko tonight, I like the search results i'm getting

    surprised: @blekko is delivering significantly better results than google...

    Trying @blekko and its fantastic! Can conceive of some cool /slashtag tools using set operators: /politics - (/conservative + /liberal)

    Just tried @blekko search beta and with out doubt its impressive ....they are sure to change the future of searching the web

    Hot Startup #3 this week is @blekko. Full web search engine that provides query refinement features called "Slashtags" http://bit.ly/aSizkO

    Wow, blekko made my life happier. http://blekko.com/

    Hey @blekko really like what your doing with search!

    @blekko I see slashtags for /alcohol, /beer, /tea and /wine. I'd love to see a built-in /coffee as well.

    My use of google is down ~75% after receiving a beta invite to Blekko and my productivity is up easily as much. Hot damn. #SlashTagsFTW

    Content.mills beware! #blekko lets me blacklist domains. It could have been you Google... do you remember what happened to MS round '95?

    @AndrewGirdwood's review of new search engine Blekko - "modern and up to date" http://bit.ly/9khrcR

    @calebhicks Blekko is definitely a more intelligent way to search. Have you used it extensively yet?

    got beta access to @blekko http://blekko.com using it today instead of Google, good experience up until now

    10-second impression of @blekko: nerdy. very, very nerdy.

    Playing around witk @blekko. Really interesting new approach to search, and some great stuff for SEO purposes

    Anybody want a beta invite to use an awesome new search engine @blekko? Really impressive work they are doing with slashtags and analytics.

    The more I play with blekko the more I really like it. I think I am going to try and go blekko for a week

    Testing out Blekko search now. Definitely some nice points in terms of the filters. Check out the feature-set here: http://selnd.com/dpFBdW

    Video: A new kind of socially-biased search engine: @Blekko youtube.com/watch?v=tlESXi… (I love this innovative new search engine, here's why).

    Had @blekko for 2 mins, incredibly powerful. UI leaves a lot to be desired. Hope it's improved, can see this becoming my default.

    @blekko is surprising. I need to play more with slashtags but it seems a little more social than google

    trying out #blekko . it's been very fast and accurate, so far... love the #slashtags feature too :)

    #blekko is my default search engine now. It didn't work for some specific long tail very well, but slash tags will make up for it massively.

    Blekko is a stealth search engine which aims to change the search game by being fairly transparent with ranking data. S.. http://dld.bz/ts5a

    Really intrigued at how simply and competently @blekko improved searching the web for me.

    Very impressed with @blekko -- SEO and Analytic features beat commercial offerings, and searching emphasises fun & discovery

    @blekko OK, I'm hooked! I asked for a /coffee slashtag, and here it is. It even searches @counter_culture @starbucks & @coffeegeek

    every time I go to @blekko I started humming a Peter Gabriel song.

    The time is right for a search solution like Blekko. They let me filter out Demand Media? They are my new best friends. http://bit.ly/dtqyvV

    @blekko social search engine looks great, thx for the invitation #blekko this will grow fast I guess

    Just got a nice thank you email from @blekko for raising a bug. No one have ever thanked me for raising a bug before..

    I'm liking @blekko search more and more every time I use it. Just wish they would fix a few bugs.

    Just created a /github slashtag for Blekko, works really well. Surprised there wasn't one already shared.

    @staticnrg Well, it tries. I am newly in love with @Blekko #hcsm

    Got my @blekko invite. Interesting concept! This could actually be usable.

    #blekko is great for lists, especially tracking your competitors - my defintion - #blekko is a new search engine with analytics

    Alright, my twitter has been buggy for days, but all is right again..My Blekko invite hath arrived. Time to slash for new facts. TY #BLEKKO

    Most new search engines fail badly. @blekko is preparing to defy the odds: http://bit.ly/bLpg6V

    Ow. 3am. @blekko ate my sleep cycle. But now I can search for Scala without typefaces, stairs, sign makers, and fonts polluting the results.

    Wow - @blekko is painfully cool.

    ha ha, somebody has broke the #google here so everybody's blocked, but since I switched to #blekko I don't really care :)

    I like @blekko because it's not just a copy of Google - it's something new! I'm /happy /to /be /a /beta /tester #blekko

    @blekko Finally an actual innovation in web search! And, a team that can actually implement it ... can I have an invite?! #blekko

    @gogoeskimo Think of /slashtags as blekko's page rank. Much more effective at filtering noise since they originate from users. $0.02

    @blekko Must say you are doing great with local search for the UK, I wasn't expecting that this early on.

    Blekko is a full web search engine that differentiates itself by offering access to data and algorithms - http://blekko.com/ #search #Blekko

    checking out blekko. i like their puppet.

    I think @blekko could be great if you could edit other people's slashtags. (Tag owner would approve edits.)

    Search results I'm getting from Blekko are better than Google and Bing. Bravo!

    I think the /noporn blekko slashtag is the best thing since sliced bananas.

    @blekko Thanks for the invite, really interesting to see what you are doing with search. I like the SEO data.

    The slashtag feature and the transparency of new search engine Blekko are pretty nifty

    Blog - The Upside-Down Logic of Taking on Google at Search - New search engine Blekko has opened in beta--is it diff... http://ow.ly/18r0tK

    @parislemon you, my friend, should use @blekko for the site removal feature alone let me know if you want an invite about

    Playing with @blekko, I like it!

    @Blekko rocks! (got my invite) RT @scobleizer Video: New kind of socially-biased arch engine: @Blekko

    Getting in on this @blekko love in, it's pretty darn impressive.

    Twenty minutes using @blekko and I'm already loving it. Amazing search experience.

    @blekko thanks for the invite. Surprised there wasn't a /wikipedia slash yet, o/wise cool service, not sold on it yet but fast + useful.

    So true! RT @blekko: RT @spladow: . @blekko has the best default avatars. http://brizzly.com/pic/37SI

    blekko is not half bad at all.

    Thanks @blekko for the invite - already have ideas for a couple of slashtags now to read up on how to make them. :-)

    Just identified a whole network using @blekko pretty powerful stuff

    been having a good play with new search engine @blekko this week and am loving the #seo #slashtag verrrrrrry useful!

    New Search Engine Blekko Takes Vertical Search to a New Level http://bit.ly/cKq9gy | #verticalsearch

    RobBruceMcNair: @mcbay watch this vid with @blekko and @scobleizer especially 16 mins in for real-time link http://url4.eu/73ihJ

    Great start to the day, got a @blekko invite first thing in the morning, 2 KM run and 2 KM walk.

    Trying #blekko and it looks really interesting. I like the ease use of /tag.

    @kristy hehe, meantime I'll just tease you w/ snippets of how awesome it is ;) #blekko

    EntrepreneurNB - Anybody want a beta invite to use an awesome new search engine @blekko? Really impressive work they are doing with slashtags and analytics.

    Going to try to use @blekko as my main search engine now. We'll see how this goes

    @blekko. Love it. Sadly I make a living from Google. But I love it. Can I keep it anyway?

    Just playing around with @blekko beta. Amazing concept, and it works really well already. The /rank slash is incredible.

    September 21, 2010

    Get a cool blekko slashtag man (or woman) tee shirt

    When we started blekko, I was fortunate to find myself working with a really great technical team. I'd worked with most of these folks before at various valley companies over the years and knew that they were killer engineers. We were pretty confident that we had the technical chops to build a new search engine from scratch. We also had a bunch of ideas about a differentiated search experience that could be built...

    But we had absolutely no in-house artistic talent. It's a running joke around our office that we're really good at building ugly web sites.

    I'm envious of startups that are fortunate enough to have a really talented designer in their core team. Unfortunately that just wasn't us. So when we were hunting around for some imagery to go with our brand, we turned to 99designs.

    Our product director had the idea that slashtags turn our users into superusers, since slashtags let them kill spam with a single click, quickly pivot through numerous verticals, and x-ray valuable seo ranking data.

    But the imagery really got fleshed out during our design competition. We wanted something cool we could put on a tee shirt, beyond the simple 'blekko' wordmark.

    Slashtag Man came back from the design competition and everybody loved him. After we made him the default avatar on the site, some of our beta users tweeted that they weren't going to upload their own pic because they liked slashtag man so much.

    We had the designers make a Slashtag Girl so we could put her on the back of the women's tees we ordered.

    The imagery here worked perfectly for what we were trying to do - empowering regular users to turn into super-users when they search.

    To help us celebrate our new super-hero identity, we’ve got 100 t-shirts to give away. If you want one, just send a note to shirt@blekko.com, include your size, let us know if you prefer a women's tee, and of course your mailing address and we'll send you a tee shirt.

    We'll also be giving away tees at our booth at upcoming shows - TechCrunch Disrupt, SMX East, PubCon and Defrag. If you're at one of these shows, please stop by and chat and we'll give you a tee if we haven't run out.

    Slash the web!

    September 22, 2010

    The magic of tee shirts and how we easily mail them anywhere - even internationally

    I had an idea at blekko that I wanted to give away a bunch of tee shirts as a marketing promo and a reward for our beta users. People love tee shirts and companies seem to be really stingy with them at trade shows these days. Instead you get a cheap pen or a squeeze ball.

    Companies pay thousands of dollars for an insert in the free bag you get at registration, or a banner on the wall, or to sponsor the cocktail reception.

    I think most of that stuff is worthless.

    For the cost of a banner on the wall at a big show you can make 500 or 1000 tee shirts.

    Instead of a logo on the wall that nobody even notices, you can have hundreds of cool industry people wearing your logo a couple of times a month. All day at work! And people love getting the shirts.

    When I told my team I also wanted to mail a bunch of shirts to our users they initially freaked out. Just how do you do that efficiently? Fortunately our marketing director Stephen had already solved this problem.

    He is a passionate motorcyclist and commutes to work every day on his BMW R1200GS. He actually travels about 20,000 miles a year on his bike. Two years ago he started a side business selling motorcycle accessories.

    So he quickly set blekko up with the same shipping solution that he had built for his company.

    Within a few days we had a Dymo LabelWriter 4XL along with rolls of Dymo labels. We subscribed to Endicia for Mac. Bubble mailers came locally from RoyalMailers.com located in Emeryville, CA. Within minutes we were printing our own USPS shipping labels with integrated stamps.

    We can also easily print international shipping labels as well with the PS-2976 customs form built right in. The great thing about Endicia's service is that it includes the round postmaster stamp on each label, so you can drop international shipments right into a mailbox without having to go to the post office for inspection.

    This stuff is really slick... With this setup our community manager Cheralyn was able to package and mail a huge stack of tee shirts in just a couple of hours.

    If this had been done by hand it would have taken a couple of days.

    Here are some of the international requests we got:

    Also I have only little hope that you will ship one of this gorgeous shirts to Germany. PLEASE DO!

    Not sure if you ship to Canada?

    I'd wear it at all the barcamps & at work (I am a webdesigner)

    Of course we in Austria - Europa want to help spread your super-hero identity!!!

    Don't know do you ship in Croatia, but it would be really great to wear one on my job :) i would sure be unique :)

    I'd love an awesome t-shirt please please please - I'm ready to slash the web. I doubt you will send one to Germany, though, even if it would be very shrewd international marketing …

    Non-US folks are blown away when they find we actually will mail a shirt to them.

    September 25, 2010

    team blekko at techcrunch disrupt hackathon

    Note other key tools in addition to the macbooks - diet dr. pepper and the bottle of scotch in the middle of the table.

    October 31, 2010

    Crowdsourcing search relevance


    How on earth do you try to disrupt the search space?

    Search requires not only a big software system, but also a massive set of relevance data to help the algorithms make sense of the billions of pages on the web.

    Bing and Google have hundreds of contractors that use web tools to refine this relevance data - classifying porn, spam, domain parks, ecommerce sites, fake 404's, markov-spam, official sites, and so on.

    As a 20-person startup, we asked ourselves how blekko could assemble this essential data. Hire contractors? Use Mechanical Turk? Elance?

    But - of course! - we know a much better way.... A way you can get orders of magnitude greater participation, while at the same time being very open about the process.

    Let the public in.

    We realized we could make web tools that let users sign up and help make the search engine better. If we opened up the process, we could not only get orders of magnitude more people involved than we could ever hope to employ, we could also create an open, accountable process around the search engine relevance data.

    Not everyone has to participate for the model to work - most people don't edit Wikipedia, yet we have a vast encyclopedia which long ago dwarfed the closed Britannica.

    But a small fraction of the web audience that does get involved can help make the search experience better for everyone else.

    We're starting by letting users define their own vertical search experiences, using a feature we call slashtags. Slashtags let all of the vertical engines that people define on blekko live within the same search box. They also let you do a search and quickly pivot from one vertical to another.

    I was looking for 2% cash back credit card. 1% cards are pretty common, but 2% are harder to track down. [cash back credit card] is a trainwreck of a spammed results any engine. So I made a blekko /money tag with the top 100 personal finance bloggers that I got from Kiplingers. Bingo, [cash back credit card /money] and I have great results.

    Being able to go into a spammy category like health, personal finance, hotels or even lyrics and search just the best sites immediately uplevels the results. Trusted sources with no spam.

    And our users are making tags we would never have thought of. One of our users created a /glutenfree tag, so you can search [chicken soup /glutenfree]. Another created a slashtag for user experience design sites, so now we have a great /ux tag.

    We have a vision of curated algorithmic search that brings quality back to the web at scale, and involves the public to get there.

    We're just getting started though, so stay tuned.

    Cool searches which show off features:

    cure for headaches
    pregnancy tips
    big island resorts
    industrial design colleges
    pan fried noodles
    obama /date
    global warming /conservative
    bioinformatics /people
    blekko /links
    blekko /links /date
    blekko /links /date /rss
    techcrunch /seo
    om malik /rank
    techcrunch /seo

    Read more:

    A New Search Engine, Where Less Is More - New York Times

    Start-Up Aims at Google: Blekko.com Taps Users to Narrow Results, Avoid Spam Sites - Wall Street Journal

    Blekko, The “Slashtag” Search Engine, Goes Live - Search Engine Land

    With Help from You, New Search Engine Slashes Through Spam - Wired

    Alternative Search Engine Blekko Launches to Eliminate Spam in Search - Mashable

    Blekko launches the biased search engine - CNET

    Search engine Blekko to rely on the human touch - Reuters

    Google And Blekko Head-To-Head: Blekko Lives To Fight Another Day - Search Engine Land

    Power to the People: A new search engine, Blekko, uses human editors to promote quality pages and block spam content from its results. - MIT Technology Review

    Interview of Rich Skrenta, of Blekko + Topix + DMOZ Fame - SEO Book

    One Reason To Take New Search Engine Blekko Seriously - Business Insider

    Blekko: The Newest Search Engine - PC Mag

    "I recommend Blekko. It's the best out-of-the-chute new engine I've seen in the last 10 years, seriously." -- John C. Dvorak

    November 1, 2010

    Anatomy of blekko's press launch

    Today’s news generated coverage in several top-tier national business outlets, including NYT, WSJ, AP, Reuters, Huffington Post, WIRED, FT, TIME, and BBC, as well as multiple leading tech outlets, such as Mashable, PC Mag, PC World, CNET, eWeek, ZDNET, BusinessInsider and many more.

    The AP piece received more than 70 reposts in multiple top-tier outlets including CBS News, CNBC, NBC Today Show, LA Times, Washington Post, Huffington Post, Seattle PI, to name a few.

    Broadcast coverage has also been really strong with more than 85 airings nationwide mentioning Blekko’s launch.

    Rich made a great appearance on Bloomberg TV.

    We’ve also received coverage from NPR, as well as NBC, CBS and ABC network affiliates in the Top DMAs including New York, San Francisco, Boston, Los Angeles and Dallas.

    Today’s news also generated quite a bit of social media buzz with over 3,200 tweets to date. The Mashable piece alone received more than 1,400 retweets.

    After the Wall Street Journal broke our press embargo 5 hours early, someone on Hacker News asked why we were launching the site on a Sunday afternoon.

    But if you want to be in the Monday morning press, a writer's story needs to be done and ready by Sunday. Edited and approved and fact-checked and, and if it's going to be in print, sent to the printing-press. And you had to meet with them to tell them your story before that. So the article is actually done long before you read it, and is just working its way through some process until it lands on someone's doorstep or pops up on a website.

    For any kind of big press announcement, however, there isn't a single story. You want many people writing about you - for something major, like a new product announcement, as many as you can get. So you have to coordinate a bunch of different writers, and try to get all the press to show up at the same time.

    You coordinate multiple stories coming out at the same time with an "embargo", which is generally a cluster-fsck, because trying to get 20 journalists to agree to all hit "publish" at the same time on a story is like herding cats. The embargoes have been broken on every large PR event I've ever been a part of. Sometimes it's intentional, sometimes it's just a mistake.

    Nearly all business press comes out this way.

    If a startup decides to not bother with all of this embargo stuff, they don't get a press pop. No Techmeme, no Digg, no Hacker News, no Reddit, no Google News, no Twitter glow. No secondary press -- reporters tend to write about what they hear a lot of other reporters writing about. You just see a random story here and there occasionally.

    I did 20 press briefings last week. We had so many back-to-back interviews last Friday we hired a cab to drive us around the city all day. We couldn't have kept the schedule if we had to find parking each time.

    btw, if you don't have a great PR firm, you won't have this problem. You won't have 20 meetings in one week with 8 of them on Friday.

    I love the part where I get to tell the story. The more open you are, the more interesting it is. Just tell them what it's really like to be an entrepreneur trying to push out some crazy-brained idea on the market. How you raised money, got people to join, found cheap hotel rooms for the launch and got camping cots and a crock-pot of chili and whatever else you did. It's way more interesting that some dry old press release. And more fun to tell, too.

    In the end we got a ton of press - nearly all of it positive.

    But it doesn't just happen by itself.

    November 6, 2010

    ReadWriteWeb: "What have these people built?"

    Teaming with excitement after completing giant post about how to be a bittersweet power user of @Blekko... The best parts of the Internet make me feel all the more alive and I'm going to write some crazy shit before I leave this world like the original vision of a beautiful startup's founding dream.
          - Marshall Kirkpatrick
    ReadWriteWeb: How to Use Blekko to Rock at Your Job

    November 12, 2010

    Domainers comment on blekko

    He is totally correct. It is totally in the long-term best interest of search engines to de-index all mass-produces content and all affiliate arbitrage and ppc arbitrage. And it does mean de-indexing all DM, Epik, all domainer mass-development efforts, and sites like business.com, etc...
          -- Johnny
    Of course sham ‘developers’ are quick to point out immense value of their bogus content. Really though, step outside this wee little community bubble and ask *anyone else* in the internet world about this; the prevailing theme is abject disgust over what the internet is becoming and what monetarily incentivised content is doing to the quality of information found on the web.

    The entire internet is rapidly turning into a contrived-content landfill and to be sure, a movement is slowly but surely taking place to offer alternative search solutions. If G has anything resembling an Achilles heel, this is the closest thing to it.

    At its core, much like direct keyword navigation, the profitability of trash content is really nothing more than a user-sophistication issue. There are still enough clueless users out there on the web- ones that cannot differentiate between a splog adfarm and a legitimately information-rich page- to keep the bounce rate just low enough and clicks just high enough to stay black… There are just enough ‘grannies getting their first computers’ to keep the game alive. The thing is, this is changing at light speed and in time, the users demanding higher quality content will no longer be limited to tech geeks and people who really ‘get’ the web. It will be everyone.

    I was recently researching tax lien investing. For one particular keyword string, an Ezine article ranked very high.

    Kinda like how a liar can spot another liar or a thief can spot another thief better than anyone else, I immediately recognized this as farmed, drivel content. The problem was, the information it conveyed was wholly incorrect, in spite of the narrative being written with an authoritative tone, in spite of being ranked shockingly high in serp. It was obviously written by someone who knew precisely *nothing* about the topic at hand but was getting paid to write an article, so they hit the expected research sources, formed a dirty, five minute opinion and stood themselves out there as a bonafide expert… and once they were done writing that article, they repeated that same intellectually bankrupt process with their next paid articles on Alpacas, Forex, Lawrence Kansas Home Mortgages, Medical Tourism in India or whatever else their employer paid them to write.

    This is not a sustainable model for the web. G is in a crappy spot since their monetization schemes are the impetus that drive so much of this, yet it all goes against their larger philosophy about content quality… If a challenger ever arises to threaten their dominance, it will be by devising a better algo to filter out this crap and deliver cleaner information to John Q Netguy.
          -- Anon

    Did Blekko launch the "Minimum Viable Product?"

    I posted a response on Quora to a question titled "How long is too long to release a minimum viable product?". One commenter asked about blekko:
    It's interesting that Blekko was in development for over three years before launching. Granted, it's a search engine, but the world has changed an awful lot in those three years.
    We felt that the threshold for a "minimum viable product" in the search space was higher than for other products, because the expectations are so high. Negative reception on launch day tends to set a permanent impression in the market which is difficult to recover from, as Cuil found.

    In part this is because search engine launches tend to get more attention than launches in other product categories.

    On one hand, people have told us we're crazy to be even trying to take on web search because it's impossible... On the other hand they ask why it took us so long. ;)

    For our previous company, Topix, we soft-launched a prototype after 9 months of development, and then followed up 3 months later with a bigger press launch. We could do that with Topix, but I don't think it would have worked with blekko.

    Some of the dings we've gotten in the launch press actually seem pretty reasonable if you consider that we have a 10-day-old web search engine being compared against ... Google.

    Two weeks post-launch, we have a bunch of fans, are getting sustained search traffic 24/7, have users creating slashtags, bloggers are writing how-to posts and guides to using blekko, and we are receiving tons of great feedback from our initial surge of users.

    So maybe we launched the Minimum Viable Product after all.

    November 23, 2010

    blekko partners with DuckDuckGo

    When we founded blekko, we decided to find a new playbook to launch and grow our search startup. We deliberately avoided playing into the old hype of being called a "Google killer". We also resolved to work with other search startups, especially ones that shared our conviction to eliminate webspam.

    So I'm pleased to announce blekko's first search partnership, with fellow search startup DuckDuckGo. When DuckDuckGo users search on a term which matches one of blekko's seven auto-fired slashtag categories, DuckDuckGo users will see results from blekko. (The seven auto-fired slashtag categories are health, colleges, autos, personal finance, lyrics, recipes and hotels.

    As part of this partnership Blekko users will have access to DuckDuckGo’s "Zero-Click Info" on a site-by-site basis. Zero-Click Info helps users find the most relevant information on sites and search terms without having to click on search results.

    We’re happy to work with Gabriel and the team at DuckDuckGo. And not just because we both have weird names. It’s because we can kill spam a lot faster working together than we can working against each other. :-)

    Read more:

    Blekko Partners Up With Search Engine DuckDuckGo (TechCrunch)

    Alternative Search Startup Blekko Announces First Partnership (Mashable)

    DuckDuckGo/blekko search partnership (GabrielWeinberg.com)

    Blekko and DuckDuckGo Launch Search Partnership (BusinessWire)

    Blekko Announce's its first Partnership (Marksonland)

    November 28, 2010

    Algorithmic search is sinking

    There's a fascinating story in the New York Times today about an online retailer who actually increased sales and profits by insulting, threatening and even cheating his customers because the more online complaints he got, the better he ranked on Google.

    A woman who purchased eyeglasses on one of this online retailer’s sites was harassed and stalked for weeks because she tried to return a purchase. At one point the online retailer told her, "you put your hand in fire. Now it’s time to get burned." The woman told the New York Times, "This might sound like an exaggeration, but I feared for my life. I was actually looking over my shoulder when I left my apartment."

    It turns out that the hundreds of online complaints being written about this bad actor were perversely fooling search algorithms into believing this was a quality site because it had a large number of inbound links. In fact, this retailer would intentionally begin battles with customers when he needed to drive an increase in traffic.

    Unfortunately this is just a single appalling story in a huge trend we're seeing. There are a finite set of decent retailers you might want to buy stuff online from. But there is an ever-increasing number of spam sites on the web. We're at the point now where there are far more fake retailers than real ones online. The bad sites are getting ever more sophisticated in appearing to be legitimate, to both consumers as well as search engines.

    Algorithmic search is sinking.

    The only way to combat this and return trust and quality to search is by taking an editorial stand and having humans identify the best sites for every category. The algorithm can't find its way through the web's growing hall of mirrors anymore. And it's only going to get worse.

    November 29, 2010

    blekko hits broadway?

    I got this handwritten pitch in the post this morning. I had been thinking about getting a billboard on 101 for blekko but this would be so much better. We could have a huge electric sign in Times Square. Of course this might cut into our tee shirt budget a bit but I think it would be worth it.

    December 15, 2010

    Friends Make Search Better

    Social Graph meet the link graph. Link graph meet the social graph. We’re sure you two will be fast friends.

    One of the reasons we built blekko is that we believe that with respect to ranking, a link, well, just ain’t what she used to be. In 1998, a link to a web site WAS the first social vote of quality. Someone took the time to log into their Geocities page and type in:

      <a href="http://marksonland.com">Marksonland</a>
    

    Now? Most links are auto/mass-generated with the sole purpose of gaming the search engines.

    You know what’s not gamed? Likes. Likes are your actual friends going around the internet telling each other the sites they think are good and bad. Friends don’t spam – and friends don’t let friends like spam.

    Another way to think about it is that your friends are already curating the web every time they click the Like button. Blekko is all about human curation. Bringing Likes directly to search results is yet another method by which blekko is fighting the good fight of keeping spammers out of your search results.

    How it works:

    1. Log onto blekko through Facebook Connect. If you are already a blekko user, there is an option to sync your existing account to FB as well.

    2. When you log in, blekko automatically creates a slashtag for you called /likes. /likes will include all the sites you and your friends have “Liked.” (If you have a lot of likes, it could take a few minutes to populate your /likes)

    3. Every search you do will layer in Like information about a particular site.

    4. You can search only the sites you and your friends like by appending /likes to the end of any query (ex. san francisco sushi /likes)

    5. You can sort any search result page by number of likes (as opposed to date or relevance) by clicking the icon next to the date button on the top right.

    6. You can like any site directly from blekko by clicking the Like button on the second line of search results.

    We’ve demo’d this feature to a few folks and everyone is pretty much uniformly blown away by it. We hope you are too.

    This is a seriously cool integration of Facebook data with slashtags. It demos quickly and the "aha" hits fast when you see your own social data spinning results. Check it out - http://blekko.com/

    More: TechCrunch: Blekko Goes Social, Now Lets You Search Sites Your Friends Have ‘Liked’ On Facebook

    January 6, 2011

    Introducing the Spam Clock


    www.spamclock.com

    I consider myself a glass half full kind of guy, but it's hard to remain optimistic about the future of the World Wide Web. I think it's fantastic that my kids have access in real time to almost every piece of information and knowledge in the world. But ever since we started working on Blekko, I've become exposed to the dark side of the Internet.

    Scratch below the surface of all this great information, or in our case dig deep below the surface, and it is shocking what is happening to the Internet. Millions upon millions of pages of junk are being unleashed on the web, a virtual torrent of pages designed solely to generate a few pennies in ad revenue for its creator. I fear that we are approaching a tipping point, where the volume of garbage soars beyond and overwhelms the valuable of what is on the web. Look at what has happened to email: Microsoft estimates that 90 percent of the mail that passes through its hotmail servers is spam.

    What happened to email was the result of very powerful economics. Spammers and con artists discovered they could reach a massive audience for pennies. And this scale of audience essentially guaranteed a very small but profitable return. Today the economic incentives for web spammers are even more lucrative than email spam and almost guarantee a continued blizzard of trash on the web.

    Web spammers simply have to create pages on the web and sit back and let search engines send them money. Current search engines have abandoned any attempt to enforce even the slightest modicum of quality control. Revenue is guaranteed if a page can draw a click.

    The result is a global sweatshop workforce cranking out millions of pages of web trash. I fear we are looking at the very scary future of the web in the job postings at Mechanical Turk. Researchers recently reviewed job postings there and found that 41 percent of all jobs offerer over a two month period were aimed at recruiting workers to create spam. Most of these jobs offered folks a measly dollar a page. Some paid as little as 5 cents. But all these jobs are being filled and the spam gets spewed out.


    ("The most infamous girl in the history of the Internet")

    Consider that in 2000 there were about 7 million hosts on the internet offering essentially all the content on the web. In 2010, the number of web hosts has soared to 250 million. How many of these 200 million plus hosts offer legitimate content? A small fraction. The rest is spam.

    Which brings me to my larger point. This spam on the web is creating REAL problems that are affecting much more than our ability just to find information.

    The energy and other costs for crawling, storing and serving this trash is soaring. I saw a recent estimate that 15% of the world's energy consumption in 10 years could go to support Internet usage. A fair amount of that energy is being burned by the thousands upon thousands of servers at incumbent search engines. Making search greener by weeding out spam could have a significant impact on energy consumption.

    The problems and challenges of spam to the entire world are going to get worse. As the online economy continues to grow at double digits compared to stalled growth for the offline economy, the incentives for spammers get even more lucrative.

    That's why we've created the world's first Spam Clock. This clock is going to record in real time the amount of web spam that is being spewed out. The clock is designed to bring greater attention to this growing problem. While it is illustrative more than scientifically accurate, it is truly indicative of the soaring spam problem.

    Finally, what can we do about this? Honestly, we think our search engine can be an important solution but we need your help. If we can together create a search engine that is a curated resource of the best trusted sources on the web, we can do a great deal to reduce the economic incentive for creating spam. Spam operators won't even offer that nickel on Mechanical Turk if the chances are pretty good that a human editor will never include that page in the search database.

    So we'd like to invite web searchers everywhere to help us clean up the web. It can be done. If we can just organize the best sources of information for the top 1000 search verticals we will drastically improve the web experience. And we will immediately create the first ever disincentive for polluting the web.

    Please join us.

    Read more:

    Blekko launches Spam Slock to keep pressure on Google (Danny Sullivan)

    The Spam Clock is live (Marksonland)

    February 11, 2011

    Burning Spam!

    The Spam Clock, which measures how many pieces of spam have been created on the internet since 1/1/11, passed 1 billion today. Only 41 days into the new year.

    We decided to commemorate this milestone in a special way. Watch:

    Also see Marksonland's take...

    February 15, 2011

    blekko + stackoverflow = better programming slashtags

         

    I've been hugely impressed with the programming community that Jeff Atwood and Joel Spolsky have built at Stack Overflow. In a short time Stack Overflow has risen to be the dominant programmer community on the web. And it was created, in part, as a response to frustration with running into a content farm that was spamming programming queries.

    Stackoverflow is sort of like the anti-experts-exchange (minus the nausea-inducing sleaze and quasi-legal search engine gaming) meets wikipedia meets programming reddit.
            -- Introducing Stack Overflow

    (Interestingly - and unrelated - we recently banned experts-exchange.com at blekko after tallying our user's /spam votes and noting that experts-exchange was the #1 most disliked site on blekko.)

    Jeff and Joel have built a vibrant community of experts, and we felt they'd be able to help us edit blekko's programming and tech slashtags. Stack Overflow members have already suggested new slashtags that we've created, and we've begun adding their members as editors to slashtags.

    The full list of slashtags (so far) that Stack Overflow will be overseeing are:

    /android
    /bsd
    /cloud
    /couchdb
    /css
    /directx
    /dotnet
    /emacs
    /freebsd
    /fsf
    /hackerspaces
    /hadoop
    /hpc
    /ipadapps
    /it
    /java
    /js
    /lego
    /linux
    /mongodb
    /ms
    /nosql
    /open-source
    /opengl
    /perl
    /php
    /python
    /rails
    /ruby
    /so
    /sql
    /tech
    /techblogs
    /ubuntu
    /unix
    /utf8
    /ux
    /videogames
    /vim
    /webdesign
    /windows

    March 3, 2011

    blekko trading cards

    blekko is at four tradeshows this month, including SXSW. We wanted to do a cool booth giveaway rather than just a pen or a squishy ball. So we ran a contest and had actual artists come up with some blekko comics that could fit on a trading card. The project turned out way better than we expected.

    Read more about it at the blekkoblog.

    April 6, 2011

    Web startup, circa 2004

    Topix. The 2 servers on the table were the whole site at this point. $60m exit 15 months later.

    We were in the cheapest office space we could find with a Palo Alto zip code. It was above a trophy shop in a wood termite-infested building. But we could open the windows! (and smell fumes from the cabinet-painting shop across the street).

    Palo Alto fiber was in the street in front of our office, but actually getting access to it was bureaucratically impossible. An abortive effort to put a fast microwave link on the roof of the building went nowhere but wasted a lot of time. So we pulled a T1 (router is on the wire rack above the servers) and were in business.

    May 13, 2011

    blekko did 50m searches in April, 750k users

    blekko got some great coverage last Sunday in the NY Times: An Engine’s Tall Order: Streamline the Search

    Earlier this week we made an announcement about enhancing searcher privacy. Key highlights:

    • Personal information (such as IP addresses) will be retained a maximum of 48 hours
    • A new HTTPS Preferred® system, which automatically points searchers at HTTPS (secure) websites in many cases
    • SuperPrivacy® and NoAds opt-out privacy settings allows users to suppress ads and reduce logging of search keywords

    And finally, blekko has grown traffic every month since our launch last November. April visitors were up 30% from March, with 750,000 unique visitors coming to blekko.com doing 50 million searches.

    Not bad for a 5 month old search engine. :-)

    September 20, 2011

    Blekko's not afraid of Google, why is Washington?

    Eric Schmidt will appear before a senate committee tomorrow to defend Google against claims that it has abused its postion in the marketplace.

    Apparently the prize if you win really big: you get to pitch your startup to congress.

    The former tech darling has begun to assume the same status of “startup grown too big for its britches” that was once hung around the neck of its nemesis Microsoft.

    But we don’t need federal intervention to level the playing field with Google. Innovation and competition are far more powerful instruments to battle companies that have grown powerful and influential. Which has been more detrimental to Microsoft's business? The lawsuit brought by the Department of Justice in the 90s, or the innovative products Apple has brought to the marketplace?

    The success of Google should be applauded on Capitol Hill, not derided.

    Let’s let entrepreneurs, technology and good old-fashioned innovation deal with Google. Consumers will always be the winners in that scenario.

    September 29, 2011

    blekko raises $30m, adds Yandex as strategic investor

    From Russia with love: Yandex backs US search startup Blekko with $15 million, computing power (AP)

    Search Engine Blekko Raises $30 Million From Russian Search Giant Yandex And Others (TechCrunch)

    Upstart Search Engine Raises $30 Million, Gets Investment From Russian Search Company Yandex (Business Insider)

    Blekko Takes on Google — With Help From Russia (Mashable)

    Blekko Closes $30M Funding – Yandex Strategic Investor (blekko blog)

    January 7, 2013

    blekko launches izik: tablet search reimagined

    Friends of blekko!

    We are very pleased to announce izik, our new tablet search app. We launched izik on Friday, and today is it currently the #3 free reference app in the Apple app store.

    We believe that the move to the tablet from the desktop/laptop is an environmental shift in how people consume web content. We have developed a search product that addresses the following unique problems of tablet search:

    • Typing is harder on tablets
    • Context needs to be in the result set to accommodate shorter queries
    • Swipe features & gestures
    • Tablets are image driven
    • Multiple browser windows aren’t a right click away (so clicking on a bad result is more punitive on a tablet)

    The product we have developed, izik, is the first search experience specifically optimized for the tablet. It leverages our core technology to create a truly unique search experience for tablet users that is both functional and beautiful.

    Press coverage:

    Blekko Launches Izik, A Tablet-Optimized Search App (TechCrunch)

    Blekko Launches New Tablet Search Engine “Izik” (Search Engine Land)

    New search engine sets out to prove Google isn’t the best option for finding things on tablets (Washington Post)

    izik: Take Search for a Joy Ride on Your Tablet (blekko blog)

    Meet Izik: Tablet-Friendly Search From Blekko (Search Engine Watch)

    Izik is a great little internet search app built for your iPad (TUAW)

    A Search Engine Made for Mobile Devices (New York Times)

    Izik Review (Maclife)

    Download the app:

    Apple app store - iPad

    Google Play - Android

    Mobile site:

    izik.com

    About main

    This page contains an archive of all entries posted to Skrentablog in the main category. They are listed from oldest to newest.

    Many more can be found on the main index page or by looking through the archives.

    Powered by
    Movable Type 3.33