« April 2007 | Main | June 2007 »

May 2007 Archives

May 2, 2007

Digg's huge PR bonanza

A few days ago I wrote about how the disintegration of mass media had lead to escalating costs to launch a new brand....$150M these days, up from single digit millions in the 1960's. That's a huge run-up, even adjusting for inflation.

Blake presciently noted that getting sued could be one of the best ways to achieve tons of cheap PR:

So my question is (always mindful of the "All Press is Good Press" cliche): Are we getting to the point where a lawsuit becomes the most cost-effective way to boot-strap market yourself?

In the wake of Digg's massive PR jackpot, others have noticed this too. Andy Beal wrote:

This whole mess has created a lot of publicity for Digg. It has demonstrated how powerful it is and how influential the voice of its users.

Yeah. :-)

May 3, 2007

"My spoon is too big!"

I must be wrong in the head to like this so much. I couldn't stop laughing.

In particular the end apocalypse sequence (starting at 7:00 min) is amazing.

Update: I'm not wrong in the head! (not about this, anyway :)

A little digging for this post turned up the fact that the animator, Don Hertzfeldt, was nominated for an academy award for this short (which is titled "Rejected"). Apparently it's received over 27 awards, and is the #3 most popular short of all time according to IMDB.

May 4, 2007

yahoo.msn.com

My first reaction on seeing the Techmeme headline about msft-yhoo was "pretty please!" Although I would rather Google tried to do a giant acquisition deal and threw themselves into the tar pit.

But then I felt kind of sad, because msft and yahoo aren't really standing in the way of anyone else succeeding right now. In fact they're struggling hard to compete themselves, trying their best... and then this comes along. Anyone who has worked in a bigco knows what this nonsense does to productivity. Imagine every single one of your employees spending hours today talking about this. Well it's Friday at least. But they'll keep talking about it on Monday...

May 8, 2007

Markson: The Top 10 Reasons Why Newspapers Are Sinking Online

I was going to blog about the whole newspaper death spiral business in the WSJ, given that last year we built an entire system with the AP to map local stories back to their originating publications, in part to address concerns such as Hussman's. But Mike's beaten me to the punch, and it's a good thing since he's got a far more comprehensive take on the state of the news industry. He pretty much covers everything...and it doesn't look good.

Marksonland: The Top 10 Reasons Why Newspapers Are Sinking Online

May 9, 2007

Giving up on Microsoft?

Jeff Atwood giving up on Microsoft? Holy cow.

There is a huge gulf between Microsoft and Unix developers. I somehow missed walking down the Microsoft road, since I'd started on the Apple II (BASIC, 6502 assembly, Pascal) and never had an IBM PC way back. Then when I got to school it was Tops-20 and VAX/VMS and a little bit of Unix here and there. And by the time I got a PC, it wasn't to run Windows, but rather SCO XENIX on my 286.

I thought this was going to catch up with me around '93, since it looked like Windows was going to kill Unix dead. And then I'd have to start over and learn all this msft stuff. But no, the Internet came along, and suddenly I could code "client server" cross-platform GUIs with print statements. Thank f'ing god I thought.

And it turned out Unix seemed a whole lot better suited to server software, having been designed as a multiuser OS from the beginning. There were horror stories of startups paying 24/7 operators to sit watching banks of NT machines and rebooting them when they froze. And the initial failed attempt to migrate Hotmail off of unix when it was acquired by msft. Whereas we'd routinely get uptimes of hundreds of days on our unix servers. (Heck, the uptime for this machine is currently 158 days.)

At this point it doesn't seem to come up much anymore. As Jeff points out, there don't seem to be many web startups running on a microsoft platform. When they do crop up you know their tech isn't likely to be very strong. You see nonsense like Dipsie supposedly being "the next google" but then hear they're coding everything on microsoft and you don't have to pay any attention anymore since you know there's nothing there. There are the odd successful standouts like Fog Creek shipping actual PC apps, but they seem increasingly rare.

You can probably even avoid buying the usual raft of PC stuff on the business side now. It's thousands of dollars, installation and maintenance are a pain. Raw linux could be a bit much for a bizdev or marketing emp to use but OSX + google apps is probably a good enough replacement.

May 10, 2007

14 rules for fast web pages

Steve Souders of Yahoo's "Exceptional Performance Team" gave an insanely great presentation at Web 2.0 about optimizing website performance by focusing on front end issues. Unfortunately I didn't get to see it in person but the Web 2.0 talks have just been put up and the ppt is fascinating and absolutely a must-read for anyone involved in web products.

His work has been serialized on the Yahoo user interface blog, and will also be published in an upcoming O'Reilly title (est publish date: Sep 07).

We have so much of this wrong at topix now that it makes me want to cry but you can bet I've already emailed this ppt to my eng team. :) Even if you're pure mgmt or product marketing you need to be aware of these issues and how they directly affect user experience. We've seen a direct correlation between site speed and traffic.

This is a big presentation, with a lot of data in it (a whole book's worth apparently), but half way through he boils it down into 14 rules for faster front end performance:

  1. Make fewer HTTP requests
  2. Use a CDN
  3. Add an Expires header
  4. Gzip components
  5. Put CSS at the top
  6. Move JS to the bottom
  7. Avoid CSS expressions
  8. Make JS and CSS external
  9. Reduce DNS lookups
  10. Minify JS
  11. Avoid redirects
  12. Remove duplicate scripts
  13. Turn off ETags
  14. Make AJAX cacheable and small

The full talk has details on what all of these mean in practice. The final slide of the deck is a set of references and resources, which I've pulled out here for clickability:

book: http://www.oreilly.com/catalog/9780596514211/
examples: http://stevesouders.com/examples/
image maps: http://www.w3.org/TR/html401/struct/objects.html#h-13.6
CSS sprites: http://alistapart.com/articles/sprites
inline images: http://tools.ietf.org/html/rfc2397
jsmin: http://crockford.com/javascript/jsmin
dojo compressor: http://dojotoolkit.org/docs/shrinksafe
HTTP status codes: http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
IBM Page Detailer: http://alphaworks.ibm.com/tech/pagedetailer
Fasterfox: http://fasterfox.mozdev.org/
LiveHTTPHeaders: http://livehttpheaders.mozdev.org/
Firebug: http://getfirebug.com/
YUIBlog: http://yuiblog.com/blog/2006/11/28/performance-research-part-1/
    http://yuiblog.com/blog/2007/01/04/performance-research-part-2/
    http://yuiblog.com/blog/2007/03/01/performance-research-part-3/
    http://yuiblog.com/blog/2007/04/11/performance-research-part-4/
YDN: http://developer.yahoo.net/blog/archives/2007/03/high_performanc.html
    http://developer.yahoo.net/blog/archives/2007/04/rule_1_make_few.html

Update: Yahoo has summarized these nicely on their developer blog.

May 13, 2007

Give Hammer a break

The collective response to Michael Arrington including MC Hammer on his TechCrunch 20 review panel is pretty lame, IMO.

Commenters should keep in mind that this is a real guy they're talking about. Have some friggin courtesy. I've actually met the guy at a party, so maybe it's easier for me to imagine him as a real human who reads the net too, and not just some TV celeb caricature. If you were introduced to him at CES or an industry party, would you say this stuff to his face? He's a nice guy, he's got a blog, and he's done a lot of other stuff since that 80's video.

Also, putting down someone who had a successful career in one area, and who is trying to reinvent themselves in a new role doesn't seem right to me. There are plenty of people who had careers in sports, music, movies, etc. and then go on to second careers in politics, wall street, real estate, etc. I think that's just great and should be encouraged.

But the worst conceit of the crowd's response is the assumption than this guy can't know anything about technology, and thus the idea of him doing a social network is silly. But the thing is -- there isn't really very much technology in social networks. You can build one of these puppies in a weekend, or have one built for you outsourced for $15-25k. It's commodity at this point. Success is based on boot-up and network-effects. So maybe, just maybe, is it possible that someone with a successful media and promotion background, with lots of contacts in those areas, with a name everyone recognizes, might actually have a decent shot at promoting something? Versus an unknown 20-something rails programmer freshly minted with their geek degree, and $20k in "VC"?

I met a bunch of music industry folks while I was at AOL Music, and many of them were savvy businesspeople and highly entrepreneurial. One aging rock dude, long out of contract, had even taught himself to program and built a subscription-driven site for his hard core fans where he posted tracks, videos, did live chats, etc.

It's hard to escape your stereotype I guess. Leonard Nimoy has done 20 things since star trek but he's got to keep doing that hand thing whenever people approach him in public.

I think Hammer's a great choice to make the event a bit less valley insular. And, as Renee Blodgett recently suggested about valley events in general, to liven things up a bit.

I have no idea what Hammer is up to, or if it's credible or not. But sheesh, cut the guy some slack.

May 14, 2007

If you're so good...

Stockbroker: I can make you 10x on this stock in 6 months!

Punter: If you're as good as you say you are, you'd be making money for yourself, instead of pretending you can for me!

So when you see an SEO consultant quit the consulting, to focus full time on his own stuff... well, at least you know his former clients were getting good advice! :)

May 15, 2007

Scaling Facebook, Hi5 with memcached

From a discussion board thread pointed to by programming.reddit, a nifty discussion of high volume sites Facebook, Hi5 and others who are using memcached as a critical scaling tool:

From: Steve Grimm <... facebook.com>
Subject: Re: Largest production memcached install?

No clue if we're the largest installation, but Facebook has roughly 200 dedicated memcached servers in its production environment, plus a small number of others for development and so on. A few of those 200 are hot spares. They are all 16GB 4-core AMD64 boxes, just because that's where the price/performance sweet spot is for us right now (though it looks like 32GB boxes are getting more economical lately, so I suspect we'll roll out some of those this year.)

We have a home-built management and monitoring system that keeps track of all our servers, both memcached and other custom backend stuff. Some of our other backend services are written memcached-style with fully interchangeable instances; for such services, the monitoring system knows how to take a hot spare and swap it into place when a live server has a failure. When one of our memcached servers dies, a replacement is always up and running in under a minute.

All our services use a unified database-backed configuration scheme which has a Web front-end we use for manual operations like adding servers to handle increased load. Unfortunately that management and configuration system is highly tailored to our particular environment, but I expect you could accomplish something similar on the monitoring side using Nagios or another such app.

...

At peak times we see about 35-40% utilization (that's across all 4 CPUs.) But as you say, that number will vary dramatically depending on how you use it. The biggest single user of CPU time isn't actually memcached per se; it's interrupt handling for all the incoming packets.

 

From: Paul Lindner <... inuus.com>

Don't forget about latency. At Hi5 we cache entire user profiles that are composed of data from up to a dozen databases. Each page might need access to many profiles. Getting these from cache is about the only way you can achieve sub 500ms response times, even with the best DBs.

We're also using memcache as a write-back cache for transient data. Data is written to memcache, then queued to the DB where it's eventually written to long-term storage. The effect is dramatic -- heavy write spikes are greatly diminished and we get predictable response times.

That said there's situations that memcache didn't work for our requirements. Storing friend graph relations was one of them. That's taken care of by another in-memory proprietary system. At some point we might consider merging some of this functionality into memcached including:

  • Multicast listener/broadcaster protocols
  • fixed size data structure storage
    (perhaps done via pluggable hashing algorithms??)
  • Loading the entire contents of one server from another.
    (while processing ongoing multicast updates to get in sync)
I'd be interested in working with others who want to add these types of features to memcache.

Greg Linden has commented on a talk about Livejournal's use of memcached for scaling. See also previous posts on scaling for ebay and mailinator.

May 30, 2007

'tie' considered harmful

Something has always left me uneasy about the 'tie' feature in perl, and I've been trying to reconcile it with my evolving view of programmer-system productivity.

To productively use a feature, like multi-process append to the same file, you have to understand the underlying performance and reliability behavior. Append is going to work great for 50 apache processes appending lines to a common log file without locking, but not for 2 processes appending 25k chunks to the same file, since they'll get corrupted. If you understand how unix's write-with-append semantics work you can get away with very fast updates to lots of little files without paying any locking penalties (twitter should probably have done something like this).

Similarly, when you see %foo in perl, you instantly know the perf footprint. It's an in-memory hash, it's going to be fast, and you won't get into trouble unless you find a corner like making a zillion hashes-of-hashes and then discover that there's a 200-300 byte overhead for each one.

But tie destroys your knowledge of how the hash works. The perf characteristics become completely different. A simple-minded approach to build a search keyword index with a hash-of-lists which might work acceptibly well with in-memory hashes suddenly becomes a disaster when you tie it to berkeley-db. Because you're not using an in-memory hash anymore, you're using a disguised call to berkeley-db.

I don't think the syntactic sugar win for the notiational convienence trumps the potential confusion to those who will view the code later, or even the confusingly overloaded semantics for the original programmer. I'd rather just know that %foo is an in-memory perl hash, and if I'm going to stuff something in a berkeley-db it's going to be with an explicit API.

As an aside, when I say 'productive', I'm trying to envision the entire life of the code and the product. Not just getting it written and working, but the lifetime maintanence load of the code, will people in ops need to monkey the system to keep it healthy, have pitfalls been left for new programmers inheriting the code, will it gracefully scale, degrade, and so on.

This is related to an evolving philosophy of programmer-system productivity that I've been developing, which I plan to write more about later.

Code is our enemy

Code is bad. It rots. It requires periodic maintenance. It has bugs that need to be found. New features mean old code has to be adapted.

The more code you have, the more places there are for bugs to hide. The longer checkouts or compiles take. The longer it takes a new employee to make sense of your system. If you have to refactor there's more stuff to move around.

Furthermore, more code often means less flexibility and functionality. This is counter-intuitive, but a lot of times a simple, elegant solution is faster and more general than the plodding mess of code produced by a programmer of lesser talent.

Code is produced by engineers. To make more code requires more engineers. Engineers have n^2 communication costs, and all that code they add to the system, while expanding its capability, also increases a whole basket of costs.

You should do whatever possible to increase the productivity of individual programmers in terms of the expressive power of the code they write. Less code to do the same thing (and possibly better). Less programmers to hire. Less organizational communication costs.

The minimum description length principle (MDL) is often used in genetic programming to identify the most promising candidate programs from a population. The shorter solutions are often better; not just shorter, but actually faster and/or more general.

A few hours reading WTF should convince anyone that there are often vast differences in the amount of code different programmers will put into the same task. But it's not just wtf? code. Components like a page crawler can have very different solutions. Maybe you can re-implement a 10k line solution into a 1k line solution, by taking a different approach. And it turns out that the shorter crawler is actually more general and works in a lot more cases. I've seen this over and over again in code and I'm convinced that it's harder to write something short and robust than something big and brittle.

I've been looking for ways to get code out of the code. Is there something the code is doing that can be turned into an external dataset, and driven by a web UI, or some rule-list that I can contract out to someone on elance? Maybe a little rule-based language has to be written. I've seen this yield an unexpected productivity increase. It turns out that using the web tool to edit the rules in the little domain-specific language ends up being more productive than messing around in the raw code anyway. The time spent formalizing the subdomain language is more than paid back.

Code has three lifetime performance curves:

  • Code that is consistent over time. The MD5 function is just great and it always does what we want. We act like all code is like this but most of the interesting parts of the system really aren't.

  • Code that will get worse over time, or will inevitably cause a problem in the future.

    Humans will have to jump in at some point to deal. You know this when you write the code, if you stop to think. Appending lines to a logfile without bothering to implement rotation is like this. Having a database that you know will grow over time on a single disk that counts on someone to type 'df' every so often and eventually deal is like that too.

    RAID is kind of like this. It reduces disk reliability problems by some constant. But when a disk fails, RAID has to email someone and say it's going to lose data unless someone steps in and deals. In a growing service, RAID is going to generate m management events for n disks. As n grows, m grows. 10X the disk cluster, 10X the management events. Wonderful. Better to architect something that decays organically over time, without requiring pager-level immediate support or else it will catastrophically fail. e.g the datacenter in one of these shipping container prototypes.

  • Code that gets better over time.

    This is the frontier.

    Google's spelling corrector is like this. It works okay on a small crawl, but better on a big crawl.

    People in the system can be organized this way, working on a component (like a dataset or ruleset) that they steadily improve over time. They're external to the core programming team but they make the code better by improving it with data.

    I've been wondering if it's possible to generally insert learning components at certain points into the code to adaptively respond to failure cases, scenarios, etc. Why am I manually tuning this perf variable or setting this backoff strategy? Why are we manually doing A/B testing and putting the results back into CVS to run another test, when the whole loop could be wired up to the live site to run by itself and just adapt and/or improve over time? I need to bake this some more but I think it's promising.

Related:

About May 2007

This page contains all entries posted to Skrentablog in May 2007. They are listed from oldest to newest.

April 2007 is the previous archive.

June 2007 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33