« July 2007 | Main | September 2007 »

August 2007 Archives

August 3, 2007

I5 fire at Tejon pass

Last week I was on my way to the beach heading down I5. We stopped at the clump of gas stations and restaurants just before the Tejon pass (the one with the Panda Garden, Iron Skillet and Wendy's). When we came out, there was a giant column of smoke coming up from it. "That wasn't there before. Oh-oh... I hope it doesn't affect traffic!"

I managed to snap a couple of pics with my iphone from behind the wheel. :)

The firefighters were out in force, and there were helicopters and planes dropping stuff on the fire. Looked like it just burned out a few hills before they got it under control.

August 5, 2007

The 11 startups actually crawling the web

The story goes that, one day back on the 1940's, a group of atomic scientists, including the famous Enrico Fermi, were sitting around talking, when the subject turned to extraterrestrial life. Fermi is supposed to have then asked, "So? Where is everybody?" What he meant was: If there are all these billions of planets in the universe that are capable of supporting life, and millions of intelligent species out there, then how come none has visited earth? This has come to be known as The Fermi Paradox.

My buddy Greg Lindahl maintains a collection of historical documents on his personal website, and gets enough traffic each month that he worries about his colo bandwidth bill.

When he analyzed his web logs recently and tallied up the self-reporting robots, he was surprised at how few he actually found crawling his site, and mentioned the Fermi quote I've reproduced above. If there really are 100 search engine startups (via via Charles Knight at Read/Write web), shouldn't we be seeing more activity from them?

Here is the list of every crawler that fetched over 1000 pages for the past three months:

1612960 Yahoo! Slurp help.yahoo.com bigco
365308 msnbot search.msn.com/msnbot.htm bigco
148090 Googlebot www.google.com/bot.html bigco
140120 VoilaBot www.voila.com bigco
68829 Ask Jeeves/Teoma about.ask.com bigco
62005 psbot www.picsearch.com/bot.html startup
39193 BecomeBot www.become.com/site_owners.html shopping
30006 WebVac www.WebVac.org edu
29778 ShopWiki www.shopwiki.com/wiki/Help:Bot shopping
22124 noxtrumbot www.noxtrum.com bigco
20963 Twiceler www.cuill.com/twiceler/robot.html startup
17113 MJ12bot majestic12.co.uk/bot.php startup
15650 Gigabot www.gigablast.com/spider.html startup
10404 ia_archiver www.archive.org nonprofit
9337 Seekbot www.seekbot.net/bot.html startup
9152 genieBot www.genieknows.com startup
7246 FAST MetaWeb www.fastsearch.com enterprise
7243 worio bot worio.com edu
6868 CazoodleBot www.cazoodle.com startup
6608 ConveraCrawler www.authoritativeweb.com/crawl enterprise
6293 IRLbot irl.cs.tamu.edu/crawler edu
5487 Exabot www.exabot.com/go/robot bigco
4215 ilial www.ilial.com/crawler startup
3991 SBIder www.sitesell.com/sbider.html memetracker
3673 boitho-dcbot www.boitho.com/dcbot.html enterprise
3601 accelobot www.accelobot.com memetracker
2878 Accoona-AI-Agent www.accoona.com startup
2521 Factbot www.factbites.com startup
2054 heritrix i.stanford.edu edu
2003 Findexa www.findexa.no ?
1760 appie www.walhello.com startup?
1678 envolk www.envolk.com spammers
1464 ichiro help.goo.ne.jp/door/crawler.html bigco
1165 IDBot www.id-search.org/bot.html edu
1161 Sogou www.sogou.com/docs/help bigco
1029 Speedy Spider www.entireweb.com bigco

There are a couple of surprises here... One is how much more aggressively Yahoo is crawling than everyone else. (Maybe he should just ban Yahoo to cut his hosting fees :)

Another is how few startups are actually crawling... And the ones that are aren't correlated with the folks getting buzz right now. In three months of data I didn't see a single visit from Zermelo, Powerset's crawler. I don't see Hakia in there at all, but they do have an index and actually refer a little traffic, which leads me to believe that they've licensed a crawl from someone else.

There hasn't been a lot of public information about Cuill since Matt Marshall's brief cryptic entry on them. But they're crawling fairly aggressively, and they've put up a public about us page detailing the impressive credentials of the founders, Tom Costello, Anna Patterson and Russell Power. Anna is the author of a widely-read intro paper on how to write a search engine from scratch.


The conventional wisdom is that there are all sorts of folks trying to take on Google, develop meaning-based search, France and Germany are supposedly both state-funding their own search efforts (heh). But if all these folks are out crawling the web... more than 11 of them should be showing up in webserver logs. ;)

Update: Charles Knight posts a ton of quotes from alt search engine folks on their approaches to crawling. Pretty interesting.

August 6, 2007

What doesn't clog your algo makes it stronger...

Valleywag outed the startup day job of the guys who collectively edit the the hilarious snark site uncov. The startup, Persai, was "hiding in plain site" since they have a blog and have been pretty open about about the tech they're using and their daily gripes.
"Persai is a startup that seeks to apply advanced machine learning techniques to content and advertising. We are using Amazon's web services to build a scalable architecture that will learn from consumer interests over time and match them with content crawled from around the web. The idea behind Persai is that you will have an active agent crawling the web looking for content that is relevant to you and only you. Every link we recommend will be something you want to read. We are zigging to social news' zag where popularity trumps relevance to the individual."
    -- from news.ycombinator

Anyway, a few days ago Persai released a Nutch webcrawl-generated set of "118,254 feeds of pure greatness". Intertwingly begged to differ about the quality after running some stats on the feeds. This generated some interesting comments...one in particular jumped out at me:

But if you look at the list itself, two sites are grossly overrepresented, and they account for the majority of the 301s and Timeouts. [emphasis mine]

I got a sinking feeling as I read this. I had curl'd over the corpus already to eyeball it ...yeah that's a list of feeds all right... but hadn't tallied the domains...

$ sed -e 's/^http:..//' -e 's/\/.*$//' persai_feedcorpus | count | head
 35695   rss.topix.net
 14613   izynews.de
  2831   feeds.feedburner.com
  1869   p.moreover.com
  1314   www.livejournal.com
  1241   rss.groups.yahoo.com
  1191   www.discountwatcher.com
  1096   news.bbc.co.uk
  1072   www.alibaba.com
   882   xml.newsisfree.com

Nooooo... Of course.. Sigh.

August 8, 2007

My top 10 beefs with the iPhone

Well I'm gonna catch some heat for this but here are my iphone beefs. My ancedotal experience based on talking to friends is that if you're coming from a treo the iphone is great, if you're coming from a blackberry, there are some rude shocks.

Serious power users I know carry both an iphone and bbery. I'm not gonna do that right now, that's defeating the point of the small form factor. Unfortunately there's not a clear winner here, neither one is better in every way. If I were to score, the iphone gets a lot more total points, but has some serious gaps w/ the bbery.

  • Sarafi controls are often unresponsive while it's transferring a page. Can't scroll, can't side scroll, can't expand or shrink, stop button doesn't work, it ignores the back button. This happens during dns delays too.

    loading techcrunch, touch screen unresponsive, rendering lag

  • No synchronous gmail app. What's this pop nonsense, is this a joke?

  • Anti keybounce or the skeptical touch software makes it lose keypresses I think should be valid.

  • Very difficult to type while driving with one hand. Or thumb. Even looking up a number from the contact list and initiating it can be tricky when it loses keypresses or gets them wrong because your thumb is hitting the screen at a funny angle.

  • Can't hear it ring. if the little holes are covered up you can't hear it at all. Like when it's in my pocket. Which is all the time.

  • Everyone I know with an iphone picked the classic phone ring, since of the bunch it's the most audible. Which still isn't great.

  • When I flip the screen sideways I wish the dpi would stay the same instead of expanding. I'm flipping the screen to get more sideways real estate. So every time I have to squish it back down.

  • Surprised they didn't do screen flip at a lower os level so all the apps got it. Even safari won't screen flip if the keyboard is popped up.

  • The accelerometers are funny, I sorta wish I could flip it with a button instead of twisting it around. I use it a lot reclining or lying down and then it gets the orientation wrong.

  • Touch keyboard is mostly useless. I can't type on this thing. bberry was much better even with their mini-keyboard. At least it could guess correctly, iphone makes "stupid" errors where the bbery predictive software would have gotten it right.

Despite the flaws the browser is good enough that I don't think I could go back. My biggest beef with the bbery browser was that it didn't do cookies, so it couldn't remember site logins. The way it ripped apart pages into a stream of text actually made them fit on the screen pretty well, then I could do a one-dimensional scroll to see everything, rather than the 3D scroll I have to do on the iphone to get page coverage (up/down, side/side, expand/shrink).

Waah! Google stole my idea!

"Google stole my idea"

if you stop crying you can have ice cream later

August 14, 2007

Byzantine Sequence Number Generation

The 645 clock was a huge box, 8 foot refrigerator size, containing a clock accurate to a microsecond. It hooked into the system as a "passive device," meaning that it looked like a bank of memory. Memory reads from a port with a clock on it returned the time in microseconds since 0000 GMT Jan 1, 1901. (52-bit register) The clock guaranteed that no two readings were the same. It had a real-time alarm register also. Inside there was a crystal in an oven, all kinds of ancient electronics.
    -- from a description of the Multics implementation on the the GE-645

That's funny. It seems like serious overkill just to make unique timestamps, even for Multics. :)

Let's Paxos for lunch...

In the garage.

Update: why keith has those bandages on his knees.

August 16, 2007

We Worship MD5, the GOD of HASH

For some time I had been looking for a mutual exclusion algorithm that satisfied my complete list of desirable properties. I finally found one--the N!-bit algorithm described in this paper. The algorithm is wildly impractical, requiring N! bits of storage for N processors, but practicality was not one of my requirements. So, I decided to publish a compendium of everything I knew about the theory of mutual exclusion.

The 3-bit algorithm described in this paper came about because of a visit by Michael Rabin. He is an advocate of probabilistic algorithms, and he claimed that a probabilistic solution to the mutual exclusion problem would be better than a deterministic one. I believe that it was during his brief visit that we came up with a probabilistic algorithm requiring just three bits of storage per processor. Probabilistic algorithms don't appeal to me. (This is a question of aesthetics, not practicality.) So later, I figured out how to remove the probability and turn it into a deterministic algorithm.
    -- Lamport

3N vs. N! Some folks just aren't comfortable with probablistic algorithms. Lamport here clearly knows what he is doing, but still has aesthetic problems with them.

In some people's minds, algorithms should be proveably correct at all times and for all inputs (as with defect-free programming and formal methods). Probabilistic algorithms give up this property. There is always a chance that the algorithm will produce a false result. But this chance can be made as small as desired. If the chance of the software failing is made smaller than the chance of the hardware failing (or of the user spontaneously combusting, or whatever), there's little to worry about.
    -- Bruce Schneier in Dr. Dobb's Journal

The common practical case I run into with coders is that they're unfamiliar with figuring how how big a hash they need to "not worry about" collisions. Here's the rule of thumb.

MD5 Quickie Tutorial

Suppose you're using something like MD5 (the GOD of HASH). MD5 takes any length string of input bytes and outputs 128 bits. The bits are consistently random, based on the input string. If you send the same string in twice, you'll get the exact same random 16 bytes coming out. But if you make even a tiny change to the input string -- even a single bit change -- you'll get a completely different output hash.

So when do you need to worry about collisions? The working rule-of-thumb here comes from the birthday paradox. Basically you can expect to see the first collision after hashing 2n/2 items, or 2^64 for MD5.

2^64 is a big number. If there are 100 billion urls on the web, and we MD5'd them all, would we see a collision? Well no, since 100,000,000,000 is way less than 2^64:

    18,446,744,073,709,551,616   2^64
               100,000,000,000  <2^37

(Another way of putting this is that the expected number of collisions from hasing a set of size 2^k bit strings hashed to m bit strings will be 22k-m collisions. [1])

Other MD5 tips & tricks

  • Unique ID generation

    Say you want to create a set of fixed-sized IDs based on chunks of text -- urls, for example. Urls can be long, with 100+ bytes common. They're varying sizes too. But md5(url) is 16 bytes, consistently, and you're unlikely to ever have a collision, so it's safe to use the md5 as an ID for the URL.

  • Checksums

    Don't trust your disk or your OS to properly detect errors for you. The CRC and protocol checksums they use are weak and bad data can get delivered.

    Instead, bring out an industrial strength checksum and protect your own data. MD5 your data before you stuff it onto the disk, check the MD5 when you read it.

        (data,md5) = read_from_disk()
        if (md5(data) != md5)

    This kind of paranoia is healthy for code -- your module doesn't have to trust the teetering stack of plates if it's doing it's own end-to-end consistency check.

  • Password security

    Suppose you're writing a web app and you're going to have users login. They sign up with an account name and a password. How do you store the password?

    You could store the password in your database, "in the clear". But this should be avoided. If your site is hacked, someone could get a giant list of usernames and passwords.

    So instead, store md5(password) in the database. When a user tries to login, take the password they entered, md5 it, and then check it against what is in the database. The process can then forget the cleartext password they entered. If the site is hacked, no one can recover the list of passwords. Even employees are protected from casually seeing other people's passwords while debugging.

    If you don't store the password, how can you email it to someone if they forget it? Instead of emailing the user their forgotten password, instead invent a new, random password, store the md5 of it in the database, and email the new random password to the user.

    If a site can email you your original password, it's storing it in the clear in its database. Tisk, tisk.

  • Hash table addressing

    There are whole chapters of textbooks devoted to the pitfalls and difficulties of writing hash addressing algorithms. Because most of these algorithms are weak, they require you to rejigger your hash table size to be relatively prime to your original hash table size when you expand it.

    Forget that nonsense. MD5 isn't a weak hash function and you don't need to worry about that stuff. MD5 your key and have your table size be a power of 2. As an engineer, your table sizes should be powers of 2 anyway. Leave the primes to the academics.

  • Random number generation

    The typical library RNG available isn't generally very good. For the same reason that you want your hashes to be randomly distributed, you want your random numbers to actually be random, and not to have some underlying mathematical structure showing through.

    Having random numbers that can't be guessed or predicted can be surprisingly useful. MD5 based sequence numbers were a solution for the TCP sequence number guessing attacks.

    I also recall some players of an old online game who broke the game's RNG, and could predict the outcome of upcoming battles. The library RNG was known, the entire seed state was 32 bits, which was easy to plow throuh to find the seed the game was using. Solution: a stronger RNG, with more internal state, that can't be predicted.

    Here is an md5-based RNG that I wrote some time ago.

  • What if you need more than 16 bytes?

    You can use SHA1 or SHA256, which generate 160 and 256 bits of output, respectively. Or you can chain hashes together to get an arbitrary amount of output material:

        a = md5(s . '0')
        b = md5(s . '1')

    Because md5 is cryptographically secure, this is safe. You can make as many unique 16 byte hashes from an input string as you want.

        md5('Rich Skrenta')  = 15ddc636 023977a2 22c3423d a5e8fbee
        md5('Rich Skrenta0') = 4343e346 b4036f80 7015847d cf983010
        md5('Rich Skrenta1') = da79412d c52c47b4 fa7848e4 54f89614

  • I heard MD5 was broken and you should use SHA

    For cryptographic purposes, MD5 and SHA have both been broken such that a sophisticated attacker can create multiple documents that intentionally hash to the same value.

    But for practical uses like hash tables, decent RNGs, and unique ID generation, these algorithms maintain their full utility. The alternatives considered are often non-secure CRCs or hashes anyway, so a cryptographic hash weakness is not a concern.

    If you're concerned about some nefarious actor leaving data around designed to deliberately cause hash collisions in your algorithm, throw a secret salt onto the end beginning of the material that you're hashing:

            hash = md5(s . 'xyzzy')  [good point]
            hash = md5('xyzzy' . s)

  • Isn't MD5 overkill?

    Folks sometimes say MD5 is "overkill" for a lot of these applications. But it's good, cheap, strong, and it works. It's not going to cause you problems if you use it. You're not going to ever have to debug it or second guess it. If you have perf problems, and suspect MD5, and then go profile your code, it's not going to be MD5 that's causing your problems. You're going to find that it was something else.

    But if you feel you absolutely must leave the path and look for some faster hashes, check out Bob Jenkins' site. [Also see the Hsieh hash, it looks very good.]

  • How fast is MD5?

    About as fast as your disk or network transfer rate.


    These are 2004 numbers from the perl Digest implementation.

Be happy and love the MD5.

August 17, 2007

Crypto vs. the working coder

Working in security tends to make people jumpy and nervous. Most security coders don't understand any of the the crypto internals of the tools they use, so they must rely on a handful of trusted experts like Schneier to tell them what's safe. Even so the algorithms last for about 10-15 years before they're broken.

Kids poke hole in protocols that spent years peer-reviewing their way through the IETF. Implementations are about as secure as swiss cheese, but it doesn't matter since the commercial success of a security product has more to do with its channel marketing strategy than actual security. Rumors surface that some Chinese mathematicians have wrecked part of the functional toolkit we've used for the last decade in all of our products, and it's time to pack up the tents and move, again.

So a culture of nit-picking and paranoia surrounds crypto stuff. If you are using a security algorithm, so the thinking goes, it must be because there is a threat. And if there is a threat, the algorithm must be made perfectly secure.

That may be an appropriate way to think for security products. But it turns out that security techniques are often useful in general programming. MD5 is a great checksum, much better than CRCs. If you have 500 nodes in a cluster, each with some disks, yes I will guarantee you that read/write corruption can occur and get into your app. TCP packets do arrive corrupted, even though they're not supposed to.

Yes, Jenkins is faster. But it's only a 32-bit hash, whereas MD5 is 128. Yes, Whirlpool is more secure. But I don't need a 512 bit checksum. MD5 is a great compromise.

Salts and HMAC are great. But you know what? The reality is that 9 out of 10 websites store your password in the clear. It would be nice if we could get the run-of-the-mill programmer to at least understand how to hash a password before trying to scare them off with the the more advanced stuff. Otherwise they're going to throw up their hands and say their app doesn't really need to be secure anyway.

You can't say MD5 without a geek chorus shouting "It's broken, you must not use it for anything." When regular programmers don't understand the basic utility of these fat hash functions they're missing out though. The fog of confusion hanging over the security space does't benefit Joe coder who could make practical use of these tools in general applications.

The message from security folks is that you shouldn't be using any of their algs for non-secure applications. If you use their stuff, you have to go all the way.

But that's bunk. The engineering tolerances for crypto security are way beyond what the typical application needs for general purpose utility out of these functions. MD5 is a great general-purpose hash. There is useful stuff in between the extremes of a crappy CRC and and SHA-512.

So MD5 away to make your stateless GUIDs and be happy. :)

August 19, 2007

RSS reader shares for Skrentablog


The data comes from fetches that look like this in the webserver logs:

GET /atom.xml   Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 512 subscribers)
GET /index.xml  Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 50 subscribers)
GET /atom.xml   Bloglines/3.1 (http://www.bloglines.com; 142 subscribers)
GET /index.xml  Bloglines/3.1 (http://www.bloglines.com; 36 subscribers)
GET /atom.xml   NewsGatorOnline/2.0 (http://www.newsgator.com; 56 subscribers)
GET /index.xml  NewsGatorOnline/2.0 (http://www.newsgator.com; 7 subscribers)
GET /atom.xml   Netvibes (http://www.netvibes.com/; 44 subscribers)
GET /atom.xml   Fastladder FeedFetcher/0.01 (http://fastladder.com/; 14 subscribers)
GET /atom.xml   livedoor FeedFetcher/0.01 (http://reader.livedoor.com/; 6 subscribers)

August 20, 2007

Some thoughts on Mahalo

I was surprised (along with many others) that Jason chose to launch a "human powered search engine" as his next venture. More so at the reported funding of $20M.

I'm a fan of Jason's antics and his promotional ability, but at first glance making this spruce goose fly looks like it would need David Copperfield plus a reduction in the universe's gravitational constant.

Is it really possible to do dmoz/about 2.0 and have a go of it?

Having founded the biggest human-powered search site on the web (600,000 pages) and more recently running a content startup with substantial SEO distribution I have a few comments and suggestions for Mahalo.

To be fair there have been some notable SEO successes. About.com is probably the biggest seo win ever, with a $410M sale to NY Times in 2005. About has been huge into SEO since they they were known as The Mining Company. About guides got an SEO manual when they joined and were directed to author high-value seo content, as Mahalo is doing with its staff. About now has approx 3-6M pages indexed in Google.

dmoz wasn't seo driven itself but was a huge presence in the early seo industry. Because we gave the dmoz data away and so many other sites put it up, getting a link in dmoz meant that you instantly had thousands of links from across the web. Plus dmoz.org was PR10 for a while which was nice. You had to have a link in dmoz just to get to the "base" level of pagerank a normal website should have. Google had to adjust some of their algs because the pagerank warping effect of this was so huge.

But the most succesful SEO site currently is Wikipedia. They get a full 2% of Google's outbound traffic. I don't expect that to last at the current level, Wikipedia is showing up in too many searches and it's gone over the line. But Google's quirky aesthetics are OK with Wikipedia being there because it is on the non-commercial side of the fence and is hugely open.

At this point though I'm thinking SEO has gotta be dead as a startup business model. It was kind of unknown stuff in 2003 but now the cat's out of the bag. It seems like the last attempt of web 2.0 sites that aren't able to get social adoption is to start flooding the Google index with tag landing page spam or a crappy template page for every restaurant in the country.

We know this from experience: No one will ever go to Mahalo directly, just as no one ever went to About.com, dmoz, Tripadvisor, Nextag, IMDB or any other vertical or broad-but-shallow site. Google is where everyone starts and Mahalo's distribution strategy has to be SEO. Its traffic is going to live or die based on SEO skill and Google's continued favor.

If Mahalo doesn't get SEO traffic it's gonna have to morph into something else. In the past a site like Looksmart that had lots of editorial generated directory content could sell that to other portals. Those days are over though with content being commoditized so I doubt there is big licensing revenue in Mahalo's future. But Jason is smart and wily and I'm sure he'll keep twisting the cube until he finds a solution.

The other structural challenge with human powered directories has always been maintenance. It's not just the labor effort to create the pages in the first place, you also have to revise them regularly to keep them up to date. Otherwise they rot. So there is an ongoing cost to keeping a site with N links to periodically revisit and re-evaluate each every M days. Wikipedia is more resiliant against rot because it is substantially a historical/reference site. But the topical/commercial queries Mahalo is targetting will require periodic review, or they will start looking dated in a year or two. Links rot, spammers take them over, or they simply point to out-of-date resources. So you have to re-author all your pages every 3mo-2years depending on how topical the subject is. We crawled dir.yahoo way back and they were 8% dead links, some categories hadn't been visited by yahoo editors in years. This was the inspiration for dmoz but even it succumbed to a similar fate, just on a bigger scale. :)

In the meantime here are some tactical comments for the Mahalo site itself:

  • Hyphens instead of underscores Jason! You too outside.in. C'mon guys, this is basic stuff.

  • Put the guide note under the <h1> and call it <h2>, it'll do better. Mahalo needs lots of guide notes. Without the contiguous block of text from the guide note, the links aren't enough to validate a landing. 250 words is ideal but anything is better than nothing.

  • <title> should match <h1> should match url. Don't forget to add <meta name="description">, this should match the <h2>

  • Not really seo but a general idea ... Reference pages in general are boring. Jason is the supreme master of linkbait... Could each mahalo page be turned into a controversy of its own? When someone biases a wikipedia page, it gets more attention and traffic, not less...

  • Marshall Simmonds was the SEO expert at TheMiningCo/About. He MADE that site. I bet he singlehandedly enabled 90% of the $410M of value.

    "$410 million for SEO? I'll bet they could hire marshall simmonds, About's director of search, for a fraction of that." [1]

    Marshall gave a talk at WebmasterWorld Pubcon 2004 where he laid out About's whole seo strategy that had made them so successful. The ppt was on the conference CD. Unfortunately I've lost mine but I'm sure you can track down the talk. You need to see that deck.

  • Minor but if you are concerned with speed, then: 1) remove urchin, 2) 15% of mahalo's pages are whitespace, that may compress or not but eliminating that before sending the page out is hygienic. 3) Don't forget the 14 rules.

YMMV. Good luck.

August 23, 2007

But Craigslist actually *is* a den of sin, Mike

Just look! (warning: NSFW. In fact, not safe for home either really.)

Attn: all hobbyists and escorts - AJC front page - m4w - 29
Date: 2007-08-22, 2:51PM EDT

Looks like a few bad apples are gonna ruin the ATL scene again. Today's Atlanta-Journal-Constitution has a front page article in which Mayor Franklin blames Craigslist for promoting child sex, and the vice squad discloses how it has been conducting stings on this list. To the ladies, thank you all for your lovely services, and pls be on guard, and to my follow hobbyists, lets continue to flag the fakes, and also be on guard. Pls don't let a few bad apples ruin a good thing. Have fun, and as always, play safe.

The BKeeper.

Techdirt and its commenters mocked the Atlanta city mayor over her accusation that Craigslist is promoting child prostitution. I dunno, it's pretty clear what's going on in the "erotic services" section on Craigslist. That's a huge part of their traffic too. It's not all apartment-finding and mattress-selling over there, you know. :)

I found this quote from the AJC article interesting:

Company founder Craig Newmark, who also was mailed Franklin's letter, no longer is involved in the company's daily affairs and is traveling, Best said.

Craig's not part of Craigslist anymore?

August 26, 2007

Rotten Tomatoes / RT / Redux

Ironically, Rich Skrenta from Topix (formerlly the founder of DMOZ/ODP) owns the domain name "rt.com" which I pursued unsuccessfully for many many years (it's not like he needs the money anymore). You wouldn't believe how many people can't spell "tomatoes". Despite not getting the domain name, we struck up a pretty good friendship and he provided me with some very valuable words of advice when we were in the middle of being acquired. His experience being acquired by Netscape and which was quickly thereafter sucked up by AOL/Time-Warner is similar to my experience being acquired by IGN and then sucked up by Fox Interactive Media. Without his words of advice (make sure that you have an escape hatch in case there's change of ownership), I'd probably be very unhappy right now?
   -- Stephen Wang, in Startup Review

Fyi I found that para - honest - not because I was googling skrenta, it was for backlinks to rt.com. Heh. Missed it when it first came out. Btw Stephen doesn't credit Chris Tolles there but IIRC Tolles did a lot of the talking so maybe Stephen owes him a beer too. :)

But jeezus what a great seo post Stephen wrote, go read the whole thing. SEO was a big part of Rotten Tomatoes as you can imagine and it worked out great for them. I had originally met Stephen because he wanted to buy my domain name, which I wasn't really interested in selling. But I became fascinated with his startup and his personal tenacity as an entrepreneur. This was no quick flip for them, it took them years to build. These guys loved movies and slaved night and day on rotten tomatoes all through the dot com bust. It was a walk through the desert for them but eventually it paid off with a great exit to IGN.

I'll say also that they built a great site and I still use it to check out the read on a movie if I'm not up on the openings or want to delve in.

Stephen's got a new project now:

Four of us formerly from Rotten Tomatoes (including Patrick Lee and myself) have gathered together in Hong Kong and have recently launched a new online community of artists (filmmaker, musicians and more) initially targeting Asia (http://www.alivenotdead.com).

August 27, 2007

Pass the hat for Greg Stein

Kevin Burton emailed me to let me know that he was trying to do something nice for Greg Stein, the director of the Apache foundation, who was mugged and seriously injured in front of his house in Mountain View.

Details here.

Seems like a nice thing to do. Let's see...

apache == cool
beer (a micro-hefeweizen by the looks) == cool
greg stein == cool

So git yer wallets out you apathetic webwags and toss some bills into the hat for Greg! How much did you ever pay to use Apache? Ok well there's a good rationalization for you. Time to make a bit of it up. Thank god we're not paying $1295 for Netscape Enterprise Server. :-)

August 28, 2007

Counting stuff is really hard

I've never worked anywhere where the logs could be tallied well. Netscape, AOL, they had giant systems that slurped up the logs from the front ends and stuffed them into web-enabled databases. Every query took 90 seconds to run, half of them timed out. Forget ad-hoc queries or tossing a custom regex in. Sometimes the logs would break and it'd be weeks or months or never before they worked again.

Sometimes there was just too much traffic to be able to count it all. More log events came in every 24 hours than could be processed in a 24 hour log run.

Google Analytics doesn't seem to fare much better. Granted, we probably put more data into it at Topix than the average site. But I could never get unique IP counts of that thing. It would just spin and spin until my browser gave up.

I've repeatedly seen senior engineers fail to make headway on the log problem. Logs should be easy, right? What could be more straightforward than collecting a set of files each day and tallying the lines?

It turns out that anything involving lots of data spread over a cluster of machines is hard. Correct that: Even little bits of data spread over a cluster is hard. i=n++ in a distributed environment is a PhD thesis.

We take the simplicity of i=n++ or counting lines for granted. It all begins with a single CPU and we know that model. In fact, we know that model so deeply that we think in it, in the same way that language shapes what we can think about. The von Neumann architecture defines our perception of what is easy and what is hard.

But it doesn't map at all to distributed systems.

The approach of the industry has been to try to impose von Neumann semantics on the distributed system. Recently some have started to question whether that's the right approach.

The underlying assumption ... is that any system that is scalable, fault-tolerant, and upgradable is composed of N nodes, where N>1.

The problem with current data storage systems, with rare exception, is that they are all "one box native" applications, i.e. from a world where N=1. From Berkeley DB to MySQL, they were all designed initially to sit on one box. Even after several years of dealing with MegaData you still see painful stories like what the YouTube guys went through as they scaled up. All of this stems from an N=1 mentality.
    -- Joe Gregorio

Distributed systems upend our intuition of what should be hard and what should be easy. So we try to devise protocols and systems to carry forward what was easy in our N=1 single CPU world.

But these algorithms are seriously messed up. "Let's Paxos for lunch" is a joke because Paxos is such a ridiculously complicated protocol. Yes I understand its inner beauty and all that but c'mon. Sometimes you get the feeling the universe is on your side when you use a technique. Like exponential backoff. You've been using that since you were a kid learning about social interactions and how to manage frustration. It feels right. But if you come to a point in your design where something like Paxos needs to be brought out, maybe the universe is telling you that you're doing it wrong.

It may be a bit unusual, but my way of thinking of "distributed systems" was the 30+ year (and still continuing) effort to make many systems look like one. Distributed transactions, quorum algorithms, RPC, synchronous request-response, tightly-coupled schema, and similar efforts all try to mask the existence of independence from the application developer and from the user. In other words, make it look to the application like many systems are one system. While I have invested a significant portion of my career working in this effort, I have repented and believe that we are evolving away from this approach.
    -- Pat Helland

This stuff isn't just for egghead protocol designers and comp sci academics. Basically any project that is sooner or later going to run on more than a single box encounters these problems. Your coders have modules to finish. But they have no tools in their aresenal to deal with this stuff. The SQL or Posix APIs leave programmers woefully unprepared for even a trivial foray outside of N=1.

Humility in the face of complexity makes programmers better. Logs sucker-punch good programmers because their assumptions about what should be hard and what should be easy are upended by N>1. Once you get two machines in the mix, if your requirements include reliability, consistency, fault-tolerance, and high performance, you are at the bleeding edge of distributed systems research.

This is not what we want to be worrying about. We're making huge social media systems to change the world. Head-spinning semantic analysis algorithms. Creepy targetted monetization networks. The future is indeed bright. But we take for granted the implicit requirements that the application will be able to scale, that it will stay up, that it will work.

So why does Technorati go down so much... why is Twitter having problems scaling... why did Friendster lose? All those places both benefited from top notch programmers, lots of resources. How can it be, we ask, that the top software designers in the world, with potentially millions of dollars personally at stake, create systems that let everyone down?

Of course programmers make systems that don't satisfy all of the (implicit) requirements. Nobody knows how to yet. We're still figuring this stuff out. There are no off-the-shelf toolkits.

Without a standardized approach or toolset, programmers do what they can and get the job done anyway. So you have cron jobs ssh'ing files around, ad-doc DB replication schemes, de-normalized data sharded across god-knows-where. And the maintenance load for the cluster starts to increase...

"We're fine here," some readers will say. "We have a great system to count our logs." But below the visible surface of FAIL is the hidden realm of productivity opportunity cost. Getting the application to work, to scale, to be resilient to failures is just the start. Making it a joy to program is the differentiator.

* * *

There is a place where they can count their logs. They had to make this funny distributed hash-of-hashes data structure. It's got a some unusual features for a database - explicit application management of disk seeks, a notion of time-based versioning, and a severely limited transactional model. It relies on an odd cast of supporting software. Paxos is even in there under the hood. That wasn't enough so they hired one of the original guys who invented Unix a million years ago, and the first thing he did was to invent an entirely new programming language to use it.

But now they can count their logs.



Kevin Burton: Distributed System Design is Difficult. We're seeing distributed systems effects even on single machines now, thanks to multiple cores.


"Frig! The moon looks like the sun..."


That squiggle is actually made of moon-light though. That's kinda neat...

"Moon good, rest dark. Maybe I can photoshop it..."


Hawk or Friedl could do much better with this view. We'll see if that book Friedl recommended helps... stay tuned. :-|

August 29, 2007

Spooky "Elk Cloner" movie

An art school student has made a spooky CGI movie "in honor of" Elk Cloner (that first virus thing that I seem to be associated with...)

He's got a ton of details about how he did the animation, even scans of his original hand-written notes.

He's submitted the movie to 21 film festivals. I hope it wins some awards. Go elk cloner go!

August 30, 2007

Popdex Revisited

So I blogged about how Popdex had been taken over by spammers.

The original author of Popdex commented about how he sold the project, and it had taken this unfortunate turn.

Now, when you search for "popdex", instead of seeing Popdex.com in #1 as before, the popdex spammer site doesn't show up anywhere in the results:

It's cool that Google actively polices web spam. But unfortunately this manual whack-a-mole job (Matt was that you?) didn't entirely work, since the first result is now an extremely ad- and adsense-heavy page (even threw a popunder at me that got through Firefox's blocker) which simply mirrors the old popdex pitch text and points to popdex.com.

I would love to see exactly what that manual whack-a-mole interface looked like. I wonder how scalable in the end hitting a -zap- button on individual spam sites is though.

August 31, 2007

Be careful what you wish for

Big AP-wide story by Nick Jesdanun on my 1982 elk cloner virus in your papers for the holiday weekend. Fun. Nick also blogged a bit about writing the story.

update ... a reporter in Pittsburgh spotted the story on the wire and called me up to add some local color.

Time to boot up the emulator...

rc$ ../a2/a2 cloner.dsk




 A 002 HELLO
 T 020 CLONER 2.0

]CALL -151


9000-   02          ???
9001-   A9 FF       LDA   #$FF
9003-   85 4C       STA   $4C
9005-   A9 8F       LDA   #$8F
9007-   85 4D       STA   $4D
9009-   A9 20       LDA   #$20
900B-   8D 80 A1    STA   $A180
900E-   A9 5B       LDA   #$5B
9010-   8D 81 A1    STA   $A181
9013-   A9 A7       LDA   #$A7
9015-   8D 82 A1    STA   $A182
9018-   A9 AD       LDA   #$AD
901A-   8D D1 A4    STA   $A4D1
901D-   A9 B6       LDA   #$B6
901F-   8D D2 A4    STA   $A4D2
9022-   A9 AA       LDA   #$AA
9024-   8D D3 A4    STA   $A4D3
9027-   A9 4C       LDA   #$4C
9029-   8D 13 A4    STA   $A413
902C-   A9 90       LDA   #$90







The Apple II was such a great computer to learn on. You turn it on, can jump right into a ROM monitor and start typing in assembly. Those were the days. :)

Oh btw that "CLONER 2.0" was the evil version that I never released.

About August 2007

This page contains all entries posted to Skrentablog in August 2007. They are listed from oldest to newest.

July 2007 is the previous archive.

September 2007 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33