« Pass the hat for Greg Stein | Main | Moon »

Counting stuff is really hard

I've never worked anywhere where the logs could be tallied well. Netscape, AOL, they had giant systems that slurped up the logs from the front ends and stuffed them into web-enabled databases. Every query took 90 seconds to run, half of them timed out. Forget ad-hoc queries or tossing a custom regex in. Sometimes the logs would break and it'd be weeks or months or never before they worked again.

Sometimes there was just too much traffic to be able to count it all. More log events came in every 24 hours than could be processed in a 24 hour log run.

Google Analytics doesn't seem to fare much better. Granted, we probably put more data into it at Topix than the average site. But I could never get unique IP counts of that thing. It would just spin and spin until my browser gave up.

I've repeatedly seen senior engineers fail to make headway on the log problem. Logs should be easy, right? What could be more straightforward than collecting a set of files each day and tallying the lines?

It turns out that anything involving lots of data spread over a cluster of machines is hard. Correct that: Even little bits of data spread over a cluster is hard. i=n++ in a distributed environment is a PhD thesis.

We take the simplicity of i=n++ or counting lines for granted. It all begins with a single CPU and we know that model. In fact, we know that model so deeply that we think in it, in the same way that language shapes what we can think about. The von Neumann architecture defines our perception of what is easy and what is hard.

But it doesn't map at all to distributed systems.

The approach of the industry has been to try to impose von Neumann semantics on the distributed system. Recently some have started to question whether that's the right approach.

The underlying assumption ... is that any system that is scalable, fault-tolerant, and upgradable is composed of N nodes, where N>1.

The problem with current data storage systems, with rare exception, is that they are all "one box native" applications, i.e. from a world where N=1. From Berkeley DB to MySQL, they were all designed initially to sit on one box. Even after several years of dealing with MegaData you still see painful stories like what the YouTube guys went through as they scaled up. All of this stems from an N=1 mentality.
    -- Joe Gregorio

Distributed systems upend our intuition of what should be hard and what should be easy. So we try to devise protocols and systems to carry forward what was easy in our N=1 single CPU world.

But these algorithms are seriously messed up. "Let's Paxos for lunch" is a joke because Paxos is such a ridiculously complicated protocol. Yes I understand its inner beauty and all that but c'mon. Sometimes you get the feeling the universe is on your side when you use a technique. Like exponential backoff. You've been using that since you were a kid learning about social interactions and how to manage frustration. It feels right. But if you come to a point in your design where something like Paxos needs to be brought out, maybe the universe is telling you that you're doing it wrong.

It may be a bit unusual, but my way of thinking of "distributed systems" was the 30+ year (and still continuing) effort to make many systems look like one. Distributed transactions, quorum algorithms, RPC, synchronous request-response, tightly-coupled schema, and similar efforts all try to mask the existence of independence from the application developer and from the user. In other words, make it look to the application like many systems are one system. While I have invested a significant portion of my career working in this effort, I have repented and believe that we are evolving away from this approach.
    -- Pat Helland

This stuff isn't just for egghead protocol designers and comp sci academics. Basically any project that is sooner or later going to run on more than a single box encounters these problems. Your coders have modules to finish. But they have no tools in their aresenal to deal with this stuff. The SQL or Posix APIs leave programmers woefully unprepared for even a trivial foray outside of N=1.

Humility in the face of complexity makes programmers better. Logs sucker-punch good programmers because their assumptions about what should be hard and what should be easy are upended by N>1. Once you get two machines in the mix, if your requirements include reliability, consistency, fault-tolerance, and high performance, you are at the bleeding edge of distributed systems research.

This is not what we want to be worrying about. We're making huge social media systems to change the world. Head-spinning semantic analysis algorithms. Creepy targetted monetization networks. The future is indeed bright. But we take for granted the implicit requirements that the application will be able to scale, that it will stay up, that it will work.

So why does Technorati go down so much... why is Twitter having problems scaling... why did Friendster lose? All those places both benefited from top notch programmers, lots of resources. How can it be, we ask, that the top software designers in the world, with potentially millions of dollars personally at stake, create systems that let everyone down?

Of course programmers make systems that don't satisfy all of the (implicit) requirements. Nobody knows how to yet. We're still figuring this stuff out. There are no off-the-shelf toolkits.

Without a standardized approach or toolset, programmers do what they can and get the job done anyway. So you have cron jobs ssh'ing files around, ad-doc DB replication schemes, de-normalized data sharded across god-knows-where. And the maintenance load for the cluster starts to increase...

"We're fine here," some readers will say. "We have a great system to count our logs." But below the visible surface of FAIL is the hidden realm of productivity opportunity cost. Getting the application to work, to scale, to be resilient to failures is just the start. Making it a joy to program is the differentiator.

* * *

There is a place where they can count their logs. They had to make this funny distributed hash-of-hashes data structure. It's got a some unusual features for a database - explicit application management of disk seeks, a notion of time-based versioning, and a severely limited transactional model. It relies on an odd cast of supporting software. Paxos is even in there under the hood. That wasn't enough so they hired one of the original guys who invented Unix a million years ago, and the first thing he did was to invent an entirely new programming language to use it.

But now they can count their logs.



Kevin Burton: Distributed System Design is Difficult. We're seeing distributed systems effects even on single machines now, thanks to multiple cores.

Comments (3)

Jim McCoy:

Counting logs is not hard, but counting it in a way that enables you to make standard SQL queries while still doing your coding in Perl is hard. When your toolkit is limited to simple languages and you base your system on the "well, it worked when we had three machines, so it should work for three hundred machines" principle is where you start to run into problems.

Just need a count of unique ips or a similar isolated data-point? Simple, use a counted bloom filter. Want to be able to do complex queries over your distributed logs and not grind your servers into dust? Use Mnesia and Erlang.

Hmm.... seems to me we have had this discussion before :)

The problem with solving scalability issues is twofold: most people subscribe to YAGNI when designing a small-scale application and get hosed when they discover that they really do need those complex algorithms and esoteric languages, and most businesses do not realize that it is harder to teach an old-school code ninja how to think about scalability than it is to teach a scalability expert how to lead a team of code ninjas.

That's great Jim. I'm glad erlang/mnesia have made distributed computing a solved problem. I'm sure the rest of the industry will breathe a sigh of relief over that as well.

Seriously, mnesia/erlang seem worthy of attention, but I get the sense you're responding to a different post than the one I thought I wrote...

Jim McCoy:

I think I was responding to part of your post, but obviously not as clearly as I had hoped.

Distributed systems are different. They are not necessarily harder, but they require you to give up a lot of long-held assumptions about how things work, what you can know about the system at any particular point in time, and how components interact. Some of the tools available to us make this transition easier than others, and sometimes the key factor is not the tool you use but how you apply it to a problem.

Erlang/OTP and Mnesia do not make distributed computing a solved problem, but they give you a toolset that makes solving distributed problems a lot easier. There is very little within these tools that could not be replicated other languages with various degrees of difficulty (some decent efforts to replicate some of the better features have made progress in Scheme and Haskell and I think that Oz has some nice features that even Erlang lacks.) What makes Erlang different, IMHO, is that it is already here with a lot of the features you claim are needed.

If you start with a system that is build on the fast & cheap side of the fast/cheap/good triangle then it should not be too surprising that things start to fall apart eventually. If you look at your Technorati, Friendster, and Twitter examples they all probably started with the ill-advised "what is the least we can do to make this work" mantra and later paid the price. Twitter and Technorati recovered (but were very vulnerable for a period of time), while Friendster probably missed out on its chance at a billion-dollar payout because of this choice.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)


This page contains a single entry from the blog posted on August 28, 2007 8:16 AM.

The previous post in this blog was Pass the hat for Greg Stein.

The next post in this blog is Moon.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33