Sometimes there was just too much traffic to be able to count it all. More log events came in every 24 hours than could be processed in a 24 hour log run.
Google Analytics doesn't seem to fare much better. Granted, we probably put more data into it at Topix than the average site. But I could never get unique IP counts of that thing. It would just spin and spin until my browser gave up.
I've repeatedly seen senior engineers fail to make headway on the log problem. Logs should be easy, right? What could be more straightforward than collecting a set of files each day and tallying the lines?
It turns out that anything involving lots of data spread
over a cluster of machines is hard. Correct that: Even
little bits of data spread over a cluster is hard.
i=n++ in a distributed environment is a
We take the simplicity of i=n++ or counting lines for granted. It all begins with a single CPU and we know that model. In fact, we know that model so deeply that we think in it, in the same way that language shapes what we can think about. The von Neumann architecture defines our perception of what is easy and what is hard.
But it doesn't map at all to distributed systems.
The approach of the industry has been to try to impose von Neumann semantics on the distributed system. Recently some have started to question whether that's the right approach.
The underlying assumption ... is that any system that is scalable, fault-tolerant, and upgradable is composed of N nodes, where N>1.
The problem with current data storage systems, with rare exception, is that they are all "one box native" applications, i.e. from a world where N=1. From Berkeley DB to MySQL, they were all designed initially to sit on one box. Even after several years of dealing with MegaData you still see painful stories like what the YouTube guys went through as they scaled up. All of this stems from an N=1 mentality.
-- Joe Gregorio
Distributed systems upend our intuition of what should be hard and what should be easy. So we try to devise protocols and systems to carry forward what was easy in our N=1 single CPU world.
But these algorithms are seriously messed up. "Let's Paxos for lunch" is a joke because Paxos is such a ridiculously complicated protocol. Yes I understand its inner beauty and all that but c'mon. Sometimes you get the feeling the universe is on your side when you use a technique. Like exponential backoff. You've been using that since you were a kid learning about social interactions and how to manage frustration. It feels right. But if you come to a point in your design where something like Paxos needs to be brought out, maybe the universe is telling you that you're doing it wrong.
It may be a bit unusual, but my way of thinking of "distributed systems" was the 30+ year (and still continuing) effort to make many systems look like one. Distributed transactions, quorum algorithms, RPC, synchronous request-response, tightly-coupled schema, and similar efforts all try to mask the existence of independence from the application developer and from the user. In other words, make it look to the application like many systems are one system. While I have invested a significant portion of my career working in this effort, I have repented and believe that we are evolving away from this approach.
-- Pat Helland
This stuff isn't just for egghead protocol designers and comp sci academics. Basically any project that is sooner or later going to run on more than a single box encounters these problems. Your coders have modules to finish. But they have no tools in their aresenal to deal with this stuff. The SQL or Posix APIs leave programmers woefully unprepared for even a trivial foray outside of N=1.
Humility in the face of complexity makes programmers better. Logs sucker-punch good programmers because their assumptions about what should be hard and what should be easy are upended by N>1. Once you get two machines in the mix, if your requirements include reliability, consistency, fault-tolerance, and high performance, you are at the bleeding edge of distributed systems research.
This is not what we want to be worrying about. We're making huge social media systems to change the world. Head-spinning semantic analysis algorithms. Creepy targetted monetization networks. The future is indeed bright. But we take for granted the implicit requirements that the application will be able to scale, that it will stay up, that it will work.
So why does Technorati go down so much... why is Twitter having problems scaling... why did Friendster lose? All those places both benefited from top notch programmers, lots of resources. How can it be, we ask, that the top software designers in the world, with potentially millions of dollars personally at stake, create systems that let everyone down?
Of course programmers make systems that don't satisfy all of the (implicit) requirements. Nobody knows how to yet. We're still figuring this stuff out. There are no off-the-shelf toolkits.
Without a standardized approach or toolset, programmers do what they can and get the job done anyway. So you have cron jobs ssh'ing files around, ad-doc DB replication schemes, de-normalized data sharded across god-knows-where. And the maintenance load for the cluster starts to increase...
"We're fine here," some readers will say. "We have a great system to count our logs." But below the visible surface of FAIL is the hidden realm of productivity opportunity cost. Getting the application to work, to scale, to be resilient to failures is just the start. Making it a joy to program is the differentiator.
* * *
There is a place where they can count their logs. They had to make this funny distributed hash-of-hashes data structure. It's got a some unusual features for a database - explicit application management of disk seeks, a notion of time-based versioning, and a severely limited transactional model. It relies on an odd cast of supporting software. Paxos is even in there under the hood. That wasn't enough so they hired one of the original guys who invented Unix a million years ago, and the first thing he did was to invent an entirely new programming language to use it.
But now they can count their logs.
Kevin Burton: Distributed System Design is Difficult. We're seeing distributed systems effects even on single machines now, thanks to multiple cores.