« Bryn turned me into a muppet | Main | Topix passes USA Today to become #1 online site for Gannett, Tribune and McClatchy »

blekko's ambient cluster health visualization

When you have several hundred servers in a cluster, knowing the state and health of all of them can be a challenge. Traditional pager alert systems can often either log too many events, which makes people tune them out, or they miss non-fatal but still serious server sickness, such as degraded disk/cpu/network performance or subtle application errors.

This becomes especially true when the cluster and application are designed for high availability. If the application is doing its best to hide server failures from the user, it's often not apparent when a serious problem is developing until the site fails in a more public or obvious way.

We called these "analog failures" at Topix. There was a fairly complicated chain of processing for incoming stories that had been crawled. Crawl, categorize, cluster, dedup, roboedit, push to front ends, and push to incremental search system. Once an engineer mistakenly deleted half of the sources from our crawl, and it took us a disturbingly long time to notice. The problem was that, while overall we had half as many stories on the site, most pages still had new stories coming in, so we didn't notice that anything was wrong.

Sometimes a server has a messed up failure, like its networking card starts losing 50% of its packets, but stuff is still getting through. Or a drive is in the process of failing, and its read/write rate is 10% of normal, but it hasn't failed enough to be removed from service yet. The cpu overheated and is running at a fraction of its normal speed. There seem to be limitless numbers of unusual ways that servers can fail.

At blekko, there are dozens of stats we'd ideally like to track per host:

  • How full are each of the disks?
  • Are there any SMART errors being reported from the drives?
  • Are we getting read or write errors?
  • What is the read/write throughput rate? Sometimes failures degrade the rate substantially, but the disk continues to function
  • What is the current disk read latency?
  • Is packet loss occurring to the node?
  • What is the read/write network throughput?
  • What is the cpu load?
  • How much memory is in use?
  • How much swap is being use?
  • How big is the kernel's dirty page cache?
  • What are the internal/external temperature sensors reading?
  • How many live filesystems are on the host vs. dead disks?

Others stats pertain to our cluster datastore:

  • How many buckets are on each host?
  • Is the host above or below goal for its number of buckets?
  • What is the outbound write lag from the host?
  • What is the maximum seek depth for a given path/bucket?
  • Do we have three copies of every bucket (R3)?
  • If we're not at R3, how many bucket copies are occurring?
  • For running mapjobs, what is their ETA + read/write/error rate?
  • Are the ram caches fully loaded?
  • Are we crawling/indexing, what is the rate compared with historical?

The first step is to start putting the stats you want to be able to see into a big status table. But at 175 hosts, the table is kind of long, and it's hard to spot developing problems in the middle of the table.

So we have been experimenting with mapping system stats onto different visualizations, so we can tell at a glance the overall state of hundreds of servers, and spot minor problems before they grow.

A table with 175 rows is pretty long, but you can fit 175 squares into a very small picture. This table shows overall disk usage by host. The color of the tile shows the disk usage: red is 90%, orange is 80%, yellow is 70%, blue is below 60%. Dead filesystems on the node are represented by grey bars inside the tile. The whole grid is sorted worst-to-best, so it's easy to see the fraction of hosts at a given level of usage.

Our datastore uses a series of buckets (4096 in our current map) to spread the data across the servers. Each bucket is stored three times. If we have three copies of every bucket, we're at "R3". This is the standard healthy state of the system.

Because fetch/store operations will route around failures, it's not at all apparent from the view of the application if some buckets do not have three copies, and the cluster is degraded. So we have a grid of the buckets in our system, color coded to show whether there are 0/1/2/3 copies of the bucket.

In the above picture, the set of buckets in red have only 1 copy. The yellow buckets have 2 copies, and the green have three. We have a big monitor with this display in our office, if it ever shows anything but a big green "3" folks notice and can investigate.

For variety we've experimented with other ways to show data. This display is showing the fraction of a path in our datastore which has been loaded into the ram cache. Ram cache misses will fall back to disk, so it's not necessarily apparent to the user if the ram cache isn't loaded or working. But the disk fetch is much slower than the ram cache, so it's good to know if some machines have crashed and the ram cache isn't at 100%.

Other parts of the display are standard graphs for data aggregated across all of the servers. These are super useful to spot overall load issues.

We're still experimenting with finding the best data to collect and show. But the ambient displays so far are a big win. Obvious issues are immediately visible to everyone in our offfice. And people will walk by and look at the deeper graphs and sometimes spot issues. Taking the data from being something where you would have to proactively type a cli command or click around on some web forms, to displays that engineers will stop and look at for a few minutes on their way to/from getting a coffee or soda has been big improvement in our awareness and response to cluster issues.

Comments (10)

great stuff, thanks for posting it!

It'd be great if there was some way in those visualizations to indicate a "desired state". There's a lot of viewer-knowledge required in the use cases you talk about...

One of the hypotheses we wanted to test was that the human eye/brain is really good at taking complex pictures over time and developing an idea of usual / unusual states. The issue that comes up with setting alarms in systems like nagios is that you have to forsee all of the unusual conditions, and put triggers on them. But you can't always forsee the right places & points to put the triggers.

We had a nagios alarm on our temp sensors at 90F. After our datacenter did some work, one of our external sensors rose from 65F to 85F in 2 days. It never triggered the nagios alert, but this was clearly a major change and we spotted it only because someone recognized the "usual" temperatures they saw in our cluster status command (hmm we should add a grid for temps).

When you make a picture out of the data, and make it frequently viewable by people casually walking by, they start to recognize what typical / abberant displays are, without having to pre-program what these states might be.

Dan Clkar:

Very informative and well written article. One thing I would mention though is that you didnt explain what buckets are. Your datastores use buckets, but whats a bucket ?

I would assume its an abstraction you use to represent a container for data.

I did skip over some details that folks will no doubt wonder about... I intend to post more details about the architecture of our datastore in the near future.

Essentially our cluster storage system looks like BigTable from the top, but the data is stored in buckets which are distributed across the nodes of the cluster. digest(rowkey) essentially provides the bucket #, for any sized bucket map which is a power of 2 (1024, 4096, etc).

$hash = $pm->get( "/path/foo", $rowkey )

$pm->set( "/path/foo", $rowkey, { column => $val } )

and so forth..

As others have said, well-written and informative. One comment pointed out that another tool requires an understanding of failure modes to be configured correctly. The article points out that things fail in new and unsusual ways, over time. Each of these is valid and point to one conclusion: The purpose of any tool is to aid identification of failures and to do the best it can to bring attention to a potential trend. However, none of them will ever be a total replacement for looking at log files -- which are often the source of their data.

Visualization is an aid, not a replacement (This just re-states something Rich Skrenta wrote.) "Then why am I using tools to aid me, if I can't trust them?", seems a logical question. The answer is, you can trust them, but because these tools result from (mathematically) an integration, smaller deviations disappear. To see these deviations (as you need to do to determine their importance) you must see them. To see them, look at the nitty-gritty. To do this efficiently, use a sampling plan.

The most efficacious method requires some understanding of statistics. (Truthfully, most folks don't develop this understanding until somewhere between their MSc and PhD studies, but the use of an exist plan nearly removes this requirement. Tables from an existing plan can be useful when applying this to clusters larger than 20 machines. (Less than that and I can't make any promises.) How to apply a sampling plan is likely to be intuitive to someone who can appreciate the algorithms behind Blekko's tool.

Older standards, e.g. older MIL specs, are adequate and may be more useful than more recent ones because the more recent standards have become an industry unto themselves (ISO, ANSI, etc). You can learn about the methods from a source such as "The Quality Control Handbook" by Juran and Gryna. Also, the American Society for Quality (http://www.asq.org/) refers to several sources. A personal favorite (that I no longer have) is the "original" text on Statistical Process Control, published by Western Electric (and -- with another date, I think -- by AT&T.)

Use tools for watching the day-to-day operations in your cluster, and use sampling plans on raw data (plus failures that slip by the tool) to design improvements to the tool. Nothing can replace getting your fingers dirty. You must use your time judiciously, sharing it between use of the tool for daily ops and the application of a corrective action process, to drive improvements to the tool.

Thanks to the original author for the well-written post about Blekko.

Despite the prevalence of the colored belts we see it today (six-sigma), someone had to teach today's colored belts. The material comes from these references (in part, at the least). For reasons unknown to me, QC methods deployed after World War II just won't go away!

"[I] use belt to hold pants up!" -- Mr. Miyagi, played by Pat Morita in "The Karate Kid"

We've actually pushed a version of this in Spinn3r a long time ago...

I think you looked at it a few months back Rich.

This is about to be mounted on our wall :)

We also have internal cluster stats like this which we're exposing more of....

and I agree it's nice to have everything in somewhat unconventional visible form.

The human brain has evolved a good ability to detect visible trends so might as well play towards that.

I think this excellent work ought to be in: http://www.flickr.com/groups/webopsviz/ :)

A while ago (http://www.skrenta.com/2008/10/whats_up_rich.html), you indicated a new name is coming.

Has it been chosen?

We're sticking with blekko.

I just happened across this page. Love your data viz ideas - we recently released what we're calling a seurat view in GroundWork Monitor. It provides a similar "whole infrastructure" view. Check out screenshot 2 on this page:
http://www.groundworkopensource.com/products/screenshots.html


Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on April 9, 2009 11:39 AM.

The previous post in this blog was Bryn turned me into a muppet.

The next post in this blog is Topix passes USA Today to become #1 online site for Gannett, Tribune and McClatchy.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33