
When you have several hundred servers in a cluster, knowing the state and health of all of them can be a challenge.
Traditional pager alert systems can often either log too
many events, which makes people tune them out, or they miss non-fatal but still serious server sickness, such as
degraded disk/cpu/network performance or subtle application
errors.
This becomes especially true when the cluster and application are designed for high availability. If the
application is doing its best to hide server failures from
the user, it's often not apparent when a serious problem is developing until the site fails in a more public or obvious way.
We called these "analog failures" at Topix. There was a fairly complicated chain of processing for incoming stories that had been crawled. Crawl, categorize, cluster, dedup,
roboedit, push to front ends, and push to incremental search
system. Once an engineer mistakenly deleted half of the sources from our crawl, and it took us a disturbingly long
time to notice. The problem was that, while overall we had
half as many stories on the site, most pages still had new
stories coming in, so we didn't notice that anything was wrong.
Sometimes a server has a messed up failure, like its networking card starts losing 50% of its packets, but stuff
is still getting through. Or a drive is in the process of
failing, and its read/write rate is 10% of normal, but it
hasn't failed enough to be removed from service yet. The
cpu overheated and is running at a fraction of its normal
speed. There seem to be limitless numbers of unusual ways
that servers can fail.
At blekko, there are dozens of stats we'd ideally like to track per host:
- How full are each of the disks?
- Are there any SMART errors being reported from the drives?
- Are we getting read or write errors?
- What is the read/write throughput rate? Sometimes failures degrade the rate substantially, but the disk continues to function
- What is the current disk read latency?
- Is packet loss occurring to the node?
- What is the read/write network throughput?
- What is the cpu load?
- How much memory is in use?
- How much swap is being use?
- How big is the kernel's dirty page cache?
- What are the internal/external temperature sensors reading?
- How many live filesystems are on the host vs. dead disks?
Others stats pertain to our cluster datastore:
- How many buckets are on each host?
- Is the host above or below goal for its number of buckets?
- What is the outbound write lag from the host?
- What is the maximum seek depth for a given path/bucket?
- Do we have three copies of every bucket (R3)?
- If we're not at R3, how many bucket copies are occurring?
- For running mapjobs, what is their ETA + read/write/error rate?
- Are the ram caches fully loaded?
- Are we crawling/indexing, what is the rate compared with historical?

The first step is to start putting the stats you want to be
able to see into a big status table. But at 175 hosts, the
table is kind of long, and it's hard to spot developing
problems in the middle of the table.
So we have been experimenting with mapping system stats onto
different visualizations, so we can tell at a glance the
overall state of hundreds of servers, and spot minor
problems before they grow.

A table with 175 rows is pretty long, but you can fit 175 squares into a very small picture. This table shows overall
disk usage by host. The color of the tile shows the disk
usage: red is 90%, orange is 80%, yellow is 70%, blue is
below 60%. Dead filesystems on the node are represented by
grey bars inside the tile. The whole grid is sorted worst-to-best, so it's easy to see the fraction of hosts at
a given level of usage.

Our datastore uses a series of buckets (4096 in our current map) to spread the data across the servers. Each bucket is
stored three times. If we have three copies of every
bucket, we're at "R3". This is the standard healthy state of the system.
Because fetch/store operations will route around failures, it's not at all apparent from the view of the application if
some buckets do not have three copies, and the cluster is
degraded. So we have a grid of the buckets in our system,
color coded to show whether there are 0/1/2/3 copies of the
bucket.

In the above picture, the set of buckets in red have only 1 copy. The yellow buckets have 2 copies, and the green have
three. We have a big monitor with this display in our
office, if it ever shows anything but a big green "3" folks
notice and can investigate.

For variety we've experimented with other ways to show data. This display is showing the fraction of a path in our
datastore which has been loaded into the ram cache. Ram
cache misses will fall back to disk, so it's not
necessarily apparent to the user if the ram cache isn't loaded or working. But the disk fetch is
much slower than the ram cache, so it's good to know if some
machines have crashed and the ram cache isn't at 100%.

Other parts of the display are standard graphs for data
aggregated across all of the servers. These are super
useful to spot overall load issues.
We're still experimenting with finding the best data to collect and show. But the ambient displays so far are a big win. Obvious issues are immediately visible to everyone in
our offfice. And people will walk by and look at the deeper
graphs and sometimes spot issues. Taking the data from
being something where you would have to proactively type a
cli command or click around on some web forms, to
displays that engineers will stop and look at for a few
minutes on their way to/from getting a coffee or soda has
been big improvement in our awareness and response to cluster
issues.