When you have several hundred servers in a cluster, knowing the state and health of all of them can be a challenge. Traditional pager alert systems can often either log too many events, which makes people tune them out, or they miss non-fatal but still serious server sickness, such as degraded disk/cpu/network performance or subtle application errors.
This becomes especially true when the cluster and application are designed for high availability. If the application is doing its best to hide server failures from the user, it's often not apparent when a serious problem is developing until the site fails in a more public or obvious way.
We called these "analog failures" at Topix. There was a fairly complicated chain of processing for incoming stories that had been crawled. Crawl, categorize, cluster, dedup, roboedit, push to front ends, and push to incremental search system. Once an engineer mistakenly deleted half of the sources from our crawl, and it took us a disturbingly long time to notice. The problem was that, while overall we had half as many stories on the site, most pages still had new stories coming in, so we didn't notice that anything was wrong.
Sometimes a server has a messed up failure, like its networking card starts losing 50% of its packets, but stuff is still getting through. Or a drive is in the process of failing, and its read/write rate is 10% of normal, but it hasn't failed enough to be removed from service yet. The cpu overheated and is running at a fraction of its normal speed. There seem to be limitless numbers of unusual ways that servers can fail.
At blekko, there are dozens of stats we'd ideally like to track per host:
- How full are each of the disks?
- Are there any SMART errors being reported from the drives?
- Are we getting read or write errors?
- What is the read/write throughput rate? Sometimes failures degrade the rate substantially, but the disk continues to function
- What is the current disk read latency?
- Is packet loss occurring to the node?
- What is the read/write network throughput?
- What is the cpu load?
- How much memory is in use?
- How much swap is being use?
- How big is the kernel's dirty page cache?
- What are the internal/external temperature sensors reading?
- How many live filesystems are on the host vs. dead disks?
Others stats pertain to our cluster datastore:
- How many buckets are on each host?
- Is the host above or below goal for its number of buckets?
- What is the outbound write lag from the host?
- What is the maximum seek depth for a given path/bucket?
- Do we have three copies of every bucket (R3)?
- If we're not at R3, how many bucket copies are occurring?
- For running mapjobs, what is their ETA + read/write/error rate?
- Are the ram caches fully loaded?
- Are we crawling/indexing, what is the rate compared with historical?
The first step is to start putting the stats you want to be able to see into a big status table. But at 175 hosts, the table is kind of long, and it's hard to spot developing problems in the middle of the table.
So we have been experimenting with mapping system stats onto different visualizations, so we can tell at a glance the overall state of hundreds of servers, and spot minor problems before they grow.
A table with 175 rows is pretty long, but you can fit 175 squares into a very small picture. This table shows overall disk usage by host. The color of the tile shows the disk usage: red is 90%, orange is 80%, yellow is 70%, blue is below 60%. Dead filesystems on the node are represented by grey bars inside the tile. The whole grid is sorted worst-to-best, so it's easy to see the fraction of hosts at a given level of usage.
Our datastore uses a series of buckets (4096 in our current map) to spread the data across the servers. Each bucket is stored three times. If we have three copies of every bucket, we're at "R3". This is the standard healthy state of the system.
Because fetch/store operations will route around failures, it's not at all apparent from the view of the application if some buckets do not have three copies, and the cluster is degraded. So we have a grid of the buckets in our system, color coded to show whether there are 0/1/2/3 copies of the bucket.
In the above picture, the set of buckets in red have only 1 copy. The yellow buckets have 2 copies, and the green have three. We have a big monitor with this display in our office, if it ever shows anything but a big green "3" folks notice and can investigate.
For variety we've experimented with other ways to show data. This display is showing the fraction of a path in our datastore which has been loaded into the ram cache. Ram cache misses will fall back to disk, so it's not necessarily apparent to the user if the ram cache isn't loaded or working. But the disk fetch is much slower than the ram cache, so it's good to know if some machines have crashed and the ram cache isn't at 100%.
Other parts of the display are standard graphs for data aggregated across all of the servers. These are super useful to spot overall load issues.
We're still experimenting with finding the best data to collect and show. But the ambient displays so far are a big win. Obvious issues are immediately visible to everyone in our offfice. And people will walk by and look at the deeper graphs and sometimes spot issues. Taking the data from being something where you would have to proactively type a cli command or click around on some web forms, to displays that engineers will stop and look at for a few minutes on their way to/from getting a coffee or soda has been big improvement in our awareness and response to cluster issues.