Service dependencies, graceful degradation and health checks

This is a post about web application load balancing, specifically it’s about what should happen when things fail.

I followed a link to the Rack health check middleware, Pinglish, and something in the README caught my eye:

The request handler should check the health of all services the application depends on, answering questions like, “Can I query agains my MySQL database,” “Can I create/read keys in Reds,” or “How many docs are in my ElasticSearch index?”

and later

The response must return an HTTP 200 OK status code if all health checks pass.

This idea reduces the state of the system dependencies to success or failure as returned to the load balancer. If any of the required services are down, the health is reported to the load balancer as a failure. It’s something that I’ve tried and regretted when it caused unecessary outages.

Before implementing any health check you need to give some thought to how you intend your application to degrade in the event that a service it calls fails.

Modern web applications typically use many services, a failure of any one service call does not necessarily render the entire application unusable. The service might deliver functionality that is not vital to the usefulness of the webpage. For example, recommendations and related products are often delivered by services, and often the pages they’re delivered onto still deliver their core purpose without them. Does that mean we should fail load balancer health checks if that service call fails?

Entire swathes of functionality might be entirely unavailable due to the failure of a dependent service and whole classes of page might return errors, yet people who don’t use that functionality can be using the application, unaware that anything is wrong. Again, it doesn’t necessarily reduce the state of the application to “failed”. If you can’t count the number of documents in ElasticSearch, it might mean that search is broken, but the rest of the application is fine.

If however, you have configured your health check to fail on the event of failure of any of the services that your application uses, then the result of such outages will likely be that all of your application nodes fail their health checks. Each load balancing software handles the situation where all nodes have failed their health checks slightly differently, but it’s not unusual for the load balancer to simply serve an error page without even attempting to send requests; a complete failure of the web application.

In nearly every situation, a better solution is to not fail the health check and to continue serving with degraded functionality.

It’s a good idea to understand what will happen in the event of these service failures and to go beyond this and actually simulate failure by causing services to fail to respond, or responding with errors.

In the context of health checks, be very careful when deciding which “services the application depends on.” Restrict them to things that the application absolutely requires in order to be usable.