On Service Health Checks

Taha Khan
3 min readJan 2, 2022

Health checks are a core part of building a service in the modern world. Simply put, a health check validates that the service is running well and can continue to serve customers. Generally, they are implemented as an endpoint in a service.

This article covers the types of health checks, some examples of how to implement health checks, and common response strategies.

There are different types of health checks but in general, you should implement at least 2 types:

  1. Simple health check
  2. Deep health check
Photo by Luke Chesser on Unsplash

Simple Health Check

As the name suggests, these health checks should check a few things and should be fast. The simple health check should be used to quickly tell if the service is up and can serve customer requests.

This endpoint can be used by the load balancer (LB) to pull any bad instances of a service out of the rotation.

Some basic checks to perform here (note: this is an incomplete list):

  1. Application is actually running :)
  2. Validate all the components of the application are loaded (think Java beans)
  3. Validate the hardware resources are in a good state (i.e., CPU usage isn’t at or nearing 100%)
  4. Validate you can write to disk (for logs etc.)

Here’s an example of a simple health check in Java:

How to respond on failure?

Here’s a few situations that come to my mind (that I’ve dealt with):

Single instance reports failure

Generally, you should be safe when only a single instance goes bad as it can be taken out of rotation by the LB. Depending on the architecture of your infrastructure, you can even use this to auto-replace the bad server/pod/instance to make sure you can still serve all the incoming requests[1].

This can also be a bad deployment, especially if you use a one-box pattern[2]. Generally, it’s a good idea to have a deployment monitor to stop the deployment at this time and rollback.

Multiple instances report failure

This type of failure is usually attributed to bad deployment or latent issues arising from any assistant daemons. Here, get a sense of the issue, determine if a rollback is necessary and/or continue to root cause.

Deep Health Check

Deep health check validates that the application can actually process customer requests correctly. Deep health check can ping databases/other services to make sure the system is functional.

Do NOT use this health check to take service instances out of rotation[3].

Instead, emit metrics on which dependencies are failing and create necessary alarms based on those metrics. These alarms can quickly tell you which dependencies are down, and by extension how the customers are affected.

Some health checks to perform here (note: also an incomplete list):

  1. Validate you can connect to your databases
  2. Validate your (critical) dependency services are up
  3. Validate you can connect to (any) queues

Here’s an example of a deep health check in Java:

How to respond on failure?

The only advice I can give here is to triage the issue, figure it out and mitigate[4]. Something’s going horribly wrong, and customers are likely getting affected.

Appendix

[1] Make sure all the logs are secure and if possible, store the server state as well so it can be used later to triage the issue.

[2] One-box pattern: You deploy to a single instance, monitor its metrics for a set time (bake period) then continue deploying to the whole fleet.

[3] Imagine this: one of your dependencies goes down, your service goes down with it.

[4] Always prefer mitigation over fixing the issue. Mitigation is generally faster to get to a stable state than fixes.

--

--

Taha Khan

Software Engineer by trade. Solving complex problems with simple solutions at Amazon Alexa.