Setting up Service Monitoring — The Why’s and What’s

You created an app, you are ready to release it to the world. But are you monitoring it? How will you know if something is off or something simply stopped working? This is where monitoring comes into the picture.

Photo by Noel Cheok on Unsplash

Health checks

This is the most basic form of monitoring that you should be implementing for your application. You are most likely already using health checks if you use a load-balancer or Kubernetes with liveness probes or Consul.

Infrastructure metrics

Your second step should be to monitor infrastructure metrics exported by your platform. Collect and monitor the data collected from the OS, such as CPU utilization, memory utilization, IO utilization, and network utilization. Your cloud provider may provide some data as well. If you are using Docker or Kubernetes, then collect those metrics as well. Your monitoring system would likely have integrations that export this data with little effort. If you are using Prometheus, installing node_exporter will collect data from all machines in your infrastructure. Similarly, the Prometheus Kubernetes operator will fetch metrics from your Kubernetes

Response codes and errors

We want as few errors as possible. So we start counting them. A simple way of tracking these is by utilizing your framework and adding middleware to track requests and responses. For an HTTP micro-service, you may track total requests, success and failures, and the time taken by each request.

Latency and timings

Timing is a crucial indicator of performance — you certainly want to track this. You should monitor your request and response timings across your stack in a fine-grained manner as possible. The most accessible place to start is your application server. Similar to the request/response status monitoring, you can begin with a middleware that logs request and response times. Extend that to any proxies or load-balancers. You have to break down what time is spent on routing decisions, network latency, SSL, etc.

Database performance metrics

I have come across it multiple times — if something is slow, start checking with your database. Databases are massive beasts that do a lot of time-consuming IO and unfortunately don’t scale that well. However, your database likely exports a ton of data that can be very useful in debugging issues or predicting a future performance regression. Stats like table sizes, query timings, tuples read per query, tuples returned, etc., are some things you should be monitoring. Database monitoring is so vast that it deserves its own post altogether.

Cache metrics

Cache helps to speed up your services by keeping frequently-used data readily available. Monitor the cache size, hit rate, miss rate, evict rate, and cache miss ratio. The goal of a cache is to maximize the number of hits. The miss ratio should be as low as possible without risking the storage of slate data. If it is too high, you should reconfigure your cache and maybe even look at your caching strategy.

Queue metrics

If you are utilizing message queues to process jobs, you must monitor them. Producer and consumer counts, queue produce rates, consumption rates, lag, and failures are critical numbers to track. Setting alerts when any of these metrics are abnormal is also essential. It will provide an early warning to any potential failure.

Business Metrics

Apart from the generic metrics we discussed above, your application will likely have custom metrics that you want to track. Things like the number of completed orders, transactions in your application are application-specific business-specific. These numbers give an assurance about how the overall system is performing. In case of an incident, these metrics also help to assess the real impact.

Computer Whisperer. Open-source contributor. Find me at https://amitosh.in/