Taming Logs
Logs are essential. They are a very convenient way to know what is happening in your application. But logs can be overwhelming. They can be too verbose, too cryptic, or too noisy. They can be hard to read, hard to search, hard to understand, hard to manage, and ultimately, hard to trust.
Unless you tame them!
Step 1: Store Logs in a Centralized Location
Logs are only helpful if you can access them easily when you need them. If logs are scattered across multiple servers, formats, and locations, they are hard to access, search, analyze and reason about.
The first step in taming logs is to store them in a centralized location, which makes them accessible from a single location. You can use a log management service that comes with your cloud provider, such as AWS CloudWatch Logs, Google Cloud Logging, or Azure Monitor. You can also use a third-party log management service, such as Datadog, Loggly, Papertrail, Splunk, Sumo Logic or Logz.io.
You can also use an open-source log management solution, such as Fluentd, ELK stack, GrayLog, Grafana + Loki, or ClickHouse.
Such services will require installing and configuring an agent to ship logs to your log server. Some agents will also ship system logs, allowing you to observe other parts of your servers, such as system services.
Step 2: Format Logs in a Standardized Way
If logs are in different formats, with varying fields and structure, they are hard for humans and machines to parse. Free-form log text only allows you to search using a full-text index, which is both expensive and limited in terms of expressiveness.
By formatting logs in a standardized way, you can define a standard parsing schema that you can use to extract essential fields from logs, such as timestamp, log level, message, and metadata. Logs with metadata make them easier to search and filter. Log agents will also add some metadata about the source automatically. If you use “fluentd” or “logstash” (from elk-stack), you can use “ grok patterns “ to parse logs and extract even more fields.
You can define a straightforward log format such as:
<timestamp> <log level> <message> [key1=value1 key2=value2 ...]
Your log agent can then parse this format, extract the timestamp, log level, message, and metadata fields, and forward them to your log server.
Step 3: Define a correlation ID
Your application will have multiple concurrent executions, each generating various logs. To make it easier to trace the logs generated by a single execution, you can define a correlation ID unique to each execution at the very start of invocation. You can pass this correlation ID from one component to another and print it in every logging statement to trace the logs generated by a single execution across multiple components.
Let’s add a correlation ID to our log format:
<timestamp> <correlation ID> <log level> <message> [key1=value1 key2=value2 ...]
Your log agent can then extract the correlation ID field and forward it to your log server.
Step 4: Define a trace ID
In a distributed system, a single request can span multiple services. To make it easier to trace the logs generated by a single request across various services, you can define a trace ID unique to each request at the very start of the request. Your code can pass this trace ID between services, and you can reuse the same trace ID while printing logs.
Let’s add a trace ID to our log format:
<timestamp> <trace ID> <correlation ID> <log level> <message> [key1=value1 key2=value2 ...]
Step 5: Define log levels
Logs can be noisy and generate a lot of information that is not useful. To make logs more useful, you can define a log level that indicates the importance of a log message. You can then filter logs based on their log level so that you only see the log messages that are important to you.
Define good log levels such as:
DEBUG
verbose; valuable information for debugging and troubleshootingINFO
good to know something happenedWARN
something unexpected happened, but the application can recoverERROR
something unexpected happened, and the application cannot recoverCRITICAL
something unexpected happened, and the application will crash
Review the log lines introduced in your code base and ensure they use the correct log level. Logging too much information at a higher log level can make finding important logs harder.
Step 6: Implement structured logging
If logs are in free-form text format, they are challenging to search and filter. By implementing structured logging, you can define a standard schema that allows you to use off-the-shelf tools to extract essential fields from logs, such as timestamp, log level, message, and metadata.
For example, for logs associated with a particular order, you can add the order ID to the log message as structured metadata instead of embedding it in the log string itself so that you can search for all logs associated with that order.
2021-01-01T00:00:00Z 1234 5678 INFO Order created order_id=ABCD1234
To emit structured logs, you can use a structured logging library such as logrus or zap for Go, logback for Java, or Structlog for Python. JSON is a widely used format for structured logs, and logfmt is also a good option.
Your log agent can then parse this format, extract the timestamp, log level, message, and metadata fields, and forward them to your log server. Your log server can then index these fields and make them searchable. You can even create dashboards and alerts based on these fields.
In JSON format, the log message would look like this:
{ "timestamp": "2021-01-01T00:00:00Z", "correlation_id": "1234", "trace_id": "5678", "log_level": "INFO", "message": "Order created", "metadata": { "order_id": "ABCD1234" } }
While JSON is a great candidate format for structured logs, it can be verbose. You can use a more compact format such as “logfmt”:
timestamp=2021-01-01T00:00:00Z correlation_id=1234 trace_id=5678 log_level=INFO message="Order created" order_id=ABCD1234
If your logging library does not support structured logging, you can still use a log format by defining a simple wrapper for formatting your log messages accordingly.
Here is an example in Go that prints logs in logfmt
:
package main import ( "fmt" "time" ) func logfmt(level string, message string, metadata map[string]string) { fmt.Printf("timestamp=%s log_level=%s message=%q", time.Now().Format(time.RFC3339), level, message) for k, v := range metadata { fmt.Printf(" %s=%q", k, v) } fmt.Println() }
You can then use this wrapper to log messages in a structured format:
logfmt("INFO", "Order created", map[string]string{"order_id": "ABCD1234"})
⚠️ Some people find JSON unyielding, especially during development. You can configure your logging library to log in JSON format in production and a more human-readable format during development. Libraries like Logback and Structlog have a developer friendly “console” format that you can use during local development.
Step 7: Add context to your logs
Log messages are context-dependent. They depend on the application’s state when and where you generate the log message. By adding context to your log messages, you can provide additional information that can help you understand the log message better.
You can add context such as:
Use a middleware to add this context to your log messages. This middleware can extract and add information from the request to the log message.
Step 8: Cut down on log verbosity
Don’t log everything. Logs can generate a lot of information that is not useful. To make logs more useful, reduce log verbosity by logging only the vital information.
Log something only if something is:
- Useful for understanding the state of the application
- Useful for debugging and troubleshooting
- Useful for monitoring and alerting
- Useful for auditing and compliance
Step 9: Check logs for sensitive data
In 2019, Facebook (now Meta) was under fire for inadvertently logging the passwords of hundreds of millions of users due to a misconfiguration. While there was no evidence of abuse or unauthorized access, they received major flak from the global security and privacy community regarding the company’s security practices and the broader implications for user privacy.
Logs can inadvertently capture and expose sensitive information such as passwords, API keys, or personal data. Regularly review your logs to ensure that sensitive data is either masked or excluded from logs entirely. Implement automated tools to scan logs for such information. Sanitizing logs is critical to prevent security breaches and comply with data protection regulations.
Many log management services provide tools to scan log files in real-time, flagging or redacting any information that matches known patterns, such as credit card numbers, social security numbers, emails, or other personally identifiable information (PII). Logging libraries also provide hooks and post-processors to enable such detection at the application level and also offer remediation features like automatically masking or removing sensitive data before it can be stored or transmitted.
Step 10: Consider supporting changing log levels at runtime
Changing log levels at runtime can be helpful for debugging and troubleshooting. It allows you to change the log level of your application without restarting it. This feature can be beneficial if you want to increase the log level of your application to debug an issue and then decrease it once you have resolved the issue.
You can implement a log-level change endpoint in your application to change the log level of your application at runtime. You must protect this endpoint by authentication and authorization to prevent unauthorized access.
Java libraries like Slf4j and Logback support changing log levels at runtime using JMX or Jolokia. You can easily implement this feature in other languages as well by exposing secured management endpoints.
Step 11: Monitor your logs
Logs can be a valuable source of information about your application’s state. While they are no alternative to dedicated observability tooling like metrics and application traces, by monitoring your logs, you can detect issues before they become critical. You can also create dashboards and alerts based on log messages to monitor your application’s health.
Log management services provide matching, filtering and aggregation capabilites to utilize patterns to match log occurences and emit metrics or trigger alerts.
For example, most databases provide something called as a slow query log, where they log queries with execution time greater than a configured value. By monitoring slow query occurences, your logging system can alert you about a potential issue in your application.
Step 12: Implement log rotation and log retention policies
Logs can consume a lot of storage which can get very expensive especially when using managed log management services. Define sensible archival and deletion policies for logs. You can choose to store logs in “hot” storage — like in ClickHouse DB for 60 days and then move it to “cold” storage like “S3” for storing it upto a year. Bulk object storage is cheaper than storing it in an actively searchable database. Tools like Amazon Athena and Loki still allow you to query archived logs when needed.
… And the list doesn’t end here
There are many more ways you can employ to make logs more useful. Following these steps one-by-one will give you a robust baseline to start from.
Do you employ some other creative ways to make the best use of logs? Let me know in the comments below!
Originally published at https://recursivefunction.blog.