Observability: Monitoring, Logging & Tracing

Tracing Errors with ConSol

Modern applications are complex. Consequently, they generate a multitude of measurement and analysis data providing information on the application’s health. These data also help to fix the problem should an error occur. The observer, though, finds himself confronted with a hardly manageable flood of data. Observability is the ability to holistically monitor IT applications in order to get this flood of data under control. This includes providing appropriate data by the application. Also, developers and operating teams have to be given the tools they need in order to be able to act swiftly and purposefully in the event of an error.

All applications and the underlying infrastructure produce metrics, logs, and where useful traces as well. These are gathered and prepared by proven open-source tools like Prometheus (metrics), Loki (logs) or Jaeger (traces). Subsequently, these data are centrally visualized in Grafana dashboards. At this point, the user gets an overview of the applications and infrastructure components he is sharing. For a long-term storage of data, additional data bases like InfluxDB can be employed.

Observability – Chasing „Mister X“

Observability is basically composed of three components: monitoring, logging, and tracing. The monitoring provides details, when a defined service level or quality criterion has fallen short of. For this, the application developers define appropriate metrics which again are being provided directly from the application. In the logs, we find the error reports of each individual software component. They point out the place in the various services where the error occurs. Tracing allows us to identify the path a call has taken in between services bevor resulting in a problem. By means of correlation IDs, we are able to observe all this information together in a central dashboard. This way we keep the overview even in complex applications and quickly track the source of error.

Observability Tools

For observability applications we favor open-source solutions. Compared with commercial solutions, there is no disadvantage at all. For many years now, we use open-source solutions with our clients as well as in our own productive employment. They also offer a truly remarkable range of functions.

Prometheus is the de facto standard for cloud-native monitoring and alerting. It offers a simple configuration for where and how metrics can be collected. Most applications support the export of metrics to Prometheus. And there is also great support for exporting metrics to Prometheus for self-written applications in all common programming languages.

Loki allows for a simple importing and indexing of logs. Its configuration is derived from Prometheus and aims at quickly finding logs for certain criteria. Therefore, only a very small index can be written. By severe parallelizing the analyses, enquiries can be quickly executed even with large amounts of data.

Grafana is used to visualize metrics. It offers a very good integration of Prometheus, Loki and Jaeger and allows for metrics as well as traces to be displayed in the graphs. It is also possible to jump directly to individual traces and for certain metrics to display the logs to these metrics. Besides a great choice of predefined dashboards with various metrics, the user can also create dashboards himself.

Jaeger supports the OpenTracing standard, thus making it possible to easily integrate applications with Jaeger. For self created applications there is, quite like with Prometheus, a broad support of programming languages and frameworks. Other advantages of Jaeger besides its widespread use include its simple installation and scaling even with larger amounts of data.

Important Terms & Notes

Logging is used to record special events or problematic and faulty situations in order to be able to understand an error situation in the event of difficulties. How informative those recordings are, is up to the developers. For most programming languages there are logging frameworks providing standardized log formats. This becomes important, when logs are supposed to be collected centrally or to be found again in accordance with certain criteria. Since local log files are lost when restarting the container, a centralized log collection is mandatory, especially in volatile containers.

In todays distributed systems and especially in microservice architectures, simple logging is not sufficient anymore. The process has to be traceable through various services or methods, since it is often the interaction between microservices that results in problems or performance bottlenecks. This requires that, aside from the end-user calls, additional information on service calls is provided and stored in special tracing log events as well. Moreover, these tracing logs have also to be stored centrally for all services involved in order to be able to display the call hierarchy. When using external libraries or services, they too will have to meet those additional requirements.

Via OpenTracing there are frameworks for many programming languages available, making end-user calls easily assignable through various services via so-called spans or correlation IDs. This standard is already being supported by many open-source libraries.

Metrics are numerical representations of statuses (e. g. number of open connections) or throughputs (e. g. writing volume on a hard drive since a specific point in time, calling of a specific functionality). This way they differ from logs and tracing data relating to individual events.

Metrics can be queried of standard applications (e. g. NGINX, DBs or objects in Kubernetes) via so-called exporters or metric endpoints. Customer-specific applications should be instrumented in a way that allows to measure SLAs as well as to obtain further information on the usage (e. g. number and response time of critical calls) for detailed performance observations.

The currently widely used metrics format was introduced by Prometheus and standardized via OpenMetrics. Metric points here are composed as follows:

  • Metrics name: describes what is represented, e. g. server_open_connection_count
  • Labels: on the basis of labels, differently measured instances can be distinguished, e. g. instance=127.0.0.1:8080.
  • Timestamp: at what time was this value valid?
  • Value: numeric value

This way, the performance and where appropriate also the number of errors or specific statuses can be compactly represented and graphically visualized in Grafana.

Based on these numeric values, rules can be defined that provide a message if the system has exceeded limit values – e. g. if more than 90 % of the available connections were occupied for more than 10 minutes or if during 5 minutes an average of more than 2 % of the queries resulted in errors. A monitoring tool for metrics like Prometheus stores the metrics, checks these conditions, and can subsequently notify the ones in charge.

Monitoring refers to the controlling of application and infrastructure. In case of faulty statuses or performance bottlenecks, the operating teams in charge are being notified – ideally before the application’s users become aware of major problems.

State-of-the-art monitoring systems like Prometheus are metrics-based. In other words, they determine problematic statuses based on the metrics and then trigger alerts. In addition, they also store the metrics over an extended period of time. This way, with visualizing tools like Grafana these metrics can be used later on for an analysis of problematic situations as well.

OpenShift Service Mesh Based on Istio

Microservice management: A secure and error-free interservice communication is ensured by an intermediary layer in your application which also serves to optimize your application’s performance. This results in less code for your developers who now can concentrate fully on the app’s business value.

Contact

Christoph Ehlers

+49-89-45841-100