SRE: Software Engineering for IT Operations

Site Reliability Engineering (SRE) is a DevOps approach by Google. It regards IT operations as a software task to be solved with software engineering. This results in optimized processes and systems that take the risk of errors into account and know how to handle them. Continuous delivery is key: a regular roll-out of many small releases reduces the risk of every single development step. In addition, SRE-tasks are designed to include time for improvements and the automation of recurring tasks.

Site Reliability Engineering is firmly established in ConSol's business processes: Experienced software engineers work hand in hand with the IT Operations team. At the same time, our cloud and monitoring experts proactively contribute their specialist knowledge. Because the main goal of SRE is also ours: The more business topics we can cover with software tools and automation, the more workload decreases in continuous development and operation.

Site Reliability Engineering in Few Points

o_workshop.svg

What Does a Site Reliability Engineering Team Do?
An SRE team consists of software engineers and takes care of the productive operation of services.

o_proof.svg

Why Software Engineers Instead of Sysadmins? In classical operation, the workload increases linearly with the number or size of the services. Especially in modern microservice architectures, this approach is no longer practicable. Site Reliability Engineering therefore solves operational tasks with software, not manually. The more software solutions, the less workload.

o_applikationen.svg

How Is an SRE Team Organized? There are several ways to organize these teams. Google relies on three pillars: The amount of time Site Reliability Engineers spend on manual tasks is limited – giving them the capacity to develop SRE tools. On-call services are professionally organized so that there is sufficient time for a thorough post-mortem analysis in case an error occurs. When it comes to risk assessment for the go-live of new features, error budgets ensure that the service developers and the SRE team pull together.

o_schulung.svg

What Are the Team's Responsibilities?
SRE teams are responsible for service availability, latency, performance, efficiency, deployment, monitoring, emergency response and capacity planning.

SRE Practical Tips

Ending points for liveness and readiness probes are often implemented very simply: They respond with 200 OK as soon as the application is started. In a number of projects, we have experienced that this may not be enough. Therefore, we have started to test the accessibility of all adjacent systems and message queues with the support of health checks. This enables us to detect problems during deployment and, ideally, resolve them automatically.

During one project, a problem with EJB Timer Services caused the transaction to roll back after each run. If one of the next runs would prove successful, this process would be unproblematic in itself. To find out if we deal with expected rollbacks or with real bugs, we implemented a metric measuring time passed since the last successful run. This allowed us to distinguish between temporary and permanent errors.

With Java applications, it is worth drawing regularly thread dumps regularly. Thread dumps help with post-mortem analysis and profiling. The development of thread dumps e.g. quickly brings to light when an external system is blocked and new threads with blocked calls are constantly being started. In particular, it is recommended to draw two to three thread dumps in the stop script in order to analyze the status after a restart of the application.

With Mapped Diagnostic Context (MDC), logging frameworks offer the possibility to log information like user names by default, allowing to trace which log lines belong together during log analysis. However, MDC data are not always available if, for example, the user has not yet been determined. Therefore, it is worthwhile to additionally include the thread in the log format. The thread provides a secure and easy way to track which log lines belong to the same request.

During a project in the telecommunications industry, we were facing the challenge of many microservices calling the same endpoint, yet the total number of calls could not exceed a certain threshold value per second. We solved this by using Zookeeper to coordinate the calls. The advantage: We were able to avoid a central coordination system as a single point of failure.

Manual steps during build and deployment are a common source of errors. It's worth to automate everything. Modern CI/CD pipelines do not only reduce the risk of errors, but also relieve the SREs of annoying recurring tasks.

The biggest challenge in load testing is to generate realistic test data. This does not only include the content of the data. In a large migration project, we found out, that the way data are fragmented in a database can have a significant impact on performance. The fact that we already determined this in the load test phase was decisive for the project’s success.

The more measuring points an application has, the better. This does not only help in operation. Load testing, for example, is much more valuable when it shows not only whether a service is adhering to SLOs, but also where potential bottlenecks are.

Software engineers like to use modern design patterns like circuit breakers. To avoid overload, however, you should not lose sight of classic configuration options. Pool sizes in Java application servers, for example, should be designed in a way that, in the event of an unexpected peak load, the stop occurs as soon as possible so that downstream components are not getting overloaded.

Site Reliability Engineers must familiarize themselves with the normal behavior of their services and regularly check their logs. Otherwise, in case of an error, a lot of time is lost in investigating oddities that have nothing to do with the acute fault.

Project Profiles

ConSol stands for technological excellence and practical expertise. We draw on three decades of cross-industry project experience – in medium-sized companies as well as in DAX corporations and with other heavyweights. We support you in important software architecture decisions and place your solution on a solid, future-proof foundation.


Branch: Automotive

Project content: Application for market research and sales forecasting of automotive components.
Technologies: React, SpringBoot, MSSQL


Branch: Authority

Project content:
Customer portal in a modern design for the provision of services for private persons and companies.
Technologies: Angular, SpringBoot

Branch: IT

Project content: Administration GUI for an Enterprise Cloud Software.
Technologies: React

Branch: Automotive

Project content: B2B application for the detailed comparison of vehicles and their equipment features.
Technologies: Angular, JavaEE, MongoDB

Branche: Telecommunications

Project content: Classic customer self-care portal for a telecommunications provider.
Technologies: Backbone.js, SpringBoot, JWT

Branch: Authority

Project content: Customer management for a municipal authority.
Technologies: Angular, Grails, ConSol CM

Branch: Automotive

Project content: Administration GUI for an application for provisioning vehicles.
Technologies: Angular, JavaEE, Microservice

Branch: Telecommunications

Project content: Application for market research and sales forecasting of automotive components.
Technologies: Microservice, Spring, REST, SOAP, Citrus


Branch: Telecommunications

Project content: Development of an integration platform for the exchange of Smart Meter data with SAP.
Technologies: Weblogic, SpringIntegration, SOAP, Messaging, Citrus


Branch: Automotive

Project content: Online interfaces for the provisioning of new vehicles
Technologies: Microservice, REST, MQTT


Branch: Automotive

Project content: Gateway for the internet communication of vehicles
Technologies: Microservice, REST, SOAP, MQTT, Citrus


Branch: Telecommunications

Project content: Web-based mail and messaging platform
Technologies: SOAP, Messaging, SMS, Citrus


Branch: Telecommunications

Project content: Online interface for data exchange with telecommunication providers
Technologies: SOAP, Spring, Citrus



Lidl
Branch: Retail

Solution: Business service and system monitoring based on Nagios


pbb Deutsche Pfandbriefbank
Branch: Finances

Solution: End-to-end application monitoring with Sakuli


it@M, central IT service provider for the City of Munich
Branch: Public Administration

Solution: Open Source Monitoring with OMD


M-net
Branch: Telecommunications

Solution: Future-proof monitoring on Nagios basis


Statutory Health Insurance Association of Lower Saxony
Branch: Public Administration

Solution: Seamless IT monitoring of the server landscape

Technologies / Competences

Sakuli
Thruk
OMD
Grafana
Prometheus
coshsh
Jolokia
Mod-Gearman

Contact

Lutz Keller

+49-89-45841-100