SRE: Site Reliability Engineering

Software
Engineering
for IT Operations

Site Reliability Engineering (SRE) is a DevOps approach by Google. It regards IT operations as a software task to be solved with software engineering. This results in optimized processes and systems that take the risk of errors into account and know how to handle them. Continuous delivery is key: a regular roll-out of many small releases reduces the risk of every single development step. In addition, SRE-tasks are designed to include time for improvements and the automation of recurring tasks.

Site Reliability Engineering is firmly established in ConSol's business processes: Experienced software engineers work hand in hand with the IT Operations team. At the same time, our cloud and monitoring experts proactively contribute their specialist knowledge. Because the main goal of SRE is also ours: The more business topics we can cover with software tools and automation, the more workload decreases in continuous development and operation.

Site Reliability Engineering in Few Points

What Does a Site Reliability Engineering Team Do?

An SRE team consists of software engineers and takes care of the productive operation of services.

Why Software Engineers Instead of Sysadmins?

In classical operation, the workload increases linearly with the number or size of the services. Especially in modern microservice architectures, this approach is no longer practicable. Site Reliability Engineering therefore solves operational tasks with software, not manually. The more software solutions, the less workload.

How Is an SRE Team Organized?

There are several ways to organize these teams. Google relies on three pillars: The amount of time Site Reliability Engineers spend on manual tasks is limited – giving them the capacity to develop SRE tools. On-call services are professionally organized so that there is sufficient time for a thorough post-mortem analysis in case an error occurs. When it comes to risk assessment for the go-live of new features, error budgets ensure that the service developers and the SRE team pull together.

What Are the Team's Responsibilities?

SRE teams are responsible for service availability, latency, performance, efficiency, deployment, monitoring, emergency response and capacity planning.

More than
200 customers
trust ConSol
for their
IT & Software

Find out more

Flexibility, automation, and the collaborative efforts of development, operations, cloud, and monitoring experts are crucial in our digital world. This approach enables new developments to reach market maturity quickly and with minimal risk, while reducing the effort required for their ongoing development and operation. That's why we at ConSol embrace Site Reliability Engineering.

Oliver Weise
Head of Platform Engineering

Our SRE Know-How

SRE Practical Tips

Ending points for liveness and readiness probes are often implemented very simply: They respond with 200 OK as soon as the application is started. In a number of projects, we have experienced that this may not be enough. Therefore, we have started to test the accessibility of all adjacent systems and message queues with the support of health checks. This enables us to detect problems during deployment and, ideally, resolve them automatically.

During one project, a problem with EJB Timer Services caused the transaction to roll back after each run. If one of the next runs would prove successful, this process would be unproblematic in itself. To find out if we deal with expected rollbacks or with real bugs, we implemented a metric measuring time passed since the last successful run. This allowed us to distinguish between temporary and permanent errors.

With Java applications, it is worth drawing regularly thread dumps regularly. Thread dumps help with post-mortem analysis and profiling. The development of thread dumps e.g. quickly brings to light when an external system is blocked and new threads with blocked calls are constantly being started. In particular, it is recommended to draw two to three thread dumps in the stop script in order to analyze the status after a restart of the application.

With Mapped Diagnostic Context (MDC), logging frameworks offer the possibility to log information like user names by default, allowing to trace which log lines belong together during log analysis. However, MDC data are not always available if, for example, the user has not yet been determined. Therefore, it is worthwhile to additionally include the thread in the log format. The thread provides a secure and easy way to track which log lines belong to the same request.

During a project in the telecommunications industry, we were facing the challenge of many microservices calling the same endpoint, yet the total number of calls could not exceed a certain threshold value per second. We solved this by using Zookeeper to coordinate the calls. The advantage: We were able to avoid a central coordination system as a single point of failure.

Manual steps during build and deployment are a common source of errors. It's worth to automate everything. Modern CI/CD pipelines do not only reduce the risk of errors, but also relieve the SREs of annoying recurring tasks.

The biggest challenge in load testing is to generate realistic test data. This does not only include the content of the data. In a large migration project, we found out, that the way data are fragmented in a database can have a significant impact on performance. The fact that we already determined this in the load test phase was decisive for the project’s success.

The more measuring points an application has, the better. This does not only help in operation. Load testing, for example, is much more valuable when it shows not only whether a service is adhering to SLOs, but also where potential bottlenecks are.

Software engineers like to use modern design patterns like circuit breakers. To avoid overload, however, you should not lose sight of classic configuration options. Pool sizes in Java application servers, for example, should be designed in a way that, in the event of an unexpected peak load, the stop occurs as soon as possible so that downstream components are not getting overloaded.

Site Reliability Engineers must familiarize themselves with the normal behavior of their services and regularly check their logs. Otherwise, in case of an error, a lot of time is lost in investigating oddities that have nothing to do with the acute fault.

Site Reliability Engineering: Technologies & Competencies

Any more Questions about SRE for Optimized Processes and Systems?

Let's talk!

Marc Mühlhoff

# IT Ops

# Observability

# Cloud Services

+49-211-339903-74

Name	Purpose	Lifetime	Type	Provider
CookieConsent	Saves your consent to using cookies.	1 year	HTML	Website
fe_typo_user	Assigns your browser to a session on the server.	session	HTTP	Website
_pk_id	Used to store a few details about the user such as the unique visitor ID.	13 months	HTML	Matomo
_pk_ref	Used to store the attribution information, the referrer initially used to visit the website.	6 months	HTML	Matomo
_pk_ses	Short lived cookie used to temporarily store data for the visit.	30 minutes	HTML	Matomo
_pk_cvar	Short lived cookie used to temporarily store data for the visit.	30 minutes	HTML	Matomo
_pk_hsr	Short lived cookie used to temporarily store data for the visit.	30 minutes	HTML	Matomo

Name	Purpose	Lifetime	Type	Provider
_gcl_au	Used by Google AdSense to experiment with advertisement efficiency.	3 months	HTML	Google
AMP_TOKEN	Contains a token that can be used to retrieve a Client ID from AMP Client ID service. Other possible values indicate opt-out, request in progress or an error retrieving a Client ID from AMP Client ID service.	1 year	HTML	Google
_dc_gtm_--property-id--	Used by DoubleClick (Google Tag Manager) to help identify the visitors by either age, gender or interests.	2 years	HTML	Google
_ga	Used to distinguish users.	2 years	HTML	Google
_gat	Used to throttle request rate.	1 day	HTML	Google
_gid	Used to distinguish users.	1 day	HTML	Google
_ga_--container-id--	Persists session state.	2 years	HTML	Google
_gac_--property-id--	Contains campaign related information for the user. If you have linked your Google Analytics and Google Ads accounts, Google Ads website conversion tags will read this cookie unless you opt-out.	3 months	HTML	Google

Innovative Product Solutions – with Open Source

Outstanding Solutions – Thanks to Great Partners

SRE: Site Reliability Engineering

Software
Engineering
for IT Operations

Site Reliability Engineering in Few Points

What Does a Site Reliability Engineering Team Do?

Why Software Engineers Instead of Sysadmins?

How Is an SRE Team Organized?

What Are the Team's Responsibilities?

Our SRE Know-How

SRE Practical Tips

Site Reliability Engineering: Technologies & Competencies

Any more Questions about SRE for Optimized Processes and Systems?

Portfolio

Company

Service

Custom IT-Solutions

IT Consulting & Design

Build & Operate

Innovate & Empower

Our customers

Product Solutions

Innovative Product Solutions – with Open Source

Openshift Consulting

Open Source Monitoring

Integration-Testing

Outstanding Solutions – Thanks to Great Partners

SRE: Site Reliability Engineering

Software Engineering for IT Operations

Site Reliability Engineering in Few Points

What Does a Site Reliability Engineering Team Do?

Why Software Engineers Instead of Sysadmins?

How Is an SRE Team Organized?

What Are the Team's Responsibilities?

Our SRE Know-How

Software Engineering for Excellent IT Solutions

Cloud Solutions: Comprehensive Expertise in the Cloud

Open-Source Monitoring

IT Operations – Few Disruptions, Peaceful Nights

IT Automation Trainings – Minimize Error Susceptibility

IT Security Consulting and Services

SRE Practical Tips

Site Reliability Engineering: Technologies & Competencies

Any more Questions about SRE for Optimized Processes and Systems?

Portfolio

Company

Service

Software
Engineering
for IT Operations