SRE: Striving for Antifragile Operations

With increasing customer demands, system reliability becomes more critical by the second. You can’t prevent all incidents, but you can learn from them—making your systems more resilient, reliable and capable of providing the services your customers need.

“Some things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors and love adventure, risk, and uncertainty. Yet, in spite of the ubiquity of the phenomenon, there is no word for the exact opposite of fragile. Let us call it antifragile. Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better.”

—Nassim Taleb, from Antifragile: Things that Gain from Disorder

Anti-fragility is a quality attribute of systems. How can we improve systems to detect incidents earlier, resolve them faster, and make sure they do not reoccur? Site Reliability Engineering (SRE) is the methodology that Google uses to run very large production systems in a reliable way. Can we learn from Google’s practices to increase the antifragility attribute of our operations?

Fragile, Robust, and Antifragile

According to Nassim Taleb, there are three kinds of systems:

Fragile systems that fail when put under stress. For example, when too many users are on the website and your single, manually configured server can’t handle the load or scale up.
Robust systems that can handle stressors and won’t fail (at least up to some point). Translated to the world of software systems, an example of this is when an error occurs and there is proper exception handling.
Antifragile systems that benefit from stressors. When we increase pressure on the system, or introduce errors to the system, the system’s resilience will grow and become better. For example, chaos engineering and introducing failure on purpose will make it possible to identify weak points in the system so they can be resolved. By fixing holes in systems before they incidentally pop-up, the system becomes stronger.

Understanding Your System

SRE is a data-centric approach that focuses on creating systems that learn from errors and outages. Typically this data is collected from technical elements, such as CPU usage or memory utilization. However, these metrics don’t give you direct insight into the impact the incident has on your end user. So, it’s important to align business objectives with technical objectives.

Technical objectives are often raw metrics. In SRE these are called Service Level Indicators (SLIs). Examples of SLIs could be “request latency,” “error rate,” and “throughput.”
Business objectives are often contractual agreements with your customers that include consequences whenever they are not met. These contractual agreements are also called Service Level Agreements (SLAs).

To bridge the gap between SLIs and SLAs, you can define Service Level Objectives (SLOs).

SLOs are the targets set on service availability. They are directly attached to the customer experience and boldly stated. If the SLO is not met, customers are unhappy.

By defining SLOs and SLIs, you can discover (potential) failures before your customers do.

Handling Incidents

When running applications, we eventually have to deal with failure. Creating and operating reliable systems requires understanding every way the system can fail—and making sure you detect and resolve the failures when they do happen.

Whenever a failure or incident occurs, it has real, often monetary, consequences. It can bring down parts of your company or even the company as a whole. To minimize the impact of an incident, the first priority is resolving it. After the incident has been resolved, you need to investigate the underlying cause. If the root cause is resolved, the incident should not reoccur—so the incident improved the system.

Embracing Failure

Since system failures are inevitable, the question that remains is how to deal with them. Often failures are experienced as stressful, even frightening—the consequences can be dreadful.

For this reason, people are often afraid to fail, which creates a fragile environment. To become antifragile, you need to embrace failure. When a failure occurs, the question should not be who is to blame. Instead, focus on how the failure could occur and how to prevent it from happening again in the future. Within SRE this is done using a concept called “Blameless postmortems.”

Blameless postmortems cover several critical aspects of becoming antifragile:

Make sure all failure is documented.
Understand all root causes; knowing exactly what caused a failure not only helps resolve it but will prevent it from occurring over and over again.
Establish specific actions to reduce the likelihood of the failure repeating, or to reduce its impact when it does.

Introducing blameless postmortems will take away the fear of failure, meaning that people will address any issues faster. The faster you are able to see (potential) failures, the faster you can take action to recover. Eventually, this will make your system more reliable and increase customer satisfaction.

Becoming Antifragile

By introducing SRE concepts into your daily workflow, you can build processes and systems that lead to antifragility. Understanding your system, proper incident management, and embracing failure all contribute to creating a culture of continuous feedback and continuous improvement. Every time your system fails, you will be able to respond faster, find the root cause analysis, and then define concrete actions to resolve, prevent and learn from it.

Schermafbeelding 2019-10-02 om 09.49.57