Key Metrics for Evaluating the Business Impact of SRE

Site Reliability Engineering (SRE) is a critical discipline that combines software engineering and operations to enhance the reliability and performance of complex systems. SRE may be a technical topic, but it has a wide and lasting impact on business performance. In fact, it plays a crucial role in properly executing customer journeys, increasing user satisfaction, and driving higher revenue. Unfortunately, despite its profound impact on business performance, this topic seems out of earshot for most business executives.

In this series of blogs, we are highlighting the importance of discussing SRE from a business perspective, and everything you stand to gain from it. In our previous blog, we discussed the specific manners in which SRE impacts the business. This blog will focus on how we can measure this impact.

Business performance is often measured through Business Objectives and related Key Performance Indicators (KPIs). They are predetermined goals and checkpoints that guide the direction of the company.

For example, for an e-commerce web shop, KPIs would include number of daily visits, time spent on the website, total revenue generated, and many other metrics.

While there are numerous manners in which we can correspond these KPIs to the website value streams, we will stick to a relatively simple example. If the web shop hosts a variety of items, to generate revenue the web shop must build systems that enable the searching, ordering and payment for the products. In this example these represent elements of the Value Stream.

Let’s zoom in on the search aspect. It is crucial to ensure that customers find the products they want to buy with as few searches as possible. Too many searches will likely lead to frustration, with many giving up on the website or switching to a competitor, unsurprisingly resulting in loss of revenue, market share, and customer retention.

Therefore, it is plausible to conclude that in this scenario, a relevant KPI may be reduction in search-time for a customer to find a product they put into their basket.

As the services in our example are all related to various product searches, we can define how each service contributes to this Business Objective and KPI:

Simple Product Search – if the customers know exactly which product they want (pink phone case)

Top 10 products from a category – If they know what they want (a protective phone case), but are not sure the exact product they want to buy
Relevant Recommendations – if they recently purchased a phone, they might need a protective phone case.

The better the quality and performance of our search services, the fewer searches a customer needs. Furthermore, customers are not restricted to only one search option. Without a relevant recommendation, they can still look at the top 10 products from a category.

To measure the quality and performance of these services – from a technical, not a functional, perspective – we will introduce two key concepts of Site Reliability Engineering: Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

Service Level Indicators

What is it?

Service Level Indicators (SLIs) are point-in-time measurements of a certain technical quality of a service, indicating how the service is performing.

The most used types of SLIs, also known as the four Golden Signals, include:

Latency: The time it takes to service a request
Error rate: The rate of requests that fail, either explicitly, implicitly, or by policy
Throughput: The number of requests or transactions that a system or service can handle in each time period
Saturation: What fraction of the system’s resources does this request use

Why does it matter?

Putting SLIs into the context of our webshop, let’s zoom in on the product search and recommendation services:

A good quality indicator for the simple product search service may be latency. This tells us how long it takes for our service to send, query, and display the results of a simple product search.
For the recommendation service, a good quality indicator may be the error rate. This tells us how often our service returns with errors or does not show recommendations.

In both cases, the lower the quality of the service (higher latency or higher error rate), the less likely it is that customers will find what they are looking for. Additionally, slow product searches or missing recommendations impact the overall user experience and ultimately jeopardize the organization's meeting its KPIs (user engagement, repeat visitors, overall revenue).

Service Level Objectives

What is it?

Service Level Objectives (SLOs) define the desired quality of service over a given period of time. For every SLI, there is a corresponding SLO. It gives you a bird’s eye view of the performance of your system.

Why does it matter?

Using the aforementioned SLIs for product search and recommendations, we can create the following SLOs:

For simple product searches, we want the service to respond within 1 second to 98% of the searches submitted in a rolling window of 60 minutes.
For recommendations, we want the service to generate 97 recommendations out of every 100 service calls successfully.

While the SLI type is generally referred to as error rate, it is more convenient to express the corresponding SLO in success rate (100% - error rate).

As a rule of thumb, a handful (3-5) of SLIs and SLOs should be defined to measure service quality at critical points of the Customer’s Journey.

But how do we know if these desired levels of service quality, as defined by SLIs and SLOs, are good enough?

Defining Meaningful SLIs and SLOs

Like everything else, the key to tracking performance is making sure the metrics are thought-out, well-defined, and relevant.

Well-defined SLIs should be good indicators of service quality performance, tracking deterioration and improvement quickly. Consequently, meaningful SLOs should be early warning signals of potential business impact.

In the case of the product search example we explored above, a search result of around 1 second is a reasonable expectation of acceptable quality of service. Teams should be notified as soon as response times become longer than 1 second for 98% of the searches. This way, the team can address the potential issue before it becomes so severe that search times may take up to several seconds or not be returned at all.

Concerning the rolling window of 60 minutes, we want to know how response times have evolved during that period. This way, we can observe, recognize, and analyze trends in service quality. With this line of reasoning, we may also consider the introduction of another SLI for product search (this time throughput-related) – correlating how response times change in relation to traffic. In this case, if there is a correlation between throughput and latency, increasing throughput may serve as an early warning signal of potentially increasing response times for product searches.

An often-seen caveat when defining SLIs and SLOs is that organizations start by looking at a long list of existing metrics and trying to make something relevant out of them. This can be time-consuming and introduce selection bias, relying on metrics already present instead of ones that may be better suited to measure service quality.

Defining relevant SLIs and meaningful SLOs should be approached with our understanding of IT systems support and deliver business value and customer experience.

Aligning business and IT performance

Traditionally, aligning business and expected IT performance is done by defining non-functional requirements. Non-functional requirements refer to the aspects of a system that are not related to its specific functionality but rather to its overall performance, security, reliability, and other characteristics. This can result in either over-engineered systems or systems that may technically meet the non-functional requirements yet don’t deliver the desired quality of service from an end-user perspective.

By building on our understanding of how IT systems support and deliver business value in conjunction with meaningful SLIs and SLOs, we can create an aligned measure of business and IT performance. Through these SRE practices, we aim to identify and understand where system reliability and resiliency are critical in delivering and securing business objectives.

As a rule, SLOs should be proxies for meeting Business Objectives. If the quality of the service is above the SLOs from an IT performance perspective, there is no risk of not meeting the Business Objectives. However, if the quality of the service is below the desired level set by the SLOs, it may put the Business Objectives at risk.

In the context of our webshop example, the Business Objective can be to increase revenue by 10% by generating additional traffic to the webshop. This could be achieved through, for example, a marketing campaign. Given the campaign is working, our systems supporting the webshop (such as the front end, search, and checkout) will need to deal with increased traffic, potentially require increased performance. Suppose the webshop becomes unresponsive, search times out, or the checkout service becomes unresponsive. In that case, the unreliability of the system and its components may impact the realization of the business objective.

While this example may be clear and directly linked to revenue, the same logic can and should be applied to any scenario, giving us an understanding of how our IT systems support the delivery of business value.

Suppose the business objective is to realize revenue increase through cross-selling of related products. In that case, SLOs defined for the recommendation service will be proxies and an indicator of early warning signals. If the system cannot generate or display recommendations, realizing the targeted revenue increase through cross-selling may be difficult.

This understanding among business leaders can help remove the disconnect between IT and Business objectives, ensuring an aligned vision for the system and overall better performance.

Interested in learning more about site reliability engineering? Check our Site Reliability Engineering service page.

Key Metrics for Evaluating the Business Impact of SRE

Service Level Indicators

Service Level Objectives

Defining Meaningful SLIs and SLOs

Aligning business and IT performance

Explore more articles