Blizzard Entertainment, the video game giant, places great importance on ensuring that their Warcraft franchise games are always available to their patrons including during off-peak hours. However, due to the complexity of the underlying system, constant maintenance is essential. Despite the recurring need for changes, updates, and bug fixes, Blizzard’s SRE teams work consistently to minimize downtime as much as possible. Downtime is dangerous because it means customers are unable to access the game they have paid for, opening the company up to frustrated customers, reputational damage, and financial losses in the short and long term. SRE teams are the silent heroes tasked with the responsibility of ensuring gamers can play to their heart’s content, protecting them from the capriciousness of the system.
IT organizations always want their production environment to be up. Downtime costs money, reputation, and ultimately customers. However, software systems are inherently unpredictable and will always fail at some point; leading to an emphasis on SRE.
Site Reliability Engineering (SRE) is a discipline that combines software engineering and operations to enhance the reliability and performance of complex systems like web applications, cloud/network infrastructures, and databases. Its focus is on building and maintaining scalable, reliable, and efficient systems. However, its impact doesn’t end there. Since SRE plays a crucial role in reducing downtime, improving system resilience, and enabling faster detection and resolution of issues, it helps contribute to improved customer experience, increased user satisfaction, and higher revenue potential. Nevertheless, despite its unquestionable impact on business performance, it is a topic that has limited itself to Tech teams.
This needs to change.
An unreliable and underperforming system not only impacts the technical performance and output of the system but will likely impact the wider business performance. In this series of blogs, we’ll explore how we can align IT and tech teams to build a Reliability Strategy that benefits the entire organization.
Understand How Your IT systems support your business:
There has historically been a disconnect between IT and business teams when it comes to the relational impact they have on each other. Despite SRE being a “technical topic”, it is
crucial to understand how individual components and the overall system support and deliver business value.
Therefore, it is crucial to look beyond the technical practices of Site Reliability Engineering (SRE). Let’s take Observability for instance. Observability is the technical practice of instrumenting systems to deliver the right information through logs, metrics, and traces. While having the right visibility into a system’s performance are the cornerstones of successful SRE implementations, this “raw data” doesn’t deliver meaningful information in themselves. They need to be put into the right context: creating relevant measures and meaningful objectives for service quality.
This is the key to aligning SRE with business objectives – understanding how the system’s components contribute to the customer’s experience. Acknowledging its importance in delivering this value is the first step to building a strong Reliability Strategy. We’ve laid out a structured way through which you can accomplish this.
Value Stream and Customer Journey Mapping
Let’s take an e-commerce website as our guinea pig.
In general, a web shop’s main strategy and objective is to generate revenue. This requires the web shop to build systems that enable the searching, ordering and payment for the products.
The web shop can only reach its business objectives if the systems and its components are designed and developed in alignment to elements of the Value Stream. These systems not only need to work together to achieve the overall business objectives, but also need to work individually and independently to ensure a smooth experience for the website visitor.
With this Value Stream map, we can now look at the customer’s journey as they search, order, and pay for the products. This is likely what the journey would look like:
- First, customers may want to search for specific products (look at the top 10 products of a category, or see relevant recommendations)
- Second, they may want to select and put products in their basket and proceed to order by filling in shipping details as well as applying promotion codes.
- Lastly, they need to pay for the selected products.
If any of the components laid out in the Value Stream fails, the customer will not be able perform the desired action, resulting in them not purchasing the product and ultimately lowering revenue for the company.
By using Value Stream and User Journey mapping, we can create a starting point of understanding how systems deliver business value and facilitate a unified understanding and language between Business and IT stakeholders.
Service Mapping
From this combined, non-technical overview of the Value Stream elements and the Customer Journey, we can now look at how systems and their components, such as services, are supporting and delivering business value. This is achieved through Service Mapping.
Let’s take the ‘Search’ element from our Value Stream: searching for a product, looking at the top 10 products of a category and seeing relevant recommendations. In our example each of these user actions are supported by simple services, resulting in a 1-on-1 mapping.
However, showing relevant recommendations may be achieved by several services, such as: using past purchase history of the customer, using analytics to show products often bought together, or providing similar alternatives. When this is the case, all relevant (downstream) services need to be mapped out to create a comprehensive Service Map.
Looking at our three services, we see that if users can’t use any of them, they won’t be able to order and subsequently pay for the products.
However, these services may not be dependent on each other. An individual product search may not require a list of top 10 products per category, nor a personalized list of recommendations.
Individual services will still deliver business value, and measuring quality of individual services helps us to create an aligned view and measurement of Business and IT performance.
It is important to understand the relational impact of IT systems on the wider business and to align technical practices with business objectives. Value stream and customer journey mapping, as well as service mapping, provide insights into how systems deliver business value and facilitate a unified understanding between business and IT stakeholders. By measuring the quality of individual services, a comprehensive view of business and IT performance can be achieved. Overall, SRE is essential for ensuring the reliability, efficiency, and success of complex systems in today's digital landscape.
Interested in learning more about site reliability engineering? Check our Site Reliability Engineering service page.