Hey there! My name is Daniil, and today, I want to share some thoughts on reliability. I’m sure all of you know what reliability means, but I’d like to emphasize one particular angle – reliability from the end users’ perspective. I’m also certain many of you have faced the challenge of prioritizing between new features driven by business needs and technical excellence driven by engineers. So, can SLOs help us tackle both of these challenges?
Before we dive in, let me share a bit about myself and why I’m so passionate about this topic. I’m currently an Engineering Manager at Teya, a fintech startup whose mission is to empower small businesses across Europe with the best financial platform. I support backend teams on the Acquiring side of the business, where, as you might expect, reliability is crucial. Prior to Teya, I worked at Meta (formerly Facebook), where I also supported backend teams as we prepared to launch a new e-commerce platform across the Family of Apps. Ensuring everything was up to the highest standard for the launch was a key focus, and this is where my passion for reliability truly developed.
Now, let’s get back to the topic, and I’d like to start by covering the key terms.
SLOs are crucial for making data-driven decisions that balance technical excellence with product increments. Properly defined SLOs reflect user happiness; consistently violating them means you’re continuously disappointing users. This makes SLOs a powerful tool for prioritization, ensuring that critical issues are addressed before less impactful enhancements. Obviously, nobody needs a new feature on the thirty-fifth screen in the app if the most-used feature takes ages to run.
The beauty of properly defined SLOs is that they stop being purely technical terms and start speaking a language that is understandable to the business. In the past, I managed to reshuffle fully packed roadmaps multiple times to prioritize performance and reliability work, which is usually very difficult to prioritize due to the absence of visible business value, just by having the right SLOs.
Moreover, adopting an SLO framework across all teams in an organization ensures everyone is aligned. When all upstream and downstream dependencies use SLOs, it helps teams understand what commitments they can make, as a service can’t be better than its upstream dependencies. It also keeps teams accountable. If one service fails to meet its SLOs due to an external dependency, it still impacts the overall user experience, underscoring the interconnected nature of service reliability.
Okay, I hope by this point you’ve already decided that SLOs will be your next priority. Great decision! But where to start? What should you do? There are so many options for defining an SLO, but ironically, very few will actually do the job.
Basically, to get on board with SLOs, you need just three things (well, maybe four): a metric, target value, window, and sometimes, a threshold. So, how do you pick them?
Choosing the right metrics to monitor is the first step in implementing SLOs. Prioritise metrics that:
Directly Reflect User Happiness: Metrics should have a clear, linear relationship with customer satisfaction. It’s the cornerstone of this exercise being successful. What are the most important features the service provides for users? In the payments world, it’s the ability to make a transaction quickly and reliably. In the e-commerce world, it’s the ability to buy a product. And so on and so forth.
As you can see, the prerequisite for this step is to make sure you understand the business and its clients. What do they do, and what do they want? Once you know the answers to these questions, picking the right metric will be easy. You wouldn’t even consider using CPU usage, for example, as you will know that this metric by itself has almost zero impact on the user experience. However, latency for certain endpoints might have a very high impact.
There are a few common metrics that might be a good starting point depending on the type of service you have:
Request-driven Services:
Data Processing Services:
For request-driven services, availability and latency are usually a win-win approach if you’re just getting started. In most cases, we start with these, so I’d recommend you do the same. With data processing services, it’s a bit more complicated, as it really depends a lot on the nature of the service and its usage. So here, you’ll need to assess it yourself. What’s important is to not pick all of them from day one. Select one, a maximum of two, and start down that route. “The man on top of the mountain didn't fall there.” © Vince Lombardi. Start small, and then iterate.
Another interesting question is how to define “valid” requests and “success” for availability, for example. As a starting point, calculating the number of 5xx errors from all requests would be a good approach. But as you iterate, you’ll unpack many more interesting questions about the nature of the errors. Just please, don’t overcomplicate things from the beginning.
The concept of reliability often revolves around the number of "9s" (e.g., 99.9%, 99.99%). However, aiming for 100% reliability is impractical and typically incorrect. As Ben Treynor Sloss, founder of Site Reliability Engineering (SRE) at Google, stated, "100% is the wrong reliability target for basically everything."
Instead, focus on setting realistic and achievable SLOs that align with user expectations and business goals. How do you find this ideal value? You won’t be surprised to hear the same advice as in the previous section—iterate! “Picking the wrong number is better than picking no number.” © SRE.Google.
Start by familiarizing yourself with the table that converts a specific number of "9s" into minutes of unreliability per day, week, or month. This will give you a practical understanding of what’s possible and what’s not. For instance, anything below 10 minutes of downtime is usually achievable only with automation and no human intervention.
If you have historical data, use it to define your initial target. If you don’t, begin with two or three nines, depending on your service's maturity. And remember to iterate as you gather more data and insights.
Also, keep in mind that adding another "9" often comes at a steep cost, with the price of additional reliability increasing almost exponentially. It’s always a good idea to assess whether those costs are justified based on the value it brings to the business.
In the realm of software engineering, striking the right balance between service reliability and rapid innovation is a challenging task. Service Level Objectives (SLOs) are essential for managing this tension. By defining and measuring SLOs, organizations can ensure that the push for fast development does not compromise the reliability of their services. SLOs provide a structured framework to maintain reliability while still prioritizing user satisfaction and allowing room for innovation.
Setting SLOs is an iterative process. It’s better to start with an imperfect target and refine it over time than to avoid setting a goal altogether. This approach allows teams to gather data, learn from experience, and make incremental improvements.
Identify Critical User Journeys: Focus on the most important user interactions with your service. Determine the key areas where reliability is crucial.
Define SLIs: Choose metrics that best represent the reliability of these critical journeys. Ensure they are measurable and meaningful.
Set SLOs: Establish realistic targets for the chosen SLIs, considering user expectations and business goals.
Monitor and Iterate: Continuously track SLIs and compare them against the SLOs. Use this data to make informed decisions and drive improvements.
Communicate and Align: Ensure all stakeholders understand the SLOs and their significance. Align the organization's efforts towards achieving these objectives.
Let’s build software that is reliable in the eyes of end users, not just one that shows green charts in our dashboards.