Demystifying Observability in the SRE Process

Dell IT shares a closer look at observability techniques that can help your organization improve site stability with SRE.

We live in a complicated world of interconnected IT systems and growing data in which customers demand flawless experiences and businesses strive to accelerate innovation. IT can no longer rely on traditional monitoring techniques to keep these modern systems up and running with the speed and agility of the market we serve. That is where observability comes in.

Observability is a mechanism that helps Site Reliability Engineering (SRE) teams understand and explain unexpected system behavior with the help of logs, traces and metrics. It helps IT proactively manage the performance of complex distributed systems running on evolving infrastructure.

The right observability strategy and solution translates into increased site reliability, better customer experience and higher team productivity. With the surge of data, we need to quickly identify signal versus noise to be able to aggregate it, analyze it and respond to it as needed. A key success metric for observability is the average time to find and resolve issues. Speed defines success in today’s digital economy.

Now, more than ever, learning to simplify complex systems is essential.

The only way to troubleshoot an unknown failure condition and optimize an application’s behavior is to instrument and collect all the data about your environment at full fidelity. However, the mere availability of data doesn’t deliver an observability solution.

While out-of-the-box solutions can get you a head start with observability, they tend to fall short of providing a complete solution for your unique needs.

Fortunately, a few observability techniques can help simplify complexity and lead to better clarity and success.

Brainstorm with Subject Matter Experts

Modern distributed architectures have numerous interdependencies, which means they also have many points of failure. A key component of resilient systems is being able to quickly pinpoint the exact location of a detected problem. That’s why, when building an SRE strategy, one of the first steps an SRE Enablement team takes is to work with subject matter experts who have an end-to-end view of their ecosystem.

Start by holding a brainstorming session with architects, engineering team leads, SREs, DevOps, on-call support teams, incident management and a user experience designer to create a bird’s-eye view or ecosystem end-to-end view of the organization’s ecosystem.

The session helps cut the clutter and identify high-level services that can be represented onto a single screen and that depict the end-to-end flow of interconnected applications. This rough end-to-end flow is a living and breathing artifact that will evolve as the application ecosystem goes through transformation.

Set Up KPIs and Scoring 

Once you have a list of services to observe, you identify the key performance indicators for each service. KPIs are derived from logs and metrics, and we need to get them from various sources.

After the data is instrumented into the tool of your choice, look back into history (ideally four weeks) at the behavior of the service to determine optimal thresholds. Outline the “Good,” the “Bad” and the “Ugly.”

Depending on the domain, what entails a service can vary quite a bit, including web service, app, network, database, message queue, email and many more. Every service has different stakeholders, KPIs and criteria for measuring success and performance.

So how can you build an observability solution that is easy to understand for everybody despite various subject domains? That’s where scoring comes in.

Scoring is a mechanism imbibed into the human DNA. While we all had different subjects in school, they were generally graded using one through 100 percent. Everyone understands what 50 means and what 90 means, irrespective of the subject. Measuring health or performance of a service should be treated no differently.

A common way to calculate a service health score is to identify the three most important KPIs within a service and assign each a weight from most important to least important. Factor the KPI weights with the percentage at which the KPI is degraded to score the health of that service.

You can further simplify your service health score by equating score levels with another widely understood gauge: the traffic light signals of red, yellow and green.

Standardizing the decision-making process using a scoring model means faster and more automated decisions.

Putting Everything Together 

Once an IT organization creates an architected design, KPIs and service health scores, an SRE team can then combine them into a diagram to create a single pane of glass via a ready-made observability tool or a custom-built solution. The single pane of glass is designed to be completely interactive and intuitive to drill down into problem areas by anyone using it.

The SRE teams or the engineering teams should build the drilldown views and maintain them to match the health scores depicted on the single pane of glass.

While the SRE dashboards provide continuous monitoring of ecosystems, the strategy doesn’t depend on watching them 24×7. Monitoring results are used in tandem with other data available to correlate and address performance events.

For instance, you may see degradation to a webpage, a database latency creeping up and a domain name system service degradation. Traditionally, that might trigger three separate alerts. But at Dell, our notification strategy, with the help of custom orchestration, generates one notification encompassing the cause and the effects.

The system avoids duplicate notifications about the same incident by centralizing datasets and designating only one tool for incident creation.

It is important to target observability notifications to the specific development teams impacted by the system issue and use the communication channels that best connect with those teams. While email was once the traditional communication channel for alerts, collaboration today might include MS Teams, Slack, SMS, Mobile Apps and WhatsApp, to name a few.

The observability strategy should include mapping microservices to development teams and establishing the communication channels for critical issue notification.

Ultimately, the goal of observability is to enable multiple teams to act with shared data, connect people with processes and align with larger business objectives.

Keep up with our Dell Digital strategies and more at Dell Technologies: Our Digital Transformation.

Tanuj Arcot

About the Author: Tanuj Arcot

Tanuj has extensive experience in building highly reliable, noise-free observability solutions with a keen focus on self-healing automation and orchestration. His work has directly contributed towards a 95% reduction of MTTD (Mean time to detect issues) and his work related to Orchestration and Automation has led to a 20% reduction in MTTR (Mean time to resolve issues). He has owned the complete life cycle of observability, from strategy to architecture to delivery, and from concept to measurable business outcomes. As a leader of the SRE enablement practice, he has personally trained and transformed several software engineers into SREs. He has a track record of rapidly adapting and embracing SRE principles and observability solutions to changes in infrastructure topologies and ecosystems. Tanuj is an innovator and a member of the Technical Leadership Community (TLC). He is a technologist who is a reviewer for the patent committee and an active volunteer for patent drives.