Building an SRE Community One Team at a Time

See how Dell IT is creating a Site Reliability Engineering community to improve site operations and better serve customers.

Deploying Site Reliability Engineering (SRE) throughout an organization has its challenges, both cultural and technical. We previously discussed how bringing an SRE practice inside Dell Digital, Dell’s IT organization, has improved the reliability and scalability of our eCommerce platforms. Today, we’re sharing more on how we’ve created a centralized SRE Enablement program and set out on a broader mission to help organizations across IT deploy an SRE approach to improve their site operations.

Eighteen months into our enablement effort, we are currently working to help five IT organizations create SRE teams and implement SRE solutions and automation capabilities. And we have several more organizations wanting to start or mature SRE capabilities.

Scaling SRE From the Center Out

In the wake of our successful eCommerce pilot, our SRE Enablement team created a Center of Excellence (COE) to provide the basis for an IT-wide outreach effort via a series of roadshows. 

The COE organization details SRE tools and best practices that will help teams improve site reliability by setting up a real-time, end-to-end monitoring ecosystem on desktop and mobile devices, delivering intelligent proactive notifications, automating solutions for recurring issues, reducing operational efforts and reducing mean time to find and mean time to repair performance incidents. 

If your organization plans to pursue SRE enablement, I recommend creating a centralized place where you build products and formalize a practice that can be scaled with consistency. This will bring down the cost, as well as time and effort to bring reliability to reality. 

As we added IT organizations to our SRE enablement effort, we have expanded our core team of SRE engineers in the COE and now have 35 team members overseeing our SRE products and processes. 

Sizing Up SRE Maturity 

In each case, the first step to helping a participating organization adopt an SRE strategy is to assess their SRE maturity. We first ask teams to assess themselves in terms of SRE work they may be doing. We then evaluate them based on a maturity assessment model that measures SRE fundamentals, including current operation monitoring capabilities, their track record in addressing issues, service level objectives and current roles and responsibilities.

The maturity assessment generates a score that helps teams set priorities and define the cultural shift they need to move away from the traditional ticketing approach to site reliability and to an engineering mindset.

Once we have a maturity score, we then help participating organizations create their own SRE team and develop needed skills. In some cases, we work with the organization’s own engineers, possibly from various backgrounds including software, architecture and networking. One organization we worked with, for example, had five engineers from differing backgrounds to start their SRE team. Another team only had one and we helped the leader create a brand-new team.

We offer training to team members as well as assist with hiring SRE engineers. Team size varies with the organization’s needs. In most cases for us, it is eight to 12 team members per major ecosystem.

Helping Organizations Build Capabilities

With an organization’s SRE team and priorities in place, we work with each team to build the first component of SRE capabilities—observability. This is where the organization creates an end-to-end, bird’s eye view of their IT ecosystems.

It starts with defining and gathering needed data. Insights from subject matter experts are crucial here to determine how the data should look, as well as defining significant performance changes to monitor operations and what key performance indicators (KPIs) should be.

In Networking, for example, SMEs were invaluable to defining performance issues across the complex network footprint of 26 data centers. While that team is still in the process of creating observability for all its data centers, the initial work provides a template to ease that process.

We put the compiled data into a third-party data platform tool to create a dashboard that allows each organization to continuously monitor capabilities and flag performance concerns via a single pane of glass.

Achieving observability is a major first step in our organizations’ SRE team progress. Benefits are clear. In Networking, for instance, the support team previously had to search through an array of spines, routers and firewalls that make up a data center’s ecosystem for hours to find the source of an issue; with SRE, the team can now pinpoint the problem in a matter of minutes via a single pane of glass.

After working with SRE enablement this past year, the Service and Connectivity teams have honed their observability processes and are about to begin the next SRE step, the orchestration phase. That’s where systems are put in place to notify the right sets of people, including product and operation teams, when an incident occurs.

From there, SRE teams will add automation and self-healing capabilities where possible—the final phase of SRE fundamentals.

With every step of the SRE enablement process, our organizations gain efficiency and are better equipped to keep their applications up and running for their customers. As customer needs keep changing, the SRE teams they put in place will continue to evolve their processes and provide support.

The Future of SRE 

Our efforts to scale SRE across Dell IT are ongoing and SRE Enablement continues to evolve our approach in expanding our practice. For example, we are in the process of launching an SRE Community of Practice where current and prospective SRE users can delve into SRE processes and tools and drive conversations and collaboration. The site will feature insights from participating teams about their experiences, the benefits of creating an SRE team and where they are in their SRE journeys.

Keep up with our Dell Digital strategies and more at Dell Technologies: Our Digital Transformation.

Scott Mosqueda

About the Author: Scott Mosqueda

Scott Mosqueda was previously Senior Director for Site Reliability Engineering Enablement with Dell Technologies.