Keeping Our Sites Up and Running with SRE

Learn how Dell IT’s Site Reliability Engineering effort is helping IT organizations keep sites available and scalable across our ecosystem.

It is a high-tech reality. The more IT operations have become increasingly nimble, fast-changing and segmented into microservices, the more challenging it is to keep up with what might go wrong and fix it with the same agility and speed. Enter Site Reliability Engineering (SRE).

Dell Digital, Dell’s IT organization, has developed an SRE initiative that provides IT organization owners with a bird’s eye view of their IT ecosystems, constant feedback on anything that goes wrong, and self-healing capabilities to fix it when it does.

By definition, SRE is the practice of applying a software engineering approach to IT operations. SRE engineers create solutions and automation capabilities to make sure our IT platforms and services are reliable, scalable and available to customers when they need them.

Initially, SRE at Dell began with an effort to reduce downtime in our eCommerce environment. Over time, it’s expanded to work with a growing number of IT organizations to improve product reliability and increase maintenance efficiency and analytic capabilities.

Creating a bird’s eye view

We began our efforts in eCommerce in part because that organization was already cultivating some reliability practices that it called site health. When IT leaders called for a way to reduce downtime across IT environments, eCommerce emerged as a good place to start. We created a small team of SRE engineers to forge a more comprehensive reliability strategy.

One of the first steps towards engineering site reliability was building observability into our products. As part of an SRE pilot, we built a dashboard with monitoring, search and report capabilities that allows us to see our priority of needs in a red, yellow and green way. This end-to-end, bird’s eye view of the eCommerce experience shows not just the customer-facing applications but a comprehensive view of the backend services as well.

Building this overview took time. We chose a third-party data platform tool that offers a key performance indicator (KPI) way of looking at things. We worked with product owners to look at services and how they talk to different application components and we built KPIs for each capability, viewable via a single pane of glass.

This is a far cry from our old way of working. To understand what I mean, picture a customer shopping for a laptop. They search the site, find needed products and configure to see product details and pricing, tax calculation, shipping, etc. All this makes up the experience that we show to our customers on one side. And on the other side is the view of how that’s all being done in the backend of the system.

Previously, we were operating in a box for each one of these services. So, Dell Digital product model owners could tell you, ‘yes, my service is working fine,” but they couldn’t tell you that the experience was working fine end to end.

And when a problem arose, we had no way of spotting it in real time. In fact, most often, we would get a call or message from the business saying something was out of whack—say pricing or missing products. Our traditional engineering operational team would then have to check all the various places where the problem could be and would likely hand it off to a second team to fix it. And by this time, it would probably have self-solved, but our customers would have had a negative experience.

Using SRE, however, we are now constantly going through our environment and checking these things in real time. We make sure the pricing is right from the front end, to the configuration, to the cart, to the checkout. And if it’s not, we notify the product model owner and automatically fix the issue if possible.

After our successful pilot, Dell Digital decided to create a center of excellence around our core SRE products and the practices to expand these capabilities to multiple IT organizations. 

Making it happen with data and SRE input

Besides building an overview of product operations, SRE requires gathering accurate, actionable data on the product and then tapping the right expertise to create an orchestration process to automate monitoring and response to operation issues across our products at scale.

Basically, SRE needs data plus insights from subject matter experts (SMEs) that know what the data is supposed to be and can help us determine how to respond to problems. Based on that, our SRE engineers write orchestration to put together alerts and automated fixes where possible. The cycle of success for SRE is observability to orchestration to automation/auto-heal.

If something is not working, we need the SMEs to help us to understand the significance of logs and systems information and display the appropriate warning on our dashboard. For instance, a certain payment option might be down, but if it is one that is seldom used by our customers, we may just issue a notice to the product owner rather than showing a yellow or red status flagging a major incident. Trouble in a main payment system will immediately trigger an early warning at a critical level.

With SRE automation, we don’t just notify product teams that something is wrong, we also give them the pertinent information and even the customers that are impacted, along with a channel to communicate to resolve the issue.

As we collect comprehensive data to build SRE capabilities for a given IT organization, we strive to supply a working model early in the process. Our SRE designer first consults SMEs to build a story board. We fill in data where we have it and work to add more going forward. In the meantime, however, the dashboard can benefit the product teams right away and expand in scale as we get more data.

Providing product teams with real-time insights

In addition to observability and orchestration, our SRE products also include data analytics, mobile access to SRE capabilities, the ability to track specific customer interactions and a two-way chat feature that we recently launched called SRE Assistant.

For data analytics, we make SRE data available to product owners, business users and operations teams upon request. They no longer need to navigate multiple systems to obtain a specific metric.

For example, if I want to know what my CSAT is today, I can immediately obtain it. And I can get customer feedback in detail as well as access to a play-by-play view of what a particular customer did in a specific interaction session.

We have also built a maturity model to measure where teams should focus from an SRE perspective.  The teams can build goals around specific success measurements and mature in both the product and practice of SRE.

We are currently working with three additional IT organizations to mature the capabilities of their SRE teams and are looking to bring in up to two more this year.

While it is too early in the process to tell how much SRE will increase reliability for these operations, we achieved significant improvement for eCommerce in our pilot program. In fact, for our most recent Black Friday sales event, eCommerce achieved the unprecedented milestone of 100 percent availability.

SRE isn’t for every organization. Some teams are too small or specialized to use this methodology. But for many organizations, the ability to see their operations in a single pane of glass, detect and fix problems in real time and head off future problems holds tremendous promise in achieving a fundamental goal—keeping their products and services up and running for the customers who rely on them.

Keep up with our Dell Digital strategies and more at Dell Technologies: Our Digital Transformation.  

Scott Mosqueda

About the Author: Scott Mosqueda

Scott Mosqueda was previously Senior Director for Site Reliability Engineering Enablement with Dell Technologies.