Dell is Democratizing Data with SRE

How Dell IT is evangelizing SRE data in real-time to drive team member collaboration.

While Site Reliability Engineering (SRE) begins with gathering data from across IT organizations to create a bird’s eye view of ecosystems so we can monitor, fix and prevent system issues, the value of that aggregated data doesn’t stop there. Sharing observability data beyond SRE engineers to teams across those organizations not only increases transparency, but it also taps the potential for new improvements and innovation.

That’s why Dell Digital’s Site Reliability Engineering Enablement team is making data available to product owners, business users and operations teams across Dell IT via a two-way chat tool we call the SRE Assistant.

Beyond using chat to provide team members easily understandable insights into ecosystem issues, the SRE Assistant is evolving to give users access to other non-systems-health data –from sales numbers to customer satisfaction – with a simple query.

It’s all part of democratizing data so team members across Dell IT’s organizations, regardless of their technical know-how, can work with data comfortably, feel confident talking about it and, as a result, make data-informed decisions and build customer experiences powered by data.

Democratizing System Alert Data

Dell Digital, Dell’s IT organization, began piloting an SRE strategy two years ago to reduce downtime in our eCommerce environment. We expanded this effort to create an SRE initiative to help organizations across IT use SRE practices to improve their product reliability and increase maintenance efficiency.

Part of the SRE process is targeting specific development teams impacted by issues with system alert notifications that use those teams’ preferred communication channels. As we created these incident alert communication channels, we realized it made sense to make them two-way, so users could both receive alerts and ask questions. To that end, we built a chatbot using the collaboration tool framework, channeling alerts to specific team members who could, in turn, seek further details via chat.

An important goal of our alert notification strategy was standardizing and simplifying the alert data we send when issues arise.

Our SRE observability tool aggregates a wide range of data to provide a bird’s eye view and determine solutions. Our data might come from a network device, a storage device, an application database, or applications for ecommerce and order, inventory and incident management. Alerts involve different stakeholders, different KPI and different thresholds where something is breaching.

To help team members at all levels of technical knowledge easily interpret issue alerts, we consolidate multiple dashboard metrics and categorize incidents using a scoring system based on percentages: 0% to 100%, and code them red, yellow and green.

This is a key step in democratizing data. For example, let’s say there’s a network outage in a data center. Traditionally, only the network team is notified about it right away. Other team members might come to know about it later in the process and may not be comfortable with technical details of the event.

By breaking alert data down in a way that everyone understands, it doesn’t matter if alert recipients know that subject or not. Someone may not know anything about networks or databases, but the color or grade of the alert offers a basic and quick understanding of the problem.

Simplifying Data Requests

With more team members accessing and understanding system health data via our chatbot, we had another inspiration. What if we share the extensive array of data we have collected from the SRE practice more widely?

We have aggregated data on sales, our service management platform, our application stack, networks, databases and more. We decided this data is rich enough to evangelize to a broader audience of team members.

The SRE Assistant could pull out data from what we have collected in response to specific team member requests. It would access APIs (application programming interfaces), which for much of the data would fetch requested information from our observability tool. It could also fetch data from non-SRE sources using APIs. A salesperson could get daily order totals. A service provider could check customer satisfaction numbers. And since SRE Assistant is available on our main collaboration tool, they can do so on their mobile device.

This data is available in separate tools across IT, but until now there was no single source where team members could get that information all in one place.

This is the other aspect of democratizing data: breaking down tool silos and bringing together necessary information where users are in their communication channels.

Not only are team members able to ask questions using the SRE Assistant about IT system performance, but they can also now ask about a business function, such as how it’s performing at a given point of time.

Our chatbot is a bit like the ubiquitous digital assistants Alexa or Siri. Users just frame a question in the bot, and the SRE Assistant will use APIs to pull the relevant information from a source and present it in the chat.

Taking Our Data Capabilities on the Road

While our team hasn’t formally unveiled it to users yet, the SRE Assistant chatbot is an idea that has been well-received so far by the limited number of current users.

The data selection provided has grown organically as team members have added requests. On the alert notification front, we have seen a lot of promise around increasing team member collaboration in response to sharing system issues. With alerts now being made available teamwide via the bot, everyone sees the same thing and there is an urgency to fix things.

Overall, the cross-pollination of information the SRE Assistant provides blurs the silos and encourages outreach and collaboration. It increases transparency about system performance. And perhaps most importantly, because it uses our central collaboration tool that is available on mobile devices, users can access alerts and data wherever they are. So, they receive up-to-date information about their systems and can make queries with ease.

In the coming months, we’re sharing the SRE Assistant more broadly across teams, product owners and the business community using our SRE Enablement program, educating them on its chatbot capabilities.

We are convinced there is a lot to be learned from sharing our SRE data across IT. We expect using our data wisely will yield better opportunities to improve how we serve our customers.

Keep up with our Dell Digital strategies and more at Dell Technologies: Our Digital Transformation.

Tanuj Arcot

About the Author: Tanuj Arcot

Tanuj Arcot was formerly with Site Reliability Engineering for Dell Technologies.