Predictive Analytics for IT Operations: Continuing the Journey

Viktor Mayer-Schonberger and Kenneth Cukier, authors of Big Data: A Revolution That Will Transform How We Live, Work and Think, wrote, “If big data teaches us anything, it is that just acting better, making improvements – without deeper understanding – is often good enough.”

EMC IT not only recognizes the hidden value of Big Data, but also strives to generate better outcomes. So, we at EMC IT can act better and faster to improve our customers’ experience.

In his November 2013 article, Dan Inbar from EMC’s IT organization eloquently presented what IT has been doing to improve the operations of our Exchange email environment. PAITO (Predictive Analytics for IT Operations) is our Big Data analytics solution for outage prediction that allows our IT operations team to collect, analyze, store, and leverage key indicators to predict and prevent interruption in mission-critical operations. The journey that started more than a year ago as a pilot has evolved into a full-fledged IT data lake and analytics platform for various IT managed areas, including applications, servers, devices, licenses, network, storage, security and workloads.

The data-driven approach to profile and proactively identify anomalous behavior using metrics and event log data is core to the analytics capabilities of PAITO, a real-time solution that is designed for scalability, handling a high velocity and variety of data ingestion, and supporting self-service streaming analytics as well as historical data analytics. These are essential characteristics to build a sustainable data lake and analytics platform that can adapt to the growth and complexity of our IT environment.
PATIOBuilding a real-time big data analytics platform requires a sophisticated stream data ingestion process. PAITO is designed to collect Exchange server performance counters as well as system and application logs remotely from each of the monitored servers every 120 seconds as data streams into PAITO Data Lake. Performance counters are important for monitoring performance and load, but prediction comes from the information hidden in the system and application logs in conjunction with the performance counters. These logs must be collected in real-time and consumed to predict outcomes. This is one of the most critical components of PAITO.

PAITO’s analytics module extends the platform to facilitate a series of analytics tasks. It uses a deductive learning process to understand the state of the system at any given time. The state of the system is based on a queue of events that have already occurred and, as the system arrives to a particular state, the analytic module continuously determines the prediction of the future states. These predictions are computed in probabilistic outcomes. Anytime a future state scores a higher probability than a pre-defined threshold, the analytic module triggers an alert.

For example, PAITO evaluates performance counters of Windows servers and computes health scores in real-time. If the health score fell below zero for any server following a certain pattern, PAITO would alert support staff immediately to take follow-up actions. In addition, the analytic module follows an adaptive learning technique where historical data is used to discover new states of the system, which becomes an input for the prediction algorithm.

The ability to scale and process a large volume of data, both structured and unstructured, enabled with stream data processing and rapid data access, requires a data processing layer combined with both in-memory computing and high-speed MPP (massive parallel processing) framework. Our GreenPlum MPP database combined with Hadoop Distributed File System (HDFS) and real-time analytical model execution engine on Strom enables PAITO with a scalable real-time analytical platform.

In addition to information extraction and analytic prediction, PAITO’s data visualization features are valuable to understand historical trends and scope for future improvements.

In conclusion, PAITO connects Big Data and analytics through the convergence of complex data and data-driven decision-making capabilities. It has enabled both business and IT to ask the right questions about the data sitting in IT platforms. As a result, it has fostered new ways to discover metrics to measure performance, and proactive monitoring and prediction of critical services. Looking into the journey, it is a paradigm shift for us in enabling our command center with insightful information and transforming our Exchange platform to a self-monitored proactive system. The result is a unique opportunity to serve our customers better.

About the Author: Bhanu Dhanaraj