The Big Data Wilderness: Finding Your Way Starts With Asking The Right Questions

When it comes to defining how to use Big Data analytics for your business, consider these words of wisdom from baseball great Yogi Berra—“If you don’t know where you’re going, you will wind up somewhere else.”

Knowing where you want to go in analyzing the vast amounts of data you now have available is one of the most important aspects of leveraging that data to gain value for your organization. In other words, if you don’t at least directionally define the question or the business problem you are trying to solve before you begin your data query process, you are unlikely to get the accurate and valuable answers you need.  Without proper planning, you can just spin your wheels forever analytics-wise.

While there is more data available for business analytics than ever before, it will not tell you where to go with your research. And, just throwing algorithms at the data is not a valid or useful path for gaining real time insights. We held a Big Data Analytics summit within EMC a couple of months ago (see here and here) and we had started to poke at this problem a little bit.

Framing the question

How do you define the path for your Big Data analysis project so you don’t wind up somewhere else? You first need to frame the problem or business question you want answered as specifically as possible based on your defined goals and hypothesis. The higher the value for the business, the higher priority the problem is for you to address. Brainstorm with your team as you set this goal. Look at what others have done in related analytical approaches. You might want to involve a subject matter expert or consultant for this step to help come up with the most crisp and clean definition and scope of your experiment.

Click to Enlarge

Build your query around data that is already available, at least initially. Try to structure your analysis so there is some low-hanging fruit in terms of expected successful results. These results can provide a foundation to your experiment before going on to larger sets of business problems.

Identify the data

Once you have the problem framed, you need to identify the data to bring in for the query process and determine how to ingest and cleanse that data. There is a misconception that if you have a lot of data to work with, it is not necessary to clean it. However, you do need to apply basic rules of data quality regardless of how large or small your data set is.

Some algorithms can be misled by data errors. For example, when EMC IT analyzed performance results for our Exchange email system to predict when servers would fail, we found that the results could have vast variations if numbers in some data areas weren’t filled in properly.  There is sensitivity to the quality of the data.

Click to Enlarge

Expect the data preparation for your analysis to take time. You not only have to find the right data but you also have to apply mechanisms to filter that data to a workable and informative subset. In our Exchange analysis project, we had 8,000 parameters in the original data set. Ingesting all of those parameters would not have been useful. In fact, we narrowed our data set from 8,000 to 100 highly-informative parameters. By doing so, we eliminated some of the clutter and made sure that we focused on the desired patterns.

Of course, bringing in structured as well as unstructured data from multiple sources is a challenge. We find that about 80 percent of the time spent on an analytics project tends to be on data ingestion. While I won’t address it in this blog, there are a number of tools that can help you absorb large amounts of data in a standardized process.

Analyzing and re-analyzing

With your data in place, the analysis can begin. Even if this data-science-driven part of the project seems like the less demanding portion of Big Data analytics, it still takes time. You can get initial analytical results in a few weeks, but remember that Big Data analysis actually involves sprints of data science. We bring in the data, iterate once, compare it, go back to the subject matter experts to see if it makes sense, and then try it again. The goal is to avoid false positives and false negatives in the research results.

In EMC IT, we have put in place the notion of Data Scientist as a Service—our data science team is available on a consulting basis to collaborate with the business on the tools and statistical techniques that will optimize their data ingestion and analytical strategy.

What we tend to work on initially is a training set of data, which we can then apply to a larger data set as the iterations progress. Since data cleansing/verification is a time consuming process and also a baseline for the entire analysis, the best option is to start with a small set of variables (the data itself can be large) that are as clean and accurate as possible. As the project progresses, variables can be added. Starting with a very high number of variables in an unclean data set dramatically increases both the time it takes for the data scientist to make sense of the data and the data cleansing/pre-processing time. That is how we start to get on the path of prediction using Big Data.

Done right, leveraging Big Data holds tremendous potential for gaining foresight into just about every aspect of your organization’s business operations, but the process to get to that predictive point is a complex one that takes planning and precision. Take the time to map out your Big Data analysis project and you won’t end up somewhere you didn’t want to go.

Read KK’s previous entries about Big Data: 

Liberate Your Big Data and Close the Loop to Leverage Its True Value
How EMC Uses Big Data to Deliver Value to the Business

About the Author: KK Krishnakumar