Keeping Garbage Out of the Lake: Data Governance in the Big Data World

As organizations unleash the power of the data lake by providing business broader access to more and more data, they are facing a growing IT dilemma—How to keep improperly governed or poor quality data from polluting the data lake.

While IT’s traditional approach to managing data governance and quality have been quite effective over the years, the magnitude of data in today’s data lake is much larger than traditional data warehouse levels. Traditional tools and tactics are being overwhelmed by Big Data in the lake.

There are, however, strategies that organizations can use to reshape data governance and quality standards in the Big Data world. While our tactics and tools are still evolving, I will share some of the efforts we are developing at EMC IT to keep our data lake clean.

Data Lake

Here’s the challenge

Data governance in the Big Data world still starts with the traditional data governance management components—providing a framework for data assets, managing the proper data lexicon and data definitions.  We still rely on data ownership to track data changes and monitor data access. And we still use some of the same data governance tools, such as Collibra and Attivio.

However, what we began to see after launching our data lake was that the skyrocketing number of data elements requiring governance was outstripping our data governance capacity. We recently counted some 33,000 data elements which we needed to govern and manage for quality. That compares with less than half that amount in our traditional data warehouse.

The result has been a slowdown in our governance process. Our tools and technology were unable to keep up with the added demand. It began taking days to load the metadata and data definitions for our data in the lake. And while in the traditional data world we had mature technologies that automated data structure changes and updates, those technologies were not fully developed to handle the large amount of unstructured data in the lake. That meant we had to manually monitor and manage vast amounts of unstructured data.

Our data stewards who manage and monitor our data sets in the lake were overwhelmed by the influx of new data. The question became how do we manage provisioning, the definitions, the access and the changes on such a vast amount of attributes.

The answer, we concluded, is that we don’t. Instead, we have begun a new approach to data governance based on the idea that not all of the data in our data lake requires 100 percent accuracy and top level governance. Instead we are working with the business to classify data based on its type and use an approach called exactitude verses approximation. Using this approach, data is tiered based on the importance of its accuracy and governance and quality practices are adjusted according.  Our core data, such as customer data, remains a top tier priority and is still subject to full data governance and quality monitoring to safeguard its accuracy, consistency, timeless and contextual correctness. Other data, however, this is being used for predictive modeling where accuracy is less critical, is assigned to a lesser tier and less stringent governance practices.

An example of data that might be subject to less stringent standards would include a set of millions of records that is being used for predictive analytics for which some inaccuracies in a small percentage of records doesn’t impact the predictive strength of the model. The idea is that if there is an accuracy problem in 5 percent of such a large data set, we shouldn’t really expend the resources to fix that 5 percent.

EMC IT is just beginning to use this tiering process to ease our data governance capacity crunch. Out of the 33,000 data elements selected from millions in our data lake, we have already been able to shift more than half to the second tier where we don’t measure all of the dimensions of data quality. That way we can more effectively focus our efforts on the core data elements and assets that do require 100 percent accuracy.

I consider this to be a good turning point in terms of data management and governance, where we are managing data quality on a return on investment (ROI) basis.

Striking the right balance

Beyond easing the governance capacity crunch, the Big Data world presents another, less defined challenge: How do we strike the right balance between data governance and enabling the new innovation the data lake opens up?

You want to have everyone be able to play with the data in the lake, start building predictive analysis and other advanced analytics—which in many cases means failing fast and evolving to come up with solutions. If you have too much governance, you could choke that innovation capability. At the same time, you don’t want to totally eliminate quality control and risk having the data lake become a data garbage dump.

Toward achieving a balance, EMC IT is focusing our full governance effort on a centralized data hub in the lake—our data— and allowing more loosely governed assets in the workspaces where our users pursue innovative analytics. In the work spaces, we asked business users to provide the metadata and track it in Collibra. They can then make changes and do whatever they want with their workspace data as long as they update the metadata from time to time.

To make sure we are meeting business users’ needs, IT is partnering with a governance body on the business side, the Data Governance Steering Committee. The steering committee is working to establish a network of data stewards for each business domain to oversee management of data that affects their business.

While IT is part of the steering committee, we want the business to run it because the business knows their data and its usage the best.

More to come

As we continue to evolve our data governance for the Big Data world, there is a lot more to be done. Besides using several vendor supplied data governance tools, we have built our own governance tool that monitors the quality of structure data quality in the lake. We are currently working on a similar tool for unstructured data. Eventually, our team would also like to build a predictive data quality model that can fill in missing data records based on surrounding data.

We are also working to refine our data quality tiering, which is currently only focused on classifying data as critical or non-critical.

It is an exciting time in the Big Data world, where the volume of data flowing into the data lake continues to challenge our data governance capabilities. Hopefully, sharing our evolving approach at EMC IT has offered you some insights to help safeguard and maximize your data lake.

About the Author: Shahidul Mannan