Data Confidence Fabric and the Importance of Vetted Data

Relying on data that you can't place your confidence in isn't just bad business, it's dangerous. There's now a mechanism to neutralize the threat and bank on data that you can trust.

By Nicole Reineke, Distinguished Engineer, Dell Technologies

Remember that classic line from Jaws: “You’re gonna need a bigger boat”? Chief Brody knows this for a fact because he has stared down the maw of the beast. He has real, trustworthy data to back his claim.

Today’s enterprises aren’t always as lucky. With the myriad sources of data feeding their insights, they can’t tell if their data is equally trustworthy. Given that major decisions in enterprises feed on data, a lot is at stake. In increasing numbers, we are using data to inform operational capital expenditures (bigger boats), revenue forecasts, risk mitigation projects, support solutions, product design, and marketing strategies. Moreover, as we increase our reliance on artificial intelligence (AI)/machine learning (ML) algorithms to automate those decisions, we must ensure that the data we are using is trustworthy.

Given that data is the make-or-break currency for today’s digital enterprises, there has to be a way of increasing trustworthiness with transparency. The Data Confidence Fabric (DCF) delivers just that: It serves as a standardized way of quantifying and measuring data. This enables enterprises to use data that meets their relevancy and trust standards to deliver more confident insights—so they can take those insights all the way to the bank.

What Is Increasing Data Untrustworthiness?

When I talk to chief data officers, they typically state that data generated within the four walls of their enterprise (in-house) is inherently trustworthy. And yet, data is often bundled with more burden than bounty.

The problem these days is not so much that in-house data has lost its way, it’s that businesses are increasingly using large volumes of data generated outside an enterprise’s firewall. The increase is typically tied to a rise in data science practices that require expansive quantities of data for training. Frequently, external data’s provenance and lineage aren’t transparent, which makes the data hard to trust. The rise of remote work means a lot of information is being created beyond the corporate network. Laptops in a home office or unauthorized mobile devices may not have passed the litmus test before use. Cooperative customer engagements are often a source of large volumes of data, but again, can you really trust the source without access to the full lineage?

Finally, the explosion of the Internet of Things (IoT) and IIoT (Industrial Internet of Things) technologies is delivering data from a whole range of Edge devices in the field. Sensors and devices are streaming real and near-real-time data that, when analyzed and acted upon, could deliver a powerful competitive advantage.

But Edge data is increasingly being augmented and analyzed at locations closer to the devices (as we push processing out toward the Edge), making the data vulnerable to attack. Amid this major pendulum swing, from a centralized IT model to a distributed computing architecture, these risks will need to be mitigated.

Taming the Data Beast

As enterprises embrace the plethora of sources and avenues of data open to them, they need a mechanism to gauge trustworthiness. Essentially, they need an ecosystem in which data can be measured and filtered based on reliability and trustworthiness. The DCF provides this ecosystem.

To populate the DCF, every piece of data that is created in locations and registered with the DCF can be automatically or manually logged with a set of attributes. Attributes can describe numerous qualities about the data, including its origins, at a level that is meaningful for the company. For instance: specific bios information, secure boot, authentication enablement, immutable storage, their IoT sensor model, IP address, etc.

Based on the business prioritization against attributes (which we can refer to as “rules”), a data score is then calculated. As an example, data generated on a non-secure device might register a lower score than one from a secure boot, in-house desktop computer. Moreover, attributes can be given various scores and weighting. For example, data scores can be higher when they include information about provenance or verifying legitimacy, using two-factor authentication.

Given that human beings ultimately decide which conditions merit trustworthiness, the worry about bias creeping into the process is understandable. But including fact-based information as part of the attributes and clearly marking it as such will enable end-users to tell which aspect of the score was based on just fact and which ones were based on inference.

When the data finally reaches the point where it needs to be pressed into service, it will have a net DCF score that will enable decision-makers to vet its trustworthiness in relation to the needs of the activity for which it is being used.

As a revolutionary approach to data management, DCF gives businesses a quantifiable way to screen, measure, and determine whether data can be trusted based on greater context. It means people can put their toes in the water, with more confidence.