CRAP and CRUD: From Database to Datacloud

Followers of EMC over the last year cannot have avoided the word transformation. We have never seen so many profound changes taking place simultaneously in IT. And for the first time in the last three decades, industry trends more powerful than any individual company are collectively changing the conversation about data management from database to datacloud.

The relational database management system (RDBMS) is an elegant technology that enables fast queries and transactional updates of structured data (rows and columns) and has over the years become the standard application data persistence layer. With the establishment of RDBMS standards for querying (SQL) and connectivity (ODBC/JDBC),  the database became the stickiest part of the IT stack because many application developers were happy to leave the job of data management to Sybase, Oracle, IBM and Microsoft.

But in the Cloud conversation, there are three megatrends in data management that point to a changing emphasis from database to datacloud.


First, data itself is changing.  We are witnessing a deluge where the amount of data is increasing by 44x this decade, according to IDC’s Digital Universe study. The great majority of this new data, however, is not your typical CRUD (Create, Read, Update, Delete) data – aka structured data. Instead, it is CRAP (Create, Replicate, Append, Process) data, often generated by machines, coming in large quantities at high velocity.  Examples of such data include web logs, social streams, sensor data, videos, ticker data, mobile geo-spatial and so on.

A new generation of applications seeks to gain insights from this new data in (near) real time and then almost always retain this data for deeper processing later. Almost none of this data needs to support RDBMS update operations or transactional capabilities. The relational database, while a beautiful data management tool for CRUD data, is not really designed for CRAP data. The pioneers among the Internet service providers have been building their own systems for processing CRAP data, and some of these systems have since been open sourced (such as Hadoop) and are gaining acceptance in the enterprise.  However there is still no industry standard “big data platform” or universal best practices on how CRAP data should be ingested, stored, and consumed.

But, enough about CRAP and CRUD, there is other… stuff to talk about.

Hybrid Cloud Changes the Equation

The second trend that is disrupting RDBMS dominance is the Cloud, or more precisely, the growing popularity of SaaS applications. These applications typically run in the SaaS provider’s own data centers, and the associated application data is stored in those same data centers. That means a customer’s RDBMS can no longer be the common data fabric between their Private Cloud and their SaaS applications.

The needs for an application to access and query data generated by other applications, however, are still there, especially for reporting and analytics.  This calls for a new “cloud data fabric” that enables applications to connect to multiple data sources across a Hybrid Cloud from both SaaS and databases. It will allow applications to run queries across different datastores across different data centers. Here lies one of the major pieces of unfinished business in the current Cloud Computing landscape.

A Vote for Democratization

The third trend is the democratization of IT. Traditionally, business intelligence has been a high stakes game, requiring millions of dollars of upfront investment for infrastructure and software tools, as well as specialized skills to perform the analytics. Consequently, only the larger companies could afford it. But the rise in open source tools for advanced analytics and statistical analysis has put a powerful, easy-to-consume cloud analytics platform in the hands of more businesses and IT professionals. By delivering analytics as a service – combined with both innovative data visualization tools and PaaS – more and more people will have access to the value hidden in their data.

These three trends, big data, cloud, and democratization of IT, together will rock the data management market.  If the old world has been the database world, then the new will be the datacloud world. The RDBMS will not disappear, but it will not play the role as the default integration and data management platform for the Cloud.  It is in creating the datacloud where the hardest data problems of today and tomorrow will be solved.  In this new world, there is no incumbent. But given the mission and heritage of EMC, Greenplum and VMware, we feel we have something valuable to add to the conversation.

The EMC, Greenplum, VMware Blueprint

Between EMC Greenplum and VMware, we will introduce the blueprint for a best-in-class “big data platform” that will help users ingest, persist and consume the CRAP data in real time.  It will include a real-time distributed ingestion engine that processes the incoming stream that both triggers events and organizes the data into two data stores. The bulk data will flow to the “big data storage” that our Cloud Infrastructure can provide natively, possibly via an HDFS API.  The “hotter” data, including indices, samplings, other metadata and cache will flow to the “big memory grid” possibly based on GemFire technology, available for real-time queries.  This platform can be consumed via PaaS platform, via visualization tools or via standard API.  It will also support running deeper analytics on the “big data storage” by running Hadoop jobs on them.

We are at the onset of a data renaissance. The primacy of the relational database is being challenged by these historic waves.  We here at EMC, Greenplum and VMware are looking forward to the challenge of helping to usher in the new age of the datacloud.

About the Author: Charles Fan