New Pivotal HD Unlocks Hadoop’s Big Data Potential

I’m very excited to finally be able to talk publicly about something we’ve been working on for quite some time. In 2011, we announced our first version of Greenplum Hadoop.

Today, we are unveiling Pivotal HD, the world’s most powerful Hadoop distribution.

Pivotal HD isn’t just about making Hadoop better and much faster. It is about significantly expanding Hadoop capabilities as a data platform and unlocking Hadoop as key to Big Data’s potential for data-driven enterprises. Check out the press release here for the big picture.

So what is it?

Pivotal HD is a new distribution of Apache Hadoop, featuring native integration of EMC Greenplum’s massively parallel processing (MPP) database with Apache Hadoop. Which together are the most cost-effective and flexibly open source Big Data platform ever developed.

Boxshot-800-PivotalHD

Pivotal HD’s core Hadoop distribution includes all the things you would expect from a Hadoop distribution (HBase, Hive, Pig, a management console, an installer, and more). But what takes the Pivotal HD Hadoop distribution to the next level is a major new component that we are very proud of: HAWQ, a relational database that runs atop of HDFS. In short, we gutted the code where our solid, mature, fast MPP Greenplum Database was writing to disk – instead, it writes directly to HDFS.

HAWQ has its own execution engine, separate from MapReduce, and manages its own data, which is stored on HDFS. HAWQ bridges the gap, an SQL interface layer on top of HDFS that also organizes data. It also boasts a core feature called GPXF (Greenplum Extension Framework) that allows HAWQ to read data from flat files in HDFS that are stored in just about any common format (delimited text, sequence files, protobuf and avro) In addition, it has native support for HBase, and offers a ton of intelligent features to retrieve HBase data.

A massive array of features and functionalities stretch what a Hadoop cluster can do, further into the analytical and data service realm, where Hadoop has struggled to do well due to its design as a batch data processing system primarily.

Together, Hadoop and HAWQ cover all the bases, empowering customers to use the right tool, for the right job, within the same cluster.

We’re finding that HAWQ is hundreds of times faster than Hive. We’ve also tested against some of our competing SQL-on-Hadoop solutions and are orders of magnitude faster for some queries (especially group by and joins, which are rather important).

What results is a near real-time analytical SQL database that runs on Hadoop. You can run queries with sub-second response time, while at the same time running over much larger datasets and processing with the full expressiveness of SQL, in the same engine. Meanwhile, you don’t have to sacrifice or compromise anything from your Hadoop/MapReduce side of the house. They work together to get the job (whatever it is) done. Learn more at visit our Pivotal HD product page. I also dive deeper into the technology here.

About the Author: Donald Miner