Dreaming of Building a 1000 node Hadoop Cluster?

The dream is real for EMC Greenplum as a 1000-node Hadoop Cluster, a.k.a Analytics Workbench, went live May 22, 2012 during EMC World.  When I first heard about this large-scale Analytics Workbench project, I immediately thought how harmful it must be for the environment.  What is the point of creating the world’s largest environment for Hadoop testing and development?  Well the joke is on me because this Big Data platform will facilitate ground-breaking insight to improve the quality of life AND create a greener environment.

I wanted to speak to the person who led the effort in creating this Big Data masterpiece –  Apurva Desai, Sr. Director of Hadoop Engineering at EMC Greenplum.  Apurva worked with our internal systems integration team, led by Gerg Robidoux, to architect the cluster and managed a team to build and test the cluster. It has been rumored that Apurva’s blood is on the cluster since there were many cuts and scrapes suffered while putting the cluster together.  Watch stop motion video clip of the Analytics Workbench being built by Apurva’s team.

Creating a 1000 node Apache Hadoop cluster seems like a dubious task. Why did you take on the project?

I knew building a large scale Hadoop cluster was certainly possible because I used to manage a Hadoop deployment at Yahoo, which expanded to 30,000 nodes over time. With this particular Analytics Workbench project, I was given the opportunity to build a 1000-node Hadoop cluster from bare metal, which is very challenging, yet exciting.

What exactly is the Analytics Workbench?

The EMC Greenplum Analytics Workbench is 1000 servers clustered together running Hadoop at Switch’s SuperNap facility in Las Vegas. This environment was made possible through our partners Intel, Mellanox Technologies, Micron, Seagate, SuperMicro, Switch, and more who contributed their products and services to create the Analytics Workbench.

What was the impetus for EMC Greenplum to build the Analytics Workbench?

As Big Data thought leaders, it is our responsibility to drive Big Data success and adoption through the continuous development and validation of features at scale. The Analytics Workbench now offers developers access to a large-scale infrastructure for testing, refining and enhancing their Big Data analytics applications.

How would an organization go about building a 1000 node cluster?

First, you need a location. We partnered with Switch to host the cluster athe SuperNap facility in Las Vegas. Second, you need a specific set of skills – hardware expertise to physically build the cluster, Hadoop expertise to deploy the software, and heavy program management skills to manage the hardware components contributed by the partners.

So what are the hardware components?

  • 54 Racks – 50 racks with 20 data nodes and remaining 4 racks for infrastructure and support nodes
  • Intel -2000 processors
  • Micron – 6000x8GB DIMMs
  • Seagate -12,000 drives of 2TB each
  • Mellanox – Infiniband backend – 68 Switches and 1000 NICs

How do I get access to this cluster?

Currently, the cluster is available to our internal engineering team to develop and validate Greenplum HD features at scale. Not only can we add more robust features to our Hadoop distribution, but we can contribute these features back to the Open Source Apache Hadoop community. This summer, the cluster will also be made available to members of Greenplum’s training and certification classes for Hadoop. A unique aspect of Greenplum’s Hadoop training program is that any individual who successfully completes the course will be granted access to the 1,000-node cluster to use as a sandbox environment. We are also working on an onboarding process for external users such as academia to not only facilitate Big Data research projects, but also accelerate Hadoop development. For more information on how to apply for the onboarding process, feel free to contact Tashneem.maistry@emc.com.

About the Author: Mona Patel