How to Manage Rapidly-Expanding Data and Infrastructure

Since Hadoop’s inception, both data analytics and analytics infrastructures have grown and evolved tremendously. There is currently a paradigm shift underway poised to further transform the way that enterprises manage their rapidly-expanding data and supporting infrastructure.

Chicago skyline as seen from the river - Photo by Roman Arkhipov on Unsplash

The paradigm shift is taking place around analytics architectures – specifically Hadoop-based architectures – with respect to deployment. Let’s begin by looking at how many analytics projects get started within an enterprise.

How Analytics Projects Start

Most projects start with some type of discovery process. Take, for example, a team of scientists looking to find that “golden nugget” within their data which might offer a significant value-add to their business. These scientists might work with their IT department to set up a small Hadoop cluster, to load the data and begin the iterative process of data visualization, cleansing and testing until the hypothesis is either validated or disproven. If the hypothesis is proven, the ideas are implemented and eventually go live – success!

What happens next – continuing with the example of our successful scientists – is that other departments take notice of that success, and want to utilize the data platform so they too can benefit. In most cases, they’ll not only use the existing data in the cluster, but also bring in new external or internal data and new applications for their projects. This starts to increase the physical size of the infrastructure.

The issue we hear most from customers at this point is how to effectively grow and manage that ever-expanding cluster as more data and applications continue to go into it. The inherent problem with such expanding architectures is that they become complex very quickly and too complicated to maintain. There are a lot of spinning discs, the unavoidable 3x data replication to maintain and these clusters will very quickly grow to be very cost inefficient.

How to Manage Their Growth

There are three potential approaches to consider:

  • Add more nodes, through direct attached storage (DAS), adding both compute and storage
  • De-couple compute and storage, through network attached scale out storage (NAS)
  • A hybrid architecture (i.e., a tiered storage Hadoop architecture)

Hadoop on Isilon Dell EMC logoWith traditional Hadoop deployments, the primary way to expand the infrastructure is through direct attached storage (DAS), or by adding more nodes. This incorporates more storage and compute power at the same time. The challenge here is that as the number of applications grows, and the amount of data grows, data access patterns change. Different applications require different performance environments – some need more compute power, others are more storage dependent.

Adding more nodes can be inefficient if you only need more storage – you’ll overspend on compute power you might not need. Likewise, if you only need compute power, you’re also potentially purchasing storage that will go unused.

Which brings us to the second approach. As the number of clusters grows in an enterprise, the inherent complexity of maintaining them grows as well, and that complexity doesn’t always grow linearly. By separating compute and storage, and incorporating an enterprise-grade storage solution that can perform data-level functions such as governance, security, encryption, user management, data access patterns, multitenancy, much of the inherent complexity associated with growing data volumes can be mitigated.

This approach allows users to adjust to shifting application performance requirements as necessary. For example, if more compute power is needed, add more servers. In a virtualized environment, you can simply spin up more compute nodes to address compute challenges. This elastic type virtualized environment can be implemented far more quickly than deploying physical hardware.

The same is true for storage-dependent applications. For those apps, moving away from traditional DAS infrastructures and leveraging network-attached storage (NAS) allows you to leave compute power as is and focus on increasing storage. This is cost effective since you don’t have to pay for compute resources – it also reduces the data footprint in the data center, including costs and resources associated with cooling and maintenance of server hardware.

The third option is a hybrid tiered storage environment, which addresses many of the challenges brought about by the paradigm shift mentioned earlier – it also embodies the idea of separating storage from compute. However, it also allows enterprises to tier their data based on its temperate. For example, “hot” (frequently accessed) data or “cold” (archived, less accessed) data. The longer that data exists, the more it cools in temperature – what makes sense in this type of an environment is to have a cheaper, deeper NAS solution extending your Hadoop mainspace. Enterprises can increase their storage footprint by adding less expensive NAS environments that don’t require data replication, essentially providing cheaper archives, without adding compute.’

How to Take Advantage of Data Lakes

Bringing in a true, multiprotocol NAS product into the Hadoop environment enables enterprises to take advantage of that magic “DL” word: data lakes.

Quickly, I am defining a data lake as a means to store data that will allow an organization to have full, multi-tenant, secure and scalable access to all of its data, all the time throughout the organization’s requirements. In other words, a data lake allows access to any and all applications, regardless of the connectivity requirements for that data, all the while maintaining the data in one central location.

Data lakes provide access to the “Three V’s of Big Data”:

  • Velocity – You can grow and manage your compute size based on application demand
  • Volume – You can handle the data volumes coming in
  • Variety – You can manage different data sources providing different types of data, which require a variety of gateways to applications to connect and access that data securely

As data volumes continue to expand exponentially, and Hadoop analytics architectures also continue to grow within the enterprise, the demand for enterprise-grade, highly available, highly secure, extremely scalable storage architecture will only continue as well. My recommendation is to consider either decoupling your storage and compute, or looking at a tiered Hadoop storage architecture.

About the Author: Hamid Djam

Hamid is the CTO of Analytics for EMC’s Emerging Technology Division. He joined EMC in 2010 as one of the founding members of the Big Data Solutions Engineering and Product Management team where he successfully set directions for Analytics product and solutions roadmap, brought in 3rd party partners and integrators such as Attunity and was instrumental in integrating Attunity as the Greenplum CDC engine. He assisted in the development of Brazil Big Data R&D center, and led global partner training and sales enablement efforts. Hamid also won the annual EMC innovation award for multi-tenant Hadoop implementations and registered a patent for Genomic Data Store on Hadoop in 2013. Most recently Hamid led the Big Data solutions development for VCE/Global Solutions Engineering organization, developing proven solutions, reference architecture and best practices. Prior to joining EMC Hamid held various engineering, pre-sales, post-sales, and leadership roles in pioneering startups such as Greenplum (Pivotal) and DatAllegro (Microsoft PDW/APS) as well as engineering Product Management roles at Oracle (Exadata) and Teradata. He has excellent knowledge, understanding of, and a great passion for analytics, MPP and distributed computing, and enjoys to learn new advances in technology. His passionate about the world of data analytics and improving the world through predictive data modeling. He holds a B.S. in Software Engineering from California State University Fullerton. He has also completed numerous data modeling and statistical analysis courses at University of California Irvine. Hamid regularly speaks at conferences on Big Data, Data Science, and Analytics. He is a polyglot and is fluent in Dutch, Farsi and Spanish and conversant in German and Italian.