Hadoop and MapReduce (MR) lower the entry barrier for Big Data processing, by making data-intensive processing easy and cost-effective. Easy because programmers need to just write and deploy Mapper and Reducer tasks while Hadoop handles much of the scaffolding. Inexpensive because Hadoop MR jobs can work over commodity servers, avoiding the need to deploy specialized (read costly) hardware. Still, Hadoop and MR have limitations.
• Hadoop and MR are designed for “batch-oriented” processing of large-scale data processing rather than for interactive use. Its emphasis on high-throughput processing over low latency works well and is fine, as long as one is running jobs in the background without any human interaction. However, many classes of applications do require low-latency analysis. For example, a credit card fraud detection application that requires real-time results.
• Similarly, interactive/exploratory querying isn’t served well by Hadoop MR. This is because Hadoop fires new MR jobs and loads data from disks for each job irrespective of the data access pattern and history. Factor this with high-latency execution of an individual MR job and users’ general expectations of sub-second response times, and it’s easy to see why interactive analysis with Hadoop MR is seen as impractical.
• Moreover, Hadoop MR is not suitable for computations involving iterative processing, owing to the overheads of repeated fetching of data since Hadoop doesn’t store working sets of data in memory. Iterative computations are common in many applications including machine learning and graph processing algorithms.
In this Knowledge Sharing article, Ravi Sharda provides an overview of various new and upcoming alternatives to Hadoop MR. Some of these alternatives are at the infrastructure-level, replacing the guts of Hadoop while keeping interface-level semantic compatibility with Hadoop. Others choose to re-use existing Hadoop infrastructure, but provide a new higher level interface designed to address some limitations. Prominent examples include: Apache Spark, Apache Drill, Yet Another Resource Navigator (YARN), Apache Tez, HaLoop, Apache Hama, Apache Giraph, and Apache Tez.