Hadoop, HDFS, MapReduce and Spark on Big Data

For the past few years, more and more companies are interested in starting big data projects. There are many technologies related to big data in the market right now, like Hadoop, Hadoop Distributed File System (HDFS), Map Reduce, Spark, Hive, Pig and many more. Although Hadoop is the most popular in this space, Spark gains a lot of attentions since 2014. For people who are interested in implementing big data in the near future, one of most popular questions is : which one should we use, Hadoop or Spark? It is a great question, but this may not be the apple to apple comparison.
Hadoop refers to a full ecosystem that includes the core projects like, HDFS, MapReduce, Hive, Flume, HBase. It has been around for about 10 years and has proven to be
the solution of choice for processing large datasets. Hadoop has three major layers:

  • HDFS for storage
  • YARN for resource management
  • MapReduce for computation engine.

In other words, HDFS is Hadoop’s storage layer while MapReduce is Hadoop’s processing framework that provides task automatic parallelization and distribution. The following diagram shows the components for Hadoop.
Spark is targeting the layer of computation engine and can solve similar problems as MapReduce does but with in-memory approach. So MapReduce and Spark are the apple to apple comparison. The followings are the major differences between the two:

Response Speed
One major issue with MapReduce is that MapReduce need to persist intermediate data back to disk after a map or reduce action. The MapReduce job can write and read intermediate data on desk too many times during the job execution depending on the total number of Mappers, Reducers, and Combiners in the job configuration. Spark improve the IO read/write performance issue by processing intermediate data in-memory for fast computation. But Spark does needs significant more meomry than MapReduce for caching.

Simple Use
Another significant improvement in Spark over MapReduce is that Spark can not only bring data into memory for computation, but also easy of use for development. Spark supports Python and Scala in addition to Java, has much more APIs than MapReduce.
Here is the sample code of WordCount in Hadoop Map Reduce.

public static class WordCountMapClass extends MapReduceBase
  implements Mapper<LongWritable, Text, Text, IntWritable> {

  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(LongWritable key, Text value,
       OutputCollector<Text, IntWritable> output,
       Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer itr = new StringTokenizer(line);
    while (itr.hasMoreTokens()) {
       output.collect(word, one);

public static class WorkdCountReduce extends MapReduceBase
  implements Reducer<Text, IntWritable, Text, IntWritable> {

  public void reduce(Text key, Iterator<IntWritable> values,
       OutputCollector<Text, IntWritable> output,
       Reporter reporter) throws IOException {
    int sum = 0;
    while (values.hasNext()) {
       sum += values.next().get();
    output.collect(key, new IntWritable(sum));

The following is the sample code to perform the same work in Spark. As I like writing code in Python, here is the python implementation.

text_file = spark.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

Hardware Cost
Spark does require to invest more in memory, which has the impact on your hardware investment while MapReduce can use regular commondity hardware.

Storage Model
MapReduce requires the HDFS storage for job execution. Spark doesn’t provide a storage system and is executed on top of a storage system. Spark is decoupling between Compuation and Storage model. It can go for an alternative computation model, abstracting data files with RDDs while use different storage models, like Hadoop HDFS, Amazon S3, Cassandara or Flat Files. By the way, S3 is Amazon Web Services’ solution for handling large files in the cloud. If running Spark on Amazon, you want to your Amazon EC2 compute nodes on the same zone as your S3 files to improve speed and save cost.

Related Technologies
MapReduce is suitable to batch-oriented applications. If want a real-time option, you need to use other platform, like Impala or Storm. For graph processing, need to use Giraph. For machine learning, people used to use Apache Mahout, but right now people are more favor to use Spark for machine learning.

Spark has one single core library, supporting not only data computation, but also data streaming, interactive queries, graph processing, and machine learning. The following is the diagram of Spark components:
Hadoop MapReduce is a mature platform and is the primary framework on Hadoop to perform distributed processing, and has a large number of deployment in productions for many enterprises. It does has the limitations of high latency and batch oriented.
With a rich set of APIs and in-memory processing, Spark can support the workload in both batch and real time and gradually gains more ground on distributed processing.

There are several Apache Hadoop distributors in the big data space, like Cloudera, Hortonworks, and MapR. Cloudera CDH distribution is one of the common used ones. It not only includes Hadoop, Hive, Spark, and YARN, but also has its own Impala for ad-hoc query and Cloudera Manager for Hadoop configuration and management.

So what is the usually start point for company just getting into big data space? There are two major types for big data projects:
1) Pure Big Data Project. The data comes from various source, RDBMS, logging data, streaming data, and many event triggered data. The data can be structured or unstructured.
After the data is populated in Hadoop HDFS storage, all the further processings happen on Hadoop econsystem, and no interaction between Hadoop and RDBMS during the processing stage.
This type of project usually requires the changes in work flow, significant new code in the process. This type of project is usually used in the scenarios that the project is working on something brand new and no work has done in the space. In other words, writing a lot of new code is not an issue as anyway there is no code before.

2) Hybrid Big Data Project. Some data, like historical data or certain large tables or table partitions, are offloaded from RDBMS to HDFS. The further processing involves
the work between RDBMS and Hadoop ecosystem together. This type of project usually requires to keep majority of the current workflow, and modify certain existing code, especially keep the changes in SQL to a miniminal level. The Hybrid Big Data project requires significant more knowledge in both Big Data and RDBMS as the tuning of the system happens not only on Hadoop Ecosystem, but also on RDBMS side.

In this blog, I will focus on the first type of project, Pure Big Data project and will discuss Hybrid Big Data Project on a separate post in the future.

Of course, there are many approaches working on the Pure Big Data Projects. I just list one approach I usually recommend here.
1) Build the Hadoop cluster using one of the major distributors, Cloudera CDH, or Hortonworks, or MapR.
2) Define the process or workflow that can regularly put files on HDFS. It doesn’t matter where the source of data comes from. If the data comes from RDBMS, you can use sqoop, or just flat CSV file outputed from RDBMS. If the data is streaming data, you can use flume or other ways to inject the data into HDFS.
3) Understand the data structure on HDFS, define your Hive table DDLs, and then point the locations of Hive tables to the file locations on HDFS. Yes, Hive might be slow for certain processing. But it is usually a good start to verify your big data ecosystem is performing normally.
If you’re using Cloudera CDH, you can use Impala to perform ad-hoc queries and it has a much better performance than Hive. For use resource allocation and scheduling, you can use YARN.
4) If the business requirements for the big data projects are Machine Learning, iterative data mining, and real time stream processing, the Spark is usually the choice. If dataset during the processing stage of batch job is very large and can not fit into memory, use the traditional Hadoop Map Reduce to process the data.
5) Result analysis.

Finally as a general rule, I would recommend to use Hadoop HDFS for the data storage framework and HDFS is a great solution for permanent storage. Uses Spark for fast in-memory processing and built-in Machine Learning capability. Use MapReduce if data can not fit into memory during batch processing. Use Hive and Impala for quick SQL access, and use Pig to save time in writing MapReduce jobs. From learning perspective for beginners, it seems more make sense to me to start with Hadoop HDFS, MapReduce, Impala, Hive and then Spark.

Of course, it also depends on what kind of problems need to solve and available budget and completion timeline, if the problem can be solved offline, response time is not critical, and budget is very limited, Hive and Impala can do the work. If you want fast response time, you need to consider Spark. But the hardware could be expensive for Spark’s memory requirements. In the end, you want to identify what problem the big data is trying to solve and keep flexible to use the right tool for solution.