Major functions and components of Hadoop for big data

By Indra Giri & Priya Chetty on April 4, 2017

With the increasing use of big data applications in various industries, Hadoop has gained popularity over the last decade in data analysis. It is an open-source framework that provides a distributed file system for big data sets. This allows users to process and transform big data sets into useful information using the MapReduce Programming Model of data processing (White, 2009).
Most of the Hadoop framework is written in Java language while some code is written in C. It is based on Java-based API. However, programs in other programming languages such as Python can also be used as a framework using a utility known as, Hadoop streaming.

Two major functions of Hadoop

Firstly providing a distributed file system to big data sets. Secondly, transforming the data set into useful information using the MapReduce programming model. Big data sets are generally in size of hundreds of gigabytes of data. For such a huge data set, it provides a distributed file system (HDFS). This allows to store them in clusters of different commodity machines and then access them parallelly.

The HDFS replicates the data sets on all the commodity machines making the process more reliable and robust. If there is a failure on one node, Hadoop can detect it and can restart the task on other healthy nodes. The framework is also highly scalable and can be easily configured anytime according to the growing needs of the user. Setting up the Hadoop framework on a machine doesn’t require any major hardware change. The machine just needs to meet some basic minimum hardware requirements such as RAM, disk space and operating system. These requirements are easy to upgrade if one does not have them (Taylor, 2010).

Major components

The major components of the Hadoop framework include:

Hadoop Common
Hadoop Distributed File System (HDFS)
MapReduce
Hadoop YARN

Hadoop Common is the most essential part of the framework. It contains all utilities and libraries used by other modules. It provides various components and interfaces for DFS and general I/O. This includes serialization, Java RPC (Remote Procedure Call) and File-based Data Structures. It was known as Hadoop core before July 2009, after which it was renamed Hadoop common (The Apache Software Foundation, 2014)

Hadoop distributed file system (Hdfs)

Hdfs is the distributed file system that comes with the Hadoop Framework. One can use this to store very large datasets which may range from gigabytes to petabytes in size (Borthakur, 2008). It is based on the data processing pattern, write-once, read many times. In other words, the dataset is copied from the commodity machine to the memory and then processed as much many times as required. HDFS is like a tree in which there is a name node (the master) and data nodes (workers). The name node is connected to the data nodes, also known as commodity machines where data is stored. The name node contains the job tracker which manages all the filesystems and the tasks to be performed. Here is a basic diagram of HDFS architecture.

Using the HDFS function of hadoop for big data storage — The basic structure of the HDFS system

HDFS has a few disadvantages. For example, one cannot use it if tasks latency is low. Similarly, HDFS is not suitable if there are a lot of small files in the data set (White, 2009).

MapReduce

It is the implementation of the MapReduce programming model used for processing large distributed datasets parallelly. MapReduce is a process of two phases; the Map phase and the Reduce phase. The Map phase takes in a set of data which are broken down into key-value pairs. The output from the Map phase goes to the Reduce phase as input where it is reduced to smaller key-value pairs. The key-value pairs given out by the Reduce phase is the final output of the MapReduce process (Taylor, 2010).

One should note that the Reduce phase takes place only after the completion of the Map phase. Until then the Reduce phase remains blocked. The main advantage of the MapReduce paradigm is that it allows parallel processing of the data over a large cluster of commodity machines. This leads to higher output in less time (White, 2009).

Hadoop yet another resource negotiator (YARN)

It is the framework that is responsible for the resource management of cluster commodity machines and the job scheduling of their tasks (Vavilapalli et al., 2013). YARN defines how the available system resources will be used by the nodes and how the scheduling will be done for various jobs assigned. It is one of the major features of Hadoop 2. YARN uses the next generation of MapReduce, also known as MapReduce 2, which has many advantages over the traditional one. Similarly, YARN does not hit the scalability bottlenecks which was the case with the traditional MapReduce paradigm.

In the YARN framework, the job tracker has two major responsibilities. Firstly, job scheduling and secondly monitoring the progress of various tasks. YARN divides them into two independent daemons. A resource manager takes care of the system resources to be assigned to the tasks. Similarly, the application manager takes responsibility for the applications running on the nodes. YARN has also made it possible for users to run different versions of MapReduce on the same cluster to suit their requirements making it more manageable.

Increasing importance of Hadoop

In April 2008, a program based on Hadoop running on a 910-node cluster beat a world record by sorting data sets of one terabyte in size in just 209 seconds (Taylor, 2010). Since then, Hadoop has only seen increased use in its applications in various industries whether it is data science or bioinformatics, or any other field. It has seen huge development over the last decade and Hadoop 2 is the result of it. Low-cost implementation and easy scalability are the features that attract customers towards it and make it so much popular.

References

Borthakur, D. (2008). HDFS Architecture Guide.
Pessach, Y. (2013). Distributed Storage: Concepts, Algorithms, and Implementations.
Taylor, R. C. (2010). An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. Proceedings of the 11th Annual Bioinformatics Open Source Conference (BOSC) 2010.
The Apache Software Foundation. (2014). What Is Apache Hadoop?
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., … Saha, B. (2013). Apache Hadoop YARN: yet another resource negotiator. Proceedings of the 4th Annual Symposium on Cloud Computing.
White, T. (2009). Hadoop : the definitive guide.

Priya Chetty

I am a management graduate with specialisation in Marketing and Finance. I have over 12 years' experience in research and analysis. This includes fundamental and applied research in the domains of management and social sciences. I am well versed with academic research principles. Over the years i have developed a mastery in different types of data analysis on different applications like SPSS, Amos, and NVIVO. My expertise lies in inferring the findings and creating actionable strategies based on them.

Over the past decade I have also built a profile as a researcher on Project Guru's Knowledge Tank division. I have penned over 200 articles that have earned me 400+ citations so far. My Google Scholar profile can be accessed here.

I now consult university faculty through Faculty Development Programs (FDPs) on the latest developments in the field of research. I also guide individual researchers on how they can commercialise their inventions or research findings. Other developments im actively involved in at Project Guru include strengthening the "Publish" division as a bridge between industry and academia by bringing together experienced research persons, learners, and practitioners to collaboratively work on a common goal.