With increasing use of big data applications in various industries, Hadoop has gained popularity over the last decade in data analysis. It is an open-source framework which provides distributed file system for big data sets. This allow users to process and transform big data sets into useful information using MapReduce Programming Model of data processing (White, 2009).
Most part of hadoop framework is written in Java language while some code is written in C. It is based on Java-based API. However programs in other programming languages such as Python can also use the its framework using an utility known as, Hadoop streaming.
Two major functions of Hadoop
Firstly providing a distributed file system to big data sets. Secondly, transforming the data set into useful information using the MapReduce programming model. Big data sets are generally in size of hundreds of gigabytes of data. For such huge data set it provides a distributed file system (HDFS). This allows to store them in clusters of different commodity machines and then accessing them parallelly.
The HDFS replicates the data sets on all the commodity machines making the process more reliable and robust. If there is a failure on one node, hadoop can detect it and can restart the task on other healthy nodes. The framework is also highly scalable and can be easily configured anytime according to the growing needs of the user. Setting up Hadoop framework on a machine doesn’t require any major hardware change. The machine just needs to meet some basic minimum hardware requirements such as RAM, disk space and operating system. This requirements are easy to upgrade if one do not have them (Taylor, 2010).
The major components of Hadoop framework include:
- Hadoop Common
- Hadoop Distributed File System (HDFS)
- Hadoop YARN
Hadoop common is the most essential part of the framework. It contains all utilities and libraries used by other modules. It provides various components and interfaces for DFS and general I/O. This includes serialization, Java RPC (Remote Procedure Call) and File-based Data Structures. It was known as Hadoop core before July 2009, after which it was renamed to Hadoop common (The Apache Software Foundation, 2014)
Hadoop distributed file system (Hdfs)
Hdfs is the distributed file system that comes with the Hadoop Framework . One can use this to store very large datasets which may range from gigabytes to petabytes in size (Borthakur, 2008). It is based on the data processing pattern, write-once, read many times. In other words, the dataset is copied from the commodity machine to the memory and then processed as much number of times as required. HDFS is like a tree in which there is a namenode (the master) and datanodes (workers). The namenode is connected to the datanodes, also known as commodity machines where data is stored. The namenode contains the jobtracker which manages all the filesystems and the tasks to be performed. Here is a basic diagram of HDFS architecture.
HDFS has a few disadvantages. For example one cannot use it if tasks latency is low. Similarly HDFS is not suitable if there are lot of small files in the data set (White, 2009).
It is the implementation of MapReduce programming model used for processing of large distributed datasets parallelly. MapReduce is a process of two phases; the Map phase and the Reduce phase. The Map phase takes in a set of data which are broken down into key-value pairs. The output from the Map phase goes to the Reduce phase as input where it is reduced to smaller key-value pairs. The key-value pairs given out by the Reduce phase is the final output of MapReduce process (Taylor, 2010).
One should note that the Reduce phase takes place only after the completion of Map phase. Until then the Reduce phase remains blocked. The main advantage of the MapReduce paradigm is that it allows parallel processing of the data over a large cluster of commodity machines. This leads to higher output in less time (White, 2009).
Hadoop yet another resource negotiator (YARN)
It is the framework which is responsible for the resource management of cluster commodity machines and the job scheduling of their tasks (Vavilapalli et al., 2013). YARN defines how the available system resources will be used by the nodes and how the scheduling will be done for various jobs assigned. It is one of the major features of Hadoop 2. YARN uses a next generation of MapReduce, also known as MapReduce 2, which has many advantages over the traditional one. Similarly YARN does not hit the scalability bottlenecks which was the case with traditional MapReduce paradigm.
In YARN framework, the jobtracker has two major responsibilities. Firstly, job scheduling and sencondly monitoring the progress of various tasks. YARN divides them into two independent daemons. A resource manager takes care of the system resources to be assigned to the tasks. Similarly the application manager takes responsibilities of the applications running on the nodes. YARN has also made possible for users to run different versions of MapReduce on the same cluster to suit their requirements making it more manageable.
Increasing importance of Hadoop
In April 2008, a program based on Hadoop running on 910-node cluster beat a world record by sorting data sets of one terabyte in size in just 209 seconds (Taylor, 2010). Since then, hadoop has only seen increased use in its applications in various industries whether it is data science or bioinformatics, or any other field. It has seen huge development over the last decade and Hadoop 2 is the result of it. Low cost implementation and easy scalability are the features that attract customers towards it and make it so much popular.
- Borthakur, D. (2008). HDFS Architecture Guide.
- Pessach, Y. (2013). Distributed Storage: Concepts, Algorithms, and Implementations.
- Taylor, R. C. (2010). An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. Proceedings of the 11th Annual Bioinformatics Open Source Conference (BOSC) 2010.
- The Apache Software Foundation. (2014). What Is Apache Hadoop?
- Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., … Saha, B. (2013). Apache Hadoop YARN: yet another resource negotiator. Proceedings of the 4th Annual Symposium on Cloud Computing.
- White, T. (2009). Hadoop : the definitive guide.