Major functions and components of Hadoop for big data

With increasing use of big data applications in various industries, Hadoop has gained popularity over the last decade in data analysis. It is an open-source framework which provides distributed file system for big data sets. This allow users to process and transform big data sets into useful information using MapReduce Programming Model of data processing (White, 2009).
Most part of hadoop framework is written in Java language while some code is written in C. It is based on  Java-based API. However programs in other programming languages such as Python can also use the its framework using an utility known as, Hadoop streaming.

Two major functions of Hadoop

Firstly providing a distributed file system to big data sets. Secondly, transforming the data set into useful information using the MapReduce programming model. Big data sets  are generally in size of hundreds of gigabytes of data. For such huge data set it provides a distributed file system (HDFS). This allows to store them in clusters of different commodity machines and then accessing them parallelly.

The HDFS replicates the data sets on all the commodity machines making the process more reliable and robust. If there is a failure on one node, hadoop can detect it and can restart the task on other healthy nodes. The framework is also highly scalable and can be easily configured anytime according to the growing needs of the user. Setting up Hadoop framework on a machine doesn’t require any major hardware change. The machine just needs to meet some basic minimum hardware requirements such as RAM, disk space and operating system. This requirements are easy to upgrade if one do not have them (Taylor, 2010).

Major components 

The major components of Hadoop framework include:

  • Hadoop Common
  • Hadoop Distributed File System (HDFS)
  • MapReduce
  • Hadoop YARN

Hadoop common is the most essential part of the framework. It contains all  utilities and libraries used by other modules. It provides various components and interfaces for DFS and general I/O. This includes serialization, Java RPC (Remote Procedure Call) and File-based Data Structures. It was known as Hadoop core before July 2009, after which it was renamed to Hadoop common (The Apache Software Foundation, 2014)

Hadoop distributed file system (Hdfs)

Hdfs is the distributed file system that comes with the Hadoop Framework . One can use this to store very large datasets which may range from gigabytes to petabytes in size (Borthakur, 2008). It is based on the data processing pattern, write-once, read many times. In other words, the dataset is copied from the commodity machine to the memory and then processed as much number of times as required. HDFS is like a tree in which there is a namenode (the master) and datanodes (workers). The namenode is connected to the datanodes, also known as commodity machines where data is stored. The namenode contains the jobtracker which manages all the filesystems and the tasks to be performed. Here is a basic diagram of HDFS architecture.

Using the HDFS function of hadoop for big data storage

Basic structure of HDFS system

HDFS has a few disadvantages. For example one cannot use it if tasks latency is low. Similarly HDFS is not suitable if there are lot of small files in the data set (White, 2009).

MapReduce

It is the implementation of MapReduce programming model used for processing of large distributed datasets parallelly. MapReduce is a process of two phases; the Map phase and the Reduce phase. The Map phase takes in a set of data which are broken down into key-value pairs. The output from the Map phase goes to the Reduce phase as input where it is reduced to smaller key-value pairs. The key-value pairs given out by the Reduce phase is the final output of MapReduce process (Taylor, 2010).

One should note that the Reduce phase takes place only after the completion of Map phase. Until then the Reduce phase remains blocked. The main advantage of the MapReduce paradigm is that it allows parallel processing of the data over a large cluster of commodity machines. This leads to higher output in less time (White, 2009).

Hadoop yet another resource negotiator (YARN)

It is the framework which is responsible for the resource management of cluster commodity machines and the job scheduling of their tasks (Vavilapalli et al., 2013). YARN defines how the available system resources will be used by the nodes and how the scheduling will be done for various jobs assigned. It is one of the major features of Hadoop 2. YARN uses a next generation of MapReduce, also known as MapReduce 2, which has many advantages over the traditional one. Similarly YARN does not hit the scalability bottlenecks which was the case with traditional MapReduce paradigm.

In YARN framework, the jobtracker has two major responsibilities. Firstly, job scheduling and sencondly monitoring the progress of various tasks. YARN divides them into two independent daemons. A resource manager takes care of the system resources to be assigned to the tasks. Similarly the application manager takes responsibilities of the applications running on the nodes. YARN has also made possible for users to run different versions of MapReduce on the same cluster to suit their requirements making it more manageable.

Increasing importance of Hadoop

In April 2008, a program based on Hadoop running on 910-node cluster beat a world record by sorting data sets of one terabyte in size in just 209 seconds (Taylor, 2010). Since then, hadoop has only seen increased use in its applications in various industries whether it is data science or bioinformatics, or any other field. It has seen huge development over the last decade and Hadoop 2 is the result of it. Low cost implementation and easy scalability are the features that attract customers towards it and make it so much popular.

References

  • Borthakur, D. (2008). HDFS Architecture Guide.
  • Pessach, Y. (2013). Distributed Storage: Concepts, Algorithms, and Implementations.
  • Taylor, R. C. (2010). An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. Proceedings of the 11th Annual Bioinformatics Open Source Conference (BOSC) 2010.
  • The Apache Software Foundation. (2014). What Is Apache Hadoop?
  • Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., … Saha, B. (2013). Apache Hadoop YARN: yet another resource negotiator. Proceedings of the 4th Annual Symposium on Cloud Computing.
  • White, T. (2009). Hadoop : the definitive guide.

Ankur Sharma

Research Analyst at Project Guru
Ankur is a Bachelor in Computer Science, and is a tech enthusiast in Core Java and Advanced Java programming. He has a penchant for writing and has also managed his own blog on technology.

Related articles

  • Importing data into hadoop distributed file system (HDFS) Hadoop is one of the applications for big data analysis, which is quite popular for its storage system that is Hadoop distributed file system (HDFS). It is a Java-based open source framework which stores big datasets in its distributed file system and processes them using MapReduce […]
  • Preferred big data software used by different organisations Big data has been a buzzword in the computing era for over a decade now. It is a term used for large and complex data sets which is difficult to be processed and analysed by traditional data processing software.
  • Difference between traditional data and big data It has become important to create a new platform to fulfill the demand of organizations due to the challenges faced by traditional data. By leveraging the talent and collaborative efforts of the people and the resources, innovation in terms of managing massive amount of data has become […]
  • Understanding big data and its importance Complex or massive data sets which are quite impractical to be managed using the traditional database system and software tools are referred to as big data.
  • Importance of big data in the business environment of Amazon Supply chain management and logistics are the crucial part of the business processes. It is the logistics and the supply chain management that manages the distribution, storage, transportation and packaging as well as delivery of the items. Big data plays an important role in managing […]

Discuss

We are looking for candidates who have completed their master's degree or Ph.D. Click here to know more about our vacancies.