Importing data into hadoop distributed file system (HDFS)

By Priya Chetty on April 21, 2017

Hadoop is one of the applications for big data analysis, which is quite popular for its storage system that is Hadoop distributed file system (HDFS). It is a Java-based open source framework which stores big datasets in its distributed file system and processes them using MapReduce programming model. Since the last decade, the size of big datasets has increased exponentially, going up to exabytes. Furthermore, even in a small organisation, big datasets range from hundreds of gigabytes to hundreds of petabytes (1 petabyte = 1000 terabytes = 1000000 gigabytes). When the size of datasets increase, it becomes more difficult for traditional applications to analyse them. That is where frameworks like Hadoop and its storage file system come into play (Taylor, 2010).

Hadoop distributed file system stores data in the form of clusters on commodity machines from where data can be accessed parallelly when required. It operates on data processing pattern, write-once, read many times. This means dataset is copied from the commodity machine to the memory and then processed as many times as required. Hadoop works like a tree, where a name node is connected to various data nodes and all the tasks are performed by the former on the latter (Borthakur, 2008).

Hadoop’s distributed file system can be used to store big datasets irrespective of whether they are structured or not. The system stores them as raw files on the commodity machines which can be accessed whenever required.

Structured data

It refers to data which is organised into a pattern or exists as a relational database. Structured datasets have a definite length and a defined format and is easier to access and process. This is because it is organised and seamless, making it readily searchable for simple and straightforward search algorithms. Since the data is already organised, there is no need to find patterns like in the case of unstructured data (Sherpa Software, 2013). Furthermore, it is less time-consuming without much resource wastage, thus it also reduces expenses. Structured data originates from various sources such as:

Machine-generated data, with or without human intervention.
Sensor data
Data from weblogs
Gaming-related data
Financial data
Click-stream data
Point-of-sales data

Unstructured data

Unstructured data has no pre-defined model, length or format. Unlike structured data, it is not in an organised form, thus patterns are ascertained from those large amounts of datasets which have been transformed by the organisation based on their utility. Unstructured datasets are generally bigger in size due to their non-organised form. It takes more time to analyse them making it more expensive for an organisation (Sherpa Software, 2013). Examples of unstructured data are as follows:

Email messages
Word processing documents
Photos and videos
Audio files
Presentations
Social media posts
PDF files

Importing data into the Hadoop distributed file system

As mentioned above, the distributed file system stores data in clusters. Each commodity machine acts as a cluster and it stores data in the form of raw files. Both types of data, structured and unstructured are imported in different ways into the Hadoop system. There are many tools and options available online to import structured data into the Hadoop file system, one of them is Sqoop. It was earlier released by Cloudera, but is currently being managed and maintained by Apache. In addition, this tool is used to import data from a relational database, data warehouses into Hadoop, as they are the major sources of structured datasets.

using hadoop distributed file system for importing data into hadoop — The basic architecture of the Hadoop distributed file system (HDFS)

Sqoop is a command line tool which allows users to import either individual tables or entire databases into Hadoop distributed file system by MapReduce code internally. It consists of two components, Sqoop Import and Sqoop Export. As the name suggests, Sqoop Import is responsible for importing data into the Hadoop distributed file system, whereas Sqoop Export manages export of data out of the distributed file system. Sqoop uses connectors for data transfer between the Hadoop file system and external data storage systems. The following is a basic command to import a table “Employees” from MySQL into the distributed file system using the Sqoop command line (Ting & Cecho, 2013).

$ sqoop import --connect jdbc:mysql://localhost/test
 --username user --password pass --table employees

Using Flume for importing unstructured data sets

On the other hand in cases of unstructured data sets, one can use Flume to import into Hadoop. It is yet another Apache project which is quite popular for its efficiency and reliability in case of collecting large amounts of log data from various sources. Flume has an event-driven pipeline architecture with three major parts as follows:

The source which defines where data will be imported from.
Sink which defines where data will be moved to.
Channels which establish a connection between source and sink.

Furthermore, like connectors in Sqoop, Flume has agents which are responsible for data transfer. Therefore it is highly flexible and scalable as one can use it on thousands of machines or for just a few machines. Also, it is known for its high throughput, low latency, high fault tolerance and scalability. Therefore, it is the best tool for importing huge amounts of data (Iconiq Inc., 2015). The basic command to import data from an external source named “Test” using the configuration file “test.conf” into the Hadoop distributed file system is as follows (IBM, 2016):

$ bin/flume-ng agent --conf./conf/ -f conf/test.conf Dflume.root.logger=DEBUG,console -n TestAgent

Both Sqoop and Flume are efficient tools for importing data into Hadoop but they serve different purposes and one cannot use them for the same tasks. When dealing with big datasets, Flume is the most popular tool in the industry and many big organisations such as Goibibo, Mozilla, Syoncloud, SimpleReachs use it to for importing large streams of log data (Iconiq Inc., 2015)

References

Borthakur, D. (2008). Hadoop distributed file system (HDFS) Architecture Guide.
IBM. (2016). Importing data by using Flume. Retrieved from https://www.ibm.com/support/knowledgecenter/SSPT3X_2.0.0/com.ibm.swg.im.infosphere.biginsights.import.doc/doc/data_motion_flume.html.
Iconiq Inc. (2015). Sqoop vs. Flume Battle of the Hadoop ETL tools. Retrieved from https://www.dezyre.com/article/sqoop-vs-flume-battle-of-the-hadoop-etl-tools-/176.
Sherpa Software. (2013). Structured and Unstructured Data: What is It? Retrieved from http://sherpasoftware.com/blog/structured-and-unstructured-data-what-is-it/.
Taylor, R. C. (2010). An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. Proceedings of the 11th Annual Bioinformatics Open Source Conference (BOSC) 2010.
Ting, K., & Cecho, J. J. (2013). Apache Sqoop Cookbook.

Priya Chetty

I am a management graduate with specialisation in Marketing and Finance. I have over 12 years' experience in research and analysis. This includes fundamental and applied research in the domains of management and social sciences. I am well versed with academic research principles. Over the years i have developed a mastery in different types of data analysis on different applications like SPSS, Amos, and NVIVO. My expertise lies in inferring the findings and creating actionable strategies based on them.

Over the past decade I have also built a profile as a researcher on Project Guru's Knowledge Tank division. I have penned over 200 articles that have earned me 400+ citations so far. My Google Scholar profile can be accessed here.

I now consult university faculty through Faculty Development Programs (FDPs) on the latest developments in the field of research. I also guide individual researchers on how they can commercialise their inventions or research findings. Other developments im actively involved in at Project Guru include strengthening the "Publish" division as a bridge between industry and academia by bringing together experienced research persons, learners, and practitioners to collaboratively work on a common goal.

Structured data

Unstructured data

Importing data into the Hadoop distributed file system

Using Flume for importing unstructured data sets

References

Discuss