Importing data into hadoop distributed file system (HDFS)

Hadoop is one of the applications for big data analysis, which is quite popular for its storage system that is Hadoop distributed file system (HDFS). It is a Java-based open source framework which stores big datasets in its distributed file system and processes them using MapReduce programming model. Since the last decade, the size of big datasets has increased exponentially, going up to exabytes. Furthermore, even in a small organisation, big datasets range from hundreds of gigabytes to hundreds of petabytes (1 petabyte = 1000 gigabytes). When the size of datasets increase, it becomes more difficult for traditional applications to analyse them. That is where frameworks like hadoop and its storage file system come into play (Taylor, 2010).

Hadoop distributed file system stores data in the form of clusters on commodity machines from where data can be accessed parallelly when required. It operates on data processing pattern, write-once, read many times. This means, dataset is copied from the commodity machine to the memory and then processed as many times as required. Hadoop works like a tree, where a namenode is connected to various datanodes and all the tasks are performed by the former on the latter (Borthakur, 2008).

Hadoop’s distributed file system can be used to store big datasets irrespective of whether they are structured or not. The system stores them as raw files on the commodity machines which can be accessed whenever required.

Structured data

It refers to data which is organised into a pattern or exists as a relational database. Structured datasets have a definite length and a defined format and is easier to access and process. This is because it is organised and seamless, making it readily searchable for simple and straightforward search algorithms. Since the data is already organised, there is no need to find patterns like in the case of unstructured data (Sherpa Software, 2013). Furthermore, it is less time-consuming without much resource wastage, thus it also reduces expenses. Structured data originates from various sources such as:

  • Machine-generated data, with or without human intervention.
  • Sensor data
  • Data from web logs
  • Gaming-related data
  • Financial data
  • Click-stream data
  • Point-of-sales data

Unstructured data

Unstructured data has no pre-defined model, length or format. Unlike structured data, it is not in an organised form, thus patterns are ascertained from those large amounts of datasets which have been transformed by the organisation based on their utility. Unstructured datasets are generally bigger in size due to their non-organised form. It takes more time to analyse them making it more expensive for an organisation (Sherpa Software, 2013). Examples of unstructured data are as follows:

  • Email messages
  • Word processing documents
  • Photos and videos
  • Audio files
  • Presentations
  • Social media posts
  • PDF files

Importing data into Hadoop distributed file system

As mentioned above, the distributed file system stores data in clusters. Each commodity machine acts as a cluster and it stores data in the form of raw files. Both type of data, structured and unstructured are imported in different ways into the Hadoop system. There are many tools and options available online to import structured data into the Hadoop file system, one of them is Sqoop. It was earlier released by Cloudera, but is currently being managed and maintained by Apache. In addition, this tool is used to import data from a relational database, data warehouses into hadoop, as they are the major sources of structured datasets.

using hadoop distributed file system for importing data into hadoop

Basic architecture of Hadoop distributed file system (HDFS)

Sqoop is a command line tool which allows users to import either individual tables or entire databases into Hadoop distributed file system by MapReduce code internally. It consists of two components, Sqoop Import and Sqoop Export. As the name suggests, Sqoop Import is responsible for importing data into Hadoop distributed file system, whereas Sqoop Export manages export of data out of the distributed file system. Sqoop uses connectors for data transfer between Hadoop file system and external data storage systems. The following is a basic command to import a table “Employees” from MySQL into distributed file system using Sqoop command line (Ting & Cecho, 2013).

$ sqoop import --connect jdbc:mysql://localhost/test
 --username user --password pass --table employees

Using Flume for importing unstructured data sets

On the other hand in cases of unstructured data sets, one can use Flume to import into Hadoop. It is yet another Apache project which is quite popular for its efficiency and reliability in case of collecting large amounts of log data from various sources. Flume has an event driven pipeline architecture with three major parts as follows:

  • Source which defines where data will be imported from.
  • Sink which defines where data will be moved to.
  • Channels which establish a connection between source and sink.

Furthermore, like connectors in Sqoop, Flume has agents which are responsible for data transfer. Therefore it is highly flexible and scalable as one can use it on thousands of machines or for just few machines. Also it is known for its high throughput, low latency, high fault tolerance and scalability. Therefore, it is the best tool for importing huge amounts of data (Iconiq Inc., 2015). The basic command to import data from an external source named “Test” using configuration file “test.conf” into Hadoop distributed file system is as follows (IBM, 2016):

$ bin/flume-ng agent --conf./conf/ -f conf/test.conf Dflume.root.logger=DEBUG,console -n TestAgent

Both Sqoop and Flume are efficient tools for importing data into Hadoop but they serve different purposes and one cannot use them for the same tasks. When dealing with big datasets, Flume is the most popular tool in the industry and many big organisations such as Goibibo, Mozilla, Syoncloud, SimpleReachs use it to for importing large streams of log data (Iconiq Inc., 2015)

References

Ankur Sharma

Research Analyst at Project Guru
Ankur is a Bachelor in Computer Science, and is a tech enthusiast in Core Java and Advanced Java programming. He has a penchant for writing and has also managed his own blog on technology.

Related articles

  • Major functions and components of Hadoop for big data With increasing use of big data applications in various industries, Hadoop has gained popularity over the last decade in data analysis. It is an open-source framework which provides distributed file system for big data sets.
  • Preferred big data software used by different organisations Big data has been a buzzword in the computing era for over a decade now. It is a term used for large and complex data sets which is difficult to be processed and analysed by traditional data processing software.
  • Importance of big data in the business environment of Amazon Supply chain management and logistics are the crucial part of the business processes. It is the logistics and the supply chain management that manages the distribution, storage, transportation and packaging as well as delivery of the items. Big data plays an important role in managing […]
  • Difference between traditional data and big data It has become important to create a new platform to fulfill the demand of organizations due to the challenges faced by traditional data. By leveraging the talent and collaborative efforts of the people and the resources, innovation in terms of managing massive amount of data has become […]
  • Understanding big data and its importance Complex or massive data sets which are quite impractical to be managed using the traditional database system and software tools are referred to as big data.

Discuss

We are looking for candidates who have completed their master's degree or Ph.D. Click here to know more about our vacancies.