Importing data into hadoop distributed file system (HDFS)
Hadoop is one of the applications for big data analysis, which is quite popular for its storage system that is Hadoop distributed file system (HDFS). It is a Java-based open source framework which stores big datasets in its distributed file system and processes them using MapReduce programming model. Since the last decade, the size of big datasets has increased exponentially, going up to exabytes. Furthermore, even in a small organisation, big datasets range from hundreds of gigabytes to hundreds of petabytes (1 petabyte = 1000 terabytes = 1000000 gigabytes). When the size of datasets increase, it becomes more difficult for traditional applications to analyse them. That is where frameworks like Hadoop and its storage file system come into play (Taylor, 2010).
Hadoop distributed file system stores data in the form of clusters on commodity machines from where data can be accessed parallelly when required. It operates on data processing pattern, write-once, read many times. This means dataset is copied from the commodity machine to the memory and then processed as many times as required. Hadoop works like a tree, where a name node is connected to various data nodes and all the tasks are performed by the former on the latter (Borthakur, 2008).
Hadoop’s distributed file system can be used to store big datasets irrespective of whether they are structured or not. The system stores them as raw files on the commodity machines which can be accessed whenever required.
It refers to data which is organised into a pattern or exists as a relational database. Structured datasets have a definite length and a defined format and is easier to access and process. This is because it is organised and seamless, making it readily searchable for simple and straightforward search algorithms. Since the data is already organised, there is no need to find patterns like in the case of unstructured data (Sherpa Software, 2013). Furthermore, it is less time-consuming without much resource wastage, thus it also reduces expenses. Structured data originates from various sources such as:
- Machine-generated data, with or without human intervention.
- Sensor data
- Data from weblogs
- Gaming-related data
- Financial data
- Click-stream data
- Point-of-sales data
Unstructured data has no pre-defined model, length or format. Unlike structured data, it is not in an organised form, thus patterns are ascertained from those large amounts of datasets which have been transformed by the organisation based on their utility. Unstructured datasets are generally bigger in size due to their non-organised form. It takes more time to analyse them making it more expensive for an organisation (Sherpa Software, 2013). Examples of unstructured data are as follows:
- Email messages
- Word processing documents
- Photos and videos
- Audio files
- Social media posts
- PDF files
Importing data into the Hadoop distributed file system
As mentioned above, the distributed file system stores data in clusters. Each commodity machine acts as a cluster and it stores data in the form of raw files. Both types of data, structured and unstructured are imported in different ways into the Hadoop system. There are many tools and options available online to import structured data into the Hadoop file system, one of them is Sqoop. It was earlier released by Cloudera, but is currently being managed and maintained by Apache. In addition, this tool is used to import data from a relational database, data warehouses into Hadoop, as they are the major sources of structured datasets.
Sqoop is a command line tool which allows users to import either individual tables or entire databases into Hadoop distributed file system by MapReduce code internally. It consists of two components, Sqoop Import and Sqoop Export. As the name suggests, Sqoop Import is responsible for importing data into the Hadoop distributed file system, whereas Sqoop Export manages export of data out of the distributed file system. Sqoop uses connectors for data transfer between the Hadoop file system and external data storage systems. The following is a basic command to import a table “Employees” from MySQL into the distributed file system using the Sqoop command line (Ting & Cecho, 2013).
$ sqoop import --connect jdbc:mysql://localhost/test --username user --password pass --table employees
Using Flume for importing unstructured data sets
On the other hand in cases of unstructured data sets, one can use Flume to import into Hadoop. It is yet another Apache project which is quite popular for its efficiency and reliability in case of collecting large amounts of log data from various sources. Flume has an event-driven pipeline architecture with three major parts as follows:
- The source which defines where data will be imported from.
- Sink which defines where data will be moved to.
- Channels which establish a connection between source and sink.
Furthermore, like connectors in Sqoop, Flume has agents which are responsible for data transfer. Therefore it is highly flexible and scalable as one can use it on thousands of machines or for just a few machines. Also, it is known for its high throughput, low latency, high fault tolerance and scalability. Therefore, it is the best tool for importing huge amounts of data (Iconiq Inc., 2015). The basic command to import data from an external source named “Test” using the configuration file “test.conf” into the Hadoop distributed file system is as follows (IBM, 2016):
$ bin/flume-ng agent --conf./conf/ -f conf/test.conf Dflume.root.logger=DEBUG,console -n TestAgent
Both Sqoop and Flume are efficient tools for importing data into Hadoop but they serve different purposes and one cannot use them for the same tasks. When dealing with big datasets, Flume is the most popular tool in the industry and many big organisations such as Goibibo, Mozilla, Syoncloud, SimpleReachs use it to for importing large streams of log data (Iconiq Inc., 2015)
- Borthakur, D. (2008). Hadoop distributed file system (HDFS) Architecture Guide.
- IBM. (2016). Importing data by using Flume. Retrieved from https://www.ibm.com/support/knowledgecenter/SSPT3X_2.0.0/com.ibm.swg.im.infosphere.biginsights.import.doc/doc/data_motion_flume.html.
- Iconiq Inc. (2015). Sqoop vs. Flume Battle of the Hadoop ETL tools. Retrieved from https://www.dezyre.com/article/sqoop-vs-flume-battle-of-the-hadoop-etl-tools-/176.
- Sherpa Software. (2013). Structured and Unstructured Data: What is It? Retrieved from http://sherpasoftware.com/blog/structured-and-unstructured-data-what-is-it/.
- Taylor, R. C. (2010). An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. Proceedings of the 11th Annual Bioinformatics Open Source Conference (BOSC) 2010.
- Ting, K., & Cecho, J. J. (2013). Apache Sqoop Cookbook.