Beginners Guide - Introduction of Big Data & Hadoop

Fazlur Rahman

4.85/5 (13 votes)

21 Jan 2017CPOL10 min read

22.9K

What is Big Data and how Hadoop been introduced to overcome the problems associated with Big Data?

Big Data

As the name implies, Big Data is the huge amount of data which is complex and difficult to store, maintain or access in regular file system using traditional data processing applications. And what are the sources of this huge set of data.

A typical large stock exchange
Mobile phones
Video sharing portal like YouTube, Vimeo, Dailymotion, etc.
Social networks like Facebook, Twitter, LinkedIn, etc.
Network sensors
Web pages, text and documents
Web logs
System logs
Search index data
CCTV images

Data Types

Data can be identified as the following three types:

Structured Data: Data which is presented in a tabular format and stores in RDMS (Relational Database Management System)
Semi-structured Data: Data which does not have a formal data model and stores in XML, JSON, etc.
Unstructured Data: Data which does not have a pre-defined data model like video, audio, image, text, web logs, system logs, etc.

Characteristics of Big Data Technology

A regular file system with typical data processing application faces the following challenges:

Volume – The volume of data coming from different sources is high and potentially increasing day by day
Velocity – A single processor, limited RAM and limited storage based system is not enough to process this high volume of data
Variety – Data coming from different sources varies

And therefore, the Big Data Technology comes into the picture:

It helps to store, manage and process high volume and variety of data in cost & time effective manner.
It analyzes data in its native form, which could be unstructured, structured or streaming.
It captures data from live events in real time.
It has a very well defined and strong system failure mechanism which provides high-availability. It handles system uptime and downtime:
- Using commodity hardware for data storage and analysis
- Maintain multiple copies of the same data across clusters
It stores data in blocks in different machines and then merges them on demand.

Hadoop

Hadoop is a platform or framework which helps to store high volume and variety of data in single or distributed file storage. It's open source, programmed in Java and distributed by Apache Foundation. It has a distributed filesystem called HDFS (Hadoop Distributed File System) which enables storing & fast data transfer among distributed file storages and MapReduce to process the data.

So, Hadoop has two main components:

HDFS is a specially designed file system to store and transfer of data among parallel servers using streaming access pattern.
MapReduce to process data.

Hadoop Hardware Architecture

There are some key terms that need to be understood:

Commodity Hardware: PCs/Servers uses cheap hardware can be used to make clusters.
Cluster: A set of commodity PCs/Servers interconnected in a network
Node: Each of the commodity PCs/Servers is called node.

So, Hadoop supports the concept of distributed architecture. The above diagram shows how a set of interconnected nodes makes clusters and how the clusters are interconnected through Hadoop framework.

The number of nodes in each cluster depends on the network speed
Uplink from cluster to node is 3 to 4 Gb/s
Uplink from cluster to cluster is 1 GB/s

HDFS vs Regular File System

Regular File System	Hadoop Distributed File System
The size of each data block is small like 4KB	The size of each data block is 64 MB or 120 MB
If a 2KB file is stored in one block, the remaining 2 KB is unused or wasted	If a file of 50 MB is stored in one block, the remaining 14 MB can be used
Slow access to blocks	Provides high-throughput access to data blocks
Large data access suffers from disk I/O problems, mainly because of multiple seek operation	Reads huge data sequentially after a single seek
Provides fancy and user friendly interface for managing the file system	Provides limited interface for managing the file system
Creates only one copy of each data block. If the data block is erased, the data is lost	Creates 3 replicas of each data block by default and distributes them on computers throughout the clusters to enable reliable and rapid data access

HDFS exposes a specially designed file system on top of OS defined file system.
It facilitates the user to store data in files.
It maintains Hierarchical file system with directories and files.
HDFS supports different file I/O operations like create, delete, rename, move, etc.

Hadoop Core Services

Hadoop follows Master-Slave architecture. There are 5 services run in Hadoop:

NameNode
Secondary NameNode
JobTracker
DataNode
TaskTracker

NameNode, Secondary NameNode and JobTracker are called master services and DataNode and TaskTracker are called slave services.

As the diagram denotes, each of the master services can talk to each other and each of the slave services can talk to each other. Since DataNode is a slave service of NameNode, they can talk to each other and TaskTracker is a slave service of JobTracker, they can also talk to each other.

HDFS Operation Principle

The HDFS components comprise different servers like NameNode, DataNode and Secondary NameNode.

NameNode Server

NameNode server is a single instance server which is responsible for the following:

Maintain the file system namespace
NameNode behaves like table of content of a book. It knows the location of each block of data.
Manage the files and directories in file system hierarchy.
It uses a file called FsImage to store the entire file system namespace including mapping of blocks to file and file system properties. This file is stored in NameNode server’s local file system.
It uses a transaction log called EditLog to record every change that occurs to the file system meta data. This file is stored in NameNode server’s local file system.
If there is any I/O operation occurs in HDFS, the Meta Data files of NameNode server is updated.
Meta Data files are loaded into Memory of NameNode server. Whenever there is a new DataNode server joined to the cluster, the Meta Data files in memory are updated and then keep an image of the files in local file system as checkpoint.
The Metadata size is limited to the RAM available in NameNode server.
NameNode is a critical one point of failure. If it fails, the entire cluster will fail.
But the NameNode server can partially be restored from a secondary namenode server.

DataNode Server

There could be any number of DataNode servers in a cluster depending on type of network and storage system in place. It is responsible for the following:

Store and maintain the data blocks
Report to NameNode server periodically to update meta data information
Store and retrieve the blocks when there is a request comes from client or NameNode server
Execute read, write requests, performs block creation, deletion and replication upon instruction from NameNode
Each of the DataNode servers sends Heartbeat and BlockReport to NameNode server in specific duration.
If any DataNode server does not report to NameNode server in specific duration, NameNode server thinks this DataNode server as dead and remove the metadata information of that DataNode server.

Secondary NameNode Server

There could be a single instance of Secondary NameNode server. It is responsible for the following:

Maintain a backup of NameNode server
It’s not treated as a disaster recovery of NameNode server but the NameNode server can partially be restored from this server.
Keeps namespace image through edit log periodically

When a client request Hadoop to store a file, the request goes to NameNode server. For example, the file size is 300 MB. Since the size of each data block is 64MB, the file will be divided into 5 chunks of data blocks where 4 of them equals 64 MB and the 5^th one is 44MB and stores them in 5 different data node servers within same cluster with 3 replicas. Here, the chunks of data are called inputsplit. NameNode service then keep the information like where the data blocks are stored, how much is the block size, etc. This information is called meta data.

Here is the complete flow of the operation:

The file is divided into 5 inputsplit say a.jpg, b.jpg, c.jpg, d.jpg and e.jpg and the original file name is photo.jpg and file size 300 MB
The client sends request to NameNode server with this details asking what are the DataNode Servers has available data blocks to store them.
NameNode server respond to client with the details of DataNode servers which has enough space to store the file. Let’s say it sends the following details:

InputSplit	DataNode Server
a.jpg	Data Node Server 1
b.jpg	Data Node Server 3
c.jpg	Data Node Server 5
d.jpg	Data Node Server 6
e.jpg	Data Node Server 7

Once the client receives response from NameNode server, it starts requesting the DataNode servers to store the file. It starts sending the first inputsplit a.jpg to DataNode server 1.
Once the DataNode Server 1 receives the request, it stores a.jpg in its local file system and request replication to DataNode Server 3.
Once the DataNode Server 3 receives the request, it stores a.jpg in its local file system and request another replication to DataNode Server 7.
Once the DataNode Server 7 receives the request, it stores a.jpg in its local file system and send acknowledgement back to DataNode Server 3 saying that the file has been stored properly.
DataNode Server 3 then send acknowledgement back to DataNode Server 1 saying that the file has been properly replicated in DataNode Server 3 & 5
DataNode Server 1 then send acknowledgement back to the client saying that the file has been stored and replicated properly.
DataNode Server 1, 3 & 5 send BlockReport to NameNode server to update metadata information.
The same process repeats for the other inputsplits.
If any of the DataNode server 1, 3 & 5 stops sending Heartbeat and BlockReport, NameNode server thinks that the DataNode server is dead and choose another DataNode server to replace the replication of the a.jpg inputsplit.
There should be a program written in Java or any other language to process the file photo.jpg. The client sends this program to JobTracker component of Hadoop. The JobTracker component gets the meta data information from NameNode server and then communicates with TaskTracker components of the respective DataNode servers to process the file. The communication between JobTracker and TaskTracker is called Map. The number of inputslits in DataNode servers equals the number of Mapper. In the above example, there will be 5 mappers running to process photo.jpg file.
The TaskTracker components keep reporting to JobTracker component if they are processing the request properly or are they alive. If there is any TaskTracker stops reporting to JobTracker, the JobTracker assigns the same task to one of the TaskTrackers where the replications of the inputslipt are stored.
The JobTracker assigns tasks to TaskTracker depending on how close the TaskTracker is and how many mappers are running.
Each Mapper produces one output file for every task assigned. In this example, there will be 5 mappers producing 5 different output files. There will be a Reducer who will combine these 5 input files and report to the DataNode server where the Reducer is running, DataNode Server 4 for example. The DataNode Server 4 will then communicate with NameNode server by providing meta data saying that there is one out file called output.jpg has been processed and ready to use.
Client will keep watching and communicate with NameNode server once the processing is completed 100% and output.jpg file is generated. NameNode server response back to client saying that the file has been processed and ready to use in DataNode Server 4.
Client then sends request to DataNode Server 4 directly and get the output.jpg file

Conclusion

Hope you enjoyed reading it and have learnt something new. In my next consecutive articles, I will show you how to install Hadoop and explain different components of Hadoop in details.

Thanks for reading my article and keep in touch.

History

22^nd January, 2017: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)