Big Data
As the name implies, Big Data is the huge amount of data which is complex and difficult to store, maintain or access in regular file system using traditional data processing applications. And what are the sources of this huge set of data.
- A typical large stock exchange
- Mobile phones
- Video sharing portal like YouTube, Vimeo, Dailymotion, etc.
- Social networks like Facebook, Twitter, LinkedIn, etc.
- Network sensors
- Web pages, text and documents
- Web logs
- System logs
- Search index data
- CCTV images
Data Types
Data can be identified as the following three types:
- Structured Data: Data which is presented in a tabular format and stores in RDMS (Relational Database Management System)
- Semi-structured Data: Data which does not have a formal data model and stores in XML, JSON, etc.
- Unstructured Data: Data which does not have a pre-defined data model like video, audio, image, text, web logs, system logs, etc.
Characteristics of Big Data Technology
A regular file system with typical data processing application faces the following challenges:
- Volume – The volume of data coming from different sources is high and potentially increasing day by day
- Velocity – A single processor, limited RAM and limited storage based system is not enough to process this high volume of data
- Variety – Data coming from different sources varies
And therefore, the Big Data Technology comes into the picture:
- It helps to store, manage and process high volume and variety of data in cost & time effective manner.
- It analyzes data in its native form, which could be unstructured, structured or streaming.
- It captures data from live events in real time.
- It has a very well defined and strong system failure mechanism which provides high-availability. It handles system uptime and downtime:
- Using commodity hardware for data storage and analysis
- Maintain multiple copies of the same data across clusters
- It stores data in blocks in different machines and then merges them on demand.
Hadoop
Hadoop is a platform or framework which helps to store high volume and variety of data in single or distributed file storage. It's open source, programmed in Java and distributed by Apache Foundation. It has a distributed filesystem called HDFS (Hadoop Distributed File System) which enables storing & fast data transfer among distributed file storages and MapReduce
to process the data.
So, Hadoop has two main components:
- HDFS is a specially designed file system to store and transfer of data among parallel servers using streaming access pattern.
- MapReduce to process data.
Hadoop Hardware Architecture
There are some key terms that need to be understood:
- Commodity Hardware: PCs/Servers uses cheap hardware can be used to make clusters.
- Cluster: A set of commodity PCs/Servers interconnected in a network
- Node: Each of the commodity PCs/Servers is called node.
So, Hadoop supports the concept of distributed architecture. The above diagram shows how a set of interconnected nodes makes clusters and how the clusters are interconnected through Hadoop framework.
- The number of nodes in each cluster depends on the network speed
- Uplink from cluster to node is 3 to 4 Gb/s
- Uplink from cluster to cluster is 1 GB/s
HDFS vs Regular File System
Regular File System | Hadoop Distributed File System |
The size of each data block is small like 4KB
| The size of each data block is 64 MB or 120 MB
|
If a 2KB file is stored in one block, the remaining 2 KB is unused or wasted
| If a file of 50 MB is stored in one block, the remaining 14 MB can be used
|
Slow access to blocks
| Provides high-throughput access to data blocks
|
Large data access suffers from disk I/O problems, mainly because of multiple seek operation
| Reads huge data sequentially after a single seek
|
Provides fancy and user friendly interface for managing the file system
| Provides limited interface for managing the file system
|
Creates only one copy of each data block. If the data block is erased, the data is lost
| Creates 3 replicas of each data block by default and distributes them on computers throughout the clusters to enable reliable and rapid data access
|
- HDFS exposes a specially designed file system on top of OS defined file system.
- It facilitates the user to store data in files.
- It maintains Hierarchical file system with directories and files.
- HDFS supports different file I/O operations like create, delete, rename, move, etc.
Hadoop Core Services
Hadoop follows Master-Slave architecture. There are 5 services run in Hadoop:
NameNode
Secondary NameNode
JobTracker
DataNode
TaskTracker
NameNode
, Secondary NameNode
and JobTracker
are called master services and DataNode
and TaskTracker
are called slave services.
As the diagram denotes, each of the master services can talk to each other and each of the slave services can talk to each other. Since DataNode
is a slave service of NameNode
, they can talk to each other and TaskTracker
is a slave service of JobTracker
, they can also talk to each other.
HDFS Operation Principle
The HDFS components comprise different servers like NameNode
, DataNode
and Secondary NameNode
.
NameNode Server
NameNode
server is a single instance server which is responsible for the following:
- Maintain the file system namespace
NameNode
behaves like table of content of a book. It knows the location of each block of data. - Manage the files and directories in file system hierarchy.
- It uses a file called
FsImage
to store the entire file system namespace including mapping of blocks to file and file system properties. This file is stored in NameNode
server’s local file system. - It uses a transaction log called
EditLog
to record every change that occurs to the file system meta data. This file is stored in NameNode
server’s local file system. - If there is any I/O operation occurs in HDFS, the Meta Data files of
NameNode
server is updated. - Meta Data files are loaded into Memory of
NameNode
server. Whenever there is a new DataNode
server joined to the cluster, the Meta Data files in memory are updated and then keep an image of the files in local file system as checkpoint. - The
Metadata
size is limited to the RAM available in NameNode
server. NameNode
is a critical one point of failure. If it fails, the entire cluster will fail. - But the
NameNode
server can partially be restored from a secondary namenode server.
DataNode Server
There could be any number of DataNode
servers in a cluster depending on type of network and storage system in place. It is responsible for the following:
- Store and maintain the data blocks
- Report to
NameNode
server periodically to update meta data information - Store and retrieve the blocks when there is a request comes from client or
NameNode
server - Execute read, write requests, performs block creation, deletion and replication upon instruction from
NameNode
- Each of the
DataNode
servers sends Heartbeat
and BlockReport
to NameNode
server in specific duration. - If any
DataNode
server does not report to NameNode
server in specific duration, NameNode
server thinks this DataNode
server as dead and remove the metadata information of that DataNode
server.
Secondary NameNode Server
There could be a single instance of Secondary NameNode
server. It is responsible for the following:
- Maintain a backup of
NameNode
server - It’s not treated as a disaster recovery of
NameNode
server but the NameNode
server can partially be restored from this server. - Keeps namespace image through edit log periodically
When a client request Hadoop to store a file, the request goes to NameNode
server. For example, the file size is 300 MB. Since the size of each data block is 64MB, the file will be divided into 5 chunks of data blocks where 4 of them equals 64 MB and the 5th one is 44MB and stores them in 5 different data node servers within same cluster with 3 replicas. Here, the chunks of data are called inputsplit
. NameNode
service then keep the information like where the data blocks are stored, how much is the block size, etc. This information is called meta data.
Here is the complete flow of the operation:
- The file is divided into 5 inputsplit say a.jpg, b.jpg, c.jpg, d.jpg and e.jpg and the original file name is photo.jpg and file size 300 MB
- The client sends request to
NameNode
server with this details asking what are the DataNode
Servers has available data blocks to store them. NameNode
server respond to client with the details of DataNode
servers which has enough space to store the file. Let’s say it sends the following details:
InputSplit | DataNode Server |
a.jpg
| Data Node Server 1
|
b.jpg
| Data Node Server 3
|
c.jpg
| Data Node Server 5
|
d.jpg
| Data Node Server 6
|
e.jpg
| Data Node Server 7
|
- Once the client receives response from
NameNode
server, it starts requesting the DataNode
servers to store the file. It starts sending the first inputsplit
a.jpg to DataNode
server 1. - Once the
DataNode
Server 1 receives the request, it stores a.jpg in its local file system and request replication to DataNode
Server 3. - Once the
DataNode
Server 3 receives the request, it stores a.jpg in its local file system and request another replication to DataNode
Server 7. - Once the
DataNode
Server 7 receives the request, it stores a.jpg in its local file system and send acknowledgement back to DataNode
Server 3 saying that the file has been stored properly. DataNode
Server 3 then send acknowledgement back to DataNode
Server 1 saying that the file has been properly replicated in DataNode
Server 3 & 5 DataNode
Server 1 then send acknowledgement back to the client saying that the file has been stored and replicated properly. DataNode
Server 1, 3 & 5 send BlockReport
to NameNode
server to update metadata information. - The same process repeats for the other inputsplits.
- If any of the
DataNode
server 1, 3 & 5 stops sending Heartbeat
and BlockReport
, NameNode
server thinks that the DataNode
server is dead and choose another DataNode
server to replace the replication of the a.jpg inputsplit
. - There should be a program written in Java or any other language to process the file photo.jpg. The client sends this program to
JobTracker
component of Hadoop. The JobTracker
component gets the meta data information from NameNode
server and then communicates with TaskTracker
components of the respective DataNode
servers to process the file. The communication between JobTracker
and TaskTracker
is called Map
. The number of inputslit
s in DataNode
servers equals the number of Mapper
. In the above example, there will be 5 mappers running to process photo.jpg file. - The
TaskTracker
components keep reporting to JobTracker
component if they are processing the request properly or are they alive. If there is any TaskTracker
stops reporting to JobTracker
, the JobTracker
assigns the same task to one of the TaskTracker
s where the replications of the inputslipt
are stored. - The
JobTracker
assigns tasks to TaskTracker
depending on how close the TaskTracker
is and how many mappers are running. - Each
Mapper
produces one output file for every task assigned. In this example, there will be 5 mappers producing 5 different output files. There will be a Reducer
who will combine these 5 input files and report to the DataNode
server where the Reducer
is running, DataNode
Server 4 for example. The DataNode
Server 4 will then communicate with NameNode
server by providing meta data saying that there is one out file called output.jpg has been processed and ready to use. - Client will keep watching and communicate with
NameNode
server once the processing is completed 100% and output.jpg file is generated. NameNode
server response back to client saying that the file has been processed and ready to use in DataNode
Server 4. - Client then sends request to
DataNode
Server 4 directly and get the output.jpg file
Conclusion
Hope you enjoyed reading it and have learnt something new. In my next consecutive articles, I will show you how to install Hadoop and explain different components of Hadoop in details.
Thanks for reading my article and keep in touch.
History
- 22nd January, 2017: Initial version