System Center Guide and Tutorial: Architecture of Hadoop

Architecture of Hadoop:

Hadoop follows master-slave architecture means there is one master machine and multiple slave machines. The data that you give to hadoop is stored across these machines in the cluster.

Two important components of hadoop are:

1.HDFS (Data storage)
2.Map-Reduce (Analyzing and Processing)

1. HDFS:
Hadoop Distributed File System is distributed file system used to store very huge amount of data. HDFS follows master-slave architecture means there is one master machine(Name Node) and multiple slave machines(Data Node).The data that you give to hadoop is stored across these machines in the cluster.

Various components of HDFS are:

a) Blocks:
HDFS is block structured file system in which individual file is split into several blocks of equal size and stored across one or more machines in a cluster. HDFS blocks are 64 MB by default in Apache Hadoop and 128 MB by default in Cloudera Hadoop but it can be increased as per the need. If the file in HDFS is smaller than block size, then it does not occupy full block. Exm- If file size is 10 MB and HDFS block size is 128 MB then it takes only 10 MB of space.

b) Name Node:
Name node is the controller/master of the system. Name node spreads data to data node. It stores the metadata of all the files in HDFS. This metadata includes name, location of each block, block size and file permission.

c) Data Node-
Data node is the place where actual data is stored. The data sent by name node is stored into data node. They store and retrieve blocks when they are requested by client or name node.They perform operations such as block creation,deletion and replication as stated by the name node.

d) Secondary Name Node-
Many people think that Secondary Namenode is just a backup of primary Namenode in Hadoop but it is not a back up node. Name Node is a primary node in which all the metadata is stored into fsimage and editlog files periodically. But, when name node down, secondary node will be online but this node only have the read access to the fsimage and editlog files and don't have the write access to them. All the secondary node operations will be stored to temp folder. When name node back to online this temp folder will be copied to name node and the namenode will update the fsimage and editlog files.

2. Map reduce:
MapReduce is a framework and processing technique using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. Map reduce program is written by default in java but we can use other language also like pig,apache pig.
The MapReduce algorithm contains two important tasks.

a) Map-
In map stage,mapper takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).

b) Reduce-
In reduce phase, reducer takes the output from a map as input and combines those data tuples into a smaller set of tuples. The reduce job is always performed after the map job.

Components of MapReduce:

a) JobTracker:
Job tracker is a daemon that runs on a namenode. There will be only one job tracker for name node but many task trackers for data nodes. It assigns the tasks to the different task tracker. It is the single point of failure. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.
JobTracker process:
1. JobTracker receives the requests from the client.
2. JobTracker talks to the NameNode to determine the location of the data.
3. JobTracker finds the best TaskTracker nodes to execute tasks.
4. The JobTracker submits the work to the chosen TaskTracker nodes.
5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals then work is scheduled on a different TaskTracker.
6. When the work is completed, the JobTracker updates its status and submit back the overall status of job back to the client.
7. The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

b) TaskTracker:
Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work amongst different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

TaskTracker process:
1. The JobTracker submits the work to the TaskTracker nodes.
2. TaskTracker run the tasks and report the status of task to JobTracker.
3. It has function of following the orders of the job tracker and updating the job tracker with its progress status periodically.
4. TaskTracker will be in constant communication with the JobTracker.
5. TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive, JobTracker will assign the task to another node.

System Center Guide and Tutorial

Monday, 19 June 2017

Architecture of Hadoop

No comments:

Post a Comment

Blog Archive