Hadoop Cluster Setup with HDFS Architecture Demo

Hello & Welcome all, in this blog we will be learning how to set up Hadoop cluster in our local machine, using Oracle Virtual Box, and we will be also seeing practical for HDFS. So before moving on with hands on, let us first have some basic inception of the technology.

What is Hadoop?

Apache Hadoop is a collection of open-source software utilities that allows the distribution of larges amounts of data sets across clusters of computers using simple programing models.

It provides a software framework for distributed storage and the processing of big data using the MapReduce. It was originally design for computer clusters (“is a set of tightly connected computers that work together, so they can be viewed as a single system”) commodity hardware, it has also found use on cluster of higher-end hardware.

The base Apache Hadoop framework is composed of the following modules:

  • Hadoop Common contains libraries and utilities needed by other Hadoop modules;
  • Hadoop Distributed File System (HDFS)a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
  • Hadoop YARN(introduced in 2012) a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications;
  • Hadoop MapReducean implementation of the MapReduce programming model for large-scale data processing.

I guess, you might now have knowledge over basic terminologies. Now let us just quickly move to the very interesting part, which is setting up cluster.

Pre-requisite —

Here, I will be creating a Master-Slave Cluster i.e Name Node — Data Node, keeping in mind of HDFS practical. So it will be consisting of single Name Node(Master Node) and Two Data Node(Slave Node). Although you can tweak it depending upon your use cases.

Step 1 :- Installing RHEL8 on Oracle Virtual Box

For installing RHEL 8 on Virtual Box, we can follow the below article. In this Cluster Setup we will be using One Name node (Installed in GUI — Workstation mode) and 2 Data Nodes (Installed in CLI-minimal mode).

Step2 :- Download Hadoop and JDK

In order to avoid conflict between the Hadoop version and OS JDK version, I’ll be using tested version of both i.e -

Hadoop Version 1.2.1–1

JDK Version 8

Download this either on your windows and then use WinSCP to transfer it to Linux or you can directly download it on Linux through the links provided.

Step3 :- Install JDK & Hadoop on Name Node(Master Node)

Firstly we will be installing JDK as it is pre-requisite for the Hadoop to install.

On the successful install of JDK we will be installing Hadoop using rpm.

Step4 :- Install JDK & Hadoop on Data Node (Slave Node)

In this step we will be doing the same on both the Data nodes as we did on Name node. i.e. first we will be installing JDK and then Hadoop. Mind that, this has to done on each data nodes.

Step 5 :- Configuring Master Node(Name Node)

In order to configure the Name node for HDFS we will create a directory in the Slash(/) Drive, whose storage has to be distributed. Once the directory gets created we will be configuring the hdfs-site.xml to tell it to act as Name node.

In the hdfs-site.xml, I have created a property, which will tell the node that it is name node and it is being configured for DFS(Distributed File System). In the name tag we will provide keyword for the DFS Name node and in the value tag we’ll provide the directory we created earlier to be shared in DFS.

Once this configuration is done, we will head toward for the configuration in core-site.xml file. This file is responsible for the networking part of the name node. As it’ll be the one which will tell IP of the Name node and port number.

In the core-site.xml, I have created a property, which will tell the name node that it is being configured for DFS(Distributed File System). In the name tag we will provide the keyword for the FS Name node and in the value tag we will provide the IP of the name node and valid and unused port(here we have used port 9001). And we will also mention the protocol(hdfs) it will use.

Step 6 :- Configuring Slave Node(Data Node)

In order to configure the Data node for HDFS we will create a directory in the Slash(/) drive, whose storage will be contributed to be distributed file System. Once the directory gets created we will be configuring the hdfs-site.xml to tell it to act as Data node.

In the hdfs-site.xml, I have created a property, which will tell the node that it is data node and it is being configured for DFS(Distributed File System). In the name tag we will provide keyword for the DFS data node and in the value tag we’ll provide the directory we created earlier whose storage has to be contributed in DFS.

Once this configuration is done, we will head toward for the configuration in core-site.xml file. This file is responsible for the networking part of the data node. As it’ll be the one which will tell data node the IP of the Name node and port number. So that it can contact to name node.

In the core-site.xml, I have created a property, which will tell the data node that it is being configured for DFS(Distributed File System). In the name tag we will provide the keyword for the FS Name node and in the value tag we will provide the IP of the name node and port to connect to the right IP and port. And we will also mention the protocol(hdfs) it will use to connect.

Step 7:- Starting Daemon on Name Node

Once after all the configuration is done, we will be starting the Hadoop service to run. Before starting we have used “jps” command to check the running java processes. As we didn't see any java process running. We will also check whether any process is running at port 9001 by netstat command.

But once we start the Hadoop service we will see a java process running named NameNode and we will also see port 9001 is now listening. But keep in mind in order to make port listen we need to make change in the firewall. In order to escape from this we can also disable firewall(not recommended).

Step 8:- Starting Daemon on Data Node

After all the configuration in Data Node and starting service at Name node. We will be starting the Hadoop service to run on Data Node. Before starting we have used “jps” command to check the running java processes. As we didn’t see any java process running.

But once we start the Hadoop service we will see a java process running named DataNode. And if we see this that means our data node is configured correctly as is connected to the Name node.

Output:

Once after all this setup we will check for the distributed storage by the command “hadoop dfsadmin -report”. This will give us a complete report of all the data nodes connected and also about the storage they are contributing to the Name Node.

So this was a small example on Distributed File System. The storage of the slave nodes(data nodes) can be distributed collectively to the name node.

Summary

In this article, we have studied Hadoop Architecture. The Hadoop follows master-slave topology. The master nodes takes the distributed storage of the slave nodes. HDFS is the distributed file system in Hadoop for storing big data. The HDFS daemon NameNode run on the master node in the Hadoop cluster. The HDFS daemon DataNode run on the slave nodes.

I hope you will like my efforts to make Hadoop journey easy.If you find this article helpful, show your love through that clap . Keep Learning!!

Connect with me on -