How to Setup Hadoop Cluster Using Ansible.

Abhishek Sahu
3 min readMar 23, 2021

Hello Readers, Today I am going to Show you that How can you setup a Hadoop Distributed Storage Cluster using Ansible Playbook. Before Starting all that, Let me introduce you to the key terms of this blog.

Source: https://thealchemist627.medium.com/

Ansible: It is a configuration management software developed by RedHat, widely use to configure the large amount of devices using a simple codes called playbooks. This playbooks are declarative in nature and written using YAML(Yet Another Markup Language) format.

Hadoop: It is an Open Source software by Apache used for distributed storage and distributed computing. This software is majorly used to handle the problem of BIG DATA, It works on the principle of master and slave where one node manages the other instances. The master node is known as Namenode and the workers are Slavenodes.

Now Let’s start creating our own Hadoop Cluster using Ansible.

Prerequisites:

  • Ansible should be installed in one of your system.
  • To install ansible run “pip3 install ansible
  • There should be SSH connectivity within all the nodes.
  • SSH keys should be copy in all the nodes for password free login.
  • To copy SSH key use “ssh-copy-id -i key(default key = id_rsa.pub) IP”

Step 1: Download the given repo using git to your ansible node.

Step 2: Go to the downloaded repo using change directory command.

cd playtosetupHadoopCluster

Step 3: Edit the inventory.txt file inside it and enter the IP’s of node on which you want to setup the Hadoop Cluster.

Inventory.txt file

Enter all the IP’s below the HaddopCluster group and IP of the node on which you want to setup namenode under the namenode group, remaing all the IP’s for datanode.

Step 4: Run ansible playbook “hadoop.yml” using the following command.

ansible-playbook hadoop.yml

Running the playbook.

Now you can see that your playbook has been launched and started configuring the Hadoop Cluster.

Playbook Executing.

NOTE: If you are using playbook on cloud, then for namenode instance enter IP as 0.0.0.0 as cloud by default doesn’t know its own public IP. And when asked on Datanode Enter the Public IP of your cloud Instance.

Entering Public IP

After successful execution of Above Playbook, Check Your cluster by using command.

hadoop dfsadmin -report” or entering the IP of namenode in your browser using port number 50070.

Using CLI
Hadoop Dashboard

Thanks for Reading this blog, Hope you find it helpful…

If any queries I am available at Linkedin

--

--

Abhishek Sahu

Hey Readers, I am tech enthusiast and an Computer Science Student. Here I am sharing various industry use cases and its solution.