Integrating LVM volumes with Hadoop and AWS to provide elasticity.
Hadoop: Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications in scalable clusters of computer servers. It’s at the center of an ecosystem of big data technologies that are primarily used to support advanced analytics initiatives, including predictive analytics, data mining and machine learning.
Prerequisites: Read this blog first for understanding of LVM.
Integrating LVM with Hadoop
Now to use Hadoop we need a minimum of two VM’s i.e. one for Namenode and another for Datanode. So I am launching one more VM for the setting up Namenode. As my current setup is not capable of launching two VM’s at once So, I am launching One of my instance at AWS.
Now I have access to two instances, to setup Hadoop we have a package requirement for Java and Hadoop. Download Java and Hadoop from the below link to your instances. As I am using a specific version of Java and Hadoop it is hard to install it from default package manager like yum or apt.
Hadoop: https://softwareforarth.s3.ap-south-1.amazonaws.com/hadoop-1.2.1-1.x86_64.rpm
Java: https://softwareforarth.s3.ap-south-1.amazonaws.com/jdk-8u171-linux-x64.rpm
Install the above softwares with the command “rpm -i software — force”
similarly install both the software in both the virtual machines. After installing both the software , It’s time to setup Namenode and Datanode.
Setting Up Namenode.
To setup Namenode edit the /etc/hadoop/hdfs-site.xml and /etc/hadoop/core-site.xml file.
Use the following code for hdfs-site.xml and core-site.xml file.
We here using 0.0.0.0 as a IP for NameNode because by default AWS instances doesn’t know it’s public IP addresses. Here we using /namenode folder for setting up namenode directory.
Now to start Hadoop namenode first format the namenode by using “ hadoop namenode -format” command then run “hadoop-daemon.sh start namenode” command to start the hadoop services.
Use command “hadoop dfsadmin -report” to see detailed information about hadoop cluster.
As you can see that currently no datanode is configured, So it is showing us 0 capacity.
Setting Up Datanode.
To setup datanode edit the core-site.xml and hdfs-site.xml file.
Here we using same folder we mounted with LVM. So we can increase or decrease the size of datanode easily.
Here we are using public IP of our AWS instance on which namenode is setup. Now to start datanode we will use “hadoop-daemon.sh start datanode” command.
Start the datanode by using command “hadoop-daemon.sh start datanode”. Now our hadoop setup is complete we can check it through web UI, or by running “hadoop dfsadmin -report” command.
Now you guys can see currently only 12,24GB is configured. Now we have a requirement to increase the size of the datanode without shutting down the cluster. Can we achieve this??
As we configured our datanode on LVM volume, So yes we can, without stopping the contribution to cluster.
Let’s increase 3 GB to current capacity.
Now refresh the WebUI or check the report from CLI.
You can see that the size of storage increased drastically without shutting down the Cluster or Datanode.
This is How you can integrate Hadoop with LVM.
I would like to thank Vimal Sir for asking me to perform this task.
Thanks for reading this blog.