Skip to main content

Apache Hadoop cluster setup guide

APACHE HADOOP CLUSTER SETUP UBUNTU 16 64 bit


Step 1: Install ubuntu os system for master and slave nodes.
            1-install vmware workstation14.
            https://www.vmware.com/in/products/workstation-pro/workstation-pro-evaluation.html
            2- install Ubuntu 16os-64bit  for masternode using vmware
            3- install Ubuntu 16os-64bit  for  slavenode  using vmware



Step-2 : Update root password so that you can perform all admin level operations.
           
             sudo passwd root command to set new root password


Step-3-Creating a User  from root user for Hadoop Eco System.
            It is recommended to create a separate user for Hadoop to isolate Hadoop file system from                  Unix file system
           # useradd hadoopuser
           # passwd  hadoopuser
    New passwd:
    Retype new passwd
            compgen -u   <to list all users >



Step-4- Check IP-address of master & slave node and update host file for FQDN.
              ifconfig
              master ip in my machine  192.168.60.132
              slave  ip in my machine  192.168.60.133

              change host file in both machine master and slave.
              sudo nano /etc/hosts   and  sudo nano /etc/hostname  <change hostname>
           
              change ip adresss /etc/hosts 
              192.168.60.132 master.node.com master
              192.168.60.133 slave.node.com slave
              <Ctr+x  save and enter>




Step-5  change hostname in both machine master and slave.
              nano /etc/hostname file. 
              replace ubuntu with master and again ubuntu with slave in slave machine.
              #hostname master        !in master node
              #hostname
              #hostname slave           !in slavenode
              #hostname

Step-6 Transfer all setup files < hadoop , java ,spark , scala , hive , pig >
       internet should be connected turn-off all machine then go to virtual machine setting--->option             tab->
       shared folder->enable
       guest isolation->enable

Step-7-Turn on machine go to network->windows network->access shared folder to access all
            setup files like jdk , apache hadoop ,apache spark.
         
           https://spark.apache.org/downloads.html

            if samba is not install in linux then u can not share folders n files
            go to linux mycomputer properties->local nework share->check on share this folder->install                samba

            agian go to network->windows network->access shared folder
            username:< your window os username  >
            password:< your windows os password >


Step-8: Install SSH so that master and slave can communicate with each other
             sudo apt-get install scala
             sudo apt-get install openssh-server openssh-client   < to access all slave node terminal >
             ssh localhost <checking if it is working>

             ssh-keygen -t rsa -P ""
            cat ~/.ssh/id_rsa.pub
            cat ~/.ssh/authorized_keys  <no such file>
            cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 
            <copy the content of the key and will create a new file>
            chmod 700 ~/.ssh/authorized_keys

copy this key in all machine master as well as slave node  < you can run this command from master and copy keys inside slave >

             cat ~/.ssh/id_rsa.pub | ssh username@remote_host "mkdir -p ~/.ssh && cat >>                                     ~/.ssh/authorized_keys"

            or if u facing problem.
            << copy command
             ssh-keygen -b 4096
             ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@master
             ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@slave
             <some userfull commands for you >  
             restorecon -R -v  ~/.ssh                   
             service ssh stop             
             service ssh start
             ssh -vv user@host                             
             whereis ssh

This is my mastser node rsa key:--
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCq4Ov0Pl5HdhgDt+StGJHEFnNRCn6lMSAHjIW366aCBPUKcdzEP8V/CxMDsJvKIMuBv17tXYuUWDvWCsQlk+jGImybRrDm+W78sURfAg46TIB8XDmVr47qFWXlecCyDSzn13mNKXsMGLfSZjCS1ibwmSnExBmsewf9qQnYDorgJdqSwlEe7AOc60MGDoixnCnWPBmapfbmcrKlQ86B7RXghxuRXFVMjCKmrO/SPKEwpwlBbsQjP/5itz8tE5XXIHahzz1o0M9EYuKeeASpMHFRZEaeI18SbiVF7xiOjmuz6wnr6OHbb1SGG0R05d9iaDf5K08eVJIkWxfR/FGcuP+L root@master

Step-9 : install jdk-1.8
   1- switch to Destop/data directory
   2- >cd Desktop
   3- >cd data
   4- Desktop/data> tar xvfz jdk1.8.tar.gz      <unzip jdk folder>
make a new java directory  inside /usr/lib
        root@master>cd /usr/lib
        root /usr/lib > mkdir java
root /usr/lib >chmod 777 /usr/lib/java  <setting full permissions to java directoy>

                manually copy jdk1.8 folder from Desktop/data and paste inside /usr/lib/java directory.
                Move to /usr/lib/java/jdk1.8.0_161/
        $cd /usr
       ->$ cd lib/java
       ->$cd jdk1.8.0_161/
->sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/java/jdk1.8.0_11/bin/java" 1
->sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/java/jdk1.8.0_11/bin/javac" 1
->sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/java/jdk1.8.0_11/bin/javaws" 1

Step-10: Update bashrc file .
root#>  nano ~/.bashrc (correct)
root#>  gedit ~/.bashrc   (run nano if gedit is not working)
     
  now at the end of bashrc file type
                #JAVA HOME directory setup
              export JAVA_HOME=/usr/lib/java/jdk1.8.0_161
              export PATH="$PATH:$JAVA_HOME/bin"
source ~/.bashrc 
cd  $HADOOP_HOME or cd $HADOOP_HOME/bin
               echo  $JAVA_HOME



Step-11:Copy Master Node Rsa key to all slave nodes.
     
<SLAVE>
install ssh in slave node > sudo apt-get install openssh-server openssh-client
ssh-keygen -t rsa -P ""
nano ~/.ssh/id_rsa.pub      <replace master key here>

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCq4Ov0Pl5HdhgDt+StGJHEFnNRCn6lMSAHjIW366aCBPUKcdzEP8V/CxMDsJvKIMuBv17tXYuUWDvWCsQlk+jGImybRrDm+W78sURfAg46TIB8XDmVr47qFWXlecCyDSzn13mNKXsMGLfSZjCS1ibwmSnExBmsewf9qQnYDorgJdqSwlEe7AOc60MGDoixnCnWPBmapfbmcrKlQ86B7RXghxuRXFVMjCKmrO/SPKEwpwlBbsQjP/5itz8tE5XXIHahzz1o0M9EYuKeeASpMHFRZEaeI18SbiVF7xiOjmuz6wnr6OHbb1SGG0R05d9iaDf5K08eVJIkWxfR/FGcuP+L root@master

cat ~/.ssh/authorized_keys  <no such file>
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys   <copy the content of the key and will create a new file>
go to master
ssh 192.168.60.133   <now you are able to acess slave node terminal from master node  >


Step-12: Install spark on master node

     1-su root  <password>
2-cd Desktop/data
3- root@master Desktop/data>tar xzf spark-2.2.1-bin-hadoop2.7.tar
4- set path to ~/.bashrc

export JAVA_HOME=<path-of-Java-installation> (eg: /usr/lib/jvm/java-7-oracle/)
export SPARK_HOME=<path-to-the-root-of-your-spark-installation> (eg: /home/prashant/spark-        2.0.0-bin-hadoop2.6/)
export PATH=$PATH:$SPARK_HOME/bin
   
       source ~/.bashrc  <update bash rc file>
        cd $SPARKHOME  <check ur path>
        >spark-shell
>pyspark

Enjoy Apache spark is working now.! & we will see spark cluster installation in next blog.

http://MASTER-IP:8080/

II. Spark application UI
http://MASTER-IP:4040/

Step-14:Install Apache Hadoop in both master and slave node.
              1- extract hadoop 2.6.5 in both machine
              2- create a new directory /usr/lib/hadoop in both machine
              3- copy and past hadoop folders and files inside /user/lib/hadoop  directoty.

             4- set bashrc file for hadoop.

export HADOOP_HOME=/usr/lib/hadoop/hadoop-2.5.6
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

 ctr+x
source ~/.bashrc    //to update bashrc file
cd $HADOOP_HOME or cd $HADOOP_HOME/bin


Step-15: Update Hadoop-env.sh
              update JAVA_HOME inside   hadoop-env.sh  (/etc/hadoop) 
              export JAVA_HOME="/usr/lib/java/jdk1.8.0.161

              export JAVA_HOME=/opt/jdk1.7.0_17 export HADOOP_OPTS=- 
               Djava.net.preferIPv4Stack=true export HADOOP_CONF_DIR=/opt/hadoop/hadoop/conf

Step-16:Update core-site.xml 
               nano core-site.xml
               update master in core-site.xml
               <configuration>   
<property>       
<name>fs.defaultFS</name>           //  fs.default.name
<value>hdfs://m1:9000</value>       //  hdfs://192.168.60.132:9000   <master ip>
</property>
               </configuration>

                $ mkdir /var/lib/hadoop
                $ chmod 777 /var/lib/hadoop

               <configuration>
                         <property>
                                    <name>hadoop.tmp.dir</name>
                                    <value>/var/lib/hadoop</value>
                           </property>
                </configuration>
Step-17:Update Hdfs-site.xml

dfs.replication (data replication value) = 1
(In the below given path /hadoop/ is the user name.
hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)
namenode path = //home/hadoop/hadoopinfra/hdfs/namenode
(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)
datanode path = //home/hadoop/hadoopinfra/hdfs/datanode
  <property>
      <name>dfs.replication</name>
      <value>1</value>
    </property>
    <property>
      <name>dfs.name.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
    </property>
    <property>
      <name>dfs.data.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>
    </property>

where hadoop is ur user

<configuration>   
<property>       
<name>dfs.replication</name>         
<value>2</value>     
</property>

<property>       
<name>dfs.name.dir</name>         
<value>file:/user/lib/hadoop/hadoopdata/hdfs/namenode</value>     
</property>

<property>       
<name>dfs.data.dir</name>         
<value>file:/user/lib/hadoop/hadoopdata/hdfs/datanode</value>     
</property>
</configuration>


Update mapred-site.xml

<configuration>
  <property> 
<name>mapreduce.framework.name</name> 
<value>yarn</value>
</property>
</configuration>


or  <use lower property in cluster setup>

<configuration>
  <property> 
<name>mapred.job.tracker</name> 
<value>localhost:9001</value>
</property>
</configuration>


Update Yarn-site.xml
          <property>
              <name>yarn.acl.enable</name>
              <value>0</value>
          </property>
          <property>
             <name>yarn.resourcemanager.hostname</name>
             <value>node-master</value>
           </property>

<configuration>
  <property> 
<name>yarn.nodemanager.aux-services</name> 
<value>mapreduce_shuffle</value>
</property>
</configuration>




Step-19: Format Yout namenode to start Apache Hadoop
                1- /usr/lib/hadoop>hadoop namenode -format or /usr/lib/hadoop/bin>hdfs namenode -                             format
               2- now u can run start-all.sh
/usr/lib/hadoop/sbin>start-all.sh       
or   start-dfs.sh
>start-yarn.sh

               3- /usr/lib/hadoop/etc/hadoop>vi master
m1
/usr/lib/hadoop/etc/hadoop>vi salves
m1
m2

               4- /usr/lib/hadoop>ls
delete current and tmp directory.
/usr/lib/hadoop>rm -rvf  current/ temp/

               5- /usr/lib/hadoop/sbin>jps  [check namenode status]
hadoop dfsadmin -report


Apache Hadoop Useful Commands

start-all.sh & stop-all.sh :
Used to start and stop hadoop daemons all at once. Issuing it on the master machine will start/stop the daemons on all the nodes of a cluster. Deprecated as you have already noticed.

start-dfs.sh, stop-dfs.sh and start-yarn.sh, stop-yarn.sh :
Same as above but start/stop HDFS and YARN daemons separately on all the nodes from the master machine. It is advisable to use these commands now over start-all.sh & stop-all.sh

hadoop-daemon.sh namenode/datanode and yarn-deamon.sh resourcemanager :
To start individual daemons on an individual machine manually. You need to go to a particular node and issue these commands.

hdfs dfs -mkdir -p /user/hadoop
hdfs dfs -put /home/master/Desktop/abc.txt  /user/hadoopuser/

http://localhost:50070  [  name node web UI ]   http://localhost:50070/
http://hadooptutorial.info/hdfs-web-ui/
 http://localhost:8088/  [All applications on Hadoop Cluster]

#start-dfs.sh
#start-yarn.sh
#stop-yarn.sh
#stop-dfs.sh
#yarn node -list
#yarn application -list
#yarn jar    ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar wordcount "books/*" output

Comments

Popular posts from this blog

KAFKA CLUSTER SETUP GUIDE

KAFKA CLUSTER SETUP Step-1: Set password less communication in master and slave machine 1-check communication in both machine. [ping master.node.com  /   ping slave.node.com] 2- set fully qualified domanin name [/etc/host] 3- su root [master/slave machine] 4- change hostname  /etc/hostname file.... hostname -f master.node.com 3-update password less ssh in master and slave. check previous blog. http://hadoop-edu.blogspot.com/2018/09/installation-of-apache-hadoop.html Step-2: Extract Kafka and Zookeeper and Update bashrc file. 1-   /usr/lib/kafka/...   [tar kafka here]         [in both machine] 2-   /usr/lib/zoo/...       [tar zookeeper here] [in both machine]       tar -xvfz  zookeeper-3.4.10.tar.gz   [master & slave] nano ~/.bashrc export ZOOKEEPER_HOME=/usr/lib/zoo/zookeeper-3.4.10 export KAFKA_HOME=/usr/lib/kafka/kafka_2.11-1.1.0 P...

Apache Spark and Apache Zeppelin Visualization Tool

Apache Spark and Apache Zeppelin Step-1: Installation and configuration of Apache Zeppelin https://zeppelin.apache.org/download.html Step-2: Extract Apache Zeppelin and move it to /usr/lib directory. sudo tar xvf   zeppelin-*-bin-all.tgz move   zepline   to   /usr/lib/directory   Step-3: Install Java development kit in ubuntu and set JAVA_HOME variable. echo $JAVA_HOME create     zepplin-env.sh   and zeppline-site.xml   from template files. open zepplin-env.sh                set          JAVA_HOME=        /path/                set          SPARK_HOME=     /path/ ...