Prepare your computer for Hadoop

Summary

This is a tutorial on how to setup Hadoop and run applications locally. The instructions are for Linux operating system.

Installation

Install prerequisites

Before proceeding you should have the following installed:

  • A version of Java
  • SSH
  • PDSH

For java you can use the following tutorial.

For SSH and PDSH you can use the following commands (assuming Debian based OS):

sudo apt-get install ssh
sudo apt-get install pdsh

Install HADOOP

Installation of Hadoop can be done by just downloading the tar.gz.file, extracting it and configuring the Java HOME path. You don’t have to use a specific installer.

Download the tar.gz from the official mirror site. This might change in the future so you can look the for official release page (at the time of this tutorial the link is this one)

For example, for version 3.3.6, you can could the following command:

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz

Download the SHA file:

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz.sha512

Verify the hash:

gpg --print-md SHA512 hadoop-3.3.6.tar.gz
cat hadoop-3.3.6.tar.gz.sha512

The two hashes should be the same.

Extract the file:

tar vxvfz hadoop-3.3.6.tar.gz.sha512

Configuring JAVA_HOME for Hadoop

Edit the file: etc/hadoop/hadoop-env.sh and set the line for JAVA_HOME:

export JAVA_HOME=/usr/java/latest

To find the proper path you can use the following command:

sudo update-alternatives --config java

The path you should use is the one of the paths displayed by the previous command after you remove the …bin/java postfix.

For example if the command update-alternatives displays the path:

/usr/lib/jvm/java-17-openjdk-amd64/bin/java 

Then the line you should add in etc/hadoop/hadoop-env would be:

export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64

Creating required directories

To execute an application you need to manually create the input directory:

mkdir input

Then you can run a demo:

cp etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar grep input output 'dfs[a-z.]+'
cat output/*

Before running another application you may need to completely delete the output directory. Hadoop will display an error if the output directory exists already.

You may optionally configure the HDFS filesystem, but it is not necessary (and not recommended if you are a beginner who wants to simply experiment a little bit with Hadoop.