Hadoop

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

How to install Hadoop for Windows  ?

To start, download the HDP for Windows MSI at http://hortonworks.com/thankyou-hdp11-win/. It is about 460MB, and will take a moment to download.

Next install Microsoft Visual C++ 2010 Redistributable Package (x64).

Download and install .NET if you haven’t already.

We need to setup Java. We need to setup JAVA_HOME, which Hadoop requires. Make sure to install Java to somewhere without a space in the path, “Program Files” will not work!

To setup JAVA_HOME, in the file browsers -> right click computer -> Properties -> Advanced System Settings -> Environment variables. Then setup a new System variable called JAVA_HOME that points to your Java install (in this case, C:\java\jdk1.6.0_31).

JAVA_HOME

Finally, we need to download python  and set the Path environment variable as we did JAVA_HOME. Go to Computer -> Properties -> Advanced System Settings -> Environment variables. Then append the install path to Python, for example C:\Python27, to this path after a ‘;’.

Verify your path is setup by entering a new shell and typing: python, which should run the python interpreter. Type quit() to exit. Now we’re ready for our configuration.

Setup SSH daemon :

Both hadoop scripts and eclipse plugin need password less ssh to operate. This section describes how to set it up in the Cygwin environment.

Configure ssh daemon

  1. Open Cygwin command prompt
  2. Execute the following command
    ssh-host-config
  3. When asked if privilege separation should be used, answer no.
  4. When asked if sshd should be installed as a service, answer yes.
  5. When asked about the value of CYGWIN environment variable enter ntsec.
  6. Here is the example session of this command, note that the input typed by the user is shown in pink and output from the system is shown in gray.
    Example of using ssh-host-config

    Start SSH daemon

     
    1. Find my computer icon either on your desktop or in the start-up menu, right-click on it and select Manage from the context menu.
    2. Open Services and Applications in the left-hand panel then select the Services item.
    3. Find the CYGWIN sshd item in the main section and right-click on it.
    4. Select Start from the context menu.

      Start SSHD service

    5. A small window should pop-up indicating the progress of the service start-up. After that window disappears the status of CYGWIN sshd service should change to Started

      Setup authorization keys

    6. Eclipse plugins and hadoop scripts require ssh authentication to be performed through authorization keys rather than through passwords. To enable key based authorization you have to setup authorization keys. The following steps describe how to do it.
       
      1. Open cygwin command prompt
      2. Execute the following command to generate keys
        ssh-keygen
      3. When prompted for filenames and pass phrases press ENTER to accept default values.
      4. After command has finished generating they key, enter the following command to change into your .ssh directory
        cd ~/.ssh
      5. Check if the keys where indeed generated by executing the following command
        ls -l

        You should see two file id_rsa.pub and id_rsa with the recent creation dates. These files contain authorization keys.

      6. To register the new authorization keys enter the following command. Note that double brackets, they are very important.
        cat id_rsa.pub >> authorized_keys
      7. Now check if the keys where set-up correctly by executing the following command
        ssh localhost

        Since it is a new ssh installation you warned that authenticity of the host could not be established and will be prompted whether you really want to connect, answer yes and press ENTER. You should see the cygwin prompt again, which means that you have successfully connected.

      8. Now execute the command again
        ssh localhost

        This time you should not be prompted for anything.

      Setting up authorization keys
    7. Download, Copy and Unpack Hadoop

      The next step is to download and unpack the hadoop distribution.

      1. Download hadoop 0.19.1 and place in some folder on your computer such as C:\Java.
      2. Open Cygwin command prompt.
      3. Execute the following command
        cd .
      4. Then execute the following command to get your home directory folder shown in the Windows Explorer window.
        explorer .
      5. Open another explorer window and navigate to the folder that contains the downloaded hadoop archive.
      6. Copy the hadoop archive into your home directory folder.

Leave a comment