Hadoop architecture birdview


Hadoop's architecture birdview as said by Mike Olson - CEO of Cloudera

Hadoop is designed to run on a large number of machines that don't share any memory or disks.

That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one.

When you want to load all of your organization's data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There's no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides.
And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.

Hadoop

In a centralized database system, you have got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPU cores.

You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That's MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.

Architecturally, the reason you're able to deal with lots of data is because Hadoop spreads it out. And the reason you're able to ask complicated computational questions is because you have got all of these processors, working in parallel, harnessed together.

Click to read more »

Usage of Hadoop


Usage of Hadoop's framework in industry - applicable to any of the industry.

Hadoop was originally conceived to solve the problem of storing large quantities of data at a very low cost even for the biggies like Google, Facebook or Yahoo. Low cost implementation of something which new and planned to be used often is very important.

Analytic applications are main output of Hadoop while the application categories can be broadly clubbed into three to four buckets.

1. Refine the data

Irrespective of the verticle/domain, the Hadoop is being used to refine large quantities of data into informational data which is more manageable. The resulting data is loaded into the existing systems and can be simply accessed by traditional tools. Note that this new set of data is much richer than the earlier existing data set.

2. Analysis of Data


Data Analysis' use case can often be where enterprises start by capturing data that was previously being discarded (exhaust data such as web logs, social media data, etc.). This data can be clubbed with other data and can be used to build more applications that uses the trends found in this data to make decisions.

3. Enhancement of applications

Existing applications (web or desktop or mobile) can be further enhanced by use of the data which Hadoop can provide. This can be used to provide user with better service which is customized so that the user comes to them instead of the competitor. Simply understanding of user patterns can achieve this for companies.

Also read - Hadoop architecture
Hadoop

Click to read more »

Hadoop and Ubuntu - step 1


Hadoop and Ubuntu - step 1 for setting up use of Hadoop using Ubuntu OS

There are basically two methods of using Hadoop:
1. Configure Hadoop on Windows - this involves use of Hadoop setup, Cygwin tool, Java and Eclipse.
I have configured this and initially configure it on my laptop, however, when I tried to perform the same configuration on another machine (to be used as another Hadoop node), Cygwin broke down.
As a result, I was not able to complete the whole set up of Hadoop.

2. Configure Hadoop on Linux - Because of the above experience, I decided to go with the Linux based OS for Hadoop.

Using a Linux based OS is best approach for below reasons:
1. Hadoop is designed for Linux based systems (yes, it is)
2. Hadoop requires SSH which is simple to configure in Linux (requires Cygwin in windows - Cygwin basically gives a feel and experience of Linux on Windows)
3. It is a system which is naturally more secure on secure OSs, exibit A - Linux.

STEP 1 - Choose and configure (Linux) OS of choice on Machine of Choice
00

I chose Ubuntu - freely and easily available, good GUI based support for heavy Windows user

I chose to perform the installation on Virtual Box - Open Source Virtualization tool by Oracle.

Down and install Virtual Box on your machine.

Download and install Ubuntu in the Virtual Box as a Guest OS

Get ready for further set up for Hadoop on Ubuntu

This completes step 1.

You can also learn about usage of Hadoop and about Hadoop architecture on BabaGyan.com.

Click to read more »

Hadoop and Ubuntu - step 2


Hadoop and Ubuntu - step 2 - Install Oracle Java for Hadoop setup.

In the step 1 of the set up available here, we took a look at installation of Linux based OS (Ubuntu) for Hadoop as we opted for Linux instead of Windows for Hadoop. We also saw the reasons for the preference.

STEP 1- Choose and configure (Linux) OS of choice on Machine of Choice

STEP 2 - Install Java and configure it on the machine

Available choices for Java are OpenJDK or Oracle JAVA. I preferred Oracle Java. Follow the below instructions for Oracle Java configuration on Ubuntu.

Download the Oracle Java from its own official download page. Version should be compatible with your OS and machine type (32 or 64 bit). It will be now in some folder as Downloads.

Uncompress it from Terminal window using
00

tar -xvf jdk-7u2-linux-x64.tar.gz

The uncompressed directory (name depends on downloaded version, here jdk1.7.0) should be in /usr/lib under jvm. So lets move it there using. Only JRE will be available as Software through official Ubuntu. We need latest JRE with latest JDK from Oracle, hence we download.
sudo mkdir -p /usr/lib/jvm

sudo mv ./jdk1.7.0_02 /usr/lib/jvm/jdk1.7.0

Set environment variables for Java. Open and append file /etc/profile with below code
JAVA_HOME=/usr/local/java/jdk1.7.0_45

PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
JRE_HOME=/usr/local/java/jdk1.7.0/jre
PATH=$PATH:$HOME/bin:$JRE_HOME/bin
export JAVA_HOME
export JRE_HOME
export PATH


Reboot now. After this, Ubuntu has to know that JDK is available; so run below commands
sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.7.0/bin/java" 1


sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.7.0/bin/javac" 1


sudo update-alternatives --config java


Do the same for Javac. That's it, done. You can now check for Java and Javac version as

java -version

java version "1.7.0"

Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) Client VM (build 21.0-b17, mixed mode)


This completes step 2. You can also learn about usage of Hadoop and about Hadoop architecture on BabaGyan.com.

Click to read more »

Hadoop and Ubuntu - step 3


Hadoop and Ubuntu - step 2 - Configure SSH and user for SSH on Ubuntu

In the step 1 of the set up available here, step 2 is available here. We took a look at installation of Linux based OS (Ubuntu) for Hadoop as we opted for Linux instead of Windows for Hadoop. We also saw the reasons for the preference. We installed and Configured our chosed Java - Oracle Java.

STEP 1- Choose and configure (Linux) OS of choice on Machine of Choice

STEP 2 - Install Java and configure it on the machine

STEP 3 - Configure SSH and user for SSH on Ubuntu


00

This step is pretty straight forward. We create a user and a user group. All hadoop cluster nodes will be using similar user name and will be part of same group. Lets call the group as Hadoop and the user as HdUser. Then we will create a SSH with its RSA key and no authentication (for ease of access by Hadoop).

Use Ubuntu Terminal window and below commands.

Create group
sudo addgroup hadoop

Create User and add it to the group
sudo adduser --ingroup hadoop hduser


Login as HdUser and generate SSH key
su - hduser

ssh-keygen -t rsa -P ""

Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Created directory '/home/hduser/.ssh'. Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: 7b:62:<<more hex codes>>

hduser@ubuntu The key's randomart image is: <<som image>>

Store the generated key
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Test SSH
ssh localhost

The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is c7:47:55:<<more hex code>>.

Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (RSA) to the list of known hosts. Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux Ubuntu 10.04 LTS <<info>>


This completes step 3. You can also learn about usage of Hadoop and about Hadoop architecture on BabaGyan.com.

Click to read more »

Hadoop and Ubuntu - step 4


Hadoop and Ubuntu - step 4 - Install and configure Hadoop is the last step for creating the single node of hadoop

In the step 1 of the set up available here, step 2 is available here. We took a look at installation of Linux based OS (Ubuntu) for Hadoop as we opted for Linux instead of Windows for Hadoop. We also saw the reasons for the preference. We installed and Configured our chosed Java - Oracle Java.

STEP 1 - Choose and configure (Linux) OS of choice on Machine of Choice

STEP 2 - Install Java and configure it on the machine

STEP 3 - Configure SSH and user for SSH on Ubuntu

STEP 4 - Download and Configure Hadoop
Hadoop



Login as HDUser. Download Hadoop - 2.x tar file from the any mirror here

Uncompress the Hadoop tar gz file and move it to /usr/local. We will also change owner. Use Terminal for all commands

cd Downloads

sudo tar vxzf hadoop-2.2.0.tar.gz -C /usr/local

cd /usr/local

sudo mv hadoop-2.2.0 hadoop

sudo chown -R hduser:hadoop hadoop

Update the HDUser's .bashrc file
cd ~

gksudo gedit .bashrc

Update the file at the end with below text. Use jdk folder name same as actual folder - something like "jdk-7-i386" (check in /usr/lib/jvm)

#Hadoop variables export JAVA_HOME=/usr/lib/jvm/jdk/ export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL #end of update
Save and close

Now open hadoop-env.sh for updating Java Home (JAVA_HOME)
gksudo gedit /usr/local/hadoop/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/jdk/

Save and close. Reboot the system and login with HDUser again.

Now, verify Hadoop installation for terminal
hadoop version

This should give something like below

Hadoop 2.2.0
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768

Compiled by hortonmu on 2013-10-07T06:28Z

Compiled with protoc 2.5.0

From source with checksum 79e53ce7994d1628b240f09af91e1af4

This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-2.2.0.jar


If you get it, Congratulations!!! Hadoop is now successfully installed. If not, put me a comment on contact page

Now we configure it by updating its xml files

Open core-site.xml and add the given text between <configuration> </configuration> tags
gksudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>

Save and Close the file

Open yarn-site.xml and add the given text between <configuration> </configuration> tags
gksudo gedit /usr/local/hadoop/etc/hadoop/yarn-site.xml

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

Save and Close the file

Open mapred-site.xml.template and add the given text between <configuration> </configuration> tags
gksudo gedit /usr/local/hadoop/etc/hadoop/mapred-site.xml.template

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

Save the file as mapred-site.xml in /usr/local/hadoop/etc/hadoop/ directory and Close the file

Lets now create Name Node and Data Node through terminal
cd ~

mkdir -p mydata/hdfs/namenode

mkdir -p mydata/hdfs/datanode
Now, update hdfs-site.xml and add the given text between <configuration> </configuration> tags
gksudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hduser/mydata/hdfs/namenode</value>
</property>

<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hduser/mydata/hdfs/datanode</value>
</property>

Next, we will format the hdfs for our first use of Hadoop and start the Hadoop Services
hdfs namenode -format

start-dfs.sh

start-yarn.sh

Verify the Hadoop nodes running by
jps

The below should appear in output

2970 ResourceManager
3461 Jps
3177 NodeManager
2361 NameNode
2840 SecondaryNameNode

This completes the set up steps for Hadoop.

You can also learn about usage of Hadoop and about Hadoop architecture on BabaGyan.com.

Click to read more »

Solutions for issues faced during hadoop configuration


Solutions for issues faced during hadoop configuration - starting as begineer for Hadoop use is not so straight forward. There are many steps and issues that one has to overcome. I have done it and so putting this up for everyone to refer. Step by step guide for Hadoop configuration is also available here - STEP 1, STEP 2, STEP 3 and STEP 4.

Solutions for issues faced during hadoop configuration

1. Which Hadoop to fetch:
Hadoop


There are two flavors of hadoop - 1.x and 2.x.
The 1.x is the initial one while 2.x was a parallel version which had YARN engine in it. So, go for Hadoop 2.x version
You can find more about hadoop here

2. Which machine to use:
Initial options are Windows and Linux. Since SSH will be extensively used, prefer a flavor of Linux for Hadoop. It will also eliminate the need to licence each instance/node that you will create.
Prefer Ubuntu if you are a extensive Windows user since you will not feel completely lost in the Unix like environment. Also, there is lot of online help on Ubuntu.
Use this guide for downloading Ubuntu and installing it on VM

3. Actual machines or Virtual machines:
I guess this is pretty easy to decide. Virtual machines offcorse. Will need atleast one actual machine with latest configurations and atleast 4GB RAM for VMs to run.

4. Which Virtualization environment:
There are many options but most popular will be Virtual Box by Oracle and VMWare. Virtual box is free and open source. Support wise it is good enough online so prefer Virtual Box.
You can find how to set up the box here

4. Which Java to use:
Most common Java versions for Linux based systems are OpenJDK; and there is always Oracle JDK available. As per the hadoop docomentation, choose a java version. It is best to go for Oracle JDK but an older and test version of Java.

You can find how to install java here

Major Issues:

1. Java and Ubuntu - 32 bit or 64 bit
If your machine is latest one as 64 bit, you may be tempted to go fir a 64 bit version of OS as well as Java. But just don't go for it yet.

Hadoop native libraries are compiled for 32 bit and if you are using 64 bit OS, you may run into problems and errors such as:

WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform...
using builtin-java classes where applicable

The solution to this is recompiling the Hadoop native libraries in your 64 bit machine. Have a look at the native libraries page and building native libraries for hadoop page. But then, it will be lot better to use 32 bit OS instead, isn't it?
(shout yes!!!)

2. Virtual Box - low graphics mode
Virtual box may run into error as Low graphics is On if you are using 32 bit Ubuntu. This is due to a missing guest plugin which comes with Virtual box.
You will have to run the Linux Guest CD image and load it. For this, the initial step is setting the Ubuntu to run kernel commands
sudo apt-get install dkms
then load the Guest addon CD
sudo mount /dev/cdrom /cdrom
sudo sh ./VBoxLinuxAdditions.run


3. Virtual Box - mouse pointer appears little above the point
This is due to a missing patch. You can have a look at it here.

For fixing this, download the VBoxGuest-linux.c.patch patch file from above link. Then run these commands on your Ubuntu virtual machine
cp VBoxGuest-linux.c.patch /usr/src/vboxguest-4.1.16/vboxguest/VBoxGuest-linux.c
/etc/init.d/vboxadd setup
You can also learn about usage of Hadoop and about Hadoop architecture on BabaGyan.com.

Click to read more »