I am working on creating a Big Data platform in our lab. I managed to install hadoop 2.5.0 with help of these two guides on Ubuntu 14.04 LTS with Oracle JDK 7 (java version 1.7.0_65)
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
http://askubuntu.com/questions/144433/how-to-install-hadoop
After successfully deploying on single computer I moved on
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
I ran into some problem where one of the datanodes failed to start due to following exception
java.io.IOException: Incompatible clusterIDs in /app/hadoop/tmp/dfs/data
The fix was similar to
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/#javaioioexception-incompatible-namespaceids
I followed the manual fix where I edited clusterID in/app/hadoop/tmp/dfs/data/current/VERSION to match /app/hadoop/tmp/dfs/name/current/VERSION.
After finishing the setup, the datanodes failed to find the namenode. The fix for the issue is given at:
http://stackoverflow.com/questions/8872807/hadoop-datanodes-cannot-find-namenode
After the Hadoop was up with all the datanodes, I moved to Yarn. However the nodes failed to connect to the manager
Retrying connect to server: 0.0.0.0/0.0.0.0:8031
I had to modify yarn-site.xml according to answer here:
http://stackoverflow.com/questions/21840771/simple-yarn-benchmark-testdfsio-fails
After this, I moved to running the example from michael-noll.com. An extra step is needed before copying the files
hdfs dfs -mkdir -p /user/hduser/gutenberg
The example itself can be run as
hduser@node0:/usr/local/hadoop/share/hadoop/mapreduce$ hadoop jar hadoop-mapreduce-examples-2.5.0.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output