17 December 2014

Builing Hadoop Sources

You'd like to hack the Hadoop source code in your favorite IDE, here are the steps to follow to avoid lots of frustration.

1. Checkout/Clone the repo

I'm using Cloudera version, so I'd like to be synched with the exact version I'm running. But the community repo https://github.com/apache/hadoop would also make it:

git clone https://github.com/cloudera/hadoop-common.git hadoop-cdh

Go to the freshly cloned repo

cd hadoop-cdh

2. Ensure Java 7+

Well, you'd better to anticipate the Java7 EOL in April 2015* and go with 8, but you must have at least Java7.

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/

3. Ensure Protobuf 2.5+

You'll find more info about where to get and how to build protobuf 2.5+ on ubuntu 12.x here: http://steveloughran.blogspot.ch/2013/09/how-to-update-ubuntu-12x-box-to-protoc.html

You can then point your maven hadoop plugin to your custom installation of protobuf:

export HADOOP_PROTOC_CDH5_PATH=/path/to/protobuf-2.5.0/src/protoc

You can now compile the project
mvn clean install -DskipTests

4. Run Maven Eclipse

Unfortunately the maven-eclipse-plugin has a regression after version 2.6, and thus this exact version needs to be used to run it properly:

mvn org.apache.maven.plugins:maven-eclipse-plugin:2.6:clean org.apache.maven.plugins:maven-eclipse-plugin:2.6:eclipse

And that's it, you're good to import the project you need into your favorite IDE!

16 December 2014

Install HDP 2.2 on Ubuntu 14.04 Trusty Tahr

HDP 2.2 is out. Woot!

Disclaimer: This article describes complete hacks how to install HDP 2.2 in Ubuntu 14.04, which is not (yet) officially supported by HortonWorks. Use it at your own risk.

The Minimal Requirement page specify "64-bit Ubuntu Precise (12.04)", which from my understanding is the minimal (i.e. oldest) Ubuntu version HDP runs on. But actually it's not the case. HDP 2.2 is not running on any other version of Ubuntu. And that's a pitty. In my brand new cluster, I'd like obviously to install 14.04 which comes with newest version of Ansible to name only this one...

So what's the problem?

Ambari agent is detecting the OS version and report it to Ambari server. The server has hardcoded strings of OS version it accepts, including "ubuntu12".

The idea here is to "trick" ambari to let it report ubuntu12 to ambari server to move forward with the installation process.

Hack it!

The guilty file is /usr/lib/ambari-agent/lib/ambari_commons/os_check.py, and this file is reading /etc/*-release to find out the OS version. Replacing 14.04 by 12.04 in these files do the trick.

$ sed -e "s/14.04/12.04/g" -i /etc/*-release

All set! Now Ambari agent can successfully contact the server.

Ganglia configuration

Ganglia relies on apache2, and the configuration layout in 14.04 changed from /etc/apache2/conf.d to /etc/apache2/conf-available and /etc/apache2/conf-enabled (more info on Debian sources). So another hack is required to create a symlink from /etc/apache2/conf.d to /etc/apache2/conf-enabled otherwise ganglia will failed to install

$ ln -fs /etc/apache2/conf-enabled/ /etc/apache2/conf.d

Lib Postgresql to point to the correct jar

libpostgresql-jdbc-java package is required to use Ambari, Hive, Oozie etc. with PostgreSQL. In 14.04, it provides 2 jars, /usr/share/java/postgresql-jdbc3.jar and /usr/share/java/postgresql-jdbc4.jar. Hive, at least, looks for /usr/share/java/postgresql-jdbc.jar (version agnostic). A symlink needs to be created to jdbc4 to avoid startup failure:
  • In Ansible:
    file: src=/usr/share/java/postgresql-jdbc4.jar dest=/usr/share/java/postgresql-jdbc.jar state=link
  • Simply in bash:
    ln -sf  /usr/share/java/postgresql-jdbc4.jar /usr/share/java/postgresql-jdbc.jar
Another link to update is the postgresql jar file sent by Ambari to test the jdbc connection.
  • In Ansible:
    file: src=/usr/share/java/postgresql-jdbc.jar dest=/var/lib/ambari-server/resources/postgres-jdbc-driver.jar state=link
  • Simply in bash:
    ln -sf /usr/share/java/postgresql-jdbc.jar /var/lib/ambari-server/resources/postgres-jdbc-driver.jar

Testing your deployment (aka updated 1TB terasort copy-paste example for HDP 2.2)

$ hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar \
    teragen -Ddfs.block.size=536870912 -Dmapred.map.tasks=20 \
    10000000000 /tmp/1Tsort/input
$ hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar \
    terasort -Ddfs.block.size=536870912 -Dmapred.reduce.tasks=20 -Dmapreduce.terasort.output.replication=3 \
    /tmp/1Tsort/input /tmp/1Tsort/output
$ hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar \
    teravalidate /tmp/1Tsort/output /tmp/1Tsort/report

Find more details about HDP 2.2 in the complet HDP 2.2 documentation.