It's Friday. It's rainy. All the monitoring is green (including me ;)). Time to test something cool. Hey, did someone ever tried to run Tez on CDH? That sounds awesome. Let's give it a try!
At the time of writing, the latest branch is 0.6.0. So go for it. The install guide is pretty straight forward: compile, upload, configure and run. Do not forget to checkout the right branch :)
I installed the libs in /opt/tez-0.6.0 and used the following commands to run it.
Mind the magic option to change the processing engine from mapreduce to yarn: mapreduce.framework.name=yarn-tez
BAM. This didn't work. I got a stacktrace about stacktrace:
There's already a jira for that: HADOOP-11064. In a nutshell. I’m running CDH 5.2 which is based on Hadoop 2.5 and by default Tez 0.6 is built on Hadoop 2.6.
Let’s do it properly and build it with the exact CDH version I’m running. I created a profile cdh5.2 in the pom file, which looks like this:
And tried to rebuild with the profile cdh5.2 enabled:
BAM. I got the following error:
Abstract method org.apache.hadoop.mapred.JobContext.userClassesTakesPrecedence? WTF?
Well, Cloudera applies custom patches to the source code, from time to time breaking downstream project. It's the case for this one. See the commit details: https://github.com/cloudera/hadoop-common/commit/fb3fd746f4340142978596c1997a7bfb19007e69
So I applied the following patch to the sources of Tez and successfully compile the project:
The good new is the patch does not seem to be in CDH 5.4 anymore. So Tez on CDH5.4 might be more straight forward.
Now it's built, let's re-run it.
Woot, the PI example works now! First Tez on CDH job run ever!
Ok, first step done. And now, what about Hive on Tez on CDH?
Well, CDH 5.2/5.3 is shipped with Hive 0.13, which is in turn built against Tez 0.4.1-incubating (see https://github.com/cloudera/hive/blob/cdh5-0.13.1_5.2.5/pom.xml for more details), so let's try Hive with Tez 0.4.0 and my custom patch.
Checkout branch 0.4.0, apply the patch, rebuild, upload, run hive, and ...
Woot!! Hive on Tez on CDH. Time to go and celebrate this great achievement! ;)
.
At the time of writing, the latest branch is 0.6.0. So go for it. The install guide is pretty straight forward: compile, upload, configure and run. Do not forget to checkout the right branch :)
I installed the libs in /opt/tez-0.6.0 and used the following commands to run it.
Mind the magic option to change the processing engine from mapreduce to yarn: mapreduce.framework.name=yarn-tez
# Setting up configuration environment variables export TEZ_HOME=/opt/tez-0.6.0 export TEZ_CONF_DIR=${TEZ_HOME}/tez-conf export HADOOP_CLASSPATH="${HADOOP_CLASSPATH}:${TEZ_CONF_DIR}:$(find ${TEZ_HOME} -name "*.jar" | paste -sd ":")" # Run an example on YARN yarn jar /opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-examples-2.5.0-cdh5.2.5.jar pi -Dmapreduce.framework.name=yarn-tez 16 100000
BAM. This didn't work. I got a stacktrace about stacktrace:
java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray(II[BI[BIILjava/lang/String;JZ)V at org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray(Native Method) at org.apache.hadoop.util.NativeCrc32.calculateChunkedSumsByteArray(NativeCrc32.java:86) at org.apache.hadoop.util.DataChecksum.calculateChunkedSums(DataChecksum.java:430) at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:202) at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:163) at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1993) at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1946) at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130) at org.apache.tez.dag.history.recovery.RecoveryService.handle(RecoveryService.java:265) at org.apache.tez.dag.history.HistoryEventHandler.handleCriticalEvent(HistoryEventHandler.java:102) at org.apache.tez.dag.app.DAGAppMaster.startDAG(DAGAppMaster.java:1954) at org.apache.tez.dag.app.DAGAppMaster.startDAG(DAGAppMaster.java:1911) at org.apache.tez.dag.app.DAGAppMaster.serviceStart(DAGAppMaster.java:1628) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.tez.dag.app.DAGAppMaster$6.run(DAGAppMaster.java:2036) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:2032) at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:1846)
There's already a jira for that: HADOOP-11064. In a nutshell. I’m running CDH 5.2 which is based on Hadoop 2.5 and by default Tez 0.6 is built on Hadoop 2.6.
Let’s do it properly and build it with the exact CDH version I’m running. I created a profile cdh5.2 in the pom file, which looks like this:
Tez on CDH 5.2.5 -- Pom profile -- See Gist 64ba10edb64d744f05a0
And tried to rebuild with the profile cdh5.2 enabled:
mvn -Pcdh5.2 clean package -DskipTests=true -Dmaven.javadoc.skip=true
BAM. I got the following error:
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project tez-mapreduce: Compilation failure [ERROR] /home/mil2048/Projects/tez/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/mapreduce/JobContextImpl.java:[57,8] org.apache.tez.mapreduce.hadoop.mapreduce.JobContextImpl is not abstract and does not override abstract method userClassesTakesPrecedence() in org.apache.hadoop.mapreduce.JobContext
Abstract method org.apache.hadoop.mapred.JobContext.userClassesTakesPrecedence? WTF?
Well, Cloudera applies custom patches to the source code, from time to time breaking downstream project. It's the case for this one. See the commit details: https://github.com/cloudera/hadoop-common/commit/fb3fd746f4340142978596c1997a7bfb19007e69
So I applied the following patch to the sources of Tez and successfully compile the project:
Tez on CDH 5.2.5 -- CDH-4212 -- See Gist 23225004a78949d4c849
The good new is the patch does not seem to be in CDH 5.4 anymore. So Tez on CDH5.4 might be more straight forward.
Now it's built, let's re-run it.
Woot, the PI example works now! First Tez on CDH job run ever!
Ok, first step done. And now, what about Hive on Tez on CDH?
hive
set hive.execution.engine=tez;
SELECT * FROM myTableBAM.
Well, CDH 5.2/5.3 is shipped with Hive 0.13, which is in turn built against Tez 0.4.1-incubating (see https://github.com/cloudera/hive/blob/cdh5-0.13.1_5.2.5/pom.xml for more details), so let's try Hive with Tez 0.4.0 and my custom patch.
Checkout branch 0.4.0, apply the patch, rebuild, upload, run hive, and ...

.
No comments:
Post a Comment