Tez | Wang's Tech Blog

Tagged: Tez Toggle Comment Threads | Keyboard Shortcuts

Wang 20:12 on 2018-03-25 Permalink | Reply
Tags: Ambari, BigData ( 37 ), Cluster ( 30 ), Presto ( 7 ), Tez

[Presto] Integrate with Ambari

Days before I have installed presto and ambari separately, officially ambari doesn’t support presto, you have to download ambari-presto-service and configure it yourself if you wanna manage presto on ambari.

So I tried this.

1.download hdp yum repository

wget -nv http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.6.3.0/hdp.repo -O /etc/yum.repos.d/HDP.repo

2.download ambari-presto-service and configure

version=`hdp-select status hadoop-client | sed 's/hadoop-client - ([0-9].[0-9]).*/1/'`
mkdir /var/lib/ambari-server/resources/stacks/HDP/$version/services/PRESTO
wget https://github.com/prestodb/ambari-presto-service/releases/download/v1.2/ambari-presto-1.2.tar.gz
tar -xvf ambari-presto-1.2.tar.gz -C /var/lib/ambari-server/resources/stacks/HDP/$version/services/PRESTO
mv /var/lib/ambari-server/resources/stacks/HDP/$version/services/PRESTO/ambari-presto-1.2/* /var/lib/ambari-server/resources/stacks/HDP/$version/services/PRESTO
rm -rf /var/lib/ambari-server/resources/stacks/HDP/$version/services/PRESTO/ambari-presto-1.2
chmod -R +x /var/lib/ambari-server/resources/stacks/HDP/$version/services/PRESTO/*

3.restart ambari-server

ambari-server restart

4.add presto service on ambari, please configure discovery.uri when you add presto service, e.g. discovery.uri: http://coordinator:8285

After doing this, you could add catalogs and use presto as query engine.

I did a simple query comparison between Tez and Presto, if you wanna accurate benchmark result, I think this benchmark test could help. The query is to calculate sum on a hive table.

Presto: 4s

presto:test> select sum(count) as sum from (
          -> select count(*) as count from t0004998 where month = '6.5'
          -> union
          -> select count(*) as count from t0004998 where typestatus in ('VL2216','VL2217','VL2218','VL2219','VL2220')
          -> union
          -> select count(*) as count from t0004998 where countrycode in ('FAMILY','FORM','GENUS','KINGDOM','ORDER','PHYLUM','SPECIES')
          -> ) t;
  sum   
--------
 307374 
(1 row)

Query 20180317_102034_00040_sq83e, FINISHED, 1 node
Splits: 29 total, 29 done (100.00%)
0:04 [982K rows, 374MB] [231K rows/s, 87.8MB/s]

Tez: 29.77s

hive> select sum(count) from (
    > select count(*) as count from t0004998 where month = "6.5"
    > union
    > select count(*) as count from t0004998 where typestatus in ("VL2216","VL2217","VL2218","VL2219","VL2220")
    > union
    > select count(*) as count from t0004998 where countrycode in ("FAMILY","FORM","GENUS","KINGDOM","ORDER","PHYLUM","SPECIES")
    > ) t;
Query ID = hdfs_20180317102109_5fd30986-f840-450e-aedd-b51c5e3a48f1
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1521267007048_0012)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Map 10 .........   SUCCEEDED      1          1        0        0       1       0
Map 8 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 11 .....   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       1
Reducer 4 ......   SUCCEEDED      1          1        0        0       0       0
Reducer 6 ......   SUCCEEDED      1          1        0        0       0       0
Reducer 7 ......   SUCCEEDED      1          1        0        0       0       0
Reducer 9 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 09/09  [==========================>>] 100%  ELAPSED TIME: 29.77 s    
--------------------------------------------------------------------------------
OK
307374
Time taken: 30.732 seconds, Fetched: 1 row(s)

Wang 20:37 on 2018-03-06 Permalink | Reply
Tags: BigData ( 37 ), Cluster ( 30 ), Hadoop ( 14 ), Hive ( 15 ), Tez

[Performance Test] MR vs Tez(2)

I test the performance of MR vs Tez again on cluster, I created a new table which contains 28,872,974 rows, below are cluster servers:

Host	OS	Memory	CPU	Disk	Region
master.c.ambari-195807.internal	CentOS 7	13 GB	Intel Ivy Bridge: 2	200G	asia-east1-a
slave1.c.ambari-195807.internal	CentOS 7	13 GB	Intel Ivy Bridge: 2	200G	asia-east1-a
slave2.c.ambari-195807.internal	CentOS 7	13 GB	Intel Ivy Bridge: 2	200G	asia-east1-a
slave3.c.ambari-195807.internal	CentOS 7	13 GB	Intel Ivy Bridge: 2	200G	asia-east1-a

1.MR

1.1.create table

hive> CREATE TABLE gbif.gbif_0004998
    > STORED AS ORC
    > TBLPROPERTIES("orc.compress"="snappy")
    > AS SELECT * FROM gbif.gbif_0004998_ori;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = gizmo_20180225064259_8df29800-b260-48f5-a409-80d6ea5200ad
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1519536795015_0001, Tracking URL = http://master.c.ambari-195807.internal:8088/proxy/application_1519536795015_0001/
Kill Command = /opt/apps/hadoop-2.8.3/bin/hadoop job  -kill job_1519536795015_0001
Hadoop job information for Stage-1: number of mappers: 43; number of reducers: 0
2018-02-25 06:43:15,110 Stage-1 map = 0%,  reduce = 0%
2018-02-25 06:44:15,419 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 231.6 sec
2018-02-25 06:44:36,386 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 380.45 sec
2018-02-25 06:44:37,810 Stage-1 map = 3%,  reduce = 0%, Cumulative CPU 386.09 sec
2018-02-25 06:44:41,695 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU 422.02 sec
...
...
2018-02-25 06:47:36,112 Stage-1 map = 97%,  reduce = 0%, Cumulative CPU 1388.9 sec
2018-02-25 06:47:38,185 Stage-1 map = 98%,  reduce = 0%, Cumulative CPU 1392.1 sec
2018-02-25 06:47:45,434 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1402.14 sec
MapReduce Total cumulative CPU time: 23 minutes 22 seconds 140 msec
Ended Job = job_1519536795015_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master.c.ambari-195807.internal:9000/user/hive/warehouse/gbif.db/.hive-staging_hive_2018-02-25_06-42-59_672_2925216554228494176-1/-ext-10002
Moving data to directory hdfs://master.c.ambari-195807.internal:9000/user/hive/warehouse/gbif.db/gbif_0004998
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 43   Cumulative CPU: 1402.14 sec   HDFS Read: 11519083564 HDFS Write: 1210708016 SUCCESS
Total MapReduce CPU Time Spent: 23 minutes 22 seconds 140 msec
OK
Time taken: 288.681 seconds

1.2.query by on condition

hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = gizmo_20180225065438_d2343424-5178-4c44-8b9d-0b28f8b701fa
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1519536795015_0002, Tracking URL = http://master.c.ambari-195807.internal:8088/proxy/application_1519536795015_0002/
Kill Command = /opt/apps/hadoop-2.8.3/bin/hadoop job  -kill job_1519536795015_0002
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 1
2018-02-25 06:54:50,078 Stage-1 map = 0%,  reduce = 0%
2018-02-25 06:55:02,485 Stage-1 map = 40%,  reduce = 0%, Cumulative CPU 21.01 sec
2018-02-25 06:55:03,544 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 38.51 sec
2018-02-25 06:55:06,704 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 49.23 sec
2018-02-25 06:55:09,881 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 51.88 sec
MapReduce Total cumulative CPU time: 51 seconds 880 msec
Ended Job = job_1519536795015_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 5  Reduce: 1   Cumulative CPU: 51.88 sec   HDFS Read: 1936305 HDFS Write: 107 SUCCESS
Total MapReduce CPU Time Spent: 51 seconds 880 msec
OK
2547716
Time taken: 32.292 seconds, Fetched: 1 row(s)

1.3.query by two conditions

hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE' and year > 1900;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = gizmo_20180225081238_766d3707-7eb4-4818-860e-887c48d507ce
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1519545228015_0002, Tracking URL = http://master.c.ambari-195807.internal:8088/proxy/application_1519545228015_0002/
Kill Command = /opt/apps/hadoop-2.8.3/bin/hadoop job  -kill job_1519545228015_0002
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 1
2018-02-25 08:17:31,666 Stage-1 map = 0%,  reduce = 0%
2018-02-25 08:17:43,866 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 10.58 sec
2018-02-25 08:17:46,045 Stage-1 map = 60%,  reduce = 0%, Cumulative CPU 34.12 sec
2018-02-25 08:17:54,996 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 41.73 sec
2018-02-25 08:17:57,126 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 51.37 sec
2018-02-25 08:17:58,192 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 53.72 sec
MapReduce Total cumulative CPU time: 53 seconds 720 msec
Ended Job = job_1519545228015_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 5  Reduce: 1   Cumulative CPU: 53.72 sec   HDFS Read: 8334197 HDFS Write: 107 SUCCESS
Total MapReduce CPU Time Spent: 53 seconds 720 msec
OK
2547716
Time taken: 321.138 seconds, Fetched: 1 row(s)

2.Tez

2.1.create table

hive> CREATE TABLE gbif.gbif_0004998
    > STORED AS ORC
    > TBLPROPERTIES("orc.compress"="snappy")
    > AS SELECT * FROM gbif.gbif_0004998_ori;
Query ID = gizmo_20180225075657_bae527a7-7cbd-46d9-afbf-70a5adcdee7c
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1519545228015_0001)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 639.61 s   
----------------------------------------------------------------------------------------------
Moving data to directory hdfs://master.c.ambari-195807.internal:9000/user/hive/warehouse/gbif.db/gbif_0004998
OK
Time taken: 664.817 seconds

2.2.query by one condition

hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE';
Query ID = gizmo_20180225080856_d1f13489-30b0-4045-bdeb-e3e5e085e736
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1519545228015_0001)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      5          5        0        0       0       0  
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 17.91 s    
----------------------------------------------------------------------------------------------
OK
2547716
Time taken: 19.255 seconds, Fetched: 1 row(s)

2.2.query by two conditions

hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE' and year > 1900;
Query ID = gizmo_20180225081200_0279f8e6-544b-4573-858b-33f48bf1fa35
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1519545228015_0001)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      5          5        0        0       0       0  
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 16.96 s    
----------------------------------------------------------------------------------------------
OK
2547716
Time taken: 17.635 seconds, Fetched: 1 row(s)

3.Summary

Rows: 28,872,974

Type	Create Table	Query By One Condition	Query By Two Conditions
MR	288.681s	32.292s	321.138s
Tez	664.817s	19.255s	17.635s

According to the result, MR is quicker than Tez on creation, but slower than Tez on query, along with query condition’s increase, MR’s query performance became worse.

But why MR is quicker than Tez on creation, currently I don’t know, need to be investigated later.

Maybe it has relationship with storage, I have checked the filesystem after the two kinds of creation, it’s different. MR has many small files, but Tez has one much bigger file.

MR generated files

Tez generated files

Wang 21:43 on 2018-03-02 Permalink | Reply
Tags: BigData ( 37 ), CentOS7 ( 7 ), Cloud ( 29 ), Cluster ( 30 ), GCP ( 12 ), Hadoop ( 14 ), Hive ( 15 ), Mysql ( 7 ), Tez

[GCP ] Install bigdata cluster

I applied google cloud for trial which give me 300$, so I initialize 4 severs to do test.

Servers:

Host	OS	Memory	CPU	Disk	Region
master.c.ambari-195807.internal	CentOS 7	13 GB	Intel Ivy Bridge: 2	200G	asia-east1-a
slave1.c.ambari-195807.internal	CentOS 7	13 GB	Intel Ivy Bridge: 2	200G	asia-east1-a
slave2.c.ambari-195807.internal	CentOS 7	13 GB	Intel Ivy Bridge: 2	200G	asia-east1-a
slave3.c.ambari-195807.internal	CentOS 7	13 GB	Intel Ivy Bridge: 2	200G	asia-east1-a

1.prepare

1.1.configure ssh key on each slave to make master login without password

1.2.install jdk1.8 on each server, download, set JAVA_HOME in profile

1.3.configure hostnames in /etc/hosts on each server

2.install hadoop

2.1.download hadoop 2.8.2

wget http://ftp.jaist.ac.jp/pub/apache/hadoop/common/hadoop-2.8.3/hadoop-2.8.3.tar.gz
tar -vzxf hadoop-2.8.3.tar.gz && cd hadoop-2.8.3

2.2.configure core-site.xml

<property>
    <name>fs.default.name</name>
    <value>hdfs://master.c.ambari-195807.internal:9000</value> 
</property>
<property>
    <name>hadoop.tmp.dir</name>  
    <value>/data/hadoop/hdfs/tmp</value>
</property>
<property>
    <name>hadoop.http.filter.initializers</name>
    <value>org.apache.hadoop.security.HttpCrossOriginFilterInitializer</value>
</property>

2.3.configure hdfs-site.xml

<property>
    <name>dfs.name.dir</name>
    <value>/data/hadoop/dfs/name</value>
</property>
<property>
    <name>dfs.data.dir</name>
    <value>/opt/hadoop/dfs/data</value>
</property>
<property>
    <name>dfs.replication</name>
    <value>3</value>
</property>

2.4.configure mapred-site.xml

<property>  
    <name>mapred.job.tracker</name>  
    <value>master.c.ambari-195807.internal:49001</value>  
</property>
<property>
    <name>mapreduce.framework.name</name>  
    <value>yarn</value>  
</property>
<property>
    <name>mapred.local.dir</name>  
    <value>/data/hadoop/mapred</value>  
</property>
<property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>2048</value>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>4096</value>
</property>
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>4096</value>
</property>
<property>
    <name>mapreduce.map.memory.mb</name>
    <value>4096</value>
</property>
<property>
    <name>mapreduce.reduce.memory.mb</name>
    <value>4096</value>
</property>
<property>
    <name>mapreduce.map.java.opts</name>
    <value>-Xmx6144m</value>
</property>
<property>
    <name>mapreduce.reduce.java.opts</name>
    <value>-Xmx6144m</value>
</property>

2.5.configure yarn-site.xml

<property>  
    <name>yarn.resourcemanager.hostname</name>  
    <value>master.c.ambari-195807.internal</value>  
</property>  
<property>  
    <name>yarn.resourcemanager.address</name>  
    <value>${yarn.resourcemanager.hostname}:8032</value>  
</property>  
<property>  
    <name>yarn.resourcemanager.scheduler.address</name>  
    <value>${yarn.resourcemanager.hostname}:8030</value>  
</property>  
<property>  
    <name>yarn.resourcemanager.webapp.address</name>  
    <value>${yarn.resourcemanager.hostname}:8088</value>  
</property>  
<property>  
    <name>yarn.resourcemanager.webapp.https.address</name>  
    <value>${yarn.resourcemanager.hostname}:8090</value>  
</property>  
<property>  
    <name>yarn.resourcemanager.resource-tracker.address</name>  
    <value>${yarn.resourcemanager.hostname}:8031</value>  
</property>  
<property>  
    <name>yarn.resourcemanager.admin.address</name>  
    <value>${yarn.resourcemanager.hostname}:8033</value>  
</property>  
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
    <name>yarn.timeline-service.enabled</name>
    <value>true</value>
</property>
<property>
    <name>yarn.resourcemanager.system-metrics-publisher.enabled</name>
    <value>true</value>
</property>
<property>
    <name>yarn.timeline-service.generic-application-history.enabled</name>
    <value>true</value>
</property>
<property>
    <name>yarn.timeline-service.http-cross-origin.enabled</name>
    <value>true</value>
</property>
<property>
    <name>yarn.timeline-service.hostname</name>
    <value>master.c.ambari-195807.internal</value>
</property>
<property>
    <name>yarn.resourcemanager.webapp.cross-origin.enabled</name>
    <value>true</value>
</property>
<property>
    <name>yarn.resourcemanager.address</name>
    <value>master.c.ambari-195807.internal:8032</value>
</property>
<property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>master.c.ambari-195807.internal:8030</value>
</property>
<property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>master.c.ambari-195807.internal:8031</value>
</property>

2.6.set slaves

echo slave1.c.ambari-195807.internal >>slaves
echo slave2.c.ambari-195807.internal >>slaves
echo slave3.c.ambari-195807.internal >>slaves

2.7.copy hadoop from master to each slave

scp -r hadoop-2.8.3/ gizmo@slave1.c.ambari-195807.internal:/opt/apps/
scp -r hadoop-2.8.3/ gizmo@slave2.c.ambari-195807.internal:/opt/apps/
scp -r hadoop-2.8.3/ gizmo@slave3.c.ambari-195807.internal:/opt/apps/

2.8.configure hadoop env profile

echo 'export HADOOP_HOME=/opt/apps/hadoop-2.8.3' >>~/.bashrc
echo 'export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop' >>~/.bashrc
echo 'export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$JAVA_HOME/bin' >>~/.bashrc

2.9.start hdfs/yarn

start-dfs.hs
start-yarn.sh

2.10.check

hdfs, http://master.c.ambari-195807.internal:50070

yarn, http://master.c.ambari-195807.internal:8088

3.install hive

3.1.download hive 2.3.2

wget http://ftp.jaist.ac.jp/pub/apache/hive/hive-2.3.2/apache-hive-2.3.2-bin.tar.gz
tar -zvxf apache-hive-2.3.2-bin.tar.gz && cd apache-hive-2.3.2-bin

3.2.configure hive env profile

echo 'export HIVE_HOME=/opt/apps/apache-hive-2.3.2-bin' >>~/.bashrc
echo 'export PATH=$PATH:$HIVE_HOME/bin' >>~/.bashrc

3.3.install mysql to store metadata

rpm -ivh http://repo.mysql.com/mysql57-community-release-el7.rpm
yum install -y mysql-server
systemctl start mysqld
mysql_password="pa12ss34wo!@d#"
mysql_default_password=`grep 'temporary password' /var/log/mysqld.log | awk -F ': ' '{print $2}'`
mysql -u root -p${mysql_default_password} -e "set global validate_password_policy=0; set global validate_password_length=4;" --connect-expired-password
mysqladmin -u root -p${mysql_default_password} password ${mysql_password}
mysql -u root -p${mysql_password} -e "create database hive default charset 'utf8'; flush privileges;"
mysql -u root -p${mysql_password} -e "grant all privileges on hive.* to hive@'' identified by 'hive'; flush privileges;"

3.4.download mysql driver

wget http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.45/mysql-connector-java-5.1.45.jar -O $HIVE_HOME/lib

3.5.configure hive-site.xml

<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hive</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>hive</value>
    </property>
</configuration>

3.6.initialize hive meta tables

schematool -dbType mysql -initSchema

3.7.test hive

4.install tez

4.1.please follow the instruction “install tez on single server” on each server

5.install hbase

5.1.download hbase 1.2.6

wget http://ftp.jaist.ac.jp/pub/apache/hbase/1.2.6/hbase-1.2.6-bin.tar.gz
tar -vzxf hbase-1.2.6-bin.tar.gz && cd hbase-1.2.6

5.2.configure hbase-site.xml

<property>
    <name>hbase.rootdir</name>
    <value>hdfs://master.c.ambari-195807.internal:9000/hbase</value>
</property>
<property>
    <name>hbase.master</name>
    <value>master</value>
</property>
<property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
</property>
<property>
    <name>hbase.zookeeper.property.clientPort</name>
    <value>2181</value>
</property>
<property>
    <name>hbase.zookeeper.quorum</name>
    <value>slave1.c.ambari-195807.internal,slave2.c.ambari-195807.internal,slave3.c.ambari-195807.internal</value>
</property>
<property>
    <name>dfs.support.append</name>
    <value>true</value>
</property>
<property>  
    <name>hbase.master.info.port</name>  
    <value>60010</value>  
</property>

5.3.configure regionservers

echo slave1.c.ambari-195807.internal >>regionservers
echo slave2.c.ambari-195807.internal >>regionservers
echo slave3.c.ambari-195807.internal >>regionservers

5.4.copy hbase from master to each slave

5.5.configure hbase env profile

echo 'export HBASE_HOME=/opt/apps/hbase-1.2.6' >>~/.bashrc 
echo 'export PATH=$PATH:$HBASE_HOME/bin' >>~/.bashrc

5.6.start hbase

start-hbase.sh

5.7.check, http://35.194.253.162:60010

Things done!

Wang 19:44 on 2018-02-25 Permalink | Reply
Tags: BigData ( 37 ), Hive ( 15 ), Performance ( 35 ), Tez

[Performance Test] MR vs Tez

I tested performance about MR and Tez on my laptop, it’s single server, so it’s not very accurate.

I create two tables to do the test which contains the datasets I downloaded from GBIF.

gbif_0004998: 327,316 rows

gbif_0004991: 6,914,665 rows

1.test gbif_0004998

create by MR

hive> set hive.execution.engine=mr;
Hive-on-MR is deprecated in Hive 2 and may no be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> CREATE TABLE gbif.gbif_0004998
> STORED AS ORC
> TBLPROPERTIES("orc.compress"="snappy")
> AS SELECT * FROM gbif.gbif_0004998_ori;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = wanghongmeng_20180224190744_f3fb257a-829e-40c2-974b-5abeb3d88693
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1519462946874_0007, Tracking URL = http://localhost:8088/proxy/application_1519462946874_0007/
Kill Command = /usr/local/Cellar/hadoop/2.8.2/bin/hadoop job -kill job_1519462946874_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-02-24 19:07:53,043 Stage-1 map = 0%, reduce = 0%
2018-02-24 19:08:10,204 Stage-1 map = 100%, reduce = 0%
Ended Job = job_1519462946874_0007
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/gbif.db/.hive-staging_hive_2018-02-24_19-07-44_762_5371659277950436672-1/-ext-10002
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/gbif.db/gbif_0004998
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1 HDFS Read: 130582415 HDFS Write: 13510429 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 28.28 seconds

create by Tez

hive> set hive.execution.engine=tez;
hive> CREATE TABLE gbif.gbif_0004998
> STORED AS ORC
> TBLPROPERTIES("orc.compress"="snappy")
> AS SELECT * FROM gbif.gbif_0004998_ori;
Query ID = wanghongmeng_20180224193755_bd7fda12-bfd7-4abf-9c3e-0f90b9b58607
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1519462946874_0013)

----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED 
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 
----------------------------------------------------------------------------------------------
VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 15.65 s 
----------------------------------------------------------------------------------------------
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/gbif.db/gbif_0004998
OK
gbif_0004998_ori.gbifid gbif_0004998_ori.datasetkey gbif_0004998_ori.occurrenceid gbif_0004998_ori.kingdom gbif_0004998_ori.phylum gbif_0004998_ori.class gbif_0004998_ori.orders gbif_0004998_ori.family gbif_0004998_ori.genus gbif_0004998_ori.species gbif_0004998_ori.infraspecificepithet gbif_0004998_ori.taxonrank gbif_0004998_ori.scientificname gbif_0004998_ori.countrycode gbif_0004998_ori.locality gbif_0004998_ori.publishingorgkey gbif_0004998_ori.decimallatitude gbif_0004998_ori.decimallongitude gbif_0004998_ori.coordinateuncertaintyinmeters gbif_0004998_ori.coordinateprecision gbif_0004998_ori.elevation gbif_0004998_ori.elevationaccuracy gbif_0004998_ori.depth gbif_0004998_ori.depthaccuracy gbif_0004998_ori.eventdate gbif_0004998_ori.day gbif_0004998_ori.month gbif_0004998_ori.year gbif_0004998_ori.taxonkey gbif_0004998_ori.specieskey gbif_0004998_ori.basisofrecord gbif_0004998_ori.institutioncode gbif_0004998_ori.collectioncode gbif_0004998_ori.catalognumber gbif_0004998_ori.recordnumber gbif_0004998_ori.identifiedby gbif_0004998_ori.license gbif_0004998_ori.rightsholder gbif_0004998_ori.recordedby gbif_0004998_ori.typestatus gbif_0004998_ori.establishmentmeans gbif_0004998_ori.lastinterpreted gbif_0004998_ori.mediatype gbif_0004998_ori.issue
Time taken: 16.631 seconds

query by MR

hive> set hive.execution.engine=mr;
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = wanghongmeng_20180224194412_0c9a74e1-b01e-4b92-8db4-f31522d44bd9
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1519462946874_0016, Tracking URL = http://localhost:8088/proxy/application_1519462946874_0016/
Kill Command = /usr/local/Cellar/hadoop/2.8.2/bin/hadoop job -kill job_1519462946874_0016
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-02-24 19:44:24,034 Stage-1 map = 0%, reduce = 0%
2018-02-24 19:44:33,661 Stage-1 map = 100%, reduce = 0%
2018-02-24 19:44:40,063 Stage-1 map = 100%, reduce = 100%
Ended Job = job_1519462946874_0016
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1 Reduce: 1 HDFS Read: 30539 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
total
28918
Time taken: 28.529 seconds, Fetched: 1 row(s)

query by Tez

hive> set hive.execution.engine=tez;
hive> select count(*) from gbif_0004998 where mediatype = 'STILLIMAGE';
Query ID = wanghongmeng_20180224193902_f03b627e-e091-4632-87e5-0d8af6484032
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1519462946874_0013)

----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED 
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 
Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 5.97 s 
----------------------------------------------------------------------------------------------
OK
total
28918
Time taken: 6.438 seconds, Fetched: 1 row(s)

2.test gbif_0004991

create by MR

hive> set hive.execution.engine=mr;
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> CREATE TABLE gbif.gbif_0004991
> STORED AS ORC
> TBLPROPERTIES("orc.compress"="snappy")
> AS SELECT * FROM gbif.gbif_0004991_ori;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = wanghongmeng_20180224191238_19301476-a77f-45fa-a405-05a8732a45e9
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1519462946874_0010, Tracking URL = http://localhost:8088/proxy/application_1519462946874_0010/
Kill Command = /usr/local/Cellar/hadoop/2.8.2/bin/hadoop job -kill job_1519462946874_0010
Hadoop job information for Stage-1: number of mappers: 14; number of reducers: 0
2018-02-24 19:16:32,473 Stage-1 map = 0%, reduce = 0%
2018-02-24 19:17:32,678 Stage-1 map = 0%, reduce = 0%
2018-02-24 19:17:40,248 Stage-1 map = 7%, reduce = 0%
2018-02-24 19:17:51,119 Stage-1 map = 11%, reduce = 0%
2018-02-24 19:17:52,207 Stage-1 map = 18%, reduce = 0%
2018-02-24 19:17:58,625 Stage-1 map = 21%, reduce = 0%
2018-02-24 19:18:13,859 Stage-1 map = 25%, reduce = 0%
2018-02-24 19:18:15,999 Stage-1 map = 32%, reduce = 0%
2018-02-24 19:18:30,537 Stage-1 map = 36%, reduce = 0%
2018-02-24 19:18:31,625 Stage-1 map = 39%, reduce = 0%
2018-02-24 19:18:32,759 Stage-1 map = 43%, reduce = 0%
2018-02-24 19:19:17,117 Stage-1 map = 46%, reduce = 0%
2018-02-24 19:19:19,250 Stage-1 map = 50%, reduce = 0%
2018-02-24 19:19:25,639 Stage-1 map = 54%, reduce = 0%
2018-02-24 19:19:28,825 Stage-1 map = 57%, reduce = 0%
2018-02-24 19:19:32,031 Stage-1 map = 61%, reduce = 0%
2018-02-24 19:19:33,101 Stage-1 map = 64%, reduce = 0%
2018-02-24 19:19:39,470 Stage-1 map = 68%, reduce = 0%
2018-02-24 19:19:42,677 Stage-1 map = 71%, reduce = 0%
2018-02-24 19:19:54,459 Stage-1 map = 75%, reduce = 0%
2018-02-24 19:19:58,723 Stage-1 map = 79%, reduce = 0%
2018-02-24 19:20:04,147 Stage-1 map = 82%, reduce = 0%
2018-02-24 19:20:06,277 Stage-1 map = 86%, reduce = 0%
2018-02-24 19:20:15,977 Stage-1 map = 93%, reduce = 0%
2018-02-24 19:20:20,269 Stage-1 map = 96%, reduce = 0%
2018-02-24 19:20:36,398 Stage-1 map = 100%, reduce = 0%
Ended Job = job_1519462946874_0010
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/gbif.db/.hive-staging_hive_2018-02-24_19-12-38_616_5758586722663198282-1/-ext-10002
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/gbif.db/gbif_0004991
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 14 HDFS Read: 3539512736 HDFS Write: 342789525 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 481.311 seconds

create via Tez

hive> set hive.execution.engine=tez;
hive> CREATE TABLE gbif.gbif_0004991
> STORED AS ORC
> TBLPROPERTIES("orc.compress"="snappy")
> AS SELECT * FROM gbif.gbif_0004991_ori;
Query ID = wanghongmeng_20180224192800_111872d9-059b-4a8a-9fd7-e3ea02af8898
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.
Status: Running (Executing on YARN cluster with App id application_1519462946874_0013)

----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED 
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 
----------------------------------------------------------------------------------------------
VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 241.12 s 
----------------------------------------------------------------------------------------------
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/gbif.db/gbif_0004991
OK
gbif_0004991_ori.gbifid gbif_0004991_ori.datasetkey gbif_0004991_ori.occurrenceid gbif_0004991_ori.kingdom gbif_0004991_ori.phylum gbif_0004991_ori.class gbif_0004991_ori.orders gbif_0004991_ori.family gbif_0004991_ori.genus gbif_0004991_ori.species gbif_0004991_ori.infraspecificepithet gbif_0004991_ori.taxonrank gbif_0004991_ori.scientificname gbif_0004991_ori.countrycode gbif_0004991_ori.locality gbif_0004991_ori.publishingorgkey gbif_0004991_ori.decimallatitude gbif_0004991_ori.decimallongitude gbif_0004991_ori.coordinateuncertaintyinmeters gbif_0004991_ori.coordinateprecision gbif_0004991_ori.elevation gbif_0004991_ori.elevationaccuracy gbif_0004991_ori.depth gbif_0004991_ori.depthaccuracy gbif_0004991_ori.eventdate gbif_0004991_ori.day gbif_0004991_ori.month gbif_0004991_ori.year gbif_0004991_ori.taxonkey gbif_0004991_ori.specieskey gbif_0004991_ori.basisofrecord gbif_0004991_ori.institutioncode gbif_0004991_ori.collectioncode gbif_0004991_ori.catalognumber gbif_0004991_ori.recordnumber gbif_0004991_ori.identifiedby gbif_0004991_ori.license gbif_0004991_ori.rightsholder gbif_0004991_ori.recordedby gbif_0004991_ori.typestatus gbif_0004991_ori.establishmentmeans gbif_0004991_ori.lastinterpreted gbif_0004991_ori.mediatype gbif_0004991_ori.issue
Time taken: 252.548 seconds

query via MR

hive> set hive.execution.engine=mr;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> select count(*) from gbif_0004991 where mediatype = 'STILLIMAGE';
Query ID = wanghongmeng_20180224192630_b2934027-2423-4945-864b-6ce663e676fa
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1519462946874_0012, Tracking URL = http://localhost:8088/proxy/application_1519462946874_0012/
Kill Command = /usr/local/Cellar/hadoop/2.8.2/bin/hadoop job -kill job_1519462946874_0012
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2018-02-24 19:26:43,988 Stage-1 map = 0%, reduce = 0%
2018-02-24 19:27:00,086 Stage-1 map = 50%, reduce = 0%
2018-02-24 19:27:03,287 Stage-1 map = 74%, reduce = 0%
2018-02-24 19:27:05,422 Stage-1 map = 100%, reduce = 0%
2018-02-24 19:27:08,595 Stage-1 map = 100%, reduce = 100%
Ended Job = job_1519462946874_0012
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2 Reduce: 1 HDFS Read: 602777 HDFS Write: 106 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
total
374998
Time taken: 38.903 seconds, Fetched: 1 row(s)

query via Tez

hive> set hive.execution.engine=tez;
hive> select count(*) from gbif_0004991 where mediatype = 'STILLIMAGE';
Query ID = wanghongmeng_20180224193241_f4edd363-fdb8-4461-b687-4b775e8719c0
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1519462946874_0013)

----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED 
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 2 2 0 0 0 0 
Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 16.54 s 
----------------------------------------------------------------------------------------------
OK
total
374998
Time taken: 17.258 seconds, Fetched: 1 row(s)

3.summary

Table	Total Count	Create Table	Query
gbif_0004998	327,316	MR	28.28s	MR	28.529s
Tez	16.631s	Tez	6.438s
gbif_0004991	6,914,665	MR	481.311s	MR	38.903s
Tez	252.548s	Tez	17.258s

Wang 20:34 on 2018-02-24 Permalink | Reply
Tags: BigData ( 37 ), Hadoop ( 14 ), Hive ( 15 ), Tez

Conflicting jars of Hadoop and Tez

After I installed Tez, it’s ok to run hive jobs via Tez, but when I changed engine to MR, I got below error:

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = wanghongmeng_20180224185414_623cf20b-77d4-4a09-a17d-41c72ed76ac3
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. DEFAULT_MR_AM_ADMIN_USER_ENV

I can’t see any useful information from logs, after long time’s investigating, I found hadoop-mapreduce-client-common-2.7.0.jar/hadoop-mapreduce-client-core-2.7.0.jar under Tez library were conflicting with hadoop version, my installed hadoop version was 2.8.2, so I removed the two jars.

After doing this, I could run hive on MR successfully..😀

Wang 19:51 on 2018-02-24 Permalink | Reply
Tags: BigData ( 37 ), Hadoop ( 14 ), Hive ( 15 ), Tez, Tomcat

Replace MR with Tez on hive2

From hive2 Hive-on-MR is not recommended, you could see the warning information when running hive cli

Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

So I installed Tez to replace MR to run jobs, below are installation steps.

1.install Tez

1.1.down Tez and unpackage

wget http://ftp.jaist.ac.jp/pub/apache/tez/0.9.0/apache-tez-0.9.0-src.tar.gz
tar -zvxf apache-tez-0.9.0-src.tar.gz && cd apache-tez-0.9.0-src

1.2.compile and build Tez jar, you need install protobuf/maven before compiling

mvn clean package -DskipTests=true -Dmaven.javadoc.skip=true

1.3.upload Tez to hdfs

hadoop fs -mkdir /apps
hadoop fs -copyFromLocal tez-dist/target/tez-0.9.0.tar.gz /apps/

1.4.create tez-site.xml under hadoop conf directory

cat <<'EOF' > $HADOOP_CONF_DIR/tez-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>tez.lib.uris</name>
        <value>${fs.defaultFS}/apps/tez-0.9.0.tar.gz</value>
    </property>
    <property>
        <name>tez.history.logging.service.class</name>
        value>org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService</value>
    </property>
    <property>
        <name>tez.tez-ui.history-url.base</name>
        <value>http://localhost:8080/tez-ui/</value>
    </property>
</configuration>
EOF

1.5.append configurations to yarn-site.xml

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
    <name>yarn.timeline-service.enabled</name>
    <value>true</value>
</property>
<property>
    <name>yarn.resourcemanager.system-metrics-publisher.enabled</name>
    <value>true</value>
</property>
<property>
    <name>yarn.timeline-service.generic-application-history.enabled</name>
    <value>true</value>
</property>
<property>
    <name>yarn.timeline-service.http-cross-origin.enabled</name>
    <value>true</value>
</property>
<property>
    <name>yarn.timeline-service.hostname</name>
    <value>localhost</value>
</property>
<property>
    <name>yarn.resourcemanager.webapp.cross-origin.enabled</name>
    <value>true</value>
</property>
<property>  
    <name>yarn.resourcemanager.address</name>  
    <value>localhost:8032</value>  
</property>  
<property>  
    <name>yarn.resourcemanager.scheduler.address</name>  
    <value>localhost:8030</value>  
</property>  
<property>  
    <name>yarn.resourcemanager.resource-tracker.address</name>  
    <value>localhost:8031</value>  
</property>

1.6.append configuration to core-site.xml

<property>
    <name>fs.default.name</name>
    <value>hdfs://master:9000</value>
</property>
<property>
    <name>hadoop.tmp.dir</name>  
    <value>/data/hadoop/hdfs/tmp</value>
</property>
<property>
    <name>hadoop.http.filter.initializers</name>
    <value>org.apache.hadoop.security.HttpCrossOriginFilterInitializer</value>
</property>

1.7.unpackage tez-dist/target/tez-0.9.0-minimal.tar.gz

1.8.append env to /etc/profile

export TEZ_CONF_DIR="location of tez-site.xml"
export TEZ_JARS="location of unpackaged tez-0.9.0-minimal.tar.gz"
export HADOOP_CLASSPATH=${TEZ_CONF_DIR}:${TEZ_JARS}/*:${TEZ_JARS}/lib/*

1.9.start timelineserver

yarn-daemon.sh start timelineserver

1.10.configure tez ui, install tomcat, unpackage tez-ui/target/tez-ui-0.9.0.war into webapps, rename unpackaged directory to tez-ui

1.11.start tomcat, visit http://localhost:8080/tez-ui to test

2.test Tez

2.1.change job engine to Tez

hive> set hive.execution.engine=tez;

2.2.run job to test

hive> select count(*) from gbif_0004998;
Query ID = wanghongmeng_20180224180801_e5ddcf23-1e1a-4724-8156-1393807c2ac0
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1519462946874_0003)

----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED 
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 
Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 9.87 s 
----------------------------------------------------------------------------------------------
OK
327316
Time taken: 23.876 seconds, Fetched: 1 row(s)

2.3.check result on tez ui

Wang's Tech Blog

Recent Posts

Tags

Archives

My Girl

Tagged: Tez Toggle Comment Threads | Keyboard Shortcuts

Wang 20:12 on 2018-03-25 Permalink | Reply
Tags: Ambari, BigData ( 37 ), Cluster ( 30 ), Presto ( 7 ), Tez

[Presto] Integrate with Ambari

Reply Cancel reply

Wang 20:37 on 2018-03-06 Permalink | Reply
Tags: BigData ( 37 ), Cluster ( 30 ), Hadoop ( 14 ), Hive ( 15 ), Tez

[Performance Test] MR vs Tez(2)

Wang 19:44 on 2018-02-25 Permalink | Reply
Tags: BigData ( 37 ), Hive ( 15 ), Performance ( 35 ), Tez

[Performance Test] MR vs Tez

1.test gbif_0004998

create by MR

create by Tez

query by MR

query by Tez

2.test gbif_0004991

create by MR

create via Tez

query via MR

query via Tez

3.summary

Wang 20:34 on 2018-02-24 Permalink | Reply
Tags: BigData ( 37 ), Hadoop ( 14 ), Hive ( 15 ), Tez

Conflicting jars of Hadoop and Tez

Wang 19:51 on 2018-02-24 Permalink | Reply
Tags: BigData ( 37 ), Hadoop ( 14 ), Hive ( 15 ), Tez, Tomcat

Replace MR with Tez on hive2

Wang's Tech Blog

Recent Posts

Tags

Archives

My Girl

Tagged: Tez Toggle Comment Threads | Keyboard Shortcuts

Wang 20:12 on 2018-03-25 Permalink | Reply Tags: Ambari, BigData ( 37 ), Cluster ( 30 ), Presto ( 7 ), Tez

[Presto] Integrate with Ambari

Reply Cancel reply

Wang 20:37 on 2018-03-06 Permalink | Reply Tags: BigData ( 37 ), Cluster ( 30 ), Hadoop ( 14 ), Hive ( 15 ), Tez

[Performance Test] MR vs Tez(2)

Wang 21:43 on 2018-03-02 Permalink | Reply Tags: BigData ( 37 ), CentOS7 ( 7 ), Cloud ( 29 ), Cluster ( 30 ), GCP ( 12 ), Hadoop ( 14 ), Hive ( 15 ), Mysql ( 7 ), Tez

[GCP ] Install bigdata cluster

Wang 19:44 on 2018-02-25 Permalink | Reply Tags: BigData ( 37 ), Hive ( 15 ), Performance ( 35 ), Tez

[Performance Test] MR vs Tez

1.test gbif_0004998

create by MR

create by Tez

query by MR

query by Tez

2.test gbif_0004991

create by MR

create via Tez

query via MR

query via Tez

3.summary

Wang 20:34 on 2018-02-24 Permalink | Reply Tags: BigData ( 37 ), Hadoop ( 14 ), Hive ( 15 ), Tez

Conflicting jars of Hadoop and Tez

Wang 19:51 on 2018-02-24 Permalink | Reply Tags: BigData ( 37 ), Hadoop ( 14 ), Hive ( 15 ), Tez, Tomcat

Replace MR with Tez on hive2

Wang 20:12 on 2018-03-25 Permalink | Reply
Tags: Ambari, BigData ( 37 ), Cluster ( 30 ), Presto ( 7 ), Tez

Wang 20:37 on 2018-03-06 Permalink | Reply
Tags: BigData ( 37 ), Cluster ( 30 ), Hadoop ( 14 ), Hive ( 15 ), Tez

Wang 21:43 on 2018-03-02 Permalink | Reply
Tags: BigData ( 37 ), CentOS7 ( 7 ), Cloud ( 29 ), Cluster ( 30 ), GCP ( 12 ), Hadoop ( 14 ), Hive ( 15 ), Mysql ( 7 ), Tez

Wang 19:44 on 2018-02-25 Permalink | Reply
Tags: BigData ( 37 ), Hive ( 15 ), Performance ( 35 ), Tez

Wang 20:34 on 2018-02-24 Permalink | Reply
Tags: BigData ( 37 ), Hadoop ( 14 ), Hive ( 15 ), Tez

Wang 19:51 on 2018-02-24 Permalink | Reply
Tags: BigData ( 37 ), Hadoop ( 14 ), Hive ( 15 ), Tez, Tomcat