Tagged: Hive Toggle Comment Threads | Keyboard Shortcuts

  • Wang 20:56 on 2019-11-11 Permalink | Reply
    Tags: , , , Hive,   

    Include Ranger to protect your hadoop ecosystem 

    Apache Ranger

    Apache Ranger™ is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.

    The vision with Ranger is to provide comprehensive security across the Apache Hadoop ecosystem. With the advent of Apache YARN, the Hadoop platform can now support a true data lake architecture. Enterprises can potentially run multiple workloads, in a multi tenant environment. Data security within Hadoop needs to evolve to support multiple use cases for data access, while also providing a framework for central administration of security policies and monitoring of user access.

     
  • Wang 22:21 on 2018-11-05 Permalink | Reply
    Tags: , , , , , Hive, , ,   

    [Presto] Secure with LDAP 

    For security issue we decided to enable LDAP in presto, to deploy presto into kubernetes cluster we build presto image ourselves which include kerberos authentication and LDAP configurations.

    As you see the image structure, configurations under catalog/etc/hive are very important, please pay attention.

    krb5.conf and xxx.keytab are used to connect to kerberos

    password-authenticator.properties and ldap_server.pem under etc, hive.properties and hive-security.json under catalog are used to connect to LDAP.

    password-authenticator.properties

    password-authenticator.name=ldap
    ldap.url=ldaps://<IP>:<PORT>
    ldap.user-bind-pattern=xxxxxx
    ldap.user-base-dn=xxxxxx
    

    hive.properties

    connector.name=hive-hadoop2
    hive.security=file
    security.config-file=<hive-security.json>
    hive.metastore.authentication.type=KERBEROS
    hive.metastore.uri=thrift://<IP>:<PORT>
    hive.metastore.service.principal=<SERVER-PRINCIPAL>
    hive.metastore.client.principal=<CLIENT-PRINCIPAL>
    hive.metastore.client.keytab=<KEYTAB>
    hive.config.resources=core-site.xml, hdfs-site.xml
    

    hive-security.json

    {
      "schemas": [{
        "user": "user_1",
        "schema": "db_1",
        "owner": false
      }, {
        "user": " ",
        "schema": "db_1",
        "owner": false
      }, {
        "user": "user_2",
        "schema": "db_2",
        "owner": false
      }],
      "tables": [{
        "user": "user_1",
        "schema": "db_1",
        "table": "table_1",
        "privileges": ["SELECT"]
      }, {
        "user": "user_1",
        "schema": "db_1",
        "table": "table_2",
        "privileges": ["SELECT"]
      }, {
        "user": "user_2",
        "schema": "db_1",
        "table": ".*",
        "privileges": ["SELECT"]
      }, {
        "user": "user_2",
        "schema": "db_2",
        "table": "table_1",
        "privileges": ["SELECT"]
      }, {
        "user": "user_2",
        "schema": "db_2",
        "table": "table_2",
        "privileges": ["SELECT"]
      }],
      "sessionProperties": [{
        "allow": false
      }]
    }
    
     
  • Wang 16:56 on 2018-05-02 Permalink | Reply
    Tags: , , Hive, ,   

    [Presto] Connect hive by kerberos 

    For data security, hadoop cluster usually implement different security mechanisms, most commonly used mechanism is kerberos. Recently I tested how to connect hive by kerberos in presto.

    1.Add krb5.conf/keytab/hdfs-site.xml/core-site.xml in every node.

    2.Modify etc/jvm.properties, append -Djava.security.krb5.conf=”krb5.conf location”

    3.Create hive.properties under etc/catalog

    cat << 'EOF' > etc/catalog/hive.properties
    connector.name=hive-hadoop2
    
    hive.metastore.uri=thrift://xxx:9083
    hive.metastore.authentication.type=KERBEROS
    hive.metastore.service.principal=xxx@xxx.com
    hive.metastore.client.principal=xxx@xxx.com
    hive.metastore.client.keytab="keytab location"
    
    hive.config.resources="core-site.xml and hdfs-site.xml" location
    EOF
    

    4.Download hadoop-lzo jar into plugin/hive-hadoop2

    wget http://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.16/hadoop-lzo-0.4.16.jar -O plugin/hive-hadoop2
    

    5.Get principal tgt

    export KRB5_CONFIG="krb5.conf location"
    kinit -kt "keytab location" xxx@xxx.com
    

    6.Restart presto

    bin/launcher restart
    
     
  • Wang 22:15 on 2018-04-14 Permalink | Reply
    Tags: , Hive   

    Hiveserver2 – hive-jdbc version conflict 

    When I tested connecting hive by hiveserver2, I got error:

    Exception in thread "main" java.sql.SQLException: Could not open client transport with JDBC Uri: jdbc:hive2://localhost:10000/test_hive: Could not establish connection to jdbc:hive2://localhost:10000/test_hive: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000, use:database=test_hive})
    at org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:224)
    at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:107)
    at java.sql.DriverManager.getConnection(DriverManager.java:664)
    at java.sql.DriverManager.getConnection(DriverManager.java:247)
    at com.rakuten.dsd.api.cdna.Thrift.main(Thrift.java:17)
    Caused by: java.sql.SQLException: Could not establish connection to jdbc:hive2://localhost:10000/test_hive: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000, use:database=test_hive})
    at org.apache.hive.jdbc.HiveConnection.openSession(HiveConnection.java:699)
    at org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:200)
    ... 4 more
    Caused by: org.apache.thrift.TApplicationException: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000, use:database=test_hive})
    at org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
    at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
    at org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_OpenSession(TCLIService.java:168)
    at org.apache.hive.service.rpc.thrift.TCLIService$Client.OpenSession(TCLIService.java:155)
    at org.apache.hive.jdbc.HiveConnection.openSession(HiveConnection.java:680)
    ... 5 more
    
    

    After checking, it’s said that the version of hive-jdbc and hive are conflict, so I changed hive-jdbc version as the same as hive, problem solved.

     
  • Wang 21:36 on 2018-03-20 Permalink | Reply
    Tags: , , Hive,   

    [Presto] Build pseudo cluster 

    Presto is a distributed query engine which is developed by Facebook, for specific concept and advantages, please refer to the official document, below are the steps how I build pseudo cluster on my mac.

    1.download presto

    wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.196/presto-server-0.196.tar.gz
    tar -zvxf presto-server-0.196.tar.gz && cd presto-server-0.196
    

    2.configure configurations

    mkdir etc
    
    cat << 'EOF' > etc/jvm.config
    -server
    -Xmx16G
    -Xms16G
    -XX:+UseG1GC
    -XX:G1HeapRegionSize=32M
    -XX:+UseGCOverheadLimit
    -XX:+ExplicitGCInvokesConcurrent
    -XX:+HeapDumpOnOutOfMemoryError
    -XX:+ExitOnOutOfMemoryError
    EOF
    
    cat << 'EOF' > etc/log.properties
    com.facebook.presto=INFO
    EOF
    
    cat << 'EOF' > etc/config1.properties
    coordinator=true
    node-scheduler.include-coordinator=true
    http-server.http.port=8001
    query.max-memory=24GB
    query.max-memory-per-node=8GB
    discovery-server.enabled=true
    discovery.uri=http://localhost:8001
    EOF
    
    cat << 'EOF' > etc/config2.properties
    coordinator=false
    node-scheduler.include-coordinator=true
    http-server.http.port=8002
    query.max-memory=24GB
    query.max-memory-per-node=8GB
    discovery-server.enabled=true
    discovery.uri=http://localhost:8001
    EOF
    
    cat << 'EOF' > etc/config3.properties
    coordinator=true
    node-scheduler.include-coordinator=true
    http-server.http.port=8003
    query.max-memory=24GB
    query.max-memory-per-node=8GB
    discovery-server.enabled=true
    discovery.uri=http://localhost:8001
    EOF
    
    cat << 'EOF' > etc/node1.properties
    node.environment=test
    node.id=671d18f9-dd0f-412d-b18c-fe6d7989b040
    node.data-dir=/usr/local/Cellar/presto/0.196/data/node1
    EOF
    
    cat << 'EOF' > etc/node2.properties
    node.environment=test
    node.id=e72fdd91-a135-4936-9a3e-f888c5106ed9
    node.data-dir=/usr/local/Cellar/presto/0.196/data/node2
    EOF
    
    cat << 'EOF' > etc/node3.properties
    node.environment=test
    node.id=6ab76715-1812-4093-95cf-1945f4cfefe3
    node.data-dir=/usr/local/Cellar/presto/0.196/data/node3
    EOF
    

    p.s. If you want to restrict operation, please add access-control.properties as below, only permit read operation.

    cat << 'EOF' > etc/access-control.properties
    access-control.name=read-only
    EOF
    

    3.start presto server

    bin/launcher start --config=etc/config1.properties --node-config=etc/node1.properties
    bin/launcher start --config=etc/config2.properties --node-config=etc/node2.properties
    bin/launcher start --config=etc/config3.properties --node-config=etc/node3.properties
    

    4.downlaod cli

    wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.196/presto-cli-0.196-executable.jar -O bin/presto-cli
    chmod +x bin/presto-cli
    

    5.create catalogs

    cat << 'EOF' > etc/catalog/mysql.properties
    connector.name=mysql
    connection-url=jdbc:mysql://localhost:3306?useSSL=false
    connection-user=presto
    connection-password=presto
    EOF
    
    cat << 'EOF' > etc/catalog/hive.properties
    connector.name=hive-hadoop2
    hive.metastore.uri=thrift://localhost:9083
    EOF
    

    6.connect

    bin/presto-cli --server localhost:8001 --catalog hive
    
    presto> show catalogs;
     Catalog 
    ---------
     hive    
     mysql   
     system  
    (3 rows)
    
    Query 20180318_045410_00013_sq83e, FINISHED, 1 node
    Splits: 1 total, 1 done (100.00%)
    0:00 [0 rows, 0B] [0 rows/s, 0B/s]
    

    Screenshot:


    P.S. If build cluster, pay attention to below items:

    1.node.id in node.properties in every node must be unique in the cluster, you could generate it by uuid/uuidgen.

    2.query.max-memory-per-node in config.properties better to be half of -Xmx in jvm.config.

     
  • Wang 21:33 on 2018-03-11 Permalink | Reply
    Tags: , , , Hive, ,   

    [Sqoop1] Interact MySQL with HDFS/Hive/HBase 

    install sqoop1 on mac

    brew install sqoop
    

    #if you have set env profiles, uncomment profiles in conf/sqoop-env.sh

    1.MySQL -> HDFS

    1.1.import table

    sqoop import --connect jdbc:mysql://localhost/test --direct --username root --P --table t1 --warehouse-dir /mysql/test --fields-terminated-by ','
    

    1.2.import schema

    sqoop import-all-tables --connect jdbc:mysql://localhost/test --direct --username root -P --warehouse-dir /mysql/test --fields-terminated-by ','
    

    2.MySQL -> Hive

    2.1.import definition

    sqoop create-hive-table --connect jdbc:mysql://localhost/test --table t1 --username root --P --hive-database test
    

    2.2.import table

    sqoop import --connect jdbc:mysql://localhost/test --username root --P --table t1 --hive-import --hive-database test --hive-table t1 --fields-terminated-by ','
    

    2.3.import schema

    sqoop import-all-tables --connect jdbc:mysql://localhost/test --username root --P --hive-import --hive-database test --fields-terminated-by ','
    

    3.MySQL -> HBase

    3.1.definition

    sqoop import --connect jdbc:mysql://localhost/test --username root --P --table t1
    

    3.2.import table, need create table in hbase first

    sqoop import --connect jdbc:mysql://localhost/test --username root --P --table t1 --hbase-bulkload --hbase-table test.t1 --column-family basic --fields-terminated-by ','
    

    3.3.import table without creating table in hbase, but pay attention to hbase/sqoop version

    sqoop import --connect jdbc:mysql://localhost/test --username root --P --table t1 --hbase-bulkload --hbase-create-table --hbase-table test.t1 --column-family basic --fields-terminated-by ','
    

    4.HDFS/Hive/HBase -> MySQL

    sqoop export --connect jdbc:mysql://localhost/test --username root --P --table t1 --export-dir /user/hive/warehouse/test.db/t1 --fields-terminated-by ','
    
     
  • Wang 20:37 on 2018-03-06 Permalink | Reply
    Tags: , , , Hive,   

    [Performance Test] MR vs Tez(2) 

    I test the performance of MR vs Tez again on cluster, I created a new table which contains 28,872,974 rows, below are cluster servers:

    Host

    OS

    Memory

    CPU

    Disk

    Region

    master.c.ambari-195807.internal

    CentOS 7

    13 GB

    Intel Ivy Bridge: 2

    200G

    asia-east1-a

    slave1.c.ambari-195807.internal

    CentOS 7

    13 GB

    Intel Ivy Bridge: 2

    200G

    asia-east1-a

    slave2.c.ambari-195807.internal

    CentOS 7

    13 GB

    Intel Ivy Bridge: 2

    200G

    asia-east1-a

    slave3.c.ambari-195807.internal

    CentOS 7

    13 GB

    Intel Ivy Bridge: 2

    200G

    asia-east1-a

    1.MR

    1.1.create table

    hive> CREATE TABLE gbif.gbif_0004998
        > STORED AS ORC
        > TBLPROPERTIES("orc.compress"="snappy")
        > AS SELECT * FROM gbif.gbif_0004998_ori;
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = gizmo_20180225064259_8df29800-b260-48f5-a409-80d6ea5200ad
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_1519536795015_0001, Tracking URL = http://master.c.ambari-195807.internal:8088/proxy/application_1519536795015_0001/
    Kill Command = /opt/apps/hadoop-2.8.3/bin/hadoop job  -kill job_1519536795015_0001
    Hadoop job information for Stage-1: number of mappers: 43; number of reducers: 0
    2018-02-25 06:43:15,110 Stage-1 map = 0%,  reduce = 0%
    2018-02-25 06:44:15,419 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 231.6 sec
    2018-02-25 06:44:36,386 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 380.45 sec
    2018-02-25 06:44:37,810 Stage-1 map = 3%,  reduce = 0%, Cumulative CPU 386.09 sec
    2018-02-25 06:44:41,695 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU 422.02 sec
    ...
    ...
    2018-02-25 06:47:36,112 Stage-1 map = 97%,  reduce = 0%, Cumulative CPU 1388.9 sec
    2018-02-25 06:47:38,185 Stage-1 map = 98%,  reduce = 0%, Cumulative CPU 1392.1 sec
    2018-02-25 06:47:45,434 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1402.14 sec
    MapReduce Total cumulative CPU time: 23 minutes 22 seconds 140 msec
    Ended Job = job_1519536795015_0001
    Stage-4 is selected by condition resolver.
    Stage-3 is filtered out by condition resolver.
    Stage-5 is filtered out by condition resolver.
    Moving data to directory hdfs://master.c.ambari-195807.internal:9000/user/hive/warehouse/gbif.db/.hive-staging_hive_2018-02-25_06-42-59_672_2925216554228494176-1/-ext-10002
    Moving data to directory hdfs://master.c.ambari-195807.internal:9000/user/hive/warehouse/gbif.db/gbif_0004998
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 43   Cumulative CPU: 1402.14 sec   HDFS Read: 11519083564 HDFS Write: 1210708016 SUCCESS
    Total MapReduce CPU Time Spent: 23 minutes 22 seconds 140 msec
    OK
    Time taken: 288.681 seconds
    

    1.2.query by on condition

    hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE';
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = gizmo_20180225065438_d2343424-5178-4c44-8b9d-0b28f8b701fa
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Starting Job = job_1519536795015_0002, Tracking URL = http://master.c.ambari-195807.internal:8088/proxy/application_1519536795015_0002/
    Kill Command = /opt/apps/hadoop-2.8.3/bin/hadoop job  -kill job_1519536795015_0002
    Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 1
    2018-02-25 06:54:50,078 Stage-1 map = 0%,  reduce = 0%
    2018-02-25 06:55:02,485 Stage-1 map = 40%,  reduce = 0%, Cumulative CPU 21.01 sec
    2018-02-25 06:55:03,544 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 38.51 sec
    2018-02-25 06:55:06,704 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 49.23 sec
    2018-02-25 06:55:09,881 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 51.88 sec
    MapReduce Total cumulative CPU time: 51 seconds 880 msec
    Ended Job = job_1519536795015_0002
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 5  Reduce: 1   Cumulative CPU: 51.88 sec   HDFS Read: 1936305 HDFS Write: 107 SUCCESS
    Total MapReduce CPU Time Spent: 51 seconds 880 msec
    OK
    2547716
    Time taken: 32.292 seconds, Fetched: 1 row(s)
    

    1.3.query by two conditions

    hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE' and year > 1900;
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = gizmo_20180225081238_766d3707-7eb4-4818-860e-887c48d507ce
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Starting Job = job_1519545228015_0002, Tracking URL = http://master.c.ambari-195807.internal:8088/proxy/application_1519545228015_0002/
    Kill Command = /opt/apps/hadoop-2.8.3/bin/hadoop job  -kill job_1519545228015_0002
    Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 1
    2018-02-25 08:17:31,666 Stage-1 map = 0%,  reduce = 0%
    2018-02-25 08:17:43,866 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 10.58 sec
    2018-02-25 08:17:46,045 Stage-1 map = 60%,  reduce = 0%, Cumulative CPU 34.12 sec
    2018-02-25 08:17:54,996 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 41.73 sec
    2018-02-25 08:17:57,126 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 51.37 sec
    2018-02-25 08:17:58,192 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 53.72 sec
    MapReduce Total cumulative CPU time: 53 seconds 720 msec
    Ended Job = job_1519545228015_0002
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 5  Reduce: 1   Cumulative CPU: 53.72 sec   HDFS Read: 8334197 HDFS Write: 107 SUCCESS
    Total MapReduce CPU Time Spent: 53 seconds 720 msec
    OK
    2547716
    Time taken: 321.138 seconds, Fetched: 1 row(s)
    

    2.Tez

    2.1.create table

    hive> CREATE TABLE gbif.gbif_0004998
        > STORED AS ORC
        > TBLPROPERTIES("orc.compress"="snappy")
        > AS SELECT * FROM gbif.gbif_0004998_ori;
    Query ID = gizmo_20180225075657_bae527a7-7cbd-46d9-afbf-70a5adcdee7c
    Total jobs = 1
    Launching Job 1 out of 1
    Status: Running (Executing on YARN cluster with App id application_1519545228015_0001)
    
    ----------------------------------------------------------------------------------------------
            VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
    ----------------------------------------------------------------------------------------------
    VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 639.61 s   
    ----------------------------------------------------------------------------------------------
    Moving data to directory hdfs://master.c.ambari-195807.internal:9000/user/hive/warehouse/gbif.db/gbif_0004998
    OK
    Time taken: 664.817 seconds
    

    2.2.query by one condition

    hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE';
    Query ID = gizmo_20180225080856_d1f13489-30b0-4045-bdeb-e3e5e085e736
    Total jobs = 1
    Launching Job 1 out of 1
    Status: Running (Executing on YARN cluster with App id application_1519545228015_0001)
    
    ----------------------------------------------------------------------------------------------
            VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container     SUCCEEDED      5          5        0        0       0       0  
    Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
    ----------------------------------------------------------------------------------------------
    VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 17.91 s    
    ----------------------------------------------------------------------------------------------
    OK
    2547716
    Time taken: 19.255 seconds, Fetched: 1 row(s)
    

    2.2.query by two conditions

    hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE' and year > 1900;
    Query ID = gizmo_20180225081200_0279f8e6-544b-4573-858b-33f48bf1fa35
    Total jobs = 1
    Launching Job 1 out of 1
    Status: Running (Executing on YARN cluster with App id application_1519545228015_0001)
    
    ----------------------------------------------------------------------------------------------
            VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container     SUCCEEDED      5          5        0        0       0       0  
    Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
    ----------------------------------------------------------------------------------------------
    VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 16.96 s    
    ----------------------------------------------------------------------------------------------
    OK
    2547716
    Time taken: 17.635 seconds, Fetched: 1 row(s)
    

    3.Summary

    Rows: 28,872,974

    TypeCreate TableQuery By One ConditionQuery By Two Conditions
    MR288.681s32.292s321.138s
    Tez664.817s19.255s17.635s

    According to the result, MR is quicker than Tez on creation, but slower than Tez on query, along with query condition’s increase, MR’s query performance became worse.

    But why MR is quicker than Tez on creation, currently I don’t know, need to be investigated later.

    Maybe it has relationship with storage, I have checked the filesystem after the two kinds of creation, it’s different. MR has many small files, but Tez has one much bigger file.

    MR generated files

    Tez generated files

     
  • Wang 21:43 on 2018-03-02 Permalink | Reply
    Tags: , , , , , , Hive, ,   

    [GCP ] Install bigdata cluster 

    I applied google cloud for trial which give me 300$, so I initialize 4 severs to do test.

    Servers:

    Host

    OS

    Memory

    CPU

    Disk

    Region

    master.c.ambari-195807.internal

    CentOS 7

    13 GB

    Intel Ivy Bridge: 2

    200G

    asia-east1-a

    slave1.c.ambari-195807.internal

    CentOS 7

    13 GB

    Intel Ivy Bridge: 2

    200G

    asia-east1-a

    slave2.c.ambari-195807.internal

    CentOS 7

    13 GB

    Intel Ivy Bridge: 2

    200G

    asia-east1-a

    slave3.c.ambari-195807.internal

    CentOS 7

    13 GB

    Intel Ivy Bridge: 2

    200G

    asia-east1-a

    1.prepare

    1.1.configure ssh key on each slave to make master login without password

    1.2.install jdk1.8 on each server, download, set JAVA_HOME in profile

    1.3.configure hostnames in /etc/hosts on each server


    2.install hadoop

    2.1.download hadoop 2.8.2

    wget http://ftp.jaist.ac.jp/pub/apache/hadoop/common/hadoop-2.8.3/hadoop-2.8.3.tar.gz
    tar -vzxf hadoop-2.8.3.tar.gz && cd hadoop-2.8.3
    

    2.2.configure core-site.xml

    <property>
        <name>fs.default.name</name>
        <value>hdfs://master.c.ambari-195807.internal:9000</value> 
    </property>
    <property>
        <name>hadoop.tmp.dir</name>  
        <value>/data/hadoop/hdfs/tmp</value>
    </property>
    <property>
        <name>hadoop.http.filter.initializers</name>
        <value>org.apache.hadoop.security.HttpCrossOriginFilterInitializer</value>
    </property>
    

    2.3.configure hdfs-site.xml

    <property>
        <name>dfs.name.dir</name>
        <value>/data/hadoop/dfs/name</value>
    </property>
    <property>
        <name>dfs.data.dir</name>
        <value>/opt/hadoop/dfs/data</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    

    2.4.configure mapred-site.xml

    <property>  
        <name>mapred.job.tracker</name>  
        <value>master.c.ambari-195807.internal:49001</value>  
    </property>
    <property>
        <name>mapreduce.framework.name</name>  
        <value>yarn</value>  
    </property>
    <property>
        <name>mapred.local.dir</name>  
        <value>/data/hadoop/mapred</value>  
    </property>
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>2048</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>4096</value>
    </property>
      <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>4096</value>
    </property>
    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>4096</value>
    </property>
    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>4096</value>
    </property>
    <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx6144m</value>
    </property>
    <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx6144m</value>
    </property>
    

    2.5.configure yarn-site.xml

    <property>  
        <name>yarn.resourcemanager.hostname</name>  
        <value>master.c.ambari-195807.internal</value>  
    </property>  
    <property>  
        <name>yarn.resourcemanager.address</name>  
        <value>${yarn.resourcemanager.hostname}:8032</value>  
    </property>  
    <property>  
        <name>yarn.resourcemanager.scheduler.address</name>  
        <value>${yarn.resourcemanager.hostname}:8030</value>  
    </property>  
    <property>  
        <name>yarn.resourcemanager.webapp.address</name>  
        <value>${yarn.resourcemanager.hostname}:8088</value>  
    </property>  
    <property>  
        <name>yarn.resourcemanager.webapp.https.address</name>  
        <value>${yarn.resourcemanager.hostname}:8090</value>  
    </property>  
    <property>  
        <name>yarn.resourcemanager.resource-tracker.address</name>  
        <value>${yarn.resourcemanager.hostname}:8031</value>  
    </property>  
    <property>  
        <name>yarn.resourcemanager.admin.address</name>  
        <value>${yarn.resourcemanager.hostname}:8033</value>  
    </property>  
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.timeline-service.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.resourcemanager.system-metrics-publisher.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.timeline-service.generic-application-history.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.timeline-service.http-cross-origin.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.timeline-service.hostname</name>
        <value>master.c.ambari-195807.internal</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.cross-origin.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>master.c.ambari-195807.internal:8032</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>master.c.ambari-195807.internal:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>master.c.ambari-195807.internal:8031</value>
    </property>
    

    2.6.set slaves

    echo slave1.c.ambari-195807.internal >>slaves
    echo slave2.c.ambari-195807.internal >>slaves
    echo slave3.c.ambari-195807.internal >>slaves
    

    2.7.copy hadoop from master to each slave

    scp -r hadoop-2.8.3/ gizmo@slave1.c.ambari-195807.internal:/opt/apps/
    scp -r hadoop-2.8.3/ gizmo@slave2.c.ambari-195807.internal:/opt/apps/
    scp -r hadoop-2.8.3/ gizmo@slave3.c.ambari-195807.internal:/opt/apps/
    

    2.8.configure hadoop env profile

    echo 'export HADOOP_HOME=/opt/apps/hadoop-2.8.3' >>~/.bashrc
    echo 'export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop' >>~/.bashrc
    echo 'export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$JAVA_HOME/bin' >>~/.bashrc
    

    2.9.start hdfs/yarn

    start-dfs.hs
    start-yarn.sh
    

    2.10.check

    hdfs, http://master.c.ambari-195807.internal:50070

    yarn, http://master.c.ambari-195807.internal:8088


    3.install hive

    3.1.download hive 2.3.2

    wget http://ftp.jaist.ac.jp/pub/apache/hive/hive-2.3.2/apache-hive-2.3.2-bin.tar.gz
    tar -zvxf apache-hive-2.3.2-bin.tar.gz && cd apache-hive-2.3.2-bin
    

    3.2.configure hive env profile

    echo 'export HIVE_HOME=/opt/apps/apache-hive-2.3.2-bin' >>~/.bashrc
    echo 'export PATH=$PATH:$HIVE_HOME/bin' >>~/.bashrc
    

    3.3.install mysql to store metadata

    rpm -ivh http://repo.mysql.com/mysql57-community-release-el7.rpm
    yum install -y mysql-server
    systemctl start mysqld
    mysql_password="pa12ss34wo!@d#"
    mysql_default_password=`grep 'temporary password' /var/log/mysqld.log | awk -F ': ' '{print $2}'`
    mysql -u root -p${mysql_default_password} -e "set global validate_password_policy=0; set global validate_password_length=4;" --connect-expired-password
    mysqladmin -u root -p${mysql_default_password} password ${mysql_password}
    mysql -u root -p${mysql_password} -e "create database hive default charset 'utf8'; flush privileges;"
    mysql -u root -p${mysql_password} -e "grant all privileges on hive.* to hive@'' identified by 'hive'; flush privileges;"
    

    3.4.download mysql driver

    wget http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.45/mysql-connector-java-5.1.45.jar -O $HIVE_HOME/lib
    

    3.5.configure hive-site.xml

    <configuration>
        <property>
            <name>javax.jdo.option.ConnectionURL</name>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionDriverName</name>
            <value>com.mysql.jdbc.Driver</value>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionUserName</name>
            <value>hive</value>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionPassword</name>
            <value>hive</value>
        </property>
    </configuration>
    

    3.6.initialize hive meta tables

    schematool -dbType mysql -initSchema
    

    3.7.test hive


    4.install tez

    4.1.please follow the instruction “install tez on single server” on each server


    5.install hbase

    5.1.download hbase 1.2.6

    wget http://ftp.jaist.ac.jp/pub/apache/hbase/1.2.6/hbase-1.2.6-bin.tar.gz
    tar -vzxf hbase-1.2.6-bin.tar.gz && cd hbase-1.2.6
    

    5.2.configure hbase-site.xml

    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://master.c.ambari-195807.internal:9000/hbase</value>
    </property>
    <property>
        <name>hbase.master</name>
        <value>master</value>
    </property>
    <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
    </property>
    <property>
        <name>hbase.zookeeper.property.clientPort</name>
        <value>2181</value>
    </property>
    <property>
        <name>hbase.zookeeper.quorum</name>
        <value>slave1.c.ambari-195807.internal,slave2.c.ambari-195807.internal,slave3.c.ambari-195807.internal</value>
    </property>
    <property>
        <name>dfs.support.append</name>
        <value>true</value>
    </property>
    <property>  
        <name>hbase.master.info.port</name>  
        <value>60010</value>  
    </property>
    

    5.3.configure regionservers

    echo slave1.c.ambari-195807.internal >>regionservers
    echo slave2.c.ambari-195807.internal >>regionservers
    echo slave3.c.ambari-195807.internal >>regionservers
    

    5.4.copy hbase from master to each slave

    5.5.configure hbase env profile

    echo 'export HBASE_HOME=/opt/apps/hbase-1.2.6' >>~/.bashrc 
    echo 'export PATH=$PATH:$HBASE_HOME/bin' >>~/.bashrc
    

    5.6.start hbase

    start-hbase.sh
    

    5.7.check, http://35.194.253.162:60010


    Things done!

     
  • Wang 21:34 on 2018-02-27 Permalink | Reply
    Tags: , ETL, , , Hive, NoSQL   

    Import data from hive to hbase 

    Recently I need restore data from hive to hbase, I found there are no direct ways to do this by tools like sqoop, so I converted it myself.

    1.create hbase namespace and table which contained one columnfamily named basic

    create_namespace 'gbif'
    create 'gbif.gbif_0004998', 'basic'
    

    1.create intermediate hive table which following hive/hbase tables’s structure

    CREATE EXTERNAL TABLE intermediate.hbase_gbif_0004998 (gbifid string, datasetkey string, occurrenceid string, kingdom string, phylum string, class string, orders string, family string, genus string, species string, infraspecificepithet string, taxonrank string, scientificname string, countrycode string, locality string, publishingorgkey string, decimallatitude string, decimallongitude string, coordinateuncertaintyinmeters string, coordinateprecision string, elevation string, elevationaccuracy string, depth string, depthaccuracy string, eventdate string, day string, month string, year string, taxonkey string, specieskey string, basisofrecord string, institutioncode string, collectioncode string, catalognumber string, recordnumber string, identifiedby string, license string, rightsholder string, recordedby string, typestatus string, establishmentmeans string, lastinterpreted string, mediatype string, issue string)
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
    WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,basic:datasetkey,basic:occurrenceid,basic:kingdom,basic:phylum,basic:class,basic:orders,basic:family,basic:genus,basic:species,basic:infraspecificepithet,basic:taxonrank,basic:scientificname,basic:countrycode,basic:locality,basic:publishingorgkey,basic:decimallatitude,basic:decimallongitude,basic:coordinateuncertaintyinmeters,basic:coordinateprecision,basic:elevation,basic:elevationaccuracy,basic:depth,basic:depthaccuracy,basic:eventdate,basic:day,basic:month,basic:year,basic:taxonkey,basic:specieskey,basic:basisofrecord,basic:institutioncode,basic:collectioncode,basic:catalognumber,basic:recordnumber,basic:identifiedby,basic:license,basic:rightsholder,basic:recordedby,basic:typestatus,basic:establishmentmeans,basic:lastinterpreted,basic:mediatype,basic:issue") 
    TBLPROPERTIES("hbase.table.name" = "gbif.gbif_0004998");
    

    3.insert data into intermediate hive table

    insert overwrite table intermediate.hbase_gbif_0004998 select * from gbif.gbif_0004998;
    

    4.get intermediate hive table’s hdfs path

    desc formatted intermediate.hbase_gbif_0004998;
    

    #5.import into hbase from hdfs

    #hbase --config config_dir org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles 
    #hdfs://localhost:9000/user/hive/warehouse/intermediate.db/hbase_gbif_0004998 
    gbif.gbif_0004998
    

    6.check hbase’s data

    count 'gbif.gbif_0004998'
    
    ...
     ...
     Current count: 326000, row: 986217061
     Current count: 327000, row: 991771339
     327316 row(s) in 13.6890 seconds
    
    => 327316
    

    7.get data from hbase table

    hbase(main):008:0> get 'gbif.gbif_0004998', '1019778874'
    COLUMN CELL 
    basic:basisofrecord timestamp=1519452831179, value=LIVING_SPECIMEN 
    basic:catalognumber timestamp=1519452831179, value=A0011 
    basic:class timestamp=1519452831179, value=Liliopsida 
    basic:collectioncode timestamp=1519452831179, value=ArxC3xA1ceas 
    basic:coordinateprecision timestamp=1519452831179, value= 
    basic:coordinateuncertaintyinmeters timestamp=1519452831179, value= 
    basic:countrycode timestamp=1519452831179, value=CO 
    basic:datasetkey timestamp=1519452831179, value=fd5ae2bb-6ee6-4e5c-8428-6284fa385f9a 
    basic:day timestamp=1519452831179, value=23 
    basic:decimallatitude timestamp=1519452831179, value= 
    basic:decimallongitude timestamp=1519452831179, value= 
    basic:depth timestamp=1519452831179, value= 
    basic:depthaccuracy timestamp=1519452831179, value= 
    basic:elevation timestamp=1519452831179, value= 
    basic:elevationaccuracy timestamp=1519452831179, value= 
    basic:establishmentmeans timestamp=1519452831179, value= 
    basic:eventdate timestamp=1519452831179, value=2007-08-23T02:00Z 
    basic:family timestamp=1519452831179, value=Araceae 
    basic:genus timestamp=1519452831179, value=Anthurium 
    basic:identifiedby timestamp=1519452831179, value= 
    basic:infraspecificepithet timestamp=1519452831179, value= 
    basic:institutioncode timestamp=1519452831179, value=CorporacixC3xB3n San Jorge 
    basic:issue timestamp=1519452831179, value= 
    basic:kingdom timestamp=1519452831179, value=Plantae 
    basic:lastinterpreted timestamp=1519452831179, value=2018-02-03T23:09Z 
    basic:license timestamp=1519452831179, value=CC0_1_0 
    basic:locality timestamp=1519452831179, value= 
    basic:mediatype timestamp=1519452831179, value= 
    basic:month timestamp=1519452831179, value=8 
    basic:occurrenceid timestamp=1519452831179, value=JBSJ:Araceas:A0011 
    basic:orders timestamp=1519452831179, value=Alismatales 
    basic:phylum timestamp=1519452831179, value=Tracheophyta 
    basic:publishingorgkey timestamp=1519452831179, value=1904954c-81e7-4254-9778-ae3deed93de6 
    basic:recordedby timestamp=1519452831179, value=Oyuela G. 
    basic:recordnumber timestamp=1519452831179, value= 
    basic:rightsholder timestamp=1519452831179, value=CorporacixC3xB3n San Jorge 
    basic:scientificname timestamp=1519452831179, value=Anthurium cabrerense Engl. 
    basic:species timestamp=1519452831179, value=Anthurium cabrerense 
    basic:specieskey timestamp=1519452831179, value=2872557 
    basic:taxonkey timestamp=1519452831179, value=2872557 
    basic:taxonrank timestamp=1519452831179, value=SPECIES 
    basic:typestatus timestamp=1519452831179, value= 
    basic:year timestamp=1519452831179, value=2007
    
     
  • Wang 19:44 on 2018-02-25 Permalink | Reply
    Tags: , Hive, ,   

    [Performance Test] MR vs Tez 

    I tested performance about MR and Tez on my laptop, it’s single server, so it’s not very accurate.

    I create two tables to do the test which contains the datasets I downloaded from GBIF.

    gbif_0004998: 327,316 rows

    gbif_0004991: 6,914,665 rows

    1.test gbif_0004998
    create by MR
    hive> set hive.execution.engine=mr;
    Hive-on-MR is deprecated in Hive 2 and may no be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    hive> CREATE TABLE gbif.gbif_0004998
    > STORED AS ORC
    > TBLPROPERTIES("orc.compress"="snappy")
    > AS SELECT * FROM gbif.gbif_0004998_ori;
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = wanghongmeng_20180224190744_f3fb257a-829e-40c2-974b-5abeb3d88693
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_1519462946874_0007, Tracking URL = http://localhost:8088/proxy/application_1519462946874_0007/
    Kill Command = /usr/local/Cellar/hadoop/2.8.2/bin/hadoop job -kill job_1519462946874_0007
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
    2018-02-24 19:07:53,043 Stage-1 map = 0%, reduce = 0%
    2018-02-24 19:08:10,204 Stage-1 map = 100%, reduce = 0%
    Ended Job = job_1519462946874_0007
    Stage-4 is selected by condition resolver.
    Stage-3 is filtered out by condition resolver.
    Stage-5 is filtered out by condition resolver.
    Moving data to directory hdfs://localhost:9000/user/hive/warehouse/gbif.db/.hive-staging_hive_2018-02-24_19-07-44_762_5371659277950436672-1/-ext-10002
    Moving data to directory hdfs://localhost:9000/user/hive/warehouse/gbif.db/gbif_0004998
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 1 HDFS Read: 130582415 HDFS Write: 13510429 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    Time taken: 28.28 seconds
    
    create by Tez
    hive> set hive.execution.engine=tez;
    hive> CREATE TABLE gbif.gbif_0004998
    > STORED AS ORC
    > TBLPROPERTIES("orc.compress"="snappy")
    > AS SELECT * FROM gbif.gbif_0004998_ori;
    Query ID = wanghongmeng_20180224193755_bd7fda12-bfd7-4abf-9c3e-0f90b9b58607
    Total jobs = 1
    Launching Job 1 out of 1
    Status: Running (Executing on YARN cluster with App id application_1519462946874_0013)
    
    ----------------------------------------------------------------------------------------------
    VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED 
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 
    ----------------------------------------------------------------------------------------------
    VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 15.65 s 
    ----------------------------------------------------------------------------------------------
    Moving data to directory hdfs://localhost:9000/user/hive/warehouse/gbif.db/gbif_0004998
    OK
    gbif_0004998_ori.gbifid gbif_0004998_ori.datasetkey gbif_0004998_ori.occurrenceid gbif_0004998_ori.kingdom gbif_0004998_ori.phylum gbif_0004998_ori.class gbif_0004998_ori.orders gbif_0004998_ori.family gbif_0004998_ori.genus gbif_0004998_ori.species gbif_0004998_ori.infraspecificepithet gbif_0004998_ori.taxonrank gbif_0004998_ori.scientificname gbif_0004998_ori.countrycode gbif_0004998_ori.locality gbif_0004998_ori.publishingorgkey gbif_0004998_ori.decimallatitude gbif_0004998_ori.decimallongitude gbif_0004998_ori.coordinateuncertaintyinmeters gbif_0004998_ori.coordinateprecision gbif_0004998_ori.elevation gbif_0004998_ori.elevationaccuracy gbif_0004998_ori.depth gbif_0004998_ori.depthaccuracy gbif_0004998_ori.eventdate gbif_0004998_ori.day gbif_0004998_ori.month gbif_0004998_ori.year gbif_0004998_ori.taxonkey gbif_0004998_ori.specieskey gbif_0004998_ori.basisofrecord gbif_0004998_ori.institutioncode gbif_0004998_ori.collectioncode gbif_0004998_ori.catalognumber gbif_0004998_ori.recordnumber gbif_0004998_ori.identifiedby gbif_0004998_ori.license gbif_0004998_ori.rightsholder gbif_0004998_ori.recordedby gbif_0004998_ori.typestatus gbif_0004998_ori.establishmentmeans gbif_0004998_ori.lastinterpreted gbif_0004998_ori.mediatype gbif_0004998_ori.issue
    Time taken: 16.631 seconds
    
    query by MR
    hive> set hive.execution.engine=mr;
    Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE';
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = wanghongmeng_20180224194412_0c9a74e1-b01e-4b92-8db4-f31522d44bd9
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
    set mapreduce.job.reduces=<number>
    Starting Job = job_1519462946874_0016, Tracking URL = http://localhost:8088/proxy/application_1519462946874_0016/
    Kill Command = /usr/local/Cellar/hadoop/2.8.2/bin/hadoop job -kill job_1519462946874_0016
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    2018-02-24 19:44:24,034 Stage-1 map = 0%, reduce = 0%
    2018-02-24 19:44:33,661 Stage-1 map = 100%, reduce = 0%
    2018-02-24 19:44:40,063 Stage-1 map = 100%, reduce = 100%
    Ended Job = job_1519462946874_0016
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 1 Reduce: 1 HDFS Read: 30539 HDFS Write: 105 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    total
    28918
    Time taken: 28.529 seconds, Fetched: 1 row(s)
    
    query by Tez
    hive> set hive.execution.engine=tez;
    hive> select count(*) from gbif_0004998 where mediatype = 'STILLIMAGE';
    Query ID = wanghongmeng_20180224193902_f03b627e-e091-4632-87e5-0d8af6484032
    Total jobs = 1
    Launching Job 1 out of 1
    Status: Running (Executing on YARN cluster with App id application_1519462946874_0013)
    
    ----------------------------------------------------------------------------------------------
    VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED 
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 
    Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 
    ----------------------------------------------------------------------------------------------
    VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 5.97 s 
    ----------------------------------------------------------------------------------------------
    OK
    total
    28918
    Time taken: 6.438 seconds, Fetched: 1 row(s)
    
    2.test gbif_0004991
    create by MR
    hive> set hive.execution.engine=mr;
    Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    hive> CREATE TABLE gbif.gbif_0004991
    > STORED AS ORC
    > TBLPROPERTIES("orc.compress"="snappy")
    > AS SELECT * FROM gbif.gbif_0004991_ori;
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = wanghongmeng_20180224191238_19301476-a77f-45fa-a405-05a8732a45e9
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_1519462946874_0010, Tracking URL = http://localhost:8088/proxy/application_1519462946874_0010/
    Kill Command = /usr/local/Cellar/hadoop/2.8.2/bin/hadoop job -kill job_1519462946874_0010
    Hadoop job information for Stage-1: number of mappers: 14; number of reducers: 0
    2018-02-24 19:16:32,473 Stage-1 map = 0%, reduce = 0%
    2018-02-24 19:17:32,678 Stage-1 map = 0%, reduce = 0%
    2018-02-24 19:17:40,248 Stage-1 map = 7%, reduce = 0%
    2018-02-24 19:17:51,119 Stage-1 map = 11%, reduce = 0%
    2018-02-24 19:17:52,207 Stage-1 map = 18%, reduce = 0%
    2018-02-24 19:17:58,625 Stage-1 map = 21%, reduce = 0%
    2018-02-24 19:18:13,859 Stage-1 map = 25%, reduce = 0%
    2018-02-24 19:18:15,999 Stage-1 map = 32%, reduce = 0%
    2018-02-24 19:18:30,537 Stage-1 map = 36%, reduce = 0%
    2018-02-24 19:18:31,625 Stage-1 map = 39%, reduce = 0%
    2018-02-24 19:18:32,759 Stage-1 map = 43%, reduce = 0%
    2018-02-24 19:19:17,117 Stage-1 map = 46%, reduce = 0%
    2018-02-24 19:19:19,250 Stage-1 map = 50%, reduce = 0%
    2018-02-24 19:19:25,639 Stage-1 map = 54%, reduce = 0%
    2018-02-24 19:19:28,825 Stage-1 map = 57%, reduce = 0%
    2018-02-24 19:19:32,031 Stage-1 map = 61%, reduce = 0%
    2018-02-24 19:19:33,101 Stage-1 map = 64%, reduce = 0%
    2018-02-24 19:19:39,470 Stage-1 map = 68%, reduce = 0%
    2018-02-24 19:19:42,677 Stage-1 map = 71%, reduce = 0%
    2018-02-24 19:19:54,459 Stage-1 map = 75%, reduce = 0%
    2018-02-24 19:19:58,723 Stage-1 map = 79%, reduce = 0%
    2018-02-24 19:20:04,147 Stage-1 map = 82%, reduce = 0%
    2018-02-24 19:20:06,277 Stage-1 map = 86%, reduce = 0%
    2018-02-24 19:20:15,977 Stage-1 map = 93%, reduce = 0%
    2018-02-24 19:20:20,269 Stage-1 map = 96%, reduce = 0%
    2018-02-24 19:20:36,398 Stage-1 map = 100%, reduce = 0%
    Ended Job = job_1519462946874_0010
    Stage-4 is selected by condition resolver.
    Stage-3 is filtered out by condition resolver.
    Stage-5 is filtered out by condition resolver.
    Moving data to directory hdfs://localhost:9000/user/hive/warehouse/gbif.db/.hive-staging_hive_2018-02-24_19-12-38_616_5758586722663198282-1/-ext-10002
    Moving data to directory hdfs://localhost:9000/user/hive/warehouse/gbif.db/gbif_0004991
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 14 HDFS Read: 3539512736 HDFS Write: 342789525 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    Time taken: 481.311 seconds
    
    create via Tez
    hive> set hive.execution.engine=tez;
    hive> CREATE TABLE gbif.gbif_0004991
    > STORED AS ORC
    > TBLPROPERTIES("orc.compress"="snappy")
    > AS SELECT * FROM gbif.gbif_0004991_ori;
    Query ID = wanghongmeng_20180224192800_111872d9-059b-4a8a-9fd7-e3ea02af8898
    Total jobs = 1
    Launching Job 1 out of 1
    Tez session was closed. Reopening...
    Session re-established.
    Status: Running (Executing on YARN cluster with App id application_1519462946874_0013)
    
    ----------------------------------------------------------------------------------------------
    VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED 
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 
    ----------------------------------------------------------------------------------------------
    VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 241.12 s 
    ----------------------------------------------------------------------------------------------
    Moving data to directory hdfs://localhost:9000/user/hive/warehouse/gbif.db/gbif_0004991
    OK
    gbif_0004991_ori.gbifid gbif_0004991_ori.datasetkey gbif_0004991_ori.occurrenceid gbif_0004991_ori.kingdom gbif_0004991_ori.phylum gbif_0004991_ori.class gbif_0004991_ori.orders gbif_0004991_ori.family gbif_0004991_ori.genus gbif_0004991_ori.species gbif_0004991_ori.infraspecificepithet gbif_0004991_ori.taxonrank gbif_0004991_ori.scientificname gbif_0004991_ori.countrycode gbif_0004991_ori.locality gbif_0004991_ori.publishingorgkey gbif_0004991_ori.decimallatitude gbif_0004991_ori.decimallongitude gbif_0004991_ori.coordinateuncertaintyinmeters gbif_0004991_ori.coordinateprecision gbif_0004991_ori.elevation gbif_0004991_ori.elevationaccuracy gbif_0004991_ori.depth gbif_0004991_ori.depthaccuracy gbif_0004991_ori.eventdate gbif_0004991_ori.day gbif_0004991_ori.month gbif_0004991_ori.year gbif_0004991_ori.taxonkey gbif_0004991_ori.specieskey gbif_0004991_ori.basisofrecord gbif_0004991_ori.institutioncode gbif_0004991_ori.collectioncode gbif_0004991_ori.catalognumber gbif_0004991_ori.recordnumber gbif_0004991_ori.identifiedby gbif_0004991_ori.license gbif_0004991_ori.rightsholder gbif_0004991_ori.recordedby gbif_0004991_ori.typestatus gbif_0004991_ori.establishmentmeans gbif_0004991_ori.lastinterpreted gbif_0004991_ori.mediatype gbif_0004991_ori.issue
    Time taken: 252.548 seconds
    
    query via MR
    hive> set hive.execution.engine=mr;
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    hive> select count(*) from gbif_0004991 where mediatype = 'STILLIMAGE';
    Query ID = wanghongmeng_20180224192630_b2934027-2423-4945-864b-6ce663e676fa
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
    set mapreduce.job.reduces=<number>
    Starting Job = job_1519462946874_0012, Tracking URL = http://localhost:8088/proxy/application_1519462946874_0012/
    Kill Command = /usr/local/Cellar/hadoop/2.8.2/bin/hadoop job -kill job_1519462946874_0012
    Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
    2018-02-24 19:26:43,988 Stage-1 map = 0%, reduce = 0%
    2018-02-24 19:27:00,086 Stage-1 map = 50%, reduce = 0%
    2018-02-24 19:27:03,287 Stage-1 map = 74%, reduce = 0%
    2018-02-24 19:27:05,422 Stage-1 map = 100%, reduce = 0%
    2018-02-24 19:27:08,595 Stage-1 map = 100%, reduce = 100%
    Ended Job = job_1519462946874_0012
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 2 Reduce: 1 HDFS Read: 602777 HDFS Write: 106 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    total
    374998
    Time taken: 38.903 seconds, Fetched: 1 row(s)
    
    query via Tez
    hive> set hive.execution.engine=tez;
    hive> select count(*) from gbif_0004991 where mediatype = 'STILLIMAGE';
    Query ID = wanghongmeng_20180224193241_f4edd363-fdb8-4461-b687-4b775e8719c0
    Total jobs = 1
    Launching Job 1 out of 1
    Status: Running (Executing on YARN cluster with App id application_1519462946874_0013)
    
    ----------------------------------------------------------------------------------------------
    VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED 
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container SUCCEEDED 2 2 0 0 0 0 
    Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 
    ----------------------------------------------------------------------------------------------
    VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 16.54 s 
    ----------------------------------------------------------------------------------------------
    OK
    total
    374998
    Time taken: 17.258 seconds, Fetched: 1 row(s)
    
    3.summary

    Table

    Total Count

    Create Table

    Query

    gbif_0004998

    327,316

    MR28.28sMR28.529s
    Tez16.631sTez6.438s

    gbif_0004991

    6,914,665

    MR481.311sMR38.903s
    Tez252.548sTez17.258s
     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel
%d bloggers like this: