Updates from February, 2018 Toggle Comment Threads | Keyboard Shortcuts

  • Wang 21:34 on 2018-02-27 Permalink | Reply
    Tags: , ETL, , , , NoSQL   

    Import data from hive to hbase 

    Recently I need restore data from hive to hbase, I found there are no direct ways to do this by tools like sqoop, so I converted it myself.

    1.create hbase namespace and table which contained one columnfamily named basic

    create_namespace 'gbif'
    create 'gbif.gbif_0004998', 'basic'
    

    1.create intermediate hive table which following hive/hbase tables’s structure

    CREATE EXTERNAL TABLE intermediate.hbase_gbif_0004998 (gbifid string, datasetkey string, occurrenceid string, kingdom string, phylum string, class string, orders string, family string, genus string, species string, infraspecificepithet string, taxonrank string, scientificname string, countrycode string, locality string, publishingorgkey string, decimallatitude string, decimallongitude string, coordinateuncertaintyinmeters string, coordinateprecision string, elevation string, elevationaccuracy string, depth string, depthaccuracy string, eventdate string, day string, month string, year string, taxonkey string, specieskey string, basisofrecord string, institutioncode string, collectioncode string, catalognumber string, recordnumber string, identifiedby string, license string, rightsholder string, recordedby string, typestatus string, establishmentmeans string, lastinterpreted string, mediatype string, issue string)
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
    WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,basic:datasetkey,basic:occurrenceid,basic:kingdom,basic:phylum,basic:class,basic:orders,basic:family,basic:genus,basic:species,basic:infraspecificepithet,basic:taxonrank,basic:scientificname,basic:countrycode,basic:locality,basic:publishingorgkey,basic:decimallatitude,basic:decimallongitude,basic:coordinateuncertaintyinmeters,basic:coordinateprecision,basic:elevation,basic:elevationaccuracy,basic:depth,basic:depthaccuracy,basic:eventdate,basic:day,basic:month,basic:year,basic:taxonkey,basic:specieskey,basic:basisofrecord,basic:institutioncode,basic:collectioncode,basic:catalognumber,basic:recordnumber,basic:identifiedby,basic:license,basic:rightsholder,basic:recordedby,basic:typestatus,basic:establishmentmeans,basic:lastinterpreted,basic:mediatype,basic:issue") 
    TBLPROPERTIES("hbase.table.name" = "gbif.gbif_0004998");
    

    3.insert data into intermediate hive table

    insert overwrite table intermediate.hbase_gbif_0004998 select * from gbif.gbif_0004998;
    

    4.get intermediate hive table’s hdfs path

    desc formatted intermediate.hbase_gbif_0004998;
    

    #5.import into hbase from hdfs

    #hbase --config config_dir org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles 
    #hdfs://localhost:9000/user/hive/warehouse/intermediate.db/hbase_gbif_0004998 
    gbif.gbif_0004998
    

    6.check hbase’s data

    count 'gbif.gbif_0004998'
    
    ...
     ...
     Current count: 326000, row: 986217061
     Current count: 327000, row: 991771339
     327316 row(s) in 13.6890 seconds
    
    => 327316
    

    7.get data from hbase table

    hbase(main):008:0> get 'gbif.gbif_0004998', '1019778874'
    COLUMN CELL 
    basic:basisofrecord timestamp=1519452831179, value=LIVING_SPECIMEN 
    basic:catalognumber timestamp=1519452831179, value=A0011 
    basic:class timestamp=1519452831179, value=Liliopsida 
    basic:collectioncode timestamp=1519452831179, value=ArxC3xA1ceas 
    basic:coordinateprecision timestamp=1519452831179, value= 
    basic:coordinateuncertaintyinmeters timestamp=1519452831179, value= 
    basic:countrycode timestamp=1519452831179, value=CO 
    basic:datasetkey timestamp=1519452831179, value=fd5ae2bb-6ee6-4e5c-8428-6284fa385f9a 
    basic:day timestamp=1519452831179, value=23 
    basic:decimallatitude timestamp=1519452831179, value= 
    basic:decimallongitude timestamp=1519452831179, value= 
    basic:depth timestamp=1519452831179, value= 
    basic:depthaccuracy timestamp=1519452831179, value= 
    basic:elevation timestamp=1519452831179, value= 
    basic:elevationaccuracy timestamp=1519452831179, value= 
    basic:establishmentmeans timestamp=1519452831179, value= 
    basic:eventdate timestamp=1519452831179, value=2007-08-23T02:00Z 
    basic:family timestamp=1519452831179, value=Araceae 
    basic:genus timestamp=1519452831179, value=Anthurium 
    basic:identifiedby timestamp=1519452831179, value= 
    basic:infraspecificepithet timestamp=1519452831179, value= 
    basic:institutioncode timestamp=1519452831179, value=CorporacixC3xB3n San Jorge 
    basic:issue timestamp=1519452831179, value= 
    basic:kingdom timestamp=1519452831179, value=Plantae 
    basic:lastinterpreted timestamp=1519452831179, value=2018-02-03T23:09Z 
    basic:license timestamp=1519452831179, value=CC0_1_0 
    basic:locality timestamp=1519452831179, value= 
    basic:mediatype timestamp=1519452831179, value= 
    basic:month timestamp=1519452831179, value=8 
    basic:occurrenceid timestamp=1519452831179, value=JBSJ:Araceas:A0011 
    basic:orders timestamp=1519452831179, value=Alismatales 
    basic:phylum timestamp=1519452831179, value=Tracheophyta 
    basic:publishingorgkey timestamp=1519452831179, value=1904954c-81e7-4254-9778-ae3deed93de6 
    basic:recordedby timestamp=1519452831179, value=Oyuela G. 
    basic:recordnumber timestamp=1519452831179, value= 
    basic:rightsholder timestamp=1519452831179, value=CorporacixC3xB3n San Jorge 
    basic:scientificname timestamp=1519452831179, value=Anthurium cabrerense Engl. 
    basic:species timestamp=1519452831179, value=Anthurium cabrerense 
    basic:specieskey timestamp=1519452831179, value=2872557 
    basic:taxonkey timestamp=1519452831179, value=2872557 
    basic:taxonrank timestamp=1519452831179, value=SPECIES 
    basic:typestatus timestamp=1519452831179, value= 
    basic:year timestamp=1519452831179, value=2007
    
     
  • Wang 19:44 on 2018-02-25 Permalink | Reply
    Tags: , , ,   

    [Performance Test] MR vs Tez 

    I tested performance about MR and Tez on my laptop, it’s single server, so it’s not very accurate.

    I create two tables to do the test which contains the datasets I downloaded from GBIF.

    gbif_0004998: 327,316 rows

    gbif_0004991: 6,914,665 rows

    1.test gbif_0004998
    create by MR
    hive> set hive.execution.engine=mr;
    Hive-on-MR is deprecated in Hive 2 and may no be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    hive> CREATE TABLE gbif.gbif_0004998
    > STORED AS ORC
    > TBLPROPERTIES("orc.compress"="snappy")
    > AS SELECT * FROM gbif.gbif_0004998_ori;
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = wanghongmeng_20180224190744_f3fb257a-829e-40c2-974b-5abeb3d88693
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_1519462946874_0007, Tracking URL = http://localhost:8088/proxy/application_1519462946874_0007/
    Kill Command = /usr/local/Cellar/hadoop/2.8.2/bin/hadoop job -kill job_1519462946874_0007
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
    2018-02-24 19:07:53,043 Stage-1 map = 0%, reduce = 0%
    2018-02-24 19:08:10,204 Stage-1 map = 100%, reduce = 0%
    Ended Job = job_1519462946874_0007
    Stage-4 is selected by condition resolver.
    Stage-3 is filtered out by condition resolver.
    Stage-5 is filtered out by condition resolver.
    Moving data to directory hdfs://localhost:9000/user/hive/warehouse/gbif.db/.hive-staging_hive_2018-02-24_19-07-44_762_5371659277950436672-1/-ext-10002
    Moving data to directory hdfs://localhost:9000/user/hive/warehouse/gbif.db/gbif_0004998
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 1 HDFS Read: 130582415 HDFS Write: 13510429 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    Time taken: 28.28 seconds
    
    create by Tez
    hive> set hive.execution.engine=tez;
    hive> CREATE TABLE gbif.gbif_0004998
    > STORED AS ORC
    > TBLPROPERTIES("orc.compress"="snappy")
    > AS SELECT * FROM gbif.gbif_0004998_ori;
    Query ID = wanghongmeng_20180224193755_bd7fda12-bfd7-4abf-9c3e-0f90b9b58607
    Total jobs = 1
    Launching Job 1 out of 1
    Status: Running (Executing on YARN cluster with App id application_1519462946874_0013)
    
    ----------------------------------------------------------------------------------------------
    VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED 
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 
    ----------------------------------------------------------------------------------------------
    VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 15.65 s 
    ----------------------------------------------------------------------------------------------
    Moving data to directory hdfs://localhost:9000/user/hive/warehouse/gbif.db/gbif_0004998
    OK
    gbif_0004998_ori.gbifid gbif_0004998_ori.datasetkey gbif_0004998_ori.occurrenceid gbif_0004998_ori.kingdom gbif_0004998_ori.phylum gbif_0004998_ori.class gbif_0004998_ori.orders gbif_0004998_ori.family gbif_0004998_ori.genus gbif_0004998_ori.species gbif_0004998_ori.infraspecificepithet gbif_0004998_ori.taxonrank gbif_0004998_ori.scientificname gbif_0004998_ori.countrycode gbif_0004998_ori.locality gbif_0004998_ori.publishingorgkey gbif_0004998_ori.decimallatitude gbif_0004998_ori.decimallongitude gbif_0004998_ori.coordinateuncertaintyinmeters gbif_0004998_ori.coordinateprecision gbif_0004998_ori.elevation gbif_0004998_ori.elevationaccuracy gbif_0004998_ori.depth gbif_0004998_ori.depthaccuracy gbif_0004998_ori.eventdate gbif_0004998_ori.day gbif_0004998_ori.month gbif_0004998_ori.year gbif_0004998_ori.taxonkey gbif_0004998_ori.specieskey gbif_0004998_ori.basisofrecord gbif_0004998_ori.institutioncode gbif_0004998_ori.collectioncode gbif_0004998_ori.catalognumber gbif_0004998_ori.recordnumber gbif_0004998_ori.identifiedby gbif_0004998_ori.license gbif_0004998_ori.rightsholder gbif_0004998_ori.recordedby gbif_0004998_ori.typestatus gbif_0004998_ori.establishmentmeans gbif_0004998_ori.lastinterpreted gbif_0004998_ori.mediatype gbif_0004998_ori.issue
    Time taken: 16.631 seconds
    
    query by MR
    hive> set hive.execution.engine=mr;
    Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE';
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = wanghongmeng_20180224194412_0c9a74e1-b01e-4b92-8db4-f31522d44bd9
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
    set mapreduce.job.reduces=<number>
    Starting Job = job_1519462946874_0016, Tracking URL = http://localhost:8088/proxy/application_1519462946874_0016/
    Kill Command = /usr/local/Cellar/hadoop/2.8.2/bin/hadoop job -kill job_1519462946874_0016
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    2018-02-24 19:44:24,034 Stage-1 map = 0%, reduce = 0%
    2018-02-24 19:44:33,661 Stage-1 map = 100%, reduce = 0%
    2018-02-24 19:44:40,063 Stage-1 map = 100%, reduce = 100%
    Ended Job = job_1519462946874_0016
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 1 Reduce: 1 HDFS Read: 30539 HDFS Write: 105 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    total
    28918
    Time taken: 28.529 seconds, Fetched: 1 row(s)
    
    query by Tez
    hive> set hive.execution.engine=tez;
    hive> select count(*) from gbif_0004998 where mediatype = 'STILLIMAGE';
    Query ID = wanghongmeng_20180224193902_f03b627e-e091-4632-87e5-0d8af6484032
    Total jobs = 1
    Launching Job 1 out of 1
    Status: Running (Executing on YARN cluster with App id application_1519462946874_0013)
    
    ----------------------------------------------------------------------------------------------
    VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED 
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 
    Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 
    ----------------------------------------------------------------------------------------------
    VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 5.97 s 
    ----------------------------------------------------------------------------------------------
    OK
    total
    28918
    Time taken: 6.438 seconds, Fetched: 1 row(s)
    
    2.test gbif_0004991
    create by MR
    hive> set hive.execution.engine=mr;
    Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    hive> CREATE TABLE gbif.gbif_0004991
    > STORED AS ORC
    > TBLPROPERTIES("orc.compress"="snappy")
    > AS SELECT * FROM gbif.gbif_0004991_ori;
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = wanghongmeng_20180224191238_19301476-a77f-45fa-a405-05a8732a45e9
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_1519462946874_0010, Tracking URL = http://localhost:8088/proxy/application_1519462946874_0010/
    Kill Command = /usr/local/Cellar/hadoop/2.8.2/bin/hadoop job -kill job_1519462946874_0010
    Hadoop job information for Stage-1: number of mappers: 14; number of reducers: 0
    2018-02-24 19:16:32,473 Stage-1 map = 0%, reduce = 0%
    2018-02-24 19:17:32,678 Stage-1 map = 0%, reduce = 0%
    2018-02-24 19:17:40,248 Stage-1 map = 7%, reduce = 0%
    2018-02-24 19:17:51,119 Stage-1 map = 11%, reduce = 0%
    2018-02-24 19:17:52,207 Stage-1 map = 18%, reduce = 0%
    2018-02-24 19:17:58,625 Stage-1 map = 21%, reduce = 0%
    2018-02-24 19:18:13,859 Stage-1 map = 25%, reduce = 0%
    2018-02-24 19:18:15,999 Stage-1 map = 32%, reduce = 0%
    2018-02-24 19:18:30,537 Stage-1 map = 36%, reduce = 0%
    2018-02-24 19:18:31,625 Stage-1 map = 39%, reduce = 0%
    2018-02-24 19:18:32,759 Stage-1 map = 43%, reduce = 0%
    2018-02-24 19:19:17,117 Stage-1 map = 46%, reduce = 0%
    2018-02-24 19:19:19,250 Stage-1 map = 50%, reduce = 0%
    2018-02-24 19:19:25,639 Stage-1 map = 54%, reduce = 0%
    2018-02-24 19:19:28,825 Stage-1 map = 57%, reduce = 0%
    2018-02-24 19:19:32,031 Stage-1 map = 61%, reduce = 0%
    2018-02-24 19:19:33,101 Stage-1 map = 64%, reduce = 0%
    2018-02-24 19:19:39,470 Stage-1 map = 68%, reduce = 0%
    2018-02-24 19:19:42,677 Stage-1 map = 71%, reduce = 0%
    2018-02-24 19:19:54,459 Stage-1 map = 75%, reduce = 0%
    2018-02-24 19:19:58,723 Stage-1 map = 79%, reduce = 0%
    2018-02-24 19:20:04,147 Stage-1 map = 82%, reduce = 0%
    2018-02-24 19:20:06,277 Stage-1 map = 86%, reduce = 0%
    2018-02-24 19:20:15,977 Stage-1 map = 93%, reduce = 0%
    2018-02-24 19:20:20,269 Stage-1 map = 96%, reduce = 0%
    2018-02-24 19:20:36,398 Stage-1 map = 100%, reduce = 0%
    Ended Job = job_1519462946874_0010
    Stage-4 is selected by condition resolver.
    Stage-3 is filtered out by condition resolver.
    Stage-5 is filtered out by condition resolver.
    Moving data to directory hdfs://localhost:9000/user/hive/warehouse/gbif.db/.hive-staging_hive_2018-02-24_19-12-38_616_5758586722663198282-1/-ext-10002
    Moving data to directory hdfs://localhost:9000/user/hive/warehouse/gbif.db/gbif_0004991
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 14 HDFS Read: 3539512736 HDFS Write: 342789525 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    Time taken: 481.311 seconds
    
    create via Tez
    hive> set hive.execution.engine=tez;
    hive> CREATE TABLE gbif.gbif_0004991
    > STORED AS ORC
    > TBLPROPERTIES("orc.compress"="snappy")
    > AS SELECT * FROM gbif.gbif_0004991_ori;
    Query ID = wanghongmeng_20180224192800_111872d9-059b-4a8a-9fd7-e3ea02af8898
    Total jobs = 1
    Launching Job 1 out of 1
    Tez session was closed. Reopening...
    Session re-established.
    Status: Running (Executing on YARN cluster with App id application_1519462946874_0013)
    
    ----------------------------------------------------------------------------------------------
    VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED 
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 
    ----------------------------------------------------------------------------------------------
    VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 241.12 s 
    ----------------------------------------------------------------------------------------------
    Moving data to directory hdfs://localhost:9000/user/hive/warehouse/gbif.db/gbif_0004991
    OK
    gbif_0004991_ori.gbifid gbif_0004991_ori.datasetkey gbif_0004991_ori.occurrenceid gbif_0004991_ori.kingdom gbif_0004991_ori.phylum gbif_0004991_ori.class gbif_0004991_ori.orders gbif_0004991_ori.family gbif_0004991_ori.genus gbif_0004991_ori.species gbif_0004991_ori.infraspecificepithet gbif_0004991_ori.taxonrank gbif_0004991_ori.scientificname gbif_0004991_ori.countrycode gbif_0004991_ori.locality gbif_0004991_ori.publishingorgkey gbif_0004991_ori.decimallatitude gbif_0004991_ori.decimallongitude gbif_0004991_ori.coordinateuncertaintyinmeters gbif_0004991_ori.coordinateprecision gbif_0004991_ori.elevation gbif_0004991_ori.elevationaccuracy gbif_0004991_ori.depth gbif_0004991_ori.depthaccuracy gbif_0004991_ori.eventdate gbif_0004991_ori.day gbif_0004991_ori.month gbif_0004991_ori.year gbif_0004991_ori.taxonkey gbif_0004991_ori.specieskey gbif_0004991_ori.basisofrecord gbif_0004991_ori.institutioncode gbif_0004991_ori.collectioncode gbif_0004991_ori.catalognumber gbif_0004991_ori.recordnumber gbif_0004991_ori.identifiedby gbif_0004991_ori.license gbif_0004991_ori.rightsholder gbif_0004991_ori.recordedby gbif_0004991_ori.typestatus gbif_0004991_ori.establishmentmeans gbif_0004991_ori.lastinterpreted gbif_0004991_ori.mediatype gbif_0004991_ori.issue
    Time taken: 252.548 seconds
    
    query via MR
    hive> set hive.execution.engine=mr;
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    hive> select count(*) from gbif_0004991 where mediatype = 'STILLIMAGE';
    Query ID = wanghongmeng_20180224192630_b2934027-2423-4945-864b-6ce663e676fa
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
    set mapreduce.job.reduces=<number>
    Starting Job = job_1519462946874_0012, Tracking URL = http://localhost:8088/proxy/application_1519462946874_0012/
    Kill Command = /usr/local/Cellar/hadoop/2.8.2/bin/hadoop job -kill job_1519462946874_0012
    Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
    2018-02-24 19:26:43,988 Stage-1 map = 0%, reduce = 0%
    2018-02-24 19:27:00,086 Stage-1 map = 50%, reduce = 0%
    2018-02-24 19:27:03,287 Stage-1 map = 74%, reduce = 0%
    2018-02-24 19:27:05,422 Stage-1 map = 100%, reduce = 0%
    2018-02-24 19:27:08,595 Stage-1 map = 100%, reduce = 100%
    Ended Job = job_1519462946874_0012
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 2 Reduce: 1 HDFS Read: 602777 HDFS Write: 106 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    total
    374998
    Time taken: 38.903 seconds, Fetched: 1 row(s)
    
    query via Tez
    hive> set hive.execution.engine=tez;
    hive> select count(*) from gbif_0004991 where mediatype = 'STILLIMAGE';
    Query ID = wanghongmeng_20180224193241_f4edd363-fdb8-4461-b687-4b775e8719c0
    Total jobs = 1
    Launching Job 1 out of 1
    Status: Running (Executing on YARN cluster with App id application_1519462946874_0013)
    
    ----------------------------------------------------------------------------------------------
    VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED 
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container SUCCEEDED 2 2 0 0 0 0 
    Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 
    ----------------------------------------------------------------------------------------------
    VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 16.54 s 
    ----------------------------------------------------------------------------------------------
    OK
    total
    374998
    Time taken: 17.258 seconds, Fetched: 1 row(s)
    
    3.summary

    Table

    Total Count

    Create Table

    Query

    gbif_0004998

    327,316

    MR28.28sMR28.529s
    Tez16.631sTez6.438s

    gbif_0004991

    6,914,665

    MR481.311sMR38.903s
    Tez252.548sTez17.258s
     
  • Wang 20:34 on 2018-02-24 Permalink | Reply
    Tags: , , ,   

    Conflicting jars of Hadoop and Tez 

    After I installed Tez, it’s ok to run hive jobs via Tez, but when I changed engine to MR, I got below error:

    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = wanghongmeng_20180224185414_623cf20b-77d4-4a09-a17d-41c72ed76ac3
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
    set mapreduce.job.reduces=<number>
    FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. DEFAULT_MR_AM_ADMIN_USER_ENV
    

    I can’t see any useful information from logs, after long time’s investigating, I found hadoop-mapreduce-client-common-2.7.0.jar/hadoop-mapreduce-client-core-2.7.0.jar under Tez library were conflicting with hadoop version, my installed hadoop version was 2.8.2, so I removed the two jars.

    After doing this, I could run hive on MR successfully..😀

     
  • Wang 19:51 on 2018-02-24 Permalink | Reply
    Tags: , , , , Tomcat   

    Replace MR with Tez on hive2 

    From hive2 Hive-on-MR is not recommended, you could see the warning information when running hive cli

    Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    

    So I installed Tez to replace MR to run jobs, below are installation steps.

    1.install Tez

    1.1.down Tez and unpackage

    wget http://ftp.jaist.ac.jp/pub/apache/tez/0.9.0/apache-tez-0.9.0-src.tar.gz
    tar -zvxf apache-tez-0.9.0-src.tar.gz && cd apache-tez-0.9.0-src
    

    1.2.compile and build Tez jar, you need install protobuf/maven before compiling

    mvn clean package -DskipTests=true -Dmaven.javadoc.skip=true
    

    1.3.upload Tez to hdfs

    hadoop fs -mkdir /apps
    hadoop fs -copyFromLocal tez-dist/target/tez-0.9.0.tar.gz /apps/
    

    1.4.create tez-site.xml under hadoop conf directory

    cat <<'EOF' > $HADOOP_CONF_DIR/tez-site.xml
    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>tez.lib.uris</name>
            <value>${fs.defaultFS}/apps/tez-0.9.0.tar.gz</value>
        </property>
        <property>
            <name>tez.history.logging.service.class</name>
            value>org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService</value>
        </property>
        <property>
            <name>tez.tez-ui.history-url.base</name>
            <value>http://localhost:8080/tez-ui/</value>
        </property>
    </configuration>
    EOF
    

    1.5.append configurations to yarn-site.xml

    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.timeline-service.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.resourcemanager.system-metrics-publisher.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.timeline-service.generic-application-history.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.timeline-service.http-cross-origin.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.timeline-service.hostname</name>
        <value>localhost</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.cross-origin.enabled</name>
        <value>true</value>
    </property>
    <property>  
        <name>yarn.resourcemanager.address</name>  
        <value>localhost:8032</value>  
    </property>  
    <property>  
        <name>yarn.resourcemanager.scheduler.address</name>  
        <value>localhost:8030</value>  
    </property>  
    <property>  
        <name>yarn.resourcemanager.resource-tracker.address</name>  
        <value>localhost:8031</value>  
    </property>
    

    1.6.append configuration to core-site.xml

    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>  
        <value>/data/hadoop/hdfs/tmp</value>
    </property>
    <property>
        <name>hadoop.http.filter.initializers</name>
        <value>org.apache.hadoop.security.HttpCrossOriginFilterInitializer</value>
    </property>
    

    1.7.unpackage tez-dist/target/tez-0.9.0-minimal.tar.gz

    1.8.append env to /etc/profile

    export TEZ_CONF_DIR="location of tez-site.xml"
    export TEZ_JARS="location of unpackaged tez-0.9.0-minimal.tar.gz"
    export HADOOP_CLASSPATH=${TEZ_CONF_DIR}:${TEZ_JARS}/*:${TEZ_JARS}/lib/*
    

    1.9.start timelineserver

    yarn-daemon.sh start timelineserver
    

    1.10.configure tez ui, install tomcat, unpackage tez-ui/target/tez-ui-0.9.0.war into webapps, rename unpackaged directory to tez-ui

    1.11.start tomcat, visit http://localhost:8080/tez-ui to test

    2.test Tez

    2.1.change job engine to Tez

    hive> set hive.execution.engine=tez;
    

    2.2.run job to test

    hive> select count(*) from gbif_0004998;
    Query ID = wanghongmeng_20180224180801_e5ddcf23-1e1a-4724-8156-1393807c2ac0
    Total jobs = 1
    Launching Job 1 out of 1
    Status: Running (Executing on YARN cluster with App id application_1519462946874_0003)
    
    ----------------------------------------------------------------------------------------------
    VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED 
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 
    Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 
    ----------------------------------------------------------------------------------------------
    VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 9.87 s 
    ----------------------------------------------------------------------------------------------
    OK
    327316
    Time taken: 23.876 seconds, Fetched: 1 row(s)
    

    2.3.check result on tez ui

     
  • Wang 22:13 on 2018-02-21 Permalink | Reply
    Tags: , , , ,   

    Manage BDP by ambari 

    It’s boring and complicated to manage bigdata platforms, there are so many softwares need to be installed and coordinated to make them work well together, so I tried ambari to manage them.

    1.run centos7 container

    docker run -dit --name centos7 --privileged --publish 8080:8080 centos:7 /usr/sbin/init
    

    2.operate container

    2.1.enter container

    docker exec -it centos7 bash
    

    2.2.update yum and install tools

    yum update -y && yum install -y wget
    

    2.3.download the ambari repository

    wget -nv http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.6.0.0/ambari.repo -O /etc/yum.repos.d/ambari.repo
    

    2.4.install the ambari

    yum install -y ambari-server
    yum install -y ambari-agent
    

    2.5.install mysql as metastore, ,create mysql repo under /etc/yum.repos.d

    cat << 'EOF' >/etc/yum.repos.d/mysql.5.7.repo
    [mysql57-community]
    name=MySQL 5.7 Community Server
    baseurl=http://repo.mysql.com/yum/mysql-5.7-community/el/7/$basearch/
    enabled=1
    gpgcheck=0
    EOF
    

    2.6.install mysql server

    yum install -y mysql-community-server
    

    2.7.start mysql

    systemctl start mysqld
    

    2.8.create mysql user && init database

    mysql_password=ambari
    mysql_default_password=`grep 'temporary password' /var/log/mysqld.log | awk -F ': ' '{print $2}'`
    mysql -u root -p${mysql_default_password} -e "set global validate_password_policy=0; set global validate_password_length=4;" --connect-expired-password
    mysqladmin -u root -p${mysql_default_password} password ${mysql_password}
    mysql -u root -p${mysql_password} -e "create database ambari default charset 'utf8'; flush privileges;"
    mysql -u root -p${mysql_password} -e "grant all privileges on ambari.* to ambari@'' identified by 'ambari'; flush privileges;"
    mysql -u root -p${mysql_password} -e "use ambari; source /var/lib/ambari-server/resources/Ambari-DDL-MySQL-CREATE.sql;"
    

    2.9.download mysql driver

    driver_path=/usr/share/java
    mkdir ${driver_path}
    wget http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.45/mysql-connector-java-5.1.45.jar -O ${driver_path}/mysql-connector.jar
    

    2.10.setup ambari server, pay attention to database configuration, need select mysql manually

    ambari-server setup
    

    2.11.modify ambari database configuration

    echo "server.jdbc.driver.path=${driver_path}/mysql-connector.jar" >> /etc/ambari-server/conf/ambari.properties
    

    2.12.start ambari

    ambari-server start
    ambari-agent start
    ambari-server setup --jdbc-db=mysql --jdbc-driver=${driver_path}/mysql-connector.jar
    

    3.login, default accuont: admin/admin
    http://localhost:8080


    P.S.

    The above steps are configured on single server,  if you wanna build cluster with several servers, you also need configure ssh key(please google for specific steps, it’s simple) and start ambari-agent on slave servers.


    Below are screenshots of a mini cluster which was built by 4 servers:

     
  • Wang 23:30 on 2018-02-16 Permalink | Reply
    Tags: Movie, ,   

    meet joe black..

     
  • Wang 23:19 on 2018-02-13 Permalink | Reply
    Tags: ,   

    Delete Hive Meta Error 

    I got this error when I deleted hive table, due to delete hive metastore cascade in mysql:

    com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; 
    check the manual that corresponds to your MySQL server version for the right syntax to use near 'OPTION SQL_SELECT_LIMIT=DEFAULT' at line 1
    • Mysql Version: 5.6
    • Hive Version: 2.3.1
    • Mysql Connector Version: 5.1.26

    After searching, I found it’s the bug of mysql connector, https://bugs.mysql.com/bug.php?id=66659.

    So I change mysql connector to 5.1.39, things done😀

     
  • Wang 18:23 on 2018-02-10 Permalink | Reply
    Tags: , Chrome, , , Web   

    not good..

     
  • Wang 22:15 on 2018-02-07 Permalink | Reply
    Tags: , ,   

    Use GBIF’s dataset to do analysis 

    GBIF, global biodiversity information facility which contains huge data, I think it’s good to do analysis with gbif’s data sample.

    Please follow the web’s instruction to download the sample dataset.

    After doing this, I imported the dataset into hive, below are the steps.

    1.create hdfs path

    hdfs dfs -mkdir -p /user/hive/gbif/0004998
    

    2.upload dataset into hdfs’s directory which was created on step 1

    hdfs dfs -copyFromLocal /Users/wanghongmeng/Desktop/0004998-180131172636756.csv /user/hive/gbif/0004998
    

    3.create hive table and load dataset

    CREATE EXTERNAL TABLE gbif_0004998_ori (
    gbifid string,
    datasetkey string,
    occurrenceid string,
    kingdom string,
    ...
    ...
    establishmentmeans string,
    lastinterpreted string,
    mediatype string,
    issue string)
    ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY 't'
    STORED as TEXTFILE
    LOCATION '/user/hive/gbif/0004998'
    tblproperties ('skip.header.line.count'='1');
    

    4.create new hive table by snappy compression, then drop origin table

    CREATE TABLE gbif.gbif_0004998
    STORED AS ORC
    TBLPROPERTIES("orc.compress"="snappy")
    AS SELECT * FROM gbif.gbif_0004998_ori;
    
    drop table gbif.gbif_0004998_ori;
    

    5.check hive table’s infomation

    hive> desc formatted gbif_0004998;
    OK
    # col_name data_type comment 
    
    gbifid string 
    datasetkey string 
    occurrenceid string 
    kingdom string 
    phylum string 
    ...
    ...
    # Detailed Table Information 
    Database: gbif 
    Owner: wanghongmeng 
    CreateTime: Wed Feb 7 21:28:25 JST 2018 
    LastAccessTime: UNKNOWN 
    Retention: 0 
    Location: hdfs://localhost:9000/user/hive/warehouse/gbif.db/gbif_0004998 
    Table Type: MANAGED_TABLE 
    Table Parameters: 
    COLUMN_STATS_ACCURATE {"BASIC_STATS":"true"}
    numFiles 1 
    numRows 327316 
    orc.compress snappy 
    rawDataSize 1319738112 
    totalSize 13510344 
    transient_lastDdlTime 1519457306 
    
    # Storage Information 
    SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde 
    InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 
    OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat 
    Compressed: No 
    Num Buckets: -1 
    Bucket Columns: [] 
    Sort Columns: [] 
    Storage Desc Params: 
    serialization.format 1 
    Time taken: 0.078 seconds, Fetched: 74 row(s)
    

    6.check data

    hive> select * from gbif.gbif_0004998 limit 5;
    OK
    1633594438 8130e5c6-f762-11e1-a439-00145eb45e9a KINGDOM incertae sedis EE Põhja-Kiviõli opencast mine 70488160-b003-11d8-a8af-b8a03c50a862 59.366475 26.8873 1000.0 2010-04-30T02:00Z 30 4 2010 0 FOSSIL_SPECIMEN Institute of Geology at TUT GIT 343-200 Toom CC_BY_NC_4_0 Toom 2018-02-02T20:24Z STILLIMAGE GEODETIC_DATUM_ASSUMED_WGS84;TAXON_MATCH_NONE
    1633594440 8130e5c6-f762-11e1-a439-00145eb45e9a KINGDOM incertae sedis EE Neitla Quarry 70488160-b003-11d8-a8af-b8a03c50a862 59.102247 25.762486 10.0 2012-09-12T02:00Z 12 9 2012 0 FOSSIL_SPECIMEN Institute of Geology at TUT GIT 362-272 CC_BY_NC_4_0 Toom 2018-02-02T20:24Z STILLIMAGE GEODETIC_DATUM_ASSUMED_WGS84;TAXON_MATCH_NONE
    1633594442 8130e5c6-f762-11e1-a439-00145eb45e9a KINGDOM incertae sedis EE Päri quarry 70488160-b003-11d8-a8af-b8a03c50a862 58.840459 24.042791 10.0 2014-05-23T02:00Z 23 5 2014 0 FOSSIL_SPECIMEN Institute of Geology at TUT GIT 340-303 Toom CC_BY_NC_4_0 Hints, O. 2018-02-02T20:24Z STILLIMAGE GEODETIC_DATUM_ASSUMED_WGS84;TAXON_MATCH_NONE
    1633594445 8130e5c6-f762-11e1-a439-00145eb45e9a KINGDOM incertae sedis EE Saxby shore 70488160-b003-11d8-a8af-b8a03c50a862 59.027778 23.117222 10.0 2017-06-17T02:00Z 17 6 2017 0 FOSSIL_SPECIMEN Institute of Geology at TUT GIT 362-544 Toom CC_BY_NC_4_0 Toom 2018-02-02T20:24Z STILLIMAGE GEODETIC_DATUM_ASSUMED_WGS84;TAXON_MATCH_NONE
    1633594446 8130e5c6-f762-11e1-a439-00145eb45e9a KINGDOM incertae sedis EE Saxby shore 70488160-b003-11d8-a8af-b8a03c50a862 59.027778 23.117222 10.0 2017-06-17T02:00Z 17 6 2017 0 FOSSIL_SPECIMEN Institute of Geology at TUT GIT 362-570 CC_BY_NC_4_0 Baranov 2018-02-02T20:24Z GEODETIC_DATUM_ASSUMED_WGS84;TAXON_MATCH_NONE
    Time taken: 0.172 seconds, Fetched: 5 row(s)
    
     
  • Wang 17:19 on 2018-02-04 Permalink | Reply
    Tags: Music, ,   

    吞风吻雨葬落日未曾彷徨,欺山赶海践雪径也未绝望,拈花把酒偏折煞世人情狂,凭这两眼与百臂或千手不能防,天阔阔雪漫漫共谁同航,这沙滚滚水皱皱笑着浪荡,贪欢一晌偏教那女儿情长埋葬
     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel
%d bloggers like this: