Use GBIF’s dataset to do analysis

GBIF, global biodiversity information facility which contains huge data, I think it’s good to do analysis with gbif’s data sample.

Please follow the web’s instruction to download the sample dataset.

After doing this, I imported the dataset into hive, below are the steps.

1.create hdfs path

hdfs dfs -mkdir -p /user/hive/gbif/0004998

2.upload dataset into hdfs’s directory which was created on step 1

hdfs dfs -copyFromLocal /Users/wanghongmeng/Desktop/0004998-180131172636756.csv /user/hive/gbif/0004998

3.create hive table and load dataset

CREATE EXTERNAL TABLE gbif_0004998_ori (
gbifid string,
datasetkey string,
occurrenceid string,
kingdom string,
...
...
establishmentmeans string,
lastinterpreted string,
mediatype string,
issue string)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY 't'
STORED as TEXTFILE
LOCATION '/user/hive/gbif/0004998'
tblproperties ('skip.header.line.count'='1');

4.create new hive table by snappy compression, then drop origin table

CREATE TABLE gbif.gbif_0004998
STORED AS ORC
TBLPROPERTIES("orc.compress"="snappy")
AS SELECT * FROM gbif.gbif_0004998_ori;

drop table gbif.gbif_0004998_ori;

5.check hive table’s infomation

hive> desc formatted gbif_0004998;
OK
# col_name data_type comment 

gbifid string 
datasetkey string 
occurrenceid string 
kingdom string 
phylum string 
...
...
# Detailed Table Information 
Database: gbif 
Owner: wanghongmeng 
CreateTime: Wed Feb 7 21:28:25 JST 2018 
LastAccessTime: UNKNOWN 
Retention: 0 
Location: hdfs://localhost:9000/user/hive/warehouse/gbif.db/gbif_0004998 
Table Type: MANAGED_TABLE 
Table Parameters: 
COLUMN_STATS_ACCURATE {"BASIC_STATS":"true"}
numFiles 1 
numRows 327316 
orc.compress snappy 
rawDataSize 1319738112 
totalSize 13510344 
transient_lastDdlTime 1519457306 

# Storage Information 
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde 
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat 
Compressed: No 
Num Buckets: -1 
Bucket Columns: [] 
Sort Columns: [] 
Storage Desc Params: 
serialization.format 1 
Time taken: 0.078 seconds, Fetched: 74 row(s)

6.check data

hive> select * from gbif.gbif_0004998 limit 5;
OK
1633594438 8130e5c6-f762-11e1-a439-00145eb45e9a KINGDOM incertae sedis EE Põhja-Kiviõli opencast mine 70488160-b003-11d8-a8af-b8a03c50a862 59.366475 26.8873 1000.0 2010-04-30T02:00Z 30 4 2010 0 FOSSIL_SPECIMEN Institute of Geology at TUT GIT 343-200 Toom CC_BY_NC_4_0 Toom 2018-02-02T20:24Z STILLIMAGE GEODETIC_DATUM_ASSUMED_WGS84;TAXON_MATCH_NONE
1633594440 8130e5c6-f762-11e1-a439-00145eb45e9a KINGDOM incertae sedis EE Neitla Quarry 70488160-b003-11d8-a8af-b8a03c50a862 59.102247 25.762486 10.0 2012-09-12T02:00Z 12 9 2012 0 FOSSIL_SPECIMEN Institute of Geology at TUT GIT 362-272 CC_BY_NC_4_0 Toom 2018-02-02T20:24Z STILLIMAGE GEODETIC_DATUM_ASSUMED_WGS84;TAXON_MATCH_NONE
1633594442 8130e5c6-f762-11e1-a439-00145eb45e9a KINGDOM incertae sedis EE Päri quarry 70488160-b003-11d8-a8af-b8a03c50a862 58.840459 24.042791 10.0 2014-05-23T02:00Z 23 5 2014 0 FOSSIL_SPECIMEN Institute of Geology at TUT GIT 340-303 Toom CC_BY_NC_4_0 Hints, O. 2018-02-02T20:24Z STILLIMAGE GEODETIC_DATUM_ASSUMED_WGS84;TAXON_MATCH_NONE
1633594445 8130e5c6-f762-11e1-a439-00145eb45e9a KINGDOM incertae sedis EE Saxby shore 70488160-b003-11d8-a8af-b8a03c50a862 59.027778 23.117222 10.0 2017-06-17T02:00Z 17 6 2017 0 FOSSIL_SPECIMEN Institute of Geology at TUT GIT 362-544 Toom CC_BY_NC_4_0 Toom 2018-02-02T20:24Z STILLIMAGE GEODETIC_DATUM_ASSUMED_WGS84;TAXON_MATCH_NONE
1633594446 8130e5c6-f762-11e1-a439-00145eb45e9a KINGDOM incertae sedis EE Saxby shore 70488160-b003-11d8-a8af-b8a03c50a862 59.027778 23.117222 10.0 2017-06-17T02:00Z 17 6 2017 0 FOSSIL_SPECIMEN Institute of Geology at TUT GIT 362-570 CC_BY_NC_4_0 Baranov 2018-02-02T20:24Z GEODETIC_DATUM_ASSUMED_WGS84;TAXON_MATCH_NONE
Time taken: 0.172 seconds, Fetched: 5 row(s)