Tagged: Hadoop Toggle Comment Threads | Keyboard Shortcuts
-
Wang
-
Wang
-
Wang
Include Ranger to protect your hadoop ecosystem
Apache Ranger
Apache Ranger™ is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
The vision with Ranger is to provide comprehensive security across the Apache Hadoop ecosystem. With the advent of Apache YARN, the Hadoop platform can now support a true data lake architecture. Enterprises can potentially run multiple workloads, in a multi tenant environment. Data security within Hadoop needs to evolve to support multiple use cases for data access, while also providing a framework for central administration of security policies and monitoring of user access. -
Wang
[Presto] Secure with LDAP
For security issue we decided to enable LDAP in presto, to deploy presto into kubernetes cluster we build presto image ourselves which include kerberos authentication and LDAP configurations.
As you see the image structure, configurations under catalog/etc/hive are very important, please pay attention.
krb5.conf and xxx.keytab are used to connect to kerberos
password-authenticator.properties and ldap_server.pem under etc, hive.properties and hive-security.json under catalog are used to connect to LDAP.
password-authenticator.properties
password-authenticator.name=ldap ldap.url=ldaps://<IP>:<PORT> ldap.user-bind-pattern=xxxxxx ldap.user-base-dn=xxxxxx
hive.properties
connector.name=hive-hadoop2 hive.security=file security.config-file=<hive-security.json> hive.metastore.authentication.type=KERBEROS hive.metastore.uri=thrift://<IP>:<PORT> hive.metastore.service.principal=<SERVER-PRINCIPAL> hive.metastore.client.principal=<CLIENT-PRINCIPAL> hive.metastore.client.keytab=<KEYTAB> hive.config.resources=core-site.xml, hdfs-site.xml
hive-security.json
{ "schemas": [{ "user": "user_1", "schema": "db_1", "owner": false }, { "user": " ", "schema": "db_1", "owner": false }, { "user": "user_2", "schema": "db_2", "owner": false }], "tables": [{ "user": "user_1", "schema": "db_1", "table": "table_1", "privileges": ["SELECT"] }, { "user": "user_1", "schema": "db_1", "table": "table_2", "privileges": ["SELECT"] }, { "user": "user_2", "schema": "db_1", "table": ".*", "privileges": ["SELECT"] }, { "user": "user_2", "schema": "db_2", "table": "table_1", "privileges": ["SELECT"] }, { "user": "user_2", "schema": "db_2", "table": "table_2", "privileges": ["SELECT"] }], "sessionProperties": [{ "allow": false }] }
-
Wang
[Sqoop2] Notebook
Recently I tested sqoop2 which has many new features compared to sqoop1, about the comparision, I think you could check here and stackoverflow, I will introduce about the operation manual.
1.install
wget http://ftp.jaist.ac.jp/pub/apache/sqoop/1.99.7/sqoop-1.99.7-bin-hadoop200.tar.gz tar -vzxf sqoop-1.99.7-bin-hadoop200.tar.gz && cd sqoop-1.99.7-bin-hadoop200
2.replace @LOGDIR@/@BASEDIR@ in sqoop.properties
3.download mysql driver into server/lib
4.configure proxy user in core-site.xml
<property> <name>hadoop.proxyuser.sqoop2.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.sqoop2.groups</name> <value>*</value> </property>
5.verify & start sqoop2 server
bin/sqoop2-tool verify bin/sqoop2-server start
6.start client & test
bin/sqoop2-shell
7.show the connectors
sqoop:000> show connector +------------------------+---------+------------------------------------------------------------+----------------------+ | Name | Version | Class | Supported Directions | +------------------------+---------+------------------------------------------------------------+----------------------+ | generic-jdbc-connector | 1.99.7 | org.apache.sqoop.connector.jdbc.GenericJdbcConnector | FROM/TO | | kite-connector | 1.99.7 | org.apache.sqoop.connector.kite.KiteConnector | FROM/TO | | oracle-jdbc-connector | 1.99.7 | org.apache.sqoop.connector.jdbc.oracle.OracleJdbcConnector | FROM/TO | | ftp-connector | 1.99.7 | org.apache.sqoop.connector.ftp.FtpConnector | TO | | hdfs-connector | 1.99.7 | org.apache.sqoop.connector.hdfs.HdfsConnector | FROM/TO | | kafka-connector | 1.99.7 | org.apache.sqoop.connector.kafka.KafkaConnector | TO | | sftp-connector | 1.99.7 | org.apache.sqoop.connector.sftp.SftpConnector | TO | +------------------------+---------+------------------------------------------------------------+----------------------+
8.create links & show links
sqoop:000> create link -connector generic-jdbc-connector sqoop:000> create link -connector hdfs-connector sqoop:000> show link +-------------+------------------------+---------+ | Name | Connector Name | Enabled | +-------------+------------------------+---------+ | mysql-local | generic-jdbc-connector | true | | hdfs-local | hdfs-connector | true | +-------------+------------------------+---------+ sqoop:000> show link --all 2 link(s) to show: link with name mysql-local (Enabled: true, Created by hongmeng.wang at 3/1/18 10:56 AM, Updated by hongmeng.wang at 3/1/18 12:51 PM) Using Connector generic-jdbc-connector with name {1} Database connection Driver class: com.mysql.jdbc.Driver Connection String: jdbc:mysql://localhost:3306 Username: root Password: Fetch Size: 100 Connection Properties: protocol = tcp useUnicode = true characterEncoding = utf-8 autoReconnect = true SQL Dialect Identifier enclose: (blank, if use default, will get error) link with name hdfs-local (Enabled: true, Created by hongmeng.wang at 3/1/18 10:58 AM, Updated by hongmeng.wang at 3/1/18 12:54 PM) Using Connector hdfs-connector with name {1} HDFS cluster URI: hdfs://localhost:9000 Conf directory: /usr/local/Cellar/hadoop/2.8.2/libexec/etc/hadoop Additional configs::
9.create job & show job
sqoop:000> create job -f "mysql-local" -t "hdfs-local" sqoop:000> show job +----+----------------------+--------------------------------------+-----------------------------+---------+ | Id | Name | From Connector | To Connector | Enabled | +----+----------------------+--------------------------------------+-----------------------------+---------+ | 1 | mysql-2-hdfs-t1 | mysql-local (generic-jdbc-connector) | hdfs-local (hdfs-connector) | true | +----+----------------------+--------------------------------------+-----------------------------+---------+ sqoop:000> show job --all 1 job(s) to show: Job with name mysql-2-hdfs-segment (Enabled: true, Created by hongmeng.wang at 3/1/18 11:06 AM, Updated by hongmeng.wang at 3/1/18 11:39 AM) Throttling resources Extractors: Loaders: Classpath configuration Extra mapper jars: From link: mysql-local Database source Schema name: test Table name: t1 SQL statement: Column names: Partition column: id Partition column nullable: Boundary query: Incremental read Check column: Last value: To link: hdfs-local Target configuration Override null value: true Null value: File format: TEXT_FILE Compression codec: NONE Custom codec: Output directory: /sqoop/mysql/test Append mode:
10.start job & check job’s status
sqoop:000> start job -name mysql-2-hdfs-segment Submission details Job Name: mysql-2-hdfs-segment Server URL: http://localhost:12000/sqoop/ Created by: sqoop2 Creation date: 2018-03-01 13:53:37 JST Lastly updated by: sqoop2 External ID: job_1519869491258_0001 http://localhost:8088/proxy/application_1519869491258_0001/ 2018-03-01 13:53:37 JST: BOOTING - Progress is not available sqoop:000> status job -n mysql-2-hdfs-segment Submission details Job Name: mysql-2-hdfs-segment Server URL: http://localhost:12000/sqoop/ Created by: sqoop2 Creation date: 2018-03-01 14:01:54 JST Lastly updated by: sqoop2 External ID: job_1519869491258_0002 http://localhost:8088/proxy/application_1519869491258_0002/ 2018-03-01 14:03:31 JST: BOOTING - 0.00 %
Issues
1.modify “org.apache.sqoop.submission.engine.mapreduce.configuration.directory=”directory of hadoop configuration” in conf/sqoop.properties if you got below error when executing bin/sqoop2-tool verify
Exception in thread "main" java.lang.RuntimeException: Failure in server initialization at org.apache.sqoop.core.SqoopServer.initialize(SqoopServer.java:68) at org.apache.sqoop.server.SqoopJettyServer.<init>(SqoopJettyServer.java:67) at org.apache.sqoop.server.SqoopJettyServer.main(SqoopJettyServer.java:177) Caused by: org.apache.sqoop.common.SqoopException: MAPREDUCE_0002:Failure on submission engine initialization - Invalid Hadoop configuration directory (not a directory or permission issues): /etc/hadoop/conf/ at org.apache.sqoop.submission.mapreduce.MapreduceSubmissionEngine.initialize(MapreduceSubmissionEngine.java:97) at org.apache.sqoop.driver.JobManager.initialize(JobManager.java:257) at org.apache.sqoop.core.SqoopServer.initialize(SqoopServer.java:64) ... 2 more
2.check $CLASSPATH and $HADOOP_CLASSPATH, maybe some jars conflict if got below error:
Caused by: java.lang.SecurityException: sealing violation: package org.apache.derby.impl.services.locks is sealed at java.net.URLClassLoader.getAndVerifyPackage(URLClassLoader.java:399) at java.net.URLClassLoader.definePackageInternal(URLClassLoader.java:419) at java.net.URLClassLoader.defineClass(URLClassLoader.java:451) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) at org.apache.derby.impl.services.monitor.BaseMonitor.getImplementations(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.getDefaultImplementations(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.runWithState(Unknown Source) at org.apache.derby.iampl.services.monitor.FileMonitor.<init>(Unknown Source) at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown Source) at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source) at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source) at org.apache.derby.jdbc.EmbeddedDriver.<clinit>(Unknown Source) ... 11 more
-
Wang
[Yarn] Configure queue and capacity
Modify capacity-scheduler.xml under $HADOOP_CONF_DIR, and I configured 3 queues: default, business, platform
<configuration> <property> <name>yarn.scheduler.capacity.maximum-applications</name> <value>10000</value> </property> <property> <name>yarn.scheduler.capacity.maximum-am-resource-percent</name> <value>0.1</value> </property> <property> <name>yarn.scheduler.capacity.resource-calculator</name> <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value> </property> <property> <name>yarn.scheduler.capacity.root.queues</name> <value>default,business,platform</value> </property> <property> <name>yarn.scheduler.capacity.root.default.capacity</name> <value>50</value> </property> <property> <name>yarn.scheduler.capacity.root.default.user-limit-factor</name> <value>1</value> </property> <property> <name>yarn.scheduler.capacity.root.default.maximum-capacity</name> <value>50</value> </property> <property> <name>yarn.scheduler.capacity.root.default.state</name> <value>RUNNING</value> </property> <property> <name>yarn.scheduler.capacity.root.default.acl_submit_applications</name> <value>*</value> </property> <property> <name>yarn.scheduler.capacity.root.default.acl_administer_queue</name> <value>*</value> </property> <property> <name>yarn.scheduler.capacity.root.business.capacity</name> <value>30</value> </property> <property> <name>yarn.scheduler.capacity.root.business.user-limit-factor</name> <value>1</value> </property> <property> <name>yarn.scheduler.capacity.root.business.maximum-capacity</name> <value>30</value> </property> <property> <name>yarn.scheduler.capacity.root.business.state</name> <value>RUNNING</value> </property> <property> <name>yarn.scheduler.capacity.root.business.acl_submit_applications</name> <value>*</value> </property> <property> <name>yarn.scheduler.capacity.root.business.acl_administer_queue</name> <value>*</value> </property> <property> <name>yarn.scheduler.capacity.root.platform.capacity</name> <value>20</value> </property> <property> <name>yarn.scheduler.capacity.root.platform.user-limit-factor</name> <value>1</value> </property> <property> <name>yarn.scheduler.capacity.root.platform.maximum-capacity</name> <value>20</value> </property> <property> <name>yarn.scheduler.capacity.root.platform.state</name> <value>RUNNING</value> </property> <property> <name>yarn.scheduler.capacity.root.platform.acl_submit_applications</name> <value>*</value> </property> <property> <name>yarn.scheduler.capacity.root.platform.acl_administer_queue</name> <value>*</value> </property> <property> <name>yarn.scheduler.capacity.node-locality-delay</name> <value>40</value> </property> <property> <name>yarn.scheduler.capacity.queue-mappings</name> <value></value> </property> <property> <name>yarn.scheduler.capacity.queue-mappings-override.enable</name> <value>false</value> </property> <property> <name>yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments</name> <value>1</value> </property> </configuration>
-
Wang
[Sqoop1] Interact MySQL with HDFS/Hive/HBase
install sqoop1 on mac
brew install sqoop
#if you have set env profiles, uncomment profiles in conf/sqoop-env.sh
1.MySQL -> HDFS
1.1.import table
sqoop import --connect jdbc:mysql://localhost/test --direct --username root --P --table t1 --warehouse-dir /mysql/test --fields-terminated-by ','
1.2.import schema
sqoop import-all-tables --connect jdbc:mysql://localhost/test --direct --username root -P --warehouse-dir /mysql/test --fields-terminated-by ','
2.MySQL -> Hive
2.1.import definition
sqoop create-hive-table --connect jdbc:mysql://localhost/test --table t1 --username root --P --hive-database test
2.2.import table
sqoop import --connect jdbc:mysql://localhost/test --username root --P --table t1 --hive-import --hive-database test --hive-table t1 --fields-terminated-by ','
2.3.import schema
sqoop import-all-tables --connect jdbc:mysql://localhost/test --username root --P --hive-import --hive-database test --fields-terminated-by ','
3.MySQL -> HBase
3.1.definition
sqoop import --connect jdbc:mysql://localhost/test --username root --P --table t1
3.2.import table, need create table in hbase first
sqoop import --connect jdbc:mysql://localhost/test --username root --P --table t1 --hbase-bulkload --hbase-table test.t1 --column-family basic --fields-terminated-by ','
3.3.import table without creating table in hbase, but pay attention to hbase/sqoop version
sqoop import --connect jdbc:mysql://localhost/test --username root --P --table t1 --hbase-bulkload --hbase-create-table --hbase-table test.t1 --column-family basic --fields-terminated-by ','
4.HDFS/Hive/HBase -> MySQL
sqoop export --connect jdbc:mysql://localhost/test --username root --P --table t1 --export-dir /user/hive/warehouse/test.db/t1 --fields-terminated-by ','
-
Wang
[Performance Test] MR vs Tez(2)
I test the performance of MR vs Tez again on cluster, I created a new table which contains 28,872,974 rows, below are cluster servers:
Host
OS
Memory
CPU
Disk
Region
master.c.ambari-195807.internal CentOS 7
13 GB
Intel Ivy Bridge: 2
200G
asia-east1-a
slave1.c.ambari-195807.internal CentOS 7
13 GB
Intel Ivy Bridge: 2
200G
asia-east1-a
slave2.c.ambari-195807.internal CentOS 7
13 GB
Intel Ivy Bridge: 2
200G
asia-east1-a
slave3.c.ambari-195807.internal CentOS 7
13 GB
Intel Ivy Bridge: 2
200G
asia-east1-a
1.MR
1.1.create table
hive> CREATE TABLE gbif.gbif_0004998 > STORED AS ORC > TBLPROPERTIES("orc.compress"="snappy") > AS SELECT * FROM gbif.gbif_0004998_ori; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = gizmo_20180225064259_8df29800-b260-48f5-a409-80d6ea5200ad Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1519536795015_0001, Tracking URL = http://master.c.ambari-195807.internal:8088/proxy/application_1519536795015_0001/ Kill Command = /opt/apps/hadoop-2.8.3/bin/hadoop job -kill job_1519536795015_0001 Hadoop job information for Stage-1: number of mappers: 43; number of reducers: 0 2018-02-25 06:43:15,110 Stage-1 map = 0%, reduce = 0% 2018-02-25 06:44:15,419 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 231.6 sec 2018-02-25 06:44:36,386 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 380.45 sec 2018-02-25 06:44:37,810 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 386.09 sec 2018-02-25 06:44:41,695 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 422.02 sec ... ... 2018-02-25 06:47:36,112 Stage-1 map = 97%, reduce = 0%, Cumulative CPU 1388.9 sec 2018-02-25 06:47:38,185 Stage-1 map = 98%, reduce = 0%, Cumulative CPU 1392.1 sec 2018-02-25 06:47:45,434 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1402.14 sec MapReduce Total cumulative CPU time: 23 minutes 22 seconds 140 msec Ended Job = job_1519536795015_0001 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to directory hdfs://master.c.ambari-195807.internal:9000/user/hive/warehouse/gbif.db/.hive-staging_hive_2018-02-25_06-42-59_672_2925216554228494176-1/-ext-10002 Moving data to directory hdfs://master.c.ambari-195807.internal:9000/user/hive/warehouse/gbif.db/gbif_0004998 MapReduce Jobs Launched: Stage-Stage-1: Map: 43 Cumulative CPU: 1402.14 sec HDFS Read: 11519083564 HDFS Write: 1210708016 SUCCESS Total MapReduce CPU Time Spent: 23 minutes 22 seconds 140 msec OK Time taken: 288.681 seconds
1.2.query by on condition
hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE'; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = gizmo_20180225065438_d2343424-5178-4c44-8b9d-0b28f8b701fa Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1519536795015_0002, Tracking URL = http://master.c.ambari-195807.internal:8088/proxy/application_1519536795015_0002/ Kill Command = /opt/apps/hadoop-2.8.3/bin/hadoop job -kill job_1519536795015_0002 Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 1 2018-02-25 06:54:50,078 Stage-1 map = 0%, reduce = 0% 2018-02-25 06:55:02,485 Stage-1 map = 40%, reduce = 0%, Cumulative CPU 21.01 sec 2018-02-25 06:55:03,544 Stage-1 map = 80%, reduce = 0%, Cumulative CPU 38.51 sec 2018-02-25 06:55:06,704 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 49.23 sec 2018-02-25 06:55:09,881 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 51.88 sec MapReduce Total cumulative CPU time: 51 seconds 880 msec Ended Job = job_1519536795015_0002 MapReduce Jobs Launched: Stage-Stage-1: Map: 5 Reduce: 1 Cumulative CPU: 51.88 sec HDFS Read: 1936305 HDFS Write: 107 SUCCESS Total MapReduce CPU Time Spent: 51 seconds 880 msec OK 2547716 Time taken: 32.292 seconds, Fetched: 1 row(s)
1.3.query by two conditions
hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE' and year > 1900; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = gizmo_20180225081238_766d3707-7eb4-4818-860e-887c48d507ce Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1519545228015_0002, Tracking URL = http://master.c.ambari-195807.internal:8088/proxy/application_1519545228015_0002/ Kill Command = /opt/apps/hadoop-2.8.3/bin/hadoop job -kill job_1519545228015_0002 Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 1 2018-02-25 08:17:31,666 Stage-1 map = 0%, reduce = 0% 2018-02-25 08:17:43,866 Stage-1 map = 20%, reduce = 0%, Cumulative CPU 10.58 sec 2018-02-25 08:17:46,045 Stage-1 map = 60%, reduce = 0%, Cumulative CPU 34.12 sec 2018-02-25 08:17:54,996 Stage-1 map = 80%, reduce = 0%, Cumulative CPU 41.73 sec 2018-02-25 08:17:57,126 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 51.37 sec 2018-02-25 08:17:58,192 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 53.72 sec MapReduce Total cumulative CPU time: 53 seconds 720 msec Ended Job = job_1519545228015_0002 MapReduce Jobs Launched: Stage-Stage-1: Map: 5 Reduce: 1 Cumulative CPU: 53.72 sec HDFS Read: 8334197 HDFS Write: 107 SUCCESS Total MapReduce CPU Time Spent: 53 seconds 720 msec OK 2547716 Time taken: 321.138 seconds, Fetched: 1 row(s)
2.Tez
2.1.create table
hive> CREATE TABLE gbif.gbif_0004998 > STORED AS ORC > TBLPROPERTIES("orc.compress"="snappy") > AS SELECT * FROM gbif.gbif_0004998_ori; Query ID = gizmo_20180225075657_bae527a7-7cbd-46d9-afbf-70a5adcdee7c Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1519545228015_0001) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 639.61 s ---------------------------------------------------------------------------------------------- Moving data to directory hdfs://master.c.ambari-195807.internal:9000/user/hive/warehouse/gbif.db/gbif_0004998 OK Time taken: 664.817 seconds
2.2.query by one condition
hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE'; Query ID = gizmo_20180225080856_d1f13489-30b0-4045-bdeb-e3e5e085e736 Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1519545228015_0001) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 5 5 0 0 0 0 Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 17.91 s ---------------------------------------------------------------------------------------------- OK 2547716 Time taken: 19.255 seconds, Fetched: 1 row(s)
2.2.query by two conditions
hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE' and year > 1900; Query ID = gizmo_20180225081200_0279f8e6-544b-4573-858b-33f48bf1fa35 Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1519545228015_0001) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 5 5 0 0 0 0 Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 16.96 s ---------------------------------------------------------------------------------------------- OK 2547716 Time taken: 17.635 seconds, Fetched: 1 row(s)
3.Summary
Rows: 28,872,974
Type Create Table Query By One Condition Query By Two Conditions MR 288.681s 32.292s 321.138s Tez 664.817s 19.255s 17.635s According to the result, MR is quicker than Tez on creation, but slower than Tez on query, along with query condition’s increase, MR’s query performance became worse.
But why MR is quicker than Tez on creation, currently I don’t know, need to be investigated later.
Maybe it has relationship with storage, I have checked the filesystem after the two kinds of creation, it’s different. MR has many small files, but Tez has one much bigger file.
MR generated files
Tez generated files
-
Wang
[GCP ] Install bigdata cluster
I applied google cloud for trial which give me 300$, so I initialize 4 severs to do test.
Servers:
Host
OS
Memory
CPU
Disk
Region
master.c.ambari-195807.internal CentOS 7
13 GB
Intel Ivy Bridge: 2
200G
asia-east1-a
slave1.c.ambari-195807.internal CentOS 7
13 GB
Intel Ivy Bridge: 2
200G
asia-east1-a
slave2.c.ambari-195807.internal CentOS 7
13 GB
Intel Ivy Bridge: 2
200G
asia-east1-a
slave3.c.ambari-195807.internal CentOS 7
13 GB
Intel Ivy Bridge: 2
200G
asia-east1-a
1.prepare
1.1.configure ssh key on each slave to make master login without password
1.2.install jdk1.8 on each server, download, set JAVA_HOME in profile
1.3.configure hostnames in /etc/hosts on each server
2.install hadoop
2.1.download hadoop 2.8.2
wget http://ftp.jaist.ac.jp/pub/apache/hadoop/common/hadoop-2.8.3/hadoop-2.8.3.tar.gz tar -vzxf hadoop-2.8.3.tar.gz && cd hadoop-2.8.3
2.2.configure core-site.xml
<property> <name>fs.default.name</name> <value>hdfs://master.c.ambari-195807.internal:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/data/hadoop/hdfs/tmp</value> </property> <property> <name>hadoop.http.filter.initializers</name> <value>org.apache.hadoop.security.HttpCrossOriginFilterInitializer</value> </property>
2.3.configure hdfs-site.xml
<property> <name>dfs.name.dir</name> <value>/data/hadoop/dfs/name</value> </property> <property> <name>dfs.data.dir</name> <value>/opt/hadoop/dfs/data</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property>
2.4.configure mapred-site.xml
<property> <name>mapred.job.tracker</name> <value>master.c.ambari-195807.internal:49001</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapred.local.dir</name> <value>/data/hadoop/mapred</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>2048</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>4096</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>4096</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>4096</value> </property> <property> <name>mapreduce.map.java.opts</name> <value>-Xmx6144m</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx6144m</value> </property>
2.5.configure yarn-site.xml
<property> <name>yarn.resourcemanager.hostname</name> <value>master.c.ambari-195807.internal</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>${yarn.resourcemanager.hostname}:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>${yarn.resourcemanager.hostname}:8030</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>${yarn.resourcemanager.hostname}:8088</value> </property> <property> <name>yarn.resourcemanager.webapp.https.address</name> <value>${yarn.resourcemanager.hostname}:8090</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>${yarn.resourcemanager.hostname}:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>${yarn.resourcemanager.hostname}:8033</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.timeline-service.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.system-metrics-publisher.enabled</name> <value>true</value> </property> <property> <name>yarn.timeline-service.generic-application-history.enabled</name> <value>true</value> </property> <property> <name>yarn.timeline-service.http-cross-origin.enabled</name> <value>true</value> </property> <property> <name>yarn.timeline-service.hostname</name> <value>master.c.ambari-195807.internal</value> </property> <property> <name>yarn.resourcemanager.webapp.cross-origin.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master.c.ambari-195807.internal:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master.c.ambari-195807.internal:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master.c.ambari-195807.internal:8031</value> </property>
2.6.set slaves
echo slave1.c.ambari-195807.internal >>slaves echo slave2.c.ambari-195807.internal >>slaves echo slave3.c.ambari-195807.internal >>slaves
2.7.copy hadoop from master to each slave
scp -r hadoop-2.8.3/ gizmo@slave1.c.ambari-195807.internal:/opt/apps/ scp -r hadoop-2.8.3/ gizmo@slave2.c.ambari-195807.internal:/opt/apps/ scp -r hadoop-2.8.3/ gizmo@slave3.c.ambari-195807.internal:/opt/apps/
2.8.configure hadoop env profile
echo 'export HADOOP_HOME=/opt/apps/hadoop-2.8.3' >>~/.bashrc echo 'export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop' >>~/.bashrc echo 'export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$JAVA_HOME/bin' >>~/.bashrc
2.9.start hdfs/yarn
start-dfs.hs start-yarn.sh
2.10.check
hdfs, http://master.c.ambari-195807.internal:50070
yarn, http://master.c.ambari-195807.internal:8088
3.install hive
3.1.download hive 2.3.2
wget http://ftp.jaist.ac.jp/pub/apache/hive/hive-2.3.2/apache-hive-2.3.2-bin.tar.gz tar -zvxf apache-hive-2.3.2-bin.tar.gz && cd apache-hive-2.3.2-bin
3.2.configure hive env profile
echo 'export HIVE_HOME=/opt/apps/apache-hive-2.3.2-bin' >>~/.bashrc echo 'export PATH=$PATH:$HIVE_HOME/bin' >>~/.bashrc
3.3.install mysql to store metadata
rpm -ivh http://repo.mysql.com/mysql57-community-release-el7.rpm yum install -y mysql-server systemctl start mysqld mysql_password="pa12ss34wo!@d#" mysql_default_password=`grep 'temporary password' /var/log/mysqld.log | awk -F ': ' '{print $2}'` mysql -u root -p${mysql_default_password} -e "set global validate_password_policy=0; set global validate_password_length=4;" --connect-expired-password mysqladmin -u root -p${mysql_default_password} password ${mysql_password} mysql -u root -p${mysql_password} -e "create database hive default charset 'utf8'; flush privileges;" mysql -u root -p${mysql_password} -e "grant all privileges on hive.* to hive@'' identified by 'hive'; flush privileges;"
3.4.download mysql driver
wget http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.45/mysql-connector-java-5.1.45.jar -O $HIVE_HOME/lib
3.5.configure hive-site.xml
<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hive</value> </property> </configuration>
3.6.initialize hive meta tables
schematool -dbType mysql -initSchema
3.7.test hive
4.install tez
4.1.please follow the instruction “install tez on single server” on each server
5.install hbase
5.1.download hbase 1.2.6
wget http://ftp.jaist.ac.jp/pub/apache/hbase/1.2.6/hbase-1.2.6-bin.tar.gz tar -vzxf hbase-1.2.6-bin.tar.gz && cd hbase-1.2.6
5.2.configure hbase-site.xml
<property> <name>hbase.rootdir</name> <value>hdfs://master.c.ambari-195807.internal:9000/hbase</value> </property> <property> <name>hbase.master</name> <value>master</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value>2181</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>slave1.c.ambari-195807.internal,slave2.c.ambari-195807.internal,slave3.c.ambari-195807.internal</value> </property> <property> <name>dfs.support.append</name> <value>true</value> </property> <property> <name>hbase.master.info.port</name> <value>60010</value> </property>
5.3.configure regionservers
echo slave1.c.ambari-195807.internal >>regionservers echo slave2.c.ambari-195807.internal >>regionservers echo slave3.c.ambari-195807.internal >>regionservers
5.4.copy hbase from master to each slave
5.5.configure hbase env profile
echo 'export HBASE_HOME=/opt/apps/hbase-1.2.6' >>~/.bashrc echo 'export PATH=$PATH:$HBASE_HOME/bin' >>~/.bashrc
5.6.start hbase
start-hbase.sh
5.7.check, http://35.194.253.162:60010
Things done!
-
Wang
Import data from hive to hbase
Recently I need restore data from hive to hbase, I found there are no direct ways to do this by tools like sqoop, so I converted it myself.
1.create hbase namespace and table which contained one columnfamily named basic
create_namespace 'gbif' create 'gbif.gbif_0004998', 'basic'
1.create intermediate hive table which following hive/hbase tables’s structure
CREATE EXTERNAL TABLE intermediate.hbase_gbif_0004998 (gbifid string, datasetkey string, occurrenceid string, kingdom string, phylum string, class string, orders string, family string, genus string, species string, infraspecificepithet string, taxonrank string, scientificname string, countrycode string, locality string, publishingorgkey string, decimallatitude string, decimallongitude string, coordinateuncertaintyinmeters string, coordinateprecision string, elevation string, elevationaccuracy string, depth string, depthaccuracy string, eventdate string, day string, month string, year string, taxonkey string, specieskey string, basisofrecord string, institutioncode string, collectioncode string, catalognumber string, recordnumber string, identifiedby string, license string, rightsholder string, recordedby string, typestatus string, establishmentmeans string, lastinterpreted string, mediatype string, issue string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,basic:datasetkey,basic:occurrenceid,basic:kingdom,basic:phylum,basic:class,basic:orders,basic:family,basic:genus,basic:species,basic:infraspecificepithet,basic:taxonrank,basic:scientificname,basic:countrycode,basic:locality,basic:publishingorgkey,basic:decimallatitude,basic:decimallongitude,basic:coordinateuncertaintyinmeters,basic:coordinateprecision,basic:elevation,basic:elevationaccuracy,basic:depth,basic:depthaccuracy,basic:eventdate,basic:day,basic:month,basic:year,basic:taxonkey,basic:specieskey,basic:basisofrecord,basic:institutioncode,basic:collectioncode,basic:catalognumber,basic:recordnumber,basic:identifiedby,basic:license,basic:rightsholder,basic:recordedby,basic:typestatus,basic:establishmentmeans,basic:lastinterpreted,basic:mediatype,basic:issue") TBLPROPERTIES("hbase.table.name" = "gbif.gbif_0004998");
3.insert data into intermediate hive table
insert overwrite table intermediate.hbase_gbif_0004998 select * from gbif.gbif_0004998;
4.get intermediate hive table’s hdfs path
desc formatted intermediate.hbase_gbif_0004998;
#5.import into hbase from hdfs
#hbase --config config_dir org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles #hdfs://localhost:9000/user/hive/warehouse/intermediate.db/hbase_gbif_0004998 gbif.gbif_0004998
6.check hbase’s data
count 'gbif.gbif_0004998' ... ... Current count: 326000, row: 986217061 Current count: 327000, row: 991771339 327316 row(s) in 13.6890 seconds => 327316
7.get data from hbase table
hbase(main):008:0> get 'gbif.gbif_0004998', '1019778874' COLUMN CELL basic:basisofrecord timestamp=1519452831179, value=LIVING_SPECIMEN basic:catalognumber timestamp=1519452831179, value=A0011 basic:class timestamp=1519452831179, value=Liliopsida basic:collectioncode timestamp=1519452831179, value=ArxC3xA1ceas basic:coordinateprecision timestamp=1519452831179, value= basic:coordinateuncertaintyinmeters timestamp=1519452831179, value= basic:countrycode timestamp=1519452831179, value=CO basic:datasetkey timestamp=1519452831179, value=fd5ae2bb-6ee6-4e5c-8428-6284fa385f9a basic:day timestamp=1519452831179, value=23 basic:decimallatitude timestamp=1519452831179, value= basic:decimallongitude timestamp=1519452831179, value= basic:depth timestamp=1519452831179, value= basic:depthaccuracy timestamp=1519452831179, value= basic:elevation timestamp=1519452831179, value= basic:elevationaccuracy timestamp=1519452831179, value= basic:establishmentmeans timestamp=1519452831179, value= basic:eventdate timestamp=1519452831179, value=2007-08-23T02:00Z basic:family timestamp=1519452831179, value=Araceae basic:genus timestamp=1519452831179, value=Anthurium basic:identifiedby timestamp=1519452831179, value= basic:infraspecificepithet timestamp=1519452831179, value= basic:institutioncode timestamp=1519452831179, value=CorporacixC3xB3n San Jorge basic:issue timestamp=1519452831179, value= basic:kingdom timestamp=1519452831179, value=Plantae basic:lastinterpreted timestamp=1519452831179, value=2018-02-03T23:09Z basic:license timestamp=1519452831179, value=CC0_1_0 basic:locality timestamp=1519452831179, value= basic:mediatype timestamp=1519452831179, value= basic:month timestamp=1519452831179, value=8 basic:occurrenceid timestamp=1519452831179, value=JBSJ:Araceas:A0011 basic:orders timestamp=1519452831179, value=Alismatales basic:phylum timestamp=1519452831179, value=Tracheophyta basic:publishingorgkey timestamp=1519452831179, value=1904954c-81e7-4254-9778-ae3deed93de6 basic:recordedby timestamp=1519452831179, value=Oyuela G. basic:recordnumber timestamp=1519452831179, value= basic:rightsholder timestamp=1519452831179, value=CorporacixC3xB3n San Jorge basic:scientificname timestamp=1519452831179, value=Anthurium cabrerense Engl. basic:species timestamp=1519452831179, value=Anthurium cabrerense basic:specieskey timestamp=1519452831179, value=2872557 basic:taxonkey timestamp=1519452831179, value=2872557 basic:taxonrank timestamp=1519452831179, value=SPECIES basic:typestatus timestamp=1519452831179, value= basic:year timestamp=1519452831179, value=2007
Reply