March | 2018 | Wang's Tech Blog

Updates from March, 2018 Toggle Comment Threads | Keyboard Shortcuts

Wang 22:32 on 2018-03-28 Permalink | Reply
Tags: API ( 18 ), Java ( 8 ), Micro-Service ( 9 ), Restful ( 10 ), Spring Boot ( 10 )

[Spring Boot2] Demo

Recently Spring Boot has released version 2.0.0.RELEASE, so I did a small demo which included the basic CRUD, I have uploaded the code to github.

There are two branches, master is the normal branch, and docker branch will create docker image when you build.

Like Loading...
Reply Cancel reply

Name

Email

Website

Notify me of new comments via email.
Notify me of new posts via email.
Δ

Wang 20:12 on 2018-03-25 Permalink | Reply
Tags: Ambari, BigData ( 37 ), Cluster ( 30 ), Presto ( 7 ), Tez ( 6 )

[Presto] Integrate with Ambari

Days before I have installed presto and ambari separately, officially ambari doesn’t support presto, you have to download ambari-presto-service and configure it yourself if you wanna manage presto on ambari.

So I tried this.

1.download hdp yum repository

wget -nv http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.6.3.0/hdp.repo -O /etc/yum.repos.d/HDP.repo

2.download ambari-presto-service and configure

version=`hdp-select status hadoop-client | sed 's/hadoop-client - ([0-9].[0-9]).*/1/'`
mkdir /var/lib/ambari-server/resources/stacks/HDP/$version/services/PRESTO
wget https://github.com/prestodb/ambari-presto-service/releases/download/v1.2/ambari-presto-1.2.tar.gz
tar -xvf ambari-presto-1.2.tar.gz -C /var/lib/ambari-server/resources/stacks/HDP/$version/services/PRESTO
mv /var/lib/ambari-server/resources/stacks/HDP/$version/services/PRESTO/ambari-presto-1.2/* /var/lib/ambari-server/resources/stacks/HDP/$version/services/PRESTO
rm -rf /var/lib/ambari-server/resources/stacks/HDP/$version/services/PRESTO/ambari-presto-1.2
chmod -R +x /var/lib/ambari-server/resources/stacks/HDP/$version/services/PRESTO/*

3.restart ambari-server

ambari-server restart

4.add presto service on ambari, please configure discovery.uri when you add presto service, e.g. discovery.uri: http://coordinator:8285

After doing this, you could add catalogs and use presto as query engine.

I did a simple query comparison between Tez and Presto, if you wanna accurate benchmark result, I think this benchmark test could help. The query is to calculate sum on a hive table.

Presto: 4s

presto:test> select sum(count) as sum from (
          -> select count(*) as count from t0004998 where month = '6.5'
          -> union
          -> select count(*) as count from t0004998 where typestatus in ('VL2216','VL2217','VL2218','VL2219','VL2220')
          -> union
          -> select count(*) as count from t0004998 where countrycode in ('FAMILY','FORM','GENUS','KINGDOM','ORDER','PHYLUM','SPECIES')
          -> ) t;
  sum   
--------
 307374 
(1 row)

Query 20180317_102034_00040_sq83e, FINISHED, 1 node
Splits: 29 total, 29 done (100.00%)
0:04 [982K rows, 374MB] [231K rows/s, 87.8MB/s]

Tez: 29.77s

hive> select sum(count) from (
    > select count(*) as count from t0004998 where month = "6.5"
    > union
    > select count(*) as count from t0004998 where typestatus in ("VL2216","VL2217","VL2218","VL2219","VL2220")
    > union
    > select count(*) as count from t0004998 where countrycode in ("FAMILY","FORM","GENUS","KINGDOM","ORDER","PHYLUM","SPECIES")
    > ) t;
Query ID = hdfs_20180317102109_5fd30986-f840-450e-aedd-b51c5e3a48f1
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1521267007048_0012)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Map 10 .........   SUCCEEDED      1          1        0        0       1       0
Map 8 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 11 .....   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       1
Reducer 4 ......   SUCCEEDED      1          1        0        0       0       0
Reducer 6 ......   SUCCEEDED      1          1        0        0       0       0
Reducer 7 ......   SUCCEEDED      1          1        0        0       0       0
Reducer 9 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 09/09  [==========================>>] 100%  ELAPSED TIME: 29.77 s    
--------------------------------------------------------------------------------
OK
307374
Time taken: 30.732 seconds, Fetched: 1 row(s)

Wang 17:55 on 2018-03-24 Permalink | Reply
Tags: SEO, Status ( 23 )
emmm, not bad..
Like Loading...

Wang 21:36 on 2018-03-20 Permalink | Reply
Tags: BigData ( 37 ), Cluster ( 30 ), Hive ( 15 ), Presto ( 7 )

[Presto] Build pseudo cluster

Presto is a distributed query engine which is developed by Facebook, for specific concept and advantages, please refer to the official document, below are the steps how I build pseudo cluster on my mac.

1.download presto

wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.196/presto-server-0.196.tar.gz
tar -zvxf presto-server-0.196.tar.gz && cd presto-server-0.196

2.configure configurations

mkdir etc

cat << 'EOF' > etc/jvm.config
-server
-Xmx16G
-Xms16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
EOF

cat << 'EOF' > etc/log.properties
com.facebook.presto=INFO
EOF

cat << 'EOF' > etc/config1.properties
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8001
query.max-memory=24GB
query.max-memory-per-node=8GB
discovery-server.enabled=true
discovery.uri=http://localhost:8001
EOF

cat << 'EOF' > etc/config2.properties
coordinator=false
node-scheduler.include-coordinator=true
http-server.http.port=8002
query.max-memory=24GB
query.max-memory-per-node=8GB
discovery-server.enabled=true
discovery.uri=http://localhost:8001
EOF

cat << 'EOF' > etc/config3.properties
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8003
query.max-memory=24GB
query.max-memory-per-node=8GB
discovery-server.enabled=true
discovery.uri=http://localhost:8001
EOF

cat << 'EOF' > etc/node1.properties
node.environment=test
node.id=671d18f9-dd0f-412d-b18c-fe6d7989b040
node.data-dir=/usr/local/Cellar/presto/0.196/data/node1
EOF

cat << 'EOF' > etc/node2.properties
node.environment=test
node.id=e72fdd91-a135-4936-9a3e-f888c5106ed9
node.data-dir=/usr/local/Cellar/presto/0.196/data/node2
EOF

cat << 'EOF' > etc/node3.properties
node.environment=test
node.id=6ab76715-1812-4093-95cf-1945f4cfefe3
node.data-dir=/usr/local/Cellar/presto/0.196/data/node3
EOF

p.s. If you want to restrict operation, please add access-control.properties as below, only permit read operation.

cat << 'EOF' > etc/access-control.properties
access-control.name=read-only
EOF

3.start presto server

bin/launcher start --config=etc/config1.properties --node-config=etc/node1.properties
bin/launcher start --config=etc/config2.properties --node-config=etc/node2.properties
bin/launcher start --config=etc/config3.properties --node-config=etc/node3.properties

4.downlaod cli

wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.196/presto-cli-0.196-executable.jar -O bin/presto-cli
chmod +x bin/presto-cli

5.create catalogs

cat << 'EOF' > etc/catalog/mysql.properties
connector.name=mysql
connection-url=jdbc:mysql://localhost:3306?useSSL=false
connection-user=presto
connection-password=presto
EOF

cat << 'EOF' > etc/catalog/hive.properties
connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
EOF

6.connect

bin/presto-cli --server localhost:8001 --catalog hive

presto> show catalogs;
 Catalog 
---------
 hive    
 mysql   
 system  
(3 rows)

Query 20180318_045410_00013_sq83e, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Screenshot:

P.S. If build cluster, pay attention to below items:

1.node.id in node.properties in every node must be unique in the cluster, you could generate it by uuid/uuidgen.

2.query.max-memory-per-node in config.properties better to be half of -Xmx in jvm.config.

Wang 22:39 on 2018-03-18 Permalink | Reply
Tags: ELK ( 6 ), Logstash

[ELK] Configure logstash

Recently I need do some statistic, so I choose ELK to build it. I will introduce about how to clean logs and send the logs to elasticsearch by logstash.

Logstash Version: 5.6.6

Firstly add a new configuration file named xxx.conf under config directory, the content are as below, please replace “xxx” with your business.

input {
    file {
        path => "/**/xxx.log"
        codec => plain {
            charset => "UTF-8"
        }
        tags => ["xxx"]
    }
    file {
        path => "/**/xxx.log"
        codec => plain {
            charset => "UTF-8"
        }
        tags => ["xxx"]
    }
}

filter {
    if "xxx" in [tags] {
        dissect {
            mapping => {
                "message" => "%{timestamp} - [%{thread}] - [%{level}] - [%{class}] - xxx:%{xxx}, xxx:%{xxx}, xxx:%{xxx}, xxx:%{xxx}, xxx:%{xxx}, xxx:%{xxx}, xxx:%{xxx}, xxx:%{xxx}"
            }
        }
    }
    if "xxx" in [tags] {
        dissect {
            mapping => {
                "message" => "%{timestamp} - [%{thread}] - [%{level}] - [%{class}] - xxx:%{xxx}, xxx:%{xxx}, xxx:%{xxx}, xxx:%{xxx}, xxx:%{xxx}, xxx:%{xxx}, xxx:%{xxx}, xxx:%{xxx}, xxx:%{xxx}, xxx:%{xxx}, xxx:%{xxx}, xxx:%{xxx}, xxx:%{xxx}"
            }
        }
    }
}

#replace @timestamp
#filter {
#date {
#match => ["timestamp", "yyyy-MM-dd HH:mm:ss,SSS"]
#target => ["@timestamp"]
#}
#}

output {
    if "xxx" in [tags] {
        elasticsearch {
            index => "xxx"
            hosts => ["http://xxx:9200"]
        }
    }
    if "xxx" in [tags] {
        elasticsearch {
            index => "xxx"
            hosts => ["http://xxx:9200"]
        }
    }
}

Then start logstash with this configuration file

bin/logstash -f config/xxx.conf

After this, please configure kibana dashborad, and you will get some cool charts.

P.S.

There are many kinds of input/filter/output, like jdbc/redis/kafka/mongodb, please refer to the official document.

If you are familiar with grok filter, you can also filter logs as below:

filter {
    grok {
        match => {
            "message" => "%{TIMESTAMP_ISO8601:timestamp}%{SPACE}-%{SPACE}[.*]%{SPACE}-%{SPACE}xxx:%{GREEDYDATA:xxx},%{SPACE}xxx:%{GREEDYDATA:xxx},%{SPACE}xxx:%{GREEDYDATA:xxx},%{SPACE}xxx:%{GREEDYDATA:xxx},%{SPACE}xxx:%{GREEDYDATA:xxx},%{SPACE}xxx:%{GREEDYDATA:xxx},%{SPACE}xxx:%{GREEDYDATA:xxx},%{SPACE}xxx:%{GREEDYDATA:xxx}"
        }
    }
}

Wang 20:24 on 2018-03-16 Permalink | Reply
Tags: Hadoop ( 14 ), Mysql ( 7 ), Sqoop ( 2 )

[Sqoop2] Notebook

Recently I tested sqoop2 which has many new features compared to sqoop1, about the comparision, I think you could check here and stackoverflow, I will introduce about the operation manual.

1.install

wget http://ftp.jaist.ac.jp/pub/apache/sqoop/1.99.7/sqoop-1.99.7-bin-hadoop200.tar.gz
tar -vzxf sqoop-1.99.7-bin-hadoop200.tar.gz && cd sqoop-1.99.7-bin-hadoop200

2.replace @LOGDIR@/@BASEDIR@ in sqoop.properties

3.download mysql driver into server/lib

4.configure proxy user in core-site.xml

<property>
    <name>hadoop.proxyuser.sqoop2.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.sqoop2.groups</name>
    <value>*</value>
</property>

5.verify & start sqoop2 server

bin/sqoop2-tool verify
bin/sqoop2-server start

6.start client & test

bin/sqoop2-shell

7.show the connectors

sqoop:000> show connector
+------------------------+---------+------------------------------------------------------------+----------------------+
| Name | Version | Class | Supported Directions |
+------------------------+---------+------------------------------------------------------------+----------------------+
| generic-jdbc-connector | 1.99.7 | org.apache.sqoop.connector.jdbc.GenericJdbcConnector | FROM/TO |
| kite-connector | 1.99.7 | org.apache.sqoop.connector.kite.KiteConnector | FROM/TO |
| oracle-jdbc-connector | 1.99.7 | org.apache.sqoop.connector.jdbc.oracle.OracleJdbcConnector | FROM/TO |
| ftp-connector | 1.99.7 | org.apache.sqoop.connector.ftp.FtpConnector | TO |
| hdfs-connector | 1.99.7 | org.apache.sqoop.connector.hdfs.HdfsConnector | FROM/TO |
| kafka-connector | 1.99.7 | org.apache.sqoop.connector.kafka.KafkaConnector | TO |
| sftp-connector | 1.99.7 | org.apache.sqoop.connector.sftp.SftpConnector | TO |
+------------------------+---------+------------------------------------------------------------+----------------------+

8.create links & show links

sqoop:000> create link -connector generic-jdbc-connector
sqoop:000> create link -connector hdfs-connector

sqoop:000> show link
+-------------+------------------------+---------+
| Name | Connector Name | Enabled |
+-------------+------------------------+---------+
| mysql-local | generic-jdbc-connector | true |
| hdfs-local | hdfs-connector | true |
+-------------+------------------------+---------+

sqoop:000> show link --all
2 link(s) to show:
link with name mysql-local (Enabled: true, Created by hongmeng.wang at 3/1/18 10:56 AM, Updated by hongmeng.wang at 3/1/18 12:51 PM)
Using Connector generic-jdbc-connector with name {1}
Database connection
Driver class: com.mysql.jdbc.Driver
Connection String: jdbc:mysql://localhost:3306
Username: root
Password:
Fetch Size: 100
Connection Properties:
protocol = tcp
useUnicode = true
characterEncoding = utf-8
autoReconnect = true
SQL Dialect
Identifier enclose: (blank, if use default, will get error)
link with name hdfs-local (Enabled: true, Created by hongmeng.wang at 3/1/18 10:58 AM, Updated by hongmeng.wang at 3/1/18 12:54 PM)
Using Connector hdfs-connector with name {1}
HDFS cluster
URI: hdfs://localhost:9000
Conf directory: /usr/local/Cellar/hadoop/2.8.2/libexec/etc/hadoop
Additional configs::

9.create job & show job

sqoop:000> create job -f "mysql-local" -t "hdfs-local"

sqoop:000> show job
+----+----------------------+--------------------------------------+-----------------------------+---------+
| Id | Name | From Connector | To Connector | Enabled |
+----+----------------------+--------------------------------------+-----------------------------+---------+
| 1 | mysql-2-hdfs-t1 | mysql-local (generic-jdbc-connector) | hdfs-local (hdfs-connector) | true |
+----+----------------------+--------------------------------------+-----------------------------+---------+

sqoop:000> show job --all
1 job(s) to show:
Job with name mysql-2-hdfs-segment (Enabled: true, Created by hongmeng.wang at 3/1/18 11:06 AM, Updated by hongmeng.wang at 3/1/18 11:39 AM)
Throttling resources
Extractors:
Loaders:
Classpath configuration
Extra mapper jars:
From link: mysql-local
Database source
Schema name: test
Table name: t1
SQL statement:
Column names:
Partition column: id
Partition column nullable:
Boundary query:
Incremental read
Check column:
Last value:
To link: hdfs-local
Target configuration
Override null value: true
Null value:
File format: TEXT_FILE
Compression codec: NONE
Custom codec:
Output directory: /sqoop/mysql/test
Append mode:

10.start job & check job’s status

sqoop:000> start job -name mysql-2-hdfs-segment
Submission details
Job Name: mysql-2-hdfs-segment
Server URL: http://localhost:12000/sqoop/
Created by: sqoop2
Creation date: 2018-03-01 13:53:37 JST
Lastly updated by: sqoop2
External ID: job_1519869491258_0001
http://localhost:8088/proxy/application_1519869491258_0001/
2018-03-01 13:53:37 JST: BOOTING - Progress is not available

sqoop:000> status job -n mysql-2-hdfs-segment
Submission details
Job Name: mysql-2-hdfs-segment
Server URL: http://localhost:12000/sqoop/
Created by: sqoop2
Creation date: 2018-03-01 14:01:54 JST
Lastly updated by: sqoop2
External ID: job_1519869491258_0002
http://localhost:8088/proxy/application_1519869491258_0002/
2018-03-01 14:03:31 JST: BOOTING - 0.00 %

Issues

1.modify “org.apache.sqoop.submission.engine.mapreduce.configuration.directory=”directory of hadoop configuration” in conf/sqoop.properties if you got below error when executing bin/sqoop2-tool verify

Exception in thread "main" java.lang.RuntimeException: Failure in server initialization
at org.apache.sqoop.core.SqoopServer.initialize(SqoopServer.java:68)
at org.apache.sqoop.server.SqoopJettyServer.<init>(SqoopJettyServer.java:67)
at org.apache.sqoop.server.SqoopJettyServer.main(SqoopJettyServer.java:177)
Caused by: org.apache.sqoop.common.SqoopException: MAPREDUCE_0002:Failure on submission engine initialization - Invalid Hadoop configuration directory (not a directory or permission issues): /etc/hadoop/conf/
at org.apache.sqoop.submission.mapreduce.MapreduceSubmissionEngine.initialize(MapreduceSubmissionEngine.java:97)
at org.apache.sqoop.driver.JobManager.initialize(JobManager.java:257)
at org.apache.sqoop.core.SqoopServer.initialize(SqoopServer.java:64)
... 2 more

2.check $CLASSPATH and $HADOOP_CLASSPATH, maybe some jars conflict if got below error:

Caused by: java.lang.SecurityException: sealing violation: package org.apache.derby.impl.services.locks is sealed
at java.net.URLClassLoader.getAndVerifyPackage(URLClassLoader.java:399)
at java.net.URLClassLoader.definePackageInternal(URLClassLoader.java:419)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:451)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.derby.impl.services.monitor.BaseMonitor.getImplementations(Unknown Source)
at org.apache.derby.impl.services.monitor.BaseMonitor.getDefaultImplementations(Unknown Source)
at org.apache.derby.impl.services.monitor.BaseMonitor.runWithState(Unknown Source)
at org.apache.derby.iampl.services.monitor.FileMonitor.<init>(Unknown Source)
at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown Source)
at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source)
at org.apache.derby.jdbc.EmbeddedDriver.<clinit>(Unknown Source)
... 11 more

Wang 23:12 on 2018-03-14 Permalink | Reply
Tags: Hadoop ( 14 ), Yarn

[Yarn] Configure queue and capacity

Modify capacity-scheduler.xml under $HADOOP_CONF_DIR, and I configured 3 queues: default, business, platform

<configuration>
    <property>
        <name>yarn.scheduler.capacity.maximum-applications</name>
        <value>10000</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
        <value>0.1</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.resource-calculator</name>
        <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.queues</name>
        <value>default,business,platform</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.default.capacity</name>
        <value>50</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.default.user-limit-factor</name>
        <value>1</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
        <value>50</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.default.state</name>
        <value>RUNNING</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.default.acl_submit_applications</name>
        <value>*</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.default.acl_administer_queue</name>
        <value>*</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.business.capacity</name>
        <value>30</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.business.user-limit-factor</name>
        <value>1</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.business.maximum-capacity</name>
        <value>30</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.business.state</name>
        <value>RUNNING</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.business.acl_submit_applications</name>
        <value>*</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.business.acl_administer_queue</name>
        <value>*</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.platform.capacity</name>
        <value>20</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.platform.user-limit-factor</name>
        <value>1</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.platform.maximum-capacity</name>
        <value>20</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.platform.state</name>
        <value>RUNNING</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.platform.acl_submit_applications</name>
        <value>*</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.platform.acl_administer_queue</name>
        <value>*</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.node-locality-delay</name>
        <value>40</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.queue-mappings</name>
        <value></value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.queue-mappings-override.enable</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments</name>
        <value>1</value>
    </property>
</configuration>

Wang 21:33 on 2018-03-11 Permalink | Reply
Tags: BigData ( 37 ), Hadoop ( 14 ), HBase ( 5 ), Hive ( 15 ), Mysql ( 7 ), Sqoop ( 2 )

[Sqoop1] Interact MySQL with HDFS/Hive/HBase

install sqoop1 on mac

brew install sqoop

#if you have set env profiles, uncomment profiles in conf/sqoop-env.sh

1.MySQL -> HDFS

1.1.import table

sqoop import --connect jdbc:mysql://localhost/test --direct --username root --P --table t1 --warehouse-dir /mysql/test --fields-terminated-by ','

1.2.import schema

sqoop import-all-tables --connect jdbc:mysql://localhost/test --direct --username root -P --warehouse-dir /mysql/test --fields-terminated-by ','

2.MySQL -> Hive

2.1.import definition

sqoop create-hive-table --connect jdbc:mysql://localhost/test --table t1 --username root --P --hive-database test

2.2.import table

sqoop import --connect jdbc:mysql://localhost/test --username root --P --table t1 --hive-import --hive-database test --hive-table t1 --fields-terminated-by ','

2.3.import schema

sqoop import-all-tables --connect jdbc:mysql://localhost/test --username root --P --hive-import --hive-database test --fields-terminated-by ','

3.MySQL -> HBase

3.1.definition

sqoop import --connect jdbc:mysql://localhost/test --username root --P --table t1

3.2.import table, need create table in hbase first

sqoop import --connect jdbc:mysql://localhost/test --username root --P --table t1 --hbase-bulkload --hbase-table test.t1 --column-family basic --fields-terminated-by ','

3.3.import table without creating table in hbase, but pay attention to hbase/sqoop version

sqoop import --connect jdbc:mysql://localhost/test --username root --P --table t1 --hbase-bulkload --hbase-create-table --hbase-table test.t1 --column-family basic --fields-terminated-by ','

4.HDFS/Hive/HBase -> MySQL

sqoop export --connect jdbc:mysql://localhost/test --username root --P --table t1 --export-dir /user/hive/warehouse/test.db/t1 --fields-terminated-by ','

Wang 22:21 on 2018-03-09 Permalink | Reply
Tags: BigData ( 37 ), HBase ( 5 )
[HBase] No columns to insert

When I load data from hdfs to hbase, I got error:
```
Caused by: java.lang.IllegalArgumentException: No columns to insert
    at org.apache.hadoop.hbase.client.HTable.validatePut(HTable.java:1505)
    at org.apache.hadoop.hbase.client.BufferedMutatorImpl.validatePut(BufferedMutatorImpl.java:147)
    at org.apache.hadoop.hbase.client.BufferedMutatorImpl.doMutate(BufferedMutatorImpl.java:134)
    at org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:98)
    at org.apache.hadoop.hbase.client.HTable.put(HTable.java:1028)
    at org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat$MyRecordWriter.write(HiveHBaseTableOutputFormat.java:146)
    at org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat$MyRecordWriter.write(HiveHBaseTableOutputFormat.java:117)
    at org.apache.hadoop.hive.ql.io.HivePassThroughRecordWriter.write(HivePassThroughRecordWriter.java:40)
    at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
    at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
    at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
    at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:148)
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:547)
    ... 9 more
```
After reading the document, it said that hbase doesn’t support null value, I checked hdfs files, it indeed contained null value in some properties.

So I modified the data and reloaded to hbase, I didn’t get the error any more.

Like Loading...
I Fashion Styles, Crave Freebies, Hairstyles, and 1 other are discussing. Toggle Comments
- Black Hairstyles 04:11 on 2020-11-28 Permalink | Reply
  
  Thanks , I have just been looking for information about this subject for a while and yours is the best I have came upon so far. But, what about the conclusion? Are you certain about the source?
  
  LikeLike
- Hairstyles 10:13 on 2020-12-03 Permalink | Reply
  
  I am constantly looking online for articles that can assist me. Thank you!
  
  LikeLike
- Crave Freebies 06:34 on 2020-12-07 Permalink | Reply
  
  Magnificent goods from you, man. I’ve understand your stuff previous to and you’re just extremely fantastic. I really like what you have acquired here, certainly like what you are saying and the way in which you say it. You make it enjoyable and you still take care of to keep it sensible. I can not wait to read much more from you. This is really a great site.
  
  LikeLike
- I Fashion Styles 19:11 on 2020-12-07 Permalink | Reply
  
  Have you ever thought about including a little bit more than just your articles? I mean, what you say is important and all. Nevertheless imagine if you added some great pictures or videos to give your posts more, “pop”! Your content is excellent but with pics and clips, this blog could certainly be one of the most beneficial in its niche. Amazing blog!
  
  LikeLike

Wang 20:37 on 2018-03-06 Permalink | Reply
Tags: BigData ( 37 ), Cluster ( 30 ), Hadoop ( 14 ), Hive ( 15 ), Tez ( 6 )

[Performance Test] MR vs Tez(2)

I test the performance of MR vs Tez again on cluster, I created a new table which contains 28,872,974 rows, below are cluster servers:

Host	OS	Memory	CPU	Disk	Region
master.c.ambari-195807.internal	CentOS 7	13 GB	Intel Ivy Bridge: 2	200G	asia-east1-a
slave1.c.ambari-195807.internal	CentOS 7	13 GB	Intel Ivy Bridge: 2	200G	asia-east1-a
slave2.c.ambari-195807.internal	CentOS 7	13 GB	Intel Ivy Bridge: 2	200G	asia-east1-a
slave3.c.ambari-195807.internal	CentOS 7	13 GB	Intel Ivy Bridge: 2	200G	asia-east1-a

1.MR

1.1.create table

hive> CREATE TABLE gbif.gbif_0004998
    > STORED AS ORC
    > TBLPROPERTIES("orc.compress"="snappy")
    > AS SELECT * FROM gbif.gbif_0004998_ori;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = gizmo_20180225064259_8df29800-b260-48f5-a409-80d6ea5200ad
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1519536795015_0001, Tracking URL = http://master.c.ambari-195807.internal:8088/proxy/application_1519536795015_0001/
Kill Command = /opt/apps/hadoop-2.8.3/bin/hadoop job  -kill job_1519536795015_0001
Hadoop job information for Stage-1: number of mappers: 43; number of reducers: 0
2018-02-25 06:43:15,110 Stage-1 map = 0%,  reduce = 0%
2018-02-25 06:44:15,419 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 231.6 sec
2018-02-25 06:44:36,386 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 380.45 sec
2018-02-25 06:44:37,810 Stage-1 map = 3%,  reduce = 0%, Cumulative CPU 386.09 sec
2018-02-25 06:44:41,695 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU 422.02 sec
...
...
2018-02-25 06:47:36,112 Stage-1 map = 97%,  reduce = 0%, Cumulative CPU 1388.9 sec
2018-02-25 06:47:38,185 Stage-1 map = 98%,  reduce = 0%, Cumulative CPU 1392.1 sec
2018-02-25 06:47:45,434 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1402.14 sec
MapReduce Total cumulative CPU time: 23 minutes 22 seconds 140 msec
Ended Job = job_1519536795015_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master.c.ambari-195807.internal:9000/user/hive/warehouse/gbif.db/.hive-staging_hive_2018-02-25_06-42-59_672_2925216554228494176-1/-ext-10002
Moving data to directory hdfs://master.c.ambari-195807.internal:9000/user/hive/warehouse/gbif.db/gbif_0004998
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 43   Cumulative CPU: 1402.14 sec   HDFS Read: 11519083564 HDFS Write: 1210708016 SUCCESS
Total MapReduce CPU Time Spent: 23 minutes 22 seconds 140 msec
OK
Time taken: 288.681 seconds

1.2.query by on condition

hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = gizmo_20180225065438_d2343424-5178-4c44-8b9d-0b28f8b701fa
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1519536795015_0002, Tracking URL = http://master.c.ambari-195807.internal:8088/proxy/application_1519536795015_0002/
Kill Command = /opt/apps/hadoop-2.8.3/bin/hadoop job  -kill job_1519536795015_0002
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 1
2018-02-25 06:54:50,078 Stage-1 map = 0%,  reduce = 0%
2018-02-25 06:55:02,485 Stage-1 map = 40%,  reduce = 0%, Cumulative CPU 21.01 sec
2018-02-25 06:55:03,544 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 38.51 sec
2018-02-25 06:55:06,704 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 49.23 sec
2018-02-25 06:55:09,881 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 51.88 sec
MapReduce Total cumulative CPU time: 51 seconds 880 msec
Ended Job = job_1519536795015_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 5  Reduce: 1   Cumulative CPU: 51.88 sec   HDFS Read: 1936305 HDFS Write: 107 SUCCESS
Total MapReduce CPU Time Spent: 51 seconds 880 msec
OK
2547716
Time taken: 32.292 seconds, Fetched: 1 row(s)

1.3.query by two conditions

hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE' and year > 1900;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = gizmo_20180225081238_766d3707-7eb4-4818-860e-887c48d507ce
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1519545228015_0002, Tracking URL = http://master.c.ambari-195807.internal:8088/proxy/application_1519545228015_0002/
Kill Command = /opt/apps/hadoop-2.8.3/bin/hadoop job  -kill job_1519545228015_0002
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 1
2018-02-25 08:17:31,666 Stage-1 map = 0%,  reduce = 0%
2018-02-25 08:17:43,866 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 10.58 sec
2018-02-25 08:17:46,045 Stage-1 map = 60%,  reduce = 0%, Cumulative CPU 34.12 sec
2018-02-25 08:17:54,996 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 41.73 sec
2018-02-25 08:17:57,126 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 51.37 sec
2018-02-25 08:17:58,192 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 53.72 sec
MapReduce Total cumulative CPU time: 53 seconds 720 msec
Ended Job = job_1519545228015_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 5  Reduce: 1   Cumulative CPU: 53.72 sec   HDFS Read: 8334197 HDFS Write: 107 SUCCESS
Total MapReduce CPU Time Spent: 53 seconds 720 msec
OK
2547716
Time taken: 321.138 seconds, Fetched: 1 row(s)

2.Tez

2.1.create table

hive> CREATE TABLE gbif.gbif_0004998
    > STORED AS ORC
    > TBLPROPERTIES("orc.compress"="snappy")
    > AS SELECT * FROM gbif.gbif_0004998_ori;
Query ID = gizmo_20180225075657_bae527a7-7cbd-46d9-afbf-70a5adcdee7c
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1519545228015_0001)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 639.61 s   
----------------------------------------------------------------------------------------------
Moving data to directory hdfs://master.c.ambari-195807.internal:9000/user/hive/warehouse/gbif.db/gbif_0004998
OK
Time taken: 664.817 seconds

2.2.query by one condition

hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE';
Query ID = gizmo_20180225080856_d1f13489-30b0-4045-bdeb-e3e5e085e736
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1519545228015_0001)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      5          5        0        0       0       0  
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 17.91 s    
----------------------------------------------------------------------------------------------
OK
2547716
Time taken: 19.255 seconds, Fetched: 1 row(s)

2.2.query by two conditions

hive> select count(*) as total from gbif_0004998 where mediatype = 'STILLIMAGE' and year > 1900;
Query ID = gizmo_20180225081200_0279f8e6-544b-4573-858b-33f48bf1fa35
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1519545228015_0001)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      5          5        0        0       0       0  
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 16.96 s    
----------------------------------------------------------------------------------------------
OK
2547716
Time taken: 17.635 seconds, Fetched: 1 row(s)

3.Summary

Rows: 28,872,974

Type	Create Table	Query By One Condition	Query By Two Conditions
MR	288.681s	32.292s	321.138s
Tez	664.817s	19.255s	17.635s

According to the result, MR is quicker than Tez on creation, but slower than Tez on query, along with query condition’s increase, MR’s query performance became worse.

But why MR is quicker than Tez on creation, currently I don’t know, need to be investigated later.

Maybe it has relationship with storage, I have checked the filesystem after the two kinds of creation, it’s different. MR has many small files, but Tez has one much bigger file.

MR generated files

Tez generated files

Wang's Tech Blog

Recent Posts

Tags

Archives

My Girl

Updates from March, 2018 Toggle Comment Threads | Keyboard Shortcuts

Wang 22:32 on 2018-03-28 Permalink | Reply
Tags: API ( 18 ), Java ( 8 ), Micro-Service ( 9 ), Restful ( 10 ), Spring Boot ( 10 )

[Spring Boot2] Demo

Reply Cancel reply

Wang 20:12 on 2018-03-25 Permalink | Reply
Tags: Ambari, BigData ( 37 ), Cluster ( 30 ), Presto ( 7 ), Tez ( 6 )

[Presto] Integrate with Ambari

Wang 17:55 on 2018-03-24 Permalink | Reply
Tags: SEO, Status ( 23 )

Wang 21:36 on 2018-03-20 Permalink | Reply
Tags: BigData ( 37 ), Cluster ( 30 ), Hive ( 15 ), Presto ( 7 )

[Presto] Build pseudo cluster

Wang 22:39 on 2018-03-18 Permalink | Reply
Tags: ELK ( 6 ), Logstash

[ELK] Configure logstash

Wang 20:24 on 2018-03-16 Permalink | Reply
Tags: Hadoop ( 14 ), Mysql ( 7 ), Sqoop ( 2 )

[Sqoop2] Notebook

Wang 23:12 on 2018-03-14 Permalink | Reply
Tags: Hadoop ( 14 ), Yarn

[Yarn] Configure queue and capacity

Wang 21:33 on 2018-03-11 Permalink | Reply
Tags: BigData ( 37 ), Hadoop ( 14 ), HBase ( 5 ), Hive ( 15 ), Mysql ( 7 ), Sqoop ( 2 )

[Sqoop1] Interact MySQL with HDFS/Hive/HBase

Wang 22:21 on 2018-03-09 Permalink | Reply
Tags: BigData ( 37 ), HBase ( 5 )

[HBase] No columns to insert

Black Hairstyles 04:11 on 2020-11-28 Permalink | Reply

Hairstyles 10:13 on 2020-12-03 Permalink | Reply

Crave Freebies 06:34 on 2020-12-07 Permalink | Reply

I Fashion Styles 19:11 on 2020-12-07 Permalink | Reply

Wang 20:37 on 2018-03-06 Permalink | Reply
Tags: BigData ( 37 ), Cluster ( 30 ), Hadoop ( 14 ), Hive ( 15 ), Tez ( 6 )

[Performance Test] MR vs Tez(2)

Recent Posts

Tags

Archives

My Girl

Updates from March, 2018 Toggle Comment Threads | Keyboard Shortcuts

Wang 22:32 on 2018-03-28 Permalink | Reply Tags: API ( 18 ), Java ( 8 ), Micro-Service ( 9 ), Restful ( 10 ), Spring Boot ( 10 )

Reply Cancel reply

Wang 20:12 on 2018-03-25 Permalink | Reply Tags: Ambari, BigData ( 37 ), Cluster ( 30 ), Presto ( 7 ), Tez ( 6 )

Wang 17:55 on 2018-03-24 Permalink | Reply Tags: SEO, Status ( 23 )

Wang 21:36 on 2018-03-20 Permalink | Reply Tags: BigData ( 37 ), Cluster ( 30 ), Hive ( 15 ), Presto ( 7 )

Wang 22:39 on 2018-03-18 Permalink | Reply Tags: ELK ( 6 ), Logstash

Wang 20:24 on 2018-03-16 Permalink | Reply Tags: Hadoop ( 14 ), Mysql ( 7 ), Sqoop ( 2 )

Wang 23:12 on 2018-03-14 Permalink | Reply Tags: Hadoop ( 14 ), Yarn

Wang 21:33 on 2018-03-11 Permalink | Reply Tags: BigData ( 37 ), Hadoop ( 14 ), HBase ( 5 ), Hive ( 15 ), Mysql ( 7 ), Sqoop ( 2 )

Wang 22:21 on 2018-03-09 Permalink | Reply Tags: BigData ( 37 ), HBase ( 5 )

Black Hairstyles 04:11 on 2020-11-28 Permalink | Reply

Hairstyles 10:13 on 2020-12-03 Permalink | Reply

Crave Freebies 06:34 on 2020-12-07 Permalink | Reply

I Fashion Styles 19:11 on 2020-12-07 Permalink | Reply

Wang 20:37 on 2018-03-06 Permalink | Reply Tags: BigData ( 37 ), Cluster ( 30 ), Hadoop ( 14 ), Hive ( 15 ), Tez ( 6 )

Wang 22:32 on 2018-03-28 Permalink | Reply
Tags: API ( 18 ), Java ( 8 ), Micro-Service ( 9 ), Restful ( 10 ), Spring Boot ( 10 )

Wang 20:12 on 2018-03-25 Permalink | Reply
Tags: Ambari, BigData ( 37 ), Cluster ( 30 ), Presto ( 7 ), Tez ( 6 )

Wang 17:55 on 2018-03-24 Permalink | Reply
Tags: SEO, Status ( 23 )

Wang 21:36 on 2018-03-20 Permalink | Reply
Tags: BigData ( 37 ), Cluster ( 30 ), Hive ( 15 ), Presto ( 7 )

Wang 22:39 on 2018-03-18 Permalink | Reply
Tags: ELK ( 6 ), Logstash

Wang 20:24 on 2018-03-16 Permalink | Reply
Tags: Hadoop ( 14 ), Mysql ( 7 ), Sqoop ( 2 )

Wang 23:12 on 2018-03-14 Permalink | Reply
Tags: Hadoop ( 14 ), Yarn

Wang 21:33 on 2018-03-11 Permalink | Reply
Tags: BigData ( 37 ), Hadoop ( 14 ), HBase ( 5 ), Hive ( 15 ), Mysql ( 7 ), Sqoop ( 2 )

Wang 22:21 on 2018-03-09 Permalink | Reply
Tags: BigData ( 37 ), HBase ( 5 )

Wang 20:37 on 2018-03-06 Permalink | Reply
Tags: BigData ( 37 ), Cluster ( 30 ), Hadoop ( 14 ), Hive ( 15 ), Tez ( 6 )