BigData | Wang's Tech Blog

Tagged: BigData Toggle Comment Threads | Keyboard Shortcuts

Wang 19:34 on 2021-12-01 Permalink | Reply
Tags: BigData, HPC, ML Engineering ( 27 ), MLOps ( 18 ), Model ( 7 ), Performance ( 35 )

NVIDIA DGX

https://www.nvidia.com/en-us/data-center/dgx-systems/

Like Loading...
Reply Cancel reply

Name

Email

Website

Notify me of new comments via email.
Notify me of new posts via email.
Δ
Wang 23:20 on 2021-10-27 Permalink | Reply
Tags: BigData, Hadoop ( 14 ), Spark ( 2 )

Spark Release 3.2.0

Like Loading...
Wang 22:11 on 2021-04-02 Permalink | Reply
Tags: BigData, MLOps ( 18 )

A MORE POWERFUL FEATURE STORE

Like Loading...
Wang 22:36 on 2021-03-02 Permalink | Reply
Tags: BigData, Hadoop ( 14 ), MLOps ( 18 ), Presto ( 7 )

Emerging Architectures for Modern Data Infrastructure

Like Loading...

Small is discussing. Toggle Comments
- Small 23:35 on 2021-03-05 Permalink | Reply
  
  CSS
  
  LikeLike
Wang 23:33 on 2021-02-05 Permalink | Reply
Tags: BigData, ELK ( 6 ), HBase ( 5 ), Performance ( 35 )

In this light here is a comparison of…

In this light, here is a comparison of Open Source NOSQL databases Cassandra, Mongodb, CouchDB, Redis, Riak, RethinkDB, Couchbase (ex-Membase), Hypertable, ElasticSearch, Accumulo, VoltDB, Kyoto Tycoon, Scalaris, OrientDB, Aerospike, Neo4j and HBase:
https://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis/

Like Loading...
Wang 21:48 on 2020-08-05 Permalink | Reply
Tags: API ( 18 ), BigData, kafka, Streaming, YouTube ( 14 )

ksqlDB

ksqlDB is the event streaming database purpose-built for stream processing applications

Like Loading...
Wang 22:22 on 2020-07-06 Permalink | Reply
Tags: BigData, Performance ( 35 ), Spark ( 2 )

Spark 3.0

Like Loading...
Wang 21:38 on 2020-06-13 Permalink | Reply
Tags: BigData, Cloud ( 30 ), Cluster ( 30 ), Kubernetes ( 41 )

Data Pipelines with Apache Airflow

Like Loading...
Wang 20:23 on 2020-03-17 Permalink | Reply
Tags: BigData, ELK ( 6 )

RCA for ES OOM

Like Loading...

Wang 20:44 on 2019-12-24 Permalink | Reply
Tags: BigData, Cluster ( 30 )

PoC of Apache Druid

As we have some business requirements about data aggregation and online processing, so we did a quick PoC on Apache Druid. Next I will show how to build druid quickly and start your ingestion task.

1.Select release version which is compatible to your existing system and download the package.

2.Choose what kind of druid service you want to start with

For single node, just execute the script under bin directory which is start with start-single-server-, or you can execute start-micro-quickstart
For multiple node cluster, please update the configuration files under start-micro-quickstart in one node and sync to other nodes. If you want to connect to your hadoop cluster, please copy corresponding hadoop xml files and kerberos keytab under druid.

Then you start druid service in every node by execute start-cluster script.

3.Visit druid through browser, http://IP:8888

Next I load the data from local file and can ingest the data file as a datasource, and finally query data by SQL.

Task configuration

{
  "type": "index_parallel",
  "id": "index_parallel_wikiticker-2015-09-12-sampled_2020-02-18T11:17:29.236Z",
  "resource": {
    "availabilityGroup": "index_parallel_wikiticker-2015-09-12-sampled_2020-02-18T11:17:29.236Z",
    "requiredCapacity": 1
  },
  "spec": {
    "dataSchema": {
      "dataSource": "wikiticker-2015-09-12-sampled",
      "parser": {
        "type": "string",
        "parseSpec": {
          "format": "json",
          "timestampSpec": {
            "column": "time",
            "format": "iso"
          },
          "dimensionsSpec": {
            "dimensions": [
              "channel",
              "cityName",
              "comment",
              "countryIsoCode",
              "countryName",
              "isAnonymous",
              "isMinor",
              "isNew",
              "isRobot",
              "isUnpatrolled",
              "namespace",
              "page",
              "regionIsoCode",
              "regionName",
              "user"
            ]
          }
        }
      },
      "metricsSpec": [
        {
          "type": "count",
          "name": "count"
        },
        {
          "type": "longSum",
          "name": "sum_added",
          "fieldName": "added",
          "expression": null
        },
        {
          "type": "longSum",
          "name": "sum_deleted",
          "fieldName": "deleted",
          "expression": null
        },
        {
          "type": "longSum",
          "name": "sum_delta",
          "fieldName": "delta",
          "expression": null
        },
        {
          "type": "longSum",
          "name": "sum_metroCode",
          "fieldName": "metroCode",
          "expression": null
        }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "HOUR",
        "rollup": true,
        "intervals": null
      },
      "transformSpec": {
        "filter": null,
        "transforms": []
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "firehose": {
        "type": "local",
        "baseDir": "/opt/druid-0.16.0/quickstart/tutorial",
        "filter": "wikiticker-2015-09-12-sampled.json.gz",
        "parser": null
      },
      "appendToExisting": false
    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxRowsPerSegment": null,
      "maxRowsInMemory": 1000000,
      "maxBytesInMemory": 0,
      "maxTotalRows": null,
      "numShards": null,
      "partitionsSpec": null,
      "indexSpec": {
        "bitmap": {
          "type": "concise"
        },
        "dimensionCompression": "lz4",
        "metricCompression": "lz4",
        "longEncoding": "longs"
      },
      "indexSpecForIntermediatePersists": {
        "bitmap": {
          "type": "concise"
        },
        "dimensionCompression": "lz4",
        "metricCompression": "lz4",
        "longEncoding": "longs"
      },
      "maxPendingPersists": 0,
      "forceGuaranteedRollup": false,
      "reportParseExceptions": false,
      "pushTimeout": 0,
      "segmentWriteOutMediumFactory": null,
      "maxNumConcurrentSubTasks": 1,
      "maxRetry": 3,
      "taskStatusCheckPeriodMs": 1000,
      "chatHandlerTimeout": "PT10S",
      "chatHandlerNumRetries": 5,
      "maxNumSegmentsToMerge": 100,
      "totalNumMergeTasks": 10,
      "logParseExceptions": false,
      "maxParseExceptions": 2147483647,
      "maxSavedParseExceptions": 0,
      "partitionDimensions": [],
      "buildV9Directly": true
    }
  },
  "context": {
    "forceTimeChunkLock": true
  },
  "groupId": "index_parallel_wikiticker-2015-09-12-sampled_2020-02-18T11:17:29.236Z",
  "dataSource": "wikiticker-2015-09-12-sampled"
}

Task running status

Task finished, you can see the item in datasource/segment/query

Wang's Tech Blog

Recent Posts

Tags

Archives

My Girl

Tagged: BigData Toggle Comment Threads | Keyboard Shortcuts

Wang 19:34 on 2021-12-01 Permalink | Reply
Tags: BigData, HPC, ML Engineering ( 27 ), MLOps ( 18 ), Model ( 7 ), Performance ( 35 )

NVIDIA DGX

Reply Cancel reply

Wang 23:20 on 2021-10-27 Permalink | Reply
Tags: BigData, Hadoop ( 14 ), Spark ( 2 )

Wang 22:11 on 2021-04-02 Permalink | Reply
Tags: BigData, MLOps ( 18 )

Wang 22:36 on 2021-03-02 Permalink | Reply
Tags: BigData, Hadoop ( 14 ), MLOps ( 18 ), Presto ( 7 )

Small 23:35 on 2021-03-05 Permalink | Reply

Wang 23:33 on 2021-02-05 Permalink | Reply
Tags: BigData, ELK ( 6 ), HBase ( 5 ), Performance ( 35 )

In this light here is a comparison of…

Wang 21:48 on 2020-08-05 Permalink | Reply
Tags: API ( 18 ), BigData, kafka, Streaming, YouTube ( 14 )

ksqlDB

Wang 22:22 on 2020-07-06 Permalink | Reply
Tags: BigData, Performance ( 35 ), Spark ( 2 )

Wang 20:23 on 2020-03-17 Permalink | Reply
Tags: BigData, ELK ( 6 )

Wang 20:44 on 2019-12-24 Permalink | Reply
Tags: BigData, Cluster ( 30 )

PoC of Apache Druid

Wang's Tech Blog

Recent Posts

Tags

Archives

My Girl

Tagged: BigData Toggle Comment Threads | Keyboard Shortcuts

Wang 19:34 on 2021-12-01 Permalink | Reply Tags: BigData, HPC, ML Engineering ( 27 ), MLOps ( 18 ), Model ( 7 ), Performance ( 35 )

NVIDIA DGX

Reply Cancel reply

Wang 23:20 on 2021-10-27 Permalink | Reply Tags: BigData, Hadoop ( 14 ), Spark ( 2 )

Wang 22:11 on 2021-04-02 Permalink | Reply Tags: BigData, MLOps ( 18 )

Wang 22:36 on 2021-03-02 Permalink | Reply Tags: BigData, Hadoop ( 14 ), MLOps ( 18 ), Presto ( 7 )

Small 23:35 on 2021-03-05 Permalink | Reply

Wang 23:33 on 2021-02-05 Permalink | Reply Tags: BigData, ELK ( 6 ), HBase ( 5 ), Performance ( 35 )

In this light here is a comparison of…

Wang 21:48 on 2020-08-05 Permalink | Reply Tags: API ( 18 ), BigData, kafka, Streaming, YouTube ( 14 )

ksqlDB

Wang 22:22 on 2020-07-06 Permalink | Reply Tags: BigData, Performance ( 35 ), Spark ( 2 )

Wang 21:38 on 2020-06-13 Permalink | Reply Tags: BigData, Cloud ( 30 ), Cluster ( 30 ), Kubernetes ( 41 )

Wang 20:23 on 2020-03-17 Permalink | Reply Tags: BigData, ELK ( 6 )

Wang 20:44 on 2019-12-24 Permalink | Reply Tags: BigData, Cluster ( 30 )

PoC of Apache Druid

Wang 19:34 on 2021-12-01 Permalink | Reply
Tags: BigData, HPC, ML Engineering ( 27 ), MLOps ( 18 ), Model ( 7 ), Performance ( 35 )

Wang 23:20 on 2021-10-27 Permalink | Reply
Tags: BigData, Hadoop ( 14 ), Spark ( 2 )

Wang 22:11 on 2021-04-02 Permalink | Reply
Tags: BigData, MLOps ( 18 )

Wang 22:36 on 2021-03-02 Permalink | Reply
Tags: BigData, Hadoop ( 14 ), MLOps ( 18 ), Presto ( 7 )

Wang 23:33 on 2021-02-05 Permalink | Reply
Tags: BigData, ELK ( 6 ), HBase ( 5 ), Performance ( 35 )

Wang 21:48 on 2020-08-05 Permalink | Reply
Tags: API ( 18 ), BigData, kafka, Streaming, YouTube ( 14 )

Wang 22:22 on 2020-07-06 Permalink | Reply
Tags: BigData, Performance ( 35 ), Spark ( 2 )

Wang 21:38 on 2020-06-13 Permalink | Reply
Tags: BigData, Cloud ( 30 ), Cluster ( 30 ), Kubernetes ( 41 )

Wang 20:23 on 2020-03-17 Permalink | Reply
Tags: BigData, ELK ( 6 )

Wang 20:44 on 2019-12-24 Permalink | Reply
Tags: BigData, Cluster ( 30 )