Tagged: bigdata Toggle Comment Threads | Keyboard Shortcuts

  • Wang 22:11 on 2021-04-02 Permalink | Reply
    Tags: bigdata,   

    A MORE POWERFUL FEATURE STORE

     
  • Wang 22:36 on 2021-03-02 Permalink | Reply
    Tags: bigdata, , ,   

    Emerging Architectures for Modern Data Infrastructure

     
  • Wang 23:33 on 2021-02-05 Permalink | Reply
    Tags: bigdata, , ,   

    In this light here is a comparison of… 

    In this light, here is a comparison of Open Source NOSQL databases CassandraMongodbCouchDBRedisRiakRethinkDBCouchbase (ex-Membase)HypertableElasticSearchAccumuloVoltDBKyoto TycoonScalarisOrientDBAerospikeNeo4j and HBase:

    https://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis/

     
  • Wang 21:48 on 2020-08-05 Permalink | Reply
    Tags: , bigdata, kafka, streaming,   

    ksqlDB 

    ksqlDB is the event streaming database purpose-built for stream processing applications

     
  • Wang 22:22 on 2020-07-06 Permalink | Reply
    Tags: bigdata, , spark   

    Spark 3.0

     
  • Wang 21:38 on 2020-06-13 Permalink | Reply
    Tags: bigdata, , ,   

    Data Pipelines with Apache Airflow

     
  • Wang 20:23 on 2020-03-17 Permalink | Reply
    Tags: bigdata,   

    RCA for ES OOM

     
  • Wang 20:44 on 2019-12-24 Permalink | Reply
    Tags: bigdata,   

    PoC of Apache Druid 

    As we have some business requirements about data aggregation and online processing, so we did a quick PoC on Apache Druid. Next I will show how to build druid quickly and start your ingestion task.

    1.Select release version which is compatible to your existing system and download the package.

    2.Choose what kind of druid service you want to start with

    • For single node, just execute the script under bin directory which is start with start-single-server-, or you can execute start-micro-quickstart
    • For multiple node cluster, please update the configuration files under start-micro-quickstart in one node and sync to other nodes. If you want to connect to your hadoop cluster, please copy corresponding hadoop xml files and kerberos keytab under druid.

    Then you start druid service in every node by execute start-cluster script.

    3.Visit druid through browser, http://IP:8888


    Next I load the data from local file and can ingest the data file as a datasource, and finally query data by SQL.

    Task configuration

    {
      "type": "index_parallel",
      "id": "index_parallel_wikiticker-2015-09-12-sampled_2020-02-18T11:17:29.236Z",
      "resource": {
        "availabilityGroup": "index_parallel_wikiticker-2015-09-12-sampled_2020-02-18T11:17:29.236Z",
        "requiredCapacity": 1
      },
      "spec": {
        "dataSchema": {
          "dataSource": "wikiticker-2015-09-12-sampled",
          "parser": {
            "type": "string",
            "parseSpec": {
              "format": "json",
              "timestampSpec": {
                "column": "time",
                "format": "iso"
              },
              "dimensionsSpec": {
                "dimensions": [
                  "channel",
                  "cityName",
                  "comment",
                  "countryIsoCode",
                  "countryName",
                  "isAnonymous",
                  "isMinor",
                  "isNew",
                  "isRobot",
                  "isUnpatrolled",
                  "namespace",
                  "page",
                  "regionIsoCode",
                  "regionName",
                  "user"
                ]
              }
            }
          },
          "metricsSpec": [
            {
              "type": "count",
              "name": "count"
            },
            {
              "type": "longSum",
              "name": "sum_added",
              "fieldName": "added",
              "expression": null
            },
            {
              "type": "longSum",
              "name": "sum_deleted",
              "fieldName": "deleted",
              "expression": null
            },
            {
              "type": "longSum",
              "name": "sum_delta",
              "fieldName": "delta",
              "expression": null
            },
            {
              "type": "longSum",
              "name": "sum_metroCode",
              "fieldName": "metroCode",
              "expression": null
            }
          ],
          "granularitySpec": {
            "type": "uniform",
            "segmentGranularity": "DAY",
            "queryGranularity": "HOUR",
            "rollup": true,
            "intervals": null
          },
          "transformSpec": {
            "filter": null,
            "transforms": []
          }
        },
        "ioConfig": {
          "type": "index_parallel",
          "firehose": {
            "type": "local",
            "baseDir": "/opt/druid-0.16.0/quickstart/tutorial",
            "filter": "wikiticker-2015-09-12-sampled.json.gz",
            "parser": null
          },
          "appendToExisting": false
        },
        "tuningConfig": {
          "type": "index_parallel",
          "maxRowsPerSegment": null,
          "maxRowsInMemory": 1000000,
          "maxBytesInMemory": 0,
          "maxTotalRows": null,
          "numShards": null,
          "partitionsSpec": null,
          "indexSpec": {
            "bitmap": {
              "type": "concise"
            },
            "dimensionCompression": "lz4",
            "metricCompression": "lz4",
            "longEncoding": "longs"
          },
          "indexSpecForIntermediatePersists": {
            "bitmap": {
              "type": "concise"
            },
            "dimensionCompression": "lz4",
            "metricCompression": "lz4",
            "longEncoding": "longs"
          },
          "maxPendingPersists": 0,
          "forceGuaranteedRollup": false,
          "reportParseExceptions": false,
          "pushTimeout": 0,
          "segmentWriteOutMediumFactory": null,
          "maxNumConcurrentSubTasks": 1,
          "maxRetry": 3,
          "taskStatusCheckPeriodMs": 1000,
          "chatHandlerTimeout": "PT10S",
          "chatHandlerNumRetries": 5,
          "maxNumSegmentsToMerge": 100,
          "totalNumMergeTasks": 10,
          "logParseExceptions": false,
          "maxParseExceptions": 2147483647,
          "maxSavedParseExceptions": 0,
          "partitionDimensions": [],
          "buildV9Directly": true
        }
      },
      "context": {
        "forceTimeChunkLock": true
      },
      "groupId": "index_parallel_wikiticker-2015-09-12-sampled_2020-02-18T11:17:29.236Z",
      "dataSource": "wikiticker-2015-09-12-sampled"
    }
    

    Task running status

    Task finished, you can see the item in datasource/segment/query

     
  • Wang 20:56 on 2019-11-11 Permalink | Reply
    Tags: bigdata, , , ,   

    Include Ranger to protect your hadoop ecosystem 

    Apache Ranger

    Apache Ranger™ is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.

    The vision with Ranger is to provide comprehensive security across the Apache Hadoop ecosystem. With the advent of Apache YARN, the Hadoop platform can now support a true data lake architecture. Enterprises can potentially run multiple workloads, in a multi tenant environment. Data security within Hadoop needs to evolve to support multiple use cases for data access, while also providing a framework for central administration of security policies and monitoring of user access.

     
  • Wang 22:21 on 2018-11-05 Permalink | Reply
    Tags: bigdata, , , , , , , ,   

    [Presto] Secure with LDAP 

    For security issue we decided to enable LDAP in presto, to deploy presto into kubernetes cluster we build presto image ourselves which include kerberos authentication and LDAP configurations.

    As you see the image structure, configurations under catalog/etc/hive are very important, please pay attention.

    krb5.conf and xxx.keytab are used to connect to kerberos

    password-authenticator.properties and ldap_server.pem under etc, hive.properties and hive-security.json under catalog are used to connect to LDAP.

    password-authenticator.properties

    password-authenticator.name=ldap
    ldap.url=ldaps://<IP>:<PORT>
    ldap.user-bind-pattern=xxxxxx
    ldap.user-base-dn=xxxxxx
    

    hive.properties

    connector.name=hive-hadoop2
    hive.security=file
    security.config-file=<hive-security.json>
    hive.metastore.authentication.type=KERBEROS
    hive.metastore.uri=thrift://<IP>:<PORT>
    hive.metastore.service.principal=<SERVER-PRINCIPAL>
    hive.metastore.client.principal=<CLIENT-PRINCIPAL>
    hive.metastore.client.keytab=<KEYTAB>
    hive.config.resources=core-site.xml, hdfs-site.xml
    

    hive-security.json

    {
      "schemas": [{
        "user": "user_1",
        "schema": "db_1",
        "owner": false
      }, {
        "user": " ",
        "schema": "db_1",
        "owner": false
      }, {
        "user": "user_2",
        "schema": "db_2",
        "owner": false
      }],
      "tables": [{
        "user": "user_1",
        "schema": "db_1",
        "table": "table_1",
        "privileges": ["SELECT"]
      }, {
        "user": "user_1",
        "schema": "db_1",
        "table": "table_2",
        "privileges": ["SELECT"]
      }, {
        "user": "user_2",
        "schema": "db_1",
        "table": ".*",
        "privileges": ["SELECT"]
      }, {
        "user": "user_2",
        "schema": "db_2",
        "table": "table_1",
        "privileges": ["SELECT"]
      }, {
        "user": "user_2",
        "schema": "db_2",
        "table": "table_2",
        "privileges": ["SELECT"]
      }],
      "sessionProperties": [{
        "allow": false
      }]
    }
    
     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel
%d bloggers like this: