Tagged: cloud Toggle Comment Threads | Keyboard Shortcuts

  • Wang 19:59 on 2021-06-09 Permalink | Reply
    Tags: cloud, , ,   

    Slurm 

    Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

    https://slurm.schedmd.com/

     
  • Wang 11:15 on 2020-09-21 Permalink | Reply
    Tags: cloud, , , ,   

    ML Infrastructure Tools for Production 

    https://towardsdatascience.com/ml-infrastructure-tools-for-production-part-2-model-deployment-and-serving-fcfc75c4a362

     
  • Wang 23:36 on 2020-09-11 Permalink | Reply
    Tags: cloud, ,   

    Turn a Git repo into a collection of interactive notebooks

     
  • Wang 11:31 on 2020-08-13 Permalink | Reply
    Tags: cloud, , , , , ,   

    Jupyter Gateway + JupyterHub 

    https://jupyter.org/enterprise_gateway/

     
  • Wang 21:38 on 2020-06-13 Permalink | Reply
    Tags: , cloud, ,   

    Data Pipelines with Apache Airflow

     
  • Wang 23:22 on 2019-10-25 Permalink | Reply
    Tags: cloud, , , , , ,   

    SpringOne Platform 2019 in Austin, https://springoneplatform.io/

     
  • Wang 22:10 on 2019-07-29 Permalink | Reply
    Tags: cloud,   

    Pivotal – Create Customer Value Like FAANG: Continuous Delivery via Spinnaker

     
  • Wang 22:34 on 2019-05-10 Permalink | Reply
    Tags: , cloud, ,   

    Kubernetes node in “NotReady” status 

    Rencetly I found some k8s nodes became “NotReady”, I checked disk and memory, they both seems fine.

    [xxx@xxx-xxx ~]# kubectl describe node xxx-xxx
    ...
    ...
    Conditions:
      Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                    Message
      ----             ------    -----------------                 ------------------                ------                    -------
      ...
      PIDPressure      False     Fri, 10 May 2019 09:24:43 +0900   Fri, 10 May 2018 00:10:12 +0900   KubeletHasSufficientPID   kubelet has sufficient PID available
      ...
    

    Then I restarted kubelet on server and checked logs, I found:

    [xxx@xxx-xxx ~]# systemctl status kubelet
    ● kubelet.service - Kubernetes Kubelet Server
       Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
    ...
    May 10 12:30:30 xxx-xxx kubelet[16776]: F0322 12:30:30.810434   16776 server.go:233] failed to run Kubelet: Running with swap on is not supported, plea...
    ...
    

    So I checked server’s status and turn off swap, then I restarted kubelet and the nodes went well.

    [xxx@xxx-xxx ~]# swapoff -a
    [xxx@xxx-xxx ~]# systemctl restart kubelet
    

     
  • Wang 22:12 on 2019-02-11 Permalink | Reply
    Tags: , cloud, , , , , , , ,   

    Guarantee service availability in kubernetes 

    A good service not only provide good functionalities, but also ensure the availability and uptime.

    We reinforce our service from QoS, QPS, Throttling, Scaling, Throughput, Monitoring.

    Qos

    There’re 3 kinds of QoS in kubernetes: Guaranteed, Burstable, BestEffort. We usually use Guaranteed, Burstable for different services.

    #Guaranteed
    resources:
      requests:
        cpu: 1000m
        memory: 4Gi
      limits:
        cpu: 1000m
        memory: 4Gi
    
    #Burstable
    resources:
      requests:
        cpu: 1000m
        memory: 4Gi
      limits:
        cpu: 6000m
        memory: 8Gi
    
    QPS

    We did lots of stress test on APIs by Gatling before we release them, we mainly care about mean response time, std deviation, mean requests/sec, error rate (API Testing Report), during testing we monitor server metrics by Datadog to find out bottlenecks.

    We usually test APIs in two scenarios: internal, external. External testing result is much lower than internal testing because of network latency, network bandwidth and son on.

    Internal testing result

    ================================================================================
    ---- Global Information --------------------------------------------------------
    > request count                                     246000 (OK=246000 KO=0     )
    > min response time                                     16 (OK=16     KO=-     )
    > max response time                                   5891 (OK=5891   KO=-     )
    > mean response time                                    86 (OK=86     KO=-     )
    > std deviation                                        345 (OK=345    KO=-     )
    > response time 50th percentile                         30 (OK=30     KO=-     )
    > response time 75th percentile                         40 (OK=40     KO=-     )
    > response time 95th percentile                         88 (OK=88     KO=-     )
    > response time 99th percentile                       1940 (OK=1940   KO=-     )
    > mean requests/sec                                817.276 (OK=817.276 KO=-     )
    ---- Response Time Distraaibution ------------------------------------------------
    > t < 800 ms                                        240565 ( 98%)
    > 800 ms < t < 1200 ms                                1110 (  0%)
    > t > 1200 ms                                         4325 (  2%)
    > failed                                                 0 (  0%)
    ================================================================================
    

    External testing result

    ================================================================================
    ---- Global Information --------------------------------------------------------
    > request count                                      33000 (OK=32999  KO=1     )
    > min response time                                    477 (OK=477    KO=60001 )
    > max response time                                  60001 (OK=41751  KO=60001 )
    > mean response time                                   600 (OK=599    KO=60001 )
    > std deviation                                        584 (OK=484    KO=0     )
    > response time 50th percentile                        497 (OK=497    KO=60001 )
    > response time 75th percentile                        506 (OK=506    KO=60001 )
    > response time 95th percentile                       1366 (OK=1366   KO=60001 )
    > response time 99th percentile                       2125 (OK=2122   KO=60001 )
    > mean requests/sec                                109.635 (OK=109.631 KO=0.003 )
    ---- Response Time Distribution ------------------------------------------------
    > t < 800 ms                                         29826 ( 90%)
    > 800 ms < t < 1200 ms                                1166 (  4%)
    > t > 1200 ms                                         2007 (  6%)
    > failed                                                 1 (  0%)
    ---- Errors --------------------------------------------------------------------
    > i.g.h.c.i.RequestTimeoutException: Request timeout after 60000      1 (100.0%)
     ms
    ================================================================================
    
    Throttling

    We throttle API by Nginx limit, we configured ingress like this:

    annotations:
      nginx.ingress.kubernetes.io/limit-connections: '30'
      nginx.ingress.kubernetes.io/limit-rps: '60'
    

    And it will generate Nginx configuration dynamically like this:

    limit_conn_zone $limit_ZGVsaXZlcnktY2RuYV9kc2QtYXBpLWNkbmEtZ2F0ZXdheQ zone=xxx_conn:5m;
    limit_req_zone $limit_ZGVsaXZlcnktY2RuYV9kc2QtYXBpLWNkbmEtZ2F0ZXdheQ zone=xxx_rps:5m rate=60r/s;
    
    server {
        server_name xxx.xxx ;
        listen 80;
        
        location ~* "^/xxx/?(?<baseuri>.*)" {
            ...
            ...        
            limit_conn xxx_conn 30;
            limit_req zone=xxx_rps burst=300 nodelay;
            ...
            ...        
    }
    
    Scaling

    We use HPA in kubernetes to ensure auto (Auto scaling in kubernetes), you could check HPA status in server:

    [xxx@xxx ~]$ kubectl get hpa -n test-ns
    NAME       REFERENCE             TARGETS           MINPODS   MAXPODS   REPLICAS   AGE
    api-demo   Deployment/api-demo   39%/30%, 0%/30%   3         10        3          126d
    
    [xxx@xxx ~]$ kubectl get pod -n test-ns
    NAME                           READY     STATUS    RESTARTS   AGE
    api-demo-76b9954f57-6hvzx      1/1       Running   0          126d
    api-demo-76b9954f57-mllsx      1/1       Running   0          126d
    api-demo-76b9954f57-s22k8      1/1       Running   0          126d
    
    
    Throughput & Monitoring

    We integrated Datadog for monitoring(Monitoring by Datadog), we could check detail API metrics from various dashboards.

    Also we could calculate throughout from user, request, request time.

     
  • Wang 21:26 on 2019-01-14 Permalink | Reply
    Tags: , cloud, , , , monitoring,   

    Monitoring by Datadog 

    We have thousands of containers running on hundreds of servers, so we need comprehensive monitoring system to monitor service and server metrics.

    We investigated popular cloud monitoring platform: New Relic and Datadog, finally we decided to use datadog.

    Dashboard: Datadog could  detect services and configure dashboards for you automatically.

    Container & Process: You could check all your containers & process in all environments clearly.

    Monitors: Datadog will create monitors according to service type automatically, if it doesn’t your requirement, you could create your own. It’s also convenient to send alert message through Slack, Email.

    APM: Datadog provide various charts for API analysis, also there’s Service Map which you could check service dependencies.

    Synthetics: New feature in Datadog which could test your API around the world to check availability and uptime.

     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel
%d bloggers like this: