Guarantee service availability in kubernetes

A good service not only provide good functionalities, but also ensure the availability and uptime.

We reinforce our service from QoS, QPS, Throttling, Scaling, Throughput, Monitoring.

Qos

There’re 3 kinds of QoS in kubernetes: Guaranteed, Burstable, BestEffort. We usually use Guaranteed, Burstable for different services.

#Guaranteed
resources:
  requests:
    cpu: 1000m
    memory: 4Gi
  limits:
    cpu: 1000m
    memory: 4Gi

#Burstable
resources:
  requests:
    cpu: 1000m
    memory: 4Gi
  limits:
    cpu: 6000m
    memory: 8Gi
QPS

We did lots of stress test on APIs by Gatling before we release them, we mainly care about mean response time, std deviation, mean requests/sec, error rate (API Testing Report), during testing we monitor server metrics by Datadog to find out bottlenecks.

We usually test APIs in two scenarios: internal, external. External testing result is much lower than internal testing because of network latency, network bandwidth and son on.

Internal testing result

================================================================================
---- Global Information --------------------------------------------------------
> request count                                     246000 (OK=246000 KO=0     )
> min response time                                     16 (OK=16     KO=-     )
> max response time                                   5891 (OK=5891   KO=-     )
> mean response time                                    86 (OK=86     KO=-     )
> std deviation                                        345 (OK=345    KO=-     )
> response time 50th percentile                         30 (OK=30     KO=-     )
> response time 75th percentile                         40 (OK=40     KO=-     )
> response time 95th percentile                         88 (OK=88     KO=-     )
> response time 99th percentile                       1940 (OK=1940   KO=-     )
> mean requests/sec                                817.276 (OK=817.276 KO=-     )
---- Response Time Distraaibution ------------------------------------------------
> t < 800 ms                                        240565 ( 98%)
> 800 ms < t < 1200 ms                                1110 (  0%)
> t > 1200 ms                                         4325 (  2%)
> failed                                                 0 (  0%)
================================================================================

External testing result

================================================================================
---- Global Information --------------------------------------------------------
> request count                                      33000 (OK=32999  KO=1     )
> min response time                                    477 (OK=477    KO=60001 )
> max response time                                  60001 (OK=41751  KO=60001 )
> mean response time                                   600 (OK=599    KO=60001 )
> std deviation                                        584 (OK=484    KO=0     )
> response time 50th percentile                        497 (OK=497    KO=60001 )
> response time 75th percentile                        506 (OK=506    KO=60001 )
> response time 95th percentile                       1366 (OK=1366   KO=60001 )
> response time 99th percentile                       2125 (OK=2122   KO=60001 )
> mean requests/sec                                109.635 (OK=109.631 KO=0.003 )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                         29826 ( 90%)
> 800 ms < t < 1200 ms                                1166 (  4%)
> t > 1200 ms                                         2007 (  6%)
> failed                                                 1 (  0%)
---- Errors --------------------------------------------------------------------
> i.g.h.c.i.RequestTimeoutException: Request timeout after 60000      1 (100.0%)
 ms
================================================================================
Throttling

We throttle API by Nginx limit, we configured ingress like this:

annotations:
  nginx.ingress.kubernetes.io/limit-connections: '30'
  nginx.ingress.kubernetes.io/limit-rps: '60'

And it will generate Nginx configuration dynamically like this:

limit_conn_zone $limit_ZGVsaXZlcnktY2RuYV9kc2QtYXBpLWNkbmEtZ2F0ZXdheQ zone=xxx_conn:5m;
limit_req_zone $limit_ZGVsaXZlcnktY2RuYV9kc2QtYXBpLWNkbmEtZ2F0ZXdheQ zone=xxx_rps:5m rate=60r/s;

server {
    server_name xxx.xxx ;
    listen 80;
    
    location ~* "^/xxx/?(?<baseuri>.*)" {
        ...
        ...        
        limit_conn xxx_conn 30;
        limit_req zone=xxx_rps burst=300 nodelay;
        ...
        ...        
}
Scaling

We use HPA in kubernetes to ensure auto (Auto scaling in kubernetes), you could check HPA status in server:

[xxx@xxx ~]$ kubectl get hpa -n test-ns
NAME       REFERENCE             TARGETS           MINPODS   MAXPODS   REPLICAS   AGE
api-demo   Deployment/api-demo   39%/30%, 0%/30%   3         10        3          126d

[xxx@xxx ~]$ kubectl get pod -n test-ns
NAME                           READY     STATUS    RESTARTS   AGE
api-demo-76b9954f57-6hvzx      1/1       Running   0          126d
api-demo-76b9954f57-mllsx      1/1       Running   0          126d
api-demo-76b9954f57-s22k8      1/1       Running   0          126d

Throughput & Monitoring

We integrated Datadog for monitoring(Monitoring by Datadog), we could check detail API metrics from various dashboards.

Also we could calculate throughout from user, request, request time.