Xinhai Bay 2019!





Xinhai Bay 2019!
A good service not only provide good functionalities, but also ensure the availability and uptime.
We reinforce our service from QoS, QPS, Throttling, Scaling, Throughput, Monitoring.
There’re 3 kinds of QoS in kubernetes: Guaranteed, Burstable, BestEffort. We usually use Guaranteed, Burstable for different services.
#Guaranteed
resources:
requests:
cpu: 1000m
memory: 4Gi
limits:
cpu: 1000m
memory: 4Gi
#Burstable
resources:
requests:
cpu: 1000m
memory: 4Gi
limits:
cpu: 6000m
memory: 8Gi
We did lots of stress test on APIs by Gatling before we release them, we mainly care about mean response time, std deviation, mean requests/sec, error rate (API Testing Report), during testing we monitor server metrics by Datadog to find out bottlenecks.
We usually test APIs in two scenarios: internal, external. External testing result is much lower than internal testing because of network latency, network bandwidth and son on.
Internal testing result
================================================================================
---- Global Information --------------------------------------------------------
> request count 246000 (OK=246000 KO=0 )
> min response time 16 (OK=16 KO=- )
> max response time 5891 (OK=5891 KO=- )
> mean response time 86 (OK=86 KO=- )
> std deviation 345 (OK=345 KO=- )
> response time 50th percentile 30 (OK=30 KO=- )
> response time 75th percentile 40 (OK=40 KO=- )
> response time 95th percentile 88 (OK=88 KO=- )
> response time 99th percentile 1940 (OK=1940 KO=- )
> mean requests/sec 817.276 (OK=817.276 KO=- )
---- Response Time Distraaibution ------------------------------------------------
> t < 800 ms 240565 ( 98%)
> 800 ms < t < 1200 ms 1110 ( 0%)
> t > 1200 ms 4325 ( 2%)
> failed 0 ( 0%)
================================================================================
External testing result
================================================================================
---- Global Information --------------------------------------------------------
> request count 33000 (OK=32999 KO=1 )
> min response time 477 (OK=477 KO=60001 )
> max response time 60001 (OK=41751 KO=60001 )
> mean response time 600 (OK=599 KO=60001 )
> std deviation 584 (OK=484 KO=0 )
> response time 50th percentile 497 (OK=497 KO=60001 )
> response time 75th percentile 506 (OK=506 KO=60001 )
> response time 95th percentile 1366 (OK=1366 KO=60001 )
> response time 99th percentile 2125 (OK=2122 KO=60001 )
> mean requests/sec 109.635 (OK=109.631 KO=0.003 )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms 29826 ( 90%)
> 800 ms < t < 1200 ms 1166 ( 4%)
> t > 1200 ms 2007 ( 6%)
> failed 1 ( 0%)
---- Errors --------------------------------------------------------------------
> i.g.h.c.i.RequestTimeoutException: Request timeout after 60000 1 (100.0%)
ms
================================================================================
We throttle API by Nginx limit, we configured ingress like this:
annotations:
nginx.ingress.kubernetes.io/limit-connections: '30'
nginx.ingress.kubernetes.io/limit-rps: '60'
And it will generate Nginx configuration dynamically like this:
limit_conn_zone $limit_ZGVsaXZlcnktY2RuYV9kc2QtYXBpLWNkbmEtZ2F0ZXdheQ zone=xxx_conn:5m;
limit_req_zone $limit_ZGVsaXZlcnktY2RuYV9kc2QtYXBpLWNkbmEtZ2F0ZXdheQ zone=xxx_rps:5m rate=60r/s;
server {
server_name xxx.xxx ;
listen 80;
location ~* "^/xxx/?(?<baseuri>.*)" {
...
...
limit_conn xxx_conn 30;
limit_req zone=xxx_rps burst=300 nodelay;
...
...
}
We use HPA in kubernetes to ensure auto (Auto scaling in kubernetes), you could check HPA status in server:
[xxx@xxx ~]$ kubectl get hpa -n test-ns
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
api-demo Deployment/api-demo 39%/30%, 0%/30% 3 10 3 126d
[xxx@xxx ~]$ kubectl get pod -n test-ns
NAME READY STATUS RESTARTS AGE
api-demo-76b9954f57-6hvzx 1/1 Running 0 126d
api-demo-76b9954f57-mllsx 1/1 Running 0 126d
api-demo-76b9954f57-s22k8 1/1 Running 0 126d
We integrated Datadog for monitoring(Monitoring by Datadog), we could check detail API metrics from various dashboards.
Also we could calculate throughout from user, request, request time.
Reply