30

Traefik中的502-500状态码异常排查

 3 years ago
source link: http://dockone.io/article/10695
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

我们都知道在Kubernetes集群中通常会使用Ingress方案来统一代理集群内部的流量,而常用的Ingress方案为Traefik和Nginx,和传统的Nginx作为企业内部的反向代理以及负载设备一样,在不同的场景下可能需要专有的配置才能满足需求,否则会出现奇奇怪怪的异常状态码。本篇文章一起来看下,我们在Traefik中遇到的500和502异常。

前言

在开始之前,我们先来看几个在Nginx作为反向代理工具中经常遇到的几个异常状态码:

  • 499:客户端主动断开连接。通常为一次请求未在客户端指定时间内返回,客户端主动关闭连接,客户端无数据返回(在Nginx中会记录499)。一般是客户端超时。
  • 500:服务器内部错误。服务器遇到未知错误导致无法完成请求处理,通常由于后端业务逻辑异常导致(本身Bug)。
  • 502:网关错误。通常为网关未从上游服务中获取期望的响应(上游未返回数据或未按照协议约定返回数据),网关感觉自己没用了,返回了网关错误。一般是后端服务器宕机或业务逻辑超时。
  • 504:网关超时。表示网关没有及时从上游获取响应数据。一般是Nginx网关作为客户端去向上游服务请求响应的过程中,Nginx网关超时导致,但此时对于上游服务器来将,它会继续执行,直到结束。(Nginx网关作为客户端时的超时)
# 499 的实际情况就是,客户端指定超时时间为N秒,但是该请求在服务端实际需要执行M秒(M>N秒),客户端等的不耐烦了就关闭了

# 对于499状态来讲,解决方式是优化后端代码逻辑或者修改Nginx参数

$ cat nginx.conf

proxy_ignore_client_abort    on;

$ curl -i -m 3 http://127.0.0.1/hello.php





# 502的实际情况通常是Nginx网关后端的服务器直接宕机了(所以就拿不到上游的响应了)

# 当然也有可能是上游服务器真正的执行逻辑超过了上游服务器的超时时间限制(比如php-fpm.conf设置request_terminate_timeout5s,但是实际的业务逻辑需要7s才能完成),此时上游服务器端出现`业务逻辑超时`,给Nginx网关返回了异常的数据造成的

# 502时后端的几种错误日志

recv() failed (104: Connection reset by peer) while reading response header from upstream

upstream prematurely closed connection while reading response header from upstream

connect() failed (111: Connection refused) while connecting to upstream

# 整体来说502出现的问题通常是因为后端挂了,或者因为后端负载太高,暂时不可响应

# 可以在Nginx侧增加proxy_read_timeout来暂时缓解

$ cat nginx.conf

proxy_read_timeout  20s;



# 504的实际情况就是客户端->Nginx->Backend,在过程中Nginx需要作为客户端访问Backend服务,但是在Backend还没用执行完成时,Nginx首先超过了自己的客户端超时时间,此时就会出现504的异常(但是对于客户端来说返回什么呢?)

# 对于504场景而言,通常的做法就是优化Backend的逻辑,适当减少执行时间;另外也可以适当的增加Nginx作为客户端时的超时时间

# 要知道,当Nginx作为客户端时,是以一个Proxy的角色存在的,配置如下参数即可

$ cat nginx.conf

uwsgi_connect_timeout 5;

uwsgi_send_timeout 5;

uwsgi_read_timeout 5;

fastcgi_read_timeout 5;

fastcgi_send_timeout 5;

proxy_connect_timeout      90;

proxy_send_timeout         90;

proxy_read_timeout         90;

Traefik中诡异的500和502

Traefik在Kubernetes集群中的部署配置

我们当前集群的Traefik的配置如下:

# Traefik的ConfigMap配置文件

$ cat traefik-config.yaml

apiVersion: v1

kind: ConfigMap

metadata:

name: traefik-config

namespace: kube-system

data:

traefik.toml: |

defaultEntryPoints = ["http","https"]

debug = false

logLevel = "INFO"



InsecureSkipVerify = true

[entryPoints]

  [entryPoints.http]

  address = ":80"

  compress = true

  [entryPoints.https]

  address = ":443"

    [entryPoints.https.tls]

[web]

  address = ":8080"

[kubernetes]

[metrics]

  [metrics.prometheus]

  buckets=[0.1,0.3,1.2,5.0]

  entryPoint = "traefik"

[ping]

entryPoint = "http"



# Traefik的DaemonSet配置

$ cat traefik-ds-v1.7.16.yaml

---

apiVersion: v1

kind: ServiceAccount

metadata:

name: traefik-ingress-controller

namespace: kube-system

---

kind: DaemonSet

apiVersion: extensions/v1beta1

metadata:

name: traefik-ingress-controller

namespace: kube-system

labels:

k8s-app: traefik-ingress-lb

spec:

template:

metadata:

  labels:

    k8s-app: traefik-ingress-lb

    name: traefik-ingress-lb

spec:

  affinity:

    nodeAffinity:

      requiredDuringSchedulingIgnoredDuringExecution:

        nodeSelectorTerms:

        - matchExpressions:

          - key: node-role.kubernetes.io/master

            operator: DoesNotExist

  serviceAccountName: traefik-ingress-controller

  terminationGracePeriodSeconds: 30

  hostNetwork: true

  containers:

  - image: traefik:v1.7.16

    name: traefik-ingress-lb

    ports:

    - name: http

      containerPort: 80

      hostPort: 80

    - name: admin

      containerPort: 8080

    securityContext:

      capabilities:

        drop:

        - ALL

        add:

        - NET_BIND_SERVICE

    args:

    - --api

    - --kubernetes

    - --logLevel=INFO

    - --traefikLog.filePath=/logdata/traefik.log

    - --configfile=/config/traefik.toml

    - --accesslog.filepath=/logdata/access.log

    - --accesslog.bufferingsize=100

    volumeMounts:

    - mountPath: /config

      name: config

    - mountPath: /logdata

      name: access-log

  volumes:

  - configMap:

      name: traefik-config

    name: config

  - name: access-log

    hostPath:

      path: /opt/logs/ingress/

---

kind: Service

apiVersion: v1

metadata:

name: traefik-ingress-service

namespace: kube-system

labels:

k8s-app: traefik-ingress-lb

spec:

selector:

k8s-app: traefik-ingress-lb

ports:

- protocol: TCP

  port: 80

  name: web

- protocol: TCP

  port: 8080

  name: admin

Python的对外API接口

# 接口对外的Ingress

$ kubectl  get ingress -n s-data

NAME                     HOSTS                    ADDRESS   PORTS   AGE

data-api.bgbiao.cn   data-api.bgbiao.cn             80      236d

ops.bgbiao.cn       ops.bgbiao.cn                 80      236d



# 测试对外接口

$ curl data-api.bgbiao.cn  -i

HTTP/1.1 401 Unauthorized

Access-Control-Allow-Headers: Content-Type, X-TOKEN

Access-Control-Allow-Origin: *

Content-Length: 58

Content-Type: application/json

Vary: Accept-Encoding

Date: Sun, 28 Jun 2020 14:55:00 GMT



# 接口需要登录,那么我们对登录接口进行压测来模拟问题

$ curl -X POST  --data '@/root/login.json' -H 'Content-type:application/json' http://data-api.bgbiao.cn/account/users/login/   -i

HTTP/1.1 200 OK

Access-Control-Allow-Headers: Content-Type, X-TOKEN

Access-Control-Allow-Origin: *

Content-Length: 250

Content-Type: application/json

Vary: Accept-Encoding

Date: Sun, 28 Jun 2020 14:56:33 GMT

诡异的500和502

在服务部署完成后,一切皆正常,但是简单压测后发现服务出现部分请求失败。

# 使用ab工具进行压测

# 由压测结果可以发现,20个并发共压测200个请求,期间出现了7次失败请求

$ ab -c 20 -n 200   -T 'application/json' -p /root/login.json http://data-api.bgbiao.cn/account/users/login/

...

Benchmarking data-api.bgbiao.cn (be patient)

Completed 100 requests

Completed 200 requests

Finished 200 requests





Server Software:

Server Hostname:        data-api.bgbiao.cn

Server Port:            80



Document Path:          /account/users/login/

Document Length:        250 bytes



Concurrency Level:      20

Time taken for tests:   1.340 seconds

Complete requests:      200

Failed requests:        7

(Connect: 0, Receive: 0, Length: 7, Exceptions: 0)

Write errors:           0

Non-2xx responses:      7

Total transferred:      91371 bytes

Total body sent:        46400

HTML transferred:       48387 bytes

Requests per second:    149.21 [#/sec] (mean)

Time per request:       134.035 [ms] (mean)

Time per request:       6.702 [ms] (mean, across all concurrent requests)

Transfer rate:          66.57 [Kbytes/sec] received

                    33.81 kb/s sent

                    100.38 kb/s total



Connection Times (ms)

          min  mean[+/-sd] median   max

Connect:        1    1   0.1      1       1

Processing:     2  116  27.8    114     179

Waiting:        2  116  27.8    114     179

Total:          3  117  27.8    116     180



Percentage of the requests served within a certain time (ms)

50%    116

66%    121

75%    125

80%    129

90%    154

95%    167

98%    173

99%    175

100%    180 (longest request)





# 将压测结果保存到文本中进行简单分析

# 简单分析在200个压测请求中,有4个请求失败,分别为2次500错误和2次502错误

$ ab -v 10 -c 20 -n 200   -T 'application/json' -p /root/login.json http://data-api.bgbiao.cn/account/users/login/  > ab-log.txt

$ cat ab-log.txt | grep HTTP | sort| uniq -c

196 HTTP/1.0 200 OK

  2 HTTP/1.0 500 Internal Server Error

  2 HTTP/1.0 502 Bad Gateway

  1 POST /account/users/login/ HTTP/1.0

Traefik中500和502问题排查

前面我们提到了在Nginx的场景中的500和502状态码的原因以及相关的解决方式,那么在Kubernetes集群中,Traefik起到的作用和Nginx的作用也是相似的。

在开始的时候,我们提到了集群内的Traefik的配置信息,对于SRE来讲,任何生产的服务都必须有相关的可观测性数据,因此,我们也默认将Traefik的访问日志和进程日志进行了持久化(分别对应access.log和traefik.log),同时也暴露了Traefik的Prometheus的Metrics接口。

对于上面的压测请求,我们在访问日志里抓到了如下的异常日志:

$ tail -f access.log | grep data-api.bgbiao.cn | grep -v 'HTTP/1.0" 200'

192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267376 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 0ms

192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267385 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 1ms

192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267410 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 1ms

192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267418 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 1ms

192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267484 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 1ms

192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 502 11 "-" "ApacheBench/2.3" 122267518 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 1ms

192.168.0.23 - - [28/Jun/2020:14:57:39 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267550 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 4ms

192.168.0.23 - - [28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 502 11 "-" "ApacheBench/2.3" 122272696 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 2ms

192.168.0.23 - - [28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 502 11 "-" "ApacheBench/2.3" 122272711 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 1ms

192.168.0.23 - - [28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122272836 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 0ms

192.168.0.23 - - [28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122272837 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 0ms

可以看到,和我们压测结果里失败请求的状态码相同,都是500和502,虽然数量上不止4个,但这暂时不太重要。

通常对于大多数人来讲,在代理层看到500或者502都会下意识的想,肯定是上游服务的问题,不过这种猜测也能很快进行排除,排除的方法基本如下:

  • 在压测出现500和502期间,持续去访问上游服务器
  • 使用同样的压测参数直接压上游服务器
  • 将上游服务部署在独立的ECS上并使用同样参数进行压测

经过上述三种方式的测试后,我们基本排除了上游服务的问题,因此正式怀疑Traefik本身是否有性能或者其他参数上的问题。

修改Traefik中的日志级别为DEBUG:

$ cat traefik-ds-v1.7.6.yaml

....

    args:

    - --api

    - --kubernetes

    - --logLevel=DEBUG

....

然后在日志中看到如下相关信息:

# 500相关的日志

time="2020-06-28T15:35:05Z" level=debug msg="'500 Internal Server Error' caused by: http: server closed idle connection"

time="2020-06-28T15:35:05Z" level=debug msg="vulcand/oxy/forward/http: Round trip: http://20.0.26.9:8080, code: 500, Length: 21, duration: 1.486276ms"



# 502相关的日志

time="2020-06-28T15:35:05Z" level=debug msg="'502 Bad Gateway' caused by: EOF"

time="2020-06-28T15:35:05Z" level=debug msg="vulcand/oxy/forward/http: Round trip: http://20.0.26.9:8080, code: 502, Length: 11, duration: 1.530677ms"

上面的500信息,大概可以看出来是Traefik的服务进程主动关闭了空闲链接导致的,而下面的502是因为EOF,感觉像是没有读取完响应数据,就被断开了,导致Traefik返回502。通常这种情况在Nginx中很常见,也很容易调整相关的配置参数(文章开始有提到),但是Traefik的部署模式以及参数调整,还是需要花心思去看相关文档的。

然后在GitHub上去翻Traefik相关的issues,发现该问题曾经出现过很多次。

一个是500的issues,一个是502的issues,但通常这两个问题都是成双出现的。

500和502问题解决方案

在上面第一个issue中提到,Traefik在http的反向代理功能中默认开启了http的keep-alive功能,但是Python的应用中未开启http的keep-alive,因为我们上面的测试程序其实也是使用的Python开发的,先对该参数进行调整。

# 在uwsgi.ini中增加keep-alive参数即可

$ cat uwsgi.ini

[uwsgi]

http = 0.0.0.0:8080

http-keepalive = 1

chdir = /opt/app/

wsgi-file = /opt/app/main.py

callable = app

stats = 0.0.0.0:8081

processes = 2

threads = 10



# 重启应用后,再次进行压测

# 之前出现的502和500错误基本都消失了

# 并发200,共1万个请求,失败的请求数为0,总耗时1min(qps才到160)

$ ab  -c 200 -n 10000   -T 'application/json' -p /root/login.json http://data-api.bgbiao.cn/account/users/login/

....

....



Concurrency Level:      200

Time taken for tests:   59.323 seconds

Complete requests:      10000

Failed requests:        0

Write errors:           0

Total transferred:      4670000 bytes

Total body sent:        2320000

HTML transferred:       2500000 bytes

Requests per second:    168.57 [#/sec] (mean)

Time per request:       1186.454 [ms] (mean)

Time per request:       5.932 [ms] (mean, across all concurrent requests)

Transfer rate:          76.88 [Kbytes/sec] received

                    38.19 kb/s sent

                    115.07 kb/s total



Connection Times (ms)

          min  mean[+/-sd] median   max

Connect:        1    3  38.6      1    1035

Processing:   101  942 1457.7    857   32684

Waiting:      101  942 1457.7    857   32684

Total:        102  945 1458.0    861   32685



# p99达到7.3s

Percentage of the requests served within a certain time (ms)

50%    861

66%   1033

75%   1136

80%   1191

90%   1886

95%   2281

98%   4209

99%   7399

100%  32685 (longest request) 

通过对业务层的http的keep-alive参数的开启,来暂时解决了500和502的问题,那能否通过Traefik层的参数来优化该问题呢,上面第二个issue中其实也提到了。

即,通过修改Traefik的如下几个参数,并重新部署整个Traefik集群即可:

# 关闭Traefik的keep-alive参数,参数默认为200,如果参数为0,则使用go标准库中的DefaultMaxIdleConnsPerHost参数

# keep-alive主要是用来复用链接来减少open files的,但是对于大量的短连接来将这种链接复用就可能出现上述情况

--maxidleconnsperhost=-1 

# 即通过设置重试次数,增加空闲链接的超时时间,增加转发响应的超时时间,默认是0次

--retry.attempts=10

# 该参数已经替换为--respondingtimeouts.idletimeout参数了,默认为3m0s

--idletimeout=60s

# 默认是0s

--forwardingtimeouts.responseheadertimeout=60s



# Traefik空闲链接超时

$ ./traefik --help | grep idletimeout

--idletimeout                                 (Deprecated) maximum amount of time an idle (keep-alive) connection will remain  (default "0s")

--respondingtimeouts.idletimeout              IdleTimeout is the maximum amount duration an idle (keep-alive) connection will  (default "3m0s")



# 响应超时相关参数

$ ./traefik --help | grep respondingtimeouts

--respondingtimeouts                          Timeouts for incoming requests to the Traefik instance                           (default "true")

--respondingtimeouts.idletimeout              IdleTimeout is the maximum amount duration an idle (keep-alive) connection will  (default "3m0s")

--respondingtimeouts.readtimeout              ReadTimeout is the maximum duration for reading the entire request, including    (default "0s")

--respondingtimeouts.writetimeout             WriteTimeout is the maximum duration before timing out writes of the response.   (default "0s")





# 转发的超时

$ ./traefik --help | grep forwardingtimeouts

--forwardingtimeouts                          Timeouts for requests forwarded to the backend servers                           (default "true")

--forwardingtimeouts.dialtimeout              The amount of time to wait until a connection to a backend server can be         (default "30s")

--forwardingtimeouts.responseheadertimeout    The amount of time to wait for a server's response headers after fully writing   (default "0s") 

最终修改后的Traefik的参数如下:

# 可以根据实际情况考虑是否要关闭keep-alive即增加参数:--maxidleconnsperhost=-1

    - --api

    - --kubernetes

    - --logLevel=INFO

    - --traefikLog.filePath=/logdata/traefik.log

    - --configfile=/config/traefik.toml

    - --accesslog.filepath=/logdata/access.log

    - --accesslog.bufferingsize=100

    - --forwardingtimeouts.responseheadertimeout=60s

    - --respondingtimeouts.idletimeout=180s

    - --retry.attempts=10

    - --idletimeout=180s

SLB中诡异的503

在发现域名直接解析到Traefik节点上不再出现502和500后,我们将Traefik的节点挂载到阿里云的内网SLB中,但是又开始出现了诡异的503问题。

接入SLB后的简单压测情况(内网使用的是免费低配的SLB)。

# client->ali-slb->traefik->pods

$ ab -v 10 -c 200 -n 2000   -T 'application/json' -p postfile.json http://data-api.soulapp.cn/get_doc > slb-log.txt

$ cat slb-log.txt | grep  'HTTP/1.1 200' | wc -l

1322

$ cat slb-log.txt | grep  'HTTP/1.1 503' | wc -l

678





# client->traefik->pods

$ ab -v 10 -c 200 -n 2000   -T 'application/json' -p postfile.json http://data-api.c.bgbiao.cn/get_doc > traefik-log.txt

Completed 200 requests

Completed 400 requests

Completed 600 requests

Completed 800 requests

Completed 1000 requests

Completed 1200 requests

Completed 1400 requests

Completed 1600 requests

Completed 1800 requests

Completed 2000 requests

Finished 2000 requests



$ cat traefik-log.txt  | grep  'HTTP/1.0 200' | wc -l

2000

后来在阿里云文档中看到如下文档,基本上就是在阿里云的SLB侧对不同规格的SLB做了一定的限流策略,此时会向客户端返回503。

原文链接:中的502-500状态码异常排查/


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK