

集群节点关机导致dns在eviction pod之前几率不可用
source link: https://zhangguanzhang.github.io/2021/02/02/node-shutdown-dns-unavailable/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

这几天我们内部在做新项目的容灾测试,业务都是在 K8S 上的。容灾里就是随便选节点 shutdown -h now
。关机后同事便发现了(页面有错误,最终问题是)集群内 DNS 解析会有几率无法解析(导致的)。
根据 SVC
的流程,node 关机后,由于 kubelet 没有 update 自己。node 和 pod 在 apiserver get 的时候显示还是正常的。在 kube-controller-manager
的 --node-monitor-grace-period
时间后再过 --pod-eviction-timeout
时间开始 eviction pod
,大概流程是这样。
在 pod
被 eviction
之前,默认是大概 5m
的时间。这段时间内,node
上 的所有 POD
的 IP
还在 SVC
的 endpoint
里。而同事关机的 node
上恰好有 coredns
。所以在 5m 内一直会有 coredns 副本数之一的几率解析失败。
其实和 K8S 版本没关系,因为 SVC
和 eviction
的行为都是这样的。实际我调整了 node
更新自身状态的所有 相关参数,调整到在 20s 内就会 eviction pod
,但是 20s 内还是存在几率无法解析。当然也问了下群友和社区群里,发现似乎大家从来没关机测试过这方面,应该是现在大伙都在用公有云了。。。。
$ kubectl version -o json
{
"clientVersion": {
"major": "1",
"minor": "15",
"gitVersion": "v1.15.5",
"gitCommit": "20c265fef0741dd71a66480e35bd69f18351daea",
"gitTreeState": "clean",
"buildDate": "2019-10-15T19:16:51Z",
"goVersion": "go1.12.10",
"compiler": "gc",
"platform": "linux/amd64"
},
"serverVersion": {
"major": "1",
"minor": "15",
"gitVersion": "v1.15.5",
"gitCommit": "20c265fef0741dd71a66480e35bd69f18351daea",
"gitTreeState": "clean",
"buildDate": "2019-10-15T19:07:57Z",
"goVersion": "go1.12.10",
"compiler": "gc",
"platform": "linux/amd64"
}
loca-dns 真的可以吗
当然首选是 local-dns 的方案 了。方案搜下,很多人介绍了。简单讲下就是在每个 node 上起 hostNetwork
的 node-cache
进程做代理 ,然后利用 dummy
接口和 nat 来拦截发向 kube-dns SVC
IP 的 dns 请求做缓存。
官方提供的 yaml 文件 里的 __PILLAR__LOCAL__DNS__,__PILLAR__DNS__SERVER__
需要换成dummy
接口 IP 和 kube-dns SVC
的 IP,还有 __PILLAR__DNS__DOMAIN__
自行根据文档更换。其余几个变量会在启动的时候替换,可以启动后看日志。
然后实际测试了下还是有问题。然后捋了下流程,yaml 文件里有这个 SVC
和 node-cache 的启动参数
apiVersion: v1
kind: Service
metadata:
name: kube-dns-upstream
namespace: kube-system
...
spec:
ports:
- name: dns
port: 53
protocol: UDP
targetPort: 53
- name: dns-tcp
port: 53
protocol: TCP
targetPort: 53
selector:
k8s-app: kube-dns
...
args: [ ..., "-upstreamsvc", "kube-dns-upstream" ]
启动的日志里可以看到配置文件被渲染了:
cluster1.local:53 {
errors
reload
bind 169.254.20.10 172.26.0.2
forward . 172.26.189.136 {
force_tcp
}
prometheus :9253
health 169.254.20.10:8080
}
in-addr.arpa:53 {
errors
cache 30
reload
loop
bind 169.254.20.10 172.26.0.2
forward . 172.26.189.136 {
force_tcp
}
prometheus :9253
}
ip6.arpa:53 {
errors
cache 30
reload
loop
bind 169.254.20.10 172.26.0.2
forward . 172.26.189.136 {
force_tcp
}
prometheus :9253
}
.:53 {
errors
cache 30
reload
loop
bind 169.254.20.10 172.26.0.2
forward . /etc/resolv.conf
prometheus :9253
}
因为要 nat 去 hook 请求 kube-dns SVC
IP(172.26.0.2)的请求,但是它自己也需要访问 kube-dns,所以 yaml 文件里创建了一个和 kube-dns 一样的属性的 svc,启动参数写了这个 SVC 名字,可以看到它代理的是走 SVC
的 ip 的。因为 enableServiceLinks 的默认开启,pod 会有如下环境变量:
$ docker exec dfa env | grep KUBE_DNS_UPSTREAM_SERVICE_HOST
KUBE_DNS_UPSTREAM_SERVICE_HOST=172.26.189.136
代码里 可以看到就是把参数的 -
转换成 _
取值然后渲染配置文件,这样就能取到 SVC 的 IP 了。
func toSvcEnv(svcName string) string {
envName := strings.Replace(svcName, "-", "_", -1)
return "$" + strings.ToUpper(envName) + "_SERVICE_HOST"
}
cluster1.local:53
这个 zone 在默认配置下还是代理到 SVC
上,所以还是有问题。
所以只有绕过 SVC
才能从根本上解决这个问题。然后就把 coredns
改成 port 153 + hostNetwork: true
加 nodeSelector
到三个 master 上固定了。然后配置文件如下:
cluster1.local:53 {
errors
reload
bind 169.254.20.10 172.26.0.2
forward . 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153 {
force_tcp
}
prometheus :9253
health 169.254.20.10:8080
}
...
然后测试还是有几率无法访问。之前看到过 米开朗基杨 分享过 coredns
的一个带故障转移的插件 dnsredir,尝试加这个插件去编译。
查阅文档编译后最后运行起来无法识别配置文件,因为官方不是直接基于 coredns
引入自己的插件开发的,而是自己的代码上来引入 coredns
的内置插件。
大概过程详情 issue 见链接 include coredns plugin at node-cache don’t work expect
官方的这个 node-cace 里的 bind 插件就是 dummy
接口和 iptables 的 nat 部分了,这个特性蛮吸引我的,决定继续尝试下这个看看能不能配置成功。
在测试加入插件 dnsredir 的时候米开朗基杨叫我试下最小配置段看看有干扰没,尝试了下面的配置段来回切换测:
Corefile: |
cluster1.local:53 {
errors
reload
dnsredir . {
to 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153
max_fails 1
health_check 1s
spray
}
#forward . 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153 {
# max_fails 1
# policy round_robin
# health_check 0.4s
#}
prometheus :9253
health 169.254.20.10:8080
}
#----------
Corefile: |
cluster1.local:53 {
errors
reload
#dnsredir . {
# to 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153
# max_fails 1
# health_check 1s
# spray
#}
forward . 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153 {
max_fails 1
policy round_robin
health_check 0.4s
}
prometheus :9253
health 169.254.20.10:8080
}
然后发现请求居然不会发生解析失败了:
$ function d(){ while :;do sleep 0.2; date;dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short; done; }
$ d
2021年 02月 02日 星期二 12:54:43 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:44 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:44 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:44 CST <---这个时间点关机了一个 master
172.26.158.130
2021年 02月 02日 星期二 12:54:45 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:47 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:48 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:48 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:48 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:51 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:51 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:52 CST
172.26.158.130
然后就不打算继续折腾 dnsredir 插件了,去叫同事测试了下没问题,叫我在另一个环境上应用下修改他再测下,发现还是会发生:
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
; <<>> DiG 9.10.3-P4-Ubuntu <<>> @172.26.0.2 account-gateway +short
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
然后我多次测试最小配置 zone,对比排查到是反向解析的问题,反向解析关闭了就不存在任何问题了,注释掉下面的内容:
#in-addr.arpa:53 {
# errors
# cache 30
# reload
# loop
# bind 169.254.20.10 172.26.0.2
# forward . __PILLAR__CLUSTER__DNS__ {
# force_tcp
# }
# prometheus :9253
# }
#ip6.arpa:53 {
# errors
# cache 30
# reload
# loop
# bind 169.254.20.10 172.26.0.2
# forward . __PILLAR__CLUSTER__DNS__ {
# force_tcp
# }
# prometheus :9253
# }
测试解析的过程中去关机任何一台 coredns 所在 node 也没问题了。
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
大致的yaml文件
apiVersion: v1
kind: ServiceAccount
metadata:
name: node-local-dns
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: v1
kind: Service
metadata:
name: kube-dns-upstream
namespace: kube-system
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/name: "KubeDNSUpstream"
spec:
clusterIP: 172.26.0.3 # <---- 给他固定了得了,可以直接这个ip不走node-cache作为测试
ports:
- name: dns
port: 53
protocol: UDP
targetPort: 153
- name: dns-tcp
port: 53
protocol: TCP
targetPort: 153
selector:
k8s-app: kube-dns
---
apiVersion: v1
kind: ConfigMap
metadata:
name: node-local-dns
namespace: kube-system
labels:
addonmanager.kubernetes.io/mode: Reconcile
data:
Corefile: |
cluster1.local:53 {
errors
cache {
success 9984 30
denial 9984 5
}
reload
loop
bind 169.254.20.10 172.26.0.2
forward . 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153 {
force_tcp
max_fails 1
policy round_robin
health_check 0.5s
}
prometheus :9253
health 169.254.20.10:8070
}
#in-addr.arpa:53 {
# errors
# cache 30
# reload
# loop
# bind 169.254.20.10 172.26.0.2
# forward . __PILLAR__CLUSTER__DNS__ {
# force_tcp
# }
# prometheus :9253
# }
#ip6.arpa:53 {
# errors
# cache 30
# reload
# loop
# bind 169.254.20.10 172.26.0.2
# forward . __PILLAR__CLUSTER__DNS__ {
# force_tcp
# }
# prometheus :9253
# }
.:53 {
errors
cache 30
reload
loop
bind 169.254.20.10 172.26.0.2
forward . __PILLAR__UPSTREAM__SERVERS__
prometheus :9253
}
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-local-dns
namespace: kube-system
labels:
k8s-app: node-local-dns
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
spec:
updateStrategy:
rollingUpdate:
maxUnavailable: 10%
selector:
matchLabels:
k8s-app: node-local-dns
template:
metadata:
labels:
k8s-app: node-local-dns
annotations:
prometheus.io/port: "9253"
prometheus.io/scrape: "true"
spec:
imagePullSecrets:
- name: regcred
priorityClassName: system-node-critical
serviceAccountName: node-local-dns
hostNetwork: true
dnsPolicy: Default # Don't use cluster DNS.
tolerations:
- key: "CriticalAddonsOnly"
operator: "Exists"
- effect: "NoExecute"
operator: "Exists"
- effect: "NoSchedule"
operator: "Exists"
containers:
- name: node-cache
image: xxx.lan:5000/k8s-dns-node-cache:1.16.0
resources:
requests:
cpu: 25m
memory: 10Mi
args: [ "-localip", "169.254.20.10,172.26.0.2", "-conf", "/etc/Corefile", "-upstreamsvc", "kube-dns-upstream", "-health-port","8070" ]
securityContext:
privileged: true
ports:
- containerPort: 53
name: dns
protocol: UDP
- containerPort: 53
name: dns-tcp
protocol: TCP
- containerPort: 9253
name: metrics
protocol: TCP
livenessProbe:
httpGet:
host: 169.254.20.10
path: /health
port: 8070
initialDelaySeconds: 40
timeoutSeconds: 3
volumeMounts:
- mountPath: /run/xtables.lock
name: xtables-lock
readOnly: false
- name: config-volume
mountPath: /etc/coredns
- name: kube-dns-config
mountPath: /etc/kube-dns
volumes:
- name: xtables-lock
hostPath:
path: /run/xtables.lock
type: FileOrCreate
- name: kube-dns-config
configMap:
name: kube-dns
optional: true
- name: config-volume
configMap:
name: node-local-dns
items:
- key: Corefile
path: Corefile.base
---
# A headless service is a service with a service IP but instead of load-balancing it will return the IPs of our associated Pods.
# We use this to expose metrics to Prometheus.
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/port: "9253"
prometheus.io/scrape: "true"
labels:
k8s-app: node-local-dns
name: node-local-dns
namespace: kube-system
spec:
clusterIP: None
ports:
- name: metrics
port: 9253
targetPort: 9253
selector:
k8s-app: node-local-dns
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: coredns
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
kubernetes.io/bootstrapping: rbac-defaults
addonmanager.kubernetes.io/mode: Reconcile
name: system:coredns
rules:
- apiGroups:
- ""
resources:
- endpoints
- services
- pods
- namespaces
verbs:
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
labels:
kubernetes.io/bootstrapping: rbac-defaults
addonmanager.kubernetes.io/mode: EnsureExists
name: system:coredns
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:coredns
subjects:
- kind: ServiceAccount
name: coredns
namespace: kube-system
---
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
labels:
addonmanager.kubernetes.io/mode: EnsureExists
data:
Corefile: |
.:153 {
errors
health :8180
kubernetes cluster1.local. in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: coredns
namespace: kube-system
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/name: "CoreDNS"
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
selector:
matchLabels:
k8s-app: kube-dns
template:
metadata:
labels:
k8s-app: kube-dns
annotations:
seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- kube-dns
topologyKey: kubernetes.io/hostname
hostNetwork: true
priorityClassName: system-cluster-critical
serviceAccountName: coredns
nodeSelector:
node-role.kubernetes.io/master: "true"
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
- key: "CriticalAddonsOnly"
operator: "Exists"
imagePullSecrets:
- name: regcred
containers:
- name: coredns
image: xxxx.lan:5000/coredns:1.7.1
imagePullPolicy: IfNotPresent
resources:
limits:
memory: 270Mi
requests:
cpu: 100m
memory: 150Mi
args: [ "-conf", "/etc/coredns/Corefile" ]
volumeMounts:
- name: config-volume
mountPath: /etc/coredns
readOnly: true
ports:
- containerPort: 153
name: dns
protocol: UDP
- containerPort: 153
name: dns-tcp
protocol: TCP
- containerPort: 9153
name: metrics
protocol: TCP
livenessProbe:
httpGet:
path: /health
port: 8180
scheme: HTTP
initialDelaySeconds: 60
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- NET_BIND_SERVICE
drop:
- all
readOnlyRootFilesystem: true
dnsPolicy: Default
volumes:
- name: config-volume
configMap:
name: coredns
items:
- key: Corefile
path: Corefile
---
apiVersion: v1
kind: Service
metadata:
name: kube-dns
namespace: kube-system
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9153"
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/name: "CoreDNS"
spec:
selector:
k8s-app: kube-dns
clusterIP: 172.26.0.2
ports:
- name: dns
port: 53
targetPort: 153
protocol: UDP
- name: dns-tcp
port: 53
targetPort: 153
protocol: TCP
Recommend
-
99
与雷军的10亿赌约,董明珠输的几率非常大与雷军的10亿赌约,董明珠输的几率非常大2017-12-14 11:34科技领域爱好者小米公司今年前三季度的营收突破千亿,预计全年营收在1200...
-
50
来源:红星新闻3日晚上11点许,就京东CEO刘强东在美国涉嫌性侵一事,红星新闻记者联系到其在明尼苏达州的当地律师约瑟夫·弗里德伯格。据《华尔街日报》报道,约瑟夫·弗里德伯格(JosephS。Friedberg)坚信刘强东不会受到任何起诉,而
-
42
来源:科技日报近日,不少公众人物被癌症夺走了生命。追悼、叹息之余,防治癌症话题又一次成为大众讨论的焦点。当下患癌症的人是不是越来越多?癌症的高危人群有哪些?癌症预防需要注意哪些问题?患癌几率大的人都有什么习惯在医学上,癌是指起源于上皮组织的
-
38
近日,不少公众人物被癌症夺走了生命。追悼、叹息之余,防治癌症话题又一次成为大众讨论的焦点。当下患癌症的人是不是越来越多?癌症的高危人群有哪些?癌症预防需要注意哪些问题?
-
36
大多数个体的艰难挣扎,最终换来了硅谷长久的繁荣昌盛。
-
15
...
-
10
本文从源码层面分析了 redis 的缓存淘汰机制,并在文章末尾描述使用 Java 实现的思路,以供参考。 相关配置 为了适配用作缓存的场景,redis 支持缓存淘汰(eviction)并提供相应的了配置项: maxmemory ...
-
10
源码分析 kubelet eviction manager 驱逐的设计实现 – 峰云就她了 ...
-
7
Artificial intelligence models aim to forecast eviction, promote renter rights by Jordan Ford,...
-
7
FIFO queues are all you need for cache eviction Computer Systems Cache August 01, 2023 Juncheng Yang EDIT 1: Many people have not...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK