集群节点关机导致dns在eviction pod之前几率不可用

coredns

字数统计: 3.3k阅读时长: 17 min

 2021/02/02  177  Share

这几天我们内部在做新项目的容灾测试，业务都是在 K8S 上的。容灾里就是随便选节点 shutdown -h now。关机后同事便发现了（页面有错误，最终问题是）集群内 DNS 解析会有几率无法解析（导致的）。

根据 SVC 的流程，node 关机后，由于 kubelet 没有 update 自己。node 和 pod 在 apiserver get 的时候显示还是正常的。在 kube-controller-manager 的 --node-monitor-grace-period 时间后再过 --pod-eviction-timeout 时间开始 eviction pod，大概流程是这样。

在 pod 被 eviction 之前，默认是大概 5m 的时间。这段时间内，node 上的所有 POD 的 IP 还在 SVC 的 endpoint 里。而同事关机的 node 上恰好有 coredns 。所以在 5m 内一直会有 coredns 副本数之一的几率解析失败。

其实和 K8S 版本没关系，因为 SVC 和 eviction 的行为都是这样的。实际我调整了 node 更新自身状态的所有相关参数，调整到在 20s 内就会 eviction pod，但是 20s 内还是存在几率无法解析。当然也问了下群友和社区群里，发现似乎大家从来没关机测试过这方面，应该是现在大伙都在用公有云了。。。。

$  kubectl  version -o json
{
  "clientVersion": {
    "major": "1",
    "minor": "15",
    "gitVersion": "v1.15.5",
    "gitCommit": "20c265fef0741dd71a66480e35bd69f18351daea",
    "gitTreeState": "clean",
    "buildDate": "2019-10-15T19:16:51Z",
    "goVersion": "go1.12.10",
    "compiler": "gc",
    "platform": "linux/amd64"
  },
  "serverVersion": {
    "major": "1",
    "minor": "15",
    "gitVersion": "v1.15.5",
    "gitCommit": "20c265fef0741dd71a66480e35bd69f18351daea",
    "gitTreeState": "clean",
    "buildDate": "2019-10-15T19:07:57Z",
    "goVersion": "go1.12.10",
    "compiler": "gc",
    "platform": "linux/amd64"
  }

loca-dns 真的可以吗

当然首选是 local-dns 的方案了。方案搜下，很多人介绍了。简单讲下就是在每个 node 上起 hostNetwork 的 node-cache 进程做代理，然后利用 dummy 接口和 nat 来拦截发向 kube-dns SVC IP 的 dns 请求做缓存。

官方提供的 yaml 文件里的 __PILLAR__LOCAL__DNS__,__PILLAR__DNS__SERVER__需要换成dummy接口 IP 和 kube-dns SVC 的 IP，还有 __PILLAR__DNS__DOMAIN__ 自行根据文档更换。其余几个变量会在启动的时候替换，可以启动后看日志。

然后实际测试了下还是有问题。然后捋了下流程，yaml 文件里有这个 SVC 和 node-cache 的启动参数

apiVersion: v1
kind: Service
metadata:
  name: kube-dns-upstream
  namespace: kube-system
...
spec:
  ports:
  - name: dns
    port: 53
    protocol: UDP
    targetPort: 53
  - name: dns-tcp
    port: 53
    protocol: TCP
    targetPort: 53
  selector:
	k8s-app: kube-dns
...
 args: [ ..., "-upstreamsvc", "kube-dns-upstream" ]

启动的日志里可以看到配置文件被渲染了：

cluster1.local:53 {
    errors
    reload
    bind 169.254.20.10 172.26.0.2
    forward . 172.26.189.136 {
        force_tcp
    }
    prometheus :9253
    health 169.254.20.10:8080
}
in-addr.arpa:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 172.26.0.2
    forward . 172.26.189.136 {
            force_tcp
    }
    prometheus :9253
    }
ip6.arpa:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 172.26.0.2
    forward . 172.26.189.136 {
            force_tcp
    }
    prometheus :9253
    }
.:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 172.26.0.2
    forward . /etc/resolv.conf
    prometheus :9253
  }

因为要 nat 去 hook 请求 kube-dns SVC IP（172.26.0.2）的请求，但是它自己也需要访问 kube-dns，所以 yaml 文件里创建了一个和 kube-dns 一样的属性的 svc，启动参数写了这个 SVC 名字，可以看到它代理的是走 SVC 的 ip 的。因为 enableServiceLinks 的默认开启，pod 会有如下环境变量：

$ docker exec dfa env | grep KUBE_DNS_UPSTREAM_SERVICE_HOST
KUBE_DNS_UPSTREAM_SERVICE_HOST=172.26.189.136

代码里可以看到就是把参数的 - 转换成 _ 取值然后渲染配置文件，这样就能取到 SVC 的 IP 了。

func toSvcEnv(svcName string) string {
	envName := strings.Replace(svcName, "-", "_", -1)
	return "$" + strings.ToUpper(envName) + "_SERVICE_HOST"
}

cluster1.local:53 这个 zone 在默认配置下还是代理到 SVC 上，所以还是有问题。

所以只有绕过 SVC 才能从根本上解决这个问题。然后就把 coredns 改成 port 153 + hostNetwork: true 加 nodeSelector 到三个 master 上固定了。然后配置文件如下：

cluster1.local:53 {
    errors
    reload
    bind 169.254.20.10 172.26.0.2
    forward . 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153 {
        force_tcp
    }
    prometheus :9253
    health 169.254.20.10:8080
}
...

然后测试还是有几率无法访问。之前看到过米开朗基杨分享过 coredns 的一个带故障转移的插件 dnsredir，尝试加这个插件去编译。

查阅文档编译后最后运行起来无法识别配置文件，因为官方不是直接基于 coredns 引入自己的插件开发的，而是自己的代码上来引入 coredns 的内置插件。

大概过程详情 issue 见链接 include coredns plugin at node-cache don’t work expect

官方的这个 node-cace 里的 bind 插件就是 dummy接口和 iptables 的 nat 部分了，这个特性蛮吸引我的，决定继续尝试下这个看看能不能配置成功。

在测试加入插件 dnsredir 的时候米开朗基杨叫我试下最小配置段看看有干扰没，尝试了下面的配置段来回切换测：

  Corefile: |
    cluster1.local:53 {
        errors
        reload
        dnsredir . {
            to 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153
            max_fails 1
            health_check 1s
            spray
        }
        #forward . 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153 {
        #    max_fails 1
        #    policy round_robin
        #    health_check 0.4s
        #}
        prometheus :9253
        health 169.254.20.10:8080
    }
#----------
  Corefile: |
    cluster1.local:53 {
        errors
        reload
        #dnsredir . {
        #    to 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153
        #    max_fails 1
        #    health_check 1s
        #    spray
        #}
        forward . 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153 {
            max_fails 1
            policy round_robin
            health_check 0.4s
        }
        prometheus :9253
        health 169.254.20.10:8080
    }

然后发现请求居然不会发生解析失败了：

$ function d(){ while :;do sleep 0.2; date;dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short; done; }
$ d
2021年 02月 02日 星期二 12:54:43 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:44 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:44 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:44 CST  <---这个时间点关机了一个 master 
172.26.158.130
2021年 02月 02日 星期二 12:54:45 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:47 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:48 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:48 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:48 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:51 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:51 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:52 CST
172.26.158.130

然后就不打算继续折腾 dnsredir 插件了，去叫同事测试了下没问题，叫我在另一个环境上应用下修改他再测下，发现还是会发生：

$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
; <<>> DiG 9.10.3-P4-Ubuntu <<>> @172.26.0.2 account-gateway +short
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short

然后我多次测试最小配置 zone，对比排查到是反向解析的问题，反向解析关闭了就不存在任何问题了，注释掉下面的内容：

#in-addr.arpa:53 {
#    errors
#    cache 30
#    reload
#    loop
#    bind 169.254.20.10 172.26.0.2
#    forward . __PILLAR__CLUSTER__DNS__ {
#            force_tcp
#    }
#    prometheus :9253
#    }
#ip6.arpa:53 {
#    errors
#    cache 30
#    reload
#    loop
#    bind 169.254.20.10 172.26.0.2
#    forward . __PILLAR__CLUSTER__DNS__ {
#            force_tcp
#    }
#    prometheus :9253
#    }

测试解析的过程中去关机任何一台 coredns 所在 node 也没问题了。

$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124

大致的yaml文件

apiVersion: v1
kind: ServiceAccount
metadata:
  name: node-local-dns
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: v1
kind: Service
metadata:
  name: kube-dns-upstream
  namespace: kube-system
  labels:
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/name: "KubeDNSUpstream"
spec:
  clusterIP: 172.26.0.3 # <---- 给他固定了得了，可以直接这个ip不走node-cache作为测试
  ports:
  - name: dns
    port: 53
    protocol: UDP
    targetPort: 153
  - name: dns-tcp
    port: 53
    protocol: TCP
    targetPort: 153
  selector:
    k8s-app: kube-dns
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: node-local-dns
  namespace: kube-system
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
data:
  Corefile: |
    cluster1.local:53 {
        errors
        cache {
            success 9984 30
            denial 9984 5
        }
        reload
        loop
        bind 169.254.20.10 172.26.0.2
        forward . 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153 {
            force_tcp
            max_fails 1
            policy round_robin
            health_check 0.5s
        }
        prometheus :9253
        health 169.254.20.10:8070
    }
    #in-addr.arpa:53 {
    #    errors
    #    cache 30
    #    reload
    #    loop
    #    bind 169.254.20.10 172.26.0.2
    #    forward . __PILLAR__CLUSTER__DNS__ {
    #            force_tcp
    #    }
    #    prometheus :9253
    #    }
    #ip6.arpa:53 {
    #    errors
    #    cache 30
    #    reload
    #    loop
    #    bind 169.254.20.10 172.26.0.2
    #    forward . __PILLAR__CLUSTER__DNS__ {
    #            force_tcp
    #    }
    #    prometheus :9253
    #    }
    .:53 {
        errors
        cache 30
        reload
        loop
        bind 169.254.20.10 172.26.0.2
        forward . __PILLAR__UPSTREAM__SERVERS__
        prometheus :9253
      }
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-local-dns
  namespace: kube-system
  labels:
    k8s-app: node-local-dns
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
spec:
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 10%
  selector:
    matchLabels:
      k8s-app: node-local-dns
  template:
    metadata:
      labels:
        k8s-app: node-local-dns
      annotations:
        prometheus.io/port: "9253"
        prometheus.io/scrape: "true"
    spec:
      imagePullSecrets:
      - name: regcred
      priorityClassName: system-node-critical
      serviceAccountName: node-local-dns
      hostNetwork: true
      dnsPolicy: Default  # Don't use cluster DNS.
      tolerations:
      - key: "CriticalAddonsOnly"
        operator: "Exists"
      - effect: "NoExecute"
        operator: "Exists"
      - effect: "NoSchedule"
        operator: "Exists"
      containers:
      - name: node-cache
        image: xxx.lan:5000/k8s-dns-node-cache:1.16.0
        resources:
          requests:
            cpu: 25m
            memory: 10Mi
        args: [ "-localip", "169.254.20.10,172.26.0.2", "-conf", "/etc/Corefile", "-upstreamsvc", "kube-dns-upstream", "-health-port","8070" ]
        securityContext:
          privileged: true
        ports:
        - containerPort: 53
          name: dns
          protocol: UDP
        - containerPort: 53
          name: dns-tcp
          protocol: TCP
        - containerPort: 9253
          name: metrics
          protocol: TCP
        livenessProbe:
          httpGet:
            host: 169.254.20.10
            path: /health
            port: 8070
          initialDelaySeconds: 40
          timeoutSeconds: 3
        volumeMounts:
        - mountPath: /run/xtables.lock
          name: xtables-lock
          readOnly: false
        - name: config-volume
          mountPath: /etc/coredns
        - name: kube-dns-config
          mountPath: /etc/kube-dns
      volumes:
      - name: xtables-lock
        hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
      - name: kube-dns-config
        configMap:
          name: kube-dns
          optional: true
      - name: config-volume
        configMap:
          name: node-local-dns
          items:
            - key: Corefile
              path: Corefile.base
---
# A headless service is a service with a service IP but instead of load-balancing it will return the IPs of our associated Pods.
# We use this to expose metrics to Prometheus.
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/port: "9253"
    prometheus.io/scrape: "true"
  labels:
    k8s-app: node-local-dns
  name: node-local-dns
  namespace: kube-system
spec:
  clusterIP: None
  ports:
    - name: metrics
      port: 9253
      targetPort: 9253
  selector:
    k8s-app: node-local-dns
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: coredns
  namespace: kube-system
  labels:
      kubernetes.io/cluster-service: "true"
      addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
    addonmanager.kubernetes.io/mode: Reconcile
  name: system:coredns
rules:
- apiGroups:
  - ""
  resources:
  - endpoints
  - services
  - pods
  - namespaces
  verbs:
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
    addonmanager.kubernetes.io/mode: EnsureExists
  name: system:coredns
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:coredns
subjects:
- kind: ServiceAccount
  name: coredns
  namespace: kube-system
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
  labels:
      addonmanager.kubernetes.io/mode: EnsureExists
data:
  Corefile: |
    .:153 {
        errors
        health :8180
        kubernetes cluster1.local. in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: coredns
  namespace: kube-system
  labels:
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/name: "CoreDNS"
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  selector:
    matchLabels:
      k8s-app: kube-dns
  template:
    metadata:
      labels:
        k8s-app: kube-dns
      annotations:
        seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: k8s-app
                      operator: In
                      values:
                        - kube-dns
                topologyKey: kubernetes.io/hostname
      hostNetwork: true
      priorityClassName: system-cluster-critical
      serviceAccountName: coredns
      nodeSelector:
        node-role.kubernetes.io/master: "true"
      tolerations:
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
        - key: "CriticalAddonsOnly"
          operator: "Exists"
      imagePullSecrets:
        - name: regcred
      containers:
      - name: coredns
        image: xxxx.lan:5000/coredns:1.7.1
        imagePullPolicy: IfNotPresent
        resources:
          limits:
            memory: 270Mi
          requests:
            cpu: 100m
            memory: 150Mi
        args: [ "-conf", "/etc/coredns/Corefile" ]
        volumeMounts:
        - name: config-volume
          mountPath: /etc/coredns
          readOnly: true
        ports:
        - containerPort: 153
          name: dns
          protocol: UDP
        - containerPort: 153
          name: dns-tcp
          protocol: TCP
        - containerPort: 9153
          name: metrics
          protocol: TCP
        livenessProbe:
          httpGet:
            path: /health
            port: 8180
            scheme: HTTP
          initialDelaySeconds: 60
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 5
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            add:
            - NET_BIND_SERVICE
            drop:
            - all
          readOnlyRootFilesystem: true
      dnsPolicy: Default
      volumes:
        - name: config-volume
          configMap:
            name: coredns
            items:
            - key: Corefile
              path: Corefile
---
apiVersion: v1
kind: Service
metadata:
  name: kube-dns
  namespace: kube-system
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9153"
  labels:
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/name: "CoreDNS"
spec:
  selector:
    k8s-app: kube-dns
  clusterIP: 172.26.0.2
  ports:
  - name: dns
    port: 53
    targetPort: 153
    protocol: UDP
  - name: dns-tcp
    port: 53
    targetPort: 153
    protocol: TCP

集群节点关机导致dns在eviction pod之前几率不可用

集群节点关机导致dns在eviction pod之前几率不可用

loca-dns 真的可以吗

大致的yaml文件

Recommend

与雷军的10亿赌约，董明珠输的几率非常大

刘强东美国律师:他99%几率不会被起诉警方或道歉

患癌几率大的人都有什么习惯？远离癌症生活指南

患癌几率大的人都有什么习惯？远离癌症生活指南 - IT与健康 - cnBeta.COM

一夜暴富几率堪比中六合彩，失业半年即出局，这才是真实的硅谷

win10强制关机，虚拟机k8s集群崩盘

Redis 的缓存淘汰机制（Eviction）

源码分析 kubelet eviction manager 驱逐的设计实现

Artificial intelligence models aim to forecast eviction, promote renter rights

FIFO queues are all you need for cache eviction

About Joyk