16

k8s master机器文件系统故障的一次恢复过程

 3 years ago
source link: https://zhangguanzhang.github.io/2020/07/23/fs-error-fix-k8s-master/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

研发反馈他们那边一套集群有台master文件系统损坏无法开机,他们是三台openstack上的虚机,是虚拟化宿主机故障导致的虚机文件系统损坏。三台机器是master+node,指导他修复后开机,修复过程和我之前文章opensuse的一次救援步骤一样

起来后我上去看,因为做了 HA 的,所以只有这个node有问题,集群没影响

[root@k8s-m1 ~]# kubectl get node -o wide
NAME             STATUS     ROLES    AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
10.252.146.104   NotReady   <none>   30d   v1.16.9   10.252.146.104   <none>        CentOS Linux 8 (Core)   4.18.0-193.6.3.el8_2.x86_64   docker://19.3.11
10.252.146.105   Ready      <none>   30d   v1.16.9   10.252.146.105   <none>        CentOS Linux 8 (Core)   4.18.0-193.6.3.el8_2.x86_64   docker://19.3.11
10.252.146.106   Ready      <none>   30d   v1.16.9   10.252.146.106   <none>        CentOS Linux 8 (Core)   4.18.0-193.6.3.el8_2.x86_64   docker://19.3.11

启动docker试试

[root@k8s-m1 ~]# systemctl start docker
Job for docker.service canceled.

无法启动,查看下启动失败的服务

[root@k8s-m1 ~]# systemctl --failed
  UNIT               LOAD   ACTIVE SUB    DESCRIPTION                 
● containerd.service loaded failed failed containerd container runtime

查看下 containerd 的日志

[root@k8s-m1 ~]# journalctl -xe -u containerd
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481459735+08:00" level=info msg="loading plugin "io.containerd.service.v1.snapshots-service"..." type=io.containerd.service.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481472223+08:00" level=info msg="loading plugin "io.containerd.runtime.v1.linux"..." type=io.containerd.runtime.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481517630+08:00" level=info msg="loading plugin "io.containerd.runtime.v2.task"..." type=io.containerd.runtime.v2
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481562176+08:00" level=info msg="loading plugin "io.containerd.monitor.v1.cgroups"..." type=io.containerd.monitor.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481964349+08:00" level=info msg="loading plugin "io.containerd.service.v1.tasks-service"..." type=io.containerd.service.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481996158+08:00" level=info msg="loading plugin "io.containerd.internal.v1.restart"..." type=io.containerd.internal.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482048208+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.containers"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482081110+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.content"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482096598+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.diff"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482112263+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.events"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482123307+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.healthcheck"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482133477+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.images"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482142943+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.leases"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482151644+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.namespaces"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482160741+08:00" level=info msg="loading plugin "io.containerd.internal.v1.opt"..." type=io.containerd.internal.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482184201+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.snapshots"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482194643+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482206871+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482215454+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482365838+08:00" level=info msg=serving... address="/run/containerd/containerd.sock"
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482404139+08:00" level=info msg="containerd successfully booted in 0.003611s"
Jul 23 11:20:11 k8s-m1 containerd[9186]: panic: runtime error: invalid memory address or nil pointer dereference
Jul 23 11:20:11 k8s-m1 containerd[9186]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x5626b983c259]
Jul 23 11:20:11 k8s-m1 containerd[9186]: goroutine 55 [running]:
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*Bucket).Cursor(...)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bucket.go:84
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*Bucket).Get(0x0, 0x5626bb7e3f10, 0xb, 0xb, 0x0, 0x2, 0x4)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bucket.go:260 +0x39
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.scanRoots.func6(0x7fe557c63020, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x5626b95eec72)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/metadata/gc.go:222 +0xcb
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*Bucket).ForEach(0xc0003d1780, 0xc00057b640, 0xa, 0xa)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bucket.go:388 +0x100
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.scanRoots(0x5626bacedde0, 0xc0003d1680, 0xc0002ee2a0, 0xc00031a3c0, 0xc000527a60, 0x7fe586a43fff)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/metadata/gc.go:216 +0x4df
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).getMarked.func1(0xc0002ee2a0, 0x0, 0x0)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/metadata/db.go:359 +0x165
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*DB).View(0xc00000c1e0, 0xc00008b860, 0x0, 0x0)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/db.go:701 +0x92
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).getMarked(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x203000, 0x203000, 0x400)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/metadata/db.go:342 +0x7e
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).GarbageCollect(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x0, 0x1, 0x0, 0x0)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/metadata/db.go:257 +0xa3
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/gc/scheduler.(*gcScheduler).run(0xc0000a0b40, 0x5626bacede20, 0xc0000d6010)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:310 +0x511
Jul 23 11:20:11 k8s-m1 containerd[9186]: created by github.com/containerd/containerd/gc/scheduler.init.0.func1
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:132 +0x462
Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Failed with result 'exit-code'.

这个问题从panic抛出的堆栈信息看和我之前文章docker启动panic很类似,都是 boltdb 文件出错,找下 git 信息去看看代码路径在哪

[root@k8s-m1 ~]# systemctl cat containerd | grep ExecStart
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/containerd

[root@k8s-m1 ~]# /usr/bin/containerd --version
containerd containerd.io 1.2.13 7ad184331fa3e55e52b890ea95e65ba581ae3429

查找下 ic.Root 路径是多少

[root@k8s-m1 ~]# /usr/bin/containerd --help | grep config
     config    information on the containerd config
   --config value, -c value     path to the configuration file (default: "/etc/containerd/config.toml")

[root@k8s-m1 ~]# grep root /etc/containerd/config.toml
#root = "/var/lib/containerd"
[root@k8s-m1 ~]]# find /var/lib/containerd -type f -name meta.db
/var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db

找到boltdb文件,改名启动

[root@k8s-m1 ~]]# mv /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db{,.bak}
[root@k8s-m1 ~]# systemctl status containerd.service
● containerd.service - containerd container runtime
   Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2020-07-23 11:20:11 CST; 17min ago
     Docs: https://containerd.io
  Process: 9186 ExecStart=/usr/bin/containerd (code=exited, status=2)
  Process: 9182 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
 Main PID: 9186 (code=exited, status=2)

Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).getMarked(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x203000, 0x203000, 0x400)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/metadata/db.go:342 +0x7e
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).GarbageCollect(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x0, 0x1, 0x0, 0x0)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/metadata/db.go:257 +0xa3
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/gc/scheduler.(*gcScheduler).run(0xc0000a0b40, 0x5626bacede20, 0xc0000d6010)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:310 +0x511
Jul 23 11:20:11 k8s-m1 containerd[9186]: created by github.com/containerd/containerd/gc/scheduler.init.0.func1
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:132 +0x462
Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Failed with result 'exit-code'.
[root@k8s-m1 ~]# systemctl restart containerd.service
[root@k8s-m1 ~]# systemctl status containerd.service
● containerd.service - containerd container runtime
   Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)
   Active: active (running) since Thu 2020-07-23 11:25:37 CST; 1s ago
     Docs: https://containerd.io
  Process: 15661 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
 Main PID: 15663 (containerd)
    Tasks: 16
   Memory: 28.6M
   CGroup: /system.slice/containerd.service
           └─15663 /usr/bin/containerd

Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496725460+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.images"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496734129+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.leases"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496742793+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.namespaces"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496751740+08:00" level=info msg="loading plugin "io.containerd.internal.v1.opt"..." type=io.containerd.internal.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496775185+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.snapshots"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496785498+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496794873+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496803178+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496944458+08:00" level=info msg=serving... address="/run/containerd/containerd.sock"
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496958031+08:00" level=info msg="containerd successfully booted in 0.003994s"

containerd 起来后,启动下 docker

[root@k8s-m1 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/docker.service.d
           └─10-docker.conf
   Active: inactive (dead) since Thu 2020-07-23 11:20:13 CST; 18min ago
     Docs: https://docs.docker.com
  Process: 9398 ExecStopPost=/bin/bash -c /sbin/iptables -D FORWARD -s 0.0.0.0/0 -j ACCEPT &> /dev/null || : (code=exited, status=0/SUCCESS)
  Process: 9187 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=0/SUCCESS)
 Main PID: 9187 (code=exited, status=0/SUCCESS)

Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.956503485+08:00" level=error msg="Stop container error: Failed to stop container 68860c8d16b9ce7e74e8efd9db00e70a57eef1b752c2e6c703073c0bce5517d3 with error: Cannot kill c>
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.954347116+08:00" level=error msg="Stop container error: Failed to stop container 5ec9922beed1276989f1866c3fd911f37cc26aae4e4b27c7ce78183a9a4725cc with error: Cannot kill c>
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.953615411+08:00" level=info msg="Container failed to stop after sending signal 15 to the process, force killing"
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.956557179+08:00" level=error msg="Stop container error: Failed to stop container 6d0096fbcd4055f8bafb6b38f502a0186cd1dfca34219e9dd6050f512971aef5 with error: Cannot kill c>
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.954601191+08:00" level=info msg="Container failed to stop after sending signal 15 to the process, force killing"
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.956600790+08:00" level=error msg="Stop container error: Failed to stop container 6d1175ba6c55cb05ad89f4134ba8e9d3495c5acb5f07938dc16339b7cca013bf with error: Cannot kill c>
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.957188989+08:00" level=info msg="Daemon shutdown complete"
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.957212655+08:00" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.957209679+08:00" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=moby
Jul 23 11:20:13 k8s-m1 systemd[1]: Stopped Docker Application Container Engine.
[root@k8s-m1 ~]# systemctl start docker
[root@k8s-m1 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/docker.service.d
           └─10-docker.conf
   Active: active (running) since Thu 2020-07-23 11:26:11 CST; 1s ago
     Docs: https://docs.docker.com
  Process: 9398 ExecStopPost=/bin/bash -c /sbin/iptables -D FORWARD -s 0.0.0.0/0 -j ACCEPT &> /dev/null || : (code=exited, status=0/SUCCESS)
  Process: 16156 ExecStartPost=/sbin/iptables -I FORWARD -s 0.0.0.0/0 -j ACCEPT (code=exited, status=0/SUCCESS)
 Main PID: 15974 (dockerd)
    Tasks: 62
   Memory: 89.1M
   CGroup: /system.slice/docker.service
           └─15974 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.851106564+08:00" level=error msg="cb4e16249cd8eac48ed734c71237195f04d63c56c55c0199b3cdf3d49461903d cleanup: failed to delete container from containerd: no such container"
Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.860456898+08:00" level=error msg="d9bbcab186ccb59f96c95fc886ec1b66a52aa96e45b117cf7d12e3ff9b95db9f cleanup: failed to delete container from containerd: no such container"
Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.872405757+08:00" level=error msg="07eb7a09bc8589abcb4d79af4b46798327bfb00624a7b9ceea457de392ad8f3d cleanup: failed to delete container from containerd: no such container"
Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.877896618+08:00" level=error msg="f5867657025bd7c3951cbd3e08ad97338cf69df2a97967a419e0e78eda869b73 cleanup: failed to delete container from containerd: no such container"
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.143661583+08:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.198200760+08:00" level=info msg="Loading containers: done."
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.219959208+08:00" level=info msg="Docker daemon" commit=42e35e61f3 graphdriver(s)=overlay2 version=19.03.11
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.220049865+08:00" level=info msg="Daemon has completed initialization"
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.232373131+08:00" level=info msg="API listen on /var/run/docker.sock"
Jul 23 11:26:11 k8s-m1 systemd[1]: Started Docker Application Container Engine.

etcd启动也失败,journal 查看下 etcd 状态

[root@k8s-m1 ~]# journalctl -xe -u etcd
Jul 23 11:26:15 k8s-m1 etcd[18129]: Loading server configuration from "/etc/etcd/etcd.config.yml"
Jul 23 11:26:15 k8s-m1 etcd[18129]: etcd Version: 3.3.20
Jul 23 11:26:15 k8s-m1 etcd[18129]: Git SHA: 9fd7e2b80
Jul 23 11:26:15 k8s-m1 etcd[18129]: Go Version: go1.12.17
Jul 23 11:26:15 k8s-m1 etcd[18129]: Go OS/Arch: linux/amd64
Jul 23 11:26:15 k8s-m1 etcd[18129]: setting maximum number of CPUs to 16, total number of available CPUs is 16
Jul 23 11:26:15 k8s-m1 etcd[18129]: found invalid file/dir wal under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
Jul 23 11:26:15 k8s-m1 etcd[18129]: the server is already initialized as member before, starting as etcd member...
Jul 23 11:26:15 k8s-m1 etcd[18129]: ignoring peer auto TLS since certs given
Jul 23 11:26:15 k8s-m1 etcd[18129]: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = /etc/kubernetes/pki/etcd/ca.crt, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = fals>
Jul 23 11:26:15 k8s-m1 etcd[18129]: listening for peers on https://10.252.146.104:2380
Jul 23 11:26:15 k8s-m1 etcd[18129]: ignoring client auto TLS since certs given
Jul 23 11:26:15 k8s-m1 etcd[18129]: pprof is enabled under /debug/pprof
Jul 23 11:26:15 k8s-m1 etcd[18129]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
Jul 23 11:26:15 k8s-m1 etcd[18129]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
Jul 23 11:26:15 k8s-m1 etcd[18129]: listening for client requests on 127.0.0.1:2379
Jul 23 11:26:15 k8s-m1 etcd[18129]: listening for client requests on 10.252.146.104:2379
Jul 23 11:26:15 k8s-m1 etcd[18129]: skipped unexpected non snapshot file 000000000000002e-000000000052f2be.snap.broken
Jul 23 11:26:15 k8s-m1 etcd[18129]: recovered store from snapshot at index 5426092
Jul 23 11:26:15 k8s-m1 etcd[18129]: restore compact to 3967425
Jul 23 11:26:15 k8s-m1 etcd[18129]: cannot unmarshal event: proto: KeyValue: illegal tag 0 (wire type 0)
Jul 23 11:26:15 k8s-m1 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Jul 23 11:26:15 k8s-m1 systemd[1]: etcd.service: Failed with result 'exit-code'.
Jul 23 11:26:15 k8s-m1 systemd[1]: Failed to start Etcd Service.
[root@k8s-m1 ~]# ll /var/lib/etcd/member/snap/
total 8560
-rw-r--r-- 1 root root   13499 Jul 20 13:36 000000000000002e-000000000052cbac.snap
-rw-r--r-- 2 root root  128360 Jul 20 13:01 000000000000002e-000000000052f2be.snap.broken
-rw------- 1 root root 8617984 Jul 23 11:26 db

这套集群是使用我的 ansible部署,求star 的,自带了 备份脚本 ,但是是三天前坏的

[root@k8s-m1 ~]# ll /opt/etcd_bak/
total 41524
-rw-r--r-- 1 root root 8618016 Jul 17 02:00 etcd-2020-07-17-02:00:01.db
-rw-r--r-- 1 root root 8618016 Jul 18 02:00 etcd-2020-07-18-02:00:01.db
-rw-r--r-- 1 root root 8323104 Jul 19 02:00 etcd-2020-07-19-02:00:01.db
-rw-r--r-- 1 root root 8618016 Jul 20 02:00 etcd-2020-07-20-02:00:01.db

有恢复剧本,但是前提是etcd的v2和v3不能共存,否则无法恢复备份,我们线上都是把v2的存储关闭了的。主要是 这个tasks里的26到42行步骤 ,这里复制了其他机器master上的 07/23 号的etcd备份文件,然后改了下host跑了下

[root@k8s-m1 ~]# cd Kubernetes-ansible
[root@k8s-m1 Kubernetes-ansible]# ansible-playbook restoreETCD.yml -e 'db=/opt/etcd_bak/etcd-bak.db'

PLAY [10.252.146.104] **********************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] *********************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]

TASK [restoreETCD : fail] ******************************************************************************************************************************************************************************************************************
skipping: [10.252.146.104]

TASK [restoreETCD : 检测备份文件存在否] *************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]

TASK [restoreETCD : fail] ******************************************************************************************************************************************************************************************************************
skipping: [10.252.146.104]

TASK [restoreETCD : set_fact] **************************************************************************************************************************************************************************************************************
skipping: [10.252.146.104]

TASK [restoreETCD : set_fact] **************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]

TASK [restoreETCD : 停止etcd] ****************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]

TASK [restoreETCD : 删除etcd数据目录] ************************************************************************************************************************************************************************************************************
ok: [10.252.146.104] => (item=/var/lib/etcd)

TASK [restoreETCD : 分发备份文件] ****************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]

TASK [restoreETCD : 恢复备份] ******************************************************************************************************************************************************************************************************************
changed: [10.252.146.104]

TASK [restoreETCD : 启动etcd] ****************************************************************************************************************************************************************************************************************
fatal: [10.252.146.104]: FAILED! => {"changed": false, "msg": "Unable to start service etcd: Job for etcd.service failed because the control process exited with error code.\nSee \"systemctl status etcd.service\" and \"journalctl -xe\" for details.\n"}

PLAY RECAP *********************************************************************************************************************************************************************************************************************************
10.252.146.104             : ok=7    changed=1    unreachable=0    failed=1    skipped=3    rescued=0    ignored=0

查看下日志

[root@k8s-m1 Kubernetes-ansible]# journalctl -xe -u etcd
Jul 23 11:27:46 k8s-m1 etcd[58954]: Loading server configuration from "/etc/etcd/etcd.config.yml"
Jul 23 11:27:46 k8s-m1 etcd[58954]: etcd Version: 3.3.20
Jul 23 11:27:46 k8s-m1 etcd[58954]: Git SHA: 9fd7e2b80
Jul 23 11:27:46 k8s-m1 etcd[58954]: Go Version: go1.12.17
Jul 23 11:27:46 k8s-m1 etcd[58954]: Go OS/Arch: linux/amd64
Jul 23 11:27:46 k8s-m1 etcd[58954]: setting maximum number of CPUs to 16, total number of available CPUs is 16
Jul 23 11:27:46 k8s-m1 etcd[58954]: the server is already initialized as member before, starting as etcd member...
Jul 23 11:27:46 k8s-m1 etcd[58954]: ignoring peer auto TLS since certs given
Jul 23 11:27:46 k8s-m1 etcd[58954]: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = /etc/kubernetes/pki/etcd/ca.crt, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = fals>
Jul 23 11:27:46 k8s-m1 etcd[58954]: listening for peers on https://10.252.146.104:2380
Jul 23 11:27:46 k8s-m1 etcd[58954]: ignoring client auto TLS since certs given
Jul 23 11:27:46 k8s-m1 etcd[58954]: pprof is enabled under /debug/pprof
Jul 23 11:27:46 k8s-m1 etcd[58954]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
Jul 23 11:27:46 k8s-m1 etcd[58954]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
Jul 23 11:27:46 k8s-m1 etcd[58954]: listening for client requests on 127.0.0.1:2379
Jul 23 11:27:46 k8s-m1 etcd[58954]: listening for client requests on 10.252.146.104:2379
Jul 23 11:27:47 k8s-m1 etcd[58954]: member ac2dcf6aed12e8f1 has already been bootstrapped
Jul 23 11:27:47 k8s-m1 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Jul 23 11:27:47 k8s-m1 systemd[1]: etcd.service: Failed with result 'exit-code'.
Jul 23 11:27:47 k8s-m1 systemd[1]: Failed to start Etcd Service.

这个 member xxxx has already been bootstrapped 解决办法就是把配置文件的下面修改,后面启动完记得改回来

initial-cluster-state: 'new' 改成 initial-cluster-state: 'existing'

然后成功启动

[root@k8s-m1 Kubernetes-ansible]# systemctl start etcd
[root@k8s-m1 Kubernetes-ansible]# journalctl -xe -u etcd
Jul 23 11:27:55 k8s-m1 etcd[59889]: Loading server configuration from "/etc/etcd/etcd.config.yml"
Jul 23 11:27:55 k8s-m1 etcd[59889]: etcd Version: 3.3.20
Jul 23 11:27:55 k8s-m1 etcd[59889]: Git SHA: 9fd7e2b80
Jul 23 11:27:55 k8s-m1 etcd[59889]: Go Version: go1.12.17
Jul 23 11:27:55 k8s-m1 etcd[59889]: Go OS/Arch: linux/amd64
Jul 23 11:27:55 k8s-m1 etcd[59889]: setting maximum number of CPUs to 16, total number of available CPUs is 16
Jul 23 11:27:55 k8s-m1 etcd[59889]: found invalid file/dir wal under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
Jul 23 11:27:55 k8s-m1 etcd[59889]: the server is already initialized as member before, starting as etcd member...
Jul 23 11:27:55 k8s-m1 etcd[59889]: ignoring peer auto TLS since certs given
Jul 23 11:27:55 k8s-m1 etcd[59889]: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = /etc/kubernetes/pki/etcd/ca.crt, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = fals>
Jul 23 11:27:55 k8s-m1 etcd[59889]: listening for peers on https://10.252.146.104:2380
Jul 23 11:27:55 k8s-m1 etcd[59889]: ignoring client auto TLS since certs given
Jul 23 11:27:55 k8s-m1 etcd[59889]: pprof is enabled under /debug/pprof
Jul 23 11:27:55 k8s-m1 etcd[59889]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
Jul 23 11:27:55 k8s-m1 etcd[59889]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
Jul 23 11:27:55 k8s-m1 etcd[59889]: listening for client requests on 127.0.0.1:2379
Jul 23 11:27:55 k8s-m1 etcd[59889]: listening for client requests on 10.252.146.104:2379
Jul 23 11:27:55 k8s-m1 etcd[59889]: recovered store from snapshot at index 5952463
Jul 23 11:27:55 k8s-m1 etcd[59889]: restore compact to 4369703
Jul 23 11:27:55 k8s-m1 etcd[59889]: name = etcd-001
Jul 23 11:27:55 k8s-m1 etcd[59889]: data dir = /var/lib/etcd
Jul 23 11:27:55 k8s-m1 etcd[59889]: member dir = /var/lib/etcd/member
Jul 23 11:27:55 k8s-m1 etcd[59889]: dedicated WAL dir = /var/lib/etcd/wal
Jul 23 11:27:55 k8s-m1 etcd[59889]: heartbeat = 100ms
Jul 23 11:27:55 k8s-m1 etcd[59889]: election = 1000ms
Jul 23 11:27:55 k8s-m1 etcd[59889]: snapshot count = 5000
Jul 23 11:27:55 k8s-m1 etcd[59889]: advertise client URLs = https://10.252.146.104:2379
Jul 23 11:27:55 k8s-m1 etcd[59889]: restarting member ac2dcf6aed12e8f1 in cluster 367e2aebc6430cbe at commit index 5952491
Jul 23 11:27:55 k8s-m1 etcd[59889]: ac2dcf6aed12e8f1 became follower at term 47
Jul 23 11:27:55 k8s-m1 etcd[59889]: newRaft ac2dcf6aed12e8f1 [peers: [1e713be314744d53,8b1621b475555fd9,ac2dcf6aed12e8f1], term: 47, commit: 5952491, applied: 5952463, lastindex: 5952491, lastterm: 47]
Jul 23 11:27:55 k8s-m1 etcd[59889]: enabled capabilities for version 3.3
Jul 23 11:27:55 k8s-m1 etcd[59889]: added member 1e713be314744d53 [https://10.252.146.105:2380] to cluster 367e2aebc6430cbe from store
Jul 23 11:27:55 k8s-m1 etcd[59889]: added member 8b1621b475555fd9 [https://10.252.146.106:2380] to cluster 367e2aebc6430cbe from store
Jul 23 11:27:55 k8s-m1 etcd[59889]: added member ac2dcf6aed12e8f1 [https://10.252.146.104:2380] to cluster 367e2aebc6430cbe from store

查看集群状态

[root@k8s-m1 Kubernetes-ansible]# etcd-ha
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.252.146.104:2379 | ac2dcf6aed12e8f1 |  3.3.20 |  8.3 MB |     false |        47 |    5953557 |
| https://10.252.146.105:2379 | 1e713be314744d53 |  3.3.20 |  8.6 MB |     false |        47 |    5953557 |
| https://10.252.146.106:2379 | 8b1621b475555fd9 |  3.3.20 |  8.3 MB |      true |        47 |    5953557 |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+

然后给kube-apiserver三个组件和kubelet起来后

[root@k8s-m1 Kubernetes-ansible]# kubectl get node -o wide
NAME             STATUS   ROLES    AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
10.252.146.104   Ready    <none>   30d   v1.16.9   10.252.146.104   <none>        CentOS Linux 8 (Core)   4.18.0-193.6.3.el8_2.x86_64   docker://19.3.11
10.252.146.105   Ready    <none>   30d   v1.16.9   10.252.146.105   <none>        CentOS Linux 8 (Core)   4.18.0-193.6.3.el8_2.x86_64   docker://19.3.11
10.252.146.106   Ready    <none>   30d   v1.16.9   10.252.146.106   <none>        CentOS Linux 8 (Core)   4.18.0-193.6.3.el8_2.x86_64   docker://19.3.11

pod也在慢慢自愈了


很遗憾的说,推酷将在这个月底关闭。人生海海,几度秋凉,感谢那些有你的时光。


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK