3

记一次虚机强制断电 K8s 集群 etcd pod 挂掉快照丢失(没有备份)问题处理

 1 year ago
source link: https://liruilongs.github.io/2023/01/19/%E5%BE%85%E5%8F%91%E5%B8%83/%E4%BA%8C%E6%9C%9F/%E8%99%9A%E6%9C%BA%E5%BC%BA%E5%88%B6%E6%96%AD%E7%94%B5K8s-%E9%9B%86%E7%BE%A4-etcd-pod%E6%8C%82%E6%8E%89-%E9%97%AE%E9%A2%98%E8%A7%A3%E5%86%B3/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

记一次虚机强制断电 K8s 集群 etcd pod 挂掉快照丢失(没有备份)问题处理

我所渴求的,無非是將心中脫穎語出的本性付諸生活,為何竟如此艱難呢 ——赫尔曼·黑塞《德米安》


  • 不小心拔错电源了,虚机强制关机,开机后集群死掉了
  • 记录下解决方案
  • 断电导致etcd 快照数据丢失,没有备份.基本上是没办法处理
  • 可以找专业的 DBA来处理数据看有没有可能恢复
  • 这篇博文的解决办法是删除了 etcd 数据目录中的部分文件。
  • 集群可以启动,但是 部署的环境数据都丢失了,包括CNI, 集群自带的 DNS 组件也丢了。
  • 理解不足小伙伴帮忙指正
  • 不管是生产还是测试,如果没做UPS电源, k8s集群 ETCD 一定要备份,ETCD 一定要备份,ETCD 一定要备份 ,重要的话说三遍。

我所渴求的,無非是將心中脫穎語出的本性付諸生活,為何竟如此艱難呢 ——赫尔曼·黑塞《德米安》


当前集群的状态

┌──[[email protected]]-[~]
└─$kubectl get nodes
The connection to the server 192.168.26.81:6443 was refused - did you specify the right host or port?

重启 docke 和 kubelet 尝试启动

┌──[[email protected]]-[~]
└─$systemctl restart docker
┌──[[email protected]]-[~]
└─$systemctl restart kubelet.service

还是不行,查看下 maser 节点的 kubelet 日志信息

┌──[[email protected]]-[~]
└─$journalctl -u kubelet.service -f
1月 19 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.703418 11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.804201 11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.905156 11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.005487 11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.105648 11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.186066 11344 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://192.168.26.81:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/vms81.liruilongs.github.io?timeout=10s": dial tcp 192.168.26.81:6443: connect: connection refused
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.205785 11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"

利用 docker 查看下当前存在的 pod 信息

┌──[[email protected]]-[~]
└─$docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d9d6471ce936 b51ddc1014b0 "kube-scheduler --au…" 17 minutes ago Up 17 minutes k8s_kube-scheduler_kube-scheduler-vms81.liruilongs.github.io_kube-system_e1b874bfdef201d69db10b200b8f47d5_14
010c1b8c30c6 5425bcbd23c5 "kube-controller-man…" 17 minutes ago Up 17 minutes k8s_kube-controller-manager_kube-controller-manager-vms81.liruilongs.github.io_kube-system_49b7654103f80170bfe29d034f806256_15
7e215924a1dd registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 18 minutes ago Up About a minute k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
f557435d150e registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 18 minutes ago Up 18 minutes k8s_POD_kube-scheduler-vms81.liruilongs.github.io_kube-system_e1b874bfdef201d69db10b200b8f47d5_7
5deaffbc555a registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 18 minutes ago Up 18 minutes k8s_POD_kube-controller-manager-vms81.liruilongs.github.io_kube-system_49b7654103f80170bfe29d034f806256_7
a418c2ce33f2 registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 18 minutes ago Up 18 minutes k8s_POD_kube-apiserver-vms81.liruilongs.github.io_kube-system_a35cb37b6c90c72f607936b33161eefe_6

etcd 没有启动, apiservice 也没有启动。

┌──[[email protected]]-[~]
└─$docker ps -a | grep etcd
b5e18722315b 004811815584 "etcd --advertise-cl…" 5 minutes ago Exited (2) About a minute ago k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_19
7e215924a1dd registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 21 minutes ago Up 4 minutes k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7

尝试重新启动 etcd

┌──[[email protected]]-[~]
└─$docker restart b5e18722315b
b5e18722315b

查看启动状态

┌──[[email protected]]-[~]
└─$docker ps -a | grep etcd
b5e18722315b 004811815584 "etcd --advertise-cl…" 5 minutes ago Exited (2) About a minute ago k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_19
7e215924a1dd registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 21 minutes ago Up 4 minutes k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
┌──[[email protected]]-[~]
└─$docker logs b5e18722315b

看一下 etcd 对应的日志

┌──[[email protected]]-[~]
└─$docker logs 8a53cbc545e4
..................................................
{"level":"info","ts":"2023-01-19T01:34:24.332Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"5.557212ms"}
{"level":"warn","ts":"2023-01-19T01:34:24.332Z","caller":"wal/util.go:90","msg":"ignored file in WAL directory","path":"0000000000000014-0000000000185aba.wal.broken"}
{"level":"info","ts":"2023-01-19T01:34:24.770Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":26912747,"snapshot-size":"42 kB"}
{"level":"warn","ts":"2023-01-19T01:34:24.771Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":26912747,"snapshot-file-path":"/var/lib/etcd/member/snap/00000000019aa7eb.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2023-01-19T01:43:31.738Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:244\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:227\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:122\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/home/remote/sbatsche/.gvm/gos/go1.16.3/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot

goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000114600, 0xc000588240, 0x1, 0x1)
/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000080960, 0x122e2fc, 0x2a, 0xc000588240, 0x1, 0x1)
/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/[email protected]/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0x7ffe54af1e25, 0x1a, 0x0, 0x0, 0x0, 0x0, 0xc0004cf830, 0x1, 0x1, 0xc0004cfa70, ...)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc0000ee000, 0xc0000ee600, 0x0, 0x0)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:244 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc0000ee000, 0x1202a6f, 0x6, 0xc000428401, 0x2)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:227 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a120, 0x12, 0x12)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:122 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a120, 0x12, 0x12)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/main.go:40 +0x11f
main.main()
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/main.go:32 +0x45

"msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","

“msg”: “从快照恢复v3后台失败”, “error”: “未能找到数据库快照文件(snap: 快照文件不存在)”,”

断电照成数据文件损坏了,它希望从快照中恢复,但是没有快照。

额,这里没有备份,所以基本上是没有办法修复了。只能通过 kubeadm 重置集群了。

一些补救措施

如果说你希望通过一些其他的方式来启动集群,来获取一些当前集群的配置信息,下面的方式可以尝试,但是我的集群使用了下面的方法,所有的 pods 数据都丢失了,没办法最后重置集群了。

如果你想使用下面的方式,一定要备份删除的 etcd 数据文件

etcd master 是一个静态 pod ,所以我们看下 yaml 文件,配置的数据文件中什么位置

┌──[[email protected]]-[~]
└─$cd /etc/kubernetes/manifests/
┌──[[email protected]]-[/etc/kubernetes/manifests]
└─$ls
etcd.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml

- --data-dir=/var/lib/etcd

┌──[[email protected]]-[/etc/kubernetes/manifests]
└─$cat etcd.yaml | grep -e "--"
- --advertise-client-urls=https://192.168.26.81:2379
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --initial-advertise-peer-urls=https://192.168.26.81:2380
- --initial-cluster=vms81.liruilongs.github.io=https://192.168.26.81:2380
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.81:2379
- --listen-metrics-urls=http://127.0.0.1:2381
- --listen-peer-urls=https://192.168.26.81:2380
- --name=vms81.liruilongs.github.io
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

对应的数据文件,可以尝试对数据文件进行修复,如果希望集群可以快速启动,可以

┌──[[email protected]]-[/var/lib/etcd/member]
└─$tree
.
├── snap
│   ├── 0000000000000058-00000000019a0ba7.snap
│   ├── 0000000000000058-00000000019a32b8.snap
│   ├── 0000000000000058-00000000019a59c9.snap
│   ├── 0000000000000058-00000000019a80da.snap
│   ├── 0000000000000058-00000000019aa7eb.snap
│   └── db
└── wal
├── 0000000000000014-0000000000185aba.wal.broken
├── 0000000000000142-0000000001963c0e.wal
├── 0000000000000143-0000000001977bbe.wal
├── 0000000000000144-0000000001986aa6.wal
├── 0000000000000145-0000000001995ef6.wal
├── 0000000000000146-00000000019a544d.wal
└── 1.tmp

2 directories, 13 files

备份一下数据文件

┌──[[email protected]]-[/var/lib/etcd]
└─$ls
member
┌──[[email protected]]-[/var/lib/etcd]
└─$tar -cvf member.tar member/
member/
member/snap/
member/snap/db
member/snap/0000000000000058-00000000019a0ba7.snap
member/snap/0000000000000058-00000000019a32b8.snap
member/snap/0000000000000058-00000000019a59c9.snap
member/snap/0000000000000058-00000000019a80da.snap
member/snap/0000000000000058-00000000019aa7eb.snap
member/wal/
member/wal/0000000000000142-0000000001963c0e.wal
member/wal/0000000000000144-0000000001986aa6.wal
member/wal/0000000000000014-0000000000185aba.wal.broken
member/wal/0000000000000145-0000000001995ef6.wal
member/wal/0000000000000146-00000000019a544d.wal
member/wal/1.tmp
member/wal/0000000000000143-0000000001977bbe.wal
┌──[[email protected]]-[/var/lib/etcd]
└─$ls
member member.tar
┌──[[email protected]]-[/var/lib/etcd]
└─$mv member.tar /tmp/
┌──[[email protected]]-[/var/lib/etcd]
└─$
┌──[[email protected]]-[/var/lib/etcd]
└─$rm -rf member/snap/*.snap
┌──[[email protected]]-[/var/lib/etcd]
└─$rm -rf member/wal/*.wal
┌──[[email protected]]-[/var/lib/etcd]
└─$

重新启动 docker 对应的镜像,或者重新启动 kubectl。

┌──[[email protected]]-[/var/lib/etcd]
└─$docker ps -a | grep etcd
a3b97cb34d9b 004811815584 "etcd --advertise-cl…" 2 minutes ago Exited (2) 2 minutes ago k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_45
7e215924a1dd registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 3 hours ago Up 2 hours k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
┌──[[email protected]]-[/var/lib/etcd]
└─$docker start a3b97cb34d9b
a3b97cb34d9b
┌──[[email protected]]-[/var/lib/etcd]
└─$docker ps -a | grep etcd
e1fc068247af 004811815584 "etcd --advertise-cl…" 3 seconds ago Up 2 seconds k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_46
a3b97cb34d9b 004811815584 "etcd --advertise-cl…" 3 minutes ago Exited (2) 3 seconds ago k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_45
7e215924a1dd registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 3 hours ago Up 2 hours k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
┌──[[email protected]]-[/var/lib/etcd]
└─$

查看 Node 状态

┌──[[email protected]]-[/var/lib/etcd]
└─$kubectl get nodes
NAME STATUS ROLES AGE VERSION
vms155.liruilongs.github.io Ready <none> 76s v1.22.2
vms81.liruilongs.github.io Ready <none> 76s v1.22.2
vms82.liruilongs.github.io Ready <none> 76s v1.22.2
vms83.liruilongs.github.io Ready <none> 76s v1.22.2
┌──[[email protected]]-[/var/lib/etcd]
└─$

查看集群当前所有的 Pod 。

┌──[[email protected]]-[~/ansible/kubevirt]
└─$kubectl get pods -A
NAME READY STATUS RESTARTS AGE
etcd-vms81.liruilongs.github.io 1/1 Running 48 (3h35m ago) 3h53m
kube-apiserver-vms81.liruilongs.github.io 1/1 Running 48 (3h35m ago) 3h51m
kube-controller-manager-vms81.liruilongs.github.io 1/1 Running 17 (3h35m ago) 3h51m
kube-scheduler-vms81.liruilongs.github.io 1/1 Running 16 (3h35m ago) 3h52m

网络相关的 pod 都不在了,而且 k8s 的 dns 组件也没有起来, 这里需要 重新配置网络,有点麻烦,正常情况下如果, 网络相关的组件没有起来, 所有节点应该都是未就绪状态。感觉有点妖。。。时间关系,我需要集群来做实验,所以通过 kubeadm重置了

┌──[[email protected]]-[~/ansible]
└─$kubectl apply -f calico.yaml

https://github.com/etcd-io/etcd/issues/11949


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK