6

etcd集群备份和数据恢复以及优化运维

 3 years ago
source link: https://www.maideliang.com/index.php/archives/25/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

etcd集群备份和数据恢复以及优化运维

2018.05.29原创文章 6 °C

快照定期备份

crontab定期执行备份脚本,每半小时备份一次,本地、异地都备份(暂定:本地备份保留最近10个备份,异地保留一个月的备份)

#!/bin/bash
#
ETCDCTL_PATH='/usr/etcd/bin/etcdctl'
ENDPOINTS='1.1.1.1:2379,1.1.1.2:2379,1.1.1.3:2379'
BACKUP_DIR='/home/apps/backup'
DATE=`date +%Y%m%d-%H%M%S`
[ ! -d $BACKUP_DIR ] && mkdir -p $BACKUP_DIR
export ETCDCTL_API=3;$ETCDCTL_PATH --endpoints=$ENDPOINTS snapshot save $BACKUP_DIR/snapshot-$DATE\.db
cd $BACKUP_DIR;ls -lt $BACKUP_DIR|awk '{if(NR>11){print "rm -rf "$9}}'|sh

镜像集群备份

提前准备部署好镜像集群

实时全量数据同步

nohup etcdctl make-mirror <destination>  &> /apps/logs/etcdmirror.log &

/apps/logs/etcdmirror.log会保存已经同步的key数量,每30s打印一次

定期数据同步

crontab定期数据同步,为避免数据遭到误删除清空造成灾难性影响,可恢复上一个同步周期之前的数据

#!/bin/bash
#
ETCDCTL_PATH='/apps/svr/etcd/bin/etcdctl'
ENDPOINTS='1.1.1.1:2379,1.1.1.2:2379,1.1.1.3:2379'
DEST_ENDPOINT='1.1.1.1:2389'
CMD="$ETCDCTL_PATH make-mirror --endpoints=$ENDPOINTS $DEST_ENDPOINT"
BaseName=$(basename $BASH_SOURCE)
export ETCDCTL_API=3
$CMD & echo $! > /tmp/$BaseName\.pid
sleep 90
kill `cat /tmp/$BaseName\.pid`

etcd集群恢复

etcd集群与zk集群相似,建议采用基数设备来搭建集群,可用性为(N-1)/2,假设集群数量N是3台设备,可最多可故障1台设备,而不影响集群使用。

leader故障

当leader故障时,etcd集群会自动选择一个新leader,由于失败检测模型是基于超时的(heartbeat-interval),因此选举新leader需要大约选举超时。
在leader选举期间,集群不能处理任何写入操作。在选举期间发送的写入请求排队等待处理,直到选出新的leader。
已经发送给old leader但尚未提交的文字可能会丢失。新leader有权重写old leader的任何未提交的条目。从用户的角度来看,一些写入请求可能会超时,但是,没有提交的写入会丢失。

  • 查看现时集群情况
    577ea941-3ff5-4da0-af2a-ac892ca9470c.png
[root@ETCD-CLUSTER-001 bin]# export ETCDCTL_API=3;./etcdctl --write-out="table" --endpoints='10.201.46.112:2379,10.201.46.113:2379,10.201.46.114:2379' endpoint status 
+--------------------+------------------+---------+---------+-----------+-----------+------------+
|      ENDPOINT      |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+--------------------+------------------+---------+---------+-----------+-----------+------------+
| 10.201.46.112:2379 | fd86741fb271523a | 3.1.10  | 25 kB   | false     |         2 |         10 |
| 10.201.46.113:2379 | fe2dd3624258de7  | 3.1.10  | 25 kB   | false     |         2 |         10 |
| 10.201.46.114:2379 | c649fd5192da5ca1 | 3.1.10  | 25 kB   | true      |         2 |         10 |
+--------------------+------------------+---------+---------+-----------+-----------+------------+
[root@ETCD-CLUSTER-001 bin]# export ETCDCTL_API=3;./etcdctl put hello nihao    
OK
  • 停止leader
  • 查看日志,在第3个选举周期的时候已经选举了新的leader
    130e0dfd-b71e-42ca-9f9d-0e093016b895.png

27bac533-b078-4d0c-b73c-39fce745c243.png

  • 集群可以正常工作
[root@ETCD-CLUSTER-001 bin]# export ETCDCTL_API=3;./etcdctl get hello nihao   
hello
nihao

follower故障

  • 查看现时集群情况
    d14070ed-006c-4c3a-b7a6-bebf44c08dc3.png
  • 停掉一台followe
[root@ETCD-CLUSTER-003 ~]# sh /apps/sh/etcd.sh stop
Stopping etcd:                                             [  OK  ]

119c46b7-a25a-4a24-ae90-92481b7e121e.png

  • 集群正常工作
[root@ETCD-CLUSTER-001 bin]# export ETCDCTL_API=3;./etcdctl get hello nihao
hello
nihao

集群故障恢复

为了从灾难性故障中恢复,etcd v3提供了快照和恢复功能,以便在没有v3密钥数据丢失的情况下重新创建群集.
要恢复集群,只需要一个快照“db”文件。集群恢复将etcdctl snapshot restore创建新的etcd数据目录; 所有成员应该使用相同的快照进行恢复。恢复会覆盖某些快照元数据(特别是成员ID和群集ID); 该成员失去了其以前的身份。此元数据覆盖可防止新成员无意中加入现有群集。因此,为了从快照启动集群,还原必须启动新的逻辑集群。

在还原时可以选择验证快照完整性。如果使用快照etcdctl snapshot save,它将具有通过检查的完整性散列etcdctl snapshot restore。如果快照是从数据目录复制的(配置文件中的data-dir),则不存在完整性哈希,并且只能使用恢复--skip-hash-check。

保存快照键空间

[root@ETCD-CLUSTER-001 bin]# export ETCDCTL_API=3;./etcdctl  --endpoints='10.201.46.112:2379,10.201.46.113:2379,10.201.46.114:2379' snapshot save ~/.snapshot.db
Snapshot saved at /root/.snapshot.db

恢复快照信息

每台机器上执行,

$ etcdctl snapshot restore snapshot.db \
  --name ETCD-CLUSTER-001 \
  --initial-cluster ETCD-CLUSTER-001=http://10.201.46.112:2380,ETCD-CLUSTER-002=http://10.201.46.113:2380,ETCD-CLUSTER-003=http://10.201.46.114:2380 \
  --initial-advertise-peer-urls http://10.201.46.112:2380
$ etcdctl snapshot restore snapshot.db \
  --name ETCD-CLUSTER-002 \
  --initial-cluster ETCD-CLUSTER-001=http://10.201.46.112:2380,ETCD-CLUSTER-002=http://10.201.46.113:2380,ETCD-CLUSTER-003=http://10.201.46.114:2380 \
  --initial-advertise-peer-urls http://10.201.46.113:2380
$ etcdctl snapshot restore snapshot.db \
  --name ETCD-CLUSTER-003 \
  --initial-cluster ETCD-CLUSTER-001=http://10.201.46.112:2380,ETCD-CLUSTER-002=http://10.201.46.113:2380,ETCD-CLUSTER-003=http://10.201.46.114:2380 \
  --initial-advertise-peer-urls http://10.201.46.114:2380

启动etcd机器

sudo sh /apps/sh/etcd.sh start

剔除、新增成员

  • 查看现有成员
[root@ETCD-CLUSTER-001 etcd-v3.2.17-linux-amd64]# export ETCDCTL_API=3;/apps/svr/etcd/bin/etcdctl --write-out="table" --endpoints='1.1.1.1:2379,1.1.1.2:2379,1.1.1.3:2379' member list
+------------------+---------+------------------+---------------------------+-----------------------+
|        ID        | STATUS  |       NAME       |        PEER ADDRS         |     CLIENT ADDRS      |
+------------------+---------+------------------+---------------------------+-----------------------+
| 28c987fd1ef634f8 | started | ETCD-CLUSTER-003 | http://1.1.1.3:2380 | http://localhost:2379 |
| 635b8eabdf3280ef | started | ETCD-CLUSTER-002 | http://1.1.1.2:2380 | http://localhost:2379 |
| e9a434659e36d3bc | started | ETCD-CLUSTER-001 | http://1.1.1.1:2380 | http://localhost:2379 |
+------------------+---------+------------------+---------------------------+-----------------------+
[root@ETCD-CLUSTER-001 etcd-v3.2.17-linux-amd64]# export ETCDCTL_API=3;/apps/svr/etcd/bin/etcdctl  --endpoints='1.1.1.1:2379,1.1.1.2:2379,1.1.1.3:2379' member remove e9a434659e36d3bc
Member e9a434659e36d3bc removed from cluster 7055108fef63cdab
[root@ETCD-CLUSTER-001 etcd-v3.2.17-linux-amd64]# export ETCDCTL_API=3;/apps/svr/etcd/bin/etcdctl --write-out="table" --endpoints='1.1.1.1:2379,1.1.1.2:2379,1.1.1.3:2379' member list
+------------------+---------+------------------+---------------------------+-----------------------+
|        ID        | STATUS  |       NAME       |        PEER ADDRS         |     CLIENT ADDRS      |
+------------------+---------+------------------+---------------------------+-----------------------+
| 28c987fd1ef634f8 | started | ETCD-CLUSTER-003 | http://1.1.1.3:2380 | http://localhost:2379 |
| 635b8eabdf3280ef | started | ETCD-CLUSTER-002 | http://1.1.1.2:2380 | http://localhost:2379 |
+------------------+---------+------------------+---------------------------+-----------------------+

新增成员至集群

注意步骤顺序

[root@ETCD-CLUSTER-001 ]# export ETCDCTL_API=3;/apps/svr/etcd/bin/etcdctl  --endpoints='1.1.1.2:2379,1.1.1.3:2379' member add ETCD-CLUSTER-001  --peer-urls=http://1.1.1.1:2380
Member 433fd69a958b8432 added to cluster 7055108fef63cdab
ETCD_NAME="ETCD-CLUSTER-001"
ETCD_INITIAL_CLUSTER="ETCD-CLUSTER-003=http://1.1.1.3:2380,ETCD-CLUSTER-001=http://1.1.1.1:2380,ETCD-CLUSTER-002=http://1.1.1.2:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
[root@ETCD-CLUSTER-001 ]# export ETCDCTL_API=3;/apps/svr/etcd/bin/etcdctl --write-out="table" --endpoints='1.1.1.2:2379,1.1.1.3:2379' member list            
+------------------+-----------+------------------+---------------------------+-----------------------+
|        ID        |  STATUS   |       NAME       |        PEER ADDRS         |     CLIENT ADDRS      |
+------------------+-----------+------------------+---------------------------+-----------------------+
| 28c987fd1ef634f8 | started   | ETCD-CLUSTER-003 | http://1.1.1.3:2380 | http://localhost:2379 |
| 433fd69a958b8432 | unstarted |                  | http://1.1.1.1:2380 |                       |
| 635b8eabdf3280ef | started   | ETCD-CLUSTER-002 | http://1.1.1.2:2380 | http://localhost:2379 |
+------------------+-----------+------------------+---------------------------+-----------------------+

删除新增成员旧数据目录,并且启动新增成员etcd服务,加入集群时要改下配置文件,把初始化集群状态由new改成existing

[root@ETCD-CLUSTER-001 ~]# vim /apps/conf/etcd/etcd.conf 
initial-cluster-state: "existing"


[root@ETCD-CLUSTER-001 ~]# export ETCDCTL_API=3;/apps/svr/etcd/bin/etcdctl --write-out="table" --endpoints='1.1.1.2:2379,1.1.1.3:2379' member list
+------------------+---------+------------------+---------------------------+-----------------------+
|        ID        | STATUS  |       NAME       |        PEER ADDRS         |     CLIENT ADDRS      |
+------------------+---------+------------------+---------------------------+-----------------------+
| 28c987fd1ef634f8 | started | ETCD-CLUSTER-003 | http://1.1.1.3:2380 | http://localhost:2379 |
| 433fd69a958b8432 | started | ETCD-CLUSTER-001 | http://1.1.1.1:2380 | http://localhost:2379 |
| 635b8eabdf3280ef | started | ETCD-CLUSTER-002 | http://1.1.1.2:2380 | http://localhost:2379 |
+------------------+---------+------------------+---------------------------+-----------------------+
 
[root@ETCD-CLUSTER-001 ~]# export ETCDCTL_API=3;/apps/svr/etcd/bin/etcdctl --write-out="table" --endpoints='1.1.1.1:2379,1.1.1.2:2379,1.1.1.3:2379' endpoint status
+--------------------+------------------+---------+---------+-----------+-----------+------------+
|      ENDPOINT      |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+--------------------+------------------+---------+---------+-----------+-----------+------------+
| 1.1.1.1:2379 | 433fd69a958b8432 | 3.1.10  | 98 kB   | false     |         3 |         13 |
| 1.1.1.2:2379 | 635b8eabdf3280ef | 3.1.10  | 98 kB   | false     |         3 |         13 |
| 1.1.1.3:2379 | 28c987fd1ef634f8 | 3.1.10  | 98 kB   | true      |         3 |         13 |
+--------------------+------------------+---------+---------+-----------+-----------+------------+

etcd v2数据迁移到v3版本

# 迁移前,需要逐台服务停止
sh /apps/sh/etcd.sh stop
# 迁移数据,数据目录根据实际填写
export ETCDCTL_API=3
etcdctl --endpoints=$ENDPOINT migrate --data-dir="default.etcd" --wal-dir="default.etcd/member/wal"

# 逐台服务启动
sh /apps/sh/etcd.sh start
# 检查确认数据已经迁移
export ETCDCTL_API=3;etcdctl --endpoints=$ENDPOINTS get /foo

快照条目数量调整

--snapshot-count:指定有多少事务(transaction)被提交时,触发截取快照保存到磁盘,在v3.2之前的版本,默认的参数是10000条,3.2之后调整为100000条

这个条目数量不能配置过高或者过低,过低会导致频繁的io压力,过高会导致占用高内存以及会导致etcd GC过慢。建议设置为10W-20W条。

历史数据压缩

key空间长期的时候,如果没有做压缩清理,到达上限的阈值时,集群会处于一个只能删除和读的状态,无法进行写操作。因此对集群的历史日志做一个压缩清理是很有必要。

数据压缩并不是清理现有数据,只是对数据的历史版本进行清理,清理后数据的历史版本将不能访问,但不会影响现有最新数据的访问。

使用客户端工具进行清理

#压缩清理revision为10之前的历史数据
[apps@test ~]$ export ETCDCTL_API=3;/apps/svr/etcd/bin/etcdctl compaction 10
compacted revision 10
 
#访问revision10之前的数据会提示已经不存在
[apps@test ~]$ export ETCDCTL_API=3;/apps/svr/etcd/bin/etcdctl get aa --rev=9
Error:  etcdserver: mvcc: required revision has been compacted

使用--auto-compaction-retention=1,表示每小时进行一次数据压缩。

进行compaction操作之后,旧的revision被压缩,会产生内部的碎片,内部碎片是指空闲状态的,能被后端使用但是仍然消耗存储空间的磁盘空间。去碎片化实际上是将存储空间还给文件系统。

[apps@test ~]$ etcdctl defrag

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK