在项目中客户计划对生产环境服务器系统由centos切换为redhat,其中一台k8s的master节点切换失败,在操作前对虚拟机打了快照,回退快照后该节点etcd服务异常。
通过查看日志发现
tocommit(xxx) is out of range [lastIndex]. was the raft log corrupted, truncated, or lost
这种情况通常是由于:
- 故障节点在一段时间内未参与集群复制;
- 当它重新加入时,Leader 发现它的日志 index 跟不上;
- 然而由于 gap 太大,leader 已经无法通过正常的 append entries 补给它日志(leader 上的快照也太新,跟不上);
- 最终导致 follower 收到不合法的 tocommit index,抛出该错误。
etcd是在docker中运行,恢复快照后一直退出。
处理思路
检查k8s状态
kubectl get node -o wide
检查etcd状态
- --listen-client-urls=
- --cert-file=/etc/kubernetes/ssl/etcd/server.crt
- --key-file=/etc/kubernetes/ssl/etcd/server.key
- --peer-trusted-ca-file=/etc/kubernetes/ssl/etcd/ca.crt
查看endpoint和member信息
etcdctl --endpoints="https://10.17.21.16:2379,https://10.17.21.17:2379,https://10.17.21.18:2379" --cacert=/etc/kubernetes/ssl/etcd/ca.crt --cert=/etc/kubernetes/ssl/etcd/server.crt --key=/etc/kubernetes/ssl/etcd/server.key endpoint status -w table
etcdctl --endpoints="https://10.17.21.16:2379,https://10.17.21.17:2379,https://10.17.21.18:2379" --cacert=/etc/kubernetes/ssl/etcd/ca.crt --cert=/etc/kubernetes/ssl/etcd/server.crt --key=/etc/kubernetes/ssl/etcd/server.key member list -w table
记录name和id
停止kubelet
在异常节点中停止kubelet服务
systemctl stop kubelet
移除etcd节点
etcdctl --endpoints="https://10.17.21.16:2379,https://10.17.21.17:2379,https://10.17.21.18:2379" --cacert=/etc/kubernetes/ssl/etcd/ca.crt --cert=/etc/kubernetes/ssl/etcd/server.crt --key=/etc/kubernetes/ssl/etcd/server.key member remove <member-id>
添加etcd节点
etcdctl --endpoints="https://10.17.21.16:2379,https://10.17.21.17:2379,https://10.17.21.18:2379" --cacert=/etc/kubernetes/ssl/etcd/ca.crt --cert=/etc/kubernetes/ssl/etcd/server.crt --key=/etc/kubernetes/ssl/etcd/server.key member add <name> --peer-urls=http://<ip>:2380
清理etcd数据
- --data-dir=/var/lib/etcd/
删除故障节点member数据(怕出意外可以备份数据)
rm -rf /var/lib/etcd/member/
恢复服务
systemctl start kubelet
参考文档:
https://blog.csdn.net/test1280/article/details/123579775
https://lolicp.com/kubernetes/202423659.html