21

Kubernetes StatefulSet 与 Deployment 在 Node NotReady代码逻辑区别

 4 years ago
source link: https://www.tuicool.com/articles/iInEBfu
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

节点离线后的 pod 状态

在 kubernetes 使用过程中,根据集群的配置不同,往往会因为如下情况的一种或几种导致节点 NotReady:

  1. kubelet 进程停止
  2. apiserver 进程停止
  3. etcd 进程停止
  4. kubernetes 管理网络 Down

当出现这种情况的时候,会出现节点 NotReady,进而当kube-controller-manager 中的 --pod-eviction-timeout 定义的值,默认 5 分钟后,将触发 Pod eviction 动作。

对于不同类型的 workloads,其对应的 pod 处理方式因为 controller-manager 中各个控制器的逻辑不通而不同。总结如下:

deployment
statefulset
daemonSet

这里说到,对于 deploymentstatefulSet 类型资源,当节点 NotReady 后显示的 pod 状态为 Unknown。 这里实际上 etcd 保存的状态为 NodeLost,只是显示时做了处理,与 daemonSet 做了区分。对应代码中的逻辑为:

### node controller

// 触发 NodeEviction 操作时会 DeletePods,这个删除为 GracefulDelete,
// apiserver rest 接口对 PodObj 添加了 DeletionTimestamp
func DeletePods(kubeClient clientset.Interface, recorder record.EventRecorder, nodeName, nodeUID string, daemonStore extensionslisters.DaemonSetLister) (bool, error) {
...
        for _, pod := range pods.Items {
...
        // Set reason and message in the pod object.
        if _, err = SetPodTerminationReason(kubeClient, &pod, nodeName); err != nil {
            if apierrors.IsConflict(err) {
                updateErrList = append(updateErrList,
                    fmt.Errorf("update status failed for pod %q: %v", format.Pod(&pod), err))
                continue
            }
        }
        // if the pod has already been marked for deletion, we still return true that there are remaining pods.
        if pod.DeletionGracePeriodSeconds != nil {
            remaining = true
            continue
        }
        // if the pod is managed by a daemonset, ignore it
        _, err := daemonStore.GetPodDaemonSets(&pod)
        if err == nil { // No error means at least one daemonset was found
            continue
        }

        glog.V(2).Infof("Starting deletion of pod %v/%v", pod.Namespace, pod.Name)
        recorder.Eventf(&pod, v1.EventTypeNormal, "NodeControllerEviction", "Marking for deletion Pod %s from Node %s", pod.Name, nodeName)
        if err := kubeClient.CoreV1().Pods(pod.Namespace).Delete(pod.Name, nil); err != nil {
            return false, err
        }
        remaining = true
    }
...
}

### staging apiserver REST 接口

// 对于优雅删除,到这里其实已经停止,不再进一步删除,剩下的交给 kubelet watch 到变化后去做 delete
func (e *Store) Delete(ctx genericapirequest.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
...
    if graceful || pendingFinalizers || shouldUpdateFinalizers {
            err, ignoreNotFound, deleteImmediately, out, lastExisting = e.updateForGracefulDeletionAndFinalizers(ctx, name, key, options, preconditions, obj)
        }
    // !deleteImmediately covers all cases where err != nil. We keep both to be future-proof.
    if !deleteImmediately || err != nil {
        return out, false, err
    }
...
}
        
// stagging/apiserver中的 rest 接口调用,设置了 DeletionTimestamp 和 DeletionGracePeriodSeconds
func (e *Store) updateForGracefulDeletionAndFinalizers(ctx genericapirequest.Context, name, key string, options *metav1.DeleteOptions, preconditions storage.Preconditions, in runtime.Object) (err error, ignoreNotFound, deleteImmediately bool, out, lastExisting runtime.Object) {
...

        if options.GracePeriodSeconds != nil {
            period := int64(*options.GracePeriodSeconds)
            if period >= *objectMeta.GetDeletionGracePeriodSeconds() {
                return false, true, nil
            }
            newDeletionTimestamp := metav1.NewTime(
                objectMeta.GetDeletionTimestamp().Add(-time.Second * time.Duration(*objectMeta.GetDeletionGracePeriodSeconds())).
                    Add(time.Second * time.Duration(*options.GracePeriodSeconds)))
            objectMeta.SetDeletionTimestamp(&newDeletionTimestamp)
            objectMeta.SetDeletionGracePeriodSeconds(&period)
            return true, false, nil
        }
...
}

### node controller 

// SetPodTerminationReason 尝试设置 Pod状态和原因到 Pod 对象中
func SetPodTerminationReason(kubeClient clientset.Interface, pod *v1.Pod, nodeName string) (*v1.Pod, error) {
    if pod.Status.Reason == nodepkg.NodeUnreachablePodReason {
        return pod, nil
    }

    pod.Status.Reason = nodepkg.NodeUnreachablePodReason
    pod.Status.Message = fmt.Sprintf(nodepkg.NodeUnreachablePodMessage, nodeName, pod.Name)

    var updatedPod *v1.Pod
    var err error
    if updatedPod, err = kubeClient.CoreV1().Pods(pod.Namespace).UpdateStatus(pod); err != nil {
        return nil, err
    }
    return updatedPod, nil
}

### 命令行输出

// 打印输出时状态的切换,如果 "DeletionTimestamp 不为空" 且 "podStatus 为 NodeLost 状态"时,
// 显示的状态为 Unknown
func printPod(pod *api.Pod, options printers.PrintOptions) ([]metav1alpha1.TableRow, error) {
...
    if pod.DeletionTimestamp != nil && pod.Status.Reason == node.NodeUnreachablePodReason {
        reason = "Unknown"
    } else if pod.DeletionTimestamp != nil {
        reason = "Terminating"
    }
...
}

节点恢复 Ready 后 pod 状态

当节点恢复后,不同的 workload 对应的 pod 状态变化也是不同的。

deployment : 根据上一节描述,此时 pod 已经有正确的 pod 在其他节点 running,此时故障节点恢复后,kubelet 执行优雅删除,删除旧的 PodObj。

statefulset : statefulset 会从Unknown 状态变为 Terminating 状态,执行优雅删除,detach PV,然后执行重新调度与重建操作。

daemonset : daemonset 会从 NodeLost 状态直接变成 Running 状态,不涉及重建。

Statefulset 为什么没有重建?

我们往往会考虑下面两个问题,statefulset 为什么没有重建? 如何保持单副本 statefulset 的高可用呢?

关于为什么没重建,首先简单介绍下 statefulset 控制器的逻辑。

Statefulset 控制器通过 StatefulSetControl 以及 StatefulPodControl 2个模块协调完成对 statefulSet 类型 workload 的状态管理(StatefulSetStatusUpdater)和扩缩控制(StatefulPodControl)。实际上,StatefulsetControl是对 StatefulPodControl 的调用来增删改 Pod。

StatefulSet 在 podManagementPolicy 为默认值 OrderedReady 时,会按照整数顺序单调递增的依次创建 Pod,否则在 Parallel 时,虽然是按整数,但是 Pod 是同时调度与创建。

具体的逻辑在核心方法 UpdateStatefulSet 中,见图:

bVbtV37?w=1404&h=1062

我们看到的 Stateful Pod 一直处于 Unknown 状态的原因就是因为这个控制器屏蔽了对该 Pod 的操作。因为在第一节介绍了,NodeController 的 Pod Eviction 机制已经把 Pod 标记删除,PodObj 中包含的 DeletionTimestamp 被设置,StatefulSet Controller 代码检查 IsTerminating 符合条件,便直接 return 了。

// updateStatefulSet performs the update function for a StatefulSet. This method creates, updates, and deletes Pods in
// the set in order to conform the system to the target state for the set. The target state always contains
// set.Spec.Replicas Pods with a Ready Condition. If the UpdateStrategy.Type for the set is
// RollingUpdateStatefulSetStrategyType then all Pods in the set must be at set.Status.CurrentRevision.
// If the UpdateStrategy.Type for the set is OnDeleteStatefulSetStrategyType, the target state implies nothing about
// the revisions of Pods in the set. If the UpdateStrategy.Type for the set is PartitionStatefulSetStrategyType, then
// all Pods with ordinal less than UpdateStrategy.Partition.Ordinal must be at Status.CurrentRevision and all other
// Pods must be at Status.UpdateRevision. If the returned error is nil, the returned StatefulSetStatus is valid and the
// update must be recorded. If the error is not nil, the method should be retried until successful.
func (ssc *defaultStatefulSetControl) updateStatefulSet(
    ...
    for i := range replicas {
        ...
        // If we find a Pod that is currently terminating, we must wait until graceful deletion
        // completes before we continue to make progress.
        if isTerminating(replicas[i]) && monotonic {
            glog.V(4).Infof(
                "StatefulSet %s/%s is waiting for Pod %s to Terminate",
                set.Namespace,
                set.Name,
                replicas[i].Name)
            return &status, nil
        }
    ...
    }
}


// isTerminating returns true if pod's DeletionTimestamp has been set
func isTerminating(pod *v1.Pod) bool {
    return pod.DeletionTimestamp != nil
}

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK