开启kubernetes的抢占模式

Pod优先级、抢占

Pod优先级、抢占功能，在kubernetes v1.8引入，在v1.11版本进入beta状态，并在v1.14版本进入GA阶段，已经是一个成熟的特性了。

顾名思义，Pod优先级、抢占功能，通过将应用细分为不同的优先级，将资源优先提供给高优先级的应用，从而提高了资源可用率，同时保障了高优先级的服务质量。

我们先来简单使用下Pod优先级、抢占功能。

伊布的集群版本是 v1.14，因此feature PodPriority 默认是开启的。抢占模式的使用分为两步：

定义PriorityClass，不同PriorityClass的value不同，value越大优先级越高。
创建Pod，并设置Pod的priorityClassName字段为期待的PriorityClass。

创建PriorityClass

如下，伊布先创建两个PriorityClass：high-priority和low-priority，其value分别为1000000、10。

需要注意的是，伊布将low-priority的globalDefault设置为了true，因此low-priority即为集群默认的PriorityClass，任何没有配置priorityClassName字段的Pod，其优先级都将设置为low-priority的10。一个集群只能有一个默认的PriorityClass。如果没有设置默认PriorityClass，则没有配置PriorityClassName的Pod的优先级为0。

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "for high priority pod"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 10
globalDefault: true
description: "for low priority pod"

创建后查看下系统当前的PriorityClass。

kubectl get priorityclasses.scheduling.k8s.io
NAME                      VALUE        GLOBAL-DEFAULT   AGE
high-priority             1000000      false            47m
low-priority              10           true             47m
system-cluster-critical   2000000000   false            254d
system-node-critical      2000001000   false            254d

可以看到，除了上面创建的两个PriorityClass，默认系统还内置了system-cluster-critical、system-node-critical用于高优先级的系统任务。

设置Pod的PriorityClassName

为了方便验证，伊布这里使用了扩展资源。伊布为节点x1设置了扩展资源example.com/foo的容量为1。

curl -k --header "Authorization: Bearer ${token}" --header "Content-Type: application/json-patch+json" \
--request PATCH \
--data '[{"op": "add", "path": "/status/capacity/example.com~1foo", "value": "1"}]' \
https://{apiServerIP}:{apiServerPort}/api/v1/nodes/x1/status

查看下x1的allocatable和capacity，可以看到x1上有1个example.com/foo资源。

Capacity:
 cpu:                2
 example.com/foo:    1
 hugepages-2Mi:      0
 memory:             4040056Ki
 pods:               110
Allocatable:
 cpu:                2
 example.com/foo:    1
 hugepages-2Mi:      0
 memory:             3937656Ki
 pods:               110

我们先创建Deployment nginx，它会请求1个example.com/foo资源，但是我们没有设置PriorityClassName，因此Pod的优先级将是默认的low-priority指定的10。

  template:
    spec:
      containers:
      - image: nginx
        imagePullPolicy: Always
        name: nginx
        resources:
          limits:
            example.com/foo: "1"
          requests:
            example.com/foo: "1"

然后再创建Deployment debian，它并没有请求example.com/foo资源。

  template:
    spec:
      containers:
      - args:
        - bash
        image: debian
        name: debian
        resources:
          limits:
            example.com/foo: "0"
          requests:
            example.com/foo: "0"
        priorityClassName: high-priority

此时两个Pod都可以正常启动。

开始抢占

我们将Deployment debian的example.com/foo请求量改为1，并将priorityClassName设置为high-priority。

  template:
    spec:
      containers:
      - args:
        - bash
        image: debian
        name: debian
        resources:
          limits:
            example.com/foo: "1"
          requests:
            example.com/foo: "1"
        priorityClassName: high-priority

此时，由于集群中只有x1上有1个example.com/foo资源，而且debian的优先级更高，因此scheduler会开始抢占。如下是观察到的Pod过程。

kubectl get pods -o wide -w
NAME                      READY   STATUS    AGE     IP             NODE       NOMINATED NODE
debian-55d94c54cb-pdfmd   1/1     Running   3m53s   10.244.4.178   x201       <none>
nginx-58dc57fbff-g5fph    1/1     Running   2m4s    10.244.3.28    x1         <none>
// 此时Deployment debian开始Recreate
debian-55d94c54cb-pdfmd   1/1     Terminating   4m49s   10.244.4.178   x201       <none>
debian-55d94c54cb-pdfmd   0/1     Terminating   5m21s   10.244.4.178   x201       <none>
debian-55d94c54cb-pdfmd   0/1     Terminating   5m22s   10.244.4.178   x201       <none>
debian-55d94c54cb-pdfmd   0/1     Terminating   5m22s   10.244.4.178   x201       <none>
// example.com/foo不满足，阻塞
debian-5bc46885dd-rvtwv   0/1     Pending       0s      <none>         <none>     <none>
debian-5bc46885dd-rvtwv   0/1     Pending       0s      <none>         <none>     <none>
// scheduler判断将x1上的Pod挤出后可以满足debian Pod的需求，设置NOMINATED为x1
debian-5bc46885dd-rvtwv   0/1     Pending       0s      <none>         <none>     x1    
// sheduler开始挤出Pod nginx
nginx-58dc57fbff-g5fph    1/1     Terminating   3m33s   10.244.3.28    x1         <none>
// Pod nginx等待。优先级低啊，没办法。
nginx-58dc57fbff-29rzw    0/1     Pending       0s      <none>         <none>     <none>
nginx-58dc57fbff-29rzw    0/1     Pending       0s      <none>         <none>     <none>
// graceful termination period，优雅退出
nginx-58dc57fbff-g5fph    0/1     Terminating   3m34s   10.244.3.28    x1         <none>
nginx-58dc57fbff-g5fph    0/1     Terminating   3m37s   10.244.3.28    x1         <none>
nginx-58dc57fbff-g5fph    0/1     Terminating   3m37s   10.244.3.28    x1         <none>
// debian NODE绑定为x1
debian-5bc46885dd-rvtwv   0/1     Pending       5s      <none>         x1         x1    
// 抢占到资源，启动
debian-5bc46885dd-rvtwv   0/1     ContainerCreating   5s      <none>         x1         <none>
debian-5bc46885dd-rvtwv   1/1     Running             14s     10.244.3.29    x1         <none>

君子：Non-preempting PriorityClasses

kubernetes v1.15为PriorityClass添加了一个字段PreemptionPolicy，当设置为Never时，该Pod将不会抢占比它优先级低的Pod，只是调度的时候，会优先调度（参照PriorityClass的value）。

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-nonpreempting
value: 1000000
preemptionPolicy: Never
globalDefault: false

所以我把这种PriorityClass叫做“君子”，因为他只是默默凭本事(Priority)排队，不会强抢别人的资源。官网给出一个适合的例子是 data science workload。

对比 Cluster Autoscaler

云上kubernetes在集群资源不足时，可以通过 Cluster Autoscaler 自动对node扩容，即向云厂商申请更多的node，并添加到集群中，从而提供更多资源。

但这种做法不足的地方是：

云下场景不易实施
增加node要多花钱
不是立即的，需要时间

如果用户能够比较明确的划分应用的优先级，在资源不足的时候通过抢占低优先级Pod的资源，可以更好的提高资源利用率、提高服务质量。

Pod优先级、抢占

创建PriorityClass

设置Pod的PriorityClassName

君子：Non-preempting PriorityClasses

对比 Cluster Autoscaler

Recommend

《奇葩说》哪些话令你印象深刻？ - 知乎

大家觉得比较虐的古诗词有什么？ - 知乎

如何用克苏鲁风格描述火锅？ - 知乎

如何看待热议的somi出道后舞台不如伴舞抢眼? - 知乎

两个场景下Mysqldump数据库备份恢复-流年似水

中科院大学录取通知书嵌龙芯芯片鼓励学生看得更远

中国第一款青少年搜索引擎“花漾搜索”发布

华为董事长：鸿蒙是物联网操作系统安卓是手机首选

球鞋电商：野蛮生长，乱象丛生

潜望丨借力华为海思，芯片代工之王台积电安度“后张忠谋时代”？

About Joyk