kubernetes中容器資源控制的那些事兒

Kubernetes · 發表 2018-12-13 20:37:02

摘要：為啥要把參考文件寫在前面呢，因為這幾篇文件都說了不少關於k8s裡面的容器的資源請求和限制的事兒，但還是沒完全講透，今天就試著講清楚些，並進行吐槽。本文很長，不耐煩的可以直接看最後一節總結。 1. Pod資源控制的來源，排程時和執行時 apiVersion: v1 kind: Pod m...

為啥要把參考文件寫在前面呢，因為這幾篇文件都說了不少關於k8s裡面的容器的資源請求和限制的事兒，但還是沒完全講透，今天就試著講清楚些，並進行吐槽。本文很長，不耐煩的可以直接看最後一節總結。

1. Pod資源控制的來源，排程時和執行時

apiVersion: v1 kind: Pod metadata:
 name: frontend
spec:
 containers:
 - name: db
 image: mysql
 resources:
 requests:
 memory: "64Mi"
 cpu: "250m"
 limits:
 memory: "128Mi"
 cpu: "500m"
 - name: wp
 image: wordpress
 resources:
 requests:
 memory: "64Mi"
 cpu: "250m"
 limits:
 memory: "128Mi"
 cpu: "500m"

k8s對容器資源控制的配置是在 pod.spec.containers[].resources 下面，又分為requests和limits。也就是說，雖然k8s的基本排程單元是pod，但控制資源還是在容器這個層面上。那麼request和limit有什麼區別呢？

k8s中的Pod，除了顯式寫明pod.spec.Nodename的之外，一般是經歷這樣的過程

Pod(被建立，寫入etcd) --> 排程器watch到pod.spec.nodeName --> 節點上的kubelet watch到屬於自己，建立容器並執行

看排程器中的 ofollow,noindex" target="_blank">這段程式碼和這段程式碼

func GetResourceRequest(pod *v1.Pod) *schedulercache.Resource {
result := schedulercache.Resource{}
for _, container := range pod.Spec.Containers {
for rName, rQuantity := range container.Resources.Requests {

factory.RegisterPriorityFunction2("MostRequestedPriority", priorities.MostRequestedPriorityMap, nil, 1)
}

我們可以得出第一個結論：

排程時僅僅使用了requests，而沒有使用limits。而在執行的時候兩者都使用了。

其中request還有nvidiaGPU啥的，也不寫了，排程時如果這幾個值是0，還會在計算優先順序時賦一個值，也不寫了。總之，今天這篇文章更關心執行時是怎麼跑的，不關心排程時。

而在執行時，這些引數的整體發揮作用的途徑如下

k8s ---> docker ---> linux cgroup

最後由作業系統核心來控制這些程序的資源限制

2. 執行時的換算k8s->docker

還是首先貼程式碼，這段程式碼的資訊量比較大

func (m *kubeGenericRuntimeManager) generateLinuxContainerConfig(container *v1.Container, pod *v1.Pod, uid *int64, username string) *runtimeapi.LinuxContainerConfig {
lc := &runtimeapi.LinuxContainerConfig{
Resources: &runtimeapi.LinuxContainerResources{},
SecurityContext: m.determineEffectiveSecurityContext(pod, container, uid, username),
}

// set linux container resources
var cpuShares int64
cpuRequest := container.Resources.Requests.Cpu()
cpuLimit := container.Resources.Limits.Cpu()
memoryLimit := container.Resources.Limits.Memory().Value()
oomScoreAdj := int64(qos.GetContainerOOMScoreAdjust(pod, container,
int64(m.machineInfo.MemoryCapacity)))
// If request is not specified, but limit is, we want request to default to limit.
// API server does this for new containers, but we repeat this logic in Kubelet
// for containers running on existing Kubernetes clusters.
if cpuRequest.IsZero() && !cpuLimit.IsZero() {
cpuShares = milliCPUToShares(cpuLimit.MilliValue())
} else {
// if cpuRequest.Amount is nil, then milliCPUToShares will return the minimal number
// of CPU shares.
cpuShares = milliCPUToShares(cpuRequest.MilliValue())
}
lc.Resources.CpuShares = cpuShares
if memoryLimit != 0 {
lc.Resources.MemoryLimitInBytes = memoryLimit
}
// Set OOM score of the container based on qos policy. Processes in lower-priority pods should
// be killed first if the system runs out of memory.
lc.Resources.OomScoreAdj = oomScoreAdj

if m.cpuCFSQuota {
// if cpuLimit.Amount is nil, then the appropriate default value is returned
// to allow full usage of cpu resource.
cpuQuota, cpuPeriod := milliCPUToQuota(cpuLimit.MilliValue())
lc.Resources.CpuQuota = cpuQuota
lc.Resources.CpuPeriod = cpuPeriod
}

return lc
}

以及這段

func milliCPUToShares(milliCPU int64) int64 {
if milliCPU == 0 {
// Return 2 here to really match kernel default for zero milliCPU. return minShares
}
// Conceptually (milliCPU / milliCPUToCPU) * sharesPerCPU, but factored to improve rounding.
shares := (milliCPU * sharesPerCPU) / milliCPUToCPU
if shares < minShares {
return minShares
}
return shares
}

// milliCPUToQuota converts milliCPU to CFS quota and period values func milliCPUToQuota(milliCPU int64) (quota int64, period int64) {
// CFS quota is measured in two values: // - cfs_period_us=100ms (the amount of time to measure usage across) // - cfs_quota=20ms (the amount of cpu time allowed to be used across a period) // so in the above example, you are limited to 20% of a single CPU // for multi-cpu environments, you just scale equivalent amounts if milliCPU == 0 {
return
}

// we set the period to 100ms by default
period = quotaPeriod

// we then convert your milliCPU to a value normalized over a period
quota = (milliCPU * quotaPeriod) / milliCPUToCPU

// quota needs to be a minimum of 1ms. if quota < minQuotaPeriod {
quota = minQuotaPeriod
}

return
}

繼續用一張表來描述裡面的關係

docker引數	處理過程
cpuShares	如果requests.cpu為0且limits.cpu非0，以limits.cpu為轉換輸入值，否則從request.cpu為轉換輸入值。轉換演算法為：如果轉換輸入值為0，則設定為minShares == 2，否則為*1024/100
oomScoreAdj	使用了一個非常複雜的演算法把容器分為三類，詳見參考文件3 ，按優先順序降序為：Guaranteed, Burstable, Best-Effort，和request.memory負相關，這個值越為負數越不容易被殺死
cpuQuota, cpuPeriod	由limits.cpu 轉換而來，預設cpuPeriod為100ms，而cpuQuota為limits.cpu的核數 * 100ms，這是一個硬限制
memoryLimit	== limits.memory

3. docker中這幾個值的含義

那麼，當k8s把這一堆記憶體CPU的值輸出到docker之後，docker對這些值的解釋是怎麼樣的呢？

# docker help run | grep cpu
 --cpu-percent int CPU percent (Windows only)
 --cpu-period int Limit CPU CFS (Completely Fair Scheduler) period
 --cpu-quota int Limit CPU CFS (Completely Fair Scheduler) quota
 -c, --cpu-shares int CPU shares (relative weight)
 --cpuset-cpus string CPUs in which to allow execution (0-3, 0,1)
 --cpuset-mems string MEMs in which to allow execution (0-3, 0,1)
 
# docker help run | grep oom
 --oom-kill-disable Disable OOM Killer
 --oom-score-adj int Tune host's OOM preferences (-1000 to 1000)

# docker help run | grep memory
 --kernel-memory string Kernel memory limit
 -m, --memory string Memory limit
 --memory-reservation string Memory soft limit
 --memory-swap string Swap limit equal to memory plus swap: '-1' to enable unlimited swap
 --memory-swappiness int Tune container memory swappiness (0 to 100) (default -1)

比較雜，所以分幾個關鍵的值來說，下面這部分內容我是從這個部落格複製過來的，寫的很詳細，當然，原始出處還是docker的官方文件

同時，對於一個正在執行的容器可以使用 docker inspect 看出來

# docker inspect c8dcd083baba | grep Cpu "CpuShares": 1024,
 "CpuPeriod": 0,
 "CpuQuota": 0,
 "CpusetCpus": "",
 "CpusetMems": "",
 "CpuCount": 0,
 "CpuPercent": 0,

3.1 CPU share constraint: `-c` or `--cpu-shares`

預設所有的容器對於 CPU 的利用佔比都是一樣的， -c 或者 --cpu-shares 可以設定 CPU 利用率權重，預設為 1024，可以設定權重為 2 或者更高(單個 CPU 為 1024，兩個為 2048，以此類推)。如果設定選項為 0，則系統會忽略該選項並且使用預設值 1024。通過以上設定，只會在 CPU 密集(繁忙)型執行程序時體現出來。當一個 container 空閒時，其它容器都是可以佔用 CPU 的。cpu-shares 值為一個相對值，實際 CPU 利用率則取決於系統上執行容器的數量。

假如一個 1core 的主機執行 3 個 container，其中一個 cpu-shares 設定為 1024，而其它 cpu-shares 被設定成 512。當 3 個容器中的程序嘗試使用 100% CPU 的時候「嘗試使用 100% CPU 很重要，此時才可以體現設定值」，則設定 1024 的容器會佔用 50% 的 CPU 時間。如果又新增一個 cpu-shares 為 1024 的 container，那麼兩個設定為 1024 的容器 CPU 利用佔比為 33%，而另外兩個則為 16.5%。簡單的演算法就是，所有設定的值相加，每個容器的佔比就是 CPU 的利用率，如果只有一個容器，那麼此時它無論設定 512 或者 1024，CPU 利用率都將是 100%。當然，如果主機是 3core，執行 3 個容器，兩個 cpu-shares 設定為 512，一個設定為 1024，則此時每個 container 都能佔用其中一個 CPU 為 100%。

測試主機「4core」當只有 1 個 container 時，可以使用任意的 CPU：

~ docker run -it --rm --cpu-shares 512 ubuntu-stress:latest /bin/bash
root@4eb961147ba6:/# stress -c 4 stress: info: [17] dispatching hogs: 4 cpu, 0 io, 0 vm, 0 hdd
 ~ docker stats 4eb961147ba6
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O
4eb961147ba6 398.05% 741.4 kB / 8.297 GB 0.01% 4.88 kB /

3.2 CPU period constraint: `--cpu-period` & `--cpu-quota`

預設的 CPU CFS「Completely Fair Scheduler」period 是 100ms。我們可以通過 --cpu-period 值限制容器的 CPU 使用。一般 --cpu-period 配合 --cpu-quota 一起使用。

設定 cpu-period 為 100ms，cpu-quota 為 200ms，表示最多可以使用 2 個 cpu，如下測試：

~ docker run -it --rm --cpu-period=100000 --cpu-quota=200000 ubuntu-stress:latest /bin/bash
root@6b89f2bda5cd:/# stress -c 4 # stress 測試使用 4 個 cpu stress: info: [17] dispatching hogs: 4 cpu, 0 io, 0 vm, 0 hdd
 ~ docker stats 6b89f2bda5cd # stats 顯示當前容器 CPU 使用率不超過 200%
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O
6b89f2bda5cd 200.68% 745.5 kB / 8.297 GB 0.01% 4.771 kB / 648 B 0 B / 0 B

通過以上測試可以得知， --cpu-period 結合 --cpu-quota 配置是固定的，無論 CPU 是閒還是繁忙，如上配置，容器最多隻能使用 2 個 CPU 到 100%。

CFS documentation on bandwidth limiting

3.3 --oom-score-adj

oom-score-adj是一個引數，用於在系統記憶體OOM時優先殺哪個，負數的分數最不容易會被抹殺，而正數的分數最容易會被抹殺，這和無限恐怖裡面完全不一樣啊，哈哈哈

可以看到在k8s的程式碼中，幾個不同型別的container的分數如下

const (
// PodInfraOOMAdj is very docker specific. For arbitrary runtime, it may not make // sense to set sandbox level oom score, e.g. a sandbox could only be a namespace // without a process. // TODO: Handle infra container oom score adj in a runtime agnostic way.
PodInfraOOMAdj int = -998
KubeletOOMScoreAdj int = -999
DockerOOMScoreAdj int = -999
KubeProxyOOMScoreAdj int = -999
guaranteedOOMScoreAdj int = -998
besteffortOOMScoreAdj int = 1000
)

可見guarantee的容器最後才會被殺，而besteffort最先被殺。

4. --cpu-shares的實測

為了嘗試一下k8s裡面對cpu-share預設值設為2的實際效果，寫了一個指令碼

cat cpu.sh
function aa {
x=0 while [ True ];do
 x=$x+1
done;
}

複製到容器內並跑起來，容器基於busybox映象

# docker run --cpu-shares=2 run_out_cpu /bin/sh /cpu.sh 
 
top - 15:41:23 up 57 days, 22:51, 3 users, load average: 4.66, 7.50, 6.16
Tasks: 396 total, 2 running, 394 sleeping, 0 stopped, 0 zombie
%Cpu(s): 12.2 us, 0.4 sy, 0.0 ni, 87.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 8154312 total, 7923816 used, 230496 free, 278512 buffers
KiB Swap: 3905532 total, 39060 used, 3866472 free. 6525112 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 39986 root 20 0 1472 364 188 R 99.6 0.0 1:10.02 sh 2098 root 20 0 116456 8948 1864 S 0.7 0.1 67:16.83 acc-snf

可以看到，當--cpu-shares設定為2時，使用了單個CPU的100%，接下來，另外不使用容器，只在宿主機上跑一個獨立的指令碼

# cat cpu8.sh function aa {
x=0 while [ True ];do
 x=$x+1
done;
}

aa &
aa &
aa &
aa &
aa &
aa &
aa &
aa &

這臺機器有8核，所以需要起8個程序去跑滿

於是，此時的top為：

top - 15:44:30 up 57 days, 22:54, 3 users, load average: 4.21, 5.23, 5.47 Tasks: 404 total, 10 running, 394 sleeping, 0 stopped, 0 zombie
%Cpu(s): 82.6 us, 16.9 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.1 hi, 0.0 si, 0.5 st KiB Mem: 8154312 total, 7932176 used, 222136 free, 278536 buffers KiB Swap: 3905532 total, 39060 used, 3866472 free. 6525116 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 40374 root 20 0 15112 1376 496 R 100.0 0.0 0:28.67 bash 
40373 root 20 0 15244 1380 496 R 99.9 0.0 0:28.92 bash 40376 root 20 0 15240 1376 496 R 99.9 0.0 0:28.80 bash 40375 root 20 0 15240 1376 496 R 99.6 0.0 0:13.15 bash 40379 root 20 0 15240 1376 496 R 99.6 0.0 0:28.75 bash 40378 root 20 0 15240 1376 496 R 99.3 0.0 0:28.79 bash 40380 root 20 0 15116 1380 496 R 98.9 0.0 0:28.84 bash 39986 root 20 0 1696 460 188 R 89.0 0.0 4:13.48 sh 40377 root 20 0 15240 1376 496 R 12.0 0.0 0:19.27 bash

可以看到，其中使用容器跑的sh這個程序，佔用88%左右的CPU，如果把--cpu-share設定為512

top - 15:50:25 up 57 days, 23:00, 3 users, load average: 8.93, 7.86, 6.60 Tasks: 407 total, 10 running, 397 sleeping, 0 stopped, 0 zombie
%Cpu(s): 80.2 us, 16.5 sy, 0.0 ni, 2.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.4 st KiB Mem: 8154312 total, 7951320 used, 202992 free, 278748 buffers KiB Swap: 3905532 total, 39060 used, 3866472 free. 6525176 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 41012 root 20 0 1316 328 188 R 100.0 0.0 0:07.89 sh 40376 root 20 0 17928 4084 496 R 89.0 0.1 5:46.70 bash 40374 root 20 0 17928 4084 496 R 88.6 0.1 6:21.40 bash 40380 root 20 0 17420 3300 496 R 87.3 0.0 6:23.22 bash 40375 root 20 0 17928 4084 496 R 86.7 0.1 6:07.00 bash 40377 root 20 0 17928 3556 496 R 85.3 0.0 5:51.28 bash 40373 root 20 0 17932 4088 496 R 81.7 0.1 6:22.83 bash 40379 root 20 0 17928 3296 496 R 81.0 0.0 2:16.05 bash 40378 root 20 0 17928 4084 496 R 75.7 0.1 6:21.70 bash

就能佔據100%的CPU，不受宿主機上常規吃CPU程序的影響。

5. 總結

感謝你讀完這篇又長又囉嗦的文件，僅僅是為了說明4個引數在k8s裡面的使用，總結一下：

request.cpu的坑

這個值，如果在單個機器上已經有容器配置了，例如配置為1核，那麼這個容器相對一個沒配置cpu的容器的搶佔CPU能力為1024:2，對不配置的容器而言差距太大。所以，我也不是很理解為啥k8s要把預設值minShares設定為2，而不是docker的1024。

於是，部署pod到某些cpu已經資源不足的機器上，如果設定了這個值為1核呢，不一定能排程的上去，但是不設定呢，這個pod的資源又未必能保證。

requests.memory的坑

這個值本身的坑還不算大，主要是參考文件3 把容器分為Guaranteed, Burstable, Best-Effort，並且還不是顯式的指定，而是看requests和limits是否有相同的項啥的，說實話，這個也很蛋疼，但蛋疼程度還稍微低一些。

本文轉移開源中國- kubernetes中容器資源控制的那些事兒