1. 程式人生 > >8.深入k8s:資源控制Qos和eviction及其原始碼分析

8.深入k8s:資源控制Qos和eviction及其原始碼分析

> 轉載請宣告出處哦~,本篇文章釋出於luozhiyun的部落格:https://www.luozhiyun.com,原始碼版本是[1.19](https://github.com/kubernetes/kubernetes/tree/release-1.19) ![83980769_p0_master1200](https://img.luozhiyun.com/20200829221848.jpg) 又是一個週末,可以愉快的坐下來靜靜的品味一段原始碼,這一篇涉及到資源的回收,工作量是很大的,篇幅會比較長,我們可以看到k8s在資源不夠時會怎麼做的,k8s在回收資源的時候有哪些考慮,我們的pod為什麼會無端端的被幹掉等等。 ## limit&request 在k8s中,CPU和記憶體的資源主要是通過這limit&request來進行限制的,在yaml檔案中的定義如下: ``` spec.containers[].resources.limits.cpu spec.containers[].resources.limits.memory spec.containers[].resources.requests.cpu spec.containers[].resources.requests.memory ``` 在排程的時候,kube-scheduler 只會按照 requests 的值進行計算,而真正限制資源使用的是limit。 下面我引用一個官方的例子: ```yaml apiVersion: v1 kind: Pod metadata: name: cpu-demo namespace: cpu-example spec: containers: - name: cpu-demo-ctr image: vish/stress resources: limits: cpu: "1" requests: cpu: "0.5" args: - -cpus - "2" ``` 在這個例子中,args引數給的是cpus等於2,表示這個container可以使用2個cpu進行壓測。但是我們的limits是1,以及requests是0.5。 當我們建立好這個pod之後,然後使用kubectl top去檢視資源使用情況的時候會發現cpu使用並不會超過1: ``` NAME CPU(cores) MEMORY(bytes) cpu-demo 974m ``` 這說明這個pod的cpu資源被限制在了1個cpu,即使container想使用,也是沒有辦法的。 在容器沒有指定 request 的時候,request 的值和 limit 預設相等。 ## QoS 模型與Eviction 下面說一下由不同的 requests 和 limits 的設定方式引出的不同的 QoS 級別。 kubernetes 中有三種 Qos,分別為: 1. `Guaranteed`:Pod中所有Container的所有Resource的`limit`和`request`都相等且不為0; 2. `Burstable`:pod不滿足Guaranteed條件,但是其中至少有一個container設定了requests或limits ; 3. `BestEffort`:pod的 requests 與 limits 均沒有設定; 當宿主機資源緊張的時候,kubelet 對 Pod 進行 Eviction(即資源回收)時會按照Qos的順序進行回收,回收順序是:BestEffort>Burstable>Guaranteed Eviction有兩種模式,分為 Soft 和 Hard。Soft Eviction 允許你為 Eviction 過程設定grace period,然後等待一個使用者配置的grace period之後,再執行Eviction,而Hard則立即執行。 那麼什麼時候會發生Eviction呢?我們可以為Eviction 設定threshold,比如設定設定記憶體的 eviction hard threshold 為 100M,那麼當這臺機器的記憶體可用資源不足 100M 時,kubelet 就會根據這臺機器上面所有 pod 的 QoS 級別以及他們的記憶體使用情況,進行一個綜合排名,把排名最靠前的 pod 進行遷移,從而釋放出足夠的記憶體資源。 thresholds定義方式為`[eviction-signal][operator][quantity]` **eviction-signal** eviction-signal按照官方文件的說法分為如下幾種: | Eviction Signal | Description | | ------------------ | ------------------------------------------------------------ | | memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet | | nodefs.available | nodefs.available := node.stats.fs.available | | nodefs.inodesFree | nodefs.inodesFree := node.stats.fs.inodesFree | | imagefs.available | imagefs.available := node.stats.runtime.imagefs.available | | imagefs.inodesFree | imagefs.inodesFree := node.stats.runtime.imagefs.inodesFree | nodefs和imagefs表示兩種檔案系統分割槽: nodefs:檔案系統,kubelet 將其用於卷和守護程式日誌等。 imagefs:檔案系統,容器執行時用於儲存映象和容器可寫層。 **operator** 就是所需的關係運算符,如"<"。 **quantity** 是閾值的大小,可以容量大小,如:1Gi;也可以用百分比來表示:10%。 如果kubelet在節點經歷系統 OOM 之前無法回收記憶體,那麼oom_killer將基於它在節點上 使用的記憶體百分比算出一個oom_score,然後結束得分最高的容器。 ## Qos原始碼分析 qos的程式碼位於pkg\apis\core\v1\helper\qos\包下面: **qos#GetPodQOS** ```go //pkg\apis\core\v1\helper\qos\qos.go func GetPodQOS(pod *v1.Pod) v1.PodQOSClass { requests := v1.ResourceList{} limits := v1.ResourceList{} zeroQuantity := resource.MustParse("0") isGuaranteed := true allContainers := []v1.Container{} //追加所有的初始化容器 allContainers = append(allContainers, pod.Spec.Containers...) allContainers = append(allContainers, pod.Spec.InitContainers...) //遍歷container for _, container := range allContainers { // process requests //遍歷request 裡面的cpu、memory 獲取其中的值 for name, quantity := range container.Resources.Requests { if !isSupportedQoSComputeResource(name) { continue } if quantity.Cmp(zeroQuantity) == 1 { delta := quantity.DeepCopy() if _, exists := requests[name]; !exists { requests[name] = delta } else { delta.Add(requests[name]) requests[name] = delta } } } // process limits qosLimitsFound := sets.NewString() //遍歷 limit 裡面的cpu、memory 獲取其中的值 for name, quantity := range container.Resources.Limits { if !isSupportedQoSComputeResource(name) { continue } if quantity.Cmp(zeroQuantity) == 1 { qosLimitsFound.Insert(string(name)) delta := quantity.DeepCopy() if _, exists := limits[name]; !exists { limits[name] = delta } else { delta.Add(limits[name]) limits[name] = delta } } } //如果limits 沒有同時設定cpu 、Memory,那麼就不是Guaranteed if !qosLimitsFound.HasAll(string(v1.ResourceMemory), string(v1.ResourceCPU)) { isGuaranteed = false } } //如果requests 和 limits都沒有設定,那麼為BestEffort if len(requests) == 0 && len(limits) == 0 { return v1.PodQOSBestEffort } // Check is requests match limits for all resources. if isGuaranteed { for name, req := range requests { if lim, exists := limits[name]; !exists || lim.Cmp(req) != 0 { isGuaranteed = false break } } } // 都設定了limits 和 requests,則是Guaranteed if isGuaranteed && len(requests) == len(limits) { return v1.PodQOSGuaranteed } return v1.PodQOSBurstable } ``` 上面有註釋我就不過多介紹,非常的簡單。 下面這裡是QOS OOM打分機制,通過給不同的pod打分來判斷,哪些pod可以被優先kill掉,分數越高的越容易被kill。 **policy** ```go //\pkg\kubelet\qos\policy.go // 分值越高越容易被kill const ( // KubeletOOMScoreAdj is the OOM score adjustment for Kubelet KubeletOOMScoreAdj int = -999 // KubeProxyOOMScoreAdj is the OOM score adjustment for kube-proxy KubeProxyOOMScoreAdj int = -999 guaranteedOOMScoreAdj int = -998 besteffortOOMScoreAdj int = 1000 ) ``` **policy#GetContainerOOMScoreAdjust** ```go //\pkg\kubelet\qos\policy.go func GetContainerOOMScoreAdjust(pod *v1.Pod, container *v1.Container, memoryCapacity int64) int { //靜態Pod、映象Pod和高優先順序Pod,直接可以是guaranteedOOMScoreAdj if types.IsCriticalPod(pod) { // Critical pods should be the last to get killed. return guaranteedOOMScoreAdj } //獲取pod的qos等級,這裡只處理Guaranteed與BestEffort switch v1qos.GetPodQOS(pod) { case v1.PodQOSGuaranteed: // Guaranteed containers should be the last to get killed. return guaranteedOOMScoreAdj case v1.PodQOSBestEffort: return besteffortOOMScoreAdj } memoryRequest := container.Resources.Requests.Memory().Value() //如果我們佔用的記憶體越少,則打分就越高 oomScoreAdjust := 1000 - (1000*memoryRequest)/memoryCapacity //這裡是為了保證burstable能有個更高的 OOM score if int(oomScoreAdjust) < (1000 + guaranteedOOMScoreAdj) { return (1000 + guaranteedOOMScoreAdj) } if int(oomScoreAdjust) == besteffortOOMScoreAdj { return int(oomScoreAdjust - 1) } return int(oomScoreAdjust) } ``` 這個方法裡面給不同的pod進行打分,靜態Pod、映象Pod和高優先順序Pod,QOS直接被設定成為guaranteed; 然後呼叫qos的GetPodQOS方法獲取一個pod的評分,但是如果一個pod是burstable,那麼需要根據其直接使用的記憶體來進行評分,佔用的記憶體越少,則打分就越高,如果分數小於1000 + guaranteedOOMScoreAdj,也就是2分,那麼被直接設定成2分,避免分數過低。 ## Eviction Manager原始碼分析 kubelet在例項化一個kubelet物件的時候,呼叫`eviction.NewManager`新建了一個evictionManager物件。然後kubelet再Run方法開始工作的時候,建立一個goroutine,每5s執行一次updateRuntimeUp。 在updateRuntimeUp中,待確認runtime啟動成功後,會呼叫initializeRuntimeDependentModules完成runtime依賴模組的初始化工作。 然後在initializeRuntimeDependentModules中會呼叫evictionManager的start方法進行啟動。 程式碼如下,具體的kubelet流程我們留到以後慢慢分析: ```go func NewMainKubelet(...){ ... evictionManager, evictionAdmitHandler := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.podManager.GetMirrorPodByPod, klet.imageManager, klet.containerGC, kubeDeps.Recorder, nodeRef, klet.clock, etcHostsPathFunc) klet.evictionManager = evictionManager ... } func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) { ... go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop) ... } func (kl *Kubelet) updateRuntimeUp() { ... kl.oneTimeInitializer.Do(kl.initializeRuntimeDependentModules) ... } func (kl *Kubelet) initializeRuntimeDependentModules() { ... kl.evictionManager.Start(kl.StatsProvider, kl.GetActivePods, kl.podResourcesAreReclaimed, evictionMonitoringPeriod) ... } ``` 下面我們來到\pkg\kubelet\eviction\eviction_manager.go去看一下Start方法怎麼實現eviction的。 **managerImp#Start** ```go // 開啟一個控制迴圈去監視和響應資源過低的情況 func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, podCleanedUpFunc PodCleanedUpFunc, monitoringInterval time.Duration) { thresholdHandler := func(message string) { klog.Infof(message) m.synchronize(diskInfoProvider, podFunc) } //是否要利用kernel memcg notification if m.config.KernelMemcgNotification { for _, threshold := range m.config.Thresholds { if threshold.Signal == evictionapi.SignalMemoryAvailable || threshold.Signal == evictionapi.SignalAllocatableMemoryAvailable { notifier, err := NewMemoryThresholdNotifier(threshold, m.config.PodCgroupRoot, &CgroupNotifierFactory{}, thresholdHandler) if err != nil { klog.Warningf("eviction manager: failed to create memory threshold notifier: %v", err) } else { go notifier.Start() m.thresholdNotifiers = append(m.thresholdNotifiers, notifier) } } } } // start the eviction manager monitoring // 啟動一個goroutine,for迴圈裡每隔monitoringInterval(10s)執行一次synchronize go func() { for { //synchronize是主要的eviction控制迴圈,返回被kill的pod,或返回nill if evictedPods := m.synchronize(diskInfoProvider, podFunc); evictedPods != nil { klog.Infof("eviction manager: pods %s evicted, waiting for pod to be cleaned up", format.Pods(evictedPods)) m.waitForPodsCleanup(podCleanedUpFunc, evictedPods) } else { time.Sleep(monitoringInterval) } } }() } ``` 下面的synchronize方法會很長,需要點耐心: **managerImpl#synchronize** 1. 根據上面介紹的不同的eviction signal會有不同的排序方法,以及設定節點資源回收方法 ```go func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... if m.dedicatedImageFs == nil { hasImageFs, ok := diskInfoProvider.HasDedicatedImageFs() if ok != nil { return nil } m.dedicatedImageFs = &hasImageFs //註冊各個eviction signal所對應的資源排序方法 m.signalToRankFunc = buildSignalToRankFunc(hasImageFs) // 註冊節點資源回收方法,例如imagefs.avaliable對應的是刪除無用容器和無用映象 m.signalToNodeReclaimFuncs = buildSignalToNodeReclaimFuncs(m.imageGC, m.containerGC, hasImageFs) } ... } ``` 看一下buildSignalToRankFunc方法的實現: ```go func buildSignalToRankFunc(withImageFs bool) map[evictionapi.Signal]rankFunc { signalToRankFunc := map[evictionapi.Signal]rankFunc{ evictionapi.SignalMemoryAvailable: rankMemoryPressure, evictionapi.SignalAllocatableMemoryAvailable: rankMemoryPressure, evictionapi.SignalPIDAvailable: rankPIDPressure, } if withImageFs { signalToRankFunc[evictionapi.SignalNodeFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsLogs, fsStatsLocalVolumeSource}, v1.ResourceEphemeralStorage) signalToRankFunc[evictionapi.SignalNodeFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsLogs, fsStatsLocalVolumeSource}, resourceInodes) signalToRankFunc[evictionapi.SignalImageFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot}, v1.ResourceEphemeralStorage) signalToRankFunc[evictionapi.SignalImageFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot}, resourceInodes) } else { signalToRankFunc[evictionapi.SignalNodeFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, v1.ResourceEphemeralStorage) signalToRankFunc[evictionapi.SignalNodeFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, resourceInodes) signalToRankFunc[evictionapi.SignalImageFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, v1.ResourceEphemeralStorage) signalToRankFunc[evictionapi.SignalImageFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, resourceInodes) } return signalToRankFunc } ``` 這個方法裡面會將各個eviction signal的排序方法放入到一個map中返回,如MemoryAvailable、NodeFsAvailable、ImageFsAvailable等。 2. 獲取所有的活躍的pod,以及整體的stat資訊 ```go func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... //獲取當前active的pods activePods := podFunc() updateStats := true //獲取節點的整體概況,即nodeStsts和podStats summary, err := m.summaryProvider.Get(updateStats) if err != nil { klog.Errorf("eviction manager: failed to get summary stats: %v", err) return nil } //如果Notifiers有超過10s沒有重新整理,那麼更新Notifiers if m.clock.Since(m.thresholdsLastUpdated) >
notifierRefreshInterval { m.thresholdsLastUpdated = m.clock.Now() for _, notifier := range m.thresholdNotifiers { if err := notifier.UpdateThreshold(summary); err != nil { klog.Warningf("eviction manager: failed to update %s: %v", notifier.Description(), err) } } } ... } ``` 3. 根據summary資訊建立相應的統計資訊到observations物件中 ```go func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... //根據summary資訊建立相應的統計資訊到observations物件中,如SignalMemoryAvailable、SignalNodeFsAvailable等。 observations, statsFunc := makeSignalObservations(summary) ... } ``` 下面抽取部分程式碼**makeSignalObservations** ```go func makeSignalObservations(summary *statsapi.Summary) (signalObservations, statsFunc) { ... if memory := summary.Node.Memory; memory != nil && memory.AvailableBytes != nil && memory.WorkingSetBytes != nil { result[evictionapi.SignalMemoryAvailable] = signalObservation{ available: resource.NewQuantity(int64(*memory.AvailableBytes), resource.BinarySI), capacity: resource.NewQuantity(int64(*memory.AvailableBytes+*memory.WorkingSetBytes), resource.BinarySI), time: memory.Time, } } ... } ``` 這個方法主要是將summary裡面的資源利用情況根據不同的eviction signal封裝到result裡面返回。 4. 根據獲取的observations判斷是否已到達閾值的thresholds ```go func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... //根據獲取的observations判斷是否已到達閾值的thresholds,然後返回 thresholds = thresholdsMet(thresholds, observations, false) if len(m.thresholdsMet) >
0 { //Minimum eviction reclaim 策略 thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true) thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved) } ... } ``` **thresholdsMet** ```go func thresholdsMet(thresholds []evictionapi.Threshold, observations signalObservations, enforceMinReclaim bool) []evictionapi.Threshold { results := []evictionapi.Threshold{} for i := range thresholds { threshold := thresholds[i] observed, found := observations[threshold.Signal] if !found { klog.Warningf("eviction manager: no observation found for eviction signal %v", threshold.Signal) continue } thresholdMet := false // 根據資源容量獲取閾值的資源大小 quantity := evictionapi.GetThresholdQuantity(threshold.Value, observed.capacity) //Minimum eviction reclaim 策略,具體看:https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#minimum-eviction-reclaim if enforceMinReclaim && threshold.MinReclaim != nil { quantity.Add(*evictionapi.GetThresholdQuantity(*threshold.MinReclaim, observed.capacity)) } //如果observed.available比quantity大,那麼返回1 thresholdResult := quantity.Cmp(*observed.available) //檢查Operator識別符號 switch threshold.Operator { //如果是小於號"<",當thresholdResult大於0,返回true case evictionapi.OpLessThan: thresholdMet = thresholdResult > 0 } //如果append到results,表示已經到達閾值 if thresholdMet { results = append(results, threshold) } } return results } ``` thresholdsMet會遍歷整個thresholds,然後從observations裡面獲取eviction signal對應的資源情況。因為我們上面講了設定的threshold可以是1Gi,也可以是百分比,所以需要呼叫GetThresholdQuantity方法換算一下,得到quantity; 然後根據Minimum eviction reclaim 策略判斷一下是否還需要提高這個需要eviction的資源,具體的資訊檢視文件:https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#minimum-eviction-reclaim; 然後用quantity和available比較一下,如果已達閾值,那麼加入到results集合中返回。 5. 記錄eviction signal 第一次的時間,並將Eviction Signals對映到對應的Node Conditions ```go func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... now := m.clock.Now() //主要用來記錄 eviction signal 第一次的時間,沒有則設定 now 時間 thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now) // the set of node conditions that are triggered by currently observed thresholds // Kubelet會將對應的Eviction Signals對映到對應的Node Conditions nodeConditions := nodeConditions(thresholds) if len(nodeConditions) > 0 { klog.V(3).Infof("eviction manager: node conditions - observed: %v", nodeConditions) } ... } ``` **nodeConditions** ```go func nodeConditions(thresholds []evictionapi.Threshold) []v1.NodeConditionType { results := []v1.NodeConditionType{} for _, threshold := range thresholds { if nodeCondition, found := signalToNodeCondition[threshold.Signal]; found { //檢查results裡是否已有nodeCondition if !hasNodeCondition(results, nodeCondition) { results = append(results, nodeCondition) } } } return results } ``` nodeConditions方法主要就是根據signalToNodeCondition來對映對應的nodeCondition,其中nodeCondition如下: ```go signalToNodeCondition = map[evictionapi.Signal]v1.NodeConditionType{} signalToNodeCondition[evictionapi.SignalMemoryAvailable] = v1.NodeMemoryPressure signalToNodeCondition[evictionapi.SignalAllocatableMemoryAvailable] = v1.NodeMemoryPressure signalToNodeCondition[evictionapi.SignalImageFsAvailable] = v1.NodeDiskPressure signalToNodeCondition[evictionapi.SignalNodeFsAvailable] = v1.NodeDiskPressure signalToNodeCondition[evictionapi.SignalImageFsInodesFree] = v1.NodeDiskPressure signalToNodeCondition[evictionapi.SignalNodeFsInodesFree] = v1.NodeDiskPressure signalToNodeCondition[evictionapi.SignalPIDAvailable] = v1.NodePIDPressure ``` 也就是將Eviction Signals分別對映成了MemoryPressure或DiskPressure,整理出來的表格如下: | Node Condition | Eviction Signal | Description | | -------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | | MemoryPressure | memory.available | Available memory on the node has satisfied an eviction threshold | | DiskPressure | nodefs.available, nodefs.inodesFree, imagefs.available, or imagefs.inodesFree | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold | 6. 本輪 node condition 與上次的observed合併,以最新的為準 ```go func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... //本輪 node condition 與上次的observed合併,以最新的為準 nodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now) ... } ``` 7. 防止Node的資源不斷在閾值附近波動,從而不斷變動Node Condition值 ```go func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... //PressureTransitionPeriod引數預設為5分鐘 //防止Node的資源不斷在閾值附近波動,從而不斷變動Node Condition值 //具體檢視文件:https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#oscillation-of-node-conditions nodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now) if len(nodeConditions) > 0 { klog.V(3).Infof("eviction manager: node conditions - transition period not met: %v", nodeConditions) } ... } ``` **nodeConditionsObservedSince** ```go func nodeConditionsObservedSince(observedAt nodeConditionsObservedAt, period time.Duration, now time.Time) []v1.NodeConditionType { results := []v1.NodeConditionType{} for nodeCondition, at := range observedAt { duration := now.Sub(at) if duration < period { results = append(results, nodeCondition) } } return results } ``` 如果已經超過了5分鐘,那麼需要排除。 8. 對eviction-soft做判斷 ```go func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... //設定 eviction-soft-grace-period,預設為90秒,超過該值加入閾值集合 thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now) ... } ``` **thresholdsMetGracePeriod** ```go func thresholdsMetGracePeriod(observedAt thresholdsObservedAt, now time.Time) []evictionapi.Threshold { results := []evictionapi.Threshold{} for threshold, at := range observedAt { duration := now.Sub(at) //Soft Eviction Thresholds,必須要等一段時間之後才能進行trigger if duration < threshold.GracePeriod { klog.V(2).Infof("eviction manager: eviction criteria not yet met for %v, duration: %v", formatThreshold(threshold), duration) continue } results = append(results, threshold) } return results } ``` 9. 設值,然後比較更新 ```go func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... // update internal state m.Lock() m.nodeConditions = nodeConditions m.thresholdsFirstObservedAt = thresholdsFirstObservedAt m.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt m.thresholdsMet = thresholds // 閾值集合跟上次比較是否需要更新 thresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations) debugLogThresholdsWithObservation("thresholds - updated stats", thresholds, observations) //將本次的資訊設定為上次資訊 m.lastObservations = observations m.Unlock() ... } ``` 10. 排序之後找到第一個需要釋放的threshold,以及對應的resource ```go func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... //如果沒有 eviction signal 集合則本輪結束流程 if len(thresholds) == 0 { klog.V(3).Infof("eviction manager: no resources are starved") return nil } //排序之後獲取thresholds集合中的第一個元素 sort.Sort(byEvictionPriority(thresholds)) thresholdToReclaim, resourceToReclaim, foundAny := getReclaimableThreshold(thresholds) if !foundAny { return nil } ... } ``` **getReclaimableThreshold** ```go func getReclaimableThreshold(thresholds []evictionapi.Threshold) (evictionapi.Threshold, v1.ResourceName, bool) { //遍歷thresholds,然後根據對應的Eviction Signals找到對應的resource for _, thresholdToReclaim := range thresholds { if resourceToReclaim, ok := signalToResource[thresholdToReclaim.Signal]; ok { return thresholdToReclaim, resourceToReclaim, true } klog.V(3).Infof("eviction manager: threshold %s was crossed, but reclaim is not implemented for this threshold.", thresholdToReclaim.Signal) } return evictionapi.Threshold{}, "", false } ``` 下面我們看一下signalToResource的定義: ```go signalToResource = map[evictionapi.Signal]v1.ResourceName{} signalToResource[evictionapi.SignalMemoryAvailable] = v1.ResourceMemory signalToResource[evictionapi.SignalAllocatableMemoryAvailable] = v1.ResourceMemory signalToResource[evictionapi.SignalImageFsAvailable] = v1.ResourceEphemeralStorage signalToResource[evictionapi.SignalImageFsInodesFree] = resourceInodes signalToResource[evictionapi.SignalNodeFsAvailable] = v1.ResourceEphemeralStorage signalToResource[evictionapi.SignalNodeFsInodesFree] = resourceInodes signalToResource[evictionapi.SignalPIDAvailable] = resourcePids ``` signalToResource將Eviction Signals分成了memory、ephemeral-storage、inodes、pids幾類。 11. 回收節點級別的資源 ```go func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... //回收節點級別的資源 if m.reclaimNodeLevelResources(thresholdToReclaim.Signal, resourceToReclaim) { klog.Infof("eviction manager: able to reduce %v pressure without evicting pods.", resourceToReclaim) return nil } ... } ``` **reclaimNodeLevelResources** ```go func (m *managerImpl) reclaimNodeLevelResources(signalToReclaim evictionapi.Signal, resourceToReclaim v1.ResourceName) bool { //呼叫buildSignalToNodeReclaimFuncs中設定的方法 nodeReclaimFuncs := m.signalToNodeReclaimFuncs[signalToReclaim] for _, nodeReclaimFunc := range nodeReclaimFuncs { // 刪除沒用使用到的images或 刪除已經是dead狀態的Pod 和 container if err := nodeReclaimFunc(); err != nil { klog.Warningf("eviction manager: unexpected error when attempting to reduce %v pressure: %v", resourceToReclaim, err) } } //回收之後再檢查一下資源佔用情況,如果沒有達到閾值,那麼直接結束 if len(nodeReclaimFuncs) >
0 { summary, err := m.summaryProvider.Get(true) if err != nil { klog.Errorf("eviction manager: failed to get summary stats after resource reclaim: %v", err) return false } observations, _ := makeSignalObservations(summary) debugLogObservations("observations after resource reclaim", observations) thresholds := thresholdsMet(m.config.Thresholds, observations, false) debugLogThresholdsWithObservation("thresholds after resource reclaim - ignoring grace period", thresholds, observations) if len(thresholds) == 0 { return true } } return false } ``` 首先根據需要釋放的signal從signalToNodeReclaimFuncs中找到對應的釋放資源的方法,這個方法在上面buildSignalToNodeReclaimFuncs中設定的,如: ``` nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages} ``` 這個方法會呼叫相應的GC方法,刪除無用的container以及無用的images來釋放資源。 然後會檢查釋放完資源之後是否依然超過閾值,如果沒有的話就直接結束了。 12. 獲取相應的排序函式並進行排序 ```go func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... //得到上面的eviction signal 排序函式,在buildSignalToRankFunc方法中設定 rank, ok := m.signalToRankFunc[thresholdToReclaim.Signal] if !ok { klog.Errorf("eviction manager: no ranking function for signal %s", thresholdToReclaim.Signal) return nil } //如果沒有 active pod 直接返回 if len(activePods) == 0 { klog.Errorf("eviction manager: eviction thresholds have been met, but no pods are active to evict") return nil } //將pod按照特定資源排序 rank(activePods, statsFunc) ... } ``` 13. 將排好序的pod刪除,並返回 ```go func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod { ... for i := range activePods { pod := activePods[i] gracePeriodOverride := int64(0) if !isHardEvictionThreshold(thresholdToReclaim) { gracePeriodOverride = m.config.MaxPodGracePeriodSeconds } message, annotations := evictionMessage(resourceToReclaim, pod, statsFunc) //kill pod if m.evictPod(pod, gracePeriodOverride, message, annotations) { metrics.Evictions.WithLabelValues(string(thresholdToReclaim.Signal)).Inc() return []*v1.Pod{pod} } } ... } ``` 只要有一個pod被刪除了,那麼就返回~ 到這裡eviction manager就分析完了~ ## 總結 這一篇講解了其中資源控制是怎麼做的,理解了通過limit和request的設定會影響到pod被刪除的優先順序,所以我們在設定pod的時候儘量設定合理的limit和request可以不那麼容易被kill掉;然後通過分析了原始碼知道了limit和request會影響到QOS的評分,從而影響到pod被kill掉的優先順序。 接下來通過原始碼分析了k8s中對閾值的設定是怎樣的,當資源不夠的時候pod是根據什麼條件被kill掉的,這一部分花了很大的篇幅來介紹。通過原始碼也可以知道在eviction發生的時候k8s也是做了很多的考慮,比如說對於節點狀態振盪應該怎麼處理、首先應該回收什麼型別的資源、minimum-reclaim最小回收資源在原始碼裡是怎麼做到的等等。 ## Reference https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/ https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/ https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/ https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/ https://zhuanlan.zhihu.com/p/38359775 https://cloud.tencent.com/developer/article/1097431 https://developer.aliyun.com/article/679216 https://my.oschina.net/jxcdwangtao/blog