1. 程式人生 > >【kubernetes/k8s原始碼分析】eviction機制原理以及原始碼解析

【kubernetes/k8s原始碼分析】eviction機制原理以及原始碼解析

What?

 

Why?

  kubelet通過OOM Killer來回收缺點:

  • System OOM events會儲存記錄直到完成了OOM
  • OOM Killer幹掉containers後,Scheduler可能又會排程新的Pod到該Node上或者直接在node上重新執行,又會觸發該Node上的OOM Killer,可能無限循化這種操作

How?

  kubelet啟動eviction預設值

--eviction-hard="imagefs.available<15%,memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%"

--eviction-max-pod-grace-period="0"

--eviction-minimum-reclaim=""

--eviction-pressure-transition-period="5m0s"

--eviction-soft=""

--eviction-soft-grace-period=""

      注意:分為eviction-soft和eviction-hard。soft到達threshold值時會給pod一段時間優雅退出,而hard直接殺掉pod,不給任何優雅退出的機會

  eviction singal

  • memory.available
  • nodefs.available
  • nodefs.inodesFree
  • imagefs.available
  • imagefs.inodesFree
  • allocatableMemory.available

注意:

  • nodefs: 指node自身的儲存,儲存執行日誌等
  • imagefs: 指dockerd儲存image和容器可寫層

 

managerImpl結構體

  • killPodFunc: 賦值為killPodNow方法
  • imageGC: 出現diskPressure時,imageGC進行刪除未使用的映象
  • thresholdsFirstObservedAt : 記錄threshold第一次觀察到的時間
  • resourceToRankFunc - 定義各種Resource進行evict 挑選時的排名方法。
  • nodeConditionsLastObservedAt: 上一次獲取的eviction signal的記錄
  • notifierInitialized - bool值,表示threshold notifier是否已經初始化,以確定是否可以利用kernel memcg notification功能來提高evict的響應速度。目前建立manager時該值為false,是否要利用kernel memcg notification,完全取決於kubelet的--experimental-kernel-memcg-notification引數。
// managerImpl implements Manager
type managerImpl struct {
	//  used to track time
	clock clock.Clock
	// config is how the manager is configured
	config Config
	// the function to invoke to kill a pod
	killPodFunc KillPodFunc
	// the interface that knows how to do image gc
	imageGC ImageGC
	// the interface that knows how to do container gc
	containerGC ContainerGC
	// protects access to internal state
	sync.RWMutex
	// node conditions are the set of conditions present
	nodeConditions []v1.NodeConditionType
	// captures when a node condition was last observed based on a threshold being met
	nodeConditionsLastObservedAt nodeConditionsObservedAt
	// nodeRef is a reference to the node
	nodeRef *v1.ObjectReference
	// used to record events about the node
	recorder record.EventRecorder
	// used to measure usage stats on system
	summaryProvider stats.SummaryProvider
	// records when a threshold was first observed
	thresholdsFirstObservedAt thresholdsObservedAt
	// records the set of thresholds that have been met (including graceperiod) but not yet resolved
	thresholdsMet []evictionapi.Threshold
	// signalToRankFunc maps a resource to ranking function for that resource.
	signalToRankFunc map[evictionapi.Signal]rankFunc
	// signalToNodeReclaimFuncs maps a resource to an ordered list of functions that know how to reclaim that resource.
	signalToNodeReclaimFuncs map[evictionapi.Signal]nodeReclaimFuncs
	// last observations from synchronize
	lastObservations signalObservations
	// dedicatedImageFs indicates if imagefs is on a separate device from the rootfs
	dedicatedImageFs *bool
	// thresholdNotifiers is a list of memory threshold notifiers which each notify for a memory eviction threshold
	thresholdNotifiers []ThresholdNotifier
	// thresholdsLastUpdated is the last time the thresholdNotifiers were updated.
	thresholdsLastUpdated time.Time
}

 

1. eviction manager初始化

  路徑: pkg/kubelet/kubelet.go

  1.1 eviction 配置引數

     可以參照上面kubelet啟動eviction預設值

	thresholds, err := eviction.ParseThresholdConfig(enforceNodeAllocatable, kubeCfg.EvictionHard, kubeCfg.EvictionSoft, kubeCfg.EvictionSoftGracePeriod, kubeCfg.EvictionMinimumReclaim)
	if err != nil {
		return nil, err
	}
	evictionConfig := eviction.Config{
		PressureTransitionPeriod: kubeCfg.EvictionPressureTransitionPeriod.Duration,
		MaxPodGracePeriodSeconds: int64(kubeCfg.EvictionMaxPodGracePeriod),
		Thresholds:               thresholds,
		KernelMemcgNotification:  experimentalKernelMemcgNotification,
		PodCgroupRoot:            kubeDeps.ContainerManager.GetPodCgroupRoot(),
	}

  1.2 初始化eviction manager

	// setup eviction manager
	evictionManager, evictionAdmitHandler := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.imageManager, klet.containerGC, kubeDeps.Recorder, nodeRef, klet.clock)

	klet.evictionManager = evictionManager
	klet.admitHandlers.AddPodAdmitHandler(evictionAdmitHandler)

  1.3 執行eviction manager

     隱藏的夠深

  •      Run(updates <-chan kubetypes.PodUpdate)  -> 
  •      fastStatusUpdateOnce()  ->
  •      updateRuntimeUp()  ->
  •      initializeRuntimeDependentModules()  -> 
  •      kl.evictionManager.Start(kl.StatsProvider, kl.GetActivePods, kl.podResourcesAreReclaimed, evictionMonitoringPeriod)

 

2. Start函式

  路徑:pkg/kubelet/eviction/eviction_manager.go

 

 

3. synchronize函式

  3.1 buildSignalToRankFunc函式和buildSignalToNodeReclaimFuncs函式

  • buildSignalToRankFunc註冊signal資源函式
  • buildSignalToNodeReclaimFuncs註冊signal reclaim函式
	// build the ranking functions (if not yet known)
	// TODO: have a function in cadvisor that lets us know if global housekeeping has completed
	if m.dedicatedImageFs == nil {
		hasImageFs, ok := diskInfoProvider.HasDedicatedImageFs()
		if ok != nil {
			return nil
		}
		glog.Infof("zzlin managerImpl synchronize m.dedicatedImageFs == nil  &hasImageFs: %v", &hasImageFs)
		m.dedicatedImageFs = &hasImageFs
		m.signalToRankFunc = buildSignalToRankFunc(hasImageFs)
		m.signalToNodeReclaimFuncs = buildSignalToNodeReclaimFuncs(m.imageGC, m.containerGC, hasImageFs)
	}

  3.2 Get函式獲取node以及pod資訊

  路徑pkg/kubelet/server/stats/summary.go

	activePods := podFunc()
	updateStats := true
	summary, err := m.summaryProvider.Get(updateStats)
	if err != nil {
		glog.Errorf("eviction manager: failed to get get summary stats: %v", err)
		return nil
	}

  3.3 makeSignalObservations函式

    顯示signal資源情況包括如下:

  • imagefs.inodesFree
  • pid.available
  • memory.available
  • allocatableMemory.available
  • nodefs.available
  • nodefs.inodesFree
  • imagefs.available
	// make observations and get a function to derive pod usage stats relative to those observations.
	observations, statsFunc := makeSignalObservations(summary)
	debugLogObservations("observations", observations)