【kubernetes/k8s原始碼分析】eviction機制原理以及原始碼解析
阿新 • • 發佈:2019-01-04
What?
Why?
kubelet通過OOM Killer來回收缺點:
- System OOM events會儲存記錄直到完成了OOM
- OOM Killer幹掉containers後,Scheduler可能又會排程新的Pod到該Node上或者直接在node上重新執行,又會觸發該Node上的OOM Killer,可能無限循化這種操作
How?
kubelet啟動eviction預設值
--eviction-hard="imagefs.available<15%,memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%"
--eviction-max-pod-grace-period="0"
--eviction-minimum-reclaim=""
--eviction-pressure-transition-period="5m0s"
--eviction-soft=""
--eviction-soft-grace-period=""
注意:分為eviction-soft和eviction-hard。soft到達threshold值時會給pod一段時間優雅退出,而hard直接殺掉pod,不給任何優雅退出的機會
eviction singal
- memory.available
- nodefs.available
- nodefs.inodesFree
- imagefs.available
- imagefs.inodesFree
- allocatableMemory.available
注意:
- nodefs: 指node自身的儲存,儲存執行日誌等
- imagefs: 指dockerd儲存image和容器可寫層
managerImpl結構體
- killPodFunc: 賦值為killPodNow方法
- imageGC: 出現diskPressure時,imageGC進行刪除未使用的映象
- thresholdsFirstObservedAt : 記錄threshold第一次觀察到的時間
- resourceToRankFunc - 定義各種Resource進行evict 挑選時的排名方法。
- nodeConditionsLastObservedAt: 上一次獲取的eviction signal的記錄
- notifierInitialized - bool值,表示threshold notifier是否已經初始化,以確定是否可以利用kernel memcg notification功能來提高evict的響應速度。目前建立manager時該值為false,是否要利用kernel memcg notification,完全取決於kubelet的--experimental-kernel-memcg-notification引數。
// managerImpl implements Manager
type managerImpl struct {
// used to track time
clock clock.Clock
// config is how the manager is configured
config Config
// the function to invoke to kill a pod
killPodFunc KillPodFunc
// the interface that knows how to do image gc
imageGC ImageGC
// the interface that knows how to do container gc
containerGC ContainerGC
// protects access to internal state
sync.RWMutex
// node conditions are the set of conditions present
nodeConditions []v1.NodeConditionType
// captures when a node condition was last observed based on a threshold being met
nodeConditionsLastObservedAt nodeConditionsObservedAt
// nodeRef is a reference to the node
nodeRef *v1.ObjectReference
// used to record events about the node
recorder record.EventRecorder
// used to measure usage stats on system
summaryProvider stats.SummaryProvider
// records when a threshold was first observed
thresholdsFirstObservedAt thresholdsObservedAt
// records the set of thresholds that have been met (including graceperiod) but not yet resolved
thresholdsMet []evictionapi.Threshold
// signalToRankFunc maps a resource to ranking function for that resource.
signalToRankFunc map[evictionapi.Signal]rankFunc
// signalToNodeReclaimFuncs maps a resource to an ordered list of functions that know how to reclaim that resource.
signalToNodeReclaimFuncs map[evictionapi.Signal]nodeReclaimFuncs
// last observations from synchronize
lastObservations signalObservations
// dedicatedImageFs indicates if imagefs is on a separate device from the rootfs
dedicatedImageFs *bool
// thresholdNotifiers is a list of memory threshold notifiers which each notify for a memory eviction threshold
thresholdNotifiers []ThresholdNotifier
// thresholdsLastUpdated is the last time the thresholdNotifiers were updated.
thresholdsLastUpdated time.Time
}
1. eviction manager初始化
路徑: pkg/kubelet/kubelet.go
1.1 eviction 配置引數
可以參照上面kubelet啟動eviction預設值
thresholds, err := eviction.ParseThresholdConfig(enforceNodeAllocatable, kubeCfg.EvictionHard, kubeCfg.EvictionSoft, kubeCfg.EvictionSoftGracePeriod, kubeCfg.EvictionMinimumReclaim)
if err != nil {
return nil, err
}
evictionConfig := eviction.Config{
PressureTransitionPeriod: kubeCfg.EvictionPressureTransitionPeriod.Duration,
MaxPodGracePeriodSeconds: int64(kubeCfg.EvictionMaxPodGracePeriod),
Thresholds: thresholds,
KernelMemcgNotification: experimentalKernelMemcgNotification,
PodCgroupRoot: kubeDeps.ContainerManager.GetPodCgroupRoot(),
}
1.2 初始化eviction manager
// setup eviction manager
evictionManager, evictionAdmitHandler := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.imageManager, klet.containerGC, kubeDeps.Recorder, nodeRef, klet.clock)
klet.evictionManager = evictionManager
klet.admitHandlers.AddPodAdmitHandler(evictionAdmitHandler)
1.3 執行eviction manager
隱藏的夠深
- Run(updates <-chan kubetypes.PodUpdate) ->
- fastStatusUpdateOnce() ->
- updateRuntimeUp() ->
- initializeRuntimeDependentModules() ->
- kl.evictionManager.Start(kl.StatsProvider, kl.GetActivePods, kl.podResourcesAreReclaimed, evictionMonitoringPeriod)
2. Start函式
路徑:pkg/kubelet/eviction/eviction_manager.go
3. synchronize函式
3.1 buildSignalToRankFunc函式和buildSignalToNodeReclaimFuncs函式
- buildSignalToRankFunc註冊signal資源函式
- buildSignalToNodeReclaimFuncs註冊signal reclaim函式
// build the ranking functions (if not yet known)
// TODO: have a function in cadvisor that lets us know if global housekeeping has completed
if m.dedicatedImageFs == nil {
hasImageFs, ok := diskInfoProvider.HasDedicatedImageFs()
if ok != nil {
return nil
}
glog.Infof("zzlin managerImpl synchronize m.dedicatedImageFs == nil &hasImageFs: %v", &hasImageFs)
m.dedicatedImageFs = &hasImageFs
m.signalToRankFunc = buildSignalToRankFunc(hasImageFs)
m.signalToNodeReclaimFuncs = buildSignalToNodeReclaimFuncs(m.imageGC, m.containerGC, hasImageFs)
}
3.2 Get函式獲取node以及pod資訊
路徑pkg/kubelet/server/stats/summary.go
activePods := podFunc()
updateStats := true
summary, err := m.summaryProvider.Get(updateStats)
if err != nil {
glog.Errorf("eviction manager: failed to get get summary stats: %v", err)
return nil
}
3.3 makeSignalObservations函式
顯示signal資源情況包括如下:
- imagefs.inodesFree
- pid.available
- memory.available
- allocatableMemory.available
- nodefs.available
- nodefs.inodesFree
- imagefs.available
// make observations and get a function to derive pod usage stats relative to those observations.
observations, statsFunc := makeSignalObservations(summary)
debugLogObservations("observations", observations)