1. 程式人生 > >cgroup原始碼分析——基於centos3.10.0-693.25.4

cgroup原始碼分析——基於centos3.10.0-693.25.4

  核心升級完測試兄弟跑ltprun套件,發現跑完後cgroup失效了。看系統一切執行正常核心也沒啥錯誤日誌,又不熟cgroup的實現,在一頓翻程式碼後發現cgroup註冊了CPU熱插拔的notifier chain。去翻ltp的測試內容發現,CPU熱插拔赫然在列。為啥不一開始先翻ltp都測哪些東西呢,真是浪費一番功夫。
  關於cgroup是什麼做什麼用,這裡不在贅述,這裡主要是分析一些cgroup的實現。cgroup可以分三個部分:描述子系統(subsys)和cgroup等物件以及其依附關係的資料結構;提供給使用者空間的檔案系統介面;各子系統。

資料結構

先看下cgroup裡都有哪些概念

  • 任務(task)。在 cgroups 中,任務就是系統的一個程序。
  • 控制族群(control group)。控制族群就是一組按照某種標準劃分的程序。 Cgroups 中的資
    源控制都是以控制族群為單位實現。一個程序可以加入到某個控制族群,也從一個程序組遷
    移到另一個控制族群。一個程序組的程序可以使用 cgroups 以控制族群為單位分配的資源,
    同時受到 cgroups 以控制族群為單位設定的限制。
  • 層級(hierarchy)。控制族群可以組織成 hierarchical 的形式,既一顆控制族群樹。控制族
    群樹上的子節點控制族群是父節點控制族群的孩子,繼承父控制族群的特定的屬性。
  • 子系統(subsytem)。一個子系統就是一個資源控制器,比如 cpu 子系統就是控制 cpu 時
    間分配的一個控制器。子系統必須附加(attach)到一個層級上才能起作用,一個子系統附
    加到某個層級以後,這個層級上的所有控制族群都受到這個子系統的控制。

這些概念有些相互的關係

  • 每次在系統中建立新層級時,該系統中的所有任務都是那個層級的預設 cgroup(我們稱
    之為 root cgroup ,此cgroup在建立層級時自動建立,後面在該層級中建立的cgroup都是此
    cgroup的後代)的初始成員。
  • 一個子系統最多隻能附加到一個層級。
  • 一個層級可以附加多個子系統。
  • 一個任務可以是多個cgroup的成員,但是這些cgroup必須在不同的層級。
  • 系統中的程序(任務)建立子程序(任務)時,該子任務自動成為其父程序所在 cgroup 的
    成員。然後可根據需要將該子任務移動到不同的 cgroup 中,但開始時它總是繼承其父任務
    的cgroup。

css_set

  task和cgroup之間是多對多的關係,cgroup和subsys是一對多的關係,task和subsys也是多對多的關係(task可以依附多個cgroup,一個cgroup可能依附了多個subsys也依附了很多task)。要描述這些關係不容易,如果task通過各cgroup來引用各subsys再從subsys獲取到資源限制,這比較低效。但是從task視角來看,每個task受到各個子系統的限制的是一定的,核心用css_set來描述多個subsys的組合,task通過css_set知道它受哪些限制,加快了訪問速度,而且subsys組合是有限的,減少了核心資料結構的複雜度。
  看css_set結構體,css_set表示一種資源限制的集合(比如cpu 20% mem 40%和cpu 30% mem 20%是不同的資源限制,用不同的css_set)並且連線程序和subsys。

struct css_set {

	/* Reference count */
	atomic_t refcount;

	/*
	 * List running through all cgroup groups in the same hash
	 * slot. Protected by css_set_lock
	 */
	struct hlist_node hlist;	//連結到全域性css_set hash連結串列

	/*
	 * List running through all tasks using this cgroup
	 * group. Protected by css_set_lock
	 */
	struct list_head tasks;	//引用此css_set的程序連結串列

	/*
	 * List of cg_cgroup_link objects on link chains from
	 * cgroups referenced from this css_set. Protected by
	 * css_set_lock
	 */
	struct list_head cg_links;	//此css_set關聯的cgroup的連結串列,通過cg_cgroup_link結構體來連線

	/*
	 * Set of subsystem states, one for each subsystem. This array
	 * is immutable after creation apart from the init_css_set
	 * during subsystem registration (at boot time) and modular subsystem
	 * loading/unloading.
	 */
	//用於引用到css_set裡具體的subsys,每個subsys在這裡都有個元素,有沒有是一回事,用不用是另外一回事
	struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];

	/* For RCU-protected deletion */
	struct rcu_head rcu_head;
};

//程序結構體
struct task_struct {
...
#ifdef CONFIG_CGROUPS
	/* Control Group info protected by css_set_lock */
	struct css_set __rcu *cgroups;	//指向task關聯的css_set
	/* cg_list protected by css_set_lock and tsk->alloc_lock */
	struct list_head cg_list;	//連結到task關聯的css_set->tasks連結串列
#endif
...
}
  • css_set的tasks連結串列是所有使用該css_set的程序,程序task_struct的cgroups指標指向該程序相關的css_set,並通過cg_list連結到該css_set的tasks連結串列。
  • css_set通過hlist連結到全域性的css_set_table hash連結串列中,方便查詢css_set。
  • cg_links用於連結所有關於此css_set的cgroup,cgroup並不是直接連線到此list_head,而是通過cg_cgroup_link結構體連線。
  • subsys是subsys指標陣列,所有subsys都會有個例項結構體在裡面

  css_set可以稱為cgroup group,即代表一組cgroup,一個css_set可以關聯著很多cgroup。

cgroup_subsys_state

  再分析下cgroup_subsys_state 結構體,這個結構體是css_set連線到具體subsys例項的橋樑,css_set有一個cgroup_subsys_state指標陣列,共CGROUP_SUBSYS_COUNT個元素,意味這每個subsys在其中都有個cgroup_subsys_state例項結構體,但這個結構體裡並沒有包含實際控制資訊。那具體控制資訊在哪呢?cgroup_subsys_state實際是和kobjecct類似作用的東西,裡面包含了各個subsys共有的資訊。通過container_of獲取的到subsys直接的例項結構體,subsys的私有控制資訊都在該實際結構體中。

/* Per-subsystem/per-cgroup state maintained by the system. */
struct cgroup_subsys_state {
	/*
	 * The cgroup that this subsystem is attached to. Useful
	 * for subsystems that want to know about the cgroup
	 * hierarchy structure
	 */
	struct cgroup *cgroup;	//本subsys所依附的cgroup

	/*
	 * State maintained by the cgroup system to allow subsystems
	 * to be "busy". Should be accessed via css_get(),
	 * css_tryget() and css_put().
	 */

	atomic_t refcnt;

	unsigned long flags;
	/* ID for this css, if possible */
	struct css_id __rcu *id;

	/* Used to put @cgroup->dentry on the last css_put() */
	struct work_struct dput_work;
};

task可以通過task_struct->css_set->subsys->cgroup找到該task所依附的cgroup。

cgroup

  cgroup是描述cgroup(一個control group)的結構體。cgroup檔案系統中,每個目錄就是一個control group。cgroup結構體的主要作用是關聯css_set和subsys,而對於task到cgroup則不需要直接連線,雖然檔案系統中目錄下有tasks這個檔案,task可以通過css_set引用到cgroup。

struct cgroup {
	unsigned long flags;		/* "unsigned long" so bitops work */

	/*
	 * count users of this cgroup. >0 means busy, but doesn't
	 * necessarily indicate the number of tasks in the cgroup
	 */
	atomic_t count;

	int id;				/* ida allocated in-hierarchy ID */

	/*
	 * We link our 'sibling' struct into our parent's 'children'.
	 * Our children link their 'sibling' into our 'children'.
	 */
	struct list_head sibling;	/* my parent's children */
	struct list_head children;	/* my children */
	struct list_head files;		/* my files */

	struct cgroup *parent;		/* my parent */
	struct dentry *dentry;		/* cgroup fs entry, RCU protected */

	/*
	 * This is a copy of dentry->d_name, and it's needed because
	 * we can't use dentry->d_name in cgroup_path().
	 *
	 * You must acquire rcu_read_lock() to access cgrp->name, and
	 * the only place that can change it is rename(), which is
	 * protected by parent dir's i_mutex.
	 *
	 * Normally you should use cgroup_name() wrapper rather than
	 * access it directly.
	 */
	struct cgroup_name __rcu *name;

	/* Private pointers for each registered subsystem */
	struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];

	struct cgroupfs_root *root;	//指向hierarchy結構體,

	/*
	 * List of cg_cgroup_links pointing at css_sets with
	 * tasks in this cgroup. Protected by css_set_lock
	 */
	struct list_head css_sets;

	struct list_head allcg_node;	/* cgroupfs_root->allcg_list */
	struct list_head cft_q_node;	/* used during cftype add/rm */

	/*
	 * Linked list running through all cgroups that can
	 * potentially be reaped by the release agent. Protected by
	 * release_list_lock
	 */
	struct list_head release_list;

	/*
	 * list of pidlists, up to two for each namespace (one for procs, one
	 * for tasks); created on demand.
	 */
	struct list_head pidlists;
	struct mutex pidlist_mutex;

	/* For RCU-protected deletion */
	struct rcu_head rcu_head;
	struct work_struct free_work;

	/* List of events which userspace want to receive */
	struct list_head event_list;
	spinlock_t event_list_lock;

	/* directory xattrs */
	struct simple_xattrs xattrs;
};

cgroup結構體我們只看幾個關鍵的欄位(實際其它欄位我還未理解透徹)

  • sibling chidren parent用於連結父母兄弟子女cgroup
  • files dentry用於檔案系統
  • name 儲存cgroup的名字,和dentry->d_name一樣
  • subsys cgroup_subsys_state陣列,每個subsys都有一個元素
  • root 指向cfroupfs_root,每個cgroup檔案系統中有個cgroupfs_root
  • css_set 本cgroup參與構成的css_set的集合連結串列
  • allcg_node 連結到cgroupfs_root->allcg_list
  • release_list 和cgroup檔案系統目錄下release檔案和cgroupreliese功能相關,不分析

cgroupfs_root

  cgroup 是cgroup檔案系統描述一個目錄的結構體,cgroup是屬於一個層級的,而層級有一個專門的結構體描述,如同檔案系統有個super_block描述一樣。這個結構體名叫cgroupfs_root,其也跟對應對cgroup檔案系統sb相關聯。

/*
 - A cgroupfs_root represents the root of a cgroup hierarchy, and may be
 - associated with a superblock to form an active hierarchy.  This is
 - internal to cgroup core.  Don't access directly from controllers.
 */
struct cgroupfs_root {
	struct super_block *sb;

	/*
	 * The bitmask of subsystems intended to be attached to this
	 * hierarchy
	 */
	unsigned long subsys_mask;

	/* Unique id for this hierarchy. */
	int hierarchy_id;

	/* The bitmask of subsystems currently attached to this hierarchy */
	unsigned long actual_subsys_mask;

	/* A list running through the attached subsystems */
	struct list_head subsys_list;

	/* The root cgroup for this hierarchy */
	struct cgroup top_cgroup;

	/* Tracks how many cgroups are currently defined in hierarchy.*/
	int number_of_cgroups;

	/* A list running through the active hierarchies */
	struct list_head root_list;

	/* All cgroups on this root, cgroup_mutex protected */
	struct list_head allcg_list;

	/* Hierarchy-specific flags */
	unsigned long flags;

	/* IDs for cgroups in this hierarchy */
	struct ida cgroup_ida;

	/* The path to use for release notifications. */
	char release_agent_path[PATH_MAX];

	/* The name for this hierarchy - may be empty */
	char name[MAX_CGROUP_ROOT_NAMELEN];
};
  • sb 這個層級相關聯的檔案系統的super_block
  • subsys_mask actual_subsys_mask 這兩個mask是掛載了的subsysmask, 和這個層級下的cgroup中的cgroup->subsys陣列搭配使用,這樣cgroup便可知道自己掛載了哪些subsys。
  • subsys_list 本cgroupfs_root執行時掛載了的subsys的連結串列
  • top_cgroup 根目錄所關聯的cgroup
  • number_of_cgroups 此層級的cgroup總數
  • root_list 把此cgroupfs_root連結到一個全域性連結串列 roots
  • allcg_list 本層級所有的cgroup連結串列
  • release_agent_path release相關

cgroup_subsys

  cgroup_subsys描述一個subsys,核心中的subsys在程式碼裡定義好了,不存在動態新增subsys。所有subsys存在陣列subsys中,其陣列元素宣告在linux/cgroup_subsys.h中,而實際每個元素是定義在各個子系統的檔案中,比如mem_cgroup的子系統mem_cgroup_subsys是定義在mm\memcontrol.c中。

static struct cgroup_subsys *subsys[CGROUP_SUBSYS_COUNT] = {
#include <linux/cgroup_subsys.h>
};

struct cgroup_subsys {
	struct cgroup_subsys_state *(*css_alloc)(struct cgroup *cgrp);
	int (*css_online)(struct cgroup *cgrp);
	void (*css_offline)(struct cgroup *cgrp);
	void (*css_free)(struct cgroup *cgrp);

	int (*can_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
	void (*cancel_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
	void (*attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
	RH_KABI_REPLACE(void (*fork)(struct task_struct *task),
			void (*fork)(struct task_struct *task, void *priv))
	void (*exit)(struct cgroup *cgrp, struct cgroup *old_cgrp,
		     struct task_struct *task);
	void (*bind)(struct cgroup *root);

	int subsys_id;
	int disabled;
	int early_init;
	/*
	 * True if this subsys uses ID. ID is not available before cgroup_init()
	 * (not available in early_init time.)
	 */
	bool use_id;

	/*
	 * If %false, this subsystem is properly hierarchical -
	 * configuration, resource accounting and restriction on a parent
	 * cgroup cover those of its children.  If %true, hierarchy support
	 * is broken in some ways - some subsystems ignore hierarchy
	 * completely while others are only implemented half-way.
	 *
	 * It's now disallowed to create nested cgroups if the subsystem is
	 * broken and cgroup core will emit a warning message on such
	 * cases.  Eventually, all subsystems will be made properly
	 * hierarchical and this will go away.
	 */
	bool broken_hierarchy;
	bool warned_broken_hierarchy;

#define MAX_CGROUP_TYPE_NAMELEN 32
	const char *name;

	/*
	 * Link to parent, and list entry in parent's children.
	 * Protected by cgroup_lock()
	 */
	struct cgroupfs_root *root;
	struct list_head sibling;
	/* used when use_id == true */
	struct idr idr;
	spinlock_t id_lock;

	/* list of cftype_sets */
	struct list_head cftsets;

	/* base cftypes, automatically [de]registered with subsys itself */
	struct cftype *base_cftypes;
	struct cftype_set base_cftset;

	/* should be defined only by modular subsystems */
	struct module *module;

	RH_KABI_EXTEND(int (*can_fork)(struct task_struct *task, void **priv_p))
	RH_KABI_EXTEND(void (*cancel_fork)(struct task_struct *task, void *priv))
};

每個subsys有自己的屬性和操作,用相同的結構體來描述subys有點太困難了,cgroup_subsys描述的是subsys通用的部分,各種不同的subsys(如mem_cgroup,cpu等)是有自己的結構體的。cgroup_subsys的欄位主要可以分為三部分,一為回撥函式指標變數;一用於cgroup_subsys組織結構;一部分和cftype相關,每個cgroup目錄下都有一些subsys的檔案用於設定subsys的屬性,這些屬性是cftype描述的。回撥函式中css_alloc值得關注,該函式建立具體subsys的結構體並返回一個cgroup_subsys_state,該結構體的指標是css_set->subsys陣列的元素。mem_cgroup_subsys的css_alloc函式mem_cgroup_css_alloc

static struct cgroup_subsys_state * __ref
mem_cgroup_css_alloc(struct cgroup *cont)
{
	struct mem_cgroup *memcg;
	long error = -ENOMEM;
	int node;

	memcg = mem_cgroup_alloc();
	if (!memcg)
		return ERR_PTR(error);

	for_each_node(node)
		if (alloc_mem_cgroup_per_zone_info(memcg, node))
			goto free_out;

	/* root ? */
	if (cont->parent == NULL) {
		root_mem_cgroup = memcg;
		page_counter_init(&memcg->memory, NULL);
		memcg->soft_limit = PAGE_COUNTER_MAX;
		page_counter_init(&memcg->memsw, NULL);
		page_counter_init(&memcg->kmem, NULL);
	}

	memcg->last_scanned_node = MAX_NUMNODES;
	INIT_LIST_HEAD(&memcg->oom_notify);
	atomic_set(&memcg->refcnt, 1);
	memcg->move_charge_at_immigrate = 0;
	mutex_init(&memcg->thresholds_lock);
	spin_lock_init(&memcg->move_lock);
	vmpressure_init(&memcg->vmpressure);

	return &memcg->css;

free_out:
	__mem_cgroup_free(memcg);
	return ERR_PTR(error);
}

struct mem_cgroup {
	struct cgroup_subsys_state css;
	...
}

不同子系統的結構體第一個元素都是cgroup_subsys_state以便於使用container_of。以mem_cgroup子系統為例,mem_cgroup子系統的結構體是mem_cgroup,每個css_set都可以container_of css_set->subsys[mem_cgroup_subsys_id]引用到一個mem_cgroup例項。相對於mem_cgroup結構體,cgroup_subsys用於全域性描述一個subsys。

cftype

  在cgroup檔案系統中,每個目錄下有各個子系統的檔案,而不同子系統的檔案是不一樣的,核心裡通過cftype結構體來描述這些檔案。

struct cftype {
	/*
	 * By convention, the name should begin with the name of the
	 * subsystem, followed by a period.  Zero length string indicates
	 * end of cftype array.
	 */
	char name[MAX_CFTYPE_NAME];
	int private;
	/*
	 * If not 0, file mode is set to this value, otherwise it will
	 * be figured out automatically
	 */
	umode_t mode;

	/*
	 * If non-zero, defines the maximum length of string that can
	 * be passed to write_string; defaults to 64
	 */
	size_t max_write_len;

	/* CFTYPE_* flags */
	unsigned int flags;
	
	int (*open)(struct inode *inode, struct file *file);
	
	...		

cftype這個結構體裡主要是一些檔案操作函式和檔案的描述資訊。所有對檔案的操作,都會呼叫這個結構體中的操作函式。每個subsys在cgroup_subsys->base_cftypes中定義自己專有的檔案。而cgroup每個目錄下都有些相同的檔案像tasks之類的,儲存在cgroup.c中的files陣列中,這時每個目錄都會有的。

cg_cgroup_link

   cg_cgroup_link並不表示什麼cgroup中物件,這只是用來將一個物件連結到另一個物件的結構體,但在cgroup中多對多的情況下,理解這個結構體的用法,有助於理解cgroup的結構體組織關係。

/* Link structure for associating css_set objects with cgroups */
struct cg_cgroup_link {
	/*
	 * List running through cg_cgroup_links associated with a
	 * cgroup, anchored on cgroup->css_sets
	 */
	struct list_head cgrp_link_list;
	struct cgroup *cgrp;
	/*
	 * List running through cg_cgroup_links pointing at a
	 * single css_set object, anchored on css_set->cg_links
	 */
	struct list_head cg_link_list;
	struct css_set *cg;
};

/**
 * link_css_set - a helper function to link a css_set to a cgroup
 * @tmp_cg_links: cg_cgroup_link objects allocated by allocate_cg_links()
 * @cg: the css_set to be linked
 * @cgrp: the destination cgroup
 */
static void link_css_set(struct list_head *tmp_cg_links,
			 struct css_set *cg, struct cgroup *cgrp)
{
	struct cg_cgroup_link *link;

	BUG_ON(list_empty(tmp_cg_links));
	link = list_first_entry(tmp_cg_links, struct cg_cgroup_link,
				cgrp_link_list);
	link->cg = cg;
	link->cgrp = cgrp;
	atomic_inc(&cgrp->count);
	list_move(&link->cgrp_link_list, &cgrp->css_sets);
	/*
	 * Always add links to the tail of the list so that the list
	 * is sorted by order of hierarchy creation
	 */
	list_add_tail(&link->cg_link_list, &cg->cg_links);
}

link_css_set用於將一個css_set連結到cgroup結構體中,它使用cg_cgroup_link的cg和cgrp_link_list元素將css_set連結到cgrp->css_sets連結串列,用意是儲存使用了該cgroup的所有css_set,同時用cgrp元素和cg_link_list元素將目的cgroup連結到css_set的cg_links連結串列,用意是儲存組成該css_set的所有cgroup。

task依附程序的過程 cgroup_attach_task

  cgroup_attach_task是個將一個執行緒或者執行緒組加入某個cgroup的函式,即將一個執行緒的pid寫進某個cgroup中的tasks檔案的實現函式,分析一下這個函式可以很好的理解cgroup、css_set、subsys之間的關係。

/**
 * cgroup_attach_task - attach a task or a whole threadgroup to a cgroup
 * @cgrp: the cgroup to attach to
 * @tsk: the task or the leader of the threadgroup to be attached
 * @threadgroup: attach the whole threadgroup?
 *  * Call holding cgroup_mutex and the group_rwsem of the leader. Will take
 * task_lock of @tsk or each thread in the threadgroup individually in turn.
 */
static int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk,
			      bool threadgroup)
{
	int retval, i, group_size;
	struct cgroup_subsys *ss, *failed_ss = NULL;
	struct cgroupfs_root *root = cgrp->root;
	/* threadgroup list cursor and array */
	struct task_struct *leader = tsk;
	struct task_and_cgroup *tc;
	struct flex_array *group;
	struct cgroup_taskset tset = { };

	/*
	 * step 0: in order to do expensive, possibly blocking operations for
	 * every thread, we cannot iterate the thread group list, since it needs
	 * rcu or tasklist locked. instead, build an array of all threads in the
	 * group - group_rwsem prevents new threads from appearing, and if
	 * threads exit, this will just be an over-estimate.
	 */
	if (threadgroup)
		group_size = get_nr_threads(tsk);
	else
		group_size = 1;
	/* flex_array supports very large thread-groups better than kmalloc. */
	group = flex_array_alloc(sizeof(*tc), group_size, GFP_KERNEL);
	if (!group)
		return -ENOMEM;
	/* pre-allocate to guarantee space while iterating in rcu read-side. */
	retval = flex_array_prealloc(group, 0, group_size, GFP_KERNEL);
	if (retval)
		goto out_free_group_list;

	i = 0;
	/*
	 * Prevent freeing of tasks while we take a snapshot. Tasks that are
	 * already PF_EXITING could be freed from underneath us unless we
	 * take an rcu_read_lock.
	 */
	rcu_read_lock();
	do {
		struct task_and_cgroup ent;

		/* @tsk either already exited or can't exit until the end */
		if (tsk->flags & PF_EXITING)
			goto next;

		/* as per above, nr_threads may decrease, but not increase. */
		BUG_ON(i >= group_size);
		ent.task = tsk;
		ent.cgrp = task_cgroup_from_root(tsk, root);
		/* nothing to do if this task is already in the cgroup */
		if (ent.cgrp == cgrp)
			goto next;
		/*
		 * saying GFP_ATOMIC has no effect here because we did prealloc
		 * earlier, but it's good form to communicate our expectations.
		 */
		retval = flex_array_put(group, i, &ent, GFP_ATOMIC);
		BUG_ON(retval != 0);
		i++;
	next:
		if (!threadgroup)
			break;
	} while_each_thread(leader, tsk);
	rcu_read_unlock();
	/* remember the number of threads in the array for later. */
	group_size = i;
	tset.tc_array = group;
	tset.tc_array_len = group_size;

	/* methods shouldn't be called if no task is actually migrating */
	retval = 0;
	if (!group_size)
		goto out_free_group_list;

	/*
	 * step 1: check that we can legitimately attach to the cgroup.
	 */
	for_each_subsys(root, ss) {
		if (ss->can_attach) {
			retval = ss->can_attach(cgrp, &tset);
			if (retval) {
				failed_ss = ss;
				goto out_cancel_attach;
			}
		}
	}

	/*
	 * step 2: make sure css_sets exist for all threads to be migrated.
	 * we use find_css_set, which allocates a new one if necessary.
	 */
	for (i = 0; i < group_size; i++) {
		tc = flex_array_get(group, i);
		tc->cg = find_css_set(tc->task->cgroups, cgrp);
		if (!tc->cg) {
			retval = -ENOMEM;
			goto out_put_css_set_refs;
		}
	}

	/*
	 * step 3: now that we're guaranteed success wrt the css_sets,
	 * proceed to move all tasks to the new cgroup.  There are no
	 * failure cases after here, so this is the commit point.
	 */
	for (i = 0; i < group_size; i++) {
		tc = flex_array_get(group, i);
		cgroup_task_migrate(tc->cgrp, tc->task, tc->cg);
	}
	/* nothing is sensitive to fork() after this point. */

	/*
	 * step 4: do subsystem attach callbacks.
	 */
	for_each_subsys(root, ss) {
		if (ss->attach)
			ss->attach(cgrp, &tset);
	}

	/*
	 * step 5: success! and cleanup
	 */
	retval = 0;
	...
}

cgroup_attach_task邏輯比較清楚

  • 首先獲取每個執行緒的cgroup填充flex_array,每個程序預設加入了cgroup,在掛載cgroup檔案系統後,可以發現根目錄下tasks下有每個程序的pid。
  • 判斷每個子系統,看是否可以將程序加入該子系統
  • 獲取每個程序將要加入的css_set,這裡查詢到一個或者建立一個新的css_set。
  • 將程序遷移進新的cgroup,即遷移進新css_set
  • 回撥subsys依附新程序的回撥函式
/*
 * find_css_set() takes an existing cgroup group and a
 * cgroup object, and returns a css_set object that's
 * equivalent to the old group, but with the given cgroup
 * substituted into the appropriate hierarchy. Must be called with
 * cgroup_mutex held
 */
static struct css_set *find_css_set(
	struct css_set *oldcg, struct cgroup *cgrp)
{
	struct css_set *res;
	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];

	struct list_head tmp_cg_links;

	struct cg_cgroup_link *link;
	unsigned long key;

	/* First see if we already have a cgroup group that matches
	 * the desired set */
	read_lock(&css_set_lock);
	//查詢一個css_set,如果查詢失敗則template中儲存目的css_set的subsys陣列,用於後續建立
	res = find_existing_css_set(oldcg, cgrp, template);
	if (res)
		get_css_set(res);
	read_unlock(&css_set_lock);

	if (res)
		return res;

	res = 
            
           

相關推薦

cgroup原始碼分析——基於centos3.10.0-693.25.4

  核心升級完測試兄弟跑ltprun套件,發現跑完後cgroup失效了。看系統一切執行正常核心也沒啥錯誤日誌,又不熟cgroup的實現,在一頓翻程式碼後發現cgroup註冊了CPU熱插拔的notifier chain。去翻ltp的測試內容發現,CPU熱插拔赫然在列。為啥不一開始先翻ltp

mmap原始碼分析--基於3.10.0-693.11.1

mmap是個既簡單又好用的東西,對於讀寫檔案,它減少了一次記憶體拷貝,對於記憶體申請,它可以方便的申請到大塊記憶體,用於自己管理。今天就來說說mmap的實現。 mmap的原型是這樣的: void *mmap(void *addr, size_t length,

sas控制器驅動結構粗探--基於3.10.0-693.25.4

部門測試環境最近出了個核心core,是宕機在了mpt3sas這個模組,以前沒見過這個模組,怎麼查這個core呢?以前沒見過,現在見見就好了;)。這個模組是sas控制器的驅動,在之前的IO棧研究中,只瞭解到過通用塊層,scsi往下的就沒接觸了,也正好趁此解這個bug的機會了解下IO棧scs

direct IO 核心實現分析及揭示一個坑——基於3.10.0-693.11.1

linux的讀寫系統呼叫提供了一個O_DIRECT標記,可以讓嘗試繞過快取,直接對磁碟進行讀寫(為啥是嘗試繞過?當直接落盤失敗時還要通過快取去落盤)。為了能實現直接落盤,使用direct IO限制多多,檔案偏移得對齊到磁碟block,記憶體地址得對齊到磁碟block,讀寫size也得對齊

springmvc工作原理以及原始碼分析(基於spring3.1.0)

springmvc是一個基於spring的web框架.本篇文章對它的工作原理以及原始碼進行深入分析. 一、springmvc請求處理流程   引用spring in action上的一張圖來說明了springmvc的核心元件和請求處理流程:       

VMware安裝VMware tool是 遇到The path "" is not a valid path to the 3.10.0-693.el7.x86_64 kernel headers.

版本 ron not kernel nbsp valid header function install The path "" is not a valid path to the 3.10.0-693.el7.x86_64 kernel headers.問題是找不到內核

Ubuntu 16.04 原始碼安裝 openVswitch 2.10.0

環境 Ubuntu 16.04 核心版本 4.8.0-36-generic openVswitch 2.10.0 下載ovs程式碼 # wget http://openvswitch.org/releases/openvswitch-2.10.0.tar.gz

ArrayList的原始碼分析(基於jdk1.8)

1.初始化 transient Object[] elementData; //實際儲存元素的陣列 private static final Object[] DEFAULTCAPACITY_EMPTY_ELEMENTDATA = {}; public ArrayList() { //初

LinkedList的原始碼分析(基於jdk1.8)

1.初始化 public LinkedList() { } 並未開闢任何類似於陣列一樣的儲存空間,那麼連結串列是如何儲存元素的呢?   2.Node型別 儲存到連結串列中的元素會被封裝為一個Node型別的結點。並且連結串列只需記錄第一個結點的位置和最後一個結點的位置。然後每一個結

47.Fabric 1.0原始碼分析(47)Fabric 1.0.4 go程式碼量統計

Fabric 1.0原始碼筆記 之Fabric 1.0.4 go程式碼量統計 1、概述 除test、vendor、mocks、example、protos外,go核心程式碼檔案341個,核心程式碼行63433行。 find ./ |grep -vE 'test|

Linux--核心---I2C匯流排驅動分析 以linux3.10.0 RK3288為例

Linux 3.10.0 iic匯流排註冊過程 I2C匯流排驅動包括I2C介面卡驅動載入與解除安裝以及I2C匯流排通訊方法 I2C核心提供了i2c_adapter的增加和刪除函式、i2c_driver的增加和刪除函式、i2c_client的依附和脫離函式 以及i2c傳輸、傳送

Netty 接受請求過程原始碼分析 (基於4.1.23)

前言 在前文中,我們分析了伺服器是如何啟動的。而伺服器啟動後肯定是要接受客戶端請求並返回客戶端想要的資訊的,否則要你伺服器幹啥子呢?所以,我們今天就分析分析 Netty 在啟動之後是如何接受客戶端請求的。 開始吧! 1. 從源頭開始 從之前伺服器啟動的原始碼中,我們得

原始碼安裝l CUDA 10.0, cuDNN 7.3 and build TensorFlow (GPU) from source on Ubuntu 18.04

更糟糕的CUDA 10.0和cuDNN 7.3版本我真的很想在我新建的機器上試用它。問題是pip包TensorFlow 1.11rc不支援最新的CUDA版本,我需要從原始碼構建它。整個過程對我來說相當痛苦,最後我完成了它後,我決定再次完成所有步驟並在空的Ubuntu機器上從頭開始設定。 我的

android6.0原始碼分析之Camera API1.0框架簡介

1、架構簡介 由於最近專案涉及到Camera,所以對Camera原始碼進行了研究,本文將分享Camera框架的基本知識。anroid6.0與5.0相比,Camera框架未曾改變,依然提供了兩種API,即API1和API2,依然採用C/S的架構,而client和

ANDROID6.0原始碼分析之CAMERA API2.0下的CAPTURE流程分析

前面對Camera2的初始化以及預覽的相關流程進行了詳細分析,本文將會對Camera2的capture(拍照)流程進行分析。 &

android6.0原始碼分析之Camera API2.0下的Preview(預覽)流程分析

本文將基於android6.0的原始碼,對Camera API2.0下Camera的preview的流程進行分析。在文章andro

android6.0原始碼分析之Camera API2.0簡介

前面幾篇主要分析的是android Camera API1.0的架構以及初始化流程,而google在android5.0(Loll

Spring 之 IoC 原始碼分析 (基於註解方式)

一、 IoC 理論 IoC 全稱為 Inversion of Control,翻譯為 “控制反轉”,它還有一個別名為 DI(Dep

Activity啟動過程原始碼分析(Android 8.0

Activity啟動過程原始碼分析 本文來Activity的啟動流程,一般我們都是通過startActivity或startActivityForResult來啟動目標activity,那麼我們就由此出發探究系統是如何實現目標activity的啟動的。 startActivity(new Intent(con

CentOS 7.3 CDH 5.10.0 Druid0.12.4安裝記錄

文件夾權限 zxvf crypt 用戶 卸載 文件夾 check PE 0.11 CentOS 7.3 CDH 5.10.0安裝記錄 0. 集群規劃192.167.1.247 realtime247 realtime+hadoopdata192.167.1.24