Netlink 核心實現分析（一）：建立

阿新 • • 發佈：2019-01-28

Netlink 是一種IPC（Inter Process Commumicate）機制，它是一種用於核心與使用者空間通訊的機制，同時它也以用於程序間通訊（Netlink 更多用於核心通訊，程序之間通訊更多使用Unix域套接字）。在一般情況下，使用者態和核心態通訊會使用傳統的Ioctl、sysfs屬性檔案或者procfs屬性檔案，這3種通訊方式都是同步通訊方式，由使用者態主動發起向核心態的通訊，核心無法主動發起通訊。而Netlink是一種非同步全雙工的通訊方式，它支援由核心態主動發起通訊，核心為Netlink通訊提供了一組特殊的API介面，使用者態則基於socket API，核心傳送的資料會儲存在接收程序socket 的接收快取中，由接收程序處理。Netlink 有以下優點：

1、雙向全雙工非同步傳輸，支援由核心主動發起傳輸通訊，而不需要使用者空間出發(例如使用ioctl這類的單工方式)。如此使用者空間在等待核心某種觸發條件滿足時就無需不斷輪詢，而非同步接收核心訊息即可。

2、支援組播傳輸，即核心態可以將訊息傳送給多個接收程序，這樣就不用每個程序單獨來查詢了。

Netlink架構框圖如下：

目前在Linux 4.1.x 的主線核心版本中，已經有許多核心模組使用netlink 機制，其中驅動模型中使用的uevent 就是基於netlink 實現。目前 netlink 協議族支援32種協議型別，它們定義在 include/uapi/linux/netlink.h 中：

#define NETLINK_ROUTE		0	/* Routing/device hook				*/
#define NETLINK_UNUSED		1	/* Unused number				*/
#define NETLINK_USERSOCK	2	/* Reserved for user mode socket protocols 	*/
#define NETLINK_FIREWALL	3	/* Unused number, formerly ip_queue		*/
#define NETLINK_SOCK_DIAG	4	/* socket monitoring				*/
#define NETLINK_NFLOG		5	/* netfilter/iptables ULOG */
#define NETLINK_XFRM		6	/* ipsec */
#define NETLINK_SELINUX		7	/* SELinux event notifications */
#define NETLINK_ISCSI		8	/* Open-iSCSI */
#define NETLINK_AUDIT		9	/* auditing */
#define NETLINK_FIB_LOOKUP	10	
#define NETLINK_CONNECTOR	11
#define NETLINK_NETFILTER	12	/* netfilter subsystem */
#define NETLINK_IP6_FW		13
#define NETLINK_DNRTMSG		14	/* DECnet routing messages */
#define NETLINK_KOBJECT_UEVENT	15	/* Kernel messages to userspace */
#define NETLINK_GENERIC		16
/* leave room for NETLINK_DM (DM Events) */
#define NETLINK_SCSITRANSPORT	18	/* SCSI Transports */
#define NETLINK_ECRYPTFS	19
#define NETLINK_RDMA		20
#define NETLINK_CRYPTO		21	/* Crypto layer */

#define NETLINK_INET_DIAG	NETLINK_SOCK_DIAG

#define MAX_LINKS 32</span>

現在4.1.x 的核心版本中已經定義了22種協議型別，其中NETLINK_ROUTE是用於設定和查詢路由表等網路核心模組的，NETLINK_KOBJECT_UEVENT是用於uevent訊息通訊的......

對於在實際的專案中，可能會有一些定製化的需求，以上這幾種專用的協議型別無法滿足，這時可以在不超過最大32種類型的基礎之上自行新增。但是一般情況下這樣做有些不妥，於是核心開發者就設計了一種通用netlink 協議型別（Generic Netlink）NETLINK_GENERIC，它就是一個Netlink複用器，便於使用者自行擴充套件子協議型別（後面我會使用該Generic Netlink 編寫一個示例程式用於演示核心和使用者空間的通訊）。

下面以linux 4.1.12版本的核心原始碼為例來分析Netlink的具體建立和通訊流程。

一、Netlink子系統初始化

核心Netlink的初始化在系統啟動階段完成，初始化程式碼在af_netlink.c的netlink_proto_init()函式中，整個初始化流程如下：

圖1 netlink子系統初始化

static int __init netlink_proto_init(void)
{
	int i;
	int err = proto_register(&netlink_proto, 0);

	if (err != 0)
		goto out;

	BUILD_BUG_ON(sizeof(struct netlink_skb_parms) > FIELD_SIZEOF(struct sk_buff, cb));

	nl_table = kcalloc(MAX_LINKS, sizeof(*nl_table), GFP_KERNEL);
	if (!nl_table)
		goto panic;

	for (i = 0; i < MAX_LINKS; i++) {
		if (rhashtable_init(&nl_table[i].hash,
				    &netlink_rhashtable_params) < 0) {
			while (--i > 0)
				rhashtable_destroy(&nl_table[i].hash);
			kfree(nl_table);
			goto panic;
		}
	}

	INIT_LIST_HEAD(&netlink_tap_all);

	netlink_add_usersock_entry();

	sock_register(&netlink_family_ops);
	register_pernet_subsys(&netlink_net_ops);
	/* The netlink device handler may be needed early. */
	rtnetlink_init();
out:
	return err;
panic:
	panic("netlink_init: Cannot allocate nl_table\n");
}

core_initcall(netlink_proto_init);

本初始化函式首先向核心註冊netlink協議；然後建立並初始化了nl_table表陣列，這個表是整個netlink實現的最關鍵的一步，每種協議型別佔陣列中的一項，後續核心中建立的不同種協議型別的netlink都將儲存在這個表中，由該表統一維護，來簡單看一些它的定義，有一個大概的印象：

struct netlink_table {
	struct rhashtable	hash;
	struct hlist_head	mc_list;
	struct listeners __rcu	*listeners;
	unsigned int		flags;
	unsigned int		groups;
	struct mutex		*cb_mutex;
	struct module		*module;
	int			(*bind)(struct net *net, int group);
	void			(*unbind)(struct net *net, int group);
	bool			(*compare)(struct net *net, struct sock *sock);
	int			registered;
};

這裡的hash（雜湊表）用來索引同種協議型別的不同netlink套接字例項，mc_list為多播使用的sock散列表，listeners為監聽者掩碼，groups為協議支援的最大多播組數量，同時還定義了一些函式指標，它們會在核心首次建立netlink時被賦值，後續應用層建立和繫結socket時呼叫到。回到初始化函式中，接下來初始化應用層使用的NETLINK_USERSOCK協議型別的netlink（用於應用層程序間通訊）；然後呼叫sock_register向核心註冊協議處理函式，即將netlink的socket建立處理函式註冊到核心中，如此以後應用層建立netlink型別的socket時將會呼叫該協議處理函式，其中netlink_family_ops函式的定義如下：

static const struct net_proto_family netlink_family_ops = {
	.family = PF_NETLINK,
	.create = netlink_create,
	.owner	= THIS_MODULE,	/* for consistency 8) */
};

這樣以後應用層建立PF_NETLINK(AF_NETLINK)型別的socket()系統呼叫時將由netlink_create()函式負責處理。再次回到初始化函式中，接下來呼叫register_pernet_subsys向核心所有的網路名稱空間註冊”子系統“的初始化和去初始化函式，這裡的"子系統”並非指的是netlink子系統，而是一種通用的處理方式，在網路名稱空間建立和登出時會呼叫這裡註冊的初始化和去初始化函式（當然對於已經存在的網路名稱空間，在註冊的過程中也會呼叫其初始化函式），後文中建立各種協議型別的netlink也是通過這種方式實現的。這裡netlink_net_ops定義如下：

static struct pernet_operations __net_initdata netlink_net_ops = {
	.init = netlink_net_init,
	.exit = netlink_net_exit,
};

其中netlink_net_init()會在檔案系統中位每個網路名稱空間建立一個proc入口，而netlink_net_exit()就是則銷燬之。下面回來看netlink_proto_init()初始化函式的最後，呼叫rtnetlink_init()建立NETLINK_ROUTE協議型別的netlink，該種類型的netlink才是當初核心設計netlink的初衷，它用來傳遞網路路由子系統、鄰居子系統、介面設定、防火牆等訊息。至此整個netlink子系統初始化完成，還是比較直觀易懂的，接下來就需要關注如何使用它進行通訊了。

二、核心Netlink套接字

核心中各種協議型別的netlink分別在不同的模組中進行建立和初始化，我以前文中的NETLINK_ROUTE為例來分析一下核心中netlink套接字的建立流程。下面首先來看一下核心netlink使用到的幾個關鍵資料結構：

1、核心netlink配置結構：struct netlink_kernel_cfg

/* optional Netlink kernel configuration parameters */
struct netlink_kernel_cfg {
	unsigned int	groups;
	unsigned int	flags;
	void		(*input)(struct sk_buff *skb);
	struct mutex	*cb_mutex;
	int		(*bind)(struct net *net, int group);
	void		(*unbind)(struct net *net, int group);
	bool		(*compare)(struct net *net, struct sock *sk);
};

該結構包含了核心netlink的可選引數。其中groups用於指定最大的多播組；flags成員可以為NL_CFG_F_NONROOT_RECV或NL_CFG_F_NONROOT_SEND，這兩個符號前者用來限定非超級使用者是否可以繫結到多播組，後者用來限定非超級使用者是否可以傳送組播；input指標用於指定回撥函式，該回調函式用於接收和處理來自使用者空間的訊息（若無需接收來自使用者空間的訊息可不指定），最後的三個函式指標實現sock的繫結和解繫結等操作，會新增到nl_table對應的項中去。

2、netlink屬性頭：struct nlattr

struct nlattr {
	__u16           nla_len;
	__u16           nla_type;
};

netlink的訊息頭後面跟著的是訊息的有效載荷部分，它採用的是格式為“型別——長度——值”，簡寫TLV。其中型別和長度使用屬性頭nlattr來表示。其中nla_len表示屬性長度；nla_type表示屬性型別，它可以取值為以下幾種型別（定義在include\net\netlink.h中)：

enum {
	NLA_UNSPEC,
	NLA_U8,
	NLA_U16,
	NLA_U32,
	NLA_U64,
	NLA_STRING,
	NLA_FLAG,
	NLA_MSECS,
	NLA_NESTED,
	NLA_NESTED_COMPAT,
	NLA_NUL_STRING,
	NLA_BINARY,
	NLA_S8,
	NLA_S16,
	NLA_S32,
	NLA_S64,
	__NLA_TYPE_MAX,
};

其中比較常用的NLA_UNSPEC表示型別和長度未知、NLA_U32表示無符號32位整形數、NLA_STRING表示變長字串、NLA_NESTED表示巢狀屬性（即包含一層新的屬性）。

3、netlink有效性策略：struct nla_policy

struct nla_policy {
	u16		type;
	u16		len;
};

netlink協議可以根據訊息屬性定義其特定的訊息有效性策略，即對於某一種屬性，該屬性的期望型別是什麼，核心將在收到訊息以後對該訊息的屬性進行有效性判斷（如果不設定len值，就不會執行有效性檢查），只有判斷一直的訊息屬性才算是合法的，否則只會默默的丟棄。這種有效性屬性使用nla_policy來描述，一般定義為一個有效性物件陣列（當前這種netlink協議中的每一種attr屬性（指定不是屬性型別，而是使用者定義的屬性）有一個對應的陣列項），這裡type值同struct nlattr中的nla_type，len欄位表示本屬性的有效載荷長度。

4、netlink套接字結構：netlink_sock

struct netlink_sock {
	/* struct sock has to be the first member of netlink_sock */
	struct sock		sk;
	u32			portid;
	u32			dst_portid;
	u32			dst_group;
	u32			flags;
	u32			subscriptions;
	u32			ngroups;
	unsigned long		*groups;
	unsigned long		state;
	size_t			max_recvmsg_len;
	wait_queue_head_t	wait;
	bool			bound;
	bool			cb_running;
	struct netlink_callback	cb;
	struct mutex		*cb_mutex;
	struct mutex		cb_def_mutex;
	void			(*netlink_rcv)(struct sk_buff *skb);
	int			(*netlink_bind)(struct net *net, int group);
	void			(*netlink_unbind)(struct net *net, int group);
	struct module		*module;
#ifdef CONFIG_NETLINK_MMAP
	struct mutex		pg_vec_lock;
	struct netlink_ring	rx_ring;
	struct netlink_ring	tx_ring;
	atomic_t		mapped;
#endif /* CONFIG_NETLINK_MMAP */

	struct rhash_head	node;
	struct rcu_head		rcu;
};

本結構用於描述一個netlink套接字，其中portid表示本套接字自己繫結的id號，對於核心來說它就是0，dst_portid表示目的id號，ngroups表示協議支援多播組數量，groups儲存組位掩碼，netlink_rcv儲存接收到使用者態資料後的處理函式，netlink_bind和netlink_unbind用於協議子協議自身特有的繫結和解繫結處理函式。

5、建立核心netlink套接字

現在檢視rtnetlink_net_init()函式來分析NETLINK_ROUTE型別netlink套接字的建立流程：首先回到子系統初始化函式netlink_proto_init()的最後，來檢視rtnetlink_init()函式的執行流程：

圖2 核心netlink套接字建立流程

void __init rtnetlink_init(void)
{
	if (register_pernet_subsys(&rtnetlink_net_ops))
		panic("rtnetlink_init: cannot initialize rtnetlink\n");

	...
}

這裡的手法前文中已經見過了，這裡將rtnetlink的init函式和exit函式註冊到核心的每個網路名稱空間中，對於已經存在的網路名稱空間會呼叫其中個的init函式，這裡就是rtnetlink_net_init()函數了。

static struct pernet_operations rtnetlink_net_ops = {
	.init = rtnetlink_net_init,
	.exit = rtnetlink_net_exit,
};

static int __net_init rtnetlink_net_init(struct net *net)
{
	struct sock *sk;
	struct netlink_kernel_cfg cfg = {
		.groups		= RTNLGRP_MAX,
		.input		= rtnetlink_rcv,
		.cb_mutex	= &rtnl_mutex,
		.flags		= NL_CFG_F_NONROOT_RECV,
	};

	sk = netlink_kernel_create(net, NETLINK_ROUTE, &cfg);
	if (!sk)
		return -ENOMEM;
	net->rtnl = sk;
	return 0;
}

首先這裡定義了一個netlink_kernel_cfg結構體例項，設定groups為RTNLGRP_MAX後指定訊息接收處理函式為rtnetlink_rcv，並設定flag為NL_CFG_F_NONROOT_RECV，這表明非超級使用者可以繫結到多播組，但是沒有設定NL_CFG_F_NONROOT_SEND，這表明非超級使用者將不能傳送組播訊息。
隨後init函式呼叫netlink_kernel_create()向當前的網路名稱空間建立NETLINK_ROUTE型別的套接字，並指定定義的那個配置結構cfg。進入netlink_kernel_create()函式內部：

netlink_kernel_create(struct net *net, int unit, struct netlink_kernel_cfg *cfg)
{
	return __netlink_kernel_create(net, unit, THIS_MODULE, cfg);
}

它其實就是__netlink_kernel_create()的一個封裝而已，__netlink_kernel_create函式比較長，分段分析：

/*
 *	We export these functions to other modules. They provide a
 *	complete set of kernel non-blocking support for message
 *	queueing.
 */

struct sock *
__netlink_kernel_create(struct net *net, int unit, struct module *module,
			struct netlink_kernel_cfg *cfg)
{
	struct socket *sock;
	struct sock *sk;
	struct netlink_sock *nlk;
	struct listeners *listeners = NULL;
	struct mutex *cb_mutex = cfg ? cfg->cb_mutex : NULL;
	unsigned int groups;

	BUG_ON(!nl_table);

	if (unit < 0 || unit >= MAX_LINKS)
		return NULL;

	if (sock_create_lite(PF_NETLINK, SOCK_DGRAM, unit, &sock))
		return NULL;

	/*
	 * We have to just have a reference on the net from sk, but don't
	 * get_net it. Besides, we cannot get and then put the net here.
	 * So we create one inside init_net and the move it to net.
	 */

	if (__netlink_create(&init_net, sock, cb_mutex, unit) < 0)
		goto out_sock_release_nosk;

這裡首先進行簡單的引數判斷之後就呼叫sock_create_lite()函式建立了一個以PF_NETLINK為地址族的SOCK_DGRAM型別的socket套接字，其協議型別就是作為引數傳入的NETLINK_ROUTE。然後該函式呼叫最核心的__netlink_create()函式向核心初始化netlink套接字（其實在下文中將會看到使用者態建立netlink套接字也是間接呼叫到該函式）：

static int __netlink_create(struct net *net, struct socket *sock,
			    struct mutex *cb_mutex, int protocol)
{
	struct sock *sk;
	struct netlink_sock *nlk;

	sock->ops = &netlink_ops;

	sk = sk_alloc(net, PF_NETLINK, GFP_KERNEL, &netlink_proto);
	if (!sk)
		return -ENOMEM;

	sock_init_data(sock, sk);

	nlk = nlk_sk(sk);
	if (cb_mutex) {
		nlk->cb_mutex = cb_mutex;
	} else {
		nlk->cb_mutex = &nlk->cb_def_mutex;
		mutex_init(nlk->cb_mutex);
	}
	init_waitqueue_head(&nlk->wait);
#ifdef CONFIG_NETLINK_MMAP
	mutex_init(&nlk->pg_vec_lock);
#endif

	sk->sk_destruct = netlink_sock_destruct;
	sk->sk_protocol = protocol;
	return 0;
}

首先將sock的操作函式集指標設定為netlink_ops，這在後面訊息通訊時會詳細分析，然後分配sock結構並進行初始化，主要包括初始化傳送接收訊息佇列、資料快取、等待佇列和互斥鎖等等，最後設定sk_destruct回撥函式和協議型別。再回到__netlink_kernel_create()函式中繼續分析：

	sk = sock->sk;
	sk_change_net(sk, net);

	if (!cfg || cfg->groups < 32)
		groups = 32;
	else
		groups = cfg->groups;

	listeners = kzalloc(sizeof(*listeners) + NLGRPSZ(groups), GFP_KERNEL);
	if (!listeners)
		goto out_sock_release;

	sk->sk_data_ready = netlink_data_ready;
	if (cfg && cfg->input)
		nlk_sk(sk)->netlink_rcv = cfg->input;

	if (netlink_insert(sk, 0))
		goto out_sock_release;

	nlk = nlk_sk(sk);
	nlk->flags |= NETLINK_KERNEL_SOCKET;

這裡有一點值得注意的就是前面在呼叫__netlink_create()時分配struct sock結構例項使用的是init_net名稱空間，這裡會呼叫sk_change_net將網路名稱空間轉移回到當前的net名稱空間（至於為什麼要這樣做，註釋中有說明，大概意思是當前的上下文中無法對net名稱空間執行get_net操作，可能是防止核心還在初始化的過程中不支援這樣的操作，具體原因還不是很理解）。
接下來校驗groups，預設最小支援32個組播地址（因為後文會看到使用者層在繫結地址時最多繫結32個組播地址），但核心也有可能支援大於32個組播地址的情況（Genetlink就屬於這種情況），然後分配listeners記憶體空間，這裡邊儲存了監聽者（監聽套接字）的資訊；接下來繼續初始化函式指標，這裡將前文中定義的rtnetlink_rcv註冊到了nlk_sk(sk)->netlink_rcv中，這樣就設定完了核心態的訊息處理函式；然後呼叫netlink_insert()函式將本次建立的這個套接字新增到nl_table中去（其核心是呼叫__netlink_insert()），註冊的套接字是通過nl_table中的雜湊表來管理的。然後設定標識NETLINK_KERNEL_SOCKET表明這個netlink套接字是一個核心套接字。

	netlink_table_grab();
	if (!nl_table[unit].registered) {
		nl_table[unit].groups = groups;
		rcu_assign_pointer(nl_table[unit].listeners, listeners);
		nl_table[unit].cb_mutex = cb_mutex;
		nl_table[unit].module = module;
		if (cfg) {
			nl_table[unit].bind = cfg->bind;
			nl_table[unit].unbind = cfg->unbind;
			nl_table[unit].flags = cfg->flags;
			if (cfg->compare)
				nl_table[unit].compare = cfg->compare;
		}
		nl_table[unit].registered = 1;
	} else {
		kfree(listeners);
		nl_table[unit].registered++;
	}
	netlink_table_ungrab();
	return sk;

接下來繼續初始化nl_table表中對應傳入NETLINK_ROUTE協議型別的陣列項，首先會判斷是否已經先有同樣協議型別的已經註冊過了，如果有就不再初始化該表項了，直接釋放剛才申請的listeners記憶體空間然後遞增註冊個數並返回。這裡假定是首次註冊NETLINK_ROUTE協議型別的套接字，這裡依次初始化了nl_table表項中的groups、listeners、cb_mutex、module、bind、unbind、flags和compare欄位。通過前文中cfg的例項分析，這裡的初始化的值分別如下： nl_table[NETLINK_ROUTE].groups = RTNLGRP_MAX; nl_table[NETLINK_ROUTE].cb_mutex = &rtnl_mutex;
nl_table[NETLINK_ROUTE].module = THIS_MODULE; nl_table[NETLINK_ROUTE].bind = NULL; nl_table[NETLINK_ROUTE].unbind = NULL;
nl_table[NETLINK_ROUTE].compare = NULL; nl_table[NETLINK_ROUTE].flags= NL_CFG_F_NONROOT_RECV; 這些寫值在後面的通訊流程中就會使用到。在函式的最後返回成功建立的netlink套接字中的sock指標，它會在最先前的rtnetlink_net_init()函式中被儲存到net->rtnl中去，注意只有NETLINK_ROUTE協議型別的套接字才會執行這個步驟，因為網路名稱空間中專門為其預留了一個sock指標。至此核心NETLINK_ROUTE套接字建立完成，下面來看一下應用層是如何建立netlink套接字的。

三、應用層Netlink套接字

應用層通過標準的sock API即可使用Netlink完成通訊功能（如socket()、sendto()、recv()、sendmsg()和recvmsg()等）。首先來看一些基本的資料結構及建立流程：

圖3 使用者層netlink套接字建立流程

1、套接字地址資料結構sockaddr_nl

struct sockaddr_nl {
	__kernel_sa_family_t	nl_family;	/* AF_NETLINK	*/
	unsigned short	nl_pad;		/* zero		*/
	__u32		nl_pid;		/* port ID	*/
       	__u32		nl_groups;	/* multicast groups mask */
};

其中（1）nl_family始終為AF_NETLINK；（2）nl_pad始終為0；（3）nl_pid為netlink套接字的單播地址，在傳送訊息時用於表示目的套接字的地址，在使用者空間繫結時可以指定為當前程序的PID號（對於核心來說這個值為0）或者乾脆不設定（在繫結bind時由核心呼叫netlink_autobind()設定為當前程序的PID），但需要注意的是當用戶同一個程序中需要建立多個netlink套接字時則必須保證這個值是唯一的（一般在多執行緒中可以使用”pthread_self() << 16 | getpid()“這樣的方法進行設定）；（4）nl_groups表示組播組。在傳送訊息時用於表示目的多播組，在繫結地址時用於表示加入的多播組。這裡nl_groups為一個32位無符號數，其中的每一位表示一個多播組，一個netlink套接字可以加入多個多播組用以接收多個多播組的多播訊息（最多支援32個）。

2、建立Netlink套接字

應用層通過socket()系統呼叫建立Netlink套接字，socket系統呼叫的第一個引數可以是AF_NETLINK或PF_NETLINK（在Linux系統中它倆實際為同一種巨集），第二個引數可以是SOCK_RAW或SOCK_DGRAM（原始套接字或無連線的資料報套接字），最後一個參為netlink.h中定義的協議型別，使用者可以按需求自行建立上述不同種類的套接字。

例如呼叫 socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE) 即建立了一個NETLINK_ROUTE型別的Netlink套接字。下面跟進這個系統呼叫，檢視核心是如何為使用者層建立這個套接字然後又做了哪些初始化動作：

SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
{
	int retval;
	struct socket *sock;
	int flags;

	/* Check the SOCK_* constants for consistency.  */
	BUILD_BUG_ON(SOCK_CLOEXEC != O_CLOEXEC);
	BUILD_BUG_ON((SOCK_MAX | SOCK_TYPE_MASK) != SOCK_TYPE_MASK);
	BUILD_BUG_ON(SOCK_CLOEXEC & SOCK_TYPE_MASK);
	BUILD_BUG_ON(SOCK_NONBLOCK & SOCK_TYPE_MASK);

	flags = type & ~SOCK_TYPE_MASK;
	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
		return -EINVAL;
	type &= SOCK_TYPE_MASK;

	if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
		flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;

	retval = sock_create(family, type, protocol, &sock);
	if (retval < 0)
		goto out;

	retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
	if (retval < 0)
		goto out_release;

out:
	/* It may be already another descriptor 8) Not kernel problem. */
	return retval;

out_release:
	sock_release(sock);
	return retval;
}

該函式首先做了一些引數檢查之後就呼叫sock_create()函式建立套接字，在建立完成後向核心申請描述符並返回該描述符。進入sock_create()函式內部，它是__sock_create()的一層封裝（核心中往往前面帶兩個下劃線的函式才是做事實的，嘿嘿），這裡要注意的是呼叫時又多了兩個個引數，一是當前程序繫結的網路名稱空間，而是最後一個kern引數，這裡傳入0表明是從應用層建立的套接字。__sock_create()函式比較長，來分段分析之：

int sock_create(int family, int type, int protocol, struct socket **res)
{
	return __sock_create(current->nsproxy->net_ns, family, type, protocol, res, 0);
}

int __sock_create(struct net *net, int family, int type, int protocol,
			 struct socket **res, int kern)
{
	int err;
	struct socket *sock;
	const struct net_proto_family *pf;

	/*
	 *      Check protocol is in range
	 */
	if (family < 0 || family >= NPROTO)
		return -EAFNOSUPPORT;
	if (type < 0 || type >= SOCK_MAX)
		return -EINVAL;

	/* Compatibility.

	   This uglymoron is moved from INET layer to here to avoid
	   deadlock in module load.
	 */
	if (family == PF_INET && type == SOCK_PACKET) {
		static int warned;
		if (!warned) {
			warned = 1;
			pr_info("%s uses obsolete (PF_INET,SOCK_PACKET)\n",
				current->comm);
		}
		family = PF_PACKET;
	}

這裡依然是一些入參判斷，非常直觀，無需分析，繼續往下：

	err = security_socket_create(family, type, protocol, kern);
	if (err)
		return err;

	/*
	 *	Allocate the socket and allow the family to set things up. if
	 *	the protocol is 0, the family is instructed to select an appropriate
	 *	default.
	 */
	sock = sock_alloc();
	if (!sock) {
		net_warn_ratelimited("socket: no more sockets\n");
		return -ENFILE;	/* Not exactly a match, but its the
				   closest posix thing */
	}

	sock->type = type;

首先對建立socket執行安全性檢查，security_socket_create這個函式在核心沒有啟用CONFIG_SECURITY_NETWORK配置時是一個空函式直接返回0，這裡先不考慮。接下來呼叫sock_alloc()分配socket例項，它會為其建立和初始化索引節點（inode）。然後將sock->type賦值為傳入的SOCK_RAW。

#ifdef CONFIG_MODULES
	/* Attempt to load a protocol module if the find failed.
	 *
	 * 12/09/1996 Marcin: But! this makes REALLY only sense, if the user
	 * requested real, full-featured networking support upon configuration.
	 * Otherwise module support will break!
	 */
	if (rcu_access_pointer(net_families[family]) == NULL)
		request_module("net-pf-%d", family);
#endif

	rcu_read_lock();
	pf = rcu_dereference(net_families[family]);
	err = -EAFNOSUPPORT;
	if (!pf)
		goto out_release;

在啟用核心模組的情況下，這裡會到核心net_families陣列中查詢該family（AF_NETLINK）是否已經註冊，如果沒有註冊就會嘗試載入網路子系統模組。其實在核心的netlink初始化函式中已經呼叫sock_register()完成註冊了（見前文）。接下來從net_families陣列中獲取已經註冊的struct net_proto_family結構例項，這裡就是第一節中描述過的netlink_family_ops了。繼續往下分析：

	/*
	 * We will call the ->create function, that possibly is in a loadable
	 * module, so we have to bump that loadable module refcnt first.
	 */
	if (!try_module_get(pf->owner))
		goto out_release;

	/* Now protected by module ref count */
	rcu_read_unlock();

	err = pf->create(net, sock, protocol, kern);
	if (err < 0)
		goto out_module_put;

	/*
	 * Now to bump the refcnt of the [loadable] module that owns this
	 * socket at sock_release time we decrement its refcnt.
	 */
	if (!try_module_get(sock->ops->owner))
		goto out_module_busy;

	/*
	 * Now that we're done with the ->create function, the [loadable]
	 * module can have its refcnt decremented
	 */
	module_put(pf->owner);
	err = security_socket_post_create(sock, family, type, protocol, kern);
	if (err)
		goto out_sock_release;
	*res = sock;

	return 0;

這裡先獲取當前模組的引用計數並上鎖，然後呼叫netlink協議的creat()鉤子函式執行進一步的建立和初始化操作（這裡就是netlink_family_ops中定義的netlink_create()了），完成之後就釋放鎖同時釋放當前模組的引用計數並返回建立成功的socket。下面進入netlink_create()內部繼續分析：

static int netlink_create(struct net *net, struct socket *sock, int protocol,
			  int kern)
{
	struct module *module = NULL;
	struct mutex *cb_mutex;
	struct netlink_sock *nlk;
	int (*bind)(struct net *net, int group);
	void (*unbind)(struct net *net, int group);
	int err = 0;

	sock->state = SS_UNCONNECTED;

	if (sock->type != SOCK_RAW && sock->type != SOCK_DGRAM)
		return -ESOCKTNOSUPPORT;

	if (protocol < 0 || protocol >= MAX_LINKS)
		return -EPROTONOSUPPORT;

	netlink_lock_table();
#ifdef CONFIG_MODULES
	if (!nl_table[protocol].registered) {
		netlink_unlock_table();
		request_module("net-pf-%d-proto-%d", PF_NETLINK, protocol);
		netlink_lock_table();
	}
#endif
	if (nl_table[protocol].registered &&
	    try_module_get(nl_table[protocol].module))
		module = nl_table[protocol].module;
	else
		err = -EPROTONOSUPPORT;
	cb_mutex = nl_table[protocol].cb_mutex;
	bind = nl_table[protocol].bind;
	unbind = nl_table[protocol].unbind;
	netlink_unlock_table();

	if (err < 0)
		goto out;

	err = __netlink_create(net, sock, cb_mutex, protocol);
	if (err < 0)
		goto out_module;

	local_bh_disable();
	sock_prot_inuse_add(net, &netlink_proto, 1);
	local_bh_enable();

	nlk = nlk_sk(sock->sk);
	nlk->module = module;
	nlk->netlink_bind = bind;
	nlk->netlink_unbind = unbind;
out:
	return err;

out_module:
	module_put(module);
	goto out;
}

首先將socket的狀態標記為未連線，判斷套接字的型別是否是SOCK_RAW或SOCK_DGRAM型別的，若不是就不能繼續建立；接著判斷該協議型別的netlink是否已經註冊了，由於前文中核心在初始化netlink子系統時已經初始化了NETLINK_ROUTE核心套接字並向nl_table註冊，所以這裡的幾個賦值結果如下：

cb_mutex = nl_table[NETLINK_ROUTE].cb_mutex = &rtnl_mutex;
module = nl_table[NETLINK_ROUTE].module = THIS_MODULE;
bind = nl_table[NETLINK_ROUTE].bind = NULL;
unbind = nl_table[NETLINK_ROUTE].unbind = NULL;

接下來將呼叫__netlink_create()完成核心的建立初始化，這個函式在前面已經分析過了，就不進入繼續分析了。再往下呼叫sock_prot_inuse_add新增協議的引用計數，最後完成賦值：

nlk->module = module = THIS_MODULE ;
nlk->netlink_bind = bind = NULL;
nlk->netlink_unbind = unbind = NULL;

至此使用者態NETLINK_ROUTE型別的套接字就建立完成了。

3、繫結套接字

在建立完成套接字後需要呼叫bind()函式進行繫結，將該套接字繫結到一個特定的地址或者加入一個多播組中，以後核心或其他應用層套接字向該地址單播或向該多播組傳送組播訊息時即可通過recv()或recvmsg()函式接收訊息了。繫結地址時需要使用到sockaddr_nl地址結構，如果使用使用單播則需要將地址本地地址資訊填入nl_pid變數並設定nl_groups為0，如果使用多播則將nl_pid設定為0並填充nl_groups為多播地址，如下可將當前程序的PID號作為單播地址進行繫結：

struct sockaddr_nl local;

fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
memset(&local, 0, sizeof(local));
local.nl_family = AF_NETLINK;
local.nl_pid = getpid();

bind(fd, (struct sockaddr *) &local, sizeof(local));

其中bind()的第一個引數為剛建立的Netlink套接字描述符，第二個引數就是需要繫結的套接字地址，最後一個引數是地址的長度。這個繫結操作同建立TCP套接字類似，需要制定繫結的埠（或者由核心給指定一個亦可）。下面進入bind()系統呼叫分析整個繫結的過程：

圖3 使用者層netlink套接字繫結流程

/*
 *	Bind a name to a socket. Nothing much to do here since it's
 *	the protocol's responsibility to handle the local address.
 *
 *	We move the socket address to kernel space before we call
 *	the protocol layer (having also checked the address is ok).
 */

SYSCALL_DEFINE3(bind, int, fd, struct sockaddr __user *, umyaddr, int, addrlen)
{
	struct socket *sock;
	struct sockaddr_storage address;
	int err, fput_needed;

	sock = sockfd_lookup_light(fd, &err, &fput_needed);
	if (sock) {
		err = move_addr_to_kernel(umyaddr, addrlen, &address);
		if (err >= 0) {
			err = security_socket_bind(sock,
						   (struct sockaddr *)&address,
						   addrlen);
			if (!err)
				err = sock->ops->bind(sock,
						      (struct sockaddr *)
						      &address, addrlen);
		}
		fput_light(sock->file, fput_needed);
	}
	return err;
}

首先根據使用者傳入的fd檔案描述符向核心查詢對應的socket結構，然後將使用者空間傳入的地址struct sockaddr拷貝到核心中（會使用到copy_from_user()），接下來繼續跳過安全檢查函式security_socket_bind()，剩下的主要工作就交給了sock->ops->bind()註冊函數了。在建立套接字時呼叫的__netlink_create()函式中已經將sock->ops賦值為netlink_ops了，來看一下這個結構例項：

static const struct proto_ops netlink_ops = {
	.family =	PF_NETLINK,
	.owner =	THIS_MODULE,
	.release =	netlink_release,
	.bind =		netlink_bind,
	.connect =	netlink_connect,
	.socketpair =	sock_no_socketpair,
	.accept =	sock_no_accept,
	.getname =	netlink_getname,
	.poll =		netlink_poll,
	.ioctl =	sock_no_ioctl,
	.listen =	sock_no_listen,
	.shutdown =	sock_no_shutdown,
	.setsockopt =	netlink_setsockopt,
	.getsockopt =	netlink_getsockopt,
	.sendmsg =	netlink_sendmsg,
	.recvmsg =	netlink_recvmsg,
	.mmap =		netlink_mmap,
	.sendpage =	sock_no_sendpage,
};

這個結構中的各個函式指標都會由系統呼叫根據套接字的協議型別間接呼叫到，此時就會呼叫到這裡的netlink_bind()函式，這個函式較長，分段分析：

static int netlink_bind(struct socket *sock, struct sockaddr *addr,
			int addr_len)
{
	struct sock *sk = sock->sk;
	struct net *net = sock_net(sk);
	struct netlink_sock *nlk = nlk_sk(sk);
	struct sockaddr_nl *nladdr = (struct sockaddr_nl *)addr;
	int err;
	long unsigned int groups = nladdr->nl_groups;
	bool bound;

	if (addr_len < sizeof(struct sockaddr_nl))
		return -EINVAL;

	if (nladdr->nl_family != AF_NETLINK)
		return -EINVAL;

	/* Only superuser is allowed to listen multicasts */
	if (groups) {
		if (!netlink_allowed(sock, NL_CFG_F_NONROOT_RECV))
			return -EPERM;
		err = netlink_realloc_groups(sk);
		if (err)
			return err;
	}

可以看到，這裡又將使用者傳入的地址型別強制轉換成了sockaddr_nl型別的地址結構，然後做了一些引數的判斷，接著如果使用者設定了需要繫結的多播地址，這裡會去檢擦nl_table中註冊的套接字是否已經設定了NL_CFG_F_NONROOT_RECV標識，如果沒有設定將拒絕使用者繫結到組播組，顯然在前文中已經看到了NETLINK_ROUTE型別的套接字是設定了這個標識的，所以這裡會呼叫netlink_realloc_groups分配組播空間，進入看一下：

static int netlink_realloc_groups(struct sock *sk)
{
	struct netlink_sock *nlk = nlk_sk(sk);
	unsigned int groups;
	unsigned long *new_groups;
	int err = 0;

	netlink_table_grab();

	groups = nl_table[sk->sk_protocol].groups;
	if (!nl_table[sk->sk_protocol].registered) {
		err = -ENOENT;
		goto out_unlock;
	}

	if (nlk->ngroups >= groups)
		goto out_unlock;

	new_groups = krealloc(nlk->groups, NLGRPSZ(groups), GFP_ATOMIC);
	if (new_groups == NULL) {
		err = -ENOMEM;
		goto out_unlock;
	}
	memset((char *)new_groups + NLGRPSZ(nlk->ngroups), 0,
	       NLGRPSZ(groups) - NLGRPSZ(nlk->ngroups));

	nlk->groups = new_groups;
	nlk->ngroups = groups;
 out_unlock:
	netlink_table_ungrab();
	return err;
}

這裡會比較驗證一下當前套接字中指定的組播地址上限是否大於NETLINK_ROUTE套接字支援的最大地址（這裡為RTNLGRP_MAX），由於這個套接字是前面剛剛建立的，所以nlk->ngroups = 0。

然後為其分配記憶體空間，分配的空間大小為NLGRPSZ(groups)（這是一個取整對齊的巨集），分配完成後將新分配的空間清零，首地址儲存在nlk->groups中，最後更新nlk->ngroups變數。回到netlink_bind()函式中繼續往下分析：

	bound = nlk->bound;
	if (bound) {
		/* Ensure nlk->portid is up-to-date. */
		smp_rmb();

		if (nladdr->nl_pid != nlk->portid)
			return -EINVAL;
	}

	if (nlk->netlink_bind && groups) {
		int group;

		for (group = 0; group < nlk->ngroups; group++) {
			if (!test_bit(group, &groups))
				continue;
			err = nlk->netlink_bind(net, group + 1);
			if (!err)
				continue;
			netlink_undo_bind(group, groups, sk);
			return err;
		}
	}

接下來如果已經繫結過了，會檢查新需要繫結的id號是否等於已經繫結的id號，若不相等則返回失敗。接著如果netlink套接字子協議存在特有的bind函式且使用者指定了需要繫結的組播地址，則呼叫之為其繫結到特定的組播組中去。現由於NETLINK_ROUTE套接字並不存在nlk->netlink_bind()函式實現，所以這裡並不會呼叫。

	/* No need for barriers here as we return to user-space without
	 * using any of the bound attributes.
	 */
	if (!bound) {
		err = nladdr->nl_pid ?
			netlink_insert(sk, nladdr->nl_pid) :
			netlink_autobind(sock);
		if (err) {
			netlink_undo_bind(nlk->ngroups, groups, sk);
			return err;
		}
	}

如果本套接字並沒有被繫結過（目前就是這種情況），這裡會根據使用者是否指定了單播的繫結地址來呼叫不同的函式。首先假定使用者空間指定了單播的繫結地址，這裡會呼叫netlink_insert()函式將這個套接字插入到nl_table[NETLINK_ROUTE]陣列項的雜湊表中去，同時設定nlk_sk(sk)->bound = nlk_sk(sk)->portid = nladdr->nl_pid。我們再假定使用者空間沒有設定單播的繫結地址，這裡會呼叫netlink_autobind()動態的繫結一個地址，進入該函式簡單的看一下：

static int netlink_autobind(struct socket *sock)
{
	struct sock *sk = sock->sk;
	struct net *net = sock_net(sk);
	struct netlink_table *table = &nl_table[sk->sk_protocol];
	s32 portid = task_tgid_vnr(current);
	int err;
	static s32 rover = -4097;

retry:
	cond_resched();
	rcu_read_lock();
	if (__netlink_lookup(table, portid, net)) {
		/* Bind collision, search negative portid values. */
		portid = rover--;
		if (rover > -4097)
			rover = -4097;
		rcu_read_unlock();
		goto retry;
	}
	rcu_read_unlock();

	err = netlink_insert(sk, portid);
	if (err == -EADDRINUSE)
		goto retry;

	/* If 2 threads race to autobind, that is fine.  */
	if (err == -EBUSY)
		err = 0;

	return err;
}

這裡會首先嚐試選用當前的程序ID作為埠地址，如果當前程序ID已經繫結過其他的相同protocol套接字則會選用一個負數作為ID號（查詢直到存在可用的），最後同樣呼叫netlink_insert()函式。回到netlink_bind()函式中：

	if (!groups && (nlk->groups == NULL || !(u32)nlk->groups[0]))
		return 0;

	netlink_table_grab();
	netlink_update_subscriptions(sk, nlk->subscriptions +
					 hweight32(groups) -
					 hweight32(nlk->groups[0]));
	nlk->groups[0] = (nlk->groups[0] & ~0xffffffffUL) | groups;
	netlink_update_listeners(sk);
	netlink_table_ungrab();

	return 0;

如果沒有指定組播地址且沒有分配組播的記憶體，繫結工作到這裡就已經結束了，可以直接返回了。現假定使用者指定了需要繫結的組播地址，這裡首先呼叫netlink_update_subscriptions繫結sk->sk_bind_node到nl_table[sk->sk_protocol].mc_list中，同時將加入的組播組數目記錄到nlk->subscriptions中，並將組播地址儲存到nlk->groups[0]中，最後更新netlink監聽位掩碼。至此繫結操作結束。

分析完成netlink子系統的建立、核心netlink套接字的建立、應用層netlink套接字的建立和繫結後，下一篇來分析一下核心和應用層之間是如何傳送訊息的。

參考文獻：《Linux Kernel Networking Implementation and Theory》

Netlink 核心實現分析（一）：建立

一、Netlink子系統初始化

二、核心Netlink套接字

1、核心netlink配置結構：struct netlink_kernel_cfg

2、netlink屬性頭：struct nlattr

3、netlink有效性策略：struct nla_policy

4、netlink套接字結構：netlink_sock

5、建立核心netlink套接字

三、應用層Netlink套接字

1、套接字地址資料結構sockaddr_nl

2、建立Netlink套接字

3、繫結套接字

Netlink 核心實現分析（一）：建立

Generic Netlink核心實現分析（二）：通訊

java併發機制的底層實現原理（一）：volatile深入分析

webpack官方文檔分析（一）：安裝

用Python預測某某國際平臺概率分析（一）：這個到底是什麽，是什麽樣的規則？

轉載：Docker源碼分析（一）：Docker架構

Vue原始碼分析（一）：入口檔案

Redisson 分散式鎖實現分析（一）

Spring Cloud Eureka原理分析（一）：註冊過程-服務端

Cat原始碼分析（一）：Client端

bigdata資料分析（一）：Java環境配置

Spark2.3.2原始碼解析： 6. SparkContext原始碼分析（一）： SparkEnv

原型設計工具Axure RP核心培訓教程（一）：入門

NLP詞法分析（一）：中文分詞

Rxjava2原始碼分析（一）：Flowable的建立和基本使用過程分析

RxJava2原始碼分析（一）：基本流程分析

Docker原始碼分析（一）：Docker架構

Live555分析（一）：VS2008編譯

CUDA硬體實現分析（一）------安營紮寨-----GPU的革命

ThreadPoolExecutor原始碼分析（一）：重要成員變數

Netlink 核心實現分析（一）：建立

一、Netlink子系統初始化

二、核心Netlink套接字

1、核心netlink配置結構：struct netlink_kernel_cfg

2、netlink屬性頭：struct nlattr

3、netlink有效性策略：struct nla_policy

4、netlink套接字結構：netlink_sock

5、建立核心netlink套接字

三、應用層Netlink套接字

1、套接字地址資料結構sockaddr_nl

2、建立Netlink套接字

3、繫結套接字

相關推薦