深入淺出 BPF TCP 擁塞演算法實現原理

本文地址：https://www.ebpf.top/post/ebpf_struct_ops

1. 前言

eBPF 的飛輪仍然在快速轉動，自從 Linux 核心 5.6 版本支援 eBPF 程式修改 TCP 擁塞演算法能力，可通過在使用者態修改核心中擁塞函式結構指標實現；在 5.13 版本中該功能又被進一步優化，增加了該類程式型別直接呼叫部分核心程式碼的能力，這避免了在 eBPF 程式中需要重複實現核心中使用的 TCP 擁塞演算法相關的函式。

這兩個功能的實現，為 Linux 從巨集核心向智慧化的微核心提供的演進，雖然當前只是聚焦在 TCP 擁塞演算法的控制，但是這兩個功能的實現卻具有非常好的想象空間。這是因為 Linux 核心中的諸多功能都是基於結構體指標的方式，當我們具有在使用者編寫的 eBPF 程式完成核心結構體中函式的重定向，則可以實現核心的靈活擴充套件和功能的增強，再配合核心函式直接的呼叫能力，等同於為普通使用者提供了定製核心的能力。儘管這只是 eBPF 一小步，後續卻可能會稱為核心生態的一大步。

本文先聚焦在 5.6 版本為 TCP 擁塞演算法定製而提供的 STRUCT_OPS 的能力，對於該型別 eBPF 程式呼叫 Linux 核心函式的能力，我們會在下一篇進行詳細介紹。

2. eBPF 賦能 TCP 擁塞控制演算法

為了支援通過 eBPF 程式可以修改 TCP 擁塞控制演算法的能力，來自於 Facebook 的工程師 Martin KaFai Lau 於 2020-01-08 號提交了一個有 11 個小 Patch 組成的提交。實現為 eBPF 增加了 BPF_MAP_TYPE_STRUCT_OPS 新的 map 結構型別和 BPF_PROG_TYPE_STRUCT_OPS 的程式型別，當前階段只支援對於核心中 TCP 擁塞結構 tcp_congestion_ops 的修改。

圖 1 整體實現的相關結構和程式碼片段

首先我們從如何使用樣例程式入手（完整程式碼實現參見這裡），這裡我們省略與功能介紹不相干的內容：

SEC("struct_ops/dctcp_init")

void BPF_PROG(dctcp_init, struct sock *sk)

{

	const struct tcp_sock *tp = tcp_sk(sk);

	struct dctcp *ca = inet_csk_ca(sk);

	ca->prior_rcv_nxt = tp->rcv_nxt;

	ca->dctcp_alpha = min(dctcp_alpha_on_init, DCTCP_MAX_ALPHA);

	ca->loss_cwnd = 0;

	ca->ce_state = 0;

	dctcp_reset(tp, ca);

}

SEC("struct_ops/dctcp_ssthresh")

__u32 BPF_PROG(dctcp_ssthresh, struct sock *sk)

{

	struct dctcp *ca = inet_csk_ca(sk);

	struct tcp_sock *tp = tcp_sk(sk);

	ca->loss_cwnd = tp->snd_cwnd;

	return max(tp->snd_cwnd - ((tp->snd_cwnd * ca->dctcp_alpha) >> 11U), 2U);

}

// ....

SEC(".struct_ops")

struct tcp_congestion_ops dctcp_nouse = {

	.init		= (void *)dctcp_init,

	.set_state	= (void *)dctcp_state,

	.flags		= TCP_CONG_NEEDS_ECN,

	.name		= "bpf_dctcp_nouse",

};

SEC(".struct_ops")

struct tcp_congestion_ops dctcp = {  // bpf 程式定義的結構與核心中使用的結構不一定相同

  																	 // 可為必要欄位的組合

	.init		= (void *)dctcp_init,

	.in_ack_event   = (void *)dctcp_update_alpha,

	.cwnd_event	= (void *)dctcp_cwnd_event,

	.ssthresh	= (void *)dctcp_ssthresh,

	.cong_avoid	= (void *)tcp_reno_cong_avoid,

	.undo_cwnd	= (void *)dctcp_cwnd_undo,

	.set_state	= (void *)dctcp_state,

	.flags		= TCP_CONG_NEEDS_ECN,

	.name		= "bpf_dctcp",

};

這裡注意到兩點：

tcp_congestion_ops 結構體並非核心標頭檔案裡的對應結構體，它只包含了核心對應結構體裡 TCP CC 演算法用到的欄位，它是核心對應同名結構體的子集。

有些結構體（如 tcp_sock）會看到 preserve_access_index 屬性表示 eBPF 位元組碼在載入的時候，會對這個結構體裡的欄位進行重定向，滿足當前核心版本的同名結構體欄位的偏移。

其中需要注意的是在 BPF 程式中定義的 tcp_congestion_ops 結構（也被稱為 bpf-prg btf 型別），該型別可以與核心中定義的結構體完全一致（被稱為 btf_vmlinux btf 型別），也可為核心結構中的部分必要欄位，結構體定義的順序可以不需核心中的結構體一致，但是名字，型別或者函式宣告必須一致（比如引數和返回值）。因此可能需要從 bpf-prg btf 型別到 btf_vmlinux btf 型別的一個翻譯過程，這個轉換過程使用到的主要是 BTF 技術，目前主要是通過成員名稱、btf 型別和大小等資訊進行查詢匹配，如果不匹配 libbpf 則會返回錯誤。整個轉換過程與 Go 語言型別中的反射機制類似，主要實現在函式 bpf_map__init_kern_struct_ops 中（見原理章節詳細介紹）。

在 eBPF 程式中增加 section 名字宣告為 .struct_ops，用於 BPF 實現中識別要實現的 struct_ops 結構，例如當前實現的 tcp_congestion_ops 結構。

在 SEC(".struct_ops") 下支援同時定義多個 struct_ops 結構。每個 struct_ops 都被定義為 SEC(".struct_ops") 下的一個全域性變數。libbpf 為每個變數建立了一個 map，map 的名字為定義變數的名字，本例中為 bpf_dctcp_nouse 和 dctcp。

使用者態完整程式碼參見這裡，生成的腳手架相關程式碼參見這裡，與 dctcp 相關的核心程式程式碼如下：

static void test_dctcp(void)

{

	struct bpf_dctcp *dctcp_skel;

	struct bpf_link *link;

  // 腳手架生成的函式

	dctcp_skel = bpf_dctcp__open_and_load();

	if (CHECK(!dctcp_skel, "bpf_dctcp__open_and_load", "failed\n"))

		return;

  // bpf_map__attach_struct_ops 增加了註冊一個 struct_ops map 到核心子系統

  // 這裡為我們上面定義的 struct tcp_congestion_ops dctcp 變數

	link = bpf_map__attach_struct_ops(dctcp_skel->maps.dctcp);

	if (CHECK(IS_ERR(link), "bpf_map__attach_struct_ops", "err:%ld\n",

		  PTR_ERR(link))) {

		bpf_dctcp__destroy(dctcp_skel);

		return;

	}

	do_test("bpf_dctcp");

  # 銷燬相關的資料結構

	bpf_link__destroy(link);

	bpf_dctcp__destroy(dctcp_skel);

}

詳細流程解釋如下：

在 bpf_object__open 階段，libbpf 將尋找 SEC(".struct_ops") 部分，並找出 struct_ops 所實現的 btf 型別。需要注意的是，這裡的 btf-type 指的是 bpf_prog.o 的 btf 中的一個型別。 "struct bpf_map" 像其他 map 型別一樣，通過 bpf_object__add_map() 進行新增。然後 libbpf 會收集（通過 SHT_REL）bpf progs 的位置（使用 SEC("struct_ops/xyz") 定義的函式），這些位置是 func ptrs 所指向的地方。在 open 階段並不需要 btf_vmlinux。
在 bpf_object__load 階段，map 結構中的欄位（賴於 btf_vmlinux）通過 bpf_map__init_kern_struct_ops() 初始化。在載入階段，libbpf 還會設定 prog->type、prog->attach_btf_id 和 prog->expected_attach_type 屬性。因此，程式的屬性並不依賴於它的 section 名稱。

目前，bpf_prog btf-type ==> btf_vmlinux btf-type 匹配過程很簡單：成員名匹配 + btf-kind 匹配 + 大小匹配。

如果這些匹配條件失敗，libbpf 將拒絕。目前的目標支援是 "struct tcp_congestion_ops"，其中它的大部分成員都是函式指標。

bpf_prog 的 btf-type 的成員排序可以不同於 btf_vmlinux 的 btf-type。

然後，所有 obj->maps 像往常一樣被建立（在 bpf_object__create_maps()）。一旦 map 被建立，並且 prog 的屬性都被設定好了，libbpf 就會繼續執行。libbpf 將繼續載入所有的程式。
bpf_map__attach_struct_ops() 是用來註冊一個 struct_ops map 到核心子系統中。

關於支援 TCP 擁塞控制演算法的完整 PR 程式碼參見這裡。

3. 腳手架程式碼相關實現

關於生成腳手架的樣例過程如下：（腳手架的提交 commit 參見這裡，可以在這裡搜尋相關關鍵詞檢視）。

$ cd tools/bpf/runqslower && make V=1  # 整個過程如下

$ .output/sbin/bpftool btf dump file /sys/kernel/btf/vmlinux format c > .output/vmlinux.h

clang -g -O2 -target bpf -I.output -I.output -I/home/vagrant/linux-5.8/tools/lib -I/home/vagrant/linux-5.8/tools/include/uapi		      \

	 -c runqslower.bpf.c -o .output/runqslower.bpf.o &&				      \

$ llvm-strip -g .output/runqslower.bpf.o

$ .output/sbin/bpftool gen skeleton .output/runqslower.bpf.o > .output/runqslower.skel.h

$ cc -g -Wall -I.output -I.output -I/home/vagrant/linux-5.8/tools/lib -I/home/vagrant/linux-5.8/tools/include/uapi -c runqslower.c -o .output/runqslower.o

$ cc -g -Wall .output/runqslower.o .output/libbpf.a -lelf -lz -o .output/runqslower

4. bpf struct_ops 底層實現原理

在上述的過程中對於使用者態程式碼與核心中的主要實現流程已經給與了說明，如果你對核心底層實現原理不感興趣，可以跳過該部分。

4.1 核心中的 ops 結構（bpf_tcp_ca.c）

如圖 1 所示，為了實現該功能，需要在核心程式碼中提供基礎能力支撐，核心中結構對應的操作物件結構（ops 結構）為 bpf_tcp_congestion_ops，定義在 /net/ipv4/bpf_tcp_ca.c 檔案中，實現參見這裡：

/* Avoid sparse warning.  It is only used in bpf_struct_ops.c. */

extern struct bpf_struct_ops bpf_tcp_congestion_ops;

struct bpf_struct_ops bpf_tcp_congestion_ops = {

	.verifier_ops = &bpf_tcp_ca_verifier_ops,

	.reg = bpf_tcp_ca_reg,

	.unreg = bpf_tcp_ca_unreg,

	.check_member = bpf_tcp_ca_check_member,

	.init_member = bpf_tcp_ca_init_member,

	.init = bpf_tcp_ca_init,

	.name = "tcp_congestion_ops",

};

bpf_tcp_congestion_ops 結構中的各個函式說明如下：

init() 函式將被首先呼叫，以進行任何需要的全域性設定；
init_member() 則驗證該結構中任何欄位的確切值。特別是，init_member() 可以驗證非函式欄位（例如，標誌欄位）；
check_member() 確定目標結構的特定成員是否允許在 BPF 中實現；
reg() 函式在檢查通過後實際註冊了替換結構；在擁塞控制的情況下，它將把 tcp_congestion_ops 結構（帶有用於函式指標的適當的 BPF 蹦床（trampolines ））安裝在網路堆疊將使用它的地方；
unreg() 撤銷註冊；
verifier_ops 結構有一些函式，用於驗證各個替換函式是否可以安全執行；

其中 verfier_ops 結構主要用於驗證器（verfier）的判斷，其中定義的函式如下：

static const struct bpf_verifier_ops bpf_tcp_ca_verifier_ops = {

	.get_func_proto		= bpf_tcp_ca_get_func_proto,// 驗證器使用的函式原型，用於驗證是否允許在 eBPF 程式中的

  																							// BPF_CALL 核心內的輔助函式，並在驗證後調整 BPF_CALL 指令中的 imm32 域。

	.is_valid_access	= bpf_tcp_ca_is_valid_access,     // 是否是合法的訪問

	.btf_struct_access	= bpf_tcp_ca_btf_struct_access, // 用於判斷 btf 中結構體是否可以被訪問

};

最後，在 kernel/bpf/bpf_struct_ops_types.h 中新增一行：

BPF_STRUCT_OPS_TYPE(tcp_congestion_ops)

4.2 核心 ops 物件結構定義和管理（bpf_struct_ops.c）

在 bpf_struct_ops.c 檔案中，通過包含 "bpf_struct_ops_types.h" 檔案 4 次，並分別設定 BPF_STRUCT_OPS_TYPE 巨集，實現了 map 中 value 值結構的定義和核心定義 ops 物件陣列的管理功能，同時也包括對應資料結構 BTF 中的定義。

/* bpf_struct_ops_##_name (e.g. bpf_struct_ops_tcp_congestion_ops) is

 * the map's value exposed to the userspace and its btf-type-id is

 * stored at the map->btf_vmlinux_value_type_id.

 *

 */

#define BPF_STRUCT_OPS_TYPE(_name)				\

extern struct bpf_struct_ops bpf_##_name;			\

								\

struct bpf_struct_ops_##_name {						\

	BPF_STRUCT_OPS_COMMON_VALUE;				\

	struct _name data ____cacheline_aligned_in_smp;		\

};

#include "bpf_struct_ops_types.h"   // ① 用於生成 bpf_struct_ops_tcp_congestion_ops 結構

#undef BPF_STRUCT_OPS_TYPE

enum {

#define BPF_STRUCT_OPS_TYPE(_name) BPF_STRUCT_OPS_TYPE_##_name,

#include "bpf_struct_ops_types.h"  //  ② 生成一個 enum 成員

#undef BPF_STRUCT_OPS_TYPE

	__NR_BPF_STRUCT_OPS_TYPE,

};

static struct bpf_struct_ops * const bpf_struct_ops[] = {

#define BPF_STRUCT_OPS_TYPE(_name)				\

	[BPF_STRUCT_OPS_TYPE_##_name] = &bpf_##_name,

#include "bpf_struct_ops_types.h"    // ③ 生成一個數組中的成員 [BPF_STRUCT_OPS_TYPE_tcp_congestion_ops]

  																	 // = &bpf_tcp_congestion_ops

#undef BPF_STRUCT_OPS_TYPE

};

void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log)

{

	/* Ensure BTF type is emitted for "struct bpf_struct_ops_##_name" */

#define BPF_STRUCT_OPS_TYPE(_name) BTF_TYPE_EMIT(struct bpf_struct_ops_##_name);

#include "bpf_struct_ops_types.h"  // ④  BTF_TYPE_EMIT（struct  bpf_struct_ops_tcp_congestion_ops btf 註冊

#undef BPF_STRUCT_OPS_TYPE

  // ...

}

編譯完整展開後相關的結構：

extern struct bpf_struct_ops bpf_tcp_congestion_ops;			

struct bpf_struct_ops_tcp_congestion_ops {		// ①	作為 map 型別的 value 物件儲存

	refcount_t refcnt;

	enum bpf_struct_ops_state state

	struct tcp_congestion_ops data ____cacheline_aligned_in_smp;	// 核心中的 tcp_congestion_ops 物件

};

enum {

	BPF_STRUCT_OPS_TYPE_tcp_congestion_ops  //  ② 序號宣告

	__NR_BPF_STRUCT_OPS_TYPE,

};

static struct bpf_struct_ops * const bpf_struct_ops[] = { // ③ 作為陣列變數

  // 其中 bpf_tcp_congestion_ops 即為 /net/ipv4/bpf_tcp_ca.c 檔案中定義的變數（包含了各種操作的函式指標）

	[BPF_STRUCT_OPS_TYPE_tcp_congestion_ops] = &bpf_tcp_congestion_ops,

};

void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log)

{

  // #define BTF_TYPE_EMIT(type) ((void)(type *)0)

  ((void)(struct  bpf_struct_ops_tcp_congestion_ops *)0); // ④  BTF 型別註冊

  // ...

}

至此核心完成了 ops 結構的型別的生成、註冊和 ops 物件陣列的管理。

4.3 map 中核心結構值初始化

該過程涉及將 bpf 程式中定義變數初始化 kernl 核心變數，該過程在 libbpf 庫中的 bpf_map__init_kern_struct_ops 函式中實現。函式原型為：

/* Init the map's fields that depend on kern_btf */

static int bpf_map__init_kern_struct_ops(struct bpf_map *map,

					 const struct btf *btf,

					 const struct btf *kern_btf)

使用 bpf 程式結構初始化 map 結構變數的主要流程如下：

bpf 程式載入過程中會識別出來定義的 BPF_MAP_TYPE_STRUCT_OPS map 物件；
獲取到 struct ops 定義的變數型別（如 struct tcp_congestion_ops dctcp）中的 tcp_congestion_ops 型別，使用獲取到 tname/type/type_id 設定到 map 結構中的 st_ops 物件中；
通過上一步驟設定的 tname 屬性在核心的 btf 資訊表中查詢核心中 tcp_congestion_ops 型別的 type_id 和 type 等資訊，同時也獲取到 map 物件中 value 值型別 bpf_struct_ops_tcp_congestion_ops 的 vtype_id 和 vtype 型別；
至此已經拿到了 bpf 程式中定義的變數及 bpf_prog btf-type tcp_congestion_ops，核心中定義的型別 tcp_congestion_ops 以及 map 值型別的 bpf_struct_ops_tcp_congestion_ops 等資訊；
接下來的事情就是通過特定的 btf 資訊規則（名稱、呼叫引數、返回型別等）將 bpf_prog btf-type 變數初始化到 bpf_struct_ops_tcp_congestion_ops 變數中，將核心中的變數初始化以後，放入到 st_ops->kern_vdata 結構中（bpf_map__attach_struct_ops() 函式會使用 st_ops->kern_vdata 更新 map 的值，map 的 key 固定為 0 值（表示第一個位置）；
然後設定 map 結構中的 btf_vmlinux_value_type_id 為 vtype_id 共後續檢查和使用， map->btf_vmlinux_value_type_id = kern_vtype_id；

5. 總結

從表面上看，擁塞控制是 BPF 的一項重要的新功能，但是從底層的實現我們可以看到，這個功能的實現遠比該功能更加通用，相信在不久的將來還有會更加豐富的實現，在軟體中定義核心功能的實現會帶給我們不一樣的體驗。

具體來說，該基礎功能可以用來讓一個 BPF 程式取代核心中的任何使用函式指標的 " 操作結構 "，而且核心程式碼的很大一部分是通過至少一個這樣的結構呼叫的。如果我們可以替換全部或部分 security_hook_heads 結構，我們就可以以任意的方式修改安全策略，例如類似於 KRSI 的建議。替換一個 file_operations 結構可以重新連線核心的 I/O 子系統的任何部分。

現在還沒有人提出要做這些事情，但是這種能力肯定會吸引感興趣的使用者。有一天，幾乎所有的核心功能都可以被使用者空間的 BPF 程式碼鉤住或替換。在這樣的世界裡，使用者將有很大的權力來改變他們系統的執行方式，但是我們認為的 "Linux 核心 " 將變得更加無定形，因為諸多功能可能會取決於哪些程式碼從使用者空間載入。

深入淺出 BPF TCP 擁塞演算法實現原理