[中英對照]Introduction to DPDK: Architecture and Principles

阿新 • • 發佈：2017-06-03

under cloud 另一個 times environ mov ket 路由進一步

Introduction to DPDK: Architecture and Principles | DPDK概論：體系結構與實現原理

技術分享

Linux network stack performance has become increasingly relevant over the past few years. This is perfectly understandable: the amount of data that can be transferred over a network and the corresponding workload has been growing not by the day, but by the hour.

這幾年以來，Linux網絡棧的性能變得越來越重要。這很好理解，因為可以通過網絡來傳輸的數據量和對應的工作負載隨著時間的推移而大幅度地增長。

Not even the widespread use of 10 GE network cards has resolved this issue; this is because a lot of bottlenecks that prevent packets from being quickly processed are found in the Linux kernel itself.

即便廣泛使用10GbE網卡也解決不了這一性能問題（然並卵），因為在Linux內核中，存在著許多阻止數據包被快速地處理的瓶頸。

There have been many attempts to circumvent these bottlenecks with techniques called kernel bypasses (a short description can be found here). They let you process packets without involving the Linux network stack and make it so that the application running in the user space communicates directly with networking device. We’d like to discuss one of these solutions, the Intel DPDK (Data Plane Development Kit), in today’s article.

嘗試繞過這些瓶頸的技術有很多，統稱為kernel bypass(簡短的描述戳這裏)。kernel bypass技術讓編程人員處理數據包，而不卷入Linux網絡棧，在用戶空間中運行的應用程序能夠直接與網絡設備打交道。在本文中，我們將討論眾多的kernel bypass解決方案中的一種，那就是Intel的DPDK(數據平面開發套件)。

A lot of posts have already been published about the DPDK and in a variety of languages. Although many of these are fairly informative, they don’t answer the most important questions: How does the DPDK process packets and what route does the packet take from the network device to the user?

來自多國語言的與有關DPDK的文章很多。雖然信息量已經很豐富了，但是並沒有回答兩個重要的問題。問題一：DPDK是如何處理數據包的？問題二：從網絡設備到用戶程序，數據包使用的路由是什麽？

Finding the answers to these questions was not easy; since we couldn’t find everything we needed in the official documentation, we had to look through a myriad of additional materials and thoroughly review their sources. But first thing’s first: before talking about the DPDK and the issues it can help resolve, we should review how packets are processed in Linux.

找到上面的兩個問題的答案不是一件容易的事情。由於我們無法在官方文檔中找到我們所需要的所有東西，我們不得不查閱大量的額外資料，並徹底審查資料來源。但是最為首要的是：在談論DPDK可以幫助我們解決問題之前，我們應該審閱一下數據包在Linux中是如何被處理的。

Processing Packets in Linux: Main Stages

Linux中的數據包處理的主要幾個階段

When a network card first receives a packet, it sends it to a receive queue, or RX. From there, it gets copied to the main memory via the DMA (Direct Memory Access) mechanism.

當網卡接收數據包之後，首先將其發送到接收隊列(RX)。在那裏，數據包被復制到內存中，通過直接內存訪問(DMA)機制。

Afterwards, the system needs to be notified of the new packet and pass the data onto a specially allocated buffer (Linux allocates these buffers for every packet). To do this, Linux uses an interrupt mechanism: an interrupt is generated several times when a new packet enters the system. The packet then needs to be transferred to the user space.

接下來，系統需要被通知到，有新的數據包來了，然後系統將數據傳遞到一個專門分配的緩沖區中去（Linux為每一個數據包都分配這樣的特別緩沖區）。為了做到這一點，Linux使用了中斷機制：當一個新的數據包進入系統時，中斷多次生成。然後，該數據包需要被轉移到用戶空間中去。

One bottleneck is already apparent: as more packets have to be processed, more resources are consumed, which negatively affects the overall system performance.

在這裏，存在著一個很明顯的瓶頸：伴隨著更多的數據包需要被處理，更多的資源將被消耗掉，這無疑對整個系統的性能將產生負面的影響。

As we’ve already said, these packets are saved to specially allocated buffers - more specifically, the sk_buff struct. This struct is allocated for each packet and becomes free when a packet enters the user space. This operation consumes a lot of bus cycles (i.e. cycles that transfer data from the CPU to the main memory).

我們在前面已經說過，這些數據包被保存在專門分配的緩沖區中-更具體地說就是sk_buff結構體。系統給每一個數據包都分配一個這樣的結構體，一但數據包到達用戶空間，該結構體就被系統給釋放掉。這種操作消耗大量的總線周期（bus cycle即是把數據從CPU挪到內存的周期）。

There is another problem with the sk_buff struct: the Linux network stack was originally designed to be compatible with as many protocols as possible. As such, metadata for all of these protocols is included in the sk_buff struct, but that’s simply not necessary for processing specific packets. Because of this overly complicated struct, processing is slower than it could be.

與sk_buff struct密切相關的另一個問題是：設計Linux網絡協議棧的初衷是盡可能地兼容更多的協議。因此，所有協議的元數據都包含在sk_buff struct中，但是，處理特定的數據包的時候這些（與特定數據包無關的協議元數據）根本不需要。因而處理速度就肯定比較慢，由於這個結構體過於復雜。

Another factor that negatively affects performance is context switching. When an application in the user space needs to send or receive a packet, it executes a system call. The context is switched to kernel mode and then back to user mode. This consumes a significant amount of system resources.

對性能產生負面影響的另外一個因素就是上下文切換。當用戶空間的應用程序需要發送或接收一個數據包的時候，執行一個系統調用。這個上下文切換就是從用戶態切換到內核態，（系統調用在內核的活幹完後）再返回到用戶態。這無疑消耗了大量的系統資源。

To solve some of these problems, all Linux kernels since version 2.6 have included NAPI (New API), which combines interrupts with requests. Let’s take a quick look at how this works.

為了解決上面提及的所有問題，Linux內核從2.6版本開始包括了NAPI，將請求和中斷予以合並。我們解析來快速地看一看這是如何工作的。

The network card first works in interrupt mode, but as soon as a packet enters the network interface, it registers itself in a poll queue and disables the interrupt. The system periodically checks the queue for new devices and gathers packets for further processing. As soon as the packets are processed, the card will be deleted from the queue and interrupts are again enabled.

首先網卡是在中斷模式下工作。但是，一旦有數據包進入網絡接口，網卡就會去輪詢隊列中註冊，並將中斷禁用掉。系統周期性地檢查新設備隊列，收集數據包以便做進一步地處理。一旦數據包被系統處理了，系統就將對應的網卡從輪詢隊列中刪除掉，並再次啟用該網卡的中斷（即將網卡恢復到中斷模式下去工作）。

This has been just a cursory description of how packets are processed. A more detailed look at this process can be found in an article series from Private Internet Access. However, even a quick glance is enough to see the problems slowing down packet processing. In the next section, we’ll describe how these problems are solved using DPDK.

這只是對數據包如何被處理的粗略的描述。有關數據包處理過程的詳細描述請參見Private Internet Access的系列文章。然而，就是這麽一個快速一瞥也足以讓我們看到數據包處理被減緩的問題。在下一節中，我們將描述使用DPDK後，這些問題是如何被解決掉的。

DPDK: How It Works | DPDK 是如何工作的

General Features | 一般特性

Let’s look at the following illustration: 讓我們來看看下面的插圖

技術分享

On the left you see the traditional way packets are processed, and on the right - with DPDK. As we can see, the kernel in the second example doesn’t step in at all: interactions with the network card are performed via special drivers and libraries.

如圖所示，在左邊的是傳統的數據包處理方式，在右邊的則是使用了DPDK之後的數據包處理方式。正如我們看到的一樣，右邊的例子中，內核根本不需要介入，與網卡的交互是通過特殊的驅動和庫函數來進行的。

If you’ve already read about DPDK or have ever used it, then you know that the ports receiving incoming traffic on network cards need to be unbound from Linux (the kernel driver). This is done using the dpdk_nic_bind (or dpdk-devbind) command, or ./dpdk_nic_bind.py in earlier versions.

如果你已經讀過DPDK或者已經使用過DPDK，那麽你肯定知道網卡接收數據傳入的網口需要從Linux內核驅動上去除綁定（松棒）。用dpdk_nic_bind(或dpdk-devbind)命令就可以完成松綁，早期的版本中使用dpdk_nic_bind.py。

How are ports then managed by DPDK? Every driver in Linux has bind and unbind files. That includes network card drivers:

ls /sys/bus/pci/drivers/ixgbe
bind  module  new_id  remove_id  uevent  unbind

XXXX

To unbind a device from a driver, the device’s bus number needs to be written to the unbind file. Similarly, to bind a device to another driver, the bus number needs to be written to its bind file. More detailed information about this can be found here.
XXXX
The DPDK installation instructions tell us that our ports need to be managed by the vfio_pci, igb_uio, or uio_pci_generic driver. (We won’t be geting into details here, but we suggested interested readers look at the following articles on kernel.org: 1 and 2.)
XXXX
These drivers make it possible to interact with devices in the user space. Of course they include a kernel module, but that’s just to initialize devices and assign the PCI interface.
XXXX
All further communication between the application and network card is organized by the DPDK poll mode driver (PMD) DPDK has poll mode drivers for all supported network cards and virtual devices.
XXXX
The DPDK also requires hugepages be configured. This is required for allocating large chunks of memory and writing data to them. We can say that hugepages does the same job in DPDK that DMA does in traditional packet processing.
XXXX
We’ll discuss all of its nuances in more detail, but for now, let’s go over the main stages of packet processing with the DPDK:

Incoming packets go to a ring buffer (we’ll look at its setup in the next section). The application periodically checks this buffer for new packets.
If the buffer contains new packet descriptors, the application will refer to the DPDK packet buffers in the specially allocated memory pool using the pointers in the packet descriptors.
If the ring buffer does not contain any packets, the application will queue the network devices under the DPDK and then refer to the ring again.

Let’s take a closer look at the DPDK’s internal structure.
XXXX

EAL: Environment Abstraction | 環境抽象層

The EAL, or Environment Abstraction Layer, is the main concept behind the DPDK.
XXX

The EAL is a set of programming tools that let the DPDK work in a specific hardware environment and under a specific operating system. In the official DPDK repository, libraries and drivers that are part of the EAL are saved in the rte_eal directory.
XXX

Drivers and libraries for Linux and the BSD system are saved in this directory. It also contains a set of header files for various processor architectures: ARM, x86, TILE64, and PPC64.
XXX

We access software in the EAL when we compile the DPDK from the source code:

make config T=x86_64-native-linuxapp-gcc

XXXX

One can guess that this command will compile DPDK for Linux in an x86_64 architecture.
XXX

The EAL is what binds the DPDK to applications. All of the applications that use the DPDK (see here for examples) must include the EAL’s header files.
XXXX

The most commonly of these include:

rte_lcore.h — manages processor cores and sockets;
rte_memory.h — manages memory;
rte_pci.h — provides the interface access to PCI address space;
rte_debug.h — provides trace and debug functions (logging, dump_stack, and more);
rte_interrupts.h — processes interrupts.

XXXX

More details on this structure and EAL functions can be found in the official documentation.
XXXX

Managing Queues: rte_ring | 隊列管理
As we’ve already said, packets received by the network card are sent to a ring buffer, which acts as a receiving queue. Packets received in the DPDK are also sent to a queue implemented on the rte_ring library. The library’s description below comes from information gathered from the developer’s guide and comments in the source code.
XXX

The rte_ring was developed from the FreeBSD ring buffer. If you look at the source code, you’ll see the following comment: Derived from FreeBSD’s bufring.c.
XXX

The queue is a lockless ring buffer built on the FIFO (First In, First Out) principle. The ring buffer is a table of pointers for objects that can be saved to the memory. Pointers can be divided into four categories: prod_tail, prod_head, cons_tail, cons_head.
XXX

Prods is short for producer, and cons for consumer. The producer is the process that writes data to the buffer at a given time, and the consumer is the process that removes data from the buffer.
XXX

The tail is where writing takes place on the ring buffer. The place the buffer is read from at a given time is called the head.
XXX

The idea behind the process for adding and removing elements from the queue is as follows: when a new object is added to the queue, the ring->prod_tail indicator should end up pointing to the location where ring->prod_head previously pointed to.
XXX

This is just a brief description; a more detailed account of how the ring buffer scripts work can be found in the developer’s manual on the DPDK site.
XXX

This approach has a number of advantages. Firstly, data is written to the buffer extremely quickly. Secondly, when adding or removing a large number of objects from the queue, cache misses occur much less frequently since pointers are saved in a table.
XXX

The drawback to DPDK’s ring buffer is its fixed size, which cannot be increased on the fly. Additionally, much more memory is spent working with the ring structure than in a linked queue since the buffer always uses the the maximum number of pointers.
XXX

Memory Management: rte_mempool | 內存管理
We mentioned above that DPDK requires hugepages. The installation instructions recommend creating 2MB hugepages.

These pages are combined in segments, which are then divided into zones. Objects that are created by applications or other libraries, like queues and packet buffers, are placed in these zones.

These objects include memory pools, which are created by the rte_mempool library. These are fixed size object pools that use rte_ring for storing free objects and can be identified by a unique name.

Memory alignment techniques can be implemented to improve performance.

Even though access to free objects is designed on a lockless ring buffer, consumption of system resources may still be very high. As multiple cores have access to the ring, a compare-and-set (CAS) operation usually has to be performed each time it is accessed.

To prevent bottlenecking, every core is given an additional local cache in the memory pool. Using the locking mechanism, cores can fully access the free object cache. When the cache is full or entirely empty, the memory pool exchanges data with the ring buffer. This gives the core access to frequently used objects.

Buffer Management: rte_mbuf | 緩沖區管理
In the Linux network stack, all network packets are represented by the the sk_buff data structure. In DPDK, this is done using the rte_mbuf struct, which is described in the rte_mbuf.h header file.

The buffer management approach in DPDK is reminiscent of the approach used in FreeBSD: instead of one big sk_buff struct, there are many smaller rte_mbuf buffers. The buffers are created before the DPDK application is launched and are saved in memory pools (memory is allocated by rte_mempool).

In addition to its own packet data, each buffer contains metadata (message type, length, data segment starting address). The buffer also contains pointers for the next buffer. This is needed when handling packets with large amounts of data. In cases like these, packets can be combined (as is done in FreeBSD; more detailed information about this can be found here).

Other Libraries: General Overview | 其他庫概覽
In previous sections, we talked about the most basic DPDK libraries. There’s a great deal of other libraries, but one article isn’t enough to describe them all. Thus, we’ll be limiting ourselves to just a brief overview.

With the LPM library, DPDK runs the Longest Prefix Match (LPM) algorithm, which can be used to forward packets based on their IPv4 address. The primary function of this library is to add and delete IP addresses as well as to search for new addresses using the LPM algorithm.

A similar function can be performed for IPv6 addresses using the LPM6 library.

Other libraries offer similar functionality based on hash functions. With rte_hash, you can search through a large record set using a unique key. This library can be used for classifying and distributing packets, for example.

The rte_timer library lets you execute functions asynchronously. The timer can run once or periodically.

Conclusion | 總結
In this article we went over the internal device and principles of DPDK. This is far from comprehensive though; the subject is too complex and extensive to fit in one article. So sit tight, we will continue this topic in a future article, where we’ll discuss the practical aspects of using DPDK.

We’d be happy to answer your questions in the comments below. And if you’ve had any experience using DPDK, we’d love to hear your thoughts and impressions.

For anyone interested in learning more, please visit the following links:

http://dpdk.org/doc/guides/prog_guide/ — a detailed (but confusing in some places) description of all the DPDK libraries;
https://www.net.in.tum.de/fileadmin/TUM/NET/NET-2014-08-1/NET-2014-08-1_15.pdf — a brief overview of DPDK’s capabilities and comparison with other frameworks (netmap and PF_RING);
http://www.slideshare.net/garyachy/dpdk-44585840 — an introductory presentation to DPDK for beginners;
http://www.it-sobytie.ru/system/attachments/files/000/001/102/original/LinuxPiter-DPDK-2015.pdf — a presentation explaining the DPDK structure.

Andrej Yemelianov 24 November 2016 Tags: DPDK, linux, network, network stacks, packet processing

參考資料

1. https://blog.selectel.com/introduction-dpdk-architecture-principles/

2. http://it-events.com/system/attachments/files/000/001/102/original/LinuxPiter-DPDK-2015.pdf

[中英對照]Introduction to DPDK: Architecture and Principles

under cloud 另一個 times environ mov ket 路由進一步 Introduction to DPDK: Architecture and Principles | DPDK概論：體系結構與實現原理 Linux network stack p

[中英對照]Introduction to DPDK: Architecture and Principles

[中英對照]Introduction to DPDK: Architecture and Principles

[中英對照]Introduction to Remote Direct Memory Access (RDMA) | RDMA概述

Welcome Docker to SUSE Linux Enterprise Server【水平有限，中英對照，求糾錯】

expdp/impdp 參數說明，中英對照

[中英對照]User-Space Device Drivers in Linux: A First Look

[中英對照]Booting Process in Linux RHEL 7 | Linux RHEL 7啟動過程

[中英對照]Linux kernel coding style | Linux內核編碼風格

IntelliJ IDEA 快捷鍵說明大全（中英對照、帶圖示詳解） (轉載)

IntelliJ IDEA 快捷鍵說明大全（中英對照、帶圖示詳解）

【論文翻譯】ResNet論文中英對照翻譯--（Deep Residual Learning for Image Recognition）

【論文翻譯】中英對照翻譯--（Attentive Generative Adversarial Network for Raindrop Removal from A Single Image）

findBugs中英對照

An Introduction to Deep Learning and Neural Networks

Google C++ Style Guide中英對照（三）

常見的大資料術語名詞解釋（中英對照）

AVFoundation Programming Guide(官方文件翻譯)完整版中英對照

eclipse選單解釋及中英對照

國家及校級獎項、稱號（中英對照）

【論文翻譯】NIN層論文中英對照翻譯--（Network In Network）

CSS中居中的完全指南（中英對照翻譯）

[中英對照]Introduction to DPDK: Architecture and Principles

相關推薦