1. 程式人生 > >[SPDK/NVMe存儲技術分析]005 - DPDK概述

[SPDK/NVMe存儲技術分析]005 - DPDK概述

ecif rar scu ket then 允許 remove lan 協議

: 之所以要中英文對照翻譯下面的文章,是因為SPDK嚴重依賴於DPDK的實現。

Introduction to DPDK: Architecture and Principles
DPDK概論:體系結構與實現原理

技術分享

Linux network stack performance has become increasingly relevant over the past few years. This is perfectly understandable: the amount of data that can be transferred over a network and the corresponding workload has been growing not by the day, but by the hour.
這幾年以來,Linux網絡棧的性能變得越來越重要。這很好理解,因為伴隨著時間的向前推移,可以通過網絡來傳輸的數據量和對應的工作負載在大幅度地向上增長。

Not even the widespread use of 10 GE network cards has resolved this issue; this is because a lot of bottlenecks that prevent packets from being quickly processed are found in the Linux kernel itself.
即便廣泛使用10GbE網卡也解決不了這一性能問題(用重慶話說就是,10GbE的網卡mang起兒子家整,然並卵),因為在Linux內核中,阻止數據包被快速地處理掉的瓶頸實在是太多啦。

There have been many attempts to circumvent these bottlenecks with techniques called kernel bypasses (a short description can be found here). They let you process packets without involving the Linux network stack and make it so that the application running in the user space communicates directly with networking device. We’d like to discuss one of these solutions, the Intel DPDK (Data Plane Development Kit), in today’s article.
嘗試繞過這些瓶頸的技術有很多,統稱為kernel bypass(簡短的描述戳這裏)。kernel bypass技術讓編程人員處理數據包,而不卷入Linux網絡棧,在用戶空間中運行的應用程序能夠直接與網絡設備打交道。在本文中,我們將討論眾多的kernel bypass解決方案中的一種,那就是Intel的DPDK(數據平面開發套件)。

A lot of posts have already been published about the DPDK and in a variety of languages. Although many of these are fairly informative, they don’t answer the most important questions: How does the DPDK process packets and what route does the packet take from the network device to the user?
來自多國語言的與有關DPDK的文章很多。雖然信息量已經很豐富了,但是並沒有回答兩個重要的問題。問題一:DPDK是如何處理數據包的?問題二:從網絡設備到用戶程序,數據包使用的路由是什麽?



Finding the answers to these questions was not easy; since we couldn’t find everything we needed in the official documentation, we had to look through a myriad of additional materials and thoroughly review their sources. But first thing’s first: before talking about the DPDK and the issues it can help resolve, we should review how packets are processed in Linux.
找到上面的兩個問題的答案不是一件容易的事情。由於我們無法在官方文檔中找到我們所需要的所有東西,我們不得不查閱大量的額外資料,並徹底審查資料來源。但是最為首要的是:在談論DPDK可以幫助我們解決問題之前,我們應該審閱一下數據包在Linux中是如何被處理的。

Processing Packets in Linux: Main Stages | Linux中的數據包處理的主要幾個階段

When a network card first receives a packet, it sends it to a receive queue, or RX. From there, it gets copied to the main memory via the DMA (Direct Memory Access) mechanism.
當網卡接收數據包之後,首先將其發送到接收隊列(RX)。在那裏,數據包被復制到內存中,通過直接內存訪問(DMA)機制。

Afterwards, the system needs to be notified of the new packet and pass the data onto a specially allocated buffer (Linux allocates these buffers for every packet). To do this, Linux uses an interrupt mechanism: an interrupt is generated several times when a new packet enters the system. The packet then needs to be transferred to the user space.
接下來,系統需要被通知到,有新的數據包來了,然後系統將數據傳遞到一個專門分配的緩沖區中去(Linux為每一個數據包都分配這樣的特別緩沖區)。為了做到這一點,Linux使用了中斷機制:當一個新的數據包進入系統時,中斷多次生成。然後,該數據包需要被轉移到用戶空間中去。

One bottleneck is already apparent: as more packets have to be processed, more resources are consumed, which negatively affects the overall system performance.
在這裏,存在著一個很明顯的瓶頸:伴隨著更多的數據包需要被處理,更多的資源將被消耗掉,這無疑對整個系統的性能將產生負面的影響。

As we‘ve already said, these packets are saved to specially allocated buffers - more specifically, the sk_buff struct. This struct is allocated for each packet and becomes free when a packet enters the user space. This operation consumes a lot of bus cycles (i.e. cycles that transfer data from the CPU to the main memory).
我們在前面已經說過,這些數據包被保存在專門分配的緩沖區中-更具體地說就是sk_buff結構體。系統給每一個數據包都分配一個這樣的結構體,一但數據包到達用戶空間,該結構體就被系統給釋放掉。這種操作消耗大量的總線周期(bus cycle即是把數據從CPU挪到內存的周期)。

There is another problem with the sk_buff struct: the Linux network stack was originally designed to be compatible with as many protocols as possible. As such, metadata for all of these protocols is included in the sk_buff struct, but that’s simply not necessary for processing specific packets. Because of this overly complicated struct, processing is slower than it could be.
與sk_buff struct密切相關的另一個問題是:設計Linux網絡協議棧的初衷是盡可能地兼容更多的協議。因此,所有協議的元數據都包含在sk_buff struct中,但是,處理特定的數據包的時候這些(與特定數據包無關的協議元數據)根本不需要。因而處理速度就肯定比較慢,由於這個結構體過於復雜。

Another factor that negatively affects performance is context switching. When an application in the user space needs to send or receive a packet, it executes a system call. The context is switched to kernel mode and then back to user mode. This consumes a significant amount of system resources.
對性能產生負面影響的另外一個因素就是上下文切換。當用戶空間的應用程序需要發送或接收一個數據包的時候,執行一個系統調用。這個上下文切換就是從用戶態切換到內核態,(系統調用在內核的活幹完後)再返回到用戶態。這無疑消耗了大量的系統資源。

To solve some of these problems, all Linux kernels since version 2.6 have included NAPI (New API), which combines interrupts with requests. Let’s take a quick look at how this works.
為了解決上面提及的所有問題,Linux內核從2.6版本開始包括了NAPI,將請求和中斷予以合並。接下來我們將快速地看一看這是如何工作的。

The network card first works in interrupt mode, but as soon as a packet enters the network interface, it registers itself in a poll queue and disables the interrupt. The system periodically checks the queue for new devices and gathers packets for further processing. As soon as the packets are processed, the card will be deleted from the queue and interrupts are again enabled.
首先網卡是在中斷模式下工作。但是,一旦有數據包進入網絡接口,網卡就會去輪詢隊列中註冊,並將中斷禁用掉。系統周期性地檢查新設備隊列,收集數據包以便做進一步地處理。一旦數據包被系統處理了,系統就將對應的網卡從輪詢隊列中刪除掉,並再次啟用該網卡的中斷(即將網卡恢復到中斷模式下去工作)。

This has been just a cursory description of how packets are processed. A more detailed look at this process can be found in an article series from Private Internet Access. However, even a quick glance is enough to see the problems slowing down packet processing. In the next section, we’ll describe how these problems are solved using DPDK.
這只是對數據包如何被處理的粗略的描述。有關數據包處理過程的詳細描述請參見Private Internet Access的系列文章。然而,就是這麽一個快速一瞥也足以讓我們看到數據包處理被減緩的問題。在下一節中,我們將描述使用DPDK後,這些問題是如何被解決掉的。

DPDK: How It Works | DPDK 是如何工作的



General Features | 一般特性

Let‘s look at the following illustration:
讓我們來看看下面的插圖:

技術分享

On the left you see the traditional way packets are processed, and on the right - with DPDK. As we can see, the kernel in the second example doesn’t step in at all: interactions with the network card are performed via special drivers and libraries.
如圖所示,在左邊的是傳統的數據包處理方式,在右邊的則是使用了DPDK之後的數據包處理方式。正如我們看到的一樣,右邊的例子中,內核根本不需要介入,與網卡的交互是通過特殊的驅動和庫函數來進行的。

If you‘ve already read about DPDK or have ever used it, then you know that the ports receiving incoming traffic on network cards need to be unbound from Linux (the kernel driver). This is done using the dpdk_nic_bind (or dpdk-devbind) command, or ./dpdk_nic_bind.py in earlier versions.
如果你已經讀過DPDK或者已經使用過DPDK,那麽你肯定知道網卡接收數據傳入的網口需要從Linux內核驅動上去除綁定(松棒)。用dpdk_nic_bind(或dpdk-devbind)命令就可以完成松綁,早期的版本中使用dpdk_nic_bind.py。

How are ports then managed by DPDK? Every driver in Linux has bind and unbind files. That includes network card drivers:
DPDK是如何管理網口的?每一個Linux內核驅動都有bind和unbind文件。 當然包括網卡驅動:

ls /sys/bus/pci/drivers/ixgbe
bind  module  new_id  remove_id  uevent  unbind

To unbind a device from a driver, the device‘s bus number needs to be written to the unbind file. Similarly, to bind a device to another driver, the bus number needs to be written to its bind file. More detailed information about this can be found here.
從內核驅動上給一個設備松綁,需要把設備的bus號寫入unbind文件。類似地,將設備綁定到另外一個驅動上,同樣需要將bus號寫入bind文件。更多詳細信息請參見這裏。

The DPDK installation instructions tell us that our ports need to be managed by the vfio_pci, igb_uio, or uio_pci_generic driver. (We won’t be geting into details here, but we suggested interested readers look at the following articles on kernel.org: 1 and 2.)
DPDK的安裝指南告訴我們ports需要被vfio_pci, igb_uio或uio_pci_generic驅動管理。(細節這裏就不談了,但建議有興趣的讀者閱讀kernel.org上的文章:1和2)

These drivers make it possible to interact with devices in the user space. Of course they include a kernel module, but that‘s just to initialize devices and assign the PCI interface.
有了這些驅動程序,就可以在用戶空間與設備進行交互。當然,它們包含了一個內核模塊,但是該內核模塊只負責設備初始化和分配PCI接口。

All further communication between the application and network card is organized by the DPDK poll mode driver (PMD). DPDK has poll mode drivers for all supported network cards and virtual devices.
在應用程序與網卡之間的所有的進一步的通信,都是有DPDK的輪詢模式驅動(PMD)負責組織的。DPDK對所有網卡和虛擬設備都支持輪詢模式驅動(PMD)。

The DPDK also requires hugepages be configured. This is required for allocating large chunks of memory and writing data to them. We can say that hugepages does the same job in DPDK that DMA does in traditional packet processing.
大內存頁(hug pages)的配置對DPDK來說是必須的。這是因為需要分配大塊內存並向大塊內存中寫入數據。可以這麽說,數據包處理的活,傳統的方式是使用直接內存訪問(DMA)來幹,而DPDK使用大內存頁(huge pages)來完成。

We‘ll discuss all of its nuances in more detail, but for now, let‘s go over the main stages of packet processing with the DPDK:
我們將討論更多的細節。但是現在,讓我們瀏覽一下使用DPDK做數據包處理的幾個主要階段:

  1. Incoming packets go to a ring buffer (we‘ll look at its setup in the next section). The application periodically checks this buffer for new packets.
    傳入的數據包被放到環形緩沖區(ring buffer)中去。應用程序周期性檢查這個緩沖區(ring buffer)以獲取新的數據包。
  2. If the buffer contains new packet descriptors, the application will refer to the DPDK packet buffers in the specially allocated memory pool using the pointers in the packet descriptors.
    如果ring buffer包含有新的數據包描述符,應用程序就使用數據包描述符所包含的指針去做處理,該指針指向的是DPDK數據包緩沖區,該緩沖區位於專門的內存池中。
  3. If the ring buffer does not contain any packets, the application will queue the network devices under the DPDK and then refer to the ring again.
    如果ring buffer中不包含任何數據包描述符,應用程序就會在DPDK中將網絡設備排隊,然後再次指向ring。

Let‘s take a closer look at the DPDK‘s internal structure.
接下來我們將近距離地深入到DPDK的內部結構中去。

EAL: Environment Abstraction | 環境抽象層

The EAL, or Environment Abstraction Layer, is the main concept behind the DPDK.
環境抽象層(EAL),是位於DPDK背後的主要概念。

The EAL is a set of programming tools that let the DPDK work in a specific hardware environment and under a specific operating system. In the official DPDK repository, libraries and drivers that are part of the EAL are saved in the rte_eal directory.
EAL是一套編程工具,允許DPDK在特定的硬件環境和特定的操作系統下工作。在官方的DPDK倉庫中,庫和驅動是EAL的一部分,被保存在rte_eal目錄。

Drivers and libraries for Linux and the BSD system are saved in this directory. It also contains a set of header files for various processor architectures: ARM, x86, TILE64, and PPC64.
為Linux和BSD系統寫的庫和驅動就保存在這個目錄。同時還包含了一系列針對不同的處理器架構的頭文件,不同的處理器包括ARM, x86, TILE64和PPC64。

We access software in the EAL when we compile the DPDK from the source code:
在從源代碼編譯DPDK的時候,就會訪問到EAL中的軟件:

make config T=x86_64-native-linuxapp-gcc

The most commonly of these include:
其中最常見的包括:

  • rte_lcore.h -- manages processor cores and sockets; 管理處理器核和socket;
  • rte_memory.h -- manages memory; 管理內存;
  • rte_pci.h -- provides the interface access to PCI address space; 提供訪問PCI地址空間的接口;
  • rte_debug.h -- provides trace and debug functions (logging, dump_stack, and more); 提供trace和debug函數(logging, dump_stack, 和更多);
  • rte_interrupts.h -- processes interrupts. 中斷處理。

More details on this structure and EAL functions can be found in the official documentation.
有關EAL功能與結構的更多詳細信息請參見官方文檔。

Managing Queues: rte_ring | 隊列管理

As we‘ve already said, packets received by the network card are sent to a ring buffer, which acts as a receiving queue. Packets received in the DPDK are also sent to a queue implemented on the rte_ring library. The library‘s description below comes from information gathered from the developer‘s guide and comments in the source code.
我們在前面已經說過了,網卡接收到的數據包被發送到環形緩沖區(ring buffer),該環形緩沖區充當接收隊列的角色。DPDk接收到的數據包也被發送到用rte_ring函數庫實現的隊列中去。註意下面的函數庫描述拉源於開發指南和源代碼註釋。

The rte_ring was developed from the FreeBSD ring buffer. If you look at the source code, you‘ll see the following comment: Derived from FreeBSD‘s bufring.c.
rte_ring是來自於FreeBSD的ring buffer。如果你閱讀源代碼,就會看見後面的註釋: 來自於FreeBSD的bufring.c。

The queue is a lockless ring buffer built on the FIFO (First In, First Out) principle. The ring buffer is a table of pointers for objects that can be saved to the memory. Pointers can be divided into four categories: prod_tail, prod_head, cons_tail, cons_head.
DPDK的隊列是一個無鎖的環形緩沖區,基於FIFO(先進先出原理)構建。ring buffer本質上是一張表,表裏的每一個元素是可以保存在內存中的對象的指針。指針分為4類: prod_tail, prod_head, cons_tail, 和cons_head。

Prod is short for producer, and cons for consumer. The producer is the process that writes data to the buffer at a given time, and the consumer is the process that removes data from the buffer.
prod是producer(生產者)的縮寫,而cons是consumer(消費者)的縮寫。生產者(producer)是在給定的時間之內將數據寫入緩沖區的進程,而消費者(consumer)是從緩沖區中讀走數據的進程。

The tail is where writing takes place on the ring buffer. The place the buffer is read from at a given time is called the head.
tail(尾部)是寫入環形緩沖區的地方,而在給定的時間內從環形緩沖區讀取數據的地方稱之為head(頭部)。

The idea behind the process for adding and removing elements from the queue is as follows: when a new object is added to the queue, the ring->prod_tail indicator should end up pointing to the location where ring->prod_head previously pointed to.
在給隊列添加一個元素和從隊列中移除一個元素的過程中, 藏在其背後的思想是:當一個新的對象被添加到隊列中,rihg->prod_tail應該最終指向ring->prod_head在之前指向的位置。

This is just a brief description; a more detailed account of how the ring buffer scripts work can be found in the developer‘s manual on the DPDK site.
這裏只是做一個簡短的描述。有關ring buffer是如何編排其工作的詳細說明請參見DPDK網站的開發者手冊。

This approach has a number of advantages. Firstly, data is written to the buffer extremely quickly. Secondly, when adding or removing a large number of objects from the queue, cache misses occur much less frequently since pointers are saved in a table.
這一方法有很多優點。首先,將數據寫入緩沖區非常快。其次,當給隊列添加大量對象或者從隊列中移除大量對象時,cache未命中發生的頻率要低得多,因為保存在表中的是對象的指針。

The drawback to DPDK‘s ring buffer is its fixed size, which cannot be increased on the fly. Additionally, much more memory is spent working with the ring structure than in a linked queue since the buffer always uses the the maximum number of pointers.
DPDK的ring buffer的缺點是ring buffer的長度是固定的,不能夠在運行時間動態地修改。另外,與鏈式隊列相比,在ring結構中使用的內存比較多,因為ring buffer總是使用支持的對象指針數量的最大值。

Memory Management: rte_mempool | 內存管理

We mentioned above that DPDK requires hugepages. The installation instructions recommend creating 2MB hugepages.
在上面我們有提到DPDK需要使用大內存頁。安裝說明建議創建2MB的大內存頁。

These pages are combined in segments, which are then divided into zones. Objects that are created by applications or other libraries, like queues and packet buffers, are placed in these zones.
這些大內存頁合並為段,然後分割成zone。應用程序或者其他庫(比如隊列和數據包緩沖區)創建的對象被放置在這些zone中。

These objects include memory pools, which are created by the rte_mempool library. These are fixed size object pools that use rte_ring for storing free objects and can be identified by a unique name.
這些對象包括通過rte_mempool庫創建的內存池。這些對象池的大小是固定的,使用rte_ring存儲自由對象,能夠用一個獨一無二的名稱來標識。

Memory alignment techniques can be implemented to improve performance.
內存對齊技術能夠被用來提高性能。

Even though access to free objects is designed on a lockless ring buffer, consumption of system resources may still be very high. As multiple cores have access to the ring, a compare-and-set (CAS) operation usually has to be performed each time it is accessed.
盡管訪問自由對象被設計在一個無鎖的ring buffer中,但是系統資源消耗可能還是很大。 由於這個ring被多個核訪問,在每一次訪問ring的時候,通常不得不執行CAS原子操作。

To prevent bottlenecking, every core is given an additional local cache in the memory pool. Using the locking mechanism, cores can fully access the free object cache. When the cache is full or entirely empty, the memory pool exchanges data with the ring buffer. This gives the core access to frequently used objects.
為了防止瓶頸發生,在內存池中給每一個CPU核配備額外的本地緩存。通過使用鎖機制,多個CPU核能夠完全訪問自由對象緩存。當緩存滿了或者完全空了,內存池與ring buffer進行數據交換。這使得CPU核能夠訪問那些被頻繁使用的對象。

Buffer Management: rte_mbuf | 緩沖區管理

In the Linux network stack, all network packets are represented by the the sk_buff data structure. In DPDK, this is done using the rte_mbuf struct, which is described in the rte_mbuf.h header file.
在Linux網絡棧中,所有的網絡數據包用sk_buff結構體表示。而在DPDK中,使用rte_mbuf結構體來表示網絡數據包,該結構體的描述位於頭文件rte_mbuf.h中。

The buffer management approach in DPDK is reminiscent of the approach used in FreeBSD: instead of one big sk_buff struct, there are many smaller rte_mbuf buffers. The buffers are created before the DPDK application is launched and are saved in memory pools (memory is allocated by rte_mempool).
DPDK的緩沖區管理方法讓人不禁聯想到FreeBSD的做法:使用很多較小的rte_buf緩沖區來代替一個大大的sk_buff結構體。在DPDK應用程序被啟動之前,這些緩沖區就創建好了,它們被保存在內存池中(內存是通過rte_mempool來分配的)。

In addition to its own packet data, each buffer contains metadata (message type, length, data segment starting address). The buffer also contains pointers for the next buffer. This is needed when handling packets with large amounts of data. In cases like these, packets can be combined (as is done in FreeBSD; more detailed information about this can be found here).
除了包含它自己的包數據之外,每一個buffer還包含了元數據(消息類型,消息長度,數據段起始地址)。與此同時buffer也包含指向下一個buffer的指針,在處理大量的數據包的時候這是必須的。在這種情況下,數據包可以合並在一起(合並工作是由FreeBSD完成的,更多詳細信息請參見這裏)。

Other Libraries: General Overview | 鳥瞰其他庫

In previous sections, we talked about the most basic DPDK libraries. There‘s a great deal of other libraries, but one article isn‘t enough to describe them all. Thus, we‘ll be limiting ourselves to just a brief overview.
在前面的章節中,我們談到了最基本的DPDK庫。其實DPDK庫還有很多,用一篇文章將所有庫都描述到是不可能的。因此,我們只是做一個簡短的概述罷了。

With the LPM library, DPDK runs the Longest Prefix Match (LPM) algorithm, which can be used to forward packets based on their IPv4 address. The primary function of this library is to add and delete IP addresses as well as to search for new addresses using the LPM algorithm.
在LPM庫中,DPDK運行LPM(匹配最長前綴)算法,用於轉發基於IPv4地址的數據包。這個庫的主要功能就是添加和刪除IP地址,同時使用LPM算法尋找新的地址。

A similar function can be performed for IPv6 addresses using the LPM6 library.
類似地,使用LPM6庫處理IPv6地址。

Other libraries offer similar functionality based on hash functions. With rte_hash, you can search through a large record set using a unique key. This library can be used for classifying and distributing packets, for example.
其他庫基於hash函數提供類似的功能。例如:使用rte_hash, 可以通過使用一個獨一無二的key來搜索大記錄集。這個庫可用來分類和分發數據包。

The rte_timer library lets you execute functions asynchronously. The timer can run once or periodically.
rte_timer庫允許執行異步函數調用。定時器可以運行一次,也可以周期性地運行。

Conclusion | 總結

In this article we went over the internal device and principles of DPDK. This is far from comprehensive though; the subject is too complex and extensive to fit in one article. So sit tight, we will continue this topic in a future article, where we‘ll discuss the practical aspects of using DPDK.
本文我們討論了DPDK的內部設備和原理。但這很不全面,因為該主題過於復雜和寬泛,以至於不能在一篇文章中講述清楚。所以,我們將在未來的文章中繼續討論這一話題,討論使用DPDK所遇到的實際問題。

We‘d be happy to answer your questions in the comments below. And if you‘ve had any experience using DPDK, we‘d love to hear your thoughts and impressions.
我們非常樂意回答你在評論中提出的問題。如果你有任何使用DPDK的經驗,請跟我們分享你的想法與感受。

For anyone interested in learning more, please visit the following links:
如有興趣學習更多,請訪問下面的鏈接:

  • http://dpdk.org/doc/guides/prog_guide/ — a detailed (but confusing in some places) description of all the DPDK libraries;
  • https://www.net.in.tum.de/fileadmin/TUM/NET/NET-2014-08-1/NET-2014-08-1_15.pdf — a brief overview of DPDK‘s capabilities and comparison with other frameworks (netmap and PF_RING);
  • http://www.slideshare.net/garyachy/dpdk-44585840 — an introductory presentation to DPDK for beginners;
  • http://www.it-sobytie.ru/system/attachments/files/000/001/102/original/LinuxPiter-DPDK-2015.pdf — a presentation explaining the DPDK structure.

Andrej Yemelianov 24 November 2016 Tags: DPDK, linux, network, network stacks, packet processing

[SPDK/NVMe存儲技術分析]005 - DPDK概述