網絡卡效能分析-Intel8257X晶片手冊讀後感

阿新 • • 發佈：2019-01-12

轉載自： http://blog.csdn.net/dog250/article/details/6313854

引：在《《OpenVPN效能》之後，我進一步閱讀了硬體的解決方案，希望能得到一些思想，然後進一步的改進我的設計，由於工作的便利性和實際工作的需要，我閱讀了intel的82571EB，82574L，82575等乙太網晶片的datesheet的相關特性描述部分(由於我不打算親自寫驅動，因此我沒有閱讀暫存器以及儲存器細節，更多的是我不相信自己的驅動比intel的工程師們的更高效)，得到了很多感覺，以下是我的一些摘錄和讀後感。

一，網路應用開銷
1.協議棧處理開銷：分層模型各個層次的OS實現開銷
2.記憶體拷貝開銷：網絡卡，核心記憶體，使用者態記憶體之間的拷貝

3.系統層面的開銷：中斷，快取管理，系統呼叫
二，區域性的解決方案
1.dma-針對記憶體拷貝，需要鎖定匯流排，此時cpu就好像被拔除了一樣(《Intel微處理器》中原話)，如果網絡卡資料晶片沒有cpu高效，除了省去了一次cpu中轉之外，效能反而降低。
2.硬中斷負載均衡-針對多cpu的利用率(希望多cpu全部用來處理協議棧)，此帶來軟中斷負載均衡(在硬中斷cpu上觸發軟中斷)，做的不好沒有效果，造成某個cpu高負載，其它cpu空閒，做的好的話，會造成基於順序的協議包亂序，這就是cpu並行和tcp序列之間的衝突。
3.TOE-針對pci匯流排訪問延遲過高，tcp解除安裝引擎，這種方案對於經常傳輸小包的情形來說無疑是不好的，因此在網絡卡中大量處理tcp協議會造成收發速率降低。

4.NAPI-針對中斷頻發導致切換過多，最終影響處理器cache的熱度。但是依賴dma，依賴網絡卡晶片，依賴網路狀況。區域性的解決方案很多都造成治聾致啞的結局，原因就是各因素之間的影響是隔離的，因此必然需要一種全域性的方案，各因素之間相互配合達到最好的效果。其中intel提出一種io加速方案，那就是ioat。
三，一種全域性的方案-IOAT
1.quick data，QD可以在不阻塞cpu的情況下非同步執行dma操作，而dma卻是一種同步的方式獨佔匯流排。和DMA相比，晶片速率比cpu更高或者根本無需掛起處理器。
2.RSS，多佇列網絡卡，將不同的流負載到不同的cpu上，同一流始終在同一cpu上，避免tcp的順序性和cpu的並行性的衝突。基於流的負載均衡，解決順序協議和cpu並行的衝突以及cache熱度問題。

3.DCA，直接將資料放入RSS繫結的cpu的快取中，更加有利於後期cpu處理協議棧流程時對資料的訪問。和頻度中斷一起解決cache熱度問題。dca基於pcie，由於pcie是基於協議訊息的，因此很容易封裝一個複雜的訊息交由前端譯碼器(root complex)觸發處理器的預取，只需要它解析訊息即可，可以想象如果在並行的總線上，設計如此複雜的機制是不可能的，要考慮多少時序啊！
4.頻度中斷，如果資料包不斷到來則積累中斷，為了不使某積累中的包延遲過大，需要一個權衡出來的可程式設計硬體timer，到期後定期中斷，並且為了為一些特殊包提供特殊服務，需要保留隨時中斷的機制。多種中斷方案，可以使得cpu切換最小化，cache熱度最大化。
5.Header Splitting，將資料包頭和資料包內容分開拷貝到不同的dma或者qd記憶體區域，這樣cpu就可以直接使用包頭了，而不用再從資料包中解析包頭了，加上dca機制，包頭資訊載入繫結cpu的cache，cpu在接下來處理過程中會提高效率。
1-5.ioat的組成部分，其中RSS和DCA可以擁有軟體方案，對於rss而言，在收到包時解析包並將包歸為某一個流，然後將softirq分發到該流繫結的cpu上即可，然而解析流的效率並不理想，因此硬體解決方案比較好，intel的82575及以上的網絡卡實現了rss。其中的頻度中斷思想類似於《OpenVPN效能-OpenVPN的第一個瓶頸在tun驅動》中對tun驅動進行修改的思想。
四，硬體實現：
82571/2沒有rss，82574有了rss(2個佇列)，但是需要驅動程式完成cpu繫結，82575由硬體完成了rss(4個佇列)的cpu繫結。花了很多時間閱讀了intel8257x網絡卡晶片的手冊，很有感覺。
1.摘錄一段82575手冊中的描述中斷的語句：
The 82575 implements interrupt moderation to reduce the number of interrupts software processes. The moderation scheme is based on EITR. Each time an interrupt event happens, the corresponding bit in the EICR is activated. However, an interrupt message is not sent out on the PCIe* interface until the EITR counter assigned to that EICR bit has counted down to zero. As soon as the interrupt is issued, the EITR counter is reloaded with its initial value and the process repeats again. The interrupt flow should follow as shown in Figure 20.
*******************
inter1

*******************
For cases where the 82575 is connected to a small number of clients, it is desirable to initiate the interrupt as soon as possible with minimum latency. For these cases, when the EITR counter counts down to zero and no interrupt event has happened, then the EITR counter is not reset but stays at zero.Therefore, the next interrupt event triggers an immediate interrupt (see Figure 21 and Figure 22).
*******************

*******************
2.針對header split，82575手冊上有很好的描述，
This feature consists of splitting or replicating a packet’s header to a different memory space. This helps the host to fetch headers only for processing: headers are replicated through a regular snoop transaction, in order to be processed by the host processor. It is recommended to perform this transaction with the DCA feature enabled.The packet (header + payload) is stored in memory through an optional non-snoop transaction.
**************

***************
如果不是很理解，微軟的msdn上也有描述，截圖如下：
**************

*****************
3.82574L中就已經集成了RSS，然則卻很不完善，有很多的事情還是需要軟體來完成的，所謂的軟體完成其實就是驅動程式來完成，不管怎樣，這一步使rss在ioat中的地位日趨確立，82574L中的描述如下：
The 82574L provides two hardware receive queues and filters each receive packet into one of the queues based on criteria that is described as follows. Classification of packets into receive queues have several uses, such as:
Receive Side Scaling (RSS)
Generic multiple receive queues
Priority receive queues.
...
When multiple receive queues are enabled, the 82574 provides software with several types of information. Some are requirements of Microsoft* RSS while others are provided for software device driver assistance:
A Dword result of the Microsoft* RSS hash function, to be used by the stack for flow classification, is written into the receive packet descriptor (required by Microsoft* RSS).
A 4-bit RSS Type field conveys the hash function used for the specific packet (required by Microsoft* RSS).
A mechanism to issue an interrupt to one or more CPUs (section 7.1.11). #注意這一句
Figure 33 shows the process of classifying a packet into a receive queue:
1. The receive packet is parsed into the header fields used by the hash operation (such as, IP addresses, TCP port, etc.).
2. A hash calculation is performed. The 82574L supports a single hash function, as defined by Microsoft* RSS. The 82574L therefore does not indicate to the software device driver which hash function is used. The 32-bit result is fed into the packet receive descriptor.
3. The seven LSBs of the hash result are used as an index into a 128-entries redirection table. Each entry in the table contains a 5-bit CPU number. This 5-bit value is fed into the packet receive descriptor. In addition, each entry provides a single bit queue number, which denotes that queue into which the packet is routed.
***************

***************
而在82575中，rss完善了，加上了多cpu之間的自動繫結，先附上一幅圖，基本原理也就明白了：
****************

*****************
82575資料手冊如是說：
It is assumed that each queue is associated with a specific processor, even when there are more
processors than queues.
附上一副更加清晰的圖：
*****************

**************
3.82571EB開始了提速，只不過在該版本中還是傳統的修補，沒有什麼大的動作，其中TOE中增加了許多東西，很

多計算都可以從cpu上offload了
4.對於效能測試，先後看了一些資料，不過既然瞄準了intel的解決方案，還是看intel的資料吧，首先貼兩張開

銷分析圖：
****************

****************

**************
然後是ioat的結果，沒有給出資料，但是給出了ioat針對哪些病症部位下了手：
****************

*****************
結：
1.和外設通訊的方式有：
1.1.中斷
1.2.輪詢
1.3.DMA
然而這些都是教科書上的方法，正如cisc和risc一樣，很教條，真正的方式已經將中斷和輪詢結合成了napi，intel也結合了DMA和多cpu形成了QuickData，根本上ioat也是napi和quickdata的結合加上其它的機制，我們可以看到，很多時候，ioat顛覆了分層的網路模型，物理網絡卡怎麼能處理tcp流呢？這在傳統教條衛道士那裡是不可饒恕的。
2.崇拜intel
上學那會兒，裝機，我用了intel的奔四cpu，1000多買的，同寢室的後來裝機都用了amd的，3200+的效能比我的竟然還好，又不挑電源，又便宜很多，MD當時我就不平衡了，同寢的都認為amd要徹底超越intel了，加上amd在64位技術上走在了前面，我漸漸的也認為intel不行了。然而工作了很久之後，我發現intel不僅僅是做微處理器的，人家主要是定標準的，不管是pc領域還是伺服器領域的。還真別說，intel的方案有時候就是好，大公司養得起猛士，因此做事絕不含糊。

網絡卡效能分析-Intel8257X晶片手冊讀後感

網絡卡效能分析-Intel8257X晶片手冊讀後感

網絡卡效能調優

挑戰萬兆網絡卡效能品高以虛戰實

linux網絡卡驅動分析之probe函式

spring cloud EurekaClient 多網絡卡 ip 配置和原始碼分析

l（轉）Linux DM9000網絡卡驅動程式完全分析

使用iperf測試網絡卡吞吐效能

OpenVPN效能-多OpenVPN共享一個虛擬網絡卡

linux核心資料包轉發流程（三）：網絡卡幀接收分析

網絡卡驅動設計---架構分析加回環網絡卡驅動設計（網絡卡驅動上）

網絡卡驅動之02驅動原始碼分析

移植USB無線網絡卡到mini2440（TP-LINK的TL-WN721N，使用的晶片型號是rtl8192cu）

Linux驅動修煉之道-DM9000A網絡卡驅動框架原始碼分析

82599網絡卡驅動rx descriptor結構體分析

2015版uboot的啟動過程及網絡卡驅動結構分析

精彩---rtl8139網絡卡驅動程式分析

Linux核心---網絡卡驅動的詳細分析讓你的網絡卡飛起來！

python功能模組之psutil------ Linux效能（CPU、磁碟、記憶體、網絡卡）監控

《uCOS51移植心得》---七年前之《快快樂樂跟我學51微控制器作業系統和IP棧》第五部分.NE2000網絡卡晶片驅動程式

netback的tasklet排程問題及網絡卡丟包的簡單分析

網絡卡效能分析-Intel8257X晶片手冊讀後感

相關推薦