1. 程式人生 > >What every programmer should know about memory (Part 2-0) 譯

What every programmer should know about memory (Part 2-0) 譯

What Every Programmer Should Know About Memory Ulrich Drepper Red Hat, Inc. [email protected] November 21, 2007

2 Commodity Hardware Today

Understanding commodity hardware is important because specialized hardware is in retreat. Scaling these days is most often achieved horizontally instead of vertically, meaning today it is more cost-effective to use many smaller, connected commodity computers instead of a few really large and exceptionally fast (and expensive) systems. This is the case because fast and inexpensive network hardware is widely available. There are still situations where the large specialized systems have their place and these systems still provide a business opportunity, but the overall market is dwarfed by the commodity hardware market. Red Hat, as of 2007, expects that for future products, the “standard building blocks” for most data centers will be a computer with up to four sockets, each filled with a quad core CPU that, in the case of Intel CPUs, will be hyper-threaded. {Hyper-threading enables a single processor core to be used for two or more concurrent executions with just a little extra hardware.

} This means the standard system in the data center will have up to 64 virtual processors. Bigger machines will be supported, but the quad socket, quad CPU core case is currently thought to be the sweet spot and most optimizations are targeted for such machines.

Large differences exist in the structure of commodity computers. That said, we will cover more than 90% of such hardware by concentrating on the most important differences. Note that these technical details tend to change rapidly, so the reader is advised to take the date of this writing into account.

Over the years the personal computers and smaller servers standardized on a chipset with two parts: the Northbridge and Southbridge. Figure 2.1 shows this structure.

Figure 2.1: Structure with Northbridge and Southbridge

All CPUs (two in the previous example, but there can be more) are connected via a common bus (the Front Side Bus, FSB) to the Northbridge. The Northbridge contains, among other things, the memory controller, and its implementation determines the type of RAM chips used for the computer. Different types of RAM, such as DRAM, Rambus, and SDRAM, require different memory controllers.

To reach all other system devices, the Northbridge must communicate with the Southbridge. The Southbridge, often referred to as the I/O bridge, handles communication with devices through a variety of different buses. Today the PCI, PCI Express, SATA, and USB buses are of most importance, but PATA, IEEE 1394, serial, and parallel ports are also supported by the Southbridge. Older systems had AGP slots which were attached to the Northbridge. This was done for performance reasons related to insufficiently fast connections between the Northbridge and Southbridge. However, today the PCI-E slots are all connected to the Southbridge.

Such a system structure has a number of noteworthy consequences:

  • All data communication from one CPU to another must travel over the same bus used to communicate with the Northbridge.
  • All communication with RAM must pass through the Northbridge.
  • The RAM has only a single port. {We will not discuss multi-port RAM in this document as this type of RAM is not found in commodity hardware, at least not in places where the programmer has access to it. It can be found in specialized hardware such as network routers which depend on utmost speed.}
  • Communication between a CPU and a device attached to the Southbridge is routed through the Northbridge.

A couple of bottlenecks are immediately apparent in this design. One such bottleneck involves access to RAM for devices. In the earliest days of the PC, all communication with devices on either bridge had to pass through the CPU, negatively impacting overall system performance. To work around this problem some devices became capable of direct memory access (DMA). DMA allows devices, with the help of the Northbridge, to store and receive data in RAM directly without the intervention of the CPU (and its inherent performance cost). Today all high-performance devices attached to any of the buses can utilize DMA. While this greatly reduces the workload on the CPU, it also creates contention for the bandwidth of the Northbridge as DMA requests compete with RAM access from the CPUs. This problem, therefore, must to be taken into account.

A second bottleneck involves the bus from the Northbridge to the RAM. The exact details of the bus depend on the memory types deployed. On older systems there is only one bus to all the RAM chips, so parallel access is not possible. Recent RAM types require two separate buses (or channels as they are called for DDR2, see Figure 2.8) which doubles the available bandwidth. The Northbridge interleaves memory access across the channels. More recent memory technologies (FB-DRAM, for instance) add more channels.

With limited bandwidth available, it is important to schedule memory access in ways that minimize delays. As we will see, processors are much faster and must wait to access memory, despite the use of CPU caches. If multiple hyper-threads, cores, or processors access memory at the same time, the wait times for memory access are even longer. This is also true for DMA operations.

There is more to accessing memory than concurrency, however. Access patterns themselves also greatly influence the performance of the memory subsystem, especially with multiple memory channels. Refer to Section 2.2 for more details of RAM access patterns.

On some more expensive systems, the Northbridge does not actually contain the memory controller. Instead the Northbridge can be connected to a number of external memory controllers (in the following example, four of them).

Figure 2.2: Northbridge with External Controllers

The advantage of this architecture is that more than one memory bus exists and therefore total bandwidth increases. This design also supports more memory. Concurrent memory access patterns reduce delays by simultaneously accessing different memory banks. This is especially true when multiple processors are directly connected to the Northbridge, as in Figure 2.2. For such a design, the primary limitation is the internal bandwidth of the Northbridge, which is phenomenal for this architecture (from Intel). { For completeness it should be mentioned that such a memory controller arrangement can be used for other purposes such as “memory RAID” which is useful in combination with hotplug memory.}

Figure 2.3: Integrated Memory Controller

Using multiple external memory controllers is not the only way to increase memory bandwidth. One other increasingly popular way is to integrate memory controllers into the CPUs and attach memory to each CPU. This architecture is made popular by SMP systems based on AMD’s Opteron processor. Figure 2.3 shows such a system. Intel will have support for the Common System Interface (CSI) starting with the Nehalem processors; this is basically the same approach: an integrated memory controller with the possibility of local memory for each processor.

With an architecture like this there are as many memory banks available as there are processors. On a quad-CPU machine the memory bandwidth is quadrupled without the need for a complicated Northbridge with enormous bandwidth. Having a memory controller integrated into the CPU has some additional advantages; we will not dig deeper into this technology here.

There are disadvantages to this architecture, too. First of all, because the machine still has to make all the memory of the system accessible to all processors, the memory is not uniform anymore (hence the name NUMA – Non-Uniform Memory Architecture – for such an architecture). Local memory (memory attached to a processor) can be accessed with the usual speed. The situation is different when memory attached to another processor is accessed. In this case the interconnects between the processors have to be used. To access memory attached to CPU2 from CPU1 requires communication across one interconnect. When the same CPU accesses memory attached to CPU4 two interconnects have to be crossed.

Each such communication has an associated cost. We talk about “NUMA factors” when we describe the extra time needed to access remote memory. The example architecture in Figure 2.3 has two levels for each CPU: immediately adjacent CPUs and one CPU which is two interconnects away. With more complicated machines the number of levels can grow significantly. There are also machine architectures (for instance IBM’s x445 and SGI’s Altix series) where there is more than one type of connection. CPUs are organized into nodes; within a node the time to access the memory might be uniform or have only small NUMA factors. The connection between nodes can be very expensive, though, and the NUMA factor can be quite high.

Commodity NUMA machines exist today and will likely play an even greater role in the future. It is expected that, from late 2008 on, every SMP machine will use NUMA. The costs associated with NUMA make it important to recognize when a program is running on a NUMA machine. In Section 5 we will discuss more machine architectures and some technologies the Linux kernel provides for these programs.

Beyond the technical details described in the remainder of this section, there are several additional factors which influence the performance of RAM. They are not controllable by software, which is why they are not covered in this section. The interested reader can learn about some of these factors in Section 2.1. They are really only needed to get a more complete picture of RAM technology and possibly to make better decisions when purchasing computers.

The following two sections discuss hardware details at the gate level and the access protocol between the memory controller and the DRAM chips. Programmers will likely find this information enlightening since these details explain why RAM access works the way it does. It is optional knowledge, though, and the reader anxious to get to topics with more immediate relevance for everyday life can jump ahead to Section 2.2.5.

商用硬體的現狀

理解商用硬體是十分重要的因為專業硬體越來越少.今日人們更多的是水平擴充套件而不是垂直擴充套件,意味著今日使用許多更小的,互聯的商用計算機而不是一些更大的,異常快(但是昂貴的)系統.這是因為快的,廉價的網路硬體正在開始普及.那些大型的專用系統仍然佔據一席之位並且這些系統仍然具有商業機會,但是最總的整體市場將會被商用硬體代替.2007年,Redhat認為未來的資料中心將會是一個擁有4個插槽的計算機,每個插槽可以插入一個4核的CPU,對於Intel的CPU,將會是超執行緒的.(超執行緒能使每一個單獨的處理器核心通過使用一點額外的硬體實現處理兩個以上的併發任務).這意味著在標準的資料中心中將有多大64個虛擬處理器.更大的機器也會被支援,但是4插槽,4核Cpu通常被認為是最佳的配置並且大多數的優化都是針對這樣的機器.

極大的差異存在於不同的商用計算機架構.即便如此,我們依然可以覆蓋90%以上的硬體通過專注於最重要的差異上.注意技術細節改變的很快,因此讀者應該將寫作時間考慮在內.

多年以來,私人的計算機和小型的伺服器被標準化到一個晶片集上,晶片由南北橋組成.圖2-1展示了這種結構.

圖2-1

所有CPU被連線通過一個通用匯流排(FBS 前端匯流排)連線到北橋.北橋除了包含一些其他的東西之外,還有記憶體控制器,他的實現決定了RAM晶片的型別.不同的RAM晶片,類似DRAM,Rambus,SDRAM,要求不同的記憶體控制器.

為了連線系統不同的裝置,北橋必須和南橋相連.南橋經常被稱為I/O橋,通過各種各樣的匯流排連線裝置.目前PCI,PCI Express,SATA和USB匯流排是十分重要的,但是PATA,IEEE 1394,序列和並行埠也是被南橋支援.更老的系統擁有AGP插槽去連線北橋.這是因為南北橋之間的速度並不是特變快,效能不太好.然而,今日PCI-E插槽全部連線到南橋.

這樣的一個系統架構造成了一系列應該注意的後果:

1. 所有從一個CPU到另一個CPU的資料都必須經過相同的匯流排,該匯流排用來與北橋交流. 2. 所有與RAM的的交流都必須通過北橋. 3. RAM只有一個埠(這裡不討論多埠RAM,因為這部分RAM不會用於商用RAM.多埠RAM可以被發現在指定的硬體類似路由器). 4.CPU與南橋硬體之間的溝通必須通過北橋.

在這種設計中,一系列的瓶頸立刻出現.其中的一個瓶頸則是裝置對RAM的訪問.在PC的早期,無論對南北橋其中哪一個的裝置的交流都必須通過CPU,對系統整體行效能造成了極大的影響.為了解決這個問題,許多裝置通過DMA來實現了優化.DMA允許裝置在南橋的幫助下,在RAM裡儲存和接收資料,同時沒有CPU的介入.今日所有的與任何匯流排相連的高效能裝置夠利用了DMA.雖然這極大的降低了CPU的負載,但是同時也創造了北橋頻寬的競爭.因為DMA和CPU競爭記憶體.因此這個問題也必須考慮在內.

第二個瓶頸出現在北橋到RAM之間.匯流排的準確細節依賴於RAM的型別.較老的系統只有一個匯流排對所有的RAM晶片,因此平行的訪問是不可能的.最近RAM的型別有求兩個單獨的匯流排(在DDR2中被稱為通道),這使得頻寬翻倍.北橋將記憶體訪問交錯的分配給多通道.更多的記憶體技術(類似FB-DRAM)增加了更多的通道.

由於可用的頻寬有限,這是重要的以排程記憶體訪問的方式去最小化延遲,優化效能.正如我們看到的,除去CPU緩衝的使用,處理器仍舊是十分的快並且需要去等待記憶體.如果是多執行緒,多核,多處理器同時訪問RAM,等待時間甚至更長.這是相同的對DMA來說.

除了併發的訪問記憶體外,訪問模式也極大的影響了記憶體子系統的效能,尤其是多記憶體通道.參考2.2節獲取等多的RAM訪問模式.

圖2-2

在一些更加昂貴的系統上,北橋實際上並不包含記憶體控制器.相反一系列額外的記憶體控制器與北橋相連. 如圖2-2

圖2-3

使用多個外部的記憶體控制器並不是唯一的方法通通去提高記憶體頻寬.另一種受歡迎的方法是去整合記憶體控制器到CPU內部並且使得記憶體與CPU直連.基於SMP(對稱多處理器)的AMD Opteron處理器使得這種架構開始流行起來.圖2.3展示了這樣的架構.Intel將從Nehalem處理器開始支援通用的系統介面CSI;這是基本的方法:一個整合的記憶體控制器處理每個CPU的本地記憶體.

在這種架構上,我們可以擁有和處理器數量一致的記憶體塊.在一個4CPU機器上,我們在沒有一個巨大頻寬的複雜北橋下,就可以獲得4倍的記憶體頻寬.同時,將記憶體控制器整合到CPU內部有一些額外的有點,但是這裡我們將不會深入挖掘.

這種架構也有一些缺點,首先,因為這種架構仍然必須使得所有的記憶體能被處理器訪問,記憶體不再是對稱的(NUMA 非一致記憶體訪問 ),處理器可以正常的速度訪問本地記憶體,但是當處理器訪問不屬於本地的記憶體時,則必須使用CPU之間的互聯通道.CPU1訪問CPU2的記憶體,則必須經過一個互聯通道,當CPU1訪問CPU4的記憶體時,則必須通過兩個互聯通道.

不同的通訊將會有不同的消耗.我們將額外的時間稱之為NUMA因素當我們訪問遠端的記憶體.在圖2.3中,每個CPU都有兩個層級.緊鄰的CPU和兩個互聯通道之外的CPU.越複雜的機器,它的層級數也會顯著的增多.有一些機器,比如IBM的X445和SGI的Altix系列,都有超過一種的連線.CPU被劃入節點,一個節點內訪問記憶體應該是一致的或者只有很小的NUMA因素.在節點之間的連線是十分昂貴的,NUMA因素也是很高的.

目前商用NUMA機器以及出現並且可能在未來扮演一個及其重要的角色.這是可預料的,在2008年年底,每一個SMP機器都將使用NUMA架構.這是重要的對於每一個NUMA機器上執行的程式認識到NUMA帶來的代價.在第5節我們將討論更多的架構以及一些Linux為這些程式提供的技術.

除了本節中提到的技術細節,還有一些額外因素影響RAM的效能.他們是無法被軟體所左右的,因此沒有放在本節.感興趣的讀者可以從2.1節中瞭解.介紹這些技術僅僅是為了讓我們對RAM有一個更加全面的瞭解,同時讓我們在購買計算機時做出更好的選擇.

接下來的兩節介紹了一些入門級別的硬體細節和記憶體控制器與DRAM之間的訪問協議.程式設計師可能會從RAM訪問原理的細節中獲得一些啟發.這部分知識是可選的,心急的讀者為了獲取核心的部分可以直接跳到2.2.5節.