1. 程式人生 > >What every programmer should know about memory (Part 1) 譯

What every programmer should know about memory (Part 1) 譯

What Every Programmer Should Know About Memory Ulrich Drepper Red Hat, Inc. [email protected] November 21, 2007

Abstract

As CPU cores become both faster and more numerous, the limiting factor for most programs is now, and will be for some time, memory access. Hardware designers have come up with ever more sophisticated memory handling and acceleration techniques–such as CPU caches–but these cannot work optimally without some help from the programmer. Unfortunately, neither the structure nor the cost of using the memory subsystem of a computer or the caches on CPUs is well understood by most programmers. This paper explains the structure of memory subsystems in use on modern commodity hardware, illustrating why CPU caches were developed, how they work, and what programs should do to achieve optimal performance by utilizing them.

1 Introduction

In the early days computers were much simpler. The various components of a system, such as the CPU, memory, mass storage, and network interfaces, were developed together and, as a result, were quite balanced in their performance. For example, the memory and network interfaces were not (much) faster than the CPU at providing data.

This situation changed once the basic structure of computers stabilized and hardware developers concentrated on optimizing individual subsystems. Suddenly the performance of some components of the computer fell significantly behind and bottlenecks developed. This was especially true for mass storage and memory subsystems which, for cost reasons, improved more slowly relative to other components.

The slowness of mass storage has mostly been dealt with using software techniques: operating systems keep most often used (and most likely to be used) data in main memory, which can be accessed at a rate orders of magnitude faster than the hard disk. Cache storage was added to the storage devices themselves, which requires no changes in the operating system to increase performance. {Changes are needed, however, to guarantee data integrity when using storage device caches.} For the purposes of this paper, we will not go into more details of software optimizations for the mass storage access.

Unlike storage subsystems, removing the main memory as a bottleneck has proven much more difficult and almost all solutions require changes to the hardware. Today these changes mainly come in the following forms:

  • RAM hardware design (speed and parallelism).
  • Memory controller designs.
  • CPU caches.
  • Direct memory access (DMA) for devices.

For the most part, this document will deal with CPU caches and some effects of memory controller design. In the process of exploring these topics, we will explore DMA and bring it into the larger picture. However, we will start with an overview of the design for today’s commodity hardware. This is a prerequisite to understanding the problems and the limitations of efficiently using memory subsystems. We will also learn about, in some detail, the different types of RAM and illustrate why these differences still exist.

This document is in no way all inclusive and final. It is limited to commodity hardware and further limited to a subset of that hardware. Also, many topics will be discussed in just enough detail for the goals of this paper. For such topics, readers are recommended to find more detailed documentation.

When it comes to operating-system-specific details and solutions, the text exclusively describes Linux. At no time will it contain any information about other OSes. The author has no interest in discussing the implications for other OSes. If the reader thinks s/he has to use a different OS they have to go to their vendors and demand they write documents similar to this one.

One last comment before the start. The text contains a number of occurrences of the term “usually” and other, similar qualifiers. The technology discussed here exists in many, many variations in the real world and this paper only addresses the most common, mainstream versions. It is rare that absolute statements can be made about this technology, thus the qualifiers.

1.1 Document Structure

This document is mostly for software developers. It does not go into enough technical details of the hardware to be useful for hardware-oriented readers. But before we can go into the practical information for developers a lot of groundwork must be laid.

To that end, the second section describes random-access memory (RAM) in technical detail. This section’s content is nice to know but not absolutely critical to be able to understand the later sections. Appropriate back references to the section are added in places where the content is required so that the anxious reader could skip most of this section at first.

The third section goes into a lot of details of CPU cache behavior. Graphs have been used to keep the text from being as dry as it would otherwise be. This content is essential for an understanding of the rest of the document. Section 4 describes briefly how virtual memory is implemented. This is also required groundwork for the rest.

Section 5 goes into a lot of detail about Non Uniform Memory Access (NUMA) systems.

Section 6 is the central section of this paper. It brings together all the previous sections’ information and gives programmers advice on how to write code which performs well in the various situations. The very impatient reader could start with this section and, if necessary, go back to the earlier sections to freshen up the knowledge of the underlying technology.

Section 7 introduces tools which can help the programmer do a better job. Even with a complete understanding of the technology it is far from obvious where in a non-trivial software project the problems are. Some tools are necessary.

In section 8 we finally give an outlook of technology which can be expected in the near future or which might just simply be good to have.

簡介: 因為cpu core變得更快和越來越多,現在更多時候程式執行的限制因素是記憶體存取.硬體設計師已經提出了很多複雜的記憶體處理和類似cpu緩衝加速技術,但是如果在沒有程式設計師的幫助下,這些技術仍然不能最佳的工作.不幸的,對於很多的程式設計師來說,他們不能深入的理解架構/電腦記憶體子系統/cpu快取的消耗, 這篇文章闡述了記憶體子系統的架構在現代商用硬體中的使用,說明了為什麼cpu快取技術會發展,他們是如何工作的,程式應該做什麼才能利用他們實現最佳的效能.

介紹: 在早期計算機是十分的簡單.像類似CPU,記憶體,大容量儲存器,網路介面,各種計算機元件它們是一起發展的,因此,擁有相對均衡的效能表現.比如:記憶體和網路介面並不比cpu提供資料快.

但是這種計算機穩定的架構開始開始改變,硬體提供商集中優化單獨的子系統.計算機的部分元件效能突然開始落後並且阻礙了計算機的發展,對於大容量儲存器和記憶體,由於成本的緣故,相對與其它元件是發展的較慢.

大容量儲存器速度慢通過軟體技術已經被很好的改善,作業系統把經常使用的資料在記憶體中,這些資料的存取速度會比從硬碟快幾個量級.將緩衝加入儲存裝置本身(緩衝將導致資料的不一致行,我們將如何處理髒資料?),這使得在不修改作業系統本身的前提下來提高效能.在這裡我們將不會進行深入的瞭解關於大容量儲存器的軟體優化.

不像儲存子系統,記憶體瓶頸已經被證明更加的困難並且幾乎所有的方法都要求硬體的改變.

今日這些改變主要是以下的方式: RAM 硬體設計(速度和併發) 記憶體控制器設計 CPU緩衝 DMA(直接記憶體訪問,繞過中央處理器)

這片文章更多的是關於CPU緩衝和記憶體控制器設計.在探索這些主題的過程中,我們將探索DMA並且將其帶入更大的背景.然而,我們將從現代商用硬體的設計談起.這是理解有效使用記憶體子系統時帶來的問題和限制的先決條件.我們將認識到RAM的不同型別和闡明為什麼這些不同仍然存在.

這篇文章實在沒有辦法包含所有的內容,只限制於商用硬體中的一小部分.一些主題也只是點到為止的討論以達到本文目的,讀者也可以閱讀其他的文件獲取細節.

文章關於作業系統特地的細節和解決方法都是針對Linux的,無論何時都不會包含其他系統的資訊.作者沒有興趣去討論其他系統.如果讀者認為他們必須去使用一個不同的系統,那麼他們必須要求他們的供應商去寫相似的文件.

在開始前的最後說明,這裡討論的技術在現實中有很多不同的實現,但是本文只是闡述最流行的技術解決方案版本.

1.1文件結構

這篇文件主要針對軟體開發者.它並沒有提供足夠的硬體細節,因此可能不是十分的有用對於硬體方向的讀者.但是在討論實際細節之前,我們應該瞭解足夠多的背景知識.

為了實現這個目的,第二節我們將描述RAM技術細節.這部分的知識是很容易理解的,但是不是必須的去理解後面的內容.我們在之後會引用這章以防心急的讀者可以起初跳過這一節.

第三節 關於Cpu緩衝行為的許多細節,使用圖形以防文章太枯燥.這部分內容對於文章的其餘部分是十分的重要.

第四節 簡短的描述了虛擬記憶體的實現.這也是背景知識之一.

第五節 提到了NUMA系統的細節

第六節 文章的中心,這裡彙集了之前章節的資訊並且給了程式設計師一些意見關於如何寫出在不同情況下魯棒性更好的程式碼.非常心急的讀者可以從本章開始閱讀.必要的時候去回顧一下基礎知識.

第七節 介紹了一些能夠幫助程式設計師去更好完成工作的工具.

第八節 展望在不久的未來我們期望出現的或者好用的技術.