1. 程式人生 > >深入理解Linux核心第3版--筆記-1.pdf

深入理解Linux核心第3版--筆記-1.pdf

深入理解Linux核心第3版.pdf

        Understanding the Linux Kernel, 3rd Edition

Preface

The Audience for This Book

we try to go beyond superficial features. We offer a background, such as the history of major features and the reasons why they were used

      Organization of the Material

We tried a bottom-up approach: start with topics that are hardware-dependent and end with those that are totally hardware-independent.

      Level of Description

    Overview of the Book

    Conventions in This Book

    How to Contact Us

Chapter 1. Introduction

1.1. Linux Versus Other Unix-Like Kernels

Linux regards lightweight processes as the basic execution context and handles them via the nonstandard clone( ) system call

    1.2. Hardware Dependency

    1.3. Linux Versions

    1.4. Basic Operating System Concepts

                 1.4.1. Multiuser Systems

                 1.4.2. Users and Groups

                 1.4.2. Processes

A process can be defined either as "an instance of a program in execution" or as the "execution context" of a running program.

                 1.4.2. kenerl architecture

                            monolithic/microkenerl(module)     

    1.5. An Overview of the Unix Filesystem

1.5.1. Files

1.5.2. Hard and Soft Links

                 1.5.3. File Types

                 1.5.4. File Descriptor and Inode

                 1.5.5. Access Rights and File Mode

When a file is created by a process, its owner ID is the UID of the process.

                 Its owner user group ID can be either the process group ID of the creator process or the user group ID of the parent directory,

                 depending on the value of the sgid flag of the parent directory.

                 1.5.6. File-Handling System Calls

1.5.6.1. Opening a file

                            1.5.6.2. Accessing an opened file

                            1.5.6.3. Closing a file

                            1.5.6.4. Renaming and deleting a file

    1.6. An Overview of Unix Kernels(需要再次理解閱讀)

1.6.1. The Process/Kernel Model

kernel routines can be activated in several ways:

                      1:A process invokes a system call.

                      2:The CPU executing the process signals an exception, which is an unusual condition such as an invalid instruction.

                         The kernel handles the exception on behalf of the process that caused it.

                      3:A peripheral device issues an interrupt signal to the CPU to notify it of an event such as a request for attention,

                         a status change, or the completion of an I/O operation.

                         Each interrupt signal is dealt by a kernel program called an interrupt handler.

                        Because peripheral devices operate asynchronously with respect to the CPU, interrupts occur at unpredictable times.

                      4:A kernel thread is executed. Because it runs in Kernel Mode, the corresponding program must be considered part of the kernel.

                 1.6.2. Process Implementation

When the kernel stops the execution of a process, it saves the current contents of several processor registers in the process descriptor.

                      These include:

1:The program counter (PC) and stack pointer (SP) registers

                      2:The general purpose registers

                      3:The floating point registers

                      4:The processor control registers (Processor Status Word) containing information about the CPU state   

                      5:The memory management registers used to keep track of the RAM accessed by the process

1.6.3. Reentrant Kernels

                 1.6.4. Process Address Space

                 1.6.5. Synchronization and Critical Regions

1.6.5.1. Kernel preemption disabling

                                   1.6.5.2. Interrupt disabling

                                   1.6.5.3. Semaphores

                                   1.6.5.4. Spin locks

                                   1.6.5.5. Avoiding deadlocks

1.6.6. Signals and Interprocess Communication

                 1.6.7. Process Management    

1.6.7.1. Zombie processes

                                   1.6.7.2. Process groups and login sessions

1.6.8. Memory Management

1.6.8.1. Virtual memory

                            1.6.8.2. Random access memory usage

                            1.6.8.3. Kernel Memory Allocator

                            1.6.8.4. Process virtual address space handling

                            1.6.8.5. Caching

1.6.9. Device Drivers

Chapter 2. Memory Addressing

2.1. Memory Addresses

(1)Logical address

           (2)Linear address

           (3)Physical address

The Memory Management Unit (MMU) transforms a logical address into a linear address by means of a hardware circuit called a segmentation unit.

a second hardware circuit called a paging unit transforms the linear address into a physical address .

           Figure 2-1. Logical address translation    

2.2. Segmentation in Hardware

2.2.1. Segment Selectors and Segmentation Registers

(1)Segment Selectors

(2 )Segmentation Registers

To make it easy to retrieve segment selectors quickly, the processor provides segmentation registerswhose only purpose is to hold Segment Selectors

                 cs, ss, ds, es, fs,gs.

           2.2.2. Segment Descriptors

Global Descriptor Table (GDT )

                 Local Descriptor Table(LDT).

Code Segment Descriptor

                 Data Segment Descriptor

                 Task State Segment Descriptor (TSSD)

                 Local Descriptor Table Descriptor (LDTD)

2.2.3. Fast Access to Segment Descriptors

2.2.4. Segmentation Unit

      2.3. Segmentation in Linux

The 2.6 version of Linux uses segmentation only when required by the 80 x 86 architecture

2.3.1. The Linux GDT

1. A Task State Segment (TSS)

           2. kernel code and data segments

           3.A segment including the default Local Descriptor Table (LDT),

           4.Three Thread-Local Storage (TLS) segments

           5. Three segments related to Advanced Power Management (APM ):

           6.Five segments related to Plug and Play (PnP ) BIOS services

           7.A special TSS segment used by the kernel to handle "Double fault " exceptions

           a few entries in the GDT may depend on the process that the CPU is executing (LDT and TLS Segment Descriptors).

           2.3.2. The Linux LDTs

2.4. Paging in Hardware

page frames/page

2.4.1. Regular Paging

           2.4.2. Extended Paging

2.4.3. Hardware Protection Scheme

           2.4.4. An Example of Regular Paging

A simple example will help in clarifying how regular paging works. Let's assume that the kernel

           assigns the linear address space between 0x20000000 and 0x2003ffff to a running process.[ ] This

           space consists of exactly 64 pages. We don't care about the physical addresses of the page frames

           containing the pages; in fact, some of them might not even be in main memory. We are interested

           only in the remaining fields of the Page Table entries.

                            [ ] As we shall see in the following chapters, the 3 GB linear address space is an upper limit, but a User Mode process is allowed to

                   reference only a subset of it.

           Let's start with the 10 most significant bits of the linear addresses assigned to the process, which

           are interpreted as the Directory field by the paging unit. The addresses start with a 2 followed by

           zeros, so the 10 bits all have the same value, namely 0x080 or 128 decimal. Thus the Directory field

           in all the addresses refers to the 129th entry of the process Page Directory. The corresponding entry   

           must contain the physical address of the Page Table assigned to the process (see Figure 2-9). If no

           other linear addresses are assigned to the process, all the remaining 1,023 entries of the Page

           Directory are filled with zeros.

           The values assumed by the intermediate 10 bits, (that is, the values of the Table field) range from 0

           to 0x03f, or from 0 to 63 decimal. Thus, only the first 64 entries of the Page Table are valid. The

           remaining 960 entries are filled with zeros.

           Suppose that the process needs to read the byte at linear address 0x20021406. This address is

           handled by the paging unit as follows:

              1.        The Directory field 0x80 is used to select entry 0x80 of the Page Directory, which points to the

                      Page Table associated with the process's pages.

              2.

                      The Table field 0x21 is used to select entry 0x21 of the Page Table, which points to the page

                      frame containing the desired page.

              3.

                      Finally, the Offset field 0x406 is used to select the byte at offset 0x406 in the desired page

                      frame.

           If the Present flag of the 0x21 entry of the Page Table is cleared, the page is not present in main

           memory; in this case, the paging unit issues a Page Fault exception while translating the linear

           address. The same exception is issued whenever the process attempts to access linear addresses

           outside of the interval delimited by 0x20000000 and 0x2003ffff, because the Page Table entries not

           assigned to the process are filled with zeros; in particular, their Present flags are all cleared.

Figure 2-9. An example of paging

2.4.5. The Physical Address Extension (PAE) Paging Mechanism

2.4.6. Paging for 64-bit Architectures

2.4.7. Hardware Cache/L1-cache/

The cache memory stores the actual lines of memory. The cache controller stores an array of entries, one entry for each line of the

                 cache memory. Each entry includes a tag and a few flags that describe the status of the cache line.

                 The tag consists of some bits that allow the cache controller to recognize the memory location

                 currently mapped by the line. The bits of the memory's physical address are usually split into three

                 groups: the most significant ones correspond to the tag, the middle ones to the cache controller

                 subset index, and the least significant ones to the offset within the line.

                 write-through:

the controller always writes into both RAM and the cache line, effectively switching off the cache for write operations

                 write-back:

the cache line is updated and the contents of the RAM are left

                 unchanged. After a write-back, of course, the RAM must eventually be updated. The cache controller

                 writes the cache line back into RAM only when the CPU executes an instruction requiring a flush of

                 cache entries or when a FLUSH hardware signal occurs (usually after a cache miss).

2.4.8. Translation Lookaside Buffers (TLB)//

Translation Lookaside Buffers (TLB) to speed up linear address translation. When a linear address is

           used for the first time, the corresponding physical address is computed through slow accesses to the

           Page Tables in RAM. The physical address is then stored in a TLB entry so that further references to

           the same linear address can be quickly translated.

2.5. Paging in Linux

Linux's handling of processes relies heavily on paging. In fact, the automatic translation of linear

           addresses into physical ones makes the following design objectives feasible:

           1.Assign a different physical address space to each process, ensuring an efficient protection

           against addressing errors.

           2.Distinguish pages (groups of data) from page frames (physical addresses in main memory).

           This allows the same page to be stored in a page frame, then saved to disk and later reloaded

           in a different page frame. This is the basic ingredient of the virtual memory mechanism (see Chapter 17).

pgd

2.5.1. The Linear Address Fields

PAGE_SHIFT/PMD_SHIFT/PUD_SHIFT/PGDIR_SHIFT

           PTRS_PER_PTE, PTRS_PER_PMD, PTRS_PER_PUD, and PTRS_PER_PGD

           2.5.2. Page Table Handling

                 (1):type-conversion macros
                
_ _ pte, _ _ pmd, _ _ pud, _ _ pgd _ _ pgprot//(protect)

            pte_val, pmd_val, pud_val, pgd_val,pgprot_val

                 (2):macros and functions to read or modify page table entries

                 pte_none, pmd_none, pud_none pgd_none

            pte_clear, pmd_clear, pud_clear pgd_clear

            set_pte, set_pmd, set_pud set_pgd

            pte_same(a,b)

            pmd_large(e)

            pmd_bad  pud_bad pgd_bad

            pte_present

The pmd_bad macro is used by functions to check Page Middle Directory entries passed as input

                 parameters. It yields the value 1 if the entry points to a bad Page Table that is, if at least one of the

                 following conditions applies:             

                 (1)The page is not in main memory (Present flag cleared).

                 (2)The page allows only Read access (Read/Write flag cleared).

                 (3)Either Accessed or Dirty is cleared (Linux always forces these flags to be set for every existing Page Table).

                 The pte_present macro yields the value 1 if either the Present flag or the Page Size flag of a Page

                 Table entry is equal to 1, the value 0 otherwise. Recall that the Page Size flag in Page Table entries

                 has no meaning for the paging unit of the microprocessor; the kernel, however, marks Present

                 equal to 0 and Page Size equal to 1 for the pages present in main memory but without read, write,

                 or execute privileges. In this way, any access to such pages triggers a Page Fault exception because

            Present is cleared, and the kernel can detect that the fault is not due to a missing page by checking

                 the value of Page Size.

                 (3): Page flag reading/setting functions

pte_user( )

            pte_read( )

            pte_write( )

            pte_exec( )

            pte_dirty( )

            pte_young( )

            pte_file( )

            mk_pte_huge( )

            pte_wrprotect( )

            pte_rdprotect( )

            pte_exprotect( )

            pte_mkwrite( )

            pte_mkread( )

            pte_mkexec( )

            pte_mkclean( )

            pte_mkdirty( )

            pte_mkold( )

            pte_mkyoung( )

            pte_modify(p,v)

            ptep_set_wrprotect()

            ptep_set_access_flags()

            ptep_mkdirty()

            ptep_test_and_clear_dirty()

            ptep_test_and_clear_young()

            (4): Macros acting on Page Table entries

pgd_index(addr)

                 pgd_offset(mm, addr)

            pgd_offset_k(addr)

            pgd_page(pgd)

            pud_offset(pgd, addr)

            pud_page(pud)

pmd_index(addr)

                 pmd_offset(pud, addr)

            pmd_page(pmd)

            mk_pte(p,prot)

            pte_index(addr)

pte_offset_kernel(dir, addr)

            pte_offset_map(dir,ddr)

            pte_to_pgoff(pte)

            pgoff_to_pte(offset

                 (5): Page allocation functions

pgd_alloc(mm)

            pgd_free( pgd)

            pud_alloc(mm, pgd,addr)

pud_free(x)

            pmd_alloc(mm, pud,addr)

            pmd_free(x)

            pte_alloc_map(mm, pmd,addr)

            pte_alloc_kernel(mm,pmd, addr)

                 pte_free(pte)

            pte_free_kernel(pte)clear_page_range(mmu,start,end)

           2.5.3. Physical Memory Layout

As a general rule, the Linux kernel is installed in RAM starting from the physical address 0x00100000 i.e., from the second megabyte.

                 Figure 2-13. The first 768 page frames (3 MB) in Linux 2.6

        1:Page frame 0 is used by BIOS to store the system hardware configuration detected during the

           Power-On Self-Test(POST); the BIOS of many laptops, moreover, writes data on this page

           frame even after the system is initialized.

           2:Physical addresses ranging from 0x000a0000 to 0x000fffff are usually reserved to BIOS

           routines and to map the internal memory of ISA graphics cards. This area is the well-known

           hole from 640 KB to 1 MB in all IBM-compatible PCs: the physical addresses exist but they are

           reserved, and the corresponding page frames cannot be used by the operating system.

           3:Additional page frames within the first megabyte may be reserved by specific computer

           models. For example, the IBM ThinkPad maps the 0xa0 page frame into the 0x9f one.

           Table 2-10. Variables describing the kernel's physical memory layout

              Variable                          name Description

        num_physpages           Page frame number of the highest usable page frame

        totalram_pages          Total number of usable page frames

           min_low_pfn                      Page frame number of the first usable page frame after the kernel image in RAM

        max_pfn             Page frame number of the last usable page frame

        max_low_pfn             Page frame number of the last page frame directly mapped by the kernel (low memory)

           totalhigh_pages             Total number of page frames not directly mapped by the kernel (high memory)

        highstart_pfn           Page frame number of the first page frame not directly mapped by the kernel

        highend_pfn             Page frame number of the last page frame not directly mapped by the kernel

           2.5.4. Process Page Tables

The linear address space of a process is divided into two parts:

           Linear addresses from 0x00000000 to 0xbfffffff can be addressed when the process runs in either User or Kernel Mode.

           Linear addresses from 0xc0000000 to 0xffffffff can be addressed only when the process runs in Kernel Mode.

           The content of the first entries of the Page Global Directory that map linear addresses lower than

        0xc0000000 (the first 768 entries with PAE disabled, or the first 3 entries with PAE enabled) depends

           on the specific process. Conversely, the remaining entries should be the same for all processes and

           equal to the corresponding entries of the master kernel Page Global Directory (see the following

           section).?????

2.5.5. Kernel Page Tables???

In the first phase, the kernel creates a limited address space including the kernel's code and data segments, the initial Page Tables,

                 and 128 KB for some dynamic data structures. This minimal address space is just large enough to install the kernel in RAM and to initialize its core

                 data structures.

                 In the second phase, the kernel takes advantage of all of the existing RAM and sets up the page tables properly.

                 Let us examine how this plan is executed.

2.5.5.1. Provisional kernel Page Tables

                     2.5.5.2. Final kernel Page Table when RAM size is less than 896 MB

                     2.5.5.3. Final kernel Page Table when RAM size is between 896 MB and 4096 MB

                     2.5.5.4. Final kernel Page Table when RAM size is more than 4096 MB

2.5.6. Fix-Mapped Linear Addresses

fix_to_virt( )

2.5.7. Handling the Hardware Cache and the TLB

2.5.7.1. Handling the hardware cache

L1_CACHE_BYTES macro yields the size of a cache line in bytes

2.5.7.2. Handling the TLB

Chapter 3. Processes

3.1. Processes, Lightweight Processes, and Threads

3.2. Process Descriptor

struct task_struct

3.2.1. Process State

eg:p->state = TASK_RUNNING;

3.2.2. Identifying a Process

process descriptor pointers

tgid(thread groups identify/pid

                3.2.2.1. Process descriptors handling

Figure 3-2. Storing the thread_info structure and the process(task_struct) kernel stack in two page frames

union thread_union {

                                struct thread_info thread_info;

                                unsigned long stack[2048]; /* 1024 for 4KB stacks */

                             };

                            1):esp is the CPU stack pointer

                            3.2.2.2. Identifying the current process

current_thread_info( )denote the tHRead_info structure pointer of the process running on the CPU that executes the instruction.

current_thread_info( )->task or current denote  the process descriptor pointer of the process running on the CPU.

                      #define task_stack_page(task)        ((task)->stack) 該巨集根據task_struct得到棧底也就是thread_info地址。//等價current

              #define task_thread_info(task)       ((struct thread_info *)(task)->stack),該巨集根據task_struct得到thread_info指標。//等價current_thread_info( )

                            3.2.2.3. Doubly linked lists

that the pointers in a list_head field store the addresses of other list_head fields rather than the addresses of the whole data structures

                      in which the list_head tructure is included;

3.2.2.4. The process list

Another useful macro, called for_each_process, scans the whole process list

                                3.2.2.5. The lists of TASK_RUNNING processes

enqueue_task(p,array) /dequeue_task(p,array)

3.2.3. Relationships Among Processes

3.2.3.1. The pidhash table and chained lists--???

each hash table is stored in four page frames

3.2.4. How Processes Are Organized

                                   3.2.4.1. Wait queues

wait_queue_head_t:

                  struct _ _wait_queue_head {

                  spinlock_t lock;

                  struct list_head task_list;

                  };

                  typedef struct _ _wait_queue_head wait_queue_head_t;

Elements of a wait queue list are of type

                            wait_queue_t:

                  struct _ _wait_queue {

                  unsigned int flags;

                  struct task_struct * task;

                  wait_queue_func_t func;

                  struct list_head task_list;

                  };

                  typedef struct _ _wait_queue wait_queue_t;

                                   3.2.4.2. Handling wait queues

The prepare_to_wait( ), prepare_to_wait_exclusive( ), and finish_wait( ) functions,

                            introduced in Linux 2.6, offer yet another way to put the current process to sleep in a wait

                            queue. Typically, they are used as follows:

                  DEFINE_WAIT(wait);

                  prepare_to_wait_exclusive(&wq, &wait, TASK_INTERRUPTIBLE);

                  /* wq is the head of the wait queue */

                  ...

                  if (!condition)

                  schedule();

                  finish_wait(&wq, &wait);

3.2.5. Process Resource Limits

The resource limits for the current process are stored in the current->signal->rlim field, that is, in

                            a field of the process's signal descriptor

3.2. Process Switch

3.3.1. Hardware Context

a part of the hardware context of a process is stored in the process descriptor, while the remaining part is saved

                      in the Kernel Mode stack.

                      3.3.2. Task State Segment

The TSSDs created by Linux are stored in the Global Descriptor Table (GDT), whose base address is

                            stored in the gdtr register of each CPU. The tr register of each CPU contains the TSSD Selector of

                            the corresponding TSS. The register also includes two hidden, nonprogrammable fields: the Base

                            and Limit fields of the TSSD. In this way, the processor can address the TSS directly without having

                            to retrieve the TSS address from the GDT.

3.3.2.1. The thread field

Thus, each process descriptor includes a field called thread of type thread_struct, in which the