1. 程式人生 > >Linux核心學習筆記(一) 虛擬檔案系統VFS

Linux核心學習筆記(一) 虛擬檔案系統VFS

什麼是VFS

Vritual Filesystem 是給使用者空間程式提供統一的檔案和檔案系統訪問介面的核心子系統。藉助VFS,即使檔案系統的型別不同(比如NTFS和ext3),也可以實現檔案系統之間互動(移動、複製檔案等),

  • 從使用者空間程式的角度來看,VFS提供了一個統一的抽象、介面。這使得使用者空間程式可以對不同型別的檔案系統發起統一的系統呼叫,而不需要關心底層的檔案系統型別。
  • 從檔案系統的角度來看,VFS提供了一個基於Unix-style檔案系統的通用檔案模型(common file model),可以用來表示任何型別檔案系統的通用特性和操作。底層檔案系統提供VFS規定的介面和資料結構,從而實現對linux的支援。

VFS中的資料結構

VFS是面向物件的,VFS中的資料結構既包含資料也包含對該資料進行操作的函式的指標,雖然是使用C的資料結構來實現,但是思想上和麵向物件程式設計是一致的。

VFS的通用資料模型主要包括4種物件型別:

  • Superblock物件,表示一個特定的已掛載檔案系統
  • Inode物件,表示一個特定的檔案
  • Dentry物件,表示一個directory entry,即dentry。路徑上的每一個單獨的元件,都是一個dentry。VFS中沒有目錄物件,目錄只是一種檔案。
  • File物件,表示程序中開啟的檔案。

每種物件型別都有著對應的操作操作函式表(相當於物件的方法)

Superblock物件

任何型別的檔案系統都要實現Superblock物件,用於儲存檔案系統的描述資訊。Superblock物件通常對應了磁碟上的filesystem superblock 或者 filesystem control block。非磁碟檔案系統(比如基於記憶體的檔案系統sysfs)需要動態地生成superblock物件,並將其儲存在記憶體中。

建立、管理、刪除superblock物件的程式碼在fs/super.c中

VFS使用super_block結構體來儲存superblock物件。使用alloc_super()函式來建立和初始化superblock物件,檔案系統掛載時,檔案系統呼叫alloc_super()從磁碟中讀取超級快,並填充super_block結構體.

super_block結構體在<linux/fs.h>中定義的,只給出了部分域

struct super_block 
{
    struct list_head        s_list;           /* list of all superblocks */ 
    dev_t                   s_dev;            /* identifier */ 
    unsigned long           s_blocksize;      /* block size in bytes */ 
    unsigned char           s_blocksize_bits; /* block size in bits */ 
    unsigned char           s_dirt;           /* dirty flag */ 
    unsigned long long      s_maxbytes;       /* max file size */ 
    struct file_system_type s_type;           /* filesystem type */ 
    struct super_operations s_op;             /* superblock methods */ 
    struct dquot_operations *dq_op;           /* quota methods */ 
    struct quotactl_ops     *s_qcop;          /* quota control methods */ 
    struct export_operations *s_export_op;    /* export methods */ 
    unsigned long            s_flags;         /* mount flags */ 
    unsigned long            s_magic;         /* filesystem’s magic number */ 
    struct dentry            *s_root;         /* directory mount point */ 
    struct rw_semaphore      s_umount;        /* unmount semaphore */ 
    struct semaphore         s_lock;          /* superblock semaphore */ 
    int                      s_count;         /* superblock ref count */ 
    int                      s_need_sync;     /* not-yet-synced flag */ 
    atomic_t                 s_active;        /* active reference count */ 
    void                     *s_security;     /* security module */ 
    struct xattr_handler  **s_xattr;  /* extended attribute handlers */
    struct list_head      s_inodes;        /* list of inodes */ 
    struct list_head      s_dirty;         /* list of dirty inodes */ 
    struct list_head      s_io;            /* list of writebacks */ 
    struct list_head      s_more_io;       /* list of more writeback */ 
    struct hlist_head     s_anon;          /* anonymous dentries */ 
    struct list_head      s_files;         /* list of assigned files */ 
    struct list_head      s_dentry_lru;    /* list of unused dentries */ 
    int                   s_nr_dentry_unused; /* number of dentries on list */ 
    struct block_device   *s_bdev;         /* associated block device */ 
    struct mtd_info       *s_mtd;          /* memory disk information */ 
    struct list_head      s_instances;     /* instances of this fs */ 
    struct quota_info     s_dquot;         /* quota-specific options */ 
    int                   s_frozen;        /* frozen status */ 
    wait_queue_head_t     s_wait_unfrozen; /* wait queue on freeze */ 
    char                  s_id[32];        /* text name */ 
    void                  *s_fs_info;      /* filesystem-specific info */ 
    fmode_t               s_mode;          /* mount permissions */ 
    struct semaphore      s_vfs_rename_sem; /* rename semaphore */ 
    u32                   s_time_gran;     /* granularity of timestamps */ 
    char                  *s_subtype;      /* subtype name */ 
    char                  *s_options;      /* saved mount options */
};

Superblock操作函式

superblock物件中最重要的成員是s_op指標,指向superblock_operations,superblock_operations在<linux/fs.h>中定義,下面僅包含部分的操作函式

struct super_operations { 
    struct inode *(*alloc_inode)(struct super_block *sb); 
    void (*destroy_inode)(struct inode *); 
    void (*dirty_inode) (struct inode *); 
    int (*write_inode) (struct inode *, int); 
    void (*drop_inode) (struct inode *); 
    void (*delete_inode) (struct inode *); 
    void (*put_super) (struct super_block *); 
    void (*write_super) (struct super_block *); 
    int (*sync_fs)(struct super_block *sb, int wait); 
    int (*freeze_fs) (struct super_block *); 
    int (*unfreeze_fs) (struct super_block *);
    int (*statfs) (struct dentry *, struct kstatfs *); 
    int (*remount_fs) (struct super_block *, int *, char *); 
    void (*clear_inode) (struct inode *); 
    void (*umount_begin) (struct super_block *); 
    int (*show_options)(struct seq_file *, struct vfsmount *); 
    int (*show_stats)(struct seq_file *, struct vfsmount *); 
    ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); 
    ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); 
    int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
};

這是一個函式表,每個指標都指向了一個對superlbock物件進行操作的函式(不含建立、刪除superblock,這個是在fs/super.c中),這些操作函式對檔案系統和它的inode執行low-level operations. 當檔案系統想要呼叫某個方法時,比如寫superblock,使用superblock的指標sb,呼叫方法為sb->s_op->write(sb).這裡需要傳入sb指標是因為C缺乏面向物件的特性(沒有C++中的this指標),所以需要將sb作為引數傳入。

函式表中有的函式是可選的,即可以選擇不實現,檔案系統可以將指標置為NULL,對於置NULL的函式,VFS將呼叫一個通用函式或者什麼都不做,取決於是什麼函式。

下面摘錄了部分函式的說明,不一一翻譯了

struct inode *(*alloc_inode)(struct super_block *sb)
Creates and initializes a new inode object under the given superblock.

void (destroy_inode)(struct inode )
Deallocates the given inode.

int (write_inode) (struct inode , int)
Writes the given inode to disk

void (delete_inode) (struct inode )
Deletes the given inode from the disk.

void (put_super) (struct super_block )
Called by the VFS on unmount to release the given superblock object

void (write_super) (struct super_block )
Updates the on-disk superblock with the specified superblock.

int (*sync_fs)(struct super_block *sb, int wait)
Synchronizes filesystem metadata with the on-disk filesystem

int (statfs) (struct dentry , struct kstatfs *)
Called by the VFS to obtain filesystem statistics

void (clear_inode) (struct inode )
Called by the VFS to release the inode and clear any pages containing related data.

void (umount_begin) (struct super_block )
Called by the VFS to interrupt a mount operation. It is used by network filesystems,
such as NFS.

Inode物件

Inode物件包含了核心操作一個檔案或者目錄需要的所有資訊。對於Unix-style的檔案系統,這些資訊可以直接從磁碟中的inode讀入,沒有inode的檔案系統需要根據磁碟上的資料動態生成inode的資訊,並將這些資訊填入記憶體中的inode物件

Inode物件使用inode結構體來儲存,該結構體定義在<linux/fs.h>中

struct inode
{
    struct hlist_node       i_hash;              /* hash list */ 
    struct list_head        i_list;              /* list of inodes */ 
    struct list_head        i_sb_list;           /* list of superblocks */ 
    struct list_head        i_dentry;            /* list of dentries */ 
    unsigned long           i_ino;               /* inode number */ 
    atomic_t                i_count;             /* reference counter */ 
    unsigned int            i_nlink;             /* number of hard links */ 
    uid_t                   i_uid;               /* user id of owner */ 
    gid_t                   i_gid;               /* group id of owner */ 
    kdev_t                  i_rdev;              /* real device node */ 
    u64                     i_version;           /* versioning number */ 
    loff_t                  i_size;              /* file size in bytes */ 
    seqcount_t              i_size_seqcount;     /* serializer for i_size */ 
    struct timespec         i_atime;             /* last access time */ 
    struct timespec         i_mtime;             /* last modify time */ 
    struct timespec         i_ctime;             /* last change time */ 
    unsigned int            i_blkbits;           /* block size in bits */ 
    blkcnt_t                i_blocks;            /* file size in blocks */ 
    unsigned short          i_bytes;             /* bytes consumed */ 
    umode_t                 i_mode;              /* access permissions */ 
    spinlock_t              i_lock;              /* spinlock */ 
    struct rw_semaphore     i_alloc_sem;         /* nests inside of i_sem */ 
    struct semaphore        i_sem;               /* inode semaphore */ 
    struct inode_operations *i_op;               /* inode ops table */ 
    struct file_operations  *i_fop;              /* default inode ops */ 
    struct super_block      *i_sb;               /* associated superblock */ 
    struct file_lock  *i_flock;            /* file lock list */ 
    struct address_space    *i_mapping;          /* associated mapping */ 
    struct address_space    i_data;              /* mapping for device */ 
    struct dquot            *i_dquot[MAXQUOTAS]; /* disk quotas for inode */ 
    struct list_head        i_devices;           /* list of block devices */ 
    union 
    {
        struct pipe_inode_info  *i_pipe;         /* pipe information */ 
        struct block_device     *i_bdev;         /* block device driver */ 
        struct cdev             *i_cdev;         /* character device driver */
    }; 
    unsigned long           i_dnotify_mask;      /* directory notify mask */ 
    struct dnotify_struct   *i_dnotify;          /* dnotify */ 
    struct list_head        inotify_watches;     /* inotify watches */ 
    struct mutex  inotify_mutex;  /* protects inotify_watches */ 
    unsigned long           i_state;             /* state flags */ 
    unsigned long           dirtied_when;        /* first dirtying time */ 
    unsigned int            i_flags;             /* filesystem flags */ 
    atomic_t                i_writecount;        /* count of writers */ 
    void                    *i_security;         /* security module */ 
    void                    *i_private;          /* fs private pointer */
};

檔案系統中的每個檔案都可以用一個inode物件來表示,但是inode物件只有在檔案被訪問時才會在記憶體中構建。inode物件中一些域是和特殊檔案相關的,比如i_pipe指向named pipe資料結構,i_bdev指向了block device資料結構,i_cdev指向character device資料結構,這三個指標儲存在了union中,因為一個給定的inode最多指向這三個資料結構中的0個或者1個。
檔案系統可能無法支援inode物件中的一些屬性,比如有些檔案系統沒有access timestamp。這種情況下,檔案系統可以自己決定怎麼如實現這些特性(比如講timestamp置為0)

Inode操作函式

inode中的i_op指標指向操作inode的函式表,該函式表定義在<linux/fs.h>中

struct inode_operations 
{ 

    int (*create) (struct inode *,struct dentry *,int, struct nameidata *); 
    struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *); 
    int (*link) (struct dentry *,struct inode *,struct dentry *); 
    int (*unlink) (struct inode *,struct dentry *); 
    int (*symlink) (struct inode *,struct dentry *,const char *);
    int (*mkdir) (struct inode *,struct dentry *,int); 
    int (*rmdir) (struct inode *,struct dentry *); 
    int (*mknod) (struct inode *,struct dentry *,int,dev_t); 
    int (*rename) (struct inode *, struct dentry *,
                   struct inode *, struct dentry *); 
    int (*readlink) (struct dentry *, char __user *,int); 
    void * (*follow_link) (struct dentry *, struct nameidata *); 
    void (*put_link) (struct dentry *, struct nameidata *, void *); 
    void (*truncate) (struct inode *); 
    int (*permission) (struct inode *, int); 
    int (*setattr) (struct dentry *, struct iattr *); 
    int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *); 
    int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); 
    ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); 
    ssize_t (*listxattr) (struct dentry *, char *, size_t); 
    int (*removexattr) (struct dentry *, const char *); 
    void (*truncate_range)(struct inode *, loff_t, loff_t); 
    long (*fallocate)(struct inode *inode, int mode, loff_t offset,
                      loff_t len); 
    int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
    u64 len);
};

下面摘錄了部分函式的說明

int create(struct inode *dir, struct dentry *dentry, int mode)
The VFS calls this function from the creat() and open() system calls to create a new inode associated with the given dentry object with the specified initial access mode.

struct dentry* lookup(struct inode *dir, struct dentry *dentry)
This function searches a directory for an inode corresponding to a filename specified in the given dentry.

int link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
Invoked by the link() system call to create a hard link of the file old_dentry in the directory dir with the new filename dentry.

int unlink(struct inode *dir, struct dentry *dentry)
Called from the unlink() system call to remove the inode specified by the directory entry dentry from the directory dir.

int follow_link(struct dentry *dentry, struct nameidata *nd)
Called by the VFS to translate a symbolic link to the inode to which it points.

int permission(struct inode *inode, int mask)
Checks whether the specified access mode is allowed for the file referenced by inode

Dentry物件

dentry是directory entry的簡稱,dentry是路徑上具體的一個元件,一個路徑上的每一個元件都是一個dentry,如路徑/bin/vi.txt中,共有3個dentry,分別是 /, bin, vi.txt。

dentry物件使用dentry結構體來表示,該結構體定義在<linux/dcache.h>中

struct dentry
{
    atomic_t                 d_count;      /* usage count */ 
    unsigned int             d_flags;      /* dentry flags */ 
    spinlock_t               d_lock;       /* per-dentry lock */ 
    int                      d_mounted;    /* is this a mount point? */ 
    struct inode             *d_inode;     /* associated inode */ 
    struct hlist_node        d_hash;       /* list of hash table entries */ 
    struct dentry            *d_parent;    /* dentry object of parent */ 
    struct qstr              d_name;       /* dentry name */ 
    struct list_head         d_lru;        /* unused list */ 
    union 
    {
        struct list_head     d_child;      /* list of dentries within */ 
        struct rcu_head      d_rcu;        /* RCU locking */
    } d_u; 
    struct list_head         d_subdirs;    /* subdirectories */ 
    struct list_head         d_alias;  /* list of alias inodes */ 
    unsigned long            d_time;       /* revalidate time */ 
    struct dentry_operations *d_op;        /* dentry operations table */ 
    struct super_block       *d_sb;        /* superblock of file */ 
    void                     *d_fsdata;    /* filesystem-specific data */ 
    unsigned char            d_iname[DNAME_INLINE_LEN_MIN]; /* short name */
};

因為dentry物件沒有在磁碟上的物理儲存,所以denty結構體中沒有用於標記物件是否被修改的域(即不需要判斷物件是否dirty,從而需要寫回磁碟)

Dentry的狀態

dentry分為三種狀態,user, unused, negative

used:
該dentry對應一個有效的inode(dentry的d_inode域指向一個有效的inode),並且d_count是正數,即有一個或者多個使用者正在使用該dentry

unused:
該dentry對應一個有效的inode(dentry的d_inode域指向一個有效的inode),並且d_count為0,即VFS並沒有使用該dentry,因為該dentry仍然指向一個有效的inode物件,dentry當前被儲存在dentry cache中(等待可能再次被使用)

negtive:
該dentry沒有對應一個有效的inode(dentry的d_inode為NULL),這種情況可能是因為對應的inode物件被銷燬了或者是查詢的路徑名稱不對。此時dentry仍然被儲存在cache中,這樣下次路徑查詢可以快速進行(直接從dentry cache中獲得)

Dentry Cache

dentry cache的機制由三個部分組成

  • used dentry 雙向連結串列:每個inode物件都有一個i_dentry域,這是一個雙向連結串列,用於儲存該inode對應的dentry物件(一個inode可以有很多個dentry物件)
  • least recently used雙向連結串列:儲存unused和negative狀態的dentry物件。該連結串列按照lru的順序儲存,尾部的是最not lru的物件,當需要刪除dentry來釋放空間時,從連結串列的尾部刪除物件。
  • 雜湊表和雜湊函式:雜湊表儲存路徑和dentry的對映關係,雜湊表使用dentry_hanshtable陣列來儲存,陣列中每個元素都指向一個由雜湊值相同的dentry組成的連結串列。雜湊函式根據路徑計算雜湊值。具體的雜湊計算方法由detry的操作函式d_hash()來決定,檔案系統可以自己實現這個函式。

dentry儲存在cache中時,dentry的存在導致對應的inode的使用計數大於0,這樣dentry物件可以將inode釘在記憶體中,只要dentry被cache了,那麼對應的inode就一定也被cache了(使用的是inode cache,即icache),所以當路徑查詢函式在dentry cache中命中時,其對應的inode一定也在記憶體中。

Dentry操作函式

dentry結構體中的d_op指標指向操作dentry的函式表,函式表定義在<linux/dcache.h>中

struct dentry_operations 
{
    int (*d_revalidate) (struct dentry *, struct nameidata *);
    int (*d_hash) (struct dentry *, struct qstr *); 
    int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); 
    int (*d_delete) (struct dentry *); 
    void (*d_release) (struct dentry *); 
    void (*d_iput) (struct dentry *, struct inode *); 
    char *(*d_dname) (struct dentry *, char *, int);
};

下面摘錄了部分函式的說明

int d_revalidate(struct dentry dentry, struct nameidata )
Determines whether the given dentry object is valid.The VFS calls this function whenever it is preparing to use a dentry from the dcache. Most filesystems set this method to NULL because their dentry objects in the dcache are always valid.

int d_hash(struct dentry *dentry, struct qstr *name)
Creates a hash value from the given dentry.

int d_compare(struct dentry *dentry, struct qstr *name1, struct qstr *name2)
Called by the VFS to compare two filenames, name1 and name2. Most filesystems leave this at the VFS default, which is a simple string compare

int d_delete (struct dentry *dentry)
Called by the VFS when the specified dentry object’s d_count reaches zero.This function requires the dcache_lock and the dentry’s d_lock.

void d_release(struct dentry *dentry)
Called by the VFS when the specified dentry is going to be freed.The default function does nothing.

void d_iput(struct dentry *dentry, struct inode *inode)
Called by the VFS when a dentry object loses its associated inode (say, because the entry was deleted from the disk). By default, the VFS simply calls the iput() function to release the inode.

File物件

File物件是開啟的檔案在記憶體中的表示(representation),用於在程序中表示開啟的檔案。程序和file物件直接進行互動,不會解除superblocks,inodes,dentrys。多個程序可以同時開啟同一個檔案,所以一個檔案在記憶體中可以對應多個file物件。而inode和dentry在記憶體中只有唯一的對應。

File物件使用file結構體來表示,定義在<linux/fs.h>中

struct file
{
    union
    {
        struct list_head   fu_list;       /* list of file objects */ 
        struct rcu_head    fu_rcuhead;    /* RCU list after freeing */
    } f_u;
    struct path            f_path;        /* contains the dentry */ 
    struct file_operations *f_op;         /* file operations table */ 
    spinlock_t             f_lock;        /* per-file struct lock */ 
    atomic_t               f_count;       /* file object’s usage count */ 
    unsigned int           f_flags;       /* flags specified on open */ 
    mode_t                 f_mode;        /* file access mode */ 
    loff_t                 f_pos;         /* file offset (file pointer) */ 
    struct fown_struct     f_owner;       /* owner data for signals */ 
    const struct cred      *f_cred;       /* file credentials */ 
    struct file_ra_state   f_ra;  /* read-ahead state */ 
    u64                    f_version;     /* version number */ 
    void                   *f_security;   /* security module */ 
    void                   *private_data; /* tty driver hook */
    struct list_head       f_ep_links;    /* list of epoll links */
    spinlock_t             f_ep_lock;     /* epoll lock */ 
    struct address_space   *f_mapping;    /* page cache mapping */ 
    unsigned long          f_mnt_write_state; /* debugging state */
};

和dentry物件類似,file物件在磁碟上也沒有對應的儲存,所以在file物件也沒有flag表示file是否dirty。file物件通過指標f_dentry指向對應的dentry物件,dentry物件指向對應的inode,inode中儲存了檔案本身是否dirty的資訊。

File操作函式

file結構體中的f_op指標指向操作file的函式表,函式表定義在<linux/fs.h>中

struct file_operations 
{ 
    struct module *owner; 
    loff_t (*llseek) (struct file *, loff_t, int); 
    ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); 
    ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); 
    ssize_t (*aio_read) (struct kiocb *, const struct iovec *,
                         unsigned long, loff_t); 
    ssize_t (*aio_write) (struct kiocb *, const struct iovec *,
                          unsigned long, loff_t); 
    int (*readdir) (struct file *, void *, filldir_t); 
    unsigned int (*poll) (struct file *, struct poll_table_struct *); 
    int (*ioctl) (struct inode *, struct file *, unsigned int,
                  unsigned long); 
    long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); 
    long (*compat_ioctl) (struct file *, unsigned int, unsigned long); 
    int (*mmap) (struct file *, struct vm_area_struct *); 
    int (*open) (struct inode *, struct file *); 
    int (*flush) (struct file *, fl_owner_t id); 
    int (*release) (struct inode *, struct file *); 
    int (*fsync) (struct file *, struct dentry *, int datasync); 
    int (*aio_fsync) (struct kiocb *, int datasync); 
    int (*fasync) (int, struct file *, int); 
    int (*lock) (struct file *, int, struct file_lock *); 
    ssize_t (*sendpage) (struct file *, struct page *,
                         int, size_t, loff_t *, int); 
    unsigned long (*get_unmapped_area) (struct file *,
                                        unsigned long,
                                        unsigned long, 
                                        unsigned long, 
                                        unsigned long);
    int (*check_flags) (int); 
    int (*flock) (struct file *, int, struct file_lock *); 
    ssize_t (*splice_write) (struct pipe_inode_info *,
                             struct file *, 
                             loff_t *, 
                             size_t, 
                             unsigned int);
    ssize_t (*splice_read) (struct file *, 
                            loff_t *, 
                            struct pipe_inode_info *, 
                            size_t, 
                            unsigned int);
    int (*setlease) (struct file *, long, struct file_lock **); 
}

檔案系統可以實現自己的file操作函式,也可以使用file的通用操作函式。通用操作函式一般可以在標準的基於Unix的檔案系統中正常工作。

下面摘錄了部分函式的說明

int open(struct inode *inode, struct file *file)
Creates a new file object and links it to the corresponding inode object. It is called by the open() system call.

loff_t llseek(struct file *file, loff_t offset, int origin)
Updates the file pointer to the given offset. It is called via the llseek() system call.

ssize_t read(struct file *file, char *buf, size_t count, loff_t *offset)
Reads count bytes from the given file at position offset into buf.The file pointer is then updated.This function is called by the read() system call.

ssize_t aio_read(struct kiocb *iocb, char *buf, size_t count, loff_t offset)
Begins an asynchronous read of count bytes into buf of the file described in iocb. This function is called by the aio_read() system call.

ssize_t write(struct file *file, const char *buf, size_t count, loff_t *offset)
Writes count bytes from buf into the given file at position offset.The file pointer is then updated.This function is called by the write() system call.

int readdir(struct file *file, void *dirent, filldir_t filldir)
Returns the next directory in a directory listing.This function is called by the readdir() system call.

unsigned int poll(struct file *file, struct poll_table_struct *poll_table)
Sleeps, waiting for activity on the given file. It is called by the poll() system call.

int ioctl(struct inode *inode, struct file *file, unsigned int cmd, unsigned long arg)
Sends a command and argument pair to a device. It is used when the file is an open device node.This function is called from the ioctl() system call. Callers must hold the BKL.

int mmap(struct file *file, struct vm_area_struct *vma)
Memory maps the given file onto the given address space and is called by the mmap() system call.

int flush(struct file *file)
Called by the VFS whenever the reference count of an open file decreases. Its purpose is filesystem-dependent.

和檔案系統相關的資料結構

核心使用兩種資料結構來管理和檔案系統相關的資料,file_system_type結構體用於表示檔案系統類別。vfsmount結構體用於表示一個掛載的檔案系統例項。

file_system_type

因為Linux支援那很多中檔案系統,所以核心必須要有一個特殊的資料結構來描述每個檔案系統的特性和行為,file_system_type結構體就是做這個的。

file_system_type定義在<linux/fs.h>中

struct file_system_type 
{ 
    const char              *name;     /* filesystem’s name */ 
    int                     fs_flags;  /* filesystem type flags */
    struct super_block      *(*get_sb) (struct file_system_type *, int, char *, void *);
    void                    (*kill_sb) (struct super_block *);
    struct module           *owner;    /* module owning the filesystem */ 
    struct file_system_type *next;     /* next file_system_type in list */ 
    struct list_head        fs_supers; /* list of superblock objects */
    struct lock_class_key   s_lock_key; 
    struct lock_class_key   s_umount_key; 
    struct lock_class_key   i_lock_key; 
    struct lock_class_key   i_mutex_key; 
    struct lock_class_key   i_mutex_dir_key; 
    struct lock_class_key   i_alloc_sem_key;
};

其中get_sb()函式在檔案系統載入的時候讀取磁碟上的superblock,並使用讀入的資料填充記憶體中的superblock物件。每種檔案系統不管有多少個例項(哪怕是0個),都會有且只有一個file_system_type。

vfsmount

vfsmount結構體在檔案系統掛載時建立,該結構體表示一個具體的檔案系統例項(掛載點)

下面是vfsmount結構體的定義,定義在<linux/mount.h>中

struct vfsmount 
{ 
    struct list_head   mnt_hash;        /* hash table list */
    struct vfsmount    *mnt_parent;     /* parent filesystem */ 
    struct dentry      *mnt_mountpoint; /* dentry of this mount point */ 
    struct dentry      *mnt_root;       /* dentry of root of this fs */ 
    struct super_block *mnt_sb;         /* superblock of this filesystem */ 
    struct list_head   mnt_mounts;      /* list of children */ 
    struct list_head   mnt_child;       /* list of children */ 
    int                mnt_flags;       /* mount flags */ 
    char               *mnt_devname;    /* device file name */ 
    struct list_head   mnt_list;        /* list of descriptors */ 
    struct list_head   mnt_expire;      /* entry in expiry list */ 
    struct list_head   mnt_share;       /* entry in shared mounts list */ 
    struct list_head   mnt_slave_list;  /* list of slave mounts */ 
    struct list_head   mnt_slave;       /* entry in slave list */ 
    struct vfsmount    *mnt_master;     /* slave’s master */ 
    struct mnt_namespace *mnt_namespace; /* associated namespace */ 
    int                mnt_id;           /* mount identifier */ 
    int                mnt_group_id;     /* peer group identifier */ 
    atomic_t           mnt_count;        /* usage count */ 
    int                mnt_expiry_mark;  /* is marked for expiration */ 
    int                mnt_pinned;       /* pinned count */ 
    int                mnt_ghosts;       /* ghosts count */ 
    atomic_t           __mnt_writers;    /* writers count */
};

vfsmount中含有指向檔案系統示例的superlbock物件的指標。

和程序相關的資料結構

程序使用files_struct, fs_struct 和mnt_namesapce這三個資料結構來將程序和VFS層關聯起來,記錄已開啟檔案列表、程序的根檔案系統、當前工作目錄等資訊。

file_struct

程序描述符的files指標指向file_struct,該結構體定義在<linux/fdtable.h>中

struct files_struct 
{ 
    atomic_t               count;              /* usage count */ 
    struct fdtable         *fdt;               /* pointer to other fd table */ 
    struct fdtable         fdtab;              /* base fd table */ 
    spinlock_t             file_lock;          /* per-file lock */ 
    int  next_fd;  /* cache of next available fd */ 
    struct embedded_fd_set close_on_exec_init; /* list of close-on-exec fds */ 
    struct embedded_fd_set open_fds_init       /* list of open fds */ 
    struct file            *fd_array[NR_OPEN_DEFAULT]; /* base files array */
};

fd_array指向一個已開啟檔案的列表。fd_array[i]指向檔案描述符為i的file物件。NR_OPEN_DEFAULT是一個常數,在64bit機器中是64.當開啟的檔案數超過這個常數值時,核心會建立一個新的fdtable,並使fdt指向這個新的fdtable結構體。

fs_struct

fs_struct結構體用於儲存和程序相關的檔案系統資訊。程序描述符中的fs指標指向程序的fs_struct結構體

fs_struct定義在 <linux/fs_struct.h>中

struct fs_struct 
{ 
    int         users;    /* user count */ 
    rwlock_t    lock;     /* per-structure lock */ 
    int         umask;    /* umask */ 
    int         in_exec;  /* currently executing a file */ 
    struct path root;     /* root directory */ 
    struct path pwd;      /* current working directory */
};

root儲存了程序的根目錄,pwd儲存了程序的當前工作目錄

mnt_namespace

mnt_namespace給了每個程序一個獨立的檔案系統視角。程序描述符中的mnt_namespace域指向程序的mnt_namespace結構體

linux中預設是所有程序共享一個namespace的,只有當clone()時指定了CLONE_NEWS標誌,才會建立一個新的namespace。

mnt_namespace定義在<linux/mnt_namespace.h>

struct mnt_namespace 
{ 
    atomic_t            count; /* usage count */ 
    struct vfsmount     *root; /* root directory */
    struct list_head    list;  /* list of mount points */ 
    wait_queue_head_t   poll;  /* polling waitqueue */ 
    int                 event; /* event count */
};

list是一個雙向連結串列,該連結串列將所有組成該namespace的已掛載檔案系統連線到一起。

參考資料

《Linux Kernel Development 3rd Edition》
《Understanding The Linux Kernel 3rd Edition》