系統技術非業餘研究 » posix_fadvise清除快取的誤解和改進措施

阿新 • • 發佈：2019-01-13

在典型的IO密集型的資料庫伺服器如MYSQL中，會涉及到大量的檔案讀寫，通常這些檔案都是通過buffer io來使用的，以便充分利用到Linux作業系統的page cache。

Buffer IO的特點是讀的時候，先檢查頁快取裡面是否有需要的資料，如果沒有就從裝置讀取，返回給使用者的同時，加到快取一份;寫的時候，直接寫到快取去，再由後臺的程序定期涮到磁碟去。這樣的機制看起來非常的好，在實踐中也效果很好。

但是如果你的IO非常密集，就會出現問題。首先由於pagesize是4K，記憶體的利用效率比較低。其次快取的淘汰演算法很簡單，由作業系統自主進行，使用者不大好參與。當你的寫很多，超過系統記憶體的某個上限的時候，後臺的程序(swapd)要出來回收頁面，而且一旦回收的速度小於寫入的速度，就會出現不可預期的行為。

這裡面最大的問題是：當你使用的記憶體包括快取，沒超過作業系統規定的上限的時候，作業系統選擇不作為，讓使用者充分使用快取，從它的角度來看這樣效率最高。但是正是由於這種策略在實踐中會導致問題。

比如說MYSQL伺服器，我們可以把資料直接走direct IO,但是它的日誌是走bufferio的。因為走directio需要對寫入檔案的偏移和大小都要扇區對全，這對日誌系統來講太麻煩了。由於MYSQL是基於事務的，會涉及到大量的日誌動作，頻繁的寫入，然後fsync. 日誌一旦寫入磁碟，buffer page就沒用了，但是一直會在記憶體呆著，直到達到記憶體上限，引起作業系統突然大量回收
頁面，出現IO柱塞或者記憶體交換等負面問題。

那麼我們知道了困境在哪裡，我們可以主動避免這個現象的發生。有二種方法：
1. 日誌也走direct io,需要規模的修改MYSQL程式碼，如percona就這麼做了，提供相應的patch。
2. 日誌還是走buffer io, 但是定期清除無用page cache.

第一張方法不是我們要討論的，我們重點討論第二種如何做：

我們在程式裡知道檔案的控制代碼，是不是就可以很輕鬆的用：

int posix_fadvise(int fd, off_t offset, off_t len, int advice);
POSIX_FADV_DONTNEED
The specified data will not be accessed in the near future.

來解決問題呢？
比如寫類似 posix_fadvise(fd, 0, len_of_file, POSIX_FADV_DONTNEED)；這樣的程式碼來清掉檔案所屬的快取。

前面介紹的vmtouch

就有這樣的功能，清某個檔案的快取。
vmtouch -ve logfile 就可以試驗，但是你會發現記憶體根本就沒下來，原因呢？

我們從程式碼來看posix_fadvise如何運作的：
參看 mm/fadvise.c：

/*
 * Posix_FADV_WILLNEED could set PG_Referenced, and POSIX_FADV_NOREUSE could
 * deactivate the pages and clear PG_Referenced.
 */
SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
{
...
	case POSIX_FADV_DONTNEED:
		if (!bdi_write_congested(mapping->backing_dev_info))
			filemap_flush(mapping);

		/* First and last FULL page! */
		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
		end_index = (endbyte >> PAGE_CACHE_SHIFT);

		if (end_index >= start_index)
			invalidate_mapping_pages(mapping, start_index,
						end_index);
		break;
...
}

我們可以看到如果後備裝置不忙的話，會先呼叫filemap_flush(mapping)把髒頁面刷掉，然後再調invalidate_mapping_pages清除頁面。先看下如何刷頁面的：
mm/filemap.c

/**                                                                                                                                                         
 * filemap_flush - mostly a non-blocking flush                                                                                                              
 * @mapping:    target address_space                                                                                                                        
 *                                                                                                                                                          
 * This is a mostly non-blocking flush.  Not suitable for data-integrity                                                                                    
 * purposes - I/O may not be started against all dirty pages.                                                                                               
 */
int filemap_flush(struct address_space *mapping)
{
        return __filemap_fdatawrite(mapping, WB_SYNC_NONE);
}
/**                                                                                                                                                         
 * __filemap_fdatawrite_range - start writeback on mapping dirty pages in range                                                                             
 * @mapping:    address space structure to write                                                                                                            
 * @start:      offset in bytes where the range starts                                                                                                      
 * @end:        offset in bytes where the range ends (inclusive)                                                                                            
 * @sync_mode:  enable synchronous operation                                                                                                                
 *                                                                                                                                                          
 * Start writeback against all of a mapping's dirty pages that lie                                                                                          
 * within the byte offsets <start, end> inclusive.                                                                                                          
 *                                                                                                                                                          
 * If sync_mode is WB_SYNC_ALL then this is a "data integrity" operation, as                                                                                
 * opposed to a regular memory cleansing writeback.  The difference between                                                                                 
 * these two operations is that if a dirty page/buffer is encountered, it must                                                                              
 * be waited upon, and not just skipped over.                                                                                                               
 */
int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
                                loff_t end, int sync_mode)
{
        int ret;
        struct writeback_control wbc = {
                .sync_mode = sync_mode,
                .nr_to_write = LONG_MAX,
		.range_start = start,
                .range_end = end,
	};

        if (!mapping_cap_writeback_dirty(mapping))
		return 0;

	ret = do_writepages(mapping, &wbc);
	return ret;
}

int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
                                loff_t end)
{
        return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL);
}

我們看到它刷頁面用的引數是是 WB_SYNC_NONE，也就是說不是同步等待頁面重新整理完成。
而fsync和fdatasync是最終會呼叫filemap_fdatawrite_range, 用WB_SYNC_ALL引數等到完成才返回的。
我們來看下程式碼mm/page-writeback.c確認下：

int generic_writepages(struct address_space *mapping,
                       struct writeback_control *wbc)
{
...
       return write_cache_pages(mapping, wbc, __writepage, mapping);
}
int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
...
        if (mapping->a_ops->writepages)
                ret = mapping->a_ops->writepages(mapping, wbc);
        else
                ret = generic_writepages(mapping, wbc);
        return ret;
}

int generic_writepages(struct address_space *mapping,
                       struct writeback_control *wbc)
{
        /* deal with chardevs and other special file */
        if (!mapping->a_ops->writepage)
                return 0;

        return write_cache_pages(mapping, wbc, __writepage, mapping);
}

int write_cache_pages(struct address_space *mapping,
                      struct writeback_control *wbc, writepage_t writepage,
                      void *data)
{
...
                        /*                                                                                                                                  
                         * We stop writing back only if we are not doing                                                                                    
                         * integrity sync. In case of integrity sync we have to                                                                             
                         * keep going until we have written all the pages                                                                                   
                         * we tagged for writeback prior to entering this loop.                                                                             
                         */
                        if (--wbc->nr_to_write <= 0 &&
                            wbc->sync_mode == WB_SYNC_NONE) {
                                done = 1;
                                break;
                        }
                }
                pagevec_release(&pvec);
                cond_resched();

...
}

從程式碼和註釋可以看出，在WB_SYNC_NONE模式下，提交完寫髒頁，然後就返回了，確實不等到回寫完成。
到這裡為止如何刷髒頁就很清楚了，再接著看第二步清除記憶體的操作：
看下mm/truncate.c的實現：

/**
 * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
 * @mapping: the address_space which holds the pages to invalidate
 * @start: the offset 'from' which to invalidate
 * @end: the offset 'to' which to invalidate (inclusive)
 *
 * This function only removes the unlocked pages, if you want to
 * remove all the pages of one inode, you must call truncate_inode_pages.
 *
 * invalidate_mapping_pages() will not block on IO activity. It will not
 * invalidate pages which are dirty, locked, under writeback or mapped into
 * pagetables.
 */
unsigned long invalidate_mapping_pages(struct address_space *mapping,
				       pgoff_t start, pgoff_t end)
{
	...
	pagevec_init(&pvec, 0);
	while (next <= end &&
			pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
		mem_cgroup_uncharge_start();
		for (i = 0; i < pagevec_count(&pvec); i++) {
			struct page *page = pvec.pages[i];
			...
			ret += invalidate_inode_page(page);
                       ...
		}
		pagevec_release(&pvec);
		mem_cgroup_uncharge_end();
		cond_resched();
	}
	return ret;
}

/*
 * Safely invalidate one page from its pagecache mapping.
 * It only drops clean, unused pages. The page must be locked.
 *
 * Returns 1 if the page is successfully invalidated, otherwise 0.
 */
int invalidate_inode_page(struct page *page)
{
	struct address_space *mapping = page_mapping(page);
	if (!mapping)
		return 0;
	if (PageDirty(page) || PageWriteback(page))
		return 0;
	if (page_mapped(page))
		return 0;
	return invalidate_complete_page(mapping, page);
}

從上面的註釋我們可以看到清除相關的頁面要滿足二個條件： 1. 不髒。 2. 未被使用。
如果滿足了這二個條件就呼叫invalidate_complete_page繼續：

/*
 * This Is for invalidate_mapping_pages().  That function can be called at
 * any time, and is not supposed to throw away dirty pages.  But pages can
 * be marked dirty at any time too, so use remove_mapping which safely
 * discards clean, unused pages.
 *
 * Returns non-zero if the page was successfully invalidated.
 */
static int
invalidate_complete_page(struct address_space *mapping, struct page *page)
{
	int ret;

	if (page->mapping != mapping)
		return 0;

	if (page_has_private(page) && !try_to_release_page(page, 0))
		return 0;

	clear_page_mlock(page);
	ret = remove_mapping(mapping, page);

	return ret;
}

我們看到invalidate_complete_page在滿足更多條件的話會繼續呼叫remove_mapping：

/*
 * Attempt to detach a locked page from its ->mapping.  If it is dirty or if
 * someone else has a ref on the page, abort and return 0.  If it was
 * successfully detached, return 1.  Assumes the caller has a single ref on
 * this page.
 */
int remove_mapping(struct address_space *mapping, struct page *page)
{
	if (__remove_mapping(mapping, page)) {
		/*
		 * Unfreezing the refcount with 1 rather than 2 effectively
		 * drops the pagecache ref for us without requiring another
		 * atomic operation.
		 */
		page_unfreeze_refs(page, 1);
		return 1;
	}
	return 0;
}
/*
 * Same as remove_mapping, but if the page is removed from the mapping, it
 * gets returned with a refcount of 0.
 */
static int __remove_mapping(struct address_space *mapping, struct page *page)；
{
	BUG_ON(!PageLocked(page));
	BUG_ON(mapping != page_mapping(page));

	spin_lock_irq(&mapping->tree_lock);
	/*
	 * The non racy check for a busy page.
	 *
	 * Must be careful with the order of the tests. When someone has
	 * a ref to the page, it may be possible that they dirty it then
	 * drop the reference. So if PageDirty is tested before page_count
	 * here, then the following race may occur:
	 *
	 * get_user_pages(&page);
	 * [user mapping goes away]
	 * write_to(page);
	 *				!PageDirty(page)    [good]
	 * SetPageDirty(page);
	 * put_page(page);
	 *				!page_count(page)   [good, discard it]
	 *
	 * [oops, our write_to data is lost]
	 *
	 * Reversing the order of the tests ensures such a situation cannot
	 * escape unnoticed. The smp_rmb is needed to ensure the page->flags
	 * load is not satisfied before that of page->_count.
	 *
	 * Note that if SetPageDirty is always performed via set_page_dirty,
	 * and thus under tree_lock, then this ordering is not required.
	 */
	if (!page_freeze_refs(page, 2))
		goto cannot_free;
	/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
	if (unlikely(PageDirty(page))) {
		page_unfreeze_refs(page, 2);
		goto cannot_free;
	}

	if (PageSwapCache(page)) {
		swp_entry_t swap = { .val = page_private(page) };
		__delete_from_swap_cache(page);
		spin_unlock_irq(&mapping->tree_lock);
		swapcache_free(swap, page);
	} else {
		void (*freepage)(struct page *);

		freepage = mapping->a_ops->freepage;

		__remove_from_page_cache(page);
		spin_unlock_irq(&mapping->tree_lock);
		mem_cgroup_uncharge_cache_page(page);

		if (freepage != NULL)
			freepage(page);
	}

	return 1;

cannot_free:
	spin_unlock_irq(&mapping->tree_lock);
	return 0;
}

看到這裡我們就明白了：為什麼相關的記憶體沒有被釋放出來：頁面還髒是最關鍵的因素。

但是我們如何保證頁面全部不髒呢？fdatasync或者fsync都是選擇,或者Linux下新系統呼叫sync_file_range都是可用的，這幾個都是使用WB_SYNC_ALL模式強制要求回寫完畢才返回的。
如這樣做：

fdatasync(fd);
posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);

這裡還有一個問題: 就是運維人員如何清除這些快取呢？畢竟他們沒有辦法寫碼干預程式的行為呀! 對於這個問題我的初步建議是：
1. 理想情況寫個指令碼systemtap指令碼把這些檔案所擁有的頁面用sync_file_range回寫了，然後再用vmtouch清除。
2. 乾脆定期sudo sysctl vm.drop_caches=1 強制清掉無用的pagecache,不過這招比較危險，不推薦。

vm.drop_caches參考以下：

Writing to this will cause the kernel to drop clean caches, dentries and inodes from memory, causing that memory to become free.
To free pagecache:
* echo 1 > /proc/sys/vm/drop_caches

以上程式碼是基於Linux2.6.37的, 和我們生產機用的是2.6.18有點不同，但是經過核對總體邏輯上是一樣的。
為了驗證上面的程式碼分析和判斷，我們準備下實驗環境，構造下場景，用資料說話：

1. 我們的物理機器有24G記憶體，構造一個檔案4G，讓它的頁面全部HOLD在記憶體裡面，而且髒的，不讓作業系統自主寫到磁碟去。
2. 檔案系統用的是ext3, 預設是ordered模式，這個模式下檔案系統會起到kjournald把我們的頁面寫到磁碟，經伯瑜同學指點，用了writeback方式，
3. 提高vm的dirty_ratio和dirty_background_ratio到90，dirty_expire_centisecs和dirty_writeback_centisecs到1個小時，好讓pdflushd不要出來搗亂。

我演示下關鍵的引數：

$uname -r
2.6.18-164.el5

$free -m
             total       used       free     shared    buffers     cached
Mem:         24098       5207      18890          0        119        467
-/+ buffers/cache:       4620      19477
Swap:         8189        582       7606

$sysctl -a|grep vm.dirty
vm.dirty_expire_centisecs = 359945
vm.dirty_writeback_centisecs = 359945
vm.dirty_ratio = 90
vm.dirty_background_ratio = 90

$mount
...
/dev/sda12 on /u02 type ext3 (rw,data=writeback)

$pwd
/u02

接下來請空全部的cache和buffer, 然後建立個4G的資料檔案，觀察這個期間系統記憶體的變化：

$sudo sysctl vm.drop_caches=3 
vm.drop_caches = 3

$sudo dd if=/dev/zero of=large bs=4M count=1024
1024+0 records in
1024+0 records out
4294967296 bytes (4.3 GB) copied, 6.68751 seconds, 642 MB/s

在另外一個終端觀察記憶體的情況：

$ watch -n 1 'cat /proc/meminfo'

Every 1.0s: cat /proc/meminfo                                                                                                                   Tue Dec 13 18:09:06 2011

MemTotal:     24676836 kB
MemFree:      15294416 kB
Buffers:         59880 kB
Cached:        4466468 kB
SwapCached:        152 kB
Active:        4849772 kB
Inactive:      4246320 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     24676836 kB
LowFree:      15294416 kB
SwapTotal:     8385760 kB
SwapFree:      7789424 kB
Dirty:         4382608 kB
Writeback:           0 kB
AnonPages:     4569712 kB
Mapped:          88080 kB
Slab:           196736 kB
PageTables:      33444 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  20724176 kB
Committed_AS: 26063060 kB
VmallocTotal: 34359738367 kB
VmallocUsed:    268664 kB
VmallocChunk: 34359469543 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

在另外一個終端再觀察下IO的情況:

$iostat -dx 1
Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda3              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda4              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda5              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda6              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda7              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda8              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda9              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda10             0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda11             0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda12             0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb               0.00    48.00  0.00  4.00     0.00   416.00   104.00     0.00    0.25   0.25   0.10
sdb1              0.00    48.00  0.00  4.00     0.00   416.00   104.00     0.00    0.25   0.25   0.10

中間的時候還可以開下TOP觀察下是否有kjournald和pdflushd出來搗亂。

現在我們可以看到的數字是：
Cached: 4466468 kB
Dirty: 4382608 kB
Writeback: 0 kB
iostat也沒有觀察到發生IO讀寫，符合預期！
現在為止我們確認環境搭建好了，待續！

TODO!!!

祝玩得開心！

Post Footer automatically generated by wp-posturl plugin for wordpress.

系統技術非業餘研究 » posix_fadvise清除快取的誤解和改進措施

系統技術非業餘研究 » posix_fadvise清除快取的誤解和改進措施

系統技術非業餘研究 » Erlang match_spec引擎介紹和應用

系統技術非業餘研究 » 區域性性原理在計算機和分散式系統中的應用課程PPT

系統技術非業餘研究 » Fio壓測工具和io佇列深度理解和誤區

系統技術非業餘研究 » qperf測量網路頻寬和延遲

系統技術非業餘研究 » 大檔案重定向和管道的效率對比

系統技術非業餘研究 » Linux快取記憶體使用率調查

系統技術非業餘研究

系統技術非業餘研究 » MySQL資料庫架構的演化觀察

系統技術非業餘研究 » inet_dist_connect_options

系統技術非業餘研究 » 推薦工作機會

系統技術非業餘研究 » 新的工作和研究方向

系統技術非業餘研究 » 叢集引入inet_dist_{listen,connect}_options更精細引數微調

系統技術非業餘研究 » 2017升的最快的幾個資料庫無責任點評

系統技術非業餘研究 » Erlang 17.5引入+hpds命令列控制程序預設字典大小

系統技術非業餘研究 » inet_dist_listen_options

系統技術非業餘研究 » 老生常談: ulimit問題及其影響

系統技術非業餘研究 » 求賢帖

系統技術非業餘研究 » Erlang R16B03釋出，R17已發力

系統技術非業餘研究 » Erlang R13B04 Installation

系統技術非業餘研究 » posix_fadvise清除快取的誤解和改進措施

相關推薦