linux下poll和epoll核心原始碼剖析

阿新 • • 發佈：2018-11-19

分享一下我老師大神的人工智慧教程！零基礎，通俗易懂！http://blog.csdn.net/jiangjunshow

也歡迎大家轉載本篇文章。分享知識，造福人民，實現我們中華民族偉大復興！

poll和epoll的使用應該不用再多說了。當fd很多時，使用epoll比poll效率更高。

我們通過核心原始碼分析來看看到底是為什麼。

poll剖析
poll系統呼叫：
int poll(struct pollfd *fds, nfds_t nfds, int timeout);
核心2.6.9對應的實現程式碼為：
[fs/select.c -->sys_poll]
456 asmlinkage long sys_poll(struct pollfd __user * ufds, unsigned int nfds, long timeout)
457 {
458 struct poll_wqueues table;
459 int fdcount, err;
460 unsigned int i;
461 struct poll_list *head;
462 struct poll_list *walk;
463
464 /* Do a sanity check on nfds ... */ /* 使用者給的nfds數不可以超過一個struct file結構支援
的最大fd數（預設是256）*/
465 if (nfds > current->files->max_fdset && nfds > OPEN_MAX)
466 return -EINVAL;
467
468 if (timeout) {
469 /* Careful about overflow in the intermediate values */
470 if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ)
471 timeout = (unsigned long)(timeout*HZ+999)/1000+1;
472 else /* Negative or overflow */
473 timeout = MAX_SCHEDULE_TIMEOUT;
474 }
475
476 poll_initwait(&table);
其中poll_initwait較為關鍵，從字面上看，應該是初始化變數table，注意此處table在整個執行poll的過
程中是很關鍵的變數。
而struct poll_table其實就只包含了一個函式指標:
[fs/poll.h]
16 /*
17 * structures and helpers for f_op->poll implementations
18 */
19 typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct
poll_table_struct *);
20
21 typedef struct poll_table_struct {
22 poll_queue_proc qproc;
23 } poll_table;
現在我們來看看poll_initwait到底在做些什麼
[fs/select.c]
57 void __pollwait(struct file *filp, wait_queue_head_t *wait_address, poll_table *p);
58
59 void poll_initwait(struct poll_wqueues *pwq)
60 {
61 &(pwq->pt)->qproc = __pollwait; /*此行已經被我“翻譯”了，方便觀看*/
62 pwq->error = 0;
63 pwq->table = NULL;
64 }
很明顯，poll_initwait的主要動作就是把table變數的成員poll_table對應的回撥函式置為__pollwait。這
個__pollwait不僅是poll系統呼叫需要，select系統呼叫也一樣是用這個__pollwait，說白了，這是個操
作系統的非同步操作的“御用”回撥函式。當然了，epoll沒有用這個，它另外新增了一個回撥函式，以達到其
高效運轉的目的，這是後話，暫且不表。
我們先不討論__pollwait的具體實現，還是繼續看sys_poll:
[fs/select.c -->sys_poll]
478 head = NULL;
479 walk = NULL;
480 i = nfds;
481 err = -ENOMEM;
482 while(i!=0) {
483 struct poll_list *pp;
484 pp = kmalloc(sizeof(struct poll_list)+
485 sizeof(struct pollfd)*
486 (i>POLLFD_PER_PAGE?POLLFD_PER_PAGE:i),
487 GFP_KERNEL);
488 if(pp==NULL)
489 goto out_fds;
490 pp->next=NULL;
491 pp->len = (i>POLLFD_PER_PAGE?POLLFD_PER_PAGE:i);
492 if (head == NULL)
493 head = pp;
494 else
495 walk->next = pp;
496
497 walk = pp;
498 if (copy_from_user(pp->entries, ufds + nfds-i,
499 sizeof(struct pollfd)*pp->len)) {
500 err = -EFAULT;
501 goto out_fds;
502 }
503 i -= pp->len;
504 }
505 fdcount = do_poll(nfds, head, &table, timeout);
這一大堆程式碼就是建立一個連結串列，每個連結串列的節點是一個page大小（通常是4k），這連結串列節點由一個指向
struct poll_list的指標掌控，而眾多的struct pollfd就通過struct_list的entries成員訪問。上面的迴圈就
是把使用者態的struct pollfd拷進這些entries裡。通常使用者程式的poll呼叫就監控幾個fd，所以上面這個鏈
表通常也就只需要一個節點，即作業系統的一頁。但是，當用戶傳入的fd很多時，由於poll系統呼叫每次都
要把所有struct pollfd拷進核心，所以引數傳遞和頁分配此時就成了poll系統呼叫的效能瓶頸。
最後一句do_poll，我們跟進去：
[fs/select.c-->sys_poll()-->do_poll()]
395 static void do_pollfd(unsigned int num, struct pollfd * fdpage,
396 poll_table ** pwait, int *count)
397 {
398 int i;
399
400 for (i = 0; i < num; i++) {
401 int fd;
402 unsigned int mask;
403 struct pollfd *fdp;
404
405 mask = 0;
406 fdp = fdpage+i;
407 fd = fdp->fd;
408 if (fd >= 0) {
409 struct file * file = fget(fd);
410 mask = POLLNVAL;
411 if (file != NULL) {
412 mask = DEFAULT_POLLMASK;
413 if (file->f_op && file->f_op->poll)
414 mask = file->f_op->poll(file, *pwait);
415 mask &= fdp->events | POLLERR | POLLHUP;
416 fput(file);
417 }
418 if (mask) {
419 *pwait = NULL;
420 (*count)++;
421 }
422 }
423 fdp->revents = mask;
424 }
425 }
426
427 static int do_poll(unsigned int nfds, struct poll_list *list,
428 struct poll_wqueues *wait, long timeout)
429 {
430 int count = 0;
431 poll_table* pt = &wait->pt;
432
433 if (!timeout)
434 pt = NULL;
435
436 for (;;) {
437 struct poll_list *walk;
438 set_current_state(TASK_INTERRUPTIBLE);
439 walk = list;
440 while(walk != NULL) {
441 do_pollfd( walk->len, walk->entries, &pt, &count);
442 walk = walk->next;
443 }
444 pt = NULL;
445 if (count || !timeout || signal_pending(current))
446 break;
447 count = wait->error;
448 if (count)
449 break;
450 timeout = schedule_timeout(timeout); /* 讓current掛起，別的程序跑，timeout到了
以後再回來執行current*/
451 }
452 __set_current_state(TASK_RUNNING);
453 return count;
454 }
注意438行的set_current_state和445行的signal_pending，它們兩句保障了當使用者程式在呼叫poll後
掛起時，發訊號可以讓程式迅速推出poll呼叫，而通常的系統呼叫是不會被訊號打斷的。
縱覽do_poll函式，主要是在迴圈內等待，直到count大於0才跳出迴圈，而count主要是靠do_pollfd函式
處理。
注意標紅的440-443行，當用戶傳入的fd很多時（比如1000個），對do_pollfd就會呼叫很多次，poll效
率瓶頸的另一原因就在這裡。
do_pollfd就是針對每個傳進來的fd，呼叫它們各自對應的poll函式，簡化一下呼叫過程，如下：
struct file* file = fget(fd);
file->f_op->poll（file, &(table->pt));
如果fd對應的是某個socket，do_pollfd呼叫的就是網路裝置驅動實現的poll；如果fd對應的是某個ext3文
件系統上的一個開啟檔案，那do_pollfd呼叫的就是ext3檔案系統驅動實現的poll。一句話，這個file-
>f_op->poll是裝置驅動程式實現的，那裝置驅動程式的poll實現通常又是什麼樣子呢？其實，裝置驅動
程式的標準實現是：呼叫poll_wait，即以裝置自己的等待佇列為引數（通常裝置都有自己的等待佇列，不
然一個不支援非同步操作的裝置會讓人很鬱悶）呼叫struct poll_table的回撥函式。
作為驅動程式的代表，我們看看socket在使用tcp時的程式碼：
[net/ipv4/tcp.c-->tcp_poll]
329 unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
330 {
331 unsigned int mask;
332 struct sock *sk = sock->sk;
333 struct tcp_opt *tp = tcp_sk(sk);
334
335 poll_wait(file, sk->sk_sleep, wait);
程式碼就看這些，剩下的無非就是判斷狀態、返回狀態值，tcp_poll的核心實現就是poll_wait，而
poll_wait就是呼叫struct poll_table對應的回撥函式，那poll系統呼叫對應的回撥函式就是
__poll_wait，所以這裡幾乎就可以把tcp_poll理解為一個語句：
__poll_wait(file, sk->sk_sleep, wait);
由此也可以看出，每個socket自己都帶有一個等待佇列sk_sleep，所以上面我們所說的“裝置的等待佇列”
其實不止一個。
這時候我們再看看__poll_wait的實現:
[fs/select.c-->__poll_wait()]
89 void __pollwait(struct file *filp, wait_queue_head_t *wait_address, poll_table *_p)
90 {
91 struct poll_wqueues *p = container_of(_p, struct poll_wqueues, pt);
92 struct poll_table_page *table = p->table;
93
94 if (!table || POLL_TABLE_FULL(table)) {
95 struct poll_table_page *new_table;
96
97 new_table = (struct poll_table_page *) __get_free_page(GFP_KERNEL);
98 if (!new_table) {
99 p->error = -ENOMEM;
100 __set_current_state(TASK_RUNNING);
101 return;
102 }
103 new_table->entry = new_table->entries;
104 new_table->next = table;
105 p->table = new_table;
106 table = new_table;
107 }
108
109 /* Add a new entry */
110 {
111 struct poll_table_entry * entry = table->entry;
112 table->entry = entry+1;
113 get_file(filp);
114 entry->filp = filp;
115 entry->wait_address = wait_address;
116 init_waitqueue_entry(&entry->wait, current);
117 add_wait_queue(wait_address,&entry->wait);
118 }
119 }

__poll_wait的作用就是建立了上圖所示的資料結構（一次__poll_wait即一次裝置poll呼叫只建立一個
poll_table_entry），並通過struct poll_table_entry的wait成員，把current掛在了裝置的等待佇列
上，此處的等待佇列是wait_address，對應tcp_poll裡的sk->sk_sleep。
現在我們可以回顧一下poll系統呼叫的原理了：先註冊回撥函式__poll_wait，再初始化table變數（型別
為struct poll_wqueues)，接著拷貝使用者傳入的struct pollfd（其實主要是fd），然後輪流呼叫所有fd對
應的poll（把current掛到各個fd對應的裝置等待佇列上）。在裝置收到一條訊息（網路裝置）或填寫完文
件資料（磁碟裝置）後，會喚醒裝置等待佇列上的程序，這時current便被喚醒了。current醒來後離開
sys_poll的操作相對簡單，這裡就不逐行分析了。

epoll原理簡介

通過上面的分析，poll執行效率的兩個瓶頸已經找出，現在的問題是怎麼改進。首先，每次poll都要把
1000個fd 拷入核心，太不科學了，核心幹嘛不自己儲存已經拷入的fd呢？答對了，epoll就是自己儲存拷
入的fd，它的API就已經說明了這一點——不是 epoll_wait的時候才傳入fd，而是通過epoll_ctl把所有fd
傳入核心再一起"wait"，這就省掉了不必要的重複拷貝。其次，在 epoll_wait時，也不是把current輪流
的加入fd對應的裝置等待佇列，而是在裝置等待佇列醒來時呼叫一個回撥函式（當然，這就需要“喚醒回
調”機制），把產生事件的fd歸入一個連結串列，然後返回這個連結串列上的fd。
epoll剖析
epoll是個module，所以先看看module的入口eventpoll_init
[fs/eventpoll.c-->evetpoll_init()]
1582 static int __init eventpoll_init(void)
1583 {
1584 int error;
1585
1586 init_MUTEX(&epsem);
1587
1588 /* Initialize the structure used to perform safe poll wait head wake ups */
1589 ep_poll_safewake_init(&psw);
1590
1591 /* Allocates slab cache used to allocate "struct epitem" items */
1592 epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem),
1593 0, SLAB_HWCACHE_ALIGN|EPI_SLAB_DEBUG|SLAB_PANIC,
1594 NULL, NULL);
1595
1596 /* Allocates slab cache used to allocate "struct eppoll_entry" */
1597 pwq_cache = kmem_cache_create("eventpoll_pwq",
1598 sizeof(struct eppoll_entry), 0,
1599 EPI_SLAB_DEBUG|SLAB_PANIC, NULL, NULL);
1600
1601 /*
1602 * Register the virtual file system that will be the source of inodes
1603 * for the eventpoll files
1604 */
1605 error = register_filesystem(&eventpoll_fs_type);
1606 if (error)
1607 goto epanic;
1608
1609 /* Mount the above commented virtual file system */
1610 eventpoll_mnt = kern_mount(&eventpoll_fs_type);
1611 error = PTR_ERR(eventpoll_mnt);
1612 if (IS_ERR(eventpoll_mnt))
1613 goto epanic;
1614
1615 DNPRINTK(3, (KERN_INFO "[%p] eventpoll: successfully initialized.\n",
1616 current));
1617 return 0;
1618
1619 epanic:
1620 panic("eventpoll_init() failed\n");
1621 }
很有趣，這個module在初始化時註冊了一個新的檔案系統，叫"eventpollfs"（在eventpoll_fs_type結
構裡），然後掛載此檔案系統。另外建立兩個核心cache（在核心程式設計中，如果需要頻繁分配小塊記憶體，
應該建立kmem_cahe來做“記憶體池”）,分別用於存放struct epitem和eppoll_entry。如果以後要開發新
的檔案系統，可以參考這段程式碼。
現在想想epoll_create為什麼會返回一個新的fd？因為它就是在這個叫做"eventpollfs"的檔案系統裡建立
了一個新檔案！如下：
[fs/eventpoll.c-->sys_epoll_create()]
476 asmlinkage long sys_epoll_create(int size)
477 {
478 int error, fd;
479 struct inode *inode;
480 struct file *file;
481
482 DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_create(%d)\n",
483 current, size));
484
485 /* Sanity check on the size parameter */
486 error = -EINVAL;
487 if (size <= 0)
488 goto eexit_1;
489
490 /*
491 * Creates all the items needed to setup an eventpoll file. That is,
492 * a file structure, and inode and a free file descriptor.
493 */
494 error = ep_getfd(&fd, &inode, &file);
495 if (error)
496 goto eexit_1;
497
498 /* Setup the file internal data structure ( "struct eventpoll" ) */
499 error = ep_file_init(file);
500 if (error)
501 goto eexit_2;
函式很簡單，其中ep_getfd看上去是“get”，其實在第一次呼叫epoll_create時，它是要建立新inode、
新的file、新的fd。而ep_file_init則要建立一個struct eventpoll結構，並把它放入file-
>private_data，注意，這個private_data後面還要用到的。
看到這裡，也許有人要問了，為什麼epoll的開發者不做一個核心的超級大map把使用者要建立的epoll控制代碼
存起來，在epoll_create時返回一個指標？那似乎很直觀呀。但是，仔細看看，linux的系統呼叫有多少是
返回指標的？你會發現幾乎沒有！（特此強調，malloc不是系統呼叫，malloc呼叫的brk才是）因為linux
做為unix的最傑出的繼承人，它遵循了unix的一個巨大優點——一切皆檔案，輸入輸出是檔案、socket也
是檔案，一切皆檔案意味著使用這個作業系統的程式可以非常簡單，因為一切都是檔案操作而已！（unix
還不是完全做到，plan 9才算）。而且使用檔案系統有個好處：epoll_create返回的是一個fd，而不是該
死的指標，指標如果指錯了，你簡直沒辦法判斷，而fd則可以通過current->files->fd_array[]找到其真
偽。
epoll_create好了，該epoll_ctl了，我們略去判斷性的程式碼：
[fs/eventpoll.c-->sys_epoll_ctl()]
524 asmlinkage long
525 sys_epoll_ctl(int epfd, int op, int fd, struct epoll_event __user *event)
526 {
527 int error;
528 struct file *file, *tfile;
529 struct eventpoll *ep;
530 struct epitem *epi;
531 struct epoll_event epds;
....
575 epi = ep_find(ep, tfile, fd);
576
577 error = -EINVAL;
578 switch (op) {
579 case EPOLL_CTL_ADD:
580 if (!epi) {
581 epds.events |= POLLERR | POLLHUP;
582
583 error = ep_insert(ep, &epds, tfile, fd);
584 } else
585 error = -EEXIST;
586 break;
587 case EPOLL_CTL_DEL:
588 if (epi)
589 error = ep_remove(ep, epi);
590 else
591 error = -ENOENT;
592 break;
593 case EPOLL_CTL_MOD:
594 if (epi) {
595 epds.events |= POLLERR | POLLHUP;
596 error = ep_modify(ep, epi, &epds);
597 } else
598 error = -ENOENT;
599 break;
600 }
原來就是在一個大的結構（現在先不管是什麼大結構）裡先ep_find，如果找到了struct epitem而使用者操
作是ADD，那麼返回-EEXIST；如果是DEL，則ep_remove。如果找不到struct epitem而使用者操作是
ADD，就ep_insert建立並插入一個。很直白。那這個“大結構”是什麼呢？看ep_find的呼叫方式，ep引數
應該是指向這個“大結構”的指標，再看ep = file->private_data，我們才明白，原來這個“大結構”就是那
個在epoll_create時建立的struct eventpoll，具體再看看ep_find的實現，發現原來是struct eventpoll
的rbr成員（struct rb_root），原來這是一個紅黑樹的根！而紅黑樹上掛的都是struct epitem。
現在清楚了，一個新建立的epoll檔案帶有一個struct eventpoll結構，這個結構上再掛一個紅黑樹，而這
個紅黑樹就是每次epoll_ctl時fd存放的地方！
現在資料結構都已經清楚了，我們來看最核心的:
[fs/eventpoll.c-->sys_epoll_wait()]
627 asmlinkage long sys_epoll_wait(int epfd, struct epoll_event __user *events,
628 int maxevents, int timeout)
629 {
630 int error;
631 struct file *file;
632 struct eventpoll *ep;
633
634 DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_wait(%d, %p, %d, %d)\n",
635 current, epfd, events, maxevents, timeout));
636
637 /* The maximum number of event must be greater than zero */
638 if (maxevents <= 0)
639 return -EINVAL;
640
641 /* Verify that the area passed by the user is writeable */
642 if ((error = verify_area(VERIFY_WRITE, events, maxevents * sizeof(struct
epoll_event))))
643 goto eexit_1;
644
645 /* Get the "struct file *" for the eventpoll file */
646 error = -EBADF;
647 file = fget(epfd);
648 if (!file)
649 goto eexit_1;
650
651 /*
652 * We have to check that the file structure underneath the fd
653 * the user passed to us _is_ an eventpoll file.
654 */
655 error = -EINVAL;
656 if (!IS_FILE_EPOLL(file))
657 goto eexit_2;
658
659 /*
660 * At this point it is safe to assume that the "private_data" contains
661 * our own data structure.
662 */
663 ep = file->private_data;
664
665 /* Time to fish for events ... */
666 error = ep_poll(ep, events, maxevents, timeout);
667
668 eexit_2:
669 fput(file);
670 eexit_1:
671 DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_wait(%d, %p, %d, %d) =
%d\n",
672 current, epfd, events, maxevents, timeout, error));
673
674 return error;
675 }
故伎重演，從file->private_data中拿到struct eventpoll，再呼叫ep_poll
[fs/eventpoll.c-->sys_epoll_wait()->ep_poll()]
1468 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
1469 int maxevents, long timeout)
1470 {
1471 int res, eavail;
1472 unsigned long flags;
1473 long jtimeout;
1474 wait_queue_t wait;
1475
1476 /*
1477 * Calculate the timeout by checking for the "infinite" value ( -1 )
1478 * and the overflow condition. The passed timeout is in milliseconds,
1479 * that why (t * HZ) / 1000.
1480 */
1481 jtimeout = timeout == -1 || timeout > (MAX_SCHEDULE_TIMEOUT - 1000) / HZ ?
1482 MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
1483
1484 retry:
1485 write_lock_irqsave(&ep->lock, flags);
1486
1487 res = 0;
1488 if (list_empty(&ep->rdllist)) {
1489 /*
1490 * We don't have any available event to return to the caller.
1491 * We need to sleep here, and we will be wake up by
1492 * ep_poll_callback() when events will become available.
1493 */
1494 init_waitqueue_entry(&wait, current);
1495 add_wait_queue(&ep->wq, &wait);
1496
1497 for (;;) {
1498 /*
1499 * We don't want to sleep if the ep_poll_callback() sends us
1500 * a wakeup in between. That's why we set the task state
1501 * to TASK_INTERRUPTIBLE before doing the checks.
1502 */
1503 set_current_state(TASK_INTERRUPTIBLE);
1504 if (!list_empty(&ep->rdllist) || !jtimeout)
1505 break;
1506 if (signal_pending(current)) {
1507 res = -EINTR;
1508 break;
1509 }
1510
1511 write_unlock_irqrestore(&ep->lock, flags);
1512 jtimeout = schedule_timeout(jtimeout);
1513 write_lock_irqsave(&ep->lock, flags);
1514 }
1515 remove_wait_queue(&ep->wq, &wait);
1516
1517 set_current_state(TASK_RUNNING);
1518 }
....
又是一個大迴圈，不過這個大迴圈比poll的那個好，因為仔細一看——它居然除了睡覺和判斷ep->rdllist
是否為空以外，啥也沒做！
什麼也沒做當然效率高了，但到底是誰來讓ep->rdllist不為空呢？
答案是ep_insert時設下的回撥函式：
[fs/eventpoll.c-->sys_epoll_ctl()-->ep_insert()]
923 static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
924 struct file *tfile, int fd)
925 {
926 int error, revents, pwake = 0;
927 unsigned long flags;
928 struct epitem *epi;
929 struct ep_pqueue epq;
930
931 error = -ENOMEM;
932 if (!(epi = EPI_MEM_ALLOC()))
933 goto eexit_1;
934
935 /* Item initialization follow here ... */
936 EP_RB_INITNODE(&epi->rbn);
937 INIT_LIST_HEAD(&epi->rdllink);
938 INIT_LIST_HEAD(&epi->fllink);
939 INIT_LIST_HEAD(&epi->txlink);
940 INIT_LIST_HEAD(&epi->pwqlist);
941 epi->ep = ep;
942 EP_SET_FFD(&epi->ffd, tfile, fd);
943 epi->event = *event;
944 atomic_set(&epi->usecnt, 1);
945 epi->nwait = 0;
946
947 /* Initialize the poll table using the queue callback */
948 epq.epi = epi;
949 init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
950
951 /*
952 * Attach the item to the poll hooks and get current event bits.
953 * We can safely use the file* here because its usage count has
954 * been increased by the caller of this function.
955 */
956 revents = tfile->f_op->poll(tfile, &epq.pt);
我們注意949行，其實就是
&(epq.pt)->qproc = ep_ptable_queue_proc;
緊接著 tfile->f_op->poll(tfile, &epq.pt)其實就是呼叫被監控檔案（epoll裡叫“target file”)的poll方
法，而這個poll其實就是呼叫poll_wait（還記得poll_wait嗎？每個支援poll的裝置驅動程式都要呼叫
的），最後就是呼叫ep_ptable_queue_proc。這是比較難解的一個呼叫關係，因為不是語言級的直接調
用。
ep_insert還把struct epitem放到struct file裡的f_ep_links連表裡，以方便查詢，struct epitem裡的
fllink就是擔負這個使命的。
[fs/eventpoll.c-->ep_ptable_queue_proc()]
883 static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
884 poll_table *pt)
885 {
886 struct epitem *epi = EP_ITEM_FROM_EPQUEUE(pt);
887 struct eppoll_entry *pwq;
888
889 if (epi->nwait >= 0 && (pwq = PWQ_MEM_ALLOC())) {
890 init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
891 pwq->whead = whead;
892 pwq->base = epi;
893 add_wait_queue(whead, &pwq->wait);
894 list_add_tail(&pwq->llink, &epi->pwqlist);
895 epi->nwait++;
896 } else {
897 /* We have to signal that an error occurred */
898 epi->nwait = -1;
899 }
900 }
上面的程式碼就是ep_insert中要做的最重要的事：建立struct eppoll_entry，設定其喚醒回撥函式為
ep_poll_callback，然後加入裝置等待佇列（注意這裡的whead就是上一章所說的每個裝置驅動都要帶的
等待佇列）。只有這樣，當裝置就緒，喚醒等待佇列上的等待著時，ep_poll_callback就會被呼叫。每次
呼叫poll系統呼叫，作業系統都要把current（當前程序）掛到fd對應的所有裝置的等待佇列上，可以想
象，fd多到上千的時候，這樣“掛”法很費事；而每次呼叫epoll_wait則沒有這麼羅嗦，epoll只在epoll_ctl
時把current掛一遍（這第一遍是免不了的）並給每個fd一個命令“好了就調回調函式”，如果裝置有事件
了，通過回撥函式，會把fd放入rdllist，而每次呼叫epoll_wait就只是收集rdllist裡的fd就可以了
——epoll巧妙的利用回撥函式，實現了更高效的事件驅動模型。
現在我們猜也能猜出來ep_poll_callback會幹什麼了——肯定是把紅黑樹上的收到event的epitem（代表
每個fd）插入ep->rdllist中，這樣，當epoll_wait返回時，rdllist裡就都是就緒的fd了！
[fs/eventpoll.c-->ep_poll_callback()]
1206 static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
1207 {
1208 int pwake = 0;
1209 unsigned long flags;
1210 struct epitem *epi = EP_ITEM_FROM_WAIT(wait);
1211 struct eventpoll *ep = epi->ep;
1212
1213 DNPRINTK(3, (KERN_INFO "[%p] eventpoll: poll_callback(%p) epi=%p
ep=%p\n",
1214 current, epi->file, epi, ep));
1215
1216 write_lock_irqsave(&ep->lock, flags);
1217
1218 /*
1219 * If the event mask does not contain any poll(2) event, we consider the
1220 * descriptor to be disabled. This condition is likely the effect of the
1221 * EPOLLONESHOT bit that disables the descriptor when an event is received,
1222 * until the next EPOLL_CTL_MOD will be issued.
1223 */
1224 if (!(epi->event.events & ~EP_PRIVATE_BITS))
1225 goto is_disabled;
1226
1227 /* If this file is already in the ready list we exit soon */
1228 if (EP_IS_LINKED(&epi->rdllink))
1229 goto is_linked;
1230
1231 list_add_tail(&epi->rdllink, &ep->rdllist);
1232
1233 is_linked:
1234 /*
1235 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
1236 * wait list.
1237 */
1238 if (waitqueue_active(&ep->wq))
1239 wake_up(&ep->wq);
1240 if (waitqueue_active(&ep->poll_wait))
1241 pwake++;
1242
1243 is_disabled:
1244 write_unlock_irqrestore(&ep->lock, flags);
1245
1246 /* We have to call this outside the lock */
1247 if (pwake)
1248 ep_poll_safewake(&psw, &ep->poll_wait);
1249
1250 return 1;
1251 }
真正重要的只有1231行的只一句，就是把struct epitem放到struct eventpoll的rdllist中去。現在我們
可以畫出epoll的核心資料結構圖了：

epoll獨有的EPOLLET

EPOLLET是epoll系統呼叫獨有的flag，ET就是Edge Trigger（邊緣觸發）的意思，具體含義和應用大家
可google之。有了EPOLLET，重複的事件就不會總是出來打擾程式的判斷，故而常被使用。那EPOLLET
的原理是什麼呢？
上篇我們講到epoll把fd都掛上一個回撥函式，當fd對應的裝置有訊息時，就把fd放入rdllist連結串列，這樣
epoll_wait只要檢查這個rdllist連結串列就可以知道哪些fd有事件了。我們看看ep_poll的最後幾行程式碼：
[fs/eventpoll.c->ep_poll()]
1524
1525 /*
1526 * Try to transfer events to user space. In case we get 0 events and
1527 * there's still timeout left over, we go trying again in search of
1528 * more luck.
1529 */
1530 if (!res && eavail &&
1531 !(res = ep_events_transfer(ep, events, maxevents)) && jtimeout)
1532 goto retry;
1533
1534 return res;
1535 }
把rdllist裡的fd拷到使用者空間，這個任務是ep_events_transfer做的：
[fs/eventpoll.c->ep_events_transfer()]
1439 static int ep_events_transfer(struct eventpoll *ep,
1440 struct epoll_event __user *events, int maxevents)
1441 {
1442 int eventcnt = 0;
1443 struct list_head txlist;
1444
1445 INIT_LIST_HEAD(&txlist);
1446
1447 /*
1448 * We need to lock this because we could be hit by
1449 * eventpoll_release_file() and epoll_ctl(EPOLL_CTL_DEL).
1450 */
1451 down_read(&ep->sem);
1452
1453 /* Collect/extract ready items */
1454 if (ep_collect_ready_items(ep, &txlist, maxevents) > 0) {
1455 /* Build result set in userspace */
1456 eventcnt = ep_send_events(ep, &txlist, events);
1457
1458 /* Reinject ready items into the ready list */
1459 ep_reinject_items(ep, &txlist);
1460 }
1461
1462 up_read(&ep->sem);
1463
1464 return eventcnt;
1465 }
程式碼很少，其中ep_collect_ready_items把rdllist裡的fd挪到txlist裡（挪完後rdllist就空了），接著
ep_send_events把txlist裡的fd拷給使用者空間，然後ep_reinject_items把一部分fd從txlist裡“返還”給
rdllist以便下次還能從rdllist裡發現它。
其中ep_send_events的實現：
[fs/eventpoll.c->ep_send_events()]
1337 static int ep_send_events(struct eventpoll *ep, struct list_head *txlist,
1338 struct epoll_event __user *events)
1339 {
1340 int eventcnt = 0;
1341 unsigned int revents;
1342 struct list_head *lnk;
1343 struct epitem *epi;
1344
1345 /*
1346 * We can loop without lock because this is a task private list.
1347 * The test done during the collection loop will guarantee us that
1348 * another task will not try to collect this file. Also, items
1349 * cannot vanish during the loop because we are holding "sem".
1350 */
1351 list_for_each(lnk, txlist) {
1352 epi = list_entry(lnk, struct epitem, txlink);
1353
1354 /*
1355 * Get the ready file event set. We can safely use the file
1356 * because we are holding the "sem" in read and this will
1357 * guarantee that both the file and the item will not vanish.
1358 */
1359 revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL);
1360
1361 /*
1362 * Set the return event set for the current file descriptor.
1363 * Note that only the task task was successfully able to link
1364 * the item to its "txlist" will write this field.
1365 */
1366 epi->revents = revents & epi->event.events;
1367
1368 if (epi->revents) {
1369 if (__put_user(epi->revents,
1370 &events[eventcnt].events) ||
1371 __put_user(epi->event.data,
1372 &events[eventcnt].data))
1373 return -EFAULT;
1374 if (epi->event.events & EPOLLONESHOT)
1375 epi->event.events &= EP_PRIVATE_BITS;
1376 eventcnt++;
1377 }
1378 }
1379 return eventcnt;
1380 }
這個拷貝實現其實沒什麼可看的，但是請注意1359行，這個poll很狡猾，它把第二個引數置為NULL來調
用。我們先看一下裝置驅動通常是怎麼實現poll的：
static unsigned int scull_p_poll(struct file *filp, poll_table *wait)
{
struct scull_pipe *dev = filp->private_data;
unsigned int mask = 0;
/*
* The buffer is circular; it is considered full
* if "wp" is right behind "rp" and empty if the
* two are equal.
*/
down(&dev->sem);
poll_wait(filp, &dev->inq, wait);
poll_wait(filp, &dev->outq, wait);
if (dev->rp != dev->wp)
mask |= POLLIN | POLLRDNORM; /* readable */
if (spacefree(dev))
mask |= POLLOUT | POLLWRNORM; /* writable */
up(&dev->sem);
return mask;
}
上面這段程式碼摘自《linux裝置驅動程式（第三版）》，絕對經典，裝置先要把current（當前程序）掛在
inq和outq兩個佇列上（這個“掛”操作是wait回撥函式指標做的），然後等裝置來喚醒，喚醒後就能通過
mask拿到事件掩碼了（注意那個mask引數，它就是負責拿事件掩碼的）。那如果wait為NULL，
poll_wait會做些什麼呢？
[include/linux/poll.h->poll_wait]
25 static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address,
poll_table *p)
26 {
27 if (p && wait_address)
28 p->qproc(filp, wait_address, p);
29 }
喏，看見了，如果poll_table為空，什麼也不做。我們倒回ep_send_events，那句標紅的poll，實際上
就是“我不想休眠，我只想拿到事件掩碼”的意思。然後再把拿到的事件掩碼拷給使用者空間。
ep_send_events完成後，就輪到ep_reinject_items了：
[fs/eventpoll.c->ep_reinject_items]
1389 static void ep_reinject_items(struct eventpoll *ep, struct list_head *txlist)
1390 {
1391 int ricnt = 0, pwake = 0;
1392 unsigned long flags;
1393 struct epitem *epi;
1394
1395 write_lock_irqsave(&ep->lock, flags);
1396
1397 while (!list_empty(txlist)) {
1398 epi = list_entry(txlist->next, struct epitem, txlink);
1399
1400 /* Unlink the current item from the transfer list */
1401 EP_LIST_DEL(&epi->txlink);
1402
1403 /*
1404 * If the item is no more linked to the interest set, we don't
1405 * have to push it inside the ready list because the following
1406 * ep_release_epitem() is going to drop it. Also, if the current
1407 * item is set to have an Edge Triggered behaviour, we don't have
1408 * to push it back either.
1409 */
1410 if (EP_RB_LINKED(&epi->rbn) && !(epi->event.events & EPOLLET) &&
1411 (epi->revents & epi->event.events) && !EP_IS_LINKED(&epi->rdllink)) {
1412 list_add_tail(&epi->rdllink, &ep->rdllist);
1413 ricnt++;
1414 }
1415 }
1416
1417 if (ricnt) {
1418 /*
1419 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
1420 * wait list.
1421 */
1422 if (waitqueue_active(&ep->wq))
1423 wake_up(&ep->wq);
1424 if (waitqueue_active(&ep->poll_wait))
1425 pwake++;
1426 }
1427
1428 write_unlock_irqrestore(&ep->lock, flags);
1429
1430 /* We have to call this outside the lock */
1431 if (pwake)
1432 ep_poll_safewake(&psw, &ep->poll_wait);
1433 }
ep_reinject_items把txlist裡的一部分fd又放回rdllist，那麼，是把哪一部分fd放回去呢？看上面1410行
的那個判斷——是哪些“沒有標上EPOLLET”（標紅程式碼）且“事件被關注”（標藍程式碼）的fd被重新放回了
rdllist。那麼下次epoll_wait當然會又把rdllist裡的fd拿來拷給使用者了。
舉個例子。假設一個socket，只是connect，還沒有收發資料，那麼它的poll事件掩碼總是有POLLOUT的
（參見上面的驅動示例），每次呼叫epoll_wait總是返回POLLOUT事件（比較煩），因為它的fd就總是被
放回rdllist；假如此時有人往這個socket裡寫了一大堆資料，造成socket塞住（不可寫了），那麼1411行
裡標藍色的判斷就不成立了（沒有POLLOUT了），fd不會放回rdllist，epoll_wait將不會再返回使用者
POLLOUT事件。現在我們給這個socket加上EPOLLET，然後connect，沒有收發資料，此時，1410行標
紅的判斷又不成立了，所以epoll_wait只會返回一次POLLOUT通知給使用者（因為此fd不會再回到rdllist
了），接下來的epoll_wait都不會有任何事件通知了。

給我老師的人工智慧教程打call！http://blog.csdn.net/jiangjunshow

你好！這是你第一次使用 **Markdown編輯器** 所展示的歡迎頁。如果你想學習如何使用Markdown編輯器, 可以仔細閱讀這篇文章，瞭解一下Markdown的基本語法知識。

新的改變

我們對Markdown編輯器進行了一些功能拓展與語法支援，除了標準的Markdown編輯器功能，我們增加了如下幾點新功能，幫助你用它寫部落格：

全新的介面設計 ，將會帶來全新的寫作體驗；
在創作中心設定你喜愛的程式碼高亮樣式，Markdown 將程式碼片顯示選擇的高亮樣式 進行展示；
增加了 圖片拖拽 功能，你可以將本地的圖片直接拖拽到編輯區域直接展示；
全新的 KaTeX數學公式 語法；
增加了支援甘特圖的mermaid語法¹ 功能；
增加了 多螢幕編輯 Markdown文章功能；
增加了 焦點寫作模式、預覽模式、簡潔寫作模式、左右區域同步滾輪設定 等功能，功能按鈕位於編輯區域與預覽區域中間；
增加了 檢查列表 功能。

功能快捷鍵

撤銷：Ctrl/Command + Z
重做：Ctrl/Command + Y
加粗：Ctrl/Command + B
斜體：Ctrl/Command + I
標題：Ctrl/Command + Shift + H
無序列表：Ctrl/Command + Shift + U
有序列表：Ctrl/Command + Shift + O
檢查列表：Ctrl/Command + Shift + C
插入程式碼：Ctrl/Command + Shift + K
插入連結：Ctrl/Command + Shift + L
插入圖片：Ctrl/Command + Shift + G

合理的建立標題，有助於目錄的生成

直接輸入1次#，並按下space後，將生成1級標題。
輸入2次#，並按下space後，將生成2級標題。
以此類推，我們支援6級標題。有助於使用TOC語法後生成一個完美的目錄。

如何改變文字的樣式

強調文字 強調文字

加粗文字 加粗文字

標記文字

~~刪除文字~~

引用文字

H₂O is是液體。

2¹⁰ 運算結果是 1024.

插入連結與圖片

連結: link.

圖片:

帶尺寸的圖片:

當然，我們為了讓使用者更加便捷，我們增加了圖片拖拽功能。

如何插入一段漂亮的程式碼片

去部落格設定頁面，選擇一款你喜歡的程式碼片高亮樣式，下面展示同樣高亮的 程式碼片.

// An highlighted block var foo = 'bar';

生成一個適合你的列表

專案
- 專案
  - 專案

專案1
專案2
專案3

計劃任務
完成任務

建立一個表格

一個簡單的表格是這麼建立的：

專案	Value
電腦	$1600
手機	$12
導管	$1

設定內容居中、居左、居右

使用:---------:居中
使用:----------居左
使用----------:居右

第一列	第二列	第三列
第一列文字居中	第二列文字居右	第三列文字居左

SmartyPants

SmartyPants將ASCII標點字元轉換為“智慧”印刷標點HTML實體。例如：

TYPE	ASCII	HTML
Single backticks	`'Isn't this fun?'`	‘Isn’t this fun?’
Quotes	`"Isn't this fun?"`	“Isn’t this fun?”
Dashes	`-- is en-dash, --- is em-dash`	– is en-dash, — is em-dash

建立一個自定義列表

Markdown: Text-to- HTML conversion tool
Authors: John; Luke

如何建立一個註腳

一個具有註腳的文字。²

註釋也是必不可少的

Markdown將文字轉換為 HTML。

KaTeX數學公式

您可以使用渲染LaTeX數學表示式 KaTeX:

Gamma公式展示 $\Gamma(n) = (n-1)!\quad\forall n\in\mathbb N$ 是通過尤拉積分

$\Gamma(z) = \int_0^\infty t^{z-1}e^{-t}dt\,.$

你可以找到更多關於的資訊 LaTeX 數學表示式here.

新的甘特圖功能，豐富你的文章

gantt
        dateFormat  YYYY-MM-DD
        title Adding GANTT diagram functionality to mermaid
        section 現有任務
        已完成               :done,    des1, 2014-01-06,2014-01-08
        進行中               :active,  des2, 2014-01-09, 3d
        計劃一               :         des3, after des2, 5d
        計劃二               :         des4, after des3, 5d

關於 甘特圖 語法，參考這兒,

UML 圖表

可以使用UML圖表進行渲染。 Mermaid. 例如下面產生的一個序列圖：:

這將產生一個流程圖。:

關於 Mermaid 語法，參考這兒,

FLowchart流程圖

我們依舊會支援flowchart的流程圖：

關於 Flowchart流程圖 語法，參考這兒.

匯出與匯入

匯出

如果你想嘗試使用此編輯器, 你可以在此篇文章任意編輯。當你完成了一篇文章的寫作, 在上方工具欄找到 文章匯出 ，生成一個.md檔案或者.html檔案進行本地儲存。

匯入

如果你想載入一篇你寫過的.md檔案或者.html檔案，在上方工具欄可以選擇匯入功能進行對應副檔名的檔案匯入，
繼續你的創作。

mermaid語法說明 ↩︎
註腳的解釋 ↩︎