系統技術非業餘研究 » 深度剖析告訴你irqbalance有用嗎？

阿新 • • 發佈：2019-01-13

irqbalance專案的主頁在這裡

irqbalance用於優化中斷分配，它會自動收集系統資料以分析使用模式，並依據系統負載狀況將工作狀態置於 Performance mode 或 Power-save mode。處於Performance mode 時，irqbalance 會將中斷儘可能均勻地分發給各個 CPU core，以充分利用 CPU 多核，提升效能。
處於Power-save mode 時，irqbalance 會將中斷集中分配給第一個 CPU，以保證其它空閒 CPU 的睡眠時間，降低能耗。

在RHEL發行版裡這個守護程式預設是開機啟用的，那如何確認它的狀態呢？

# service irqbalance status
irqbalance (pid PID) is running…

然後在實踐中，我們的專用的應用程式通常是繫結在特定的CPU上的，所以其實不可不需要它。如果已經被打開了，我們可以用下面的命令關閉它：

# service irqbalance stop
Stopping irqbalance: [ OK ]

或者乾脆取消開機啟動：

# chkconfig irqbalance off

下面我們來分析下這個irqbalance的工作原理，好準確的知道什麼時候該用它，什麼時候不用它。

既然irqbalance用於優化中斷分配，首先我們從中斷講起,文章很長，深吸一口氣，來吧！

SMP IRQ Affinity 相關東西可以參見這篇文章

摘抄重點：

SMP affinity is controlled by manipulating files in the /proc/irq/ directory.
In /proc/irq/ are directories that correspond to the IRQs present on your
system (not all IRQs may be available). In each of these directories is
the “smp_affinity” file, and this is where we will work our magic.

說白了就是往/proc/irq/N/smp_affinity檔案寫入你希望的親緣的CPU的mask碼！關於如何手工設定中斷親緣性，請參見我之前的博文：

這裡這裡

接著普及下概念，我們再來看下CPU的拓撲結構，首先看下Intel CPU的各個部件之間的關係：

一個NUMA node包括一個或者多個Socket，以及與之相連的local memory。一個多核的Socket有多個Core。如果CPU支援HT，OS還會把這個Core看成 2個Logical Processor。

可以看拓撲的工具很多lscpu或者intel的cpu_topology64工具都可以，可以參考這裡這裡

這次用之前我們新介紹的Likwid工具箱裡面的likwid-topology我們可以看到：

./likwid-topology

CPU的拓撲結構是各種高效能伺服器CPU親緣性繫結必須理解的東西，有感覺了嗎？

有了前面的各種基礎知識和名詞的鋪墊，我們就可以來調查irqbalance的工作原理：

//irqbalance.c
int main(int argc, char** argv)
{
  /* ... */
  while (keep_going) {
                sleep_approx(SLEEP_INTERVAL); //#define SLEEP_INTERVAL 10
                /* ... */
                clear_work_stats();
                parse_proc_interrupts();
                parse_proc_stat();
                /* ... */
                calculate_placement();
                activate_mappings();
                /* ... */
}
/* ... */
}

從程式的主迴圈可以很清楚的看到它的邏輯，在退出之前每隔10秒它做了以下的幾個事情：
1. 清除統計
2. 分析中斷的情況
3. 分析中斷的負載情況
4. 根據負載情況計算如何平衡中斷
5. 實施中斷親緣性變跟

好吧，稍微看下irqbalance如何使用的：

man irqbalance

–oneshot
Causes irqbalance to be run once, after which the daemon exits
–debug
Causes irqbalance to run in the foreground and extra debug information to be printed

在診斷模型下執行irqbalance可以給我們很多詳細的資訊：

#./irqbalance –oneshot –debug

喝口水，我們接著來分析下各個步驟的詳細情況：

先了解下中斷在CPU上的分佈情況：

$cat /proc/interrupts|tr -s ' ' '\t'|cut -f 1-3
        CPU0    CPU1
        0:      2622846291
        1:      7
        4:      234
        8:      1
        9:      0
        12:     4
        50:     6753
        66:     228
        90:     497
        98:     31
209:    2       0
217:    0       0
225:    29      556
233:    0       0
NMI:    7395302 4915439
LOC:    2622846035      2622833187
ERR:    0
MIS:    0

輸出的第一列是中斷號，後面的2列是在CPU0，CPU1的中斷次數。

但是我們如何知道比如中斷是98那個型別的裝置呢？不廢話，上程式碼！

//classify.c
char *classes[] = {
        "other",
        "legacy",
        "storage",
        "timer",
        "ethernet",
        "gbit-ethernet",
        "10gbit-ethernet",
        0
};

#define MAX_CLASS 0x12
/*                                                                                                                        
 * Class codes lifted from pci spec, appendix D.                                                                          
 * and mapped to irqbalance types here                                                                                    
 */
static short class_codes[MAX_CLASS] = {
        IRQ_OTHER,
        IRQ_SCSI,
        IRQ_ETH,
        IRQ_OTHER,
        IRQ_OTHER,
        IRQ_OTHER,
        IRQ_LEGACY,
        IRQ_OTHER,
        IRQ_OTHER,
        IRQ_LEGACY,
        IRQ_OTHER,
        IRQ_OTHER,
        IRQ_LEGACY,
        IRQ_ETH,
        IRQ_SCSI,
        IRQ_OTHER,
        IRQ_OTHER,
        IRQ_OTHER,
};
int map_class_to_level[7] =
{ BALANCE_PACKAGE, BALANCE_CACHE, BALANCE_CACHE, BALANCE_NONE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE };

irqbalance把中斷分成7個型別，不同型別的中斷平衡的時候作用域不同，有的在PACKAGE，有的在CACHE，有的在CORE。
那麼型別資訊在那裡獲取呢？不廢話，上程式碼！

//#define SYSDEV_DIR "/sys/bus/pci/devices"
static struct irq_info *add_one_irq_to_db(const char *devpath, int irq, struct user_irq_policy *pol)
{
...
        sprintf(path, "%s/class", devpath);

        fd = fopen(path, "r");

        if (!fd) {
                perror("Can't open class file: ");
                goto get_numa_node;
        }

        rc = fscanf(fd, "%x", &class);
        fclose(fd);

        if (!rc)
                goto get_numa_node;

        /*                                                                                                                
         * Restrict search to major class code                                                                            
         */
        class >>= 16;

        if (class >= MAX_CLASS)
                goto get_numa_node;

        new->class = class_codes[class];
        if (pol->level >= 0)
                new->level = pol->level;
        else
                new->level = map_class_to_level[class_codes[class]];
get_numa_node:
        numa_node = -1;
        sprintf(path, "%s/numa_node", devpath);
        fd = fopen(path, "r");
        if (!fd)
                goto assign_node;

        rc = fscanf(fd, "%d", &numa_node);
        fclose(fd);

assign_node:
        new->numa_node = get_numa_node(numa_node);

        sprintf(path, "%s/local_cpus", devpath);
        fd = fopen(path, "r");
        if (!fd) {
                cpus_setall(new->cpumask);
                goto assign_affinity_hint;
        }
        lcpu_mask = NULL;
        ret = getline(&lcpu_mask, &blen, fd);
fclose(fd);
        if (ret <= 0) {
                cpus_setall(new->cpumask);
        } else {
                cpumask_parse_user(lcpu_mask, ret, new->cpumask);
        }
        free(lcpu_mask);

assign_affinity_hint:
        cpus_clear(new->affinity_hint);
        sprintf(path, "/proc/irq/%d/affinity_hint", irq);
        fd = fopen(path, "r");
        if (!fd)
                goto out;
        lcpu_mask = NULL;
        ret = getline(&lcpu_mask, &blen, fd);
        fclose(fd);
        if (ret <= 0)
            goto out;
        cpumask_parse_user(lcpu_mask, ret, new->affinity_hint);
        free(lcpu_mask);
out:
...
}

#上面的c程式碼翻譯成下面的指令碼就是：

$cat>x.sh
SYSDEV_DIR="/sys/bus/pci/devices/"
for dev in `ls $SYSDEV_DIR`
do 
    IRQ=`cat $SYSDEV_DIR$dev/irq`
    CLASS=$(((`cat $SYSDEV_DIR$dev/class`)>>16))
    printf "irq %s: class[%s] " $IRQ $CLASS
    if [ -f "/proc/irq/$IRQ/affinity_hint" ]; then
        printf "affinity_hint[%s] " `cat /proc/irq/$IRQ/affinity_hint`
    fi
    if [ -f "$SYSDEV_DIR$dev/local_cpus" ]; then
        printf "local_cpus[%s] " `cat $SYSDEV_DIR$dev/local_cpus`
    fi
    if [ -f "$SYSDEV_DIR$dev/numa_node" ]; then
        printf "numa_node[%s]" `cat $SYSDEV_DIR$dev/numa_node`
    fi
    echo
done
CTRL+D
$ tree /sys/bus/pci/devices
/sys/bus/pci/devices
|-- 0000:00:00.0 -> ../../../devices/pci0000:00/0000:00:00.0
|-- 0000:00:01.0 -> ../../../devices/pci0000:00/0000:00:01.0
|-- 0000:00:03.0 -> ../../../devices/pci0000:00/0000:00:03.0
|-- 0000:00:07.0 -> ../../../devices/pci0000:00/0000:00:07.0
|-- 0000:00:09.0 -> ../../../devices/pci0000:00/0000:00:09.0
|-- 0000:00:13.0 -> ../../../devices/pci0000:00/0000:00:13.0
|-- 0000:00:14.0 -> ../../../devices/pci0000:00/0000:00:14.0
|-- 0000:00:14.1 -> ../../../devices/pci0000:00/0000:00:14.1
|-- 0000:00:14.2 -> ../../../devices/pci0000:00/0000:00:14.2
|-- 0000:00:14.3 -> ../../../devices/pci0000:00/0000:00:14.3
|-- 0000:00:1a.0 -> ../../../devices/pci0000:00/0000:00:1a.0
|-- 0000:00:1a.7 -> ../../../devices/pci0000:00/0000:00:1a.7
|-- 0000:00:1d.0 -> ../../../devices/pci0000:00/0000:00:1d.0
|-- 0000:00:1d.1 -> ../../../devices/pci0000:00/0000:00:1d.1
|-- 0000:00:1d.2 -> ../../../devices/pci0000:00/0000:00:1d.2
|-- 0000:00:1d.7 -> ../../../devices/pci0000:00/0000:00:1d.7
|-- 0000:00:1e.0 -> ../../../devices/pci0000:00/0000:00:1e.0
|-- 0000:00:1f.0 -> ../../../devices/pci0000:00/0000:00:1f.0
|-- 0000:00:1f.2 -> ../../../devices/pci0000:00/0000:00:1f.2
|-- 0000:00:1f.3 -> ../../../devices/pci0000:00/0000:00:1f.3
|-- 0000:00:1f.5 -> ../../../devices/pci0000:00/0000:00:1f.5
|-- 0000:01:00.0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0
|-- 0000:01:00.1 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.1
|-- 0000:04:00.0 -> ../../../devices/pci0000:00/0000:00:09.0/0000:04:00.0
`-- 0000:05:00.0 -> ../../../devices/pci0000:00/0000:00:1e.0/0000:05:00.0

$chmod +x x.sh
$./x.sh|grep 98
irq 98: class[2] local_cpus[00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000]

簡單的分析下數字：class_codes[2]=IRQ_ETH 也就是說這個中斷是塊網絡卡。

那中斷的負載是怎麼算出來的呢？繼續看程式碼！

//procinterrupts.c
void parse_proc_stat(void)
{
  ...
        file = fopen("/proc/stat", "r");
        if (!file) {
                log(TO_ALL, LOG_WARNING, "WARNING cant open /proc/stat.  balacing is broken\n");
                return;
        }

        /* first line is the header we don't need; nuke it */
        if (getline(&line, &size, file)==0) {
                free(line);
                log(TO_ALL, LOG_WARNING, "WARNING read /proc/stat. balancing is broken\n");
                fclose(file);
                return;
        }
 cpucount = 0;
        while (!feof(file)) {
                if (getline(&line, &size, file)==0)
                        break;

                if (!strstr(line, "cpu"))
                        break;

                cpunr = strtoul(&line[3], NULL, 10);

                if (cpu_isset(cpunr, banned_cpus))
                        continue;

                rc = sscanf(line, "%*s %*d %*d %*d %*d %*d %d %d", &irq_load, &softirq_load);
                if (rc < 2)
                        break;

                cpu = find_cpu_core(cpunr);

                if (!cpu)
                        break;

                cpucount++;
 /*                                                                                                        
                 * For each cpu add the irq and softirq load and propagate that                                           
                 * all the way up the device tree                                                                         
                 */
                if (cycle_count) {
                        cpu->load = (irq_load + softirq_load) - (cpu->last_load);
                        /*                                                                                                
                         * the [soft]irq_load values are in jiffies, which are                                            
                         * units of 10ms, multiply by 1000 to convert that to                                             
                         * 1/10 milliseconds.  This give us a better integer                                              
                         * distribution of load between irqs                                                              
                         */
                        cpu->load *= 1000;
                }
                cpu->last_load = (irq_load + softirq_load);
        }
...
}

相當於以下的命令：

$grep cpu015/proc/stat
cpu15 30068830 85841 22995655 3212064899 536154 91145 2789328 0

關於CPU這行摘抄如下：

cpu — Measures the number of jiffies (1/100 of a second for x86 systems) that the system has been in user mode, user mode with low priority (nice), system mode, idle task, I/O wait, IRQ (hardirq), and softirq respectively. The IRQ (hardirq) is the direct response to a hardware event. The IRQ takes minimal work for queuing the “heavy” work up for the softirq to execute. The softirq runs at a lower priority than the IRQ and therefore may be interrupted more frequently. The total for all CPUs is given at the top, while each individual CPU is listed below with its own statistics. The following example is a 4-way Intel Pentium Xeon configuration with multi-threading enabled, therefore showing four physical processors and four virtual processors totaling eight processors.

可以知道這行的第7，8項分別對應著中斷和軟中斷的次數，二者加起來就是我們所謂的CPU負載。
這個和結果和irqbalance報告的中斷的情況是吻合的，見圖：

是不是有點暈了，喝口水！
我們繼續來看下整個Package層面irqbalance是如何計算負載的，從下面的圖結合前面的那個CPU拓撲很清楚的看到：

每個CORE的負載是附在上面的中斷的負載的總和，每個DOMAIN是包含的CORE的總和，每個PACKAGE包含的DOMAIN的總和，就像樹層次一樣的計算。
知道了每個CORE, DOMAIN，PACKAGE的負載的情況，那麼剩下的就是找個這個中斷型別所在作用域範圍內最輕的物件把中斷遷移過去。

遷移的依據正是之前看過的這個東西：

int map_class_to_level[7] =
{ BALANCE_PACKAGE, BALANCE_CACHE, BALANCE_CACHE, BALANCE_NONE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE };

水喝多了，等等放下水先，回來繼續！

最後那irqbalance系統是如何實施中斷親緣性變更的呢，繼續上程式碼：

// activate.c
static void activate_mapping(struct irq_info *info, void *data __attribute__((unused)))
{
...
        if ((hint_policy == HINT_POLICY_EXACT) &&
            (!cpus_empty(info->affinity_hint))) {
                applied_mask = info->affinity_hint;
                valid_mask = 1;
        } else if (info->assigned_obj) {
                applied_mask = info->assigned_obj->mask;
                valid_mask = 1;
                if ((hint_policy == HINT_POLICY_SUBSET) &&
                    (!cpus_empty(info->affinity_hint)))
                        cpus_and(applied_mask, applied_maskapplied_mask, info->affinity_hint);
        }

        /*                                                                                                                
         * only activate mappings for irqs that have moved                                                                
         */
        if (!info->moved && (!valid_mask || check_affinity(info, applied_mask)))
                return;

        if (!info->assigned_obj)
                return;

        sprintf(buf, "/proc/irq/%i/smp_affinity", info->irq);
        file = fopen(buf, "w");
        if (!file)
                return;

        cpumask_scnprintf(buf, PATH_MAX, applied_mask);
        fprintf(file, "%s", buf);
        fclose(file);
        info->moved = 0; /*migration is done*/
}

void activate_mappings(void)
{
        for_each_irq(NULL, activate_mapping, NULL);
}

上面的程式碼簡單的翻譯成shell就是:

#echo MASK > /proc/irq/N/smp_affinity

當然如果使用者設定的策略如果是HINT_POLICY_EXACT，那麼我們會參照/proc/irq/N/affinity_hint設定
策略如果是HINT_POLICY_SUBSET, 那麼我們會參照/proc/irq/N/affinity_hint | applied_mask 設定。

好吧，總算分析完成了！

總結：
irqbalance根據系統中斷負載的情況，自動遷移中斷保持中斷的平衡，同時會考慮到省電因素等等。但是在實時系統中會導致中斷自動漂移，對效能造成不穩定因素，在高效能的場合建議關閉。

祝玩得開心！

Post Footer automatically generated by wp-posturl plugin for wordpress.

系統技術非業餘研究 » 深度剖析告訴你irqbalance有用嗎？

系統技術非業餘研究 » 深度剖析告訴你irqbalance有用嗎？

系統技術非業餘研究 » Inside Erlang VM(你需要知道的VM原理)

系統技術非業餘研究 » whatsapp深度使用Erlang有感

系統技術非業餘研究 » gen_tcp:send的深度解刨和使用指南(初稿)

系統技術非業餘研究 » latencytop深度瞭解你的Linux系統的延遲

系統技術非業餘研究 » Fio壓測工具和io佇列深度理解和誤區

系統技術非業餘研究 » blktrace 深度瞭解linux系統的IO運作

系統技術非業餘研究

系統技術非業餘研究 » MySQL資料庫架構的演化觀察

系統技術非業餘研究 » inet_dist_connect_options

系統技術非業餘研究 » 推薦工作機會

系統技術非業餘研究 » 新的工作和研究方向

系統技術非業餘研究 » 叢集引入inet_dist_{listen,connect}_options更精細引數微調

系統技術非業餘研究 » 2017升的最快的幾個資料庫無責任點評

系統技術非業餘研究 » Erlang 17.5引入+hpds命令列控制程序預設字典大小

系統技術非業餘研究 » inet_dist_listen_options

系統技術非業餘研究 » 老生常談: ulimit問題及其影響

系統技術非業餘研究 » 求賢帖

系統技術非業餘研究 » Erlang R16B03釋出，R17已發力

系統技術非業餘研究 » Erlang R13B04 Installation

系統技術非業餘研究 » 深度剖析告訴你irqbalance有用嗎？

相關推薦