1. 程式人生 > >理解LINUX LOAD AVERAGE的誤區

理解LINUX LOAD AVERAGE的誤區

roc 數字 也會 不知道 cross 這樣的 cor per-cpu mis

一直不解,為什麽io占用較高時,系統負載也會變高,偶遇此文,終解吾惑。

uptime和top等命令都可以看到load average指標,從左至右三個數字分別表示1分鐘、5分鐘、15分鐘的load average:

$ uptime 
 18:05:09 up 1534 days,  7:42,  1 user,  load average: 5.76, 5.54, 5.61

Load average的概念源自UNIX系統,雖然各家的公式不盡相同,但都是用於衡量正在使用CPU的進程數量和正在等待CPU的進程數量,一句話就是runnable processes的數量。所以load average可以作為CPU瓶頸的參考指標,如果大於CPU的數量,說明CPU可能不夠用了。

但是,Linux上不是這樣的!

Linux上的load average除了包括正在使用CPU的進程數量和正在等待CPU的進程數量之外,還包括uninterruptible sleep的進程數量。通常等待IO設備、等待網絡的時候,進程會處於uninterruptible sleep狀態。Linux設計者的邏輯是,uninterruptible sleep應該都是很短暫的,很快就會恢復運行,所以被等同於runnable。然而uninterruptible sleep即使再短暫也是sleep,何況現實世界中uninterruptible sleep未必很短暫,大量的、或長時間的uninterruptible sleep通常意味著IO設備遇到了瓶頸。眾所周知,sleep狀態的進程是不需要CPU的,即使所有的CPU都空閑,正在sleep的進程也是運行不了的,所以sleep進程的數量絕對不適合用作衡量CPU負載的指標,Linux把uninterruptible sleep進程算進load average的做法直接顛覆了load average的本來意義。所以在Linux系統上,load average這個指標基本失去了作用,因為你不知道它代表什麽意思,當看到load average很高的時候,你不知道是runnable進程太多還是uninterruptible sleep進程太多,也就無法判斷是CPU不夠用還是IO設備有瓶頸。

參考資料:https://en.wikipedia.org/wiki/Load_(computing)
“Most UNIX systems count only processes in the running (on CPU) or runnable (waiting for CPU) states. However, Linux also includes processes in uninterruptible sleep states (usually waiting for disk activity), which can lead to markedly different results if many processes remain blocked in I/O due to a busy or stalled I/O system.“

源代碼:

RHEL6
kernel/sched.c:
===============
 
static void calc_load_account_active(struct rq *this_rq)
{
        long nr_active, delta;
 
        nr_active = this_rq->nr_running;
        nr_active += (long) this_rq->nr_uninterruptible;
 
        if (nr_active != this_rq->calc_load_active) {
                delta = nr_active - this_rq->calc_load_active;
                this_rq->calc_load_active = nr_active;
                atomic_long_add(delta, &calc_load_tasks);
        }
}
RHEL7
kernel/sched/core.c:
====================
 
static long calc_load_fold_active(struct rq *this_rq)
{
        long nr_active, delta = 0;
 
        nr_active = this_rq->nr_running;
        nr_active += (long) this_rq->nr_uninterruptible;
 
        if (nr_active != this_rq->calc_load_active) {
                delta = nr_active - this_rq->calc_load_active;
                this_rq->calc_load_active = nr_active;
        }
 
        return delta;
}
RHEL7
kernel/sched/core.c:
====================

/*
 * Global load-average calculations
 *
 * We take a distributed and async approach to calculating the global load-avg
 * in order to minimize overhead.
 *
 * The global load average is an exponentially decaying average of nr_running +
 * nr_uninterruptible.
 *
 * Once every LOAD_FREQ:
 *
 *   nr_active = 0;
 *   for_each_possible_cpu(cpu)
 *      nr_active += cpu_of(cpu)->nr_running + cpu_of(cpu)->nr_uninterruptible;
 *
 *   avenrun[n] = avenrun[0] * exp_n + nr_active * (1 - exp_n)
 *
 * Due to a number of reasons the above turns in the mess below:
 *
 *  - for_each_possible_cpu() is prohibitively expensive on machines with
 *    serious number of cpus, therefore we need to take a distributed approach
 *    to calculating nr_active.
 *
 *        \Sum_i x_i(t) = \Sum_i x_i(t) - x_i(t_0) | x_i(t_0) := 0
 *                      = \Sum_i { \Sum_j=1 x_i(t_j) - x_i(t_j-1) }
 *
 *    So assuming nr_active := 0 when we start out -- true per definition, we
 *    can simply take per-cpu deltas and fold those into a global accumulate
 *    to obtain the same result. See calc_load_fold_active().
 *
 *    Furthermore, in order to avoid synchronizing all per-cpu delta folding
 *    across the machine, we assume 10 ticks is sufficient time for every
 *    cpu to have completed this task.
 *
 *    This places an upper-bound on the IRQ-off latency of the machine. Then
 *    again, being late doesn‘t loose the delta, just wrecks the sample.
 *
 *  - cpu_rq()->nr_uninterruptible isn‘t accurately tracked per-cpu because
 *    this would add another cross-cpu cacheline miss and atomic operation
 *    to the wakeup path. Instead we increment on whatever cpu the task ran
 *    when it went into uninterruptible state and decrement on whatever cpu
 *    did the wakeup. This means that only the sum of nr_uninterruptible over
 *    all cpus yields the correct result.
 *
 *  This covers the NO_HZ=n code, for extra head-aches, see the comment below.
 */

參考:

http://linuxperf.com/?p=176

理解LINUX LOAD AVERAGE的誤區