內核調試-perf introduction

阿新 • • 發佈：2018-04-22

sta ecif ted odi event HA bit 搶占 --help

perf概念

perf_event

Perf_events是目前在Linux上使用廣泛的profiling/tracing工具，除了本身是內核（kernel）的組成部分以外，還提供了用戶空間（user-space）的命令行工具（“perf”，“perf-record”，“perf-stat”等等）。

perf_events提供兩種工作模式：

采樣模式（sampling）
計數模式（counting）

“perf record”命令工作在采樣模式：周期性地做事件采樣，並把信息記錄下來，默認保存在perf.data文件；而“perf stat”命令工作在計數模式：僅僅統計某個事件發生的次數。

我們經常看到類似這樣的命令：“perf record -a ...... sleep 10”。在這裏，“sleep”這個命令相當於一個“dummy”命令，沒有做任何有意義的工作，它的作用是讓“perf record”命令對整個系統進行采樣，並在10秒後自動結束采樣工作。

perf_event - PMU

Perf_events所處理的hardware event（硬件事件）需要CPU的支持，而目前主流的CPU基本都包含了PMU（Performance Monitoring Unit，性能監控單元）。PMU用來統計性能相關的參數，像cache命中率，指令周期等等。由於這些統計工作是硬件完成的，所以CPU開銷很小。

以X86體系結構為例，PMU包含了兩種MSRs（Model-Specific Registers，之所以稱之為Model-Specific，是因為不同model的CPU，有些register是不同的）：Performance Event Select Registers和Performance Monitoring Counters（PMC）。當想對某種性能事件（performance event）進行統計時，需要對Performance Event Select Register進行設置，統計結果會存在Performance Monitoring Counter中。

當perf_events工作在采樣模式（sampling，perf record命令即工作在這種模式）時，由於采樣事件發生時和實際處理采樣事件之間有時間上的delay，以及CPU流水線和亂序執行等因素，所以得到的指令地址IP(Instruction Pointer) 並不是當時產生采樣事件的IP，這個稱之為skid。為了改善這種狀況，使IP值更加準確，Intel使用PEBS（Precise Event-Based Sampling），而AMD則使用IBS（Instruction-Based Sampling）。

以PEBS為例：每次采樣事件發生時，會先把采樣數據存到一個緩沖區中（PEBS buffer），當緩沖區內容達到某一值時，再一次性處理，這樣可以很好地解決skid問題。

執行一下perf list --help命令，會看到下面內容：

The p modifier can be used for specifying how precise the instruction address should be. The p modifier can be specified multiple times:

       0 - SAMPLE_IP can have arbitrary skid
       1 - SAMPLE_IP must have constant skid
       2 - SAMPLE_IP requested to have 0 skid
       3 - SAMPLE_IP must have 0 skid

For Intel systems precise event sampling is implemented with PEBS which supports up to precise-level 2.

現在可以理解，經常看到的類似“perf record -e "cpu/mem-loads/pp" -a”命令中，pp就是指定IP精度的。

系統調用`perf_open_event`

代表一種事件資源，用戶態調用 perf_open_event 即會創建一個與之對應的 perf_event 對象，相應的一些重要數據都會以這個數據結構為維度存放包含 pmu ctx enabled_time

running_time count 等信息

include/linux/perf_event.h
struct perf_event {

}


./arch/arm64/kernel/perf_event.c

例子

下面我用 ls 命令來演示 sys_enter 這個 tracepoint 的使用：

perf stat -e raw_syscalls:sys_enter ls

指定pid, 采集1s:

[root@localhost /home/ahao.mah]
#perf stat -e syscalls:* -p 49770 sleep 1

perf stat的輸出簡單介紹

perf stat 的輸出

[root@localhost /home/ahao.mah]
#perf stat ls
perf.data  perf.data.old  test  test.c

 Performance counter stats for ‘ls‘:

          1.256036      task-clock (msec)         #    0.724 CPUs utilized
                 4      context-switches          #    0.003 M/sec
                 0      cpu-migrations            #    0.000 K/sec
               285      page-faults               #    0.227 M/sec
         2,506,596      cycles                    #    1.996 GHz                      (87.56%)
         1,891,085      stalled-cycles-frontend   #   75.44% frontend cycles idle
         1,526,425      stalled-cycles-backend    #   60.90% backend  cycles idle
         1,551,244      instructions              #    0.62  insns per cycle
                                                  #    1.22  stalled cycles per insn
           309,841      branches                  #  246.682 M/sec
            12,190      branch-misses             #    3.93% of all branches          (21.57%)

       0.001733685 seconds time elapsed

1. 執行時間： 1.256036ms
2. 持續時間： 0.001733685 seconds time ， 持續時間肯定大於執行時間， 因為cpu的調度策略，搶占等原因
3. cpu利用率：  #    0.724 CPUs utilized  等於 （ 執行時間/持續時間）

perf stat 實現

tools/perf/builtin-stat.c
run_perf_stat
   __run_perf_stat

print_stat

perf使用

#include <stdio.h>
void longa()
{
  int i,j;
  for(i = 0; i < 1000000; i++)
  j=i; //am I silly or crazy? I feel boring and desperate.
}

void foo2()
{
  int i;
  for(i=0 ; i < 10; i++)
       longa();
}

void foo1()
{
  int i;
  for(i = 0; i< 100; i++)
     longa();
}

int main(void)
{
  foo1();
  foo2();
}

#perf stat -e kmem:*  ./t1

 Performance counter stats for ‘./t1‘:

                 1      kmem:kmalloc
             1,443      kmem:kmem_cache_alloc
                85      kmem:kmalloc_node
                85      kmem:kmem_cache_alloc_node
             1,078      kmem:kfree
             1,472      kmem:kmem_cache_free
                37      kmem:mm_page_free
                35      kmem:mm_page_free_batched
                40      kmem:mm_page_alloc
                70      kmem:mm_page_alloc_zone_locked
                 0      kmem:mm_page_pcpu_drain
                 0      kmem:mm_page_alloc_extfrag

       0.382027010 seconds time elapsed

perf的開銷

環境：

kernel 3.10
上面運行java 滿載733.3%

針對java 單個pid 進行perf record 采集，啟動階段perf開銷100%，穩定後7.5%

#perf sched record  -p 49770

#ps -eo pmem,pcpu,args   | grep perf  | grep -v grep
 0.0  0.0 [perf]
 0.0  7.5 perf sched record -p 49770

使用perf全局采集syscall,針對單pid，開銷很大，穩定在40%

#perf stat -e syscalls:* -p 49770 sleep 10

 0.0 88.0 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 96.5 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 90.6 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 68.0 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 54.4 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 45.3 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 38.8 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 34.0 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 30.2 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 27.2 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 24.7 perf stat -e syscalls:* -p 49770 sleep 10

全局采集syscall，開銷稍微小點

#perf stat -e syscalls:*  sleep 10

 0.0  0.0 [perf]
 0.0  0.0 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  6.0 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  3.0 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  2.0 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  1.5 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  1.0 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.8 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.7 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.6 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.6 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.5 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.5 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.4 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.4 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.4 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.3 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.3 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.3 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.4 perf stat -e syscalls:* sleep 10

最簡單的perf stat，開銷比較低

#perf stat  -p 49770 sleep 10

 0.0  0.0 [perf]
 0.0  0.0 perf stat -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  3.0 perf stat -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  1.0 perf stat -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.7 perf stat -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.6 perf stat -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.5 perf stat -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.4 perf stat -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.3 perf stat -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.3 perf stat -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.3 perf stat -p 49770 sleep 10

perf采集kmem相關event的開銷

#perf stat -e kmem:*  -p 49770 sleep 10

 Performance counter stats for process id ‘49770‘:

           163,603      kmem:kmalloc                                                  (100.00%)
           484,012      kmem:kmem_cache_alloc                                         (100.00%)
           302,553      kmem:kmalloc_node                                             (100.00%)
           301,051      kmem:kmem_cache_alloc_node                                     (100.00%)
           263,768      kmem:kfree                                                    (100.00%)
           774,941      kmem:kmem_cache_free                                          (100.00%)
            83,850      kmem:mm_page_free                                             (100.00%)
               799      kmem:mm_page_free_batched                                     (100.00%)
            83,064      kmem:mm_page_alloc                                            (100.00%)
             1,088      kmem:mm_page_alloc_zone_locked                                     (100.00%)
               403      kmem:mm_page_pcpu_drain                                       (100.00%)
                 0      kmem:mm_page_alloc_extfrag

 0.0  7.0 perf stat -e kmem:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  3.5 perf stat -e kmem:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  2.3 perf stat -e kmem:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  1.7 perf stat -e kmem:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  1.4 perf stat -e kmem:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  1.1 perf stat -e kmem:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  1.0 perf stat -e kmem:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.8 perf stat -e kmem:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.7 perf stat -e kmem:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.7 perf stat -e kmem:* -p 49770 sleep 10

REF

使用 OProfile for Linux on POWER 識別性能瓶頸:
https://www.ibm.com/developerworks/cn/linux/l-pow-oprofile/

http://abcdxyzk.github.io/blog/2015/07/27/debug-perf/

perf-tools:

https://www.slideshare.net/brendangregg/linux-performance-analysis-new-tools-and-old-secrets

內核調試-perf introduction

sta ecif ted odi event HA bit 搶占 --help perf概念 perf_event Perf_events是目前在Linux上使用廣泛的profiling/tracing工具，除了本身是內核（kernel）的組成部分以外，還提供了用戶空間（u

內核調試-perf introduction

perf概念

perf_event

perf_event - PMU

系統調用`perf_open_event`

例子

perf stat的輸出簡單介紹

perf stat 的輸出

perf stat 實現

perf使用

perf的開銷

針對java 單個pid 進行perf record 采集，啟動階段perf開銷100%，穩定後7.5%

使用perf全局采集syscall,針對單pid，開銷很大，穩定在40%

全局采集syscall，開銷稍微小點

最簡單的perf stat，開銷比較低

perf采集kmem相關event的開銷

REF

內核調試-perf introduction

內核調試神器SystemTap — 簡單介紹與使用（一）

linux內核調試

MacOS內核調試環境搭建

linux內核調試技巧之一 dump_stack【轉】

linux內核調試環境搭建

Kernel 內核調試

linux內核調優

linux 下nginx應用場景的tcp內核調優

6、進程知識（參合內核調度知識）

linux內核調優tcp_max_syn_backlog和somaxconn的區別

valgrind 內存調試工具

linux內核調度算法（2）--CPU時間片如何分配

【轉載】linux內核調度算法（3）--多核系統的負載均衡

linux內核調度算法（1）--快速找到最高優先級進程

linux內核調優參考

Linux內核調用I2C驅動_以MPU6050為例

Vmware+gdb調試Linux內核——工欲善其事，必先利其器

Mac下調試Win內核 (雙虛擬機調試) 環境配置

用qemu+gdb tcp server+CDT調試linux內核啟動-起步

內核調試-perf introduction

perf概念

perf_event

perf_event - PMU

系統調用perf_open_event

例子

perf stat的輸出簡單介紹

perf stat 的輸出

perf stat 實現

perf使用

perf的開銷

針對java 單個pid 進行perf record 采集， 啟動階段perf開銷100%，穩定後7.5%

使用perf全局采集syscall,針對單pid， 開銷很大， 穩定在40%

全局采集syscall，開銷稍微小點

最簡單的perf stat，開銷比較低

perf采集kmem相關event的開銷

REF

相關推薦

系統調用`perf_open_event`

針對java 單個pid 進行perf record 采集，啟動階段perf開銷100%，穩定後7.5%

使用perf全局采集syscall,針對單pid，開銷很大，穩定在40%