出自:https://www.brendangregg.com/perf.html
简介
perf是linux c不可缺少的性能分析工具
它是使用方法可以用一本几百页的书本来描述
对于使用者来说,如果有现成的常用命令,可以更便利的利用perf提供的功能
并在对这些常用命令的使用过程中,慢慢深入了解perf的机制和Linux的性能几何
基本命令介绍
-
perf
perf命令支持以下的选项
一般来说常用的:
- list:列出可检测的事件
- record:收集事件的数据
- stat:打印进程的性能数据
- report:报告性能的数据
# perf usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS] The most commonly used perf commands are: annotate Read perf.data (created by perf record) and display annotated code archive Create archive with object files with build-ids found in perf.data file bench General framework for benchmark suites buildid-cache Manage build-id cache. buildid-list List the buildids in a perf.data file config Get and set variables in a configuration file. data Data file related processing diff Read perf.data files and display the differential profile evlist List the event names in a perf.data file inject Filter to augment the events stream with additional information kmem Tool to trace/measure kernel memory properties kvm Tool to trace/measure kvm guest os list List all symbolic event types lock Analyze lock events mem Profile memory accesses record Run a command and record its profile into perf.data report Read perf.data (created by perf record) and display the profile sched Tool to trace/measure scheduler properties (latencies) script Read perf.data (created by perf record) and display trace output stat Run a command and gather performance counter statistics test Runs sanity tests. timechart Tool to visualize total system behavior during a workload top System profiling tool. probe Define new dynamic tracepoints trace strace inspired tool See 'perf help COMMAND' for more information on a specific command.
-
perf list
列出所有可检测的事件
root@ubuntu:~# perf list|wc 2056 7433 135920
perf可以支持2k中不同事件的检测
# Listing all currently known events: # perf list List of pre-defined events (to be used in -e): cpu-cycles OR cycles [Hardware event] instructions [Hardware event] cache-references [Hardware event] cache-misses [Hardware event] branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] stalled-cycles-frontend OR idle-cycles-frontend [Hardware event] stalled-cycles-backend OR idle-cycles-backend [Hardware event] ref-cycles [Hardware event] cpu-clock [Software event] task-clock [Software event] page-faults OR faults [Software event] context-switches OR cs [Software event] cpu-migrations OR migrations [Software event] minor-faults [Software event] major-faults [Software event] alignment-faults [Software event] emulation-faults [Software event] L1-dcache-loads [Hardware cache event] L1-dcache-load-misses [Hardware cache event] L1-dcache-stores [Hardware cache event] [...] rNNN [Raw hardware event descriptor] cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor] (see 'man perf-list' on how to encode it) mem:<addr>[:access] [Hardware breakpoint] probe:tcp_sendmsg [Tracepoint event] [...] sched:sched_process_exec [Tracepoint event] sched:sched_process_fork [Tracepoint event] sched:sched_process_wait [Tracepoint event] sched:sched_wait_task [Tracepoint event] sched:sched_process_exit [Tracepoint event] [...]
从上面看,perf可以检测以下几个重点内容
- cpu 对指令的执行的性能指标
- CPU cache利用的性能指标
- 缺页中断的情况
- 上下文切换的情况
- 进程的调度情况
perf stat
# perf stat gzip file1
Performance counter stats for 'gzip file1':
1920.159821 task-clock # 0.991 CPUs utilized
13 context-switches # 0.007 K/sec
0 CPU-migrations # 0.000 K/sec
258 page-faults # 0.134 K/sec
5,649,595,479 cycles # 2.942 GHz [83.43%]
1,808,339,931 stalled-cycles-frontend # 32.01% frontend cycles idle [83.54%]
1,171,884,577 stalled-cycles-backend # 20.74% backend cycles idle [66.77%]
8,625,207,199 instructions # 1.53 insns per cycle
# 0.21 stalled cycles per insn [83.51%]
1,488,797,176 branches # 775.351 M/sec [82.58%]
53,395,139 branch-misses # 3.59% of all branches [83.78%]
1.936842598 seconds time elapsed
通过perf stat 可以把进程或者系统的感兴趣事件统计下来,并进行下一步的分析
例如统计gzip命令的L1 cache的利用率
确认locality的好坏
# perf list | grep L1-dcache
L1-dcache-loads [Hardware cache event]
L1-dcache-load-misses [Hardware cache event]
L1-dcache-stores [Hardware cache event]
L1-dcache-store-misses [Hardware cache event]
L1-dcache-prefetches [Hardware cache event]
L1-dcache-prefetch-misses [Hardware cache event]
# perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores gzip file1
Performance counter stats for 'gzip file1':
1,947,551,657 L1-dcache-loads
153,829,652 L1-dcache-misses
# 7.90% of all L1-dcache hits
1,171,475,286 L1-dcache-stores
1.538038091 seconds time elapsed
-
perf record
收集函数栈的数据
并发现耗时长的函数
# perf record -F 99 -a -g -- sleep 30 [ perf record: Woken up 9 times to write data ] [ perf record: Captured and wrote 3.135 MB perf.data (~136971 samples) ] # ls -lh perf.data -rw------- 1 root root 3.2M Jan 26 07:26 perf.data
通过report分析perf.data
# perf report --stdio # ======== # captured on: Mon Jan 26 07:26:40 2014 # hostname : dev2 # os release : 3.8.6-ubuntu-12-opt # perf version : 3.8.6 # arch : x86_64 # nrcpus online : 8 # nrcpus avail : 8 # cpudesc : Intel(R) Xeon(R) CPU X5675 @ 3.07GHz # cpuid : GenuineIntel,6,44,2 # total memory : 8182008 kB # cmdline : /usr/bin/perf record -F 99 -a -g -- sleep 30 # event : name = cpu-clock, type = 1, config = 0x0, config1 = 0x0, config2 = ... # HEADER_CPU_TOPOLOGY info available, use -I to display # HEADER_NUMA_TOPOLOGY info available, use -I to display # pmu mappings: software = 1, breakpoint = 5 # ======== # # Samples: 22K of event 'cpu-clock' # Event count (approx.): 22751 # # Overhead Command Shared Object Symbol # ........ ....... ................. ............................... # 94.12% dd [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore | --- _raw_spin_unlock_irqrestore | |--96.67%-- extract_buf | extract_entropy_user | urandom_read | vfs_read | sys_read | system_call_fastpath | read | |--1.69%-- account | | | |--99.72%-- extract_entropy_user | | urandom_read | | vfs_read | | sys_read | | system_call_fastpath | | read | --0.28%-- [...] | |--1.60%-- mix_pool_bytes.constprop.17 [...]
高频命令列表
-
list 事件
# Listing all currently known events: perf list # Listing sched tracepoints: perf list 'sched:*'
-
统计事件
# CPU counter statistics for the specified command: perf stat command # Detailed CPU counter statistics (includes extras) for the specified command: perf stat -d command # CPU counter statistics for the specified PID, until Ctrl-C: perf stat -p PID # CPU counter statistics for the entire system, for 5 seconds: perf stat -a sleep 5 # Various basic CPU statistics, system wide, for 10 seconds: perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles -a sleep 10 # Various CPU level 1 data cache statistics for the specified command: perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores command # Various CPU data TLB statistics for the specified command: perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses command # Various CPU last level cache statistics for the specified command: perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command # Using raw PMC counters, eg, counting unhalted core cycles: perf stat -e r003c -a sleep 5 # PMCs: counting cycles and frontend stalls via raw specification: perf stat -e cycles -e cpu/event=0x0e,umask=0x01,inv,cmask=0x01/ -a sleep 5 # Count syscalls per-second system-wide: perf stat -e raw_syscalls:sys_enter -I 1000 -a # Count system calls by type for the specified PID, until Ctrl-C: perf stat -e 'syscalls:sys_enter_*' -p PID # Count system calls by type for the entire system, for 5 seconds: perf stat -e 'syscalls:sys_enter_*' -a sleep 5 # Count scheduler events for the specified PID, until Ctrl-C: perf stat -e 'sched:*' -p PID # Count scheduler events for the specified PID, for 10 seconds: perf stat -e 'sched:*' -p PID sleep 10 # Count ext4 events for the entire system, for 10 seconds: perf stat -e 'ext4:*' -a sleep 10 # Count block device I/O events for the entire system, for 10 seconds: perf stat -e 'block:*' -a sleep 10 # Count all vmscan events, printing a report every second: perf stat -e 'vmscan:*' -a -I 1000
-
性能分析
# Sample on-CPU functions for the specified command, at 99 Hertz: perf record -F 99 command # Sample on-CPU functions for the specified PID, at 99 Hertz, until Ctrl-C: perf record -F 99 -p PID # Sample on-CPU functions for the specified PID, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID sleep 10 # Sample CPU stack traces (via frame pointers) for the specified PID, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID -g -- sleep 10 # Sample CPU stack traces for the PID, using dwarf (dbg info) to unwind stacks, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID --call-graph dwarf sleep 10 # Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (< Linux 4.11): perf record -F 99 -ag -- sleep 10 # Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (>= Linux 4.11): perf record -F 99 -g -- sleep 10 # If the previous command didn't work, try forcing perf to use the cpu-clock event: perf record -F 99 -e cpu-clock -ag -- sleep 10 # Sample CPU stack traces for a container identified by its /sys/fs/cgroup/perf_event cgroup: perf record -F 99 -e cpu-clock --cgroup=docker/1d567f4393190204...etc... -a -- sleep 10 # Sample CPU stack traces for the entire system, with dwarf stacks, at 99 Hertz, for 10 seconds: perf record -F 99 -a --call-graph dwarf sleep 10 # Sample CPU stack traces for the entire system, using last branch record for stacks, ... (>= Linux 4.?): perf record -F 99 -a --call-graph lbr sleep 10 # Sample CPU stack traces, once every 10,000 Level 1 data cache misses, for 5 seconds: perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5 # Sample CPU stack traces, once every 100 last level cache misses, for 5 seconds: perf record -e LLC-load-misses -c 100 -ag -- sleep 5 # Sample on-CPU kernel instructions, for 5 seconds: perf record -e cycles:k -a -- sleep 5 # Sample on-CPU user instructions, for 5 seconds: perf record -e cycles:u -a -- sleep 5 # Sample on-CPU user instructions precisely (using PEBS), for 5 seconds: perf record -e cycles:up -a -- sleep 5 # Perform branch tracing (needs HW support), for 1 second: perf record -b -a sleep 1 # Sample CPUs at 49 Hertz, and show top addresses and symbols, live (no perf.data file): perf top -F 49 # Sample CPUs at 49 Hertz, and show top process names and segments, live: perf top -F 49 -ns comm,dso
-
静态跟踪
# Trace new processes, until Ctrl-C: perf record -e sched:sched_process_exec -a # Sample (take a subset of) context-switches, until Ctrl-C: perf record -e context-switches -a # Trace all context-switches, until Ctrl-C: perf record -e context-switches -c 1 -a # Include raw settings used (see: man perf_event_open): perf record -vv -e context-switches -a # Trace all context-switches via sched tracepoint, until Ctrl-C: perf record -e sched:sched_switch -a # Sample context-switches with stack traces, until Ctrl-C: perf record -e context-switches -ag # Sample context-switches with stack traces, for 10 seconds: perf record -e context-switches -ag -- sleep 10 # Sample CS, stack traces, and with timestamps (< Linux 3.17, -T now default): perf record -e context-switches -ag -T # Sample CPU migrations, for 10 seconds: perf record -e migrations -a -- sleep 10 # Trace all connect()s with stack traces (outbound connections), until Ctrl-C: perf record -e syscalls:sys_enter_connect -ag # Trace all accepts()s with stack traces (inbound connections), until Ctrl-C: perf record -e syscalls:sys_enter_accept* -ag # Trace all block device (disk I/O) requests with stack traces, until Ctrl-C: perf record -e block:block_rq_insert -ag # Sample at most 100 block device requests per second, until Ctrl-C: perf record -F 100 -e block:block_rq_insert -a # Trace all block device issues and completions (has timestamps), until Ctrl-C: perf record -e block:block_rq_issue -e block:block_rq_complete -a # Trace all block completions, of size at least 100 Kbytes, until Ctrl-C: perf record -e block:block_rq_complete --filter 'nr_sector > 200' # Trace all block completions, synchronous writes only, until Ctrl-C: perf record -e block:block_rq_complete --filter 'rwbs == "WS"' # Trace all block completions, all types of writes, until Ctrl-C: perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"' # Sample minor faults (RSS growth) with stack traces, until Ctrl-C: perf record -e minor-faults -ag # Trace all minor faults with stack traces, until Ctrl-C: perf record -e minor-faults -c 1 -ag # Sample page faults with stack traces, until Ctrl-C: perf record -e page-faults -ag # Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C: perf record -e 'ext4:*' -o /tmp/perf.data -a # Trace kswapd wakeup events, until Ctrl-C: perf record -e vmscan:mm_vmscan_wakeup_kswapd -ag # Add Node.js USDT probes (Linux 4.10+): perf buildid-cache --add `which node` # Trace the node http__server__request USDT event (Linux 4.10+): perf record -e sdt_node:http__server__request -a
-
动态跟踪
# Add a tracepoint for the kernel tcp_sendmsg() function entry ("--add" is optional): perf probe --add tcp_sendmsg # Remove the tcp_sendmsg() tracepoint (or use "--del"): perf probe -d tcp_sendmsg # Add a tracepoint for the kernel tcp_sendmsg() function return: perf probe 'tcp_sendmsg%return' # Show available variables for the kernel tcp_sendmsg() function (needs debuginfo): perf probe -V tcp_sendmsg # Show available variables for the kernel tcp_sendmsg() function, plus external vars (needs debuginfo): perf probe -V tcp_sendmsg --externs # Show available line probes for tcp_sendmsg() (needs debuginfo): perf probe -L tcp_sendmsg # Show available variables for tcp_sendmsg() at line number 81 (needs debuginfo): perf probe -V tcp_sendmsg:81 # Add a tracepoint for tcp_sendmsg(), with three entry argument registers (platform specific): perf probe 'tcp_sendmsg %ax %dx %cx' # Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for the %cx register (platform specific): perf probe 'tcp_sendmsg bytes=%cx' # Trace previously created probe when the bytes (alias) variable is greater than 100: perf record -e probe:tcp_sendmsg --filter 'bytes > 100' # Add a tracepoint for tcp_sendmsg() return, and capture the return value: perf probe 'tcp_sendmsg%return $retval' # Add a tracepoint for tcp_sendmsg(), and "size" entry argument (reliable, but needs debuginfo): perf probe 'tcp_sendmsg size' # Add a tracepoint for tcp_sendmsg(), with size and socket state (needs debuginfo): perf probe 'tcp_sendmsg size sk->__sk_common.skc_state' # Tell me how on Earth you would do this, but don't actually do it (needs debuginfo): perf probe -nv 'tcp_sendmsg size sk->__sk_common.skc_state' # Trace previous probe when size is non-zero, and state is not TCP_ESTABLISHED(1) (needs debuginfo): perf record -e probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a # Add a tracepoint for tcp_sendmsg() line 81 with local variable seglen (needs debuginfo): perf probe 'tcp_sendmsg:81 seglen' # Add a tracepoint for do_sys_open() with the filename as a string (needs debuginfo): perf probe 'do_sys_open filename:string' # Add a tracepoint for myfunc() return, and include the retval as a string: perf probe 'myfunc%return +0($retval):string' # Add a tracepoint for the user-level malloc() function from libc: perf probe -x /lib64/libc.so.6 malloc # Add a tracepoint for this user-level static probe (USDT, aka SDT event): perf probe -x /usr/lib64/libpthread-2.24.so %sdt_libpthread:mutex_entry # List currently available dynamic probes: perf probe -l
-
报告
# Show perf.data in an ncurses browser (TUI) if possible: perf report # Show perf.data with a column for sample count: perf report -n # Show perf.data as a text report, with data coalesced and percentages: perf report --stdio # Report, with stacks in folded format: one line per stack (needs 4.4): perf report --stdio -n -g folded # List all events from perf.data: perf script # List all perf.data events, with data header (newer kernels; was previously default): perf script --header # List all perf.data events, with customized fields (< Linux 4.1): perf script -f time,event,trace # List all perf.data events, with customized fields (>= Linux 4.1): perf script -F time,event,trace # List all perf.data events, with my recommended fields (needs record -a; newer kernels): perf script --header -F comm,pid,tid,cpu,time,event,ip,sym,dso # List all perf.data events, with my recommended fields (needs record -a; older kernels): perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso # Dump raw contents from perf.data as hex (for debugging): perf script -D # Disassemble and annotate instructions with percentages (needs some debuginfo): perf annotate --stdio