stack walking scripts
authorBrendan Gregg <brendan.d.gregg@gmail.com>
Sat, 16 Jan 2016 01:26:30 +0000 (17:26 -0800)
committerBrendan Gregg <brendan.d.gregg@gmail.com>
Sat, 16 Jan 2016 01:26:30 +0000 (17:26 -0800)
README.md
man/man8/offcputime.8 [new file with mode: 0644]
man/man8/stackcount.8 [new file with mode: 0644]
man/man8/stacksnoop.8 [new file with mode: 0644]
tools/offcputime [new file with mode: 0755]
tools/offcputime_example.txt [new file with mode: 0644]
tools/stackcount [new file with mode: 0755]
tools/stackcount_example.txt [new file with mode: 0644]
tools/stacksnoop [new file with mode: 0755]
tools/stacksnoop_example.txt [new file with mode: 0644]

index 4e2a7ab..afbe8a6 100644 (file)
--- a/README.md
+++ b/README.md
@@ -71,9 +71,12 @@ Tools:
 - tools/[funclatency](tools/funclatency): Time kernel functions and show their latency distribution. [Examples](tools/funclatency_example.txt).
 - tools/[hardirqs](tools/hardirqs):  Measure hard IRQ (hard interrupt) event time. [Examples](tools/hardirqs_example.txt).
 - tools/[killsnoop](tools/killsnoop): Trace signals issued by the kill() syscall. [Examples](tools/killsnoop_example.txt).
+- tools/[offcputime](tools/offcputime): Summarize off-CPU time by kernel stack trace. [Examples](tools/offcputime_example.txt).
 - tools/[opensnoop](tools/opensnoop): Trace open() syscalls. [Examples](tools/opensnoop_example.txt).
 - tools/[pidpersec](tools/pidpersec): Count new processes (via fork). [Examples](tools/pidpersec_example.txt).
 - tools/[softirqs](tools/softirqs):  Measure soft IRQ (soft interrupt) event time. [Examples](tools/softirqs_example.txt).
+- tools/[stackcount](tools/stackcount): Count kernel function calls and their stack traces. [Examples](tools/stackcount_example.txt).
+- tools/[stacksnoop](tools/stacksnoop): Trace a kernel function and print all kernel stack traces. [Examples](tools/stacksnoop_example.txt).
 - tools/[syncsnoop](tools/syncsnoop): Trace sync() syscall. [Examples](tools/syncsnoop_example.txt).
 - tools/[tcpaccept](tools/tcpaccept): Trace TCP passive connections (accept()). [Examples](tools/tcpaccept_example.txt).
 - tools/[tcpconnect](tools/tcpconnect): Trace TCP active connections (connect()). [Examples](tools/tcpconnect_example.txt).
diff --git a/man/man8/offcputime.8 b/man/man8/offcputime.8
new file mode 100644 (file)
index 0000000..0c34464
--- /dev/null
@@ -0,0 +1,88 @@
+.TH offcputime 8  "2016-01-14" "USER COMMANDS"
+.SH NAME
+offcputime \- Summarize off-CPU time by kernel stack trace. Uses Linux eBPF/bcc.
+.SH SYNOPSIS
+.B offcputime [\-h] [\-p PID] [\-i INTERVAL] [\-T] [duration]
+.SH DESCRIPTION
+This program shows kernel stack traces and task names that were blocked and
+"off-CPU", and the total duration they were blocked: their "off-CPU time".
+It works by tracing when threads block and when they return to CPU, measuring
+both the time they were blocked and the blocked kernel stack trace and the
+task name. This data is summarized in the kernel using an eBPF map, and by
+summing the blocked time by unique stack trace and task name.
+
+The output summary will help you identify reasons why threads
+were blocking, and quantify the time they were blocked. This spans all types
+of blocking activity: disk I/O, network I/O, locks, page faults, involuntary
+context switches, etc.
+
+This is complementary to CPU profiling (e.g., CPU flame graphs) which shows
+the time spent on-CPU. This shows the time spent off-CPU, and the output,
+especially the -f format, can be used to generate an "off-CPU time flame graph".
+
+See http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
+
+The stack depth is currently limited to 20, and the stack traces are kernel
+mode only. Check for newer versions where either may be improved.
+
+This currently only works on x86_64. Check for future versions.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+.SH OPTIONS
+.TP
+\-h
+Print usage message.
+.TP
+\-v
+Show raw addresses.
+.TP
+\-p PID
+Trace this process ID only (filtered in-kernel).
+.TP
+duration
+Duration to trace, in seconds.
+.SH EXAMPLES
+.TP
+Trace all thread blocking events, and summarize (in-kernel) by kernel stack
+trace and total off-CPU time:
+#
+.B offcputime
+.TP
+Trace for 5 seconds only:
+#
+.B offcputime 5
+.TP
+Trace for 5 seconds, and emit output in folded stack format (suitable for
+flame graphs):
+#
+.B offcputime -f 5
+.TP
+Trace PID 185 only:
+#
+.B offcputime -p 185
+.SH OVERHEAD
+This summarizes unique stack traces in-kernel for efficiency, allowing it to
+trace a higher rate of events than methods that post-process in user space. The
+stack trace and time data is only copied to user space once, when the output is
+printed. While these techniques greatly lower overhead, scheduler events are
+still a high frequency event, as they can exceed 1 million events per second,
+and so caution should still be used. Test before production use.
+
+If the overhead is still a problem, take a look at the MINBLOCK_US tunable in
+the code. If your aim is to chase down longer blocking events, then this could
+be increased to filter shorter blocking events, further lowering overhead.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Brendan Gregg
+.SH SEE ALSO
+stackcount(8)
diff --git a/man/man8/stackcount.8 b/man/man8/stackcount.8
new file mode 100644 (file)
index 0000000..1354190
--- /dev/null
@@ -0,0 +1,106 @@
+.TH stackcount 8  "2016-01-14" "USER COMMANDS"
+.SH NAME
+stackcount \- Count kernel function calls and their stack traces. Uses Linux eBPF/bcc.
+.SH SYNOPSIS
+.B stackcount [\-h] [\-p PID] [\-i INTERVAL] [\-T] [\-r] pattern
+.SH DESCRIPTION
+stackcount traces kernel functions and frequency counts them with their entire
+kernel stack trace, summarized in-kernel for efficiency. This allows higher
+frequency events to be studied. The output consists of unique stack traces,
+and their occurance counts.
+
+The pattern is a string with optional '*' wildcards, similar to file globbing.
+If you'd prefer to use regular expressions, use the \-r option.
+
+The stack depth is currently limited to 10 (+1 for the current instruction
+pointer).
+
+This currently only works on x86_64. Check for future versions.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+.SH OPTIONS
+.TP
+\-h
+Print usage message.
+.TP
+\-r
+Allow regular expressions for the search pattern. The default allows "*"
+wildcards only.
+.TP
+\-s
+Show address offsets.
+.TP
+\-T
+Include a timestamp with interval output.
+.TP
+\-v
+Show raw addresses.
+.TP
+\-i interval
+Summary interval, in seconds.
+.TP
+\-p PID
+Trace this process ID only (filtered in-kernel).
+.TP
+pattern
+A kernel function name, or a search pattern. Can include wildcards ("*"). If the
+\-r option is used, can include regular expressions.
+.SH EXAMPLES
+.TP
+Count kernel stack traces for submit_bio():
+#
+.B stackcount submit_bio
+.TP
+Count kernel stack traces for ip_output():
+#
+.B stackcount ip_output
+.TP
+Show symbol offsets:
+#
+.B stackcount -s ip_output
+.TP
+Show offsets and raw addresses (verbose):
+#
+.B stackcount -sv ip_output
+.TP
+Count kernel stacks for kernel functions matching tcp_send*:
+#
+.B stackcount 'tcp_send*'
+.TP
+Same as previous, but using regular expressions:
+#
+.B stackcount -r '^tcp_send.*'
+.TP
+Output every 5 seconds, with timestamps:
+#
+.B stackcount -Ti 5 ip_output
+.TP
+Only count stacks when PID 185 is on-CPU:
+#
+.B stackcount -p 185 ip_output
+.SH OVERHEAD
+This summarizes unique stack traces in-kernel for efficiency, allowing it to
+trace a higher rate of function calls than methods that post-process in user
+space. The stack trace data is only copied to user space when the output is
+printed, which usually only happens once. Given these techniques, I'd suspect
+that call rates of < 10,000/sec would incur negligible overhead (for this
+current version; future versions may improve this). Beyond that,
+there will be a point where the overhead is measurable, as this does add
+a number of instructions to each function call to walk and save stacks.
+Test before production use. You can also use funccount to get a handle on
+function call rates first.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Brendan Gregg
+.SH SEE ALSO
+stacksnoop(8), funccount(8)
diff --git a/man/man8/stacksnoop.8 b/man/man8/stacksnoop.8
new file mode 100644 (file)
index 0000000..66a9407
--- /dev/null
@@ -0,0 +1,80 @@
+.TH stacksnoop 8  "2016-01-14" "USER COMMANDS"
+.SH NAME
+stacksnoop \- Print kernel stack traces for kernel functions. Uses Linux eBPF/bcc.
+.SH SYNOPSIS
+.B stacksnoop [\-h] [\-p PID] [\-s] [\-v] function
+.SH DESCRIPTION
+stacksnoop traces a given kernel function and for each call, prints the
+kernel stack back trace for that call. This shows the ancestry of function
+calls, and is a quick way to investigate low frequency kernel functions and
+their cause. For high frequency kernel functions, see stackcount.
+
+The stack depth is currently limited to 10 (+1 for the current instruction
+pointer).
+
+This currently only works on x86_64. Check for future versions.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+.SH OPTIONS
+.TP
+\-h
+Print usage message.
+.TP
+\-s
+Show address offsets.
+.TP
+\-v
+Print more fields.
+.TP
+\-p PID
+Trace this process ID only (filtered in-kernel).
+.TP
+function
+Kernel function name.
+.SH EXAMPLES
+.TP
+Print kernel stack traces for each call to ext4_sync_fs:
+#
+.B stacksnoop ext4_sync_fs
+.TP
+Also show the symbol offsets:
+#
+.B stacksnoop -s ext4_sync_fs
+.TP
+Show extra columns:
+#
+.B stacksnoop -v ext4_sync_fs
+.TP
+Only trace when PID 185 is on-CPU:
+#
+.B stacksnoop -p 185 ext4_sync_fs
+.SH FIELDS
+.TP
+TIME(s)
+Time of the call, in seconds.
+.TP
+STACK
+Kernel stack trace. The first column shows "ip" for instruction pointer, and
+"r#" for each return pointer in the stack. The second column is the stack trace
+as hexidecimal. The third column is the translated kernel symbol names.
+.SH OVERHEAD
+This can have significant overhead if frequently called functions (> 1000/s) are
+traced, and is only intended for low frequency function calls. This is because
+details including the stack trace for every call is passed to user space and
+processed. See stackcount for higher frequency calls, which performs in-kernel
+summaries.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Brendan Gregg
+.SH SEE ALSO
+stackcount(8)
diff --git a/tools/offcputime b/tools/offcputime
new file mode 100755 (executable)
index 0000000..4ab9439
--- /dev/null
@@ -0,0 +1,201 @@
+#!/usr/bin/python
+#
+# offcputime    Summarize off-CPU time by kernel stack trace
+#               For Linux, uses BCC, eBPF.
+#
+# USAGE: offcputime [-h] [-p PID] [-i INTERVAL] [-T] [duration]
+#
+# The current implementation uses an unrolled loop for x86_64, and was written
+# as a proof of concept. This implementation should be replaced in the future
+# with an appropriate bpf_ call, when available.
+#
+# Currently limited to a stack trace depth of 21 (maxdepth + 1).
+#
+# Copyright 2016 Netflix, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 13-Jan-2016  Brendan Gregg   Created this.
+
+from __future__ import print_function
+from bcc import BPF
+from time import sleep, strftime
+import argparse
+import signal
+
+# arguments
+examples = """examples:
+    ./offcputime             # trace off-CPU stack time until Ctrl-C
+    ./offcputime 5           # trace for 5 seconds only
+    ./offcputime -f 5        # 5 seconds, and output in folded format
+    ./offcputime -p 185      # trace fo PID 185 only
+"""
+parser = argparse.ArgumentParser(
+    description="Summarize off-CPU time by kernel stack trace",
+    formatter_class=argparse.RawDescriptionHelpFormatter,
+    epilog=examples)
+parser.add_argument("-p", "--pid",
+    help="trace this PID only")
+parser.add_argument("-v", "--verbose", action="store_true",
+    help="show raw addresses")
+parser.add_argument("-f", "--folded", action="store_true",
+    help="output folded format")
+parser.add_argument("duration", nargs="?", default=99999999,
+    help="duration of trace, in seconds")
+args = parser.parse_args()
+folded = args.folded
+duration = int(args.duration)
+debug = 0
+maxdepth = 20    # and MAXDEPTH
+
+# signal handler
+def signal_ignore(signal, frame):
+    print()
+
+# load BPF program
+bpf_text = """
+#include <uapi/linux/ptrace.h>
+#include <linux/sched.h>
+
+#define MAXDEPTH       20
+#define MINBLOCK_US    1
+
+struct key_t {
+    char name[TASK_COMM_LEN];
+    // Skip saving the ip
+    u64 ret[MAXDEPTH];
+};
+BPF_HASH(counts, struct key_t);
+BPF_HASH(start, u32);
+
+static u64 get_frame(u64 *bp) {
+    if (*bp) {
+        // The following stack walker is x86_64 specific
+        u64 ret = 0;
+        if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
+            return 0;
+        if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
+            bp = 0;
+        if (ret < __START_KERNEL_map)
+            return 0;
+        return ret;
+    }
+    return 0;
+}
+
+int offcpu(struct pt_regs *ctx) {
+    u32 pid = bpf_get_current_pid_tgid();
+    u64 ts = bpf_ktime_get_ns();
+    FILTER
+    start.update(&pid, &ts);
+    return 0;
+}
+
+int oncpu(struct pt_regs *ctx) {
+    u32 pid = bpf_get_current_pid_tgid();
+    FILTER
+    u64 ts = bpf_ktime_get_ns();
+    struct key_t key = {};
+    u64 zero = 0, *val, bp = 0, *tsp, delta;
+    int depth = 0;
+
+    // calculate delta time
+    tsp = start.lookup(&pid);
+    if (tsp == 0)
+        return 0;        // missed start
+    delta = bpf_ktime_get_ns() - *tsp;
+    start.delete(&pid);
+    delta = delta / 1000;
+    if (delta < MINBLOCK_US)
+        return 0;
+
+    bpf_get_current_comm(&key.name, sizeof(key.name));
+    bp = ctx->bp;
+
+    // unrolled loop (MAXDEPTH):
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+
+out:
+    val = counts.lookup_or_init(&key, &zero);
+    (*val) += delta;
+    return 0;
+}
+"""
+if args.pid:
+    bpf_text = bpf_text.replace('FILTER',
+        'if (pid != %s) { return 0; }' % (args.pid))
+else:
+    bpf_text = bpf_text.replace('FILTER', '')
+if debug:
+    print(bpf_text)
+b = BPF(text=bpf_text)
+b.attach_kprobe(event="schedule", fn_name="offcpu")
+b.attach_kprobe(event="finish_task_switch", fn_name="oncpu")
+matched = b.num_open_kprobes()
+if matched == 0:
+    print("0 functions traced. Exiting.")
+    exit()
+
+# header
+if not folded:
+    print("Tracing off-CPU time (us) by kernel stack", end="")
+    if duration < 99999999:
+        print(" for %d secs." % duration)
+    else:
+        print("... Hit Ctrl-C to end.")
+
+# output
+while (1):
+    try:
+        sleep(duration)
+    except KeyboardInterrupt:
+        # as cleanup can take many seconds, trap Ctrl-C:
+        signal.signal(signal.SIGINT, signal_ignore)
+
+    if not folded:
+        print()
+    counts = b.get_table("counts")
+    for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
+        if folded:
+            # print folded stack output
+            line = k.name + ";"
+            for i in reversed(range(0, maxdepth)):
+                if k.ret[i] == 0:
+                    continue
+                line = line + b.ksym(k.ret[i])
+                if i != 0:
+                    line = line + ";"
+            print("%s %d" % (line, v.value))
+        else:
+            # print default multi-line stack output
+            for i in range(0, maxdepth):
+                if k.ret[i] == 0:
+                    break
+                print("    %-16x %s" % (k.ret[i],
+                    b.ksym(k.ret[i])))
+            print("    %-16s %s" % ("-", k.name))
+            print("        %d\n" % v.value)
+    counts.clear()
+
+    if not folded:
+        print("Detaching...")
+    exit()
diff --git a/tools/offcputime_example.txt b/tools/offcputime_example.txt
new file mode 100644 (file)
index 0000000..0883341
--- /dev/null
@@ -0,0 +1,744 @@
+Demonstrations of offcputime, the Linux eBPF/bcc version.
+
+
+This program shows stack traces that were blocked, and the total duration they
+were blocked. It works by tracing when threads block and when they return to
+CPU, measuring both the time they were blocked (aka the "off-CPU time") and the
+blocked kernel stack trace and the task name. This data is summarized in kernel
+by summing the blocked time by unique stack trace and task name.
+
+Here is some example output. To explain what we are seeing: the very first
+stack trace looks like a page fault (do_page_fault() etc) from the "chmod"
+command, and in total was off-CPU for 13 microseconds.
+
+# ./offcputime 
+Tracing off-CPU time (us) by kernel stack... Hit Ctrl-C to end.
+^C
+    schedule
+    schedule_timeout
+    io_schedule_timeout
+    bit_wait_io
+    __wait_on_bit
+    wait_on_page_bit_killable
+    __lock_page_or_retry
+    filemap_fault
+    __do_fault
+    handle_mm_fault
+    __do_page_fault
+    do_page_fault
+    page_fault
+    chmod
+        13
+
+    schedule
+    rcu_nocb_kthread
+    kthread
+    ret_from_fork
+    ddebug_tables
+    rcuos/0
+        22
+
+    schedule
+    schedule_timeout
+    io_schedule_timeout
+    bit_wait_io
+    __wait_on_bit_lock
+    __lock_page
+    lock_page
+    __do_fault
+    handle_mm_fault
+    __do_page_fault
+    do_page_fault
+    page_fault
+    run
+        27
+
+    schedule
+    schedule_timeout
+    io_schedule_timeout
+    bit_wait_io
+    __wait_on_bit
+    wait_on_page_bit_killable
+    __lock_page_or_retry
+    filemap_fault
+    __do_fault
+    handle_mm_fault
+    __do_page_fault
+    do_page_fault
+    page_fault
+    clear_user
+    padzero
+    load_elf_binary
+    search_binary_handler
+    load_script
+    search_binary_handler
+    do_execveat_common.isra.27
+    run
+        28
+
+    schedule
+    schedule_timeout
+    io_schedule_timeout
+    bit_wait_io
+    __wait_on_bit
+    wait_on_page_bit_killable
+    __lock_page_or_retry
+    filemap_fault
+    __do_fault
+    handle_mm_fault
+    __do_page_fault
+    do_page_fault
+    page_fault
+    run
+        82
+
+    schedule
+    pipe_wait
+    pipe_read
+    __vfs_read
+    vfs_read
+    sys_read
+    entry_SYSCALL_64_fastpath
+    bash
+        94
+
+    schedule
+    rcu_gp_kthread
+    kthread
+    ret_from_fork
+    ddebug_tables
+    rcu_sched
+        104
+
+    schedule
+    schedule_timeout
+    io_schedule_timeout
+    bit_wait_io
+    __wait_on_bit
+    out_of_line_wait_on_bit
+    __wait_on_buffer
+    jbd2_journal_commit_transaction
+    kjournald2
+    kthread
+    ret_from_fork
+    mb_cache_list
+    jbd2/xvda1-8
+        986
+
+    schedule
+    schedule_timeout
+    io_schedule_timeout
+    bit_wait_io
+    __wait_on_bit
+    out_of_line_wait_on_bit
+    __wait_on_buffer
+    jbd2_journal_commit_transaction
+    kjournald2
+    kthread
+    ret_from_fork
+    mb_cache_list
+    jbd2/xvda1-8
+        6630
+
+    schedule
+    schedule_timeout
+    io_schedule_timeout
+    bit_wait_io
+    __wait_on_bit
+    out_of_line_wait_on_bit
+    do_get_write_access
+    jbd2_journal_get_write_access
+    __ext4_journal_get_write_access
+    ext4_mb_mark_diskspace_used
+    ext4_mb_new_blocks
+    ext4_ext_map_blocks
+    ext4_map_blocks
+    ext4_writepages
+    do_writepages
+    __filemap_fdatawrite_range
+    filemap_flush
+    ext4_alloc_da_blocks
+    ext4_rename
+    ext4_rename2
+    supervise
+        6645
+
+    schedule
+    schedule_timeout
+    io_schedule_timeout
+    bit_wait_io
+    __wait_on_bit
+    out_of_line_wait_on_bit
+    do_get_write_access
+    jbd2_journal_get_write_access
+    __ext4_journal_get_write_access
+    __ext4_new_inode
+    ext4_create
+    vfs_create
+    path_openat
+    do_filp_open
+    do_sys_open
+    sys_open
+    entry_SYSCALL_64_fastpath
+    supervise
+        12702
+
+    schedule
+    rcu_nocb_kthread
+    kthread
+    ret_from_fork
+    rcuos/2
+        16036
+
+    schedule
+    rcu_nocb_kthread
+    kthread
+    ret_from_fork
+    rcuos/4
+        24085
+
+    schedule
+    do_wait
+    sys_wait4
+    entry_SYSCALL_64_fastpath
+    run
+        233055
+
+    schedule
+    schedule_timeout
+    io_schedule_timeout
+    bit_wait_io
+    __wait_on_bit
+    wait_on_page_bit
+    truncate_inode_pages_range
+    truncate_inode_pages_final
+    ext4_evict_inode
+    evict
+    iput
+    __dentry_kill
+    dput
+    sys_rename
+    entry_SYSCALL_64_fastpath
+    supervise
+        297113
+
+    schedule
+    schedule_timeout
+    wait_woken
+    n_tty_read
+    tty_read
+    __vfs_read
+    vfs_read
+    sys_read
+    entry_SYSCALL_64_fastpath
+    bash
+        1789866
+
+    schedule
+    schedule_timeout
+    io_schedule_timeout
+    do_blockdev_direct_IO
+    __blockdev_direct_IO
+    blkdev_direct_IO
+    generic_file_read_iter
+    blkdev_read_iter
+    __vfs_read
+    vfs_read
+    sys_read
+    entry_SYSCALL_64_fastpath
+    dd
+        3310763
+
+    schedule
+    smpboot_thread_fn
+    kthread
+    ret_from_fork
+    watchdog/1
+        3999989
+
+    schedule
+    smpboot_thread_fn
+    kthread
+    ret_from_fork
+    watchdog/5
+        3999995
+
+    schedule
+    smpboot_thread_fn
+    kthread
+    ret_from_fork
+    watchdog/4
+        3999996
+
+    schedule
+    smpboot_thread_fn
+    kthread
+    ret_from_fork
+    watchdog/0
+        3999996
+
+    schedule
+    smpboot_thread_fn
+    kthread
+    ret_from_fork
+    watchdog/3
+        3999998
+
+    schedule
+    smpboot_thread_fn
+    kthread
+    ret_from_fork
+    watchdog/7
+        3999999
+
+    schedule
+    smpboot_thread_fn
+    kthread
+    ret_from_fork
+    watchdog/2
+        4000001
+
+    schedule
+    smpboot_thread_fn
+    kthread
+    ret_from_fork
+    watchdog/6
+        4000001
+
+    schedule
+    do_wait
+    sys_wait4
+    entry_SYSCALL_64_fastpath
+    bash
+        4039675
+
+    schedule
+    do_nanosleep
+    hrtimer_nanosleep
+    sys_nanosleep
+    entry_SYSCALL_64_fastpath
+    svscan
+        5000112
+
+    schedule
+    schedule_hrtimeout_range_clock
+    schedule_hrtimeout_range
+    poll_schedule_timeout
+    do_select
+    core_sys_select
+    sys_select
+    entry_SYSCALL_64_fastpath
+    snmpd
+        5998761
+
+    schedule
+    smpboot_thread_fn
+    kthread
+    ret_from_fork
+    migration/3
+        6149779
+
+    schedule
+    schedule_hrtimeout_range_clock
+    schedule_hrtimeout_range
+    poll_schedule_timeout
+    do_select
+    core_sys_select
+    sys_select
+    entry_SYSCALL_64_fastpath
+    ntpd
+        6999832
+
+    schedule
+    worker_thread
+    kthread
+    ret_from_fork
+    kworker/u16:2
+        7131941
+
+    schedule
+    worker_thread
+    kthread
+    ret_from_fork
+    kworker/3:0
+        7999844
+
+    schedule
+    worker_thread
+    kthread
+    ret_from_fork
+    kworker/1:1
+        7999872
+
+    schedule
+    worker_thread
+    kthread
+    ret_from_fork
+    kworker/2:1
+        7999889
+
+    schedule
+    worker_thread
+    kthread
+    ret_from_fork
+    kworker/5:1
+        7999936
+
+    schedule
+    worker_thread
+    kthread
+    ret_from_fork
+    kworker/7:1
+        7999938
+
+    schedule
+    worker_thread
+    kthread
+    ret_from_fork
+    kworker/6:1
+        7999940
+
+    schedule
+    do_nanosleep
+    hrtimer_nanosleep
+    sys_nanosleep
+    entry_SYSCALL_64_fastpath
+    tail
+        8000905
+
+    schedule
+    smpboot_thread_fn
+    kthread
+    ret_from_fork
+    migration/7
+        8197046
+
+    schedule
+    pipe_wait
+    pipe_read
+    __vfs_read
+    vfs_read
+    sys_read
+    entry_SYSCALL_64_fastpath
+    readproctitle
+        8197835
+
+    schedule
+    smpboot_thread_fn
+    kthread
+    ret_from_fork
+    migration/4
+        8201851
+
+    schedule
+    smpboot_thread_fn
+    kthread
+    ret_from_fork
+    migration/2
+        8203375
+
+    schedule
+    smpboot_thread_fn
+    kthread
+    ret_from_fork
+    migration/6
+        8208664
+
+    schedule
+    smpboot_thread_fn
+    kthread
+    ret_from_fork
+    migration/5
+        8209819
+
+    schedule
+    smpboot_thread_fn
+    kthread
+    ret_from_fork
+    ddebug_tables
+    migration/0
+        8211292
+
+    schedule
+    smpboot_thread_fn
+    kthread
+    ret_from_fork
+    migration/1
+        8212100
+
+    schedule
+    worker_thread
+    kthread
+    ret_from_fork
+    kworker/0:2
+        8270305
+
+    schedule
+    rcu_nocb_kthread
+    kthread
+    ret_from_fork
+    rcuos/3
+        8349697
+
+    schedule
+    rcu_nocb_kthread
+    kthread
+    ret_from_fork
+    rcuos/2
+        8363357
+
+    schedule
+    rcu_nocb_kthread
+    kthread
+    ret_from_fork
+    rcuos/1
+        8365338
+
+    schedule
+    schedule_timeout
+    xfs_buf_terminate
+    kthread
+    ret_from_fork
+    xfsaild/md0
+        8371514
+
+    schedule
+    rcu_nocb_kthread
+    kthread
+    ret_from_fork
+    rcuos/4
+        8384013
+
+    schedule
+    rcu_nocb_kthread
+    kthread
+    ret_from_fork
+    rcuos/5
+        8390016
+
+    schedule
+    rcu_nocb_kthread
+    kthread
+    ret_from_fork
+    ddebug_tables
+    rcuos/0
+        8405428
+
+    schedule
+    schedule_timeout
+    rcu_gp_kthread
+    kthread
+    ret_from_fork
+    ddebug_tables
+    rcu_sched
+        8406930
+
+    schedule
+    rcu_nocb_kthread
+    kthread
+    ret_from_fork
+    rcuos/7
+        8409575
+
+    schedule
+    rcu_nocb_kthread
+    kthread
+    ret_from_fork
+    rcuos/6
+        8415062
+
+    schedule
+    schedule_hrtimeout_range_clock
+    schedule_hrtimeout_range
+    poll_schedule_timeout
+    do_select
+    core_sys_select
+    sys_select
+    entry_SYSCALL_64_fastpath
+    offcputime
+        8421478
+
+    schedule
+    worker_thread
+    kthread
+    ret_from_fork
+    kworker/4:0
+        8421492
+
+    schedule
+    schedule_hrtimeout_range_clock
+    schedule_hrtimeout_range
+    poll_schedule_timeout
+    do_select
+    core_sys_select
+    sys_select
+    entry_SYSCALL_64_fastpath
+    sshd
+        14249005
+
+    schedule
+    schedule_hrtimeout_range_clock
+    schedule_hrtimeout_range
+    poll_schedule_timeout
+    do_sys_poll
+    sys_poll
+    entry_SYSCALL_64_fastpath
+    supervise
+        81670888
+
+Detaching...
+
+The last few stack traces aren't very interesting, since they are threads that
+are ofter blocked off-CPU waiting for work.
+
+Do be somewhat careful with overhead: this is tracing scheduler functions, which
+can be called very frequently. While this uses in-kernel summaries for
+efficiency, the rate of scheduler functions can be very high (> 1,000,000/sec),
+and this is performing stack walks when threads return to CPU. At some point
+the overhead will be measurable.
+
+
+A -p option can be used to filter (in-kernel) on a single process ID. For
+example, only matching PID 26651, which is a running "dd" command:
+
+# ./offcputime -p 26651
+Tracing off-CPU time (us) by kernel stack... Hit Ctrl-C to end.
+^C
+    schedule
+    schedule_timeout
+    io_schedule_timeout
+    do_blockdev_direct_IO
+    __blockdev_direct_IO
+    blkdev_direct_IO
+    generic_file_read_iter
+    blkdev_read_iter
+    __vfs_read
+    vfs_read
+    sys_read
+    entry_SYSCALL_64_fastpath
+    dd
+        2405710
+
+Detaching...
+
+The stack trace shows "dd" is blocked waiting on disk I/O, as expected, for a
+total of 2.4 seconds during tracing.
+
+
+A duration can be added, for example, tracing for 5 seconds only:
+
+# ./offcputime -p 26651 5
+Tracing off-CPU time (us) by kernel stack for 5 secs.
+
+    schedule
+    schedule_timeout
+    io_schedule_timeout
+    do_blockdev_direct_IO
+    __blockdev_direct_IO
+    blkdev_direct_IO
+    generic_file_read_iter
+    blkdev_read_iter
+    __vfs_read
+    vfs_read
+    sys_read
+    entry_SYSCALL_64_fastpath
+    dd
+        4413909
+
+Detaching...
+
+Here, dd was blocked for 4.4 seconds out of 5. Or put differently, likely
+on-CPU for about 12% of the time. Which matches the ratio seen by time(1):
+
+# time dd if=/dev/md0 iflag=direct of=/dev/null bs=1k 
+^C108115+0 records in
+108114+0 records out
+110708736 bytes (111 MB) copied, 13.7565 s, 8.0 MB/s
+
+real   0m13.760s
+user   0m0.000s
+sys    0m1.739s
+
+
+A -f option will emit output using the "folded stacks" format, which can be
+read directly by flamegraph.pl from the FlameGraph open source software
+(https://github.com/brendangregg/FlameGraph). Eg:
+
+# ./offcputime -f 5
+bash;entry_SYSCALL_64_fastpath;sys_read;vfs_read;__vfs_read;tty_read;n_tty_read;call_rwsem_down_read_failed;rwsem_down_read_failed;schedule 8
+yes;entry_SYSCALL_64_fastpath;sys_write;vfs_write;__vfs_write;tty_write;n_tty_write;call_rwsem_down_read_failed;rwsem_down_read_failed;schedule 14
+run;page_fault;do_page_fault;__do_page_fault;handle_mm_fault;__do_fault;filemap_fault;__lock_page_or_retry;wait_on_page_bit_killable;__wait_on_bit;bit_wait_io;io_schedule_timeout;schedule_timeout;schedule 33
+rcuos/4;ret_from_fork;kthread;rcu_nocb_kthread;schedule 45
+bash;entry_SYSCALL_64_fastpath;sys_read;vfs_read;__vfs_read;pipe_read;pipe_wait;schedule 88
+run;page_fault;do_page_fault;__do_page_fault;handle_mm_fault;__do_fault;filemap_fault;__lock_page_or_retry;wait_on_page_bit_killable;__wait_on_bit;bit_wait_io;io_schedule_timeout;schedule_timeout;schedule 108
+jbd2/xvda1-8;mb_cache_list;ret_from_fork;kthread;kjournald2;jbd2_journal_commit_transaction;__wait_on_buffer;out_of_line_wait_on_bit;__wait_on_bit;bit_wait_io;io_schedule_timeout;schedule_timeout;schedule 828
+jbd2/xvda1-8;mb_cache_list;ret_from_fork;kthread;kjournald2;jbd2_journal_commit_transaction;__wait_on_buffer;out_of_line_wait_on_bit;__wait_on_bit;bit_wait_io;io_schedule_timeout;schedule_timeout;schedule 6201
+supervise;entry_SYSCALL_64_fastpath;sys_rename;dput;__dentry_kill;iput;evict;ext4_evict_inode;truncate_inode_pages_final;truncate_inode_pages_range;wait_on_page_bit;__wait_on_bit;bit_wait_io;io_schedule_timeout;schedule_timeout;schedule 41049
+run;entry_SYSCALL_64_fastpath;sys_wait4;do_wait;schedule 120709
+bash;entry_SYSCALL_64_fastpath;sys_wait4;do_wait;schedule 699320
+ksoftirqd/0;ret_from_fork;kthread;smpboot_thread_fn;schedule 1077529
+bash;entry_SYSCALL_64_fastpath;sys_read;vfs_read;__vfs_read;tty_read;n_tty_read;wait_woken;schedule_timeout;schedule 1362045
+sshd;entry_SYSCALL_64_fastpath;sys_select;core_sys_select;do_select;poll_schedule_timeout;schedule_hrtimeout_range;schedule_hrtimeout_range_clock;schedule 1377627
+migration/0;ddebug_tables;ret_from_fork;kthread;smpboot_thread_fn;schedule 2040753
+snmpd;entry_SYSCALL_64_fastpath;sys_select;core_sys_select;do_select;poll_schedule_timeout;schedule_hrtimeout_range;schedule_hrtimeout_range_clock;schedule 2197568
+migration/5;ret_from_fork;kthread;smpboot_thread_fn;schedule 3079426
+migration/7;ret_from_fork;kthread;smpboot_thread_fn;schedule 3084746
+kworker/6:2;ret_from_fork;kthread;worker_thread;schedule 3940583
+kworker/5:1;ret_from_fork;kthread;worker_thread;schedule 3944892
+kworker/1:2;ret_from_fork;kthread;worker_thread;schedule 3999646
+ntpd;entry_SYSCALL_64_fastpath;sys_select;core_sys_select;do_select;poll_schedule_timeout;schedule_hrtimeout_range;schedule_hrtimeout_range_clock;schedule 3999904
+kworker/u16:0;ret_from_fork;kthread;worker_thread;schedule 3999967
+kworker/7:0;ret_from_fork;kthread;worker_thread;schedule 3999987
+tail;entry_SYSCALL_64_fastpath;sys_nanosleep;hrtimer_nanosleep;do_nanosleep;schedule 4000473
+migration/1;ret_from_fork;kthread;smpboot_thread_fn;schedule 4091150
+migration/4;ret_from_fork;kthread;smpboot_thread_fn;schedule 4095217
+readproctitle;entry_SYSCALL_64_fastpath;sys_read;vfs_read;__vfs_read;pipe_read;pipe_wait;schedule 4108470
+migration/3;ret_from_fork;kthread;smpboot_thread_fn;schedule 4109264
+migration/2;ret_from_fork;kthread;smpboot_thread_fn;schedule 4109280
+migration/6;ret_from_fork;kthread;smpboot_thread_fn;schedule 4111143
+kworker/4:0;ret_from_fork;kthread;worker_thread;schedule 4402350
+kworker/3:0;ret_from_fork;kthread;worker_thread;schedule 4433988
+kworker/2:1;ret_from_fork;kthread;worker_thread;schedule 4636142
+kworker/0:2;ret_from_fork;kthread;worker_thread;schedule 4832023
+rcuos/1;ret_from_fork;kthread;rcu_nocb_kthread;schedule 4974186
+rcuos/5;ret_from_fork;kthread;rcu_nocb_kthread;schedule 4977137
+rcuos/6;ret_from_fork;kthread;rcu_nocb_kthread;schedule 4987769
+rcuos/3;ret_from_fork;kthread;rcu_nocb_kthread;schedule 4992282
+rcuos/4;ret_from_fork;kthread;rcu_nocb_kthread;schedule 4992364
+rcuos/2;ret_from_fork;kthread;rcu_nocb_kthread;schedule 4992714
+rcuos/0;ddebug_tables;ret_from_fork;kthread;rcu_nocb_kthread;schedule 4996504
+rcuos/7;ret_from_fork;kthread;rcu_nocb_kthread;schedule 4998497
+rcu_sched;ddebug_tables;ret_from_fork;kthread;rcu_gp_kthread;schedule_timeout;schedule 5000686
+offcputime;entry_SYSCALL_64_fastpath;sys_select;core_sys_select;do_select;poll_schedule_timeout;schedule_hrtimeout_range;schedule_hrtimeout_range_clock;schedule 5005063
+dd;entry_SYSCALL_64_fastpath;sys_read;vfs_read;__vfs_read;blkdev_read_iter;generic_file_read_iter;blkdev_direct_IO;__blockdev_direct_IO;do_blockdev_direct_IO;io_schedule_timeout;schedule_timeout;schedule 8025599
+supervise;entry_SYSCALL_64_fastpath;sys_poll;do_sys_poll;poll_schedule_timeout;schedule_hrtimeout_range;schedule_hrtimeout_range_clock;schedule 40835611
+
+The stack traces are shown as single lines, with functions separated by
+semicolons. The first entry is the task name. The 2nd column is the total
+off-CPU time.
+
+I'd save this output to a file, then move it to the system where you'll be
+creating your "off-CPU time flame graphs".
+
+
+USAGE message:
+
+./offcputime --help
+usage: offcputime [-h] [-p PID] [-v] [-f] [duration]
+
+Summarize off-CPU time by kernel stack trace
+
+positional arguments:
+  duration           duration of trace, in seconds
+
+optional arguments:
+  -h, --help         show this help message and exit
+  -p PID, --pid PID  trace this PID only
+  -v, --verbose      show raw addresses
+  -f, --folded       output folded format
+
+examples:
+    ./offcputime             # trace off-CPU stack time until Ctrl-C
+    ./offcputime 5           # trace for 5 seconds only
+    ./offcputime -f 5        # 5 seconds, and output in folded format
+    ./offcputime -p 185      # trace fo PID 185 only
diff --git a/tools/stackcount b/tools/stackcount
new file mode 100755 (executable)
index 0000000..b6cc3b6
--- /dev/null
@@ -0,0 +1,179 @@
+#!/usr/bin/python
+#
+# stackcount    Count kernel function calls and their stack traces.
+#               For Linux, uses BCC, eBPF.
+#
+# USAGE: stackcount [-h] [-p PID] [-i INTERVAL] [-T] [-r] pattern
+#
+# The pattern is a string with optional '*' wildcards, similar to file
+# globbing. If you'd prefer to use regular expressions, use the -r option.
+#
+# The current implementation uses an unrolled loop for x86_64, and was written
+# as a proof of concept. This implementation should be replaced in the future
+# with an appropriate bpf_ call, when available.
+#
+# Currently limited to a stack trace depth of 11 (maxdepth + 1).
+#
+# Copyright 2016 Netflix, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 12-Jan-2016  Brendan Gregg   Created this.
+
+from __future__ import print_function
+from bcc import BPF
+from time import sleep, strftime
+import argparse
+import signal
+
+# arguments
+examples = """examples:
+    ./stackcount submit_bio       # count kernel stack traces for submit_bio
+    ./stackcount ip_output        # count kernel stack traces for ip_output
+    ./stackcount -s ip_output     # show symbol offsets
+    ./stackcount -sv ip_output    # show offsets and raw addresses (verbose)
+    ./stackcount 'tcp_send*'      # count stacks for funcs matching tcp_send*
+    ./stackcount -r '^tcp_send.*' # same as above, using regular expressions
+    ./stackcount -Ti 5 ip_output  # output every 5 seconds, with timestamps
+    ./stackcount -p 185 ip_output # count ip_output stacks for PID 185 only
+"""
+parser = argparse.ArgumentParser(
+    description="Count kernel function calls and their stack traces",
+    formatter_class=argparse.RawDescriptionHelpFormatter,
+    epilog=examples)
+parser.add_argument("-p", "--pid",
+    help="trace this PID only")
+parser.add_argument("-i", "--interval", default=99999999,
+    help="summary interval, seconds")
+parser.add_argument("-T", "--timestamp", action="store_true",
+    help="include timestamp on output")
+parser.add_argument("-r", "--regexp", action="store_true",
+    help="use regular expressions. Default is \"*\" wildcards only.")
+parser.add_argument("-s", "--offset", action="store_true",
+    help="show address offsets")
+parser.add_argument("-v", "--verbose", action="store_true",
+    help="show raw addresses")
+parser.add_argument("pattern",
+    help="search expression for kernel functions")
+args = parser.parse_args()
+pattern = args.pattern
+if not args.regexp:
+    pattern = pattern.replace('*', '.*')
+    pattern = '^' + pattern + '$'
+offset = args.offset
+verbose = args.verbose
+debug = 0
+maxdepth = 10    # and MAXDEPTH
+
+# signal handler
+def signal_ignore(signal, frame):
+    print()
+
+# load BPF program
+bpf_text = """
+#include <uapi/linux/ptrace.h>
+
+#define MAXDEPTH       10
+
+struct key_t {
+    u64 ip;
+    u64 ret[MAXDEPTH];
+};
+BPF_HASH(counts, struct key_t);
+
+static u64 get_frame(u64 *bp) {
+    if (*bp) {
+        // The following stack walker is x86_64 specific
+        u64 ret = 0;
+        if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
+            return 0;
+        if (!ret || ret < __START_KERNEL_map)
+            return 0;
+        if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
+            bp = 0;
+        return ret;
+    }
+    return 0;
+}
+
+int trace_count(struct pt_regs *ctx) {
+    FILTER
+    struct key_t key = {};
+    u64 zero = 0, *val, bp = 0;
+    int depth = 0;
+
+    key.ip = ctx->ip;
+    bp = ctx->bp;
+
+    // unrolled loop, 10 (MAXDEPTH) frames deep:
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+
+out:
+    val = counts.lookup_or_init(&key, &zero);
+    (*val)++;
+    return 0;
+}
+"""
+if args.pid:
+    bpf_text = bpf_text.replace('FILTER',
+        ('u32 pid; pid = bpf_get_current_pid_tgid(); ' +
+        'if (pid != %s) { return 0; }') % (args.pid))
+else:
+    bpf_text = bpf_text.replace('FILTER', '')
+if debug:
+    print(bpf_text)
+b = BPF(text=bpf_text)
+b.attach_kprobe(event_re=pattern, fn_name="trace_count")
+matched = b.num_open_kprobes()
+if matched == 0:
+    print("0 functions matched by \"%s\". Exiting." % args.pattern)
+    exit()
+
+# header
+print("Tracing %d functions for \"%s\"... Hit Ctrl-C to end." %
+    (matched, args.pattern))
+
+def print_frame(addr):
+    print("  ", end="")
+    if verbose:
+        print("%-16x " % addr, end="")
+    if offset:
+        print("%s" % b.ksymaddr(addr))
+    else:
+        print("%s" % b.ksym(addr))
+
+# output
+exiting = 0 if args.interval else 1
+while (1):
+    try:
+        sleep(int(args.interval))
+    except KeyboardInterrupt:
+        exiting = 1
+        # as cleanup can take many seconds, trap Ctrl-C:
+        signal.signal(signal.SIGINT, signal_ignore)
+
+    print()
+    if args.timestamp:
+        print("%-8s\n" % strftime("%H:%M:%S"), end="")
+
+    counts = b.get_table("counts")
+    for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
+        print_frame(k.ip)
+        for i in range(0, maxdepth):
+            if k.ret[i] == 0:
+                break
+            print_frame(k.ret[i])
+        print("    %d\n" % v.value)
+    counts.clear()
+
+    if exiting:
+        print("Detaching...")
+        exit()
diff --git a/tools/stackcount_example.txt b/tools/stackcount_example.txt
new file mode 100644 (file)
index 0000000..bf8b101
--- /dev/null
@@ -0,0 +1,466 @@
+Demonstrations of stackcount, the Linux eBPF/bcc version.
+
+
+This program traces kernel functions and frequency counts them with their entire
+kernel stack trace, summarized in-kernel for efficiency. For example, counting
+stack traces that led to submit_bio(), which creates block device I/O:
+
+# ./stackcount submit_bio
+Tracing 1 functions for "submit_bio"... Hit Ctrl-C to end.
+^C
+  submit_bio
+  submit_bh
+  journal_submit_commit_record.isra.13
+  jbd2_journal_commit_transaction
+  kjournald2
+  kthread
+  ret_from_fork
+  mb_cache_list
+    1
+
+  submit_bio
+  __block_write_full_page.constprop.39
+  block_write_full_page
+  blkdev_writepage
+  __writepage
+  write_cache_pages
+  generic_writepages
+  do_writepages
+  __writeback_single_inode
+  writeback_sb_inodes
+  __writeback_inodes_wb
+    2
+
+  submit_bio
+  __block_write_full_page.constprop.39
+  block_write_full_page
+  blkdev_writepage
+  __writepage
+  write_cache_pages
+  generic_writepages
+  do_writepages
+  __filemap_fdatawrite_range
+  filemap_fdatawrite
+  fdatawrite_one_bdev
+    36
+
+  submit_bio
+  submit_bh
+  jbd2_journal_commit_transaction
+  kjournald2
+  kthread
+  ret_from_fork
+  mb_cache_list
+    38
+
+  submit_bio
+  ext4_writepages
+  do_writepages
+  __filemap_fdatawrite_range
+  filemap_flush
+  ext4_alloc_da_blocks
+  ext4_rename
+  ext4_rename2
+  vfs_rename
+  sys_rename
+  entry_SYSCALL_64_fastpath
+    79
+
+Detaching...
+
+The output shows unique stack traces, in order from leaf (on-CPU) to root,
+followed by their occurrence count. The last stack trace in the above output
+shows syscall handling, ext4_rename(), and filemap_flush(): looks like an
+application issued file rename has caused back end disk I/O due to ext4
+block allocation and a filemap_flush(). I'd have to browse the code to those
+functions to confirm!
+
+The order of printed stack traces is from least to most frequent. The most
+frequent in this case, the ext4_rename() stack, was taken 79 times during
+tracing.
+
+It can be useful to trace the path to submit_bio to explain unusual rates of
+disk IOPS. These could have in-kernel origins (eg, background scrub).
+
+This version of stackcount truncates stacks to 10 levels deep (plus 1 for
+the traced function, so 11).
+
+
+As another example, here are the code paths that led to ip_output(), which
+sends a packet at the IP level:
+
+# ./stackcount ip_output
+Tracing 1 functions for "ip_output"... Hit Ctrl-C to end.
+^C
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_write_xmit
+  tcp_tsq_handler.part.32
+  tcp_tasklet_func
+  tasklet_action
+  __do_softirq
+  do_softirq_own_stack
+  do_softirq
+  __local_bh_enable_ip
+    1
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_send_ack
+  tcp_send_delayed_ack
+  __tcp_ack_snd_check
+  tcp_rcv_established
+  tcp_v4_do_rcv
+  tcp_v4_rcv
+  ip_local_deliver_finish
+  ip_local_deliver
+    1
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_send_ack
+  tcp_send_delayed_ack
+  __tcp_ack_snd_check
+  tcp_rcv_established
+  tcp_v4_do_rcv
+  tcp_v4_rcv
+  ip_local_deliver_finish
+  ip_local_deliver
+    1
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_send_ack
+  tcp_delack_timer_handler
+  tcp_delack_timer
+  call_timer_fn
+  run_timer_softirq
+  __do_softirq
+  irq_exit
+  xen_evtchn_do_upcall
+    1
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_write_xmit
+  __tcp_push_pending_frames
+  tcp_rcv_established
+  tcp_v4_do_rcv
+  release_sock
+  tcp_sendmsg
+  inet_sendmsg
+  sock_sendmsg
+    3
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_write_xmit
+  tcp_tsq_handler.part.32
+  tcp_tasklet_func
+  tasklet_action
+  __do_softirq
+  run_ksoftirqd
+  smpboot_thread_fn
+  kthread
+    3
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_write_xmit
+  __tcp_push_pending_frames
+  tcp_rcv_established
+  tcp_v4_do_rcv
+  tcp_v4_rcv
+  ip_local_deliver_finish
+  ip_local_deliver
+  ip_rcv_finish
+    4
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_send_ack
+  tcp_cleanup_rbuf
+  tcp_recvmsg
+  inet_recvmsg
+  sock_recvmsg
+  sock_read_iter
+  __vfs_read
+  vfs_read
+    5
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_write_xmit
+  __tcp_push_pending_frames
+  tcp_rcv_established
+  tcp_v4_do_rcv
+  release_sock
+  tcp_sendmsg
+  inet_sendmsg
+  sock_sendmsg
+    9
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_write_xmit
+  __tcp_push_pending_frames
+  tcp_push
+  tcp_sendmsg
+  inet_sendmsg
+  sock_sendmsg
+  sock_write_iter
+  __vfs_write
+    51
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_write_xmit
+  __tcp_push_pending_frames
+  tcp_rcv_established
+  tcp_v4_do_rcv
+  tcp_v4_rcv
+  ip_local_deliver_finish
+  ip_local_deliver
+  ip_rcv_finish
+    171
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_write_xmit
+  __tcp_push_pending_frames
+  tcp_rcv_established
+  tcp_v4_do_rcv
+  tcp_v4_rcv
+  ip_local_deliver_finish
+  ip_local_deliver
+  ip_rcv_finish
+    994
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_write_xmit
+  tcp_tsq_handler.part.32
+  tcp_tasklet_func
+  tasklet_action
+  __do_softirq
+  irq_exit
+  xen_evtchn_do_upcall
+  xen_do_hypervisor_callback
+    1002
+
+Detaching...
+
+The last two most frequent stack traces are via tcp_transmit_skb(). Note the
+send last begins with ip_rcv_finish(), for a local receive, which then becomes
+a transmit: likely pushing more frames when processing the newly received ones.
+
+As may be obvious, this is a great tool for quickly understanding kernel code
+flow.
+
+
+A -i option can be used to set an output interval, and -T to include a
+timestamp. For example:
+
+# ./stackcount -Ti 1 submit_bio
+Tracing 1 functions for "submit_bio"... Hit Ctrl-C to end.
+
+01:11:37
+  submit_bio
+  submit_bh
+  journal_submit_commit_record.isra.13
+  jbd2_journal_commit_transaction
+  kjournald2
+  kthread
+  ret_from_fork
+  mb_cache_list
+    1
+
+  submit_bio
+  ext4_writepages
+  do_writepages
+  __filemap_fdatawrite_range
+  filemap_flush
+  ext4_alloc_da_blocks
+  ext4_rename
+  ext4_rename2
+  vfs_rename
+  sys_rename
+  entry_SYSCALL_64_fastpath
+    20
+
+  submit_bio
+  submit_bh
+  jbd2_journal_commit_transaction
+  kjournald2
+  kthread
+  ret_from_fork
+  mb_cache_list
+    39
+
+
+01:11:38
+  submit_bio
+  ext4_writepages
+  do_writepages
+  __filemap_fdatawrite_range
+  filemap_flush
+  ext4_alloc_da_blocks
+  ext4_rename
+  ext4_rename2
+  vfs_rename
+  sys_rename
+  entry_SYSCALL_64_fastpath
+    20
+
+
+01:11:39
+  submit_bio
+  ext4_writepages
+  do_writepages
+  __filemap_fdatawrite_range
+  filemap_flush
+  ext4_alloc_da_blocks
+  ext4_rename
+  ext4_rename2
+  vfs_rename
+  sys_rename
+  entry_SYSCALL_64_fastpath
+    20
+
+^C
+01:11:39
+  submit_bio
+  ext4_writepages
+  do_writepages
+  __filemap_fdatawrite_range
+  filemap_flush
+  ext4_alloc_da_blocks
+  ext4_rename
+  ext4_rename2
+  vfs_rename
+  sys_rename
+  entry_SYSCALL_64_fastpath
+    20
+
+Detaching...
+
+
+The -s output prints the return instruction offset for each function (aka
+symbol offset). Eg:
+
+# ./stackcount -s tcp_sendmsg
+Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end.
+^C
+  tcp_sendmsg0x1
+  sock_sendmsg0x38
+  sock_write_iter0x78
+  __vfs_write0xaa
+  vfs_write0xa9
+  sys_write0x46
+  entry_SYSCALL_64_fastpath0x16
+    29
+
+Detaching...
+
+If it wasn't clear how one function called another, knowing the instruction
+offset can help you locate the lines of code from a dissassembly dump.
+
+
+A wildcard can also be used. Eg, all functions beginning with "tcp_send":
+
+# ./stackcount -s 'tcp_send*'
+Tracing 14 functions for "tcp_send*"... Hit Ctrl-C to end.
+^C
+  tcp_send_delayed_ack0x1
+  tcp_rcv_established0x3b1
+  tcp_v4_do_rcv0x130
+  tcp_v4_rcv0x8e0
+  ip_local_deliver_finish0x9f
+  ip_local_deliver0x51
+  ip_rcv_finish0x8a
+  ip_rcv0x29d
+  __netif_receive_skb_core0x637
+  __netif_receive_skb0x18
+  netif_receive_skb_internal0x23
+    1
+
+  tcp_send_delayed_ack0x1
+  tcp_rcv_established0x222
+  tcp_v4_do_rcv0x130
+  tcp_v4_rcv0x8e0
+  ip_local_deliver_finish0x9f
+  ip_local_deliver0x51
+  ip_rcv_finish0x8a
+  ip_rcv0x29d
+  __netif_receive_skb_core0x637
+  __netif_receive_skb0x18
+  netif_receive_skb_internal0x23
+    4
+
+  tcp_send_mss0x1
+  inet_sendmsg0x67
+  sock_sendmsg0x38
+  sock_write_iter0x78
+  __vfs_write0xaa
+  vfs_write0xa9
+  sys_write0x46
+  entry_SYSCALL_64_fastpath0x16
+    7
+
+  tcp_sendmsg0x1
+  sock_sendmsg0x38
+  sock_write_iter0x78
+  __vfs_write0xaa
+  vfs_write0xa9
+  sys_write0x46
+  entry_SYSCALL_64_fastpath0x16
+    7
+
+Detaching...
+
+Use -r to allow regular expressions.
+
+
+USAGE message:
+
+# ./stackcount -h
+usage: stackcount [-h] [-p PID] [-i INTERVAL] [-T] [-r] [-s] [-v] pattern
+
+Count kernel function calls and their stack traces
+
+positional arguments:
+  pattern               search expression for kernel functions
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -p PID, --pid PID     trace this PID only
+  -i INTERVAL, --interval INTERVAL
+                        summary interval, seconds
+  -T, --timestamp       include timestamp on output
+  -r, --regexp          use regular expressions. Default is "*" wildcards
+                        only.
+  -s, --offset          show address offsets
+  -v, --verbose         show raw addresses
+
+examples:
+    ./stackcount submit_bio       # count kernel stack traces for submit_bio
+    ./stackcount ip_output        # count kernel stack traces for ip_output
+    ./stackcount -s ip_output     # show symbol offsets
+    ./stackcount -sv ip_output    # show offsets and raw addresses (verbose)
+    ./stackcount 'tcp_send*'      # count stacks for funcs matching tcp_send*
+    ./stackcount -r '^tcp_send.*' # same as above, using regular expressions
+    ./stackcount -Ti 5 ip_output  # output every 5 seconds, with timestamps
+    ./stackcount -p 185 ip_output # count ip_output stacks for PID 185 only
diff --git a/tools/stacksnoop b/tools/stacksnoop
new file mode 100755 (executable)
index 0000000..bd5c0c9
--- /dev/null
@@ -0,0 +1,130 @@
+#!/usr/bin/python
+#
+# stacksnoop    Trace a kernel function and print all kernel stack traces.
+#               For Linux, uses BCC, eBPF, and currently x86_64 only. Inline C.
+#
+# USAGE: stacksnoop [-h] [-p PID] [-s] [-v] function
+#
+# The current implementation uses an unrolled loop for x86_64, and was written
+# as a proof of concept. This implementation should be replaced in the future
+# with an appropriate bpf_ call, when available.
+#
+# The stack depth is limited to 10 (+1 for the current instruction pointer).
+# This could be tunable in a future version.
+#
+# Copyright 2016 Netflix, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 12-Jan-2016   Brendan Gregg   Created this.
+
+from __future__ import print_function
+from bcc import BPF
+import argparse
+
+# arguments
+examples = """examples:
+    ./stacksnoop ext4_sync_fs    # print kernel stack traces for ext4_sync_fs
+    ./stacksnoop -s ext4_sync_fs    # ... also show symbol offsets
+    ./stacksnoop -v ext4_sync_fs    # ... show extra columns
+    ./stacksnoop -p 185 ext4_sync_fs    # ... only when PID 185 is on-CPU
+"""
+parser = argparse.ArgumentParser(
+    description="Trace and print kernel stack traces for a kernel function",
+    formatter_class=argparse.RawDescriptionHelpFormatter,
+    epilog=examples)
+parser.add_argument("-p", "--pid",
+    help="trace this PID only")
+parser.add_argument("-s", "--offset", action="store_true",
+    help="show address offsets")
+parser.add_argument("-v", "--verbose", action="store_true",
+    help="print more fields")
+parser.add_argument("function",
+    help="kernel function name")
+args = parser.parse_args()
+function = args.function
+offset = args.offset
+verbose = args.verbose
+debug = 0
+
+# define BPF program
+bpf_text = """
+#include <uapi/linux/ptrace.h>
+
+static int print_frame(u64 *bp, int *depth) {
+    if (*bp) {
+        // The following stack walker is x86_64 specific
+        u64 ret = 0;
+        if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
+            return 0;
+        if (!ret || ret < __START_KERNEL_map)
+            return 0;
+        bpf_trace_printk("r%d: %llx\\n", *depth, ret);
+        if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
+            return 0;
+        *depth += 1;
+        return 1;
+    }
+    return 0;
+}
+
+void trace_stack(struct pt_regs *ctx) {
+    FILTER
+    u64 bp = 0;
+    int depth = 0;
+
+    bpf_trace_printk("\\n");
+    if (ctx->ip)
+        bpf_trace_printk("ip: %llx\\n", ctx->ip);
+    bp = ctx->bp;
+
+    // unrolled loop, 10 frames deep:
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+};
+"""
+if args.pid:
+    bpf_text = bpf_text.replace('FILTER',
+        ('u32 pid; pid = bpf_get_current_pid_tgid(); ' +
+        'if (pid != %s) { return 0; }') % (args.pid))
+else:
+    bpf_text = bpf_text.replace('FILTER', '')
+if debug:
+    print(bpf_text)
+
+# initialize BPF
+b = BPF(text=bpf_text)
+b.attach_kprobe(event=function, fn_name="trace_stack")
+matched = b.num_open_kprobes()
+if matched == 0:
+    print("Function \"%s\" not found. Exiting." % function)
+    exit()
+
+# header
+if verbose:
+    print("%-18s %-12s %-6s %-3s %s" % ("TIME(s)", "COMM", "PID", "CPU",
+        "STACK"))
+else:
+    print("%-18s %s" % ("TIME(s)", "STACK"))
+
+# format output
+while 1:
+    (task, pid, cpu, flags, ts, msg) = b.trace_fields()
+    if msg != "":
+        (reg, addr) = msg.split(" ")
+        if offset:
+            ip = b.ksymaddr(int(addr, 16))
+        else:
+            ip = b.ksym(int(addr, 16))
+        msg = msg + " " + ip
+    if verbose:
+        print("%-18.9f %-12.12s %-6d %-3d %s" % (ts, task, pid, cpu, msg))
+    else:
+        print("%-18.9f %s" % (ts, msg))
diff --git a/tools/stacksnoop_example.txt b/tools/stacksnoop_example.txt
new file mode 100644 (file)
index 0000000..d900025
--- /dev/null
@@ -0,0 +1,103 @@
+Demonstrations of stacksnoop, the Linux eBPF/bcc version.
+
+
+This program traces the given kernel function and prints the kernel stack trace
+for every call. This tool is useful for studying low frequency kernel functions,
+to see how they were invoked. For exmaple, tracing the ext4_sync_fs() call:
+
+# ./stacksnoop ext4_sync_fs
+TIME(s)            STACK
+42005194.132250004 
+42005194.132253997 ip: ffffffff81280461 ext4_sync_fs
+42005194.132256001 r0: ffffffff811ed7f9 iterate_supers
+42005194.132257000 r1: ffffffff8121ba25 sys_sync
+42005194.132257000 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
+42005194.132275000 
+42005194.132275999 ip: ffffffff81280461 ext4_sync_fs
+42005194.132275999 r0: ffffffff811ed7f9 iterate_supers
+42005194.132276997 r1: ffffffff8121ba35 sys_sync
+42005194.132276997 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
+
+This shows that ext4_sync_fs() was called by iterate_supers(), which was called
+by sys_sync(), and so on. (It tells me that this was a syscall invoked sync,
+so an application has requested it.)
+
+The "ip" refers to the instruction pointer, and the "r#" refers to the return
+address for each stack frame.
+
+For high frequency functions, see stackcount, which summarizes in-kernel for
+efficiency. If you don't know if your function is low or high frequency, try
+funccount.
+
+
+The -v option includes more fields, including the on-CPU process (COMM and PID):
+
+# ./stacksnoop -v ext4_sync_fs
+TIME(s)            COMM         PID    CPU STACK
+42005557.056332998 sync         22352  1   
+42005557.056336999 sync         22352  1   ip: ffffffff81280461 ext4_sync_fs
+42005557.056339003 sync         22352  1   r0: ffffffff811ed7f9 iterate_supers
+42005557.056340002 sync         22352  1   r1: ffffffff8121ba25 sys_sync
+42005557.056340002 sync         22352  1   r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
+42005557.056358002 sync         22352  1   
+42005557.056358002 sync         22352  1   ip: ffffffff81280461 ext4_sync_fs
+42005557.056359001 sync         22352  1   r0: ffffffff811ed7f9 iterate_supers
+42005557.056359999 sync         22352  1   r1: ffffffff8121ba35 sys_sync
+42005557.056359999 sync         22352  1   r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
+
+This identifies the application issuing the sync syscall: the sync(1) command
+(COMM column).
+
+
+Here's another example, showing the path to second_overflow() and on-CPU
+process:
+
+# ./stacksnoop -v second_overflow
+TIME(s)            COMM         PID    CPU STACK
+42005696.529449999 <idle>       0      0   
+42005696.529457003 <idle>       0      0   ip: ffffffff810e5701 second_overflow
+42005696.529459000 <idle>       0      0   r0: ffffffff810ecb1b tick_do_update_jiffies64
+42005696.529459998 <idle>       0      0   r1: ffffffff810ed6e0 tick_irq_enter
+42005696.529459998 <idle>       0      0   r2: ffffffff8107a195 irq_enter
+42005696.529460996 <idle>       0      0   r3: ffffffff8146bb6f xen_evtchn_do_upcall
+42005696.529460996 <idle>       0      0   r4: ffffffff81777a2e xen_do_hypervisor_callback
+42005697.616295002 <idle>       0      0   
+42005697.616301000 <idle>       0      0   ip: ffffffff810e5701 second_overflow
+42005697.616302997 <idle>       0      0   r0: ffffffff810ecb1b tick_do_update_jiffies64
+42005697.616304003 <idle>       0      0   r1: ffffffff810ed6e0 tick_irq_enter
+42005697.616304003 <idle>       0      0   r2: ffffffff8107a195 irq_enter
+42005697.616305001 <idle>       0      0   r3: ffffffff8146bb6f xen_evtchn_do_upcall
+42005697.616305001 <idle>       0      0   r4: ffffffff81777a2e xen_do_hypervisor_callback
+42005698.556240998 <idle>       0      1   
+42005698.556247003 <idle>       0      1   ip: ffffffff810e5701 second_overflow
+42005698.556249000 <idle>       0      1   r0: ffffffff810ecb1b tick_do_update_jiffies64
+42005698.556249000 <idle>       0      1   r1: ffffffff810ed6e0 tick_irq_enter
+42005698.556249999 <idle>       0      1   r2: ffffffff8107a195 irq_enter
+42005698.556249999 <idle>       0      1   r3: ffffffff8146bb6f xen_evtchn_do_upcall
+42005698.556250997 <idle>       0      1   r4: ffffffff81777a2e xen_do_hypervisor_callback
+[...]
+
+This fires every second (see TIME(s)), and is from tick_do_update_jiffies64().
+
+
+USAGE message:
+
+# ./stacksnoop -h
+usage: stacksnoop [-h] [-p PID] [-s] [-v] function
+
+Trace and print kernel stack traces for a kernel function
+
+positional arguments:
+  function           kernel function name
+
+optional arguments:
+  -h, --help         show this help message and exit
+  -p PID, --pid PID  trace this PID only
+  -s, --offset       show address offsets
+  -v, --verbose      print more fields
+
+examples:
+    ./stacksnoop ext4_sync_fs    # print kernel stack traces for ext4_sync_fs
+    ./stacksnoop -s ext4_sync_fs    # ... also show symbol offsets
+    ./stacksnoop -v ext4_sync_fs    # ... show extra columns
+    ./stacksnoop -p 185 ext4_sync_fs    # ... only when PID 185 is on-CPU