- tools/[biosnoop](tools/biosnoop.py): Trace block device I/O with PID and latency. [Examples](tools/biosnoop_example.txt).
- tools/[bitesize](tools/bitesize.py): Show per process I/O size histogram. [Examples](tools/bitesize_example.txt).
- tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt).
+- tools/[fsslower](tools/fsslower.py): Trace slow file system synchronous reads and writes. [Examples](tools/fsslower_example.txt).
- tools/[funccount](tools/funccount.py): Count kernel function calls. [Examples](tools/funccount_example.txt).
- tools/[funclatency](tools/funclatency.py): Time kernel functions and show their latency distribution. [Examples](tools/funclatency_example.txt).
- tools/[gethostlatency](tools/gethostlatency.py): Show latency for getaddrinfo/gethostbyname[2] calls. [Examples](tools/gethostlatency_example.txt).
--- /dev/null
+.TH fsslower 8 "2016-02-07" "USER COMMANDS"
+.SH NAME
+fsslower \- Summarize block device I/O latency as a histogram.
+.SH SYNOPSIS
+.B fsslower [\-h] [\-p PID] [min_ms]
+.SH DESCRIPTION
+This script uses kernel dynamic tracing of synchronous reads and writes
+at the VFS interface, to identify slow file system I/O for any file system.
+
+This version traces __vfs_read() and __vfs_write() and only showing
+synchronous I/O (the path to new_sync_read() and new_sync_write()), and
+I/O with filenames. This approach provides a view of just two file
+system request types: reads and writes. There are typically many others:
+asynchronous I/O, directory operations, file handle operations, file open()s,
+fflush(), etc, that this tool does not currently instrument. This
+implementation is a work around until we have suitable fs tracepoints.
+
+WARNING: See the OVERHEAD section.
+
+By default, a minimum millisecond threshold of 10 is used.
+
+Since this works by tracing various kernel __vfs_*() functions using dynamic
+tracing, it will need updating to match any changes to these functions. A
+future version should switch to using FS tracepoints instead.
+
+Since this uses BPF, only the root user can use this tool.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+.SH OPTIONS
+\-p PID
+Trace this PID only.
+.TP
+min_ms
+Minimum I/O latency (duration) to trace, in milliseconds. Default is 10 ms.
+.SH EXAMPLES
+.TP
+Trace synchronous file system reads and writes slower than 10 ms:
+#
+.B fsslower
+.TP
+Trace slower than 1 ms:
+#
+.B fsslower 1
+.TP
+Trace slower than 1 ms, for PID 181 only:
+#
+.B fsslower \-p 181 1
+.SH FIELDS
+.TP
+TIME(s)
+Time of I/O completion since the first I/O seen, in seconds.
+.TP
+COMM
+Process name.
+.TP
+PID
+Process ID.
+.TP
+D
+Direction of I/O. R == read, W == write.
+.TP
+BYTES
+Size of I/O, in bytes.
+.TP
+LAT(ms)
+Latency (duration) of I/O, measured from when the application issued it to VFS
+to when it completed. This time is inclusive of block device I/O, file system
+CPU cycles, file system locks, run queue latency, etc. It's a more accurate
+measure of the latency suffered by applications performing file system I/O,
+than to measure this down at the block device interface.
+.TP
+FILENAME
+A cached kernel file name (comes from dentry->d_iname).
+.SH OVERHEAD
+Depending on the frequency of application reads and writes, overhead can become
+serve, in the worst case slowing applications by 2x. In the best case, the
+overhead is negligible. Hopefully for real world workloads the overhead is
+often at the lower end of the spectrum -- test before use. The reason for
+high overhead is that this traces VFS reads and writes, which includes FS
+cache reads and writes, and can exceed one million events per second if the
+application is I/O heavy. While the instrumentation is extremely lightweight,
+and uses in-kernel eBPF maps for efficient timing and filtering, multiply that
+cost by one million events per second and that cost becomes a million times
+worse. You can get an idea of the possible cost by just counting the
+instrumented events using the bcc funccount tool, eg:
+.PP
+# ./funccount.py -i 1 -r '^__vfs_(read|write)$'
+.PP
+This also costs overhead, but is somewhat less than fsslower.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Brendan Gregg
+.SH SEE ALSO
+biosnoop(8), funccount(8)
--- /dev/null
+#!/usr/bin/python
+# @lint-avoid-python-3-compatibility-imports
+#
+# fsslower Trace slow file system synchronous reads and writes.
+# For Linux, uses BCC, eBPF.
+#
+# USAGE: fsslower [-h] [-p PID] [min_ms]
+#
+# This script uses kernel dynamic tracing of synchronous reads and writes
+# at the VFS interface, to identify slow file system I/O for any file system.
+#
+# This works by tracing __vfs_read() and __vfs_write(), and filtering for
+# synchronous I/O (the path to new_sync_read() and new_sync_write()), and
+# for I/O with filenames. This approach provides a view of just two file
+# system request types. There are typically many others: asynchronous I/O,
+# directory operations, file handle operations, etc, that this tool does not
+# instrument. This implementation is a work around until we have suitable fs
+# tracepoints.
+#
+# WARNING: This traces VFS reads and writes, which can be extremely frequent,
+# and so the overhead of this tool can become severe depending on the
+# workload.
+#
+# By default, a minimum millisecond threshold of 10 is used.
+#
+# Copyright 2016 Netflix, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 06-Feb-2016 Brendan Gregg Created this.
+
+from __future__ import print_function
+from bcc import BPF
+import argparse
+import signal
+
+# arguments
+examples = """examples:
+ ./fsslower # trace sync I/O slower than 10 ms (default)
+ ./fsslower 1 # trace sync I/O slower than 1 ms
+ ./fsslower -p 185 # trace PID 185 only
+"""
+parser = argparse.ArgumentParser(
+ description="Trace slow file system sync reads and writes",
+ formatter_class=argparse.RawDescriptionHelpFormatter,
+ epilog=examples)
+parser.add_argument("-p", "--pid",
+ help="trace this PID only")
+parser.add_argument("min_ms", nargs="?", default='10',
+ help="minimum I/O duration to trace, in ms (default 10)")
+args = parser.parse_args()
+min_ms = int(args.min_ms)
+pid = args.pid
+debug = 0
+
+# signal handler
+def signal_ignore(signal, frame):
+ print()
+
+# define BPF program
+bpf_text = """
+#include <uapi/linux/ptrace.h>
+#include <linux/fs.h>
+
+#define TRACE_READ 0
+#define TRACE_WRITE 1
+
+struct val_t {
+ u32 sz;
+ u64 ts;
+ char name[DNAME_INLINE_LEN];
+};
+
+BPF_HASH(entryinfo, pid_t, struct val_t);
+
+// store timestamp and size on entry
+static int trace_rw_entry(struct pt_regs *ctx, struct file *file,
+ char __user *buf, size_t count)
+{
+ u32 pid;
+
+ pid = bpf_get_current_pid_tgid();
+ if (FILTER)
+ return 0;
+
+ // skip I/O lacking a filename
+ struct dentry *de = file->f_path.dentry;
+ if (de->d_iname[0] == 0)
+ return 0;
+
+ // store size and timestamp by pid
+ struct val_t val = {};
+ val.sz = count;
+ val.ts = bpf_ktime_get_ns();
+ __builtin_memcpy(&val.name, de->d_iname, sizeof(val.name));
+ entryinfo.update(&pid, &val);
+
+ return 0;
+}
+
+int trace_read_entry(struct pt_regs *ctx, struct file *file,
+ char __user *buf, size_t count)
+{
+ // skip non-sync I/O; see kernel code for __vfs_read()
+ if (!(file->f_op->read_iter))
+ return 0;
+ return trace_rw_entry(ctx, file, buf, count);
+}
+
+int trace_write_entry(struct pt_regs *ctx, struct file *file,
+ char __user *buf, size_t count)
+{
+ // skip non-sync I/O; see kernel code for __vfs_write()
+ if (!(file->f_op->write_iter))
+ return 0;
+ return trace_rw_entry(ctx, file, buf, count);
+}
+
+// output
+static int trace_rw_return(struct pt_regs *ctx, int type)
+{
+ struct val_t *valp;
+ u32 pid = bpf_get_current_pid_tgid();
+
+ valp = entryinfo.lookup(&pid);
+ if (valp == 0) {
+ // missed tracing issue or filtered
+ return 0;
+ }
+ u64 delta_us = (bpf_ktime_get_ns() - valp->ts) / 1000;
+ entryinfo.delete(&pid);
+ if (delta_us < MIN_US)
+ return 0;
+
+ if (type == TRACE_READ) {
+ bpf_trace_printk("R %d %d %s\\n", valp->sz, delta_us, valp->name);
+ } else {
+ bpf_trace_printk("W %d %d %s\\n", valp->sz, delta_us, valp->name);
+ }
+
+ return 0;
+}
+
+int trace_read_return(struct pt_regs *ctx)
+{
+ return trace_rw_return(ctx, TRACE_READ);
+}
+
+int trace_write_return(struct pt_regs *ctx)
+{
+ return trace_rw_return(ctx, TRACE_WRITE);
+}
+
+"""
+bpf_text = bpf_text.replace('MIN_US', str(min_ms * 1000))
+if args.pid:
+ bpf_text = bpf_text.replace('FILTER', 'pid != %s' % pid)
+else:
+ bpf_text = bpf_text.replace('FILTER', '0')
+if debug:
+ print(bpf_text)
+
+# initialize BPF
+b = BPF(text=bpf_text)
+
+# I'd rather trace these via new_sync_read/new_sync_write (which used to be
+# do_sync_read/do_sync_write), but those became static. So trace these from
+# the parent functions, at the cost of more overhead, instead.
+# Ultimately, we should be using [V]FS tracepoints.
+b.attach_kprobe(event="__vfs_read", fn_name="trace_read_entry")
+b.attach_kprobe(event="__vfs_write", fn_name="trace_write_entry")
+b.attach_kretprobe(event="__vfs_read", fn_name="trace_read_return")
+b.attach_kretprobe(event="__vfs_write", fn_name="trace_write_return")
+
+# header
+print("Tracing sync read/writes slower than %d ms" % min_ms)
+print("%-8s %-14s %-6s %1s %-7s %7s %s" % ("TIME(s)", "COMM", "PID", "D",
+ "BYTES", "LAT(ms)", "FILENAME"))
+
+start_ts = 0
+
+# format output
+while 1:
+ (task, pid, cpu, flags, ts, msg) = b.trace_fields()
+ args = msg.split(" ", 3)
+ (type_s, sz, delta_us_s) = (args[0], args[1], args[2])
+ try:
+ filename = args[3]
+ except:
+ filename = "?"
+ if start_ts == 0:
+ start_ts = ts
+
+ ms = float(int(delta_us_s, 10)) / 1000
+
+ print("%-8.3f %-14.14s %-6s %1s %-7s %7.2f %s" % (
+ ts - start_ts, task, pid, type_s, sz, ms, filename))
--- /dev/null
+Demonstrations of fsslower, the Linux eBPF/bcc version.
+
+
+fsslower shows file-based synchronous reads and writes slower than a threshold.
+For example:
+
+# ./fsslower
+Tracing sync read/writes slower than 10 ms
+TIME(s) COMM PID D BYTES LAT(ms) FILENAME
+0.000 randread.pl 4762 R 8192 12.70 data1
+8.850 randread.pl 4762 R 8192 11.26 data1
+12.852 randread.pl 4762 R 8192 10.43 data1
+
+This showed a few reads from a "randread.pl" program, each 8 Kbytes in size,
+and from a "data1" file. These all had over 10 ms latency.
+
+This "latency" is measured from when the read or write was issued at the VFS
+interface, to when it completed. This spans everything: block device I/O (disk
+I/O), file system CPU cycles, file system locks, run queue latency, etc. This
+is a better measure of the latency suffered by applications reading from the
+file system than measuring this down at the block device interface.
+
+Note that this only traces reads and writes: other file system operations
+(eg, directory operations, open(), fflush()) are not currently traced.
+
+
+The threshold can be provided as an argument. Eg, I/O slower than 1 ms:
+
+# ./fsslower 1
+Tracing sync read/writes slower than 1 ms
+TIME(s) COMM PID D BYTES LAT(ms) FILENAME
+0.000 randread.pl 6925 R 8192 1.06 data1
+0.082 randread.pl 6925 R 8192 2.42 data1
+0.116 randread.pl 6925 R 8192 1.78 data1
+0.153 randread.pl 6925 R 8192 2.31 data1
+0.330 randread.pl 6925 R 8192 1.14 data1
+0.345 randread.pl 6925 R 8192 1.52 data1
+0.359 randread.pl 6925 R 8192 1.04 data1
+0.532 randread.pl 6925 R 8192 2.56 data1
+0.609 supervise 1892 W 18 3.65 status.new
+0.610 randread.pl 6925 R 8192 1.37 data1
+0.614 randread.pl 6925 R 8192 3.04 data1
+0.729 randread.pl 6925 R 8192 2.90 data1
+0.755 randread.pl 6925 R 8192 1.12 data1
+0.762 randread.pl 6925 R 8192 2.62 data1
+0.771 randread.pl 6925 R 8192 1.07 data1
+0.816 randread.pl 6925 R 8192 10.50 data1
+0.983 randread.pl 6925 R 8192 1.73 data1
+0.989 randread.pl 6925 R 8192 2.12 data1
+0.992 randread.pl 6925 R 8192 2.17 data1
+1.001 randread.pl 6925 R 8192 1.93 data1
+1.007 randread.pl 6925 R 8192 2.03 data1
+1.210 randread.pl 6925 R 8192 1.82 data1
+1.213 randread.pl 6925 R 8192 2.58 data1
+1.219 randread.pl 6925 R 8192 2.20 data1
+1.430 randread.pl 6925 R 8192 1.01 data1
+1.448 randread.pl 6925 R 8192 2.22 data1
+[...]
+
+There's now much more output (this spans only 1.4 seconds, the previous output
+spanned 12 seconds), and the lower threshold is catching more I/O.
+
+
+In the following example, the file system caches were dropped before running
+fsslower, and then in another session a "man ls" was executed. The command
+and files read from disk can be seen:
+
+# echo 3 > /proc/sys/vm/drop_caches; ./fsslower 1
+Tracing sync read/writes slower than 1 ms
+TIME(s) COMM PID D BYTES LAT(ms) FILENAME
+0.000 bash 9647 R 128 5.83 man
+0.050 man 9647 R 832 19.52 libmandb-2.6.7.1.so
+0.066 man 9647 R 832 15.79 libman-2.6.7.1.so
+0.123 man 9647 R 832 56.36 libpipeline.so.1.3.0
+0.135 man 9647 R 832 9.79 libgdbm.so.3.0.0
+0.323 man 9647 R 4096 59.52 locale.alias
+0.540 man 9648 R 8192 11.11 ls.1.gz
+0.558 man 9647 R 72 6.97 index.db
+0.563 man 9647 R 4096 5.12 index.db
+0.723 man 9658 R 128 12.06 less
+0.725 man 9656 R 128 14.52 nroff
+0.779 man 9655 R 128 68.86 tbl
+0.814 nroff 9660 R 128 14.55 locale
+0.830 pager 9658 R 4096 28.27 .lesshst
+0.866 man 9654 R 128 163.12 preconv
+0.980 nroff 9684 R 128 13.80 groff
+0.999 groff 9684 R 4096 14.29 DESC
+1.036 groff 9685 R 128 5.94 troff
+1.038 groff 9686 R 128 7.76 grotty
+1.065 troff 9685 R 4096 6.33 R
+1.082 troff 9685 R 4096 10.52 BI
+1.096 troff 9685 R 4096 8.70 troffrc
+1.176 troff 9685 R 4096 80.12 composite.tmac
+1.195 troff 9685 R 4096 19.20 fallbacks.tmac
+1.202 troff 9685 R 4096 6.79 tty.tmac
+1.221 troff 9685 R 4096 7.87 man.local
+2.977 supervise 1876 W 18 4.23 status.new
+
+This caught an individual I/O reaching 163.12 ms, for the "preconv" file. While
+the file system cache was flush, causing these to need to be read from disk,
+the duration here may not be entirely disk I/O: it can include file system
+locks, run queue latency, etc. These can be explored using other commands.
+
+
+USAGE message:
+
+# ./fsslower -h
+usage: fsslower [-h] [-p PID] [min_ms]
+
+Trace slow file system sync reads and writes
+
+positional arguments:
+ min_ms minimum I/O duration to trace, in ms (default 10)
+
+optional arguments:
+ -h, --help show this help message and exit
+ -p PID, --pid PID trace this PID only
+
+examples:
+ ./fsslower # trace sync I/O slower than 10 ms (default)
+ ./fsslower 1 # trace sync I/O slower than 1 ms
+ ./fsslower -p 185 # trace PID 185 only