From: Brendan Gregg Date: Sun, 7 Feb 2016 20:14:37 +0000 (-0800) Subject: fsslower X-Git-Tag: v0.1.8~33^2~7 X-Git-Url: http://review.tizen.org/git/?a=commitdiff_plain;h=a32fbafb97d25b45bf92d55eccfeda7d2416d055;p=platform%2Fupstream%2Fbcc.git fsslower --- diff --git a/README.md b/README.md index 98f002f..18e954c 100644 --- a/README.md +++ b/README.md @@ -71,6 +71,7 @@ Tools: - tools/[biosnoop](tools/biosnoop.py): Trace block device I/O with PID and latency. [Examples](tools/biosnoop_example.txt). - tools/[bitesize](tools/bitesize.py): Show per process I/O size histogram. [Examples](tools/bitesize_example.txt). - tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt). +- tools/[fsslower](tools/fsslower.py): Trace slow file system synchronous reads and writes. [Examples](tools/fsslower_example.txt). - tools/[funccount](tools/funccount.py): Count kernel function calls. [Examples](tools/funccount_example.txt). - tools/[funclatency](tools/funclatency.py): Time kernel functions and show their latency distribution. [Examples](tools/funclatency_example.txt). - tools/[gethostlatency](tools/gethostlatency.py): Show latency for getaddrinfo/gethostbyname[2] calls. [Examples](tools/gethostlatency_example.txt). diff --git a/man/man8/fsslower.8 b/man/man8/fsslower.8 new file mode 100644 index 0000000..11566ec --- /dev/null +++ b/man/man8/fsslower.8 @@ -0,0 +1,104 @@ +.TH fsslower 8 "2016-02-07" "USER COMMANDS" +.SH NAME +fsslower \- Summarize block device I/O latency as a histogram. +.SH SYNOPSIS +.B fsslower [\-h] [\-p PID] [min_ms] +.SH DESCRIPTION +This script uses kernel dynamic tracing of synchronous reads and writes +at the VFS interface, to identify slow file system I/O for any file system. + +This version traces __vfs_read() and __vfs_write() and only showing +synchronous I/O (the path to new_sync_read() and new_sync_write()), and +I/O with filenames. This approach provides a view of just two file +system request types: reads and writes. There are typically many others: +asynchronous I/O, directory operations, file handle operations, file open()s, +fflush(), etc, that this tool does not currently instrument. This +implementation is a work around until we have suitable fs tracepoints. + +WARNING: See the OVERHEAD section. + +By default, a minimum millisecond threshold of 10 is used. + +Since this works by tracing various kernel __vfs_*() functions using dynamic +tracing, it will need updating to match any changes to these functions. A +future version should switch to using FS tracepoints instead. + +Since this uses BPF, only the root user can use this tool. +.SH REQUIREMENTS +CONFIG_BPF and bcc. +.SH OPTIONS +\-p PID +Trace this PID only. +.TP +min_ms +Minimum I/O latency (duration) to trace, in milliseconds. Default is 10 ms. +.SH EXAMPLES +.TP +Trace synchronous file system reads and writes slower than 10 ms: +# +.B fsslower +.TP +Trace slower than 1 ms: +# +.B fsslower 1 +.TP +Trace slower than 1 ms, for PID 181 only: +# +.B fsslower \-p 181 1 +.SH FIELDS +.TP +TIME(s) +Time of I/O completion since the first I/O seen, in seconds. +.TP +COMM +Process name. +.TP +PID +Process ID. +.TP +D +Direction of I/O. R == read, W == write. +.TP +BYTES +Size of I/O, in bytes. +.TP +LAT(ms) +Latency (duration) of I/O, measured from when the application issued it to VFS +to when it completed. This time is inclusive of block device I/O, file system +CPU cycles, file system locks, run queue latency, etc. It's a more accurate +measure of the latency suffered by applications performing file system I/O, +than to measure this down at the block device interface. +.TP +FILENAME +A cached kernel file name (comes from dentry->d_iname). +.SH OVERHEAD +Depending on the frequency of application reads and writes, overhead can become +serve, in the worst case slowing applications by 2x. In the best case, the +overhead is negligible. Hopefully for real world workloads the overhead is +often at the lower end of the spectrum -- test before use. The reason for +high overhead is that this traces VFS reads and writes, which includes FS +cache reads and writes, and can exceed one million events per second if the +application is I/O heavy. While the instrumentation is extremely lightweight, +and uses in-kernel eBPF maps for efficient timing and filtering, multiply that +cost by one million events per second and that cost becomes a million times +worse. You can get an idea of the possible cost by just counting the +instrumented events using the bcc funccount tool, eg: +.PP +# ./funccount.py -i 1 -r '^__vfs_(read|write)$' +.PP +This also costs overhead, but is somewhat less than fsslower. +.SH SOURCE +This is from bcc. +.IP +https://github.com/iovisor/bcc +.PP +Also look in the bcc distribution for a companion _examples.txt file containing +example usage, output, and commentary for this tool. +.SH OS +Linux +.SH STABILITY +Unstable - in development. +.SH AUTHOR +Brendan Gregg +.SH SEE ALSO +biosnoop(8), funccount(8) diff --git a/tools/fsslower.py b/tools/fsslower.py new file mode 100755 index 0000000..6e6a082 --- /dev/null +++ b/tools/fsslower.py @@ -0,0 +1,196 @@ +#!/usr/bin/python +# @lint-avoid-python-3-compatibility-imports +# +# fsslower Trace slow file system synchronous reads and writes. +# For Linux, uses BCC, eBPF. +# +# USAGE: fsslower [-h] [-p PID] [min_ms] +# +# This script uses kernel dynamic tracing of synchronous reads and writes +# at the VFS interface, to identify slow file system I/O for any file system. +# +# This works by tracing __vfs_read() and __vfs_write(), and filtering for +# synchronous I/O (the path to new_sync_read() and new_sync_write()), and +# for I/O with filenames. This approach provides a view of just two file +# system request types. There are typically many others: asynchronous I/O, +# directory operations, file handle operations, etc, that this tool does not +# instrument. This implementation is a work around until we have suitable fs +# tracepoints. +# +# WARNING: This traces VFS reads and writes, which can be extremely frequent, +# and so the overhead of this tool can become severe depending on the +# workload. +# +# By default, a minimum millisecond threshold of 10 is used. +# +# Copyright 2016 Netflix, Inc. +# Licensed under the Apache License, Version 2.0 (the "License") +# +# 06-Feb-2016 Brendan Gregg Created this. + +from __future__ import print_function +from bcc import BPF +import argparse +import signal + +# arguments +examples = """examples: + ./fsslower # trace sync I/O slower than 10 ms (default) + ./fsslower 1 # trace sync I/O slower than 1 ms + ./fsslower -p 185 # trace PID 185 only +""" +parser = argparse.ArgumentParser( + description="Trace slow file system sync reads and writes", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=examples) +parser.add_argument("-p", "--pid", + help="trace this PID only") +parser.add_argument("min_ms", nargs="?", default='10', + help="minimum I/O duration to trace, in ms (default 10)") +args = parser.parse_args() +min_ms = int(args.min_ms) +pid = args.pid +debug = 0 + +# signal handler +def signal_ignore(signal, frame): + print() + +# define BPF program +bpf_text = """ +#include +#include + +#define TRACE_READ 0 +#define TRACE_WRITE 1 + +struct val_t { + u32 sz; + u64 ts; + char name[DNAME_INLINE_LEN]; +}; + +BPF_HASH(entryinfo, pid_t, struct val_t); + +// store timestamp and size on entry +static int trace_rw_entry(struct pt_regs *ctx, struct file *file, + char __user *buf, size_t count) +{ + u32 pid; + + pid = bpf_get_current_pid_tgid(); + if (FILTER) + return 0; + + // skip I/O lacking a filename + struct dentry *de = file->f_path.dentry; + if (de->d_iname[0] == 0) + return 0; + + // store size and timestamp by pid + struct val_t val = {}; + val.sz = count; + val.ts = bpf_ktime_get_ns(); + __builtin_memcpy(&val.name, de->d_iname, sizeof(val.name)); + entryinfo.update(&pid, &val); + + return 0; +} + +int trace_read_entry(struct pt_regs *ctx, struct file *file, + char __user *buf, size_t count) +{ + // skip non-sync I/O; see kernel code for __vfs_read() + if (!(file->f_op->read_iter)) + return 0; + return trace_rw_entry(ctx, file, buf, count); +} + +int trace_write_entry(struct pt_regs *ctx, struct file *file, + char __user *buf, size_t count) +{ + // skip non-sync I/O; see kernel code for __vfs_write() + if (!(file->f_op->write_iter)) + return 0; + return trace_rw_entry(ctx, file, buf, count); +} + +// output +static int trace_rw_return(struct pt_regs *ctx, int type) +{ + struct val_t *valp; + u32 pid = bpf_get_current_pid_tgid(); + + valp = entryinfo.lookup(&pid); + if (valp == 0) { + // missed tracing issue or filtered + return 0; + } + u64 delta_us = (bpf_ktime_get_ns() - valp->ts) / 1000; + entryinfo.delete(&pid); + if (delta_us < MIN_US) + return 0; + + if (type == TRACE_READ) { + bpf_trace_printk("R %d %d %s\\n", valp->sz, delta_us, valp->name); + } else { + bpf_trace_printk("W %d %d %s\\n", valp->sz, delta_us, valp->name); + } + + return 0; +} + +int trace_read_return(struct pt_regs *ctx) +{ + return trace_rw_return(ctx, TRACE_READ); +} + +int trace_write_return(struct pt_regs *ctx) +{ + return trace_rw_return(ctx, TRACE_WRITE); +} + +""" +bpf_text = bpf_text.replace('MIN_US', str(min_ms * 1000)) +if args.pid: + bpf_text = bpf_text.replace('FILTER', 'pid != %s' % pid) +else: + bpf_text = bpf_text.replace('FILTER', '0') +if debug: + print(bpf_text) + +# initialize BPF +b = BPF(text=bpf_text) + +# I'd rather trace these via new_sync_read/new_sync_write (which used to be +# do_sync_read/do_sync_write), but those became static. So trace these from +# the parent functions, at the cost of more overhead, instead. +# Ultimately, we should be using [V]FS tracepoints. +b.attach_kprobe(event="__vfs_read", fn_name="trace_read_entry") +b.attach_kprobe(event="__vfs_write", fn_name="trace_write_entry") +b.attach_kretprobe(event="__vfs_read", fn_name="trace_read_return") +b.attach_kretprobe(event="__vfs_write", fn_name="trace_write_return") + +# header +print("Tracing sync read/writes slower than %d ms" % min_ms) +print("%-8s %-14s %-6s %1s %-7s %7s %s" % ("TIME(s)", "COMM", "PID", "D", + "BYTES", "LAT(ms)", "FILENAME")) + +start_ts = 0 + +# format output +while 1: + (task, pid, cpu, flags, ts, msg) = b.trace_fields() + args = msg.split(" ", 3) + (type_s, sz, delta_us_s) = (args[0], args[1], args[2]) + try: + filename = args[3] + except: + filename = "?" + if start_ts == 0: + start_ts = ts + + ms = float(int(delta_us_s, 10)) / 1000 + + print("%-8.3f %-14.14s %-6s %1s %-7s %7.2f %s" % ( + ts - start_ts, task, pid, type_s, sz, ms, filename)) diff --git a/tools/fsslower_example.txt b/tools/fsslower_example.txt new file mode 100644 index 0000000..9ff2617 --- /dev/null +++ b/tools/fsslower_example.txt @@ -0,0 +1,122 @@ +Demonstrations of fsslower, the Linux eBPF/bcc version. + + +fsslower shows file-based synchronous reads and writes slower than a threshold. +For example: + +# ./fsslower +Tracing sync read/writes slower than 10 ms +TIME(s) COMM PID D BYTES LAT(ms) FILENAME +0.000 randread.pl 4762 R 8192 12.70 data1 +8.850 randread.pl 4762 R 8192 11.26 data1 +12.852 randread.pl 4762 R 8192 10.43 data1 + +This showed a few reads from a "randread.pl" program, each 8 Kbytes in size, +and from a "data1" file. These all had over 10 ms latency. + +This "latency" is measured from when the read or write was issued at the VFS +interface, to when it completed. This spans everything: block device I/O (disk +I/O), file system CPU cycles, file system locks, run queue latency, etc. This +is a better measure of the latency suffered by applications reading from the +file system than measuring this down at the block device interface. + +Note that this only traces reads and writes: other file system operations +(eg, directory operations, open(), fflush()) are not currently traced. + + +The threshold can be provided as an argument. Eg, I/O slower than 1 ms: + +# ./fsslower 1 +Tracing sync read/writes slower than 1 ms +TIME(s) COMM PID D BYTES LAT(ms) FILENAME +0.000 randread.pl 6925 R 8192 1.06 data1 +0.082 randread.pl 6925 R 8192 2.42 data1 +0.116 randread.pl 6925 R 8192 1.78 data1 +0.153 randread.pl 6925 R 8192 2.31 data1 +0.330 randread.pl 6925 R 8192 1.14 data1 +0.345 randread.pl 6925 R 8192 1.52 data1 +0.359 randread.pl 6925 R 8192 1.04 data1 +0.532 randread.pl 6925 R 8192 2.56 data1 +0.609 supervise 1892 W 18 3.65 status.new +0.610 randread.pl 6925 R 8192 1.37 data1 +0.614 randread.pl 6925 R 8192 3.04 data1 +0.729 randread.pl 6925 R 8192 2.90 data1 +0.755 randread.pl 6925 R 8192 1.12 data1 +0.762 randread.pl 6925 R 8192 2.62 data1 +0.771 randread.pl 6925 R 8192 1.07 data1 +0.816 randread.pl 6925 R 8192 10.50 data1 +0.983 randread.pl 6925 R 8192 1.73 data1 +0.989 randread.pl 6925 R 8192 2.12 data1 +0.992 randread.pl 6925 R 8192 2.17 data1 +1.001 randread.pl 6925 R 8192 1.93 data1 +1.007 randread.pl 6925 R 8192 2.03 data1 +1.210 randread.pl 6925 R 8192 1.82 data1 +1.213 randread.pl 6925 R 8192 2.58 data1 +1.219 randread.pl 6925 R 8192 2.20 data1 +1.430 randread.pl 6925 R 8192 1.01 data1 +1.448 randread.pl 6925 R 8192 2.22 data1 +[...] + +There's now much more output (this spans only 1.4 seconds, the previous output +spanned 12 seconds), and the lower threshold is catching more I/O. + + +In the following example, the file system caches were dropped before running +fsslower, and then in another session a "man ls" was executed. The command +and files read from disk can be seen: + +# echo 3 > /proc/sys/vm/drop_caches; ./fsslower 1 +Tracing sync read/writes slower than 1 ms +TIME(s) COMM PID D BYTES LAT(ms) FILENAME +0.000 bash 9647 R 128 5.83 man +0.050 man 9647 R 832 19.52 libmandb-2.6.7.1.so +0.066 man 9647 R 832 15.79 libman-2.6.7.1.so +0.123 man 9647 R 832 56.36 libpipeline.so.1.3.0 +0.135 man 9647 R 832 9.79 libgdbm.so.3.0.0 +0.323 man 9647 R 4096 59.52 locale.alias +0.540 man 9648 R 8192 11.11 ls.1.gz +0.558 man 9647 R 72 6.97 index.db +0.563 man 9647 R 4096 5.12 index.db +0.723 man 9658 R 128 12.06 less +0.725 man 9656 R 128 14.52 nroff +0.779 man 9655 R 128 68.86 tbl +0.814 nroff 9660 R 128 14.55 locale +0.830 pager 9658 R 4096 28.27 .lesshst +0.866 man 9654 R 128 163.12 preconv +0.980 nroff 9684 R 128 13.80 groff +0.999 groff 9684 R 4096 14.29 DESC +1.036 groff 9685 R 128 5.94 troff +1.038 groff 9686 R 128 7.76 grotty +1.065 troff 9685 R 4096 6.33 R +1.082 troff 9685 R 4096 10.52 BI +1.096 troff 9685 R 4096 8.70 troffrc +1.176 troff 9685 R 4096 80.12 composite.tmac +1.195 troff 9685 R 4096 19.20 fallbacks.tmac +1.202 troff 9685 R 4096 6.79 tty.tmac +1.221 troff 9685 R 4096 7.87 man.local +2.977 supervise 1876 W 18 4.23 status.new + +This caught an individual I/O reaching 163.12 ms, for the "preconv" file. While +the file system cache was flush, causing these to need to be read from disk, +the duration here may not be entirely disk I/O: it can include file system +locks, run queue latency, etc. These can be explored using other commands. + + +USAGE message: + +# ./fsslower -h +usage: fsslower [-h] [-p PID] [min_ms] + +Trace slow file system sync reads and writes + +positional arguments: + min_ms minimum I/O duration to trace, in ms (default 10) + +optional arguments: + -h, --help show this help message and exit + -p PID, --pid PID trace this PID only + +examples: + ./fsslower # trace sync I/O slower than 10 ms (default) + ./fsslower 1 # trace sync I/O slower than 1 ms + ./fsslower -p 185 # trace PID 185 only