argdist, trace: Native tracepoint support (#724)
authorSasha Goldshtein <goldshtn@gmail.com>
Tue, 4 Oct 2016 16:49:57 +0000 (19:49 +0300)
committer4ast <alexei.starovoitov@gmail.com>
Tue, 4 Oct 2016 16:49:57 +0000 (09:49 -0700)
* Remove tracepoint.py

The `Tracepoint` class which implements the necessary
support for the tracepoint kprobe-based hack is no
longer needed and can be removed.

* argdist: Native tracepoint support

This commit migrates argdist to use the native bcc/BPF
tracepoint support instead of the hackish kprobe-
based approach. The resulting programs are cleaner
and likely more efficient.

As a result of this change, there is a slight API
change in how argdist is used with tracepoints. To
access fields from the tracepoint structure, the user
is expected to use `args->field` directly. This
leverages most of the built-in bcc support for
generating the tracepoint probe function.

* trace: Native tracepoint support

This commit migrates trace to use the native bcc/BPF
tracepoint support instead of the hackish kprobe-
based approach. The resulting programs are cleaner
and likely more efficient.

As with argdist, users are now expected to use the
`args` structure pointer to access the tracepoint's
arguments.

For example:

```
trace 't:irq:irq_handler_entry (args->irq != 27) "irq %d", args->irq'
```

man/man8/argdist.8
man/man8/trace.8
src/python/bcc/__init__.py
src/python/bcc/tracepoint.py [deleted file]
tools/argdist.py
tools/argdist_example.txt
tools/trace.py
tools/trace_example.txt

index bf6a293..d1c9404 100644 (file)
@@ -92,10 +92,9 @@ The expression(s) to capture.
 These are the values that are assigned to the histogram or raw event collection.
 You may use the parameters directly, or valid C expressions that involve the
 parameters, such as "size % 10".
-Tracepoints may access a special structure called "tp" that is formatted according
-to the tracepoint format (which you can obtain using tplist). For example, the
-block:block_rq_complete tracepoint can access tp.nr_sector. You may also use the
-members of the "tp" struct directly, e.g. "nr_sector" instead of "tp.nr_sector".
+Tracepoints may access a special structure called "args" that is formatted
+according to the tracepoint format (which you can obtain using tplist).
+For example, the block:block_rq_complete tracepoint can access args->nr_sector.
 USDT probes may access the arguments defined by the tracing program in the 
 special arg1, arg2, ... variables. To obtain their types, use the tplist tool.
 Return probes can use the argument values received by the
index 4c70bf6..2a97a99 100644 (file)
@@ -93,11 +93,9 @@ format specifier replacements may be any C expressions, and may refer to the
 same special keywords as in the predicate (arg1, arg2, etc.).
 
 In tracepoints, both the predicate and the arguments may refer to the tracepoint
-format structure, which is stored in the special "tp" variable. For example, the
-block:block_rq_complete tracepoint can print or filter by tp.nr_sector. To 
-discover the format of your tracepoint, use the tplist tool. Note that you can
-also use the members of the "tp" struct directly, e.g "nr_sector" instead of
-"tp.nr_sector".
+format structure, which is stored in the special "args" variable. For example, the
+block:block_rq_complete tracepoint can print or filter by args->nr_sector. To 
+discover the format of your tracepoint, use the tplist tool. 
 
 In USDT probes, the arg1, ..., argN variables refer to the probe's arguments.
 To determine which arguments your probe has, use the tplist tool.
@@ -126,7 +124,7 @@ Trace returns from the readline function in bash and print the return value as a
 .TP
 Trace the block:block_rq_complete tracepoint and print the number of sectors completed:
 #
-.B trace 't:block:block_rq_complete """%d sectors"", nr_sector'
+.B trace 't:block:block_rq_complete """%d sectors"", args->nr_sector'
 .TP
 Trace the pthread_create USDT probe from the pthread library and print the address of the thread's start function:
 #
index c0f95f1..1c60d1b 100644 (file)
@@ -27,7 +27,6 @@ basestring = (unicode if sys.version_info[0] < 3 else str)
 
 from .libbcc import lib, _CB_TYPE, bcc_symbol
 from .table import Table
-from .tracepoint import Tracepoint
 from .perf import Perf
 from .usyms import ProcessSymbols
 
diff --git a/src/python/bcc/tracepoint.py b/src/python/bcc/tracepoint.py
deleted file mode 100644 (file)
index 31abaaa..0000000
+++ /dev/null
@@ -1,143 +0,0 @@
-# Copyright 2016 Sasha Goldshtein
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import ctypes as ct
-import multiprocessing
-import os
-import re
-from .perf import Perf
-
-class Tracepoint(object):
-        enabled_tracepoints = []
-        trace_root = "/sys/kernel/debug/tracing"
-        event_root = os.path.join(trace_root, "events")
-
-        @classmethod
-        def _any_tracepoints_enabled(cls):
-                return len(cls.enabled_tracepoints) > 0
-
-        @classmethod
-        def generate_decl(cls):
-                if not cls._any_tracepoints_enabled():
-                        return ""
-                return "\nBPF_HASH(__trace_di, u64, u64);\n"
-
-        @classmethod
-        def generate_entry_probe(cls):
-                if not cls._any_tracepoints_enabled():
-                        return ""
-                return """
-int __trace_entry_update(struct pt_regs *ctx)
-{
-        u64 tid = bpf_get_current_pid_tgid();
-        u64 val = PT_REGS_PARM1(ctx);
-        __trace_di.update(&tid, &val);
-        return 0;
-}
-"""
-
-        def __init__(self, category, event, tp_id):
-                self.category = category
-                self.event = event
-                self.tp_id = tp_id
-                self._retrieve_struct_fields()
-
-        def _retrieve_struct_fields(self):
-                self.struct_fields = []
-                format_lines = Tracepoint.get_tpoint_format(self.category,
-                                                            self.event)
-                for line in format_lines:
-                        match = re.search(r'field:([^;]*);.*size:(\d+);', line)
-                        if match is None:
-                                continue
-                        parts = match.group(1).split()
-                        field_name = parts[-1:][0]
-                        field_type = " ".join(parts[:-1])
-                        field_size = int(match.group(2))
-                        if "__data_loc" in field_type:
-                                continue
-                        if field_name.startswith("common_"):
-                                continue
-                        self.struct_fields.append((field_type, field_name))
-
-        def _generate_struct_fields(self):
-                text = ""
-                for field_type, field_name in self.struct_fields:
-                        text += "        %s %s;\n" % (field_type, field_name)
-                return text
-
-        def generate_struct(self):
-                self.struct_name = self.event + "_trace_entry"
-                return """
-struct %s {
-        u64 __do_not_use__;
-%s
-};
-                """ % (self.struct_name, self._generate_struct_fields())
-
-        def _generate_struct_locals(self):
-                text = ""
-                for field_type, field_name in self.struct_fields:
-                        if field_type == "char" and field_name.endswith(']'):
-                                # Special case for 'char whatever[N]', should
-                                # be assigned to a 'char *'
-                                field_type = "char *"
-                                field_name = re.sub(r'\[\d+\]$', '', field_name)
-                        text += "        %s %s = tp.%s;\n" % (
-                                        field_type, field_name, field_name)
-                return text
-
-        def generate_get_struct(self):
-                return """
-        u64 tid = bpf_get_current_pid_tgid();
-        u64 *di = __trace_di.lookup(&tid);
-        if (di == 0) { return 0; }
-        struct %s tp = {};
-        bpf_probe_read(&tp, sizeof(tp), (void *)*di);
-%s
-                """ % (self.struct_name, self._generate_struct_locals())
-
-        @classmethod
-        def enable_tracepoint(cls, category, event):
-                tp_id = cls.get_tpoint_id(category, event)
-                if tp_id == -1:
-                        raise ValueError("no such tracepoint found: %s:%s" %
-                                         (category, event))
-                Perf.perf_event_open(tp_id, ptype=Perf.PERF_TYPE_TRACEPOINT)
-                new_tp = Tracepoint(category, event, tp_id)
-                cls.enabled_tracepoints.append(new_tp)
-                return new_tp
-
-        @staticmethod
-        def get_tpoint_id(category, event):
-                evt_dir = os.path.join(Tracepoint.event_root, category, event)
-                try:
-                        return int(
-                          open(os.path.join(evt_dir, "id")).read().strip())
-                except:
-                        return -1
-
-        @staticmethod
-        def get_tpoint_format(category, event):
-                evt_dir = os.path.join(Tracepoint.event_root, category, event)
-                try:
-                        return open(os.path.join(evt_dir, "format")).readlines()
-                except:
-                        return ""
-
-        @classmethod
-        def attach(cls, bpf):
-                if cls._any_tracepoints_enabled():
-                        bpf.attach_kprobe(event="tracing_generic_entry_update",
-                                          fn_name="__trace_entry_update")
index b49d549..a4aab70 100755 (executable)
@@ -12,7 +12,7 @@
 # Licensed under the Apache License, Version 2.0 (the "License")
 # Copyright (C) 2016 Sasha Goldshtein.
 
-from bcc import BPF, Tracepoint, Perf, USDT
+from bcc import BPF, USDT
 from time import sleep, strftime
 import argparse
 import re
@@ -195,9 +195,6 @@ u64 __time = bpf_ktime_get_ns();
                         self.library = ""       # kernel
                         self.tp_category = parts[1]
                         self.tp_event = self.function
-                        self.tp = Tracepoint.enable_tracepoint(
-                                        self.tp_category, self.tp_event)
-                        self.function = "perf_trace_" + self.function
                 elif self.probe_type == "u":
                         self.library = parts[1]
                         self.probe_func_name = "%s_probe%d" % \
@@ -329,8 +326,10 @@ u64 __time = bpf_ktime_get_ns();
                 program = ""
                 probe_text = """
 DATA_DECL
-
-int PROBENAME(struct pt_regs *ctx SIGNATURE)
+""" + (
+                "TRACEPOINT_PROBE(%s, %s)" % (self.tp_category, self.tp_event) \
+                        if self.probe_type == "t" \
+                        else "int PROBENAME(struct pt_regs *ctx SIGNATURE)") + """
 {
         PID_FILTER
         PREFIX
@@ -352,10 +351,7 @@ int PROBENAME(struct pt_regs *ctx SIGNATURE)
                         # value we collected when entering the function:
                         self._replace_entry_exprs()
 
-                if self.probe_type == "t":
-                        program += self.tp.generate_struct()
-                        prefix += self.tp.generate_get_struct()
-                elif self.probe_type == "p" and len(self.signature) > 0:
+                if self.probe_type == "p" and len(self.signature) > 0:
                         # Only entry uprobes/kprobes can have user-specified
                         # signatures. Other probes force it to ().
                         signature = ", " + self.signature
@@ -396,7 +392,9 @@ int PROBENAME(struct pt_regs *ctx SIGNATURE)
                                                pid=self.pid or -1)
 
         def _attach_k(self):
-                if self.probe_type == "r" or self.probe_type == "t":
+                if self.probe_type == "t":
+                        pass    # Nothing to do for tracepoints
+                elif self.probe_type == "r":
                         self.bpf.attach_kretprobe(event=self.function,
                                              fn_name=self.probe_func_name)
                 else:
@@ -537,10 +535,10 @@ argdist -C 'p:c:fork()#fork calls'
         Count fork() calls in libc across all processes
         Can also use funccount.py, which is easier and more flexible
 
-argdist -H 't:block:block_rq_complete():u32:tp.nr_sector'
+argdist -H 't:block:block_rq_complete():u32:args->nr_sector'
         Print histogram of number of sectors in completing block I/O requests
 
-argdist -C 't:irq:irq_handler_entry():int:tp.irq'
+argdist -C 't:irq:irq_handler_entry():int:args->irq'
         Aggregate interrupts by interrupt request (IRQ)
 
 argdist -C 'u:pthread:pthread_start():u64:arg2' -p 1337
@@ -613,8 +611,6 @@ struct __string_t { char s[%d]; };
                         bpf_source += "#include <%s>\n" % include
                 bpf_source += BPF.generate_auto_includes(
                                 map(lambda p: p.raw_spec, self.probes))
-                bpf_source += Tracepoint.generate_decl()
-                bpf_source += Tracepoint.generate_entry_probe()
                 for probe in self.probes:
                         bpf_source += probe.generate_text()
                 if self.args.verbose:
@@ -627,7 +623,6 @@ struct __string_t { char s[%d]; };
                 self.bpf = BPF(text=bpf_source, usdt_contexts=usdt_contexts)
 
         def _attach(self):
-                Tracepoint.attach(self.bpf)
                 for probe in self.probes:
                         probe.attach(self.bpf)
                 if self.args.verbose:
index 789bd3a..55fdc05 100644 (file)
@@ -264,24 +264,18 @@ certain kinds of allocations or visually group them together.
 
 argdist also has basic support for kernel tracepoints. It is sometimes more
 convenient to use tracepoints because they are documented and don't vary a lot
-between kernel versions like function signatures tend to. For example, let's
-trace the net:net_dev_start_xmit tracepoint and print the interface name that
-is transmitting:
+between kernel versions. For example, let's trace the net:net_dev_start_xmit
+tracepoint and print out the protocol field from the tracepoint structure:
 
-# argdist -c -C 't:net:net_dev_start_xmit(void *a, void *b, struct net_device *c):char*:c->name' -n 2
-[05:01:10]
-t:net:net_dev_start_xmit(void *a, void *b, struct net_device *c):char*:c->name
+# argdist -C 't:net:net_dev_start_xmit():u16:args->protocol'
+[13:01:49]
+t:net:net_dev_start_xmit():u16:args->protocol
         COUNT      EVENT
-        4          c->name = eth0
-[05:01:11]
-t:net:net_dev_start_xmit(void *a, void *b, struct net_device *c):char*:c->name
-        COUNT      EVENT
-        6          c->name = lo
-        92         c->name = eth0
+        8          args->protocol = 2048
+^C
 
-Note that to determine the necessary function signature you need to look at the
-TP_PROTO declaration in the kernel headers. For example, the net_dev_start_xmit
-tracepoint is defined in the include/trace/events/net.h header file.
+Note that to discover the format of the net:net_dev_start_xmit tracepoint, you
+use the tplist tool (tplist -v net:net_dev_start_xmit).
 
 Here's a final example that finds how many write() system calls are performed
 by each process on the system:
@@ -388,10 +382,10 @@ argdist -C 'p:c:fork()#fork calls'
         Count fork() calls in libc across all processes
         Can also use funccount.py, which is easier and more flexible 
 
-argdist -H 't:block:block_rq_complete():u32:tp.nr_sector'
+argdist -H 't:block:block_rq_complete():u32:args->nr_sector'
         Print histogram of number of sectors in completing block I/O requests
 
-argdist -C 't:irq:irq_handler_entry():int:tp.irq'
+argdist -C 't:irq:irq_handler_entry():int:args->irq'
         Aggregate interrupts by interrupt request (IRQ)
 
 argdist -C 'u:pthread:pthread_start():u64:arg2' -p 1337
index 742587b..0cb1c3e 100755 (executable)
@@ -9,7 +9,7 @@
 # Licensed under the Apache License, Version 2.0 (the "License")
 # Copyright (C) 2016 Sasha Goldshtein.
 
-from bcc import BPF, Tracepoint, Perf, USDT
+from bcc import BPF, USDT
 from functools import partial
 from time import sleep, strftime
 import argparse
@@ -138,10 +138,8 @@ class Probe(object):
                 if self.probe_type == "t":
                         self.tp_category = parts[1]
                         self.tp_event = parts[2]
-                        self.tp = Tracepoint.enable_tracepoint(
-                                        self.tp_category, self.tp_event)
                         self.library = ""       # kernel
-                        self.function = "perf_trace_%s" % self.tp_event
+                        self.function = ""      # generated from TRACEPOINT_PROBE
                 elif self.probe_type == "u":
                         self.library = parts[1]
                         self.usdt_name = parts[2]
@@ -357,9 +355,6 @@ BPF_PERF_OUTPUT(%s);
 
                 prefix = ""
                 signature = "struct pt_regs *ctx"
-                if self.probe_type == "t":
-                        data_decl += self.tp.generate_struct()
-                        prefix = self.tp.generate_get_struct()
 
                 data_fields = ""
                 for i, expr in enumerate(self.values):
@@ -377,8 +372,14 @@ BPF_PERF_OUTPUT(%s);
           ctx, BPF_F_REUSE_STACKID
         );""" % self.stacks_name
 
-                text = """
-int %s(%s)
+                if self.probe_type == "t":
+                        heading = "TRACEPOINT_PROBE(%s, %s)" % \
+                                  (self.tp_category, self.tp_event)
+                        ctx_name = "args"
+                else:
+                        heading = "int %s(%s)" % (self.probe_name, signature)
+                        ctx_name = "ctx"
+                text = heading + """
 {
         %s
         %s
@@ -391,15 +392,14 @@ int %s(%s)
         bpf_get_current_comm(&__data.comm, sizeof(__data.comm));
 %s
 %s
-        %s.perf_submit(ctx, &__data, sizeof(__data));
+        %s.perf_submit(%s, &__data, sizeof(__data));
         return 0;
 }
 """
-                text = text % (self.probe_name, signature,
-                               pid_filter, prefix,
+                text = text % (pid_filter, prefix,
                                self._generate_usdt_filter_read(), self.filter,
                                self.struct_name, data_fields,
-                               stack_trace, self.events_name)
+                               stack_trace, self.events_name, ctx_name)
 
                 return data_decl + "\n" + text
 
@@ -464,9 +464,10 @@ int %s(%s)
                 if self.probe_type == "r":
                         bpf.attach_kretprobe(event=self.function,
                                              fn_name=self.probe_name)
-                elif self.probe_type == "p" or self.probe_type == "t":
+                elif self.probe_type == "p":
                         bpf.attach_kprobe(event=self.function,
                                           fn_name=self.probe_name)
+                # Note that tracepoints don't need an explicit attach
 
         def _attach_u(self, bpf):
                 libpath = BPF.find_library(self.library)
@@ -511,7 +512,7 @@ trace 'r::__kmalloc (retval == 0) "kmalloc failed!"
         Trace returns from __kmalloc which returned a null pointer
 trace 'r:c:malloc (retval) "allocated = %p", retval
         Trace returns from malloc and print non-NULL allocated buffers
-trace 't:block:block_rq_complete "sectors=%d", tp.nr_sector'
+trace 't:block:block_rq_complete "sectors=%d", args->nr_sector'
         Trace the block_rq_complete kernel tracepoint and print # of tx sectors
 trace 'u:pthread:pthread_create (arg4 != 0)'
         Trace the USDT probe pthread_create when its 4th argument is non-zero
@@ -558,8 +559,6 @@ trace 'u:pthread:pthread_create (arg4 != 0)'
 """
                 self.program += BPF.generate_auto_includes(
                         map(lambda p: p.raw_probe, self.probes))
-                self.program += Tracepoint.generate_decl()
-                self.program += Tracepoint.generate_entry_probe()
                 for probe in self.probes:
                         self.program += probe.generate_program(
                                         self.args.include_self)
@@ -580,7 +579,6 @@ trace 'u:pthread:pthread_create (arg4 != 0)'
                                 print(probe.usdt.get_text())
                         usdt_contexts.append(probe.usdt)
                 self.bpf = BPF(text=self.program, usdt_contexts=usdt_contexts)
-                Tracepoint.attach(self.bpf)
                 for probe in self.probes:
                         if self.args.verbose:
                                 print(probe)
index dce72b9..dbb8cf1 100644 (file)
@@ -84,15 +84,15 @@ trace has also some basic support for kernel tracepoints. For example, let's
 trace the block:block_rq_complete tracepoint and print out the number of sectors
 transferred:
 
-# trace 't:block:block_rq_complete "sectors=%d", tp.nr_sector'
+# trace 't:block:block_rq_complete "sectors=%d", args->nr_sector'
 TIME     PID    COMM         FUNC             -
 01:23:51 0      swapper/0    block_rq_complete sectors=8
 01:23:55 10017  kworker/u64: block_rq_complete sectors=1
 01:23:55 0      swapper/0    block_rq_complete sectors=8
 ^C
 
-To discover the tracepoint structure format (which you can refer to as the "tp"
-variable), use the tplist tool. For example:
+To discover the tracepoint structure format (which you can refer to as the "args"
+pointer variable), use the tplist tool. For example:
 
 # tplist -v block:block_rq_complete
 block:block_rq_complete
@@ -102,7 +102,7 @@ block:block_rq_complete
     int errors;
     char rwbs[8];
 
-This output tells you that you can use "tp.dev", "tp.sector", etc. in your
+This output tells you that you can use "args->dev", "args->sector", etc. in your
 predicate and trace arguments.
 
 As a final example, let's trace open syscalls for a specific process. By 
@@ -169,7 +169,7 @@ trace 'r::__kmalloc (retval == 0) "kmalloc failed!"
         Trace returns from __kmalloc which returned a null pointer
 trace 'r:c:malloc (retval) "allocated = %p", retval
         Trace returns from malloc and print non-NULL allocated buffers
-trace 't:block:block_rq_complete "sectors=%d", tp.nr_sector'
+trace 't:block:block_rq_complete "sectors=%d", args->nr_sector'
         Trace the block_rq_complete kernel tracepoint and print # of tx sectors
 trace 'u:pthread:pthread_create (arg4 != 0)'
         Trace the USDT probe pthread_create when its 4th argument is non-zero