Daniel Borkmann [Mon, 14 May 2018 21:29:46 +0000 (23:29 +0200)]
Merge branch 'bpf-stackmap-nmi'
Song Liu says:
====================
Changes v2 -> v3:
Improve syntax based on suggestion by Tobin C. Harding.
Changes v1 -> v2:
1. Rename some variables to (hopefully) reduce confusion;
2. Check irq_work status with IRQ_WORK_BUSY (instead of work->sem);
3. In Kconfig, let BPF_SYSCALL select IRQ_WORK;
4. Add static to DEFINE_PER_CPU();
5. Remove pr_info() in stack_map_init().
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Song Liu [Mon, 7 May 2018 17:50:49 +0000 (10:50 -0700)]
bpf: add selftest for stackmap with build_id in NMI context
This new test captures stackmap with build_id with hardware event
PERF_COUNT_HW_CPU_CYCLES.
Because we only support one ips-to-build_id lookup per cpu in NMI
context, stack_amap will not be able to do the lookup in this test.
Therefore, we didn't do compare_stack_ips(), as it will alwasy fail.
urandom_read.c is extended to run configurable cycles so that it can be
caught by the perf event.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Song Liu [Mon, 7 May 2018 17:50:48 +0000 (10:50 -0700)]
bpf: enable stackmap with build_id in nmi context
Currently, we cannot parse build_id in nmi context because of
up_read(¤t->mm->mmap_sem), this makes stackmap with build_id
less useful. This patch enables parsing build_id in nmi by putting
the up_read() call in irq_work. To avoid memory allocation in nmi
context, we use per cpu variable for the irq_work. As a result, only
one irq_work per cpu is allowed. If the irq_work is in-use, we
fallback to only report ips.
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Thu, 10 May 2018 23:48:19 +0000 (01:48 +0200)]
Merge branch 'bpf-perf-rb-libbpf'
Jakub Kicinski says:
====================
This series started out as a follow up to the bpftool perf event dumping
patches.
As suggested by Daniel patch 1 makes use of PERF_SAMPLE_TIME to simplify
code and improve accuracy of timestamps.
Remaining patches are trying to move perf event loop into libbpf as
suggested by Alexei. One user for this new function is bpftool which
links with libbpf nicely, the other, unfortunately, is in samples/bpf.
Remaining patches make samples/bpf link against full libbpf.a (not just
a handful of objects). Once we have full power of libbpf at our disposal
we can convert some of XDP samples to use libbpf loader instead of
bpf_load.c. My understanding is that this is the desired direction,
at least for networking code.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Thu, 10 May 2018 17:24:43 +0000 (10:24 -0700)]
samples: bpf: convert some XDP samples from bpf_load to libbpf
Now that we can use full powers of libbpf in BPF samples, we
should perhaps make the simplest XDP programs not depend on
bpf_load helpers. This way newcomers will be exposed to the
recommended library from the start.
Use of bpf_prog_load_xattr() will also make it trivial to later
on request offload of the programs by simply adding ifindex to
the xattr.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Thu, 10 May 2018 17:24:42 +0000 (10:24 -0700)]
tools: bpf: don't complain about no kernel version for networking code
BPF programs only have to specify the target kernel version for
tracing related hooks, in networking world that requirement does
not really apply. Loosen the checks in libbpf to reflect that.
bpf_object__open() users will continue to see the error for backward
compatibility (and because prog_type is not available there).
Error code for NULL file name is changed from ENOENT to EINVAL,
as it seems more appropriate, hopefully, that's an OK change.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Thu, 10 May 2018 17:24:41 +0000 (10:24 -0700)]
tools: bpf: improve comments in libbpf.h
Fix spelling mistakes, improve and clarify the language of comments
in libbpf.h.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Thu, 10 May 2018 17:24:40 +0000 (10:24 -0700)]
tools: bpf: move the event reading loop to libbpf
There are two copies of event reading loop - in bpftool and
trace_helpers "library". Consolidate them and move the code
to libbpf. Return codes from trace_helpers are kept, but
renamed to include LIBBPF prefix.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Thu, 10 May 2018 17:24:39 +0000 (10:24 -0700)]
samples: bpf: compile and link against full libbpf
samples/bpf currently cherry-picks object files from tools/lib/bpf
to link against. Just compile the full library and link statically
against it.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Thu, 10 May 2018 17:24:38 +0000 (10:24 -0700)]
samples: bpf: rename struct bpf_map_def to avoid conflict with libbpf
Both tools/lib/bpf/libbpf.h and samples/bpf/bpf_load.h define their
own version of struct bpf_map_def. The version in bpf_load.h has
more fields. libbpf does not support inner maps and its definition
of struct bpf_map_def lacks the related fields. Rename the definition
in bpf_load.h (samples/bpf) to avoid conflicts.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Thu, 10 May 2018 17:24:37 +0000 (10:24 -0700)]
tools: bpftool: use PERF_SAMPLE_TIME instead of reading the clock
Ask the kernel to include sample time in each even instead of
reading the clock. This is also more accurate because our
clock reading was done when user space would dump the buffer,
not when sample was produced.
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Prashant Bhole [Wed, 9 May 2018 02:04:59 +0000 (11:04 +0900)]
bpf: sync tools bpf.h uapi header
Sync the header from include/uapi/linux/bpf.h which was updated to add
fib lookup helper function. This fixes selftests/bpf build failure.
Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Joe Stringer [Thu, 10 May 2018 22:26:51 +0000 (15:26 -0700)]
selftests/bpf: Fix bash reference in Makefile
'|& ...' is a bash 4.0+ construct which is not guaranteed to be available
when using '$(shell ...)' in a Makefile. Fall back to the more portable
'2>&1 | ...'.
Fixes the following warning during compilation:
/bin/sh: 1: Syntax error: "&" unexpected
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Thu, 10 May 2018 22:10:58 +0000 (00:10 +0200)]
Merge branch 'bpf-fib-lookup-helper'
David Ahern says:
====================
Provide a helper for doing a FIB and neighbor lookup in the kernel
tables from an XDP program. The helper provides a fastpath for forwarding
packets. If the packet is a local delivery or for any reason is not a
simple lookup and forward, the packet is expected to continue up the stack
for full processing.
The response from a FIB and neighbor lookup is either the egress index
with the bpf_fib_lookup struct filled in with dmac and gateway or
0 meaning the packet should continue up the stack. In time we can
revisit this to return the FIB lookup result errno if it is one of the
special RTN_'s such as RTN_BLACKHOLE (-EINVAL) so that the XDP
programs can do an early drop if desired.
Patches 1-6 do some more refactoring to IPv6 with the end goal of
extracting a FIB lookup function that aligns with fib_lookup for IPv4,
basically returning a fib6_info without creating a dst based entry.
Patch 7 adds lookup functions to the ipv6 stub. These are needed since
bpf is built into the kernel and ipv6 may not be built or loaded.
Patch 8 adds the bpf helper and 9 adds a sample program.
v3
- remove ETH_ALEN and in6_addr from uapi header
v2
- removed pkt_access from bpf_func_proto as noticed by Daniel
- added check in that IPv6 forwarding is enabled
- added DaveM's ack on patches 1-7 and 9 based on v1 response and
fact that no changes were made to them in v2
v1
- updated commit messages and cover letter
- added comment to sample program noting lack of verification on
egress device supporting XDP
RFC v2
- fixed use of foward helper from cls_act as noted by Daniel
- in patch 1 rename fib6_lookup_1 as well for consistency
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David Ahern [Thu, 10 May 2018 03:34:27 +0000 (20:34 -0700)]
samples/bpf: Add example of ipv4 and ipv6 forwarding in XDP
Simple example of fast-path forwarding. It has a serious flaw
in not verifying the egress device index supports XDP forwarding.
If the egress device does not packets are dropped.
Take this only as a simple example of fast-path forwarding.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David Ahern [Thu, 10 May 2018 03:34:26 +0000 (20:34 -0700)]
bpf: Provide helper to do forwarding lookups in kernel FIB table
Provide a helper for doing a FIB and neighbor lookup in the kernel
tables from an XDP program. The helper provides a fastpath for forwarding
packets. If the packet is a local delivery or for any reason is not a
simple lookup and forward, the packet continues up the stack.
If it is to be forwarded, the forwarding can be done directly if the
neighbor is already known. If the neighbor does not exist, the first
few packets go up the stack for neighbor resolution. Once resolved, the
xdp program provides the fast path.
On successful lookup the nexthop dmac, current device smac and egress
device index are returned.
The API supports IPv4, IPv6 and MPLS protocols, but only IPv4 and IPv6
are implemented in this patch. The API includes layer 4 parameters if
the XDP program chooses to do deep packet inspection to allow compare
against ACLs implemented as FIB rules.
Header rewrite is left to the XDP program.
The lookup takes 2 flags:
- BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes
straight to the table associated with the device (expert setting for
those looking to maximize throughput)
- BPF_FIB_LOOKUP_OUTPUT to do a lookup from the egress perspective.
Default is an ingress lookup.
Initial performance numbers collected by Jesper, forwarded packets/sec:
Full stack XDP FIB lookup XDP Direct lookup
IPv4 1,947,969 7,074,156 7,415,333
IPv6 1,728,000 6,165,504 7,262,720
These number are single CPU core forwarding on a Broadwell
E5-1650 v4 @ 3.60GHz.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David Ahern [Thu, 10 May 2018 03:34:25 +0000 (20:34 -0700)]
net/ipv6: Add fib lookup stubs for use in bpf helper
Add stubs to retrieve a handle to an IPv6 FIB table, fib6_get_table,
a stub to do a lookup in a specific table, fib6_table_lookup, and
a stub for a full route lookup.
The stubs are needed for core bpf code to handle the case when the
IPv6 module is not builtin.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David Ahern [Thu, 10 May 2018 03:34:24 +0000 (20:34 -0700)]
net/ipv6: Update fib6 tracepoint to take fib6_info
Similar to IPv4, IPv6 should use the FIB lookup result in the
tracepoint.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David Ahern [Thu, 10 May 2018 03:34:23 +0000 (20:34 -0700)]
net/ipv6: Add fib6_lookup
Add IPv6 equivalent to fib_lookup. Does a fib lookup, including rules,
but returns a FIB entry, fib6_info, rather than a dst based rt6_info.
fib6_lookup is any where from 140% (MULTIPLE_TABLES config disabled)
to 60% faster than any of the dst based lookup methods (without custom
rules) and 25% faster with custom rules (e.g., l3mdev rule).
Since the lookup function has a completely different signature,
fib6_rule_action is split into 2 paths: the existing one is
renamed __fib6_rule_action and a new one for the fib6_info path
is added. fib6_rule_action decides which to call based on the
lookup_ptr. If it is fib6_table_lookup then the new path is taken.
Caller must hold rcu lock as no reference is taken on the returned
fib entry.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David Ahern [Thu, 10 May 2018 03:34:22 +0000 (20:34 -0700)]
net/ipv6: Refactor fib6_rule_action
Move source address lookup from fib6_rule_action to a helper. It will be
used in a later patch by a second variant for fib6_rule_action.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David Ahern [Thu, 10 May 2018 03:34:21 +0000 (20:34 -0700)]
net/ipv6: Extract table lookup from ip6_pol_route
ip6_pol_route is used for ingress and egress FIB lookups. Refactor it
moving the table lookup into a separate fib6_table_lookup that can be
invoked separately and export the new function.
ip6_pol_route now calls fib6_table_lookup and uses the result to generate
a dst based rt6_info.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David Ahern [Thu, 10 May 2018 03:34:20 +0000 (20:34 -0700)]
net/ipv6: Rename rt6_multipath_select
Rename rt6_multipath_select to fib6_multipath_select and export it.
A later patch wants access to it similar to IPv4's fib_select_path.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David Ahern [Thu, 10 May 2018 03:34:19 +0000 (20:34 -0700)]
net/ipv6: Rename fib6_lookup to fib6_node_lookup
Rename fib6_lookup to fib6_node_lookup to better reflect what it
returns. The fib6_lookup name will be used in a later patch for
an IPv6 equivalent to IPv4's fib_lookup.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Wang YanQing [Thu, 10 May 2018 03:09:21 +0000 (11:09 +0800)]
bpf, doc: clarification for the meaning of 'id'
For me, as a reader whose mother language isn't English, the
old words bring a little difficulty to catch the meaning, this
patch rewords the subsection in a more clarificatory way.
This patch also add blank lines as separator at two places
to improve readability.
Signed-off-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Sirio Balmelli [Tue, 8 May 2018 13:36:37 +0000 (15:36 +0200)]
selftests/bpf: ignore build products
Update .gitignore files.
Signed-off-by: Sirio Balmelli <sirio@b-ad.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Sirio Balmelli [Tue, 8 May 2018 13:36:12 +0000 (15:36 +0200)]
selftests/bpf: add architecture-agnostic headers
The BPF selftests fail to build with missing headers
'asm/bitsperlong.h' and 'asm/errno.h'.
These already exist in 'tools/arch/[arch]/include';
add architecture-agnostic header files in 'tools/include/uapi'
to reference them.
Signed-off-by: Sirio Balmelli <sirio@b-ad.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Björn Töpel [Mon, 7 May 2018 17:43:50 +0000 (19:43 +0200)]
xsk: fix 64-bit division
i386 builds report:
net/xdp/xdp_umem.o: In function `xdp_umem_reg':
xdp_umem.c:(.text+0x47e): undefined reference to `__udivdi3'
This fix uses div_u64 instead of the GCC built-in.
Fixes:
c0c77d8fb787 ("xsk: add user memory registration support sockopt")
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Wed, 9 May 2018 16:04:38 +0000 (18:04 +0200)]
Merge branch 'bpf-nfp-programmable-rss'
Jakub Kicinski says:
====================
This small series adds a feature which extends BPF offload beyond
a pure host processing offload and firmly into the realm of
heterogeneous processing. Allowing offloaded XDP programs to set
the RX queue index opens the door for defining fully programmable
RSS/n-tuple filter replacement. In fact the device datapath will
skip the RSS processing completely if BPF decided on the queue
already, making the XDP program replace part of the standard NIC
datapath.
We hope some day the entire NIC datapath will be defined by BPF :)
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Jakub Kicinski [Wed, 9 May 2018 02:37:07 +0000 (19:37 -0700)]
nfp: bpf: support setting the RX queue index
BPF has access to all internal FW datapath structures. Including
the structure containing RX queue selection. With little coordination
with the datapath we can let the offloaded BPF select the RX queue.
We just need a way to tell the datapath that queue selection has already
been done and it shouldn't overwrite it. Define a bit to tell datapath
BPF already selected a queue (QSEL_SET), if the selected queue is not
enabled (>= number of enabled queues) datapath will perform normal RSS.
BPF queue selection on the NIC can be used to replace standard
datapath RSS with fully programmable BPF/XDP RSS.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Wed, 9 May 2018 02:37:06 +0000 (19:37 -0700)]
bpf: xdp: allow offloads to store into rx_queue_index
It's fairly easy for offloaded XDP programs to select the RX queue
packets go to. We need a way of expressing this in the software.
Allow write to the rx_queue_index field of struct xdp_md for
device-bound programs.
Skip convert_ctx_access callback entirely for offloads.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Wed, 9 May 2018 15:25:15 +0000 (17:25 +0200)]
Merge branch 'bpf-btf-id'
Martin KaFai Lau says:
====================
This series introduces BTF ID which is exposed through
the new BPF_BTF_GET_FD_BY_ID cmd, new "struct bpf_btf_info"
and new members in the "struct bpf_map_info".
Please see individual patch for details.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Fri, 4 May 2018 21:49:55 +0000 (14:49 -0700)]
bpf: btf: Tests for BPF_OBJ_GET_INFO_BY_FD and BPF_BTF_GET_FD_BY_ID
This patch adds test for BPF_BTF_GET_FD_BY_ID and the new
btf_id/btf_key_id/btf_value_id in the "struct bpf_map_info".
It also modifies the existing BPF_OBJ_GET_INFO_BY_FD test
to reflect the new "struct bpf_btf_info".
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Fri, 4 May 2018 21:49:54 +0000 (14:49 -0700)]
bpf: btf: Update tools/include/uapi/linux/btf.h with BTF ID
This patch sync the tools/include/uapi/linux/btf.h with
the newly introduced BTF ID support.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Fri, 4 May 2018 21:49:53 +0000 (14:49 -0700)]
bpf: btf: Some test_btf clean up
This patch adds a CHECK() macro for condition checking
and error report purpose. Something similar to test_progs.c
It also counts the number of tests passed/skipped/failed and
print them at the end of the test run.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Fri, 4 May 2018 21:49:52 +0000 (14:49 -0700)]
bpf: btf: Add struct bpf_btf_info
During BPF_OBJ_GET_INFO_BY_FD on a btf_fd, the current bpf_attr's
info.info is directly filled with the BTF binary data. It is
not extensible. In this case, we want to add BTF ID.
This patch adds "struct bpf_btf_info" which has the BTF ID as
one of its member. The BTF binary data itself is exposed through
the "btf" and "btf_size" members.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Fri, 4 May 2018 21:49:51 +0000 (14:49 -0700)]
bpf: btf: Introduce BTF ID
This patch gives an ID to each loaded BTF. The ID is allocated by
the idr like the existing prog-id and map-id.
The bpf_put(map->btf) is moved to __bpf_map_put() so that the
userspace can stop seeing the BTF ID ASAP when the last BTF
refcnt is gone.
It also makes BTF accessible from userspace through the
1. new BPF_BTF_GET_FD_BY_ID command. It is limited to CAP_SYS_ADMIN
which is inline with the BPF_BTF_LOAD cmd and the existing
BPF_[MAP|PROG]_GET_FD_BY_ID cmd.
2. new btf_id (and btf_key_id + btf_value_id) in "struct bpf_map_info"
Once the BTF ID handler is accessible from userspace, freeing a BTF
object has to go through a rcu period. The BPF_BTF_GET_FD_BY_ID cmd
can then be done under a rcu_read_lock() instead of taking
spin_lock.
[Note: A similar rcu usage can be done to the existing
bpf_prog_get_fd_by_id() in a follow up patch]
When processing the BPF_BTF_GET_FD_BY_ID cmd,
refcount_inc_not_zero() is needed because the BTF object
could be already in the rcu dead row . btf_get() is
removed since its usage is currently limited to btf.c
alone. refcount_inc() is used directly instead.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Fri, 4 May 2018 21:49:50 +0000 (14:49 -0700)]
bpf: btf: Avoid WARN_ON when CONFIG_REFCOUNT_FULL=y
If CONFIG_REFCOUNT_FULL=y, refcount_inc() WARN when refcount is 0.
When creating a new btf, the initial btf->refcnt is 0 and
triggered the following:
[ 34.855452] refcount_t: increment on 0; use-after-free.
[ 34.856252] WARNING: CPU: 6 PID: 1857 at lib/refcount.c:153 refcount_inc+0x26/0x30
....
[ 34.868809] Call Trace:
[ 34.869168] btf_new_fd+0x1af6/0x24d0
[ 34.869645] ? btf_type_seq_show+0x200/0x200
[ 34.870212] ? lock_acquire+0x3b0/0x3b0
[ 34.870726] ? security_capable+0x54/0x90
[ 34.871247] __x64_sys_bpf+0x1b2/0x310
[ 34.871761] ? __ia32_sys_bpf+0x310/0x310
[ 34.872285] ? bad_area_access_error+0x310/0x310
[ 34.872894] do_syscall_64+0x95/0x3f0
This patch uses refcount_set() instead.
Reported-by: Yonghong Song <yhs@fb.com>
Tested-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Fabio Estevam [Mon, 7 May 2018 12:17:51 +0000 (09:17 -0300)]
dt-bindings: dsa: Remove unnecessary #address/#size-cells
If the example binding is used on a real dts file, the following DTC
warning is seen with W=1:
arch/arm/boot/dts/imx6q-b450v3.dtb: Warning (avoid_unnecessary_addr_size): /mdio-gpio/switch@0: unnecessary #address-cells/#size-cells without "ranges" or child "reg" property
Remove unnecessary #address-cells/#size-cells to improve the binding
document examples.
Signed-off-by: Fabio Estevam <fabio.estevam@nxp.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Fri, 4 May 2018 15:21:03 +0000 (17:21 +0200)]
net: phy: sfp: handle cases where neither BR, min nor BR, max is given
When computing the bitrate using values read from an SFP module EEPROM,
we use the nominal BR plus BR,min and BR,max to determine the
boundaries. But in some cases BR,min and BR,max aren't provided, which
led the SFP code to end up having the nominal value for both the minimum
and maximum bitrate values. When using a passive cable, the nominal
value should be used as the maximum one, and there is no minimum one
so we should use 0.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 8 May 2018 14:14:22 +0000 (10:14 -0400)]
Merge branch 'bnxt_en-Fixes-for-net-next'
Michael Chan says:
====================
bnxt_en: Fixes for net-next.
This series includes a bug fix for a regression in firmware message polling
introduced recently on net-next. There are 3 additional minor fixes for
unsupported link speed checking, VF MAC address handling, and setting
PHY eeprom length.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Tue, 8 May 2018 07:18:41 +0000 (03:18 -0400)]
bnxt_en: Always forward VF MAC address to the PF.
The current code already forwards the VF MAC address to the PF, except
in one case. If the VF driver gets a valid MAC address from the firmware
during probe time, it will not forward the MAC address to the PF,
incorrectly assuming that the PF already knows the MAC address. This
causes "ip link show" to show zero VF MAC addresses for this case.
This assumption is not correct. Newer firmware remembers the VF MAC
address last used by the VF and provides it to the VF driver during
probe. So we need to always forward the VF MAC address to the PF.
The forwarded MAC address may now be the PF assigned MAC address and so we
need to make sure we approve it for this case.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vasundhara Volam [Tue, 8 May 2018 07:18:40 +0000 (03:18 -0400)]
bnxt_en: Read phy eeprom A2h address only when optical diagnostics is supported.
For SFP+ modules, 0xA2 page is available only when Diagnostic Monitoring
Type [Address A0h, Byte 92] is implemented. Extend bnxt_get_module_info(),
to read optical diagnostics support at offset 92(0x5c) and set eeprom_len
length to ETH_MODULE_SFF_8436_LEN (to exclude A2 page), if dianostics is
not supported.
Also in bnxt_get_module_info(), module id is read from offset 0x5e which
is not correct. It was working by accident, as offset was not effective
without setting enables flag in the firmware request. SFP module id is
present at location 0. Fix this by removing the offset and read it
from location 0.
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Tue, 8 May 2018 07:18:39 +0000 (03:18 -0400)]
bnxt_en: Check unsupported speeds in bnxt_update_link() on PF only.
Only non-NPAR PFs need to actively check and manage unsupported link
speeds. NPAR functions and VFs do not control the link speed and
should skip the unsupported speed detection logic, to avoid warning
messages from firmware rejecting the unsupported firmware calls.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Tue, 8 May 2018 07:18:38 +0000 (03:18 -0400)]
bnxt_en: Fix firmware message delay loop regression.
A recent change to reduce delay granularity waiting for firmware
reponse has caused a regression. With a tighter delay loop,
the driver may see the beginning part of the response faster.
The original 5 usec delay to wait for the rest of the message
is not long enough and some messages are detected as invalid.
Increase the maximum wait time from 5 usec to 20 usec. Also, fix
the debug message that shows the total delay time for the response
when the message times out. With the new logic, the delay time
is not fixed per iteration of the loop, so we define a macro to
show the total delay time.
Fixes:
9751e8e71487 ("bnxt_en: reduce timeout on initial HWRM calls")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhao Chen [Mon, 7 May 2018 13:21:57 +0000 (09:21 -0400)]
net-next/hinic: add pci device ids for 25ge and 100ge card
This patch adds PCI device IDs to support 25GE and 100GE card:
1. Add device id 0x0201 for HINIC 100GE dual port card.
2. Add device id 0x0200 for HINIC 25GE dual port card.
3. Macro of device id 0x1822 is modified for HINIC 25GE quad port card.
Signed-off-by: Zhao Chen <zhaochen6@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Mon, 7 May 2018 10:06:03 +0000 (12:06 +0200)]
flow_dissector: do not rely on implicit casts
This change fixes a couple of type mismatch reported by the sparse
tool, explicitly using the requested type for the offending arguments.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 4 May 2018 09:32:59 +0000 (11:32 +0200)]
net: core: rework basic flow dissection helper
When the core networking needs to detect the transport offset in a given
packet and parse it explicitly, a full-blown flow_keys struct is used for
storage.
This patch introduces a smaller keys store, rework the basic flow dissect
helper to use it, and apply this new helper where possible - namely in
skb_probe_transport_header(). The used flow dissector data structures
are renamed to match more closely the new role.
The above gives ~50% performance improvement in micro benchmarking around
skb_probe_transport_header() and ~30% around eth_get_headlen(), mostly due
to the smaller memset. Small, but measurable improvement is measured also
in macro benchmarking.
v1 -> v2: use the new helper in eth_get_headlen() and skb_get_poff(),
as per DaveM suggestion
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 8 May 2018 03:56:32 +0000 (23:56 -0400)]
Merge branch 'master' of git://git./linux/kernel/git/klassert/ipsec-next
Minor conflict in ip_output.c, overlapping changes to
the body of an if() statement.
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 8 May 2018 03:50:28 +0000 (23:50 -0400)]
Merge branch 'ipv6-misc'
Tariq Toukan says:
====================
net/ipv6 misc
This patchset contains two patches for net/ipv6.
Patch 1 is a trivial typo fix in documentation.
Patch 2 by Eran is a re-spin. It adds GRO support for IPv6 GRE tunnel,
this significantly improves performance in case GRO in native interface
is disabled.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eran Ben Elisha [Mon, 7 May 2018 07:45:27 +0000 (10:45 +0300)]
net: ipv6/gre: Add GRO support
Add GRO capability for IPv6 GRE tunnel and ip6erspan tap, via gro_cells
infrastructure.
Performance testing: 55% higher badwidth.
Measuring bandwidth of 1 thread IPv4 TCP traffic over IPv6 GRE tunnel
while GRO on the physical interface is disabled.
CPU: Intel Xeon E312xx (Sandy Bridge)
NIC: Mellanox Technologies MT27700 Family [ConnectX-4]
Before (GRO not working in tunnel) : 2.47 Gbits/sec
After (GRO working in tunnel) : 3.85 Gbits/sec
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
CC: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tariq Toukan [Mon, 7 May 2018 07:45:26 +0000 (10:45 +0300)]
net: ipv6: Fix typo in ipv6_find_hdr() documentation
Fix 'an' into 'and', and use a comma instead of a period.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 8 May 2018 03:46:11 +0000 (23:46 -0400)]
Merge branch 'qed-Add-support-for-new-multi-partitioning-modes'
Sudarsana Reddy Kalluru says:
====================
qed*: Add support for new multi partitioning modes.
The patch series simplifies the multi function (MF) mode implementation of
qed/qede drivers, and adds support for new MF modes.
Please consider applying it to net-next branch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sudarsana Reddy Kalluru [Sun, 6 May 2018 01:43:02 +0000 (18:43 -0700)]
qed: Add support for Unified Fabric Port.
This patch adds driver changes for supporting the Unified Fabric Port
(UFP). This is a new paritioning mode wherein MFW provides the set of
parameters to be used by the device such as traffic class, outer-vlan
tag value, priority type etc. Drivers receives this info via notifications
from mfw and configures the hardware accordingly.
Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sudarsana Reddy Kalluru [Sun, 6 May 2018 01:43:01 +0000 (18:43 -0700)]
qed: Add support for multi function mode with 802.1ad tagging.
The patch adds support for new Multi function mode wherein the traffic
classification is done based on the 802.1ad tagging and the outer vlan tag
provided by the management firmware.
Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sudarsana Reddy Kalluru [Sun, 6 May 2018 01:43:00 +0000 (18:43 -0700)]
qed: Remove unused data member 'is_mf_default'.
The data member 'is_mf_default' is not used by the qed/qede drivers,
removing the same.
Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sudarsana Reddy Kalluru [Sun, 6 May 2018 01:42:59 +0000 (18:42 -0700)]
qed*: Refactor mf_mode to consist of bits.
`mf_mode' field indicates the multi-partitioning mode the device is
configured to. This method doesn't scale very well, adding a new MF mode
requires going over all the existing conditions, and deciding whether those
are needed for the new mode or not.
The patch defines a set of bit-fields for modes which are derived according
to the mode info shared by the MFW and all the configuration would be made
according to those. To add a new mode, there would be a single place where
we'll need to go and choose which bits apply and which don't.
Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sun Lianwen [Sat, 5 May 2018 03:29:16 +0000 (11:29 +0800)]
net/9p: correct the variable name in v9fs_get_trans_by_name() comment
The v9fs_get_trans_by_name(char *s) variable name is not "name" but "s".
Signed-off-by: Sun Lianwen <sunlw.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sun Lianwen [Sat, 5 May 2018 01:08:18 +0000 (09:08 +0800)]
vlan: correct the file path in vlan_dev_change_flags() comment
The vlan_flags enum is defined in include/uapi/linux/if_vlan.h file.
not in include/linux/if_vlan.h file.
Signed-off-by: Sun Lianwen <sunlw.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 8 May 2018 03:35:08 +0000 (23:35 -0400)]
Merge git://git./linux/kernel/git/bpf/bpf-next
Minor conflict, a CHECK was placed into an if() statement
in net-next, whilst a newline was added to that CHECK
call in 'net'. Thanks to Daniel for the merge resolution.
Signed-off-by: David S. Miller <davem@davemloft.net>
Weilin Chang [Fri, 4 May 2018 18:07:19 +0000 (11:07 -0700)]
liquidio: support use of ethtool to set link speed of CN23XX-225 cards
Support setting the link speed of CN23XX-225 cards (which can do 25Gbps or
10Gbps) via ethtool_ops.set_link_ksettings.
Also fix the function assigned to ethtool_ops.get_link_ksettings to use the
new link_ksettings api completely (instead of partially via
ethtool_convert_legacy_u32_to_link_mode).
Signed-off-by: Weilin Chang <weilin.chang@cavium.com>
Acked-by: Raghu Vatsavayi <raghu.vatsavayi@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 8 May 2018 03:25:25 +0000 (23:25 -0400)]
Merge branch '3c59x-patches-and-the-removal-of-an-unused-function'
Sebastian Andrzej Siewior says:
====================
3c59x patches and the removal of an unused function
The first patch removes an unused function. The goal of remaining three
patches is to get rid of the local_irq_save() usage in the driver which
benefits -RT.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Anna-Maria Gleixner [Fri, 4 May 2018 15:17:49 +0000 (17:17 +0200)]
net: 3com: 3c59x: irq save variant of ISR
When vortex_boomerang_interrupt() is invoked from vortex_tx_timeout() or
poll_vortex() interrupts must be disabled. This detaches the interrupt
disable logic from locking which requires patching for PREEMPT_RT.
The advantage of avoiding spin_lock_irqsave() in the interrupt handler is
minimal, but converting it removes all the extra code for callers which
come not from interrupt context.
Cc: Steffen Klassert <klassert@mathematik.tu-chemnitz.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Anna-Maria Gleixner [Fri, 4 May 2018 15:17:48 +0000 (17:17 +0200)]
net: 3com: 3c59x: Pull locking out of ISR
Locking is done in the same way in _vortex_interrupt() and
_boomerang_interrupt(). To prevent duplication, move the locking into the
calling vortex_boomerang_interrupt() function.
No functional change.
Cc: Steffen Klassert <klassert@mathematik.tu-chemnitz.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Anna-Maria Gleixner [Fri, 4 May 2018 15:17:47 +0000 (17:17 +0200)]
net: 3com: 3c59x: Move boomerang/vortex conditional into function
If vp->full_bus_master_tx is set, vp->full_bus_master_rx is set as well
(see vortex_probe1()). Therefore the conditionals for the decision if
boomerang or vortex ISR is executed have the same result. Instead of
repeating the explicit conditional execution of the boomerang/vortex ISR,
move it into an own function.
No functional change.
Cc: Steffen Klassert <klassert@mathematik.tu-chemnitz.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Anna-Maria Gleixner [Fri, 4 May 2018 15:17:46 +0000 (17:17 +0200)]
net: u64_stats_sync: Remove functions without user
Commit
67db3e4bfbc9 ("tcp: no longer hold ehash lock while calling
tcp_get_info()") removes the only users of u64_stats_update_end/begin_raw()
without removing the function in header file.
Remove no longer used functions.
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Anders Roxell [Fri, 4 May 2018 09:17:25 +0000 (11:17 +0200)]
selftests: net: add udpgso* to TEST_GEN_FILES
The generated files udpgso* shouldn't be part of TEST_PROGS, they are
used by udpgso.sh and udpgsp_bench.sh. They should be added to the
TEST_GEN_FILES to get installed without being added to the main
run_kselftest.sh script.
Fixes:
3a687bef148d ("selftests: udp gso benchmark")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 7 May 2018 01:51:37 +0000 (21:51 -0400)]
Merge git://git./linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next
tree, more relevant updates in this batch are:
1) Add Maglev support to IPVS. Moreover, store lastest server weight in
IPVS since this is needed by maglev, patches from from Inju Song.
2) Preparation works to add iptables flowtable support, patches
from Felix Fietkau.
3) Hand over flows back to conntrack slow path in case of TCP RST/FIN
packet is seen via new teardown state, also from Felix.
4) Add support for extended netlink error reporting for nf_tables.
5) Support for larger timeouts that 23 days in nf_tables, patch from
Florian Westphal.
6) Always set an upper limit to dynamic sets, also from Florian.
7) Allow number generator to make map lookups, from Laura Garcia.
8) Use hash_32() instead of opencode hashing in IPVS, from Vicent Bernat.
9) Extend ip6tables SRH match to support previous, next and last SID,
from Ahmed Abdelsalam.
10) Move Passive OS fingerprint nf_osf.c, from Fernando Fernandez.
11) Expose nf_conntrack_max through ctnetlink, from Florent Fourcot.
12) Several housekeeping patches for xt_NFLOG, x_tables and ebtables,
from Taehee Yoo.
13) Unify meta bridge with core nft_meta, then make nft_meta built-in.
Make rt and exthdr built-in too, again from Florian.
14) Missing initialization of tbl->entries in IPVS, from Cong Wang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Fri, 27 Apr 2018 20:37:43 +0000 (22:37 +0200)]
netfilter: nft_dynset: fix timeout updates on 32bit
This must now use a 64bit jiffies value, else we set
a bogus timeout on 32bit.
Fixes:
8e1102d5a1596 ("netfilter: nf_tables: support timeouts larger than 23 days")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florent Fourcot [Sun, 6 May 2018 14:30:14 +0000 (16:30 +0200)]
netfilter: ctnetlink: export nf_conntrack_max
IPCTNL_MSG_CT_GET_STATS netlink command allow to monitor current number
of conntrack entries. However, if one wants to compare it with the
maximum (and detect exhaustion), the only solution is currently to read
sysctl value.
This patch add nf_conntrack_max value in netlink message, and simplify
monitoring for application built on netlink API.
Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fernando Fernandez Mancera [Thu, 3 May 2018 12:05:40 +0000 (14:05 +0200)]
netfilter: extract Passive OS fingerprint infrastructure from xt_osf
Add nf_osf_ttl() and nf_osf_match() into nf_osf.c to prepare for
nf_tables support.
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Phil Sutter [Fri, 27 Apr 2018 10:47:01 +0000 (12:47 +0200)]
netfilter: nf_tables: Provide NFT_{RT,CT}_MAX for userspace
These macros allow conveniently declaring arrays which use NFT_{RT,CT}_*
values as indexes.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Thu, 26 Apr 2018 15:42:15 +0000 (17:42 +0200)]
netfilter: nf_nat: remove unused ct arg from lookup functions
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Ahmed Abdelsalam [Wed, 25 Apr 2018 10:30:24 +0000 (05:30 -0500)]
netfilter: ip6t_srh: extend SRH matching for previous, next and last SID
IPv6 Segment Routing Header (SRH) contains a list of SIDs to be crossed
by SR encapsulated packet. Each SID is encoded as an IPv6 prefix.
When a Firewall receives an SR encapsulated packet, it should be able
to identify which node previously processed the packet (previous SID),
which node is going to process the packet next (next SID), and which
node is the last to process the packet (last SID) which represent the
final destination of the packet in case of inline SR mode.
An example use-case of using these features could be SID list that
includes two firewalls. When the second firewall receives a packet,
it can check whether the packet has been processed by the first firewall
or not. Based on that check, it decides to apply all rules, apply just
subset of the rules, or totally skip all rules and forward the packet to
the next SID.
This patch extends SRH match to support matching previous SID, next SID,
and last SID.
Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Laura Garcia Liebana [Mon, 23 Apr 2018 10:48:07 +0000 (12:48 +0200)]
netfilter: nft_numgen: enable hashing of one element
The modulus in the hash function was limited to > 1 as initially
there was no sense to create a hashing of just one element.
Nevertheless, there are certain cases specially for load balancing
where this case needs to be addressed.
This patch fixes the following error.
Error: Could not process rule: Numerical result out of range
add rule ip nftlb lb01 dnat to jhash ip saddr mod 1 map { 0: 192.168.0.10 }
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The solution comes to force the hash to 0 when the modulus is 1.
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Laura Garcia Liebana [Sun, 22 Apr 2018 09:03:23 +0000 (11:03 +0200)]
netfilter: nft_numgen: add map lookups for numgen statements
This patch includes a new attribute in the numgen structure to allow
the lookup of an element based on the number generator as a key.
For this purpose, different ops have been included to extend the
current numgen inc functions.
Currently, only supported for numgen incremental operations, but
it will be supported for random in a follow-up patch.
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
David Ahern [Fri, 4 May 2018 20:54:24 +0000 (13:54 -0700)]
net/ipv6: rename rt6_next to fib6_next
This slipped through the cracks in the followup set to the fib6_info flip.
Rename rt6_next to fib6_next.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Fri, 4 May 2018 14:27:53 +0000 (16:27 +0200)]
bpf, xskmap: fix crash in xsk_map_alloc error path handling
If bpf_map_precharge_memlock() did not fail, then we set err to zero.
However, any subsequent failure from either alloc_percpu() or the
bpf_map_area_alloc() will return ERR_PTR(0) which in find_and_alloc_map()
will cause NULL pointer deref.
In devmap we have the convention that we return -EINVAL on page count
overflow, so keep the same logic here and just set err to -ENOMEM
after successful bpf_map_precharge_memlock().
Fixes:
fbfc504a24f5 ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Björn Töpel <bjorn.topel@intel.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Fri, 4 May 2018 21:41:05 +0000 (23:41 +0200)]
Merge branch 'bpf-event-output-offload'
Jakub Kicinski says:
====================
This series centres on NFP offload of bpf_event_output(). The
first patch allows perf event arrays to be used by offloaded
programs. Next patch makes the nfp driver keep track of such
arrays to be able to filter FW events referring to maps.
Perf event arrays are not device bound. Having driver
reimplement and manage the perf array seems brittle and unnecessary.
Patch 4 moves slightly the verifier step which replaces map fds
with map pointers. This is useful for nfp JIT since we can then
easily replace host pointers with NFP table ids (patch 6). This
allows us to lift the limitation on map helpers having to be used
with the same map pointer on all paths. Second use of replacing
fds with real host map pointers is that we can use the host map
pointer as a key for FW events in perf event array offload.
Patch 5 adds perf event output offload support for the NFP.
There are some differences between bpf_event_output() offloaded
and non-offloaded version. The FW messages which carry events
may get dropped and reordered relatively easily. The return codes
from the helper are also not guaranteed to match the host. Users
are warned about some of those discrepancies with a one time
warning message to kernel logs.
bpftool gains an ability to dump perf ring events in a very simple
format. This was very useful for testing and simple debug, maybe
it will be useful to others?
Last patch is a trivial comment fix.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Fri, 4 May 2018 01:37:17 +0000 (18:37 -0700)]
bpf: fix references to free_bpf_prog_info() in comments
Comments in the verifier refer to free_bpf_prog_info() which
seems to have never existed in tree. Replace it with
free_used_maps().
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Fri, 4 May 2018 01:37:16 +0000 (18:37 -0700)]
tools: bpftool: add simple perf event output reader
Users of BPF sooner or later discover perf_event_output() helpers
and BPF_MAP_TYPE_PERF_EVENT_ARRAY. Dumping this array type is
not possible, however, we can add simple reading of perf events.
Create a new event_pipe subcommand for maps, this sub command
will only work with BPF_MAP_TYPE_PERF_EVENT_ARRAY maps.
Parts of the code from samples/bpf/trace_output_user.c.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Fri, 4 May 2018 01:37:15 +0000 (18:37 -0700)]
tools: bpftool: move get_possible_cpus() to common code
Move the get_possible_cpus() function to shared code. No functional
changes.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Fri, 4 May 2018 01:37:14 +0000 (18:37 -0700)]
tools: bpftool: fold hex keyword in command help
Instead of spelling [hex] BYTES everywhere use DATA as keyword
for generalized value. This will help us keep the messages
concise when longer command are added in the future. It will
also be useful once BTF support comes. We will only have to
change the definition of DATA.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Fri, 4 May 2018 01:37:13 +0000 (18:37 -0700)]
nfp: bpf: rewrite map pointers with NFP TIDs
Kernel will now replace map fds with actual pointer before
calling the offload prepare. We can identify those pointers
and replace them with NFP table IDs instead of loading the
table ID in code generated for CALL instruction.
This allows us to support having the same CALL being used with
different maps.
Since we don't want to change the FW ABI we still need to
move the TID from R1 to portion of R0 before the jump.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Fri, 4 May 2018 01:37:12 +0000 (18:37 -0700)]
nfp: bpf: perf event output helpers support
Add support for the perf_event_output family of helpers.
The implementation on the NFP will not match the host code exactly.
The state of the host map and rings is unknown to the device, hence
device can't return errors when rings are not installed. The device
simply packs the data into a firmware notification message and sends
it over to the host, returning success to the program.
There is no notion of a host CPU on the device when packets are being
processed. Device will only offload programs which set BPF_F_CURRENT_CPU.
Still, if map index doesn't match CPU no error will be returned (see
above).
Dropped/lost firmware notification messages will not cause "lost
events" event on the perf ring, they are only visible via device
error counters.
Firmware notification messages may also get reordered in respect
to the packets which caused their generation.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Fri, 4 May 2018 01:37:11 +0000 (18:37 -0700)]
bpf: replace map pointer loads before calling into offloads
Offloads may find host map pointers more useful than map fds.
Map pointers can be used to identify the map, while fds are
only valid within the context of loading process.
Jump to skip_full_check on error in case verifier log overflow
has to be handled (replace_map_fd_with_map_ptr() prints to the
log, driver prep may do that too in the future).
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Fri, 4 May 2018 01:37:10 +0000 (18:37 -0700)]
bpf: export bpf_event_output()
bpf_event_output() is useful for offloads to add events to BPF
event rings, export it. Note that export is placed near the stub
since tracing is optional and kernel/bpf/core.c is always going
to be built.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Fri, 4 May 2018 01:37:09 +0000 (18:37 -0700)]
nfp: bpf: record offload neutral maps in the driver
For asynchronous events originating from the device, like perf event
output, we need to be able to make sure that objects being referred
to by the FW message are valid on the host. FW events can get queued
and reordered. Even if we had a FW message "barrier" we should still
protect ourselves from bogus FW output.
Add a reverse-mapping hash table and record in it all raw map pointers
FW may refer to. Only record neutral maps, i.e. perf event arrays.
These are currently the only objects FW can refer to. Use RCU protection
on the read side, update side is under RTNL.
Since program vs map destruction order is slightly painful for offload
simply take an extra reference on all the recorded maps to make sure
they don't disappear.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Fri, 4 May 2018 01:37:08 +0000 (18:37 -0700)]
bpf: offload: allow offloaded programs to use perf event arrays
BPF_MAP_TYPE_PERF_EVENT_ARRAY is special as far as offload goes.
The map only holds glue to perf ring, not actual data. Allow
non-offloaded perf event arrays to be used in offloaded programs.
Offload driver can extract the events from HW and put them in
the map for user space to retrieve.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Fri, 4 May 2018 09:58:38 +0000 (11:58 +0200)]
Merge branch 'bpf-subprog-mgmt-cleanup'
Jiong Wang says:
====================
This patch set clean up some code logic related with managing subprog
information.
Part of the set are inspried by Edwin's code in his RFC:
"bpf/verifier: subprog/func_call simplifications"
but with clearer separation so it could be easier to review.
- Path 1 unifies main prog and subprogs. All of them are registered in
env->subprog_starts.
- After patch 1, it is clear that subprog_starts and subprog_stack_depth
could be merged as both of them now have main and subprog unified.
Patch 2 therefore does the merge, all subprog information are centred
at bpf_subprog_info.
- Patch 3 goes further to introduce a new fake "exit" subprog which
serves as an ending marker to the subprog list. We could then turn the
following code snippets across verifier:
if (env->subprog_cnt == cur_subprog + 1)
subprog_end = insn_cnt;
else
subprog_end = env->subprog_info[cur_subprog + 1].start;
into:
subprog_end = env->subprog_info[cur_subprog + 1].start;
There is no functional change by this patch set.
No bpf selftest (both non-jit and jit) regression found after this set.
v2:
- fixed adjust_subprog_starts to also update fake "exit" subprog start.
- for John's suggestion on renaming subprog to prog, I could work on
a follow-up patch if it is recognized as worth the change.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Eric Dumazet [Thu, 19 Apr 2018 15:49:29 +0000 (08:49 -0700)]
net/mlx4_en: optimizes get_fixed_ipv6_csum()
While trying to support CHECKSUM_COMPLETE for IPV6 fragments,
I had to experiments various hacks in get_fixed_ipv6_csum().
I must admit I could not find how to implement this :/
However, get_fixed_ipv6_csum() does a lot of redundant operations,
calling csum_partial() twice.
First csum_partial() computes the checksum of saddr and daddr,
put in @csum_pseudo_hdr. Undone later in the second csum_partial()
computed on whole ipv6 header.
Then nexthdr is added once, added a second time, then substracted.
payload_len is added once, then substracted.
Really all this can be reduced to two add_csum(), to add back 6 bytes
that were removed by mlx4 when providing hw_checksum in RX descriptor.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 4 May 2018 15:45:12 +0000 (11:45 -0400)]
Merge branch 'smc-splice-implementation'
Ursula Braun says:
====================
net/smc: splice implementation
Stefan comes up with an smc implementation for splice(). The first
three patches are preparational patches, the 4th patch implements
splice().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Raspl [Thu, 3 May 2018 16:12:39 +0000 (18:12 +0200)]
smc: add support for splice()
Provide an implementation for splice() when we are using SMC. See
smc_splice_read() for further details.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com><
Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Raspl [Thu, 3 May 2018 16:12:38 +0000 (18:12 +0200)]
smc: allocate RMBs as compound pages
Preparatory work for splice() support.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com><
Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Raspl [Thu, 3 May 2018 16:12:37 +0000 (18:12 +0200)]
smc: make smc_rx_wait_data() generic
Turn smc_rx_wait_data into a generic function that can be used at various
instances to wait on traffic to complete with varying criteria.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com><
Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Raspl [Thu, 3 May 2018 16:12:36 +0000 (18:12 +0200)]
smc: simplify abort logic
Some of the conditions to exit recv() are common in two pathes - cleaning up
code by moving the check up so we have it only once.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com><
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 4 May 2018 13:58:56 +0000 (09:58 -0400)]
Merge git://git./linux/kernel/git/davem/net
Overlapping changes in selftests Makefile.
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 4 May 2018 13:11:50 +0000 (09:11 -0400)]
Merge branch 'sh_eth-complain-on-access-to-unimplemented-TSU-registers'
Sergei Shtylyov says:
====================
sh_eth: complain on access to unimplemented TSU registers
Here's a set of 2 patches against DaveM's 'net-next.git' repo. The 1st patch
routes TSU_POST<n> register accesses thru sh_eth_tsu_{read|write}() and the 2nd
added WARN_ON() unimplemented register to those functions. I'm going to deal with
TSU_ADR{H|L}<n> registers in a later series...
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Wed, 2 May 2018 19:55:52 +0000 (22:55 +0300)]
sh_eth: WARN_ON() access to unimplemented TSU register
Commit
3365711df024 ("sh_eth: WARN on access to a register not implemented
in a particular chip") added WARN_ON() to sh_eth_{read|write}() but not
to sh_eth_tsu_{read|write}(). Now that we've routed almost all TSU register
accesses (except TSU_ADR{H|L}<n> -- which are special) thru the latter
pair of accessors, it makes sense to check for the unimplemented TSU
registers as well...
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Wed, 2 May 2018 19:54:48 +0000 (22:54 +0300)]
sh_eth: use TSU register accessors for TSU_POST<n>
There's no particularly good reason TSU_POST<n> registers get accessed
circumventing sh_eth_tsu_{read|write}() -- start using those, removing
(badly named) sh_eth_tsu_get_post_reg_offset(), while at it...
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiong Wang [Wed, 2 May 2018 20:17:19 +0000 (16:17 -0400)]
bpf: add faked "ending" subprog
There are quite a few code snippet like the following in verifier:
subprog_start = 0;
if (env->subprog_cnt == cur_subprog + 1)
subprog_end = insn_cnt;
else
subprog_end = env->subprog_info[cur_subprog + 1].start;
The reason is there is no marker in subprog_info array to tell the end of
it.
We could resolve this issue by introducing a faked "ending" subprog.
The special "ending" subprog is with "insn_cnt" as start offset, so it is
serving as the end mark whenever we iterate over all subprogs.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>