platform/kernel/linux-rpi.git
3 years agoselftests/bpf: fix bpf_testmod.ko recompilation logic
Andrii Nakryiko [Fri, 11 Dec 2020 01:59:46 +0000 (17:59 -0800)]
selftests/bpf: fix bpf_testmod.ko recompilation logic

bpf_testmod.ko build rule declared dependency on VMLINUX_BTF, but the variable
itself was initialized after the rule was declared, which often caused
bpf_testmod.ko to not be re-compiled. Fix by moving VMLINUX_BTF determination
sooner.

Also enforce bpf_testmod.ko recompilation when we detect that vmlinux image
changed by removing bpf_testmod/bpf_testmod.ko. This is necessary to generate
correct module's split BTF. Without it, Kbuild's module build logic might
determine that nothing changed on the kernel side and thus bpf_testmod.ko
shouldn't be rebuilt, so won't re-generate module BTF, which often leads to
module's BTF with wrong string offsets against vmlinux BTF. Removing .ko file
forces Kbuild to re-build the module.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Fixes: 9f7fa225894c ("selftests/bpf: Add bpf_testmod kernel module for testing")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20201211015946.4062098-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agosamples/bpf: Fix possible hang in xdpsock with multiple threads
Magnus Karlsson [Thu, 10 Dec 2020 16:34:07 +0000 (17:34 +0100)]
samples/bpf: Fix possible hang in xdpsock with multiple threads

Fix a possible hang in xdpsock that can occur when using multiple
threads. In this case, one or more of the threads might get stuck in
the while-loop in tx_only after the user has signaled the main thread
to stop execution. In this case, no more Tx packets will be sent, so a
thread might get stuck in the aforementioned while-loop. Fix this by
introducing a test inside the while-loop to check if the benchmark has
been terminated. If so, return from the function.

Fixes: cd9e72b6f210 ("samples/bpf: xdpsock: Add option to specify batch size")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201210163407.22066-1-magnus.karlsson@gmail.com
3 years agoselftests/bpf: Make selftest compilation work on clang 11
Jiri Olsa [Wed, 9 Dec 2020 14:29:12 +0000 (15:29 +0100)]
selftests/bpf: Make selftest compilation work on clang 11

We can't compile test_core_reloc_module.c selftest with clang 11, compile
fails with:

  CLNG-LLC [test_maps] test_core_reloc_module.o
  progs/test_core_reloc_module.c:57:21: error: use of unknown builtin \
  '__builtin_preserve_type_info' [-Wimplicit-function-declaration]
   out->read_ctx_sz = bpf_core_type_size(struct bpf_testmod_test_read_ctx);

Skipping these tests if __builtin_preserve_type_info() is not supported
by compiler.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201209142912.99145-1-jolsa@kernel.org
3 years agoselftests/bpf: Xsk selftests - adding xdpxceiver to .gitignore
Weqaar Janjua [Thu, 10 Dec 2020 11:54:35 +0000 (11:54 +0000)]
selftests/bpf: Xsk selftests - adding xdpxceiver to .gitignore

This patch adds *xdpxceiver* to selftests/bpf/.gitignore

Reported-by: Yonghong Song <yhs@fb.com>
Suggested-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Weqaar Janjua <weqaar.a.janjua@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201210115435.3995-1-weqaar.a.janjua@intel.com
3 years agoselftests/bpf: Drop tcp-{client,server}.py from Makefile
Veronika Kabatova [Thu, 10 Dec 2020 12:01:34 +0000 (13:01 +0100)]
selftests/bpf: Drop tcp-{client,server}.py from Makefile

The files don't exist anymore so this breaks generic kselftest builds
when using "make install" or "make gen_tar".

Fixes: 247f0ec361b7 ("selftests/bpf: Drop python client/server in favor of threads")
Signed-off-by: Veronika Kabatova <vkabatov@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201210120134.2148482-1-vkabatov@redhat.com
3 years agoMerge branch 'bpf-xsk-selftests'
Daniel Borkmann [Wed, 9 Dec 2020 15:44:45 +0000 (16:44 +0100)]
Merge branch 'bpf-xsk-selftests'

Weqaar Janjua says:

====================
This patch set adds AF_XDP selftests based on veth to selftests/bpf.

 # Topology:
 # ---------
 #                 -----------
 #               _ | Process | _
 #              /  -----------  \
 #             /        |        \
 #            /         |         \
 #      -----------     |     -----------
 #      | Thread1 |     |     | Thread2 |
 #      -----------     |     -----------
 #           |          |          |
 #      -----------     |     -----------
 #      |  xskX   |     |     |  xskY   |
 #      -----------     |     -----------
 #           |          |          |
 #      -----------     |     ----------
 #      |  vethX  | --------- |  vethY |
 #      -----------   peer    ----------
 #           |          |          |
 #      namespaceX      |     namespaceY

These selftests test AF_XDP SKB and Native/DRV modes using veth Virtual
Ethernet interfaces.

The test program contains two threads, each thread is single socket with
a unique UMEM. It validates in-order packet delivery and packet content
by sending packets to each other.

Prerequisites setup by script test_xsk.sh:

   Set up veth interfaces as per the topology shown ^^:
   * setup two veth interfaces and one namespace
   ** veth<xxxx> in root namespace
   ** veth<yyyy> in af_xdp<xxxx> namespace
   ** namespace af_xdp<xxxx>
   * create a spec file veth.spec that includes this run-time configuration
   *** xxxx and yyyy are randomly generated 4 digit numbers used to avoid
       conflict with any existing interface

   Adds xsk framework test to validate veth xdp DRV and SKB modes.

The following tests are provided:

1. AF_XDP SKB mode
   Generic mode XDP is driver independent, used when the driver does
   not have support for XDP. Works on any netdevice using sockets and
   generic XDP path. XDP hook from netif_receive_skb().
   a. nopoll - soft-irq processing
   b. poll - using poll() syscall
   c. Socket Teardown
      Create a Tx and a Rx socket, Tx from one socket, Rx on another.
      Destroy both sockets, then repeat multiple times. Only nopoll mode
  is used
   d. Bi-directional Sockets
      Configure sockets as bi-directional tx/rx sockets, sets up fill
  and completion rings on each socket, tx/rx in both directions.
  Only nopoll mode is used

2. AF_XDP DRV/Native mode
   Works on any netdevice with XDP_REDIRECT support, driver dependent.
   Processes packets before SKB allocation. Provides better performance
   than SKB. Driver hook available just after DMA of buffer descriptor.
   a. nopoll
   b. poll
   c. Socket Teardown
   d. Bi-directional Sockets
   * Only copy mode is supported because veth does not currently support
     zero-copy mode

Total tests: 8

Flow:
* Single process spawns two threads: Tx and Rx
* Each of these two threads attach to a veth interface within their
  assigned namespaces
* Each thread creates one AF_XDP socket connected to a unique umem
  for each veth interface
* Tx thread transmits 10k packets from veth<xxxx> to veth<yyyy>
* Rx thread verifies if all 10k packets were received and delivered
  in-order, and have the right content

v2 changes:
* Move selftests/xsk to selftests/bpf
* Remove Makefiles under selftests/xsk, and utilize selftests/bpf/Makefile
v3 changes:
* merge all test scripts test_xsk_*.sh into test_xsk.sh
v4 changes:
* merge xsk_env.sh into xsk_prereqs.sh
* test_xsk.sh add cliarg -c for color-coded output
* test_xsk.sh PREREQUISITES disables IPv6 on veth interfaces
* test_xsk.sh PREREQUISITES adds xsk framework test
* test_xsk.sh is independently executable
* xdpxceiver.c Tx/Rx validates only IPv4 packets with TOS 0x9, ignores
  others
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
3 years agoselftests/bpf: Xsk selftests - Bi-directional Sockets - SKB, DRV
Weqaar Janjua [Mon, 7 Dec 2020 21:53:33 +0000 (21:53 +0000)]
selftests/bpf: Xsk selftests - Bi-directional Sockets - SKB, DRV

Adds following tests:

1. AF_XDP SKB mode
   d. Bi-directional Sockets
      Configure sockets as bi-directional tx/rx sockets, sets up fill
      and completion rings on each socket, tx/rx in both directions.
      Only nopoll mode is used

2. AF_XDP DRV/Native mode
   d. Bi-directional Sockets
   * Only copy mode is supported because veth does not currently support
     zero-copy mode

Signed-off-by: Weqaar Janjua <weqaar.a.janjua@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Yonghong Song <yhs@fb.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201207215333.11586-6-weqaar.a.janjua@intel.com
3 years agoselftests/bpf: Xsk selftests - Socket Teardown - SKB, DRV
Weqaar Janjua [Mon, 7 Dec 2020 21:53:32 +0000 (21:53 +0000)]
selftests/bpf: Xsk selftests - Socket Teardown - SKB, DRV

Adds following tests:

1. AF_XDP SKB mode
   c. Socket Teardown
      Create a Tx and a Rx socket, Tx from one socket, Rx on another.
      Destroy both sockets, then repeat multiple times. Only nopoll mode
      is used

2. AF_XDP DRV/Native mode
   c. Socket Teardown
   * Only copy mode is supported because veth does not currently support
     zero-copy mode

Signed-off-by: Weqaar Janjua <weqaar.a.janjua@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Yonghong Song <yhs@fb.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201207215333.11586-5-weqaar.a.janjua@intel.com
3 years agoselftests/bpf: Xsk selftests - DRV POLL, NOPOLL
Weqaar Janjua [Mon, 7 Dec 2020 21:53:31 +0000 (21:53 +0000)]
selftests/bpf: Xsk selftests - DRV POLL, NOPOLL

Adds following tests:

2. AF_XDP DRV/Native mode
   Works on any netdevice with XDP_REDIRECT support, driver dependent.
   Processes packets before SKB allocation. Provides better performance
   than SKB. Driver hook available just after DMA of buffer descriptor.
   a. nopoll
   b. poll
   * Only copy mode is supported because veth does not currently support
     zero-copy mode

Signed-off-by: Weqaar Janjua <weqaar.a.janjua@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Yonghong Song <yhs@fb.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201207215333.11586-4-weqaar.a.janjua@intel.com
3 years agoselftests/bpf: Xsk selftests - SKB POLL, NOPOLL
Weqaar Janjua [Mon, 7 Dec 2020 21:53:30 +0000 (21:53 +0000)]
selftests/bpf: Xsk selftests - SKB POLL, NOPOLL

Adds following tests:

1. AF_XDP SKB mode
   Generic mode XDP is driver independent, used when the driver does
   not have support for XDP. Works on any netdevice using sockets and
   generic XDP path. XDP hook from netif_receive_skb().
   a. nopoll - soft-irq processing
   b. poll - using poll() syscall

Signed-off-by: Weqaar Janjua <weqaar.a.janjua@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Yonghong Song <yhs@fb.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201207215333.11586-3-weqaar.a.janjua@intel.com
3 years agoselftests/bpf: Xsk selftests framework
Weqaar Janjua [Mon, 7 Dec 2020 21:53:29 +0000 (21:53 +0000)]
selftests/bpf: Xsk selftests framework

This patch adds AF_XDP selftests framework under selftests/bpf.

Topology:
---------
     -----------           -----------
     |  xskX   | --------- |  xskY   |
     -----------     |     -----------
          |          |          |
     -----------     |     ----------
     |  vethX  | --------- |  vethY |
     -----------   peer    ----------
          |          |          |
     namespaceX      |     namespaceY

Prerequisites setup by script test_xsk.sh:

   Set up veth interfaces as per the topology shown ^^:
   * setup two veth interfaces and one namespace
   ** veth<xxxx> in root namespace
   ** veth<yyyy> in af_xdp<xxxx> namespace
   ** namespace af_xdp<xxxx>
   * create a spec file veth.spec that includes this run-time configuration
   *** xxxx and yyyy are randomly generated 4 digit numbers used to avoid
       conflict with any existing interface
   * tests the veth and xsk layers of the topology

Signed-off-by: Weqaar Janjua <weqaar.a.janjua@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Yonghong Song <yhs@fb.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201207215333.11586-2-weqaar.a.janjua@intel.com
3 years agobpf: Only provide bpf_sock_from_file with CONFIG_NET
Florent Revest [Tue, 8 Dec 2020 17:36:23 +0000 (18:36 +0100)]
bpf: Only provide bpf_sock_from_file with CONFIG_NET

This moves the bpf_sock_from_file definition into net/core/filter.c
which only gets compiled with CONFIG_NET and also moves the helper proto
usage next to other tracing helpers that are conditional on CONFIG_NET.

This avoids
  ld: kernel/trace/bpf_trace.o: in function `bpf_sock_from_file':
  bpf_trace.c:(.text+0xe23): undefined reference to `sock_from_file'
When compiling a kernel with BPF and without NET.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/bpf/20201208173623.1136863-1-revest@chromium.org
3 years agobpf: Return -ENOTSUPP when attaching to non-kernel BTF
Andrii Nakryiko [Tue, 8 Dec 2020 06:43:26 +0000 (22:43 -0800)]
bpf: Return -ENOTSUPP when attaching to non-kernel BTF

Return -ENOTSUPP if tracing BPF program is attempted to be attached with
specified attach_btf_obj_fd pointing to non-kernel (neither vmlinux nor
module) BTF object. This scenario might be supported in the future and isn't
outright invalid, so -EINVAL isn't the most appropriate error code.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201208064326.667389-1-andrii@kernel.org
3 years agoxsk: Validate socket state in xsk_recvmsg, prior touching socket members
Björn Töpel [Mon, 7 Dec 2020 08:20:08 +0000 (09:20 +0100)]
xsk: Validate socket state in xsk_recvmsg, prior touching socket members

In AF_XDP the socket state needs to be checked, prior touching the
members of the socket. This was not the case for the recvmsg
implementation. Fix that by moving the xsk_is_bound() call.

Fixes: 45a86681844e ("xsk: Add support for recvmsg()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20201207082008.132263-1-bjorn.topel@gmail.com
3 years agobpf: Propagate __user annotations properly
Lukas Bulwahn [Mon, 7 Dec 2020 12:37:20 +0000 (13:37 +0100)]
bpf: Propagate __user annotations properly

__htab_map_lookup_and_delete_batch() stores a user pointer in the local
variable ubatch and uses that in copy_{from,to}_user(), but ubatch misses a
__user annotation.

So, sparse warns in the various assignments and uses of ubatch:

  kernel/bpf/hashtab.c:1415:24: warning: incorrect type in initializer
    (different address spaces)
  kernel/bpf/hashtab.c:1415:24:    expected void *ubatch
  kernel/bpf/hashtab.c:1415:24:    got void [noderef] __user *

  kernel/bpf/hashtab.c:1444:46: warning: incorrect type in argument 2
    (different address spaces)
  kernel/bpf/hashtab.c:1444:46:    expected void const [noderef] __user *from
  kernel/bpf/hashtab.c:1444:46:    got void *ubatch

  kernel/bpf/hashtab.c:1608:16: warning: incorrect type in assignment
    (different address spaces)
  kernel/bpf/hashtab.c:1608:16:    expected void *ubatch
  kernel/bpf/hashtab.c:1608:16:    got void [noderef] __user *

  kernel/bpf/hashtab.c:1609:26: warning: incorrect type in argument 1
    (different address spaces)
  kernel/bpf/hashtab.c:1609:26:    expected void [noderef] __user *to
  kernel/bpf/hashtab.c:1609:26:    got void *ubatch

Add the __user annotation to repair this chain of propagating __user
annotations in __htab_map_lookup_and_delete_batch().

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20201207123720.19111-1-lukas.bulwahn@gmail.com
3 years agobpf: Avoid overflows involving hash elem_size
Eric Dumazet [Mon, 7 Dec 2020 18:28:21 +0000 (10:28 -0800)]
bpf: Avoid overflows involving hash elem_size

Use of bpf_map_charge_init() was making sure hash tables would not use more
than 4GB of memory.

Since the implicit check disappeared, we have to be more careful
about overflows, to support big hash tables.

syzbot triggers a panic using :

bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_LRU_HASH, key_size=16384, value_size=8,
                     max_entries=262200, map_flags=0, inner_map_fd=-1, map_name="",
                     map_ifindex=0, btf_fd=-1, btf_key_type_id=0, btf_value_type_id=0,
                     btf_vmlinux_value_type_id=0}, 64) = ...

BUG: KASAN: vmalloc-out-of-bounds in bpf_percpu_lru_populate kernel/bpf/bpf_lru_list.c:594 [inline]
BUG: KASAN: vmalloc-out-of-bounds in bpf_lru_populate+0x4ef/0x5e0 kernel/bpf/bpf_lru_list.c:611
Write of size 2 at addr ffffc90017e4a020 by task syz-executor.5/19786

CPU: 0 PID: 19786 Comm: syz-executor.5 Not tainted 5.10.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x107/0x163 lib/dump_stack.c:118
 print_address_description.constprop.0.cold+0x5/0x4c8 mm/kasan/report.c:385
 __kasan_report mm/kasan/report.c:545 [inline]
 kasan_report.cold+0x1f/0x37 mm/kasan/report.c:562
 bpf_percpu_lru_populate kernel/bpf/bpf_lru_list.c:594 [inline]
 bpf_lru_populate+0x4ef/0x5e0 kernel/bpf/bpf_lru_list.c:611
 prealloc_init kernel/bpf/hashtab.c:319 [inline]
 htab_map_alloc+0xf6e/0x1230 kernel/bpf/hashtab.c:507
 find_and_alloc_map kernel/bpf/syscall.c:123 [inline]
 map_create kernel/bpf/syscall.c:829 [inline]
 __do_sys_bpf+0xa81/0x5170 kernel/bpf/syscall.c:4336
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45deb9
Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fd93fbc0c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 0000000000001a40 RCX: 000000000045deb9
RDX: 0000000000000040 RSI: 0000000020000280 RDI: 0000000000000000
RBP: 000000000119bf60 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000119bf2c
R13: 00007ffc08a7be8f R14: 00007fd93fbc19c0 R15: 000000000119bf2c

Fixes: 755e5d55367a ("bpf: Eliminate rlimit-based memory accounting for hashtab maps")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Link: https://lore.kernel.org/bpf/20201207182821.3940306-1-eric.dumazet@gmail.com
3 years agoselftests/bpf: Test bpf_sk_storage_get in tcp iterators
Florent Revest [Fri, 4 Dec 2020 11:36:09 +0000 (12:36 +0100)]
selftests/bpf: Test bpf_sk_storage_get in tcp iterators

This extends the existing bpf_sk_storage_get test where a socket is
created and tagged with its creator's pid by a task_file iterator.

A TCP iterator is now also used at the end of the test to negate the
values already stored in the local storage. The test therefore expects
-getpid() to be stored in the local storage.

Signed-off-by: Florent Revest <revest@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201204113609.1850150-6-revest@google.com
3 years agoselftests/bpf: Add an iterator selftest for bpf_sk_storage_get
Florent Revest [Fri, 4 Dec 2020 11:36:08 +0000 (12:36 +0100)]
selftests/bpf: Add an iterator selftest for bpf_sk_storage_get

The eBPF program iterates over all files and tasks. For all socket
files, it stores the tgid of the last task it encountered with a handle
to that socket. This is a heuristic for finding the "owner" of a socket
similar to what's done by lsof, ss, netstat or fuser. Potentially, this
information could be used from a cgroup_skb/*gress hook to try to
associate network traffic with processes.

The test makes sure that a socket it created is tagged with prog_tests's
pid.

Signed-off-by: Florent Revest <revest@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201204113609.1850150-5-revest@google.com
3 years agoselftests/bpf: Add an iterator selftest for bpf_sk_storage_delete
Florent Revest [Fri, 4 Dec 2020 11:36:07 +0000 (12:36 +0100)]
selftests/bpf: Add an iterator selftest for bpf_sk_storage_delete

The eBPF program iterates over all entries (well, only one) of a socket
local storage map and deletes them all. The test makes sure that the
entry is indeed deleted.

Signed-off-by: Florent Revest <revest@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201204113609.1850150-4-revest@google.com
3 years agobpf: Expose bpf_sk_storage_* to iterator programs
Florent Revest [Fri, 4 Dec 2020 11:36:06 +0000 (12:36 +0100)]
bpf: Expose bpf_sk_storage_* to iterator programs

Iterators are currently used to expose kernel information to userspace
over fast procfs-like files but iterators could also be used to
manipulate local storage. For example, the task_file iterator could be
used to initialize a socket local storage with associations between
processes and sockets or to selectively delete local storage values.

Signed-off-by: Florent Revest <revest@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20201204113609.1850150-3-revest@google.com
3 years agobpf: Add a bpf_sock_from_file helper
Florent Revest [Fri, 4 Dec 2020 11:36:05 +0000 (12:36 +0100)]
bpf: Add a bpf_sock_from_file helper

While eBPF programs can check whether a file is a socket by file->f_op
== &socket_file_ops, they cannot convert the void private_data pointer
to a struct socket BTF pointer. In order to do this a new helper
wrapping sock_from_file is added.

This is useful to tracing programs but also other program types
inheriting this set of helpers such as iterators or LSM programs.

Signed-off-by: Florent Revest <revest@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: KP Singh <kpsingh@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201204113609.1850150-2-revest@google.com
3 years agonet: Remove the err argument from sock_from_file
Florent Revest [Fri, 4 Dec 2020 11:36:04 +0000 (12:36 +0100)]
net: Remove the err argument from sock_from_file

Currently, the sock_from_file prototype takes an "err" pointer that is
either not set or set to -ENOTSOCK IFF the returned socket is NULL. This
makes the error redundant and it is ignored by a few callers.

This patch simplifies the API by letting callers deduce the error based
on whether the returned socket is NULL or not.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Florent Revest <revest@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20201204113609.1850150-1-revest@google.com
3 years agoMerge branch 'Improve error handling of verifier tests'
Andrii Nakryiko [Fri, 4 Dec 2020 19:53:17 +0000 (11:53 -0800)]
Merge branch 'Improve error handling of verifier tests'

Florian Lehner says:

====================
These patches improve the error handling for verifier tests. With "Test
the 32bit narrow read" Krzesimir Nowak provided these patches first, but
they were never merged.
The improved error handling helps to implement and test BPF program types
that are not supported yet.

v3:
  - Add explicit fallthrough

v2:
  - Add unpriv check in error validation
====================

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
3 years agoselftests/bpf: Avoid errno clobbering
Florian Lehner [Fri, 4 Dec 2020 18:18:28 +0000 (19:18 +0100)]
selftests/bpf: Avoid errno clobbering

Print a message when the returned error is about a program type being
not supported or because of permission problems.
These messages are expected if the program to test was actually
executed.

Signed-off-by: Florian Lehner <dev@der-flo.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201204181828.11974-3-dev@der-flo.net
3 years agoselftests/bpf: Print reason when a tester could not run a program
Florian Lehner [Fri, 4 Dec 2020 18:18:27 +0000 (19:18 +0100)]
selftests/bpf: Print reason when a tester could not run a program

Commit 8184d44c9a57 ("selftests/bpf: skip verifier tests for unsupported
program types") added a check to skip unsupported program types. As
bpf_probe_prog_type can change errno, do_single_test should save it before
printing a reason why a supported BPF program type failed to load.

Fixes: 8184d44c9a57 ("selftests/bpf: skip verifier tests for unsupported program types")
Signed-off-by: Florian Lehner <dev@der-flo.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201204181828.11974-2-dev@der-flo.net
3 years agobpf: Remove trailing semicolon in macro definition
Tom Rix [Wed, 2 Dec 2020 21:28:10 +0000 (13:28 -0800)]
bpf: Remove trailing semicolon in macro definition

The macro use will already have a semicolon. Clean up escaped newlines.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201202212810.3774614-1-trix@redhat.com
3 years agoMerge tag 'wireless-drivers-next-2020-12-03' of git://git.kernel.org/pub/scm/linux...
Jakub Kicinski [Fri, 4 Dec 2020 18:56:37 +0000 (10:56 -0800)]
Merge tag 'wireless-drivers-next-2020-12-03' of git://git./linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers-next patches for v5.11

First set of patches for v5.11. rtw88 getting improvements to work
better with Bluetooth and other driver also getting some new features.
mhi-ath11k-immutable branch was pulled from mhi tree to avoid
conflicts with mhi tree.

Major changes:

rtw88
 * major bluetooth co-existance improvements
wilc1000
 * Wi-Fi Multimedia (WMM) support
ath11k
 * Fast Initial Link Setup (FILS) discovery and unsolicited broadcast
   probe response support
 * qcom,ath11k-calibration-variant Device Tree setting
 * cold boot calibration support
 * new DFS region: JP
wnc36xx
 * enable connection monitoring and keepalive in firmware
ath10k
 * firmware IRAM recovery feature
mhi
 * merge mhi-ath11k-immutable branch to make MHI API change go smoothly

* tag 'wireless-drivers-next-2020-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next: (180 commits)
  wl1251: remove trailing semicolon in macro definition
  airo: remove trailing semicolon in macro definition
  wilc1000: added queue support for WMM
  wilc1000: call complete() for failure in wilc_wlan_txq_add_cfg_pkt()
  wilc1000: free resource in wilc_wlan_txq_add_mgmt_pkt() for failure path
  wilc1000: free resource in wilc_wlan_txq_add_net_pkt() for failure path
  wilc1000: added 'ndo_set_mac_address' callback support
  brcmfmac: expose firmware config files through modinfo
  wlcore: Switch to using the new API kobj_to_dev()
  rtw88: coex: add feature to enhance HID coexistence performance
  rtw88: coex: upgrade coexistence A2DP mechanism
  rtw88: coex: add action for coexistence in hardware initial
  rtw88: coex: add function to avoid cck lock
  rtw88: coex: change the coexistence mechanism for WLAN connected
  rtw88: coex: change the coexistence mechanism for HID
  rtw88: coex: update AFH information while in free-run mode
  rtw88: coex: update the mechanism for A2DP + PAN
  rtw88: coex: add debug message
  rtw88: coex: run coexistence when WLAN entering/leaving LPS
  Revert "rtl8xxxu: Add Buffalo WI-U3-866D to list of supported devices"
  ...
====================

Link: https://lore.kernel.org/r/20201203185732.9CFA5C433ED@smtp.codeaurora.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agodpaa_eth: fix build errorr in dpaa_fq_init
Anders Roxell [Thu, 3 Dec 2020 14:43:43 +0000 (15:43 +0100)]
dpaa_eth: fix build errorr in dpaa_fq_init

When building FSL_DPAA_ETH the following build error shows up:

/tmp/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c: In function ‘dpaa_fq_init’:
/tmp/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c:1135:9: error: too few arguments to function ‘xdp_rxq_info_reg’
 1135 |   err = xdp_rxq_info_reg(&dpaa_fq->xdp_rxq, dpaa_fq->net_dev,
      |         ^~~~~~~~~~~~~~~~

Commit b02e5a0ebb17 ("xsk: Propagate napi_id to XDP socket Rx path")
added an extra argument to function xdp_rxq_info_reg and commit
d57e57d0cd04 ("dpaa_eth: add XDP_TX support") didn't know about that
extra argument.

Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/r/20201203144343.790719-1-anders.roxell@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Jakub Kicinski [Fri, 4 Dec 2020 15:48:11 +0000 (07:48 -0800)]
Merge https://git./linux/kernel/git/bpf/bpf-next

Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-12-03

The main changes are:

1) Support BTF in kernel modules, from Andrii.

2) Introduce preferred busy-polling, from Björn.

3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh.

4) Memcg-based memory accounting for bpf objects, from Roman.

5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits)
  selftests/bpf: Fix invalid use of strncat in test_sockmap
  libbpf: Use memcpy instead of strncpy to please GCC
  selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module
  selftests/bpf: Add tp_btf CO-RE reloc test for modules
  libbpf: Support attachment of BPF tracing programs to kernel modules
  libbpf: Factor out low-level BPF program loading helper
  bpf: Allow to specify kernel module BTFs when attaching BPF programs
  bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
  selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF
  selftests/bpf: Add support for marking sub-tests as skipped
  selftests/bpf: Add bpf_testmod kernel module for testing
  libbpf: Add kernel module BTF support for CO-RE relocations
  libbpf: Refactor CO-RE relocs to not assume a single BTF object
  libbpf: Add internal helper to load BTF data by FD
  bpf: Keep module's btf_data_size intact after load
  bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
  selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP
  bpf: Adds support for setting window clamp
  samples/bpf: Fix spelling mistake "recieving" -> "receiving"
  bpf: Fix cold build of test_progs-no_alu32
  ...
====================

Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoselftests/bpf: Fix invalid use of strncat in test_sockmap
Andrii Nakryiko [Thu, 3 Dec 2020 23:54:40 +0000 (15:54 -0800)]
selftests/bpf: Fix invalid use of strncat in test_sockmap

strncat()'s third argument is how many bytes will be added *in addition* to
already existing bytes in destination. Plus extra zero byte will be added
after that. So existing use in test_sockmap has many opportunities to overflow
the string and cause memory corruptions. And in this case, GCC complains for
a good reason.

Fixes: 16962b2404ac ("bpf: sockmap, add selftests")
Fixes: 73563aa3d977 ("selftests/bpf: test_sockmap, print additional test options")
Fixes: 1ade9abadfca ("bpf: test_sockmap, add options for msg_pop_data() helper")
Fixes: 463bac5f1ca7 ("bpf, selftests: Add test for ktls with skb bpf ingress policy")
Fixes: e9dd904708c4 ("bpf: add tls support for testing in test_sockmap")
Fixes: 753fb2ee0934 ("bpf: sockmap, add msg_peek tests to test_sockmap")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203235440.2302137-2-andrii@kernel.org
3 years agolibbpf: Use memcpy instead of strncpy to please GCC
Andrii Nakryiko [Thu, 3 Dec 2020 23:54:39 +0000 (15:54 -0800)]
libbpf: Use memcpy instead of strncpy to please GCC

Some versions of GCC are really nit-picky about strncpy() use. Use memcpy(),
as they are pretty much equivalent for the case of fixed length strings.

Fixes: e459f49b4394 ("libbpf: Separate XDP program load with xsk socket creation")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203235440.2302137-1-andrii@kernel.org
3 years agoMerge branch 'Support BTF-powered BPF tracing programs for kernel modules'
Alexei Starovoitov [Fri, 4 Dec 2020 01:38:22 +0000 (17:38 -0800)]
Merge branch 'Support BTF-powered BPF tracing programs for kernel modules'

Andrii Nakryiko says:

====================

This patch sets extends kernel and libbpf with support for attaching
BTF-powered raw tracepoint (tp_btf) and tracing (fentry/fexit/fmod_ret/lsm)
BPF programs to BPF hooks defined in kernel modules. As part of that, libbpf
now supports performing CO-RE relocations against types in kernel module BTFs,
in addition to existing vmlinux BTF support.

Kernel UAPI for BPF_PROG_LOAD now allows to specify kernel module (or vmlinux)
BTF object FD in attach_btf_obj_fd field, aliased to attach_prog_fd. This is
used to identify which BTF object needs to be used for finding BTF type by
provided attach_btf_id.

This patch set also sets up a convenient and fully-controlled custom kernel
module (called "bpf_testmod"), that is a predictable playground for all the
BPF selftests, that rely on module BTFs. Currently pahole doesn't generate
BTF_KIND_FUNC info for ftrace-able static functions in kernel modules, so
expose traced function in bpf_sidecar.ko. Once pahole is enhanced, we can go
back to static function.

From end user perspective there are no extra actions that need to happen.
Libbpf will continue searching across all kernel module BTFs, if desired
attach BTF type is not found in vmlinux. That way it doesn't matter if BPF
hook that user is trying to attach to is built into vmlinux image or is
loaded in kernel module.

v5->v6:
  - move btf_put() back to syscall.c (kernel test robot);
  - added close(fd) in patch #5 (John);
v4->v5:
  - use FD to specify BTF object (Alexei);
  - move prog->aux->attach_btf putting into bpf_prog_free() for consistency
    with putting prog->aux->dst_prog;
  - fix BTF FD leak(s) in libbpf;
v3->v4:
  - merge together patch sets [0] and [1];
  - avoid increasing bpf_reg_state by reordering fields (Alexei);
  - preserve btf_data_size in struct module;
v2->v3:
  - fix subtle uninitialized variable use in BTF ID iteration code;
v1->v2:
  - module_put() inside preempt_disable() region (Alexei);
  - bpf_sidecar -> bpf_testmod rename (Alexei);
  - test_progs more relaxed handling of bpf_testmod;
  - test_progs marks skipped sub-tests properly as SKIP now.

  [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=393677&state=*
  [1] https://patchwork.kernel.org/project/netdevbpf/list/?series=393679&state=*
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agoselftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module
Andrii Nakryiko [Thu, 3 Dec 2020 20:46:34 +0000 (12:46 -0800)]
selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module

Add new selftest checking attachment of fentry/fexit/fmod_ret (and raw
tracepoint ones for completeness) BPF programs to kernel module function.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-15-andrii@kernel.org
3 years agoselftests/bpf: Add tp_btf CO-RE reloc test for modules
Andrii Nakryiko [Thu, 3 Dec 2020 20:46:33 +0000 (12:46 -0800)]
selftests/bpf: Add tp_btf CO-RE reloc test for modules

Add another CO-RE relocation test for kernel module relocations. This time for
tp_btf with direct memory reads.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-14-andrii@kernel.org
3 years agolibbpf: Support attachment of BPF tracing programs to kernel modules
Andrii Nakryiko [Thu, 3 Dec 2020 20:46:32 +0000 (12:46 -0800)]
libbpf: Support attachment of BPF tracing programs to kernel modules

Teach libbpf to search for BTF types in kernel modules for tracing BPF
programs. This allows attachment of raw_tp/fentry/fexit/fmod_ret/etc BPF
program types to tracepoints and functions in kernel modules.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-13-andrii@kernel.org
3 years agolibbpf: Factor out low-level BPF program loading helper
Andrii Nakryiko [Thu, 3 Dec 2020 20:46:31 +0000 (12:46 -0800)]
libbpf: Factor out low-level BPF program loading helper

Refactor low-level API for BPF program loading to not rely on public API
types. This allows painless extension without constant efforts to cleverly not
break backwards compatibility.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-12-andrii@kernel.org
3 years agobpf: Allow to specify kernel module BTFs when attaching BPF programs
Andrii Nakryiko [Thu, 3 Dec 2020 20:46:30 +0000 (12:46 -0800)]
bpf: Allow to specify kernel module BTFs when attaching BPF programs

Add ability for user-space programs to specify non-vmlinux BTF when attaching
BTF-powered BPF programs: raw_tp, fentry/fexit/fmod_ret, LSM, etc. For this,
attach_prog_fd (now with the alias name attach_btf_obj_fd) should specify FD
of a module or vmlinux BTF object. For backwards compatibility reasons,
0 denotes vmlinux BTF. Only kernel BTF (vmlinux or module) can be specified.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-11-andrii@kernel.org
3 years agobpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
Andrii Nakryiko [Thu, 3 Dec 2020 20:46:29 +0000 (12:46 -0800)]
bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier

Remove a permeating assumption thoughout BPF verifier of vmlinux BTF. Instead,
wherever BTF type IDs are involved, also track the instance of struct btf that
goes along with the type ID. This allows to gradually add support for kernel
module BTFs and using/tracking module types across BPF helper calls and
registers.

This patch also renames btf_id() function to btf_obj_id() to minimize naming
clash with using btf_id to denote BTF *type* ID, rather than BTF *object*'s ID.

Also, altough btf_vmlinux can't get destructed and thus doesn't need
refcounting, module BTFs need that, so apply BTF refcounting universally when
BPF program is using BTF-powered attachment (tp_btf, fentry/fexit, etc). This
makes for simpler clean up code.

Now that BTF type ID is not enough to uniquely identify a BTF type, extend BPF
trampoline key to include BTF object ID. To differentiate that from target
program BPF ID, set 31st bit of type ID. BTF type IDs (at least currently) are
not allowed to take full 32 bits, so there is no danger of confusing that bit
with a valid BTF type ID.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-10-andrii@kernel.org
3 years agoselftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF
Andrii Nakryiko [Thu, 3 Dec 2020 20:46:28 +0000 (12:46 -0800)]
selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF

Add a self-tests validating libbpf is able to perform CO-RE relocations
against the type defined in kernel module BTF. if bpf_testmod.o is not
supported by the kernel (e.g., due to version mismatch), skip tests, instead
of failing.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-9-andrii@kernel.org
3 years agoselftests/bpf: Add support for marking sub-tests as skipped
Andrii Nakryiko [Thu, 3 Dec 2020 20:46:27 +0000 (12:46 -0800)]
selftests/bpf: Add support for marking sub-tests as skipped

Previously skipped sub-tests would be counted as passing with ":OK" appened
in the log. Change that to be accounted as ":SKIP".

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-8-andrii@kernel.org
3 years agoselftests/bpf: Add bpf_testmod kernel module for testing
Andrii Nakryiko [Thu, 3 Dec 2020 20:46:26 +0000 (12:46 -0800)]
selftests/bpf: Add bpf_testmod kernel module for testing

Add bpf_testmod module, which is conceptually out-of-tree module and provides
ways for selftests/bpf to test various kernel module-related functionality:
raw tracepoint, fentry/fexit/fmod_ret, etc. This module will be auto-loaded by
test_progs test runner and expected by some of selftests to be present and
loaded.

Pahole currently isn't able to generate BTF for static functions in kernel
modules, so make sure traced function is global.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-7-andrii@kernel.org
3 years agolibbpf: Add kernel module BTF support for CO-RE relocations
Andrii Nakryiko [Thu, 3 Dec 2020 20:46:25 +0000 (12:46 -0800)]
libbpf: Add kernel module BTF support for CO-RE relocations

Teach libbpf to search for candidate types for CO-RE relocations across kernel
modules BTFs, in addition to vmlinux BTF. If at least one candidate type is
found in vmlinux BTF, kernel module BTFs are not iterated. If vmlinux BTF has
no matching candidates, then find all kernel module BTFs and search for all
matching candidates across all of them.

Kernel's support for module BTFs are inferred from the support for BTF name
pointer in BPF UAPI.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-6-andrii@kernel.org
3 years agolibbpf: Refactor CO-RE relocs to not assume a single BTF object
Andrii Nakryiko [Thu, 3 Dec 2020 20:46:24 +0000 (12:46 -0800)]
libbpf: Refactor CO-RE relocs to not assume a single BTF object

Refactor CO-RE relocation candidate search to not expect a single BTF, rather
return all candidate types with their corresponding BTF objects. This will
allow to extend CO-RE relocations to accommodate kernel module BTFs.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-5-andrii@kernel.org
3 years agolibbpf: Add internal helper to load BTF data by FD
Andrii Nakryiko [Thu, 3 Dec 2020 20:46:23 +0000 (12:46 -0800)]
libbpf: Add internal helper to load BTF data by FD

Add a btf_get_from_fd() helper, which constructs struct btf from in-kernel BTF
data by FD. This is used for loading module BTFs.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-4-andrii@kernel.org
3 years agobpf: Keep module's btf_data_size intact after load
Andrii Nakryiko [Thu, 3 Dec 2020 20:46:22 +0000 (12:46 -0800)]
bpf: Keep module's btf_data_size intact after load

Having real btf_data_size stored in struct module is benefitial to quickly
determine which kernel modules have associated BTF object and which don't.
There is no harm in keeping this info, as opposed to keeping invalid pointer.

Fixes: 607c543f939d ("bpf: Sanitize BTF data pointer after module is loaded")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-3-andrii@kernel.org
3 years agobpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
Andrii Nakryiko [Thu, 3 Dec 2020 20:46:21 +0000 (12:46 -0800)]
bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()

__module_address() needs to be called with preemption disabled or with
module_mutex taken. preempt_disable() is enough for read-only uses, which is
what this fix does. Also, module_put() does internal check for NULL, so drop
it as well.

Fixes: a38d1107f937 ("bpf: support raw tracepoints in modules")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-2-andrii@kernel.org
3 years agoMerge branch 'Add support to set window_clamp from bpf setsockops'
Alexei Starovoitov [Fri, 4 Dec 2020 01:23:24 +0000 (17:23 -0800)]
Merge branch 'Add support to set window_clamp from bpf setsockops'

Prankur gupta says:

====================

This patch contains support to set tcp window_field field from bpf setsockops.

v2: Used TCP_WINDOW_CLAMP setsockopt logic for bpf_setsockopt (review comment addressed)

v3: Created a common function for duplicated code (review comment addressed)

v4: Removing logic to pass struct sock and struct tcp_sock together (review comment addressed)
====================

Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agoselftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP
Prankur gupta [Wed, 2 Dec 2020 21:31:52 +0000 (13:31 -0800)]
selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP

Adding selftests for new added functionality to set TCP_WINDOW_CLAMP
from bpf setsockopt.

Signed-off-by: Prankur gupta <prankgup@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201202213152.435886-3-prankgup@fb.com
3 years agobpf: Adds support for setting window clamp
Prankur gupta [Wed, 2 Dec 2020 21:31:51 +0000 (13:31 -0800)]
bpf: Adds support for setting window clamp

Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_WINDOW_CLAMP,
which sets the maximum receiver window size. It will be useful for
limiting receiver window based on RTT.

Signed-off-by: Prankur gupta <prankgup@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201202213152.435886-2-prankgup@fb.com
3 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 3 Dec 2020 23:42:13 +0000 (15:42 -0800)]
Merge git://git./linux/kernel/git/netdev/net

Conflicts:
drivers/net/ethernet/ibm/ibmvnic.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge tag 'net-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 3 Dec 2020 21:10:11 +0000 (13:10 -0800)]
Merge tag 'net-5.10-rc7' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Networking fixes for 5.10-rc7, including fixes from bpf, netfilter,
  wireless drivers, wireless mesh and can.

  Current release - regressions:

   - mt76: usb: fix crash on device removal

  Current release - always broken:

   - xsk: Fix umem cleanup from wrong context in socket destruct

  Previous release - regressions:

   - net: ip6_gre: set dev->hard_header_len when using header_ops

   - ipv4: Fix TOS mask in inet_rtm_getroute()

   - net, xsk: Avoid taking multiple skbuff references

  Previous release - always broken:

   - net/x25: prevent a couple of overflows

   - netfilter: ipset: prevent uninit-value in hash_ip6_add

   - geneve: pull IP header before ECN decapsulation

   - mpls: ensure LSE is pullable in TC and openvswitch paths

   - vxlan: respect needed_headroom of lower device

   - batman-adv: Consider fragmentation for needed packet headroom

   - can: drivers: don't count arbitration loss as an error

   - netfilter: bridge: reset skb->pkt_type after POST_ROUTING traversal

   - inet_ecn: Fix endianness of checksum update when setting ECT(1)

   - ibmvnic: fix various corner cases around reset handling

   - net/mlx5: fix rejecting unsupported Connect-X6DX SW steering

   - net/mlx5: Enforce HW TX csum offload with kTLS"

* tag 'net-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits)
  net/mlx5: DR, Proper handling of unsupported Connect-X6DX SW steering
  net/mlx5e: kTLS, Enforce HW TX csum offload with kTLS
  net: mlx5e: fix fs_tcp.c build when IPV6 is not enabled
  net/mlx5: Fix wrong address reclaim when command interface is down
  net/sched: act_mpls: ensure LSE is pullable before reading it
  net: openvswitch: ensure LSE is pullable before reading it
  net: skbuff: ensure LSE is pullable before decrementing the MPLS ttl
  net: mvpp2: Fix error return code in mvpp2_open()
  chelsio/chtls: fix a double free in chtls_setkey()
  rtw88: debug: Fix uninitialized memory in debugfs code
  vxlan: fix error return code in __vxlan_dev_create()
  net: pasemi: fix error return code in pasemi_mac_open()
  cxgb3: fix error return code in t3_sge_alloc_qset()
  net/x25: prevent a couple of overflows
  dpaa_eth: copy timestamp fields to new skb in A-050385 workaround
  net: ip6_gre: set dev->hard_header_len when using header_ops
  mt76: usb: fix crash on device removal
  iwlwifi: pcie: add some missing entries for AX210
  iwlwifi: pcie: invert values of NO_160 device config entries
  iwlwifi: pcie: add one missing entry for AX210
  ...

3 years agoMerge branch 'mptcp-reject-invalid-mp_join-requests-right-away'
Jakub Kicinski [Thu, 3 Dec 2020 20:56:05 +0000 (12:56 -0800)]
Merge branch 'mptcp-reject-invalid-mp_join-requests-right-away'

Florian Westphal says:

====================
mptcp: reject invalid mp_join requests right away

At the moment MPTCP can detect an invalid join request (invalid token,
max number of subflows reached, and so on) right away but cannot reject
the connection until the 3WHS has completed.
Instead the connection will complete and the subflow is reset afterwards.

To send the reset most information is already available, but we don't have
good spot where the reset could be sent:

1. The ->init_req callback is too early and also doesn't allow to return an
   error that could be used to inform the TCP stack that the SYN should be
   dropped.

2. The ->route_req callback lacks the skb needed to send a reset.

3. The ->send_synack callback is the best fit from the available hooks,
   but its called after the request socket has been inserted into the queue
   already. This means we'd have to remove it again right away.

From a technical point of view, the second hook would be best:
 1. Its before insertion into listener queue.
 2. If it returns NULL TCP will drop the packet for us.

Problem is that we'd have to pass the skb to the function just for MPTCP.

Paolo suggested to merge init_req and route_req callbacks instead:
This makes all info available to MPTCP -- a return value of NULL drops the
packet and MPTCP can send the reset if needed.

Because 'route_req' has a 'const struct sock *', this means either removal
of const qualifier, or a bit of code churn to pass 'const' in security land.

This does the latter; I did not find any spots that need write access to struct
sock.

To recap, the two alternatives are:
1. Solve it entirely in MPTCP: use the ->send_synack callback to
   unlink the request socket from the listener & drop it.
2. Avoid 'security' churn by removing the const qualifier.
====================

Link: https://lore.kernel.org/r/20201130153631.21872-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agomptcp: emit tcp reset when a join request fails
Florian Westphal [Mon, 30 Nov 2020 15:36:31 +0000 (16:36 +0100)]
mptcp: emit tcp reset when a join request fails

RFC 8684 says:
 If the token is unknown or the host wants to refuse subflow establishment
 (for example, due to a limit on the number of subflows it will permit),
 the receiver will send back a reset (RST) signal, analogous to an unknown
 port in TCP, containing an MP_TCPRST option (Section 3.6) with an
 "MPTCP specific error" reason code.

mptcp-next doesn't support MP_TCPRST yet, this can be added in another
change.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agotcp: merge 'init_req' and 'route_req' functions
Florian Westphal [Mon, 30 Nov 2020 15:36:30 +0000 (16:36 +0100)]
tcp: merge 'init_req' and 'route_req' functions

The Multipath-TCP standard (RFC 8684) says that an MPTCP host should send
a TCP reset if the token in a MP_JOIN request is unknown.

At this time we don't do this, the 3whs completes and the 'new subflow'
is reset afterwards.  There are two ways to allow MPTCP to send the
reset.

1. override 'send_synack' callback and emit the rst from there.
   The drawback is that the request socket gets inserted into the
   listeners queue just to get removed again right away.

2. Send the reset from the 'route_req' function instead.
   This avoids the 'add&remove request socket', but route_req lacks the
   skb that is required to send the TCP reset.

Instead of just adding the skb to that function for MPTCP sake alone,
Paolo suggested to merge init_req and route_req functions.

This saves one indirection from syn processing path and provides the skb
to the merged function at the same time.

'send reset on unknown mptcp join token' is added in next patch.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agosecurity: add const qualifier to struct sock in various places
Florian Westphal [Mon, 30 Nov 2020 15:36:29 +0000 (16:36 +0100)]
security: add const qualifier to struct sock in various places

A followup change to tcp_request_sock_op would have to drop the 'const'
qualifier from the 'route_req' function as the
'security_inet_conn_request' call is moved there - and that function
expects a 'struct sock *'.

However, it turns out its also possible to add a const qualifier to
security_inet_conn_request instead.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agosamples/bpf: Fix spelling mistake "recieving" -> "receiving"
Colin Ian King [Thu, 3 Dec 2020 11:44:52 +0000 (11:44 +0000)]
samples/bpf: Fix spelling mistake "recieving" -> "receiving"

There is a spelling mistake in an error message. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201203114452.1060017-1-colin.king@canonical.com
3 years agobpf: Fix cold build of test_progs-no_alu32
Brendan Jackman [Thu, 3 Dec 2020 12:08:50 +0000 (12:08 +0000)]
bpf: Fix cold build of test_progs-no_alu32

This object lives inside the trunner output dir,
i.e. tools/testing/selftests/bpf/no_alu32/btf_data.o

At some point it gets copied into the parent directory during another
part of the build, but that doesn't happen when building
test_progs-no_alu32 from clean.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/bpf/20201203120850.859170-1-jackmanb@google.com
3 years agolibbpf: Cap retries in sys_bpf_prog_load
Stanislav Fomichev [Wed, 2 Dec 2020 23:13:32 +0000 (15:13 -0800)]
libbpf: Cap retries in sys_bpf_prog_load

I've seen a situation, where a process that's under pprof constantly
generates SIGPROF which prevents program loading indefinitely.
The right thing to do probably is to disable signals in the upper
layers while loading, but it still would be nice to get some error from
libbpf instead of an endless loop.

Let's add some small retry limit to the program loading:
try loading the program 5 (arbitrary) times and give up.

v2:
* 10 -> 5 retires (Andrii Nakryiko)

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201202231332.3923644-1-sdf@google.com
3 years agoMerge tag 's390-5.10-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Linus Torvalds [Thu, 3 Dec 2020 19:58:26 +0000 (11:58 -0800)]
Merge tag 's390-5.10-6' of git://git./linux/kernel/git/s390/linux

Pull s390 fixes from Heiko Carstens:
 "One commit is fixing lockdep irq state tracing which broke with -rc6.

  The other one fixes logical vs physical CPU address mixup in our PCI
  code.

  Summary:

   - fix lockdep irq state tracing

   - fix logical vs physical CPU address confusion in PCI code"

* tag 's390-5.10-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390: fix irq state tracing
  s390/pci: fix CPU address in MSI for directed IRQ

3 years agolibbpf: Sanitise map names before pinning
Toke Høiland-Jørgensen [Thu, 3 Dec 2020 09:33:06 +0000 (10:33 +0100)]
libbpf: Sanitise map names before pinning

When we added sanitising of map names before loading programs to libbpf, we
still allowed periods in the name. While the kernel will accept these for
the map names themselves, they are not allowed in file names when pinning
maps. This means that bpf_object__pin_maps() will fail if called on an
object that contains internal maps (such as sections .rodata).

Fix this by replacing periods with underscores when constructing map pin
paths. This only affects the paths generated by libbpf when
bpf_object__pin_maps() is called with a path argument. Any pin paths set
by bpf_map__set_pin_path() are unaffected, and it will still be up to the
caller to avoid invalid characters in those.

Fixes: 113e6b7e15e2 ("libbpf: Sanitise internal map names so they are not rejected by the kernel")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201203093306.107676-1-toke@redhat.com
3 years agoMerge tag '9p-for-5.10-rc7' of git://github.com/martinetd/linux
Linus Torvalds [Thu, 3 Dec 2020 19:47:13 +0000 (11:47 -0800)]
Merge tag '9p-for-5.10-rc7' of git://github.com/martinetd/linux

Pull 9p fixes from Dominique Martinet:
 "Restore splice functionality for 9p"

* tag '9p-for-5.10-rc7' of git://github.com/martinetd/linux:
  fs: 9p: add generic splice_write file operation
  fs: 9p: add generic splice_read file operations

3 years agolibbpf: Fail early when loading programs with unspecified type
Andrei Matei [Thu, 3 Dec 2020 04:34:10 +0000 (23:34 -0500)]
libbpf: Fail early when loading programs with unspecified type

Before this patch, a program with unspecified type
(BPF_PROG_TYPE_UNSPEC) would be passed to the BPF syscall, only to have
the kernel reject it with an opaque invalid argument error. This patch
makes libbpf reject such programs with a nicer error message - in
particular libbpf now tries to diagnose bad ELF section names at both
open time and load time.

Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201203043410.59699-1-andreimatei1@gmail.com
3 years agoMerge branch 'Fixes for ima selftest'
Andrii Nakryiko [Thu, 3 Dec 2020 19:20:21 +0000 (11:20 -0800)]
Merge branch 'Fixes for ima selftest'

KP Singh says:

====================

From: KP Singh <kpsingh@google.com>

# v3 -> v4

* Fix typos.
* Update commit message for the indentation patch.
* Added Andrii's acks.

# v2 -> v3

* Added missing tags.
* Indentation fixes + some other fixes suggested by Andrii.
* Re-indent file to tabs.

The selftest for the bpf_ima_inode_hash helper uses a shell script to
setup the system for ima. While this worked without an issue on recent
desktop distros, it failed on environments with stripped out shells like
busybox which is also used by the bpf CI.

This series fixes the assumptions made on the availablity of certain
command line switches and the expectation that securityfs being mounted
by default.

It also adds the missing kernel config dependencies in
tools/testing/selftests/bpf and, lastly, changes the indentation of
ima_setup.sh to use tabs.
====================

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
3 years agoselftests/bpf: Indent ima_setup.sh with tabs.
KP Singh [Thu, 3 Dec 2020 19:14:37 +0000 (19:14 +0000)]
selftests/bpf: Indent ima_setup.sh with tabs.

The file was formatted with spaces instead of tabs and went unnoticed
as checkpatch.pl did not complain (probably because this is a shell
script). Re-indent it with tabs to be consistent with other scripts.

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201203191437.666737-5-kpsingh@chromium.org
3 years agoselftests/bpf: Add config dependency on BLK_DEV_LOOP
KP Singh [Thu, 3 Dec 2020 19:14:36 +0000 (19:14 +0000)]
selftests/bpf: Add config dependency on BLK_DEV_LOOP

The ima selftest restricts its scope to a test filesystem image
mounted on a loop device and prevents permanent ima policy changes for
the whole system.

Fixes: 34b82d3ac105 ("bpf: Add a selftest for bpf_ima_inode_hash")
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201203191437.666737-4-kpsingh@chromium.org
3 years agoselftests/bpf: Ensure securityfs mount before writing ima policy
KP Singh [Thu, 3 Dec 2020 19:14:35 +0000 (19:14 +0000)]
selftests/bpf: Ensure securityfs mount before writing ima policy

SecurityFS may not be mounted even if it is enabled in the kernel
config. So, check if the mount exists in /proc/mounts by parsing the
file and, if not, mount it on /sys/kernel/security.

Fixes: 34b82d3ac105 ("bpf: Add a selftest for bpf_ima_inode_hash")
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201203191437.666737-3-kpsingh@chromium.org
3 years agoselftests/bpf: Update ima_setup.sh for busybox
KP Singh [Thu, 3 Dec 2020 19:14:34 +0000 (19:14 +0000)]
selftests/bpf: Update ima_setup.sh for busybox

losetup on busybox does not output the name of loop device on using
-f with --show. It also doesn't support -j to find the loop devices
for a given backing file. losetup is updated to use "-a" which is
available on busybox.

blkid does not support options (-s and -o) to only display the uuid, so
parse the output instead.

Not all environments have mkfs.ext4, the test requires a loop device
with a backing image file which could formatted with any filesystem.
Update to using mkfs.ext2 which is available on busybox.

Fixes: 34b82d3ac105 ("bpf: Add a selftest for bpf_ima_inode_hash")
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201203191437.666737-2-kpsingh@chromium.org
3 years agoMerge branch 'mlx5-fixes-2020-12-01'
Jakub Kicinski [Thu, 3 Dec 2020 19:18:37 +0000 (11:18 -0800)]
Merge branch 'mlx5-fixes-2020-12-01'

Saeed Mahameed says:

====================
mlx5 fixes 2020-12-01

This series introduces some fixes to mlx5 driver.
====================

Link: https://lore.kernel.org/r/20201203043946.235385-1-saeedm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet/mlx5: DR, Proper handling of unsupported Connect-X6DX SW steering
Yevgeny Kliteynik [Thu, 3 Dec 2020 04:39:46 +0000 (20:39 -0800)]
net/mlx5: DR, Proper handling of unsupported Connect-X6DX SW steering

STEs format for Connect-X5 and Connect-X6DX different. Currently, on
Connext-X6DX the SW steering would break at some point when building STEs
w/o giving a proper error message. Fix this by checking the STE format of
the current device when initializing domain: add mlx5_ifc definitions for
Connect-X6DX SW steering, read FW capability to get the current format
version, and check this version when domain is being created.

Fixes: 26d688e33f88 ("net/mlx5: DR, Add Steering entry (STE) utilities")
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet/mlx5e: kTLS, Enforce HW TX csum offload with kTLS
Tariq Toukan [Thu, 3 Dec 2020 04:39:45 +0000 (20:39 -0800)]
net/mlx5e: kTLS, Enforce HW TX csum offload with kTLS

Checksum calculation cannot be done in SW for TX kTLS HW offloaded
packets.
Offload it to the device, disregard the declared state of the TX
csum offload feature.

Fixes: d2ead1f360e8 ("net/mlx5e: Add kTLS TX HW offload support")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: mlx5e: fix fs_tcp.c build when IPV6 is not enabled
Randy Dunlap [Thu, 3 Dec 2020 04:39:44 +0000 (20:39 -0800)]
net: mlx5e: fix fs_tcp.c build when IPV6 is not enabled

Fix build when CONFIG_IPV6 is not enabled by making a function
be built conditionally.

Fixes these build errors and warnings:

../drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c: In function 'accel_fs_tcp_set_ipv6_flow':
../include/net/sock.h:380:34: error: 'struct sock_common' has no member named 'skc_v6_daddr'; did you mean 'skc_daddr'?
  380 | #define sk_v6_daddr  __sk_common.skc_v6_daddr
      |                                  ^~~~~~~~~~~~
../drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c:55:14: note: in expansion of macro 'sk_v6_daddr'
   55 |         &sk->sk_v6_daddr, 16);
      |              ^~~~~~~~~~~
At top level:
../drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c:47:13: warning: 'accel_fs_tcp_set_ipv6_flow' defined but not used [-Wunused-function]
   47 | static void accel_fs_tcp_set_ipv6_flow(struct mlx5_flow_spec *spec, struct sock *sk)

Fixes: 5229a96e59ec ("net/mlx5e: Accel, Expose flow steering API for rules add/del")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet/mlx5: Fix wrong address reclaim when command interface is down
Eran Ben Elisha [Thu, 3 Dec 2020 04:39:43 +0000 (20:39 -0800)]
net/mlx5: Fix wrong address reclaim when command interface is down

When command interface is down, driver to reclaim all 4K page chucks that
were hold by the Firmeware. Fix a bug for 64K page size systems, where
driver repeatedly released only the first chunk of the page.

Define helper function to fill 4K chunks for a given Firmware pages.
Iterate over all unreleased Firmware pages and call the hepler per each.

Fixes: 5adff6a08862 ("net/mlx5: Fix incorrect page count when in internal error")
Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet/sched: act_mpls: ensure LSE is pullable before reading it
Davide Caratti [Thu, 3 Dec 2020 09:37:52 +0000 (10:37 +0100)]
net/sched: act_mpls: ensure LSE is pullable before reading it

when 'act_mpls' is used to mangle the LSE, the current value is read from
the packet dereferencing 4 bytes at mpls_hdr(): ensure that the label is
contained in the skb "linear" area.

Found by code inspection.

v2:
 - use MPLS_HLEN instead of sizeof(new_lse), thanks to Jakub Kicinski

Fixes: 2a2ea50870ba ("net: sched: add mpls manipulation actions to TC")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Link: https://lore.kernel.org/r/3243506cba43d14858f3bd21ee0994160e44d64a.1606987058.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: openvswitch: ensure LSE is pullable before reading it
Davide Caratti [Thu, 3 Dec 2020 09:46:06 +0000 (10:46 +0100)]
net: openvswitch: ensure LSE is pullable before reading it

when openvswitch is configured to mangle the LSE, the current value is
read from the packet dereferencing 4 bytes at mpls_hdr(): ensure that
the label is contained in the skb "linear" area.

Found by code inspection.

Fixes: d27cf5c59a12 ("net: core: add MPLS update core helper and use in OvS")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/aa099f245d93218b84b5c056b67b6058ccf81a66.1606987185.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: skbuff: ensure LSE is pullable before decrementing the MPLS ttl
Davide Caratti [Thu, 3 Dec 2020 09:58:21 +0000 (10:58 +0100)]
net: skbuff: ensure LSE is pullable before decrementing the MPLS ttl

skb_mpls_dec_ttl() reads the LSE without ensuring that it is contained in
the skb "linear" area. Fix this calling pskb_may_pull() before reading the
current ttl.

Found by code inspection.

Fixes: 2a2ea50870ba ("net: sched: add mpls manipulation actions to TC")
Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/53659f28be8bc336c113b5254dc637cc76bbae91.1606987074.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge tag 'wireless-drivers-2020-12-03' of git://git.kernel.org/pub/scm/linux/kernel...
Jakub Kicinski [Thu, 3 Dec 2020 18:59:28 +0000 (10:59 -0800)]
Merge tag 'wireless-drivers-2020-12-03' of git://git./linux/kernel/git/kvalo/wireless-drivers

Kalle Valo says:

====================
wireless-drivers fixes for v5.10

Second, and most likely final, set of fixes for v5.10. Small fixes and
PCI id addtions.

iwlwifi
 * PCI id additions
mt76
 * fix a kernel crash during device removal
rtw88
 * fix uninitialized memory in debugfs code

* tag 'wireless-drivers-2020-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers:
  rtw88: debug: Fix uninitialized memory in debugfs code
  mt76: usb: fix crash on device removal
  iwlwifi: pcie: add some missing entries for AX210
  iwlwifi: pcie: invert values of NO_160 device config entries
  iwlwifi: pcie: add one missing entry for AX210
  iwlwifi: update MAINTAINERS entry
====================

Link: https://lore.kernel.org/r/20201203183408.EE88AC43461@smtp.codeaurora.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: mvpp2: Fix error return code in mvpp2_open()
Wang Hai [Thu, 3 Dec 2020 14:18:06 +0000 (22:18 +0800)]
net: mvpp2: Fix error return code in mvpp2_open()

Fix to return negative error code -ENOENT from invalid configuration
error handling case instead of 0, as done elsewhere in this function.

Fixes: 4bb043262878 ("net: mvpp2: phylink support")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20201203141806.37966-1-wanghai38@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agochelsio/chtls: fix a double free in chtls_setkey()
Dan Carpenter [Thu, 3 Dec 2020 08:44:31 +0000 (11:44 +0300)]
chelsio/chtls: fix a double free in chtls_setkey()

The "skb" is freed by the transmit code in cxgb4_ofld_send() and we
shouldn't use it again.  But in the current code, if we hit an error
later on in the function then the clean up code will call kfree_skb(skb)
and so it causes a double free.

Set the "skb" to NULL and that makes the kfree_skb() a no-op.

Fixes: d25f2f71f653 ("crypto: chtls - Program the TLS session Key")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/X8ilb6PtBRLWiSHp@mwanda
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge branch 'libbpf: add support for privileged/unprivileged control separation'
Alexei Starovoitov [Thu, 3 Dec 2020 18:37:59 +0000 (10:37 -0800)]
Merge branch 'libbpf: add support for privileged/unprivileged control separation'

Mariusz Dudek says:

====================

From: Mariusz Dudek <mariuszx.dudek@intel.com>

This patch series adds support for separation of eBPF program
load and xsk socket creation. In for example a Kubernetes
environment you can have an AF_XDP CNI or daemonset that is
responsible for launching pods that execute an application
using AF_XDP sockets. It is desirable that the pod runs with
as low privileges as possible, CAP_NET_RAW in this case,
and that all operations that require privileges are contained
in the CNI or daemonset.

In this case, you have to be able separate ePBF program load from
xsk socket creation.

Currently, this will not work with the xsk_socket__create APIs
because you need to have CAP_NET_ADMIN privileges to load eBPF
program and CAP_SYS_ADMIN privileges to create update xsk_bpf_maps.
To be exact xsk_set_bpf_maps does not need those privileges but
it takes the prog_fd and xsks_map_fd and those are known only to
process that was loading eBPF program. The api bpf_prog_get_fd_by_id
that looks up the fd of the prog using an prog_id and
bpf_map_get_fd_by_id that looks for xsks_map_fd usinb map_id both
requires CAP_SYS_ADMIN.

With this patch, the pod can be run with CAP_NET_RAW capability
only. In case your umem is larger or equal process limit for
MEMLOCK you need either increase the limit or CAP_IPC_LOCK capability.
Without this patch in case of insufficient rights ENOPERM is
returned by xsk_socket__create.

To resolve this privileges issue two new APIs are introduced:
- xsk_setup_xdp_prog - loads the built in XDP program. It can
also return xsks_map_fd which is needed by unprivileged
process to update xsks_map with AF_XDP socket "fd"
- xsk_sokcet__update_xskmap - inserts an AF_XDP socket into an
xskmap for a particular xsk_socket

Usage example:
int xsk_setup_xdp_prog(int ifindex, int *xsks_map_fd)

int xsk_socket__update_xskmap(struct xsk_socket *xsk, int xsks_map_fd);

Inserts AF_XDP socket "fd" into the xskmap.

The first patch introduces the new APIs. The second patch provides
a new sample applications working as control and modification to
existing xdpsock application to work with less privileges.

This patch set is based on bpf-next commit 97306be45fbe
(Merge branch 'switch to memcg-based memory accounting')

Since v6
- rebase on 97306be45fbe to resolve RLIMIT conflicts

Since v5
- fixed sample/bpf/xdpsock_user.c to resolve merge conflicts

Since v4
- sample/bpf/Makefile issues fixed

Since v3:
- force_set_map flag removed
- leaking of xsk struct fixed
- unified function error returning policy implemented

Since v2:
- new APIs moved itto LIBBPF_0.3.0 section
- struct bpf_prog_cfg_opts removed
- loading own eBPF program via xsk_setup_xdp_prog functionality removed

Since v1:
- struct bpf_prog_cfg improved for backward/forward compatibility
- API xsk_update_xskmap renamed to xsk_socket__update_xskmap
- commit message formatting fixed
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agosamples/bpf: Sample application for eBPF load and socket creation split
Mariusz Dudek [Thu, 3 Dec 2020 09:05:46 +0000 (10:05 +0100)]
samples/bpf: Sample application for eBPF load and socket creation split

Introduce a sample program to demonstrate the control and data
plane split. For the control plane part a new program called
xdpsock_ctrl_proc is introduced. For the data plane part, some code
was added to xdpsock_user.c to act as the data plane entity.

Application xdpsock_ctrl_proc works as control entity with sudo
privileges (CAP_SYS_ADMIN and CAP_NET_ADMIN are sufficient) and the
extended xdpsock as data plane entity with CAP_NET_RAW capability
only.

Usage example:

sudo ./samples/bpf/xdpsock_ctrl_proc -i <interface>

sudo ./samples/bpf/xdpsock -i <interface> -q <queue_id>
-n <interval> -N -l -R

Signed-off-by: Mariusz Dudek <mariuszx.dudek@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20201203090546.11976-3-mariuszx.dudek@intel.com
3 years agolibbpf: Separate XDP program load with xsk socket creation
Mariusz Dudek [Thu, 3 Dec 2020 09:05:45 +0000 (10:05 +0100)]
libbpf: Separate XDP program load with xsk socket creation

Add support for separation of eBPF program load and xsk socket
creation.

This is needed for use-case when you want to privide as little
privileges as possible to the data plane application that will
handle xsk socket creation and incoming traffic.

With this patch the data entity container can be run with only
CAP_NET_RAW capability to fulfill its purpose of creating xsk
socket and handling packages. In case your umem is larger or
equal process limit for MEMLOCK you need either increase the
limit or CAP_IPC_LOCK capability.

To resolve privileges issue two APIs are introduced:

- xsk_setup_xdp_prog - loads the built in XDP program. It can
also return xsks_map_fd which is needed by unprivileged process
to update xsks_map with AF_XDP socket "fd"

- xsk_socket__update_xskmap - inserts an AF_XDP socket into an xskmap
for a particular xsk_socket

Signed-off-by: Mariusz Dudek <mariuszx.dudek@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20201203090546.11976-2-mariuszx.dudek@intel.com
3 years agotools/resolve_btfids: Fix some error messages
Brendan Jackman [Thu, 3 Dec 2020 10:22:34 +0000 (10:22 +0000)]
tools/resolve_btfids: Fix some error messages

Add missing newlines and fix polarity of strerror argument.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/bpf/20201203102234.648540-1-jackmanb@google.com
3 years agoselftests/bpf: Copy file using read/write in local storage test
Stanislav Fomichev [Wed, 2 Dec 2020 17:49:47 +0000 (09:49 -0800)]
selftests/bpf: Copy file using read/write in local storage test

Splice (copy_file_range) doesn't work on all filesystems. I'm running
test kernels on top of my read-only disk image and it uses plan9 under the
hood. This prevents test_local_storage from successfully passing.

There is really no technical reason to use splice, so lets do
old-school read/write to copy file; this should work in all
environments.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201202174947.3621989-1-sdf@google.com
3 years agoMerge branch 'bpftool: improve split BTF support'
Alexei Starovoitov [Thu, 3 Dec 2020 18:17:27 +0000 (10:17 -0800)]
Merge branch 'bpftool: improve split BTF support'

Andrii Nakryiko says:

====================

Few follow up improvements to bpftool for split BTF support:
  - emit "name <anon>" for non-named BTFs in `bpftool btf show` command;
  - when dumping /sys/kernel/btf/<module> use /sys/kernel/btf/vmlinux as the
    base BTF, unless base BTF is explicitly specified with -B flag.

This patch set also adds btf__base_btf() getter to access base BTF of the
struct btf.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agotools/bpftool: Auto-detect split BTFs in common cases
Andrii Nakryiko [Wed, 2 Dec 2020 06:52:43 +0000 (22:52 -0800)]
tools/bpftool: Auto-detect split BTFs in common cases

In case of working with module's split BTF from /sys/kernel/btf/*,
auto-substitute /sys/kernel/btf/vmlinux as the base BTF. This makes using
bpftool with module BTFs faster and simpler.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201202065244.530571-4-andrii@kernel.org
3 years agolibbpf: Add base BTF accessor
Andrii Nakryiko [Wed, 2 Dec 2020 06:52:42 +0000 (22:52 -0800)]
libbpf: Add base BTF accessor

Add ability to get base BTF. It can be also used to check if BTF is split BTF.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201202065244.530571-3-andrii@kernel.org
3 years agotools/bpftool: Emit name <anon> for anonymous BTFs
Andrii Nakryiko [Wed, 2 Dec 2020 06:52:41 +0000 (22:52 -0800)]
tools/bpftool: Emit name <anon> for anonymous BTFs

For consistency of output, emit "name <anon>" for BTFs without the name. This
keeps output more consistent and obvious.

Suggested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201202065244.530571-2-andrii@kernel.org
3 years agouapi: fix statx attribute value overlap for DAX & MOUNT_ROOT
Eric Sandeen [Tue, 1 Dec 2020 23:21:40 +0000 (17:21 -0600)]
uapi: fix statx attribute value overlap for DAX & MOUNT_ROOT

STATX_ATTR_MOUNT_ROOT and STATX_ATTR_DAX got merged with the same value,
so one of them needs fixing.  Move STATX_ATTR_DAX.

While we're in here, clarify the value-matching scheme for some of the
attributes, and explain why the value for DAX does not match.

Fixes: 80340fe3605c ("statx: add mount_root")
Fixes: 712b2698e4c0 ("fs/stat: Define DAX statx attribute")
Link: https://lore.kernel.org/linux-fsdevel/7027520f-7c79-087e-1d00-743bdefa1a1e@redhat.com/
Link: https://lore.kernel.org/lkml/20201202214629.1563760-1-ira.weiny@intel.com/
Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: <stable@vger.kernel.org> # 5.8
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agopwm: sl28cpld: fix getting driver data in pwm callbacks
Uwe Kleine-König [Thu, 3 Dec 2020 08:41:42 +0000 (09:41 +0100)]
pwm: sl28cpld: fix getting driver data in pwm callbacks

Currently .get_state() and .apply() use dev_get_drvdata() on the struct
device related to the pwm chip.  This only works after .probe() called
platform_set_drvdata() which in this driver happens only after
pwmchip_add() and so comes possibly too late.

Instead of setting the driver data earlier use the traditional
container_of approach as this way the driver data is conceptually and
computational nearer.

Fixes: 9db33d221efc ("pwm: Add support for sl28cpld PWM controller")
Tested-by: Michael Walle <michael@walle.cc>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agolib/syscall: fix syscall registers retrieval on 32-bit platforms
Willy Tarreau [Mon, 30 Nov 2020 07:36:48 +0000 (08:36 +0100)]
lib/syscall: fix syscall registers retrieval on 32-bit platforms

Lilith >_> and Claudio Bozzato of Cisco Talos security team reported
that collect_syscall() improperly casts the syscall registers to 64-bit
values leaking the uninitialized last 24 bytes on 32-bit platforms, that
are visible in /proc/self/syscall.

The cause is that info->data.args are u64 while syscall_get_arguments()
uses longs, as hinted by the bogus pointer cast in the function.

Let's just proceed like the other call places, by retrieving the
registers into an array of longs before assigning them to the caller's
array.  This was successfully tested on x86_64, i386 and ppc32.

Reference: CVE-2020-28588, TALOS-2020-1211
Fixes: 631b7abacd02 ("ptrace: Remove maxargs from task_current_syscall()")
Cc: Greg KH <greg@kroah.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Michael Ellerman <mpe@ellerman.id.au> (ppc32)
Signed-off-by: Willy Tarreau <w@1wt.eu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomacvlan: Support for high multicast packet rate
Thomas Karlsson [Wed, 2 Dec 2020 18:49:58 +0000 (19:49 +0100)]
macvlan: Support for high multicast packet rate

Background:
Broadcast and multicast packages are enqueued for later processing.
This queue was previously hardcoded to 1000.

This proved insufficient for handling very high packet rates.
This resulted in packet drops for multicast.
While at the same time unicast worked fine.

The change:
This patch make the queue length adjustable to accommodate
for environments with very high multicast packet rate.
But still keeps the default value of 1000 unless specified.

The queue length is specified as a request per macvlan
using the IFLA_MACVLAN_BC_QUEUE_LEN parameter.

The actual used queue length will then be the maximum of
any macvlan connected to the same port. The actual used
queue length for the port can be retrieved (read only)
by the IFLA_MACVLAN_BC_QUEUE_LEN_USED parameter for verification.

This will be followed up by a patch to iproute2
in order to adjust the parameter from userspace.

Signed-off-by: Thomas Karlsson <thomas.karlsson@paneda.se>
Link: https://lore.kernel.org/r/dd4673b2-7eab-edda-6815-85c67ce87f63@paneda.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agortw88: debug: Fix uninitialized memory in debugfs code
Dan Carpenter [Thu, 3 Dec 2020 08:43:37 +0000 (11:43 +0300)]
rtw88: debug: Fix uninitialized memory in debugfs code

This code does not ensure that the whole buffer is initialized and none
of the callers check for errors so potentially none of the buffer is
initialized.  Add a memset to eliminate this bug.

Fixes: e3037485c68e ("rtw88: new Realtek 802.11ac driver")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/X8ilOfVz3pf0T5ec@mwanda
3 years agoMerge branch 'switch to memcg-based memory accounting'
Alexei Starovoitov [Thu, 3 Dec 2020 02:28:09 +0000 (18:28 -0800)]
Merge branch 'switch to memcg-based memory accounting'

Roman Gushchin says:

====================

Currently bpf is using the memlock rlimit for the memory accounting.
This approach has its downsides and over time has created a significant
amount of problems:

1) The limit is per-user, but because most bpf operations are performed
   as root, the limit has a little value.

2) It's hard to come up with a specific maximum value. Especially because
   the counter is shared with non-bpf use cases (e.g. memlock()).
   Any specific value is either too low and creates false failures
   or is too high and useless.

3) Charging is not connected to the actual memory allocation. Bpf code
   should manually calculate the estimated cost and charge the counter,
   and then take care of uncharging, including all fail paths.
   It adds to the code complexity and makes it easy to leak a charge.

4) There is no simple way of getting the current value of the counter.
   We've used drgn for it, but it's far from being convenient.

5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
   a function to "explain" this case for users.

6) rlimits are generally considered as (at least partially) obsolete.
   They do not provide a comprehensive system for the control of physical
   resources: memory, cpu, io etc. All resource control developments
   in the recent years were related to cgroups.

In order to overcome these problems let's switch to the memory cgroup-based
memory accounting of bpf objects. With the recent addition of the percpu
memory accounting, now it's possible to provide a comprehensive accounting
of the memory used by bpf programs and maps.

This approach has the following advantages:
1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
   a better control over memory usage by different workloads.

2) The actual memory consumption is taken into account. It happens automatically
   on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
   performed automatically on releasing the memory. So the code on the bpf side
   becomes simpler and safer.

3) There is a simple way to get the current value and statistics.

Cgroup-based accounting adds new requirements:
1) The kernel config should have CONFIG_CGROUPS and CONFIG_MEMCG_KMEM enabled.
   These options are usually enabled, maybe excluding tiny builds for embedded
   devices.
2) The system should have a configured cgroup hierarchy, including reasonable
   memory limits and/or guarantees. Modern systems usually delegate this task
   to systemd or similar task managers.

Without meeting these requirements there are no limits on how much memory bpf
can use and a non-root user is able to hurt the system by allocating too much.
But because per-user rlimits do not provide a functional system to protect
and manage physical resources anyway, anyone who seriously depends on it,
should use cgroups.

When a bpf map is created, the memory cgroup of the process which creates
the map is recorded. Subsequently all memory allocation related to the bpf map
are charged to the same cgroup. It includes allocations made from interrupts
and by any processes. Bpf program memory is charged to the memory cgroup of
a process which loads the program.

The patchset consists of the following parts:
1) 4 mm patches are required on the mm side, otherwise vmallocs cannot be mapped
   to userspace
2) memcg-based accounting for various bpf objects: progs and maps
3) removal of the rlimit-based accounting
4) removal of rlimit adjustments in userspace samples

v9:
  - always charge the saved memory cgroup, by Daniel, Toke and Alexei
  - added bpf_map_kzalloc()
  - rebase and minor fixes

v8:
  - extended the cover letter to be more clear on new requirements, by Daniel
  - an approximate value is provided by map memlock info, by Alexei

v7:
  - introduced bpf_map_kmalloc_node() and bpf_map_alloc_percpu(), by Alexei
  - switched allocations made from an interrupt context to new helpers,
    by Daniel
  - rebase and minor fixes

v6:
  - rebased to the latest version of the remote charging API
  - fixed signatures, added acks

v5:
  - rebased to the latest version of the remote charging API
  - implemented kmem accounting from an interrupt context, by Shakeel
  - rebased to latest changes in mm allowed to map vmallocs to userspace
  - fixed a build issue in kselftests, by Alexei
  - fixed a use-after-free bug in bpf_map_free_deferred()
  - added bpf line info coverage, by Shakeel
  - split bpf map charging preparations into a separate patch

v4:
  - covered allocations made from an interrupt context, by Daniel
  - added some clarifications to the cover letter

v3:
  - droped the userspace part for further discussions/refinements,
    by Andrii and Song

v2:
  - fixed build issue, caused by the remaining rlimit-based accounting
    for sockhash maps
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
3 years agobpf: samples: Do not touch RLIMIT_MEMLOCK
Roman Gushchin [Tue, 1 Dec 2020 21:59:00 +0000 (13:59 -0800)]
bpf: samples: Do not touch RLIMIT_MEMLOCK

Since bpf is not using rlimit memlock for the memory accounting
and control, do not change the limit in sample applications.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-35-guro@fb.com
3 years agobpf: Eliminate rlimit-based memory accounting for bpf progs
Roman Gushchin [Tue, 1 Dec 2020 21:58:59 +0000 (13:58 -0800)]
bpf: Eliminate rlimit-based memory accounting for bpf progs

Do not use rlimit-based memory accounting for bpf progs. It has been
replaced with memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-34-guro@fb.com
3 years agobpf: Eliminate rlimit-based memory accounting infra for bpf maps
Roman Gushchin [Tue, 1 Dec 2020 21:58:58 +0000 (13:58 -0800)]
bpf: Eliminate rlimit-based memory accounting infra for bpf maps

Remove rlimit-based accounting infrastructure code, which is not used
anymore.

To provide a backward compatibility, use an approximation of the
bpf map memory footprint as a "memlock" value, available to a user
via map info. The approximation is based on the maximal number of
elements and key and value sizes.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-33-guro@fb.com
3 years agobpf: Eliminate rlimit-based memory accounting for bpf local storage maps
Roman Gushchin [Tue, 1 Dec 2020 21:58:57 +0000 (13:58 -0800)]
bpf: Eliminate rlimit-based memory accounting for bpf local storage maps

Do not use rlimit-based memory accounting for bpf local storage maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-32-guro@fb.com
3 years agobpf: Eliminate rlimit-based memory accounting for xskmap maps
Roman Gushchin [Tue, 1 Dec 2020 21:58:56 +0000 (13:58 -0800)]
bpf: Eliminate rlimit-based memory accounting for xskmap maps

Do not use rlimit-based memory accounting for xskmap maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-31-guro@fb.com
3 years agobpf: Eliminate rlimit-based memory accounting for stackmap maps
Roman Gushchin [Tue, 1 Dec 2020 21:58:55 +0000 (13:58 -0800)]
bpf: Eliminate rlimit-based memory accounting for stackmap maps

Do not use rlimit-based memory accounting for stackmap maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-30-guro@fb.com
3 years agobpf: Eliminate rlimit-based memory accounting for sockmap and sockhash maps
Roman Gushchin [Tue, 1 Dec 2020 21:58:54 +0000 (13:58 -0800)]
bpf: Eliminate rlimit-based memory accounting for sockmap and sockhash maps

Do not use rlimit-based memory accounting for sockmap and sockhash maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-29-guro@fb.com