Stanislav Fomichev [Tue, 28 Jun 2022 17:43:10 +0000 (10:43 -0700)]
tools/bpf: Sync btf_ids.h to tools
Has been slowly getting out of sync, let's update it.
resolve_btfids usage has been updated to match the header changes.
Also bring new parts of tools/include/uapi/linux/bpf.h.
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-8-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Stanislav Fomichev [Tue, 28 Jun 2022 17:43:09 +0000 (10:43 -0700)]
bpf: expose bpf_{g,s}etsockopt to lsm cgroup
I don't see how to make it nice without introducing btf id lists
for the hooks where these helpers are allowed. Some LSM hooks
work on the locked sockets, some are triggering early and
don't grab any locks, so have two lists for now:
1. LSM hooks which trigger under socket lock - minority of the hooks,
but ideal case for us, we can expose existing BTF-based helpers
2. LSM hooks which trigger without socket lock, but they trigger
early in the socket creation path where it should be safe to
do setsockopt without any locks
3. The rest are prohibited. I'm thinking that this use-case might
be a good gateway to sleeping lsm cgroup hooks in the future.
We can either expose lock/unlock operations (and add tracking
to the verifier) or have another set of bpf_setsockopt
wrapper that grab the locks and might sleep.
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-7-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Stanislav Fomichev [Tue, 28 Jun 2022 17:43:08 +0000 (10:43 -0700)]
bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUP
We have two options:
1. Treat all BPF_LSM_CGROUP the same, regardless of attach_btf_id
2. Treat BPF_LSM_CGROUP+attach_btf_id as a separate hook point
I was doing (2) in the original patch, but switching to (1) here:
* bpf_prog_query returns all attached BPF_LSM_CGROUP programs
regardless of attach_btf_id
* attach_btf_id is exported via bpf_prog_info
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-6-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Stanislav Fomichev [Tue, 28 Jun 2022 17:43:07 +0000 (10:43 -0700)]
bpf: minimize number of allocated lsm slots per program
Previous patch adds 1:1 mapping between all 211 LSM hooks
and bpf_cgroup program array. Instead of reserving a slot per
possible hook, reserve 10 slots per cgroup for lsm programs.
Those slots are dynamically allocated on demand and reclaimed.
struct cgroup_bpf {
struct bpf_prog_array * effective[33]; /* 0 264 */
/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
struct hlist_head progs[33]; /* 264 264 */
/* --- cacheline 8 boundary (512 bytes) was 16 bytes ago --- */
u8 flags[33]; /* 528 33 */
/* XXX 7 bytes hole, try to pack */
struct list_head storages; /* 568 16 */
/* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
struct bpf_prog_array * inactive; /* 584 8 */
struct percpu_ref refcnt; /* 592 16 */
struct work_struct release_work; /* 608 72 */
/* size: 680, cachelines: 11, members: 7 */
/* sum members: 673, holes: 1, sum holes: 7 */
/* last cacheline: 40 bytes */
};
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-5-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Stanislav Fomichev [Tue, 28 Jun 2022 17:43:06 +0000 (10:43 -0700)]
bpf: per-cgroup lsm flavor
Allow attaching to lsm hooks in the cgroup context.
Attaching to per-cgroup LSM works exactly like attaching
to other per-cgroup hooks. New BPF_LSM_CGROUP is added
to trigger new mode; the actual lsm hook we attach to is
signaled via existing attach_btf_id.
For the hooks that have 'struct socket' or 'struct sock' as its first
argument, we use the cgroup associated with that socket. For the rest,
we use 'current' cgroup (this is all on default hierarchy == v2 only).
Note that for some hooks that work on 'struct sock' we still
take the cgroup from 'current' because some of them work on the socket
that hasn't been properly initialized yet.
Behind the scenes, we allocate a shim program that is attached
to the trampoline and runs cgroup effective BPF programs array.
This shim has some rudimentary ref counting and can be shared
between several programs attaching to the same lsm hook from
different cgroups.
Note that this patch bloats cgroup size because we add 211
cgroup_bpf_attach_type(s) for simplicity sake. This will be
addressed in the subsequent patch.
Also note that we only add non-sleepable flavor for now. To enable
sleepable use-cases, bpf_prog_run_array_cg has to grab trace rcu,
shim programs have to be freed via trace rcu, cgroup_bpf.effective
should be also trace-rcu-managed + maybe some other changes that
I'm not aware of.
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-4-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Stanislav Fomichev [Tue, 28 Jun 2022 17:43:05 +0000 (10:43 -0700)]
bpf: convert cgroup_bpf.progs to hlist
This lets us reclaim some space to be used by new cgroup lsm slots.
Before:
struct cgroup_bpf {
struct bpf_prog_array * effective[23]; /* 0 184 */
/* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */
struct list_head progs[23]; /* 184 368 */
/* --- cacheline 8 boundary (512 bytes) was 40 bytes ago --- */
u32 flags[23]; /* 552 92 */
/* XXX 4 bytes hole, try to pack */
/* --- cacheline 10 boundary (640 bytes) was 8 bytes ago --- */
struct list_head storages; /* 648 16 */
struct bpf_prog_array * inactive; /* 664 8 */
struct percpu_ref refcnt; /* 672 16 */
struct work_struct release_work; /* 688 32 */
/* size: 720, cachelines: 12, members: 7 */
/* sum members: 716, holes: 1, sum holes: 4 */
/* last cacheline: 16 bytes */
};
After:
struct cgroup_bpf {
struct bpf_prog_array * effective[23]; /* 0 184 */
/* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */
struct hlist_head progs[23]; /* 184 184 */
/* --- cacheline 5 boundary (320 bytes) was 48 bytes ago --- */
u8 flags[23]; /* 368 23 */
/* XXX 1 byte hole, try to pack */
/* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */
struct list_head storages; /* 392 16 */
struct bpf_prog_array * inactive; /* 408 8 */
struct percpu_ref refcnt; /* 416 16 */
struct work_struct release_work; /* 432 72 */
/* size: 504, cachelines: 8, members: 7 */
/* sum members: 503, holes: 1, sum holes: 1 */
/* last cacheline: 56 bytes */
};
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-3-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Stanislav Fomichev [Tue, 28 Jun 2022 17:43:04 +0000 (10:43 -0700)]
bpf: add bpf_func_t and trampoline helpers
I'll be adding lsm cgroup specific helpers that grab
trampoline mutex.
No functional changes.
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-2-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexei Starovoitov [Tue, 28 Jun 2022 20:13:33 +0000 (13:13 -0700)]
Merge branch 'libbpf: remove deprecated APIs'
Andrii Nakryiko says:
====================
This patch set removes all the deprecated APIs in preparation for 1.0 release.
It also makes libbpf_set_strict_mode() a no-op (but keeps it to let per-1.0
applications buildable and dynamically linkable against libbpf 1.0 if they
were already libbpf-1.0 ready) and starts enforcing all the
behaviors that were previously opt-in through libbpf_set_strict_mode().
xsk.{c,h} parts that are now properly provided by libxdp ([0]) are still used
by xdpxceiver.c in selftest/bpf, so I've moved xsk.{c,h} with barely any
changes to under selftests/bpf.
Other than that, apart from removing all the LIBBPF_DEPRECATED-marked APIs,
there is a bunch of internal clean ups allowed by that. I've also "restored"
libbpf.map inheritance chain while removing all the deprecated APIs. I think
that's the right way to do this, as applications using libbpf as shared
library but not relying on any deprecated APIs (i.e., "good citizens" that
prepared for libbpf 1.0 release ahead of time to minimize disruption) should
be able to link both against 0.x and 1.x versions. Either way, it doesn't seem
to break anything and preserve a history on when each "surviving" API was
added.
[0] https://github.com/xdp-project/xdp-tools/tree/master/lib/libxdp
v1->v2:
- rebase on latest bpf-next now that Jiri's perf patches landed;
- fix xsk.o dependency in Makefile to ensure libbpf headers are installed
reliably.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko [Mon, 27 Jun 2022 21:15:27 +0000 (14:15 -0700)]
libbpf: fix up few libbpf.map problems
Seems like we missed to add 2 APIs to libbpf.map and another API was
misspelled. Fix it in libbpf.map.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-16-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko [Mon, 27 Jun 2022 21:15:26 +0000 (14:15 -0700)]
libbpf: enforce strict libbpf 1.0 behaviors
Remove support for legacy features and behaviors that previously had to
be disabled by calling libbpf_set_strict_mode():
- legacy BPF map definitions are not supported now;
- RLIMIT_MEMLOCK auto-setting, if necessary, is always on (but see
libbpf_set_memlock_rlim());
- program name is used for program pinning (instead of section name);
- cleaned up error returning logic;
- entry BPF programs should have SEC() always.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-15-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko [Mon, 27 Jun 2022 21:15:25 +0000 (14:15 -0700)]
selftests/bpf: remove last tests with legacy BPF map definitions
Libbpf 1.0 stops support legacy-style BPF map definitions. Selftests has
been migrated away from using legacy BPF map definitions except for two
selftests, to make sure that legacy functionality still worked in
pre-1.0 libbpf. Now it's time to let those tests go as libbpf 1.0 is
imminent.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-14-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko [Mon, 27 Jun 2022 21:15:24 +0000 (14:15 -0700)]
libbpf: clean up SEC() handling
Get rid of sloppy prefix logic and remove deprecated xdp_{devmap,cpumap}
sections.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-13-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko [Mon, 27 Jun 2022 21:15:23 +0000 (14:15 -0700)]
libbpf: remove internal multi-instance prog support
Clean up internals that had to deal with the possibility of
multi-instance bpf_programs. Libbpf 1.0 doesn't support this, so all
this is not necessary now and can be simplified.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-12-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko [Mon, 27 Jun 2022 21:15:22 +0000 (14:15 -0700)]
libbpf: cleanup LIBBPF_DEPRECATED_SINCE supporting macros for v0.x
Keep the LIBBPF_DEPRECATED_SINCE macro "framework" for future
deprecations, but clean up 0.x related helper macros.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-11-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko [Mon, 27 Jun 2022 21:15:21 +0000 (14:15 -0700)]
libbpf: remove multi-instance and custom private data APIs
Remove all the public APIs that are related to creating multi-instance
bpf_programs through custom preprocessing callback and generally working
with them.
Also remove all the bpf_{object,map,program}__[set_]priv() APIs.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-10-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko [Mon, 27 Jun 2022 21:15:20 +0000 (14:15 -0700)]
libbpf: remove most other deprecated high-level APIs
Remove a bunch of high-level bpf_object/bpf_map/bpf_program related
APIs. All the APIs related to private per-object/map/prog state,
program preprocessing callback, and generally everything multi-instance
related is removed in a separate patch.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko [Mon, 27 Jun 2022 21:15:19 +0000 (14:15 -0700)]
libbpf: remove prog_info_linear APIs
Remove prog_info_linear-related APIs previously used by perf.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko [Mon, 27 Jun 2022 21:15:18 +0000 (14:15 -0700)]
libbpf: clean up perfbuf APIs
Remove deprecated perfbuf APIs and clean up opts structs.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko [Mon, 27 Jun 2022 21:15:17 +0000 (14:15 -0700)]
libbpf: remove deprecated BTF APIs
Get rid of deprecated BTF-related APIs.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko [Mon, 27 Jun 2022 21:15:16 +0000 (14:15 -0700)]
libbpf: remove deprecated probing APIs
Get rid of deprecated feature-probing APIs.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko [Mon, 27 Jun 2022 21:15:15 +0000 (14:15 -0700)]
libbpf: remove deprecated XDP APIs
Get rid of deprecated bpf_set_link*() and bpf_get_link*() APIs.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko [Mon, 27 Jun 2022 21:15:14 +0000 (14:15 -0700)]
libbpf: remove deprecated low-level APIs
Drop low-level APIs as well as high-level (and very confusingly named)
BPF object loading bpf_prog_load_xattr() and bpf_prog_load_deprecated()
APIs.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko [Mon, 27 Jun 2022 21:15:13 +0000 (14:15 -0700)]
libbpf: move xsk.{c,h} into selftests/bpf
Remove deprecated xsk APIs from libbpf. But given we have selftests
relying on this, move those files (with minimal adjustments to make them
compilable) under selftests/bpf.
We also remove all the removed APIs from libbpf.map, while overall
keeping version inheritance chain, as most APIs are backwards
compatible so there is no need to reassign them as LIBBPF_1.0.0 versions.
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220627211527.2245459-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
John Fastabend [Tue, 28 Jun 2022 03:58:03 +0000 (20:58 -0700)]
bpf: Fix sockmap calling sleepable function in teardown path
syzbot reproduced the bug ...
BUG: sleeping function called from invalid context at kernel/workqueue.c:3010
... with the following stack trace fragment ...
start_flush_work kernel/workqueue.c:3010 [inline]
__flush_work+0x109/0xb10 kernel/workqueue.c:3074
__cancel_work_timer+0x3f9/0x570 kernel/workqueue.c:3162
sk_psock_stop+0x4cb/0x630 net/core/skmsg.c:802
sock_map_destroy+0x333/0x760 net/core/sock_map.c:1581
inet_csk_destroy_sock+0x196/0x440 net/ipv4/inet_connection_sock.c:1130
__tcp_close+0xd5b/0x12b0 net/ipv4/tcp.c:2897
tcp_close+0x29/0xc0 net/ipv4/tcp.c:2909
... introduced by
d8616ee2affc. Do a quick trace of the code path and the
bug is obvious:
inet_csk_destroy_sock(sk)
sk_prot->destroy(sk); <--- sock_map_destroy
sk_psock_stop(, true); <--- true so cancel workqueue
cancel_work_sync() <--- splat, because *_bh_disable()
We can not call cancel_work_sync() from inside destroy path. So mark
the sk_psock_stop call to skip this cancel_work_sync(). This will avoid
the BUG, but means we may run sk_psock_backlog after or during the
destroy op. We zapped the ingress_skb queue in sk_psock_stop (safe to
do with local_bh_disable) so its empty and the sk_psock_backlog work
item will not find any pkts to process here. However, because we are
not going to wait for it or clear its ->state its possible it kicks off
or is already running. This should be 'safe' up until psock drops its
refcnt to psock->sk. The sock_put() that drops this reference is only
done at psock destroy time from sk_psock_destroy(). This is done through
workqueue when sk_psock_drop() is called on psock refnt reaches 0.
And importantly sk_psock_destroy() does a cancel_work_sync(). So trivial
fix works.
I've had hit or miss luck reproducing this caught it once or twice with
the provided reproducer when running with many runners. However, syzkaller
is very good at reproducing so relying on syzkaller to verify fix.
Fixes:
d8616ee2affc ("bpf, sockmap: Fix sk->sk_forward_alloc warn_on in sk_stream_kill_queues")
Reported-by: syzbot+140186ceba0c496183bc@syzkaller.appspotmail.com
Suggested-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Wang Yufen <wangyufen@huawei.com>
Link: https://lore.kernel.org/bpf/20220628035803.317876-1-john.fastabend@gmail.com
Daniel Müller [Thu, 23 Jun 2022 18:29:34 +0000 (18:29 +0000)]
bpf: Merge "types_are_compat" logic into relo_core.c
BPF type compatibility checks (bpf_core_types_are_compat()) are
currently duplicated between kernel and user space. That's a historical
artifact more than intentional doing and can lead to subtle bugs where
one implementation is adjusted but another is forgotten.
That happened with the enum64 work, for example, where the libbpf side
was changed (commit
23b2a3a8f63a ("libbpf: Add enum64 relocation
support")) to use the btf_kind_core_compat() helper function but the
kernel side was not (commit
6089fb325cf7 ("bpf: Add btf enum64
support")).
This patch addresses both the duplication issue, by merging both
implementations and moving them into relo_core.c, and fixes the alluded
to kind check (by giving preference to libbpf's already adjusted logic).
For discussion of the topic, please refer to:
https://lore.kernel.org/bpf/CAADnVQKbWR7oarBdewgOBZUPzryhRYvEbkhyPJQHHuxq=0K1gw@mail.gmail.com/T/#mcc99f4a33ad9a322afaf1b9276fb1f0b7add9665
Changelog:
v1 -> v2:
- limited libbpf recursion limit to 32
- changed name to __bpf_core_types_are_compat
- included warning previously present in libbpf version
- merged kernel and user space changes into a single patch
Signed-off-by: Daniel Müller <deso@posteo.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220623182934.2582827-1-deso@posteo.net
Shahab Vahedi [Fri, 24 Jun 2022 14:14:11 +0000 (14:14 +0000)]
bpf, docs: Fix the code formatting in instruction-set
A minor typo fix to include "| BPF_LD" into its previous
code phrase:
``BPF_IND`` | BPF_LD --> ``BPF_IND | BPF_LD``
Signed-off-by: Shahab Vahedi <shahab@synopsys.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/b6120b31-3d1d-bf2d-2f2a-aa768d91257b@synopsys.com
Andrii Nakryiko [Fri, 24 Jun 2022 20:36:23 +0000 (13:36 -0700)]
Merge branch 'perf tools: Fix prologue generation'
Jiri Olsa says:
====================
hi,
sending change we discussed some time ago [1] to get rid of
some deprecated functions we use in perf prologue code.
Despite the gloomy discussion I think the final code does
not look that bad ;-)
This patchset removes following libbpf functions from perf:
bpf_program__set_prep
bpf_program__nth_fd
struct bpf_prog_prep_result
v5 changes:
- squashed patches together so we don't break bisection [Arnaldo]
v4 changes:
- fix typo [Andrii]
v3 changes:
- removed R0/R1 zero init in libbpf_prog_prepare_load_fn,
because it's not needed [Andrii]
- rebased/post on top of bpf-next/master which now has
all the needed perf/core changes
v2 changes:
- use fallback section prog handler, so we don't need to
use section prefix [Andrii]
- realloc prog->insns array in bpf_program__set_insns [Andrii]
- squash patch 1 from previous version with
bpf_program__set_insns change [Daniel]
- patch 3 already merged [Arnaldo]
- added more comments
thanks,
jirka
[1] https://lore.kernel.org/bpf/CAEf4BzaiBO3_617kkXZdYJ8hS8YF--ZLgapNbgeeEJ-pY0H88g@mail.gmail.com/
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Jiri Olsa [Thu, 16 Jun 2022 20:22:14 +0000 (22:22 +0200)]
perf tools: Rework prologue generation code
Some functions we use for bpf prologue generation are going to be
deprecated. This change reworks current code not to use them.
We need to replace following functions/struct:
bpf_program__set_prep
bpf_program__nth_fd
struct bpf_prog_prep_result
Currently we use bpf_program__set_prep to hook perf callback before
program is loaded and provide new instructions with the prologue.
We replace this function/ality by taking instructions for specific
program, attaching prologue to them and load such new ebpf programs
with prologue using separate bpf_prog_load calls (outside libbpf
load machinery).
Before we can take and use program instructions, we need libbpf to
actually load it. This way we get the final shape of its instructions
with all relocations and verifier adjustments).
There's one glitch though.. perf kprobe program already assumes
generated prologue code with proper values in argument registers,
so loading such program directly will fail in the verifier.
That's where the fallback pre-load handler fits in and prepends
the initialization code to the program. Once such program is loaded
we take its instructions, cut off the initialization code and prepend
the prologue.
I know.. sorry ;-)
To have access to the program when loading this patch adds support to
register 'fallback' section handler to take care of perf kprobe programs.
The fallback means that it handles any section definition besides the
ones that libbpf handles.
The handler serves two purposes:
- allows perf programs to have special arguments in section name
- allows perf to use pre-load callback where we can attach init
code (zeroing all argument registers) to each perf program
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/bpf/20220616202214.70359-2-jolsa@kernel.org
Eduard Zingerman [Fri, 24 Jun 2022 02:06:13 +0000 (05:06 +0300)]
selftest/bpf: Test for use-after-free bug fix in inline_bpf_loop
This test verifies that bpf_loop() inlining works as expected when
address of `env->prog` is updated. This address is updated upon BPF
program reallocation.
Reallocation is handled by bpf_prog_realloc(), which reuses old memory
if page boundary is not crossed. The value of `len` in the test is
chosen to cross this boundary on bpf_loop() patching.
Verify that the use-after-free bug in inline_bpf_loop() reported by
Dan Carpenter is fixed.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220624020613.548108-3-eddyz87@gmail.com
Eduard Zingerman [Fri, 24 Jun 2022 02:06:12 +0000 (05:06 +0300)]
bpf: Fix for use-after-free bug in inline_bpf_loop
As reported by Dan Carpenter, the following statements in inline_bpf_loop()
might cause a use-after-free bug:
struct bpf_prog *new_prog;
// ...
new_prog = bpf_patch_insn_data(env, position, insn_buf, *cnt);
// ...
env->prog->insnsi[call_insn_offset].imm = callback_offset;
The bpf_patch_insn_data() might free the memory used by env->prog.
Fixes:
1ade23711971 ("bpf: Inline calls to bpf_loop when callback is known")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220624020613.548108-2-eddyz87@gmail.com
Simon Wang [Wed, 22 Jun 2022 03:19:23 +0000 (23:19 -0400)]
bpf: Replace hard-coded 0 with BPF_K in check_alu_op
Enhance readability a bit.
Signed-off-by: Simon Wang <wangchuanguo@inspur.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220622031923.65692-1-wangchuanguo@inspur.com
Jörn-Thorben Hinz [Tue, 21 Jun 2022 07:01:16 +0000 (09:01 +0200)]
selftests/bpf: Fix rare segfault in sock_fields prog test
test_sock_fields__detach() got called with a null pointer here when one
of the CHECKs or ASSERTs up to the test_sock_fields__open_and_load()
call resulted in a jump to the "done" label.
A skeletons *__detach() is not safe to call with a null pointer, though.
This led to a segfault.
Go the easy route and only call test_sock_fields__destroy() which is
null-pointer safe and includes detaching.
Came across this while looking[1] to introduce the usage of
bpf_tcp_helpers.h (included in progs/test_sock_fields.c) together with
vmlinux.h.
[1] https://lore.kernel.org/bpf/
629bc069dd807d7ac646f836e9dca28bbc1108e2.camel@mailbox.tu-berlin.de/
Fixes:
8f50f16ff39d ("selftests/bpf: Extend verifier and bpf_sock tests for dst_port loads")
Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220621070116.307221-1-jthinz@mailbox.tu-berlin.de
Alexei Starovoitov [Thu, 23 Jun 2022 16:49:58 +0000 (09:49 -0700)]
Merge branch 'Align BPF TCP CCs implementing cong_control() with non-BPF CCs'
Jörn-Thorben Hinz says:
====================
This series corrects some inconveniences for a BPF TCP CC that
implements and uses tcp_congestion_ops.cong_control(). Until now, such a
CC did not have all necessary write access to struct sock and
unnecessarily needed to implement cong_avoid().
v4:
- Remove braces around single statements after if
- Don’t check pointer passed to bpf_link__destroy()
v3:
- Add a selftest writing sk_pacing_*
- Add a selftest with incomplete tcp_congestion_ops
- Add a selftest with unsupported get_info()
- Remove an unused variable
- Reword a comment about reg() in bpf_struct_ops_map_update_elem()
v2:
- Drop redundant check for required functions and just rely on
tcp_register_congestion_control() (Martin KaFai Lau)
====================
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jörn-Thorben Hinz [Wed, 22 Jun 2022 19:12:27 +0000 (21:12 +0200)]
selftests/bpf: Test a BPF CC implementing the unsupported get_info()
Test whether a TCP CC implemented in BPF providing get_info() is
rejected correctly. get_info() is unsupported in a BPF CC. The check for
required functions in a BPF CC has moved, this test ensures unsupported
functions are still rejected correctly.
Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220622191227.898118-6-jthinz@mailbox.tu-berlin.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jörn-Thorben Hinz [Wed, 22 Jun 2022 19:12:26 +0000 (21:12 +0200)]
selftests/bpf: Test an incomplete BPF CC
Test whether a TCP CC implemented in BPF providing neither cong_avoid()
nor cong_control() is correctly rejected. This check solely depends on
tcp_register_congestion_control() now, which is invoked during
bpf_map__attach_struct_ops().
Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Link: https://lore.kernel.org/r/20220622191227.898118-5-jthinz@mailbox.tu-berlin.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jörn-Thorben Hinz [Wed, 22 Jun 2022 19:12:25 +0000 (21:12 +0200)]
selftests/bpf: Test a BPF CC writing sk_pacing_*
Test whether a TCP CC implemented in BPF is allowed to write
sk_pacing_rate and sk_pacing_status in struct sock. This is needed when
cong_control() is implemented and used.
Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Link: https://lore.kernel.org/r/20220622191227.898118-4-jthinz@mailbox.tu-berlin.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jörn-Thorben Hinz [Wed, 22 Jun 2022 19:12:24 +0000 (21:12 +0200)]
bpf: Require only one of cong_avoid() and cong_control() from a TCP CC
Remove the check for required and optional functions in a struct
tcp_congestion_ops from bpf_tcp_ca.c. Rely on
tcp_register_congestion_control() to reject a BPF CC that does not
implement all required functions, as it will do for a non-BPF CC.
When a CC implements tcp_congestion_ops.cong_control(), the alternate
cong_avoid() is not in use in the TCP stack. Previously, a BPF CC was
still forced to implement cong_avoid() as a no-op since it was
non-optional in bpf_tcp_ca.c.
Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220622191227.898118-3-jthinz@mailbox.tu-berlin.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jörn-Thorben Hinz [Wed, 22 Jun 2022 19:12:23 +0000 (21:12 +0200)]
bpf: Allow a TCP CC to write sk_pacing_rate and sk_pacing_status
A CC that implements tcp_congestion_ops.cong_control() should be able to
control sk_pacing_rate and set sk_pacing_status, since
tcp_update_pacing_rate() is never called in this case. A built-in CC or
one from a kernel module is already able to write to both struct sock
members. For a BPF program, write access has not been allowed, yet.
Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220622191227.898118-2-jthinz@mailbox.tu-berlin.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jian Shen [Wed, 22 Jun 2022 13:50:02 +0000 (21:50 +0800)]
test_bpf: fix incorrect netdev features
The prototype of .features is netdev_features_t, it should use
NETIF_F_LLTX and NETIF_F_HW_VLAN_STAG_TX, not NETIF_F_LLTX_BIT
and NETIF_F_HW_VLAN_STAG_TX_BIT.
Fixes:
cf204a718357 ("bpf, testing: Introduce 'gso_linear_no_head_frag' skb_segment test")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20220622135002.8263-1-shenjian15@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Mon, 20 Jun 2022 22:25:54 +0000 (15:25 -0700)]
selftests/bpf: Add benchmark for local_storage get
Add a benchmarks to demonstrate the performance cliff for local_storage
get as the number of local_storage maps increases beyond current
local_storage implementation's cache size.
"sequential get" and "interleaved get" benchmarks are added, both of
which do many bpf_task_storage_get calls on sets of task local_storage
maps of various counts, while considering a single specific map to be
'important' and counting task_storage_gets to the important map
separately in addition to normal 'hits' count of all gets. Goal here is
to mimic scenario where a particular program using one map - the
important one - is running on a system where many other local_storage
maps exist and are accessed often.
While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
bpf_task_storage_gets for the important map for every 10 map gets. This
is meant to highlight performance differences when important map is
accessed far more frequently than non-important maps.
A "hashmap control" benchmark is also included for easy comparison of
standard bpf hashmap lookup vs local_storage get. The benchmark is
similar to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
instead of local storage. Only one inner map is created - a hashmap
meant to hold tid -> data mapping for all tasks. Size of the hashmap is
hardcoded to my system's PID_MAX_LIMIT (4,194,304). The number of these
keys which are actually fetched as part of the benchmark is
configurable.
Addition of this benchmark is inspired by conversation with Alexei in a
previous patchset's thread [0], which highlighted the need for such a
benchmark to motivate and validate improvements to local_storage
implementation. My approach in that series focused on improving
performance for explicitly-marked 'important' maps and was rejected
with feedback to make more generally-applicable improvements while
avoiding explicitly marking maps as important. Thus the benchmark
reports both general and important-map-focused metrics, so effect of
future work on both is clear.
Regarding the benchmark results. On a powerful system (Skylake, 20
cores, 256gb ram):
Hashmap Control
===============
num keys: 10
hashmap (control) sequential get: hits throughput: 20.900 ± 0.334 M ops/s, hits latency: 47.847 ns/op, important_hits throughput: 20.900 ± 0.334 M ops/s
num keys: 1000
hashmap (control) sequential get: hits throughput: 13.758 ± 0.219 M ops/s, hits latency: 72.683 ns/op, important_hits throughput: 13.758 ± 0.219 M ops/s
num keys: 10000
hashmap (control) sequential get: hits throughput: 6.995 ± 0.034 M ops/s, hits latency: 142.959 ns/op, important_hits throughput: 6.995 ± 0.034 M ops/s
num keys: 100000
hashmap (control) sequential get: hits throughput: 4.452 ± 0.371 M ops/s, hits latency: 224.635 ns/op, important_hits throughput: 4.452 ± 0.371 M ops/s
num keys: 4194304
hashmap (control) sequential get: hits throughput: 3.043 ± 0.033 M ops/s, hits latency: 328.587 ns/op, important_hits throughput: 3.043 ± 0.033 M ops/s
Local Storage
=============
num_maps: 1
local_storage cache sequential get: hits throughput: 47.298 ± 0.180 M ops/s, hits latency: 21.142 ns/op, important_hits throughput: 47.298 ± 0.180 M ops/s
local_storage cache interleaved get: hits throughput: 55.277 ± 0.888 M ops/s, hits latency: 18.091 ns/op, important_hits throughput: 55.277 ± 0.888 M ops/s
num_maps: 10
local_storage cache sequential get: hits throughput: 40.240 ± 0.802 M ops/s, hits latency: 24.851 ns/op, important_hits throughput: 4.024 ± 0.080 M ops/s
local_storage cache interleaved get: hits throughput: 48.701 ± 0.722 M ops/s, hits latency: 20.533 ns/op, important_hits throughput: 17.393 ± 0.258 M ops/s
num_maps: 16
local_storage cache sequential get: hits throughput: 44.515 ± 0.708 M ops/s, hits latency: 22.464 ns/op, important_hits throughput: 2.782 ± 0.044 M ops/s
local_storage cache interleaved get: hits throughput: 49.553 ± 2.260 M ops/s, hits latency: 20.181 ns/op, important_hits throughput: 15.767 ± 0.719 M ops/s
num_maps: 17
local_storage cache sequential get: hits throughput: 38.778 ± 0.302 M ops/s, hits latency: 25.788 ns/op, important_hits throughput: 2.284 ± 0.018 M ops/s
local_storage cache interleaved get: hits throughput: 43.848 ± 1.023 M ops/s, hits latency: 22.806 ns/op, important_hits throughput: 13.349 ± 0.311 M ops/s
num_maps: 24
local_storage cache sequential get: hits throughput: 19.317 ± 0.568 M ops/s, hits latency: 51.769 ns/op, important_hits throughput: 0.806 ± 0.024 M ops/s
local_storage cache interleaved get: hits throughput: 24.397 ± 0.272 M ops/s, hits latency: 40.989 ns/op, important_hits throughput: 6.863 ± 0.077 M ops/s
num_maps: 32
local_storage cache sequential get: hits throughput: 13.333 ± 0.135 M ops/s, hits latency: 75.000 ns/op, important_hits throughput: 0.417 ± 0.004 M ops/s
local_storage cache interleaved get: hits throughput: 16.898 ± 0.383 M ops/s, hits latency: 59.178 ns/op, important_hits throughput: 4.717 ± 0.107 M ops/s
num_maps: 100
local_storage cache sequential get: hits throughput: 6.360 ± 0.107 M ops/s, hits latency: 157.233 ns/op, important_hits throughput: 0.064 ± 0.001 M ops/s
local_storage cache interleaved get: hits throughput: 7.303 ± 0.362 M ops/s, hits latency: 136.930 ns/op, important_hits throughput: 1.907 ± 0.094 M ops/s
num_maps: 1000
local_storage cache sequential get: hits throughput: 0.452 ± 0.010 M ops/s, hits latency: 2214.022 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s
local_storage cache interleaved get: hits throughput: 0.542 ± 0.007 M ops/s, hits latency: 1843.341 ns/op, important_hits throughput: 0.136 ± 0.002 M ops/s
Looking at the "sequential get" results, it's clear that as the
number of task local_storage maps grows beyond the current cache size
(16), there's a significant reduction in hits throughput. Note that
current local_storage implementation assigns a cache_idx to maps as they
are created. Since "sequential get" is creating maps 0..n in order and
then doing bpf_task_storage_get calls in the same order, the benchmark
is effectively ensuring that a map will not be in cache when the program
tries to access it.
For "interleaved get" results, important-map hits throughput is greatly
increased as the important map is more likely to be in cache by virtue
of being accessed far more frequently. Throughput still reduces as #
maps increases, though.
To get a sense of the overhead of the benchmark program, I
commented out bpf_task_storage_get/bpf_map_lookup_elem in
local_storage_bench.c and ran the benchmark on the same host as the
'real' run. Results:
Hashmap Control
===============
num keys: 10
hashmap (control) sequential get: hits throughput: 54.288 ± 0.655 M ops/s, hits latency: 18.420 ns/op, important_hits throughput: 54.288 ± 0.655 M ops/s
num keys: 1000
hashmap (control) sequential get: hits throughput: 52.913 ± 0.519 M ops/s, hits latency: 18.899 ns/op, important_hits throughput: 52.913 ± 0.519 M ops/s
num keys: 10000
hashmap (control) sequential get: hits throughput: 53.480 ± 1.235 M ops/s, hits latency: 18.699 ns/op, important_hits throughput: 53.480 ± 1.235 M ops/s
num keys: 100000
hashmap (control) sequential get: hits throughput: 54.982 ± 1.902 M ops/s, hits latency: 18.188 ns/op, important_hits throughput: 54.982 ± 1.902 M ops/s
num keys: 4194304
hashmap (control) sequential get: hits throughput: 50.858 ± 0.707 M ops/s, hits latency: 19.662 ns/op, important_hits throughput: 50.858 ± 0.707 M ops/s
Local Storage
=============
num_maps: 1
local_storage cache sequential get: hits throughput: 110.990 ± 4.828 M ops/s, hits latency: 9.010 ns/op, important_hits throughput: 110.990 ± 4.828 M ops/s
local_storage cache interleaved get: hits throughput: 161.057 ± 4.090 M ops/s, hits latency: 6.209 ns/op, important_hits throughput: 161.057 ± 4.090 M ops/s
num_maps: 10
local_storage cache sequential get: hits throughput: 112.930 ± 1.079 M ops/s, hits latency: 8.855 ns/op, important_hits throughput: 11.293 ± 0.108 M ops/s
local_storage cache interleaved get: hits throughput: 115.841 ± 2.088 M ops/s, hits latency: 8.633 ns/op, important_hits throughput: 41.372 ± 0.746 M ops/s
num_maps: 16
local_storage cache sequential get: hits throughput: 115.653 ± 0.416 M ops/s, hits latency: 8.647 ns/op, important_hits throughput: 7.228 ± 0.026 M ops/s
local_storage cache interleaved get: hits throughput: 138.717 ± 1.649 M ops/s, hits latency: 7.209 ns/op, important_hits throughput: 44.137 ± 0.525 M ops/s
num_maps: 17
local_storage cache sequential get: hits throughput: 112.020 ± 1.649 M ops/s, hits latency: 8.927 ns/op, important_hits throughput: 6.598 ± 0.097 M ops/s
local_storage cache interleaved get: hits throughput: 128.089 ± 1.960 M ops/s, hits latency: 7.807 ns/op, important_hits throughput: 38.995 ± 0.597 M ops/s
num_maps: 24
local_storage cache sequential get: hits throughput: 92.447 ± 5.170 M ops/s, hits latency: 10.817 ns/op, important_hits throughput: 3.855 ± 0.216 M ops/s
local_storage cache interleaved get: hits throughput: 128.844 ± 2.808 M ops/s, hits latency: 7.761 ns/op, important_hits throughput: 36.245 ± 0.790 M ops/s
num_maps: 32
local_storage cache sequential get: hits throughput: 102.042 ± 1.462 M ops/s, hits latency: 9.800 ns/op, important_hits throughput: 3.194 ± 0.046 M ops/s
local_storage cache interleaved get: hits throughput: 126.577 ± 1.818 M ops/s, hits latency: 7.900 ns/op, important_hits throughput: 35.332 ± 0.507 M ops/s
num_maps: 100
local_storage cache sequential get: hits throughput: 111.327 ± 1.401 M ops/s, hits latency: 8.983 ns/op, important_hits throughput: 1.113 ± 0.014 M ops/s
local_storage cache interleaved get: hits throughput: 131.327 ± 1.339 M ops/s, hits latency: 7.615 ns/op, important_hits throughput: 34.302 ± 0.350 M ops/s
num_maps: 1000
local_storage cache sequential get: hits throughput: 101.978 ± 0.563 M ops/s, hits latency: 9.806 ns/op, important_hits throughput: 0.102 ± 0.001 M ops/s
local_storage cache interleaved get: hits throughput: 141.084 ± 1.098 M ops/s, hits latency: 7.088 ns/op, important_hits throughput: 35.430 ± 0.276 M ops/s
Adjusting for overhead, latency numbers for "hashmap control" and
"sequential get" are:
hashmap_control_1k: ~53.8ns
hashmap_control_10k: ~124.2ns
hashmap_control_100k: ~206.5ns
sequential_get_1: ~12.1ns
sequential_get_10: ~16.0ns
sequential_get_16: ~13.8ns
sequential_get_17: ~16.8ns
sequential_get_24: ~40.9ns
sequential_get_32: ~65.2ns
sequential_get_100: ~148.2ns
sequential_get_1000: ~2204ns
Clearly demonstrating a cliff.
In the discussion for v1 of this patch, Alexei noted that local_storage
was 2.5x faster than a large hashmap when initially implemented [1]. The
benchmark results show that local_storage is 5-10x faster: a
long-running BPF application putting some pid-specific info into a
hashmap for each pid it sees will probably see on the order of 10-100k
pids. Bench numbers for hashmaps of this size are ~10x slower than
sequential_get_16, but as the number of local_storage maps grows far
past local_storage cache size the performance advantage shrinks and
eventually reverses.
When running the benchmarks it may be necessary to bump 'open files'
ulimit for a successful run.
[0]: https://lore.kernel.org/all/
20220420002143.1096548-1-davemarchevsky@fb.com
[1]: https://lore.kernel.org/bpf/
20220511173305.ftldpn23m4ski3d3@MBP-
98dd607d3435.dhcp.thefacebook.com/
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20220620222554.270578-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andy Gospodarek [Tue, 21 Jun 2022 17:54:02 +0000 (17:54 +0000)]
samples/bpf: fixup some tools to be able to support xdp multibuffer
This changes the section name for the bpf program embedded in these
files to "xdp.frags" to allow the programs to be loaded on drivers that
are using an MTU greater than PAGE_SIZE. Rather than directly accessing
the buffers, the packet data is now accessed via xdp helper functions to
provide an example for those who may need to write more complex
programs.
v2: remove new unnecessary variable
Signed-off-by: Andy Gospodarek <gospo@broadcom.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/20220621175402.35327-1-gospo@broadcom.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jakub Sitnicki [Fri, 17 Jun 2022 10:57:35 +0000 (12:57 +0200)]
bpf, arm64: Keep tail call count across bpf2bpf calls
Today doing a BPF tail call after a BPF to BPF call, that is from a
subprogram, is allowed only by the x86-64 BPF JIT. Mixing these features
requires support from JIT. Tail call count has to be tracked through BPF to
BPF calls, as well as through BPF tail calls to prevent unbounded chains of
tail calls.
arm64 BPF JIT stores the tail call count (TCC) in a dedicated
register (X26). This makes it easier to support bpf2bpf calls mixed with
tail calls than on x86 platform.
In order to keep the tail call count in tact throughout bpf2bpf calls, all
we need to do is tweak the program prologue generator. When emitting
prologue for a subprogram, we skip the block that initializes the tail call
count and emits a jump pad for the tail call.
With this change, a sample execution flow where a bpf2bpf call is followed
by a tail call would look like so:
int entry(struct __sk_buff *skb):
0xffffffc0090151d4: paciasp
0xffffffc0090151d8: stp x29, x30, [sp, #-16]!
0xffffffc0090151dc: mov x29, sp
0xffffffc0090151e0: stp x19, x20, [sp, #-16]!
0xffffffc0090151e4: stp x21, x22, [sp, #-16]!
0xffffffc0090151e8: stp x25, x26, [sp, #-16]!
0xffffffc0090151ec: stp x27, x28, [sp, #-16]!
0xffffffc0090151f0: mov x25, sp
0xffffffc0090151f4: mov x26, #0x0 // <- init TCC only
0xffffffc0090151f8: bti j // in main prog
0xffffffc0090151fc: sub x27, x25, #0x0
0xffffffc009015200: sub sp, sp, #0x10
0xffffffc009015204: mov w1, #0x0
0xffffffc009015208: mov x10, #0xffffffffffffffff
0xffffffc00901520c: strb w1, [x25, x10]
0xffffffc009015210: mov x10, #0xffffffffffffd25c
0xffffffc009015214: movk x10, #0x902, lsl #16
0xffffffc009015218: movk x10, #0xffc0, lsl #32
0xffffffc00901521c: blr x10 -------------------. // bpf2bpf call
0xffffffc009015220: add x7, x0, #0x0 <-------------.
0xffffffc009015224: add sp, sp, #0x10 | |
0xffffffc009015228: ldp x27, x28, [sp], #16 | |
0xffffffc00901522c: ldp x25, x26, [sp], #16 | |
0xffffffc009015230: ldp x21, x22, [sp], #16 | |
0xffffffc009015234: ldp x19, x20, [sp], #16 | |
0xffffffc009015238: ldp x29, x30, [sp], #16 | |
0xffffffc00901523c: add x0, x7, #0x0 | |
0xffffffc009015240: autiasp | |
0xffffffc009015244: ret | |
| |
int subprog_tail(struct __sk_buff *skb): | |
0xffffffc00902d25c: paciasp <----------------------' |
0xffffffc00902d260: stp x29, x30, [sp, #-16]! |
0xffffffc00902d264: mov x29, sp |
0xffffffc00902d268: stp x19, x20, [sp, #-16]! |
0xffffffc00902d26c: stp x21, x22, [sp, #-16]! |
0xffffffc00902d270: stp x25, x26, [sp, #-16]! |
0xffffffc00902d274: stp x27, x28, [sp, #-16]! |
0xffffffc00902d278: mov x25, sp |
0xffffffc00902d27c: sub x27, x25, #0x0 |
0xffffffc00902d280: sub sp, sp, #0x10 | // <- end of prologue, notice:
0xffffffc00902d284: add x19, x0, #0x0 | // 1) TCC not touched, and
0xffffffc00902d288: mov w0, #0x1 | // 2) no tail call jump pad
0xffffffc00902d28c: mov x10, #0xfffffffffffffffc |
0xffffffc00902d290: str w0, [x25, x10] |
0xffffffc00902d294: mov x20, #0xffffff80ffffffff |
0xffffffc00902d298: movk x20, #0xc033, lsl #16 |
0xffffffc00902d29c: movk x20, #0x4e00 |
0xffffffc00902d2a0: add x0, x19, #0x0 |
0xffffffc00902d2a4: add x1, x20, #0x0 |
0xffffffc00902d2a8: mov x2, #0x0 |
0xffffffc00902d2ac: mov w10, #0x24 |
0xffffffc00902d2b0: ldr w10, [x1, x10] |
0xffffffc00902d2b4: add w2, w2, #0x0 |
0xffffffc00902d2b8: cmp w2, w10 |
0xffffffc00902d2bc: b.cs 0xffffffc00902d2f8 |
0xffffffc00902d2c0: mov w10, #0x21 |
0xffffffc00902d2c4: cmp x26, x10 | // TCC >= MAX_TAIL_CALL_CNT?
0xffffffc00902d2c8: b.cs 0xffffffc00902d2f8 |
0xffffffc00902d2cc: add x26, x26, #0x1 | // TCC++
0xffffffc00902d2d0: mov w10, #0x110 |
0xffffffc00902d2d4: add x10, x1, x10 |
0xffffffc00902d2d8: lsl x11, x2, #3 |
0xffffffc00902d2dc: ldr x11, [x10, x11] |
0xffffffc00902d2e0: cbz x11, 0xffffffc00902d2f8 |
0xffffffc00902d2e4: mov w10, #0x30 |
0xffffffc00902d2e8: ldr x10, [x11, x10] |
0xffffffc00902d2ec: add x10, x10, #0x24 |
0xffffffc00902d2f0: add sp, sp, #0x10 | // <- destroy just current
0xffffffc00902d2f4: br x10 ---------------------. | // BPF stack frame
0xffffffc00902d2f8: mov x10, #0xfffffffffffffffc | | // before the tail call
0xffffffc00902d2fc: ldr w7, [x25, x10] | |
0xffffffc00902d300: add sp, sp, #0x10 | |
0xffffffc00902d304: ldp x27, x28, [sp], #16 | |
0xffffffc00902d308: ldp x25, x26, [sp], #16 | |
0xffffffc00902d30c: ldp x21, x22, [sp], #16 | |
0xffffffc00902d310: ldp x19, x20, [sp], #16 | |
0xffffffc00902d314: ldp x29, x30, [sp], #16 | |
0xffffffc00902d318: add x0, x7, #0x0 | |
0xffffffc00902d31c: autiasp | |
0xffffffc00902d320: ret | |
| |
int classifier_0(struct __sk_buff *skb): | |
0xffffffc008ff5874: paciasp | |
0xffffffc008ff5878: stp x29, x30, [sp, #-16]! | |
0xffffffc008ff587c: mov x29, sp | |
0xffffffc008ff5880: stp x19, x20, [sp, #-16]! | |
0xffffffc008ff5884: stp x21, x22, [sp, #-16]! | |
0xffffffc008ff5888: stp x25, x26, [sp, #-16]! | |
0xffffffc008ff588c: stp x27, x28, [sp, #-16]! | |
0xffffffc008ff5890: mov x25, sp | |
0xffffffc008ff5894: mov x26, #0x0 | |
0xffffffc008ff5898: bti j <----------------------' |
0xffffffc008ff589c: sub x27, x25, #0x0 |
0xffffffc008ff58a0: sub sp, sp, #0x0 |
0xffffffc008ff58a4: mov x0, #0xffffffc0ffffffff |
0xffffffc008ff58a8: movk x0, #0x8fc, lsl #16 |
0xffffffc008ff58ac: movk x0, #0x6000 |
0xffffffc008ff58b0: mov w1, #0x1 |
0xffffffc008ff58b4: str w1, [x0] |
0xffffffc008ff58b8: mov w7, #0x0 |
0xffffffc008ff58bc: mov sp, sp |
0xffffffc008ff58c0: ldp x27, x28, [sp], #16 |
0xffffffc008ff58c4: ldp x25, x26, [sp], #16 |
0xffffffc008ff58c8: ldp x21, x22, [sp], #16 |
0xffffffc008ff58cc: ldp x19, x20, [sp], #16 |
0xffffffc008ff58d0: ldp x29, x30, [sp], #16 |
0xffffffc008ff58d4: add x0, x7, #0x0 |
0xffffffc008ff58d8: autiasp |
0xffffffc008ff58dc: ret -------------------------------'
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220617105735.733938-3-jakub@cloudflare.com
Tony Ambardar [Fri, 17 Jun 2022 10:57:34 +0000 (12:57 +0200)]
bpf, x64: Add predicate for bpf2bpf with tailcalls support in JIT
The BPF core/verifier is hard-coded to permit mixing bpf2bpf and tail
calls for only x86-64. Change the logic to instead rely on a new weak
function 'bool bpf_jit_supports_subprog_tailcalls(void)', which a capable
JIT backend can override.
Update the x86-64 eBPF JIT to reflect this.
Signed-off-by: Tony Ambardar <Tony.Ambardar@gmail.com>
[jakub: drop MIPS bits and tweak patch subject]
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220617105735.733938-2-jakub@cloudflare.com
Alexei Starovoitov [Tue, 21 Jun 2022 00:40:52 +0000 (17:40 -0700)]
Merge branch 'bpf_loop inlining'
Eduard Zingerman says:
====================
Hi Everyone,
This is the next iteration of the patch. It includes changes suggested
by Song, Joanne and Alexei. Please find updated intro message and
change log below.
This patch implements inlining of calls to bpf_loop helper function
when bpf_loop's callback is statically known. E.g. the rewrite does
the following transformation during BPF program processing:
bpf_loop(10, foo, NULL, 0);
->
for (int i = 0; i < 10; ++i)
foo(i, NULL);
The transformation leads to measurable latency change for simple
loops. Measurements using `benchs/run_bench_bpf_loop.sh` inside QEMU /
KVM on i7-4710HQ CPU show a drop in latency from 14 ns/op to 2 ns/op.
The change is split in five parts:
* Update to test_verifier.c to specify expected and unexpected
instruction sequences. This allows to check BPF program rewrites
applied by e.g. do_mix_fixups function.
* Update to test_verifier.c to specify BTF function infos and types
per test case. This is necessary for tests that load sub-program
addresses to a variable because of the checks applied by
check_ld_imm function.
* The update to verifier.c that tracks state of the parameters for
each bpf_loop call in a program and decides whether it could be
replaced by a loop.
* A set of test cases for `test_verifier` that use capabilities added
by the first two patches to verify instructions produced by inlining
logic.
* Two test cases for `test_prog` to check that possible corner cases
behave as expected.
Additional details are available in commit messages for each patch.
Changes since v7:
- Call to `mark_chain_precision` is added in `loop_flag_is_zero` to
avoid potential issues with state pruning and precision tracking.
- `flags non-zero` test_verifier test case is updated to have two
execution paths reaching `bpf_loop` call, one with flags = 0,
another with flags = 1. Potentially this test case should be able
to show that call to `mark_chain_precision` is necessary in
`loop_flag_is_zero` but not at the moment. Please refer to
discussion for [PATCH bpf-next v7 3/5] for additional details.
- `stack_depth_extra` computation is updated to guarantee that R6, R7
and R8 offsets are always aligned on 8 byte boundary.
- `stack locations for loop vars` test_verifier test case updated to
show that R6, R7, R8 offsets are indeed aligned when function stack
depth is not a multiple of 8.
- I removed Song Liu's ACK from commit message for [PATCH bpf-next v8
4/5] because I updated the patch. (Please let me know if I had to
keep the ACK tag).
Changes since v6:
- Return value of the `optimize_bpf_loop` function is no longer
ignored. This is necessary to properly propagate -ENOMEM error.
Changes since v5:
- Added function `loop_flag_is_zero` to skip a few checks in
`update_loop_inline_state` when loop instruction is not fit for
inline.
Changes since v4:
- Added missing `static` modifier for `update_loop_inline_state` and
`inline_bpf_loop` functions.
- `update_loop_inline_state` updated for better readability.
- Fields `initialized` and `fit_for_inline` of `struct
bpf_loop_inline_state` are changed back from `bool` to bitfields.
- Acks from Song Liu added to comments for patches 1/5, 2/5, 4/5,
5/5.
Changes since v3:
- Function `adjust_stack_depth_for_loop_inlining` is replaced by
function `optimize_bpf_loop`. Function `optimize_bpf_loop` is
responsible for both stack depth adjustment and call instruction
replacement.
- Changes in `do_misc_fixups` are reverted.
- Changes in `adjust_subprog_starts_after_remove` are reverted and
function `adjust_loop_inline_subprogno` is removed. This is
possible because call to `optimize_bpf_loop` is placed before the
dead code removal in `opt_remove_dead_code` (in contrast to the
position of `do_misc_fixups` where inlining was done in v3).
- Field `bpf_insn_aux_data.loop_inline_state` is now a part of
anonymous union at the start of the `bpf_insn_aux_data`.
- Data structure `bpf_loop_inline_state` is simplified to use single
flag field `fit_for_inline` instead of separate fields
`flags_is_zero` & `callback_is_constant`.
- Macro definition `BPF_MAX_LOOPS` is moved from
`include/linux/bpf_verifier.h` to `include/linux/bpf.h` to avoid
include of `include/linux/bpf_verifier.h` in `bpf_iter.c`.
- `inline_bpf_loop` changed back to use array initialization and hard
coded offsets as in v2.
- Style / formatting updates.
Changes since v2:
- fix for `stack_check` test case in `test_progs-no_alu32`, all tests
are passing now;
- v2 3/3 patch is split in three parts:
- kernel changes
- test_verifier changes
- test_prog changes
- updated `inline_bpf_loop` in `verifier.c` to calculate each offset
used in instructions to avoid "magic" numbers;
- removed newline handling logic in `fail_log` branch of
`do_single_test` in `test_verifier.c` to simplify the patch set;
- styling fixes suggested in review for v2 of this patch set.
Changes since v1:
- allow to use SKIP_INSNS in instruction pattern specification in
test_verifier tests;
- fix for a bug in spill offset assignement for loop vars when
bpf_loop is located in a non-main function.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Eduard Zingerman [Mon, 20 Jun 2022 23:53:44 +0000 (02:53 +0300)]
selftests/bpf: BPF test_prog selftests for bpf_loop inlining
Two new test BPF programs for test_prog selftests checking bpf_loop
behavior. Both are corner cases for bpf_loop inlinig transformation:
- check that bpf_loop behaves correctly when callback function is not
a compile time constant
- check that local function variables are not affected by allocating
additional stack storage for registers spilled by loop inlining
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Eduard Zingerman [Mon, 20 Jun 2022 23:53:43 +0000 (02:53 +0300)]
selftests/bpf: BPF test_verifier selftests for bpf_loop inlining
A number of test cases for BPF selftests test_verifier to check how
bpf_loop inline transformation rewrites the BPF program. The following
cases are covered:
- happy path
- no-rewrite when flags is non-zero
- no-rewrite when callback is non-constant
- subprogno in insn_aux is updated correctly when dead sub-programs
are removed
- check that correct stack offsets are assigned for spilling of R6-R8
registers
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Eduard Zingerman [Mon, 20 Jun 2022 23:53:42 +0000 (02:53 +0300)]
bpf: Inline calls to bpf_loop when callback is known
Calls to `bpf_loop` are replaced with direct loops to avoid
indirection. E.g. the following:
bpf_loop(10, foo, NULL, 0);
Is replaced by equivalent of the following:
for (int i = 0; i < 10; ++i)
foo(i, NULL);
This transformation could be applied when:
- callback is known and does not change during program execution;
- flags passed to `bpf_loop` are always zero.
Inlining logic works as follows:
- During execution simulation function `update_loop_inline_state`
tracks the following information for each `bpf_loop` call
instruction:
- is callback known and constant?
- are flags constant and zero?
- Function `optimize_bpf_loop` increases stack depth for functions
where `bpf_loop` calls can be inlined and invokes `inline_bpf_loop`
to apply the inlining. The additional stack space is used to spill
registers R6, R7 and R8. These registers are used as loop counter,
loop maximal bound and callback context parameter;
Measurements using `benchs/run_bench_bpf_loop.sh` inside QEMU / KVM on
i7-4710HQ CPU show a drop in latency from 14 ns/op to 2 ns/op.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Eduard Zingerman [Mon, 20 Jun 2022 23:53:41 +0000 (02:53 +0300)]
selftests/bpf: allow BTF specs and func infos in test_verifier tests
The BTF and func_info specification for test_verifier tests follows
the same notation as in prog_tests/btf.c tests. E.g.:
...
.func_info = { { 0, 6 }, { 8, 7 } },
.func_info_cnt = 2,
.btf_strings = "\0int\0",
.btf_types = {
BTF_TYPE_INT_ENC(1, BTF_INT_SIGNED, 0, 32, 4),
BTF_PTR_ENC(1),
},
...
The BTF specification is loaded only when specified.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Eduard Zingerman [Mon, 20 Jun 2022 23:53:40 +0000 (02:53 +0300)]
selftests/bpf: specify expected instructions in test_verifier tests
Allows to specify expected and unexpected instruction sequences in
test_verifier test cases. The instructions are requested from kernel
after BPF program loading, thus allowing to check some of the
transformations applied by BPF verifier.
- `expected_insn` field specifies a sequence of instructions expected
to be found in the program;
- `unexpected_insn` field specifies a sequence of instructions that
are not expected to be found in the program;
- `INSN_OFF_MASK` and `INSN_IMM_MASK` values could be used to mask
`off` and `imm` fields.
- `SKIP_INSNS` could be used to specify that some instructions in the
(un)expected pattern are not important (behavior similar to usage of
`\t` in `errstr` field).
The intended usage is as follows:
{
"inline simple bpf_loop call",
.insns = {
/* main */
BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
BPF_RAW_INSN(BPF_LD | BPF_IMM | BPF_DW, BPF_REG_2,
BPF_PSEUDO_FUNC, 0, 6),
...
BPF_EXIT_INSN(),
/* callback */
BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 1),
BPF_EXIT_INSN(),
},
.expected_insns = {
BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
SKIP_INSNS(),
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_CALL, 8, 1)
},
.unexpected_insns = {
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0,
INSN_OFF_MASK, INSN_IMM_MASK),
},
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
.result = ACCEPT,
.runs = 0,
},
Here it is expected that move of 1 to register 1 would remain in place
and helper function call instruction would be replaced by a relative
call instruction.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Delyan Kratunov [Mon, 20 Jun 2022 21:47:55 +0000 (21:47 +0000)]
uprobe: gate bpf call behind BPF_EVENTS
The call into bpf from uprobes needs to be gated now that it doesn't use
the trace_events.h helpers.
Randy found this as a randconfig build failure on linux-next [1].
[1]: https://lore.kernel.org/linux-next/
2de99180-7d55-2fdf-134d-
33198c27cc58@infradead.org/
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Delyan Kratunov <delyank@fb.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/cb8bfbbcde87ed5d811227a393ef4925f2aadb7b.camel@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Maxim Mikityanskiy [Mon, 20 Jun 2022 10:49:39 +0000 (13:49 +0300)]
selftests/bpf: Enable config options needed for xdp_synproxy test
This commit adds the kernel config options needed to run the recently
added xdp_synproxy test. Users without these options will hit errors
like this:
test_synproxy:FAIL:iptables -t raw -I PREROUTING -i tmp1 -p
tcp -m tcp --syn --dport 8080 -j CT --notrack unexpected error: 256
(errno 22)
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220620104939.4094104-1-maximmi@nvidia.com
Cong Wang [Wed, 15 Jun 2022 16:20:14 +0000 (09:20 -0700)]
skmsg: Get rid of unncessary memset()
We always allocate skmsg with kzalloc(), so there is no need
to call memset(0) on it, the only thing we need from
sk_msg_init() is sg_init_marker(). So introduce a new helper
which is just kzalloc()+sg_init_marker(), this saves an
unncessary memset(0) for skmsg on fast path.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220615162014.89193-5-xiyou.wangcong@gmail.com
Cong Wang [Wed, 15 Jun 2022 16:20:13 +0000 (09:20 -0700)]
skmsg: Get rid of skb_clone()
With ->read_skb() now we have an entire skb dequeued from
receive queue, now we just need to grab an addtional refcnt
before passing its ownership to recv actors.
And we should not touch them any more, particularly for
skb->sk. Fortunately, skb->sk is already set for most of
the protocols except UDP where skb->sk has been stolen,
so we have to fix it up for UDP case.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220615162014.89193-4-xiyou.wangcong@gmail.com
Cong Wang [Wed, 15 Jun 2022 16:20:12 +0000 (09:20 -0700)]
net: Introduce a new proto_ops ->read_skb()
Currently both splice() and sockmap use ->read_sock() to
read skb from receive queue, but for sockmap we only read
one entire skb at a time, so ->read_sock() is too conservative
to use. Introduce a new proto_ops ->read_skb() which supports
this sematic, with this we can finally pass the ownership of
skb to recv actors.
For non-TCP protocols, all ->read_sock() can be simply
converted to ->read_skb().
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220615162014.89193-3-xiyou.wangcong@gmail.com
Cong Wang [Wed, 15 Jun 2022 16:20:11 +0000 (09:20 -0700)]
tcp: Introduce tcp_read_skb()
This patch inroduces tcp_read_skb() based on tcp_read_sock(),
a preparation for the next patch which actually introduces
a new sock ops.
TCP is special here, because it has tcp_read_sock() which is
mainly used by splice(). tcp_read_sock() supports partial read
and arbitrary offset, neither of them is needed for sockmap.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220615162014.89193-2-xiyou.wangcong@gmail.com
David S. Miller [Mon, 20 Jun 2022 09:03:34 +0000 (10:03 +0100)]
Merge branch 'mlxsw-unified-bridge-conversion-part-1'
Ido Schimmel says:
====================
mlxsw: Unified bridge conversion - part 1/6
This set starts converting mlxsw to the unified bridge model and mainly
adds new device registers and extends existing ones that will be used in
follow-up patchsets.
High-level summary
==================
The unified bridge model is a new way of managing low-level device
objects such as filtering identifiers (FIDs). The conversion moves a lot
of logic out of the device's firmware towards the driver, but its main
selling point is that it allows to overcome various scalability issues
related to the amount of entries that need to be programmed to the
device.
The only (intended) user visible changes of the conversion are
improvement in resource utilization and ability to support more router
interfaces (RIFs) in Spectrum-{2,3}.
Details
=======
Commit
50853808ff4a ("Merge branch
'mlxsw-Prepare-for-VLAN-aware-bridge-w-VxLAN'") converted mlxsw to
emulate 802.1Q FIDs (represent VLANs in a VLAN-aware bridge) using
802.1D FIDs (represent VLAN-unaware bridges). This was necessary because
at that time VNI could not be assigned to 802.1Q FIDs, which effectively
meant that mlxsw could not support VXLAN with VLAN-aware bridges.
The downside of this approach is that multiple {Port,VID}->FID entries
are required in order to classify incoming traffic to a FID, as opposed
to a single VID->FID entry that can be used with actual 802.1Q FIDs.
For example, if 10 ports are members in the same VLAN-aware bridge and
the same 100 VLANs are configured on each port, then only 100 VID->FID
entries are required with 802.1Q FIDs, whereas 1000 {Port,VID}->FID
entries are required with emulated 802.1Q FIDs.
The above limitation is the result of various assumptions that were made
in the design of the API that was exposed to software. In the unified
bridge model the API is much more "raw" and therefore avoids these
assumptions, allowing software to configure the device in a more
efficient manner.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Amit Cohen [Sun, 19 Jun 2022 10:29:21 +0000 (13:29 +0300)]
mlxsw: reg: Add support for VLAN RIF as part of RITR register
Router interfaces (RIFs) constructed on top of VLAN-aware bridges are of
"VLAN" type, whereas RIFs constructed on top of VLAN-unaware bridges of
"FID" type.
In other words, the RIF type is derived from the underlying FID type.
VLAN RIFs are used on top of 802.1Q FIDs, whereas FID RIFs are used on
top of 802.1D FIDs.
Currently 802.1Q FIDs are emulated using 802.1D FIDs, and therefore VLAN
RIFs are emulated using FID RIFs.
As part of converting the driver to use unified bridge, 802.1Q FIDs and
VLAN RIFs will be used.
Add the relevant fields to RITR register, add pack() function for VLAN
RIF and rename one field to fit the internal name.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amit Cohen [Sun, 19 Jun 2022 10:29:20 +0000 (13:29 +0300)]
mlxsw: Add support for egress FID classification after decapsulation
As preparation for unified bridge model, add support for VNI->FID mapping
via SVFA register.
When performing VXLAN encapsulation, the VXLAN header needs to contain a
VNI. This VNI is derived from the FID classification performed on
ingress, through which the ingress RIF is also determined.
Similarly, when performing VXLAN decapsulation, the FID of the packet
needs to be determined. This FID is derived from VNI classification
performed during decapsulation.
In the old model, both entries (i.e., FID->VNI and VNI->FID) were
configured via SFMR.vni.
In the new model, where ingress is separated from egress, ingress
configuration (VNI->FID) is performed via SVFA, while SFMR only
configures egress (FID->VNI).
Add 'vni' field to SVFA, add new mapping table - VNI to FID, add new
pack() function for VNI mapping and edit the comment in SFMR.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amit Cohen [Sun, 19 Jun 2022 10:29:19 +0000 (13:29 +0300)]
mlxsw: reg: Add egress FID field to RITR register
RITR configures the router interface table. As preparation for unified
bridge model, add egress FID field to RITR.
After routing, a packet has to perform a layer-2 lookup using the
destination MAC it got from the routing and a FID.
In the new model, the egress FID is configured by RITR for both sub-port
and FID RIFs.
Add 'efid' field to sub-port router interface and update FID router
interface related comment.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amit Cohen [Sun, 19 Jun 2022 10:29:18 +0000 (13:29 +0300)]
mlxsw: reg: Add Router Egress Interface to VID Register
The REIV maps {egress router interface (eRIF), egress_port} -> {vlan ID}.
As preparation for unified bridge model, add REIV register for future use.
In the past, firmware would take care of the above mentioned mapping,
but in the new model this should be done by software using REIV register.
REIV register supports a simultaneous update of 256 ports using
'port_page' field. When 'port_page'=0 the records represent ports
0-255, when 'port_page'=1 the records represent ports 256-511 and so
on.
The register is reserved while using the legacy model.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amit Cohen [Sun, 19 Jun 2022 10:29:17 +0000 (13:29 +0300)]
mlxsw: reg: Replace MID related fields in SFGC register
SFGC register maps {packet type, bridge type} -> {MID base, table type}.
As preparation for unified bridge model, remove 'mid' field and add
'mid_base' field.
The MID index (index to PGT table which maps MID to local port list and
SMPE index) is a result of 'mid_base' + 'fid_offset'. Using the legacy
bridge model, firmware configures 'mid_base'. However, using the new model,
software is responsible to configure it via SFGC register.
The 'mid_base' is configured per {packet type, bridge type}, for
example, for {Unicast, .1Q}, {Broadcast, .1D}.
Add the field 'mid_base' to SFGC register and increase the length of the
register accordingly.
Remove the field 'mid' as currently it is ignored by the device, its use
is an old leftover.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amit Cohen [Sun, 19 Jun 2022 10:29:16 +0000 (13:29 +0300)]
mlxsw: reg: Add flood related field to SFMR register
SFMR register creates and configures FIDs. As preparation for unified
bridge model, add a required field for future use.
The PGT (Port Group) table maps multicast ID (MID) to
{local port list, SMPE index} on Spectrum-1 and to {local port list} on
the other ASICs.
In the legacy model, software did not interact with this table directly.
Instead, it was accessed by firmware in response to registers such as
SFTR and SMID.
In the new model, the SFTR register is deprecated and software has full
control over the PGT table using the SMID register.
The configuration of MDB entries (using SFD) is unchanged, but flooding
configuration is completely different.
SFGC register maps {packet type, bridge type} -> {MID base, table type},
then with FID and FID-offset which are configured via SFMR, the MID index
is obtained.
Add the field 'flood_bridge_type' to SFMR, software can separate between
802.1q FIDs and vFIDs using two types which are supported.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amit Cohen [Sun, 19 Jun 2022 10:29:15 +0000 (13:29 +0300)]
mlxsw: reg: Add VID related fields to SFD register
SFD register configures FDB table. As preparation for unified bridge model,
add some required fields for future use.
In the new model, firmware no longer configures the egress VID, this
responsibility is moved to software. For layer 2 this means that software
needs to determine the egress VID for both unicast and multicast.
For unicast FDB records and unicast LAG FDB records, the VID needs to be
set via new fields in SFD - 'set_vid' and 'vid'.
Add the two mentioned fields for future use.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amit Cohen [Sun, 19 Jun 2022 10:29:14 +0000 (13:29 +0300)]
mlxsw: reg: Add SMPE related fields to SFMR register
SFMR register creates and configures FIDs. As preparation unified bridge
model, add some required fields for future use.
The device includes two main tables to support layer 2 multicast (i.e.,
MDB and flooding). These are the PGT (Port Group Table) and the
MPE (Multicast Port Egress) table.
- PGT is {MID -> (bitmap of local_port, SPME index)}
- MPE is {(Local port, SMPE index) -> eVID}
In Spectrum-2 and later ASICs, the SMPE index is an attribute of the FID
and programmed via new fields in SFMR register - 'smpe_valid' and 'smpe'.
Add the two mentioned fields for future use and increase the length of
the register accordingly.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amit Cohen [Sun, 19 Jun 2022 10:29:13 +0000 (13:29 +0300)]
mlxsw: Add SMPE related fields to SMID2 register
SMID register maps multicast ID (MID) into a list of local ports.
As preparation for unified bridge model, add some required fields for
future use.
The device includes two main tables to support layer 2 multicast (i.e.,
MDB and flooding). These are the PGT (Port Group Table) and the
MPE (Multicast Port Egress) table.
- PGT is {MID -> (bitmap of local_port, SPME index)}
- MPE is {(Local port, SMPE index) -> eVID}
In Spectrum-1, both indexes into the MPE table (local port and SMPE) are
derived from the PGT table. Therefore, the SMPE index needs to be
programmed as part of the PGT entry via new fields in SMID - 'smpe_valid'
and 'smpe'.
Add the two mentioned fields for future use and align the callers of
mlxsw_reg_smid2_pack() to pass zeros for SMPE fields.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amit Cohen [Sun, 19 Jun 2022 10:29:12 +0000 (13:29 +0300)]
mlxsw: reg: Add Switch Multicast Port to Egress VID Register
The SMPE register maps {egress_port, SMPE index} -> VID.
The device includes two main tables to support layer 2 multicast (i.e.,
MDB and flooding). These are the PGT (Port Group Table) and the
MPE (Multicast Port Egress) table.
- PGT is {MID -> (bitmap of local_port, SPME index)}
- MPE is {(Local port, SMPE index) -> eVID}
In Spectrum-1, the index into the MPE table - called switch multicast to
port egress VID (SMPE) - is derived from the PGT entry, whereas in
Spectrum-2 and later ASICs it is derived from the FID.
In the legacy model, software did not interact with this table as it was
completely hidden in firmware. In the new model, software needs to
populate the table itself in order to map from {Local port, SMPE index} to
an egress VID. This is done using the SMPE register.
Add the register for future use.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amit Cohen [Sun, 19 Jun 2022 10:29:11 +0000 (13:29 +0300)]
mlxsw: reg: Add ingress RIF related fields to SVFA register
SVFA register controls the VID to FID mapping and {Port, VID} to FID
mapping for virtualized ports. As preparation for unified bridge model,
add some required fields for future use.
On ingress, after ingress ACL, a packet needs to be classified to a FID.
The key for this lookup can be one of:
1. VID. When port is not in virtual mode.
2. {RQ, VID}. When port is in virtual mode.
3. FID. When FID was set by ingress ACL.
Since RITR no longer performs ingress configuration, the ingress RIF for
the first two entry types needs to be set via new fields in SVFA -
'irif_v' and 'irif'.
Add the two mentioned fields for future use and increase the length of
the register accordingly.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amit Cohen [Sun, 19 Jun 2022 10:29:10 +0000 (13:29 +0300)]
mlxsw: reg: Add ingress RIF related fields to SFMR register
SFMR register creates and configures FIDs. As preparation for unified
bridge model, add some required fields for future use.
On ingress, after ingress ACL, a packet needs to be classified to a FID.
The key for this lookup can be one of:
1. VID. When port is not in virtual mode.
2. {RQ, VID}. When port is in virtual mode.
3. FID. When FID was set by ingress ACL.
For example, via VR_AND_FID_ACTION.
Since RITR no longer performs ingress configuration, the ingress RIF for
the last entry type needs to be set via new fields in SFMR - 'irif_v'
and 'irif'.
Add the two mentioned fields for future use.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amit Cohen [Sun, 19 Jun 2022 10:29:09 +0000 (13:29 +0300)]
mlxsw: reg: Add 'flood_rsp' field to SFMR register
SFMR register creates and configures FIDs. As preparation for unified
bridge model, add a field for future use.
In the new model, RITR no longer configures the rFID used for sub-port RIFs
and it has to be created by software via SFMR. Such FIDs need to be created
with special flood indication using 'flood_rsp' field. When set, this bit
instructs the device to manage the flooding entries for this FID in a
reserved part of the port group table (PGT).
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ronak Doshi [Mon, 20 Jun 2022 00:10:13 +0000 (17:10 -0700)]
vmxnet3: disable overlay offloads if UPT device does not support
'Commit
6f91f4ba046e ("vmxnet3: add support for capability registers")'
added support for capability registers. These registers are used
to advertize capabilities of the device.
The patch updated the dev_caps to disable outer checksum offload if
PTCR register does not support it. However, it missed to update
other overlay offloads. This patch fixes this issue.
Fixes:
6f91f4ba046e ("vmxnet3: add support for capability registers")
Signed-off-by: Ronak Doshi <doshir@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 20 Jun 2022 08:10:13 +0000 (09:10 +0100)]
Merge branch 'raw-rcu-fixes'
Kuniyuki Iwashima says:
====================
raw: Fix nits of RCU conversion series.
The first patch fixes a build error by commit
ba44f8182ec2 ("raw: use
more conventional iterators"), but it does not land in the net tree,
so this series is targeted to net-next. The second patch replaces some
hlist functions with sk's helper macros.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Sun, 19 Jun 2022 23:29:27 +0000 (16:29 -0700)]
raw: Use helpers for the hlist_nulls variant.
hlist_nulls_add_head_rcu() and hlist_nulls_for_each_entry() have dedicated
macros for sk.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Sun, 19 Jun 2022 23:29:26 +0000 (16:29 -0700)]
raw: Fix mixed declarations error in raw_icmp_error().
The trailing semicolon causes a compiler error, so let's remove it.
net/ipv4/raw.c: In function ‘raw_icmp_error’:
net/ipv4/raw.c:266:2: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement]
266 | struct hlist_nulls_head *hlist;
| ^~~~~~
Fixes:
ba44f8182ec2 ("raw: use more conventional iterators")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xiang wangx [Sat, 18 Jun 2022 13:22:41 +0000 (21:22 +0800)]
sfc/siena: Fix typo in comment
Delete the redundant word 'and'.
Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xiang wangx [Sat, 18 Jun 2022 13:19:14 +0000 (21:19 +0800)]
sfc: Fix typo in comment
Delete the redundant word 'and'.
Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xiang wangx [Sat, 18 Jun 2022 13:16:26 +0000 (21:16 +0800)]
net: emac: Fix typo in a comment
Delete the redundant word 'and'.
Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman [Sun, 19 Jun 2022 08:15:30 +0000 (10:15 +0200)]
Revert "nfp: update nfp_X logging definitions"
This reverts commit
9386ebccfc59 ("nfp: update nfp_X logging definitions")
The reverted patch was intended to improve logging for the NFP driver by
including information such as the source code file and number in log
messages.
Unfortunately our experience is that this has not improved things as
we had hoped. The resulting logs are inconsistent with (most) other
kernel log messages. And rely on knowledge of the source code version
in order for the extra information to be useful.
Thus, revert the change.
We acknowledge that Jakub Kicinski <kuba@kernel.org> foresaw this problem.
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 19 Jun 2022 09:38:27 +0000 (10:38 +0100)]
Merge branch 'mii_bmcr_encode_fixed'
Russell King says:
====================
net: introduce mii_bmcr_encode_fixed()
While converting the mv88e6xxx driver to phylink pcs, it has been
noticed that we've started to have repeated cases where we convert a
speed and duplex to a BMCR value.
Rather than open coding this in multiple locations, let's provide a
helper for this - in linux/mii.h. This helper not only takes care of
the standard 10, 100 and 1000Mbps encodings, but also includes
2500Mbps (which is the same as 1000Mbps) for those users who require
that encoding as well. Unknown speeds will be encoded to 10Mbps, and
non-full duplexes will be encoded as half duplex.
This series converts the existing users to the new helper, and the
mv88e6xxx conversion will add further users in the 6352 and 639x PCS
code.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King (Oracle) [Sat, 18 Jun 2022 10:28:47 +0000 (11:28 +0100)]
net: pcs: pcs-xpcs: use mii_bmcr_encode_fixed()
Use the newly introduced mii_bmcr_encode_fixed() for the xpcs driver.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King (Oracle) [Sat, 18 Jun 2022 10:28:42 +0000 (11:28 +0100)]
net: phy: marvell: use mii_bmcr_encode_fixed()
Make use of the newly introduced mii_bmcr_encode_fixed() to get the
BMCR value when setting loopback mode for the 88e1510.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King (Oracle) [Sat, 18 Jun 2022 10:28:37 +0000 (11:28 +0100)]
net: phy: use mii_bmcr_encode_fixed()
phylib can make use of the newly introduced mii_bmcr_encode_fixed()
macro, so let's convert it over.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King (Oracle) [Sat, 18 Jun 2022 10:28:32 +0000 (11:28 +0100)]
net: mii: add mii_bmcr_encode_fixed()
Add a function to encode a fixed speed/duplex to a BMCR value.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 19 Jun 2022 09:00:02 +0000 (10:00 +0100)]
Merge branch 'raw-RCU-conversion'
Eric Dumazet says:
====================
raw: RCU conversion
Using rwlock in networking code is extremely risky.
writers can starve if enough readers are constantly
grabing the rwlock.
I thought rwlock were at fault and sent this patch:
https://lkml.org/lkml/2022/6/17/272
But Peter and Linus essentially told me rwlock had to be unfair.
We need to get rid of rwlock in networking stacks.
Without this conversion, following script triggers soft lockups:
for i in {1..48}
do
ping -f -n -q 127.0.0.1 &
sleep 0.1
done
Next step will be to convert ping sockets to RCU as well.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 18 Jun 2022 03:47:05 +0000 (20:47 -0700)]
raw: convert raw sockets to RCU
Using rwlock in networking code is extremely risky.
writers can starve if enough readers are constantly
grabing the rwlock.
I thought rwlock were at fault and sent this patch:
https://lkml.org/lkml/2022/6/17/272
But Peter and Linus essentially told me rwlock had to be unfair.
We need to get rid of rwlock in networking code.
Without this fix, following script triggers soft lockups:
for i in {1..48}
do
ping -f -n -q 127.0.0.1 &
sleep 0.1
done
Fixes:
1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 18 Jun 2022 03:47:04 +0000 (20:47 -0700)]
raw: use more conventional iterators
In order to prepare the following patch,
I change raw v4 & v6 code to use more conventional
iterators.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xiaoliang Yang [Fri, 17 Jun 2022 03:24:23 +0000 (11:24 +0800)]
net: dsa: felix: update base time of time-aware shaper when adjusting PTP time
When adjusting the PTP clock, the base time of the TAS configuration
will become unreliable. We need reset the TAS configuration by using a
new base time.
For example, if the driver gets a base time 0 of Qbv configuration from
user, and current time is 20000. The driver will set the TAS base time
to be 20000. After the PTP clock adjustment, the current time becomes
10000. If the TAS base time is still 20000, it will be a future time,
and TAS entry list will stop running. Another example, if the current
time becomes to be
10000000 after PTP clock adjust, a large time offset
can cause the hardware to hang.
This patch introduces a tas_clock_adjust() function to reset the TAS
module by using a new base time after the PTP clock adjustment. This can
avoid issues above.
Due to PTP clock adjustment can occur at any time, it may conflict with
the TAS configuration. We introduce a new TAS lock to serialize the
access to the TAS registers.
Signed-off-by: Xiaoliang Yang <xiaoliang.yang_1@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christian Marangi [Thu, 16 Jun 2022 22:15:54 +0000 (00:15 +0200)]
net: ethernet: stmmac: remove select QCOM_SOCINFO and make it optional
QCOM_SOCINFO depends on QCOM_SMEM but is not selected, this cause some
problems with QCOM_SOCINFO getting selected with the dependency of
QCOM_SMEM not met.
To fix this remove the select in Kconfig and add additional info in the
DWMAC_IPQ806X config description.
Reported-by: kernel test robot <lkp@intel.com>
Fixes:
9ec092d2feb6 ("net: ethernet: stmmac: add missing sgmii configure for ipq806x")
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 18 Jun 2022 04:04:15 +0000 (21:04 -0700)]
ping: convert to RCU lookups, get rid of rwlock
Using rwlock in networking code is extremely risky.
writers can starve if enough readers are constantly
grabing the rwlock.
I thought rwlock were at fault and sent this patch:
https://lkml.org/lkml/2022/6/17/272
But Peter and Linus essentially told me rwlock had to be unfair.
We need to get rid of rwlock in networking code.
Fixes:
c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Peter Lafreniere [Thu, 16 Jun 2022 15:23:33 +0000 (11:23 -0400)]
ax25: use GFP_KERNEL in ax25_dev_device_up()
ax25_dev_device_up() is only called during device setup, which is
done in user context. In addition, ax25_dev_device_up()
unconditionally calls ax25_register_dev_sysctl(), which already
allocates with GFP_KERNEL.
Since it is allowed to sleep in this function, here we change
ax25_dev_device_up() to use GFP_KERNEL to reduce unnecessary
out-of-memory errors.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Lafreniere <pjlafren@mtu.edu>
Link: https://lore.kernel.org/r/20220616152333.9812-1-pjlafren@mtu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xiang wangx [Thu, 16 Jun 2022 16:41:55 +0000 (00:41 +0800)]
atm: iphase: Fix typo in comment
Delete the redundant word 'the'.
Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com>
Link: https://lore.kernel.org/r/20220616164155.11686-1-wangxiang@cdjrlc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xiang wangx [Thu, 16 Jun 2022 14:26:24 +0000 (22:26 +0800)]
ppp: Fix typo in comment
Delete the redundant word 'the'.
Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com>
Link: https://lore.kernel.org/r/20220616142624.3397-1-wangxiang@cdjrlc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yinjun Zhang [Thu, 16 Jun 2022 13:33:57 +0000 (15:33 +0200)]
nfp: add support for .get_pauseparam()
Show correct pause frame parameters for nfp. These parameters cannot
be configured, so .set_pauseparam() is not implemented. With this
change:
#ethtool --show-pause enp1s0np0
Pause parameters for enp1s0np0:
Autonegotiate: off
RX: on
TX: on
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20220616133358.135305-1-simon.horman@corigine.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Oleksij Rempel [Thu, 16 Jun 2022 11:25:50 +0000 (13:25 +0200)]
net: dsa: ar9331: fix potential dead lock on mdio access
Rework MDIO locking to avoid potential circular locking:
WARNING: possible circular locking dependency detected
5.19.0-rc1-ar9331-00017-g3ab364c7c48c #5 Not tainted
------------------------------------------------------
kworker/u2:4/68 is trying to acquire lock:
81f3c83c (ar9331:1005:(&ar9331_mdio_regmap_config)->lock){+.+.}-{4:4}, at: regmap_write+0x50/0x8c
but task is already holding lock:
81f60494 (&bus->mdio_lock){+.+.}-{4:4}, at: mdiobus_read+0x40/0x78
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&bus->mdio_lock){+.+.}-{4:4}:
lock_acquire+0x2d4/0x360
__mutex_lock+0xf8/0x384
mutex_lock_nested+0x2c/0x38
mdiobus_write+0x44/0x80
ar9331_sw_bus_write+0x50/0xe4
_regmap_raw_write_impl+0x604/0x724
_regmap_bus_raw_write+0x9c/0xb4
_regmap_write+0xdc/0x1a0
_regmap_update_bits+0xf4/0x118
_regmap_select_page+0x108/0x138
_regmap_raw_read+0x25c/0x288
_regmap_bus_read+0x60/0x98
_regmap_read+0xd4/0x1b0
_regmap_update_bits+0xc4/0x118
regmap_update_bits_base+0x64/0x8c
ar9331_sw_irq_bus_sync_unlock+0x40/0x6c
__irq_set_handler+0x7c/0xac
ar9331_sw_irq_map+0x48/0x7c
irq_domain_associate+0x174/0x208
irq_create_mapping_affinity+0x1a8/0x230
ar9331_sw_probe+0x22c/0x388
mdio_probe+0x44/0x70
really_probe+0x200/0x424
__driver_probe_device+0x290/0x298
driver_probe_device+0x54/0xe4
__device_attach_driver+0xe4/0x130
bus_for_each_drv+0xb4/0xd8
__device_attach+0x104/0x1a4
bus_probe_device+0x48/0xc4
device_add+0x600/0x800
mdio_device_register+0x68/0xa0
of_mdiobus_register+0x2bc/0x3c4
ag71xx_probe+0x6e4/0x984
platform_probe+0x78/0xd0
really_probe+0x200/0x424
__driver_probe_device+0x290/0x298
driver_probe_device+0x54/0xe4
__driver_attach+0x17c/0x190
bus_for_each_dev+0x8c/0xd0
bus_add_driver+0x110/0x228
driver_register+0xe4/0x12c
do_one_initcall+0x104/0x2a0
kernel_init_freeable+0x250/0x288
kernel_init+0x34/0x130
ret_from_kernel_thread+0x14/0x1c
-> #0 (ar9331:1005:(&ar9331_mdio_regmap_config)->lock){+.+.}-{4:4}:
check_noncircular+0x88/0xc0
__lock_acquire+0x10bc/0x18bc
lock_acquire+0x2d4/0x360
__mutex_lock+0xf8/0x384
mutex_lock_nested+0x2c/0x38
regmap_write+0x50/0x8c
ar9331_sw_mbus_read+0x74/0x1b8
__mdiobus_read+0x90/0xec
mdiobus_read+0x50/0x78
get_phy_device+0xa0/0x18c
fwnode_mdiobus_register_phy+0x120/0x1d4
of_mdiobus_register+0x244/0x3c4
devm_of_mdiobus_register+0xe8/0x100
ar9331_sw_setup+0x16c/0x3a0
dsa_register_switch+0x7dc/0xcc0
ar9331_sw_probe+0x370/0x388
mdio_probe+0x44/0x70
really_probe+0x200/0x424
__driver_probe_device+0x290/0x298
driver_probe_device+0x54/0xe4
__device_attach_driver+0xe4/0x130
bus_for_each_drv+0xb4/0xd8
__device_attach+0x104/0x1a4
bus_probe_device+0x48/0xc4
deferred_probe_work_func+0xf0/0x10c
process_one_work+0x314/0x4d4
worker_thread+0x2a4/0x354
kthread+0x134/0x13c
ret_from_kernel_thread+0x14/0x1c
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&bus->mdio_lock);
lock(ar9331:1005:(&ar9331_mdio_regmap_config)->lock);
lock(&bus->mdio_lock);
lock(ar9331:1005:(&ar9331_mdio_regmap_config)->lock);
*** DEADLOCK ***
5 locks held by kworker/u2:4/68:
#0:
81c04eb4 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1e4/0x4d4
#1:
81f0de78 (deferred_probe_work){+.+.}-{0:0}, at: process_one_work+0x1e4/0x4d4
#2:
81f0a880 (&dev->mutex){....}-{4:4}, at: __device_attach+0x40/0x1a4
#3:
80c8aee0 (dsa2_mutex){+.+.}-{4:4}, at: dsa_register_switch+0x5c/0xcc0
#4:
81f60494 (&bus->mdio_lock){+.+.}-{4:4}, at: mdiobus_read+0x40/0x78
stack backtrace:
CPU: 0 PID: 68 Comm: kworker/u2:4 Not tainted 5.19.0-rc1-ar9331-00017-g3ab364c7c48c #5
Workqueue: events_unbound deferred_probe_work_func
Stack :
00000056 800d4638 81f0d64c 00000004 00000018 00000000 80a20000 80a20000
80937590 81ef3858 81f0d760 3913578a 00000005 8045e824 81f0d600 a8db84cc
00000000 00000000 80937590 00000a44 00000000 00000002 00000001 ffffffff
81f0d6a4 80982d7c 0000000f 20202020 80a20000 00000001 80937590 81ef3858
81f0d760 3913578a 00000005 00000005 00000000 03bd0000 00000000 80e00000
...
Call Trace:
[<
80069db0>] show_stack+0x94/0x130
[<
8045e824>] dump_stack_lvl+0x54/0x8c
[<
800c7fac>] check_noncircular+0x88/0xc0
[<
800ca068>] __lock_acquire+0x10bc/0x18bc
[<
800cb478>] lock_acquire+0x2d4/0x360
[<
807b84c4>] __mutex_lock+0xf8/0x384
[<
807b877c>] mutex_lock_nested+0x2c/0x38
[<
804ea640>] regmap_write+0x50/0x8c
[<
80501e38>] ar9331_sw_mbus_read+0x74/0x1b8
[<
804fe9a0>] __mdiobus_read+0x90/0xec
[<
804feac4>] mdiobus_read+0x50/0x78
[<
804fcf74>] get_phy_device+0xa0/0x18c
[<
804ffeb4>] fwnode_mdiobus_register_phy+0x120/0x1d4
[<
805004f0>] of_mdiobus_register+0x244/0x3c4
[<
804f0c50>] devm_of_mdiobus_register+0xe8/0x100
[<
805017a0>] ar9331_sw_setup+0x16c/0x3a0
[<
807355c8>] dsa_register_switch+0x7dc/0xcc0
[<
80501468>] ar9331_sw_probe+0x370/0x388
[<
804ff0c0>] mdio_probe+0x44/0x70
[<
804d1848>] really_probe+0x200/0x424
[<
804d1cfc>] __driver_probe_device+0x290/0x298
[<
804d1d58>] driver_probe_device+0x54/0xe4
[<
804d2298>] __device_attach_driver+0xe4/0x130
[<
804cf048>] bus_for_each_drv+0xb4/0xd8
[<
804d200c>] __device_attach+0x104/0x1a4
[<
804d026c>] bus_probe_device+0x48/0xc4
[<
804d108c>] deferred_probe_work_func+0xf0/0x10c
[<
800a0ffc>] process_one_work+0x314/0x4d4
[<
800a17fc>] worker_thread+0x2a4/0x354
[<
800a9a54>] kthread+0x134/0x13c
[<
8006306c>] ret_from_kernel_thread+0x14/0x1c
[
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20220616112550.877118-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 18 Jun 2022 02:35:17 +0000 (19:35 -0700)]
Merge https://git./linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-06-17
We've added 72 non-merge commits during the last 15 day(s) which contain
a total of 92 files changed, 4582 insertions(+), 834 deletions(-).
The main changes are:
1) Add 64 bit enum value support to BTF, from Yonghong Song.
2) Implement support for sleepable BPF uprobe programs, from Delyan Kratunov.
3) Add new BPF helpers to issue and check TCP SYN cookies without binding to a
socket especially useful in synproxy scenarios, from Maxim Mikityanskiy.
4) Fix libbpf's internal USDT address translation logic for shared libraries as
well as uprobe's symbol file offset calculation, from Andrii Nakryiko.
5) Extend libbpf to provide an API for textual representation of the various
map/prog/attach/link types and use it in bpftool, from Daniel Müller.
6) Provide BTF line info for RV64 and RV32 JITs, and fix a put_user bug in the
core seen in 32 bit when storing BPF function addresses, from Pu Lehui.
7) Fix libbpf's BTF pointer size guessing by adding a list of various aliases
for 'long' types, from Douglas Raillard.
8) Fix bpftool to readd setting rlimit since probing for memcg-based accounting
has been unreliable and caused a regression on COS, from Quentin Monnet.
9) Fix UAF in BPF cgroup's effective program computation triggered upon BPF link
detachment, from Tadeusz Struk.
10) Fix bpftool build bootstrapping during cross compilation which was pointing
to the wrong AR process, from Shahab Vahedi.
11) Fix logic bug in libbpf's is_pow_of_2 implementation, from Yuze Chi.
12) BPF hash map optimization to avoid grabbing spinlocks of all CPUs when there
is no free element. Also add a benchmark as reproducer, from Feng Zhou.
13) Fix bpftool's codegen to bail out when there's no BTF, from Michael Mullin.
14) Various minor cleanup and improvements all over the place.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (72 commits)
bpf: Fix bpf_skc_lookup comment wrt. return type
bpf: Fix non-static bpf_func_proto struct definitions
selftests/bpf: Don't force lld on non-x86 architectures
selftests/bpf: Add selftests for raw syncookie helpers in TC mode
bpf: Allow the new syncookie helpers to work with SKBs
selftests/bpf: Add selftests for raw syncookie helpers
bpf: Add helpers to issue and check SYN cookies in XDP
bpf: Allow helpers to accept pointers with a fixed size
bpf: Fix documentation of th_len in bpf_tcp_{gen,check}_syncookie
selftests/bpf: add tests for sleepable (uk)probes
libbpf: add support for sleepable uprobe programs
bpf: allow sleepable uprobe programs to attach
bpf: implement sleepable uprobes by chaining gps
bpf: move bpf_prog to bpf.h
libbpf: Fix internal USDT address translation logic for shared libraries
samples/bpf: Check detach prog exist or not in xdp_fwd
selftests/bpf: Avoid skipping certain subtests
selftests/bpf: Fix test_varlen verification failure with latest llvm
bpftool: Do not check return value from libbpf_set_strict_mode()
Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"
...
====================
Link: https://lore.kernel.org/r/20220617220836.7373-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tobias Klauser [Fri, 17 Jun 2022 15:21:21 +0000 (17:21 +0200)]
bpf: Fix bpf_skc_lookup comment wrt. return type
The function no longer returns 'unsigned long' as of commit
edbf8c01de5a
("bpf: add skc_lookup_tcp helper").
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220617152121.29617-1-tklauser@distanz.ch
Joanne Koong [Thu, 16 Jun 2022 22:54:07 +0000 (15:54 -0700)]
bpf: Fix non-static bpf_func_proto struct definitions
This patch does two things:
1) Marks the dynptr bpf_func_proto structs that were added in [1]
as static, as pointed out by the kernel test robot in [2].
2) There are some bpf_func_proto structs marked as extern which can
instead be statically defined.
[1] https://lore.kernel.org/bpf/
20220523210712.3641569-1-joannelkoong@gmail.com/
[2] https://lore.kernel.org/bpf/
62ab89f2.Pko7sI08RAKdF8R6%25lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220616225407.1878436-1-joannelkoong@gmail.com
Hoang Le [Fri, 17 Jun 2022 01:47:51 +0000 (08:47 +0700)]
tipc: cleanup unused function
tipc_dest_list_len() is not being called anywhere. Clean it up.
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Oleksij Rempel [Fri, 17 Jun 2022 07:16:07 +0000 (09:16 +0200)]
net: macb: fix negative max_mtu size for sama5d3
JML register on probe will return zero . This register is configured
later on macb_init_hw() which is called on open.
Since we have zero, after header and FCS length subtraction we will get
negative max_mtu size. This issue was affecting DSA drivers with MTU support
(for example KSZ9477).
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Claudiu Beznea <claudiu.beznea@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kees Cook [Thu, 16 Jun 2022 05:23:12 +0000 (22:23 -0700)]
hinic: Replace memcpy() with direct assignment
Under CONFIG_FORTIFY_SOURCE=y and CONFIG_UBSAN_BOUNDS=y, Clang is bugged
here for calculating the size of the destination buffer (0x10 instead of
0x14). This copy is a fixed size (sizeof(struct fw_section_info_st)), with
the source and dest being struct fw_section_info_st, so the memcpy should
be safe, assuming the index is within bounds, which is UBSAN_BOUNDS's
responsibility to figure out.
Avoid the whole thing and just do a direct assignment. This results in
no change to the executable code.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Tom Rix <trix@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Jiri Pirko <jiri@nvidia.com>
Cc: Vladimir Oltean <olteanv@gmail.com>
Cc: Simon Horman <simon.horman@corigine.com>
Cc: netdev@vger.kernel.org
Cc: llvm@lists.linux.dev
Link: https://github.com/ClangBuiltLinux/linux/issues/1592
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org> # build
Signed-off-by: David S. Miller <davem@davemloft.net>
Oleksij Rempel [Thu, 16 Jun 2022 11:37:24 +0000 (13:37 +0200)]
net: ag71xx: fix discards 'const' qualifier warning
Current kernel will compile this driver with warnings. This patch will
fix it.
drivers/net/ethernet/atheros/ag71xx.c: In function 'ag71xx_fast_reset':
drivers/net/ethernet/atheros/ag71xx.c:996:31: warning: passing argument 2 of 'ag71xx_hw_set
_macaddr' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
996 | ag71xx_hw_set_macaddr(ag, dev->dev_addr);
| ~~~^~~~~~~~~~
drivers/net/ethernet/atheros/ag71xx.c:951:69: note: expected 'unsigned char *' but argument
is of type 'const unsigned char *'
951 | static void ag71xx_hw_set_macaddr(struct ag71xx *ag, unsigned char *mac)
| ~~~~~~~~~~~~~~~^~~
drivers/net/ethernet/atheros/ag71xx.c: In function 'ag71xx_open':
drivers/net/ethernet/atheros/ag71xx.c:1441:32: warning: passing argument 2 of 'ag71xx_hw_se
t_macaddr' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
1441 | ag71xx_hw_set_macaddr(ag, ndev->dev_addr);
| ~~~~^~~~~~~~~~
drivers/net/ethernet/atheros/ag71xx.c:951:69: note: expected 'unsigned char *' but argument
is of type 'const unsigned char *'
951 | static void ag71xx_hw_set_macaddr(struct ag71xx *ag, unsigned char *mac)
| ~~~~~~~~~~~~~~~^~~
Fixes:
adeef3e32146 ("net: constify netdev->dev_addr")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>