Yonghong Song [Sat, 10 Sep 2022 02:52:14 +0000 (19:52 -0700)]
libbpf: Improve BPF_PROG2 macro code quality and description
Commit
34586d29f8df ("libbpf: Add new BPF_PROG2 macro") added BPF_PROG2
macro for trampoline based programs with struct arguments. Andrii
made a few suggestions to improve code quality and description.
This patch implemented these suggestions including better internal
macro name, consistent usage pattern for __builtin_choose_expr(),
simpler macro definition for always-inline func arguments and
better macro description.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20220910025214.1536510-1-yhs@fb.com
Andrii Nakryiko [Wed, 21 Sep 2022 23:24:18 +0000 (16:24 -0700)]
Merge branch 'bpf: Add user-space-publisher ring buffer map type'
David Vernet says:
====================
This patch set defines a new map type, BPF_MAP_TYPE_USER_RINGBUF, which
provides single-user-space-producer / single-kernel-consumer semantics over
a ring buffer. Along with the new map type, a helper function called
bpf_user_ringbuf_drain() is added which allows a BPF program to specify a
callback with the following signature, to which samples are posted by the
helper:
void (struct bpf_dynptr *dynptr, void *context);
The program can then use the bpf_dynptr_read() or bpf_dynptr_data() helper
functions to safely read the sample from the dynptr. There are currently no
helpers available to determine the size of the sample, but one could easily
be added if required.
On the user-space side, libbpf has been updated to export a new
'struct ring_buffer_user' type, along with the following symbols:
struct ring_buffer_user *
ring_buffer_user__new(int map_fd,
const struct ring_buffer_user_opts *opts);
void ring_buffer_user__free(struct ring_buffer_user *rb);
void *ring_buffer_user__reserve(struct ring_buffer_user *rb,
uint32_t size);
void *ring_buffer_user__poll(struct ring_buffer_user *rb, uint32_t size,
int timeout_ms);
void ring_buffer_user__discard(struct ring_buffer_user *rb, void *sample);
void ring_buffer_user__submit(struct ring_buffer_user *rb, void *sample);
These symbols are exported for inclusion in libbpf version 1.0.0.
Signed-off-by: David Vernet <void@manifault.com>
---
v5 -> v6:
- Fixed s/BPF_MAP_TYPE_RINGBUF/BPF_MAP_TYPE_USER_RINGBUF typo in the
libbpf user ringbuf doxygen header comment for ring_buffer_user__new()
(Andrii).
- Specify that pointer returned from ring_buffer_user__reserve() and its
blocking counterpart is 8-byte aligned (Andrii).
- Renamed user_ringbuf__commit() to user_ringbuf_commit(), as it's static
(Andrii).
- Another slight reworking of user_ring_buffer__reserve_blocking() to
remove some extraneous nanosecond variables + checking (Andrii).
- Add a final check of user_ring_buffer__reserve() in
user_ring_buffer__reserve_blocking().
- Moved busy bit lock / unlock logic from __bpf_user_ringbuf_peek() to
bpf_user_ringbuf_drain() (Andrii).
- -ENOSPC -> -ENODATA for an empty ring buffer in
__bpf_user_ringbuf_peek() (Andrii).
- Updated BPF_RB_FORCE_WAKEUP to only force a wakeup notification to be
sent if even if no sample was drained.
- Changed a bit of the wording in the UAPI header for
bpf_user_ringbuf_drain() to mention the BPF_RB_FORCE_WAKEUP behavior.
- Remove extra space after return in ringbuf_map_poll_user() (Andrii).
- Removed now-extraneous paragraph from the commit summary of patch 2/4
(Andrii).
v4 -> v5:
- DENYLISTed the user-ringbuf test suite on s390x. We have a number of
functions in the progs/user_ringbuf_success.c prog that user-space
fires by invoking a syscall. Not all of these syscalls are available
on s390x. If and when we add the ability to kick the kernel from
user-space, or if we end up using iterators for that per Hao's
suggestion, we could re-enable this test suite on s390x.
- Fixed a few more places that needed ringbuffer -> ring buffer.
v3 -> v4:
- Update BPF_MAX_USER_RINGBUF_SAMPLES to not specify a bit, and instead
just specify a number of samples. (Andrii)
- Update "ringbuffer" in comments and commit summaries to say "ring
buffer". (Andrii)
- Return -E2BIG from bpf_user_ringbuf_drain() both when a sample can't
fit into the ring buffer, and when it can't fit into a dynptr. (Andrii)
- Don't loop over samples in __bpf_user_ringbuf_peek() if a sample was
discarded. Instead, return -EAGAIN so the caller can deal with it. Also
updated the caller to detect -EAGAIN and skip over it when iterating.
(Andrii)
- Removed the heuristic for notifying user-space when a sample is drained,
causing the ring buffer to no longer be full. This may be useful in the
future, but is being removed now because it's strictly a heuristic.
- Re-add BPF_RB_FORCE_WAKEUP flag to bpf_user_ringbuf_drain(). (Andrii)
- Remove helper_allocated_dynptr tracker from verifier. (Andrii)
- Add libbpf function header comments to tools/lib/bpf/libbpf.h, so that
they will be included in rendered libbpf docs. (Andrii)
- Add symbols to a new LIBBPF_1.1.0 section in linker version script,
rather than including them in LIBBPF_1.0.0. (Andrii)
- Remove libbpf_err() calls from static libbpf functions. (Andrii)
- Check user_ring_buffer_opts instead of ring_buffer_opts in
user_ring_buffer__new(). (Andrii)
- Avoid an extra if in the hot path in user_ringbuf__commit(). (Andrii)
- Use ENOSPC rather than ENODATA if no space is available in the ring
buffer. (Andrii)
- Don't round sample size in header to 8, but still round size that is
reserved and written to 8, and validate positions are multiples of 8
(Andrii).
- Use nanoseconds for most calculations in
user_ring_buffer__reserve_blocking(). (Andrii)
- Don't use CHECK() in testcases, instead use ASSERT_*. (Andrii)
- Use SEC("?raw_tp") instead of SEC("?raw_tp/sys_nanosleep") in negative
test. (Andrii)
- Move test_user_ringbuf.h header to live next to BPF program instead of
a directory up from both it and the user-space test program. (Andrii)
- Update bpftool help message / docs to also include user_ringbuf.
v2 -> v3:
- Lots of formatting fixes, such as keeping things on one line if they fit
within 100 characters, and removing some extraneous newlines. Applies
to all diffs in the patch-set. (Andrii)
- Renamed ring_buffer_user__* symbols to user_ring_buffer__*. (Andrii)
- Added a missing smb_mb__before_atomic() in
__bpf_user_ringbuf_sample_release(). (Hao)
- Restructure how and when notification events are sent from the kernel to
the user-space producers via the .map_poll() callback for the
BPF_MAP_TYPE_USER_RINGBUF map. Before, we only sent a notification when
the ringbuffer was fully drained. Now, we guarantee user-space that
we'll send an event at least once per bpf_user_ringbuf_drain(), as long
as at least one sample was drained, and BPF_RB_NO_WAKEUP was not passed.
As a heuristic, we also send a notification event any time a sample being
drained causes the ringbuffer to no longer be full. (Andrii)
- Continuing on the above point, updated
user_ring_buffer__reserve_blocking() to loop around epoll_wait() until a
sufficiently large sample is found. (Andrii)
- Communicate BPF_RINGBUF_BUSY_BIT and BPF_RINGBUF_DISCARD_BIT in sample
headers. The ringbuffer implementation still only supports
single-producer semantics, but we can now add synchronization support in
user_ring_buffer__reserve(), and will automatically get multi-producer
semantics. (Andrii)
- Updated some commit summaries, specifically adding more details where
warranted. (Andrii)
- Improved function documentation for bpf_user_ringbuf_drain(), more
clearly explaining all function arguments and return types, as well as
the semantics for waking up user-space producers.
- Add function header comments for user_ring_buffer__reserve{_blocking}().
(Andrii)
- Rounding-up all samples to 8-bytes in the user-space producer, and
enforcing that all samples are properly aligned in the kernel. (Andrii)
- Added testcases that verify that bpf_user_ringbuf_drain() properly
validates samples, and returns error conditions if any invalid samples
are encountered. (Andrii)
- Move atomic_t busy field out of the consumer page, and into the
struct bpf_ringbuf. (Andrii)
- Split ringbuf_map_{mmap, poll}_{kern, user}() into separate
implementations. (Andrii)
- Don't silently consume errors in bpf_user_ringbuf_drain(). (Andrii)
- Remove magic number of samples (4096) from bpf_user_ringbuf_drain(),
and instead use BPF_MAX_USER_RINGBUF_SAMPLES macro, which allows
128k samples. (Andrii)
- Remove MEM_ALLOC modifier from PTR_TO_DYNPTR register in verifier, and
instead rely solely on the register being PTR_TO_DYNPTR. (Andrii)
- Move freeing of atomic_t busy bit to before we invoke irq_work_queue() in
__bpf_user_ringbuf_sample_release(). (Andrii)
- Only check for BPF_RB_NO_WAKEUP flag in bpf_ringbuf_drain().
- Remove libbpf function names from kernel smp_{load, store}* comments in
the kernel. (Andrii)
- Don't use double-underscore naming convention in libbpf functions.
(Andrii)
- Use proper __u32 and __u64 for types where we need to guarantee their
size. (Andrii)
v1 -> v2:
- Following Joanne landing
883743422ced ("bpf: Fix ref_obj_id for dynptr
data slices in verifier") [0], removed [PATCH 1/5] bpf: Clear callee
saved regs after updating REG0 [1]. (Joanne)
- Following the above adjustment, updated check_helper_call() to not store
a reference for bpf_dynptr_data() if the register containing the dynptr
is of type MEM_ALLOC. (Joanne)
- Fixed casting issue pointed out by kernel test robot by adding a missing
(uintptr_t) cast. (lkp)
[0] https://lore.kernel.org/all/
20220809214055.4050604-1-joannelkoong@gmail.com/
[1] https://lore.kernel.org/all/
20220808155341.2479054-1-void@manifault.com/
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
David Vernet [Tue, 20 Sep 2022 00:01:00 +0000 (19:01 -0500)]
selftests/bpf: Add selftests validating the user ringbuf
This change includes selftests that validate the expected behavior and
APIs of the new BPF_MAP_TYPE_USER_RINGBUF map type.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220920000100.477320-5-void@manifault.com
David Vernet [Tue, 20 Sep 2022 00:00:59 +0000 (19:00 -0500)]
bpf: Add libbpf logic for user-space ring buffer
Now that all of the logic is in place in the kernel to support user-space
produced ring buffers, we can add the user-space logic to libbpf. This
patch therefore adds the following public symbols to libbpf:
struct user_ring_buffer *
user_ring_buffer__new(int map_fd,
const struct user_ring_buffer_opts *opts);
void *user_ring_buffer__reserve(struct user_ring_buffer *rb, __u32 size);
void *user_ring_buffer__reserve_blocking(struct user_ring_buffer *rb,
__u32 size, int timeout_ms);
void user_ring_buffer__submit(struct user_ring_buffer *rb, void *sample);
void user_ring_buffer__discard(struct user_ring_buffer *rb,
void user_ring_buffer__free(struct user_ring_buffer *rb);
A user-space producer must first create a struct user_ring_buffer * object
with user_ring_buffer__new(), and can then reserve samples in the
ring buffer using one of the following two symbols:
void *user_ring_buffer__reserve(struct user_ring_buffer *rb, __u32 size);
void *user_ring_buffer__reserve_blocking(struct user_ring_buffer *rb,
__u32 size, int timeout_ms);
With user_ring_buffer__reserve(), a pointer to a 'size' region of the ring
buffer will be returned if sufficient space is available in the buffer.
user_ring_buffer__reserve_blocking() provides similar semantics, but will
block for up to 'timeout_ms' in epoll_wait if there is insufficient space
in the buffer. This function has the guarantee from the kernel that it will
receive at least one event-notification per invocation to
bpf_ringbuf_drain(), provided that at least one sample is drained, and the
BPF program did not pass the BPF_RB_NO_WAKEUP flag to bpf_ringbuf_drain().
Once a sample is reserved, it must either be committed to the ring buffer
with user_ring_buffer__submit(), or discarded with
user_ring_buffer__discard().
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220920000100.477320-4-void@manifault.com
David Vernet [Tue, 20 Sep 2022 00:00:58 +0000 (19:00 -0500)]
bpf: Add bpf_user_ringbuf_drain() helper
In a prior change, we added a new BPF_MAP_TYPE_USER_RINGBUF map type which
will allow user-space applications to publish messages to a ring buffer
that is consumed by a BPF program in kernel-space. In order for this
map-type to be useful, it will require a BPF helper function that BPF
programs can invoke to drain samples from the ring buffer, and invoke
callbacks on those samples. This change adds that capability via a new BPF
helper function:
bpf_user_ringbuf_drain(struct bpf_map *map, void *callback_fn, void *ctx,
u64 flags)
BPF programs may invoke this function to run callback_fn() on a series of
samples in the ring buffer. callback_fn() has the following signature:
long callback_fn(struct bpf_dynptr *dynptr, void *context);
Samples are provided to the callback in the form of struct bpf_dynptr *'s,
which the program can read using BPF helper functions for querying
struct bpf_dynptr's.
In order to support bpf_ringbuf_drain(), a new PTR_TO_DYNPTR register
type is added to the verifier to reflect a dynptr that was allocated by
a helper function and passed to a BPF program. Unlike PTR_TO_STACK
dynptrs which are allocated on the stack by a BPF program, PTR_TO_DYNPTR
dynptrs need not use reference tracking, as the BPF helper is trusted to
properly free the dynptr before returning. The verifier currently only
supports PTR_TO_DYNPTR registers that are also DYNPTR_TYPE_LOCAL.
Note that while the corresponding user-space libbpf logic will be added
in a subsequent patch, this patch does contain an implementation of the
.map_poll() callback for BPF_MAP_TYPE_USER_RINGBUF maps. This
.map_poll() callback guarantees that an epoll-waiting user-space
producer will receive at least one event notification whenever at least
one sample is drained in an invocation of bpf_user_ringbuf_drain(),
provided that the function is not invoked with the BPF_RB_NO_WAKEUP
flag. If the BPF_RB_FORCE_WAKEUP flag is provided, a wakeup
notification is sent even if no sample was drained.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220920000100.477320-3-void@manifault.com
David Vernet [Tue, 20 Sep 2022 00:00:57 +0000 (19:00 -0500)]
bpf: Define new BPF_MAP_TYPE_USER_RINGBUF map type
We want to support a ringbuf map type where samples are published from
user-space, to be consumed by BPF programs. BPF currently supports a
kernel -> user-space circular ring buffer via the BPF_MAP_TYPE_RINGBUF
map type. We'll need to define a new map type for user-space -> kernel,
as none of the helpers exported for BPF_MAP_TYPE_RINGBUF will apply
to a user-space producer ring buffer, and we'll want to add one or
more helper functions that would not apply for a kernel-producer
ring buffer.
This patch therefore adds a new BPF_MAP_TYPE_USER_RINGBUF map type
definition. The map type is useless in its current form, as there is no
way to access or use it for anything until we one or more BPF helpers. A
follow-on patch will therefore add a new helper function that allows BPF
programs to run callbacks on samples that are published to the ring
buffer.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220920000100.477320-2-void@manifault.com
William Dean [Sat, 17 Sep 2022 08:42:48 +0000 (16:42 +0800)]
bpf: simplify code in btf_parse_hdr
It could directly return 'btf_check_sec_info' to simplify code.
Signed-off-by: William Dean <williamsukatube@163.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220917084248.3649-1-williamsukatube@163.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Xin Liu [Sat, 17 Sep 2022 08:48:09 +0000 (16:48 +0800)]
libbpf: Fix NULL pointer exception in API btf_dump__dump_type_data
We found that function btf_dump__dump_type_data can be called by the
user as an API, but in this function, the `opts` parameter may be used
as a null pointer.This causes `opts->indent_str` to trigger a NULL
pointer exception.
Fixes:
2ce8450ef5a3 ("libbpf: add bpf_object__open_{file, mem} w/ extensible opts")
Signed-off-by: Xin Liu <liuxin350@huawei.com>
Signed-off-by: Weibin Kong <kongweibin2@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220917084809.30770-1-liuxin350@huawei.com
Rong Tao [Sun, 11 Sep 2022 13:03:30 +0000 (21:03 +0800)]
samples/bpf: Replace blk_account_io_done() with __blk_account_io_done()
Since commit
be6bfe36db17 ("block: inline hot paths of blk_account_io_*()")
blk_account_io_*() become inline functions.
Signed-off-by: Rong Tao <rtoax@foxmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/tencent_1CC476835C219FACD84B6715F0D785517E07@qq.com
Martin KaFai Lau [Tue, 20 Sep 2022 21:30:35 +0000 (14:30 -0700)]
Merge branch 'bpf: Small nf_conn cleanups'
Daniel Xu says:
====================
This patchset cleans up a few small things:
* Delete unused stub
* Rename variable to be more descriptive
* Fix some `extern` declaration warnings
Past discussion:
- v2: https://lore.kernel.org/bpf/cover.
1663616584.git.dxu@dxuuu.xyz/
Changes since v2:
- Remove unused #include's
- Move #include <linux/filter.h> to .c
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Daniel Xu [Tue, 20 Sep 2022 14:15:24 +0000 (08:15 -0600)]
bpf: Move nf_conn extern declarations to filter.h
We're seeing the following new warnings on netdev/build_32bit and
netdev/build_allmodconfig_warn CI jobs:
../net/core/filter.c:8608:1: warning: symbol
'nf_conn_btf_access_lock' was not declared. Should it be static?
../net/core/filter.c:8611:5: warning: symbol 'nfct_bsa' was not
declared. Should it be static?
Fix by ensuring extern declaration is present while compiling filter.o.
Fixes:
864b656f82cc ("bpf: Add support for writing to nf_conn:mark")
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/2bd2e0283df36d8a4119605878edb1838d144174.1663683114.git.dxu@dxuuu.xyz
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Daniel Xu [Tue, 20 Sep 2022 14:15:23 +0000 (08:15 -0600)]
bpf: Rename nfct_bsa to nfct_btf_struct_access
The former name was a little hard to guess.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/73adc72385c8b162391fbfb404f0b6d4c5cc55d7.1663683114.git.dxu@dxuuu.xyz
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Daniel Xu [Tue, 20 Sep 2022 14:15:22 +0000 (08:15 -0600)]
bpf: Remove unused btf_struct_access stub
This stub was not being used anywhere.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/590e7bd6172ffe0f3d7b51cd40e8ded941aaf7e8.1663683114.git.dxu@dxuuu.xyz
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Hou Tao [Mon, 19 Sep 2022 14:48:11 +0000 (22:48 +0800)]
bpf: Check whether or not node is NULL before free it in free_bulk
llnode could be NULL if there are new allocations after the checking of
c-free_cnt > c->high_watermark in bpf_mem_refill() and before the
calling of __llist_del_first() in free_bulk (e.g. a PREEMPT_RT kernel
or allocation in NMI context). And it will incur oops as shown below:
BUG: kernel NULL pointer dereference, address:
0000000000000000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT_RT SMP
CPU: 39 PID: 373 Comm: irq_work/39 Tainted: G W 6.0.0-rc6-rt9+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:bpf_mem_refill+0x66/0x130
......
Call Trace:
<TASK>
irq_work_single+0x24/0x60
irq_work_run_list+0x24/0x30
run_irq_workd+0x18/0x20
smpboot_thread_fn+0x13f/0x2c0
kthread+0x121/0x140
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
Simply fixing it by checking whether or not llnode is NULL in free_bulk().
Fixes:
8d5a8011b35d ("bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20220919144811.3570825-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Hou Tao [Mon, 19 Sep 2022 03:57:14 +0000 (11:57 +0800)]
selftests/bpf: Add test result messages for test_task_storage_map_stress_lookup
Add test result message when test_task_storage_map_stress_lookup()
succeeds or is skipped. The test case can be skipped due to the choose
of preemption model in kernel config, so export skips in test_maps.c and
increase it when needed.
The following is the output of test_maps when the test case succeeds or
is skipped:
test_task_storage_map_stress_lookup:PASS
test_maps: OK, 0 SKIPPED
test_task_storage_map_stress_lookup SKIP (no CONFIG_PREEMPT)
test_maps: OK, 1 SKIPPED
Fixes:
73b97bc78b32 ("selftests/bpf: Test concurrent updates on bpf_task_storage_busy")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20220919035714.2195144-1-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Peilin Ye [Fri, 16 Sep 2022 20:28:00 +0000 (13:28 -0700)]
bpf/btf: Use btf_type_str() whenever possible
We have btf_type_str(). Use it whenever possible in btf.c, instead of
"btf_kind_str[BTF_INFO_KIND(t->info)]".
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Link: https://lore.kernel.org/r/20220916202800.31421-1-yepeilin.cs@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Xin Liu [Tue, 13 Sep 2022 07:36:43 +0000 (15:36 +0800)]
libbpf: Clean up legacy bpf maps declaration in bpf_helpers
Legacy BPF map declarations are no longer supported in libbpf v1.0 [0].
Only BTF-defined maps are supported starting from v1.0, so it is time to
remove the definition of bpf_map_def in bpf_helpers.h.
[0] https://github.com/libbpf/libbpf/wiki/Libbpf:-the-road-to-v1.0
Signed-off-by: Xin Liu <liuxin350@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20220913073643.19960-1-liuxin350@huawei.com
Andrii Nakryiko [Fri, 9 Sep 2022 19:30:53 +0000 (12:30 -0700)]
selftests/bpf: Add veristat tool for mass-verifying BPF object files
Add a small tool, veristat, that allows mass-verification of
a set of *libbpf-compatible* BPF ELF object files. For each such object
file, veristat will attempt to verify each BPF program *individually*.
Regardless of success or failure, it parses BPF verifier stats and
outputs them in human-readable table format. In the future we can also
add CSV and JSON output for more scriptable post-processing, if necessary.
veristat allows to specify a set of stats that should be output and
ordering between multiple objects and files (e.g., so that one can
easily order by total instructions processed, instead of default file
name, prog name, verdict, total instructions order).
This tool should be useful for validating various BPF verifier changes
or even validating different kernel versions for regressions.
Here's an example for some of the heaviest selftests/bpf BPF object
files:
$ sudo ./veristat -s insns,file,prog {pyperf,loop,test_verif_scale,strobemeta,test_cls_redirect,profiler}*.linked3.o
File Program Verdict Duration, us Total insns Total states Peak states
------------------------------------ ------------------------------------ ------- ------------ ----------- ------------ -----------
loop3.linked3.o while_true failure 350990 1000001 9663 9663
test_verif_scale3.linked3.o balancer_ingress success 115244 845499 8636 2141
test_verif_scale2.linked3.o balancer_ingress success 77688 773445 3048 788
pyperf600.linked3.o on_event success 2079872 624585 30335 30241
pyperf600_nounroll.linked3.o on_event success 353972 568128 37101 2115
strobemeta.linked3.o on_event success 455230 557149 15915 13537
test_verif_scale1.linked3.o balancer_ingress success 89880 554754 8636 2141
strobemeta_nounroll2.linked3.o on_event success 433906 501725 17087 1912
loop6.linked3.o trace_virtqueue_add_sgs success 282205 398057 8717 919
loop1.linked3.o nested_loops success 125630 361349 5504 5504
pyperf180.linked3.o on_event success 2511740 160398 11470 11446
pyperf100.linked3.o on_event success 744329 87681 6213 6191
test_cls_redirect.linked3.o cls_redirect success 54087 78925 4782 903
strobemeta_subprogs.linked3.o on_event success 57898 65420 1954 403
test_cls_redirect_subprogs.linked3.o cls_redirect success 54522 64965 4619 958
strobemeta_nounroll1.linked3.o on_event success 43313 57240 1757 382
pyperf50.linked3.o on_event success 194355 46378 3263 3241
profiler2.linked3.o tracepoint__syscalls__sys_enter_kill success 23869 43372 1423 542
pyperf_subprogs.linked3.o on_event success 29179 36358 2499 2499
profiler1.linked3.o tracepoint__syscalls__sys_enter_kill success 13052 27036 1946 936
profiler3.linked3.o tracepoint__syscalls__sys_enter_kill success 21023 26016 2186 915
profiler2.linked3.o kprobe__vfs_link success 5255 13896 303 271
profiler1.linked3.o kprobe__vfs_link success 7792 12687 1042 1041
profiler3.linked3.o kprobe__vfs_link success 7332 10601 865 865
profiler2.linked3.o kprobe_ret__do_filp_open success 3417 8900 216 199
profiler2.linked3.o kprobe__vfs_symlink success 3548 8775 203 186
pyperf_global.linked3.o on_event success 10007 7563 520 520
profiler3.linked3.o kprobe_ret__do_filp_open success 4708 6464 532 532
profiler1.linked3.o kprobe_ret__do_filp_open success 3090 6445 508 508
profiler3.linked3.o kprobe__vfs_symlink success 4477 6358 521 521
profiler1.linked3.o kprobe__vfs_symlink success 3381 6347 507 507
profiler2.linked3.o raw_tracepoint__sched_process_exec success 2464 5874 292 189
profiler3.linked3.o raw_tracepoint__sched_process_exec success 2677 4363 397 283
profiler2.linked3.o kprobe__proc_sys_write success 1800 4355 143 138
profiler1.linked3.o raw_tracepoint__sched_process_exec success 1649 4019 333 240
pyperf600_bpf_loop.linked3.o on_event success 2711 3966 306 306
profiler2.linked3.o raw_tracepoint__sched_process_exit success 1234 3138 83 66
profiler3.linked3.o kprobe__proc_sys_write success 1755 2623 223 223
profiler1.linked3.o kprobe__proc_sys_write success 1222 2456 193 193
loop2.linked3.o while_true success 608 1783 57 30
profiler3.linked3.o raw_tracepoint__sched_process_exit success 789 1680 146 146
profiler1.linked3.o raw_tracepoint__sched_process_exit success 592 1526 133 133
strobemeta_bpf_loop.linked3.o on_event success 1015 1512 106 106
loop4.linked3.o combinations success 165 524 18 17
profiler3.linked3.o raw_tracepoint__sched_process_fork success 196 299 25 25
profiler1.linked3.o raw_tracepoint__sched_process_fork success 109 265 19 19
profiler2.linked3.o raw_tracepoint__sched_process_fork success 111 265 19 19
loop5.linked3.o while_true success 47 84 9 9
------------------------------------ ------------------------------------ ------- ------------ ----------- ------------ -----------
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220909193053.577111-4-andrii@kernel.org
Andrii Nakryiko [Fri, 9 Sep 2022 19:30:52 +0000 (12:30 -0700)]
libbpf: Fix crash if SEC("freplace") programs don't have attach_prog_fd set
Fix SIGSEGV caused by libbpf trying to find attach type in vmlinux BTF
for freplace programs. It's wrong to search in vmlinux BTF and libbpf
doesn't even mark vmlinux BTF as required for freplace programs. So
trying to search anything in obj->vmlinux_btf might cause NULL
dereference if nothing else in BPF object requires vmlinux BTF.
Instead, error out if freplace (EXT) program doesn't specify
attach_prog_fd during at the load time.
Fixes:
91abb4a6d79d ("libbpf: Support attachment of BPF tracing programs to kernel modules")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220909193053.577111-3-andrii@kernel.org
Andrii Nakryiko [Fri, 9 Sep 2022 19:30:51 +0000 (12:30 -0700)]
selftests/bpf: Fix test_verif_scale{1,3} SEC() annotations
Use proper SEC("tc") for test_verif_scale{1,3} programs. It's not
a problem for selftests right now because we manually set type
programmatically, but not having correct SEC() definitions makes it
harded to generically load BPF object files.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220909193053.577111-2-andrii@kernel.org
Jiri Olsa [Sat, 3 Sep 2022 13:11:54 +0000 (15:11 +0200)]
bpf: Move bpf_dispatcher function out of ftrace locations
The dispatcher function is attached/detached to trampoline by
dispatcher update function. At the same time it's available as
ftrace attachable function.
After discussion [1] the proposed solution is to use compiler
attributes to alter bpf_dispatcher_##name##_func function:
- remove it from being instrumented with __no_instrument_function__
attribute, so ftrace has no track of it
- but still generate 5 nop instructions with patchable_function_entry(5)
attribute, which are expected by bpf_arch_text_poke used by
dispatcher update function
Enabling HAVE_DYNAMIC_FTRACE_NO_PATCHABLE option for x86, so
__patchable_function_entries functions are not part of ftrace/mcount
locations.
Adding attributes to bpf_dispatcher_XXX function on x86_64 so it's
kept out of ftrace locations and has 5 byte nop generated at entry.
These attributes need to be arch specific as pointed out by Ilya
Leoshkevic in here [2].
The dispatcher image is generated only for x86_64 arch, so the
code can stay as is for other archs.
[1] https://lore.kernel.org/bpf/
20220722110811.124515-1-jolsa@kernel.org/
[2] https://lore.kernel.org/bpf/
969a14281a7791c334d476825863ee449964dd0c.camel@linux.ibm.com/
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/20220903131154.420467-3-jolsa@kernel.org
Peter Zijlstra (Intel) [Sat, 3 Sep 2022 13:11:53 +0000 (15:11 +0200)]
ftrace: Add HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
x86 will shortly start using -fpatchable-function-entry for purposes
other than ftrace, make sure the __patchable_function_entry section
isn't merged in the mcount_loc section.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220903131154.420467-2-jolsa@kernel.org
Yauheni Kaliuta [Mon, 5 Sep 2022 09:01:49 +0000 (12:01 +0300)]
bpf: Use bpf_capable() instead of CAP_SYS_ADMIN for blinding decision
The full CAP_SYS_ADMIN requirement for blinding looks too strict nowadays.
These days given unprivileged BPF is disabled by default, the main users
for constant blinding coming from unprivileged in particular via cBPF -> eBPF
migration (e.g. old-style socket filters).
Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220831090655.156434-1-ykaliuta@redhat.com
Link: https://lore.kernel.org/bpf/20220905090149.61221-1-ykaliuta@redhat.com
Wang Yufen [Tue, 13 Sep 2022 08:40:33 +0000 (16:40 +0800)]
bpf: use kvmemdup_bpfptr helper
Use kvmemdup_bpfptr helper instead of open-coding to
simplify the code.
Signed-off-by: Wang Yufen <wangyufen@huawei.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/1663058433-14089-1-git-send-email-wangyufen@huawei.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Dave Marchevsky [Mon, 12 Sep 2022 15:45:44 +0000 (08:45 -0700)]
bpf: Add verifier check for BPF_PTR_POISON retval and arg
BPF_PTR_POISON was added in commit
c0a5a21c25f37 ("bpf: Allow storing
referenced kptr in map") to denote a bpf_func_proto btf_id which the
verifier will replace with a dynamically-determined btf_id at verification
time.
This patch adds verifier 'poison' functionality to BPF_PTR_POISON in
order to prepare for expanded use of the value to poison ret- and
arg-btf_id in ongoing work, namely rbtree and linked list patchsets
[0, 1]. Specifically, when the verifier checks helper calls, it assumes
that BPF_PTR_POISON'ed ret type will be replaced with a valid type before
- or in lieu of - the default ret_btf_id logic. Similarly for arg btf_id.
If poisoned btf_id reaches default handling block for either, consider
this a verifier internal error and fail verification. Otherwise a helper
w/ poisoned btf_id but no verifier logic replacing the type will cause a
crash as the invalid pointer is dereferenced.
Also move BPF_PTR_POISON to existing include/linux/posion.h header and
remove unnecessary shift.
[0]: lore.kernel.org/bpf/
20220830172759.4069786-1-davemarchevsky@fb.com
[1]: lore.kernel.org/bpf/
20220904204145.3089-1-memxor@gmail.com
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220912154544.1398199-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Marchevsky [Thu, 8 Sep 2022 23:07:16 +0000 (16:07 -0700)]
bpf: Add verifier support for custom callback return range
Verifier logic to confirm that a callback function returns 0 or 1 was
added in commit
69c087ba6225b ("bpf: Add bpf_for_each_map_elem() helper").
At the time, callback return value was only used to continue or stop
iteration.
In order to support callbacks with a broader return value range, such as
those added in rbtree series[0] and others, add a callback_ret_range to
bpf_func_state. Verifier's helpers which set in_callback_fn will also
set the new field, which the verifier will later use to check return
value bounds.
Default to tnum_range(0, 0) instead of using tnum_unknown as a sentinel
value as the latter would prevent the valid range (0, U64_MAX) being
used. Previous global default tnum_range(0, 1) is explicitly set for
extant callback helpers. The change to global default was made after
discussion around this patch in rbtree series [1], goal here is to make
it more obvious that callback_ret_range should be explicitly set.
[0]: lore.kernel.org/bpf/
20220830172759.4069786-1-davemarchevsky@fb.com/
[1]: lore.kernel.org/bpf/
20220830172759.4069786-2-davemarchevsky@fb.com/
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220908230716.2751723-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Lorenzo Bianconi [Thu, 8 Sep 2022 08:06:12 +0000 (10:06 +0200)]
selftests/bpf: fix ct status check in bpf_nf selftests
Check properly the connection tracking entry status configured running
bpf_ct_change_status kfunc.
Remove unnecessary IPS_CONFIRMED status configuration since it is
already done during entry allocation.
Fixes:
6eb7fba007a7 ("selftests/bpf: Add tests for new nf_conntrack kfuncs")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/813a5161a71911378dfac8770ec890428e4998aa.1662623574.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexei Starovoitov [Sun, 11 Sep 2022 00:27:33 +0000 (17:27 -0700)]
Merge branch 'Support direct writes to nf_conn:mark'
Daniel Xu says:
====================
Support direct writes to nf_conn:mark from TC and XDP prog types. This
is useful when applications want to store per-connection metadata. This
is also particularly useful for applications that run both bpf and
iptables/nftables because the latter can trivially access this metadata.
One example use case would be if a bpf prog is responsible for advanced
packet classification and iptables/nftables is later used for routing
due to pre-existing/legacy code.
Past discussion:
- v4: https://lore.kernel.org/bpf/cover.
1661192455.git.dxu@dxuuu.xyz/
- v3: https://lore.kernel.org/bpf/cover.
1660951028.git.dxu@dxuuu.xyz/
- v2: https://lore.kernel.org/bpf/CAP01T74Sgn354dXGiFWFryu4vg+o8b9s9La1d9zEbC4LGvH4qg@mail.gmail.com/T/
- v1: https://lore.kernel.org/bpf/cover.
1660592020.git.dxu@dxuuu.xyz/
Changes since v4:
- Use exported function pointer + mutex to handle CONFIG_NF_CONNTRACK=m
case
Changes since v3:
- Use a mutex to protect module load/unload critical section
Changes since v2:
- Remove use of NOT_INIT for btf_struct_access write path
- Disallow nf_conn writing when nf_conntrack module not loaded
- Support writing to nf_conn___init:mark
Changes since v1:
- Add unimplemented stub for when !CONFIG_BPF_SYSCALL
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Xu [Wed, 7 Sep 2022 16:40:41 +0000 (10:40 -0600)]
selftests/bpf: Add tests for writing to nf_conn:mark
Add a simple extension to the existing selftest to write to
nf_conn:mark. Also add a failure test for writing to unsupported field.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/f78966b81b9349d2b8ebb4cee2caf15cb6b38ee2.1662568410.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Xu [Wed, 7 Sep 2022 16:40:40 +0000 (10:40 -0600)]
bpf: Add support for writing to nf_conn:mark
Support direct writes to nf_conn:mark from TC and XDP prog types. This
is useful when applications want to store per-connection metadata. This
is also particularly useful for applications that run both bpf and
iptables/nftables because the latter can trivially access this metadata.
One example use case would be if a bpf prog is responsible for advanced
packet classification and iptables/nftables is later used for routing
due to pre-existing/legacy code.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/ebca06dea366e3e7e861c12f375a548cc4c61108.1662568410.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Xu [Wed, 7 Sep 2022 16:40:39 +0000 (10:40 -0600)]
bpf: Export btf_type_by_id() and bpf_log()
These symbols will be used in nf_conntrack.ko to support direct writes
to `nf_conn`.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/3c98c19dc50d3b18ea5eca135b4fc3a5db036060.1662568410.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Xu [Wed, 7 Sep 2022 16:40:38 +0000 (10:40 -0600)]
bpf: Use 0 instead of NOT_INIT for btf_struct_access() writes
Returning a bpf_reg_type only makes sense in the context of a BPF_READ.
For writes, prefer to explicitly return 0 for clarity.
Note that is non-functional change as it just so happened that NOT_INIT
== 0.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/01772bc1455ae16600796ac78c6cc9fff34f95ff.1662568410.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Xu [Wed, 7 Sep 2022 16:40:37 +0000 (10:40 -0600)]
bpf: Add stub for btf_struct_access()
Add corresponding unimplemented stub for when CONFIG_BPF_SYSCALL=n
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/4021398e884433b1fef57a4d28361bb9fcf1bd05.1662568410.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Xu [Wed, 7 Sep 2022 16:40:36 +0000 (10:40 -0600)]
bpf: Remove duplicate PTR_TO_BTF_ID RO check
Since commit
27ae7997a661 ("bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS")
there has existed bpf_verifier_ops:btf_struct_access. When
btf_struct_access is _unset_ for a prog type, the verifier runs the
default implementation, which is to enforce read only:
if (env->ops->btf_struct_access) {
[...]
} else {
if (atype != BPF_READ) {
verbose(env, "only read is supported\n");
return -EACCES;
}
[...]
}
When btf_struct_access is _set_, the expectation is that
btf_struct_access has full control over accesses, including if writes
are allowed.
Rather than carve out an exception for each prog type that may write to
BTF ptrs, delete the redundant check and give full control to
btf_struct_access.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/962da2bff1238746589e332ff1aecc49403cd7ce.1662568410.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Punit Agrawal [Wed, 7 Sep 2022 15:57:46 +0000 (16:57 +0100)]
bpf: Simplify code by using for_each_cpu_wrap()
In the percpu freelist code, it is a common pattern to iterate over
the possible CPUs mask starting with the current CPU. The pattern is
implemented using a hand rolled while loop with the loop variable
increment being open-coded.
Simplify the code by using for_each_cpu_wrap() helper to iterate over
the possible cpus starting with the current CPU. As a result, some of
the special-casing in the loop also gets simplified.
No functional change intended.
Signed-off-by: Punit Agrawal <punit.agrawal@bytedance.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20220907155746.1750329-1-punit.agrawal@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tetsuo Handa [Sat, 10 Sep 2022 15:07:11 +0000 (00:07 +0900)]
bpf: add missing percpu_counter_destroy() in htab_map_alloc()
syzbot is reporting ODEBUG bug in htab_map_alloc() [1], for
commit
86fe28f7692d96d2 ("bpf: Optimize element count in non-preallocated
hash map.") added percpu_counter_init() to htab_map_alloc() but forgot to
add percpu_counter_destroy() to the error path.
Link: https://syzkaller.appspot.com/bug?extid=5d1da78b375c3b5e6c2b
Reported-by: syzbot <syzbot+5d1da78b375c3b5e6c2b@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes:
86fe28f7692d96d2 ("bpf: Optimize element count in non-preallocated hash map.")
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/e2e4cc0e-9d36-4ca1-9bfa-ce23e6f8310b@I-love.SAKURA.ne.jp
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin KaFai Lau [Fri, 9 Sep 2022 17:40:45 +0000 (10:40 -0700)]
Merge branch 'cgroup/connect{4,6} programs for unprivileged ICMP ping'
YiFei Zhu says:
====================
Usually when a TCP/UDP connection is initiated, we can bind the socket
to a specific IP attached to an interface in a cgroup/connect hook.
But for pings, this is impossible, as the hook is not being called.
This series adds the invocation for cgroup/connect{4,6} programs to
unprivileged ICMP ping (i.e. ping sockets created with SOCK_DGRAM
IPPROTO_ICMP(V6) as opposed to SOCK_RAW). This also adds a test to
verify that the hooks are being called and invoking bpf_bind() from
within the hook actually binds the socket.
Patch 1 adds the invocation of the hook.
Patch 2 deduplicates write_sysctl in BPF test_progs.
Patch 3 adds the tests for this hook.
v1 -> v2:
* Added static to bindaddr_v6 in prog_tests/connect_ping.c
* Deduplicated much of the test logic in prog_tests/connect_ping.c
* Deduplicated write_sysctl() to test_progs.c
v2 -> v3:
* Renamed variable "obj" to "skel" for the BPF skeleton object in
prog_tests/connect_ping.c
v3 -> v4:
* Fixed error path to destroy skel in prog_tests/connect_ping.c
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
YiFei Zhu [Fri, 9 Sep 2022 00:49:41 +0000 (00:49 +0000)]
selftests/bpf: Ensure cgroup/connect{4,6} programs can bind unpriv ICMP ping
This tests that when an unprivileged ICMP ping socket connects,
the hooks are actually invoked. We also ensure that if the hook does
not call bpf_bind(), the bound address is unmodified, and if the
hook calls bpf_bind(), the bound address is exactly what we provided
to the helper.
A new netns is used to enable ping_group_range in the test without
affecting ouside of the test, because by default, not even root is
permitted to use unprivileged ICMP ping...
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Link: https://lore.kernel.org/r/086b227c1b97f4e94193e58aae7576d0261b68a4.1662682323.git.zhuyifei@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
YiFei Zhu [Fri, 9 Sep 2022 00:49:40 +0000 (00:49 +0000)]
selftests/bpf: Deduplicate write_sysctl() to test_progs.c
This helper is needed in multiple tests. Instead of copying it over
and over, better to deduplicate this helper to test_progs.c.
test_progs.c is chosen over testing_helpers.c because of this helper's
use of CHECK / ASSERT_*, and the CHECK was modified to use ASSERT_*
so it does not rely on a duration variable.
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Link: https://lore.kernel.org/r/9b4fc9a27bd52f771b657b4c4090fc8d61f3a6b5.1662682323.git.zhuyifei@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
YiFei Zhu [Fri, 9 Sep 2022 00:49:39 +0000 (00:49 +0000)]
bpf: Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping
Usually when a TCP/UDP connection is initiated, we can bind the socket
to a specific IP attached to an interface in a cgroup/connect hook.
But for pings, this is impossible, as the hook is not being called.
This adds the hook invocation to unprivileged ICMP ping (i.e. ping
sockets created with SOCK_DGRAM IPPROTO_ICMP(V6) as opposed to
SOCK_RAW. Logic is mirrored from UDP sockets where the hook is invoked
during pre_connect, after a check for suficiently sized addr_len.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Link: https://lore.kernel.org/r/5764914c252fad4cd134fb6664c6ede95f409412.1662682323.git.zhuyifei@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Daniel Borkmann [Fri, 9 Sep 2022 14:15:11 +0000 (16:15 +0200)]
libbpf: Remove gcc support for bpf_tail_call_static for now
This reverts commit
14e5ce79943a ("libbpf: Add GCC support for
bpf_tail_call_static"). Reason is that gcc invented their own BPF asm
which is not conform with LLVM one, and going forward this would be
more painful to maintain here and in other areas of the library. Thus
remove it; ask to gcc folks is to align with LLVM one to use exact
same syntax.
Fixes:
14e5ce79943a ("libbpf: Add GCC support for bpf_tail_call_static")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: James Hilliard <james.hilliard1@gmail.com>
Cc: Jose E. Marchesi <jose.marchesi@oracle.com>
Kumar Kartikeya Dwivedi [Sun, 4 Sep 2022 20:41:28 +0000 (22:41 +0200)]
bpf: Add helper macro bpf_for_each_reg_in_vstate
For a lot of use cases in future patches, we will want to modify the
state of registers part of some same 'group' (e.g. same ref_obj_id). It
won't just be limited to releasing reference state, but setting a type
flag dynamically based on certain actions, etc.
Hence, we need a way to easily pass a callback to the function that
iterates over all registers in current bpf_verifier_state in all frames
upto (and including) the curframe.
While in C++ we would be able to easily use a lambda to pass state and
the callback together, sadly we aren't using C++ in the kernel. The next
best thing to avoid defining a function for each case seems like
statement expressions in GNU C. The kernel already uses them heavily,
hence they can passed to the macro in the style of a lambda. The
statement expression will then be substituted in the for loop bodies.
Variables __state and __reg are set to current bpf_func_state and reg
for each invocation of the expression inside the passed in verifier
state.
Then, convert mark_ptr_or_null_regs, clear_all_pkt_pointers,
release_reference, find_good_pkt_pointers, find_equal_scalars to
use bpf_for_each_reg_in_vstate.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220904204145.3089-16-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Kumar Kartikeya Dwivedi [Sun, 4 Sep 2022 20:41:16 +0000 (22:41 +0200)]
bpf: Add zero_map_value to zero map value with special fields
We need this helper to skip over special fields (bpf_spin_lock,
bpf_timer, kptrs) while zeroing a map value. Use the same logic as
copy_map_value but memset instead of memcpy.
Currently, the code zeroing map value memory does not have to deal with
special fields, hence this is a prerequisite for introducing such
support.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220904204145.3089-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Kumar Kartikeya Dwivedi [Sun, 4 Sep 2022 20:41:15 +0000 (22:41 +0200)]
bpf: Support kptrs in percpu arraymap
Enable support for kptrs in percpu BPF arraymap by wiring up the freeing
of these kptrs from percpu map elements.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220904204145.3089-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Kumar Kartikeya Dwivedi [Sun, 4 Sep 2022 20:41:14 +0000 (22:41 +0200)]
bpf: Add copy_map_value_long to copy to remote percpu memory
bpf_long_memcpy is used while copying to remote percpu regions from BPF
syscall and helpers, so that the copy is atomic at word size
granularity.
This might not be possible when you copy from map value hosting kptrs
from or to percpu maps, as the alignment or size in disjoint regions may
not be multiple of word size.
Hence, to avoid complicating the copy loop, we only use bpf_long_memcpy
when special fields are not present, otherwise use normal memcpy to copy
the disjoint regions.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220904204145.3089-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jules Irenge [Wed, 7 Sep 2022 15:24:20 +0000 (16:24 +0100)]
bpf: Fix resetting logic for unreferenced kptrs
Sparse reported a warning at bpf_map_free_kptrs()
"warning: Using plain integer as NULL pointer"
During the process of fixing this warning, it was discovered that the current
code erroneously writes to the pointer variable instead of deferencing and
writing to the actual kptr. Hence, Sparse tool accidentally helped to uncover
this problem. Fix this by doing WRITE_ONCE(*p, 0) instead of WRITE_ONCE(p, 0).
Note that the effect of this bug is that unreferenced kptrs will not be cleared
during check_and_free_fields. It is not a problem if the clearing is not done
during map_free stage, as there is nothing to free for them.
Fixes:
14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Link: https://lore.kernel.org/r/Yxi3pJaK6UDjVJSy@playground
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexei Starovoitov [Wed, 7 Sep 2022 17:57:29 +0000 (10:57 -0700)]
Merge branch 'bpf-core changes for preparation of HID-bpf'
Benjamin Tissoires says:
====================
Hi,
well, given that the HID changes haven't moved a lot in the past
revisions and that I am cc-ing a bunch of people, I have dropped them
while we focus on the last 2 requirements in bpf-core changes.
I'll submit a HID targeted series when we get these in tree, which
would make things a lore more independent.
For reference, the whole reasons for these 2 main changes are at
https://lore.kernel.org/bpf/
20220902132938.2409206-1-benjamin.tissoires@redhat.com/
Compared to v10 (in addition of dropping the HID changes), I have
changed the selftests so we can test both light skeletons and libbbpf
calls.
Cheers,
Benjamin
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Benjamin Tissoires [Tue, 6 Sep 2022 15:13:03 +0000 (17:13 +0200)]
selftests/bpf: Add tests for kfunc returning a memory pointer
We add 2 new kfuncs that are following the RET_PTR_TO_MEM
capability from the previous commit.
Then we test them in selftests:
the first tests are testing valid case, and are not failing,
and the later ones are actually preventing the program to be loaded
because they are wrong.
To work around that, we mark the failing ones as not autoloaded
(with SEC("?tc")), and we manually enable them one by one, ensuring
the verifier rejects them.
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Link: https://lore.kernel.org/r/20220906151303.2780789-8-benjamin.tissoires@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Benjamin Tissoires [Tue, 6 Sep 2022 15:13:02 +0000 (17:13 +0200)]
bpf/verifier: allow kfunc to return an allocated mem
For drivers (outside of network), the incoming data is not statically
defined in a struct. Most of the time the data buffer is kzalloc-ed
and thus we can not rely on eBPF and BTF to explore the data.
This commit allows to return an arbitrary memory, previously allocated by
the driver.
An interesting extra point is that the kfunc can mark the exported
memory region as read only or read/write.
So, when a kfunc is not returning a pointer to a struct but to a plain
type, we can consider it is a valid allocated memory assuming that:
- one of the arguments is either called rdonly_buf_size or
rdwr_buf_size
- and this argument is a const from the caller point of view
We can then use this parameter as the size of the allocated memory.
The memory is either read-only or read-write based on the name
of the size parameter.
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Link: https://lore.kernel.org/r/20220906151303.2780789-7-benjamin.tissoires@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Benjamin Tissoires [Tue, 6 Sep 2022 15:13:01 +0000 (17:13 +0200)]
bpf/btf: bump BTF_KFUNC_SET_MAX_CNT
net/bpf/test_run.c is already presenting 20 kfuncs.
net/netfilter/nf_conntrack_bpf.c is also presenting an extra 10 kfuncs.
Given that all the kfuncs are regrouped into one unique set, having
only 2 space left prevent us to add more selftests.
Bump it to 256.
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Link: https://lore.kernel.org/r/20220906151303.2780789-6-benjamin.tissoires@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Benjamin Tissoires [Tue, 6 Sep 2022 15:13:00 +0000 (17:13 +0200)]
selftests/bpf: add test for accessing ctx from syscall program type
We need to also export the kfunc set to the syscall program type,
and then add a couple of eBPF programs that are testing those calls.
The first one checks for valid access, and the second one is OK
from a static analysis point of view but fails at run time because
we are trying to access outside of the allocated memory.
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Link: https://lore.kernel.org/r/20220906151303.2780789-5-benjamin.tissoires@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Benjamin Tissoires [Tue, 6 Sep 2022 15:12:59 +0000 (17:12 +0200)]
bpf/verifier: allow all functions to read user provided context
When a function was trying to access data from context in a syscall eBPF
program, the verifier was rejecting the call unless it was accessing the
first element.
This is because the syscall context is not known at compile time, and
so we need to check this when actually accessing it.
Check for the valid memory access if there is no convert_ctx callback,
and allow such situation to happen.
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Link: https://lore.kernel.org/r/20220906151303.2780789-4-benjamin.tissoires@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Benjamin Tissoires [Tue, 6 Sep 2022 15:12:58 +0000 (17:12 +0200)]
bpf: split btf_check_subprog_arg_match in two
btf_check_subprog_arg_match() was used twice in verifier.c:
- when checking for the type mismatches between a (sub)prog declaration
and BTF
- when checking the call of a subprog to see if the provided arguments
are correct and valid
This is problematic when we check if the first argument of a program
(pointer to ctx) is correctly accessed:
To be able to ensure we access a valid memory in the ctx, the verifier
assumes the pointer to context is not null.
This has the side effect of marking the program accessing the entire
context, even if the context is never dereferenced.
For example, by checking the context access with the current code, the
following eBPF program would fail with -EINVAL if the ctx is set to null
from the userspace:
```
SEC("syscall")
int prog(struct my_ctx *args) {
return 0;
}
```
In that particular case, we do not want to actually check that the memory
is correct while checking for the BTF validity, but we just want to
ensure that the (sub)prog definition matches the BTF we have.
So split btf_check_subprog_arg_match() in two so we can actually check
for the memory used when in a call, and ignore that part when not.
Note that a further patch is in preparation to disentangled
btf_check_func_arg_match() from these two purposes, and so right now we
just add a new hack around that by adding a boolean to this function.
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220906151303.2780789-3-benjamin.tissoires@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Benjamin Tissoires [Tue, 6 Sep 2022 15:12:57 +0000 (17:12 +0200)]
selftests/bpf: regroup and declare similar kfuncs selftests in an array
Similar to tools/testing/selftests/bpf/prog_tests/dynptr.c:
we declare an array of tests that we run one by one in a for loop.
Followup patches will add more similar-ish tests, so avoid a lot of copy
paste by grouping the declaration in an array.
For light skeletons, we have to rely on the offsetof() macro so we can
statically declare which program we are using.
In the libbpf case, we can rely on bpf_object__find_program_by_name().
So also change the Makefile to generate both light skeletons and normal
ones.
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220906151303.2780789-2-benjamin.tissoires@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexei Starovoitov [Wed, 7 Sep 2022 02:52:12 +0000 (19:52 -0700)]
Merge branch 'bpf: Support struct argument for trampoline base progs'
Yonghong Song says:
====================
Currently struct arguments are not supported for trampoline based progs.
One of major reason is that struct argument may pass by value which may
use more than one registers. This breaks trampoline progs where
each argument is assumed to take one register. bcc community reported the
issue ([1]) where struct argument is not supported for fentry program.
typedef struct {
uid_t val;
} kuid_t;
typedef struct {
gid_t val;
} kgid_t;
int security_path_chown(struct path *path, kuid_t uid, kgid_t gid);
Inside Meta, we also have a use case to attach to tcp_setsockopt()
typedef struct {
union {
void *kernel;
void __user *user;
};
bool is_kernel : 1;
} sockptr_t;
int tcp_setsockopt(struct sock *sk, int level, int optname,
sockptr_t optval, unsigned int optlen);
This patch added struct value support for bpf tracing programs which
uses trampoline. Only <= 16 byte struct size is supported for now
which covers use cases in the above. For x86/arm64/bpf, <= 16
struct value will be passed in registers instead of by reference.
Only x86_64 is supported in this patch. arm64 support can be
added later.
[1] https://github.com/iovisor/bcc/issues/3657
Changelog:
v3 -> v4:
- fix a test failure where no casting for
bpf_get_func_arg() value as the value type is 'int'.
- add tracing_struct test in DENYLIST.s390x
- simplify macro BPF_REG_CNT for BPF_PROG2.
v2 -> v3:
- previously struct arguments (<= 16 bytes) are passed
by reference for bpf programs. Suggested by Alexei,
it is passed by value now.
- in order to support passing <= 16 struct value, a
new macro BPF_PROG2 is invented.
rfc v1 -> v2:
- changed bpf_func_model struct info fields to
arg_flags[] to make it easy to iterate arguments
in arch specific {save|restore}_regs() functions.
- added fexit tests to test return values with
struct arguments.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song [Wed, 31 Aug 2022 15:27:23 +0000 (08:27 -0700)]
selftests/bpf: Add tracing_struct test in DENYLIST.s390x
Add tracing_struct test in DENYLIST.s390x since s390x does not
support trampoline now.
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220831152723.2081551-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song [Wed, 31 Aug 2022 15:27:18 +0000 (08:27 -0700)]
selftests/bpf: Use BPF_PROG2 for some fentry programs without struct arguments
Use BPF_PROG2 instead of BPF_PROG for programs in progs/timer.c
to test BPF_PROG2 for cases without struct arguments.
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220831152718.2081091-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song [Wed, 31 Aug 2022 15:27:13 +0000 (08:27 -0700)]
selftests/bpf: Add struct argument tests with fentry/fexit programs.
Add various struct argument tests with fentry/fexit programs.
Also add one test with a kernel func which does not have any
argument to test BPF_PROG2 macro in such situation.
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220831152713.2080039-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song [Wed, 31 Aug 2022 15:27:07 +0000 (08:27 -0700)]
libbpf: Add new BPF_PROG2 macro
To support struct arguments in trampoline based programs,
existing BPF_PROG doesn't work any more since
the type size is needed to find whether a parameter
takes one or two registers. So this patch added a new
BPF_PROG2 macro to support such trampoline programs.
The idea is suggested by Andrii. For example, if the
to-be-traced function has signature like
typedef struct {
void *x;
int t;
} sockptr;
int blah(sockptr x, char y);
In the new BPF_PROG2 macro, the argument can be
represented as
__bpf_prog_call(
({ union {
struct { __u64 x, y; } ___z;
sockptr x;
} ___tmp = { .___z = { ctx[0], ctx[1] }};
___tmp.x;
}),
({ union {
struct { __u8 x; } ___z;
char y;
} ___tmp = { .___z = { ctx[2] }};
___tmp.y;
}));
In the above, the values stored on the stack are properly
assigned to the actual argument type value by using 'union'
magic. Note that the macro also works even if no arguments
are with struct types.
Note that new BPF_PROG2 works for both llvm16 and pre-llvm16
compilers where llvm16 supports bpf target passing value
with struct up to 16 byte size and pre-llvm16 will pass
by reference by storing values on the stack. With static functions
with struct argument as always inline, the compiler is able
to optimize and remove additional stack saving of struct values.
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220831152707.2079473-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song [Wed, 31 Aug 2022 15:27:02 +0000 (08:27 -0700)]
bpf: arm64: No support of struct argument in trampoline programs
ARM64 does not support struct argument for trampoline based
bpf programs yet.
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220831152702.2079066-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song [Wed, 31 Aug 2022 15:26:57 +0000 (08:26 -0700)]
bpf: Update descriptions for helpers bpf_get_func_arg[_cnt]()
Now instead of the number of arguments, the number of registers
holding argument values are stored in trampoline. Update
the description of bpf_get_func_arg[_cnt]() helpers. Previous
programs without struct arguments should continue to work
as usual.
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220831152657.2078805-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song [Wed, 31 Aug 2022 15:26:52 +0000 (08:26 -0700)]
bpf: x86: Support in-register struct arguments in trampoline programs
In C, struct value can be passed as a function argument.
For small structs, struct value may be passed in
one or more registers. For trampoline based bpf programs,
this would cause complication since one-to-one mapping between
function argument and arch argument register is not valid
any more.
The latest llvm16 added bpf support to pass by values
for struct up to 16 bytes ([1]). This is also true for
x86_64 architecture where two registers will hold
the struct value if the struct size is >8 and <= 16.
This may not be true if one of struct member is 'double'
type but in current linux source code we don't have
such instance yet, so we assume all >8 && <= 16 struct
holds two general purpose argument registers.
Also change on-stack nr_args value to the number
of registers holding the arguments. This will
permit bpf_get_func_arg() helper to get all
argument values.
[1] https://reviews.llvm.org/D132144
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220831152652.2078600-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song [Wed, 31 Aug 2022 15:26:46 +0000 (08:26 -0700)]
bpf: Allow struct argument in trampoline based programs
Allow struct argument in trampoline based programs where
the struct size should be <= 16 bytes. In such cases, the argument
will be put into up to 2 registers for bpf, x86_64 and arm64
architectures.
To support arch-specific trampoline manipulation,
add arg_flags for additional struct information about arguments
in btf_func_model. Such information will be used in arch specific
function arch_prepare_bpf_trampoline() to prepare argument access
properly in trampoline.
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220831152646.2078089-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexei Starovoitov [Wed, 7 Sep 2022 02:38:53 +0000 (19:38 -0700)]
bpf: Replace __ksize with ksize.
__ksize() was made private. Use ksize() instead.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Paolo Abeni [Tue, 6 Sep 2022 21:21:14 +0000 (23:21 +0200)]
Merge https://git./linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-09-05
The following pull-request contains BPF updates for your *net-next* tree.
We've added 106 non-merge commits during the last 18 day(s) which contain
a total of 159 files changed, 5225 insertions(+), 1358 deletions(-).
There are two small merge conflicts, resolve them as follows:
1) tools/testing/selftests/bpf/DENYLIST.s390x
Commit
27e23836ce22 ("selftests/bpf: Add lru_bug to s390x deny list") in
bpf tree was needed to get BPF CI green on s390x, but it conflicted with
newly added tests on bpf-next. Resolve by adding both hunks, result:
[...]
lru_bug # prog 'printk': failed to auto-attach: -524
setget_sockopt # attach unexpected error: -524 (trampoline)
cb_refs # expected error message unexpected error: -524 (trampoline)
cgroup_hierarchical_stats # JIT does not support calling kernel function (kfunc)
htab_update # failed to attach: ERROR: strerror_r(-524)=22 (trampoline)
[...]
2) net/core/filter.c
Commit
1227c1771dd2 ("net: Fix data-races around sysctl_[rw]mem_(max|default).")
from net tree conflicts with commit
29003875bd5b ("bpf: Change bpf_setsockopt(SOL_SOCKET)
to reuse sk_setsockopt()") from bpf-next tree. Take the code as it is from
bpf-next tree, result:
[...]
if (getopt) {
if (optname == SO_BINDTODEVICE)
return -EINVAL;
return sk_getsockopt(sk, SOL_SOCKET, optname,
KERNEL_SOCKPTR(optval),
KERNEL_SOCKPTR(optlen));
}
return sk_setsockopt(sk, SOL_SOCKET, optname,
KERNEL_SOCKPTR(optval), *optlen);
[...]
The main changes are:
1) Add any-context BPF specific memory allocator which is useful in particular for BPF
tracing with bonus of performance equal to full prealloc, from Alexei Starovoitov.
2) Big batch to remove duplicated code from bpf_{get,set}sockopt() helpers as an effort
to reuse the existing core socket code as much as possible, from Martin KaFai Lau.
3) Extend BPF flow dissector for BPF programs to just augment the in-kernel dissector
with custom logic. In other words, allow for partial replacement, from Shmulik Ladkani.
4) Add a new cgroup iterator to BPF with different traversal options, from Hao Luo.
5) Support for BPF to collect hierarchical cgroup statistics efficiently through BPF
integration with the rstat framework, from Yosry Ahmed.
6) Support bpf_{g,s}et_retval() under more BPF cgroup hooks, from Stanislav Fomichev.
7) BPF hash table and local storages fixes under fully preemptible kernel, from Hou Tao.
8) Add various improvements to BPF selftests and libbpf for compilation with gcc BPF
backend, from James Hilliard.
9) Fix verifier helper permissions and reference state management for synchronous
callbacks, from Kumar Kartikeya Dwivedi.
10) Add support for BPF selftest's xskxceiver to also be used against real devices that
support MAC loopback, from Maciej Fijalkowski.
11) Various fixes to the bpf-helpers(7) man page generation script, from Quentin Monnet.
12) Document BPF verifier's tnum_in(tnum_range(), ...) gotchas, from Shung-Hsi Yu.
13) Various minor misc improvements all over the place.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (106 commits)
bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.
bpf: Remove usage of kmem_cache from bpf_mem_cache.
bpf: Remove prealloc-only restriction for sleepable bpf programs.
bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
bpf: Remove tracing program restriction on map types
bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
bpf: Add percpu allocation support to bpf_mem_alloc.
bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
bpf: Adjust low/high watermarks in bpf_mem_cache
bpf: Optimize call_rcu in non-preallocated hash map.
bpf: Optimize element count in non-preallocated hash map.
bpf: Relax the requirement to use preallocated hash maps in tracing progs.
samples/bpf: Reduce syscall overhead in map_perf_test.
selftests/bpf: Improve test coverage of test_maps
bpf: Convert hash map to bpf_mem_alloc.
bpf: Introduce any context BPF specific memory allocator.
selftest/bpf: Add test for bpf_getsockopt()
bpf: Change bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt()
bpf: Change bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt()
bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt()
...
====================
Link: https://lore.kernel.org/r/20220905161136.9150-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Sergei Antonov [Fri, 2 Sep 2022 12:50:37 +0000 (15:50 +0300)]
net: moxa: fix endianness-related issues from 'sparse'
Sparse checker found two endianness-related issues:
.../moxart_ether.c:34:15: warning: incorrect type in assignment (different base types)
.../moxart_ether.c:34:15: expected unsigned int [usertype]
.../moxart_ether.c:34:15: got restricted __le32 [usertype]
.../moxart_ether.c:39:16: warning: cast to restricted __le32
Fix them by using __le32 type instead of u32.
Signed-off-by: Sergei Antonov <saproj@gmail.com>
Link: https://lore.kernel.org/r/20220902125037.1480268-1-saproj@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Sergei Antonov [Fri, 2 Sep 2022 11:37:49 +0000 (14:37 +0300)]
net: ftmac100: fix endianness-related issues from 'sparse'
Sparse found a number of endianness-related issues of these kinds:
.../ftmac100.c:192:32: warning: restricted __le32 degrades to integer
.../ftmac100.c:208:23: warning: incorrect type in assignment (different base types)
.../ftmac100.c:208:23: expected unsigned int rxdes0
.../ftmac100.c:208:23: got restricted __le32 [usertype]
.../ftmac100.c:249:23: warning: invalid assignment: &=
.../ftmac100.c:249:23: left side has type unsigned int
.../ftmac100.c:249:23: right side has type restricted __le32
.../ftmac100.c:527:16: warning: cast to restricted __le32
Change type of some fields from 'unsigned int' to '__le32' to fix it.
Signed-off-by: Sergei Antonov <saproj@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20220902113749.1408562-1-saproj@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Horatiu Vultur [Fri, 2 Sep 2022 11:15:48 +0000 (13:15 +0200)]
net: lan966x: Extend lan966x with RGMII support
Extend lan966x with RGMII support. The MAC supports all RGMII_* modes.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20220902111548.614525-1-horatiu.vultur@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Heiner Kallweit [Sat, 3 Sep 2022 11:15:13 +0000 (13:15 +0200)]
r8169: remove not needed net_ratelimit() check
We're not in a hot path and don't want to miss this message,
therefore remove the net_ratelimit() check.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kees Cook [Sat, 3 Sep 2022 04:37:49 +0000 (21:37 -0700)]
netlink: Bounds-check struct nlmsgerr creation
In preparation for FORTIFY_SOURCE doing bounds-check on memcpy(),
switch from __nlmsg_put to nlmsg_put(), and explain the bounds check
for dealing with the memcpy() across a composite flexible array struct.
Avoids this future run-time warning:
memcpy: detected field-spanning write (size 32) of single field "&errmsg->msg" at net/netlink/af_netlink.c:2447 (size 16)
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: syzbot <syzkaller@googlegroups.com>
Cc: netfilter-devel@vger.kernel.org
Cc: coreteam@netfilter.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220901071336.1418572-1-keescook@chromium.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Mon, 5 Sep 2022 13:33:07 +0000 (15:33 +0200)]
Merge branch 'bpf-allocator'
Alexei Starovoitov says:
====================
Introduce any context BPF specific memory allocator.
Tracing BPF programs can attach to kprobe and fentry. Hence they run in
unknown context where calling plain kmalloc() might not be safe. Front-end
kmalloc() with per-cpu cache of free elements. Refill this cache asynchronously
from irq_work.
Major achievements enabled by bpf_mem_alloc:
- Dynamically allocated hash maps used to be 10 times slower than fully
preallocated. With bpf_mem_alloc and subsequent optimizations the speed
of dynamic maps is equal to full prealloc.
- Tracing bpf programs can use dynamically allocated hash maps. Potentially
saving lots of memory. Typical hash map is sparsely populated.
- Sleepable bpf programs can used dynamically allocated hash maps.
Future work:
- Expose bpf_mem_alloc as uapi FD to be used in dynptr_alloc, kptr_alloc
- Convert lru map to bpf_mem_alloc
- Further cleanup htab code. Example: htab_use_raw_lock can be removed.
Changelog:
v5->v6:
- Debugged the reason for selftests/bpf/test_maps ooming in a small VM that BPF CI is using.
Added patch 16 that optimizes the usage of rcu_barrier-s between bpf_mem_alloc and
hash map. It drastically improved the speed of htab destruction.
v4->v5:
- Fixed missing migrate_disable in hash tab free path (Daniel)
- Replaced impossible "memory leak" with WARN_ON_ONCE (Martin)
- Dropped sysctl kernel.bpf_force_dyn_alloc patch (Daniel)
- Added Andrii's ack
- Added new patch 15 that removes kmem_cache usage from bpf_mem_alloc.
It saves memory, speeds up map create/destroy operations
while maintains hash map update/delete performance.
v3->v4:
- fix build issue due to missing local.h on 32-bit arch
- add Kumar's ack
- proposal for next steps from Delyan:
https://lore.kernel.org/bpf/
d3f76b27f4e55ec9e400ae8dcaecbb702a4932e8.camel@fb.com/
v2->v3:
- Rewrote the free_list algorithm based on discussions with Kumar. Patch 1.
- Allowed sleepable bpf progs use dynamically allocated maps. Patches 13 and 14.
- Added sysctl to force bpf_mem_alloc in hash map even if pre-alloc is
requested to reduce memory consumption. Patch 15.
- Fix: zero-fill percpu allocation
- Single rcu_barrier at the end instead of each cpu during bpf_mem_alloc destruction
v2 thread:
https://lore.kernel.org/bpf/
20220817210419.95560-1-alexei.starovoitov@gmail.com/
v1->v2:
- Moved unsafe direct call_rcu() from hash map into safe place inside bpf_mem_alloc. Patches 7 and 9.
- Optimized atomic_inc/dec in hash map with percpu_counter. Patch 6.
- Tuned watermarks per allocation size. Patch 8
- Adopted this approach to per-cpu allocation. Patch 10.
- Fully converted hash map to bpf_mem_alloc. Patch 11.
- Removed tracing prog restriction on map types. Combination of all patches and final patch 12.
v1 thread:
https://lore.kernel.org/bpf/
20220623003230.37497-1-alexei.starovoitov@gmail.com/
LWN article:
https://lwn.net/Articles/899274/
====================
Link: https://lore.kernel.org/r/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Alexei Starovoitov [Fri, 2 Sep 2022 21:10:58 +0000 (14:10 -0700)]
bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.
User space might be creating and destroying a lot of hash maps. Synchronous
rcu_barrier-s in a destruction path of hash map delay freeing of hash buckets
and other map memory and may cause artificial OOM situation under stress.
Optimize rcu_barrier usage between bpf hash map and bpf_mem_alloc:
- remove rcu_barrier from hash map, since htab doesn't use call_rcu
directly and there are no callback to wait for.
- bpf_mem_alloc has call_rcu_in_progress flag that indicates pending callbacks.
Use it to avoid barriers in fast path.
- When barriers are needed copy bpf_mem_alloc into temp structure
and wait for rcu barrier-s in the worker to let the rest of
hash map freeing to proceed.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220902211058.60789-17-alexei.starovoitov@gmail.com
Alexei Starovoitov [Fri, 2 Sep 2022 21:10:57 +0000 (14:10 -0700)]
bpf: Remove usage of kmem_cache from bpf_mem_cache.
For bpf_mem_cache based hash maps the following stress test:
for (i = 1; i <= 512; i <<= 1)
for (j = 1; j <= 1 << 18; j <<= 1)
fd = bpf_map_create(BPF_MAP_TYPE_HASH, NULL, i, j, 2, 0);
creates many kmem_cache-s that are not mergeable in debug kernels
and consume unnecessary amount of memory.
Turned out bpf_mem_cache's free_list logic does batching well,
so usage of kmem_cache for fixes size allocations doesn't bring
any performance benefits vs normal kmalloc.
Hence get rid of kmem_cache in bpf_mem_cache.
That saves memory, speeds up map create/destroy operations,
while maintains hash map update/delete performance.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220902211058.60789-16-alexei.starovoitov@gmail.com
Alexei Starovoitov [Fri, 2 Sep 2022 21:10:56 +0000 (14:10 -0700)]
bpf: Remove prealloc-only restriction for sleepable bpf programs.
Since hash map is now converted to bpf_mem_alloc and it's waiting for rcu and
rcu_tasks_trace GPs before freeing elements into global memory slabs it's safe
to use dynamically allocated hash maps in sleepable bpf programs.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-15-alexei.starovoitov@gmail.com
Alexei Starovoitov [Fri, 2 Sep 2022 21:10:55 +0000 (14:10 -0700)]
bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
Then use call_rcu() to wait for normal progs to finish
and finally do free_one() on each element when freeing objects
into global memory pool.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-14-alexei.starovoitov@gmail.com
Alexei Starovoitov [Fri, 2 Sep 2022 21:10:54 +0000 (14:10 -0700)]
bpf: Remove tracing program restriction on map types
The hash map is now fully converted to bpf_mem_alloc. Its implementation is not
allocating synchronously and not calling call_rcu() directly. It's now safe to
use non-preallocated hash maps in all types of tracing programs including
BPF_PROG_TYPE_PERF_EVENT that runs out of NMI context.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-13-alexei.starovoitov@gmail.com
Alexei Starovoitov [Fri, 2 Sep 2022 21:10:53 +0000 (14:10 -0700)]
bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
Convert dynamic allocations in percpu hash map from alloc_percpu() to
bpf_mem_cache_alloc() from per-cpu bpf_mem_alloc. Since bpf_mem_alloc frees
objects after RCU gp the call_rcu() is removed. pcpu_init_value() now needs to
zero-fill per-cpu allocations, since dynamically allocated map elements are now
similar to full prealloc, since alloc_percpu() is not called inline and the
elements are reused in the freelist.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-12-alexei.starovoitov@gmail.com
Alexei Starovoitov [Fri, 2 Sep 2022 21:10:52 +0000 (14:10 -0700)]
bpf: Add percpu allocation support to bpf_mem_alloc.
Extend bpf_mem_alloc to cache free list of fixed size per-cpu allocations.
Once such cache is created bpf_mem_cache_alloc() will return per-cpu objects.
bpf_mem_cache_free() will free them back into global per-cpu pool after
observing RCU grace period.
per-cpu flavor of bpf_mem_alloc is going to be used by per-cpu hash maps.
The free list cache consists of tuples { llist_node, per-cpu pointer }
Unlike alloc_percpu() that returns per-cpu pointer
the bpf_mem_cache_alloc() returns a pointer to per-cpu pointer and
bpf_mem_cache_free() expects to receive it back.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-11-alexei.starovoitov@gmail.com
Alexei Starovoitov [Fri, 2 Sep 2022 21:10:51 +0000 (14:10 -0700)]
bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
SLAB_TYPESAFE_BY_RCU makes kmem_caches non mergeable and slows down
kmem_cache_destroy. All bpf_mem_cache are safe to share across different maps
and programs. Convert SLAB_TYPESAFE_BY_RCU to batched call_rcu. This change
solves the memory consumption issue, avoids kmem_cache_destroy latency and
keeps bpf hash map performance the same.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-10-alexei.starovoitov@gmail.com
Alexei Starovoitov [Fri, 2 Sep 2022 21:10:50 +0000 (14:10 -0700)]
bpf: Adjust low/high watermarks in bpf_mem_cache
The same low/high watermarks for every bucket in bpf_mem_cache consume
significant amount of memory. Preallocating 64 elements of 4096 bytes each in
the free list is not efficient. Make low/high watermarks and batching value
dependent on element size. This change brings significant memory savings.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-9-alexei.starovoitov@gmail.com
Alexei Starovoitov [Fri, 2 Sep 2022 21:10:49 +0000 (14:10 -0700)]
bpf: Optimize call_rcu in non-preallocated hash map.
Doing call_rcu() million times a second becomes a bottle neck.
Convert non-preallocated hash map from call_rcu to SLAB_TYPESAFE_BY_RCU.
The rcu critical section is no longer observed for one htab element
which makes non-preallocated hash map behave just like preallocated hash map.
The map elements are released back to kernel memory after observing
rcu critical section.
This improves 'map_perf_test 4' performance from 100k events per second
to 250k events per second.
bpf_mem_alloc + percpu_counter + typesafe_by_rcu provide 10x performance
boost to non-preallocated hash map and make it within few % of preallocated map
while consuming fraction of memory.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-8-alexei.starovoitov@gmail.com
Alexei Starovoitov [Fri, 2 Sep 2022 21:10:48 +0000 (14:10 -0700)]
bpf: Optimize element count in non-preallocated hash map.
The atomic_inc/dec might cause extreme cache line bouncing when multiple cpus
access the same bpf map. Based on specified max_entries for the hash map
calculate when percpu_counter becomes faster than atomic_t and use it for such
maps. For example samples/bpf/map_perf_test is using hash map with max_entries
1000. On a system with 16 cpus the 'map_perf_test 4' shows 14k events per
second using atomic_t. On a system with 15 cpus it shows 100k events per second
using percpu. map_perf_test is an extreme case where all cpus colliding on
atomic_t which causes extreme cache bouncing. Note that the slow path of
percpu_counter is 5k events per secound vs 14k for atomic, so the heuristic is
necessary. See comment in the code why the heuristic is based on
num_online_cpus().
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-7-alexei.starovoitov@gmail.com
Alexei Starovoitov [Fri, 2 Sep 2022 21:10:47 +0000 (14:10 -0700)]
bpf: Relax the requirement to use preallocated hash maps in tracing progs.
Since bpf hash map was converted to use bpf_mem_alloc it is safe to use
from tracing programs and in RT kernels.
But per-cpu hash map is still using dynamic allocation for per-cpu map
values, hence keep the warning for this map type.
In the future alloc_percpu_gfp can be front-end-ed with bpf_mem_cache
and this restriction will be completely lifted.
perf_event (NMI) bpf programs have to use preallocated hash maps,
because free_htab_elem() is using call_rcu which might crash if re-entered.
Sleepable bpf programs have to use preallocated hash maps, because
life time of the map elements is not protected by rcu_read_lock/unlock.
This restriction can be lifted in the future as well.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-6-alexei.starovoitov@gmail.com
Alexei Starovoitov [Fri, 2 Sep 2022 21:10:46 +0000 (14:10 -0700)]
samples/bpf: Reduce syscall overhead in map_perf_test.
Make map_perf_test for preallocated and non-preallocated hash map
spend more time inside bpf program to focus performance analysis
on the speed of update/lookup/delete operations performed by bpf program.
It makes 'perf report' of bpf_mem_alloc look like:
11.76% map_perf_test [k] _raw_spin_lock_irqsave
11.26% map_perf_test [k] htab_map_update_elem
9.70% map_perf_test [k] _raw_spin_lock
9.47% map_perf_test [k] htab_map_delete_elem
8.57% map_perf_test [k] memcpy_erms
5.58% map_perf_test [k] alloc_htab_elem
4.09% map_perf_test [k] __htab_map_lookup_elem
3.44% map_perf_test [k] syscall_exit_to_user_mode
3.13% map_perf_test [k] lookup_nulls_elem_raw
3.05% map_perf_test [k] migrate_enable
3.04% map_perf_test [k] memcmp
2.67% map_perf_test [k] unit_free
2.39% map_perf_test [k] lookup_elem_raw
Reduce default iteration count as well to make 'map_perf_test' quick enough
even on debug kernels.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-5-alexei.starovoitov@gmail.com
Alexei Starovoitov [Fri, 2 Sep 2022 21:10:45 +0000 (14:10 -0700)]
selftests/bpf: Improve test coverage of test_maps
Make test_maps more stressful with more parallelism in
update/delete/lookup/walk including different value sizes.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-4-alexei.starovoitov@gmail.com
Alexei Starovoitov [Fri, 2 Sep 2022 21:10:44 +0000 (14:10 -0700)]
bpf: Convert hash map to bpf_mem_alloc.
Convert bpf hash map to use bpf memory allocator.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-3-alexei.starovoitov@gmail.com
Alexei Starovoitov [Fri, 2 Sep 2022 21:10:43 +0000 (14:10 -0700)]
bpf: Introduce any context BPF specific memory allocator.
Tracing BPF programs can attach to kprobe and fentry. Hence they
run in unknown context where calling plain kmalloc() might not be safe.
Front-end kmalloc() with minimal per-cpu cache of free elements.
Refill this cache asynchronously from irq_work.
BPF programs always run with migration disabled.
It's safe to allocate from cache of the current cpu with irqs disabled.
Free-ing is always done into bucket of the current cpu as well.
irq_work trims extra free elements from buckets with kfree
and refills them with kmalloc, so global kmalloc logic takes care
of freeing objects allocated by one cpu and freed on another.
struct bpf_mem_alloc supports two modes:
- When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
This is typical bpf hash map use case when all elements have equal size.
- When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
kmalloc/kfree. Max allocation size is 4096 in this case.
This is bpf_dynptr and bpf_kptr use case.
bpf_mem_alloc/bpf_mem_free are bpf specific 'wrappers' of kmalloc/kfree.
bpf_mem_cache_alloc/bpf_mem_cache_free are 'wrappers' of kmem_cache_alloc/kmem_cache_free.
The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-2-alexei.starovoitov@gmail.com
Sean Anderson [Fri, 2 Sep 2022 22:02:39 +0000 (18:02 -0400)]
net: phy: Add 1000BASE-KX interface mode
Add 1000BASE-KX interface mode. This 1G backplane ethernet as described in
clause 70. Clause 73 autonegotiation is mandatory, and only full duplex
operation is supported.
Although at the PMA level this interface mode is identical to
1000BASE-X, it uses a different form of in-band autonegation. This
justifies a separate interface mode, since the interface mode (along
with the MLO_AN_* autonegotiation mode) sets the type of autonegotiation
which will be used on a link. This results in more than just electrical
differences between the link modes.
With regard to 1000BASE-X, 1000BASE-KX holds a similar position to
SGMII: same signaling, but different autonegotiation. PCS drivers
(which typically handle in-band autonegotiation) may only support
1000BASE-X, and not 1000BASE-KX. Similarly, the phy mode is used to
configure serdes phys with phy_set_mode_ext. Due to the different
electrical standards (SFI or XFI vs Clause 70), they will likely want to
use different configuration. Adding a phy interface mode for
1000BASE-KX helps simplify configuration in these areas.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 5 Sep 2022 13:27:40 +0000 (14:27 +0100)]
Merge branch 'dpaa-cleanups'
Sean Anderson says:
====================
net: dpaa: Cleanups in preparation for phylink conversion (part 2)
This series contains several cleanup patches for dpaa/fman. While they
are intended to prepare for a phylink conversion, they stand on their
own. This series was originally submitted as part of [1].
[1] https://lore.kernel.org/netdev/
20220715215954.1449214-1-sean.anderson@seco.com
Changes in v5:
- Reduce line length of tgec_config
- Reduce line length of qman_update_cgr_safe
- Rebase onto net-next/master
Changes in v4:
- weer -> were
- tricy -> tricky
- Use mac_dev for calling change_addr
- qman_cgr_create -> qman_create_cgr
Changes in v2:
- Fix prototype for dtsec_initialization
- Fix warning if sizeof(void *) != sizeof(resource_size_t)
- Specify type of mac_dev for exception_cb
- Add helper for sanity checking cgr ops
- Add CGR update function
- Adjust queue depth on rate change
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson [Fri, 2 Sep 2022 21:57:36 +0000 (17:57 -0400)]
net: dpaa: Adjust queue depth on rate change
Instead of setting the queue depth once during probe, adjust it on the
fly whenever we configure the link. This is a bit unusal, since usually
the DPAA driver calls into the FMAN driver, but here we do the opposite.
We need to add a netdev to struct mac_device for this, but it will soon
live in the phylink config.
I haven't tested this extensively, but it doesn't seem to break
anything. We could possibly optimize this a bit by keeping track of the
last rate, but for now we just update every time. 10GEC probably doesn't
need to call into this at all, but I've added it for consistency.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson [Fri, 2 Sep 2022 21:57:35 +0000 (17:57 -0400)]
soc: fsl: qbman: Add CGR update function
This adds a function to update a CGR with new parameters. qman_create_cgr
can almost be used for this (with flags=0), but it's not suitable because
it also registers the callback function. The _safe variant was modeled off
of qman_cgr_delete_safe. However, we handle multiple arguments and a return
value.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson [Fri, 2 Sep 2022 21:57:34 +0000 (17:57 -0400)]
soc: fsl: qbman: Add helper for sanity checking cgr ops
This breaks out/combines get_affine_portal and the cgr sanity check in
preparation for the next commit. No functional change intended.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson [Fri, 2 Sep 2022 21:57:33 +0000 (17:57 -0400)]
net: dpaa: Use mac_dev variable in dpaa_netdev_init
There are several references to mac_dev in dpaa_netdev_init. Make things a
bit more concise by adding a local variable for it.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson [Fri, 2 Sep 2022 21:57:32 +0000 (17:57 -0400)]
net: fman: Change return type of disable to void
When disabling, there is nothing we can do about errors. In fact, the
only error which can occur is misuse of the API. Just warn in the mac
driver instead.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson [Fri, 2 Sep 2022 21:57:31 +0000 (17:57 -0400)]
net: fman: Clean up error handling
This removes the _return label, since something like
err = -EFOO;
goto _return;
can be replaced by the briefer
return -EFOO;
Additionally, this skips going to _return_of_node_put when dev_node has
already been put (preventing a double put).
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson [Fri, 2 Sep 2022 21:57:30 +0000 (17:57 -0400)]
net: fman: Specify type of mac_dev for exception_cb
Instead of using a void pointer for mac_dev, specify its type
explicitly.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson [Fri, 2 Sep 2022 21:57:29 +0000 (17:57 -0400)]
net: fman: Use mac_dev for some params
Some params are already present in mac_dev. Use them directly instead of
passing them through params.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson [Fri, 2 Sep 2022 21:57:28 +0000 (17:57 -0400)]
net: fman: Pass params directly to mac init
Instead of having the mac init functions call back into the fman core to
get their params, just pass them directly to the init functions.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson [Fri, 2 Sep 2022 21:57:27 +0000 (17:57 -0400)]
net: fman: Map the base address once
We don't need to remap the base address from the resource twice (once in
mac_probe() and again in set_fman_mac_params()). We still need the
resource to get the end address, but we can use a single function call
to get both at once.
While we're at it, use platform_get_mem_or_io and devm_request_resource
to map the resource. I think this is the more "correct" way to do things
here, since we use the pdev resource, instead of creating a new one.
It's still a bit tricky, since we need to ensure that the resource is a
child of the fman region when it gets requested.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson [Fri, 2 Sep 2022 21:57:26 +0000 (17:57 -0400)]
net: fman: Remove internal_phy_node from params
This member was used to pass the phy node between mac_probe and the
mac-specific initialization function. But now that the phy node is
gotten in the initialization function, this parameter does not serve a
purpose. Remove it, and do the grabbing of the node/grabbing of the phy
in the same place.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>