Martin KaFai Lau [Thu, 1 Sep 2022 19:11:45 +0000 (12:11 -0700)]
Merge branch 'Use this_cpu_xxx for preemption-safety'
Hou Tao says:
====================
From: Hou Tao <houtao1@huawei.com>
Hi,
The patchset aims to make the update of per-cpu prog->active and per-cpu
bpf_task_storage_busy being preemption-safe. The problem is on same
architectures (e.g. arm64), __this_cpu_{inc|dec|inc_return} are neither
preemption-safe nor IRQ-safe, so under fully preemptible kernel the
concurrent updates on these per-cpu variables may be interleaved and the
final values of these variables may be not zero.
Patch 1 & 2 use the preemption-safe per-cpu helpers to manipulate
prog->active and bpf_task_storage_busy. Patch 3 & 4 add a test case in
map_tests to show the concurrent updates on the per-cpu
bpf_task_storage_busy by using __this_cpu_{inc|dec} are not atomic.
Comments are always welcome.
Regards,
Tao
Change Log:
v2:
* Patch 1: update commit message to indicate the problem is only
possible for fully preemptible kernel
* Patch 2: a new patch which fixes the problem for prog->active
* Patch 3 & 4: move it to test_maps and make it depend on CONFIG_PREEMPT
v1: https://lore.kernel.org/bpf/
20220829142752.330094-1-houtao@huaweicloud.com/
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Hou Tao [Thu, 1 Sep 2022 06:19:38 +0000 (14:19 +0800)]
selftests/bpf: Test concurrent updates on bpf_task_storage_busy
Under full preemptible kernel, task local storage lookup operations on
the same CPU may update per-cpu bpf_task_storage_busy concurrently. If
the update of bpf_task_storage_busy is not preemption safe, the final
value of bpf_task_storage_busy may become not-zero forever and
bpf_task_storage_trylock() will always fail. So add a test case to
ensure the update of bpf_task_storage_busy is preemption safe.
Will skip the test case when CONFIG_PREEMPT is disabled, and it can only
reproduce the problem probabilistically. By increasing
TASK_STORAGE_MAP_NR_LOOP and running it under ARM64 VM with 4-cpus, it
takes about four rounds to reproduce:
> test_maps is modified to only run test_task_storage_map_stress_lookup()
$ export TASK_STORAGE_MAP_NR_THREAD=256
$ export TASK_STORAGE_MAP_NR_LOOP=81920
$ export TASK_STORAGE_MAP_PIN_CPU=1
$ time ./test_maps
test_task_storage_map_stress_lookup(135):FAIL:bad bpf_task_storage_busy got -2
real 0m24.743s
user 0m6.772s
sys 0m17.966s
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20220901061938.3789460-5-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Hou Tao [Thu, 1 Sep 2022 06:19:37 +0000 (14:19 +0800)]
selftests/bpf: Move sys_pidfd_open() into task_local_storage_helpers.h
sys_pidfd_open() is defined twice in both test_bprm_opts.c and
test_local_storage.c, so move it to a common header file. And it will be
used in map_tests as well.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20220901061938.3789460-4-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Hou Tao [Thu, 1 Sep 2022 06:19:36 +0000 (14:19 +0800)]
bpf: Use this_cpu_{inc_return|dec} for prog->active
Both __this_cpu_inc_return() and __this_cpu_dec() are not preemption
safe and now migrate_disable() doesn't disable preemption, so the update
of prog-active is not atomic and in theory under fully preemptible kernel
recurisve prevention may do not work.
Fixing by using the preemption-safe and IRQ-safe variants.
Fixes:
ca06f55b9002 ("bpf: Add per-program recursion prevention mechanism")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20220901061938.3789460-3-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Hou Tao [Thu, 1 Sep 2022 06:19:35 +0000 (14:19 +0800)]
bpf: Use this_cpu_{inc|dec|inc_return} for bpf_task_storage_busy
Now migrate_disable() does not disable preemption and under some
architectures (e.g. arm64) __this_cpu_{inc|dec|inc_return} are neither
preemption-safe nor IRQ-safe, so for fully preemptible kernel concurrent
lookups or updates on the same task local storage and on the same CPU
may make bpf_task_storage_busy be imbalanced, and
bpf_task_storage_trylock() on the specific cpu will always fail.
Fixing it by using this_cpu_{inc|dec|inc_return} when manipulating
bpf_task_storage_busy.
Fixes:
bc235cdb423a ("bpf: Prevent deadlock from recursive bpf_task_storage_[get|delete]")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20220901061938.3789460-2-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Martin KaFai Lau [Wed, 31 Aug 2022 20:35:47 +0000 (13:35 -0700)]
Merge branch 'fixes for concurrent htab updates'
Hou Tao says:
====================
From: Hou Tao <houtao1@huawei.com>
Hi,
The patchset aims to fix the issues found during investigating the
syzkaller problem reported in [0]. It seems that the concurrent updates
to the same hash-table bucket may fail as shown in patch 1.
Patch 1 uses preempt_disable() to fix the problem for
htab_use_raw_lock() case. For !htab_use_raw_lock() case, the problem is
left to "BPF specific memory allocator" patchset [1] in which
!htab_use_raw_lock() will be removed.
Patch 2 fixes the out-of-bound memory read problem reported in [0]. The
problem has the root cause as patch 1 and it is fixed by handling -EBUSY
from htab_lock_bucket() correctly.
Patch 3 add two cases for hash-table update: one for the reentrancy of
bpf_map_update_elem(), and another one for concurrent updates of the
same hash-table bucket.
Comments are always welcome.
Regards,
Tao
[0]: https://lore.kernel.org/bpf/CACkBjsbuxaR6cv0kXJoVnBfL9ZJXjjoUcMpw_Ogc313jSrg14A@mail.gmail.com/
[1]: https://lore.kernel.org/bpf/
20220819214232.18784-1-alexei.starovoitov@gmail.com/
Change Log:
v4:
* rebased on bpf-next
* add htab_update to DENYLIST.s390x
v3: https://lore.kernel.org/bpf/
20220829023709.1958204-1-houtao@huaweicloud.com/
* patch 1: update commit message and add Fixes tag
* patch 2: add Fixes tag
* patch 3: elaborate the description of test cases
v2: https://lore.kernel.org/bpf/
bd60ef93-1c6a-2db2-557d-
b09b92ad22bd@huaweicloud.com/
* Note the fix is for CONFIG_PREEMPT case in commit message and add
Reviewed-by tag for patch 1
* Drop patch "bpf: Allow normally concurrent map updates for !htab_use_raw_lock() case"
v1: https://lore.kernel.org/bpf/
20220821033223.2598791-1-houtao@huaweicloud.com/
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Hou Tao [Wed, 31 Aug 2022 04:26:29 +0000 (12:26 +0800)]
selftests/bpf: Add test cases for htab update
One test demonstrates the reentrancy of hash map update on the same
bucket should fail, and another one shows concureently updates of
the same hash map bucket should succeed and not fail due to
the reentrancy checking for bucket lock.
There is no trampoline support on s390x, so move htab_update to
denylist.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20220831042629.130006-4-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Hou Tao [Wed, 31 Aug 2022 04:26:28 +0000 (12:26 +0800)]
bpf: Propagate error from htab_lock_bucket() to userspace
In __htab_map_lookup_and_delete_batch() if htab_lock_bucket() returns
-EBUSY, it will go to next bucket. Going to next bucket may not only
skip the elements in current bucket silently, but also incur
out-of-bound memory access or expose kernel memory to userspace if
current bucket_cnt is greater than bucket_size or zero.
Fixing it by stopping batch operation and returning -EBUSY when
htab_lock_bucket() fails, and the application can retry or skip the busy
batch as needed.
Fixes:
20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked")
Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20220831042629.130006-3-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Hou Tao [Wed, 31 Aug 2022 04:26:27 +0000 (12:26 +0800)]
bpf: Disable preemption when increasing per-cpu map_locked
Per-cpu htab->map_locked is used to prohibit the concurrent accesses
from both NMI and non-NMI contexts. But since commit
74d862b682f5
("sched: Make migrate_disable/enable() independent of RT"),
migrate_disable() is also preemptible under CONFIG_PREEMPT case, so now
map_locked also disallows concurrent updates from normal contexts
(e.g. userspace processes) unexpectedly as shown below:
process A process B
htab_map_update_elem()
htab_lock_bucket()
migrate_disable()
/* return 1 */
__this_cpu_inc_return()
/* preempted by B */
htab_map_update_elem()
/* the same bucket as A */
htab_lock_bucket()
migrate_disable()
/* return 2, so lock fails */
__this_cpu_inc_return()
return -EBUSY
A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
only checking the value of map_locked for nmi context. But it will
re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
through non-tracing program (e.g. fentry program).
One cannot use preempt_disable() to fix this issue as htab_use_raw_lock
being false causes the bucket lock to be a spin lock which can sleep and
does not work with preempt_disable().
Therefore, use migrate_disable() when using the spinlock instead of
preempt_disable() and defer fixing concurrent updates to when the kernel
has its own BPF memory allocator.
Fixes:
74d862b682f5 ("sched: Make migrate_disable/enable() independent of RT")
Reviewed-by: Hao Luo <haoluo@google.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20220831042629.130006-2-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Martin KaFai Lau [Tue, 30 Aug 2022 23:19:53 +0000 (16:19 -0700)]
selftest/bpf: Ensure no module loading in bpf_setsockopt(TCP_CONGESTION)
This patch adds a test to ensure bpf_setsockopt(TCP_CONGESTION, "not_exist")
will not trigger the kernel module autoload.
Before the fix:
[ 40.535829] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274
[...]
[ 40.552134] tcp_ca_find_autoload.constprop.0+0xcb/0x200
[ 40.552689] tcp_set_congestion_control+0x99/0x7b0
[ 40.553203] do_tcp_setsockopt+0x3ed/0x2240
[...]
[ 40.556041] __bpf_setsockopt+0x124/0x640
Signed-off-by: Martin KaFai Lau <martin.lau@linux.dev>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220830231953.792412-1-martin.lau@linux.dev
Martin KaFai Lau [Tue, 30 Aug 2022 23:19:46 +0000 (16:19 -0700)]
bpf, net: Avoid loading module when calling bpf_setsockopt(TCP_CONGESTION)
When bpf prog changes tcp-cc by calling bpf_setsockopt(TCP_CONGESTION),
it should not try to load module which may be a blocking operation.
This details was correct in the v1 [0] but missed by mistake in the
later revision in commit
cb388e7ee3a8 ("bpf: net: Change do_tcp_setsockopt()
to use the sockopt's lock_sock() and capable()"). This patch fixes it by
checking the has_current_bpf_ctx().
[0] https://lore.kernel.org/bpf/
20220727060921.2373314-1-kafai@fb.com/
Fixes:
cb388e7ee3a8 ("bpf: net: Change do_tcp_setsockopt() to use the sockopt's lock_sock() and capable()")
Signed-off-by: Martin KaFai Lau <martin.lau@linux.dev>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220830231946.791504-1-martin.lau@linux.dev
James Hilliard [Mon, 29 Aug 2022 21:05:46 +0000 (15:05 -0600)]
libbpf: Add GCC support for bpf_tail_call_static
The bpf_tail_call_static function is currently not defined unless
using clang >= 8.
To support bpf_tail_call_static on GCC we can check if __clang__ is
not defined to enable bpf_tail_call_static.
We need to use GCC assembly syntax when the compiler does not define
__clang__ as LLVM inline assembly is not fully compatible with GCC.
Signed-off-by: James Hilliard <james.hilliard1@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220829210546.755377-1-james.hilliard1@gmail.com
Hao Luo [Mon, 29 Aug 2022 23:18:28 +0000 (16:18 -0700)]
bpftool: Add support for querying cgroup_iter link
Support dumping info of a cgroup_iter link. This includes
showing the cgroup's id and the order for walking the cgroup
hierarchy. Example output is as follows:
> bpftool link show
1: iter prog 2 target_name bpf_map
2: iter prog 3 target_name bpf_prog
3: iter prog 12 target_name cgroup cgroup_id 72 order self_only
> bpftool -p link show
[{
"id": 1,
"type": "iter",
"prog_id": 2,
"target_name": "bpf_map"
},{
"id": 2,
"type": "iter",
"prog_id": 3,
"target_name": "bpf_prog"
},{
"id": 3,
"type": "iter",
"prog_id": 12,
"target_name": "cgroup",
"cgroup_id": 72,
"order": "self_only"
}
]
Signed-off-by: Hao Luo <haoluo@google.com>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220829231828.1016835-1-haoluo@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@linux.dev>
James Hilliard [Mon, 29 Aug 2022 15:47:09 +0000 (09:47 -0600)]
selftests/bpf: Fix connect4_prog tcp/socket header type conflict
There is a potential for us to hit a type conflict when including
netinet/tcp.h and sys/socket.h, we can replace both of these includes
with linux/tcp.h and bpf_tcp_helpers.h to avoid this conflict.
Fixes errors like the below when compiling with gcc BPF backend:
In file included from /usr/include/netinet/tcp.h:91,
from progs/connect4_prog.c:11:
/home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:34:23: error: conflicting types for 'int8_t'; have 'char'
34 | typedef __INT8_TYPE__ int8_t;
| ^~~~~~
In file included from /usr/include/x86_64-linux-gnu/sys/types.h:155,
from /usr/include/x86_64-linux-gnu/bits/socket.h:29,
from /usr/include/x86_64-linux-gnu/sys/socket.h:33,
from progs/connect4_prog.c:10:
/usr/include/x86_64-linux-gnu/bits/stdint-intn.h:24:18: note: previous declaration of 'int8_t' with type 'int8_t' {aka 'signed char'}
24 | typedef __int8_t int8_t;
| ^~~~~~
/home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:43:24: error: conflicting types for 'int64_t'; have 'long int'
43 | typedef __INT64_TYPE__ int64_t;
| ^~~~~~~
/usr/include/x86_64-linux-gnu/bits/stdint-intn.h:27:19: note: previous declaration of 'int64_t' with type 'int64_t' {aka 'long long int'}
27 | typedef __int64_t int64_t;
| ^~~~~~~
Signed-off-by: James Hilliard <james.hilliard1@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220829154710.3870139-1-james.hilliard1@gmail.com
James Hilliard [Fri, 26 Aug 2022 05:29:22 +0000 (23:29 -0600)]
selftests/bpf: Fix bind{4,6} tcp/socket header type conflict
There is a potential for us to hit a type conflict when including
netinet/tcp.h with sys/socket.h, we can remove these as they are not
actually needed.
Fixes errors like the below when compiling with gcc BPF backend:
In file included from /usr/include/netinet/tcp.h:91,
from progs/bind4_prog.c:10:
/home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:34:23: error: conflicting types for 'int8_t'; have 'char'
34 | typedef __INT8_TYPE__ int8_t;
| ^~~~~~
In file included from /usr/include/x86_64-linux-gnu/sys/types.h:155,
from /usr/include/x86_64-linux-gnu/bits/socket.h:29,
from /usr/include/x86_64-linux-gnu/sys/socket.h:33,
from progs/bind4_prog.c:9:
/usr/include/x86_64-linux-gnu/bits/stdint-intn.h:24:18: note: previous declaration of 'int8_t' with type 'int8_t' {aka 'signed char'}
24 | typedef __int8_t int8_t;
| ^~~~~~
/home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:43:24: error: conflicting types for 'int64_t'; have 'long int'
43 | typedef __INT64_TYPE__ int64_t;
| ^~~~~~~
/usr/include/x86_64-linux-gnu/bits/stdint-intn.h:27:19: note: previous declaration of 'int64_t' with type 'int64_t' {aka 'long long int'}
27 | typedef __int64_t int64_t;
| ^~~~~~~
make: *** [Makefile:537: /home/buildroot/bpf-next/tools/testing/selftests/bpf/bpf_gcc/bind4_prog.o] Error 1
Signed-off-by: James Hilliard <james.hilliard1@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220826052925.980431-1-james.hilliard1@gmail.com
Tiezhu Yang [Mon, 29 Aug 2022 03:05:09 +0000 (11:05 +0800)]
bpf, mips: No need to use min() to get MAX_TAIL_CALL_CNT
MAX_TAIL_CALL_CNT is 33, so min(MAX_TAIL_CALL_CNT, 0xffff) is always
MAX_TAIL_CALL_CNT, it is better to use MAX_TAIL_CALL_CNT directly.
At the same time, add BUILD_BUG_ON(MAX_TAIL_CALL_CNT > 0xffff) with a
comment on why the assertion is there.
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Suggested-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/1661742309-2320-1-git-send-email-yangtiezhu@loongson.cn
Quentin Monnet [Thu, 25 Aug 2022 22:08:06 +0000 (23:08 +0100)]
bpf: Fix a few typos in BPF helpers documentation
Address a few typos in the documentation for the BPF helper functions.
They were reported by Jakub [0], who ran spell checkers on the generated
man page [1].
[0] https://lore.kernel.org/linux-man/
d22dcd47-023c-8f52-d369-
7b5308e6c842@gmail.com/T/#mb02e7d4b7fb61d98fa914c77b581184e9a9537af
[1] https://lore.kernel.org/linux-man/
eb6a1e41-c48e-ac45-5154-
ac57a2c76108@gmail.com/T/#m4a8d1b003616928013ffcd1450437309ab652f9f
v3: Do not copy unrelated (and breaking) elements to tools/ header
v2: Turn a ',' into a ';'
Reported-by: Jakub Wilk <jwilk@jwilk.net>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220825220806.107143-1-quentin@isovalent.com
James Hilliard [Fri, 26 Aug 2022 03:51:39 +0000 (21:51 -0600)]
selftests/bpf: Declare subprog_noise as static in tailcall_bpf2bpf4
Due to bpf_map_lookup_elem being declared static we need to also
declare subprog_noise as static.
Fixes the following error:
progs/tailcall_bpf2bpf4.c:26:9: error: 'bpf_map_lookup_elem' is static but used in inline function 'subprog_noise' which is not static [-Werror]
26 | bpf_map_lookup_elem(&nop_table, &key);
| ^~~~~~~~~~~~~~~~~~~
Signed-off-by: James Hilliard <james.hilliard1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20220826035141.737919-1-james.hilliard1@gmail.com
James Hilliard [Fri, 26 Aug 2022 05:06:59 +0000 (23:06 -0600)]
selftests/bpf: fix type conflict in test_tc_dtime
The sys/socket.h header isn't required to build test_tc_dtime and may
cause a type conflict.
Fixes the following error:
In file included from /usr/include/x86_64-linux-gnu/sys/types.h:155,
from /usr/include/x86_64-linux-gnu/bits/socket.h:29,
from /usr/include/x86_64-linux-gnu/sys/socket.h:33,
from progs/test_tc_dtime.c:18:
/usr/include/x86_64-linux-gnu/bits/stdint-intn.h:24:18: error: conflicting types for 'int8_t'; have '__int8_t' {aka 'signed char'}
24 | typedef __int8_t int8_t;
| ^~~~~~
In file included from progs/test_tc_dtime.c:5:
/home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:34:23: note: previous declaration of 'int8_t' with type 'int8_t' {aka 'char'}
34 | typedef __INT8_TYPE__ int8_t;
| ^~~~~~
/usr/include/x86_64-linux-gnu/bits/stdint-intn.h:27:19: error: conflicting types for 'int64_t'; have '__int64_t' {aka 'long long int'}
27 | typedef __int64_t int64_t;
| ^~~~~~~
/home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:43:24: note: previous declaration of 'int64_t' with type 'int64_t' {aka 'long int'}
43 | typedef __INT64_TYPE__ int64_t;
| ^~~~~~~
make: *** [Makefile:537: /home/buildroot/bpf-next/tools/testing/selftests/bpf/bpf_gcc/test_tc_dtime.o] Error 1
Signed-off-by: James Hilliard <james.hilliard1@gmail.com>
Link: https://lore.kernel.org/r/20220826050703.869571-1-james.hilliard1@gmail.com
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Benjamin Tissoires [Wed, 24 Aug 2022 13:40:37 +0000 (15:40 +0200)]
libbpf: add map_get_fd_by_id and map_delete_elem in light skeleton
This allows to have a better control over maps from the kernel when
preloading eBPF programs.
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Link: https://lore.kernel.org/r/20220824134055.1328882-8-benjamin.tissoires@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Benjamin Tissoires [Wed, 24 Aug 2022 13:40:36 +0000 (15:40 +0200)]
bpf: prepare for more bpf syscall to be used from kernel and user space.
Add BPF_MAP_GET_FD_BY_ID and BPF_MAP_DELETE_PROG.
Only BPF_MAP_GET_FD_BY_ID needs to be amended to be able
to access the bpf pointer either from the userspace or the kernel.
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Link: https://lore.kernel.org/r/20220824134055.1328882-7-benjamin.tissoires@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Hao Luo [Thu, 25 Aug 2022 22:39:36 +0000 (15:39 -0700)]
bpf: Add CGROUP prefix to cgroup_iter_order
bpf_cgroup_iter_order is globally visible but the entries do not have
CGROUP prefix. As requested by Andrii, put a CGROUP in the names
in bpf_cgroup_iter_order.
This patch fixes two previous commits: one introduced the API and
the other uses the API in bpf selftest (that is, the selftest
cgroup_hierarchical_stats).
I tested this patch via the following command:
test_progs -t cgroup,iter,btf_dump
Fixes:
d4ccaf58a847 ("bpf: Introduce cgroup iter")
Fixes:
88886309d2e8 ("selftests/bpf: add a selftest for cgroup hierarchical stats collection")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/r/20220825223936.1865810-1-haoluo@google.com
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Eyal Birger [Wed, 24 Aug 2022 18:10:43 +0000 (21:10 +0300)]
bpf/scripts: Assert helper enum value is aligned with comment order
The helper value is ABI as defined by enum bpf_func_id.
As bpf_helper_defs.h is used for the userpace part, it must be consistent
with this enum.
Before this change the comments order was used by the bpf_doc script in
order to set the helper values defined in the helpers file.
When adding new helpers it is very puzzling when the userspace application
breaks in weird places if the comment is inserted instead of appended -
because the generated helper ABI is incorrect and shifted.
This commit sets the helper value to the enum value.
In addition it is currently the practice to have the comments appended
and kept in the same order as the enum. As such, add an assertion
validating the comment order is consistent with enum value.
In case a different comments ordering is desired, this assertion can
be lifted.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20220824181043.1601429-1-eyal.birger@gmail.com
Lam Thai [Wed, 24 Aug 2022 22:59:00 +0000 (15:59 -0700)]
bpftool: Fix a wrong type cast in btf_dumper_int
When `data` points to a boolean value, casting it to `int *` is problematic
and could lead to a wrong value being passed to `jsonw_bool`. Change the
cast to `bool *` instead.
Fixes:
b12d6ec09730 ("bpf: btf: add btf print functionality")
Signed-off-by: Lam Thai <lamthai@arista.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220824225859.9038-1-lamthai@arista.com
Alexei Starovoitov [Thu, 25 Aug 2022 18:35:38 +0000 (11:35 -0700)]
Merge branch 'bpf: rstat: cgroup hierarchical'
Hao Luo says:
====================
This patch series allows for using bpf to collect hierarchical cgroup
stats efficiently by integrating with the rstat framework. The rstat
framework provides an efficient way to collect cgroup stats percpu and
propagate them through the cgroup hierarchy.
The stats are exposed to userspace in textual form by reading files in
bpffs, similar to cgroupfs stats by using a cgroup_iter program.
cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes:
- walking a cgroup's descendants in pre-order.
- walking a cgroup's descendants in post-order.
- walking a cgroup's ancestors.
- process only a single object.
When attaching cgroup_iter, one needs to set a cgroup to the iter_link
created from attaching. This cgroup can be passed either as a file
descriptor or a cgroup id. That cgroup serves as the starting point of
the walk.
One can also terminate the walk early by returning 1 from the iter
program.
Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
program is called with cgroup_mutex held.
** Background on rstat for stats collection **
(I am using a subscriber analogy that is not commonly used)
The rstat framework maintains a tree of cgroups that have updates and
which cpus have updates. A subscriber to the rstat framework maintains
their own stats. The framework is used to tell the subscriber when
and what to flush, for the most efficient stats propagation. The
workflow is as follows:
- When a subscriber updates a cgroup on a cpu, it informs the rstat
framework by calling cgroup_rstat_updated(cgrp, cpu).
- When a subscriber wants to read some stats for a cgroup, it asks
the rstat framework to initiate a stats flush (propagation) by calling
cgroup_rstat_flush(cgrp).
- When the rstat framework initiates a flush, it makes callbacks to
subscribers to aggregate stats on cpus that have updates, and
propagate updates to their parent.
Currently, the main subscribers to the rstat framework are cgroup
subsystems (e.g. memory, block). This patch series allow bpf programs to
become subscribers as well.
Patches in this series are organized as follows:
* Patches 1-2 introduce cgroup_iter prog, and a selftest.
* Patches 3-5 allow bpf programs to integrate with rstat by adding the
necessary hook points and kfunc. A comprehensive selftest that
demonstrates the entire workflow for using bpf and rstat to
efficiently collect and output cgroup stats is added.
---
Changelog:
v8 -> v9:
- Make UNSPEC (an invalid option) as the default order for cgroup_iter.
- Use enum for specifying cgroup_iter order, instead of u32.
- Add BPF_ITER_RESHCED to cgroup_iter.
- Add cgroup_hierarchical_stats to s390x denylist.
v7 -> v8:
- Removed the confusing BPF_ITER_DEFAULT (Andrii)
- s/SELF/SELF_ONLY/g
- Fixed typo (e.g. outputing) (Andrii)
- Use "descendants_pre", "descendants_post" etc. instead of "pre",
"post" (Andrii)
v6 -> v7:
- Updated commit/comments in cgroup_iter for read() behavior (Yonghong)
- Extracted BPF_ITER_SELF and other options out of cgroup_iter, so
that they can be used in other iters. Also renamed them. (Andrii)
- Supports both cgroup_fd and cgroup_id when specifying target cgroup.
(Andrii)
- Avoided using macro for formatting expected output in cgroup_iter
selftest. (Andrii)
- Applied 'static' on all vars and functions in cgroup_iter selftest.
(Andrii)
- Fixed broken buf reading in cgroup_iter selftest. (Andrii)
- Switched to use bpf_link__destroy() unconditionally. (Andrii)
- Removed 'volatile' for non-const global vars in selftests. (Andrii)
- Started using bpf_core_enum_value() to get memory_cgrp_id. (Andrii)
v5 -> v6:
- Rebased on bpf-next
- Tidy up cgroup_hierarchical_stats test (Andrii)
* 'static' and 'inline'
* avoid using libbpf_get_error()
* string literals of cgroup paths.
- Rename patch 8/8 to 'selftests/bpf' (Yonghong)
- Fix cgroup_iter comments (e.g. PAGE_SIZE and uapi) (Yonghong)
- Make sure further read() returns OK after previous read() finished
properly (Yonghong)
- Release cgroup_mutex before the last call of show() (Kumar)
v4 -> v5:
- Rebased on top of new kfunc flags infrastructure, updated patch 1 and
patch 6 accordingly.
- Added docs for sleepable kfuncs.
v3 -> v4:
- cgroup_iter:
* reorder fields in bpf_link_info to avoid break uapi (Yonghong)
* comment the behavior when cgroup_fd=0 (Yonghong)
* comment on the limit of number of cgroups supported by cgroup_iter.
(Yonghong)
- cgroup_hierarchical_stats selftest:
* Do not return -1 if stats are not found (causes overflow in userspace).
* Check if child process failed to join cgroup.
* Make buf and path arrays in get_cgroup_vmscan_delay() static.
* Increase the test map sizes to accomodate cgroups that are not
created by the test.
v2 -> v3:
- cgroup_iter:
* Added conditional compilation of cgroup_iter.c in kernel/bpf/Makefile
(kernel test) and dropped the !CONFIG_CGROUP patch.
* Added validation of traversal_order when attaching (Yonghong).
* Fixed previous wording "two modes" to "three modes" (Yonghong).
* Fixed the btf_dump selftest broken by this patch (Yonghong).
* Fixed ctx_arg_info[0] to use "PTR_TO_BTF_ID_OR_NULL" instead of
"PTR_TO_BTF_ID", because the "cgroup" pointer passed to iter prog can
be null.
- Use __diag_push to eliminate __weak noinline warning in
bpf_rstat_flush().
- cgroup_hierarchical_stats selftest:
* Added write_cgroup_file_parent() helper.
* Added error handling for failed map updates.
* Added null check for cgroup in vmscan_flush.
* Fixed the signature of vmscan_[start/end].
* Correctly return error code when attaching trace programs fail.
* Make sure all links are destroyed correctly and not leaking in
cgroup_hierarchical_stats selftest.
* Use memory.reclaim instead of memory.high as a more reliable way to
invoke reclaim.
* Eliminated sleeps, the test now runs faster.
v1 -> v2:
- Redesign of cgroup_iter from v1, based on Alexei's idea [1]:
* supports walking cgroup subtree.
* supports walking ancestors of a cgroup. (Andrii)
* supports terminating the walk early.
* uses fd instead of cgroup_id as parameter for iter_link. Using fd is
a convention in bpf.
* gets cgroup's ref at attach time and deref at detach.
* brought back cgroup1 support for cgroup_iter.
- Squashed the patches adding the rstat flush hook points and kfuncs
(Tejun).
- Added a comment explaining why bpf_rstat_flush() needs to be weak
(Tejun).
- Updated the final selftest with the new cgroup_iter design.
- Changed CHECKs in the selftest with ASSERTs (Yonghong, Andrii).
- Removed empty line at the end of the selftest (Yonghong).
- Renamed test files to cgroup_hierarchical_stats.c.
- Reordered CGROUP_PATH params order to match struct declaration
in the selftest (Michal).
- Removed memory_subsys_enabled() and made sure memcg controller
enablement checks make sense and are documented (Michal).
RFC v2 -> v1:
- Instead of introducing a new program type for rstat flushing, add an
empty hook point, bpf_rstat_flush(), and use fentry bpf programs to
attach to it and flush bpf stats.
- Instead of using helpers, use kfuncs for rstat functions.
- These changes simplify the patchset greatly, with minimal changes to
uapi.
RFC v1 -> RFC v2:
- Instead of rstat flush programs attach to subsystems, they now attach
to rstat (global flushers, not per-subsystem), based on discussions
with Tejun. The first patch is entirely rewritten.
- Pass cgroup pointers to rstat flushers instead of cgroup ids. This is
much more flexibility and less likely to need a uapi update later.
- rstat helpers are now only defined if CGROUP_CONFIG.
- Most of the code is now only defined if CGROUP_CONFIG and
CONFIG_BPF_SYSCALL.
- Move rstat helper protos from bpf_base_func_proto() to
tracing_prog_func_proto().
- rstat helpers argument (cgroup pointer) is now ARG_PTR_TO_BTF_ID, not
ARG_ANYTHING.
- Rewrote the selftest to use the cgroup helpers.
- Dropped bpf_map_lookup_percpu_elem (already added by Feng).
- Dropped patch to support cgroup v1 for cgroup_iter.
- Dropped patch to define some cgroup_put() when !CONFIG_CGROUP. The
code that calls it is no longer compiled when !CONFIG_CGROUP.
cgroup_iter was originally introduced in a different patch series[2].
Hao and I agreed that it fits better as part of this series.
RFC v1 of this patch series had the following changes from [2]:
- Getting the cgroup's reference at the time at attaching, instead of
at the time when iterating. (Yonghong)
- Remove .init_seq_private and .fini_seq_private callbacks for
cgroup_iter. They are not needed now. (Yonghong)
[1] https://lore.kernel.org/bpf/
20220520221919.jnqgv52k4ajlgzcl@MBP-
98dd607d3435.dhcp.thefacebook.com/
[2] https://lore.kernel.org/lkml/
20220225234339.2386398-9-haoluo@google.com/
Hao Luo (2):
bpf: Introduce cgroup iter
selftests/bpf: Test cgroup_iter.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yosry Ahmed [Wed, 24 Aug 2022 23:31:17 +0000 (16:31 -0700)]
selftests/bpf: add a selftest for cgroup hierarchical stats collection
Add a selftest that tests the whole workflow for collecting,
aggregating (flushing), and displaying cgroup hierarchical stats.
TL;DR:
- Userspace program creates a cgroup hierarchy and induces memcg reclaim
in parts of it.
- Whenever reclaim happens, vmscan_start and vmscan_end update
per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
have updates.
- When userspace tries to read the stats, vmscan_dump calls rstat to flush
the stats, and outputs the stats in text format to userspace (similar
to cgroupfs stats).
- rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
updates, vmscan_flush aggregates cpu readings and propagates updates
to parents.
- Userspace program makes sure the stats are aggregated and read
correctly.
Detailed explanation:
- The test loads tracing bpf programs, vmscan_start and vmscan_end, to
measure the latency of cgroup reclaim. Per-cgroup readings are stored in
percpu maps for efficiency. When a cgroup reading is updated on a cpu,
cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
rstat updated tree on that cpu.
- A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
each cgroup. Reading this file invokes the program, which calls
cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
the stats are exposed to the user. vmscan_dump returns 1 to terminate
iteration early, so that we only expose stats for one cgroup per read.
- An ftrace program, vmscan_flush, is also loaded and attached to
bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
once for each (cgroup, cpu) pair that has updates. cgroups are popped
from the rstat tree in a bottom-up fashion, so calls will always be
made for cgroups that have updates before their parents. The program
aggregates percpu readings to a total per-cgroup reading, and also
propagates them to the parent cgroup. After rstat flushing is over, all
cgroups will have correct updated hierarchical readings (including all
cpus and all their descendants).
- Finally, the test creates a cgroup hierarchy and induces memcg reclaim
in parts of it, and makes sure that the stats collection, aggregation,
and reading workflow works as expected.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/r/20220824233117.1312810-6-haoluo@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yosry Ahmed [Wed, 24 Aug 2022 23:31:16 +0000 (16:31 -0700)]
selftests/bpf: extend cgroup helpers
This patch extends bpf selft cgroup_helpers [ID] n various ways:
- Add enable_controllers() that allows tests to enable all or a
subset of controllers for a specific cgroup.
- Add join_cgroup_parent(). The cgroup workdir is based on the pid,
therefore a spawned child cannot join the same cgroup hierarchy of the
test through join_cgroup(). join_cgroup_parent() is used in child
processes to join a cgroup under the parent's workdir.
- Add write_cgroup_file() and write_cgroup_file_parent() (similar to
join_cgroup_parent() above).
- Add get_root_cgroup() for tests that need to do checks on root cgroup.
- Distinguish relative and absolute cgroup paths in function arguments.
Now relative paths are called relative_path, and absolute paths are
called cgroup_path.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/r/20220824233117.1312810-5-haoluo@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yosry Ahmed [Wed, 24 Aug 2022 23:31:15 +0000 (16:31 -0700)]
cgroup: bpf: enable bpf programs to integrate with rstat
Enable bpf programs to make use of rstat to collect cgroup hierarchical
stats efficiently:
- Add cgroup_rstat_updated() kfunc, for bpf progs that collect stats.
- Add cgroup_rstat_flush() sleepable kfunc, for bpf progs that read stats.
- Add an empty bpf_rstat_flush() hook that is called during rstat
flushing, for bpf progs that flush stats to attach to. Attaching a bpf
prog to this hook effectively registers it as a flush callback.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/r/20220824233117.1312810-4-haoluo@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Hao Luo [Wed, 24 Aug 2022 23:31:14 +0000 (16:31 -0700)]
selftests/bpf: Test cgroup_iter.
Add a selftest for cgroup_iter. The selftest creates a mini cgroup tree
of the following structure:
ROOT (working cgroup)
|
PARENT
/ \
CHILD1 CHILD2
and tests the following scenarios:
- invalid cgroup fd.
- pre-order walk over descendants from PARENT.
- post-order walk over descendants from PARENT.
- walk of ancestors from PARENT.
- process only a single object (i.e. PARENT).
- early termination.
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/r/20220824233117.1312810-3-haoluo@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Hao Luo [Wed, 24 Aug 2022 23:31:13 +0000 (16:31 -0700)]
bpf: Introduce cgroup iter
Cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes:
- walking a cgroup's descendants in pre-order.
- walking a cgroup's descendants in post-order.
- walking a cgroup's ancestors.
- process only the given cgroup.
When attaching cgroup_iter, one can set a cgroup to the iter_link
created from attaching. This cgroup is passed as a file descriptor
or cgroup id and serves as the starting point of the walk. If no
cgroup is specified, the starting point will be the root cgroup v2.
For walking descendants, one can specify the order: either pre-order or
post-order. For walking ancestors, the walk starts at the specified
cgroup and ends at the root.
One can also terminate the walk early by returning 1 from the iter
program.
Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
program is called with cgroup_mutex held.
Currently only one session is supported, which means, depending on the
volume of data bpf program intends to send to user space, the number
of cgroups that can be walked is limited. For example, given the current
buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each
cgroup, assuming PAGE_SIZE is 4kb, the total number of cgroups that can
be walked is 512. This is a limitation of cgroup_iter. If the output
data is larger than the kernel buffer size, after all data in the
kernel buffer is consumed by user space, the subsequent read() syscall
will signal EOPNOTSUPP. In order to work around, the user may have to
update their program to reduce the volume of data sent to output. For
example, skip some uninteresting cgroups. In future, we may extend
bpf_iter flags to allow customizing buffer size.
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/r/20220824233117.1312810-2-haoluo@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yang Yingliang [Wed, 24 Aug 2022 01:39:07 +0000 (09:39 +0800)]
selftests/bpf: Fix wrong size passed to bpf_setsockopt()
sizeof(new_cc) is not real memory size that new_cc points to; introduce
a new_cc_len to store the size and then pass it to bpf_setsockopt().
Fixes:
31123c0360e0 ("selftests/bpf: bpf_setsockopt tests")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220824013907.380448-1-yangyingliang@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Müller [Wed, 24 Aug 2022 16:39:06 +0000 (16:39 +0000)]
selftests/bpf: Add cb_refs test to s390x deny list
The cb_refs BPF selftest is failing execution on s390x machines. This is
a newly added test that requires a feature not presently supported on
this architecture.
Denylist the test for this architecture.
Fixes:
3cf7e7d8685c ("selftests/bpf: Add tests for reference state fixes for callbacks")
Signed-off-by: Daniel Müller <deso@posteo.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220824163906.1186832-1-deso@posteo.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexei Starovoitov [Tue, 23 Aug 2022 23:22:00 +0000 (16:22 -0700)]
Merge branch 'Fix reference state management for synchronous callbacks'
Kumar Kartikeya Dwivedi says:
====================
This is patch 1, 2 + their individual tests split into a separate series from
the RFC, so that these can be taken in, while we continue working towards a fix
for handling stack access inside the callback.
Changelog:
----------
v1 -> v2:
v1: https://lore.kernel.org/bpf/
20220822131923.21476-1-memxor@gmail.com
* Fix error for test_progs-no_alu32 due to distinct alloc_insn in errstr
RFC v1 -> v1:
RFC v1: https://lore.kernel.org/bpf/
20220815051540.18791-1-memxor@gmail.com
* Fix up commit log to add more explanation (Alexei)
* Split reference state fix out into a separate series
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Kumar Kartikeya Dwivedi [Tue, 23 Aug 2022 01:32:26 +0000 (03:32 +0200)]
selftests/bpf: Add tests for reference state fixes for callbacks
These are regression tests to ensure we don't end up in invalid runtime
state for helpers that execute callbacks multiple times. It exercises
the fixes to verifier callback handling for reference state in previous
patches.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220823013226.24988-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Kumar Kartikeya Dwivedi [Tue, 23 Aug 2022 01:31:25 +0000 (03:31 +0200)]
bpf: Fix reference state management for synchronous callbacks
Currently, verifier verifies callback functions (sync and async) as if
they will be executed once, (i.e. it explores execution state as if the
function was being called once). The next insn to explore is set to
start of subprog and the exit from nested frame is handled using
curframe > 0 and prepare_func_exit. In case of async callback it uses a
customized variant of push_stack simulating a kind of branch to set up
custom state and execution context for the async callback.
While this approach is simple and works when callback really will be
executed only once, it is unsafe for all of our current helpers which
are for_each style, i.e. they execute the callback multiple times.
A callback releasing acquired references of the caller may do so
multiple times, but currently verifier sees it as one call inside the
frame, which then returns to caller. Hence, it thinks it released some
reference that the cb e.g. got access through callback_ctx (register
filled inside cb from spilled typed register on stack).
Similarly, it may see that an acquire call is unpaired inside the
callback, so the caller will copy the reference state of callback and
then will have to release the register with new ref_obj_ids. But again,
the callback may execute multiple times, but the verifier will only
account for acquired references for a single symbolic execution of the
callback, which will cause leaks.
Note that for async callback case, things are different. While currently
we have bpf_timer_set_callback which only executes it once, even for
multiple executions it would be safe, as reference state is NULL and
check_reference_leak would force program to release state before
BPF_EXIT. The state is also unaffected by analysis for the caller frame.
Hence async callback is safe.
Since we want the reference state to be accessible, e.g. for pointers
loaded from stack through callback_ctx's PTR_TO_STACK, we still have to
copy caller's reference_state to callback's bpf_func_state, but we
enforce that whatever references it adds to that reference_state has
been released before it hits BPF_EXIT. This requires introducing a new
callback_ref member in the reference state to distinguish between caller
vs callee references. Hence, check_reference_leak now errors out if it
sees we are in callback_fn and we have not released callback_ref refs.
Since there can be multiple nested callbacks, like frame 0 -> cb1 -> cb2
etc. we need to also distinguish between whether this particular ref
belongs to this callback frame or parent, and only error for our own, so
we store state->frameno (which is always non-zero for callbacks).
In short, callbacks can read parent reference_state, but cannot mutate
it, to be able to use pointers acquired by the caller. They must only
undo their changes (by releasing their own acquired_refs before
BPF_EXIT) on top of caller reference_state before returning (at which
point the caller and callback state will match anyway, so no need to
copy it back to caller).
Fixes:
69c087ba6225 ("bpf: Add bpf_for_each_map_elem() helper")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220823013125.24938-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Kumar Kartikeya Dwivedi [Tue, 23 Aug 2022 01:31:17 +0000 (03:31 +0200)]
bpf: Move bpf_loop and bpf_for_each_map_elem under CAP_BPF
They would require func_info which needs prog BTF anyway. Loading BTF
and setting the prog btf_fd while loading the prog indirectly requires
CAP_BPF, so just to reduce confusion, move both these helpers taking
callback under bpf_capable() protection as well, since they cannot be
used without CAP_BPF.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220823013117.24916-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexei Starovoitov [Tue, 23 Aug 2022 23:08:22 +0000 (16:08 -0700)]
Merge branch 'bpf: expose bpf_{g,s}et_retval to more cgroup hooks'
Stanislav Fomichev says:
====================
Apparently, only a small subset of cgroup hooks actually falls
back to cgroup_base_func_proto. This leads to unexpected result
where not all cgroup helpers have bpf_{g,s}et_retval.
It's getting harder and harder to manage which helpers are exported
to which hooks. We now have the following call chains:
- cg_skb_func_proto
- sk_filter_func_proto
- bpf_sk_base_func_proto
- bpf_base_func_proto
So by looking at cg_skb_func_proto it's pretty hard to understand
what's going on.
For cgroup helpers, I'm proposing we do the following instead:
func_proto = cgroup_common_func_proto();
if (func_proto) return func_proto;
/* optional, if hook has 'current' */
func_proto = cgroup_current_func_proto();
if (func_proto) return func_proto;
...
switch (func_id) {
/* hook specific helpers */
case BPF_FUNC_hook_specific_helper:
return &xyz;
default:
/* always fall back to plain bpf_base_func_proto */
bpf_base_func_proto(func_id);
}
If this turns out more workable, we can follow up with converting
the rest to the same pattern.
v5:
- remove net/cls_cgroup.h include from patch 1/5 (Martin)
- move endif changes from patch 1/5 to 3/5 (Martin)
- don't define __weak protos, the ones in core.c suffice (Martin)
v4:
- don't touch existing helper.c helpers (Martin)
- drop unneeded CONFIG_CGROUP_BPF in bpf_lsm_func_proto (Martin)
v3:
- expose strtol/strtoul everywhere (Martin)
- move helpers declarations from bpf.h to bpf-cgroup.h (Martin)
- revise bpf_{g,s}et_retval documentation (Martin)
- don't expose bpf_{g,s}et_retval to cg_skb hooks (Martin)
v2:
- move everything into kernel/bpf/cgroup.c instead (Martin)
- use cgroup_common_func_proto in lsm (Martin)
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Stanislav Fomichev [Tue, 23 Aug 2022 22:25:55 +0000 (15:25 -0700)]
selftests/bpf: Make sure bpf_{g,s}et_retval is exposed everywhere
For each hook, have a simple bpf_set_retval(bpf_get_retval) program
and make sure it loads for the hooks we want. The exceptions are
the hooks which don't propagate the error to the callers:
- sockops
- recvmsg
- getpeername
- getsockname
- cg_skb ingress and egress
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220823222555.523590-6-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Stanislav Fomichev [Tue, 23 Aug 2022 22:25:54 +0000 (15:25 -0700)]
bpf: update bpf_{g,s}et_retval documentation
* replace 'syscall' with 'upper layers', still mention that it's being
exported via syscall errno
* describe what happens in set_retval(-EPERM) + return 1
* describe what happens with bind's 'return 3'
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220823222555.523590-5-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Stanislav Fomichev [Tue, 23 Aug 2022 22:25:53 +0000 (15:25 -0700)]
bpf: expose bpf_strtol and bpf_strtoul to all program types
bpf_strncmp is already exposed everywhere. The motivation is to keep
those helpers in kernel/bpf/helpers.c. Otherwise it's tempting to move
them under kernel/bpf/cgroup.c because they are currently only used
by sysctl prog types.
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220823222555.523590-4-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Stanislav Fomichev [Tue, 23 Aug 2022 22:25:52 +0000 (15:25 -0700)]
bpf: Use cgroup_{common,current}_func_proto in more hooks
The following hooks are per-cgroup hooks but they are not
using cgroup_{common,current}_func_proto, fix it:
* BPF_PROG_TYPE_CGROUP_SKB (cg_skb)
* BPF_PROG_TYPE_CGROUP_SOCK_ADDR (cg_sock_addr)
* BPF_PROG_TYPE_CGROUP_SOCK (cg_sock)
* BPF_PROG_TYPE_LSM+BPF_LSM_CGROUP
Also:
* move common func_proto's into cgroup func_proto handlers
* make sure bpf_{g,s}et_retval are not accessible from recvmsg,
getpeername and getsockname (return/errno is ignored in these
places)
* as a side effect, expose get_current_pid_tgid, get_current_comm_proto,
get_current_ancestor_cgroup_id, get_cgroup_classid to more cgroup
hooks
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220823222555.523590-3-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Stanislav Fomichev [Tue, 23 Aug 2022 22:25:51 +0000 (15:25 -0700)]
bpf: Introduce cgroup_{common,current}_func_proto
Split cgroup_base_func_proto into the following:
* cgroup_common_func_proto - common helpers for all cgroup hooks
* cgroup_current_func_proto - common helpers for all cgroup hooks
running in the process context (== have meaningful 'current').
Move bpf_{g,s}et_retval and other cgroup-related helpers into
kernel/bpf/cgroup.c so they closer to where they are being used.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220823222555.523590-2-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Quentin Monnet [Tue, 23 Aug 2022 15:53:27 +0000 (16:53 +0100)]
scripts/bpf: Set date attribute for bpf-helpers(7) man page
The bpf-helpers(7) manual page shipped in the man-pages project is
generated from the documentation contained in the BPF UAPI header, in
the Linux repository, parsed by script/bpf_doc.py and then fed to
rst2man.
The man page should contain the date of last modification of the
documentation. This commit adds the relevant date when generating the
page.
Before:
$ ./scripts/bpf_doc.py helpers | rst2man | grep '\.TH'
.TH BPF-HELPERS 7 "" "Linux v5.19-14022-g30d2a4d74e11" ""
After:
$ ./scripts/bpf_doc.py helpers | rst2man | grep '\.TH'
.TH BPF-HELPERS 7 "2022-08-15" "Linux v5.19-14022-g30d2a4d74e11" ""
We get the version by using "git log" to look for the commit date of the
latest change to the section of the BPF header containing the
documentation. If the command fails, we just skip the date field. and
keep generating the page.
Reported-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alejandro Colomar <alx.manpages@gmail.com>
Link: https://lore.kernel.org/bpf/20220823155327.98888-2-quentin@isovalent.com
Quentin Monnet [Tue, 23 Aug 2022 15:53:26 +0000 (16:53 +0100)]
scripts/bpf: Set version attribute for bpf-helpers(7) man page
The bpf-helpers(7) manual page shipped in the man-pages project is
generated from the documentation contained in the BPF UAPI header, in
the Linux repository, parsed by script/bpf_doc.py and then fed to
rst2man.
After a recent update of that page [0], Alejandro reported that the
linter used to validate the man pages complains about the generated
document [1]. The header for the page is supposed to contain some
attributes that we do not set correctly with the script. This commit
updates the "project and version" field. We discussed the format of
those fields in [1] and [2].
Before:
$ ./scripts/bpf_doc.py helpers | rst2man | grep '\.TH'
.TH BPF-HELPERS 7 "" "" ""
After:
$ ./scripts/bpf_doc.py helpers | rst2man | grep '\.TH'
.TH BPF-HELPERS 7 "" "Linux v5.19-14022-g30d2a4d74e11" ""
We get the version from "git describe", but if unavailable, we fall back
on "make kernelversion". If none works, for example because neither git
nore make are installed, we just set the field to "Linux" and keep
generating the page.
[0] https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/man7/bpf-helpers.7?id=
19c7f78393f2b038e76099f87335ddf43a87f039
[1] https://lore.kernel.org/all/
20220823084719.13613-1-quentin@isovalent.com/t/#m58a418a318642c6428e14ce9bb84eba5183b06e8
[2] https://lore.kernel.org/all/
20220721110821.8240-1-alx.manpages@gmail.com/t/#m8e689a822e03f6e2530a0d6de9d128401916c5de
Reported-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alejandro Colomar <alx.manpages@gmail.com>
Link: https://lore.kernel.org/bpf/20220823155327.98888-1-quentin@isovalent.com
Shmulik Ladkani [Sun, 21 Aug 2022 11:35:19 +0000 (14:35 +0300)]
bpf, selftests: Test BPF_FLOW_DISSECTOR_CONTINUE
The dissector program returns BPF_FLOW_DISSECTOR_CONTINUE (and avoids
setting skb->flow_keys or last_dissection map) in case it encounters
IP packets whose (outer) source address is 127.0.0.127.
Additional test is added to prog_tests/flow_dissector.c which sets
this address as test's pkk.iph.saddr, with the expected retval of
BPF_FLOW_DISSECTOR_CONTINUE.
Also, legacy test_flow_dissector.sh was similarly augmented.
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220821113519.116765-5-shmulik.ladkani@gmail.com
Shmulik Ladkani [Sun, 21 Aug 2022 11:35:18 +0000 (14:35 +0300)]
bpf, test_run: Propagate bpf_flow_dissect's retval to user's bpf_attr.test.retval
Formerly, a boolean denoting whether bpf_flow_dissect returned BPF_OK
was set into 'bpf_attr.test.retval'.
Augment this, so users can check the actual return code of the dissector
program under test.
Existing prog_tests/flow_dissector*.c tests were correspondingly changed
to check against each test's expected retval.
Also, tests' resulting 'flow_keys' are verified only in case the expected
retval is BPF_OK. This allows adding new tests that expect non BPF_OK.
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220821113519.116765-4-shmulik.ladkani@gmail.com
Shmulik Ladkani [Sun, 21 Aug 2022 11:35:17 +0000 (14:35 +0300)]
bpf, flow_dissector: Introduce BPF_FLOW_DISSECTOR_CONTINUE retcode for bpf progs
Currently, attaching BPF_PROG_TYPE_FLOW_DISSECTOR programs completely
replaces the flow-dissector logic with custom dissection logic. This
forces implementors to write programs that handle dissection for any
flows expected in the namespace.
It makes sense for flow-dissector BPF programs to just augment the
dissector with custom logic (e.g. dissecting certain flows or custom
protocols), while enjoying the broad capabilities of the standard
dissector for any other traffic.
Introduce BPF_FLOW_DISSECTOR_CONTINUE retcode. Flow-dissector BPF
programs may return this to indicate no dissection was made, and
fallback to the standard dissector is requested.
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220821113519.116765-3-shmulik.ladkani@gmail.com
Shmulik Ladkani [Sun, 21 Aug 2022 11:35:16 +0000 (14:35 +0300)]
flow_dissector: Make 'bpf_flow_dissect' return the bpf program retcode
Let 'bpf_flow_dissect' callers know the BPF program's retcode and act
accordingly.
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220821113519.116765-2-shmulik.ladkani@gmail.com
Martin KaFai Lau [Fri, 19 Aug 2022 19:21:55 +0000 (12:21 -0700)]
selftest/bpf: Add setget_sockopt to DENYLIST.s390x
Trampoline is not supported in s390.
Fixes:
31123c0360e0 ("selftests/bpf: bpf_setsockopt tests")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220819192155.91713-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Colin Ian King [Wed, 17 Aug 2022 21:32:42 +0000 (22:32 +0100)]
selftests/bpf: Fix spelling mistake.
There is a spelling mistake in an ASSERT_OK literal string. Fix it.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Acked-by: Mykola Lysenko <mykolal@fb.com>
Link: https://lore.kernel.org/r/20220817213242.101277-1-colin.i.king@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexei Starovoitov [Fri, 19 Aug 2022 00:06:14 +0000 (17:06 -0700)]
Merge branch 'bpf: net: Remove duplicated code from bpf_setsockopt()'
Martin KaFai Lau says:
====================
The code in bpf_setsockopt() is mostly a copy-and-paste from
the sock_setsockopt(), do_tcp_setsockopt(), do_ipv6_setsockopt(),
and do_ip_setsockopt(). As the allowed optnames in bpf_setsockopt()
grows, so are the duplicated code. The code between the copies
also slowly drifted.
This set is an effort to clean this up and reuse the existing
{sock,do_tcp,do_ipv6,do_ip}_setsockopt() as much as possible.
After the clean up, this set also adds a few allowed optnames
that we need to the bpf_setsockopt().
The initial attempt was to clean up both bpf_setsockopt() and
bpf_getsockopt() together. However, the patch set was getting
too long. It is beneficial to leave the bpf_getsockopt()
out for another patch set. Thus, this set is focusing
on the bpf_setsockopt().
v4:
- This set now depends on the commit
f574f7f839fc ("net: bpf: Use the protocol's set_rcvlowat behavior if there is one")
in the net-next tree. The commit calls a specific protocol's
set_rcvlowat and it changed the bpf_setsockopt
which this set has also changed.
Because of this, patch 9 of this set has also adjusted
and a 'sock' NULL check is added to the sk_setsockopt()
because some of the bpf hooks have a NULL sk->sk_socket.
This removes more dup code from the bpf_setsockopt() side.
- Avoid mentioning specific prog types in the comment of
the has_current_bpf_ctx(). (Andrii)
- Replace signed with unsigned int bitfield in the
patch 15 selftest. (Daniel)
v3:
- s/in_bpf/has_current_bpf_ctx/ (Andrii)
- Add comment to has_current_bpf_ctx() and sockopt_lock_sock()
(Stanislav)
- Use vmlinux.h in selftest and add defines to bpf_tracing_net.h
(Stanislav)
- Use bpf_getsockopt(SO_MARK) in selftest (Stanislav)
- Use BPF_CORE_READ_BITFIELD in selftest (Yonghong)
v2:
- A major change is to use in_bpf() to test if a setsockopt()
is called by a bpf prog and use in_bpf() to skip capable
check. Suggested by Stanislav.
- Instead of passing is_locked through sockptr_t or through an extra
argument to sk_setsockopt, v2 uses in_bpf() to skip the lock_sock()
also because bpf prog has the lock acquired.
- No change to the current sockptr_t in this revision
- s/codes/code/
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin KaFai Lau [Wed, 17 Aug 2022 06:18:47 +0000 (23:18 -0700)]
selftests/bpf: bpf_setsockopt tests
This patch adds tests to exercise optnames that are allowed
in bpf_setsockopt().
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061847.4182339-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin KaFai Lau [Wed, 17 Aug 2022 06:18:41 +0000 (23:18 -0700)]
bpf: Add a few optnames to bpf_setsockopt
This patch adds a few optnames for bpf_setsockopt:
SO_REUSEADDR, IPV6_AUTOFLOWLABEL, TCP_MAXSEG, TCP_NODELAY,
and TCP_THIN_LINEAR_TIMEOUTS.
Thanks to the previous patches of this set, all additions can reuse
the sk_setsockopt(), do_ipv6_setsockopt(), and do_tcp_setsockopt().
The only change here is to allow them in bpf_setsockopt.
The bpf prog has been able to read all members of a sk by
using PTR_TO_BTF_ID of a sk. The optname additions here can also be
read by the same approach. Meaning there is a way to read
the values back.
These optnames can also be added to bpf_getsockopt() later with
another patch set that makes the bpf_getsockopt() to reuse
the sock_getsockopt(), tcp_getsockopt(), and ip[v6]_getsockopt().
Thus, this patch does not add more duplicated code to
bpf_getsockopt() now.
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061841.4181642-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin KaFai Lau [Wed, 17 Aug 2022 06:18:34 +0000 (23:18 -0700)]
bpf: Change bpf_setsockopt(SOL_IPV6) to reuse do_ipv6_setsockopt()
After the prep work in the previous patches,
this patch removes the dup code from bpf_setsockopt(SOL_IPV6)
and reuses the implementation in do_ipv6_setsockopt().
ipv6 could be compiled as a module. Like how other code solved it
with stubs in ipv6_stubs.h, this patch adds the do_ipv6_setsockopt
to the ipv6_bpf_stub.
The current bpf_setsockopt(IPV6_TCLASS) does not take the
INET_ECN_MASK into the account for tcp. The
do_ipv6_setsockopt(IPV6_TCLASS) will handle it correctly.
The existing optname white-list is refactored into a new
function sol_ipv6_setsockopt().
After this last SOL_IPV6 dup code removal, the __bpf_setsockopt()
is simplified enough that the extra "{ }" around the if statement
can be removed.
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061834.4181198-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin KaFai Lau [Wed, 17 Aug 2022 06:18:26 +0000 (23:18 -0700)]
bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt()
After the prep work in the previous patches,
this patch removes the dup code from bpf_setsockopt(SOL_IP)
and reuses the implementation in do_ip_setsockopt().
The existing optname white-list is refactored into a new
function sol_ip_setsockopt().
NOTE,
the current bpf_setsockopt(IP_TOS) is quite different from the
the do_ip_setsockopt(IP_TOS). For example, it does not take
the INET_ECN_MASK into the account for tcp and also does not adjust
sk->sk_priority. It looks like the current bpf_setsockopt(IP_TOS)
was referencing the IPV6_TCLASS implementation instead of IP_TOS.
This patch tries to rectify that by using the do_ip_setsockopt(IP_TOS).
While this is a behavior change, the do_ip_setsockopt(IP_TOS) behavior
is arguably what the user is expecting. At least, the INET_ECN_MASK bits
should be masked out for tcp.
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061826.4180990-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin KaFai Lau [Wed, 17 Aug 2022 06:18:19 +0000 (23:18 -0700)]
bpf: Change bpf_setsockopt(SOL_TCP) to reuse do_tcp_setsockopt()
After the prep work in the previous patches,
this patch removes all the dup code from bpf_setsockopt(SOL_TCP)
and reuses the do_tcp_setsockopt().
The existing optname white-list is refactored into a new
function sol_tcp_setsockopt(). The sol_tcp_setsockopt()
also calls the bpf_sol_tcp_setsockopt() to handle
the TCP_BPF_XXX specific optnames.
bpf_setsockopt(TCP_SAVE_SYN) now also allows a value 2 to
save the eth header also and it comes for free from
do_tcp_setsockopt().
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061819.4180146-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin KaFai Lau [Wed, 17 Aug 2022 06:18:12 +0000 (23:18 -0700)]
bpf: Refactor bpf specific tcp optnames to a new function
The patch moves all bpf specific tcp optnames (TCP_BPF_XXX)
to a new function bpf_sol_tcp_setsockopt(). This will make
the next patch easier to follow.
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061812.4179645-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin KaFai Lau [Wed, 17 Aug 2022 06:18:04 +0000 (23:18 -0700)]
bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sk_setsockopt()
After the prep work in the previous patches,
this patch removes most of the dup code from bpf_setsockopt(SOL_SOCKET)
and reuses them from sk_setsockopt().
The sock ptr test is added to the SO_RCVLOWAT because
the sk->sk_socket could be NULL in some of the bpf hooks.
The existing optname white-list is refactored into a new
function sol_socket_setsockopt().
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061804.4178920-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin KaFai Lau [Wed, 17 Aug 2022 06:17:58 +0000 (23:17 -0700)]
bpf: Embed kernel CONFIG check into the if statement in bpf_setsockopt
This patch moves the "#ifdef CONFIG_XXX" check into the "if/else"
statement itself. The change is done for the bpf_setsockopt()
function only. It will make the latter patches easier to follow
without the surrounding ifdef macro.
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061758.4178374-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin KaFai Lau [Wed, 17 Aug 2022 06:17:51 +0000 (23:17 -0700)]
bpf: Initialize the bpf_run_ctx in bpf_iter_run_prog()
The bpf-iter-prog for tcp and unix sk can do bpf_setsockopt()
which needs has_current_bpf_ctx() to decide if it is called by a
bpf prog. This patch initializes the bpf_run_ctx in
bpf_iter_run_prog() for the has_current_bpf_ctx() to use.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061751.4177657-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin KaFai Lau [Wed, 17 Aug 2022 06:17:44 +0000 (23:17 -0700)]
bpf: net: Change do_ipv6_setsockopt() to use the sockopt's lock_sock() and capable()
Similar to the earlier patch that avoids sk_setsockopt() from
taking sk lock and doing capable test when called by bpf. This patch
changes do_ipv6_setsockopt() to use the sockopt_{lock,release}_sock()
and sockopt_[ns_]capable().
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061744.4176893-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin KaFai Lau [Wed, 17 Aug 2022 06:17:37 +0000 (23:17 -0700)]
bpf: net: Change do_ip_setsockopt() to use the sockopt's lock_sock() and capable()
Similar to the earlier patch that avoids sk_setsockopt() from
taking sk lock and doing capable test when called by bpf. This patch
changes do_ip_setsockopt() to use the sockopt_{lock,release}_sock()
and sockopt_[ns_]capable().
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061737.4176402-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin KaFai Lau [Wed, 17 Aug 2022 06:17:30 +0000 (23:17 -0700)]
bpf: net: Change do_tcp_setsockopt() to use the sockopt's lock_sock() and capable()
Similar to the earlier patch that avoids sk_setsockopt() from
taking sk lock and doing capable test when called by bpf. This patch
changes do_tcp_setsockopt() to use the sockopt_{lock,release}_sock()
and sockopt_[ns_]capable().
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061730.4176021-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin KaFai Lau [Wed, 17 Aug 2022 06:17:23 +0000 (23:17 -0700)]
bpf: net: Consider has_current_bpf_ctx() when testing capable() in sk_setsockopt()
When bpf program calling bpf_setsockopt(SOL_SOCKET),
it could be run in softirq and doesn't make sense to do the capable
check. There was a similar situation in bpf_setsockopt(TCP_CONGESTION).
In commit
8d650cdedaab ("tcp: fix tcp_set_congestion_control() use from bpf hook"),
tcp_set_congestion_control(..., cap_net_admin) was added to skip
the cap check for bpf prog.
This patch adds sockopt_ns_capable() and sockopt_capable() for
the sk_setsockopt() to use. They will consider the
has_current_bpf_ctx() before doing the ns_capable() and capable() test.
They are in EXPORT_SYMBOL for the ipv6 module to use in a latter patch.
Suggested-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061723.4175820-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin KaFai Lau [Wed, 17 Aug 2022 06:17:17 +0000 (23:17 -0700)]
bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf
Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
the sk_setsockopt(). The number of supported optnames are
increasing ever and so as the duplicated code.
One issue in reusing sk_setsockopt() is that the bpf prog
has already acquired the sk lock. This patch adds a
has_current_bpf_ctx() to tell if the sk_setsockopt() is called from
a bpf prog. The bpf prog calling bpf_setsockopt() is either running
in_task() or in_serving_softirq(). Both cases have the current->bpf_ctx
initialized. Thus, the has_current_bpf_ctx() only needs to
test !!current->bpf_ctx.
This patch also adds sockopt_{lock,release}_sock() helpers
for sk_setsockopt() to use. These helpers will test
has_current_bpf_ctx() before acquiring/releasing the lock. They are
in EXPORT_SYMBOL for the ipv6 module to use in a latter patch.
Note on the change in sock_setbindtodevice(). sockopt_lock_sock()
is done in sock_setbindtodevice() instead of doing the lock_sock
in sock_bindtoindex(..., lock_sk = true).
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061717.4175589-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martin KaFai Lau [Wed, 17 Aug 2022 06:17:11 +0000 (23:17 -0700)]
net: Add sk_setsockopt() to take the sk ptr instead of the sock ptr
A latter patch refactors bpf_setsockopt(SOL_SOCKET) with the
sock_setsockopt() to avoid code duplication and code
drift between the two duplicates.
The current sock_setsockopt() takes sock ptr as the argument.
The very first thing of this function is to get back the sk ptr
by 'sk = sock->sk'.
bpf_setsockopt() could be called when the sk does not have
the sock ptr created. Meaning sk->sk_socket is NULL. For example,
when a passive tcp connection has just been established but has yet
been accept()-ed. Thus, it cannot use the sock_setsockopt(sk->sk_socket)
or else it will pass a NULL ptr.
This patch moves all sock_setsockopt implementation to the newly
added sk_setsockopt(). The new sk_setsockopt() takes a sk ptr
and immediately gets the sock ptr by 'sock = sk->sk_socket'
The existing sock_setsockopt(sock) is changed to call
sk_setsockopt(sock->sk). All existing callers have both sock->sk
and sk->sk_socket pointer.
The latter patch will make bpf_setsockopt(SOL_SOCKET) call
sk_setsockopt(sk) directly. The bpf_setsockopt(SOL_SOCKET) does
not use the optnames that require sk->sk_socket, so it will
be safe.
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061711.4175048-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Maxime Chevallier [Wed, 17 Aug 2022 09:57:25 +0000 (11:57 +0200)]
net: ethernet: altera: Add use of ethtool_op_get_ts_info
Add the ethtool_op_get_ts_info() callback to ethtool ops, so that we can
at least use software timestamping.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://lore.kernel.org/r/20220817095725.97444-1-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wong Vee Khee [Wed, 17 Aug 2022 06:43:24 +0000 (14:43 +0800)]
stmmac: intel: remove unused 'has_crossts' flag
The 'has_crossts' flag was not used anywhere in the stmmac driver,
removing it from both header file and dwmac-intel driver.
Signed-off-by: Wong Vee Khee <veekhee@apple.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Link: https://lore.kernel.org/r/20220817064324.10025-1-veekhee@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 18 Aug 2022 03:29:34 +0000 (20:29 -0700)]
Merge https://git./linux/kernel/git/bpf/bpf-next
Andrii Nakryiko says:
====================
bpf-next 2022-08-17
We've added 45 non-merge commits during the last 14 day(s) which contain
a total of 61 files changed, 986 insertions(+), 372 deletions(-).
The main changes are:
1) New bpf_ktime_get_tai_ns() BPF helper to access CLOCK_TAI, from Kurt
Kanzenbach and Jesper Dangaard Brouer.
2) Few clean ups and improvements for libbpf 1.0, from Andrii Nakryiko.
3) Expose crash_kexec() as kfunc for BPF programs, from Artem Savkov.
4) Add ability to define sleepable-only kfuncs, from Benjamin Tissoires.
5) Teach libbpf's bpf_prog_load() and bpf_map_create() to gracefully handle
unsupported names on old kernels, from Hangbin Liu.
6) Allow opting out from auto-attaching BPF programs by libbpf's BPF skeleton,
from Hao Luo.
7) Relax libbpf's requirement for shared libs to be marked executable, from
Henqgi Chen.
8) Improve bpf_iter internals handling of error returns, from Hao Luo.
9) Few accommodations in libbpf to support GCC-BPF quirks, from James Hilliard.
10) Fix BPF verifier logic around tracking dynptr ref_obj_id, from Joanne Koong.
11) bpftool improvements to handle full BPF program names better, from Manu
Bretelle.
12) bpftool fixes around libcap use, from Quentin Monnet.
13) BPF map internals clean ups and improvements around memory allocations,
from Yafang Shao.
14) Allow to use cgroup_get_from_file() on cgroupv1, allowing BPF cgroup
iterator to work on cgroupv1, from Yosry Ahmed.
15) BPF verifier internal clean ups, from Dave Marchevsky and Joanne Koong.
16) Various fixes and clean ups for selftests/bpf and vmtest.sh, from Daniel
Xu, Artem Savkov, Joanne Koong, Andrii Nakryiko, Shibin Koikkara Reeny.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits)
selftests/bpf: Few fixes for selftests/bpf built in release mode
libbpf: Clean up deprecated and legacy aliases
libbpf: Streamline bpf_attr and perf_event_attr initialization
libbpf: Fix potential NULL dereference when parsing ELF
selftests/bpf: Tests libbpf autoattach APIs
libbpf: Allows disabling auto attach
selftests/bpf: Fix attach point for non-x86 arches in test_progs/lsm
libbpf: Making bpf_prog_load() ignore name if kernel doesn't support
selftests/bpf: Update CI kconfig
selftests/bpf: Add connmark read test
selftests/bpf: Add existing connection bpf_*_ct_lookup() test
bpftool: Clear errno after libcap's checks
bpf: Clear up confusion in bpf_skb_adjust_room()'s documentation
bpftool: Fix a typo in a comment
libbpf: Add names for auxiliary maps
bpf: Use bpf_map_area_alloc consistently on bpf map creation
bpf: Make __GFP_NOWARN consistent in bpf map creation
bpf: Use bpf_map_area_free instread of kvfree
bpf: Remove unneeded memset in queue_stack_map creation
libbpf: preserve errno across pr_warn/pr_info/pr_debug
...
====================
Link: https://lore.kernel.org/r/20220817215656.1180215-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andrii Nakryiko [Tue, 16 Aug 2022 00:19:29 +0000 (17:19 -0700)]
selftests/bpf: Few fixes for selftests/bpf built in release mode
Fix few issues found when building and running test_progs in
release mode.
First, potentially uninitialized idx variable in xskxceiver,
force-initialize to zero to satisfy compiler.
Few instances of defining uprobe trigger functions break in release mode
unless marked as noinline, due to being static. Add noinline to make
sure everything works.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/bpf/20220816001929.369487-5-andrii@kernel.org
Andrii Nakryiko [Tue, 16 Aug 2022 00:19:28 +0000 (17:19 -0700)]
libbpf: Clean up deprecated and legacy aliases
Remove three missed deprecated APIs that were aliased to new APIs:
bpf_object__unload, bpf_prog_attach_xattr and btf__load.
Also move legacy API libbpf_find_kernel_btf (aliased to
btf__load_vmlinux_btf) into libbpf_legacy.h.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/bpf/20220816001929.369487-4-andrii@kernel.org
Andrii Nakryiko [Tue, 16 Aug 2022 00:19:27 +0000 (17:19 -0700)]
libbpf: Streamline bpf_attr and perf_event_attr initialization
Make sure that entire libbpf code base is initializing bpf_attr and
perf_event_attr with memset(0). Also for bpf_attr make sure we
clear and pass to kernel only relevant parts of bpf_attr. bpf_attr is
a huge union of independent sub-command attributes, so there is no need
to clear and pass entire union bpf_attr, which over time grows quite
a lot and for most commands this growth is completely irrelevant.
Few cases where we were relying on compiler initialization of BPF UAPI
structs (like bpf_prog_info, bpf_map_info, etc) with `= {};` were
switched to memset(0) pattern for future-proofing.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/bpf/20220816001929.369487-3-andrii@kernel.org
Andrii Nakryiko [Tue, 16 Aug 2022 00:19:26 +0000 (17:19 -0700)]
libbpf: Fix potential NULL dereference when parsing ELF
Fix if condition filtering empty ELF sections to prevent NULL
dereference.
Fixes:
47ea7417b074 ("libbpf: Skip empty sections in bpf_object__init_global_data_maps")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/bpf/20220816001929.369487-2-andrii@kernel.org
Jakub Kicinski [Wed, 17 Aug 2022 17:55:06 +0000 (10:55 -0700)]
Merge branch 'net-dsa-bcm_sf2-utilize-phylink-for-all-ports'
Florian Fainelli says:
====================
net: dsa: bcm_sf2: Utilize PHYLINK for all ports
This patch series has the bcm_sf2 driver utilize PHYLINK to configure
the CPU port link parameters to unify the configuration and pave the way
for DSA to utilize PHYLINK for all ports in the future.
Tested on BCM7445 and BCM7278
====================
Link: https://lore.kernel.org/r/20220815175009.2681932-1-f.fainelli@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Florian Fainelli [Mon, 15 Aug 2022 17:50:09 +0000 (10:50 -0700)]
net: dsa: bcm_sf2: Have PHYLINK configure CPU/IMP port(s)
Remove the artificial limitations imposed upon
bcm_sf2_sw_mac_link_{up,down} and allow us to override the link
parameters for IMP port(s) as well as regular ports by accounting for
the special differences that exist there.
Remove the code that did override the link parameters in
bcm_sf2_imp_setup().
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Florian Fainelli [Mon, 15 Aug 2022 17:50:08 +0000 (10:50 -0700)]
net: dsa: bcm_sf2: Introduce helper for port override offset
Depending upon the generation of switches, we have different offsets for
configuring a given port's status override where link parameters are
applied. Introduce a helper function that we re-use throughout the code
in order to let phylink callbacks configure the IMP/CPU port(s) in
subsequent changes.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Beniamin Sandu [Sat, 13 Aug 2022 20:46:58 +0000 (23:46 +0300)]
net: sfp: use simplified HWMON_CHANNEL_INFO macro
This makes the code look cleaner and easier to read.
Signed-off-by: Beniamin Sandu <beniaminsandu@gmail.com>
Link: https://lore.kernel.org/r/20220813204658.848372-1-beniaminsandu@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hao Luo [Tue, 16 Aug 2022 23:40:12 +0000 (16:40 -0700)]
selftests/bpf: Tests libbpf autoattach APIs
Adds test for libbpf APIs that toggle bpf program auto-attaching.
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220816234012.910255-2-haoluo@google.com
Hao Luo [Tue, 16 Aug 2022 23:40:11 +0000 (16:40 -0700)]
libbpf: Allows disabling auto attach
Adds libbpf APIs for disabling auto-attach for individual functions.
This is motivated by the use case of cgroup iter [1]. Some iter
types require their parameters to be non-zero, therefore applying
auto-attach on them will fail. With these two new APIs, users who
want to use auto-attach and these types of iters can disable
auto-attach on the program and perform manual attach.
[1] https://lore.kernel.org/bpf/CAEf4BzZ+a2uDo_t6kGBziqdz--m2gh2_EUwkGLDtMd65uwxUjA@mail.gmail.com/
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220816234012.910255-1-haoluo@google.com
David S. Miller [Wed, 17 Aug 2022 10:53:53 +0000 (11:53 +0100)]
Merge branch 'wwan-t7xx-fw-flashing-and-coredump-support'
M Chetan Kumar says:
====================
net: wwan: t7xx: fw flashing & coredump support
This patch series brings-in the support for FM350 wwan device firmware
flashing & coredump collection using devlink interface.
Below is the high level description of individual patches.
Refer to individual patch commit message for details.
PATCH1: Enables AP CLDMA communication for firmware flashing &
coredump collection.
PATCH2: Enables the infrastructure & queue configuration required
for early ports enumeration.
PATCH3: Implements device reset and rescan logic required to enter
or exit fastboot mode.
PATCH4: Implements devlink interface & uses the fastboot protocol for
fw flashing and coredump collection.
PATCH5: t7xx devlink commands documentation.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
M Chetan Kumar [Tue, 16 Aug 2022 04:24:17 +0000 (09:54 +0530)]
net: wwan: t7xx: Devlink documentation
Document the t7xx devlink commands usage for fw flashing &
coredump collection.
Refer to t7xx.rst file for details.
Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com>
Signed-off-by: Devegowda Chandrashekar <chandrashekar.devegowda@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
M Chetan Kumar [Tue, 16 Aug 2022 04:24:05 +0000 (09:54 +0530)]
net: wwan: t7xx: Enable devlink based fw flashing and coredump collection
This patch brings-in support for t7xx wwan device firmware flashing &
coredump collection using devlink.
Driver Registers with Devlink framework.
Implements devlink ops flash_update callback that programs modem firmware.
Creates region & snapshot required for device coredump log collection.
On early detection of wwan device in fastboot mode driver sets up CLDMA0 HW
tx/rx queues for raw data transfer then registers with devlink framework.
Upon receiving firmware image & partition details driver sends fastboot
commands for flashing the firmware.
In this flow the fastboot command & response gets exchanged between driver
and device. Once firmware flashing is success completion status is reported
to user space application.
Below is the devlink command usage for firmware flashing
$devlink dev flash pci/$BDF file ABC.img component ABC
Note: ABC.img is the firmware to be programmed to "ABC" partition.
In case of coredump collection when wwan device encounters an exception
it reboots & stays in fastboot mode for coredump collection by host driver.
On detecting exception state driver collects the core dump, creates the
devlink region & reports an event to user space application for dump
collection. The user space application invokes devlink region read command
for dump collection.
Below are the devlink commands used for coredump collection.
devlink region new pci/$BDF/mr_dump
devlink region read pci/$BDF/mr_dump snapshot $ID address $ADD length $LEN
devlink region del pci/$BDF/mr_dump snapshot $ID
Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com>
Signed-off-by: Devegowda Chandrashekar <chandrashekar.devegowda@intel.com>
Signed-off-by: Mishra Soumya Prakash <soumya.prakash.mishra@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Haijun Liu [Tue, 16 Aug 2022 04:23:53 +0000 (09:53 +0530)]
net: wwan: t7xx: PCIe reset rescan
PCI rescan module implements "rescan work queue". In firmware flashing
or coredump collection procedure WWAN device is programmed to boot in
fastboot mode and a work item is scheduled for removal & detection.
The WWAN device is reset using APCI call as part driver removal flow.
Work queue rescans pci bus at fixed interval for device detection,
later when device is detect work queue exits.
Signed-off-by: Haijun Liu <haijun.liu@mediatek.com>
Co-developed-by: Madhusmita Sahu <madhusmita.sahu@intel.com>
Signed-off-by: Madhusmita Sahu <madhusmita.sahu@intel.com>
Signed-off-by: Ricardo Martinez <ricardo.martinez@linux.intel.com>
Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com>
Signed-off-by: Devegowda Chandrashekar <chandrashekar.devegowda@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Haijun Liu [Tue, 16 Aug 2022 04:23:40 +0000 (09:53 +0530)]
net: wwan: t7xx: Infrastructure for early port configuration
To support cases such as FW update or Core dump, the t7xx device
is capable of signaling the host that a special port needs
to be created before the handshake phase.
This patch adds the infrastructure required to create the
early ports which also requires a different configuration of
CLDMA queues.
Signed-off-by: Haijun Liu <haijun.liu@mediatek.com>
Co-developed-by: Madhusmita Sahu <madhusmita.sahu@intel.com>
Signed-off-by: Madhusmita Sahu <madhusmita.sahu@intel.com>
Signed-off-by: Ricardo Martinez <ricardo.martinez@linux.intel.com>
Signed-off-by: Devegowda Chandrashekar <chandrashekar.devegowda@intel.com>
Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Haijun Liu [Tue, 16 Aug 2022 04:23:28 +0000 (09:53 +0530)]
net: wwan: t7xx: Add AP CLDMA
The t7xx device contains two Cross Layer DMA (CLDMA) interfaces to
communicate with AP and Modem processors respectively. So far only
MD-CLDMA was being used, this patch enables AP-CLDMA.
Rename small Application Processor (sAP) to AP.
Signed-off-by: Haijun Liu <haijun.liu@mediatek.com>
Co-developed-by: Madhusmita Sahu <madhusmita.sahu@intel.com>
Signed-off-by: Madhusmita Sahu <madhusmita.sahu@intel.com>
Signed-off-by: Moises Veleta <moises.veleta@linux.intel.com>
Signed-off-by: Devegowda Chandrashekar <chandrashekar.devegowda@intel.com>
Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Mon, 15 Aug 2022 19:07:47 +0000 (12:07 -0700)]
net: phy: broadcom: Implement suspend/resume for AC131 and BCM5241
Implement the suspend/resume procedure for the Broadcom AC131 and BCM5241 type
of PHYs (10/100 only) by entering the standard power down followed by the
proprietary standby mode in the auxiliary mode 4 shadow register. On resume,
the PHY software reset is enough to make it come out of standby mode so we can
utilize brcm_fet_config_init() as the resume hook.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 17 Aug 2022 09:20:45 +0000 (10:20 +0100)]
Merge branch '100GbE' of git://git./linux/kernel/git/tnguy/nex
t-queue
Tony Nguyen says:
====================
ice: detect and report PTP timestamp issues
Jacob Keller says:
This series fixes a few small issues with the cached PTP Hardware Clock
timestamp used for timestamp extension. It also introduces extra checks to
help detect issues with this logic, such as if the cached timestamp is not
updated within the 2 second window.
This introduces a few statistics similar to the ones already available in
other Intel drivers, including tx_hwtstamp_skipped and tx_hwtstamp_timeouts.
It is intended to aid in debugging issues we're seeing with some setups
which might be related to incorrect cached timestamp values.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jie Meng [Mon, 15 Aug 2022 20:29:00 +0000 (13:29 -0700)]
tcp: Make SYN ACK RTO tunable by BPF programs with TFO
Instead of the hardcoded TCP_TIMEOUT_INIT, this diff calls tcp_timeout_init
to initiate req->timeout like the non TFO SYN ACK case.
Tested using the following packetdrill script, on a host with a BPF
program that sets the initial connect timeout to 10ms.
`../../common/defaults.sh`
// Initialize connection
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_TCP, TCP_FASTOPEN, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 32792 <mss 1000,sackOK,FO TFO_COOKIE>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.01 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.02 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.04 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.01 < . 1:1(0) ack 1 win 32792
+0 accept(3, ..., ...) = 4
Signed-off-by: Jie Meng <jmeng@fb.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhengchao Shao [Mon, 15 Aug 2022 06:10:23 +0000 (14:10 +0800)]
net: sched: delete unused input parameter in qdisc_create
The input parameter p is unused in qdisc_create. Delete it.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://lore.kernel.org/r/20220815061023.51318-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stefan Wahren [Mon, 15 Aug 2022 08:06:26 +0000 (10:06 +0200)]
net: vertexcom: mse102x: Update email address
in-tech smart charging is now chargebyte. So update the email address
accordingly.
Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
Link: https://lore.kernel.org/r/20220815080626.9688-2-stefan.wahren@i2se.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stefan Wahren [Mon, 15 Aug 2022 08:06:25 +0000 (10:06 +0200)]
dt-bindings: vertexcom-mse102x: Update email address
in-tech smart charging is now chargebyte. So update the email address
accordingly.
Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20220815080626.9688-1-stefan.wahren@i2se.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Zhengchao Shao [Mon, 15 Aug 2022 03:04:17 +0000 (11:04 +0800)]
net: sched: remove the unused return value of unregister_qdisc
Return value of unregister_qdisc is unused, remove it.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://lore.kernel.org/r/20220815030417.271894-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Artem Savkov [Tue, 16 Aug 2022 05:52:31 +0000 (07:52 +0200)]
selftests/bpf: Fix attach point for non-x86 arches in test_progs/lsm
Use SYS_PREFIX macro from bpf_misc.h instead of hard-coded '__x64_'
prefix for sys_setdomainname attach point in lsm test.
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220816055231.717006-1-asavkov@redhat.com
Jacob Keller [Wed, 27 Jul 2022 23:16:02 +0000 (16:16 -0700)]
ice: introduce ice_ptp_reset_cached_phctime function
If the PTP hardware clock is adjusted, the ice driver must update the
cached PHC timestamp. This is required in order to perform timestamp
extension on the shorter timestamps captured by the PHY.
Currently, we simply call ice_ptp_update_cached_phctime in the settime and
adjtime callbacks. This has a few issues:
1) if ICE_CFG_BUSY is set because another thread is updating the Rx rings,
we will exit with an error. This is not checked, and the functions do
not re-schedule the update. This could leave the cached timestamp
incorrect until the next scheduled work item execution.
2) even if we did handle an update, any currently outstanding Tx timestamp
would be extended using the wrong cached PHC time. This would produce
incorrect results.
To fix these issues, introduce a new ice_ptp_reset_cached_phctime function.
This function calls the ice_ptp_update_cached_phctime, and discards
outstanding Tx timestamps.
If the ice_ptp_update_cached_phctime function fails because ICE_CFG_BUSY is
set, we log a warning and schedule the thread to execute soon. The update
function is modified so that it always updates the cached copy in the PF
regardless. This ensures we have the most up to date values possible and
minimizes the risk of a packet timestamp being extended with the wrong
value.
It would be nice if we could skip reporting Rx timestamps until the cached
values are up to date. However, we can't access the Rx rings while
ICE_CFG_BUSY is set because they are actively being updated by another
thread.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Wed, 27 Jul 2022 23:16:01 +0000 (16:16 -0700)]
ice: re-arrange some static functions in ice_ptp.c
A following change is going to want to make use of ice_ptp_flush_tx_tracker
earlier in the ice_ptp.c file. To make this work, move the Tx timestamp
tracking functions higher up in the file, and pull the
ice_ptp_update_cached_timestamp function below them. This should have no
functional change.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Wed, 27 Jul 2022 23:16:00 +0000 (16:16 -0700)]
ice: track and warn when PHC update is late
The ice driver requires a cached copy of the PHC time in order to perform
timestamp extension on Tx and Rx hardware timestamp values. This cached PHC
time must always be updated at least once every 2 seconds. Otherwise, the
math used to perform the extension would produce invalid results.
The updates are supposed to occur periodically in the PTP kthread work
item, which is scheduled to run every half second. Thus, we do not expect
an update to be delayed for so long. However, there are error conditions
which can cause the update to be delayed.
Track this situation by using jiffies to determine approximately how long
ago the last update occurred. Add a new statistic and a dev_warn when we
have failed to update the cached PHC time. This makes the error case more
obvious.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Wed, 27 Jul 2022 23:15:59 +0000 (16:15 -0700)]
ice: track Tx timestamp stats similar to other Intel drivers
Several Intel networking drivers which support PTP track when Tx timestamps
are skipped or when they timeout without a timestamp from hardware. The
conditions which could cause these events are rare, but it can be useful to
know when and how often they occur.
Implement similar statistics for the ice driver, tx_hwtstamp_skipped,
tx_hwtstamp_timeouts, and tx_hwtstamp_flushed.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Wed, 27 Jul 2022 23:15:58 +0000 (16:15 -0700)]
ice: initialize cached_phctime when creating Rx rings
When we create new Rx rings, the cached_phctime field is zero initialized.
This could result in incorrect timestamp reporting due to the cached value
not yet being updated. Although a background task will periodically update
the cached value, ensure it matches the existing cached value in the PF
structure at ring initialization.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Wed, 27 Jul 2022 23:15:57 +0000 (16:15 -0700)]
ice: set tx_tstamps when creating new Tx rings via ethtool
When the user changes the number of queues via ethtool, the driver
allocates new rings. This allocation did not initialize tx_tstamps. This
results in the tx_tstamps field being zero (due to kcalloc allocation), and
would result in a NULL pointer dereference when attempting a transmit
timestamp on the new ring.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Hangbin Liu [Sat, 13 Aug 2022 00:09:36 +0000 (08:09 +0800)]
libbpf: Making bpf_prog_load() ignore name if kernel doesn't support
Similar with commit
10b62d6a38f7 ("libbpf: Add names for auxiliary maps"),
let's make bpf_prog_load() also ignore name if kernel doesn't support
program name.
To achieve this, we need to call sys_bpf_prog_load() directly in
probe_kern_prog_name() to avoid circular dependency. sys_bpf_prog_load()
also need to be exported in the libbpf_internal.h file.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20220813000936.6464-1-liuhangbin@gmail.com