Andrii Nakryiko [Sun, 7 Nov 2021 16:55:13 +0000 (08:55 -0800)]
selftests/bpf: Pass sanitizer flags to linker through LDFLAGS
When adding -fsanitize=address to SAN_CFLAGS, it has to be passed both
to compiler through CFLAGS as well as linker through LDFLAGS. Add
SAN_CFLAGS into LDFLAGS to allow building selftests with ASAN.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Hengqi Chen <hengqi.chen@gmail.com>
Link: https://lore.kernel.org/bpf/20211107165521.9240-2-andrii@kernel.org
Alexei Starovoitov [Sun, 7 Nov 2021 16:34:24 +0000 (08:34 -0800)]
Merge branch 'libbpf: add unified bpf_prog_load() low-level API'
Andrii Nakryiko says:
====================
This patch set adds unified OPTS-based low-level bpf_prog_load() API for
loading BPF programs directly into kernel without utilizing libbpf's
bpf_object abstractions. This OPTS-based interface allows for future
extensions without breaking backwards or forward API and ABI compatibility.
Similar approach will be used for other low-level APIs that require extensive
sets of parameters, like BPF_MAP_CREATE command.
First half of the patch set adds libbpf API, cleans up internal usage of
to-be-deprecated APIs, etc. Second half cleans up and converts selftests away
from using deprecated APIs. See individual patches for more details.
v1->v2:
- dropped exposing sys_bpf() into public API (Alexei, Daniel);
- also dropped bpftool/cgroup.c fix for unistd.h include because it's not
necessary due to sys_bpf() staying as is.
Cc: Hengqi Chen <hengqi.chen@gmail.com>
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko [Wed, 3 Nov 2021 22:08:45 +0000 (15:08 -0700)]
selftests/bpf: Use explicit bpf_test_load_program() helper calls
Remove the second part of prog loading testing helper re-definition:
-Dbpf_load_program=bpf_test_load_program
This completes the clean up of deprecated libbpf program loading APIs.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/bpf/20211103220845.2676888-13-andrii@kernel.org
Andrii Nakryiko [Wed, 3 Nov 2021 22:08:44 +0000 (15:08 -0700)]
selftests/bpf: Use explicit bpf_prog_test_load() calls everywhere
-Dbpf_prog_load_deprecated=bpf_prog_test_load trick is both ugly and
breaks when deprecation goes into effect due to macro magic. Convert all
the uses to explicit bpf_prog_test_load() calls which avoid deprecation
errors and makes everything less magical.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/bpf/20211103220845.2676888-12-andrii@kernel.org
Andrii Nakryiko [Wed, 3 Nov 2021 22:08:43 +0000 (15:08 -0700)]
selftests/bpf: Merge test_stub.c into testing_helpers.c
Move testing prog and object load wrappers (bpf_prog_test_load and
bpf_test_load_program) into testing_helpers.{c,h} and get rid of
otherwise useless test_stub.c. Make testing_helpers.c available to
non-test_progs binaries as well.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/bpf/20211103220845.2676888-11-andrii@kernel.org
Andrii Nakryiko [Wed, 3 Nov 2021 22:08:42 +0000 (15:08 -0700)]
selftests/bpf: Convert legacy prog load APIs to bpf_prog_load()
Convert all the uses of legacy low-level BPF program loading APIs
(mostly bpf_load_program_xattr(), but also some bpf_verify_program()) to
bpf_prog_load() uses.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211103220845.2676888-10-andrii@kernel.org
Andrii Nakryiko [Wed, 3 Nov 2021 22:08:41 +0000 (15:08 -0700)]
selftests/bpf: Fix non-strict SEC() program sections
Fix few more SEC() definitions that were previously missed.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/bpf/20211103220845.2676888-9-andrii@kernel.org
Andrii Nakryiko [Wed, 3 Nov 2021 22:08:40 +0000 (15:08 -0700)]
libbpf: Remove deprecation attribute from struct bpf_prog_prep_result
This deprecation annotation has no effect because for struct deprecation
attribute has to be declared after struct definition. But instead of
moving it to the end of struct definition, remove it. When deprecation
will go in effect at libbpf v0.7, this deprecation attribute will cause
libbpf's own source code compilation to trigger deprecation warnings,
which is unavoidable because libbpf still has to support that API.
So keep deprecation of APIs, but don't mark structs used in API as
deprecated.
Fixes:
e21d585cb3db ("libbpf: Deprecate multi-instance bpf_program APIs")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/bpf/20211103220845.2676888-8-andrii@kernel.org
Andrii Nakryiko [Wed, 3 Nov 2021 22:08:39 +0000 (15:08 -0700)]
bpftool: Stop using deprecated bpf_load_program()
Switch to bpf_prog_load() instead.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211103220845.2676888-7-andrii@kernel.org
Andrii Nakryiko [Wed, 3 Nov 2021 22:08:38 +0000 (15:08 -0700)]
libbpf: Stop using to-be-deprecated APIs
Remove all the internal uses of libbpf APIs that are slated to be
deprecated in v0.7.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211103220845.2676888-6-andrii@kernel.org
Andrii Nakryiko [Wed, 3 Nov 2021 22:08:37 +0000 (15:08 -0700)]
libbpf: Remove internal use of deprecated bpf_prog_load() variants
Remove all the internal uses of bpf_load_program_xattr(), which is
slated for deprecation in v0.7.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211103220845.2676888-5-andrii@kernel.org
Andrii Nakryiko [Wed, 3 Nov 2021 22:08:36 +0000 (15:08 -0700)]
libbpf: Unify low-level BPF_PROG_LOAD APIs into bpf_prog_load()
Add a new unified OPTS-based low-level API for program loading,
bpf_prog_load() ([0]). bpf_prog_load() accepts few "mandatory"
parameters as input arguments (program type, name, license,
instructions) and all the other optional (as in not required to specify
for all types of BPF programs) fields into struct bpf_prog_load_opts.
This makes all the other non-extensible APIs variant for BPF_PROG_LOAD
obsolete and they are slated for deprecation in libbpf v0.7:
- bpf_load_program();
- bpf_load_program_xattr();
- bpf_verify_program().
Implementation-wise, internal helper libbpf__bpf_prog_load is refactored
to become a public bpf_prog_load() API. struct bpf_prog_load_params used
internally is replaced by public struct bpf_prog_load_opts.
Unfortunately, while conceptually all this is pretty straightforward,
the biggest complication comes from the already existing bpf_prog_load()
*high-level* API, which has nothing to do with BPF_PROG_LOAD command.
We try really hard to have a new API named bpf_prog_load(), though,
because it maps naturally to BPF_PROG_LOAD command.
For that, we rename old bpf_prog_load() into bpf_prog_load_deprecated()
and mark it as COMPAT_VERSION() for shared library users compiled
against old version of libbpf. Statically linked users and shared lib
users compiled against new version of libbpf headers will get "rerouted"
to bpf_prog_deprecated() through a macro helper that decides whether to
use new or old bpf_prog_load() based on number of input arguments (see
___libbpf_overload in libbpf_common.h).
To test that existing
bpf_prog_load()-using code compiles and works as expected, I've compiled
and ran selftests as is. I had to remove (locally) selftest/bpf/Makefile
-Dbpf_prog_load=bpf_prog_test_load hack because it was conflicting with
the macro-based overload approach. I don't expect anyone else to do
something like this in practice, though. This is testing-specific way to
replace bpf_prog_load() calls with special testing variant of it, which
adds extra prog_flags value. After testing I kept this selftests hack,
but ensured that we use a new bpf_prog_load_deprecated name for this.
This patch also marks bpf_prog_load() and bpf_prog_load_xattr() as deprecated.
bpf_object interface has to be used for working with struct bpf_program.
Libbpf doesn't support loading just a bpf_program.
The silver lining is that when we get to libbpf 1.0 all these
complication will be gone and we'll have one clean bpf_prog_load()
low-level API with no backwards compatibility hackery surrounding it.
[0] Closes: https://github.com/libbpf/libbpf/issues/284
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211103220845.2676888-4-andrii@kernel.org
Andrii Nakryiko [Wed, 3 Nov 2021 22:08:35 +0000 (15:08 -0700)]
libbpf: Pass number of prog load attempts explicitly
Allow to control number of BPF_PROG_LOAD attempts from outside the
sys_bpf_prog_load() helper.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/bpf/20211103220845.2676888-3-andrii@kernel.org
Andrii Nakryiko [Wed, 3 Nov 2021 22:08:34 +0000 (15:08 -0700)]
libbpf: Rename DECLARE_LIBBPF_OPTS into LIBBPF_OPTS
It's confusing that libbpf-provided helper macro doesn't start with
LIBBPF. Also "declare" vs "define" is confusing terminology, I can never
remember and always have to look up previous examples.
Bypass both issues by renaming DECLARE_LIBBPF_OPTS into a short and
clean LIBBPF_OPTS. To avoid breaking existing code, provide:
#define DECLARE_LIBBPF_OPTS LIBBPF_OPTS
in libbpf_legacy.h. We can decide later if we ever want to remove it or
we'll keep it forever because it doesn't add any maintainability burden.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/bpf/20211103220845.2676888-2-andrii@kernel.org
Andrii Nakryiko [Fri, 5 Nov 2021 19:10:55 +0000 (12:10 -0700)]
libbpf: Fix non-C89 loop variable declaration in gen_loader.c
Fix the `int i` declaration inside the for statement. This is non-C89
compliant. See [0] for user report breaking BCC build.
[0] https://github.com/libbpf/libbpf/issues/403
Fixes:
18f4fccbf314 ("libbpf: Update gen_loader to emit BTF_KIND_FUNC relocations")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20211105191055.3324874-1-andrii@kernel.org
Andrii Nakryiko [Wed, 3 Nov 2021 05:14:49 +0000 (22:14 -0700)]
libbpf: Deprecate bpf_program__load() API
Mark bpf_program__load() as deprecated ([0]) since v0.6. Also rename few
internal program loading bpf_object helper functions to have more
consistent naming.
[0] Closes: https://github.com/libbpf/libbpf/issues/301
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211103051449.1884903-1-andrii@kernel.org
Alexei Starovoitov [Wed, 3 Nov 2021 20:25:37 +0000 (13:25 -0700)]
Merge branch 'libbpf ELF sanity checking improvements'
Andrii Nakryiko says:
====================
Few patches fixing various issues discovered by oss-fuzz project fuzzing
bpf_object__open() call. Fixes are mostly focused around additional simple
sanity checks of ELF format: symbols, relos, etc.
v1->v2:
- address Yonghong's feedback.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko [Wed, 3 Nov 2021 17:32:13 +0000 (10:32 -0700)]
libbpf: Improve ELF relo sanitization
Add few sanity checks for relocations to prevent div-by-zero and
out-of-bounds array accesses in libbpf.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211103173213.1376990-6-andrii@kernel.org
Andrii Nakryiko [Wed, 3 Nov 2021 17:32:12 +0000 (10:32 -0700)]
libbpf: Fix section counting logic
e_shnum does include section #0 and as such is exactly the number of ELF
sections that we need to allocate memory for to use section indices as
array indices. Fix the off-by-one error.
This is purely accounting fix, previously we were overallocating one
too many array items. But no correctness errors otherwise.
Fixes:
25bbbd7a444b ("libbpf: Remove assumptions about uniqueness of .rodata/.data/.bss maps")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211103173213.1376990-5-andrii@kernel.org
Andrii Nakryiko [Wed, 3 Nov 2021 17:32:11 +0000 (10:32 -0700)]
libbpf: Validate that .BTF and .BTF.ext sections contain data
.BTF and .BTF.ext ELF sections should have SHT_PROGBITS type and contain
data. If they are not, ELF is invalid or corrupted, so bail out.
Otherwise this can lead to data->d_buf being NULL and SIGSEGV later on.
Reported by oss-fuzz project.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211103173213.1376990-4-andrii@kernel.org
Andrii Nakryiko [Wed, 3 Nov 2021 17:32:10 +0000 (10:32 -0700)]
libbpf: Improve sanity checking during BTF fix up
If BTF is corrupted DATASEC's variable type ID might be incorrect.
Prevent this easy to detect situation with extra NULL check.
Reported by oss-fuzz project.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211103173213.1376990-3-andrii@kernel.org
Andrii Nakryiko [Wed, 3 Nov 2021 17:32:09 +0000 (10:32 -0700)]
libbpf: Detect corrupted ELF symbols section
Prevent divide-by-zero if ELF is corrupted and has zero sh_entsize.
Reported by oss-fuzz project.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211103173213.1376990-2-andrii@kernel.org
Andrii Nakryiko [Wed, 3 Nov 2021 18:21:42 +0000 (11:21 -0700)]
Merge branch 'libbpf: deprecate bpf_program__get_prog_info_linear'
Dave Marchevsky says:
====================
bpf_program__get_prog_info_linear is a helper which wraps the
bpf_obj_get_info_by_fd BPF syscall with some niceties that put
all dynamic-length bpf_prog_info in one buffer contiguous with struct
bpf_prog_info, and simplify the selection of which dynamic data to grab.
The resultant combined struct, bpf_prog_info_linear, is persisted to
file by 'perf' to enable later annotation of BPF prog data. libbpf
includes some vaddr <-> offset conversion helpers for
struct bpf_prog_info_linear to simplify this.
This functionality is heavily tailored to perf's usecase, so its use as
a general prog info API should be deemphasized in favor of just calling
bpf_obj_get_info_by_fd, which can be more easily fit to purpose. Some
examples from caller migrations in this series:
* Some callers weren't requesting or using dynamic-sized prog info and
are well served by a simple get_info_by_fd call (e.g.
dump_prog_id_as_func_ptr in bpftool)
* Some callers were requesting all of a specific dynamic info type but
only using the first record, so can avoid unnecessary malloc by
only requesting 1 (e.g. profile_target_name in bpftool)
* bpftool's do_dump saves some malloc/free by growing and reusing its
dynamic prog_info buf as it loops over progs to grab info and dump.
Perf does need the full functionality of
bpf_program__get_prog_info_linear and its accompanying structs +
helpers, so copy the code to its codebase, migrate all other uses in the
tree, and deprecate the helper in libbpf.
Since the deprecated symbols continue to be included in perf some
renaming was necessary in perf's copy, otherwise functionality is
unchanged.
This work was previously discussed in libbpf's issue tracker [0].
[0]: https://github.com/libbpf/libbpf/issues/313
v2->v3:
* Remove v2's patch 1 ("libbpf: Migrate internal use of
bpf_program__get_prog_info_linear"), which was applied [Andrii]
* Add new patch 1 migrating error checking of libbpf calls to
new scheme [Andrii, Quentin]
* In patch 2, fix != -1 error check of libbpf call, improper realloc
handling, and get rid of confusing macros [Andrii]
* In patch 4, deprecate starting from 0.6 instead of 0.7 [Andrii]
v1->v2: fix bpftool do_dump changes to clear bpf_prog_info after use and
correctly pass realloc'd ptr back (patch 2)
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Dave Marchevsky [Mon, 1 Nov 2021 22:43:57 +0000 (15:43 -0700)]
libbpf: Deprecate bpf_program__get_prog_info_linear
As part of the road to libbpf 1.0, and discussed in libbpf issue tracker
[0], bpf_program__get_prog_info_linear and its associated structs and
helper functions should be deprecated. The functionality is too specific
to the needs of 'perf', and there's little/no out-of-tree usage to
preclude introduction of a more general helper in the future.
[0] Closes: https://github.com/libbpf/libbpf/issues/313
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211101224357.2651181-5-davemarchevsky@fb.com
Dave Marchevsky [Mon, 1 Nov 2021 22:43:56 +0000 (15:43 -0700)]
perf: Pull in bpf_program__get_prog_info_linear
To prepare for impending deprecation of libbpf's
bpf_program__get_prog_info_linear, pull in the function and associated
helpers into the perf codebase and migrate existing uses to the perf
copy.
Since libbpf's deprecated definitions will still be visible to perf, it
is necessary to rename perf's definitions.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20211101224357.2651181-4-davemarchevsky@fb.com
Dave Marchevsky [Mon, 1 Nov 2021 22:43:55 +0000 (15:43 -0700)]
bpftool: Use bpf_obj_get_info_by_fd directly
To prepare for impending deprecation of libbpf's
bpf_program__get_prog_info_linear, migrate uses of this function to use
bpf_obj_get_info_by_fd.
Since the profile_target_name and dump_prog_id_as_func_ptr helpers were
only looking at the first func_info, avoid grabbing the rest to save a
malloc. For do_dump, add a more full-featured helper, but avoid
free/realloc of buffer when possible for multi-prog dumps.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20211101224357.2651181-3-davemarchevsky@fb.com
Dave Marchevsky [Mon, 1 Nov 2021 22:43:54 +0000 (15:43 -0700)]
bpftool: Migrate -1 err checks of libbpf fn calls
Per [0], callers of libbpf functions with LIBBPF_STRICT_DIRECT_ERRS set
should handle negative error codes of various values (e.g. -EINVAL).
Migrate two callsites which were explicitly checking for -1 only to
handle the new scheme.
[0]: https://github.com/libbpf/libbpf/wiki/Libbpf-1.0-migration-guide#direct-error-code-returning-libbpf_strict_direct_errs
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20211101224357.2651181-2-davemarchevsky@fb.com
Linus Torvalds [Tue, 2 Nov 2021 14:56:47 +0000 (07:56 -0700)]
Merge tag 'x86_core_for_v5.16_rc1' of git://git./linux/kernel/git/tip/tip
Pull x86 core updates from Borislav Petkov:
- Do not #GP on userspace use of CLI/STI but pretend it was a NOP to
keep old userspace from breaking. Adjust the corresponding iopl
selftest to that.
- Improve stack overflow warnings to say which stack got overflowed and
raise the exception stack sizes to 2 pages since overflowing the
single page of exception stack is very easy to do nowadays with all
the tracing machinery enabled. With that, rip out the custom mapping
of AMD SEV's too.
- A bunch of changes in preparation for FGKASLR like supporting more
than 64K section headers in the relocs tool, correct ORC lookup table
size to cover the whole kernel .text and other adjustments.
* tag 'x86_core_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
selftests/x86/iopl: Adjust to the faked iopl CLI/STI usage
vmlinux.lds.h: Have ORC lookup cover entire _etext - _stext
x86/boot/compressed: Avoid duplicate malloc() implementations
x86/boot: Allow a "silent" kaslr random byte fetch
x86/tools/relocs: Support >64K section headers
x86/sev: Make the #VC exception stacks part of the default stacks storage
x86: Increase exception stack sizes
x86/mm/64: Improve stack overflow warnings
x86/iopl: Fake iopl(3) CLI/STI usage
Linus Torvalds [Tue, 2 Nov 2021 13:20:58 +0000 (06:20 -0700)]
Merge tag 'net-next-for-5.16' of git://git./linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- Remove socket skb caches
- Add a SO_RESERVE_MEM socket op to forward allocate buffer space and
avoid memory accounting overhead on each message sent
- Introduce managed neighbor entries - added by control plane and
resolved by the kernel for use in acceleration paths (BPF / XDP
right now, HW offload users will benefit as well)
- Make neighbor eviction on link down controllable by userspace to
work around WiFi networks with bad roaming implementations
- vrf: Rework interaction with netfilter/conntrack
- fq_codel: implement L4S style ce_threshold_ect1 marking
- sch: Eliminate unnecessary RCU waits in mini_qdisc_pair_swap()
BPF:
- Add support for new btf kind BTF_KIND_TAG, arbitrary type tagging
as implemented in LLVM14
- Introduce bpf_get_branch_snapshot() to capture Last Branch Records
- Implement variadic trace_printk helper
- Add a new Bloomfilter map type
- Track <8-byte scalar spill and refill
- Access hw timestamp through BPF's __sk_buff
- Disallow unprivileged BPF by default
- Document BPF licensing
Netfilter:
- Introduce egress hook for looking at raw outgoing packets
- Allow matching on and modifying inner headers / payload data
- Add NFT_META_IFTYPE to match on the interface type either from
ingress or egress
Protocols:
- Multi-Path TCP:
- increase default max additional subflows to 2
- rework forward memory allocation
- add getsockopts: MPTCP_INFO, MPTCP_TCPINFO, MPTCP_SUBFLOW_ADDRS
- MCTP flow support allowing lower layer drivers to configure msg
muxing as needed
- Automatic Multicast Tunneling (AMT) driver based on RFC7450
- HSR support the redbox supervision frames (IEC-62439-3:2018)
- Support for the ip6ip6 encapsulation of IOAM
- Netlink interface for CAN-FD's Transmitter Delay Compensation
- Support SMC-Rv2 eliminating the current same-subnet restriction, by
exploiting the UDP encapsulation feature of RoCE adapters
- TLS: add SM4 GCM/CCM crypto support
- Bluetooth: initial support for link quality and audio/codec offload
Driver APIs:
- Add a batched interface for RX buffer allocation in AF_XDP buffer
pool
- ethtool: Add ability to control transceiver modules' power mode
- phy: Introduce supported interfaces bitmap to express MAC
capabilities and simplify PHY code
- Drop rtnl_lock from DSA .port_fdb_{add,del} callbacks
New drivers:
- WiFi driver for Realtek 8852AE 802.11ax devices (rtw89)
- Ethernet driver for ASIX AX88796C SPI device (x88796c)
Drivers:
- Broadcom PHYs
- support 72165, 7712 16nm PHYs
- support IDDQ-SR for additional power savings
- PHY support for QCA8081, QCA9561 PHYs
- NXP DPAA2: support for IRQ coalescing
- NXP Ethernet (enetc): support for software TCP segmentation
- Renesas Ethernet (ravb) - support DMAC and EMAC blocks of
Gigabit-capable IP found on RZ/G2L SoC
- Intel 100G Ethernet
- support for eswitch offload of TC/OvS flow API, including
offload of GRE, VxLAN, Geneve tunneling
- support application device queues - ability to assign Rx and Tx
queues to application threads
- PTP and PPS (pulse-per-second) extensions
- Broadcom Ethernet (bnxt)
- devlink health reporting and device reload extensions
- Mellanox Ethernet (mlx5)
- offload macvlan interfaces
- support HW offload of TC rules involving OVS internal ports
- support HW-GRO and header/data split
- support application device queues
- Marvell OcteonTx2:
- add XDP support for PF
- add PTP support for VF
- Qualcomm Ethernet switch (qca8k): support for QCA8328
- Realtek Ethernet DSA switch (rtl8366rb)
- support bridge offload
- support STP, fast aging, disabling address learning
- support for Realtek RTL8365MB-VC, a 4+1 port 10M/100M/1GE switch
- Mellanox Ethernet/IB switch (mlxsw)
- multi-level qdisc hierarchy offload (e.g. RED, prio and shaping)
- offload root TBF qdisc as port shaper
- support multiple routing interface MAC address prefixes
- support for IP-in-IP with IPv6 underlay
- MediaTek WiFi (mt76)
- mt7921 - ASPM, 6GHz, SDIO and testmode support
- mt7915 - LED and TWT support
- Qualcomm WiFi (ath11k)
- include channel rx and tx time in survey dump statistics
- support for 80P80 and 160 MHz bandwidths
- support channel 2 in 6 GHz band
- spectral scan support for QCN9074
- support for rx decapsulation offload (data frames in 802.3
format)
- Qualcomm phone SoC WiFi (wcn36xx)
- enable Idle Mode Power Save (IMPS) to reduce power consumption
during idle
- Bluetooth driver support for MediaTek MT7922 and MT7921
- Enable support for AOSP Bluetooth extension in Qualcomm WCN399x and
Realtek 8822C/8852A
- Microsoft vNIC driver (mana)
- support hibernation and kexec
- Google vNIC driver (gve)
- support for jumbo frames
- implement Rx page reuse
Refactor:
- Make all writes to netdev->dev_addr go thru helpers, so that we can
add this address to the address rbtree and handle the updates
- Various TCP cleanups and optimizations including improvements to
CPU cache use
- Simplify the gnet_stats, Qdisc stats' handling and remove
qdisc->running sequence counter
- Driver changes and API updates to address devlink locking
deficiencies"
* tag 'net-next-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2122 commits)
Revert "net: avoid double accounting for pure zerocopy skbs"
selftests: net: add arp_ndisc_evict_nocarrier
net: ndisc: introduce ndisc_evict_nocarrier sysctl parameter
net: arp: introduce arp_evict_nocarrier sysctl parameter
libbpf: Deprecate AF_XDP support
kbuild: Unify options for BTF generation for vmlinux and modules
selftests/bpf: Add a testcase for 64-bit bounds propagation issue.
bpf: Fix propagation of signed bounds from 64-bit min/max into 32-bit.
bpf: Fix propagation of bounds from 64-bit min/max into 32-bit and var_off.
net: vmxnet3: remove multiple false checks in vmxnet3_ethtool.c
net: avoid double accounting for pure zerocopy skbs
tcp: rename sk_wmem_free_skb
netdevsim: fix uninit value in nsim_drv_configure_vfs()
selftests/bpf: Fix also no-alu32 strobemeta selftest
bpf: Add missing map_delete_elem method to bloom filter map
selftests/bpf: Add bloom map success test for userspace calls
bpf: Add alignment padding for "map_extra" + consolidate holes
bpf: Bloom filter map naming fixups
selftests/bpf: Add test cases for struct_ops prog
bpf: Add dummy BPF STRUCT_OPS for test purpose
...
Jakub Kicinski [Tue, 2 Nov 2021 05:26:08 +0000 (22:26 -0700)]
Revert "net: avoid double accounting for pure zerocopy skbs"
This reverts commit
f1a456f8f3fc5828d8abcad941860380ae147b1d.
WARNING: CPU: 1 PID: 6819 at net/core/skbuff.c:5429 skb_try_coalesce+0x78b/0x7e0
CPU: 1 PID: 6819 Comm: xxxxxxx Kdump: loaded Tainted: G S 5.15.0-04194-gd852503f7711 #16
RIP: 0010:skb_try_coalesce+0x78b/0x7e0
Code: e8 2a bf 41 ff 44 8b b3 bc 00 00 00 48 8b 7c 24 30 e8 19 c0 41 ff 44 89 f0 48 03 83 c0 00 00 00 48 89 44 24 40 e9 47 fb ff ff <0f> 0b e9 ca fc ff ff 4c 8d 70 ff 48 83 c0 07 48 89 44 24 38 e9 61
RSP: 0018:
ffff88881f449688 EFLAGS:
00010282
RAX:
00000000fffffe96 RBX:
ffff8881566e4460 RCX:
ffffffff82079f7e
RDX:
0000000000000003 RSI:
dffffc0000000000 RDI:
ffff8881566e47b0
RBP:
ffff8881566e46e0 R08:
ffffed102619235d R09:
ffffed102619235d
R10:
ffff888130c91ae3 R11:
ffffed102619235c R12:
ffff88881f4498a0
R13:
0000000000000056 R14:
0000000000000009 R15:
ffff888130c91ac0
FS:
00007fec2cbb9700(0000) GS:
ffff88881f440000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00007fec1b060d80 CR3:
00000003acf94005 CR4:
00000000003706e0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Call Trace:
<IRQ>
tcp_try_coalesce+0xeb/0x290
? tcp_parse_options+0x610/0x610
? mark_held_locks+0x79/0xa0
tcp_queue_rcv+0x69/0x2f0
tcp_rcv_established+0xa49/0xd40
? tcp_data_queue+0x18a0/0x18a0
tcp_v6_do_rcv+0x1c9/0x880
? rt6_mtu_change_route+0x100/0x100
tcp_v6_rcv+0x1624/0x1830
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Tue, 2 Nov 2021 04:24:02 +0000 (21:24 -0700)]
Merge branch 'linus' of git://git./linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Delay boot-up self-test for built-in algorithms
Algorithms:
- Remove fallback path on arm64 as SIMD now runs with softirq off
Drivers:
- Add Keem Bay OCS ECC Driver"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (61 commits)
crypto: testmgr - fix wrong key length for pkcs1pad
crypto: pcrypt - Delay write to padata->info
crypto: ccp - Make use of the helper macro kthread_run()
crypto: sa2ul - Use the defined variable to clean code
crypto: s5p-sss - Add error handling in s5p_aes_probe()
crypto: keembay-ocs-ecc - Add Keem Bay OCS ECC Driver
dt-bindings: crypto: Add Keem Bay ECC bindings
crypto: ecc - Export additional helper functions
crypto: ecc - Move ecc.h to include/crypto/internal
crypto: engine - Add KPP Support to Crypto Engine
crypto: api - Do not create test larvals if manager is disabled
crypto: tcrypt - fix skcipher multi-buffer tests for 1420B blocks
hwrng: s390 - replace snprintf in show functions with sysfs_emit
crypto: octeontx2 - set assoclen in aead_do_fallback()
crypto: ccp - Fix whitespace in sev_cmd_buffer_len()
hwrng: mtk - Force runtime pm ops for sleep ops
crypto: testmgr - Only disable migration in crypto_disable_simd_for_test()
crypto: qat - share adf_enable_pf2vf_comms() from adf_pf2vf_msg.c
crypto: qat - extract send and wait from adf_vf2pf_request_version()
crypto: qat - add VF and PF wrappers to common send function
...
Linus Torvalds [Tue, 2 Nov 2021 04:17:39 +0000 (21:17 -0700)]
Merge tag 'audit-pr-
20211101' of git://git./linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"Add some additional audit logging to capture the openat2() syscall
open_how struct info.
Previous variations of the open()/openat() syscalls allowed audit
admins to inspect the syscall args to get the information contained in
the new open_how struct used in openat2()"
* tag 'audit-pr-
20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: return early if the filter rule has a lower priority
audit: add OPENAT2 record to list "how" info
audit: add support for the openat2 syscall
audit: replace magic audit syscall class numbers with macros
lsm_audit: avoid overloading the "key" audit field
audit: Convert to SPDX identifier
audit: rename struct node to struct audit_node to prevent future name collisions
Linus Torvalds [Tue, 2 Nov 2021 04:06:18 +0000 (21:06 -0700)]
Merge tag 'selinux-pr-
20211101' of git://git./linux/kernel/git/pcmoore/selinux
Pull selinux updates from Paul Moore:
- Add LSM/SELinux/Smack controls and auditing for io-uring.
As usual, the individual commit descriptions have more detail, but we
were basically missing two things which we're adding here:
+ establishment of a proper audit context so that auditing of
io-uring ops works similarly to how it does for syscalls (with
some io-uring additions because io-uring ops are *not* syscalls)
+ additional LSM hooks to enable access control points for some of
the more unusual io-uring features, e.g. credential overrides.
The additional audit callouts and LSM hooks were done in conjunction
with the io-uring folks, based on conversations and RFC patches
earlier in the year.
- Fixup the binder credential handling so that the proper credentials
are used in the LSM hooks; the commit description and the code
comment which is removed in these patches are helpful to understand
the background and why this is the proper fix.
- Enable SELinux genfscon policy support for securityfs, allowing
improved SELinux filesystem labeling for other subsystems which make
use of securityfs, e.g. IMA.
* tag 'selinux-pr-
20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
security: Return xattr name from security_dentry_init_security()
selinux: fix a sock regression in selinux_ip_postroute_compat()
binder: use cred instead of task for getsecid
binder: use cred instead of task for selinux checks
binder: use euid from cred instead of using task
LSM: Avoid warnings about potentially unused hook variables
selinux: fix all of the W=1 build warnings
selinux: make better use of the nf_hook_state passed to the NF hooks
selinux: fix race condition when computing ocontext SIDs
selinux: remove unneeded ipv6 hook wrappers
selinux: remove the SELinux lockdown implementation
selinux: enable genfscon labeling for securityfs
Smack: Brutalist io_uring support
selinux: add support for the io_uring access controls
lsm,io_uring: add LSM hooks to io_uring
io_uring: convert io_uring to the secure anon inode interface
fs: add anon_inode_getfile_secure() similar to anon_inode_getfd_secure()
audit: add filtering for io_uring records
audit,io_uring,io-wq: add some basic audit support to io_uring
audit: prepare audit_context for use in calling contexts beyond syscalls
Linus Torvalds [Tue, 2 Nov 2021 03:25:38 +0000 (20:25 -0700)]
Merge tag 'rcu.2021.11.01a' of git://git./linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Miscellaneous fixes
- Torture-test updates for smp_call_function(), most notably improved
checking of module parameters.
- Tasks-trace RCU updates that fix a number of rare but important
race-condition bugs.
- Other torture-test updates, most notably better checking of module
parameters. In addition, rcutorture may once again be run on
CONFIG_PREEMPT_RT kernels.
- Torture-test scripting updates, most notably specifying the new
CONFIG_KCSAN_STRICT kconfig option rather than maintaining an
ever-changing list of individual KCSAN kconfig options.
* tag 'rcu.2021.11.01a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (46 commits)
rcu: Fix rcu_dynticks_curr_cpu_in_eqs() vs noinstr
rcu: Always inline rcu_dynticks_task*_{enter,exit}()
torture: Make kvm-remote.sh print size of downloaded tarball
torture: Allot 1G of memory for scftorture runs
tools/rcu: Add an extract-stall script
scftorture: Warn on individual scf_torture_init() error conditions
scftorture: Count reschedule IPIs
scftorture: Account for weight_resched when checking for all zeroes
scftorture: Shut down if nonsensical arguments given
scftorture: Allow zero weight to exclude an smp_call_function*() category
rcu: Avoid unneeded function call in rcu_read_unlock()
rcu-tasks: Update comments to cond_resched_tasks_rcu_qs()
rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader
rcu-tasks: Fix read-side primitives comment for call_rcu_tasks_trace
rcu-tasks: Clarify read side section info for rcu_tasks_rude GP primitives
rcu-tasks: Correct comparisons for CPU numbers in show_stalled_task_trace
rcu-tasks: Correct firstreport usage in check_all_holdout_tasks_trace
rcu-tasks: Fix s/rcu_add_holdout/trc_add_holdout/ typo in comment
rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop
rcu-tasks: Fix s/instruction/instructions/ typo in comment
...
Linus Torvalds [Tue, 2 Nov 2021 03:05:19 +0000 (20:05 -0700)]
Merge tag 'trace-v5.16' of git://git./linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- kprobes: Restructured stack unwinder to show properly on x86 when a
stack dump happens from a kretprobe callback.
- Fix to bootconfig parsing
- Have tracefs allow owner and group permissions by default (only
denying others). There's been pressure to allow non root to tracefs
in a controlled fashion, and using groups is probably the safest.
- Bootconfig memory managament updates.
- Bootconfig clean up to have the tools directory be less dependent on
changes in the kernel tree.
- Allow perf to be traced by function tracer.
- Rewrite of function graph tracer to be a callback from the function
tracer instead of having its own trampoline (this change will happen
on an arch by arch basis, and currently only x86_64 implements it).
- Allow multiple direct trampolines (bpf hooks to functions) be batched
together in one synchronization.
- Allow histogram triggers to add variables that can perform
calculations against the event's fields.
- Use the linker to determine architecture callbacks from the ftrace
trampoline to allow for proper parameter prototypes and prevent
warnings from the compiler.
- Extend histogram triggers to key off of variables.
- Have trace recursion use bit magic to determine preempt context over
if branches.
- Have trace recursion disable preemption as all use cases do anyway.
- Added testing for verification of tracing utilities.
- Various small clean ups and fixes.
* tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (101 commits)
tracing/histogram: Fix semicolon.cocci warnings
tracing/histogram: Fix documentation inline emphasis warning
tracing: Increase PERF_MAX_TRACE_SIZE to handle Sentinel1 and docker together
tracing: Show size of requested perf buffer
bootconfig: Initialize ret in xbc_parse_tree()
ftrace: do CPU checking after preemption disabled
ftrace: disable preemption when recursion locked
tracing/histogram: Document expression arithmetic and constants
tracing/histogram: Optimize division by a power of 2
tracing/histogram: Covert expr to const if both operands are constants
tracing/histogram: Simplify handling of .sym-offset in expressions
tracing: Fix operator precedence for hist triggers expression
tracing: Add division and multiplication support for hist triggers
tracing: Add support for creating hist trigger variables from literal
selftests/ftrace: Stop tracing while reading the trace file by default
MAINTAINERS: Update KPROBES and TRACING entries
test_kprobes: Move it from kernel/ to lib/
docs, kprobes: Remove invalid URL and add new reference
samples/kretprobes: Fix return value if register_kretprobe() failed
lib/bootconfig: Fix the xbc_get_info kerneldoc
...
Jakub Kicinski [Tue, 2 Nov 2021 03:05:14 +0000 (20:05 -0700)]
Merge git://git./linux/kernel/git/netdev/net
Merge in the fixes we had queued in case there was another -rc.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 2 Nov 2021 02:59:45 +0000 (19:59 -0700)]
Merge https://git./linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-11-01
We've added 181 non-merge commits during the last 28 day(s) which contain
a total of 280 files changed, 11791 insertions(+), 5879 deletions(-).
The main changes are:
1) Fix bpf verifier propagation of 64-bit bounds, from Alexei.
2) Parallelize bpf test_progs, from Yucong and Andrii.
3) Deprecate various libbpf apis including af_xdp, from Andrii, Hengqi, Magnus.
4) Improve bpf selftests on s390, from Ilya.
5) bloomfilter bpf map type, from Joanne.
6) Big improvements to JIT tests especially on Mips, from Johan.
7) Support kernel module function calls from bpf, from Kumar.
8) Support typeless and weak ksym in light skeleton, from Kumar.
9) Disallow unprivileged bpf by default, from Pawan.
10) BTF_KIND_DECL_TAG support, from Yonghong.
11) Various bpftool cleanups, from Quentin.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (181 commits)
libbpf: Deprecate AF_XDP support
kbuild: Unify options for BTF generation for vmlinux and modules
selftests/bpf: Add a testcase for 64-bit bounds propagation issue.
bpf: Fix propagation of signed bounds from 64-bit min/max into 32-bit.
bpf: Fix propagation of bounds from 64-bit min/max into 32-bit and var_off.
selftests/bpf: Fix also no-alu32 strobemeta selftest
bpf: Add missing map_delete_elem method to bloom filter map
selftests/bpf: Add bloom map success test for userspace calls
bpf: Add alignment padding for "map_extra" + consolidate holes
bpf: Bloom filter map naming fixups
selftests/bpf: Add test cases for struct_ops prog
bpf: Add dummy BPF STRUCT_OPS for test purpose
bpf: Factor out helpers for ctx access checking
bpf: Factor out a helper to prepare trampoline for struct_ops prog
selftests, bpf: Fix broken riscv build
riscv, libbpf: Add RISC-V (RV64) support to bpf_tracing.h
tools, build: Add RISC-V to HOSTARCH parsing
riscv, bpf: Increase the maximum number of iterations
selftests, bpf: Add one test for sockmap with strparser
selftests, bpf: Fix test_txmsg_ingress_parser error
...
====================
Link: https://lore.kernel.org/r/20211102013123.9005-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 2 Nov 2021 02:57:16 +0000 (19:57 -0700)]
Merge branch 'make-neighbor-eviction-controllable-by-userspace'
James Prestwood says:
====================
Make neighbor eviction controllable by userspace
====================
Link: https://lore.kernel.org/r/20211101173630.300969-1-prestwoj@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
James Prestwood [Mon, 1 Nov 2021 17:36:30 +0000 (10:36 -0700)]
selftests: net: add arp_ndisc_evict_nocarrier
This tests the sysctl options for ARP/ND:
/net/ipv4/conf/<iface>/arp_evict_nocarrier
/net/ipv4/conf/all/arp_evict_nocarrier
/net/ipv6/conf/<iface>/ndisc_evict_nocarrier
/net/ipv6/conf/all/ndisc_evict_nocarrier
Signed-off-by: James Prestwood <prestwoj@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
James Prestwood [Mon, 1 Nov 2021 17:36:29 +0000 (10:36 -0700)]
net: ndisc: introduce ndisc_evict_nocarrier sysctl parameter
In most situations the neighbor discovery cache should be cleared on a
NOCARRIER event which is currently done unconditionally. But for wireless
roams the neighbor discovery cache can and should remain intact since
the underlying network has not changed.
This patch introduces a sysctl option ndisc_evict_nocarrier which can
be disabled by a wireless supplicant during a roam. This allows packets
to be sent after a roam immediately without having to wait for
neighbor discovery.
A user reported roughly a 1 second delay after a roam before packets
could be sent out (note, on IPv4). This delay was due to the ARP
cache being cleared. During testing of this same scenario using IPv6
no delay was noticed, but regardless there is no reason to clear
the ndisc cache for wireless roams.
Signed-off-by: James Prestwood <prestwoj@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
James Prestwood [Mon, 1 Nov 2021 17:36:28 +0000 (10:36 -0700)]
net: arp: introduce arp_evict_nocarrier sysctl parameter
This change introduces a new sysctl parameter, arp_evict_nocarrier.
When set (default) the ARP cache will be cleared on a NOCARRIER event.
This new option has been defaulted to '1' which maintains existing
behavior.
Clearing the ARP cache on NOCARRIER is relatively new, introduced by:
commit
859bd2ef1fc1110a8031b967ee656c53a6260a76
Author: David Ahern <dsahern@gmail.com>
Date: Thu Oct 11 20:33:49 2018 -0700
net: Evict neighbor entries on carrier down
The reason for this changes is to prevent the ARP cache from being
cleared when a wireless device roams. Specifically for wireless roams
the ARP cache should not be cleared because the underlying network has not
changed. Clearing the ARP cache in this case can introduce significant
delays sending out packets after a roam.
A user reported such a situation here:
https://lore.kernel.org/linux-wireless/CACsRnHWa47zpx3D1oDq9JYnZWniS8yBwW1h0WAVZ6vrbwL_S0w@mail.gmail.com/
After some investigation it was found that the kernel was holding onto
packets until ARP finished which resulted in this 1 second delay. It
was also found that the first ARP who-has was never responded to,
which is actually what caues the delay. This change is more or less
working around this behavior, but again, there is no reason to clear
the cache on a roam anyways.
As for the unanswered who-has, we know the packet made it OTA since
it was seen while monitoring. Why it never received a response is
unknown. In any case, since this is a problem on the AP side of things
all that can be done is to work around it until it is solved.
Some background on testing/reproducing the packet delay:
Hardware:
- 2 access points configured for Fast BSS Transition (Though I don't
see why regular reassociation wouldn't have the same behavior)
- Wireless station running IWD as supplicant
- A device on network able to respond to pings (I used one of the APs)
Procedure:
- Connect to first AP
- Ping once to establish an ARP entry
- Start a tcpdump
- Roam to second AP
- Wait for operstate UP event, and note the timestamp
- Start pinging
Results:
Below is the tcpdump after UP. It was recorded the interface went UP at
10:42:01.432875.
10:42:01.461871 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
10:42:02.497976 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
10:42:02.507162 ARP, Reply 192.168.254.1 is-at ac:86:74:55:b0:20, length 46
10:42:02.507185 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 1, length 64
10:42:02.507205 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 2, length 64
10:42:02.507212 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 3, length 64
10:42:02.507219 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 4, length 64
10:42:02.507225 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 5, length 64
10:42:02.507232 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 6, length 64
10:42:02.515373 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 1, length 64
10:42:02.521399 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 2, length 64
10:42:02.521612 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 3, length 64
10:42:02.521941 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 4, length 64
10:42:02.522419 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 5, length 64
10:42:02.523085 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 6, length 64
You can see the first ARP who-has went out very quickly after UP, but
was never responded to. Nearly a second later the kernel retries and
gets a response. Only then do the ping packets go out. If an ARP entry
is manually added prior to UP (after the cache is cleared) it is seen
that the first ping is never responded to, so its not only an issue with
ARP but with data packets in general.
As mentioned prior, the wireless interface was also monitored to verify
the ping/ARP packet made it OTA which was observed to be true.
Signed-off-by: James Prestwood <prestwoj@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Tue, 2 Nov 2021 02:16:49 +0000 (19:16 -0700)]
Merge tag 'hwmon-for-v5.16' of git://git./linux/kernel/git/groeck/linux-staging
Pull hwmon updates from Guenter Roeck:
"New driver:
- Maxim MAX6620
Notable functional enhancements:
- Add Asus WMI support to nct6775 driver, and list boards supporting
it
- Move TMP461 support from tm401 driver to lm90 driver
- Add support for fanX_min, fanX_max and fanX_target to dell-smm
driver, and clean it up while doing so
- Extend mlxreg-fan driver to support multiple cooling devices and
multiple PWM channels. Also increase number of supported fan
tachometers.
- Add a new customer ID (for ASRock) to nct6683 driver
- Make temperature/voltage sensors on nct7802 configurable
- Add mfg_id debugfs entry to pmbus/ibm-cffps driver
- Support configurable sense resistor values in pmbus/lm25066, and
fix various coefficients
- Use generic notification mechanism in raspberrypi driver
Notable cleanups:
- Convert various devicetree bindings to dtschema, and add missing
bindings
- Convert i5500_temp and tmp103 drivers to
devm_hwmon_device_register_with_info
- Clean up non-bool "valid" data fields
- Improve devicetree configurability for tmp421 driver"
* tag 'hwmon-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: (73 commits)
hwmon: (nct7802) Add of_node_put() before return
hwmon: (tmp401) Drop support for TMP461
hwmon: (lm90) Add basic support for TI TMP461
hwmon: (lm90) Introduce flag indicating extended temperature support
hwmon: (nct6775) add ProArt X570-CREATOR WIFI.
hwmon: (nct7802) Make temperature/voltage sensors configurable
dt-bindings: hwmon: Add nct7802 bindings
hwmon: (dell-smm) Speed up setting of fan speed
hwmon: (dell-smm) Add comment explaining usage of i8k_config_data[]
hwmon: (dell-smm) Return -ENOIOCTLCMD instead of -EINVAL
hwmon: (dell-smm) Use strscpy_pad()
hwmon: (dell-smm) Sort includes in alphabetical order
hwmon: (tmp421) Add of_node_put() before return
hwmon: (max31722) Warn about failure to put device in stand-by in .remove()
hwmon: (acpi_power_meter) Use acpi_bus_get_acpi_device()
hwmon: (dell-smm) Add support for fanX_min, fanX_max and fanX_target
dt-bindings: hwmon: allow specifying channels for tmp421
hwmon: (tmp421) ignore non-channel related DT nodes
hwmon: (tmp421) update documentation
hwmon: (tmp421) support HWMON_T_ENABLE
...
Linus Torvalds [Tue, 2 Nov 2021 02:09:04 +0000 (19:09 -0700)]
Merge tag 'spi-v5.16' of git://git./linux/kernel/git/broonie/spi
Pull spi updates from Mark Brown:
"This is quite a quiet release for SPI, there's been a bit of cleanup
to the core from Uwe but nothing functionality wise.
We have added several new drivers, Cadence XSPI, Ingenic JZ47xx,
Qualcomm SC7280 and SC7180 and Xilinx Versal OSPI"
* tag 'spi-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: (41 commits)
spi: Convert NXP flexspi to json schema
spi: spi-geni-qcom: Add support for GPI dma
spi: fsi: Fix contention in the FSI2SPI engine
spi: spi-rpc-if: Check return value of rpcif_sw_init()
spi: tegra210-quad: Put device into suspend on driver removal
spi: tegra20-slink: Put device into suspend on driver removal
spi: bcm-qspi: Fix missing clk_disable_unprepare() on error in bcm_qspi_probe()
spi: at91-usart: replacing legacy gpio interface for gpiod
spi: replace snprintf in show functions with sysfs_emit
spi: cadence: Add of_node_put() before return
spi: orion: Add of_node_put() before goto
spi: cadence-quadspi: fix dma_unmap_single() call
spi: tegra20: fix build with CONFIG_PM_SLEEP=n
spi: bcm-qspi: add support for 3-wire mode for half duplex transfer
spi: bcm-qspi: Add mspi spcr3 32/64-bits xfer mode
spi: Make several public functions private to spi.c
spi: Reorder functions to simplify the next commit
spi: Remove unused function spi_busnum_to_master()
spi: Move comment about chipselect check to the right place
spi: fsi: Print status on error
...
Linus Torvalds [Tue, 2 Nov 2021 02:04:47 +0000 (19:04 -0700)]
Merge tag 'regulator-v5.16' of git://git./linux/kernel/git/broonie/regulator
Pull regulator updates from Mark Brown:
"Thanks to the removal of the unused TPS80021 driver the regulator
updates for this cycle actually have a negative diffstat.
Otherwise it's been quite a quiet release, lots of fixes and small
improvements with the biggest individual changes being several
conversions of DT bindings to YAML format"
* tag 'regulator-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (34 commits)
regulator: Don't error out fixed regulator in regulator_sync_voltage()
regulator: tps80031: Remove driver
regulator: Fix SY7636A breakage
regulator: uniphier: Add binding for NX1 SoC
regulator: uniphier: Add USB-VBUS compatible string for NX1 SoC
regulator: qcom,rpmh: Add compatible for PM6350
regulator: qcom-rpmh: Add PM6350 regulators
regulator: sy7636a: Remove requirement on sy7636a mfd
regulator: tps62360: replacing legacy gpio interface for gpiod
regulator: lp872x: Remove lp872x_dvs_state
regulator: lp872x: replacing legacy gpio interface for gpiod
regulator: dt-bindings: samsung,s5m8767: convert to dtschema
regulator: dt-bindings: samsung,s2mpa01: convert to dtschema
regulator: dt-bindings: samsung,s2m: convert to dtschema
dt-bindings: clock: samsung,s2mps11: convert to dtschema
regulator: dt-bindings: samsung,s5m8767: correct s5m8767,pmic-buck-default-dvs-idx property
regulator: s5m8767: do not use reset value as DVS voltage if GPIO DVS is disabled
regulator: dt-bindings: maxim,max8973: convert to dtschema
regulator: dt-bindings: maxim,max8997: convert to dtschema
regulator: dt-bindings: maxim,max8952: convert to dtschema
...
Linus Torvalds [Tue, 2 Nov 2021 02:01:51 +0000 (19:01 -0700)]
Merge tag 'regmap-v5.16' of git://git./linux/kernel/git/broonie/regmap
Pull regmap update from Mark Brown:
"A single change to use the maximum transfer and message sizes
advertised by SPI controllers to configure limits within the
regmap core, ensuring better interoperation"
* tag 'regmap-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: spi: Set regmap max raw r/w from max_transfer_size
Linus Torvalds [Tue, 2 Nov 2021 01:58:13 +0000 (18:58 -0700)]
Merge tag 'mailbox-v5.16' of git://git.linaro.org/landing-teams/working/fujitsu/integration
Pull mailbox updates from Jassi Brar:
"qcom:
- add support for qcm2290
- consolidate msm8994 type apcs_data
mtk:
- fix clock id usage
apple:
- add driver for ASC/M3 controllers
pcc:
- reorganise PCC pcc_mbox_request_channel
- add support for PCCT extended PCC subspaces
misc:
- make use of devm_platform_ioremap_resource()
- change Altera, PCC and Apple mailbox maintainers"
* tag 'mailbox-v5.16' of git://git.linaro.org/landing-teams/working/fujitsu/integration: (38 commits)
mailbox: imx: support i.MX8ULP S4 MU
dt-bindings: mailbox: imx-mu: add i.MX8ULP S400 MU support
ACPI/PCC: Add maintainer for PCC mailbox driver
mailbox: pcc: Move bulk of PCCT parsing into pcc_mbox_probe
mailbox: pcc: Add support for PCCT extended PCC subspaces(type 3/4)
mailbox: pcc: Drop handling invalid bit-width in {read,write}_register
mailbox: pcc: Avoid accessing PCCT table in pcc_send_data and pcc_mbox_irq
mailbox: pcc: Add PCC register bundle and associated accessor functions
mailbox: pcc: Rename doorbell ack to platform interrupt ack register
mailbox: pcc: Use PCC mailbox channel pointer instead of standard
mailbox: pcc: Add pcc_mbox_chan structure to hold shared memory region info
mailbox: pcc: Consolidate subspace doorbell register parsing
mailbox: pcc: Consolidate subspace interrupt information parsing
mailbox: pcc: Refactor all PCC channel information into a structure
mailbox: pcc: Fix kernel doc warnings
mailbox: apple: Add driver for Apple mailboxes
dt-bindings: mailbox: Add Apple mailbox bindings
MAINTAINERS: Add Apple mailbox files
mailbox: mtk-cmdq: Fix local clock ID usage
mailbox: mtk-cmdq: Validate alias_id on probe
...
Linus Torvalds [Tue, 2 Nov 2021 01:55:12 +0000 (18:55 -0700)]
Merge tag 'mmc-v5.16' of git://git./linux/kernel/git/ulfh/mmc
Pull MMC and MEMSTICK updates from Ulf Hansson:
"MMC core:
- Update maintainer and URL for the mmc-utils
- Set default label for slot-gpio in case of no con-id
- Convert MMC card DT bindings to a schema
- Add optional host specific tuning support for eMMC HS400
- Add error handling of add_disk()
MMC host:
- mtk-sd: Add host specific tuning support for eMMC HS400
- mtk-sd: Make DMA handling more robust
- dw_mmc: Prevent hangs for some data writes
- dw_mmc: Move away from using the ->init_card() callback
- mxs-mmc: Manage the regulator in the error path and in ->remove()
- sdhci-cadence: Add support for the Microchip MPFS variant
- sdhci-esdhc-imx: Add support for the NXP S32G2 variant
- sdhci-of-arasan: Add support for the Intel Thunder Bay variant
- sdhci-omap: Prepare to support more SoCs
- sdhci-omap: Add support for omap3 and omap4 variants
- sdhci-omap: Add support for power management
- sdhci-omap: Add support for system wakeups
- sdhci-msm: Add support for the msm8226 variant
- sdhci-sprd: Verify that the DLL locks according to spec
MEMSTICK:
- Add error handling of add_disk()
- A couple of small fixes and improvements"
* tag 'mmc-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (60 commits)
docs: mmc: update maintainer name and URL
mmc: dw_mmc: exynos: Fix spelling mistake "candiates" -> candidates
MAINTAINERS: drop obsolete file pattern in SDHCI DRIVER section
mmc: sdhci-esdhc-imx: add NXP S32G2 support
dt-bindings: mmc: fsl-imx-esdhc: add NXP S32G2 support
mmc: dw_mmc: Drop use of ->init_card() callback
mmc: sdhci-omap: Fix build if CONFIG_PM_SLEEP is not set
mmc: sdhci-omap: Remove forward declaration of sdhci_omap_context_save()
memstick: r592: Fix a UAF bug when removing the driver
mmc: mxs-mmc: disable regulator on error and in the remove function
mmc: sdhci-omap: Configure optional wakeirq
mmc: sdhci-omap: Allow SDIO card power off and enable aggressive PM
mmc: sdhci-omap: Implement PM runtime functions
mmc: sdhci-omap: Add omap_offset to support omap3 and earlier
mmc: sdhci-omap: Handle voltages to add support omap4
dt-bindings: sdhci-omap: Update binding for legacy SoCs
mmc: sdhci-pci: Remove dead code (rst_n_gpio et al)
mmc: sdhci-pci: Remove dead code (cd_gpio, cd_irq et al)
mmc: sdhci-pci: Remove dead code (struct sdhci_pci_data et al)
mmc: sdhci: Remove unused prototype declaration in the header
...
Linus Torvalds [Tue, 2 Nov 2021 01:53:03 +0000 (18:53 -0700)]
Merge tag 'for-linus-5.16-1' of https://github.com/cminyard/linux-ipmi
Pull IPMI driver updates from Corey Minyard:
"A new type of low-level IPMI driver is added for direct communication
over the IPMI message bus without a BMC between the driver and the
bus.
Other than that, lots of little bug fixes and enhancements"
* tag 'for-linus-5.16-1' of https://github.com/cminyard/linux-ipmi:
ipmi: kcs_bmc: Fix a memory leak in the error handling path of 'kcs_bmc_serio_add_device()'
char: ipmi: replace snprintf in show functions with sysfs_emit
ipmi: ipmb: fix dependencies to eliminate build error
ipmi:ipmb: Add OF support
ipmi: bt: Add ast2600 compatible string
ipmi: bt-bmc: Use registers directly
ipmi: ipmb: Fix off-by-one size check on rcvlen
ipmi:ssif: Use depends on, not select, for I2C
ipmi: Add docs for the IPMI IPMB driver
ipmi: Add docs for IPMB direct addressing
ipmi:ipmb: Add initial support for IPMI over IPMB
ipmi: Add support for IPMB direct messages
ipmi: Export ipmb_checksum()
ipmi: Fix a typo
ipmi: Check error code before processing BMC response
ipmi:devintf: Return a proper error when recv buffer too small
ipmi: Disable some operations during a panic
ipmi:watchdog: Set panic count to proper value on a panic
Linus Torvalds [Tue, 2 Nov 2021 01:49:25 +0000 (18:49 -0700)]
Merge tag 'leds-5.16-rc1' of git://git./linux/kernel/git/pavel/linux-leds
Pull LED updates from Pavel Machek:
"Johannes pointed out that locking is still problematic with triggers
list, attempt to solve that by using RCU"
* tag 'leds-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/pavel/linux-leds:
leds: trigger: Disable CPU trigger on PREEMPT_RT
leds: trigger: use RCU to protect the led_cdevs list
led-class-flash: fix -Wrestrict warning
Linus Torvalds [Tue, 2 Nov 2021 01:45:08 +0000 (18:45 -0700)]
Merge tag 'media/v5.16-1' of git://git./linux/kernel/git/mchehab/linux-media
Pull media updates from Mauro Carvalho Chehab:
- New driver for SK Hynix Hi-846 8M pixel camera
- New driver for the ov13b10 camera
- New driver for Renesas R-Car ISP
- mtk-vcodec gained support for version 2 of decoder firmware ABI
- The legacy sir_ir driver got removed
- videobuf2: the vb2_mem_ops kAPI had some improvements
- lots of cleanups, fixes and new features at device drivers
* tag 'media/v5.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (328 commits)
media: venus: core: Add sdm660 DT compatible and resource struct
media: dt-bindings: media: venus: Add sdm660 dt schema
media: venus: vdec: decoded picture buffer handling during reconfig sequence
media: venus: Handle fatal errors during encoding and decoding
media: venus: helpers: Add helper to mark fatal vb2 error
media: venus: hfi: Check for sys error on session hfi functions
media: venus: Make sys_error flag an atomic bitops
media: venus: venc: Use pmruntime autosuspend
media: allegro: write vui parameters for HEVC
media: allegro: nal-hevc: implement generator for vui
media: allegro: write correct colorspace into SPS
media: allegro: extract nal value lookup functions to header
media: allegro: correctly scale the bit rate in SPS
media: allegro: remove external QP table
media: allegro: fix row and column in response message
media: allegro: add control to disable encoder buffer
media: allegro: add encoder buffer support
media: allegro: add pm_runtime support
media: allegro: lookup VCU settings
media: allegro: fix module removal if initialization failed
...
Magnus Karlsson [Fri, 29 Oct 2021 09:01:11 +0000 (11:01 +0200)]
libbpf: Deprecate AF_XDP support
Deprecate AF_XDP support in libbpf ([0]). This has been moved to
libxdp as it is a better fit for that library. The AF_XDP support only
uses the public libbpf functions and can therefore just use libbpf as
a library from libxdp. The libxdp APIs are exactly the same so it
should just be linking with libxdp instead of libbpf for the AF_XDP
functionality. If not, please submit a bug report. Linking with both
libraries is supported but make sure you link in the correct order so
that the new functions in libxdp are used instead of the deprecated
ones in libbpf.
Libxdp can be found at https://github.com/xdp-project/xdp-tools.
[0] Closes: https://github.com/libbpf/libbpf/issues/270
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20211029090111.4733-1-magnus.karlsson@gmail.com
Jiri Olsa [Fri, 29 Oct 2021 12:57:29 +0000 (14:57 +0200)]
kbuild: Unify options for BTF generation for vmlinux and modules
Using new PAHOLE_FLAGS variable to pass extra arguments to
pahole for both vmlinux and modules BTF data generation.
Adding new scripts/pahole-flags.sh script that detect and
prints pahole options.
[ fixed issues found by kernel test robot ]
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211029125729.70002-1-jolsa@kernel.org
Alexei Starovoitov [Mon, 1 Nov 2021 22:21:53 +0000 (15:21 -0700)]
selftests/bpf: Add a testcase for 64-bit bounds propagation issue.
./test_progs-no_alu32 -vv -t twfw
Before the 64-bit_into_32-bit fix:
19: (25) if r1 > 0x3f goto pc+6
R1_w=inv(id=0,umax_value=63,var_off=(0x0; 0xff),s32_max_value=255,u32_max_value=255)
and eventually:
invalid access to map value, value_size=8 off=7 size=8
R6 max value is outside of the allowed memory range
libbpf: failed to load object 'no_alu32/twfw.o'
After the fix:
19: (25) if r1 > 0x3f goto pc+6
R1_w=inv(id=0,umax_value=63,var_off=(0x0; 0x3f))
verif_twfw:OK
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211101222153.78759-3-alexei.starovoitov@gmail.com
Alexei Starovoitov [Mon, 1 Nov 2021 22:21:52 +0000 (15:21 -0700)]
bpf: Fix propagation of signed bounds from 64-bit min/max into 32-bit.
Similar to unsigned bounds propagation fix signed bounds.
The 'Fixes' tag is a hint. There is no security bug here.
The verifier was too conservative.
Fixes:
3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211101222153.78759-2-alexei.starovoitov@gmail.com
Alexei Starovoitov [Mon, 1 Nov 2021 22:21:51 +0000 (15:21 -0700)]
bpf: Fix propagation of bounds from 64-bit min/max into 32-bit and var_off.
Before this fix:
166: (b5) if r2 <= 0x1 goto pc+22
from 166 to 189: R2=invP(id=1,umax_value=1,var_off=(0x0; 0xffffffff))
After this fix:
166: (b5) if r2 <= 0x1 goto pc+22
from 166 to 189: R2=invP(id=1,umax_value=1,var_off=(0x0; 0x1))
While processing BPF_JLE the reg_set_min_max() would set true_reg->umax_value = 1
and call __reg_combine_64_into_32(true_reg).
Without the fix it would not pass the condition:
if (__reg64_bound_u32(reg->umin_value) && __reg64_bound_u32(reg->umax_value))
since umin_value == 0 at this point.
Before commit
10bf4e83167c the umin was incorrectly ingored.
The commit
10bf4e83167c fixed the correctness issue, but pessimized
propagation of 64-bit min max into 32-bit min max and corresponding var_off.
Fixes:
10bf4e83167c ("bpf: Fix propagation of 32 bit unsigned bounds from 64 bit bounds")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211101222153.78759-1-alexei.starovoitov@gmail.com
Linus Torvalds [Tue, 2 Nov 2021 00:34:02 +0000 (17:34 -0700)]
Merge tag 'Smack-for-5.16' of https://github.com/cschaufler/smack-next
Pull smack updates from Casey Schaufler:
"Multiple corrections to smackfs:
- a change for overlayfs support that corrects the initial attributes
on created files
- code clean-up for netlabel processing
- several fixes in smackfs for a variety of reasons
- Errors reported by W=1 have been addressed
All told, nothing challenging"
* tag 'Smack-for-5.16' of https://github.com/cschaufler/smack-next:
smackfs: use netlbl_cfg_cipsov4_del() for deleting cipso_v4_doi
smackfs: use __GFP_NOFAIL for smk_cipso_doi()
Smack: fix W=1 build warnings
smack: remove duplicated hook function
Smack:- Use overlay inode label in smack_inode_copy_up()
smack: Guard smack_ipv6_lock definition within a SMACK_IPV6_PORT_LABELING block
smackfs: Fix use-after-free in netlbl_catmap_walk()
Linus Torvalds [Tue, 2 Nov 2021 00:32:22 +0000 (17:32 -0700)]
Merge tag 'fallthrough-fixes-clang-5.16-rc1' of git://git./linux/kernel/git/gustavoars/linux
Pull fallthrough fixes from Gustavo A. R. Silva:
"Fix some fall-through warnings when building with Clang and
-Wimplicit-fallthrough"
* tag 'fallthrough-fixes-clang-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
pcmcia: db1xxx_ss: Fix fall-through warning for Clang
MIPS: Fix fall-through warnings for Clang
scsi: st: Fix fall-through warning for Clang
Linus Torvalds [Tue, 2 Nov 2021 00:29:10 +0000 (17:29 -0700)]
Merge tag 'kspp-misc-fixes-5.16-rc1' of git://git./linux/kernel/git/gustavoars/linux
Pull hardening fixes and cleanups from Gustavo A. R. Silva:
"Various hardening fixes and cleanups that I've been collecting during
the last development cycle:
Fix -Wcast-function-type error:
- firewire: Remove function callback casts (Oscar Carter)
Fix application of sizeof operator:
- firmware/psci: fix application of sizeof to pointer (jing yangyang)
Replace open coded instances with size_t saturating arithmetic
helpers:
- assoc_array: Avoid open coded arithmetic in allocator arguments
(Len Baker)
- writeback: prefer struct_size over open coded arithmetic (Len
Baker)
- aio: Prefer struct_size over open coded arithmetic (Len Baker)
- dmaengine: pxa_dma: Prefer struct_size over open coded arithmetic
(Len Baker)
Flexible array transformation:
- KVM: PPC: Replace zero-length array with flexible array member (Len
Baker)
Use 2-factor argument multiplication form:
- nouveau/svm: Use kvcalloc() instead of kvzalloc() (Gustavo A. R.
Silva)
- xfs: Use kvcalloc() instead of kvzalloc() (Gustavo A. R. Silva)"
* tag 'kspp-misc-fixes-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
firewire: Remove function callback casts
nouveau/svm: Use kvcalloc() instead of kvzalloc()
firmware/psci: fix application of sizeof to pointer
dmaengine: pxa_dma: Prefer struct_size over open coded arithmetic
KVM: PPC: Replace zero-length array with flexible array member
aio: Prefer struct_size over open coded arithmetic
writeback: prefer struct_size over open coded arithmetic
xfs: Use kvcalloc() instead of kvzalloc()
assoc_array: Avoid open coded arithmetic in allocator arguments
Linus Torvalds [Tue, 2 Nov 2021 00:25:09 +0000 (17:25 -0700)]
Merge tag 'seccomp-v5.16-rc1' of git://git./linux/kernel/git/kees/linux
Pull seccomp updates from Kees Cook:
"These are x86-specific, but I carried these since they're also
seccomp-specific.
This flips the defaults for spec_store_bypass_disable and
spectre_v2_user from "seccomp" to "prctl", as enough time has passed
to allow system owners to have updated the defensive stances of their
various workloads, and it's long overdue to unpessimize seccomp
threads.
Extensive rationale and details are in Andrea's main patch.
Summary:
- set spec_store_bypass_disable & spectre_v2_user to prctl (Andrea Arcangeli)"
* tag 'seccomp-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
x86: deduplicate the spectre_v2_user documentation
x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl
Linus Torvalds [Tue, 2 Nov 2021 00:12:56 +0000 (17:12 -0700)]
Merge tag 'overflow-v5.16-rc1' of git://git./linux/kernel/git/kees/linux
Pull overflow updates from Kees Cook:
"The end goal of the current buffer overflow detection work[0] is to
gain full compile-time and run-time coverage of all detectable buffer
overflows seen via array indexing or memcpy(), memmove(), and
memset(). The str*() family of functions already have full coverage.
While much of the work for these changes have been on-going for many
releases (i.e. 0-element and 1-element array replacements, as well as
avoiding false positives and fixing discovered overflows[1]), this
series contains the foundational elements of several related buffer
overflow detection improvements by providing new common helpers and
FORTIFY_SOURCE changes needed to gain the introspection required for
compiler visibility into array sizes. Also included are a handful of
already Acked instances using the helpers (or related clean-ups), with
many more waiting at the ready to be taken via subsystem-specific
trees[2].
The new helpers are:
- struct_group() for gaining struct member range introspection
- memset_after() and memset_startat() for clearing to the end of
structures
- DECLARE_FLEX_ARRAY() for using flex arrays in unions or alone in
structs
Also included is the beginning of the refactoring of FORTIFY_SOURCE to
support memcpy() introspection, fix missing and regressed coverage
under GCC, and to prepare to fix the currently broken Clang support.
Finishing this work is part of the larger series[0], but depends on
all the false positives and buffer overflow bug fixes to have landed
already and those that depend on this series to land.
As part of the FORTIFY_SOURCE refactoring, a set of both a
compile-time and run-time tests are added for FORTIFY_SOURCE and the
mem*()-family functions respectively. The compile time tests have
found a legitimate (though corner-case) bug[6] already.
Please note that the appearance of "panic" and "BUG" in the
FORTIFY_SOURCE refactoring are the result of relocating existing code,
and no new use of those code-paths are expected nor desired.
Finally, there are two tree-wide conversions for 0-element arrays and
flexible array unions to gain sane compiler introspection coverage
that result in no known object code differences.
After this series (and the changes that have now landed via netdev and
usb), we are very close to finally being able to build with
-Warray-bounds and -Wzero-length-bounds.
However, due corner cases in GCC[3] and Clang[4], I have not included
the last two patches that turn on these options, as I don't want to
introduce any known warnings to the build. Hopefully these can be
solved soon"
Link: https://lore.kernel.org/lkml/20210818060533.3569517-1-keescook@chromium.org/
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?qt=grep&q=FORTIFY_SOURCE
Link: https://lore.kernel.org/lkml/202108220107.3E26FE6C9C@keescook/
Link: https://lore.kernel.org/lkml/3ab153ec-2798-da4c-f7b1-81b0ac8b0c5b@roeck-us.net/
Link: https://bugs.llvm.org/show_bug.cgi?id=51682
Link: https://lore.kernel.org/lkml/202109051257.29B29745C0@keescook/
Link: https://lore.kernel.org/lkml/20211020200039.170424-1-keescook@chromium.org/
* tag 'overflow-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (30 commits)
fortify: strlen: Avoid shadowing previous locals
compiler-gcc.h: Define __SANITIZE_ADDRESS__ under hwaddress sanitizer
treewide: Replace 0-element memcpy() destinations with flexible arrays
treewide: Replace open-coded flex arrays in unions
stddef: Introduce DECLARE_FLEX_ARRAY() helper
btrfs: Use memset_startat() to clear end of struct
string.h: Introduce memset_startat() for wiping trailing members and padding
xfrm: Use memset_after() to clear padding
string.h: Introduce memset_after() for wiping trailing members/padding
lib: Introduce CONFIG_MEMCPY_KUNIT_TEST
fortify: Add compile-time FORTIFY_SOURCE tests
fortify: Allow strlen() and strnlen() to pass compile-time known lengths
fortify: Prepare to improve strnlen() and strlen() warnings
fortify: Fix dropped strcpy() compile-time write overflow check
fortify: Explicitly disable Clang support
fortify: Move remaining fortify helpers into fortify-string.h
lib/string: Move helper functions out of string.c
compiler_types.h: Remove __compiletime_object_size()
cm4000_cs: Use struct_group() to zero struct cm4000_dev region
can: flexcan: Use struct_group() to zero struct flexcan_regs regions
...
Linus Torvalds [Tue, 2 Nov 2021 00:09:03 +0000 (17:09 -0700)]
Merge tag 'hardening-v5.16-rc1' of git://git./linux/kernel/git/kees/linux
Pull compiler hardening updates from Kees Cook:
"These are various compiler-related hardening feature updates. Notable
is the addition of an explicit limited rationale for, and deprecation
schedule of, gcc-plugins.
gcc-plugins:
- remove support for GCC 4.9 and older (Ard Biesheuvel)
- remove duplicate include in gcc-common.h (Ye Guojin)
- Explicitly document purpose and deprecation schedule (Kees Cook)
- Remove cyc_complexity (Kees Cook)
instrumentation:
- Avoid harmless Clang option under CONFIG_INIT_STACK_ALL_ZERO (Kees Cook)
Clang LTO:
- kallsyms: strip LTO suffixes from static functions (Nick Desaulniers)"
* tag 'hardening-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
gcc-plugins: remove duplicate include in gcc-common.h
gcc-plugins: Remove cyc_complexity
gcc-plugins: Explicitly document purpose and deprecation schedule
kallsyms: strip LTO suffixes from static functions
gcc-plugins: remove support for GCC 4.9 and older
hardening: Avoid harmless Clang option under CONFIG_INIT_STACK_ALL_ZERO
Linus Torvalds [Tue, 2 Nov 2021 00:00:05 +0000 (17:00 -0700)]
Merge tag 'cpu-to-thread_info-v5.16-rc1' of git://git./linux/kernel/git/kees/linux
Pull thread_info update to move 'cpu' back from task_struct from Kees Cook:
"Cross-architecture update to move task_struct::cpu back into
thread_info on arm64, x86, s390, powerpc, and riscv. All Acked by arch
maintainers.
Quoting Ard Biesheuvel:
'Move task_struct::cpu back into thread_info
Keeping CPU in task_struct is problematic for architectures that
define raw_smp_processor_id() in terms of this field, as it
requires linux/sched.h to be included, which causes a lot of pain
in terms of circular dependencies (aka 'header soup')
This series moves it back into thread_info (where it came from)
for all architectures that enable THREAD_INFO_IN_TASK, addressing
the header soup issue as well as some pointless differences in the
implementations of task_cpu() and set_task_cpu()'"
* tag 'cpu-to-thread_info-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
riscv: rely on core code to keep thread_info::cpu updated
powerpc: smp: remove hack to obtain offset of task_struct::cpu
sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y
powerpc: add CPU field to struct thread_info
s390: add CPU field to struct thread_info
x86: add CPU field to struct thread_info
arm64: add CPU field to struct thread_info
Linus Torvalds [Mon, 1 Nov 2021 23:57:36 +0000 (16:57 -0700)]
Merge tag 'm68k-for-v5.16-tag1' of git://git./linux/kernel/git/geert/linux-m68k
Pull m68k updates from Geert Uytterhoeven:
- A small comma vs semicolon cleanup
- defconfig updates
* tag 'm68k-for-v5.16-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
m68k: defconfig: Update defconfigs for v5.15-rc1
m68k: muldi3: Use semicolon instead of comma
Linus Torvalds [Mon, 1 Nov 2021 23:51:13 +0000 (16:51 -0700)]
Merge tag 'for-5.16/parisc-1' of git://git./linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:
"Lots of new features and fixes:
- Added TOC (table of content) support, which is a debugging feature
which is either initiated by pressing the TOC button or via command
in the BMC. If pressed the Linux built-in KDB/KGDB will be called
(Sven Schnelle)
- Fix CONFIG_PREEMPT (Sven)
- Fix unwinder on 64-bit kernels (Sven)
- Various kgdb fixes (Sven)
- Added KFENCE support (me)
- Switch to ARCH_STACKWALK implementation (me)
- Fix ptrace check on syscall return (me)
- Fix kernel crash with fixmaps on PA1.x machines (me)
- Move thread_info into task struct, aka CONFIG_THREAD_INFO_IN_TASK
(me)
- Updated defconfigs
- Smaller cleanups, including Makefile cleanups (Masahiro Yamada),
use kthread_run() macro (Cai Huoqing), use swap() macro (Yihao
Han)"
* tag 'for-5.16/parisc-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: (36 commits)
parisc: Fix set_fixmap() on PA1.x CPUs
parisc: Use swap() to swap values in setup_bootmem()
parisc: Update defconfigs
parisc: decompressor: clean up Makefile
parisc: decompressor: remove repeated depenency of misc.o
parisc: Remove unused constants from asm-offsets.c
parisc/ftrace: use static key to enable/disable function graph tracer
parisc/ftrace: set function trace function
parisc: Make use of the helper macro kthread_run()
parisc: mark xchg functions notrace
parisc: enhance warning regarding usage of O_NONBLOCK
parisc: Drop ifdef __KERNEL__ from non-uapi kernel headers
parisc: Use PRIV_USER and PRIV_KERNEL in ptrace.h
parisc: Use PRIV_USER in syscall.S
parisc/kgdb: add kgdb_roundup() to make kgdb work with idle polling
parisc: Move thread_info into task struct
parisc: add support for TOC (transfer of control)
parisc/firmware: add functions to retrieve TOC data
parisc: add PIM TOC data structures
parisc: move virt_map macro to assembly.h
...
Jean Sacren [Sun, 31 Oct 2021 01:27:28 +0000 (19:27 -0600)]
net: vmxnet3: remove multiple false checks in vmxnet3_ethtool.c
In one if branch, (ec->rx_coalesce_usecs != 0) is checked. When it is
checked again in two more places, it is always false and has no effect
on the whole check expression. We should remove it in both places.
In another if branch, (ec->use_adaptive_rx_coalesce != 0) is checked.
When it is checked again, it is always false. We should remove the
entire branch with it.
In addition we might as well let C precedence dictate by getting rid of
two pairs of parentheses in the neighboring lines in order to keep
expressions on both sides of '||' in balance with checkpatch warning
silenced.
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Link: https://lore.kernel.org/r/20211031012728.8325-1-sakiwit@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Mon, 1 Nov 2021 23:33:53 +0000 (16:33 -0700)]
Merge tag 'arm64-upstream' of git://git./linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"There's the usual summary below, but the highlights are support for
the Armv8.6 timer extensions, KASAN support for asymmetric MTE, the
ability to kexec() with the MMU enabled and a second attempt at
switching to the generic pfn_valid() implementation.
Summary:
- Support for the Arm8.6 timer extensions, including a
self-synchronising view of the system registers to elide some
expensive ISB instructions.
- Exception table cleanup and rework so that the fixup handlers
appear correctly in backtraces.
- A handful of miscellaneous changes, the main one being selection of
CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK.
- More mm and pgtable cleanups.
- KASAN support for "asymmetric" MTE, where tag faults are reported
synchronously for loads (via an exception) and asynchronously for
stores (via a register).
- Support for leaving the MMU enabled during kexec relocation, which
significantly speeds up the operation.
- Minor improvements to our perf PMU drivers.
- Improvements to the compat vDSO build system, particularly when
building with LLVM=1.
- Preparatory work for handling some Coresight TRBE tracing errata.
- Cleanup and refactoring of the SVE code to pave the way for SME
support in future.
- Ensure SCS pages are unpoisoned immediately prior to freeing them
when KASAN is enabled for the vmalloc area.
- Try moving to the generic pfn_valid() implementation again now that
the DMA mapping issue from last time has been resolved.
- Numerous improvements and additions to our FPSIMD and SVE
selftests"
[ armv8.6 timer updates were in a shared branch and already came in
through -tip in the timer pull - Linus ]
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (85 commits)
arm64: Select POSIX_CPU_TIMERS_TASK_WORK
arm64: Document boot requirements for FEAT_SME_FA64
arm64/sve: Fix warnings when SVE is disabled
arm64/sve: Add stub for sve_max_virtualisable_vl()
arm64: errata: Add detection for TRBE write to out-of-range
arm64: errata: Add workaround for TSB flush failures
arm64: errata: Add detection for TRBE overwrite in FILL mode
arm64: Add Neoverse-N2, Cortex-A710 CPU part definition
selftests: arm64: Factor out utility functions for assembly FP tests
arm64: vmlinux.lds.S: remove `.fixup` section
arm64: extable: add load_unaligned_zeropad() handler
arm64: extable: add a dedicated uaccess handler
arm64: extable: add `type` and `data` fields
arm64: extable: use `ex` for `exception_table_entry`
arm64: extable: make fixup_exception() return bool
arm64: extable: consolidate definitions
arm64: gpr-num: support W registers
arm64: factor out GPR numbering helpers
arm64: kvm: use kvm_exception_table_entry
arm64: lib: __arch_copy_to_user(): fold fixups into body
...
Jakub Kicinski [Mon, 1 Nov 2021 23:33:29 +0000 (16:33 -0700)]
Merge branch 'accurate-memory-charging-for-msg_zerocopy'
Talal Ahmad says:
====================
Accurate Memory Charging For MSG_ZEROCOPY
This series improves the accuracy of msg_zerocopy memory accounting.
At present, when msg_zerocopy is used memory is charged twice for the
data - once when user space allocates it, and then again within
__zerocopy_sg_from_iter. The memory charging in the kernel is excessive
because data is held in user pages and is never actually copied to skb
fragments. This leads to incorrectly inflated memory statistics for
programs passing MSG_ZEROCOPY.
We reduce this inaccuracy by introducing the notion of "pure" zerocopy
SKBs - where all the frags in the SKB are backed by pinned userspace
pages, and none are backed by copied pages. For such SKBs, tracked via
the new SKBFL_PURE_ZEROCOPY flag, we elide sk_mem_charge/uncharge
calls, leading to more accurate accounting.
However, SKBs can also be coalesced by the stack at present,
potentially leading to "impure" SKBs. We restrict this coalescing so
it can only happen within the sendmsg() system call itself, for the
most recently allocated SKB. While this can lead to a small degree of
double-charging of memory, this case does not arise often in practice
for workloads that set MSG_ZEROCOPY.
Testing verified that memory usage in the kernel is lowered.
Instrumentation with counters also showed that accounting at time
charging and uncharging is balanced.
====================
Link: https://lore.kernel.org/r/20211030020542.3870542-1-mailtalalahmad@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Talal Ahmad [Sat, 30 Oct 2021 02:05:42 +0000 (22:05 -0400)]
net: avoid double accounting for pure zerocopy skbs
Track skbs with only zerocopy data and avoid charging them to kernel
memory to correctly account the memory utilization for msg_zerocopy.
All of the data in such skbs is held in user pages which are already
accounted to user. Before this change, they are charged again in
kernel in __zerocopy_sg_from_iter. The charging in kernel is
excessive because data is not being copied into skb frags. This
excessive charging can lead to kernel going into memory pressure
state which impacts all sockets in the system adversely. Mark pure
zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
charge/uncharge for data in such skbs.
Initially, an skb is marked pure zerocopy when it is empty and in
zerocopy path. skb can then change from a pure zerocopy skb to mixed
data skb (zerocopy and copy data) if it is at tail of write queue and
there is room available in it and non-zerocopy data is being sent in
the next sendmsg call. At this time sk_mem_charge is done for the pure
zerocopied data and the pure zerocopy flag is unmarked. We found that
this happens very rarely on workloads that pass MSG_ZEROCOPY.
A pure zerocopy skb can later be coalesced into normal skb if they are
next to each other in queue but this patch prevents coalescing from
happening. This avoids complexity of charging when skb downgrades from
pure zerocopy to mixed. This is also rare.
In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
for SKB_TRUESIZE(MAX_TCP_HEADER) is done for sk_mem_charge in
tcp_skb_entail for an skb without data.
Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
with zerocopy showed that before this patch the 'sock' variable in
memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
change it is 0. This is due to no charge to sk_forward_alloc for
zerocopy data and shows memory utilization for kernel is lowered.
Signed-off-by: Talal Ahmad <talalahmad@google.com>
Acked-by: Arjun Roy <arjunroy@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Talal Ahmad [Sat, 30 Oct 2021 02:05:41 +0000 (22:05 -0400)]
tcp: rename sk_wmem_free_skb
sk_wmem_free_skb() is only used by TCP.
Rename it to make this clear, and move its declaration to
include/net/tcp.h
Signed-off-by: Talal Ahmad <talalahmad@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Mon, 1 Nov 2021 22:18:45 +0000 (15:18 -0700)]
netdevsim: fix uninit value in nsim_drv_configure_vfs()
Build bot points out that I missed initializing ret
after refactoring.
Reported-by: kernel test robot <lkp@intel.com>
Fixes:
1c401078bcf3 ("netdevsim: move details of vf config to dev")
Link: https://lore.kernel.org/r/20211101221845.3188490-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andrii Nakryiko [Mon, 1 Nov 2021 23:01:18 +0000 (16:01 -0700)]
selftests/bpf: Fix also no-alu32 strobemeta selftest
Previous fix aded bpf_clamp_umax() helper use to re-validate boundaries.
While that works correctly, it introduces more branches, which blows up
past 1 million instructions in no-alu32 variant of strobemeta selftests.
Switching len variable from u32 to u64 also fixes the issue and reduces
the number of validated instructions, so use that instead. Fix this
patch and bpf_clamp_umax() removed, both alu32 and no-alu32 selftests
pass.
Fixes:
0133c20480b1 ("selftests/bpf: Fix strobemeta selftest regression")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211101230118.1273019-1-andrii@kernel.org
Linus Torvalds [Mon, 1 Nov 2021 22:54:07 +0000 (15:54 -0700)]
Merge tag 'x86_sgx_for_v5.16_rc1' of git://git./linux/kernel/git/tip/tip
Pull x86 SGX updates from Borislav Petkov:
"Add a SGX_IOC_VEPC_REMOVE ioctl to the /dev/sgx_vepc virt interface
with which EPC pages can be put back into their uninitialized state
without having to reopen /dev/sgx_vepc, which could not be possible
anymore after startup due to security policies"
* tag 'x86_sgx_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sgx/virt: implement SGX_IOC_VEPC_REMOVE ioctl
x86/sgx/virt: extract sgx_vepc_remove_page
Linus Torvalds [Mon, 1 Nov 2021 22:52:26 +0000 (15:52 -0700)]
Merge tag 'x86_sev_for_v5.16_rc1' of git://git./linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:
- Export sev_es_ghcb_hv_call() so that HyperV Isolation VMs can use it
too
- Non-urgent fixes and cleanups
* tag 'x86_sev_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev: Expose sev_es_ghcb_hv_call() for use by HyperV
x86/sev: Allow #VC exceptions on the VC2 stack
x86/sev: Fix stack type check in vc_switch_off_ist()
x86/sme: Use #define USE_EARLY_PGTABLE_L5 in mem_encrypt_identity.c
x86/sev: Carve out HV call's return value verification
Linus Torvalds [Mon, 1 Nov 2021 22:45:14 +0000 (15:45 -0700)]
Merge tag 'x86_misc_for_v5.16_rc1' of git://git./linux/kernel/git/tip/tip
Pull misc x86 changes from Borislav Petkov:
- Use the proper interface for the job: get_unaligned() instead of
memcpy() in the insn decoder
- A randconfig build fix
* tag 'x86_misc_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/insn: Use get_unaligned() instead of memcpy()
x86/Kconfig: Fix an unused variable error in dell-smm-hwmon
Linus Torvalds [Mon, 1 Nov 2021 22:33:54 +0000 (15:33 -0700)]
Merge tag 'x86_cpu_for_v5.16_rc1' of git://git./linux/kernel/git/tip/tip
Pull x86 cpu updates from Borislav Petkov:
- Start checking a CPUID bit on AMD Zen3 which states that the CPU
clears the segment base when a null selector is written. Do the
explicit detection on older CPUs, zen2 and hygon specifically, which
have the functionality but do not advertize the CPUID bit. Factor in
the presence of a hypervisor underneath the kernel and avoid doing
the explicit check there which the HV might've decided to not
advertize for migration safety reasons, or similar.
- Add support for a new X86 CPU vendor: VORTEX. Needed for whitelisting
those CPUs in the hardware vulnerabilities detection
- Force the compiler to use rIP-relative addressing in the fallback
path of static_cpu_has(), in order to avoid unnecessary register
pressure
* tag 'x86_cpu_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Fix migration safety with X86_BUG_NULL_SEL
x86/CPU: Add support for Vortex CPUs
x86/umip: Downgrade warning messages to debug loglevel
x86/asm: Avoid adding register pressure for the init case in static_cpu_has()
x86/asm: Add _ASM_RIP() macro for x86-64 (%rip) suffix
Linus Torvalds [Mon, 1 Nov 2021 22:25:08 +0000 (15:25 -0700)]
Merge tag 'x86_cleanups_for_v5.16_rc1' of git://git./linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:
"The usual round of random minor fixes and cleanups all over the place"
* tag 'x86_cleanups_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/Makefile: Remove unneeded whitespaces before tabs
x86/of: Kill unused early_init_dt_scan_chosen_arch()
x86: Fix misspelled Kconfig symbols
x86/Kconfig: Remove references to obsolete Kconfig symbols
x86/smp: Remove unnecessary assignment to local var freq_scale
Linus Torvalds [Mon, 1 Nov 2021 22:16:52 +0000 (15:16 -0700)]
Merge tag 'x86_cc_for_v5.16_rc1' of git://git./linux/kernel/git/tip/tip
Pull generic confidential computing updates from Borislav Petkov:
"Add an interface called cc_platform_has() which is supposed to be used
by confidential computing solutions to query different aspects of the
system.
The intent behind it is to unify testing of such aspects instead of
having each confidential computing solution add its own set of tests
to code paths in the kernel, leading to an unwieldy mess"
* tag 'x86_cc_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
treewide: Replace the use of mem_encrypt_active() with cc_platform_has()
x86/sev: Replace occurrences of sev_es_active() with cc_platform_has()
x86/sev: Replace occurrences of sev_active() with cc_platform_has()
x86/sme: Replace occurrences of sme_active() with cc_platform_has()
powerpc/pseries/svm: Add a powerpc version of cc_platform_has()
x86/sev: Add an x86 version of cc_platform_has()
arch/cc: Introduce a function to check for confidential computing features
x86/ioremap: Selectively build arch override encryption functions
Linus Torvalds [Mon, 1 Nov 2021 22:14:20 +0000 (15:14 -0700)]
Merge tag 'x86_build_for_v5.16_rc1' of git://git./linux/kernel/git/tip/tip
Pull x86 build fix from Borislav Petkov:
- A single fix to hdimage when using older versions of mtools
* tag 'x86_build_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Fix make hdimage with older versions of mtools
Linus Torvalds [Mon, 1 Nov 2021 22:12:04 +0000 (15:12 -0700)]
Merge tag 'ras_core_for_v5.16_rc1' of git://git./linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:
- Get rid of a bunch of function pointers used in MCA land in favor of
normal functions. This is in preparation of making the MCA code
noinstr-aware
- When the kernel copies data from user addresses and it encounters a
machine check, a SIGBUS is sent to that process. Change this action
to either an -EFAULT which is returned to the user or a short write,
making the recovery action a lot more user-friendly
* tag 'ras_core_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Sort mca_config members to get rid of unnecessary padding
x86/mce: Get rid of the ->quirk_no_way_out() indirect call
x86/mce: Get rid of msr_ops
x86/mce: Get rid of machine_check_vector
x86/mce: Get rid of the mce_severity function pointer
x86/mce: Drop copyin special case for #MC
x86/mce: Change to not send SIGBUS error during copy from user
Linus Torvalds [Mon, 1 Nov 2021 22:05:48 +0000 (15:05 -0700)]
Merge tag 'efi-next-for-v5.16' of git://git./linux/kernel/git/tip/tip
Pull EFI updates from Borislav Petkov:
"The last EFI pull request which is forwarded through the tip tree, for
v5.16. From now on, Ard will be sending stuff directly.
Disable EFI runtime services by default on PREEMPT_RT, while adding
the ability to re-enable them on demand by passing efi=runtime on the
command line"
* tag 'efi-next-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: Allow efi=runtime
efi: Disable runtime services on RT
Linus Torvalds [Mon, 1 Nov 2021 22:02:49 +0000 (15:02 -0700)]
Merge tag 'edac_updates_for_v5.16' of git://git./linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
"A small pile of EDAC updates which the autumn wind blew my way. :)
- amd64_edac: Add support for three-rank interleaving mode which is
present on AMD zen2 servers
- The usual fixes and cleanups all over EDAC land"
* tag 'edac_updates_for_v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/sb_edac: Fix top-of-high-memory value for Broadwell/Haswell
EDAC/ti: Remove redundant error messages
EDAC/amd64: Handle three rank interleaving mode
EDAC/mc_sysfs: Print MC-scope sysfs counters unsigned
EDAC/al_mc: Make use of the helper function devm_add_action_or_reset()
EDAC/mc: Replace strcpy(), sprintf() and snprintf() with strscpy() or scnprintf()
Linus Torvalds [Mon, 1 Nov 2021 21:56:37 +0000 (14:56 -0700)]
mm: fix mismerge of folio page flag manipulators
I had missed a semantic conflict between commit
d389a4a81155 ("mm: Add
folio flag manipulation functions") from the folio tree, and commit
eac96c3efdb5 ("mm: filemap: check if THP has hwpoisoned subpage for PMD
page fault") that added a new set of page flags.
My build tests had too many options enabled, which hid this issue. But
if you didn't have MEMORY_FAILURE or TRANSPARENT_HUGEPAGE enabled, you'd
end up with build errors like this:
include/linux/page-flags.h:806:29: error: macro "PAGEFLAG_FALSE" requires 2 arguments, but only 1 given
806 | PAGEFLAG_FALSE(HasHWPoisoned)
| ^
due to the missing lowercase name used for folio function naming.
Fixes:
49f8275c7d92 ("Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Yang Shi <shy828301@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Dumazet [Sun, 31 Oct 2021 17:13:53 +0000 (10:13 -0700)]
bpf: Add missing map_delete_elem method to bloom filter map
Without it, kernel crashes in map_delete_elem(), as reported
by syzbot.
BUG: kernel NULL pointer dereference, address:
0000000000000000
PGD
72c97067 P4D
72c97067 PUD
1e20c067 PMD 0
Oops: 0010 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 6518 Comm: syz-executor196 Not tainted 5.15.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:0x0
Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
RSP: 0018:
ffffc90002bafcb8 EFLAGS:
00010246
RAX:
dffffc0000000000 RBX:
1ffff92000575f9f RCX:
0000000000000000
RDX:
1ffffffff1327aba RSI:
0000000000000000 RDI:
ffff888025a30c00
RBP:
ffffc90002baff08 R08:
0000000000000000 R09:
0000000000000001
R10:
ffffffff818525d8 R11:
0000000000000000 R12:
ffffffff8993d560
R13:
ffff888025a30c00 R14:
ffff888024bc0000 R15:
0000000000000000
FS:
0000555557491300(0000) GS:
ffff8880b9c00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
ffffffffffffffd6 CR3:
0000000070189000 CR4:
00000000003506f0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Call Trace:
map_delete_elem kernel/bpf/syscall.c:1220 [inline]
__sys_bpf+0x34f1/0x5ee0 kernel/bpf/syscall.c:4606
__do_sys_bpf kernel/bpf/syscall.c:4719 [inline]
__se_sys_bpf kernel/bpf/syscall.c:4717 [inline]
__x64_sys_bpf+0x75/0xb0 kernel/bpf/syscall.c:4717
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
Fixes:
9330986c0300 ("bpf: Add bloom filter map implementation")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211031171353.4092388-1-eric.dumazet@gmail.com
Alexei Starovoitov [Mon, 1 Nov 2021 21:16:03 +0000 (14:16 -0700)]
Merge branch '"map_extra" and bloom filter fixups'
Joanne Koong says:
====================
There are 3 patches in this patchset:
1/3 - Bloom filter naming fixups (kernel/bpf/bloom_filter.c)
2/3 - Add alignment padding for map_extra, rearrange fields in
bpf_map struct to consolidate holes
3/3 - Bloom filter tests (prog_tests/bloom_filter_map):
Add test for successful userspace calls, some refactoring to
use bpf_create_map instead of bpf_create_map_xattr
v1 -> v2:
* In prog_tests/bloom_filter_map: remove unneeded line break,
also change the inner_map_test to use bpf_create_map instead
of bpf_create_map_xattr.
* Add acked-bys to commit messages
====================
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Joanne Koong [Fri, 29 Oct 2021 22:49:09 +0000 (15:49 -0700)]
selftests/bpf: Add bloom map success test for userspace calls
This patch has two changes:
1) Adds a new function "test_success_cases" to test
successfully creating + adding + looking up a value
in a bloom filter map from the userspace side.
2) Use bpf_create_map instead of bpf_create_map_xattr in
the "test_fail_cases" and test_inner_map to make the
code look cleaner.
Signed-off-by: Joanne Koong <joannekoong@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211029224909.1721024-4-joannekoong@fb.com
Joanne Koong [Fri, 29 Oct 2021 22:49:08 +0000 (15:49 -0700)]
bpf: Add alignment padding for "map_extra" + consolidate holes
This patch makes 2 changes regarding alignment padding
for the "map_extra" field.
1) In the kernel header, "map_extra" and "btf_value_type_id"
are rearranged to consolidate the hole.
Before:
struct bpf_map {
...
u32 max_entries; /* 36 4 */
u32 map_flags; /* 40 4 */
/* XXX 4 bytes hole, try to pack */
u64 map_extra; /* 48 8 */
int spin_lock_off; /* 56 4 */
int timer_off; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
u32 id; /* 64 4 */
int numa_node; /* 68 4 */
...
bool frozen; /* 117 1 */
/* XXX 10 bytes hole, try to pack */
/* --- cacheline 2 boundary (128 bytes) --- */
...
struct work_struct work; /* 144 72 */
/* --- cacheline 3 boundary (192 bytes) was 24 bytes ago --- */
struct mutex freeze_mutex; /* 216 144 */
/* --- cacheline 5 boundary (320 bytes) was 40 bytes ago --- */
u64 writecnt; /* 360 8 */
/* size: 384, cachelines: 6, members: 26 */
/* sum members: 354, holes: 2, sum holes: 14 */
/* padding: 16 */
/* forced alignments: 2, forced holes: 1, sum forced holes: 10 */
} __attribute__((__aligned__(64)));
After:
struct bpf_map {
...
u32 max_entries; /* 36 4 */
u64 map_extra; /* 40 8 */
u32 map_flags; /* 48 4 */
int spin_lock_off; /* 52 4 */
int timer_off; /* 56 4 */
u32 id; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
int numa_node; /* 64 4 */
...
bool frozen /* 113 1 */
/* XXX 14 bytes hole, try to pack */
/* --- cacheline 2 boundary (128 bytes) --- */
...
struct work_struct work; /* 144 72 */
/* --- cacheline 3 boundary (192 bytes) was 24 bytes ago --- */
struct mutex freeze_mutex; /* 216 144 */
/* --- cacheline 5 boundary (320 bytes) was 40 bytes ago --- */
u64 writecnt; /* 360 8 */
/* size: 384, cachelines: 6, members: 26 */
/* sum members: 354, holes: 1, sum holes: 14 */
/* padding: 16 */
/* forced alignments: 2, forced holes: 1, sum forced holes: 14 */
} __attribute__((__aligned__(64)));
2) Add alignment padding to the bpf_map_info struct
More details can be found in commit
36f9814a494a ("bpf: fix uapi hole
for 32 bit compat applications")
Signed-off-by: Joanne Koong <joannekoong@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211029224909.1721024-3-joannekoong@fb.com
Joanne Koong [Fri, 29 Oct 2021 22:49:07 +0000 (15:49 -0700)]
bpf: Bloom filter map naming fixups
This patch has two changes in the kernel bloom filter map
implementation:
1) Change the names of map-ops functions to include the
"bloom_map" prefix.
As Martin pointed out on a previous patchset, having generic
map-ops names may be confusing in tracing and in perf-report.
2) Drop the "& 0xF" when getting nr_hash_funcs, since we
already ascertain that no other bits in map_extra beyond the
first 4 bits can be set.
Signed-off-by: Joanne Koong <joannekoong@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211029224909.1721024-2-joannekoong@fb.com
Alexei Starovoitov [Mon, 1 Nov 2021 21:10:00 +0000 (14:10 -0700)]
Merge branch 'introduce dummy BPF STRUCT_OPS'
Hou Tao says:
====================
Hi,
Currently the test of BPF STRUCT_OPS depends on the specific bpf
implementation (e.g, tcp_congestion_ops), but it can not cover all
basic functionalities (e.g, return value handling), so introduce
a dummy BPF STRUCT_OPS for test purpose.
Instead of loading a userspace-implemeted bpf_dummy_ops map into
kernel and calling the specific function by writing to sysfs provided
by bpf_testmode.ko, only loading bpf_dummy_ops related prog into
kernel and calling these prog by bpf_prog_test_run(). The latter
is more flexible and has no dependency on extra kernel module.
Now the return value handling is supported by test_1(...) ops,
and passing multiple arguments is supported by test_2(...) ops.
If more is needed, test_x(...) ops can be added afterwards.
Comments are always welcome.
Regards,
Hou
Change Log:
v4:
* add Acked-by tags in patch 1~4
* patch 2: remove unncessary comments and update commit message
accordingly
* patch 4: remove unnecessary nr checking in dummy_ops_init_args()
v3: https://www.spinics.net/lists/bpf/msg48303.html
* rebase on bpf-next
* address comments for Martin, mainly include: merge patch 3 &
patch 4 in v2, fix names of btf ctx access check helpers,
handle CONFIG_NET, fix leak in dummy_ops_init_args(), and
simplify bpf_dummy_init()
* patch 4: use a loop to check args in test_dummy_multiple_args()
v2: https://www.spinics.net/lists/bpf/msg47948.html
* rebase on bpf-next
* add test_2(...) ops to test the passing of multiple arguments
* a new patch (patch #2) is added to factor out ctx access helpers
* address comments from Martin & Andrii
v1: https://www.spinics.net/lists/bpf/msg46787.html
RFC: https://www.spinics.net/lists/bpf/msg46117.html
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Hou Tao [Mon, 25 Oct 2021 06:40:25 +0000 (14:40 +0800)]
selftests/bpf: Add test cases for struct_ops prog
Running a BPF_PROG_TYPE_STRUCT_OPS prog for dummy_st_ops::test_N()
through bpf_prog_test_run(). Four test cases are added:
(1) attach dummy_st_ops should fail
(2) function return value of bpf_dummy_ops::test_1() is expected
(3) pointer argument of bpf_dummy_ops::test_1() works as expected
(4) multiple arguments passed to bpf_dummy_ops::test_2() are correct
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20211025064025.2567443-5-houtao1@huawei.com
Hou Tao [Mon, 25 Oct 2021 06:40:24 +0000 (14:40 +0800)]
bpf: Add dummy BPF STRUCT_OPS for test purpose
Currently the test of BPF STRUCT_OPS depends on the specific bpf
implementation of tcp_congestion_ops, but it can not cover all
basic functionalities (e.g, return value handling), so introduce
a dummy BPF STRUCT_OPS for test purpose.
Loading a bpf_dummy_ops implementation from userspace is prohibited,
and its only purpose is to run BPF_PROG_TYPE_STRUCT_OPS program
through bpf(BPF_PROG_TEST_RUN). Now programs for test_1() & test_2()
are supported. The following three cases are exercised in
bpf_dummy_struct_ops_test_run():
(1) test and check the value returned from state arg in test_1(state)
The content of state is copied from userspace pointer and copied back
after calling test_1(state). The user pointer is saved in an u64 array
and the array address is passed through ctx_in.
(2) test and check the return value of test_1(NULL)
Just simulate the case in which an invalid input argument is passed in.
(3) test multiple arguments passing in test_2(state, ...)
5 arguments are passed through ctx_in in form of u64 array. The first
element of array is userspace pointer of state and others 4 arguments
follow.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20211025064025.2567443-4-houtao1@huawei.com
Hou Tao [Mon, 25 Oct 2021 06:40:23 +0000 (14:40 +0800)]
bpf: Factor out helpers for ctx access checking
Factor out two helpers to check the read access of ctx for raw tp
and BTF function. bpf_tracing_ctx_access() is used to check
the read access to argument is valid, and bpf_tracing_btf_ctx_access()
checks whether the btf type of argument is valid besides the checking
of argument read. bpf_tracing_btf_ctx_access() will be used by the
following patch.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20211025064025.2567443-3-houtao1@huawei.com
Hou Tao [Mon, 25 Oct 2021 06:40:22 +0000 (14:40 +0800)]
bpf: Factor out a helper to prepare trampoline for struct_ops prog
Factor out a helper bpf_struct_ops_prepare_trampoline() to prepare
trampoline for BPF_PROG_TYPE_STRUCT_OPS prog. It will be used by
.test_run callback in following patch.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20211025064025.2567443-2-houtao1@huawei.com
Linus Torvalds [Mon, 1 Nov 2021 21:03:56 +0000 (14:03 -0700)]
Merge tag 'x86-fpu-2021-11-01' of git://git./linux/kernel/git/tip/tip
Pull x86 fpu updates from Thomas Gleixner:
- Cleanup of extable fixup handling to be more robust, which in turn
allows to make the FPU exception fixups more robust as well.
- Change the return code for signal frame related failures from
explicit error codes to a boolean fail/success as that's all what the
calling code evaluates.
- A large refactoring of the FPU code to prepare for adding AMX
support:
- Distangle the public header maze and remove especially the
misnomed kitchen sink internal.h which is despite it's name
included all over the place.
- Add a proper abstraction for the register buffer storage (struct
fpstate) which allows to dynamically size the buffer at runtime
by flipping the pointer to the buffer container from the default
container which is embedded in task_struct::tread::fpu to a
dynamically allocated container with a larger register buffer.
- Convert the code over to the new fpstate mechanism.
- Consolidate the KVM FPU handling by moving the FPU related code
into the FPU core which removes the number of exports and avoids
adding even more export when AMX has to be supported in KVM.
This also removes duplicated code which was of course
unnecessary different and incomplete in the KVM copy.
- Simplify the KVM FPU buffer handling by utilizing the new
fpstate container and just switching the buffer pointer from the
user space buffer to the KVM guest buffer when entering
vcpu_run() and flipping it back when leaving the function. This
cuts the memory requirements of a vCPU for FPU buffers in half
and avoids pointless memory copy operations.
This also solves the so far unresolved problem of adding AMX
support because the current FPU buffer handling of KVM inflicted
a circular dependency between adding AMX support to the core and
to KVM. With the new scheme of switching fpstate AMX support can
be added to the core code without affecting KVM.
- Replace various variables with proper data structures so the
extra information required for adding dynamically enabled FPU
features (AMX) can be added in one place
- Add AMX (Advanced Matrix eXtensions) support (finally):
AMX is a large XSTATE component which is going to be available with
Saphire Rapids XEON CPUs. The feature comes with an extra MSR
(MSR_XFD) which allows to trap the (first) use of an AMX related
instruction, which has two benefits:
1) It allows the kernel to control access to the feature
2) It allows the kernel to dynamically allocate the large register
state buffer instead of burdening every task with the the extra
8K or larger state storage.
It would have been great to gain this kind of control already with
AVX512.
The support comes with the following infrastructure components:
1) arch_prctl() to
- read the supported features (equivalent to XGETBV(0))
- read the permitted features for a task
- request permission for a dynamically enabled feature
Permission is granted per process, inherited on fork() and
cleared on exec(). The permission policy of the kernel is
restricted to sigaltstack size validation, but the syscall
obviously allows further restrictions via seccomp etc.
2) A stronger sigaltstack size validation for sys_sigaltstack(2)
which takes granted permissions and the potentially resulting
larger signal frame into account. This mechanism can also be used
to enforce factual sigaltstack validation independent of dynamic
features to help with finding potential victims of the 2K
sigaltstack size constant which is broken since AVX512 support
was added.
3) Exception handling for #NM traps to catch first use of a extended
feature via a new cause MSR. If the exception was caused by the
use of such a feature, the handler checks permission for that
feature. If permission has not been granted, the handler sends a
SIGILL like the #UD handler would do if the feature would have
been disabled in XCR0. If permission has been granted, then a new
fpstate which fits the larger buffer requirement is allocated.
In the unlikely case that this allocation fails, the handler
sends SIGSEGV to the task. That's not elegant, but unavoidable as
the other discussed options of preallocation or full per task
permissions come with their own set of horrors for kernel and/or
userspace. So this is the lesser of the evils and SIGSEGV caused
by unexpected memory allocation failures is not a fundamentally
new concept either.
When allocation succeeds, the fpstate properties are filled in to
reflect the extended feature set and the resulting sizes, the
fpu::fpstate pointer is updated accordingly and the trap is
disarmed for this task permanently.
4) Enumeration and size calculations
5) Trap switching via MSR_XFD
The XFD (eXtended Feature Disable) MSR is context switched with
the same life time rules as the FPU register state itself. The
mechanism is keyed off with a static key which is default
disabled so !AMX equipped CPUs have zero overhead. On AMX enabled
CPUs the overhead is limited by comparing the tasks XFD value
with a per CPU shadow variable to avoid redundant MSR writes. In
case of switching from a AMX using task to a non AMX using task
or vice versa, the extra MSR write is obviously inevitable.
All other places which need to be aware of the variable feature
sets and resulting variable sizes are not affected at all because
they retrieve the information (feature set, sizes) unconditonally
from the fpstate properties.
6) Enable the new AMX states
Note, this is relatively new code despite the fact that AMX support
is in the works for more than a year now.
The big refactoring of the FPU code, which allowed to do a proper
integration has been started exactly 3 weeks ago. Refactoring of the
existing FPU code and of the original AMX patches took a week and has
been subject to extensive review and testing. The only fallout which
has not been caught in review and testing right away was restricted
to AMX enabled systems, which is completely irrelevant for anyone
outside Intel and their early access program. There might be dragons
lurking as usual, but so far the fine grained refactoring has held up
and eventual yet undetected fallout is bisectable and should be
easily addressable before the 5.16 release. Famous last words...
Many thanks to Chang Bae and Dave Hansen for working hard on this and
also to the various test teams at Intel who reserved extra capacity
to follow the rapid development of this closely which provides the
confidence level required to offer this rather large update for
inclusion into 5.16-rc1
* tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
Documentation/x86: Add documentation for using dynamic XSTATE features
x86/fpu: Include vmalloc.h for vzalloc()
selftests/x86/amx: Add context switch test
selftests/x86/amx: Add test cases for AMX state management
x86/fpu/amx: Enable the AMX feature in 64-bit mode
x86/fpu: Add XFD handling for dynamic states
x86/fpu: Calculate the default sizes independently
x86/fpu/amx: Define AMX state components and have it used for boot-time checks
x86/fpu/xstate: Prepare XSAVE feature table for gaps in state component numbers
x86/fpu/xstate: Add fpstate_realloc()/free()
x86/fpu/xstate: Add XFD #NM handler
x86/fpu: Update XFD state where required
x86/fpu: Add sanity checks for XFD
x86/fpu: Add XFD state to fpstate
x86/msr-index: Add MSRs for XFD
x86/cpufeatures: Add eXtended Feature Disabling (XFD) feature bit
x86/fpu: Reset permission and fpstate on exec()
x86/fpu: Prepare fpu_clone() for dynamically enabled features
x86/fpu/signal: Prepare for variable sigframe length
x86/signal: Use fpu::__state_user_size for sigalt stack validation
...
Linus Torvalds [Mon, 1 Nov 2021 21:01:35 +0000 (14:01 -0700)]
Merge tag 'x86-apic-2021-11-01' of git://git./linux/kernel/git/tip/tip
Pull x86/apic update from Thomas Gleixner:
"A single commit which reduces cache misses in __x2apic_send_IPI_mask()
significantly by converting x86_cpu_to_logical_apicid() to an array
instead of using per CPU storage.
This reduces the cost for a full broadcast on a dual socket system
with 256 CPUs from 33 down to 11 microseconds"
* tag 'x86-apic-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Reduce cache line misses in __x2apic_send_IPI_mask()
Linus Torvalds [Mon, 1 Nov 2021 20:48:52 +0000 (13:48 -0700)]
Merge tag 'sched-core-2021-11-01' of git://git./linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
- Revert the printk format based wchan() symbol resolution as it can
leak the raw value in case that the symbol is not resolvable.
- Make wchan() more robust and work with all kind of unwinders by
enforcing that the task stays blocked while unwinding is in progress.
- Prevent sched_fork() from accessing an invalid sched_task_group
- Improve asymmetric packing logic
- Extend scheduler statistics to RT and DL scheduling classes and add
statistics for bandwith burst to the SCHED_FAIR class.
- Properly account SCHED_IDLE entities
- Prevent a potential deadlock when initial priority is assigned to a
newly created kthread. A recent change to plug a race between cpuset
and __sched_setscheduler() introduced a new lock dependency which is
now triggered. Break the lock dependency chain by moving the priority
assignment to the thread function.
- Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
- Improve idle balancing in general and especially for NOHZ enabled
systems.
- Provide proper interfaces for live patching so it does not have to
fiddle with scheduler internals.
- Add cluster aware scheduling support.
- A small set of tweaks for RT (irqwork, wait_task_inactive(), various
scheduler options and delaying mmdrop)
- The usual small tweaks and improvements all over the place
* tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
sched/fair: Cleanup newidle_balance
sched/fair: Remove sysctl_sched_migration_cost condition
sched/fair: Wait before decaying max_newidle_lb_cost
sched/fair: Skip update_blocked_averages if we are defering load balance
sched/fair: Account update_blocked_averages in newidle_balance cost
x86: Fix __get_wchan() for !STACKTRACE
sched,x86: Fix L2 cache mask
sched/core: Remove rq_relock()
sched: Improve wake_up_all_idle_cpus() take #2
irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT
irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.
sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
sched: Add cluster scheduler level for x86
sched: Add cluster scheduler level in core and related Kconfig for ARM64
topology: Represent clusters of CPUs within a die
sched: Disable -Wunused-but-set-variable
sched: Add wrapper for get_wchan() to keep task blocked
x86: Fix get_wchan() to support the ORC unwinder
proc: Use task_is_running() for wchan in /proc/$pid/stat
...
Linus Torvalds [Mon, 1 Nov 2021 20:44:55 +0000 (13:44 -0700)]
Merge tag 'timers-core-2021-10-31' of git://git./linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Time, timers and timekeeping updates:
- No core updates
- No new clocksource/event driver
- A large rework of the ARM architected timer driver to prepare for
the support of the upcoming ARMv8.6 support
- Fix Kconfig options for Exynos MCT, Samsung PWM and TI DM timers
- Address a namespace collison in the ARC sp804 timer driver"
* tag 'timers-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/timer-ti-dm: Select TIMER_OF
clocksource/drivers/exynosy: Depend on sub-architecture for Exynos MCT and Samsung PWM
clocksource/drivers/arch_arm_timer: Move workaround synchronisation around
clocksource/drivers/arm_arch_timer: Fix masking for high freq counters
clocksource/drivers/arm_arch_timer: Drop unnecessary ISB on CVAL programming
clocksource/drivers/arm_arch_timer: Remove any trace of the TVAL programming interface
clocksource/drivers/arm_arch_timer: Work around broken CVAL implementations
clocksource/drivers/arm_arch_timer: Advertise 56bit timer to the core code
clocksource/drivers/arm_arch_timer: Move MMIO timer programming over to CVAL
clocksource/drivers/arm_arch_timer: Fix MMIO base address vs callback ordering issue
clocksource/drivers/arm_arch_timer: Move drop _tval from erratum function names
clocksource/drivers/arm_arch_timer: Move system register timer programming over to CVAL
clocksource/drivers/arm_arch_timer: Extend write side of timer register accessors to u64
clocksource/drivers/arm_arch_timer: Drop CNT*_TVAL read accessors
clocksource/arm_arch_timer: Add build-time guards for unhandled register accesses
clocksource/drivers/arc_timer: Eliminate redefined macro error
Linus Torvalds [Mon, 1 Nov 2021 20:24:43 +0000 (13:24 -0700)]
Merge tag 'objtool-core-2021-10-31' of git://git./linux/kernel/git/tip/tip
Pull objtool updates from Thomas Gleixner:
- Improve retpoline code patching by separating it from alternatives
which reduces memory footprint and allows to do better optimizations
in the actual runtime patching.
- Add proper retpoline support for x86/BPF
- Address noinstr warnings in x86/kvm, lockdep and paravirtualization
code
- Add support to handle pv_opsindirect calls in the noinstr analysis
- Classify symbols upfront and cache the result to avoid redundant
str*cmp() invocations.
- Add a CFI hash to reduce memory consumption which also reduces
runtime on a allyesconfig by ~50%
- Adjust XEN code to make objtool handling more robust and as a side
effect to prevent text fragmentation due to placement of the
hypercall page.
* tag 'objtool-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
bpf,x86: Respect X86_FEATURE_RETPOLINE*
bpf,x86: Simplify computing label offsets
x86,bugs: Unconditionally allow spectre_v2=retpoline,amd
x86/alternative: Add debug prints to apply_retpolines()
x86/alternative: Try inline spectre_v2=retpoline,amd
x86/alternative: Handle Jcc __x86_indirect_thunk_\reg
x86/alternative: Implement .retpoline_sites support
x86/retpoline: Create a retpoline thunk array
x86/retpoline: Move the retpoline thunk declarations to nospec-branch.h
x86/asm: Fixup odd GEN-for-each-reg.h usage
x86/asm: Fix register order
x86/retpoline: Remove unused replacement symbols
objtool,x86: Replace alternatives with .retpoline_sites
objtool: Shrink struct instruction
objtool: Explicitly avoid self modifying code in .altinstr_replacement
objtool: Classify symbols
objtool: Support pv_opsindirect calls for noinstr
x86/xen: Rework the xen_{cpu,irq,mmu}_opsarrays
x86/xen: Mark xen_force_evtchn_callback() noinstr
x86/xen: Make irq_disable() noinstr
...
Linus Torvalds [Mon, 1 Nov 2021 20:15:36 +0000 (13:15 -0700)]
Merge tag 'locking-core-2021-10-31' of git://git./linux/kernel/git/tip/tip
Pull locking updates from Thomas Gleixner:
- Move futex code into kernel/futex/ and split up the kitchen sink into
seperate files to make integration of sys_futex_waitv() simpler.
- Add a new sys_futex_waitv() syscall which allows to wait on multiple
futexes.
The main use case is emulating Windows' WaitForMultipleObjects which
allows Wine to improve the performance of Windows Games. Also native
Linux games can benefit from this interface as this is a common wait
pattern for this kind of applications.
- Add context to ww_mutex_trylock() to provide a path for i915 to
rework their eviction code step by step without making lockdep upset
until the final steps of rework are completed. It's also useful for
regulator and TTM to avoid dropping locks in the non contended path.
- Lockdep and might_sleep() cleanups and improvements
- A few improvements for the RT substitutions.
- The usual small improvements and cleanups.
* tag 'locking-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
locking: Remove spin_lock_flags() etc
locking/rwsem: Fix comments about reader optimistic lock stealing conditions
locking: Remove rcu_read_{,un}lock() for preempt_{dis,en}able()
locking/rwsem: Disable preemption for spinning region
docs: futex: Fix kernel-doc references
futex: Fix PREEMPT_RT build
futex2: Documentation: Document sys_futex_waitv() uAPI
selftests: futex: Test sys_futex_waitv() wouldblock
selftests: futex: Test sys_futex_waitv() timeout
selftests: futex: Add sys_futex_waitv() test
futex,arm: Wire up sys_futex_waitv()
futex,x86: Wire up sys_futex_waitv()
futex: Implement sys_futex_waitv()
futex: Simplify double_lock_hb()
futex: Split out wait/wake
futex: Split out requeue
futex: Rename mark_wake_futex()
futex: Rename: match_futex()
futex: Rename: hb_waiter_{inc,dec,pending}()
futex: Split out PI futex
...
Linus Torvalds [Mon, 1 Nov 2021 20:12:15 +0000 (13:12 -0700)]
Merge tag 'perf-core-2021-10-31' of git://git./linux/kernel/git/tip/tip
Pull perf updates from Thomas Gleixner:
"Core:
- Allow ftrace to instrument parts of the perf core code
- Add a new mem_hops field to perf_mem_data_src which allows to
represent intra-node/package or inter-node/off-package details to
prepare for next generation systems which have more hieararchy
within the node/pacakge level.
Tools:
- Update for the new mem_hops field in perf_mem_data_src
Arch:
- A set of constraints fixes for the Intel uncore PMU
- The usual set of small fixes and improvements for x86 and PPC"
* tag 'perf-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix ICL/SPR INST_RETIRED.PREC_DIST encodings
powerpc/perf: Fix data source encodings for L2.1 and L3.1 accesses
tools/perf: Add mem_hops field in perf_mem_data_src structure
perf: Add mem_hops field in perf_mem_data_src structure
perf: Add comment about current state of PERF_MEM_LVL_* namespace and remove an extra line
perf/core: Allow ftrace for functions in kernel/event/core.c
perf/x86: Add new event for AUX output counter index
perf/x86: Add compiler barrier after updating BTS
perf/x86/intel/uncore: Fix Intel SPR M3UPI event constraints
perf/x86/intel/uncore: Fix Intel SPR M2PCIE event constraints
perf/x86/intel/uncore: Fix Intel SPR IIO event constraints
perf/x86/intel/uncore: Fix Intel SPR CHA event constraints
perf/x86/intel/uncore: Fix Intel ICX IIO event constraints
perf/x86/intel/uncore: Fix invalid unit check
perf/x86/intel/uncore: Support extra IMC channel on Ice Lake server
Linus Torvalds [Mon, 1 Nov 2021 20:09:10 +0000 (13:09 -0700)]
Merge tag 'irq-core-2021-10-31' of git://git./linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Updates for the interrupt subsystem:
Core changes:
- Prevent a potential deadlock when initial priority is assigned to a
newly created interrupt thread. A recent change to plug a race
between cpuset and __sched_setscheduler() introduced a new lock
dependency which is now triggered. Break the lock dependency chain
by moving the priority assignment to the thread function.
- A couple of small updates to make the irq core RT safe.
- Confine the irq_cpu_online/offline() API to the only left unfixable
user Cavium Octeon so that it does not grow new usage.
- A small documentation update
Driver changes:
- A large cross architecture rework to move irq_enter/exit() into the
architecture code to make addressing the NOHZ_FULL/RCU issues
simpler.
- The obligatory new irq chip driver for Microchip EIC
- Modularize a few irq chip drivers
- Expand usage of devm_*() helpers throughout the driver code
- The usual small fixes and improvements all over the place"
* tag 'irq-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
h8300: Fix linux/irqchip.h include mess
dt-bindings: irqchip: renesas-irqc: Document r8a774e1 bindings
MIPS: irq: Avoid an unused-variable error
genirq: Hide irq_cpu_{on,off}line() behind a deprecated option
irqchip/mips-gic: Get rid of the reliance on irq_cpu_online()
MIPS: loongson64: Drop call to irq_cpu_offline()
irq: remove handle_domain_{irq,nmi}()
irq: remove CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
irq: riscv: perform irqentry in entry code
irq: openrisc: perform irqentry in entry code
irq: csky: perform irqentry in entry code
irq: arm64: perform irqentry in entry code
irq: arm: perform irqentry in entry code
irq: add a (temporary) CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
irq: nds32: avoid CONFIG_HANDLE_DOMAIN_IRQ
irq: arc: avoid CONFIG_HANDLE_DOMAIN_IRQ
irq: add generic_handle_arch_irq()
irq: unexport handle_irq_desc()
irq: simplify handle_domain_{irq,nmi}()
irq: mips: simplify do_domain_IRQ()
...