platform/kernel/linux-starfive.git
19 months agobpf: use check_ids() for active_lock comparison
Eduard Zingerman [Fri, 9 Dec 2022 13:57:31 +0000 (15:57 +0200)]
bpf: use check_ids() for active_lock comparison

An update for verifier.c:states_equal()/regsafe() to use check_ids()
for active spin lock comparisons. This fixes the issue reported by
Kumar Kartikeya Dwivedi in [1] using technique suggested by Edward Cree.

W/o this commit the verifier might be tricked to accept the following
program working with a map containing spin locks:

  0: r9 = map_lookup_elem(...)  ; Returns PTR_TO_MAP_VALUE_OR_NULL id=1.
  1: r8 = map_lookup_elem(...)  ; Returns PTR_TO_MAP_VALUE_OR_NULL id=2.
  2: if r9 == 0 goto exit       ; r9 -> PTR_TO_MAP_VALUE.
  3: if r8 == 0 goto exit       ; r8 -> PTR_TO_MAP_VALUE.
  4: r7 = ktime_get_ns()        ; Unbound SCALAR_VALUE.
  5: r6 = ktime_get_ns()        ; Unbound SCALAR_VALUE.
  6: bpf_spin_lock(r8)          ; active_lock.id == 2.
  7: if r6 > r7 goto +1         ; No new information about the state
                                ; is derived from this check, thus
                                ; produced verifier states differ only
                                ; in 'insn_idx'.
  8: r9 = r8                    ; Optionally make r9.id == r8.id.
  --- checkpoint ---            ; Assume is_state_visisted() creates a
                                ; checkpoint here.
  9: bpf_spin_unlock(r9)        ; (a,b) active_lock.id == 2.
                                ; (a) r9.id == 2, (b) r9.id == 1.
 10: exit(0)

Consider two verification paths:
(a) 0-10
(b) 0-7,9-10

The path (a) is verified first. If checkpoint is created at (8)
the (b) would assume that (8) is safe because regsafe() does not
compare register ids for registers of type PTR_TO_MAP_VALUE.

[1] https://lore.kernel.org/bpf/20221111202719.982118-1-memxor@gmail.com/

Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Suggested-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20221209135733.28851-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agoselftests/bpf: verify states_equal() maintains idmap across all frames
Eduard Zingerman [Fri, 9 Dec 2022 13:57:30 +0000 (15:57 +0200)]
selftests/bpf: verify states_equal() maintains idmap across all frames

A test case that would erroneously pass verification if
verifier.c:states_equal() maintains separate register ID mappings for
call frames.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20221209135733.28851-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: states_equal() must build idmap for all function frames
Eduard Zingerman [Fri, 9 Dec 2022 13:57:29 +0000 (15:57 +0200)]
bpf: states_equal() must build idmap for all function frames

verifier.c:states_equal() must maintain register ID mapping across all
function frames. Otherwise the following example might be erroneously
marked as safe:

main:
    fp[-24] = map_lookup_elem(...)  ; frame[0].fp[-24].id == 1
    fp[-32] = map_lookup_elem(...)  ; frame[0].fp[-32].id == 2
    r1 = &fp[-24]
    r2 = &fp[-32]
    call foo()
    r0 = 0
    exit

foo:
  0: r9 = r1
  1: r8 = r2
  2: r7 = ktime_get_ns()
  3: r6 = ktime_get_ns()
  4: if (r6 > r7) goto skip_assign
  5: r9 = r8

skip_assign:                ; <--- checkpoint
  6: r9 = *r9               ; (a) frame[1].r9.id == 2
                            ; (b) frame[1].r9.id == 1

  7: if r9 == 0 goto exit:  ; mark_ptr_or_null_regs() transfers != 0 info
                            ; for all regs sharing ID:
                            ;   (a) r9 != 0 => &frame[0].fp[-32] != 0
                            ;   (b) r9 != 0 => &frame[0].fp[-24] != 0

  8: r8 = *r8               ; (a) r8 == &frame[0].fp[-32]
                            ; (b) r8 == &frame[0].fp[-32]
  9: r0 = *r8               ; (a) safe
                            ; (b) unsafe

exit:
 10: exit

While processing call to foo() verifier considers the following
execution paths:

(a) 0-10
(b) 0-4,6-10
(There is also path 0-7,10 but it is not interesting for the issue at
 hand. (a) is verified first.)

Suppose that checkpoint is created at (6) when path (a) is verified,
next path (b) is verified and (6) is reached.

If states_equal() maintains separate 'idmap' for each frame the
mapping at (6) for frame[1] would be empty and
regsafe(r9)::check_ids() would add a pair 2->1 and return true,
which is an error.

If states_equal() maintains single 'idmap' for all frames the mapping
at (6) would be { 1->1, 2->2 } and regsafe(r9)::check_ids() would
return false when trying to add a pair 2->1.

This issue was suggested in the following discussion:
https://lore.kernel.org/bpf/CAEf4BzbFB5g4oUfyxk9rHy-PJSLQ3h8q9mV=rVoXfr_JVm8+1Q@mail.gmail.com/

Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20221209135733.28851-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agoselftests/bpf: test cases for regsafe() bug skipping check_id()
Eduard Zingerman [Fri, 9 Dec 2022 13:57:28 +0000 (15:57 +0200)]
selftests/bpf: test cases for regsafe() bug skipping check_id()

Under certain conditions it was possible for verifier.c:regsafe() to
skip check_id() call. This commit adds negative test cases previously
errorneously accepted as safe.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20221209135733.28851-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: regsafe() must not skip check_ids()
Eduard Zingerman [Fri, 9 Dec 2022 13:57:27 +0000 (15:57 +0200)]
bpf: regsafe() must not skip check_ids()

The verifier.c:regsafe() has the following shortcut:

equal = memcmp(rold, rcur, offsetof(struct bpf_reg_state, parent)) == 0;
...
if (equal)
return true;

Which is executed regardless old register type. This is incorrect for
register types that might have an ID checked by check_ids(), namely:
 - PTR_TO_MAP_KEY
 - PTR_TO_MAP_VALUE
 - PTR_TO_PACKET_META
 - PTR_TO_PACKET

The following pattern could be used to exploit this:

  0: r9 = map_lookup_elem(...)  ; Returns PTR_TO_MAP_VALUE_OR_NULL id=1.
  1: r8 = map_lookup_elem(...)  ; Returns PTR_TO_MAP_VALUE_OR_NULL id=2.
  2: r7 = ktime_get_ns()        ; Unbound SCALAR_VALUE.
  3: r6 = ktime_get_ns()        ; Unbound SCALAR_VALUE.
  4: if r6 > r7 goto +1         ; No new information about the state
                                ; is derived from this check, thus
                                ; produced verifier states differ only
                                ; in 'insn_idx'.
  5: r9 = r8                    ; Optionally make r9.id == r8.id.
  --- checkpoint ---            ; Assume is_state_visisted() creates a
                                ; checkpoint here.
  6: if r9 == 0 goto <exit>     ; Nullness info is propagated to all
                                ; registers with matching ID.
  7: r1 = *(u64 *) r8           ; Not always safe.

Verifier first visits path 1-7 where r8 is verified to be not null
at (6). Later the jump from 4 to 6 is examined. The checkpoint for (6)
looks as follows:
  R8_rD=map_value_or_null(id=2,off=0,ks=4,vs=8,imm=0)
  R9_rwD=map_value_or_null(id=2,off=0,ks=4,vs=8,imm=0)
  R10=fp0

The current state is:
  R0=... R6=... R7=... fp-8=...
  R8=map_value_or_null(id=2,off=0,ks=4,vs=8,imm=0)
  R9=map_value_or_null(id=1,off=0,ks=4,vs=8,imm=0)
  R10=fp0

Note that R8 states are byte-to-byte identical, so regsafe() would
exit early and skip call to check_ids(), thus ID mapping 2->2 will not
be added to 'idmap'. Next, states for R9 are compared: these are not
identical and check_ids() is executed, but 'idmap' is empty, so
check_ids() adds mapping 2->1 to 'idmap' and returns success.

This commit pushes the 'equal' down to register types that don't need
check_ids().

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20221209135733.28851-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agodocs/bpf: Add documentation for BPF_MAP_TYPE_SK_STORAGE
Donald Hunter [Fri, 9 Dec 2022 11:24:01 +0000 (11:24 +0000)]
docs/bpf: Add documentation for BPF_MAP_TYPE_SK_STORAGE

Add documentation for the BPF_MAP_TYPE_SK_STORAGE including
kernel version introduced, usage and examples.

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221209112401.69319-1-donald.hunter@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agoMerge branch 'Dynptr refactorings'
Alexei Starovoitov [Fri, 9 Dec 2022 02:25:32 +0000 (18:25 -0800)]
Merge branch 'Dynptr refactorings'

Kumar Kartikeya Dwivedi says:

====================

This is part 1 of https://lore.kernel.org/bpf/20221018135920.726360-1-memxor@gmail.com.
This thread also gives some background on why the refactor is being done:
https://lore.kernel.org/bpf/CAEf4Bzb4beTHgVo+G+jehSj8oCeAjRbRcm6MRe=Gr+cajRBwEw@mail.gmail.com

As requested in patch 6 by Alexei, it only includes patches which
refactors the code, on top of which further fixes will be made in part
2. The refactor itself fixes another issue as a side effect. No
functional change is intended (except a few modified log messages).

Changelog:
----------
v1 -> v2
v1: https://lore.kernel.org/bpf/20221115000130.1967465-1-memxor@gmail.com

 * Address feedback from Joanne and David, add acks

Fixes v1 -> v1
Fixes v1: https://lore.kernel.org/bpf/20221018135920.726360-1-memxor@gmail.com

 * Collect acks from Joanne and David
 * Fix misc nits pointed out by Joanne, David
 * Split move of reg->off alignment check for dynptr into separate
   change (Alexei)
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agoselftests/bpf: Add test for dynptr reinit in user_ringbuf callback
Kumar Kartikeya Dwivedi [Wed, 7 Dec 2022 20:41:41 +0000 (02:11 +0530)]
selftests/bpf: Add test for dynptr reinit in user_ringbuf callback

The original support for bpf_user_ringbuf_drain callbacks simply
short-circuited checks for the dynptr state, allowing users to pass
PTR_TO_DYNPTR (now CONST_PTR_TO_DYNPTR) to helpers that initialize a
dynptr. This bug would have also surfaced with other dynptr helpers in
the future that changed dynptr view or modified it in some way.

Include test cases for all cases, i.e. both bpf_dynptr_from_mem and
bpf_ringbuf_reserve_dynptr, and ensure verifier rejects both of them.
Without the fix, both of these programs load and pass verification.

While at it, remove sys_nanosleep target from failure cases' SEC
definition, as there is no such tracepoint.

Acked-by: David Vernet <void@manifault.com>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: Use memmove for bpf_dynptr_{read,write}
Kumar Kartikeya Dwivedi [Wed, 7 Dec 2022 20:41:40 +0000 (02:11 +0530)]
bpf: Use memmove for bpf_dynptr_{read,write}

It may happen that destination buffer memory overlaps with memory dynptr
points to. Hence, we must use memmove to correctly copy from dynptr to
destination buffer, or source buffer to dynptr.

This actually isn't a problem right now, as memcpy implementation falls
back to memmove on detecting overlap and warns about it, but we
shouldn't be relying on that.

Acked-by: Joanne Koong <joannelkoong@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: Move PTR_TO_STACK alignment check to process_dynptr_func
Kumar Kartikeya Dwivedi [Wed, 7 Dec 2022 20:41:39 +0000 (02:11 +0530)]
bpf: Move PTR_TO_STACK alignment check to process_dynptr_func

After previous commit, we are minimizing helper specific assumptions
from check_func_arg_reg_off, making it generic, and offloading checks
for a specific argument type to their respective functions called after
check_func_arg_reg_off has been called.

This allows relying on a consistent set of guarantees after that call
and then relying on them in code that deals with registers for each
argument type later. This is in line with how process_spin_lock,
process_timer_func, process_kptr_func check reg->var_off to be constant.
The same reasoning is used here to move the alignment check into
process_dynptr_func. Note that it also needs to check for constant
var_off, and accumulate the constant var_off when computing the spi in
get_spi, but that fix will come in later changes.

Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: Rework check_func_arg_reg_off
Kumar Kartikeya Dwivedi [Wed, 7 Dec 2022 20:41:38 +0000 (02:11 +0530)]
bpf: Rework check_func_arg_reg_off

While check_func_arg_reg_off is the place which performs generic checks
needed by various candidates of reg->type, there is some handling for
special cases, like ARG_PTR_TO_DYNPTR, OBJ_RELEASE, and
ARG_PTR_TO_RINGBUF_MEM.

This commit aims to streamline these special cases and instead leave
other things up to argument type specific code to handle. The function
will be restrictive by default, and cover all possible cases when
OBJ_RELEASE is set, without having to update the function again (and
missing to do that being a bug).

This is done primarily for two reasons: associating back reg->type to
its argument leaves room for the list getting out of sync when a new
reg->type is supported by an arg_type.

The other case is ARG_PTR_TO_RINGBUF_MEM. The problem there is something
we already handle, whenever a release argument is expected, it should
be passed as the pointer that was received from the acquire function.
Hence zero fixed and variable offset.

There is nothing special about ARG_PTR_TO_RINGBUF_MEM, where technically
its target register type PTR_TO_MEM | MEM_RINGBUF can already be passed
with non-zero offset to other helper functions, which makes sense.

Hence, lift the arg_type_is_release check for reg->off and cover all
possible register types, instead of duplicating the same kind of check
twice for current OBJ_RELEASE arg_types (alloc_mem and ptr_to_btf_id).

For the release argument, arg_type_is_dynptr is the special case, where
we go to actual object being freed through the dynptr, so the offset of
the pointer still needs to allow fixed and variable offset and
process_dynptr_func will verify them later for the release argument case
as well.

This is not specific to ARG_PTR_TO_DYNPTR though, we will need to make
this exception for any future object on the stack that needs to be
released. In this sense, PTR_TO_STACK as a candidate for object on stack
argument is a special case for release offset checks, and they need to
be done by the helper releasing the object on stack.

Since the check has been lifted above all register type checks, remove
the duplicated check that is being done for PTR_TO_BTF_ID.

Acked-by: Joanne Koong <joannelkoong@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: Rework process_dynptr_func
Kumar Kartikeya Dwivedi [Wed, 7 Dec 2022 20:41:37 +0000 (02:11 +0530)]
bpf: Rework process_dynptr_func

Recently, user ringbuf support introduced a PTR_TO_DYNPTR register type
for use in callback state, because in case of user ringbuf helpers,
there is no dynptr on the stack that is passed into the callback. To
reflect such a state, a special register type was created.

However, some checks have been bypassed incorrectly during the addition
of this feature. First, for arg_type with MEM_UNINIT flag which
initialize a dynptr, they must be rejected for such register type.
Secondly, in the future, there are plans to add dynptr helpers that
operate on the dynptr itself and may change its offset and other
properties.

In all of these cases, PTR_TO_DYNPTR shouldn't be allowed to be passed
to such helpers, however the current code simply returns 0.

The rejection for helpers that release the dynptr is already handled.

For fixing this, we take a step back and rework existing code in a way
that will allow fitting in all classes of helpers and have a coherent
model for dealing with the variety of use cases in which dynptr is used.

First, for ARG_PTR_TO_DYNPTR, it can either be set alone or together
with a DYNPTR_TYPE_* constant that denotes the only type it accepts.

Next, helpers which initialize a dynptr use MEM_UNINIT to indicate this
fact. To make the distinction clear, use MEM_RDONLY flag to indicate
that the helper only operates on the memory pointed to by the dynptr,
not the dynptr itself. In C parlance, it would be equivalent to taking
the dynptr as a point to const argument.

When either of these flags are not present, the helper is allowed to
mutate both the dynptr itself and also the memory it points to.
Currently, the read only status of the memory is not tracked in the
dynptr, but it would be trivial to add this support inside dynptr state
of the register.

With these changes and renaming PTR_TO_DYNPTR to CONST_PTR_TO_DYNPTR to
better reflect its usage, it can no longer be passed to helpers that
initialize a dynptr, i.e. bpf_dynptr_from_mem, bpf_ringbuf_reserve_dynptr.

A note to reviewers is that in code that does mark_stack_slots_dynptr,
and unmark_stack_slots_dynptr, we implicitly rely on the fact that
PTR_TO_STACK reg is the only case that can reach that code path, as one
cannot pass CONST_PTR_TO_DYNPTR to helpers that don't set MEM_RDONLY. In
both cases such helpers won't be setting that flag.

The next patch will add a couple of selftest cases to make sure this
doesn't break.

Fixes: 205715673844 ("bpf: Add bpf_user_ringbuf_drain() helper")
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: Propagate errors from process_* checks in check_func_arg
Kumar Kartikeya Dwivedi [Wed, 7 Dec 2022 20:41:36 +0000 (02:11 +0530)]
bpf: Propagate errors from process_* checks in check_func_arg

Currently, we simply ignore the errors in process_spin_lock,
process_timer_func, process_kptr_func, process_dynptr_func. Instead,
bubble up the error by storing and checking err variable.

Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: Refactor ARG_PTR_TO_DYNPTR checks into process_dynptr_func
Kumar Kartikeya Dwivedi [Wed, 7 Dec 2022 20:41:35 +0000 (02:11 +0530)]
bpf: Refactor ARG_PTR_TO_DYNPTR checks into process_dynptr_func

ARG_PTR_TO_DYNPTR is akin to ARG_PTR_TO_TIMER, ARG_PTR_TO_KPTR, where
the underlying register type is subjected to more special checks to
determine the type of object represented by the pointer and its state
consistency.

Move dynptr checks to their own 'process_dynptr_func' function so that
is consistent and in-line with existing code. This also makes it easier
to reuse this code for kfunc handling.

Then, reuse this consolidated function in kfunc dynptr handling too.
Note that for kfuncs, the arg_type constraint of DYNPTR_TYPE_LOCAL has
been lifted.

Acked-by: David Vernet <void@manifault.com>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agoMerge branch 'Misc optimizations for bpf mem allocator'
Alexei Starovoitov [Fri, 9 Dec 2022 01:50:17 +0000 (17:50 -0800)]
Merge branch 'Misc optimizations for bpf mem allocator'

Hou Tao says:

====================

From: Hou Tao <houtao1@huawei.com>

Hi,

The patchset is just misc optimizations for bpf mem allocator. Patch 1
fixes the OOM problem found during running hash-table update benchmark
from qp-trie patchset [0]. The benchmark will add htab elements in
batch and then delete elements in batch, so freed objects will stack on
free_by_rcu and wait for the expiration of RCU grace period. There can
be tens of thousands of freed objects and these objects are not
available for new allocation, so adding htab element will continue to do
new allocation.

For the benchmark commmand: "./bench -w3 -d10 -a htab-update -p 16",
even the maximum entries of htab is 16384, key_size is 255 and
value_size is 4, the peak memory usage will reach 14GB or more.
Increasing rcupdate.rcu_task_enqueue_lim will decrease the peak memory to
860MB, but it is still too many. Although the above case is contrived,
it is better to fix it and the fixing is simple: just reusing the freed
objects in free_by_rcu during allocation. After the fix, the peak memory
usage will decrease to 26MB. Beside above case, the memory blow-up
problem is also possible when allocation and freeing are done on total
different CPUs. I'm trying to fix the blow-up problem by using a global
per-cpu work to free these objects in free_by_rcu timely, but it doesn't
work very well and I am still digging into it.

Patch 2 is a left-over patch from rcu_trace_implies_rcu_gp() patchset
[1]. After disscussing with Paul [2], I think it is also safe to skip
rcu_barrier() when rcu_trace_implies_rcu_gp() returns true.

Comments are always welcome.

Change Log:
v2:
  * Patch 1: repharse the commit message (Suggested by Yonghong & Alexei)
  * Add Acked-by for both patch 1 and 2

v1: https://lore.kernel.org/bpf/20221206042946.686847-1-houtao@huaweicloud.com

[0]: https://lore.kernel.org/bpf/20220924133620.4147153-13-houtao@huaweicloud.com/
[1]: https://lore.kernel.org/bpf/20221014113946.965131-1-houtao@huaweicloud.com/
[2]: https://lore.kernel.org/bpf/20221021185002.GP5600@paulmck-ThinkPad-P17-Gen-1/
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true
Hou Tao [Fri, 9 Dec 2022 01:09:47 +0000 (09:09 +0800)]
bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true

If there are pending rcu callback, free_mem_alloc() will use
rcu_barrier_tasks_trace() and rcu_barrier() to wait for the pending
__free_rcu_tasks_trace() and __free_rcu() callback.

If rcu_trace_implies_rcu_gp() is true, there will be no pending
__free_rcu(), so it will be OK to skip rcu_barrier() as well.

Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221209010947.3130477-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: Reuse freed element in free_by_rcu during allocation
Hou Tao [Fri, 9 Dec 2022 01:09:46 +0000 (09:09 +0800)]
bpf: Reuse freed element in free_by_rcu during allocation

When there are batched freeing operations on a specific CPU, part of
the freed elements ((high_watermark - lower_watermark) / 2 + 1) will be
indirectly moved into waiting_for_gp list through free_by_rcu list.
After call_rcu_in_progress becomes false again, the remaining elements
in free_by_rcu list will be moved to waiting_for_gp list by the next
invocation of free_bulk(). However if the expiration of RCU tasks trace
grace period is relatively slow, none element in free_by_rcu list will
be moved.

So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to
allocate a new object, in alloc_bulk() just check whether or not there is
freed element in free_by_rcu list and reuse it if available.

Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221209010947.3130477-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agoselftests/bpf: Bring test_offload.py back to life
Stanislav Fomichev [Tue, 6 Dec 2022 23:27:39 +0000 (15:27 -0800)]
selftests/bpf: Bring test_offload.py back to life

Bpftool has new extra libbpf_det_bind probing map we need to exclude.
Also skip trying to load netdevsim modules if it's already loaded (builtin).

v2:
- drop iproute2->bpftool changes (Toke)

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221206232739.2504890-1-sdf@google.com
19 months agobpf: Fix comment error in fixup_kfunc_call function
Yang Jihong [Thu, 8 Dec 2022 01:37:24 +0000 (09:37 +0800)]
bpf: Fix comment error in fixup_kfunc_call function

insn->imm for kfunc is the relative address of __bpf_call_base,
instead of __bpf_base_call, Fix the comment error.

Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Link: https://lore.kernel.org/r/20221208013724.257848-1-yangjihong1@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: Do not zero-extend kfunc return values
Björn Töpel [Wed, 7 Dec 2022 10:35:40 +0000 (11:35 +0100)]
bpf: Do not zero-extend kfunc return values

In BPF all global functions, and BPF helpers return a 64-bit
value. For kfunc calls, this is not the case, and they can return
e.g. 32-bit values.

The return register R0 for kfuncs calls can therefore be marked as
subreg_def != DEF_NOT_SUBREG. In general, if a register is marked with
subreg_def != DEF_NOT_SUBREG, some archs (where bpf_jit_needs_zext()
returns true) require the verifier to insert explicit zero-extension
instructions.

For kfuncs calls, however, the caller should do sign/zero extension
for return values. In other words, the compiler is responsible to
insert proper instructions, not the verifier.

An example, provided by Yonghong Song:

$ cat t.c
extern unsigned foo(void);
unsigned bar1(void) {
     return foo();
}
unsigned bar2(void) {
     if (foo()) return 10; else return 20;
}

$ clang -target bpf -mcpu=v3 -O2 -c t.c && llvm-objdump -d t.o
t.o:    file format elf64-bpf

Disassembly of section .text:

0000000000000000 <bar1>:
0:       85 10 00 00 ff ff ff ff call -0x1
1:       95 00 00 00 00 00 00 00 exit

0000000000000010 <bar2>:
2:       85 10 00 00 ff ff ff ff call -0x1
3:       bc 01 00 00 00 00 00 00 w1 = w0
4:       b4 00 00 00 14 00 00 00 w0 = 0x14
5:       16 01 01 00 00 00 00 00 if w1 == 0x0 goto +0x1 <LBB1_2>
6:       b4 00 00 00 0a 00 00 00 w0 = 0xa

0000000000000038 <LBB1_2>:
7:       95 00 00 00 00 00 00 00 exit

If the return value of 'foo()' is used in the BPF program, the proper
zero-extension will be done.

Currently, the verifier correctly marks, say, a 32-bit return value as
subreg_def != DEF_NOT_SUBREG, but will fail performing the actual
zero-extension, due to a verifier bug in
opt_subreg_zext_lo32_rnd_hi32(). load_reg is not properly set to R0,
and the following path will be taken:

if (WARN_ON(load_reg == -1)) {
verbose(env, "verifier bug. zext_dst is set, but no reg is defined\n");
return -EFAULT;
}

A longer discussion from v1 can be found in the link below.

Correct the verifier by avoiding doing explicit zero-extension of R0
for kfunc calls. Note that R0 will still be marked as a sub-register
for return values smaller than 64-bit.

Fixes: 83a2881903f3 ("bpf: Account for BPF_FETCH in insn_has_def32()")
Link: https://lore.kernel.org/bpf/20221202103620.1915679-1-bjorn@kernel.org/
Suggested-by: Yonghong Song <yhs@meta.com>
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221207103540.396496-1-bjorn@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agoMerge branch 'Document some recent core kfunc additions'
Alexei Starovoitov [Thu, 8 Dec 2022 01:09:13 +0000 (17:09 -0800)]
Merge branch 'Document some recent core kfunc additions'

David Vernet says:

====================

A series of recent patch sets introduced kfuncs that allowed struct
task_struct and struct cgroup objects to be used as kptrs. These were
introduced in [0], [1], and [2].

[0]: https://lore.kernel.org/lkml/20221120051004.3605026-1-void@manifault.com/
[1]: https://lore.kernel.org/lkml/20221122145300.251210-2-void@manifault.com/T/
[2]: https://lore.kernel.org/lkml/20221122055458.173143-1-void@manifault.com/

These are "core" kfuncs, in that they may be used by a wide variety of
possible BPF tracepoint or struct_ops programs, and are defined in
kernel/bpf/helpers.c. Even though as kfuncs they have no ABI stability
guarantees, they should still be properly documented. This patch set
adds that documentation.

Some other kfuncs were added recently as well, such as
bpf_rcu_read_lock() and bpf_rcu_read_unlock(). Those could and should be
added to this "Core kfuncs" section as well in subsequent patch sets.

Note that this patch set does not contain documentation for
bpf_task_acquire_not_zero(), or bpf_task_kptr_get(). As discussed in
[3], those kfuncs currently always return NULL pending resolution on how
to properly protect their arguments using RCU.

[3]: https://lore.kernel.org/all/20221206210538.597606-1-void@manifault.com/
---
Changelog:
v2 -> v3:
- Don't document bpf_task_kptr_get(), and instead provide a more
  substantive example for bpf_cgroup_kptr_get().
- Further clarify expected behavior of bpf_task_from_pid() in comments
  (Alexei)

v1 -> v2:
- Expand comment to specify that a map holds a reference to a task kptr
  if we don't end up releasing it (Alexei)
- Just read task->pid instead of using a probed read (Alexei)
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf/docs: Document struct cgroup * kfuncs
David Vernet [Wed, 7 Dec 2022 20:49:11 +0000 (14:49 -0600)]
bpf/docs: Document struct cgroup * kfuncs

bpf_cgroup_acquire(), bpf_cgroup_release(), bpf_cgroup_kptr_get(), and
bpf_cgroup_ancestor(), are kfuncs that were recently added to
kernel/bpf/helpers.c. These are "core" kfuncs in that they're available
for use in any tracepoint or struct_ops BPF program. Though they have no
ABI stability guarantees, we should still document them. This patch adds
a struct cgroup * subsection to the Core kfuncs section which describes
each of these kfuncs.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221207204911.873646-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf/docs: Document struct task_struct * kfuncs
David Vernet [Wed, 7 Dec 2022 20:49:10 +0000 (14:49 -0600)]
bpf/docs: Document struct task_struct * kfuncs

bpf_task_acquire(), bpf_task_release(), and bpf_task_from_pid() are
kfuncs that were recently added to kernel/bpf/helpers.c. These are
"core" kfuncs in that they're available for use for any tracepoint or
struct_ops BPF program. Though they have no ABI stability guarantees, we
should still document them. This patch adds a new Core kfuncs section to
the BPF kfuncs doc, and adds entries for all of these task kfuncs.

Note that bpf_task_kptr_get() is not documented, as it still returns
NULL while we're working to resolve how it can use RCU to ensure struct
task_struct * lifetime.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221207204911.873646-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agoselftests/bpf: convert dynptr_fail and map_kptr_fail subtests to generic tester
Andrii Nakryiko [Wed, 7 Dec 2022 20:16:48 +0000 (12:16 -0800)]
selftests/bpf: convert dynptr_fail and map_kptr_fail subtests to generic tester

Convert big chunks of dynptr and map_kptr subtests to use generic
verification_tester. They are switched from using manually maintained
tables of test cases, specifying program name and expected error
verifier message, to btf_decl_tag-based annotations directly on
corresponding BPF programs: __failure to specify that BPF program is
expected to fail verification, and __msg() to specify expected log
message.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207201648.2990661-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agoselftests/bpf: add generic BPF program tester-loader
Andrii Nakryiko [Wed, 7 Dec 2022 20:16:47 +0000 (12:16 -0800)]
selftests/bpf: add generic BPF program tester-loader

It's become a common pattern to have a collection of small BPF programs
in one BPF object file, each representing one test case. On user-space
side of such tests we maintain a table of program names and expected
failure or success, along with optional expected verifier log message.

This works, but each set of tests reimplement this mundane code over and
over again, which is a waste of time for anyone trying to add a new set
of tests. Furthermore, it's quite error prone as it's way too easy to miss
some entries in these manually maintained test tables (as evidences by
dynptr_fail tests, in which ringbuf_release_uninit_dynptr subtest was
accidentally missed; this is fixed in next patch).

So this patch implements generic test_loader, which accepts skeleton
name and handles the rest of details: opens and loads BPF object file,
making sure each program is tested in isolation. Optionally each test
case can specify expected BPF verifier log message. In case of failure,
tester makes sure to report verifier log, but it also reports verifier
log in verbose mode unconditionally.

Now, the interesting deviation from existing custom implementations is
the use of btf_decl_tag attribute to specify expected-to-fail vs
expected-to-succeed markers and, optionally, expected log message
directly next to BPF program source code, eliminating the need to
manually create and update table of tests.

We define few macros wrapping btf_decl_tag with a convention that all
values of btf_decl_tag start with "comment:" prefix, and then utilizing
a very simple "just_some_text_tag" or "some_key_name=<value>" pattern to
define things like expected success/failure, expected verifier message,
extra verifier log level (if necessary). This approach is demonstrated
by next patch in which two existing sets of failure tests are converted.

Tester supports both expected-to-fail and expected-to-succeed programs,
though this patch set didn't convert any existing expected-to-succeed
programs yet, as existing tests couple BPF program loading with their
further execution through attach or test_prog_run. One way to allow
testing scenarios like this would be ability to specify custom callback,
executed for each successfully loaded BPF program. This is left for
follow up patches, after some more analysis of existing test cases.

This test_loader is, hopefully, a start of a test_verifier-like runner,
but integrated into test_progs infrastructure. It will allow much better
"user experience" of defining low-level verification tests that can take
advantage of all the libbpf-provided nicety features on BPF side: global
variables, declarative maps, etc.  All while having a choice of defining
it in C or as BPF assembly (through __attribute__((naked)) functions and
using embedded asm), depending on what makes most sense in each
particular case. This will be explored in follow up patches as well.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221207201648.2990661-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: Remove unused insn_cnt argument from visit_[func_call_]insn()
Andrii Nakryiko [Wed, 7 Dec 2022 19:55:34 +0000 (11:55 -0800)]
bpf: Remove unused insn_cnt argument from visit_[func_call_]insn()

Number of total instructions in BPF program (including subprogs) can and
is accessed from env->prog->len. visit_func_call_insn() doesn't do any
checks against insn_cnt anymore, relying on push_insn() to do this check
internally. So remove unnecessary insn_cnt input argument from
visit_func_call_insn() and visit_insn() functions.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20221207195534.2866030-1-andrii@kernel.org
19 months agoMerge "do not rely on ALLOW_ERROR_INJECTION for fmod_ret" into bpf-next
Alexei Starovoitov [Wed, 7 Dec 2022 21:49:11 +0000 (13:49 -0800)]
Merge "do not rely on ALLOW_ERROR_INJECTION for fmod_ret" into bpf-next

Merge commit 5b481acab4ce ("bpf: do not rely on ALLOW_ERROR_INJECTION for fmod_ret")
from hid tree into bpf-next.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: do not rely on ALLOW_ERROR_INJECTION for fmod_ret
Benjamin Tissoires [Tue, 6 Dec 2022 14:59:32 +0000 (15:59 +0100)]
bpf: do not rely on ALLOW_ERROR_INJECTION for fmod_ret

The current way of expressing that a non-bpf kernel component is willing
to accept that bpf programs can be attached to it and that they can change
the return value is to abuse ALLOW_ERROR_INJECTION.
This is debated in the link below, and the result is that it is not a
reasonable thing to do.

Reuse the kfunc declaration structure to also tag the kernel functions
we want to be fmodret. This way we can control from any subsystem which
functions are being modified by bpf without touching the verifier.

Link: https://lore.kernel.org/all/20221121104403.1545f9b5@gandalf.local.home/
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20221206145936.922196-2-benjamin.tissoires@redhat.com
19 months agonet: xsk: Don't include <linux/rculist.h>
Christophe JAILLET [Sat, 3 Dec 2022 16:51:04 +0000 (17:51 +0100)]
net: xsk: Don't include <linux/rculist.h>

There is no need to include <linux/rculist.h> here.

Prefer the less invasive <linux/types.h> which is needed for 'hlist_head'.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/88d6a1d88764cca328610854f890a9ca1f4b029e.1670086246.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agoMerge branch 'Refactor verifier prune and jump point handling'
Alexei Starovoitov [Wed, 7 Dec 2022 03:14:39 +0000 (19:14 -0800)]
Merge branch 'Refactor verifier prune and jump point handling'

Andrii Nakryiko says:

====================

Disentangle prune and jump points in BPF verifier code. They are conceptually
independent but currently coupled together. This small patch set refactors
related code and make it possible to have some instruction marked as pruning
or jump point independently.

Besides just conceptual cleanliness, this allows to remove unnecessary jump
points (saving a tiny bit of performance and memory usage, potentially), and
even more importantly it allows for clean extension of special pruning points,
similarly to how it's done for BPF_FUNC_timer_set_callback. This will be used
by future patches implementing open-coded BPF iterators.

v1->v2:
  - clarified path #3 commit message and a comment in the code (John);
  - added back mark_jmp_point() to right after subprog call to record
    non-linear implicit jump from BPF_EXIT to right after CALL <subprog>.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: remove unnecessary prune and jump points
Andrii Nakryiko [Tue, 6 Dec 2022 23:33:45 +0000 (15:33 -0800)]
bpf: remove unnecessary prune and jump points

Don't mark some instructions as jump points when there are actually no
jumps and instructions are just processed sequentially. Such case is
handled naturally by precision backtracking logic without the need to
update jump history. See get_prev_insn_idx(). It goes back linearly by
one instruction, unless current top of jmp_history is pointing to
current instruction. In such case we use `st->jmp_history[cnt - 1].prev_idx`
to find instruction from which we jumped to the current instruction
non-linearly.

Also remove both jump and prune point marking for instruction right
after unconditional jumps, as program flow can get to the instruction
right after unconditional jump instruction only if there is a jump to
that instruction from somewhere else in the program. In such case we'll
mark such instruction as prune/jump point because it's a destination of
a jump.

This change has no changes in terms of number of instructions or states
processes across Cilium and selftests programs.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20221206233345.438540-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: mostly decouple jump history management from is_state_visited()
Andrii Nakryiko [Tue, 6 Dec 2022 23:33:44 +0000 (15:33 -0800)]
bpf: mostly decouple jump history management from is_state_visited()

Jump history updating and state equivalence checks are conceptually
independent, so move push_jmp_history() out of is_state_visited(). Also
make a decision whether to perform state equivalence checks or not one
layer higher in do_check(), keeping is_state_visited() unconditionally
performing state checks.

push_jmp_history() should be performed after state checks. There is just
one small non-uniformity. When is_state_visited() finds already
validated equivalent state, it propagates precision marks to current
state's parent chain. For this to work correctly, jump history has to be
updated, so is_state_visited() is doing that internally.

But if no equivalent verified state is found, jump history has to be
updated in a newly cloned child state, so is_jmp_point()
+ push_jmp_history() is performed after is_state_visited() exited with
zero result, which means "proceed with validation".

This change has no functional changes. It's not strictly necessary, but
feels right to decouple these two processes.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221206233345.438540-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: decouple prune and jump points
Andrii Nakryiko [Tue, 6 Dec 2022 23:33:43 +0000 (15:33 -0800)]
bpf: decouple prune and jump points

BPF verifier marks some instructions as prune points. Currently these
prune points serve two purposes.

It's a point where verifier tries to find previously verified state and
check current state's equivalence to short circuit verification for
current code path.

But also currently it's a point where jump history, used for precision
backtracking, is updated. This is done so that non-linear flow of
execution could be properly backtracked.

Such coupling is coincidental and unnecessary. Some prune points are not
part of some non-linear jump path, so don't need update of jump history.
On the other hand, not all instructions which have to be recorded in
jump history necessarily are good prune points.

This patch splits prune and jump points into independent flags.
Currently all prune points are marked as jump points to minimize amount
of changes in this patch, but next patch will perform some optimization
of prune vs jmp point placement.

No functional changes are intended.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221206233345.438540-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: Loosen alloc obj test in verifier's reg_btf_record
Dave Marchevsky [Tue, 6 Dec 2022 23:09:48 +0000 (15:09 -0800)]
bpf: Loosen alloc obj test in verifier's reg_btf_record

btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
There, a BTF record is created for any type containing a spin_lock or
any next-gen datastructure node/head.

Currently, for non-MAP_VALUE types, reg_btf_record will only search for
a record using struct_meta_tab if the reg->type exactly matches
(PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
"allocated obj" type - returned from bpf_obj_new - might pick up other
flags while working its way through the program.

Loosen the check to be exact for base_type and just use MEM_ALLOC mask
for type_flag.

This patch is marked Fixes as the original intent of reg_btf_record was
unlikely to have been to fail finding btf_record for valid alloc obj
types with additional flags, some of which (e.g. PTR_UNTRUSTED)
are valid register type states for alloc obj independent of this series.
However, I didn't find a specific broken repro case outside of this
series' added functionality, so it's possible that nothing was
triggering this logic error before.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: 4e814da0d599 ("bpf: Allow locking bpf_spin_lock in allocated objects")
Link: https://lore.kernel.org/r/20221206231000.3180914-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: Don't use rcu_users to refcount in task kfuncs
David Vernet [Tue, 6 Dec 2022 21:05:38 +0000 (15:05 -0600)]
bpf: Don't use rcu_users to refcount in task kfuncs

A series of prior patches added some kfuncs that allow struct
task_struct * objects to be used as kptrs. These kfuncs leveraged the
'refcount_t rcu_users' field of the task for performing refcounting.
This field was used instead of 'refcount_t usage', as we wanted to
leverage the safety provided by RCU for ensuring a task's lifetime.

A struct task_struct is refcounted by two different refcount_t fields:

1. p->usage:     The "true" refcount field which task lifetime. The
 task is freed as soon as this refcount drops to 0.

2. p->rcu_users: An "RCU users" refcount field which is statically
 initialized to 2, and is co-located in a union with
 a struct rcu_head field (p->rcu). p->rcu_users
 essentially encapsulates a single p->usage
 refcount, and when p->rcu_users goes to 0, an RCU
 callback is scheduled on the struct rcu_head which
 decrements the p->usage refcount.

Our logic was that by using p->rcu_users, we would be able to use RCU to
safely issue refcount_inc_not_zero() a task's rcu_users field to
determine if a task could still be acquired, or was exiting.
Unfortunately, this does not work due to p->rcu_users and p->rcu sharing
a union. When p->rcu_users goes to 0, an RCU callback is scheduled to
drop a single p->usage refcount, and because the fields share a union,
the refcount immediately becomes nonzero again after the callback is
scheduled.

If we were to split the fields out of the union, this wouldn't be a
problem. Doing so should also be rather non-controversial, as there are
a number of places in struct task_struct that have padding which we
could use to avoid growing the structure by splitting up the fields.

For now, so as to fix the kfuncs to be correct, this patch instead
updates bpf_task_acquire() and bpf_task_release() to use the p->usage
field for refcounting via the get_task_struct() and put_task_struct()
functions. Because we can no longer rely on RCU, the change also guts
the bpf_task_acquire_not_zero() and bpf_task_kptr_get() functions
pending a resolution on the above problem.

In addition, the task fixes the kfunc and rcu_read_lock selftests to
expect this new behavior.

Fixes: 90660309b0c7 ("bpf: Add kfuncs for storing struct task_struct * as a kptr")
Fixes: fca1aa75518c ("bpf: Handle MEM_RCU type properly")
Reported-by: Matus Jokay <matus.jokay@stuba.sk>
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221206210538.597606-1-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agoMerge branch 'BPF selftests fixes'
Andrii Nakryiko [Wed, 7 Dec 2022 00:31:26 +0000 (16:31 -0800)]
Merge branch 'BPF selftests fixes'

Daan De Meyer says:

====================

This patch series fixes a few issues I've found while integrating the
bpf selftests into systemd's mkosi development environment.
====================

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
19 months agoselftests/bpf: Use CONFIG_TEST_BPF=m instead of CONFIG_TEST_BPF=y
Daan De Meyer [Mon, 5 Dec 2022 13:16:18 +0000 (14:16 +0100)]
selftests/bpf: Use CONFIG_TEST_BPF=m instead of CONFIG_TEST_BPF=y

CONFIG_TEST_BPF can only be a module, so let's indicate it as such in
the selftests config.

Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221205131618.1524337-4-daan.j.demeyer@gmail.com
19 months agoselftests/bpf: Use "is not set" instead of "=n"
Daan De Meyer [Mon, 5 Dec 2022 13:16:17 +0000 (14:16 +0100)]
selftests/bpf: Use "is not set" instead of "=n"

"=n" is not valid kconfig syntax. Use "is not set" instead to indicate
the option should be disabled.

Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221205131618.1524337-3-daan.j.demeyer@gmail.com
19 months agoselftests/bpf: Install all required files to run selftests
Daan De Meyer [Mon, 5 Dec 2022 13:16:16 +0000 (14:16 +0100)]
selftests/bpf: Install all required files to run selftests

When installing the selftests using
"make -C tools/testing/selftests install", we need to make sure
all the required files to run the selftests are installed. Let's
make sure this is the case.

Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221205131618.1524337-2-daan.j.demeyer@gmail.com
19 months agolibbpf: Parse usdt args without offset on x86 (e.g. 8@(%rsp))
Timo Hunziker [Sat, 3 Dec 2022 12:37:46 +0000 (12:37 +0000)]
libbpf: Parse usdt args without offset on x86 (e.g. 8@(%rsp))

Parse USDT arguments like "8@(%rsp)" on x86. These are emmited by
SystemTap. The argument syntax is similar to the existing "memory
dereference case" but the offset left out as it's zero (i.e. read
the value from the address in the register). We treat it the same
as the the "memory dereference case", but set the offset to 0.

I've tested that this fixes the "unrecognized arg #N spec: 8@(%rsp).."
error I've run into when attaching to a probe with such an argument.
Attaching and reading the correct argument values works.

Something similar might be needed for the other supported
architectures.

  [0] Closes: https://github.com/libbpf/libbpf/issues/559

Signed-off-by: Timo Hunziker <timo.hunziker@gmx.ch>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221203123746.2160-1-timo.hunziker@eclipso.ch
19 months agoselftests/bpf: Allow building bpf tests with CONFIG_XFRM_INTERFACE=[m|n]
Martin KaFai Lau [Tue, 6 Dec 2022 19:35:54 +0000 (11:35 -0800)]
selftests/bpf: Allow building bpf tests with CONFIG_XFRM_INTERFACE=[m|n]

It is useful to use vmlinux.h in the xfrm_info test like other kfunc
tests do.  In particular, it is common for kfunc bpf prog that requires
to use other core kernel structures in vmlinux.h

Although vmlinux.h is preferred, it needs a ___local flavor of
struct bpf_xfrm_info in order to build the bpf selftests
when CONFIG_XFRM_INTERFACE=[m|n].

Cc: Eyal Birger <eyal.birger@gmail.com>
Fixes: 90a3a05eb33f ("selftests/bpf: add xfrm_info tests")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20221206193554.1059757-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpftool: Fix memory leak in do_build_table_cb
Miaoqian Lin [Tue, 6 Dec 2022 07:19:06 +0000 (11:19 +0400)]
bpftool: Fix memory leak in do_build_table_cb

strdup() allocates memory for path. We need to release the memory in the
following error path. Add free() to avoid memory leak.

Fixes: 8f184732b60b ("bpftool: Switch to libbpf's hashmap for pinned paths of BPF objects")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20221206071906.806384-1-linmq006@gmail.com
19 months agoriscv, bpf: Emit fixed-length instructions for BPF_PSEUDO_FUNC
Pu Lehui [Tue, 6 Dec 2022 09:14:10 +0000 (17:14 +0800)]
riscv, bpf: Emit fixed-length instructions for BPF_PSEUDO_FUNC

For BPF_PSEUDO_FUNC instruction, verifier will refill imm with
correct addresses of bpf_calls and then run last pass of JIT.
Since the emit_imm of RV64 is variable-length, which will emit
appropriate length instructions accorroding to the imm, it may
broke ctx->offset, and lead to unpredictable problem, such as
inaccurate jump. So let's fix it with fixed-length instructions.

Fixes: 69c087ba6225 ("bpf: Add bpf_for_each_map_elem() helper")
Suggested-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Björn Töpel <bjorn@kernel.org>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20221206091410.1584784-1-pulehui@huaweicloud.com
19 months agoMerge branch 'xfrm: interface: Add unstable helpers for XFRM metadata'
Martin KaFai Lau [Tue, 6 Dec 2022 05:58:28 +0000 (21:58 -0800)]
Merge branch 'xfrm: interface: Add unstable helpers for XFRM metadata'

Eyal Birger says:

====================

This patch series adds xfrm metadata helpers using the unstable kfunc
call interface for the TC-BPF hooks.

This allows steering traffic towards different IPsec connections based
on logic implemented in bpf programs.

The helpers are integrated into the xfrm_interface module. For this
purpose the main functionality of this module is moved to
xfrm_interface_core.c.
---

changes in v6: fix sparse warning in patch 2
changes in v5:
  - avoid cleanup of percpu dsts as detailed in patch 2
changes in v3:
  - tag bpf-next tree instead of ipsec-next
  - add IFLA_XFRM_COLLECT_METADATA sync patch
====================

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
19 months agoselftests/bpf: add xfrm_info tests
Eyal Birger [Sat, 3 Dec 2022 08:46:59 +0000 (10:46 +0200)]
selftests/bpf: add xfrm_info tests

Test the xfrm_info kfunc helpers.

The test setup creates three name spaces - NS0, NS1, NS2.

XFRM tunnels are setup between NS0 and the two other NSs.

The kfunc helpers are used to steer traffic from NS0 to the other
NSs based on a userspace populated bpf global variable and validate
that the return traffic had arrived from the desired NS.

Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Link: https://lore.kernel.org/r/20221203084659.1837829-5-eyal.birger@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
19 months agotools: add IFLA_XFRM_COLLECT_METADATA to uapi/linux/if_link.h
Eyal Birger [Sat, 3 Dec 2022 08:46:58 +0000 (10:46 +0200)]
tools: add IFLA_XFRM_COLLECT_METADATA to uapi/linux/if_link.h

Needed for XFRM metadata tests.

Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Link: https://lore.kernel.org/r/20221203084659.1837829-4-eyal.birger@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
19 months agoxfrm: interface: Add unstable helpers for setting/getting XFRM metadata from TC-BPF
Eyal Birger [Sat, 3 Dec 2022 08:46:57 +0000 (10:46 +0200)]
xfrm: interface: Add unstable helpers for setting/getting XFRM metadata from TC-BPF

This change adds xfrm metadata helpers using the unstable kfunc call
interface for the TC-BPF hooks. This allows steering traffic towards
different IPsec connections based on logic implemented in bpf programs.

This object is built based on the availability of BTF debug info.

When setting the xfrm metadata, percpu metadata dsts are used in order
to avoid allocating a metadata dst per packet.

In order to guarantee safe module unload, the percpu dsts are allocated
on first use and never freed. The percpu pointer is stored in
net/core/filter.c so that it can be reused on module reload.

The metadata percpu dsts take ownership of the original skb dsts so
that they may be used as part of the xfrm transmission logic - e.g.
for MTU calculations.

Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Link: https://lore.kernel.org/r/20221203084659.1837829-3-eyal.birger@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
19 months agoxfrm: interface: rename xfrm_interface.c to xfrm_interface_core.c
Eyal Birger [Sat, 3 Dec 2022 08:46:56 +0000 (10:46 +0200)]
xfrm: interface: rename xfrm_interface.c to xfrm_interface_core.c

This change allows adding additional files to the xfrm_interface module.

Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Link: https://lore.kernel.org/r/20221203084659.1837829-2-eyal.birger@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
19 months agoselftests/bpf: Fix conflicts with built-in functions in bpf_iter_ksym
James Hilliard [Sat, 3 Dec 2022 01:08:44 +0000 (18:08 -0700)]
selftests/bpf: Fix conflicts with built-in functions in bpf_iter_ksym

Both tolower and toupper are built in c functions, we should not
redefine them as this can result in a build error.

Fixes the following errors:
progs/bpf_iter_ksym.c:10:20: error: conflicting types for built-in function 'tolower'; expected 'int(int)' [-Werror=builtin-declaration-mismatch]
   10 | static inline char tolower(char c)
      |                    ^~~~~~~
progs/bpf_iter_ksym.c:5:1: note: 'tolower' is declared in header '<ctype.h>'
    4 | #include <bpf/bpf_helpers.h>
  +++ |+#include <ctype.h>
    5 |
progs/bpf_iter_ksym.c:17:20: error: conflicting types for built-in function 'toupper'; expected 'int(int)' [-Werror=builtin-declaration-mismatch]
   17 | static inline char toupper(char c)
      |                    ^~~~~~~
progs/bpf_iter_ksym.c:17:20: note: 'toupper' is declared in header '<ctype.h>'

See background on this sort of issue:
https://stackoverflow.com/a/20582607
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=12213

(C99, 7.1.3p1) "All identifiers with external linkage in any of the
following subclauses (including the future library directions) are
always reserved for use as identifiers with external linkage."

This is documented behavior in GCC:
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-std-2

Signed-off-by: James Hilliard <james.hilliard1@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221203010847.2191265-1-james.hilliard1@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf, sockmap: fix race in sock_map_free()
Eric Dumazet [Fri, 2 Dec 2022 11:16:40 +0000 (11:16 +0000)]
bpf, sockmap: fix race in sock_map_free()

sock_map_free() calls release_sock(sk) without owning a reference
on the socket. This can cause use-after-free as syzbot found [1]

Jakub Sitnicki already took care of a similar issue
in sock_hash_free() in commit 75e68e5bf2c7 ("bpf, sockhash:
Synchronize delete from bucket list on map free")

[1]
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 0 PID: 3785 at lib/refcount.c:31 refcount_warn_saturate+0x17c/0x1a0 lib/refcount.c:31
Modules linked in:
CPU: 0 PID: 3785 Comm: kworker/u4:6 Not tainted 6.1.0-rc7-syzkaller-00103-gef4d3ea40565 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
Workqueue: events_unbound bpf_map_free_deferred
RIP: 0010:refcount_warn_saturate+0x17c/0x1a0 lib/refcount.c:31
Code: 68 8b 31 c0 e8 75 71 15 fd 0f 0b e9 64 ff ff ff e8 d9 6e 4e fd c6 05 62 9c 3d 0a 01 48 c7 c7 80 bb 68 8b 31 c0 e8 54 71 15 fd <0f> 0b e9 43 ff ff ff 89 d9 80 e1 07 80 c1 03 38 c1 0f 8c a2 fe ff
RSP: 0018:ffffc9000456fb60 EFLAGS: 00010246
RAX: eae59bab72dcd700 RBX: 0000000000000004 RCX: ffff8880207057c0
RDX: 0000000000000000 RSI: 0000000000000201 RDI: 0000000000000000
RBP: 0000000000000004 R08: ffffffff816fdabd R09: fffff520008adee5
R10: fffff520008adee5 R11: 1ffff920008adee4 R12: 0000000000000004
R13: dffffc0000000000 R14: ffff88807b1c6c00 R15: 1ffff1100f638dcf
FS: 0000000000000000(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b30c30000 CR3: 000000000d08e000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__refcount_dec include/linux/refcount.h:344 [inline]
refcount_dec include/linux/refcount.h:359 [inline]
__sock_put include/net/sock.h:779 [inline]
tcp_release_cb+0x2d0/0x360 net/ipv4/tcp_output.c:1092
release_sock+0xaf/0x1c0 net/core/sock.c:3468
sock_map_free+0x219/0x2c0 net/core/sock_map.c:356
process_one_work+0x81c/0xd10 kernel/workqueue.c:2289
worker_thread+0xb14/0x1330 kernel/workqueue.c:2436
kthread+0x266/0x300 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
</TASK>

Fixes: 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Song Liu <songliubraving@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20221202111640.2745533-1-edumazet@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: Add dummy type reference to nf_conn___init to fix type deduplication
Toke Høiland-Jørgensen [Thu, 1 Dec 2022 12:39:39 +0000 (13:39 +0100)]
bpf: Add dummy type reference to nf_conn___init to fix type deduplication

The bpf_ct_set_nat_info() kfunc is defined in the nf_nat.ko module, and
takes as a parameter the nf_conn___init struct, which is allocated through
the bpf_xdp_ct_alloc() helper defined in the nf_conntrack.ko module.
However, because kernel modules can't deduplicate BTF types between each
other, and the nf_conn___init struct is not referenced anywhere in vmlinux
BTF, this leads to two distinct BTF IDs for the same type (one in each
module). This confuses the verifier, as described here:

https://lore.kernel.org/all/87leoh372s.fsf@toke.dk/

As a workaround, add an explicit BTF_TYPE_EMIT for the type in
net/filter.c, so the type definition gets included in vmlinux BTF. This
way, both modules can refer to the same type ID (as they both build on top
of vmlinux BTF), and the verifier is no longer confused.

v2:

- Use BTF_TYPE_EMIT (which is a statement so it has to be inside a function
  definition; use xdp_func_proto() for this, since this is mostly
  xdp-related).

Fixes: 820dc0523e05 ("net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.c")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20221201123939.696558-1-toke@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: Add sleepable prog tests for cgrp local storage
Yonghong Song [Thu, 1 Dec 2022 05:04:49 +0000 (21:04 -0800)]
bpf: Add sleepable prog tests for cgrp local storage

Add three tests for cgrp local storage support for sleepable progs.
Two tests can load and run properly, one for cgroup_iter, another
for passing current->cgroups->dfl_cgrp to bpf_cgrp_storage_get()
helper. One test has bpf_rcu_read_lock() and failed to load.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221201050449.2785613-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: Enable sleeptable support for cgrp local storage
Yonghong Song [Thu, 1 Dec 2022 05:04:44 +0000 (21:04 -0800)]
bpf: Enable sleeptable support for cgrp local storage

Similar to sk/inode/task local storage, enable sleepable support for
cgrp local storage.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221201050444.2785007-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf, docs: BPF Iterator Document
Sreevani Sreejith [Fri, 2 Dec 2022 22:17:10 +0000 (14:17 -0800)]
bpf, docs: BPF Iterator Document

Document that describes how BPF iterators work, how to use iterators,
and how to pass parameters in BPF iterators.

Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Sreevani Sreejith <psreep@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221202221710.320810-2-ssreevani@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: Do not mark certain LSM hook arguments as trusted
Yonghong Song [Sat, 3 Dec 2022 20:49:54 +0000 (12:49 -0800)]
bpf: Do not mark certain LSM hook arguments as trusted

Martin mentioned that the verifier cannot assume arguments from
LSM hook sk_alloc_security being trusted since after the hook
is called, the sk ref_count is set to 1. This will overwrite
the ref_count changed by the bpf program and may cause ref_count
underflow later on.

I then further checked some other hooks. For example,
for bpf_lsm_file_alloc() hook in fs/file_table.c,

        f->f_cred = get_cred(cred);
        error = security_file_alloc(f);
        if (unlikely(error)) {
                file_free_rcu(&f->f_rcuhead);
                return ERR_PTR(error);
        }

        atomic_long_set(&f->f_count, 1);

The input parameter 'f' to security_file_alloc() cannot be trusted
as well.

Specifically, I investiaged bpf_map/bpf_prog/file/sk/task alloc/free
lsm hooks. Except bpf_map_alloc and task_alloc, arguments for all other
hooks should not be considered as trusted. This may not be a complete
list, but it covers common usage for sk and task.

Fixes: 3f00c5239344 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs")
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221203204954.2043348-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agoMerge branch 'bpf: Handle MEM_RCU type properly'
Alexei Starovoitov [Sun, 4 Dec 2022 20:52:40 +0000 (12:52 -0800)]
Merge branch 'bpf: Handle MEM_RCU type properly'

Yonghong Song says:

====================

Patch set [1] added rcu support for bpf programs. In [1], a rcu
pointer is considered to be trusted and not null. This is actually
not true in some cases. The rcu pointer could be null, and for non-null
rcu pointer, it may have reference count of 0. This small patch set
fixed this problem. Patch 1 is the kernel fix. Patch 2 adjusted
selftests properly. Patch 3 added documentation for newly-introduced
KF_RCU flag.

  [1] https://lore.kernel.org/all/20221124053201.2372298-1-yhs@fb.com/

Changelogs:
  v1 -> v2:
    - rcu ptr could be NULL.
    - non_null_rcu_ptr->rcu_field can be marked as MEM_RCU as well.
    - Adjust the code to avoid existing error message change.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agodocs/bpf: Add KF_RCU documentation
Yonghong Song [Sat, 3 Dec 2022 18:46:13 +0000 (10:46 -0800)]
docs/bpf: Add KF_RCU documentation

Add proper KF_RCU documentation in kfuncs.rst.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221203184613.478967-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agoselftests/bpf: Fix rcu_read_lock test with new MEM_RCU semantics
Yonghong Song [Sat, 3 Dec 2022 18:46:07 +0000 (10:46 -0800)]
selftests/bpf: Fix rcu_read_lock test with new MEM_RCU semantics

Add MEM_RCU pointer null checking for related tests. Also
modified task_acquire test so it takes a rcu ptr 'ptr' where
'ptr = rcu_ptr->rcu_field'.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221203184607.478314-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agobpf: Handle MEM_RCU type properly
Yonghong Song [Sat, 3 Dec 2022 18:46:02 +0000 (10:46 -0800)]
bpf: Handle MEM_RCU type properly

Commit 9bb00b2895cb ("bpf: Add kfunc bpf_rcu_read_lock/unlock()")
introduced MEM_RCU and bpf_rcu_read_lock/unlock() support. In that
commit, a rcu pointer is tagged with both MEM_RCU and PTR_TRUSTED
so that it can be passed into kfuncs or helpers as an argument.

Martin raised a good question in [1] such that the rcu pointer,
although being able to accessing the object, might have reference
count of 0. This might cause a problem if the rcu pointer is passed
to a kfunc which expects trusted arguments where ref count should
be greater than 0.

This patch makes the following changes related to MEM_RCU pointer:
  - MEM_RCU pointer might be NULL (PTR_MAYBE_NULL).
  - Introduce KF_RCU so MEM_RCU ptr can be acquired with
    a KF_RCU tagged kfunc which assumes ref count of rcu ptr
    could be zero.
  - For mem access 'b = ptr->a', say 'ptr' is a MEM_RCU ptr, and
    'a' is tagged with __rcu as well. Let us mark 'b' as
    MEM_RCU | PTR_MAYBE_NULL.

 [1] https://lore.kernel.org/bpf/ac70f574-4023-664e-b711-e0d3b18117fd@linux.dev/

Fixes: 9bb00b2895cb ("bpf: Add kfunc bpf_rcu_read_lock/unlock()")
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221203184602.477272-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
19 months agolibbpf: Improve usability of libbpf Makefile
Xin Liu [Fri, 2 Dec 2022 08:17:38 +0000 (16:17 +0800)]
libbpf: Improve usability of libbpf Makefile

Current libbpf Makefile does not contain the help command, which
is inconvenient to use. Similar to the Makefile help command of the
perf, a help command is provided to list the commands supported by
libbpf make and the functions of the commands.

Signed-off-by: Xin Liu <liuxin350@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221202081738.128513-1-liuxin350@huawei.com
19 months agoselftests/bpf: Add GCC compatible builtins to bpf_legacy.h
James Hilliard [Thu, 1 Dec 2022 19:09:39 +0000 (12:09 -0700)]
selftests/bpf: Add GCC compatible builtins to bpf_legacy.h

The bpf_legacy.h header uses llvm specific load functions, add
GCC compatible variants as well to fix tests using these functions
under GCC.

Signed-off-by: James Hilliard <james.hilliard1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221201190939.3230513-1-james.hilliard1@gmail.com
19 months agobpf, docs: Correct the example of BPF_XOR
Zheng Yejian [Tue, 29 Nov 2022 13:45:58 +0000 (21:45 +0800)]
bpf, docs: Correct the example of BPF_XOR

Refer to description of BPF_XOR, dst_reg should be used but not src_reg
in the examples.

Fixes: be3193cded9d ("bpf, docs: Add subsections for ALU and JMP instructions")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20221129134558.2757043-1-zhengyejian1@huawei.com
20 months agoselftests/bpf: Validate multiple ref release_on_unlock logic
Dave Marchevsky [Thu, 1 Dec 2022 18:34:06 +0000 (10:34 -0800)]
selftests/bpf: Validate multiple ref release_on_unlock logic

Modify list_push_pop_multiple to alloc and insert nodes 2-at-a-time.
Without the previous patch's fix, this block of code:

  bpf_spin_lock(lock);
  bpf_list_push_front(head, &f[i]->node);
  bpf_list_push_front(head, &f[i + 1]->node);
  bpf_spin_unlock(lock);

would fail check_reference_leak check as release_on_unlock logic would miss
a ref that should've been released.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221201183406.1203621-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
20 months agobpf: Fix release_on_unlock release logic for multiple refs
Dave Marchevsky [Thu, 1 Dec 2022 18:34:05 +0000 (10:34 -0800)]
bpf: Fix release_on_unlock release logic for multiple refs

Consider a verifier state with three acquired references, all with
release_on_unlock = true:

            idx  0 1 2
  state->refs = [2 4 6]

(with 2, 4, and 6 being the ref ids).

When bpf_spin_unlock is called, process_spin_lock will loop through all
acquired_refs and, for each ref, if it's release_on_unlock, calls
release_reference on it. That function in turn calls
release_reference_state, which removes the reference from state->refs by
swapping the reference state with the last reference state in
refs array and decrements acquired_refs count.

process_spin_lock's loop logic, which is essentially:

  for (i = 0; i < state->acquired_refs; i++) {
    if (!state->refs[i].release_on_unlock)
      continue;
    release_reference(state->refs[i].id);
  }

will fail to release release_on_unlock references which are swapped from
the end. Running this logic on our example demonstrates:

  state->refs = [2 4 6] (start of idx=0 iter)
    release state->refs[0] by swapping w/ state->refs[2]

  state->refs = [6 4]   (start of idx=1)
    release state->refs[1], no need to swap as it's the last idx

  state->refs = [6]     (start of idx=2, loop terminates)

ref_id 6 should have been removed but was skipped.

Fix this by looping from back-to-front, which results in refs that are
candidates for removal being swapped with refs which have already been
examined and kept.

If we modify our initial example such that ref 6 is replaced with ref 7,
which is _not_ release_on_unlock, and loop from the back, we'd see:

  state->refs = [2 4 7] (start of idx=2)

  state->refs = [2 4 7] (start of idx=1)

  state->refs = [2 7]   (start of idx=0, refs 7 and 4 swapped)

  state->refs = [7]     (after idx=0, 7 and 2 swapped, loop terminates)

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: 534e86bc6c66 ("bpf: Add 'release on unlock' logic for bpf_list_push_{front,back}")
Link: https://lore.kernel.org/r/20221201183406.1203621-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
20 months agobpf: Fix a compilation failure with clang lto build
Yonghong Song [Wed, 30 Nov 2022 05:21:47 +0000 (21:21 -0800)]
bpf: Fix a compilation failure with clang lto build

When building the kernel with clang lto (CONFIG_LTO_CLANG_FULL=y), the
following compilation error will appear:

  $ make LLVM=1 LLVM_IAS=1 -j
  ...
  ld.lld: error: ld-temp.o <inline asm>:26889:1: symbol 'cgroup_storage_map_btf_ids' is already defined
  cgroup_storage_map_btf_ids:;
  ^
  make[1]: *** [/.../bpf-next/scripts/Makefile.vmlinux_o:61: vmlinux.o] Error 1

In local_storage.c, we have
  BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, bpf_local_storage_map)
Commit c4bcfb38a95e ("bpf: Implement cgroup storage available to
non-cgroup-attached bpf progs") added the above identical BTF_ID_LIST_SINGLE
definition in bpf_cgrp_storage.c. With duplicated definitions, llvm linker
complains with lto build.

Also, extracting btf_id of 'struct bpf_local_storage_map' is defined four times
for sk, inode, task and cgrp local storages. Let us define a single global one
with a different name than cgroup_storage_map_btf_ids, which also fixed
the lto compilation error.

Fixes: c4bcfb38a95e ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221130052147.1591625-1-yhs@fb.com
20 months agoselftests/bpf: Add ingress tests for txmsg with apply_bytes
Pengcheng Yang [Tue, 29 Nov 2022 10:40:41 +0000 (18:40 +0800)]
selftests/bpf: Add ingress tests for txmsg with apply_bytes

Currently, the ingress redirect is not covered in "txmsg test apply".

Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/1669718441-2654-5-git-send-email-yangpc@wangsu.com
20 months agobpf, sockmap: Fix data loss caused by using apply_bytes on ingress redirect
Pengcheng Yang [Tue, 29 Nov 2022 10:40:40 +0000 (18:40 +0800)]
bpf, sockmap: Fix data loss caused by using apply_bytes on ingress redirect

Use apply_bytes on ingress redirect, when apply_bytes is less than
the length of msg data, some data may be skipped and lost in
bpf_tcp_ingress().

If there is still data in the scatterlist that has not been consumed,
we cannot move the msg iter.

Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/1669718441-2654-4-git-send-email-yangpc@wangsu.com
20 months agobpf, sockmap: Fix missing BPF_F_INGRESS flag when using apply_bytes
Pengcheng Yang [Tue, 29 Nov 2022 10:40:39 +0000 (18:40 +0800)]
bpf, sockmap: Fix missing BPF_F_INGRESS flag when using apply_bytes

When redirecting, we use sk_msg_to_ingress() to get the BPF_F_INGRESS
flag from the msg->flags. If apply_bytes is used and it is larger than
the current data being processed, sk_psock_msg_verdict() will not be
called when sendmsg() is called again. At this time, the msg->flags is 0,
and we lost the BPF_F_INGRESS flag.

So we need to save the BPF_F_INGRESS flag in sk_psock and use it when
redirection.

Fixes: 8934ce2fd081 ("bpf: sockmap redirect ingress support")
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/1669718441-2654-3-git-send-email-yangpc@wangsu.com
20 months agobpf, sockmap: Fix repeated calls to sock_put() when msg has more_data
Pengcheng Yang [Tue, 29 Nov 2022 10:40:38 +0000 (18:40 +0800)]
bpf, sockmap: Fix repeated calls to sock_put() when msg has more_data

In tcp_bpf_send_verdict() redirection, the eval variable is assigned to
__SK_REDIRECT after the apply_bytes data is sent, if msg has more_data,
sock_put() will be called multiple times.

We should reset the eval variable to __SK_NONE every time more_data
starts.

This causes:

IPv4: Attempt to release TCP socket in state 1 00000000b4c925d7
------------[ cut here ]------------
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 5 PID: 4482 at lib/refcount.c:25 refcount_warn_saturate+0x7d/0x110
Modules linked in:
CPU: 5 PID: 4482 Comm: sockhash_bypass Kdump: loaded Not tainted 6.0.0 #1
Hardware name: Red Hat KVM, BIOS 1.11.0-2.el7 04/01/2014
Call Trace:
 <TASK>
 __tcp_transmit_skb+0xa1b/0xb90
 ? __alloc_skb+0x8c/0x1a0
 ? __kmalloc_node_track_caller+0x184/0x320
 tcp_write_xmit+0x22a/0x1110
 __tcp_push_pending_frames+0x32/0xf0
 do_tcp_sendpages+0x62d/0x640
 tcp_bpf_push+0xae/0x2c0
 tcp_bpf_sendmsg_redir+0x260/0x410
 ? preempt_count_add+0x70/0xa0
 tcp_bpf_send_verdict+0x386/0x4b0
 tcp_bpf_sendmsg+0x21b/0x3b0
 sock_sendmsg+0x58/0x70
 __sys_sendto+0xfa/0x170
 ? xfd_validate_state+0x1d/0x80
 ? switch_fpu_return+0x59/0xe0
 __x64_sys_sendto+0x24/0x30
 do_syscall_64+0x37/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: cd9733f5d75c ("tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function")
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/1669718441-2654-2-git-send-email-yangpc@wangsu.com
20 months agobpf: Tighten ptr_to_btf_id checks.
Alexei Starovoitov [Fri, 25 Nov 2022 22:06:17 +0000 (14:06 -0800)]
bpf: Tighten ptr_to_btf_id checks.

The networking programs typically don't require CAP_PERFMON, but through kfuncs
like bpf_cast_to_kern_ctx() they can access memory through PTR_TO_BTF_ID. In
such case enforce CAP_PERFMON.
Also make sure that only GPL programs can access kernel data structures.
All kfuncs require GPL already.

Also remove allow_ptr_to_map_access. It's the same as allow_ptr_leaks and
different name for the same check only causes confusion.

Fixes: fd264ca02094 ("bpf: Add a kfunc to type cast from bpf uapi ctx to kernel ctx")
Fixes: 50c6b8a9aea2 ("selftests/bpf: Add a test for btf_type_tag "percpu"")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221125220617.26846-1-alexei.starovoitov@gmail.com
20 months agoselftests/bpf: Add bench test to arm64 and s390x denylist
Daniel Borkmann [Wed, 30 Nov 2022 22:20:49 +0000 (23:20 +0100)]
selftests/bpf: Add bench test to arm64 and s390x denylist

BPF CI fails for arm64 and s390x each with the following result:

  [...]
  All error logs:

  serial_test_kprobe_multi_bench_attach:PASS:get_syms 0 nsec
  serial_test_kprobe_multi_bench_attach:PASS:kprobe_multi_empty__open_and_load 0 nsec
  libbpf: prog 'test_kprobe_empty': failed to attach: Operation not supported
  serial_test_kprobe_multi_bench_attach:FAIL:bpf_program__attach_kprobe_multi_opts unexpected error: -95
  #92      kprobe_multi_bench_attach:FAIL
  [...]

Add the test to the deny list.

Fixes: 5b6c7e5c4434 ("selftests/bpf: Add attach bench test")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
20 months agoselftests/bpf: Make sure enum-less bpf_enable_stats() API works in C++ mode
Andrii Nakryiko [Wed, 30 Nov 2022 20:00:13 +0000 (12:00 -0800)]
selftests/bpf: Make sure enum-less bpf_enable_stats() API works in C++ mode

Just a simple test to make sure we don't introduce unwanted compiler
warnings and API still supports passing enums as input argument.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20221130200013.2997831-2-andrii@kernel.org
20 months agolibbpf: Avoid enum forward-declarations in public API in C++ mode
Andrii Nakryiko [Wed, 30 Nov 2022 20:00:12 +0000 (12:00 -0800)]
libbpf: Avoid enum forward-declarations in public API in C++ mode

C++ enum forward declarations are fundamentally not compatible with pure
C enum definitions, and so libbpf's use of `enum bpf_stats_type;`
forward declaration in libbpf/bpf.h public API header is causing C++
compilation issues.

More details can be found in [0], but it comes down to C++ supporting
enum forward declaration only with explicitly specified backing type:

  enum bpf_stats_type: int;

In C (and I believe it's a GCC extension also), such forward declaration
is simply:

  enum bpf_stats_type;

Further, in Linux UAPI this enum is defined in pure C way:

enum bpf_stats_type { BPF_STATS_RUN_TIME = 0; }

And even though in both cases backing type is int, which can be
confirmed by looking at DWARF information, for C++ compiler actual enum
definition and forward declaration are incompatible.

To eliminate this problem, for C++ mode define input argument as int,
which makes enum unnecessary in libbpf public header. This solves the
issue and as demonstrated by next patch doesn't cause any unwanted
compiler warnings, at least with default warnings setting.

  [0] https://stackoverflow.com/questions/42766839/c11-enum-forward-causes-underlying-type-mismatch
  [1] Closes: https://github.com/libbpf/libbpf/issues/249

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20221130200013.2997831-1-andrii@kernel.org
20 months agoselftests/bpf: Avoid pinning prog when attaching to tc ingress in btf_skc_cls_ingress
Martin KaFai Lau [Tue, 29 Nov 2022 07:09:00 +0000 (23:09 -0800)]
selftests/bpf: Avoid pinning prog when attaching to tc ingress in btf_skc_cls_ingress

This patch removes the need to pin prog when attaching to tc ingress
in the btf_skc_cls_ingress test.  Instead, directly use the
bpf_tc_hook_create() and bpf_tc_attach().  The qdisc clsact
will go away together with the netns, so no need to
bpf_tc_hook_destroy().

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20221129070900.3142427-8-martin.lau@linux.dev
20 months agoselftests/bpf: Remove serial from tests using {open,close}_netns
Martin KaFai Lau [Tue, 29 Nov 2022 07:08:59 +0000 (23:08 -0800)]
selftests/bpf: Remove serial from tests using {open,close}_netns

After removing the mount/umount dance from {open,close}_netns()
in the pervious patch, "serial_" can be removed from
the tests using {open,close}_netns().

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20221129070900.3142427-7-martin.lau@linux.dev
20 months agoselftests/bpf: Remove the "/sys" mount and umount dance in {open,close}_netns
Martin KaFai Lau [Tue, 29 Nov 2022 07:08:58 +0000 (23:08 -0800)]
selftests/bpf: Remove the "/sys" mount and umount dance in {open,close}_netns

The previous patches have removed the need to do the mount and umount
dance when switching netns. In particular:
* Avoid remounting /sys/fs/bpf to have a clean start
* Avoid remounting /sys to get a ifindex of a particular netns

This patch can finally remove the mount and umount dance in
{open,close}_netns which is unnecessarily complicated and
error-prone.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20221129070900.3142427-6-martin.lau@linux.dev
20 months agoselftests/bpf: Avoid pinning bpf prog in the netns_load_bpf() callers
Martin KaFai Lau [Tue, 29 Nov 2022 07:08:57 +0000 (23:08 -0800)]
selftests/bpf: Avoid pinning bpf prog in the netns_load_bpf() callers

This patch removes the need to pin prog in the remaining tests in
tc_redirect.c by directly using the bpf_tc_hook_create() and
bpf_tc_attach().  The clsact qdisc will go away together with
the test netns, so no need to do bpf_tc_hook_destroy().

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20221129070900.3142427-5-martin.lau@linux.dev
20 months agoselftests/bpf: Avoid pinning bpf prog in the tc_redirect_peer_l3 test
Martin KaFai Lau [Tue, 29 Nov 2022 07:08:56 +0000 (23:08 -0800)]
selftests/bpf: Avoid pinning bpf prog in the tc_redirect_peer_l3 test

This patch removes the need to pin prog in the tc_redirect_peer_l3
test by directly using the bpf_tc_hook_create() and bpf_tc_attach().
The clsact qdisc will go away together with the test netns, so
no need to do bpf_tc_hook_destroy().

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20221129070900.3142427-4-martin.lau@linux.dev
20 months agoselftests/bpf: Avoid pinning bpf prog in the tc_redirect_dtime test
Martin KaFai Lau [Tue, 29 Nov 2022 07:08:55 +0000 (23:08 -0800)]
selftests/bpf: Avoid pinning bpf prog in the tc_redirect_dtime test

This patch removes the need to pin prog in the tc_redirect_dtime
test by directly using the bpf_tc_hook_create() and bpf_tc_attach().
The clsact qdisc will go away together with the test netns, so
no need to do bpf_tc_hook_destroy().

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20221129070900.3142427-3-martin.lau@linux.dev
20 months agoselftests/bpf: Use if_nametoindex instead of reading the /sys/net/class/*/ifindex
Martin KaFai Lau [Tue, 29 Nov 2022 07:08:54 +0000 (23:08 -0800)]
selftests/bpf: Use if_nametoindex instead of reading the /sys/net/class/*/ifindex

When switching netns, the setns_by_fd() is doing dances in mount/umounting
the /sys directories.  One reason is the tc_redirect.c test is depending
on the /sys/net/class/*/ifindex instead of using the if_nametoindex().
if_nametoindex() uses ioctl() to get the ifindex.

This patch is to move all /sys/net/class/*/ifindex usages to
if_nametoindex().  The current code checks ifindex >= 0 which is
incorrect.  ifindex > 0 should be checked instead.  This patch also
stores ifindex_veth_src and ifindex_veth_dst since the latter patch
will need them.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20221129070900.3142427-2-martin.lau@linux.dev
20 months agoselftests/net: add csum offload test
Willem de Bruijn [Mon, 28 Nov 2022 14:02:10 +0000 (09:02 -0500)]
selftests/net: add csum offload test

Test NIC hardware checksum offload:

- Rx + Tx
- IPv4 + IPv6
- TCP + UDP

Optional features:

- zero checksum 0xFFFF
- checksum disable 0x0000
- transport encap headers
- randomization

See file header for detailed comments.

Expected results differ depending on NIC features:

- CHECKSUM_UNNECESSARY vs CHECKSUM_COMPLETE
- NETIF_F_HW_CSUM (csum_start/csum_off) vs NETIF_F_IP(V6)_CSUM

Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20221128140210.553391-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
20 months agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec...
Jakub Kicinski [Wed, 30 Nov 2022 04:50:50 +0000 (20:50 -0800)]
Merge branch 'master' of git://git./linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================
ipsec-next 2022-11-26

1) Remove redundant variable in esp6.
   From Colin Ian King.

2) Update x->lastused for every packet. It was used only for
   outgoing mobile IPv6 packets, but showed to be usefull
   to check if the a SA is still in use in general.
   From Antony Antony.

3) Remove unused variable in xfrm_byidx_resize.
   From Leon Romanovsky.

4) Finalize extack support for xfrm.
   From Sabrina Dubroca.

* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
  xfrm: add extack to xfrm_set_spdinfo
  xfrm: add extack to xfrm_alloc_userspi
  xfrm: add extack to xfrm_do_migrate
  xfrm: add extack to xfrm_new_ae and xfrm_replay_verify_len
  xfrm: add extack to xfrm_del_sa
  xfrm: add extack to xfrm_add_sa_expire
  xfrm: a few coding style clean ups
  xfrm: Remove not-used total variable
  xfrm: update x->lastused for every packet
  esp6: remove redundant variable err
====================

Link: https://lore.kernel.org/r/20221126110303.1859238-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
20 months agoMerge branch 'net-pcs-altera-tse-simplify-and-clean-up-the-driver'
Jakub Kicinski [Wed, 30 Nov 2022 04:29:58 +0000 (20:29 -0800)]
Merge branch 'net-pcs-altera-tse-simplify-and-clean-up-the-driver'

Maxime Chevallier says:

====================
net: pcs: altera-tse: simplify and clean-up the driver

This small series does a bit of code cleanup in the altera TSE pcs
driver, removing unused register definitions, handling 1000BaseX speed
configuration correctly according to the datasheet, and making use of
proper poll_timeout helpers.

No functional change is introduced.
====================

Link: https://lore.kernel.org/r/20221125131801.64234-1-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
20 months agonet: pcs: altera-tse: remove unnecessary register definitions
Maxime Chevallier [Fri, 25 Nov 2022 13:18:01 +0000 (14:18 +0100)]
net: pcs: altera-tse: remove unnecessary register definitions

remove unused register definitions, left from the split with the
altera-tse mac driver.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
20 months agonet: pcs: altera-tse: don't set the speed for 1000BaseX
Maxime Chevallier [Fri, 25 Nov 2022 13:18:00 +0000 (14:18 +0100)]
net: pcs: altera-tse: don't set the speed for 1000BaseX

When disabling the SGMII mode bit, the PCS defaults to 1000BaseX mode.
In that mode, we don't need to set the speed since it's always 1000Mbps.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
20 months agonet: pcs: altera-tse: use read_poll_timeout to wait for reset
Maxime Chevallier [Fri, 25 Nov 2022 13:17:59 +0000 (14:17 +0100)]
net: pcs: altera-tse: use read_poll_timeout to wait for reset

Software resets on the TSE PCS don't clear registers, but rather reset
all internal state machines regarding AN, comma detection and
encoding/decoding. Use read_poll_timeout to wait for the reset to clear
instead of manually polling the register.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
20 months agoMerge branch 'mptcp-msg_fastopen-and-tfo-listener-side-support'
Jakub Kicinski [Wed, 30 Nov 2022 04:24:31 +0000 (20:24 -0800)]
Merge branch 'mptcp-msg_fastopen-and-tfo-listener-side-support'

Matthieu Baerts says:

====================
mptcp: MSG_FASTOPEN and TFO listener side support

Before this series, only the initiator of a connection was able to combine
both TCP FastOpen and MPTCP when using TCP_FASTOPEN_CONNECT socket option.

These new patches here add (in theory) the full support of TFO with MPTCP,
which means:

 - MSG_FASTOPEN sendmsg flag support (patch 1/8)
 - TFO support for the listener side (patches 2-5/8)
 - TCP_FASTOPEN socket option (patch 6/8)
 - TCP_FASTOPEN_KEY socket option (patch 7/8)

To support TFO for the server side, a few preparation patches are needed
(patches 2 to 5/8). Some of them were inspired by a previous work from
Benjamin Hesmans.

Note that TFO support with MPTCP has been validated with selftests
(patch 8/8) but also with Packetdrill tests running with a modified
but still very WIP version supporting MPTCP. Both the modified tool
and the tests are available online:

  https://github.com/multipath-tcp/packetdrill/
====================

Link: https://lore.kernel.org/r/20221125222958.958636-1-matthieu.baerts@tessares.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
20 months agoselftests: mptcp: mptfo Initiator/Listener
Dmytro Shytyi [Fri, 25 Nov 2022 22:29:54 +0000 (23:29 +0100)]
selftests: mptcp: mptfo Initiator/Listener

This patch first adds TFO support in mptcp_connect.c.

This can be enabled via a new option: -o MPTFO.

Once enabled, the TCP_FASTOPEN socket option is enabled for the server
side and a sendto() with MSG_FASTOPEN is used instead of a connect() for
the client side.

Note that the first SYN has a limit of bytes it can carry. In other
words, it is allowed to send less data than the provided one. We then
need to track more status info to properly allow the next sendmsg()
starting from the next part of the data to send the rest.

Also in TFO scenarios, we need to completely spool the partially xmitted
buffer -- and account for that -- before starting sendfile/mmap xmit,
otherwise the relevant tests will fail.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
20 months agomptcp: add support for TCP_FASTOPEN_KEY sockopt
Matthieu Baerts [Fri, 25 Nov 2022 22:29:53 +0000 (23:29 +0100)]
mptcp: add support for TCP_FASTOPEN_KEY sockopt

The goal of this socket option is to set different keys per listener,
see commit 1fba70e5b6be ("tcp: socket option to set TCP fast open key")
for more details about this socket option.

The only thing to do here with MPTCP is to relay the request to the
first subflow like it is already done for the other TCP_FASTOPEN* socket
options.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
20 months agomptcp: add TCP_FASTOPEN sock option
Dmytro Shytyi [Fri, 25 Nov 2022 22:29:52 +0000 (23:29 +0100)]
mptcp: add TCP_FASTOPEN sock option

The TCP_FASTOPEN socket option is one way for the application to tell
the kernel TFO support has to be enabled for the listener socket.

The only thing to do here with MPTCP is to relay the request to the
first subflow like it is already done for the other TCP_FASTOPEN* socket
options.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
20 months agomptcp: add subflow_v(4,6)_send_synack()
Dmytro Shytyi [Fri, 25 Nov 2022 22:29:51 +0000 (23:29 +0100)]
mptcp: add subflow_v(4,6)_send_synack()

The send_synack() needs to be overridden for MPTCP to support TFO for
two reasons:

- There is not be enough space in the TCP options if the TFO cookie has
  to be added in the SYN+ACK with other options: MSS (4), SACK OK (2),
  Timestamps (10), Window Scale (3+1), TFO (10+2), MP_CAPABLE (12).
  MPTCPv1 specs -- RFC 8684, section B.1 [1] -- suggest to drop the TCP
  timestamps option in this case.

- The data received in the SYN has to be handled: the SKB can be
  dequeued from the subflow sk and transferred to the MPTCP sk. Counters
  need to be updated accordingly and the application can be notified at
  the end because some bytes have been received.

[1] https://www.rfc-editor.org/rfc/rfc8684.html#section-b.1

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
20 months agomptcp: implement delayed seq generation for passive fastopen
Dmytro Shytyi [Fri, 25 Nov 2022 22:29:50 +0000 (23:29 +0100)]
mptcp: implement delayed seq generation for passive fastopen

With fastopen in place, the first subflow socket is created before the
MPC handshake completes, and we need to properly initialize the sequence
numbers at MPC ACK reception.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
20 months agomptcp: consolidate initial ack seq generation
Paolo Abeni [Fri, 25 Nov 2022 22:29:49 +0000 (23:29 +0100)]
mptcp: consolidate initial ack seq generation

Currently the initial ack sequence is generated on demand whenever
it's requested and the remote key is handy. The relevant code is
scattered in different places and can lead to multiple, unneeded,
crypto operations.

This change consolidates the ack sequence generation code in a single
helper, storing the sequence number at the subflow level.

The above additionally saves a few conditional in fast-path and will
simplify the upcoming fast-open implementation.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
20 months agomptcp: track accurately the incoming MPC suboption type
Paolo Abeni [Fri, 25 Nov 2022 22:29:48 +0000 (23:29 +0100)]
mptcp: track accurately the incoming MPC suboption type

Currently in the receive path we don't need to discriminate
between MPC SYN, MPC SYN-ACK and MPC ACK, but soon the fastopen
code will need that info to properly track the fully established
status.

Track the exact MPC suboption type into the receive opt bitmap.
No functional change intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
20 months agomptcp: add MSG_FASTOPEN sendmsg flag support
Dmytro Shytyi [Fri, 25 Nov 2022 22:29:47 +0000 (23:29 +0100)]
mptcp: add MSG_FASTOPEN sendmsg flag support

Since commit 54f1944ed6d2 ("mptcp: factor out mptcp_connect()"), all the
infrastructure is now in place to support the MSG_FASTOPEN flag, we
just need to call into the fastopen path in mptcp_sendmsg().

Co-developed-by: Benjamin Hesmans <benjamin.hesmans@tessares.net>
Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
20 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Tue, 29 Nov 2022 21:04:52 +0000 (13:04 -0800)]
Merge git://git./linux/kernel/git/netdev/net

tools/lib/bpf/ringbuf.c
  927cbb478adf ("libbpf: Handle size overflow for ringbuf mmap")
  b486d19a0ab0 ("libbpf: checkpatch: Fixed code alignments in ringbuf.c")
https://lore.kernel.org/all/20221121122707.44d1446a@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
20 months agoMerge tag 'net-6.1-rc8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Tue, 29 Nov 2022 17:52:10 +0000 (09:52 -0800)]
Merge tag 'net-6.1-rc8-2' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf, can and wifi.

  Current release - new code bugs:

   - eth: mlx5e:
      - use kvfree() in mlx5e_accel_fs_tcp_create()
      - MACsec, fix RX data path 16 RX security channel limit
      - MACsec, fix memory leak when MACsec device is deleted
      - MACsec, fix update Rx secure channel active field
      - MACsec, fix add Rx security association (SA) rule memory leak

  Previous releases - regressions:

   - wifi: cfg80211: don't allow multi-BSSID in S1G

   - stmmac: set MAC's flow control register to reflect current settings

   - eth: mlx5:
      - E-switch, fix duplicate lag creation
      - fix use-after-free when reverting termination table

  Previous releases - always broken:

   - ipv4: fix route deletion when nexthop info is not specified

   - bpf: fix a local storage BPF map bug where the value's spin lock
     field can get initialized incorrectly

   - tipc: re-fetch skb cb after tipc_msg_validate

   - wifi: wilc1000: fix Information Element parsing

   - packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE

   - sctp: fix memory leak in sctp_stream_outq_migrate()

   - can: can327: fix potential skb leak when netdev is down

   - can: add number of missing netdev freeing on error paths

   - aquantia: do not purge addresses when setting the number of rings

   - wwan: iosm:
      - fix incorrect skb length leading to truncated packet
      - fix crash in peek throughput test due to skb UAF"

* tag 'net-6.1-rc8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits)
  net: ethernet: renesas: ravb: Fix promiscuous mode after system resumed
  MAINTAINERS: Update maintainer list for chelsio drivers
  ionic: update MAINTAINERS entry
  sctp: fix memory leak in sctp_stream_outq_migrate()
  packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE
  net/mlx5: Lag, Fix for loop when checking lag
  Revert "net/mlx5e: MACsec, remove replay window size limitation in offload path"
  net: marvell: prestera: Fix a NULL vs IS_ERR() check in some functions
  net: tun: Fix use-after-free in tun_detach()
  net: mdiobus: fix unbalanced node reference count
  net: hsr: Fix potential use-after-free
  tipc: re-fetch skb cb after tipc_msg_validate
  mptcp: fix sleep in atomic at close time
  mptcp: don't orphan ssk in mptcp_close()
  dsa: lan9303: Correct stat name
  ipv4: Fix route deletion when nexthop info is not specified
  net: wwan: iosm: fix incorrect skb length
  net: wwan: iosm: fix crash in peek throughput test
  net: wwan: iosm: fix dma_alloc_coherent incompatible pointer type
  net: wwan: iosm: fix kernel test robot reported error
  ...

20 months agoudp_tunnel: Add checks for nla_nest_start() in __udp_tunnel_nic_dump_write()
Yuan Can [Tue, 29 Nov 2022 01:39:34 +0000 (01:39 +0000)]
udp_tunnel: Add checks for nla_nest_start() in __udp_tunnel_nic_dump_write()

As the nla_nest_start() may fail with NULL returned, the return value
should be checked.

Note that this is not a real bug, nothing will break here.
The next nla_put() will fail as well and we'll bail (and
nla_nest_cancel() can handle NULL). But we keep getting
those "fixes" so whatever.

Signed-off-by: Yuan Can <yuancan@huawei.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20221129013934.55184-1-yuancan@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
20 months agonet: ethernet: renesas: ravb: Fix promiscuous mode after system resumed
Yoshihiro Shimoda [Mon, 28 Nov 2022 06:56:04 +0000 (15:56 +0900)]
net: ethernet: renesas: ravb: Fix promiscuous mode after system resumed

After system resumed on some environment board, the promiscuous mode
is disabled because the SoC turned off. So, call ravb_set_rx_mode() in
the ravb_resume() to fix the issue.

Reported-by: Tho Vu <tho.vu.wh@renesas.com>
Fixes: 0184165b2f42 ("ravb: add sleep PM suspend/resume support")
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Link: https://lore.kernel.org/r/20221128065604.1864391-1-yoshihiro.shimoda.uh@renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
20 months agoMAINTAINERS: Update maintainer list for chelsio drivers
Ayush Sawal [Mon, 28 Nov 2022 23:13:48 +0000 (04:43 +0530)]
MAINTAINERS: Update maintainer list for chelsio drivers

This updates the maintainers for chelsio inline crypto drivers and
chelsio crypto drivers.

Signed-off-by: Ayush Sawal <ayush.sawal@chelsio.com>
Link: https://lore.kernel.org/r/20221128231348.8225-1-ayush.sawal@chelsio.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>