platform/kernel/linux-starfive.git
17 months agoice: Remove next_{dd,rs} fields from ice_tx_ring
Maciej Fijalkowski [Tue, 31 Jan 2023 20:45:05 +0000 (21:45 +0100)]
ice: Remove next_{dd,rs} fields from ice_tx_ring

Now that both ZC and standard XDP data paths stopped using Tx logic
based on next_dd and next_rs fields, we can safely remove these fields
and shrink Tx ring structure.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-13-maciej.fijalkowski@intel.com
17 months agoice: Add support for XDP multi-buffer on Tx side
Maciej Fijalkowski [Tue, 31 Jan 2023 20:45:04 +0000 (21:45 +0100)]
ice: Add support for XDP multi-buffer on Tx side

Similarly as for Rx side in previous patch, logic on XDP Tx in ice
driver needs to be adjusted for multi-buffer support. Specifically, the
way how HW Tx descriptors are produced and cleaned.

Currently, XDP_TX works on strict ring boundaries, meaning it sets RS
bit (on producer side) / looks up DD bit (on consumer/cleaning side)
every quarter of the ring. It means that if for example multi buffer
frame would span across the ring quarter boundary (say that frame
consists of 4 frames and we start from 62 descriptor where ring is sized
to 256 entries), RS bit would be produced in the middle of multi buffer
frame, which would be a broken behavior as it needs to be set on the
last descriptor of the frame.

To make it work, set RS bit at the last descriptor from the batch of
frames that XDP_TX action was used on and make the first entry remember
the index of last descriptor with RS bit set. This way, cleaning side
can take the index of descriptor with RS bit, look up DD bit's presence
and clean from first entry to last.

In order to clean up the code base introduce the common ice_set_rs_bit()
which will return index of descriptor that got RS bit produced on so
that standard driver can store this within proper ice_tx_buf and ZC
driver can simply ignore return value.

Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
17 months agoice: Add support for XDP multi-buffer on Rx side
Maciej Fijalkowski [Tue, 31 Jan 2023 20:45:03 +0000 (21:45 +0100)]
ice: Add support for XDP multi-buffer on Rx side

Ice driver needs to be a bit reworked on Rx data path in order to
support multi-buffer XDP. For skb path, it currently works in a way that
Rx ring carries pointer to skb so if driver didn't manage to combine
fragmented frame at current NAPI instance, it can restore the state on
next instance and keep looking for last fragment (so descriptor with EOP
bit set). What needs to be achieved is that xdp_buff needs to be
combined in such way (linear + frags part) in the first place. Then skb
will be ready to go in case of XDP_PASS or BPF program being not present
on interface. If BPF program is there, it would work on multi-buffer
XDP. At this point xdp_buff resides directly on Rx ring, so given the
fact that skb will be built straight from xdp_buff, there will be no
further need to carry skb on Rx ring.

Besides removing skb pointer from Rx ring, lots of members have been
moved around within ice_rx_ring. First and foremost reason was to place
rx_buf with xdp_buff on the same cacheline. This means that once we
touch rx_buf (which is a preceding step before touching xdp_buff),
xdp_buff will already be hot in cache. Second thing was that xdp_rxq is
used rather rarely and it occupies a separate cacheline, so maybe it is
better to have it at the end of ice_rx_ring.

Other change that affects ice_rx_ring is the introduction of
ice_rx_ring::first_desc. Its purpose is twofold - first is to propagate
rx_buf->act to all the parts of current xdp_buff after running XDP
program, so that ice_put_rx_buf() that got moved out of the main Rx
processing loop will be able to tak an appriopriate action on each
buffer. Second is for ice_construct_skb().

ice_construct_skb() has a copybreak mechanism which had an explicit
impact on xdp_buff->skb conversion in the new approach when legacy Rx
flag is toggled. It works in a way that linear part is 256 bytes long,
if frame is bigger than that, remaining bytes are going as a frag to
skb_shared_info.

This means while memcpying frags from xdp_buff to newly allocated skb,
care needs to be taken when picking the destination frag array entry.
Upon the time ice_construct_skb() is called, when dealing with
fragmented frame, current rx_buf points to the *last* fragment, but
copybreak needs to be done against the first one.  That's where
ice_rx_ring::first_desc helps.

When frame building spans across NAPI polls (DD bit is not set on
current descriptor and xdp->data is not NULL) with current Rx buffer
handling state there might be some problems.
Since calls to ice_put_rx_buf() were pulled out of the main Rx
processing loop and were scoped from cached_ntc to current ntc, remember
that now mentioned function relies on rx_buf->act, which is set within
ice_run_xdp(). ice_run_xdp() is called when EOP bit was found, so
currently we could put Rx buffer with rx_buf->act being *uninitialized*.
To address this, change scoping to rely on first_desc on both boundaries
instead.

This also implies that cleaned_count which is used as an input to
ice_alloc_rx_buffers() and tells how many new buffers should be refilled
has to be adjusted. If it stayed as is, what could happen is a case
where ntc would go over ntu.

Therefore, remove cleaned_count altogether and use against allocing
routine newly introduced ICE_RX_DESC_UNUSED() macro which is an
equivalent of ICE_DESC_UNUSED() dedicated for Rx side and based on
struct ice_rx_ring::first_desc instead of next_to_clean.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-11-maciej.fijalkowski@intel.com
17 months agoice: Use xdp->frame_sz instead of recalculating truesize
Maciej Fijalkowski [Tue, 31 Jan 2023 20:45:02 +0000 (21:45 +0100)]
ice: Use xdp->frame_sz instead of recalculating truesize

SKB path calculates truesize on three different functions, which could
be avoided as xdp_buff carries the already calculated truesize under
xdp_buff::frame_sz. If ice_add_rx_frag() is adjusted to take the
xdp_buff as an input just like functions responsible for creating
sk_buff initially, codebase could be simplified by removing these
redundant recalculations and rely on xdp_buff::frame_sz instead.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-10-maciej.fijalkowski@intel.com
17 months agoice: Do not call ice_finalize_xdp_rx() unnecessarily
Maciej Fijalkowski [Tue, 31 Jan 2023 20:45:01 +0000 (21:45 +0100)]
ice: Do not call ice_finalize_xdp_rx() unnecessarily

Currently ice_finalize_xdp_rx() is called only when xdp_prog is present
on VSI, which is a good thing. However, this optimization can be
enhanced and check only if any of the XDP_TX/XDP_REDIRECT took place in
current Rx processing. Non-zero value of @xdp_xmit indicates that
xdp_prog is present on VSI. This way XDP_DROP-based workloads will not
suffer from unnecessary calls to external function.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-9-maciej.fijalkowski@intel.com
17 months agoice: Use ice_max_xdp_frame_size() in ice_xdp_setup_prog()
Maciej Fijalkowski [Tue, 31 Jan 2023 20:45:00 +0000 (21:45 +0100)]
ice: Use ice_max_xdp_frame_size() in ice_xdp_setup_prog()

This should have been used in there from day 1, let us address that
before introducing XDP multi-buffer support for Rx side.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-8-maciej.fijalkowski@intel.com
17 months agoice: Centrallize Rx buffer recycling
Maciej Fijalkowski [Tue, 31 Jan 2023 20:44:59 +0000 (21:44 +0100)]
ice: Centrallize Rx buffer recycling

Currently calls to ice_put_rx_buf() are sprinkled through
ice_clean_rx_irq() - first place is for explicit flow director's
descriptor handling, second is after running XDP prog and the last one
is after taking care of skb.

1st callsite was actually only for ntc bump purpose, as Rx buffer to be
recycled is not even passed to a function.

It is possible to walk through Rx buffers processed in particular NAPI
cycle by caching ntc from beginning of the ice_clean_rx_irq().

To do so, let us store XDP verdict inside ice_rx_buf, so action we need
to take on will be known. For XDP prog absence, just store ICE_XDP_PASS
as a verdict.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-7-maciej.fijalkowski@intel.com
17 months agoice: Inline eop check
Maciej Fijalkowski [Tue, 31 Jan 2023 20:44:58 +0000 (21:44 +0100)]
ice: Inline eop check

This might be in future used by ZC driver and might potentially yield a
minor performance boost. While at it, constify arguments that
ice_is_non_eop() takes, since they are pointers and this will help compiler
while generating asm.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-6-maciej.fijalkowski@intel.com
17 months agoice: Pull out next_to_clean bump out of ice_put_rx_buf()
Maciej Fijalkowski [Tue, 31 Jan 2023 20:44:57 +0000 (21:44 +0100)]
ice: Pull out next_to_clean bump out of ice_put_rx_buf()

Plan is to move ice_put_rx_buf() to the end of ice_clean_rx_irq() so
in order to keep the ability of walking through HW Rx descriptors, pull
out next_to_clean handling out of ice_put_rx_buf().

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-5-maciej.fijalkowski@intel.com
17 months agoice: Store page count inside ice_rx_buf
Maciej Fijalkowski [Tue, 31 Jan 2023 20:44:56 +0000 (21:44 +0100)]
ice: Store page count inside ice_rx_buf

This will allow us to avoid carrying additional auxiliary array of page
counts when dealing with XDP multi buffer support. Previously combining
fragmented frame to skb was not affected in the same way as XDP would be
as whole frame is needed to be in place before executing XDP prog.
Therefore, when going through HW Rx descriptors one-by-one, calls to
ice_put_rx_buf() need to be taken *after* running XDP prog on a
potentially multi buffered frame, so some additional storage of
page count is needed.

By adding page count to rx buf, it will make it easier to walk through
processed entries at the end of rx cleaning routine and decide whether
or not buffers should be recycled.

While at it, bump ice_rx_buf::pagecnt_bias from u16 up to u32. It was
proven many times that calculations on variables smaller than standard
register size are harmful. This was also the case during experiments
with embedding page count to ice_rx_buf - when this was added as u16 it
had a performance impact.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-4-maciej.fijalkowski@intel.com
17 months agoice: Add xdp_buff to ice_rx_ring struct
Maciej Fijalkowski [Tue, 31 Jan 2023 20:44:55 +0000 (21:44 +0100)]
ice: Add xdp_buff to ice_rx_ring struct

In preparation for XDP multi-buffer support, let's store xdp_buff on
Rx ring struct. This will allow us to combine fragmented frames across
separate NAPI cycles in the same way as currently skb fragments are
handled. This means that skb pointer on Rx ring will become redundant
and will be removed. For now it is kept and layout of Rx ring struct was
not inspected, some member movement will be needed later on so that will
be the time to take care of it.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-3-maciej.fijalkowski@intel.com
17 months agoice: Prepare legacy-rx for upcoming XDP multi-buffer support
Maciej Fijalkowski [Tue, 31 Jan 2023 20:44:54 +0000 (21:44 +0100)]
ice: Prepare legacy-rx for upcoming XDP multi-buffer support

Rx path is going to be modified in a way that fragmented frame will be
gathered within xdp_buff in the first place. This approach implies that
underlying buffer has to provide tailroom for skb_shared_info. This is
currently the case when ring uses build_skb but not when legacy-rx knob
is turned on. This case configures 2k Rx buffers and has no way to
provide either headroom or tailroom - FWIW it currently has
XDP_PACKET_HEADROOM which is broken and in here it is removed. 2k Rx
buffers were used so driver in this setting was able to support 9k MTU
as it can chain up to 5 Rx buffers. With offset configuring HW writing
2k of a data was passing the half of the page which broke the assumption
of our internal page recycling tricks.

Now if above got fixed and legacy-rx path would be left as is, when
referring to skb_shared_info via xdp_get_shared_info_from_buff(),
packet's content would be corrupted again. Hence size of Rx buffer needs
to be lowered and therefore supported MTU. This operation will allow us
to keep the unified data path and with 8k MTU users (if any of
legacy-rx) would still be good to go. However, tendency is to drop the
support for this code path at some point.

Add ICE_RXBUF_1664 as vsi::rx_buf_len and ICE_MAX_FRAME_LEGACY_RX (8320)
as vsi::max_frame for legacy-rx. For bigger page sizes configure 3k Rx
buffers, not 2k.

Since headroom support is removed, disable data_meta support on legacy-rx.
When preparing XDP buff, rely on ice_rx_ring::rx_offset setting when
deciding whether to support data_meta or not.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-2-maciej.fijalkowski@intel.com
17 months agoMerge branch 'Support bpf trampoline for s390x'
Alexei Starovoitov [Mon, 30 Jan 2023 03:16:29 +0000 (19:16 -0800)]
Merge branch 'Support bpf trampoline for s390x'

Ilya Leoshkevich says:

====================

v2: https://lore.kernel.org/bpf/20230128000650.1516334-1-iii@linux.ibm.com/#t
v2 -> v3:
- Make __arch_prepare_bpf_trampoline static.
  (Reported-by: kernel test robot <lkp@intel.com>)
- Support both old- and new- style map definitions in sk_assign. (Alexei)
- Trim DENYLIST.s390x. (Alexei)
- Adjust s390x vmlinux path in vmtest.sh.
- Drop merged fixes.

v1: https://lore.kernel.org/bpf/20230125213817.1424447-1-iii@linux.ibm.com/#t
v1 -> v2:
- Fix core_read_macros, sk_assign, test_profiler, test_bpffs (24/31;
  I'm not quite happy with the fix, but don't have better ideas),
  and xdp_synproxy. (Andrii)
- Prettify liburandom_read and verify_pkcs7_sig fixes. (Andrii)
- Fix bpf_usdt_arg using barrier_var(); prettify barrier_var(). (Andrii)
- Change BPF_MAX_TRAMP_LINKS to enum and query it using BTF. (Andrii)
- Improve bpf_jit_supports_kfunc_call() description. (Alexei)
- Always check sign_extend() return value.
- Cc: Alexander Gordeev.

Hi,

This series implements poke, trampoline, kfunc, and mixing subprogs
and tailcalls on s390x.

The following failures still remain:

#82      get_stack_raw_tp:FAIL
get_stack_print_output:FAIL:user_stack corrupted user stack
Known issue:
We cannot reliably unwind userspace on s390x without DWARF.

#101     ksyms_module:FAIL
address of kernel function bpf_testmod_test_mod_kfunc is out of range
Known issue:
Kernel and modules are too far away from each other on s390x.

#190     stacktrace_build_id:FAIL
Known issue:
We cannot reliably unwind userspace on s390x without DWARF.

#281     xdp_metadata:FAIL
See patch 6.

None of these seem to be due to the new changes.

Best regards,
Ilya
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Trim DENYLIST.s390x
Ilya Leoshkevich [Sun, 29 Jan 2023 19:05:01 +0000 (20:05 +0100)]
selftests/bpf: Trim DENYLIST.s390x

Now that trampoline is implemented, enable a number of tests on s390x.
18 of the remaining failures have to do with either lack of rethook
(fixed by [1]) or syscall symbols missing from BTF (fixed by [2]).

Do not re-classify the remaining failures for now; wait until the
s390/for-next fixes are merged and re-classify only the remaining few.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=for-next&id=1a280f48c0e403903cf0b4231c95b948e664f25a
[2] https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=for-next&id=2213d44e140f979f4b60c3c0f8dd56d151cc8692

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230129190501.1624747-9-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Fix s390x vmlinux path
Ilya Leoshkevich [Sun, 29 Jan 2023 19:05:00 +0000 (20:05 +0100)]
selftests/bpf: Fix s390x vmlinux path

After commit edd4a8667355 ("s390/boot: get rid of startup archive")
there is no more compressed/ subdirectory.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230129190501.1624747-8-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agos390/bpf: Implement bpf_jit_supports_kfunc_call()
Ilya Leoshkevich [Sun, 29 Jan 2023 19:04:59 +0000 (20:04 +0100)]
s390/bpf: Implement bpf_jit_supports_kfunc_call()

Implement calling kernel functions from eBPF. In general, the eBPF ABI
is fairly close to that of s390x, with one important difference: on
s390x callers should sign-extend signed arguments. Handle that by using
information returned by bpf_jit_find_kfunc_model().

Here is an example of how sign extensions works. Suppose we need to
call the following function from BPF:

    ; long noinline bpf_kfunc_call_test4(signed char a, short b, int c,
long d)
    0000000000936a78 <bpf_kfunc_call_test4>:
    936a78:       c0 04 00 00 00 00       jgnop bpf_kfunc_call_test4
    ;     return (long)a + (long)b + (long)c + d;
    936a7e:       b9 08 00 45             agr     %r4,%r5
    936a82:       b9 08 00 43             agr     %r4,%r3
    936a86:       b9 08 00 24             agr     %r2,%r4
    936a8a:       c0 f4 00 1e 3b 27       jg      <__s390_indirect_jump_r14>

As per the s390x ABI, bpf_kfunc_call_test4() has the right to assume
that a, b and c are sign-extended by the caller, which results in using
64-bit additions (agr) without any additional conversions. Without sign
extension we would have the following on the JITed code side:

    ; tmp = bpf_kfunc_call_test4(-3, -30, -200, -1000);
    ;        5:       b4 10 00 00 ff ff ff fd w1 = -3
    0x3ff7fdcdad4:       llilf   %r2,0xfffffffd
    ;        6:       b4 20 00 00 ff ff ff e2 w2 = -30
    0x3ff7fdcdada:       llilf   %r3,0xffffffe2
    ;        7:       b4 30 00 00 ff ff ff 38 w3 = -200
    0x3ff7fdcdae0:       llilf   %r4,0xffffff38
    ;       8:       b7 40 00 00 ff ff fc 18 r4 = -1000
    0x3ff7fdcdae6:       lgfi    %r5,-1000
    0x3ff7fdcdaec:       mvc     64(4,%r15),160(%r15)
    0x3ff7fdcdaf2:       lgrl    %r1,bpf_kfunc_call_test4@GOT
    0x3ff7fdcdaf8:       brasl   %r14,__s390_indirect_jump_r1

This first 3 llilfs are 32-bit loads, that need to be sign-extended
to 64 bits.

Note: at the moment bpf_jit_find_kfunc_model() does not seem to play
nicely with XDP metadata functions: add_kfunc_call() adds an "abstract"
bpf_*() version to kfunc_btf_tab, but then fixup_kfunc_call() puts the
concrete version into insn->imm, which bpf_jit_find_kfunc_model() cannot
find. But this seems to be a common code problem.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230129190501.1624747-7-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agos390/bpf: Implement bpf_jit_supports_subprog_tailcalls()
Ilya Leoshkevich [Sun, 29 Jan 2023 19:04:58 +0000 (20:04 +0100)]
s390/bpf: Implement bpf_jit_supports_subprog_tailcalls()

Allow mixing subprogs and tail calls by passing the current tail
call count to subprogs.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230129190501.1624747-6-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agos390/bpf: Implement arch_prepare_bpf_trampoline()
Ilya Leoshkevich [Sun, 29 Jan 2023 19:04:57 +0000 (20:04 +0100)]
s390/bpf: Implement arch_prepare_bpf_trampoline()

arch_prepare_bpf_trampoline() is used for direct attachment of eBPF
programs to various places, bypassing kprobes. It's responsible for
calling a number of eBPF programs before, instead and/or after
whatever they are attached to.

Add a s390x implementation, paying attention to the following:

- Reuse the existing JIT infrastructure, where possible.
- Like the existing JIT, prefer making multiple passes instead of
  backpatching. Currently 2 passes is enough. If literal pool is
  introduced, this needs to be raised to 3. However, at the moment
  adding literal pool only makes the code larger. If branch
  shortening is introduced, the number of passes needs to be
  increased even further.
- Support both regular and ftrace calling conventions, depending on
  the trampoline flags.
- Use expolines for indirect calls.
- Handle the mismatch between the eBPF and the s390x ABIs.
- Sign-extend fmod_ret return values.

invoke_bpf_prog() produces about 120 bytes; it might be possible to
slightly optimize this, but reaching 50 bytes, like on x86_64, looks
unrealistic: just loading cookie, __bpf_prog_enter, bpf_func, insnsi
and __bpf_prog_exit as literals already takes at least 5 * 12 = 60
bytes, and we can't use relative addressing for most of them.
Therefore, lower BPF_MAX_TRAMP_LINKS on s390x.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230129190501.1624747-5-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agos390/bpf: Implement bpf_arch_text_poke()
Ilya Leoshkevich [Sun, 29 Jan 2023 19:04:56 +0000 (20:04 +0100)]
s390/bpf: Implement bpf_arch_text_poke()

bpf_arch_text_poke() is used to hotpatch eBPF programs and trampolines.
s390x has a very strict hotpatching restriction: the only thing that is
allowed to be hotpatched is conditional branch mask.

Take the same approach as commit de5012b41e5c ("s390/ftrace: implement
hotpatching"): create a conditional jump to a "plt", which loads the
target address from memory and jumps to it; then first patch this
address, and then the mask.

Trampolines (introduced in the next patch) respect the ftrace calling
convention: the return address is in %r0, and %r1 is clobbered. With
that in mind, bpf_arch_text_poke() does not differentiate between jumps
and calls.

However, there is a simple optimization for jumps (for the epilogue_ip
case): if a jump already points to the destination, then there is no
"plt" and we can just flip the mask.

For simplicity, the "plt" template is defined in assembly, and its size
is used to define C arrays. There doesn't seem to be a way to convey
this size to C as a constant, so it's hardcoded and double-checked
during runtime.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230129190501.1624747-4-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agos390/bpf: Add expoline to tail calls
Ilya Leoshkevich [Sun, 29 Jan 2023 19:04:55 +0000 (20:04 +0100)]
s390/bpf: Add expoline to tail calls

All the indirect jumps in the eBPF JIT already use expolines, except
for the tail call one.

Fixes: de5cb6eb514e ("s390: use expoline thunks in the BPF JIT")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230129190501.1624747-3-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Fix sk_assign on s390x
Ilya Leoshkevich [Sun, 29 Jan 2023 19:04:54 +0000 (20:04 +0100)]
selftests/bpf: Fix sk_assign on s390x

sk_assign is failing on an s390x machine running Debian "bookworm" for
2 reasons: legacy server_map definition and uninitialized addrlen in
recvfrom() call.

Fix by adding a new-style server_map definition and dropping addrlen
(recvfrom() allows NULL values for src_addr and addrlen).

Since the test should support tc built without libbpf, build the prog
twice: with the old-style definition and with the new-style definition,
then select the right one at runtime. This could be done at compile
time too, but this would not be cross-compilation friendly.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230129190501.1624747-2-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agos390/bpf: Fix a typo in a comment
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:45 +0000 (01:06 +0100)]
s390/bpf: Fix a typo in a comment

"desription" should be "description".

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-27-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agobpf: btf: Add BTF_FMODEL_SIGNED_ARG flag
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:44 +0000 (01:06 +0100)]
bpf: btf: Add BTF_FMODEL_SIGNED_ARG flag

s390x eBPF JIT needs to know whether a function return value is signed
and which function arguments are signed, in order to generate code
compliant with the s390x ABI.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-26-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agobpf: iterators: Split iterators.lskel.h into little- and big- endian versions
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:43 +0000 (01:06 +0100)]
bpf: iterators: Split iterators.lskel.h into little- and big- endian versions

iterators.lskel.h is little-endian, therefore bpf iterator is currently
broken on big-endian systems. Introduce a big-endian version and add
instructions regarding its generation. Unfortunately bpftool's
cross-endianness capabilities are limited to BTF right now, so the
procedure requires access to a big-endian machine or a configured
emulator.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-25-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agolibbpf: Fix BPF_PROBE_READ{_STR}_INTO() on s390x
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:42 +0000 (01:06 +0100)]
libbpf: Fix BPF_PROBE_READ{_STR}_INTO() on s390x

BPF_PROBE_READ_INTO() and BPF_PROBE_READ_STR_INTO() should map to
bpf_probe_read() and bpf_probe_read_str() respectively in order to work
correctly on architectures with !ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-24-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agolibbpf: Fix unbounded memory access in bpf_usdt_arg()
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:41 +0000 (01:06 +0100)]
libbpf: Fix unbounded memory access in bpf_usdt_arg()

Loading programs that use bpf_usdt_arg() on s390x fails with:

    ; if (arg_num >= BPF_USDT_MAX_ARG_CNT || arg_num >= spec->arg_cnt)
    128: (79) r1 = *(u64 *)(r10 -24)      ; frame1: R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0
    129: (25) if r1 > 0xb goto pc+83      ; frame1: R1_w=scalar(umax=11,var_off=(0x0; 0xf))
    ...
    ; arg_spec = &spec->args[arg_num];
    135: (79) r1 = *(u64 *)(r10 -24)      ; frame1: R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0
    ...
    ; switch (arg_spec->arg_type) {
    139: (61) r1 = *(u32 *)(r2 +8)
    R2 unbounded memory access, make sure to bounds check any such access

The reason is that, even though the C code enforces that
arg_num < BPF_USDT_MAX_ARG_CNT, the verifier cannot propagate this
constraint to the arg_spec assignment yet. Help it by forcing r1 back
to stack after comparison.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-23-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agolibbpf: Simplify barrier_var()
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:40 +0000 (01:06 +0100)]
libbpf: Simplify barrier_var()

Use a single "+r" constraint instead of the separate "=r" and "0".

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-22-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Fix profiler on s390x
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:39 +0000 (01:06 +0100)]
selftests/bpf: Fix profiler on s390x

Use bpf_probe_read_kernel() and bpf_probe_read_kernel_str() instead
of bpf_probe_read() and bpf_probe_read_kernel().

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-21-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Fix xdp_synproxy/tc on s390x
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:38 +0000 (01:06 +0100)]
selftests/bpf: Fix xdp_synproxy/tc on s390x

Use the correct datatype for the values map values; currently the test
works by accident, since on little-endian machines it is sometimes
acceptable to access u64 as u32.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-20-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Fix vmlinux test on s390x
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:36 +0000 (01:06 +0100)]
selftests/bpf: Fix vmlinux test on s390x

Use a syscall macro to access the nanosleep()'s first argument;
currently the code uses gprs[2] instead of orig_gpr2.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-18-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Fix test_xdp_adjust_tail_grow2 on s390x
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:35 +0000 (01:06 +0100)]
selftests/bpf: Fix test_xdp_adjust_tail_grow2 on s390x

s390x cache line size is 256 bytes, so skb_shared_info must be aligned
on a much larger boundary than for x86.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-17-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Fix test_lsm on s390x
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:34 +0000 (01:06 +0100)]
selftests/bpf: Fix test_lsm on s390x

Use syscall macros to access the setdomainname() arguments; currently
the code uses gprs[2] instead of orig_gpr2 for the first argument.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-16-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Add a sign-extension test for kfuncs
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:33 +0000 (01:06 +0100)]
selftests/bpf: Add a sign-extension test for kfuncs

s390x ABI requires the caller to zero- or sign-extend the arguments.
eBPF already deals with zero-extension (by definition of its ABI), but
not with sign-extension.

Add a test to cover that potentially problematic area.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-15-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Increase SIZEOF_BPF_LOCAL_STORAGE_ELEM on s390x
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:32 +0000 (01:06 +0100)]
selftests/bpf: Increase SIZEOF_BPF_LOCAL_STORAGE_ELEM on s390x

sizeof(struct bpf_local_storage_elem) is 512 on s390x:

    struct bpf_local_storage_elem {
            struct hlist_node          map_node;             /*     0    16 */
            struct hlist_node          snode;                /*    16    16 */
            struct bpf_local_storage * local_storage;        /*    32     8 */
            struct callback_head       rcu __attribute__((__aligned__(8))); /*    40    16 */

            /* XXX 200 bytes hole, try to pack */

            /* --- cacheline 1 boundary (256 bytes) --- */
            struct bpf_local_storage_data sdata __attribute__((__aligned__(256))); /*   256     8 */

            /* size: 512, cachelines: 2, members: 5 */
            /* sum members: 64, holes: 1, sum holes: 200 */
            /* padding: 248 */
            /* forced alignments: 2, forced holes: 1, sum forced holes: 200 */
    } __attribute__((__aligned__(256)));

As the existing comment suggests, use a larger number in order to be
future-proof.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-14-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Check stack_mprotect() return value
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:31 +0000 (01:06 +0100)]
selftests/bpf: Check stack_mprotect() return value

If stack_mprotect() succeeds, errno is not changed. This can produce
misleading error messages, that show stale errno.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-13-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Fix cgrp_local_storage on s390x
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:30 +0000 (01:06 +0100)]
selftests/bpf: Fix cgrp_local_storage on s390x

Sync the definition of socket_cookie between the eBPF program and the
test. Currently the test works by accident, since on little-endian it
is sometimes acceptable to access u64 as u32.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-12-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Fix xdp_do_redirect on s390x
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:29 +0000 (01:06 +0100)]
selftests/bpf: Fix xdp_do_redirect on s390x

s390x cache line size is 256 bytes, so skb_shared_info must be aligned
on a much larger boundary than for x86. This makes the maximum packet
size smaller.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-11-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Fix verify_pkcs7_sig on s390x
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:28 +0000 (01:06 +0100)]
selftests/bpf: Fix verify_pkcs7_sig on s390x

Use bpf_probe_read_kernel() instead of bpf_probe_read(), which is not
defined on all architectures.

While at it, improve the error handling: do not hide the verifier log,
and check the return values of bpf_probe_read_kernel() and
bpf_copy_from_user().

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-10-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Fix decap_sanity_ns cleanup
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:27 +0000 (01:06 +0100)]
selftests/bpf: Fix decap_sanity_ns cleanup

decap_sanity prints the following on the 1st run:

    decap_sanity: sh: 1: Syntax error: Bad fd number

and the following on the 2nd run:

    Cannot create namespace file "/run/netns/decap_sanity_ns": File exists

The problem is that the cleanup command has a typo and does nothing.
Fix the typo.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-9-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Set errno when urand_spawn() fails
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:26 +0000 (01:06 +0100)]
selftests/bpf: Set errno when urand_spawn() fails

The result of urand_spawn() is checked with ASSERT_OK_PTR, which treats
NULL as success if errno == 0.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-8-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Fix kfree_skb on s390x
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:25 +0000 (01:06 +0100)]
selftests/bpf: Fix kfree_skb on s390x

h_proto is big-endian; use htons() in order to make comparison work on
both little- and big-endian machines.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-7-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Fix symlink creation error
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:24 +0000 (01:06 +0100)]
selftests/bpf: Fix symlink creation error

When building with O=, the following error occurs:

    ln: failed to create symbolic link 'no_alu32/bpftool': No such file or directory

Adjust the code to account for $(OUTPUT).

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-6-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Fix liburandom_read.so linker error
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:23 +0000 (01:06 +0100)]
selftests/bpf: Fix liburandom_read.so linker error

When building with O=, the following linker error occurs:

    clang: error: no such file or directory: 'liburandom_read.so'

Fix by adding $(OUTPUT) to the linker search path.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-5-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoselftests/bpf: Query BPF_MAX_TRAMP_LINKS using BTF
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:22 +0000 (01:06 +0100)]
selftests/bpf: Query BPF_MAX_TRAMP_LINKS using BTF

Do not hard-code the value, since for s390x it will be smaller than
for x86.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-4-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agobpf: Change BPF_MAX_TRAMP_LINKS to enum
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:21 +0000 (01:06 +0100)]
bpf: Change BPF_MAX_TRAMP_LINKS to enum

This way it's possible to query its value from testcases using BTF.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-3-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agobpf: Use ARG_CONST_SIZE_OR_ZERO for 3rd argument of bpf_tcp_raw_gen_syncookie_ipv...
Ilya Leoshkevich [Sat, 28 Jan 2023 00:06:20 +0000 (01:06 +0100)]
bpf: Use ARG_CONST_SIZE_OR_ZERO for 3rd argument of bpf_tcp_raw_gen_syncookie_ipv{4,6}()

These functions already check that th_len < sizeof(*th), and
propagating the lower bound (th_len > 0) may be challenging
in complex code, e.g. as is the case with xdp_synproxy test on
s390x [1]. Switch to ARG_CONST_SIZE_OR_ZERO in order to make the
verifier accept code where it cannot prove that th_len > 0.

[1] https://lore.kernel.org/bpf/CAEf4Bzb3uiSHtUbgVWmkWuJ5Sw1UZd4c_iuS4QXtUkXmTTtXuQ@mail.gmail.com/

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-2-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agoDocumentation: bpf: correct spelling
Randy Dunlap [Sat, 28 Jan 2023 19:50:46 +0000 (11:50 -0800)]
Documentation: bpf: correct spelling

Correct spelling problems for Documentation/bpf/ as reported
by codespell.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: bpf@vger.kernel.org
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://lore.kernel.org/r/20230128195046.13327-1-rdunlap@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agobpf: Build-time assert that cpumask offset is zero
David Vernet [Sat, 28 Jan 2023 14:15:37 +0000 (08:15 -0600)]
bpf: Build-time assert that cpumask offset is zero

The first element of a struct bpf_cpumask is a cpumask_t. This is done
to allow struct bpf_cpumask to be cast to a struct cpumask. If this
element were ever moved to another field, any BPF program passing a
struct bpf_cpumask * to a kfunc expecting a const struct cpumask * would
immediately fail to load. Add a build-time assertion so this is
assumption is captured and verified.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230128141537.100777-1-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
17 months agonet: netlink: recommend policy range validation
Johannes Berg [Fri, 27 Jan 2023 07:45:06 +0000 (08:45 +0100)]
net: netlink: recommend policy range validation

For large ranges (outside of s16) the documentation currently
recommends open-coding the validation, but it's better to use
the NLA_POLICY_FULL_RANGE() or NLA_POLICY_FULL_RANGE_SIGNED()
policy validation instead; recommend that.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20230127084506.09f280619d64.I5dece85f06efa8ab0f474ca77df9e26d3553d4ab@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agoMerge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf...
Jakub Kicinski [Sat, 28 Jan 2023 07:59:45 +0000 (23:59 -0800)]
Merge tag 'for-netdev' of https://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
bpf-next 2023-01-28

We've added 124 non-merge commits during the last 22 day(s) which contain
a total of 124 files changed, 6386 insertions(+), 1827 deletions(-).

The main changes are:

1) Implement XDP hints via kfuncs with initial support for RX hash and
   timestamp metadata kfuncs, from Stanislav Fomichev and
   Toke Høiland-Jørgensen.
   Measurements on overhead: https://lore.kernel.org/bpf/875yellcx6.fsf@toke.dk

2) Extend libbpf's bpf_tracing.h support for tracing arguments of
   kprobes/uprobes and syscall as a special case, from Andrii Nakryiko.

3) Significantly reduce the search time for module symbols by livepatch
   and BPF, from Jiri Olsa and Zhen Lei.

4) Enable cpumasks to be used as kptrs, which is useful for tracing
   programs tracking which tasks end up running on which CPUs
   in different time intervals, from David Vernet.

5) Fix several issues in the dynptr processing such as stack slot liveness
   propagation, missing checks for PTR_TO_STACK variable offset, etc,
   from Kumar Kartikeya Dwivedi.

6) Various performance improvements, fixes, and introduction of more
   than just one XDP program to XSK selftests, from Magnus Karlsson.

7) Big batch to BPF samples to reduce deprecated functionality,
   from Daniel T. Lee.

8) Enable struct_ops programs to be sleepable in verifier,
   from David Vernet.

9) Reduce pr_warn() noise on BTF mismatches when they are expected under
   the CONFIG_MODULE_ALLOW_BTF_MISMATCH config anyway, from Connor O'Brien.

10) Describe modulo and division by zero behavior of the BPF runtime
    in BPF's instruction specification document, from Dave Thaler.

11) Several improvements to libbpf API documentation in libbpf.h,
    from Grant Seltzer.

12) Improve resolve_btfids header dependencies related to subcmd and add
    proper support for HOSTCC, from Ian Rogers.

13) Add ipip6 and ip6ip decapsulation support for bpf_skb_adjust_room()
    helper along with BPF selftests, from Ziyang Xuan.

14) Simplify the parsing logic of structure parameters for BPF trampoline
    in the x86-64 JIT compiler, from Pu Lehui.

15) Get BTF working for kernels with CONFIG_RUST enabled by excluding
    Rust compilation units with pahole, from Martin Rodriguez Reboredo.

16) Get bpf_setsockopt() working for kTLS on top of TCP sockets,
    from Kui-Feng Lee.

17) Disable stack protection for BPF objects in bpftool given BPF backends
    don't support it, from Holger Hoffstätte.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (124 commits)
  selftest/bpf: Make crashes more debuggable in test_progs
  libbpf: Add documentation to map pinning API functions
  libbpf: Fix malformed documentation formatting
  selftests/bpf: Properly enable hwtstamp in xdp_hw_metadata
  selftests/bpf: Calls bpf_setsockopt() on a ktls enabled socket.
  bpf: Check the protocol of a sock to agree the calls to bpf_setsockopt().
  bpf/selftests: Verify struct_ops prog sleepable behavior
  bpf: Pass const struct bpf_prog * to .check_member
  libbpf: Support sleepable struct_ops.s section
  bpf: Allow BPF_PROG_TYPE_STRUCT_OPS programs to be sleepable
  selftests/bpf: Fix vmtest static compilation error
  tools/resolve_btfids: Alter how HOSTCC is forced
  tools/resolve_btfids: Install subcmd headers
  bpf/docs: Document the nocast aliasing behavior of ___init
  bpf/docs: Document how nested trusted fields may be defined
  bpf/docs: Document cpumask kfuncs in a new file
  selftests/bpf: Add selftest suite for cpumask kfuncs
  selftests/bpf: Add nested trust selftests suite
  bpf: Enable cpumasks to be queried and used as kptrs
  bpf: Disallow NULLable pointers for trusted kfuncs
  ...
====================

Link: https://lore.kernel.org/r/20230128004827.21371-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agonetpoll: Remove 4s sleep during carrier detection
Breno Leitao [Wed, 25 Jan 2023 18:52:30 +0000 (10:52 -0800)]
netpoll: Remove 4s sleep during carrier detection

This patch removes the msleep(4s) during netpoll_setup() if the carrier
appears instantly.

Here are some scenarios where this workaround is counter-productive in
modern ages:

Servers which have BMC communicating over NC-SI via the same NIC as gets
used for netconsole. BMC will keep the PHY up, hence the carrier
appearing instantly.

The link is fibre, SERDES getting sync could happen within 0.1Hz, and
the carrier also appears instantly.

Other than that, if a driver is reporting instant carrier and then
losing it, this is probably a driver bug.

Reported-by: Michael van der Westhuizen <rmikey@meta.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20230125185230.3574681-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Sat, 28 Jan 2023 06:25:19 +0000 (22:25 -0800)]
Merge git://git./linux/kernel/git/netdev/net

Conflicts:

drivers/net/ethernet/intel/ice/ice_main.c
  418e53401e47 ("ice: move devlink port creation/deletion")
  643ef23bd9dd ("ice: Introduce local var for readability")
https://lore.kernel.org/all/20230127124025.0dacef40@canb.auug.org.au/
https://lore.kernel.org/all/20230124005714.3996270-1-anthony.l.nguyen@intel.com/

drivers/net/ethernet/engleder/tsnep_main.c
  3d53aaef4332 ("tsnep: Fix TX queue stop/wake for multiple queues")
  25faa6a4c5ca ("tsnep: Replace TX spin_lock with __netif_tx_lock")
https://lore.kernel.org/all/20230127123604.36bb3e99@canb.auug.org.au/

net/netfilter/nf_conntrack_proto_sctp.c
  13bd9b31a969 ("Revert "netfilter: conntrack: add sctp DATA_SENT state"")
  a44b7651489f ("netfilter: conntrack: unify established states for SCTP paths")
  f71cb8f45d09 ("netfilter: conntrack: sctp: use nf log infrastructure for invalid packets")
https://lore.kernel.org/all/20230127125052.674281f9@canb.auug.org.au/
https://lore.kernel.org/all/d36076f3-6add-a442-6d4b-ead9f7ffff86@tessares.net/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agonet: dsa: mt7530: fix tristate and help description
Arınç ÜNAL [Thu, 26 Jan 2023 19:01:11 +0000 (22:01 +0300)]
net: dsa: mt7530: fix tristate and help description

Fix description for tristate and help sections which include inaccurate
information.

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Link: https://lore.kernel.org/r/20230126190110.9124-1-arinc.unal@arinc9.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agoMerge branch 'net-xdp-execute-xdp_do_flush-before-napi_complete_done'
Jakub Kicinski [Sat, 28 Jan 2023 06:28:01 +0000 (22:28 -0800)]
Merge branch 'net-xdp-execute-xdp_do_flush-before-napi_complete_done'

Magnus Karlsson says:

====================
net: xdp: execute xdp_do_flush() before napi_complete_done()

Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found in [1].

The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in [2].

The drivers have only been compile-tested since I do not own any of
the HW below. So if you are a maintainer, it would be great if you
could take a quick look to make sure I did not mess something up.

Note that these were the drivers I found that violated the ordering by
running a simple script and manually checking the ones that came up as
potential offenders. But the script was not perfect in any way. There
might still be offenders out there, since the script can generate
false negatives.

[1] https://lore.kernel.org/r/20221220185903.1105011-1-sbohrer@cloudflare.com
[2] https://lore.kernel.org/all/20210624160609.292325-1-toke@redhat.com/
====================

Link: https://lore.kernel.org/r/20230125074901.2737-1-magnus.karlsson@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agodpaa2-eth: execute xdp_do_flush() before napi_complete_done()
Magnus Karlsson [Wed, 25 Jan 2023 07:49:01 +0000 (08:49 +0100)]
dpaa2-eth: execute xdp_do_flush() before napi_complete_done()

Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found following the first link below.

The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in the second link below.

Fixes: d678be1dc1ec ("dpaa2-eth: add XDP_REDIRECT support")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20221220185903.1105011-1-sbohrer@cloudflare.com
Link: https://lore.kernel.org/all/20210624160609.292325-1-toke@redhat.com/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agodpaa_eth: execute xdp_do_flush() before napi_complete_done()
Magnus Karlsson [Wed, 25 Jan 2023 07:49:00 +0000 (08:49 +0100)]
dpaa_eth: execute xdp_do_flush() before napi_complete_done()

Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found following the first link below.

The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in the second link below.

Fixes: a1e031ffb422 ("dpaa_eth: add XDP_REDIRECT support")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20221220185903.1105011-1-sbohrer@cloudflare.com
Link: https://lore.kernel.org/all/20210624160609.292325-1-toke@redhat.com/
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agovirtio-net: execute xdp_do_flush() before napi_complete_done()
Magnus Karlsson [Wed, 25 Jan 2023 07:48:59 +0000 (08:48 +0100)]
virtio-net: execute xdp_do_flush() before napi_complete_done()

Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found following the first link below.

The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in the second link below.

Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20221220185903.1105011-1-sbohrer@cloudflare.com
Link: https://lore.kernel.org/all/20210624160609.292325-1-toke@redhat.com/
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agolan966x: execute xdp_do_flush() before napi_complete_done()
Magnus Karlsson [Wed, 25 Jan 2023 07:48:58 +0000 (08:48 +0100)]
lan966x: execute xdp_do_flush() before napi_complete_done()

Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found following the first link below.

The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in the second link below.

Fixes: a825b611c7c1 ("net: lan966x: Add support for XDP_REDIRECT")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Link: https://lore.kernel.org/r/20221220185903.1105011-1-sbohrer@cloudflare.com
Link: https://lore.kernel.org/all/20210624160609.292325-1-toke@redhat.com/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agoqede: execute xdp_do_flush() before napi_complete_done()
Magnus Karlsson [Wed, 25 Jan 2023 07:48:57 +0000 (08:48 +0100)]
qede: execute xdp_do_flush() before napi_complete_done()

Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found following the first link below.

The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in the second link below.

Fixes: d1b25b79e162b ("qede: add .ndo_xdp_xmit() and XDP_REDIRECT support")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20221220185903.1105011-1-sbohrer@cloudflare.com
Link: https://lore.kernel.org/all/20210624160609.292325-1-toke@redhat.com/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
17 months agoselftest/bpf: Make crashes more debuggable in test_progs
Stanislav Fomichev [Fri, 27 Jan 2023 21:57:05 +0000 (13:57 -0800)]
selftest/bpf: Make crashes more debuggable in test_progs

Reset stdio before printing verbose log of the SIGSEGV'ed test.
Otherwise, it's hard to understand what's going on in the cases like [0].

With the following patch applied:

--- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -392,6 +392,11 @@ void test_xdp_metadata(void)
         "generate freplace packet"))
  goto out;

+
+ ASSERT_EQ(1, 2, "oops");
+ int *x = 0;
+ *x = 1; /* die */
+
  while (!retries--) {
  if (bpf_obj2->bss->called)
  break;

Before:

 #281     xdp_metadata:FAIL
Caught signal #11!
Stack trace:
./test_progs(crash_handler+0x1f)[0x55c919d98bcf]
/lib/x86_64-linux-gnu/libc.so.6(+0x3bf90)[0x7f36aea5df90]
./test_progs(test_xdp_metadata+0x1db0)[0x55c919d8c6d0]
./test_progs(+0x23b438)[0x55c919d9a438]
./test_progs(main+0x534)[0x55c919d99454]
/lib/x86_64-linux-gnu/libc.so.6(+0x2718a)[0x7f36aea4918a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f36aea49245]
./test_progs(_start+0x21)[0x55c919b82ef1]

After:

test_xdp_metadata:PASS:ip netns add xdp_metadata 0 nsec
open_netns:PASS:malloc token 0 nsec
open_netns:PASS:open /proc/self/ns/net 0 nsec
open_netns:PASS:open netns fd 0 nsec
open_netns:PASS:setns 0 nsec
..
test_xdp_metadata:FAIL:oops unexpected oops: actual 1 != expected 2
 #281     xdp_metadata:FAIL
Caught signal #11!
Stack trace:
./test_progs(crash_handler+0x1f)[0x562714a76bcf]
/lib/x86_64-linux-gnu/libc.so.6(+0x3bf90)[0x7fa663f9cf90]
./test_progs(test_xdp_metadata+0x1db0)[0x562714a6a6d0]
./test_progs(+0x23b438)[0x562714a78438]
./test_progs(main+0x534)[0x562714a77454]
/lib/x86_64-linux-gnu/libc.so.6(+0x2718a)[0x7fa663f8818a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7fa663f88245]
./test_progs(_start+0x21)[0x562714860ef1]

0: https://github.com/kernel-patches/bpf/actions/runs/4019879316/jobs/6907358876

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230127215705.1254316-1-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
17 months agolibbpf: Add documentation to map pinning API functions
Grant Seltzer [Thu, 26 Jan 2023 02:42:25 +0000 (21:42 -0500)]
libbpf: Add documentation to map pinning API functions

This adds documentation for the following API functions:

- bpf_map__set_pin_path()
- bpf_map__pin_path()
- bpf_map__is_pinned()
- bpf_map__pin()
- bpf_map__unpin()
- bpf_object__pin_maps()
- bpf_object__unpin_maps()

Signed-off-by: Grant Seltzer <grantseltzer@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230126024225.520685-1-grantseltzer@gmail.com
17 months agolibbpf: Fix malformed documentation formatting
Grant Seltzer [Thu, 26 Jan 2023 02:47:49 +0000 (21:47 -0500)]
libbpf: Fix malformed documentation formatting

This fixes the doxygen format documentation above the
user_ring_buffer__* APIs. There has to be a newline
before the @brief, otherwise doxygen won't render them
for libbpf.readthedocs.org.

Signed-off-by: Grant Seltzer <grantseltzer@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230126024749.522278-1-grantseltzer@gmail.com
17 months agoMerge branch 'devlink-parama-cleanup'
David S. Miller [Fri, 27 Jan 2023 12:32:03 +0000 (12:32 +0000)]
Merge branch 'devlink-parama-cleanup'

Jiri Pirko says:

====================
devlink: Cleanup params usage

This patchset takes care of small cleanup of devlink params usage.
Some of the patches (first 2/3) are cosmetic, but I would like to
point couple of interesting ones:

Patch 9 is the main one of this set and introduces devlink instance
locking for params, similar to other devlink objects. That allows params
to be registered/unregistered when devlink instance is registered.

Patches 10-12 change mlx5 code to register non-driverinit params in the
code they are related to, and thanks to patch 8 this might be when
devlink instance is registered - for example during devlink reload.

---
v1->v2:
- Just small fix in the last patch
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/mlx5: Move eswitch port metadata devlink param to flow eswitch code
Jiri Pirko [Thu, 26 Jan 2023 07:58:38 +0000 (08:58 +0100)]
net/mlx5: Move eswitch port metadata devlink param to flow eswitch code

Move the param registration and handling code into the eswitch offloads
code as they are related to each other. No point in having the
devlink param registration done in separate file.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/mlx5: Move flow steering devlink param to flow steering code
Jiri Pirko [Thu, 26 Jan 2023 07:58:37 +0000 (08:58 +0100)]
net/mlx5: Move flow steering devlink param to flow steering code

Move the param registration and handling code into the flow steering
code as they are related to each other. No point in having the
devlink param registration done in separate file.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/mlx5: Move fw reset devlink param to fw reset code
Jiri Pirko [Thu, 26 Jan 2023 07:58:36 +0000 (08:58 +0100)]
net/mlx5: Move fw reset devlink param to fw reset code

Move the param registration and handling code into the fw reset code
as they are related to each other. No point in having the devlink param
registration done in separate file.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agodevlink: protect devlink param list by instance lock
Jiri Pirko [Thu, 26 Jan 2023 07:58:35 +0000 (08:58 +0100)]
devlink: protect devlink param list by instance lock

Commit 1d18bb1a4ddd ("devlink: allow registering parameters after
the instance") as the subject implies introduced possibility to register
devlink params even for already registered devlink instance. This is a
bit problematic, as the consistency or params list was originally
secured by the fact it is static during devlink lifetime. So in order to
protect the params list, take devlink instance lock during the params
operations. Introduce unlocked function variants and use them in drivers
in locked context. Put lock assertions to appropriate places.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Tested-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agodevlink: put couple of WARN_ONs in devlink_param_driverinit_value_get()
Jiri Pirko [Thu, 26 Jan 2023 07:58:34 +0000 (08:58 +0100)]
devlink: put couple of WARN_ONs in devlink_param_driverinit_value_get()

Put couple of WARN_ONs in devlink_param_driverinit_value_get() function
to clearly indicate, that it is a driver bug if used without reload
support or for non-driverinit param.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agodevlink: make devlink_param_driverinit_value_set() return void
Jiri Pirko [Thu, 26 Jan 2023 07:58:33 +0000 (08:58 +0100)]
devlink: make devlink_param_driverinit_value_set() return void

devlink_param_driverinit_value_set() currently returns int with possible
error, but no user is checking it anyway. The only reason for a fail is
a driver bug. So convert the function to return void and put WARN_ONs
on error paths.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agoqed: remove pointless call to devlink_param_driverinit_value_set()
Jiri Pirko [Thu, 26 Jan 2023 07:58:32 +0000 (08:58 +0100)]
qed: remove pointless call to devlink_param_driverinit_value_set()

devlink_param_driverinit_value_set() call makes sense only for "
driverinit" params. However here, the param is "runtime".
devlink_param_driverinit_value_set() returns -EOPNOTSUPP in such case
and does not do anything. So remove the pointless call to
devlink_param_driverinit_value_set() entirely.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agoice: remove pointless calls to devlink_param_driverinit_value_set()
Jiri Pirko [Thu, 26 Jan 2023 07:58:31 +0000 (08:58 +0100)]
ice: remove pointless calls to devlink_param_driverinit_value_set()

devlink_param_driverinit_value_set() call makes sense only for
"driverinit" params. However here, both params are "runtime".
devlink_param_driverinit_value_set() returns -EOPNOTSUPP in such case
and does not do anything. So remove the pointless calls to
devlink_param_driverinit_value_set() entirely.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agodevlink: don't work with possible NULL pointer in devlink_param_unregister()
Jiri Pirko [Thu, 26 Jan 2023 07:58:30 +0000 (08:58 +0100)]
devlink: don't work with possible NULL pointer in devlink_param_unregister()

There is a WARN_ON checking the param_item for being NULL when the param
is not inserted in the list. That indicates a driver BUG. Instead of
continuing to work with NULL pointer with its consequences, return.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agodevlink: make devlink_param_register/unregister static
Jiri Pirko [Thu, 26 Jan 2023 07:58:29 +0000 (08:58 +0100)]
devlink: make devlink_param_register/unregister static

There is no user outside the devlink code, so remove the export and make
the functions static. Move them before callers to avoid forward
declarations.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/mlx5: Covert devlink params registration to use devlink_params_register/unregister()
Jiri Pirko [Thu, 26 Jan 2023 07:58:28 +0000 (08:58 +0100)]
net/mlx5: Covert devlink params registration to use devlink_params_register/unregister()

Since mlx5 is the only user of devlink API to register/unregister a
single param, convert it to use array registration function allowing to
simplify the devlink API by removing the single param registration
functions.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet/mlx5: Change devlink param register/unregister function names
Jiri Pirko [Thu, 26 Jan 2023 07:58:27 +0000 (08:58 +0100)]
net/mlx5: Change devlink param register/unregister function names

The functions are registering and unregistering devlink params, so
change the names accordingly.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agoMerge branch 'ethtool-netlink-next'
David S. Miller [Fri, 27 Jan 2023 12:24:32 +0000 (12:24 +0000)]
Merge branch 'ethtool-netlink-next'

Jakub Kicinski says:

====================
ethtool: netlink: handle SET intro/outro in the common code

Factor out the boilerplate code from SET handlers to common code.

I volunteered to refactor the extack in GET in a conversation
with Vladimir but I gave up.

The handling of failures during dump in GET handlers is a bit
unclear to me. Some code uses presence of info as indication
of dump and tries to avoid reporting errors altogether
(including extack messages).

There's also the question of whether we should have a validation
callback (similar to .set_validate here) for GET. It looks like
.parse_request was expected to perform the validation. It takes
the extack and tb directly, not via info:

int (*parse_request)(struct ethnl_req_info *req_info,
     struct nlattr **tb,
     struct netlink_ext_ack *extack);

int (*prepare_data)(const struct ethnl_req_info *req_info,
    struct ethnl_reply_data *reply_data,
    struct genl_info *info);

so no crashes dereferencing info possible.

But .parse_request doesn't run under rtnl nor ethnl_ops_begin().
As a result some implementations defer validation until .prepare_data
where all the locks are held and they can call out to the driver.

All this makes me think that maybe we should refactor GET in the
same direction I'm refactoring SET. Split .prepare_data, take
more locks in the core, and add a validation helper which would
take extack directly:

    - ret = ops->prepare_data(req_info, reply_data, info);
    + ret = ops->prepare_data_validate(req_info, reply_data, attrs, extack);
    + if (ret < 1) // if 0 -> skip for dump; -EOPNOTSUPP in do
    +   goto err1;
    +
    + ret = ethnl_ops_begin(dev);
    + if (ret)
    +   goto err1;
    +
    + ret = ops->prepare_data(req_info, reply_data); // no extack
    + ethnl_ops_complete(dev);

I'll file that away as a TODO for posterity / older me.

v2:
 - invert checks for coalescing to avoid error code changes
 - rebase and convert MM as well

v1: https://lore.kernel.org/all/20230121054430.642280-1-kuba@kernel.org/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agoethtool: netlink: convert commands to common SET
Jakub Kicinski [Wed, 25 Jan 2023 23:05:19 +0000 (15:05 -0800)]
ethtool: netlink: convert commands to common SET

Convert all SET commands where new common code is applicable.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agoethtool: netlink: handle SET intro/outro in the common code
Jakub Kicinski [Wed, 25 Jan 2023 23:05:18 +0000 (15:05 -0800)]
ethtool: netlink: handle SET intro/outro in the common code

Most ethtool SET callbacks follow the same general structure.

  ethnl_parse_header_dev_get()
  rtnl_lock()
  ethnl_ops_begin()

  ... do stuff ...

  ethtool_notify()
  ethnl_ops_complete()
  rtnl_unlock()
  ethnl_parse_header_dev_put()

This leads to a lot of copy / pasted code an bugs when people
mis-handle the error path.

Add a generic implementation of this pattern with a .set callback
in struct ethnl_request_ops called to "do stuff".

Also add an optional .set_validate which is called before
ethnl_ops_begin() -- a lot of implementations do basic request
capability / sanity checking at that point.

Because we want to avoid generating the notification when
no change happened - adopt a slightly hairy return values:
 - 0 means nothing to do (no notification)
 - 1 means done / continue
 - negative error codes on error

Reuse .hdr_attr from struct ethnl_request_ops, GET and SET
use the same attr spaces in all cases.

Convert pause as an example (and to avoid unused function warnings).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: dsa: qca8k: convert to regmap read/write API
Christian Marangi [Wed, 25 Jan 2023 20:35:17 +0000 (21:35 +0100)]
net: dsa: qca8k: convert to regmap read/write API

Convert qca8k to regmap read/write bulk API. The mgmt eth can write up
to 32 bytes of data at times. Currently we use a custom function to do
it but regmap now supports declaration of read/write bulk even without a
bus.

Drop the custom function and rework the regmap function to this new
implementation.

Rework the qca8k_fdb_read/write function to use the new
regmap_bulk_read/write as the old qca8k_bulk_read/write are now dropped.

Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: dsa: qca8k: add QCA8K_ATU_TABLE_SIZE define for fdb access
Christian Marangi [Wed, 25 Jan 2023 20:35:16 +0000 (21:35 +0100)]
net: dsa: qca8k: add QCA8K_ATU_TABLE_SIZE define for fdb access

Add and use QCA8K_ATU_TABLE_SIZE instead of hardcoding the ATU size with
a pure number and using sizeof on the array.

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agoMerge branch 'net-skbuff-includes'
David S. Miller [Fri, 27 Jan 2023 11:19:46 +0000 (11:19 +0000)]
Merge branch 'net-skbuff-includes'

Jakub Kicinski says:

====================
net: skbuff: clean up unnecessary includes

skbuff.h is included in a significant portion of the tree.
Clean up unused dependencies to speed up builds.

This set only takes care of the most obvious cases.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: remove unnecessary includes from net/flow.h
Jakub Kicinski [Thu, 26 Jan 2023 07:14:24 +0000 (23:14 -0800)]
net: remove unnecessary includes from net/flow.h

This file is included by a lot of other commonly included
headers, it doesn't need socket.h or flow_dissector.h.

This reduces the size of this file after pre-processing
from 28165 to 4663.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: skbuff: drop the linux/hrtimer.h include
Jakub Kicinski [Thu, 26 Jan 2023 07:14:23 +0000 (23:14 -0800)]
net: skbuff: drop the linux/hrtimer.h include

linux/hrtimer.h include was added because apparently it used
to contain ktime related code. This is no longer the case
and we include linux/time.h explicitly.

Sadly this change is currently a noop because linux/dma-mapping.h
and net/page_pool.h pull in half of the universe.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: skbuff: drop the linux/splice.h include
Jakub Kicinski [Thu, 26 Jan 2023 07:14:22 +0000 (23:14 -0800)]
net: skbuff: drop the linux/splice.h include

splice.h is included since commit a60e3cc7c929 ("net: make
skb_splice_bits more configureable") but really even then
all we needed is some forward declarations. Most of that
code is now gone, and remaining has fwd declarations.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: add missing includes of linux/splice.h
Jakub Kicinski [Thu, 26 Jan 2023 07:14:21 +0000 (23:14 -0800)]
net: add missing includes of linux/splice.h

Number of files depend on linux/splice.h getting included
by linux/skbuff.h which soon will no longer be the case.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: skbuff: drop the linux/sched.h include
Jakub Kicinski [Thu, 26 Jan 2023 07:14:20 +0000 (23:14 -0800)]
net: skbuff: drop the linux/sched.h include

linux/sched.h was added for skb_mstamp_* (all the way back
before linux/sched.h got split and linux/sched/clock.h created).
We don't need it in skbuff.h any more.

Sadly this change is currently a noop because linux/dma-mapping.h
and net/page_pool.h pull in half of the universe.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: skbuff: drop the linux/sched/clock.h include
Jakub Kicinski [Thu, 26 Jan 2023 07:14:19 +0000 (23:14 -0800)]
net: skbuff: drop the linux/sched/clock.h include

It used to be necessary for skb_mstamp_* static inlines,
but those are gone since we moved to usec timestamps in
TCP, in 2017.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: add missing includes of linux/sched/clock.h
Jakub Kicinski [Thu, 26 Jan 2023 07:14:18 +0000 (23:14 -0800)]
net: add missing includes of linux/sched/clock.h

Number of files depend on linux/sched/clock.h getting included
by linux/skbuff.h which soon will no longer be the case.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: skbuff: drop the linux/textsearch.h include
Jakub Kicinski [Thu, 26 Jan 2023 07:14:17 +0000 (23:14 -0800)]
net: skbuff: drop the linux/textsearch.h include

This include was added for skb_find_text() but all we need there
is a forward declaration of struct ts_config.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: checksum: drop the linux/uaccess.h include
Jakub Kicinski [Thu, 26 Jan 2023 07:14:16 +0000 (23:14 -0800)]
net: checksum: drop the linux/uaccess.h include

net/checksum.h pulls in linux/uaccess.h which is large.

In the x86 header the include seems to not be needed at all.
ARM on the other hand does not include uaccess.h, even tho
it calls access_ok().

In the generic implementation guard the include of linux/uaccess.h
with the same condition as the code that needs it.

With this change pre-processed net/checksum.h shrinks on x86
from 30616 lines to just 1193.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: skbuff: drop the linux/net.h include
Jakub Kicinski [Thu, 26 Jan 2023 07:14:15 +0000 (23:14 -0800)]
net: skbuff: drop the linux/net.h include

It appears nothing needs it. The kernel builds fine with this
include removed, building an otherwise empty source file with:

 #include <linux/skbuff.h>
 #ifdef _LINUX_NET_H
 #error linux/net.h is back
 #endif

works too (meaning net.h is not just pulled in indirectly).

This gives us a slight 0.5% reduction in the pre-processed size
of skbuff.h.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: add missing includes of linux/net.h
Jakub Kicinski [Thu, 26 Jan 2023 07:14:14 +0000 (23:14 -0800)]
net: add missing includes of linux/net.h

linux/net.h will soon not be included by linux/skbuff.h.
Fix the cases where source files were depending on the implicit
include.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agoMerge branch 'ipa-abstract-status'
David S. Miller [Fri, 27 Jan 2023 11:16:29 +0000 (11:16 +0000)]
Merge branch 'ipa-abstract-status'

Alex Elder says:

====================
net: ipa: abstract status parsing

Under some circumstances, IPA generates a "packet status" structure
that describes information about a packet.  This is used, for
example, when offload hardware detects an error in a packet, or
otherwise discovers a packet needs special handling.  In this case,
the status is delivered (along with the packet it describes) to a
"default" endpoint so that it can be handled by the AP.

Until now, the structure of this status information hasn't changed.
However, to support more than 32 endpoints, this structure required
some changes, such that some fields are rearranged in ways that are
tricky to represent using C code.

This series updates code related to the IPA status structure.  The
first patch uses a local variable to avoid recomputing a packet
length more than once.  The second stops using sizeof() to determine
the size of an IPA packet status structure.  Patches 3-5 extend the
definitions for values held in packet status fields.  Patch 6 does a
little general cleanup to make patch 7 simpler.  Patch 7 stops using
a C structure to represent packet status; instead, a new function
fetches values "by name" from a buffer containing such a structure.
The last patch updates this function so it also supports IPA v5.0+.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: ipa: add IPA v5.0 packet status support
Alex Elder [Wed, 25 Jan 2023 20:45:45 +0000 (14:45 -0600)]
net: ipa: add IPA v5.0 packet status support

Update ipa_status_extract() to support IPA v5.0 and beyond.  Because
the format of the IPA packet status depends on the version, pass an
IPA pointer to the function.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: ipa: introduce generalized status decoder
Alex Elder [Wed, 25 Jan 2023 20:45:44 +0000 (14:45 -0600)]
net: ipa: introduce generalized status decoder

Stop assuming the IPA packet status has a fixed format (defined by
a C structure).  Instead, use a function to extract each field from
a block of data interpreted as an IPA packet status.  Define an
enumerated type that identifies the fields that can be extracted.
The current function extracts fields based on the existing
ipa_status structure format (which is no longer used).

Define IPA_STATUS_RULE_MISS, to replace the calls to field_max() to
represent that condition; those depended on the knowing the width of
a filter or router rule in the IPA packet status structure.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: ipa: IPA status preparatory cleanups
Alex Elder [Wed, 25 Jan 2023 20:45:43 +0000 (14:45 -0600)]
net: ipa: IPA status preparatory cleanups

The next patch reworks how the IPA packet status structure is
interpreted.  This patch does some preparatory work, to make it
easier to see the effect of that change:
  - Change a few functions that access fields in a IPA packet status
    structure to store field values in local variables with names
    related to the field.
  - Pass a void pointer rather than an (equivalent) status pointer
    to two functions called by ipa_endpoint_status_parse().
  - Use "rule" rather than "val" as the name of a variable that
    holds a routing rule ID.
  - Consistently use "IPA packet status" rather than "status
    element" when referring to this data structure.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: ipa: define remaining IPA status field values
Alex Elder [Wed, 25 Jan 2023 20:45:42 +0000 (14:45 -0600)]
net: ipa: define remaining IPA status field values

Define the remaining values for opcode and exception fields in the
IPA packet status structure.  Most of these values are powers-of-2,
suggesting they are meant to be used as bitmasks, but that is not
the case.  Add comments to be clear about this, and express the
values in decimal format.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: ipa: rename the NAT enumerated type
Alex Elder [Wed, 25 Jan 2023 20:45:41 +0000 (14:45 -0600)]
net: ipa: rename the NAT enumerated type

Rename the ipa_nat_en enumerated type to be ipa_nat_type, and rename
its symbols accordingly.  Add a comment indicating those values are
also used in the IPA status nat_type field.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: ipa: define all IPA status mask bits
Alex Elder [Wed, 25 Jan 2023 20:45:40 +0000 (14:45 -0600)]
net: ipa: define all IPA status mask bits

There is a 16 bit status mask defined in the IPA packet status
structure, of which only one (TAG_VALID) is currently used.

Define all other IPA status mask values in an enumerated type whose
numeric values are bit mask values (in CPU byte order) in the status
mask.  Use the TAG_VALID value from that type rather than defining a
separate field mask.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
17 months agonet: ipa: stop using sizeof(status)
Alex Elder [Wed, 25 Jan 2023 20:45:39 +0000 (14:45 -0600)]
net: ipa: stop using sizeof(status)

The IPA packet status structure changes in IPA v5.0 in ways that are
difficult to represent cleanly.  As a small step toward redefining
it as a parsed block of data, use a constant to define its size,
rather than the size of the IPA status structure type.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>