review.tizen.org Git - platform/kernel/linux-rpi.git/log

Merge tag 'for-netdev' of https://git./linux/kernel/git/bpf/bpf-next

Alexei Starovoitov says:

====================
pull-request: bpf-next 2023-07-13

We've added 67 non-merge commits during the last 15 day(s) which contain
a total of 106 files changed, 4444 insertions(+), 619 deletions(-).

The main changes are:

1) Fix bpftool build in presence of stale vmlinux.h,
   from Alexander Lobakin.

2) Introduce bpf_me_mcache_free_rcu() and fix OOM under stress,
   from Alexei Starovoitov.

3) Teach verifier actual bounds of bpf_get_smp_processor_id()
   and fix perf+libbpf issue related to custom section handling,
   from Andrii Nakryiko.

4) Introduce bpf map element count, from Anton Protopopov.

5) Check skb ownership against full socket, from Kui-Feng Lee.

6) Support for up to 12 arguments in BPF trampoline, from Menglong Dong.

7) Export rcu_request_urgent_qs_task, from Paul E. McKenney.

8) Fix BTF walking of unions, from Yafang Shao.

9) Extend link_info for kprobe_multi and perf_event links,
   from Yafang Shao.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (67 commits)
  selftests/bpf: Add selftest for PTR_UNTRUSTED
  bpf: Fix an error in verifying a field in a union
  selftests/bpf: Add selftests for nested_trust
  bpf: Fix an error around PTR_UNTRUSTED
  selftests/bpf: add testcase for TRACING with 6+ arguments
  bpf, x86: allow function arguments up to 12 for TRACING
  bpf, x86: save/restore regs with BPF_DW size
  bpftool: Use "fallthrough;" keyword instead of comments
  bpf: Add object leak check.
  bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.
  bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  selftests/bpf: Improve test coverage of bpf_mem_alloc.
  rcu: Export rcu_request_urgent_qs_task()
  bpf: Allow reuse from waiting_for_gp_ttrace list.
  bpf: Add a hint to allocated objects.
  bpf: Change bpf_mem_cache draining process.
  bpf: Further refactor alloc_bulk().
  bpf: Factor out inc/dec of active flag into helpers.
  bpf: Refactor alloc_bulk().
  bpf: Let free_all() return the number of freed elements.
  ...
====================

Link: https://lore.kernel.org/r/20230714020910.80794-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge git://git./linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/bpf: Add selftest for PTR_UNTRUSTED

Add a new selftest to check the PTR_UNTRUSTED condition. Below is the
result,

#160 ptr_untrusted:OK

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230713025642.27477-5-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Fix an error in verifying a field in a union

We are utilizing BPF LSM to monitor BPF operations within our container
environment. When we add support for raw_tracepoint, it hits below
error.

; (const void *)attr->raw_tracepoint.name);
27: (79) r3 = *(u64 *)(r2 +0)
access beyond the end of member map_type (mend:4) in struct (anon) with off 0 size 8

It can be reproduced with below BPF prog.

SEC("lsm/bpf")
int BPF_PROG(bpf_audit, int cmd, union bpf_attr *attr, unsigned int size)
{
switch (cmd) {
case BPF_RAW_TRACEPOINT_OPEN:
bpf_printk("raw_tracepoint is %s", attr->raw_tracepoint.name);
break;
default:
break;
}
return 0;
}

The reason is that when accessing a field in a union, such as bpf_attr,
if the field is located within a nested struct that is not the first
member of the union, it can result in incorrect field verification.

  union bpf_attr {
      struct {
          __u32 map_type; <<<< Actually it will find that field.
          __u32 key_size;
          __u32 value_size;
         ...
      };
      ...
      struct {
          __u64 name;    <<<< We want to verify this field.
          __u32 prog_fd;
      } raw_tracepoint;
  };

Considering the potential deep nesting levels, finding a perfect
solution to address this issue has proven challenging. Therefore, I
propose a solution where we simply skip the verification process if the
field in question is located within a union.

Fixes: 7e3617a72df3 ("bpf: Add array support to btf_struct_access")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230713025642.27477-4-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add selftests for nested_trust

Add selftests for nested_strust to check whehter PTR_UNTRUSTED is cleared
as expected, the result as follows:

#141/1   nested_trust/test_read_cpumask:OK
#141/2   nested_trust/test_skb_field:OK                    <<<<
#141/3   nested_trust/test_invalid_nested_user_cpus:OK
#141/4   nested_trust/test_invalid_nested_offset:OK
#141/5   nested_trust/test_invalid_skb_field:OK            <<<<
#141     nested_trust:OK

The #141/2 and #141/5 are newly added.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230713025642.27477-3-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Fix an error around PTR_UNTRUSTED

Per discussion with Alexei, the PTR_UNTRUSTED flag should not been
cleared when we start to walk a new struct, because the struct in
question may be a struct nested in a union. We should also check and set
this flag before we walk its each member, in case itself is a union.
We will clear this flag if the field is BTF_TYPE_SAFE_RCU_OR_NULL.

Fixes: 6fcd486b3a0a ("bpf: Refactor RCU enforcement in the verifier.")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230713025642.27477-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-x86-allow-function-arguments-up-to-12-for-tracing'

Menglong Dong says:

====================
bpf, x86: allow function arguments up to 12 for TRACING

From: Menglong Dong <imagedong@tencent.com>

For now, the BPF program of type BPF_PROG_TYPE_TRACING can only be used
on the kernel functions whose arguments count less than or equal to 6, if
not considering '> 8 bytes' struct argument. This is not friendly at all,
as too many functions have arguments count more than 6. According to the
current kernel version, below is a statistics of the function arguments
count:

argument count | function count
7              | 704
8              | 270
9              | 84
10             | 47
11             | 47
12             | 27
13             | 22
14             | 5
15             | 0
16             | 1

Therefore, let's enhance it by increasing the function arguments count
allowed in arch_prepare_bpf_trampoline(), for now, only x86_64.

In the 1st patch, we save/restore regs with BPF_DW size to make the code
in save_regs()/restore_regs() simpler.

In the 2nd patch, we make arch_prepare_bpf_trampoline() support to copy
function arguments in stack for x86 arch. Therefore, the maximum
arguments can be up to MAX_BPF_FUNC_ARGS for FENTRY, FEXIT and
MODIFY_RETURN. Meanwhile, we clean the potential garbage value when we
copy the arguments on-stack.

And the 3rd patch is for the testcases of the this series.

Changes since v9:
- fix the failed test cases of trampoline_count and get_func_args_test
  in the 3rd patch

Changes since v8:
- change the way to test fmod_ret in the 3rd patch

Changes since v7:
- split the testcases, and add fentry_many_args/fexit_many_args to
  DENYLIST.aarch64 in 3rd patch

Changes since v6:
- somit nits from commit message and comment in the 1st patch
- remove the inline in get_nr_regs() in the 1st patch
- rename some function and various in the 1st patch

Changes since v5:
- adjust the commit log of the 1st patch, avoiding confusing people that
  bugs exist in current code
- introduce get_nr_regs() to get the space that used to pass args on
  stack correct in the 2nd patch
- add testcases to tracing_struct.c instead of fentry_test.c and
  fexit_test.c

Changes since v4:
- consider the case of the struct in arguments can't be hold by regs
- add comment for some code
- add testcases for MODIFY_RETURN
- rebase to the latest

Changes since v3:
- try make the stack pointer 16-byte aligned. Not sure if I'm right :)
- introduce clean_garbage() to clean the grabage when argument count is 7
- use different data type in bpf_testmod_fentry_test{7,12}
- add testcase for grabage values in ctx

Changes since v2:
- keep MAX_BPF_FUNC_ARGS still
- clean garbage value in upper bytes in the 2nd patch
- move bpf_fentry_test{7,12} to bpf_testmod.c and rename them to
  bpf_testmod_fentry_test{7,12} meanwhile in the 3rd patch

Changes since v1:
- change the maximun function arguments to 14 from 12
- add testcases (Jiri Olsa)
- instead EMIT4 with EMIT3_off32 for "lea" to prevent overflow
====================

Link: https://lore.kernel.org/r/20230713040738.1789742-1-imagedong@tencent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: add testcase for TRACING with 6+ arguments

Add fentry_many_args.c and fexit_many_args.c to test the fentry/fexit
with 7/11 arguments. As this feature is not supported by arm64 yet, we
disable these testcases for arm64 in DENYLIST.aarch64. We can combine
them with fentry_test.c/fexit_test.c when arm64 is supported too.

Correspondingly, add bpf_testmod_fentry_test7() and
bpf_testmod_fentry_test11() to bpf_testmod.c

Meanwhile, add bpf_modify_return_test2() to test_run.c to test the
MODIFY_RETURN with 7 arguments.

Add bpf_testmod_test_struct_arg_7/bpf_testmod_test_struct_arg_7 in
bpf_testmod.c to test the struct in the arguments.

And the testcases passed on x86_64:

./test_progs -t fexit
Summary: 5/14 PASSED, 0 SKIPPED, 0 FAILED

./test_progs -t fentry
Summary: 3/2 PASSED, 0 SKIPPED, 0 FAILED

./test_progs -t modify_return
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

./test_progs -t tracing_struct
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230713040738.1789742-4-imagedong@tencent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, x86: allow function arguments up to 12 for TRACING

For now, the BPF program of type BPF_PROG_TYPE_TRACING can only be used
on the kernel functions whose arguments count less than or equal to 6, if
not considering '> 8 bytes' struct argument. This is not friendly at all,
as too many functions have arguments count more than 6.

According to the current kernel version, below is a statistics of the
function arguments count:

argument count | function count
7              | 704
8              | 270
9              | 84
10             | 47
11             | 47
12             | 27
13             | 22
14             | 5
15             | 0
16             | 1

Therefore, let's enhance it by increasing the function arguments count
allowed in arch_prepare_bpf_trampoline(), for now, only x86_64.

For the case that we don't need to call origin function, which means
without BPF_TRAMP_F_CALL_ORIG, we need only copy the function arguments
that stored in the frame of the caller to current frame. The 7th and later
arguments are stored in "$rbp + 0x18", and they will be copied to the
stack area following where register values are saved.

For the case with BPF_TRAMP_F_CALL_ORIG, we need prepare the arguments
in stack before call origin function, which means we need alloc extra
"8 * (arg_count - 6)" memory in the top of the stack. Note, there should
not be any data be pushed to the stack before calling the origin function.
So 'rbx' value will be stored on a stack position higher than where stack
arguments are stored for BPF_TRAMP_F_CALL_ORIG.

According to the research of Yonghong, struct members should be all in
register or all on the stack. Meanwhile, the compiler will pass the
argument on regs if the remaining regs can hold the argument. Therefore,
we need save the arguments in order. Otherwise, disorder of the args can
happen. For example:

  struct foo_struct {
      long a;
      int b;
  };
  int foo(char, char, char, char, char, struct foo_struct,
          char);

the arg1-5,arg7 will be passed by regs, and arg6 will by stack. Therefore,
we should save/restore the arguments in the same order with the
declaration of foo(). And the args used as ctx in stack will be like this:

  reg_arg6   -- copy from regs
  stack_arg2 -- copy from stack
  stack_arg1
  reg_arg5   -- copy from regs
  reg_arg4
  reg_arg3
  reg_arg2
  reg_arg1

We use EMIT3_off32() or EMIT4() for "lea" and "sub". The range of the
imm in "lea" and "sub" is [-128, 127] if EMIT4() is used. Therefore,
we use EMIT3_off32() instead if the imm out of the range.

It works well for the FENTRY/FEXIT/MODIFY_RETURN.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230713040738.1789742-3-imagedong@tencent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, x86: save/restore regs with BPF_DW size

As we already reserve 8 byte in the stack for each reg, it is ok to
store/restore the regs in BPF_DW size. This will make the code in
save_regs()/restore_regs() simpler.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230713040738.1789742-2-imagedong@tencent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge tag 'net-6.5-rc2' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter, wireless and ebpf.

  Current release - regressions:

   - netfilter: conntrack: gre: don't set assured flag for clash entries

   - wifi: iwlwifi: remove 'use_tfh' config to fix crash

  Previous releases - regressions:

   - ipv6: fix a potential refcount underflow for idev

   - icmp6: ifix null-ptr-deref of ip6_null_entry->rt6i_idev in
     icmp6_dev()

   - bpf: fix max stack depth check for async callbacks

   - eth: mlx5e:
      - check for NOT_READY flag state after locking
      - fix page_pool page fragment tracking for XDP

   - eth: igc:
      - fix tx hang issue when QBV gate is closed
      - fix corner cases for TSN offload

   - eth: octeontx2-af: Move validation of ptp pointer before its usage

   - eth: ena: fix shift-out-of-bounds in exponential backoff

  Previous releases - always broken:

   - core: prevent skb corruption on frag list segmentation

   - sched:
      - cls_fw: fix improper refcount update leads to use-after-free
      - sch_qfq: account for stab overhead in qfq_enqueue

   - netfilter:
      - report use refcount overflow
      - prevent OOB access in nft_byteorder_eval

   - wifi: mt7921e: fix init command fail with enabled device

   - eth: ocelot: fix oversize frame dropping for preemptible TCs

   - eth: fec: recycle pages for transmitted XDP frames"

* tag 'net-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits)
  selftests: tc-testing: add test for qfq with stab overhead
  net/sched: sch_qfq: account for stab overhead in qfq_enqueue
  selftests: tc-testing: add tests for qfq mtu sanity check
  net/sched: sch_qfq: reintroduce lmax bound check for MTU
  wifi: cfg80211: fix receiving mesh packets without RFC1042 header
  wifi: rtw89: debug: fix error code in rtw89_debug_priv_send_h2c_set()
  net: txgbe: fix eeprom calculation error
  net/sched: make psched_mtu() RTNL-less safe
  net: ena: fix shift-out-of-bounds in exponential backoff
  netdevsim: fix uninitialized data in nsim_dev_trap_fa_cookie_write()
  net/sched: flower: Ensure both minimum and maximum ports are specified
  MAINTAINERS: Add another mailing list for QUALCOMM ETHQOS ETHERNET DRIVER
  docs: netdev: update the URL of the status page
  wifi: iwlwifi: remove 'use_tfh' config to fix crash
  xdp: use trusted arguments in XDP hints kfuncs
  bpf: cpumap: Fix memory leak in cpu_map_update_elem
  wifi: airo: avoid uninitialized warning in airo_get_rate()
  octeontx2-pf: Add additional check for MCAM rules
  net: dsa: Removed unneeded of_node_put in felix_parse_ports_node
  net: fec: use netdev_err_once() instead of netdev_err()
  ...

Merge tag 'trace-v6.5-rc1-3' of git://git./linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

- Fix some missing-prototype warnings

- Fix user events struct args (did not include size of struct)

   When creating a user event, the "struct" keyword is to denote that
   the size of the field will be passed in. But the parsing failed to
   handle this case.

- Add selftest to struct sizes for user events

- Fix sample code for direct trampolines.

   The sample code for direct trampolines attached to handle_mm_fault().
   But the prototype changed and the direct trampoline sample code was
   not updated. Direct trampolines needs to have the arguments correct
   otherwise it can fail or crash the system.

- Remove unused ftrace_regs_caller_ret() prototype.

- Quiet false positive of FORTIFY_SOURCE

   Due to backward compatibility, the structure used to save stack
   traces in the kernel had a fixed size of 8. This structure is
   exported to user space via the tracing format file. A change was made
   to allow more than 8 functions to be recorded, and user space now
   uses the size field to know how many functions are actually in the
   stack.

   But the structure still has size of 8 (even though it points into the
   ring buffer that has the required amount allocated to hold a full
   stack.

   This was fine until the fortifier noticed that the
   memcpy(&entry->caller, stack, size) was greater than the 8 functions
   and would complain at runtime about it.

   Hide this by using a pointer to the stack location on the ring buffer
   instead of using the address of the entry structure caller field.

- Fix a deadloop in reading trace_pipe that was caused by a mismatch
   between ring_buffer_empty() returning false which then asked to read
   the data, but the read code uses rb_num_of_entries() that returned
   zero, and causing a infinite "retry".

- Fix a warning caused by not using all pages allocated to store ftrace
   functions, where this can happen if the linker inserts a bunch of
   "NULL" entries, causing the accounting of how many pages needed to be
   off.

- Fix histogram synthetic event crashing when the start event is
   removed and the end event is still using a variable from it

- Fix memory leak in freeing iter->temp in tracing_release_pipe()

* tag 'trace-v6.5-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix memory leak of iter->temp when reading trace_pipe
  tracing/histograms: Add histograms to hist_vars if they have referenced variables
  tracing: Stop FORTIFY_SOURCE complaining about stack trace caller
  ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()
  ring-buffer: Fix deadloop issue on reading trace_pipe
  tracing: arm64: Avoid missing-prototype warnings
  selftests/user_events: Test struct size match cases
  tracing/user_events: Fix struct arg size match check
  x86/ftrace: Remove unsued extern declaration ftrace_regs_caller_ret()
  arm64: ftrace: Add direct call trampoline samples support
  samples: ftrace: Save required argument registers in sample trampolines

Merge tag 'for-linus-6.5-rc2-tag' of git://git./linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:

- a cleanup of the Xen related ELF-notes

- a fix for virtio handling in Xen dom0 when running Xen in a VM

* tag 'for-linus-6.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/virtio: Fix NULL deref when a bridge of PCI root bus has no parent
x86/Xen: tidy xen-head.S

Merge tag 'sh-for-v6.5-tag2' of git://git./linux/kernel/git/glaubitz/sh-linux

Pull sh fixes from John Paul Adrian Glaubitz:
"The sh updates introduced multiple regressions.

  In particular, the change a8ac2961148e ("sh: Avoid using IRQ0 on SH3
  and SH4") causes several boards to hang during boot due to incorrect
  IRQ numbers.

  Geert Uytterhoeven has contributed patches that handle the virq offset
  in the IRQ code for the dreamcast, highlander and r2d boards while
  Artur Rojek has contributed a patch which handles the virq offset for
  the hd64461 companion chip"

* tag 'sh-for-v6.5-tag2' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux:
  sh: hd64461: Handle virq offset for offchip IRQ base and HD64461 IRQ
  sh: mach-dreamcast: Handle virq offset in cascaded IRQ demux
  sh: mach-highlander: Handle virq offset in cascaded IRL demux
  sh: mach-r2d: Handle virq offset in cascaded IRL demux

tracing: Fix memory leak of iter->temp when reading trace_pipe

kmemleak reports:
  unreferenced object 0xffff88814d14e200 (size 256):
    comm "cat", pid 336, jiffies 4294871818 (age 779.490s)
    hex dump (first 32 bytes):
      04 00 01 03 00 00 00 00 08 00 00 00 00 00 00 00  ................
      0c d8 c8 9b ff ff ff ff 04 5a ca 9b ff ff ff ff  .........Z......
    backtrace:
      [<ffffffff9bdff18f>] __kmalloc+0x4f/0x140
      [<ffffffff9bc9238b>] trace_find_next_entry+0xbb/0x1d0
      [<ffffffff9bc9caef>] trace_print_lat_context+0xaf/0x4e0
      [<ffffffff9bc94490>] print_trace_line+0x3e0/0x950
      [<ffffffff9bc95499>] tracing_read_pipe+0x2d9/0x5a0
      [<ffffffff9bf03a43>] vfs_read+0x143/0x520
      [<ffffffff9bf04c2d>] ksys_read+0xbd/0x160
      [<ffffffff9d0f0edf>] do_syscall_64+0x3f/0x90
      [<ffffffff9d2000aa>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8

when reading file 'trace_pipe', 'iter->temp' is allocated or relocated
in trace_find_next_entry() but not freed before 'trace_pipe' is closed.

To fix it, free 'iter->temp' in tracing_release_pipe().

Link: https://lore.kernel.org/linux-trace-kernel/20230713141435.1133021-1-zhengyejian1@huawei.com
Cc: stable@vger.kernel.org
Fixes: ff895103a84ab ("tracing: Save off entry when peeking at next entry")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Merge branch 'net-sched-fixes-for-sch_qfq'

Pedro Tammela says:

====================
net/sched: fixes for sch_qfq

Patch 1 fixes a regression introduced in 6.4 where the MTU size could be
bigger than 'lmax'.

Patch 3 fixes an issue where the code doesn't account for qdisc_pkt_len()
returning a size bigger then 'lmax'.

Patches 2 and 4 are selftests for the issues above.
====================

Link: https://lore.kernel.org/r/20230711210103.597831-1-pctammela@mojatatu.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: tc-testing: add test for qfq with stab overhead

A packet with stab overhead greater than QFQ_MAX_LMAX should be dropped
by the QFQ qdisc as it can't handle such lengths.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Tested-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: sch_qfq: account for stab overhead in qfq_enqueue

Lion says:
-------
In the QFQ scheduler a similar issue to CVE-2023-31436
persists.

Consider the following code in net/sched/sch_qfq.c:

static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
                struct sk_buff **to_free)
{
     unsigned int len = qdisc_pkt_len(skb), gso_segs;

    // ...

     if (unlikely(cl->agg->lmax < len)) {
         pr_debug("qfq: increasing maxpkt from %u to %u for class %u",
              cl->agg->lmax, len, cl->common.classid);
         err = qfq_change_agg(sch, cl, cl->agg->class_weight, len);
         if (err) {
             cl->qstats.drops++;
             return qdisc_drop(skb, sch, to_free);
         }

    // ...

     }

Similarly to CVE-2023-31436, "lmax" is increased without any bounds
checks according to the packet length "len". Usually this would not
impose a problem because packet sizes are naturally limited.

This is however not the actual packet length, rather the
"qdisc_pkt_len(skb)" which might apply size transformations according to
"struct qdisc_size_table" as created by "qdisc_get_stab()" in
net/sched/sch_api.c if the TCA_STAB option was set when modifying the qdisc.

A user may choose virtually any size using such a table.

As a result the same issue as in CVE-2023-31436 can occur, allowing heap
out-of-bounds read / writes in the kmalloc-8192 cache.
-------

We can create the issue with the following commands:

tc qdisc add dev $DEV root handle 1: stab mtu 2048 tsize 512 mpu 0 \
overhead 999999999 linklayer ethernet qfq
tc class add dev $DEV parent 1: classid 1:1 htb rate 6mbit burst 15k
tc filter add dev $DEV parent 1: matchall classid 1:1
ping -I $DEV 1.1.1.2

This is caused by incorrectly assuming that qdisc_pkt_len() returns a
length within the QFQ_MIN_LMAX < len < QFQ_MAX_LMAX.

Fixes: 462dbc9101ac ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
Reported-by: Lion <nnamrec@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: tc-testing: add tests for qfq mtu sanity check

QFQ only supports a certain bound of MTU size so make sure
we check for this requirement in the tests.

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Tested-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: sch_qfq: reintroduce lmax bound check for MTU

25369891fcef deletes a check for the case where no 'lmax' is
specified which 3037933448f6 previously fixed as 'lmax'
could be set to the device's MTU without any bound checking
for QFQ_LMAX_MIN and QFQ_LMAX_MAX. Therefore, reintroduce the check.

Fixes: 25369891fcef ("net/sched: sch_qfq: refactor parsing of netlink parameters")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

sh: hd64461: Handle virq offset for offchip IRQ base and HD64461 IRQ

A recent change to start counting SuperH IRQ #s from 16 breaks support
for the Hitachi HD64461 companion chip.

Move the offchip IRQ base and HD64461 IRQ # by 16 in order to
accommodate for the new virq numbering rules.

Fixes: a8ac2961148e ("sh: Avoid using IRQ0 on SH3 and SH4")
Signed-off-by: Artur Rojek <contact@artur-rojek.eu>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Link: https://lore.kernel.org/r/20230710233132.69734-1-contact@artur-rojek.eu
Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>

sh: mach-dreamcast: Handle virq offset in cascaded IRQ demux

Take into account the virq offset when translating cascaded interrupts.

Fixes: a8ac2961148e8c72 ("sh: Avoid using IRQ0 on SH3 and SH4")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Link: https://lore.kernel.org/r/7d0cb246c9f1cd24bb1f637ec5cb67e799a4c3b8.1688908227.git.geert+renesas@glider.be
Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>

sh: mach-highlander: Handle virq offset in cascaded IRL demux

Take into account the virq offset when translating cascaded IRL
interrupts.

Fixes: a8ac2961148e8c72 ("sh: Avoid using IRQ0 on SH3 and SH4")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Link: https://lore.kernel.org/r/4fcb0d08a2b372431c41e04312742dc9e41e1be4.1688908186.git.geert+renesas@glider.be
Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>

sh: mach-r2d: Handle virq offset in cascaded IRL demux

When booting rts7751r2dplus_defconfig on QEMU, the system hangs due to
an interrupt storm on IRQ 20. IRQ 20 aka event 0x280 is a cascaded IRL
interrupt, which maps to IRQ_VOYAGER, the interrupt used by the Silicon
Motion SM501 multimedia companion chip. As rts7751r2d_irq_demux() does
not take into account the new virq offset, the interrupt is no longer
translated, leading to an unhandled interrupt.

Fix this by taking into account the virq offset when translating
cascaded IRL interrupts.

Fixes: a8ac2961148e8c72 ("sh: Avoid using IRQ0 on SH3 and SH4")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Closes: https://lore.kernel.org/r/fbfea3ad-d327-4ad5-ac9c-648c7ca3fe1f@roeck-us.net
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/2c99d5df41c40691f6c407b7b6a040d406bc81ac.1688901306.git.geert+renesas@glider.be
Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>

Merge tag 'for-netdev' of https://git./linux/kernel/git/bpf/bpf

Alexei Starovoitov says:

====================
pull-request: bpf 2023-07-12

We've added 5 non-merge commits during the last 7 day(s) which contain
a total of 7 files changed, 93 insertions(+), 28 deletions(-).

The main changes are:

1) Fix max stack depth check for async callbacks, from Kumar.

2) Fix inconsistent JIT image generation, from Björn.

3) Use trusted arguments in XDP hints kfuncs, from Larysa.

4) Fix memory leak in cpu_map_update_elem, from Pu.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  xdp: use trusted arguments in XDP hints kfuncs
  bpf: cpumap: Fix memory leak in cpu_map_update_elem
  riscv, bpf: Fix inconsistent JIT image generation
  selftests/bpf: Add selftest for check_stack_max_depth bug
  bpf: Fix max stack depth check for async callbacks
====================

Link: https://lore.kernel.org/r/20230712223045.40182-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

wifi: cfg80211: fix receiving mesh packets without RFC1042 header

Fix ethernet header length field after stripping the mesh header

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/CT5GNZSK28AI.2K6M69OXM9RW5@syracuse/
Fixes: 986e43b19ae9 ("wifi: mac80211: fix receiving A-MSDU frames on mesh interfaces")
Reported-and-tested-by: Nicolas Escande <nico.escande@gmail.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20230711115052.68430-1-nbd@nbd.name
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

wifi: rtw89: debug: fix error code in rtw89_debug_priv_send_h2c_set()

If there is a failure during rtw89_fw_h2c_raw() rtw89_debug_priv_send_h2c
should return negative error code instead of a positive value count.
Fix this bug by returning correct error code.

Fixes: e3ec7017f6a2 ("rtw89: add Realtek 802.11ax driver")
Signed-off-by: Zhang Shurong <zhang_shurong@foxmail.com>
Acked-by: Ping-Ke Shih <pkshih@realtek.com>
Link: https://lore.kernel.org/r/tencent_AD09A61BC4DA92AD1EB0790F5C850E544D07@qq.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: wwan: t7xx: Add AP CLDMA

At this moment with the current status, t7xx is not functional due to
problems like this after connection, if there is no activity:
[   57.370534] mtk_t7xx 0000:72:00.0: [PM] SAP suspend error: -110
[   57.370581] mtk_t7xx 0000:72:00.0: can't suspend
    (t7xx_pci_pm_runtime_suspend [mtk_t7xx] returned -110)
because after this, the traffic no longer works.

The complete series 'net: wwan: t7xx: fw flashing & coredump support'
was reverted because of issues with the pci implementation.
In order to have at least the modem working, it would be enough if just
the first commit of the series is re-applied:
d20ef656f994 net: wwan: t7xx: Add AP CLDMA
With that, the Application Processor would be controlled, correctly
suspended and the commented problems would be fixed (I am testing here
like this with no related issue).

This commit is independent of the others and not related to the
commented pci implementation for the new features: fw flashing and
coredump collection.

Use v2 patch version of d20ef656f994 as JinJian Song suggests
(https://patchwork.kernel.org/project/netdevbpf/patch/20230105154215.198828-1-m.chetan.kumar@linux.intel.com/).

Original text from the commit that would be re-applied:

    d20ef656f994 net: wwan: t7xx: Add AP CLDMA
    Author: Haijun Liu <haijun.liu@mediatek.com>
    Date:   Tue Aug 16 09:53:28 2022 +0530

    The t7xx device contains two Cross Layer DMA (CLDMA) interfaces to
    communicate with AP and Modem processors respectively. So far only
    MD-CLDMA was being used, this patch enables AP-CLDMA.

    Rename small Application Processor (sAP) to AP.

Signed-off-by: Haijun Liu <haijun.liu@mediatek.com>
Co-developed-by: Madhusmita Sahu <madhusmita.sahu@intel.com>
Signed-off-by: Madhusmita Sahu <madhusmita.sahu@intel.com>
Signed-off-by: Moises Veleta <moises.veleta@linux.intel.com>
Signed-off-by: Devegowda Chandrashekar <chandrashekar.devegowda@intel.com>
Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230711062817.6108-1-jtornosm@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: rpl: Remove redundant skb_dst_drop().

RPL code has a pattern where skb_dst_drop() is called before
ip6_route_input().

However, ip6_route_input() calls skb_dst_drop() internally,
so we need not call skb_dst_drop() before ip6_route_input().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230710213511.5364-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mlxsw-add-port-range-matching-support'

Petr Machata says:

====================
mlxsw: Add port range matching support

Ido Schimmel writes:

Add port range matching support in mlxsw as part of tc-flower offload.

Patches #1-#7 gradually add port range matching support in mlxsw. See
patch #3 to understand how port range matching is implemented in the
device.

Patches #8-#10 add selftests.
====================

Link: https://lore.kernel.org/r/cover.1689092769.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: Add test cases for flower port range matching

Add test cases to verify that flower port range matching works
correctly. Test both source and destination port ranges, with different
combinations of IPv4/IPv6 and TCP/UDP, on both ingress and egress.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/9d47c9cd4522b2d335b13ce8f6c9b33199298cee.1689092769.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mlxsw: Test port range registers' occupancy

Test that filters that match on the same port range, but with different
combination of IPv4/IPv6 and TCP/UDP all use the same port range
register by observing port range registers' occupancy via
devlink-resource.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/0a2eb63b234fb062ff011e80231868cc80000c81.1689092769.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mlxsw: Add scale test for port ranges

Query the maximum number of supported port range registers using
devlink-resource and test that this number can be reached by configuring
tc filters with different port ranges. Test that an error is returned in
case the maximum number is exceeded.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/48eee181270d9f291e09d1858c7b26a3f7fcc164.1689092769.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_flower: Add ability to match on port ranges

Add the ability to match on port ranges by utilizing the previously
added port range registers and the port range key element. Up to two
port range registers can be used for each filter, one for source port
and another for destination port.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/df4385a9592917e9a22ebff339e0463e4a8dfa82.1689092769.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_acl: Pass main driver structure to mlxsw_sp_acl_rulei_destroy()

The main driver structure will be needed in this function by a
subsequent patch, so pass it. No functional changes intended.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/24d96a4e21310e5de2951ace58263db35e44a0df.1689092769.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_acl: Add port range key element

Add the port range key element to supported key blocks so that it could
be used to match on the output of the port range registers. Each bit in
the element can be used to match on the output of the port range
register with the corresponding index.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/f0423f6ee9e36c6b0a426bc9995f42223c48f2db.1689092769.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_port_range: Add devlink resource support

Expose via devlink-resource the maximum number of port range registers
and their current occupancy. Besides the observability benefits, this
resource will be used by subsequent patches for scale and occupancy
tests.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/7945e0c715dc5efb1617f45f7560c1f1bd0bcf8a.1689092769.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_port_range: Add port range core

The Spectrum ASICs have a fixed number of port range registers, each of
which maintains the following parameters:

* Minimum and maximum port.
* Apply port range for source port, destination port or both.
* Apply port range for TCP, UDP or both.
* Apply port range for IPv4, IPv6 or both.

Implement a port range core which takes care of the allocation and
configuration of these registers and exposes an API that allows
in-driver consumers (e.g., the ACL code) to request matching on a range
of either source or destination port.

These registers are going to be used for port range matching in the
flower classifier that already matches on EtherType being IPv4 / IPv6 and
IP protocol being TCP / UDP. As such, there is no need to limit these
registers to a specific EtherType or IP protocol, which will increase
the likelihood of a register being shared by multiple flower filters.

It is unlikely that a filter will match on the same range of both source
and destination ports, which is why each register is only configured to
match on either source or destination port. If a filter requires
matching on a range of both source and destination ports, it will
utilize two port range registers and match on the output of both.

For efficient lookup and traversal, use XArray to store the allocated
port range registers. The XArray uses RCU and an internal spinlock to
synchronise access, so there is no need for a dedicate lock.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/674f00539a0072d455847663b5feb504db51a259.1689092769.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: resource: Add resource identifier for port range registers

Add a resource identifier for maximum number of layer 4 port range
register so that it could be later used to query the information from
firmware.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/59a8fec353d5ad9fbfb7612e4a7ff61eaedad445.1689092769.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: reg: Add Policy-Engine Port Range Register

Add the Policy-Engine Port Range Register that is used for configuring
port range identification.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/d1a1f53d758f7452cf5abfe006b23496076ec3e6.1689092769.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: txgbe: fix eeprom calculation error

For some device types like TXGBE_ID_XAUI, *checksum computed in
txgbe_calc_eeprom_checksum() is larger than TXGBE_EEPROM_SUM. Remove the
limit on the size of *checksum.

Fixes: 049fe5365324 ("net: txgbe: Add operations to interact with firmware")
Fixes: 5e2ea7801fac ("net: txgbe: Fix unsigned comparison to zero in txgbe_calc_eeprom_checksum()")
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://lore.kernel.org/r/20230711063414.3311-1-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'for-linus' of https://github.com/openrisc/linux

Pull OpenRISC fix from Stafford Horne:

- During the 6.4 cycle my fpu support work broke ABI compatibility in
   the sigcontext struct. This was noticed by musl libc developers after
   the release. This fix restores the ABI.

* tag 'for-linus' of https://github.com/openrisc/linux:
  openrisc: Union fpcsr and oldmask in sigcontext to unbreak userspace ABI

tracing/histograms: Add histograms to hist_vars if they have referenced variables

Hist triggers can have referenced variables without having direct
variables fields. This can be the case if referenced variables are added
for trigger actions. In this case the newly added references will not
have field variables. Not taking such referenced variables into
consideration can result in a bug where it would be possible to remove
hist trigger with variables being refenced. This will result in a bug
that is easily reproducable like so

$ cd /sys/kernel/tracing
$ echo 'synthetic_sys_enter char[] comm; long id' >> synthetic_events
$ echo 'hist:keys=common_pid.execname,id.syscall:vals=hitcount:comm=common_pid.execname' >> events/raw_syscalls/sys_enter/trigger
$ echo 'hist:keys=common_pid.execname,id.syscall:onmatch(raw_syscalls.sys_enter).synthetic_sys_enter($comm, id)' >> events/raw_syscalls/sys_enter/trigger
$ echo '!hist:keys=common_pid.execname,id.syscall:vals=hitcount:comm=common_pid.execname' >> events/raw_syscalls/sys_enter/trigger

[  100.263533] ==================================================================
[  100.264634] BUG: KASAN: slab-use-after-free in resolve_var_refs+0xc7/0x180
[  100.265520] Read of size 8 at addr ffff88810375d0f0 by task bash/439
[  100.266320]
[  100.266533] CPU: 2 PID: 439 Comm: bash Not tainted 6.5.0-rc1 #4
[  100.267277] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
[  100.268561] Call Trace:
[  100.268902]  <TASK>
[  100.269189]  dump_stack_lvl+0x4c/0x70
[  100.269680]  print_report+0xc5/0x600
[  100.270165]  ? resolve_var_refs+0xc7/0x180
[  100.270697]  ? kasan_complete_mode_report_info+0x80/0x1f0
[  100.271389]  ? resolve_var_refs+0xc7/0x180
[  100.271913]  kasan_report+0xbd/0x100
[  100.272380]  ? resolve_var_refs+0xc7/0x180
[  100.272920]  __asan_load8+0x71/0xa0
[  100.273377]  resolve_var_refs+0xc7/0x180
[  100.273888]  event_hist_trigger+0x749/0x860
[  100.274505]  ? kasan_save_stack+0x2a/0x50
[  100.275024]  ? kasan_set_track+0x29/0x40
[  100.275536]  ? __pfx_event_hist_trigger+0x10/0x10
[  100.276138]  ? ksys_write+0xd1/0x170
[  100.276607]  ? do_syscall_64+0x3c/0x90
[  100.277099]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  100.277771]  ? destroy_hist_data+0x446/0x470
[  100.278324]  ? event_hist_trigger_parse+0xa6c/0x3860
[  100.278962]  ? __pfx_event_hist_trigger_parse+0x10/0x10
[  100.279627]  ? __kasan_check_write+0x18/0x20
[  100.280177]  ? mutex_unlock+0x85/0xd0
[  100.280660]  ? __pfx_mutex_unlock+0x10/0x10
[  100.281200]  ? kfree+0x7b/0x120
[  100.281619]  ? ____kasan_slab_free+0x15d/0x1d0
[  100.282197]  ? event_trigger_write+0xac/0x100
[  100.282764]  ? __kasan_slab_free+0x16/0x20
[  100.283293]  ? __kmem_cache_free+0x153/0x2f0
[  100.283844]  ? sched_mm_cid_remote_clear+0xb1/0x250
[  100.284550]  ? __pfx_sched_mm_cid_remote_clear+0x10/0x10
[  100.285221]  ? event_trigger_write+0xbc/0x100
[  100.285781]  ? __kasan_check_read+0x15/0x20
[  100.286321]  ? __bitmap_weight+0x66/0xa0
[  100.286833]  ? _find_next_bit+0x46/0xe0
[  100.287334]  ? task_mm_cid_work+0x37f/0x450
[  100.287872]  event_triggers_call+0x84/0x150
[  100.288408]  trace_event_buffer_commit+0x339/0x430
[  100.289073]  ? ring_buffer_event_data+0x3f/0x60
[  100.292189]  trace_event_raw_event_sys_enter+0x8b/0xe0
[  100.295434]  syscall_trace_enter.constprop.0+0x18f/0x1b0
[  100.298653]  syscall_enter_from_user_mode+0x32/0x40
[  100.301808]  do_syscall_64+0x1a/0x90
[  100.304748]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  100.307775] RIP: 0033:0x7f686c75c1cb
[  100.310617] Code: 73 01 c3 48 8b 0d 65 3c 10 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 21 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 35 3c 10 00 f7 d8 64 89 01 48
[  100.317847] RSP: 002b:00007ffc60137a38 EFLAGS: 00000246 ORIG_RAX: 0000000000000021
[  100.321200] RAX: ffffffffffffffda RBX: 000055f566469ea0 RCX: 00007f686c75c1cb
[  100.324631] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 000000000000000a
[  100.328104] RBP: 00007ffc60137ac0 R08: 00007f686c818460 R09: 000000000000000a
[  100.331509] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000009
[  100.334992] R13: 0000000000000007 R14: 000000000000000a R15: 0000000000000007
[  100.338381]  </TASK>

We hit the bug because when second hist trigger has was created
has_hist_vars() returned false because hist trigger did not have
variables. As a result of that save_hist_vars() was not called to add
the trigger to trace_array->hist_vars. Later on when we attempted to
remove the first histogram find_any_var_ref() failed to detect it is
being used because it did not find the second trigger in hist_vars list.

With this change we wait until trigger actions are created so we can take
into consideration if hist trigger has variable references. Also, now we
check the return value of save_hist_vars() and fail trigger creation if
save_hist_vars() fails.

Link: https://lore.kernel.org/linux-trace-kernel/20230712223021.636335-1-mkhalfella@purestorage.com
Cc: stable@vger.kernel.org
Fixes: 067fe038e70f6 ("tracing: Add variable reference handling to hist triggers")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

net/sched: make psched_mtu() RTNL-less safe

Eric Dumazet says[1]:
-------
Speaking of psched_mtu(), I see that net/sched/sch_pie.c is using it
without holding RTNL, so dev->mtu can be changed underneath.
KCSAN could issue a warning.
-------

Annotate dev->mtu with READ_ONCE() so KCSAN don't issue a warning.

[1] https://lore.kernel.org/all/CANn89iJoJO5VtaJ-2=_d2aOQhb0Xw8iBT_Cxqp2HyuS-zj6azw@mail.gmail.com/

v1 -> v2: Fix commit message

Fixes: d4b36210c2e6 ("net: pkt_sched: PIE AQM scheme")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230711021634.561598-1-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ena: fix shift-out-of-bounds in exponential backoff

The ENA adapters on our instances occasionally reset.  Once recently
logged a UBSAN failure to console in the process:

  UBSAN: shift-out-of-bounds in build/linux/drivers/net/ethernet/amazon/ena/ena_com.c:540:13
  shift exponent 32 is too large for 32-bit type 'unsigned int'
  CPU: 28 PID: 70012 Comm: kworker/u72:2 Kdump: loaded not tainted 5.15.117
  Hardware name: Amazon EC2 c5d.9xlarge/, BIOS 1.0 10/16/2017
  Workqueue: ena ena_fw_reset_device [ena]
  Call Trace:
  <TASK>
  dump_stack_lvl+0x4a/0x63
  dump_stack+0x10/0x16
  ubsan_epilogue+0x9/0x36
  __ubsan_handle_shift_out_of_bounds.cold+0x61/0x10e
  ? __const_udelay+0x43/0x50
  ena_delay_exponential_backoff_us.cold+0x16/0x1e [ena]
  wait_for_reset_state+0x54/0xa0 [ena]
  ena_com_dev_reset+0xc8/0x110 [ena]
  ena_down+0x3fe/0x480 [ena]
  ena_destroy_device+0xeb/0xf0 [ena]
  ena_fw_reset_device+0x30/0x50 [ena]
  process_one_work+0x22b/0x3d0
  worker_thread+0x4d/0x3f0
  ? process_one_work+0x3d0/0x3d0
  kthread+0x12a/0x150
  ? set_kthread_struct+0x50/0x50
  ret_from_fork+0x22/0x30
  </TASK>

Apparently, the reset delays are getting so large they can trigger a
UBSAN panic.

Looking at the code, the current timeout is capped at 5000us.  Using a
base value of 100us, the current code will overflow after (1<<29).  Even
at values before 32, this function wraps around, perhaps
unintentionally.

Cap the value of the exponent used for this backoff at (1<<16) which is
larger than currently necessary, but large enough to support bigger
values in the future.

Cc: stable@vger.kernel.org
Fixes: 4bb7f4cf60e3 ("net: ena: reduce driver load time")
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shay Agroskin <shayagr@amazon.com>
Link: https://lore.kernel.org/r/20230711013621.GE1926@templeofstupid.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bpftool: Use "fallthrough;" keyword instead of comments

After using "__fallthrough;" in a switch/case block in bpftool's
btf_dumper.c [0], and then turning it into a comment [1] to prevent a
merge conflict in linux-next when the keyword was changed into just
"fallthrough;" [2], we can now drop the comment and use the new keyword,
no underscores.

Also update the other occurrence of "/* fallthrough */" in bpftool.

[0] commit 9fd496848b1c ("bpftool: Support inline annotations when dumping the CFG of a program")
[1] commit 4b7ef71ac977 ("bpftool: Replace "__fallthrough" by a comment to address merge conflict")
[2] commit f7a858bffcdd ("tools: Rename __fallthrough to fallthrough")

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230712152322.81758-1-quentin@isovalent.com

tracing: Stop FORTIFY_SOURCE complaining about stack trace caller

The stack_trace event is an event created by the tracing subsystem to
store stack traces. It originally just contained a hard coded array of 8
words to hold the stack, and a "size" to know how many entries are there.
This is exported to user space as:

name: kernel_stack
ID: 4
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;

field:int size; offset:8; size:4; signed:1;
field:unsigned long caller[8]; offset:16; size:64; signed:0;

print fmt: "\t=> %ps\n\t=> %ps\n\t=> %ps\n" "\t=> %ps\n\t=> %ps\n\t=> %ps\n" "\t=> %ps\n\t=> %ps\n",i
(void *)REC->caller[0], (void *)REC->caller[1], (void *)REC->caller[2],
(void *)REC->caller[3], (void *)REC->caller[4], (void *)REC->caller[5],
(void *)REC->caller[6], (void *)REC->caller[7]

Where the user space tracers could parse the stack. The library was
updated for this specific event to only look at the size, and not the
array. But some older users still look at the array (note, the older code
still checks to make sure the array fits inside the event that it read.
That is, if only 4 words were saved, the parser would not read the fifth
word because it will see that it was outside of the event size).

This event was changed a while ago to be more dynamic, and would save a
full stack even if it was greater than 8 words. It does this by simply
allocating more ring buffer to hold the extra words. Then it copies in the
stack via:

memcpy(&entry->caller, fstack->calls, size);

As the entry is struct stack_entry, that is created by a macro to both
create the structure and export this to user space, it still had the caller
field of entry defined as: unsigned long caller[8].

When the stack is greater than 8, the FORTIFY_SOURCE code notices that the
amount being copied is greater than the source array and complains about
it. It has no idea that the source is pointing to the ring buffer with the
required allocation.

To hide this from the FORTIFY_SOURCE logic, pointer arithmetic is used:

ptr = ring_buffer_event_data(event);
entry = ptr;
ptr += offsetof(typeof(*entry), caller);
memcpy(ptr, fstack->calls, size);

Link: https://lore.kernel.org/all/20230612160748.4082850-1-svens@linux.ibm.com/
Link: https://lore.kernel.org/linux-trace-kernel/20230712105235.5fc441aa@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Reported-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()

As comments in ftrace_process_locs(), there may be NULL pointers in
mcount_loc section:
> Some architecture linkers will pad between
> the different mcount_loc sections of different
> object files to satisfy alignments.
> Skip any NULL pointers.

After commit 20e5227e9f55 ("ftrace: allow NULL pointers in mcount_loc"),
NULL pointers will be accounted when allocating ftrace pages but skipped
before adding into ftrace pages, this may result in some pages not being
used. Then after commit 706c81f87f84 ("ftrace: Remove extra helper
functions"), warning may occur at:
WARN_ON(pg->next);

To fix it, only warn for case that no pointers skipped but pages not used
up, then free those unused pages after releasing ftrace_lock.

Link: https://lore.kernel.org/linux-trace-kernel/20230712060452.3175675-1-zhengyejian1@huawei.com
Cc: stable@vger.kernel.org
Fixes: 706c81f87f84 ("ftrace: Remove extra helper functions")
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

tcp: add a scheduling point in established_get_first()

Kubernetes[1] is going to stick with /proc/net/tcp for a while.

This commit reduces the scheduling latency introduced by
established_get_first(), similar to commit acffb584cda7 ("net: diag:
add a scheduling point in inet_diag_dump_icsk()").

In our environment, the scheduling latency affects the performance of
latency-sensitive services like Redis.

Changes in V2 :
- call cond_resched() before checking if a bucket is empty as
suggested by Eric Dumazet
- removed the delay of synchronize_net() from the commit message

[1] https://github.com/google/cadvisor/blob/v0.47.2/container/libcontainer/handler.go#L130

Signed-off-by: Jian Wen <wenjian1@xiaomi.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230711032405.3253025-1-wenjian1@xiaomi.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'bpf-mem-cache-free-rcu'

Alexei Starovoitov says:

====================
v3->v4:
- extra patch 14 from Hou to check for object leaks.
- fixed the race/leak in free_by_rcu_ttrace. Extra hunk in patch 8.
- added Acks and fixed typos.

v2->v3:
- dropped _tail optimization for free_by_rcu_ttrace
- new patch 5 to refactor inc/dec of c->active
- change 'draining' logic in patch 7
- add rcu_barrier in patch 12
- __llist_add-> llist_add(waiting_for_gp_ttrace) in patch 9 to fix race
- David's Ack in patch 13 and explanation that migrate_disable cannot be removed just yet.

v1->v2:
- Fixed race condition spotted by Hou. Patch 7.

v1:

Introduce bpf_mem_cache_free_rcu() that is similar to kfree_rcu except
the objects will go through an additional RCU tasks trace grace period
before being freed into slab.

Patches 1-9 - a bunch of prep work
Patch 10 - a patch from Paul that exports rcu_request_urgent_qs_task().
Patch 12 - the main bpf_mem_cache_free_rcu patch.
Patch 13 - use it in bpf_cpumask.

bpf_local_storage, bpf_obj_drop, qp-trie will be other users eventually.

With additional hack patch to htab that replaces bpf_mem_cache_free with bpf_mem_cache_free_rcu
the following are benchmark results:
- map_perf_test 4 8 16348 1000000
drops from 800k to 600k. Waiting for RCU GP makes objects cache cold.

- bench htab-mem -a -p 8
20% drop in performance and big increase in memory. From 3 Mbyte to 50 Mbyte. As expected.

- bench htab-mem -a -p 16 --use-case add_del_on_diff_cpu
Same performance and better memory consumption.
Before these patches this bench would OOM (with or without 'reuse after GP')
Patch 8 addresses the issue.

At the end the performance drop and additional memory consumption due to _rcu()
were expected and came out to be within reasonable margin.
Without Paul's patch 10 the memory consumption in 'bench htab-mem' is in Gbytes
which wouldn't be acceptable.

Patch 8 is a heuristic to address 'alloc on one cpu, free on another' issue.
It works well in practice. One can probably construct an artificial benchmark
to make heuristic ineffective, but we have to trade off performance, code complexity,
and memory consumption.

The life cycle of objects:
alloc: dequeue free_llist
free: enqeueu free_llist
free_llist above high watermark -> free_by_rcu_ttrace
free_rcu: enqueue free_by_rcu -> waiting_for_gp
after RCU GP waiting_for_gp -> free_by_rcu_ttrace
free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: Add object leak check.

The object leak check is cheap. Do it unconditionally to spot difficult races
in bpf_mem_alloc.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230706033447.54696-15-alexei.starovoitov@gmail.com

bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.

Convert bpf_cpumask to bpf_mem_cache_free_rcu.
Note that migrate_disable() in bpf_cpumask_release() is still necessary, since
bpf_cpumask_release() is a dtor. bpf_obj_free_fields() can be converted to do
migrate_disable() there in a follow up.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-14-alexei.starovoitov@gmail.com

bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().

Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into
per-cpu free list the _rcu() flavor waits for RCU grace period and then moves
objects into free_by_rcu_ttrace list where they are waiting for RCU
task trace grace period to be freed into slab.

The life cycle of objects:
alloc: dequeue free_llist
free: enqeueu free_llist
free_rcu: enqueue free_by_rcu -> waiting_for_gp
free_llist above high watermark -> free_by_rcu_ttrace
after RCU GP waiting_for_gp -> free_by_rcu_ttrace
free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-13-alexei.starovoitov@gmail.com

selftests/bpf: Improve test coverage of bpf_mem_alloc.

bpf_obj_new() calls bpf_mem_alloc(), but doing alloc/free of 8 elements
is not triggering watermark conditions in bpf_mem_alloc.
Increase to 200 elements to make sure alloc_bulk/free_bulk is exercised.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-12-alexei.starovoitov@gmail.com

rcu: Export rcu_request_urgent_qs_task()

If a CPU is executing a long series of non-sleeping system calls,
RCU grace periods can be delayed for on the order of a couple hundred
milliseconds. This is normally not a problem, but if each system call
does a call_rcu(), those callbacks can stack up. RCU will eventually
notice this callback storm, but use of rcu_request_urgent_qs_task()
allows the code invoking call_rcu() to give RCU a heads up.

This function is not for general use, not yet, anyway.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230706033447.54696-11-alexei.starovoitov@gmail.com

bpf: Allow reuse from waiting_for_gp_ttrace list.

alloc_bulk() can reuse elements from free_by_rcu_ttrace.
Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230706033447.54696-10-alexei.starovoitov@gmail.com

bpf: Add a hint to allocated objects.

To address OOM issue when one cpu is allocating and another cpu is freeing add
a target bpf_mem_cache hint to allocated objects and when local cpu free_llist
overflows free to that bpf_mem_cache. The hint addresses the OOM while
maintaining the same performance for common case when alloc/free are done on the
same cpu.

Note that do_call_rcu_ttrace() now has to check 'draining' flag in one more case,
since do_call_rcu_ttrace() is called not only for current cpu.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-9-alexei.starovoitov@gmail.com

bpf: Change bpf_mem_cache draining process.

The next patch will introduce cross-cpu llist access and existing
irq_work_sync() + drain_mem_cache() + rcu_barrier_tasks_trace() mechanism will
not be enough, since irq_work_sync() + drain_mem_cache() on cpu A won't
guarantee that llist on cpu A are empty. The free_bulk() on cpu B might add
objects back to llist of cpu A. Add 'bool draining' flag.
The modified sequence looks like:
for_each_cpu:
  WRITE_ONCE(c->draining, true); // do_call_rcu_ttrace() won't be doing call_rcu() any more
  irq_work_sync(); // wait for irq_work callback (free_bulk) to finish
  drain_mem_cache(); // free all objects
rcu_barrier_tasks_trace(); // wait for RCU callbacks to execute

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-8-alexei.starovoitov@gmail.com

bpf: Further refactor alloc_bulk().

In certain scenarios alloc_bulk() might be taking free objects mainly from
free_by_rcu_ttrace list. In such case get_memcg() and set_active_memcg() are
redundant, but they show up in perf profile. Split the loop and only set memcg
when allocating from slab. No performance difference in this patch alone, but
it helps in combination with further patches.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-7-alexei.starovoitov@gmail.com

bpf: Factor out inc/dec of active flag into helpers.

Factor out local_inc/dec_return(&c->active) into helpers.
No functional changes.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-6-alexei.starovoitov@gmail.com

bpf: Refactor alloc_bulk().

Factor out inner body of alloc_bulk into separate helper.
No functional changes.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-5-alexei.starovoitov@gmail.com

bpf: Let free_all() return the number of freed elements.

Let free_all() helper return the number of freed elements.
It's not used in this patch, but helps in debug/development of bpf_mem_alloc.

For example this diff for __free_rcu():
-       free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size);
+       printk("cpu %d freed %d objs after tasks trace\n", raw_smp_processor_id(),
+        free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size));

would show how busy RCU tasks trace is.
In artificial benchmark where one cpu is allocating and different cpu is freeing
the RCU tasks trace won't be able to keep up and the list of objects
would keep growing from thousands to millions and eventually OOMing.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-4-alexei.starovoitov@gmail.com

bpf: Simplify code of destroy_mem_alloc() with kmemdup().

Use kmemdup() to simplify the code.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-3-alexei.starovoitov@gmail.com

bpf: Rename few bpf_mem_alloc fields.

Rename:
-       struct rcu_head rcu;
-       struct llist_head free_by_rcu;
-       struct llist_head waiting_for_gp;
-       atomic_t call_rcu_in_progress;
+       struct llist_head free_by_rcu_ttrace;
+       struct llist_head waiting_for_gp_ttrace;
+       struct rcu_head rcu_ttrace;
+       atomic_t call_rcu_ttrace_in_progress;
...
- static void do_call_rcu(struct bpf_mem_cache *c)
+ static void do_call_rcu_ttrace(struct bpf_mem_cache *c)

to better indicate intended use.

The 'tasks trace' is shortened to 'ttrace' to reduce verbosity.
No functional changes.

Later patches will add free_by_rcu/waiting_for_gp fields to be used with normal RCU.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-2-alexei.starovoitov@gmail.com

netdevsim: fix uninitialized data in nsim_dev_trap_fa_cookie_write()

The simple_write_to_buffer() function is designed to handle partial
writes.  It returns negatives on error, otherwise it returns the number
of bytes that were able to be copied.  This code doesn't check the
return properly.  We only know that the first byte is written, the rest
of the buffer might be uninitialized.

There is no need to use the simple_write_to_buffer() function.
Partial writes are prohibited by the "if (*ppos != 0)" check at the
start of the function.  Just use memdup_user() and copy the whole
buffer.

Fixes: d3cbb907ae57 ("netdevsim: add ACL trap reporting cookie as a metadata")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/7c1f950b-3a7d-4252-82a6-876e53078ef7@moroto.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'platform-drivers-x86-v6.5-2' of git://git./linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform driver fixes from Hans de Goede:
"Misc small fixes and hw-id additions"

* tag 'platform-drivers-x86-v6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
  platform/x86: touchscreen_dmi: Add info for the Archos 101 Cesium Educ tablet
  platform/x86: dell-ddv: Fix mangled list in documentation
  platform/x86: dell-ddv: Improve error handling
  platform/x86/amd: pmf: Add new ACPI ID AMDI0103
  platform/x86/amd: pmc: Add new ACPI ID AMDI000A
  platform/x86/amd: pmc: Apply nvme quirk to HP 15s-eq2xxx
  platform/x86: Move s2idle quirk from thinkpad-acpi to amd-pmc
  platform/x86: int3472/discrete: set variable skl_int3472_regulator_second_sensor storage-class-specifier to static
  platform/x86/intel/tpmi: Prevent overflow for cap_offset
  platform/x86: wmi: Replace open coded guid_parse_and_compare()
  platform/x86: wmi: Break possible infinite loop when parsing GUID

Merge tag 'probes-fixes-v6.5-rc1' of git://git./linux/kernel/git/trace/linux-trace

Pull probes fixes from Masami Hiramatsu:

- Fix fprobe's rethook release issues:

     - Release rethook after ftrace_ops is unregistered so that the
       rethook is not accessed after free.

     - Stop rethook before ftrace_ops is unregistered so that the
       rethook is NOT used after exiting unregister_fprobe()

- Fix eprobe cleanup logic. If it attaches to multiple events and
   failes to enable one of them, rollback all enabled events correctly.

- Fix fprobe to unlock ftrace recursion lock correctly when it missed
   by another running kprobe.

- Cleanup kprobe to remove unnecessary NULL.

- Cleanup kprobe to remove unnecessary 0 initializations.

* tag 'probes-fixes-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  fprobe: Ensure running fprobe_exit_handler() finished before calling rethook_free()
  kernel: kprobes: Remove unnecessary ‘0’ values
  kprobes: Remove unnecessary ‘NULL’ values from correct_ret_addr
  fprobe: add unlock to match a succeeded ftrace_test_recursion_trylock
  kernel/trace: Fix cleanup logic of enable_trace_eprobe
  fprobe: Release rethook after the ftrace_ops is unregistered

Merge tag 'for-linus-2023071101' of git://git./linux/kernel/git/hid/hid

Pull HID fixes from Benjamin Tissoires:

- AMD SFH shift-out-of-bounds fix (Basavaraj Natikar)

- avoid struct memcpy overrun warning in the hid-hyperv module (Arnd
   Bergmann)

- a quick HID kselftests script fix for our CI to be happy (Benjamin
   Tissoires)

- various fixes and additions of device IDs

* tag 'for-linus-2023071101' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
  HID: amd_sfh: Fix for shift-out-of-bounds
  HID: amd_sfh: Rename the float32 variable
  HID: input: fix mapping for camera access keys
  HID: logitech-hidpp: Add wired USB id for Logitech G502 Lightspeed
  HID: nvidia-shield: Pack inner/related declarations in HOSTCMD reports
  HID: hyperv: avoid struct memcpy overrun warning
  selftests: hid: fix vmtests.sh not running make headers

ring-buffer: Fix deadloop issue on reading trace_pipe

Soft lockup occurs when reading file 'trace_pipe':

  watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [cat:4488]
  [...]
  RIP: 0010:ring_buffer_empty_cpu+0xed/0x170
  RSP: 0018:ffff88810dd6fc48 EFLAGS: 00000246
  RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffffffff93d1aaeb
  RDX: ffff88810a280040 RSI: 0000000000000008 RDI: ffff88811164b218
  RBP: ffff88811164b218 R08: 0000000000000000 R09: ffff88815156600f
  R10: ffffed102a2acc01 R11: 0000000000000001 R12: 0000000051651901
  R13: 0000000000000000 R14: ffff888115e49500 R15: 0000000000000000
  [...]
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f8d853c2000 CR3: 000000010dcd8000 CR4: 00000000000006e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   __find_next_entry+0x1a8/0x4b0
   ? peek_next_entry+0x250/0x250
   ? down_write+0xa5/0x120
   ? down_write_killable+0x130/0x130
   trace_find_next_entry_inc+0x3b/0x1d0
   tracing_read_pipe+0x423/0xae0
   ? tracing_splice_read_pipe+0xcb0/0xcb0
   vfs_read+0x16b/0x490
   ksys_read+0x105/0x210
   ? __ia32_sys_pwrite64+0x200/0x200
   ? switch_fpu_return+0x108/0x220
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x61/0xc6

Through the vmcore, I found it's because in tracing_read_pipe(),
ring_buffer_empty_cpu() found some buffer is not empty but then it
cannot read anything due to "rb_num_of_entries() == 0" always true,
Then it infinitely loop the procedure due to user buffer not been
filled, see following code path:

  tracing_read_pipe() {
    ... ...
    waitagain:
      tracing_wait_pipe() // 1. find non-empty buffer here
      trace_find_next_entry_inc()  // 2. loop here try to find an entry
        __find_next_entry()
          ring_buffer_empty_cpu();  // 3. find non-empty buffer
          peek_next_entry()  // 4. but peek always return NULL
            ring_buffer_peek()
              rb_buffer_peek()
                rb_get_reader_page()
                  // 5. because rb_num_of_entries() == 0 always true here
                  //    then return NULL
      // 6. user buffer not been filled so goto 'waitgain'
      //    and eventually leads to an deadloop in kernel!!!
  }

By some analyzing, I found that when resetting ringbuffer, the 'entries'
of its pages are not all cleared (see rb_reset_cpu()). Then when reducing
the ringbuffer, and if some reduced pages exist dirty 'entries' data, they
will be added into 'cpu_buffer->overrun' (see rb_remove_pages()), which
cause wrong 'overrun' count and eventually cause the deadloop issue.

To fix it, we need to clear every pages in rb_reset_cpu().

Link: https://lore.kernel.org/linux-trace-kernel/20230708225144.3785600-1-zhengyejian1@huawei.com
Cc: stable@vger.kernel.org
Fixes: a5fb833172eca ("ring-buffer: Fix uninitialized read_stamp")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

tracing: arm64: Avoid missing-prototype warnings

These are all tracing W=1 warnings in arm64 allmodconfig about missing
prototypes:

kernel/trace/trace_kprobe_selftest.c:7:5: error: no previous prototype for 'kprobe_trace_selftest_target' [-Werror=missing-pro
totypes]
kernel/trace/ftrace.c:329:5: error: no previous prototype for '__register_ftrace_function' [-Werror=missing-prototypes]
kernel/trace/ftrace.c:372:5: error: no previous prototype for '__unregister_ftrace_function' [-Werror=missing-prototypes]
kernel/trace/ftrace.c:4130:15: error: no previous prototype for 'arch_ftrace_match_adjust' [-Werror=missing-prototypes]
kernel/trace/fgraph.c:243:15: error: no previous prototype for 'ftrace_return_to_handler' [-Werror=missing-prototypes]
kernel/trace/fgraph.c:358:6: error: no previous prototype for 'ftrace_graph_sleep_time_control' [-Werror=missing-prototypes]
arch/arm64/kernel/ftrace.c:460:6: error: no previous prototype for 'prepare_ftrace_return' [-Werror=missing-prototypes]
arch/arm64/kernel/ptrace.c:2172:5: error: no previous prototype for 'syscall_trace_enter' [-Werror=missing-prototypes]
arch/arm64/kernel/ptrace.c:2195:6: error: no previous prototype for 'syscall_trace_exit' [-Werror=missing-prototypes]

Move the declarations to an appropriate header where they can be seen
by the caller and callee, and make sure the headers are included where
needed.

Link: https://lore.kernel.org/linux-trace-kernel/20230517125215.930689-1-arnd@kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Florent Revest <revest@chromium.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
[ Fixed ftrace_return_to_handler() to handle CONFIG_HAVE_FUNCTION_GRAPH_RETVAL case ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

selftests/user_events: Test struct size match cases

The self tests for user_events currently does not ensure that the edge
case for struct types work properly with size differences.

Add cases for mis-matching struct names and sizes to ensure they work
properly.

Link: https://lkml.kernel.org/r/20230629235049.581-3-beaub@linux.microsoft.com
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: linux-kselftest@vger.kernel.org
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

selftests/bpf: extend existing map resize tests for per-cpu use case

Add a per-cpu array resizing use case and demonstrate how
bpf_get_smp_processor_id() can be used to directly access proper data
with no extra checks.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230711232400.1658562-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: teach verifier actual bounds of bpf_get_smp_processor_id() result

bpf_get_smp_processor_id() helper returns current CPU on which BPF
program runs. It can't return value that is bigger than maximum allowed
number of CPUs (minus one, due to zero indexing). Teach BPF verifier to
recognize that. This makes it possible to use bpf_get_smp_processor_id()
result to index into arrays without extra checks, as demonstrated in
subsequent selftests/bpf patch.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230711232400.1658562-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

net/sched: flower: Ensure both minimum and maximum ports are specified

The kernel does not currently validate that both the minimum and maximum
ports of a port range are specified. This can lead user space to think
that a filter matching on a port range was successfully added, when in
fact it was not. For example, with a patched (buggy) iproute2 that only
sends the minimum port, the following commands do not return an error:

# tc filter add dev swp1 ingress pref 1 proto ip flower ip_proto udp src_port 100-200 action pass

# tc filter add dev swp1 ingress pref 1 proto ip flower ip_proto udp dst_port 100-200 action pass

# tc filter show dev swp1 ingress
filter protocol ip pref 1 flower chain 0
filter protocol ip pref 1 flower chain 0 handle 0x1
   eth_type ipv4
   ip_proto udp
   not_in_hw
         action order 1: gact action pass
          random type none pass val 0
          index 1 ref 1 bind 1

filter protocol ip pref 1 flower chain 0 handle 0x2
   eth_type ipv4
   ip_proto udp
   not_in_hw
         action order 1: gact action pass
          random type none pass val 0
          index 2 ref 1 bind 1

Fix by returning an error unless both ports are specified:

# tc filter add dev swp1 ingress pref 1 proto ip flower ip_proto udp src_port 100-200 action pass
Error: Both min and max source ports must be specified.
We have an error talking to the kernel

# tc filter add dev swp1 ingress pref 1 proto ip flower ip_proto udp dst_port 100-200 action pass
Error: Both min and max destination ports must be specified.
We have an error talking to the kernel

Fixes: 5c72299fba9d ("net: sched: cls_flower: Classify packets using port ranges")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'stmmac-errors'

Andrew Halaney says:

====================
net: stmmac: dwmac-qcom-ethqos: Improve error handling

This series includes some very minor quality of life patches in the
error handling.

I recently ran into a few issues where these patches would have made my
life easier (messing with the devicetree, dependent driver of this
failing, and incorrect kernel configs resulting in this driver not
probing).

v1: https://lore.kernel.org/netdev/20230629191725.1434142-1-ahalaney@redhat.com/
Changes since v1:
    * Collect tags (Andrew Lunn)
    * Switch to of_get_phy_mode() (Andrew Lunn)
    * Follow netdev patch submission process (net-next subject, wait
      until merge window is open) (Simon)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: dwmac-qcom-ethqos: Log more errors in probe

These are useful to see when debugging a probe failure.

Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: dwmac-qcom-ethqos: Use dev_err_probe()

Using dev_err_probe() logs to devices_deferred which is helpful
when debugging.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: dwmac-qcom-ethqos: Use of_get_phy_mode() over device_get_phy_mode()

Since this driver only uses devicetree, let's move over to
of_get_phy_mode(). That API has an explicit error return and produces a
phy_interface_t directly instead of an int when representing the phy
mode.

Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '1GbE' of git://git./linux/kernel/git/tnguy/net-queue

Tony Nguyen says:

====================
igc: Fix corner cases for TSN offload

Florian Kauer says:

The igc driver supports several different offloading capabilities
relevant in the TSN context. Recent patches in this area introduced
regressions for certain corner cases that are fixed in this series.

Each of the patches (except the first one) addresses a different
regression that can be separately reproduced. Still, they have
overlapping code changes so they should not be separately applied.

Especially #4 and #6 address the same observation,
but both need to be applied to avoid TX hang occurrences in
the scenario described in the patches.
====================

Signed-off-by: Florian Kauer <florian.kauer@linutronix.de>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Reviewed-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

MAINTAINERS: Add another mailing list for QUALCOMM ETHQOS ETHERNET DRIVER

linux-arm-msm is the list most people subscribe to in order to receive
updates about Qualcomm related drivers. Make sure changes for the
Qualcomm ethernet driver make it there.

Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Acked-by: Vinod Koul <vkoul@kernel.org>
Link: https://lore.kernel.org/r/20230710195240.197047-1-ahalaney@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

docs: netdev: update the URL of the status page

Move the status page from vger to the same server as mailbot.

Link: https://lore.kernel.org/r/20230710174636.1174684-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

wifi: iwlwifi: remove 'use_tfh' config to fix crash

This is equivalent to 'gen2', and it was always confusing to have
two identical config entries. The split config patch actually had
been originally developed after removing 'use_tfh" and didn't add
the use_tfh in the new configs as they'd later been copied to the
new files. Thus the easiest way to fix the init crash here now is
to just remove use_tfh (which is erroneously unset in most of the
configs now) and use 'gen2' in the code instead.

There's possibly still an unwind error in iwl_txq_gen2_init() as
it crashes if TXQ 0 fails to initialize, but we can deal with it
later since the original failure is due to the use_tfh confusion.

Tested-by: Xi Ruoyao <xry111@xry111.site>
Reported-and-tested-by: Niklāvs Koļesņikovs <pinkflames.linux@gmail.com>
Reported-and-tested-by: Jeff Chua <jeff.chua.linux@gmail.com>
Reported-and-tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217622
Link: https://lore.kernel.org/all/9274d9bd3d080a457649ff5addcc1726f08ef5b2.camel@xry111.site/
Link: https://lore.kernel.org/all/CAAJw_Zug6VCS5ZqTWaFSr9sd85k%3DtyPm9DEE%2BmV%3DAKoECZM%2BsQ@mail.gmail.com/
Fixes: 19898ce9cf8a ("wifi: iwlwifi: split 22000.c into multiple files")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20230710145038.84186-2-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: use dev_consume_skb_any() in bnxt_tx_int

Replace dev_kfree_skb_any() with dev_consume_skb_any() in bnxt_tx_int()
to clear the unnecessary noise of "kfree_skb" event.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230710094747.943782-1-imagedong@tencent.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-freescale-convert-to-platform-remove-callback-returning-void'

Uwe Kleine-König says:

====================
net: freescale: Convert to platform remove callback returning void

v2 of this series was sent in June[1], code changes since then only affect
patch #1 where the dev_err invocation was adapted to emit the error code of
dpaa_fq_free(). Thanks for feedback by Maciej Fijalkowski and Russell King.
Other than that I added Reviewed-by tags for Simon Horman and Wei Fang and
rebased to v6.5-rc1.

There is only one dependency in this series: Patch #2 depends on patch

[1] https://lore.kernel.org/netdev/20230606162829.166226-1-u.kleine-koenig@pengutronix.de
====================

Link: https://lore.kernel.org/r/20230710071946.3470249-1-u.kleine-koenig@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ucc_geth: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Link: https://lore.kernel.org/r/20230710071946.3470249-9-u.kleine-koenig@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: gianfar: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Link: https://lore.kernel.org/r/20230710071946.3470249-8-u.kleine-koenig@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fsl_pq_mdio: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Link: https://lore.kernel.org/r/20230710071946.3470249-7-u.kleine-koenig@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fs_enet: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Link: https://lore.kernel.org/r/20230710071946.3470249-6-u.kleine-koenig@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fman: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Link: https://lore.kernel.org/r/20230710071946.3470249-5-u.kleine-koenig@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fec: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Link: https://lore.kernel.org/r/20230710071946.3470249-4-u.kleine-koenig@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dpaa: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Link: https://lore.kernel.org/r/20230710071946.3470249-3-u.kleine-koenig@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dpaa: Improve error reporting

Instead of the generic error message emitted by the driver core when a
remove callback returns an error code ("remove callback returned a
non-zero value. This will be ignored."), emit a message describing the
actual problem and return zero to suppress the generic message.

Note that apart from suppressing the generic error message there are no
side effects by changing the return value to zero. This prepares
changing the remove callback to return void.

Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Link: https://lore.kernel.org/r/20230710071946.3470249-2-u.kleine-koenig@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'bpf: Support ->fill_link_info for kprobe_multi and perf_event links'

Yafang Shao says:

====================
This patchset enhances the usability of kprobe_multi program by introducing
support for ->fill_link_info. This allows users to easily determine the
probed functions associated with a kprobe_multi program. While
`bpftool perf show` already provides information about functions probed by
perf_event programs, supporting ->fill_link_info ensures consistent access
to this information across all bpf links.

In addition, this patch extends support to generic perf events, which are
currently not covered by `bpftool perf show`. While userspace is exposed to
only the perf type and config, other attributes such as sample_period and
sample_freq are disregarded.

To ensure accurate identification of probed functions, it is preferable to
expose the address directly rather than relying solely on the symbol name.
However, this implementation respects the kptr_restrict setting and avoids
exposing the address if it is not permitted.

v6->v7:
- From Daniel
  - No need to explicitly cast in many places
  - Use ptr_to_u64() instead of the cast
  - return -ENOMEM when calloc fails
  - Simplify the code in bpf_get_kprobe_info() further
  - Squash #9 with #8
  - And other coding style improvement
- From Andrii
  - Comment improvement
  - Use ENOSPC instead of E2BIG
  - Use strlen only when buf in not NULL
- Clear probe_addr in bpf_get_uprobe_info()

v5->v6:
- From Andrii
  - if ucount is too less, copy ucount items and return -E2BIG
  - zero out kmulti_link->cnt elements if it is not permitted by kptr
  - avoid leaking information when ucount is greater than kmulti_link->cnt
  - drop the flags, and add BPF_PERF_EVENT_[UK]RETPROBE
- From Quentin
  - use jsonw_null instead when we have no module name
  - add explanation on perf_type_name in the commit log
  - avoid the unnecessary out lable

v4->v5:
- Print "func [module]" in the kprobe_multi header (Andrii)
- Remove MAX_BPF_PERF_EVENT_TYPE (Alexei)
- Add padding field for future reuse (Yonghong)

v3->v4:
- From Quentin
  - Rename MODULE_NAME_LEN to MODULE_MAX_NAME
  - Convert retprobe to boolean for json output
  - Trim the square brackets around module names for json output
  - Move perf names into link.c
  - Use a generic helper to get perf names
  - Show address before func name, for consistency
  - Use switch-case instead of if-else
  - Increase the buff len to PATH_MAX
  - Move macros to the top of the file
- From Andrii
  - kprobe_multi flags should always be returned
  - Keep it single line if it fits in under 100 characters
  - Change the output format when showing kprobe_multi
  - Imporve the format of perf_event names
  - Rename struct perf_link to struct perf_event, and change the names of
    the enum consequently
- From Yonghong
  - Avoid disallowing extensions for all structs in the big union
- From Jiri
  - Add flags to bpf_kprobe_multi_link
  - Report kprobe_multi selftests errors
  - Rename bpf_perf_link_fill_name and make it a separate patch
  - Avoid breaking compilation when CONFIG_KPROBE_EVENTS or
    CONFIG_UPROBE_EVENTS options are not defined

v2->v3:
- Expose flags instead of retporbe (Andrii)
- Simplify the check on kmulti_link->cnt (Andrii)
- Use kallsyms_show_value() instead (Andrii)
- Show also the module name for kprobe_multi (Andrii)
- Add new enum bpf_perf_link_type (Andrii)
- Move perf event names into bpftool (Andrii, Quentin, Jiri)
- Keep perf event names in sync with perf tools (Jiri)

v1->v2:
- Fix sparse warning (Stanislav, lkp@intel.com)
- Fix BPF CI build error
- Reuse kernel_syms_load() (Alexei)
- Print 'name' instead of 'func' (Alexei)
- Show whether the probe is retprobe or not (Andrii)
- Add comment for the meaning of perf_event name (Andrii)
- Add support for generic perf event
- Adhere to the kptr_restrict setting

RFC->v1:
- Use a single copy_to_user() instead (Jiri)
- Show also the symbol name in bpftool (Quentin, Alexei)
- Use calloc() instead of malloc() in bpftool (Quentin)
- Avoid having conditional entries in the JSON output (Quentin)
- Drop ->show_fdinfo (Alexei)
- Use __u64 instead of __aligned_u64 for the field addr (Alexei)
- Avoid the contradiction in perf_event name length (Alexei)
- Address a build warning reported by kernel test robot <lkp@intel.com>
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpftool: Show perf link info

Enhance bpftool to display comprehensive information about exposed
perf_event links, covering uprobe, kprobe, tracepoint, and generic perf
event. The resulting output will include the following details:

$ tools/bpf/bpftool/bpftool link show
3: perf_event  prog 14
        event software:cpu-clock
        bpf_cookie 0
        pids perf_event(19483)
4: perf_event  prog 14
        event hw-cache:LLC-load-misses
        bpf_cookie 0
        pids perf_event(19483)
5: perf_event  prog 14
        event hardware:cpu-cycles
        bpf_cookie 0
        pids perf_event(19483)
6: perf_event  prog 19
        tracepoint sched_switch
        bpf_cookie 0
        pids tracepoint(20947)
7: perf_event  prog 26
        uprobe /home/dev/waken/bpf/uprobe/a.out+0x1338
        bpf_cookie 0
        pids uprobe(21973)
8: perf_event  prog 27
        uretprobe /home/dev/waken/bpf/uprobe/a.out+0x1338
        bpf_cookie 0
        pids uprobe(21973)
10: perf_event  prog 43
        kprobe ffffffffb70a9660 kernel_clone
        bpf_cookie 0
        pids kprobe(35275)
11: perf_event  prog 41
        kretprobe ffffffffb70a9660 kernel_clone
        bpf_cookie 0
        pids kprobe(35275)

$ tools/bpf/bpftool/bpftool link show -j
[{"id":3,"type":"perf_event","prog_id":14,"event_type":"software","event_config":"cpu-clock","bpf_cookie":0,"pids":[{"pid":19483,"comm":"perf_event"}]},{"id":4,"type":"perf_event","prog_id":14,"event_type":"hw-cache","event_config":"LLC-load-misses","bpf_cookie":0,"pids":[{"pid":19483,"comm":"perf_event"}]},{"id":5,"type":"perf_event","prog_id":14,"event_type":"hardware","event_config":"cpu-cycles","bpf_cookie":0,"pids":[{"pid":19483,"comm":"perf_event"}]},{"id":6,"type":"perf_event","prog_id":19,"tracepoint":"sched_switch","bpf_cookie":0,"pids":[{"pid":20947,"comm":"tracepoint"}]},{"id":7,"type":"perf_event","prog_id":26,"retprobe":false,"file":"/home/dev/waken/bpf/uprobe/a.out","offset":4920,"bpf_cookie":0,"pids":[{"pid":21973,"comm":"uprobe"}]},{"id":8,"type":"perf_event","prog_id":27,"retprobe":true,"file":"/home/dev/waken/bpf/uprobe/a.out","offset":4920,"bpf_cookie":0,"pids":[{"pid":21973,"comm":"uprobe"}]},{"id":10,"type":"perf_event","prog_id":43,"retprobe":false,"addr":18446744072485508704,"func":"kernel_clone","offset":0,"bpf_cookie":0,"pids":[{"pid":35275,"comm":"kprobe"}]},{"id":11,"type":"perf_event","prog_id":41,"retprobe":true,"addr":18446744072485508704,"func":"kernel_clone","offset":0,"bpf_cookie":0,"pids":[{"pid":35275,"comm":"kprobe"}]}]

For generic perf events, the displayed information in bpftool is limited to
the type and configuration, while other attributes such as sample_period,
sample_freq, etc., are not included.

The kernel function address won't be exposed if it is not permitted by
kptr_restrict. The result as follows when kptr_restrict is 2.

$ tools/bpf/bpftool/bpftool link show
3: perf_event  prog 14
        event software:cpu-clock
4: perf_event  prog 14
        event hw-cache:LLC-load-misses
5: perf_event  prog 14
        event hardware:cpu-cycles
6: perf_event  prog 19
        tracepoint sched_switch
7: perf_event  prog 26
        uprobe /home/dev/waken/bpf/uprobe/a.out+0x1338
8: perf_event  prog 27
        uretprobe /home/dev/waken/bpf/uprobe/a.out+0x1338
10: perf_event  prog 43
        kprobe kernel_clone
11: perf_event  prog 41
        kretprobe kernel_clone

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-11-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpftool: Add perf event names

Add new functions and macros to get perf event names. These names except
the perf_type_name are all copied from
tool/perf/util/{parse-events,evsel}.c, so that in the future we will
have a good chance to use the same code.

Suggested-by: Jiri Olsa <olsajiri@gmail.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-10-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Support ->fill_link_info for perf_event

By introducing support for ->fill_link_info to the perf_event link, users
gain the ability to inspect it using `bpftool link show`. While the current
approach involves accessing this information via `bpftool perf show`,
consolidating link information for all link types in one place offers
greater convenience. Additionally, this patch extends support to the
generic perf event, which is not currently accommodated by
`bpftool perf show`. While only the perf type and config are exposed to
userspace, other attributes such as sample_period and sample_freq are
ignored. It's important to note that if kptr_restrict is not permitted, the
probed address will not be exposed, maintaining security measures.

A new enum bpf_perf_event_type is introduced to help the user understand
which struct is relevant.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-9-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add a common helper bpf_copy_to_user()

Add a common helper bpf_copy_to_user(), which will be used at multiple
places.
No functional change.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-8-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Expose symbol's respective address

Since different symbols can share the same name, it is insufficient to only
expose the symbol name. It is essential to also expose the symbol address
so that users can accurately identify which one is being probed.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-7-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Clear the probe_addr for uprobe

To avoid returning uninitialized or random values when querying the file
descriptor (fd) and accessing probe_addr, it is necessary to clear the
variable prior to its use.

Fixes: 41bdc4b40ed6 ("bpf: introduce bpf subcommand BPF_TASK_FD_QUERY")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-6-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Protect probed address based on kptr_restrict setting

The probed address can be accessed by userspace through querying the task
file descriptor (fd). However, it is crucial to adhere to the kptr_restrict
setting and refrain from exposing the address if it is not permitted.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-5-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>