review.tizen.org Git - platform/kernel/linux-starfive.git/log

samples/bpf: Detach xdp prog when program exits unexpectedly in xdp_rxq_info_user

When xdp_rxq_info_user program exits unexpectedly, it doesn't detach xdp
prog of device, and other xdp prog can't be attached to the device. So
call init_exit() to detach xdp prog when program exits unexpectedly.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220427062338.80173-1-shaozhengchao@huawei.com

bpf/selftests: Add granular subtest output for prog_test

Implement per subtest log collection for both parallel
and sequential test execution. This allows granular
per-subtest error output in the 'All error logs' section.
Add subtest log transfer into the protocol during the
parallel test execution.

Move all test log printing logic into dump_test_log
function. One exception is the output of test names when
verbose printing is enabled. Move test name/result
printing into separate functions to avoid repetition.

Print all successful subtest results in the log. Print
only failed test logs when test does not have subtests.
Or only failed subtests' logs when test has subtests.

Disable 'All error logs' output when verbose mode is
enabled. This functionality was already broken and is
causing confusion.

Signed-off-by: Mykola Lysenko <mykolal@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220427041353.246007-1-mykolal@fb.com

Merge https://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2022-04-27

We've added 85 non-merge commits during the last 18 day(s) which contain
a total of 163 files changed, 4499 insertions(+), 1521 deletions(-).

The main changes are:

1) Teach libbpf to enhance BPF verifier log with human-readable and relevant
   information about failed CO-RE relocations, from Andrii Nakryiko.

2) Add typed pointer support in BPF maps and enable it for unreferenced pointers
   (via probe read) and referenced ones that can be passed to in-kernel helpers,
   from Kumar Kartikeya Dwivedi.

3) Improve xsk to break NAPI loop when rx queue gets full to allow for forward
   progress to consume descriptors, from Maciej Fijalkowski & Björn Töpel.

4) Fix a small RCU read-side race in BPF_PROG_RUN routines which dereferenced
   the effective prog array before the rcu_read_lock, from Stanislav Fomichev.

5) Implement BPF atomic operations for RV64 JIT, and add libbpf parsing logic
   for USDT arguments under riscv{32,64}, from Pu Lehui.

6) Implement libbpf parsing of USDT arguments under aarch64, from Alan Maguire.

7) Enable bpftool build for musl and remove nftw with FTW_ACTIONRETVAL usage
   so it can be shipped under Alpine which is musl-based, from Dominique Martinet.

8) Clean up {sk,task,inode} local storage trace RCU handling as they do not
   need to use call_rcu_tasks_trace() barrier, from KP Singh.

9) Improve libbpf API documentation and fix error return handling of various
   API functions, from Grant Seltzer.

10) Enlarge offset check for bpf_skb_{load,store}_bytes() helpers given data
    length of frags + frag_list may surpass old offset limit, from Liu Jian.

11) Various improvements to prog_tests in area of logging, test execution
    and by-name subtest selection, from Mykola Lysenko.

12) Simplify map_btf_id generation for all map types by moving this process
    to build time with help of resolve_btfids infra, from Menglong Dong.

13) Fix a libbpf bug in probing when falling back to legacy bpf_probe_read*()
    helpers; the probing caused always to use old helpers, from Runqing Yang.

14) Add support for ARCompact and ARCv2 platforms for libbpf's PT_REGS
    tracing macros, from Vladimir Isaev.

15) Cleanup BPF selftests to remove old & unneeded rlimit code given kernel
    switched to memcg-based memory accouting a while ago, from Yafang Shao.

16) Refactor of BPF sysctl handlers to move them to BPF core, from Yan Zhu.

17) Fix BPF selftests in two occasions to work around regressions caused by latest
    LLVM to unblock CI until their fixes are worked out, from Yonghong Song.

18) Misc cleanups all over the place, from various others.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (85 commits)
  selftests/bpf: Add libbpf's log fixup logic selftests
  libbpf: Fix up verifier log for unguarded failed CO-RE relos
  libbpf: Simplify bpf_core_parse_spec() signature
  libbpf: Refactor CO-RE relo human description formatting routine
  libbpf: Record subprog-resolved CO-RE relocations unconditionally
  selftests/bpf: Add CO-RE relos and SEC("?...") to linked_funcs selftests
  libbpf: Avoid joining .BTF.ext data with BPF programs by section name
  libbpf: Fix logic for finding matching program for CO-RE relocation
  libbpf: Drop unhelpful "program too large" guess
  libbpf: Fix anonymous type check in CO-RE logic
  bpf: Compute map_btf_id during build time
  selftests/bpf: Add test for strict BTF type check
  selftests/bpf: Add verifier tests for kptr
  selftests/bpf: Add C tests for kptr
  libbpf: Add kptr type tag macros to bpf_helpers.h
  bpf: Make BTF type match stricter for release arguments
  bpf: Teach verifier about kptr_get kfunc helpers
  bpf: Wire up freeing of referenced kptr
  bpf: Populate pairs of btf_id and destructor kfunc in btf
  bpf: Adapt copy_map_value for multiple offset case
  ...
====================

Link: https://lore.kernel.org/r/20220427224758.20976-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: ksz9477: move get_stats64 to ksz_common.c

The mib counters for the ksz9477 is same for the ksz9477 switch and
LAN937x switch. Hence moving it to ksz_common.c file in order to have it
generic function. The DSA hook get_stats64 now can call ksz_get_stats64.

Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/20220426091048.9311-1-arun.ramadoss@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'remove-virt_to_bus-drivers'

Jakub Kicinski says:

====================
net: remove non-Ethernet drivers using virt_to_bus()

Networking is currently the main offender in using virt_to_bus().
Frankly all the drivers which use it are super old and unlikely
to be used today. They are just an ongoing maintenance burden.

In other words this series is using virt_to_bus() as an excuse
to shed some old stuff. Having done the tree-wide dev_addr_set()
conversion recently I have limited sympathy for carrying dead
code.

Obviously please scream if any of these drivers _is_ in fact
still being used. Otherwise let's take the chance, we can always
apologize and revert if users show up later.

Also I should say thanks to everyone who contributed to this code!
The work continues to be appreciated although realistically in more
of a "history book" fashion...
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: hamradio: remove support for DMA SCC devices

Another incarnation of Z8530, looks like? Again, no real changes
in the git history, and it needs VIRT_TO_BUS. Unlikely to have
users, let's spend less time refactoring dead code...

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: wan: remove support for Z85230-based devices

Looks like all the changes to this driver had been automated
churn since git era begun. The driver is using virt_to_bus(),
it's just a maintenance burden unlikely to have any users.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: wan: remove support for COSA and SRP synchronous serial boards

Looks like all the changes to this driver had been automated
churn since git era begun. The driver is using virt_to_bus()
so it should be updated to a proper DMA API or removed. Given
the latest "news" entry on the website is from 1999 I'm opting
for the latter.

I'm marking the allocated char device major number as [REMOVED],
I reckon we can't reuse it in case some SW out there assumes its
COSA?

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: atm: remove support for ZeitNet ZN122x ATM devices

This driver received nothing but automated fixes in the last 15 years.
Since it's using virt_to_bus it's unlikely to be used on any modern
platform.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: atm: remove support for Madge Horizon ATM devices

This driver received nothing but automated fixes since git era begun.
Since it's using virt_to_bus it's unlikely to be used on any modern
platform.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: atm: remove support for Fujitsu FireStream ATM devices

This driver received nothing but automated fixes (mostly spelling
and compiler warnings) since git era begun. Since it's using
virt_to_bus it's unlikely to be used on any modern platform.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'lan966x-ptp-programmable-pins'

Horatiu Vultur says:

====================
net: lan966x: Add support for PTP programmable pins

Lan966x has 8 PTP programmable pins. The last pin is hardcoded to be used
by PHC0 and all the rest are shareable between the PHCs. The PTP pins can
implement both extts and perout functions.

v1->v2:
- use ptp_find_pin_unlocked instead of ptp_find_pin inside the irq handler.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: lan966x: Add support for PTP_PF_EXTTS

Extend the PTP programmable pins to implement also PTP_PF_EXTTS
function. The PTP pin can be configured to capture only on the rising
edge of the PPS signal. And once an event is seen then an interrupt is
generated and the local time counter is saved.
The interrupt is shared between all the pins.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: lan966x: Add support for PTP_PF_PEROUT

Lan966x has 8 PTP programmable pins, where the last pins is hardcoded to
be used by PHC0, which does the frame timestamping. All the rest of the
PTP pins can be shared between the PHCs and can have different functions
like perout or extts. For now add support for PTP_FS_PEROUT.
The HW is not able to support absolute start time but can use the nsec
for phase adjustment when generating PPS.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: lan966x: Add registers used to configure the PTP pin

Add registers that are used to configure the PTP pins. These registers
are used to enable the interrupts per PTP pin and to set the waveform
generated by the pin.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: lan966x: Change the PTP pin used to read/write the PHC.

To read/write a value to a PHC, it is required to use a PTP pin.
Currently it is used pin 5, but change to pin 7 as is the last pin.
All the other pins will have different functions.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: lan966x: Extend with the ptp external interrupt.

Extend dt-bindings for lan966x with ptp external interrupt. This is
generated when an external 1pps signal is received on the ptp pin.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mptcp-MP_FAIL-timeout'

Mat Martineau says:

====================
mptcp: Timeout for MP_FAIL response

When one peer sends an infinite mapping to coordinate fallback from
MPTCP to regular TCP, the other peer is expected to send a packet with
the MPTCP MP_FAIL option to acknowledge the infinite mapping. Rather
than leave the connection in some half-fallback state, this series adds
a timeout after which the infinite mapping sender will reset the
connection.

Patch 1 adds a fallback self test.

Patches 2-5 make use of the MPTCP socket's retransmit timer to reset the
MPTCP connection if no MP_FAIL was received.

Patches 6 and 7 extends the self test to check MP_FAIL-related MIBs.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: print extra msg in chk_csum_nr

When the multiple checksum errors occur in chk_csum_nr(), print the
numbers of the errors as an extra message.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: check MP_FAIL response mibs

This patch extends chk_fail_nr to check the MP_FAIL response mibs.

Add a new argument invert for chk_fail_nr to allow it can check the
MP_FAIL TX and RX mibs from the opposite direction.

When the infinite map is received before the MP_FAIL response, the
response will be lost. A '-' can be added into fail_tx or fail_rx to
represent that MP_FAIL response TX or RX can be lost when doing the
checks.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: reset subflow when MP_FAIL doesn't respond

This patch adds a new msk->flags bit MPTCP_FAIL_NO_RESPONSE, then reuses
sk_timer to trigger a check if we have not received a response from the
peer after sending MP_FAIL. If the peer doesn't respond properly, reset
the subflow.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: add MP_FAIL response support

This patch adds a new struct member mp_fail_response_expect in struct
mptcp_subflow_context to support MP_FAIL response. In the single subflow
with checksum error and contiguous data special case, a MP_FAIL is sent
in response to another MP_FAIL.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: add data lock for sk timers

mptcp_data_lock() needs to be held when manipulating the msk
retransmit_timer or the sk sk_timer. This patch adds the data
lock for the both timers.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: use mptcp_stop_timer

Use the helper mptcp_stop_timer() instead of using sk_stop_timer() to
stop icsk_retransmit_timer directly.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: add infinite map testcase

Add the single subflow test case for MP_FAIL, to test the infinite
mapping case. Use the test_linkfail value to make 128KB test files.

Add a new function reset_with_fail(), in it use 'iptables' and 'tc
action pedit' rules to produce the bit flips to trigger the checksum
failures. Set validate_checksum to enable checksums for the MP_FAIL
tests without passing the '-C' argument. Set check_invert flag to
enable the invert bytes check for the output data in check_transfer().
Instead of the file mismatch error, this test prints out the inverted
bytes.

Add a new function pedit_action_pkts() to get the numbers of the packets
edited by the tc pedit actions. Print this numbers to the output.

Also add the needed kernel configures in the selftests config file.

Suggested-by: Davide Caratti <dcaratti@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: dwmac-imx: comment spelling fix

Fix spelling in comment.

Fixes: 94abdad6974a ("net: ethernet: dwmac: add ethernet glue logic for NXP imx8 chip")
Signed-off-by: Marcel Ziswiler <marcel.ziswiler@toradex.com>
Link: https://lore.kernel.org/r/20220425154856.169499-1-marcel@ziswiler.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: remove comments that mention obsolete __SLOW_DOWN_IO

The only remaining definitions of __SLOW_DOWN_IO (for alpha and ia64) do
nothing, and the only mentions in networking are in comments. Remove these
mentions.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: wan: atp: remove unused eeprom_delay()

atp.h is included only by atp.c, which does not use eeprom_delay(). Remove
the unused definition.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: tls: fix async vs NIC crypto offload

When NIC takes care of crypto (or the record has already
been decrypted) we forget to update darg->async. ->async
is supposed to mean whether record is async capable on
input and whether record has been queued for async crypto
on output.

Reported-by: Gal Pressman <gal@nvidia.com>
Fixes: 3547a1f9d988 ("tls: rx: use async as an in-out argument")
Tested-by: Gal Pressman <gal@nvidia.com>
Link: https://lore.kernel.org/r/20220425233309.344858-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mt753x: fix pcs conversion regression

Daniel Golle reports that the conversion of mt753x to phylink PCS caused
an oops as below.

The problem is with the placement of the PCS initialisation, which
occurs after mt7531_setup() has been called. However, burited in this
function is a call to setup the CPU port, which requires the PCS
structure to be already setup.

Fix this by changing the initialisation order.

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
Mem abort info:
  ESR = 0x96000005
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
  FSC = 0x05: level 1 translation fault
Data abort info:
  ISV = 0, ISS = 0x00000005
  CM = 0, WnR = 0
user pgtable: 4k pages, 39-bit VAs, pgdp=0000000046057000
[0000000000000020] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
Internal error: Oops: 96000005 [#1] SMP
Modules linked in:
CPU: 0 PID: 32 Comm: kworker/u4:1 Tainted: G S 5.18.0-rc3-next-20220422+ #0
Hardware name: Bananapi BPI-R64 (DT)
Workqueue: events_unbound deferred_probe_work_func
pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : mt7531_cpu_port_config+0xcc/0x1b0
lr : mt7531_cpu_port_config+0xc0/0x1b0
sp : ffffffc008d5b980
x29: ffffffc008d5b990 x28: ffffff80060562c8 x27: 00000000f805633b
x26: ffffff80001a8880 x25: 00000000000009c4 x24: 0000000000000016
x23: ffffff8005eb6470 x22: 0000000000003600 x21: ffffff8006948080
x20: 0000000000000000 x19: 0000000000000006 x18: 0000000000000000
x17: 0000000000000001 x16: 0000000000000001 x15: 02963607fcee069e
x14: 0000000000000000 x13: 0000000000000030 x12: 0101010101010101
x11: ffffffc037302000 x10: 0000000000000870 x9 : ffffffc008d5b800
x8 : ffffff800028f950 x7 : 0000000000000001 x6 : 00000000662b3000
x5 : 00000000000002f0 x4 : 0000000000000000 x3 : ffffff800028f080
x2 : 0000000000000000 x1 : ffffff800028f080 x0 : 0000000000000000
Call trace:
mt7531_cpu_port_config+0xcc/0x1b0
mt753x_cpu_port_enable+0x24/0x1f0
mt7531_setup+0x49c/0x5c0
mt753x_setup+0x20/0x31c
dsa_register_switch+0x8bc/0x1020
mt7530_probe+0x118/0x200
mdio_probe+0x30/0x64
really_probe.part.0+0x98/0x280
__driver_probe_device+0x94/0x140
driver_probe_device+0x40/0x114
__device_attach_driver+0xb0/0x10c
bus_for_each_drv+0x64/0xa0
__device_attach+0xa8/0x16c
device_initial_probe+0x10/0x20
bus_probe_device+0x94/0x9c
deferred_probe_work_func+0x80/0xb4
process_one_work+0x200/0x3a0
worker_thread+0x260/0x4c0
kthread+0xd4/0xe0
ret_from_fork+0x10/0x20
Code: 9409e911 937b7e60 8b0002a0 f9405800 (f9401005)
---[ end trace 0000000000000000 ]---

Reported-by: Daniel Golle <daniel@makrotopia.org>
Tested-by: Daniel Golle <daniel@makrotopia.org>
Fixes: cbd1f243bc41 ("net: dsa: mt7530: partially convert to phylink_pcs")
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/E1nj6FW-007WZB-5Y@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: generalize skb freeing deferral to per-cpu lists

Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.

But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.

For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.

For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.

Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.

This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.

This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.

In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.

Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.

Note that we can add in the future a small per-cpu cache
if we see any contention on sd->defer_lock.

Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)

Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.

10 runs of one TCP_STREAM flow

Before:
Average throughput: 49685 Mbit.

Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)

    57.81%  [kernel]       [k] copy_user_enhanced_fast_string
(*) 12.87%  [kernel]       [k] skb_release_data
(*)  4.25%  [kernel]       [k] __free_one_page
(*)  3.57%  [kernel]       [k] __list_del_entry_valid
     1.85%  [kernel]       [k] __netif_receive_skb_core
     1.60%  [kernel]       [k] __skb_datagram_iter
(*)  1.59%  [kernel]       [k] free_unref_page_commit
(*)  1.16%  [kernel]       [k] __slab_free
     1.16%  [kernel]       [k] _copy_to_iter
(*)  1.01%  [kernel]       [k] kfree
(*)  0.88%  [kernel]       [k] free_unref_page
     0.57%  [kernel]       [k] ip6_rcv_core
     0.55%  [kernel]       [k] ip6t_do_table
     0.54%  [kernel]       [k] flush_smp_call_function_queue
(*)  0.54%  [kernel]       [k] free_pcppages_bulk
     0.51%  [kernel]       [k] llist_reverse_order
     0.38%  [kernel]       [k] process_backlog
(*)  0.38%  [kernel]       [k] free_pcp_prepare
     0.37%  [kernel]       [k] tcp_recvmsg_locked
(*)  0.37%  [kernel]       [k] __list_add_valid
     0.34%  [kernel]       [k] sock_rfree
     0.34%  [kernel]       [k] _raw_spin_lock_irq
(*)  0.33%  [kernel]       [k] __page_cache_release
     0.33%  [kernel]       [k] tcp_v6_rcv
(*)  0.33%  [kernel]       [k] __put_page
(*)  0.29%  [kernel]       [k] __mod_zone_page_state
     0.27%  [kernel]       [k] _raw_spin_lock

After patch:
Average throughput: 73076 Mbit.

Kernel profiles on cpu running user thread recvmsg() looks better:

    81.35%  [kernel]       [k] copy_user_enhanced_fast_string
     1.95%  [kernel]       [k] _copy_to_iter
     1.95%  [kernel]       [k] __skb_datagram_iter
     1.27%  [kernel]       [k] __netif_receive_skb_core
     1.03%  [kernel]       [k] ip6t_do_table
     0.60%  [kernel]       [k] sock_rfree
     0.50%  [kernel]       [k] tcp_v6_rcv
     0.47%  [kernel]       [k] ip6_rcv_core
     0.45%  [kernel]       [k] read_tsc
     0.44%  [kernel]       [k] _raw_spin_lock_irqsave
     0.37%  [kernel]       [k] _raw_spin_lock
     0.37%  [kernel]       [k] native_irq_return_iret
     0.33%  [kernel]       [k] __inet6_lookup_established
     0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
     0.29%  [kernel]       [k] tcp_rcv_established
     0.29%  [kernel]       [k] llist_reverse_order

v2: kdoc issue (kernel bots)
    do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
    replace the sk_buff_head with a single-linked list (Jakub)
    add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'Teach libbpf to "fix up" BPF verifier log'

Andrii Nakryiko says:

====================

This patch set teaches libbpf to enhance BPF verifier log with human-readable
and relevant information about failed CO-RE relocation. Patch #9 is the main
one with the new logic. See relevant commit messages for some more details.

All the other patches are either fixing various bugs detected
while working on this feature, most prominently a bug with libbpf not handling
CO-RE relocations for SEC("?...") programs, or are refactoring libbpf
internals to allow for easier reuse of CO-RE relo lookup and formatting logic.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add libbpf's log fixup logic selftests

Add tests validating that libbpf is indeed patching up BPF verifier log
with CO-RE relocation details. Also test partial and full truncation
scenarios.

This test might be a bit fragile due to changing BPF verifier log
format. If that proves to be frequently breaking, we can simplify tests
or remove the truncation subtests. But for now it seems useful to test
it in those conditions that are otherwise rarely occuring in practice.

Also test CO-RE relo failure in a subprog as that excercises subprogram CO-RE
relocation mapping logic which doesn't work out of the box without extra
relo storage previously done only for gen_loader case.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220426004511.2691730-11-andrii@kernel.org

libbpf: Fix up verifier log for unguarded failed CO-RE relos

Teach libbpf to post-process BPF verifier log on BPF program load
failure and detect known error patterns to provide user with more
context.

Currently there is one such common situation: an "unguarded" failed BPF
CO-RE relocation. While failing CO-RE relocation is expected, it is
expected to be property guarded in BPF code such that BPF verifier
always eliminates BPF instructions corresponding to such failed CO-RE
relos as dead code. In cases when user failed to take such precautions,
BPF verifier provides the best log it can:

  123: (85) call unknown#195896080
  invalid func unknown#195896080

Such incomprehensible log error is due to libbpf "poisoning" BPF
instruction that corresponds to failed CO-RE relocation by replacing it
with invalid `call 0xbad2310` instruction (195896080 == 0xbad2310 reads
"bad relo" if you squint hard enough).

Luckily, libbpf has all the necessary information to look up CO-RE
relocation that failed and provide more human-readable description of
what's going on:

  5: <invalid CO-RE relocation>
  failed to resolve CO-RE relocation <byte_off> [6] struct task_struct___bad.fake_field_subprog (0:2 @ offset 8)

This hopefully makes it much easier to understand what's wrong with
user's BPF program without googling magic constants.

This BPF verifier log fixup is setup to be extensible and is going to be
used for at least one other upcoming feature of libbpf in follow up patches.
Libbpf is parsing lines of BPF verifier log starting from the very end.
Currently it processes up to 10 lines of code looking for familiar
patterns. This avoids wasting lots of CPU processing huge verifier logs
(especially for log_level=2 verbosity level). Actual verification error
should normally be found in last few lines, so this should work
reliably.

If libbpf needs to expand log beyond available log_buf_size, it
truncates the end of the verifier log. Given verifier log normally ends
with something like:

  processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0

... truncating this on program load error isn't too bad (end user can
always increase log size, if it needs to get complete log).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220426004511.2691730-10-andrii@kernel.org

libbpf: Simplify bpf_core_parse_spec() signature

Simplify bpf_core_parse_spec() signature to take struct bpf_core_relo as
an input instead of requiring callers to decompose them into type_id,
relo, spec_str, etc. This makes using and reusing this helper easier.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220426004511.2691730-9-andrii@kernel.org

libbpf: Refactor CO-RE relo human description formatting routine

Refactor how CO-RE relocation is formatted. Now it dumps human-readable
representation, currently used by libbpf in either debug or error
message output during CO-RE relocation resolution process, into provided
buffer. This approach allows for better reuse of this functionality
outside of CO-RE relocation resolution, which we'll use in next patch
for providing better error message for BPF verifier rejecting BPF
program due to unguarded failed CO-RE relocation.

It also gets rid of annoying "stitching" of libbpf_print() calls, which
was the only place where we did this.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220426004511.2691730-8-andrii@kernel.org

libbpf: Record subprog-resolved CO-RE relocations unconditionally

Previously, libbpf recorded CO-RE relocations with insns_idx resolved
according to finalized subprog locations (which are appended at the end
of entry BPF program) to simplify the job of light skeleton generator.

This is necessary because once subprogs' instructions are appended to
main entry BPF program all the subprog instruction indices are shifted
and that shift is different for each entry (main) BPF program, so it's
generally impossible to map final absolute insn_idx of the finalized BPF
program to their original locations inside subprograms.

This information is now going to be used not only during light skeleton
generation, but also to map absolute instruction index to subprog's
instruction and its corresponding CO-RE relocation. So start recording
these relocations always, not just when obj->gen_loader is set.

This information is going to be freed at the end of bpf_object__load()
step, as before (but this can change in the future if there will be
a need for this information post load step).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220426004511.2691730-7-andrii@kernel.org

selftests/bpf: Add CO-RE relos and SEC("?...") to linked_funcs selftests

Enhance linked_funcs selftest with two tricky features that might not
obviously work correctly together. We add CO-RE relocations to entry BPF
programs and mark those programs as non-autoloadable with SEC("?...")
annotation. This makes sure that libbpf itself handles .BTF.ext CO-RE
relocation data matching correctly for SEC("?...") programs, as well as
ensures that BPF static linker handles this correctly (this was the case
before, no changes are necessary, but it wasn't explicitly tested).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220426004511.2691730-6-andrii@kernel.org

libbpf: Avoid joining .BTF.ext data with BPF programs by section name

Instead of using ELF section names as a joining key between .BTF.ext and
corresponding BPF programs, pre-build .BTF.ext section number to ELF
section index mapping during bpf_object__open() and use it later for
matching .BTF.ext information (func/line info or CO-RE relocations) to
their respective BPF programs and subprograms.

This simplifies corresponding joining logic and let's libbpf do
manipulations with BPF program's ELF sections like dropping leading '?'
character for non-autoloaded programs. Original joining logic in
bpf_object__relocate_core() (see relevant comment that's now removed)
was never elegant, so it's a good improvement regardless. But it also
avoids unnecessary internal assumptions about preserving original ELF
section name as BPF program's section name (which was broken when
SEC("?abc") support was added).

Fixes: a3820c481112 ("libbpf: Support opting out from autoloading BPF programs declaratively")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220426004511.2691730-5-andrii@kernel.org

libbpf: Fix logic for finding matching program for CO-RE relocation

Fix the bug in bpf_object__relocate_core() which can lead to finding
invalid matching BPF program when processing CO-RE relocation. IF
matching program is not found, last encountered program will be assumed
to be correct program and thus error detection won't detect the problem.

Fixes: 9c82a63cf370 ("libbpf: Fix CO-RE relocs against .text section")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220426004511.2691730-4-andrii@kernel.org

libbpf: Drop unhelpful "program too large" guess

libbpf pretends it knows actual limit of BPF program instructions based
on UAPI headers it compiled with. There is neither any guarantee that
UAPI headers match host kernel, nor BPF verifier actually uses
BPF_MAXINSNS constant anymore. Just drop unhelpful "guess", BPF verifier
will emit actual reason for failure in its logs anyways.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220426004511.2691730-3-andrii@kernel.org

libbpf: Fix anonymous type check in CO-RE logic

Use type name for checking whether CO-RE relocation is referring to
anonymous type. Using spec string makes no sense.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220426004511.2691730-2-andrii@kernel.org

bpf: Compute map_btf_id during build time

For now, the field 'map_btf_id' in 'struct bpf_map_ops' for all map
types are computed during vmlinux-btf init:

  btf_parse_vmlinux() -> btf_vmlinux_map_ids_init()

It will lookup the btf_type according to the 'map_btf_name' field in
'struct bpf_map_ops'. This process can be done during build time,
thanks to Jiri's resolve_btfids.

selftest of map_ptr has passed:

  $96 map_ptr:OK
  Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

net: usb: qmi_wwan: add support for Sierra Wireless EM7590

add support for Sierra Wireless EM7590 0xc081 composition.

Signed-off-by: Ethan Yang <etyang@sierrawireless.com>
Acked-by: Bjørn Mork <bjorn@mork.no>
Link: https://lore.kernel.org/r/20220425054028.5444-1-etyang@sierrawireless.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/af_packet: add VLAN support for AF_PACKET SOCK_RAW GSO

Currently, the kernel drops GSO VLAN tagged packet if it's created with
socket(AF_PACKET, SOCK_RAW, 0) plus virtio_net_hdr.

The reason is AF_PACKET doesn't adjust the skb network header if there is
a VLAN tag. Then after virtio_net_hdr_set_proto() called, the skb->protocol
will be set to ETH_P_IP/IPv6. And in later inet/ipv6_gso_segment() the skb
is dropped as network header position is invalid.

Let's handle VLAN packets by adjusting network header position in
packet_parse_headers(). The adjustment is safe and does not affect the
later xmit as tap device also did that.

In packet_snd(), packet_parse_headers() need to be moved before calling
virtio_net_hdr_set_proto(), so we can set correct skb->protocol and
network header first.

There is no need to update tpacket_snd() as it calls packet_parse_headers()
in tpacket_fill_skb(), which is already before calling virtio_net_hdr_*
functions.

skb->no_fcs setting is also moved upper to make all skb settings together
and keep consistency with function packet_sendmsg_spkt().

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20220425014502.985464-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: dsa: ksz: added the generic port_stp_state_set function

The ksz8795 and ksz9477 uses the same algorithm for the
port_stp_state_set function except the register address is different. So
moved the algorithm to the ksz_common.c and used the dev_ops for
register read and write. This function can also used for the lan937x
part. Hence making it generic for all the parts.

Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20220424112831.11504-1-arun.ramadoss@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: phy: LAN937x: add interrupt support for link detection

Added the config_intr and handle_interrupt for the LAN937x phy which is
same as the LAN87xx phy.

Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20220423154727.29052-1-arun.ramadoss@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'Introduce typed pointer support in BPF maps'

Kumar Kartikeya Dwivedi says:

====================

This set enables storing pointers of a certain type in BPF map, and extends the
verifier to enforce type safety and lifetime correctness properties.

The infrastructure being added is generic enough for allowing storing any kind
of pointers whose type is available using BTF (user or kernel) in the future
(e.g. strongly typed memory allocation in BPF program), which are internally
tracked in the verifier as PTR_TO_BTF_ID, but for now the series limits them to
two kinds of pointers obtained from the kernel.

Obviously, use of this feature depends on map BTF.

1. Unreferenced kernel pointer

In this case, there are very few restrictions. The pointer type being stored
must match the type declared in the map value. However, such a pointer when
loaded from the map can only be dereferenced, but not passed to any in-kernel
helpers or kernel functions available to the program. This is because while the
verifier's exception handling mechanism coverts BPF_LDX to PROBE_MEM loads,
which are then handled specially by the JIT implementation, the same liberty is
not available to accesses inside the kernel. The pointer by the time it is
passed into a helper has no lifetime related guarantees about the object it is
pointing to, and may well be referencing invalid memory.

2. Referenced kernel pointer

This case imposes a lot of restrictions on the programmer, to ensure safety. To
transfer the ownership of a reference in the BPF program to the map, the user
must use the bpf_kptr_xchg helper, which returns the old pointer contained in
the map, as an acquired reference, and releases verifier state for the
referenced pointer being exchanged, as it moves into the map.

This a normal PTR_TO_BTF_ID that can be used with in-kernel helpers and kernel
functions callable by the program.

However, if BPF_LDX is used to load a referenced pointer from the map, it is
still not permitted to pass it to in-kernel helpers or kernel functions. To
obtain a reference usable with helpers, the user must invoke a kfunc helper
which returns a usable reference (which also must be eventually released before
BPF_EXIT, or moved into a map).

Since the load of the pointer (preserving data dependency ordering) must happen
inside the RCU read section, the kfunc helper will take a pointer to the map
value, which must point to the actual pointer of the object whose reference is
to be raised. The type will be verified from the BTF information of the kfunc,
as the prototype must be:

T *func(T **, ... /* other arguments */);

Then, the verifier checks whether pointer at offset of the map value points to
the type T, and permits the call.

This convention is followed so that such helpers may also be called from
sleepable BPF programs, where RCU read lock is not necessarily held in the BPF
program context, hence necessiating the need to pass in a pointer to the actual
pointer to perform the load inside the RCU read section.

Notes
-----

* C selftests require https://reviews.llvm.org/D119799 to pass.
* Unlike BPF timers, kptr is not reset or freed on map_release_uref.
* Referenced kptr storage is always treated as unsigned long * on kernel side,
   as BPF side cannot mutate it. The storage (8 bytes) is sufficient for both
   32-bit and 64-bit platforms.
* Use of WRITE_ONCE to reset unreferenced kptr on 32-bit systems is fine, as
   the actual pointer is always word sized, so the store tearing into two 32-bit
   stores won't be a problem as the other half is always zeroed out.

Changelog:
----------
v5 -> v6
v5: https://lore.kernel.org/bpf/20220415160354.1050687-1-memxor@gmail.com

* Address comments from Alexei
   * Drop 'Revisit stack usage' comment
   * Rename off_btf to kernel_btf
   * Add comment about searching using type from map BTF
   * Do kmemdup + btf_get instead of get + kmemdup + put
   * Add comment for btf_struct_ids_match
   * Add comment for assigning non-zero id for mark_ptr_or_null_reg
   * Rename PTR_RELEASE to OBJ_RELEASE
   * Rename BPF_MAP_OFF_DESC_TYPE_XXX_KPTR to BPF_KPTR_XXX
   * Remove unneeded likely/unlikely in cold functions
   * Fix other misc nits
* Keep release_regno instead of replacing with bool + regno
* Add a patch to prevent type match for first member when off == 0 for
   release functions (kfunc + BPF helpers)
* Guard kptr/kptr_ref definition in libbpf header with __has_attribute
   to prevent selftests compilation error with old clang not support
   type tags

v4 -> v5
v4: https://lore.kernel.org/bpf/20220409093303.499196-1-memxor@gmail.com

* Address comments from Joanne
   * Move __btf_member_bit_offset before strcmp
   * Move strcmp conditional on name to unref kptr patch
   * Directly return from btf_find_struct in patch 1
   * Use enum btf_field_type vs int field_type
   * Put btf and btf_id in off_desc in named struct 'kptr'
   * Switch order for BTF_FIELD_IGNORE check
   * Drop dead tab->nr_off = 0 store
   * Use i instead of tab->nr_off to btf_put on failure
   * Replace kzalloc + memcpy with kmemdup (kernel test robot)
   * Reject both BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG
   * Add logging statement for reject BPF_MODE(insn->code) != BPF_MEM
   * Rename off_desc -> kptr_off_desc in check_mem_access
   * Drop check for err, fallthrough to end of function
   * Remove is_release_function, use meta.release_regno to detect release
     function, release reference state, and remove check_release_regno
   * Drop off_desc->flags, use off_desc->type
   * Update comment for ARG_PTR_TO_KPTR
* Distinguish between direct/indirect access to kptr
* Drop check_helper_mem_access from process_kptr_func, check_mem_reg in kptr_get
* Add verifier test for helper accessing kptr indirectly
* Fix other misc nits, add Acked-by for patch 2

v3 -> v4
v3: https://lore.kernel.org/bpf/20220320155510.671497-1-memxor@gmail.com

* Use btf_parse_kptrs, plural kptrs naming (Joanne, Andrii)
* Remove unused parameters in check_map_kptr_access (Joanne)
* Handle idx < info_cnt kludge using tmp variable (Andrii)
* Validate tags always precede modifiers in BTF (Andrii)
   * Split out into https://lore.kernel.org/bpf/20220406004121.282699-1-memxor@gmail.com
* Store u32 type_id in btf_field_info (Andrii)
* Use base_type in map_kptr_match_type (Andrii)
* Free kptr_off_tab when not bpf_capable (Martin)
* Use PTR_RELEASE flag instead of bools in bpf_func_proto (Joanne)
* Drop extra reg->off and reg->ref_obj_id checks in map_kptr_match_type (Martin)
* Use separate u32 and u8 arrays for offs and sizes in off_arr (Andrii)
* Simplify and remove map->value_size sentinel in copy_map_value (Andrii)
* Use sort_r to keep both arrays in sync while sorting (Andrii)
* Rename check_and_free_timers_and_kptr to check_and_free_fields (Andrii)
* Move dtor prototype checks to registration phase (Alexei)
* Use ret variable for checking ASSERT_XXX, use shorter strings (Andrii)
* Fix missing checks for other maps (Jiri)
* Fix various other nits, and bugs noticed during self review

v2 -> v3
v2: https://lore.kernel.org/bpf/20220317115957.3193097-1-memxor@gmail.com

* Address comments from Alexei
   * Set name, sz, align in btf_find_field
   * Do idx >= info_cnt check in caller of btf_find_field_*
     * Use extra element in the info_arr to make this safe
   * Remove while loop, reject extra tags
   * Remove cases of defensive programming
   * Move bpf_capable() check to map_check_btf
   * Put check_ptr_off_reg reordering hunk into separate patch
   * Warn for ref_ptr once
   * Make the meta.ref_obj_id == 0 case simpler to read
   * Remove kptr_percpu and kptr_user support, remove their tests
   * Store size of field at offset in off_arr
* Fix BPF_F_NO_PREALLOC set wrongly for hash map in C selftest
* Add missing check_mem_reg call for kptr_get kfunc arg#0 check

v1 -> v2
v1: https://lore.kernel.org/bpf/20220220134813.3411982-1-memxor@gmail.com

* Address comments from Alexei
   * Rename bpf_btf_find_by_name_kind_all to bpf_find_btf_id
   * Reduce indentation level in that function
   * Always take reference regardless of module or vmlinux BTF
   * Also made it the same for btf_get_module_btf
   * Use kptr, kptr_ref, kptr_percpu, kptr_user type tags
   * Don't reserve tag namespace
   * Refactor btf_find_field to be side effect free, allocate and populate
     kptr_off_tab in caller
   * Move module reference to dtor patch
   * Remove support for BPF_XCHG, BPF_CMPXCHG insn
   * Introduce bpf_kptr_xchg helper
   * Embed offset array in struct bpf_map, populate and sort it once
   * Adjust copy_map_value to memcpy directly using this offset array
   * Removed size member from offset array to save space
* Fix some problems pointed out by kernel test robot
* Tidy selftests
* Lots of other minor fixes
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add test for strict BTF type check

Ensure that the edge case where first member type was matched
successfully even if it didn't match BTF type of register is caught and
rejected by the verifier.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-14-memxor@gmail.com

selftests/bpf: Add verifier tests for kptr

Reuse bpf_prog_test functions to test the support for PTR_TO_BTF_ID in
BPF map case, including some tests that verify implementation sanity and
corner cases.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-13-memxor@gmail.com

selftests/bpf: Add C tests for kptr

This uses the __kptr and __kptr_ref macros as well, and tries to test
the stuff that is supposed to work, since we have negative tests in
test_verifier suite. Also include some code to test map-in-map support,
such that the inner_map_meta matches the kptr_off_tab of map added as
element.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-12-memxor@gmail.com

libbpf: Add kptr type tag macros to bpf_helpers.h

Include convenience definitions:
__kptr: Unreferenced kptr
__kptr_ref: Referenced kptr

Users can use them to tag the pointer type meant to be used with the new
support directly in the map value definition.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-11-memxor@gmail.com

bpf: Make BTF type match stricter for release arguments

The current of behavior of btf_struct_ids_match for release arguments is
that when type match fails, it retries with first member type again
(recursively). Since the offset is already 0, this is akin to just
casting the pointer in normal C, since if type matches it was just
embedded inside parent sturct as an object. However, we want to reject
cases for release function type matching, be it kfunc or BPF helpers.

An example is the following:

struct foo {
struct bar b;
};

struct foo *v = acq_foo();
rel_bar(&v->b); // btf_struct_ids_match fails btf_types_are_same, then
// retries with first member type and succeeds, while
// it should fail.

Hence, don't walk the struct and only rely on btf_types_are_same for
strict mode. All users of strict mode must be dealing with zero offset
anyway, since otherwise they would want the struct to be walked.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-10-memxor@gmail.com

bpf: Teach verifier about kptr_get kfunc helpers

We introduce a new style of kfunc helpers, namely *_kptr_get, where they
take pointer to the map value which points to a referenced kernel
pointer contained in the map. Since this is referenced, only
bpf_kptr_xchg from BPF side and xchg from kernel side is allowed to
change the current value, and each pointer that resides in that location
would be referenced, and RCU protected (this must be kept in mind while
adding kernel types embeddable as reference kptr in BPF maps).

This means that if do the load of the pointer value in an RCU read
section, and find a live pointer, then as long as we hold RCU read lock,
it won't be freed by a parallel xchg + release operation. This allows us
to implement a safe refcount increment scheme. Hence, enforce that first
argument of all such kfunc is a proper PTR_TO_MAP_VALUE pointing at the
right offset to referenced pointer.

For the rest of the arguments, they are subjected to typical kfunc
argument checks, hence allowing some flexibility in passing more intent
into how the reference should be taken.

For instance, in case of struct nf_conn, it is not freed until RCU grace
period ends, but can still be reused for another tuple once refcount has
dropped to zero. Hence, a bpf_ct_kptr_get helper not only needs to call
refcount_inc_not_zero, but also do a tuple match after incrementing the
reference, and when it fails to match it, put the reference again and
return NULL.

This can be implemented easily if we allow passing additional parameters
to the bpf_ct_kptr_get kfunc, like a struct bpf_sock_tuple * and a
tuple__sz pair.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-9-memxor@gmail.com

bpf: Wire up freeing of referenced kptr

A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.

In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.

Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.

Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.

The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.

Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com

bpf: Populate pairs of btf_id and destructor kfunc in btf

To support storing referenced PTR_TO_BTF_ID in maps, we require
associating a specific BTF ID with a 'destructor' kfunc. This is because
we need to release a live referenced pointer at a certain offset in map
value from the map destruction path, otherwise we end up leaking
resources.

Hence, introduce support for passing an array of btf_id, kfunc_btf_id
pairs that denote a BTF ID and its associated release function. Then,
add an accessor 'btf_find_dtor_kfunc' which can be used to look up the
destructor kfunc of a certain BTF ID. If found, we can use it to free
the object from the map free path.

The registration of these pairs also serve as a whitelist of structures
which are allowed as referenced PTR_TO_BTF_ID in a BPF map, because
without finding the destructor kfunc, we will bail and return an error.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-7-memxor@gmail.com

bpf: Adapt copy_map_value for multiple offset case

Since now there might be at most 10 offsets that need handling in
copy_map_value, the manual shuffling and special case is no longer going
to work. Hence, let's generalise the copy_map_value function by using
a sorted array of offsets to skip regions that must be avoided while
copying into and out of a map value.

When the map is created, we populate the offset array in struct map,
Then, copy_map_value uses this sorted offset array is used to memcpy
while skipping timer, spin lock, and kptr. The array is allocated as
in most cases none of these special fields would be present in map
value, hence we can save on space for the common case by not embedding
the entire object inside bpf_map struct.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-6-memxor@gmail.com

bpf: Prevent escaping of kptr loaded from maps

While we can guarantee that even for unreferenced kptr, the object
pointer points to being freed etc. can be handled by the verifier's
exception handling (normal load patching to PROBE_MEM loads), we still
cannot allow the user to pass these pointers to BPF helpers and kfunc,
because the same exception handling won't be done for accesses inside
the kernel. The same is true if a referenced pointer is loaded using
normal load instruction. Since the reference is not guaranteed to be
held while the pointer is used, it must be marked as untrusted.

Hence introduce a new type flag, PTR_UNTRUSTED, which is used to mark
all registers loading unreferenced and referenced kptr from BPF maps,
and ensure they can never escape the BPF program and into the kernel by
way of calling stable/unstable helpers.

In check_ptr_to_btf_access, the !type_may_be_null check to reject type
flags is still correct, as apart from PTR_MAYBE_NULL, only MEM_USER,
MEM_PERCPU, and PTR_UNTRUSTED may be set for PTR_TO_BTF_ID. The first
two are checked inside the function and rejected using a proper error
message, but we still want to allow dereference of untrusted case.

Also, we make sure to inherit PTR_UNTRUSTED when chain of pointers are
walked, so that this flag is never dropped once it has been set on a
PTR_TO_BTF_ID (i.e. trusted to untrusted transition can only be in one
direction).

In convert_ctx_accesses, extend the switch case to consider untrusted
PTR_TO_BTF_ID in addition to normal PTR_TO_BTF_ID for PROBE_MEM
conversion for BPF_LDX.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-5-memxor@gmail.com

bpf: Allow storing referenced kptr in map

Extending the code in previous commits, introduce referenced kptr
support, which needs to be tagged using 'kptr_ref' tag instead. Unlike
unreferenced kptr, referenced kptr have a lot more restrictions. In
addition to the type matching, only a newly introduced bpf_kptr_xchg
helper is allowed to modify the map value at that offset. This transfers
the referenced pointer being stored into the map, releasing the
references state for the program, and returning the old value and
creating new reference state for the returned pointer.

Similar to unreferenced pointer case, return value for this case will
also be PTR_TO_BTF_ID_OR_NULL. The reference for the returned pointer
must either be eventually released by calling the corresponding release
function, otherwise it must be transferred into another map.

It is also allowed to call bpf_kptr_xchg with a NULL pointer, to clear
the value, and obtain the old value if any.

BPF_LDX, BPF_STX, and BPF_ST cannot access referenced kptr. A future
commit will permit using BPF_LDX for such pointers, but attempt at
making it safe, since the lifetime of object won't be guaranteed.

There are valid reasons to enforce the restriction of permitting only
bpf_kptr_xchg to operate on referenced kptr. The pointer value must be
consistent in face of concurrent modification, and any prior values
contained in the map must also be released before a new one is moved
into the map. To ensure proper transfer of this ownership, bpf_kptr_xchg
returns the old value, which the verifier would require the user to
either free or move into another map, and releases the reference held
for the pointer being moved in.

In the future, direct BPF_XCHG instruction may also be permitted to work
like bpf_kptr_xchg helper.

Note that process_kptr_func doesn't have to call
check_helper_mem_access, since we already disallow rdonly/wronly flags
for map, which is what check_map_access_type checks, and we already
ensure the PTR_TO_MAP_VALUE refers to kptr by obtaining its off_desc,
so check_map_access is also not required.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-4-memxor@gmail.com

bpf: Tag argument to be released in bpf_func_proto

Add a new type flag for bpf_arg_type that when set tells verifier that
for a release function, that argument's register will be the one for
which meta.ref_obj_id will be set, and which will then be released
using release_reference. To capture the regno, introduce a new field
release_regno in bpf_call_arg_meta.

This would be required in the next patch, where we may either pass NULL
or a refcounted pointer as an argument to the release function
bpf_kptr_xchg. Just releasing only when meta.ref_obj_id is set is not
enough, as there is a case where the type of argument needed matches,
but the ref_obj_id is set to 0. Hence, we must enforce that whenever
meta.ref_obj_id is zero, the register that is to be released can only
be NULL for a release function.

Since we now indicate whether an argument is to be released in
bpf_func_proto itself, is_release_function helper has lost its utitlity,
hence refactor code to work without it, and just rely on
meta.release_regno to know when to release state for a ref_obj_id.
Still, the restriction of one release argument and only one ref_obj_id
passed to BPF helper or kfunc remains. This may be lifted in the future.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-3-memxor@gmail.com

bpf: Allow storing unreferenced kptr in map

This commit introduces a new pointer type 'kptr' which can be embedded
in a map value to hold a PTR_TO_BTF_ID stored by a BPF program during
its invocation. When storing such a kptr, BPF program's PTR_TO_BTF_ID
register must have the same type as in the map value's BTF, and loading
a kptr marks the destination register as PTR_TO_BTF_ID with the correct
kernel BTF and BTF ID.

Such kptr are unreferenced, i.e. by the time another invocation of the
BPF program loads this pointer, the object which the pointer points to
may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
patched to PROBE_MEM loads by the verifier, it would safe to allow user
to still access such invalid pointer, but passing such pointers into
BPF helpers and kfuncs should not be permitted. A future patch in this
series will close this gap.

The flexibility offered by allowing programs to dereference such invalid
pointers while being safe at runtime frees the verifier from doing
complex lifetime tracking. As long as the user may ensure that the
object remains valid, it can ensure data read by it from the kernel
object is valid.

The user indicates that a certain pointer must be treated as kptr
capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
information is recorded in the object BTF which will be passed into the
kernel by way of map's BTF information. The name and kind from the map
value BTF is used to look up the in-kernel type, and the actual BTF and
BTF ID is recorded in the map struct in a new kptr_off_tab member. For
now, only storing pointers to structs is permitted.

An example of this specification is shown below:

#define __kptr __attribute__((btf_type_tag("kptr")))

struct map_value {
...
struct task_struct __kptr *task;
...
};

Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
task_struct into the map, and then load it later.

Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
the verifier cannot know whether the value is NULL or not statically, it
must treat all potential loads at that map value offset as loading a
possibly NULL pointer.

Only BPF_LDX, BPF_STX, and BPF_ST (with insn->imm = 0 to denote NULL)
are allowed instructions that can access such a pointer. On BPF_LDX, the
destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
it is checked whether the source register type is a PTR_TO_BTF_ID with
same BTF type as specified in the map BTF. The access size must always
be BPF_DW.

For the map in map support, the kptr_off_tab for outer map is copied
from the inner map's kptr_off_tab. It was chosen to do a deep copy
instead of introducing a refcount to kptr_off_tab, because the copy only
needs to be done when paramterizing using inner_map_fd in the map in map
case, hence would be unnecessary for all other users.

It is not permitted to use MAP_FREEZE command and mmap for BPF map
having kptrs, similar to the bpf_timer case. A kptr also requires that
BPF program has both read and write access to the map (hence both
BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG are disallowed).

Note that check_map_access must be called from both
check_helper_mem_access and for the BPF instructions, hence the kptr
check must distinguish between ACCESS_DIRECT and ACCESS_HELPER, and
reject ACCESS_HELPER cases. We rename stack_access_src to bpf_access_src
and reuse it for this purpose.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-2-memxor@gmail.com

bpf: Use bpf_prog_run_array_cg_flags everywhere

Rename bpf_prog_run_array_cg_flags to bpf_prog_run_array_cg and
use it everywhere. check_return_code already enforces sane
return ranges for all cgroup types. (only egress and bind hooks have
uncanonical return ranges, the rest is using [0, 1])

No functional changes.

v2:
- 'func_ret & 1' under explicit test (Andrii & Martin)

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220425220448.3669032-1-sdf@google.com

bpftool, musl compat: Replace sys/fcntl.h by fcntl.h

musl does not like including sys/fcntl.h directly:

    [...]
    1 | #warning redirecting incorrect #include <sys/fcntl.h> to <fcntl.h>
    [...]

Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20220424051022.2619648-5-asmadeus@codewreck.org

bpftool, musl compat: Replace nftw with FTW_ACTIONRETVAL

musl nftw implementation does not support FTW_ACTIONRETVAL. There have been
multiple attempts at pushing the feature in musl upstream, but it has been
refused or ignored all the times:

https://www.openwall.com/lists/musl/2021/03/26/1
https://www.openwall.com/lists/musl/2022/01/22/1

In this case we only care about /proc/<pid>/fd/<fd>, so it's not too difficult
to reimplement directly instead, and the new implementation makes 'bpftool perf'
slightly faster because it doesn't needlessly stat/readdir unneeded directories
(54ms -> 13ms on my machine).

Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20220424051022.2619648-4-asmadeus@codewreck.org

wwan_hwsim: Avoid flush_scheduled_work() usage

Flushing system-wide workqueues is dangerous and will be forbidden.
Replace system_wq with local wwan_wq.

While we are at it, make err_clean_devs: label of wwan_hwsim_init()
behave like wwan_hwsim_exit(), for it is theoretically possible to call
wwan_hwsim_debugfs_devcreate_write()/wwan_hwsim_debugfs_devdestroy_write()
by the moment wwan_hwsim_init_devs() returns.

Link: https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Reviewed-by: Loic Poulain <loic.poulain@linaro.org>
Link: https://lore.kernel.org/r/7390d51f-60e2-3cee-5277-b819a55ceabe@I-love.SAKURA.ne.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

libbpf: Remove unnecessary type cast

The link variable is already of type 'struct bpf_link *', casting it to
'struct bpf_link *' is redundant, drop it.

Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220424143420.457082-1-ytcoode@gmail.com

net: dsa: remove unused headers

Reduce a number of included headers to a necessary minimum.

Signed-off-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

arp: fix unused variable warnning when CONFIG_PROC_FS=n

net/ipv4/arp.c:1412:36: warning: unused variable 'arp_seq_ops' [-Wunused-const-variable]

Add #ifdef CONFIG_PROC_FS for 'arp_seq_ops'.

Fixes: e968b1b3e9b8 ("arp: Remove #ifdef CONFIG_PROC_FS")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: compute proper aggregation limit

The aggregation byte limit for an endpoint is currently computed
based on the endpoint's receive buffer size.

However, some bytes at the front of each receive buffer are reserved
on the assumption that--as with SKBs--it might be useful to insert
data (such as headers) before what lands in the buffer.

The aggregation byte limit currently doesn't take into account that
reserved space, and as a result, aggregation could require space
past that which is available in the buffer.

Fix this by reducing the size used to compute the aggregation byte
limit by the NET_SKB_PAD offset reserved for each receive buffer.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: mtk_eth_soc: add check for allocation failure

Check if the kzalloc() failed.

Fixes: 804775dfc288 ("net: ethernet: mtk_eth_soc: add support for Wireless Ethernet Dispatch (WED)")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethernet: broadcom/sb1250-mac: remove BUG_ON in sbmac_probe()

Replace the BUG_ON() with returning error code to handle
the fault more gracefully.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: Remove useless code

payload only memset but no use at all, so we drop them.

Signed-off-by: Haowen Bai <baihaowen@meizu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-line-card-model'

Ido Schimmel says:

====================
mlxsw: extend line card model by devices and info

Jiri says:

This patchset is extending the line card model by three items:
1) line card devices
2) line card info
3) line card device info

First three patches are introducing the necessary changes in devlink
core.

Then, all three extensions are implemented in mlxsw alongside with
selftest.

Examples:
$ devlink lc show pci/0000:01:00.0 lc 8
pci/0000:01:00.0:
  lc 8 state active type 16x100G
    supported_types:
      16x100G
    devices:
      device 0
      device 1
      device 2
      device 3
$ devlink lc info pci/0000:01:00.0 lc 8
pci/0000:01:00.0:
  lc 8
    versions:
        fixed:
          hw.revision 0
        running:
          ini.version 4
    devices:
      device 0
        versions:
            running:
              fw 19.2010.1310
      device 1
        versions:
            running:
              fw 19.2010.1310
      device 2
        versions:
            running:
              fw 19.2010.1310
      device 3
        versions:
            running:
              fw 19.2010.1310

Note that device FW flashing is going to be implemented in the follow-up
patchset.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mlxsw: Check device info on activated line card

Once line card is activated, check the device FW version is exposed.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: core_linecards: Expose device FW version over device info

Extend MDDQ to obtain FW version of line card device and implement
device_info_get() op to fill up the info with that.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: reg: Extend MDDQ device_info by FW version fields

Add FW version fields to MDDQ device_info.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mlxsw: Check line card info on provisioned line card

Once line card is provisioned, check if HW revision and INI version
are exposed.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: core_linecards: Expose HW revision and INI version

Implement info_get() to expose HW revision of a linecard and loaded INI
version.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mlxsw: Check devices on provisioned line card

Once line card is provisioned, check the count of devices on it and
print them out.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: core_linecards: Probe provisioned line cards for devices and attach them

In case the line card is provisioned, go over all possible existing
devices (gearboxes) on it and attach them, so devlink core is aware of
them.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: reg: Extend MDDQ by device_info

Extend existing MDDQ register by possibility to query information about
devices residing on a line card.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: introduce line card device info infrastructure

Extend the line card info message with information (e.g., FW version)
about devices found on the line card.

Example:
$ devlink lc info pci/0000:01:00.0 lc 8
pci/0000:01:00.0:
  lc 8
    versions:
        fixed:
          hw.revision 0
        running:
          ini.version 4
    devices:
      device 0
        versions:
            running:
              fw 19.2010.1310
      device 1
        versions:
            running:
              fw 19.2010.1310
      device 2
        versions:
            running:
              fw 19.2010.1310
      device 3
        versions:
            running:
              fw 19.2010.1310

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: introduce line card info get message

Allow the driver to provide per line card info get op to fill-up info,
similar to the "devlink dev info".

Example:

$ devlink lc info pci/0000:01:00.0 lc 8
pci/0000:01:00.0:
  lc 8
    versions:
        fixed:
          hw.revision 0
        running:
          ini.version 4

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: introduce line card devices support

Line card can contain one or more devices that makes sense to make
visible to the user. For example, this can be a gearbox with
flash memory, which could be updated.

Provide the driver possibility to attach such devices to a line card
and expose those to user.

Example:
$ devlink lc show pci/0000:01:00.0 lc 8
pci/0000:01:00.0:
  lc 8 state active type 16x100G
    supported_types:
      16x100G
    devices:
      device 0
      device 1
      device 2
      device 3

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dsa-selftests'

Vladimir Oltean says:

====================
DSA selftests

When working on complex new features or reworks it becomes increasingly
difficult to ensure there aren't regressions being introduced, and
therefore it would be nice if we could go over the functionality we
already have and write some tests for it.

Verbally I know from Tobias Waldekranz that he has been working on some
selftests for DSA, yet I have never seen them, so here I am adding some
tests I have written which have been useful for me. The list is by no
means complete (it only covers elementary functionality), but it's still
good to have as a starting point. I also borrowed some refactoring
changes from Joachim Wiberg that he submitted for his "net: bridge:
forwarding of unknown IPv4/IPv6/MAC BUM traffic" series, but not the
entirety of his selftests. I now think that his selftests have some
overlap with bridge_vlan_unaware.sh and bridge_vlan_aware.sh and they
should be more tightly integrated with each other - yet I didn't do that
either :). Another issue I had with his selftests was that they jumped
straight ahead to configure brport flags on br0 (a radical new idea
still at RFC status) while we have bigger problems, and we don't have
nearly enough coverage for the *existing* functionality.

One idea introduced here which I haven't seen before is the symlinking
of relevant forwarding selftests to the selftests/drivers/net/<my-driver>/
folder, plus a forwarding.config file. I think there's some value in
having things structured this way, since the forwarding dir has so many
selftests that aren't relevant to DSA that it is a bit difficult to find
the ones that are.

While searching for applications that I could use for multicast testing
(not my domain of interest/knowledge really), I found Joachim Wiberg's
mtools, mcjoin and omping, and I tried them all with various degrees of
success. In particular, I was going to use mcjoin, but I faced some
issues getting IPv6 multicast traffic to work in a VRF, and I bothered
David Ahern about it here:
https://lore.kernel.org/netdev/97eaffb8-2125-834e-641f-c99c097b6ee2@gmail.com/t/
It seems that the problem is that this application should use
SO_BINDTODEVICE, yet it doesn't.

So I ended up patching the bare-bones mtools (msend, mreceive) forked by
Joachim from the University of Virginia's Multimedia Networks Group to
include IPv6 support, and to use SO_BINDTODEVICE. This is what I'm using
now for IPv6.

Note that mausezahn doesn't appear to do a particularly good job of
supporting IPv6 really, and I needed a program to emit the actual
IP_ADD_MEMBERSHIP calls, for dev_mc_add(), so I could test RX filtering.
Crafting the IGMP/MLD reports by hand doesn't really do the trick.
While extremely bare-bones, the mreceive application now seems to do
what I need it to.

Feedback appreciated, it is very likely that I could have done things in
a better way.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: drivers: dsa: add a subset of forwarding selftests

This adds an initial subset of forwarding selftests which I considered
to be relevant for DSA drivers, along with a forwarding.config that
makes it easier to run them (disables veth pair creation, makes sure MAC
addresses are unique and stable).

The intention is to request driver writers to run these selftests during
review and make sure that the tests pass, or at least that the problems
are known.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: add a test for local_termination.sh

This tests the capability of switch ports to filter out undesired
traffic. Different drivers are expected to have different capabilities
here (so some may fail and some may pass), yet the test still has some
value, for example to check for regressions.

There are 2 kinds of failures, one is when a packet which should have
been accepted isn't (and that should be fixed), and the other "failure"
(as reported by the test) is when a packet could have been filtered out
(for being unnecessary) yet it was received.

The bridge driver fares particularly badly at this test:

TEST: br0: Unicast IPv4 to primary MAC address                      [ OK ]
TEST: br0: Unicast IPv4 to macvlan MAC address                      [ OK ]
TEST: br0: Unicast IPv4 to unknown MAC address                      [FAIL]
        reception succeeded, but should have failed
TEST: br0: Unicast IPv4 to unknown MAC address, promisc             [ OK ]
TEST: br0: Unicast IPv4 to unknown MAC address, allmulti            [FAIL]
        reception succeeded, but should have failed
TEST: br0: Multicast IPv4 to joined group                           [ OK ]
TEST: br0: Multicast IPv4 to unknown group                          [FAIL]
        reception succeeded, but should have failed
TEST: br0: Multicast IPv4 to unknown group, promisc                 [ OK ]
TEST: br0: Multicast IPv4 to unknown group, allmulti                [ OK ]
TEST: br0: Multicast IPv6 to joined group                           [ OK ]
TEST: br0: Multicast IPv6 to unknown group                          [FAIL]
        reception succeeded, but should have failed
TEST: br0: Multicast IPv6 to unknown group, promisc                 [ OK ]
TEST: br0: Multicast IPv6 to unknown group, allmulti                [ OK ]

mainly because it does not implement IFF_UNICAST_FLT. Yet I still think
having the test (with the failures) is useful in case somebody wants to
tackle that problem in the future, to make an easy before-and-after
comparison.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: add a no_forwarding.sh test

Bombard a standalone switch port with various kinds of traffic to ensure
it is really standalone and doesn't leak packets to other switch ports.
Also check for switch ports in different bridges, and switch ports in a
VLAN-aware bridge but having different pvids. No forwarding should take
place in either case.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: add helper for retrieving IPv6 link-local address of interface

Pinging an IPv6 link-local multicast address selects the link-local
unicast address of the interface as source, and we'd like to monitor for
that in tcpdump.

Add a helper to the forwarding library which retrieves the link-local
IPv6 address of an interface, to make that task easier.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: add helpers for IP multicast group joins/leaves

Extend the forwarding library with calls to some small C programs which
join an IP multicast group and send some packets to it. Both IPv4 and
IPv6 groups are supported. Use cases range from testing IGMP/MLD
snooping, to RX filtering, to multicast routing.

Testing multicast traffic using msend/mreceive is intended to be done
using tcpdump.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: multiple instances in tcpdump helper

Extend tcpdump_start() & C:o to handle multiple instances. Useful when
observing bridge operation, e.g., unicast learning/flooding, and any
case of multicast distribution (to these ports but not that one ...).

This means the interface argument is now a mandatory argument to all
tcpdump_*() functions, hence the changes to the ocelot flower test.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: add TCPDUMP_EXTRA_FLAGS to lib.sh

For some use-cases we may want to change the tcpdump flags used in
tcpdump_start(). For instance, observing interfaces without the PROMISC
flag, e.g. to see what's really being forwarded to the bridge interface.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: add option to run tests with stable MAC addresses

By default, DSA switch ports inherit their MAC address from the DSA
master.

This works well for practical situations, but some selftests like
bridge_vlan_unaware.sh loop back 2 standalone DSA ports with 2 bridged
DSA ports, and require the bridge to forward packets between the
standalone ports.

Due to the bridge seeing that the MAC DA it needs to forward is present
as a local FDB entry (it coincides with the MAC address of the bridge
ports), the test packets are not forwarded, but terminated locally on
br0. In turn, this makes the ping and ping6 tests fail.

Address this by introducing an option to have stable MAC addresses.
When mac_addr_prepare is called, the current addresses of the netifs are
saved and replaced with 00:01:02:03:04:${netif number}. Then when
mac_addr_restore is called at the end of the test, the original MAC
addresses are restored. This ensures that the MAC addresses are unique,
which makes the test pass even for DSA ports.

The usage model is for the behavior to be opt-in via STABLE_MAC_ADDRS,
which DSA should set to true, all others behave as before. By hooking
the calls to mac_addr_prepare and mac_addr_restore within the forwarding
lib itself, we do not need to patch each individual selftest, the only
requirement is that pre_cleanup is called.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mptcp-tcp-fallback'

Mat Martineau says:

====================
mptcp: TCP fallback for established connections

RFC 8684 allows some MPTCP connections to fall back to regular TCP when
the MPTCP DSS checksum detects middlebox interference, there is only a
single subflow, and there is no unacknowledged out-of-sequence
data. When this condition is detected, the stack sends a MPTCP DSS
option with an "infinite mapping" to signal that a fallback is
happening, and the peers will stop sending MPTCP options in their TCP
headers. The Linux MPTCP stack has not yet supported this type of
fallback, instead closing the connection when the MPTCP checksum fails.

This series adds support for fallback to regular TCP in a more limited
scenario, for only MPTCP connections that have never connected
additional subflows or transmitted out-of-sequence data. The selftests
are also updated to check new MIBs that track infinite mappings.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: add infinite map mibs check

This patch adds a function chk_infi_nr() to check the mibs for the
infinite mapping. Invoke it in chk_join_nr() when validate_checksum
is set.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: dump infinite_map field in mptcp_dump_mpext

In trace event class mptcp_dump_mpext, dump the newly added infinite_map
field of struct mptcp_dump_mpext too.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: add mib for infinite map sending

This patch adds a new mib named MPTCP_MIB_INFINITEMAPTX, increase it
when a infinite mapping has been sent out.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: infinite mapping receiving

This patch adds the infinite mapping receiving logic. When the infinite
mapping is received, set the map_data_len of the subflow to 0.

In subflow_check_data_avail(), only reset the subflow when the map_data_len
of the subflow is non-zero.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: infinite mapping sending

This patch adds the infinite mapping sending logic.

Add a new flag send_infinite_map in struct mptcp_subflow_context. Set
it true when a single contiguous subflow is in use and the
allow_infinite_fallback flag is true in mptcp_pm_mp_fail_received().

In mptcp_sendmsg_frag(), if this flag is true, call the new function
mptcp_update_infinite_map() to set the infinite mapping.

Add a new flag infinite_map in struct mptcp_ext, set it true in
mptcp_update_infinite_map(), and check this flag in a new helper
mptcp_check_infinite_map().

In mptcp_update_infinite_map(), set data_len to 0, and clear the
send_infinite_map flag, then do fallback.

In mptcp_established_options(), use the helper mptcp_check_infinite_map()
to let the infinite mapping DSS can be sent out in the fallback mode.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: track and update contiguous data status

This patch adds a new member allow_infinite_fallback in mptcp_sock,
which is initialized to 'true' when the connection begins and is set
to 'false' on any retransmit or successful MP_JOIN. Only do infinite
mapping fallback if there is a single subflow AND there have been no
retransmissions AND there have never been any MP_JOINs.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>