review.tizen.org Git - platform/kernel/linux-rpi.git/log

tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH.

Commit
   9652dc2eb9e40 ("tcp: relax listening_hash operations")

removed the need to disable bottom half while acquiring
listening_hash.lock. There are still two callers left which disable
bottom half before the lock is acquired.

On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts
as a lock to ensure that resources, that are protected by disabling
bottom halves, remain protected.
This leads to a circular locking dependency if the lock acquired with
disabled bottom halves is also acquired with enabled bottom halves
followed by disabling bottom halves. This is the reverse locking order.
It has been observed with inet_listen_hashbucket::lock:

local_bh_disable() + spin_lock(&ilb->lock):
  inet_listen()
    inet_csk_listen_start()
      sk->sk_prot->hash() := inet_hash()
local_bh_disable()
__inet_hash()
  spin_lock(&ilb->lock);
    acquire(&ilb->lock);

Reverse order: spin_lock(&ilb2->lock) + local_bh_disable():
  tcp_seq_next()
    listening_get_next()
      spin_lock(&ilb2->lock);
acquire(&ilb2->lock);

  tcp4_seq_show()
    get_tcp4_sock()
      sock_i_ino()
read_lock_bh(&sk->sk_callback_lock);
  acquire(softirq_ctrl) // <---- whoops
  acquire(&sk->sk_callback_lock)

Drop local_bh_disable() around __inet_hash() which acquires
listening_hash->lock. Split inet_unhash() and acquire the
listen_hashbucket lock without disabling bottom halves; the inet_ehash
lock with disabled bottom halves.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/12d6f9879a97cd56c09fb53dee343cbb14f7f1f7.camel@gmx.de
Link: https://lkml.kernel.org/r/X9CheYjuXWc75Spa@hirez.programming.kicks-ass.net
Link: https://lore.kernel.org/r/YgQOebeZ10eNx1W6@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge https://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2022-02-09

We've added 126 non-merge commits during the last 16 day(s) which contain
a total of 201 files changed, 4049 insertions(+), 2215 deletions(-).

The main changes are:

1) Add custom BPF allocator for JITs that pack multiple programs into a huge
   page to reduce iTLB pressure, from Song Liu.

2) Add __user tagging support in vmlinux BTF and utilize it from BPF
   verifier when generating loads, from Yonghong Song.

3) Add per-socket fast path check guarding from cgroup/BPF overhead when
   used by only some sockets, from Pavel Begunkov.

4) Continued libbpf deprecation work of APIs/features and removal of their
   usage from samples, selftests, libbpf & bpftool, from Andrii Nakryiko
   and various others.

5) Improve BPF instruction set documentation by adding byte swap
   instructions and cleaning up load/store section, from Christoph Hellwig.

6) Switch BPF preload infra to light skeleton and remove libbpf dependency
   from it, from Alexei Starovoitov.

7) Fix architecture-agnostic macros in libbpf for accessing syscall
   arguments from BPF progs for non-x86 architectures,
   from Ilya Leoshkevich.

8) Rework port members in struct bpf_sk_lookup and struct bpf_sock to be
   of 16-bit field with anonymous zero padding, from Jakub Sitnicki.

9) Add new bpf_copy_from_user_task() helper to read memory from a different
   task than current. Add ability to create sleepable BPF iterator progs,
   from Kenny Yu.

10) Implement XSK batching for ice's zero-copy driver used by AF_XDP and
    utilize TX batching API from XSK buffer pool, from Maciej Fijalkowski.

11) Generate temporary netns names for BPF selftests to avoid naming
    collisions, from Hangbin Liu.

12) Implement bpf_core_types_are_compat() with limited recursion for
    in-kernel usage, from Matteo Croce.

13) Simplify pahole version detection and finally enable CONFIG_DEBUG_INFO_DWARF5
    to be selected with CONFIG_DEBUG_INFO_BTF, from Nathan Chancellor.

14) Misc minor fixes to libbpf and selftests from various folks.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (126 commits)
  selftests/bpf: Cover 4-byte load from remote_port in bpf_sk_lookup
  bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide
  libbpf: Fix compilation warning due to mismatched printf format
  selftests/bpf: Test BPF_KPROBE_SYSCALL macro
  libbpf: Add BPF_KPROBE_SYSCALL macro
  libbpf: Fix accessing the first syscall argument on s390
  libbpf: Fix accessing the first syscall argument on arm64
  libbpf: Allow overriding PT_REGS_PARM1{_CORE}_SYSCALL
  selftests/bpf: Skip test_bpf_syscall_macro's syscall_arg1 on arm64 and s390
  libbpf: Fix accessing syscall arguments on riscv
  libbpf: Fix riscv register names
  libbpf: Fix accessing syscall arguments on powerpc
  selftests/bpf: Use PT_REGS_SYSCALL_REGS in bpf_syscall_macro
  libbpf: Add PT_REGS_SYSCALL_REGS macro
  selftests/bpf: Fix an endianness issue in bpf_syscall_macro test
  bpf: Fix bpf_prog_pack build HPAGE_PMD_SIZE
  bpf: Fix leftover header->pages in sparc and powerpc code.
  libbpf: Fix signedness bug in btf_dump_array_data()
  selftests/bpf: Do not export subtest as standalone test
  bpf, x86_64: Fail gracefully on bpf_jit_binary_pack_finalize failures
  ...
====================

Link: https://lore.kernel.org/r/20220209210050.8425-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: drop_monitor: support drop reason

In the commit c504e5c2f964 ("net: skb: introduce kfree_skb_reason()")
drop reason is introduced to the tracepoint of kfree_skb. Therefore,
drop_monitor is able to report the drop reason to users by netlink.

The drop reasons are reported as string to users, which is exactly
the same as what we do when reporting it to ftrace.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220209060838.55513-1-imagedong@tencent.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'Split bpf_sk_lookup remote_port field'

Jakub Sitnicki says:

====================

Following the recent split-up of the bpf_sock dst_port field, apply the same to
technique to the bpf_sk_lookup remote_port field to make uAPI more user
friendly.

v1 -> v2:
- Remove remote_port range check and cast to be16 in TEST_RUN for sk_lookup
(kernel test robot)
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Cover 4-byte load from remote_port in bpf_sk_lookup

Extend the context access tests for sk_lookup prog to cover the surprising
case of a 4-byte load from the remote_port field, where the expected value
is actually shifted by 16 bits.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220209184333.654927-3-jakub@cloudflare.com

bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide

remote_port is another case of a BPF context field documented as a 32-bit
value in network byte order for which the BPF context access converter
generates a load of a zero-padded 16-bit integer in network byte order.

First such case was dst_port in bpf_sock which got addressed in commit
4421a582718a ("bpf: Make dst_port field in struct bpf_sock 16-bit wide").

Loading 4-bytes from the remote_port offset and converting the value with
bpf_ntohl() leads to surprising results, as the expected value is shifted
by 16 bits.

Reduce the confusion by splitting the field in two - a 16-bit field holding
a big-endian integer, and a 16-bit zero-padding anonymous field that
follows it.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220209184333.654927-2-jakub@cloudflare.com

libbpf: Fix compilation warning due to mismatched printf format

On ppc64le architecture __s64 is long int and requires %ld. Cast to
ssize_t and use %zd to avoid architecture-specific specifiers.

Fixes: 4172843ed4a3 ("libbpf: Fix signedness bug in btf_dump_array_data()")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220209063909.1268319-1-andrii@kernel.org

net: usb: smsc95xx: add generic selftest support

Provide generic selftest support. Tested with LAN9500 and LAN9512.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: cavium: use div64_u64() instead of do_div()

do_div() does a 64-by-32 division.
When the divisor is u64, do_div() truncates it to 32 bits, this means it
can test non-zero and be truncated to zero for division.

fix do_div.cocci warning:
do_div() does a 64-by-32 division, please consider using div64_u64 instead.

Signed-off-by: Wang Qing <wangqing@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net:enetc: enetc qos using the CBDR dma alloc function

Now we can use the enetc_cbd_alloc_data_mem() to replace complicated DMA
data alloc method and CBDR memory basic seting.

Signed-off-by: Po Liu <po.liu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net:enetc: command BD ring data memory alloc as one function alone

Separate the CBDR data memory alloc standalone. It is convenient for
other part loading, for example the ENETC QOS part.

Reported-and-suggested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Po Liu <po.liu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net:enetc: allocate CBD ring data memory using DMA coherent methods

To replace the dma_map_single() stream DMA mapping with DMA coherent
method dma_alloc_coherent() which is more simple.

dma_map_single() found by Tim Gardner not proper. Suggested by Claudiu
Manoil and Jakub Kicinski to use dma_alloc_coherent(). Discussion at:

https://lore.kernel.org/netdev/AM9PR04MB8397F300DECD3C44D2EBD07796BD9@AM9PR04MB8397.eurprd04.prod.outlook.com/t/

Fixes: 888ae5a3952ba ("net: enetc: add tc flower psfp offload driver")
cc: Claudiu Manoil <claudiu.manoil@nxp.com>
Reported-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Po Liu <po.liu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dpaa2-eth-sw-TSO'

Ioana Ciornei says:

====================
dpaa2-eth: add support for software TSO

This series adds support for driver level TSO in the dpaa2-eth driver.

The first 5 patches lay the ground work for the actual feature:
rearrange some variable declaration, cleaning up the interraction with
the S/G Table buffer cache etc.

The 6th patch adds the actual driver level software TSO support by using
the usual tso_build_hdr()/tso_build_data() APIs and creates the S/G FDs.

With this patch set we can see the following improvement in a TCP flow
running on a single A72@2.2GHz of the LX2160A SoC:

before: 6.38Gbit/s
after: 8.48Gbit/s
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

soc: fsl: dpio: read the consumer index from the cache inhibited area

Once we added support in the dpaa2-eth for driver level software TSO we
observed the following situation: if the EQCR CI (consumer index) is
read from the cache-enabled area we sometimes end up with a computed
value of available enqueue entries bigger than the size of the ring.

This eventually will lead to the multiple enqueue of the same FD which
will determine the same FD to end up on the Tx confirmation path and the
same skb being freed twice.

Just read the consumer index from the cache inhibited area so that we
avoid this situation.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dpaa2-eth: add support for software TSO

This patch adds support for driver level TSO in the enetc driver using
the TSO API.

There is not much to say about this specific implementation. We are
using the usual tso_build_hdr(), tso_build_data() to create each data
segment, we create an array of S/G FDs where the first S/G entry is
referencing the header data and the remaining ones the data portion.

For the S/G Table buffer we use the same cache of buffers used on the
other non-GSO cases - dpaa2_eth_sgt_get() and dpaa2_eth_sgt_recycle().

We cannot keep a DMA coherent buffer for all the TSO headers because the
DPAA2 architecture does not work in a ring based fashion so we just
allocate a buffer each time.

Even with these limitations we get the following improvement in TCP
termination on the LX2160A SoC, on a single A72 core running at 2.2GHz.

before: 6.38Gbit/s
after: 8.48Gbit/s

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dpaa2-eth: work with an array of FDs

Up until now, the __dpaa2_eth_tx function used a single FD on the stack
to construct the structure to be enqueued. Since we are now preparing
the ground work to add support for TSO done in software at the driver
level, the same function needs to work with an array of FDs and enqueue
as many as the build_*_fd functions create.

Make the necessary adjustments in order to do this. These include:
keeping an array of FDs in a percpu structure, cleaning up the necessary
FDs before populating it and then, retrying the enqueue process up till
all the generated FDs were enqueued or until we reach the maximum number
retries.

This patch does not change the fact that only a single FD will result
from a __dpaa2_eth_tx call but rather just creates the necessary changes
for the next patch.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dpaa2-eth: use the S/G table cache also for the normal S/G path

Instead of allocating memory for an S/G table each time a nonlinear skb
is processed, and then freeing it on the Tx confirmation path, use the
S/G table cache in order to reuse the memory.

For this to work we have to change the size of the cached buffers so
that it can hold the maximum number of scatterlist entries.

Other than that, each allocate/free call is replaced by a call to the
dpaa2_eth_sgt_get/dpaa2_eth_sgt_recycle functions, introduced in the
previous patch.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dpaa2-eth: extract the S/G table buffer cache interaction into functions

The dpaa2-eth driver uses in certain circumstances a buffer cache for
the S/G tables needed in case of a S/G FD. At the moment, the
interraction with the cache is open-coded and couldn't be reused easily.

Add two new functions - dpaa2_eth_sgt_get and dpaa2_eth_sgt_recycle -
which help with code reusability.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dpaa2-eth: allocate a fragment already aligned

Instead of allocating memory and then manually aligning it to the
desired value use napi_alloc_frag_align() directly to streamline the
process.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dpaa2-eth: rearrange variable declaration in __dpaa2_eth_tx

In the next patches we'll be moving things arroung in the mentioned
function and also add some new variable declarations. Before all this,
cleanup the variable declaration order.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'octeontx2-af-priority-flow-control'

Hariprasad Kelam says:

====================
Priority flow control support for RVU netdev

In network congestion, instead of pausing all traffic on link
PFC allows user to selectively pause traffic according to its
class. This series of patches add support of PFC for RVU netdev
drivers.

Patch1 adds support to disable pause frames by default as
with PFC user can enable either PFC or 802.3 pause frames.
Patch2&3 adds resource management support for flow control
and configures necessary registers for PFC.
Patch4 adds dcb ops registration for netdev drivers.

V2 changes:
Fix compilation error by exporting required symbols 'otx2_config_pause_frm'
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

octeontx2-pf: PFC config support with DCBx

Data centric bridging designed to eliminate packet loss due to
queue overflow by adding enhancements to ethernet network such as
proprity flow control etc. This patch adds support for management
of Priority flow control(PFC) on Octeontx2 and CN10K interfaces.

To enable PFC for all priorities
dcb pfc set dev eth0 prio-pfc all:on/off

To enable PFC on selected priorites
dcb pfc set dev eth0 prio-pfc 0:on/off 1:on/off ..7:on/off

With the ntuple commands user can map Priority to receive queues.
On queue overflow NIX will assert backpressure such that PFC pause frames
are genarated with mapped priority.

To map priority 7 to Queue 1
ethtool -U eth0 flow-type ether dst xx:xx:xx:xx:xx:xx vlan 0xe00a
m 0x1fff queue 1

Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

octeontx2-af: Flow control resource management

CN10K MAC block (RPM) and Octeontx2 MAC block (CGX) both supports
PFC flow control and 802.3X flow control pause frames.

Each MAC block supports max 4 LMACS and AF driver assigns same
(MAC,LMAC) to PF and its VFs. As PF and its share same (MAC,LMAC)
pair we need resource management to address below scenarios

1. Maintain PFC and 8023X pause frames mutually exclusive.
2. Reject disable flow control request if other PF or Vfs
enabled it.

Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

octeontx2-af: Priority flow control configuration support

Prirority based flow control (802.1Qbb) mechanism is similar to
ethernet pause frames (802.3x) instead pausing all traffic on a link,
PFC allows user to selectively pause traffic according to its class.

Oceteontx2 MAC block (CGX) and CN10K Mac block (RPM) both supports
PFC. As upper layer mbox handler is same for both the MACs, this
patch configures PFC by calling apporopritate callbacks.

Signed-off-by: Sunil Kumar Kori <skori@marvell.com>
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

octeontx2-af: Don't enable Pause frames by default

Current implementation is such that 802.3x pause frames are
enabled by default. As CGX and RPM blocks support PFC
(priority flow control) also, instead of driver enabling one
between them enable them upon request from PF or its VFs.
Also add support to disable pause frames in driver unbind.

Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'MCTP-tag-control-interface'

Jeremy Kerr says:

====================
MCTP tag control interface

This series implements a small interface for userspace-controlled
message tag allocation for the MCTP protocol. Rather than leaving the
kernel to allocate per-message tag values, userspace can explicitly
allocate (and release) message tags through two new ioctls:
SIOCMCTPALLOCTAG and SIOCMCTPDROPTAG.

In order to do this, we first introduce some minor changes to the tag
handling, including a couple of new tests for the route input paths.

As always, any comments/queries/etc are most welcome.

v2:
- make mctp_lookup_prealloc_tag static
- minor checkpatch formatting fixes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mctp: Add SIOCMCTP{ALLOC,DROP}TAG ioctls for tag control

This change adds a couple of new ioctls for mctp sockets:
SIOCMCTPALLOCTAG and SIOCMCTPDROPTAG. These ioctls provide facilities
for explicit allocation / release of tags, overriding the automatic
allocate-on-send/release-on-reply and timeout behaviours. This allows
userspace more control over messages that may not fit a simple
request/response model.

In order to indicate a pre-allocated tag to the sendmsg() syscall, we
introduce a new flag to the struct sockaddr_mctp.smctp_tag value:
MCTP_TAG_PREALLOC.

Additional changes from Jeremy Kerr <jk@codeconstruct.com.au>.

Contains a fix that was:
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

mctp: Allow keys matching any local address

Currently, we require an exact match on an incoming packet's dest
address, and the key's local_addr field.

In a future change, we may want to set up a key before packets are
routed, meaning we have no local address to match on.

This change allows key lookups to match on local_addr = MCTP_ADDR_ANY.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

mctp: Add helper for address match checking

Currently, we have a couple of paths that check that an EID matches, or
the match value is MCTP_ADDR_ANY.

Rather than open coding this, add a little helper.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

mctp: tests: Add key state tests

This change adds a few more tests to check the key/tag lookups on route
input. We add a specific entry to the keys lists, route a packet with
specific header values, and check for key match/mismatch.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

mctp: tests: Rename FL_T macro to FL_TO

This is a definition for the tag-owner flag, which has TO as a standard
abbreviation. We'll want to add a helper for the actual tag value in a
future change.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '40GbE' of git://git./linux/kernel/git/tnguy/next
-queue

Tony Nguyen says:

====================
40GbE Intel Wired LAN Driver Updates 2022-02-08

Joe Damato says:

This patch set makes several updates to the i40e driver stats collection
and reporting code to help users of i40e get a better sense of how the
driver is performing and interacting with the rest of the kernel.

These patches include some new stats (like waived and busy) which were
inspired by other drivers that track stats using the same nomenclature.

The new stats and an existing stat, rx_reuse, are now accessible with
ethtool to make harvesting this data more convenient for users.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ip6_tunnel: fix possible NULL deref in ip6_tnl_xmit

Make sure to test that skb has a dst attached to it.

general protection fault, probably for non-canonical address 0xdffffc0000000011: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000088-0x000000000000008f]
CPU: 0 PID: 32650 Comm: syz-executor.4 Not tainted 5.17.0-rc2-next-20220204-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:ip6_tnl_xmit+0x2140/0x35f0 net/ipv6/ip6_tunnel.c:1127
Code: 4d 85 f6 0f 85 c5 04 00 00 e8 9c b0 66 f9 48 83 e3 fe 48 b8 00 00 00 00 00 fc ff df 48 8d bb 88 00 00 00 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 07 7f 05 e8 11 25 b2 f9 44 0f b6 b3 88 00 00
RSP: 0018:ffffc900141b7310 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc9000c77a000
RDX: 0000000000000011 RSI: ffffffff8811f854 RDI: 0000000000000088
RBP: ffffc900141b7480 R08: 0000000000000000 R09: 0000000000000008
R10: ffffffff8811f846 R11: 0000000000000008 R12: ffffc900141b7548
R13: ffff8880297c6000 R14: 0000000000000000 R15: ffff8880351c8dc0
FS: 00007f9827ba2700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b31322000 CR3: 0000000033a70000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
ipxip6_tnl_xmit net/ipv6/ip6_tunnel.c:1386 [inline]
ip6_tnl_start_xmit+0x71e/0x1830 net/ipv6/ip6_tunnel.c:1435
__netdev_start_xmit include/linux/netdevice.h:4683 [inline]
netdev_start_xmit include/linux/netdevice.h:4697 [inline]
xmit_one net/core/dev.c:3473 [inline]
dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3489
__dev_queue_xmit+0x2a24/0x3760 net/core/dev.c:4116
packet_snd net/packet/af_packet.c:3057 [inline]
packet_sendmsg+0x2265/0x5460 net/packet/af_packet.c:3084
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:725
sock_write_iter+0x289/0x3c0 net/socket.c:1061
call_write_iter include/linux/fs.h:2075 [inline]
do_iter_readv_writev+0x47a/0x750 fs/read_write.c:726
do_iter_write+0x188/0x710 fs/read_write.c:852
vfs_writev+0x1aa/0x630 fs/read_write.c:925
do_writev+0x27f/0x300 fs/read_write.c:968
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f9828c2d059

Fixes: c1f55c5e0482 ("ip6_tunnel: allow routing IPv4 traffic in NBMA mode")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Qing Deng <i@moy.cat>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Netvsc: Call hv_unmap_memory() in the netvsc_device_remove()

netvsc_device_remove() calls vunmap() inside which should not be
called in the interrupt context. Current code calls hv_unmap_memory()
in the free_netvsc_device() which is rcu callback and maybe called
in the interrupt context. This will trigger BUG_ON(in_interrupt())
in the vunmap(). Fix it via moving hv_unmap_memory() to netvsc_device_
remove().

Fixes: 846da38de0e8 ("net: netvsc: Add Isolation VM support for netvsc driver")
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '1GbE' of git://git./linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
1GbE Intel Wired LAN Driver Updates 2022-02-07

Corinna Vinschen says:

Fix the kernel warning "Missing unregister, handled but fix driver"
when running, e.g.,

$ ethtool -G eth0 rx 1024

on igc. Remove memset hack from igb and align igb code to igc.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: renesas,etheravb: Document RZ/G2UL SoC

Document Gigabit Ethernet IP found on RZ/G2UL SoC. Gigabit Ethernet
Interface is identical to one found on the RZ/G2L SoC. No driver changes
are required as generic compatible string "renesas,rzg2l-gbeth" will be
used as a fallback.

Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: renesas,etheravb: Document RZ/V2L SoC

Document Gigabit Ethernet IP found on RZ/V2L SoC. Gigabit Ethernet
Interface is identical to one found on the RZ/G2L SoC. No driver changes
are required as generic compatible string "renesas,rzg2l-gbeth" will be
used as a fallback.

Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com>
Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Acked-by: Rob Herring <robh@kernel.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'libbpf: Add syscall-specific variant of BPF_KPROBE'

Hengqi Chen says:

====================

Add new macro BPF_KPROBE_SYSCALL, which provides easy access to syscall
input arguments. See [0] and [1] for background.

  [0]: https://github.com/libbpf/libbpf-bootstrap/issues/57
  [1]: https://github.com/libbpf/libbpf/issues/425

v2->v3:
  - Use PT_REGS_SYSCALL_REGS
  - Move selftest to progs/bpf_syscall_macro.c

v1->v2:
  - Use PT_REGS_PARM2_CORE_SYSCALL instead
====================

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

selftests/bpf: Test BPF_KPROBE_SYSCALL macro

Add tests for the newly added BPF_KPROBE_SYSCALL macro.

Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220207143134.2977852-3-hengqi.chen@gmail.com

libbpf: Add BPF_KPROBE_SYSCALL macro

Add syscall-specific variant of BPF_KPROBE named BPF_KPROBE_SYSCALL ([0]).
The new macro hides the underlying way of getting syscall input arguments.
With the new macro, the following code:

    SEC("kprobe/__x64_sys_close")
    int BPF_KPROBE(do_sys_close, struct pt_regs *regs)
    {
        int fd;

        fd = PT_REGS_PARM1_CORE(regs);
        /* do something with fd */
    }

can be written as:

    SEC("kprobe/__x64_sys_close")
    int BPF_KPROBE_SYSCALL(do_sys_close, int fd)
    {
        /* do something with fd */
    }

  [0] Closes: https://github.com/libbpf/libbpf/issues/425

Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220207143134.2977852-2-hengqi.chen@gmail.com

Merge branch 'Fix accessing syscall arguments'

Ilya Leoshkevich says:

====================

libbpf now has macros to access syscall arguments in an
architecture-agnostic manner, but unfortunately they have a number of
issues on non-Intel arches, which this series aims to fix.

v1: https://lore.kernel.org/bpf/20220201234200.1836443-1-iii@linux.ibm.com/
v1 -> v2:
* Put orig_gpr2 in place of args[1] on s390 (Vasily).
* Fix arm64, powerpc and riscv (Heiko).

v2: https://lore.kernel.org/bpf/20220204041955.1958263-1-iii@linux.ibm.com/
v2 -> v3:
* Undo args[1] change (Andrii).
* Rename PT_REGS_SYSCALL to PT_REGS_SYSCALL_REGS (Andrii).
* Split the riscv patch (Andrii).

v3: https://lore.kernel.org/bpf/20220204145018.1983773-1-iii@linux.ibm.com/
v3 -> v4:
* Undo arm64's and s390's user_pt_regs changes.
* Use struct pt_regs when vmlinux.h is available (Andrii).
* Use offsetofend for accessing orig_gpr2 and orig_x0 (Andrii).
* Move libbpf's copy of offsetofend to a new header.
* Fix riscv's __PT_FP_REG.
* Use PT_REGS_SYSCALL_REGS in test_probe_user.c.
* Test bpf_syscall_macro with userspace headers.
* Use Naveen's suggestions and code in patches 5 and 6.
* Add warnings to arm64's and s390's ptrace.h (Andrii).

v4: https://lore.kernel.org/bpf/20220208051635.2160304-1-iii@linux.ibm.com/
v4 -> v5:
* Go back to v3.
* Do not touch arch headers.
* Use CO-RE struct flavors to access orig_x0 and orig_gpr2.
* Fail compilation if non-CO-RE macros are used to access the first
syscall parameter on arm64 and s390.
* Fix accessing frame pointer on riscv.
====================

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

libbpf: Fix accessing the first syscall argument on s390

On s390, the first syscall argument should be accessed via orig_gpr2
(see arch/s390/include/asm/syscall.h). Currently gpr[2] is used
instead, leading to bpf_syscall_macro test failure.

orig_gpr2 cannot be added to user_pt_regs, since its layout is a part
of the ABI. Therefore provide access to it only through
PT_REGS_PARM1_CORE_SYSCALL() by using a struct pt_regs flavor.

Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220209021745.2215452-11-iii@linux.ibm.com

libbpf: Fix accessing the first syscall argument on arm64

On arm64, the first syscall argument should be accessed via orig_x0
(see arch/arm64/include/asm/syscall.h). Currently regs[0] is used
instead, leading to bpf_syscall_macro test failure.

orig_x0 cannot be added to struct user_pt_regs, since its layout is a
part of the ABI. Therefore provide access to it only through
PT_REGS_PARM1_CORE_SYSCALL() by using a struct pt_regs flavor.

Reported-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220209021745.2215452-10-iii@linux.ibm.com

libbpf: Allow overriding PT_REGS_PARM1{_CORE}_SYSCALL

arm64 and s390 need a special way to access the first syscall argument.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220209021745.2215452-9-iii@linux.ibm.com

selftests/bpf: Skip test_bpf_syscall_macro's syscall_arg1 on arm64 and s390

These architectures can provide access to the first syscall argument
only through PT_REGS_PARM1_CORE_SYSCALL().

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220209021745.2215452-8-iii@linux.ibm.com

libbpf: Fix accessing syscall arguments on riscv

riscv does not select ARCH_HAS_SYSCALL_WRAPPER, so its syscall
handlers take "unpacked" syscall arguments. Indicate this to libbpf
using PT_REGS_SYSCALL_REGS macro.

Reported-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220209021745.2215452-7-iii@linux.ibm.com

libbpf: Fix riscv register names

riscv registers are accessed via struct user_regs_struct, not struct
pt_regs. The program counter member in this struct is called pc, not
epc. The frame pointer is called s0, not fp.

Fixes: 3cc31d794097 ("libbpf: Normalize PT_REGS_xxx() macro definitions")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220209021745.2215452-6-iii@linux.ibm.com

libbpf: Fix accessing syscall arguments on powerpc

powerpc does not select ARCH_HAS_SYSCALL_WRAPPER, so its syscall
handlers take "unpacked" syscall arguments. Indicate this to libbpf
using PT_REGS_SYSCALL_REGS macro.

Reported-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Link: https://lore.kernel.org/bpf/20220209021745.2215452-5-iii@linux.ibm.com

selftests/bpf: Use PT_REGS_SYSCALL_REGS in bpf_syscall_macro

Ensure that PT_REGS_SYSCALL_REGS works correctly.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220209021745.2215452-4-iii@linux.ibm.com

libbpf: Add PT_REGS_SYSCALL_REGS macro

Architectures that select ARCH_HAS_SYSCALL_WRAPPER pass a pointer to
struct pt_regs to syscall handlers, others unpack it into individual
function parameters. Introduce a macro to describe what a particular
arch does.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220209021745.2215452-3-iii@linux.ibm.com

selftests/bpf: Fix an endianness issue in bpf_syscall_macro test

bpf_syscall_macro reads a long argument into an int variable, which
produces a wrong value on big-endian systems. Fix by reading the
argument into an intermediate long variable first.

Fixes: 77fc0330dfe5 ("selftests/bpf: Add a test to confirm PT_REGS_PARM4_SYSCALL")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220209021745.2215452-2-iii@linux.ibm.com

net: dsa: typo in comment

Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20220208053210.14831-1-luizluca@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ptp_pch: Remove unused pch_pm_ops

The default values for hooks in the driver.pm are NULLs.
Hence drop unused pch_pm_ops.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20220207210730.75252-6-andriy.shevchenko@linux.intel.com
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ptp_pch: Convert to use managed functions pcim_* and devm_*

This makes the error handling much more simpler than open-coding everything
and in addition makes the probe function smaller an tidier.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20220207210730.75252-5-andriy.shevchenko@linux.intel.com
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ptp_pch: Switch to use module_pci_driver() macro

Eliminate some boilerplate code by using module_pci_driver() instead of
init/exit, and, if needed, moving the salient bits from init into probe.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20220207210730.75252-4-andriy.shevchenko@linux.intel.com
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ptp_pch: Use ioread64_hi_lo() / iowrite64_hi_lo()

There is already helper functions to do 64-bit I/O on 32-bit machines or
buses, thus we don't need to reinvent the wheel.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20220207210730.75252-3-andriy.shevchenko@linux.intel.com
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ptp_pch: Use ioread64_lo_hi() / iowrite64_lo_hi()

There is already helper functions to do 64-bit I/O on 32-bit machines or
buses, thus we don't need to reinvent the wheel.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20220207210730.75252-2-andriy.shevchenko@linux.intel.com
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ptp_pch: use mac_pton()

Use mac_pton() instead of custom approach.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Link: https://lore.kernel.org/r/20220207210730.75252-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-speedup-netns-dismantles'

Eric Dumazet says:

====================
net: speedup netns dismantles

From: Eric Dumazet <edumazet@google.com>

In this series, I made network namespace deletions more scalable,
by 4x on the little benchmark described in this cover letter.

- Remove bottleneck on ipv6 addrconf, by replacing a global
  hash table to a per netns one.

- Rework many (struct pernet_operations)->exit() handlers to
  exit_batch() ones. This removes many rtnl acquisitions,
  and gives to cleanup_net() kind of a priority over rtnl
  ownership.

Tested on a host with 24 cpus (48 HT)

Test script:

for nr in {1..10}
do
  (for i in {1..10000}; do unshare -n /bin/bash -c "ifconfig lo up"; done) &
done
wait

for i in {1..10}
do
  sleep 1
  echo 3 >/proc/sys/vm/drop_caches
  grep net_namespace /proc/slabinfo
done

Before: We can see host struggles to clean the netns, even after there are no new creations.
Memory cost is high, because each netns consumes a good amount of memory.

time ./unshare10.sh
net_namespace      82634  82634   3968    1    1 : tunables   24   12    8 : slabdata  82634  82634      0
net_namespace      82634  82634   3968    1    1 : tunables   24   12    8 : slabdata  82634  82634      0
net_namespace      82634  82634   3968    1    1 : tunables   24   12    8 : slabdata  82634  82634      0
net_namespace      82634  82634   3968    1    1 : tunables   24   12    8 : slabdata  82634  82634      0
net_namespace      82634  82634   3968    1    1 : tunables   24   12    8 : slabdata  82634  82634      0
net_namespace      82634  82634   3968    1    1 : tunables   24   12    8 : slabdata  82634  82634      0
net_namespace      82634  82634   3968    1    1 : tunables   24   12    8 : slabdata  82634  82634      0
net_namespace      82634  82634   3968    1    1 : tunables   24   12    8 : slabdata  82634  82634      0
net_namespace      82634  82634   3968    1    1 : tunables   24   12    8 : slabdata  82634  82634      0
net_namespace      37214  37792   3968    1    1 : tunables   24   12    8 : slabdata  37214  37792    192

real 6m57.766s
user 3m37.277s
sys 40m4.826s

After: We can see the script completes much faster,
the kernel thread doing the cleanup_net() keeps up just fine.
Memory cost is not too big.

time ./unshare10.sh
net_namespace       9945   9945   4096    1    1 : tunables   24   12    8 : slabdata   9945   9945      0
net_namespace       4087   4665   4096    1    1 : tunables   24   12    8 : slabdata   4087   4665    192
net_namespace       4082   4607   4096    1    1 : tunables   24   12    8 : slabdata   4082   4607    192
net_namespace        234    761   4096    1    1 : tunables   24   12    8 : slabdata    234    761    192
net_namespace        224    751   4096    1    1 : tunables   24   12    8 : slabdata    224    751    192
net_namespace        218    745   4096    1    1 : tunables   24   12    8 : slabdata    218    745    192
net_namespace        193    667   4096    1    1 : tunables   24   12    8 : slabdata    193    667    172
net_namespace        167    609   4096    1    1 : tunables   24   12    8 : slabdata    167    609    152
net_namespace        167    609   4096    1    1 : tunables   24   12    8 : slabdata    167    609    152
net_namespace        157    609   4096    1    1 : tunables   24   12    8 : slabdata    157    609    152

real    1m43.876s
user    3m39.728s
sys 7m36.342s
====================

Link: https://lore.kernel.org/r/20220208045038.2635826-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: remove default_device_exit()

For some reason default_device_ops kept two exit method:

1) default_device_exit() is called for each netns being dismantled in
a cleanup_net() round. This acquires rtnl for each invocation.

2) default_device_exit_batch() is called once with the list of all netns
int the batch, allowing for a single rtnl invocation.

Get rid of the .exit() method to handle the logic from
default_device_exit_batch(), to decrease the number of rtnl acquisition
to one.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bonding: switch bond_net_exit() to batch mode

cleanup_net() is competing with other rtnl users.

Batching bond_net_exit() factorizes all rtnl acquistions
to a single one, giving chance for cleanup_net()
to progress much faster, holding rtnl a bit longer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

can: gw: switch cangw_pernet_exit() to batch mode

cleanup_net() is competing with other rtnl users.

Avoiding to acquire rtnl for each netns before calling
cgw_remove_all_jobs() gives chance for cleanup_net()
to progress much faster, holding rtnl a bit longer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr: introduce ipmr_net_exit_batch()

cleanup_net() is competing with other rtnl users.

Avoiding to acquire rtnl for each netns before calling
ipmr_rules_exit() gives chance for cleanup_net()
to progress much faster, holding rtnl a bit longer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ip6mr: introduce ip6mr_net_exit_batch()

cleanup_net() is competing with other rtnl users.

Avoiding to acquire rtnl for each netns before calling
ip6mr_rules_exit() gives chance for cleanup_net()
to progress much faster, holding rtnl a bit longer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: change fib6_rules_net_exit() to batch mode

cleanup_net() is competing with other rtnl users.

fib6_rules_net_exit() seems a good candidate for exit_batch(),
as this gives chance for cleanup_net() to progress much faster,
holding rtnl a bit longer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: add fib_net_exit_batch()

cleanup_net() is competing with other rtnl users.

Instead of acquiring rtnl at each fib_net_exit() invocation,
add fib_net_exit_batch() so that rtnl is acquired once.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

nexthop: change nexthop_net_exit() to nexthop_net_exit_batch()

cleanup_net() is competing with other rtnl users.

nexthop_net_exit() seems a good candidate for exit_batch(),
as this gives chance for cleanup_net() to progress much faster,
holding rtnl a bit longer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6/addrconf: switch to per netns inet6_addr_lst hash table

IPv6 does not scale very well with the number of IPv6 addresses.
It uses a global (shared by all netns) hash table with 256 buckets.

Some functions like addrconf_verify_rtnl() and addrconf_ifdown()
have to iterate all addresses in the hash table.

I have seen addrconf_verify_rtnl() holding the cpu for 10ms or more.

Switch to the per netns hashtable (and spinlock) added
in prior patches.

This considerably speeds up netns dismantle times on hosts
with thousands of netns. This also has an impact
on regular (fast path) IPv6 processing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6/addrconf: use one delayed work per netns

Next step for using per netns inet6_addr_lst
is to have per netns work item to ultimately
call addrconf_verify_rtnl() and addrconf_verify()
with a new 'struct net*' argument.

Everything is still using the global inet6_addr_lst[] table.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6/addrconf: allocate a per netns hash table

Add a per netns hash table and a dedicated spinlock,
first step to get rid of the global inet6_addr_lst[] one.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: add dev->dev_registered_tracker

Convert one dev_hold()/dev_put() pair in register_netdevice()
and unregister_netdevice_many() to dev_hold_track()
and dev_put_track().

This would allow to detect a rogue dev_put() a bit earlier.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220207184107.1401096-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'fix bpf_prog_pack build errors'

Song Liu says:

====================

Fix build errors reported by kernel test robot.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Fix bpf_prog_pack build HPAGE_PMD_SIZE

Fix build with CONFIG_TRANSPARENT_HUGEPAGE=n with BPF_PROG_PACK_SIZE as
PAGE_SIZE.

Fixes: 57631054fae6 ("bpf: Introduce bpf_prog_pack allocator")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220208220509.4180389-3-song@kernel.org

et131x: support arbitrary MAX_SKB_FRAGS

This NIC does not support TSO, it is very unlikely it would
have to send packets with many fragments.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220208004855.1887345-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'iwl-next' of git://git./linux/kernel/git/tnguy/linux

Nguyen, Anthony L says:

====================
iwl-next Intel Wired LAN Driver Updates 2022-02-07

Dave adds support for ice driver to provide DSCP QoS mappings to irdma
driver.

[1] https://lore.kernel.org/netdev/20220202191921.1638-1-shiraz.saleem@intel.com/

* 'iwl-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux:
ice: add support for DSCP QoS for IDC
====================

Link: https://lore.kernel.org/r/20220207235921.1303522-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bpf: Fix leftover header->pages in sparc and powerpc code.

Replace header->pages * PAGE_SIZE with new header->size.

Fixes: ed2d9e1a26cc ("bpf: Use size instead of pages in bpf_binary_header")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220208220509.4180389-2-song@kernel.org

libbpf: Fix signedness bug in btf_dump_array_data()

The btf__resolve_size() function returns negative error codes so
"elem_size" must be signed for the error handling to work.

Fixes: 920d16af9b42 ("libbpf: BTF dumper support for typed data")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220208071552.GB10495@kili

selftests/bpf: Do not export subtest as standalone test

Two subtests in ksyms_module.c are not qualified as static, so these
subtests are exported as standalone tests in tests.h and lead to
confusion for the output of "./test_progs -t ksyms_module".

By using the following command ...

grep "^void $serial_$\?test_[a-zA-Z0-9_]\+($void$\?)" \
tools/testing/selftests/bpf/prog_tests/*.c | \
awk -F : '{print $1}' | sort | uniq -c | awk '$1 != 1'

... one finds out that other tests also have a similar problem, so
fix these tests by marking subtests in these tests as static.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220208065444.648778-1-houtao1@huawei.com

bpf, x86_64: Fail gracefully on bpf_jit_binary_pack_finalize failures

Instead of BUG_ON(), fail gracefully and return orig_prog.

Fixes: 1022a5498f6f ("bpf, x86_64: Use bpf_jit_binary_pack_alloc")
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220208062533.3802081-1-song@kernel.org

i40e: Add a stat for tracking busy rx pages

In some cases, pages cannot be reused by i40e because the page is busy. Add
a counter for this event.

Busy page count is accessible via ethtool.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Dave Switzer <david.switzer@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

i40e: Add a stat for tracking pages waived

In some cases, pages can not be reused because they are not associated with
the correct NUMA zone. Knowing how often pages are waived helps users to
understand the interaction between the driver's memory usage and their
system.

Pass rx_stats through to i40e_can_reuse_rx_page to allow tracking when
pages are waived.

The page waive count is accessible via ethtool.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Dave Switzer <david.switzer@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

i40e: Add a stat tracking new RX page allocations

Add a counter for new page allocations in the i40e RX path. This stat is
accessible with ethtool.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Dave Switzer <david.switzer@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

i40e: Aggregate and export RX page reuse stat

rx page reuse was already being tracked by the i40e driver per RX ring.
Aggregate the counts and make them accessible via ethtool.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Dave Switzer <david.switzer@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

i40e: Remove rx page reuse double count

Page reuse was being tracked from two locations:
- i40e_reuse_rx_page (via 40e_clean_rx_irq), and
- i40e_alloc_mapped_page

Remove the double count and only count reuse from i40e_alloc_mapped_page
when the page is about to be reused.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Dave Switzer <david.switzer@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

Merge branch 'inet-separate-dscp-from-ecn-bits-using-new-dscp_t-type'

Guillaume Nault says:

====================
inet: Separate DSCP from ECN bits using new dscp_t type

The networking stack currently doesn't clearly distinguish between DSCP
and ECN bits. The entire DSCP+ECN bits are stored in u8 variables (or
structure fields), and each part of the stack handles them in their own
way, using different macros. This has created several bugs in the past
and some uncommon code paths are still unfixed.

Such bugs generally manifest by selecting invalid routes because of ECN
bits interfering with FIB routes and rules lookups (more details in the
LPC 2021 talk[1] and in the RFC of this series[2]).

This patch series aims at preventing the introduction of such bugs (and
detecting existing ones), by introducing a dscp_t type, representing
"sanitised" DSCP values (that is, with no ECN information), as opposed
to plain u8 values that contain both DSCP and ECN information. dscp_t
makes it clear for the reader what we're working on, and Sparse can
flag invalid interactions between dscp_t and plain u8.

This series converts only a few variables and structures:

  * Patch 1 converts the tclass field of struct fib6_rule. It
    effectively forbids the use of ECN bits in the tos/dsfield option
    of ip -6 rule. Rules now match packets solely based on their DSCP
    bits, so ECN doesn't influence the result any more. This contrasts
    with the previous behaviour where all 8 bits of the Traffic Class
    field were used. It is believed that this change is acceptable as
    matching ECN bits wasn't usable for IPv4, so only IPv6-only
    deployments could be depending on it. Also the previous behaviour
    made DSCP-based ip6-rules fail for packets with both a DSCP and an
    ECN mark, which is another reason why any such deploy is unlikely.

  * Patch 2 converts the tos field of struct fib4_rule. This one too
    effectively forbids defining ECN bits, this time in ip -4 rule.
    Before that, setting ECN bit 1 was accepted, while ECN bit 0 was
    rejected. But even when accepted, the rule would never match, as
    the packets would have their ECN bits cleared before doing the
    rule lookup.

  * Patch 3 converts the fc_tos field of struct fib_config. This is
    equivalent to patch 2, but for IPv4 routes. Routes using a
    tos/dsfield option with any ECN bit set is now rejected. Before
    this patch, they were accepted but, as with ip4 rules, these routes
    couldn't match any packet, since their ECN bits are cleared before
    the lookup.

  * Patch 4 converts the fa_tos field of struct fib_alias. This one is
    pure internal u8 to dscp_t conversion. While patches 1-3 had user
    facing consequences, this patch shouldn't have any side effect and
    is there to give an overview of what future conversion patches will
    look like. Conversions are quite mechanical, but imply some code
    churn, which is the price for the extra clarity a possibility of
    type checking.

To summarise, all the behaviour changes required for the dscp_t type
approach to work should be contained in patches 1-3. These changes are
edge cases of ip-route and ip-rule that don't currently work properly.
So they should be safe. Also, a kernel selftest is added for each of
them.

Finally, this work also paves the way for allowing the usage of the 3
high order DSCP bits in IPv4 (a few call paths already handle them, but
in general the stack clears them before IPv4 rule and route lookups).

References:
  [1] LPC 2021 talk:
        - https://linuxplumbersconf.org/event/11/contributions/943/
        - Direct link to slide deck:
            https://linuxplumbersconf.org/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf
  [2] RFC version of this series:
      - https://lore.kernel.org/netdev/cover.1638814614.git.gnault@redhat.com/
====================

Link: https://lore.kernel.org/r/cover.1643981839.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: Use dscp_t in struct fib_alias

Use the new dscp_t type to replace the fa_tos field of fib_alias. This
ensures ECN bits are ignored and makes the field compatible with the
fc_dscp field of struct fib_config.

Converting old *tos variables and fields to dscp_t allows sparse to
flag incorrect uses of DSCP and ECN bits. This patch is entirely about
type annotation and shouldn't change any existing behaviour.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: Reject routes specifying ECN bits in rtm_tos

Use the new dscp_t type to replace the fc_tos field of fib_config, to
ensure IPv4 routes aren't influenced by ECN bits when configured with
non-zero rtm_tos.

Before this patch, IPv4 routes specifying an rtm_tos with some of the
ECN bits set were accepted. However they wouldn't work (never match) as
IPv4 normally clears the ECN bits with IPTOS_RT_MASK before doing a FIB
lookup (although a few buggy code paths don't).

After this patch, IPv4 routes specifying an rtm_tos with any ECN bit
set is rejected.

Note: IPv6 routes ignore rtm_tos altogether, any rtm_tos is accepted,
but treated as if it were 0.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: Stop taking ECN bits into account in fib4-rules

Use the new dscp_t type to replace the tos field of struct fib4_rule,
so that fib4-rules consistently ignore ECN bits.

Before this patch, fib4-rules did accept rules with the high order ECN
bit set (but not the low order one). Also, it relied on its callers
masking the ECN bits of ->flowi4_tos to prevent those from influencing
the result. This was brittle and a few call paths still do the lookup
without masking the ECN bits first.

After this patch fib4-rules only compare the DSCP bits. ECN can't
influence the result anymore, even if the caller didn't mask these
bits. Also, fib4-rules now must have both ECN bits cleared or they will
be rejected.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: Define dscp_t and stop taking ECN bits into account in fib6-rules

Define a dscp_t type and its appropriate helpers that ensure ECN bits
are not taken into account when handling DSCP.

Use this new type to replace the tclass field of struct fib6_rule, so
that fib6-rules don't get influenced by ECN bits anymore.

Before this patch, fib6-rules didn't make any distinction between the
DSCP and ECN bits. Therefore, rules specifying a DSCP (tos or dsfield
options in iproute2) stopped working as soon a packets had at least one
of its ECN bits set (as a work around one could create four rules for
each DSCP value to match, one for each possible ECN value).

After this patch fib6-rules only compare the DSCP bits. ECN doesn't
influence the result anymore. Also, fib6-rules now must have the ECN
bits cleared or they will be rejected.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: optimize locking around PTP clock reads

Reading the PTP clock is a simple operation requiring only 3 register
reads. Under a PREEMPT_RT kernel, protecting those reads by a spin_lock is
counter-productive: if the 2nd task preempting the 1st has a higher prio
but needs to read time as well, it will require 2 context switches, which
will pretty much always be more costly than just disabling preemption for
the duration of the reads. Moreover, with the code logic recently added
to get_systime(), disabling preemption is not even required anymore:
reads and writes just need to be protected from each other, to prevent a
clock read while the clock is being updated.

Improve the above situation by replacing the PTP spinlock by a rwlock, and
using read_lock for PTP clock reads so simultaneous reads do not block
each other.

Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
Link: https://lore.kernel.org/r/20220204135545.2770625-1-yannick.vignon@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: typhoon: include <net/vxlan.h>

We need this to get vxlan_features_check() definition.

Fixes: d2692eee05b8 ("net: typhoon: implement ndo_features_check method")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220208003502.1799728-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bpf: test_run: Fix overflow in bpf_test_finish frags parsing

This place also uses signed min_t and passes this singed int to
copy_to_user (which accepts unsigned argument). I don't think
there is an issue, but let's be consistent.

Fixes: 7855e0db150ad ("bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220204235849.14658-2-sdf@google.com

bpf: test_run: Fix overflow in xdp frags parsing

When kattr->test.data_size_in > INT_MAX, signed min_t will assign
negative value to data_len. This negative value then gets passed
over to copy_from_user where it is converted to (big) unsigned.

Use unsigned min_t to avoid this overflow.

usercopy: Kernel memory overwrite attempt detected to wrapped address
(offset 0, size 18446612140539162846)!
------------[ cut here ]------------
kernel BUG at mm/usercopy.c:102!
invalid opcode: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 0 PID: 3781 Comm: syz-executor226 Not tainted 4.15.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:usercopy_abort+0xbd/0xbf mm/usercopy.c:102
RSP: 0018:ffff8801e9703a38 EFLAGS: 00010286
RAX: 000000000000006c RBX: ffffffff84fc7040 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff816560a2 RDI: ffffed003d2e0739
RBP: ffff8801e9703a90 R08: 000000000000006c R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff84fc73a0
R13: ffffffff84fc7180 R14: ffffffff84fc7040 R15: ffffffff84fc7040
FS: 00007f54e0bec300(0000) GS:ffff8801f6600000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000280 CR3: 00000001e90ea000 CR4: 00000000003426f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
check_bogus_address mm/usercopy.c:155 [inline]
__check_object_size mm/usercopy.c:263 [inline]
__check_object_size.cold+0x8c/0xad mm/usercopy.c:253
check_object_size include/linux/thread_info.h:112 [inline]
check_copy_size include/linux/thread_info.h:143 [inline]
copy_from_user include/linux/uaccess.h:142 [inline]
bpf_prog_test_run_xdp+0xe57/0x1240 net/bpf/test_run.c:989
bpf_prog_test_run kernel/bpf/syscall.c:3377 [inline]
__sys_bpf+0xdf2/0x4a50 kernel/bpf/syscall.c:4679
SYSC_bpf kernel/bpf/syscall.c:4765 [inline]
SyS_bpf+0x26/0x50 kernel/bpf/syscall.c:4763
do_syscall_64+0x21a/0x3e0 arch/x86/entry/common.c:305
entry_SYSCALL_64_after_hwframe+0x46/0xbb

Fixes: 1c1949982524 ("bpf: introduce frags support to bpf_prog_test_run_xdp()")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220204235849.14658-1-sdf@google.com

Merge branch 'bpf_prog_pack allocator'

Song Liu says:

====================

Changes v8 => v9:
1. Fix an error with multi function program, in 4/9.

Changes v7 => v8:
1. Rebase and fix conflicts.
2. Lock text_mutex for text_poke_copy. (Daniel)

Changes v6 => v7:
1. Redesign the interface between generic and arch logic, based on feedback
   from Alexei and Ilya.
2. Split 6/7 of v6 to 7/9 and 8/9 in v7, for cleaner logic.
3. Add bpf_arch_text_copy in 6/9.

Changes v5 => v6:
1. Make jit_hole_buffer 128 byte long. Only fill the first and last 128
   bytes of header with INT3. (Alexei)
2. Use kvmalloc for temporary buffer. (Alexei)
3. Rename tmp_header/tmp_image => rw_header/rw_image. Remove tmp_image from
   x64_jit_data. (Alexei)
4. Change fall back round_up_to in bpf_jit_binary_alloc_pack() from
   BPF_PROG_MAX_PACK_PROG_SIZE to PAGE_SIZE.

Changes v4 => v5:
1. Do not use atomic64 for bpf_jit_current. (Alexei)

Changes v3 => v4:
1. Rename text_poke_jit() => text_poke_copy(). (Peter)
2. Change comment style. (Peter)

Changes v2 => v3:
1. Fix tailcall.

Changes v1 => v2:
1. Use text_poke instead of writing through linear mapping. (Peter)
2. Avoid making changes to non-x86_64 code.

Most BPF programs are small, but they consume a page each. For systems
with busy traffic and many BPF programs, this could also add significant
pressure to instruction TLB. High iTLB pressure usually causes slow down
for the whole system, which includes visible performance degradation for
production workloads.

This set tries to solve this problem with customized allocator that pack
multiple programs into a huge page.

Patches 1-6 prepare the work. Patch 7 contains key logic of bpf_prog_pack
allocator. Patch 8 contains bpf_jit_binary_pack_alloc logic on top of
bpf_prog_pack allocator. Patch 9 uses this allocator in x86_64 jit.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, x86_64: Use bpf_jit_binary_pack_alloc

Use bpf_jit_binary_pack_alloc in x86_64 jit. The jit engine first writes
the program to the rw buffer. When the jit is done, the program is copied
to the final location with bpf_jit_binary_pack_finalize.

Note that we need to do bpf_tail_call_direct_fixup after finalize.
Therefore, the text_live = false logic in __bpf_arch_text_poke is no
longer needed.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220204185742.271030-10-song@kernel.org

bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]

This is the jit binary allocator built on top of bpf_prog_pack.

bpf_prog_pack allocates RO memory, which cannot be used directly by the
JIT engine. Therefore, a temporary rw buffer is allocated for the JIT
engine. Once JIT is done, bpf_jit_binary_pack_finalize is used to copy
the program to the RO memory.

bpf_jit_binary_pack_alloc reserves 16 bytes of extra space for illegal
instructions, which is small than the 128 bytes space reserved by
bpf_jit_binary_alloc. This change is necessary for bpf_jit_binary_hdr
to find the correct header. Also, flag use_bpf_prog_pack is added to
differentiate a program allocated by bpf_jit_binary_pack_alloc.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220204185742.271030-9-song@kernel.org

bpf: Introduce bpf_prog_pack allocator

Most BPF programs are small, but they consume a page each. For systems
with busy traffic and many BPF programs, this could add significant
pressure to instruction TLB. High iTLB pressure usually causes slow down
for the whole system, which includes visible performance degradation for
production workloads.

Introduce bpf_prog_pack allocator to pack multiple BPF programs in a huge
page. The memory is then allocated in 64 byte chunks.

Memory allocated by bpf_prog_pack allocator is RO protected after initial
allocation. To write to it, the user (jit engine) need to use text poke
API.

Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220204185742.271030-8-song@kernel.org

bpf: Introduce bpf_arch_text_copy

This will be used to copy JITed text to RO protected module memory. On
x86, bpf_arch_text_copy is implemented with text_poke_copy.

bpf_arch_text_copy returns pointer to dst on success, and ERR_PTR(errno)
on errors.

Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220204185742.271030-7-song@kernel.org

x86/alternative: Introduce text_poke_copy

This will be used by BPF jit compiler to dump JITed binary to a RX huge
page, and thus allow multiple BPF programs sharing the a huge (2MB) page.

Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/20220204185742.271030-6-song@kernel.org

bpf: Use prog->jited_len in bpf_prog_ksym_set_addr()

Using prog->jited_len is simpler and more accurate than current
estimation (header + header->size).

Also, fix missing prog->jited_len with multi function program. This hasn't
been a real issue before this.

Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220204185742.271030-5-song@kernel.org