review.tizen.org Git - platform/kernel/linux-rpi.git/log

rds: ib: force endiannes annotation

While the endiannes is being handled correctly as indicated by the comment
above the offending line - sparse was unhappy with the missing annotation
as be64_to_cpu() expects a __be64 argument. To mitigate this annotation
all involved variables are changed to a consistent __le64 and the
conversion to uint64_t delayed to the call to rds_cong_map_updated().

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-mvpp2-cls-Add-classification'

Maxime Chevallier says:

====================
net: mvpp2: cls: Add classification

This series is a rework of the previously standalone patch adding
classification support for mvpp2 :

https://lore.kernel.org/netdev/20190423075031.26074-1-maxime.chevallier@bootlin.com/

This patch has been reworked according to Saeed's review, to make sure
that the location of the rule is always respected and serves as a way to
prioritize rules between each other. This the 3rd iteration of this
submission, but since it's now a series, I reset the revision numbering.

This series implements that in a limited configuration for now, since we
limit the total number of rules per port to 4.

The main factors for this limitation are that :
- We share the classification tables between all ports (4 max, although
   one is only used for internal loopback), hence we have to perform a
   logical separation between rules, which is done today by dedicated
   ranges for each port in each table

- The "Flow table", which dictates which lookups operations are
   performed for an ingress packet, in subdivided into 22 "sub flows",
   each corresponding to a traffic type based on the L3 proto, L4
   proto, the presence or not of a VLAN tag and the L3 fragmentation.

   This makes so that when adding a rule, it has to be added into each
   of these subflows, introducing duplications of entries and limiting
   our max number of entries.

These limitations can be overcomed in several ways, but for readability
sake, I'd rather submit basic classification offload support for now,
and improve it gradually.

This series also adds a small cosmetic cleanup patch (1), and also adds
support for the "Drop" action compared to the first submission of this
feature. It is simple enough to be added with this basic support.

Compared to the first submissions, the NETIF_F_NTUPLE flag was also
removed, following Saeed's comment.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvpp2: cls: Allow dropping packets with classification offload

This commit introduces support for the "Drop" action in classification
offload. This corresponds to the "-1" action with ethtool -N.

This is achieved using the color marking actions available in the C2
engine, which associate a color to a packet. These colors can be either
Green, Yellow or Red, Red meaning that the packet should be dropped.

Green and Yellow colors are interpreted by the Policer, which isn't
supported yet.

This method of dropping using the Classifier is different than the
already existing early-drop features, such as VLAN filtering and MAC
UC/MC filtering, which are performed during the Parsing step, and
therefore take precedence over classification actions.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvpp2: cls: Add Classification offload support

This commit introduces basic classification offloading support for the
PPv2 controller.

The PPv2 classifier has many classification engines, for now we only use
the C2 TCAM match engine.

This engine allows to perform ternary lookups on 64 bits keys (called
Header Extracted Key), that are built by extracting fields from the packet
header and concatenating them. At most 4 fields can be extracted for a
single lookup.

This basic implementation allows to build the HEK from the following
fields :
- L4 source and destination ports (for UDP and TCP)

More fields are to be added in the future.

Classification flows are added through the ethtool interface, using the
newly introduced flow_rule infrastructure as an internal rule
representation, allowing to more easily implement tc flower rules if
need be.

The internal design for now allocates one range of 4 rules per port
due to the internal design of the flow table, which uses 22 sub-flows.

When inserting a classification rule, the rule is created in every
relevant sub-flow.

This low rule-count is a very simple design which reaches quickly the
limitations of the flow table ordering, but guarantees that the rule
ordering will always be respected.

This commit only introduces support for the "steer to rxq" action.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvpp2: cls: Use a bitfield to represent the flow_type

As of today, the classification code is used only for RSS. We split the
incoming traffic into multiple flows, that correspond to the ethtool
flow_type parameter.

We don't want to use the ethtool flow definitions such as TCP_V4_FLOW,
for several reason :

- We want to decorrelate the driver code from ethtool as much as
possible, so that we can easily use other interfaces such as tc flower,

- We want the flow_type to be a bitfield, so that we can match flows
embedded into each other, such as TCP4 which is a subset of IP4.

This commit does the conversion to the newer type.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvpp2: cls: Remove extra whitespace in mvpp2_cls_flow_write

Cosmetic patch removing extra whitespaces when writing the flow_table
entries

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlx5-next' of git://git./linux/kernel/git/mellanox/linux

This merge commit includes some misc shared code updates from mlx5-next branch needed
for net-next.

1) From Aya: Enable general events on all physical link types and
   restrict general event handling of subtype DELAY_DROP_TIMEOUT in mlx5 rdma
   driver to ethernet links only as it was intended.

2) From Eli: Introduce low level bits for prio tag mode

3) From Maor: Low level steering updates to support RDMA RX flow
   steering and enables RoCE loopback traffic when switchdev is enabled.

4) From Vu and Parav: Two small mlx5 core cleanups

5) From Yevgeny add HW definitions of geneve offloads

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

Merge tag 'arc-5.1-final' of git://git./linux/kernel/git/vgupta/arc

Pull ARC fixes from Vineet Gupta:
"A few minor fixes for ARC.

   - regression in memset if line size !64

   - avoid panic if PAE and IOC"

* tag 'arc-5.1-final' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
  ARC: memset: fix build with L1_CACHE_SHIFT != 6
  ARC: [hsdk] Make it easier to add PAE40 region to DTB
  ARC: PAE40: don't panic and instead turn off hw ioc

net/mlx5: Fix broken hca cap offset

The cited commit broke the offsets of hca cap struct, fix it.
While at it, cleanup a white space introduced by the same commit.

Fixes: b169e64a2444 ("net/mlx5: Geneve, Add flow table capabilities for Geneve decap with TLV options")
Reported-by: Qian Cai <cai@lca.pw>
Cc: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

PCI/portdrv: Use shared MSI/MSI-X vector for Bandwidth Management

The Interrupt Message Number in the PCIe Capabilities register (PCIe r4.0,
sec 7.5.3.2) indicates which MSI/MSI-X vector is shared by interrupts
related to the PCIe Capability, including Link Bandwidth Management and
Link Autonomous Bandwidth Interrupts (Link Control, 7.5.3.7), Command
Completed and Hot-Plug Interrupts (Slot Control, 7.5.3.10), and the PME
Interrupt (Root Control, 7.5.3.12).

pcie_message_numbers() checked whether we want to enable PME or Hot-Plug
interrupts but neglected to check for Link Bandwidth Management, so if we
only wanted the Bandwidth Management interrupts, it decided we didn't need
any vectors at all. Then pcie_port_enable_irq_vec() tried to reallocate
zero vectors, which failed, resulting in fallback to INTx.

On some systems, e.g., an X79-based workstation, that INTx seems broken or
not handled correctly, so we got spurious IRQ16 interrupts for Bandwidth
Management events.

Change pcie_message_numbers() so that if we want Link Bandwidth Management
interrupts, we use the shared MSI/MSI-X vector from the PCIe Capabilities
register.

Fixes: e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth notification")
Link: https://lore.kernel.org/lkml/155597243666.19387.1205950870601742062.stgit@gimli.home
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

Merge tag 'acpi-5.1-rc8' of git://git./linux/kernel/git/rafael/linux-pm

Pull ACPI fix from Rafael Wysocki:
"Revert a recent ACPICA change that caused initialization to fail on
systems with Thunderbolt docking stations connected at the init time"

* tag 'acpi-5.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
Revert "ACPICA: Clear status of GPEs before enabling them"

gcc-9: don't warn about uninitialized btrfs extent_type variable

The 'extent_type' variable does seem to be reliably initialized, but
it's _very_ non-obvious, since there's a "goto next" case that jumps
over the normal initialization. That will then always trigger the
"start >= extent_end" test, which will end up never falling through to
the use of that variable.

But the code is certainly not obvious, and the compiler warning looks
reasonable. Make 'extent_type' an int, and initialize it to an invalid
negative value, which seems to be the common pattern in other places.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'net-ll_temac-x86_64-support'

Esben Haabendal says:

====================
net: ll_temac: x86_64 support

This patch series adds support for use of ll_temac driver with
platform_data configuration and fixes endianess and 64-bit problems so
that it can be used on x86_64 platform.

A few bugfixes are also included.

Changes since v2:
  - Fixed lp->indirect_mutex initialization regression for OF
    platforms introduced in v2

Changes since v1:
  - Make indirect_mutex specification mandatory when using platform_data
  - Move header to include/linux/platform_data
  - Enable COMPILE_TEST for XILINX_LL_TEMAC
  - Rebased to v5.1-rc7
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ll_temac: Enable DMA when ready, not before

As soon as TAILDESCR_PTR is written, DMA transfers might start.
Let's ensure we are ready to receive DMA IRQ's before doing that.

Signed-off-by: Esben Haabendal <esben@geanix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ll_temac: Allow configuration of IRQ coalescing

This allows custom setup of IRQ coalescing for platforms using legacy
platform_device. The irq timeout and count parameters can be used for
tuning cpu load vs. latency.

I have maintained the 0x00000400 bit in TX_CHNL_CTRL. It is specified as
unused in the documentation I have available. It does not make any
difference in the hardware I have available, so it is left in to not risk
breaking other platforms where it might be used.

Signed-off-by: Esben Haabendal <esben@geanix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ll_temac: Replace bad usage of msleep() with usleep_range()

Use usleep_range() to avoid problems with msleep() actually sleeping
much longer than expected.

Signed-off-by: Esben Haabendal <esben@geanix.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ll_temac: Fix bug causing buffer descriptor overrun

As we are actually using a BD for both the skb and each frag contained in
it, the oldest TX BD would be overwritten when there was exactly one BD
less than needed.

Signed-off-by: Esben Haabendal <esben@geanix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ll_temac: Fix iommu/swiotlb leak

Unmap the actual buffer length, not the amount of data received, avoiding
resource exhaustion of swiotlb (seen on x86_64 platform).

Signed-off-by: Esben Haabendal <esben@geanix.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ll_temac: Support indirect_mutex share within TEMAC IP

Indirect register access goes through a DCR bus bridge, which
allows only one outstanding transaction.  And to make matters
worse, each TEMAC IP block contains two Ethernet interfaces, and
although they seem to have separate registers for indirect access,
they actually share the registers.  Or to be more specific, MSW, LSW
and CTL registers are physically shared between Ethernet interfaces
in same TEMAC IP, with RDY register being (almost) specificic to
the Ethernet interface.  The 0x10000 bit in RDY reflects combined
bus ready state though.

So we need to take care to synchronize not only within a single
device, but also between devices in same TEMAC IP.

This commit allows to do that with legacy platform devices.

For OF devices, the xlnx,compound parent of the temac node should be
used to find siblings, and setup a shared indirect_mutex between them.
I will leave this work to somebody else, as I don't have hardware to
test that.  No regression is introduced by that, as before this commit
using two Ethernet interfaces in same TEMAC block is simply broken.

Signed-off-by: Esben Haabendal <esben@geanix.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ll_temac: Allow use on x86 platforms

With little-endian and 64-bit support in place, the ll_temac driver can
now be used on x86 and x86_64 platforms.

And while at it, enable COMPILE_TEST also.

Signed-off-by: Esben Haabendal <esben@geanix.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ll_temac: Fix support for little-endian platforms

Both TEMAC and SDMA is big-endian, so make sure that all values in SDMA
buffer descriptors (cmdac_bd) are handled as big-endian, independent of the
host endianness. With all currently supported platforms being big-endian,
this change does not make a change for any of them.

Note, when using app3 and app4 for piggybacking skb pointers there is no
need to care about endianness, as neither TEMAC nor SDMA access app3 and
app4 in TX buffer descriptors.

Signed-off-by: Esben Haabendal <esben@geanix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ll_temac: Add support for non-native register endianness

Replace the powerpc specific MMIO register access functions with the
generic big-endian mmio access functions, and add support for
little-endian access depending on configuration.

Big-endian access is maintained as the default, but little-endian can
be configured in device-tree binding or in platform data.

The temac_ior()/temac_iow() functions are replaced with macro wrappers
to avoid modifying existing code more than necessary.

Signed-off-by: Esben Haabendal <esben@geanix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ll_temac: Fix support for 64-bit platforms

The use of buffer descriptor APP4 field (32-bit) for storing skb pointer
obviously does not work on 64-bit platforms.
As APP3 is also unused, we can use that to store the other half of 64-bit
pointer values.

Contrary to what is hinted at in commit message of commit 15bfe05c8d63
("net: ethernet: xilinx: Mark XILINX_LL_TEMAC broken on 64-bit")
there are no other pointers stored in cdmac_bd.

Signed-off-by: Esben Haabendal <esben@geanix.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ll_temac: Extend support to non-device-tree platforms

Support initialization with platdata, so the driver can be used on
non-device-tree platforms.

For currently supported device-tree platforms, the driver should behave
as before.

Signed-off-by: Esben Haabendal <esben@geanix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ll_temac: Fix and simplify error handling by using devres functions

As a side effect, a few error cases are fixed.

If of_iomap() of sdma_regs failed, no error code was returned. Fixed to
return -ENOMEM similar to of_iomap() fail of regs.

If sysfs_create_group() or register_netdev() failed, lp->phy_node was not
released.

Finally, the order in remove function is corrected to be reverse order
of what is done in probe, i.e. calling temac_mdio_teardown() last, so we
unregister the netdev that most likely is using the mdio_bus first.

Signed-off-by: Esben Haabendal <esben@geanix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: fib_rule_tests: print the result and return 1 if any tests failed

Fixes: 65b2b4939a64 ("selftests: net: initial fib rule tests")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpsw: Fix inconsistent IS_ERR and PTR_ERR in cpsw_probe()

Fix inconsistent IS_ERR and PTR_ERR in cpsw_probe,
The proper pointer to use is clk instead of mode.

This issue was detected with the help of Coccinelle.

Fixes: 83a8471ba255 ("net: ethernet: ti: cpsw: refactor probe to group common hw initialization")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

gcc-9: properly declare the {pv,hv}clock_page storage

The pvlock_page and hvclock_page variables are (as the name implies)
addresses to pages, created by the linker script.

But we declared them as just "extern u8" variables, which _works_, but
now that gcc does some more bounds checking, it causes warnings like

warning: array subscript 1 is outside array bounds of ‘u8[1]’

when we then access more than one byte from those variables.

Fix this by simply making the declaration of the variables match
reality, which makes the compiler happy too.

Signed-off-by: Linus Torvalds <torvalds@-linux-foundation.org>

gcc-9: don't warn about uninitialized variable

I'm not sure what made gcc warn about this code now. The 'ret' variable
does end up initialized in all cases, but it's definitely not obvious,
so the compiler is quite reasonable to warn about this.

So just add initialization to make it all much more obvious both to
compilers and to humans.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

gcc-9: silence 'address-of-packed-member' warning

We already did this for clang, but now gcc has that warning too. Yes,
yes, the address may be unaligned. And that's kind of the point.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ipv4: ip_do_fragment: Preserve skb_iif during fragmentation

Previously, during fragmentation after forwarding, skb->skb_iif isn't
preserved, i.e. 'ip_copy_metadata' does not copy skb_iif from given
'from' skb.

As a result, ip_do_fragment's creates fragments with zero skb_iif,
leading to inconsistent behavior.

Assume for example an eBPF program attached at tc egress (post
forwarding) that examines __sk_buff->ingress_ifindex:
- the correct iif is observed if forwarding path does not involve
fragmentation/refragmentation
- a bogus iif is observed if forwarding path involves
fragmentation/refragmentatiom

Fix, by preserving skb_iif during 'ip_copy_metadata'.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

io_uring: avoid page allocation warnings

In io_sqe_buffer_register() we allocate a number of arrays based on the
iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
we can still attempt to allocate arrays exceeding MAX_ORDER.

On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
bytes, which is greater than 4MiB. This results in SLUB warning that
we're trying to allocate greater than MAX_ORDER, and failing the
allocation.

Avoid this by using kvmalloc() for allocations dependent on the
user-provided iov_len. At the same time, fix a leak of imu->bvec when
registration fails.

Full splat from before this patch:

WARNING: CPU: 1 PID: 2314 at mm/page_alloc.c:4595 __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 2314 Comm: syz-executor326 Not tainted 5.1.0-rc7-dirty #4
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193
show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x110/0x190 lib/dump_stack.c:113
panic+0x384/0x68c kernel/panic.c:214
__warn+0x2bc/0x2c0 kernel/panic.c:571
report_bug+0x228/0x2d8 lib/bug.c:186
bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956
call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline]
brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316
do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831
el1_dbg+0x18/0x8c
__alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595
alloc_pages_current+0x164/0x278 mm/mempolicy.c:2132
alloc_pages include/linux/gfp.h:509 [inline]
kmalloc_order+0x20/0x50 mm/slab_common.c:1231
kmalloc_order_trace+0x30/0x2b0 mm/slab_common.c:1243
kmalloc_large include/linux/slab.h:480 [inline]
__kmalloc+0x3dc/0x4f0 mm/slub.c:3791
kmalloc_array include/linux/slab.h:670 [inline]
io_sqe_buffer_register fs/io_uring.c:2472 [inline]
__io_uring_register fs/io_uring.c:2962 [inline]
__do_sys_io_uring_register fs/io_uring.c:3008 [inline]
__se_sys_io_uring_register fs/io_uring.c:2990 [inline]
__arm64_sys_io_uring_register+0x9e0/0x1bc8 fs/io_uring.c:2990
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:47 [inline]
el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83
el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129
el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948
SMP: stopping secondary CPUs
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled
CPU features: 0x002,23000438
Memory Limit: none
Rebooting in 1 seconds..

Fixes: edafccee56ff3167 ("io_uring: add support for pre-mapped user IO buffers")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch 'net-sched-taprio-change-schedules'

Vinicius Costa Gomes says:

====================
net/sched: taprio change schedules

Changes from RFC:
- Removed the patches for taprio offloading, because of the lack of
   in-tree users;
- Updated the links to point to the PATCH version of this series;

Original cover letter:

Overview
--------

This RFC has two objectives, it adds support for changing the running
schedules during "runtime", explained in more detail later, and
proposes an interface between taprio and the drivers for hardware
offloading.

These two different features are presented together so it's clear what
the "final state" would look like. But after the RFC stage, they can
be proposed (and reviewed) separately.

Changing the schedules without disrupting traffic is important for
handling dynamic use cases, for example, when streams are
added/removed and when the network configuration changes.

Hardware offloading support allows schedules to be more precise and
have lower resource usage.

Changing schedules
------------------

The same as the other interfaces we proposed, we try to use the same
concepts as the IEEE 802.1Q-2018 specification. So, for changing
schedules, there are an "oper" (operational) and an "admin" schedule.
The "admin" schedule is mutable and not in use, the "oper" schedule is
immutable and is in use.

That is, when the user first adds an schedule it is in the "admin"
state, and it becomes "oper" when its base-time (basically when it
starts) is reached.

What this means is that now it's possible to create taprio with a schedule:

$ tc qdisc add dev IFACE parent root handle 100 taprio \
      num_tc 3 \
      map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
      queues 1@0 1@1 2@2 \
      base-time 10000000 \
      sched-entry S 03 300000 \
      sched-entry S 02 300000 \
      sched-entry S 06 400000 \
      clockid CLOCK_TAI

And then, later, after the previous schedule is "promoted" to "oper",
add a new ("admin") schedule to be used some time later:

$ tc qdisc change dev IFACE parent root handle 100 taprio \
      base-time 1553121866000000000 \
      sched-entry S 02 500000 \
      sched-entry S 0f 400000 \
      clockid CLOCK_TAI

When enabling the ability to change schedules, it makes sense to add
two more defined knobs to schedules: "cycle-time" allows to truncate a
cycle to some value, so it repeats after a well-defined value;
"cycle-time-extension" controls how much an entry can be extended if
it's the last one before the change of schedules, the reason is to
avoid a very small cycle when transitioning from a schedule to
another.

With these, taprio in the software mode should provide a fairly
complete implementation of what's defined in the Enhancements for
Scheduled Traffic parts of the specification.

Hardware offload
----------------

Some workloads require better guarantees from their schedules than
what's provided by the software implementation. This series proposes
an interface for configuring schedules into compatible network
controllers.

This part is proposed together with the support for changing
schedules, because it raises questions like, should the "qdisc" side
be responsible of providing visibility into the schedules or should it
be the driver?

In this proposal, the driver is called passing the new schedule as
soon as it is validated, and the "core" qdisc takes care of displaying
(".dump()") the correct schedules at all times. It means that some
logic would need to be duplicated in the driver, if the hardware
doesn't have support for multiple schedules. But as taprio doesn't
have enough information about the underlying controller to know how
much in advance a schedule needs to be informed to the hardware, it
feels like a fair compromise.

The hardware offloading part of this proposal also tries to define an
interface for frame-preemption and how it interacts with the
scheduling of traffic, see Section 8.6.8.4 of IEEE 802.1Q-2018 for
more information.

One important difference between the qdisc interface and the
qdisc-driver interface, is that the "gate mask" on the qdisc side
references traffic classes, that is bit 0 of the gate mask means
Traffic Class 0, and in the driver interface, it specifies the queues,
that is bit 0 means queue 0. That is to say that taprio converts the
references to traffic classes to references to queues before sending
the offloading request to the driver.

Request for help
----------------

I would like that interested driver maintainers could take a look at
the proposed interface and see if it's going to be too awkward for any
particular device. Also, pointers to available documentation would be
appreciated. The idea here is to start a discussion so we can have an
interface that would work for multiple vendors.

Links
-----

kernel patches:
https://github.com/vcgomes/net-next/tree/taprio-add-support-for-change-v3

iproute2 patches:
https://github.com/vcgomes/iproute2/tree/taprio-add-support-for-change-v3
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

taprio: Add support for cycle-time-extension

IEEE 802.1Q-2018 defines the concept of a cycle-time-extension, so the
last entry of a schedule before the start of a new schedule can be
extended, so "too-short" entries can be avoided.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

taprio: Add support for setting the cycle-time manually

IEEE 802.1Q-2018 defines that a the cycle-time of a schedule may be
overridden, so the schedule is truncated to a determined "width".

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

taprio: Add support adding an admin schedule

The IEEE 802.1Q-2018 defines two "types" of schedules, the "Oper" (from
operational?) and "Admin" ones. Up until now, 'taprio' only had
support for the "Oper" one, added when the qdisc is created. This adds
support for the "Admin" one, which allows the .change() operation to
be supported.

Just for clarification, some quick (and dirty) definitions, the "Oper"
schedule is the currently (as in this instant) running one, and it's
read-only. The "Admin" one is the one that the system configurator has
installed, it can be changed, and it will be "promoted" to "Oper" when
it's 'base-time' is reached.

The idea behing this patch is that calling something like the below,
(after taprio is already configured with an initial schedule):

$ tc qdisc change taprio dev IFACE parent root      \
         base-time X                        \
         sched-entry <CMD> <GATES> <INTERVAL>      \
   ...

Will cause a new admin schedule to be created and programmed to be
"promoted" to "Oper" at instant X. If an "Admin" schedule already
exists, it will be overwritten with the new parameters.

Up until now, there was some code that was added to ease the support
of changing a single entry of a schedule, but was ultimately unused.
Now, that we have support for "change" with more well thought
semantics, updating a single entry seems to be less useful.

So we remove what is in practice dead code, and return a "not
supported" error if the user tries to use it. If changing a single
entry would make the user's life easier we may ressurrect this idea,
but at this point, removing it simplifies the code.

For now, only the schedule specific bits are allowed to be added for a
new schedule, that means that 'clockid', 'num_tc', 'map' and 'queues'
cannot be modified.

Example:

$ tc qdisc change dev IFACE parent root handle 100 taprio \
      base-time $BASE_TIME \
      sched-entry S 00 500000 \
      sched-entry S 0f 500000 \
      clockid CLOCK_TAI

The only change in the netlink API introduced by this change is the
introduction of an "admin" type in the response to a dump request,
that type allows userspace to separate the "oper" schedule from the
"admin" schedule. If userspace doesn't support the "admin" type, it
will only display the "oper" schedule.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

taprio: Fix potencial use of invalid memory during dequeue()

Right now, this isn't a problem, but the next commit allows schedules
to be added during runtime. When a new schedule transitions from the
inactive to the active state ("admin" -> "oper") the previous one can
be freed, if it's freed just after the RCU read lock is released, we
may access an invalid entry.

So, we should take care to protect the dequeue() flow, so all the
places that access the entries are protected by the RCU read lock.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tcp-undo-congestion'

Yuchung Cheng says:

====================
undo congestion window on spurious SYN or SYNACK timeout

Linux TCP currently uses the initial congestion window of 1 packet
if multiple SYN or SYNACK timeouts per RFC6298. However such
timeouts are often spurious on wireless or cellular networks that
experience high delay variances (e.g. ramping up dormant radios or
local link retransmission). Another case is when the underlying
path is longer than the default SYN timeout (e.g. 1 second). In
these cases starting the transfer with a minimal congestion window
is detrimental to the performance for short flows.

One naive approach is to simply ignore SYN or SYNACK timeouts and
always use a larger or default initial window. This approach however
risks pouring gas to the fire when the network is already highly
congested. This is particularly true in data center where application
could start thousands to millions of connections over a single or
multiple hosts resulting in high SYN drops (e.g. incast).

This patch-set detects spurious SYN and SYNACK timeouts upon
completing the handshake via the widely-supported TCP timestamp
options. Upon such events the sender reverts to the default
initial window to start the data transfer so it gets best of both
worlds. This patch-set supports this feature for both active and
passive as well as Fast Open or regular connections.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: refactor setting the initial congestion window

Relocate the congestion window initialization from tcp_init_metrics()
to tcp_init_transfer() to improve code readability.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: refactor to consolidate TFO passive open code

Use a helper to consolidate two identical code block for passive TFO.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: undo cwnd on Fast Open spurious SYNACK retransmit

This patch makes passive Fast Open reverts the cwnd to default
initial cwnd (10 packets) if the SYNACK timeout is spurious.

Passive Fast Open uses a full socket during handshake so it can
use the existing undo logic to detect spurious retransmission
by recording the first SYNACK timeout in key state variable
retrans_stamp. Upon receiving the ACK of the SYNACK, if the socket
has sent some data before the timeout, the spurious timeout
is detected by tcp_try_undo_recovery() in tcp_process_loss()
in tcp_ack().

But if the socket has not send any data yet, tcp_ack() does not
execute the undo code since no data is acknowledged. The fix is to
check such case explicitly after tcp_ack() during the ACK processing
in SYN_RECV state. In addition this is checked in FIN_WAIT_1 state
in case the server closes the socket before handshake completes.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: lower congestion window on Fast Open SYNACK timeout

TCP sender would use congestion window of 1 packet on the second SYN
and SYNACK timeout except passive TCP Fast Open. This makes passive
TFO too aggressive and unfair during congestion at handshake. This
patch fixes this issue so TCP (fast open or not, passive or active)
always conforms to the RFC6298.

Note that tcp_enter_loss() is called only once during recurring
timeouts. This is because during handshake, high_seq and snd_una
are the same so tcp_enter_loss() would incorrect set the undo state
variables multiple times.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: undo init congestion window on false SYNACK timeout

Linux implements RFC6298 and use an initial congestion window
of 1 upon establishing the connection if the SYNACK packet is
retransmitted 2 or more times. In cellular networks SYNACK timeouts
are often spurious if the wireless radio was dormant or idle. Also
some network path is longer than the default SYNACK timeout. In
both cases falsely starting with a minimal cwnd are detrimental
to performance.

This patch avoids doing so when the final ACK's TCP timestamp
indicates the original SYNACK was delivered. It remembers the
original SYNACK timestamp when SYNACK timeout has occurred and
re-uses the function to detect spurious SYN timeout conveniently.

Note that a server may receives multiple SYNs from and immediately
retransmits SYNACKs without any SYNACK timeout. This often happens
on when the client SYNs have timed out due to wireless delay
above. In this case since the server will still use the default
initial congestion (e.g. 10) because tp->undo_marker is reset in
tcp_init_metrics(). This is an intentional design because packets
are not lost but delayed.

This patch only covers regular TCP passive open. Fast Open is
supported in the next patch.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: better SYNACK sent timestamp

Detecting spurious SYNACK timeout using timestamp option requires
recording the exact SYNACK skb timestamp. Previously the SYNACK
sent timestamp was stamped slightly earlier before the skb
was transmitted. This patch uses the SYNACK skb transmission
timestamp directly.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: undo initial congestion window on false SYN timeout

Linux implements RFC6298 and use an initial congestion window of 1
upon establishing the connection if the SYN packet is retransmitted 2
or more times. In cellular networks SYN timeouts are often spurious
if the wireless radio was dormant or idle. Also some network path
is longer than the default SYN timeout. Having a minimal cwnd on
both cases are detrimental to TCP startup performance.

This patch extends TCP undo feature (RFC3522 aka TCP Eifel) to detect
spurious SYN timeout via TCP timestamps. Since tp->retrans_stamp
records the initial SYN timestamp instead of first retransmission, we
have to implement a different undo code additionally. The detection
also must happen before tcp_ack() as retrans_stamp is reset when
SYN is acknowledged.

Note this patch covers both active regular and fast open.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: avoid unconditional congestion window undo on SYN retransmit

Previously if an active TCP open has SYN timeout, it always undo the
cwnd upon receiving the SYNACK. This is because tcp_clean_rtx_queue
would reset tp->retrans_stamp when SYN is acked, which fools then
tcp_try_undo_loss and tcp_packet_delayed. Addressing this issue is
required to properly support undo for spurious SYN timeout.

Fixing this is tricky -- for active TCP open tp->retrans_stamp
records the time when the handshake starts, not the first
retransmission time as the name may suggest. The simplest fix is
for tcp_packet_delayed to ensure it is valid before comparing with
other timestamp.

One side effect of this change is active TCP Fast Open that incurred
SYN timeout. Upon receiving a SYN-ACK that only acknowledged
the SYN, it would immediately retransmit unacknowledged data in
tcp_ack() because the data is marked lost after SYN timeout. But
the retransmission would have an incorrect ack sequence number since
rcv_nxt has not been updated yet tcp_rcv_synsent_state_process(), the
retransmission needs to properly handed by tcp_rcv_fastopen_synack()
like before.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/tls: avoid NULL pointer deref on nskb->sk in fallback

update_chksum() accesses nskb->sk before it has been set
by complete_skb(), move the init up.

Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netdevsim: fix fall-through annotation

Replace "pass through" with a proper "fall through" annotation
in order to fix the following warning:

drivers/net/netdevsim/bus.c: In function ‘new_device_store’:
drivers/net/netdevsim/bus.c:170:14: warning: this statement may fall through [-Wimplicit-fallthrough=]
   port_count = 1;
   ~~~~~~~~~~~^~~
drivers/net/netdevsim/bus.c:172:2: note: here
  case 2:
  ^~~~

Warning level 3 was used: -Wimplicit-fallthrough=3

This fix is part of the ongoing efforts to enable
-Wimplicit-fallthrough

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: fib_rule_tests: Fix icmp proto with ipv6

A recent commit returns an error if icmp is used as the ip-proto for
IPv6 fib rules. Update fib_rule_tests to send ipv6-icmp instead of icmp.

Fixes: 5e1a99eae8499 ("ipv4: Add ICMPv6 support when parse route ipproto")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

packet: validate msg_namelen in send directly

Packet sockets in datagram mode take a destination address. Verify its
length before passing to dev_hard_header.

Prior to 2.6.14-rc3, the send code ignored sll_halen. This is
established behavior. Directly compare msg_namelen to dev->addr_len.

Change v1->v2: initialize addr in all paths

Fixes: 6b8d95f1795c4 ("packet: validate address length if non-zero")
Suggested-by: David Laight <David.Laight@aculab.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

packet: in recvmsg msg_name return at least sizeof sockaddr_ll

Packet send checks that msg_name is at least sizeof sockaddr_ll.
Packet recv must return at least this length, so that its output
can be passed unmodified to packet send.

This ceased to be true since adding support for lladdr longer than
sll_addr. Since, the return value uses true address length.

Always return at least sizeof sockaddr_ll, even if address length
is shorter. Zero the padding bytes.

Change v1->v2: do not overwrite zeroed padding again. use copy_len.

Fixes: 0fb375fb9b93 ("[AF_PACKET]: Allow for > 8 byte hardware addresses.")
Suggested-by: David Laight <David.Laight@aculab.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: mcdi_port: Mark expected switch fall-through

In preparation to enabling -Wimplicit-fallthrough, mark switch
cases where we are expecting to fall through.

This patch fixes the following warning:

drivers/net/ethernet/sfc/mcdi_port.c: In function ‘efx_mcdi_phy_decode_link’:
./include/linux/compiler.h:77:22: warning: this statement may fall through [-Wimplicit-fallthrough=]
# define unlikely(x) __builtin_expect(!!(x), 0)
                      ^~~~~~~~~~~~~~~~~~~~~~~~~~
./include/asm-generic/bug.h:125:2: note: in expansion of macro ‘unlikely’
  unlikely(__ret_warn_on);     \
  ^~~~~~~~
drivers/net/ethernet/sfc/mcdi_port.c:344:3: note: in expansion of macro ‘WARN_ON’
   WARN_ON(1);
   ^~~~~~~
drivers/net/ethernet/sfc/mcdi_port.c:345:2: note: here
  case MC_CMD_FCNTL_OFF:
  ^~~~

Warning level 3 was used: -Wimplicit-fallthrough=3

This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Change devlink health locking mechanism

The devlink health reporters create/destroy and user commands currently
use the devlink->lock as a locking mechanism. Different reporters have
different rules in the driver and are being created/destroyed during
different stages of driver load/unload/running. So during execution of a
reporter recover the flow can go through another reporter's destroy and
create. Such flow leads to deadlock trying to lock a mutex already
held.

With the new locking mechanism the different reporters share mutex lock
only to protect access to shared reporters list.
Added refcount per reporter, to protect the reporters from destroy while
being used.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

iov_iter: fix iov_iter_type

Commit 875f1d0769cd ("iov_iter: add ITER_BVEC_FLAG_NO_REF flag")
introduces one extra flag of ITER_BVEC_FLAG_NO_REF, and this flag
is stored into iter->type.

However, iov_iter_type() doesn't consider the new added flag, fix
it by masking this flag in iov_iter_type().

Fixes: 875f1d0769cd ("iov_iter: add ITER_BVEC_FLAG_NO_REF flag")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: fix handling for BIO_NO_PAGE_REF

Commit 399254aaf489211 ("block: add BIO_NO_PAGE_REF flag") introduces
BIO_NO_PAGE_REF, and once this flag is set for one bio, all pages
in the bio won't be get/put during IO.

However, if one bio is submitted via __blkdev_direct_IO_simple(),
even though BIO_NO_PAGE_REF is set, pages still may be put.

Fixes this issue by avoiding to put pages if BIO_NO_PAGE_REF is
set.

Fixes: 399254aaf489211 ("block: add BIO_NO_PAGE_REF flag")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: drop req submit reference always in async punt

If we don't end up actually calling submit in io_sq_wq_submit_work(),
we still need to drop the submit reference to the request. If we
don't, then we can leak the request. This can happen if we race
with ring shutdown while flushing the workqueue for requests that
require use of the mm_struct.

Fixes: e65ef56db494 ("io_uring: use regular request ref counts")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: free allocated io_memory once

If io_allocate_scq_urings() fails to allocate an sq_* region, it will
call io_mem_free() for any previously allocated regions, but leave
dangling pointers to these regions in the ctx. Any regions which have
not yet been allocated are left NULL. Note that when returning
-EOVERFLOW, the previously allocated sq_ring is not freed, which appears
to be an unintentional leak.

When io_allocate_scq_urings() fails, io_uring_create() will call
io_ring_ctx_wait_and_kill(), which calls io_mem_free() on all the sq_*
regions, assuming the pointers are valid and not NULL.

This can result in pages being freed multiple times, which has been
observed to corrupt the page state, leading to subsequent fun. This can
also result in virt_to_page() on NULL, resulting in the use of bogus
page addresses, and yet more subsequent fun. The latter can be detected
with CONFIG_DEBUG_VIRTUAL on arm64.

Adding a cleanup path to io_allocate_scq_urings() complicates the logic,
so let's leave it to io_ring_ctx_free() to consistently free these
pointers, and simplify the io_allocate_scq_urings() error paths.

Full splats from before this patch below. Note that the pointer logged
by the DEBUG_VIRTUAL "non-linear address" warning has been hashed, and
is actually NULL.

[   26.098129] page:ffff80000e949a00 count:0 mapcount:-128 mapping:0000000000000000 index:0x0
[   26.102976] flags: 0x63fffc000000()
[   26.104373] raw: 000063fffc000000 ffff80000e86c188 ffff80000ea3df08 0000000000000000
[   26.108917] raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000
[   26.137235] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
[   26.143960] ------------[ cut here ]------------
[   26.146020] kernel BUG at include/linux/mm.h:547!
[   26.147586] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[   26.149163] Modules linked in:
[   26.150287] Process syz-executor.21 (pid: 20204, stack limit = 0x000000000e9cefeb)
[   26.153307] CPU: 2 PID: 20204 Comm: syz-executor.21 Not tainted 5.1.0-rc7-00004-g7d30b2ea43d6 #18
[   26.156566] Hardware name: linux,dummy-virt (DT)
[   26.158089] pstate: 40400005 (nZcv daif +PAN -UAO)
[   26.159869] pc : io_mem_free+0x9c/0xa8
[   26.161436] lr : io_mem_free+0x9c/0xa8
[   26.162720] sp : ffff000013003d60
[   26.164048] x29: ffff000013003d60 x28: ffff800025048040
[   26.165804] x27: 0000000000000000 x26: ffff800025048040
[   26.167352] x25: 00000000000000c0 x24: ffff0000112c2820
[   26.169682] x23: 0000000000000000 x22: 0000000020000080
[   26.171899] x21: ffff80002143b418 x20: ffff80002143b400
[   26.174236] x19: ffff80002143b280 x18: 0000000000000000
[   26.176607] x17: 0000000000000000 x16: 0000000000000000
[   26.178997] x15: 0000000000000000 x14: 0000000000000000
[   26.181508] x13: 00009178a5e077b2 x12: 0000000000000001
[   26.183863] x11: 0000000000000000 x10: 0000000000000980
[   26.186437] x9 : ffff000013003a80 x8 : ffff800025048a20
[   26.189006] x7 : ffff8000250481c0 x6 : ffff80002ffe9118
[   26.191359] x5 : ffff80002ffe9118 x4 : 0000000000000000
[   26.193863] x3 : ffff80002ffefe98 x2 : 44c06ddd107d1f00
[   26.196642] x1 : 0000000000000000 x0 : 000000000000003e
[   26.198892] Call trace:
[   26.199893]  io_mem_free+0x9c/0xa8
[   26.201155]  io_ring_ctx_wait_and_kill+0xec/0x180
[   26.202688]  io_uring_setup+0x6c4/0x6f0
[   26.204091]  __arm64_sys_io_uring_setup+0x18/0x20
[   26.205576]  el0_svc_common.constprop.0+0x7c/0xe8
[   26.207186]  el0_svc_handler+0x28/0x78
[   26.208389]  el0_svc+0x8/0xc
[   26.209408] Code: aa0203e0 d0006861 9133a021 97fcdc3c (d4210000)
[   26.211995] ---[ end trace bdb81cd43a21e50d ]---

[   81.770626] ------------[ cut here ]------------
[   81.825015] virt_to_phys used for non-linear address: 000000000d42f2c7 (          (null))
[   81.827860] WARNING: CPU: 1 PID: 30171 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x48/0x68
[   81.831202] Modules linked in:
[   81.832212] CPU: 1 PID: 30171 Comm: syz-executor.20 Not tainted 5.1.0-rc7-00004-g7d30b2ea43d6 #19
[   81.835616] Hardware name: linux,dummy-virt (DT)
[   81.836863] pstate: 60400005 (nZCv daif +PAN -UAO)
[   81.838727] pc : __virt_to_phys+0x48/0x68
[   81.840572] lr : __virt_to_phys+0x48/0x68
[   81.842264] sp : ffff80002cf67c70
[   81.843858] x29: ffff80002cf67c70 x28: ffff800014358e18
[   81.846463] x27: 0000000000000000 x26: 0000000020000080
[   81.849148] x25: 0000000000000000 x24: ffff80001bb01f40
[   81.851986] x23: ffff200011db06c8 x22: ffff2000127e3c60
[   81.854351] x21: ffff800014358cc0 x20: ffff800014358d98
[   81.856711] x19: 0000000000000000 x18: 0000000000000000
[   81.859132] x17: 0000000000000000 x16: 0000000000000000
[   81.861586] x15: 0000000000000000 x14: 0000000000000000
[   81.863905] x13: 0000000000000000 x12: ffff1000037603e9
[   81.866226] x11: 1ffff000037603e8 x10: 0000000000000980
[   81.868776] x9 : ffff80002cf67840 x8 : ffff80001bb02920
[   81.873272] x7 : ffff1000037603e9 x6 : ffff80001bb01f47
[   81.875266] x5 : ffff1000037603e9 x4 : dfff200000000000
[   81.876875] x3 : ffff200010087528 x2 : ffff1000059ecf58
[   81.878751] x1 : 44c06ddd107d1f00 x0 : 0000000000000000
[   81.880453] Call trace:
[   81.881164]  __virt_to_phys+0x48/0x68
[   81.882919]  io_mem_free+0x18/0x110
[   81.886585]  io_ring_ctx_wait_and_kill+0x13c/0x1f0
[   81.891212]  io_uring_setup+0xa60/0xad0
[   81.892881]  __arm64_sys_io_uring_setup+0x2c/0x38
[   81.894398]  el0_svc_common.constprop.0+0xac/0x150
[   81.896306]  el0_svc_handler+0x34/0x88
[   81.897744]  el0_svc+0x8/0xc
[   81.898715] ---[ end trace b4a703802243cbba ]---

Fixes: 2b188cc1bb857a9d ("Add io_uring IO interface")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-block@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: fix SQPOLL cpu validation

In io_sq_offload_start(), we call cpu_possible() on an unbounded cpu
value from userspace. On v5.1-rc7 on arm64 with
CONFIG_DEBUG_PER_CPU_MAPS, this results in a splat:

  WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpu_max_bits_warn include/linux/cpumask.h:121 [inline]

There was an attempt to fix this in commit:

  917257daa0fea7a0 ("io_uring: only test SQPOLL cpu after we've verified it")

... by adding a check after the cpu value had been limited to NR_CPU_IDS
using array_index_nospec(). However, this left an unbound check at the
start of the function, for which the warning still fires.

Let's fix this correctly by checking that the cpu value is bound by
nr_cpu_ids before passing it to cpu_possible(). Note that only
nr_cpu_ids of a cpumask are guaranteed to exist at runtime, and
nr_cpu_ids can be significantly smaller than NR_CPUs. For example, an
arm64 defconfig has NR_CPUS=256, while my test VM has 4 vCPUs.

Following the intent from the commit message for 917257daa0fea7a0, the
check is moved under the SQ_AFF branch, which is the only branch where
the cpu values is consumed. The check is performed before bounding the
value with array_index_nospec() so that we don't silently accept bogus
cpu values from userspace, where array_index_nospec() would force these
values to 0.

I suspect we can remove the array_index_nospec() call entirely, but I've
conservatively left that in place, updated to use nr_cpu_ids to match
the prior check.

Tested on arm64 with the Syzkaller reproducer:

  https://syzkaller.appspot.com/bug?extid=cd714a07c6de2bc34293
  https://syzkaller.appspot.com/x/repro.syz?x=15d8b397200000

Full splat from before this patch:

WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpu_max_bits_warn include/linux/cpumask.h:121 [inline]
WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpumask_check include/linux/cpumask.h:128 [inline]
WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpumask_test_cpu include/linux/cpumask.h:344 [inline]
WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_sq_offload_start fs/io_uring.c:2244 [inline]
WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_uring_create fs/io_uring.c:2864 [inline]
WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_uring_setup+0x1108/0x15a0 fs/io_uring.c:2916
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 27601 Comm: syz-executor.0 Not tainted 5.1.0-rc7 #3
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193
show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x110/0x190 lib/dump_stack.c:113
panic+0x384/0x68c kernel/panic.c:214
__warn+0x2bc/0x2c0 kernel/panic.c:571
report_bug+0x228/0x2d8 lib/bug.c:186
bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956
call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline]
brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316
do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831
el1_dbg+0x18/0x8c
cpu_max_bits_warn include/linux/cpumask.h:121 [inline]
cpumask_check include/linux/cpumask.h:128 [inline]
cpumask_test_cpu include/linux/cpumask.h:344 [inline]
io_sq_offload_start fs/io_uring.c:2244 [inline]
io_uring_create fs/io_uring.c:2864 [inline]
io_uring_setup+0x1108/0x15a0 fs/io_uring.c:2916
__do_sys_io_uring_setup fs/io_uring.c:2929 [inline]
__se_sys_io_uring_setup fs/io_uring.c:2926 [inline]
__arm64_sys_io_uring_setup+0x50/0x70 fs/io_uring.c:2926
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:47 [inline]
el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83
el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129
el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948
SMP: stopping secondary CPUs
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled
CPU features: 0x002,23000438
Memory Limit: none
Rebooting in 1 seconds..

Fixes: 917257daa0fea7a0 ("io_uring: only test SQPOLL cpu after we've verified it")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-block@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Simplied the logic

Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch 'aquantia-next'

Igor Russkikh says:

====================
net: atlantic: Aquantia driver updates 2019-04

This patchset contains various improvements:

- Work targeting link up speedups: link interrupt introduced, some other
  logic changes to imrove this.
- FW operations securing with mutex
- Counters and statistics logic improved by Dmitry
- read out of chip temperature via hwmon interface implemented by
  Yana and Nikita.

v4 changes:
- remove drvinfo_exit noop
- 64bit stats should be readed out sequentially (lsw, then msw)
  declare 64bit read ops for that

v3 changes:
- temp ops renamed to phy_temp ops
- mutex commits squashed for better structure

v2 changes:
- use threaded irq for link state handling
- rework hwmon via devm_hwmon_device_register_with_info
Extra comments on review from Andrew:
- direct device name pointer is used in hwmon registration.
  This causes hwmon device to derive possible interface name changes
- Will consider sanity checks for firmware mutex lock separately.
  Right now there is no single point exsists where such check could
  be easily added.
- There is no way now to fetch and configure min/max/crit temperatures
  via FW. Will investigate this separately.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: remove outdated device ids

Some device ids were never released and does not exist.
Cleanup these.

Signed-off-by: Nikita Danilov <ndanilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: fixups on 64bit dma counters

DMA counters are 64 bit and we can fetch that to reduce
counter overflow, espesially on byte counters.

Tested-by: Nikita Danilov <ndanilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: Dmitry Bogdanov <dmitry.bogdanov@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: get total counters from DMA block

aq_nic_update_ndev_stats pushes statistics to ndev->stats from
system interface. This is not always good because it counts packets/bytes
before any of rx filters (including mac filter).

Its better to report the packet/bytes statistics from DMA
counters which gives actual values of data transferred over pci.
System level stats is still available via ethtool.

Signed-off-by: Nikita Danilov <ndanilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: Dmitry Bogdanov <dmitry.bogdanov@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: fetch up to date statistics on ethtool request

This improves ethtool -S usage, where stats are now actual
on each request. Before that stats only were updated at service
timer period.

Tested-by: Nikita Danilov <ndanilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: Dmitry Bogdanov <dmitry.bogdanov@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: extract timer cb into work job

Service timer callback fetches statistics from FW and that may cause
a long delay in error cases. We also now need to use fw mutex
to prevent concurrent access to FW, thus - extract that logic
from timer callback into the job in the separate work queue.

Signed-off-by: Nikita Danilov <ndanilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: introduce fwreq mutex

Some of FW operations could be invoked simultaneously,
from f.e. ethtool context and from service service activity work.
Here we introduce a fw mutex to secure and serialize access
to FW logic.

Signed-off-by: Nikita Danilov <ndanilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: user correct MSI irq type

Typo in msi code. No much impact though.

Signed-off-by: Nikita Danilov <ndanilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: use macros for better visibility

Improve for better readability

Signed-off-by: Nikita Danilov <ndanilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: improve ifup link detection

Original code detected link only after 1 sec is passed after up.
Here we replace this with direct service callback which updates
link status immediately

Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: link status irq handling

Here we define and request an extra interrupt line,
assign it on link isr handler and restructure abit aq_pci code
to better support that.

We also remove logic for using different timer intervals
depending on link state, since thats now useless.

Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: create global service workqueue

We need this to schedule link interrupt handling and
various service tasks.

Signed-off-by: Nikita Danilov <ndanilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: link interrupt handling function

Define link interrupt handler

Signed-off-by: Nikita Danilov <ndanilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: add link interrupt fields

Declare macroes and nic fields to support link interrupt
handling

Signed-off-by: Nikita Danilov <ndanilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: implement hwmon api for chip temperature

Added support for hwmon api to fetch out chip temperature

Signed-off-by: Yana Esina <yana.esina@aquantia.com>
Signed-off-by: Nikita Danilov <nikita.danilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: add infrastructure to readout chip temperature

Ability to read the chip temperature from memory
via hwmon interface

Signed-off-by: Yana Esina <yana.esina@aquantia.com>
Signed-off-by: Nikita Danilov <nikita.danilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: avoid running the sctp state machine recursively

Ying triggered a call trace when doing an asconf testing:

  BUG: scheduling while atomic: swapper/12/0/0x10000100
  Call Trace:
   <IRQ>  [<ffffffffa4375904>] dump_stack+0x19/0x1b
   [<ffffffffa436fcaf>] __schedule_bug+0x64/0x72
   [<ffffffffa437b93a>] __schedule+0x9ba/0xa00
   [<ffffffffa3cd5326>] __cond_resched+0x26/0x30
   [<ffffffffa437bc4a>] _cond_resched+0x3a/0x50
   [<ffffffffa3e22be8>] kmem_cache_alloc_node+0x38/0x200
   [<ffffffffa423512d>] __alloc_skb+0x5d/0x2d0
   [<ffffffffc0995320>] sctp_packet_transmit+0x610/0xa20 [sctp]
   [<ffffffffc098510e>] sctp_outq_flush+0x2ce/0xc00 [sctp]
   [<ffffffffc098646c>] sctp_outq_uncork+0x1c/0x20 [sctp]
   [<ffffffffc0977338>] sctp_cmd_interpreter.isra.22+0xc8/0x1460 [sctp]
   [<ffffffffc0976ad1>] sctp_do_sm+0xe1/0x350 [sctp]
   [<ffffffffc099443d>] sctp_primitive_ASCONF+0x3d/0x50 [sctp]
   [<ffffffffc0977384>] sctp_cmd_interpreter.isra.22+0x114/0x1460 [sctp]
   [<ffffffffc0976ad1>] sctp_do_sm+0xe1/0x350 [sctp]
   [<ffffffffc097b3a4>] sctp_assoc_bh_rcv+0xf4/0x1b0 [sctp]
   [<ffffffffc09840f1>] sctp_inq_push+0x51/0x70 [sctp]
   [<ffffffffc099732b>] sctp_rcv+0xa8b/0xbd0 [sctp]

As it shows, the first sctp_do_sm() running under atomic context (NET_RX
softirq) invoked sctp_primitive_ASCONF() that uses GFP_KERNEL flag later,
and this flag is supposed to be used in non-atomic context only. Besides,
sctp_do_sm() was called recursively, which is not expected.

Vlad tried to fix this recursive call in Commit c0786693404c ("sctp: Fix
oops when sending queued ASCONF chunks") by introducing a new command
SCTP_CMD_SEND_NEXT_ASCONF. But it didn't work as this command is still
used in the first sctp_do_sm() call, and sctp_primitive_ASCONF() will
be called in this command again.

To avoid calling sctp_do_sm() recursively, we send the next queued ASCONF
not by sctp_primitive_ASCONF(), but by sctp_sf_do_prm_asconf() in the 1st
sctp_do_sm() directly.

Reported-by: Ying Xu <yinxu@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: pci: Fix typo in IOT2000 comment

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Documentation: fix netdev-FAQ.rst markup warning

Fix ReST underline warning:

./Documentation/networking/netdev-FAQ.rst:135: WARNING: Title underline too short.

Q: I made changes to only a few patches in a patch series should I resend only those changed?
--------------------------------------------------------------------------------------------

Fixes: ffa91253739c ("Documentation: networking: Update netdev-FAQ regarding patches")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

io_uring: have submission side sqe errors post a cqe

Currently we only post a cqe if we get an error OUTSIDE of submission.
For submission, we return the error directly through io_uring_enter().
This is a bit awkward for applications, and it makes more sense to
always post a cqe with an error, if the error happens on behalf of an
sqe.

This changes submission behavior a bit. io_uring_enter() returns -ERROR
for an error, and > 0 for number of sqes submitted. Before this change,
if you wanted to submit 8 entries and had an error on the 5th entry,
io_uring_enter() would return 4 (for number of entries successfully
submitted) and rewind the sqring. The application would then have to
peek at the sqring and figure out what was wrong with the head sqe, and
then skip it itself. With this change, we'll return 5 since we did
consume 5 sqes, and the last sqe (with the error) will result in a cqe
being posted with the error.

This makes the logic easier to handle in the application, and it cleans
up the submission part.

Suggested-by: Stefan Bühler <source@stbuehler.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ipv6: fix races in ip6_dst_destroy()

We had many syzbot reports that seem to be caused by use-after-free
of struct fib6_info.

ip6_dst_destroy(), fib6_drop_pcpu_from() and rt6_remove_exception()
are writers vs rt->from, and use non consistent synchronization among
themselves.

Switching to xchg() will solve the issues with no possible
lockdep issues.

BUG: KASAN: user-memory-access in atomic_dec_and_test include/asm-generic/atomic-instrumented.h:747 [inline]
BUG: KASAN: user-memory-access in fib6_info_release include/net/ip6_fib.h:294 [inline]
BUG: KASAN: user-memory-access in fib6_info_release include/net/ip6_fib.h:292 [inline]
BUG: KASAN: user-memory-access in fib6_drop_pcpu_from net/ipv6/ip6_fib.c:927 [inline]
BUG: KASAN: user-memory-access in fib6_purge_rt+0x4f6/0x670 net/ipv6/ip6_fib.c:960
Write of size 4 at addr 0000000000ffffb4 by task syz-executor.1/7649

CPU: 0 PID: 7649 Comm: syz-executor.1 Not tainted 5.1.0-rc6+ #183
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
kasan_report.cold+0x5/0x40 mm/kasan/report.c:321
check_memory_region_inline mm/kasan/generic.c:185 [inline]
check_memory_region+0x123/0x190 mm/kasan/generic.c:191
kasan_check_write+0x14/0x20 mm/kasan/common.c:108
atomic_dec_and_test include/asm-generic/atomic-instrumented.h:747 [inline]
fib6_info_release include/net/ip6_fib.h:294 [inline]
fib6_info_release include/net/ip6_fib.h:292 [inline]
fib6_drop_pcpu_from net/ipv6/ip6_fib.c:927 [inline]
fib6_purge_rt+0x4f6/0x670 net/ipv6/ip6_fib.c:960
fib6_del_route net/ipv6/ip6_fib.c:1813 [inline]
fib6_del+0xac2/0x10a0 net/ipv6/ip6_fib.c:1844
fib6_clean_node+0x3a8/0x590 net/ipv6/ip6_fib.c:2006
fib6_walk_continue+0x495/0x900 net/ipv6/ip6_fib.c:1928
fib6_walk+0x9d/0x100 net/ipv6/ip6_fib.c:1976
fib6_clean_tree+0xe0/0x120 net/ipv6/ip6_fib.c:2055
__fib6_clean_all+0x118/0x2a0 net/ipv6/ip6_fib.c:2071
fib6_clean_all+0x2b/0x40 net/ipv6/ip6_fib.c:2082
rt6_sync_down_dev+0x134/0x150 net/ipv6/route.c:4057
rt6_disable_ip+0x27/0x5f0 net/ipv6/route.c:4062
addrconf_ifdown+0xa2/0x1220 net/ipv6/addrconf.c:3705
addrconf_notify+0x19a/0x2260 net/ipv6/addrconf.c:3630
notifier_call_chain+0xc7/0x240 kernel/notifier.c:93
__raw_notifier_call_chain kernel/notifier.c:394 [inline]
raw_notifier_call_chain+0x2e/0x40 kernel/notifier.c:401
call_netdevice_notifiers_info+0x3f/0x90 net/core/dev.c:1753
call_netdevice_notifiers_extack net/core/dev.c:1765 [inline]
call_netdevice_notifiers net/core/dev.c:1779 [inline]
dev_close_many+0x33f/0x6f0 net/core/dev.c:1522
rollback_registered_many+0x43b/0xfd0 net/core/dev.c:8177
rollback_registered+0x109/0x1d0 net/core/dev.c:8242
unregister_netdevice_queue net/core/dev.c:9289 [inline]
unregister_netdevice_queue+0x1ee/0x2c0 net/core/dev.c:9282
unregister_netdevice include/linux/netdevice.h:2658 [inline]
__tun_detach+0xd5b/0x1000 drivers/net/tun.c:727
tun_detach drivers/net/tun.c:744 [inline]
tun_chr_close+0xe0/0x180 drivers/net/tun.c:3443
__fput+0x2e5/0x8d0 fs/file_table.c:278
____fput+0x16/0x20 fs/file_table.c:309
task_work_run+0x14a/0x1c0 kernel/task_work.c:113
exit_task_work include/linux/task_work.h:22 [inline]
do_exit+0x90a/0x2fa0 kernel/exit.c:876
do_group_exit+0x135/0x370 kernel/exit.c:980
__do_sys_exit_group kernel/exit.c:991 [inline]
__se_sys_exit_group kernel/exit.c:989 [inline]
__x64_sys_exit_group+0x44/0x50 kernel/exit.c:989
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x458da9
Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffeafc2a6a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 000000000000001c RCX: 0000000000458da9
RDX: 0000000000412a80 RSI: 0000000000a54ef0 RDI: 0000000000000043
RBP: 00000000004be552 R08: 000000000000000c R09: 000000000004c0d1
R10: 0000000002341940 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00007ffeafc2a7f0 R14: 000000000004c065 R15: 00007ffeafc2a800

Fixes: a68886a69180 ("net/ipv6: Make from in rt6_info rcu protected")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: David Ahern <dsahern@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: remove manual autoneg restart workaround

According to Neil who reported the issue leading to this
workaround, the workaround is no longer needed since
version 5.0. So let's remove it.

This was the bug report leading to the workaround:
https://bugzilla.kernel.org/show_bug.cgi?id=201081

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Tested-by: Neil MacLeod <neil@nmacleod.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'r8169-improve-eri-function-handling'

Heiner Kallweit says:

====================
r8169: improve eri function handling

This series aims at improving and simplifying the eri functions.
No functional change intended.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: add rtl_reset_packet_filter

Fortunately in one place there's a comment explaining what toggling
this bit does. So let's create a helper for it.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: add helpers rtl_eri_set/clear_bits

Add helpers rtl_eri_set_bits and rtl_eri_clear_bits to improve
readability of the code.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: make ERIAR_EXGMAC the default in eri functions

In basically all eri function calls the type argument is ERIAR_EXGMAC.
Therefore make it the default.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Convert-mv88e6060-to-mdio-device'

Andrew Lunn says:

====================
Convert mv88e6060 to mdio device

This patchset builds upon the previous patches to mv88e6060. It adds
support for probing the switch as an MDIO device and then removes the
legacy probe method. Since this is the last device supporting legacy
probe, this allows legacy probe to be removed, originally planned to
be removed in 4.17, but took a bit longer.

This change to the mv88e6060 is more risky than the previous
patchset. Some attempts to test it have been made, by hacking the
driver to match on an mv88e6352 so that it probes. These changes are
all about probe, so it is a reasonable test. But testing on a real
mv88e6060 would be great.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: DSA: Remove legacy binding

Now that the code to support the legacy binding has been removed,
remove the documentation for it.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Remove legacy probing support

Now that all drivers can be probed using more traditional methods,
remove the legacy probe code.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6060: Remove support for legacy probing

Now that the driver can be probed as an mdio device, remove the legacy
DSA platform device probing.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6060: Support probing as an mdio device

Probing DSA devices as platform devices has been superseded by using
normal bus drivers. Add support for probing the mv88e6060 device as an
mdio device.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dsa-core-vlan'

Vladimir Oltean says:

====================
Improvements to DSA core VLAN manipulation

In preparation of submitting the NXP SJA1105 driver, the Broadcom b53
and Mediatek mt7530 drivers have been found to apply some VLAN
workarounds that are needed in the new driver as well.

Therefore this patchset is mostly simply promoting the DSA driver
workarounds for VLAN to the generic code.

The b53 driver was applying a few workarounds in order to convince DSA
that its vlan_filtering setting is not really per-port. This is now
simply set by the driver via a DSA variable at probe time. The sja1105
driver will be a second user of this.

The mt7530 was also keeping track of when the .port_vlan_filtering
callback was being called. Remove the kept state from this driver
and simplify dealing with vlan_filtering in the generic case.

TODO:

Find the best way to deal generically with the situation described below
(discussion at https://lkml.org/lkml/2019/4/16/1355):

> > +Segregating the switch ports in multiple bridges is supported (e.g. 2 + 2), but
> > +all bridges should have the same level of VLAN awareness (either both have
> > +``vlan_filtering`` 0, or both 1). Also an inevitable limitation of the fact
> > +that VLAN awareness is global at the switch level is that once a bridge with
> > +``vlan_filtering`` enslaves at least one switch port, the other un-bridged
> > +ports are no longer available for standalone traffic termination.
>
> That is quite a limitation that I don't think I had fully grasped until
> reading your different patches. Since enslaving ports into a bridge
> comes after the network device was already made available for use, maybe
> you should force the carrier down or something along those lines as soon
> as a port is enslaved into a bridge with vlan_filtering=1 to make this
> more predictable for the user?
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Add more convenient functions for installing port VLANs

This hides the need to perform a two-phase transaction and construct a
switchdev_obj_port_vlan struct.

Call graph (including a function that will be introduced in a follow-up
patch) looks like this now (same for the *_vlan_del function):

dsa_slave_vlan_rx_add_vid   dsa_port_setup_8021q_tagging
            |                        |
            |                        |
            |          +-------------+
            |          |
            v          v
           dsa_port_vid_add      dsa_slave_port_obj_add
                  |                         |
                  +-------+         +-------+
                          |         |
                          v         v
                       dsa_port_vlan_add

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: b53: Use vlan_filtering property from dsa_switch

While possible (and safe) to use the newly introduced
dsa_port_is_vlan_filtering helper, fabricating a dsa_port pointer is a
bit awkward, so simply retrieve this from the dsa_switch structure.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: b53: Let DSA call .port_vlan_filtering only when necessary

Since DSA has recently learned to treat better with drivers that set
vlan_filtering_is_global, doing this is no longer required.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Skip calling .port_vlan_filtering on no change

Even if VLAN filtering is global, DSA will call this callback once per
each port. Drivers should not have to compare the global state with the
requested change. So let DSA do it.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mt7530: Use the DSA vlan_filtering helper function

This was recently introduced, so keeping state inside the driver is no
longer necessary.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Add helper function to retrieve VLAN awareness setting

Since different types of hardware may or may not support this setting
per-port, DSA keeps it either in dsa_switch or in dsa_port.

While drivers may know the characteristics of their hardware and
retrieve it from the correct place without the need of helpers, it is
cumbersone to find out an unambigous answer from generic DSA code.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Keep the vlan_filtering setting in dsa_switch if it's global

The current behavior is not as obvious as one would assume (which is
that, if the driver set vlan_filtering_is_global = 1, then checking any
dp->vlan_filtering would yield the same result). Only the ports which
are actively enslaved into a bridge would have vlan_filtering set.

This makes it tricky for drivers to check what the global state is.
So fix this and make the struct dsa_switch hold this global setting.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mt7530: Let DSA handle the unsetting of vlan_filtering

The driver, recognizing that the .port_vlan_filtering callback was never
coming after the port left its parent bridge, decided to take that duty
in its own hands. DSA now takes care of this condition, so fix that.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Unset vlan_filtering when ports leave the bridge

When ports are standalone (after they left the bridge), they should have
no VLAN filtering semantics (they should pass all traffic to the CPU).
Currently this is not true for switchdev drivers, because the bridge
"forgets" to unset that.

Normally one would think that doing this at the bridge layer would be a
better idea, i.e. call br_vlan_filter_toggle() from br_del_if(), similar
to how nbp_vlan_init() is called from br_add_if().

However what complicates that approach, and makes this one preferable,
is the fact that for the bridge core, vlan_filtering is a per-bridge
setting, whereas for switchdev/DSA it is per-port. Also there are
switches where the setting is per the entire device, and unsetting
vlan_filtering one by one, for each leaving port, would not be possible
from the bridge core without a certain level of awareness. So do this in
DSA and let drivers be unaware of it.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: b53: Let DSA handle mismatched VLAN filtering settings

The DSA core is now able to do this check prior to calling the
.port_vlan_filtering callback, so tell it that VLAN filtering is global
for this particular hardware.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>