review.tizen.org Git - platform/kernel/linux-starfive.git/log

net/mlx5e: Offload flow rules to active lower representor

When a bond device is created over one or more non uplink representors,
and when a flow rule is offloaded to such bond device, offload a rule
to the active lower device.

Assuming that this is active-backup lag, the rules should be offloaded
to the active lower device which is the representor of the direct
path (not the failover).

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Support tc block sharing for representors

Currently offloading a rule over a tc block shared by multiple
representors fails because an e-switch global hashtable to keep
the mapping from tc cookies to mlx5e flow instances is used, and
tc block sharing offloads the same rule/cookie multiple times,
each time for different representor sharing the tc block.

Changing the implementation and behavior by acknowledging and returning
success if the same rule/cookie is offloaded again to other slave
representor sharing the tc block by setting, checking and comparing
the netdev that added the rule first.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Use netdev events to set/del egress acl forward-to-vport rule

Register a notifier block to handle netdev events for bond device
of non-uplink representors to support eswitch vports bonding.

When a non-uplink representor is a lower dev (slave) of bond and
becomes active, adding egress acl forward-to-vport rule of all slave
netdevs (active + standby) to forward to this representor's vport. Use
change lower netdev event to do this.

Use change upper event to detect slave representor unslaved from lag
device to delete its vport egress acl forward rule if any.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: E-Switch, Introduce APIs to enable egress acl forward-to-vport rule

By default, e-switch vport's egress acl just forward packets to its
counterpart NIC vport using existing egress acl table.

During port failover in bonding scenario where two VFs representors
are bonded, the egress acl forward-to-vport rule will be added to
the existing egress acl table of e-switch vport of passive/inactive
slave representor to forward packets to other NIC vport ie. the active
slave representor's NIC vport to handle egress "failover" traffic.

Enable egress acl and have APIs to create and destroy egress acl
forward-to-vport rule and group.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: E-Switch, Refactor eswitch ingress acl codes

Restructure the eswitch ingress acl codes into eswitch directory
and different files:
. Acl ingress helper functions to acl_helper.c/h
. Acl ingress functions used in offloads mode to acl_ingress_ofld.c
. Acl ingress functions used in legacy mode to acl_ingress_lgy.c

This patch does not change any functionality.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>

net/mlx5: E-Switch, Refactor eswitch egress acl codes

Refactor the egress acl codes so that offloads and legacy modes
can configure specifically their own needs of egress acl table,
groups and rules. While at it, restructure the eswitch egress
acl codes into eswitch directory and different files:
. Acl egress helper functions to acl_helper.c/h
. Acl egress functions used in offloads mode to acl_egress_ofld.c
. Acl egress functions used in legacy mode to acl_egress_lgy.c

This patch does not change any functionality.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net: phy: at803x: add cable diagnostics support for ATH9331 and ATH8032

Add support for Atheros 100Base-T PHYs. The only difference seems to be
the ability to test 2 pairs instead of 4 and the lack of 1000Base-T
specific register.

Only the ATH9331 was tested with this patch.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Raw-PHY-TDR-data'

Andrew Lunn says:

====================
Raw PHY TDR data

Some ethernet PHYs allow access to raw TDR data in addition to summary
diagnostics information. Add support for retrieving this data via
netlink ethtool. The basic structure in the core is the same as for
normal phy diagnostics, the PHY driver simply uses different helpers
to fill the netlink message with different data.

There is a graphical tool under development, as well a ethtool(1)
which can dump the data as text and JSON.

A patched ethtool(1) can be found in
https://github.com/lunn/ethtool.git feature/cable-test-v5

Thanks for Chris Healy for lots of testing.

v2:
See the individual patches but:

Pass distances in centimeters, not meters

Allow the PHY to round distances to what it supports and report what
it actually used along with the results.

Make the Marvell PHY use steps a multiple of 0.805 meters, its native
step size.

v3:
Move the TDR configuration into a structure
Add a range check on step
Use NL_SET_ERR_MSG_ATTR() when appropriate
Move TDR configuration into a nest
Document attributes in the request
Unsquash the last two patches
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: marvell: Configure TDR pulse based on measurement length

When performing a TDR measurement for a short distance, the pulse
width should be low, to help differentiate between the outgoing pulse
and any reflection. For longer distances, the pulse should be wider,
to help with attenuation.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net : phy: marvell: Speedup TDR data retrieval by only changing page once

Getting the TDR data requires a large number of MDIO bus
transactions. The number can however be reduced if the page is only
changed once. Add the needed locking to allow this, and make use of
unlocked read/write methods where needed.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethtool: Allow PHY cable test TDR data to configured

Allow the user to configure where on the cable the TDR data should be
retrieved, in terms of first and last sample, and the step between
samples. Also add the ability to ask for TDR data for just one pair.

If this configuration is not provided, it defaults to 1-150m at 1m
intervals for all pairs.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
v3:
Move the TDR configuration into a structure
Add a range check on step
Use NL_SET_ERR_MSG_ATTR() when appropriate
Move TDR configuration into a nest
Document attributes in the request

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: marvell: Add support for amplitude graph

The Marvell PHYs can measure the amplitude of the returned signal for
a given distance. Implement this option of the cable test
infrastructure. When reporting the step, convert the distance into cm.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
v2:
Step based on the measurement resolution, and convert this to cm.

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethtool: Add helpers for cable test TDR data

Add helpers for returning raw TDR helpers in netlink messages.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethtool: Add generic parts of cable test TDR

Add the generic parts of the code used to trigger a cable test and
return raw TDR data. Any PHY driver which support this must implement
the new driver op.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
v2
Update nxp-tja11xx for API change.

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethtool: Add attributes for cable test TDR data

Some Ethernet PHYs can return the raw time domain reflectromatry data.
Add the attributes to allow this data to be requested and returned via
netlink ethtool.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
v2:
m -> cm
Report what the PHY actually used for start/stop/step.

Signed-off-by: David S. Miller <davem@davemloft.net>

ne2k-pci: Fix various coding-style issues and improve printk() usage

Fixed a ton of minor checkpatch errors/warnings and remove version
printing at module init/when device is found and use MODULE_VERSION
instead. Also modifying the RTL8029 PCI string to include the compatible
RTL8029AS nic.
The only mayor issue remaining is the missing SPDX tag, but since the
exact version of the GPL is not stated anywhere inside the file, its
impossible to add such a tag at the moment.
But maybe it is possible, since 8390.h states Donald Becker's 8390
drivers are licensed under GPL 2.2 only (= GPL-2.0-only ?).
The kernel module containing this patch compiles and runs without
problems on a RTL8029AS-based NE2000 clone card with kernel 5.7.0-rc6.

Signed-off-by: Armin Wolf <W_Armin@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

macvlan: Skip loopback packets in RX handler

Ignore loopback-originatig packets soon enough and don't try to process L2
header where it doesn't exist. The very similar br_handle_frame() in bridge
code performs exactly the same check.

This is an example of such ICMPv6 packet:

skb len=96 headroom=40 headlen=96 tailroom=56
mac=(40,0) net=(40,40) trans=80
shinfo(txflags=0 nr_frags=0 gso(size=0 type=0 segs=0))
csum(0xae2e9a2f ip_summed=1 complete_sw=0 valid=0 level=0)
hash(0xc97ebd88 sw=1 l4=1) proto=0x86dd pkttype=5 iif=24
dev name=etha01.212 feat=0x0x0000000040005000
skb headroom: 00000000: 00 7c 86 52 84 88 ff ff 00 00 00 00 00 00 08 00
skb headroom: 00000010: 45 00 00 9e 5d 5c 40 00 40 11 33 33 00 00 00 01
skb headroom: 00000020: 02 40 43 80 00 00 86 dd
skb linear:   00000000: 60 09 88 bd 00 38 3a ff fe 80 00 00 00 00 00 00
skb linear:   00000010: 00 40 43 ff fe 80 00 00 ff 02 00 00 00 00 00 00
skb linear:   00000020: 00 00 00 00 00 00 00 01 86 00 61 00 40 00 00 2d
skb linear:   00000030: 00 00 00 00 00 00 00 00 03 04 40 e0 00 00 01 2c
skb linear:   00000040: 00 00 00 78 00 00 00 00 fd 5f 42 68 23 87 a8 81
skb linear:   00000050: 00 00 00 00 00 00 00 00 01 01 02 40 43 80 00 00
skb tailroom: 00000000: ...
skb tailroom: 00000010: ...
skb tailroom: 00000020: ...
skb tailroom: 00000030: ...

Call Trace, how it happens exactly:
...
macvlan_handle_frame+0x321/0x425 [macvlan]
? macvlan_forward_source+0x110/0x110 [macvlan]
__netif_receive_skb_core+0x545/0xda0
? enqueue_task_fair+0xe5/0x8e0
? __netif_receive_skb_one_core+0x36/0x70
__netif_receive_skb_one_core+0x36/0x70
process_backlog+0x97/0x140
net_rx_action+0x1eb/0x350
? __hrtimer_run_queues+0x136/0x2e0
__do_softirq+0xe3/0x383
do_softirq_own_stack+0x2a/0x40
</IRQ>
do_softirq.part.4+0x4e/0x50
netif_rx_ni+0x60/0xd0
dev_loopback_xmit+0x83/0xf0
ip6_finish_output2+0x575/0x590 [ipv6]
? ip6_cork_release.isra.1+0x64/0x90 [ipv6]
? __ip6_make_skb+0x38d/0x680 [ipv6]
? ip6_output+0x6c/0x140 [ipv6]
ip6_output+0x6c/0x140 [ipv6]
ip6_send_skb+0x1e/0x60 [ipv6]
rawv6_sendmsg+0xc4b/0xe10 [ipv6]
? proc_put_long+0xd0/0xd0
? rw_copy_check_uvector+0x4e/0x110
? sock_sendmsg+0x36/0x40
sock_sendmsg+0x36/0x40
___sys_sendmsg+0x2b6/0x2d0
? proc_dointvec+0x23/0x30
? addrconf_sysctl_forward+0x8d/0x250 [ipv6]
? dev_forward_change+0x130/0x130 [ipv6]
? _raw_spin_unlock+0x12/0x30
? proc_sys_call_handler.isra.14+0x9f/0x110
? __call_rcu+0x213/0x510
? get_max_files+0x10/0x10
? trace_hardirqs_on+0x2c/0xe0
? __sys_sendmsg+0x63/0xa0
__sys_sendmsg+0x63/0xa0
do_syscall_64+0x6c/0x1e0
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-Various-trap-changes-part-2'

Ido Schimmel says:

====================
mlxsw: Various trap changes - part 2

This patch set contains another set of small changes in mlxsw trap
configuration. It is the last set before exposing control traps (e.g.,
IGMP query, ARP request) via devlink-trap.

Tested with existing devlink-trap selftests. Please see individual
patches for a detailed changelog.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Allow programming link-local prefix routes

The device has a trap for IPv6 packets that need be routed and have a
unicast link-local destination IP (i.e., fe80::/10). This allows mlxsw
to ignore link-local routes, as the packets will be trapped to the CPU
in any case.

However, since link-local routes are not programmed, it is possible for
routed packets to hit the default route which might also be programmed
to trap packets. This means that packets with a link-local destination
IP might be trapped for the wrong reason.

To overcome this, allow programming link-local prefix routes (usually
one fe80::/64 per-table), so that the packets will be forwarded until
reaching the link-local trap.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Add packet traps for BFD packets

Bidirectional Forwarding Detection (BFD) provides "low-overhead,
short-duration detection of failures in the path between adjacent
forwarding engines" (RFC 5880).

This is accomplished by exchanging BFD packets between the two
forwarding engines. Up until now these packets were trapped via the
general local delivery (i.e., IP2ME) trap which also traps a lot of
other packets that are not as time-sensitive as BFD packets.

Expose dedicated traps for BFD packets so that user space could
configure a dedicated policer for them.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Treat IPv6 link-local SIP as an exception

IPv6 packets that need to be forwarded and have a link-local source IP are
dropped by the kernel and an ICMPv6 "Destination unreachable" is sent to
the sending host.

As such, change the trap group of such packets so that they do not
interfere with IPv6 management packets. In the future this trap will be
exposed as an exception via devlink-trap.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Share one group for all locally delivered packets

Routed IP packets with the Router Alert option need to be trapped to
the CPU as they might need to be locally delivered to raw sockets with
the IP_ROUTER_ALERT / IPV6_ROUTER_ALERT socket option.

Move them to the same group with other packets that might need to be
trapped following route lookup.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: reg: Move all trap groups under the same enum

After the previous patch the split is no longer necessary and all the
trap groups can be moved under the same enum.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_trap: Do not hard code "thin" policer identifier

As explained in commit e612523041ab ("mlxsw: spectrum_trap: Introduce
dummy group with thin policer"), the purpose of the "thin" policer is to
pass as less packets as possible to the CPU.

The identifier of this policer is currently set according to the maximum
number of used trap groups, but this is fragile: On Spectrum-1 the
maximum number of policers is less than the maximum number of trap
groups, which might result in an invalid policer identifier in case the
number of used trap groups grows beyond the policer limit.

Solve this by dynamically allocating the policer identifier.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: switchx2: Move SwitchX-2 trap groups out of main enum

The number of Spectrum trap groups is not infinite, but two identifiers
are occupied by SwitchX-2 specific trap groups. Free these identifiers
by moving them out of the main enum.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Reduce priority of locally delivered packets

To align with recent recommended values. Will be configurable by future
patches.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Use same trap group for local routes and link-local destination

Packets with an IPv6 link-local destination (i.e., fe80::/10) should not
be forwarded and are therefore trapped to the CPU for local delivery.
Since these packets are trapped for the same logical reason as packets
hitting local routes, associate both traps with the same group.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Use separate trap group for FID miss

When a packet enters the device it is classified to a filtering
identifier (FID) based on the ingress port and VLAN. The FID miss trap
is used to trap packets for which a FID could not be found.

In mlxsw this trap should only be triggered when a port is enslaved to
an OVS bridge and a matching ACL rule could not be found, so as to
trigger learning.

These packets are therefore completely unrelated to packets hitting
local routes and should be in a different group. Move them.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Use same trap group for various IPv6 packets

Group these various IPv6 packets (e.g., router solicitations, router
advertisement) together and subject them to the same policer.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Rename IPv6 ND trap group

The IPv6 Neighbour Discovery (ND) group will be used for various IPv6
packets, not all of which fall under the definition of ND, so rename it
to "IPV6" which is more appropriate.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Use same switch case for identical groups

Trap groups that use the same policer settings can share the same switch
case.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Use dedicated trap group for ACL trap

Packets that are trapped via tc's trap action are currently subject to
the same policer as packets hitting local routes. The latter are
critical to the correct functioning of the control plane, while the
former are mainly used for traffic inspection.

Split the ACL trap to a separate group with its own policer. Use a
higher priority for these traps than for traps using mirror action
(e.g., ARP, IGMP). Otherwise, packets matching both traps will not be
forwarded in hardware (because of trap action) and also not forwarded in
software because they will be marked with 'offload_fwd_mark'.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: attempt coalescing when moving skbs to mptcp rx queue

We can try to coalesce skbs we take from the subflows rx queue with the
tail of the mptcp rx queue.

If successful, the skb head can be discarded early.

We can also free the skb extensions, we do not access them after this.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: improve rtl_remove_one

Don't call netif_napi_del() manually, free_netdev() does this for us.
In addition reorder calls to match reverse order of calls in probe().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-ethernet-fec-move-GPR-register-offset-and-bit-into-DT'

Fugang Duan says:

====================
net: ethernet: fec: move GPR register offset and bit into DT

The commit da722186f654 (net: fec: set GPR bit on suspend by
DT configuration) set the GPR reigster offset and bit in driver
for wol feature support.

It brings trouble to enable wol feature on imx6sx/imx6ul/imx7d
platforms that have multiple ethernet instances with different
GPR bit for stop mode control. So the patch set is to move GPR
register offset and bit define into DT, and enable imx6q/imx6dl
imx6qp/imx6sx/imx6ul/imx7d stop mode support.

Currently, below NXP i.MX boards support wol:
- imx6q/imx6dl/imx6qp sabresd
- imx6sx sabreauto
- imx7d sdb

imx6q/imx6dl/imx6qp sabresd board dts file miss the property
"fsl,magic-packet;", so patch#4 is to add the property for stop
mode support.

v1 -> v2:
- driver: switch back to store the quirks bitmask in driver_data
- dt-bindings: rename 'gpr' property string to 'fsl,stop-mode'
- imx6/7 dtsi: add imx6sx/imx6ul/imx7d ethernet stop mode property
v2 -> v3:
- driver: suggested by Sascha Hauer, use a struct fec_devinfo for
abstracting differences between different hardware variants,
it can give more freedom to describe the differences.
- imx6/7 dtsi: correct one typo pointed out by Andrew.

Thanks Martin, Andrew and Sascha Hauer for the review.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ARM: dts: imx6qdl-sabresd: enable fec wake-on-lan

Enable ethernet wake-on-lan feature for imx6q/dl/qp sabresd
boards since the PHY clock is supplied by external osc.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Fugang Duan <fugang.duan@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ARM: dts: imx: add ethernet stop mode property

- Update the imx6qdl gpr property to define gpr register
offset and bit in DT.
- Add imx6sx/imx6ul/imx7d ethernet stop mode property.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Fugang Duan <fugang.duan@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: fec: update the gpr property

- rename the 'gpr' property string to 'fsl,stop-mode'.
- Update the property to define gpr register offset and
bit in DT, since different instance have different gpr bit.

v2:
* rename 'gpr' property string to 'fsl,stop-mode'.

Signed-off-by: Fugang Duan <fugang.duan@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: fec: move GPR register offset and bit into DT

The commit da722186f654 (net: fec: set GPR bit on suspend by DT
configuration) set the GPR reigster offset and bit in driver for
wake on lan feature.

But it introduces two issues here:
- one SOC has two instances, they have different bit
- different SOCs may have different offset and bit

So to support wake-on-lan feature on other i.MX platforms, it should
configure the GPR reigster offset and bit from DT.

So the patch is to improve the commit da722186f654 (net: fec: set GPR
bit on suspend by DT configuration) to support multiple ethernet
instances on i.MX series.

v2:
* switch back to store the quirks bitmask in driver_data
v3:
* suggested by Sascha Hauer, use a struct fec_devinfo for
abstracting differences between different hardware variants,
it can give more freedom to describe the differences.

Signed-off-by: Fugang Duan <fugang.duan@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: mark smc_pnet_policy as const

Netlink policies are generally declared as const.
This is safer and prevents potential bugs.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mac80211-next-for-net-next-2020-04-25' of git://git./linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
One batch of changes, containing:
* hwsim improvements from Jouni and myself, to be able to
test more scenarios easily
* some more HE (802.11ax) support
* some initial S1G (sub 1 GHz) work for fractional MHz channels
* some (action) frame registration updates to help DPP support
* along with other various improvements/fixes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-phy-mscc-miim-reduce-waiting-time-between-MDIO-transactions'

Antoine Tenart says:

====================
net: phy: mscc-miim: reduce waiting time between MDIO transactions

This series aims at reducing the waiting time between MDIO transactions
when using the MSCC MIIM MDIO controller.

I'm not sure we need patch 4/4 and we could reasonably drop it from the
series. I'm including the patch as it could help to ensure the system
is functional with a non optimal configuration.

We needed to improve the driver's performances as when using a PHY
requiring lots of registers accesses (such as the VSC85xx family),
delays would add up and ended up to be quite large which would cause
issues such as: a slow initialization of the PHY, and issues when using
timestamping operations (this feature will be sent quite soon to the
mailing lists).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: mscc-miim: read poll when high resolution timers are disabled

The driver uses a read polling mechanism to check the status of the MDIO
bus, to know if it is ready to accept next commands. This polling
mechanism uses usleep_delay() under the hood between reads which is fine
as long as high resolution timers are enabled. Otherwise the delays will
end up to be much longer than expected.

This patch fixes this by using udelay() under the hood when
CONFIG_HIGH_RES_TIMERS isn't enabled. This increases CPU usage.

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: mscc-miim: improve waiting logic

The MSCC MIIM MDIO driver uses a waiting logic to wait for the MDIO bus
to be ready to accept next commands. It does so by polling the BUSY
status bit which indicates the MDIO bus has completed all pending
operations. This can take time, and the controller supports writing the
next command as soon as there are no pending commands (which happens
while the MDIO bus is busy completing its current command).

This patch implements this improved logic by adding an helper to poll
the PENDING status bit, and by adjusting where we should wait for the
bus to not be busy or to not be pending.

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: mscc-miim: remove redundant timeout check

readl_poll_timeout already returns -ETIMEDOUT if the condition isn't
satisfied, there's no need to check again the condition after calling
it. Remove the redundant timeout check.

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: mscc-miim: use more reasonable delays

The MSCC MIIM MDIO driver uses delays to read poll a status register. I
made multiple tests on a Ocelot PCS120 platform which led me to reduce
those delays. The delay in between which the polling function is allowed
to sleep is reduced from 100us to 50us which in almost all cases is a
good value to succeed at the first retry. The overall delay is also
lowered as the prior value was really way to high, 10000us is large
enough.

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mdiobus: add clause 45 mdiobus accessors

There is a recurring pattern throughout some of the PHY code converting
a devad and regnum to our packed clause 45 representation. Rather than
having this scattered around the code, let's put a common translation
function in mdio.h, and provide some register accessors.

Convert the phylib core, phylink, bcm87xx and cortina to use these.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'flow-mpls'

Guillaume Nault says:

====================
flow_dissector, cls_flower: Add support for multiple MPLS Label Stack Entries

Currently, the flow dissector and the Flower classifier can only handle
the first entry of an MPLS label stack. This patch series generalises
the code to allow parsing and matching the Label Stack Entries that
follow.

Patch 1 extends the flow dissector to parse MPLS LSEs until the Bottom
Of Stack bit is reached. The number of parsed LSEs is capped at
FLOW_DIS_MPLS_MAX (arbitrarily set to 7). Flower and the NFP driver
are updated to take into account the new layout of struct
flow_dissector_key_mpls.

Patch 2 extends Flower. It defines new netlink attributes, which are
independent from the previous MPLS ones. Mixing the old and the new
attributes in a same filter is not allowed. For backward compatibility,
the old attributes are used when dumping filters that don't require the
new ones.

Changes since v2:
  * Fix compilation with the new MLX5 bareudp tunnel code.

Changes since v1:
  * Fix compilation of NFP driver (kbuild test robot).
  * Fix sparse warning with entropy label (kbuild test robot).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

cls_flower: Support filtering on multiple MPLS Label Stack Entries

With struct flow_dissector_key_mpls now recording the first
FLOW_DIS_MPLS_MAX labels, we can extend Flower to filter on any of
these LSEs independently.

In order to avoid creating new netlink attributes for every possible
depth, let's define a new TCA_FLOWER_KEY_MPLS_OPTS nested attribute
that contains the list of LSEs to match. Each LSE is represented by
another attribute, TCA_FLOWER_KEY_MPLS_OPTS_LSE, which then contains
the attributes representing the depth and the MPLS fields to match at
this depth (label, TTL, etc.).

For each MPLS field, the mask is always set to all-ones, as this is
what the original API did. We could allow user configurable masks in
the future if there is demand for more flexibility.

The new API also allows to only specify an LSE depth. In that case,
Flower only verifies that the MPLS label stack depth is greater or
equal to the provided depth (that is, an LSE exists at this depth).

Filters that only match on one (or more) fields of the first LSE are
dumped using the old netlink attributes, to avoid confusing user space
programs that don't understand the new API.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

flow_dissector: Parse multiple MPLS Label Stack Entries

The current MPLS dissector only parses the first MPLS Label Stack
Entry (second LSE can be parsed too, but only to set a key_id).

This patch adds the possibility to parse several LSEs by making
__skb_flow_dissect_mpls() return FLOW_DISSECT_RET_PROTO_AGAIN as long
as the Bottom Of Stack bit hasn't been seen, up to a maximum of
FLOW_DIS_MPLS_MAX entries.

FLOW_DIS_MPLS_MAX is arbitrarily set to 7. This should be enough for
many practical purposes, without wasting too much space.

To record the parsed values, flow_dissector_key_mpls is modified to
store an array of stack entries, instead of just the values of the
first one. A bit field, "used_lses", is also added to keep track of
the LSEs that have been set. The objective is to avoid defining a
new FLOW_DISSECTOR_KEY_MPLS_XX for each level of the MPLS stack.

TC flower is adapted for the new struct flow_dissector_key_mpls layout.
Matching on several MPLS Label Stack Entries will be added in the next
patch.

The NFP and MLX5 drivers are also adapted: nfp_flower_compile_mac() and
mlx5's parse_tunnel() now verify that the rule only uses the first LSE
and fail if it doesn't.

Finally, the behaviour of the FLOW_DISSECTOR_KEY_MPLS_ENTROPY key is
slightly modified. Instead of recording the first Entropy Label, it
now records the last one. This shouldn't have any consequences since
there doesn't seem to have any user of FLOW_DISSECTOR_KEY_MPLS_ENTROPY
in the tree. We'd probably better do a hash of all parsed MPLS labels
instead (excluding reserved labels) anyway. That'd give better entropy
and would probably also simplify the code. But that's not the purpose
of this patch, so I'm keeping that as a future possible improvement.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'batadv-next-for-davem-20200526' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This cleanup patchset includes the following patches:

- Fix revert dynamic lockdep key changes for batman-adv,
by Sven Eckelmann

- use rcu_replace_pointer() where appropriate, by Antonio Quartulli

- Revert "disable ethtool link speed detection when auto negotiation
off", by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tipc-add-some-improvements'

Tuong Lien says:

====================
tipc: add some improvements

This series adds some improvements to TIPC.

The first patch improves the TIPC broadcast's performance with the 'Gap
ACK blocks' mechanism similar to unicast before, while the others give
support on tracing & statistics for broadcast links, and an alternative
to carry broadcast retransmissions via unicast which might be useful in
some cases.

Besides, the Nagle algorithm can now automatically 'adjust' itself
depending on the specific network condition a stream connection runs by
the last patch.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tipc: add test for Nagle algorithm effectiveness

When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.

In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.

In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.

When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.

Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.

This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.

In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.

Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

tipc: add support for broadcast rcv stats dumping

This commit enables dumping the statistics of a broadcast-receiver link
like the traditional 'broadcast-link' one (which is for broadcast-
sender). The link dumping can be triggered via netlink (e.g. the
iproute2/tipc tool) by the link flag - 'TIPC_NLA_LINK_BROADCAST' as the
indicator.

The name of a broadcast-receiver link of a specific peer will be in the
format: 'broadcast-link:<peer-id>'.

For example:

Link <broadcast-link:1001002>
  Window:50 packets
  RX packets:7841 fragments:2408/440 bundles:0/0
  TX packets:0 fragments:0/0 bundles:0/0
  RX naks:0 defs:124 dups:0
  TX naks:21 acks:0 retrans:0
  Congestion link:0  Send queue max:0 avg:0

In addition, the broadcast-receiver link statistics can be reset in the
usual way via netlink by specifying that link name in command.

Note: the 'tipc_link_name_ext()' is removed because the link name can
now be retrieved simply via the 'l->name'.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

tipc: enable broadcast retrans via unicast

In some environment, broadcast traffic is suppressed at high rate (i.e.
a kind of bandwidth limit setting). When it is applied, TIPC broadcast
can still run successfully. However, when it comes to a high load, some
packets will be dropped first and TIPC tries to retransmit them but the
packet retransmission is intentionally broadcast too, so making things
worse and not helpful at all.

This commit enables the broadcast retransmission via unicast which only
retransmits packets to the specific peer that has really reported a gap
i.e. not broadcasting to all nodes in the cluster, so will prevent from
being suppressed, and also reduce some overheads on the other peers due
to duplicates, finally improve the overall TIPC broadcast performance.

Note: the functionality can be turned on/off via the sysctl file:

echo 1 > /proc/sys/net/tipc/bc_retruni
echo 0 > /proc/sys/net/tipc/bc_retruni

Default is '0', i.e. the broadcast retransmission still works as usual.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

tipc: add back link trace events

In the previous commit ("tipc: add Gap ACK blocks support for broadcast
link"), we have removed the following link trace events due to the code
changes:

- tipc_link_bc_ack
- tipc_link_retrans

This commit adds them back along with some minor changes to adapt to
the new code.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

tipc: introduce Gap ACK blocks for broadcast link

As achieved through commit 9195948fbf34 ("tipc: improve TIPC throughput
by Gap ACK blocks"), we apply the same mechanism for the broadcast link
as well. The 'Gap ACK blocks' data field in a 'PROTOCOL/STATE_MSG' will
consist of two parts built for both the broadcast and unicast types:

31                       16 15                        0
+-------------+-------------+-------------+-------------+
|  bgack_cnt  |  ugack_cnt  |            len            |
+-------------+-------------+-------------+-------------+  -
|            gap            |            ack            |   |
+-------------+-------------+-------------+-------------+    > bc gacks
:                           :                           :   |
+-------------+-------------+-------------+-------------+  -
|            gap            |            ack            |   |
+-------------+-------------+-------------+-------------+    > uc gacks
:                           :                           :   |
+-------------+-------------+-------------+-------------+  -

which is "automatically" backward-compatible.

We also increase the max number of Gap ACK blocks to 128, allowing upto
64 blocks per type (total buffer size = 516 bytes).

Besides, the 'tipc_link_advance_transmq()' function is refactored which
is applicable for both the unicast and broadcast cases now, so some old
functions can be removed and the code is optimized.

With the patch, TIPC broadcast is more robust regardless of packet loss
or disorder, latency, ... in the underlying network. Its performance is
boost up significantly.
For example, experiment with a 5% packet loss rate results:

$ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000
real    0m 42.46s
user    0m 1.16s
sys     0m 17.67s

Without the patch:

$ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000
real    8m 27.94s
user    0m 0.55s
sys     0m 2.38s

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed: Add EDPM mode type for user-fw compatibility

In older FW versions the completion flag was treated as the ack flag in
edpm messages. Expose the FW option of setting which mode the QP is in
by adding a flag to the qedr <-> qed API.

Flag is added for backward compatibility with libqedr.
This flag will be set by qedr after determining whether the libqedr is
using the updated version.

Fixes: f10939403352 ("qed: Add support for QP verbs")
Signed-off-by: Yuval Basson <yuval.bason@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: tcp_v4_err() icmp skb is named icmp_skb

I missed the fact that tcp_v4_err() differs from tcp_v6_err().

After commit 4d1a2d9ec1c1 ("Rename skb to icmp_skb in tcp_v4_err()")
the skb argument has been renamed to icmp_skb only in one function.

I will in a future patch reconciliate these functions to avoid
this kind of confusion.

Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

batman-adv: Revert "disable ethtool link speed detection when auto negotiation off"

The commit 8c46fcd78308 ("batman-adv: disable ethtool link speed detection
when auto negotiation off") disabled the usage of ethtool's link_ksetting
when auto negotation was enabled due to invalid values when used with
tun/tap virtual net_devices. According to the patch, automatic measurements
should be used for these kind of interfaces.

But there are major flaws with this argumentation:

* automatic measurements are not implemented
* auto negotiation has nothing to do with the validity of the retrieved
values

The first point has to be fixed by a longer patch series. The "validity"
part of the second point must be addressed in the same patch series by
dropping the usage of ethtool's link_ksetting (thus always doing automatic
measurements over ethernet).

Drop the patch again to have more default values for various net_device
types/configurations. The user can still overwrite them using the
batadv_hardif's BATADV_ATTR_THROUGHPUT_OVERRIDE.

Reported-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>

Merge branch 'r8169-sync-hw-config-for-few-chip-versions-with-r8168-vendor-driver'

Heiner Kallweit says:

====================
r8169: sync hw config for few chip versions with r8168 vendor driver

Sync hw config for few chip versions with r8168 vendor driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: sync RTL8168f/RTL8411 hw config with vendor driver

Sync hw config for RTL8168f/RTL8411 with r8168 vendor driver.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: sync RTL8168evl hw config with vendor driver

Sync hw config for RTL8168evl with r8168 vendor driver.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: sync RTL8168h hw config with vendor driver

Sync hw config for RTL8168h with r8168 vendor driver.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: sync RTL8168g hw config with vendor driver

Sync hw config for RTL8168g with r8168 vendor driver.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'wireless-drivers-next-2020-05-25' of git://git./linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers-next patches for v5.8

Second set of patches for v5.8. Lots of new features and new supported
hardware for mt76. Also rtw88 got new hardware support.

Major changes:

rtw88

* add support for Realtek 8723DE PCI adapter

* rename rtw88.ko/rtwpci.ko to rtw88_core.ko/rtw88_pci.ko

iwlwifi

* stop supporting swcrypto and bt_coex_active module parameters on
mvm devices

* enable A-AMSDU in low latency

mt76

* new devices for mt76x0/mt76x2

* support for non-offload firmware on mt7663

* hw/sched scan support for mt7663

* mt7615/mt7663 MSI support

* TDLS support

* mt7603/mt7615 rate control fixes

* new driver for mt7915

* wowlan support for mt7663

* suspend/resume support for mt7663
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: mrp: Fix out-of-bounds read in br_mrp_parse

The issue was reported by syzbot. When the function br_mrp_parse was
called with a valid net_bridge_port, the net_bridge was an invalid
pointer. Therefore the check br->stp_enabled could pass/fail
depending where it was pointing in memory.
The fix consists of setting the net_bridge pointer if the port is a
valid pointer.

Reported-by: syzbot+9c6f0f1f8e32223df9a4@syzkaller.appspotmail.com
Fixes: 6536993371fa ("bridge: mrp: Integrate MRP into the bridge")
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: ipa: print dev_err info accurately

Print certain name string instead of hard-coded "memory" for dev_err
output, which would be more accurate and helpful for debugging.

Signed-off-by: Wang Wenhu <wenhu.wang@vivo.com>
Cc: Alex Elder <elder@kernel.org>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp_clock: Let the ADJ_OFFSET interface respect the ADJ_NANO flag for PHC devices.

In commit 184ecc9eb260d5a3bcdddc5bebd18f285ac004e9 ("ptp: Add adjphase
function to support phase offset control.") the PTP Hardware Clock
interface expanded to support the ADJ_OFFSET offset mode. However,
the implementation did not respect the traditional yet pedantic
distinction between units of microseconds and nanoseconds signaled by
the ADJ_NANO flag. This patch fixes the issue by adding logic to
handle that flag.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Vincent Cheng <vincent.cheng.xh@renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: allow traceroute -Mtcp for unpriv users

Unpriv users can use traceroute over plain UDP sockets, but not TCP ones.

$ traceroute -Mtcp 8.8.8.8
You do not have enough privileges to use this traceroute method.

$ traceroute -n -Mudp 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
1  192.168.86.1  3.631 ms  3.512 ms  3.405 ms
2  10.1.10.1  4.183 ms  4.125 ms  4.072 ms
3  96.120.88.125  20.621 ms  19.462 ms  20.553 ms
4  96.110.177.65  24.271 ms  25.351 ms  25.250 ms
5  69.139.199.197  44.492 ms  43.075 ms  44.346 ms
6  68.86.143.93  27.969 ms  25.184 ms  25.092 ms
7  96.112.146.18  25.323 ms 96.112.146.22  25.583 ms 96.112.146.26  24.502 ms
8  72.14.239.204  24.405 ms 74.125.37.224  16.326 ms  17.194 ms
9  209.85.251.9  18.154 ms 209.85.247.55  14.449 ms 209.85.251.9  26.296 ms^C

We can easily support traceroute over TCP, by queueing an error message
into socket error queue.

Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
enable this feature, and that the error message is only queued
while in SYN_SNT state.

socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_TIMESTAMP_OLD, [1], 4) = 0
setsockopt(3, SOL_IPV6, IPV6_UNICAST_HOPS, [5], 4) = 0
connect(3, {sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
        inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0}, 28) = -1 EHOSTUNREACH (No route to host)
recvmsg(3, {msg_name={sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
        inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0},
        msg_namelen=1024->28, msg_iov=[{iov_base="`\r\337\320\0004\6\1&\7\370\260\200\231\16\27\0\0\0\0\0\0\0\0 \2\n\5f\10\2\227"..., iov_len=1024}],
        msg_iovlen=1, msg_control=[{cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=SO_TIMESTAMP_OLD, cmsg_data={tv_sec=1590340680, tv_usec=272424}},
                                   {cmsg_len=60, cmsg_level=SOL_IPV6, cmsg_type=IPV6_RECVERR}],
        msg_controllen=96, msg_flags=MSG_ERRQUEUE}, MSG_ERRQUEUE) = 144

Suggested-by: Maciej Żenczykowski <maze@google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: allow bnx2x_bsc_read() to schedule

bnx2x_warpcore_read_sfp_module_eeprom() can call bnx2x_bsc_read()
three times before giving up.

This causes latency blips of at least 31 ms (58 ms being reported
by our teams)

Convert the long lasting loops of udelay() to usleep_range() ones,
and breaks the loops on precise time tracking.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ariel Elior <aelior@marvell.com>
Cc: Sudarsana Kalluru <skalluru@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: potential underflow in compat_ip_setsockopt()

The value of "n" is capped at 0x1ffffff but it checked for negative
values. I don't think this causes a problem but I'm not certain and
it's harmless to prevent it.

Fixes: 2e04172875c9 ("ipv4: do compat setsockopt for MCAST_MSFILTER directly")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mvneta: MVNETA_SKB_HEADROOM set last 3 bits to zero

For XDP the MVNETA_SKB_HEADROOM is used as an offset for
the received data.
The MVNETA manual states that the last 3 bits assumed to be 0.

This is currently the case but lets make it explicit in the definition
to prevent future problems.

Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

vxlan: Do not assume RTNL is held in vxlan_fdb_info()

vxlan_fdb_info() is not always called with RTNL held or from an RCU
read-side critical section. For example, in the following call path:

vxlan_cleanup()
  vxlan_fdb_destroy()
    vxlan_fdb_notify()
      __vxlan_fdb_notify()
        vxlan_fdb_info()

The use of rtnl_dereference() can therefore result in the following
splat [1].

Fix this by dereferencing the nexthop under RCU read-side critical
section.

[1]
[May24 22:56] =============================
[  +0.004676] WARNING: suspicious RCU usage
[  +0.004614] 5.7.0-rc5-custom-16219-g201392003491 #2772 Not tainted
[  +0.007116] -----------------------------
[  +0.004657] drivers/net/vxlan.c:276 suspicious rcu_dereference_check() usage!
[  +0.008164]
              other info that might help us debug this:

[  +0.009126]
              rcu_scheduler_active = 2, debug_locks = 1
[  +0.007504] 5 locks held by bash/6892:
[  +0.004392]  #0: ffff8881d47e3410 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: __do_execve_file.isra.27+0x392/0x23c0
[  +0.011795]  #1: ffff8881d47e34b0 (&sig->exec_update_mutex){+.+.}-{3:3}, at: flush_old_exec+0x510/0x2030
[  +0.010947]  #2: ffff8881a141b0b0 (ptlock_ptr(page)#2){+.+.}-{2:2}, at: unmap_page_range+0x9c0/0x2590
[  +0.010585]  #3: ffff888230009d50 ((&vxlan->age_timer)){+.-.}-{0:0}, at: call_timer_fn+0xe8/0x800
[  +0.010192]  #4: ffff888183729bc8 (&vxlan->hash_lock[h]){+.-.}-{2:2}, at: vxlan_cleanup+0x133/0x4a0
[  +0.010382]
              stack backtrace:
[  +0.005103] CPU: 1 PID: 6892 Comm: bash Not tainted 5.7.0-rc5-custom-16219-g201392003491 #2772
[  +0.009675] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
[  +0.010155] Call Trace:
[  +0.002775]  <IRQ>
[  +0.002313]  dump_stack+0xfd/0x178
[  +0.003895]  lockdep_rcu_suspicious+0x14a/0x153
[  +0.005157]  vxlan_fdb_info+0xe39/0x12a0
[  +0.004775]  __vxlan_fdb_notify+0xb8/0x160
[  +0.004672]  vxlan_fdb_notify+0x8e/0xe0
[  +0.004370]  vxlan_fdb_destroy+0x117/0x330
[  +0.004662]  vxlan_cleanup+0x1aa/0x4a0
[  +0.004329]  call_timer_fn+0x1c4/0x800
[  +0.004357]  run_timer_softirq+0x129d/0x17e0
[  +0.004762]  __do_softirq+0x24c/0xaef
[  +0.004232]  irq_exit+0x167/0x190
[  +0.003767]  smp_apic_timer_interrupt+0x1dd/0x6a0
[  +0.005340]  apic_timer_interrupt+0xf/0x20
[  +0.004620]  </IRQ>

Fixes: 1274e1cc4226 ("vxlan: ecmp support for mac fdb entries")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-Various-trap-changes-part-1'

Ido Schimmel says:

====================
mlxsw: Various trap changes - part 1

This patch set contains various changes in mlxsw trap configuration.
Another set will perform similar changes before exposing control traps
(e.g., IGMP query, ARP request) via devlink-trap.

Tested with existing devlink-trap selftests. Please see individual
patches for a detailed changelog.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Fix spelling mistake in trap's name

Fix incorrect spelling of "advertisement".

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Use dedicated trap group for sampled packets

The rate with which packets are sampled is determined by user space, so
there is no need to associate such packets with a policer.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Use same trap group for IPv6 ND and ARP packets

Both packet types are needed for the same reason (neighbour discovery),
so associate them with the same trap group.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Rename ARP trap group

The ARP trap group will be used for IPv6 ND traps in the next patch, so
rename it to "NEIGH_DISCOVERY" which is more appropriate.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_trap: Remove unnecessary field

Now that traffic class (TC) and priority are set to the same value,
there is no need to store both. Remove the first.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Align TC and trap priority

The traffic class (TC) attribute of packet traps determines through which
TC a packet trap will be scheduled through the CPU port.

The priority attribute determines which trap will be triggered in case
several packet traps match a packet.

We try to configure these attributes to the same value for all packet
traps as there is little reason not to.

Some packet traps did not use the same value, so rectify that now.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_buffers: Assign non-zero quotas to TC 0 of the CPU port

As explained in commit 9ffcc3725f09 ("mlxsw: spectrum: Allow packets to
be trapped from any PG"), incoming packets can be admitted to the shared
buffer and forwarded / trapped, if:

(Ingress{Port}.Usage < Thres && Ingress{Port,PG}.Usage < Thres &&
Egress{Port}.Usage < Thres && Egress{Port,TC}.Usage < Thres)
||
(Ingress{Port}.Usage < Min || Ingress{Port,PG} < Min ||
Egress{Port}.Usage < Min || Egress{Port,TC}.Usage < Min)

Trapped packets are scheduled to transmission through the CPU port.
Currently, the minimum and maximum quotas of traffic class (TC) 0 of the
CPU port are 0, which means it is not usable.

Assign non-zero quotas to TC 0 of the CPU port, so that it could be
utilized by subsequent patches.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Change default rate and priority of DHCP packets

Reduce the default acceptable rate of DHCP packets to 128 packets per
second and reduce their priority. This is reasonable given the Spectrum
ASICs are limited to 128 ports at the moment.

These are only the default values. Users will be able to modify them via
devlink-trap.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Trap IPv4 DHCP packets in router

Currently, IPv4 DHCP packets are trapped during L2 forwarding, which
means that packets might be trapped unnecessarily. Instead, only trap
the DHCP packets that reach the router. Either because they were flooded
to the router port or forwarded to it by the FDB. This is consistent
with the corresponding IPv6 trap.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Use same trap group for MLD and IGMP packets

Both packet types are needed for the same reason (multicast snooping),
so associate them with the same trap group.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Rename IGMP trap group

The IGMP trap group will be used for MLD traps in the next patch, so
rename it to "MC_SNOOPING" which is more appropriate.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/netdev/net

The MSCC bug fix in 'net' had to be slightly adjusted because the
register accesses are done slightly differently in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'efi-urgent-2020-05-24' of git://git./linux/kernel/git/tip/tip

Pull EFI fixes from Thomas Gleixner:
"A set of EFI fixes:

   - Don't return a garbage screen info when EFI framebuffer is not
     available

   - Make the early EFI console work properly with wider fonts instead
     of drawing garbage

   - Prevent a memory buffer leak in allocate_e820()

   - Print the firmware error record properly so it can be decoded by
     users

   - Fix a symbol clash in the host tool build which only happens with
     newer compilers.

   - Add a missing check for the event log version of TPM which caused
     boot failures on several Dell systems due to an attempt to decode
     SHA-1 format with the crypto agile algorithm"

* tag 'efi-urgent-2020-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tpm: check event log version before reading final events
  efi: Pull up arch-specific prototype efi_systab_show_arch()
  x86/boot: Mark global variables as static
  efi: cper: Add support for printing Firmware Error Record Reference
  efi/libstub/x86: Avoid EFI map buffer alloc in allocate_e820()
  efi/earlycon: Fix early printk for wider fonts
  efi/libstub: Avoid returning uninitialized data from setup_graphics()

Merge tag 'x86-urgent-2020-05-24' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
"Two fixes for x86:

   - Unbreak stack dumps for inactive tasks by interpreting the special
     first frame left by __switch_to_asm() correctly.

     The recent change not to skip the first frame so ORC and frame
     unwinder behave in the same way caused all entries to be
     unreliable, i.e. prepended with '?'.

   - Use cpumask_available() instead of an implicit NULL check of a
     cpumask_var_t in mmio trace to prevent a Clang build warning"

* tag 'x86-urgent-2020-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/unwind/orc: Fix unwind_get_return_address_ptr() for inactive tasks
  x86/mmiotrace: Use cpumask_available() for cpumask_var_t variables

Merge tag 'sched-urgent-2020-05-24' of git://git./linux/kernel/git/tip/tip

Pull scheduler fixes from Thomas Gleixner:
"A set of fixes for the scheduler:

   - Fix handling of throttled parents in enqueue_task_fair() completely.

     The recent fix overlooked a corner case where the first iteration
     terminates due to an entity already being on the runqueue which
     makes the list management incomplete and later triggers the
     assertion which checks for completeness.

   - Fix a similar problem in unthrottle_cfs_rq().

   - Show the correct uclamp values in procfs which prints the effective
     value twice instead of requested and effective"

* tag 'sched-urgent-2020-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list
  sched/debug: Fix requested task uclamp values shown in procfs
  sched/fair: Fix enqueue_task_fair() warning some more

Merge git://git./linux/kernel/git/netdev/net

Pull networking fixes from David Miller:

1) Fix RCU warnings in ipv6 multicast router code, from Madhuparna
    Bhowmik.

2) Nexthop attributes aren't being checked properly because of
    mis-initialized iterator, from David Ahern.

3) Revert iop_idents_reserve() change as it caused performance
    regressions and was just working around what is really a UBSAN bug
    in the compiler. From Yuqi Jin.

4) Read MAC address properly from ROM in bmac driver (double iteration
    proceeds past end of address array), from Jeremy Kerr.

5) Add Microsoft Surface device IDs to r8152, from Marc Payne.

6) Prevent reference to freed SKB in __netif_receive_skb_core(), from
    Boris Sukholitko.

7) Fix ACK discard behavior in rxrpc, from David Howells.

8) Preserve flow hash across packet scrubbing in wireguard, from Jason
    A. Donenfeld.

9) Cap option length properly for SO_BINDTODEVICE in AX25, from Eric
    Dumazet.

10) Fix encryption error checking in kTLS code, from Vadim Fedorenko.

11) Missing BPF prog ref release in flow dissector, from Jakub Sitnicki.

12) dst_cache must be used with BH disabled in tipc, from Eric Dumazet.

13) Fix use after free in mlxsw driver, from Jiri Pirko.

14) Order kTLS key destruction properly in mlx5 driver, from Tariq
    Toukan.

15) Check devm_platform_ioremap_resource() return value properly in
    several drivers, from Tiezhu Yang.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (71 commits)
  net: smsc911x: Fix runtime PM imbalance on error
  net/mlx4_core: fix a memory leak bug.
  net: ethernet: ti: cpsw: fix ASSERT_RTNL() warning during suspend
  net: phy: mscc: fix initialization of the MACsec protocol mode
  net: stmmac: don't attach interface until resume finishes
  net: Fix return value about devm_platform_ioremap_resource()
  net/mlx5: Fix error flow in case of function_setup failure
  net/mlx5e: CT: Correctly get flow rule
  net/mlx5e: Update netdev txq on completions during closure
  net/mlx5: Annotate mutex destroy for root ns
  net/mlx5: Don't maintain a case of del_sw_func being null
  net/mlx5: Fix cleaning unmanaged flow tables
  net/mlx5: Fix memory leak in mlx5_events_init
  net/mlx5e: Fix inner tirs handling
  net/mlx5e: kTLS, Destroy key object after destroying the TIS
  net/mlx5e: Fix allowed tc redirect merged eswitch offload cases
  net/mlx5: Avoid processing commands before cmdif is ready
  net/mlx5: Fix a race when moving command interface to events mode
  net/mlx5: Add command entry handling completion
  rxrpc: Fix a memory leak in rxkad_verify_response()
  ...

ethtool: propagate get_coalesce return value

get_coalesce returns 0 or ERRNO, but the return value isn't checked.
The returned coalesce data may be invalid if an ERRNO is set,
therefore better check and propagate the return value.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-provide-a-devres-variant-of-register_netdev'

Bartosz Golaszewski says:

====================
net: provide a devres variant of register_netdev()

Using devres helpers allows to shrink the probing code, avoid memory leaks in
error paths make sure the order in which resources are freed is the exact
opposite of their allocation. This series proposes to add a devres variant
of register_netdev() that will only work with net_device structures whose
memory is also managed.

First we add the missing documentation entry for the only other networking
devres helper: devm_alloc_etherdev().

Next we move devm_alloc_etherdev() into a separate source file.

We then use a proxy structure in devm_alloc_etherdev() to improve readability.

Last: we implement devm_register_netdev() and use it in mtk-eth-mac driver.

v1 -> v2:
- rebase on top of net-next after driver rename, no functional changes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: mtk_star_emac: use devm_register_netdev()

Use the new devres variant of register_netdev() in the mtk-star-emac
driver and shrink the code by a couple lines.

Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: devres: provide devm_register_netdev()

Provide devm_register_netdev() - a device resource managed variant
of register_netdev(). This new helper will only work for net_device
structs that are also already managed by devres.

Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: devres: define a separate devres structure for devm_alloc_etherdev()

Not using a proxy structure to store struct net_device doesn't save
anything in terms of compiled code size or memory usage but significantly
decreases the readability of the code with all the pointer casting.

Define struct net_device_devres and use it in devm_alloc_etherdev_mqs().

Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: move devres helpers into a separate source file

There's currently only a single devres helper in net/ - devm variant
of alloc_etherdev. Let's move it to net/devres.c with the intention of
assing a second one: devm_register_netdev(). This new routine will need
to know the address of the release function of devm_alloc_etherdev() so
that it can verify (using devres_find()) that the struct net_device
that's being passed to it is also resource managed.

Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Documentation: devres: add a missing section for networking helpers

Add a new section for networking devres helpers to devres.rst and list
the two existing devm functions.

Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'r8169-remove-mask-argument-from-few-ERI-OCP-functions'

Heiner Kallweit says:

====================
r8169: remove mask argument from few ERI/OCP functions

Few ERI/OCP functions have a mask argument that isn't needed.
Remove it to simplify the functions.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: remove mask argument from r8168ep_ocp_read

Remove the mask argument as it's not used by r8168ep_ocp_read().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>