Leon Romanovsky [Wed, 21 Jul 2021 12:39:44 +0000 (15:39 +0300)]
ionic: cleanly release devlink instance
The failure to register devlink will leave the system with dangled
devlink resource, which is not cleaned if devlink_port_register() fails.
In order to remove access to ".registered" field of struct devlink_port,
require both devlink_register and devlink_port_register to success and
check it through device pointer.
Fixes:
fbfb8031533c ("ionic: Add hardware init and device commands")
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Wed, 21 Jul 2021 14:01:27 +0000 (17:01 +0300)]
net: bridge: multicast: add context support for host-joined groups
Adding bridge multicast context support for host-joined groups is easy
because we only need the proper timer value. We pass the already chosen
context and use its timer value.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Wed, 21 Jul 2021 14:01:26 +0000 (17:01 +0300)]
net: bridge: multicast: add mdb context support
Choose the proper bridge multicast context when user-spaces is adding
mdb entries. Currently we require the vlan to be configured on at least
one device (port or bridge) in order to add an mdb entry if vlan
mcast snooping is enabled (vlan snooping implies vlan filtering).
Note that we always allow deleting an entry, regardless of the vlan state.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joakim Zhang [Wed, 21 Jul 2021 10:12:20 +0000 (18:12 +0800)]
ARM: dts: imx6qdl: move phy properties into phy device node
This patch fixes issues found by dtbs_check:
make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- dtbs_check DT_SCHEMA_FILES=Documentation/devicetree/bindings/net/fsl,fec.yaml
According to the Micrel PHY dt-binding:
Documentation/devicetree/bindings/net/micrel-ksz90x1.txt,
Add clock delay in an Ethernet OF device node is deprecated, so move
these properties to PHY OF device node.
Suggested-by: Rob Herring <robh@kernel.org>
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joakim Zhang [Wed, 21 Jul 2021 10:12:19 +0000 (18:12 +0800)]
dt-bindings: net: fsl,fec: improve the binding a bit
This patch improves the yaml a bit according to Rob Herring comments:
1) normalize interrupt-names property, there is no reason to support
random order.
2) validate each string in clock-names property.
3) add constraints for fsl,num-tx-queues/fsl,num-rx-queues property.
4) change additionalProperties to false in order to do strict checking.
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 21 Jul 2021 10:15:28 +0000 (03:15 -0700)]
tcp: tweak len/truesize ratio for coalesce candidates
tcp_grow_window() is using skb->len/skb->truesize to increase tp->rcv_ssthresh
which has a direct impact on advertized window sizes.
We added TCP coalescing in linux-3.4 & linux-3.5:
Instead of storing skbs with one or two MSS in receive queue (or OFO queue),
we try to append segments together to reduce memory overhead.
High performance network drivers tend to cook skb with 3 parts :
1) sk_buff structure (256 bytes)
2) skb->head contains room to copy headers as needed, and skb_shared_info
3) page fragment(s) containing the ~1514 bytes frame (or more depending on MTU)
Once coalesced into a previous skb, 1) and 2) are freed.
We can therefore tweak the way we compute len/truesize ratio knowing
that skb->truesize is inflated by 1) and 2) soon to be freed.
This is done only for in-order skb, or skb coalesced into OFO queue.
The result is that low rate flows no longer pay the memory price of having
low GRO aggregation factor. Same result for drivers not using GRO.
This is critical to allow a big enough receiver window,
typically tcp_rmem[2] / 2.
We have been using this at Google for about 5 years, it is due time
to make it upstream.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Wed, 21 Jul 2021 10:06:24 +0000 (13:06 +0300)]
net: bridge: multicast: fix igmp/mld port context null pointer dereferences
With the recent change to use bridge/port multicast context pointers
instead of bridge/port I missed to convert two locations which pass the
port pointer as-is, but with the new model we need to verify the port
context is non-NULL first and retrieve the port from it. The first
location is when doing querier selection when a query is received, the
second location is when leaving a group. The port context will be null
if the packets originated from the bridge device (i.e. from the host).
The fix is simple just check if the port context exists and retrieve
the port pointer from it.
Fixes:
adc47037a7d5 ("net: bridge: multicast: use multicast contexts instead of bridge or port")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Leon Romanovsky [Wed, 21 Jul 2021 09:54:13 +0000 (12:54 +0300)]
ionic: drop useless check of PCI driver data validity
The driver core will call to .remove callback only if .probe succeeded
and it will ensure that driver data has pointer to struct ionic.
There is no need to check it again.
Fixes:
fbfb8031533c ("ionic: Add hardware init and device commands")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 21 Jul 2021 09:06:14 +0000 (02:06 -0700)]
tcp: avoid indirect call in tcp_new_space()
For tcp sockets, sk->sk_write_space is most probably sk_stream_write_space().
Other sk->sk_write_space() calls in TCP are slow path and do not deserve
any change.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Wed, 21 Jul 2021 08:20:58 +0000 (11:20 +0300)]
net: wwan: iosm: Switch to use module_pci_driver() macro
Eliminate some boilerplate code by using module_pci_driver() instead of
init/exit, moving the salient bits from init into probe.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dongliang Mu [Wed, 21 Jul 2021 08:14:57 +0000 (16:14 +0800)]
usb: hso: remove the bailout parameter
There are two invocation sites of hso_free_net_device. After
refactoring hso_create_net_device, this parameter is useless.
Remove the bailout in the hso_free_net_device and change the invocation
sites of this function.
Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dongliang Mu [Wed, 21 Jul 2021 08:14:56 +0000 (16:14 +0800)]
usb: hso: fix error handling code of hso_create_net_device
The current error handling code of hso_create_net_device is
hso_free_net_device, no matter which errors lead to. For example,
WARNING in hso_free_net_device [1].
Fix this by refactoring the error handling code of
hso_create_net_device by handling different errors by different code.
[1] https://syzkaller.appspot.com/bug?id=
66eff8d49af1b28370ad342787413e35bbe76efe
Reported-by: syzbot+44d53c7255bb1aea22d2@syzkaller.appspotmail.com
Fixes:
5fcfb6d0bfcd ("hso: fix bailout in error case of probe")
Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Piotr Kwapulinski [Tue, 20 Jul 2021 23:23:48 +0000 (16:23 -0700)]
i40e: add support for PTP external synchronization clock
Add support for external synchronization clock via GPIOs.
1PPS signals are handled via the dedicated 3 GPIOs: SDP3_2,
SDP3_3 and GPIO_4.
Previously it was not possible to use the external PTP
synchronization clock.
All possible HW configurations are supported.
SDP3_2, SDP3_3, GPIO_4
off, off, off
off, in_A, off
off, out_A, off
off, in_B, off
off, out_B, off
in_A, off, off
in_A, in_B, off
in_A, out_B, off
out_A, off, off
out_A, in_B, off
in_B, off, off
in_B, in_A, off
in_B, out_A, off
out_B, off, off
out_B, in_A, off
off, off, in_A
off, out_A, in_A
off, in_B, in_A
off, out_B, in_A
out_A, off, in_A
out_A, in_B, in_A
in_B, off, in_A
in_B, out_A, in_A
out_B, off, in_A
off, off, out_A
off, in_A, out_A
off, in_B, out_A
off, out_B, out_A
in_A, off, out_A
in_A, in_B, out_A
in_A, out_B, out_A
in_B, off, out_A
in_B, in_A, out_A
out_B, off, out_A
out_B, in_A, out_A
off, off, in_B
off, in_A, in_B
off, out_A, in_B
off, out_B, in_B
in_A, off, in_B
in_A, out_B, in_B
out_A, off, in_B
out_B, off, in_B
out_B, in_A, in_B
off, off, out_B
off, in_A, out_B
off, out_A, out_B
off, in_B, out_B
in_A, off, out_B
in_A, in_B, out_B
out_A, off, out_B
out_A, in_B, out_B
in_B, off, out_B
in_B, in_A, out_B
in_B, out_A, out_B
Tested with oscilloscope, 1PPS generator and ts2phc.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Tested-by: Ashish K <ashishx.k@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vadim Fedorenko [Tue, 20 Jul 2021 20:06:28 +0000 (23:06 +0300)]
net: ipv4: Consolidate ipv4_mtu and ip_dst_mtu_maybe_forward
Consolidate IPv4 MTU code the same way it is done in IPv6 to have code
aligned in both address families
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vadim Fedorenko [Tue, 20 Jul 2021 20:06:27 +0000 (23:06 +0300)]
net: ipv6: introduce ip6_dst_mtu_maybe_forward
Replace ip6_dst_mtu_forward with ip6_dst_mtu_maybe_forward and
reuse this code in ip6_mtu. Actually these two functions were
almost duplicates, this change will simplify the maintaince of
mtu calculation code.
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 21 Jul 2021 15:21:39 +0000 (08:21 -0700)]
Merge branch 'ipv6-ioam'
Justin Iurman says:
====================
Support for the IOAM Pre-allocated Trace with IPv6
v5:
- Refine types, min/max and default values for new sysctls
- Introduce a "_wide" sysctl for each "ioam6_id" sysctl
- Add more validation on headers before processing data
- RCU for sc <> ns pointers + appropriate accessors
- Generic Netlink policies are now per op, not per family anymore
- Address other comments/remarks from Jakub (thanks again)
- Revert "__packed" to "__attribute__((packed))" for uapi headers
- Add tests to cover the functionality added, as requested by David Ahern
v4:
- Address warnings from checkpatch (ignore errors related to unnamed bitfields
in the first patch)
- Use of hweight32 (thanks Jakub)
- Remove inline keyword from static functions in C files and let the compiler
decide what to do (thanks Jakub)
v3:
- Fix warning "unused label 'out_unregister_genl'" by adding conditional macro
- Fix lwtunnel output redirect bug: dst cache useless in this case, use
orig_output instead
v2:
- Fix warning with static for __ioam6_fill_trace_data
- Fix sparse warning with __force when casting __be64 to __be32
- Fix unchecked dereference when removing IOAM namespaces or schemas
- exthdrs.c: Don't drop by default (now: ignore) to match the act bits "00"
- Add control plane support for the inline insertion (lwtunnel)
- Provide uapi structures
- Use __net_timestamp if skb->tstamp is empty
- Add note about the temporary IANA allocation
- Remove support for "removable" TLVs
- Remove support for virtual/anonymous tunnel decapsulation
In-situ Operations, Administration, and Maintenance (IOAM) records
operational and telemetry information in a packet while it traverses
a path between two points in an IOAM domain. It is defined in
draft-ietf-ippm-ioam-data [1]. IOAM data fields can be encapsulated
into a variety of protocols. The IPv6 encapsulation is defined in
draft-ietf-ippm-ioam-ipv6-options [2], via extension headers. IOAM
can be used to complement OAM mechanisms based on e.g. ICMP or other
types of probe packets.
This patchset implements support for the Pre-allocated Trace, carried
by a Hop-by-Hop. Therefore, a new IPv6 Hop-by-Hop TLV option is
introduced, see IANA [3]. The three other IOAM options are not included
in this patchset (Incremental Trace, Proof-of-Transit and Edge-to-Edge).
The main idea behind the IOAM Pre-allocated Trace is that a node
pre-allocates some room in packets for IOAM data. Then, each IOAM node
on the path will insert its data. There exist several interesting use-
cases, e.g. Fast failure detection/isolation or Smart service selection.
Another killer use-case is what we have called Cross-Layer Telemetry,
see the demo video on its repository [4], that aims to make the entire
stack (L2/L3 -> L7) visible for distributed tracing tools (e.g. Jaeger),
instead of the current L5 -> L7 limited view. So, basically, this is a
nice feature for the Linux Kernel.
This patchset also provides support for the control plane part, but only for the
inline insertion (host-to-host use case), through lightweight tunnels. Indeed,
for in-transit traffic, the solution is to have an IPv6-in-IPv6 encapsulation,
which brings some difficulties and still requires a little bit of work and
discussion (ie anonymous tunnel decapsulation and multi egress resolution).
- Patch 1: IPv6 IOAM headers definition
- Patch 2: Data plane support for Pre-allocated Trace
- Patch 3: IOAM Generic Netlink API
- Patch 4: Support for IOAM injection with lwtunnels
- Patch 5: Documentation for new IOAM sysctls
- Patch 6: Test for the IOAM insertion with IPv6
[1] https://tools.ietf.org/html/draft-ietf-ippm-ioam-data
[2] https://tools.ietf.org/html/draft-ietf-ippm-ioam-ipv6-options
[3] https://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xhtml#ipv6-parameters-2
[4] https://github.com/iurmanj/cross-layer-telemetry
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Justin Iurman [Tue, 20 Jul 2021 19:43:01 +0000 (21:43 +0200)]
selftests: net: Test for the IOAM insertion with IPv6
This test evaluates the IOAM insertion for IPv6 by checking the IOAM data
integrity on the receiver.
The topology is formed by 3 nodes: Alpha (sender), Beta (router in-between)
and Gamma (receiver). An IOAM domain is configured from Alpha to Gamma only,
which means not on the reverse path. When Gamma is the destination, Alpha
adds an IOAM option (Pre-allocated Trace) inside a Hop-by-hop and fills the
trace with its own IOAM data. Beta and Gamma also fill the trace. The IOAM
data integrity is checked on Gamma, by comparing with the pre-defined IOAM
configuration (see below).
+-------------------+ +-------------------+
| | | |
| alpha netns | | gamma netns |
| | | |
| +-------------+ | | +-------------+ |
| | veth0 | | | | veth0 | |
| | db01::2/64 | | | | db02::2/64 | |
| +-------------+ | | +-------------+ |
| . | | . |
+-------------------+ +-------------------+
. .
. .
. .
+----------------------------------------------------+
| . . |
| +-------------+ +-------------+ |
| | veth0 | | veth1 | |
| | db01::1/64 | ................ | db02::1/64 | |
| +-------------+ +-------------+ |
| |
| beta netns |
| |
+--------------------------+-------------------------+
~~~~~~~~~~~~~~~~~~~~~~
| IOAM configuration |
~~~~~~~~~~~~~~~~~~~~~~
Alpha
+-----------------------------------------------------------+
| Type | Value |
+-----------------------------------------------------------+
| Node ID | 1 |
+-----------------------------------------------------------+
| Node Wide ID |
11111111 |
+-----------------------------------------------------------+
| Ingress ID | 0xffff (default value) |
+-----------------------------------------------------------+
| Ingress Wide ID | 0xffffffff (default value) |
+-----------------------------------------------------------+
| Egress ID | 101 |
+-----------------------------------------------------------+
| Egress Wide ID | 101101 |
+-----------------------------------------------------------+
| Namespace Data | 0xdeadbee0 |
+-----------------------------------------------------------+
| Namespace Wide Data | 0xcafec0caf00dc0de |
+-----------------------------------------------------------+
| Schema ID | 777 |
+-----------------------------------------------------------+
| Schema Data | something that will be 4n-aligned |
+-----------------------------------------------------------+
Note: When Gamma is the destination, Alpha adds an IOAM Pre-allocated Trace
option inside a Hop-by-hop, where 164 bytes are pre-allocated for the
trace, with 123 as the IOAM-Namespace and with 0xfff00200 as the trace
type (= all available options at this time). As a result, and based on
IOAM configurations here, only both Alpha and Beta should be capable of
inserting their IOAM data while Gamma won't have enough space and will
set the overflow bit.
Beta
+-----------------------------------------------------------+
| Type | Value |
+-----------------------------------------------------------+
| Node ID | 2 |
+-----------------------------------------------------------+
| Node Wide ID |
22222222 |
+-----------------------------------------------------------+
| Ingress ID | 201 |
+-----------------------------------------------------------+
| Ingress Wide ID | 201201 |
+-----------------------------------------------------------+
| Egress ID | 202 |
+-----------------------------------------------------------+
| Egress Wide ID | 202202 |
+-----------------------------------------------------------+
| Namespace Data | 0xdeadbee1 |
+-----------------------------------------------------------+
| Namespace Wide Data | 0xcafec0caf11dc0de |
+-----------------------------------------------------------+
| Schema ID | 0xffffff (= None) |
+-----------------------------------------------------------+
| Schema Data | |
+-----------------------------------------------------------+
Gamma
+-----------------------------------------------------------+
| Type | Value |
+-----------------------------------------------------------+
| Node ID | 3 |
+-----------------------------------------------------------+
| Node Wide ID |
33333333 |
+-----------------------------------------------------------+
| Ingress ID | 301 |
+-----------------------------------------------------------+
| Ingress Wide ID | 301301 |
+-----------------------------------------------------------+
| Egress ID | 0xffff (default value) |
+-----------------------------------------------------------+
| Egress Wide ID | 0xffffffff (default value) |
+-----------------------------------------------------------+
| Namespace Data | 0xdeadbee2 |
+-----------------------------------------------------------+
| Namespace Wide Data | 0xcafec0caf22dc0de |
+-----------------------------------------------------------+
| Schema ID | 0xffffff (= None) |
+-----------------------------------------------------------+
| Schema Data | |
+-----------------------------------------------------------+
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Justin Iurman [Tue, 20 Jul 2021 19:43:00 +0000 (21:43 +0200)]
ipv6: ioam: Documentation for new IOAM sysctls
Add documentation for new IOAM sysctls:
- ioam6_id and ioam6_id_wide: two per-namespace sysctls
- ioam6_enabled, ioam6_id and ioam6_id_wide: three per-interface sysctls
Example of IOAM configuration based on the following simple topology:
_____ _____ _____
| | eth0 eth0 | | eth1 eth0 | |
| A |.----------.| B |.----------.| C |
|_____| |_____| |_____|
1) Node and interface IDs can be configured for IOAM:
# IOAM ID of A = 1, IOAM ID of A.eth0 = 11
(A) sysctl -w net.ipv6.ioam6_id=1
(A) sysctl -w net.ipv6.conf.eth0.ioam6_id=11
# IOAM ID of B = 2, IOAM ID of B.eth0 = 21, IOAM ID of B.eth1 = 22
(B) sysctl -w net.ipv6.ioam6_id=2
(B) sysctl -w net.ipv6.conf.eth0.ioam6_id=21
(B) sysctl -w net.ipv6.conf.eth1.ioam6_id=22
# IOAM ID of C = 3, IOAM ID of C.eth0 = 31
(C) sysctl -w net.ipv6.ioam6_id=3
(C) sysctl -w net.ipv6.conf.eth0.ioam6_id=31
Note that "_wide" IDs equivalents can be configured the same way.
2) Each node can be configured to form an IOAM domain. For instance,
we allow IOAM from A to C only (not the reverse path), i.e. enable
IOAM on ingress for B.eth0 and C.eth0:
(B) sysctl -w net.ipv6.conf.eth0.ioam6_enabled=1
(C) sysctl -w net.ipv6.conf.eth0.ioam6_enabled=1
3) An IOAM domain (e.g. ID=123) is defined and made known to each node:
(A) ip ioam namespace add 123
(B) ip ioam namespace add 123
(C) ip ioam namespace add 123
4) Finally, an IOAM Pre-allocated Trace can be inserted in traffic sent
by A when C (e.g. db02::2) is the destination:
(A) ip -6 route add db02::2/128 encap ioam6 trace type 0x800000 ns 123
size 12 dev eth0
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Justin Iurman [Tue, 20 Jul 2021 19:42:59 +0000 (21:42 +0200)]
ipv6: ioam: Support for IOAM injection with lwtunnels
Add support for the IOAM inline insertion (only for the host-to-host use case)
which is per-route configured with lightweight tunnels. The target is iproute2
and the patch is ready. It will be posted as soon as this patchset is merged.
Here is an overview:
$ ip -6 ro ad fc00::1/128 encap ioam6 trace type 0x800000 ns 1 size 12 dev eth0
This example configures an IOAM Pre-allocated Trace option attached to the
fc00::1/128 prefix. The IOAM namespace (ns) is 1, the size of the pre-allocated
trace data block is 12 octets (size) and only the first IOAM data (bit 0:
hop_limit + node id) is included in the trace (type) represented as a bitfield.
The reason why the in-transit (IPv6-in-IPv6 encapsulation) use case is not
implemented is explained on the patchset cover.
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Justin Iurman [Tue, 20 Jul 2021 19:42:58 +0000 (21:42 +0200)]
ipv6: ioam: IOAM Generic Netlink API
Add Generic Netlink commands to allow userspace to configure IOAM
namespaces and schemas. The target is iproute2 and the patch is ready.
It will be posted as soon as this patchset is merged. Here is an overview:
$ ip ioam
Usage: ip ioam { COMMAND | help }
ip ioam namespace show
ip ioam namespace add ID [ data DATA32 ] [ wide DATA64 ]
ip ioam namespace del ID
ip ioam schema show
ip ioam schema add ID DATA
ip ioam schema del ID
ip ioam namespace set ID schema { ID | none }
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Justin Iurman [Tue, 20 Jul 2021 19:42:57 +0000 (21:42 +0200)]
ipv6: ioam: Data plane support for Pre-allocated Trace
Implement support for processing the IOAM Pre-allocated Trace with IPv6,
see [1] and [2]. Introduce a new IPv6 Hop-by-Hop TLV option, see IANA [3].
A new per-interface sysctl is introduced. The value is a boolean to accept (=1)
or ignore (=0, by default) IPv6 IOAM options on ingress for an interface:
- net.ipv6.conf.XXX.ioam6_enabled
Two other sysctls are introduced to define IOAM IDs, represented by an integer.
They are respectively per-namespace and per-interface:
- net.ipv6.ioam6_id
- net.ipv6.conf.XXX.ioam6_id
The value of the first one represents the IOAM ID of the node itself (u32; max
and default value = U32_MAX>>8, due to hop limit concatenation) while the other
represents the IOAM ID of an interface (u16; max and default value = U16_MAX).
Each "ioam6_id" sysctl has a "_wide" equivalent:
- net.ipv6.ioam6_id_wide
- net.ipv6.conf.XXX.ioam6_id_wide
The value of the first one represents the wide IOAM ID of the node itself (u64;
max and default value = U64_MAX>>8, due to hop limit concatenation) while the
other represents the wide IOAM ID of an interface (u32; max and default value
= U32_MAX).
The use of short and wide equivalents is not exclusive, a deployment could
choose to leverage both. For example, net.ipv6.conf.XXX.ioam6_id (short format)
could be an identifier for a physical interface, whereas
net.ipv6.conf.XXX.ioam6_id_wide (wide format) could be an identifier for a
logical sub-interface. Documentation about new sysctls is provided at the end
of this patchset.
Two relativistic hash tables are used: one for IOAM namespaces, the other for
IOAM schemas. A namespace can only have a single active schema and a schema
can only be attached to a single namespace (1:1 relationship).
[1] https://tools.ietf.org/html/draft-ietf-ippm-ioam-ipv6-options
[2] https://tools.ietf.org/html/draft-ietf-ippm-ioam-data
[3] https://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xhtml#ipv6-parameters-2
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Justin Iurman [Tue, 20 Jul 2021 19:42:56 +0000 (21:42 +0200)]
uapi: IPv6 IOAM headers definition
This patch provides the IPv6 IOAM option header [1] as well as the IOAM
Trace header [2]. An IOAM option must be 4n-aligned. Here is an overview of
a Hop-by-Hop with an IOAM Trace option:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Next header | Hdr Ext Len | Padding | Padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Option Type | Opt Data Len | Reserved | IOAM Type |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Namespace-ID | NodeLen | Flags | RemainingLen|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| IOAM-Trace-Type | Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+
| | |
| node data [n] | |
| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ D
| | a
| node data [n-1] | t
| | a
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
~ ... ~ S
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ p
| | a
| node data [1] | c
| | e
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | |
| node data [0] | |
| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+
The IOAM option header starts at "Option Type" and ends after "IOAM
Type". The IOAM Trace header starts at "Namespace-ID" and ends after
"IOAM-Trace-Type/Reserved".
IOAM Type: either Pre-allocated Trace (=0), Incremental Trace (=1),
Proof-of-Transit (=2) or Edge-to-Edge (=3). Note that both the
Pre-allocated Trace and the Incremental Trace look the same. The two
others are not implemented.
Namespace-ID: IOAM namespace identifier, not to be confused with network
namespaces. It adds further context to IOAM options and associated data,
and allows devices which are IOAM capable to determine whether IOAM
options must be processed or ignored. It can also be used by an operator
to distinguish different operational domains or to identify different
sets of devices.
NodeLen: Length of data added by each node. It depends on the Trace
Type.
Flags: Only the Overflow (O) flag for now. The O flag is set by a
transit node when there are not enough octets left to record its data.
RemainingLen: Remaining free space to record data.
IOAM-Trace-Type: Bit field where each bit corresponds to a specific kind
of IOAM data. See [2] for a detailed list.
[1] https://tools.ietf.org/html/draft-ietf-ippm-ioam-ipv6-options
[2] https://tools.ietf.org/html/draft-ietf-ippm-ioam-data
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 20 Jul 2021 17:35:57 +0000 (20:35 +0300)]
net: switchdev: recurse into __switchdev_handle_fdb_del_to_device
The difference between __switchdev_handle_fdb_del_to_device and
switchdev_handle_del_to_device is that the former takes an extra
orig_dev argument, while the latter starts with dev == orig_dev.
We should recurse into the variant that does not lose the orig_dev along
the way. This is relevant when deleting FDB entries pointing towards a
bridge (dev changes to the lower interfaces, but orig_dev shouldn't).
The addition helper already recurses properly, just the deletion one
doesn't.
Fixes:
8ca07176ab00 ("net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 20 Jul 2021 17:35:56 +0000 (20:35 +0300)]
net: switchdev: remove stray semicolon in switchdev_handle_fdb_del_to_device shim
With the semicolon at the end, the compiler sees the shim function as a
declaration and not as a definition, and warns:
'switchdev_handle_fdb_del_to_device' declared 'static' but never defined
Reported-by: kernel test robot <lkp@intel.com>
Fixes:
8ca07176ab00 ("net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 20 Jul 2021 17:24:33 +0000 (20:24 +0300)]
net: phy: at803x: finish the phy id checking simplification
The blamed commit was probably not tested on net-next, since it did not
refactor the extra phy id check introduced in commit
b856150c8098 ("net:
phy: at803x: mask 1000 Base-X link mode").
Fixes:
8887ca5474bd ("net: phy: at803x: simplify custom phy id matching")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King (Oracle) [Tue, 20 Jul 2021 11:15:26 +0000 (12:15 +0100)]
net: phylink: cleanup ksettings_set
We only need to fiddle about with the supported mask after we have
validated the user's requested parameters. Simplify and streamline the
code by moving the linkmode copy and update of the autoneg bit after
validating the user's request.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 20 Jul 2021 23:57:06 +0000 (16:57 -0700)]
Merge branch '1GbE' of git://git./linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
1GbE Intel Wired LAN Driver Updates 2021-07-20
This series contains updates to e1000e and igc drivers.
Sasha adds initial S0ix support for devices with CSME and adds polling
for exiting of DPG. He sets the PHY to low power idle when in S0ix. He
also adds support for new device IDs for and adds a space to debug
messaging to help with readability for e1000e.
For igc, he ensures that q_vector array is not accessed beyond its
bounds and removes unneeded PHY related checks.
Tree Davies corrects a spelling mistake in e1000e.
Muhammad corrects the value written when there is no TSN offloading
and adjusts timeout value to avoid possible Tx hang for igc.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Joakim Zhang [Mon, 19 Jul 2021 07:18:21 +0000 (15:18 +0800)]
arm64: dts: imx8mp: change interrupt order per dt-binding
This patch changs interrupt order which found by dtbs_check.
$ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- dtbs_check DT_SCHEMA_FILES=Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml
arch/arm64/boot/dts/freescale/imx8mp-evk.dt.yaml: ethernet@
30bf0000: interrupt-names:0: 'macirq' was expected
arch/arm64/boot/dts/freescale/imx8mp-evk.dt.yaml: ethernet@
30bf0000: interrupt-names:1: 'eth_wake_irq' was expected
According to Documentation/devicetree/bindings/net/snps,dwmac.yaml, we
should list interrupt in it's order.
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joakim Zhang [Mon, 19 Jul 2021 07:18:20 +0000 (15:18 +0800)]
dt-bindings: net: imx-dwmac: convert imx-dwmac bindings to yaml
In order to automate the verification of DT nodes covert imx-dwmac to
nxp,dwmac-imx.yaml, and pass below checking.
$ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- dt_binding_check DT_SCHEMA_FILES=Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml
$ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- dtbs_check DT_SCHEMA_FILES=Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joakim Zhang [Mon, 19 Jul 2021 07:18:19 +0000 (15:18 +0800)]
dt-bindings: net: snps,dwmac: add missing DWMAC IP version
Add missing DWMAC IP version in snps,dwmac.yaml which found by below
command, as NXP i.MX8 families support SNPS DWMAC 5.10a IP.
$ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- dt_binding_check DT_SCHEMA_FILES=Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml
Documentation/devicetree/bindings/net/nxp,dwmac-imx.example.dt.yaml:
ethernet@
30bf0000: compatible: None of ['nxp,imx8mp-dwmac-eqos', 'snps,dwmac-5.10a'] are valid under the given schema
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Muhammad Husaini Zulkifli [Sat, 17 Jul 2021 16:12:22 +0000 (00:12 +0800)]
igc: Increase timeout value for Speed 100/1000/2500
As the cycle time is set to maximum of 1s, the TX Hang timeout need to
be increase to avoid possible TX Hang.
There is no dedicated number specific in data sheet for the timeout factor.
Timeout factor was determined during the debugging to solve the "Tx Hang"
issues that happen in some cases mainly during ETF(Earliest TxTime First).
This can be test by using TSN Schedule Tx Tools udp_tai sample application.
Signed-off-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>
Acked-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Muhammad Husaini Zulkifli [Fri, 9 Jul 2021 23:40:17 +0000 (07:40 +0800)]
igc: Set QBVCYCLET_S to 0 for TSN Basic Scheduling
According to datasheet section 8.12.19, when there's no TSN offloading
Shadow_QbvCycle bit[29:0] must be set to zero for basic scheduling.
Signed-off-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Sasha Neftin [Sat, 10 Jul 2021 17:57:50 +0000 (20:57 +0300)]
igc: Remove phy->type checking
i225 devices have only one phy->type: copper. There is no point checking
phy->type during the igc_has_link method from the watchdog that
invoked every 2 seconds.
This patch comes to clean up these pointless checkings.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Sasha Neftin [Wed, 7 Jul 2021 05:14:40 +0000 (08:14 +0300)]
igc: Remove _I_PHY_ID checking
i225 devices have only one PHY vendor. There is no point checking
_I_PHY_ID during the link establishment and auto-negotiation process.
This patch comes to clean up these pointless checkings.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Sasha Neftin [Mon, 14 Jun 2021 12:19:39 +0000 (15:19 +0300)]
igc: Check if num of q_vectors is smaller than max before array access
Ensure that the adapter->q_vector[MAX_Q_VECTORS] array isn't accessed
beyond its size. It was fixed by using a local variable num_q_vectors
as a limit for loop index, and ensure that num_q_vectors is not bigger
than MAX_Q_VECTORS.
Suggested-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tree Davies [Thu, 24 Jun 2021 12:06:01 +0000 (05:06 -0700)]
net/e1000e: Fix spelling mistake "The" -> "This"
There is a spelling mistake in the comment block.
Signed-off-by: Tree Davies <tdavies@darkphysics.net>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Sasha Neftin [Wed, 16 Jun 2021 04:19:30 +0000 (07:19 +0300)]
e1000e: Add space to the debug print
Minor fixes to allow debug prints more readable.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Sasha Neftin [Sat, 12 Jun 2021 17:02:20 +0000 (20:02 +0300)]
e1000e: Add support for the next LOM generation
Add devices IDs for the next LOM generations that will be
available on the next Intel Client platforms
This patch provides the initial support for these devices
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Sasha Neftin [Thu, 4 Mar 2021 07:38:13 +0000 (09:38 +0200)]
e1000e: Add support for Lunar Lake
Add devices IDs for the next LOM generations that will be
available on the next Intel Client platform (Lunar Lake)
This patch provides the initial support for these devices
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Sasha Neftin [Thu, 24 Jun 2021 08:19:08 +0000 (11:19 +0300)]
e1000e: Additional PHY power saving in S0ix
After transferring the MAC-PHY interface to the SMBus set the PHY
to S0ix low power idle mode.
Suggested-by: Dima Ruinskiy <dima.ruinskiy@intel.com>
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Sasha Neftin [Thu, 24 Jun 2021 08:18:46 +0000 (11:18 +0300)]
e1000e: Add polling mechanism to indicate CSME DPG exit
Per guidance from the CSME architecture team, it may take
up to 1 second for unconfiguring dynamic power gating mode.
Practically it can take more time. Wait up to 2.5 seconds to indicate
dynamic power gating exit from the S0ix configuration. Detect
scenarios that take more than 1 second but less than 2.5 seconds
will emit warning message.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Sasha Neftin [Thu, 24 Jun 2021 08:18:27 +0000 (11:18 +0300)]
e1000e: Add handshake with the CSME to support S0ix
On the corporate system, the driver will ask from the CSME
(manageability engine) to perform device settings are required
to allow S0ix residency.
This patch provides initial support.
Reviewed-by: Dima Ruinskiy <dima.ruinskiy@intel.com>
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Russell King [Tue, 20 Jul 2021 13:33:49 +0000 (14:33 +0100)]
net: phy: at803x: simplify custom phy id matching
The at803x driver contains a function, at803x_match_phy_id(), which
tests whether the PHY ID matches the value passed, comparing phy_id
with phydev->phy_id and testing all bits that in the driver's mask.
This is the same test that is used to match the driver, with phy_id
replaced with the driver specified ID, phydev->drv->phy_id.
Hence, we already know the value of the bits being tested if we look
at phydev->drv->phy_id directly, and we do not require a complicated
test to check them. Test directly against phydev->drv->phy_id instead.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Fabio Estevam <festevam@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Tue, 20 Jul 2021 13:03:11 +0000 (14:03 +0100)]
net: marvell: clean up trigraph warning on ??! string
The character sequence ??! is a trigraph and causes the following
clang warning:
drivers/net/ethernet/marvell/mvneta.c:2604:39: warning: trigraph ignored [-Wtrigraphs]
Clean this by replacing it with single ?.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Tue, 20 Jul 2021 12:48:13 +0000 (13:48 +0100)]
atm: idt77252: clean up trigraph warning on ??) string
The character sequence ??) is a trigraph and causes the following
clang warning:
drivers/atm/idt77252.c:3544:35: warning: trigraph ignored [-Wtrigraphs]
Clean this by replacing it with single ?.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Schiller [Tue, 20 Jul 2021 11:56:47 +0000 (13:56 +0200)]
net: phy: intel-xway: Add RGMII internal delay configuration
This adds the possibility to configure the RGMII RX/TX clock skew via
devicetree.
Simply set phy mode to "rgmii-id", "rgmii-rxid" or "rgmii-txid" and add
the "rx-internal-delay-ps" or "tx-internal-delay-ps" property to the
devicetree.
Furthermore, a warning is now issued if the phy mode is configured to
"rgmii" and an internal delay is set in the phy (e.g. by pin-strapping),
as in the dp83867 driver.
Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King (Oracle) [Tue, 20 Jul 2021 11:15:20 +0000 (12:15 +0100)]
net: phylink: add phy change pause mode debug
Augment the phy link debug prints with the pause state.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King (Oracle) [Tue, 20 Jul 2021 09:57:53 +0000 (10:57 +0100)]
net: mvpp2: deny disabling autoneg for 802.3z modes
The documentation for Armada 8040 says:
Bit 2 Field InBandAnEn In-band Auto-Negotiation enable. ...
When <PortType> = 1 (1000BASE-X) this field must be set to 1.
We presently ignore whether userspace requests autonegotiation or not
through the ethtool ksettings interface. However, we have some network
interfaces that wish to do this. To offer a consistent API across
network interfaces, deny the ability to disable autonegotiation on
mvpp2 hardware when in 1000BASE-X and 2500BASE-X.
This means the only way to switch between 2500BASE-X and 1000BASE-X
on SFPs that support this will be:
# ethtool -s ethX advertise 0x20000006000 # 1000BASE-X Pause AsymPause
# ethtool -s ethX advertise 0xe000 # 2500BASE-X Pause AsymPause
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Marek Behún <kabel@kernel.org>
Acked-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King (Oracle) [Tue, 20 Jul 2021 09:57:48 +0000 (10:57 +0100)]
net: mvneta: deny disabling autoneg for 802.3z modes
The documentation for Armada 38x says:
Bit 2 Field InBandAnEn In-band Auto-Negotiation enable. ...
When <PortType> = 1 (1000BASE-X) this field must be set to 1.
We presently ignore whether userspace requests autonegotiation or not
through the ethtool ksettings interface. However, we have some network
interfaces that wish to do this. To offer a consistent API across
network interfaces, deny the ability to disable autonegotiation on
mvneta hardware when in 1000BASE-X and 2500BASE-X.
This means the only way to switch between 2500BASE-X and 1000BASE-X
on SFPs that support this will be:
# ethtool -s ethX advertise 0x20000002000 # 1000BASE-X Pause
# ethtool -s ethX advertise 0xa000 # 2500BASE-X Pause
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Marek Behún <kabel@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yang Yang [Tue, 20 Jul 2021 01:43:28 +0000 (18:43 -0700)]
net: ipv4: add capability check for net administration
Root in init user namespace can modify /proc/sys/net/ipv4/ip_forward
without CAP_NET_ADMIN, this doesn't follow the principle of
capabilities. For example, let's take a look at netdev_store(),
root can't modify netdev attribute without CAP_NET_ADMIN.
So let's keep the consistency of permission check logic.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 20 Jul 2021 14:10:49 +0000 (07:10 -0700)]
Merge branch 'qcom-dts-updates'
Alex Elder says:
====================
arm64: dts: qcom: DTS updates
This series updates some IPA-related DT nodes.
Newer versions of IPA do not require an interconnect between IPA
and SoC internal memory. The first patch updates the DT binding
to reflect this.
The second patch adds IPA information to "sc7280.dtsi", using only
two interconnects. It includes the definition of the reserved
memory area used to hold IPA firmware.
The last patch defines the reserved IPA firmware memory area in
"sc7180.dtsi".
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Mon, 19 Jul 2021 21:24:56 +0000 (16:24 -0500)]
arm64: dts: qcom: sc7180: define ipa_fw_mem node
Define the reserved memory space used for IPA firmware for the
Qualcomm SC7180 SoC.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Mon, 19 Jul 2021 21:24:55 +0000 (16:24 -0500)]
arm64: dts: qcom: sc7280: add IPA information
Add IPA-related nodes and definitions to "sc7280.dtsi", including
the reserved memory area used for AP-based IPA firmware loading.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Mon, 19 Jul 2021 21:24:54 +0000 (16:24 -0500)]
dt-bindings: net: qcom,ipa: make imem interconnect optional
On some newer SoCs, the interconnect between IPA and SoC internal
memory (imem) is not used. Reflect this in the binding by moving
the definition of the "imem" interconnect to the end and defining
minItems to be 2 for both the interconnects and interconnect-names
properties.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Mon, 19 Jul 2021 20:23:33 +0000 (15:23 -0500)]
net: ipa: fix IPA v4.11 interconnect data
Currently three interconnects are defined for the Qualcomm SC7280
SoC, but this was based on a misunderstanding. There should only be
two interconnects defined: one between the IPA and system memory;
and another between the AP and IPA config space. The bandwidths
defined for the memory and config interconnects do not match what I
understand to be proper values, so update these.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fabio Estevam [Mon, 19 Jul 2021 23:26:39 +0000 (20:26 -0300)]
dt-bindings: net: fec: Fix indentation
The following warning is observed when running 'make dtbs_check':
Documentation/devicetree/bindings/net/fsl,fec.yaml:85:7: [warning] wrong indentation: expected 8 but found 6 (indentation)
Fix the indentation accordingly.
Signed-off-by: Fabio Estevam <festevam@gmail.com>
Reviewed-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 20 Jul 2021 14:04:37 +0000 (07:04 -0700)]
Merge branch 'fdb-fanout'
Vladimir Oltean says:
====================
Fan out FDB entries pointing towards the bridge to all switchdev member ports
The "DSA RX filtering" series has added some important support for
interpreting addresses towards the bridge device as host addresses and
installing them as FDB entries towards the CPU port, but it does not
cover all circumstances and needs further work.
To be precise, the mechanism introduced in that series only works as
long as the ports are fairly static and no port joins or leaves the
bridge once the configuration is done. If any port leaves, host FDB
entries that were installed during runtime (for example the user changes
the MAC address of the bridge device) will be prematurely deleted,
resulting in a broken setup.
I see this work as targeted for "net-next" because technically it was
not supposed to work. Also, there are still corner cases and holes to be
plugged. For example, today, FDB entries on foreign interfaces are not
covered by br_fdb_replay(), which means that there are cases where some
host addresses are either lost, or never deleted by DSA. That will be
resolved once more work gets accepted, in particular the "Allow
forwarding for the software bridge data path to be offloaded to capable
devices" series, which moves the br_fdb_replay() call to the bridge core
and therefore would be required to solve the problem in a generic way
for every switchdev driver and not just for DSA.
These patches also pave the way for a cleaner implementation for FDB
entries pointing towards a LAG upper interface in DSA (that code needs
only to be added, nothing changed), however this is not done here.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Mon, 19 Jul 2021 13:51:40 +0000 (16:51 +0300)]
net: dsa: use switchdev_handle_fdb_{add,del}_to_device
Using the new fan-out helper for FDB entries installed on the software
bridge, we can install host addresses with the proper refcount on the
CPU port, such that this case:
ip link set swp0 master br0
ip link set swp1 master br0
ip link set swp2 master br0
ip link set swp3 master br0
ip link set br0 address 00:01:02:03:04:05
ip link set swp3 nomaster
works properly and the br0 address remains installed as a host entry
with refcount 3 instead of getting deleted.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Mon, 19 Jul 2021 13:51:39 +0000 (16:51 +0300)]
net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE
Currently DSA has an issue with FDB entries pointing towards the bridge
in the presence of br_fdb_replay() being called at port join and leave
time.
In particular, each bridge port will ask for a replay for the FDB
entries pointing towards the bridge when it joins, and for another
replay when it leaves.
This means that for example, a bridge with 4 switch ports will notify
DSA 4 times of the bridge MAC address.
But if the MAC address of the bridge changes during the normal runtime
of the system, the bridge notifies switchdev [ once ] of the deletion of
the old MAC address as a local FDB towards the bridge, and of the
insertion [ again once ] of the new MAC address as a local FDB.
This is a problem, because DSA keeps the old MAC address as a host FDB
entry with refcount 4 (4 ports asked for it using br_fdb_replay). So the
old MAC address will not be deleted. Additionally, the new MAC address
will only be installed with refcount 1, and when the first switch port
leaves the bridge (leaving 3 others as still members), it will delete
with it the new MAC address of the bridge from the local FDB entries
kept by DSA (because the br_fdb_replay call on deletion will bring the
entry's refcount from 1 to 0).
So the problem, really, is that the number of br_fdb_replay() calls is
not matched with the refcount that a host FDB is offloaded to DSA during
normal runtime.
An elegant way to solve the problem would be to make the switchdev
notification emitted by br_fdb_change_mac_address() result in a host FDB
kept by DSA which has a refcount exactly equal to the number of ports
under that bridge. Then, no matter how many DSA ports join or leave that
bridge, the host FDB entry will always be deleted when there are exactly
zero remaining DSA switch ports members of the bridge.
To implement the proposed solution, we remember that the switchdev
objects and port attributes have some helpers provided by switchdev,
which can be optionally called by drivers:
switchdev_handle_port_obj_{add,del} and switchdev_handle_port_attr_set.
These helpers:
- fan out a switchdev object/attribute emitted for the bridge towards
all the lower interfaces that pass the check_cb().
- fan out a switchdev object/attribute emitted for a bridge port that is
a LAG towards all the lower interfaces that pass the check_cb().
In other words, this is the model we need for the FDB events too:
something that will keep an FDB entry emitted towards a physical port as
it is, but translate an FDB entry emitted towards the bridge into N FDB
entries, one per physical port.
Of course, there are many differences between fanning out a switchdev
object (VLAN) on 3 lower interfaces of a LAG and fanning out an FDB
entry on 3 lower interfaces of a LAG. Intuitively, an FDB entry towards
a LAG should be treated specially, because FDB entries are unicast, we
can't just install the same address towards 3 destinations. It is
imaginable that drivers might want to treat this case specifically, so
create some methods for this case and do not recurse into the LAG lower
ports, just the bridge ports.
DSA also listens for FDB entries on "foreign" interfaces, aka interfaces
bridged with us which are not part of our hardware domain: think an
Ethernet switch bridged with a Wi-Fi AP. For those addresses, DSA
installs host FDB entries. However, there we have the same problem
(those host FDB entries are installed with a refcount of only 1) and an
even bigger one which we did not have with FDB entries towards the
bridge:
br_fdb_replay() is currently not called for FDB entries on foreign
interfaces, just for the physical port and for the bridge itself.
So when DSA sniffs an address learned by the software bridge towards a
foreign interface like an e1000 port, and then that e1000 leaves the
bridge, DSA remains with the dangling host FDB address. That will be
fixed separately by replaying all FDB entries and not just the ones
towards the port and the bridge.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Mon, 19 Jul 2021 13:51:38 +0000 (16:51 +0300)]
net: switchdev: introduce helper for checking dynamically learned FDB entries
It is a bit difficult to understand what DSA checks when it tries to
avoid installing dynamically learned addresses on foreign interfaces as
local host addresses, so create a generic switchdev helper that can be
reused and is generally more readable.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xu Liang [Mon, 19 Jul 2021 05:32:12 +0000 (13:32 +0800)]
net: phy: add Maxlinear GPY115/21x/24x driver
Add driver to support the Maxlinear GPY115, GPY211, GPY212, GPY215,
GPY241, GPY245 PHYs. Separate from XWAY PHY driver because this series
has different register layout and new features not supported in XWAY PHY.
Signed-off-by: Xu Liang <lxu@maxlinear.com>
Acked-by: Hauke Mehrtens <hmehrtens@maxlinear.com>
Tested-by: Wong Vee Khee <vee.khee.wong@linux.intel.com>
Tested-by: Wong Vee Khee <vee.khee.wong@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xu Liang [Mon, 19 Jul 2021 05:32:11 +0000 (13:32 +0800)]
net: phy: add API to read 802.3-c45 IDs
Add API to read 802.3-c45 IDs so that C22/C45 mixed device can use
C45 APIs without failing ID checks.
Signed-off-by: Xu Liang <lxu@maxlinear.com>
Acked-by: Hauke Mehrtens <hmehrtens@maxlinear.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 20 Jul 2021 13:37:12 +0000 (06:37 -0700)]
Merge branch 'tag_8021q-cross-chip'
Vladimir Olteans says:
====================
Proper cross-chip support for tag_8021q
The cross-chip bridging support for tag_8021q/sja1105 introduced here:
https://patchwork.ozlabs.org/project/netdev/cover/
20200510163743.18032-1-olteanv@gmail.com/
took some shortcuts and is not reusable in other topologies except for
the one it was written for: disjoint DSA trees. A diagram of this
topology can be seen here:
https://patchwork.ozlabs.org/project/netdev/patch/
20200510163743.18032-3-olteanv@gmail.com/
However there are sja1105 switches on other boards using other
topologies, most notably:
- Daisy chained:
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
- "H" topology:
eth0 eth1
| |
CPU port CPU port
| DSA link |
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 -------- sw1p4 sw1p3 sw1p2 sw1p1 sw1p0
| | | | | |
user user user user user user
port port port port port port
In fact, the current code for tag_8021q cross-chip links works for
neither of these 2 classes of topologies.
The main reasons are:
(a) The sja1105 driver does not treat DSA links. In the "disjoint trees"
topology, the routing port towards any other switch is also the CPU
port, and that was already configured so it already worked.
This series does not deal with enabling DSA links in the sja1105
driver, that is a fairly trivial task that will be dealt with
separately.
(b) The tag_8021q code for cross-chip links assumes that any 2 switches
between cross-chip forwarding needs to be enabled (i.e. which have
user ports part of the same bridge) are at most 1 hop away from each
other. This was true for the "disjoint trees" case because
once a packet reached the CPU port, VLAN-unaware bridging was done
by the DSA master towards the other switches based on destination
MAC address, so the tag_8021q header was not interpreted in any way.
However, in a daisy chain setup with 3 switches, all of them will
interpret the tag_8021q header, and all tag_8021q VLANs need to be
installed in all switches.
When looking at the O(n^2) real complexity of the problem, it is clear
that the current code had absolutely no chance of working in the general
case. So this patch series brings a redesign of tag_8021q, in light of
its new requirements. Anything with O(n^2) complexity (where n is the
number of switches in a DSA tree) is an obvious candidate for the DSA
cross-chip notifier support.
One by one, the patches are:
- The sja1105 driver is extremely entangled with tag_8021q, to be exact,
with that driver's best_effort_vlan_filtering support. We drop this
operating mode, which means that sja1105 temporarily loses network
stack termination for VLAN-aware bridges. That operating mode raced
itself to its own grave anyway due to some hardware limitations in
combination with PTP reported by NXP customers. I can't say a lot
more, but network stack termination for VLAN-aware bridges in sja1105
will be reimplemented soon with a much, much better solution.
- What remains of tag_8021q in sja1105 is support for standalone ports
mode and for VLAN-unaware bridging. We refactor the API surface of
tag_8021q to a single pair of dsa_tag_8021q_{register,unregister}
functions and we clean up everything else related to tag_8021q from
sja1105 and felix.
- Then we move tag_8021q into the DSA core. I thought about this a lot,
and there is really no other way to add a DSA_NOTIFIER_TAG_8021Q_VLAN_ADD
cross-chip notifier if DSA has no way to know if the individual
switches use tag_8021q or not. So it needs to be part of the core to
use notifiers.
- Then we modify tag_8021q to update dynamically on bridge_{join,leave}
events, instead of what we have today which is simply installing the
VLANs on all ports of a switch and leaving port isolation up to
somebody else. This change is necessary because port isolation over a
DSA link cannot be done in any other way except based on VLAN
membership, as opposed to bridging within the same switch which had 2
choices (at least on sja1105).
- Finally we add 2 new cross-chip notifiers for adding and deleting a
tag_8021q VLAN, which is properly refcounted similar to the bridge FDB
and MDB code, and complete cleanup is done on teardown (note that this
is unlike regular bridge VLANs, where we currently cannot do
refcounting because the user can run "bridge vlan add dev swp0 vid 100"
a gazillion times, and "bridge vlan del dev swp0 vid 100" just once,
and for some reason expect that the VLAN will be deleted. But I digress).
With this opportunity we remove a lot of hard-to-digest code and
replace it with much more idiomatic DSA-style code.
This series was regression-tested on:
- Single-switch boards with SJA1105T
- Disjoint-tree boards with SJA1105S and Felix (using ocelot-8021q)
- H topology boards using SJA1110A
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Mon, 19 Jul 2021 17:14:52 +0000 (20:14 +0300)]
net: dsa: tag_8021q: add proper cross-chip notifier support
The big problem which mandates cross-chip notifiers for tag_8021q is
this:
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
When the user runs:
ip link add br0 type bridge
ip link set sw0p0 master br0
ip link set sw2p0 master br0
It doesn't work.
This is because dsa_8021q_crosschip_bridge_join() assumes that "ds" and
"other_ds" are at most 1 hop away from each other, so it is sufficient
to add the RX VLAN of {ds, port} into {other_ds, other_port} and vice
versa and presto, the cross-chip link works. When there is another
switch in the middle, such as in this case switch 1 with its DSA links
sw1p3 and sw1p4, somebody needs to tell it about these VLANs too.
Which is exactly why the problem is quadratic: when a port joins a
bridge, for each port in the tree that's already in that same bridge we
notify a tag_8021q VLAN addition of that port's RX VLAN to the entire
tree. It is a very complicated web of VLANs.
It must be mentioned that currently we install tag_8021q VLANs on too
many ports (DSA links - to be precise, on all of them). For example,
when sw2p0 joins br0, and assuming sw1p0 was part of br0 too, we add the
RX VLAN of sw2p0 on the DSA links of switch 0 too, even though there
isn't any port of switch 0 that is a member of br0 (at least yet).
In theory we could notify only the switches which sit in between the
port joining the bridge and the port reacting to that bridge_join event.
But in practice that is impossible, because of the way 'link' properties
are described in the device tree. The DSA bindings require DT writers to
list out not only the real/physical DSA links, but in fact the entire
routing table, like for example switch 0 above will have:
sw0p3: port@3 {
link = <&sw1p4 &sw2p4>;
};
This was done because:
/* TODO: ideally DSA ports would have a single dp->link_dp member,
* and no dst->rtable nor this struct dsa_link would be needed,
* but this would require some more complex tree walking,
* so keep it stupid at the moment and list them all.
*/
but it is a perfect example of a situation where too much information is
actively detrimential, because we are now in the position where we
cannot distinguish a real DSA link from one that is put there to avoid
the 'complex tree walking'. And because DT is ABI, there is not much we
can change.
And because we do not know which DSA links are real and which ones
aren't, we can't really know if DSA switch A is in the data path between
switches B and C, in the general case.
So this is why tag_8021q RX VLANs are added on all DSA links, and
probably why it will never change.
On the other hand, at least the number of additions/deletions is well
balanced, and this means that once we implement reference counting at
the cross-chip notifier level a la fdb/mdb, there is absolutely zero
need for a struct dsa_8021q_crosschip_link, it's all self-managing.
In fact, with the tag_8021q notifiers emitted from the bridge join
notifiers, it becomes so generic that sja1105 does not need to do
anything anymore, we can just delete its implementation of the
.crosschip_bridge_{join,leave} methods.
Among other things we can simply delete is the home-grown implementation
of sja1105_notify_crosschip_switches(). The reason why that is wrong is
because it is not quadratic - it only covers remote switches to which we
have a cross-chip bridging link and that does not cover in-between
switches. This deletion is part of the same patch because sja1105 used
to poke deep inside the guts of the tag_8021q context in order to do
that. Because the cross-chip links went away, so needs the sja1105 code.
Last but not least, dsa_8021q_setup_port() is simplified (and also
renamed). Because our TAG_8021Q_VLAN_ADD notifier is designed to react
on the CPU port too, the four dsa_8021q_vid_apply() calls:
- 1 for RX VLAN on user port
- 1 for the user port's RX VLAN on the CPU port
- 1 for TX VLAN on user port
- 1 for the user port's TX VLAN on the CPU port
now get squashed into only 2 notifier calls via
dsa_port_tag_8021q_vlan_add.
And because the notifiers to add and to delete a tag_8021q VLAN are
distinct, now we finally break up the port setup and teardown into
separate functions instead of relying on a "bool enabled" flag which
tells us what to do. Arguably it should have been this way from the
get go.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Mon, 19 Jul 2021 17:14:51 +0000 (20:14 +0300)]
net: dsa: tag_8021q: manage RX VLANs dynamically at bridge join/leave time
There has been at least one wasted opportunity for tag_8021q to be used
by a driver:
https://patchwork.ozlabs.org/project/netdev/patch/
20200710113611.3398-3-kurt@linutronix.de/#2484272
because of a design decision: the declared purpose of tag_8021q is to
offer source port/switch identification for a tagging driver for packets
coming from a switch with no hardware DSA tagging support. It is not
intended to provide VLAN-based port isolation, because its first user,
sja1105, had another mechanism for bridging domain isolation, the L2
Forwarding Table. So even if 2 ports are in the same VLAN but they are
separated via the L2 Forwarding Table, they will not communicate with
one another. The L2 Forwarding Table is managed by the
sja1105_bridge_join() and sja1105_bridge_leave() methods.
As a consequence, today tag_8021q does not bother too much with hooking
into .port_bridge_join() and .port_bridge_leave() because that would
introduce yet another degree of freedom, it just iterates statically
through all ports of a switch and adds the RX VLAN of one port to all
the others. In this way, whenever .port_bridge_join() is called,
bridging will magically work because the RX VLANs are already installed
everywhere they need to be.
This is not to say that the reason for the change in this patch is to
satisfy the hellcreek and similar use cases, that is merely a nice side
effect. Instead it is to make sja1105 cross-chip links work properly
over a DSA link.
For context, sja1105 today supports a degenerate form of cross-chip
bridging, where the switches are interconnected through their CPU ports
("disjoint trees" topology). There is some code which has been
generalized into dsa_8021q_crosschip_link_{add,del}, but it is not
enough, and frankly it is impossible to build upon that.
Real multi-switch DSA trees, like daisy chains or H trees, which have
actual DSA links, do not work.
The problem is that sja1105 is unlike mv88e6xxx, and does not have a PVT
for cross-chip bridging, which is a table by which the local switch can
select the forwarding domain for packets from a certain ingress switch
ID and source port. The sja1105 switches cannot parse their own DSA
tags, because, well, they don't really have support for DSA tags, it's
all VLANs.
So to make something like cross-chip bridging between sw0p0 and sw1p0 to
work over the sw0p3/sw1p3 DSA link to work with sja1105 in the topology
below:
| |
sw0p0 sw0p1 sw0p2 sw0p3 sw1p3 sw1p2 sw1p1 sw1p0
[ user ] [ user ] [ cpu ] [ dsa ] ---- [ dsa ] [ cpu ] [ user ] [ user ]
we need to ask ourselves 2 questions:
(1) how should the L2 Forwarding Table be managed?
(2) how should the VLAN Lookup Table be managed?
i.e. what should prevent packets from going to unwanted ports?
Since as mentioned, there is no PVT, the L2 Forwarding Table only
contains forwarding rules for local ports. So we can say "all user ports
are allowed to forward to all CPU ports and all DSA links".
If we allow forwarding to DSA links unconditionally, this means we must
prevent forwarding using the VLAN Lookup Table. This is in fact
asymmetric with what we do for tag_8021q on ports local to the same
switch, and it matters because now that we are making tag_8021q a core
DSA feature, we need to hook into .crosschip_bridge_join() to add/remove
the tag_8021q VLANs. So for symmetry it makes sense to manage the VLANs
for local forwarding in the same way as cross-chip forwarding.
Note that there is a very precise reason why tag_8021q hooks into
dsa_switch_bridge_join() which acts at the cross-chip notifier level,
and not at a higher level such as dsa_port_bridge_join(). We need to
install the RX VLAN of the newly joining port into the VLAN table of all
the existing ports across the tree that are part of the same bridge, and
the notifier already does the iteration through the switches for us.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Mon, 19 Jul 2021 17:14:50 +0000 (20:14 +0300)]
net: dsa: tag_8021q: absorb dsa_8021q_setup into dsa_tag_8021q_{,un}register
Right now, setting up tag_8021q is a 2-step operation for a driver,
first the context structure needs to be created, then the VLANs need to
be installed on the ports. A similar thing is true for teardown.
Merge the 2 steps into the register/unregister methods, to be as
transparent as possible for the driver as to what tag_8021q does behind
the scenes. This also gets rid of the funny "bool setup == true means
setup, == false means teardown" API that tag_8021q used to expose.
Note that dsa_tag_8021q_register() must be called at least in the
.setup() driver method and never earlier (like in the driver probe
function). This is because the DSA switch tree is not initialized at
probe time, and the cross-chip notifiers will not work.
For symmetry with .setup(), the unregister method should be put in
.teardown().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Mon, 19 Jul 2021 17:14:49 +0000 (20:14 +0300)]
net: dsa: make tag_8021q operations part of the core
Make tag_8021q a more central element of DSA and move the 2 driver
specific operations outside of struct dsa_8021q_context (which is
supposed to hold dynamic data and not really constant function
pointers).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Mon, 19 Jul 2021 17:14:48 +0000 (20:14 +0300)]
net: dsa: let the core manage the tag_8021q context
The basic problem description is as follows:
Be there 3 switches in a daisy chain topology:
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
The CPU will not be able to ping through the user ports of the
bottom-most switch (like for example sw2p0), simply because tag_8021q
was not coded up for this scenario - it has always assumed DSA switch
trees with a single switch.
To add support for the topology above, we must admit that the RX VLAN of
sw2p0 must be added on some ports of switches 0 and 1 as well. This is
in fact a textbook example of thing that can use the cross-chip notifier
framework that DSA has set up in switch.c.
There is only one problem: core DSA (switch.c) is not able right now to
make the connection between a struct dsa_switch *ds and a struct
dsa_8021q_context *ctx. Right now, it is drivers who call into
tag_8021q.c and always provide a struct dsa_8021q_context *ctx pointer,
and tag_8021q.c calls them back with the .tag_8021q_vlan_{add,del}
methods.
But with cross-chip notifiers, it is possible for tag_8021q to call
drivers without drivers having ever asked for anything. A good example
is right above: when sw2p0 wants to set itself up for tag_8021q,
the .tag_8021q_vlan_add method needs to be called for switches 1 and 0,
so that they transport sw2p0's VLANs towards the CPU without dropping
them.
So instead of letting drivers manage the tag_8021q context, add a
tag_8021q_ctx pointer inside of struct dsa_switch, which will be
populated when dsa_tag_8021q_register() returns success.
The patch is fairly long-winded because we are partly reverting commit
5899ee367ab3 ("net: dsa: tag_8021q: add a context structure") which made
the driver-facing tag_8021q API use "ctx" instead of "ds". Now that we
can access "ctx" directly from "ds", this is no longer needed.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Mon, 19 Jul 2021 17:14:47 +0000 (20:14 +0300)]
net: dsa: build tag_8021q.c as part of DSA core
Upcoming patches will add tag_8021q related logic to switch.c and
port.c, in order to allow it to make use of cross-chip notifiers.
In addition, a struct dsa_8021q_context *ctx pointer will be added to
struct dsa_switch.
It seems fairly low-reward to #ifdef the *ctx from struct dsa_switch and
to provide shim implementations of the entire tag_8021q.c calling
surface (not even clear what to do about the tag_8021q cross-chip
notifiers to avoid compiling them). The runtime overhead for switches
which don't use tag_8021q is fairly small because all helpers will check
for ds->tag_8021q_ctx being a NULL pointer and stop there.
So let's make it part of dsa_core.o.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Mon, 19 Jul 2021 17:14:46 +0000 (20:14 +0300)]
net: dsa: tag_8021q: create dsa_tag_8021q_{register,unregister} helpers
In preparation of moving tag_8021q to core DSA, move all initialization
and teardown related to tag_8021q which is currently done by drivers in
2 functions called "register" and "unregister". These will gather more
functionality in future patches, which will better justify the chosen
naming scheme.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Mon, 19 Jul 2021 17:14:45 +0000 (20:14 +0300)]
net: dsa: tag_8021q: remove struct packet_type declaration
This is no longer necessary since tag_8021q doesn't register itself as a
full-blown tagger anymore.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Mon, 19 Jul 2021 17:14:44 +0000 (20:14 +0300)]
net: dsa: tag_8021q: use symbolic error names
Use %pe to give the user a string holding the error code instead of just
a number.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Mon, 19 Jul 2021 17:14:43 +0000 (20:14 +0300)]
net: dsa: tag_8021q: use "err" consistently instead of "rc"
Some of the tag_8021q code has been taken out of sja1105, which uses
"rc" for its return code variables, whereas the DSA core uses "err".
Change tag_8021q for consistency.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Mon, 19 Jul 2021 17:14:42 +0000 (20:14 +0300)]
net: dsa: sja1105: delete the best_effort_vlan_filtering mode
Simply put, the best-effort VLAN filtering mode relied on VLAN retagging
from a bridge VLAN towards a tag_8021q sub-VLAN in order to be able to
decode the source port in the tagger, but the VLAN retagging
implementation inside the sja1105 chips is not the best and we were
relying on marginal operating conditions.
The most notable limitation of the best-effort VLAN filtering mode is
its incapacity to treat this case properly:
ip link add br0 type bridge vlan_filtering 1
ip link set swp2 master br0
ip link set swp4 master br0
bridge vlan del dev swp4 vid 1
bridge vlan add dev swp4 vid 1 pvid
When sending an untagged packet through swp2, the expectation is for it
to be forwarded to swp4 as egress-tagged (so it will contain VLAN ID 1
on egress). But the switch will send it as egress-untagged.
There was an attempt to fix this here:
https://patchwork.kernel.org/project/netdevbpf/patch/
20210407201452.1703261-2-olteanv@gmail.com/
but it failed miserably because it broke PTP RX timestamping, in a way
that cannot be corrected due to hardware issues related to VLAN
retagging.
So with either PTP broken or pushing VLAN headers on egress for untagged
packets being broken, the sad reality is that the best-effort VLAN
filtering code is broken. Delete it.
Note that this means there will be a temporary loss of functionality in
this driver until it is replaced with something better (network stack
RX/TX capability for "mode 2" as described in
Documentation/networking/dsa/sja1105.rst, the "port under VLAN-aware
bridge" case). We simply cannot keep this code until that driver rework
is done, it is super bloated and tangled with tag_8021q.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 20 Jul 2021 13:23:50 +0000 (06:23 -0700)]
Merge branch 's390-next'
Julian Wiedmann says:
====================
s390/qeth: updates 2021-07-20
please apply the following patch series for qeth to netdev's net-next tree.
This removes the deprecated support for OSN-mode devices, and does some
follow-on cleanups.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 20 Jul 2021 13:23:13 +0000 (06:23 -0700)]
Merge branch 'veth-flexible-channel-numbers'
Paolo Abeni says:
====================
veth: more flexible channels number configuration
XDP setups can benefit from multiple veth RX/TX queues. Currently
veth allow setting such number only at creation time via the
'numrxqueues' and 'numtxqueues' parameters.
This series introduces support for the ethtool set_channel operation
and allows configuring the queue number via a new module parameter.
The veth default configuration is not changed.
Finally self-tests are updated to check the new features, with both
valid and invalid arguments.
This iteration is a rebase of the most recent RFC, it does not provide
a module parameter to configure the default number of queues, but I
think could be worthy
RFC v1 -> RFC v2:
- report more consistent 'combined' count
- make set_channel as resilient as possible to errors
- drop module parameter - but I would still consider it.
- more self-tests
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 20 Jul 2021 13:22:14 +0000 (06:22 -0700)]
Merge branch 'bridge-vlan-multicast'
Nikolay Aleksandrov says:
====================
net: bridge: multicast: add vlan support
This patchset adds initial per-vlan multicast support, most of the code
deals with moving to multicast context pointers from bridge/port pointers.
That allows us to switch them with the per-vlan contexts when a multicast
packet is being processed and vlan multicast snooping has been enabled.
That is controlled by a global bridge option added in patch 06 which is
off by default (BR_BOOLOPT_MCAST_VLAN_SNOOPING). It is important to note
that this option can change only under RTNL and doesn't require
multicast_lock, so we need to be careful when retrieving mcast contexts
in parallel. For packet processing they are switched only once in
br_multicast_rcv() and then used until the packet has been processed.
For the most part we need these contexts only to read config values and
check if they are disabled. The global mcast state which is maintained
consists of querier and router timers, the rest are config options.
The port mcast state which is maintained consists of query timer and
link to router port list if it's ever marked as a router port. Port
multicast contexts _must_ be used only with their respective global
contexts, that is a bridge port's mcast context must be used only with
bridge's global mcast context and a vlan/port's mcast context must be
used only with that vlan's global mcast context due to the router port
lists. This way a bridge port can be marked as a router in multiple
vlans, but might not be a router in some other vlan. Also this allows us
to have per-vlan querier elections, per-vlan queries and basically the
whole multicast state becomes per-vlan when the option is enabled.
One of the hardest parts is synchronization with vlan's memory
management, that is done through a new vlan flag: BR_VLFLAG_MCAST_ENABLED
which is changed only under multicast_lock. When a vlan is being
destroyed first that flag is removed under the lock, then the multicast
context is torn down which includes waiting for any outstanding context
timers. Since all of the vlan processing depends on BR_VLFLAG_MCAST_ENABLED
it must be checked first if the contexts are vlan and the multicast_lock
has been acquired. That is done by all IGMP/MLD packet processing
functions and timers. When processing a packet we have RCU so the vlan
memory won't be freed, but if the flag is missing we must not process it.
The timers are synchronized in the same way with the addition of waiting
for them to finish in case they are running after removing the flag
under multicast_lock (i.e. they were waiting for the lock). Multicast vlan
snooping requires vlan filtering to be enabled, if it's disabled then
snooping gets automatically disabled, too. BR_VLFLAG_GLOBAL_MCAST_ENABLED
controls if a vlan has BR_VLFLAG_MCAST_ENABLED set which is used in all
vlan disabled checks. We need both flags because one is controlled by
user-space globally (BR_VLFLAG_GLOBAL_MCAST_ENABLED) and the other is
for a particular bridge/vlan or port/vlan entry (BR_VLFLAG_MCAST_ENABLED).
Since the latter is also used for synchronization between the multicast
and vlan code, and also controlled by BR_VLFLAG_GLOBAL_MCAST_ENABLED we
rely on it when checking if a vlan context is disabled. The multicast
fast-path has 3 new bit tests on the cache-hot bridge flags field, I
didn't observe any measurable difference. I haven't forced either
context options to be always disabled when the other type is enabled
because the state consists of timers which either expire (router) or
don't affect the normal operation. Some options, like the mcast querier
one, won't be allowed to change for the disabled context type, that will
come with a future patch-set which adds per-vlan querier control.
Another important addition is the global vlan options, so far we had
only per bridge/port vlan options but in order to control vlan multicast
snooping globally we need to add a new type of global vlan options.
They can be changed only on the bridge device and are dumped only when a
special flag is set in the dump request. The first global option is vlan
mcast snooping control, it controls the vlan BR_VLFLAG_GLOBAL_MCAST_ENABLED
private flag. It can be set only on master vlan entries. There will be
many more global vlan options in the future both for multicast config
and other per-vlan options (e.g. STP).
There's a lot of room for improvements, I'll do some of the initial
ones but splitting the state to different contexts opens the door
for a lot more. Also any new multicast options become vlan-supported with
very little to no effort by using the same contexts.
Short patch description:
patches 01-04: initial mcast context add, no functional changes
patch 05: adds vlan mcast init and control helpers and uses them on
vlan create/destroy
patch 06: adds a global bridge mcast vlan snooping knob (default
off)
patches 07-08: add a helper for users which must derive the contexts
based on current bridge and vlan options (e.g. timers)
patch 09: adds checks for disabled vlan contexts in packet
processing and timers
patch 10: adds support for per-vlan querier and tagged queries
patch 11: adds router port vlan id in the notifications
patches 12-14: add global vlan options support (change, dump, notify)
patch 15: adds per-vlan global mcast snooping control
Future patch-sets which build on this one (in order):
- vlan state mcast handling
- user-space mdb contexts (currently only the bridge contexts are used
there)
- all bridge multicast config options added per-vlan global and per
vlan/port
- iproute2 support for all the new uAPIs
- selftests
This set has been stress-tested (deleting/adding ports/vlans while changing
vlan mcast snooping while processing IGMP/MLD packets), and also has
passed all bridge self-tests. I'm sending this set as early as possible
since there're a few more related sets that should go in the same
release to get proper and full mcast vlan snooping support.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Tue, 20 Jul 2021 06:38:49 +0000 (08:38 +0200)]
s390/qeth: clean up device_type management
qeth uses three device_type structs - a generic one, and one for each
sub-driver (which is used for fixed-layer devices only). Instead of
exporting these device_types back&forth between the driver's modules,
make all the logic self-contained within the sub-drivers.
On disc->setup() they either install their own device_type, or add the
sysfs attributes that are missing in the generic device_type. Later on
disc->remove() these attributes are removed again from any device that
has the generic device_type.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Tue, 20 Jul 2021 06:38:48 +0000 (08:38 +0200)]
s390/qeth: clean up QETH_PROT_* naming
The QETH_PROT_* naming is shared among two unrelated areas - one is
the MPC-level protocol identifiers, the other is the qeth_prot_version
enum.
Rename the MPC definitions to use QETH_MPC_PROT_*.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Tue, 20 Jul 2021 06:38:47 +0000 (08:38 +0200)]
s390/qeth: remove OSN support
Commit
fb64de1bc36c ("s390/qeth: phase out OSN support") spelled out
why the OSN support in qeth is in a bad shape, and put any remaining
interested parties on notice to speak up before it gets ripped out.
It's 2021 now, so make true on that promise and remove all the
OSN-specific parts from qeth. This also means that we no longer need to
export various parts of the cmd & data path internals to the L2 driver.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 20 Jul 2021 13:13:37 +0000 (06:13 -0700)]
Merge branch '40GbE' of git://git./linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
40GbE Intel Wired LAN Driver Updates 2021-07-19
This series contains updates to iavf and i40e drivers.
Stefan Assmann adds locking to a path that does not acquire a spinlock
where needed for i40e. He also adjusts locking of critical sections to
help avoid races and removes overriding of the adapter state during
pending reset for iavf driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 20 Jul 2021 13:11:28 +0000 (06:11 -0700)]
Merge branch 'veth-flexible-channel-numbers'
Paolo Abeni says:
====================
veth: more flexible channels number configuration
XDP setups can benefit from multiple veth RX/TX queues. Currently
veth allow setting such number only at creation time via the
'numrxqueues' and 'numtxqueues' parameters.
This series introduces support for the ethtool set_channel operation
and allows configuring the queue number via a new module parameter.
The veth default configuration is not changed.
Finally self-tests are updated to check the new features, with both
valid and invalid arguments.
This iteration is a rebase of the most recent RFC, it does not provide
a module parameter to configure the default number of queues, but I
think could be worthy
RFC v1 -> RFC v2:
- report more consistent 'combined' count
- make set_channel as resilient as possible to errors
- drop module parameter - but I would still consider it.
- more self-tests
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Tue, 20 Jul 2021 08:41:52 +0000 (10:41 +0200)]
selftests: net: veth: add tests for set_channel
Simple functional test for the newly exposed features.
Also add an optional stress test for the channel number
update under flood.
RFC v1 -> RFC v2:
- add the stress test
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Tue, 20 Jul 2021 08:41:51 +0000 (10:41 +0200)]
veth: create by default nr_possible_cpus queues
This allows easier XDP usage. The number of default active
queues is not changed: 1 RX and 1 TX so that this does
not introduce overhead on the datapath for queue selection.
v1 -> v2:
- drop the module parameter, force default to nr_possible_cpus - Toke
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Tue, 20 Jul 2021 08:41:50 +0000 (10:41 +0200)]
veth: implement support for set_channel ethtool op
This change implements the set_channel() ethtool operation,
preserving the current defaults values and allowing up set
the number of queues in the range set ad device creation
time.
The update operation tries hard to leave the device in a
consistent status in case of errors.
RFC v1 -> RFC v2:
- don't flip device status on set_channel()
- roll-back the changes if possible on error - Jackub
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Tue, 20 Jul 2021 08:41:49 +0000 (10:41 +0200)]
veth: factor out initialization helper
Extract in simpler helpers the code to enable and disable a
range of xdp/napi instance, with the common property that
"disable" helpers can't fail.
Will be used by the next patch. No functional change intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Tue, 20 Jul 2021 08:41:48 +0000 (10:41 +0200)]
veth: always report zero combined channels
veth get_channel currently reports for channels being both RX/TX and
combined. As Jakub noted:
"""
ethtool man page is relatively clear, unfortunately the kernel code
is not and few read the man page. A channel is approximately an IRQ,
not a queue, and IRQ can't be dedicated and combined simultaneously
"""
This patch changes the information exposed by veth_get_channels,
setting max_combined to zero, being more consistent with the above
statement. The ethtool_channels is always cleared by the caller, we just
need to avoid setting the 'combined' fields.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vasily Averin [Mon, 19 Jul 2021 10:44:56 +0000 (13:44 +0300)]
memcg: enable accounting for scm_fp_list objects
unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
Each such send call forces kernel to allocate up to 2Kb memory for
struct scm_fp_list.
It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vasily Averin [Mon, 19 Jul 2021 10:44:50 +0000 (13:44 +0300)]
memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation
Author: Andrey Ryabinin <aryabinin@virtuozzo.com>
The size of the ip_tunnel_prl structs allocation is controllable from
user-space, thus it's better to avoid spam in dmesg if allocation failed.
Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
accounting. Allocation is temporary and limited by 4GB.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vasily Averin [Mon, 19 Jul 2021 10:44:44 +0000 (13:44 +0300)]
memcg: enable accounting for VLAN group array
vlan array consume up to 8 pages of memory per net device.
It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vasily Averin [Mon, 19 Jul 2021 10:44:37 +0000 (13:44 +0300)]
memcg: enable accounting for inet_bin_bucket cache
net namespace can create up to 64K tcp and dccp ports and force kernel
to allocate up to several megabytes of memory per netns
for inet_bind_bucket objects.
It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vasily Averin [Mon, 19 Jul 2021 10:44:31 +0000 (13:44 +0300)]
memcg: enable accounting for IP address and routing-related objects
An netadmin inside container can use 'ip a a' and 'ip r a'
to assign a large number of ipv4/ipv6 addresses and routing entries
and force kernel to allocate megabytes of unaccounted memory
for long-lived per-netdevice related kernel objects:
'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
'struct rt6_info', 'struct fib_rules' and ip_fib caches.
These objects can be manually removed, though usually they lives
in memory till destroy of its net namespace.
It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.
One of such objects is the 'struct fib6_node' mostly allocated in
net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
write_lock_bh(&table->tb6_lock);
err = fib6_add(&table->tb6_root, rt, info, mxc);
write_unlock_bh(&table->tb6_lock);
In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
kmem cache. The proper memory cgroup still cannot be found due to the
incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
Obsoleted in_interrupt() does not describe real execution context properly.
>From include/linux/preempt.h:
The following macros are deprecated and should not be used in new code:
in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled
To verify the current execution context new macro should be used instead:
in_task() - We're in task context
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vasily Averin [Mon, 19 Jul 2021 10:44:23 +0000 (13:44 +0300)]
memcg: enable accounting for net_device and Tx/Rx queues
Container netadmin can create a lot of fake net devices,
then create a new net namespace and repeat it again and again.
Net device can request the creation of up to 4096 tx and rx queues,
and force kernel to allocate up to several tens of megabytes memory
per net device.
It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 20 Jul 2021 12:41:29 +0000 (05:41 -0700)]
Merge branch 'bridge-vlan-multicast'
Nikolay Aleksandrov says:
====================
net: bridge: multicast: add vlan support
This patchset adds initial per-vlan multicast support, most of the code
deals with moving to multicast context pointers from bridge/port pointers.
That allows us to switch them with the per-vlan contexts when a multicast
packet is being processed and vlan multicast snooping has been enabled.
That is controlled by a global bridge option added in patch 06 which is
off by default (BR_BOOLOPT_MCAST_VLAN_SNOOPING). It is important to note
that this option can change only under RTNL and doesn't require
multicast_lock, so we need to be careful when retrieving mcast contexts
in parallel. For packet processing they are switched only once in
br_multicast_rcv() and then used until the packet has been processed.
For the most part we need these contexts only to read config values and
check if they are disabled. The global mcast state which is maintained
consists of querier and router timers, the rest are config options.
The port mcast state which is maintained consists of query timer and
link to router port list if it's ever marked as a router port. Port
multicast contexts _must_ be used only with their respective global
contexts, that is a bridge port's mcast context must be used only with
bridge's global mcast context and a vlan/port's mcast context must be
used only with that vlan's global mcast context due to the router port
lists. This way a bridge port can be marked as a router in multiple
vlans, but might not be a router in some other vlan. Also this allows us
to have per-vlan querier elections, per-vlan queries and basically the
whole multicast state becomes per-vlan when the option is enabled.
One of the hardest parts is synchronization with vlan's memory
management, that is done through a new vlan flag: BR_VLFLAG_MCAST_ENABLED
which is changed only under multicast_lock. When a vlan is being
destroyed first that flag is removed under the lock, then the multicast
context is torn down which includes waiting for any outstanding context
timers. Since all of the vlan processing depends on BR_VLFLAG_MCAST_ENABLED
it must be checked first if the contexts are vlan and the multicast_lock
has been acquired. That is done by all IGMP/MLD packet processing
functions and timers. When processing a packet we have RCU so the vlan
memory won't be freed, but if the flag is missing we must not process it.
The timers are synchronized in the same way with the addition of waiting
for them to finish in case they are running after removing the flag
under multicast_lock (i.e. they were waiting for the lock). Multicast vlan
snooping requires vlan filtering to be enabled, if it's disabled then
snooping gets automatically disabled, too. BR_VLFLAG_GLOBAL_MCAST_ENABLED
controls if a vlan has BR_VLFLAG_MCAST_ENABLED set which is used in all
vlan disabled checks. We need both flags because one is controlled by
user-space globally (BR_VLFLAG_GLOBAL_MCAST_ENABLED) and the other is
for a particular bridge/vlan or port/vlan entry (BR_VLFLAG_MCAST_ENABLED).
Since the latter is also used for synchronization between the multicast
and vlan code, and also controlled by BR_VLFLAG_GLOBAL_MCAST_ENABLED we
rely on it when checking if a vlan context is disabled. The multicast
fast-path has 3 new bit tests on the cache-hot bridge flags field, I
didn't observe any measurable difference. I haven't forced either
context options to be always disabled when the other type is enabled
because the state consists of timers which either expire (router) or
don't affect the normal operation. Some options, like the mcast querier
one, won't be allowed to change for the disabled context type, that will
come with a future patch-set which adds per-vlan querier control.
Another important addition is the global vlan options, so far we had
only per bridge/port vlan options but in order to control vlan multicast
snooping globally we need to add a new type of global vlan options.
They can be changed only on the bridge device and are dumped only when a
special flag is set in the dump request. The first global option is vlan
mcast snooping control, it controls the vlan BR_VLFLAG_GLOBAL_MCAST_ENABLED
private flag. It can be set only on master vlan entries. There will be
many more global vlan options in the future both for multicast config
and other per-vlan options (e.g. STP).
There's a lot of room for improvements, I'll do some of the initial
ones but splitting the state to different contexts opens the door
for a lot more. Also any new multicast options become vlan-supported with
very little to no effort by using the same contexts.
Short patch description:
patches 01-04: initial mcast context add, no functional changes
patch 05: adds vlan mcast init and control helpers and uses them on
vlan create/destroy
patch 06: adds a global bridge mcast vlan snooping knob (default
off)
patches 07-08: add a helper for users which must derive the contexts
based on current bridge and vlan options (e.g. timers)
patch 09: adds checks for disabled vlan contexts in packet
processing and timers
patch 10: adds support for per-vlan querier and tagged queries
patch 11: adds router port vlan id in the notifications
patches 12-14: add global vlan options support (change, dump, notify)
patch 15: adds per-vlan global mcast snooping control
Future patch-sets which build on this one (in order):
- vlan state mcast handling
- user-space mdb contexts (currently only the bridge contexts are used
there)
- all bridge multicast config options added per-vlan global and per
vlan/port
- iproute2 support for all the new uAPIs
- selftests
This set has been stress-tested (deleting/adding ports/vlans while changing
vlan mcast snooping while processing IGMP/MLD packets), and also has
passed all bridge self-tests. I'm sending this set as early as possible
since there're a few more related sets that should go in the same
release to get proper and full mcast vlan snooping support.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Mon, 19 Jul 2021 17:06:37 +0000 (20:06 +0300)]
net: bridge: vlan: add mcast snooping control
Add a new global vlan option which controls whether multicast snooping
is enabled or disabled for a single vlan. It controls the vlan private
flag: BR_VLFLAG_GLOBAL_MCAST_ENABLED.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Mon, 19 Jul 2021 17:06:36 +0000 (20:06 +0300)]
net: bridge: vlan: notify when global options change
Add support for global options notifications. They use only RTM_NEWVLAN
since global options can only be set and are contained in a separate
vlan global options attribute. Notifications are compressed in ranges
where possible, i.e. the sequential vlan options are equal.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Mon, 19 Jul 2021 17:06:35 +0000 (20:06 +0300)]
net: bridge: vlan: add support for dumping global vlan options
Add a new vlan options dump flag which causes only global vlan options
to be dumped. The dumps are done only with bridge devices, ports are
ignored. They support vlan compression if the options in sequential
vlans are equal (currently always true).
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Mon, 19 Jul 2021 17:06:34 +0000 (20:06 +0300)]
net: bridge: vlan: add support for global options
We can have two types of vlan options depending on context:
- per-device vlan options (split in per-bridge and per-port)
- global vlan options
The second type wasn't supported in the bridge until now, but we need
them for per-vlan multicast support, per-vlan STP support and other
options which require global vlan context. They are contained in the global
bridge vlan context even if the vlan is not configured on the bridge device
itself. This patch adds initial netlink attributes and support for setting
these global vlan options, they can only be set (RTM_NEWVLAN) and the
operation must use the bridge device. Since there are no such options yet
it shouldn't have any functional effect.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Mon, 19 Jul 2021 17:06:33 +0000 (20:06 +0300)]
net: bridge: multicast: include router port vlan id in notifications
Use the port multicast context to check if the router port is a vlan and
in case it is include its vlan id in the notification.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Mon, 19 Jul 2021 17:06:32 +0000 (20:06 +0300)]
net: bridge: multicast: add vlan querier and query support
Add basic vlan context querier support, if the contexts passed to
multicast_alloc_query are vlan then the query will be tagged. Also
handle querier start/stop of vlan contexts.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>