Ido Schimmel [Wed, 2 May 2018 07:17:35 +0000 (10:17 +0300)]
mlxsw: spectrum_router: Return an error for routes added after abort
We currently do not perform accounting in the driver and thus can't
reject routes before resources are exceeded.
However, in order to make users aware of the fact that routes are no
longer offloaded we can return an error for routes configured after the
abort mechanism was triggered.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 2 May 2018 07:17:34 +0000 (10:17 +0300)]
mlxsw: spectrum_router: Return an error for non-default FIB rules
Since commit
9776d32537d2 ("net: Move call_fib_rule_notifiers up in
fib_nl_newrule") it is possible to forbid the installation of
unsupported FIB rules.
Have mlxsw return an error for non-default FIB rules in addition to the
existing extack message.
Example:
# ip rule add from 198.51.100.1 table 10
Error: mlxsw_spectrum: FIB rules not supported.
Note that offload is only aborted when non-default FIB rules are already
installed and merely replayed during module initialization.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ganesh Goudar [Wed, 2 May 2018 06:17:15 +0000 (11:47 +0530)]
cxgb4: add new T5 device id's
Add device id's 0x5019, 0x501a and 0x501b for T5
cards.
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kees Cook [Tue, 1 May 2018 21:01:30 +0000 (14:01 -0700)]
net: stmmac: Avoid VLA usage
In the quest to remove all stack VLAs from the kernel[1], this switches
the "status" stack buffer to use the existing small (8) upper bound on
how many queues can be checked for DMA, and adds a sanity-check just to
make sure it doesn't operate under pathological conditions.
[1] http://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Jose Abreu <joabreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Raghu Vatsavayi [Tue, 1 May 2018 17:32:10 +0000 (10:32 -0700)]
liquidio VF: indicate that disabling rx vlan offload is not allowed
NIC firmware does not support disabling rx vlan offload, but the VF driver
incorrectly indicates that it is supported. The PF driver already does the
correct indication by clearing the NETIF_F_HW_VLAN_CTAG_RX bit in its
netdev->hw_features. So just do the same thing in the VF.
Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@cavium.com>
Acked-by: Prasad Kanneganti <prasad.kanneganti@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Tranchetti [Tue, 1 May 2018 00:01:02 +0000 (18:01 -0600)]
udp: Complement partial checksum for GSO packet
Using the udp_v4_check() function to calculate the pseudo header
for the newly segmented UDP packets results in assigning the complement
of the value to the UDP header checksum field.
Always undo the complement the partial checksum value in order to
match the case where GSO is not used on the UDP transmit path.
Fixes:
ee80d1ebe5ba ("udp: add udp gso")
Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Soheil Hassas Yeganeh [Tue, 1 May 2018 19:39:16 +0000 (15:39 -0400)]
selftest: add test for TCP_INQ
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Soheil Hassas Yeganeh [Tue, 1 May 2018 19:39:15 +0000 (15:39 -0400)]
tcp: send in-queue bytes in cmsg upon read
Applications with many concurrent connections, high variance
in receive queue length and tight memory bounds cannot
allocate worst-case buffer size to drain sockets. Knowing
the size of receive queue length, applications can optimize
how they allocate buffers to read from the socket.
The number of bytes pending on the socket is directly
available through ioctl(FIONREAD/SIOCINQ) and can be
approximated using getsockopt(MEMINFO) (rmem_alloc includes
skb overheads in addition to application data). But, both of
these options add an extra syscall per recvmsg. Moreover,
ioctl(FIONREAD/SIOCINQ) takes the socket lock.
Add the TCP_INQ socket option to TCP. When this socket
option is set, recvmsg() relays the number of bytes available
on the socket for reading to the application via the
TCP_CM_INQ control message.
Calculate the number of bytes after releasing the socket lock
to include the processed backlog, if any. To avoid an extra
branch in the hot path of recvmsg() for this new control
message, move all cmsg processing inside an existing branch for
processing receive timestamps. Since the socket lock is not held
when calculating the size of receive queue, TCP_INQ is a hint.
For example, it can overestimate the queue size by one byte,
if FIN is received.
With this method, applications can start reading from the socket
using a small buffer, and then use larger buffers based on the
remaining data when needed.
V3 change-log:
As suggested by David Miller, added loads with barrier
to check whether we have multiple threads calling recvmsg
in parallel. When that happens we lock the socket to
calculate inq.
V4 change-log:
Removed inline from a static function.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 1 May 2018 19:08:38 +0000 (15:08 -0400)]
Merge branch 'hns3-fixes'
Salil Mehta says:
====================
Misc bug fixes for HNS3 Ethernet driver
This patch-set presents some miscellaneous bug fixs and cleanups for
HNS3 Ethernet Driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Xi Wang [Tue, 1 May 2018 18:56:05 +0000 (19:56 +0100)]
net: hns3: Remove packet statistics in the range of 8192~12287
Because the current statistics for size 8192~12287 are only valid for GE,
the ranges of 8192~9216 and 9217~12287 are valid only for LGE/CGE, and are
always 0 for GE interfaces. it is easy to cause confusion when viewing the
packet statistics using the command ethtool -S.
This patch removes the 8192~12287 range of packet statistics and uses the
8192~9216 and 9217~12287 ranges for statistics. This change depends on the
firmware upgrade.
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yunsheng Lin [Tue, 1 May 2018 18:56:04 +0000 (19:56 +0100)]
net: hns3: Fix for packet loss due wrong filter config in VLAN tbls
There are two level of vlan tables in hardware, one is port vlan
which is shared by all functions, the other one is function
vlan table, each function has it's own function vlan table.
Currently, PF sets the port vlan table, and vf sets the function
vlan table, which will cause packet lost problem.
This patch fixes this problem by setting both vlan table, and
use hdev->vlan_table to manage thet port vlan table.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Tue, 1 May 2018 18:56:03 +0000 (19:56 +0100)]
net: hns3: fix a dead loop in hclge_cmd_csq_clean
If head has invlid value then a dead loop can be triggered in
hclge_cmd_csq_clean. This patch adds sanity check for this case.
Fixes:
68c0a5c70614 ("net: hns3: Add HNS3 IMP(Integrated Mgmt Proc) Cmd
Interface Support")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fuyun Liang [Tue, 1 May 2018 18:56:02 +0000 (19:56 +0100)]
net: hns3: Fix to support autoneg only for port attached with phy
This patch adds a check to support autoneg(ethtool -A) only when PHY
is attached with the port.
Fixes:
e2cb1dec9779 ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility
Layer) Support")
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Tue, 1 May 2018 18:56:01 +0000 (19:56 +0100)]
net: hns3: fix for phy_addr error in hclge_mac_mdio_config
When phy exists, phy_addr must less than PHY_MAX_ADDR.
If not, hclge_mac_mdio_config should return error.
And for fiber(phy_addr=0xff), it does not need hclge_mac_mdio_config.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Tue, 1 May 2018 18:56:00 +0000 (19:56 +0100)]
net: hns3: Fixes the error legs in hclge_init_ae_dev function
This patch fixes some of the missed error legs in the initialization
function of the ae device. This might cause leaks in case of failure.
Fixes:
46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer
Support")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Tue, 1 May 2018 18:55:59 +0000 (19:55 +0100)]
net: hns3: Fixes the out of bounds access in hclge_map_tqp
This patch fixes the handling of the check when number of vports
are detected to be more than available TPQs. Current handling causes
an out of bounds access in hclge_map_tqp().
Fixes:
7df7dad633e2 ("net: hns3: Refactor the mapping of tqp to vport")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Tue, 1 May 2018 18:55:58 +0000 (19:55 +0100)]
net: hns3: fix to correctly fetch l4 protocol outer header
This patch fixes the function being used to fetch L4
protocol outer header. Mistakenly skb_inner_transport_header
API was being used earlier.
Fixes:
76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yunsheng Lin [Tue, 1 May 2018 18:55:57 +0000 (19:55 +0100)]
net: hns3: Remove error log when getting pfc stats fails
When mac supports DCB, but is in GE mode, it does not support
querying pfc stats, firmware returns error when trying to
query the pfc stats. this creates a lot of noise in the kernel
log when it prints the error log.
This patch fixes it by removing the error log, because it already
return the error to the user space, so the user should be aware of
the error.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Strogin [Mon, 30 Apr 2018 22:04:29 +0000 (01:04 +0300)]
connector: add parent pid and tgid to coredump and exit events
The intention is to get notified of process failures as soon
as possible, before a possible core dumping (which could be very long)
(e.g. in some process-manager). Coredump and exit process events
are perfect for such use cases (see
2b5faa4c553f "connector: Added
coredumping event to the process connector").
The problem is that for now the process-manager cannot know the parent
of a dying process using connectors. This could be useful if the
process-manager should monitor for failures only children of certain
parents, so we could filter the coredump and exit events by parent
process and/or thread ID.
Add parent pid and tgid to coredump and exit process connectors event
data.
Signed-off-by: Stefan Strogin <sstrogin@cisco.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Mon, 30 Apr 2018 21:20:05 +0000 (14:20 -0700)]
net: core: Inline netdev_features_size_check()
We do not require this inline function to be used in multiple different
locations, just inline it where it gets used in register_netdevice().
Suggested-by: David Miller <davem@davemloft.net>
Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Mon, 30 Apr 2018 19:58:36 +0000 (15:58 -0400)]
udp: disable gso with no_check_tx
Syzbot managed to send a udp gso packet without checksum offload into
the gso stack by disabling tx checksum (UDP_NO_CHECK6_TX). This
triggered the skb_warn_bad_offload.
RIP: 0010:skb_warn_bad_offload+0x2bc/0x600 net/core/dev.c:2658
skb_gso_segment include/linux/netdevice.h:4038 [inline]
validate_xmit_skb+0x54d/0xd90 net/core/dev.c:3120
__dev_queue_xmit+0xbf8/0x34c0 net/core/dev.c:3577
dev_queue_xmit+0x17/0x20 net/core/dev.c:3618
UDP_NO_CHECK6_TX sets skb->ip_summed to CHECKSUM_NONE just after the
udp gso integrity checks in udp_(v6_)send_skb. Extend those checks to
catch and fail in this case.
After the integrity checks jump directly to the CHECKSUM_PARTIAL case
to avoid reading the no_check_tx flags again (a TOCTTOU race).
Fixes:
bec1f6f69736 ("udp: generate gso with UDP_SEGMENT")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Blakey [Mon, 30 Apr 2018 11:28:30 +0000 (14:28 +0300)]
cls_flower: Support multiple masks per priority
Currently flower doesn't support inserting filters with different masks
on a single priority, even if the actual flows (key + mask) inserted
aren't overlapping, as with the use case of offloading openvswitch
datapath flows. Instead one must go up one level, and assign different
priorities for each mask, which will create a different flower
instances.
This patch opens flower to support more than one mask per priority,
and a single flower instance. It does so by adding another hash table
on top of the existing one which will store the different masks,
and the filters that share it.
The user is left with the responsibility of ensuring non overlapping
flows, otherwise precedence is not guaranteed.
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 1 May 2018 16:09:36 +0000 (12:09 -0400)]
Merge branch 'sctp-unify-sctp_make_op_error_fixed-and-sctp_make_op_error_space'
Marcelo Ricardo Leitner says:
====================
sctp: unify sctp_make_op_error_fixed and sctp_make_op_error_space
These two variants are very close to each other and can be merged
to avoid code duplication. That's what this patchset does.
First, we allow sctp_init_cause to return errors, which then allow us to
add sctp_make_op_error_limited that handles both situations.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Marcelo Ricardo Leitner [Sun, 29 Apr 2018 15:56:32 +0000 (12:56 -0300)]
sctp: add sctp_make_op_error_limited and reuse inner functions
The idea is quite similar to the old functions, but note that the _fixed
function wasn't "fixed" as in that it would generate a packet with a fixed
size, but rather limited/bounded to PMTU.
Also, now with sctp_mtu_payload(), we have a more accurate limit.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marcelo Ricardo Leitner [Sun, 29 Apr 2018 15:56:31 +0000 (12:56 -0300)]
sctp: allow sctp_init_cause to return errors
And do so if the skb doesn't have enough space for the payload.
This is a preparation for the next patch.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 1 May 2018 15:30:00 +0000 (11:30 -0400)]
Merge branch 'net-stmmac-dwmac-meson-100M-phy-mode-support-for-AXG-SoC'
Yixun Lan says:
====================
net: stmmac: dwmac-meson: 100M phy mode support for AXG SoC
Due to the dwmac glue layer register changed, we need to
introduce a new compatible name for the Meson-AXG SoC
to support for the RMII 100M ethernet PHY.
Change since v1 at [1]:
- implement set_phy_mode() for each SoC
[1] https://lkml.kernel.org/r/
20180426160508.29380-1-yixun.lan@amlogic.com
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Yixun Lan [Sat, 28 Apr 2018 10:21:11 +0000 (10:21 +0000)]
net: stmmac: dwmac-meson: extend phy mode setting
In the Meson-AXG SoC, the phy mode setting of PRG_ETH0 in the glue layer
is extended from bit[0] to bit[2:0].
There is no problem if we configure it to the RGMII 1000M PHY mode,
since the register setting is coincidentally compatible with previous one,
but for the RMII 100M PHY mode, the configuration need to be changed to
value - b100.
This patch was verified with a RTL8201F 100M ethernet PHY.
Signed-off-by: Yixun Lan <yixun.lan@amlogic.com>
Acked-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yixun Lan [Sat, 28 Apr 2018 10:21:10 +0000 (10:21 +0000)]
dt-bindings: net: meson-dwmac: new compatible name for AXG SoC
We need to introduce a new compatible name for the Meson-AXG SoC
in order to support the RMII 100M ethernet PHY, since the PRG_ETH0
register of the dwmac glue layer is changed from previous old SoC.
Signed-off-by: Yixun Lan <yixun.lan@amlogic.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 1 May 2018 14:22:41 +0000 (10:22 -0400)]
Merge branch 'netns-uevent-filtering'
Christian Brauner says:
====================
netns: uevent filtering
This is the new approach to uevent filtering as discussed (see the
threads in [1], [2], and [3]). It only contains *non-functional
changes*.
This series deals with with fixing up uevent filtering logic:
- uevent filtering logic is simplified
- locking time on uevent_sock_list is minimized
- tagged and untagged kobjects are handled in separate codepaths
- permissions for userspace are fixed for network device uevents in
network namespaces owned by non-initial user namespaces
Udev is now able to see those events correctly which it wasn't before.
For example, moving a physical device into a network namespace not
owned by the initial user namespaces before gave:
root@xen1:~# udevadm --debug monitor -k
calling: monitor
monitor will print the received events for:
KERNEL - the kernel uevent
sender uid=65534, message ignored
sender uid=65534, message ignored
sender uid=65534, message ignored
sender uid=65534, message ignored
sender uid=65534, message ignored
and now after the discussion and solution in [3] correctly gives:
root@xen1:~# udevadm --debug monitor -k
calling: monitor
monitor will print the received events for:
KERNEL - the kernel uevent
KERNEL[625.301042] add /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/enp1s0f1 (net)
KERNEL[625.301109] move /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/enp1s0f1 (net)
KERNEL[625.301138] move /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/eth1 (net)
KERNEL[655.333272] remove /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/eth1 (net)
Thanks!
Christian
[1]: https://lkml.org/lkml/2018/4/4/739
[2]: https://lkml.org/lkml/2018/4/26/767
[3]: https://lkml.org/lkml/2018/4/26/738
====================
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christian Brauner [Sun, 29 Apr 2018 10:44:12 +0000 (12:44 +0200)]
netns: restrict uevents
commit
07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk. We have now reached the point where hotplug events for all devices
that carry a namespace tag are filtered according to that namespace.
Specifically, they are filtered whenever the namespace tag of the kobject
does not match the namespace tag of the netlink socket.
Currently, only network devices carry namespace tags (i.e. network
namespace tags). Hence, uevents for network devices only show up in the
network namespace such devices are created in or moved to.
However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.
This patch simplifies and fixes couple of things:
- Split codepath for sending uevents by kobject namespace tags:
1. Untagged kobjects - uevent_net_broadcast_untagged():
Untagged kobjects will be broadcast into all uevent sockets recorded
in uevent_sock_list, i.e. into all network namespacs owned by the
intial user namespace.
2. Tagged kobjects - uevent_net_broadcast_tagged():
Tagged kobjects will only be broadcast into the network namespace they
were tagged with.
Handling of tagged kobjects in 2. does not cause any semantic changes.
This is just splitting out the filtering logic that was handled by
kobj_bcast_filter() before.
Handling of untagged kobjects in 1. will cause a semantic change. The
reasons why this is needed and ok have been discussed in [1]. Here is a
short summary:
- Userspace ignores uevents from network namespaces that are not owned by
the intial user namespace:
Uevents are filtered by userspace in a user namespace because the
received uid != 0. Instead the uid associated with the event will be
65534 == "nobody" because the global root uid is not mapped.
This means we can safely and without introducing regressions modify the
kernel to not send uevents into all network namespaces whose owning
user namespace is not the initial user namespace because we know that
userspace will ignore the message because of the uid anyway.
I have a) verified that is is true for every udev implementation out
there b) that this behavior has been present in all udev
implementations from the very beginning.
- Thundering herd:
Broadcasting uevents into all network namespaces introduces significant
overhead.
All processes that listen to uevents running in non-initial user
namespaces will end up responding to uevents that will be meaningless
to them. Mainly, because non-initial user namespaces cannot easily
manage devices unless they have a privileged host-process helping them
out. This means that there will be a thundering herd of activity when
there shouldn't be any.
- Removing needless overhead/Increasing performance:
Currently, the uevent socket for each network namespace is added to the
global variable uevent_sock_list. The list itself needs to be protected
by a mutex. So everytime a uevent is generated the mutex is taken on
the list. The mutex is held *from the creation of the uevent (memory
allocation, string creation etc. until all uevent sockets have been
handled*. This is aggravated by the fact that for each uevent socket
that has listeners the mc_list must be walked as well which means we're
talking O(n^2) here. Given that a standard Linux workload usually has
quite a lot of network namespaces and - in the face of containers - a
lot of user namespaces this quickly becomes a performance problem (see
"Thundering herd" above). By just recording uevent sockets of network
namespaces that are owned by the initial user namespace we
significantly increase performance in this codepath.
- Injecting uevents:
There's a valid argument that containers might be interested in
receiving device events especially if they are delegated to them by a
privileged userspace process. One prime example are SR-IOV enabled
devices that are explicitly designed to be handed of to other users
such as VMs or containers.
This use-case can now be correctly handled since
commit
692ec06d7c92 ("netns: send uevent messages"). This commit
introduced the ability to send uevents from userspace. As such we can
let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
namespace of the network namespace of the netlink socket) userspace
process make a decision what uevents should be sent. This removes the
need to blindly broadcast uevents into all user namespaces and provides
a performant and safe solution to this problem.
- Filtering logic:
This patch filters by *owning user namespace of the network namespace a
given task resides in* and not by user namespace of the task per se.
This means if the user namespace of a given task is unshared but the
network namespace is kept and is owned by the initial user namespace a
listener that is opening the uevent socket in that network namespace
can still listen to uevents.
- Fix permission for tagged kobjects:
Network devices that are created or moved into a network namespace that
is owned by a non-initial user namespace currently are send with
INVALID_{G,U}ID in their credentials. This means that all current udev
implementations in userspace will ignore the uevent they receive for
them. This has lead to weird bugs whereby new devices showing up in such
network namespaces were not recognized and did not get IPs assigned etc.
This patch adjusts the permission to the appropriate {g,u}id in the
respective user namespace. This way udevd is able to correctly handle
such devices.
- Simplify filtering logic:
do_one_broadcast() already ensures that only listeners in mc_list receive
uevents that have the same network namespace as the uevent socket itself.
So the filtering logic in kobj_bcast_filter is not needed (see [3]). This
patch therefore removes kobj_bcast_filter() and replaces
netlink_broadcast_filtered() with the simpler netlink_broadcast()
everywhere.
[1]: https://lkml.org/lkml/2018/4/4/739
[2]: https://lkml.org/lkml/2018/4/26/767
[3]: https://lkml.org/lkml/2018/4/26/738
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christian Brauner [Sun, 29 Apr 2018 10:44:11 +0000 (12:44 +0200)]
uevent: add alloc_uevent_skb() helper
This patch adds alloc_uevent_skb() in preparation for follow up patches.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 1 May 2018 13:42:48 +0000 (09:42 -0400)]
Merge branch 'tls-offload-netdev-and-mlx5-support'
Boris Pismenny says:
====================
TLS offload, netdev & MLX5 support
The following series provides TLS TX inline crypto offload.
v1->v2:
- Added IS_ENABLED(CONFIG_TLS_DEVICE) and a STATIC_KEY for icsk_clean_acked
- File license fix
- Fix spelling, comment by DaveW
- Move memory allocations out of tls_set_device_offload and other misc fixes,
comments by Kiril.
v2->v3:
- Reversed xmas tree where needed and style fixes
- Removed the need for skb_page_frag_refill, per Eric's comment
- IPv6 dependency fixes
v3->v4:
- Remove "inline" from functions in C files
- Make clean_acked_data_enabled a static variable and add enable/disable functions to control it.
- Remove unnecessary variable initialization mentioned by ShannonN
- Rebase over TLS RX
- Refactor the tls_software_fallback to reduce the number of variables mentioned by KirilT
v4->v5:
- Add missing CONFIG_TLS_DEVICE
v5->v6:
- Move changes to the software implementation into a seperate patch
- Fix some checkpatch warnings
- GPL export the enable/disable clean_acked_data functions
v6->v7:
- Use the dst_entry to obtain the netdev in dev_get_by_index
- Remove the IPv6 patch since it is redundent now
v7->v8:
- Fix a merge conflict in mlx5 header
v8->v9:
- Fix false -Wmaybe-uninitialized warning
- Fix empty space in the end of new files
v9->v10:
- Remove default "n" in net/Kconfig
This series adds a generic infrastructure to offload TLS crypto to a
network devices. It enables the kernel TLS socket to skip encryption and
authentication operations on the transmit side of the data path. Leaving
those computationally expensive operations to the NIC.
The NIC offload infrastructure builds TLS records and pushes them to the
TCP layer just like the SW KTLS implementation and using the same API.
TCP segmentation is mostly unaffected. Currently the only exception is
that we prevent mixed SKBs where only part of the payload requires
offload. In the future we are likely to add a similar restriction
following a change cipher spec record.
The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. The offloaded implementation builds "plaintext TLS record", those
records contain plaintext instead of ciphertext and place holder bytes
instead of authentication tags.
2. The offloaded implementation maintains a mapping from TCP sequence
number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
TLS socket, we can use the tls NIC offload infrastructure to obtain
enough context to encrypt the payload of the SKB.
A TLS record is released when the last byte of the record is ack'ed,
this is done through the new icsk_clean_acked callback.
The infrastructure should be extendable to support various NIC offload
implementations. However it is currently written with the
implementation below in mind:
The NIC assumes that packets from each offloaded stream are sent as
plaintext and in-order. It keeps track of the TLS records in the TCP
stream. When a packet marked for offload is transmitted, the NIC
encrypts the payload in-place and puts authentication tags in the
relevant place holders.
The responsibility for handling out-of-order packets (i.e. TCP
retransmission, qdisc drops) falls on the netdev driver.
The netdev driver keeps track of the expected TCP SN from the NIC's
perspective. If the next packet to transmit matches the expected TCP
SN, the driver advances the expected TCP SN, and transmits the packet
with TLS offload indication.
If the next packet to transmit does not match the expected TCP SN. The
driver calls the TLS layer to obtain the TLS record that includes the
TCP of the packet for transmission. Using this TLS record, the driver
posts a work entry on the transmit queue to reconstruct the NIC TLS
state required for the offload of the out-of-order packet. It updates
the expected TCP SN accordingly and transmit the now in-order packet.
The same queue is used for packet transmission and TLS context
reconstruction to avoid the need for flushing the transmit queue before
issuing the context reconstruction request.
Expected TCP SN is accessed without a lock, under the assumption that
TCP doesn't transmit SKBs from different TX queue concurrently.
If packets are rerouted to a different netdevice, then a software
fallback routine handles encryption.
Paper: https://www.netdevconf.org/1.2/papers/netdevconf-TLS.pdf
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Boris Pismenny [Mon, 30 Apr 2018 07:16:23 +0000 (10:16 +0300)]
MAINTAINERS: Update TLS maintainers
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Boris Pismenny [Mon, 30 Apr 2018 07:16:22 +0000 (10:16 +0300)]
MAINTAINERS: Update mlx5 innova driver maintainers
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilya Lesokhin [Mon, 30 Apr 2018 07:16:21 +0000 (10:16 +0300)]
net/mlx5e: TLS, Add error statistics
Add statistics for rare TLS related errors.
Since the errors are rare we have a counter per netdev
rather then per SQ.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilya Lesokhin [Mon, 30 Apr 2018 07:16:20 +0000 (10:16 +0300)]
net/mlx5e: TLS, Add Innova TLS TX offload data path
Implement the TLS tx offload data path according to the
requirements of the TLS generic NIC offload infrastructure.
Special metadata ethertype is used to pass information to
the hardware.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilya Lesokhin [Mon, 30 Apr 2018 07:16:19 +0000 (10:16 +0300)]
net/mlx5e: TLS, Add Innova TLS TX support
Add NETIF_F_HW_TLS_TX capability and expose tlsdev_ops to work with the
TLS generic NIC offload infrastructure.
The NETIF_F_HW_TLS_TX capability will be added in the next patch.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilya Lesokhin [Mon, 30 Apr 2018 07:16:18 +0000 (10:16 +0300)]
net/mlx5: Accel, Add TLS tx offload interface
Add routines for manipulating TLS TX offload contexts.
In Innova TLS, TLS contexts are added or deleted
via a command message over the SBU connection.
The HW then sends a response message over the same connection.
Add implementation for Innova TLS (FPGA-based) hardware.
These routines will be used by the TLS offload support in a later patch
mlx5/accel is a middle acceleration layer to allow mlx5e and other ULPs
to work directly with mlx5_core rather than Innova FPGA or other mlx5
acceleration providers.
In the future, when IPSec/TLS or any other acceleration gets integrated
into ConnectX chip, mlx5/accel layer will provide the integrated
acceleration, rather than the Innova one.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilya Lesokhin [Mon, 30 Apr 2018 07:16:17 +0000 (10:16 +0300)]
net/mlx5e: Move defines out of ipsec code
The defines are not IPSEC specific.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilya Lesokhin [Mon, 30 Apr 2018 07:16:16 +0000 (10:16 +0300)]
net/tls: Add generic NIC offload infrastructure
This patch adds a generic infrastructure to offload TLS crypto to a
network device. It enables the kernel TLS socket to skip encryption
and authentication operations on the transmit side of the data path.
Leaving those computationally expensive operations to the NIC.
The NIC offload infrastructure builds TLS records and pushes them to
the TCP layer just like the SW KTLS implementation and using the same
API.
TCP segmentation is mostly unaffected. Currently the only exception is
that we prevent mixed SKBs where only part of the payload requires
offload. In the future we are likely to add a similar restriction
following a change cipher spec record.
The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. The offloaded implementation builds "plaintext TLS record", those
records contain plaintext instead of ciphertext and place holder bytes
instead of authentication tags.
2. The offloaded implementation maintains a mapping from TCP sequence
number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
TLS socket, we can use the tls NIC offload infrastructure to obtain
enough context to encrypt the payload of the SKB.
A TLS record is released when the last byte of the record is ack'ed,
this is done through the new icsk_clean_acked callback.
The infrastructure should be extendable to support various NIC offload
implementations. However it is currently written with the
implementation below in mind:
The NIC assumes that packets from each offloaded stream are sent as
plaintext and in-order. It keeps track of the TLS records in the TCP
stream. When a packet marked for offload is transmitted, the NIC
encrypts the payload in-place and puts authentication tags in the
relevant place holders.
The responsibility for handling out-of-order packets (i.e. TCP
retransmission, qdisc drops) falls on the netdev driver.
The netdev driver keeps track of the expected TCP SN from the NIC's
perspective. If the next packet to transmit matches the expected TCP
SN, the driver advances the expected TCP SN, and transmits the packet
with TLS offload indication.
If the next packet to transmit does not match the expected TCP SN. The
driver calls the TLS layer to obtain the TLS record that includes the
TCP of the packet for transmission. Using this TLS record, the driver
posts a work entry on the transmit queue to reconstruct the NIC TLS
state required for the offload of the out-of-order packet. It updates
the expected TCP SN accordingly and transmits the now in-order packet.
The same queue is used for packet transmission and TLS context
reconstruction to avoid the need for flushing the transmit queue before
issuing the context reconstruction request.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Boris Pismenny [Mon, 30 Apr 2018 07:16:15 +0000 (10:16 +0300)]
net/tls: Split conf to rx + tx
In TLS inline crypto, we can have one direction in software
and another in hardware. Thus, we split the TLS configuration to separate
structures for receive and transmit.
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilya Lesokhin [Mon, 30 Apr 2018 07:16:14 +0000 (10:16 +0300)]
net: Add TLS TX offload features
This patch adds a netdev feature to configure TLS TX offloads.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilya Lesokhin [Mon, 30 Apr 2018 07:16:13 +0000 (10:16 +0300)]
net: Add TLS offload netdev ops
Add new netdev ops to add and delete tls context
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilya Lesokhin [Mon, 30 Apr 2018 07:16:12 +0000 (10:16 +0300)]
net: Add Software fallback infrastructure for socket dependent offloads
With socket dependent offloads we rely on the netdev to transform
the transmitted packets before sending them to the wire.
When a packet from an offloaded socket is rerouted to a different
device we need to detect it and do the transformation in software.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilya Lesokhin [Mon, 30 Apr 2018 07:16:11 +0000 (10:16 +0300)]
net: Rename and export copy_skb_header
copy_skb_header is renamed to skb_copy_header and
exported. Exposing this function give more flexibility
in copying SKBs.
skb_copy and skb_copy_expand do not give enough control
over which parts are copied.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilya Lesokhin [Mon, 30 Apr 2018 07:16:10 +0000 (10:16 +0300)]
tcp: Add clean acked data hook
Called when a TCP segment is acknowledged.
Could be used by application protocols who hold additional
metadata associated with the stream data.
This is required by TLS device offload to release
metadata associated with acknowledged TLS records.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 1 May 2018 13:37:44 +0000 (09:37 -0400)]
Merge branch '40GbE' of git://git./linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
40GbE Intel Wired LAN Driver Updates 2018-04-30
This series contains updates to i40e and i40evf only.
Jia-Ju Bai replaces an instance of GFP_ATOMIC to GFP_KERNEL, since
i40evf is not in atomic context when i40evf_add_vlan() is called.
Jake cleans up function header comments to ensure that the function
parameter comments actually match the function parameters. Fixed a
possible overflow error in the PTP clock code. Fixed warnings regarding
restricted __be32 type usage.
Mariusz fixes the reading of the LLDP configuration, which moves from
using relative values to calculating the absolute address.
Jakub adds a check for 10G LR mode for i40e.
Paweł fixes an issue, where changing the MTU would turn on TSO, GSO and
GRO.
Alex fixes a couple of issues with the UDP tunnel filter configuration.
First being that the tunnels did not have mutual exclusion in place to
prevent a race condition between a user request to add/remove a port and
an update. The second issue was we were deleting filters that were not
associated with the actual filter we wanted to delete.
Harshitha ensures that the queue map sent by the VF is taken into
account when enabling/disabling queues in the VF VSI.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 30 Apr 2018 16:42:41 +0000 (12:42 -0400)]
Merge branch 'mlxsw-SPAN-Support-routes-pointing-at-bridges'
Ido Schimmel says:
====================
mlxsw: SPAN: Support routes pointing at bridges
Petr says:
When mirroring to a gretap or ip6gretap netdevice, the route that
directs the encapsulated packets can reference a bridge. In that case,
in the software model, the packet is switched.
Thus when offloading mirroring like that, take into consideration FDB,
STP, PVID configured at the bridge, and whether that VLAN ID should be
tagged on egress.
Patch #1 introduces functions to get bridge PVID, VLAN flags and to look
up an FDB entry.
Patches #2 and #3 refactor some existing code and introduce a new
accessor function.
With patches #4 and #5 mlxsw calls mlxsw_sp_span_respin() on switchdev
events as well. There is no impact yet, because bridge as an underlay
device is still not allowed.
That is implemented in patch #6, which uses the new interfaces to figure
out on which one port the mirroring should be configured, and whether
the mirrored packets should be VLAN-tagged and how.
Changes from v2 to v3:
- Rename the suite of bridge accessor function to br_vlan_get_pvid(),
br_vlan_get_info() and br_fdb_find_port(). The _get bit is to avoid
clashing with an existing static function.
Changes from v1 to v2:
- Change the suite of bridge accessor functions to br_vlan_pvid_rtnl(),
br_vlan_info_rtnl(), br_fdb_find_port_rtnl().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Sun, 29 Apr 2018 07:56:13 +0000 (10:56 +0300)]
mlxsw: spectrum_span: Allow bridge for gretap mirror
When handling mirroring to a gretap or ip6gretap netdevice in mlxsw, the
underlay address (i.e. the remote address of the tunnel) may be routed
to a bridge.
In that case, look up the resolved neighbor Ethernet address in that
bridge's FDB. Then configure the offload to direct the mirrored traffic
to that port, possibly with tagging.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Sun, 29 Apr 2018 07:56:12 +0000 (10:56 +0300)]
mlxsw: Respin SPAN on switchdev events
Changes to switchdev artifact can make a SPAN entry offloadable or
unoffloadable. To that end:
- Listen to SWITCHDEV_FDB_*_TO_BRIDGE notifications in addition to
the *_TO_DEVICE ones, to catch whatever activity is sent to the
bridge (likely by mlxsw itself).
On each FDB notification, respin SPAN to reconcile it with the FDB
changes.
- Also respin on switchdev port attribute changes (which currently
covers changes to STP state of ports) and port object additions and
removals.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Sun, 29 Apr 2018 07:56:11 +0000 (10:56 +0300)]
mlxsw: spectrum: Register SPAN before switchdev
Since switchdev events can trigger SPAN respin, it is necessary that the
data structures are available. Register SPAN first, with a commentary on
what the dependencies are.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Sun, 29 Apr 2018 07:56:10 +0000 (10:56 +0300)]
mlxsw: spectrum_switchdev: Publish two functions
Publish the existing function mlxsw_sp_bridge_port_find(), and add
another service accessor mlxsw_sp_bridge_port_stp_state(). Publish both
in a new file spectrum_switchdev.h.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Sun, 29 Apr 2018 07:56:09 +0000 (10:56 +0300)]
mlxsw: spectrum: Extract mlxsw_sp_stp_spms_state()
Instead of duplicating the decision regarding port forwarding state made
by mlxsw_sp_port_vid_stp_set(), extract the decision-making into a new
function and reuse.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Sun, 29 Apr 2018 07:56:08 +0000 (10:56 +0300)]
net: bridge: Publish bridge accessor functions
Add a couple new functions to allow querying FDB and vlan settings of a
bridge.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jacob Keller [Fri, 20 Apr 2018 08:41:40 +0000 (01:41 -0700)]
i40e: use %pI4b instead of byte swapping before dev_err
Fix warnings regarding restricted __be32 type usage by strictly
specifying the type of the ipv4 address being printed in the dev_err
statement.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Harshitha Ramamurthy [Fri, 20 Apr 2018 08:41:39 +0000 (01:41 -0700)]
i40e/i40evf: take into account queue map from vf when handling queues
The expectation of the ops VIRTCHNL_OP_ENABLE_QUEUES and
VIRTCHNL_OP_DISABLE_QUEUES is that the queue map sent by
the VF is taken into account when enabling/disabling
queues in the VF VSI. This patch makes sure that happens.
By breaking out the individual queue set up functions so
that they can be called directly from the i40e_virtchnl_pf.c
file, only the queues as specified by the queue bit map that
accompanies the enable/disable queues ops will be handled.
Signed-off-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Fri, 20 Apr 2018 08:41:38 +0000 (01:41 -0700)]
i40e: avoid overflow in i40e_ptp_adjfreq()
When operating at 1GbE, the base incval for the PTP clock is so large
that multiplying it by numbers close to the max_adj can overflow the
u64.
Rather than attempting to limit the max_adj to a value small enough to
avoid overflow, instead calculate the incvalue adjustment based on the
40GbE incvalue, and then multiply that by the scaling factor for the
link speed.
This sacrifices a small amount of precision in the adjustment but we
avoid erratic behavior of the clock due to the overflow caused if ppb is
very near the maximum adjustment.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Fri, 20 Apr 2018 08:41:37 +0000 (01:41 -0700)]
i40e: Fix multiple issues with UDP tunnel offload filter configuration
This fixes at least 2 issues I have found with the UDP tunnel filter
configuration.
The first issue is the fact that the tunnels didn't have any sort of mutual
exclusion in place to prevent an update from racing with a user request to
add/remove a port. As such you could request to add and remove a port
before the port update code had a chance to respond which would result in a
very confusing result. To address it I have added 2 changes. First I added
the RTNL mutex wrapper around our updating of the pending, port, and
filter_index bits. Second I added logic so that we cannot use a port that
has a pending deletion since we need to free the space in hardware before
we can allow software to reuse it.
The second issue addressed is the fact that we were not recording the
actual filter index provided to us by the admin queue. As a result we were
deleting filters that were not associated with the actual filter we wanted
to delete. To fix that I added a filter_index member to the UDP port
tracking structure.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Paweł Jabłoński [Fri, 20 Apr 2018 08:41:36 +0000 (01:41 -0700)]
i40evf: Fix turning TSO, GSO and GRO on after
This patch fixes the problem where each MTU change turns TSO,
GSO and GRO on from off state.
Now when TSO, GSO or GRO is turned off, MTU change does not
turn them on.
Signed-off-by: Paweł Jabłoński <pawel.jablonski@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jakub Pawlak [Fri, 20 Apr 2018 08:41:35 +0000 (01:41 -0700)]
i40e: Add advertising 10G LR mode
The advertising 10G LR mode should be possible to set
but in the function i40e_set_link_ksettings() check for this
is missed. This patch adds check for 10000baseLR_Full
flag for 10G modes.
Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Ahmed Abdelsalam [Sat, 28 Apr 2018 10:18:35 +0000 (12:18 +0200)]
ipv6: sr: extract the right key values for "seg6_make_flowlabel"
The seg6_make_flowlabel() is used by seg6_do_srh_encap() to compute the
flowlabel from a given skb. It relies on skb_get_hash() which eventually
calls __skb_flow_dissect() to extract the flow_keys struct values from
the skb.
In case of IPv4 traffic, calling seg6_make_flowlabel() after skb_push(),
skb_reset_network_header(), and skb_mac_header_rebuild() will results in
flow_keys struct of all key values set to zero.
This patch calls seg6_make_flowlabel() before resetting the headers of skb
to get the right key values.
Extracted Key values are based on the type inner packet as follows:
1) IPv6 traffic: src_IP, dst_IP, L4 proto, and flowlabel of inner packet.
2) IPv4 traffic: src_IP, dst_IP, L4 proto, src_port, and dst_port
3) L2 traffic: depends on what kind of traffic carried into the L2
frame. IPv6 and IPv4 traffic works as discussed 1) and 2)
Here a hex_dump of struct flow_keys for IPv4 and IPv6 traffic
10.100.1.100: 47302 > 30.0.0.2: 5001
00000000: 14 00 02 00 00 00 00 00 08 00 11 00 00 00 00 00
00000010: 00 00 00 00 00 00 00 00 13 89 b8 c6 1e 00 00 02
00000020: 0a 64 01 64
fc00:a1:a > b2::2
00000000: 28 00 03 00 00 00 00 00 86 dd 11 00 99 f9 02 00
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 b2 00 00
00000020: 00 00 00 00 00 00 00 00 00 00 00 02 fc 00 00 a1
00000030: 00 00 00 00 00 00 00 00 00 00 00 0a
Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mariusz Stachura [Fri, 20 Apr 2018 08:41:34 +0000 (01:41 -0700)]
i40e: fix reading LLDP configuration
Previous method for reading LLDP config was based on hard-coded
offsets. It happened to work, because of structured architecture of
the NVM memory. In the new approach, known as FLAT, we need to
calculate the absolute address, instead of using relative values.
Needed defines for memory location were added.
Signed-off-by: Mariusz Stachura <mariusz.stachura@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Fri, 20 Apr 2018 08:41:33 +0000 (01:41 -0700)]
i40e/i40evf: cleanup incorrect function doxygen comments
Recent versions of the Linux kernel now warn about incorrect parameter
definitions for function comments. Fix up several function comments to
correctly reflect the current function arguments. This cleans up the
warnings and helps ensure our documentation is accurate.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
YueHaibing [Sat, 28 Apr 2018 04:35:22 +0000 (12:35 +0800)]
libcxgb,cxgb4: use __skb_put_zero to simplfy code
use helper __skb_put_zero to replace the pattern of __skb_put() && memset()
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
William Tu [Fri, 27 Apr 2018 21:16:32 +0000 (14:16 -0700)]
erspan: auto detect truncated packets.
Currently the truncated bit is set only when the mirrored packet
is larger than mtu. For certain cases, the packet might already
been truncated before sending to the erspan tunnel. In this case,
the patch detect whether the IP header's total length is larger
than the actual skb->len. If true, this indicated that the
mirrored packet is truncated and set the erspan truncate bit.
I tested the patch using bpf_skb_change_tail helper function to
shrink the packet size and send to erspan tunnel.
Reported-by: Xiaoyan Jin <xiaoyanj@vmware.com>
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jia-Ju Bai [Wed, 11 Apr 2018 02:08:18 +0000 (10:08 +0800)]
i40evf: Replace GFP_ATOMIC with GFP_KERNEL in i40evf_add_vlan
i40evf_add_vlan() is never called in atomic context.
i40evf_add_vlan() is only called by i40evf_vlan_rx_add_vid(),
which is only set as ".ndo_vlan_rx_add_vid" in struct net_device_ops.
".ndo_vlan_rx_add_vid" is not called in atomic context.
Despite never getting called from atomic context,
i40evf_add_vlan() calls kzalloc() with GFP_ATOMIC,
which does not sleep for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
which can sleep and improve the possibility of sucessful allocation.
This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David S. Miller [Mon, 30 Apr 2018 13:38:20 +0000 (09:38 -0400)]
Merge branch 'r8169-further-improvements'
Heiner Kallweit says:
====================
r8169: further improvements w/o functional change
This series aims at further improving and simplifying the code w/o
any intended functional changes.
Series was tested on: RTL8169sb, RTL8168d, RTL8168e-vl
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Sat, 28 Apr 2018 20:19:47 +0000 (22:19 +0200)]
r8169: move common initializations to tp->hw_start
The chip-specific init code includes quite some calls which are
identical for all chips. So move these calls to tp->hw_start().
In addition move rtl_set_rx_max_size() a little to make sure it's
defined before it's used. Unfortunately the diff generated by git
is a little bit hard to read.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Sat, 28 Apr 2018 20:19:43 +0000 (22:19 +0200)]
r8169: remove calls to rtl_set_rx_mode
__dev_open() calls the ndo_set_rx_mode callback anyway, so we don't
have to do it here too.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Sat, 28 Apr 2018 20:19:40 +0000 (22:19 +0200)]
r8169: simplify rtl_hw_start_8169
Currently done:
- if mac_version in (01, 02, 03, 04)
RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
- if mac_version in (01, 02, 03, 04)
rtl_set_rx_tx_config_registers(tp);
- if mac_version not in (01, 02, 03, 04)
RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
rtl_set_rx_tx_config_registers(tp);
So we do exactly the same independent of chip version and can simplify
the code.
In addition remove the call to rtl_init_rxcfg(), it's called in
rtl_init_one() already and the set bits are never touched later.
rtl_init_8168/8101 don't include this call either.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Sat, 28 Apr 2018 20:19:30 +0000 (22:19 +0200)]
r8169: improve handling of CPCMD quirk mask
Both quirk masks are the same, so we can merge them. The quirk mask
includes most bits so it's actually easier to define a mask with
the bits to keep.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Sat, 28 Apr 2018 20:19:26 +0000 (22:19 +0200)]
r8169: improve CPlusCmd handling
tp->cp_cmd is supposed to reflect the current value of the CplusCmd
register. Several (quite old) changes however directly change this
register w/o updating tp->cp_cmd. Also we have places in the code
reading this register where we could use the cached value.
In addition:
- Properly initialize tp->cmd with the register value.
- In rtl_hw_start_8169 remove one setting of PCIMulRW because it's
set unconditionally anyway a few lines later.
- In rtl_hw_start_8168 properly mask out the INTT bits before
setting INTT_1. So far we rely on both bits being zero.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Sat, 28 Apr 2018 20:19:21 +0000 (22:19 +0200)]
r8169: replace magic number for INTT mask with a constant
Use a proper constant for INTT bit mask.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Sat, 28 Apr 2018 20:19:15 +0000 (22:19 +0200)]
r8169: improve rtl8169_set_features
__rtl8169_set_features is used in rtl8169_set_features only, so we
can inline it. In addition:
- Remove check (features ^ dev->features), __netdev_update_features
check's already that requested features differ from current ones.
- Don't mask out unsupported flags, there's no benefit in it.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Sat, 28 Apr 2018 20:19:08 +0000 (22:19 +0200)]
r8169: remove unneeded call to __rtl8169_set_features in rtl_open
RxChkSum and RxVlan aren't touched outside __rtl8169_set_features
(except in probe), so they are always in sync with dev->features.
And the RxConfig flags are set in rtl_set_rx_mode() which is
called via dev_set_rx_mode() from __dev_open().
Therefore we can safely remove this call.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Sat, 28 Apr 2018 09:52:16 +0000 (10:52 +0100)]
liquidio: fix spelling mistake: "mac_tx_multi_collison" -> "mac_tx_multi_collision"
Trivial fix to spelling mistake in oct_stats_strings text
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 30 Apr 2018 13:26:29 +0000 (09:26 -0400)]
Merge branch 'liquidio-enhanced-ethtool-set-channels-feature'
Intiyaz Basha says:
====================
liquidio: enhanced ethtool --set-channels feature
For the ethtool --set-channels feature, the liquidio driver currently
accepts max combined value as the queue count configured during driver
load time, where max combined count is the total count of input and output
queues. This limitation is applicable only when SR-IOV is enabled, that
is, when VFs are created for PF. If SR-IOV is not enabled, the driver can
configure max supported (64) queues.
This series of patches are for enhancing driver to accept
max supported queues for ethtool --set-channels.
Changes in V2:
Only patch #6 was changed to fix these Sparse warnings reported by kbuild
test robot:
lio_ethtool.c:848:5: warning: symbol 'lio_23xx_reconfigure_queue_count'
was not declared. Should it be static?
lio_ethtool.c:877:22: warning: incorrect type in assignment (different
base types)
lio_ethtool.c:878:22: warning: incorrect type in assignment (different
base types)
lio_ethtool.c:879:22: warning: incorrect type in assignment (different
base types)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 28 Apr 2018 06:32:57 +0000 (23:32 -0700)]
liquidio: enhanced ethtool --set-channels feature
Enhancing driver to accept max supported queues for ethtool --set-channels
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 28 Apr 2018 06:32:55 +0000 (23:32 -0700)]
liquidio: Moved common function setup_glists to lio_core.c
Moved common function setup_glists to lio_core.c
and reamed it to lio_setup_glists
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 28 Apr 2018 06:32:51 +0000 (23:32 -0700)]
liquidio: Moved common definition octnic_gather to octeon_network.h
Moving common definition octnic_gather to octeon_network.h
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 28 Apr 2018 06:32:49 +0000 (23:32 -0700)]
liquidio: Moved common function delete_glists to lio_core.c
Moved common function delete_glists to lio_core.c
and renamed it to lio_delete_glists
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 28 Apr 2018 06:32:45 +0000 (23:32 -0700)]
liquidio: Moved common function list_delete_head to octeon_network.h
Moved common function list_delete_head to octeon_network.h
and renamed it to lio_list_delete_head
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 28 Apr 2018 06:32:39 +0000 (23:32 -0700)]
liquidio: Moved common function if_cfg_callback to lio_core.c
Moved common function if_cfg_callback to lio_core.c
and renamed it to lio_if_cfg_callback.
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 27 Apr 2018 20:11:14 +0000 (13:11 -0700)]
net: core: Assert the size of netdev_featres_t
We have about 53 netdev_features_t bits defined and counting, add a
build time check to catch when an u64 type will not be enough and we
will have to convert that to a bitmap. This is done in
register_netdevice() for convenience.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 30 Apr 2018 02:01:33 +0000 (22:01 -0400)]
Merge branch 'net-cleanup-skb_tx_hash'
Alexander Duyck says:
====================
Clean up users of skb_tx_hash and __skb_tx_hash
I am in the process of doing some work to try and enable macvlan Tx queue
selection without using ndo_select_queue. As a part of that I will likely
need to make changes to skb_tx_hash. As such this is a clean up or refactor
of the two spots where he function has been used. In both cases it didn't
really seem like the function was being used correctly so I have updated
both code paths to not make use of the function.
My current development environment doesn't have an mlx4 or OPA vnic
available so the changes to those have been build tested only.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 27 Apr 2018 18:06:53 +0000 (14:06 -0400)]
net: Revoke export for __skb_tx_hash, update it to just be static skb_tx_hash
I am dropping the export of __skb_tx_hash as after my patches nobody is
using it outside of the net/core/dev.c file. In addition I am renaming and
repurposing it to just be a static declaration of skb_tx_hash since that
was the only user for it at this point. By doing this the compiler can
inline it into __netdev_pick_tx as that will improve performance.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 27 Apr 2018 18:06:45 +0000 (14:06 -0400)]
mlx4: Don't bother using skb_tx_hash in mlx4_en_select_queue
The code in the fallback path has supported XDP in conjunction with the Tx
traffic classification for TCs for over a year now. So instead of just
calling skb_tx_hash for every packet we are better off using the fallback
since that will record the Tx queue to the socket and then that can be used
instead of having to recompute the hash every time.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 27 Apr 2018 18:06:35 +0000 (14:06 -0400)]
opa_vnic: Just use skb_get_hash instead of skb_tx_hash
This patch is meant to clean up how the opa_vnic is obtaining entropy from
Tx packets.
The code as it was written was claiming to get 16 bits of hash, but from
what I can tell it was only ever actually getting 14 bits as it was limited
to 0 - (2^15 - 1). It then was folding the result to get a 8 bit value for
entropy.
Instead of throwing away all that input I am cutting out the middle man and
instead having the code call skb_get_hash directly and then folding the 32
bit value into a 8 bit value using a pair of shifts and XOR operations.
Execution wise this new approach should provide more entropy and be faster
since we are bypassing the reciprocal multiplication to reduce the 32b
value to 16b and instead just using a shift/XOR combination.
In addition we can drop the unneeded adapter value from the call to get the
entropy since the netdev itself isn't even needed.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 30 Apr 2018 01:41:01 +0000 (21:41 -0400)]
Merge branch 'lan78xx-fixed-phy'
Raghuram Chary J says:
====================
lan78xx updates along with Fixed phy Support
These series of patches handle few modifications in driver
and adds support for fixed phy.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Raghuram Chary J [Sat, 28 Apr 2018 06:03:16 +0000 (11:33 +0530)]
lan78xx: Modify error messages
Modify the error messages when phy registration fails.
Signed-off-by: Raghuram Chary J <raghuramchary.jallipalli@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Raghuram Chary J [Sat, 28 Apr 2018 06:03:15 +0000 (11:33 +0530)]
lan78xx: Remove DRIVER_VERSION for lan78xx driver
Remove driver version info from the lan78xx driver.
Signed-off-by: Raghuram Chary J <raghuramchary.jallipalli@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Raghuram Chary J [Sat, 28 Apr 2018 06:03:14 +0000 (11:33 +0530)]
lan78xx: Lan7801 Support for Fixed PHY
Adding Fixed PHY support to the lan78xx driver.
Signed-off-by: Raghuram Chary J <raghuramchary.jallipalli@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 30 Apr 2018 01:29:55 +0000 (21:29 -0400)]
Merge branch 'tcp-mmap-rework-zerocopy-receive'
Eric Dumazet says:
====================
tcp: mmap: rework zerocopy receive
syzbot reported a lockdep issue caused by tcp mmap() support.
I implemented Andy Lutomirski nice suggestions to resolve the
issue and increase scalability as well.
First patch is adding a new getsockopt() operation and changes mmap()
behavior.
Second patch changes tcp_mmap reference program.
v4: tcp mmap() support depends on CONFIG_MMU, as kbuild bot told us.
v3: change TCP_ZEROCOPY_RECEIVE to be a getsockopt() option
instead of setsockopt(), feedback from Ka-Cheon Poon
v2: Added a missing page align of zc->length in tcp_zerocopy_receive()
Properly clear zc->recv_skip_hint in case user request was completed.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 27 Apr 2018 15:58:09 +0000 (08:58 -0700)]
selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE
After prior kernel change, mmap() on TCP socket only reserves VMA.
We have to use getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...)
to perform the transfert of pages from skbs in TCP receive queue into such VMA.
struct tcp_zerocopy_receive {
__u64 address; /* in: address of mapping */
__u32 length; /* in/out: number of bytes to map/mapped */
__u32 recv_skip_hint; /* out: amount of bytes to skip */
};
After a successful getsockopt(...TCP_ZEROCOPY_RECEIVE...), @length contains
number of bytes that were mapped, and @recv_skip_hint contains number of bytes
that should be read using conventional read()/recv()/recvmsg() system calls,
to skip a sequence of bytes that can not be mapped, because not properly page
aligned.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 27 Apr 2018 15:58:08 +0000 (08:58 -0700)]
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes:
93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 30 Apr 2018 00:36:50 +0000 (20:36 -0400)]
Merge branch 'dsa-mv88e6xxx-remove-Global-2-setup'
Vivien Didelot says:
====================
net: dsa: mv88e6xxx: remove Global 2 setup
Parts of the mv88e6xxx driver still write arbitrary registers of
different banks at setup time, which is misleading especially when
supporting multiple device models.
This patchset moves two features setup into the top lovel
mv88e6xxx_setup function and kills the old Global 2 register bank setup
function. It brings no functional changes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 27 Apr 2018 01:56:46 +0000 (21:56 -0400)]
net: dsa: mv88e6xxx: remove Global 2 setup
The remaining values written to the Switch Management Register in the
mv88e6xxx_g2_setup function are specific to 88E6352 and older, and are
the default values anyway.
Thus remove completely this function. The mv88e6xxx driver no more
contains setup code to access arbitrary Global 2 registers.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 27 Apr 2018 01:56:45 +0000 (21:56 -0400)]
net: dsa: mv88e6xxx: move device mapping setup
Move the Device Mapping setup out of the specific Global 2 code,
into the top level device setup function.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 27 Apr 2018 01:56:44 +0000 (21:56 -0400)]
net: dsa: mv88e6xxx: move trunk setup
Move the trunking setup out of Global 2 specific setup into the top
level mv88e6xxx_setup function.
Note that the 88E6390 family calls this LAG instead of Trunk and
supports 32 possible ID routing vectors, with LAG ID bit 4 being placed
in Global 2 register 0x1D...
We don't need Trunk (or LAG) IDs for the moment, thus keep it simple.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 27 Apr 2018 19:41:49 +0000 (12:41 -0700)]
net: phy: Fix modular PHYLIB build
After commit
c59530d0d5dc ("net: Move PHY statistics code into PHY
library helpers") we made net/core/ethtool.c reference symbols which are
part of the library which can be modular. David introduced a temporary
fix with
1ecd6e8ad996 ("phy: Temporary build fix after phylib changes.")
which would prevent such modularity.
This is not desireable of course, so instead, just inline the functions
into include/linux/phy.h to keep both options available.
Fixes:
c59530d0d5dc ("net: Move PHY statistics code into PHY library helpers")
Fixes:
1ecd6e8ad996 ("phy: Temporary build fix after phylib changes.")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>