Jiri Pirko [Thu, 3 Oct 2019 09:49:38 +0000 (11:49 +0200)]
netdevsim: take devlink net instead of init_net
Follow-up patch is going to allow to reload devlink instance into
different network namespace, so use devlink_net() helper instead
of init_net.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 3 Oct 2019 09:49:37 +0000 (11:49 +0200)]
netdevsim: register port netdevices into net of device
Register newly created port netdevice into net namespace
that the parent device belongs to.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 3 Oct 2019 09:49:36 +0000 (11:49 +0200)]
netdevsim: implement proper devlink reload
During devlink reload, all driver objects should be reinstantiated with
the exception of devlink instance and devlink resources and params.
Move existing devlink_resource_size_get() calls into fib_create() just
before fib notifier is registered. Also, make sure that extack is
propagated down to fib_notifier_register() call.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 3 Oct 2019 09:49:35 +0000 (11:49 +0200)]
netdevsim: add all ports in nsim_dev_create() and del them in destroy()
Currently the probe/remove function does this separately. Put the
addition an deletion of ports into nsim_dev_create() and
nsim_dev_destroy().
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 3 Oct 2019 09:49:34 +0000 (11:49 +0200)]
mlxsw: Propagate extack down to register_fib_notifier()
During the devlink reaload the extack is present, so propagate it all
the way down to register_fib_notifier() call in spectrum_router.c.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 3 Oct 2019 09:49:33 +0000 (11:49 +0200)]
mlxsw: Register port netdevices into net of core
When creating netdevices for ports, put them under network namespace
that the core/parent devlink belongs to.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 3 Oct 2019 09:49:32 +0000 (11:49 +0200)]
mlxsw: spectrum: Take devlink net instead of init_net
Follow-up patch is going to allow to reload devlink instance into
different network namespace, so use devlink_net() helper instead
of init_net.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 3 Oct 2019 09:49:31 +0000 (11:49 +0200)]
net: devlink: export devlink net getter
Allow drivers to get net struct for devlink instance.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 3 Oct 2019 09:49:30 +0000 (11:49 +0200)]
net: fib_notifier: propagate extack down to the notifier block callback
Since errors are propagated all the way up to the caller, propagate
possible extack of the caller all the way down to the notifier block
callback.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 3 Oct 2019 09:49:29 +0000 (11:49 +0200)]
mlxsw: spectrum_router: Don't rely on missing extack to symbolize dump
Currently if info->extack is NULL, mlxsw assumes that the event came
down from dump. Originally, the dump did not propagate the return value
back to the original caller (fib_notifier_register()). However, that is
now happening. So benefit from this and push the error up if it happened.
Remove rule cases in work handlers that are now dead code.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 3 Oct 2019 09:49:28 +0000 (11:49 +0200)]
net: fib_notifier: propagate possible error during fib notifier registration
Unlike events for registered notifier, during the registration, the
errors that happened for the block being registered are not propagated
up to the caller. Make sure the error is propagated for FIB rules and
entries.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 3 Oct 2019 09:49:27 +0000 (11:49 +0200)]
net: fib_notifier: make FIB notifier per-netns
Currently all users of FIB notifier only cares about events in init_net.
Later in this patchset, users get interested in other namespaces too.
However, for every registered block user is interested only about one
namespace. Make the FIB notifier registration per-netns and avoid
unnecessary calls of notifier block for other namespaces.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 3 Oct 2019 09:49:26 +0000 (11:49 +0200)]
netdevsim: change fib accounting and limitations to be per-device
Currently, the accounting is done per-namespace. However, devlink
instance is always in init_net namespace for now, so only the accounting
related to init_net is used. Limitations set using devlink resources
are only considered for init_net. nsim_devlink_net() always
returns init_net always.
Make the accounting per-device. This brings no functional change.
Per-device accounting has the same values as per-net.
For a single netdevsim instance, the behaviour is exactly the same
as before. When multiple netdevsim instances are created, each
can have different limits.
This is in prepare to implement proper devlink netns support. After
that, the devlink instance which would exist in particular netns would
account and limit that netns.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 3 Oct 2019 15:59:24 +0000 (08:59 -0700)]
net: propagate errors correctly in register_netdevice()
If netdev_name_node_head_alloc() fails to allocate
memory, we absolutely want register_netdevice() to return
-ENOMEM instead of zero :/
One of the syzbot report looked like :
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 8760 Comm: syz-executor839 Not tainted 5.3.0+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:ovs_vport_add+0x185/0x500 net/openvswitch/vport.c:205
Code: 89 c6 e8 3e b6 3a fa 49 81 fc 00 f0 ff ff 0f 87 6d 02 00 00 e8 8c b4 3a fa 4c 89 e2 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 d3 02 00 00 49 8d 7c 24 08 49 8b 34 24 48 b8 00
RSP: 0018:
ffff88808fe5f4e0 EFLAGS:
00010247
RAX:
dffffc0000000000 RBX:
ffffffff89be8820 RCX:
ffffffff87385162
RDX:
0000000000000000 RSI:
ffffffff87385174 RDI:
0000000000000007
RBP:
ffff88808fe5f510 R08:
ffff8880933c6600 R09:
fffffbfff14ee13c
R10:
fffffbfff14ee13b R11:
ffffffff8a7709df R12:
0000000000000004
R13:
ffffffff89be8850 R14:
ffff88808fe5f5e0 R15:
0000000000000002
FS:
0000000001d71880(0000) GS:
ffff8880ae900000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000020000280 CR3:
0000000096e4c000 CR4:
00000000001406e0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Call Trace:
new_vport+0x1b/0x1d0 net/openvswitch/datapath.c:194
ovs_dp_cmd_new+0x5e5/0xe30 net/openvswitch/datapath.c:1644
genl_family_rcv_msg+0x74b/0xf90 net/netlink/genetlink.c:629
genl_rcv_msg+0xca/0x170 net/netlink/genetlink.c:654
netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
genl_rcv+0x29/0x40 net/netlink/genetlink.c:665
netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328
netlink_sendmsg+0x8a5/0xd60 net/netlink/af_netlink.c:1917
sock_sendmsg_nosec net/socket.c:637 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:657
___sys_sendmsg+0x803/0x920 net/socket.c:2311
__sys_sendmsg+0x105/0x1d0 net/socket.c:2356
__do_sys_sendmsg net/socket.c:2365 [inline]
__se_sys_sendmsg net/socket.c:2363 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2363
Fixes:
ff92741270bf ("net: introduce name_node struct to be used in hashlist")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 3 Oct 2019 19:27:06 +0000 (12:27 -0700)]
Merge branch 'phy-at803x-add-ar9331-support'
Oleksij Rempel says:
====================
phy: at803x: add ar9331 support
changes v3:
- use PHY_ID_MATCH_EXACT only for ATH9331 PHY
changes v2:
- use PHY_ID_MATCH_EXACT instead of leaky masking
- remove probe and struct at803x_priv
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Oleksij Rempel [Thu, 3 Oct 2019 08:21:13 +0000 (10:21 +0200)]
net: phy: at803x: remove probe and struct at803x_priv
struct at803x_priv is never used in this driver. So remove it
and the probe function allocating it.
Suggested-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Oleksij Rempel [Thu, 3 Oct 2019 08:21:12 +0000 (10:21 +0200)]
net: phy: at803x: add ar9331 support
Mostly this hardware can work with generic PHY driver, but this change
is needed to provided interrupt handling support.
Tested with dsa ar9331-switch driver.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Thu, 3 Oct 2019 05:44:49 +0000 (08:44 +0300)]
mlxsw: PCI: Send EMAD traffic on a separate queue
Currently mlxsw distributes sent traffic among all the available send
queues. That includes control traffic as well as EMADs, which are used for
configuration of the device.
However because all the queues have the same traffic class of 3, they all
end up being directed to the same traffic class buffer. If the control
traffic in the buffer cannot be serviced quickly enough, the EMAD traffic
might be shut out, which causes transient failures, typically in FDB
maintenance, counter upkeep and other periodic work.
To address this issue, dedicate SDQ 0 to EMAD traffic, with TC 0.
Distribute the control traffic among the remaining queues, which are left
with their current TC 3.
Suggested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ka-Cheong Poon [Thu, 3 Oct 2019 04:11:08 +0000 (21:11 -0700)]
net/rds: Use DMA memory pool allocation for rds_header
Currently, RDS calls ib_dma_alloc_coherent() to allocate a large piece
of contiguous DMA coherent memory to store struct rds_header for
sending/receiving packets. The memory allocated is then partitioned
into struct rds_header. This is not necessary and can be costly at
times when memory is fragmented. Instead, RDS should use the DMA
memory pool interface to handle this. The DMA addresses of the pre-
allocated headers are stored in an array. At send/receive ring
initialization and refill time, this arrary is de-referenced to get
the DMA addresses. This array is not accessed at send/receive packet
processing.
Suggested-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 3 Oct 2019 19:00:56 +0000 (12:00 -0700)]
Merge branch 'stmmac-eam'
Thierry Reding says:
====================
net: stmmac: Enhanced addressing mode for DWMAC 4.10
The DWMAC 4.10 supports the same enhanced addressing mode as later
generations. Parse this capability from the hardware feature registers
and set the EAME (Enhanced Addressing Mode Enable) bit when necessary.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Thierry Reding [Wed, 2 Oct 2019 14:52:58 +0000 (16:52 +0200)]
net: stmmac: Support enhanced addressing mode for DWMAC 4.10
The address width of the controller can be read from hardware feature
registers much like on XGMAC. Add support for parsing the ADDR64 field
so that the DMA mask can be set accordingly.
This avoids getting swiotlb involved for DMA on Tegra186 and later.
Also make sure that the upper 32 bits of the DMA address are written to
the DMA descriptors when enhanced addressing mode is used. Similarily,
for each channel, the upper 32 bits of the DMA descriptor ring's base
address also need to be programmed to make sure the correct memory can
be fetched when the DMA descriptor ring is located beyond the 32-bit
boundary.
Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thierry Reding [Wed, 2 Oct 2019 14:52:57 +0000 (16:52 +0200)]
net: stmmac: Only enable enhanced addressing mode when needed
Enhanced addressing mode is only required when more than 32 bits need to
be addressed. Add a DMA configuration parameter to enable this mode only
when needed.
Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prashant Malani [Wed, 2 Oct 2019 21:09:33 +0000 (14:09 -0700)]
r8152: Add identifier names for function pointers
Checkpatch throws warnings for function pointer declarations which lack
identifier names.
An example of such a warning is:
WARNING: function definition argument 'struct r8152 *' should
also have an identifier name
739: FILE: drivers/net/usb/r8152.c:739:
+ void (*init)(struct r8152 *);
So, fix those warnings by adding the identifier names.
While we are at it, also fix a character limit violation which was
causing another checkpatch warning.
Change-Id: Idec857ce2dc9592caf3173188be1660052c052ce
Signed-off-by: Prashant Malani <pmalani@chromium.org>
Reviewed-by: Grant Grundler <grundler@chromium.org>
Acked-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sudhakar Dindukurti [Tue, 1 Oct 2019 23:33:14 +0000 (16:33 -0700)]
net/rds: Log vendor error if send/recv Work requests fail
Log vendor error if work requests fail. Vendor error provides
more information that is used for debugging the issue.
Signed-off-by: Sudhakar Dindukurti <sudhakar.dindukurti@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Matteo Croce [Wed, 2 Oct 2019 21:49:04 +0000 (23:49 +0200)]
mvpp2: remove misleading comment
Recycling in mvpp2 has gone long time ago, but two comment still refers
to it. Remove those two misleading comments as they generate confusion.
Fixes:
7ef7e1d949cd ("net: mvpp2: drop useless fields in mvpp2_bm_pool and related code")
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 2 Oct 2019 21:47:52 +0000 (14:47 -0700)]
Merge branch 'CAIF-Kconfig-fixes'
Randy Dunlap says:
====================
CAIF Kconfig fixes
This series of patches cleans up the CAIF Kconfig menus in
net/caif/Kconfig and drivers/net/caif/Kconfig and also puts the
CAIF Transport drivers into their own sub-menu.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
rd.dunlab@gmail.com [Tue, 1 Oct 2019 23:04:01 +0000 (16:04 -0700)]
Minor fixes to the CAIF Transport drivers Kconfig file
Minor fixes to the CAIF Transport drivers Kconfig file:
- end sentence with period
- capitalize CAIF acronym
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
rd.dunlab@gmail.com [Tue, 1 Oct 2019 23:04:00 +0000 (16:04 -0700)]
Isolate CAIF transport drivers into their own menu
Isolate CAIF transport drivers into their own menu.
This cleans up the main Network device support menu,
makes it easier to find the CAIF drivers, and makes it
easier to enable/disable them as a group.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
rd.dunlab@gmail.com [Tue, 1 Oct 2019 23:03:59 +0000 (16:03 -0700)]
Clean up the net/caif/Kconfig menu
Clean up the net/caif/Kconfig menu:
- remove extraneous space
- minor language tweaks
- fix punctuation
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 1 Oct 2019 21:02:36 +0000 (14:02 -0700)]
net_sched: remove need_resched() from qdisc_run()
The introduction of this schedule point was done in commit
2ba2506ca7ca ("[NET]: Add preemption point in qdisc_run")
at a time the loop was not bounded.
Then later in commit
d5b8aa1d246f ("net_sched: fix dequeuer fairness")
we added a limit on the number of packets.
Now is the time to remove the schedule point, since the default
limit of 64 packets matches the number of packets a typical NAPI
poll can process in a row.
This solves a latency problem for most TCP receivers under moderate load :
1) host receives a packet.
NET_RX_SOFTIRQ is raised by NIC hard IRQ handler
2) __do_softirq() does its first loop, handling NET_RX_SOFTIRQ
and calling the driver napi->loop() function
3) TCP stores the skb in socket receive queue:
4) TCP calls sk->sk_data_ready() and wakeups a user thread
waiting for EPOLLIN (as a result, need_resched() might now be true)
5) TCP cooks an ACK and sends it.
6) qdisc_run() processes one packet from qdisc, and sees need_resched(),
this raises NET_TX_SOFTIRQ (even if there are no more packets in
the qdisc)
Then we go back to the __do_softirq() in 2), and we see that new
softirqs were raised. Since need_resched() is true, we end up waking
ksoftirqd in this path :
if (pending) {
if (time_before(jiffies, end) && !need_resched() &&
--max_restart)
goto restart;
wakeup_softirqd();
}
So we have many wakeups of ksoftirqd kernel threads,
and more calls to qdisc_run() with associated lock overhead.
Note that another way to solve the issue would be to change TCP
to first send the ACK packet, then signal the EPOLLIN,
but this changes P99 latencies, as sending the ACK packet
can add a long delay.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 1 Oct 2019 19:12:50 +0000 (22:12 +0300)]
net: dsa: Remove unused __DSA_SKB_CB macro
The struct __dsa_skb_cb is supposed to span the entire 48-byte skb
control block, while the struct dsa_skb_cb only the portion of it which
is used by the DSA core (the rest is available as private data to
drivers).
The DSA_SKB_CB and __DSA_SKB_CB helpers are supposed to help retrieve
this pointer based on a skb, but it turns out there is nobody directly
interested in the struct __dsa_skb_cb in the kernel. So remove it.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 2 Oct 2019 16:25:11 +0000 (12:25 -0400)]
Merge branch 'sja1105-cleanups'
Vladimir Oltean says:
====================
SJA1105 DSA coding style cleanup
This series provides some mechanical cleanup patches related to function
names and prototypes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 1 Oct 2019 19:18:01 +0000 (22:18 +0300)]
net: dsa: sja1105: Rename sja1105_spi_send_packed_buf to sja1105_xfer_buf
The most commonly called function in the driver is long due for a
rename. The "packed" word is redundant (it doesn't make sense to
transfer an unpacked structure, since that is in CPU endianness yadda
yadda), and the "spi" word is also redundant since argument 2 of the
function is SPI_READ or SPI_WRITE.
As for the sja1105_spi_send_long_packed_buf function, it is only being
used from sja1105_spi.c, so remove its global prototype.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 1 Oct 2019 19:18:00 +0000 (22:18 +0300)]
net: dsa: sja1105: Replace sja1105_spi_send_int with sja1105_xfer_{u32, u64}
Having a function that takes a variable number of unpacked bytes which
it generically calls an "int" is confusing and makes auditing patches
next to impossible.
We only use spi_send_int with the int sizes of 32 and 64 bits. So just
make the spi_send_int function less generic and replace it with the
appropriate two explicit functions, which can now type-check the int
pointer type.
Note that there is still a small weirdness in the u32 function, which
has to convert it to a u64 temporary. This is because of how the packing
API works at the moment, but the weirdness is at least hidden from
callers of sja1105_xfer_u32 now.
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 1 Oct 2019 19:17:59 +0000 (22:17 +0300)]
net: dsa: sja1105: Don't use "inline" function declarations in C files
Let the compiler decide.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 2 Oct 2019 16:15:15 +0000 (12:15 -0400)]
Merge branch 'SMB-rootfs'
Paulo Alcantara says:
====================
Experimental SMB rootfs support
This patch series enables Linux to mount root file systems over the
network by utilizing SMB protocol.
Upstream commit
8eecd1c2e5bc ("cifs: Add support for root file
systems") introduced a new CONFIG_CIFS_ROOT option, a virtual device
(Root_CIFS) and a kernel cmdline parameter "cifsroot=" which tells the
kernel to actually mount the root filesystem over a SMB share.
The feature relies on ipconfig to set up the network prior to mounting
the rootfs, so when it is set along with "cifsroot=" parameter:
(1) cifs_root_setup() parses all necessary data out of "cifsroot="
parameter for the init process know how to mount the SMB rootfs
(e.g. SMB server address, mount options).
(2) If DHCP failed for some reason in ipconfig, we keep retrying
forever as we have nowhere to go for NFS or SMB root
filesystems (see PATCH 2/2). Otherwise go to (3).
(3) mount_cifs_root() is then called by mount_root() (ROOT_DEV ==
Root_CIFS), retrieves early parsed data from (1), then attempt to
mount SMB rootfs by CIFSROOT_RETRY_MAX times at most (see PATCH
1/2).
(4) If all attempts failed, fall back to floppy drive, otherwise
continue the boot process with rootfs mounted over a SMB share.
My idea was to keep the same behavior of nfsroot - as it seems to work
for most users so far.
For more information on how this feature works, see
Documentation/filesystems/cifs/cifsroot.txt.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Paulo Alcantara (SUSE) [Tue, 1 Oct 2019 17:10:28 +0000 (14:10 -0300)]
ipconfig: Handle CONFIG_CIFS_ROOT option
The experimental root file system support in cifs.ko relies on
ipconfig to set up the network stack and then accessing the SMB share
that contains the rootfs files.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paulo Alcantara (SUSE) [Tue, 1 Oct 2019 17:10:27 +0000 (14:10 -0300)]
init: Support mounting root file systems over SMB
Add a new virtual device named /dev/cifs (0xfe) to tell the kernel to
mount the root file system over the network by using SMB protocol.
cifs_root_data() will be responsible to retrieve the parsed
information of the new command-line option (cifsroot=) and then call
do_mount_root() with the appropriate mount options for cifs.ko.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 2 Oct 2019 15:55:12 +0000 (11:55 -0400)]
Merge branch 'ionic-driver-updates'
Shannon Nelson says:
====================
ionic: driver updates
These patches are a few updates to clean up some code
issues and add an ethtool feature.
v3: drop the Fixes tags as they really aren't fixing bugs
simplify ionic_lif_quiesce() as no return is necessary
v2: add cover letter
edit a couple of patch descriptions for clarity
and add Fixes: tags
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Tue, 1 Oct 2019 03:03:26 +0000 (20:03 -0700)]
ionic: add lif_quiesce to wait for queue activity to stop
Even though we've already turned off the queue activity with
the ionic_qcq_disable(), we need to wait for any device queues
that are processing packets to drain down before we try to
flush our packets and tear down the queues.
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Tue, 1 Oct 2019 03:03:25 +0000 (20:03 -0700)]
ionic: implement ethtool set-fec
Wire up the --set-fec and --show-fec features in the ethtool
callbacks and pull the related code out of set_link_ksettings.
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Tue, 1 Oct 2019 03:03:24 +0000 (20:03 -0700)]
ionic: report users coalesce request
The user's request for an interrupt coalescing value gets
translated into a hardware value to be used with the NIC,
and was getting reported back based on the hw value, which,
due to hw tic resolution, could be reported as a different
number than what the user originally asked for. This code
now tracks both the user request and what was put into the
hardware so we can report back to the user what they
requested.
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Tue, 1 Oct 2019 03:03:23 +0000 (20:03 -0700)]
ionic: use wait_on_bit_lock() rather than open code
Replace the open-coded ionic_wait_for_bit() with the
kernel's wait_on_bit_lock().
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Tue, 1 Oct 2019 03:03:22 +0000 (20:03 -0700)]
ionic: simplify returns in devlink info
There is no need for a goto in this bit of code.
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 2 Oct 2019 15:48:44 +0000 (11:48 -0400)]
Merge branch 'per-netns-notifier'
Jiri Pirko says:
====================
net: introduce per-netns netdevice notifiers and use them in mlxsw
Some drivers, like mlxsw, are not interested in notifications coming in
for netdevices from other network namespaces. So introduce per-netns
notifiers and allow to reduce overhead by listening only for
notifications from the same netns.
This is also a preparation for upcoming patchset "devlink: allow devlink
instances to change network namespace". This resolves deadlock during
reload mlxsw into initial netns made possible by
328fbe747ad4 ("net: Close race between {un, }register_netdevice_notifier() and setup_net()/cleanup_net()").
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Mon, 30 Sep 2019 08:15:11 +0000 (10:15 +0200)]
mlxsw: spectrum: Use per-netns netdevice notifier registration
The mlxsw_sp instance is not interested in events happening in other
network namespaces. So use "_net" variants for netdevice notifier
registration/unregistration and get only events which are happening in
the net the instance is in.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Mon, 30 Sep 2019 08:15:10 +0000 (10:15 +0200)]
net: introduce per-netns netdevice notifiers
Often the code for example in drivers is interested in getting notifier
call only from certain network namespace. In addition to the existing
global netdevice notifier chain introduce per-netns chains and allow
users to register to that. Eventually this would eliminate unnecessary
overhead in case there are many netdevices in many network namespaces.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Mon, 30 Sep 2019 08:15:09 +0000 (10:15 +0200)]
net: push loops and nb calls into helper functions
Push iterations over net namespaces and netdevices from
register_netdevice_notifier() and unregister_netdevice_notifier()
into helper functions. Along with that introduce continue_reverse macros
to make the code a bit nicer allowing to get rid of "last" marks.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prashant Malani [Mon, 30 Sep 2019 19:38:18 +0000 (12:38 -0700)]
r8152: Use guard clause and fix comment typos
Use a guard clause in tx_bottom() to reduce the indentation of the
do-while loop.
Also, fix a couple of spelling and grammatical mistakes in the
r8152_csum_workaround() function comment.
Change-Id: I460befde150ad92248fd85b0f189ec2df2ab8431
Signed-off-by: Prashant Malani <pmalani@chromium.org>
Reviewed-by: Grant Grundler <grundler@chromium.org>
Acked-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Matias Ezequiel Vara Larsen [Mon, 30 Sep 2019 18:25:23 +0000 (18:25 +0000)]
vsock/virtio: add support for MSG_PEEK
This patch adds support for MSG_PEEK. In such a case, packets are not
removed from the rx_queue and credit updates are not sent.
Signed-off-by: Matias Ezequiel Vara Larsen <matiasevara@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hovold [Mon, 30 Sep 2019 15:12:41 +0000 (17:12 +0200)]
hso: fix NULL-deref on tty open
Fix NULL-pointer dereference on tty open due to a failure to handle a
missing interrupt-in endpoint when probing modem ports:
BUG: kernel NULL pointer dereference, address:
0000000000000006
...
RIP: 0010:tiocmget_submit_urb+0x1c/0xe0 [hso]
...
Call Trace:
hso_start_serial_device+0xdc/0x140 [hso]
hso_serial_open+0x118/0x1b0 [hso]
tty_open+0xf1/0x490
Fixes:
542f54823614 ("tty: Modem functions for the HSO driver")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman [Mon, 30 Sep 2019 14:03:52 +0000 (16:03 +0200)]
dt-bindings: sh_eth convert bindings to json-schema
Convert Renesas Electronics SH EtherMAC bindings documentation to
json-schema. Also name bindings documentation file according to the compat
string being documented.
Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Peter Fink [Mon, 30 Sep 2019 12:04:03 +0000 (14:04 +0200)]
net: usb: ax88179_178a: allow optionally getting mac address from device tree
Adopt and integrate the feature to pass the MAC address via device tree
from asix_device.c (03fc5d4) also to other ax88179 based asix chips.
E.g. the bootloader fills in local-mac-address and the driver will then
pick up and use this MAC address.
Signed-off-by: Peter Fink <pfink@christ-es.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel [Mon, 30 Sep 2019 12:02:16 +0000 (14:02 +0200)]
ipv6: minor code reorg in inet6_fill_ifla6_attrs()
Just put related code together to ease code reading: the memcpy() is
related to the nla_reserve().
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 1 Oct 2019 21:47:19 +0000 (14:47 -0700)]
Merge branch 'netdev-altnames'
Jiri Pirko says:
====================
net: introduce alternative names for network interfaces
In the past, there was repeatedly discussed the IFNAMSIZ (16) limit for
netdevice name length. Now when we have PF and VF representors
with port names like "pfXvfY", it became quite common to hit this limit:
0123456789012345
enp131s0f1npf0vf6
enp131s0f1npf0vf22
Udev cannot rename these interfaces out-of-the-box and user needs to
create custom rules to handle them.
Also, udev has multiple schemes of netdev names. From udev code:
* Type of names:
* b<number> - BCMA bus core number
* c<bus_id> - bus id of a grouped CCW or CCW device,
* with all leading zeros stripped [s390]
* o<index>[n<phys_port_name>|d<dev_port>]
* - on-board device index number
* s<slot>[f<function>][n<phys_port_name>|d<dev_port>]
* - hotplug slot index number
* x<MAC> - MAC address
* [P<domain>]p<bus>s<slot>[f<function>][n<phys_port_name>|d<dev_port>]
* - PCI geographical location
* [P<domain>]p<bus>s<slot>[f<function>][u<port>][..][c<config>][i<interface>]
* - USB port number chain
* v<slot> - VIO slot number (IBM PowerVM)
* a<vendor><model>i<instance> - Platform bus ACPI instance id
* i<addr>n<phys_port_name> - Netdevsim bus address and port name
One device can be often renamed by multiple patterns at the
same time (e.g. pci address/mac).
This patchset introduces alternative names for network interfaces.
Main goal is to:
1) Overcome the IFNAMSIZ limitation (altname limitation is 128 bytes)
2) Allow to have multiple names at the same time (multiple udev patterns)
3) Allow to use alternative names as handle for commands
The patchset introduces two new commands to add/delete list of properties.
Currently only alternative names are implemented but the ifrastructure
could be easily extended later on. This is very similar to the list of vlan
and tunnels being added/deleted to/from bridge ports.
See following examples.
$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ae:67:a9:67:46:86 brd ff:ff:ff:ff:ff:ff
-> Add alternative names for dummy0:
$ ip link prop add dummy0 altname someothername
$ ip link prop add dummy0 altname someotherveryveryveryverylongname
$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ae:67:a9:67:46:86 brd ff:ff:ff:ff:ff:ff
altname someothername
altname someotherveryveryveryverylongname
$ ip link show someotherveryveryveryverylongname
2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ae:67:a9:67:46:86 brd ff:ff:ff:ff:ff:ff
altname someothername
altname someotherveryveryveryverylongname
-> Add bridge brx, add it's alternative name and use alternative names to
do enslavement.
$ ip link add name brx type bridge
$ ip link prop add brx altname mypersonalsuperspecialbridge
$ ip link set someotherveryveryveryverylongname master mypersonalsuperspecialbridge
$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop master brx state DOWN mode DEFAULT group default qlen 1000
link/ether ae:67:a9:67:46:86 brd ff:ff:ff:ff:ff:ff
altname someothername
altname someotherveryveryveryverylongname
3: brx: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ae:67:a9:67:46:86 brd ff:ff:ff:ff:ff:ff
altname mypersonalsuperspecialbridge
-> Add ipv4 address to the bridge using alternative name:
$ ip addr add 192.168.0.1/24 dev mypersonalsuperspecialbridge
$ ip addr show mypersonalsuperspecialbridge
3: brx: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether ae:67:a9:67:46:86 brd ff:ff:ff:ff:ff:ff
altname mypersonalsuperspecialbridge
inet 192.168.0.1/24 scope global brx
valid_lft forever preferred_lft forever
-> Delete one of dummy0 alternative names:
$ ip link prop del dummy0 altname someotherveryveryveryverylongname
$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop master brx state DOWN mode DEFAULT group default qlen 1000
link/ether ae:67:a9:67:46:86 brd ff:ff:ff:ff:ff:ff
altname someothername
3: brx: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ae:67:a9:67:46:86 brd ff:ff:ff:ff:ff:ff
altname mypersonalsuperspecialbridge
-> Add multiple alternative names at once
$ ip link prop add dummy0 altname a altname b altname c altname d
$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop master brx state DOWN mode DEFAULT group default qlen 1000
link/ether ae:67:a9:67:46:86 brd ff:ff:ff:ff:ff:ff
altname someothername
altname a
altname b
altname c
altname d
3: brx: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ae:67:a9:67:46:86 brd ff:ff:ff:ff:ff:ff
altname mypersonalsuperspecialbridge
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Mon, 30 Sep 2019 09:48:20 +0000 (11:48 +0200)]
net: rtnetlink: add possibility to use alternative names as message handle
Extend the basic rtnetlink commands to use alternative interface names
as a handle instead of ifindex and ifname.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Mon, 30 Sep 2019 09:48:19 +0000 (11:48 +0200)]
net: rtnetlink: introduce helper to get net_device instance by ifname
Introduce helper function rtnl_get_dev() that gets net_device structure
instance pointer according to passed ifname or ifname attribute.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Mon, 30 Sep 2019 09:48:18 +0000 (11:48 +0200)]
net: rtnetlink: unify the code in __rtnl_newlink get dev with the rest
__rtnl_newlink() code flow is a bit different around tb[IFLA_IFNAME]
processing comparing to the other places. Change that to be unified with
the rest.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Mon, 30 Sep 2019 09:48:17 +0000 (11:48 +0200)]
net: rtnetlink: put alternative names to getlink message
Extend exiting getlink info message with list of properties. Now the
only ones are alternative names.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Mon, 30 Sep 2019 09:48:16 +0000 (11:48 +0200)]
net: rtnetlink: add linkprop commands to add and delete alternative ifnames
Add two commands to add and delete list of link properties. Implement
the first property type along - alternative ifnames.
Each net device can have multiple alternative names.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Mon, 30 Sep 2019 09:48:15 +0000 (11:48 +0200)]
net: introduce name_node struct to be used in hashlist
Introduce name_node structure to hold name of device and put it into
hashlist instead of putting there struct net_device directly. Add a
necessary infrastructure to manipulate the hashlist. This prepares
the code to use the same hashlist for alternative names introduced
later in this set.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Mon, 30 Sep 2019 09:48:14 +0000 (11:48 +0200)]
net: procfs: use index hashlist instead of name hashlist
Name hashlist is going to be used for more than just dev->name, so use
rather index hashlist for iteration over net_device instances.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 1 Oct 2019 17:49:06 +0000 (10:49 -0700)]
tcp: add ipv6_addr_v4mapped_loopback() helper
tcp_twsk_unique() has a hard coded assumption about ipv4 loopback
being 127/8
Lets instead use the standard ipv4_is_loopback() method,
in a new ipv6_addr_v4mapped_loopback() helper.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julio Faracco [Tue, 1 Oct 2019 14:39:04 +0000 (11:39 -0300)]
net: core: dev: replace state xoff flag comparison by netif_xmit_stopped method
Function netif_schedule_queue() has a hardcoded comparison between queue
state and any xoff flag. This comparison does the same thing as method
netif_xmit_stopped(). In terms of code clarity, it is better. See other
methods like: generic_xdp_tx() and dev_direct_xmit().
Signed-off-by: Julio Faracco <jcfaracco@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prashant Malani [Tue, 1 Oct 2019 08:35:57 +0000 (01:35 -0700)]
r8152: Factor out OOB link list waits
The same for-loop check for the LINK_LIST_READY bit of an OOB_CTRL
register is used in several places. Factor these out into a single
function to reduce the lines of code.
Change-Id: I20e8f327045a72acc0a83e2d145ae2993ab62915
Signed-off-by: Prashant Malani <pmalani@chromium.org>
Reviewed-by: Grant Grundler <grundler@chromium.org>
Acked-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 29 Sep 2019 00:47:33 +0000 (17:47 -0700)]
Merge git://git./linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
1) Sanity check URB networking device parameters to avoid divide by
zero, from Oliver Neukum.
2) Disable global multicast filter in NCSI, otherwise LLDP and IPV6
don't work properly. Longer term this needs a better fix tho. From
Vijay Khemka.
3) Small fixes to selftests (use ping when ping6 is not present, etc.)
from David Ahern.
4) Bring back rt_uses_gateway member of struct rtable, it's semantics
were not well understood and trying to remove it broke things. From
David Ahern.
5) Move usbnet snaity checking, ignore endpoints with invalid
wMaxPacketSize. From Bjørn Mork.
6) Missing Kconfig deps for sja1105 driver, from Mao Wenan.
7) Various small fixes to the mlx5 DR steering code, from Alaa Hleihel,
Alex Vesker, and Yevgeny Kliteynik
8) Missing CAP_NET_RAW checks in various places, from Ori Nimron.
9) Fix crash when removing sch_cbs entry while offloading is enabled,
from Vinicius Costa Gomes.
10) Signedness bug fixes, generally in looking at the result given by
of_get_phy_mode() and friends. From Dan Crapenter.
11) Disable preemption around BPF_PROG_RUN() calls, from Eric Dumazet.
12) Don't create VRF ipv6 rules if ipv6 is disabled, from David Ahern.
13) Fix quantization code in tcp_bbr, from Kevin Yang.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (127 commits)
net: tap: clean up an indentation issue
nfp: abm: fix memory leak in nfp_abm_u32_knode_replace
tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
sk_buff: drop all skb extensions on free and skb scrubbing
tcp_bbr: fix quantization code to not raise cwnd if not probing bandwidth
mlxsw: spectrum_flower: Fail in case user specifies multiple mirror actions
Documentation: Clarify trap's description
mlxsw: spectrum: Clear VLAN filters during port initialization
net: ena: clean up indentation issue
NFC: st95hf: clean up indentation issue
net: phy: micrel: add Asym Pause workaround for KSZ9021
net: socionext: ave: Avoid using netdev_err() before calling register_netdev()
ptp: correctly disable flags on old ioctls
lib: dimlib: fix help text typos
net: dsa: microchip: Always set regmap stride to 1
nfp: flower: fix memory leak in nfp_flower_spawn_vnic_reprs
nfp: flower: prevent memory leak in nfp_flower_spawn_phy_reprs
net/sched: Set default of CONFIG_NET_TC_SKB_EXT to N
vrf: Do not attempt to create IPv6 mcast rule if IPv6 is disabled
net: sched: sch_sfb: don't call qdisc_put() while holding tree lock
...
Linus Torvalds [Sat, 28 Sep 2019 21:26:47 +0000 (14:26 -0700)]
Merge branch 'hugepage-fallbacks' (hugepatch patches from David Rientjes)
Merge hugepage allocation updates from David Rientjes:
"We (mostly Linus, Andrea, and myself) have been discussing offlist how
to implement a sane default allocation strategy for hugepages on NUMA
platforms.
With these reverts in place, the page allocator will happily allocate
a remote hugepage immediately rather than try to make a local hugepage
available. This incurs a substantial performance degradation when
memory compaction would have otherwise made a local hugepage
available.
This series reverts those reverts and attempts to propose a more sane
default allocation strategy specifically for hugepages. Andrea
acknowledges this is likely to fix the swap storms that he originally
reported that resulted in the patches that removed __GFP_THISNODE from
hugepage allocations.
The immediate goal is to return 5.3 to the behavior the kernel has
implemented over the past several years so that remote hugepages are
not immediately allocated when local hugepages could have been made
available because the increased access latency is untenable.
The next goal is to introduce a sane default allocation strategy for
hugepages allocations in general regardless of the configuration of
the system so that we prevent thrashing of local memory when
compaction is unlikely to succeed and can prefer remote hugepages over
remote native pages when the local node is low on memory."
Note on timing: this reverts the hugepage VM behavior changes that got
introduced fairly late in the 5.3 cycle, and that fixed a huge
performance regression for certain loads that had been around since
4.18.
Andrea had this note:
"The regression of 4.18 was that it was taking hours to start a VM
where 3.10 was only taking a few seconds, I reported all the details
on lkml when it was finally tracked down in August 2018.
https://lore.kernel.org/linux-mm/
20180820032640.9896-2-aarcange@redhat.com/
__GFP_THISNODE in MADV_HUGEPAGE made the above enterprise vfio
workload degrade like in the "current upstream" above. And it still
would have been that bad as above until 5.3-rc5"
where the bad behavior ends up happening as you fill up a local node,
and without that change, you'd get into the nasty swap storm behavior
due to compaction working overtime to make room for more memory on the
nodes.
As a result 5.3 got the two performance fix reverts in rc5.
However, David Rientjes then noted that those performance fixes in turn
regressed performance for other loads - although not quite to the same
degree. He suggested reverting the reverts and instead replacing them
with two small changes to how hugepage allocations are done (patch
descriptions rephrased by me):
- "avoid expensive reclaim when compaction may not succeed": just admit
that the allocation failed when you're trying to allocate a huge-page
and compaction wasn't successful.
- "allow hugepage fallback to remote nodes when madvised": when that
node-local huge-page allocation failed, retry without forcing the
local node.
but by then I judged it too late to replace the fixes for a 5.3 release.
So 5.3 was released with behavior that harked back to the pre-4.18 logic.
But now we're in the merge window for 5.4, and we can see if this
alternate model fixes not just the horrendous swap storm behavior, but
also restores the performance regression that the late reverts caused.
Fingers crossed.
* emailed patches from David Rientjes <rientjes@google.com>:
mm, page_alloc: allow hugepage fallback to remote nodes when madvised
mm, page_alloc: avoid expensive reclaim when compaction may not succeed
Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
Revert "Revert "mm, thp: restore node-local hugepage allocations""
David Rientjes [Wed, 4 Sep 2019 19:54:25 +0000 (12:54 -0700)]
mm, page_alloc: allow hugepage fallback to remote nodes when madvised
For systems configured to always try hard to allocate transparent
hugepages (thp defrag setting of "always") or for memory that has been
explicitly madvised to MADV_HUGEPAGE, it is often better to fallback to
remote memory to allocate the hugepage if the local allocation fails
first.
The point is to allow the initial call to __alloc_pages_node() to attempt
to defragment local memory to make a hugepage available, if possible,
rather than immediately fallback to remote memory. Local hugepages will
always have a better access latency than remote (huge)pages, so an attempt
to make a hugepage available locally is always preferred.
If memory compaction cannot be successful locally, however, it is likely
better to fallback to remote memory. This could take on two forms: either
allow immediate fallback to remote memory or do per-zone watermark checks.
It would be possible to fallback only when per-zone watermarks fail for
order-0 memory, since that would require local reclaim for all subsequent
faults so remote huge allocation is likely better than thrashing the local
zone for large workloads.
In this case, it is assumed that because the system is configured to try
hard to allocate hugepages or the vma is advised to explicitly want to try
hard for hugepages that remote allocation is better when local allocation
and memory compaction have both failed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 4 Sep 2019 19:54:22 +0000 (12:54 -0700)]
mm, page_alloc: avoid expensive reclaim when compaction may not succeed
Memory compaction has a couple significant drawbacks as the allocation
order increases, specifically:
- isolate_freepages() is responsible for finding free pages to use as
migration targets and is implemented as a linear scan of memory
starting at the end of a zone,
- failing order-0 watermark checks in memory compaction does not account
for how far below the watermarks the zone actually is: to enable
migration, there must be *some* free memory available. Per the above,
watermarks are not always suffficient if isolate_freepages() cannot
find the free memory but it could require hundreds of MBs of reclaim to
even reach this threshold (read: potentially very expensive reclaim with
no indication compaction can be successful), and
- if compaction at this order has failed recently so that it does not even
run as a result of deferred compaction, looping through reclaim can often
be pointless.
For hugepage allocations, these are quite substantial drawbacks because
these are very high order allocations (order-9 on x86) and falling back to
doing reclaim can potentially be *very* expensive without any indication
that compaction would even be successful.
Reclaim itself is unlikely to free entire pageblocks and certainly no
reliance should be put on it to do so in isolation (recall lumpy reclaim).
This means we should avoid reclaim and simply fail hugepage allocation if
compaction is deferred.
It is also not helpful to thrash a zone by doing excessive reclaim if
compaction may not be able to access that memory. If order-0 watermarks
fail and the allocation order is sufficiently large, it is likely better
to fail the allocation rather than thrashing the zone.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 4 Sep 2019 19:54:20 +0000 (12:54 -0700)]
Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
This reverts commit
92717d429b38e4f9f934eed7e605cc42858f1839.
Since commit
a8282608c88e ("Revert "mm, thp: restore node-local hugepage
allocations"") is reverted in this series, it is better to restore the
previous 5.2 behavior between the thp allocation and the page allocator
rather than to attempt any consolidation or cleanup for a policy that is
now reverted. It's less risky during an rc cycle and subsequent patches
in this series further modify the same policy that the pre-5.3 behavior
implements.
Consolidation and cleanup can be done subsequent to a sane default page
allocation strategy, so this patch reverts a cleanup done on a strategy
that is now reverted and thus is the least risky option.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 4 Sep 2019 19:54:18 +0000 (12:54 -0700)]
Revert "Revert "mm, thp: restore node-local hugepage allocations""
This reverts commit
a8282608c88e08b1782141026eab61204c1e533f.
The commit references the original intended semantic for MADV_HUGEPAGE
which has subsequently taken on three unique purposes:
- enables or disables thp for a range of memory depending on the system's
config (is thp "enabled" set to "always" or "madvise"),
- determines the synchronous compaction behavior for thp allocations at
fault (is thp "defrag" set to "always", "defer+madvise", or "madvise"),
and
- reverts a previous MADV_NOHUGEPAGE (there is no madvise mode to only
clear previous hugepage advice).
These are the three purposes that currently exist in 5.2 and over the
past several years that userspace has been written around. Adding a
NUMA locality preference adds a fourth dimension to an already conflated
advice mode.
Based on the semantic that MADV_HUGEPAGE has provided over the past
several years, there exist workloads that use the tunable based on these
principles: specifically that the allocation should attempt to
defragment a local node before falling back. It is agreed that remote
hugepages typically (but not always) have a better access latency than
remote native pages, although on Naples this is at parity for
intersocket.
The revert commit that this patch reverts allows hugepage allocation to
immediately allocate remotely when local memory is fragmented. This is
contrary to the semantic of MADV_HUGEPAGE over the past several years:
that is, memory compaction should be attempted locally before falling
back.
The performance degradation of remote hugepages over local hugepages on
Rome, for example, is 53.5% increased access latency. For this reason,
the goal is to revert back to the 5.2 and previous behavior that would
attempt local defragmentation before falling back. With the patch that
is reverted by this patch, we see performance degradations at the tail
because the allocator happily allocates the remote hugepage rather than
even attempting to make a local hugepage available.
zone_reclaim_mode is not a solution to this problem since it does not
only impact hugepage allocations but rather changes the memory
allocation strategy for *all* page allocations.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 28 Sep 2019 20:43:00 +0000 (13:43 -0700)]
Merge tag 'powerpc-5.4-2' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"An assortment of fixes that were either missed by me, or didn't arrive
quite in time for the first v5.4 pull.
- Most notable is a fix for an issue with tlbie (broadcast TLB
invalidation) on Power9, when using the Radix MMU. The tlbie can
race with an mtpid (move to PID register, essentially MMU context
switch) on another thread of the core, which can cause stores to
continue to go to a page after it's unmapped.
- A fix in our KVM code to add a missing barrier, the lack of which
has been observed to cause missed IPIs and subsequently stuck CPUs
in the host.
- A change to the way we initialise PCR (Processor Compatibility
Register) to make it forward compatible with future CPUs.
- On some older PowerVM systems our H_BLOCK_REMOVE support could
oops, fix it to detect such systems and fallback to the old
invalidation method.
- A fix for an oops seen on some machines when using KASAN on 32-bit.
- A handful of other minor fixes, and two new selftests.
Thanks to: Alistair Popple, Aneesh Kumar K.V, Christophe Leroy,
Gustavo Romero, Joel Stanley, Jordan Niethe, Laurent Dufour, Michael
Roth, Oliver O'Halloran"
* tag 'powerpc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/eeh: Fix eeh eeh_debugfs_break_device() with SRIOV devices
powerpc/nvdimm: use H_SCM_QUERY hcall on H_OVERLAP error
powerpc/nvdimm: Use HCALL error as the return value
selftests/powerpc: Add test case for tlbie vs mtpidr ordering issue
powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9
powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flag
powerpc/book3s64/mm: Don't do tlbie fixup for some hardware revisions
powerpc/pseries: Call H_BLOCK_REMOVE when supported
powerpc/pseries: Read TLB Block Invalidate Characteristics
KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag
powerpc/mm: Fix an Oops in kasan_mmu_init()
powerpc/mm: Add a helper to select PAGE_KERNEL_RO or PAGE_READONLY
powerpc/64s: Set reserved PCR bits
powerpc: Fix definition of PCR bits to work with old binutils
powerpc/book3s64/radix: Remove WARN_ON in destroy_context()
powerpc/tm: Add tm-poison test
Linus Torvalds [Sat, 28 Sep 2019 20:37:41 +0000 (13:37 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fix from Ingo Molnar:
"A kexec fix for the case when GCC_PLUGIN_STACKLEAK=y is enabled"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/purgatory: Disable the stackleak GCC plugin for the purgatory
Linus Torvalds [Sat, 28 Sep 2019 19:39:07 +0000 (12:39 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
- Apply a number of membarrier related fixes and cleanups, which fixes
a use-after-free race in the membarrier code
- Introduce proper RCU protection for tasks on the runqueue - to get
rid of the subtle task_rcu_dereference() interface that was easy to
get wrong
- Misc fixes, but also an EAS speedup
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Avoid redundant EAS calculation
sched/core: Remove double update_max_interval() call on CPU startup
sched/core: Fix preempt_schedule() interrupt return comment
sched/fair: Fix -Wunused-but-set-variable warnings
sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
sched/membarrier: Return -ENOMEM to userspace on memory allocation failure
sched/membarrier: Skip IPIs when mm->mm_users == 1
selftests, sched/membarrier: Add multi-threaded test
sched/membarrier: Fix p->mm->membarrier_state racy load
sched/membarrier: Call sync_core only before usermode for same mm
sched/membarrier: Remove redundant check
sched/membarrier: Fix private expedited registration check
tasks, sched/core: RCUify the assignment of rq->curr
tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
tasks: Add a count of task RCU users
sched/core: Convert vcpu_is_preempted() from macro to an inline function
sched/fair: Remove unused cfs_rq_clock_task() function
Linus Torvalds [Sat, 28 Sep 2019 15:14:15 +0000 (08:14 -0700)]
Merge branch 'next-lockdown' of git://git./linux/kernel/git/jmorris/linux-security
Pull kernel lockdown mode from James Morris:
"This is the latest iteration of the kernel lockdown patchset, from
Matthew Garrett, David Howells and others.
From the original description:
This patchset introduces an optional kernel lockdown feature,
intended to strengthen the boundary between UID 0 and the kernel.
When enabled, various pieces of kernel functionality are restricted.
Applications that rely on low-level access to either hardware or the
kernel may cease working as a result - therefore this should not be
enabled without appropriate evaluation beforehand.
The majority of mainstream distributions have been carrying variants
of this patchset for many years now, so there's value in providing a
doesn't meet every distribution requirement, but gets us much closer
to not requiring external patches.
There are two major changes since this was last proposed for mainline:
- Separating lockdown from EFI secure boot. Background discussion is
covered here: https://lwn.net/Articles/751061/
- Implementation as an LSM, with a default stackable lockdown LSM
module. This allows the lockdown feature to be policy-driven,
rather than encoding an implicit policy within the mechanism.
The new locked_down LSM hook is provided to allow LSMs to make a
policy decision around whether kernel functionality that would allow
tampering with or examining the runtime state of the kernel should be
permitted.
The included lockdown LSM provides an implementation with a simple
policy intended for general purpose use. This policy provides a coarse
level of granularity, controllable via the kernel command line:
lockdown={integrity|confidentiality}
Enable the kernel lockdown feature. If set to integrity, kernel features
that allow userland to modify the running kernel are disabled. If set to
confidentiality, kernel features that allow userland to extract
confidential information from the kernel are also disabled.
This may also be controlled via /sys/kernel/security/lockdown and
overriden by kernel configuration.
New or existing LSMs may implement finer-grained controls of the
lockdown features. Refer to the lockdown_reason documentation in
include/linux/security.h for details.
The lockdown feature has had signficant design feedback and review
across many subsystems. This code has been in linux-next for some
weeks, with a few fixes applied along the way.
Stephen Rothwell noted that commit
9d1f8be5cf42 ("bpf: Restrict bpf
when kernel lockdown is in confidentiality mode") is missing a
Signed-off-by from its author. Matthew responded that he is providing
this under category (c) of the DCO"
* 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (31 commits)
kexec: Fix file verification on S390
security: constify some arrays in lockdown LSM
lockdown: Print current->comm in restriction messages
efi: Restrict efivar_ssdt_load when the kernel is locked down
tracefs: Restrict tracefs when the kernel is locked down
debugfs: Restrict debugfs when the kernel is locked down
kexec: Allow kexec_file() with appropriate IMA policy when locked down
lockdown: Lock down perf when in confidentiality mode
bpf: Restrict bpf when kernel lockdown is in confidentiality mode
lockdown: Lock down tracing and perf kprobes when in confidentiality mode
lockdown: Lock down /proc/kcore
x86/mmiotrace: Lock down the testmmiotrace module
lockdown: Lock down module params that specify hardware parameters (eg. ioport)
lockdown: Lock down TIOCSSERIAL
lockdown: Prohibit PCMCIA CIS storage when the kernel is locked down
acpi: Disable ACPI table override if the kernel is locked down
acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
ACPI: Limit access to custom_method when the kernel is locked down
x86/msr: Restrict MSR access when the kernel is locked down
x86: Lock down IO port access when the kernel is locked down
...
Linus Torvalds [Sat, 28 Sep 2019 02:37:27 +0000 (19:37 -0700)]
Merge branch 'next-integrity' of git://git./linux/kernel/git/zohar/linux-integrity
Pull integrity updates from Mimi Zohar:
"The major feature in this time is IMA support for measuring and
appraising appended file signatures. In addition are a couple of bug
fixes and code cleanup to use struct_size().
In addition to the PE/COFF and IMA xattr signatures, the kexec kernel
image may be signed with an appended signature, using the same
scripts/sign-file tool that is used to sign kernel modules.
Similarly, the initramfs may contain an appended signature.
This contained a lot of refactoring of the existing appended signature
verification code, so that IMA could retain the existing framework of
calculating the file hash once, storing it in the IMA measurement list
and extending the TPM, verifying the file's integrity based on a file
hash or signature (eg. xattrs), and adding an audit record containing
the file hash, all based on policy. (The IMA support for appended
signatures patch set was posted and reviewed 11 times.)
The support for appended signature paves the way for adding other
signature verification methods, such as fs-verity, based on a single
system-wide policy. The file hash used for verifying the signature and
the signature, itself, can be included in the IMA measurement list"
* 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
ima: ima_api: Use struct_size() in kzalloc()
ima: use struct_size() in kzalloc()
sefltest/ima: support appended signatures (modsig)
ima: Fix use after free in ima_read_modsig()
MODSIGN: make new include file self contained
ima: fix freeing ongoing ahash_request
ima: always return negative code for error
ima: Store the measurement again when appraising a modsig
ima: Define ima-modsig template
ima: Collect modsig
ima: Implement support for module-style appended signatures
ima: Factor xattr_verify() out of ima_appraise_measurement()
ima: Add modsig appraise_type option for module-style appended signatures
integrity: Select CONFIG_KEYS instead of depending on it
PKCS#7: Introduce pkcs7_get_digest()
PKCS#7: Refactor verify_pkcs7_signature()
MODSIGN: Export module signature definitions
ima: initialize the "template" field with the default template
Linus Torvalds [Sat, 28 Sep 2019 00:00:27 +0000 (17:00 -0700)]
Merge tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Highlights:
- Add a new knfsd file cache, so that we don't have to open and close
on each (NFSv2/v3) READ or WRITE. This can speed up read and write
in some cases. It also replaces our readahead cache.
- Prevent silent data loss on write errors, by treating write errors
like server reboots for the purposes of write caching, thus forcing
clients to resend their writes.
- Tweak the code that allocates sessions to be more forgiving, so
that NFSv4.1 mounts are less likely to hang when a server already
has a lot of clients.
- Eliminate an arbitrary limit on NFSv4 ACL sizes; they should now be
limited only by the backend filesystem and the maximum RPC size.
- Allow the server to enforce use of the correct kerberos credentials
when a client reclaims state after a reboot.
And some miscellaneous smaller bugfixes and cleanup"
* tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux: (34 commits)
sunrpc: clean up indentation issue
nfsd: fix nfs read eof detection
nfsd: Make nfsd_reset_boot_verifier_locked static
nfsd: degraded slot-count more gracefully as allocation nears exhaustion.
nfsd: handle drc over-allocation gracefully.
nfsd: add support for upcall version 2
nfsd: add a "GetVersion" upcall for nfsdcld
nfsd: Reset the boot verifier on all write I/O errors
nfsd: Don't garbage collect files that might contain write errors
nfsd: Support the server resetting the boot verifier
nfsd: nfsd_file cache entries should be per net namespace
nfsd: eliminate an unnecessary acl size limit
Deprecate nfsd fault injection
nfsd: remove duplicated include from filecache.c
nfsd: Fix the documentation for svcxdr_tmpalloc()
nfsd: Fix up some unused variable warnings
nfsd: close cached files prior to a REMOVE or RENAME that would replace target
nfsd: rip out the raparms cache
nfsd: have nfsd_test_lock use the nfsd_file cache
nfsd: hook up nfs4_preprocess_stateid_op to the nfsd_file cache
...
Linus Torvalds [Fri, 27 Sep 2019 22:54:24 +0000 (15:54 -0700)]
Merge tag 'virtio-fs-5.4' of git://git./linux/kernel/git/mszeredi/fuse
Pull fuse virtio-fs support from Miklos Szeredi:
"Virtio-fs allows exporting directory trees on the host and mounting
them in guest(s).
This isn't actually a new filesystem, but a glue layer between the
fuse filesystem and a virtio based back-end.
It's similar in functionality to the existing virtio-9p solution, but
significantly faster in benchmarks and has better POSIX compliance.
Further permformance improvements can be achieved by sharing the page
cache between host and guest, allowing for faster I/O and reduced
memory use.
Kata Containers have been including the out-of-tree virtio-fs (with
the shared page cache patches as well) since version 1.7 as an
experimental feature. They have been active in development and plan to
switch from virtio-9p to virtio-fs as their default solution. There
has been interest from other sources as well.
The userspace infrastructure is slated to be merged into qemu once the
kernel part hits mainline.
This was developed by Vivek Goyal, Dave Gilbert and Stefan Hajnoczi"
* tag 'virtio-fs-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
virtio-fs: add virtiofs filesystem
virtio-fs: add Documentation/filesystems/virtiofs.rst
fuse: reserve values for mapping protocol
Linus Torvalds [Fri, 27 Sep 2019 22:10:34 +0000 (15:10 -0700)]
Merge tag '9p-for-5.4' of git://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"Some of the usual small fixes and cleanup.
Small fixes all around:
- avoid overlayfs copy-up for PRIVATE mmaps
- KUMSAN uninitialized warning for transport error
- one syzbot memory leak fix in 9p cache
- internal API cleanup for v9fs_fill_super"
* tag '9p-for-5.4' of git://github.com/martinetd/linux:
9p/vfs_super.c: Remove unused parameter data in v9fs_fill_super
9p/cache.c: Fix memory leak in v9fs_cache_session_get_cookie
9p: Transport error uninitialized
9p: avoid attaching writeback_fid on mmap with type PRIVATE
Linus Torvalds [Fri, 27 Sep 2019 20:08:36 +0000 (13:08 -0700)]
Merge tag 'riscv/for-v5.4-rc1-b' of git://git./linux/kernel/git/riscv/linux
Pull more RISC-V updates from Paul Walmsley:
"Some additional RISC-V updates.
This includes one significant fix:
- Prevent interrupts from being unconditionally re-enabled during
exception handling if they were disabled in the context in which
the exception occurred
Also a few other fixes:
- Fix a build error when sparse memory support is manually enabled
- Prevent CPUs beyond CONFIG_NR_CPUS from being enabled in early boot
And a few minor improvements:
- DT improvements: in the FU540 SoC DT files, improve U-Boot
compatibility by adding an "ethernet0" alias, drop an unnecessary
property from the DT files, and add support for the PWM device
- KVM preparation: add a KVM-related macro for future RISC-V KVM
support, and export some symbols required to build KVM support as
modules
- defconfig additions: build more drivers by default for QEMU
configurations"
* tag 'riscv/for-v5.4-rc1-b' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Avoid interrupts being erroneously enabled in handle_exception()
riscv: dts: sifive: Drop "clock-frequency" property of cpu nodes
riscv: dts: sifive: Add ethernet0 to the aliases node
RISC-V: Export kernel symbols for kvm
KVM: RISC-V: Add KVM_REG_RISCV for ONE_REG interface
arch/riscv: disable excess harts before picking main boot hart
RISC-V: Enable VIRTIO drivers in RV64 and RV32 defconfig
RISC-V: Fix building error when CONFIG_SPARSEMEM_MANUAL=y
riscv: dts: Add DT support for SiFive FU540 PWM driver
Linus Torvalds [Fri, 27 Sep 2019 20:02:19 +0000 (13:02 -0700)]
Merge tag 'nios2-v5.4-rc1' of git://git./linux/kernel/git/lftan/nios2
Pull nios2 fix from Ley Foon Tan:
"Make sure the command line buffer is NUL-terminated"
* tag 'nios2-v5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2:
nios2: force the string buffer NULL-terminated
Linus Torvalds [Fri, 27 Sep 2019 19:44:26 +0000 (12:44 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull more KVM updates from Paolo Bonzini:
"x86 KVM changes:
- The usual accuracy improvements for nested virtualization
- The usual round of code cleanups from Sean
- Added back optimizations that were prematurely removed in 5.2 (the
bare minimum needed to fix the regression was in 5.3-rc8, here
comes the rest)
- Support for UMWAIT/UMONITOR/TPAUSE
- Direct L2->L0 TLB flushing when L0 is Hyper-V and L1 is KVM
- Tell Windows guests if SMT is disabled on the host
- More accurate detection of vmexit cost
- Revert a pvqspinlock pessimization"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (56 commits)
KVM: nVMX: cleanup and fix host 64-bit mode checks
KVM: vmx: fix build warnings in hv_enable_direct_tlbflush() on i386
KVM: x86: Don't check kvm_rebooting in __kvm_handle_fault_on_reboot()
KVM: x86: Drop ____kvm_handle_fault_on_reboot()
KVM: VMX: Add error handling to VMREAD helper
KVM: VMX: Optimize VMX instruction error and fault handling
KVM: x86: Check kvm_rebooting in kvm_spurious_fault()
KVM: selftests: fix ucall on x86
Revert "locking/pvqspinlock: Don't wait if vCPU is preempted"
kvm: nvmx: limit atomic switch MSRs
kvm: svm: Intercept RDPRU
kvm: x86: Add "significant index" flag to a few CPUID leaves
KVM: x86/mmu: Skip invalid pages during zapping iff root_count is zero
KVM: x86/mmu: Explicitly track only a single invalid mmu generation
KVM: x86/mmu: Revert "KVM: x86/mmu: Remove is_obsolete() call"
KVM: x86/mmu: Revert "Revert "KVM: MMU: reclaim the zapped-obsolete page first""
KVM: x86/mmu: Revert "Revert "KVM: MMU: collapse TLB flushes when zap all pages""
KVM: x86/mmu: Revert "Revert "KVM: MMU: zap pages in batch""
KVM: x86/mmu: Revert "Revert "KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages""
KVM: x86/mmu: Revert "Revert "KVM: MMU: show mmu_valid_gen in shadow page related tracepoints""
...
Linus Torvalds [Fri, 27 Sep 2019 19:19:47 +0000 (12:19 -0700)]
Merge tag 'pwm/for-5.4-rc1' of git://git./linux/kernel/git/thierry.reding/linux-pwm
Pull pwm updates from Thierry Reding:
"Besides one new driver being added for the PWM controller found in
various Spreadtrum SoCs, this series of changes brings a slew of,
mostly minor, fixes and cleanups for existing drivers, as well as some
enhancements to the core code.
Lastly, Uwe is added to the PWM subsystem entry of the MAINTAINERS
file, making official his role as a reviewer"
* tag 'pwm/for-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm: (34 commits)
MAINTAINERS: Add myself as reviewer for the PWM subsystem
MAINTAINERS: Add patchwork link for PWM entry
MAINTAINERS: Add a selection of PWM related keywords to the PWM entry
pwm: mediatek: Add MT7629 compatible string
dt-bindings: pwm: Update bindings for MT7629 SoC
pwm: mediatek: Update license and switch to SPDX tag
pwm: mediatek: Use pwm_mediatek as common prefix
pwm: mediatek: Allocate the clks array dynamically
pwm: mediatek: Remove the has_clks field
pwm: mediatek: Drop the check for of_device_get_match_data()
pwm: atmel: Consolidate driver data initialization
pwm: atmel: Remove unneeded check for match data
pwm: atmel: Remove platform_device_id and use only dt bindings
pwm: stm32-lp: Add check in case requested period cannot be achieved
pwm: Ensure pwm_apply_state() doesn't modify the state argument
pwm: fsl-ftm: Don't update the state for the caller of pwm_apply_state()
pwm: sun4i: Don't update the state for the caller of pwm_apply_state()
pwm: rockchip: Don't update the state for the caller of pwm_apply_state()
pwm: Let pwm_get_state() return the last implemented state
pwm: Introduce local struct pwm_chip in pwm_apply_state()
...
Linus Torvalds [Fri, 27 Sep 2019 19:08:24 +0000 (12:08 -0700)]
Merge tag 'for-5.4/io_uring-2019-09-27' of git://git.kernel.dk/linux-block
Pull more io_uring updates from Jens Axboe:
"Just two things in here:
- Improvement to the io_uring CQ ring wakeup for batched IO (me)
- Fix wrong comparison in poll handling (yangerkun)
I realize the first one is a little late in the game, but it felt
pointless to hold it off until the next release. Went through various
testing and reviews with Pavel and peterz"
* tag 'for-5.4/io_uring-2019-09-27' of git://git.kernel.dk/linux-block:
io_uring: make CQ ring wakeups be more efficient
io_uring: compare cached_cq_tail with cq.head in_io_uring_poll
Colin Ian King [Fri, 27 Sep 2019 09:40:39 +0000 (10:40 +0100)]
net: tap: clean up an indentation issue
There is a statement that is indented too deeply, remove
the extraneous tab.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 27 Sep 2019 18:58:03 +0000 (11:58 -0700)]
Merge tag 'for-linus-2019-09-27' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A few fixes/changes to round off this merge window. This contains:
- Small series making some functional tweaks to blk-iocost (Tejun)
- Elevator switch locking fix (Ming)
- Kill redundant call in blk-wbt (Yufen)
- Fix flush timeout handling (Yufen)"
* tag 'for-linus-2019-09-27' of git://git.kernel.dk/linux-block:
block: fix null pointer dereference in blk_mq_rq_timed_out()
rq-qos: get rid of redundant wbt_update_limits()
iocost: bump up default latency targets for hard disks
iocost: improve nr_lagging handling
iocost: better trace vrate changes
block: don't release queue's sysfs lock during switching elevator
blk-mq: move lockdep_assert_held() into elevator_exit
Navid Emamdoost [Fri, 27 Sep 2019 01:51:46 +0000 (20:51 -0500)]
nfp: abm: fix memory leak in nfp_abm_u32_knode_replace
In nfp_abm_u32_knode_replace if the allocation for match fails it should
go to the error handling instead of returning. Updated other gotos to
have correct errno returned, too.
Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 26 Sep 2019 22:42:51 +0000 (15:42 -0700)]
tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
Yuchung Cheng and Marek Majkowski independently reported a weird
behavior of TCP_USER_TIMEOUT option when used at connect() time.
When the TCP_USER_TIMEOUT is reached, tcp_write_timeout()
believes the flow should live, and the following condition
in tcp_clamp_rto_to_user_timeout() programs one jiffie timers :
remaining = icsk->icsk_user_timeout - elapsed;
if (remaining <= 0)
return 1; /* user timeout has passed; fire ASAP */
This silly situation ends when the max syn rtx count is reached.
This patch makes sure we honor both TCP_SYNCNT and TCP_USER_TIMEOUT,
avoiding these spurious SYN packets.
Fixes:
b701a99e431d ("tcp: Add tcp_clamp_rto_to_user_timeout() helper to improve accuracy")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yuchung Cheng <ycheng@google.com>
Reported-by: Marek Majkowski <marek@cloudflare.com>
Cc: Jon Maxwell <jmaxwell37@gmail.com>
Link: https://marc.info/?l=linux-netdev&m=156940118307949&w=2
Acked-by: Jon Maxwell <jmaxwell37@gmail.com>
Tested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Marek Majkowski <marek@cloudflare.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Thu, 26 Sep 2019 18:37:05 +0000 (20:37 +0200)]
sk_buff: drop all skb extensions on free and skb scrubbing
Now that we have a 3rd extension, add a new helper that drops the
extension space and use it when we need to scrub an sk_buff.
At this time, scrubbing clears secpath and bridge netfilter data, but
retains the tc skb extension, after this patch all three get cleared.
NAPI reuse/free assumes we can only have a secpath attached to skb, but
it seems better to clear all extensions there as well.
v2: add unlikely hint (Eric Dumazet)
Fixes:
95a7233c452a ("net: openvswitch: Set OvS recirc_id from tc chain index")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kevin(Yudong) Yang [Thu, 26 Sep 2019 14:30:05 +0000 (10:30 -0400)]
tcp_bbr: fix quantization code to not raise cwnd if not probing bandwidth
There was a bug in the previous logic that attempted to ensure gain cycling
gets inflight above BDP even for small BDPs. This code correctly raised and
lowered target inflight values during the gain cycle. And this code
correctly ensured that cwnd was raised when probing bandwidth. However, it
did not correspondingly ensure that cwnd was *not* raised in this way when
*not* probing for bandwidth. The result was that small-BDP flows that were
always cwnd-bound could go for many cycles with a fixed cwnd, and not probe
or yield bandwidth at all. This meant that multiple small-BDP flows could
fail to converge in their bandwidth allocations.
Fixes:
3c346b233c68 ("tcp_bbr: fix bw probing to raise in-flight data for very small BDPs")
Signed-off-by: Kevin(Yudong) Yang <yyd@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 27 Sep 2019 18:35:13 +0000 (11:35 -0700)]
Merge branch 'for-5.4' of git://git./linux/kernel/git/rzhang/linux
Pull thermal management updates from Zhang Rui:
- Add Amit Kucheria as thermal subsystem Reviewer (Amit Kucheria)
- Fix a use after free bug when unregistering thermal zone devices (Ido
Schimmel)
- Fix thermal core framework to use put_device() when device_register()
fails (Yue Hu)
- Enable intel_pch_thermal and MMIO RAPL support for Intel Icelake
platform (Srinivas Pandruvada)
- Add clock operations in qorip thermal driver, for some platforms with
clock control like i.MX8MQ (Anson Huang)
- A couple of trivial fixes and cleanups for thermal core and different
soc thermal drivers (Amit Kucheria, Christophe JAILLET, Chuhong Yuan,
Fuqian Huang, Kelsey Skunberg, Nathan Huckleberry, Rishi Gupta,
Srinivas Kandagatla)
* 'for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux:
MAINTAINERS: Add Amit Kucheria as reviewer for thermal
thermal: Add some error messages
thermal: Fix use-after-free when unregistering thermal zone device
thermal/drivers/core: Use put_device() if device_register() fails
thermal_hwmon: Sanitize thermal_zone type
thermal: intel: Use dev_get_drvdata
thermal: intel: int3403: replace printk(KERN_WARN...) with pr_warn(...)
thermal: intel: int340x_thermal: Remove unnecessary acpi_has_method() uses
thermal: int340x: processor_thermal: Add Ice Lake support
drivers: thermal: qcom: tsens: Fix memory leak from qfprom read
thermal: tegra: Fix a typo
thermal: rcar_gen3_thermal: Replace devm_add_action() followed by failure action with devm_add_action_or_reset()
thermal: armada: Fix -Wshift-negative-value
dt-bindings: thermal: qoriq: Add optional clocks property
thermal: qoriq: Use __maybe_unused instead of #if CONFIG_PM_SLEEP
thermal: qoriq: Use devm_platform_ioremap_resource() instead of of_iomap()
thermal: qoriq: Fix error path of calling qoriq_tmu_register_tmu_zone fail
thermal: qoriq: Add clock operations
drivers: thermal: processor_thermal_device: Export sysfs interface for TCC offset
David S. Miller [Fri, 27 Sep 2019 18:33:19 +0000 (20:33 +0200)]
Merge branch 'mlxsw-Various-fixes'
Ido Schimmel says:
====================
mlxsw: Various fixes
This patchset includes two small fixes for the mlxsw driver and one
patch which clarifies recently introduced devlink-trap documentation.
Patch #1 clears the port's VLAN filters during port initialization. This
ensures that the drop reason reported to the user is consistent. The
problem is explained in detail in the commit message.
Patch #2 clarifies the description of one of the traps exposed via
devlink-trap.
Patch #3 from Danielle forbids the installation of a tc filter with
multiple mirror actions since this is not supported by the device. The
failure is communicated to the user via extack.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Danielle Ratson [Thu, 26 Sep 2019 11:43:40 +0000 (14:43 +0300)]
mlxsw: spectrum_flower: Fail in case user specifies multiple mirror actions
The ASIC can only mirror a packet to one port, but when user is trying
to set more than one mirror action, it doesn't fail.
Add a check if more than one mirror action was specified per rule and if so,
fail for not being supported.
Fixes:
d0d13c1858a11 ("mlxsw: spectrum_acl: Add support for mirror action")
Signed-off-by: Danielle Ratson <danieller@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Thu, 26 Sep 2019 11:43:39 +0000 (14:43 +0300)]
Documentation: Clarify trap's description
Alex noted that the below description might not be obvious to all users.
Clarify it by adding an example.
Fixes:
f3047ca01f12 ("Documentation: Add devlink-trap documentation")
Reported-by: Alex Kushnarov <alexanderk@mellanox.com>
Reviewed-by: Alex Kushnarov <alexanderk@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Thu, 26 Sep 2019 11:43:38 +0000 (14:43 +0300)]
mlxsw: spectrum: Clear VLAN filters during port initialization
When a port is created, its VLAN filters are not cleared by the
firmware. This causes tagged packets to be later dropped by the ingress
STP filters, which default to DISCARD state.
The above did not matter much until commit
b5ce611fd96e ("mlxsw:
spectrum: Add devlink-trap support") where we exposed the drop reason to
users.
Without this patch, the drop reason users will see is not consistent. If
a port is enslaved to a VLAN-aware bridge and a packet with an invalid
VLAN tries to ingress the bridge, it will be dropped due to ingress STP
filter. If the VLAN is later enabled and then disabled, the packet will
be dropped by the ingress VLAN filter despite the above being a
seemingly NOP operation.
Fix this by clearing all the VLAN filters during port initialization.
Adjust the test accordingly.
Fixes:
b5ce611fd96e ("mlxsw: spectrum: Add devlink-trap support")
Reported-by: Alex Kushnarov <alexanderk@mellanox.com>
Tested-by: Alex Kushnarov <alexanderk@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Thu, 26 Sep 2019 11:22:52 +0000 (12:22 +0100)]
net: ena: clean up indentation issue
There memset is indented incorrectly, remove the extraneous tabs.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Thu, 26 Sep 2019 11:13:06 +0000 (12:13 +0100)]
NFC: st95hf: clean up indentation issue
The return statement is indented incorrectly, add in a missing
tab and remove an extraneous space after the return
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hans Andersson [Thu, 26 Sep 2019 07:54:37 +0000 (09:54 +0200)]
net: phy: micrel: add Asym Pause workaround for KSZ9021
The Micrel KSZ9031 PHY may fail to establish a link when the Asymmetric
Pause capability is set. This issue is described in a Silicon Errata
(DS80000691D or DS80000692D), which advises to always disable the
capability.
Micrel KSZ9021 has no errata, but has the same issue with Asymmetric Pause.
This patch apply the same workaround as the one for KSZ9031.
Fixes:
3aed3e2a143c ("net: phy: micrel: add Asym Pause workaround")
Signed-off-by: Hans Andersson <hans.andersson@cellavision.se>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kunihiko Hayashi [Thu, 26 Sep 2019 06:35:10 +0000 (15:35 +0900)]
net: socionext: ave: Avoid using netdev_err() before calling register_netdev()
Until calling register_netdev(), ndev->dev_name isn't specified, and
netdev_err() displays "(unnamed net_device)".
ave
65000000.ethernet (unnamed net_device) (uninitialized): invalid phy-mode setting
ave: probe of
65000000.ethernet failed with error -22
This replaces netdev_err() with dev_err() before calling register_netdev().
Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jacob Keller [Thu, 26 Sep 2019 02:28:19 +0000 (19:28 -0700)]
ptp: correctly disable flags on old ioctls
Commit
415606588c61 ("PTP: introduce new versions of IOCTLs",
2019-09-13) introduced new versions of the PTP ioctls which actually
validate that the flags are acceptable values.
As part of this, it cleared the flags value using a bitwise
and+negation, in an attempt to prevent the old ioctl from accidentally
enabling new features.
This is incorrect for a couple of reasons. First, it results in
accidentally preventing previously working flags on the request ioctl.
By clearing the "valid" flags, we now no longer allow setting the
enable, rising edge, or falling edge flags.
Second, if we add new additional flags in the future, they must not be
set by the old ioctl. (Since the flag wasn't checked before, we could
potentially break userspace programs which sent garbage flag data.
The correct way to resolve this is to check for and clear all but the
originally valid flags.
Create defines indicating which flags are correctly checked and
interpreted by the original ioctls. Use these to clear any bits which
will not be correctly interpreted by the original ioctls.
In the future, new flags must be added to the VALID_FLAGS macros, but
*not* to the V1_VALID_FLAGS macros. In this way, new features may be
exposed over the v2 ioctls, but without breaking previous userspace
which happened to not clear the flags value properly. The old ioctl will
continue to behave the same way, while the new ioctl gains the benefit
of using the flags fields.
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Felipe Balbi <felipe.balbi@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Christopher Hall <christopher.s.hall@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>