platform/kernel/linux-rpi.git
6 years agonet: hns3: Remove tx ring BD len register in hns3_enet
Yunsheng Lin [Tue, 14 Aug 2018 16:13:17 +0000 (17:13 +0100)]
net: hns3: Remove tx ring BD len register in hns3_enet

There is no HNS3_RING_TX_RING_BD_LEN_REG register according
to UM, so this patch removes it.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: hns3: Fix desc num set to default when setting channel
Yunsheng Lin [Tue, 14 Aug 2018 16:13:16 +0000 (17:13 +0100)]
net: hns3: Fix desc num set to default when setting channel

When user set the channel num using "ethtool -L ethX", the desc num
of BD will set to default value, which will cause desc num set by
user lost problem.

This patch fixes it by restoring the desc num set by user when setting
channel num.

Fixes: 09f2af6405b8 ("net: hns3: add support to modify tqps number")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: hns3: Fix for phy link issue when using marvell phy driver
Jian Shen [Tue, 14 Aug 2018 16:13:15 +0000 (17:13 +0100)]
net: hns3: Fix for phy link issue when using marvell phy driver

For marvell phy m88e1510, bit SUPPORTED_FIBRE of phydev->supported
is default on. Both phy_resume() and phy_suspend() will check the
SUPPORTED_FIBRE bit and write register of fibre page.

Currently in hns3 driver, the SUPPORTED_FIBRE bit will be cleared
after phy_connect_direct() finished. Because phy_resume() is called
in phy_connect_direct(), and phy_suspend() is called when disconnect
phy device, so the operation for fibre page register is not symmetrical.
It will cause phy link issue when reload hns3 driver.

This patch fixes it by disable the SUPPORTED_FIBRE before connecting
phy.

Fixes: 256727da7395 ("net: hns3: Add MDIO support to HNS3 Ethernet driver for hip08 SoC")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: hns3: Fix for information of phydev lost problem when down/up
Fuyun Liang [Tue, 14 Aug 2018 16:13:14 +0000 (17:13 +0100)]
net: hns3: Fix for information of phydev lost problem when down/up

Function call of phy_connect_direct will reinitialize phydev. Some
information like advertising will be lost. Phy_connect_direct only
needs to be called once. And driver can run well. This patch adds
some functions to ensure that phy_connect_direct is called only once
to solve the information of phydev lost problem occurring when we stop
the net and open it again.

Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: hns3: Fix for command format parsing error in hclge_is_all_function_id_zero
Xi Wang [Tue, 14 Aug 2018 16:13:13 +0000 (17:13 +0100)]
net: hns3: Fix for command format parsing error in hclge_is_all_function_id_zero

According to the functional specification of hardware, the first
descriptor of response from command 'lookup vlan talbe' is not valid.
Currently, the first descriptor is parsed as normal value, which will
cause an expected error.

This patch fixes this problem by skipping the first descriptor.

Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: hns3: Add support for serdes loopback selftest
Peng Li [Tue, 14 Aug 2018 16:13:12 +0000 (17:13 +0100)]
net: hns3: Add support for serdes loopback selftest

This patch adds support for serdes loopback selftest in hns3
driver.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agobnxt_en: take coredump_record structure off stack
Arnd Bergmann [Mon, 13 Aug 2018 22:12:45 +0000 (00:12 +0200)]
bnxt_en: take coredump_record structure off stack

The bnxt_coredump_record structure is very long, causing a warning
about possible stack overflow on 32-bit architectures:

drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c: In function 'bnxt_get_coredump':
drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c:2989:1: error: the frame size of 1188 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]

I could not see any reason to operate on an on-stack copy of the
structure before copying it back into the caller-provided buffer, which
also simplifies the code here.

Fixes: 6c5657d085ae ("bnxt_en: Add support for ethtool get dump.")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: systemport: fix unused function warning
Arnd Bergmann [Mon, 13 Aug 2018 22:10:34 +0000 (00:10 +0200)]
net: systemport: fix unused function warning

The only remaining caller of this function is inside of an #ifdef
after another caller got removed. This causes a harmless warning
in some configurations:

drivers/net/ethernet/broadcom/bcmsysport.c:1068:13: error: 'bcm_sysport_resume_from_wol' defined but not used [-Werror=unused-function]

Removing the #ifdef around the PM functions simplifies the code
and avoids the problem but letting the compiler drop the unused
functions silently.

Fixes: 9e85e22713d6 ("net: systemport: Do not re-configure upon WoL interrupt")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: stmmac: mark PM functions as __maybe_unused
Arnd Bergmann [Mon, 13 Aug 2018 21:50:41 +0000 (23:50 +0200)]
net: stmmac: mark PM functions as __maybe_unused

The newly added suspend/resume functions cause a build warning
when CONFIG_PM is disabled:

drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c:324:12: error: 'stmmac_pci_resume' defined but not used [-Werror=unused-function]
drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c:306:12: error: 'stmmac_pci_suspend' defined but not used [-Werror=unused-function]

Mark them as __maybe_unused so gcc can drop them silently.

Fixes: b7d0f08e9129 ("net: stmmac: Fix WoL for PCI-based setups")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agol2tp: fix unused function warning
Arnd Bergmann [Mon, 13 Aug 2018 21:43:05 +0000 (23:43 +0200)]
l2tp: fix unused function warning

Removing one of the callers of pppol2tp_session_get_sock caused a harmless
warning in some configurations:

net/l2tp/l2tp_ppp.c:142:21: 'pppol2tp_session_get_sock' defined but not used [-Wunused-function]

Rather than adding another #ifdef here, using a proper IS_ENABLED()
check makes the code more readable and avoids those warnings while
letting the compiler figure out for itself which code is needed.

This adds one pointer for the unused show() callback in struct
l2tp_session, but that seems harmless.

Fixes: b0e29063dcb3 ("l2tp: remove pppol2tp_session_ioctl()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agobnxt_en: avoid string overflow for record->system_name
Arnd Bergmann [Mon, 13 Aug 2018 21:26:54 +0000 (23:26 +0200)]
bnxt_en: avoid string overflow for record->system_name

The utsname()->nodename string may be 64 bytes long, and it gets
copied without the trailing nul byte into the shorter record->system_name,
as gcc now warns:

In file included from include/linux/bitmap.h:9,
                 from include/linux/ethtool.h:16,
                 from drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c:13:
In function 'strncpy',
    inlined from 'bnxt_fill_coredump_record' at drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c:2863:2:
include/linux/string.h:254:9: error: '__builtin_strncpy' output truncated before terminating nul copying as many bytes from a string as its length [-Werror=stringop-truncation]

Using strlcpy() at least avoids overflowing the destination buffer
and adds proper nul-termination. It may still truncate long names
though, which probably can't be solved here.

Fixes: 6c5657d085ae ("bnxt_en: Add support for ethtool get dump.")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: lan743x: fix building without CONFIG_PTP_1588_CLOCK
Arnd Bergmann [Mon, 13 Aug 2018 21:19:22 +0000 (23:19 +0200)]
net: lan743x: fix building without CONFIG_PTP_1588_CLOCK

Building without CONFIG_PTP_1588_CLOCK results in multiple failures,
this was obviously not well tested:

drivers/net/ethernet/microchip/lan743x_ptp.c: In function 'lan743x_ptp_isr':
drivers/net/ethernet/microchip/lan743x_ptp.c:781:28: error: 'struct lan743x_ptp' has no member named 'ptp_clock'; did you mean 'tx_ts_lock'?
   ptp_schedule_worker(ptp->ptp_clock, 0);
                            ^~~~~~~~~
                            tx_ts_lock
drivers/net/ethernet/microchip/lan743x_ptp.c: In function 'lan743x_ptp_open':
drivers/net/ethernet/microchip/lan743x_ptp.c:879:6: error: unused variable 'ret' [-Werror=unused-variable]
  int ret = -ENODEV;
      ^~~
At top level:
drivers/net/ethernet/microchip/lan743x_ptp.c:63:13: error: 'lan743x_ptp_tx_ts_enqueue_ts' defined but not used [-Werror=unused-function]
 static void lan743x_ptp_tx_ts_enqueue_ts(struct lan743x_adapter *adapter,
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
drivers/net/ethernet/microchip/lan743x_ethtool.c: In function 'lan743x_ethtool_get_ts_info':
drivers/net/ethernet/microchip/lan743x_ethtool.c:558:19: error: 'struct lan743x_ptp' has no member named 'ptp_clock'; did you mean 'tx_ts_lock'?

Those #ifdef checks are hard to get right, replace them all with
IS_ENABLED() checks that leave the same code visible to the compiler
but let it optimize out the unused bits based on the configuration.

Fixes: 07624df1c9ef ("lan743x: lan743x: Add PTP support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: lan743x: select CRC16
Arnd Bergmann [Mon, 13 Aug 2018 21:19:21 +0000 (23:19 +0200)]
net: lan743x: select CRC16

lan743x now fails to build when CONFIG_CRC16 is disabled:

drivers/net/ethernet/microchip/lan743x_main.o: In function crc16'

Force it on like all other users do.

Fixes: 4d94282afd95 ("lan743x: Add power management support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'net_sched-Fix-two-tc_index-filter-init-issues'
David S. Miller [Tue, 14 Aug 2018 02:37:42 +0000 (19:37 -0700)]
Merge branch 'net_sched-Fix-two-tc_index-filter-init-issues'

Hangbin Liu says:

====================
net_sched: Fix two tc_index filter init issues

These two patches fix two tc_index filter init issues. The first one fixes
missing exts info in new filter, which will cause NULL pointer dereference
when delete tcindex filter. The second one fixes missing res info when create
new filter, which will make filter unbind failed.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet_sched: Fix missing res info when create new tc_index filter
Hangbin Liu [Mon, 13 Aug 2018 10:44:04 +0000 (18:44 +0800)]
net_sched: Fix missing res info when create new tc_index filter

Li Shuang reported the following warn:

[  733.484610] WARNING: CPU: 6 PID: 21123 at net/sched/sch_cbq.c:1418 cbq_destroy_class+0x5d/0x70 [sch_cbq]
[  733.495190] Modules linked in: sch_cbq cls_tcindex sch_dsmark rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat l
[  733.574155]  syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb ixgbe ahci libahci i2c_algo_bit libata i40e i2c_core dca mdio megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[  733.592500] CPU: 6 PID: 21123 Comm: tc Not tainted 4.18.0-rc8.latest+ #131
[  733.600169] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.1.5 04/11/2016
[  733.608518] RIP: 0010:cbq_destroy_class+0x5d/0x70 [sch_cbq]
[  733.614734] Code: e7 d9 d2 48 8b 7b 48 e8 61 05 da d2 48 8d bb f8 00 00 00 e8 75 ae d5 d2 48 39 eb 74 0a 48 89 df 5b 5d e9 16 6c 94 d2 5b 5d c3 <0f> 0b eb b6 0f 1f 44 00 00 66 2e 0f 1f 84
[  733.635798] RSP: 0018:ffffbfbb066bb9d8 EFLAGS: 00010202
[  733.641627] RAX: 0000000000000001 RBX: ffff9cdd17392800 RCX: 000000008010000f
[  733.649588] RDX: ffff9cdd1df547e0 RSI: ffff9cdd17392800 RDI: ffff9cdd0f84c800
[  733.657547] RBP: ffff9cdd0f84c800 R08: 0000000000000001 R09: 0000000000000000
[  733.665508] R10: ffff9cdd0f84d000 R11: 0000000000000001 R12: 0000000000000001
[  733.673469] R13: 0000000000000000 R14: 0000000000000001 R15: ffff9cdd17392200
[  733.681430] FS:  00007f911890a740(0000) GS:ffff9cdd1f8c0000(0000) knlGS:0000000000000000
[  733.690456] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  733.696864] CR2: 0000000000b5544c CR3: 0000000859374002 CR4: 00000000001606e0
[  733.704826] Call Trace:
[  733.707554]  cbq_destroy+0xa1/0xd0 [sch_cbq]
[  733.712318]  qdisc_destroy+0x62/0x130
[  733.716401]  dsmark_destroy+0x2a/0x70 [sch_dsmark]
[  733.721745]  qdisc_destroy+0x62/0x130
[  733.725829]  qdisc_graft+0x3ba/0x470
[  733.729817]  tc_get_qdisc+0x2a6/0x2c0
[  733.733901]  ? cred_has_capability+0x7d/0x130
[  733.738761]  rtnetlink_rcv_msg+0x263/0x2d0
[  733.743330]  ? rtnl_calcit.isra.30+0x110/0x110
[  733.748287]  netlink_rcv_skb+0x4d/0x130
[  733.752576]  netlink_unicast+0x1a3/0x250
[  733.756949]  netlink_sendmsg+0x2ae/0x3a0
[  733.761324]  sock_sendmsg+0x36/0x40
[  733.765213]  ___sys_sendmsg+0x26f/0x2d0
[  733.769493]  ? handle_pte_fault+0x586/0xdf0
[  733.774158]  ? __handle_mm_fault+0x389/0x500
[  733.778919]  ? __sys_sendmsg+0x5e/0xa0
[  733.783099]  __sys_sendmsg+0x5e/0xa0
[  733.787087]  do_syscall_64+0x5b/0x180
[  733.791171]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  733.796805] RIP: 0033:0x7f9117f23f10
[  733.800791] Code: c3 48 8b 05 82 6f 2c 00 f7 db 64 89 18 48 83 cb ff eb dd 0f 1f 80 00 00 00 00 83 3d 8d d0 2c 00 00 75 10 b8 2e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8
[  733.821873] RSP: 002b:00007ffe96818398 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  733.830319] RAX: ffffffffffffffda RBX: 000000005b71244c RCX: 00007f9117f23f10
[  733.838280] RDX: 0000000000000000 RSI: 00007ffe968183e0 RDI: 0000000000000003
[  733.846241] RBP: 00007ffe968183e0 R08: 000000000000ffff R09: 0000000000000003
[  733.854202] R10: 00007ffe96817e20 R11: 0000000000000246 R12: 0000000000000000
[  733.862161] R13: 0000000000662ee0 R14: 0000000000000000 R15: 0000000000000000
[  733.870121] ---[ end trace 28edd4aad712ddca ]---

This is because we didn't update f->result.res when create new filter. Then in
tcindex_delete() -> tcf_unbind_filter(), we will failed to find out the res
and unbind filter, which will trigger the WARN_ON() in cbq_destroy_class().

Fix it by updating f->result.res when create new filter.

Fixes: 6e0565697a106 ("net_sched: fix another crash in cls_tcindex")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet_sched: fix NULL pointer dereference when delete tcindex filter
Hangbin Liu [Mon, 13 Aug 2018 10:44:03 +0000 (18:44 +0800)]
net_sched: fix NULL pointer dereference when delete tcindex filter

Li Shuang reported the following crash:

[   71.267724] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[   71.276456] PGD 800000085d9bd067 P4D 800000085d9bd067 PUD 859a0b067 PMD 0
[   71.284127] Oops: 0000 [#1] SMP PTI
[   71.288015] CPU: 12 PID: 2386 Comm: tc Not tainted 4.18.0-rc8.latest+ #131
[   71.295686] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.1.5 04/11/2016
[   71.304037] RIP: 0010:tcindex_delete+0x72/0x280 [cls_tcindex]
[   71.310446] Code: 00 31 f6 48 87 75 20 48 85 f6 74 11 48 8b 47 18 48 8b 40 08 48 8b 40 50 e8 fb a6 f8 fc 48 85 db 0f 84 dc 00 00 00 48 8b 73 18 <8b> 56 04 48 8d 7e 04 85 d2 0f 84 7b 01 00
[   71.331517] RSP: 0018:ffffb45207b3f898 EFLAGS: 00010282
[   71.337345] RAX: ffff8ad3d72d6360 RBX: ffff8acc84393680 RCX: 000000000000002e
[   71.345306] RDX: ffff8ad3d72c8570 RSI: 0000000000000000 RDI: ffff8ad847a45800
[   71.353277] RBP: ffff8acc84393688 R08: ffff8ad3d72c8400 R09: 0000000000000000
[   71.361238] R10: ffff8ad3de786e00 R11: 0000000000000000 R12: ffffb45207b3f8c7
[   71.369199] R13: ffff8ad3d93bd2a0 R14: 000000000000002e R15: ffff8ad3d72c9600
[   71.377161] FS:  00007f9d3ec3e740(0000) GS:ffff8ad3df980000(0000) knlGS:0000000000000000
[   71.386188] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   71.392597] CR2: 0000000000000004 CR3: 0000000852f06003 CR4: 00000000001606e0
[   71.400558] Call Trace:
[   71.403299]  tcindex_destroy_element+0x25/0x40 [cls_tcindex]
[   71.409611]  tcindex_walk+0xbb/0x110 [cls_tcindex]
[   71.414953]  tcindex_destroy+0x44/0x90 [cls_tcindex]
[   71.420492]  ? tcindex_delete+0x280/0x280 [cls_tcindex]
[   71.426323]  tcf_proto_destroy+0x16/0x40
[   71.430696]  tcf_chain_flush+0x51/0x70
[   71.434876]  tcf_block_put_ext.part.30+0x8f/0x1b0
[   71.440122]  tcf_block_put+0x4d/0x70
[   71.444108]  cbq_destroy+0x4d/0xd0 [sch_cbq]
[   71.448869]  qdisc_destroy+0x62/0x130
[   71.452951]  dsmark_destroy+0x2a/0x70 [sch_dsmark]
[   71.458300]  qdisc_destroy+0x62/0x130
[   71.462373]  qdisc_graft+0x3ba/0x470
[   71.466359]  tc_get_qdisc+0x2a6/0x2c0
[   71.470443]  ? cred_has_capability+0x7d/0x130
[   71.475307]  rtnetlink_rcv_msg+0x263/0x2d0
[   71.479875]  ? rtnl_calcit.isra.30+0x110/0x110
[   71.484832]  netlink_rcv_skb+0x4d/0x130
[   71.489109]  netlink_unicast+0x1a3/0x250
[   71.493482]  netlink_sendmsg+0x2ae/0x3a0
[   71.497859]  sock_sendmsg+0x36/0x40
[   71.501748]  ___sys_sendmsg+0x26f/0x2d0
[   71.506029]  ? handle_pte_fault+0x586/0xdf0
[   71.510694]  ? __handle_mm_fault+0x389/0x500
[   71.515457]  ? __sys_sendmsg+0x5e/0xa0
[   71.519636]  __sys_sendmsg+0x5e/0xa0
[   71.523626]  do_syscall_64+0x5b/0x180
[   71.527711]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   71.533345] RIP: 0033:0x7f9d3e257f10
[   71.537331] Code: c3 48 8b 05 82 6f 2c 00 f7 db 64 89 18 48 83 cb ff eb dd 0f 1f 80 00 00 00 00 83 3d 8d d0 2c 00 00 75 10 b8 2e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8
[   71.558401] RSP: 002b:00007fff6f893398 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[   71.566848] RAX: ffffffffffffffda RBX: 000000005b71274d RCX: 00007f9d3e257f10
[   71.574810] RDX: 0000000000000000 RSI: 00007fff6f8933e0 RDI: 0000000000000003
[   71.582770] RBP: 00007fff6f8933e0 R08: 000000000000ffff R09: 0000000000000003
[   71.590729] R10: 00007fff6f892e20 R11: 0000000000000246 R12: 0000000000000000
[   71.598689] R13: 0000000000662ee0 R14: 0000000000000000 R15: 0000000000000000
[   71.606651] Modules linked in: sch_cbq cls_tcindex sch_dsmark xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_coni
[   71.685425]  libahci i2c_algo_bit i2c_core i40e libata dca mdio megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[   71.697075] CR2: 0000000000000004
[   71.700792] ---[ end trace f604eb1acacd978b ]---

Reproducer:
tc qdisc add dev lo handle 1:0 root dsmark indices 64 set_tc_index
tc filter add dev lo parent 1:0 protocol ip prio 1 tcindex mask 0xfc shift 2
tc qdisc add dev lo parent 1:0 handle 2:0 cbq bandwidth 10Mbit cell 8 avpkt 1000 mpu 64
tc class add dev lo parent 2:0 classid 2:1 cbq bandwidth 10Mbit rate 1500Kbit avpkt 1000 prio 1 bounded isolated allot 1514 weight 1 maxburst 10
tc filter add dev lo parent 2:0 protocol ip prio 1 handle 0x2e tcindex classid 2:1 pass_on
tc qdisc add dev lo parent 2:1 pfifo limit 5
tc qdisc del dev lo root

This is because in tcindex_set_parms, when there is no old_r, we set new
exts to cr.exts. And we didn't set it to filter when r == &new_filter_result.

Then in tcindex_delete() -> tcf_exts_get_net(), we will get NULL pointer
dereference as we didn't init exts.

Fix it by moving tcf_exts_change() after "if (old_r && old_r != r)" check.
Then we don't need "cr" as there is no errout after that.

Fixes: bf63ac73b3e13 ("net_sched: fix an oops in tcindex filter")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonfp: clean up return types in kdoc comments
Jakub Kicinski [Tue, 14 Aug 2018 01:31:05 +0000 (18:31 -0700)]
nfp: clean up return types in kdoc comments

Remove 'Return:' information from functions which no longer
return a value.  Also update name and return types of nfp_nffw_info
access functions.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge tag 'mlx5e-updates-2018-08-10' of git://git.kernel.org/pub/scm/linux/kernel...
David S. Miller [Mon, 13 Aug 2018 23:04:18 +0000 (16:04 -0700)]
Merge tag 'mlx5e-updates-2018-08-10' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5e-updates-2018-08-10

This series provides the following updates to mlx5e netdevice driver.

1) First 4 patches extends the support for ethtool rxnfc flow steering
   - Added ipv6 support
   - l4 proto ip field for both ip6 and ip4

2) Next 4 patches, reorganizing flow steering structures and declaration into
one header file, and add two Kconfig flags to allow disabling/enabling mlx5
netdevice rx flow steering at compile time:
CONFIG_MLX5_EN_ARFS for en_arfs.c
CONFIG_MLX5_EN_RXNFC for en_fs_ehtool.c

3) More kconfig flags dependencies
- vxlan.c depends on CONFIG_VXLAN
- clock.c depends on CONFIG_PTP_1588_CLOCK

4) Reorganize the Makefile
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet/mlx5: Improve argument name for add flow API
Eli Cohen [Sun, 29 Apr 2018 14:17:34 +0000 (09:17 -0500)]
net/mlx5: Improve argument name for add flow API

The last argument to mlx5_add_flow_rules passes the number of
destinations in the struct pointed to by the dest arg. Change the name
to better reflect this fact.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
6 years agonet/mlx5: Reorganize the makefile
Saeed Mahameed [Tue, 31 Jul 2018 21:44:00 +0000 (14:44 -0700)]
net/mlx5: Reorganize the makefile

Reorganize the Makefile and group files together according to their
functionality and importance.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
6 years agonet/mlx5e: clock.c depends on CONFIG_PTP_1588_CLOCK
Moshe Shemesh [Sun, 29 Jul 2018 10:29:45 +0000 (13:29 +0300)]
net/mlx5e: clock.c depends on CONFIG_PTP_1588_CLOCK

lib/clock.c includes clock related functions which require ptp support.
Thus compile out lib/clock.c and add the needed function stubs in case
kconfig CONFIG_PTP_1588_CLOCK is off.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
6 years agonet/mlx5e: vxlan.c depends on CONFIG_VXLAN
Saeed Mahameed [Fri, 13 Jul 2018 17:58:49 +0000 (10:58 -0700)]
net/mlx5e: vxlan.c depends on CONFIG_VXLAN

When vxlan is not enabled by kernel, no need to enable it in mlx5.
Compile out lib/vxlan.c if CONFIG_VXLAN is not selected.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
6 years agonet/mlx5e: Move flow steering declarations into en/fs.h
Saeed Mahameed [Thu, 12 Jul 2018 10:09:19 +0000 (03:09 -0700)]
net/mlx5e: Move flow steering declarations into en/fs.h

Move flow steering declarations and definitions into the dedicated
en/fs.h header file

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
6 years agonet/mlx5e: Add CONFIG_MLX5_EN_ARFS for accelerated flow steering support
Saeed Mahameed [Thu, 12 Jul 2018 10:01:26 +0000 (03:01 -0700)]
net/mlx5e: Add CONFIG_MLX5_EN_ARFS for accelerated flow steering support

Add new mlx5 Kconfig flag to allow selecting accelerated flow steering
support, and compile out en_arfs.c if not selected.

Move arfs declarations and definitions to en/fs.h header file.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
6 years agonet/mlx5e: Add CONFIG_MLX5_EN_RXNFC for ethtool rx nfc
Saeed Mahameed [Wed, 11 Jul 2018 19:02:42 +0000 (12:02 -0700)]
net/mlx5e: Add CONFIG_MLX5_EN_RXNFC for ethtool rx nfc

Add new mlx5 Kconfig flag to allow selecting ethtool rx nfc support,
and compile out en_fs_ehtool.c if not selected.

Add en/fs.h header file to host all steering declarations and
definitions.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
6 years agonet/mlx5e: Ethtool steering, move ethtool callbacks
Saeed Mahameed [Wed, 11 Jul 2018 00:04:49 +0000 (17:04 -0700)]
net/mlx5e: Ethtool steering, move ethtool callbacks

Move ethool rxnfc callback into en_fs_etthool file where they belong.
This will allow us to make many ethtool fs related helper functions
static.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
6 years agonet/mlx5e: Ethtool steering, l4 proto support
Saeed Mahameed [Sun, 8 Jul 2018 07:24:21 +0000 (00:24 -0700)]
net/mlx5e: Ethtool steering, l4 proto support

Add support for l4 proto ip field in ethtool flow steering.

Example: Redirect icmpv6 to rx queue #2

ethtool -U eth0 flow-type ip6 l4proto 58 action 2

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
6 years agonet/mlx5e: Ethtool steering, ip6 support
Saeed Mahameed [Fri, 6 Jul 2018 00:11:09 +0000 (17:11 -0700)]
net/mlx5e: Ethtool steering, ip6 support

Add ip6 support for ethtool flow steering.

New supported flow types: ip6|tcp6|udp6|
Supported fields: src-ip|dst-ip|src-port|dst-port

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
6 years agonet/mlx5e: Ethtool steering flow parsing refactoring
Saeed Mahameed [Fri, 6 Jul 2018 00:31:14 +0000 (17:31 -0700)]
net/mlx5e: Ethtool steering flow parsing refactoring

Have a parsing function per flow type, that converts from ethtool rx flow
spec to mlx5 flow spec.

Will be useful to add support for ip6 ethtool flow steering in the
next patch.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
6 years agonet/mlx5e: Ethtool steering flow validation refactoring
Saeed Mahameed [Fri, 6 Jul 2018 00:30:34 +0000 (17:30 -0700)]
net/mlx5e: Ethtool steering flow validation refactoring

Have a ethtool rx flow spec validation helper function per flow type.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
6 years agonet: sched: act_ife: disable bh when taking ife_mod_lock
Vlad Buslov [Mon, 13 Aug 2018 17:20:11 +0000 (20:20 +0300)]
net: sched: act_ife: disable bh when taking ife_mod_lock

Lockdep reports deadlock for following locking scenario in ife action:

Task one:
1) Executes ife action update.
2) Takes tcfa_lock.
3) Waits on ife_mod_lock which is already taken by task two.

Task two:

1) Executes any path that obtains ife_mod_lock without disabling bh (any
path that takes ife_mod_lock while holding tcfa_lock has bh disabled) like
loading a meta module, or creating new action.
2) Takes ife_mod_lock.
3) Task is preempted by rate estimator timer.
4) Timer callback waits on tcfa_lock which is taken by task one.

In described case tasks deadlock because they take same two locks in
different order. To prevent potential deadlock reported by lockdep, always
disable bh when obtaining ife_mod_lock.

Lockdep warning:

[  508.101192] =====================================================
[  508.107708] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
[  508.114728] 4.18.0-rc8+ #646 Not tainted
[  508.119050] -----------------------------------------------------
[  508.125559] tc/5460 [HC0[0]:SC0[2]:HE1:SE0] is trying to acquire:
[  508.132025] 000000005a938c68 (ife_mod_lock){++++}, at: find_ife_oplist+0x1e/0xc0 [act_ife]
[  508.140996]
               and this task is already holding:
[  508.147548] 00000000d46f6c56 (&(&p->tcfa_lock)->rlock){+.-.}, at: tcf_ife_init+0x6ae/0xf40 [act_ife]
[  508.157371] which would create a new lock dependency:
[  508.162828]  (&(&p->tcfa_lock)->rlock){+.-.} -> (ife_mod_lock){++++}
[  508.169572]
               but this new dependency connects a SOFTIRQ-irq-safe lock:
[  508.178197]  (&(&p->tcfa_lock)->rlock){+.-.}
[  508.178201]
               ... which became SOFTIRQ-irq-safe at:
[  508.189771]   _raw_spin_lock+0x2c/0x40
[  508.193906]   est_fetch_counters+0x41/0xb0
[  508.198391]   est_timer+0x83/0x3c0
[  508.202180]   call_timer_fn+0x16a/0x5d0
[  508.206400]   run_timer_softirq+0x399/0x920
[  508.210967]   __do_softirq+0x157/0x97d
[  508.215102]   irq_exit+0x152/0x1c0
[  508.218888]   smp_apic_timer_interrupt+0xc0/0x4e0
[  508.223976]   apic_timer_interrupt+0xf/0x20
[  508.228540]   cpuidle_enter_state+0xf8/0x5d0
[  508.233198]   do_idle+0x28a/0x350
[  508.236881]   cpu_startup_entry+0xc7/0xe0
[  508.241296]   start_secondary+0x2e8/0x3f0
[  508.245678]   secondary_startup_64+0xa5/0xb0
[  508.250347]
               to a SOFTIRQ-irq-unsafe lock:  (ife_mod_lock){++++}
[  508.256531]
               ... which became SOFTIRQ-irq-unsafe at:
[  508.267279] ...
[  508.267283]   _raw_write_lock+0x2c/0x40
[  508.273653]   register_ife_op+0x118/0x2c0 [act_ife]
[  508.278926]   do_one_initcall+0xf7/0x4d9
[  508.283214]   do_init_module+0x18b/0x44e
[  508.287521]   load_module+0x4167/0x5730
[  508.291739]   __do_sys_finit_module+0x16d/0x1a0
[  508.296654]   do_syscall_64+0x7a/0x3f0
[  508.300788]   entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  508.306302]
               other info that might help us debug this:

[  508.315286]  Possible interrupt unsafe locking scenario:

[  508.322771]        CPU0                    CPU1
[  508.327681]        ----                    ----
[  508.332604]   lock(ife_mod_lock);
[  508.336300]                                local_irq_disable();
[  508.342608]                                lock(&(&p->tcfa_lock)->rlock);
[  508.349793]                                lock(ife_mod_lock);
[  508.355990]   <Interrupt>
[  508.358974]     lock(&(&p->tcfa_lock)->rlock);
[  508.363803]
                *** DEADLOCK ***

[  508.370715] 2 locks held by tc/5460:
[  508.374680]  #0: 00000000e27e4fa4 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x583/0x7b0
[  508.383366]  #1: 00000000d46f6c56 (&(&p->tcfa_lock)->rlock){+.-.}, at: tcf_ife_init+0x6ae/0xf40 [act_ife]
[  508.393648]
               the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
[  508.403505] -> (&(&p->tcfa_lock)->rlock){+.-.} ops: 1001553 {
[  508.409646]    HARDIRQ-ON-W at:
[  508.413136]                     _raw_spin_lock_bh+0x34/0x40
[  508.419059]                     gnet_stats_start_copy_compat+0xa2/0x230
[  508.426021]                     gnet_stats_start_copy+0x16/0x20
[  508.432333]                     tcf_action_copy_stats+0x95/0x1d0
[  508.438735]                     tcf_action_dump_1+0xb0/0x4e0
[  508.444795]                     tcf_action_dump+0xca/0x200
[  508.450673]                     tcf_exts_dump+0xd9/0x320
[  508.456392]                     fl_dump+0x1b7/0x4a0 [cls_flower]
[  508.462798]                     tcf_fill_node+0x380/0x530
[  508.468601]                     tfilter_notify+0xdf/0x1c0
[  508.474404]                     tc_new_tfilter+0x84a/0xc90
[  508.480270]                     rtnetlink_rcv_msg+0x5bd/0x7b0
[  508.486419]                     netlink_rcv_skb+0x184/0x220
[  508.492394]                     netlink_unicast+0x31b/0x460
[  508.507411]                     netlink_sendmsg+0x3fb/0x840
[  508.513390]                     sock_sendmsg+0x7b/0xd0
[  508.518907]                     ___sys_sendmsg+0x4c6/0x610
[  508.524797]                     __sys_sendmsg+0xd7/0x150
[  508.530510]                     do_syscall_64+0x7a/0x3f0
[  508.536201]                     entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  508.543301]    IN-SOFTIRQ-W at:
[  508.546834]                     _raw_spin_lock+0x2c/0x40
[  508.552522]                     est_fetch_counters+0x41/0xb0
[  508.558571]                     est_timer+0x83/0x3c0
[  508.563912]                     call_timer_fn+0x16a/0x5d0
[  508.569699]                     run_timer_softirq+0x399/0x920
[  508.575840]                     __do_softirq+0x157/0x97d
[  508.581538]                     irq_exit+0x152/0x1c0
[  508.586882]                     smp_apic_timer_interrupt+0xc0/0x4e0
[  508.593533]                     apic_timer_interrupt+0xf/0x20
[  508.599686]                     cpuidle_enter_state+0xf8/0x5d0
[  508.605895]                     do_idle+0x28a/0x350
[  508.611147]                     cpu_startup_entry+0xc7/0xe0
[  508.617097]                     start_secondary+0x2e8/0x3f0
[  508.623029]                     secondary_startup_64+0xa5/0xb0
[  508.629245]    INITIAL USE at:
[  508.632686]                    _raw_spin_lock_bh+0x34/0x40
[  508.638557]                    gnet_stats_start_copy_compat+0xa2/0x230
[  508.645491]                    gnet_stats_start_copy+0x16/0x20
[  508.651719]                    tcf_action_copy_stats+0x95/0x1d0
[  508.657992]                    tcf_action_dump_1+0xb0/0x4e0
[  508.663937]                    tcf_action_dump+0xca/0x200
[  508.669716]                    tcf_exts_dump+0xd9/0x320
[  508.675337]                    fl_dump+0x1b7/0x4a0 [cls_flower]
[  508.681650]                    tcf_fill_node+0x380/0x530
[  508.687366]                    tfilter_notify+0xdf/0x1c0
[  508.693031]                    tc_new_tfilter+0x84a/0xc90
[  508.698820]                    rtnetlink_rcv_msg+0x5bd/0x7b0
[  508.704869]                    netlink_rcv_skb+0x184/0x220
[  508.710758]                    netlink_unicast+0x31b/0x460
[  508.716627]                    netlink_sendmsg+0x3fb/0x840
[  508.722510]                    sock_sendmsg+0x7b/0xd0
[  508.727931]                    ___sys_sendmsg+0x4c6/0x610
[  508.733729]                    __sys_sendmsg+0xd7/0x150
[  508.739346]                    do_syscall_64 +0x7a/0x3f0
[  508.744943]                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  508.751930]  }
[  508.753964]  ... key      at: [<ffffffff916b3e20>] __key.61145+0x0/0x40
[  508.760946]  ... acquired at:
[  508.764294]    _raw_read_lock+0x2f/0x40
[  508.768513]    find_ife_oplist+0x1e/0xc0 [act_ife]
[  508.773692]    tcf_ife_init+0x82f/0xf40 [act_ife]
[  508.778785]    tcf_action_init_1+0x510/0x750
[  508.783468]    tcf_action_init+0x1e8/0x340
[  508.787938]    tcf_action_add+0xc5/0x240
[  508.792241]    tc_ctl_action+0x203/0x2a0
[  508.796550]    rtnetlink_rcv_msg+0x5bd/0x7b0
[  508.801200]    netlink_rcv_skb+0x184/0x220
[  508.805674]    netlink_unicast+0x31b/0x460
[  508.810129]    netlink_sendmsg+0x3fb/0x840
[  508.814611]    sock_sendmsg+0x7b/0xd0
[  508.818665]    ___sys_sendmsg+0x4c6/0x610
[  508.823029]    __sys_sendmsg+0xd7/0x150
[  508.827246]    do_syscall_64+0x7a/0x3f0
[  508.831483]    entry_SYSCALL_64_after_hwframe+0x49/0xbe

               the dependencies between the lock to be acquired
[  508.838945]  and SOFTIRQ-irq-unsafe lock:
[  508.851177] -> (ife_mod_lock){++++} ops: 95 {
[  508.855920]    HARDIRQ-ON-W at:
[  508.859478]                     _raw_write_lock+0x2c/0x40
[  508.865264]                     register_ife_op+0x118/0x2c0 [act_ife]
[  508.872071]                     do_one_initcall+0xf7/0x4d9
[  508.877947]                     do_init_module+0x18b/0x44e
[  508.883819]                     load_module+0x4167/0x5730
[  508.889595]                     __do_sys_finit_module+0x16d/0x1a0
[  508.896043]                     do_syscall_64+0x7a/0x3f0
[  508.901734]                     entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  508.908827]    HARDIRQ-ON-R at:
[  508.912359]                     _raw_read_lock+0x2f/0x40
[  508.918043]                     find_ife_oplist+0x1e/0xc0 [act_ife]
[  508.924692]                     tcf_ife_init+0x82f/0xf40 [act_ife]
[  508.931252]                     tcf_action_init_1+0x510/0x750
[  508.937393]                     tcf_action_init+0x1e8/0x340
[  508.943366]                     tcf_action_add+0xc5/0x240
[  508.949130]                     tc_ctl_action+0x203/0x2a0
[  508.954922]                     rtnetlink_rcv_msg+0x5bd/0x7b0
[  508.961024]                     netlink_rcv_skb+0x184/0x220
[  508.966970]                     netlink_unicast+0x31b/0x460
[  508.972915]                     netlink_sendmsg+0x3fb/0x840
[  508.978859]                     sock_sendmsg+0x7b/0xd0
[  508.984400]                     ___sys_sendmsg+0x4c6/0x610
[  508.990264]                     __sys_sendmsg+0xd7/0x150
[  508.995952]                     do_syscall_64+0x7a/0x3f0
[  509.001643]                     entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  509.008722]    SOFTIRQ-ON-W at:\
[  509.012242]                     _raw_write_lock+0x2c/0x40
[  509.018013]                     register_ife_op+0x118/0x2c0 [act_ife]
[  509.024841]                     do_one_initcall+0xf7/0x4d9
[  509.030720]                     do_init_module+0x18b/0x44e
[  509.036604]                     load_module+0x4167/0x5730
[  509.042397]                     __do_sys_finit_module+0x16d/0x1a0
[  509.048865]                     do_syscall_64+0x7a/0x3f0
[  509.054551]                     entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  509.061636]    SOFTIRQ-ON-R at:
[  509.065145]                     _raw_read_lock+0x2f/0x40
[  509.070854]                     find_ife_oplist+0x1e/0xc0 [act_ife]
[  509.077515]                     tcf_ife_init+0x82f/0xf40 [act_ife]
[  509.084051]                     tcf_action_init_1+0x510/0x750
[  509.090172]                     tcf_action_init+0x1e8/0x340
[  509.096124]                     tcf_action_add+0xc5/0x240
[  509.101891]                     tc_ctl_action+0x203/0x2a0
[  509.107671]                     rtnetlink_rcv_msg+0x5bd/0x7b0
[  509.113811]                     netlink_rcv_skb+0x184/0x220
[  509.119768]                     netlink_unicast+0x31b/0x460
[  509.125716]                     netlink_sendmsg+0x3fb/0x840
[  509.131668]                     sock_sendmsg+0x7b/0xd0
[  509.137167]                     ___sys_sendmsg+0x4c6/0x610
[  509.143010]                     __sys_sendmsg+0xd7/0x150
[  509.148718]                     do_syscall_64+0x7a/0x3f0
[  509.154443]                     entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  509.161533]    INITIAL USE at:
[  509.164956]                    _raw_read_lock+0x2f/0x40
[  509.170574]                    find_ife_oplist+0x1e/0xc0 [act_ife]
[  509.177134]                    tcf_ife_init+0x82f/0xf40 [act_ife]
[  509.183619]                    tcf_action_init_1+0x510/0x750
[  509.189674]                    tcf_action_init+0x1e8/0x340
[  509.195534]                    tcf_action_add+0xc5/0x240
[  509.201229]                    tc_ctl_action+0x203/0x2a0
[  509.206920]                    rtnetlink_rcv_msg+0x5bd/0x7b0
[  509.212936]                    netlink_rcv_skb+0x184/0x220
[  509.218818]                    netlink_unicast+0x31b/0x460
[  509.224699]                    netlink_sendmsg+0x3fb/0x840
[  509.230581]                    sock_sendmsg+0x7b/0xd0
[  509.235984]                    ___sys_sendmsg+0x4c6/0x610
[  509.241791]                    __sys_sendmsg+0xd7/0x150
[  509.247425]                    do_syscall_64+0x7a/0x3f0
[  509.253007]                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  509.259975]  }
[  509.261998]  ... key      at: [<ffffffffc1554258>] ife_mod_lock+0x18/0xffffffffffff8dc0 [act_ife]
[  509.271569]  ... acquired at:
[  509.274912]    _raw_read_lock+0x2f/0x40
[  509.279134]    find_ife_oplist+0x1e/0xc0 [act_ife]
[  509.284324]    tcf_ife_init+0x82f/0xf40 [act_ife]
[  509.289425]    tcf_action_init_1+0x510/0x750
[  509.294068]    tcf_action_init+0x1e8/0x340
[  509.298553]    tcf_action_add+0xc5/0x240
[  509.302854]    tc_ctl_action+0x203/0x2a0
[  509.307153]    rtnetlink_rcv_msg+0x5bd/0x7b0
[  509.311805]    netlink_rcv_skb+0x184/0x220
[  509.316282]    netlink_unicast+0x31b/0x460
[  509.320769]    netlink_sendmsg+0x3fb/0x840
[  509.325248]    sock_sendmsg+0x7b/0xd0
[  509.329290]    ___sys_sendmsg+0x4c6/0x610
[  509.333687]    __sys_sendmsg+0xd7/0x150
[  509.337902]    do_syscall_64+0x7a/0x3f0
[  509.342116]    entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  509.349601]
               stack backtrace:
[  509.354663] CPU: 6 PID: 5460 Comm: tc Not tainted 4.18.0-rc8+ #646
[  509.361216] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017

Fixes: ef6980b6becb ("introduce IFE action")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetoot...
David S. Miller [Mon, 13 Aug 2018 18:38:54 +0000 (11:38 -0700)]
Merge branch 'for-upstream' of git://git./linux/kernel/git/bluetooth/bluetooth-next

Johan Hedberg says:

====================
pull request: bluetooth-next 2018-08-13

There was one pretty bad bug that slipped into the MediaTek HCI driver
in the last bluetooth-next pull request. Would it be possible to get
this one-liner fix pulled to net-next before you make your first 4.19
pull request for Linus? Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
David S. Miller [Mon, 13 Aug 2018 17:07:23 +0000 (10:07 -0700)]
Merge git://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2018-08-13

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add driver XDP support for veth. This can be used in conjunction with
   redirect of another XDP program e.g. sitting on NIC so the xdp_frame
   can be forwarded to the peer veth directly without modification,
   from Toshiaki.

2) Add a new BPF map type REUSEPORT_SOCKARRAY and prog type SK_REUSEPORT
   in order to provide more control and visibility on where a SO_REUSEPORT
   sk should be located, and the latter enables to directly select a sk
   from the bpf map. This also enables map-in-map for application migration
   use cases, from Martin.

3) Add a new BPF helper bpf_skb_ancestor_cgroup_id() that returns the id
   of cgroup v2 that is the ancestor of the cgroup associated with the
   skb at the ancestor_level, from Andrey.

4) Implement BPF fs map pretty-print support based on BTF data for regular
   hash table and LRU map, from Yonghong.

5) Decouple the ability to attach BTF for a map from the key and value
   pretty-printer in BPF fs, and enable further support of BTF for maps for
   percpu and LPM trie, from Daniel.

6) Implement a better BPF sample of using XDP's CPU redirect feature for
   load balancing SKB processing to remote CPU. The sample implements the
   same XDP load balancing as Suricata does which is symmetric hash based
   on IP and L4 protocol, from Jesper.

7) Revert adding NULL pointer check with WARN_ON_ONCE() in __xdp_return()'s
   critical path as it is ensured that the allocator is present, from Björn.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge tag 'wireless-drivers-next-for-davem-2018-08-12' of git://git.kernel.org/pub...
David S. Miller [Mon, 13 Aug 2018 17:06:11 +0000 (10:06 -0700)]
Merge tag 'wireless-drivers-next-for-davem-2018-08-12' of git://git./linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
pull-request: wireless-drivers-next 2018-08-12

wireless-drivers-next patches for 4.19

Last set of new features for 4.19. Most notable is simplifying SSB
debugging code with two Kconfig option removals and fixing mt76 USB
build problems.

Major changes:

ath10k

* add debugfs file warm_hw_reset

wil6210

* add debugfs files tx_latency, link_stats and link_stats_global

* add 3-MSI support

* allow scan on AP interface

* support max aggregation window size 64

ssb

* remove CONFIG_SSB_SILENT and CONFIG_SSB_DEBUG Kconfig options

mt76

* fix build problems with recently added USB support
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoliquidio: remove set but not used variable 'is25G'
YueHaibing [Mon, 13 Aug 2018 09:29:36 +0000 (17:29 +0800)]
liquidio: remove set but not used variable 'is25G'

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/net/ethernet/cavium/liquidio/lio_ethtool.c: In function 'lio_set_link_ksettings':
drivers/net/ethernet/cavium/liquidio/lio_ethtool.c:392:6: warning:
 variable 'is25G' set but not used [-Wunused-but-set-variable]

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agocxgb4: remove set but not used variable 'spd'
YueHaibing [Mon, 13 Aug 2018 06:59:02 +0000 (14:59 +0800)]
cxgb4: remove set but not used variable 'spd'

Fixes gcc '-Wunused-but-set-variable' warning:
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c: In function 'print_port_info':
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:5147:14: warning:
 variable 'spd' set but not used [-Wunused-but-set-variable]

variable 'spd' is set but not used since
commit 547fd27241a8 ("cxgb4: Warn if device doesn't have enough PCI bandwidth")

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agolan743x: lan743x: Remove duplicated include from lan743x_ptp.c
Yue Haibing [Mon, 13 Aug 2018 06:39:21 +0000 (06:39 +0000)]
lan743x: lan743x: Remove duplicated include from lan743x_ptp.c

Remove duplicated include.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agovirtio_net: remove duplicated include from virtio_net.c
YueHaibing [Mon, 13 Aug 2018 06:13:15 +0000 (14:13 +0800)]
virtio_net: remove duplicated include from virtio_net.c

Remove duplicated include linux/netdevice.h

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agopacket: switch kvzalloc to allocate memory
Li RongQing [Mon, 13 Aug 2018 02:42:46 +0000 (10:42 +0800)]
packet: switch kvzalloc to allocate memory

The patches includes following change:

*Use modern kvzalloc()/kvfree() instead of custom allocations.

*Remove order argument for alloc_pg_vec, it can get from req.

*Remove order argument for free_pg_vec, free_pg_vec now uses
kvfree which does not need order argument.

*Remove pg_vec_order from struct packet_ring_buffer, no longer
need to save/restore 'order'

*Remove variable 'order' for packet_set_ring, it is now unused

Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: Change the layout of structure trace_event_raw_fib_table_lookup
Zong Li [Mon, 13 Aug 2018 02:26:52 +0000 (10:26 +0800)]
net: Change the layout of structure trace_event_raw_fib_table_lookup

There is an unalignment access about the structure
'trace_event_raw_fib_table_lookup'.

In include/trace/events/fib.h, there is a memory operation which casting
the 'src' data member to a pointer, and then store a value to this
pointer point to.

p32 = (__be32 *) __entry->src;
*p32 = flp->saddr;

The offset of 'src' in structure trace_event_raw_fib_table_lookup is not
four bytes alignment. On some architectures, they don't permit the
unalignment access, it need to pay the price to handle this situation in
exception handler.

Adjust the layout of structure to avoid this case.

Fixes: 9f323973c915 ("net/ipv4: Udate fib_table_lookup tracepoint")
Signed-off-by: Zong Li <zong@andestech.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'net-sched-actions-rename-for-grep-ability-and-consistency'
David S. Miller [Mon, 13 Aug 2018 16:06:22 +0000 (09:06 -0700)]
Merge branch 'net-sched-actions-rename-for-grep-ability-and-consistency'

Jamal Hadi Salim says:

====================
net: sched: actions rename for grep-ability and consistency

Having a structure (example tcf_mirred) and a function with the same name is
not good for readability or grepability.

This long overdue patchset improves it and make sure there is consistency
across all actions
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_mirred method rename for grep-ability and consistency
Jamal Hadi Salim [Sun, 12 Aug 2018 13:35:01 +0000 (09:35 -0400)]
net: sched: act_mirred method rename for grep-ability and consistency

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_vlan method rename for grep-ability and consistency
Jamal Hadi Salim [Sun, 12 Aug 2018 13:35:00 +0000 (09:35 -0400)]
net: sched: act_vlan method rename for grep-ability and consistency

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_skbmod method rename for grep-ability and consistency
Jamal Hadi Salim [Sun, 12 Aug 2018 13:34:59 +0000 (09:34 -0400)]
net: sched: act_skbmod method rename for grep-ability and consistency

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_skbedit method rename for grep-ability and consistency
Jamal Hadi Salim [Sun, 12 Aug 2018 13:34:58 +0000 (09:34 -0400)]
net: sched: act_skbedit method rename for grep-ability and consistency

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_simple method rename for grep-ability and consistency
Jamal Hadi Salim [Sun, 12 Aug 2018 13:34:57 +0000 (09:34 -0400)]
net: sched: act_simple method rename for grep-ability and consistency

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_police method rename for grep-ability and consistency
Jamal Hadi Salim [Sun, 12 Aug 2018 13:34:56 +0000 (09:34 -0400)]
net: sched: act_police method rename for grep-ability and consistency

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_pedit method rename for grep-ability and consistency
Jamal Hadi Salim [Sun, 12 Aug 2018 13:34:55 +0000 (09:34 -0400)]
net: sched: act_pedit method rename for grep-ability and consistency

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_nat method rename for grep-ability and consistency
Jamal Hadi Salim [Sun, 12 Aug 2018 13:34:54 +0000 (09:34 -0400)]
net: sched: act_nat method rename for grep-ability and consistency

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_ipt method rename for grep-ability and consistency
Jamal Hadi Salim [Sun, 12 Aug 2018 13:34:53 +0000 (09:34 -0400)]
net: sched: act_ipt method rename for grep-ability and consistency

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_gact method rename for grep-ability and consistency
Jamal Hadi Salim [Sun, 12 Aug 2018 13:34:52 +0000 (09:34 -0400)]
net: sched: act_gact method rename for grep-ability and consistency

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_sum method rename for grep-ability and consistency
Jamal Hadi Salim [Sun, 12 Aug 2018 13:34:51 +0000 (09:34 -0400)]
net: sched: act_sum method rename for grep-ability and consistency

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_bpf method rename for grep-ability and consistency
Jamal Hadi Salim [Sun, 12 Aug 2018 13:34:50 +0000 (09:34 -0400)]
net: sched: act_bpf method rename for grep-ability and consistency

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_connmark method rename for grep-ability and consistency
Jamal Hadi Salim [Sun, 12 Aug 2018 13:34:49 +0000 (09:34 -0400)]
net: sched: act_connmark method rename for grep-ability and consistency

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agocpumask: make cpumask_next_wrap available without smp
Willem de Bruijn [Sun, 12 Aug 2018 13:14:03 +0000 (09:14 -0400)]
cpumask: make cpumask_next_wrap available without smp

The kbuild robot shows build failure on machines without CONFIG_SMP:

  drivers/net/virtio_net.c:1916:10: error:
    implicit declaration of function 'cpumask_next_wrap'

cpumask_next_wrap is exported from lib/cpumask.o, which has

    lib-$(CONFIG_SMP) += cpumask.o

same as other functions, also define it as static inline in the
NR_CPUS==1 branch in include/linux/cpumask.h.

If wrap is true and next == start, return nr_cpumask_bits, or 1.
Else wrap across the range of valid cpus, here [0].

Fixes: 2ca653d607ce ("virtio_net: Stripe queue affinities across cores.")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agor8169: don't use MSI-X on RTL8168g
Heiner Kallweit [Sun, 12 Aug 2018 11:26:26 +0000 (13:26 +0200)]
r8169: don't use MSI-X on RTL8168g

There have been two reports that network doesn't come back on resume
from suspend when using MSI-X. Both cases affect the same chip version
(RTL8168g - version 40), on different systems. Falling back to MSI
fixes the issue.
Even though we don't really have a proof yet that the network chip
version is to blame, let's disable MSI-X for this version.

Reported-by: Steve Dodd <steved424@gmail.com>
Reported-by: Lou Reed <gogen@disroot.org>
Tested-by: Steve Dodd <steved424@gmail.com>
Tested-by: Lou Reed <gogen@disroot.org>
Fixes: 6c6aa15fdea5 ("r8169: improve interrupt handling")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'nixge-Minor-cleanups'
David S. Miller [Mon, 13 Aug 2018 15:49:37 +0000 (08:49 -0700)]
Merge branch 'nixge-Minor-cleanups'

Moritz Fischer says:

====================
net: nixge: Minor cleanups

in preparation of my 64-bit support series, here's some
minor cleanup in preparation that gets rid of unneccesary
accesses to the descriptor application fields.

I've confirmed that the hardware does not access the fields
in all our configurations.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: nixge: Don't store skb in app4 field of descriptor
Moritz Fischer [Sat, 11 Aug 2018 01:19:41 +0000 (18:19 -0700)]
net: nixge: Don't store skb in app4 field of descriptor

Don't store skb in app4 field of descriptor since it is
not being used anywhere (including hardware).

Signed-off-by: Moritz Fischer <mdf@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: nixge: Do not zero application specific fields in desc
Moritz Fischer [Sat, 11 Aug 2018 01:19:40 +0000 (18:19 -0700)]
net: nixge: Do not zero application specific fields in desc

Do not zero application specific fields in DMA descriptors.
The hardware does ignore them, so should software.

Signed-off-by: Moritz Fischer <mdf@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agol2tp: use sk_dst_check() to avoid race on sk->sk_dst_cache
Wei Wang [Fri, 10 Aug 2018 18:14:56 +0000 (11:14 -0700)]
l2tp: use sk_dst_check() to avoid race on sk->sk_dst_cache

In l2tp code, if it is a L2TP_UDP_ENCAP tunnel, tunnel->sk points to a
UDP socket. User could call sendmsg() on both this tunnel and the UDP
socket itself concurrently. As l2tp_xmit_skb() holds socket lock and call
__sk_dst_check() to refresh sk->sk_dst_cache, while udpv6_sendmsg() is
lockless and call sk_dst_check() to refresh sk->sk_dst_cache, there
could be a race and cause the dst cache to be freed multiple times.
So we fix l2tp side code to always call sk_dst_check() to garantee
xchg() is called when refreshing sk->sk_dst_cache to avoid race
conditions.

Syzkaller reported stack trace:
BUG: KASAN: use-after-free in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: use-after-free in atomic_fetch_add_unless include/linux/atomic.h:575 [inline]
BUG: KASAN: use-after-free in atomic_add_unless include/linux/atomic.h:597 [inline]
BUG: KASAN: use-after-free in dst_hold_safe include/net/dst.h:308 [inline]
BUG: KASAN: use-after-free in ip6_hold_safe+0xe6/0x670 net/ipv6/route.c:1029
Read of size 4 at addr ffff8801aea9a880 by task syz-executor129/4829

CPU: 0 PID: 4829 Comm: syz-executor129 Not tainted 4.18.0-rc7-next-20180802+ #30
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x30d mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
 atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
 atomic_fetch_add_unless include/linux/atomic.h:575 [inline]
 atomic_add_unless include/linux/atomic.h:597 [inline]
 dst_hold_safe include/net/dst.h:308 [inline]
 ip6_hold_safe+0xe6/0x670 net/ipv6/route.c:1029
 rt6_get_pcpu_route net/ipv6/route.c:1249 [inline]
 ip6_pol_route+0x354/0xd20 net/ipv6/route.c:1922
 ip6_pol_route_output+0x54/0x70 net/ipv6/route.c:2098
 fib6_rule_lookup+0x283/0x890 net/ipv6/fib6_rules.c:122
 ip6_route_output_flags+0x2c5/0x350 net/ipv6/route.c:2126
 ip6_dst_lookup_tail+0x1278/0x1da0 net/ipv6/ip6_output.c:978
 ip6_dst_lookup_flow+0xc8/0x270 net/ipv6/ip6_output.c:1079
 ip6_sk_dst_lookup_flow+0x5ed/0xc50 net/ipv6/ip6_output.c:1117
 udpv6_sendmsg+0x2163/0x36b0 net/ipv6/udp.c:1354
 inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
 sock_sendmsg_nosec net/socket.c:622 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:632
 ___sys_sendmsg+0x51d/0x930 net/socket.c:2115
 __sys_sendmmsg+0x240/0x6f0 net/socket.c:2210
 __do_sys_sendmmsg net/socket.c:2239 [inline]
 __se_sys_sendmmsg net/socket.c:2236 [inline]
 __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2236
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x446a29
Code: e8 ac b8 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f4de5532db8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 00000000006dcc38 RCX: 0000000000446a29
RDX: 00000000000000b8 RSI: 0000000020001b00 RDI: 0000000000000003
RBP: 00000000006dcc30 R08: 00007f4de5533700 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dcc3c
R13: 00007ffe2b830fdf R14: 00007f4de55339c0 R15: 0000000000000001

Fixes: 71b1391a4128 ("l2tp: ensure sk->dst is still valid")
Reported-by: syzbot+05f840f3b04f211bad55@syzkaller.appspotmail.com
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Guillaume Nault <g.nault@alphalink.fr>
Cc: David Ahern <dsahern@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoipv6: Add icmp_echo_ignore_all support for ICMPv6
Virgile Jarry [Fri, 10 Aug 2018 15:48:15 +0000 (17:48 +0200)]
ipv6: Add icmp_echo_ignore_all support for ICMPv6

Preventing the kernel from responding to ICMP Echo Requests messages
can be useful in several ways. The sysctl parameter
'icmp_echo_ignore_all' can be used to prevent the kernel from
responding to IPv4 ICMP echo requests. For IPv6 pings, such
a sysctl kernel parameter did not exist.

Add the ability to prevent the kernel from responding to IPv6
ICMP echo requests through the use of the following sysctl
parameter : /proc/sys/net/ipv6/icmp/echo_ignore_all.
Update the documentation to reflect this change.

Signed-off-by: Virgile Jarry <virgile@acceis.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'net-tls-Combined-memory-allocation-for-decryption-request'
David S. Miller [Mon, 13 Aug 2018 15:41:09 +0000 (08:41 -0700)]
Merge branch 'net-tls-Combined-memory-allocation-for-decryption-request'

Vakul Garg says:

====================
net/tls: Combined memory allocation for decryption request

This patch does a combined memory allocation from heap for scatterlists,
aead_request, aad and iv for the tls record decryption path. In present
code, aead_request is allocated from heap, scatterlists on a conditional
basis are allocated on heap or on stack. This is inefficient as it may
requires multiple kmalloc/kfree.

The initialization vector passed in cryption request is allocated on
stack. This is a problem since the stack memory is not dma-able from
crypto accelerators.

Doing one combined memory allocation for each decryption request fixes
both the above issues. It also paves a way to be able to submit multiple
async decryption requests while the previous one is pending i.e. being
processed or queued.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet/tls: Combined memory allocation for decryption request
Vakul Garg [Fri, 10 Aug 2018 15:16:41 +0000 (20:46 +0530)]
net/tls: Combined memory allocation for decryption request

For preparing decryption request, several memory chunks are required
(aead_req, sgin, sgout, iv, aad). For submitting the decrypt request to
an accelerator, it is required that the buffers which are read by the
accelerator must be dma-able and not come from stack. The buffers for
aad and iv can be separately kmalloced each, but it is inefficient.
This patch does a combined allocation for preparing decryption request
and then segments into aead_req || sgin || sgout || iv || aad.

Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoBluetooth: mediatek: pass correct size to h4_recv_buf()
Dan Carpenter [Mon, 13 Aug 2018 09:32:35 +0000 (12:32 +0300)]
Bluetooth: mediatek: pass correct size to h4_recv_buf()

We're supposed to pass the number of elements in the mtk_recv_pkts, not
the number of bytes.

Fixes: 7237c4c9ec92 ("Bluetooth: mediatek: Add protocol support for MediaTek serial devices")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
6 years agoMerge branch 'bpf-ancestor-cgroup-id'
Daniel Borkmann [Sun, 12 Aug 2018 23:02:40 +0000 (01:02 +0200)]
Merge branch 'bpf-ancestor-cgroup-id'

Andrey Ignatov says:

====================
This patch set adds new BPF helper bpf_skb_ancestor_cgroup_id that returns
id of cgroup v2 that is ancestor of cgroup associated with the skb at the
ancestor_level.

The helper is useful to implement policies in TC based on cgroups that are
upper in hierarchy than immediate cgroup associated with skb.

v1->v2:
- more reliable check for testing IPv6 to become ready in selftest.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
6 years agoselftests/bpf: Selftest for bpf_skb_ancestor_cgroup_id
Andrey Ignatov [Sun, 12 Aug 2018 17:49:30 +0000 (10:49 -0700)]
selftests/bpf: Selftest for bpf_skb_ancestor_cgroup_id

Add selftests for bpf_skb_ancestor_cgroup_id helper.

test_skb_cgroup_id.sh prepares testing interface and adds tc qdisc and
filter for it using BPF object compiled from test_skb_cgroup_id_kern.c
program.

BPF program in test_skb_cgroup_id_kern.c gets ancestor cgroup id using
the new helper at different levels of cgroup hierarchy that skb belongs
to, including root level and non-existing level, and saves it to the map
where the key is the level of corresponding cgroup and the value is its
id.

To trigger BPF program, user space program test_skb_cgroup_id_user is
run. It adds itself into testing cgroup and sends UDP datagram to
link-local multicast address of testing interface. Then it reads cgroup
ids saved in kernel for different levels from the BPF map and compares
them with those in user space. They must be equal for every level of
ancestry.

Example of run:
  # ./test_skb_cgroup_id.sh
  Wait for testing link-local IP to become available ... OK
  Note: 8 bytes struct bpf_elf_map fixup performed due to size mismatch!
  [PASS]

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
6 years agoselftests/bpf: Add cgroup id helpers to bpf_helpers.h
Andrey Ignatov [Sun, 12 Aug 2018 17:49:29 +0000 (10:49 -0700)]
selftests/bpf: Add cgroup id helpers to bpf_helpers.h

Add bpf_skb_cgroup_id and bpf_skb_ancestor_cgroup_id helpers to
bpf_helpers.h to use them in tests and samples.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
6 years agobpf: Sync bpf.h to tools/
Andrey Ignatov [Sun, 12 Aug 2018 17:49:28 +0000 (10:49 -0700)]
bpf: Sync bpf.h to tools/

Sync skb_ancestor_cgroup_id() related bpf UAPI changes to tools/.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
6 years agobpf: Introduce bpf_skb_ancestor_cgroup_id helper
Andrey Ignatov [Sun, 12 Aug 2018 17:49:27 +0000 (10:49 -0700)]
bpf: Introduce bpf_skb_ancestor_cgroup_id helper

== Problem description ==

It's useful to be able to identify cgroup associated with skb in TC so
that a policy can be applied to this skb, and existing bpf_skb_cgroup_id
helper can help with this.

Though in real life cgroup hierarchy and hierarchy to apply a policy to
don't map 1:1.

It's often the case that there is a container and corresponding cgroup,
but there are many more sub-cgroups inside container, e.g. because it's
delegated to containerized application to control resources for its
subsystems, or to separate application inside container from infra that
belongs to containerization system (e.g. sshd).

At the same time it may be useful to apply a policy to container as a
whole.

If multiple containers like this are run on a host (what is often the
case) and many of them have sub-cgroups, it may not be possible to apply
per-container policy in TC with existing helpers such as
bpf_skb_under_cgroup or bpf_skb_cgroup_id:

* bpf_skb_cgroup_id will return id of immediate cgroup associated with
  skb, i.e. if it's a sub-cgroup inside container, it can't be used to
  identify container's cgroup;

* bpf_skb_under_cgroup can work only with one cgroup and doesn't scale,
  i.e. if there are N containers on a host and a policy has to be
  applied to M of them (0 <= M <= N), it'd require M calls to
  bpf_skb_under_cgroup, and, if M changes, it'd require to rebuild &
  load new BPF program.

== Solution ==

The patch introduces new helper bpf_skb_ancestor_cgroup_id that can be
used to get id of cgroup v2 that is an ancestor of cgroup associated
with skb at specified level of cgroup hierarchy.

That way admin can place all containers on one level of cgroup hierarchy
(what is a good practice in general and already used in many
configurations) and identify specific cgroup on this level no matter
what sub-cgroup skb is associated with.

E.g. if there is a cgroup hierarchy:
  root/
  root/container1/
  root/container1/app11/
  root/container1/app11/sub-app-a/
  root/container1/app12/
  root/container2/
  root/container2/app21/
  root/container2/app22/
  root/container2/app22/sub-app-b/

, then having skb associated with root/container1/app11/sub-app-a/ it's
possible to get ancestor at level 1, what is container1 and apply policy
for this container, or apply another policy if it's container2.

Policies can be kept e.g. in a hash map where key is a container cgroup
id and value is an action.

Levels where container cgroups are created are usually known in advance
whether cgroup hierarchy inside container may be hard to predict
especially in case when its creation is delegated to containerized
application.

== Implementation details ==

The helper gets ancestor by walking parents up to specified level.

Another option would be to get different kind of "id" from
cgroup->ancestor_ids[level] and use it with idr_find() to get struct
cgroup for ancestor. But that would require radix lookup what doesn't
seem to be better (at least it's not obviously better).

Format of return value of the new helper is same as that of
bpf_skb_cgroup_id.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
6 years agobpf: decouple btf from seq bpf fs dump and enable more maps
Daniel Borkmann [Sat, 11 Aug 2018 23:59:17 +0000 (01:59 +0200)]
bpf: decouple btf from seq bpf fs dump and enable more maps

Commit a26ca7c982cb ("bpf: btf: Add pretty print support to
the basic arraymap") and 699c86d6ec21 ("bpf: btf: add pretty
print for hash/lru_hash maps") enabled support for BTF and
dumping via BPF fs for array and hash/lru map. However, both
can be decoupled from each other such that regular BPF maps
can be supported for attaching BTF key/value information,
while not all maps necessarily need to dump via map_seq_show_elem()
callback.

The basic sanity check which is a prerequisite for all maps
is that key/value size has to match in any case, and some maps
can have extra checks via map_check_btf() callback, e.g.
probing certain types or indicating no support in general. With
that we can also enable retrieving BTF info for per-cpu map
types and lpm.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
6 years agoMerge branch 'ip-faster-in-order-IP-fragments'
David S. Miller [Sun, 12 Aug 2018 00:54:18 +0000 (17:54 -0700)]
Merge branch 'ip-faster-in-order-IP-fragments'

Peter Oskolkov says:

====================
ip: faster in-order IP fragments

Added "Signed-off-by" in v2.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoip: process in-order fragments efficiently
Peter Oskolkov [Sat, 11 Aug 2018 20:27:25 +0000 (20:27 +0000)]
ip: process in-order fragments efficiently

This patch changes the runtime behavior of IP defrag queue:
incoming in-order fragments are added to the end of the current
list/"run" of in-order fragments at the tail.

On some workloads, UDP stream performance is substantially improved:

RX: ./udp_stream -F 10 -T 2 -l 60
TX: ./udp_stream -c -H <host> -F 10 -T 5 -l 60

with this patchset applied on a 10Gbps receiver:

  throughput=9524.18
  throughput_units=Mbit/s

upstream (net-next):

  throughput=4608.93
  throughput_units=Mbit/s

Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoip: add helpers to process in-order fragments faster.
Peter Oskolkov [Sat, 11 Aug 2018 20:27:24 +0000 (20:27 +0000)]
ip: add helpers to process in-order fragments faster.

This patch introduces several helper functions/macros that will be
used in the follow-up patch. No runtime changes yet.

The new logic (fully implemented in the second patch) is as follows:

* Nodes in the rb-tree will now contain not single fragments, but lists
  of consecutive fragments ("runs").

* At each point in time, the current "active" run at the tail is
  maintained/tracked. Fragments that arrive in-order, adjacent
  to the previous tail fragment, are added to this tail run without
  triggering the re-balancing of the rb-tree.

* If a fragment arrives out of order with the offset _before_ the tail run,
  it is inserted into the rb-tree as a single fragment.

* If a fragment arrives after the current tail fragment (with a gap),
  it starts a new "tail" run, as is inserted into the rb-tree
  at the end as the head of the new run.

skb->cb is used to store additional information
needed here (suggested by Eric Dumazet).

Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
David S. Miller [Sun, 12 Aug 2018 00:52:00 +0000 (17:52 -0700)]
Merge ra./pub/scm/linux/kernel/git/davem/net

6 years agoMerge branch 'Remove-rtnl-lock-dependency-from-all-action-implementations'
David S. Miller [Sat, 11 Aug 2018 19:37:10 +0000 (12:37 -0700)]
Merge branch 'Remove-rtnl-lock-dependency-from-all-action-implementations'

Vlad Buslov says:

====================
Remove rtnl lock dependency from all action implementations

Currently, all netlink protocol handlers for updating rules, actions and
qdiscs are protected with single global rtnl lock which removes any
possibility for parallelism. This patch set is a second step to remove
rtnl lock dependency from TC rules update path.

Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
Handlers registered with this flag are called without RTNL taken. End
goal is to have rule update handlers(RTM_NEWTFILTER, RTM_DELTFILTER,
etc.) to be registered with UNLOCKED flag to allow parallel execution.
However, there is no intention to completely remove or split rtnl lock
itself. This patch set addresses specific problems in implementation of
tc actions that prevent their control path from being executed
concurrently. Additional changes are required to refactor classifiers
API and individual classifiers for parallel execution. This patch set
lays groundwork to eventually register rule update handlers as
rtnl-unlocked.

Action API is already prepared for parallel execution with previous
patch set, which means that action ops that use action API for their
implementation do not require additional modifications. (delete, search,
etc.) Action API implements concurrency-safe reference counting and
guarantees that cleanup/delete is called only once, after last reference
to action is released.

The goal of this change is to update specific actions APIs that access
action private state directly, in order to be independent from external
locking. General approach is to re-use existing tcf_lock spinlock (used
by some action implementation to synchronize control path with data
path) to protect action private state from concurrent modification. If
action has rcu-protected pointer, tcf spinlock is used to protect its
update code, instead of relying on rtnl lock.

Some actions need to determine rtnl mutex status in order to release it.
For example, ife action can load additional kernel modules(meta ops) and
must make sure that no locks are held during module load. In such cases
'rtnl_held' argument is used to conditionally release rtnl mutex.

Changes from V1 to V2:
- Patch 12:
  - new patch
- Patch 14:
  - refactor gen_new_estimator() to reuse stats_lock when re-assigning
    rate estimator statistics pointer
- Remove mirred and tunnel_key helper function changes. (to be submitted
  and standalone patch)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_police: remove dependency on rtnl lock
Vlad Buslov [Fri, 10 Aug 2018 17:51:55 +0000 (20:51 +0300)]
net: sched: act_police: remove dependency on rtnl lock

Use tcf spinlock to protect police action private data from concurrent
modification during dump. (init already uses tcf spinlock when changing
police action state)

Pass tcf spinlock as estimator lock argument to gen_replace_estimator()
during action init.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: core: protect rate estimator statistics pointer with lock
Vlad Buslov [Fri, 10 Aug 2018 17:51:54 +0000 (20:51 +0300)]
net: core: protect rate estimator statistics pointer with lock

Extend gen_new_estimator() to also take stats_lock when re-assigning rate
estimator statistics pointer. (to be used by unlocked actions)

Rename 'stats_lock' to 'lock' and change argument description to explain
that it is now also used for control path.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_mirred: remove dependency on rtnl lock
Vlad Buslov [Fri, 10 Aug 2018 17:51:53 +0000 (20:51 +0300)]
net: sched: act_mirred: remove dependency on rtnl lock

Re-introduce mirred list spinlock, that was removed some time ago, in order
to protect it from concurrent modifications, instead of relying on rtnl
lock.

Use tcf spinlock to protect mirred action private data from concurrent
modification in init and dump. Rearrange access to mirred data in order to
be performed only while holding the lock.

Rearrange net dev access to always hold reference while working with it,
instead of relying on rntl lock.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: extend action ops with put_dev callback
Vlad Buslov [Fri, 10 Aug 2018 17:51:52 +0000 (20:51 +0300)]
net: sched: extend action ops with put_dev callback

As a preparation for removing dependency on rtnl lock from rules update
path, all users of shared objects must take reference while working with
them.

Extend action ops with put_dev() API to be used on net device returned by
get_dev().

Modify mirred action (only action that implements get_dev callback):
- Take reference to net device in get_dev.
- Implement put_dev API that releases reference to net device.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_vlan: remove dependency on rtnl lock
Vlad Buslov [Fri, 10 Aug 2018 17:51:51 +0000 (20:51 +0300)]
net: sched: act_vlan: remove dependency on rtnl lock

Use tcf spinlock to protect vlan action private data from concurrent
modification during dump and init. Use rcu swap operation to reassign
params pointer under protection of tcf lock. (old params value is not used
by init, so there is no need of standalone rcu dereference step)

Remove rtnl assertion that is no longer necessary.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_tunnel_key: remove dependency on rtnl lock
Vlad Buslov [Fri, 10 Aug 2018 17:51:50 +0000 (20:51 +0300)]
net: sched: act_tunnel_key: remove dependency on rtnl lock

Use tcf lock to protect tunnel key action struct private data from
concurrent modification in init and dump. Use rcu swap operation to
reassign params pointer under protection of tcf lock. (old params value is
not used by init, so there is no need of standalone rcu dereference step)

Remove rtnl lock assertion that is no longer required.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_skbmod: remove dependency on rtnl lock
Vlad Buslov [Fri, 10 Aug 2018 17:51:49 +0000 (20:51 +0300)]
net: sched: act_skbmod: remove dependency on rtnl lock

Move read of skbmod_p rcu pointer to be protected by tcf spinlock. Use tcf
spinlock to protect private skbmod data from concurrent modification during
dump.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_simple: remove dependency on rtnl lock
Vlad Buslov [Fri, 10 Aug 2018 17:51:48 +0000 (20:51 +0300)]
net: sched: act_simple: remove dependency on rtnl lock

Use tcf spinlock to protect private simple action data from concurrent
modification during dump. (simple init already uses tcf spinlock when
changing action state)

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_sample: remove dependency on rtnl lock
Vlad Buslov [Fri, 10 Aug 2018 17:51:47 +0000 (20:51 +0300)]
net: sched: act_sample: remove dependency on rtnl lock

Use tcf spinlock to protect private sample action data from concurrent
modification during dump and init.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_pedit: remove dependency on rtnl lock
Vlad Buslov [Fri, 10 Aug 2018 17:51:46 +0000 (20:51 +0300)]
net: sched: act_pedit: remove dependency on rtnl lock

Rearrange pedit init code to only access pedit action data while holding
tcf spinlock. Change keys allocation type to atomic to allow it to execute
while holding tcf spinlock. Take tcf spinlock in dump function when
accessing pedit action data.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_ipt: remove dependency on rtnl lock
Vlad Buslov [Fri, 10 Aug 2018 17:51:45 +0000 (20:51 +0300)]
net: sched: act_ipt: remove dependency on rtnl lock

Use tcf spinlock to protect ipt action private data from concurrent
modification during dump. Ipt init already takes tcf spinlock when
modifying ipt state.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_ife: remove dependency on rtnl lock
Vlad Buslov [Fri, 10 Aug 2018 17:51:44 +0000 (20:51 +0300)]
net: sched: act_ife: remove dependency on rtnl lock

Use tcf spinlock and rcu to protect params pointer from concurrent
modification during dump and init. Use rcu swap operation to reassign
params pointer under protection of tcf lock. (old params value is not used
by init, so there is no need of standalone rcu dereference step)

Ife action has meta-actions that are compiled as standalone modules. Rtnl
mutex must be released while loading a kernel module. In order to support
execution without rtnl mutex, propagate 'rtnl_held' argument to meta action
loading functions. When requesting meta action module, conditionally
release rtnl lock depending on 'rtnl_held' argument.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_gact: remove dependency on rtnl lock
Vlad Buslov [Fri, 10 Aug 2018 17:51:43 +0000 (20:51 +0300)]
net: sched: act_gact: remove dependency on rtnl lock

Use tcf spinlock to protect gact action private state from concurrent
modification during dump and init. Remove rtnl assertion that is no longer
necessary.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_csum: remove dependency on rtnl lock
Vlad Buslov [Fri, 10 Aug 2018 17:51:42 +0000 (20:51 +0300)]
net: sched: act_csum: remove dependency on rtnl lock

Use tcf lock to protect csum action struct private data from concurrent
modification in init and dump. Use rcu swap operation to reassign params
pointer under protection of tcf lock. (old params value is not used by
init, so there is no need of standalone rcu dereference step)

Remove rtnl assertion that is no longer necessary.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: act_bpf: remove dependency on rtnl lock
Vlad Buslov [Fri, 10 Aug 2018 17:51:41 +0000 (20:51 +0300)]
net: sched: act_bpf: remove dependency on rtnl lock

Use tcf spinlock to protect bpf action private data from concurrent
modification during dump and init. Remove rtnl lock assertion that is no
longer necessary.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'net-sctp-Avoid-allocating-high-order-memory-with-kmalloc'
David S. Miller [Sat, 11 Aug 2018 19:25:15 +0000 (12:25 -0700)]
Merge branch 'net-sctp-Avoid-allocating-high-order-memory-with-kmalloc'

Konstantin Khorenko says:

====================
net/sctp: Avoid allocating high order memory with kmalloc()

Each SCTP association can have up to 65535 input and output streams.
For each stream type an array of sctp_stream_in or sctp_stream_out
structures is allocated using kmalloc_array() function. This function
allocates physically contiguous memory regions, so this can lead
to allocation of memory regions of very high order, i.e.:

  sizeof(struct sctp_stream_out) == 24,
  ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
  which means 9th memory order.

This can lead to a memory allocation failures on the systems
under a memory stress.

We actually do not need these arrays of memory to be physically
contiguous. Possible simple solution would be to use kvmalloc()
instread of kmalloc() as kvmalloc() can allocate physically scattered
pages if contiguous pages are not available. But the problem
is that the allocation can happed in a softirq context with
GFP_ATOMIC flag set, and kvmalloc() cannot be used in this scenario.

So the other possible solution is to use flexible arrays instead of
contiguios arrays of memory so that the memory would be allocated
on a per-page basis.

This patchset replaces kvmalloc() with flex_array usage.
It consists of two parts:

  * First patch is preparatory - it mechanically wraps all direct
    access to assoc->stream.out[] and assoc->stream.in[] arrays
    with SCTP_SO() and SCTP_SI() wrappers so that later a direct
    array access could be easily changed to an access to a
    flex_array (or any other possible alternative).
  * Second patch replaces kmalloc_array() with flex_array usage.

v2 changes:
 sctp_stream_in() users are updated to provide stream as an argument,
 sctp_stream_{in,out}_ptr() are now just sctp_stream_{in,out}().

v3 changes:
 Move type chages struct sctp_stream_out -> flex_array to next patch.
 Make sctp_stream_{in,out}() static incline and move them to a header.

Performance results (single stream):
====================================
  * Kernel: v4.18-rc6 - stock and with 2 patches from Oleg (earlier in this thread)
  * Node: CPU (8 cores): Intel(R) Xeon(R) CPU E31230 @ 3.20GHz
          RAM: 32 Gb

  * netperf: taken from https://github.com/HewlettPackard/netperf.git,
     compiled from sources with sctp support
  * netperf server and client are run on the same node
  * ip link set lo mtu 1500

The script used to run tests:
 # cat run_tests.sh
 #!/bin/bash

for test in SCTP_STREAM SCTP_STREAM_MANY SCTP_RR SCTP_RR_MANY; do
  echo "TEST: $test";
  for i in `seq 1 3`; do
    echo "Iteration: $i";
    set -x
    netperf -t $test -H localhost -p 22222 -S 200000,200000 -s 200000,200000 \
            -l 60 -- -m 1452;
    set +x
  done
done
================================================

Results (a bit reformatted to be more readable):
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

v4.18-rc7 v4.18-rc7 + fixes
TEST: SCTP_STREAM
212992 212992   1452    60.21 1125.52 1247.04
212992 212992   1452    60.20 1376.38 1149.95
212992 212992   1452    60.20 1131.40 1163.85
TEST: SCTP_STREAM_MANY
212992 212992   1452    60.00 1111.00 1310.05
212992 212992   1452    60.00 1188.55 1130.50
212992 212992   1452    60.00 1108.06 1162.50

===========
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

v4.18-rc7 v4.18-rc7 + fixes
TEST: SCTP_RR
212992 212992 1        1       60.00 45486.98 46089.43
212992 212992 1        1       60.00 45584.18 45994.21
212992 212992 1        1       60.00 45703.86 45720.84
TEST: SCTP_RR_MANY
212992 212992 1        1       60.00 40.75 40.77
212992 212992 1        1       60.00 40.58 40.08
212992 212992 1        1       60.00 39.98 39.97

Performance results for many streams:
=====================================
   * Kernel: v4.18-rc8 - stock and with 2 patches v3
   * Node: CPU (8 cores): Intel(R) Xeon(R) CPU E31230 @ 3.20GHz
           RAM: 32 Gb

   * sctp_test: https://github.com/sctp/lksctp-tools
   * both server and client are run on the same node
   * ip link set lo mtu 1500
   * sysctl -w vm.max_map_count=65530000 (need it to make memory fragmented)

The script used to run tests:
=============================
 # cat run_sctp_test.sh
 #!/bin/bash

set -x

uname -r
ip link set lo mtu 1500
swapoff -a

free
cat /proc/buddyinfo

./src/apps/sctp_test -H 127.0.0.1 -P 22222 -l -d 0 &
sleep 3

time ./src/apps/sctp_test -H 127.0.0.1 -P 22221 -h 127.0.0.1 -p 22222 \
         -s -c 1 -M 65535 -T -t 1 -x 100000 -d 0 1>/dev/null

killall -9 lt-sctp_test
===============================

Results (a bit reformatted to be more readable):

1) ms stock kernel v4.18-rc8, no memory fragmentation
test 1 test 2 test 3
real    0m14.715s 0m14.593s 0m15.954s
user    0m0.954s 0m0.955s 0m0.854s
sys     0m13.388s 0m12.537s 0m13.749s

2) kernel with fixes, no memory fragmentation
test 1 test 2 test 3
real    0m14.959s 0m14.693s 0m14.762s
user    0m0.948s 0m0.921s 0m0.929s
sys     0m13.538s 0m13.225s 0m13.217s

3) kernel with fixes, memory fragmented
'free':
               total        used        free      shared  buff/cache   available
Mem:       32906008    30555200      302740         764     2048068      266452
Mem:       32906008    30379948      541436         764     1984624      442376
Mem:       32906008    30717312      262380         764     1926316      109908

/proc/buddyinfo:
Node 0, zone   Normal  40773     37     34     29      0      0      0      0      0      0      0
Node 0, zone   Normal 100332     68      8      4      2      1      1      0      0      0      0
Node 0, zone   Normal  31113      7      2      1      0      0      0      0      0      0      0

test 1 test 2 test 3
real    0m14.159s 0m15.252s 0m15.826s
user    0m0.839s 0m1.004s 0m1.048s
sys     0m11.827s 0m14.240s 0m14.778s
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet/sctp: Replace in/out stream arrays with flex_array
Konstantin Khorenko [Fri, 10 Aug 2018 17:11:43 +0000 (20:11 +0300)]
net/sctp: Replace in/out stream arrays with flex_array

This path replaces physically contiguous memory arrays
allocated using kmalloc_array() with flexible arrays.
This enables to avoid memory allocation failures on the
systems under a memory stress.

Signed-off-by: Oleg Babin <obabin@virtuozzo.com>
Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet/sctp: Make wrappers for accessing in/out streams
Konstantin Khorenko [Fri, 10 Aug 2018 17:11:42 +0000 (20:11 +0300)]
net/sctp: Make wrappers for accessing in/out streams

This patch introduces wrappers for accessing in/out streams indirectly.
This will enable to replace physically contiguous memory arrays
of streams with flexible arrays (or maybe any other appropriate
mechanism) which do memory allocation on a per-page basis.

Signed-off-by: Oleg Babin <obabin@virtuozzo.com>
Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotc: Update README and add config
Keara Leibovitz [Fri, 10 Aug 2018 14:09:41 +0000 (10:09 -0400)]
tc: Update README and add config

Updated README.

Added config file that contains the minimum required features enabled to
run the tests currently present in the kernel.
This must be updated when new unittests are created and require their own
modules.

Signed-off-by: Keara Leibovitz <kleib@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'l2tp-rework-pppol2tp-ioctl-handling'
David S. Miller [Sat, 11 Aug 2018 19:13:49 +0000 (12:13 -0700)]
Merge branch 'l2tp-rework-pppol2tp-ioctl-handling'

Guillaume Nault says:

====================
l2tp: rework pppol2tp ioctl handling

The current ioctl() handling code can be simplified. It tests for
non-relevant conditions and uselessly holds sockets. Once useless
code is removed, it becomes even simpler to let pppol2tp_ioctl() handle
commands directly, rather than dispatch them to pppol2tp_tunnel_ioctl()
or pppol2tp_session_ioctl(). That is the approach taken by this series.

Patch #1 and #2 define helper functions aimed at simplifying the rest
of the patch set.

Patch #3 drops useless tests in pppol2p_ioctl() and avoid holding a
refcount on the socket.

Patches #4, #5 and #6 are the core of the series. They let
pppol2tp_ioctl() handle all ioctls and drop the tunnel and session
specific functions.

Then patch #6 brings a little bit of consolidation.

Finally, patch #7 takes advantage of the simplified code to make
pppol2tp sockets compatible with dev_ioctl(). Certainly not a killer
feature, but it is trivial and it is always nice to see l2tp getting
better integration with the rest of the stack.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agol2tp: let pppol2tp_ioctl() fallback to dev_ioctl()
Guillaume Nault [Fri, 10 Aug 2018 11:22:03 +0000 (13:22 +0200)]
l2tp: let pppol2tp_ioctl() fallback to dev_ioctl()

Return -ENOIOCTLCMD for unknown ioctl commands. This lets dev_ioctl()
handle generic socket ioctls like SIOCGIFNAME or SIOCGIFINDEX.
PF_PPPOX/PX_PROTO_OL2TP was one of the few socket types not honouring
this mechanism.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agol2tp: zero out stats in pppol2tp_copy_stats()
Guillaume Nault [Fri, 10 Aug 2018 11:22:02 +0000 (13:22 +0200)]
l2tp: zero out stats in pppol2tp_copy_stats()

Integrate memset(0) in pppol2tp_copy_stats() to avoid calling it
manually every time.

While there, constify 'stats'.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agol2tp: remove pppol2tp_session_ioctl()
Guillaume Nault [Fri, 10 Aug 2018 11:22:01 +0000 (13:22 +0200)]
l2tp: remove pppol2tp_session_ioctl()

pppol2tp_ioctl() has everything in place for handling PPPIOCGL2TPSTATS
on session sockets. We just need to copy the stats and set ->session_id.

As a side effect of sharing session and tunnel code, ->using_ipsec is
properly set even when the request was made using a session socket.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agol2tp: remove pppol2tp_tunnel_ioctl()
Guillaume Nault [Fri, 10 Aug 2018 11:22:00 +0000 (13:22 +0200)]
l2tp: remove pppol2tp_tunnel_ioctl()

Handle PPPIOCGL2TPSTATS in pppol2tp_ioctl() if the socket represents a
tunnel. This one is a bit special because the caller may use the tunnel
socket to retrieve statistics of one of its sessions. If the session_id
is set, the corresponding session's statistics are returned, instead of
those of the tunnel. This is handled by the new
pppol2tp_tunnel_copy_stats() helper function.

Set ->tunnel_id and ->using_ipsec out of the conditional, so
that it can be used by the 'else' branch in the following patch.
We cannot do that for ->session_id, because tunnel sockets have to
report the value that was originally passed in 'stats.session_id',
while session sockets have to report their own session_id.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agol2tp: handle PPPIOC[GS]MRU and PPPIOC[GS]FLAGS in pppol2tp_ioctl()
Guillaume Nault [Fri, 10 Aug 2018 11:21:58 +0000 (13:21 +0200)]
l2tp: handle PPPIOC[GS]MRU and PPPIOC[GS]FLAGS in pppol2tp_ioctl()

Let pppol2tp_ioctl() handle ioctl commands directly. It still relies on
pppol2tp_{session,tunnel}_ioctl() for PPPIOCGL2TPSTATS.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>