David S. Miller [Sat, 22 Jun 2019 02:05:42 +0000 (22:05 -0400)]
Merge branch 'PCI-let-pci_disable_link_state-propagate-errors'
Heiner Kallweit says:
====================
PCI: let pci_disable_link_state propagate errors
Drivers like r8169 rely on pci_disable_link_state() having disabled
certain ASPM link states. If OS can't control ASPM then
pci_disable_link_state() turns into a no-op w/o informing the caller.
The driver therefore may falsely assume the respective ASPM link
states are disabled. Let pci_disable_link_state() propagate errors
to the caller, enabling the caller to react accordingly.
I'd propose to let this series go through the netdev tree if the PCI
core extension is acked by the PCI people.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Tue, 18 Jun 2019 21:14:50 +0000 (23:14 +0200)]
r8169: don't activate ASPM in chip if OS can't control ASPM
Certain chip version / board combinations have massive problems if
ASPM is active. If BIOS enables ASPM and doesn't let OS control it,
then we may have a problem with the current code. Therefore check the
return code of pci_disable_link_state() and don't enable ASPM in the
network chip if OS can't control ASPM.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Tue, 18 Jun 2019 21:13:48 +0000 (23:13 +0200)]
PCI: let pci_disable_link_state propagate errors
Drivers may rely on pci_disable_link_state() having disabled certain
ASPM link states. If OS can't control ASPM then pci_disable_link_state()
turns into a no-op w/o informing the caller. The driver therefore may
falsely assume the respective ASPM link states are disabled.
Let pci_disable_link_state() propagate errors to the caller, enabling
the caller to react accordingly.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 20 Jun 2019 04:06:27 +0000 (00:06 -0400)]
Merge git://git./linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
====================
pull-request: bpf-next 2019-06-19
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) new SO_REUSEPORT_DETACH_BPF setsocktopt, from Martin.
2) BTF based map definition, from Andrii.
3) support bpf_map_lookup_elem for xskmap, from Jonathan.
4) bounded loops and scalar precision logic in the verifier, from Alexei.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Wed, 19 Jun 2019 22:15:52 +0000 (00:15 +0200)]
page_pool: fix compile warning when CONFIG_PAGE_POOL is disabled
Kbuild test robot reported compile warning:
warning: no return statement in function returning non-void
in function page_pool_request_shutdown, when CONFIG_PAGE_POOL is disabled.
The fix makes the code a little more verbose, with a descriptive variable.
Fixes:
99c07c43c4ea ("xdp: tracking page_pool resources and safe removal")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Blumenstingl [Tue, 18 Jun 2019 20:39:27 +0000 (22:39 +0200)]
net: stmmac: initialize the reset delay array
Commit
ce4ab73ab0c27c ("net: stmmac: drop the reset delays from struct
stmmac_mdio_bus_data") moved the reset delay array from struct
stmmac_mdio_bus_data to a stack variable.
The values from the array inside struct stmmac_mdio_bus_data were
previously initialized to 0 because the struct was allocated using
devm_kzalloc(). The array on the stack has to be initialized
explicitly, else we might be reading garbage values.
Initialize all reset delays to 0 to ensure that the values are 0 if the
"snps,reset-delays-us" property is not defined.
This fixes booting at least two boards (MIPS pistachio marduk and ARM
sun8i H2+ Orange Pi Zero). These are hanging during boot when
initializing the stmmac Ethernet controller (as found by Kernel CI).
Both have in common that they don't define the "snps,reset-delays-us"
property.
Fixes:
ce4ab73ab0c27c ("net: stmmac: drop the reset delays from struct stmmac_mdio_bus_data")
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Tue, 18 Jun 2019 20:03:04 +0000 (16:03 -0400)]
selftests/net: make udpgso_bench skip unsupported testcases
Kselftest can be run against older kernels. Instead of failing hard
when a feature is unsupported, return the KSFT_SKIP exit code.
Specifically, do not fail hard on missing udp zerocopy.
The udp gso bench test runs multiple test cases from a single script.
Fail if any case fails, else return skip if any test is skipped.
Link: https://lore.kernel.org/lkml/20190618171516.GA17547@kroah.com/
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Matthias Kaehlcke [Tue, 18 Jun 2019 21:14:40 +0000 (14:14 -0700)]
net/ipv4: fib_trie: Avoid cryptic ternary expressions
empty_child_inc/dec() use the ternary operator for conditional
operations. The conditions involve the post/pre in/decrement
operator and the operation is only performed when the condition
is *not* true. This is hard to parse for humans, use a regular
'if' construct instead and perform the in/decrement separately.
This also fixes two warnings that are emitted about the value
of the ternary expression being unused, when building the kernel
with clang + "kbuild: Remove unnecessary -Wno-unused-value"
(https://lore.kernel.org/patchwork/patch/1089869/):
CC net/ipv4/fib_trie.o
net/ipv4/fib_trie.c:351:2: error: expression result unused [-Werror,-Wunused-value]
++tn_info(n)->empty_children ? : ++tn_info(n)->full_children;
Fixes:
95f60ea3e99a ("fib_trie: Add collapse() and should_collapse() to resize")
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kevin Darbyshire-Bryant [Wed, 19 Jun 2019 17:41:10 +0000 (18:41 +0100)]
net: sched: act_ctinfo: tidy UAPI definition
Remove some enums from the UAPI definition that were only used
internally and are NOT part of the UAPI.
Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 19 Jun 2019 15:37:48 +0000 (11:37 -0400)]
Merge branch 'inet-fix-defrag-units-dismantle-races'
Eric Dumazet says:
====================
inet: fix defrag units dismantle races
This series add a new pre_exit() method to struct pernet_operations
to solve a race in defrag units dismantle, without adding extra
delays to netns dismantles.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 18 Jun 2019 18:09:00 +0000 (11:09 -0700)]
inet: fix various use-after-free in defrags units
syzbot reported another issue caused by my recent patches. [1]
The issue here is that fqdir_exit() is initiating a work queue
and immediately returns. A bit later cleanup_net() was able
to free the MIB (percpu data) and the whole struct net was freed,
but we had active frag timers that fired and triggered use-after-free.
We need to make sure that timers can catch fqdir->dead being set,
to bailout.
Since RCU is used for the reader side, this means
we want to respect an RCU grace period between these operations :
1) qfdir->dead = 1;
2) netns dismantle (freeing of various data structure)
This patch uses new new (struct pernet_operations)->pre_exit
infrastructure to ensures a full RCU grace period
happens between fqdir_pre_exit() and fqdir_exit()
This also means we can use a regular work queue, we no
longer need rcu_work.
Tested:
$ time for i in {1..1000}; do unshare -n /bin/false;done
real 0m2.585s
user 0m0.160s
sys 0m2.214s
[1]
BUG: KASAN: use-after-free in ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
Read of size 8 at addr
ffff88808b9fe330 by task syz-executor.4/11860
CPU: 1 PID: 11860 Comm: syz-executor.4 Not tainted 5.2.0-rc2+ #22
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
__kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
kasan_report+0x12/0x20 mm/kasan/common.c:614
__asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
call_timer_fn+0x193/0x720 kernel/time/timer.c:1322
expire_timers kernel/time/timer.c:1366 [inline]
__run_timers kernel/time/timer.c:1685 [inline]
__run_timers kernel/time/timer.c:1653 [inline]
run_timer_softirq+0x66f/0x1740 kernel/time/timer.c:1698
__do_softirq+0x25c/0x94c kernel/softirq.c:293
invoke_softirq kernel/softirq.c:374 [inline]
irq_exit+0x180/0x1d0 kernel/softirq.c:414
exiting_irq arch/x86/include/asm/apic.h:536 [inline]
smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806
</IRQ>
RIP: 0010:tomoyo_domain_quota_is_ok+0x131/0x540 security/tomoyo/util.c:1035
Code: 24 4c 3b 65 d0 0f 84 9c 00 00 00 e8 19 1d 73 fe 49 8d 7c 24 18 48 ba 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 0f b6 04 10 <48> 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 69 03 00 00 41 0f b6 5c
RSP: 0018:
ffff88806ae079c0 EFLAGS:
00000a02 ORIG_RAX:
ffffffffffffff13
RAX:
0000000000000000 RBX:
0000000000000010 RCX:
ffffc9000e655000
RDX:
dffffc0000000000 RSI:
ffffffff82fd88a7 RDI:
ffff888086202398
RBP:
ffff88806ae07a00 R08:
ffff88808b6c8700 R09:
ffffed100d5c0f4d
R10:
ffffed100d5c0f4c R11:
0000000000000000 R12:
ffff888086202380
R13:
0000000000000030 R14:
00000000000000d3 R15:
0000000000000000
tomoyo_supervisor+0x2e8/0xef0 security/tomoyo/common.c:2087
tomoyo_audit_path_number_log security/tomoyo/file.c:235 [inline]
tomoyo_path_number_perm+0x42f/0x520 security/tomoyo/file.c:734
tomoyo_file_ioctl+0x23/0x30 security/tomoyo/tomoyo.c:335
security_file_ioctl+0x77/0xc0 security/security.c:1370
ksys_ioctl+0x57/0xd0 fs/ioctl.c:711
__do_sys_ioctl fs/ioctl.c:720 [inline]
__se_sys_ioctl fs/ioctl.c:718 [inline]
__x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4592c9
Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:
00007f8db5e44c78 EFLAGS:
00000246 ORIG_RAX:
0000000000000010
RAX:
ffffffffffffffda RBX:
0000000000000003 RCX:
00000000004592c9
RDX:
0000000020000080 RSI:
00000000000089f1 RDI:
0000000000000006
RBP:
000000000075bf20 R08:
0000000000000000 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000246 R12:
00007f8db5e456d4
R13:
00000000004cc770 R14:
00000000004d5cd8 R15:
00000000ffffffff
Allocated by task 9047:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_kmalloc mm/kasan/common.c:489 [inline]
__kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
slab_post_alloc_hook mm/slab.h:437 [inline]
slab_alloc mm/slab.c:3326 [inline]
kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3488
kmem_cache_zalloc include/linux/slab.h:732 [inline]
net_alloc net/core/net_namespace.c:386 [inline]
copy_net_ns+0xed/0x340 net/core/net_namespace.c:426
create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
ksys_unshare+0x440/0x980 kernel/fork.c:2692
__do_sys_unshare kernel/fork.c:2760 [inline]
__se_sys_unshare kernel/fork.c:2758 [inline]
__x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Freed by task 2541:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
__cache_free mm/slab.c:3432 [inline]
kmem_cache_free+0x86/0x260 mm/slab.c:3698
net_free net/core/net_namespace.c:402 [inline]
net_drop_ns.part.0+0x70/0x90 net/core/net_namespace.c:409
net_drop_ns net/core/net_namespace.c:408 [inline]
cleanup_net+0x538/0x960 net/core/net_namespace.c:571
process_one_work+0x989/0x1790 kernel/workqueue.c:2269
worker_thread+0x98/0xe40 kernel/workqueue.c:2415
kthread+0x354/0x420 kernel/kthread.c:255
ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
The buggy address belongs to the object at
ffff88808b9fe100
which belongs to the cache net_namespace of size 6784
The buggy address is located 560 bytes inside of
6784-byte region [
ffff88808b9fe100,
ffff88808b9ffb80)
The buggy address belongs to the page:
page:
ffffea00022e7f80 refcount:1 mapcount:0 mapping:
ffff88821b6f60c0 index:0x0 compound_mapcount: 0
flags: 0x1fffc0000010200(slab|head)
raw:
01fffc0000010200 ffffea000256f288 ffffea0001bbef08 ffff88821b6f60c0
raw:
0000000000000000 ffff88808b9fe100 0000000100000001 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88808b9fe200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88808b9fe280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>
ffff88808b9fe300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88808b9fe380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88808b9fe400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes:
3c8fc8782044 ("inet: frags: rework rhashtable dismantle")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 18 Jun 2019 18:08:59 +0000 (11:08 -0700)]
netns: add pre_exit method to struct pernet_operations
Current struct pernet_operations exit() handlers are highly
discouraged to call synchronize_rcu().
There are cases where we need them, and exit_batch() does
not help the common case where a single netns is dismantled.
This patch leverages the existing synchronize_rcu() call
in cleanup_net()
Calling optional ->pre_exit() method before ->exit() or
->exit_batch() allows to benefit from a single synchronize_rcu()
call.
Note that the synchronize_rcu() calls added in this patch
are only in error paths or slow paths.
Tested:
$ time for i in {1..1000}; do unshare -n /bin/false;done
real 0m2.612s
user 0m0.171s
sys 0m2.216s
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 19 Jun 2019 15:23:14 +0000 (11:23 -0400)]
Merge branch 'xdp-page_pool-fixes-and-in-flight-accounting'
Jesper Dangaard Brouer says:
====================
xdp: page_pool fixes and in-flight accounting
This patchset fix page_pool API and users, such that drivers can use it for
DMA-mapping. A number of places exist, where the DMA-mapping would not get
released/unmapped, all these are fixed. This occurs e.g. when an xdp_frame
gets converted to an SKB. As network stack doesn't have any callback for XDP
memory models.
The patchset also address a shutdown race-condition. Today removing a XDP
memory model, based on page_pool, is only delayed one RCU grace period. This
isn't enough as redirected xdp_frames can still be in-flight on different
queues (remote driver TX, cpumap or veth).
We stress that when drivers use page_pool for DMA-mapping, then they MUST
use one packet per page. This might change in the future, but more work lies
ahead, before we can lift this restriction.
This patchset change the page_pool API to be more strict, as in-flight page
accounting is added.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Tue, 18 Jun 2019 13:06:08 +0000 (15:06 +0200)]
page_pool: make sure struct device is stable
For DMA mapping use-case the page_pool keeps a pointer
to the struct device, which is used in DMA map/unmap calls.
For our in-flight handling, we also need to make sure that
the struct device have not disappeared. This is assured
via using get_device/put_device API.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reported-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Tue, 18 Jun 2019 13:06:03 +0000 (15:06 +0200)]
page_pool: add tracepoints for page_pool with details need by XDP
The xdp tracepoints for mem id disconnect don't carry information about, why
it was not safe_to_remove. The tracepoint page_pool:page_pool_inflight in
this patch can be used for extract this info for further debugging.
This patchset also adds tracepoint for the pages_state_* release/hold
transitions, including a pointer to the page. This can be used for stats
about in-flight pages, or used to debug page leakage via keeping track of
page pointer and combining this with kprobe for __put_page().
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Tue, 18 Jun 2019 13:05:58 +0000 (15:05 +0200)]
xdp: add tracepoints for XDP mem
These tracepoints make it easier to troubleshoot XDP mem id disconnect.
The xdp:mem_disconnect tracepoint cannot be replaced via kprobe. It is
placed at the last stable place for the pointer to struct xdp_mem_allocator,
just before it's scheduled for RCU removal. It also extract info on
'safe_to_remove' and 'force'.
Detailed info about in-flight pages is not available at this layer. The next
patch will added tracepoints needed at the page_pool layer for this.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Tue, 18 Jun 2019 13:05:53 +0000 (15:05 +0200)]
xdp: force mem allocator removal and periodic warning
If bugs exists or are introduced later e.g. by drivers misusing the API,
then we want to warn about the issue, such that developer notice. This patch
will generate a bit of noise in form of periodic pr_warn every 30 seconds.
It is not nice to have this stall warning running forever. Thus, this patch
will (after 120 attempts) force disconnect the mem id (from the rhashtable)
and free the page_pool object. This will cause fallback to the put_page() as
before, which only potentially leak DMA-mappings, if objects are really
stuck for this long. In that unlikely case, a WARN_ONCE should show us the
call stack.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Tue, 18 Jun 2019 13:05:47 +0000 (15:05 +0200)]
xdp: tracking page_pool resources and safe removal
This patch is needed before we can allow drivers to use page_pool for
DMA-mappings. Today with page_pool and XDP return API, it is possible to
remove the page_pool object (from rhashtable), while there are still
in-flight packet-pages. This is safely handled via RCU and failed lookups in
__xdp_return() fallback to call put_page(), when page_pool object is gone.
In-case page is still DMA mapped, this will result in page note getting
correctly DMA unmapped.
To solve this, the page_pool is extended with tracking in-flight pages. And
XDP disconnect system queries page_pool and waits, via workqueue, for all
in-flight pages to be returned.
To avoid killing performance when tracking in-flight pages, the implement
use two (unsigned) counters, that in placed on different cache-lines, and
can be used to deduct in-flight packets. This is done by mapping the
unsigned "sequence" counters onto signed Two's complement arithmetic
operations. This is e.g. used by kernel's time_after macros, described in
kernel commit
1ba3aab3033b and
5a581b367b5, and also explained in RFC1982.
The trick is these two incrementing counters only need to be read and
compared, when checking if it's safe to free the page_pool structure. Which
will only happen when driver have disconnected RX/alloc side. Thus, on a
non-fast-path.
It is chosen that page_pool tracking is also enabled for the non-DMA
use-case, as this can be used for statistics later.
After this patch, using page_pool requires more strict resource "release",
e.g. via page_pool_release_page() that was introduced in this patchset, and
previous patches implement/fix this more strict requirement.
Drivers no-longer call page_pool_destroy(). Drivers already call
xdp_rxq_info_unreg() which call xdp_rxq_info_unreg_mem_model(), which will
attempt to disconnect the mem id, and if attempt fails schedule the
disconnect for later via delayed workqueue.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Tue, 18 Jun 2019 13:05:42 +0000 (15:05 +0200)]
mlx5: more strict use of page_pool API
The mlx5 driver is using page_pool, but not for DMA-mapping (currently), and
is a little too relaxed about returning or releasing page resources, as it
is not strictly necessary, when not using DMA-mappings.
As this patchset is working towards tracking page_pool resources, to know
about in-flight frames on shutdown. Then fix places where mlx5 leak
page_pool resource.
In case of dma_mapping_error, then recycle into page_pool.
In mlx5e_free_rq() moved the page_pool_destroy() call to after the
mlx5e_page_release() calls, as it is more correct.
In mlx5e_page_release() when no recycle was requested, then release page
from the page_pool, via page_pool_release_page().
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Tue, 18 Jun 2019 13:05:37 +0000 (15:05 +0200)]
page_pool: introduce page_pool_free and use in mlx5
In case driver fails to register the page_pool with XDP return API (via
xdp_rxq_info_reg_mem_model()), then the driver can free the page_pool
resources more directly than calling page_pool_destroy(), which does a
unnecessarily RCU free procedure.
This patch is preparing for removing page_pool_destroy(), from driver
invocation.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Tue, 18 Jun 2019 13:05:32 +0000 (15:05 +0200)]
veth: use xdp_release_frame for XDP_PASS
Like cpumap use xdp_release_frame() when an xdp_frame got
converted into an SKB and send towars the network stack.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Tue, 18 Jun 2019 13:05:27 +0000 (15:05 +0200)]
xdp: page_pool related fix to cpumap
When converting an xdp_frame into an SKB, and sending this into the network
stack, then the underlying XDP memory model need to release associated
resources, because the network stack don't have callbacks for XDP memory
models. The only memory model that needs this is page_pool, when a driver
use the DMA-mapping feature.
Introduce page_pool_release_page(), which basically does the same as
page_pool_unmap_page(). Add xdp_release_frame() as the XDP memory model
interface for calling it, if the memory model match MEM_TYPE_PAGE_POOL, to
save the function call overhead for others. Have cpumap call
xdp_release_frame() before xdp_scrub_frame().
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Tue, 18 Jun 2019 13:05:22 +0000 (15:05 +0200)]
xdp: fix leak of IDA cyclic id if rhashtable_insert_slow fails
Fix error handling case, where inserting ID with rhashtable_insert_slow
fails in xdp_rxq_info_reg_mem_model, which leads to never releasing the IDA
ID, as the lookup in xdp_rxq_info_unreg_mem_model fails and thus
ida_simple_remove() is never called.
Fix by releasing ID via ida_simple_remove(), and mark xdp_rxq->mem.id with
zero, which is already checked in xdp_rxq_info_unreg_mem_model().
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilias Apalodimas [Tue, 18 Jun 2019 13:05:17 +0000 (15:05 +0200)]
net: page_pool: add helper function to unmap dma addresses
On a previous patch dma addr was stored in 'struct page'.
Use that to unmap DMA addresses used by network drivers
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilias Apalodimas [Tue, 18 Jun 2019 13:05:12 +0000 (15:05 +0200)]
net: page_pool: add helper function to retrieve dma addresses
On a previous patch dma addr was stored in 'struct page'.
Use that to retrieve DMA addresses used by network drivers
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilias Apalodimas [Wed, 19 Jun 2019 10:04:01 +0000 (13:04 +0300)]
net: netsec: remove loops in napi Rx process
netsec_process_rx was running in a loop trying to process as many packets
as possible before re-enabling interrupts. With the recent DMA changes
this is not needed anymore as we manage to consume all the budget without
looping over the function.
Since it has no performance penalty let's remove that and simplify the Rx
path a bit
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilias Apalodimas [Wed, 19 Jun 2019 10:04:00 +0000 (13:04 +0300)]
net: netsec: initialize tx ring on ndo_open
Since we changed the Tx ring handling and now depends on bit31 to figure
out the owner of the descriptor, we should initialize this every time
the device goes down-up instead of doing it once on driver init. If the
value is not correctly initialized the device won't have any available
descriptors
Changes since v1:
- Typo fixes
Fixes:
35e07d234739 ("net: socionext: remove mmio reads on Tx")
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rasmus Villemoes [Wed, 19 Jun 2019 10:02:38 +0000 (10:02 +0000)]
net: dsa: mv88e6xxx: fix shift of FID bits in mv88e6250_g1_vtu_loadpurge()
The comment is correct, but the code ends up moving the bits four
places too far, into the VTUOp field.
Fixes:
bec8e5725281 (net: dsa: mv88e6xxx: implement vtu_getnext and vtu_loadpurge for mv88e6250)
Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 19 Jun 2019 14:12:58 +0000 (10:12 -0400)]
act_ctinfo: Don't use BIT() in UAPI headers.
Use _BITUL() instead.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 19 Jun 2019 14:09:22 +0000 (10:09 -0400)]
Merge branch 'mlxsw-Implement-flower-ingress-device-matching-offload'
Ido Schimmel says:
====================
mlxsw: Implement flower ingress device matching offload
Jiri says:
In case of using shared block, user might find it handy to be able to insert
filters to match on particular ingress device. This patchset exposes the
ingress ifindex through flow_dissector and flow_offload so mlxsw can use it to
push down to HW. See the selftests for examples of usage.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 19 Jun 2019 06:41:09 +0000 (09:41 +0300)]
selftests: tc: add ingress device matching support
Extend tc_flower to test plain ingress device matching and also
tc_shblock to test ingress device matching on shared block.
Add new tc_flower_router.sh where ingress device matching on egress
(after routing) is done.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 19 Jun 2019 06:41:08 +0000 (09:41 +0300)]
mlxsw: spectrum_flower: Implement support for ingress device matching
Benefit from the previously extended flow_dissector infrastructure and
offload matching on ingress port.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 19 Jun 2019 06:41:07 +0000 (09:41 +0300)]
mlxsw: spectrum_acl: Fix SRC_SYS_PORT element size
Fix the size of the SRC_SYS_PORT element to be 16.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 19 Jun 2019 06:41:06 +0000 (09:41 +0300)]
mlxsw: spectrum_acl: Avoid size check for RX_ACL_SYSTEM_PORT element
RX_ACL_SYSTEM_PORT is 8 bit but SRC_SYS_PORT is 16 bits. Internally,
SRC_SYS_PORT is used to carry the value. Relax the checker in case of
RX_ACL_SYSTEM_PORT and allow different size.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 19 Jun 2019 06:41:05 +0000 (09:41 +0300)]
mlxsw: spectrum_acl: Write RX_ACL_SYSTEM_PORT acl element correctly
RX_ACL_SYSTEM_PORT is equal to SRC_SYS_PORT - 1. So before write to
block we need to adjust the key value. Introduce new "EXT" helper to
implement this.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 19 Jun 2019 06:41:04 +0000 (09:41 +0300)]
net: flow_offload: implement support for meta key
Implement support for previously added flow dissector meta key.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 19 Jun 2019 06:41:03 +0000 (09:41 +0300)]
net: sched: cls_flower: use flow_dissector for ingress ifindex
Use previously introduced infra to obtain and store ingress ifindex
instead doing it locally.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 19 Jun 2019 06:41:02 +0000 (09:41 +0300)]
flow_dissector: add support for ingress ifindex dissection
Add new key meta that contains ingress ifindex value and add a function
to dissect this from skb. The key and function is prepared to cover
other potential skb metadata values dissection.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Tue, 18 Jun 2019 15:15:10 +0000 (16:15 +0100)]
net/mlx5: add missing void argument to function mlx5_devlink_alloc
Function mlx5_devlink_alloc is missing a void argument, add it
to clean up the non-ANSI function declaration.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 19 Jun 2019 02:26:05 +0000 (22:26 -0400)]
Merge branch 'net-mvpp2-cls-Allow-steering-based-on-vlan-tag'
Maxime Chevallier says:
====================
net: mvpp2: cls: Allow steering based on vlan tag
The PPv2 classifier can perform flow steering based on keys extracted
from the VLAN tag. This series adds support for using the vlan id and
the vlan prio as keys, using the ethtool interface.
Patch 1 is a preparatory patch that prevent false-positive matches,
using a dedicated lookup id for the RSS C2 lookup.
Patch 2 allows to separate the flows based on the header fields they
contain. The main goal is to be able to separate tagged traffic from
untagged traffic for flow steering, just as we already do for RSS.
Patch 3 solves an issue we have when extracting fields that aren't full
bytes, such as the vlan tag which is 12 bits wide, or the priority which
is 3 bits wide.
Finally, patch 4 adds support for steering based on both vlan id and
priority, extracted from the outermost tag.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Maxime Chevallier [Tue, 18 Jun 2019 14:55:19 +0000 (16:55 +0200)]
net: mvpp2: cls: Add steering based on vlan Id and priority.
This commit allows using the vlan Id and priority as parts of the key
for classification offload. These fields are extracted from the
outermost tag, if multiple tags are present.
Vlan Id and priority are considered as 2 different fields by the
classifier, however the fields are both appended in the Header Extracted
Key in the same layout as they are found in the tags. This means that
when steering only based on the prio, a 16-bit slot is still taken in
the HEK.
The classifier doesn't allow extracting the DEI bit from the tag, so we
explicitly prevent user from using this bit in the key.
This commit adds the vlan priotity as a compatible HEK field for
tagged traffic, meaning that we limit the possibility of extracting this
field only to the flows that contain tagged traffic.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maxime Chevallier [Tue, 18 Jun 2019 14:55:18 +0000 (16:55 +0200)]
net: mvpp2: cls: right-justify the C2 TCAM keys
The C2 TCAM used for classification uses a key (Header Extracted Key)
built by concatenating several fields extracted from the packet header.
After a lot of trial-and-error and some guess work, it seems the HEK is
right justified, with the first fields being stored in the MSB, then
concatenated up until the LSB.
Until now, this doesn't cause any issue since all HEK fields we use are
full bytes. However this is an issue for the upcoming VLAN id and pri
extraction, which aren't full bytes.
Rework the way we built that TCAM key, by changing the order in which we
append the fields.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maxime Chevallier [Tue, 18 Jun 2019 14:55:17 +0000 (16:55 +0200)]
net: mvpp2: cls: Only select applicable flows of classification offload
The way we currently handle classification offload and RSS is by having
dedicated lookup sequences in the flow table, each being selected
depending on several fields being present in the packet header.
We need to make sure the classification operation we want to perform can
be done in each flow we want to insert it into. As an example,
classifying on VLAN tag can only be done on flows used for tagged
traffic.
This commit makes sure we don't insert rules in flows we aren't
compatible with.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maxime Chevallier [Tue, 18 Jun 2019 14:55:16 +0000 (16:55 +0200)]
net: mvpp2: cls: Use a dedicated lu_type for the RSS lookup
When performing a TCAM lookup in the C2 engine, it's possible that
multiple entries match the packet. To make sure the correct entry match
when performing a lookup, the Flow Table can set a lookup type, which
will be used in the TCAM lookup, thus preventing such false-positives.
We need to make sure the RSS match doesn't interfere with other
classification lookups, hence we use a dedicated lookup_type for it.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 19 Jun 2019 02:02:27 +0000 (22:02 -0400)]
Merge branch 'macb-SiFive-FU540-C000'
Yash Shah says:
====================
Add macb support for SiFive FU540-C000
On FU540, the management IP block is tightly coupled with the Cadence
MACB IP block. It manages many of the boundary signals from the MACB IP
This patchset controls the tx_clk input signal to the MACB IP. It
switches between the local TX clock (125MHz) and PHY TX clocks. This
is necessary to toggle between 1Gb and 100/10Mb speeds.
Future patches may add support for monitoring or controlling other IP
boundary signals.
This patchset is mostly based on work done by
Wesley Terpstra <wesley@sifive.com>
This patchset is based on Linux v5.2-rc1 and tested on HiFive Unleashed
board with additional board related patches needed for testing can be
found at dev/yashs/ethernet_v3 branch of:
https://github.com/yashshah7/riscv-linux.git
Change History:
V3:
- Revert "MACB_SIFIVE_FU540" config changes in Kconfig and driver code.
The driver does not depend on SiFive GPIO driver.
V2:
- Change compatible string from "cdns,fu540-macb" to "sifive,fu540-macb"
- Add "MACB_SIFIVE_FU540" in Kconfig to support SiFive FU540 in macb
driver. This is needed because on FU540, the macb driver depends on
SiFive GPIO driver.
- Avoid writing the result of a comparison to a register.
- Fix the issue of probe fail on reloading the module reported by:
Andreas Schwab <schwab@suse.de>
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Yash Shah [Tue, 18 Jun 2019 07:56:08 +0000 (13:26 +0530)]
macb: Add support for SiFive FU540-C000
The management IP block is tightly coupled with the Cadence MACB IP
block on the FU540, and manages many of the boundary signals from the
MACB IP. This patch only controls the tx_clk input signal to the MACB
IP. Future patches may add support for monitoring or controlling other
IP boundary signals.
Signed-off-by: Yash Shah <yash.shah@sifive.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yash Shah [Tue, 18 Jun 2019 07:56:07 +0000 (13:26 +0530)]
macb: bindings doc: add sifive fu540-c000 binding
Add the compatibility string documentation for SiFive FU540-C0000
interface.
On the FU540, this driver also needs to read and write registers in a
management IP block that monitors or drives boundary signals for the
GEMGXL IP block that are not directly mapped to GEMGXL registers.
Therefore, add additional range to "reg" property for SiFive GEMGXL
management IP registers.
Signed-off-by: Yash Shah <yash.shah@sifive.com>
Reviewed-by: Paul Walmsley <paul.walmsley@sifive.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 19 Jun 2019 01:52:27 +0000 (21:52 -0400)]
Merge branch 'hinic-add-rss-support-and-rss-parameters-configuration'
Xue Chaojing says:
====================
hinic: add rss support and rss parameters configuration
This series add rss support for HINIC driver and implement the ethtool
interface related to rss parameter configuration. user can use ethtool
configure rss parameters or show rss parameters.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Xue Chaojing [Tue, 18 Jun 2019 06:20:53 +0000 (06:20 +0000)]
hinic: add support for rss parameters with ethtool
This patch adds support rss parameters with ethtool,
user can change hash key, hash indirection table, hash
function by ethtool -X, and show rss parameters by ethtool -x.
Signed-off-by: Xue Chaojing <xuechaojing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xue Chaojing [Tue, 18 Jun 2019 06:20:52 +0000 (06:20 +0000)]
hinic: move ethtool code into hinic_ethtool
This patch moves ethtool code from hinic_main.c to hinic_ethtool.c
Signed-off-by: Xue Chaojing <xuechaojing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xue Chaojing [Tue, 18 Jun 2019 06:20:51 +0000 (06:20 +0000)]
hinic: add rss support
This patch adds rss support for the HINIC driver.
Signed-off-by: Xue Chaojing <xuechaojing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Mon, 17 Jun 2019 16:58:36 +0000 (17:58 +0100)]
net: stmmac: add sanity check to device_property_read_u32_array call
Currently the call to device_property_read_u32_array is not error checked
leading to potential garbage values in the delays array that are then used
in msleep delays. Add a sanity check to the property fetching.
Addresses-Coverity: ("Uninitialized scalar variable")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Geert Uytterhoeven [Mon, 17 Jun 2019 14:38:36 +0000 (16:38 +0200)]
net: hns3: Add missing newline at end of file
"git diff" says:
\ No newline at end of file
after modifying the file.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 19 Jun 2019 00:22:53 +0000 (02:22 +0200)]
Merge branch 'bpf-bounded-loops'
Alexei Starovoitov says:
====================
v2->v3: fixed issues in backtracking pointed out by Andrii.
The next step is to add a lot more tests for backtracking.
v1->v2: addressed Andrii's feedback.
this patch set introduces verifier support for bounded loops and
adds several other improvements.
Ideally they would be introduced one at a time,
but to support bounded loop the verifier needs to 'step back'
in the patch 1. That patch introduces tracking of spill/fill
of constants through the stack. Though it's a useful feature
it hurts cilium tests.
Patch 3 introduces another feature by extending is_branch_taken
logic to 'if rX op rY' conditions. This feature is also
necessary to support bounded loops.
Then patch 4 adds support for the loops while adding
key heuristics with jmp_processed.
Introduction of parentage chain of verifier states in patch 4
allows patch 9 to add backtracking of precise scalar registers
which finally resolves degradation from patch 1.
The end result is much faster verifier for existing programs
and new support for loops.
See patch 8 for many kinds of loops that are now validated.
Patch 9 is the most tricky one and could be rewritten with
a different algorithm in the future.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Alexei Starovoitov [Sat, 15 Jun 2019 19:12:25 +0000 (12:12 -0700)]
bpf: precise scalar_value tracking
Introduce precision tracking logic that
helps cilium programs the most:
old clang old clang new clang new clang
with all patches with all patches
bpf_lb-DLB_L3.o 1838 2283 1923 1863
bpf_lb-DLB_L4.o 3218 2657 3077 2468
bpf_lb-DUNKNOWN.o 1064 545 1062 544
bpf_lxc-DDROP_ALL.o 26935 23045 166729 22629
bpf_lxc-DUNKNOWN.o 34439 35240 174607 28805
bpf_netdev.o 9721 8753 8407 6801
bpf_overlay.o 6184 7901 5420 4754
bpf_lxc_jit.o 39389 50925 39389 50925
Consider code:
654: (85) call bpf_get_hash_recalc#34
655: (bf) r7 = r0
656: (15) if r8 == 0x0 goto pc+29
657: (bf) r2 = r10
658: (07) r2 += -48
659: (18) r1 = 0xffff8881e41e1b00
661: (85) call bpf_map_lookup_elem#1
662: (15) if r0 == 0x0 goto pc+23
663: (69) r1 = *(u16 *)(r0 +0)
664: (15) if r1 == 0x0 goto pc+21
665: (bf) r8 = r7
666: (57) r8 &= 65535
667: (bf) r2 = r8
668: (3f) r2 /= r1
669: (2f) r2 *= r1
670: (bf) r1 = r8
671: (1f) r1 -= r2
672: (57) r1 &= 255
673: (25) if r1 > 0x1e goto pc+12
R0=map_value(id=0,off=0,ks=20,vs=64,imm=0) R1_w=inv(id=0,umax_value=30,var_off=(0x0; 0x1f))
674: (67) r1 <<= 1
675: (0f) r0 += r1
At this point the verifier will notice that scalar R1 is used in map pointer adjustment.
R1 has to be precise for later operations on R0 to be validated properly.
The verifier will backtrack the above code in the following way:
last_idx 675 first_idx 664
regs=2 stack=0 before 675: (0f) r0 += r1 // started backtracking R1 regs=2 is a bitmask
regs=2 stack=0 before 674: (67) r1 <<= 1
regs=2 stack=0 before 673: (25) if r1 > 0x1e goto pc+12
regs=2 stack=0 before 672: (57) r1 &= 255
regs=2 stack=0 before 671: (1f) r1 -= r2 // now both R1 and R2 has to be precise -> regs=6 mask
regs=6 stack=0 before 670: (bf) r1 = r8 // after this insn R8 and R2 has to be precise
regs=104 stack=0 before 669: (2f) r2 *= r1 // after this one R8, R2, and R1
regs=106 stack=0 before 668: (3f) r2 /= r1
regs=106 stack=0 before 667: (bf) r2 = r8
regs=102 stack=0 before 666: (57) r8 &= 65535
regs=102 stack=0 before 665: (bf) r8 = r7
regs=82 stack=0 before 664: (15) if r1 == 0x0 goto pc+21
// this is the end of verifier state. The following regs will be marked precised:
R1_rw=invP(id=0,umax_value=65535,var_off=(0x0; 0xffff)) R7_rw=invP(id=0)
parent didn't have regs=82 stack=0 marks // so backtracking continues into parent state
last_idx 663 first_idx 655
regs=82 stack=0 before 663: (69) r1 = *(u16 *)(r0 +0) // R1 was assigned no need to track it further
regs=80 stack=0 before 662: (15) if r0 == 0x0 goto pc+23 // keep tracking R7
regs=80 stack=0 before 661: (85) call bpf_map_lookup_elem#1 // keep tracking R7
regs=80 stack=0 before 659: (18) r1 = 0xffff8881e41e1b00
regs=80 stack=0 before 658: (07) r2 += -48
regs=80 stack=0 before 657: (bf) r2 = r10
regs=80 stack=0 before 656: (15) if r8 == 0x0 goto pc+29
regs=80 stack=0 before 655: (bf) r7 = r0 // here the assignment into R7
// mark R0 to be precise:
R0_rw=invP(id=0)
parent didn't have regs=1 stack=0 marks // regs=1 -> tracking R0
last_idx 654 first_idx 644
regs=1 stack=0 before 654: (85) call bpf_get_hash_recalc#34 // and in the parent frame it was a return value
// nothing further to backtrack
Two scalar registers not marked precise are equivalent from state pruning point of view.
More details in the patch comments.
It doesn't support bpf2bpf calls yet and enabled for root only.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Alexei Starovoitov [Sat, 15 Jun 2019 19:12:24 +0000 (12:12 -0700)]
selftests/bpf: add realistic loop tests
Add a bunch of loop tests. Most of them are created by replacing
'#pragma unroll' with '#pragma clang loop unroll(disable)'
Several tests are artificially large:
/* partial unroll. llvm will unroll loop ~150 times.
* C loop count -> 600.
* Asm loop count -> 4.
* 16k insns in loop body.
* Total of 5 such loops. Total program size ~82k insns.
*/
"./pyperf600.o",
/* no unroll at all.
* C loop count -> 600.
* ASM loop count -> 600.
* ~110 insns in loop body.
* Total of 5 such loops. Total program size ~1500 insns.
*/
"./pyperf600_nounroll.o",
/* partial unroll. 19k insn in a loop.
* Total program size 20.8k insn.
* ~350k processed_insns
*/
"./strobemeta.o",
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Alexei Starovoitov [Sat, 15 Jun 2019 19:12:23 +0000 (12:12 -0700)]
selftests/bpf: add basic verifier tests for loops
This set of tests is a rewrite of Edward's earlier tests:
https://patchwork.ozlabs.org/patch/877221/
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Alexei Starovoitov [Sat, 15 Jun 2019 19:12:22 +0000 (12:12 -0700)]
selftests/bpf: fix tests
Fix tests that assumed no loops.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Alexei Starovoitov [Sat, 15 Jun 2019 19:12:21 +0000 (12:12 -0700)]
bpf: fix callees pruning callers
The commit
7640ead93924 partially resolved the issue of callees
incorrectly pruning the callers.
With introduction of bounded loops and jmps_processed heuristic
single verifier state may contain multiple branches and calls.
It's possible that new verifier state (for future pruning) will be
allocated inside callee. Then callee will exit (still within the same
verifier state). It will go back to the caller and there R6-R9 registers
will be read and will trigger mark_reg_read. But the reg->live for all frames
but the top frame is not set to LIVE_NONE. Hence mark_reg_read will fail
to propagate liveness into parent and future walking will incorrectly
conclude that the states are equivalent because LIVE_READ is not set.
In other words the rule for parent/live should be:
whenever register parentage chain is set the reg->live should be set to LIVE_NONE.
is_state_visited logic already follows this rule for spilled registers.
Fixes:
7640ead93924 ("bpf: verifier: make sure callees don't prune with caller differences")
Fixes:
f4d7e40a5b71 ("bpf: introduce function calls (verification)")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Alexei Starovoitov [Sat, 15 Jun 2019 19:12:20 +0000 (12:12 -0700)]
bpf: introduce bounded loops
Allow the verifier to validate the loops by simulating their execution.
Exisiting programs have used '#pragma unroll' to unroll the loops
by the compiler. Instead let the verifier simulate all iterations
of the loop.
In order to do that introduce parentage chain of bpf_verifier_state and
'branches' counter for the number of branches left to explore.
See more detailed algorithm description in bpf_verifier.h
This algorithm borrows the key idea from Edward Cree approach:
https://patchwork.ozlabs.org/patch/877222/
Additional state pruning heuristics make such brute force loop walk
practical even for large loops.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Alexei Starovoitov [Sat, 15 Jun 2019 19:12:19 +0000 (12:12 -0700)]
bpf: extend is_branch_taken to registers
This patch extends is_branch_taken() logic from JMP+K instructions
to JMP+X instructions.
Conditional branches are often done when src and dst registers
contain known scalars. In such case the verifier can follow
the branch that is going to be taken when program executes.
That speeds up the verification and is essential feature to support
bounded loops.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Alexei Starovoitov [Sat, 15 Jun 2019 19:12:18 +0000 (12:12 -0700)]
selftests/bpf: fix tests due to const spill/fill
fix tests that incorrectly assumed that the verifier
cannot track constants through stack.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Alexei Starovoitov [Sat, 15 Jun 2019 19:12:17 +0000 (12:12 -0700)]
bpf: track spill/fill of constants
Compilers often spill induction variables into the stack,
hence it is necessary for the verifier to track scalar values
of the registers through stack slots.
Also few bpf programs were incorrectly rejected in the past,
since the verifier was not able to track such constants while
they were used to compute offsets into packet headers.
Tracking constants through the stack significantly decreases
the chances of state pruning, since two different constants
are considered to be different by state equivalency.
End result that cilium tests suffer serious degradation in the number
of states processed and corresponding verification time increase.
before after
bpf_lb-DLB_L3.o 1838 6441
bpf_lb-DLB_L4.o 3218 5908
bpf_lb-DUNKNOWN.o 1064 1064
bpf_lxc-DDROP_ALL.o 26935 93790
bpf_lxc-DUNKNOWN.o 34439 123886
bpf_netdev.o 9721 31413
bpf_overlay.o 6184 18561
bpf_lxc_jit.o 39389 359445
After further debugging turned out that cillium progs are
getting hurt by clang due to the same constant tracking issue.
Newer clang generates better code by spilling less to the stack.
Instead it keeps more constants in the registers which
hurts state pruning since the verifier already tracks constants
in the registers:
old clang new clang
(no spill/fill tracking introduced by this patch)
bpf_lb-DLB_L3.o 1838 1923
bpf_lb-DLB_L4.o 3218 3077
bpf_lb-DUNKNOWN.o 1064 1062
bpf_lxc-DDROP_ALL.o 26935 166729
bpf_lxc-DUNKNOWN.o 34439 174607
bpf_netdev.o 9721 8407
bpf_overlay.o 6184 5420
bpf_lcx_jit.o 39389 39389
The final table is depressing:
old clang old clang new clang new clang
const spill/fill const spill/fill
bpf_lb-DLB_L3.o 1838 6441 1923 8128
bpf_lb-DLB_L4.o 3218 5908 3077 6707
bpf_lb-DUNKNOWN.o 1064 1064 1062 1062
bpf_lxc-DDROP_ALL.o 26935 93790 166729 380712
bpf_lxc-DUNKNOWN.o 34439 123886 174607 440652
bpf_netdev.o 9721 31413 8407 31904
bpf_overlay.o 6184 18561 5420 23569
bpf_lxc_jit.o 39389 359445 39389 359445
Tracking constants in the registers hurts state pruning already.
Adding tracking of constants through stack hurts pruning even more.
The later patch address this general constant tracking issue
with coarse/precise logic.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Andrii Nakryiko [Mon, 17 Jun 2019 22:48:58 +0000 (15:48 -0700)]
libbpf: constify getter APIs
Add const qualifiers to bpf_object/bpf_program/bpf_map arguments for
getter APIs. There is no need for them to not be const pointers.
Verified that
make -C tools/lib/bpf
make -C tools/testing/selftests/bpf
make -C tools/perf
all build without warnings.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Arnd Bergmann [Mon, 17 Jun 2019 13:04:49 +0000 (15:04 +0200)]
qed: Fix -Wmaybe-uninitialized false positive
A previous attempt to shut up the uninitialized variable use
warning was apparently insufficient. When CONFIG_PROFILE_ANNOTATED_BRANCHES
is set, gcc-8 still warns, because the unlikely() check in DP_NOTICE()
causes it to no longer track the state of all variables correctly:
drivers/net/ethernet/qlogic/qed/qed_dev.c: In function 'qed_llh_set_ppfid_affinity':
drivers/net/ethernet/qlogic/qed/qed_dev.c:798:47: error: 'abs_ppfid' may be used uninitialized in this function [-Werror=maybe-uninitialized]
addr = NIG_REG_PPF_TO_ENGINE_SEL + abs_ppfid * 0x4;
~~~~~~~~~~^~~~~
This is not a nice workaround, but always initializing the output from
qed_llh_abs_ppfid() at least shuts up the false positive reliably.
Fixes:
79284adeb99e ("qed: Add llh ppfid interface and 100g support for offload protocols")
Fixes:
8e2ea3ea9625 ("qed: Fix static checker warning")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Geert Uytterhoeven [Mon, 17 Jun 2019 11:50:44 +0000 (13:50 +0200)]
ps3_gelic: Use [] to denote a flexible array member
Flexible array members should be denoted using [] instead of [0], else
gcc will not warn when they are no longer at the end of a struct.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Denis Kirjanov [Mon, 17 Jun 2019 08:53:41 +0000 (10:53 +0200)]
ipoib: show VF broadcast address
in IPoIB case we can't see a VF broadcast address for but
can see for PF
Before:
11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast
state UP mode DEFAULT group default qlen 256
link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
vf 0 MAC 14:80:00:00:66:fe, spoof checking off, link-state disable,
trust off, query_rss off
...
After:
11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast
state UP mode DEFAULT group default qlen 256
link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
vf 0 link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof
checking off, link-state disable, trust off, query_rss off
v1->v2: add the IFLA_VF_BROADCAST constant
v2->v3: put IFLA_VF_BROADCAST at the end
to avoid KABI breakage and set NLA_REJECT
dev_setlink
Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Acked-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Denis Kirjanov [Mon, 17 Jun 2019 08:53:40 +0000 (10:53 +0200)]
ipoib: correcly show a VF hardware address
in the case of IPoIB with SRIOV enabled hardware
ip link show command incorrecly prints
0 instead of a VF hardware address.
Before:
11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast
state UP mode DEFAULT group default qlen 256
link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
vf 0 MAC 00:00:00:00:00:00, spoof checking off, link-state disable,
trust off, query_rss off
...
After:
11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast
state UP mode DEFAULT group default qlen 256
link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
vf 0 link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof
checking off, link-state disable, trust off, query_rss off
v1->v2: just copy an address without modifing ifla_vf_mac
v2->v3: update the changelog
Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Acked-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 18 Jun 2019 16:45:37 +0000 (09:45 -0700)]
Merge branch 'mlxsw-Improve-IPv6-route-insertion-rate'
Ido Schimmel says:
====================
mlxsw: Improve IPv6 route insertion rate
Unlike IPv4, an IPv6 multipath route in the kernel is composed from
multiple sibling routes, each representing a single nexthop.
Therefore, an addition of a multipath route with N nexthops translates
to N in-kernel notifications. This is inefficient for device drivers
that need to program the route to the underlying device. Each time a new
nexthop is appended, a new nexthop group needs to be constructed and the
old one deleted.
This patchset improves the situation by sending a single notification
for a multipath route addition / deletion instead of one per-nexthop.
When adding thousands of multipath routes with 16 nexthops, I measured
an improvement of about x10 in the insertion rate.
Patches #1-#3 add a flag that indicates that in-kernel notifications
need to be suppressed and extend the IPv6 FIB notification info with
information about the number of sibling routes that are being notified.
Patches #4-#5 adjust the two current listeners to these notifications to
ignore notifications about IPv6 multipath routes.
Patches #6-#7 adds add / delete notifications for IPv6 multipath routes.
Patches #8-#14 do the same for mlxsw.
Patch #15 finally removes the limitations added in patches #4-#5 and
stops the kernel from sending a notification for each added / deleted
nexthop.
Patch #16 adds test cases.
v2 (David Ahern):
* Remove patch adjusting netdevsim to consume resources for each
fib6_info. Instead, consume one resource for the entire multipath
route
* Remove 'multipath_rt' usage in patch #10
* Remove 'multipath_rt' from 'struct fib6_entry_notifier_info' in patch
#15. The member is only removed in this patch to prevent drivers from
processing multipath routes twice during the series
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 18 Jun 2019 15:12:58 +0000 (18:12 +0300)]
selftests: mlxsw: Add a test for FIB offload indication
Test that the offload indication for unicast routes is correctly set in
different scenarios. IPv4 support will be added in the future.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 18 Jun 2019 15:12:57 +0000 (18:12 +0300)]
ipv6: Stop sending in-kernel notifications for each nexthop
Both listeners - mlxsw and netdevsim - of IPv6 FIB notifications are now
ready to handle IPv6 multipath notifications.
Therefore, stop ignoring such notifications in both drivers and stop
sending notification for each added / deleted nexthop.
v2:
* Remove 'multipath_rt' from 'struct fib6_entry_notifier_info'
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 18 Jun 2019 15:12:56 +0000 (18:12 +0300)]
mlxsw: spectrum_router: Create IPv6 multipath routes in one go
Allow the driver to create an IPv6 multipath route in one go by passing
an array of sibling routes and iterating over them.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 18 Jun 2019 15:12:55 +0000 (18:12 +0300)]
mlxsw: spectrum_router: Add / delete multiple IPv6 nexthops
Currently, the functions that take care of populating IPv6 nexthop
groups only add / delete a single nexthop.
Prepare them to handle multiple routes in one notification by passing an
array of routes and adding / deleting all of them.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 18 Jun 2019 15:12:54 +0000 (18:12 +0300)]
mlxsw: spectrum_router: Pass array of routes to route handling functions
Prepare the driver to handle multiple routes in a single notification by
passing an array of routes to the functions that actually add / delete a
route.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 18 Jun 2019 15:12:53 +0000 (18:12 +0300)]
mlxsw: spectrum_router: Adjust IPv6 replace logic to new notifications
Previously, IPv6 replace notifications were only sent from
fib6_add_rt2node(). The function only emitted such notifications if a
route actually replaced another route.
A previous patch added another call site in ip6_route_multipath_add()
from which such notification can be emitted even if a route was merely
added and did not replace another route.
Adjust the driver to take this into account and potentially set the
'replace' flag to 'false' if the notified route did not replace an
existing route.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 18 Jun 2019 15:12:52 +0000 (18:12 +0300)]
mlxsw: spectrum_router: Pass multiple routes to work item
Prepare the driver to process IPv6 multipath notifications by passing an
array of 'struct fib6_info' instead of just one route.
A reference is taken on each sibling route in order to prevent them from
being freed until they are processed by the workqueue.
v2:
* Remove 'multipath_rt' usage
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 18 Jun 2019 15:12:51 +0000 (18:12 +0300)]
mlxsw: spectrum_router: Prepare function to return errors
The function mlxsw_sp_router_fib6_event() takes care of preparing the
needed information for the work item that actually inserts the route
into the device.
When processing an IPv6 multipath route, the function will need to
allocate an array to store pointers to all the sibling routes.
Change the function's signature to return an error code and adjust the
single call site.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 18 Jun 2019 15:12:50 +0000 (18:12 +0300)]
mlxsw: spectrum_router: Remove processing of IPv6 append notifications
No such notifications are sent by the IPv6 code, so remove them.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 18 Jun 2019 15:12:49 +0000 (18:12 +0300)]
ipv6: Add IPv6 multipath notification for route delete
If all the nexthops of a multipath route are being deleted, send one
notification for the entire route, instead of one per-nexthop.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 18 Jun 2019 15:12:48 +0000 (18:12 +0300)]
ipv6: Add IPv6 multipath notifications for add / replace
Emit a notification when a multipath routes is added or replace.
Note that unlike the replace notifications sent from fib6_add_rt2node(),
it is possible we are sending a 'FIB_EVENT_ENTRY_REPLACE' when a route
was merely added and not replaced.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 18 Jun 2019 15:12:47 +0000 (18:12 +0300)]
netdevsim: Ignore IPv6 multipath notifications
In a similar fashion to previous patch, have netdevsim ignore IPv6
multipath notifications for now.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 18 Jun 2019 15:12:46 +0000 (18:12 +0300)]
mlxsw: spectrum_router: Ignore IPv6 multipath notifications
IPv6 multipath notifications are about to be sent, but mlxsw is not
ready to process them, so ignore them.
The limitation will be lifted by a subsequent patch which will also stop
the kernel from sending a notification for each nexthop.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 18 Jun 2019 15:12:45 +0000 (18:12 +0300)]
ipv6: Extend notifier info for multipath routes
Extend the IPv6 FIB notifier info with number of sibling routes being
notified.
This will later allow listeners to process one notification for a
multipath routes instead of N, where N is the number of nexthops.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 18 Jun 2019 15:12:44 +0000 (18:12 +0300)]
netlink: Add field to skip in-kernel notifications
The struct includes a 'skip_notify' flag that indicates if netlink
notifications to user space should be suppressed. As explained in commit
3b1137fe7482 ("net: ipv6: Change notifications for multipath add to
RTA_MULTIPATH"), this is useful to suppress per-nexthop RTM_NEWROUTE
notifications when an IPv6 multipath route is added / deleted. Instead,
one notification is sent for the entire multipath route.
This concept is also useful for in-kernel notifications. Sending one
in-kernel notification for the addition / deletion of an IPv6 multipath
route - instead of one per-nexthop - provides a significant increase in
the insertion / deletion rate to underlying devices.
Add a 'skip_notify_kernel' flag to suppress in-kernel notifications.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 18 Jun 2019 15:12:43 +0000 (18:12 +0300)]
netlink: Document all fields of 'struct nl_info'
Some fields were not documented. Add documentation.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 18 Jun 2019 16:33:15 +0000 (09:33 -0700)]
Merge branch '40GbE' of git://git./linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2019-06-17
This series contains updates to the iavf driver only.
Akeem updates the driver to change how VLAN tags are being populated and
programmed into the hardware by starting from the first member of the
list until the number of allowed VLAN tags is exhausted.
Mitch fixed the variable type since the variable counter starts out
negative and climbs to zero, so use a signed integer instead of
unsigned. Also increase the timeout to avoid erroneous errors. Fixed
the driver to be able to handle when the hardware hands us a null
receive descriptor with no data attached, yet is still valid.
Aleksandr fixes the driver to use GFP_ATOMIC when allocating memory in
atomic context.
Avinash updates the driver to fix a calculation error in virtchnl
regarding the valid length.
Jakub does some refactoring of the commands processing the watchdog
state machine to reduce the length and complexity of the function. Also
decalre watchdog task as delayed work and use a dedicated work queue to
service the driver tasks.
Paul updated the iavf_process_aq_command to call the necessary functions
to be able to clear cloud filter bits that need to be cleared.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Shalom Toledo [Tue, 18 Jun 2019 12:45:35 +0000 (12:45 +0000)]
mlxsw: spectrum_ptp: Fix compilation on 32-bit ARM
Compilation on 32-bit ARM fails after commit
992aa864dca0 ("mlxsw:
spectrum_ptp: Add implementation for physical hardware clock operations")
because of 64-bit division:
arm-linux-gnueabi-ld:
drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.o: in function
`mlxsw_sp1_ptp_phc_settime': spectrum_ptp.c:(.text+0x39c): undefined
reference to `__aeabi_uldivmod'
Fix by using div_u64().
Fixes:
992aa864dca0 ("mlxsw: spectrum_ptp: Add implementation for physical hardware clock operations")
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 18 Jun 2019 02:48:13 +0000 (19:48 -0700)]
Merge git://git./linux/kernel/git/davem/net
Honestly all the conflicts were simple overlapping changes,
nothing really interesting to report.
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 17 Jun 2019 23:30:38 +0000 (16:30 -0700)]
Merge branch 'UDP-GSO-audit-tests'
Fred Klassen says:
====================
UDP GSO audit tests
Updates to UDP GSO selftests ot optionally stress test CMSG
subsytem, and report the reliability and performance of both
TX Timestamping and ZEROCOPY messages.
====================
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fred Klassen [Mon, 17 Jun 2019 19:08:37 +0000 (12:08 -0700)]
net/udpgso_bench.sh test fails on error
Ensure that failure on any individual test results in an overall
failure of the test script.
Signed-off-by: Fred Klassen <fklassen@appneta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fred Klassen [Mon, 17 Jun 2019 19:08:36 +0000 (12:08 -0700)]
net/udpgso_bench.sh add UDP GSO audit tests
Audit tests count the total number of messages sent and compares
with total number of CMSG received on error queue. Example:
udp gso zerocopy timestamp audit
udp rx: 1599 MB/s 1166414 calls/s
udp tx: 1615 MB/s 27395 calls/s 27395 msg/s
udp rx: 1634 MB/s 1192261 calls/s
udp tx: 1633 MB/s 27699 calls/s 27699 msg/s
udp rx: 1633 MB/s 1191358 calls/s
udp tx: 1631 MB/s 27678 calls/s 27678 msg/s
Summary over 4.000 seconds...
sum udp tx: 1665 MB/s 82772 calls (27590/s) 82772 msgs (27590/s)
Tx Timestamps: 82772 received 0 errors
Zerocopy acks: 82772 received
Errors are thrown if CMSG count does not equal send count,
example:
Summary over 4.000 seconds...
sum tcp tx: 7451 MB/s 493706 calls (123426/s) 493706 msgs (123426/s)
./udpgso_bench_tx: Unexpected number of Zerocopy completions: 493706 expected 493704 received
Also reduce individual test time from 4 to 3 seconds so that
overall test time does not increase significantly.
v3: Enhancements as per Willem de Bruijn <willemb@google.com>
- document -P option for TCP audit
Signed-off-by: Fred Klassen <fklassen@appneta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fred Klassen [Mon, 17 Jun 2019 19:08:35 +0000 (12:08 -0700)]
net/udpgso_bench_tx: options to exercise TX CMSG
This enhancement adds options that facilitate load testing with
additional TX CMSG options, and to optionally print results of
various send CMSG operations.
These options are especially useful in isolating situations
where error-queue messages are lost when combined with other
CMSG operations (e.g. SO_ZEROCOPY).
New options:
-a - count all CMSG messages and match to sent messages
-T - add TX CMSG that requests TX software timestamps
-H - similar to -T except request TX hardware timestamps
-P - call poll() before reading error queue
-v - print detailed results
v2: Enhancements as per Willem de Bruijn <willemb@google.com>
- Updated control and buffer parameters for recvmsg
- poll() parameter cleanup
- fail on bad audit results
- remove TOS options
- improved reporting
v3: Enhancements as per Willem de Bruijn <willemb@google.com>
- add SOF_TIMESTAMPING_OPT_TSONLY to eliminate MSG_TRUNC
- general code cleanup
Signed-off-by: Fred Klassen <fklassen@appneta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Mon, 17 Jun 2019 23:28:28 +0000 (16:28 -0700)]
Merge branch 'fixes' of git://git./linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
"MS_MOVE regression fix + breakage in fsmount(2) (also introduced in
this cycle, along with fsmount(2) itself).
I'm still digging through the piles of mail, so there might be more
fixes to follow, but these two are obvious and self-contained, so
there's no point delaying those..."
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/namespace: fix unprivileged mount propagation
vfs: fsmount: add missing mntget()
David S. Miller [Mon, 17 Jun 2019 23:27:43 +0000 (16:27 -0700)]
Merge branch 'net-ipv4-remove-erroneous-advancement-of-list-pointer'
Florian Westphal says:
====================
net: ipv4: remove erroneous advancement of list pointer
Tariq reported a soft lockup on net-next that Mellanox was able to
bisect to
2638eb8b50cf ("net: ipv4: provide __rcu annotation for ifa_list").
While reviewing above patch I found a regression when addresses have a
lifetime specified.
Second patch extends rtnetlink.sh to trigger crash
(without first patch applied).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Mon, 17 Jun 2019 14:02:28 +0000 (16:02 +0200)]
selftests: rtnetlink: add addresses with fixed life time
This exercises kernel code path that deal with addresses that have
a limited lifetime.
Without previous fix, this triggers following crash on net-next:
BUG: KASAN: null-ptr-deref in check_lifetime+0x403/0x670
Read of size 8 at addr
0000000000000010 by task kworker [..]
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Mon, 17 Jun 2019 14:02:27 +0000 (16:02 +0200)]
net: ipv4: remove erroneous advancement of list pointer
Causes crash when lifetime expires on an adress as garbage is
dereferenced soon after.
This used to look like this:
for (ifap = &ifa->ifa_dev->ifa_list;
*ifap != NULL; ifap = &(*ifap)->ifa_next) {
if (*ifap == ifa) ...
but this was changed to:
struct in_ifaddr *tmp;
ifap = &ifa->ifa_dev->ifa_list;
tmp = rtnl_dereference(*ifap);
while (tmp) {
tmp = rtnl_dereference(tmp->ifa_next); // Bogus
if (rtnl_dereference(*ifap) == ifa) {
...
ifap = &tmp->ifa_next; // Can be NULL
tmp = rtnl_dereference(*ifap); // Dereference
}
}
Remove the bogus assigment/list entry skip.
Fixes:
2638eb8b50cf ("net: ipv4: provide __rcu annotation for ifa_list")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Mon, 17 Jun 2019 13:14:10 +0000 (15:14 +0200)]
net: dsa: sja1105: fix ptp link error
Due to a reversed dependency, it is possible to build
the lower ptp driver as a loadable module and the actual
driver using it as built-in, causing a link error:
drivers/net/dsa/sja1105/sja1105_spi.o: In function `sja1105_static_config_upload':
sja1105_spi.c:(.text+0x6f0): undefined reference to `sja1105_ptp_reset'
drivers/net/dsa/sja1105/sja1105_spi.o:(.data+0x2d4): undefined reference to `sja1105et_ptp_cmd'
drivers/net/dsa/sja1105/sja1105_spi.o:(.data+0x604): undefined reference to `sja1105pqrs_ptp_cmd'
drivers/net/dsa/sja1105/sja1105_main.o: In function `sja1105_remove':
sja1105_main.c:(.text+0x8d4): undefined reference to `sja1105_ptp_clock_unregister'
drivers/net/dsa/sja1105/sja1105_main.o: In function `sja1105_rxtstamp_work':
sja1105_main.c:(.text+0x964): undefined reference to `sja1105_tstamp_reconstruct'
drivers/net/dsa/sja1105/sja1105_main.o: In function `sja1105_setup':
sja1105_main.c:(.text+0xb7c): undefined reference to `sja1105_ptp_clock_register'
drivers/net/dsa/sja1105/sja1105_main.o: In function `sja1105_port_deferred_xmit':
sja1105_main.c:(.text+0x1fa0): undefined reference to `sja1105_ptpegr_ts_poll'
sja1105_main.c:(.text+0x1fc4): undefined reference to `sja1105_tstamp_reconstruct'
drivers/net/dsa/sja1105/sja1105_main.o:(.rodata+0x5b0): undefined reference to `sja1105_get_ts_info'
Change the Makefile logic to always build the ptp module
the same way as the rest. Another option would be to
just add it to the same module and remove the exports,
but I don't know if there was a good reason to keep them
separate.
Fixes:
bb77f36ac21d ("net: dsa: sja1105: Add support for the PTP clock")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Mon, 17 Jun 2019 13:13:03 +0000 (15:13 +0200)]
net: stmmac: fix unused-variable warning
When building without CONFIG_OF, we get a harmless build warning:
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c: In function 'stmmac_phy_setup':
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:973:22: error: unused variable 'node' [-Werror=unused-variable]
struct device_node *node = priv->plat->phy_node;
Reword it so we always use the local variable, by making it the
fwnode pointer instead of the device_node.
Fixes:
74371272f97f ("net: stmmac: Convert to phylink and remove phylib logic")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Mon, 17 Jun 2019 22:55:34 +0000 (15:55 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
"Lots of bug fixes here:
1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer.
2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John
Crispin.
3) Use after free in psock backlog workqueue, from John Fastabend.
4) Fix source port matching in fdb peer flow rule of mlx5, from Raed
Salem.
5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet.
6) Network header needs to be set for packet redirect in nfp, from
John Hurley.
7) Fix udp zerocopy refcnt, from Willem de Bruijn.
8) Don't assume linear buffers in vxlan and geneve error handlers,
from Stefano Brivio.
9) Fix TOS matching in mlxsw, from Jiri Pirko.
10) More SCTP cookie memory leak fixes, from Neil Horman.
11) Fix VLAN filtering in rtl8366, from Linus Walluij.
12) Various TCP SACK payload size and fragmentation memory limit fixes
from Eric Dumazet.
13) Use after free in pneigh_get_next(), also from Eric Dumazet.
14) LAPB control block leak fix from Jeremy Sowden"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits)
lapb: fixed leak of control-blocks.
tipc: purge deferredq list for each grp member in tipc_group_delete
ax25: fix inconsistent lock state in ax25_destroy_timer
neigh: fix use-after-free read in pneigh_get_next
tcp: fix compile error if !CONFIG_SYSCTL
hv_sock: Suppress bogus "may be used uninitialized" warnings
be2net: Fix number of Rx queues used for flow hashing
net: handle 802.1P vlan 0 packets properly
tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
tcp: add tcp_min_snd_mss sysctl
tcp: tcp_fragment() should apply sane memory limits
tcp: limit payload size of sacked skbs
Revert "net: phylink: set the autoneg state in phylink_phy_change"
bpf: fix nested bpf tracepoints with per-cpu data
bpf: Fix out of bounds memory access in bpf_sk_storage
vsock/virtio: set SOCK_DONE on peer shutdown
net: dsa: rtl8366: Fix up VLAN filtering
net: phylink: set the autoneg state in phylink_phy_change
net: add high_order_alloc_disable sysctl/static key
tcp: add tcp_tx_skb_cache sysctl
...
Mitch Williams [Tue, 14 May 2019 17:37:09 +0000 (10:37 -0700)]
iavf: allow null RX descriptors
In some circumstances, the hardware can hand us a null receive
descriptor, with no data attached but otherwise valid. Unfortunately,
the driver was ill-equipped to handle such an event, and would stop
processing packets at that point.
To fix this, use the Descriptor Done bit instead of the size to
determine whether or not a descriptor is ready to be processed. Add some
checks to allow for unused buffers.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>