Petr Machata [Mon, 16 Oct 2017 14:26:39 +0000 (16:26 +0200)]
mlxsw: spectrum: Drop refcounting of IPIP entries
Formerly, IPIP entries were created lazily by next hops that referenced
an offloadable IP-in-IP netdevice. However now that they are created
eagerly as a reaction to events on such netdevices, the reference
counting is useless. Hence drop it.
The routes whose next hops reference an offloaded IP-in-IP netdevice
actually linger around a bit after their device is unregistered.
However, mlxsw_sp_ipip_entry_destroy() also destroys the backing
loopback, and mlxsw_sp_rif_destroy() transitively (via
mlxsw_sp_nexthop_rif_gone_sync()) calls mlxsw_sp_nexthop_ipip_fini(),
which unlinks the IPIP entry from a next hop. Thus no dangling pointers
are left behind for the brief window after netdevice is gone, but routes
not yet.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Mon, 16 Oct 2017 14:26:38 +0000 (16:26 +0200)]
mlxsw: spectrum: Support IPIP overlay VRF migration
IPIP entries are created as soon as an offloadable device is created.
That means that when such a device is later moved to a different VRF,
the loopback device that backs the tunnel is wrong.
Thus when an offloadable encapsulating netdevice moves from one VRF to
another, make sure that the loopback is updated as necessary.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Mon, 16 Oct 2017 14:26:37 +0000 (16:26 +0200)]
mlxsw: spectrum: Support decap-only IP-in-IP tunnels
Current code for offloading IP-in-IP tunneling assumes that there is no
decap without encap. But that's never true for IPv6 overlays, and is not
true for IPv4 ones either, if net.ipv4.conf.*.rp_filter is unset.
To support decap-only tunnels, an IPIP entry is now created as soon as
an offloadable tunneling device is created. When that netdevice is up'd,
a decap route is looked up and possibly offloaded. Thus decap is not
handled implicitly as part of mlxsw_sp_ipip_entry_get() call anymore,
but needs to be done explicitly after the get, if desired.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Mon, 16 Oct 2017 14:26:36 +0000 (16:26 +0200)]
mlxsw: spectrum_router: Move mlxsw_sp_netdev_ipip_type()
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Mon, 16 Oct 2017 14:26:35 +0000 (16:26 +0200)]
mlxsw: spectrum: Move netdevice NB to struct mlxsw_sp
So far, all netdevice notifications that the driver cared about were
related to its own ports, and mlxsw_sp could be retrieved from the
netdevice's private data. For IP-in-IP offloading however, the driver
cares about events on foreign netdevices, and getting at mlxsw_sp or
router data structures from the handler is inconvenient.
Therefore move the netdevice notifier blocks from global scope to struct
mlxsw_sp to allow retrieval from the notifier block pointer itself.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Mon, 16 Oct 2017 14:04:51 +0000 (16:04 +0200)]
tipc: fix rebasing error
In commit
2f487712b893 ("tipc: guarantee that group broadcast doesn't
bypass group unicast") there was introduced a last-minute rebasing
error that broke non-group communication.
We fix this here.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 16 Oct 2017 20:26:40 +0000 (21:26 +0100)]
Merge branch 'net-core-rcuify-rtnl-af_ops'
Florian Westphal says:
====================
net: core: rcuify rtnl af_ops
None of the rtnl af_ops callbacks sleep, so they can be called while
holding rcu read lock.
Switch handling of af_ops to rcu.
This would allow to later call af_ops functions without holding
the rtnl mutex anymore.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Mon, 16 Oct 2017 13:44:36 +0000 (15:44 +0200)]
net: core: rcu-ify rtnl af_ops
rtnl af_ops currently rely on rtnl mutex: unregister (called from module
exit functions) takes the rtnl mutex and all users that do af_ops lookup
also take the rtnl mutex. IOW, parallel rmmod will block until doit()
callback is done.
As none of the af_ops implementation sleep we can use rcu instead.
doit functions that need the af_ops can now use rcu instead of the
rtnl mutex provided the mutex isn't needed for other reasons.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Mon, 16 Oct 2017 13:44:35 +0000 (15:44 +0200)]
rtnetlink: place link af dump into own helper
next patch will rcu-ify rtnl af_ops, i.e. allow af_ops
lookup and function calls with rcu read lock held instead
of rtnl mutex.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Mon, 16 Oct 2017 13:33:21 +0000 (14:33 +0100)]
tcp: cdg: make struct tcp_cdg static
The structure tcp_cdg is local to the source and
does not need to be in global scope, so make it static.
Cleans up sparse warning:
symbol 'tcp_cdg' was not declared. Should it be static?
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Mon, 16 Oct 2017 11:32:36 +0000 (13:32 +0200)]
net: systemport: add NET_DSA dependency
The notifier cause a link error when NET_DSA is a loadable
module:
drivers/net/ethernet/broadcom/bcmsysport.o: In function `bcm_sysport_remove':
bcmsysport.c:(.text+0x1582): undefined reference to `unregister_dsa_notifier'
drivers/net/ethernet/broadcom/bcmsysport.o: In function `bcm_sysport_probe':
bcmsysport.c:(.text+0x278d): undefined reference to `register_dsa_notifier'
This adds a dependency that forces the systemport driver to be
a loadable module as well when that happens, but otherwise
allows it to be built normally when DSA is either built-in or
completely disabled.
Fixes:
d156576362c0 ("net: systemport: Establish lower/upper queue mapping")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sudip Mukherjee [Sun, 15 Oct 2017 21:00:52 +0000 (22:00 +0100)]
hamradio: baycom_par: use new parport device model
Modify baycom driver to use the new parallel port device model.
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gustavo A. R. Silva [Sun, 15 Oct 2017 18:22:10 +0000 (13:22 -0500)]
net: dccp: mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Notice that for options.c file, I placed the "fall through" comment
on its own line, which is what GCC is expecting to find.
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Sat, 14 Oct 2017 14:04:40 +0000 (17:04 +0300)]
pch_gbe: Switch to new PCI IRQ allocation API
This removes custom flag handling.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steven Rostedt (VMware) [Thu, 12 Oct 2017 22:40:02 +0000 (18:40 -0400)]
tracing: bpf: Hide bpf trace events when they are not used
All the trace events defined in include/trace/events/bpf.h are only
used when CONFIG_BPF_SYSCALL is defined. But this file gets included by
include/linux/bpf_trace.h which is included by the networking code with
CREATE_TRACE_POINTS defined.
If a trace event is created but not used it still has data structures
and functions created for its use, even though nothing is using them.
To not waste space, do not define the BPF trace events in bpf.h unless
CONFIG_BPF_SYSCALL is defined.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Wang [Fri, 13 Oct 2017 22:08:07 +0000 (15:08 -0700)]
ipv6: only update __use and lastusetime once per jiffy at most
In order to not dirty the cacheline too often, we try to only update
dst->__use and dst->lastusetime at most once per jiffy.
As dst->lastusetime is only used by ipv6 garbage collector, it should
be good enough time resolution.
And __use is only used in ipv6_route_seq_show() to show how many times a
dst has been used. And as __use is not atomic_t right now, it does not
show the precise number of usage times anyway. So we think it should be
OK to only update it at most once per jiffy.
According to my latest syn flood test on a machine with intel Xeon 6th
gen processor and 2 10G mlx nics bonded together, each with 8 rx queues
on 2 NUMA nodes:
With this patch, the packet process rate increases from ~3.49Mpps to
~3.75Mpps with a 7% increase rate.
Note: dst_use() is being renamed to dst_hold_and_use() to better specify
the purpose of the function.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@googl.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Wang [Fri, 13 Oct 2017 22:01:08 +0000 (15:01 -0700)]
ipv6: check fn before doing FIB6_SUBTREE(fn)
In fib6_locate(), we need to first make sure fn is not NULL before doing
FIB6_SUBTREE(fn) to avoid crash.
This fixes the following static checker warning:
net/ipv6/ip6_fib.c:1462 fib6_locate()
warn: variable dereferenced before check 'fn' (see line 1459)
net/ipv6/ip6_fib.c
1458 if (src_len) {
1459 struct fib6_node *subtree = FIB6_SUBTREE(fn);
^^^^^^^^^^^^^^^^
We shifted this dereference
1460
1461 WARN_ON(saddr == NULL);
1462 if (fn && subtree)
^^
before the check for NULL.
1463 fn = fib6_locate_1(subtree, saddr, src_len,
1464 offsetof(struct rt6_info, rt6i_src)
Fixes:
66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Abhijit Ayarekar [Fri, 13 Oct 2017 19:24:06 +0000 (12:24 -0700)]
bpf: Add -target to clang switch while cross compiling.
Update to llvm excludes assembly instructions.
llvm git revision is below
commit
65fad7c26569 ("bpf: add inline-asm support")
This change will be part of llvm release 6.0
__ASM_SYSREG_H define is not required for native compile.
-target switch includes appropriate target specific files
while cross compiling
Tested on x86 and arm64.
Signed-off-by: Abhijit Ayarekar <abhijit.ayarekar@caviumnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 16 Oct 2017 20:00:41 +0000 (21:00 +0100)]
Merge branch 'sched-tp_q-remove'
Jiri Pirko <jiri@mellanox.com>
====================
net: sched: remove some tp->q usage
In order to prepare for block sharing, tcf_proto instances need to be
independent on particular qdisc instances. This patchset takes care of
removal of couple occurrences of tp->q usage.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 13 Oct 2017 12:01:05 +0000 (14:01 +0200)]
net: sched: propagate q and parent from caller down to tcf_fill_node
The callers have this info, they will pass it down to tcf_fill_node.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 13 Oct 2017 12:01:04 +0000 (14:01 +0200)]
net: sched: use tcf_block_q helper to get q pointer for sch_tree_lock
Use tcf_block_q helper to get q pointer to be used for direct call of
sch_tree_lock/unlock instead of tcf_tree_lock/unlock.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 13 Oct 2017 12:01:03 +0000 (14:01 +0200)]
net: sched: tcindex, fw, flow: use tcf_block_q helper to get struct Qdisc
Use helper to get q pointer per block.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 13 Oct 2017 12:01:02 +0000 (14:01 +0200)]
net: sched: cls_u32: use block instead of q in tc_u_common
tc_u_common is now per-q. With blocks, it has to be converted to be
per-block.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 13 Oct 2017 12:01:01 +0000 (14:01 +0200)]
net: sched: ematch: obtain net pointer from blocks
Instead of using tp->q, use block to get the net pointer.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 13 Oct 2017 12:01:00 +0000 (14:01 +0200)]
net: sched: teach tcf_bind/unbind_filter to use block->q
Whenever the block->q is set, it can be used instead of tp->q as it
contains the same value. When it is not set, which can't happen now but
it might happen with the follow-up shared blocks introduction, the class
is not set in the result. That would lead to a class lookup instead
of direct class pointer use for classful qdiscs. However, it is not
planned to support classful qdisqs sharing filter blocks, so that may
never happen.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 13 Oct 2017 12:00:59 +0000 (14:00 +0200)]
net: sched: introduce tcf_block_q and tcf_block_dev helpers
These helpers allows to get a q and netdev pointers
for given block easily.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 13 Oct 2017 12:00:58 +0000 (14:00 +0200)]
net: sched: store net pointer in block and introduce qdisc_net helper
Store net pointer in the block structure. Along the way, introduce
qdisc_net helper which allows to easily obtain net pointer for
qdisc instance.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 13 Oct 2017 12:00:57 +0000 (14:00 +0200)]
net: sched: store Qdisc pointer in struct block
Prepare for removal of tp->q and store Qdisc pointer in the block
structure.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Thu, 12 Oct 2017 18:38:45 +0000 (11:38 -0700)]
mqprio: Reserve last 32 classid values for HW traffic classes and misc IDs
This patch makes a slight tweak to mqprio in order to bring the
classid values used back in line with what is used for mq. The general idea
is to reserve values :ffe0 - :ffef to identify hardware traffic classes
normally reported via dev->num_tc. By doing this we can maintain a
consistent behavior with mq for classid where :1 - :ffdf will represent a
physical qdisc mapped onto a Tx queue represented by classid - 1, and the
traffic classes will be mapped onto a known subset of classid values
reserved for our virtual qdiscs.
Note I reserved the range from :fff0 - :ffff since this way we might be
able to reuse these classid values with clsact and ingress which would mean
that for mq, mqprio, ingress, and clsact we should be able to maintain a
similar classid layout.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 16 Oct 2017 04:42:41 +0000 (05:42 +0100)]
Merge tag 'mlx5-updates-2017-10-11' of git://git./linux/kernel/git/mellanox/linux
Saeed Mahameed says:
====================
mlx5-updates-2017-10-11: IPoIB Multi Pkey support
This series provides the support for IPoIB Multi Pkey.
InfiniBand Pkeys are the equivalent of Ethernet vlans.
Currently IPoIB device driver supports only default Pkey and IPoIB Pkey child
interfaces are not supported with IPoIB offloads mode, this series will add
the support for that by allowing creating mlx5 multiple IPoIB netdevices with
a non-default Pkey.
mlx5 IPoIB Pkey child interface is smaller version of mlx5i IPoIB interfaces and shares
most of its resources with the parent IPoIB interface, namely RX steering and ring
queue resources.
The only mlx5 resources a child Pkey interface will be creating are the TX rings,
since they should be assigned to a specific Pkey.
mlx5i Pkey netdev is implemented via new mlx5e netdev profile implemented in
mlx5/core/ipoib/ipoib_vlan.c.
The series starts with a refactoring of mlx5e PTP and mlx5 clock implementation
to move the code to be part of mlx5 core rather than mlx5e netdevice, in order to
make mlx5 clock and PTP registration part of the core to be shared with mlx5e
master Ethernet netdev/IPoIB parent netdev and mlx5_ib in the near future.
Add the support for attaching multiple underlay QPs for the different Pkeys
in mlx5 core RX steering.
Add Pkey index to rdma_netdev to add the ability to set PKEY index to lower
IPoIB offload netdev.
Use hash-table to map between DQPN (Destination QP number) to child netdev
for the IPoIB parent netdev to forward RX packets to the corresponding
child Pkey netdev, since the RX rings are shared.
The reset of the series adds the ipoib child Pkey: mlx5e netdev profile,
netdev nods implementation and minimal set of ethtool callbacks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 15 Oct 2017 01:49:42 +0000 (18:49 -0700)]
Merge branch '40GbE' of git://git./linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
40GbE Intel Wired LAN Driver Updates 2017-10-13
This series contains updates to mqprio and i40e.
Amritha introduces a new hardware offload mode in tc/mqprio where the TCs,
the queue configurations and bandwidth rate limits are offloaded to the
hardware. The existing mqprio framework is extended to configure the queue
counts and layout and also added support for rate limiting. This is
achieved through new netlink attributes for the 'mode' option which takes
values such as 'dcb' (default) and 'channel' and a 'shaper' option for
QoS attributes such as bandwidth rate limits in hw mode 1. Legacy devices
can fall back to the existing setup supporting hw mode 1 without these
additional options where only the TCs are offloaded and then the 'mode'
and 'shaper' options defaults to DCB support. The i40e driver enables the
new mqprio hardware offload mechanism factoring the TCs, queue
configuration and bandwidth rates by creating HW channel VSIs.
In this new mode, the priority to traffic class mapping and the user
specified queue ranges are used to configure the traffic class when the
'mode' option is set to 'channel'. This is achieved by creating HW
channels(VSI). A new channel is created for each of the traffic class
configuration offloaded via mqprio framework except for the first TC (TC0)
which is for the main VSI. TC0 for the main VSI is also reconfigured as
per user provided queue parameters. Finally, bandwidth rate limits are set
on these traffic classes through the shaper attribute by sending these
rates in addition to the number of TCs and the queue configurations.
Colin Ian King makes an array of constant values "constant".
Alan fixes and issue where on some firmware versions, we were failing to
actually fill out the phy_types which caused ethtool to not report any
link types. Also hardened against a potentially malicious VF by not
letting the VF to reset itself after requesting to change the number of
queues (via ethtool), let the PF reset the VF to institute the requested
changes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 15 Oct 2017 01:47:53 +0000 (18:47 -0700)]
Merge branch 'tc-testing-updates'
Lucas Bates says:
====================
tc-testing: Test suite updates
This patch series is a roundup of changes to the tc-testing
suite:
- Add test cases for police and mirred modules and some coverage
in already-submitted test categories
- Break the test case files down into more user-friendly sizes
- Bug fix to the tdc.py script's handling of the -l argument
v2: fix the lack of final newlines in two new files (thanks David)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Lucas Bates [Fri, 13 Oct 2017 21:51:25 +0000 (17:51 -0400)]
tc-testing: fix the -l argument bug in tdc.py
This patch fixes a bug in the tdc script, where executing tdc
with the -l argument would cause the tests to start running
as opposed to listing all the known test cases.
Signed-off-by: Lucas Bates <lucasb@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lucas Bates [Fri, 13 Oct 2017 21:51:24 +0000 (17:51 -0400)]
tc-testing: Add test cases for police and skbmod
Add basic unit tests for police and skbmod actions in tc.
Signed-off-by: Lucas Bates <lucasb@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lucas Bates [Fri, 13 Oct 2017 21:51:23 +0000 (17:51 -0400)]
tc-testing: Split test case files into smaller chunks
The original submission had the test cases stored in one
monolithic file. This can be unwieldy to edit, especially as more
test cases are added. This patch removes the original tests.json
file in favour of individual ones broken down by category.
Signed-off-by: Lucas Bates <lucasb@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lucas Bates [Fri, 13 Oct 2017 21:51:22 +0000 (17:51 -0400)]
tc-testing: Add test cases for flushing actions
Tests for flushing gact and mirred were missing. This patch
adds test cases to explicitly test the flush of any installed
gact/mirred actions.
Signed-off-by: Lucas Bates <lucasb@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 15 Oct 2017 01:46:36 +0000 (18:46 -0700)]
Merge branch 'macvlan-cleanups'
Alexander Duyck says:
====================
net: Minor macvlan source mode cleanups
So this patch series is just a few minor cleanups for macvlan source mode.
The first patch addresses double receives when a packet is being routed to
the macvlan destination address, and the other addresses the pkt_type being
updated in cases where it most likely should not be.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 13 Oct 2017 20:40:31 +0000 (13:40 -0700)]
macvlan: Only update pkt_type if destination MAC address matches
This patch updates the pkt_type to PACKET_HOST only if the destination MAC
address matches on the on the source based macvlan. It didn't make sense to
be updating broadcast, multicast, and non-local destined frames with
PACKET_HOST.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 13 Oct 2017 20:40:24 +0000 (13:40 -0700)]
macvlan: Only deliver one copy of the frame to the macvlan interface
This patch intoduces a slight adjustment for macvlan to address the fact
that in source mode I was seeing two copies of any packet addressed to the
macvlan interface being delivered where there should have been only one.
The issue appears to be that one copy was delivered based on the source MAC
address and then the second copy was being delivered based on the
destination MAC address. To fix it I am just treating a unicast address
match as though it is not a match since source based macvlan isn't supposed
to be matching based on the destination MAC anyway.
Fixes:
79cf79abce71 ("macvlan: add source mode")
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 13 Oct 2017 20:03:16 +0000 (13:03 -0700)]
tcp: add a tracepoint for tcp retransmission
We need a real-time notification for tcp retransmission
for monitoring.
Of course we could use ftrace to dynamically instrument this
kernel function too, however we can't retrieve the connection
information at the same time, for example perf-tools [1] reads
/proc/net/tcp for socket details, which is slow when we have
a lots of connections.
Therefore, this patch adds a tracepoint for __tcp_retransmit_skb()
and exposes src/dst IP addresses and ports of the connection.
This also makes it easier to integrate into perf.
Note, I expose both IPv4 and IPv6 addresses at the same time:
for a IPv4 socket, v4 mapped address is used as IPv6 addresses,
for a IPv6 socket, LOOPBACK4_IPV6 is already filled by kernel.
Also, add sk and skb pointers as they are useful for BPF.
1. https://github.com/brendangregg/perf-tools/blob/master/net/tcpretrans
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Brendan Gregg <bgregg@netflix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 13 Oct 2017 19:58:13 +0000 (12:58 -0700)]
net_sched: fix a compile warning in act_ife
Apparently ife_meta_id2name() is only called when
CONFIG_MODULES is defined.
This fixes:
net/sched/act_ife.c:251:20: warning: ‘ife_meta_id2name’ defined but not used [-Wunused-function]
static const char *ife_meta_id2name(u32 metaid)
^~~~~~~~~~~~~~~~
Fixes:
d3f24ba895f0 ("net sched actions: fix module auto-loading")
Cc: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 15 Oct 2017 01:42:56 +0000 (18:42 -0700)]
Merge branch 'hv_netvsc-Add-init-of-send-table-and-var-renames'
Haiyang Zhang says:
====================
hv_netvsc: Add init of send table and var renames
Add initialization of send indirection table. Otherwise it may contain
old info of previous device with different number of channels.
Also, did some variable renaming for easier reading.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Haiyang Zhang [Fri, 13 Oct 2017 19:28:05 +0000 (12:28 -0700)]
hv_netvsc: Add initialization of tx_table in netvsc_device_add()
tx_table is part of the private data of kernel net_device. It is only
zero-ed out when allocating net_device.
We may recreate netvsc_device w/o recreating net_device, so the private
netdev data, including tx_table, are not zeroed. It may contain channel
numbers for the older netvsc_device.
This patch adds initialization of tx_table each time we recreate
netvsc_device.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Haiyang Zhang [Fri, 13 Oct 2017 19:28:04 +0000 (12:28 -0700)]
hv_netvsc: Rename tx_send_table to tx_table
Simplify the variable name: tx_send_table
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Haiyang Zhang [Fri, 13 Oct 2017 19:28:03 +0000 (12:28 -0700)]
hv_netvsc: Rename ind_table to rx_table
Rename this variable because it is the Receive indirection
table.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Fri, 13 Oct 2017 16:29:00 +0000 (17:29 +0100)]
cxgb4: fix missing break in switch and indent return statements
The break statement for the Macronix case is missing and will
fall through to the Winbond case and re-assign the size setting.
Fix this by adding the missing break statement. Also correctly
indent the return statements.
Detected by CoverityScan, CID#1458020 ("Missing break in switch")
Fixes:
96ac18f14a5a ("cxgb4: Add support for new flash parts")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 15 Oct 2017 01:36:46 +0000 (18:36 -0700)]
Merge tag 'mac80211-next-for-davem-2017-10-13' of git://git./linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Three fixes for the recently added new code:
* make "make -s" silent for the certs file (Arnd)
* fix missing CONFIG_ in extra certs symbol (Arnd)
* use crypto_aead_authsize() to use the proper API
and two other changes:
* remove a set-but-unused variable
* don't track HT *capability* changes, capabilities
are supposed to be constant (HT operation changes)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 15 Oct 2017 01:35:15 +0000 (18:35 -0700)]
Merge branch 'cxgb4-hw-debug-logs'
Rahul Lakkireddy says:
====================
cxgb4: add support to get hardware debug logs via ethtool
This series of patches add support to collect hardware debug logs
via ethtool --get-dump facility.
Currently supports:
Memory dumps - Collects on-chip EDC0 and EDC1 dumps.
Hardware dumps - Collects firmware and hardware dumps.
Patch 1 adds ethtool set/get dump data. It also adds template header
that precedes dump data. This template header gives additional
information needed for extracting and decoding the collected dump
data.
Patch 2 adds base to collect dumps. Also collects regdump.
Patch 3 collects on-chip EDC0 and EDC1 memory dumps.
Patch 4 collects firmware mbox log and device log.
Patch 5 updates base API for accessing TP indirect registers.
Patch 6 collects hardware TP module dump.
Patch 7 collects hardware SGE, PCIE, PM, UP CIM, MA, and HMA
module dumps.
Patch 8 collects hardware IBQ and OBQ dump.
Thanks,
Rahul
---
v2:
- Prefix symbols that pollute global namespace in files starting
with cxgb4_* with "cxgb4_"
- Prefix symbols that pollute global namespace in files starting
with cudbg_* with "cudbg_"
- Make cudbg_collect_mem_info() static.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Lakkireddy [Fri, 13 Oct 2017 13:18:20 +0000 (18:48 +0530)]
cxgb4: collect IBQ and OBQ dumps
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Lakkireddy [Fri, 13 Oct 2017 13:18:19 +0000 (18:48 +0530)]
cxgb4: collect hardware module dumps
Collect SGE, PCIE, PM, UP CIM, MA and HMA dumps.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Lakkireddy [Fri, 13 Oct 2017 13:18:18 +0000 (18:48 +0530)]
cxgb4: collect TP dump
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Lakkireddy [Fri, 13 Oct 2017 13:18:17 +0000 (18:48 +0530)]
cxgb4: update API for TP indirect register access
Try to access TP indirect registers via firmware first. If this fails,
fallback and access them directly. This ensures that driver and
firmware do not conflict each other while accessing the TP indirect
registers.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Lakkireddy [Fri, 13 Oct 2017 13:18:16 +0000 (18:48 +0530)]
cxgb4: collect firmware mbox and device log dump
Collect firmware mbox and device logs before collecting the rest of
the hardware dumps to snap the firmware state before the mailbox logs
are updated by other hardware dumps.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Lakkireddy [Fri, 13 Oct 2017 13:18:15 +0000 (18:48 +0530)]
cxgb4: collect on-chip memory dump
Collect EDC0 and EDC1 memory dump.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Lakkireddy [Fri, 13 Oct 2017 13:18:14 +0000 (18:48 +0530)]
cxgb4: collect register dump
Add base to collect dump entities. Collect register dump and
update template header accordingly.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Lakkireddy [Fri, 13 Oct 2017 13:18:13 +0000 (18:48 +0530)]
cxgb4: implement ethtool dump data operations
Implement operations to set/get dump data via ethtool. Also add
template header that precedes dump data, which helps in decoding
and extracting the dump data.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mark Brown [Fri, 13 Oct 2017 02:50:35 +0000 (03:50 +0100)]
nfp: Explicitly include linux/bug.h
Today's -next build encountered an error due to a missing definition of
WARN_ON(), caused by some header reorganization removing an implicit
inclusion of linux/bug.h. Fix this with an explicit inclusion.
Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 15 Oct 2017 01:30:07 +0000 (18:30 -0700)]
Merge branch 'dsa-remove-set_addr'
Vivien Didelot says:
====================
net: dsa: remove .set_addr
An Ethernet switch may support having a MAC address, which can be used
as the switch's source address in transmitted full-duplex Pause frames.
If a DSA switch supports the related .set_addr operation, the DSA core
sets the master's MAC address on the switch.
This won't make sense anymore in a multi-CPU ports system, because there
won't be a unique master device assigned to a switch tree.
Moreover this operation is confusing because it makes the user think
that it could be used to program the switch with the MAC address of the
CPU/management port such that MAC address learning can be disabled on
said port, but in fact, that's not how it is currently used.
To fix this, assign a random MAC address at setup time in the mv88e6060
and mv88e6xxx drivers before removing .set_addr completely from DSA.
Changes in v3:
- include fix for mv88e6060 switch MAC address setter.
Changes in v2:
- remove .set_addr implementation from drivers and use a random MAC.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 13 Oct 2017 18:18:09 +0000 (14:18 -0400)]
net: dsa: remove .set_addr
Now that there is no user for the .set_addr function, remove it from
DSA. If a switch supports this feature (like mv88e6xxx), the
implementation can be done in the driver setup.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 13 Oct 2017 18:18:08 +0000 (14:18 -0400)]
net: dsa: dsa_loop: remove .set_addr
The .set_addr function does nothing, remove the dsa_loop implementation
before getting rid of it completely in DSA.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 13 Oct 2017 18:18:07 +0000 (14:18 -0400)]
net: dsa: mv88e6060: setup random mac address
As for mv88e6xxx, setup the switch from within the mv88e6060 driver with
a random MAC address, and remove the .set_addr implementation.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 13 Oct 2017 18:18:06 +0000 (14:18 -0400)]
net: dsa: mv88e6060: fix switch MAC address
The 88E6060 Ethernet switch always transmits the multicast bit of the
switch MAC address as a zero. It re-uses the corresponding bit 8 of the
register "Switch MAC Address Register Bytes 0 & 1" for "DiffAddr".
If the "DiffAddr" bit is 0, then all ports transmit the same source
address. If it is set to 1, then bit 2:0 are used for the port number.
The mv88e6060 driver is currently wrongly shifting the MAC address byte
0 by 9. To fix this, shift it by 8 as usual and clear its bit 0.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 13 Oct 2017 18:18:05 +0000 (14:18 -0400)]
net: dsa: mv88e6xxx: setup random mac address
An Ethernet switch may support having a MAC address, which can be used
as the switch's source address in transmitted full-duplex Pause frames.
If a DSA switch supports the related .set_addr operation, the DSA core
sets the master's MAC address on the switch. This won't make sense
anymore in a multi-CPU ports system, because there won't be a unique
master device assigned to a switch tree.
Instead, setup the switch from within the Marvell driver with a random
MAC address, and remove the .set_addr implementation.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gustavo A. R. Silva [Thu, 12 Oct 2017 21:11:32 +0000 (16:11 -0500)]
atm: fore200e: mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Vesker [Thu, 14 Sep 2017 15:22:50 +0000 (18:22 +0300)]
net/mlx5e: IPoIB, Modify rdma netdev allocate and free to support PKEY
Resources such as FT, QPN HT and mdev resources should be allocated
only by parent netdev. Shared resources are allocated and freed by the
parent interface since the parent is always present and created
before the IPoIB PKEY sub-interface.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Alex Vesker [Thu, 14 Sep 2017 15:02:31 +0000 (18:02 +0300)]
net/mlx5e: IPoIB, Add PKEY child interface ethtool ops
Similar to VLAN interfaces child interfaces have limited ethtool
support. In current code the main limitation that does not
allow child interface ethtool configuration is due to shared
resources which are managed by the parent.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Alex Vesker [Thu, 14 Sep 2017 13:33:35 +0000 (16:33 +0300)]
net/mlx5e: IPoIB, Add PKEY child interface ndos
Child interface ndos will be called to support child interface
specific behaviour.
ndo_init flow:
-Acquire shared QPN to net-device HT from parent
-Continue with the same flow as parent interface
ndo_open flow:
-Initialize child underlay QP and connect to shared FT
-Create child send TIS
-Open child send channels
Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Alex Vesker [Thu, 14 Sep 2017 11:08:39 +0000 (14:08 +0300)]
net/mlx5e: IPoIB, Add PKEY child interface nic profile
Child interface profile will be called to support child interface
specific behaviour. The child code is sparse compared to the parent
since the RX channels are shared between the interfaces.
Creating a septate profile for child and parent will make a smother
code with a better ability for future expansion.
The profile stuct is exposed to the parent using a getter function.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Alex Vesker [Thu, 14 Sep 2017 07:27:25 +0000 (10:27 +0300)]
net/mlx5e: IPoIB, Use hash-table to map between QPN to child netdev
This change is needed for PKEY support, since the RQs are shared
between the child interface and the parent. The parent is responsible
for NAPI and the precessing of RX completions. Using the dqpn in the
completion descriptor we set the corresponding child IPoIB netdevice
on the SKB.
The mapping between the dqpn and the netdevice is done using a HT,
each mlx5 IPoIB interface registers its mapping on creation.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Alex Vesker [Wed, 13 Sep 2017 09:17:50 +0000 (12:17 +0300)]
net/mlx5e: IPoIB, Support for setting PKEY index to underlay QP
Added a function to set PKEY index to IPoIB device driver using the
already present set_id function. PKEY index is attached to the QP
during state modification.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Alex Vesker [Wed, 14 Jun 2017 19:18:48 +0000 (22:18 +0300)]
IB/ipoib: Add ability to set PKEY index to lower device driver
To support passing child interfaces to the lower device a new
rdma_netdev function was used, set_id. This will allow us to
attach the PKEY index lower device resources such as TIS/QP.
For devices that do not support offloads in IPoIB same logic
will be used, setting the PKEY index to priv struct.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Alex Vesker [Tue, 10 Oct 2017 07:36:41 +0000 (10:36 +0300)]
IB/ipoib: Grab rtnl lock on heavy flush when calling ndo_open/stop
When ndo_open and ndo_stop are called RTNL lock should be held.
In this specific case ipoib_ib_dev_open calls the offloaded ndo_open
which re-sets the number of TX queue assuming RTNL lock is held.
Since RTNL lock is not held, RTNL assert will fail.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Alex Vesker [Wed, 13 Sep 2017 08:37:02 +0000 (11:37 +0300)]
net/mlx5: Support for attaching multiple underlay QPs to root flow table
Previous support allowed connecting only a single QPN to the FT.
Now using a linked list multiple QPNs can be attached to the same FT.
Supporting attaching multiple underlay QPs is required for PKEY
support in which child and parent share the same FT.
The actual attaching/detaching FW commands will be called inside the
function symmetrically.
This change requires a change in IPoIB open and close functions, the
attaching/detaching to/from the FT is done each time we open/close.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Alex Vesker [Tue, 12 Sep 2017 11:11:29 +0000 (14:11 +0300)]
net/mlx5e: IPoIB, Move underlay QP init/uninit to separate functions
During the creation of the underlay QP the PKEY index is unknown, the
PKEY index is known only when calling ndo_open.
PKEY index attached to the QP during state modification.
Splitting the functions will also make the code symmetric and more
readable. This split is also required for later PKEY support to be
called with the PKEY index during ndo_open.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Feras Daoud [Tue, 15 Aug 2017 10:46:04 +0000 (13:46 +0300)]
net/mlx5: PTP code migration to driver core section
PTP code is moved to core section of mlx5 driver in order to share
it between ethernet and infiniband. This movement involves the following
changes:
- Change mlx5e_ prefix to be mlx5_
- Add clock structs to Core
- Add clock object to mlx5_core_dev
- Call Init/Uninit clock from core init/cleanup
- Rename mlx5e_tstamp to be mlx5_clock
Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Eitan Rabin <rabin@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Feras Daoud [Mon, 14 Aug 2017 08:23:27 +0000 (11:23 +0300)]
net/mlx5: File renaming towards ptp core implementation
en_clock.c renamed clock.c and moved to lib/ as first step
towards relocating code to core part of the driver to allow
sharing between Ethernet and Infiniband.
Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Eitan Rabin <rabin@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
David S. Miller [Sat, 14 Oct 2017 18:13:29 +0000 (11:13 -0700)]
Merge branch 'nfp-bpf-support-direct-packet-access'
Jakub Kicinski says:
====================
nfp: bpf: support direct packet access
The core of this series is direct packet access support. With a
small change to the verifier, the offloaded code can now make
use of DPA. We need to be careful to use kernel (after initial
translation) offsets in our JIT. Direct packet access also brings
us to the problem of eBPF endianness. After considering the
changes necessary we decided to not support translation on both
BE and LE hosts, for now.
This series contains two fixes - one for compare instructions and
one for ineffective jne optimization. I chose to include fixes
in this set because the code in -net works only with unreleased
PoC FW (ABI version 1) and therefore nobody outside of Netronome
can exercise it anyway.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 12 Oct 2017 17:34:18 +0000 (10:34 -0700)]
nfp: bpf: support direct packet access in TC
Add support for direct packet access in TC, note that because
writing the packet will cause the verifier to generate a csum
fixup prologue we won't be able to offload packet writes from
TC, just yet, only the reads will work.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 12 Oct 2017 17:34:17 +0000 (10:34 -0700)]
nfp: bpf: direct packet access - write
This patch adds ability to write packet contents using pre-validated
packet pointers (direct packet access).
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 12 Oct 2017 17:34:16 +0000 (10:34 -0700)]
nfp: bpf: add support for direct packet access - read
In direct packet access bound checks are already done, we can
simply dereference the packet pointer.
Verifier/parser logic needs to record pointer type. Note that
although verifier does protect us from CTX vs other pointer
changes we will also want to differentiate between PACKET vs
MAP_VALUE or STACK, so we can add the check already.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 12 Oct 2017 17:34:15 +0000 (10:34 -0700)]
nfp: bpf: separate I/O from checks for legacy data load
Move data load into a separate function and separate it from
packet length checks of legacy I/O. This makes the code more
readable and easier to reuse.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 12 Oct 2017 17:34:14 +0000 (10:34 -0700)]
nfp: bpf: fix context accesses
Sizes of fields in struct xdp_md/xdp_buff and some in sk_buff depend
on target architecture. Take that into account and use struct xdp_buff,
not struct xdp_md.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 12 Oct 2017 17:34:13 +0000 (10:34 -0700)]
nfp: bpf: support BPF offload only on little endian
eBPF is host-endian specific. Translating both BE and LE eBPF
to the NFP is feasible, but would require quite a bit of indirection.
The fact that I don't have access to any BE hosts that would fit
a 25G/40G/100G NIC is also limiting my ability to test big endian.
For now restrict the offload to little endian hosts only.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 12 Oct 2017 17:34:12 +0000 (10:34 -0700)]
nfp: bpf: implement byte swap instruction
Implement byte swaps with rotations, shifts and byte loads.
Remember to clear upper parts of the 64 bit registers.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 12 Oct 2017 17:34:11 +0000 (10:34 -0700)]
nfp: bpf: add mov helper
Register move operation is encoded as alu no op. This means
that one has to specify number of unused/none parameters to
the emit_alu(). Add a helper.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 12 Oct 2017 17:34:10 +0000 (10:34 -0700)]
nfp: bpf: fix compare instructions
Now that we have BPF assemebler support in LLVM 6 we can easily
test all compare instructions (LLVM 4 didn't generate most of them
from C). Fix the compare to immediates and refactor the order
of compare to regs to make sure they both follow the same pattern.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 12 Oct 2017 17:34:09 +0000 (10:34 -0700)]
nfp: bpf: add missing return in jne_imm optimization
We optimize comparisons to immediate 0 as if (reg.lo | reg.hi).
The early return statement was missing, however, which means we
would generate two comparisons - optimized one followed by a
normal 2x 32 bit compare.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 12 Oct 2017 17:34:08 +0000 (10:34 -0700)]
nfp: bpf: reorder arguments to emit_ld_field_any()
ld_field instruction has the following format in NFP assembler:
ld_field[dst, 1000, src, <<24]
reoder parameters to emit_ld_field_any() to make it closer to
the familiar assembler order.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 12 Oct 2017 17:34:07 +0000 (10:34 -0700)]
bpf: verifier: set reg_type on context accesses in second pass
Use a simplified is_valid_access() callback when verifier
is used for program analysis by non-host JITs. This allows
us to teach the verifier about packet start and packet end
offsets for direct packet access.
We can extend the callback as needed but for most packet
processing needs there isn't much more the offloads may
require.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 14 Oct 2017 18:12:08 +0000 (11:12 -0700)]
Merge branch 'stmmac-Improvements-for-multi-queuing-and-for-AVB'
Jose Abreu says:
====================
net: stmmac: Improvements for multi-queuing and for AVB
Two improvements for stmmac: First one corrects the available fifo
size per queue, second one corrects enabling of AVB queues. More info
in commit log.
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Changes from v1:
- Fix typo in second patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Jose Abreu [Fri, 13 Oct 2017 09:58:37 +0000 (10:58 +0100)]
net: stmmac: Disable flow ctrl for RX AVB queues and really enable TX AVB queues
Flow control must be disabled for AVB enabled queues and TX
AVB queues must be enabled by setting BIT(2) of TXQEN.
Correct this by passing the queue mode to DMA callbacks
and by checking in these functions wether we are in AVB
performing the necessary adjustments.
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jose Abreu [Fri, 13 Oct 2017 09:58:36 +0000 (10:58 +0100)]
net: stmmac: Use correct values in TQS/RQS fields
Currently we are using all the available fifo size in RQS and
TQS fields. This will not work correctly in multi-queues IP's
because total fifo size must be splitted to the enabled queues.
Correct this by computing the available fifo size per queue and
setting the right value in TQS and RQS fields.
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Matteo Croce [Thu, 12 Oct 2017 14:12:37 +0000 (16:12 +0200)]
icmp: don't fail on fragment reassembly time exceeded
The ICMP implementation currently replies to an ICMP time exceeded message
(type 11) with an ICMP host unreachable message (type 3, code 1).
However, time exceeded messages can either represent "time to live exceeded
in transit" (code 0) or "fragment reassembly time exceeded" (code 1).
Unconditionally replying to "fragment reassembly time exceeded" with
host unreachable messages might cause unjustified connection resets
which are now easily triggered as UFO has been removed, because, in turn,
sending large buffers triggers IP fragmentation.
The issue can be easily reproduced by running a lot of UDP streams
which is likely to trigger IP fragmentation:
# start netserver in the test namespace
ip netns add test
ip netns exec test netserver
# create a VETH pair
ip link add name veth0 type veth peer name veth0 netns test
ip link set veth0 up
ip -n test link set veth0 up
for i in $(seq 20 29); do
# assign addresses to both ends
ip addr add dev veth0 192.168.$i.1/24
ip -n test addr add dev veth0 192.168.$i.2/24
# start the traffic
netperf -L 192.168.$i.1 -H 192.168.$i.2 -t UDP_STREAM -l 0 &
done
# wait
send_data: data send error: No route to host (errno 113)
netperf: send_omni: send_data failed: No route to host
We need to differentiate instead: if fragment reassembly time exceeded
is reported, we need to silently drop the packet,
if time to live exceeded is reported, maintain the current behaviour.
In both cases increment the related error count "icmpInTimeExcds".
While at it, fix a typo in a comment, and convert the if statement
into a switch to mate it more readable.
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alan Brady [Wed, 11 Oct 2017 21:49:43 +0000 (14:49 -0700)]
i40e/i40evf: don't trust VF to reset itself
When using 'ethtool -L' on a VF to change number of requested queues
from PF, we shouldn't trust the VF to reset itself after making the
request. Doing it that way opens the door for a potentially malicious
VF to do nasty things to the PF which should never be the case.
This makes it such that after VF makes a successful request, PF will
then reset the VF to institute required changes. Only if the request
fails will PF send a message back to VF letting it know the request was
unsuccessful.
Testing-hints:
There should be no real functional changes. This is simply hardening
against a potentially malicious VF.
Signed-off-by: Alan Brady <alan.brady@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alan Brady [Wed, 11 Oct 2017 21:49:42 +0000 (14:49 -0700)]
i40e: fix link reporting
When querying the NVM for supported phy_types, on some firmware
versions, we were failing to actually fill out the phy_types which means
ethtool wouldn't report any link types.
Testing-hints:
Check 'ethtool <iface>' if you have the right (wrong?) firmware.
Without this patch, no link modes will be reported.
Signed-off-by: Alan Brady <alan.brady@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Colin Ian King [Fri, 22 Sep 2017 14:11:38 +0000 (15:11 +0100)]
i40e: make const array patterns static, reduces object code size
Don't populate const array patterns on the stack, instead make it
static. Makes the object code smaller by over 60 bytes:
Before:
text data bss dec hex filename
1953 496 0 2449 991 i40e_diag.o
After:
text data bss dec hex filename
1798 584 0 2382 94e i40e_diag.o
(gcc 6.3.0, x86-64)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Amritha Nambiar [Thu, 7 Sep 2017 11:00:32 +0000 (04:00 -0700)]
i40e: Add support setting TC max bandwidth rates
This patch enables setting up maximum Tx rates for the traffic
classes in i40e. The maximum rate is offloaded to the hardware through
the mqprio framework by specifying the mode option as 'channel' and
shaper option as 'bw_rlimit' and is configured for the VSI. Configuring
minimum Tx rate limit is not supported in the device. The minimum
usable value for Tx rate is 50Mbps.
Example:
# tc qdisc add dev eth0 root mqprio num_tc 2 map 0 0 0 0 1 1 1 1\
queues 4@0 4@4 hw 1 mode channel shaper bw_rlimit\
max_rate 4Gbit 5Gbit
To dump the bandwidth rates:
# tc qdisc show dev eth0
qdisc mqprio 804a: root tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
queues:(0:3) (4:7)
mode:channel
shaper:bw_rlimit max_rate:4Gbit 5Gbit
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Amritha Nambiar [Thu, 7 Sep 2017 11:00:27 +0000 (04:00 -0700)]
i40e: Refactor VF BW rate limiting
This patch refactors the BW rate limiting for Tx traffic
on the VF to be reused in the next patch for rate limiting Tx
traffic for the VSIs on the PF as well.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Amritha Nambiar [Thu, 7 Sep 2017 11:00:22 +0000 (04:00 -0700)]
i40e: Enable 'channel' mode in mqprio for TC configs
The i40e driver is modified to enable the new mqprio hardware
offload mode and factor the TCs and queue configuration by
creating channel VSIs. In this mode, the priority to traffic
class mapping and the user specified queue ranges are used
to configure the traffic classes by setting the mode option to
'channel'.
Example:
map 0 0 0 0 1 2 2 3 queues 2@0 2@2 1@4 1@5\
hw 1 mode channel
qdisc mqprio 8038: root tc 4 map 0 0 0 0 1 2 2 3 0 0 0 0 0 0 0 0
queues:(0:1) (2:3) (4:4) (5:5)
mode:channel
shaper:dcb
The HW channels created are removed and all the queue configuration
is set to default when the qdisc is detached from the root of the
device.
This patch also disables setting up channels via ethtool (ethtool -L)
when the TCs are configured using mqprio scheduler.
The patch also limits setting ethtool Rx flow hash indirection
(ethtool -X eth0 equal N) to max queues configured via mqprio.
The Rx flow hash indirection input through ethtool should be
validated so that it is within in the queue range configured via
tc/mqprio. The bound checking is achieved by reporting the current
rss size to the kernel when queues are configured via mqprio.
Example:
map 0 0 0 1 0 2 3 0 queues 2@0 4@2 8@6 11@14\
hw 1 mode channel
Cannot set RX flow hash configuration: Invalid argument
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Amritha Nambiar [Thu, 7 Sep 2017 11:00:17 +0000 (04:00 -0700)]
i40e: Add infrastructure for queue channel support
This patch sets up the infrastructure for offloading TCs and
queue configurations to the hardware by creating HW channels(VSI).
A new channel is created for each of the traffic class
configuration offloaded via mqprio framework except for the first TC
(TC0). TC0 for the main VSI is also reconfigured as per user provided
queue parameters. Queue counts that are not power-of-2 are handled by
reconfiguring RSS by reprogramming LUTs using the queue count value.
This patch also handles configuring the TX rings for the channels,
setting up the RX queue map for channel.
Also, the channels so created are removed and all the queue
configuration is set to default when the qdisc is detached from the
root of the device.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Kiran Patil <kiran.patil@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>