Maor Gottlieb [Thu, 28 Apr 2016 22:36:39 +0000 (01:36 +0300)]
net/mlx5: Initializing CPU reverse mapping
Allocating CPU rmap and add entry for each IRQ.
CPU rmap is used in aRFS to get the RX queue number
of the RX completion interrupts.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maor Gottlieb [Thu, 28 Apr 2016 22:36:38 +0000 (01:36 +0300)]
net/mlx5e: Split the main flow steering table
Currently, the main flow table is used for two purposes:
One is to do mac filtering and the other is to classify
the packet l3-l4 header in order to steer the packet to
the right RSS TIR.
This design is very complex, for each configured mac address we
have to add eleven rules (rule for each traffic type), the same if the
device is put to promiscuous/allmulti mode.
This scheme isn't scalable for future features like aRFS.
In order to simplify it, the main flow table is split to two flow
tables:
1. l2 table - filter the packet dmac address, if there is a match
we forward to the ttc flow table.
2. TTC (Traffic Type Classifier) table - classify the traffic
type of the packet and steer the packet to the right TIR.
In this new design, when new mac address is added, the driver adds
only one flow rule instead of eleven.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maor Gottlieb [Thu, 28 Apr 2016 22:36:37 +0000 (01:36 +0300)]
net/mlx5e: Refactor mlx5e flow steering structs
Slightly refactor and re-order the flow steering structs,
tables and data-bases for better self-containment and
flexibility to add more future steering phases
(tables/rules/data bases) e.g: aRFS.
Changes:
1. Move the vlan DB and address DB into their table structs.
2. Rename steering table structs to unique format: mlx5e_*_table,
e.g: mlx5e_vlan_table.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maor Gottlieb [Thu, 28 Apr 2016 22:36:36 +0000 (01:36 +0300)]
net/mlx5: Support different attributes for priorities in namespace
Currently, namespace could be initialized only
with priorities with the same attributes.
Add support to initialize namespace with priorities
with different attributes(e.g. different number of levels).
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maor Gottlieb [Thu, 28 Apr 2016 22:36:35 +0000 (01:36 +0300)]
net/mlx5: Add user chosen levels when allocating flow tables
Currently, consumers of the flow steering infrastructure can't
choose their own flow table levels and are limited to one
flow table per level. This just waste levels.
Instead, we introduce here the possibility to use multiple
flow tables in a level. The user is free to connect these
flow tables, while following the rule (FTEs in FT of level x
could only point to FTs of level y where y > x).
In addition this patch switch the order of the create/destroy
flow tables of the NIC(vlan and main).
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maor Gottlieb [Thu, 28 Apr 2016 22:36:34 +0000 (01:36 +0300)]
net/mlx5: Set number of allowed levels in priority
Refactors the flow steering namespace creation,
by changing the name num_fts to num_levels.
When new flow table is created, the driver assign new level
to this flow table therefore the meaning is equivalent.
Since downstream patches will introduce the ability to create more
than one flow table per level, the name num_fts is no
longer accurate.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maor Gottlieb [Thu, 28 Apr 2016 22:36:33 +0000 (01:36 +0300)]
net/mlx5: Introduce modify flow rule destination
This API is used for modifying the flow rule destination.
This is needed for modifying the pointed flow table by the
traffic type classifier rules to point on the aRFS tables.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tariq Toukan [Thu, 28 Apr 2016 22:36:32 +0000 (01:36 +0300)]
net/mlx5e: Direct TIR per RQ
Introduce new TIRs for direct access per RQ.
Now we have 2 available kinds of TIRs:
- indirect TIR per traffic type, each points to one RQT (RSS RQT)
same as before.
- New direct TIR per RQ, each points to RQT with a size of one
that forwards packets to that RQ only.
Driver will open max channels (num cores) direct TIRs by default,
they will be filled with the actual RQs once channels are allocated.
Needed for downstream aRFS and ethtool direct steering functionalities.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Matthew Finlay [Thu, 28 Apr 2016 22:36:31 +0000 (01:36 +0300)]
net/mlx5e: Call vxlan_get_rx_port() with rtnl lock
Hold the rtnl lock when calling vxlan_get_rx_port().
Fixes:
b7aade15485a ("vxlan: break dependency with netdev drivers")
Signed-off-by: Matthew Finlay <matt@mellanox.com>
Reported-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 29 Apr 2016 20:23:03 +0000 (16:23 -0400)]
Merge branch 'enc28j60-small-improvements'
Michael Heimpold says:
====================
net: ethernet: enc28j60: small improvements
This series of two patches adds the following improvements to the driver:
1) Rework the central SPI read function so that it is compatible with
SPI masters which only support half duplex transfers.
2) Add a device tree binding for the driver.
Changelog:
v3: * renamed and improved binding documentation as
suggested by Rob Herring
v2: * took care of Arnd Bergmann's review comments
- allow to specify MAC address via DT
- unconditionally define DT id table
* increased the driver version minor number
* driver author's email address bounces, removed from address list
v1: * Initial submission
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Heimpold [Thu, 28 Apr 2016 20:06:15 +0000 (22:06 +0200)]
net: ethernet: enc28j60: add device tree support
The following patch adds the required match table for device tree support
(and while at, fix the indent). It's also possible to specify the
MAC address in the DT blob.
Also add the corresponding binding documentation file.
Signed-off-by: Michael Heimpold <mhei@heimpold.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Heimpold [Thu, 28 Apr 2016 20:06:14 +0000 (22:06 +0200)]
net: ethernet: enc28j60: support half-duplex SPI controllers
The current spi_read_buf function fails on SPI host masters which
are only half-duplex capable. Splitting the Tx and Rx part solves
this issue.
Tested on Raspberry Pi (full duplex) and I2SE Duckbill (half duplex).
Signed-off-by: Michael Heimpold <mhei@heimpold.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Thu, 28 Apr 2016 15:59:28 +0000 (17:59 +0200)]
net: constify is_skb_forwardable's arguments
is_skb_forwardable is not supposed to change anything so constify its
arguments
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 29 Apr 2016 20:09:45 +0000 (16:09 -0400)]
Merge branch 'ppp-rtnetlink'
Guillaume Nault says:
====================
ppp: add rtnetlink support
PPP devices lack the ability to be customised at creation time. In
particular they can't be created in a given netns or with a particular
name. Moving or renaming the device after creation is possible, but
creates undesirable transient effects on servers where PPP devices are
constantly created and removed, as users connect and disconnect.
Implementing rtnetlink support solves this problem.
The rtnetlink handlers implemented in this series are minimal, and can
only replace the PPPIOCNEWUNIT ioctl. The rest of PPP ioctls remains
necessary for any other operation on channels and units.
It is perfectly possible to mix PPP devices created by rtnl
and by ioctl(PPPIOCNEWUNIT). Devices will behave in the same way.
mutex_trylock() is used to resolve the locking issue wrt. locking
dependency between rtnl_lock() and ppp_mutex (see ppp_nl_newlink() in
patch #2).
A user visible difference brought by this series is that old PPP
interfaces (those created with ioctl(PPPIOCNEWUNIT)), can now be
removed by "ip link del", just like new rtnl based PPP devices.
Changes since v3:
- Rebase on net-next.
- Not an RFC anymore.
Changes since v2:
- Define ->rtnl_link_ops for ioctl based PPP devices, so they can
handle rtnl messages just like rtnl based ones (suggested by
Stephen Hemminger).
- Move back to original lock ordering between ppp_mutex and rtnl_lock
to simplify patch series. Handle lock inversion issue using
mutex_trylock() (suggested by Stephen Hemminger).
- Do file descriptor lookup directly in ppp_nl_newlink(), to simplify
ppp_dev_configure().
Changes since v1:
- Rebase on net-next.
- Invert locking order wrt. ppp_mutex and rtnl_lock and protect
file->private_data with ppp_mutex.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Guillaume Nault [Thu, 28 Apr 2016 15:55:30 +0000 (17:55 +0200)]
ppp: add rtnetlink device creation support
Define PPP device handler for use with rtnetlink.
The only PPP specific attribute is IFLA_PPP_DEV_FD. It is mandatory and
contains the file descriptor of the associated /dev/ppp instance (the
file descriptor which would have been used for ioctl(PPPIOCNEWUNIT) in
the ioctl-based API). The PPP device is removed when this file
descriptor is released (same behaviour as with ioctl based PPP
devices).
PPP devices created with the rtnetlink API behave like the ones created
with ioctl(PPPIOCNEWUNIT). In particular existing ioctls work the same
way, no matter how the PPP device was created.
The rtnl callbacks are also assigned to ioctl based PPP devices. This
way, rtnl messages have the same effect on any PPP devices.
The immediate effect is that all PPP devices, even ioctl-based
ones, can now be removed with "ip link del".
A minor difference still exists between ioctl and rtnl based PPP
interfaces: in the device name, the number following the "ppp" prefix
corresponds to the PPP unit number for ioctl based devices, while it is
just an unrelated incrementing index for rtnl ones.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Guillaume Nault [Thu, 28 Apr 2016 15:55:28 +0000 (17:55 +0200)]
ppp: define reusable device creation functions
Move PPP device initialisation and registration out of
ppp_create_interface().
This prepares code for device registration with rtnetlink.
While there, simplify the prototype of ppp_create_interface():
* Since ppp_dev_configure() takes care of setting file->private_data,
there's no need to return a ppp structure to ppp_unattached_ioctl()
anymore.
* The unit parameter is made read/write so that ppp_create_interface()
can tell which unit number has been assigned.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexandre TORGUE [Thu, 28 Apr 2016 13:56:45 +0000 (15:56 +0200)]
net: ethernet: stmmac: update MDIO support for GMAC4
On new GMAC4 IP, MAC_MDIO_address register has been updated, and bitmaps
changed. This patch takes into account those changes.
Signed-off-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Thu, 28 Apr 2016 14:36:30 +0000 (16:36 +0200)]
vxlan: fix initialization with custom link parameters
Commit
0c867c9bf84c ("vxlan: move Ethernet initialization to a separate
function") changed initialization order and as an unintended result, when the
user specifies additional link parameters (such as IFLA_ADDRESS) while
creating vxlan interface, those are overwritten by vxlan_ether_setup later.
It's necessary to call ether_setup from withing the ->setup callback. That
way, the correct parameters are set by rtnl_create_link later. This is done
also for VXLAN-GPE, as we don't know the interface type yet at that point,
and changed to the correct interface type later.
Fixes:
0c867c9bf84c ("vxlan: move Ethernet initialization to a separate function")
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 29 Apr 2016 18:26:32 +0000 (14:26 -0400)]
Merge branch 'samples-bpf-user-experience'
Jesper Dangaard Brouer says:
====================
samples/bpf: Improve user experience
It is a steep learning curve getting started with using the eBPF
examples in samples/bpf/. There are several dependencies, and
specific versions of these dependencies. Invoking make in the correct
manor is also slightly obscure.
This patchset cleanup, document and hopefully improves the first time
user experience with the eBPF samples directory by auto-detecting
certain scenarios.
V4:
- Address Naveen's nitpicks
- Handle/fail if extra args are passed in LLC or CLANG (David Laight)
V3:
- Add Alexei's ACKs
- Remove README paragraph about LLVM experimental BPF target
as it only existed between LLVM version 3.6 to 3.7.
V2:
- Adjusted recommend minimum versions to 3.7.1
- Included clang build instructions
- New patch adding CLANG variable and validation of command
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Thu, 28 Apr 2016 12:21:14 +0000 (14:21 +0200)]
samples/bpf: like LLC also verify and allow redefining CLANG command
Users are likely to manually compile both LLVM 'llc' and 'clang'
tools. Thus, also allow redefining CLANG and verify command exist.
Makefile implementation wise, the target that verify the command have
been generalized.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Thu, 28 Apr 2016 12:21:09 +0000 (14:21 +0200)]
samples/bpf: allow make to be run from samples/bpf/ directory
It is not intuitive that 'make' must be run from the top level
directory with argument "samples/bpf/" to compile these eBPF samples.
Introduce a kbuild make file trick that allow make to be run from the
"samples/bpf/" directory itself. It basically change to the top level
directory and call "make samples/bpf/" with the "/" slash after the
directory name.
Also add a clean target that only cleans this directory, by taking
advantage of the kbuild external module setting M=$PWD.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Thu, 28 Apr 2016 12:21:04 +0000 (14:21 +0200)]
samples/bpf: add a README file to get users started
Getting started with using examples in samples/bpf/ is not
straightforward. There are several dependencies, and specific
versions of these dependencies.
Just compiling the example tool is also slightly obscure, e.g. one
need to call make like:
make samples/bpf/
Do notice the "/" slash after the directory name.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Thu, 28 Apr 2016 12:20:58 +0000 (14:20 +0200)]
samples/bpf: Makefile verify LLVM compiler avail and bpf target is supported
Make compiling samples/bpf more user friendly, by detecting if LLVM
compiler tool 'llc' is available, and also detect if the 'bpf' target
is available in this version of LLVM.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Thu, 28 Apr 2016 12:20:53 +0000 (14:20 +0200)]
samples/bpf: add back functionality to redefine LLC command
It is practical to be-able-to redefine the location of the LLVM
command 'llc', because not all distros have a LLVM version with bpf
target support. Thus, it is sometimes required to compile LLVM from
source, and sometimes it is not desired to overwrite the distros
default LLVM version.
This feature was removed with
128d1514be35 ("samples/bpf: Use llc in
PATH, rather than a hardcoded value").
Add this features back. Note that it is possible to redefine the LLC
on the make command like:
make samples/bpf/ LLC=~/git/llvm/build/bin/llc
Fixes:
128d1514be35 ("samples/bpf: Use llc in PATH, rather than a hardcoded value")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 29 Apr 2016 17:41:47 +0000 (13:41 -0400)]
Merge branch 'cxgb4-mbox-cmd-logging'
Hariprasad Shenai says:
====================
cxgb4/cxgb4vf: add support for mbox cmd logging
This patch series adds support for logging mailbox commands and
replies for debugging purpose for both PF and VF driver.
This patch series has been created against net-next tree and includes
patches on cxgb4 and cxgb4vf driver.
We have included all the maintainers of respective drivers. Kindly
review the change and let us know in case of any review comments.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Thu, 28 Apr 2016 07:53:19 +0000 (13:23 +0530)]
cxgb4vf: Add support to enable logging of firmware mailbox commands for VF
Add new /sys/kernel/debug/ support to dump firmware mailbox commands
and replies for debugging purpose.
Based on original work by Casey Leedom <leedom@chelsio.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Thu, 28 Apr 2016 07:53:18 +0000 (13:23 +0530)]
cxgb4: Add support to enable logging of firmware mailbox commands
Add new /sys/kernel/debug/ support to dump a firmware mailbox command
issued and replies for debugging purpose.
Based on original work by Casey Leedom <leedom@chelsio.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 29 Apr 2016 17:39:04 +0000 (13:39 -0400)]
Merge branch 'hns-props'
Yisen Zhuang says:
====================
net: hns: update DT properties according to Rob's comments
There are some inappropriate properties definition in hns DT. We
update the definition according to Rob's review comments and fix some
typos in binding.
For more details, please see individual patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Yisen.Zhuang\(Zhuangyuzeng\) [Thu, 28 Apr 2016 07:09:04 +0000 (15:09 +0800)]
dts: hisi: update hns dst for changing property port-id to reg
Indexes should generally be avoided. This patch changes property port-id
to reg in dsaf port node.
Signed-off-by: Yisen Zhuang <yisen.zhuang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yisen.Zhuang\(Zhuangyuzeng\) [Thu, 28 Apr 2016 07:09:03 +0000 (15:09 +0800)]
Documentation: Bindings: Update DT binding for hns dsaf node
This patch changes property port-id to reg in dsaf port node,
removes property cpld-ctrl-reg, and fixes some typos.
Signed-off-by: Yisen Zhuang <yisen.zhuang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yisen.Zhuang\(Zhuangyuzeng\) [Thu, 28 Apr 2016 07:09:02 +0000 (15:09 +0800)]
net: hns: change port-id property to reg property in dsaf port node
Indexes should generally be avoided. So we use reg rather than port-id to
index ports.
Signed-off-by: Yisen Zhuang <yisen.zhuang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yisen.Zhuang\(Zhuangyuzeng\) [Thu, 28 Apr 2016 07:09:01 +0000 (15:09 +0800)]
net: hns: remove cpld-ctrl-reg and add cell in the cpld-syscon property
Because cpld-ctrl-reg property is offset base on cpld-syscon property,
we make it as a cell in the cpld-syscon property.
Signed-off-by: Yisen Zhuang <yisen.zhuang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mahesh Bandewar [Wed, 27 Apr 2016 21:59:27 +0000 (14:59 -0700)]
ipvlan: Fix failure path in dev registration during link creation
When newlink creation fails at device-registration, the port->count
is decremented twice. Francesco Ruggeri (fruggeri@arista.com) found
this issue in Macvlan and the same exists in IPvlan driver too.
While fixing this issue I noticed another issue of missing unregister
in case of failure, so adding it to the fix which is similar to the
macvlan fix by Francesco in commit
308379607548 ("macvlan: fix failure
during registration v3")
Reported-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Eric Dumazet <edumazet@google.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
françois romieu [Wed, 27 Apr 2016 21:29:44 +0000 (23:29 +0200)]
pch_gbe: replace private tx ring lock with common netif_tx_lock
pch_gbe_tx_ring.tx_lock is only used in the hard_xmit handler and
in the transmit completion reaper called from NAPI context.
Compile-tested only. Potential victims Cced.
Someone more knowledgeable may check if pch_gbe_tx_queue could
have some use for a mmiowb.
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Andy Cress <andy.cress@us.kontron.com>
Cc: bryan@fossetcon.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Wed, 27 Apr 2016 18:45:14 +0000 (11:45 -0700)]
net: dsa: Provide CPU port statistics to master netdev
This patch overloads the DSA master netdev, aka CPU Ethernet MAC to also
include switch-side statistics, which is useful for debugging purposes,
when the switch is not properly connected to the Ethernet MAC (duplex
mismatch, (RG)MII electrical issues etc.).
We accomplish this by retaining the original copy of the master netdev's
ethtool_ops, and just overload the 3 operations we care about:
get_sset_count, get_strings and get_ethtool_stats so as to intercept
these calls and call into the original master_netdev ethtool_ops, plus
our own.
We take this approach as opposed to providing a set of DSA helper
functions that would retrive the CPU port's statistics, because the
entire purpose of DSA is to allow unmodified Ethernet MAC drivers to be
used as CPU conduit interfaces, therefore, statistics overlay in such
drivers would simply not scale.
The new ethtool -S <iface> output would therefore look like this now:
<iface> statistics
p<2 digits cpu port number>_<switch MIB counter names>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 27 Apr 2016 17:12:25 +0000 (10:12 -0700)]
tcp: give prequeue mode some care
TCP prequeue goal is to defer processing of incoming packets
to user space thread currently blocked in a recvmsg() system call.
Intent is to spend less time processing these packets on behalf
of softirq handler, as softirq handler is unfair to normal process
scheduler decisions, as it might interrupt threads that do not
even use networking.
Current prequeue implementation has following issues :
1) It only checks size of the prequeue against sk_rcvbuf
It was fine 15 years ago when sk_rcvbuf was in the 64KB vicinity.
But we now have ~8MB values to cope with modern networking needs.
We have to add sk_rmem_alloc in the equation, since out of order
packets can definitely use up to sk_rcvbuf memory themselves.
2) Even with a fixed memory truesize check, prequeue can be filled
by thousands of packets. When prequeue needs to be flushed, either
from sofirq context (in tcp_prequeue() or timer code), or process
context (in tcp_prequeue_process()), this adds a latency spike
which is often not desirable.
I added a fixed limit of 32 packets, as this translated to a max
flush time of 60 us on my test hosts.
Also note that all packets in prequeue are not accounted for tcp_mem,
since they are not charged against sk_forward_alloc at this point.
This is probably not a big deal.
Note that this might increase LINUX_MIB_TCPPREQUEUEDROPPED counts,
which is misnamed, as packets are not dropped at all, but rather pushed
to the stack (where they can be either consumed or dropped)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kazior [Wed, 27 Apr 2016 10:59:13 +0000 (12:59 +0200)]
fq: split out backlog update logic
mac80211 (which will be the first user of the
fq.h) recently started to support software A-MSDU
aggregation. It glues skbuffs together into a
single one so the backlog accounting needs to be
more fine-grained.
To avoid backlog sorting logic duplication split
it up for re-use.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Wed, 27 Apr 2016 08:05:28 +0000 (11:05 +0300)]
tipc: remove an unnecessary NULL check
This is never called with a NULL "buf" and anyway, we dereference 's' on
the lines before so it would Oops before we reach the check.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Tue, 26 Apr 2016 15:52:33 +0000 (17:52 +0200)]
net/mlx5e: avoid stack overflow in mlx5e_open_channels
struct mlx5e_channel_param is a large structure that is allocated
on the stack of mlx5e_open_channels, and with a recent change
it has grown beyond the warning size for the maximum stack
that a single function should use:
mellanox/mlx5/core/en_main.c: In function 'mlx5e_open_channels':
mellanox/mlx5/core/en_main.c:1325:1: error: the frame size of 1072 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
The function is already using dynamic allocation and is not in
a fast path, so the easiest workaround is to use another kzalloc
for allocating the channel parameters.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes:
d3c9bc2743dc ("net/mlx5e: Added ICO SQs")
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Tue, 26 Apr 2016 03:13:42 +0000 (23:13 -0400)]
tuntap: calculate rps hash only when needed
There's no need to calculate rps hash if it was not enabled. So this
patch export rps_needed and check it before trying to get rps
hash. Tests (using pktgen to inject packets to guest) shows this can
improve pps about 13% (when rps is disabled).
Before:
~1150000 pps
After:
~1300000 pps
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
----
Changes from V1:
- Fix build when CONFIG_RPS is not set
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 28 Apr 2016 20:14:20 +0000 (16:14 -0400)]
Merge branch 'tcp-eor'
Martin KaFai Lau says:
====================
tcp: Make use of MSG_EOR in tcp_sendmsg
v4:
~ Do not set eor bit in do_tcp_sendpages() since there is
no way to pass MSG_EOR from the userland now.
~ Avoid rmw by testing MSG_EOR first in tcp_sendmsg().
~ Move TCP_SKB_CB(skb)->eor test to a new helper
tcp_skb_can_collapse_to() (suggested by Soheil).
~ Add some packetdrill tests.
v3:
~ Separate EOR marking from the SKBTX_ANY_TSTAMP logic.
~ Move the eor bit test back to the loop in tcp_sendmsg and
tcp_sendpage because there could be >1 threads doing
sendmsg.
~ Thanks to Eric Dumazet's suggestions on v2.
~ The TCP timestamp bug fixes are separated into other threads.
v2:
~ Rework based on the recent work
"add TX timestamping via cmsg" by
Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
~ This version takes the MSG_EOR bit as a signal of
end-of-response-message and leave the selective
timestamping job to the cmsg
~ Changes based on the v1 feedback (like avoid
unlikely check in a loop and adding tcp_sendpage
support)
~ The first 3 patches are bug fixes. The fixes in this
series depend on the newly introduced txstamp_ack in
net-next. I will make relevant patches against net after
getting some feedback.
~ The test results are based on the recently posted net fix:
"tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks"
One potential use case is to use MSG_EOR with
SOF_TIMESTAMPING_TX_ACK to get a more accurate
TCP ack timestamping on application protocol with
multiple outgoing response messages (e.g. HTTP2).
One of our use case is at the webserver. The webserver tracks
the HTTP2 response latency by measuring when the webserver sends
the first byte to the socket till the TCP ACK of the last byte
is received. In the cases where we don't have client side
measurement, measuring from the server side is the only option.
In the cases we have the client side measurement, the server side
data can also be used to justify/cross-check-with the client
side data.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Mon, 25 Apr 2016 21:44:50 +0000 (14:44 -0700)]
tcp: Handle eor bit when fragmenting a skb
When fragmenting a skb, the next_skb should carry
the eor from prev_skb. The eor of prev_skb should
also be reset.
Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
0.200 sendto(4, ..., 15330, MSG_EOR, ..., ...) = 15330
0.200 sendto(4, ..., 730, 0, ..., ...) = 730
0.200 > . 1:7301(7300) ack 1
0.200 > . 7301:14601(7300) ack 1
0.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 1
0.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Mon, 25 Apr 2016 21:44:49 +0000 (14:44 -0700)]
tcp: Handle eor bit when coalescing skb
This patch:
1. Prevent next_skb from coalescing to the prev_skb if
TCP_SKB_CB(prev_skb)->eor is set
2. Update the TCP_SKB_CB(prev_skb)->eor if coalescing is
allowed
Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 write(4, ..., 11680) = 11680
0.200 > P. 1:731(730) ack 1
0.200 > P. 731:1461(730) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:13141(4380) ack 1
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:13141,nop,nop>
0.300 > P. 1:731(730) ack 1
0.300 > P. 731:1461(730) ack 1
0.400 < . 1:1(0) ack 13141 win 257
0.400 close(4) = 0
0.400 > F. 13141:13141(0) ack 1
0.500 < F. 1:1(0) ack 13142 win 257
0.500 > . 13142:13142(0) ack 2
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Mon, 25 Apr 2016 21:44:48 +0000 (14:44 -0700)]
tcp: Make use of MSG_EOR in tcp_sendmsg
This patch adds an eor bit to the TCP_SKB_CB. When MSG_EOR
is passed to tcp_sendmsg, the eor bit will be set at the skb
containing the last byte of the userland's msg. The eor bit
will prevent data from appending to that skb in the future.
The change in do_tcp_sendpages is to honor the eor set
during the previous tcp_sendmsg(MSG_EOR) call.
This patch handles the tcp_sendmsg case. The followup patches
will handle other skb coalescing and fragment cases.
One potential use case is to use MSG_EOR with
SOF_TIMESTAMPING_TX_ACK to get a more accurate
TCP ack timestamping on application protocol with
multiple outgoing response messages (e.g. HTTP2).
Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
0.200 write(4, ..., 14600) = 14600
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 > . 1:7301(7300) ack 1
0.200 > P. 7301:14601(7300) ack 1
0.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 1
0.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 28 Apr 2016 20:06:11 +0000 (16:06 -0400)]
Merge branch 'tcp-redundant-checks'
Soheil Hassas Yeganeh says:
====================
tcp: simplify ack tx timestamps
v2:
- Fully remove SKBTX_ACK_TSTAMP, as suggested by Willem de Bruijn.
This patch series aims at removing redundant checks and fields
for ack timestamps for TCP.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Soheil Hassas Yeganeh [Thu, 28 Apr 2016 03:39:01 +0000 (23:39 -0400)]
tcp: remove SKBTX_ACK_TSTAMP since it is redundant
The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when
the timestamp of the TCP acknowledgement should be reported on
error queue. Since accessing skb_shinfo is likely to incur a
cache-line miss at the time of receiving the ack, the
txstamp_ack bit was added in tcp_skb_cb, which is set iff
the SKBTX_ACK_TSTAMP flag is set for an skb. This makes
SKBTX_ACK_TSTAMP flag redundant.
Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit
everywhere.
Note that this frees one bit in shinfo->tx_flags.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Suggested-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Soheil Hassas Yeganeh [Thu, 28 Apr 2016 03:39:00 +0000 (23:39 -0400)]
tcp: remove an unnecessary check in tcp_tx_timestamp
Remove the redundant check for sk->sk_tsflags in tcp_tx_timestamp.
tcp_tx_timestamp() receives the tsflags as a parameter. As a
result the "sk->sk_tsflags || tsflags" is redundant, since
tsflags already includes sk->sk_tsflags plus overrides from
control messages.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 28 Apr 2016 13:33:24 +0000 (06:33 -0700)]
net: snmp: fix 64bit stats on 32bit arches
I accidentally replaced BH disabling by preemption disabling
in SNMP_ADD_STATS64() and SNMP_UPD_PO_STATS64() on 32bit builds.
For 64bit stats on 32bit arch, we really need to disable BH,
since the "struct u64_stats_sync syncp" might be manipulated
both from process and BH contexts.
Fixes:
6aef70a851ac ("net: snmp: kill various STATS_USER() helpers")
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 28 Apr 2016 03:08:41 +0000 (23:08 -0400)]
Merge branch 'socket-space-optimizations'
Eric Dumazet says:
====================
net: avoid some atomic ops when FASYNC is not used
We can avoid some atomic operations on sockets not using FASYNC
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 25 Apr 2016 17:39:34 +0000 (10:39 -0700)]
net: SOCKWQ_ASYNC_WAITDATA optimizations
SOCKWQ_ASYNC_WAITDATA is set/cleared in sk_wait_data()
and equivalent functions, so that sock_wake_async() can send
a SIGIO only when necessary.
Since these atomic operations are really not needed unless
socket expressed interest in FASYNC, we can omit them in most
cases.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 25 Apr 2016 17:39:32 +0000 (10:39 -0700)]
net: SOCKWQ_ASYNC_NOSPACE optimizations
SOCKWQ_ASYNC_NOSPACE is tested in sock_wake_async()
so that a SIGIO signal is sent when needed.
tcp_sendmsg() clears the bit.
tcp_poll() sets the bit when stream is not writeable.
We can avoid two atomic operations by first checking if socket
is actually interested in the FASYNC business (most sockets in
real applications do not use AIO, but select()/poll()/epoll())
This also removes one cache line miss to access sk->sk_wq->flags
in tcp_sendmsg()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 28 Apr 2016 02:48:25 +0000 (22:48 -0400)]
Merge branch 'snmp-stats-update'
Eric Dumazet says:
====================
net: snmp: update SNMP methods
In the old days (before linux-3.0), SNMP counters were duplicated,
one set for user context, and anther one for BH context.
After commit
8f0ea0fe3a03 ("snmp: reduce percpu needs by 50%")
we have a single copy, and what really matters is preemption being
enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc()
respectively.
This patch series kills the obsolete STATS_USER() helpers,
and rename all XXX_BH() helpers to __XXX() ones, to more
closely match conventions used to update per cpu variables.
This is probably going to hurt maintainers job for a while,
since cherry-picks will not be clean, but this had to be
cleaned at one point. I am so sorry guys.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 27 Apr 2016 23:44:43 +0000 (16:44 -0700)]
net: snmp: kill STATS_BH macros
There is nothing related to BH in SNMP counters anymore,
since linux-3.0.
Rename helpers to use __ prefix instead of _BH prefix,
for contexts where preemption is disabled.
This more closely matches convention used to update
percpu variables.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 27 Apr 2016 23:44:42 +0000 (16:44 -0700)]
ipv6: kill ICMP6MSGIN_INC_STATS_BH()
IPv6 ICMP stats are atomics anyway.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 27 Apr 2016 23:44:41 +0000 (16:44 -0700)]
ipv6: rename IP6_UPD_PO_STATS_BH()
Rename IP6_UPD_PO_STATS_BH() to __IP6_UPD_PO_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 27 Apr 2016 23:44:40 +0000 (16:44 -0700)]
ipv6: rename IP6_INC_STATS_BH()
Rename IP6_INC_STATS_BH() to __IP6_INC_STATS()
and IP6_ADD_STATS_BH() to __IP6_ADD_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 27 Apr 2016 23:44:39 +0000 (16:44 -0700)]
net: rename NET_{ADD|INC}_STATS_BH()
Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 27 Apr 2016 23:44:38 +0000 (16:44 -0700)]
net: rename IP_UPD_PO_STATS_BH()
Rename IP_UPD_PO_STATS_BH() to __IP_UPD_PO_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 27 Apr 2016 23:44:37 +0000 (16:44 -0700)]
net: rename IP_ADD_STATS_BH()
Rename IP_ADD_STATS_BH() to __IP_ADD_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 27 Apr 2016 23:44:36 +0000 (16:44 -0700)]
net: rename ICMP6_INC_STATS_BH()
Rename ICMP6_INC_STATS_BH() to __ICMP6_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 27 Apr 2016 23:44:35 +0000 (16:44 -0700)]
net: rename IP_INC_STATS_BH()
Rename IP_INC_STATS_BH() to __IP_INC_STATS(), to
better express this is used in non preemptible context.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 27 Apr 2016 23:44:34 +0000 (16:44 -0700)]
net: sctp: rename SCTP_INC_STATS_BH()
Rename SCTP_INC_STATS_BH() to __SCTP_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 27 Apr 2016 23:44:33 +0000 (16:44 -0700)]
net: icmp: rename ICMPMSGIN_INC_STATS_BH()
Remove misleading _BH suffix.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 27 Apr 2016 23:44:32 +0000 (16:44 -0700)]
net: tcp: rename TCP_INC_STATS_BH
Rename TCP_INC_STATS_BH() to __TCP_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 27 Apr 2016 23:44:31 +0000 (16:44 -0700)]
net: xfrm: kill XFRM_INC_STATS_BH()
Not used anymore.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 27 Apr 2016 23:44:30 +0000 (16:44 -0700)]
net: udp: rename UDP_INC_STATS_BH()
Rename UDP_INC_STATS_BH() to __UDP_INC_STATS(),
and UDP6_INC_STATS_BH() to __UDP6_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 27 Apr 2016 23:44:29 +0000 (16:44 -0700)]
net: rename ICMP_INC_STATS_BH()
Rename ICMP_INC_STATS_BH() to __ICMP_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 27 Apr 2016 23:44:28 +0000 (16:44 -0700)]
dccp: rename DCCP_INC_STATS_BH()
Rename DCCP_INC_STATS_BH() to __DCCP_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 27 Apr 2016 23:44:27 +0000 (16:44 -0700)]
net: snmp: kill various STATS_USER() helpers
In the old days (before linux-3.0), SNMP counters were duplicated,
one for user context, and one for BH context.
After commit
8f0ea0fe3a03 ("snmp: reduce percpu needs by 50%")
we have a single copy, and what really matters is preemption being
enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc()
respectively.
We therefore kill SNMP_INC_STATS_USER(), SNMP_ADD_STATS_USER(),
NET_INC_STATS_USER(), NET_ADD_STATS_USER(), SCTP_INC_STATS_USER(),
SNMP_INC_STATS64_USER(), SNMP_ADD_STATS64_USER(), TCP_ADD_STATS_USER(),
UDP_INC_STATS_USER(), UDP6_INC_STATS_USER(), and XFRM_INC_STATS_USER()
Following patches will rename __BH helpers to make clear their
usage is not tied to BH being disabled.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 28 Apr 2016 01:59:08 +0000 (21:59 -0400)]
Merge branch '40GbE' of git://git./linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
40GbE Intel Wired LAN Driver Updates 2016-04-27
This series contains updates to i40e and i40evf.
Alex Duyck cleans up the feature flags since they are becoming pretty
"massive", the primary change being that we now build our features list
around hw_encap_features. Added support for IPIP and SIT offloads,
which should improvement in throughput for IPIP and SIT tunnels with
the offload enabled.
Mitch adds support for configuring RSS on behalf of the VFs, which removes
the burden of dealing with different hardware interfaces from the VF
drivers and improves future compatibility. Fix to ensure that we do not
panic by checking that the vsi_res pointer is valid before dereferencing
it, after which we can drink beer and eat peanuts.
Shannon does come housekeeping in i40e_add_fdir_ethtool() in preparation
for more cloud filter work. Added flexibility to the nvmupdate
facility by adding the ability to specify an AQ event opcode to wait on
after Exec_AQ request.
Michal adds device capability which defines if an update is available and
if a security check is needed during the update process.
Kamil just adds a device id to support X722 QSFP+ device.
Greg fixes an issue where a mirror rule ID may be zero, so do not return
invalid parameter when the user passes in a zero for a rule ID. Adds
support to steer packets to VSIs by VLAN tag alone while being in
promiscuous mode for multicast and unicast MAC addresses.
Jesse fixes the driver from offloading the VLAN tag into the skb any
time there was a VLAN tag and the hardware stripping was enabled, to
making sure it is enabled before put_tag.
v2: Dropped patch 8 ("i40e: Allow user to change input set mask for flow
director") while Kiran reworks a more generalized solution based
on feedback from David Miller.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 26 Apr 2016 22:30:07 +0000 (15:30 -0700)]
net-rfs: fix false sharing accessing sd->input_queue_head
sd->input_queue_head is incremented for each processed packet
in process_backlog(), and read from other cpus performing
Out Of Order avoidance in get_rps_cpu()
Moving this field in a separate cache line keeps it mostly
hot for the cpu in process_backlog(), as other cpus will
only read it.
In a stress test, process_backlog() was consuming 6.80 % of cpu cycles,
and the patch reduced the cost to 0.65 %
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Akinobu Mita [Tue, 26 Apr 2016 20:43:48 +0000 (05:43 +0900)]
net: w5100: support W5500
This adds support for W5500 chip.
W5500 has similar register and memory organization with W5100 and W5200.
There are a few important differences listed below but it is still
possible to share common code with W5100 and W5200.
* W5500 register and memory are organized by multiple blocks. Each one
is selected by 16bits offset address and 5bits block select bits.
But the existing register access operations take u16 address. This change
extends the addess by u32 address and put offset address to lower 16bits
and block select bits to upper 16bits.
This change also adds the offset addresses for socket register and TX/RX
memory blocks to the driver private data structure in order to reduce
conditional switches for each chip.
* W5500 has the different register offset for socket interrupt mask
register. Newly added internal functions w5100_enable_intr() and
w5100_disable_intr() take care of the diffrence.
* W5500 has the different register offset for retry time-value register.
But this register is only used to verify that the reset value is correctly
read at initialization. So move the verification to w5100_hw_reset()
which already does different things for different chips.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Mike Sinkovsky <msink@permonline.ru>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Anjali Singhai Jain [Tue, 12 Apr 2016 15:30:52 +0000 (08:30 -0700)]
i40evf: Add driver support for promiscuous mode
Add necessary Linux Ethernet driver support for promiscuous mode
operation. Add a flag so the VF knows it is in promiscuous mode
and two state flags to discreetly track multicast and unicast
promiscuous states.
Change-Id: Ib2f2dc7a7582304fec90fc917ebb7ded21ba1de4
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anjali Singhai Jain [Tue, 12 Apr 2016 15:30:51 +0000 (08:30 -0700)]
i40e: Add VF promiscuous mode driver support
Add infrastructure for Network Function Virtualization VLAN tagged
packet steering feature.
Change-Id: I9b873d8fcc253858e6baba65ac68ec5b9363944e
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Greg Rose [Tue, 12 Apr 2016 15:30:50 +0000 (08:30 -0700)]
i40e: Add promiscuous on VLAN support
NFV use cases require the ability to steer packets to VSIs by VLAN tag
alone while being in promiscuous mode for multicast and unicast MAC
addresses. These two new functions support that ability.
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jesse Brandeburg [Tue, 12 Apr 2016 15:30:49 +0000 (08:30 -0700)]
i40e/i40evf: Only offload VLAN tag if enabled
The driver was offloading the VLAN tag into the skb
any time there was a VLAN tag and the hardware stripping was
enabled. Just check to make sure it's enabled before put_tag.
Change-Id: Ife95290c06edd9a616393b38679923938b382241
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Greg Rose [Tue, 12 Apr 2016 15:30:48 +0000 (08:30 -0700)]
i40e: Remove zero check
A mirror rule ID may be zero so do not return invalid parameter when the
user passes in a zero value for a rule ID.
Change-ID: I261b8c24725ce2c6ed32f859da81093dfcbe2970
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Kamil Krawczyk [Tue, 12 Apr 2016 15:30:47 +0000 (08:30 -0700)]
i40e: Add DeviceID for X722 QSFP+
Change-ID: I1370fbc7774e815ac1ad56561e97488e829592fc
Signed-off-by: Kamil Krawczyk <kamil.krawczyk@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Michal Kosiarz [Tue, 12 Apr 2016 15:30:46 +0000 (08:30 -0700)]
i40e: Add device capability which defines if update is available
Add device capability which defines if update is available and security
check is needed during update process.
Change-ID: I380787c878275e1df18b39198df3ee3666342282
Signed-off-by: Michal Kosiarz <michal.kosiarz@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David S. Miller [Wed, 27 Apr 2016 19:43:10 +0000 (15:43 -0400)]
Merge git://git./linux/kernel/git/davem/net
Minor overlapping changes in the conflicts.
In the macsec case, the change of the default ID macro
name overlapped with the 64-bit netlink attribute alignment
fixes in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Mon, 25 Apr 2016 04:26:04 +0000 (21:26 -0700)]
net: ipv6: Use passed in table for nexthop lookups
Similar to
3bfd847203c6 ("net: Use passed in table for nexthop lookups")
for IPv4, if the route spec contains a table id use that to lookup the
next hop first and fall back to a full lookup if it fails (per the fix
4c9bcd117918b ("net: Fix nexthop lookups")).
Example:
root@kenny:~# ip -6 ro ls table red
local 2100:1::1 dev lo proto none metric 0 pref medium
2100:1::/120 dev eth1 proto kernel metric 256 pref medium
local 2100:2::1 dev lo proto none metric 0 pref medium
2100:2::/120 dev eth2 proto kernel metric 256 pref medium
local fe80::e0:f9ff:fe09:3cac dev lo proto none metric 0 pref medium
local fe80::e0:f9ff:fe1c:b974 dev lo proto none metric 0 pref medium
fe80::/64 dev eth1 proto kernel metric 256 pref medium
fe80::/64 dev eth2 proto kernel metric 256 pref medium
ff00::/8 dev red metric 256 pref medium
ff00::/8 dev eth1 metric 256 pref medium
ff00::/8 dev eth2 metric 256 pref medium
unreachable default dev lo metric 240 error -113 pref medium
root@kenny:~# ip -6 ro add table red 2100:3::/64 via 2100:1::64
RTNETLINK answers: No route to host
Route add fails even though 2100:1::64 is a reachable next hop:
root@kenny:~# ping6 -I red 2100:1::64
ping6: Warning: source address might be selected on device other than red.
PING 2100:1::64(2100:1::64) from 2100:1::1 red: 56 data bytes
64 bytes from 2100:1::64: icmp_seq=1 ttl=64 time=1.33 ms
With this patch:
root@kenny:~# ip -6 ro add table red 2100:3::/64 via 2100:1::64
root@kenny:~# ip -6 ro ls table red
local 2100:1::1 dev lo proto none metric 0 pref medium
2100:1::/120 dev eth1 proto kernel metric 256 pref medium
local 2100:2::1 dev lo proto none metric 0 pref medium
2100:2::/120 dev eth2 proto kernel metric 256 pref medium
2100:3::/64 via 2100:1::64 dev eth1 metric 1024 pref medium
local fe80::e0:f9ff:fe09:3cac dev lo proto none metric 0 pref medium
local fe80::e0:f9ff:fe1c:b974 dev lo proto none metric 0 pref medium
fe80::/64 dev eth1 proto kernel metric 256 pref medium
fe80::/64 dev eth2 proto kernel metric 256 pref medium
ff00::/8 dev red metric 256 pref medium
ff00::/8 dev eth1 metric 256 pref medium
ff00::/8 dev eth2 metric 256 pref medium
unreachable default dev lo metric 240 error -113 pref medium
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel [Wed, 27 Apr 2016 15:53:08 +0000 (17:53 +0200)]
taskstats: fix nl parsing in accounting/getdelays.c
The type TASKSTATS_TYPE_NULL should always be ignored.
When jumping to the next attribute, only the length of the current
attribute should be added, not the length of all nested attributes.
This last bug was not visible before commit
80df554275c2, because the
kernel didn't put more than two nested attributes.
Fixes:
a3baf649ca9c ("[PATCH] per-task-delay-accounting: documentation")
Fixes:
80df554275c2 ("taskstats: use the libnl API to align nlattr on 64-bit")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Tue, 26 Apr 2016 23:25:51 +0000 (16:25 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Handle v4/v6 mixed sockets properly in soreuseport, from Craig
Gallak.
2) Bug fixes for the new macsec facility (missing kmalloc NULL checks,
missing locking around netdev list traversal, etc.) from Sabrina
Dubroca.
3) Fix handling of host routes on ifdown in ipv6, from David Ahern.
4) Fix double-fdput in bpf verifier. From Jann Horn.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (31 commits)
bpf: fix double-fdput in replace_map_fd_with_map_ptr()
net: ipv6: Delete host routes on an ifdown
Revert "ipv6: Revert optional address flusing on ifdown."
net/mlx4_en: fix spurious timestamping callbacks
net: dummy: remove note about being Y by default
cxgbi: fix uninitialized flowi6
ipv6: Revert optional address flusing on ifdown.
ipv4/fib: don't warn when primary address is missing if in_dev is dead
net/mlx5: Add pci shutdown callback
net/mlx5_core: Remove static from local variable
net/mlx5e: Use vport MTU rather than physical port MTU
net/mlx5e: Fix minimum MTU
net/mlx5e: Device's mtu field is u16 and not int
net/mlx5_core: Add ConnectX-5 to list of supported devices
net/mlx5e: Fix MLX5E_100BASE_T define
net/mlx5_core: Fix soft lockup in steering error flow
qlcnic: Update version to 5.3.64
net: stmmac: socfpga: Remove re-registration of reset controller
macsec: fix netlink attribute validation
macsec: add missing macsec prefix in uapi
...
Linus Torvalds [Tue, 26 Apr 2016 23:17:01 +0000 (16:17 -0700)]
Merge tag 'fixes-for-linus' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Arnd Bergmann:
"Here are the latest bug fixes for ARM SoCs, mostly addressing recent
regressions. Changes are across several platforms, so I'm listing
every change separately here.
Regressions since 4.5:
- A correction of the psci firmware DT binding, to prevent users from
relying on unintended semantics
- Actually getting the newly merged clock driver for some OMAP
platforms to work
- A revert of patches for the Qualcomm BAM, these need to be reworked
for 4.7 to avoid breaking boards other than the one they were
intended for
- A correction for the I2C device nodes on the Socionext Uniphier
platform
- i.MX SDHCI was broken for non-DT platforms due to a change with the
setting of the DMA mask
- A revert of a patch that accidentally added a nonexisting clock on
the Rensas "Porter" board
- A couple of OMAP fixes that are all related to suspend after the
power domain changes for dra7
- On Mediatek, revert part of the power domain initialization changes
that broke mt8173-evb
Fixes for older bugs:
- Workaround for an "external abort" in the omap34xx suspend/resume
code.
- The USB1/eSATA should not be listed as an excon device on
am57xx-beagle-x15 (broken since v4.0)
- A v4.5 regression in the TI AM33xx and AM43XX DT specifying
incorrect DMA request lines for the GPMC
- The jiffies calibration on Renesas platforms was incorrect for some
modern CPU cores.
- A hardware errata woraround for clockdomains on TI DRA7"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
drivers: firmware: psci: unify enable-method binding on ARM {64,32}-bit systems
arm64: dts: uniphier: fix I2C nodes of PH1-LD20
ARM: shmobile: timer: Fix preset_lpj leading to too short delays
Revert "ARM: dts: porter: Enable SCIF_CLK frequency and pins"
ARM: dts: r8a7791: Don't disable referenced optional clocks
Revert "ARM: OMAP: Catch callers of revision information prior to it being populated"
ARM: OMAP3: Fix external abort on 36xx waking from off mode idle
ARM: dts: am57xx-beagle-x15: remove extcon_usb1
ARM: dts: am437x: Fix GPMC dma properties
ARM: dts: am33xx: Fix GPMC dma properties
Revert "soc: mediatek: SCPSYS: Fix double enabling of regulators"
ARM: mach-imx: sdhci-esdhc-imx: initialize DMA mask
ARM: DRA7: clockdomain: Implement timer workaround for errata i874
ARM: OMAP: Catch callers of revision information prior to it being populated
ARM: dts: dra7: Correct clock tree for sys_32k_ck
ARM: OMAP: DRA7: Provide proper class to omap2_set_globals_tap
ARM: OMAP: DRA7: wakeupgen: Skip SAR save for wakeupgen
Revert "dts: msm8974: Add dma channels for blsp2_i2c1 node"
Revert "dts: msm8974: Add blsp2_bam dma node"
ARM: dts: Add clocks for dm814x ADPLL
Linus Torvalds [Tue, 26 Apr 2016 03:04:08 +0000 (20:04 -0700)]
devpts: more pty driver interface cleanups
This is more prep-work for the upcoming pty changes. Still just code
cleanup with no actual semantic changes.
This removes a bunch pointless complexity by just having the slave pty
side remember the dentry associated with the devpts slave rather than
the inode. That allows us to remove all the "look up the dentry" code
for when we want to remove it again.
Together with moving the tty pointer from "inode->i_private" to
"dentry->d_fsdata" and getting rid of pointless inode locking, this
removes about 30 lines of code. Not only is the end result smaller,
it's simpler and easier to understand.
The old code, for example, depended on the d_find_alias() to not just
find the dentry, but also to check that it is still hashed, which in
turn validated the tty pointer in the inode.
That is a _very_ roundabout way to say "invalidate the cached tty
pointer when the dentry is removed".
The new code just does
dentry->d_fsdata = NULL;
in devpts_pty_kill() instead, invalidating the tty pointer rather more
directly and obviously. Don't do something complex and subtle when the
obvious straightforward approach will do.
The rest of the patch (ie apart from code deletion and the above tty
pointer clearing) is just switching the calling convention to pass the
dentry or file pointer around instead of the inode.
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Aurelien Jarno <aurelien@aurel32.net>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Jann Horn <jann@thejh.net>
Cc: Greg KH <greg@kroah.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jann Horn [Tue, 26 Apr 2016 20:26:26 +0000 (22:26 +0200)]
bpf: fix double-fdput in replace_map_fd_with_map_ptr()
When bpf(BPF_PROG_LOAD, ...) was invoked with a BPF program whose bytecode
references a non-map file descriptor as a map file descriptor, the error
handling code called fdput() twice instead of once (in __bpf_map_get() and
in replace_map_fd_with_map_ptr()). If the file descriptor table of the
current task is shared, this causes f_count to be decremented too much,
allowing the struct file to be freed while it is still in use
(use-after-free). This can be exploited to gain root privileges by an
unprivileged user.
This bug was introduced in
commit
0246e64d9a5f ("bpf: handle pseudo BPF_LD_IMM64 insn"), but is only
exploitable since
commit
1be7f75d1668 ("bpf: enable non-root eBPF programs") because
previously, CAP_SYS_ADMIN was required to reach the vulnerable code.
(posted publicly according to request by maintainer)
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Tue, 26 Apr 2016 21:14:30 +0000 (23:14 +0200)]
pch_gbe: fix bogus trylock conversion
Should have converted 'if (trylock)' to 'lock'.
Fixes:
a6086a893718db ("drivers: net: remove NETDEV_TX_LOCKED")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 26 Apr 2016 20:07:21 +0000 (16:07 -0400)]
Merge branch 'sh_eth-next'
Sergei Shtylyov says:
====================
sh_eth: couple of software reset bit cleanups
Here's a set of 2 patches against DaveM's 'net-next.git' repo. We clean up
the use of the software reset bits...
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Sun, 24 Apr 2016 20:46:15 +0000 (23:46 +0300)]
sh_eth: rename ARSTR register bit
The Renesas RZ/A1H manual names the software reset bit in the software reset
register (ARSTR) ARST which makes a bit more sense than the ARSTR_ARSTR name
used now by the driver -- rename the latter to ARSTR_ARST.
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Sun, 24 Apr 2016 20:45:23 +0000 (23:45 +0300)]
sh_eth: use EDMR_SRST_GETHER in sh_eth_check_reset()
sh_eth_check_reset() uses a bare number where EDMR_SRST_GETHER would fit,
i.e. the receive/trasmit software reset bits that comprise EDMR_SRST_GETHER
read as 1 while the corresponding reset is in progress and thus, when both
are 0, the reset is complete.
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 26 Apr 2016 19:58:03 +0000 (15:58 -0400)]
Merge branch 'mlx5-next'
Saeed Mahameed says:
====================
Mellanox 100G extending mlx5 ethtool support
Changes from V0:
- Dropped: net/mlx5e: Disable link up on INIT HCA command
Due to Ido's and Or's requests we will submit this patch to net and will need it for -stable.
- Rebased to:
11afbff86168 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next")
This series is centralized around extending and improving mlx5 ethernet driver ethtool
support. We've done some code refactoring for ethtool statistics reporting, making it
more scalable and robust, now each reported ethtool counter belongs to a group and has
its own descriptor within that group, the descriptor holds the counter name and offset
in memory in that group memory block.
Added new counters:
- Reporting more error and drop counter in ifconig/ip tool.
- Per priority pause and traffic counter in ethtool.
- link down events counter in ethtool.
Set features handling was also refactored a little bit to be more resilient and generic,
now setting more than one feature will not stop on the first failed one, but instead
it will try to continue setting others. We made it generic to make it simpler for adding
more features support, it is now done easily by only introducing a handler function of
the new supported netdev feature, and let the generic handler do the job.
New netdev features and ethtool support:
- Netdev feature RXALL, set on/off FCS check offload.
- Netdev feature HW_VLAN_CTAG_RX, set on/off rx-vlan stripping offload.
- Ethtool interface identify.
- Ethtool dump module EEPROM.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed [Sun, 24 Apr 2016 19:51:56 +0000 (22:51 +0300)]
net/mlx5e: Fix checksum handling for non-stripped vlan packets
Now as rx-vlan offload can be disabled, packets can be received
with vlan tag not stripped, which means is_first_ethertype_ip will
return false, for that we need to check if the hardware reported
csum OK so we will report CHECKSUM_UNNECESSARY for those packets.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gal Pressman [Sun, 24 Apr 2016 19:51:55 +0000 (22:51 +0300)]
net/mlx5e: Add ethtool support for rxvlan-offload (vlan stripping)
Use ethtool -K <interface> rxvlan <on/off> to enable/disable
C-TAG vlan stripping by hardware.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gal Pressman [Sun, 24 Apr 2016 19:51:54 +0000 (22:51 +0300)]
net/mlx5e: Add ethtool support for dump module EEPROM
Add query MCIA, PMLP registers infrastructure and commands.
Add ethtool support for get_module_info() and get_module_eeprom()
callbacks.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gal Pressman [Sun, 24 Apr 2016 19:51:53 +0000 (22:51 +0300)]
net/mlx5e: Add ethtool support for interface identify (LED blinking)
Add the needed hardware command and mlx5_ifc structs for managing LED
control.
Add set_phys_id ethtool callback to support ethtool -p flag.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eran Ben Elisha [Sun, 24 Apr 2016 19:51:52 +0000 (22:51 +0300)]
net/mlx5e: Add support for RXALL netdev feature
Introduce new access register named Ports Check Mask Register (PCMR) to
control all HW checks on port. With this register, the driver can
enable/disable Hardware FCS validation.
When RXALL is enabled/disabled using ndo_set_features, enable/disable
fcs check at HW.
User can change HW configuration using rx-all flag at ethtool.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gal Pressman [Sun, 24 Apr 2016 19:51:51 +0000 (22:51 +0300)]
net/mlx5e: Improve set features ndo resiliency
In current mlx5e ndo_set_features implementation, setting some features
can success while others can fail. Today, we return one error code which
doesn't reflect the current features status of the netdev at the end of
the ndo callback.
Set netdev->features with features which were successfully set in order
to keep the current status in case of failure. For this purpose, define
new Macro to set/unset specific feature in netdev->features.
This patch introduces a mechanism that uses feature handlers for each
feature.
Set features will call a generic handler, which will then call a specific
handler in his turn and update netdev->features according to it's return
value. Each specific handler is responsible to perform driver specific
actions, and updating params if needed.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gal Pressman [Sun, 24 Apr 2016 19:51:50 +0000 (22:51 +0300)]
net/mlx5e: Add link down events counter
Expose link_down_events counter through ethtool -S.
This counter is read from PPort statistics, then proccessed and stored as
a special handling software counter.
This counter is stored along software counters since it is the only PPort
counter that it's size is not 64 bits.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gal Pressman [Sun, 24 Apr 2016 19:51:49 +0000 (22:51 +0300)]
net/mlx5e: Add per priority group to PPort counters
Expose counters providing information for each priority level (PCP) through
ethtool -S option and DCBNL.
This includes rx/tx bytes, frames, and pause counters.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gal Pressman [Sun, 24 Apr 2016 19:51:48 +0000 (22:51 +0300)]
net/mlx5e: Rename VPort counters
VPort and software counters names are confusing and may be unclear, all
VPort counters now have a prefix of rx/tx_vport_*.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>