David S. Miller [Thu, 21 Jul 2016 05:07:24 +0000 (22:07 -0700)]
Merge branch 'xdp-cleanups'
Brenden Blanco says:
====================
misc cleanups for xdp
This addresses several of the non-blocking comments left over from the
xdp patch set. See individual patches for details.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenden Blanco [Thu, 21 Jul 2016 00:22:35 +0000 (17:22 -0700)]
bpf: make xdp sample variable names more meaningful
The naming choice of index is not terribly descriptive, and dropcnt is
in fact incorrect for xdp2. Pick better names for these: ipproto and
rxcnt.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenden Blanco [Thu, 21 Jul 2016 00:22:34 +0000 (17:22 -0700)]
rtnl: protect do_setlink from IFLA_XDP_ATTACHED
The IFLA_XDP_ATTACHED nested attribute is meant for read-only, and while
do_setlink properly ignores it, it should be more paranoid and reject
commands that try to set it.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenden Blanco [Thu, 21 Jul 2016 00:22:33 +0000 (17:22 -0700)]
net/mlx4_en: use READ_ONCE when freeing xdp_prog
For consistency, and in order to hint at the synchronous nature of the
xdp_prog field, use READ_ONCE in the destroy path of the ring. All
occurrences should now use either READ_ONCE or xchg.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 21 Jul 2016 04:36:55 +0000 (21:36 -0700)]
Merge branch '100GbE' of git://git./linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
100GbE Intel Wired LAN Driver Updates 2016-07-20
This series contains updates to fm10k only.
Ngai-Mint provides a fix to clear PCIE_GMBX bits to ensure the proper
functioning of the mailbox global interrupt after a data path reset.
Jake provides most of the patches in the series, starting with a early
return from fm10k_down() if we are already down to prevent conflict with
other threads. Fixed an issue where fm10k_update_stats() could cause
a null pointer dereference, specifically if it is called when we are going
down and the rings have been removed. Cleans up and fixes the data path
reset flow, Tx hang routine and stop_hw(). Re-worked the fm10k_reinit()
to be more maintainable and fixed several inconsistencies with the work
flow. Implemented fm10k_prepare_suspend() and fm10k_handle_resume()
which abstract around the now existing fm10k_prepare_for_reset and
fm10k_handle_reset. The new functions also handle stopping the service
task, which is something that the original re-init flow does not need.
Fixed an issue where if an FLR occurs, VF devices will be knocked out of
bus master mode, and the driver will be unable to recover from the reset
properly, so ensure bus master is enabled after every reset. Fixed an
issue where a reset will occur as if for no reason, regularly every few
minutes until the switch manager software is loaded, which is caused
by continuously requesting the lport map so only do the request after
we have verified the switch mailbox is tx_ready.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 21 Jul 2016 04:10:55 +0000 (21:10 -0700)]
Merge branch 'mv88r6xxx-eeprom-rework'
Vivien Didelot says:
====================
net: dsa: mv88e6xxx: rework EEPROM code
Some switches can access an optional external EEPROM via its registers.
The 88E6352 family of switches have 8-bit address / 16-bit data access.
The new 88E6390 family has 16-bit address / 8-bit data access.
This patchset cleans up the EEPROM code with 16-suffixed Global2 helpers
and makes it easy to add future support for 8-bit data EEPROM access.
It also removes unnecessary mutexes and a few locked access functions.
Changes in v2:
- add missing Signed-off-by tag
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Wed, 20 Jul 2016 22:18:36 +0000 (18:18 -0400)]
net: dsa: mv88e6xxx: kill last locked reg_read
Get rid of the last usage of the locked mv88e6xxx_reg_read function with
a new mv88e6xxx_port_read helper, useful later for chips with different
port registers base address.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Wed, 20 Jul 2016 22:18:35 +0000 (18:18 -0400)]
net: dsa: mv88e6xxx: rework EEPROM access
The 6352 family of switches and compatibles provide a 8-bit address and
16-bit data access to an optional EEPROM.
Newer chip such as the 6390 family slightly changed the access to 16-bit
address and 8-bit data.
This commit cleans up the EEPROM access code for 16-bit access and makes
it easy to eventually introduce future support for 8-bit access.
Here's a list of notable changes brought by this patch:
- provide Global2 unlocked helpers for EEPROM commands
- remove eeprom_mutex, only reg_lock is necessary for driver functions
- eeprom_len is 0 for chip without EEPROM, so return it directly
- the Running bit must be 0 before r/w, so wait for Busy *and* Running
- remove now unused mv88e6xxx_wait and mv88e6xxx_reg_write
- other than that, the logic (in _{get,set}_eeprom16) didn't change
Chips with an 8-bit EEPROM access will require to implement the
8-suffixed variant of G2 helpers and the related flag:
#define MV88E6XXX_FLAGS_EEPROM8 \
(MV88E6XXX_FLAG_G2_EEPROM_CMD | \
MV88E6XXX_FLAG_G2_EEPROM_ADDR)
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Wed, 20 Jul 2016 22:18:34 +0000 (18:18 -0400)]
net: dsa: mv88e6xxx: remove unused phy_mutex
Only reg_lock is necessary now and phy_mutex is dead. Remove it.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gavin Shan [Thu, 21 Jul 2016 01:42:54 +0000 (11:42 +1000)]
net/faraday: Disallow using reversed MAC address from hardware
The initial MAC address is retrieved from hardware if it's not
provided by device-tree. The reserved MAC address from hardware
will be used if non-reserved MAC address is invalid. It will
cause mismatched MAC address seen by hardware and software.
This disallows using the reserved hardware MAC address to avoid
the mismatched MAC address seen by hardware and software.
Fixes:
113ce107afe9 ("net/faraday: Read MAC address from chip")
Suggested-by: David Laight <David.Laight@ACULAB.COM>
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jacob Keller [Tue, 7 Jun 2016 23:09:02 +0000 (16:09 -0700)]
fm10k: bump version number
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 7 Jun 2016 23:09:01 +0000 (16:09 -0700)]
fm10k: return proper error code when pci_enable_msix_range fails
The pci_enable_msix_range() function returns a positive value of the
number of allocated vectors if it succeeds. On failure it returns
a negative error code. Return this code properly so that the error
message printed by the driver will show the actual error code instead of
being masked by -ENOMEM.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 7 Jun 2016 23:09:00 +0000 (16:09 -0700)]
fm10k: force link to remain down for at least a second on resume events
When we resume from an AER recovery with many active VFs, the PF sees
many spurious link up and link down events. Prevent this by delaying
link down for at least one second after the resume event.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 7 Jun 2016 23:08:59 +0000 (16:08 -0700)]
fm10k: implement request_lport_map pointer
If the fm10k interface is brought up, but the switch manager software is
not running, the driver will continuously request the lport map every
few seconds in the base driver watchdog routine. Eventually after
several minutes the switch mailbox Tx fifo will fill up and the mailbox
will timeout, resulting in a reset. This reset will appear as if for no
reason, and occurs regularly every few minutes until the switch manager
software is loaded.
Prevent this from happening by only requesting the lport map after we've
verified the switch mailbox is tx_ready. In order to simplify code logic
and reduce code duplication, implement this as a new function pointer
"mac.ops.request_lport_map" which the VF will not implement. Otherwise,
we have to duplicate the tx_ready check outside of
fm10k_get_host_state_generic, or re-implement most of
fm10k_get_host_state_generic in the pf version.
The resulting code is simpler and easier to understand, and prevents the
PF from continuously requesting lport map and filling the Tx fifo of
a switch mailbox that isn't ready.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 7 Jun 2016 23:08:58 +0000 (16:08 -0700)]
fm10k: check if PCIe link is restored
Sometimes, a VF driver will lose PCIe address access, such as due to
a PF FLR event. In fm10k_detach_subtask, poll and check whether the
PCIe register space is active again and restore the device when it has.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 7 Jun 2016 23:08:57 +0000 (16:08 -0700)]
fm10k: enable bus master after every reset
If an FLR occurs, VF devices will be knocked out of bus master mode, and
the driver will be unable to recover from the reset properly, resulting
in malicious driver events and an infinite reset loop. In the normal
case, the bus master mode will already be enabled and this call will
essentially be a no-op. Since we're doing this every reset, it is
possible we could remove the other calls to pci_set_master() but it
seems not harmful to just leave them in place.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 7 Jun 2016 23:08:56 +0000 (16:08 -0700)]
fm10k: use common flow for suspend and resume
Continuing the effort to commonize the similar suspend/resume flows,
finish up by using the new fm10k_handle_suspand and fm10k_handle_resume
functions for the standard suspend/resume flow.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 7 Jun 2016 23:08:55 +0000 (16:08 -0700)]
fm10k: implement reset_notify handler for PCIe FLR events
When a function level PCI reset is triggered using sysfs, it calls the
driver's .reset_notify error handler. Implement a handler based on the
now split fm10k_prepare_for_reset and fm10k_handle_reset functions, so
that we fully reset the driver when the PCI function level reset occurs.
This also ensures the reset is handled in a clean way by first disabling
all the driver bits first and then restoring them after the function
reset. Previously the stack simply performed a blind function reset and
our driver didn't take any part in the process.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 7 Jun 2016 23:08:54 +0000 (16:08 -0700)]
fm10k: use common reset flow when handling io errors from PCI stack
Now that we have extracted the necessary steps for a split
suspend/resume flow, re-use these functions instead of using the current
open coded flow. This ensures that we don't miss any steps. It also
ensures that we have the correct driver states set.
Since we'll be handling all of the reset flow ourselves, we no longer
need to request a reset in the io_slot_reset() function.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 7 Jun 2016 23:08:53 +0000 (16:08 -0700)]
fm10k: implement prepare_suspend and handle_resume
Implement fm10k_prepare_suspend and fm10k_handle_resume functions which
abstract around the now existing fm10k_prepare_for_reset and
fm10k_handle_reset. The new functions also handle stopping the service
task, which is something that the original re-init flow does not need.
Every other location that does a suspend/resume type flow is expected to
use these functions, because otherwise they may have conflicts with the
running watchdog routines. This also has the effect of preventing
possible surprise remove events during handling of FLR events and PCIe
errors.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 7 Jun 2016 23:08:52 +0000 (16:08 -0700)]
fm10k: split fm10k_reinit into two functions
There are several flows in the driver which perform the similar function
of tearing down software and restoring software to recover from certain
errors or PCIe events, including:
* fm10k_reinit
* fm10k_suspend/resume
* fm10k_io_error_detected/fm10k_io_resume
In addition, we want to implement a .reset_notify() handler as well
which will also perform similar function.
Rework how the driver codes reset and resume flows by separating out the
reinit logic into two functions "fm10k_prepare_for_reset" and
"fm10k_handle_reset". This first step will allow us to re-use this
functionality in the similar blocks of code instead of re-coding the
same sequence of events slightly different.
The end result should be more maintainable and correct, fixing several
inconsistencies with the work flow.
The new functions expect to take the rtnl_lock() themselves, and it does
have the unfortunate side effect of having the reinit flow take then
release then take the rtnl_lock. However, this minor downside is
out weighted by the benefits of code reduction and reducing needless
difference between these flows.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 7 Jun 2016 23:08:51 +0000 (16:08 -0700)]
fm10k: wait for queues to drain if stop_hw() fails once
It turns out that sometimes during a reset the Tx queues will be
temporarily stuck longer than .stop_hw() expects. Work around this issue
by attempting to .stop_hw() first. If it tails, wait a number of
attempts until the Tx queues appear to be drained. After this, attempt
stop_hw() again. This ensures that we avoid waiting if we don't need to,
such as during the first initialization of a VF, and give the proper
amount of time necessary to recover from most situations. It is possible
that the hardware is actually stuck. For PFs, this is usually fixed by
a datapath reset. Unfortunately the VF cannot request a similar reset
for itself.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 7 Jun 2016 23:08:50 +0000 (16:08 -0700)]
fm10k: only warn when stop_hw fails with FM10K_ERR_REQUESTS_PENDING
When stop_hw() routine fails with FM10K_ERR_REQUESTS_PENDING, this
indicates that the Tx or Rx queues did not shutdown within the time
limit. Print a more suitable message at the dev_info level instead of
dev_err.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 7 Jun 2016 23:08:49 +0000 (16:08 -0700)]
fm10k: use actual hardware registers when checking for pending Tx
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 7 Jun 2016 23:08:48 +0000 (16:08 -0700)]
fm10k: perform data path reset even when switch is not ready
A while ago, an additional check for the switch being ready was added to
reset_hw. A recent refactor accidentally made this check return an error
code on failure which caused fm10k_probe to fail when the switch wasn't
brought up first. The original reasoning for the check was to prevent
additional data path reset when the fabric wasn't ready yet. However,
there isn't a compelling reason to keep the check, as the data path
reset will restore hardware to a known good state. Remove the check and
perform the data path reset regardless of the switch manager state.
An alternative fix is to return FM10K_SUCCESS instead, and bypass the
actual data path reset. This should be fine as we will perform
a reset_hw once the switch is active. However, since data path reset
will reset many parts of the hardware it seems better to just perform
the reset regardless of switch state.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 7 Jun 2016 23:08:47 +0000 (16:08 -0700)]
fm10k: don't stop reset due to FM10K_ERR_REQUESTS_PENDING
Don't report FM10K_ERR_REQUESTS_PENDING when we fail to disable queues
within the timeout. This can occur due to a hardware Tx hang, or when
the switch ethernet fabric is resetting while we are transmitting
traffic. It can sometimes take up to 500ms before the Tx DMA engine
gives up. Instead, just skip the DMA engine check and perform
a data-path reset anyways. Add a statistic counter to keep track of the
number of resets occurring while we have pending DMA on the rings.
In order to prevent having to re-assign err to 0, re-order the
last few items of the reset_hw_pf function so that we don't perform
"return err" at the end.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Ngai-Mint Kwan [Tue, 7 Jun 2016 23:08:46 +0000 (16:08 -0700)]
fm10k: Reset mailbox global interrupts
When a data path reset is initiated, write control to the PCIE_GMBX is
yanked from the switch manager. The switch manager writes to this
register to clear mailbox global interrupt bits as part of its mailbox
interrupt handling routine. When the device recovers from the data path
reset and these bits are not cleared, it will prevent future mailbox
global interrupts from being triggered. Upon confirming that the device
has exited from a data path reset, clear these bits to ensure the proper
functioning of the mailbox global interrupt.
Signed-off-by: Ngai-Mint Kwan <ngai-mint.kwan@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 7 Jun 2016 23:08:45 +0000 (16:08 -0700)]
fm10k: prevent multiple threads updating statistics
Also prevent updating stats while the interface is down. If we're
already updating stats, just return doing nothing. When we take the
device down, block stat updates until we come back up. This ensures that
we avoid tearing down rings when we're updating statistics, and prevents
updating statistics until we're up.
We can't re-use the __FM10K_DOWN for this because it wouldn't prevent
multiple threads from accessing statistics. Neither does it prevent the
case where we start updating stats and then start going down in another
thread.
The fm10k_get_stats64 is except from this, because it has a completely
different flow which does not suffer from the same issues as
fm10k_update_stats might.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Fri, 3 Jun 2016 22:42:12 +0000 (15:42 -0700)]
fm10k: avoid possible null pointer dereference in fm10k_update_stats
It's currently possible for fm10k_update_stats to be called during the
window when we go down and the rings are removed. This can result in
a null pointer dereference. In fm10k_get_stats64 we work around this by
using ACCESS_ONCE and a null pointer check inside the loop. Use this
same flow in the fm10k_update_stats to avoid the potential null pointer.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Fri, 3 Jun 2016 22:42:11 +0000 (15:42 -0700)]
fm10k: no need to continue in fm10k_down if __FM10K_DOWN already set
Return early from fm10k_down() when we are already down, since that
means another thread is either already finished or has started going
down, so shouldn't conflict with them.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David S. Miller [Wed, 20 Jul 2016 21:53:57 +0000 (14:53 -0700)]
Merge branch 'mlxsw-per-prio-tc-counters'
Jiri Pirko says:
====================
mlxsw: Add per-{Prio,TC} counters
Ido says:
Add per-priority and per-tc counters, which are very useful for debugging
purposes and fine-tuning.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 19 Jul 2016 13:35:54 +0000 (15:35 +0200)]
mlxsw: spectrum: Expose per-tc counters via ethtool
Expose the transmit queue length of each traffic class and the amount of
unicast packets discarded due to insufficient room in the shared buffer.
The first counter allows us to debug user priority to traffic class
mapping, whereas the drop counter is useful when determining shared buffer
configuration.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 19 Jul 2016 13:35:53 +0000 (15:35 +0200)]
mlxsw: spectrum: Expose per-priority counters via ethtool
Expose per-priority bytes / packets / PFC packets counters via ethtool.
These counters are very useful when debugging QoS functionality and
provide a better insight into the device's forwarding plane.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Tue, 19 Jul 2016 12:37:53 +0000 (12:37 +0000)]
net: cpmac: fix error handling of cpmac_probe()
Add the missing free_netdev() before return from function
cpmac_probe() in the error handling case.
This patch revert commit
0465be8f4f1d ("net: cpmac: fix in
releasing resources"), which changed to only free_netdev
while register_netdev failed.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Tue, 19 Jul 2016 11:35:46 +0000 (11:35 +0000)]
net/mlx5: Use PTR_ERR_OR_ZERO() to simplify the code
Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR.
Generated by: scripts/coccinelle/api/ptr_ret.cocci
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Tue, 19 Jul 2016 11:33:10 +0000 (11:33 +0000)]
net: ethernet: nb8800: fix error handling of nb8800_probe()
In ops->reset() error handling case, clk_disable_unprepare() is missed
before return from this function.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Mans Rullgard <mans@mansr.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Tue, 19 Jul 2016 11:25:16 +0000 (11:25 +0000)]
wan/fsl_ucc_hdlc: use module_platform_driver to simplify the code
module_platform_driver() makes the code simpler by eliminating
boilerplate code.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Tue, 19 Jul 2016 11:25:03 +0000 (11:25 +0000)]
wan/fsl_ucc_hdlc: remove .owner field for driver
Remove .owner field if calls are used which set it automatically.
Generated by: scripts/coccinelle/api/platform_no_drv_owner.cocci
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Tue, 19 Jul 2016 11:23:24 +0000 (11:23 +0000)]
net: axienet: Fix return value check in axienet_probe()
In case of error, the function of_parse_phandle() returns NULL
pointer not ERR_PTR(). The IS_ERR() test in the return value
check should be replaced with NULL test.
Fixes:
46aa27df8853 ('net: axienet: Use devm_* calls')
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 20 Jul 2016 21:42:28 +0000 (14:42 -0700)]
Merge branch 'for-upstream' of git://git./linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-07-19
Here's likely the last bluetooth-next pull request for the 4.8 kernel:
- Fix for L2CAP setsockopt
- Fix for is_suspending flag handling in btmrvl driver
- Addition of Bluetooth HW & FW info fields to debugfs
- Fix to use int instead of char for callback status.
The last one (from Geert Uytterhoeven) is actually not purely a
Bluetooth (or 802.15.4) patch, but it was agreed with other maintainers
that we take it through the bluetooth-next tree.
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 20 Jul 2016 18:17:47 +0000 (20:17 +0200)]
bpf, elf: add official ELF machine define for eBPF
Add the official BPF ELF e_machine value that was assigned recently [1,2]
and will be propagated to glibc, et al. LLVM is switching to it in 3.9
release.
[1] https://github.com/llvm-mirror/llvm/commit/
36b9c09330bfb5e771914cfe307588f30d5510d2
[2] http://lists.iovisor.org/pipermail/iovisor-dev/2016-June/000266.html
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenden Blanco [Wed, 20 Jul 2016 14:55:52 +0000 (07:55 -0700)]
bpf: fix implicit declaration of bpf_prog_add
For the ifndef case of CONFIG_BPF_SYSCALL, an inline version of
bpf_prog_add needs to exist otherwise the build breaks on some configs.
drivers/net/ethernet/mellanox/mlx4/en_netdev.c:2544:10: error: implicit declaration of function 'bpf_prog_add'
prog = bpf_prog_add(prog, priv->rx_ring_num - 1);
The function is introduced in
59d3656d5bf50 ("bpf: add bpf_prog_add api for bulk prog refcnt")
and first used in
47f1afdba2b87 ("net/mlx4_en: add support for fast rx drop bpf program").
Fixes:
47f1afdba2b87 ("net/mlx4_en: add support for fast rx drop bpf program")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Tariq Toukan <ttoukan.linux@gmail.com>
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 20 Jul 2016 04:46:34 +0000 (21:46 -0700)]
Merge branch 'xdp'
Brenden Blanco says:
====================
Add driver bpf hook for early packet drop and forwarding
This patch set introduces new infrastructure for programmatically
processing packets in the earliest stages of rx, as part of an effort
others are calling eXpress Data Path (XDP) [1]. Start this effort by
introducing a new bpf program type for early packet filtering, before
even an skb has been allocated.
Extend on this with the ability to modify packet data and send back out
on the same port.
Patch 1 adds an API for bulk bpf prog refcnt incrememnt.
Patch 2 introduces the new prog type and helpers for validating the bpf
program. A new userspace struct is defined containing only data and
data_end as fields, with others to follow in the future.
In patch 3, create a new ndo to pass the fd to supported drivers.
In patch 4, expose a new rtnl option to userspace.
In patch 5, enable support in mlx4 driver.
In patch 6, create a sample drop and count program. With single core,
achieved ~20 Mpps drop rate on a 40G ConnectX3-Pro. This includes
packet data access, bpf array lookup, and increment.
In patch 7, add a page recycle facility to mlx4 rx, enabled when xdp is
active.
In patch 8, add the XDP_TX type to bpf.h
In patch 9, add helper in tx patch for writing tx_desc
In patch 10, add support in mlx4 for packet data write and forwarding
In patch 11, turn on packet write support in the bpf verifier
In patch 12, add a sample program for packet write and forwarding. With
single core, achieved ~10 Mpps rewrite and forwarding.
[1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
v10:
1/12: Add bulk refcnt api.
5/12: Move prog from priv to ring. This attribute is still only set
globally, but the path to finer granularity should be clear. No lock
is taken, so some rings may operate on older programs for a time (one
napi loop). Looked into options such as napi_synchronize, but they
were deemed too slow (calls to msleep).
Rename prog to xdp_prog. Add xdp_ring_num to help with accounting,
used more heavily in later patches.
7/12: Adjust to use per-ring xdp prog. Use priv->xdp_ring_num where
before priv->prog was used to determine buffer allocations.
9/12: Add cpu_to_be16 to vlan_tag in mxl4_en_xmit(). Remove unused variable
from mlx4_en_xmit and unused params from build_inline_wqe.
v9:
4/11: Add missing newline in en_err message.
6/11: Move page_cache cleanup from mlx4_en_destroy_rx_ring to
mlx4_en_deactivate_rx_ring. Move mlx4_en_moderation_update back to
static. Remove calls to mlx4_en_alloc/free_resources in mlx4_xdp_set.
Adopt instead the approach of mlx4_en_change_mtu to use a watchdog.
9/11: Use a per-ring function pointer in tx to separate out the code
for regular and recycle paths of tx completion handling. Add a helper
function to init the recycle ring and callback, called just after
activating tx. Remove extra tx ring resource requirement, and instead
steal from the upper rings. This helps to avoid needing
mlx4_en_alloc_resources. Add some hopefully meaningful error
messages for the various error cases. Reverted some of the
hard-to-follow logic that was accounting for the extra tx rings.
v8:
1/11: Reduce WARN_ONCE to single line. Also, change act param of that
function to u32 to match return type of bpf_prog_run_xdp.
2/11: Clarify locking semantics in ndo comment.
4/11: Add en_err warning in mlx4_xdp_set on num_frags/mtu violation.
v7:
Addressing two of the major discussion points: return codes and ndo.
The rest will be taken as todo items for separate patches.
Add an XDP_ABORTED type, which explicitly falls through to DROP. The
same result must be taken for the default case as well, as it is now
well-defined API behavior.
Merge ndo_xdp_* into a single ndo. The style is similar to
ndo_setup_tc, but with less unidirectional naming convention. The IFLA
parameter names are unchanged.
TODOs:
Add ethtool per-ring stats for aborted, default cases, maybe even drop
and tx as well.
Avoid duplicate dma sync operation in XDP_PASS case as mentioned by
Saeed.
1/12: Add XDP_ABORTED enum, reword API comment, and update commit
message.
2/12: Rewrite ndo_xdp_*() into single ndo_xdp() with type/union style
calling convention.
3/12: Switch to ndo_xdp callback.
4/12: Add XDP_ABORTED case as a fall-through to XDP_DROP. Implement
ndo_xdp.
12/12: Dropped, this will need some more work.
v6:
2/12: drop unnecessary netif_device_present check
4/12, 6/12, 9/12: Reorder default case statement above drop case to
remove some copy/paste.
v5:
0/12: Rebase and remove previous 1/13 patch
1/12: Fix nits from Daniel. Left the (void *) cast as-is, to be fixed
in future. Add bpf_warn_invalid_xdp_action() helper, to be used when
out of bounds action is returned by the program. Add a comment to
bpf.h denoting the undefined nature of out of bounds returns.
2/12: Switch to using bpf_prog_get_type(). Rename ndo_xdp_get() to
ndo_xdp_attached().
3/12: Add IFLA_XDP as a nested type, and add the associated nla_policy
for the new subtypes IFLA_XDP_FD and IFLA_XDP_ATTACHED.
4/12: Fixup the use of READ_ONCE in the ndos. Add a user of
bpf_warn_invalid_xdp_action helper.
5/12: Adjust to using the nested netlink options.
6/12: kbuild was complaining about overflow of u16 on tile
architecture...bump frag_stride to u32. The page_offset member that
is computed from this was already u32.
v4:
2/12: Add inline helper for calling xdp bpf prog under rcu
3/12: Add detail to ndo comments
5/12: Remove mlx4_call_xdp and use inline helper instead.
6/12: Fix checkpatch complaints
9/12: Introduce new patch 9/12 with common helper for tx_desc write
Refactor to use common tx_desc write helper
11/12: Fix checkpatch complaints
v3:
Rewrite from v2 trying to incorporate feedback from multiple sources.
Specifically, add ability to forward packets out the same port and
allow packet modification.
For packet forwarding, the driver reserves a dedicated set of tx rings
for exclusive use by xdp. Upon completion, the pages on this ring are
recycled directly back to a small per-rx-ring page cache without
being dma unmapped.
Use of the percpu skb is dropped in favor of a lightweight struct
xdp_buff. The direct packet access feature is leveraged to remove
dependence on the skb.
The mlx4 driver implementation allocates a page-per-packet and maps it
in PCI_DMA_BIDIRECTIONAL mode when the bpf program is activated.
Naming is converted to use "xdp" instead of "phys_dev".
v2:
1/5: Drop xdp from types, instead consistently use bpf_phys_dev_
Introduce enum for return values from phys_dev hook
2/5: Move prog->type check to just before invoking ndo
Change ndo to take a bpf_prog * instead of fd
Add ndo_bpf_get rather than keeping a bool in the netdev struct
3/5: Use ndo_bpf_get to fetch bool
4/5: Enforce that only 1 frag is ever given to bpf prog by disallowing
mtu to increase beyond FRAG_SZ0 when bpf prog is running, or conversely
to set a bpf prog when priv->num_frags > 1
Rename pseudo_skb to bpf_phys_dev_md
Implement ndo_bpf_get
Add dma sync just before invoking prog
Check for explicit bpf return code rather than nonzero
Remove increment of rx_dropped
5/5: Use explicit bpf return code in example
Update commit log with higher pps numbers
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenden Blanco [Tue, 19 Jul 2016 19:16:57 +0000 (12:16 -0700)]
bpf: add sample for xdp forwarding and rewrite
Add a sample that rewrites and forwards packets out on the same
interface. Observed single core forwarding performance of ~10Mpps.
Since the mlx4 driver under test recycles every single packet page, the
perf output shows almost exclusively just the ring management and bpf
program work. Slowdowns are likely occurring due to cache misses.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenden Blanco [Tue, 19 Jul 2016 19:16:56 +0000 (12:16 -0700)]
bpf: enable direct packet data write for xdp progs
For forwarding to be effective, XDP programs should be allowed to
rewrite packet data.
This requires that the drivers supporting XDP must all map the packet
memory as TODEVICE or BIDIRECTIONAL before invoking the program.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenden Blanco [Tue, 19 Jul 2016 19:16:55 +0000 (12:16 -0700)]
net/mlx4_en: add xdp forwarding and data write support
A user will now be able to loop packets back out of the same port using
a bpf program attached to xdp hook. Updates to the packet contents from
the bpf program is also supported.
For the packet write feature to work, the rx buffers are now mapped as
bidirectional when the page is allocated. This occurs only when the xdp
hook is active.
When the program returns a TX action, enqueue the packet directly to a
dedicated tx ring, so as to avoid completely any locking. This requires
the tx ring to be allocated 1:1 for each rx ring, as well as the tx
completion running in the same softirq.
Upon tx completion, this dedicated tx ring recycles pages without
unmapping directly back to the original rx ring. In steady state tx/drop
workload, effectively 0 page allocs/frees will occur.
In order to separate out the paths between free and recycle, a
free_tx_desc func pointer is introduced that is optionally updated
whenever recycle_ring is activated. By default the original free
function is always initialized.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenden Blanco [Tue, 19 Jul 2016 19:16:54 +0000 (12:16 -0700)]
net/mlx4_en: break out tx_desc write into separate function
In preparation for writing the tx descriptor from multiple functions,
create a helper for both normal and blueflame access.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenden Blanco [Tue, 19 Jul 2016 19:16:53 +0000 (12:16 -0700)]
bpf: add XDP_TX xdp_action for direct forwarding
XDP enabled drivers must transmit received packets back out on the same
port they were received on when a program returns this action.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenden Blanco [Tue, 19 Jul 2016 19:16:52 +0000 (12:16 -0700)]
net/mlx4_en: add page recycle to prepare rx ring for tx support
The mlx4 driver by default allocates order-3 pages for the ring to
consume in multiple fragments. When the device has an xdp program, this
behavior will prevent tx actions since the page must be re-mapped in
TODEVICE mode, which cannot be done if the page is still shared.
Start by making the allocator configurable based on whether xdp is
running, such that order-0 pages are always used and never shared.
Since this will stress the page allocator, add a simple page cache to
each rx ring. Pages in the cache are left dma-mapped, and in drop-only
stress tests the page allocator is eliminated from the perf report.
Note that setting an xdp program will now require the rings to be
reconfigured.
Before:
26.91% ksoftirqd/0 [mlx4_en] [k] mlx4_en_process_rx_cq
17.88% ksoftirqd/0 [mlx4_en] [k] mlx4_en_alloc_frags
6.00% ksoftirqd/0 [mlx4_en] [k] mlx4_en_free_frag
4.49% ksoftirqd/0 [kernel.vmlinux] [k] get_page_from_freelist
3.21% swapper [kernel.vmlinux] [k] intel_idle
2.73% ksoftirqd/0 [kernel.vmlinux] [k] bpf_map_lookup_elem
2.57% swapper [mlx4_en] [k] mlx4_en_process_rx_cq
After:
31.72% swapper [kernel.vmlinux] [k] intel_idle
8.79% swapper [mlx4_en] [k] mlx4_en_process_rx_cq
7.54% swapper [kernel.vmlinux] [k] poll_idle
6.36% swapper [mlx4_core] [k] mlx4_eq_int
4.21% swapper [kernel.vmlinux] [k] tasklet_action
4.03% swapper [kernel.vmlinux] [k] cpuidle_enter_state
3.43% swapper [mlx4_en] [k] mlx4_en_prepare_rx_desc
2.18% swapper [kernel.vmlinux] [k] native_irq_return_iret
1.37% swapper [kernel.vmlinux] [k] menu_select
1.09% swapper [kernel.vmlinux] [k] bpf_map_lookup_elem
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenden Blanco [Tue, 19 Jul 2016 19:16:51 +0000 (12:16 -0700)]
Add sample for adding simple drop program to link
Add a sample program that only drops packets at the BPF_PROG_TYPE_XDP_RX
hook of a link. With the drop-only program, observed single core rate is
~20Mpps.
Other tests were run, for instance without the dropcnt increment or
without reading from the packet header, the packet rate was mostly
unchanged.
$ perf record -a samples/bpf/xdp1 $(</sys/class/net/eth0/ifindex)
proto 17:
20403027 drops/s
./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4
Running... ctrl^C to stop
Device: eth4@0
Result: OK:
11791017(
c11788327+d2689) usec,
59622913 (60byte,0frags)
5056638pps 2427Mb/sec (2427186240bps) errors: 0
Device: eth4@1
Result: OK:
11791012(
c11787906+d3106) usec,
60526944 (60byte,0frags)
5133311pps 2463Mb/sec (2463989280bps) errors: 0
Device: eth4@2
Result: OK:
11791019(
c11788249+d2769) usec,
59868091 (60byte,0frags)
5077431pps 2437Mb/sec (2437166880bps) errors: 0
Device: eth4@3
Result: OK:
11795039(
c11792403+d2636) usec,
59483181 (60byte,0frags)
5043067pps 2420Mb/sec (2420672160bps) errors: 0
perf report --no-children:
26.05% ksoftirqd/0 [mlx4_en] [k] mlx4_en_process_rx_cq
17.84% ksoftirqd/0 [mlx4_en] [k] mlx4_en_alloc_frags
5.52% ksoftirqd/0 [mlx4_en] [k] mlx4_en_free_frag
4.90% swapper [kernel.vmlinux] [k] poll_idle
4.14% ksoftirqd/0 [kernel.vmlinux] [k] get_page_from_freelist
2.78% ksoftirqd/0 [kernel.vmlinux] [k] __free_pages_ok
2.57% ksoftirqd/0 [kernel.vmlinux] [k] bpf_map_lookup_elem
2.51% swapper [mlx4_en] [k] mlx4_en_process_rx_cq
1.94% ksoftirqd/0 [kernel.vmlinux] [k] percpu_array_map_lookup_elem
1.45% swapper [mlx4_en] [k] mlx4_en_alloc_frags
1.35% ksoftirqd/0 [kernel.vmlinux] [k] free_one_page
1.33% swapper [kernel.vmlinux] [k] intel_idle
1.04% ksoftirqd/0 [mlx4_en] [k] 0x000000000001c5c5
0.96% ksoftirqd/0 [mlx4_en] [k] 0x000000000001c58d
0.93% ksoftirqd/0 [mlx4_en] [k] 0x000000000001c6ee
0.92% ksoftirqd/0 [mlx4_en] [k] 0x000000000001c6b9
0.89% ksoftirqd/0 [kernel.vmlinux] [k] __alloc_pages_nodemask
0.83% ksoftirqd/0 [mlx4_en] [k] 0x000000000001c686
0.83% ksoftirqd/0 [mlx4_en] [k] 0x000000000001c5d5
0.78% ksoftirqd/0 [mlx4_en] [k] mlx4_alloc_pages.isra.23
0.77% ksoftirqd/0 [mlx4_en] [k] 0x000000000001c5b4
0.77% ksoftirqd/0 [kernel.vmlinux] [k] net_rx_action
machine specs:
receiver - Intel E5-1630 v3 @ 3.70GHz
sender - Intel E5645 @ 2.40GHz
Mellanox ConnectX-3 @40G
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenden Blanco [Tue, 19 Jul 2016 19:16:50 +0000 (12:16 -0700)]
net/mlx4_en: add support for fast rx drop bpf program
Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
In tc/socket bpf programs, helpers linearize skb fragments as needed
when the program touches the packet data. However, in the pursuit of
speed, XDP programs will not be allowed to use these slower functions,
especially if it involves allocating an skb.
Therefore, disallow MTU settings that would produce a multi-fragment
packet that XDP programs would fail to access. Future enhancements could
be done to increase the allowable MTU.
The xdp program is present as a per-ring data structure, but as of yet
it is not possible to set at that granularity through any ndo.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenden Blanco [Tue, 19 Jul 2016 19:16:49 +0000 (12:16 -0700)]
rtnl: add option for setting link xdp prog
Sets the bpf program represented by fd as an early filter in the rx path
of the netdev. The fd must have been created as BPF_PROG_TYPE_XDP.
Providing a negative value as fd clears the program. Getting the fd back
via rtnl is not possible, therefore reading of this value merely
provides a bool whether the program is valid on the link or not.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenden Blanco [Tue, 19 Jul 2016 19:16:48 +0000 (12:16 -0700)]
net: add ndo to setup/query xdp prog in adapter rx
Add one new netdev op for drivers implementing the BPF_PROG_TYPE_XDP
filter. The single op is used for both setup/query of the xdp program,
modelled after ndo_setup_tc.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenden Blanco [Tue, 19 Jul 2016 19:16:47 +0000 (12:16 -0700)]
bpf: add XDP prog type for early driver filter
Add a new bpf prog type that is intended to run in early stages of the
packet rx path. Only minimal packet metadata will be available, hence a
new context type, struct xdp_md, is exposed to userspace. So far only
expose the packet start and end pointers, and only in read mode.
An XDP program must return one of the well known enum values, all other
return codes are reserved for future use. Unfortunately, this
restriction is hard to enforce at verification time, so take the
approach of warning at runtime when such programs are encountered. Out
of bounds return codes should alias to XDP_ABORTED.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenden Blanco [Tue, 19 Jul 2016 19:16:46 +0000 (12:16 -0700)]
bpf: add bpf_prog_add api for bulk prog refcnt
A subsystem may need to store many copies of a bpf program, each
deserving its own reference. Rather than requiring the caller to loop
one by one (with possible mid-loop failure), add a bulk bpf_prog_add
api.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 20 Jul 2016 03:49:18 +0000 (20:49 -0700)]
Merge branch 'ncsi'
Gavin Shan says:
====================
NCSI Support
This series rebases on David's linux-net git repo ("master" branch). It's
to support NCSI stack on drivers/net/ethernet/faraday/ftgmac100.c. The
implementation is based on NCSI spec (version: 1.1.0):
https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.1.0.pdf
As the following figure shows and defined in NCSI spec:
* The NC-SI (aka NCSI) is defined as the interface between a (Base)
Management Controller (BMC) and one or multiple Network Interface
Controlers (NIC) on host side. The interface is responsible for providing
external network connectivity for BMC.
* Each BMC can connect to multiple packages, up to 8. Each package can have
multiple channels, up to 32. Every package and channel are identified by
3-bits and 5-bits in NCSI packet.
* NCSI packet, encapsulated in ethernet frame, has 0x88F8 in the protocol
field. The destination MAC address should be 0xFF's while the source MAC
address can be arbitrary one.
* NCSI packets are classified to command, response, AEN (Asynchronous Event Notification).
Commands are sent from BMC to host (NIC) for configuration and
information retrival. Responses, corresponding to commands, are sent from
host to BMC for confirmation and requested information. One command should
have one and only one response. AEN is sent from host to BMC for notification
(e.g. link down on active channel) so that BMC can take appropriate action.
+------------------+ +----------------------------------------------+
| | | Host |
| BMC | | |
| | | +-------------------+ +-------------------+ |
| +---------+ | | | Package-A | | Package-B | |
| | | | | +---------+---------+ +-------------------+ |
| |ftgmac100| | | | Channel | Channel | | Channel | Channel | |
+----+----+----+---+ +-+---------+---------+--+---------+---------+-+
| | |
| | |
+-----------------------------+----------------------+
The series of patches is highlighted as:
The design for the patchset is highlighted as below:
* The network driver uses 3 interfaces exported from NCSI stack:
ncsi_register_dev() - Register (create) a associated NCSI device.
ncsi_start_dev() - Bring up the NCSI device.
ncsi_unregister_dev() - Destroy the registered NCSI device.
* There are several data structures introduced for different objects:
struct ncsi_dev - NCSI device seen by network device driver.
struct ncsi_dev_priv - NCSI device seen by NCSI stack.
struct ncsi_package - NCSI package which can have multiple channels.
struct ncsi_channel - NCSI channel.
* The NCSI stack is driven by workqueue and state machine internally.
* The all available NCSI packages and channels are enumerated (probed) on
the first call to ncsi_start_dev(). The NCSI topology won't change until
the NCSI device is destroyed.
* All available channels will be brought up When the hardware arbitration
is enabled. Otherwise, only one channel is selected as active one. The
NCSI internal is driven by state machine with help of a workqueue. In
the meanwhile, there are 3 states for each channel which can be put into
a queue requesting for configuration or suspending. Channels in the queue
with inactive state set will be configured (bringup) while channels in
the queue with active state will be suspended (teardown). The request
configuration or suspending is being applied on the channel if it's in
invisible state.
* Failover, another inactive channel is selected as active, can happen when
the hardware arbitration is disabled. The failover can be caused by timeout
on link monitor and AEN.
* NCSI stack should be configurable through netlink or another mechanism, it's
not implemented in this patchset. It's something TBD.
* The first NIC driver that is aware of NCSI: drivers/net/ethernet/faraday/ftgmac100.c
Changelog
=========
v2 -> v3:
* Include (one line) change in include/uapi/linux/if_ether.h to fix build
error.
v1 -> v2:
* Support NCSI spec v1.1.0 (3 more commands and 4 hardware arbitration
modes added).
* Enable AEN packets according to the supported list.
* Introduce NCSI channel states and processing queue in order to support
the hardware arbitration.
* The hardware arbitration is supported (tested with emulated environment).
* Introduce link monitor with GLS (Get Link Status) command/response as part
of the error handling defined in NCSI spec.
* Support IPv6 address discovery when CONFIG_IPV6 is enabled.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Gavin Shan [Tue, 19 Jul 2016 01:54:25 +0000 (11:54 +1000)]
net/faraday: Mask PHY interrupt with NCSI mode
Bogus PHY interrupts are observed. This masks the PHY interrupt
when the interface works in NCSI mode as there is no attached
PHY under the circumstance.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gavin Shan [Tue, 19 Jul 2016 01:54:24 +0000 (11:54 +1000)]
net/faraday: Match driver according to compatible property
This matches the driver with devices compatible with "faraday,ftgmac100"
declared in the device tree. Originally, device's name from device
tree for it.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gavin Shan [Tue, 19 Jul 2016 01:54:23 +0000 (11:54 +1000)]
net/faraday: Support NCSI mode
This makes ftgmac100 driver support NCSI mode. The NCSI is enabled
on the interface if property "use-nc-si" or "use-ncsi" is found from
the device node in device tree.
* No PHY device is used when NCSI mode is enabled.
* The NCSI device (struct ncsi_dev) is created when probing the
device while it's enabled/started when the interface is brought
up.
* Hardware IP checksum dosn't work when NCSI mode is enabled. It
is disabled on enabled NCSI.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gavin Shan [Tue, 19 Jul 2016 01:54:22 +0000 (11:54 +1000)]
net/faraday: Read MAC address from chip
The device is assigned with random MAC address. It isn't reasonable.
An valid MAC address might have been provided by (uboot) firmware by
device-tree or in chip. It's reasonable to use it to maintain consistency.
This uses the MAC address from device-tree or that in the chip if it's
valid. Otherwise, a random MAC address is given as before.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gavin Shan [Tue, 19 Jul 2016 01:54:21 +0000 (11:54 +1000)]
net/faraday: Helper functions to create or destroy MDIO interface
This introduces two helper functions to create or destroy MDIO
interface. No logical changes introduced except the proper MDIO
names are given when having more than one MDIO bus.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gavin Shan [Tue, 19 Jul 2016 01:54:20 +0000 (11:54 +1000)]
net/ncsi: NCSI AEN packet handler
This introduces NCSI AEN packet handlers that result in (A) the
currently active channel is reconfigured; (B) Currently active
channel is deconfigured and disabled, another channel is chosen
as active one and configured. Case (B) won't happen if hardware
arbitration has been enabled, the channel that was in active
state is suspended simply.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gavin Shan [Tue, 19 Jul 2016 01:54:19 +0000 (11:54 +1000)]
net/ncsi: Package and channel management
This manages NCSI packages and channels:
* The available packages and channels are enumerated in the first
time of calling ncsi_start_dev(). The channels' capabilities are
probed in the meanwhile. The NCSI network topology won't change
until the NCSI device is destroyed.
* There in a queue in every NCSI device. The element in the queue,
channel, is waiting for configuration (bringup) or suspending
(teardown). The channel's state (inactive/active) indicates the
futher action (configuration or suspending) will be applied on the
channel. Another channel's state (invisible) means the requested
action is being applied.
* The hardware arbitration will be enabled if all available packages
and channels support it. All available channels try to provide
service when hardware arbitration is enabled. Otherwise, one channel
is selected as the active one at once.
* When channel is in active state, meaning it's providing service, a
timer started to retrieve the channe's link status. If the channel's
link status fails to be updated in the determined period, the channel
is going to be reconfigured. It's the error handling implementation
as defined in NCSI spec.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gavin Shan [Tue, 19 Jul 2016 01:54:18 +0000 (11:54 +1000)]
net/ncsi: NCSI response packet handler
The NCSI response packets are sent to MC (Management Controller)
from the remote end. They are responses of NCSI command packets
for multiple purposes: completion status of NCSI command packets,
return NCSI channel's capability or configuration etc.
This defines struct to represent NCSI response packets and introduces
function ncsi_rcv_rsp() which will be used to receive NCSI response
packets and parse them.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gavin Shan [Tue, 19 Jul 2016 01:54:17 +0000 (11:54 +1000)]
net/ncsi: NCSI command packet handler
The NCSI command packets are sent from MC (Management Controller)
to remote end. They are used for multiple purposes: probe existing
NCSI package/channel, retrieve NCSI channel's capability, configure
NCSI channel etc.
This defines struct to represent NCSI command packets and introduces
function ncsi_xmit_cmd(), which will be used to transmit NCSI command
packet according to the request. The request is represented by struct
ncsi_cmd_arg.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gavin Shan [Tue, 19 Jul 2016 01:54:16 +0000 (11:54 +1000)]
net/ncsi: Resource management
NCSI spec (DSP0222) defines several objects: package, channel, mode,
filter, version and statistics etc. This introduces the data structs
to represent those objects and implement functions to manage them.
Also, this introduces CONFIG_NET_NCSI for the newly implemented NCSI
stack.
* The user (e.g. netdev driver) dereference NCSI device by
"struct ncsi_dev", which is embedded to "struct ncsi_dev_priv".
The later one is used by NCSI stack internally.
* Every NCSI device can have multiple packages simultaneously, up
to 8 packages. It's represented by "struct ncsi_package" and
identified by 3-bits ID.
* Every NCSI package can have multiple channels, up to 32. It's
represented by "struct ncsi_channel" and identified by 5-bits ID.
* Every NCSI channel has version, statistics, various modes and
filters. They are represented by "struct ncsi_channel_version",
"struct ncsi_channel_stats", "struct ncsi_channel_mode" and
"struct ncsi_channel_filter" separately.
* Apart from AEN (Asynchronous Event Notification), the NCSI stack
works in terms of command and response. This introduces "struct
ncsi_req" to represent a complete NCSI transaction made of NCSI
request and response.
link: https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.1.0.pdf
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 20 Jul 2016 02:42:02 +0000 (19:42 -0700)]
Merge branch 'dsa-mv88e6xxx-g2-cleanup-stp'
Vivien Didelot says:
====================
net: dsa: mv88e6xxx: Global2 cleanup and STP
The Marvell switches registers are organized in distinct internal SMI
devices, such as PHY, Port, Global 1 or Global 2 registers sets.
Since not all chips support every registers sets or have slightly
differences in them (such as old 88E6060 or new 88E6390 likely to be
supported soon), make the setup code clearer now by removing a few
family checks and adding flags to describe the Global 2 registers map.
This patchset enables basic STP support and bridging on most chips when
getting rid of a few inconsistencies in chip descriptions (patch 1) and
add bridge Ageing Time support to DSA and the mv88e6xxx driver.
Changes v2 -> v3:
- rename mv88e6xxx_update_write to mv88e6xxx_update
- set fastest ageing time in use in the chip for multiple bridges,
tested with a few printk
Changes v1 -> v2:
- add a write helper for pointer-data Update registers
- add ageing time support
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Tue, 19 Jul 2016 00:45:40 +0000 (20:45 -0400)]
net: dsa: mv88e6xxx: add support for DSA ageing time
Implement the DSA driver function to configure the bridge ageing time.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Tue, 19 Jul 2016 00:45:39 +0000 (20:45 -0400)]
net: dsa: mv88e6xxx: add G1 helper for ageing time
All Marvell switch chips from (88E6060 to 88E6390) have a ATU Control
register containing bits 11:4 to configure an ATU Age Time quotient.
However the coefficient used to calculate the ATU Age Time vary with the
models. E.g. 88E6060, 88E6352 and 88E6390 use respectively 16, 15 and
3.75 seconds.
Add a age_time_coeff to the info structure to handle this and a Global 1
helper to set the default age time of 5 minutes in the setup code.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Tue, 19 Jul 2016 00:45:38 +0000 (20:45 -0400)]
net: dsa: support switchdev ageing time attr
Add a new function for DSA drivers to handle the switchdev
SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME attribute.
The ageing time is passed as milliseconds.
Also because we can have multiple logical bridges on top of a physical
switch and ageing time are switch-wide, call the driver function with
the fastest ageing time in use on the chip instead of the requested one.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Tue, 19 Jul 2016 00:45:37 +0000 (20:45 -0400)]
net: dsa: mv88e6xxx: add cap for IRL
Add capability flags to describe the presence of Ingress Rate Limit unit
registers and an helper function to clear it.
In the meantime, fix a few harmless issues:
- 6185 and 6095 don't have such registers (reserved)
- the previous code didn't wait for the IRL operation to complete
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Tue, 19 Jul 2016 00:45:36 +0000 (20:45 -0400)]
net: dsa: mv88e6xxx: add cap for Priority Override
Add flags and helpers to describe the presence of Priority Override
Table (POT) related registers and simplify the setup of Global 2.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Tue, 19 Jul 2016 00:45:35 +0000 (20:45 -0400)]
net: dsa: mv88e6xxx: add cap for PVT
Add flags to describe the presence of Cross-chip Port VLAN Table (PVT)
related registers and simplify the setup of Global 2.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Tue, 19 Jul 2016 00:45:34 +0000 (20:45 -0400)]
net: dsa: mv88e6xxx: rework Switch MAC setter
Switches such as 88E6185 as 3 Switch MAC registers in Global 1. Newer
chips such as 88E6352 have freed these registers in favor of an indirect
access in a Switch MAC/WoL/WoF register in Global 2.
Explicit this difference with G1 and G2 helpers and flags.
Also, note that this indirect access is a single-register which doesn't
require to wait for the operation to complete (like Switch MAC, Trunk
Mapping, etc.), in contrary to multi-registers indirect accesses with
several operations and a busy bit.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Tue, 19 Jul 2016 00:45:33 +0000 (20:45 -0400)]
net: dsa: mv88e6xxx: add cap for MGMT Enables bits
Some switches provide a Rsvd2CPU mechanism used to choose which of the
16 reserved multicast destination addresses matching 01:80:c2:00:00:0x
should be considered as MGMT and thus forwarded to the CPU port.
Other switches extend this mechanism to also configure as MGMT the
additional 16 reserved multicast addresses matching 01:80:c2:00:00:2x.
This mechanism is exposed via two registers in Global 2, and an Rsvd2CPU
enable bit in the management register.
Newer chip (such as 88E6390) has replaced these registers with a new
indirect MGMT mechanism in Global 1.
The patch adds two MV88E6XXX_FLAG_G2_MGMT_EN_{0,2}X flags to describe
the presence of these Global 2 registers. If 88E6390 support is added, a
MV88E6XXX_FLAG_G1_MGMT_CTRL flag will be needed to setup Rsvd2CPU.
Note: all switches still support in parallel the ATU Load operation with
an MGMT Entry State to forward such frames in a less convenient way.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Tue, 19 Jul 2016 00:45:32 +0000 (20:45 -0400)]
net: dsa: mv88e6xxx: extract trunk mapping
The Trunk Mask and Trunk Mapping registers are two Global 2 indirect
accesses to trunking configuration.
Add helpers for these tables and simplify the Global 2 setup.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Tue, 19 Jul 2016 00:45:31 +0000 (20:45 -0400)]
net: dsa: mv88e6xxx: extract device mapping
The Device Mapping register is an indirect table access.
Provide helpers to access this table and explicit the checking of the
new DSA_RTABLE_NONE routing table value.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Tue, 19 Jul 2016 00:45:30 +0000 (20:45 -0400)]
net: dsa: mv88e6xxx: split setup of Global 1 and 2
Separate the setup of Global 1 and Global 2 internal SMI devices and add
a flag to describe the presence of this second registers set.
Also rearrange the G1 setup in the registers order.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Tue, 19 Jul 2016 00:45:29 +0000 (20:45 -0400)]
net: dsa: mv88e6xxx: remove basic function flags
All 88E6xxx Marvell switches (even the old not supported yet 88E6060)
have at least an ATU, per-port STP states and VLAN map, to run basic
switch functions such as Spanning Tree and port based VLANs.
Get rid of the related MV88E6XXX_FLAG_{ATU,PORTSTATE,VLANTABLE} flags,
as they are defaults to every chip.
This enables STP on 6185 and removes many inconsistencies on others.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Morton [Mon, 18 Jul 2016 22:50:58 +0000 (15:50 -0700)]
kernel/trace/bpf_trace.c: work around gcc-4.4.4 anon union initialization bug
kernel/trace/bpf_trace.c: In function 'bpf_event_output':
kernel/trace/bpf_trace.c:312: error: unknown field 'next' specified in initializer
kernel/trace/bpf_trace.c:312: warning: missing braces around initializer
kernel/trace/bpf_trace.c:312: warning: (near initialization for 'raw.frag.<anonymous>')
Fixes:
555c8a8623a3a87 ("bpf: avoid stack copy and use skb ctx for event output")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Lutomirski [Mon, 18 Jul 2016 22:34:49 +0000 (15:34 -0700)]
virtio-net: Remove more stack DMA
VLAN and MQ control was doing DMA from the stack. Fix it.
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: "netdev@vger.kernel.org" <netdev@vger.kernel.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Mon, 18 Jul 2016 20:02:47 +0000 (13:02 -0700)]
bnxt_en: Remove locking around txr->dev_state
txr->dev_state was not consistently manipulated with the acquisition of
the per-queue lock, after further inspection the lock does not seem
necessary, either the value is read as BNXT_DEV_STATE_CLOSING or 0.
Reported-by: coverity (CID 1339583)
Fixes:
c0c050c58d840 ("bnxt_en: New Broadcom ethernet driver.")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 19 Jul 2016 23:40:22 +0000 (16:40 -0700)]
Merge branch 'frag-udp-tunneled-skbs'
Shmulik Ladkani says:
====================
net: Consider fragmentation of udp tunneled skbs in 'ip_finish_output_gso'
Currently IP fragmentation of GSO segments that exceed dst mtu is
considered only in the ipv4 forwarding case.
There are cases where GSO skbs that are bridged and then udp-tunneled
may have gso_size exceeding the egress device mtu.
It makes sense to fragment them, as in the non GSOed code path.
The exact cases where this behavior is needed is described and addressed
in the 2nd patch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Shmulik Ladkani [Mon, 18 Jul 2016 11:49:34 +0000 (14:49 +0300)]
net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs
Given:
- tap0 and vxlan0 are bridged
- vxlan0 stacked on eth0, eth0 having small mtu (e.g. 1400)
Assume GSO skbs arriving from tap0 having a gso_size as determined by
user-provided virtio_net_hdr (e.g. 1460 corresponding to VM mtu of 1500).
After encapsulation these skbs have skb_gso_network_seglen that exceed
eth0's ip_skb_dst_mtu.
These skbs are accidentally passed to ip_finish_output2 AS IS.
Alas, each final segment (segmented either by validate_xmit_skb or by
hardware UFO) would be larger than eth0 mtu.
As a result, those above-mtu segments get dropped on certain networks.
This behavior is not aligned with the NON-GSO case:
Assume a non-gso 1500-sized IP packet arrives from tap0. After
encapsulation, the vxlan datagram is fragmented normally at the
ip_finish_output-->ip_fragment code path.
The expected behavior for the GSO case would be segmenting the
"gso-oversized" skb first, then fragmenting each segment according to
dst mtu, and finally passing the resulting fragments to ip_finish_output2.
'ip_finish_output_gso' already supports this "Slowpath" behavior,
according to the IPSKB_FRAG_SEGS flag, which is only set during ipv4
forwarding (not set in the bridged case).
In order to support the bridged case, we'll mark skbs arriving from an
ingress interface that get udp-encaspulated as "allowed to be fragmented",
causing their network_seglen to be validated by 'ip_finish_output_gso'
(and fragment if needed).
Note the TUNNEL_DONT_FRAGMENT tun_flag is still honoured (both in the
gso and non-gso cases), which serves users wishing to forbid
fragmentation at the udp tunnel endpoint.
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shmulik Ladkani [Mon, 18 Jul 2016 11:49:33 +0000 (14:49 +0300)]
net/ipv4: Introduce IPSKB_FRAG_SEGS bit to inet_skb_parm.flags
This flag indicates whether fragmentation of segments is allowed.
Formerly this policy was hardcoded according to IPSKB_FORWARDED (set by
either ip_forward or ipmr_forward).
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 19 Jul 2016 23:29:41 +0000 (16:29 -0700)]
Merge branch 'bnxt_en-NS2-Nitro'
Michael Chan says:
====================
bnxt_en: Add support for NS2 Nitro.
This series adds support for the embedded version of the
ethernet controller (Nitro) in the North Star 2 SoC. There are a number
of features not supported and a software workaround for a hardware rx
bug is required for Nitro A0. Please review.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Prashant Sreedharan [Mon, 18 Jul 2016 11:15:25 +0000 (07:15 -0400)]
bnxt_en: Add BCM58700 PCI device ID for NS2 Nitro.
A bridge device in NS2 has the same device ID as the ethernet controller.
Add check to avoid probing the bridge device.
Signed-off-by: Prashant Sreedharan <prashant.sreedharan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prashant Sreedharan [Mon, 18 Jul 2016 11:15:24 +0000 (07:15 -0400)]
bnxt_en: Workaround Nitro A0 RX hardware bug (part 4).
Allocate special vnic for dropping packets not matching the RX filters.
First vnic is for normal RX packets and the driver will drop all
packets on the 2nd vnic.
Signed-off-by: Prashant Sreedharan <prashant.sreedharan@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prashant Sreedharan [Mon, 18 Jul 2016 11:15:23 +0000 (07:15 -0400)]
bnxt_en: Workaround Nitro A0 hardware RX bug (part 3).
Allocate napi for special vnic, packets arriving on this
napi will simply be dropped and the buffers will be replenished back
to the HW.
Signed-off-by: Prashant Sreedharan <prashant.sreedharan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prashant Sreedharan [Mon, 18 Jul 2016 11:15:22 +0000 (07:15 -0400)]
bnxt_en: Workaround Nitro A0 hardware RX bug (part 2).
The hardware is unable to drop rx packets not matching the RX filters. To
workaround it, we create a special VNIC and configure the hardware to
direct all packets not matching the filters to it. We then setup the
driver to drop packets received on this VNIC.
This patch creates the infrastructure for this VNIC, reserves a
completion ring, and rx rings. Only shared completion ring mode is
supported. The next 2 patches add a NAPI to handle packets from this
VNIC and the setup of the VNIC.
Signed-off-by: Prashant Sreedharan <prashant.sreedharan@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prashant Sreedharan [Mon, 18 Jul 2016 11:15:21 +0000 (07:15 -0400)]
bnxt_en: Workaround Nitro A0 hardware RX bug (part 1).
Nitro A0 has a hardware bug in the rx path. The workaround is to create
a special COS context as a path for non-RSS (non-IP) packets. Without this
workaround, the chip may stall when receiving RSS and non-RSS packets.
Add infrastructure to allow 2 contexts (RSS and CoS) per VNIC. Allocate
and configure the CoS context for Nitro A0.
Signed-off-by: Prashant Sreedharan <prashant.sreedharan@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prashant Sreedharan [Mon, 18 Jul 2016 11:15:20 +0000 (07:15 -0400)]
bnxt_en: Add basic support for Nitro in North Star 2.
Nitro is the embedded version of the ethernet controller in the North
Star 2 SoC. Add basic code to recognize the chip ID and disable
the features (ntuple, TPA, ring and port statistics) not supported on
Nitro A0.
Signed-off-by: Prashant Sreedharan <prashant.sreedharan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 19 Jul 2016 23:05:57 +0000 (16:05 -0700)]
Merge branch 'marvell-phy'
Charles-Antoine Couret says:
====================
Marvell phy: fiber interface configuration
Another patchset to manage correctly the fiber link for some concerned Marvell's
phy like 88E1512.
This patchset fixed the commit log for the third and last commits and a comment
in the first commit.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Charles-Antoine Couret [Tue, 19 Jul 2016 09:13:13 +0000 (11:13 +0200)]
Marvell phy: add functions to suspend and resume both interfaces: fiber and copper links.
These functions used standards registers in a different page
for both interfaces: copper and fiber.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Charles-Antoine Couret <charles-antoine.couret@nexvision.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Charles-Antoine Couret [Tue, 19 Jul 2016 09:13:12 +0000 (11:13 +0200)]
Marvell phy: add configuration of autonegociation for fiber link.
To be correctly initilized, the fiber interface needs
to be configured via autonegociation registers which use
some customs options or registers.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Charles-Antoine Couret <charles-antoine.couret@nexvision.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Charles-Antoine Couret [Tue, 19 Jul 2016 09:13:11 +0000 (11:13 +0200)]
Marvell phy: add field to get errors from fiber link.
Add support for the fiber receiver error counter in the
statistics. Rename the current counter which is for copper errors to
phy_receive_errors_copper, so it is easy to distinguish copper from
fiber.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Charles-Antoine Couret <charles-antoine.couret@nexvision.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Charles-Antoine Couret [Tue, 19 Jul 2016 09:13:10 +0000 (11:13 +0200)]
Marvell phy: check link status in case of fiber link.
For concerned phy, the fiber link is checked before the copper link.
According to datasheet, the link which is up is enabled.
If both links are down, copper link would be used.
To detect fiber link status, we used the real time status
because of troubles with the copper method.
Tested with Marvell 88E1512.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Charles-Antoine Couret <charles-antoine.couret@nexvision.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 19 Jul 2016 23:01:34 +0000 (16:01 -0700)]
Merge branch 'renesas-dma-channel'
Sergei Shtylyov says:
====================
Fix DMA channel misreporting for the Renesas Ethernet drivers
Here's a set of 2 patches against DaveM's 'net.git' repo fixing up the DMA
channel reporting by 'ifconfig'...
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Sun, 17 Jul 2016 13:09:19 +0000 (16:09 +0300)]
sh_eth: fix DMA channel misreporting
Currently 'ifconfig' for the Ethernet devices handled by this driver shows
"DMA chan: ff" while the driver doesn't use any DMA channels. Not assigning
a value to 'net_device::dma' causes 'ifconfig' to correctly not report a
DMA channel.
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Sun, 17 Jul 2016 13:08:43 +0000 (16:08 +0300)]
ravb: fix DMA channel misreporting
Currently 'ifconfig' for the Ethernet devices handled by this driver shows
"DMA chan: ff" while the driver doesn't use any DMA channels. Not assigning
a value to 'net_device::dma' causes 'ifconfig' to correctly not report a
DMA channel.
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>