review.tizen.org Git - platform/kernel/linux-rpi.git/log

rhashtable: remove insecure_max_entries param

no users in the tree, insecure_max_entries is always set to
ht->p.max_size * 2 in rhtashtable_init().

Replace only spot that uses it with a ht->p.max_size check.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: move xdp_prog field in RX cache lines

(struct net_device, xdp_prog) field should be moved in RX cache lines,
reducing latencies when a single packet is received on idle host,
since netif_elide_gro() needs it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: xgene-v2: Fix error return code in xge_mdio_config()

Fix to return error code -ENODEV from the no PHY found error
handling case instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Generic XDP

This provides a generic SKB based non-optimized XDP path which is used
if either the driver lacks a specific XDP implementation, or the user
requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE.

It is arguable that perhaps I should have required something like
this as part of the initial XDP feature merge.

I believe this is critical for two reasons:

1) Accessibility.  More people can play with XDP with less
   dependencies.  Yes I know we have XDP support in virtio_net, but
   that just creates another depedency for learning how to use this
   facility.

   I wrote this to make life easier for the XDP newbies.

2) As a model for what the expected semantics are.  If there is a pure
   generic core implementation, it serves as a semantic example for
   driver folks adding XDP support.

One thing I have not tried to address here is the issue of
XDP_PACKET_HEADROOM, thanks to Daniel for spotting that.  It seems
incredibly expensive to do a skb_cow(skb, XDP_PACKET_HEADROOM) or
whatever even if the XDP program doesn't try to push headers at all.
I think we really need the verifier to somehow propagate whether
certain XDP helpers are used or not.

v5:
- Handle both negative and positive offset after running prog
- Fix mac length in XDP_TX case (Alexei)
- Use rcu_dereference_protected() in free_netdev (kbuild test robot)

v4:
- Fix MAC header adjustmnet before calling prog (David Ahern)
- Disable LRO when generic XDP is installed (Michael Chan)
- Bypass qdisc et al. on XDP_TX and record the event (Alexei)
- Do not perform generic XDP on reinjected packets (DaveM)

v3:
- Make sure XDP program sees packet at MAC header, push back MAC
   header if we do XDP_TX.  (Alexei)
- Elide GRO when generic XDP is in use.  (Alexei)
- Add XDP_FLAG_SKB_MODE flag which the user can use to request generic
   XDP even if the driver has an XDP implementation.  (Alexei)
- Report whether SKB mode is in use in rtnl_xdp_fill() via XDP_FLAGS
   attribute.  (Daniel)

v2:
- Add some "fall through" comments in switch statements based
   upon feedback from Andrew Lunn
- Use RCU for generic xdp_prog, thanks to Johannes Berg.

Tested-by: Andy Gospodarek <andy@greyhouse.net>
Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed: fix invalid use of sizeof in qed_alloc_qm_data()

sizeof() when applied to a pointer typed expression gives the
size of the pointer, not that of the pointed data.

Fixes: b5a9ee7cf3be ("qed: Revise QM configuration")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: map_get_next_key to return first key on NULL

When iterating through a map, we need to find a key that does not exist
in the map so map_get_next_key will give us the first key of the map.
This often requires a lot of guessing in production systems.

This patch makes map_get_next_key return the first key when the key
pointer in the parameter is NULL.

Signed-off-by: Teng Qin <qinteng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/net: Fix broken test case in psock_fanout

The error return falue form sock_fanout_open is -1, not zero. One test
case was checking for 0 instead of -1.

Tested: Built and tested in clean client.
Signed-off-by: Mike Maloney <maloney@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: netcp_core: remove unused compl queue mapping

This code is unused and probably was unintentionally left while
moving completion queue mapping in submit function.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'qed-vf-tunnel'

Manish Chopra says:

====================
qed/qede: VF tunnelling support

With this series VFs can run vxlan/geneve/gre tunnels over it.
Please consider applying this series to "net-next"
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

qed - VF tunnelling support [VXLAN/GENEVE/GRE]

This patch adds hardware channel APIs support between
VF and PF for tunnelling configuration for the VFs.
According to that configuration VFs can run VXLAN/GENEVE/GRE
tunnels over it with tunnel features offloaded.

Using these APIs VF can also request for UDP ports configuration
to the PF, although PF and it's child VFs share the same port.

Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed/qede: Add UDP ports in bulletin board

This patch adds support for UDP ports in bulletin board
to notify UDP ports change to the VFs

Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qede: Configure UDP ports in local context.

This patch configures UDP ports locally instead of
configuring them in deferred context which would be
helpful in synchronizing UDP ports configuration for VFs
which will be enabled in further patches.

Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qede: Disable tunnel offloads for non offloaded UDP ports

This patch disables tunnel offloads via ndo_features_check()
if given UDP port is not offloaded to hardware. This in turn
allows to run multiple tunnel interfaces using different UDP ports.

Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed/qede: Enable tunnel offloads based on hw configuration

This patch enables tunnel feature offloads based on hw configuration
at initialization time instead of enabling them always.

Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed: refactor tunnelling - API/Structs

This patch changes the tunnel APIs to use per tunnel
info instead of using bitmasks for all tunnels and also
uses single struct to hold the data to prepare multiple
variant of tunnel configuration ramrods to be sent to the hardware.

Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'l2tpeth-info'

Guillaume Nault says:

====================
l2tp: add informations about l2tpeth interfaces in /sys

Patch #1 lets userspace retrieve the naming scheme of an l2tpeth
interface, using /sys/class/net/<iface>/name_assign_type.

Patch #2 adds the DEVTYPE field in /sys/class/net/<iface>/uevent so
that userspace can reliably know if a device is an l2tpeth interface.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: define "l2tpeth" device type

Export type of l2tpeth interfaces to userspace
(/sys/class/net/<iface>/uevent).

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Acked-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: set name_assign_type for devices created by l2tp_eth.c

Export naming scheme used when creating l2tpeth interfaces
(/sys/class/net/<iface>/name_assign_type). This let userspace know if
the device's name has been generated automatically or defined manually.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Acked-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net sched actions: Complete the JUMPX opcode

per discussion at netconf/netdev:
When we have an action that is capable of branching (example a policer),
we can achieve a continuation of the action graph by programming a
"continue" where we find an exact replica of the same filter rule with a lower
priority and the remainder of the action graph. When you have 100s of thousands
of filters which require such a feature it gets very inefficient to do two
lookups.

This patch completes a leftover feature of action codes. Its time has come.

Example below where a user labels packets with a different skbmark on ingress
of a port depending on whether they have/not exceeded the configured rate.
This mark is then used to make further decisions on some egress port.

#rate control, very low so we can easily see the effect
sudo $TC actions add action police rate 1kbit burst 90k \
conform-exceed pipe/jump 2 index 10
# skbedit index 11 will be used if the user conforms
sudo $TC actions add action skbedit mark 11 ok index 11
# skbedit index 12 will be used if the user does not conform
sudo $TC actions add action skbedit mark 12 ok index 12

#lets bind the user ..
sudo $TC filter add dev $ETH parent ffff: protocol ip prio 8 u32 \
match ip dst 127.0.0.8/32 flowid 1:10 \
action police index 10 \
action skbedit index 11 \
action skbedit index 12

#run a ping -f and see what happens..
#
jhs@foobar:~$ sudo $TC -s filter ls dev $ETH parent ffff: protocol ip
filter pref 8 u32
filter pref 8 u32 fh 800: ht divisor 1
filter pref 8 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:10  (rule hit 2800 success 1005)
  match 7f000008/ffffffff at 16 (success 1005 )
action order 1:  police 0xa rate 1Kbit burst 23440b mtu 2Kb action pipe/jump 2 overhead 0b
ref 2 bind 1 installed 207 sec used 122 sec
Action statistics:
Sent 84420 bytes 1005 pkt (dropped 0, overlimits 721 requeues 0)
backlog 0b 0p requeues 0

action order 2:  skbedit mark 11 pass
index 11 ref 2 bind 1 installed 204 sec used 122 sec
Action statistics:
Sent 60564 bytes 721 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

action order 3:  skbedit mark 12 pass
index 12 ref 2 bind 1 installed 201 sec used 122 sec
Action statistics:
Sent 23856 bytes 284 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

Not bad, about 28% non-conforming packets..

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'linux-can-next-for-4.12-20170425' of git://git./linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2017-04-25

this is a pull request of 21 patches for net-next/master.

There are 4 patches by Stephane Grosjean for the PEAK PCAN-PCIe FD
CAN-FD boards. The next 7 patches are by Mario Huettel, which add
support for M_CAN IP version >= v3.1.x to the m_can driver. A patch by
Remigiusz Kołłątaj adds support for the Microchip CAN BUS Analyzer. 8
patches by Oliver Hartkopp complete the initial CAN network namespace
support. Wei Yongjun's patch for the ti_hecc driver fixes the return
value check in the probe function.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ipvlan: use pernet operations and restrict l3s hooks to master netns

commit 4fbae7d83c98c30efc ("ipvlan: Introduce l3s mode") added
registration of netfilter hooks via nf_register_hooks().

This API provides the illusion of 'global' netfilter hooks by placing the
hooks in all current and future network namespaces.

In case of ipvlan the hook appears to be only needed in the namespace
that contains the ipvlan master device (i.e., usually init_net), so
placing them in all namespaces is not needed.

This switches ipvlan driver to pernet operations, and then only registers
hooks in namespaces where a ipvlan master device is set to l3s mode.

Extra care has to be taken when the master device is moved to another
namespace, as we might have to 'move' the netfilter hooks too.

This is done by storing the namespace the ipvlan port was created in.
On REGISTER event, do (un)register operations in the old/new namespaces.

This will also allow removal of the nf_register_hooks() in a future patch.

Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

can: ti_hecc: fix return value check in ti_hecc_probe()

In case of error, the function devm_ioremap_resource() returns ERR_PTR()
and never returns NULL. The NULL test in the return value check should
be replaced with IS_ERR().

Fixes: dabf54dd1c63 ("can: ti_hecc: Convert TI HECC driver to DT only driver")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: enable module auto loading for virtual CAN interfaces

Autoload the vcan module when a vcan instance is to be created by
'ip link add type vcan'

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: add Virtual CAN Tunnel driver (vxcan)

Similar to the virtual ethernet driver veth, vxcan implements a
local CAN traffic tunnel between two virtual CAN network devices.
See Kconfig entry for details.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: network namespace support for CAN gateway

The CAN gateway was not implemented as per-net in the initial network
namespace support by Mario Kicherer (8e8cda6d737d).
This patch enables the CAN gateway to be used in different namespaces.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: network namespace support for CAN_BCM protocol

The CAN_BCM protocol and its procfs entries were not implemented as per-net
in the initial network namespace support by Mario Kicherer (8e8cda6d737d).
This patch adds the missing per-net functionality for the CAN BCM.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: complete initial namespace support

The statistics and its proc output was not implemented as per-net in the
initial network namespace support by Mario Kicherer (8e8cda6d737d).
This patch adds the missing per-net statistics for the CAN subsystem.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: remove obsolete definitions

can_rx_alldev_list is a per-net data structure now. Remove it's definition
here and can_rx_dev_list too.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: remove obsolete pernet_operations definitions

The namespace support for the CAN subsystem does not need any additional
memory. So when ".size = 0" there's no extra memory allocated by the system.
And therefore ".id" is obsolete too.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: fix memory leak in initial namespace support

The can_rx_alldev_list is a per-net data structure now and allocated in
can_pernet_init(). Make sure the memory is free'd in can_pernet_exit() too.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcba_usb: Add support for Microchip CAN BUS Analyzer

SocketCAN driver for Microchip CAN BUS Analyzer
(http://www.microchip.com/development-tools/)

Changes in v4:
- possible memory leak fixed in mcba_usb_write_bulk_callback
- LED support added
- failure handling in mcba_usb_probe improved
- C99 initializers for structs on stack

Changes in v3:
- improved/simplified CAN ID conversion
- functions for transmission of skb and cmd separated
- fixed/improved netif_stop_queue handling
- style/cosmetic corrections

Changes in v2:
- Termination handling reimplemented to fit new netlink API
(IFLA_CAN_TERMINATION)
- Bitrate handling reimplemented to fit new netlink API
(IFLA_CAN_BITRATE)
- CAN ID conversion refactored (changed from macro to inline functions)
- CAN DLC handling using get_can_dlc()
- Endianness handling for can_speed introduced
- Debugging removed
- Redundant error prints removed
- Style/cosmetic corrections (i.e. macro names, redefs, inits etc.)

Signed-off-by: Remigiusz Kołłątaj <remigiusz.kollataj@mobica.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: m_can: Enable TX FIFO Handling for M_CAN IP version >= v3.1.x

* Added defines for TX Event FIFO Element
* Adapted ndo_start_xmit function.
  For versions >= v3.1.x it uses the TX FIFO to optimize the data
  throughput. It stores the echo skb at the same index as in the
  M_CAN's TX FIFO. The frame's message marker is set to this index.
  This message marker is received in the TX Event FIFO after
  the message was successfully transmitted. It is used to echo the
  correct echo skb back to the network stack.
* Added m_can_echo_tx_event function. It reads all received
  message markers in the TX Event FIFO and loops back the
  corresponding echo skbs.
* ISR checks for new TX Event Entry interrupt for version >= 3.1.x.

Signed-off-by: Mario Huettel <mario.huettel@gmx.net>
Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net>
Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: m_can: Configuration for TX and TX event FIFOs

* TX/TX Event FIFO sizes are configured for version >= v3.1.x

Signed-off-by: Mario Huettel <mario.huettel@gmx.net>
Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: m_can: Enable M_CAN version dependent initialization

This patch adapts the initialization of the M_CAN. So it can be used
with all versions >= 3.0.x.

Changes:
* Added version element to m_can_priv structure to hold M_CAN version.
* Renamed bittiming structs for version 3.0.x
* Added new bittiming structs for version >= 3.1.x
* Function alloc_m_can_dev takes 2 new arguments. The TX FIFO size and the
  base address of the module.
* Chip configuration for CAN_CTRLMODE_LOOPBACK is changed: Enabled
  CCCR_MON bit. In combination with TEST_LBCK it activates the internal
  loopback mode. Leaving CCCR_MON '0' results in external loopback mode.
* Clocks are temporarily enabled by platform_propbe function in order to
  allow read access to the Core Release register and the Control Register.
  Registers are used to detect M_CAN version and optional Non-ISO Feature.

Initialization of M_CAN for version >= 3.1.x:
* TX FIFO of M_CAN is used to transmit frames. The driver does not need to
  stop the tx queue after each frame sent.
* Initialization of TX Event FIFO is added.
* NON-ISO is fixed for all M_CAN versions < 3.2.x. Version 3.2.x _can_ have
  the NISO (Non-ISO) bit which can switch the mode of the M_CAN to Non-ISO
  mode. This bit does not have to be writeable. Therefore it is checked.
  If it is writable Non-ISO support is added to the controllers supported
  CAN modes.

New Functions:
* Function to check the Core Release version. The read value determines the
  behaviour of the driver.
* Function to check if the NISO bit for version >= 3.2.x is implemented.

Signed-off-by: Mario Huettel <mario.huettel@gmx.net>
Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net>
Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: m_can: Updated register defines to newest version

* Updated register defines to newest M_CAN version (v3.2.1).
* Changed defines in the whole code.

Signed-off-by: Mario Huettel <mario.huettel@gmx.net>
Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net>
Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: m_can: Removed virtual address from print

The virtual address of the device was printed. I removed it because it
leaks internal information.

Signed-off-by: Mario Huettel <mario.huettel@gmx.net>
Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: m_can: Removed initialization of FIFO water marks

FIFO water marks disabled because the driver doesn't handle water mark
events.

Signed-off-by: Mario Huettel <mario.huettel@gmx.net>
Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net>
Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: m_can: Disabled Interrupt Line 1

* Disabled interrupt line 1. The driver didn't use it.

Signed-off-by: Mario Huettel <mario.huettel@gmx.net>
Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: peak: add support for PEAK PCAN-PCIe FD CAN-FD boards

This patch adds the support of the PCAN-PCI Express FD boards made
by PEAK-System, for computers using the PCI Express slot.

The PCAN-PCI Express FD has one or two CAN FD channels, depending
on the model. A galvanic isolation of the CAN ports protects
the electronics of the card and the respective computer against
disturbances of up to 500 Volts. The PCAN-PCI Express FD can be operated
with ambient temperatures in a range of -40 to +85 °C.

Such boards run an extented version of the CAN-FD IP running into USB
CAN-FD interfaces from PEAK-System, so this patch adds several new commands
and their corresponding data types to the PEAK CAN-FD common definitions
header file too.

Signed-off-by: Stephane Grosjean <s.grosjean@peak-system.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: peak: move header file to new can common subdir

The CAN-FD IP from PEAK-System runs into several kinds of PC CAN-FD
interfaces. Up to now, only the USB CAN-FD adapters were supported by
the Kernel. In order to prepare the adding of some new non-USB CAN-FD
interfaces, this patch moves - and rename - the IP definitions file
from its private (usb) sub-directory into a - newly created - CAN specific
one.

Signed-off-by: Stephane Grosjean <s.grosjean@peak-system.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: peak: fix usage of const qualifier in pointers args

Fixes the usage of the const qualifier in the memory pointer arguments
of the declared inline functions. By changing the line containing "const",
this patch also changes the name of the arg into a more usual one.

Signed-off-by: Stephane Grosjean <s.grosjean@peak-system.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: peak: fix usage of usb specific data type

This patch fixes the wrong usage of a specific USB data type into a common
header file. This common header file is intended to define the common data
types and values that define access to the PEAK-System CAN-FD IP, whatever
the PC interface is.

Signed-off-by: Stephane Grosjean <s.grosjean@peak-system.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

Merge branch 'virtio-net-tx-napi'

Willem de Bruijn says:

====================
virtio-net tx napi

Add napi for virtio-net transmit completion processing.

Changes:
  v2 -> v3:
    - convert __netif_tx_trylock to __netif_tx_lock on tx napi poll
          ensure that the handler always cleans, to avoid deadlock
    - unconditionally clean in start_xmit
          avoid adding an unnecessary "if (use_napi)" branch
    - remove virtqueue_disable_cb in patch 5/5
          a noop in the common event_idx based loop
    - document affinity_hint_set constraint

  v1 -> v2:
    - disable by default
    - disable unless affinity_hint_set
          because cache misses add up to a third higher cycle cost,
  e.g., in TCP_RR tests. This is not limited to the patch
  that enables tx completion cleaning in rx napi.
    - use trylock to avoid contention between tx and rx napi
    - keep interrupts masked during xmit_more (new patch 5/5)
          this improves cycles especially for multi UDP_STREAM, which
  does not benefit from cleaning tx completions on rx napi.
    - move free_old_xmit_skbs (new patch 3/5)
          to avoid forward declaration

    not changed:
    - deduplicate virnet_poll_tx and virtnet_poll_txclean
          they look similar, but have differ too much to make it
  worthwhile.
    - delay netif_wake_subqueue for more than 2 + MAX_SKB_FRAGS
          evaluated, but made no difference
    - patch 1/5

  RFC -> v1:
    - dropped vhost interrupt moderation patch:
          not needed and likely expensive at light load
    - remove tx napi weight
        - always clean all tx completions
        - use boolean to toggle tx-napi, instead
    - only clean tx in rx if tx-napi is enabled
        - then clean tx before rx
    - fix: add missing braces in virtnet_freeze_down
    - testing: add 4KB TCP_RR + UDP test results

Based on previous patchsets by Jason Wang:

  [RFC V7 PATCH 0/7] enable tx interrupts for virtio-net
  http://lkml.iu.edu/hypermail/linux/kernel/1505.3/00245.html

Before commit b0c39dbdc204 ("virtio_net: don't free buffers in xmit
ring") the virtio-net driver would free transmitted packets on
transmission of new packets in ndo_start_xmit and, to catch the edge
case when no new packet is sent, also in a timer at 10HZ.

A timer can cause long stalls. VIRTIO_F_NOTIFY_ON_EMPTY avoids stalls
due to low free descriptor count. It does not address a stalls due to
low socket SO_SNDBUF. Increasing timer frequency decreases that stall
time, but increases interrupt rate and, thus, cycle count.

Currently, with no timer, packets are freed only at ndo_start_xmit.
Latency of consume_skb is now unbounded. To avoid a deadlock if a sock
reaches SO_SNDBUF, packets are orphaned on tx. This breaks TCP small
queues.

Reenable TCP small queues by removing the orphan. Instead of using a
timer, convert the driver to regular tx napi. This does not have the
unresolved stall issue and does not have any frequency to tune.

By keeping interrupts enabled by default, napi increases tx
interrupt rate. VIRTIO_F_EVENT_IDX avoids sending an interrupt if
one is already unacknowledged, so makes this more feasible today.
Combine that with an optimization that brings interrupt rate
back in line with the existing version for most workloads:

Tx completion cleaning on rx interrupts elides most explicit tx
interrupts by relying on the fact that many rx interrupts fire.

Tested by running {1, 10, 100} {TCP, UDP} STREAM, RR, 4K_RR benchmarks
from a guest to a server on the host, on an x86_64 Haswell. The guest
runs 4 vCPUs pinned to 4 cores. vhost and the test server are
pinned to a core each.

All results are the median of 5 runs, with variance well < 10%.
Used neper (github.com/google/neper) as test process.

Napi increases single stream throughput, but increases cycle cost.
The optimizations bring this down. The previous patchset saw a
regression with UDP_STREAM, which does not benefit from cleaning tx
interrupts in rx napi. This regression is now gone for 10x, 100x.
Remaining difference is higher 1x TCP_STREAM, lower 1x UDP_STREAM.

The latest results are with process, rx napi and tx napi affine to
the same core. All numbers are lower than the previous patchset.

             upstream     napi
TCP_STREAM:
1x:
  Mbps          27816    39805
  Gcycles         274      285

10x:
  Mbps          42947    42531
  Gcycles         300      296

100x:
  Mbps          31830    28042
  Gcycles         279      269

TCP_RR Latency (us):
1x:
  p50              21       21
  p99              27       27
  Gcycles         180      167

10x:
  p50              40       39
  p99              52       52
  Gcycles         214      211

100x:
  p50             281      241
  p99             411      337
  Gcycles         218      226

TCP_RR 4K:
1x:
  p50              28       29
  p99              34       36
  Gcycles         177      167

10x:
  p50              70       71
  p99              85      134
  Gcycles         213      214

100x:
  p50             442      611
  p99             802      785
  Gcycles         237      216

UDP_STREAM:
1x:
  Mbps          29468    26800
  Gcycles         284      293

10x:
  Mbps          29891    29978
  Gcycles         285      312

100x:
  Mbps          30269    30304
  Gcycles         318      316

UDP_RR:
1x:
  p50              19       19
  p99              23       23
  Gcycles         180      173

10x:
  p50              35       40
  p99              54       64
  Gcycles         245      237

100x:
  p50             234      286
  p99             484      473
  Gcycles         224      214

Note that GSO is enabled, so 4K RR still translates to one packet
per request.

Lower throughput at 100x vs 10x can be (at least in part)
explained by looking at bytes per packet sent (nstat). It likely
also explains the lower throughput of 1x for some variants.

upstream:

N=1   bytes/pkt=16581
N=10  bytes/pkt=61513
N=100 bytes/pkt=51558

at_rx:

N=1   bytes/pkt=65204
N=10  bytes/pkt=65148
N=100 bytes/pkt=56840
====================

Acked-by: Michael S. Tsirkin <mst@redhat.com>

virtio-net: keep tx interrupts disabled unless kick

Tx napi mode increases the rate of transmit interrupts. Suppress some
by masking interrupts while more packets are expected. The interrupts
will be reenabled before the last packet is sent.

This optimization reduces the througput drop with tx napi for
unidirectional flows such as UDP_STREAM that do not benefit from
cleaning tx completions in the the receive napi handler.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

virtio-net: clean tx descriptors from rx napi

Amortize the cost of virtual interrupts by doing both rx and tx work
on reception of a receive interrupt if tx napi is enabled. With
VIRTIO_F_EVENT_IDX, this suppresses most explicit tx completion
interrupts for bidirectional workloads.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

virtio-net: move free_old_xmit_skbs

An upcoming patch will call free_old_xmit_skbs indirectly from
virtnet_poll. Move the function above this to avoid having to
introduce a forward declaration.

This is a pure move: no code changes.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

virtio-net: transmit napi

Convert virtio-net to a standard napi tx completion path. This enables
better TCP pacing using TCP small queues and increases single stream
throughput.

The virtio-net driver currently cleans tx descriptors on transmission
of new packets in ndo_start_xmit. Latency depends on new traffic, so
is unbounded. To avoid deadlock when a socket reaches its snd limit,
packets are orphaned on tranmission. This breaks socket backpressure,
including TSQ.

Napi increases the number of interrupts generated compared to the
current model, which keeps interrupts disabled as long as the ring
has enough free descriptors. Keep tx napi optional and disabled for
now. Follow-on patches will reduce the interrupt cost.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

virtio-net: napi helper functions

Prepare virtio-net for tx napi by converting existing napi code to
use helper functions. This also deduplicates some logic.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sparc64: Improve 64-bit constant loading in eBPF JIT.

Doing a full 64-bit decomposition is really stupid especially for
simple values like 0 and -1.

But if we are going to optimize this, go all the way and try for all 2
and 3 instruction sequences not requiring a temporary register as
well.

First we do the easy cases where it's a zero or sign extended 32-bit
number (sethi+or, sethi+xor, respectively).

Then we try to find a range of set bits we can load simply then shift
up into place, in various ways.

Then we try negating the constant and see if we can do a simple
sequence using that with a xor at the end.  (f.e. the range of set
bits can't be loaded simply, but for the negated value it can)

The final optimized strategy involves 4 instructions sequences not
needing a temporary register.

Otherwise we sadly fully decompose using a temp..

Example, from ALU64_XOR_K: 0x0000ffffffff0000 ^ 0x0 = 0x0000ffffffff0000:

0000000000000000 <foo>:
   0:   9d e3 bf 50     save  %sp, -176, %sp
   4:   01 00 00 00     nop
   8:   90 10 00 18     mov  %i0, %o0
   c:   13 3f ff ff     sethi  %hi(0xfffffc00), %o1
  10:   92 12 63 ff     or  %o1, 0x3ff, %o1     ! ffffffff <foo+0xffffffff>
  14:   93 2a 70 10     sllx  %o1, 0x10, %o1
  18:   15 3f ff ff     sethi  %hi(0xfffffc00), %o2
  1c:   94 12 a3 ff     or  %o2, 0x3ff, %o2     ! ffffffff <foo+0xffffffff>
  20:   95 2a b0 10     sllx  %o2, 0x10, %o2
  24:   92 1a 60 00     xor  %o1, 0, %o1
  28:   12 e2 40 8a     cxbe  %o1, %o2, 38 <foo+0x38>
  2c:   9a 10 20 02     mov  2, %o5
  30:   10 60 00 03     b,pn   %xcc, 3c <foo+0x3c>
  34:   01 00 00 00     nop
  38:   9a 10 20 01     mov  1, %o5     ! 1 <foo+0x1>
  3c:   81 c7 e0 08     ret
  40:   91 eb 40 00     restore  %o5, %g0, %o0

Signed-off-by: David S. Miller <davem@davemloft.net>

sparc64: Support cbcond instructions in eBPF JIT.

cbcond combines a compare with a branch into a single instruction.

The limitations are:

1) Only newer chips support it

2) For immediate compares we are limited to 5-bit signed immediate
values

3) The branch displacement is limited to 10-bit signed

4) We cannot use it for JSET

Also, cbcond (unlike all other sparc control transfers) lacks a delay
slot.

Currently we don't have a useful instruction we can push into the
delay slot of normal branches. So using cbcond pretty much always
increases code density, and is therefore a win.

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bpf-misc-cleanups'

Alexander Alemayhu says:

====================
Misc BPF cleanup

while looking into making the Makefile in samples/bpf better handle O= I saw
several warnings when running `make clean && make samples/bpf/`. This series
reduces those warnings.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

samples/bpf: check before defining offsetof

Fixes the following warning

samples/bpf/test_lru_dist.c:28:0: warning: "offsetof" redefined
#define offsetof(TYPE, MEMBER) ((size_t)&((TYPE *)0)->MEMBER)

In file included from ./tools/lib/bpf/bpf.h:25:0,
from samples/bpf/libbpf.h:5,
from samples/bpf/test_lru_dist.c:24:
/usr/lib/gcc/x86_64-redhat-linux/6.3.1/include/stddef.h:417:0: note: this is the location of the previous definition
#define offsetof(TYPE, MEMBER) __builtin_offsetof (TYPE, MEMBER)

Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

samples/bpf: add static to function with no prototype

Fixes the following warning

samples/bpf/cookie_uid_helper_example.c: At top level:
samples/bpf/cookie_uid_helper_example.c:276:6: warning: no previous prototype for ‘finish’ [-Wmissing-prototypes]
void finish(int ret)
^~~~~~
HOSTLD samples/bpf/per_socket_stats_example

Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

samples/bpf: add -Wno-unknown-warning-option to clang

I was initially going to remove '-Wno-address-of-packed-member' because I
thought it was not supposed to be there but Daniel suggested using
'-Wno-unknown-warning-option'.

This silences several warnings similiar to the one below

warning: unknown warning option '-Wno-address-of-packed-member' [-Wunknown-warning-option]
1 warning generated.
clang  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/6.3.1/include -I./arch/x86/include -I./arch/x86/include/generated/uapi -I./arch/x86/include/generated  -I./include
-I./arch/x86/include/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h  \
        -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \
        -Wno-compare-distinct-pointer-types \
        -Wno-gnu-variable-sized-type-not-at-end \
        -Wno-address-of-packed-member -Wno-tautological-compare \
        -O2 -emit-llvm -c samples/bpf/xdp_tx_iptunnel_kern.c -o -| llc -march=bpf -filetype=obj -o samples/bpf/xdp_tx_iptunnel_kern.o

$ clang --version

clang version 3.9.1 (tags/RELEASE_391/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: make bpf_xdp_adjust_head support mandatory

Now that also the last in-tree user of the xdp_adjust_head bit has
been removed, we can remove the flag from struct bpf_prog altogether.

This, at the same time, also makes sure that any future driver for
XDP comes with bpf_xdp_adjust_head() support right away.

A rejection based on this flag would also mean that tail calls
couldn't be used with such driver as per c2002f983767 ("bpf: fix
checking xdp_adjust_head on tail calls") fix, thus lets not allow
for it in the first place.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

qlcnic: fix unchecked return value

Function pci_find_ext_capability() may return 0, which is an invalid
address. In function qlcnic_sriov_virtid_fn(), its return value is used
without validation. This may result in invalid memory access bugs. This
patch fixes the bug.

Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

wan: pc300too: abort path on failure

In function pc300_pci_init_one(), on the ioremap error path, function
pc300_pci_remove_one() is called to free the allocated memory. However,
the path is not terminated, and the freed memory will be used later,
resulting in use-after-free bugs. This path fixes the bug.

Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tipc: check return value of nlmsg_new

Function nlmsg_new() will return a NULL pointer if there is no enough
memory, and its return value should be checked before it is used.
However, in function tipc_nl_node_get_monitor(), the validation of the
return value of function nlmsg_new() is missed. This patch fixes the
bug.

Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

lwtunnel: check return value of nla_nest_start

Function nla_nest_start() may return a NULL pointer on error. However,
in function lwtunnel_fill_encap(), the return value of nla_nest_start()
is not validated before it is used. This patch checks the return value
of nla_nest_start() against NULL.

Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'nfp-dma-adjust_head-fixes'

Jakub Kicinski says:

====================
nfp: DMA flags, adjust head and fixes

This series takes advantage of Alex's DMA_ATTR_SKIP_CPU_SYNC to make
XDP packet modifications "correct" from DMA API point of view.  It
also allows us to parse the metadata before we run XDP at no additional
DMA sync cost.  That way we can get rid of the metadata memcpy, and
remove the last upstream user of bpf_prog->xdp_adjust_head.

David's patch adds a way to read capabilities from the management
firmware.

There are also two net-next fixes.  Patch 4 which fixes what seems to
be a result of a botched rebase on my part.  Patch 5 corrects locking
when state of ethernet ports is being refreshed.

v3: move the sync from alloc func to the actual give to hw func
v2: sync rx buffers before giving them to the card (Alex)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: remove the refresh of all ports optimization

The code refreshing the eth port state was trying to update state
of all ports of the card. Unfortunately to safely walk the port
list we would have to hold the port lock, which we can't due to
lock ordering constraints against rtnl.

Make the per-port sync refresh and async refresh of all ports
completely separate routines.

Fixes: 172f638c93dd ("nfp: add port state refresh")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: fix free list buffer size reporting

XDP headroom should not be included in free list buffer size.

Fixes: 6fe0c3b43804 ("nfp: add support for xdp_adjust_head()")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: add NSP routine to get static information

Retrieve identifying information from the NSP. For now it only
contains versions of firmware subcomponents.

Signed-off-by: David Brunecz <david.brunecz@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: parse metadata prepend before XDP runs

Calling memcpy to shift metadata out of the way for XDP to run
seems like an overkill. The most common metadata contents are
8 bytes containing type and flow hash. Simply parse the metadata
before we run XDP.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: make use of the DMA_ATTR_SKIP_CPU_SYNC attr

DMA unmap may destroy changes CPU made to the buffer. To make XDP
run correctly on non-x86 platforms we should use the
DMA_ATTR_SKIP_CPU_SYNC attribute.

Thanks to using the attribute we can now push the sync operation to the
common code path from XDP handler.

A little bit of variable name reshuffling is required to bring the
code back to readable state.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'cls_flower-MPLS'

Benjamin LaHaise says:

====================
flower: add MPLS matching support

This patch series adds support for parsing MPLS flows in the flow dissector
and the flower classifier. Each of the MPLS TTL, BOS, TC and Label fields
can be used for matching.

v2: incorporate style feedback, move #defines to linux/include/mpls.h
Note: this omits Jiri's request to remove tabs between the type and
field names in struct declarations. This would be inconsistent with
numerous other struct definitions.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

cls_flower: add support for matching MPLS fields (v2)

Add support to the tc flower classifier to match based on fields in MPLS
labels (TTL, Bottom of Stack, TC field, Label).

Signed-off-by: Benjamin LaHaise <benjamin.lahaise@netronome.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Simon Horman <simon.horman@netronome.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Hadar Hen Zion <hadarh@mellanox.com>
Cc: Gao Feng <fgao@ikuai8.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

flow_dissector: add mpls support (v2)

Add support for parsing MPLS flows to the flow dissector in preparation for
adding MPLS match support to cls_flower.

Signed-off-by: Benjamin LaHaise <benjamin.lahaise@netronome.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Simon Horman <simon.horman@netronome.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Eric Dumazet <jhs@mojatatu.com>
Cc: Hadar Hen Zion <hadarh@mellanox.com>
Cc: Gao Feng <fgao@ikuai8.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tcp-fastopen-middlebox-fixes'

Wei Wang says:

====================
net/tcp_fastopen: Fix for various TFO firewall issues

Currently there are still some firewall issues in the middlebox
which make the middlebox drop packets silently for TFO sockets.
This kind of issue is hard to be detected by the end client.

This patch series tries to detect such issues in the kernel and disable
TFO temporarily.
More details about the issues and the fixes are included in the following
patches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/tcp_fastopen: Remove mss check in tcp_write_timeout()

Christoph Paasch from Apple found another firewall issue for TFO:
After successful 3WHS using TFO, server and client starts to exchange
data. Afterwards, a 10s idle time occurs on this connection. After that,
firewall starts to drop every packet on this connection.

The fix for this issue is to extend existing firewall blackhole detection
logic in tcp_write_timeout() by removing the mss check.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/tcp_fastopen: Add snmp counter for blackhole detection

This counter records the number of times the firewall blackhole issue is
detected and active TFO is disabled.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/tcp_fastopen: Disable active side TFO in certain scenarios

Middlebox firewall issues can potentially cause server's data being
blackholed after a successful 3WHS using TFO. Following are the related
reports from Apple:
https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
Slide 31 identifies an issue where the client ACK to the server's data
sent during a TFO'd handshake is dropped.
C ---> syn-data ---> S
C <--- syn/ack ----- S
C (accept & write)
C <---- data ------- S
C ----- ACK -> X     S
[retry and timeout]

https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
Slide 5 shows a similar situation that the server's data gets dropped
after 3WHS.
C ---- syn-data ---> S
C <--- syn/ack ----- S
C ---- ack --------> S
S (accept & write)
C?  X <- data ------ S
[retry and timeout]

This is the worst failure b/c the client can not detect such behavior to
mitigate the situation (such as disabling TFO). Failing to proceed, the
application (e.g., SSL library) may simply timeout and retry with TFO
again, and the process repeats indefinitely.

The proposed solution is to disable active TFO globally under the
following circumstances:
1. client side TFO socket detects out of order FIN
2. client side TFO socket receives out of order RST

We disable active side TFO globally for 1hr at first. Then if it
happens again, we disable it for 2h, then 4h, 8h, ...
And we reset the timeout to 1hr if a client side TFO sockets not opened
on loopback has successfully received data segs from server.
And we examine this condition during close().

The rational behind it is that when such firewall issue happens,
application running on the client should eventually close the socket as
it is not able to get the data it is expecting. Or application running
on the server should close the socket as it is not able to receive any
response from client.
In both cases, out of order FIN or RST will get received on the client
given that the firewall will not block them as no data are in those
frames.
And we want to disable active TFO globally as it helps if the middle box
is very close to the client and most of the connections are likely to
fail.

Also, add a debug sysctl:
  tcp_fastopen_blackhole_detect_timeout_sec:
    the initial timeout to use when firewall blackhole issue happens.
    This can be set and read.
    When setting it to 0, it means to disable the active disable logic.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mlx5-updates-2017-04-22' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2017-04-22

Sparse and compiler warnings fixes from Stephen Hemminger.

From Roi Dayan and Or Gerlitz, Add devlink and mlx5 support for controlling
E-Switch encapsulation mode, this knob will enable HW support for applying
encapsulation/decapsulation to VF traffic as part of SRIOV e-switch offloading.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: add rcu locking when changing early demux

systemd-sysctl is triggering a suspicious RCU usage message when
net.ipv4.tcp_early_demux or net.ipv4.udp_early_demux is changed via
a sysctl config file:

[   33.896184] ===============================
[   33.899558] [ ERR: suspicious RCU usage.  ]
[   33.900624] 4.11.0-rc7+ #104 Not tainted
[   33.901698] -------------------------------
[   33.903059] /home/dsa/kernel-2.git/net/ipv4/sysctl_net_ipv4.c:305 suspicious rcu_dereference_check() usage!
[   33.905724]
other info that might help us debug this:

[   33.907656]
rcu_scheduler_active = 2, debug_locks = 0
[   33.909288] 1 lock held by systemd-sysctl/143:
[   33.910373]  #0:  (sb_writers#5){.+.+.+}, at: [<ffffffff8123a370>] file_start_write+0x45/0x48
[   33.912407]
stack backtrace:
[   33.914018] CPU: 0 PID: 143 Comm: systemd-sysctl Not tainted 4.11.0-rc7+ #104
[   33.915631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[   33.917870] Call Trace:
[   33.918431]  dump_stack+0x81/0xb6
[   33.919241]  lockdep_rcu_suspicious+0x10f/0x118
[   33.920263]  proc_configure_early_demux+0x65/0x10a
[   33.921391]  proc_udp_early_demux+0x3a/0x41

add rcu locking to proc_configure_early_demux.

Fixes: dddb64bcb3461 ("net: Add sysctl to toggle early demux for tcp and udp")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'for-upstream' of git://git./linux/kernel/git/bluetooth/bluetooth-next

Johan Hedberg says:

====================
pull request: bluetooth-next 2017-04-22

Here are some more Bluetooth patches (and one 802.15.4 patch) in the
bluetooth-next tree targeting the 4.12 kernel. Most of them are pure
fixes.

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: netcp: fix spelling mistake: "memomry" -> "memory"

Trivial fix to spelling mistake in dev_err message and rejoin
line.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: atheros: atl1: use offset_in_page() macro

Use offset_in_page() macro instead of open-coding.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bnxt_en-misc-next'

Michael Chan says:

====================
bnxt_en: Updates for net-next.

Miscellaneous updates include passing DCBX RoCE VLAN priority to firmware,
checking one more new firmware flag before allowing DCBX to run on the host,
adding 100Gbps speed support, adding check to disallow speed settings on
Multi-host NICs, and a minor fix for reporting VF attributes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Restrict a PF in Multi-Host mode from changing port PHY configuration

This change restricts the PF in multi-host mode from setting any port
level PHY configuration. The settings are controlled by firmware in
Multi-Host mode.

Signed-off-by: Deepak Khungar <deepak.khungar@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Check the FW_LLDP_AGENT flag before allowing DCBX host agent.

Check the additional flag in bnxt_hwrm_func_qcfg() before allowing
DCBX to be done in host mode.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Add 100G link speed reporting for BCM57454 ASIC in ethtool

Added support for 100G link speed reporting for Broadcom BCM57454
ASIC in ethtool command.

Signed-off-by: Deepak Khungar <deepak.khungar@broadcom.com>
Signed-off-by: Ray Jui <ray.jui@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Fix VF attributes reporting.

The .ndo_get_vf_config() is returning the wrong qos attribute. Fix
the code that checks and reports the qos and spoofchk attributes. The
BNXT_VF_QOS and BNXT_VF_LINK_UP flags should not be set by default
during init. time.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Pass DCB RoCE app priority to firmware.

When the driver gets the RoCE app priority set/delete call through DCBNL,
the driver will send the information to the firmware to set up the
priority VLAN tag for RDMA traffic.

[ New version using the common ETH_P_IBOE constant in if_ether.h ]

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

openvswitch: Add eventmask support to CT action.

Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK,
which can be used in conjunction with the commit flag
(OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which
conntrack events (IPCT_*) should be delivered via the Netfilter
netlink multicast groups.  Default behavior depends on the system
configuration, but typically a lot of events are delivered.  This can be
very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some
types of events are of interest.

Netfilter core init_conntrack() adds the event cache extension, so we
only need to set the ctmask value.  However, if the system is
configured without support for events, the setting will be skipped due
to extension not being found.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

openvswitch: Typo fix.

Fix typo in a comment.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ibmvnic-updates-and-fixes'

Nathan Fontenot says:

====================
ibmvnic: Additional updates and bug fixes

This set of patches is an additional set of updates and bug fixes to
the ibmvnic driver which applies on top of the previous set of updates
sent out on 4/19.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ibmvnic: Free skb's in cases of failure in transmit

When an error is encountered during transmit we need to free the
skb instead of returning TX_BUSY.

Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ibmvnic: Validate napi exist before disabling them

Validate that the napi structs exist before trying to disable them
at driver close.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ibmvnic: Add set_link_state routine for setting adapter link state

Create a common routine for setting the link state for the vnic adapter.
This update moves the sending of the crq and waiting for the link state
response to a common place. The new routine also adds handling of
resending the crq in cases of getting a partial success response.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ibmvnic: Move initialization of the stats token to ibmvnic_open

We should be initializing the stats token in the same place we
initialize the other resources for the driver.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ibmvnic: Only retrieve error info if present

When handling a fatal error in the driver, there can be additional
error information provided by the vios. This information is not
always present, so only retrieve the additional error information
when present.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ibmvnic: Insert header on VLAN tagged received frame

This patch addresses a modification in the PAPR+ specification which now
defines a previously reserved value for vNIC capabilities. It indicates
whether the system firmware performs a VLAN header stripping on all VLAN
tagged received frames, in case it does, the behavior expected is for
the ibmvnic driver to be responsible for inserting the VLAN header.

Reported-by: Manvanthara B. Puttashankar <mputtash@in.ibm.com>
Signed-off-by: Murilo Fossa Vicentini <muvic@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ibmvnic: Set real number of rx queues

Along with 5 TX queues, 5 RX queues are allocated at the beginning of
device probe. However, only the real number of TX queues is set. Configure
the real number of RX queues as well.

Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'packet-fanout-unique-id'

Mike Maloney says:

====================
packet: Add option to create new fanout group with unique id.

Fanout uses a per net global namespace. A process that intends to create a
new fanout group can accidentally join an existing group. It is
not possible to detect this.

Add a socket option to specify on the first call to
setsockopt(..., PACKET_FANOUT, ...) to ensure that a new group is created.
Also add tests.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/net: add tests for PACKET_FANOUT_FLAG_UNIQUEID

Create two groups with PACKET_FANOUT_FLAG_UNIQUEID, add a socket to one.
Ensure that the groups can only be joined if all options are consistent
with the original except for this flag.

Signed-off-by: Mike Maloney <maloney@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

packet: add PACKET_FANOUT_FLAG_UNIQUEID to assign new fanout group id.

Fanout uses a per net global namespace. A process that intends to create
a new fanout group can accidentally join an existing group. It is not
possible to detect this.

Add socket option PACKET_FANOUT_FLAG_UNIQUEID.  When specified the
supplied fanout group id must be set to 0, and the kernel chooses an id
that is not already in use.  This is an ephemeral flag so that
other sockets can be added to this group using setsockopt, but NOT
specifying this flag.  The current getsockopt(..., PACKET_FANOUT, ...)
can be used to retrieve the new group id.

We assume that there are not a lot of fanout groups and that this is not
a high frequency call.

The method assigns ids starting at zero and increases until it finds an
unused id.  It keeps track of the last assigned id, and uses it as a
starting point to find new ids.

Signed-off-by: Mike Maloney <maloney@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/net: cleanup unused parameter in psock_fanout

sock_fanout_open no longer sets the size of packet_socket ring, so stop
passing the parameter.

Signed-off-by: Mike Maloney <maloney@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mdio_bus: Issue GPIO RESET to PHYs.

Some boards [1] leave the PHYs at an invalid state
during system power-up or reset thus causing unreliability
issues with the PHY which manifests as PHY not being detected
or link not functional. To fix this, these PHYs need to be RESET
via a GPIO connected to the PHY's RESET pin.

Some boards have a single GPIO controlling the PHY RESET pin of all
PHYs on the bus whereas some others have separate GPIOs controlling
individual PHY RESETs.

In both cases, the RESET de-assertion cannot be done in the PHY driver
as the PHY will not probe till its reset is de-asserted.
So do the RESET de-assertion in the MDIO bus driver.

[1] - am572x-idk, am571x-idk, a437x-idk

Signed-off-by: Roger Quadros <rogerq@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'VSOCK-add-vsockmon'

Stefan Hajnoczi says:

====================
VSOCK: vsockmon virtual device to monitor AF_VSOCK sockets.

v5:
* Change vsock_deliver_tap() API to avoid unnecessary skb creation
   [Jorgen]
* Fix skb leak when no taps are registered [Jorgen]
* s/cpu_to_le16(pkt->hdr.op)/le16_to_cpu(pkt->hdr.op)/ [Michael]
* Add af_vsock_tap.c and vsockmon.[ch] to MAINTAINERS
* checkpatch.pl and sparse fixes

v4:
* Add explicit reserved padding field to struct af_vsockmon_hdr and
   drop __attribute__((packed)) [Michael, DaveM]
* Call synchronize_net() before module_put() [Michael]

v3:
* Hook virtio_transport.c (guest driver), not just drivers/vhost/vsock.c (host
   driver)
* Fix DEFAULT_MTU macro definition [Zhu Yanjun]
* Rename af_vsockmon_hdr->t field ->transport for clarity
* Update .ndo_get_stats64() return type since it has changed
* Include missing <linux/module.h> header in af_vsock_tap.c

This is a continuation of Gerard Garcia's work on the vsockmon packet capture
interface for AF_VSOCK.  Packet capture is an essential feature for network
communication.  Gerard began addressing this feature gap in his Google Summer
of Code 2016 project.  I have cleaned up, rebased, and retested the v2 series
he posted previously.

The design follows the nlmon packet capture interface closely.  This is because
vsock has the same problem as netlink: there is no netdev on which packets can
be captured.  The nlmon driver is a synthetic netdev purely for the purpose of
enabling packet capture.  We follow the same approach here with vsockmon.

See include/uapi/linux/vsockmon.h in this series for details on the packet
layout.

How to try it:

1. Build tcpdump with vsockmon patches:

  $ git clone -b vsock https://github.com/stefanha/libpcap
  $ (cd libcap && ./configure && make)
  $ git clone -b vsock https://github.com/stefanha/tcpdump
  $ (cd tcpdump && ./configure && make)

2. Build nc-vsock (a netcat-like tool):

  $ git clone https://github.com/stefanha/nc-vsock
  $ (cd nc-vsock && make)

3. Launch a virtual machine:

  # modprobe vhost_vsock
  # qemu-system-x86_64 -M accel=kvm -m 1024 -cpu host \
      -drive if=virtio,file=test.img,format=raw \
      -device vhost-vsock-pci,guest-cid=3

  (Assumes guest is running a kernel with this patch)

4. Capture AF_VSOCK traffic in guest and/or host:

  # modprobe vsockmon
  # ip link add type vsockmon
  # ip link set vsockmon0 up
  # tcpdump -i vsockmon0 -vvv

5. Communicate!

  (host)$ nc-vsock -l 1234
  (guest)$ nc-vsock 2 1234
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

VSOCK: Add virtio vsock vsockmon hooks

The virtio drivers deal with struct virtio_vsock_pkt. Add
virtio_transport_deliver_tap_pkt(pkt) for handing packets to the
vsockmon device.

We call virtio_transport_deliver_tap_pkt(pkt) from
net/vmw_vsock/virtio_transport.c and drivers/vhost/vsock.c instead of
common code. This is because the drivers may drop packets before
handing them to common code - we still want to capture them.

Signed-off-by: Gerard Garcia <ggarcia@deic.uab.cat>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>