review.tizen.org Git - platform/kernel/linux-starfive.git/log

ixgbe: Match on multiple headers for cls_u32 offloads

Adds support to set filters with multiple header fields (L3,L4)to match on.
This is achieved in the following order:
1. Create a leaf hash table for the next header.
2. Create a link to the leaf hash table from the base hash table with
   matches on next header type and current header fields.
3. Add filter in leaf hash table with match on next header fields and
   action.

Verified with the following filters :

Match TCP and DIP:
        handle 1: u32 divisor 1
        u32 ht 800: order 1 link 1: \
        offset at 0 mask 0f00 shift 6 plus 0 eat \
        match ip protocol 6 ff match ip dst 10.0.0.1/32
        match tcp src 28 ffff action drop

Delete the filter:

Match on DIP, SIP, UDP (SPort, DPort):
        handle 2: u32 divisor 1
        u32 ht 800: order 2 link 2: \
        offset at 0 mask 0f00 shift 6 plus 0 eat \
        match ip dst 15.0.0.2/32 match ip protocol 17 ff \
        match ip src 15.0.0.1/32
        match udp src 30 ffff match udp dst 32 ffff action drop

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Add support for redirect action to cls_u32 offloads

This patch enables 'redirect' to a SRIOV VF or a offloaded macvlan
device queue via tc 'mirred' action.

Verified with the following script that creates SRIOV VFs,  offloaded
macvlan and adds tc u32 filters with redirect action to the associated
netdevs.

# add ingress qdisc.
tc qdisc add dev p4p1 ingress

# enable hw tc offload.
ethtool -K p4p1 hw-tc-offload on

# create 4 sriov VFs and bring up the first one.
echo 4 > /sys/class/net/p4p1/device/sriov_numvfs
sleep 1
ip link set p4p1 up
ip link set p4p1_0 up

# create a offloaded macvlan device and bring it up.
ethtool -K p4p1 l2-fwd-offload on
ip link add link p4p1 name mvlan_1 type macvlan
ip link set mvlan_1 up

# add u32 filter with action to redirect to VF netdev
tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
    handle 800:0:1 u32 ht 800: \
    match ip src 192.168.1.3/32 \
    action mirred egress redirect dev p4p1_0

# add u32 filter with action to redirect to macvlan netdev
tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
    handle 800:0:2 u32 ht 800: \
    match ip src 192.168.2.3/32 \
    action mirred egress redirect dev mvlan_1

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

net_sched: act_mirred: add helper inlines to access tcf_mirred info

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ipv6: add new struct ipcm6_cookie

In the sendmsg function of UDP, raw, ICMP and l2tp sockets, we use local
variables like hlimits, tclass, opt and dontfrag and pass them to corresponding
functions like ip6_make_skb, ip6_append_data and xxx_push_pending_frames.
This is not a good practice and makes it hard to add new parameters.
This fix introduces a new struct ipcm6_cookie similar to ipcm_cookie in
ipv4 and include the above mentioned variables. And we only pass the
pointer to this structure to corresponding functions. This makes it easier
to add new parameters in the future and makes the function cleaner.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: add __sock_wfree() helper

Hosts sending lot of ACK packets exhibit high sock_wfree() cost
because of cache line miss to test SOCK_USE_WRITE_QUEUE

We could move this flag close to sk_wmem_alloc but it is better
to perform the atomic_sub_and_test() on a clean cache line,
as it avoid one extra bus transaction.

skb_orphan_partial() can also have a fast track for packets that either
are TCP acks, or already went through another skb_orphan_partial()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tipc-next'

Jon Maloy says:

====================
tipc: redesign socket-level flow control

The socket-level flow control in TIPC has long been due for a major
overhaul. This series fixes this.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tipc: redesign connection-level flow control

There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.

Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.

A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.

This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.

This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.

In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tipc: propagate peer node capabilities to socket layer

During neighbor discovery, nodes advertise their capabilities as a bit
map in a dedicated 16-bit field in the discovery message header. This
bit map has so far only be stored in the node structure on the peer
nodes, but we now see the need to keep a copy even in the socket
structure.

This commit adds this functionality.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tipc: re-enable compensation for socket receive buffer double counting

In the refactoring commit d570d86497ee ("tipc: enqueue arrived buffers
in socket in separate function") we did by accident replace the test

if (sk->sk_backlog.len == 0)
atomic_set(&tsk->dupl_rcvcnt, 0);

with

if (sk->sk_backlog.len)
atomic_set(&tsk->dupl_rcvcnt, 0);

This effectively disables the compensation we have for the double
receive buffer accounting that occurs temporarily when buffers are
moved from the backlog to the socket receive queue. Until now, this
has gone unnoticed because of the large receive buffer limits we are
applying, but becomes indispensable when we reduce this buffer limit
later in this series.

We now fix this by inverting the mentioned condition.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rtl8152: correct speed testing

Allow for SS+ USB

Signed-off-by: Oliver Neukum <ONeukum@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

usbnet: correct speed testing

Allow for SS+ USB

Signed-off-by: Oliver Neukum <ONeukum@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

brcm80211: correct speed testing

Allow for SS+ USB

Signed-off-by: Oliver Neukum <ONeukum@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed: Apply tunnel configurations after PF start

Configure and enable various tunnels on the
adapter after PF start.

This change was missed as a part of
'commit 464f664501816ef5fbbc00b8de96f4ae5a1c9325
("qed: Add infrastructure support for tunneling")'

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <yuval.mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'stmmac-dwmac-socfpga-cleanup'

Joachim Eastwood says:

====================
stmmac: dwmac-socfpga refactor+cleanup

This patch aims to remove the init/exit callbacks from the dwmac-
socfpga driver and instead use standard PM callbacks. Doing this
will also allow us to cleanup the driver.

Eventually the init/exit callbacks will be deprecated and removed
from all drivers dwmac-* except for dwmac-generic. Drivers will be
refactored to use standard PM and remove callbacks.

This patch set should not change the behavior of the driver itself,
it only moves code around. The only exception to this is patch
number 4 which restores the resume callback behavior which was
changed in the "net: stmmac: socfpga: Remove re-registration of
reset controller" patch. I belive calling phy_resume() only
from the resume callback and not probe is the right thing to do.

Changes from v1:
- Rebase on net-next

One heads-up here:
The first patch changes the prototype of a couple of
functions used in Alexandre's "add Ethernet glue logic for
stm32 chip" patch [1] and will cause build failures for
dwmac-stm32.c if not fixed up!
If Alexandre's patch set is applied first I will gladly
rebase my patch set to account for his driver as well.

[1] https://patchwork.ozlabs.org/patch/614405/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: dwmac-socfpga: kill init() and rename setup() to set_phy_mode()

Remove old init callback which now contains only a call to
socfpga_dwmac_setup(). Also rename socfpga_dwmac_setup() to indicate
what the function really does.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Tested-by: Marek Vasut <marex@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: dwmac-socfpga: call phy_resume() only in resume callback

Calling phy_resume() should only be need during driver resume to
workaround a hardware errata.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Tested-by: Marek Vasut <marex@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: dwmac-socfpga: keep a copy of stmmac_rst in driver priv data

The dwmac-socfpga driver needs to control the reset usually managed
by the core driver to set the PHY mode. Take a copy of the reset
handle from core priv data so it can be used by the driver later.

This also allow us to move reset handling into socfpga_dwmac_setup()
where the code that needs it is located.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Tested-by: Marek Vasut <marex@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: dwmac-socfpga: add PM ops and resume function

Implement the needed PM callbacks in the driver instead of
relying on the init/exit hooks in stmmac_platform. This gives
the driver more flexibility in how the code is organized.

Eventually the init/exit callbacks will be deprecated in favor
of the standard PM callbacks and driver remove function.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Tested-by: Marek Vasut <marex@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: let remove/resume/suspend functions take device pointer

Change stmmac_remove/resume/suspend to take a device pointer so
they can be used directly by drivers that doesn't need to perform
anything device specific.

This lets us remove the PCI pm functions and later simplifiy the
platform drivers.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Tested-by: Marek Vasut <marex@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: fec_mpc52xx: move to new ethtool api {get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move the fec_mpc52xx driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: fs-enet: move to new ethtool api {get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move the fs-enet driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ucc: move to new ethtool api {get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move the ucc driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: gianfar: move to new ethtool api {get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move the gianfar driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

VSOCK: constify vsock_transport structure

The vsock_transport structure is never modified, so declare it as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: xgene: constify xgene_cle_ops structure

The xgene_cle_ops structure is never modified, so declare it as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fq_codel: add batch ability to fq_codel_drop()

In presence of inelastic flows and stress, we can call
fq_codel_drop() for every packet entering fq_codel qdisc.

fq_codel_drop() is quite expensive, as it does a linear scan
of 4 KB of memory to find a fat flow.
Once found, it drops the oldest packet of this flow.

Instead of dropping a single packet, try to drop 50% of the backlog
of this fat flow, with a configurable limit of 64 packets per round.

TCA_FQ_CODEL_DROP_BATCH_SIZE is the new attribute to make this
limit configurable.

With this strategy the 4 KB search is amortized to a single cache line
per drop [1], so fq_codel_drop() no longer appears at the top of kernel
profile in presence of few inelastic flows.

[1] Assuming a 64byte cache line, and 1024 buckets

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dave Taht <dave.taht@gmail.com>
Cc: Jonathan Morton <chromatix99@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Dave Taht
Signed-off-by: David S. Miller <davem@davemloft.net>

ravb: Remove rx buffer ALIGN

Aligning the reception data size is not required.

Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
Signed-off-by: Yoshihiro Kaneko <ykaneko0929@gmail.com>
Tested-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'wireless-drivers-next-for-davem-2016-05-02' of git://git./linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers patches for 4.7

Major changes:

brcmfmac

* add support for nl80211 BSS_SELECT feature

mwifiex

* add platform specific wakeup interrupt support

ath10k

* implement set_tsf() for 10.2.4 branch
* remove rare MSI range support
* remove deprecated firmware API 1 support

ath9k

* add module parameter to invert LED polarity

wcn36xx

* fixes to get the driver properly working on Dragonboard 410c
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: relax expensive skb_unclone() in iptunnel_handle_offloads()

Locally generated TCP GSO packets having to go through a GRE/SIT/IPIP
tunnel have to go through an expensive skb_unclone()

Reallocating skb->head is a lot of work.

Test should really check if a 'real clone' of the packet was done.

TCP does not care if the original gso_type is changed while the packet
travels in the stack.

This adds skb_header_unclone() which is a variant of skb_clone()
using skb_header_cloned() check instead of skb_cloned().

This variant can probably be used from other points.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netdevice: shrink size of struct netdev_queue

- trans_timeout is incremented when tx queue timed out (tx watchdog).
- tx_maxrate is set via sysfs

Moving tx_maxrate to read-mostly part shrinks the struct by 64 bytes.
While at it, also move trans_timeout (it is out-of-place in the
'write-mostly' part).

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bridge-per-vlan-stats'

Nikolay Aleksandrov says:

====================
bridge: per-vlan stats

This set adds support for bridge per-vlan statistics.
In order to be able to dump statistics for many vlans we need a way to
continue dumping after reaching maximum size, thus patches 01 and 02 extend
the new stats API with a per-device extended link stats attribute and
callback which can save its local state and continue where it left off
afterwards. I considered using the already existing "fill_xstats" callback
but it gets confusing since we need to separate the linkinfo dump from the
new stats api dump and adding a flag/argument to do that just looks messy.
I don't think the rtnl_link_ops size is an issue, so adding these seemed
like the cleaner approach.

Patches 03 and 04 add the stats support and netlink dump support
respectively. The stats accounting is controlled via a bridge option which
is default off, thus the performance impact is kept minimal.
I've tested this set with both old and modified iproute2, kmemleak on and
some traffic stress tests while adding/removing vlans and ports.

v3:
- drop the RCU pvid patch and remove one pointer fetch as requested
- make stats accounting optional with default to off, the option is in the
   same cache line as vlan_proto and vlan_enabled, so it is already fetched
   before the fast path check thus the performance impact is minimal, this
   also allows us to avoid one vlan lookup and return early when using pvid
- rebased and retested

v2:
- Improve the error checking, rename lidx to prividx and save the current
   idx user instead of restricting it to one in patch 01
- squash patch 02 into 01 and remove the restriction
- add callback descriptions, improve the size calculation and change the
   xstats message structure to have an embedding level per rtnl link type
   so we can avoid one call to get the link type (and thus filter on it)
   and also each link type can now have any number of private attributes
   inside
- fix a problem where the vlan stats are not dumped if the bridge has 0
   vlans on it but has vlans on the ports, add bridge link type private
   attributes and also add paddings for future extensions to avoid at least
   a few netlink attributes and improve struct alignment
- drop the is_skb_forwardable argument constifying patch as it's not
   needed anymore, but it's a nice cleanup which I'll send separately
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: export per-vlan stats

Add a new LINK_XSTATS_TYPE_BRIDGE attribute and implement the
RTM_GETSTATS callbacks for IFLA_STATS_LINK_XSTATS (fill_linkxstats and
get_linkxstats_size) in order to export the per-vlan stats.
The paddings were added because soon these fields will be needed for
per-port per-vlan stats (or something else if someone beats me to it) so
avoiding at least a few more netlink attributes.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: vlan: learn to count

Add support for per-VLAN Tx/Rx statistics. Every global vlan context gets
allocated a per-cpu stats which is then set in each per-port vlan context
for quick access. The br_allowed_ingress() common function is used to
account for Rx packets and the br_handle_vlan() common function is used
to account for Tx packets. Stats accounting is performed only if the
bridge-wide vlan_stats_enabled option is set either via sysfs or netlink.
A struct hole between vlan_enabled and vlan_proto is used for the new
option so it is in the same cache line. Currently it is binary (on/off)
but it is intentionally restricted to exactly 0 and 1 since other values
will be used in the future for different purposes (e.g. per-port stats).

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: rtnetlink: add linkxstats callbacks and attribute

Add callbacks to calculate the size and fill link extended statistics
which can be split into multiple messages and are dumped via the new
rtnl stats API (RTM_GETSTATS) with the IFLA_STATS_LINK_XSTATS attribute.
Also add that attribute to the idx mask check since it is expected to
be able to save state and resume dumping (e.g. future bridge per-vlan
stats will be dumped via this attribute and callbacks).
Each link type should nest its private attributes under the per-link type
attribute. This allows to have any number of separated private attributes
and to avoid one call to get the dev link type.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: rtnetlink: allow rtnl_fill_statsinfo to save private state counter

The new prividx argument allows the current dumping device to save a
private state counter which would enable it to continue dumping from
where it left off. And the idxattr is used to save the current idx user
so multiple prividx using attributes can be requested at the same time
as suggested by Roopa Prabhu.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ipv6-tunnel-cleanups'

Tom Herbert says:

====================
net: Cleanup IPv6 ip tunnels

The IPv6 tunnel code is very different from IPv4 code. There is a lot
of redundancy with the IPv4 code, particularly in the GRE tunneling.

This patch set cleans up the tunnel code to make the IPv6 code look
more like the IPv4 code and use common functions between the two
stacks where possible.

This work should make it easier to maintain and extend the IPv6 ip
tunnels.

Items in this patch set:
  - Cleanup IPv6 tunnel receive path (ip6_tnl_rcv). Includes using
    gro_cells and exporting ip6_tnl_rcv so the ip6_gre can call it
  - Move GRE functions to common header file (tx functions) or
    gre_demux.c (rx functions like gre_parse_header)
  - Call common GRE functions from IPv6 GRE
  - Create ip6_tnl_xmit (to be like ip_tunnel_xmit)

Tested:
  Ran super_netperf tests for TCP_RR and TCP_STREAM for:
    - IPv4 over gre, gretap, gre6, gre6tap
    - IPv6 over gre, gretap, gre6, gre6tap
    - ipip
    - ip6ip6
    - ipip/gue
    - IPv6 over gre/gue
    - IPv4 over gre/gue
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

gre6: Cleanup GREv6 transmit path, call common GRE functions

Changes in GREv6 transmit path:
  - Call gre_checksum, remove gre6_checksum
  - Rename ip6gre_xmit2 to __gre6_xmit
  - Call gre_build_header utility function
  - Call ip6_tnl_xmit common function
  - Call ip6_tnl_change_mtu, eliminate ip6gre_tunnel_change_mtu

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: Generic tunnel cleanup

A few generic changes to generalize tunnels in IPv6:
- Export ip6_tnl_change_mtu so that it can be called by ip6_gre
- Add tun_hlen to ip6_tnl structure.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

gre: Create common functions for transmit

Create common functions for both IPv4 and IPv6 GRE in transmit. These
are put into gre.h.

Common functions are for:
  - GRE checksum calculation. Move gre_checksum to gre.h.
  - Building a GRE header. Move GRE build_header and rename
    gre_build_header.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: Create ip6_tnl_xmit

This patch renames ip6_tnl_xmit2 to ip6_tnl_xmit and exports it. Other
users like GRE will be able to call this. The original ip6_tnl_xmit
function is renamed to ip6_tnl_start_xmit (this is an ndo_start_xmit
function).

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

gre6: Cleanup GREv6 receive path, call common GRE functions

- Create gre_rcv function. This calls gre_parse_header and ip6gre_rcv.
- Call ip6_tnl_rcv. Doing this and using gre_parse_header eliminates
most of the code in ip6gre_rcv.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

gre: Move utility functions to common headers

Several of the GRE functions defined in net/ipv4/ip_gre.c are usable
for IPv6 GRE implementation (that is they are protocol agnostic).

These include:
  - GRE flag handling functions are move to gre.h
  - GRE build_header is moved to gre.h and renamed gre_build_header
  - parse_gre_header is moved to gre_demux.c and renamed gre_parse_header
  - iptunnel_pull_header is taken out of gre_parse_header. This is now
    done by caller. The header length is returned from gre_parse_header
    in an int* argument.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: Cleanup IPv6 tunnel receive path

Some basic changes to make IPv6 tunnel receive path look more like
IPv4 path:
  - Make ip6_tnl_rcv non-static so that GREv6 and others can call it
  - Make ip6_tnl_rcv look like ip_tunnel_rcv
  - Switch to gro_cells_receive
  - Make ip6_tnl_rcv non-static and export it

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tcp-preempt'

Eric Dumazet says:

====================
net: make TCP preemptible

Most of TCP stack assumed it was running from BH handler.

This is great for most things, as TCP behavior is very sensitive
to scheduling artifacts.

However, the prequeue and backlog processing are problematic,
as they need to be flushed with BH being blocked.

To cope with modern needs, TCP sockets have big sk_rcvbuf values,
in the order of 16 MB, and soon 32 MB.
This means that backlog can hold thousands of packets, and things
like TCP coalescing or collapsing on this amount of packets can
lead to insane latency spikes, since BH are blocked for too long.

It is time to make UDP/TCP stacks preemptible.

Note that fast path still runs from BH handler.

v2: Added "tcp: make tcp_sendmsg() aware of socket backlog"
to reduce latency problems of large sends.

v3: Fixed a typo in tcp_cdg.c
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: make tcp_sendmsg() aware of socket backlog

Large sendmsg()/write() hold socket lock for the duration of the call,
unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
are parked into socket backlog for a long time.
Critical decisions like fast retransmit might be delayed.
Receivers have to maintain a big out of order queue with additional cpu
overhead, and also possible stalls in TX once windows are full.

Bidirectional flows are particularly hurt since the backlog can become
quite big if the copy from user space triggers IO (page faults)

Some applications learnt to use sendmsg() (or sendmmsg()) with small
chunks to avoid this issue.

Kernel should know better, right ?

Add a generic sk_flush_backlog() helper and use it right
before a new skb is allocated. Typically we put 64KB of payload
per skb (unless MSG_EOR is requested) and checking socket backlog
every 64KB gives good results.

As a matter of fact, tests with TSO/GSO disabled give very nice
results, as we manage to keep a small write queue and smaller
perceived rtt.

Note that sk_flush_backlog() maintains socket ownership,
so is not equivalent to a {release_sock(sk); lock_sock(sk);},
to ensure implicit atomicity rules that sendmsg() was
giving to (possibly buggy) applications.

In this simple implementation, I chose to not call tcp_release_cb(),
but we might consider this later.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: do not block BH while processing socket backlog

Socket backlog processing is a major latency source.

With current TCP socket sk_rcvbuf limits, I have sampled __release_sock()
holding cpu for more than 5 ms, and packets being dropped by the NIC
once ring buffer is filled.

All users are now ready to be called from process context,
we can unblock BH and let interrupts be serviced faster.

cond_resched_softirq() could be removed, as it has no more user.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: prepare for socket backlog behavior change

sctp_inq_push() will soon be called without BH being blocked
when generic socket code flushes the socket backlog.

It is very possible SCTP can be converted to not rely on BH,
but this needs to be done by SCTP experts.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

udp: prepare for non BH masking at backlog processing

UDP uses the generic socket backlog code, and this will soon
be changed to not disable BH when protocol is called back.

We need to use appropriate SNMP accessors.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dccp: do not assume DCCP code is non preemptible

DCCP uses the generic backlog code, and this will soon
be changed to not disable BH when protocol is called back.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: do not block bh during prequeue processing

AFAIK, nothing in current TCP stack absolutely wants BH
being disabled once socket is owned by a thread running in
process context.

As mentioned in my prior patch ("tcp: give prequeue mode some care"),
processing a batch of packets might take time, better not block BH
at all.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: do not assume TCP code is non preemptible

We want to to make TCP stack preemptible, as draining prequeue
and backlog queues can take lot of time.

Many SNMP updates were assuming that BH (and preemption) was disabled.

Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
and some __TCP_INC_STATS() to TCP_INC_STATS()

Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
and tcp_v4_send_ack(), we add an explicit preempt disabled section.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'xgene-channel-number'

Iyappan Subramanian says:

====================
drivers: net: xgene: fix: Get channel number from device binding

This patch set adds 'channel' property to get ethernet to CPU channel number,
thus decoupling the Linux driver from static resource selection.

v2: Address review comments from v1
- removed irq reference from Linux driver
- added 'channel' property to get ethernet to CPU channel number

v1:
- Initial version
====================

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dtb: xgene: Add channel property

Added 'channel' property, describing ethernet to CPU channel number.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Documentation: dtb: xgene: Add channel property

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: xgene: Get channel number from device binding

This patch gets ethernet to CPU channel (prefetch buffer number) from
the newly added 'channel' property, thus decoupling Linux driver from
resource management.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'qed-selftests'

Sudarsana Reddy Kalluru says:

====================
qed/qede: ethtool selftests support.

This series adds the driver support for following selftests:
1. Register test
2. Memory test
3. Clock test
4. Interrupt test
5. Internal loopback test
Patch (1) adds the qed driver infrastructure for selftests. Patches (2) and
(3) add qede driver support for ethtool selftests.

Please consider applying this series to "net-next".
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

qede: add implementation for internal loopback test.

This patch adds the qede implementation for internal loopback test.

Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qede: add support for selftests.

This patch adds the qede ethtool support for the following tests:
- interrupt test
- memory test
- register test
- clock test

Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed: add infrastructure for device self tests.

This patch adds the functionality and APIs needed for selftests.
It adds the ability to configure the link-mode which is required for the
implementation of loopback tests. It adds the APIs for clock test,
register test, interrupt test and memory test.

Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: replace ds with ps where possible

The dsa_switch structure ds is actually needed in very few places,
mostly during setup of the switch. The private structure ps is however
needed nearly everywhere. Pass ps, not ds internally.

[vd: rebased Andrew's patch.]

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '40GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2016-05-01

This series contains updates to i40e and i40evf.

The theme of this series is code reduction, with several code cleanups in
this series.  Starting with Neerav's removal of the code that implemented
the HMC AQ APIs and calls, since they are now obsolete and not supported
by firmware.

Anjali changes the default of VFs to make sure they are not trusted or
privileged until its explicitly set for trust through the new NDO op
interface.  Also limited the number of MAC and VLAN addresses a VF can
add if it is untrusted/privileged.

Carolyn syncs the VF code for the changes made to the PF for the RSS
hash tuple settings, which ends up cleaning up much of the existing code.

Jesse cleans up compiler warnings which were found with gcc's W=2 option.
Then removed duplicate code, especially since only one copy was actually
being used.

Jacob addresses an issue which was found when testing GCC 6's which
happens to produce new warnings when you left shift a signed value
beyond the storage sizeof the type.  The converts i40e & i40evf to use
the BIT() macro more consistently.

Alex actually bucks the trend of code removal by adding support for
both drivers to use GSO_PARTIAL so that segmentation of frames with
checksums enabled in outer headers is supported.  Fortunately it does
not take much to add this support!
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: signal sk_data_ready earlier on data chunks reception

Dave Miller pointed out that fb586f25300f ("sctp: delay calls to
sk_data_ready() as much as possible") may insert latency specially if
the receiving application is running on another CPU and that it would be
better if we signalled as early as possible.

This patch thus basically inverts the logic on fb586f25300f and signals
it as early as possible, similar to what we had before.

Fixes: fb586f25300f ("sctp: delay calls to sk_data_ready() as much as possible")
Reported-by: Dave Miller <davem@davemloft.net>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mdio_bus: Fix MDIO bus scanning in __mdiobus_register()

Since commit b74766a0a0fe ("phylib: don't return NULL
from get_phy_device()") in linux-next, phy_get_device() will return
ERR_PTR(-ENODEV) instead of NULL if the PHY device ID is all ones.

This causes problem with stmmac driver and likely some other drivers
which call mdiobus_register(). I triggered this bug on SoCFPGA MCVEVK
board with linux-next 20160427 and 20160428. In case of the stmmac, if
there is no PHY node specified in the DT for the stmmac block, the stmmac
driver ( drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c function
stmmac_mdio_register() ) will call mdiobus_register() , which will
register the MDIO bus and probe for the PHY.

The mdiobus_register() resp. __mdiobus_register() iterates over all of
the addresses on the MDIO bus and calls mdiobus_scan() for each of them,
which invokes get_phy_device(). Before the aforementioned patch, the
mdiobus_scan() would return NULL if no PHY was found on a given address
and mdiobus_register() would continue and try the next PHY address. Now,
mdiobus_scan() returns ERR_PTR(-ENODEV), which is caught by the
'if (IS_ERR(phydev))' condition and the loop exits immediately if the
PHY address does not contain PHY.

Repair this by explicitly checking for the ERR_PTR(-ENODEV) and if this
error comes around, continue with the next PHY address.

Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@opensource.altera.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

i40e/i40evf: Add support for GSO partial with UDP_TUNNEL_CSUM and GRE_CSUM

This patch makes it so that i40e and i40evf can use GSO_PARTIAL to support
segmentation for frames with checksums enabled in outer headers. As a
result we can now send data over these types of tunnels at over 20Gb/s
versus the 12Gb/s that was previously possible on my system.

The advantage with the i40e parts is that this offload is mostly
transparent as the hardware still deals with the inner and/or outer IPv4
headers so the IP ID is still incrementing for both when this offload is
performed.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40evf: make use of BIT() macro to avoid signed left shift

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: make use of BIT() macro to prevent left shift of signed values

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e/i40evf: fix I40E_MASK signed shift overflow warnings

GCC 6 has a new warning which will display when you attempt to left
shift a signed value beyond the storage size of the type. I40E_MASK
generates a mask value for 32bit registers. Properly typecast the mask
value and place the values in parenthesis to prevent macro expansion
issues.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e/i40evf : Bump driver version from 1.5.5 to 1.5.10

Signed-off-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Update device ids for X722

Add a device ID for X722.

Change-Id: I574f2345ab341de98a6a1c212d0603af853e48b0
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Drop extra copy of function

i40e_release_rx_desc was in two files, but was only used
and needed in txrx.c. Get rid of the extra copy.

Change-Id: I86e18239aa03531fc198b6c052847475084a9200
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Use consistent type for vf_id

The driver was all over the place using signed or unsigned types
for vf_id, when it should always be signed.

This fixes warnings of type unsafe comparisons from gcc with W=2.

Change-Id: I2cb681f83d0f68ca124d2e4131e4ac0d9f8a6b22
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: PTP - avoid aggregate return warnings

Aggregate return warnings are when struct types are returned
and must be copied to the lvalue with a struct copy by the compiler.

This fixes warnings of type aggregate-return from gcc with W=2.

Change-Id: I896b1bf514544bf0faeb458869d79914b9f1b168
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Fix uninitialized variable

We have an uninitialized variable warning for valid_len for one case in
validate_vf_mesg. To fix this, just initialize it to 0 at the top of the
function and remove all of the now redundant assignments to 0 in the
individual cases.

Change-Id: Iacbd97f4c521ed8d662eef803a598d8707708cfd
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40evf: RSS Hash Option parameters

This patch syncs the VF code for the changes made to the PF for the RSS
hash tuple settings. Since the VF still cannot change the RSS hash
settings, change the code to make this clear to the user. Previously,
the default settings were returned in this function. However, the
default can be changed by the PF so this does not make sense anymore.

Change-Id: I085eaf005fc7978b440d2a1bf2b2dd7cadaff39b
Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Remove HMC AQ API implementation

Remove the code that implements the HMC AQ APIs and call these APIs.
This is done because these are obsolete APIs and are not supported
by firmware.

Change-ID: I5d771d8f37c3e16e7b0a972ff9b27e75aa2d05d4
Signed-off-by: Neerav Parikh <neerav.parikh@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Prevent falling to promiscuous if the VF is not trusted

With this change a non trusted VF can never fall to promiscuous
mode when there is no room for a MAC/VLAN filter.

Change-Id: I8a155aa25c0bcdc6093414920c9ade4ee0bd20e8
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Limit the number of MAC and VLAN addresses that can be added for VFs

If the VF is privileged/trusted it can do as it may please including
but not limited to hogging resources and playing unfair.
But if the VF is not privileged/trusted it still can add some number
(8) of MAC and VLAN addresses.
Other restrictions with respect to Port VLAN and normal VLAN still apply
to not privileged/trusted VF.

Change-Id: I3a9529201b184c8873e1ad2e300aff468c9e6296
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

tipc: set 'active' state correctly for first established link

When we are displaying statistics for the first link established between
two peers, it will always be presented as STANDBY although it in reality
is ACTIVE.

This happens because we forget to set the 'active' flag in the link
instance at the moment it is established. Although this is a bug, it only
has impact on the presentation view of the link, not on its actual
functionality.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

of: of_mdio: Check if MDIO bus controller is available

Add a check whether the 'struct device_node' pointer passed to
of_mdiobus_register() is an available (aka enabled) node in the Device
Tree.

Rationale for doing this are cases where an Ethernet MAC provides a MDIO
bus controller and node, and an additional Ethernet MAC might be
connecting its PHY/switches to that first MDIO bus controller, while
still embedding one internally which is therefore marked as "disabled".

Instead of sprinkling checks like these in callers of
of_mdiobus_register(), do this in a central location.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

i40e: Change the default for VFs to be not privileged

Make sure a VF is not trusted/privileged until its explicitly
set for trust through the new NDO op interface.

Change-Id: I476385c290d2b4901d8fceb29de43546accdc499
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Merge branch 'mlx5-aRFS'

Saeed Mahameed says:

====================
Mellanox 100G mlx5 ethernet aRFS support

This series adds accelerated RFS support for the mlx5e driver.
I have added one patch non-related to aRFS that fixes the rtnl_lock
warning mlx5 driver been getting since b7aade15485a ('vxlan: break dependency with netdev drivers')

aRFS support in details:

A direct TIR per RQ is now required in order to have the essential building blocks
for aRFS.  Today the driver has one direct TIR that forwards traffic to RQ[0] (core 0),
and one indirect TIR for RSS indirection table.  For that we've added one direct TIR
per RQ, e.g.: TIR[i] -> RQ[i] (core i).

Publicize Modify flow rule destination and reveal it in flow steering API, to have the
ability to dynamically modify the destination TIR(core) for aRFS rules from the
ethernet driver.

Initializing CPU reverse mapping to notify upper layer on internal receive queue cpu
mappings.

Some design refactoring for mlx5e ethernet driver flow tables and flow steering API.
Now the caller of create_flow_table can choose the level of the flow table, this way
we will create the mlx5e flow tables in a reversed order and connect them as we go,
we create flow table[i+1] before flow table[i] to be able to set flow table[i + 1] as
a destination of flow table[i] once flow table[i] is created.
also we have split the main flow table in the following manner:
    - From before: RX packet had to visit two flow tables until it is delivered to its receive queue:
        RX packet -> vlan filter flow table -> main flow table.
        > vlan filter will check the packet vlan field is allowed.
        > main flow will check if the dest mac is allowed and will check the l3/l4 headers to
        retrieve the RSS hash for steering the packet into its final receive queue.

    - Now main flow table is split into l2 dst mac steering table and ttc (traffic type classifier) table:
        RX packet -> vlan filter -> l2 table -> ttc table
        > vlan filter - same as before
        > L2 filter - filter packets according their destination mac address
        > ttc table - classify packet headers for RSS steering
            - L3/L4 classification rules to steer the packet according to thier headers hash
            - in case of none of the rules applies the packet is steered to RQ[0]

After the above refactoring all left to-do is to create aRFS flow table which will manage
aRFS steering rules to forward traffic to the desired RQ (core) and just connect the ttc
table rules destinations to aRFS flow table.

aRFS flow table in case of a miss will deliver the traffic to the core where the original
ttc hash would have chosen.

TTC table is not initialized and enabled until the user explicitly asks to, i.e. setting the NETIF_F_NTUPLE
to ON.  This way there is no need for ttc table to forward traffic to aRFS table unless required.
When setting back to OFF aRFS flow table is disabled and disconnected.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Enabling aRFS mechanism

Accelerated RFS requires that ntuple filtering is enabled via
ethtool and driver supports ndo_rx_flow_steer.
When the ntuple filtering is enabled, we modify the l3_l4 ttc
rules to point on the aRFS flow tables and when the filtering
is disabled, we modify the l3_l4 ttc rules to point on the RSS
TIRs.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Add accelerated RFS support

Implement ndo_rx_flow_steer ndo.
A new flow steering rule will be composed from the
skb 4-tuple and added to the hardware aRFS flow table.

Each rule is stored in an internal hash table, if such
skb 4-tuple rule already exists we update the corresponding
hardware steering rule with the new destination.

For garbage collection rps_may_expire_flow will be
invoked for a limited amount of old rules upon any
ndo_rx_flow_steer invocation.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Create aRFS flow tables

Create the following four flow tables for aRFS usage:
1. IPv4 TCP - filtering 4-tuple of IPv4 TCP packets.
2. IPv6 TCP - filtering 4-tuple of IPv6 TCP packets.
3. IPv4 UDP - filtering 4-tuple of IPv4 UDP packets.
4. IPv6 UDP - filtering 4-tuple of IPv6 UDP packets.

Each flow table has two flow groups: one for the 4-tuple
filtering (full match) and the other contains * rule for miss rule.

Full match rule means a hit for aRFS and packet will be forwarded
to the dedicated RQ/Core, miss rule packets will be forwarded to
default RSS hashing.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Initializing CPU reverse mapping

Allocating CPU rmap and add entry for each IRQ.
CPU rmap is used in aRFS to get the RX queue number
of the RX completion interrupts.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Split the main flow steering table

Currently, the main flow table is used for two purposes:
One is to do mac filtering and the other is to classify
the packet l3-l4 header in order to steer the packet to
the right RSS TIR.

This design is very complex, for each configured mac address we
have to add eleven rules (rule for each traffic type), the same if the
device is put to promiscuous/allmulti mode.
This scheme isn't scalable for future features like aRFS.

In order to simplify it, the main flow table is split to two flow
tables:
1. l2 table - filter the packet dmac address, if there is a match
we forward to the ttc flow table.

2. TTC (Traffic Type Classifier) table - classify the traffic
type of the packet and steer the packet to the right TIR.

In this new design, when new mac address is added, the driver adds
only one flow rule instead of eleven.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Refactor mlx5e flow steering structs

Slightly refactor and re-order the flow steering structs,
tables and data-bases for better self-containment and
flexibility to add more future steering phases
(tables/rules/data bases) e.g: aRFS.

Changes:
1. Move the vlan DB and address DB into their table structs.
2. Rename steering table structs to unique format: mlx5e_*_table,
e.g: mlx5e_vlan_table.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Support different attributes for priorities in namespace

Currently, namespace could be initialized only
with priorities with the same attributes.
Add support to initialize namespace with priorities
with different attributes(e.g. different number of levels).

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Add user chosen levels when allocating flow tables

Currently, consumers of the flow steering infrastructure can't
choose their own flow table levels and are limited to one
flow table per level. This just waste levels.
Instead, we introduce here the possibility to use multiple
flow tables in a level. The user is free to connect these
flow tables, while following the rule (FTEs in FT of level x
could only point to FTs of level y where y > x).

In addition this patch switch the order of the create/destroy
flow tables of the NIC(vlan and main).

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Set number of allowed levels in priority

Refactors the flow steering namespace creation,
by changing the name num_fts to num_levels.
When new flow table is created, the driver assign new level
to this flow table therefore the meaning is equivalent.
Since downstream patches will introduce the ability to create more
than one flow table per level, the name num_fts is no
longer accurate.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Introduce modify flow rule destination

This API is used for modifying the flow rule destination.
This is needed for modifying the pointed flow table by the
traffic type classifier rules to point on the aRFS tables.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Direct TIR per RQ

Introduce new TIRs for direct access per RQ.
Now we have 2 available kinds of TIRs:
- indirect TIR per traffic type, each points to one RQT (RSS RQT)
same as before.
- New direct TIR per RQ, each points to RQT with a size of one
that forwards packets to that RQ only.

Driver will open max channels (num cores) direct TIRs by default,
they will be filled with the actual RQs once channels are allocated.

Needed for downstream aRFS and ethtool direct steering functionalities.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Call vxlan_get_rx_port() with rtnl lock

Hold the rtnl lock when calling vxlan_get_rx_port().

Fixes: b7aade15485a ("vxlan: break dependency with netdev drivers")
Signed-off-by: Matthew Finlay <matt@mellanox.com>
Reported-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'enc28j60-small-improvements'

Michael Heimpold says:

====================
net: ethernet: enc28j60: small improvements

This series of two patches adds the following improvements to the driver:

1) Rework the central SPI read function so that it is compatible with
   SPI masters which only support half duplex transfers.

2) Add a device tree binding for the driver.

Changelog:

v3: * renamed and improved binding documentation as
      suggested by Rob Herring

v2: * took care of Arnd Bergmann's review comments
      - allow to specify MAC address via DT
      - unconditionally define DT id table
    * increased the driver version minor number
    * driver author's email address bounces, removed from address list

v1: * Initial submission
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: enc28j60: add device tree support

The following patch adds the required match table for device tree support
(and while at, fix the indent). It's also possible to specify the
MAC address in the DT blob.

Also add the corresponding binding documentation file.

Signed-off-by: Michael Heimpold <mhei@heimpold.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: enc28j60: support half-duplex SPI controllers

The current spi_read_buf function fails on SPI host masters which
are only half-duplex capable. Splitting the Tx and Rx part solves
this issue.

Tested on Raspberry Pi (full duplex) and I2SE Duckbill (half duplex).

Signed-off-by: Michael Heimpold <mhei@heimpold.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: constify is_skb_forwardable's arguments

is_skb_forwardable is not supposed to change anything so constify its
arguments

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ppp-rtnetlink'

Guillaume Nault says:

====================
ppp: add rtnetlink support

PPP devices lack the ability to be customised at creation time. In
particular they can't be created in a given netns or with a particular
name. Moving or renaming the device after creation is possible, but
creates undesirable transient effects on servers where PPP devices are
constantly created and removed, as users connect and disconnect.
Implementing rtnetlink support solves this problem.

The rtnetlink handlers implemented in this series are minimal, and can
only replace the PPPIOCNEWUNIT ioctl. The rest of PPP ioctls remains
necessary for any other operation on channels and units.
It is perfectly possible to mix PPP devices created by rtnl
and by ioctl(PPPIOCNEWUNIT). Devices will behave in the same way.

mutex_trylock() is used to resolve the locking issue wrt. locking
dependency between rtnl_lock() and ppp_mutex (see ppp_nl_newlink() in
patch #2).

A user visible difference brought by this series is that old PPP
interfaces (those created with ioctl(PPPIOCNEWUNIT)), can now be
removed by "ip link del", just like new rtnl based PPP devices.

Changes since v3:
  - Rebase on net-next.
  - Not an RFC anymore.

Changes since v2:
  - Define ->rtnl_link_ops for ioctl based PPP devices, so they can
    handle rtnl messages just like rtnl based ones (suggested by
    Stephen Hemminger).
  - Move back to original lock ordering between ppp_mutex and rtnl_lock
    to simplify patch series. Handle lock inversion issue using
    mutex_trylock() (suggested by Stephen Hemminger).
  - Do file descriptor lookup directly in ppp_nl_newlink(), to simplify
    ppp_dev_configure().

Changes since v1:
  - Rebase on net-next.
  - Invert locking order wrt. ppp_mutex and rtnl_lock and protect
    file->private_data with ppp_mutex.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ppp: add rtnetlink device creation support

Define PPP device handler for use with rtnetlink.
The only PPP specific attribute is IFLA_PPP_DEV_FD. It is mandatory and
contains the file descriptor of the associated /dev/ppp instance (the
file descriptor which would have been used for ioctl(PPPIOCNEWUNIT) in
the ioctl-based API). The PPP device is removed when this file
descriptor is released (same behaviour as with ioctl based PPP
devices).

PPP devices created with the rtnetlink API behave like the ones created
with ioctl(PPPIOCNEWUNIT). In particular existing ioctls work the same
way, no matter how the PPP device was created.
The rtnl callbacks are also assigned to ioctl based PPP devices. This
way, rtnl messages have the same effect on any PPP devices.
The immediate effect is that all PPP devices, even ioctl-based
ones, can now be removed with "ip link del".

A minor difference still exists between ioctl and rtnl based PPP
interfaces: in the device name, the number following the "ppp" prefix
corresponds to the PPP unit number for ioctl based devices, while it is
just an unrelated incrementing index for rtnl ones.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

ppp: define reusable device creation functions

Move PPP device initialisation and registration out of
ppp_create_interface().
This prepares code for device registration with rtnetlink.

While there, simplify the prototype of ppp_create_interface():

  * Since ppp_dev_configure() takes care of setting file->private_data,
    there's no need to return a ppp structure to ppp_unattached_ioctl()
    anymore.

  * The unit parameter is made read/write so that ppp_create_interface()
    can tell which unit number has been assigned.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>