review.tizen.org Git - platform/kernel/linux-rpi.git/log

packet: free packet_rollover after synchronize_net

Destruction of the po->rollover must be delayed until there are no
more packets in flight that can access it. The field is destroyed in
packet_release, before synchronize_net. Delay using rcu.

Fixes: 0648ab70afe6 ("packet: rollover prepare: per-socket state")

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/pablo/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains a final Netfilter pull request for net-next
4.2. This mostly addresses some fallout from the previous pull request, small
netns updates and a couple of new features for nfnetlink_log and the socket
match that didn't get in time for the previous pull request. More specifically
they are:

1) Add security context information to nfnetlink_queue, from Roman Kubiak.

2) Add support to restore the sk_mark into skb->mark through xt_socket,
   from Harout Hedeshian.

3) Force alignment of 16 bytes of per cpu xt_counters, from Eric Dumazet.

4) Rename br_netfilter.c to br_netfilter_hooks.c to prepare split of IPv6 code
   into a separated file.

5) Move the IPv6 code in br_netfilter into a separated file.

6) Remove unused RCV_SKB_FAIL() in nfnetlink_queue and nfetlink_log, from Eric
   Biederman.

7) Two liner to simplify netns logic in em_ipset_match().

8) Add missing includes to net/net_namespace.h to avoid compilation problems
   that result from not including linux/netfilter.h in netns headers.

9) Use a forward declaration instead of including linux/proc_fs.h from
   netns/netfilter.h

10) Add a new linux/netfilter_defs.h to replace the linux/netfilter.h inclusion
    in netns headers.

11) Remove spurious netfilter.h file included in the net tree, also from Eric
    Biederman.

12) Fix x_tables compilation warnings on 32 bits platforms that resulted from
    recent changes in x_tables counters, from Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

netfilter: xtables: fix warnings on 32bit platforms

On 32bit archs gcc complains due to cast from void* to u64.
Add intermediate casts to long to silence these warnings.

include/linux/netfilter/x_tables.h:376:10: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
include/linux/netfilter/x_tables.h:384:15: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
include/linux/netfilter/x_tables.h:391:23: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
include/linux/netfilter/x_tables.h:400:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]

Fixes: 71ae0dff02d756e ("netfilter: xtables: use percpu rule counters")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: Remove spurios included of netfilter.h

While testing my netfilter changes I noticed several files where
recompiling unncessarily because they unncessarily included
netfilter.h.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: don't pull include/linux/netfilter.h from netns headers

This pulls the full hook netfilter definitions from all those that include
net_namespace.h.

Instead let's just include the bare minimum required in the new
linux/netfilter_defs.h file, and use it from the netfilter netns header files.

I also needed to include in.h and in6.h from linux/netfilter.h otherwise we hit
this compilation error:

In file included from include/linux/netfilter_defs.h:4:0,
                 from include/net/netns/netfilter.h:4,
                 from include/net/net_namespace.h:22,
                 from include/linux/netdevice.h:43,
                 from net/netfilter/nfnetlink_queue_core.c:23:
include/uapi/linux/netfilter.h:76:17: error: field ‘in’ has incomplete type struct in_addr in;

And also explicit include linux/netfilter.h in several spots.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

netfilter: use forward declaration instead of including linux/proc_fs.h

We don't need to pull the full definitions in that file, a simple forward
declaration is enough.

Moreover, include linux/procfs.h from nf_synproxy_core, otherwise this hits a
compilation error due to missing declarations, ie.

net/netfilter/nf_synproxy_core.c: In function ‘synproxy_proc_init’:
net/netfilter/nf_synproxy_core.c:326:2: error: implicit declaration of function ‘proc_create’ [-Werror=implicit-function-declaration]
if (!proc_create("synproxy", S_IRUGO, net->proc_net_stat,
^

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

net: include missing headers in net/net_namespace.h

Include linux/idr.h and linux/skbuff.h since they are required by objects that
are declared in the net structure.

struct net {
...
struct idr netns_ids;
...
struct sk_buff_head wext_nlevents;
...

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

net: sched: Simplify em_ipset_match

em->net is always set and always available, use it in preference
to dev_net(skb->dev).

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: Kill unused copies of RCV_SKB_FAIL

This appears to have been a dead macro in both nfnetlink_log.c and
nfnetlink_queue_core.c since these pieces of code were added in 2005.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: bridge: split ipv6 code into separated file

Resolve compilation breakage when CONFIG_IPV6 is not set by moving the IPv6
code into a separated br_netfilter_ipv6.c file.

Fixes: efb6de9b4ba0 ("netfilter: bridge: forward IPv6 fragmented packets")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: bridge: rename br_netfilter.c to br_netfilter_hooks.c

To prepare separation of the IPv6 code into different file.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: x_tables: align per cpu xt_counter

Let's force a 16 bytes alignment on xt_counter percpu allocations,
so that bytes and packets sit in same cache line.

xt_counter being exported to user space, we cannot add __align(16) on
the structure itself.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: xt_socket: add XT_SOCKET_RESTORESKMARK flag

xt_socket is useful for matching sockets with IP_TRANSPARENT and
taking some action on the matching packets. However, it lacks the
ability to match only a small subset of transparent sockets.

Suppose there are 2 applications, each with its own set of transparent
sockets. The first application wants all matching packets dropped,
while the second application wants them forwarded somewhere else.

Add the ability to retore the skb->mark from the sk_mark. The mark
is only restored if a matching socket is found and the transparent /
nowildcard conditions are satisfied.

Now the 2 hypothetical applications can differentiate their sockets
based on a mark value set with SO_MARK.

iptables -t mangle -I PREROUTING -m socket --transparent \
--restore-skmark -j action
iptables -t mangle -A action -m mark --mark 10 -j action2
iptables -t mangle -A action -m mark --mark 11 -j action3

Signed-off-by: Harout Hedeshian <harouth@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nfnetlink_queue: add security context information

This patch adds an additional attribute when sending
packet information via netlink in netfilter_queue module.
It will send additional security context data, so that
userspace applications can verify this context against
their own security databases.

Signed-off-by: Roman Kubiak <r.kubiak@samsung.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Merge branch 'master' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
This series contains updates to fm10k only.

Alex provides two fixes for the fm10k, first folds the fm10k_pull_tail()
call into fm10k_add_rx_frag(), this way the fragment does not have to be
modified after it is added to the skb.  The second fixes missing braces
to an if statement.

The remaining patches are from Jacob which contain improvements and fixes
for fm10k.  First fix makes it so that invalid address will simply be
skipped and allows synchronizing the full list to proceed with using
iproute2 tool.  Fixed a possible kernel panic by using the correct
transmit timestamp function.  Simplified the code flow for setting the
IN_PROGRESS bit of the shinfo for an skb that we will be timestamping.
Fix a bug in the timestamping transmit enqueue code responsible for a
NULL pointer dereference and invalid access of the skb list by freeing
the clone in the cases where we did not add it to the queue.  Update the
PF code so that it resets the empty TQMAP/RQMAP regirsters post-VFLR to
prevent innocent VF drivers from triggering malicious driver events.
The SYSTIME_CFG.Adjust direction bit is actually supposed to indicate
that the adjustment is positive, so fix the code to align correctly with
the hardware and documentation.  Cleanup local variable that is no longer
used after a previous refactor of the code.  Fix the code flow so that we
actually clear the enabled flag as part of our removal of the LPORT.

v2:
- updated patch 07 description based on feedback from Sergei Shtylyov
- updated patch 09 & 10 to use %d in error message based on feedback
   from Sergei Shtylyov
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: dwmac-rk: Don't add function name in info or err messages

These kind of informations are only useful for debugging and should not be
displayed in normal modules message.

Signed-off-by: Romain Perier <romain.perier@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fm10k: Fix missing braces after if statement

While reviewing the code I noticed that one of the commits added an if
statement followed by a for loop, but the if statement was missing the
braces around the loop. This change corrects the coding style error.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: fix iov_msg_lport_state_pf issue

When a VF issues an LPORT_STATE request to enable a port that is already
enabled, the PF will first disable the VF LPORT. Then it should
re-enable the VF again with the new requested settings. This ensures
that any switch rules are cleared by deleting the LPORT on the switch.
However, the flow is bugged because we actually check if the VF is
enabled at the end, and thus don't re-enable it. Fix the flow so that we
actually clear the enabled flags as part of our removal of the LPORT.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: remove err_no reference in fm10k_mbx.c

The reference to err_no was left around after a previous code refactor.
We never use the value, and it doesn't seem to be used in side a hidden
macro reference. Discovered via cppcheck.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: fix incorrect DIR_NEVATIVE bit in 1588 code

The SYSTIME_CFG.Adjust Direction bit is actually supposed to indicate
that the adjustment is positive. Fix the code to align correctly with
hardware and documentation.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: pack TLV overlay structures

This patch adds the __attribute__((packed)) indicator to some structures
which are overlayed onto a TLV message. These structures must be packed
as small as possible in order to correctly align when copied into the
mailbox buffer. Without doing so, the receiving mailbox code incorrectly
parses the values and we get invalid message responses from the switch
manager software.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: re-map all possible VF queues after a VFLR

During initialization, the VF counts its rings by walking the TQDLOC
registers. This works only if the TQMAP/RQMAP registers are set to map
all of the out-of-bound rings back to the first one. This allows the VF
to cleanly detect when it has run out of queues. Update the PF code so
that it resets the empty TQMAP/RQMAP registers post-VFLR to prevent
innocent VF drivers from triggering malicious driver events.

Signed-off-by: Matthew Vick <matthew.vick@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: force LPORT delete when updating VLAN or MAC address

Currently, we don't notify the switch at all when the PF
administratively sets a new VLAN or MAC address. This causes the old
addresses to remain valid on the switch table. Since the PF is
overriding any configuration done directly by the VF, we choose to
simply re-create the LPORT for the VF. This does mean that all rules for
the VF will be dropped when we set something directly via the PF, but it
prevents some weird issues where the MAC/VLAN table retains some stale
configuration.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: use dma_set_mask_and_coherent in fm10k_probe

This patch cleans up the use of dma_get_required_mask and uses the
simpler dma_set_mask_and_coherent function instead of doing these as
separate steps.

I removed the dma_get_required_mask call because based on some minimal
testing it appears that either (a) we're not doing the right thing with
the call or (b) we don't need it anyways. If the value returned is
<48bits, we'll end up trying with 48 bits anyways. If it's over 48bits,
fm10k can't support that anyways, and we should try 48bits. If 48bits
fails, we'll fallback to 32bits. This cleans up some very funky code.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: trivial fixup message style to include a colon

Also use %d for error values, since printing in hexadecimal is probably
not helpful.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: remove extraneous NULL check on l2_accel

l2_accel was checked for NULL at the top of fm10k_dfwd_del_station, and
we return if it is not defined. Due to this, we already know it can't be
null here so a separate check is meaningless. Discovered via cppcheck.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: use an unsigned int for i in ethtool_get_strings

The value will never be negative, and we use the %u print format. Thus,
use unsigned int for the loop counter. Issue found using cppcheck.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: add call to fm10k_clean_all_rx_rings in fm10k_down

This prevents a memory leak in fm10k_set_ringparams. The leak occurs
because we go down, change ring parameters, and then come up. However,
fm10k_down on its own is not clearing the Rx rings. Since fm10k_up
assumes the rings are clean we basically drop the buffers and leak a
bunch of memory. Eventually we hit dirty page faults and reboot the
system. This issue does not occur elsewhere because other flows that
involve fm10k_down go through fm10k_close which immediately called
fm10k_free_all_rx_resources which properly cleans the rings.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: fix incorrect free on skb in ts_tx_enqueue

This patch resolves a bug in the ts_tx_enqueue code responsible for a
NULL pointer dereference and invalid access of the skb list. We
incorrectly freed the actual skb we found instead of our copy. Thus the
skb queue is essentially invalidated. Resolve this by freeing our clone
in the cases where we did not add it to the queue. This also avoids the
skb memory leak caused by failure to free the clone.

[  589.719320] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  589.722344] IP: [<ffffffffa0310e60>] fm10k_ts_tx_subtask+0xb0/0x160 [fm10k]
[  589.723796] PGD 0
[  589.725228] Oops: 0000 [#1] SMP

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: move setting shinfo inside ts_tx_enqueue

This patch simplifies the code flow for setting the IN_PROGRESS bit of
the shinfo for an skb we will be timestamping.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: use correct ethernet driver Tx timestamp function

skb_complete_tx_timestamp is intended for use by PHY drivers which
implement a different method of returning timestamps. This method is
intended to be used after a PHY driver accepts a cloned packet via its
phy_driver.txtstamp function. It is not correct to use in the standard
ethernet driver such as fm10k. This patch fixes the following possible
kernel panic.

[ 2744.552896] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W  OE  3.19.3-200.fc21.x86_64 #1
[ 2744.552899] Hardware name: Intel Corporation S2600CO/S2600CO, BIOS SE5C600.86B.02.03.8x23.060520140825 06/05/2014
[ 2744.552901]  0000000000000000 2f4c8b10ea3f9848 ffff88081ee03a38 ffffffff8176e215
[ 2744.552906]  0000000000000000 0000000000000000 ffff88081ee03a78 ffffffff8109bc1a
[ 2744.552910]  ffff88081ee03c50 ffff88080e55fc00 ffff88080e55fc00 ffffffff81647c50
[ 2744.552914] Call Trace:
[ 2744.552917]  <IRQ>  [<ffffffff8176e215>] dump_stack+0x45/0x57
[ 2744.552931]  [<ffffffff8109bc1a>] warn_slowpath_common+0x8a/0xc0
[ 2744.552936]  [<ffffffff81647c50>] ? skb_queue_purge+0x20/0x40
[ 2744.552941]  [<ffffffff8109bd4a>] warn_slowpath_null+0x1a/0x20
[ 2744.552946]  [<ffffffff81646911>] skb_release_head_state+0xe1/0xf0
[ 2744.552950]  [<ffffffff81647b26>] skb_release_all+0x16/0x30
[ 2744.552954]  [<ffffffff81647ba6>] kfree_skb+0x36/0x90
[ 2744.552958]  [<ffffffff81647c50>] skb_queue_purge+0x20/0x40
[ 2744.552964]  [<ffffffff81751f8d>] packet_sock_destruct+0x1d/0x90
[ 2744.552968]  [<ffffffff81642053>] __sk_free+0x23/0x140
[ 2744.552973]  [<ffffffff81642189>] sk_free+0x19/0x20
[ 2744.552977]  [<ffffffff81647d60>] skb_complete_tx_timestamp+0x50/0x60
[ 2744.552988]  [<ffffffffa02eee40>] fm10k_ts_tx_hwtstamp+0xd0/0x100 [fm10k]
[ 2744.552994]  [<ffffffffa02e054e>] fm10k_1588_msg_pf+0x12e/0x140 [fm10k]
[ 2744.553002]  [<ffffffffa02edf1d>] fm10k_tlv_msg_parse+0x8d/0xc0 [fm10k]
[ 2744.553010]  [<ffffffffa02eb2d0>] fm10k_mbx_dequeue_rx+0x60/0xb0 [fm10k]
[ 2744.553016]  [<ffffffffa02ebf98>] fm10k_sm_mbx_process+0x178/0x3c0 [fm10k]
[ 2744.553022]  [<ffffffffa02e09ca>] fm10k_msix_mbx_pf+0xfa/0x360 [fm10k]
[ 2744.553030]  [<ffffffff811030a7>] ? get_next_timer_interrupt+0x1f7/0x270
[ 2744.553036]  [<ffffffff810f2a47>] handle_irq_event_percpu+0x77/0x1a0
[ 2744.553041]  [<ffffffff810f2bab>] handle_irq_event+0x3b/0x60
[ 2744.553045]  [<ffffffff810f5d6e>] handle_edge_irq+0x6e/0x120
[ 2744.553054]  [<ffffffff81017414>] handle_irq+0x74/0x140
[ 2744.553061]  [<ffffffff810bb54a>] ? atomic_notifier_call_chain+0x1a/0x20
[ 2744.553066]  [<ffffffff8177777f>] do_IRQ+0x4f/0xf0
[ 2744.553072]  [<ffffffff8177556d>] common_interrupt+0x6d/0x6d
[ 2744.553074]  <EOI>  [<ffffffff81609b16>] ? cpuidle_enter_state+0x66/0x160
[ 2744.553084]  [<ffffffff81609b01>] ? cpuidle_enter_state+0x51/0x160
[ 2744.553087]  [<ffffffff81609cf7>] cpuidle_enter+0x17/0x20
[ 2744.553092]  [<ffffffff810de101>] cpu_startup_entry+0x321/0x3c0
[ 2744.553098]  [<ffffffff81764497>] rest_init+0x77/0x80
[ 2744.553103]  [<ffffffff81d4f02c>] start_kernel+0x4a4/0x4c5
[ 2744.553107]  [<ffffffff81d4e120>] ? early_idt_handlers+0x120/0x120
[ 2744.553110]  [<ffffffff81d4e4d7>] x86_64_start_reservations+0x2a/0x2c
[ 2744.553114]  [<ffffffff81d4e62b>] x86_64_start_kernel+0x152/0x175

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: ignore invalid multicast address entries

This change fixes an issue with adding an invalid multicast address
using the iproute2 tool (ip maddr add <MADDR> dev <dev>). The iproute2
tool and the kernel do not validate or filter the multicast addresses
when adding them to the multicast list. Thus, when synchronizing this
list with an invalid entry, the action will be aborted with an error
since the fm10k driver currently validates the list. Consequently,
multicast entries beyond the invalid one will not be processed and
communicated with the switch via the mailbox. This change makes it so
that invalid addresses will simply be skipped and allows synchronizing
the full list to proceed.

Signed-off-by: Ngai-Mint Kwan <ngai-mint.kwan@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: fold fm10k_pull_tail into fm10k_add_rx_frag

This change folds the fm10k_pull_tail call into fm10k_add_rx_frag. The
advantage to doing this is that the fragment doesn't have to be modified
after it is added to the skb.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

bpf: disallow bpf tc programs access current->pid,uid

Accessing current->pid/uid from cls_bpf may lead to misleading results and
should not be used when TC classifiers need accurate information about pid/uid.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: mark state UNINIT after unregister

Without this change, modprobe -r sfc hits the BUG_ON() in
efx_pci_remove_main().

Fixes: e7fef9b45ae1 ("sfc: add sysfs entry to control MCDI tracing")
Reported-by: Jarod Wilson <jarod@redhat.com>
Reviewed-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sock_diag_destruction_events'

Craig Gallek says:

====================
Socket destruction events via netlink sock_diag

This series extends the netlink sock_diag interface to broadcast
socket information as they are being destroyed. The current
interface is poll based and can not be used to retreive information
about sockets that are destroyed between poll intervals.

Only inet sockets are broadcast in this implementation, but other
families could easily be added as needed in the future.

If this patch set is accepted, a follow-up patch to the ss utility
in the iproute2 suite will also be submitted.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

sock_diag: implement a get_info handler for inet

This get_info handler will simply dispatch to the appropriate
existing inet protocol handler.

This patch also includes a new netlink attribute
(INET_DIAG_PROTOCOL).  This attribute is currently only used
for multicast messages.  Without this attribute, there is no
way of knowing the IP protocol used by the socket information
being broadcast.  This attribute is not necessary in the 'dump'
variant of this protocol (though it could easily be added)
because dump requests are issued for specific family/protocol
pairs.

Tested: ss -E (note, the -E option has not yet been merged into
the upstream version of ss).

Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sock_diag: specify info_size per inet protocol

Previously, there was no clear distinction between the inet protocols
that used struct tcp_info to report information and those that didn't.
This change adds a specific size attribute to the inet_diag_handler
struct which defines these interfaces. This will make dispatching
sock_diag get_info requests identical for all inet protocols in a
following patch.

Tested: ss -au
Tested: ss -at
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sock_diag: define destruction multicast groups

These groups will contain socket-destruction events for
AF_INET/AF_INET6, IPPROTO_TCP/IPPROTO_UDP.

Near the end of socket destruction, a check for listeners is
performed. In the presence of a listener, rather than completely
cleanup the socket, a unit of work will be added to a private
work queue which will first broadcast information about the socket
and then finish the cleanup operation.

Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlx4-vf-counters'

Or Gerlitz says:

====================
mlx4 driver update (+ new VF ndo)

This series from Eran and Hadar is further dealing with traffic
counters in the mlx4 driver, this time mostly around SRIOV.

We added a new ndo to read the VF counters through the PF netdev
netlink infrastructure plus mlx4 implementation for that ndo.

changes from V0:
  - applied feedback from John to use nested netlink encoding
    for the VF counters so we can extend it later
  - add handling of single ported VFs in the mlx4_en driver new ndo
  - avoid chopping the FW counters from 64 to 32 bits in mlx4_en PF flow
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_en: Support ndo_get_vf_stats

Implement the ndo to gather VF statistics through the PF.

All counters related to this VF are stored in a per slave
list, run over the slave's list and collect all statistics.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/core: Add reading VF statistics through the PF netdevice

Add ndo_get_vf_stats where the PF retrieves and fills the VFs traffic
statistics. We encode the VF stats in a nested manner to allow for
future extensions.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_en: Show PF own statistics via ethtool

Allow the user to observe the PF own statistics using ethtool with pf_
prefixed counter names.

Those counters are the PF statistics out of the overall port statistics.
Every PF QP is attached to a counter and the summary of those counters
is the PF statistics.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_core: Add helper to query counters

This is an infrastructure step for querying VF and PF counters.

This code was in the IB driver, move it to the mlx4 core driver
so it will be accessible for more use cases.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

IB/mlx4: Set VF to read from QP counters

As IB VFs are not capable to read the port counters through MADs,
move there to read their own QP counters to gather statistics.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

IB/mlx4: Add RoCE/IB dedicated counters

This is an infrastructure step to attach all the QPs opened from the
IB driver to a counter in order to collect VF stats from the PF using
those counters.

If the port's type is Ethernet, the counter policy demands two counters
per port (one for RoCE and one for Ethernet). The port default counter
(allocated in mlx4_core) is used for the Ethernet netdev QPs and we
allocate another counter for RoCE.

If the port's traffic is Infiniband, the counter policy demands
one counter per port, so it can use the port's default counter.

Also, Add 'allocated' flag for each counter in order to clean it at
unload.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_core: Allocate default counter per port

Default counter per port will be allocated at the mlx4 core driver load.

Every QP opened by the Ethernet driver will be attached to the port's default
counter. This is an infrastructure step to collect VF statistics from the PF.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_core: Add port attribute when tracking counters

Counter will get its port attribute within the resource tracker when
the first QP attached to it is modified to RTR. If a QP is counter-less,
an attempt to create a new counter with assigned port will be made.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_core: Adjust counter grant policy in the resource tracker

Each physical function has a guarantee of two counters per port, one
for a default counter and one for the IB driver.

Each virtual function has a guarantee of one counter per port.
All other counters are free and can be obtained on demand.

This is a preparation step for supporting a get_vf_stats ndo call,
so we can promise a counter for every VF in order to collect their
statistics from the PF context.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_core: Remove counters table allocation from VF flow

Since virtual functions get their counters indices allocation from the PF,
allocate counters indices bitmap only in case the function isn't virtual.

Also, check that the device has counters to allocate before creating the
indices bitmap table.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_core: Add sink counter

Reserve the last valid counter index for "sink" counter, when a
new counter cannot be allocated, the driver will use this counter.

In order to avoid allocating this counter on any other flow, fix the
indices bitmap allocation range, and reserve the sink counter index.

Add macro for the sink counter index and replace all appearences of the
index with the macro.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_core: Reset counters data when freed

Add resetting the counter data to the free counter flow, so the counter's
data won't be accessible anymore if querying the counter. Also, on next
counter allocation (to another VM for example), it will be fresh and clear.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_core: Check before cleaning counters bitmap

If counters are not supported by the device. The indices bitmap table is not
allocated during initialization. Add the symmetrical check before cleaning
the counters bitmap table or freeing a counter.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: del external_learned fdbs from device on flush or ageout

We need to delete from offload the device externally learnded fdbs when any
one of these events happen:

1) Bridge ages out fdb. (When bridge is doing ageing vs. device doing
ageing. If device is doing ageing, it would send SWITCHDEV_FDB_DEL
directly).

2) STP state change flushes fdbs on port.

3) User uses sysfs interface to flush fdbs from bridge or bridge port:

echo 1 >/sys/class/net/BR_DEV/bridge/flush
echo 1 >/sys/class/net/BR_PORT/brport/flush

4) Offload driver send event SWITCHDEV_FDB_DEL to delete fdb entry.

For rocker, we can now get called to delete fdb entry in wait and nowait
contexts, so set NOWAIT flag when deleting fdb entry.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'nfc-next-4.2-1' of git://git./linux/kernel/git/sameo/nfc-next

Samuel Ortiz says:

====================
NFC 4.2 pull request

This is the NFC pull request for 4.2.

- NCI drivers can now define their own handlers for processing
  proprietary NCI responses and notifications.

- NFC vendors can use a dedicated netlink API to send their own
  proprietary commands, like e.g. all commands needed to implement
  vendor specific manufacturing tools.

- A new generic NCI over UART driver against which any NCI chipset
  running on top of a serial interface can register.

- The st21nfcb driver is renamed to st-nci as it can and will support
  most of ST Microelectronics NCI chipsets.

- The st21nfcb driver can put its CLF in hibernate mode and save
  significant amount of power.

- A few st21nfcb minor fixes.

- The NXP NCI driver now supports ACPI enumeration.

- The Marvell NCI driver now supports both USB and serial
  physical interfaces.

- The Marvell NCI drivers also supports NCI frames being muxed
  over HCI. This is a setting that can be defined by a DT property.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bond-netlink-3ad-attrs'

Nikolay Aleksandrov says:

====================
bonding: extend the 3ad exported attributes

These are two small patches that export actor_oper_port_state and
partner_oper_port_state via netlink and sysfs, until now they were only
exported via bond's proc entry. If this set gets accepted I have an iproute2
patch prepared that will export them with which I tested these changes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bonding: export slave's partner_oper_port_state via sysfs and netlink

Export the partner_oper_port_state of each port via sysfs and netlink.
In 802.3ad mode it is valuable for the user to be able to check the
partner_oper state, it is already exported via bond's proc entry.

Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bonding: export slave's actor_oper_port_state via sysfs and netlink

Export the actor_oper_port_state of each port via sysfs and netlink.
In 802.3ad mode it is valuable for the user to be able to check the
actor_oper state, it is already exported via bond's proc entry.

Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'rocker-no-wait'

Scott Feldman says:

====================
rocker: revert back to support for nowait processes

One of the items removed from the rocker driver in the Spring Cleanup patch
series was the ability to mark processing in the driver as "no wait" for
those contexts where we cannot sleep.  Turns out, we have "no wait"
contexts where we want to program the device and we don't want to defer the
processing to a process context.  So re-add the ROCKER_OP_FLAG_NOWAIT flag
to mark such processes, and propagate flags to mem allocator and to the
device cmd executor.  With NOWAIT, mem allocs are GFP_ATOMIC and device
cmds are queued to the device, but the driver will not wait (sleep) for the
response back from the device.

My bad for removing NOWAIT support in the first place; I thought we could
swing non-sleep contexts to process context using a work queue, for
example, but there is push-back to keep processing in original context.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

rocker: move port stop to 'no wait' processing

rocker_port_stop can be called from atomic and non-atomic contexts. Since
we can't test what context we're getting called in, do the processing as
'no wait', which will cover all cases.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rocker: move MAC learn event back to 'no wait' processing

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rocker: mark STP update as 'no wait' processing

We can get STP updates from the bridge driver in atomic and non-atomic
contexts. Since we can't test what context we're getting called in,
do the STP processing as 'no wait', which will cover all cases.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rocker: mark neigh update event processing as 'no wait'

Neigh update event handler runs in a context where we can't sleep, so mark
processing in driver with ROCKER_OP_FLAG_NOWAIT. NOWAIT will use
GFP_ATOMIC for allocations and will queue cmds to the device's cmd ring but
will not wait (sleep) for cmd response back from device.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rocker: revert back to support for nowait processes

One of the items removed from the rocker driver in the Spring Cleanup patch
series was the ability to mark processing in the driver as "no wait" for
those contexts where we cannot sleep.  Turns out, we have "no wait"
contexts where we want to program the device.  So re-add the
ROCKER_OP_FLAG_NOWAIT flag to mark such processes, and propagate flags to
mem allocator and to the device cmd executor.  With NOWAIT, mem allocs are
GFP_ATOMIC and device cmds are queued to the device, but the driver will
not wait (sleep) for the response back from the device.

My bad for removing NOWAIT support in the first place; I thought we could
swing non-sleep contexts to process context using a work queue, for
example, but there is push-back to keep processing in original context.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rocker: fix neigh tbl index increment race

rocker->neigh_tbl_next_index is used to generate unique indices for neigh
entries programmed into the device.  The way new indices were generated was
racy with the new prepare-commit transaction model.  A simple fix here
removes the race.  The race was with two processes getting the same index,
one process using prepare-commit, the other not:

Proc A Proc B

PREPARE phase
get neigh_tbl_next_index

NONE phase
get neigh_tbl_next_index
neigh_tbl_next_index++

COMMIT phase
neigh_tbl_next_index++

Both A and B got the same index.  The fix is to store and increment
neigh_tbl_next_index in the PREPARE (or NONE) phase and use value in COMMIT
phase:

Proc A Proc B

PREPARE phase
get neigh_tbl_next_index
neigh_tbl_next_index++

NONE phase
get neigh_tbl_next_index
neigh_tbl_next_index++

COMMIT phase
// use value stashed in PREPARE phase

Reported-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rocker: gaurd against NULL rocker_port when removing ports

The ports array is filled in as ports are probed, but if probing doesn't
finish, we need to stop only those ports that where probed successfully.
Check the ports array for NULL to skip un-probed ports when stopping.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: make u64_stats_init() a function

Using a function instead of a macro is cleaner and remove
following W=1 warnings (extract)

In file included from net/ipv6/ip6_vti.c:29:0:
net/ipv6/ip6_vti.c: In function ‘vti6_dev_init_gen’:
include/linux/netdevice.h:2029:18: warning: variable ‘stat’ set but not
used [-Wunused-but-set-variable]
    typeof(type) *stat;   \
                  ^
net/ipv6/ip6_vti.c:862:16: note: in expansion of macro
‘netdev_alloc_pcpu_stats’
  dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
                ^
  CC [M]  net/ipv6/sit.o
In file included from net/ipv6/sit.c:30:0:
net/ipv6/sit.c: In function ‘ipip6_tunnel_init’:
include/linux/netdevice.h:2029:18: warning: variable ‘stat’ set but not
used [-Wunused-but-set-variable]
    typeof(type) *stat;   \
                  ^

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: use either ndo VLAN ops or switchdev VLAN ops to install MASTER vlans

v2:

Move struct switchdev_obj automatics to inner scope where there used.

v1:

To maintain backward compatibility with the existing iproute2 "bridge vlan"
command, let bridge's setlink/dellink handler call into either the port
driver's 8021q ndo ops or the port driver's bridge_setlink/dellink ops.

This allows port driver to choose 8021q ops or the newer
bridge_setlink/dellink ops when implementing VLAN add/del filtering on the
device.  The iproute "bridge vlan" command does not need to be modified.

To summarize using the "bridge vlan" command examples, we have:

1) bridge vlan add|del vid VID dev DEV

Here iproute2 sets MASTER flag.  Bridge's bridge_setlink/dellink is called.
Vlan is set on bridge for port.  If port driver implements ndo 8021q ops,
call those to port driver can install vlan filter on device.  Otherwise, if
port driver implements bridge_setlink/dellink ops, call those to install
vlan filter to device.  This option only works if port is bridged.

2) bridge vlan add|del vid VID dev DEV master

Same as 1)

3) bridge vlan add|del vid VID dev DEV self

Bridge's bridge_setlink/dellink isn't called.  Port driver's
bridge_setlink/dellink is called, if implemented.  This option works if
port is bridged or not.  If port is not bridged, a VLAN can still be
added/deleted to device filter using this variant.

4) bridge vlan add|del vid VID dev DEV master self

This is a combination of 1) and 3), but will only work if port is bridged.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bpf-share-helpers'

Alexei Starovoitov says:

====================
v1->v2: switched to init_user_ns from current_user_ns as suggested by Andy

Introduce new helpers to access 'struct task_struct'->pid, tgid, uid, gid, comm
fields in tracing and networking.

Share bpf_trace_printk() and bpf_get_smp_processor_id() helpers between
tracing and networking.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: let kprobe programs use bpf_get_smp_processor_id() helper

It's useful to do per-cpu histograms.

Suggested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: allow networking programs to use bpf_trace_printk() for debugging

bpf_trace_printk() is a helper function used to debug eBPF programs.
Let socket and TC programs use it as well.
Note, it's DEBUG ONLY helper. If it's used in the program,
the kernel will print warning banner to make sure users don't use
it in production.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: introduce current->pid, tgid, uid, gid, comm accessors

eBPF programs attached to kprobes need to filter based on
current->pid, uid and other fields, so introduce helper functions:

u64 bpf_get_current_pid_tgid(void)
Return: current->tgid << 32 | current->pid

u64 bpf_get_current_uid_gid(void)
Return: current_gid << 32 | current_uid

bpf_get_current_comm(char *buf, int size_of_buf)
stores current->comm into buf

They can be used from the programs attached to TC as well to classify packets
based on current task fields.

Update tracex2 example to print histogram of write syscalls for each process
instead of aggregated for all.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/pablo/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

This a bit large (and late) patchset that contains Netfilter updates for
net-next. Most relevantly br_netfilter fixes, ipset RCU support, removal of
x_tables percpu ruleset copy and rework of the nf_tables netdev support. More
specifically, they are:

1) Warn the user when there is a better protocol conntracker available, from
   Marcelo Ricardo Leitner.

2) Fix forwarding of IPv6 fragmented traffic in br_netfilter, from Bernhard
   Thaler. This comes with several patches to prepare the change in first place.

3) Get rid of special mtu handling of PPPoE/VLAN frames for br_netfilter. This
   is not needed anymore since now we use the largest fragment size to
   refragment, from Florian Westphal.

4) Restore vlan tag when refragmenting in br_netfilter, also from Florian.

5) Get rid of the percpu ruleset copy in x_tables, from Florian. Plus another
   follow up patch to refine it from Eric Dumazet.

6) Several ipset cleanups, fixes and finally RCU support, from Jozsef Kadlecsik.

7) Get rid of parens in Netfilter Kconfig files.

8) Attach the net_device to the basechain as opposed to the initial per table
   approach in the nf_tables netdev family.

9) Subscribe to netdev events to detect the removal and registration of a
   device that is referenced by a basechain.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

netfilter: nf_tables_netdev: unregister hooks on net_device removal

In case the net_device is gone, we have to unregister the hooks and put back
the reference on the net_device object. Once it comes back, register them
again. This also covers the device rename case.

This patch also adds a new flag to indicate that the basechain is disabled, so
their hooks are not registered. This flag is used by the netdev family to
handle the case where the net_device object is gone. Currently this flag is not
exposed to userspace.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add nft_register_basechain() and nft_unregister_basechain()

This wrapper functions take care of hook registration for basechains.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: attach net_device to basechain

The device is part of the hook configuration, so instead of a global
configuration per table, set it to each of the basechain that we create.

This patch reworks ebddf1a8d78a ("netfilter: nf_tables: allow to bind table to
net_device").

Note that this adds a dev_name field in the nft_base_chain structure which is
required the netdev notification subscription that follows up in a patch to
handle gone net_devices.

Suggested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: x_tables: remove XT_TABLE_INFO_SZ and a dereference.

After Florian patches, there is no need for XT_TABLE_INFO_SZ anymore :
Only one copy of table is kept, instead of one copy per cpu.

We also can avoid a dereference if we put table data right after
xt_table_info. It reduces register pressure and helps compiler.

Then, we attempt a kmalloc() if total size is under order-3 allocation,
to reduce TLB pressure, as in many cases, rules fit in 32 KB.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Merge branch 'master' of git://blackhole.kfki.hu/nf-next

Jozsef Kadlecsik says:

====================
ipset patches for nf-next

Please consider to apply the next bunch of patches for ipset. First
comes the small changes, then the bugfixes and at the end the RCU
related patches.

* Use MSEC_PER_SEC consistently instead of the number.
* Use SET_WITH_*() helpers to test set extensions from Sergey Popovich.
* Check extensions attributes before getting extensions from Sergey Popovich.
* Permit CIDR equal to the host address CIDR in IPv6 from Sergey Popovich.
* Make sure we always return line number on batch in the case of error
  from Sergey Popovich.
* Check CIDR value only when attribute is given from Sergey Popovich.
* Fix cidr handling for hash:*net* types, reported by Jonathan Johnson.
* Fix parallel resizing and listing of the same set so that the original
  set is kept for the whole dumping.
* Make sure listing doesn't grab a set which is just being destroyed.
* Remove rbtree from ip_set_hash_netiface.c in order to introduce RCU.
* Replace rwlock_t with spinlock_t in "struct ip_set", change the locking
  in the core and simplifications in the timeout routines.
* Introduce RCU locking in bitmap:* types with a slight modification in the
  logic on how an element is added.
* Introduce RCU locking in hash:* types. This is the most complex part of
  the changes.
* Introduce RCU locking in list type where standard rculist is used.
* Fix coding styles reported by checkpatch.pl.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: Kconfig: get rid of parens around depends on

According to the reporter, they are not needed.

Reported-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

tcp: cdg: use div_u64()

Fixes cross-compile to mips.

Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: David S. Miller <davem@davemloft.net>

netfilter: ipset: Fix coding styles reported by checkpatch.pl

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ipset: Introduce RCU locking in list type

Standard rculist is used.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ipset: Introduce RCU locking in hash:* types

Three types of data need to be protected in the case of the hash types:

a. The hash buckets: standard rcu pointer operations are used.
b. The element blobs in the hash buckets are stored in an array and
   a bitmap is used for book-keeping to tell which elements in the array
   are used or free.
c. Networks per cidr values and the cidr values themselves are stored
   in fix sized arrays and need no protection. The values are modified
   in such an order that in the worst case an element testing is repeated
   once with the same cidr value.

The ipset hash approach uses arrays instead of lists and therefore is
incompatible with rhashtable.

Performance is tested by Jesper Dangaard Brouer:

Simple drop in FORWARD
~~~~~~~~~~~~~~~~~~~~~~

Dropping via simple iptables net-mask match::

iptables -t raw -N simple || iptables -t raw -F simple
iptables -t raw -I simple  -s 198.18.0.0/15 -j DROP
iptables -t raw -D PREROUTING -j simple
iptables -t raw -I PREROUTING -j simple

Drop performance in "raw": 11.3Mpps

Generator: sending 12.2Mpps (tx:12264083 pps)

Drop via original ipset in RAW table
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Create a set with lots of elements::

sudo ./ipset destroy test
echo "create test hash:ip hashsize 65536" > test.set
for x in `seq 0 255`; do
    for y in `seq 0 255`; do
        echo "add test 198.18.$x.$y" >> test.set
    done
done
sudo ./ipset restore < test.set

Dropping via ipset::

iptables -t raw -F
iptables -t raw -N net198 || iptables -t raw -F net198
iptables -t raw -I net198 -m set --match-set test src -j DROP
iptables -t raw -I PREROUTING -j net198

Drop performance in "raw" with ipset: 8Mpps

Perf report numbers ipset drop in "raw"::

+   24.65%  ksoftirqd/1  [ip_set]           [k] ip_set_test
-   21.42%  ksoftirqd/1  [kernel.kallsyms]  [k] _raw_read_lock_bh
    - _raw_read_lock_bh
       + 99.88% ip_set_test
-   19.42%  ksoftirqd/1  [kernel.kallsyms]  [k] _raw_read_unlock_bh
    - _raw_read_unlock_bh
       + 99.72% ip_set_test
+    4.31%  ksoftirqd/1  [ip_set_hash_ip]   [k] hash_ip4_kadt
+    2.27%  ksoftirqd/1  [ixgbe]            [k] ixgbe_fetch_rx_buffer
+    2.18%  ksoftirqd/1  [ip_tables]        [k] ipt_do_table
+    1.81%  ksoftirqd/1  [ip_set_hash_ip]   [k] hash_ip4_test
+    1.61%  ksoftirqd/1  [kernel.kallsyms]  [k] __netif_receive_skb_core
+    1.44%  ksoftirqd/1  [kernel.kallsyms]  [k] build_skb
+    1.42%  ksoftirqd/1  [kernel.kallsyms]  [k] ip_rcv
+    1.36%  ksoftirqd/1  [kernel.kallsyms]  [k] __local_bh_enable_ip
+    1.16%  ksoftirqd/1  [kernel.kallsyms]  [k] dev_gro_receive
+    1.09%  ksoftirqd/1  [kernel.kallsyms]  [k] __rcu_read_unlock
+    0.96%  ksoftirqd/1  [ixgbe]            [k] ixgbe_clean_rx_irq
+    0.95%  ksoftirqd/1  [kernel.kallsyms]  [k] __netdev_alloc_frag
+    0.88%  ksoftirqd/1  [kernel.kallsyms]  [k] kmem_cache_alloc
+    0.87%  ksoftirqd/1  [xt_set]           [k] set_match_v3
+    0.85%  ksoftirqd/1  [kernel.kallsyms]  [k] inet_gro_receive
+    0.83%  ksoftirqd/1  [kernel.kallsyms]  [k] nf_iterate
+    0.76%  ksoftirqd/1  [kernel.kallsyms]  [k] put_compound_page
+    0.75%  ksoftirqd/1  [kernel.kallsyms]  [k] __rcu_read_lock

Drop via ipset in RAW table with RCU-locking
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

With RCU locking, the RW-lock is gone.

Drop performance in "raw" with ipset with RCU-locking: 11.3Mpps

Performance-tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ipset: Introduce RCU locking in bitmap:* types

There's nothing much required because the bitmap types use atomic
bit operations. However the logic of adding elements slightly changed:
first the MAC address updated (which is not atomic), then the element
activated (added). The extensions may call kfree_rcu() therefore we
call rcu_barrier() at module removal.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ipset: Prepare the ipset core to use RCU at set level

Replace rwlock_t with spinlock_t in "struct ip_set" and change the locking
accordingly. Convert the comment extension into an rcu-avare object. Also,
simplify the timeout routines.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter:ipset Remove rbtree from hash:net,iface

Remove rbtree in order to introduce RCU instead of rwlock in ipset

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ipset: Make sure listing doesn't grab a set which is just being destroyed.

There was a small window when all sets are destroyed and a concurrent
listing of all sets could grab a set which is just being destroyed.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ipset: Fix parallel resizing and listing of the same set

When elements added to a hash:* type of set and resizing triggered,
parallel listing could start to list the original set (before resizing)
and "continue" with listing the new set. Fix it by references and
using the original hash table for listing. Therefore the destroying of
the original hash table may happen from the resizing or listing functions.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ipset: Fix cidr handling for hash:*net* types

Commit "Simplify cidr handling for hash:*net* types" broke the cidr
handling for the hash:*net* types when the sets were used by the SET
target: entries with invalid cidr values were added to the sets.
Reported by Jonathan Johnson.

Testsuite entry is added to verify the fix.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ipset: Check CIDR value only when attribute is given

There is no reason to check CIDR value regardless attribute
specifying CIDR is given.

Initialize cidr array in element structure on element structure
declaration to let more freedom to the compiler to optimize
initialization right before element structure is used.

Remove local variables cidr and cidr2 for netnet and netportnet
hashes as we do not use packed cidr value for such set types and
can store value directly in e.cidr[].

Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ipset: Make sure we always return line number on batch

Even if we return with generic IPSET_ERR_PROTOCOL it is good idea
to return line number if we called in batch mode.

Moreover we are not always exiting with IPSET_ERR_PROTOCOL. For
example hash:ip,port,net may return IPSET_ERR_HASH_RANGE_UNSUPPORTED
or IPSET_ERR_INVALID_CIDR.

Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ipset: Permit CIDR equal to the host address CIDR in IPv6

Permit userspace to supply CIDR length equal to the host address CIDR
length in netlink message. Prohibit any other CIDR length for IPv6
variant of the set.

Also return -IPSET_ERR_HASH_RANGE_UNSUPPORTED instead of generic
-IPSET_ERR_PROTOCOL in IPv6 variant of hash:ip,port,net when
IPSET_ATTR_IP_TO attribute is given.

Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ipset: Check extensions attributes before getting extensions.

Make all extensions attributes checks within ip_set_get_extensions()
and reduce number of duplicated code.

Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ipset: Use SET_WITH_*() helpers to test set extensions

Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ipset: Use MSEC_PER_SEC consistently

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

Merge git://git./linux/kernel/git/davem/net

Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) Fix uninitialized struct station_info in cfg80211_wireless_stats(),
    from Johannes Berg.

2) Revert commit attempt to fix ipv6 protocol resubmission, it adds
    regressions.

3) Endless loops can be created in bridge port lists, fix from Nikolay
    Aleksandrov.

4) Don't WARN_ON() if sk->sk_forward_alloc is non-zero in
    sk_clear_memalloc, it is a legal situation during swap deactivation.
    Fix from Mel Gorman.

5) Fix order of disabling interrupts and unlocking NAPI in enic driver
    to avoid a race.  From Govindarajulu Varadarajan.

6) High and low register writes are swapped when programming the start
    of periodic output in igb driver.  From Richard Cochran.

7) Fix device rename handling in mpls stack, from Robert Shearman.

8) Do not trigger compaction synchronously when optimistically trying
    to allocate an order 3 page in alloc_skb_with_frags() and
    skb_page_frag_refill().  From Shaohua Li.

9) Authentication with COOKIE_ECHO is not handled properly in SCTP, fix
    from Marcelo Ricardo Leitner.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  Doc: networking: Fix URL for wiki.wireshark.org in udplite.txt
  sctp: allow authenticating DATA chunks that are bundled with COOKIE_ECHO
  net: don't wait for order-3 page allocation
  mpls: handle device renames for per-device sysctls
  net: igb: fix the start time for periodic output signals
  enic: fix memory leak in rq_clean
  enic: check return value for stat dump
  enic: unlock napi busy poll before unmasking intr
  net, swap: Remove a warning and clarify why sk_mem_reclaim is required when deactivating swap
  bridge: fix multicast router rlist endless loop
  tipc: disconnect socket directly after probe failure
  Revert "ipv6: Fix protocol resubmission"
  cfg80211: wext: clear sinfo struct before calling driver

tcp: tcp_v6_connect() cleanup

Remove dead code from tcp_v6_connect()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

flow_dissector: fix ipv6 dst, hop-by-hop and routing ext hdrs

__skb_header_pointer() returns a pointer that must be checked.

Fixes infinite loop reported by Alexei, and add __must_check to
catch these errors earlier.

Fixes: 6a74fcf426f5 ("flow_dissector: add support for dst, hop-by-hop and routing ext hdrs")
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Fix Cavium Liquidio build related errors and warnings

1) Fixed following sparse warnings:
    lio_main.c:213:6: warning: symbol 'octeon_droq_bh' was not
    declared. Should it be static?
    lio_main.c:233:5: warning: symbol 'lio_wait_for_oq_pkts' was
    not declared. Should it be static?
    lio_main.c:3083:5: warning: symbol 'lio_nic_info' was not
    declared. Should it be static?
    lio_main.c:2618:16: warning: cast from restricted __be16
    octeon_device.c:466:6: warning: symbol 'oct_set_config_info'
    was not declared. Should it be static?
    octeon_device.c:573:25: warning: cast to restricted __be32
    octeon_device.c:582:29: warning: cast to restricted __be32
    octeon_device.c:584:39: warning: cast to restricted __be32
    octeon_device.c:594:13: warning: cast to restricted __be32
    octeon_device.c:596:25: warning: cast to restricted __be32
    octeon_device.c:613:25: warning: cast to restricted __be32
    octeon_device.c:614:29: warning: cast to restricted __be64
    octeon_device.c:615:29: warning: cast to restricted __be32
    octeon_device.c:619:37: warning: cast to restricted __be32
    octeon_device.c:623:33: warning: cast to restricted __be32
    cn66xx_device.c:540:6: warning: symbol
    'lio_cn6xxx_get_pcie_qlmport' was not declared. Should it be s
    octeon_mem_ops.c:181:16: warning: cast to restricted __be64
    octeon_mem_ops.c:190:16: warning: cast to restricted __be32
    octeon_mem_ops.c:196:17: warning: incorrect type in initializer
2) Fix build errors corresponding to vmalloc on linux-next 4.1.
3) Liquidio now supports 64 bit only, modified Kconfig accordingly.
4) Fix some code alignment issues based on kernel build warnings.

Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>