review.tizen.org Git - platform/kernel/linux-exynos.git/log

projects / platform / kernel / linux-exynos.git / log

David S. Miller [Thu, 6 Nov 2014 21:33:13 +0000 (16:33 -0500)]

Merge branch 'net_next_ovs' of git://git./linux/kernel/git/pshelar/openvswitch

Pravin B Shelar says:

====================
Open vSwitch

First two patches are related to OVS MPLS support. Rest of patches
are mostly refactoring and minor improvements to openvswitch.

v1-v2:
- Fix conflicts due to "gue: Remote checksum offload"
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Thu, 6 Nov 2014 20:16:44 +0000 (15:16 -0500)]

Merge branch 'sunvnet-next'

Sowmini Varadhan says:

====================
sunvnet: bug fixes

This patch series has a coding-style fix and a bug fix.

The coding style fix (patch 1) is the extra indentation flagged by
Ben Hutchings:
  http://marc.info/?l=linux-netdev&m=141529243409594&w=2

The bugfix (patch 2) is the following:
when vnet_event_napi() is  called as part of napi_resume
(i.e., continuation of a previous NAPI read that was truncated
due to budget constraints), and then finds no more packets to read,
the code was trying to avoid an additional trip through ldc_rx
as an optimization. However, when this corner case happens, we would
need to reset a number of dring state bits such as rcv_nxt carefully,
which quickly becomes complex and hacky.  The cleaner solution
is to just roll back to vnet_poll, re-enable interrupts and set up
dring state as was done in the pre-NAPI version of the driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Sowmini Varadhan [Thu, 6 Nov 2014 19:51:08 +0000 (14:51 -0500)]

sunvnet: Return from vnet_napi_event() if no packets to read

vnet_event_napi() may be called as part of the NAPI ->poll,
to resume reading descriptor rings. When no data is available,
descriptor ring state (e.g., rcv_nxt) needs to be reset
carefully to stay in lock-step with ldc_read(). In the interest
of simplicity, the best way to do this is to return from
vnet_event_napi() when there are no more packets to read.
The next trip through ldc_rx will correctly set up the dring state.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: David Stevens <david.stevens@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Sowmini Varadhan [Thu, 6 Nov 2014 19:51:02 +0000 (14:51 -0500)]

sunvnet: Fix indentation in maybe_tx_wakeup()

remove redundant tab.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Thu, 6 Nov 2014 20:14:36 +0000 (15:14 -0500)]

Merge branch 'r8152-next'

Hayes Wang says:

====================
r8152: rtl_ops_init modify

Initialize the ops through tp->version. This could skip checking
each VID/PID.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

hayeswang [Thu, 6 Nov 2014 04:47:40 +0000 (12:47 +0800)]

r8152: remove the definitions of the PID

The PIDs are only used in the id table, so the definitions are
unnacessary. Remove them wouldn't have confusion.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

hayeswang [Thu, 6 Nov 2014 04:47:39 +0000 (12:47 +0800)]

r8152: modify rtl_ops_init

Replace using VID/PID with using tp->version to initialize the ops.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

hayeswang [Thu, 6 Nov 2014 04:47:38 +0000 (12:47 +0800)]

r8152: move r8152b_get_version

Move r8152b_get_version() to the location before rtl_ops_init().
Then, the rtl_ops_init() could use tp->version.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Joe Perches [Wed, 5 Nov 2014 23:42:09 +0000 (15:42 -0800)]

sock.h: Remove unused NETDEBUG macro

It's unused now, just delete it.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Joe Perches [Wed, 5 Nov 2014 23:36:08 +0000 (15:36 -0800)]

net: esp: Convert NETDEBUG to pr_info

Commit 64ce207306de ("[NET]: Make NETDEBUG pure printk wrappers")
originally had these NETDEBUG printks as always emitting.

Commit a2a316fd068c ("[NET]: Replace CONFIG_NET_DEBUG with sysctl")
added a net_msg_warn sysctl to these NETDEBUG uses.

Convert these NETDEBUG uses to normal pr_info calls.

This changes the output prefix from "ESP: " to include
"IPSec: " for the ipv4 case and "IPv6: " for the ipv6 case.

These output lines are now like the other messages in the files.

Other miscellanea:

Neaten the arithmetic spacing to be consistent with other
arithmetic spacing in the files.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Joe Perches [Wed, 5 Nov 2014 22:39:21 +0000 (14:39 -0800)]

net; ipv[46] - Remove 2 unnecessary NETDEBUG OOM messages

These messages aren't useful as there's a generic dump_stack()
on OOM.

Neaten the comment and if test above the OOM by separating the
assign in if into an allocation then if test.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Andrew Lunn [Wed, 5 Nov 2014 19:01:59 +0000 (20:01 +0100)]

dsa: mv88e6171: Add support for mv88e6172

The mv88e6172 is very similar to the mv88e6171. So extend the
mv88e6171 driver to support it.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Jiri Pirko [Wed, 5 Nov 2014 19:51:51 +0000 (20:51 +0100)]

sched: fix act file names in header comment

Fixes: 4bba3925 ("[PKT_SCHED]: Prefix tc actions with act_")
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Thu, 6 Nov 2014 19:43:13 +0000 (14:43 -0500)]

Merge branch 'sfc-next'

Shradha Shah says:

====================
sfc: Clean up Siena SR-IOV support in preparation for EF10 SR-IOV support

This patch series provides a base and clean up for the upcoming
EF10 SRIOV patches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Shradha Shah [Wed, 5 Nov 2014 12:16:46 +0000 (12:16 +0000)]

sfc: Add NIC type operations to replace direct calls from efx.c into siena_sriov.c

Also add dummy functions where required to avoid NULL pointer dereference.

Signed-off-by: Shradha Shah <sshah@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Shradha Shah [Wed, 5 Nov 2014 12:16:32 +0000 (12:16 +0000)]

sfc: Rename implementations in siena_sriov.c to have a 'siena' prefix

Patch in preparation for the upcoming EF10 sriov support.

Signed-off-by: Shradha Shah <sshah@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Shradha Shah [Wed, 5 Nov 2014 12:16:18 +0000 (12:16 +0000)]

sfc: Move the current VF state from efx_nic into siena_nic_data

This patch series provides a base and cleanup for the
upcoming EF10 SRIOV support.

This patch moves the VF state into siena_nic_data as a basis to
save the VF state based on nic type.

Signed-off-by: Shradha Shah <sshah@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Malcolm Crossley [Wed, 5 Nov 2014 10:50:22 +0000 (10:50 +0000)]

xen-netback: remove unconditional __pskb_pull_tail() in guest Tx path

Unconditionally pulling 128 bytes into the linear area is not required
for:

- security: Every protocol demux starts with pskb_may_pull() to pull
  frag data into the linear area, if necessary, before looking at
  headers.

- performance: Netback has already grant copied up-to 128 bytes from
  the first slot of a packet into the linear area. The first slot
  normally contain all the IPv4/IPv6 and TCP/UDP headers.

The unconditional pull would often copy frag data unnecessarily.  This
is a performance problem when running on a version of Xen where grant
unmap avoids TLB flushes for pages which are not accessed.  TLB
flushes can now be avoided for > 99% of unmaps (it was 0% before).

Grant unmap TLB flush avoidance will be available in a future version
of Xen (probably 4.6).

Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Thu, 6 Nov 2014 19:39:01 +0000 (14:39 -0500)]

Merge branch 'stmmac-next'

Andy Shevchenko says:

====================
stmmac: pci: various cleanups and fixes

There are few cleanups and fixes regarding to stmmac PCI driver.
This has been tested on Intel Galileo board with recent net-next tree.

Since v2:
- drop patch 5/5 since it will be part of a big change across entire subsystem

Since v1:
- remove already applied patch
- append patch 1/5
- rework patch 3/5 to be functional compatible with original code
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Andy Shevchenko [Wed, 5 Nov 2014 10:27:29 +0000 (12:27 +0200)]

stmmac: pci: convert to use dev_* macros

Instead of pr_* macros let's use dev_* macros which provide device name.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Andy Shevchenko [Wed, 5 Nov 2014 10:27:28 +0000 (12:27 +0200)]

stmmac: pci: use managed resources

Migrate pci driver to managed resources to reduce boilerplate error handling
code.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Andy Shevchenko [Wed, 5 Nov 2014 10:27:27 +0000 (12:27 +0200)]

stmmac: pci: convert to use dev_pm_ops

Convert system PM callbacks to use dev_pm_ops. In addition remove the PCI calls
related to a power state since the bus code cares about this already.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Andy Shevchenko [Wed, 5 Nov 2014 10:27:26 +0000 (12:27 +0200)]

stmmac: pci: use defined constant instead of magic number

The last standard PCI resource is defined as PCI_STD_RESOURCE_END. Thus, we
could use it instead of plain integer.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Andy Shevchenko [Wed, 5 Nov 2014 09:45:32 +0000 (11:45 +0200)]

stmmac: fix sparse warnings

This patch fixes the following sparse warnings.

drivers/net/ethernet/stmicro/stmmac/enh_desc.c:381:30: warning: symbol 'enh_desc_ops' was not declared. Should it be static?
drivers/net/ethernet/stmicro/stmmac/norm_desc.c:253:30: warning: symbol 'ndesc_ops' was not declared. Should it be static?
drivers/net/ethernet/stmicro/stmmac/stmmac_hwtstamp.c:141:33: warning: symbol 'stmmac_ptp' was not declared. Should it be static?

There is no functional change.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Giuseppe CAVALLARO <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Steffen Klassert [Wed, 5 Nov 2014 07:03:50 +0000 (08:03 +0100)]

ip6_tunnel: Add support for wildcard tunnel endpoints.

This patch adds support for tunnels with local or
remote wildcard endpoints. With this we get a
NBMA tunnel mode like we have it for ipv4 and
sit tunnels.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Steffen Klassert [Wed, 5 Nov 2014 07:02:48 +0000 (08:02 +0100)]

ipv6: Allow sending packets through tunnels with wildcard endpoints

Currently we need the IP6_TNL_F_CAP_XMIT capabiltiy to transmit
packets through an ipv6 tunnel. This capability is set when the
tunnel gets configured, based on the tunnel endpoint addresses.

On tunnels with wildcard tunnel endpoints, we need to do the
capabiltiy checking on a per packet basis like it is done in
the receive path.

This patch extends ip6_tnl_xmit_ctl() to take local and remote
addresses as parameters to allow for per packet capabiltiy
checking.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Pravin B Shelar [Sun, 19 Oct 2014 19:03:40 +0000 (12:03 -0700)]

openvswitch: Avoid NULL mask check while building mask

OVS does mask validation even if it does not need to convert
netlink mask attributes to mask structure.  ovs_nla_get_match()
caller can pass NULL mask structure pointer if the caller does
not need mask.  Therefore NULL check is required in SW_FLOW_KEY*
macros.  Following patch does not convert mask netlink attributes
if mask pointer is NULL, so we do not need these checks in
SW_FLOW_KEY* macro.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Daniele Di Proietto <ddiproietto@vmware.com>
Acked-by: Andy Zhou <azhou@nicira.com>

commit | commitdiff | tree

Pravin B Shelar [Sun, 19 Oct 2014 18:19:51 +0000 (11:19 -0700)]

openvswitch: Refactor action alloc and copy api.

There are two separate API to allocate and copy actions list. Anytime
OVS needs to copy action list, it needs to call both functions.
Following patch moves action allocation to copy function to avoid
code duplication.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>

commit | commitdiff | tree

Joe Stringer [Sat, 18 Oct 2014 23:14:14 +0000 (16:14 -0700)]

openvswitch: Move key_attr_size() to flow_netlink.h.

flow-netlink has netlink related code.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>

commit | commitdiff | tree

Lorand Jakab [Mon, 6 Oct 2014 12:45:32 +0000 (05:45 -0700)]

openvswitch: Remove flow member from struct ovs_skb_cb

The 'flow' memeber was chosen for removal because it's only used
in ovs_execute_actions() we can pass it as argument to this
function.

Signed-off-by: Lorand Jakab <lojakab@cisco.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>

commit | commitdiff | tree

Jarno Rajahalme [Tue, 30 Sep 2014 17:52:32 +0000 (10:52 -0700)]

openvswitch: Fix the type of struct ovs_key_nd nd_target field.

Should be the same as other IPv6 address fields.

Current master produces sparse warnings without this change.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>

commit | commitdiff | tree

Chunhe Li [Mon, 8 Sep 2014 20:17:21 +0000 (13:17 -0700)]

openvswitch: Drop packets when interdev is not up

If the internal device is not up, it should drop received
packets. Sometimes it receive the broadcast or multicast
packets, and the ip protocol stack will casue more cpu
usage wasted.

Signed-off-by: Chunhe Li <lichunhe@huawei.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>

commit | commitdiff | tree

Andy Zhou [Mon, 8 Sep 2014 20:14:22 +0000 (13:14 -0700)]

openvswitch: Refactor get_dp() function into multiple access APIs.

Avoid recursive read_rcu_lock() by using the lighter weight
get_dp_rcu() API. Add proper locking assertions to get_dp().

Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>

commit | commitdiff | tree

Joe Stringer [Mon, 8 Sep 2014 20:09:37 +0000 (13:09 -0700)]

openvswitch: Refactor ovs_flow_cmd_fill_info().

Split up ovs_flow_cmd_fill_info() to make it easier to cache parts of a
dump reply. This will be used to streamline flow_dump in a future patch.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Thomas Graf <tgraf@noironetworks.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>

commit | commitdiff | tree

Andy Zhou [Mon, 8 Sep 2014 07:35:02 +0000 (00:35 -0700)]

openvswitch: refactor do_output() to move NULL check out of fast path

skb_clone() NULL check is implemented in do_output(), as past of the
common (fast) path. Refactoring so that NULL check is done in the
slow path, immediately after skb_clone() is called.

Besides optimization, this change also improves code readability by
making the skb_clone() NULL check consistent within OVS datapath
module.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>

commit | commitdiff | tree

Jesse Gross [Mon, 6 Oct 2014 12:08:38 +0000 (05:08 -0700)]

openvswitch: Additional logging for -EINVAL on flow setups.

There are many possible ways that a flow can be invalid so we've
added logging for most of them. This adds logs for the remaining
possible cases so there isn't any ambiguity while debugging.

CC: Federico Iezzi <fiezzi@enter.it>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Thomas Graf <tgraf@noironetworks.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>

commit | commitdiff | tree

Joe Stringer [Mon, 8 Sep 2014 05:11:08 +0000 (22:11 -0700)]

openvswitch: Remove redundant tcp_flags code.

These two cases used to be treated differently for IPv4/IPv6,
but they are now identical.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>

commit | commitdiff | tree

Pravin B Shelar [Wed, 7 May 2014 01:41:20 +0000 (18:41 -0700)]

openvswitch: Move table destroy to dp-rcu callback.

Ths simplifies flow-table-destroy API. No need to pass explicit
parameter about context.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@redhat.com>

commit | commitdiff | tree

Simon Horman [Mon, 6 Oct 2014 12:05:13 +0000 (05:05 -0700)]

openvswitch: Add basic MPLS support to kernel

Allow datapath to recognize and extract MPLS labels into flow keys
and execute actions which push, pop, and set labels on packets.

Based heavily on work by Leo Alterman, Ravi K, Isaku Yamahata and Joe Stringer.

Cc: Ravi K <rkerur@gmail.com>
Cc: Leo Alterman <lalterman@nicira.com>
Cc: Isaku Yamahata <yamahata@valinux.co.jp>
Cc: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>

commit | commitdiff | tree

Pravin B Shelar [Wed, 5 Nov 2014 23:27:48 +0000 (15:27 -0800)]

net: Remove MPLS GSO feature.

Device can export MPLS GSO support in dev->mpls_features same way
it export vlan features in dev->vlan_features. So it is safe to
remove NETIF_F_GSO_MPLS redundant flag.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>

commit | commitdiff | tree

Tom Herbert [Thu, 6 Nov 2014 00:49:38 +0000 (16:49 -0800)]

fou: Fix typo in returning flags in netlink

When filling netlink info, dport is being returned as flags. Fix
instances to return correct value.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

hayeswang [Wed, 5 Nov 2014 02:17:02 +0000 (10:17 +0800)]

r8152: disable the tasklet by default

Let the tasklet only be enabled after open(), and be disabled for
the other situation. The tasklet is only necessary after open() for
tx/rx, so it could be disabled by default.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Daniel Borkmann [Wed, 5 Nov 2014 19:27:38 +0000 (20:27 +0100)]

ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs

It has been reported that generating an MLD listener report on
devices with large MTUs (e.g. 9000) and a high number of IPv6
addresses can trigger a skb_over_panic():

skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
dev:port1
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:100!
invalid opcode: 0000 [#1] SMP
Modules linked in: ixgbe(O)
CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
[...]
Call Trace:
<IRQ>
[<ffffffff80578226>] ? skb_put+0x3a/0x3b
[<ffffffff80612a5d>] ? add_grhead+0x45/0x8e
[<ffffffff80612e3a>] ? add_grec+0x394/0x3d4
[<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
[<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
[<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
[<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
[<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
[<ffffffff8025112b>] ? irq_exit+0x4e/0xd3
[<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
[<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70

mld_newpack() skb allocations are usually requested with dev->mtu
in size, since commit 72e09ad107e7 ("ipv6: avoid high order allocations")
we have changed the limit in order to be less likely to fail.

However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
macros, which determine if we may end up doing an skb_put() for
adding another record. To avoid possible fragmentation, we check
the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
assumption as the actual max allocation size can be much smaller.

The IGMP case doesn't have this issue as commit 57e1ab6eaddc
("igmp: refine skb allocations") stores the allocation size in
the cb[].

Set a reserved_tailroom to make it fit into the MTU and use
skb_availroom() helper instead. This also allows to get rid of
igmp_skb_size().

Reported-by: Wei Liu <lw1a2.jing@gmail.com>
Fixes: 72e09ad107e7 ("ipv6: avoid high order allocations")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: David L Stevens <david.stevens@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Joe Perches [Tue, 4 Nov 2014 23:37:03 +0000 (15:37 -0800)]

net: Convert SEQ_START_TOKEN/seq_printf to seq_puts

Using a single fixed string is smaller code size than using
a format and many string arguments.

Reduces overall code size a little.

$ size net/ipv4/igmp.o* net/ipv6/mcast.o* net/ipv6/ip6_flowlabel.o*
   text    data     bss     dec     hex filename
  34269    7012   14824   56105    db29 net/ipv4/igmp.o.new
  34315    7012   14824   56151    db57 net/ipv4/igmp.o.old
  30078    7869   13200   51147    c7cb net/ipv6/mcast.o.new
  30105    7869   13200   51174    c7e6 net/ipv6/mcast.o.old
  11434    3748    8580   23762    5cd2 net/ipv6/ip6_flowlabel.o.new
  11491    3748    8580   23819    5d0b net/ipv6/ip6_flowlabel.o.old

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Hannes Frederic Sowa [Tue, 4 Nov 2014 23:23:04 +0000 (00:23 +0100)]

fast_hash: avoid indirect function calls

By default the arch_fast_hash hashing function pointers are initialized
to jhash(2). If during boot-up a CPU with SSE4.2 is detected they get
updated to the CRC32 ones. This dispatching scheme incurs a function
pointer lookup and indirect call for every hashing operation.

rhashtable as a user of arch_fast_hash e.g. stores pointers to hashing
functions in its structure, too, causing two indirect branches per
hashing operation.

Using alternative_call we can get away with one of those indirect branches.

Acked-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Thu, 6 Nov 2014 02:50:43 +0000 (21:50 -0500)]

Merge branch 'amd-xgbe-next'

Tom Lendacky says:

====================
amd-xgbe: AMD XGBE driver updates 2014-11-04

The following series of patches includes functional updates to the
driver as well as some trivial changes for function renaming and
spelling fixes.

- Move channel and ring structure allocation into the device open path
- Rename the pre_xmit function to dev_xmit
- Explicitly use the u32 data type for the device descriptors
- Use page allocation for the receive buffers
- Add support for split header/payload receive
- Add support for per DMA channel interrupts
- Add support for receive side scaling (RSS)
- Add support for ethtool receive side scaling commands
- Fix the spelling of descriptors
- After a PCS reset, sync the PCS and PHY modes
- Add dependency on HAS_IOMEM to both the amd-xgbe and amd-xgbe-phy
drivers

This patch series is based on net-next.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Lendacky, Thomas [Tue, 4 Nov 2014 22:07:46 +0000 (16:07 -0600)]

amd-xgbe-phy: Let AMD_XGBE_PHY depend on HAS_IOMEM

The amd-xgbe-phy driver needs to perform ioremap calls, so add HAS_IOMEM
to its build dependency.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Lendacky, Thomas [Tue, 4 Nov 2014 22:07:41 +0000 (16:07 -0600)]

amd-xgbe: Let AMD_XGBE depend on HAS_IOMEM

The amd-xgbe driver needs to perform ioremap calls, so add HAS_IOMEM
to its build dependency.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Lendacky, Thomas [Tue, 4 Nov 2014 22:07:35 +0000 (16:07 -0600)]

amd-xgbe-phy: Sync PCS and PHY modes after reset

This patch adds support to sync the states of the PCS and the PHY
after a reset is performed. If the PCS and the PHY are not in the
same state after reset an extra mode change would be performed. This
extra mode change might not be needed if the PCS and the PHY are
synced up after reset.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Lendacky, Thomas [Tue, 4 Nov 2014 22:07:29 +0000 (16:07 -0600)]

amd-xgbe: Fix a spelling error

This patch fixes the spelling of the word "descriptor" in a couple
of locations.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Lendacky, Thomas [Tue, 4 Nov 2014 22:07:23 +0000 (16:07 -0600)]

amd-xgbe: Add receive side scaling ethtool support

This patch adds support for ethtool receive side scaling (RSS) commands.
Support is added to get/set the RSS hash key and the RSS lookup table.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Lendacky, Thomas [Tue, 4 Nov 2014 22:07:02 +0000 (16:07 -0600)]

amd-xgbe: Provide support for receive side scaling

This patch provides support for receive side scaling (RSS). RSS allows
for spreading incoming network packets across the Rx queues. When used
in conjunction with the per DMA channel interrupt support, this allows
the receive processing to be spread across multiple processors.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Lendacky, Thomas [Tue, 4 Nov 2014 22:06:56 +0000 (16:06 -0600)]

amd-xgbe: Add support for per DMA channel interrupts

This patch provides support for interrupts that are generated by the
Tx/Rx DMA channel pairs of the device. This allows for Tx and Rx
processing to run across multiple processsors.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Lendacky, Thomas [Tue, 4 Nov 2014 22:06:50 +0000 (16:06 -0600)]

amd-xgbe: Implement split header receive support

Provide support for splitting IP packets so that the header and
payload can be sent to different DMA addresses. This will allow
the IP header to be put into the linear part of the skb while the
payload can be added as frags.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Lendacky, Thomas [Tue, 4 Nov 2014 22:06:44 +0000 (16:06 -0600)]

amd-xgbe: Use page allocations for Rx buffers

Use page allocations for Rx buffers instead of pre-allocating skbs
of a set size.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Lendacky, Thomas [Tue, 4 Nov 2014 22:06:37 +0000 (16:06 -0600)]

amd-xgbe: Use the u32 data type for descriptors

The Tx and Rx descriptors are unsigned 32 bit values. Use the u32
type, rather than unsigned int, to map these descriptors.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Lendacky, Thomas [Tue, 4 Nov 2014 22:06:32 +0000 (16:06 -0600)]

amd-xgbe: Rename pre_xmit function to dev_xmit

The pre_xmit function name implies that it performs operations prior
to transmitting the packet when in fact it is responsible for setting
up the descriptors and initiating the transmit. Rename this to
function from pre_xmit to dev_xmit, which is consistent with the name
used during receive processing - dev_read.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Lendacky, Thomas [Tue, 4 Nov 2014 22:06:26 +0000 (16:06 -0600)]

amd-xgbe: Move ring allocation to device open

Move the channel and ring tracking structures allocation to device
open. This will allow for future support to vary the number of Tx/Rx
queues without unloading the module.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

WANG Cong [Tue, 4 Nov 2014 18:59:47 +0000 (10:59 -0800)]

ipv6: move INET6_MATCH() to include/net/inet6_hashtables.h

It is only used in net/ipv6/inet6_hashtables.c.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Wed, 5 Nov 2014 21:46:40 +0000 (16:46 -0500)]

net: Add and use skb_copy_datagram_msg() helper.

This encapsulates all of the skb_copy_datagram_iovec() callers
with call argument signature "skb, offset, msghdr->msg_iov, length".

When we move to iov_iters in the networking, the iov_iter object will
sit in the msghdr.

Having a helper like this means there will be less places to touch
during that transformation.

Based upon descriptions and patch from Al Viro.

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Wed, 5 Nov 2014 21:34:47 +0000 (16:34 -0500)]

Merge branch 'gue-next'

Tom Herbert says:

====================
gue: Remote checksum offload

This patch set implements remote checksum offload for
GUE, which is a mechanism that provides checksum offload of
encapsulated packets using rudimentary offload capabilities found in
most Network Interface Card (NIC) devices. The outer header checksum
for UDP is enabled in packets and, with some additional meta
information in the GUE header, a receiver is able to deduce the
checksum to be set for an inner encapsulated packet. Effectively this
offloads the computation of the inner checksum. Enabling the outer
checksum in encapsulation has the additional advantage that it covers
more of the packet than the inner checksum including the encapsulation
headers.

Remote checksum offload is described in:
http://tools.ietf.org/html/draft-herbert-remotecsumoffload-01

The GUE transmit and receive paths are modified to support the
remote checksum offload option. The option contains a checksum
offset and checksum start which are directly derived from values
set in stack when doing CHECKSUM_PARTIAL. On receipt of the option, the
operation is to calculate the packet checksum from "start" to end of
the packet (normally derived for checksum complete), and then set
the resultant value at checksum "offset" (the checksum field has
already been primed with the pseudo header). This emulates a NIC
that implements NETIF_F_HW_CSUM.

The primary purpose of this feature is to eliminate cost of performing
checksum calculation over a packet when encpasulating.

In this patch set:
  - Move fou_build_header into fou.c and split it into a couple of
    functions
  - Enable offloading of outer UDP checksum in encapsulation
  - Change udp_offload to support remote checksum offload, includes
    new GSO type and ensuring encapsulated layers (TCP) doesn't try to
    set a checksum covered by RCO
  - TX support for RCO with GUE. This is configured through ip_tunnel
    and set the option on transmit when packet being encapsulated is
    CHECKSUM_PARTIAL
  - RX support for RCO with GUE for normal and GRO paths. Includes
    resolving the offloaded checksum

v2:
  Address comments from davem: Move accounting for private option
  field in gue_encap_hlen to patch in which we add the remote checksum
  offload option.

Testing:

I ran performance numbers using netperf TCP_STREAM and TCP_RR with 200
streams, comparing GUE with and without remote checksum offload (doing
checksum-unnecessary to complete conversion in both cases). These
were run on mlnx4 and bnx2x. Some mlnx4 results are below.

GRE/GUE
    TCP_STREAM
      IPv4, with remote checksum offload
        9.71% TX CPU utilization
        7.42% RX CPU utilization
        36380 Mbps
      IPv4, without remote checksum offload
        12.40% TX CPU utilization
        7.36% RX CPU utilization
        36591 Mbps
    TCP_RR
      IPv4, with remote checksum offload
        77.79% CPU utilization
91/144/216 90/95/99% latencies
        1.95127e+06 tps
      IPv4, without remote checksum offload
        78.70% CPU utilization
        89/152/297 90/95/99% latencies
        1.95458e+06 tps

IPIP/GUE
    TCP_STREAM
      With remote checksum offload
        10.30% TX CPU utilization
        7.43% RX CPU utilization
        36486 Mbps
      Without remote checksum offload
        12.47% TX CPU utilization
        7.49% RX CPU utilization
        36694 Mbps
    TCP_RR
      With remote checksum offload
        77.80% CPU utilization
        87/153/270 90/95/99% latencies
        1.98735e+06 tps
      Without remote checksum offload
        77.98% CPU utilization
        87/150/287 90/95/99% latencies
        1.98737e+06 tps

SIT/GUE
    TCP_STREAM
      With remote checksum offload
        9.68% TX CPU utilization
        7.36% RX CPU utilization
        35971 Mbps
      Without remote checksum offload
        12.95% TX CPU utilization
        8.04% RX CPU utilization
        36177 Mbps
    TCP_RR
      With remote checksum offload
        79.32% CPU utilization
        94/158/295 90/95/99% latencies
        1.88842e+06 tps
      Without remote checksum offload
        80.23% CPU utilization
        94/149/226 90/95/99% latencies
        1.90338e+06 tps

VXLAN
    TCP_STREAM
        35.03% TX CPU utilization
        20.85% RX CPU utilization
        36230 Mbps
    TCP_RR
        77.36% CPU utilization
        84/146/270 90/95/99% latencies
        2.08063e+06 tps

We can also look at CPU time in csum_partial using perf (with bnx2x
setup). For GRE with TCP_STREAM I see:

    With remote checksum offload
        0.33% TX
        1.81% RX
    Without remote checksum offload
        6.00% TX
        0.51% RX

I suspect the fact that time in csum_partial noticably increases
with remote checksum offload for RX is due to taking the cache miss on
the encapsulated header in that function. By similar reasoning, if on
the TX side the packet were not in cache (say we did a splice from a
file whose data was never touched by the CPU) the CPU savings for TX
would probably be more pronounced.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Tom Herbert [Tue, 4 Nov 2014 17:06:57 +0000 (09:06 -0800)]

gue: Receive side of remote checksum offload

Add processing of the remote checksum offload option in both the normal
path as well as the GRO path. The implements patching the affected
checksum to derive the offloaded checksum.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Tom Herbert [Tue, 4 Nov 2014 17:06:56 +0000 (09:06 -0800)]

gue: TX support for using remote checksum offload option

Add if_tunnel flag TUNNEL_ENCAP_FLAG_REMCSUM to configure
remote checksum offload on an IP tunnel. Add logic in gue_build_header
to insert remote checksum offload option.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Tom Herbert [Tue, 4 Nov 2014 17:06:55 +0000 (09:06 -0800)]

gue: Protocol constants for remote checksum offload

Define a private flag for remote checksun offload as well as a length
for the option.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Tom Herbert [Tue, 4 Nov 2014 17:06:54 +0000 (09:06 -0800)]

udp: Changes to udp_offload to support remote checksum offload

Add a new GSO type, SKB_GSO_TUNNEL_REMCSUM, which indicates remote
checksum offload being done (in this case inner checksum must not
be offloaded to the NIC).

Added logic in __skb_udp_tunnel_segment to handle remote checksum
offload case.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Tom Herbert [Tue, 4 Nov 2014 17:06:53 +0000 (09:06 -0800)]

gue: Add infrastructure for flags and options

Add functions and basic definitions for processing standard flags,
private flags, and control messages. This includes definitions
to compute length of optional fields corresponding to a set of flags.
Flag validation is in validate_gue_flags function. This checks for
unknown flags, and that length of optional fields is <= length
in guehdr hlen.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Tom Herbert [Tue, 4 Nov 2014 17:06:52 +0000 (09:06 -0800)]

udp: Offload outer UDP tunnel csum if available

In __skb_udp_tunnel_segment if outer UDP checksums are enabled and
ip_summed is not already CHECKSUM_PARTIAL, set up checksum offload
if device features allow it.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Tom Herbert [Tue, 4 Nov 2014 17:06:51 +0000 (09:06 -0800)]

net: Move fou_build_header into fou.c and refactor

Move fou_build_header out of ip_tunnel.c and into fou.c splitting
it up into fou_build_header, gue_build_header, and fou_build_udp.
This allows for other users for TX of FOU or GUE. Change ip_tunnel_encap
to call fou_build_header or gue_build_header based on the tunnel
encapsulation type. Similarly, added fou_encap_hlen and gue_encap_hlen
functions which are called by ip_encap_hlen. New net/fou.h has
prototypes and defines for this.

Added NET_FOU_IP_TUNNELS configuration. When this is set, IP tunnels
can use FOU/GUE and fou module is also selected.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Wed, 5 Nov 2014 21:14:54 +0000 (16:14 -0500)]

Merge branch 'stmmac-next'

Giuseppe Cavallaro says:

====================
stmmac: review driver Koptions

Recently many Koption options have been added to have new glue logic on several
platforms.

The main goal behind this work is to guarantee that the driver built
fine on all the branches where it is present independently of which
glue logic is selected.

IMHO, it is better to remove all the not necessary Koption(s) that can hide
build problems when something changes in the driver and especially when
the DT compatibility allows us to manage all the platform data.

I compiled the driver w/o any issue on net-next Git for:

x86, arm and sh4.

In case of there are build problems on some repos now it will be
easy to catch them and cherry-pick patches from mainstream.

For sure, do not hesitate to contact me in case of issue.

Also this set removes STMMAC_DEBUG_FS and BUS_MODE_DA. The latter is useless
and the former can be replaced by DEBUG_FS (always to make safe the build).

V2: patch-set re-based on top of the latest updates for net-next
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Giuseppe CAVALLARO [Tue, 4 Nov 2014 14:49:34 +0000 (15:49 +0100)]

stmmac: remove BUS_MODE_DA

This is a very old and often unused option to configure
a bit in a register inside the DMA. This support should
not stay under Koption and should be extended for new chips too.
This will be do later maybe via device-tree parameters.
Also no performance impact when remove this setting on STi platforms.

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Giuseppe CAVALLARO [Tue, 4 Nov 2014 14:49:33 +0000 (15:49 +0100)]

stmmac: remove STMMAC_DEBUG_FS

the STMMAC_DEBUG_FS Koption is now removed from the
driver configuration and this support will be built
by default when DEBUG_FS is present. This can also be
useful on building driver verification.

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Giuseppe CAVALLARO [Tue, 4 Nov 2014 14:49:32 +0000 (15:49 +0100)]

stmmac: remove specific SoC Koption from platform.

This patch removes all the Koptions added to build the glue-logic files
for all different architectures: DWMAC_MESON, DWMAC_SUNXI, DWMAC_STI ...
Nowadays the stmmac needs to be compiled on several platforms; in some
case it very convenient to guarantee that its build is always completed
with success on all the branches where the driver is present.

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Vladimir Zapolskiy [Mon, 3 Nov 2014 23:25:09 +0000 (01:25 +0200)]

net: phy: spi_ks8995: remove sysfs bin file by registered attribute

When a sysfs binary file is asked to be removed, it is found by
attribute name, so strictly speaking this change is not a fix, but
just in case when attribute name is changed in the driver or sysfs
internals are changed, it might be better to remove the previously
created file using right the same binary attribute.

Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Fabian Frederick [Tue, 4 Nov 2014 22:11:19 +0000 (23:11 +0100)]

udp: remove blank line between set and test

Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Florent Fourcot [Tue, 4 Nov 2014 22:01:00 +0000 (23:01 +0100)]

ipv6: trivial, add bracket for the if block

The "else" block is on several lines and use bracket.

Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Fabian Frederick [Tue, 4 Nov 2014 21:54:36 +0000 (22:54 +0100)]

esp4: remove assignment in if condition

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Tue, 4 Nov 2014 21:06:46 +0000 (16:06 -0500)]

Merge branch 'ecn_via_routing_table'

Florian Westphal says:

====================
net: allow setting ecn via routing table

Here is v4 of the patchset, its exactly the same as v3 except in patch3/3
where I added the missing 'const' qualifier to a function argument that
Eric spotted during review.

I preserved Erics Acks so that he doesn't have to resend them.

v3 cover letter:

When using syn cookies, then do not simply trust that the echoed timestamp
was not modified to make sure that ecn is not turned on magically when it
is disabled on the host.

The first two patches, which were not part of earlier series, prepare
the cookie code for the ecn route metrics change by allowing is to
more easily use the existing dst object for ecn validation.

The 3rd patch adds the ecn route metric feature support.
It is almost the same as in v2, except that we'll now also test the
dst_features when decoding a syn cookie timestamp that indicates ecn support.

These three patches then allow turning on explicit congestion notification
based on the destination network.

For example, assuming the default tcp_ecn sysctl '2', the following will
enable ecn (tcp_ecn=1 behaviour, i.e. request ecn to be enabled for a
tcp connection) for all connections to hosts inside the 192.168.2/24 network:

ip route change 192.168.2.0/24 dev eth0 features ecn

Having a more fine-grained per-route setting can be beneficial for
various reasons, for example 1) within data centers, or 2) local ISPs
may deploy ECN support for their own video/streaming services [1], etc.

Joint work with Daniel Borkmann, feature suggested by Hannes Frederic Sowa.

The patch to enable this in iproute2 will be posted shortly, it is currently
also available here:
http://git.breakpoint.cc/cgit/fw/iproute2.git/commit/?h=iproute_features&id=8843d2d8973fb81c78a7efe6d42e3a17d739003e

[1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Florian Westphal [Mon, 3 Nov 2014 16:35:03 +0000 (17:35 +0100)]

net: allow setting ecn via routing table

This patch allows to set ECN on a per-route basis in case the sysctl
tcp_ecn is not set to 1. In other words, when ECN is set for specific
routes, it provides a tcp_ecn=1 behaviour for that route while the rest
of the stack acts according to the global settings.

One can use 'ip route change dev $dev $net features ecn' to toggle this.

Having a more fine-grained per-route setting can be beneficial for various
reasons, for example, 1) within data centers, or 2) local ISPs may deploy
ECN support for their own video/streaming services [1], etc.

There was a recent measurement study/paper [2] which scanned the Alexa's
publicly available top million websites list from a vantage point in US,
Europe and Asia:

Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely
blamed to commit 255cac91c3 ("tcp: extend ECN sysctl to allow server-side
only ECN") ;)); the break in connectivity on-path was found is about
1 in 10,000 cases. Timeouts rather than receiving back RSTs were much
more common in the negotiation phase (and mostly seen in the Alexa
middle band, ranks around 50k-150k): from 12-thousand hosts on which
there _may_ be ECN-linked connection failures, only 79 failed with RST
when _not_ failing with RST when ECN is not requested.

It's unclear though, how much equipment in the wild actually marks CE
when buffers start to fill up.

We thought about a fallback to non-ECN for retransmitted SYNs as another
global option (which could perhaps one day be made default), but as Eric
points out, there's much more work needed to detect broken middleboxes.

Two examples Eric mentioned are buggy firewalls that accept only a single
SYN per flow, and middleboxes that successfully let an ECN flow establish,
but later mark CE for all packets (so cwnd converges to 1).

[1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
[2] http://ecn.ethz.ch/

Joint work with Daniel Borkmann.

Reference: http://thread.gmane.org/gmane.linux.network/335797
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Florian Westphal [Mon, 3 Nov 2014 16:35:02 +0000 (17:35 +0100)]

syncookies: split cookie_check_timestamp() into two functions

The function cookie_check_timestamp(), both called from IPv4/6 context,
is being used to decode the echoed timestamp from the SYN/ACK into TCP
options used for follow-up communication with the peer.

We can remove ECN handling from that function, split it into a separate
one, and simply rename the original function into cookie_decode_options().
cookie_decode_options() just fills in tcp_option struct based on the
echoed timestamp received from the peer. Anything that fails in this
function will actually discard the request socket.

While this is the natural place for decoding options such as ECN which
commit 172d69e63c7f ("syncookies: add support for ECN") added, we argue
that in particular for ECN handling, it can be checked at a later point
in time as the request sock would actually not need to be dropped from
this, but just ECN support turned off.

Therefore, we split this functionality into cookie_ecn_ok(), which tells
us if the timestamp indicates ECN support AND the tcp_ecn sysctl is enabled.

This prepares for per-route ECN support: just looking at the tcp_ecn sysctl
won't be enough anymore at that point; if the timestamp indicates ECN
and sysctl tcp_ecn == 0, we will also need to check the ECN dst metric.

This would mean adding a route lookup to cookie_check_timestamp(), which
we definitely want to avoid. As we already do a route lookup at a later
point in cookie_{v4,v6}_check(), we can simply make use of that as well
for the new cookie_ecn_ok() function w/o any additional cost.

Joint work with Daniel Borkmann.

Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Florian Westphal [Mon, 3 Nov 2014 16:35:01 +0000 (17:35 +0100)]

syncookies: avoid magic values and document which-bit-is-what-option

Was a bit more difficult to read than needed due to magic shifts;
add defines and document the used encoding scheme.

Joint work with Daniel Borkmann.

Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Fabian Frederick [Tue, 4 Nov 2014 19:52:14 +0000 (20:52 +0100)]

igmp: remove camel case definitions

use standard uppercase for definitions

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Fabian Frederick [Tue, 4 Nov 2014 19:48:41 +0000 (20:48 +0100)]

udp: remove else after return

else is unnecessary after return 0 in __udp4_lib_rcv()

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Fabian Frederick [Tue, 4 Nov 2014 19:44:04 +0000 (20:44 +0100)]

inet: frags: remove inline on static in c file

remove __inline__ / inline and let compiler decide what to do
with static functions

Inspired-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Fabian Frederick [Tue, 4 Nov 2014 19:37:46 +0000 (20:37 +0100)]

ipv4: remove 0/NULL assignment on static

static values are automatically initialized to 0

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Fabian Frederick [Tue, 4 Nov 2014 19:31:26 +0000 (20:31 +0100)]

ipv4: use seq_puts instead of seq_printf where possible

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Fabian Frederick [Tue, 4 Nov 2014 19:25:38 +0000 (20:25 +0100)]

tcp: spelling s/plugable/pluggable

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Fabian Frederick [Tue, 4 Nov 2014 19:19:19 +0000 (20:19 +0100)]

cipso: remove NULL assignment on static

Also add blank line after structure declarations

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Fabian Frederick [Tue, 4 Nov 2014 19:13:50 +0000 (20:13 +0100)]

ipv4: include linux/bug.h instead of asm/bug.h

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Fabian Frederick [Tue, 4 Nov 2014 19:10:01 +0000 (20:10 +0100)]

cipso: kerneldoc warning fix

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Eric Dumazet [Mon, 3 Nov 2014 16:19:53 +0000 (08:19 -0800)]

net: add rbnode to struct sk_buff

Yaogong replaces TCP out of order receive queue by an RB tree.

As netem already does a private skb->{next/prev/tstamp} union
with a 'struct rb_node', lets do this in a cleaner way.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yaogong Wang <wygivan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Mon, 3 Nov 2014 21:10:11 +0000 (16:10 -0500)]

Merge branch 'master' of git://git./linux/kernel/git/jkirsher/net-next

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2014-11-03

This series contains updates to i40e and i40evf.

Akeem adds a check for i40e so that flow director flush and reinit are
not done when flow director is not enabled.

Mitch fixes the i40evf driver to properly handle multiple admin queue
messages, by reinit the msg_size field each time we go through the loop.
Without this, we may receive truncated messages due to the firmware
thinking we have insufficient buffer size.  Also fixes the link checking
logic to only check the carrier state if the interface is actually
open, which allows link changes to be reported correctly without spamming
the VFs.  Updates i40e to inset the VSI ID in the QTX_CTL register
when configuring queues for VMDq VSIs.

Paul adds support for 10G-base-T in i40evf.

Jesse fixes i40e where the call to irq_dynamic_disable() was turning off
the interrupt completely when trying to set ITR to 0 (for lowest
moderation).

Shannon removes debugfs dump stats function, since it was not being
kept up-to-date and was redundant with the ethtool output.  Also, scales
back the LAN MSIx usage to force queue/vector sharing and leave some
vectors for Flow Director, VMDq, etc. when there are more cores than
vectors available to the PF.  Cleans up the error reporting for
get_lump() resource tracking errors.  Also adds a check for the
debug module parameter earlier to be able to catch the early configuration
phase admin queue messages.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Sudip Mukherjee [Mon, 3 Nov 2014 12:12:29 +0000 (17:42 +0530)]

hamradio: 6pack: remove unnecessary check

this is check for dev is unnecessary, as we are already checking dev
after allocating it via alloc_netdev, and jumping to label: out
if it is NULL.

Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Denis Kirjanov [Thu, 30 Oct 2014 06:12:15 +0000 (09:12 +0300)]

PPC: bpf_jit_comp: add SKF_AD_PKTTYPE instruction

Add BPF extension SKF_AD_PKTTYPE to ppc JIT to load
skb->pkt_type field.

Before:
[   88.262622] test_bpf: #11 LD_IND_NET 86 97 99 PASS
[   88.265740] test_bpf: #12 LD_PKTTYPE 109 107 PASS

After:
[   80.605964] test_bpf: #11 LD_IND_NET 44 40 39 PASS
[   80.607370] test_bpf: #12 LD_PKTTYPE 9 9 PASS

CC: Alexei Starovoitov<alexei.starovoitov@gmail.com>
CC: Michael Ellerman<mpe@ellerman.id.au>
Cc: Matt Evans <matt@ozlabs.org>
Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
v2: Added test rusults
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Mon, 3 Nov 2014 17:28:21 +0000 (12:28 -0500)]

Merge branch 'mlx4-next'

Or Gerlitz says:

====================
Mellanox ethernet driver update Oct-30-2014

The 1st patch from Saeed fixes a bug in the last net-next batch where
a VF could get access to set port configuration, the next patch from Amir
fixes a race in the port VPI logic. Next are two performance patches from Ido.

The patch to add checksum complete status on GRE and such packets was
preceded with a patch that converted the driver to only use napi_gro_receive
vs. the current code which goes through napi_gro_frags on it's usual track.
Eric D. has some thoughts and suggestions on that change for which we
want to take the time and consider, so for the time being dropped that
patch and the ones that depend on it.

Changes from V0:
  - have the caller to provide the __GFP_COLD hint to the service function
  - dropped the patch that changes the GRO logic and the subsequent dependent
    patches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Matan Barak [Sun, 2 Nov 2014 14:26:17 +0000 (16:26 +0200)]

net/mlx4_core: Add retrieval of CONFIG_DEV parameters

Add code to issue CONFIG_DEV "get" firmware command.

This command is used in order to obtain certain parameters used for
supporting various RX checksumming options and vxlan UDP port.

The GET operation is allowed for VFs too.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ido Shamay [Sun, 2 Nov 2014 14:26:16 +0000 (16:26 +0200)]

net/mlx4_en: Add __GFP_COLD gfp flags in alloc_pages

Needed in order to get cache cold pages (L3 flushed) for HW scatter.

Otherwise memory may flush those entries when the packet comes from
PCI, causing back pressure resulting in BW decrease.

Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ido Shamay [Sun, 2 Nov 2014 14:26:15 +0000 (16:26 +0200)]

net/mlx4_en: Remove RX buffers alignment to IP_ALIGN

When IP_ALIGN has a non zero value, hardware will write to a non aligned
address. The only reader from this address is when copying the header
from the first frag into the linear buffer (further access to the IP
address will be from the linear buffer, in which the headers are
aligned). Since the penalty of non align access by the hardware is
greater than the software memcpy, changing the frag_align to always be 0.

Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Amir Vadai [Sun, 2 Nov 2014 14:26:14 +0000 (16:26 +0200)]

net/mlx4_core: Protect port type setting by mutex

We need to protect set_port_type() for concurrency, as the sysfs code could
call it from mutliple contexts in parallel.

The port_mutex is not enough because we need to protect from concurrent
modification of 'info' and stopping of the port sensing work.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Saeed Mahameed [Sun, 2 Nov 2014 14:26:13 +0000 (16:26 +0200)]

net/mlx4_core: Prevent VF from changing port configuration

Added wrapper to the ACCESS_REG command for handling guest HW
registers access, preventing write operations, but do allow reads.

This will prevent SRIOV guests to change port PTYS configuration,
such as speed/advertised link modes.

Fixes: adbc7ac5c15e ('net/mlx4_core: Introduce ACCESS_REG CMD [...]')
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Eric Dumazet [Sun, 2 Nov 2014 14:19:33 +0000 (06:19 -0800)]

net: less interrupt masking in NAPI

net_rx_action() can mask irqs a single time to transfert sd->poll_list
into a private list, for a very short duration.

Then, napi_complete() can avoid masking irqs again,
and net_rx_action() only needs to mask irq again in slow path.

This patch removes 2 couples of irq mask/unmask per typical NAPI run,
more if multiple napi were triggered.

Note this also allows to give control back to caller (do_softirq())
more often, so that other softirq handlers can be called a bit earlier,
or ksoftirqd can be wakeup earlier under pressure.

This was developed while testing an alternative to RX interrupt
mitigation to reduce latencies while keeping or improving GRO
aggregation on fast NIC.

Idea is to test napi->gro_list at the end of a napi->poll() and
reschedule one NAPI poll, but after servicing a full round of
softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
is currently serviced by idle task or ksoftirqd, and resched not needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Domain: System / Kernel;

RSS Atom