Bruce Allan [Tue, 8 Dec 2015 23:51:04 +0000 (15:51 -0800)]
fm10k: remove unused struct element
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Bruce Allan [Tue, 8 Dec 2015 23:50:39 +0000 (15:50 -0800)]
fm10k: constify fm10k_mac_ops, fm10k_iov_ops and fm10k_info structures
These structures never change so declare them as const.
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Bruce Allan [Tue, 8 Dec 2015 23:50:34 +0000 (15:50 -0800)]
fm10k: address operator not needed when declaring function pointers
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Mon, 16 Nov 2015 23:33:34 +0000 (15:33 -0800)]
fm10k: use ether_addr_equal instead of memcmp
When comparing MAC addresses, use ether_addr_equal instead of memcmp to
ETH_ALEN length. Found and replaced using the following sed:
sed -e 's/memcmp\x28\(.*\), ETH_ALEN\x29/!ether_addr_equal\x28\1\x29/'
Reported-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Tue, 10 Nov 2015 17:40:30 +0000 (09:40 -0800)]
fm10k: Cleanup exception handling for changing queues
This patch is meant to cleanup the exception handling for the paths where
we reset the interrupts and then reconfigure them. In all of these paths
we had very different levels of exception handling. I have updated the
driver so that all of the paths should result in a similar state if we
fail.
Specifically the driver will now unload the mailbox interrupt, free the
queue vectors and MSI-X, and then detach the interface.
In addition for any of the PCIe related resets I have added a check with
the hw_ready function to just make sure the registers are in a readable
state prior to reopening the interface.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Mon, 9 Nov 2015 22:04:08 +0000 (14:04 -0800)]
fm10k: correctly pack TLV structures and explain reasoning
The TLV format for little endian structures is actually 4 byte aligned
copy. To this end, we need to add an additional __aligned(4) marker
along with __packed to ensure that these structures are actually 4 byte
aligned and packed correctly. Use of just __packed will not work as this
will result in 1byte alignment which is incorrect. Add a comment
explaining the reasoning behind why these structures need the special
treatment.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Bruce Allan [Tue, 3 Nov 2015 19:35:02 +0000 (11:35 -0800)]
fm10k: don't initialize fm10k_workqueue at global level
Cleans up checkpatch GLOBAL_INITIALIZERS error
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Nicholas Mc Guire [Sun, 20 Dec 2015 14:06:18 +0000 (15:06 +0100)]
ibmveth: consolidate kmalloc of array, memset 0 to kcalloc
This is an API consolidation only. The use of kmalloc + memset to 0
is equivalent to kcalloc in this case as it is allocating an array
of elements.
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Fri, 18 Dec 2015 14:18:08 +0000 (15:18 +0100)]
netcp: fix regression in receive processing
A cleanup patch I did was unfortunately wrong and introduced
multiple serious bugs in the netcp rx processing, as indicated
by these correct gcc warnings:
drivers/net/ethernet/ti/netcp_core.c:776:14: warning: 'buf_ptr' may be used uninitialized in this function [-Wuninitialized]
drivers/net/ethernet/ti/netcp_core.c:687:14: warning: 'ptr' may be used uninitialized in this function [-Wuninitialized]
I have checked the patch once more and found that a call to
get_pkt_info() accidentally got removed in netcp_free_rx_desc_chain,
and netcp_process_one_rx_packet no longer retrieved the correct
buffer length. This patch should fix all the known problems,
but I did not test on real hardware.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes:
899077791403 ("netcp: try to reduce type confusion in descriptors")
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Fri, 18 Dec 2015 01:51:16 +0000 (17:51 -0800)]
asix: silence log message from oversize packet
Since it is possible for an external system to send oversize packets
at anytime, it is best for driver not to print a message and spam
the log (potential external DoS).
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=109471
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 18 Dec 2015 00:14:11 +0000 (16:14 -0800)]
tcp: diag: add support for request sockets to tcp_abort()
Adding support for SYN_RECV request sockets to tcp_abort()
is quite easy after our tcp listener rewrite.
Note that we also need to better handle listeners, or we might
leak not yet accepted children, because of a missing
inet_csk_listen_stop() call.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Tested-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 18 Dec 2015 21:04:52 +0000 (16:04 -0500)]
Merge branch 'bpf-misc-updates'
Daniel Borkmann says:
====================
Misc BPF updates
This series contains a couple of misc updates to the BPF code, besides
others a new helper bpf_skb_load_bytes(), moving clearing of A/X to the
classic converter, etc. Please see individual patches for details.
Thanks!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Thu, 17 Dec 2015 22:51:57 +0000 (23:51 +0100)]
bpf, test: add couple of test cases
Add couple of test cases for interpreter but also JITs, f.e. to test that
when imm32 moves are being done, upper 32bits of the regs are being zero
extended.
Without JIT:
[...]
[ 1114.129301] test_bpf: #43 MOV REG64 jited:0 128 PASS
[ 1114.130626] test_bpf: #44 MOV REG32 jited:0 139 PASS
[ 1114.132055] test_bpf: #45 LD IMM64 jited:0 124 PASS
[...]
With JIT (generated code can as usual be nicely verified with the help of
bpf_jit_disasm tool):
[...]
[ 1062.726782] test_bpf: #43 MOV REG64 jited:1 6 PASS
[ 1062.726890] test_bpf: #44 MOV REG32 jited:1 6 PASS
[ 1062.726993] test_bpf: #45 LD IMM64 jited:1 6 PASS
[...]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Thu, 17 Dec 2015 22:51:56 +0000 (23:51 +0100)]
bpf, x86: detect/optimize loading 0 immediates
When sometimes structs or variables need to be initialized/'memset' to 0 in
an eBPF C program, the x86 BPF JIT converts this to use immediates. We can
however save a couple of bytes (f.e. even up to 7 bytes on a single emmission
of BPF_LD | BPF_IMM | BPF_DW) in the image by detecting such case and use xor
on the dst register instead.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Thu, 17 Dec 2015 22:51:55 +0000 (23:51 +0100)]
bpf: fix misleading comment in bpf_convert_filter
Comment says "User BPF's register A is mapped to our BPF register 6",
which is actually wrong as the mapping is on register 0. This can
already be inferred from the code itself. So just remove it before
someone makes assumptions based on that. Only code tells truth. ;)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Thu, 17 Dec 2015 22:51:54 +0000 (23:51 +0100)]
bpf: move clearing of A/X into classic to eBPF migration prologue
Back in the days where eBPF (or back then "internal BPF" ;->) was not
exposed to user space, and only the classic BPF programs internally
translated into eBPF programs, we missed the fact that for classic BPF
A and X needed to be cleared. It was fixed back then via
83d5b7ef99c9
("net: filter: initialize A and X registers"), and thus classic BPF
specifics were added to the eBPF interpreter core to work around it.
This added some confusion for JIT developers later on that take the
eBPF interpreter code as an example for deriving their JIT. F.e. in
f75298f5c3fe ("s390/bpf: clear correct BPF accumulator register"), at
least X could leak stack memory. Furthermore, since this is only needed
for classic BPF translations and not for eBPF (verifier takes care
that read access to regs cannot be done uninitialized), more complexity
is added to JITs as they need to determine whether they deal with
migrations or native eBPF where they can just omit clearing A/X in
their prologue and thus reduce image size a bit, see f.e.
cde66c2d88da
("s390/bpf: Only clear A and X for converted BPF programs"). In other
cases (x86, arm64), A and X is being cleared in the prologue also for
eBPF case, which is unnecessary.
Lets move this into the BPF migration in bpf_convert_filter() where it
actually belongs as long as the number of eBPF JITs are still few. It
can thus be done generically; allowing us to remove the quirk from
__bpf_prog_run() and to slightly reduce JIT image size in case of eBPF,
while reducing code duplication on this matter in current(/future) eBPF
JITs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Zi Shen Lim <zlim.lnx@gmail.com>
Cc: Yang Shi <yang.shi@linaro.org>
Acked-by: Yang Shi <yang.shi@linaro.org>
Acked-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Thu, 17 Dec 2015 22:51:53 +0000 (23:51 +0100)]
bpf: add bpf_skb_load_bytes helper
When hacking tc programs with eBPF, one of the issues that come up
from time to time is to load addresses from headers. In eBPF as in
classic BPF, we have BPF_LD | BPF_ABS | BPF_{B,H,W} instructions that
extract a byte, half-word or word out of the skb data though helpers
such as bpf_load_pointer() (interpreter case).
F.e. extracting a whole IPv6 address could possibly look like ...
union v6addr {
struct {
__u32 p1;
__u32 p2;
__u32 p3;
__u32 p4;
};
__u8 addr[16];
};
[...]
a.p1 = htonl(load_word(skb, off));
a.p2 = htonl(load_word(skb, off + 4));
a.p3 = htonl(load_word(skb, off + 8));
a.p4 = htonl(load_word(skb, off + 12));
[...]
/* access to a.addr[...] */
This work adds a complementary helper bpf_skb_load_bytes() (we also
have bpf_skb_store_bytes()) as an alternative where the same call
would look like from an eBPF program:
ret = bpf_skb_load_bytes(skb, off, addr, sizeof(addr));
Same verifier restrictions apply as in
ffeedafbf023 ("bpf: introduce
current->pid, tgid, uid, gid, comm accessors") case, where stack memory
access needs to be statically verified and thus guaranteed to be
initialized in first use (otherwise verifier cannot tell whether a
subsequent access to it is valid or not as it's runtime dependent).
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 18 Dec 2015 20:37:42 +0000 (15:37 -0500)]
Merge git://git./linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains the first batch of Netfilter updates for
the upcoming 4.5 kernel. This batch contains userspace netfilter header
compilation fixes, support for packet mangling in nf_tables, the new
tracing infrastructure for nf_tables and cgroup2 support for iptables.
More specifically, they are:
1) Two patches to include dependencies in our netfilter userspace
headers to resolve compilation problems, from Mikko Rapeli.
2) Four comestic cleanup patches for the ebtables codebase, from Ian Morris.
3) Remove duplicate include in the netfilter reject infrastructure,
from Stephen Hemminger.
4) Two patches to simplify the netfilter defragmentation code for IPv6,
patch from Florian Westphal.
5) Fix root ownership of /proc/net netfilter for unpriviledged net
namespaces, from Philip Whineray.
6) Get rid of unused fields in struct nft_pktinfo, from Florian Westphal.
7) Add mangling support to our nf_tables payload expression, from
Patrick McHardy.
8) Introduce a new netlink-based tracing infrastructure for nf_tables,
from Florian Westphal.
9) Change setter functions in nfnetlink_log to be void, from
Rami Rosen.
10) Add netns support to the cttimeout infrastructure.
11) Add cgroup2 support to iptables, from Tejun Heo.
12) Introduce nfnl_dereference_protected() in nfnetlink, from Florian.
13) Add support for mangling pkttype in the nf_tables meta expression,
also from Florian.
BTW, I need that you pull net into net-next, I have another batch that
requires changes that I don't yet see in net.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 17 Dec 2015 14:18:44 +0000 (14:18 +0000)]
nfp: call netif_carrier_off() during init
Netdevs default to carrier on, we should call netif_carrier_off()
during initialization since we handle carrier state changes in the
driver.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Rolf Neugebauer <rolf.neugebauer@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 18 Dec 2015 19:43:39 +0000 (14:43 -0500)]
Merge branch 'l3mdev-accept'
David Ahern says:
====================
net: Allow accepted sockets to be bound to l3mdev domain
Allow accepted sockets to derive their sk_bound_dev_if setting from the
l3mdev domain in which the packets originated. This version adds a sysctl
to control whether the setting is inherited, making the functionality
similar to sk_mark and its sysctl_tcp_fwmark_accept setting.
This effectively allow a process to have a "VRF-global" listen socket,
with child sockets bound to the VRF device in which the packet originated.
A similar behavior can be achieved using sk_mark, but a solution using marks
is incomplete as it does not handle duplicate addresses in different L3
domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
domain provides a complete solution.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Wed, 16 Dec 2015 21:20:44 +0000 (13:20 -0800)]
net: Allow accepted sockets to be bound to l3mdev domain
Allow accepted sockets to derive their sk_bound_dev_if setting from the
l3mdev domain in which the packets originated. A sysctl setting is added
to control the behavior which is similar to sk_mark and
sysctl_tcp_fwmark_accept.
This effectively allow a process to have a "VRF-global" listen socket,
with child sockets bound to the VRF device in which the packet originated.
A similar behavior can be achieved using sk_mark, but a solution using marks
is incomplete as it does not handle duplicate addresses in different L3
domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
domain provides a complete solution.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Wed, 16 Dec 2015 21:20:43 +0000 (13:20 -0800)]
net: l3mdev: Add master device lookup by index
Add helper to lookup l3mdev master index given a device index.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bjørn Mork [Wed, 16 Dec 2015 15:44:38 +0000 (16:44 +0100)]
ipv6: addrconf: use stable address generator for ARPHRD_NONE
Add a new address generator mode, using the stable address generator
with an automatically generated secret. This is intended as a default
address generator mode for device types with no EUI64 implementation.
The new generator is used for ARPHRD_NONE interfaces initially, adding
default IPv6 autoconf support to e.g. tun interfaces.
If the addrgenmode is set to 'random', either by default or manually,
and no stable secret is available, then a random secret is used as
input for the stable-privacy address generator. The secret can be
read and modified like manually configured secrets, using the proc
interface. Modifying the secret will change the addrgen mode to
'stable-privacy' to indicate that it operates on a known secret.
Existing behaviour of the 'stable-privacy' mode is kept unchanged. If
a known secret is available when the device is created, then the mode
will default to 'stable-privacy' as before. The mode can be manually
set to 'random' but it will behave exactly like 'stable-privacy' in
this case. The secret will not change.
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: 吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Fri, 18 Dec 2015 14:37:37 +0000 (15:37 +0100)]
ila: add NETFILTER dependency
The recently added generic ILA translation facility fails to
build when CONFIG_NETFILTER is disabled:
net/ipv6/ila/ila_xlat.c:229:20: warning: 'struct nf_hook_state' declared inside parameter list
net/ipv6/ila/ila_xlat.c:235:27: error: array type has incomplete element type 'struct nf_hook_ops'
static struct nf_hook_ops ila_nf_hook_ops[] __read_mostly = {
This adds an explicit Kconfig dependency to avoid that case.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes:
7f00feaf1076 ("ila: Add generic ILA translation facility")
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Thu, 10 Dec 2015 17:04:07 +0000 (18:04 +0100)]
netfilter: meta: add support for setting skb->pkttype
This allows to redirect bridged packets to local machine:
ether type ip ether daddr set aa:53:08:12:34:56 meta pkttype set unicast
Without 'set unicast', ip stack discards PACKET_OTHERHOST skbs.
It is also useful to add support for a '-m cluster like' nft rule
(where switch floods packets to several nodes, and each cluster node
node processes a subset of packets for load distribution).
Mangling is restricted to HOST/OTHER/BROAD/MULTICAST, i.e. you cannot set
skb->pkt_type to PACKET_KERNEL or change PACKET_LOOPBACK to PACKET_HOST.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
David S. Miller [Fri, 18 Dec 2015 03:08:28 +0000 (22:08 -0500)]
Merge git://git./linux/kernel/git/davem/net
Conflicts:
drivers/net/geneve.c
Here we had an overlapping change, where in 'net' the extraneous stats
bump was being removed whilst in 'net-next' the final argument to
udp_tunnel6_xmit_skb() was being changed.
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 17 Dec 2015 22:05:22 +0000 (14:05 -0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Fix uninitialized variable warnings in nfnetlink_queue, a lot of
people reported this... From Arnd Bergmann.
2) Don't init mutex twice in i40e driver, from Jesse Brandeburg.
3) Fix spurious EBUSY in rhashtable, from Herbert Xu.
4) Missing DMA unmaps in mvpp2 driver, from Marcin Wojtas.
5) Fix race with work structure access in pppoe driver causing
corruptions, from Guillaume Nault.
6) Fix OOPS due to sh_eth_rx() not checking whether netdev_alloc_skb()
actually succeeded or not, from Sergei Shtylyov.
7) Don't lose flags when settifn IFA_F_OPTIMISTIC in ipv6 code, from
Bjørn Mork.
8) VXLAN_HD_RCO defined incorrectly, fix from Jiri Benc.
9) Fix clock source used for cookies in SCTP, from Marcelo Ricardo
Leitner.
10) aurora driver needs HAS_DMA dependency, from Geert Uytterhoeven.
11) ndo_fill_metadata_dst op of vxlan has to handle ipv6 tunneling
properly as well, from Jiri Benc.
12) Handle request sockets properly in xfrm layer, from Eric Dumazet.
13) Double stats update in ipv6 geneve transmit path, fix from Pravin B
Shelar.
14) sk->sk_policy[] needs RCU protection, and as a result
xfrm_policy_destroy() needs to free policies using an RCU grace
period, from Eric Dumazet.
15) SCTP needs to clone ipv6 tx options in order to avoid use after
free, from Eric Dumazet.
16) Missing kbuild export if ila.h, from Stephen Hemminger.
17) Missing mdiobus_alloc() return value checking in mdio-mux.c, from
Tobias Klauser.
18) Validate protocol value range in ->create() methods, from Hannes
Frederic Sowa.
19) Fix early socket demux races that result in illegal dst reuse, from
Eric Dumazet.
20) Validate socket address length in pptp code, from WANG Cong.
21) skb_reorder_vlan_header() uses incorrect offset and can corrupt
packets, from Vlad Yasevich.
22) Fix memory leaks in nl80211 registry code, from Ola Olsson.
23) Timeout loop count handing fixes in mISDN, xgbe, qlge, sfc, and
qlcnic. From Dan Carpenter.
24) msg.msg_iocb needs to be cleared in recvfrom() otherwise, for
example, AF_ALG will interpret it as an async call. From Tadeusz
Struk.
25) inetpeer_set_addr_v4 forgets to initialize the 'vif' field, from
Eric Dumazet.
26) rhashtable enforces the minimum table size not early enough,
breaking how we calculate the per-cpu lock allocations. From
Herbert Xu.
27) Fix FCC port lockup in 82xx driver, from Martin Roth.
28) FOU sockets need to be freed using RCU, from Hannes Frederic Sowa.
29) Fix out-of-bounds access in __skb_complete_tx_timestamp() and
sock_setsockopt() wrt. timestamp handling. From WANG Cong.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (117 commits)
net: check both type and procotol for tcp sockets
drivers: net: xgene: fix Tx flow control
tcp: restore fastopen with no data in SYN packet
af_unix: Revert 'lock_interruptible' in stream receive code
fou: clean up socket with kfree_rcu
82xx: FCC: Fixing a bug causing to FCC port lock-up
gianfar: Don't enable RX Filer if not supported
net: fix warnings in 'make htmldocs' by moving macro definition out of field declaration
rhashtable: Fix walker list corruption
rhashtable: Enforce minimum size on initial hash table
inet: tcp: fix inetpeer_set_addr_v4()
ipv6: automatically enable stable privacy mode if stable_secret set
net: fix uninitialized variable issue
bluetooth: Validate socket address length in sco_sock_bind().
net_sched: make qdisc_tree_decrease_qlen() work for non mq
ser_gigaset: remove unnecessary kfree() calls from release method
ser_gigaset: fix deallocation of platform device structure
ser_gigaset: turn nonsense checks into WARN_ON
ser_gigaset: fix up NULL checks
qlcnic: fix a timeout loop
...
Eran Ben Elisha [Thu, 17 Dec 2015 14:11:55 +0000 (16:11 +0200)]
team: Advertise tunneling offload features
When the underlying device supports offloads encapulated traffic,
we need to reflect that through the hw_enc_features field of the
team net-device.
This will cause the xmit path in the core networking stack to provide
team with encapsulated GSO frames to offload into the HW etc.
Using this over Mellanox ConnectX3-pro (mlx4 driver) card that supports
VXLAN offloads we got 36.0 Gbits/sec using eight iperf streams.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bjørn Mork [Thu, 17 Dec 2015 11:44:04 +0000 (12:44 +0100)]
net: qmi_wwan: ignore bogus CDC Union descriptors
The CDC descriptors found on these vendor specific functions should
not be considered authoritative. They seem to be ignored by drivers
for other systems, and the quality is therefore low.
One device (1e0e:9001) has been reported to have such a bogus union
descriptor on the QMI function, making it fail probing even if the
device id was dynamically added. The report was not complete enough
to allow adding a device entry for this modem. But this should at
least fix the dynamic id probing problem.
Reported-by: Kanerva Topi <Topi.Kanerva@cinia.fi>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gregory CLEMENT [Thu, 17 Dec 2015 09:51:04 +0000 (10:51 +0100)]
net/macb: Update device tree binding for resetting PHY using GPIO
Instead of being at the MAC level the reset gpio preperty is moved at the
PHY child node level. It is still managed by the MAC, but from the point
of view of the binding it make more sense to be part of the PHY node.
This commit also fixes a build errors if GPIOLIB is not selected.
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 17 Dec 2015 20:51:05 +0000 (15:51 -0500)]
Merge branch 'cxgb4-l2-table-enhancements'
Hariprasad Shenai says:
====================
Few l2 table related enhancements for cxgb4
This series adds a new API to allocate and update l2t entry, replaces
arpq_head/arpq_tail with double skb double linked list. Use t4_mgmt_tx()
to send control packets of l2t write request. Use symbolic constants
while calculating vlan priority.
This patch series has been created against net-next tree and includes
patches on cxgb4 driver.
We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
Thanks
V2: Remove unnecessary MAS operation while calculating vlan prio in
PATCH 1/4 ("cxgb4: Use symbolic constant for VLAN priority calculation")
based on review comment by David Miller
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Thu, 17 Dec 2015 08:15:10 +0000 (13:45 +0530)]
cxgb4: Replace arpq_head/arpq_tail with SKB double link-list code
Based on original work by Michael Werner <werner@chelsio.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Thu, 17 Dec 2015 08:15:09 +0000 (13:45 +0530)]
cxgb4: Use t4_mgmt_tx() API for sending write l2t request ctrl packets.
Based on original work by Michael Werner <werner@chelsio.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Thu, 17 Dec 2015 08:15:08 +0000 (13:45 +0530)]
cxgb4: Add API to alloc l2t entry; also update existing ones
Based on original work by Kumar Sanghvi <kumaras@chelsio.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Thu, 17 Dec 2015 08:15:07 +0000 (13:45 +0530)]
cxgb4: Use symbolic constant for VLAN priority calculation
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Thu, 17 Dec 2015 07:39:04 +0000 (23:39 -0800)]
net: check both type and procotol for tcp sockets
Dmitry reported the following out-of-bound access:
Call Trace:
[<
ffffffff816cec2e>] __asan_report_load4_noabort+0x3e/0x40
mm/kasan/report.c:294
[<
ffffffff84affb14>] sock_setsockopt+0x1284/0x13d0 net/core/sock.c:880
[< inline >] SYSC_setsockopt net/socket.c:1746
[<
ffffffff84aed7ee>] SyS_setsockopt+0x1fe/0x240 net/socket.c:1729
[<
ffffffff85c18c76>] entry_SYSCALL_64_fastpath+0x16/0x7a
arch/x86/entry/entry_64.S:185
This is because we mistake a raw socket as a tcp socket.
We should check both sk->sk_type and sk->sk_protocol to ensure
it is a tcp socket.
Willem points out __skb_complete_tx_timestamp() needs to fix as well.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Iyappan Subramanian [Thu, 17 Dec 2015 06:26:05 +0000 (22:26 -0800)]
drivers: net: xgene: fix Tx flow control
Currently the Tx flow control is based on reading the hardware state,
which is not accurate since it may not reflect the descriptors that
are not yet reached the memory.
To accurately control the Tx flow, changing it to be software based.
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 16 Dec 2015 21:53:10 +0000 (13:53 -0800)]
tcp: restore fastopen with no data in SYN packet
Yuchung tracked a regression caused by commit
57be5bdad759 ("ip: convert
tcp_sendmsg() to iov_iter primitives") for TCP Fast Open.
Some Fast Open users do not actually add any data in the SYN packet.
Fixes:
57be5bdad759 ("ip: convert tcp_sendmsg() to iov_iter primitives")
Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rainer Weikusat [Wed, 16 Dec 2015 20:09:25 +0000 (20:09 +0000)]
af_unix: Revert 'lock_interruptible' in stream receive code
With
b3ca9b02b00704053a38bfe4c31dbbb9c13595d0, the AF_UNIX SOCK_STREAM
receive code was changed from using mutex_lock(&u->readlock) to
mutex_lock_interruptible(&u->readlock) to prevent signals from being
delayed for an indefinite time if a thread sleeping on the mutex
happened to be selected for handling the signal. But this was never a
problem with the stream receive code (as opposed to its datagram
counterpart) as that never went to sleep waiting for new messages with the
mutex held and thus, wouldn't cause secondary readers to block on the
mutex waiting for the sleeping primary reader. As the interruptible
locking makes the code more complicated in exchange for no benefit,
change it back to using mutex_lock.
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 16 Dec 2015 19:08:52 +0000 (19:08 +0000)]
nfp: clear ring delayed kick counters
We need to clear delayed kick counters when we free rings otherwise
after ndo_close()/ndo_open() we could kick HW by more entries than
actually written to rings.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Rolf Neugebauer <rolf.neugebauer@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 16 Dec 2015 16:57:37 +0000 (08:57 -0800)]
tun: honor IFF_UP in tun_get_user()
If a tun interface is turned down, we should not allow packet injection
into the kernel.
Kernel does not send packets to the tun already.
TUNATTACHFILTER can not be used as only tun_net_xmit() is taking care
of it.
Reported-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hannes Frederic Sowa [Wed, 16 Dec 2015 16:22:47 +0000 (17:22 +0100)]
ipv6: add IPV6_HDRINCL option for raw sockets
Same as in Windows, we miss IPV6_HDRINCL for SOL_IPV6 and SOL_RAW.
The SOL_IP/IP_HDRINCL is not available for IPv6 sockets.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Wed, 16 Dec 2015 09:50:11 +0000 (17:50 +0800)]
ipv6: allow routes to be configured with expire values
Add the support for adding expire value to routes, requested by
Tom Gundersen <teg@jklm.no> for systemd-networkd, and NetworkManager
wants it too.
implement it by adding the new RTNETLINK attribute RTA_EXPIRES.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 17 Dec 2015 19:55:29 +0000 (11:55 -0800)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Some i915 fixes, one omap fix, one core regression fix.
Not even enough fixes for a twelve days of xmas song, which seemms
good"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm: Don't overwrite UNVERFIED mode status to OK
drm/omap: fix fbdev pix format to support all platforms
drm/i915: Do a better job at disabling primary plane in the noatomic case.
drm/i915/skl: Double RC6 WRL always on
drm/i915/skl: Disable coarse power gating up until F0
drm/i915: Remove incorrect warning in context cleanup
Will Deacon [Fri, 11 Dec 2015 17:46:41 +0000 (17:46 +0000)]
locking/osq: Fix ordering of node initialisation in osq_lock
The Cavium guys reported a soft lockup on their arm64 machine, caused by
commit
c55a6ffa6285 ("locking/osq: Relax atomic semantics"):
mutex_optimistic_spin+0x9c/0x1d0
__mutex_lock_slowpath+0x44/0x158
mutex_lock+0x54/0x58
kernfs_iop_permission+0x38/0x70
__inode_permission+0x88/0xd8
inode_permission+0x30/0x6c
link_path_walk+0x68/0x4d4
path_openat+0xb4/0x2bc
do_filp_open+0x74/0xd0
do_sys_open+0x14c/0x228
SyS_openat+0x3c/0x48
el0_svc_naked+0x24/0x28
This is because in osq_lock we initialise the node for the current CPU:
node->locked = 0;
node->next = NULL;
node->cpu = curr;
and then publish the current CPU in the lock tail:
old = atomic_xchg_acquire(&lock->tail, curr);
Once the update to lock->tail is visible to another CPU, the node is
then live and can be both read and updated by concurrent lockers.
Unfortunately, the ACQUIRE semantics of the xchg operation mean that
there is no guarantee the contents of the node will be visible before
lock tail is updated. This can lead to lock corruption when, for
example, a concurrent locker races to set the next field.
Fixes:
c55a6ffa6285 ("locking/osq: Relax atomic semantics"):
Reported-by: David Daney <ddaney@caviumnetworks.com>
Reported-by: Andrew Pinski <andrew.pinski@caviumnetworks.com>
Tested-by: Andrew Pinski <andrew.pinski@caviumnetworks.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1449856001-21177-1-git-send-email-will.deacon@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 17 Dec 2015 19:20:13 +0000 (11:20 -0800)]
Merge branch 'libnvdimm-fixes' of git://git./linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
- Two bug fixes for misuse of PAGE_MASK in scatterlist and dma-debug.
These are tagged for -stable. The scatterlist impact is potentially
corrupted dma addresses on HIGHMEM enabled platforms.
- A minor locking fix for the NFIT hot-add implementation that is new
in 4.4-rc. This would only trigger in the case a hot-add raced
driver removal.
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dma-debug: Fix dma_debug_entry offset calculation
Revert "scatterlist: use sg_phys()"
nfit: acpi_nfit_notify(): Do not leave device locked
Hannes Frederic Sowa [Tue, 15 Dec 2015 20:01:53 +0000 (21:01 +0100)]
fou: clean up socket with kfree_rcu
fou->udp_offloads is managed by RCU. As it is actually included inside
the fou sockets, we cannot let the memory go out of scope before a grace
period. We either can synchronize_rcu or switch over to kfree_rcu to
manage the sockets. kfree_rcu seems appropriate as it is used by vxlan
and geneve.
Fixes:
23461551c00628c ("fou: Support for foo-over-udp RX path")
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hubert Sokolowski [Tue, 15 Dec 2015 13:20:30 +0000 (13:20 +0000)]
net: Pass ndm_state to route netlink FDB notifications.
Before this change applications monitoring FDB notifications
were not able to determine whether a new FDB entry is permament
or not:
bridge fdb add f1:f2:f3:f4:f5:f8 dev sw0p1 temp self
bridge fdb add f1:f2:f3:f4:f5:f9 dev sw0p1 self
bridge monitor fdb
f1:f2:f3:f4:f5:f8 dev sw0p1 self permanent
f1:f2:f3:f4:f5:f9 dev sw0p1 self permanent
With this change ndm_state from the original netlink message
is passed to the new netlink message sent as notification.
bridge fdb add f1:f2:f3:f4:f5:f6 dev sw0p1 self
bridge fdb add f1:f2:f3:f4:f5:f7 dev sw0p1 temp self
bridge monitor fdb
f1:f2:f3:f4:f5:f6 dev sw0p1 self permanent
f1:f2:f3:f4:f5:f7 dev sw0p1 self static
Signed-off-by: Hubert Sokolowski <hubert.sokolowski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 Dec 2015 23:33:38 +0000 (18:33 -0500)]
Merge tag 'mac80211-for-davem-2015-12-15' of git://git./linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Another set of fixes:
* memory leak fixes (from Ola)
* operating mode notification spec compliance fix (from Eyal)
* copy rfkill names in case pointer becomes invalid (myself)
* two hardware restart fixes (myself)
* get rid of "limiting TX power" log spam (myself)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Roth [Tue, 15 Dec 2015 02:17:53 +0000 (04:17 +0200)]
82xx: FCC: Fixing a bug causing to FCC port lock-up
The patch fixes FCC port lock-up, which occurs as a result of a bug
during underrun/collision handling. Within the tx_startup() function
in mac-fcc.c, the address of last BD is not calculated correctly.
As a result of wrong calculation of the last BD address, the next
transmitted BD may be set to an area out of the transmit BD ring.
This actually causes to port lock-up and it is not recoverable.
Signed-off-by: Martin Roth <martin.roth@motorolasolutions.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hamish Martin [Tue, 15 Dec 2015 01:14:50 +0000 (14:14 +1300)]
gianfar: Don't enable RX Filer if not supported
After commit
15bf176db1fb ("gianfar: Don't enable the Filer w/o the
Parser"), 'TSEC' model controllers (for example as seen on MPC8541E)
always have 8 bytes stripped from the front of received frames.
Only 'eTSEC' gianfar controllers have the RX Filer capability (amongst
other enhancements). Previously this was treated as always enabled
for both 'TSEC' and 'eTSEC' controllers.
In commit
15bf176db1fb ("gianfar: Don't enable the Filer w/o the Parser")
a subtle change was made to the setting of 'uses_rxfcb' to effectively
always set it (since 'rx_filer_enable' was always true). This had the
side-effect of always stripping 8 bytes from the front of received frames
on 'TSEC' type controllers.
We now only enable the RX Filer capability on controller types that
support it, thereby avoiding the issue for 'TSEC' type controllers.
Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Reviewed-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
Signed-off-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz>
Reviewed-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Mentz [Wed, 16 Dec 2015 01:38:48 +0000 (17:38 -0800)]
dma-debug: Fix dma_debug_entry offset calculation
dma-debug uses struct dma_debug_entry to keep track of dma coherent
memory allocation requests. The virtual address is converted into a pfn
and an offset. Previously, the offset was calculated using an incorrect
bit mask. As a result, we saw incorrect error messages from dma-debug
like the following:
"DMA-API: exceeded 7 overlapping mappings of cacheline 0x03e00000"
Cacheline 0x03e00000 does not exist on our platform.
Cc: <stable@vger.kernel.org>
Fixes:
0abdd7a81b7e ("dma-debug: introduce debug_dma_assert_idle()")
Signed-off-by: Daniel Mentz <danielmentz@google.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Linus Torvalds [Wed, 16 Dec 2015 18:57:24 +0000 (10:57 -0800)]
Merge branch 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm
Pull ARM fixes from Russell King:
"Further ARM fixes:
- Anson Huang noticed that we were corrupting a register we shouldn't
be during suspend on some CPUs.
- Shengjiu Wang spotted a bug in the 'swp' instruction emulation.
- Will Deacon fixed a bug in the ASID allocator.
- Laura Abbott fixed the kernel permission protection to apply to all
threads running in the system.
- I've fixed two bugs with the domain access control register
handling, one to do with printing an appropriate value at oops
time, and the other to further fix the uaccess_with_memcpy code"
* 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm:
ARM: 8475/1: SWP emulation: Restore original *data when failed
ARM: 8471/1: need to save/restore arm register(r11) when it is corrupted
ARM: fix uaccess_with_memcpy() with SW_DOMAIN_PAN
ARM: report proper DACR value in oops dumps
ARM: 8464/1: Update all mm structures with section adjustments
ARM: 8465/1: mm: keep reserved ASIDs in sync with mm after multiple rollovers
Hannes Frederic Sowa [Mon, 14 Dec 2015 22:30:43 +0000 (23:30 +0100)]
net: fix warnings in 'make htmldocs' by moving macro definition out of field declaration
Docbook does not like the definition of macros inside a field declaration
and adds a warning. Move the definition out.
Fixes:
79462ad02e86180 ("net: add validation for the socket syscall protocol argument")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 16 Dec 2015 08:45:54 +0000 (16:45 +0800)]
rhashtable: Fix walker list corruption
The commit
ba7c95ea3870fe7b847466d39a049ab6f156aa2c ("rhashtable:
Fix sleeping inside RCU critical section in walk_stop") introduced
a new spinlock for the walker list. However, it did not convert
all existing users of the list over to the new spin lock. Some
continued to use the old mutext for this purpose. This obviously
led to corruption of the list.
The fix is to use the spin lock everywhere where we touch the list.
This also allows us to do rcu_rad_lock before we take the lock in
rhashtable_walk_start. With the old mutex this would've deadlocked
but it's safe with the new spin lock.
Fixes:
ba7c95ea3870 ("rhashtable: Fix sleeping inside RCU...")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 Dec 2015 16:09:40 +0000 (11:09 -0500)]
Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
Included changes:
- change my email in MAINTAINERS and Doc files
- create and export list of single hop neighs per interface
- protect CRC in the BLA code by means of its own lock
- minor fixes and code cleanups
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 Dec 2015 15:59:05 +0000 (10:59 -0500)]
Merge branch 'geneve-udp-port-offload'
Anjali Singhai Jain says:
====================
Add support for Geneve udp port offload
This patch series adds new ndo ops for Geneve add/del port, so as
to help offload Geneve tunnel functionalities such as RX checksum,
RSS, filters etc.
i40e driver has been tested with the changes to make sure the offloads
happen.
We do understand that this is not the ideal solution and most likely
will be redone with a more generic offload framework.
But this certainly will enable us to start seeing benefits of the
accelerations for Geneve tunnels.
As a side note, we did find an existing issue in i40e driver where a
service task can modify tunnel data structures with no locks held to
help linearize access. A separate patch will be taking care of that issue.
A question out to the community is regarding the driver Kconfig parameters
for VxLAN and Geneve, it would be ideal to drop those if there is a way
to help resolve vxlan/geneve_get_rx_port symbols while the tunnel modules
are not loaded.
Performance numbers:
With the offloads enable on X722 devices with remote checksum enabled
and no other tuning in terms of cpu governer etc on my test machine:
With offload
Throughput: 5527Mbits/sec with a single thread
%cpu: ~43% per core with 4 threads
Without offload
Throughput: 2364Mbits/sec with a single thread
%cpu: ~99% per core with 4 threads
These numbers will get better for X722 as it is being worked. But
this does bring out the delta in terms of when the stack is notified
with csum_level 1 and CHECKSUM_UNNECESSARY vs not without the RX offload.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Singhai, Anjali [Mon, 14 Dec 2015 20:21:21 +0000 (12:21 -0800)]
i40e: Call geneve_get_rx_port to get the existing Geneve ports
This patch adds a call to geneve_get_rx_port in i40e so that when it
comes up it can learn about the existing geneve tunnels.
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Singhai, Anjali [Mon, 14 Dec 2015 20:21:20 +0000 (12:21 -0800)]
geneve: Add geneve_get_rx_port support
This patch adds an op that the drivers can call into to get existing
geneve ports.
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Singhai, Anjali [Mon, 14 Dec 2015 20:21:19 +0000 (12:21 -0800)]
i40e: Kernel dependency update for i40e to support geneve offload
Update the Kconfig file with dependency for supporting GENEVE tunnel
offloads.
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Kiran Patil <kiran.patil@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Singhai, Anjali [Mon, 14 Dec 2015 20:21:18 +0000 (12:21 -0800)]
i40e: geneve tunnel offload support
This patch adds driver hooks to implement ndo_ops to add/del
udp port in the HW to identify GENEVE tunnels.
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Kiran Patil <kiran.patil@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Singhai, Anjali [Mon, 14 Dec 2015 20:21:17 +0000 (12:21 -0800)]
geneve: Add geneve udp port offload for ethernet devices
Add ndo_ops to add/del UDP ports to a device that supports geneve
offload.
v2: Comment fix.
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Kiran Patil <kiran.patil@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhu Yanjun [Wed, 16 Dec 2015 05:55:04 +0000 (13:55 +0800)]
net: sctp: dynamically enable or disable pf state
As we all know, the value of pf_retrans >= max_retrans_path can
disable pf state. The variables of pf_retrans and max_retrans_path
can be changed by the userspace application.
Sometimes the user expects to disable pf state while the 2
variables are changed to enable pf state. So it is necessary to
introduce a new variable to disable pf state.
According to the suggestions from Vlad Yasevich, extra1 and extra2
are removed. The initialization of pf_enable is added.
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 16 Dec 2015 10:13:14 +0000 (18:13 +0800)]
rhashtable: Enforce minimum size on initial hash table
William Hua <william.hua@canonical.com> wrote:
>
> I wasn't aware there was an enforced minimum size. I simply set the
> nelem_hint in the rhastable_params struct to 1, expecting it to grow as
> needed. This caused a segfault afterwards when trying to insert an
> element.
OK we're doing the size computation before we enforce the limit
on min_size.
---8<---
We need to do the initial hash table size computation after we
have obtained the correct min_size/max_size parameters. Otherwise
we may end up with a hash table whose size is outside the allowed
envelope.
Fixes:
a998f712f77e ("rhashtable: Round up/down min/max_size to...")
Reported-by: William Hua <william.hua@canonical.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Wunderlich [Fri, 11 Sep 2015 16:04:13 +0000 (18:04 +0200)]
batman-adv: lock crc access in bridge loop avoidance
We have found some networks in which nodes were constantly requesting
other nodes BLA claim tables to synchronize, just to ask for that again
once completed. The reason was that the crc checksum of the asked nodes
were out of sync due to missing locking and multiple writes to the same
crc checksum when adding/removing entries. Therefore the asked nodes
constantly reported the wrong crc, which caused repeating requests.
To avoid multiple functions changing a backbone gateways crc entry at
the same time, lock it using a spinlock.
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Tested-by: Alfons Name <AlfonsName@web.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Sven Eckelmann [Wed, 26 Aug 2015 08:31:51 +0000 (10:31 +0200)]
batman-adv: Fix typo 'wether' -> 'whether'
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Sven Eckelmann [Wed, 26 Aug 2015 08:31:50 +0000 (10:31 +0200)]
batman-adv: Use chain pointer when purging fragments
The chain pointer was already created in batadv_frag_purge_orig to make the
checks more readable. Just use the chain pointer everywhere instead of
having the same dereference + array access in the most lines of this
function.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Martin Hundebøll <martin@hundeboll.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Simon Wunderlich [Wed, 26 Aug 2015 14:33:34 +0000 (16:33 +0200)]
batman-adv: unify flags access style in tt global add
This should slightly improve readability
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Simon Wunderlich [Wed, 2 Sep 2015 18:09:56 +0000 (20:09 +0200)]
batman-adv: detect local excess vlans in TT request
If the local representation of the global TT table of one originator has
more VLAN entries than the respective TT update, there is some
inconsistency present. By detecting and reporting this inconsistency,
the global table gets updated and the excess VLAN will get removed in
the process.
Reported-by: Alessandro Bolletta <alessandro@mediaspot.net>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Acked-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Eric Dumazet [Wed, 16 Dec 2015 04:56:44 +0000 (20:56 -0800)]
inet: tcp: fix inetpeer_set_addr_v4()
David Ahern added a vif field in the a4 part of inetpeer_addr struct.
This broke IPv4 TCP fast open client side and more generally tcp metrics
cache, because inetpeer_addr_cmp() is now comparing two u32 instead of
one.
inetpeer_set_addr_v4() needs to properly init vif field, otherwise
the comparison result depends on uninitialized data.
Fixes:
192132b9a034 ("net: Add support for VRFs to inetpeer cache")
Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hannes Frederic Sowa [Tue, 15 Dec 2015 21:59:12 +0000 (22:59 +0100)]
ipv6: automatically enable stable privacy mode if stable_secret set
Bjørn reported that while we switch all interfaces to privacy stable mode
when setting the secret, we don't set this mode for new interfaces. This
does not make sense, so change this behaviour.
Fixes:
622c81d57b392cc ("ipv6: generation of stable privacy addresses for link-local and autoconf")
Reported-by: Bjørn Mork <bjorn@mork.no>
Cc: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 15 Dec 2015 23:33:39 +0000 (15:33 -0800)]
sctp: use GFP_KERNEL in sctp_init()
modules init functions being called from process context, we better
use GFP_KERNEL allocations to increase our chances to get these
high-order pages we want for SCTP hash tables.
This mostly matters if SCTP module is loaded once memory got fragmented.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 Dec 2015 04:26:52 +0000 (23:26 -0500)]
Merge branch 'sock-diag-destroy'
Lorenzo Colitti says:
====================
Support administratively closing application sockets
This patchset adds the ability to administratively close a socket
without any action from the process owning the socket or the
socket protocol.
It implements this by adding a new diag_destroy function pointer
to struct proto. In-kernel callers can access this functionality
directly by calling sk->sk_prot->diag_destroy(sk, err).
It also exposes this functionality to userspace via a new
SOCK_DESTROY operation in the NETLINK_SOCK_DIAG sockets. This
allows a privileged userspace process, such as a connection
manager or system administration tool, to close sockets belonging
to other apps when the network they were established on has
disconnected. It is needed on laptops and mobile hosts to ensure
that network switches / disconnects do not result in applications
being blocked for long periods of time (minutes) in read or
connect calls on TCP sockets that will never succeed because the
IP address they are bound to is no longer on the system. Closing
the sockets causes these calls to fail fast and allows the apps
to reconnect on another network.
Userspace intervention is necessary because in many cases the
kernel does not have enough information to know that a connection
is now inoperable. The kernel can know if a packet can't be
routed, but in general it won't know if a TCP connection is stuck
because it is now routed to a network where its source address is
no longer valid [5][6].
Many other operating systems offer similar functionality:
- FreeBSD has had this since 5.4 in 2005 [2]. It is available
to privileged userspace and there is a tool to use it [3].
- The FreeBSD commit description states that the idea came
from OpenBSD.
- iOS has been administratively closing app sockets since
iOS 4 - see [4], which states that a socket "might get
reclaimed by the kernel" and after that will return EBADF].
For many years Android kernels have supported this via an
out-of-tree SIOCKILLADDR ioctl that is called on every
RTM_DELADDR event, but this solution is cleaner, more robust
and more flexible: the connection manager can iterate over all
connections on the deleted IP address and close all of them.
It can also be used to close all sockets opened by a given app
process, for example if the user has restricted that app from
using the network, if a secure network such as a VPN has
connected and security policy requires all of an application's
connections to be routed via the VPN, etc.
- For many years Android kernels have supported an out-of-tree
SIOCKILLADDR ioctl that is called when a network disconnects
or an RTM_DELADDR event is received. This solution is cleaner,
more robust and more flexible. The connection manager can
implement SIOCKILLADDR by iterating over all connections on
the deleted IP address and close all of them, but it can also
close all sockets opened by a given app process (for example
if the user has restricted that app from), close all of a
user's TCP connections if a user has connected a secure
network such as a VPN and expects all of an application's
connections to be routed via the VPN, etc.
Alternative schemes such as TCP keepalives in combination with
"iptables -j REJECT --reject-with tcp-reset", could be used to
achieve similar results, but on mobile devices TCP keepalives are
very expensive, and in such a scheme detecting stuck connections
has to wait for a keepalive to be sent or the application to
perform a write. An explicit notification from userspace is
cheaper and faster in the common case where an application is
blocked on read.
SOCK_DESTROY is placed behind an INET_DIAG_DESTROY configuration
option, which is currently off by default.
The TCP implementation of diag_destroy causes a TCP ABORT as
specified by RFC 793 [1]: immediately send a RST and clear local
connection state. This is what happens today if an application
enables SO_LINGER with a timeout of 0 and then calls close.
The first versions of the patchset did not send a RST, but that
is not graceful/correct TCP behaviour. tcp_abort now does a
proper RFC 793 ABORT and sends a RST to the peer. This is
consistent with BSD's tcpdrop, and is more correct in general,
even though in many use cases tcp_abort will only be called when
sending a RST is no longer possible (e.g., the network has
disconnected).
The original patchset also behaved like SIOCKILADDR and closed
TCP sockets with ETIMEDOUT. Tom Herbert pointed out that it would
be better if applications could distinguish between a timeout and
an administrative close. ECONNABORTED was chosen because it is
consistent with BSD.
[1] http://tools.ietf.org/html/rfc793#page-50
[2] http://svnweb.freebsd.org/base?view=revision&revision=141381
[3] https://www.freebsd.org/cgi/man.cgi?query=tcpdrop&sektion=8&manpath=FreeBSD+5.4-RELEASE
[4] https://developer.apple.com/library/ios/technotes/tn2277/_index.html#//apple_ref/doc/uid/DTS40010841-CH1-SUBSECTION3
[5] http://www.spinics.net/lists/netdev/msg352775.html
[6] http://www.spinics.net/lists/netdev/msg352952.html
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Colitti [Wed, 16 Dec 2015 03:30:05 +0000 (12:30 +0900)]
net: diag: Support destroying TCP sockets.
This implements SOCK_DESTROY for TCP sockets. It causes all
blocking calls on the socket to fail fast with ECONNABORTED and
causes a protocol close of the socket. It informs the other end
of the connection by sending a RST, i.e., initiating a TCP ABORT
as per RFC 793. ECONNABORTED was chosen for consistency with
FreeBSD.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Colitti [Wed, 16 Dec 2015 03:30:04 +0000 (12:30 +0900)]
net: diag: Support SOCK_DESTROY for inet sockets.
This passes the SOCK_DESTROY operation to the underlying protocol
diag handler, or returns -EOPNOTSUPP if that handler does not
define a destroy operation.
Most of this patch is just renaming functions. This is not
strictly necessary, but it would be fairly counterintuitive to
have the code to destroy inet sockets be in a function whose name
starts with inet_diag_get.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Colitti [Wed, 16 Dec 2015 03:30:03 +0000 (12:30 +0900)]
net: diag: Add the ability to destroy a socket.
This patch adds a SOCK_DESTROY operation, a destroy function
pointer to sock_diag_handler, and a diag_destroy function
pointer. It does not include any implementation code.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Colitti [Wed, 16 Dec 2015 03:30:02 +0000 (12:30 +0900)]
net: diag: split inet_diag_dump_one_icsk into two
Currently, inet_diag_dump_one_icsk finds a socket and then dumps
its information to userspace. Split it into a part that finds the
socket and a part that dumps the information.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 Dec 2015 04:25:20 +0000 (23:25 -0500)]
Merge branch 'ila-early-demux'
Tom Herbert says:
====================
ila: Optimization to preserve value of early demux
In the current implementation of ILA, LWT is used to perform
translation on both the input and output paths. This is functional,
however there is a big performance hit in the receive path. Early
demux occurs before the routing lookup (a hit actually obviates the
route lookup). Therefore the stack currently performs early
demux before translation so that a local connection with ILA
addresses is never matched. Note that this issue is not just
with ILA, but pretty much any translated or encapsulated packet
handled by LWT would miss the opportunity for early demux. Solving
the general problem seems non trivial since we would need to move
the route lookup before early demx thereby mitigating the value.
This patch set addresses the issue for ILA by adding a fast locator
lookup that occurs before early demux. This done by hooking in to
NF_INET_PRE_ROUTING
For the backend we implement an rhashtable that contains identifier
to locator to mappings. The table also allows more specific matches
that include original locator and interface.
This patch set:
- Add an rhashtable function to atomically replace and element.
This is useful to implement sub-trees from a table entry
without needing to use a special anchor structure as the
table entry.
- Add a start callback for starting a netlink dump.
- Creates an ila directory under net/ipv6 and moves ila.c to it.
ila.c is split into ila_common.c and ila_lwt.c.
- Implement a table to do identifier->locator mapping. This is
an rhashtable (in ila_xlat.c).
- Configuration for the table with netlink.
- Add a hook into NF_INET_PRE_ROUTING to perform ILA translation
before early demux.
Changes in v2:
- Use iptables targets instead of a new xfrm function
Changes in v3:
- Add __rcu to next pointer in struct ila_map
Changes in v4:
- Use hook for NF_INET_PRE_ROUTING
Changed in v5:
- Register hooks per namespace using nf_register_net_hooks
- Only register hooks when first mapping is actually added
Changed in v6:
- Remove gfp argument in alloc_ila_locks, it is unnecessary
- Set registered_hooks properly when hooks are registered
Testing:
Running 200 netperf TCP_RR streams
No ILA, baseline
79.26% CPU utilization
1678282 tps
104/189/390 50/90/99% latencies
ILA before fix (LWT on both input and output)
81.91% CPU utilization
1464723 tps (-14.5% from baseline)
121/215/411 50/90/99% latencies
ILA after fix
80.62% CPU utilization
1622985 (-3.4% from baseline)
110/191/347 50/90/99% latencies
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 15 Dec 2015 23:41:38 +0000 (15:41 -0800)]
ila: Add generic ILA translation facility
This patch implements an ILA tanslation table. This table can be
configured with identifier to locator mappings, and can be be queried
to resolve a mapping. Queries can be parameterized based on interface,
direction (incoming or outoing), and matching locator. The table is
implemented using rhashtable and is configured via netlink (through
"ip ila .." in iproute).
The table may be used as alternative means to do do ILA tanslations
other than the lw tunnels
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 15 Dec 2015 23:41:37 +0000 (15:41 -0800)]
netlink: add a start callback for starting a netlink dump
The start callback allows the caller to set up a context for the
dump callbacks. Presumably, the context can then be destroyed in
the done callback.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 15 Dec 2015 23:41:36 +0000 (15:41 -0800)]
rhashtable: add function to replace an element
Add the rhashtable_replace_fast function. This replaces one object in
the table with another atomically. The hashes of the new and old objects
must be equal.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 15 Dec 2015 23:41:35 +0000 (15:41 -0800)]
ila: Create net/ipv6/ila directory
Create ila directory in preparation for supporting other hooks in the
kernel than LWT for doing ILA. This includes:
- Moving ila.c to ila/ila_lwt.c
- Splitting out some common functions into ila_common.c
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 Dec 2015 04:21:05 +0000 (23:21 -0500)]
Merge branch 'stmmac-mdio-compat'
Merge branch 'stmmac-mdio-compat'
Phil Reid says:
====================
stmmac: create of compatible mdio bus for stmacc driver
Provide ability to specify a fixed phy in the device tree and
retain the mdio bus if no phy is found. This is needed where
a dsa is connected via a fixed phy and uses the mdio bus for config.
Fixed ptp ref clock calculatins for the stmmac when ptp ref clock
is running at <= 50Mhz. Also add device tree setting to config
ptp clk source on socfpga platforms.
Changes from V5:
- Restore behaviour of unregister mdio bus when no phys found
if there is no device tree node create the bus.
- Modify condition to allocate mdio_base_data conditional
on fixed phy presece as well. Maintains existing behaviour
in conditions where a fixed phy is not present.
Changes from V4:
- Restore #ifdef CONFIG_OF around setting of reset_gpio.
Member doesn't exist when this isn't defined.
Changes from V3:
- Use if (IS_ENABLED(CONFIG_OF)) instead of #if.
Reorder some code to reduce if statements.
- of_mdiobus_register already falls back to mdiobus_register
- Tested on system with CONFIG_OF
Changes from V2:
- Formatting, spaces & lines > 80 chars. Using checkpatch
- Drop PTP register debugfs patch.
Changes from V1:
- Fixed mismatch doc / code for ptp_ref_clk dt node.
- Remove unit address from doc example.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Phil Reid [Mon, 14 Dec 2015 03:32:02 +0000 (11:32 +0800)]
stmmac: socfpga: Provide dt node to config ptp clk source.
Provides an options to use the ptp clock routed from the Altera FPGA
fabric. Instead of the defalt eosc1 clock connected to the ARM HPS core.
This setting affects all emacs in the core as the ptp clock is common.
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Phil Reid <preid@electromag.com.au>
Acked-by: Dinh Nguyen <dinguyen@opensource.altera.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Phil Reid [Mon, 14 Dec 2015 03:32:01 +0000 (11:32 +0800)]
stmmac: Fix calculations for ptp counters when clock input = 50Mhz.
stmmac_config_sub_second_increment set the sub second increment to 20ns.
Driver is configured to use the fine adjustment method where the sub second
register is incremented when the acculumator incremented by the addend
register wraps overflows. This accumulator is update on every ptp clk
cycle. If a ptp clk with a period of greater than 20ns was used the
sub second register would not get updated correctly.
Instead set the sub sec increment to twice the period of the ptp clk.
This result in the addend register being set mid range and overflow
the accumlator every 2 clock cycles.
Signed-off-by: Phil Reid <preid@electromag.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Phil Reid [Mon, 14 Dec 2015 03:32:00 +0000 (11:32 +0800)]
stmmac: Correct documentation on stmmac clocks.
devm_get_clk looks in clock-name property for matching clock.
the ptp_ref_clk property is ignored.
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Phil Reid <preid@electromag.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Phil Reid [Mon, 14 Dec 2015 03:31:59 +0000 (11:31 +0800)]
stmmac: create of compatible mdio bus for stmmac driver
The DSA driver needs to be passed a reference to an mdio bus. Typically
the mac is configured to use a fixed link but the mdio bus still needs
to be registered so that it con configure the switch.
This patch follows the same process as the altera tse ethernet driver for
creation of the mdio bus.
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Phil Reid <preid@electromag.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 15 Dec 2015 21:50:28 +0000 (16:50 -0500)]
Merge branch 'end-of-ip-csum'
Tom Herbert says:
====================
net: The beginning of the end for NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM
Background:
This patch set starts to address one front in the battle against
protocol ossification. Protocol ossification describes the state
that we have arrived at in the evolution of the Internet where we are
materially limited to only using a very narrow range of protocols
and protocol features. For instance, only TCP and UDP is sufficiently
supported on the Internet so that deploying alternative protocols,
such as SCTP and DCCP, are non-starters. Similarly, IP options and IPv6
extension headers are typically not considered feasible for wide
deployment, so we have loss the extensibility of IP protocols.
Protocol ossification is not only a problem on the Internet, but in
the data center as well. A root cause of this seems to be narrow,
protocol specific optimizations implemented in switches (for doing
EMCP) and in NICs (NIC offloads). These tend to be performance
optimization around TCP and UDP packets, and these have become
requirements to implement performant network solutions at scale.
Attempts to deal with protocol ossification in data center have yielded
ad hoc, sub-optimal solutions. A main driver of foo-over-UDP (e.g.
GRE/UDP, MPLS/UDP) is to leverage the existing EMCP and RSS support for
UDP by setting the source port as an entropy value. This has seen some
success, but the cost of additional overhead and layering limits its
usefulness. An even more extreme solution is STT where non-TCP packets
are spoofed as TCP to leverage NIC offloads.
This patch set endeavours to address protocol ossification caused by
techniques used in transmit checksum offload for NICs. Future work
will address protocol ossification in the other primary NIC offloads--
namely receive checksum offload, LSO, LRO, and RSS.
NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM:
NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM exemplify the problem of protocol
ossification. These features are relics from a simpler time in the
Internet, before encapsulation, before GRE and IPIP. Many hardware
vendors only saw the need to provide checksum offload for simple UDP and
TCP packets over IPv4 (IPv6 support is an afterthought also). In today's
Internet and data centers, checksum offload is well established as a
valuable feature, but we can no longer afford to be contsrained to
use a handful of protocols and features that are supported at the
discretion of NIC vendors. Generic and protocol agnostic methods are
needed.
The actual interface that the stack uses with drivers for checksum
offload is CHECKSUM_PARTIAL. This is a generic and protocol agnostic
interface. A driver for a device that supports this generic
interface advertises NETIF_F_HW_CSUM.
Goals of this patch set:
We propose that drivers advertise NETIF_F_HW_CSUM instead of protocol
specific values of NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM. If the
driver's device is constrained (for instance it can only offlaod simple
IPv4 and IPv6 packets) then these constraints can be checked in the
transmit path and skb_checksum_help would be called for packets that the
driver is unable to offload. In order to facilitate this, we add some
helper functions that takes a specification argument indicating the
type of packets a device is able to offload. If a packet does not match
the specification, the helper function calls skb_checksum_help.
Benefits of this approach are:
- Simplify the stack and clarify the interface for checksum offload
- Encourage NIC vendors to implement the generic. protocol agnostic
checksum offload methods in hardware
- Encourage feature parity in NIC offloads for IPv4 and IPv6
Many drivers advertise NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM and it
probably isn't feasible to convert them all in a given time frame
(although if we could this would be a great simplification to the
stack). A reasonable direction may be to declare that new drivers must
use NETIF_F_HW_CSUM as NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM are
considered deprecated.
There is a class of drivers that should now be converted to advertise
NETIF_F_HW_CSUM, namely those that support offload of ecapsulated
checksums. These drivers have to date been using skb->encapsulation
to infer that checksum offload is being performed for an encapsulated
checksum. This is strictly not correct. skb->encapsulation
indicates that the inner headers are valid in the skbuff, whereas
the stack indicates checksum offload arguments exclusively in csum_start
and csum_offset. At some point we may want to set the inner headers for
an skbuff but offload the outer transport checksum, so this needs to be
fixed.
In this patch set:
- Rename some of constants involved in checksum offload to be more
reflective of their function
- Eliminate NETIF_F_GEN_CSUM and NETIF_F_V[46]_CSUM entirely as
unnecessary convolutions
- Fix conditions in tcp_sendpage and tcp_sendmsg to take IP protocol
into account when determining if checksum offload can be done
- Add driver helper functions for determining if a checksum can
be offloaded to a device. If not, the helper function can call
skb_checksum_help
- Document the checksum offload interface between the stack and
drivers with detail and specifics
Testing:
Have been testing ixgbe and mlx4. No noticeable regressions seen yet.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 14 Dec 2015 19:19:47 +0000 (11:19 -0800)]
net: Elaborate on checksum offload interface description
Add specifics and details the description of the interface between
the stack and drivers for doing checksum offload. This description
is meant to be as specific and complete as possible.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 14 Dec 2015 19:19:46 +0000 (11:19 -0800)]
net: Add driver helper functions to determine checksum offloadability
Add skb_csum_offload_chk driver helper function to determine if a
device with limited checksum offload capabilities is able to offload the
checksum for a given packet.
This patch includes:
- The skb_csum_offload_chk function. Returns true if checksum is
offloadable, else false. Optionally, in the case that the checksum
is not offloable, the function can call skb_checksum_help to resolve
the checksum. skb_csum_offload_chk also returns whether the checksum
refers to an encapsulated checksum.
- Definition of skb_csum_offl_spec structure that caller uses to
indicate rules about what it can offload (e.g. IPv4/v6, TCP/UDP only,
whether encapsulated checksums can be offloaded, whether checksum with
IPv6 extension headers can be offloaded).
- Ancilary functions called skb_csum_offload_chk_help,
skb_csum_off_chk_help_cmn, skb_csum_off_chk_help_cmn_v4_only.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 14 Dec 2015 19:19:45 +0000 (11:19 -0800)]
tcp: Fix conditions to determine checksum offload
In tcp_send_sendpage and tcp_sendmsg we check the route capabilities to
determine if checksum offload can be performed. This check currently
does not take the IP protocol into account for devices that advertise
only one of NETIF_F_IPV6_CSUM or NETIF_F_IP_CSUM. This patch adds a
function to check capabilities for checksum offload with a socket
called sk_check_csum_caps. This function checks for specific IPv4 or
IPv6 offload support based on the family of the socket.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 14 Dec 2015 19:19:44 +0000 (11:19 -0800)]
net: Eliminate NETIF_F_GEN_CSUM and NETIF_F_V[46]_CSUM
These netif flags are unnecessary convolutions. It is more
straightforward to just use NETIF_F_HW_CSUM, NETIF_F_IP_CSUM,
and NETIF_F_IPV6_CSUM directly.
This patch also:
- Cleans up can_checksum_protocol
- Simplifies netdev_intersect_features
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 14 Dec 2015 19:19:43 +0000 (11:19 -0800)]
net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK
The name NETIF_F_ALL_CSUM is a misnomer. This does not correspond to the
set of features for offloading all checksums. This is a mask of the
checksum offload related features bits. It is incorrect to set both
NETIF_F_HW_CSUM and NETIF_F_IP_CSUM or NETIF_F_IPV6 at the same time for
features of a device.
This patch:
- Changes instances of NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK (where
NETIF_F_ALL_CSUM is being used as a mask).
- Changes bonding, sfc/efx, ipvlan, macvlan, vlan, and team drivers to
use NEITF_F_HW_CSUM in features list instead of NETIF_F_ALL_CSUM.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 14 Dec 2015 19:19:42 +0000 (11:19 -0800)]
fcoe: Use CHECKSUM_PARTIAL to indicate CRC offload
When setting up CRC offload set ip_summed to CHECKSUM_PARTIAL
instead of CHECKSUM_UNNECESSARY. This is consistent with the
definition of CHECKSUM_PARTIAL.
The only driver that seems to be advertising NETIF_F_FCOE_CRC is
ixgbe. AFICT the driver does not look at ip_summed for FCOE and
just assumes that CRC is being offloaded.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 14 Dec 2015 19:19:41 +0000 (11:19 -0800)]
sctp: Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC
The SCTP checksum is really a CRC and is very different from the
standards 1's complement checksum that serves as the checksum
for IP protocols. This offload interface is also very different.
Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC to highlight these
differences. The term CSUM should be reserved in the stack to refer
to the standard 1's complement IP checksum.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 14 Dec 2015 19:19:40 +0000 (11:19 -0800)]
net: Add skb_inner_transport_offset function
Same thing as skb_transport_offset but returns the offset of the inner
transport header (when skb->encpasulation is set).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Williams [Tue, 15 Dec 2015 20:54:06 +0000 (12:54 -0800)]
Revert "scatterlist: use sg_phys()"
commit
db0fa0cb0157 "scatterlist: use sg_phys()" did replacements of
the form:
phys_addr_t phys = page_to_phys(sg_page(s));
phys_addr_t phys = sg_phys(s) & PAGE_MASK;
However, this breaks platforms where sizeof(phys_addr_t) >
sizeof(unsigned long). Revert for 4.3 and 4.4 to make room for a
combined helper in 4.5.
Cc: <stable@vger.kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes:
db0fa0cb0157 ("scatterlist: use sg_phys()")
Suggested-by: Joerg Roedel <joro@8bytes.org>
Reported-by: Vitaly Lavrov <vel21ripn@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
tadeusz.struk@intel.com [Tue, 15 Dec 2015 18:46:17 +0000 (10:46 -0800)]
net: fix uninitialized variable issue
msg_iocb needs to be initialized on the recv/recvfrom path.
Otherwise afalg will wrongly interpret it as an async call.
Cc: stable@vger.kernel.org
Reported-by: Harald Freudenberger <freude@linux.vnet.ibm.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 15 Dec 2015 20:39:08 +0000 (15:39 -0500)]
bluetooth: Validate socket address length in sco_sock_bind().
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Tue, 15 Dec 2015 18:56:39 +0000 (10:56 -0800)]
Merge tag 'dmaengine-fix-4.4-rc6' of git://git.infradead.org/users/vkoul/slave-dma
Pull dmaengine fixes from Vinod Koul:
"This has fixes spread thru driver, notably among them:
- edma fixes for recent edma DT changes which went into 4.4
- odd fixes for at_hdmac
- minor fixes on bc dma and mic dma"
* tag 'dmaengine-fix-4.4-rc6' of git://git.infradead.org/users/vkoul/slave-dma:
dmaengine: at_xdmac: fix at_xdmac_prep_dma_memcpy()
dmaengine: edma: DT: Change reserved slot array from 16bit to 32bit type
dmaengine: edma: DT: Change memcpy channel array from 16bit to 32bit type
dmaengine: mic_x100: add missing spin_unlock
dmaengine: bcm2835-dma: Convert to use DMA pool
dmaengine: at_xdmac: fix bad behavior in interleaved mode
dmaengine: at_xdmac: fix false condition for memset_sg transfers
dmaengine: at_xdmac: fix macro typo