Baruch Siach [Thu, 10 Nov 2016 11:21:42 +0000 (13:21 +0200)]
net: bpqether.h: remove if_ether.h guard
__LINUX_IF_ETHER_H is not defined anywhere, and if_ether.h can keep itself from
double inclusion, though it uses a single underscore prefix.
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 10 Nov 2016 00:04:46 +0000 (16:04 -0800)]
net: __skb_flow_dissect() must cap its return value
After Tom patch, thoff field could point past the end of the buffer,
this could fool some callers.
If an skb was provided, skb->len should be the upper limit.
If not, hlen is supposed to be the upper limit.
Fixes:
a6e544b0a88b ("flow_dissector: Jump to exit code in __skb_flow_dissect")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yibin Yang <yibyang@cisco.com
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 13 Nov 2016 04:38:08 +0000 (23:38 -0500)]
Merge branch 'fix-bpf_redirect'
Martin KaFai Lau says:
====================
bpf: Fix bpf_redirect to an ipip/ip6tnl dev
This patch set fixes a bug in bpf_redirect(dev, flags) when dev is an
ipip/ip6tnl. The current problem is IP-EthHdr-IP is sent out instead of
IP-IP.
Patch 1 adds a dev->type test similar to dev_is_mac_header_xmit()
in act_mirred.c which is only available in net-next. We can consider to
refactor it once this patch is pulled into net-next from net.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Wed, 9 Nov 2016 23:36:34 +0000 (15:36 -0800)]
bpf: Add test for bpf_redirect to ipip/ip6tnl
The test creates two netns, ns1 and ns2. The host (the default netns)
has an ipip or ip6tnl dev configured for tunneling traffic to the ns2.
ping VIPS from ns1 <----> host <--tunnel--> ns2 (VIPs at loopback)
The test is to have ns1 pinging VIPs configured at the loopback
interface in ns2.
The VIPs are 10.10.1.102 and 2401:face::66 (which are configured
at lo@ns2). [Note: 0x66 => 102].
At ns1, the VIPs are routed _via_ the host.
At the host, bpf programs are installed at the veth to redirect packets
from a veth to the ipip/ip6tnl. The test is configured in a way so
that both ingress and egress can be tested.
At ns2, the ipip/ip6tnl dev is configured with the local and remote address
specified. The return path is routed to the dev ipip/ip6tnl.
During egress test, the host also locally tests pinging the VIPs to ensure
that bpf_redirect at egress also works for the direct egress (i.e. not
forwarding from dev ve1 to ve2).
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Wed, 9 Nov 2016 23:36:33 +0000 (15:36 -0800)]
bpf: Fix bpf_redirect to an ipip/ip6tnl dev
If the bpf program calls bpf_redirect(dev, 0) and dev is
an ipip/ip6tnl, it currently includes the mac header.
e.g. If dev is ipip, the end result is IP-EthHdr-IP instead
of IP-IP.
The fix is to pull the mac header. At ingress, skb_postpull_rcsum()
is not needed because the ethhdr should have been pulled once already
and then got pushed back just before calling the bpf_prog.
At egress, this patch calls skb_postpull_rcsum().
If bpf_redirect(dev, BPF_F_INGRESS) is called,
it also fails now because it calls dev_forward_skb() which
eventually calls eth_type_trans(skb, dev). The eth_type_trans()
will set skb->type = PACKET_OTHERHOST because the mac address
does not match the redirecting dev->dev_addr. The PACKET_OTHERHOST
will eventually cause the ip_rcv() errors out. To fix this,
____dev_forward_skb() is added.
Joint work with Daniel Borkmann.
Fixes:
cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel")
Fixes:
8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 10 Nov 2016 18:02:25 +0000 (13:02 -0500)]
Merge branch 'mlxsw-fixes'
Jiri Pirko says:
====================
mlxsw: Couple of router fixes
v1->v2:
- patch2:
- use net_eq
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 10 Nov 2016 11:31:05 +0000 (12:31 +0100)]
mlxsw: spectrum_router: Ignore FIB notification events for non-init namespaces
Since now, the table with same id in multiple netnamespaces were squashed
to a single virtual router. That is not only incorrect, it also causes
error messages when trying to use RALUE register to do double remove
of FIB entries, like this one:
mlxsw_spectrum 0000:03:00.0: EMAD reg access failed (tid=
facb831c00007b20,reg_id=8013(ralue),type=write,status=7(bad parameter))
Since we don't allow ports to change namespaces (NETIF_F_NETNS_LOCAL),
and the infrastructure is not yet prepared to handle netnamespaces, just
ignore FIB notification events for non-init namespaces. That is clear to
do since we don't need to offload them.
Fixes:
b45f64d16d45 ("mlxsw: spectrum_router: Use FIB notifications instead of switchdev calls")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 10 Nov 2016 11:31:04 +0000 (12:31 +0100)]
mlxsw: spectrum_router: Fix handling of neighbour structure
__neigh_create function works in a different way than assumed.
It passes "n" as a parameter to ndo_neigh_construct. But this "n" might
be destroyed right away before __neigh_create() returns in case there is
already another neighbour struct in the hashtable with the same dev and
primary key. That is not expected by mlxsw_sp_router_neigh_construct()
and the stored "n" points to freed memory, eventually leading to crash.
Fix this by doing tight 1:1 coupling between neighbour struct and
internal driver neigh_entry. That allows to narrow down the key in
internal driver hashtable to do lookups by "n" only.
Fixes:
6cf3c971dc84 ("mlxsw: spectrum_router: Add private neigh table")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 10 Nov 2016 17:55:26 +0000 (12:55 -0500)]
Merge branch 'qed-fixes'
Yuval Mintz says:
====================
qed: Fix RoCE infrastructure
This series fixes 2 basic issues with RoCE support,
one handles a missing configuration in the initial infrastructure
support while the other is a regression introduced by one of the
initial fix submissions.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ram Amrani [Wed, 9 Nov 2016 20:48:44 +0000 (22:48 +0200)]
qed: Correct rdma params configuration
Previous fix has broken RoCE support as the rdma_pf_params are now
being set into the parameters only after the params are alrady assigned
into the hw-function.
Fixes:
0189efb8f4f8 ("qed*: Fix Kconfig dependencies with INFINIBAND_QEDR")
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ram Amrani [Wed, 9 Nov 2016 20:48:43 +0000 (22:48 +0200)]
qed: configure ll2 RoCE v1/v2 flavor correctly
Currently RoCE v2 won't operate with RDMA CM due to missing setting of
the roce-flavour in the ll2 configuration.
This patch properly sets the flavour, and deletes incorrect HSI
that doesn't [yet] exist.
Fixes:
abd49676c707 ("qed: Add RoCE ll2 & GSI support")
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lance Richardson [Wed, 9 Nov 2016 20:04:39 +0000 (15:04 -0500)]
ipv4: update comment to document GSO fragmentation cases.
This is a follow-up to commit
9ee6c5dc816a ("ipv4: allow local
fragmentation in ip_finish_output_gso()"), updating the comment
documenting cases in which fragmentation is needed for egress
GSO packets.
Suggested-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Wed, 9 Nov 2016 17:07:26 +0000 (09:07 -0800)]
net: tcp response should set oif only if it is L3 master
Lorenzo noted an Android unit test failed due to
e0d56fdd7342:
"The expectation in the test was that the RST replying to a SYN sent to a
closed port should be generated with oif=0. In other words it should not
prefer the interface where the SYN came in on, but instead should follow
whatever the routing table says it should do."
Revert the change to ip_send_unicast_reply and tcp_v6_send_response such
that the oif in the flow is set to the skb_iif only if skb_iif is an L3
master.
Fixes:
e0d56fdd7342 ("net: l3mdev: remove redundant calls")
Reported-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allan Chou [Tue, 8 Nov 2016 22:08:01 +0000 (16:08 -0600)]
Net Driver: Add Cypress GX3 VID=04b4 PID=3610.
Add support for Cypress GX3 SuperSpeed to Gigabit Ethernet
Bridge Controller (Vendor=04b4 ProdID=3610).
Patch verified on x64 linux kernel 4.7.4, 4.8.6, 4.9-rc4 systems
with the Kensington SD4600P USB-C Universal Dock with Power,
which uses the Cypress GX3 SuperSpeed to Gigabit Ethernet Bridge
Controller.
A similar patch was signed-off and tested-by Allan Chou
<allan@asix.com.tw> on 2015-12-01.
Allan verified his similar patch on x86 Linux kernel 4.1.6 system
with Cypress GX3 SuperSpeed to Gigabit Ethernet Bridge Controller.
Tested-by: Allan Chou <allan@asix.com.tw>
Tested-by: Chris Roth <chris.roth@usask.ca>
Tested-by: Artjom Simon <artjom.simon@gmail.com>
Signed-off-by: Allan Chou <allan@asix.com.tw>
Signed-off-by: Chris Roth <chris.roth@usask.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 10 Nov 2016 01:38:18 +0000 (20:38 -0500)]
Merge git://git./pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains a larger than usual batch of Netfilter
fixes for your net tree. This series contains a mixture of old bugs and
recently introduced bugs, they are:
1) Fix a crash when using nft_dynset with nft_set_rbtree, which doesn't
support the set element updates from the packet path. From Liping
Zhang.
2) Fix leak when nft_expr_clone() fails, from Liping Zhang.
3) Fix a race when inserting new elements to the set hash from the
packet path, also from Liping.
4) Handle segmented TCP SIP packets properly, basically avoid that the
INVITE in the allow header create bogus expectations by performing
stricter SIP message parsing, from Ulrich Weber.
5) nft_parse_u32_check() should return signed integer for errors, from
John Linville.
6) Fix wrong allocation instead of connlabels, allocate 16 instead of
32 bytes, from Florian Westphal.
7) Fix compilation breakage when building the ip_vs_sync code with
CONFIG_OPTIMIZE_INLINING on x86, from Arnd Bergmann.
8) Destroy the new set if the transaction object cannot be allocated,
also from Liping Zhang.
9) Use device to route duplicated packets via nft_dup only when set by
the user, otherwise packets may not follow the right route, again
from Liping.
10) Fix wrong maximum genetlink attribute definition in IPVS, from
WANG Cong.
11) Ignore untracked conntrack objects from xt_connmark, from Florian
Westphal.
12) Allow to use conntrack helpers that are registered NFPROTO_UNSPEC
via CT target, otherwise we cannot use the h.245 helper, from
Florian.
13) Revisit garbage collection heuristic in the new workqueue-based
timer approach for conntrack to evict objects earlier, again from
Florian.
14) Fix crash in nf_tables when inserting an element into a verdict map,
from Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Mathias Krause [Mon, 7 Nov 2016 22:22:19 +0000 (23:22 +0100)]
rtnl: reset calcit fptr in rtnl_unregister()
To avoid having dangling function pointers left behind, reset calcit in
rtnl_unregister(), too.
This is no issue so far, as only the rtnl core registers a netlink
handler with a calcit hook which won't be unregistered, but may become
one if new code makes use of the calcit hook.
Fixes:
c7ac8679bec9 ("rtnetlink: Compute and store minimum ifinfo...")
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Mon, 7 Nov 2016 21:09:07 +0000 (22:09 +0100)]
vxlan: hide unused local variable
A bugfix introduced a harmless warning in v4.9-rc4:
drivers/net/vxlan.c: In function 'vxlan_group_used':
drivers/net/vxlan.c:947:21: error: unused variable 'sock6' [-Werror=unused-variable]
This hides the variable inside of the same #ifdef that is
around its user. The extraneous initialization is removed
at the same time, it was accidentally introduced in the
same commit.
Fixes:
c6fcc4fc5f8b ("vxlan: avoid using stale vxlan socket.")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Allen [Mon, 7 Nov 2016 20:27:28 +0000 (14:27 -0600)]
ibmvnic: Start completion queue negotiation at server-provided optimum values
Use the opt_* fields to determine the starting point for negotiating the
number of tx/rx completion queues with the vnic server. These contain the
number of queues that the vnic server estimates that it will be able to
allocate. While renegotiation may still occur, using the opt_* fields will
reduce the number of times this needs to happen and will prevent driver
probe timeout on systems using large numbers of ibmvnic client devices per
vnic port.
Signed-off-by: John Allen <jallen@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Mon, 7 Nov 2016 20:03:09 +0000 (12:03 -0800)]
net: icmp_route_lookup should use rt dev to determine L3 domain
icmp_send is called in response to some event. The skb may not have
the device set (skb->dev is NULL), but it is expected to have an rt.
Update icmp_route_lookup to use the rt on the skb to determine L3
domain.
Fixes:
613d09b30f8b ("net: Use VRF device index for lookups on TX")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 9 Nov 2016 23:45:36 +0000 (18:45 -0500)]
Merge branch 'qcom-emac-pause'
Timur Tabi says:
====================
net: qcom/emac: ensure that pause frames are enabled
The qcom emac driver experiences significant packet loss (through frame
check sequence errors) if flow control is not enabled and the phy is
not configured to allow pause frames to pass through it. Therefore, we
need to enable flow control and force the phy to pass pause frames.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Timur Tabi [Mon, 7 Nov 2016 16:51:41 +0000 (10:51 -0600)]
net: qcom/emac: enable flow control if requested
If the PHY has been configured to allow pause frames, then the MAC
should be configured to generate and/or accept those frames.
Signed-off-by: Timur Tabi <timur@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Timur Tabi [Mon, 7 Nov 2016 16:51:40 +0000 (10:51 -0600)]
net: qcom/emac: configure the external phy to allow pause frames
Pause frames are used to enable flow control. A MAC can send and
receive pause frames in order to throttle traffic. However, the PHY
must be configured to allow those frames to pass through.
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Timur Tabi <timur@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rafał Miłecki [Mon, 7 Nov 2016 12:53:27 +0000 (13:53 +0100)]
net: bgmac: fix reversed checks for clock control flag
This fixes regression introduced by patch adding feature flags. It was
already reported and patch followed (it got accepted) but it appears it
was incorrect. Instead of fixing reversed condition it broke a good one.
This patch was verified to actually fix SoC hanges caused by bgmac on
BCM47186B0.
Fixes:
db791eb2970b ("net: ethernet: bgmac: convert to feature flags")
Fixes:
4af1474e6198 ("net: bgmac: Fix errant feature flag check")
Cc: Jon Mason <jon.mason@broadcom.com>
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Benjamin Poirier [Mon, 7 Nov 2016 09:57:56 +0000 (17:57 +0800)]
bna: Add synchronization for tx ring.
We received two reports of BUG_ON in bnad_txcmpl_process() where
hw_consumer_index appeared to be ahead of producer_index. Out of order
write/read of these variables could explain these reports.
bnad_start_xmit(), as a producer of tx descriptors, has a few memory
barriers sprinkled around writes to producer_index and the device's
doorbell but they're not paired with anything in bnad_txcmpl_process(), a
consumer.
Since we are synchronizing with a device, we must use mandatory barriers,
not smp_*. Also, I didn't see the purpose of the last smp_mb() in
bnad_start_xmit().
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tariq Toukan [Sun, 6 Nov 2016 16:05:06 +0000 (18:05 +0200)]
Revert "net/mlx4_en: Fix panic during reboot"
This reverts commit
9d2afba058722d40cc02f430229c91611c0e8d16.
The original issue would possibly exist if an external module
tried calling our "ethtool_ops" without checking if it still
exists.
The right way of solving it is by simply doing the check in
the caller side.
Currently, no action is required as there's no such use case.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maciej Żenczykowski [Fri, 4 Nov 2016 21:51:54 +0000 (14:51 -0700)]
net-ipv6: on device mtu change do not add mtu to mtu-less routes
Routes can specify an mtu explicitly or inherit the mtu from
the underlying device - this inheritance is implemented in
dst->ops->mtu handlers ip6_mtu() and ip6_blackhole_mtu().
Currently changing the mtu of a device adds mtu explicitly
to routes using that device.
ie.
# ip link set dev lo mtu 65536
# ip -6 route add local 2000::1 dev lo
# ip -6 route get 2000::1
local 2000::1 dev lo table local src ... metric 1024 pref medium
# ip link set dev lo mtu 65535
# ip -6 route get 2000::1
local 2000::1 dev lo table local src ... metric 1024 mtu 65535 pref medium
# ip link set dev lo mtu 65536
# ip -6 route get 2000::1
local 2000::1 dev lo table local src ... metric 1024 mtu 65536 pref medium
# ip -6 route del local 2000::1
After this patch the route entry no longer changes unless it already has an mtu.
There is no need: this inheritance is already done in ip6_mtu()
# ip link set dev lo mtu 65536
# ip -6 route add local 2000::1 dev lo
# ip -6 route add local 2000::2 dev lo mtu 2000
# ip -6 route get 2000::1; ip -6 route get 2000::2
local 2000::1 dev lo table local src ... metric 1024 pref medium
local 2000::2 dev lo table local src ... metric 1024 mtu 2000 pref medium
# ip link set dev lo mtu 65535
# ip -6 route get 2000::1; ip -6 route get 2000::2
local 2000::1 dev lo table local src ... metric 1024 pref medium
local 2000::2 dev lo table local src ... metric 1024 mtu 2000 pref medium
# ip link set dev lo mtu 1501
# ip -6 route get 2000::1; ip -6 route get 2000::2
local 2000::1 dev lo table local src ... metric 1024 pref medium
local 2000::2 dev lo table local src ... metric 1024 mtu 1501 pref medium
# ip link set dev lo mtu 65536
# ip -6 route get 2000::1; ip -6 route get 2000::2
local 2000::1 dev lo table local src ... metric 1024 pref medium
local 2000::2 dev lo table local src ... metric 1024 mtu 65536 pref medium
# ip -6 route del local 2000::1
# ip -6 route del local 2000::2
This is desirable because changing device mtu and then resetting it
to the previous value shouldn't change the user visible routing table.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
CC: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Soheil Hassas Yeganeh [Fri, 4 Nov 2016 19:36:49 +0000 (15:36 -0400)]
sock: fix sendmmsg for partial sendmsg
Do not send the next message in sendmmsg for partial sendmsg
invocations.
sendmmsg assumes that it can continue sending the next message
when the return value of the individual sendmsg invocations
is positive. It results in corrupting the data for TCP,
SCTP, and UNIX streams.
For example, sendmmsg([["abcd"], ["efgh"]]) can result in a stream
of "aefgh" if the first sendmsg invocation sends only the first
byte while the second sendmsg goes through.
Datagram sockets either send the entire datagram or fail, so
this patch affects only sockets of type SOCK_STREAM and
SOCK_SEQPACKET.
Fixes:
228e548e6020 ("net: Add sendmmsg socket system call")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gao Feng [Fri, 4 Nov 2016 02:28:49 +0000 (10:28 +0800)]
driver: macvlan: Destroy new macvlan port if macvlan_common_newlink failed.
When there is no existing macvlan port in lowdev, one new macvlan port
would be created. But it doesn't be destoried when something failed later.
It casues some memleak.
Now add one flag to indicate if new macvlan port is created.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Liping Zhang [Sun, 6 Nov 2016 06:40:01 +0000 (14:40 +0800)]
netfilter: nf_tables: fix oops when inserting an element into a verdict map
Dalegaard says:
The following ruleset, when loaded with 'nft -f bad.txt'
----snip----
flush ruleset
table ip inlinenat {
map sourcemap {
type ipv4_addr : verdict;
}
chain postrouting {
ip saddr vmap @sourcemap accept
}
}
add chain inlinenat test
add element inlinenat sourcemap { 100.123.10.2 : jump test }
----snip----
results in a kernel oops:
BUG: unable to handle kernel paging request at
0000000000001344
IP: [<
ffffffffa07bf704>] nf_tables_check_loops+0x114/0x1f0 [nf_tables]
[...]
Call Trace:
[<
ffffffffa07c2aae>] ? nft_data_init+0x13e/0x1a0 [nf_tables]
[<
ffffffffa07c1950>] nft_validate_register_store+0x60/0xb0 [nf_tables]
[<
ffffffffa07c74b5>] nft_add_set_elem+0x545/0x5e0 [nf_tables]
[<
ffffffffa07bfdd0>] ? nft_table_lookup+0x30/0x60 [nf_tables]
[<
ffffffff8132c630>] ? nla_strcmp+0x40/0x50
[<
ffffffffa07c766e>] nf_tables_newsetelem+0x11e/0x210 [nf_tables]
[<
ffffffff8132c400>] ? nla_validate+0x60/0x80
[<
ffffffffa030d9b4>] nfnetlink_rcv+0x354/0x5a7 [nfnetlink]
Because we forget to fill the net pointer in bind_ctx, so dereferencing
it may cause kernel crash.
Reported-by: Dalegaard <dalegaard@gmail.com>
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Fri, 4 Nov 2016 15:54:58 +0000 (16:54 +0100)]
netfilter: conntrack: refine gc worker heuristics
Nicolas Dichtel says:
After commit
b87a2f9199ea ("netfilter: conntrack: add gc worker to
remove timed-out entries"), netlink conntrack deletion events may be
sent with a huge delay.
Nicolas further points at this line:
goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);
and indeed, this isn't optimal at all. Rationale here was to ensure that
we don't block other work items for too long, even if
nf_conntrack_htable_size is huge. But in order to have some guarantee
about maximum time period where a scan of the full conntrack table
completes we should always use a fixed slice size, so that once every
N scans the full table has been examined at least once.
We also need to balance this vs. the case where the system is either idle
(i.e., conntrack table (almost) empty) or very busy (i.e. eviction happens
from packet path).
So, after some discussion with Nicolas:
1. want hard guarantee that we scan entire table at least once every X s
-> need to scan fraction of table (get rid of upper bound)
2. don't want to eat cycles on idle or very busy system
-> increase interval if we did not evict any entries
3. don't want to block other worker items for too long
-> make fraction really small, and prefer small scan interval instead
4. Want reasonable short time where we detect timed-out entry when
system went idle after a burst of traffic, while not doing scans
all the time.
-> Store next gc scan in worker, increasing delays when no eviction
happened and shrinking delay when we see timed out entries.
The old gc interval is turned into a max number, scans can now happen
every jiffy if stale entries are present.
Longest possible time period until an entry is evicted is now 2 minutes
in worst case (entry expires right after it was deemed 'not expired').
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Thu, 3 Nov 2016 13:44:42 +0000 (14:44 +0100)]
netfilter: conntrack: fix CT target for UNSPEC helpers
Thomas reports its not possible to attach the H.245 helper:
iptables -t raw -A PREROUTING -p udp -j CT --helper H.245
iptables: No chain/target/match by that name.
xt_CT: No such helper "H.245"
This is because H.245 registers as NFPROTO_UNSPEC, but the CT target
passes NFPROTO_IPV4/IPV6 to nf_conntrack_helper_try_module_get.
We should treat UNSPEC as wildcard and ignore the l3num instead.
Reported-by: Thomas Woerner <twoerner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Sat, 29 Oct 2016 01:01:50 +0000 (03:01 +0200)]
netfilter: connmark: ignore skbs with magic untracked conntrack objects
The (percpu) untracked conntrack entries can end up with nonzero connmarks.
The 'untracked' conntrack objects are merely a way to distinguish INVALID
(i.e. protocol connection tracker says payload doesn't meet some
requirements or packet was never seen by the connection tracking code)
from packets that are intentionally not tracked (some icmpv6 types such as
neigh solicitation, or by using 'iptables -j CT --notrack' option).
Untracked conntrack objects are implementation detail, we might as well use
invalid magic address instead to tell INVALID and UNTRACKED apart.
Check skb->nfct for untracked dummy and behave as if skb->nfct is NULL.
Reported-by: XU Tianwen <evan.xu.tianwen@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
WANG Cong [Fri, 4 Nov 2016 00:14:03 +0000 (17:14 -0700)]
ipvs: use IPVS_CMD_ATTR_MAX for family.maxattr
family.maxattr is the max index for policy[], the size of
ops[] is determined with ARRAY_SIZE().
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Alexander Duyck [Fri, 4 Nov 2016 19:11:57 +0000 (15:11 -0400)]
fib_trie: Correct /proc/net/route off by one error
The display of /proc/net/route has had a couple issues due to the fact that
when I originally rewrote most of fib_trie I made it so that the iterator
was tracking the next value to use instead of the current.
In addition it had an off by 1 error where I was tracking the first piece
of data as position 0, even though in reality that belonged to the
SEQ_START_TOKEN.
This patch updates the code so the iterator tracks the last reported
position and key instead of the next expected position and key. In
addition it shifts things so that all of the leaves start at 1 instead of
trying to report leaves starting with offset 0 as being valid. With these
two issues addressed this should resolve any off by one errors that were
present in the display of /proc/net/route.
Fixes:
25b97c016b26 ("ipv4: off-by-one in continuation handling in /proc/net/route")
Cc: Andy Whitcroft <apw@canonical.com>
Reported-by: Jason Baron <jbaron@akamai.com>
Tested-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fabian Mewes [Fri, 4 Nov 2016 12:16:14 +0000 (13:16 +0100)]
Documentation: networking: dsa: Update tagging protocols
Add Qualcomm QCA tagging introduced in
cafdc45c9 to the
list of supported protocols.
Signed-off-by: Fabian Mewes <architekt@coding4coffee.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael S. Tsirkin [Fri, 4 Nov 2016 10:55:36 +0000 (12:55 +0200)]
virtio-net: drop legacy features in virtio 1 mode
Virtio 1.0 spec says VIRTIO_F_ANY_LAYOUT and VIRTIO_NET_F_GSO are
legacy-only feature bits. Do not negotiate them in virtio 1 mode. Note
this is a spec violation so we need to backport it to stable/downstream
kernels.
Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Thu, 3 Nov 2016 23:17:26 +0000 (16:17 -0700)]
net: icmp6_send should use dst dev to determine L3 domain
icmp6_send is called in response to some event. The skb may not have
the device set (skb->dev is NULL), but it is expected to have a dst set.
Update icmp6_send to use the dst on the skb to determine L3 domain.
Fixes:
ca254490c8dfd ("net: Add VRF support to IPv6 stack")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Thu, 3 Nov 2016 23:56:31 +0000 (00:56 +0100)]
bpf: fix map not being uncharged during map creation failure
In map_create(), we first find and create the map, then once that
suceeded, we charge it to the user's RLIMIT_MEMLOCK, and then fetch
a new anon fd through anon_inode_getfd(). The problem is, once the
latter fails f.e. due to RLIMIT_NOFILE limit, then we only destruct
the map via map->ops->map_free(), but without uncharging the previously
locked memory first. That means that the user_struct allocation is
leaked as well as the accounted RLIMIT_MEMLOCK memory not released.
Make the label names in the fix consistent with bpf_prog_load().
Fixes:
aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Thu, 3 Nov 2016 23:01:19 +0000 (00:01 +0100)]
bpf: fix htab map destruction when extra reserve is in use
Commit
a6ed3ea65d98 ("bpf: restore behavior of bpf_map_update_elem")
added an extra per-cpu reserve to the hash table map to restore old
behaviour from pre prealloc times. When non-prealloc is in use for a
map, then problem is that once a hash table extra element has been
linked into the hash-table, and the hash table is destroyed due to
refcount dropping to zero, then htab_map_free() -> delete_all_elements()
will walk the whole hash table and drop all elements via htab_elem_free().
The problem is that the element from the extra reserve is first fed
to the wrong backend allocator and eventually freed twice.
Fixes:
a6ed3ea65d98 ("bpf: restore behavior of bpf_map_update_elem")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marcelo Ricardo Leitner [Thu, 3 Nov 2016 19:03:41 +0000 (17:03 -0200)]
sctp: assign assoc_id earlier in __sctp_connect
sctp_wait_for_connect() currently already holds the asoc to keep it
alive during the sleep, in case another thread release it. But Andrey
Konovalov and Dmitry Vyukov reported an use-after-free in such
situation.
Problem is that __sctp_connect() doesn't get a ref on the asoc and will
do a read on the asoc after calling sctp_wait_for_connect(), but by then
another thread may have closed it and the _put on sctp_wait_for_connect
will actually release it, causing the use-after-free.
Fix is, instead of doing the read after waiting for the connect, do it
before so, and avoid this issue as the socket is still locked by then.
There should be no issue on returning the asoc id in case of failure as
the application shouldn't trust on that number in such situations
anyway.
This issue doesn't exist in sctp_sendmsg() path.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 7 Nov 2016 18:17:31 +0000 (13:17 -0500)]
Merge branch 'phy-ref-leaks'
Johan Hovold says:
====================
net: fix device reference leaks
This series fixes a number of device reference leaks (and one of_node
leak) due to failure to drop the references taken by bus_find_device()
and friends.
Note that the final two patches have been compile tested only.
v2
- hold reference to cpsw-phy-sel device while accessing private data as
requested by David. Also update the commit message. (patch 1/4)
- add linux-omap on CC where appropriate
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hovold [Thu, 3 Nov 2016 17:40:22 +0000 (18:40 +0100)]
net: hns: fix device reference leaks
Make sure to drop the reference taken by class_find_device() in
hnae_get_handle() on errors and when later releasing the handle.
Fixes:
6fe6611ff275 ("net: add Hisilicon Network Subsystem...")
Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
Cc: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hovold [Thu, 3 Nov 2016 17:40:21 +0000 (18:40 +0100)]
net: ethernet: ti: davinci_emac: fix device reference leak
Make sure to drop the references taken by bus_find_device() before
returning from emac_dev_open().
Note that phy_connect still takes a reference to the phy device.
Fixes:
5d69e0076a72 ("net: davinci_emac: switch to new mdio")
Cc: Mugunthan V N <mugunthanvnm@ti.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: linux-omap@vger.kernel.org
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hovold [Thu, 3 Nov 2016 17:40:20 +0000 (18:40 +0100)]
net: ethernet: ti: cpsw: fix device and of_node leaks
Make sure to drop the references taken by of_get_child_by_name() and
bus_find_device() before returning from cpsw_phy_sel().
Note that holding a reference to the cpsw-phy-sel device does not
prevent the devres-managed private data from going away.
Fixes:
5892cd135e16 ("drivers: net: cpsw-phy-sel: Add new driver...")
Cc: Mugunthan V N <mugunthanvnm@ti.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: linux-omap@vger.kernel.org
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hovold [Thu, 3 Nov 2016 17:40:19 +0000 (18:40 +0100)]
phy: fix device reference leaks
Make sure to drop the reference taken by bus_find_device_by_name()
before returning from phy_connect() and phy_attach().
Note that both function still take a reference to the phy device
through phy_attach_direct().
Fixes:
e13934563db0 ("[PATCH] PHY Layer fixup")
Cc: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 4 Nov 2016 18:59:38 +0000 (14:59 -0400)]
Merge branch 'mlx5-fixes'
Saeed Mahameed says:
====================
Mellanox 100G mlx5 fixes 2016-11-04
This series contains six hot fixes of the mlx5 core and mlx5e driver.
Huy fixed an invalid pointer dereference on initialization flow for when
the selected mlx5 load profile is out of range.
Or provided three eswitch offloads related fixes
- Prevent changing NS of a VF representor.
- Handle matching on vlan priority for offloaded TC rules
- Set the actions for offloaded rules properly
On my part I here addressed the error flow related issues in
mlx5e_open_channel reported by Jesper just this week.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Huy Nguyen [Thu, 3 Nov 2016 23:48:47 +0000 (01:48 +0200)]
net/mlx5: Fix invalid pointer reference when prof_sel parameter is invalid
When prof_sel is invalid, mlx5_core_warn is called but the
mlx5_core_dev is not initialized yet. Solution is moving the prof_sel code
after dev->pdev assignment
Fixes:
2974ab6e8bd8 ('net/mlx5: Improve driver log messages')
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Thu, 3 Nov 2016 23:48:46 +0000 (01:48 +0200)]
net/mlx5: E-Switch, Set the actions for offloaded rules properly
As for the current generation of the mlx5 HW (CX4/CX4-Lx) per flow vlan
push/pop actions are emulated, we must not program them to the firmware.
Fixes:
f5f82476090f ('net/mlx5: E-Switch, Support VLAN actions in the offloads mode')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Thu, 3 Nov 2016 23:48:45 +0000 (01:48 +0200)]
net/mlx5e: Handle matching on vlan priority for offloaded TC rules
We ignored the vlan priority in offloaded TC rules matching part,
fix that.
Fixes:
095b6cfd69ce ('net/mlx5e: Add TC vlan match parsing')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Thu, 3 Nov 2016 23:48:44 +0000 (01:48 +0200)]
net/mlx5e: Disallow changing name-space for VF representors
VF reps should be altogether on the same NS as they were created.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed [Thu, 3 Nov 2016 23:48:43 +0000 (01:48 +0200)]
net/mlx5e: Re-arrange XDP SQ/CQ creation
In mlx5e_open_channel CQs must be created before napi is enabled.
Here we move the XDP CQ creation to satisfy that fact.
mlx5e_close_channel is already working according to the right order.
Fixes:
b5503b994ed5 ("net/mlx5e: XDP TX forwarding support")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed [Thu, 3 Nov 2016 23:48:42 +0000 (01:48 +0200)]
net/mlx5e: Fix XDP error path of mlx5e_open_channel()
In case of mlx5e_open_rq fails the error handling will jump to
label err_close_xdp_sq and will try to close the xdp_sq unconditionally.
xdp_sq is valid only in case of XDP use cases, i.e priv->xdp_prog is
not null.
To fix this in this patch we test xdp_sq validity prior to closing it.
In addition we now close the xdp_sq.cq as well.
Fixes:
b5503b994ed5 ("net/mlx5e: XDP TX forwarding support")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Thu, 3 Nov 2016 16:42:36 +0000 (09:42 -0700)]
taskstats: fix the length of cgroupstats_cmd_get_policy
cgroupstats_cmd_get_policy is [CGROUPSTATS_CMD_ATTR_MAX+1],
taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1],
but their family.maxattr is TASKSTATS_CMD_ATTR_MAX.
CGROUPSTATS_CMD_ATTR_MAX is less than TASKSTATS_CMD_ATTR_MAX,
so we could end up accessing out-of-bound.
Change cgroupstats_cmd_get_policy to TASKSTATS_CMD_ATTR_MAX+1,
this is safe because the rest are initialized to 0's.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Thu, 3 Nov 2016 16:42:35 +0000 (09:42 -0700)]
genetlink: fix a memory leak on error path
In __genl_register_family(), when genl_validate_assign_mc_groups()
fails, we forget to free the memory we possibly allocate for
family->attrbuf.
Note, some callers call genl_unregister_family() to clean up
on error path, it doesn't work because the family is inserted
to the global list in the nearly last step.
Cc: Jakub Kicinski <kubakici@wp.pl>
Cc: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 3 Nov 2016 15:59:46 +0000 (08:59 -0700)]
ipv6: dccp: add missing bind_conflict to dccp_ipv6_mapped
While fuzzing kernel with syzkaller, Andrey reported a nasty crash
in inet6_bind() caused by DCCP lacking a required method.
Fixes:
ab1e0a13d7029 ("[SOCK] proto: Add hashinfo member to struct proto")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Guilherme G. Piccoli [Thu, 3 Nov 2016 10:16:20 +0000 (08:16 -0200)]
ehea: fix operation state report
Currently the ehea driver is missing a call to netif_carrier_off()
before the interface bring-up; this is necessary in order to
initialize the __LINK_STATE_NOCARRIER bit in the net_device state
field. Otherwise, we observe state UNKNOWN on "ip address" command
output.
This patch adds a call to netif_carrier_off() on ehea's net device
open callback.
Reported-by: Xiong Zhou <zhou@redhat.com>
Reference-ID: IBM bz #137702, Red Hat bz #1089134
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Signed-off-by: Douglas Miller <dougmill@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 3 Nov 2016 03:30:48 +0000 (20:30 -0700)]
ipv6: dccp: fix out of bound access in dccp_v6_err()
dccp_v6_err() does not use pskb_may_pull() and might access garbage.
We only need 4 bytes at the beginning of the DCCP header, like TCP,
so the 8 bytes pulled in icmpv6_notify() are more than enough.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 3 Nov 2016 03:21:20 +0000 (20:21 -0700)]
netlink: netlink_diag_dump() runs without locks
A recent commit removed locking from netlink_diag_dump() but forgot
one error case.
=====================================
[ BUG: bad unlock balance detected! ]
4.9.0-rc3+ #336 Not tainted
-------------------------------------
syz-executor/4018 is trying to release lock ([ 36.220068] nl_table_lock
) at:
[<
ffffffff82dc8683>] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182
but there are no more locks to release!
other info that might help us debug this:
3 locks held by syz-executor/4018:
#0: [ 36.220068] (
sock_diag_mutex[ 36.220068] ){+.+.+.}
, at: [ 36.220068] [<
ffffffff82c3873b>] sock_diag_rcv+0x1b/0x40
#1: [ 36.220068] (
sock_diag_table_mutex[ 36.220068] ){+.+.+.}
, at: [ 36.220068] [<
ffffffff82c38e00>] sock_diag_rcv_msg+0x140/0x3a0
#2: [ 36.220068] (
nlk->cb_mutex[ 36.220068] ){+.+.+.}
, at: [ 36.220068] [<
ffffffff82db6600>] netlink_dump+0x50/0xac0
stack backtrace:
CPU: 1 PID: 4018 Comm: syz-executor Not tainted 4.9.0-rc3+ #336
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
ffff8800645df688 ffffffff81b46934 ffffffff84eb3e78 ffff88006ad85800
ffffffff82dc8683 ffffffff84eb3e78 ffff8800645df6b8 ffffffff812043ca
dffffc0000000000 ffff88006ad85ff8 ffff88006ad85fd0 00000000ffffffff
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<
ffffffff81b46934>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
[<
ffffffff812043ca>] print_unlock_imbalance_bug+0x17a/0x1a0
kernel/locking/lockdep.c:3388
[< inline >] __lock_release kernel/locking/lockdep.c:3512
[<
ffffffff8120cfd8>] lock_release+0x8e8/0xc60 kernel/locking/lockdep.c:3765
[< inline >] __raw_read_unlock ./include/linux/rwlock_api_smp.h:225
[<
ffffffff83fc001a>] _raw_read_unlock+0x1a/0x30 kernel/locking/spinlock.c:255
[<
ffffffff82dc8683>] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182
[<
ffffffff82db6947>] netlink_dump+0x397/0xac0 net/netlink/af_netlink.c:2110
Fixes:
ad202074320c ("netlink: Use rhashtable walk interface in diag dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 3 Nov 2016 02:00:40 +0000 (19:00 -0700)]
dccp: fix out of bound access in dccp_v4_err()
dccp_v4_err() does not use pskb_may_pull() and might access garbage.
We only need 4 bytes at the beginning of the DCCP header, like TCP,
so the 8 bytes pulled in icmp_socket_deliver() are more than enough.
This patch might allow to process more ICMP messages, as some routers
are still limiting the size of reflected bytes to 28 (RFC 792), instead
of extended lengths (RFC 1812 4.3.2.3)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 3 Nov 2016 01:04:24 +0000 (18:04 -0700)]
dccp: do not send reset to already closed sockets
Andrey reported following warning while fuzzing with syzkaller
WARNING: CPU: 1 PID: 21072 at net/dccp/proto.c:83 dccp_set_state+0x229/0x290
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 21072 Comm: syz-executor Not tainted 4.9.0-rc1+ #293
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
ffff88003d4c7738 ffffffff81b474f4 0000000000000003 dffffc0000000000
ffffffff844f8b00 ffff88003d4c7804 ffff88003d4c7800 ffffffff8140c06a
0000000041b58ab3 ffffffff8479ab7d ffffffff8140beae ffffffff8140cd00
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<
ffffffff81b474f4>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
[<
ffffffff8140c06a>] panic+0x1bc/0x39d kernel/panic.c:179
[<
ffffffff8111125c>] __warn+0x1cc/0x1f0 kernel/panic.c:542
[<
ffffffff8111144c>] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
[<
ffffffff8389e5d9>] dccp_set_state+0x229/0x290 net/dccp/proto.c:83
[<
ffffffff838a0aa2>] dccp_close+0x612/0xc10 net/dccp/proto.c:1016
[<
ffffffff8316bf1f>] inet_release+0xef/0x1c0 net/ipv4/af_inet.c:415
[<
ffffffff82b6e89e>] sock_release+0x8e/0x1d0 net/socket.c:570
[<
ffffffff82b6e9f6>] sock_close+0x16/0x20 net/socket.c:1017
[<
ffffffff815256ad>] __fput+0x29d/0x720 fs/file_table.c:208
[<
ffffffff81525bb5>] ____fput+0x15/0x20 fs/file_table.c:244
[<
ffffffff811727d8>] task_work_run+0xf8/0x170 kernel/task_work.c:116
[< inline >] exit_task_work include/linux/task_work.h:21
[<
ffffffff8111bc53>] do_exit+0x883/0x2ac0 kernel/exit.c:828
[<
ffffffff811221fe>] do_group_exit+0x10e/0x340 kernel/exit.c:931
[<
ffffffff81143c94>] get_signal+0x634/0x15a0 kernel/signal.c:2307
[<
ffffffff81054aad>] do_signal+0x8d/0x1a30 arch/x86/kernel/signal.c:807
[<
ffffffff81003a05>] exit_to_usermode_loop+0xe5/0x130
arch/x86/entry/common.c:156
[< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190
[<
ffffffff81006298>] syscall_return_slowpath+0x1a8/0x1e0
arch/x86/entry/common.c:259
[<
ffffffff83fc1a62>] entry_SYSCALL_64_fastpath+0xc0/0xc2
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled
Fix this the same way we did for TCP in commit
565b7b2d2e63
("tcp: do not send reset to already closed sockets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 3 Nov 2016 00:14:41 +0000 (17:14 -0700)]
dccp: do not release listeners too soon
Andrey Konovalov reported following error while fuzzing with syzkaller :
IPv4: Attempt to release alive inet socket
ffff880068e98940
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task:
ffff88006b9e0000 task.stack:
ffff880068770000
RIP: 0010:[<
ffffffff819ead5f>] [<
ffffffff819ead5f>]
selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639
RSP: 0018:
ffff8800687771c8 EFLAGS:
00010202
RAX:
ffff88006b9e0000 RBX:
1ffff1000d0eee3f RCX:
1ffff1000d1d312a
RDX:
1ffff1000d1d31a6 RSI:
dffffc0000000000 RDI:
0000000000000010
RBP:
ffff880068777360 R08:
0000000000000000 R09:
0000000000000002
R10:
dffffc0000000000 R11:
0000000000000006 R12:
ffff880068e98940
R13:
0000000000000002 R14:
ffff880068777338 R15:
0000000000000000
FS:
00007f00ff760700(0000) GS:
ffff88006cd00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000020008000 CR3:
000000006a308000 CR4:
00000000000006e0
Stack:
ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007
ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480
ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0
Call Trace:
[<
ffffffff819c8dd5>] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317
[<
ffffffff82c2a9e7>] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81
[<
ffffffff82b81e60>] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460
[<
ffffffff838bbf12>] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873
[<
ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0
net/ipv4/ip_input.c:216
[< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
[< inline >] NF_HOOK ./include/linux/netfilter.h:255
[<
ffffffff8306abd2>] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
[< inline >] dst_input ./include/net/dst.h:507
[<
ffffffff83068500>] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
[< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
[< inline >] NF_HOOK ./include/linux/netfilter.h:255
[<
ffffffff8306b82f>] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
[<
ffffffff82bd9fb7>] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
[<
ffffffff82bdb19a>] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
[<
ffffffff82bdb493>] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
[<
ffffffff82bdb6b8>] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
[<
ffffffff8241fc75>] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
[<
ffffffff82421b5a>] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
[< inline >] new_sync_write fs/read_write.c:499
[<
ffffffff8151bd44>] __vfs_write+0x334/0x570 fs/read_write.c:512
[<
ffffffff8151f85b>] vfs_write+0x17b/0x500 fs/read_write.c:560
[< inline >] SYSC_write fs/read_write.c:607
[<
ffffffff81523184>] SyS_write+0xd4/0x1a0 fs/read_write.c:599
[<
ffffffff83fc02c1>] entry_SYSCALL_64_fastpath+0x1f/0xc2
It turns out DCCP calls __sk_receive_skb(), and this broke when
lookups no longer took a reference on listeners.
Fix this issue by adding a @refcounted parameter to __sk_receive_skb(),
so that sock_put() is used only when needed.
Fixes:
3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 2 Nov 2016 21:41:50 +0000 (14:41 -0700)]
tcp: fix return value for partial writes
After my commit, tcp_sendmsg() might restart its loop after
processing socket backlog.
If sk_err is set, we blindly return an error, even though we
copied data to user space before.
We should instead return number of bytes that could be copied,
otherwise user space might resend data and corrupt the stream.
This might happen if another thread is using recvmsg(MSG_ERRQUEUE)
to process timestamps.
Issue was diagnosed by Soheil and Willem, big kudos to them !
Fixes:
d41a69f1d390f ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Tested-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lance Richardson [Wed, 2 Nov 2016 20:36:17 +0000 (16:36 -0400)]
ipv4: allow local fragmentation in ip_finish_output_gso()
Some configurations (e.g. geneve interface with default
MTU of 1500 over an ethernet interface with 1500 MTU) result
in the transmission of packets that exceed the configured MTU.
While this should be considered to be a "bad" configuration,
it is still allowed and should not result in the sending
of packets that exceed the configured MTU.
Fix by dropping the assumption in ip_finish_output_gso() that
locally originated gso packets will never need fragmentation.
Basic testing using iperf (observing CPU usage and bandwidth)
have shown no measurable performance impact for traffic not
requiring fragmentation.
Fixes:
c7ba65d7b649 ("net: ip: push gso skb forwarding handling down the stack")
Reported-by: Jan Tluka <jtluka@redhat.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Wed, 2 Nov 2016 19:08:25 +0000 (12:08 -0700)]
net: tcp: check skb is non-NULL for exact match on lookups
Andrey reported the following error report while running the syzkaller
fuzzer:
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 648 Comm: syz-executor Not tainted 4.9.0-rc3+ #333
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task:
ffff8800398c4480 task.stack:
ffff88003b468000
RIP: 0010:[<
ffffffff83091106>] [< inline >]
inet_exact_dif_match include/net/tcp.h:808
RIP: 0010:[<
ffffffff83091106>] [<
ffffffff83091106>]
__inet_lookup_listener+0xb6/0x500 net/ipv4/inet_hashtables.c:219
RSP: 0018:
ffff88003b46f270 EFLAGS:
00010202
RAX:
0000000000000004 RBX:
0000000000004242 RCX:
0000000000000001
RDX:
0000000000000000 RSI:
ffffc90000e3c000 RDI:
0000000000000054
RBP:
ffff88003b46f2d8 R08:
0000000000004000 R09:
ffffffff830910e7
R10:
0000000000000000 R11:
000000000000000a R12:
ffffffff867fa0c0
R13:
0000000000004242 R14:
0000000000000003 R15:
dffffc0000000000
FS:
00007fb135881700(0000) GS:
ffff88003ec00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000020cc3000 CR3:
000000006d56a000 CR4:
00000000000006f0
Stack:
0000000000000000 000000000601a8c0 0000000000000000 ffffffff00004242
424200003b9083c2 ffff88003def4041 ffffffff84e7e040 0000000000000246
ffff88003a0911c0 0000000000000000 ffff88003a091298 ffff88003b9083ae
Call Trace:
[<
ffffffff831100f4>] tcp_v4_send_reset+0x584/0x1700 net/ipv4/tcp_ipv4.c:643
[<
ffffffff83115b1b>] tcp_v4_rcv+0x198b/0x2e50 net/ipv4/tcp_ipv4.c:1718
[<
ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0
net/ipv4/ip_input.c:216
...
MD5 has a code path that calls __inet_lookup_listener with a null skb,
so inet{6}_exact_dif_match needs to check skb against null before pulling
the flag.
Fixes:
a04a480d4392 ("net: Require exact match for TCP socket lookups if
dif is l3mdev")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 2 Nov 2016 14:53:17 +0000 (07:53 -0700)]
tcp: fix potential memory corruption
Imagine initial value of max_skb_frags is 17, and last
skb in write queue has 15 frags.
Then max_skb_frags is lowered to 14 or smaller value.
tcp_sendmsg() will then be allowed to add additional page frags
and eventually go past MAX_SKB_FRAGS, overflowing struct
skb_shared_info.
Fixes:
5f74f82ea34c ("net:Add sysctl_max_skb_frags")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Cc: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mintz, Yuval [Wed, 2 Nov 2016 14:36:46 +0000 (16:36 +0200)]
qede: Correctly map aggregation replacement pages
Driver allocates replacement buffers before-hand to make
sure whenever an aggregation begins there would be a replacement
for the Rx buffers, as we can't release the buffer until
aggregation is terminated and driver logic assumes the Rx rings
are always full.
For every other Rx page that's being allocated [I.e., regular]
the page is being completely mapped while for the replacement
buffers only the first portion of the page is being mapped.
This means that:
a. Once replacement buffer replenishes the regular Rx ring,
assuming there's more than a single packet on page we'd post unmapped
memory toward HW [assuming mapping is actually done in granularity
smaller than page].
b. Unmaps are being done for the entire page, which is incorrect.
Fixes:
55482edc25f06 ("qede: Add slowpath/fastpath support and enable hardware GRO")
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Wed, 2 Nov 2016 05:22:53 +0000 (10:52 +0530)]
cxgb4: correct device ID of T6 adapter
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Tue, 1 Nov 2016 23:04:36 +0000 (16:04 -0700)]
inet: fix sleeping inside inet_wait_for_connect()
Andrey reported this kernel warning:
WARNING: CPU: 0 PID: 4608 at kernel/sched/core.c:7724
__might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
do not call blocking ops when !TASK_RUNNING; state=1 set at
[<
ffffffff811f5a5c>] prepare_to_wait+0xbc/0x210
kernel/sched/wait.c:178
Modules linked in:
CPU: 0 PID: 4608 Comm: syz-executor Not tainted 4.9.0-rc2+ #320
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
ffff88006625f7a0 ffffffff81b46914 ffff88006625f818 0000000000000000
ffffffff84052960 0000000000000000 ffff88006625f7e8 ffffffff81111237
ffff88006aceac00 ffffffff00001e2c ffffed000cc4beff ffffffff84052960
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<
ffffffff81b46914>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
[<
ffffffff81111237>] __warn+0x1a7/0x1f0 kernel/panic.c:550
[<
ffffffff8111132c>] warn_slowpath_fmt+0xac/0xd0 kernel/panic.c:565
[<
ffffffff811922fc>] __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
[< inline >] slab_pre_alloc_hook mm/slab.h:393
[< inline >] slab_alloc_node mm/slub.c:2634
[< inline >] slab_alloc mm/slub.c:2716
[<
ffffffff81508da0>] __kmalloc_track_caller+0x150/0x2a0 mm/slub.c:4240
[<
ffffffff8146be14>] kmemdup+0x24/0x50 mm/util.c:113
[<
ffffffff8388b2cf>] dccp_feat_clone_sp_val.part.5+0x4f/0xe0 net/dccp/feat.c:374
[< inline >] dccp_feat_clone_sp_val net/dccp/feat.c:1141
[< inline >] dccp_feat_change_recv net/dccp/feat.c:1141
[<
ffffffff8388d491>] dccp_feat_parse_options+0xaa1/0x13d0 net/dccp/feat.c:1411
[<
ffffffff83894f01>] dccp_parse_options+0x721/0x1010 net/dccp/options.c:128
[<
ffffffff83891280>] dccp_rcv_state_process+0x200/0x15b0 net/dccp/input.c:644
[<
ffffffff838b8a94>] dccp_v4_do_rcv+0xf4/0x1a0 net/dccp/ipv4.c:681
[< inline >] sk_backlog_rcv ./include/net/sock.h:872
[<
ffffffff82b7ceb6>] __release_sock+0x126/0x3a0 net/core/sock.c:2044
[<
ffffffff82b7d189>] release_sock+0x59/0x1c0 net/core/sock.c:2502
[< inline >] inet_wait_for_connect net/ipv4/af_inet.c:547
[<
ffffffff8316b2a2>] __inet_stream_connect+0x5d2/0xbb0 net/ipv4/af_inet.c:617
[<
ffffffff8316b8d5>] inet_stream_connect+0x55/0xa0 net/ipv4/af_inet.c:656
[<
ffffffff82b705e4>] SYSC_connect+0x244/0x2f0 net/socket.c:1533
[<
ffffffff82b72dd4>] SyS_connect+0x24/0x30 net/socket.c:1514
[<
ffffffff83fbf701>] entry_SYSCALL_64_fastpath+0x1f/0xc2
arch/x86/entry/entry_64.S:209
Unlike commit
26cabd31259ba43f68026ce3f62b78094124333f
("sched, net: Clean up sk_wait_event() vs. might_sleep()"), the
sleeping function is called before schedule_timeout(), this is indeed
a bug. Fix this by moving the wait logic to the new API, it is similar
to commit
ff960a731788a7408b6f66ec4fd772ff18833211
("netdev, sched/wait: Fix sleeping inside wait event").
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dongli Zhang [Wed, 2 Nov 2016 01:04:33 +0000 (09:04 +0800)]
xen-netfront: cast grant table reference first to type int
IS_ERR_VALUE() in commit
87557efc27f6a50140fb20df06a917f368ce3c66
("xen-netfront: do not cast grant table reference to signed short") would
not return true for error code unless we cast ref first to type int.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eli Cooper [Tue, 1 Nov 2016 15:45:13 +0000 (23:45 +0800)]
ip6_udp_tunnel: remove unused IPCB related codes
Some IPCB fields are currently set in udp_tunnel6_xmit_skb(), which are
never used before it reaches ip6tunnel_xmit(), and past that point the
control buffer is no longer interpreted as IPCB.
This clears these unused IPCB related codes. Currently there is no skb
scrubbing in ip6_udp_tunnel, otherwise IPCB(skb)->opt might need to be
cleared for IPv4 packets, as shown in
5146d1f1511
("tunnel: Clear IPCB(skb)->opt before dst_link_failure called").
Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eli Cooper [Tue, 1 Nov 2016 15:45:12 +0000 (23:45 +0800)]
ip6_tunnel: Clear IP6CB in ip6tunnel_xmit()
skb->cb may contain data from previous layers. In the observed scenario,
the garbage data were misinterpreted as IP6CB(skb)->frag_max_size, so
that small packets sent through the tunnel are mistakenly fragmented.
This patch unconditionally clears the control buffer in ip6tunnel_xmit(),
which affects ip6_tunnel, ip6_udp_tunnel and ip6_gre. Currently none of
these tunnels set IP6CB(skb)->flags, otherwise it needs to be done earlier.
Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed [Tue, 1 Nov 2016 13:09:58 +0000 (15:09 +0200)]
MAINTAINERS: Update MELLANOX MLX5 core VPI driver maintainers
Add myself as a maintainer for mlx5 core driver as well.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King [Tue, 1 Nov 2016 10:50:01 +0000 (10:50 +0000)]
net: mv643xx_eth: ensure coalesce settings survive read-modify-write
The coalesce settings behave badly when changing just one value:
... # ethtool -c eth0
rx-usecs: 249
... # ethtool -C eth0 tx-usecs 250
... # ethtool -c eth0
rx-usecs: 248
This occurs due to rounding errors when calculating the microseconds
value - the divisons round down. This causes (eg) the rx-usecs to
decrease by one every time the tx-usecs value is set as per the above.
Fix this by making the divison round-to-nearest.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christophe Jaillet [Tue, 1 Nov 2016 07:10:53 +0000 (08:10 +0100)]
net/mlx5: Simplify a test
'create_root_ns()' does not return an error pointer, so the test can be
simplified to be more consistent.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Isaac Boukris [Tue, 1 Nov 2016 00:41:35 +0000 (02:41 +0200)]
unix: escape all null bytes in abstract unix domain socket
Abstract unix domain socket may embed null characters,
these should be translated to '@' when printed out to
proc the same way the null prefix is currently being
translated.
This helps for tools such as netstat, lsof and the proc
based implementation in ss to show all the significant
bytes of the name (instead of getting cut at the first
null occurrence).
Signed-off-by: Isaac Boukris <iboukris@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Timur Tabi [Mon, 31 Oct 2016 23:18:42 +0000 (18:18 -0500)]
net: qcom/emac: use correct value for SGMII_LN_UCDR_SO_GAIN_MODE0
The documentation says that SGMII_LN_UCDR_SO_GAIN_MODE0 should be
set to 0, not 6, on the Qualcomm Technologies QDF2432.
Signed-off-by: Timur Tabi <timur@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 1 Nov 2016 16:04:53 +0000 (12:04 -0400)]
Merge branch 'xgene-coalescing-bugs'
Iyappan Subramanian says:
====================
drivers: net: xgene: Fix coalescing bugs
This patch set fixes the following,
1. Since ethernet v1 hardware has a bug related to coalescing,
disabling this feature
2. Fixing ethernet v2 hardware, interrupt trigger region
id to 2, to kickoff coalescing
====================
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Iyappan Subramanian [Mon, 31 Oct 2016 23:00:27 +0000 (16:00 -0700)]
drivers: net: xgene: fix: Coalescing values for v2 hardware
Changing the interrupt trigger region id to 2 and the
corresponding threshold set0/set1 values to 8/16.
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Toan Le <toanle@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Iyappan Subramanian [Mon, 31 Oct 2016 23:00:26 +0000 (16:00 -0700)]
drivers: net: xgene: fix: Disable coalescing on v1 hardware
Since ethernet v1 hardware has a bug related to coalescing, disabling
this feature.
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Toan Le <toanle@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 1 Nov 2016 01:01:03 +0000 (21:01 -0400)]
Merge tag 'linux-can-fixes-for-4.9-
20161031' of git://git./linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2016-10-31
this is a pull request of two patches for the upcoming v4.9 release.
The first patch is by Lukas Resch for the sja1000 plx_pci driver that adds
support for Moxa CAN devices. The second patch is by Oliver Hartkopp and fixes
a potential kernel panic in the CAN broadcast manager.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Gospodarek [Mon, 31 Oct 2016 17:32:03 +0000 (13:32 -0400)]
bgmac: stop clearing DMA receive control register right after it is set
Current bgmac code initializes some DMA settings in the receive control
register for some hardware and then immediately clears those settings.
Not clearing those settings results in ~420Mbps *improvement* in
throughput; this system can now receive frames at line-rate on Broadcom
5871x hardware compared to ~520Mbps today. I also tested a few other
values but found there to be no discernible difference in CPU
utilization even if burst size and prefetching values are different.
On the hardware tested there was no need to keep the code that cleared
all but bits 16-17, but since there is a wide variety of hardware that
used this driver (I did not look at all hardware docs for hardware using
this IP block), I find it wise to move this call up and clear bits just
after reading the default value from the hardware rather than completely
removing it.
This is a good candidate for -stable >=3.14 since that is when the code
that was supposed to improve performance (but did not) was introduced.
Signed-off-by: Andy Gospodarek <gospo@broadcom.com>
Fixes:
56ceecde1f29 ("bgmac: initialize the DMA controller of core...")
Cc: Hauke Mehrtens <hauke@hauke-m.de>
Acked-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 31 Oct 2016 20:20:33 +0000 (16:20 -0400)]
Merge branch 'sctp-hold-transport-fixes'
Xin Long says:
====================
sctp: a bunch of fixes by holding transport
There are several places where it holds assoc after getting transport by
searching from transport rhashtable, it may cause use-after-free issue.
This patchset is to fix them by holding transport instead.
v1->v2:
Fix the changelog of patch 2/3
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 31 Oct 2016 12:32:33 +0000 (20:32 +0800)]
sctp: hold transport instead of assoc when lookup assoc in rx path
Prior to this patch, in rx path, before calling lock_sock, it needed to
hold assoc when got it by __sctp_lookup_association, in case other place
would free/put assoc.
But in __sctp_lookup_association, it lookup and hold transport, then got
assoc by transport->assoc, then hold assoc and put transport. It means
it didn't hold transport, yet it was returned and later on directly
assigned to chunk->transport.
Without the protection of sock lock, the transport may be freed/put by
other places, which would cause a use-after-free issue.
This patch is to fix this issue by holding transport instead of assoc.
As holding transport can make sure to access assoc is also safe, and
actually it looks up assoc by searching transport rhashtable, to hold
transport here makes more sense.
Note that the function will be renamed later on on another patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 31 Oct 2016 12:32:32 +0000 (20:32 +0800)]
sctp: return back transport in __sctp_rcv_init_lookup
Prior to this patch, it used a local variable to save the transport that is
looked up by __sctp_lookup_association(), and didn't return it back. But in
sctp_rcv, it is used to initialize chunk->transport. So when hitting this,
even if it found the transport, it was still initializing chunk->transport
with null instead.
This patch is to return the transport back through transport pointer
that is from __sctp_rcv_lookup_harder().
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 31 Oct 2016 12:32:31 +0000 (20:32 +0800)]
sctp: hold transport instead of assoc in sctp_diag
In sctp_transport_lookup_process(), Commit
1cceda784980 ("sctp: fix
the issue sctp_diag uses lock_sock in rcu_read_lock") moved cb() out
of rcu lock, but it put transport and hold assoc instead, and ignore
that cb() still uses transport. It may cause a use-after-free issue.
This patch is to hold transport instead of assoc there.
Fixes:
1cceda784980 ("sctp: fix the issue sctp_diag uses lock_sock in rcu_read_lock")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dongli Zhang [Mon, 31 Oct 2016 05:38:29 +0000 (13:38 +0800)]
xen-netfront: do not cast grant table reference to signed short
While grant reference is of type uint32_t, xen-netfront erroneously casts
it to signed short in BUG_ON().
This would lead to the xen domU panic during boot-up or migration when it
is attached with lots of paravirtual devices.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Oliver Hartkopp [Mon, 24 Oct 2016 19:11:26 +0000 (21:11 +0200)]
can: bcm: fix warning in bcm_connect/proc_register
Andrey Konovalov reported an issue with proc_register in bcm.c.
As suggested by Cong Wang this patch adds a lock_sock() protection and
a check for unsuccessful proc_create_data() in bcm_connect().
Reference: http://marc.info/?l=linux-netdev&m=
147732648731237
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Lukas Resch [Mon, 10 Oct 2016 08:07:32 +0000 (08:07 +0000)]
can: sja1000: plx_pci: Add support for Moxa CAN devices
This patch adds support for Moxa CAN devices.
Signed-off-by: Lukas Resch <l.resch@incubedit.com>
Signed-off-by: Christoph Zehentner <c.zehentner@incubedit.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
David S. Miller [Mon, 31 Oct 2016 19:37:30 +0000 (15:37 -0400)]
Merge tag 'wireless-drivers-for-davem-2016-10-30' of git://git./linux/kernel/git/kvalo/wireless-drivers
Kalle Valo says:
====================
wireless-drivers fixes for 4.9
iwlwifi
* some fixes for suspend/resume with unified FW images
* a fix for a false-positive lockdep report
* a fix for multi-queue that caused an unnecessary 1 second latency
* a fix for an ACPI parsing bug that caused a misleading error message
brcmfmac
* fix a variable uninitialised warning in brcmf_cfg80211_start_ap()
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sun, 30 Oct 2016 09:09:22 +0000 (10:09 +0100)]
mlxsw: spectrum: Fix incorrect reuse of MID entries
In the device, a MID entry represents a group of local ports, which can
later be bound to a MDB entry.
The lookup of an existing MID entry is currently done using the provided
MC MAC address and VID, from the Linux bridge. However, this can result
in an incorrect reuse of the same MID index in different VLAN-unaware
bridges (same IP MC group and VID 0).
Fix this by performing the lookup based on FID instead of VID, which is
unique across different bridges.
Fixes:
3a49b4fde2a1 ("mlxsw: Adding layer 2 multicast support")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mintz, Yuval [Sun, 30 Oct 2016 08:25:42 +0000 (10:25 +0200)]
qede: Fix statistics' strings for Tx/Rx queues
When an interface is configured to use Tx/Rx-only queues,
the length of the statistics would be shortened to accomodate only the
statistics required per-each queue, and the values would be provided
accordingly.
However, the strings provided would still contain both Tx and Rx strings
for each one of the queues [regardless of its configuration], which might
lead to out-of-bound access when filling the buffers as well as incorrect
statistics presented.
Fixes:
9a4d7e86acf3 ("qede: Add support for Tx/Rx-only queues.")
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 29 Oct 2016 18:02:36 +0000 (11:02 -0700)]
net: mangle zero checksum in skb_checksum_help()
Sending zero checksum is ok for TCP, but not for UDP.
UDPv6 receiver should by default drop a frame with a 0 checksum,
and UDPv4 would not verify the checksum and might accept a corrupted
packet.
Simply replace such checksum by 0xffff, regardless of transport.
This error was caught on SIT tunnels, but seems generic.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 28 Oct 2016 20:40:24 +0000 (13:40 -0700)]
net: clear sk_err_soft in sk_clone_lock()
At accept() time, it is possible the parent has a non zero
sk_err_soft, leftover from a prior error.
Make sure we do not leave this value in the child, as it
makes future getsockopt(SO_ERROR) calls quite unreliable.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Fri, 28 Oct 2016 16:43:11 +0000 (18:43 +0200)]
dctcp: avoid bogus doubling of cwnd after loss
If a congestion control module doesn't provide .undo_cwnd function,
tcp_undo_cwnd_reduction() will set cwnd to
tp->snd_cwnd = max(tp->snd_cwnd, tp->snd_ssthresh << 1);
... which makes sense for reno (it sets ssthresh to half the current cwnd),
but it makes no sense for dctcp, which sets ssthresh based on the current
congestion estimate.
This can cause severe growth of cwnd (eventually overflowing u32).
Fix this by saving last cwnd on loss and restore cwnd based on that,
similar to cubic and other algorithms.
Fixes:
e3118e8359bb7c ("net: tcp: add DCTCP congestion control algorithm")
Cc: Lawrence Brakmo <brakmo@fb.com>
Cc: Andrew Shewmaker <agshew@gmail.com>
Cc: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 28 Oct 2016 10:18:01 +0000 (18:18 +0800)]
ipv6: add mtu lock check in __ip6_rt_update_pmtu
Prior to this patch, ipv6 didn't do mtu lock check in ip6_update_pmtu.
It leaded to that mtu lock doesn't really work when receiving the pkt
of ICMPV6_PKT_TOOBIG.
This patch is to add mtu lock check in __ip6_rt_update_pmtu just as ipv4
did in __ip_rt_update_pmtu.
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Sitnicki [Wed, 26 Oct 2016 09:21:14 +0000 (11:21 +0200)]
ipv6: Don't use ufo handling on later transformed packets
Similar to commit
c146066ab802 ("ipv4: Don't use ufo handling on later
transformed packets"), don't perform UFO on packets that will be IPsec
transformed. To detect it we rely on the fact that headerlen in
dst_entry is non-zero only for transformation bundles (xfrm_dst
objects).
Unwanted segmentation can be observed with a NETIF_F_UFO capable device,
such as a dummy device:
DEV=dum0 LEN=1493
ip li add $DEV type dummy
ip addr add fc00::1/64 dev $DEV nodad
ip link set $DEV up
ip xfrm policy add dir out src fc00::1 dst fc00::2 \
tmpl src fc00::1 dst fc00::2 proto esp spi 1
ip xfrm state add src fc00::1 dst fc00::2 \
proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b
tcpdump -n -nn -i $DEV -t &
socat /dev/zero,readbytes=$LEN udp6:[fc00::2]:$LEN
tcpdump output before:
IP6 fc00::1 > fc00::2: frag (0|1448) ESP(spi=0x00000001,seq=0x1), length 1448
IP6 fc00::1 > fc00::2: frag (1448|48)
IP6 fc00::1 > fc00::2: ESP(spi=0x00000001,seq=0x2), length 88
... and after:
IP6 fc00::1 > fc00::2: frag (0|1448) ESP(spi=0x00000001,seq=0x1), length 1448
IP6 fc00::1 > fc00::2: frag (1448|80)
Fixes:
e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach")
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Liping Zhang [Sat, 29 Oct 2016 14:09:51 +0000 (22:09 +0800)]
netfilter: nft_dup: do not use sreg_dev if the user doesn't specify it
The NFTA_DUP_SREG_DEV attribute is not a must option, so we should use it
in routing lookup only when the user specify it.
Fixes:
d877f07112f1 ("netfilter: nf_tables: add nft_dup expression")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Liping Zhang [Sat, 29 Oct 2016 14:03:05 +0000 (22:03 +0800)]
netfilter: nf_tables: destroy the set if fail to add transaction
When the memory is exhausted, then we will fail to add the NFT_MSG_NEWSET
transaction. In such case, we should destroy the set before we free it.
Fixes:
958bee14d071 ("netfilter: nf_tables: use new transaction infrastructure to handle sets")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Mark Lord [Sun, 30 Oct 2016 23:28:27 +0000 (19:28 -0400)]
r8152: Fix broken RX checksums.
The r8152 driver has been broken since (approx) 3.16.xx
when support was added for hardware RX checksums
on newer chip versions. Symptoms include random
segfaults and silent data corruption over NFS.
The hardware checksum logig does not work on the VER_02
dongles I have here when used with a slow embedded system CPU.
Google reveals others reporting similar issues on Raspberry Pi.
So, disable hardware RX checksum support for VER_02, and fix
an obvious coding error for IPV6 checksums in the same function.
Because this bug results in silent data corruption,
it is a good candidate for back-porting to -stable >= 3.16.xx.
Signed-off-by: Mark Lord <mlord@pobox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 30 Oct 2016 03:33:20 +0000 (20:33 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
"Lots of fixes, mostly drivers as is usually the case.
1) Don't treat zero DMA address as invalid in vmxnet3, from Alexey
Khoroshilov.
2) Fix element timeouts in netfilter's nft_dynset, from Anders K.
Pedersen.
3) Don't put aead_req crypto struct on the stack in mac80211, from
Ard Biesheuvel.
4) Several uninitialized variable warning fixes from Arnd Bergmann.
5) Fix memory leak in cxgb4, from Colin Ian King.
6) Fix bpf handling of VLAN header push/pop, from Daniel Borkmann.
7) Several VRF semantic fixes from David Ahern.
8) Set skb->protocol properly in ip6_tnl_xmit(), from Eli Cooper.
9) Socket needs to be locked in udp_disconnect(), from Eric Dumazet.
10) Div-by-zero on 32-bit fix in mlx4 driver, from Eugenia Emantayev.
11) Fix stale link state during failover in NCSCI driver, from Gavin
Shan.
12) Fix netdev lower adjacency list traversal, from Ido Schimmel.
13) Propvide proper handle when emitting notifications of filter
deletes, from Jamal Hadi Salim.
14) Memory leaks and big-endian issues in rtl8xxxu, from Jes Sorensen.
15) Fix DESYNC_FACTOR handling in ipv6, from Jiri Bohac.
16) Several routing offload fixes in mlxsw driver, from Jiri Pirko.
17) Fix broadcast sync problem in TIPC, from Jon Paul Maloy.
18) Validate chunk len before using it in SCTP, from Marcelo Ricardo
Leitner.
19) Revert a netns locking change that causes regressions, from Paul
Moore.
20) Add recursion limit to GRO handling, from Sabrina Dubroca.
21) GFP_KERNEL in irq context fix in ibmvnic, from Thomas Falcon.
22) Avoid accessing stale vxlan/geneve socket in data path, from
Pravin Shelar"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (189 commits)
geneve: avoid using stale geneve socket.
vxlan: avoid using stale vxlan socket.
qede: Fix out-of-bound fastpath memory access
net: phy: dp83848: add dp83822 PHY support
enic: fix rq disable
tipc: fix broadcast link synchronization problem
ibmvnic: Fix missing brackets in init_sub_crq_irqs
ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context
Revert "ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context"
arch/powerpc: Update parameters for csum_tcpudp_magic & csum_tcpudp_nofold
net/mlx4_en: Save slave ethtool stats command
net/mlx4_en: Fix potential deadlock in port statistics flow
net/mlx4: Fix firmware command timeout during interrupt test
net/mlx4_core: Do not access comm channel if it has not yet been initialized
net/mlx4_en: Fix panic during reboot
net/mlx4_en: Process all completions in RX rings after port goes up
net/mlx4_en: Resolve dividing by zero in 32-bit system
net/mlx4_core: Change the default value of enable_qos
net/mlx4_core: Avoid setting ports to auto when only one port type is supported
net/mlx4_core: Fix the resource-type enum in res tracker to conform to FW spec
...