Wei Yongjun [Thu, 28 Dec 2017 03:46:48 +0000 (03:46 +0000)]
xen/pvcalls: use GFP_ATOMIC under spin lock
A spin lock is taken here so we should use GFP_ATOMIC.
Fixes:
9774c6cca266 ("xen/pvcalls: implement accept command")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Boris Ostrovsky [Tue, 12 Dec 2017 20:08:21 +0000 (15:08 -0500)]
xen/balloon: Mark unallocated host memory as UNUSABLE
Commit
f5775e0b6116 ("x86/xen: discard RAM regions above the maximum
reservation") left host memory not assigned to dom0 as available for
memory hotplug.
Unfortunately this also meant that those regions could be used by
others. Specifically, commit
fa564ad96366 ("x86/PCI: Enable a 64bit BAR
on AMD Family 15h (Models 00-1f, 30-3f, 60-7f)") may try to map those
addresses as MMIO.
To prevent this mark unallocated host memory as E820_TYPE_UNUSABLE (thus
effectively reverting
f5775e0b6116) and keep track of that region as
a hostmem resource that can be used for the hotplug.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Mon, 18 Dec 2017 16:37:45 +0000 (09:37 -0700)]
x86-64/Xen: eliminate W+X mappings
A few thousand such pages are usually left around due to the re-use of
L1 tables having been provided by the hypervisor (Dom0) or tool stack
(DomU). Set NX in the direct map variant, which needs to be done in L2
due to the dual use of the re-used L1s.
For x86_configure_nx() to actually do what it is supposed to do, call
get_cpu_cap() first. This was broken by commit
4763ed4d45 ("x86, mm:
Clean up and simplify NX enablement") when switching away from the
direct EFER read.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Jan Beulich [Tue, 12 Dec 2017 10:18:11 +0000 (03:18 -0700)]
xen: XEN_ACPI_PROCESSOR is Dom0-only
Add a respective dependency.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Jan Beulich [Fri, 8 Dec 2017 11:17:28 +0000 (04:17 -0700)]
x86/Xen: don't report ancient LAPIC version
Unconditionally reporting a value seen on the P4 or older invokes
functionality like io_apic_get_unique_id() on 32-bit builds, resulting
in a panic() with sufficiently many CPUs and/or IO-APICs. Doing what
that function does would be the hypervisor's responsibility anyway, so
makes no sense to be used when running on Xen. Uniformly report a more
modern version; this shouldn't matter much as both LAPIC and IO-APIC are
being managed entirely / mostly by the hypervisor.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Dan Carpenter [Tue, 5 Dec 2017 14:38:54 +0000 (17:38 +0300)]
xen/pvcalls: Fix a check in pvcalls_front_remove()
bedata->ref can't be less than zero because it's unsigned. This affects
certain error paths in probe. We first set ->ref = -1 and then we set
it to a valid value later.
Fixes:
219681909913 ("xen/pvcalls: connect to the backend")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Dan Carpenter [Tue, 5 Dec 2017 14:38:43 +0000 (17:38 +0300)]
xen/pvcalls: check for xenbus_read() errors
Smatch complains that "len" is uninitialized if xenbus_read() fails so
let's add some error handling.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Stefano Stabellini [Wed, 15 Nov 2017 21:20:21 +0000 (13:20 -0800)]
xen/pvcalls: fix potential endless loop in pvcalls-front.c
mutex_trylock() returns 1 if you take the lock and 0 if not. Assume you
take in_mutex on the first try, but you can't take out_mutex. Next times
you call mutex_trylock() in_mutex is going to fail. It's an endless
loop.
Solve the problem by waiting until the global refcount is 1 instead (the
refcount is 1 when the only active pvcalls frontend function is
pvcalls_front_release).
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Boris Ostrovsky [Wed, 15 Nov 2017 16:24:02 +0000 (11:24 -0500)]
xen/pvcalls: Add MODULE_LICENSE()
Since commit
ba1029c9cbc5 ("modpost: detect modules without a
MODULE_LICENSE") modules without said macro will generate
WARNING: modpost: missing MODULE_LICENSE() in <filename>
While at it, also add module description and attribution.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Joao Martins [Wed, 8 Nov 2017 17:19:58 +0000 (17:19 +0000)]
MAINTAINERS: xen, kvm: track pvclock-abi.h changes
This file defines an ABI shared between guest and hypervisor(s)
(KVM, Xen) and as such there should be an correspondent entry in
MAINTAINERS file. Notice that there's already a text notice at the
top of the header file, hence this commit simply enforces it more
explicitly and have both peers noticed when such changes happen.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Acked-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Joao Martins [Wed, 8 Nov 2017 17:19:57 +0000 (17:19 +0000)]
x86/xen/time: setup vcpu 0 time info page
In order to support pvclock vdso on xen we need to setup the time
info page for vcpu 0 and register the page with Xen using the
VCPUOP_register_vcpu_time_memory_area hypercall. This hypercall
will also forcefully update the pvti which will set some of the
necessary flags for vdso. Afterwards we check if it supports the
PVCLOCK_TSC_STABLE_BIT flag which is mandatory for having
vdso/vsyscall support. And if so, it will set the cpu 0 pvti that
will be later on used when mapping the vdso image.
The xen headers are also updated to include the new hypercall for
registering the secondary vcpu_time_info struct.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Joao Martins [Wed, 8 Nov 2017 17:19:56 +0000 (17:19 +0000)]
x86/xen/time: set pvclock flags on xen_time_init()
Specifically check for PVCLOCK_TSC_STABLE_BIT and if this bit is set,
then set it too on pvclock flags. This allows Xen clocksource to use it
and thus speeding up xen_clocksource_read() callers (i.e. sched_clock())
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Joao Martins [Wed, 8 Nov 2017 17:19:55 +0000 (17:19 +0000)]
x86/pvclock: add setter for pvclock_pvti_cpu0_va
Right now there is only a pvclock_pvti_cpu0_va() which is defined
on kvmclock since:
commit
dac16fba6fc5
("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")
The only user of this interface so far is kvm. This commit adds a
setter function for the pvti page and moves pvclock_pvti_cpu0_va
to pvclock, which is a more generic place to have it; and would
allow other PV clocksources to use it, such as Xen.
While moving pvclock_pvti_cpu0_va into pvclock, rename also this
function to pvclock_get_pvti_cpu0_va (including its call sites)
to be symmetric with the setter (pvclock_set_pvti_cpu0_va).
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Joao Martins [Wed, 8 Nov 2017 17:19:54 +0000 (17:19 +0000)]
ptp_kvm: probe for kvm guest availability
In the event of moving pvclock_pvti_cpu0_va() definition to common
pvclock code, this function would return a value on non KVM guests.
Later on this would fail with a GPF on ptp_kvm_init when running on a
Xen guest. Therefore, ptp_kvm_init() should check whether it is running
in a KVM guest.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Acked-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Colin Ian King [Wed, 8 Nov 2017 13:00:30 +0000 (13:00 +0000)]
xen/privcmd: remove unused variable pageidx
Variable pageidx is assigned a value but it is never read, hence it
is redundant and can be removed. Cleans up clang warning:
drivers/xen/privcmd.c:199:2: warning: Value stored to 'pageidx'
is never read
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Juergen Gross [Thu, 2 Nov 2017 09:19:21 +0000 (10:19 +0100)]
xen: select grant interface version
Grant v2 will be needed in cases where a frame number in the grant
table can exceed 32 bits. For PV guests this is a host feature, while
for HVM guests this is a guest feature.
So select grant v2 in case frame numbers can be larger than 32 bits
and grant v1 else.
For testing purposes add a way to specify the grant interface version
via a boot parameter.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Juergen Gross [Thu, 2 Nov 2017 09:19:20 +0000 (10:19 +0100)]
xen: update arch/x86/include/asm/xen/cpuid.h
Update arch/x86/include/asm/xen/cpuid.h from the Xen tree to get newest
definitions.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Juergen Gross [Thu, 2 Nov 2017 09:19:19 +0000 (10:19 +0100)]
xen: add grant interface version dependent constants to gnttab_ops
Instead of having multiple variables with constants like
grant_table_version or grefs_per_grant_frame add those to struct
gnttab_ops and access them just via the gnttab_interface pointer.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Juergen Gross [Thu, 2 Nov 2017 09:19:18 +0000 (10:19 +0100)]
xen: limit grant v2 interface to the v1 functionality
As there is currently no user for sub-page grants or transient grants
remove that functionality. This at once makes it possible to switch
from grant v2 to grant v1 without restrictions, as there is no loss of
functionality other than the limited frame number width related to
the switch.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Juergen Gross [Thu, 2 Nov 2017 09:19:17 +0000 (10:19 +0100)]
xen: re-introduce support for grant v2 interface
The grant v2 support was removed from the kernel with
commit
438b33c7145ca8a5131a30c36d8f59bce119a19a ("xen/grant-table:
remove support for V2 tables") as the higher memory footprint of v2
grants resulted in less grants being possible for a kernel compared
to the v1 grant interface.
As machines with more than 16TB of memory are expected to be more
common in the near future support of grant v2 is mandatory in order
to be able to run a Xen pv domain at any memory location.
So re-add grant v2 support basically by reverting above commit.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Paul Durrant [Fri, 3 Nov 2017 17:04:11 +0000 (17:04 +0000)]
xen: support priv-mapping in an HVM tools domain
If the domain has XENFEAT_auto_translated_physmap then use of the PV-
specific HYPERVISOR_mmu_update hypercall is clearly incorrect.
This patch adds checks in xen_remap_domain_gfn_array() and
xen_unmap_domain_gfn_array() which call through to the approprate
xlate_mmu function if the feature is present. A check is also added
to xen_remap_domain_gfn_range() to fail with -EOPNOTSUPP since this
should not be used in an HVM tools domain.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Colin Ian King [Fri, 3 Nov 2017 09:20:47 +0000 (09:20 +0000)]
xen/pvcalls: remove redundant check for irq >= 0
This is a moot point, but irq is always less than zero at the label
out_error, so the check for irq >= 0 is redundant and can be removed.
Detected by CoverityScan, CID#1460371 ("Logically dead code")
Fixes:
cb1c7d9bbc87 ("xen/pvcalls: implement connect command")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Colin Ian King [Fri, 3 Nov 2017 08:42:02 +0000 (08:42 +0000)]
xen/pvcalls: fix unsigned less than zero error check
The check on bedata->ref is never true because ref is an unsigned
integer. Fix this by assigning signed int ret to the return of the
call to gnttab_claim_grant_reference so the -ve return can be checked.
Detected by CoverityScan, CID#1460358 ("Unsigned compared against 0")
Fixes:
219681909913 ("xen/pvcalls: connect to the backend")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Boris Ostrovsky [Thu, 2 Nov 2017 22:18:03 +0000 (18:18 -0400)]
xen/time: Return -ENODEV from xen_get_wallclock()
For any other error sync_cmos_clock() will reschedule itself
every second or so, for no good reason.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Gustavo A. R. Silva [Thu, 2 Nov 2017 18:51:22 +0000 (13:51 -0500)]
xen/pvcalls-front: mark expected switch fall-through
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Notice that in this particular case I placed the "fall through" comment
on its own line, which is what GCC is expecting to find.
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Gustavo A. R. Silva [Thu, 2 Nov 2017 18:41:07 +0000 (13:41 -0500)]
xen: xenbus_probe_frontend: mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Addresses-Coverity-ID: 146562
Addresses-Coverity-ID: 146563
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Dongli Zhang [Wed, 1 Nov 2017 01:46:33 +0000 (09:46 +0800)]
xen/time: do not decrease steal time after live migration on xen
After guest live migration on xen, steal time in /proc/stat
(cpustat[CPUTIME_STEAL]) might decrease because steal returned by
xen_steal_lock() might be less than this_rq()->prev_steal_time which is
derived from previous return value of xen_steal_clock().
For instance, steal time of each vcpu is 335 before live migration.
cpu 198 0 368 200064 1962 0 0 1340 0 0
cpu0 38 0 81 50063 492 0 0 335 0 0
cpu1 65 0 97 49763 634 0 0 335 0 0
cpu2 38 0 81 50098 462 0 0 335 0 0
cpu3 56 0 107 50138 374 0 0 335 0 0
After live migration, steal time is reduced to 312.
cpu 200 0 370 200330 1971 0 0 1248 0 0
cpu0 38 0 82 50123 500 0 0 312 0 0
cpu1 65 0 97 49832 634 0 0 312 0 0
cpu2 39 0 82 50167 462 0 0 312 0 0
cpu3 56 0 107 50207 374 0 0 312 0 0
Since runstate times are cumulative and cleared during xen live migration
by xen hypervisor, the idea of this patch is to accumulate runstate times
to global percpu variables before live migration suspend. Once guest VM is
resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
runstate times and previously accumulated times stored in global percpu
variables.
Comment above HYPERVISOR_suspend() has been removed as it is inaccurate:
the call can return an error code (e.g., possibly -EPERM in the future).
Similar and more severe issue would impact prior linux 4.8-4.10 as
discussed by Michael Las at
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
which would overflow steal time and lead to 100% st usage in top command
for linux 4.8-4.10. A backport of this patch would fix that issue.
[boris: added linux/slab.h to driver/xen/time.c, slightly reformatted
commit message]
References: https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Juergen Gross [Fri, 27 Oct 2017 17:49:37 +0000 (19:49 +0200)]
xen: support 52 bit physical addresses in pv guests
Physical addresses on processors supporting 5 level paging can be up to
52 bits wide. For a Xen pv guest running on such a machine those
physical addresses have to be supported in order to be able to use any
memory on the machine even if the guest itself does not support 5 level
paging.
So when reading/writing a MFN from/to a pte don't use the kernel's
PTE_PFN_MASK but a new XEN_PTE_MFN_MASK allowing full 40 bit wide MFNs.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Stefano Stabellini [Mon, 30 Oct 2017 22:41:03 +0000 (15:41 -0700)]
xen: introduce a Kconfig option to enable the pvcalls frontend
Also add pvcalls-front to the Makefile.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
CC: boris.ostrovsky@oracle.com
CC: jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Stefano Stabellini [Mon, 30 Oct 2017 22:41:02 +0000 (15:41 -0700)]
xen/pvcalls: implement release command
Send PVCALLS_RELEASE to the backend and wait for a reply. Take both
in_mutex and out_mutex to avoid concurrent accesses. Then, free the
socket.
For passive sockets, check whether we have already pre-allocated an
active socket for the purpose of being accepted. If so, free that as
well.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
CC: boris.ostrovsky@oracle.com
CC: jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Stefano Stabellini [Mon, 30 Oct 2017 22:41:01 +0000 (15:41 -0700)]
xen/pvcalls: implement poll command
For active sockets, check the indexes and use the inflight_conn_req
waitqueue to wait.
For passive sockets if an accept is outstanding
(PVCALLS_FLAG_ACCEPT_INFLIGHT), check if it has been answered by looking
at bedata->rsp[req_id]. If so, return POLLIN. Otherwise use the
inflight_accept_req waitqueue.
If no accepts are inflight, send PVCALLS_POLL to the backend. If we have
outstanding POLL requests awaiting for a response use the inflight_req
waitqueue: inflight_req is awaken when a new response is received; on
wakeup we check whether the POLL response is arrived by looking at the
PVCALLS_FLAG_POLL_RET flag. We set the flag from
pvcalls_front_event_handler, if the response was for a POLL command.
In pvcalls_front_event_handler, get the struct sock_mapping from the
poll id (we previously converted struct sock_mapping* to uintptr_t and
used it as id).
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
CC: boris.ostrovsky@oracle.com
CC: jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Stefano Stabellini [Mon, 30 Oct 2017 22:41:00 +0000 (15:41 -0700)]
xen/pvcalls: implement recvmsg
Implement recvmsg by copying data from the "in" ring. If not enough data
is available and the recvmsg call is blocking, then wait on the
inflight_conn_req waitqueue. Take the active socket in_mutex so that
only one function can access the ring at any given time.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
CC: boris.ostrovsky@oracle.com
CC: jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Stefano Stabellini [Mon, 30 Oct 2017 22:40:59 +0000 (15:40 -0700)]
xen/pvcalls: implement sendmsg
Send data to an active socket by copying data to the "out" ring. Take
the active socket out_mutex so that only one function can access the
ring at any given time.
If not enough room is available on the ring, rather than returning
immediately or sleep-waiting, spin for up to 5000 cycles. This small
optimization turns out to improve performance significantly.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
CC: boris.ostrovsky@oracle.com
CC: jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Stefano Stabellini [Mon, 30 Oct 2017 22:40:58 +0000 (15:40 -0700)]
xen/pvcalls: implement accept command
Introduce a waitqueue to allow only one outstanding accept command at
any given time and to implement polling on the passive socket. Introduce
a flags field to keep track of in-flight accept and poll commands.
Send PVCALLS_ACCEPT to the backend. Allocate a new active socket. Make
sure that only one accept command is executed at any given time by
setting PVCALLS_FLAG_ACCEPT_INFLIGHT and waiting on the
inflight_accept_req waitqueue.
Convert the new struct sock_mapping pointer into an uintptr_t and use it
as id for the new socket to pass to the backend.
Check if the accept call is non-blocking: in that case after sending the
ACCEPT command to the backend store the sock_mapping pointer of the new
struct and the inflight req_id then return -EAGAIN (which will respond
only when there is something to accept). Next time accept is called,
we'll check if the ACCEPT command has been answered, if so we'll pick up
where we left off, otherwise we return -EAGAIN again.
Note that, differently from the other commands, we can use
wait_event_interruptible (instead of wait_event) in the case of accept
as we are able to track the req_id of the ACCEPT response that we are
waiting.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
CC: boris.ostrovsky@oracle.com
CC: jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Stefano Stabellini [Mon, 30 Oct 2017 22:40:57 +0000 (15:40 -0700)]
xen/pvcalls: implement listen command
Send PVCALLS_LISTEN to the backend.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
CC: boris.ostrovsky@oracle.com
CC: jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Stefano Stabellini [Mon, 30 Oct 2017 22:40:56 +0000 (15:40 -0700)]
xen/pvcalls: implement bind command
Send PVCALLS_BIND to the backend. Introduce a new structure, part of
struct sock_mapping, to store information specific to passive sockets.
Introduce a status field to keep track of the status of the passive
socket.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
CC: boris.ostrovsky@oracle.com
CC: jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Stefano Stabellini [Mon, 30 Oct 2017 22:40:55 +0000 (15:40 -0700)]
xen/pvcalls: implement connect command
Send PVCALLS_CONNECT to the backend. Allocate a new ring and evtchn for
the active socket.
Introduce fields in struct sock_mapping to keep track of active sockets.
Introduce a waitqueue to allow the frontend to wait on data coming from
the backend on the active socket (recvmsg command).
Two mutexes (one of reads and one for writes) will be used to protect
the active socket in and out rings from concurrent accesses.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
CC: boris.ostrovsky@oracle.com
CC: jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Stefano Stabellini [Mon, 30 Oct 2017 22:40:54 +0000 (15:40 -0700)]
xen/pvcalls: implement socket command and handle events
Send a PVCALLS_SOCKET command to the backend, use the masked
req_prod_pvt as req_id. This way, req_id is guaranteed to be between 0
and PVCALLS_NR_REQ_PER_RING. We already have a slot in the rsp array
ready for the response, and there cannot be two outstanding responses
with the same req_id.
Wait for the response by waiting on the inflight_req waitqueue and
check for the req_id field in rsp[req_id]. Use atomic accesses and
barriers to read the field. Note that the barriers are simple smp
barriers (as opposed to virt barriers) because they are for internal
frontend synchronization, not frontend<->backend communication.
Once a response is received, clear the corresponding rsp slot by setting
req_id to PVCALLS_INVALID_ID. Note that PVCALLS_INVALID_ID is invalid
only from the frontend point of view. It is not part of the PVCalls
protocol.
pvcalls_front_event_handler is in charge of copying responses from the
ring to the appropriate rsp slot. It is done by copying the body of the
response first, then by copying req_id atomically. After the copies,
wake up anybody waiting on waitqueue.
socket_lock protects accesses to the ring.
Convert the pointer to sock_mapping into an uintptr_t and use it as
id for the new socket to pass to the backend. The struct will be fully
initialized later on connect or bind.
sock->sk->sk_send_head is not used for ip sockets: reuse the field to
store a pointer to the struct sock_mapping corresponding to the socket.
This way, we can easily get the struct sock_mapping from the struct
socket.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
CC: boris.ostrovsky@oracle.com
CC: jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Stefano Stabellini [Mon, 30 Oct 2017 22:40:53 +0000 (15:40 -0700)]
xen/pvcalls: connect to the backend
Implement the probe function for the pvcalls frontend. Read the
supported versions, max-page-order and function-calls nodes from
xenstore.
Only one frontend<->backend connection is supported at any given time
for a guest. Store the active frontend device to a static pointer.
Introduce a stub functions for the event handler.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
CC: boris.ostrovsky@oracle.com
CC: jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Stefano Stabellini [Mon, 30 Oct 2017 22:40:52 +0000 (15:40 -0700)]
xen/pvcalls: implement frontend disconnect
Introduce a data structure named pvcalls_bedata. It contains pointers to
the command ring, the event channel, a list of active sockets and a list
of passive sockets. Lists accesses are protected by a spin_lock.
Introduce a waitqueue to allow waiting for a response on commands sent
to the backend.
Introduce an array of struct xen_pvcalls_response to store commands
responses.
Introduce a new struct sock_mapping to keep track of sockets. In this
patch the struct sock_mapping is minimal, the fields will be added by
the next patches.
pvcalls_refcount is used to keep count of the outstanding pvcalls users.
Only remove connections once the refcount is zero.
Implement pvcalls frontend removal function. Go through the list of
active and passive sockets and free them all, one at a time.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
CC: boris.ostrovsky@oracle.com
CC: jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Stefano Stabellini [Mon, 30 Oct 2017 22:40:51 +0000 (15:40 -0700)]
xen/pvcalls: introduce the pvcalls xenbus frontend
Introduce a xenbus frontend for the pvcalls protocol, as defined by
https://xenbits.xen.org/docs/unstable/misc/pvcalls.html.
This patch only adds the stubs, the code will be added by the following
patches.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
CC: boris.ostrovsky@oracle.com
CC: jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Linus Torvalds [Sun, 29 Oct 2017 20:58:38 +0000 (13:58 -0700)]
Linux 4.14-rc7
Linus Torvalds [Sun, 29 Oct 2017 15:11:49 +0000 (08:11 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Fix route leak in xfrm_bundle_create().
2) In mac80211, validate user rate mask before configuring it. From
Johannes Berg.
3) Properly enforce memory limits in fair queueing code, from Toke
Hoiland-Jorgensen.
4) Fix lockdep splat in inet_csk_route_req(), from Eric Dumazet.
5) Fix TSO header allocation and management in mvpp2 driver, from Yan
Markman.
6) Don't take socket lock in BH handler in strparser code, from Tom
Herbert.
7) Don't show sockets from other namespaces in AF_UNIX code, from
Andrei Vagin.
8) Fix double free in error path of tap_open(), from Girish Moodalbail.
9) Fix TX map failure path in igb and ixgbe, from Jean-Philippe Brucker
and Alexander Duyck.
10) Fix DCB mode programming in stmmac driver, from Jose Abreu.
11) Fix err_count handling in various tunnels (ipip, ip6_gre). From Xin
Long.
12) Properly align SKB head before building SKB in tuntap, from Jason
Wang.
13) Avoid matching qdiscs with a zero handle during lookups, from Cong
Wang.
14) Fix various endianness bugs in sctp, from Xin Long.
15) Fix tc filter callback races and add selftests which trigger the
problem, from Cong Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits)
selftests: Introduce a new test case to tc testsuite
selftests: Introduce a new script to generate tc batch file
net_sched: fix call_rcu() race on act_sample module removal
net_sched: add rtnl assertion to tcf_exts_destroy()
net_sched: use tcf_queue_work() in tcindex filter
net_sched: use tcf_queue_work() in rsvp filter
net_sched: use tcf_queue_work() in route filter
net_sched: use tcf_queue_work() in u32 filter
net_sched: use tcf_queue_work() in matchall filter
net_sched: use tcf_queue_work() in fw filter
net_sched: use tcf_queue_work() in flower filter
net_sched: use tcf_queue_work() in flow filter
net_sched: use tcf_queue_work() in cgroup filter
net_sched: use tcf_queue_work() in bpf filter
net_sched: use tcf_queue_work() in basic filter
net_sched: introduce a workqueue for RCU callbacks of tc filter
sctp: fix some type cast warnings introduced since very beginning
sctp: fix a type cast warnings that causes a_rwnd gets the wrong value
sctp: fix some type cast warnings introduced by transport rhashtable
sctp: fix some type cast warnings introduced by stream reconf
...
David S. Miller [Sun, 29 Oct 2017 13:49:32 +0000 (22:49 +0900)]
Merge branch 'net_sched-fix-races-with-RCU-callbacks'
Cong Wang says:
====================
net_sched: fix races with RCU callbacks
Recently, the RCU callbacks used in TC filters and TC actions keep
drawing my attention, they introduce at least 4 race condition bugs:
1. A simple one fixed by Daniel:
commit
c78e1746d3ad7d548bdf3fe491898cc453911a49
Author: Daniel Borkmann <daniel@iogearbox.net>
Date: Wed May 20 17:13:33 2015 +0200
net: sched: fix call_rcu() race on classifier module unloads
2. A very nasty one fixed by me:
commit
1697c4bb5245649a23f06a144cc38c06715e1b65
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date: Mon Sep 11 16:33:32 2017 -0700
net_sched: carefully handle tcf_block_put()
3. Two more bugs found by Chris:
https://patchwork.ozlabs.org/patch/826696/
https://patchwork.ozlabs.org/patch/826695/
Usually RCU callbacks are simple, however for TC filters and actions,
they are complex because at least TC actions could be destroyed
together with the TC filter in one callback. And RCU callbacks are
invoked in BH context, without locking they are parallel too. All of
these contribute to the cause of these nasty bugs.
Alternatively, we could also:
a) Introduce a spinlock to serialize these RCU callbacks. But as I
said in commit
1697c4bb5245 ("net_sched: carefully handle
tcf_block_put()"), it is very hard to do because of tcf_chain_dump().
Potentially we need to do a lot of work to make it possible (if not
impossible).
b) Just get rid of these RCU callbacks, because they are not
necessary at all, callers of these call_rcu() are all on slow paths
and holding RTNL lock, so blocking is allowed in their contexts.
However, David and Eric dislike adding synchronize_rcu() here.
As suggested by Paul, we could defer the work to a workqueue and
gain the permission of holding RTNL again without any performance
impact, however, in tcf_block_put() we could have a deadlock when
flushing workqueue while hodling RTNL lock, the trick here is to
defer the work itself in workqueue and make it queued after all
other works so that we keep the same ordering to avoid any
use-after-free. Please see the first patch for details.
Patch 1 introduces the infrastructure, patch 2~12 move each
tc filter to the new tc filter workqueue, patch 13 adds
an assertion to catch potential bugs like this, patch 14
closes another rcu callback race, patch 15 and patch 16 add
new test cases.
====================
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Chris Mi [Fri, 27 Oct 2017 01:24:43 +0000 (18:24 -0700)]
selftests: Introduce a new test case to tc testsuite
In this patchset, we fixed a tc bug. This patch adds the test case
that reproduces the bug. To run this test case, user should specify
an existing NIC device:
# sudo ./tdc.py -d enp4s0f0
This test case belongs to category "flower". If user doesn't specify
a NIC device, the test cases belong to "flower" will not be run.
In this test case, we create 1M filters and all filters share the same
action. When destroying all filters, kernel should not panic. It takes
about 18s to run it.
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Lucas Bates <lucasb@mojatatu.com>
Signed-off-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Chris Mi [Fri, 27 Oct 2017 01:24:42 +0000 (18:24 -0700)]
selftests: Introduce a new script to generate tc batch file
# ./tdc_batch.py -h
usage: tdc_batch.py [-h] [-n NUMBER] [-o] [-s] [-p] device file
TC batch file generator
positional arguments:
device device name
file batch file name
optional arguments:
-h, --help show this help message and exit
-n NUMBER, --number NUMBER
how many lines in batch file
-o, --skip_sw skip_sw (offload), by default skip_hw
-s, --share_action all filters share the same action
-p, --prio all filters have different prio
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Lucas Bates <lucasb@mojatatu.com>
Signed-off-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:41 +0000 (18:24 -0700)]
net_sched: fix call_rcu() race on act_sample module removal
Similar to commit
c78e1746d3ad
("net: sched: fix call_rcu() race on classifier module unloads"),
we need to wait for flying RCU callback tcf_sample_cleanup_rcu().
Cc: Yotam Gigi <yotamg@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:40 +0000 (18:24 -0700)]
net_sched: add rtnl assertion to tcf_exts_destroy()
After previous patches, it is now safe to claim that
tcf_exts_destroy() is always called with RTNL lock.
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:39 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in tcindex filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:38 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in rsvp filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:37 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in route filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:36 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in u32 filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:35 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in matchall filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:34 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in fw filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:33 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in flower filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:32 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in flow filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:31 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in cgroup filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:30 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in bpf filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:29 +0000 (18:24 -0700)]
net_sched: use tcf_queue_work() in basic filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 27 Oct 2017 01:24:28 +0000 (18:24 -0700)]
net_sched: introduce a workqueue for RCU callbacks of tc filter
This patch introduces a dedicated workqueue for tc filters
so that each tc filter's RCU callback could defer their
action destroy work to this workqueue. The helper
tcf_queue_work() is introduced for them to use.
Because we hold RTNL lock when calling tcf_block_put(), we
can not simply flush works inside it, therefore we have to
defer it again to this workqueue and make sure all flying RCU
callbacks have already queued their work before this one, in
other words, to ensure this is the last one to execute to
prevent any use-after-free.
On the other hand, this makes tcf_block_put() ugly and
harder to understand. Since David and Eric strongly dislike
adding synchronize_rcu(), this is probably the only
solution that could make everyone happy.
Please also see the code comments below.
Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 29 Oct 2017 09:03:25 +0000 (18:03 +0900)]
Merge branch 'sctp-endianness-fixes'
Xin Long says:
====================
sctp: a bunch of fixes for some sparse warnings
As Eric noticed, when running 'make C=2 M=net/sctp/', a plenty of
warnings or errors checked by sparse appear. They are all problems
about Endian and type cast.
Most of them are just warnings by which no issues could be caused
while some might be bugs.
This patchset fixes them with four patches basically according to
how they are introduced.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Sat, 28 Oct 2017 11:43:57 +0000 (19:43 +0800)]
sctp: fix some type cast warnings introduced since very beginning
These warnings were found by running 'make C=2 M=net/sctp/'.
They are there since very beginning.
Note after this patch, there still one warning left in
sctp_outq_flush():
sctp_chunk_fail(chunk, SCTP_ERROR_INV_STRM)
Since it has been moved to sctp_stream_outq_migrate on net-next,
to avoid the extra job when merging net-next to net, I will post
the fix for it after the merging is done.
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Sat, 28 Oct 2017 11:43:56 +0000 (19:43 +0800)]
sctp: fix a type cast warnings that causes a_rwnd gets the wrong value
These warnings were found by running 'make C=2 M=net/sctp/'.
Commit
d4d6fb5787a6 ("sctp: Try not to change a_rwnd when faking a
SACK from SHUTDOWN.") expected to use the peers old rwnd and add
our flight size to the a_rwnd. But with the wrong Endian, it may
not work as well as expected.
So fix it by converting to the right value.
Fixes:
d4d6fb5787a6 ("sctp: Try not to change a_rwnd when faking a SACK from SHUTDOWN.")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Sat, 28 Oct 2017 11:43:55 +0000 (19:43 +0800)]
sctp: fix some type cast warnings introduced by transport rhashtable
These warnings were found by running 'make C=2 M=net/sctp/'.
They are introduced by not aware of Endian for the port when
coding transport rhashtable patches.
Fixes:
7fda702f9315 ("sctp: use new rhlist interface on sctp transport rhashtable")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Sat, 28 Oct 2017 11:43:54 +0000 (19:43 +0800)]
sctp: fix some type cast warnings introduced by stream reconf
These warnings were found by running 'make C=2 M=net/sctp/'.
They are introduced by not aware of Endian when coding stream
reconf patches.
Since commit
c0d8bab6ae51 ("sctp: add get and set sockopt for
reconf_enable") enabled stream reconf feature for users, the
Fixes tag below would use it.
Fixes:
c0d8bab6ae51 ("sctp: add get and set sockopt for reconf_enable")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Sat, 28 Oct 2017 05:08:56 +0000 (22:08 -0700)]
net_sched: avoid matching qdisc with zero handle
Davide found the following script triggers a NULL pointer
dereference:
ip l a name eth0 type dummy
tc q a dev eth0 parent :1 handle 1: htb
This is because for a freshly created netdevice noop_qdisc
is attached and when passing 'parent :1', kernel actually
tries to match the major handle which is 0 and noop_qdisc
has handle 0 so is matched by mistake. Commit
69012ae425d7
tries to fix a similar bug but still misses this case.
Handle 0 is not a valid one, should be just skipped. In
fact, kernel uses it as TC_H_UNSPEC.
Fixes:
69012ae425d7 ("net: sched: fix handling of singleton qdiscs with qdisc_hash")
Fixes:
59cc1f61f09c ("net: sched:convert qdisc linked list to hashtable")
Reported-by: Davide Caratti <dcaratti@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 27 Oct 2017 18:13:29 +0000 (02:13 +0800)]
sctp: reset owner sk for data chunks on out queues when migrating a sock
Now when migrating sock to another one in sctp_sock_migrate(), it only
resets owner sk for the data in receive queues, not the chunks on out
queues.
It would cause that data chunks length on the sock is not consistent
with sk sk_wmem_alloc. When closing the sock or freeing these chunks,
the old sk would never be freed, and the new sock may crash due to
the overflow sk_wmem_alloc.
syzbot found this issue with this series:
r0 = socket$inet_sctp()
sendto$inet(r0)
listen(r0)
accept4(r0)
close(r0)
Although listen() should have returned error when one TCP-style socket
is in connecting (I may fix this one in another patch), it could also
be reproduced by peeling off an assoc.
This issue is there since very beginning.
This patch is to reset owner sk for the chunks on out queues so that
sk sk_wmem_alloc has correct value after accept one sock or peeloff
an assoc to one sock.
Note that when resetting owner sk for chunks on outqueue, it has to
sctp_clear_owner_w/skb_orphan chunks before changing assoc->base.sk
first and then sctp_set_owner_w them after changing assoc->base.sk,
due to that sctp_wfree and it's callees are using assoc->base.sk.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 29 Oct 2017 02:18:49 +0000 (11:18 +0900)]
Merge branch 'sockmap-fixes'
John Fastabend says:
====================
net: sockmap fixes
Last two fixes (as far as I know) for sockmap code this round.
First, we are using the qdisc cb structure when making the data end
calculation. This is really just wrong so, store it with the other
metadata in the correct tcp_skb_cb sturct to avoid breaking things.
Next, with recent work to attach multiple programs to a cgroup a
specific enumeration of return codes was agreed upon. However,
I wrote the sk_skb program types before seeing this work and used
a different convention. Patch 2 in the series aligns the return
codes to avoid breaking with this infrastructure and also aligns
with other programming conventions to avoid being the odd duck out
forcing programs to remember SK_SKB programs are different. Pusing
to net because its a user visible change. With this SK_SKB program
return codes are the same as other cgroup program types.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Fri, 27 Oct 2017 16:45:53 +0000 (09:45 -0700)]
bpf: rename sk_actions to align with bpf infrastructure
Recent additions to support multiple programs in cgroups impose
a strict requirement, "all yes is yes, any no is no". To enforce
this the infrastructure requires the 'no' return code, SK_DROP in
this case, to be 0.
To apply these rules to SK_SKB program types the sk_actions return
codes need to be adjusted.
This fix adds SK_PASS and makes 'SK_DROP = 0'. Finally, remove
SK_ABORTED to remove any chance that the API may allow aborted
program flows to be passed up the stack. This would be incorrect
behavior and allow programs to break existing policies.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Fri, 27 Oct 2017 16:45:34 +0000 (09:45 -0700)]
bpf: bpf_compute_data uses incorrect cb structure
SK_SKB program types use bpf_compute_data to store the end of the
packet data. However, bpf_compute_data assumes the cb is stored in the
qdisc layer format. But, for SK_SKB this is the wrong layer of the
stack for this type.
It happens to work (sort of!) because in most cases nothing happens
to be overwritten today. This is very fragile and error prone.
Fortunately, we have another hole in tcp_skb_cb we can use so lets
put the data_end value there.
Note, SK_SKB program types do not use data_meta, they are failed by
sk_skb_is_valid_access().
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 28 Oct 2017 18:01:57 +0000 (11:01 -0700)]
Merge tag 'kbuild-fixes-v4.14-2' of git://git./linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- fix O= building on dash
- remove unused dependency in Makefile
- fix default of a choice in Kconfig
- fix typos and documentation style
- fix command options unrecognized by sparse
* tag 'kbuild-fixes-v4.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: clang: fix build failures with sparse check
kbuild doc: a bundle of fixes on makefiles.txt
Makefile: kselftest: fix grammar typo
kbuild: Fix optimization level choice default
kbuild: drop unused symverfile in Makefile.modpost
kbuild: revert $(realpath ...) to $(shell cd ... && /bin/pwd)
Linus Torvalds [Sat, 28 Oct 2017 17:56:13 +0000 (10:56 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:
- fix gtco tablet driver, tightening parsing of HID descriptors
- add ACPI ID added to Elan driver to be able to handle touchpads found
in Lenovo Ideapad 320/520
- fix the Symaptics RMI4 driver to adjust handling of buttons
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: synaptics-rmi4 - limit the range of what GPIOs are buttons
Input: gtco - fix potential out-of-bound access
Input: elan_i2c - add ELAN0611 to the ACPI table
Linus Torvalds [Sat, 28 Oct 2017 17:53:24 +0000 (10:53 -0700)]
Merge tag 'pci-v4.14-fixes-6' of git://git./linux/kernel/git/helgaas/pci
Pull PCI fix from Bjorn Helgaas:
"Move alpha PCI IRQ map/swizzle functions out of initdata to fix
regression from PCI core IRQ mapping changes (Lorenzo Pieralisi)"
* tag 'pci-v4.14-fixes-6' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
alpha/PCI: Move pci_map_irq()/pci_swizzle() out of initdata
Linus Torvalds [Sat, 28 Oct 2017 17:50:38 +0000 (10:50 -0700)]
Merge tag 'drm-fixes-for-v4.14-rc7' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Two amd fixes, one i915 core and a few i915 GVT fixes, things seem
fairly quiet"
* tag 'drm-fixes-for-v4.14-rc7' of git://people.freedesktop.org/~airlied/linux:
drm/i915/gvt: Adding ACTHD mmio read handler
drm/i915/gvt: Extract mmio_read_from_hw() common function
drm/i915/gvt: Refine MMIO_RING_F()
drm/i915/gvt: properly check per_ctx bb valid state
drm/i915/perf: fix perf enable/disable ioctls with 32bits userspace
drm/amd/amdgpu: Remove workaround check for UVD6 on APUs
drm/amd/powerplay: fix uninitialized variable
Linus Torvalds [Sat, 28 Oct 2017 17:46:20 +0000 (10:46 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Six fixes for mostly minor issues, most of which have small race
windows for occurring"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: Suppress a kernel warning in case the prep function returns BLKPREP_DEFER
scsi: sg: Re-fix off by one in sg_fill_request_table()
scsi: aacraid: Fix controller initialization failure
scsi: hpsa: Fix configured_logical_drive_count·check
scsi: qla2xxx: Initialize Work element before requesting IRQs
scsi: zfcp: fix erp_action use-before-initialize in REC action trace
David Howells [Wed, 11 Oct 2017 22:32:27 +0000 (23:32 +0100)]
assoc_array: Fix a buggy node-splitting case
This fixes CVE-2017-12193.
Fix a case in the assoc_array implementation in which a new leaf is
added that needs to go into a node that happens to be full, where the
existing leaves in that node cluster together at that level to the
exclusion of new leaf.
What needs to happen is that the existing leaves get moved out to a new
node, N1, at level + 1 and the existing node needs replacing with one,
N0, that has pointers to the new leaf and to N1.
The code that tries to do this gets this wrong in two ways:
(1) The pointer that should've pointed from N0 to N1 is set to point
recursively to N0 instead.
(2) The backpointer from N0 needs to be set correctly in the case N0 is
either the root node or reached through a shortcut.
Fix this by removing this path and using the split_node path instead,
which achieves the same end, but in a more general way (thanks to Eric
Biggers for spotting the redundancy).
The problem manifests itself as:
BUG: unable to handle kernel NULL pointer dereference at
0000000000000010
IP: assoc_array_apply_edit+0x59/0xe5
Fixes:
3cb989501c26 ("Add a generic associative array implementation.")
Reported-and-tested-by: WU Fan <u3536072@connect.hku.hk>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: stable@vger.kernel.org [v3.13-rc1+]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 28 Oct 2017 15:39:35 +0000 (08:39 -0700)]
Merge tag '4.14-smb3-fixes-for-stable' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Various SMB3 fixes for 4.14 and stable"
* tag '4.14-smb3-fixes-for-stable' of git://git.samba.org/sfrench/cifs-2.6:
SMB3: Validate negotiate request must always be signed
SMB: fix validate negotiate info uninitialised memory use
SMB: fix leak of validate negotiate info response buffer
CIFS: Fix NULL pointer deref on SMB2_tcon() failure
CIFS: do not send invalid input buffer on QUERY_INFO requests
cifs: Select all required crypto modules
CIFS: SMBD: Fix the definition for SMB2_CHANNEL_RDMA_V1_INVALIDATE
cifs: handle large EA requests more gracefully in smb2+
Fix encryption labels and lengths for SMB3.1.1
Linus Torvalds [Sat, 28 Oct 2017 15:29:29 +0000 (08:29 -0700)]
Merge branch 'overlayfs-linus' of git://git./linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"Fix several issues, most of them introduced in the last release"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: do not cleanup unsupported index entries
ovl: handle ENOENT on index lookup
ovl: fix EIO from lookup of non-indexed upper
ovl: Return -ENOMEM if an allocation fails ovl_lookup()
ovl: add NULL check in ovl_alloc_inode
Linus Torvalds [Sat, 28 Oct 2017 15:27:46 +0000 (08:27 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/mszeredi/fuse
Pull fuse fix from Miklos Szeredi:
"This fixes a longstanding bug, which can be triggered by interrupting
a directory reading syscall"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix READDIRPLUS skipping an entry
Girish Moodalbail [Fri, 27 Oct 2017 07:00:16 +0000 (00:00 -0700)]
tap: reference to KVA of an unloaded module causes kernel panic
The commit
9a393b5d5988 ("tap: tap as an independent module") created a
separate tap module that implements tap functionality and exports
interfaces that will be used by macvtap and ipvtap modules to create
create respective tap devices.
However, that patch introduced a regression wherein the modules macvtap
and ipvtap can be removed (through modprobe -r) while there are
applications using the respective /dev/tapX devices. These applications
cause kernel to hold reference to /dev/tapX through 'struct cdev
macvtap_cdev' and 'struct cdev ipvtap_dev' defined in macvtap and ipvtap
modules respectively. So, when the application is later closed the
kernel panics because we are referencing KVA that is present in the
unloaded modules.
----------8<------- Example ----------8<----------
$ sudo ip li add name mv0 link enp7s0 type macvtap
$ sudo ip li show mv0 |grep mv0| awk -e '{print $1 $2}'
14:mv0@enp7s0:
$ cat /dev/tap14 &
$ lsmod |egrep -i 'tap|vlan'
macvtap 16384 0
macvlan 24576 1 macvtap
tap 24576 3 macvtap
$ sudo modprobe -r macvtap
$ fg
cat /dev/tap14
^C
<...system panics...>
BUG: unable to handle kernel paging request at
ffffffffa038c500
IP: cdev_put+0xf/0x30
----------8<-----------------8<----------
The fix is to set cdev.owner to the module that creates the tap device
(either macvtap or ipvtap). With this set, the operations (in
fs/char_dev.c) on char device holds and releases the module through
cdev_get() and cdev_put() and will not allow the module to unload
prematurely.
Fixes:
9a393b5d5988ea4e (tap: tap as an independent module)
Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 27 Oct 2017 04:21:40 +0000 (21:21 -0700)]
tcp: refresh tp timestamp before tcp_mtu_probe()
In the unlikely event tcp_mtu_probe() is sending a packet, we
want tp->tcp_mstamp being as accurate as possible.
This means we need to call tcp_mstamp_refresh() a bit earlier in
tcp_write_xmit().
Fixes:
385e20706fac ("tcp: use tp->tcp_mstamp in output path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Fri, 27 Oct 2017 03:05:44 +0000 (11:05 +0800)]
tuntap: properly align skb->head before building skb
An unaligned alloc_frag->offset caused by previous allocation will
result an unaligned skb->head. This will lead unaligned
skb_shared_info and then unaligned dataref which requires to be
aligned for accessing on some architecture. Fix this by aligning
alloc_frag->offset before the frag refilling.
Fixes:
0bbd7dad34f8 ("tun: make tun_build_skb() thread safe")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Wei Wei <dotweiba@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Reported-by: Wei Wei <dotweiba@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 28 Oct 2017 03:41:05 +0000 (20:41 -0700)]
Merge tag 'for-linus-4.14c-rc7-tag' of git://git./linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- a fix for the Xen gntdev device repairing an issue in case of partial
failure of mapping multiple pages of another domain
- a fix of a regression in the Xen balloon driver introduced in 4.13
- a build fix for Xen on ARM which will trigger e.g. for Linux RT
- a maintainers update for pvops (not really Xen, but carrying through
this tree just for convenience)
* tag 'for-linus-4.14c-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
maintainers: drop Chris Wright from pvops
arm/xen: don't inclide rwlock.h directly.
xen: fix booting ballooned down hvm guest
xen/gntdev: avoid out of bounds access in case of partial gntdev_mmap()
Linus Torvalds [Sat, 28 Oct 2017 03:38:47 +0000 (20:38 -0700)]
Merge tag 'arc-4.14-rc7' of git://git./linux/kernel/git/vgupta/arc
Pull ARC fixes from Vineet Gupta:
- Fixes for HSDK platform
- module build error for !LLSC config
* tag 'arc-4.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: unbork module link errors with !CONFIG_ARC_HAS_LLSC
ARC: [plat-hsdk] Increase SDIO CIU frequency to 50000000Hz
ARC: [plat-hsdk] select CONFIG_RESET_HSDK from Kconfig
Linus Torvalds [Sat, 28 Oct 2017 03:35:31 +0000 (20:35 -0700)]
Fix tracing sample code warning.
Commit
6575257c60e1 ("tracing/samples: Fix creation and deletion of
simple_thread_fn creation") introduced a new warning due to using a
boolean as a counter.
Just make it "int".
Fixes:
6575257c60e1 ("tracing/samples: Fix creation and deletion of simple_thread_fn creation")
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 28 Oct 2017 03:32:24 +0000 (20:32 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/s390/linux
Pull s390 fix from Martin Schwidefsky:
"A fix for a regression in regard to machine check handling in KVM.
Keeping my fingers crossed that this is the last s390 fix for v4.14"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/kvm: fix detection of guest machine checks
Linus Torvalds [Sat, 28 Oct 2017 00:19:39 +0000 (17:19 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- revert a /dev/mem restriction change that crashes with certain boot
parameters
- an AMD erratum fix for cases where the BIOS doesn't apply it
- fix unwinder debuginfo
- improve ORC unwinder warning printouts"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "x86/mm: Limit mmap() of /dev/mem to valid physical addresses"
x86/unwind: Show function name+offset in ORC error messages
x86/entry: Fix idtentry unwind hint
x86/cpu/AMD: Apply the Erratum 688 fix when the BIOS doesn't
Linus Torvalds [Sat, 28 Oct 2017 00:17:25 +0000 (17:17 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"Update the <linux/swait.h> documentation to discourage their use"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/swait: Document it clearly that the swait facilities are special and shouldn't be used
Linus Torvalds [Sat, 28 Oct 2017 00:15:49 +0000 (17:15 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fix from Ingo Molnar:
"A fix for a misplaced permission check that can leave perf PT or LBR
disabled (on Intel CPUs) permanently until the next reboot"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/bts: Fix exclusive event reference leak
Linus Torvalds [Sat, 28 Oct 2017 00:14:32 +0000 (17:14 -0700)]
Merge branch 'efi-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull EFI fixes from Ingo Molnar:
"Two fixes: an ARM fix for KASLR interaction with hibernation, plus an
efi_test crash fix"
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/libstub/arm: Don't randomize runtime regions when CONFIG_HIBERNATION=y
efi/efi_test: Prevent an Oops in efi_runtime_query_capsulecaps()
Andrew Duggan [Wed, 25 Oct 2017 16:30:16 +0000 (09:30 -0700)]
Input: synaptics-rmi4 - limit the range of what GPIOs are buttons
By convention the first 6 bits of F30 Ctrl 2 and 3 are used to signify
GPIOs which are connected to buttons. Additional GPIOs may be used as
input GPIOs to signal the touch controller of some event
(ie disable touchpad). These additional GPIOs may meet the criteria of
a button in rmi_f30_is_valid_button() but should not be considered
buttons. This patch limits the GPIOs which are mapped to buttons to just
the first 6.
Signed-off-by: Andrew Duggan <aduggan@synaptics.com>
Reported-by: Daniel Martin <consume.noise@gmail.com>
Tested-by: Daniel Martin <consume.noise@gmail.com>
Acked-By: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Dmitry Torokhov [Mon, 23 Oct 2017 23:46:00 +0000 (16:46 -0700)]
Input: gtco - fix potential out-of-bound access
parse_hid_report_descriptor() has a while (i < length) loop, which
only guarantees that there's at least 1 byte in the buffer, but the
loop body can read multiple bytes which causes out-of-bounds access.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
David S. Miller [Fri, 27 Oct 2017 15:05:34 +0000 (00:05 +0900)]
Merge branch '40GbE' of git://git./linux/kernel/git/jkirsher/net-queue
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2017-10-26
This series contains fixes to e1000, igb, ixgbe and i40e.
Vincenzo Maffione fixes a potential race condition which would result in
the interface being up but transmits are disabled in the hardware.
Colin Ian King fixes a possible NULL pointer dereference in e1000, which
was found by Coverity.
Jean-Philippe Brucker fixes a possible kernel panic when a driver cannot
map a transmit buffer, which is caused by an erroneous test.
Alex provides a fix for ixgbe, which is a partial revert of the commit
ffed21bcee7a ("ixgbe: Don't bother clearing buffer memory for descriptor rings")
because the previous commit messed up the exception handling path by
adding the count back in when we did not need to. Also fixed a typo,
where the transmit ITR setting was being used to determine if we were
using adaptive receive interrupt moderation or not. Lastly, fixed a
memory leak by including programming descriptors in the cleaned count.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Thu, 26 Oct 2017 11:27:17 +0000 (19:27 +0800)]
ip6_gre: update dst pmtu if dev mtu has been updated by toobig in __gre6_xmit
When receiving a Toobig icmpv6 packet, ip6gre_err would just set
tunnel dev's mtu, that's not enough. For skb_dst(skb)'s pmtu may
still be using the old value, it has no chance to be updated with
tunnel dev's mtu.
Jianlin found this issue by reducing route's mtu while running
netperf, the performance went to 0.
ip6ip6 and ip4ip6 tunnel can work well with this, as they lookup
the upper dst and update_pmtu it's pmtu or icmpv6_send a Toobig
to upper socket after setting tunnel dev's mtu.
We couldn't do that for ip6_gre, as gre's inner packet could be
any protocol, it's difficult to handle them (like lookup upper
dst) in a good way.
So this patch is to fix it by updating skb_dst(skb)'s pmtu when
dev->mtu < skb_dst(skb)'s pmtu in tx path. It's safe to do this
update there, as usually dev->mtu <= skb_dst(skb)'s pmtu and no
performance regression can be caused by this.
Fixes:
c12b395a4664 ("gre: Support GRE over IPv6")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Thu, 26 Oct 2017 11:23:27 +0000 (19:23 +0800)]
ip6_gre: only increase err_count for some certain type icmpv6 in ip6gre_err
The similar fix in patch 'ipip: only increase err_count for some
certain type icmp in ipip_err' is needed for ip6gre_err.
In Jianlin's case, udp netperf broke even when receiving a TooBig
icmpv6 packet.
Fixes:
c12b395a4664 ("gre: Support GRE over IPv6")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Thu, 26 Oct 2017 11:19:56 +0000 (19:19 +0800)]
ipip: only increase err_count for some certain type icmp in ipip_err
t->err_count is used to count the link failure on tunnel and an err
will be reported to user socket in tx path if t->err_count is not 0.
udp socket could even return EHOSTUNREACH to users.
Since commit
fd58156e456d ("IPIP: Use ip-tunneling code.") removed
the 'switch check' for icmp type in ipip_err(), err_count would be
increased by the icmp packet with ICMP_EXC_FRAGTIME code. an link
failure would be reported out due to this.
In Jianlin's case, when receiving ICMP_EXC_FRAGTIME a icmp packet,
udp netperf failed with the err:
send_data: data send error: No route to host (errno 113)
We expect this error reported from tunnel to socket when receiving
some certain type icmp, but not ICMP_EXC_FRAGTIME, ICMP_SR_FAILED
or ICMP_PARAMETERPROB ones.
This patch is to bring 'switch check' for icmp type back to ipip_err
so that it only reports link failure for the right type icmp, just as
in ipgre_err() and ipip6_err().
Fixes:
fd58156e456d ("IPIP: Use ip-tunneling code.")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jose Abreu [Thu, 26 Oct 2017 09:07:12 +0000 (10:07 +0100)]
net: stmmac: First Queue must always be in DCB mode
According to DWMAC databook the first queue operating mode
must always be in DCB.
As MTL_QUEUE_DCB = 1, we need to always set the first queue
operating mode to DCB otherwise driver will think that queue
is in AVB mode (because MTL_QUEUE_AVB = 0).
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jose Abreu [Thu, 26 Oct 2017 08:51:33 +0000 (09:51 +0100)]
net: stmmac: dwc-qos-eth: Fix typo in DT bindings parsing
According to DT bindings documentation we are expecting a
property called "snps,read-requests" but we are parsing
instead a property called "read,read-requests".
This is clearly a typo. Fix it.
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 27 Oct 2017 13:23:41 +0000 (22:23 +0900)]
Merge tag 'mlx5-fixes-2017-10-26' of git://git./linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
Mellanox, mlx5 fixes 2017-10-26
The series includes some misc fixes for mlx5 core and etherent driver.
Please pull and let me know if there's any problem.
For -Stable:
net/mlx5e: Properly deal with encap flows add/del under neigh update (kernels >= 4.12)
net/mlx5: Fix health work queue spin lock to IRQ safe (kernels >= 4.13)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ingo Molnar [Fri, 27 Oct 2017 08:03:13 +0000 (10:03 +0200)]
Revert "x86/mm: Limit mmap() of /dev/mem to valid physical addresses"
This reverts commit
ce56a86e2ade45d052b3228cdfebe913a1ae7381.
There's unanticipated interaction with some boot parameters like 'mem=',
which now cause the new checks via valid_mmap_phys_addr_range() to be too
restrictive, crashing a Qemu bootup in fact, as reported by Fengguang Wu.
So while the motivation of the change is still entirely valid, we
need a few more rounds of testing to get it right - it's way too late
after -rc6, so revert it for now.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Craig Bergstrom <craigb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: dsafonov@virtuozzo.com
Cc: kirill.shutemov@linux.intel.com
Cc: mhocko@suse.com
Cc: oleg@redhat.com
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>