Paul Mackerras [Mon, 30 Jan 2017 10:21:47 +0000 (21:21 +1100)]
KVM: PPC: Book3S HV: MMU notifier callbacks for radix guests
This adapts our implementations of the MMU notifier callbacks
(unmap_hva, unmap_hva_range, age_hva, test_age_hva, set_spte_hva)
to call radix functions when the guest is using radix. These
implementations are much simpler than for HPT guests because we
have only one PTE to deal with, so we don't need to traverse
rmap chains.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Paul Mackerras [Mon, 30 Jan 2017 10:21:46 +0000 (21:21 +1100)]
KVM: PPC: Book3S HV: Page table construction and page faults for radix guests
This adds the code to construct the second-level ("partition-scoped" in
architecturese) page tables for guests using the radix MMU. Apart from
the PGD level, which is allocated when the guest is created, the rest
of the tree is all constructed in response to hypervisor page faults.
As well as hypervisor page faults for missing pages, we also get faults
for reference/change (RC) bits needing to be set, as well as various
other error conditions. For now, we only set the R or C bit in the
guest page table if the same bit is set in the host PTE for the
backing page.
This code can take advantage of the guest being backed with either
transparent or ordinary 2MB huge pages, and insert 2MB page entries
into the guest page tables. There is no support for 1GB huge pages
yet.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Paul Mackerras [Mon, 30 Jan 2017 10:21:45 +0000 (21:21 +1100)]
KVM: PPC: Book3S HV: Modify guest entry/exit paths to handle radix guests
This adds code to branch around the parts that radix guests don't
need - clearing and loading the SLB with the guest SLB contents,
saving the guest SLB contents on exit, and restoring the host SLB
contents.
Since the host is now using radix, we need to save and restore the
host value for the PID register.
On hypervisor data/instruction storage interrupts, we don't do the
guest HPT lookup on radix, but just save the guest physical address
for the fault (from the ASDR register) in the vcpu struct.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Paul Mackerras [Mon, 30 Jan 2017 10:21:44 +0000 (21:21 +1100)]
KVM: PPC: Book3S HV: Add basic infrastructure for radix guests
This adds a field in struct kvm_arch and an inline helper to
indicate whether a guest is a radix guest or not, plus a new file
to contain the radix MMU code, which currently contains just a
translate function which knows how to traverse the guest page
tables to translate an address.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Paul Mackerras [Mon, 30 Jan 2017 10:21:43 +0000 (21:21 +1100)]
KVM: PPC: Book3S HV: Use ASDR for HPT guests on POWER9
POWER9 adds a register called ASDR (Access Segment Descriptor
Register), which is set by hypervisor data/instruction storage
interrupts to contain the segment descriptor for the address
being accessed, assuming the guest is using HPT translation.
(For radix guests, it contains the guest real address of the
access.)
Thus, for HPT guests on POWER9, we can use this register rather
than looking up the SLB with the slbfee. instruction.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Paul Mackerras [Mon, 30 Jan 2017 10:21:42 +0000 (21:21 +1100)]
KVM: PPC: Book3S HV: Set process table for HPT guests on POWER9
This adds the implementation of the KVM_PPC_CONFIGURE_V3_MMU ioctl
for HPT guests on POWER9. With this, we can return 1 for the
KVM_CAP_PPC_MMU_HASH_V3 capability.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Paul Mackerras [Mon, 30 Jan 2017 10:21:41 +0000 (21:21 +1100)]
KVM: PPC: Book3S HV: Add userspace interfaces for POWER9 MMU
This adds two capabilities and two ioctls to allow userspace to
find out about and configure the POWER9 MMU in a guest. The two
capabilities tell userspace whether KVM can support a guest using
the radix MMU, or using the hashed page table (HPT) MMU with a
process table and segment tables. (Note that the MMUs in the
POWER9 processor cores do not use the process and segment tables
when in HPT mode, but the nest MMU does).
The KVM_PPC_CONFIGURE_V3_MMU ioctl allows userspace to specify
whether a guest will use the radix MMU or the HPT MMU, and to
specify the size and location (in guest space) of the process
table.
The KVM_PPC_GET_RMMU_INFO ioctl gives userspace information about
the radix MMU. It returns a list of supported radix tree geometries
(base page size and number of bits indexed at each level of the
radix tree) and the encoding used to specify the various page
sizes for the TLB invalidate entry instruction.
Initially, both capabilities return 0 and the ioctls return -EINVAL,
until the necessary infrastructure for them to operate correctly
is added.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Paul Mackerras [Mon, 30 Jan 2017 10:21:40 +0000 (21:21 +1100)]
powerpc/64: Allow for relocation-on interrupts from guest to host
With host and guest both using radix translation, it is feasible
for the host to take interrupts that come from the guest with
relocation on, and that is in fact what the POWER9 hardware will
do when LPCR[AIL] = 3. All such interrupts use HSRR0/1 not SRR0/1
except for system call with LEV=1 (hcall).
Therefore this adds the KVM tests to the _HV variants of the
relocation-on interrupt handlers, and adds the KVM test to the
relocation-on system call entry point.
We also instantiate the relocation-on versions of the hypervisor
data storage and instruction interrupt handlers, since these can
occur with relocation on in radix guests.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Paul Mackerras [Mon, 30 Jan 2017 10:21:39 +0000 (21:21 +1100)]
powerpc/64: Make type of partition table flush depend on partition type
When changing a partition table entry on POWER9, we do a particular
form of the tlbie instruction which flushes all TLBs and caches of
the partition table for a given logical partition ID (LPID).
This instruction has a field in the instruction word, labelled R
(radix), which should be 1 if the partition was previously a radix
partition and 0 if it was a HPT partition. This implements that
logic.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Paul Mackerras [Mon, 30 Jan 2017 10:21:38 +0000 (21:21 +1100)]
powerpc/64: Export pgtable_cache and pgtable_cache_add for KVM
This exports the pgtable_cache array and the pgtable_cache_add
function so that HV KVM can use them for allocating radix page
tables for guests.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Paul Mackerras [Mon, 30 Jan 2017 10:21:37 +0000 (21:21 +1100)]
powerpc/64: More definitions for POWER9
This adds definitions for bits in the DSISR register which are used
by POWER9 for various translation-related exception conditions, and
for some more bits in the partition table entry that will be needed
by KVM.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Paul Mackerras [Mon, 30 Jan 2017 10:21:36 +0000 (21:21 +1100)]
powerpc/64: Enable use of radix MMU under hypervisor on POWER9
To use radix as a guest, we first need to tell the hypervisor via
the ibm,client-architecture call first that we support POWER9 and
architecture v3.00, and that we can do either radix or hash and
that we would like to choose later using an hcall (the
H_REGISTER_PROC_TBL hcall).
Then we need to check whether the hypervisor agreed to us using
radix. We need to do this very early on in the kernel boot process
before any of the MMU initialization is done. If the hypervisor
doesn't agree, we can't use radix and therefore clear the radix
MMU feature bit.
Later, when we have set up our process table, which points to the
radix tree for each process, we need to install that using the
H_REGISTER_PROC_TBL hcall.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Paul Mackerras [Mon, 30 Jan 2017 10:21:35 +0000 (21:21 +1100)]
powerpc/pseries: Fixes for the "ibm,architecture-vec-5" options
This fixes the byte index values for some of the option bits in
the "ibm,architectur-vec-5" property. The "platform facilities options"
bits are in byte 17 not byte 14, so the upper 8 bits of their
definitions need to be 0x11 not 0x0E. The "sub processor support" option
is in byte 21 not byte 15.
Note none of these options are actually looked up in
"ibm,architecture-vec-5" at this time, so there is no bug.
When checking whether option bits are set, we should check that
the offset of the byte being checked is less than the vector
length that we got from the hypervisor.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Paul Mackerras [Mon, 30 Jan 2017 10:21:34 +0000 (21:21 +1100)]
powerpc/64: Don't try to use radix MMU under a hypervisor
Currently, if the kernel is running on a POWER9 processor under a
hypervisor, it will try to use the radix MMU even though it doesn't have
the necessary code to use radix under a hypervisor (it doesn't negotiate
use of radix, and it doesn't do the H_REGISTER_PROC_TBL hcall). The
result is that the guest kernel will crash when it tries to turn on the
MMU.
This fixes it by looking for the /chosen/ibm,architecture-vec-5
property, and if it exists, clears the radix MMU feature bit, before we
decide whether to initialize for radix or HPT. This property is created
by the hypervisor as a result of the guest calling the
ibm,client-architecture-support method to indicate its capabilities, so
it will indicate whether the hypervisor agreed to us using radix.
Systems without a hypervisor may have this property also (for example,
skiboot creates it), so we check the HV bit in the MSR to see whether we
are running as a guest or not. If we are in hypervisor mode, then we can
do whatever we like including using the radix MMU.
The reason for using this property is that in future, when we have
support for using radix under a hypervisor, we will need to check this
property to see whether the hypervisor agreed to us using radix.
Fixes:
2bfd65e45e87 ("powerpc/mm/radix: Add radix callbacks for early init routines")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Fri, 27 Jan 2017 04:00:34 +0000 (14:00 +1000)]
KVM: PPC: Book3S: 64-bit CONFIG_RELOCATABLE support for interrupts
64-bit Book3S exception handlers must find the dynamic kernel base
to add to the target address when branching beyond __end_interrupts,
in order to support kernel running at non-0 physical address.
Support this in KVM by branching with CTR, similarly to regular
interrupt handlers. The guest CTR saved in HSTATE_SCRATCH1 and
restored after the branch.
Without this, the host kernel hangs and crashes randomly when it is
running at a non-0 address and a KVM guest is started.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Wed, 21 Dec 2016 18:29:26 +0000 (04:29 +1000)]
KVM: PPC: Book3S: Move 64-bit KVM interrupt handler out from alt section
A subsequent patch to make KVM handlers relocation-safe makes them
unusable from within alt section "else" cases (due to the way fixed
addresses are taken from within fixed section head code).
Stop open-coding the KVM handlers, and add them both as normal. A more
optimal fix may be to allow some level of alternate feature patching in
the exception macros themselves, but for now this will do.
The TRAMP_KVM handlers must be moved to the "virt" fixed section area
(name is arbitrary) in order to be closer to .text and avoid the dreaded
"relocation truncated to fit" error.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Wed, 21 Dec 2016 18:29:25 +0000 (04:29 +1000)]
KVM: PPC: Book3S: Change interrupt call to reduce scratch space use on HV
Change the calling convention to put the trap number together with
CR in two halves of r12, which frees up HSTATE_SCRATCH2 in the HV
handler.
The 64-bit PR handler entry translates the calling convention back
to match the previous call convention (i.e., shared with 32-bit), for
simplicity.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Linus Torvalds [Mon, 16 Jan 2017 00:21:59 +0000 (16:21 -0800)]
Linux 4.10-rc4
Linus Torvalds [Mon, 16 Jan 2017 00:09:50 +0000 (16:09 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/ebiederm/user-namespace
Pull namespace fixes from Eric Biederman:
"This tree contains 4 fixes.
The first is a fix for a race that can causes oopses under the right
circumstances, and that someone just recently encountered.
Past that are several small trivial correct fixes. A real issue that
was blocking development of an out of tree driver, but does not appear
to have caused any actual problems for in-tree code. A potential
deadlock that was reported by lockdep. And a deadlock people have
experienced and took the time to track down caused by a cleanup that
removed the code to drop a reference count"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
sysctl: Drop reference added by grab_header in proc_sys_readdir
pid: fix lockdep deadlock warning due to ucount_lock
libfs: Modify mount_pseudo_xattr to be clear it is not a userspace mount
mnt: Protect the mountpoint hashtable with mount_lock
Linus Torvalds [Sun, 15 Jan 2017 20:40:53 +0000 (12:40 -0800)]
Merge tag 'char-misc-4.10-rc4' of git://git./linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are some small char/misc driver fixes for 4.10-rc4 that resolve
some reported issues.
The MEI driver issue resolves a lot of problems that people have been
having, as does the mem driver fix. The other minor fixes resolve
other reported issues.
All of these have been in linux-next for a while"
* tag 'char-misc-4.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
vme: Fix wrong pointer utilization in ca91cx42_slave_get
auxdisplay: fix new ht16k33 build errors
ppdev: don't print a free'd string
extcon: return error code on failure
drivers: char: mem: Fix thinkos in kmem address checks
mei: bus: enable OS version only for SPT and newer
Linus Torvalds [Sun, 15 Jan 2017 20:38:53 +0000 (12:38 -0800)]
Merge tag 'driver-core-4.10-rc4' of git://git./linux/kernel/git/gregkh/driver-core
Pull driver core fix from Greg KH:
"Here is a single patch being reverted to remove a feature that was
added in 4.10-rc1 that isn't quite ready for release.
It will be redone as a debugfs file instead of a sysfs file in the
future"
* tag 'driver-core-4.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
Revert "driver core: Add deferred_probe attribute to devices in sysfs"
Linus Torvalds [Sun, 15 Jan 2017 20:36:32 +0000 (12:36 -0800)]
Merge tag 'tty-4.10-rc4' of git://git./linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are some small tty/serial driver fixes for 4.10-rc4 to resolve a
number of reported issues.
Nothing major here at all, one revert of a problematic patch, and some
other tiny bugfixes. Full details are in the shortlog below.
All have been in linux-next with no reported issues"
* tag 'tty-4.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
sysrq: attach sysrq handler correctly for 32-bit kernel
Revert "tty: serial: 8250: add CON_CONSDEV to flags"
Clearing FIFOs in RS485 emulation mode causes subsequent transmits to break
8250_pci: Fix potential use-after-free in error path
tty/serial: atmel: RS485 half duplex w/DMA: enable RX after TX is done
tty/serial: atmel_serial: BUG: stop DMA from transmitting in stop_tx
Linus Torvalds [Sun, 15 Jan 2017 20:34:35 +0000 (12:34 -0800)]
Merge tag 'usb-4.10-rc4' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are a few small USB driver fixes for 4.10-rc4 to resolve some
reported issues.
The "largest" here is a number of bugs being fixed in the ch341
usb-serial driver, to hopefully resolve the mess of different devices
floating around that use this driver that have been having problems
with the 4.10-rc1 release.
There's also a tiny musb fix that I missed in the last pull request,
as well as the traditional xhci fix rounding out the batch.
All have been in linux-next with no reported issues"
* tag 'usb-4.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
xhci: fix deadlock at host remove by running watchdog correctly
USB: serial: ch341: fix control-message error handling
usb: musb: fix runtime PM in debugfs
wusbcore: Fix one more crypto-on-the-stack bug
USB: serial: kl5kusb105: fix line-state error handling
USB: serial: ch341: fix baud rate and line-control handling
USB: serial: ch341: fix line settings after reset-resume
USB: serial: ch341: fix resume after reset
USB: serial: ch341: fix open error handling
USB: serial: ch341: fix modem-control and B0 handling
USB: serial: ch341: fix open and resume after B0
USB: serial: ch341: fix initial modem-control state
Linus Torvalds [Sun, 15 Jan 2017 20:28:14 +0000 (12:28 -0800)]
Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Bugfixes for I2C. Mostly core this time which is a bit unusual but
nothing really scary in there"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: piix4: Avoid race conditions with IMC
i2c: fix spelling mistake: "insufficent" -> "insufficient"
i2c: print correct device invalid address
i2c: do not enable fall back to Host Notify by default
i2c: fix kernel memory disclosure in dev interface
Linus Torvalds [Sun, 15 Jan 2017 20:03:11 +0000 (12:03 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- unwinder fixes
- AMD CPU topology enumeration fixes
- microcode loader fixes
- x86 embedded platform fixes
- fix for a bootup crash that may trigger when clearcpuid= is used
with invalid values"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mpx: Use compatible types in comparison to fix sparse error
x86/tsc: Add the Intel Denverton Processor to native_calibrate_tsc()
x86/entry: Fix the end of the stack for newly forked tasks
x86/unwind: Include __schedule() in stack traces
x86/unwind: Disable KASAN checks for non-current tasks
x86/unwind: Silence warnings for non-current tasks
x86/microcode/intel: Use correct buffer size for saving microcode data
x86/microcode/intel: Fix allocation size of struct ucode_patch
x86/microcode/intel: Add a helper which gives the microcode revision
x86/microcode: Use native CPUID to tickle out microcode revision
x86/CPU: Add native CPUID variants returning a single datum
x86/boot: Add missing declaration of string functions
x86/CPU/AMD: Fix Bulldozer topology
x86/platform/intel-mid: Rename 'spidev' to 'mrfld_spidev'
x86/cpu: Fix typo in the comment for Anniedale
x86/cpu: Fix bootup crashes by sanitizing the argument of the 'clearcpuid=' command-line option
Linus Torvalds [Sun, 15 Jan 2017 20:00:37 +0000 (12:00 -0800)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull NOHZ fix from Ingo Molnar:
"This fixes an old NOHZ race where we incorrectly calculate the next
timer interrupt in certain circumstances where hrtimers are pending,
that can cause hard to reproduce stalled-values artifacts in
/proc/stat"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
nohz: Fix collision between tick and other hrtimers
Linus Torvalds [Sun, 15 Jan 2017 19:37:43 +0000 (11:37 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Misc race fixes uncovered by fuzzing efforts, a Sparse fix, two PMU
driver fixes, plus miscellanous tooling fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Reject non sampling events with precise_ip
perf/x86/intel: Account interrupts for PEBS errors
perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race
perf/core: Fix sys_perf_event_open() vs. hotplug
perf/x86/intel: Use ULL constant to prevent undefined shift behaviour
perf/x86/intel/uncore: Fix hardcoded socket 0 assumption in the Haswell init code
perf/x86: Set pmu->module in Intel PMU modules
perf probe: Fix to probe on gcc generated symbols for offline kernel
perf probe: Fix --funcs to show correct symbols for offline module
perf symbols: Robustify reading of build-id from sysfs
perf tools: Install tools/lib/traceevent plugins with install-bin
tools lib traceevent: Fix prev/next_prio for deadline tasks
perf record: Fix --switch-output documentation and comment
perf record: Make __record_options static
tools lib subcmd: Add OPT_STRING_OPTARG_SET option
perf probe: Fix to get correct modname from elf header
samples/bpf trace_output_user: Remove duplicate sys/ioctl.h include
samples/bpf sock_example: Avoid getting ethhdr from two includes
perf sched timehist: Show total scheduling time
Linus Torvalds [Sun, 15 Jan 2017 18:54:39 +0000 (10:54 -0800)]
Merge branch 'efi-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull EFI fixes from Ingo Molnar:
"A number of regression fixes:
- Fix a boot hang on machines that have somewhat unusual memory map
entries of phys_addr=0x0 num_pages=0, which broke due to a recent
commit. This commit got cherry-picked from the v4.11 queue because
the bug is affecting real machines.
- Fix a boot hang also reported by KASAN, caused by incorrect init
ordering introduced by a recent optimization.
- Fix a recent robustification fix to allocate_new_fdt_and_exit_boot()
that introduced an invalid assumption. Neither bugs were seen in
the wild AFAIK"
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/x86: Prune invalid memory map entries and fix boot regression
x86/efi: Don't allocate memmap through memblock after mm_init()
efi/libstub/arm*: Pass latest memory map to the kernel
Linus Torvalds [Sun, 15 Jan 2017 01:13:28 +0000 (17:13 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro.
The most notable fix here is probably the fix for a splice regression
("fix a fencepost error in pipe_advance()") noticed by Alan Wylie.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix a fencepost error in pipe_advance()
coredump: Ensure proper size of sparse core files
aio: fix lock dep warning
tmpfs: clear S_ISGID when setting posix ACLs
Linus Torvalds [Sun, 15 Jan 2017 01:07:04 +0000 (17:07 -0800)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- the virtio_blk stack DMA corruption fix from Christoph, fixing and
issue with VMAP stacks.
- O_DIRECT blkbits calculation fix from Chandan.
- discard regression fix from Christoph.
- queue init error handling fixes for nbd and virtio_blk, from Omar and
Jeff.
- two small nvme fixes, from Christoph and Guilherme.
- rename of blk_queue_zone_size and bdev_zone_size to _sectors instead,
to more closely follow what we do in other places in the block layer.
This interface is new for this series, so let's get the naming right
before releasing a kernel with this feature. From Damien.
* 'for-linus' of git://git.kernel.dk/linux-block:
block: don't try to discard from __blkdev_issue_zeroout
sd: remove __data_len hack for WRITE SAME
nvme: use blk_rq_payload_bytes
scsi: use blk_rq_payload_bytes
block: add blk_rq_payload_bytes
block: Rename blk_queue_zone_size and bdev_zone_size
nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too
nvme-rdma: fix nvme_rdma_queue_is_ready
virtio_blk: fix panic in initialization error path
nbd: blk_mq_init_queue returns an error code on failure, not NULL
virtio_blk: avoid DMA to stack for the sense buffer
do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned
Al Viro [Sun, 15 Jan 2017 00:33:08 +0000 (19:33 -0500)]
fix a fencepost error in pipe_advance()
The logics in pipe_advance() used to release all buffers past the new
position failed in cases when the number of buffers to release was equal
to pipe->buffers. If that happened, none of them had been released,
leaving pipe full. Worse, it was trivial to trigger and we end up with
pipe full of uninitialized pages. IOW, it's an infoleak.
Cc: stable@vger.kernel.org # v4.9
Reported-by: "Alan J. Wylie" <alan@wylie.me.uk>
Tested-by: "Alan J. Wylie" <alan@wylie.me.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Kleikamp [Wed, 11 Jan 2017 19:25:00 +0000 (13:25 -0600)]
coredump: Ensure proper size of sparse core files
If the last section of a core file ends with an unmapped or zero page,
the size of the file does not correspond with the last dump_skip() call.
gdb complains that the file is truncated and can be confusing to users.
After all of the vma sections are written, make sure that the file size
is no smaller than the current file position.
This problem can be demonstrated with gdb's bigcore testcase on the
sparc architecture.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Shaohua Li [Tue, 13 Dec 2016 20:09:56 +0000 (12:09 -0800)]
aio: fix lock dep warning
lockdep reports a warnning. file_start_write/file_end_write only
acquire/release the lock for regular files. So checking the files in aio
side too.
[ 453.532141] ------------[ cut here ]------------
[ 453.533011] WARNING: CPU: 1 PID: 1298 at ../kernel/locking/lockdep.c:3514 lock_release+0x434/0x670
[ 453.533011] DEBUG_LOCKS_WARN_ON(depth <= 0)
[ 453.533011] Modules linked in:
[ 453.533011] CPU: 1 PID: 1298 Comm: fio Not tainted 4.9.0+ #964
[ 453.533011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
[ 453.533011]
ffff8803a24b7a70 ffffffff8196cffb ffff8803a24b7ae8 0000000000000000
[ 453.533011]
ffff8803a24b7ab8 ffffffff81091ee1 ffff8803a5dba700 00000dba00000008
[ 453.533011]
ffffed0074496f59 ffff8803a5dbaf54 ffff8803ae0f8488 fffffffffffffdef
[ 453.533011] Call Trace:
[ 453.533011] [<
ffffffff8196cffb>] dump_stack+0x67/0x9c
[ 453.533011] [<
ffffffff81091ee1>] __warn+0x111/0x130
[ 453.533011] [<
ffffffff81091f97>] warn_slowpath_fmt+0x97/0xb0
[ 453.533011] [<
ffffffff81091f00>] ? __warn+0x130/0x130
[ 453.533011] [<
ffffffff8191b789>] ? blk_finish_plug+0x29/0x60
[ 453.533011] [<
ffffffff811205d4>] lock_release+0x434/0x670
[ 453.533011] [<
ffffffff8198af94>] ? import_single_range+0xd4/0x110
[ 453.533011] [<
ffffffff81322195>] ? rw_verify_area+0x65/0x140
[ 453.533011] [<
ffffffff813aa696>] ? aio_write+0x1f6/0x280
[ 453.533011] [<
ffffffff813aa6c9>] aio_write+0x229/0x280
[ 453.533011] [<
ffffffff813aa4a0>] ? aio_complete+0x640/0x640
[ 453.533011] [<
ffffffff8111df20>] ? debug_check_no_locks_freed+0x1a0/0x1a0
[ 453.533011] [<
ffffffff8114793a>] ? debug_lockdep_rcu_enabled.part.2+0x1a/0x30
[ 453.533011] [<
ffffffff81147985>] ? debug_lockdep_rcu_enabled+0x35/0x40
[ 453.533011] [<
ffffffff812a92be>] ? __might_fault+0x7e/0xf0
[ 453.533011] [<
ffffffff813ac9bc>] do_io_submit+0x94c/0xb10
[ 453.533011] [<
ffffffff813ac2ae>] ? do_io_submit+0x23e/0xb10
[ 453.533011] [<
ffffffff813ac070>] ? SyS_io_destroy+0x270/0x270
[ 453.533011] [<
ffffffff8111d7b3>] ? mark_held_locks+0x23/0xc0
[ 453.533011] [<
ffffffff8100201a>] ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 453.533011] [<
ffffffff813acb90>] SyS_io_submit+0x10/0x20
[ 453.533011] [<
ffffffff824f96aa>] entry_SYSCALL_64_fastpath+0x18/0xad
[ 453.533011] [<
ffffffff81119190>] ? trace_hardirqs_off_caller+0xc0/0x110
[ 453.533011] ---[ end trace
b2fbe664d1cc0082 ]---
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Sat, 14 Jan 2017 19:09:24 +0000 (11:09 -0800)]
Merge tag 'dmaengine-fix-4.10-rc4' of git://git.infradead.org/users/vkoul/slave-dma
Pull dmaengine fixes from Vinod Koul:
"The fixes this time around are spread over drivers, pretty normal
update:
- PCI ID for SKL ioatdma, workaround for SKX and
ioat_alloc_chan_resources sleepy allocation fix
- dw kconfig typo fix
- null pointer deref for stm32
- MAINTAINERS Update for at_hdmac
- pl330 runtime pm fixes
- omap-dma port window fix
- rcar-dmac unmap slave resource fix"
* tag 'dmaengine-fix-4.10-rc4' of git://git.infradead.org/users/vkoul/slave-dma:
dmaengine: rcar-dmac: unmap slave resource when channel is freed
dmaengine: omap-dma: Fix the port_window support
dmaengine: iota: ioat_alloc_chan_resources should not perform sleeping allocations.
dmaengine: pl330: Fix runtime PM support for terminated transfers
MAINTAINERS: dmaengine: Update + Hand over the at_hdmac driver to Ludovic
dmaengine: omap-dma: Fix dynamic lch_map allocation
dmaengine: ti-dma-crossbar: Add some 'of_node_put()' in error path.
dmaengine: stm32-dma: Fix null pointer dereference in stm32_dma_tx_status
dmaengine: stm32-dma: Set correct args number for DMA request from DT
dmaengine: dw: fix typo in Kconfig
dmaengine: ioatdma: workaround SKX ioatdma version
dmaengine: ioatdma: Add Skylake PCI Dev ID
Peter Jones [Mon, 12 Dec 2016 23:42:28 +0000 (18:42 -0500)]
efi/x86: Prune invalid memory map entries and fix boot regression
Some machines, such as the Lenovo ThinkPad W541 with firmware GNET80WW
(2.28), include memory map entries with phys_addr=0x0 and num_pages=0.
These machines fail to boot after the following commit,
commit
8e80632fb23f ("efi/esrt: Use efi_mem_reserve() and avoid a kmalloc()")
Fix this by removing such bogus entries from the memory map.
Furthermore, currently the log output for this case (with efi=debug)
looks like:
[ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[0x0000000000000000-0xffffffffffffffff] (0MB)
This is clearly wrong, and also not as informative as it could be. This
patch changes it so that if we find obviously invalid memory map
entries, we print an error and skip those entries. It also detects the
display of the address range calculation overflow, so the new output is:
[ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
[ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[0x0000000000000000-0x0000000000000000] (invalid)
It also detects memory map sizes that would overflow the physical
address, for example phys_addr=0xfffffffffffff000 and
num_pages=0x0200000000000001, and prints:
[ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
[ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[phys_addr=0xfffffffffffff000-0x20ffffffffffffffff] (invalid)
It then removes these entries from the memory map.
Signed-off-by: Peter Jones <pjones@redhat.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
[ardb: refactor for clarity with no functional changes, avoid PAGE_SHIFT]
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
[Matt: Include bugzilla info in commit log]
Cc: <stable@vger.kernel.org> # v4.9+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=191121
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Greg Kroah-Hartman [Sat, 14 Jan 2017 13:09:03 +0000 (14:09 +0100)]
Revert "driver core: Add deferred_probe attribute to devices in sysfs"
This reverts commit
6751667a29d6fd64afb9ce30567ad616b68ed789.
Rob Herring objected to it, and a replacement for it will be added using
debugfs in the future.
Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
Reported-by: Rob Herring <robh@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jiri Olsa [Tue, 3 Jan 2017 14:24:54 +0000 (15:24 +0100)]
perf/x86: Reject non sampling events with precise_ip
As Peter suggested [1] rejecting non sampling PEBS events,
because they dont make any sense and could cause bugs
in the NMI handler [2].
[1] http://lkml.kernel.org/r/
20170103094059.GC3093@worktop
[2] http://lkml.kernel.org/r/
1482931866-6018-3-git-send-email-jolsa@kernel.org
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20170103142454.GA26251@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Jiri Olsa [Wed, 28 Dec 2016 13:31:03 +0000 (14:31 +0100)]
perf/x86/intel: Account interrupts for PEBS errors
It's possible to set up PEBS events to get only errors and not
any data, like on SNB-X (model 45) and IVB-EP (model 62)
via 2 perf commands running simultaneously:
taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10
This leads to a soft lock up, because the error path of the
intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
for error PEBS interrupts, so in case you're getting ONLY
errors you don't have a way to stop the event when it's over
the max_samples_per_tick limit:
NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
...
RIP: 0010:[<
ffffffff81159232>] [<
ffffffff81159232>] smp_call_function_single+0xe2/0x140
...
Call Trace:
? trace_hardirqs_on_caller+0xf5/0x1b0
? perf_cgroup_attach+0x70/0x70
perf_install_in_context+0x199/0x1b0
? ctx_resched+0x90/0x90
SYSC_perf_event_open+0x641/0xf90
SyS_perf_event_open+0x9/0x10
do_syscall_64+0x6c/0x1f0
entry_SYSCALL64_slow_path+0x25/0x25
Add perf_event_account_interrupt() which does the interrupt
and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
error path.
We keep the pending_kill and pending_wakeup logic only in the
__perf_event_overflow() path, because they make sense only if
there's any data to deliver.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Wed, 11 Jan 2017 20:09:50 +0000 (21:09 +0100)]
perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race
Di Shen reported a race between two concurrent sys_perf_event_open()
calls where both try and move the same pre-existing software group
into a hardware context.
The problem is exactly that described in commit:
f63a8daa5812 ("perf: Fix event->ctx locking")
... where, while we wait for a ctx->mutex acquisition, the event->ctx
relation can have changed under us.
That very same commit failed to recognise sys_perf_event_context() as an
external access vector to the events and thereby didn't apply the
established locking rules correctly.
So while one sys_perf_event_open() call is stuck waiting on
mutex_lock_double(), the other (which owns said locks) moves the group
about. So by the time the former sys_perf_event_open() acquires the
locks, the context we've acquired is stale (and possibly dead).
Apply the established locking rules as per perf_event_ctx_lock_nested()
to the mutex_lock_double() for the 'move_group' case. This obviously means
we need to validate state after we acquire the locks.
Reported-by: Di Shen (Keen Lab)
Tested-by: John Dias <joaodias@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Min Chong <mchong@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes:
f63a8daa5812 ("perf: Fix event->ctx locking")
Link: http://lkml.kernel.org/r/20170106131444.GZ3174@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Fri, 9 Dec 2016 13:59:00 +0000 (14:59 +0100)]
perf/core: Fix sys_perf_event_open() vs. hotplug
There is problem with installing an event in a task that is 'stuck' on
an offline CPU.
Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
blocked task doesn't run and doesn't require a CPU etc.. Only on
wakeup do we ammend the situation and place the task on a available
CPU.
If we hit such a task with perf_install_in_context() we'll loop until
either that task wakes up or the CPU comes back online, if the task
waking depends on the event being installed, we're stuck.
While looking into this issue, I also spotted another problem, if we
hit a task with perf_install_in_context() that is in the middle of
being migrated, that is we observe the old CPU before sending the IPI,
but run the IPI (on the old CPU) while the task is already running on
the new CPU, things also go sideways.
Rework things to rely on task_curr() -- outside of rq->lock -- which
is rather tricky. Imagine the following scenario where we're trying to
install the first event into our task 't':
CPU0 CPU1 CPU2
(current == t)
t->perf_event_ctxp[] = ctx;
smp_mb();
cpu = task_cpu(t);
switch(t, n);
migrate(t, 2);
switch(p, t);
ctx = t->perf_event_ctxp[]; // must not be NULL
smp_function_call(cpu, ..);
generic_exec_single()
func();
spin_lock(ctx->lock);
if (task_curr(t)) // false
add_event_to_ctx();
spin_unlock(ctx->lock);
perf_event_context_sched_in();
spin_lock(ctx->lock);
// sees event
So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
Because if CPU2's load of that variable were to observe NULL, it would
not try to schedule the ctx and we'd have a task running without its
counter, which would be 'bad'.
As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
first and not see the event yet, then CPU0 must observe task_curr()
and retry. If the install happens first, then we must see the event on
sched-in and all is well.
I think we can translate the first part (until the 'must not be NULL')
of the scenario to a litmus test like:
C C-peterz
{
}
P0(int *x, int *y)
{
int r1;
WRITE_ONCE(*x, 1);
smp_mb();
r1 = READ_ONCE(*y);
}
P1(int *y, int *z)
{
WRITE_ONCE(*y, 1);
smp_store_release(z, 1);
}
P2(int *x, int *z)
{
int r1;
int r2;
r1 = smp_load_acquire(z);
smp_mb();
r2 = READ_ONCE(*x);
}
exists
(0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
Where:
x is perf_event_ctxp[],
y is our tasks's CPU, and
z is our task being placed on the rq of CPU2.
The P0 smp_mb() is the one added by this patch, ordering the store to
perf_event_ctxp[] from find_get_context() and the load of task_cpu()
in task_function_call().
The smp_store_release/smp_load_acquire model the RCpc locking of the
rq->lock and the smp_mb() of P2 is the context switch switching from
whatever CPU2 was running to our task 't'.
This litmus test evaluates into:
Test C-peterz Allowed
States 7
0:r1=0; 2:r1=0; 2:r2=0;
0:r1=0; 2:r1=0; 2:r2=1;
0:r1=0; 2:r1=1; 2:r2=1;
0:r1=1; 2:r1=0; 2:r2=0;
0:r1=1; 2:r1=0; 2:r2=1;
0:r1=1; 2:r1=1; 2:r2=0;
0:r1=1; 2:r1=1; 2:r2=1;
No
Witnesses
Positive: 0 Negative: 7
Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
Observation C-peterz Never 0 7
Hash=
e427f41d9146b2a5445101d3e2fcaa34
And the strong and weak model agree.
Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: jeremy.linton@arm.com
Link: http://lkml.kernel.org/r/20161209135900.GU3174@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tobias Klauser [Thu, 12 Jan 2017 15:53:11 +0000 (16:53 +0100)]
x86/mpx: Use compatible types in comparison to fix sparse error
info->si_addr is of type void __user *, so it should be compared against
something from the same address space.
This fixes the following sparse error:
arch/x86/mm/mpx.c:296:27: error: incompatible types in comparison expression (different address spaces)
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Len Brown [Fri, 13 Jan 2017 06:11:18 +0000 (01:11 -0500)]
x86/tsc: Add the Intel Denverton Processor to native_calibrate_tsc()
The Intel Denverton microserver uses a 25 MHz TSC crystal,
so we can derive its exact [*] TSC frequency
using CPUID and some arithmetic, eg.:
TSC: 1800 MHz (
25000000 Hz * 216 / 3 / 1000000)
[*] 'exact' is only as good as the crystal, which should be +/- 20ppm
Signed-off-by: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/306899f94804aece6d8fa8b4223ede3b48dbb59c.1484287748.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linus Torvalds [Sat, 14 Jan 2017 01:40:22 +0000 (17:40 -0800)]
Merge branch 'for-linus-4.10' of git://git./linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"These are all over the place.
The tracepoint part of the pull fixes a crash and adds a little more
information to two tracepoints, while the rest are good old fashioned
fixes"
* 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: make tracepoint format strings more compact
Btrfs: add truncated_len for ordered extent tracepoints
Btrfs: add 'inode' for extent map tracepoint
btrfs: fix crash when tracepoint arguments are freed by wq callbacks
Btrfs: adjust outstanding_extents counter properly when dio write is split
Btrfs: fix lockdep warning about log_mutex
Btrfs: use down_read_nested to make lockdep silent
btrfs: fix locking when we put back a delayed ref that's too new
btrfs: fix error handling when run_delayed_extent_op fails
btrfs: return the actual error value from from btrfs_uuid_tree_iterate
Linus Torvalds [Sat, 14 Jan 2017 01:38:05 +0000 (17:38 -0800)]
Merge tag 'ceph-for-4.10-rc4' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Two small fixups for the filesystem changes that went into this merge
window"
* tag 'ceph-for-4.10-rc4' of git://github.com/ceph/ceph-client:
ceph: fix get_oldest_context()
ceph: fix mds cluster availability check
Linus Torvalds [Sat, 14 Jan 2017 01:35:43 +0000 (17:35 -0800)]
Merge tag 'vfio-v4.10-rc4' of git://github.com/awilliam/linux-vfio
Pull VFIO fixes from Alex Williamson:
- Cleanups and bug fixes for the mtty sample driver (Dan Carpenter)
- Export and make use of has_capability() to fix incorrect use of
ns_capable() for testing task capabilities (Jike Song)
* tag 'vfio-v4.10-rc4' of git://github.com/awilliam/linux-vfio:
vfio/type1: Remove pid_namespace.h include
vfio iommu type1: fix the testing of capability for remote task
capability: export has_capability
vfio-mdev: remove some dead code
vfio-mdev: buffer overflow in ioctl()
vfio-mdev: return -EFAULT if copy_to_user() fails
Linus Torvalds [Sat, 14 Jan 2017 01:06:24 +0000 (17:06 -0800)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
- fix for module unload vs deferred jump labels (note: there might be
other buggy modules!)
- two NULL pointer dereferences from syzkaller
- also syzkaller: fix emulation of fxsave/fxrstor/sgdt/sidt, problem
made worse during this merge window, "just" kernel memory leak on
releases
- fix emulation of "mov ss" - somewhat serious on AMD, less so on Intel
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: fix emulation of "MOV SS, null selector"
KVM: x86: fix NULL deref in vcpu_scan_ioapic
KVM: eventfd: fix NULL deref irqbypass consumer
KVM: x86: Introduce segmented_write_std
KVM: x86: flush pending lapic jump label updates on module unload
jump_labels: API for flushing deferred jump label updates
Linus Torvalds [Sat, 14 Jan 2017 01:00:42 +0000 (17:00 -0800)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Fix huge_ptep_set_access_flags() to return "changed" when any of the
ptes in the contiguous range is changed, not just the last one
- Fix the adr_l assembly macro to work in modules under KASLR
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: assembler: make adr_l work in modules under KASLR
arm64: hugetlb: fix the wrong return value for huge_ptep_set_access_flags
Christoph Hellwig [Fri, 13 Jan 2017 22:18:16 +0000 (15:18 -0700)]
block: don't try to discard from __blkdev_issue_zeroout
Discard can return -EIO asynchronously if the alignment for the request
isn't suitable for the driver, which makes a proper fallback to other
methods in __blkdev_issue_zeroout impossible. Thus only issue a sync
discard from blkdev_issue_zeroout an don't try discard at all from
__blkdev_issue_zeroout as a non-invasive workaround.
One more reason why abusing discard for zeroing must die..
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Eryu Guan <eguan@redhat.com>
Fixes:
e73c23ff ("block: add async variant of blkdev_issue_zeroout")
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Fri, 13 Jan 2017 11:29:13 +0000 (12:29 +0100)]
sd: remove __data_len hack for WRITE SAME
Now that we have the blk_rq_payload_bytes helper available to determine
the actual I/O size we don't need to mess around with __data_len for
WRITE SAME.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Fri, 13 Jan 2017 11:29:12 +0000 (12:29 +0100)]
nvme: use blk_rq_payload_bytes
The new blk_rq_payload_bytes generalizes the payload length hacks
that nvme_map_len did before.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Fri, 13 Jan 2017 11:29:11 +0000 (12:29 +0100)]
scsi: use blk_rq_payload_bytes
Without that we'll pass a wrong payload size in cmd->sdb, which
can lead to hangs with drivers that need the total transfer size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Chris Valean <v-chvale@microsoft.com>
Reported-by: Dexuan Cui <decui@microsoft.com>
Fixes:
f9d03f96 ("block: improve handling of the magic discard payload")
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Fri, 13 Jan 2017 11:29:10 +0000 (12:29 +0100)]
block: add blk_rq_payload_bytes
Add a helper to calculate the actual data transfer size for special
payload requests.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Linus Torvalds [Fri, 13 Jan 2017 20:38:36 +0000 (12:38 -0800)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"The major fix is the bfa firmware, since the latest 10Gb cards fail
probing with the current firmware.
The rest is a set of minor fixes: one missed Kconfig dependency
causing randconfig failures, a missed error return on an error leg, a
change for how multiqueue waits on a blocked device and a don't reset
while in reset fix"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: bfa: Increase requested firmware version to 3.2.5.1
scsi: snic: Return error code on memory allocation failure
scsi: fnic: Avoid sending reset to firmware when another reset is in progress
scsi: qedi: fix build, depends on UIO
scsi: scsi-mq: Wait for .queue_rq() if necessary
Linus Torvalds [Fri, 13 Jan 2017 19:49:34 +0000 (11:49 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
"Small driver fixups"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: elants_i2c - avoid divide by 0 errors on bad touchscreen data
Input: adxl34x - make it enumerable in ACPI environment
Input: ALPS - fix TrackStick Y axis handling for SS5 hardware
Input: synaptics-rmi4 - fix F03 build error when serio is module
Input: xpad - use correct product id for x360w controllers
Input: synaptics_i2c - change msleep to usleep_range for small msecs
Input: i8042 - add Pegatron touchpad to noloop table
Input: joydev - remove unused linux/miscdevice.h include
Alex Williamson [Thu, 12 Jan 2017 15:24:16 +0000 (08:24 -0700)]
vfio/type1: Remove pid_namespace.h include
Using has_capability() rather than ns_capable(), we're no longer using
this header.
Cc: Jike Song <jike.song@intel.com>
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Niklas Söderlund [Wed, 11 Jan 2017 14:39:31 +0000 (15:39 +0100)]
dmaengine: rcar-dmac: unmap slave resource when channel is freed
The slave mapping should be removed together with other channel
resources when the channel is freed. If it's not unmapped it will hang
around forever after the channel is freed.
Fixes:
9f878603dbdb7db3 ("dmaengine: rcar-dmac: add iommu support for slave transfers")
Reported-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Jike Song [Thu, 12 Jan 2017 08:52:03 +0000 (16:52 +0800)]
vfio iommu type1: fix the testing of capability for remote task
Before the mdev enhancement type1 iommu used capable() to test the
capability of current task; in the course of mdev development a
new requirement, testing for another task other than current, was
raised. ns_capable() was used for this purpose, however it still
tests current, the only difference is, in a specified namespace.
Fix it by using has_capability() instead, which tests the cap for
specified task in init_user_ns, the same namespace as capable().
Cc: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Jike Song <jike.song@intel.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Linus Torvalds [Thu, 12 Jan 2017 22:45:59 +0000 (14:45 -0800)]
Merge tag 'sound-4.10-rc4' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"This time we got a few more fixes than the previous rc's, and most of
commits were about ASoC.
The only significant change in the core side is the regression fix wrt
the aux device list handling, and all the rest are driver-specific
small / trivial fixes"
* tag 'sound-4.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: usb-audio: Add a quirk for Plantronics BT600
ASoC: rt5645: set sel_i2s_pre_div1 to 2
ASoC: dpcm: Avoid putting stream state to STOP when FE stream is paused
ASoC: Intel: Skylake: Release FW ctx in cleanup
ASoC: Intel: bytcr-rt5640: fix settings in internal clock mode
ASoC: fsl_ssi: set fifo watermark to more reliable value
ASoC: nau8825: fix invalid configuration in Pre-Scalar of FLL
ASoC: nau8825: correct the function name of register
ASoC: Intel: Skylake: Fix to fail safely if module not available in path
ASoC: tlv320aic3x: Mark the RESET register as volatile
ASoC: Fix binding and probing of auxiliary components
ASoC: wm_adsp: Don't overrun firmware file buffer when reading region data
ASoC: Intel: bytcr_rt5640: fallback mechanism if MCLK is not enabled
ASoC: hdmi-codec: use unsigned type to structure members with bit-field
ASoC: topology: kfree kcontrol->private_value before freeing kcontrol
ASoC: rsnd: don't double free kctrl
ASoC: dwc: Fix PIO mode initialization
Ricardo Ribalda [Wed, 11 Jan 2017 09:11:44 +0000 (10:11 +0100)]
i2c: piix4: Avoid race conditions with IMC
On AMD's SB800 and upwards, the SMBus is shared with the Integrated
Micro Controller (IMC).
The platform provides a hardware semaphore to avoid race conditions
among them. (Check page 288 of the SB800-Series Southbridges Register
Reference Guide http://support.amd.com/TechDocs/45482.pdf)
Without this patch, many access to the SMBus end with an invalid
transaction or even with the bus stalled.
Reported-by: Alexandre Desnoyers <alex@qtec.com>
Signed-off-by: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>:
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Colin Ian King [Thu, 29 Dec 2016 22:27:33 +0000 (22:27 +0000)]
i2c: fix spelling mistake: "insufficent" -> "insufficient"
Trivial fix to spelling mistake in WARN message, insufficient has
an insufficient number of i's in the spelling.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Linus Torvalds [Thu, 12 Jan 2017 19:06:26 +0000 (11:06 -0800)]
Merge tag 'xfs-for-linus-4.10-rc4-1' of git://git./fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"As promised last week, here's some stability fixes from Christoph and
Jan Kara:
- fix free space request handling when low on disk space
- remove redundant log failure error messages
- free truncated dirty pages instead of letting them build up
forever"
* tag 'xfs-for-linus-4.10-rc4-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: Timely free truncated dirty pages
xfs: don't print warnings when xfs_log_force fails
xfs: don't rely on ->total in xfs_alloc_space_available
xfs: adjust allocation length in xfs_alloc_space_available
xfs: fix bogus minleft manipulations
xfs: bump up reserved blocks in xfs_alloc_set_aside
John Garry [Fri, 6 Jan 2017 11:02:57 +0000 (19:02 +0800)]
i2c: print correct device invalid address
In of_i2c_register_device(), when the check for
device address validity fails we print the info.addr,
which has not been assigned properly.
Fix this by printing the actual invalid address.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Vladimir Zapolskiy <vz@mleia.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Fixes:
b4e2f6ac1281 ("i2c: apply DT flags when probing")
Cc: stable@kernel.org
Dmitry Torokhov [Thu, 5 Jan 2017 04:57:22 +0000 (20:57 -0800)]
i2c: do not enable fall back to Host Notify by default
Falling back unconditionally to HostNotify as primary client's interrupt
breaks some drivers which alter their functionality depending on whether
interrupt is present or not, so let's introduce a board flag telling I2C
core explicitly if we want wired interrupt or HostNotify-based one:
I2C_CLIENT_HOST_NOTIFY.
For DT-based systems we introduce "host-notify" property that we convert
to I2C_CLIENT_HOST_NOTIFY board flag.
Tested-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Acked-by: Pali Rohár <pali.rohar@gmail.com>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Vlad Tsyrklevich [Mon, 9 Jan 2017 15:53:36 +0000 (22:53 +0700)]
i2c: fix kernel memory disclosure in dev interface
i2c_smbus_xfer() does not always fill an entire block, allowing
kernel stack memory disclosure through the temp variable. Clear
it before it's read to.
Signed-off-by: Vlad Tsyrklevich <vlad@tsyrklevich.net>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Cc: stable@kernel.org
Linus Torvalds [Thu, 12 Jan 2017 19:00:22 +0000 (11:00 -0800)]
Merge tag 'rproc-v4.10-fixes' of git://github.com/andersson/remoteproc
Pull remoteproc fixes from Bjorn Andersson:
"This fixes two regressions that have been reported to be introduced in
v4.10-rc1.
- correct an incorrect usage of the kref api
- revert the change to make the resource table read-only. As the
space each vdev resource is used as virtio device config space it
must be shared with the remote"
* tag 'rproc-v4.10-fixes' of git://github.com/andersson/remoteproc:
Revert "remoteproc: Merge table_ptr and cached_table pointers"
remoteproc: fix vdev reference management
Linus Torvalds [Thu, 12 Jan 2017 18:58:16 +0000 (10:58 -0800)]
Merge tag 'rpmsg-v4.10-fixes' of git://github.com/andersson/remoteproc
Pull rpmsg fixes from Bjorn Andersson:
"This fixes a regression introduced in v4.10-rc1 that prohibits
multiple channels with the same name but different endpoint addresses
to be used"
* tag 'rpmsg-v4.10-fixes' of git://github.com/andersson/remoteproc:
rpmsg: virtio_rpmsg_bus: fix channel creation
Linus Torvalds [Thu, 12 Jan 2017 18:55:28 +0000 (10:55 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jikos/hid
Pull HID fixes from Jiri Kosina:
- device descriptor length validation fix to hid-cypress driver from
Greg
- introduction of a short delay into i2c-hid, which is not really
mandated by the spec, but fixes Asus Touchpads
- Petzl USB connectable flashlight quirk from myself
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
HID: i2c-hid: Add sleep between POWER ON and RESET
HID: hid-cypress: validate length of report
HID: ignore Petzl USB headlamp
Linus Torvalds [Thu, 12 Jan 2017 18:41:20 +0000 (10:41 -0800)]
Merge branch 'scsi-target-for-v4.10' of git://git./linux/kernel/git/bvanassche/linux
Pull scsi target fixes from Bart Van Assche:
- a series of bug fixes for the XCOPY implementation from David
Disseldorp
- one bug fix for the ibmvscsis driver, a driver that is used for
communication between partitions on IBM POWER systems.
* 'scsi-target-for-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/bvanassche/linux:
ibmvscsis: Fix srp_transfer_data fail return code
target: support XCOPY requests without parameters
target: check for XCOPY parameter truncation
target: use XCOPY segment descriptor CSCD IDs
target: check XCOPY segment descriptor CSCD IDs
target: simplify XCOPY wwn->se_dev lookup helper
target: return UNSUPPORTED TARGET/SEGMENT DESC TYPE CODE sense
target: bounds check XCOPY total descriptor list length
target: bounds check XCOPY segment descriptor list
target: use XCOPY TOO MANY TARGET DESCRIPTORS sense
target: add XCOPY target/segment desc sense codes
Geng, Jichao [Thu, 5 Jan 2017 08:50:39 +0000 (16:50 +0800)]
ceph: fix get_oldest_context()
For no snapshot case, we should use ci->truncate_{seq,size}.
Fixes:
5f743e456606 ("ceph: record truncate size/seq for snap data writeback")
Signed-off-by: Geng, Jichao <geng.jichao@h3c.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Yan, Zheng [Wed, 4 Jan 2017 08:21:58 +0000 (16:21 +0800)]
ceph: fix mds cluster availability check
We should apply the check after getting the initial mdsmap.
Fixes:
e9e427f0a14f ("ceph: check availability of mds cluster on mount")
Link: http://tracker.ceph.com/issues/18161
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Linus Torvalds [Thu, 12 Jan 2017 18:17:59 +0000 (10:17 -0800)]
Merge tag 'md/4.10-rc3' of git://git./linux/kernel/git/shli/md
Pull md fixes from Shaohua Li:
"Basically one fix for raid5 cache which is merged in this cycle,
others are trival fixes"
* tag 'md/4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md/raid5: Use correct IS_ERR() variation on pointer check
md: cleanup mddev flag clear for takeover
md/r5cache: fix spelling mistake on "recoverying"
md/r5cache: assign conf->log before r5l_load_log()
md/r5cache: simplify handling of sh->log_start in recovery
md/raid5-cache: removes unnecessary write-through mode judgments
md/raid10: Refactor raid10_make_request
md/raid1: Refactor raid1_make_request
Ard Biesheuvel [Wed, 11 Jan 2017 14:54:53 +0000 (14:54 +0000)]
arm64: assembler: make adr_l work in modules under KASLR
When CONFIG_RANDOMIZE_MODULE_REGION_FULL=y, the offset between loaded
modules and the core kernel may exceed 4 GB, putting symbols exported
by the core kernel out of the reach of the ordinary adrp/add instruction
pairs used to generate relative symbol references. So make the adr_l
macro emit a movz/movk sequence instead when executing in module context.
While at it, remove the pointless special case for the stack pointer.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Greg Kroah-Hartman [Thu, 12 Jan 2017 17:17:38 +0000 (18:17 +0100)]
Merge tag 'usb-serial-4.10-rc4' of git://git./linux/kernel/git/johan/usb-serial into usb-linus
Johan writes:
USB-serial fixes for v4.10-rc4
These fixes address a number of issues in the ch341 driver and includes
a partial revert of a change in how we set the line settings that went
into 4.10-rc1 but which turned out to have undesired side effects. This
included deasserting the modem-control lines when configuring the
device, but also prevented a certain class of CH340 devices from working
with the driver.
Included are also two fixes for two minor information leaks in
kl5kusb105 and ch341 due to failures to detect short control transfers.
Signed-off-by: Johan Hovold <johan@kernel.org>
Damien Le Moal [Thu, 12 Jan 2017 14:58:32 +0000 (07:58 -0700)]
block: Rename blk_queue_zone_size and bdev_zone_size
All block device data fields and functions returning a number of 512B
sectors are by convention named xxx_sectors while names in the form
xxx_size are generally used for a number of bytes. The blk_queue_zone_size
and bdev_zone_size functions were not following this convention so rename
them.
No functional change is introduced by this patch.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Collapsed the two patches, they were nonsensically split and broke
bisection.
Signed-off-by: Jens Axboe <axboe@fb.com>
Paolo Bonzini [Thu, 12 Jan 2017 14:02:32 +0000 (15:02 +0100)]
KVM: x86: fix emulation of "MOV SS, null selector"
This is CVE-2017-2583. On Intel this causes a failed vmentry because
SS's type is neither 3 nor 7 (even though the manual says this check is
only done for usable SS, and the dmesg splat says that SS is unusable!).
On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb.
The fix fabricates a data segment descriptor when SS is set to a null
selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb.
Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3;
this in turn ensures CPL < 3 because RPL must be equal to CPL.
Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing
the bug and deciphering the manuals.
Reported-by: Xiaohan Zhang <zhangxiaohan1@huawei.com>
Fixes:
79d5b4c3cd809c770d4bf9812635647016c56011
Cc: stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jike Song [Thu, 12 Jan 2017 08:52:02 +0000 (16:52 +0800)]
capability: export has_capability
has_capability() is sometimes needed by modules to test capability
for specified task other than current, so export it.
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Jike Song <jike.song@intel.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Wanpeng Li [Wed, 4 Jan 2017 02:56:19 +0000 (18:56 -0800)]
KVM: x86: fix NULL deref in vcpu_scan_ioapic
Reported by syzkaller:
BUG: unable to handle kernel NULL pointer dereference at
00000000000001b0
IP: _raw_spin_lock+0xc/0x30
PGD
3e28eb067
PUD
3f0ac6067
PMD 0
Oops: 0002 [#1] SMP
CPU: 0 PID: 2431 Comm: test Tainted: G OE 4.10.0-rc1+ #3
Call Trace:
? kvm_ioapic_scan_entry+0x3e/0x110 [kvm]
kvm_arch_vcpu_ioctl_run+0x10a8/0x15f0 [kvm]
? pick_next_task_fair+0xe1/0x4e0
? kvm_arch_vcpu_load+0xea/0x260 [kvm]
kvm_vcpu_ioctl+0x33a/0x600 [kvm]
? hrtimer_try_to_cancel+0x29/0x130
? do_nanosleep+0x97/0xf0
do_vfs_ioctl+0xa1/0x5d0
? __hrtimer_init+0x90/0x90
? do_nanosleep+0x5b/0xf0
SyS_ioctl+0x79/0x90
do_syscall_64+0x6e/0x180
entry_SYSCALL64_slow_path+0x25/0x25
RIP: _raw_spin_lock+0xc/0x30 RSP:
ffffa43688973cc0
The syzkaller folks reported a NULL pointer dereference due to
ENABLE_CAP succeeding even without an irqchip. The Hyper-V
synthetic interrupt controller is activated, resulting in a
wrong request to rescan the ioapic and a NULL pointer dereference.
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <linux/kvm.h>
#include <pthread.h>
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#ifndef KVM_CAP_HYPERV_SYNIC
#define KVM_CAP_HYPERV_SYNIC 123
#endif
void* thr(void* arg)
{
struct kvm_enable_cap cap;
cap.flags = 0;
cap.cap = KVM_CAP_HYPERV_SYNIC;
ioctl((long)arg, KVM_ENABLE_CAP, &cap);
return 0;
}
int main()
{
void *host_mem = mmap(0, 0x1000, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
int kvmfd = open("/dev/kvm", 0);
int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
struct kvm_userspace_memory_region memreg;
memreg.slot = 0;
memreg.flags = 0;
memreg.guest_phys_addr = 0;
memreg.memory_size = 0x1000;
memreg.userspace_addr = (unsigned long)host_mem;
host_mem[0] = 0xf4;
ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &memreg);
int cpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0);
struct kvm_sregs sregs;
ioctl(cpufd, KVM_GET_SREGS, &sregs);
sregs.cr0 = 0;
sregs.cr4 = 0;
sregs.efer = 0;
sregs.cs.selector = 0;
sregs.cs.base = 0;
ioctl(cpufd, KVM_SET_SREGS, &sregs);
struct kvm_regs regs = { .rflags = 2 };
ioctl(cpufd, KVM_SET_REGS, ®s);
ioctl(vmfd, KVM_CREATE_IRQCHIP, 0);
pthread_t th;
pthread_create(&th, 0, thr, (void*)(long)cpufd);
usleep(rand() % 10000);
ioctl(cpufd, KVM_RUN, 0);
pthread_join(th, 0);
return 0;
}
This patch fixes it by failing ENABLE_CAP if without an irqchip.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes:
5c919412fe61 (kvm/x86: Hyper-V synthetic interrupt controller)
Cc: stable@vger.kernel.org # 4.5+
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wanpeng Li [Fri, 6 Jan 2017 01:39:42 +0000 (17:39 -0800)]
KVM: eventfd: fix NULL deref irqbypass consumer
Reported syzkaller:
BUG: unable to handle kernel NULL pointer dereference at
0000000000000008
IP: irq_bypass_unregister_consumer+0x9d/0xb70 [irqbypass]
PGD 0
Oops: 0002 [#1] SMP
CPU: 1 PID: 125 Comm: kworker/1:1 Not tainted 4.9.0+ #1
Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm]
task:
ffff9bbe0dfbb900 task.stack:
ffffb61802014000
RIP: 0010:irq_bypass_unregister_consumer+0x9d/0xb70 [irqbypass]
Call Trace:
irqfd_shutdown+0x66/0xa0 [kvm]
process_one_work+0x16b/0x480
worker_thread+0x4b/0x500
kthread+0x101/0x140
? process_one_work+0x480/0x480
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x25/0x30
RIP: irq_bypass_unregister_consumer+0x9d/0xb70 [irqbypass] RSP:
ffffb61802017e20
CR2:
0000000000000008
The syzkaller folks reported a NULL pointer dereference that due to
unregister an consumer which fails registration before. The syzkaller
creates two VMs w/ an equal eventfd occasionally. So the second VM
fails to register an irqbypass consumer. It will make irqfd as inactive
and queue an workqueue work to shutdown irqfd and unregister the irqbypass
consumer when eventfd is closed. However, the second consumer has been
initialized though it fails registration. So the token(same as the first
VM's) is taken to unregister the consumer through the workqueue, the
consumer of the first VM is found and unregistered, then NULL deref incurred
in the path of deleting consumer from the consumers list.
This patch fixes it by making irq_bypass_register/unregister_consumer()
looks for the consumer entry based on consumer pointer itself instead of
token matching.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Cc: stable@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Steve Rutherford [Thu, 12 Jan 2017 02:28:29 +0000 (18:28 -0800)]
KVM: x86: Introduce segmented_write_std
Introduces segemented_write_std.
Switches from emulated reads/writes to standard read/writes in fxsave,
fxrstor, sgdt, and sidt. This fixes CVE-2017-2584, a longstanding
kernel memory leak.
Since commit
283c95d0e389 ("KVM: x86: emulate FXSAVE and FXRSTOR",
2016-11-09), which is luckily not yet in any final release, this would
also be an exploitable kernel memory *write*!
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes:
96051572c819194c37a8367624b285be10297eca
Fixes:
283c95d0e3891b64087706b344a4b545d04a6e62
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Fri, 16 Dec 2016 22:30:36 +0000 (14:30 -0800)]
KVM: x86: flush pending lapic jump label updates on module unload
KVM's lapic emulation uses static_key_deferred (apic_{hw,sw}_disabled).
These are implemented with delayed_work structs which can still be
pending when the KVM module is unloaded. We've seen this cause kernel
panics when the kvm_intel module is quickly reloaded.
Use the new static_key_deferred_flush() API to flush pending updates on
module unload.
Signed-off-by: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Fri, 16 Dec 2016 22:30:35 +0000 (14:30 -0800)]
jump_labels: API for flushing deferred jump label updates
Modules that use static_key_deferred need a way to synchronize with
any delayed work that is still pending when the module is unloaded.
Introduce static_key_deferred_flush() which flushes any pending
jump label updates.
Signed-off-by: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Josh Poimboeuf [Mon, 9 Jan 2017 18:00:25 +0000 (12:00 -0600)]
x86/entry: Fix the end of the stack for newly forked tasks
When unwinding a task, the end of the stack is always at the same offset
right below the saved pt_regs, regardless of which syscall was used to
enter the kernel. That convention allows the unwinder to verify that a
stack is sane.
However, newly forked tasks don't always follow that convention, as
reported by the following unwinder warning seen by Dave Jones:
WARNING: kernel stack frame pointer at
ffffc90001443f30 in kworker/u8:8:30468 has bad value (null)
The warning was due to the following call chain:
(ftrace handler)
call_usermodehelper_exec_async+0x5/0x140
ret_from_fork+0x22/0x30
The problem is that ret_from_fork() doesn't create a stack frame before
calling other functions. Fix that by carefully using the frame pointer
macros.
In addition to conforming to the end of stack convention, this also
makes related stack traces more sensible by making it clear to the user
that ret_from_fork() was involved.
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/8854cdaab980e9700a81e9ebf0d4238e4bbb68ef.1483978430.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Josh Poimboeuf [Mon, 9 Jan 2017 18:00:24 +0000 (12:00 -0600)]
x86/unwind: Include __schedule() in stack traces
In the following commit:
0100301bfdf5 ("sched/x86: Rewrite the switch_to() code")
... the layout of the 'inactive_task_frame' struct was designed to have
a frame pointer header embedded in it, so that the unwinder could use
the 'bp' and 'ret_addr' fields to report __schedule() on the stack (or
ret_from_fork() for newly forked tasks which haven't actually run yet).
Finish the job by changing get_frame_pointer() to return a pointer to
inactive_task_frame's 'bp' field rather than 'bp' itself. This allows
the unwinder to start one frame higher on the stack, so that it properly
reports __schedule().
Reported-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/598e9f7505ed0aba86e8b9590aa528c6c7ae8dcd.1483978430.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Josh Poimboeuf [Mon, 9 Jan 2017 18:00:23 +0000 (12:00 -0600)]
x86/unwind: Disable KASAN checks for non-current tasks
There are a handful of callers to save_stack_trace_tsk() and
show_stack() which try to unwind the stack of a task other than current.
In such cases, it's remotely possible that the task is running on one
CPU while the unwinder is reading its stack from another CPU, causing
the unwinder to see stack corruption.
These cases seem to be mostly harmless. The unwinder has checks which
prevent it from following bad pointers beyond the bounds of the stack.
So it's not really a bug as long as the caller understands that
unwinding another task will not always succeed.
In such cases, it's possible that the unwinder may read a KASAN-poisoned
region of the stack. Account for that by using READ_ONCE_NOCHECK() when
reading the stack of another task.
Use READ_ONCE() when reading the stack of the current task, since KASAN
warnings can still be useful for finding bugs in that case.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/4c575eb288ba9f73d498dfe0acde2f58674598f1.1483978430.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Josh Poimboeuf [Mon, 9 Jan 2017 18:00:22 +0000 (12:00 -0600)]
x86/unwind: Silence warnings for non-current tasks
There are a handful of callers to save_stack_trace_tsk() and
show_stack() which try to unwind the stack of a task other than current.
In such cases, it's remotely possible that the task is running on one
CPU while the unwinder is reading its stack from another CPU, causing
the unwinder to see stack corruption.
These cases seem to be mostly harmless. The unwinder has checks which
prevent it from following bad pointers beyond the bounds of the stack.
So it's not really a bug as long as the caller understands that
unwinding another task will not always succeed.
Since stack "corruption" on another task's stack isn't necessarily a
bug, silence the warnings when unwinding tasks other than current.
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/00d8c50eea3446c1524a2a755397a3966629354c.1483978430.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Brendan McGrath [Fri, 6 Jan 2017 21:01:38 +0000 (08:01 +1100)]
HID: i2c-hid: Add sleep between POWER ON and RESET
Support for the Asus Touchpad was recently added. It turns out this
device can fail initialisation (and become unusable) when the RESET
command is sent too soon after the POWER ON command.
Unfortunately the i2c-hid specification does not specify the need for
a delay between these two commands. But it was discovered the Windows
driver has a 1ms delay.
As a result, this patch modifies the i2c-hid module to add a sleep
inbetween the POWER ON and RESET commands which lasts between 1ms and 5ms.
See https://github.com/vlasenko/hid-asus-dkms/issues/24 for further
details.
Signed-off-by: Brendan McGrath <redmcg@redmandi.dyndns.org>
Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Linus Torvalds [Wed, 11 Jan 2017 19:15:15 +0000 (11:15 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
"27 fixes.
There are three patches that aren't actually fixes. They're simple
function renamings which are nice-to-have in mainline as ongoing net
development depends on them."
* akpm: (27 commits)
timerfd: export defines to userspace
mm/hugetlb.c: fix reservation race when freeing surplus pages
mm/slab.c: fix SLAB freelist randomization duplicate entries
zram: support BDI_CAP_STABLE_WRITES
zram: revalidate disk under init_lock
mm: support anonymous stable page
mm: add documentation for page fragment APIs
mm: rename __page_frag functions to __page_frag_cache, drop order from drain
mm: rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
mm, memcg: fix the active list aging for lowmem requests when memcg is enabled
mm: don't dereference struct page fields of invalid pages
mailmap: add codeaurora.org names for nameless email commits
signal: protect SIGNAL_UNKILLABLE from unintentional clearing.
mm: pmd dirty emulation in page fault handler
ipc/sem.c: fix incorrect sem_lock pairing
lib/Kconfig.debug: fix frv build failure
mm: get rid of __GFP_OTHER_NODE
mm: fix remote numa hits statistics
mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}
ocfs2: fix crash caused by stale lvb with fsdlm plugin
...
Dan Carpenter [Sat, 7 Jan 2017 06:30:08 +0000 (09:30 +0300)]
vfio-mdev: remove some dead code
We set info.count to 1 in mtty_get_irq_info() so static checkers
complain that, "Why do we have impossible conditions?" The answer is
that it seems to be left over dead code that can be safely removed.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Dan Carpenter [Sat, 7 Jan 2017 06:28:40 +0000 (09:28 +0300)]
vfio-mdev: buffer overflow in ioctl()
This is a sample driver for documentation so the impact is probably
pretty low. But we should check that bar_index is valid so we
don't write beyond the end of the mdev_state->region_info[] array.
Fixes:
9d1a546c53b4 ("docs: Sample driver to demonstrate how to use Mediated device framework.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Dan Carpenter [Sat, 7 Jan 2017 06:27:49 +0000 (09:27 +0300)]
vfio-mdev: return -EFAULT if copy_to_user() fails
The copy_to_user() function returns the number of bytes which it wasn't
able to copy but we want to return a negative error code.
Fixes:
9d1a546c53b4 ("docs: Sample driver to demonstrate how to use Mediated device framework.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Takashi Iwai [Wed, 11 Jan 2017 18:49:27 +0000 (19:49 +0100)]
Merge tag 'asoc-fix-v4.10-rc3' of git://git./linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v4.10
As well as the usual smattering of driver specific fixes collected since
the merge window this has one particularly important fix to the core for
handling of aux_devs which was broken during the merge window by some of
the componentization refactoring.
Jan Kara [Wed, 11 Jan 2017 18:20:04 +0000 (10:20 -0800)]
xfs: Timely free truncated dirty pages
Commit
99579ccec4e2 "xfs: skip dirty pages in ->releasepage()" started
to skip dirty pages in xfs_vm_releasepage() which also has the effect
that if a dirty page is truncated, it does not get freed by
block_invalidatepage() and is lingering in LRU list waiting for reclaim.
So a simple loop like:
while true; do
dd if=/dev/zero of=file bs=1M count=100
rm file
done
will keep using more and more memory until we hit low watermarks and
start pagecache reclaim which will eventually reclaim also the truncate
pages. Keeping these truncated (and thus never usable) pages in memory
is just a waste of memory, is unnecessarily stressing page cache
reclaim, and reportedly also leads to anonymous mmap(2) returning ENOMEM
prematurely.
So instead of just skipping dirty pages in xfs_vm_releasepage(), return
to old behavior of skipping them only if they have delalloc or unwritten
buffers and fix the spurious warnings by warning only if the page is
clean.
CC: stable@vger.kernel.org
CC: Brian Foster <bfoster@redhat.com>
CC: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Petr Tůma <petr.tuma@d3s.mff.cuni.cz>
Fixes:
99579ccec4e271c3d4d4e7c946058766812afdab
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Linus Torvalds [Wed, 11 Jan 2017 17:52:12 +0000 (09:52 -0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Fix rtlwifi crash, from Larry Finger.
2) Memory disclosure in appletalk ipddp routing code, from Vlad
Tsyrklevich.
3) r8152 can erroneously split an RX packet into multiple URBs if the
Rx FIFO is not empty when we suspend. Fix this by waiting for the
FIFO to empty before suspending. From Hayes Wang.
4) Two GRO fixes (enter slow path when not enough SKB tail room exists,
disable frag0 optimizations when there are IPV6 extension headers)
from Eric Dumazet and Herbert Xu.
5) A series of mlx5e bug fixes (do source udp port offloading for
tunnels properly, Ip fragment matching fixes, handling firmware
errors properly when installing TC rules, etc.) from Saeed Mahameed,
Or Gerlitz, Roi Dayan, Hadar Hen Zion, Gil Rockah, and Daniel
Jurgens.
6) Two VRF fixes from David Ahern (don't skip multipath selection for
VRF paths, disallow VRF to be configured with table ID 0).
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits)
net: vrf: do not allow table id 0
net: phy: marvell: fix Marvell 88E1512 used in SGMII mode
sctp: Fix spelling mistake: "Atempt" -> "Attempt"
net: ipv4: Fix multipath selection with vrf
cgroup: move CONFIG_SOCK_CGROUP_DATA to init/Kconfig
gro: use min_t() in skb_gro_reset_offset()
net/mlx5: Only cancel recovery work when cleaning up device
net/mlx5e: Remove WARN_ONCE from adaptive moderation code
net/mlx5e: Un-register uplink representor on nic_disable
net/mlx5e: Properly handle FW errors while adding TC rules
net/mlx5e: Fix kbuild warnings for uninitialized parameters
net/mlx5e: Set inline mode requirements for matching on IP fragments
net/mlx5e: Properly get address type of encapsulation IP headers
net/mlx5e: TC ipv4 tunnel encap offload error flow fixes
net/mlx5e: Warn when rejecting offload attempts of IP tunnels
net/mlx5e: Properly handle offloading of source udp port for IP tunnels
gro: Disable frag0 optimization on IPv6 ext headers
gro: Enter slow-path if there is no tailroom
mlx4: Return EOPNOTSUPP instead of ENOTSUPP
net/af_iucv: don't use paged skbs for TX on HiperSockets
...
Linus Torvalds [Wed, 11 Jan 2017 17:28:13 +0000 (09:28 -0800)]
Merge branch 'linus' of git://git./linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
"This fixes a regression in aesni that renders it useless if it's
built-in with a modular pcbc configuration"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: aesni - Fix failure when built-in with modular pcbc
Guilherme G. Piccoli [Thu, 29 Dec 2016 00:13:15 +0000 (22:13 -0200)]
nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too
Commit
54adc01055b7 ("nvme/quirk: Add a delay before checking for adapter
readiness") introduced a quirk to adapters that cannot read the bit
NVME_CSTS_RDY right after register NVME_REG_CC is set; these adapters
need a delay or else the action of reading the bit NVME_CSTS_RDY could
somehow corrupt adapter's registers state and it never recovers.
When this quirk was added, we checked ctrl->tagset in order to avoid
quirking in probe time, supposing we would never require such delay
during probe. Well, it was too optimistic; we in fact need this quirk
at probe time in some cases, like after a kexec.
In some experiments, after abnormal shutdown of machine (aka power cord
unplug), we booted into our bootloader in Power, which is a Linux kernel,
and kexec'ed into another distro. If this kexec is too quick, we end up
reaching the probe of NVMe adapter in that distro when adapter is in
bad state (not fully initialized on our bootloader). What happens next
is that nvme_wait_ready() is unable to complete, except if the quirk is
enabled.
So, this patch removes the original ctrl->tagset verification in order
to enable the quirk even on probe time.
Fixes:
54adc01055b7 ("nvme/quirk: Add a delay before checking for adapter readiness")
Reported-by: Andrew Byrne <byrneadw@ie.ibm.com>
Reported-by: Jaime A. H. Gomez <jahgomez@mx1.ibm.com>
Reported-by: Zachary D. Myers <zdmyers@us.ibm.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Acked-by: Jeffrey Lien <Jeff.Lien@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Tue, 3 Jan 2017 11:29:02 +0000 (14:29 +0300)]
nvme-rdma: fix nvme_rdma_queue_is_ready
Now that we don't abuse the cmd field in struct request for nvme command
passthrough this function needs to be converted to the proper accessor
as well.
Fixes:
d49187e97e ("nvme: introduce struct nvme_request")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Mathias Nyman [Wed, 11 Jan 2017 15:10:34 +0000 (17:10 +0200)]
xhci: fix deadlock at host remove by running watchdog correctly
If a URB is killed while the host is removed we can end up in a situation
where the hub thread takes the roothub device lock, and waits for
the URB to be given back by xhci-hcd, blocking the host remove code.
xhci-hcd tries to stop the endpoint and give back the urb, but can't
as the host is removed from PCI bus at the same time, preventing the normal
way of giving back urb.
Instead we need to rely on the stop command timeout function to give back
the urb. This xhci_stop_endpoint_command_watchdog() timeout function
used a XHCI_STATE_DYING flag to indicate if the timeout function is already
running, but later this flag has been taking into use in other places to
mark that xhci is dying.
Remove checks for XHCI_STATE_DYING in xhci_urb_dequeue. We are still
checking that reading from pci state does not return 0xffffffff or that
host is not halted before trying to stop the endpoint.
This whole area of stopping endpoints, giving back URBs, and the wathdog
timeout need rework, this fix focuses on solving a specific deadlock
issue that we can then send to stable before any major rework.
Cc: <stable@vger.kernel.org>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Colin King [Wed, 11 Jan 2017 11:43:10 +0000 (11:43 +0000)]
perf/x86/intel: Use ULL constant to prevent undefined shift behaviour
When x86_pmu.num_counters is 32 the shift of the integer constant 1 is
exceeding 32bit and therefor undefined behaviour.
Fix this by shifting 1ULL instead of 1.
Reported-by: CoverityScan CID#1192105 ("Bad bit shift operation")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Link: http://lkml.kernel.org/r/20170111114310.17928-1-colin.king@canonical.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
David Ahern [Tue, 10 Jan 2017 23:22:25 +0000 (15:22 -0800)]
net: vrf: do not allow table id 0
Frank reported that vrf devices can be created with a table id of 0.
This breaks many of the run time table id checks and should not be
allowed. Detect this condition at create time and fail with EINVAL.
Fixes:
193125dbd8eb ("net: Introduce VRF device driver")
Reported-by: Frank Kellermann <frank.kellermann@atos.net>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King [Tue, 10 Jan 2017 23:13:45 +0000 (23:13 +0000)]
net: phy: marvell: fix Marvell 88E1512 used in SGMII mode
When an Marvell 88E1512 PHY is connected to a nic in SGMII mode, the
fiber page is used for the SGMII host-side connection. The PHY driver
notices that SUPPORTED_FIBRE is set, so it tries reading the fiber page
for the link status, and ends up reading the MAC-side status instead of
the outgoing (copper) link. This leads to incorrect results reported
via ethtool.
If the PHY is connected via SGMII to the host, ignore the fiber page.
However, continue to allow the existing power management code to
suspend and resume the fiber page.
Fixes:
6cfb3bcc0641 ("Marvell phy: check link status in case of fiber link.")
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>