platform/kernel/linux-rpi.git
4 years agofix up iter on short count in fuse_direct_io()
Miklos Szeredi [Thu, 6 Feb 2020 15:39:28 +0000 (16:39 +0100)]
fix up iter on short count in fuse_direct_io()

commit f658adeea45e430a24c7a157c3d5448925ac2038 upstream.

fuse_direct_io() can end up advancing the iterator by more than the amount
of data read or written.  This case is handled by the generic code if going
through ->direct_IO(), but not in the FOPEN_DIRECT_IO case.

Fix by reverting the extra bytes from the iterator in case of error or a
short count.

To test: install lxcfs, then the following testcase
  int fd = open("/var/lib/lxcfs/proc/uptime", O_RDONLY);
  sendfile(1, fd, NULL, 16777216);
  sendfile(1, fd, NULL, 16777216);
will spew WARN_ON() in iov_iter_pipe().

Reported-by: Peter Geis <pgwipeout@gmail.com>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 3c3db095b68c ("fuse: use iov_iter based generic splice helpers")
Cc: <stable@vger.kernel.org> # v5.1
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agovirtio-pci: check name when counting MSI-X vectors
Daniel Verkamp [Fri, 3 Jan 2020 18:40:45 +0000 (10:40 -0800)]
virtio-pci: check name when counting MSI-X vectors

commit 303090b513fd1ee45aa1536b71a3838dc054bc05 upstream.

VQs without a name specified are not valid; they are skipped in the
later loop that assigns MSI-X vectors to queues, but the per_vq_vectors
loop above that counts the required number of vectors previously still
counted any queue with a non-NULL callback as needing a vector.

Add a check to the per_vq_vectors loop so that vectors with no name are
not counted to make the two loops consistent.  This prevents
over-counting unnecessary vectors (e.g. for features which were not
negotiated with the device).

Cc: stable@vger.kernel.org
Fixes: 86a559787e6f ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT")
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Daniel Verkamp <dverkamp@chromium.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Wang, Wei W <wei.w.wang@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agovirtio-balloon: initialize all vq callbacks
Daniel Verkamp [Fri, 3 Jan 2020 18:40:43 +0000 (10:40 -0800)]
virtio-balloon: initialize all vq callbacks

commit 5790b53390e18fdd21e70776e46d058c05eda2f2 upstream.

Ensure that elements of the callbacks array that correspond to
unavailable features are set to NULL; previously, they would be left
uninitialized.

Since the corresponding names array elements were explicitly set to
NULL, the uninitialized callback pointers would not actually be
dereferenced; however, the uninitialized callbacks elements would still
be read in vp_find_vqs_msix() and used to calculate the number of MSI-X
vectors required.

Cc: stable@vger.kernel.org
Fixes: 86a559787e6f ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT")
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Daniel Verkamp <dverkamp@chromium.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agodrm/amd/dm/mst: Ignore payload update failures
Lyude Paul [Fri, 24 Jan 2020 19:10:46 +0000 (14:10 -0500)]
drm/amd/dm/mst: Ignore payload update failures

commit 58fe03d6dec908a1bec07eea7e94907af5c07eec upstream.

Disabling a display on MST can potentially happen after the entire MST
topology has been removed, which means that we can't communicate with
the topology at all in this scenario. Likewise, this also means that we
can't properly update payloads on the topology and as such, it's a good
idea to ignore payload update failures when disabling displays.
Currently, amdgpu makes the mistake of halting the payload update
process when any payload update failures occur, resulting in leaving
DC's local copies of the payload tables out of date.

This ends up causing problems with hotplugging MST topologies, and
causes modesets on the second hotplug to fail like so:

[drm] Failed to updateMST allocation table forpipe idx:1
------------[ cut here ]------------
WARNING: CPU: 5 PID: 1511 at
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc_link.c:2677
update_mst_stream_alloc_table+0x11e/0x130 [amdgpu]
Modules linked in: cdc_ether usbnet fuse xt_conntrack nf_conntrack
nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4
nft_counter nft_compat nf_tables nfnetlink tun bridge stp llc sunrpc
vfat fat wmi_bmof uvcvideo snd_hda_codec_realtek snd_hda_codec_generic
snd_hda_codec_hdmi videobuf2_vmalloc snd_hda_intel videobuf2_memops
videobuf2_v4l2 snd_intel_dspcfg videobuf2_common crct10dif_pclmul
snd_hda_codec videodev crc32_pclmul snd_hwdep snd_hda_core
ghash_clmulni_intel snd_seq mc joydev pcspkr snd_seq_device snd_pcm
sp5100_tco k10temp i2c_piix4 snd_timer thinkpad_acpi ledtrig_audio snd
wmi soundcore video i2c_scmi acpi_cpufreq ip_tables amdgpu(O)
rtsx_pci_sdmmc amd_iommu_v2 gpu_sched mmc_core i2c_algo_bit ttm
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm
crc32c_intel serio_raw hid_multitouch r8152 mii nvme r8169 nvme_core
rtsx_pci pinctrl_amd
CPU: 5 PID: 1511 Comm: gnome-shell Tainted: G           O      5.5.0-rc7Lyude-Test+ #4
Hardware name: LENOVO FA495SIT26/FA495SIT26, BIOS R12ET22W(0.22 ) 01/31/2019
RIP: 0010:update_mst_stream_alloc_table+0x11e/0x130 [amdgpu]
Code: 28 00 00 00 75 2b 48 8d 65 e0 5b 41 5c 41 5d 41 5e 5d c3 0f b6 06
49 89 1c 24 41 88 44 24 08 0f b6 46 01 41 88 44 24 09 eb 93 <0f> 0b e9
2f ff ff ff e8 a6 82 a3 c2 66 0f 1f 44 00 00 0f 1f 44 00
RSP: 0018:ffffac428127f5b0 EFLAGS: 00010202
RAX: 0000000000000002 RBX: ffff8d1e166eee80 RCX: 0000000000000000
RDX: ffffac428127f668 RSI: ffff8d1e166eee80 RDI: ffffac428127f610
RBP: ffffac428127f640 R08: ffffffffc03d94a8 R09: 0000000000000000
R10: ffff8d1e24b02000 R11: ffffac428127f5b0 R12: ffff8d1e1b83d000
R13: ffff8d1e1bea0b08 R14: 0000000000000002 R15: 0000000000000002
FS:  00007fab23ffcd80(0000) GS:ffff8d1e28b40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f151f1711e8 CR3: 00000005997c0000 CR4: 00000000003406e0
Call Trace:
 ? mutex_lock+0xe/0x30
 dc_link_allocate_mst_payload+0x9a/0x210 [amdgpu]
 ? dm_read_reg_func+0x39/0xb0 [amdgpu]
 ? core_link_enable_stream+0x656/0x730 [amdgpu]
 core_link_enable_stream+0x656/0x730 [amdgpu]
 dce110_apply_ctx_to_hw+0x58e/0x5d0 [amdgpu]
 ? dcn10_verify_allow_pstate_change_high+0x1d/0x280 [amdgpu]
 ? dcn10_wait_for_mpcc_disconnect+0x3c/0x130 [amdgpu]
 dc_commit_state+0x292/0x770 [amdgpu]
 ? add_timer+0x101/0x1f0
 ? ttm_bo_put+0x1a1/0x2f0 [ttm]
 amdgpu_dm_atomic_commit_tail+0xb59/0x1ff0 [amdgpu]
 ? amdgpu_move_blit.constprop.0+0xb8/0x1f0 [amdgpu]
 ? amdgpu_bo_move+0x16d/0x2b0 [amdgpu]
 ? ttm_bo_handle_move_mem+0x118/0x570 [ttm]
 ? ttm_bo_validate+0x134/0x150 [ttm]
 ? dm_plane_helper_prepare_fb+0x1b9/0x2a0 [amdgpu]
 ? _cond_resched+0x15/0x30
 ? wait_for_completion_timeout+0x38/0x160
 ? _cond_resched+0x15/0x30
 ? wait_for_completion_interruptible+0x33/0x190
 commit_tail+0x94/0x130 [drm_kms_helper]
 drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
 drm_atomic_helper_set_config+0x70/0xb0 [drm_kms_helper]
 drm_mode_setcrtc+0x194/0x6a0 [drm]
 ? _cond_resched+0x15/0x30
 ? mutex_lock+0xe/0x30
 ? drm_mode_getcrtc+0x180/0x180 [drm]
 drm_ioctl_kernel+0xaa/0xf0 [drm]
 drm_ioctl+0x208/0x390 [drm]
 ? drm_mode_getcrtc+0x180/0x180 [drm]
 amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
 do_vfs_ioctl+0x458/0x6d0
 ksys_ioctl+0x5e/0x90
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x55/0x1b0
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fab2121f87b
Code: 0f 1e fa 48 8b 05 0d 96 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff
ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 8b 0d dd 95 2c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffd045f9068 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffd045f90a0 RCX: 00007fab2121f87b
RDX: 00007ffd045f90a0 RSI: 00000000c06864a2 RDI: 000000000000000b
RBP: 00007ffd045f90a0 R08: 0000000000000000 R09: 000055dbd2985d10
R10: 000055dbd2196280 R11: 0000000000000246 R12: 00000000c06864a2
R13: 000000000000000b R14: 0000000000000000 R15: 000055dbd2196280
---[ end trace 6ea888c24d2059cd ]---

Note as well, I have only been able to reproduce this on setups with 2
MST displays.

Changes since v1:
* Don't return false when part 1 or part 2 of updating the payloads
  fails, we don't want to abort at any step of the process even if
  things fail

Reviewed-by: Mikita Lipski <Mikita.Lipski@amd.com>
Signed-off-by: Lyude Paul <lyude@redhat.com>
Acked-by: Harry Wentland <harry.wentland@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoclk: tegra: Mark fuse clock as critical
Stephen Warren [Thu, 3 Oct 2019 20:50:30 +0000 (14:50 -0600)]
clk: tegra: Mark fuse clock as critical

commit bf83b96f87ae2abb1e535306ea53608e8de5dfbb upstream.

For a little over a year, U-Boot on Tegra124 has configured the flow
controller to perform automatic RAM re-repair on off->on power
transitions of the CPU rail[1]. This is mandatory for correct operation
of Tegra124. However, RAM re-repair relies on certain clocks, which the
kernel must enable and leave running. The fuse clock is one of those
clocks. Mark this clock as critical so that LP1 power mode (system
suspend) operates correctly.

[1] 3cc7942a4ae5 ARM: tegra: implement RAM repair

Reported-by: Jonathan Hunter <jonathanh@nvidia.com>
Cc: stable@vger.kernel.org
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agomm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush
Peter Zijlstra [Tue, 4 Feb 2020 01:36:49 +0000 (17:36 -0800)]
mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush

commit 0ed1325967ab5f7a4549a2641c6ebe115f76e228 upstream.

Architectures for which we have hardware walkers of Linux page table
should flush TLB on mmu gather batch allocation failures and batch flush.
Some architectures like POWER supports multiple translation modes (hash
and radix) and in the case of POWER only radix translation mode needs the
above TLBI.  This is because for hash translation mode kernel wants to
avoid this extra flush since there are no hardware walkers of linux page
table.  With radix translation, the hardware also walks linux page table
and with that, kernel needs to make sure to TLB invalidate page walk cache
before page table pages are freed.

More details in commit d86564a2f085 ("mm/tlb, x86/mm: Support invalidating
TLB caches for RCU_TABLE_FREE")

The changes to sparc are to make sure we keep the old behavior since we
are now removing HAVE_RCU_TABLE_NO_INVALIDATE.  The default value for
tlb_needs_table_invalidate is to always force an invalidate and sparc can
avoid the table invalidate.  Hence we define tlb_needs_table_invalidate to
false for sparc architecture.

Link: http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.kumar@linux.ibm.com
Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: <stable@vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoarm64: dts: qcom: qcs404-evb: Set vdd_apc regulator in high power mode
Niklas Cassel [Mon, 14 Oct 2019 12:09:20 +0000 (14:09 +0200)]
arm64: dts: qcom: qcs404-evb: Set vdd_apc regulator in high power mode

commit eac8ce86cb90ba96cb4bcbf2549d7a8b6938aa30 upstream.

vdd_apc is the regulator that supplies the main CPU cluster.

At sudden CPU load changes, we have noticed invalid page faults on
addresses with all bits shifted, as well as on addresses with individual
bits flipped.

By putting the vdd_apc regulator in high power mode, the voltage drops
during sudden load changes will be less severe, and we have not been able
to reproduce the invalid page faults with the regulator in this mode.

Fixes: 8faea8edbb35 ("arm64: dts: qcom: qcs404-evb: add spmi regulators")
Cc: stable@vger.kernel.org
Suggested-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Niklas Cassel <niklas.cassel@linaro.org>
Reviewed-by: Vinod Koul <vkoul@kernel.org>
Link: https://lore.kernel.org/r/20191014120920.12691-1-niklas.cassel@linaro.org
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agomm/page_alloc.c: fix uninitialized memmaps on a partially populated last section
David Hildenbrand [Tue, 4 Feb 2020 01:33:48 +0000 (17:33 -0800)]
mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section

commit e822969cab48b786b64246aad1a3ba2a774f5d23 upstream.

Patch series "mm: fix max_pfn not falling on section boundary", v2.

Playing with different memory sizes for a x86-64 guest, I discovered that
some memmaps (highest section if max_mem does not fall on the section
boundary) are marked as being valid and online, but contain garbage.  We
have to properly initialize these memmaps.

Looking at /proc/kpageflags and friends, I found some more issues,
partially related to this.

This patch (of 3):

If max_pfn is not aligned to a section boundary, we can easily run into
BUGs.  This can e.g., be triggered on x86-64 under QEMU by specifying a
memory size that is not a multiple of 128MB (e.g., 4097MB, but also
4160MB).  I was told that on real HW, we can easily have this scenario
(esp., one of the main reasons sub-section hotadd of devmem was added).

The issue is, that we have a valid memmap (pfn_valid()) for the whole
section, and the whole section will be marked "online".
pfn_to_online_page() will succeed, but the memmap contains garbage.

E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m
4160M" - (see tools/vm/page-types.c):

[  200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
[  200.477500] #PF: supervisor read access in kernel mode
[  200.478334] #PF: error_code(0x0000) - not-present page
[  200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
[  200.479557] Oops: 0000 [#4] SMP NOPTI
[  200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G      D W         5.5.0-rc1-next-20191209 #93
[  200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
[  200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
[  200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
[  200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
[  200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
[  200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
[  200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
[  200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
[  200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
[  200.487130] FS:  00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
[  200.487804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
[  200.488897] Call Trace:
[  200.489115]  kpageflags_read+0xe9/0x140
[  200.489447]  proc_reg_read+0x3c/0x60
[  200.489755]  vfs_read+0xc2/0x170
[  200.490037]  ksys_pread64+0x65/0xa0
[  200.490352]  do_syscall_64+0x5c/0xa0
[  200.490665]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

But it can be triggered much easier via "cat /proc/kpageflags > /dev/null"
after cold/hot plugging a DIMM to such a system:

[root@localhost ~]# cat /proc/kpageflags > /dev/null
[  111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
[  111.517907] #PF: supervisor read access in kernel mode
[  111.518333] #PF: error_code(0x0000) - not-present page
[  111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0

This patch fixes that by at least zero-ing out that memmap (so e.g.,
page_to_pfn() will not crash).  Commit 907ec5fca3dc ("mm: zero remaining
unavailable struct pages") tried to fix a similar issue, but forgot to
consider this special case.

After this patch, there are still problems to solve.  E.g., not all of
these pages falling into a memory hole will actually get initialized later
and set PageReserved - they are only zeroed out - but at least the
immediate crashes are gone.  A follow-up patch will take care of this.

Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <stable@vger.kernel.org> [4.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoocfs2: fix oops when writing cloned file
Gang He [Tue, 4 Feb 2020 01:33:45 +0000 (17:33 -0800)]
ocfs2: fix oops when writing cloned file

commit 2d797e9ff95ecbcf0a83d657928ed20579444857 upstream.

Writing a cloned file triggers a kernel oops and the user-space command
process is also killed by the system.  The bug can be reproduced stably
via:

1) create a file under ocfs2 file system directory.

  journalctl -b > aa.txt

2) create a cloned file for this file.

  reflink aa.txt bb.txt

3) write the cloned file with dd command.

  dd if=/dev/zero of=bb.txt bs=512 count=1 conv=notrunc

The dd command is killed by the kernel, then you can see the oops message
via dmesg command.

[  463.875404] BUG: kernel NULL pointer dereference, address: 0000000000000028
[  463.875413] #PF: supervisor read access in kernel mode
[  463.875416] #PF: error_code(0x0000) - not-present page
[  463.875418] PGD 0 P4D 0
[  463.875425] Oops: 0000 [#1] SMP PTI
[  463.875431] CPU: 1 PID: 2291 Comm: dd Tainted: G           OE     5.3.16-2-default
[  463.875433] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[  463.875500] RIP: 0010:ocfs2_refcount_cow+0xa4/0x5d0 [ocfs2]
[  463.875505] Code: 06 89 6c 24 38 89 eb f6 44 24 3c 02 74 be 49 8b 47 28
[  463.875508] RSP: 0018:ffffa2cb409dfce8 EFLAGS: 00010202
[  463.875512] RAX: ffff8b1ebdca8000 RBX: 0000000000000001 RCX: ffff8b1eb73a9df0
[  463.875515] RDX: 0000000000056a01 RSI: 0000000000000000 RDI: 0000000000000000
[  463.875517] RBP: 0000000000000001 R08: ffff8b1eb73a9de0 R09: 0000000000000000
[  463.875520] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[  463.875522] R13: ffff8b1eb922f048 R14: 0000000000000000 R15: ffff8b1eb922f048
[  463.875526] FS:  00007f8f44d15540(0000) GS:ffff8b1ebeb00000(0000) knlGS:0000000000000000
[  463.875529] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  463.875532] CR2: 0000000000000028 CR3: 000000003c17a000 CR4: 00000000000006e0
[  463.875546] Call Trace:
[  463.875596]  ? ocfs2_inode_lock_full_nested+0x18b/0x960 [ocfs2]
[  463.875648]  ocfs2_file_write_iter+0xaf8/0xc70 [ocfs2]
[  463.875672]  new_sync_write+0x12d/0x1d0
[  463.875688]  vfs_write+0xad/0x1a0
[  463.875697]  ksys_write+0xa1/0xe0
[  463.875710]  do_syscall_64+0x60/0x1f0
[  463.875743]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  463.875758] RIP: 0033:0x7f8f4482ed44
[  463.875762] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 80 00 00 00
[  463.875765] RSP: 002b:00007fff300a79d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  463.875769] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8f4482ed44
[  463.875771] RDX: 0000000000000200 RSI: 000055f771b5c000 RDI: 0000000000000001
[  463.875774] RBP: 0000000000000200 R08: 00007f8f44af9c78 R09: 0000000000000003
[  463.875776] R10: 000000000000089f R11: 0000000000000246 R12: 000055f771b5c000
[  463.875779] R13: 0000000000000200 R14: 0000000000000000 R15: 000055f771b5c000

This regression problem was introduced by commit e74540b28556 ("ocfs2:
protect extent tree in ocfs2_prepare_inode_for_write()").

Link: http://lkml.kernel.org/r/20200121050153.13290-1-ghe@suse.com
Fixes: e74540b28556 ("ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()").
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: s390: do not clobber registers during guest reset/store status
Christian Borntraeger [Fri, 31 Jan 2020 10:02:00 +0000 (05:02 -0500)]
KVM: s390: do not clobber registers during guest reset/store status

commit 55680890ea78be0df5e1384989f1be835043c084 upstream.

The initial CPU reset clobbers the userspace fpc and the store status
ioctl clobbers the guest acrs + fpr.  As these calls are only done via
ioctl (and not via vcpu_run), no CPU context is loaded, so we can (and
must) act directly on the sync regs, not on the thread context.

Cc: stable@kernel.org
Fixes: e1788bb995be ("KVM: s390: handle floating point registers in the run ioctl not in vcpu_put/load")
Fixes: 31d8b8d41a7e ("KVM: s390: handle access registers in the run ioctl not in vcpu_put/load")
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20200131100205.74720-2-frankja@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: x86: Revert "KVM: X86: Fix fpu state crash in kvm guest"
Sean Christopherson [Fri, 17 Jan 2020 19:30:51 +0000 (11:30 -0800)]
KVM: x86: Revert "KVM: X86: Fix fpu state crash in kvm guest"

commit 2620fe268e80d667a94553cd37a94ccaa2cb8c83 upstream.

Reload the current thread's FPU state, which contains the guest's FPU
state, to the CPU registers if necessary during vcpu_enter_guest().
TIF_NEED_FPU_LOAD can be set any time control is transferred out of KVM,
e.g. if I/O is triggered during a KVM call to get_user_pages() or if a
softirq occurs while KVM is scheduled in.

Moving the handling of TIF_NEED_FPU_LOAD from vcpu_enter_guest() to
kvm_arch_vcpu_load(), effectively kvm_sched_in(), papered over a bug
where kvm_put_guest_fpu() failed to account for TIF_NEED_FPU_LOAD.  The
easiest way to the kvm_put_guest_fpu() bug was to run with involuntary
preemption enable, thus handling TIF_NEED_FPU_LOAD during kvm_sched_in()
made the bug go away.  But, removing the handling in vcpu_enter_guest()
exposed KVM to the rare case of a softirq triggering kernel_fpu_begin()
between vcpu_load() and vcpu_enter_guest().

Now that kvm_{load,put}_guest_fpu() correctly handle TIF_NEED_FPU_LOAD,
revert the commit to both restore the vcpu_enter_guest() behavior and
eliminate the superfluous switch_fpu_return() in kvm_arch_vcpu_load().

Note, leaving the handling in kvm_arch_vcpu_load() isn't wrong per se,
but it is unnecessary, and most critically, makes it extremely difficult
to find bugs such as the kvm_put_guest_fpu() issue due to shrinking the
window where a softirq can corrupt state.

A sample trace triggered by warning if TIF_NEED_FPU_LOAD is set while
vcpu state is loaded:

 <IRQ>
  gcmaes_crypt_by_sg.constprop.12+0x26e/0x660
  ? 0xffffffffc024547d
  ? __qdisc_run+0x83/0x510
  ? __dev_queue_xmit+0x45e/0x990
  ? ip_finish_output2+0x1a8/0x570
  ? fib4_rule_action+0x61/0x70
  ? fib4_rule_action+0x70/0x70
  ? fib_rules_lookup+0x13f/0x1c0
  ? helper_rfc4106_decrypt+0x82/0xa0
  ? crypto_aead_decrypt+0x40/0x70
  ? crypto_aead_decrypt+0x40/0x70
  ? crypto_aead_decrypt+0x40/0x70
  ? esp_output_tail+0x8f4/0xa5a [esp4]
  ? skb_ext_add+0xd3/0x170
  ? xfrm_input+0x7a6/0x12c0
  ? xfrm4_rcv_encap+0xae/0xd0
  ? xfrm4_transport_finish+0x200/0x200
  ? udp_queue_rcv_one_skb+0x1ba/0x460
  ? udp_unicast_rcv_skb.isra.63+0x72/0x90
  ? __udp4_lib_rcv+0x51b/0xb00
  ? ip_protocol_deliver_rcu+0xd2/0x1c0
  ? ip_local_deliver_finish+0x44/0x50
  ? ip_local_deliver+0xe0/0xf0
  ? ip_protocol_deliver_rcu+0x1c0/0x1c0
  ? ip_rcv+0xbc/0xd0
  ? ip_rcv_finish_core.isra.19+0x380/0x380
  ? __netif_receive_skb_one_core+0x7e/0x90
  ? netif_receive_skb_internal+0x3d/0xb0
  ? napi_gro_receive+0xed/0x150
  ? 0xffffffffc0243c77
  ? net_rx_action+0x149/0x3b0
  ? __do_softirq+0xe4/0x2f8
  ? handle_irq_event_percpu+0x6a/0x80
  ? irq_exit+0xe6/0xf0
  ? do_IRQ+0x7f/0xd0
  ? common_interrupt+0xf/0xf
  </IRQ>
  ? irq_entries_start+0x20/0x660
  ? vmx_get_interrupt_shadow+0x2f0/0x710 [kvm_intel]
  ? kvm_set_msr_common+0xfc7/0x2380 [kvm]
  ? recalibrate_cpu_khz+0x10/0x10
  ? ktime_get+0x3a/0xa0
  ? kvm_arch_vcpu_ioctl_run+0x107/0x560 [kvm]
  ? kvm_init+0x6bf/0xd00 [kvm]
  ? __seccomp_filter+0x7a/0x680
  ? do_vfs_ioctl+0xa4/0x630
  ? security_file_ioctl+0x32/0x50
  ? ksys_ioctl+0x60/0x90
  ? __x64_sys_ioctl+0x16/0x20
  ? do_syscall_64+0x5f/0x1a0
  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace 9564a1ccad733a90 ]---

This reverts commit e751732486eb3f159089a64d1901992b1357e7cc.

Fixes: e751732486eb3 ("KVM: X86: Fix fpu state crash in kvm guest")
Reported-by: Derek Yerger <derek@djy.llc>
Reported-by: kernel@najdan.com
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Thomas Lambertz <mail@thomaslambertz.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: x86: Ensure guest's FPU state is loaded when accessing for emulation
Sean Christopherson [Fri, 17 Jan 2020 19:30:50 +0000 (11:30 -0800)]
KVM: x86: Ensure guest's FPU state is loaded when accessing for emulation

commit a7baead7e312f5a05381d68585fb6dc68e19e90f upstream.

Lock the FPU regs and reload the current thread's FPU state, which holds
the guest's FPU state, to the CPU registers if necessary prior to
accessing guest FPU state as part of emulation.  kernel_fpu_begin() can
be called from softirq context, therefore KVM must ensure softirqs are
disabled (locking the FPU regs disables softirqs) when touching CPU FPU
state.

Note, for all intents and purposes this reverts commit 6ab0b9feb82a7
("x86,kvm: remove KVM emulator get_fpu / put_fpu"), but at the time it
was applied, removing get/put_fpu() was correct.  The re-introduction
of {get,put}_fpu() is necessitated by the deferring of FPU state load.

Fixes: 5f409e20b7945 ("x86/fpu: Defer FPU state load until return to userspace")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: x86: Handle TIF_NEED_FPU_LOAD in kvm_{load,put}_guest_fpu()
Sean Christopherson [Fri, 17 Jan 2020 19:30:49 +0000 (11:30 -0800)]
KVM: x86: Handle TIF_NEED_FPU_LOAD in kvm_{load,put}_guest_fpu()

commit c9aef3b85f425d1f6635382ec210ee5a7ef55d7d upstream.

Handle TIF_NEED_FPU_LOAD similar to how fpu__copy() handles the flag
when duplicating FPU state to a new task struct.  TIF_NEED_FPU_LOAD can
be set any time control is transferred out of KVM, be it voluntarily,
e.g. if I/O is triggered during a KVM call to get_user_pages, or
involuntarily, e.g. if softirq runs after an IRQ occurs.  Therefore,
KVM must account for TIF_NEED_FPU_LOAD whenever it is (potentially)
accessing CPU FPU state.

Fixes: 5f409e20b7945 ("x86/fpu: Defer FPU state load until return to userspace")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: x86: Free wbinvd_dirty_mask if vCPU creation fails
Sean Christopherson [Wed, 18 Dec 2019 21:54:48 +0000 (13:54 -0800)]
KVM: x86: Free wbinvd_dirty_mask if vCPU creation fails

commit 16be9ddea268ad841457a59109963fff8c9de38d upstream.

Free the vCPU's wbinvd_dirty_mask if vCPU creation fails after
kvm_arch_vcpu_init(), e.g. when installing the vCPU's file descriptor.
Do the freeing by calling kvm_arch_vcpu_free() instead of open coding
the freeing.  This adds a likely superfluous, but ultimately harmless,
call to kvmclock_reset(), which only clears vcpu->arch.pv_time_enabled.
Using kvm_arch_vcpu_free() allows for additional cleanup in the future.

Fixes: f5f48ee15c2ee ("KVM: VMX: Execute WBINVD to keep data consistency with assigned devices")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: x86: Don't let userspace set host-reserved cr4 bits
Sean Christopherson [Tue, 10 Dec 2019 22:44:13 +0000 (14:44 -0800)]
KVM: x86: Don't let userspace set host-reserved cr4 bits

commit b11306b53b2540c6ba068c4deddb6a17d9f8d95b upstream.

Calculate the host-reserved cr4 bits at runtime based on the system's
capabilities (using logic similar to __do_cpuid_func()), and use the
dynamically generated mask for the reserved bit check in kvm_set_cr4()
instead using of the static CR4_RESERVED_BITS define.  This prevents
userspace from "enabling" features in cr4 that are not supported by the
system, e.g. by ignoring KVM_GET_SUPPORTED_CPUID and specifying a bogus
CPUID for the vCPU.

Allowing userspace to set unsupported bits in cr4 can lead to a variety
of undesirable behavior, e.g. failed VM-Enter, and in general increases
KVM's attack surface.  A crafty userspace can even abuse CR4.LA57 to
induce an unchecked #GP on a WRMSR.

On a platform without LA57 support:

  KVM_SET_CPUID2 // CPUID_7_0_ECX.LA57 = 1
  KVM_SET_SREGS  // CR4.LA57 = 1
  KVM_SET_MSRS   // KERNEL_GS_BASE = 0x0004000000000000
  KVM_RUN

leads to a #GP when writing KERNEL_GS_BASE into hardware:

  unchecked MSR access error: WRMSR to 0xc0000102 (tried to write 0x0004000000000000)
  at rIP: 0xffffffffa00f239a (vmx_prepare_switch_to_guest+0x10a/0x1d0 [kvm_intel])
  Call Trace:
   kvm_arch_vcpu_ioctl_run+0x671/0x1c70 [kvm]
   kvm_vcpu_ioctl+0x36b/0x5d0 [kvm]
   do_vfs_ioctl+0xa1/0x620
   ksys_ioctl+0x66/0x70
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x4c/0x170
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7fc08133bf47

Note, the above sequence fails VM-Enter due to invalid guest state.
Userspace can allow VM-Enter to succeed (after the WRMSR #GP) by adding
a KVM_SET_SREGS w/ CR4.LA57=0 after KVM_SET_MSRS, in which case KVM will
technically leak the host's KERNEL_GS_BASE into the guest.  But, as
KERNEL_GS_BASE is a userspace-defined value/address, the leak is largely
benign as a malicious userspace would simply be exposing its own data to
the guest, and attacking a benevolent userspace would require multiple
bugs in the userspace VMM.

Cc: stable@vger.kernel.org
Cc: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: VMX: Add non-canonical check on writes to RTIT address MSRs
Sean Christopherson [Tue, 10 Dec 2019 23:24:32 +0000 (15:24 -0800)]
KVM: VMX: Add non-canonical check on writes to RTIT address MSRs

commit fe6ed369fca98e99df55c932b85782a5687526b5 upstream.

Reject writes to RTIT address MSRs if the data being written is a
non-canonical address as the MSRs are subject to canonical checks, e.g.
KVM will trigger an unchecked #GP when loading the values to hardware
during pt_guest_enter().

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agox86/KVM: Clean up host's steal time structure
Boris Ostrovsky [Fri, 6 Dec 2019 15:36:12 +0000 (15:36 +0000)]
x86/KVM: Clean up host's steal time structure

commit a6bd811f1209fe1c64c9f6fd578101d6436c6b6e upstream.

Now that we are mapping kvm_steal_time from the guest directly we
don't need keep a copy of it in kvm_vcpu_arch.st. The same is true
for the stime field.

This is part of CVE-2019-3016.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agox86/kvm: Cache gfn to pfn translation
Boris Ostrovsky [Thu, 5 Dec 2019 01:30:51 +0000 (01:30 +0000)]
x86/kvm: Cache gfn to pfn translation

commit 917248144db5d7320655dbb41d3af0b8a0f3d589 upstream.

__kvm_map_gfn()'s call to gfn_to_pfn_memslot() is
* relatively expensive
* in certain cases (such as when done from atomic context) cannot be called

Stashing gfn-to-pfn mapping should help with both cases.

This is part of CVE-2019-3016.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agox86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed
Boris Ostrovsky [Thu, 5 Dec 2019 03:45:32 +0000 (03:45 +0000)]
x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed

commit b043138246a41064527cf019a3d51d9f015e9796 upstream.

There is a potential race in record_steal_time() between setting
host-local vcpu->arch.st.steal.preempted to zero (i.e. clearing
KVM_VCPU_PREEMPTED) and propagating this value to the guest with
kvm_write_guest_cached(). Between those two events the guest may
still see KVM_VCPU_PREEMPTED in its copy of kvm_steal_time, set
KVM_VCPU_FLUSH_TLB and assume that hypervisor will do the right
thing. Which it won't.

Instad of copying, we should map kvm_steal_time and that will
guarantee atomicity of accesses to @preempted.

This is part of CVE-2019-3016.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agox86/kvm: Introduce kvm_(un)map_gfn()
Boris Ostrovsky [Tue, 12 Nov 2019 16:35:06 +0000 (16:35 +0000)]
x86/kvm: Introduce kvm_(un)map_gfn()

commit 1eff70a9abd46f175defafd29bc17ad456f398a7 upstream.

kvm_vcpu_(un)map operates on gfns from any current address space.
In certain cases we want to make sure we are not mapping SMRAM
and for that we can use kvm_(un)map_gfn() that we are introducing
in this patch.

This is part of CVE-2019-3016.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agox86/kvm: Be careful not to clear KVM_VCPU_FLUSH_TLB bit
Boris Ostrovsky [Wed, 30 Oct 2019 19:01:31 +0000 (19:01 +0000)]
x86/kvm: Be careful not to clear KVM_VCPU_FLUSH_TLB bit

commit 8c6de56a42e0c657955e12b882a81ef07d1d073e upstream.

kvm_steal_time_set_preempted() may accidentally clear KVM_VCPU_FLUSH_TLB
bit if it is called more than once while VCPU is preempted.

This is part of CVE-2019-3016.

(This bug was also independently discovered by Jim Mattson
<jmattson@google.com>)

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agokvm/svm: PKU not currently supported
John Allen [Thu, 19 Dec 2019 20:17:59 +0000 (14:17 -0600)]
kvm/svm: PKU not currently supported

commit a47970ed74a535b1accb4bc73643fd5a93993c3e upstream.

Current SVM implementation does not have support for handling PKU. Guests
running on a host with future AMD cpus that support the feature will read
garbage from the PKRU register and will hit segmentation faults on boot as
memory is getting marked as protected that should not be. Ensure that cpuid
from SVM does not advertise the feature.

Signed-off-by: John Allen <john.allen@amd.com>
Cc: stable@vger.kernel.org
Fixes: 0556cbdc2fbc ("x86/pkeys: Don't check if PKRU is zero before writing it")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: PPC: Book3S PR: Free shared page if mmu initialization fails
Sean Christopherson [Wed, 18 Dec 2019 21:54:47 +0000 (13:54 -0800)]
KVM: PPC: Book3S PR: Free shared page if mmu initialization fails

commit cb10bf9194f4d2c5d830eddca861f7ca0fecdbb4 upstream.

Explicitly free the shared page if kvmppc_mmu_init() fails during
kvmppc_core_vcpu_create(), as the page is freed only in
kvmppc_core_vcpu_free(), which is not reached via kvm_vcpu_uninit().

Fixes: 96bc451a15329 ("KVM: PPC: Introduce shared page")
Cc: stable@vger.kernel.org
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: PPC: Book3S HV: Uninit vCPU if vcore creation fails
Sean Christopherson [Wed, 18 Dec 2019 21:54:46 +0000 (13:54 -0800)]
KVM: PPC: Book3S HV: Uninit vCPU if vcore creation fails

commit 1a978d9d3e72ddfa40ac60d26301b154247ee0bc upstream.

Call kvm_vcpu_uninit() if vcore creation fails to avoid leaking any
resources allocated by kvm_vcpu_init(), i.e. the vcpu->run page.

Fixes: 371fefd6f2dc4 ("KVM: PPC: Allow book3s_hv guests to use SMT processor modes")
Cc: stable@vger.kernel.org
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: x86: Fix potential put_fpu() w/o load_fpu() on MPX platform
Sean Christopherson [Mon, 9 Dec 2019 20:19:31 +0000 (12:19 -0800)]
KVM: x86: Fix potential put_fpu() w/o load_fpu() on MPX platform

commit f958bd2314d117f8c29f4821401bc1925bc2e5ef upstream.

Unlike most state managed by XSAVE, MPX is initialized to zero on INIT.
Because INITs are usually recognized in the context of a VCPU_RUN call,
kvm_vcpu_reset() puts the guest's FPU so that the FPU state is resident
in memory, zeros the MPX state, and reloads FPU state to hardware.  But,
in the unlikely event that an INIT is recognized during
kvm_arch_vcpu_ioctl_get_mpstate() via kvm_apic_accept_events(),
kvm_vcpu_reset() will call kvm_put_guest_fpu() without a preceding
kvm_load_guest_fpu() and corrupt the guest's FPU state (and possibly
userspace's FPU state as well).

Given that MPX is being removed from the kernel[*], fix the bug with the
simple-but-ugly approach of loading the guest's FPU during
KVM_GET_MP_STATE.

[*] See commit f240652b6032b ("x86/mpx: Remove MPX APIs").

Fixes: f775b13eedee2 ("x86,kvm: move qemu/guest FPU switching out to vcpu_run")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: x86: Protect MSR-based index computations in fixed_msr_to_seg_unit() from Spectr...
Marios Pomonis [Wed, 11 Dec 2019 20:47:47 +0000 (12:47 -0800)]
KVM: x86: Protect MSR-based index computations in fixed_msr_to_seg_unit() from Spectre-v1/L1TF attacks

commit 25a5edea71b7c154b6a0b8cec14c711cafa31d26 upstream.

This fixes a Spectre-v1/L1TF vulnerability in fixed_msr_to_seg_unit().
This function contains index computations based on the
(attacker-controlled) MSR number.

Fixes: de9aef5e1ad6 ("KVM: MTRR: introduce fixed_mtrr_segment table")

Signed-off-by: Nick Finco <nifi@google.com>
Signed-off-by: Marios Pomonis <pomonis@google.com>
Reviewed-by: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: x86: Protect x86_decode_insn from Spectre-v1/L1TF attacks
Marios Pomonis [Wed, 11 Dec 2019 20:47:41 +0000 (12:47 -0800)]
KVM: x86: Protect x86_decode_insn from Spectre-v1/L1TF attacks

commit 3c9053a2cae7ba2ba73766a34cea41baa70f57f7 upstream.

This fixes a Spectre-v1/L1TF vulnerability in x86_decode_insn().
kvm_emulate_instruction() (an ancestor of x86_decode_insn()) is an exported
symbol, so KVM should treat it conservatively from a security perspective.

Fixes: 045a282ca415 ("KVM: emulator: implement fninit, fnstsw, fnstcw")

Signed-off-by: Nick Finco <nifi@google.com>
Signed-off-by: Marios Pomonis <pomonis@google.com>
Reviewed-by: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: x86: Protect MSR-based index computations from Spectre-v1/L1TF attacks in x86.c
Marios Pomonis [Wed, 11 Dec 2019 20:47:49 +0000 (12:47 -0800)]
KVM: x86: Protect MSR-based index computations from Spectre-v1/L1TF attacks in x86.c

commit 6ec4c5eee1750d5d17951c4e1960d953376a0dda upstream.

This fixes a Spectre-v1/L1TF vulnerability in set_msr_mce() and
get_msr_mce().
Both functions contain index computations based on the
(attacker-controlled) MSR number.

Fixes: 890ca9aefa78 ("KVM: Add MCE support")

Signed-off-by: Nick Finco <nifi@google.com>
Signed-off-by: Marios Pomonis <pomonis@google.com>
Reviewed-by: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: x86: Protect ioapic_read_indirect() from Spectre-v1/L1TF attacks
Marios Pomonis [Wed, 11 Dec 2019 20:47:44 +0000 (12:47 -0800)]
KVM: x86: Protect ioapic_read_indirect() from Spectre-v1/L1TF attacks

commit 8c86405f606ca8508b8d9280680166ca26723695 upstream.

This fixes a Spectre-v1/L1TF vulnerability in ioapic_read_indirect().
This function contains index computations based on the
(attacker-controlled) IOREGSEL register.

Fixes: a2c118bfab8b ("KVM: Fix bounds checking in ioapic indirect register reads (CVE-2013-1798)")

Signed-off-by: Nick Finco <nifi@google.com>
Signed-off-by: Marios Pomonis <pomonis@google.com>
Reviewed-by: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: x86: Protect MSR-based index computations in pmu.h from Spectre-v1/L1TF attacks
Marios Pomonis [Wed, 11 Dec 2019 20:47:48 +0000 (12:47 -0800)]
KVM: x86: Protect MSR-based index computations in pmu.h from Spectre-v1/L1TF attacks

commit 13c5183a4e643cc2b03a22d0e582c8e17bb7457d upstream.

This fixes a Spectre-v1/L1TF vulnerability in the get_gp_pmc() and
get_fixed_pmc() functions.
They both contain index computations based on the (attacker-controlled)
MSR number.

Fixes: 25462f7f5295 ("KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch")

Signed-off-by: Nick Finco <nifi@google.com>
Signed-off-by: Marios Pomonis <pomonis@google.com>
Reviewed-by: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: x86: Protect ioapic_write_indirect() from Spectre-v1/L1TF attacks
Marios Pomonis [Wed, 11 Dec 2019 20:47:45 +0000 (12:47 -0800)]
KVM: x86: Protect ioapic_write_indirect() from Spectre-v1/L1TF attacks

commit 670564559ca35b439c8d8861fc399451ddf95137 upstream.

This fixes a Spectre-v1/L1TF vulnerability in ioapic_write_indirect().
This function contains index computations based on the
(attacker-controlled) IOREGSEL register.

This patch depends on patch
"KVM: x86: Protect ioapic_read_indirect() from Spectre-v1/L1TF attacks".

Fixes: 70f93dae32ac ("KVM: Use temporary variable to shorten lines.")

Signed-off-by: Nick Finco <nifi@google.com>
Signed-off-by: Marios Pomonis <pomonis@google.com>
Reviewed-by: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: x86: Protect kvm_hv_msr_[get|set]_crash_data() from Spectre-v1/L1TF attacks
Marios Pomonis [Wed, 11 Dec 2019 20:47:42 +0000 (12:47 -0800)]
KVM: x86: Protect kvm_hv_msr_[get|set]_crash_data() from Spectre-v1/L1TF attacks

commit 8618793750071d66028584a83ed0b4fa7eb4f607 upstream.

This fixes Spectre-v1/L1TF vulnerabilities in kvm_hv_msr_get_crash_data()
and kvm_hv_msr_set_crash_data().
These functions contain index computations that use the
(attacker-controlled) MSR number.

Fixes: e7d9513b60e8 ("kvm/x86: added hyper-v crash msrs into kvm hyperv context")

Signed-off-by: Nick Finco <nifi@google.com>
Signed-off-by: Marios Pomonis <pomonis@google.com>
Reviewed-by: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: x86: Protect kvm_lapic_reg_write() from Spectre-v1/L1TF attacks
Marios Pomonis [Wed, 11 Dec 2019 20:47:46 +0000 (12:47 -0800)]
KVM: x86: Protect kvm_lapic_reg_write() from Spectre-v1/L1TF attacks

commit 4bf79cb089f6b1c6c632492c0271054ce52ad766 upstream.

This fixes a Spectre-v1/L1TF vulnerability in kvm_lapic_reg_write().
This function contains index computations based on the
(attacker-controlled) MSR number.

Fixes: 0105d1a52640 ("KVM: x2apic interface to lapic")

Signed-off-by: Nick Finco <nifi@google.com>
Signed-off-by: Marios Pomonis <pomonis@google.com>
Reviewed-by: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: x86: Protect DR-based index computations from Spectre-v1/L1TF attacks
Marios Pomonis [Wed, 11 Dec 2019 20:47:52 +0000 (12:47 -0800)]
KVM: x86: Protect DR-based index computations from Spectre-v1/L1TF attacks

commit ea740059ecb37807ba47b84b33d1447435a8d868 upstream.

This fixes a Spectre-v1/L1TF vulnerability in __kvm_set_dr() and
kvm_get_dr().
Both kvm_get_dr() and kvm_set_dr() (a wrapper of __kvm_set_dr()) are
exported symbols so KVM should tream them conservatively from a security
perspective.

Fixes: 020df0794f57 ("KVM: move DR register access handling into generic code")

Signed-off-by: Nick Finco <nifi@google.com>
Signed-off-by: Marios Pomonis <pomonis@google.com>
Reviewed-by: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: x86: Protect pmu_intel.c from Spectre-v1/L1TF attacks
Marios Pomonis [Wed, 11 Dec 2019 20:47:53 +0000 (12:47 -0800)]
KVM: x86: Protect pmu_intel.c from Spectre-v1/L1TF attacks

commit 66061740f1a487f4ed54fde75e724709f805da53 upstream.

This fixes Spectre-v1/L1TF vulnerabilities in intel_find_fixed_event()
and intel_rdpmc_ecx_to_pmc().
kvm_rdpmc() (ancestor of intel_find_fixed_event()) and
reprogram_fixed_counter() (ancestor of intel_rdpmc_ecx_to_pmc()) are
exported symbols so KVM should treat them conservatively from a security
perspective.

Fixes: 25462f7f5295 ("KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch")

Signed-off-by: Nick Finco <nifi@google.com>
Signed-off-by: Marios Pomonis <pomonis@google.com>
Reviewed-by: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: x86: Refactor prefix decoding to prevent Spectre-v1/L1TF attacks
Marios Pomonis [Wed, 11 Dec 2019 20:47:50 +0000 (12:47 -0800)]
KVM: x86: Refactor prefix decoding to prevent Spectre-v1/L1TF attacks

commit 125ffc5e0a56a3eded608dc51e09d5ebf72cf652 upstream.

This fixes Spectre-v1/L1TF vulnerabilities in
vmx_read_guest_seg_selector(), vmx_read_guest_seg_base(),
vmx_read_guest_seg_limit() and vmx_read_guest_seg_ar().  When
invoked from emulation, these functions contain index computations
based on the (attacker-influenced) segment value.  Using constants
prevents the attack.

Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoKVM: x86: Refactor picdev_write() to prevent Spectre-v1/L1TF attacks
Marios Pomonis [Wed, 11 Dec 2019 20:47:43 +0000 (12:47 -0800)]
KVM: x86: Refactor picdev_write() to prevent Spectre-v1/L1TF attacks

commit 14e32321f3606e4b0970200b6e5e47ee6f1e6410 upstream.

This fixes a Spectre-v1/L1TF vulnerability in picdev_write().
It replaces index computations based on the (attacked-controlled) port
number with constants through a minor refactoring.

Fixes: 85f455f7ddbe ("KVM: Add support for in-kernel PIC emulation")

Signed-off-by: Nick Finco <nifi@google.com>
Signed-off-by: Marios Pomonis <pomonis@google.com>
Reviewed-by: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoaio: prevent potential eventfd recursion on poll
Jens Axboe [Mon, 3 Feb 2020 17:33:42 +0000 (10:33 -0700)]
aio: prevent potential eventfd recursion on poll

commit 01d7a356872eec22ef34a33a5f9cfa917d145468 upstream.

If we have nested or circular eventfd wakeups, then we can deadlock if
we run them inline from our poll waitqueue wakeup handler. It's also
possible to have very long chains of notifications, to the extent where
we could risk blowing the stack.

Check the eventfd recursion count before calling eventfd_signal(). If
it's non-zero, then punt the signaling to async context. This is always
safe, as it takes us out-of-line in terms of stack and locking context.

Cc: stable@vger.kernel.org # 4.19+
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoeventfd: track eventfd_signal() recursion depth
Jens Axboe [Sun, 2 Feb 2020 15:23:03 +0000 (08:23 -0700)]
eventfd: track eventfd_signal() recursion depth

commit b5e683d5cab8cd433b06ae178621f083cabd4f63 upstream.

eventfd use cases from aio and io_uring can deadlock due to circular
or resursive calling, when eventfd_signal() tries to grab the waitqueue
lock. On top of that, it's also possible to construct notification
chains that are deep enough that we could blow the stack.

Add a percpu counter that tracks the percpu recursion depth, warn if we
exceed it. The counter is also exposed so that users of eventfd_signal()
can do the right thing if it's non-zero in the context where it is
called.

Cc: stable@vger.kernel.org # 4.19+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agobcache: add readahead cache policy options via sysfs interface
Coly Li [Sat, 1 Feb 2020 14:42:33 +0000 (22:42 +0800)]
bcache: add readahead cache policy options via sysfs interface

commit 038ba8cc1bffc51250add4a9b9249d4331576d8f upstream.

In year 2007 high performance SSD was still expensive, in order to
save more space for real workload or meta data, the readahead I/Os
for non-meta data was bypassed and not cached on SSD.

In now days, SSD price drops a lot and people can find larger size
SSD with more comfortable price. It is unncessary to alway bypass
normal readahead I/Os to save SSD space for now.

This patch adds options for readahead data cache policies via sysfs
file /sys/block/bcache<N>/readahead_cache_policy, the options are,
- "all": cache all readahead data I/Os.
- "meta-only": only cache meta data, and bypass other regular I/Os.

If users want to make bcache continue to only cache readahead request
for metadata and bypass regular data readahead, please set "meta-only"
to this sysfs file. By default, bcache will back to cache all read-
ahead requests now.

Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Acked-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agowatchdog: fix UAF in reboot notifier handling in watchdog core code
Vladis Dronov [Wed, 8 Jan 2020 12:53:47 +0000 (13:53 +0100)]
watchdog: fix UAF in reboot notifier handling in watchdog core code

commit 69503e585192fdd84b240f18a0873d20e18a2e0a upstream.

After the commit 44ea39420fc9 ("drivers/watchdog: make use of
devm_register_reboot_notifier()") the struct notifier_block reboot_nb in
the struct watchdog_device is removed from the reboot notifiers chain at
the time watchdog's chardev is closed. But at least in i6300esb.c case
reboot_nb is embedded in the struct esb_dev which can be freed on its
device removal and before the chardev is closed, thus UAF at reboot:

[    7.728581] esb_probe: esb_dev.watchdog_device ffff91316f91ab28
ts# uname -r                            note the address ^^^
5.5.0-rc5-ae6088-wdog
ts# ./openwdog0 &
[1] 696
ts# opened /dev/watchdog0, sleeping 10s...
ts# echo 1 > /sys/devices/pci0000\:00/0000\:00\:09.0/remove
[  178.086079] devres:rel_nodes: dev ffff91317668a0b0 data ffff91316f91ab28
           esb_dev.watchdog_device.reboot_nb memory is freed here ^^^
ts# ...woken up
[  181.459010] devres:rel_nodes: dev ffff913171781000 data ffff913174a1dae8
[  181.460195] devm_unreg_reboot_notifier: res ffff913174a1dae8 nb ffff91316f91ab78
                                     attempt to use memory already freed ^^^
[  181.461063] devm_unreg_reboot_notifier: nb->call 6b6b6b6b6b6b6b6b
[  181.461243] devm_unreg_reboot_notifier: nb->next 6b6b6b6b6b6b6b6b
                freed memory is filled with a slub poison ^^^
[1]+  Done                    ./openwdog0
ts# reboot
[  229.921862] systemd-shutdown[1]: Rebooting.
[  229.939265] notifier_call_chain: nb ffffffff9c6c2f20 nb->next ffffffff9c6d50c0
[  229.943080] notifier_call_chain: nb ffffffff9c6d50c0 nb->next 6b6b6b6b6b6b6b6b
[  229.946054] notifier_call_chain: nb 6b6b6b6b6b6b6b6b INVAL
[  229.957584] general protection fault: 0000 [#1] SMP
[  229.958770] CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 5.5.0-rc5-ae6088-wdog
[  229.960224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
[  229.963288] RIP: 0010:notifier_call_chain+0x66/0xd0
[  229.969082] RSP: 0018:ffffb20dc0013d88 EFLAGS: 00010246
[  229.970812] RAX: 000000000000002e RBX: 6b6b6b6b6b6b6b6b RCX: 00000000000008b3
[  229.972929] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffffffff9ccc46ac
[  229.975028] RBP: 0000000000000001 R08: 0000000000000000 R09: 00000000000008b3
[  229.977039] R10: 0000000000000001 R11: ffffffff9c26c740 R12: 0000000000000000
[  229.979155] R13: 6b6b6b6b6b6b6b6b R14: 0000000000000000 R15: 00000000fffffffa
...   slub_debug=FZP poison ^^^
[  229.989089] Call Trace:
[  229.990157]  blocking_notifier_call_chain+0x43/0x59
[  229.991401]  kernel_restart_prepare+0x14/0x30
[  229.992607]  kernel_restart+0x9/0x30
[  229.993800]  __do_sys_reboot+0x1d2/0x210
[  230.000149]  do_syscall_64+0x3d/0x130
[  230.001277]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  230.002639] RIP: 0033:0x7f5461bdd177
[  230.016402] Modules linked in: i6300esb
[  230.050261] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

Fix the crash by reverting 44ea39420fc9 so unregister_reboot_notifier()
is called when watchdog device is removed. This also makes handling of
the reboot notifier unified with the handling of the restart handler,
which is freed with unregister_restart_handler() in the same place.

Fixes: 44ea39420fc9 ("drivers/watchdog: make use of devm_register_reboot_notifier()")
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Vladis Dronov <vdronov@redhat.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20200108125347.6067-1-vdronov@redhat.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoxen/balloon: Support xend-based toolstack take two
Juergen Gross [Fri, 17 Jan 2020 13:49:31 +0000 (14:49 +0100)]
xen/balloon: Support xend-based toolstack take two

commit eda4eabf86fd6806eaabc23fb90dd056fdac037b upstream.

Commit 3aa6c19d2f38be ("xen/balloon: Support xend-based toolstack")
tried to fix a regression with running on rather ancient Xen versions.
Unfortunately the fix was based on the assumption that xend would
just use another Xenstore node, but in reality only some downstream
versions of xend are doing that. The upstream xend does not write
that Xenstore node at all, so the problem must be fixed in another
way.

The easiest way to achieve that is to fall back to the behavior
before commit 96edd61dcf4436 ("xen/balloon: don't online new memory
initially") in case the static memory maximum can't be read.

This is achieved by setting static_max to the current number of
memory pages known by the system resulting in target_diff becoming
zero.

Fixes: 3aa6c19d2f38be ("xen/balloon: Support xend-based toolstack")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: <stable@vger.kernel.org> # 4.13
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agotools/kvm_stat: Fix kvm_exit filter name
Gavin Shan [Tue, 10 Dec 2019 04:48:29 +0000 (15:48 +1100)]
tools/kvm_stat: Fix kvm_exit filter name

commit 5fcf3a55a62afb0760ccb6f391d62f20bce4a42f upstream.

The filter name is fixed to "exit_reason" for some kvm_exit events, no
matter what architect we have. Actually, the filter name ("exit_reason")
is only applicable to x86, meaning it's broken on other architects
including aarch64.

This fixes the issue by providing various kvm_exit filter names, depending
on architect we're on. Afterwards, the variable filter name is picked and
applied through ioctl(fd, SET_FILTER).

Reported-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agomedia: rc: ensure lirc is initialized before registering input device
Sean Young [Thu, 21 Nov 2019 10:10:47 +0000 (11:10 +0100)]
media: rc: ensure lirc is initialized before registering input device

commit 080d89f522e2baddb4fbbd1af4b67b5f92537ef8 upstream.

Once rc_open is called on the input device, lirc events can be delivered.
Ensure lirc is ready to do so else we might get this:

Registered IR keymap rc-hauppauge
rc rc0: Hauppauge WinTV PVR-350 as
/devices/pci0000:00/0000:00:1e.0/0000:04:00.0/i2c-0/0-0018/rc/rc0
input: Hauppauge WinTV PVR-350 as
/devices/pci0000:00/0000:00:1e.0/0000:04:00.0/i2c-0/0-0018/rc/rc0/input9
BUG: kernel NULL pointer dereference, address: 0000000000000038
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 1 PID: 17 Comm: kworker/1:0 Not tainted 5.3.11-300.fc31.x86_64 #1
Hardware name:  /DG43NB, BIOS NBG4310H.86A.0096.2009.0903.1845 09/03/2009
Workqueue: events ir_work [ir_kbd_i2c]
RIP: 0010:ir_lirc_scancode_event+0x3d/0xb0
Code: a6 b4 07 00 00 49 81 c6 b8 07 00 00 55 53 e8 ba a7 9d ff 4c 89
e7 49 89 45 00 e8 5e 7a 25 00 49 8b 1e 48 89 c5 4c 39 f3 74 58 <8b> 43
38 8b 53 40 89 c1 2b 4b 3c 39 ca 72 41 21 d0 49 8b 7d 00 49
RSP: 0018:ffffaae2000b3d88 EFLAGS: 00010017
RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000019
RDX: 0000000000000001 RSI: 006e801b1f26ce6a RDI: ffff9e39797c37b4
RBP: 0000000000000002 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000001 R12: ffff9e39797c37b4
R13: ffffaae2000b3db8 R14: ffff9e39797c37b8 R15: ffff9e39797c33d8
FS:  0000000000000000(0000) GS:ffff9e397b680000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000038 CR3: 0000000035844000 CR4: 00000000000006e0
Call Trace:
ir_do_keydown+0x8e/0x2b0
rc_keydown+0x52/0xc0
ir_work+0xb8/0x130 [ir_kbd_i2c]
process_one_work+0x19d/0x340
worker_thread+0x50/0x3b0
kthread+0xfb/0x130
? process_one_work+0x340/0x340
? kthread_park+0x80/0x80
ret_from_fork+0x35/0x40
Modules linked in: rc_hauppauge tuner msp3400 saa7127 saa7115 ivtv(+)
tveeprom cx2341x v4l2_common videodev mc i2c_algo_bit ir_kbd_i2c
ip_tables firewire_ohci e1000e serio_raw firewire_core ata_generic
crc_itu_t pata_acpi pata_jmicron fuse
CR2: 0000000000000038
---[ end trace c67c2697a99fa74b ]---
RIP: 0010:ir_lirc_scancode_event+0x3d/0xb0
Code: a6 b4 07 00 00 49 81 c6 b8 07 00 00 55 53 e8 ba a7 9d ff 4c 89
e7 49 89 45 00 e8 5e 7a 25 00 49 8b 1e 48 89 c5 4c 39 f3 74 58 <8b> 43
38 8b 53 40 89 c1 2b 4b 3c 39 ca 72 41 21 d0 49 8b 7d 00 49
RSP: 0018:ffffaae2000b3d88 EFLAGS: 00010017
RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000019
RDX: 0000000000000001 RSI: 006e801b1f26ce6a RDI: ffff9e39797c37b4
RBP: 0000000000000002 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000001 R12: ffff9e39797c37b4
R13: ffffaae2000b3db8 R14: ffff9e39797c37b8 R15: ffff9e39797c33d8
FS:  0000000000000000(0000) GS:ffff9e397b680000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000038 CR3: 0000000035844000 CR4: 00000000000006e0
rc rc0: lirc_dev: driver ir_kbd_i2c registered at minor = 0, scancode
receiver, no transmitter
tuner-simple 0-0061: creating new instance
tuner-simple 0-0061: type set to 2 (Philips NTSC (FI1236,FM1236 and
compatibles))
ivtv0: Registered device video0 for encoder MPG (4096 kB)
ivtv0: Registered device video32 for encoder YUV (2048 kB)
ivtv0: Registered device vbi0 for encoder VBI (1024 kB)
ivtv0: Registered device video24 for encoder PCM (320 kB)
ivtv0: Registered device radio0 for encoder radio
ivtv0: Registered device video16 for decoder MPG (1024 kB)
ivtv0: Registered device vbi8 for decoder VBI (64 kB)
ivtv0: Registered device vbi16 for decoder VOUT
ivtv0: Registered device video48 for decoder YUV (1024 kB)

Cc: stable@vger.kernel.org
Tested-by: Nick French <nickfrench@gmail.com>
Reported-by: Nick French <nickfrench@gmail.com>
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agomedia: iguanair: fix endpoint sanity check
Johan Hovold [Fri, 3 Jan 2020 16:35:13 +0000 (17:35 +0100)]
media: iguanair: fix endpoint sanity check

commit 1b257870a78b0a9ce98fdfb052c58542022ffb5b upstream.

Make sure to use the current alternate setting, which need not be the
first one by index, when verifying the endpoint descriptors and
initialising the URBs.

Failing to do so could cause the driver to misbehave or trigger a WARN()
in usb_submit_urb() that kernels with panic_on_warn set would choke on.

Fixes: 26ff63137c45 ("[media] Add support for the IguanaWorks USB IR Transceiver")
Fixes: ab1cbdf159be ("media: iguanair: add sanity checks")
Cc: stable <stable@vger.kernel.org> # 3.6
Cc: Oliver Neukum <oneukum@suse.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agodrm/rect: Avoid division by zero
Ville Syrjälä [Fri, 22 Nov 2019 17:56:20 +0000 (19:56 +0200)]
drm/rect: Avoid division by zero

commit 433480c1afd44f3e1e664b85063d98cefeefa0ed upstream.

Check for zero width/height destination rectangle in
drm_rect_clip_scaled() to avoid a division by zero.

Cc: stable@vger.kernel.org
Fixes: f96bdf564f3e ("drm/rect: Handle rounding errors in drm_rect_clip_scaled, v3.")
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Benjamin Gaignard <benjamin.gaignard@st.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Testcase: igt/kms_selftest/drm_rect_clip_scaled_div_by_zero
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191122175623.13565-2-ville.syrjala@linux.intel.com
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Benjamin Gaignard <benjamin.gaignard@st.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agodrm: atmel-hlcdc: prefer a lower pixel-clock than requested
Peter Rosin [Wed, 18 Dec 2019 12:28:28 +0000 (14:28 +0200)]
drm: atmel-hlcdc: prefer a lower pixel-clock than requested

commit 51a19d150b520f6cb42143f3bdffacd3c33d7ac5 upstream.

The intention was to only select a higher pixel-clock rate than the
requested, if a slight overclocking would result in a rate significantly
closer to the requested rate than if the conservative lower pixel-clock
rate is selected. The fixed patch has the logic the other way around and
actually prefers the higher frequency. Fix that.

Signed-off-by: Peter Rosin <peda@axentia.se>
Signed-off-by: Claudiu Beznea <claudiu.beznea@microchip.com>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Fixes: 9946a3a9dbed ("drm/atmel-hlcdc: allow selecting a higher pixel-clock than requested")
Reported-by: Claudiu Beznea <claudiu.beznea@microchip.com>
Tested-by: Claudiu Beznea <claudiu.beznea@microchip.com>
Cc: Boris Brezillon <boris.brezillon@bootlin.com>
Cc: <stable@vger.kernel.org> # v4.20+
Link: https://patchwork.freedesktop.org/patch/msgid/1576672109-22707-6-git-send-email-claudiu.beznea@microchip.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agodrm: atmel-hlcdc: enable clock before configuring timing engine
Claudiu Beznea [Wed, 18 Dec 2019 12:28:25 +0000 (14:28 +0200)]
drm: atmel-hlcdc: enable clock before configuring timing engine

commit 2c1fb9d86f6820abbfaa38a6836157c76ccb4e7b upstream.

Changing pixel clock source without having this clock source enabled
will block the timing engine and the next operations after (in this case
setting ATMEL_HLCDC_CFG(5) settings in atmel_hlcdc_crtc_mode_set_nofb()
will fail). It is recomended (although in datasheet this is not present)
to actually enabled pixel clock source before doing any changes on timing
enginge (only SAM9X60 datasheet specifies that the peripheral clock and
pixel clock must be enabled before using LCD controller).

Fixes: 1a396789f65a ("drm: add Atmel HLCDC Display Controller support")
Signed-off-by: Claudiu Beznea <claudiu.beznea@microchip.com>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
Cc: <stable@vger.kernel.org> # v4.0+
Link: https://patchwork.freedesktop.org/patch/msgid/1576672109-22707-3-git-send-email-claudiu.beznea@microchip.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agodrm: atmel-hlcdc: use double rate for pixel clock only if supported
Claudiu Beznea [Wed, 18 Dec 2019 12:28:24 +0000 (14:28 +0200)]
drm: atmel-hlcdc: use double rate for pixel clock only if supported

commit 07acf4bafe81dd37eff3fbcfbbdbc48084bc202b upstream.

Doubled system clock should be used as pixel cock source only if this
is supported. This is emphasized by the value of
atmel_hlcdc_crtc::dc::desc::fixed_clksrc.

Fixes: a6eca2abdd42 ("drm: atmel-hlcdc: add config option for clock selection")
Signed-off-by: Claudiu Beznea <claudiu.beznea@microchip.com>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Boris Brezillon <bbrezillon@kernel.org>
Cc: <stable@vger.kernel.org> # v5.3+
Link: https://patchwork.freedesktop.org/patch/msgid/1576672109-22707-2-git-send-email-claudiu.beznea@microchip.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agogfs2: fix O_SYNC write handling
Andreas Gruenbacher [Tue, 14 Jan 2020 16:12:18 +0000 (17:12 +0100)]
gfs2: fix O_SYNC write handling

commit 6e5e41e2dc4e4413296d5a4af54ac92d7cd52317 upstream.

In gfs2_file_write_iter, for direct writes, the error checking in the buffered
write fallback case is incomplete.  This can cause inode write errors to go
undetected.  Fix and clean up gfs2_file_write_iter along the way.

Based on a proposed fix by Christoph Hellwig <hch@lst.de>.

Fixes: 967bcc91b044 ("gfs2: iomap direct I/O support")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agogfs2: move setting current->backing_dev_info
Christoph Hellwig [Wed, 15 Jan 2020 15:38:29 +0000 (16:38 +0100)]
gfs2: move setting current->backing_dev_info

commit 4c0e8dda608a51855225c611b5c6b442f95fbc56 upstream.

Set current->backing_dev_info just around the buffered write calls to
prepare for the next fix.

Fixes: 967bcc91b044 ("gfs2: iomap direct I/O support")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agogfs2: fix gfs2_find_jhead that returns uninitialized jhead with seq 0
Abhi Das [Tue, 4 Feb 2020 20:14:56 +0000 (14:14 -0600)]
gfs2: fix gfs2_find_jhead that returns uninitialized jhead with seq 0

commit 7582026f6f3588ecebd281965c8a71aff6fb6158 upstream.

When the first log header in a journal happens to have a sequence
number of 0, a bug in gfs2_find_jhead() causes it to prematurely exit,
and return an uninitialized jhead with seq 0. This can cause failures
in the caller. For instance, a mount fails in one test case.

The correct behavior is for it to continue searching through the journal
to find the correct journal head with the highest sequence number.

Fixes: f4686c26ecc3 ("gfs2: read journal in large chunks")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agosunrpc: expiry_time should be seconds not timeval
Roberto Bergantinos Corpas [Tue, 4 Feb 2020 10:32:56 +0000 (11:32 +0100)]
sunrpc: expiry_time should be seconds not timeval

commit 3d96208c30f84d6edf9ab4fac813306ac0d20c10 upstream.

When upcalling gssproxy, cache_head.expiry_time is set as a
timeval, not seconds since boot. As such, RPC cache expiry
logic will not clean expired objects created under
auth.rpcsec.context cache.

This has proven to cause kernel memory leaks on field. Using
64 bit variants of getboottime/timespec

Expiration times have worked this way since 2010's c5b29f885afe "sunrpc:
use seconds since boot in expiry cache".  The gssproxy code introduced
in 2012 added gss_proxy_save_rsc and introduced the bug.  That's a while
for this to lurk, but it required a bit of an extreme case to make it
obvious.

Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 030d794bf498 "SUNRPC: Use gssproxy upcall for server..."
Tested-By: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agomwifiex: fix unbalanced locking in mwifiex_process_country_ie()
Brian Norris [Mon, 6 Jan 2020 22:42:12 +0000 (14:42 -0800)]
mwifiex: fix unbalanced locking in mwifiex_process_country_ie()

commit 65b1aae0d9d5962faccc06bdb8e91a2a0b09451c upstream.

We called rcu_read_lock(), so we need to call rcu_read_unlock() before
we return.

Fixes: 3d94a4a8373b ("mwifiex: fix possible heap overflow in mwifiex_process_country_ie()")
Cc: stable@vger.kernel.org
Cc: huangwen <huangwenabc@gmail.com>
Cc: Ganapathi Bhat <ganapathi.bhat@nxp.com>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Acked-by: Ganapathi Bhat <ganapathi.bhat@nxp.com>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoiwlwifi: don't throw error when trying to remove IGTK
Luca Coelho [Fri, 31 Jan 2020 13:45:25 +0000 (15:45 +0200)]
iwlwifi: don't throw error when trying to remove IGTK

commit 197288d5ba8a5289f22d3aeb4fca3824bfd9b4af upstream.

The IGTK keys are only removed by mac80211 after it has already
removed the AP station.  This causes the driver to throw an error
because mac80211 is trying to remove the IGTK when the station doesn't
exist anymore.

The firmware is aware that the station has been removed and can deal
with it the next time we try to add an IGTK for a station, so we
shouldn't try to remove the key if the station ID is
IWL_MVM_INVALID_STA.  Do this by removing the check for mvm_sta before
calling iwl_mvm_send_sta_igtk() and check return from that function
gracefully if the station ID is invalid.

Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoARM: tegra: Enable PLLP bypass during Tegra124 LP1
Stephen Warren [Thu, 3 Oct 2019 20:50:31 +0000 (14:50 -0600)]
ARM: tegra: Enable PLLP bypass during Tegra124 LP1

commit 1a3388d506bf5b45bb283e6a4c4706cfb4897333 upstream.

For a little over a year, U-Boot has configured the flow controller to
perform automatic RAM re-repair on off->on power transitions of the CPU
rail[1]. This is mandatory for correct operation of Tegra124. However,
RAM re-repair relies on certain clocks, which the kernel must enable and
leave running. PLLP is one of those clocks. This clock is shut down
during LP1 in order to save power. Enable bypass (which I believe routes
osc_div_clk, essentially the crystal clock, to the PLL output) so that
this clock signal toggles even though the PLL is not active. This is
required so that LP1 power mode (system suspend) operates correctly.

The bypass configuration must then be undone when resuming from LP1, so
that all peripheral clocks run at the expected rate. Without this, many
peripherals won't work correctly; for example, the UART baud rate would
be incorrect.

NVIDIA's downstream kernel code only does this if not compiled for
Tegra30, so the added code is made conditional upon the chip ID.
NVIDIA's downstream code makes this change conditional upon the active
CPU cluster. The upstream kernel currently doesn't support cluster
switching, so this patch doesn't test the active CPU cluster ID.

[1] 3cc7942a4ae5 ARM: tegra: implement RAM repair

Reported-by: Jonathan Hunter <jonathanh@nvidia.com>
Cc: stable@vger.kernel.org
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agobtrfs: Correctly handle empty trees in find_first_clear_extent_bit
Nikolay Borisov [Mon, 27 Jan 2020 09:59:26 +0000 (11:59 +0200)]
btrfs: Correctly handle empty trees in find_first_clear_extent_bit

commit 5750c37523a2c8cbb450b9ef31e21c2ba876b05e upstream.

Raviu reported that running his regular fs_trim segfaulted with the
following backtrace:

[  237.525947] assertion failed: prev, in ../fs/btrfs/extent_io.c:1595
[  237.525984] ------------[ cut here ]------------
[  237.525985] kernel BUG at ../fs/btrfs/ctree.h:3117!
[  237.525992] invalid opcode: 0000 [#1] SMP PTI
[  237.525998] CPU: 4 PID: 4423 Comm: fstrim Tainted: G     U     OE     5.4.14-8-vanilla #1
[  237.526001] Hardware name: ASUSTeK COMPUTER INC.
[  237.526044] RIP: 0010:assfail.constprop.58+0x18/0x1a [btrfs]
[  237.526079] Call Trace:
[  237.526120]  find_first_clear_extent_bit+0x13d/0x150 [btrfs]
[  237.526148]  btrfs_trim_fs+0x211/0x3f0 [btrfs]
[  237.526184]  btrfs_ioctl_fitrim+0x103/0x170 [btrfs]
[  237.526219]  btrfs_ioctl+0x129a/0x2ed0 [btrfs]
[  237.526227]  ? filemap_map_pages+0x190/0x3d0
[  237.526232]  ? do_filp_open+0xaf/0x110
[  237.526238]  ? _copy_to_user+0x22/0x30
[  237.526242]  ? cp_new_stat+0x150/0x180
[  237.526247]  ? do_vfs_ioctl+0xa4/0x640
[  237.526278]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[  237.526283]  do_vfs_ioctl+0xa4/0x640
[  237.526288]  ? __do_sys_newfstat+0x3c/0x60
[  237.526292]  ksys_ioctl+0x70/0x80
[  237.526297]  __x64_sys_ioctl+0x16/0x20
[  237.526303]  do_syscall_64+0x5a/0x1c0
[  237.526310]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

That was due to btrfs_fs_device::aloc_tree being empty. Initially I
thought this wasn't possible and as a percaution have put the assert in
find_first_clear_extent_bit. Turns out this is indeed possible and could
happen when a file system with SINGLE data/metadata profile has a 2nd
device added. Until balance is run or a new chunk is allocated on this
device it will be completely empty.

In this case find_first_clear_extent_bit should return the full range
[0, -1ULL] and let the caller handle this i.e for trim the end will be
capped at the size of actual device.

Link: https://lore.kernel.org/linux-btrfs/izW2WNyvy1dEDweBICizKnd2KDwDiDyY2EYQr4YCwk7pkuIpthx-JRn65MPBde00ND6V0_Lh8mW0kZwzDiLDv25pUYWxkskWNJnVP0kgdMA=@protonmail.com/
Fixes: 45bfcfc168f8 ("btrfs: Implement find_first_clear_extent_bit")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agobtrfs: flush write bio if we loop in extent_write_cache_pages
Josef Bacik [Thu, 23 Jan 2020 20:33:02 +0000 (15:33 -0500)]
btrfs: flush write bio if we loop in extent_write_cache_pages

commit 42ffb0bf584ae5b6b38f72259af1e0ee417ac77f upstream.

There exists a deadlock with range_cyclic that has existed forever.  If
we loop around with a bio already built we could deadlock with a writer
who has the page locked that we're attempting to write but is waiting on
a page in our bio to be written out.  The task traces are as follows

  PID: 1329874  TASK: ffff889ebcdf3800  CPU: 33  COMMAND: "kworker/u113:5"
   #0 [ffffc900297bb658] __schedule at ffffffff81a4c33f
   #1 [ffffc900297bb6e0] schedule at ffffffff81a4c6e3
   #2 [ffffc900297bb6f8] io_schedule at ffffffff81a4ca42
   #3 [ffffc900297bb708] __lock_page at ffffffff811f145b
   #4 [ffffc900297bb798] __process_pages_contig at ffffffff814bc502
   #5 [ffffc900297bb8c8] lock_delalloc_pages at ffffffff814bc684
   #6 [ffffc900297bb900] find_lock_delalloc_range at ffffffff814be9ff
   #7 [ffffc900297bb9a0] writepage_delalloc at ffffffff814bebd0
   #8 [ffffc900297bba18] __extent_writepage at ffffffff814bfbf2
   #9 [ffffc900297bba98] extent_write_cache_pages at ffffffff814bffbd

  PID: 2167901  TASK: ffff889dc6a59c00  CPU: 14  COMMAND:
  "aio-dio-invalid"
   #0 [ffffc9003b50bb18] __schedule at ffffffff81a4c33f
   #1 [ffffc9003b50bba0] schedule at ffffffff81a4c6e3
   #2 [ffffc9003b50bbb8] io_schedule at ffffffff81a4ca42
   #3 [ffffc9003b50bbc8] wait_on_page_bit at ffffffff811f24d6
   #4 [ffffc9003b50bc60] prepare_pages at ffffffff814b05a7
   #5 [ffffc9003b50bcd8] btrfs_buffered_write at ffffffff814b1359
   #6 [ffffc9003b50bdb0] btrfs_file_write_iter at ffffffff814b5933
   #7 [ffffc9003b50be38] new_sync_write at ffffffff8128f6a8
   #8 [ffffc9003b50bec8] vfs_write at ffffffff81292b9d
   #9 [ffffc9003b50bf00] ksys_pwrite64 at ffffffff81293032

I used drgn to find the respective pages we were stuck on

page_entry.page 0xffffea00fbfc7500 index 8148 bit 15 pid 2167901
page_entry.page 0xffffea00f9bb7400 index 7680 bit 0 pid 1329874

As you can see the kworker is waiting for bit 0 (PG_locked) on index
7680, and aio-dio-invalid is waiting for bit 15 (PG_writeback) on index
8148.  aio-dio-invalid has 7680, and the kworker epd looks like the
following

  crash> struct extent_page_data ffffc900297bbbb0
  struct extent_page_data {
    bio = 0xffff889f747ed830,
    tree = 0xffff889eed6ba448,
    extent_locked = 0,
    sync_io = 0
  }

Probably worth mentioning as well that it waits for writeback of the
page to complete while holding a lock on it (at prepare_pages()).

Using drgn I walked the bio pages looking for page
0xffffea00fbfc7500 which is the one we're waiting for writeback on

  bio = Object(prog, 'struct bio', address=0xffff889f747ed830)
  for i in range(0, bio.bi_vcnt.value_()):
      bv = bio.bi_io_vec[i]
      if bv.bv_page.value_() == 0xffffea00fbfc7500:
  print("FOUND IT")

which validated what I suspected.

The fix for this is simple, flush the epd before we loop back around to
the beginning of the file during writeout.

Fixes: b293f02e1423 ("Btrfs: Add writepages support")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoBtrfs: fix race between adding and putting tree mod seq elements and nodes
Filipe Manana [Wed, 22 Jan 2020 12:23:20 +0000 (12:23 +0000)]
Btrfs: fix race between adding and putting tree mod seq elements and nodes

commit 7227ff4de55d931bbdc156c8ef0ce4f100c78a5b upstream.

There is a race between adding and removing elements to the tree mod log
list and rbtree that can lead to use-after-free problems.

Consider the following example that explains how/why the problems happens:

1) Task A has mod log element with sequence number 200. It currently is
   the only element in the mod log list;

2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to
   access the tree mod log. When it enters the function, it initializes
   'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock'
   before checking if there are other elements in the mod seq list.
   Since the list it empty, 'min_seq' remains set to (u64)-1. Then it
   unlocks the lock 'tree_mod_seq_lock';

3) Before task A acquires the lock 'tree_mod_log_lock', task B adds
   itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a
   sequence number of 201;

4) Some other task, name it task C, modifies a btree and because there
   elements in the mod seq list, it adds a tree mod elem to the tree
   mod log rbtree. That node added to the mod log rbtree is assigned
   a sequence number of 202;

5) Task B, which is doing fiemap and resolving indirect back references,
   calls btrfs get_old_root(), with 'time_seq' == 201, which in turn
   calls tree_mod_log_search() - the search returns the mod log node
   from the rbtree with sequence number 202, created by task C;

6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating
   the mod log rbtree and finds the node with sequence number 202. Since
   202 is less than the previously computed 'min_seq', (u64)-1, it
   removes the node and frees it;

7) Task B still has a pointer to the node with sequence number 202, and
   it dereferences the pointer itself and through the call to
   __tree_mod_log_rewind(), resulting in a use-after-free problem.

This issue can be triggered sporadically with the test case generic/561
from fstests, and it happens more frequently with a higher number of
duperemove processes. When it happens to me, it either freezes the VM or
it produces a trace like the following before crashing:

  [ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  [ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1
  [ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
  [ 1245.321287] RIP: 0010:rb_next+0x16/0x50
  [ 1245.321307] Code: ....
  [ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202
  [ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b
  [ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80
  [ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000
  [ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038
  [ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8
  [ 1245.321539] FS:  00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000
  [ 1245.321591] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0
  [ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [ 1245.321706] Call Trace:
  [ 1245.321798]  __tree_mod_log_rewind+0xbf/0x280 [btrfs]
  [ 1245.321841]  btrfs_search_old_slot+0x105/0xd00 [btrfs]
  [ 1245.321877]  resolve_indirect_refs+0x1eb/0xc60 [btrfs]
  [ 1245.321912]  find_parent_nodes+0x3dc/0x11b0 [btrfs]
  [ 1245.321947]  btrfs_check_shared+0x115/0x1c0 [btrfs]
  [ 1245.321980]  ? extent_fiemap+0x59d/0x6d0 [btrfs]
  [ 1245.322029]  extent_fiemap+0x59d/0x6d0 [btrfs]
  [ 1245.322066]  do_vfs_ioctl+0x45a/0x750
  [ 1245.322081]  ksys_ioctl+0x70/0x80
  [ 1245.322092]  ? trace_hardirqs_off_thunk+0x1a/0x1c
  [ 1245.322113]  __x64_sys_ioctl+0x16/0x20
  [ 1245.322126]  do_syscall_64+0x5c/0x280
  [ 1245.322139]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [ 1245.322155] RIP: 0033:0x7fdee3942dd7
  [ 1245.322177] Code: ....
  [ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7
  [ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004
  [ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44
  [ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48
  [ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50
  [ 1245.322423] Modules linked in: ....
  [ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]---

Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum
sequence number and iterates the rbtree while holding the lock
'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock'
lock, since it is now redundant.

Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions")
Fixes: 097b8a7c9e48e2 ("Btrfs: join tree mod log code with the code holding back delayed refs")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agobtrfs: drop log root for dropped roots
Josef Bacik [Fri, 17 Jan 2020 14:12:45 +0000 (09:12 -0500)]
btrfs: drop log root for dropped roots

commit 889bfa39086e86b52fcfaa04d72c95eaeb12f9a5 upstream.

If we fsync on a subvolume and create a log root for that volume, and
then later delete that subvolume we'll never clean up its log root.  Fix
this by making switch_commit_roots free the log for any dropped roots we
encounter.  The extra churn is because we need a btrfs_trans_handle, not
the btrfs_transaction.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agobtrfs: set trans->drity in btrfs_commit_transaction
Josef Bacik [Fri, 17 Jan 2020 13:57:51 +0000 (08:57 -0500)]
btrfs: set trans->drity in btrfs_commit_transaction

commit d62b23c94952e78211a383b7d90ef0afbd9a3717 upstream.

If we abort a transaction we have the following sequence

if (!trans->dirty && list_empty(&trans->new_bgs))
return;
WRITE_ONCE(trans->transaction->aborted, err);

The idea being if we didn't modify anything with our trans handle then
we don't really need to abort the whole transaction, maybe the other
trans handles are fine and we can carry on.

However in the case of create_snapshot we add a pending_snapshot object
to our transaction and then commit the transaction.  We don't actually
modify anything.  sync() behaves the same way, attach to an existing
transaction and commit it.  This means that if we have an IO error in
the right places we could abort the committing transaction with our
trans->dirty being not set and thus not set transaction->aborted.

This is a problem because in the create_snapshot() case we depend on
pending->error being set to something, or btrfs_commit_transaction
returning an error.

If we are not the trans handle that gets to commit the transaction, and
we're waiting on the commit to happen we get our return value from
cur_trans->aborted.  If this was not set to anything because sync() hit
an error in the transaction commit before it could modify anything then
cur_trans->aborted would be 0.  Thus we'd return 0 from
btrfs_commit_transaction() in create_snapshot.

This is a problem because we then try to do things with
pending_snapshot->snap, which will be NULL because we didn't create the
snapshot, and then we'll get a NULL pointer dereference like the
following

"BUG: kernel NULL pointer dereference, address: 00000000000001f0"
RIP: 0010:btrfs_orphan_cleanup+0x2d/0x330
Call Trace:
 ? btrfs_mksubvol.isra.31+0x3f2/0x510
 btrfs_mksubvol.isra.31+0x4bc/0x510
 ? __sb_start_write+0xfa/0x200
 ? mnt_want_write_file+0x24/0x50
 btrfs_ioctl_snap_create_transid+0x16c/0x1a0
 btrfs_ioctl_snap_create_v2+0x11e/0x1a0
 btrfs_ioctl+0x1534/0x2c10
 ? free_debug_processing+0x262/0x2a3
 do_vfs_ioctl+0xa6/0x6b0
 ? do_sys_open+0x188/0x220
 ? syscall_trace_enter+0x1f8/0x330
 ksys_ioctl+0x60/0x90
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x4a/0x1b0

In order to fix this we need to make sure anybody who calls
commit_transaction has trans->dirty set so that they properly set the
trans->transaction->aborted value properly so any waiters know bad
things happened.

This was found while I was running generic/475 with my modified
fsstress, it reproduced within a few runs.  I ran with this patch all
night and didn't see the problem again.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoBtrfs: fix infinite loop during fsync after rename operations
Filipe Manana [Wed, 15 Jan 2020 13:21:35 +0000 (13:21 +0000)]
Btrfs: fix infinite loop during fsync after rename operations

commit b5e4ff9d465da1233a2d9a47ebce487c70d8f4ab upstream.

Recently fsstress (from fstests) sporadically started to trigger an
infinite loop during fsync operations. This turned out to be because
support for the rename exchange and whiteout operations was added to
fsstress in fstests. These operations, unlike any others in fsstress,
cause file names to be reused, whence triggering this issue. However
it's not necessary to use rename exchange and rename whiteout operations
trigger this issue, simple rename operations and file creations are
enough to trigger the issue.

The issue boils down to when we are logging inodes that conflict (that
had the name of any inode we need to log during the fsync operation), we
keep logging them even if they were already logged before, and after
that we check if there's any other inode that conflicts with them and
then add it again to the list of inodes to log. Skipping already logged
inodes fixes the issue.

Consider the following example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/testdir                           # inode 257

  $ touch /mnt/testdir/zz                        # inode 258
  $ ln /mnt/testdir/zz /mnt/testdir/zz_link

  $ touch /mnt/testdir/a                         # inode 259

  $ sync

  # The following 3 renames achieve the same result as a rename exchange
  # operation (<rename_exchange> /mnt/testdir/zz_link to /mnt/testdir/a).

  $ mv /mnt/testdir/a /mnt/testdir/a/tmp
  $ mv /mnt/testdir/zz_link /mnt/testdir/a
  $ mv /mnt/testdir/a/tmp /mnt/testdir/zz_link

  # The following rename and file creation give the same result as a
  # rename whiteout operation (<rename_whiteout> zz to a2).

  $ mv /mnt/testdir/zz /mnt/testdir/a2
  $ touch /mnt/testdir/zz                        # inode 260

  $ xfs_io -c fsync /mnt/testdir/zz
    --> results in the infinite loop

The following steps happen:

1) When logging inode 260, we find that its reference named "zz" was
   used by inode 258 in the previous transaction (through the commit
   root), so inode 258 is added to the list of conflicting indoes that
   need to be logged;

2) After logging inode 258, we find that its reference named "a" was
   used by inode 259 in the previous transaction, and therefore we add
   inode 259 to the list of conflicting inodes to be logged;

3) After logging inode 259, we find that its reference named "zz_link"
   was used by inode 258 in the previous transaction - we add inode 258
   to the list of conflicting inodes to log, again - we had already
   logged it before at step 3. After logging it again, we find again
   that inode 259 conflicts with him, and we add again 259 to the list,
   etc - we end up repeating all the previous steps.

So fix this by skipping logging of conflicting inodes that were already
logged.

Fixes: 6b5fc433a7ad67 ("Btrfs: fix fsync after succession of renames of different files")
CC: stable@vger.kernel.org # 5.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoBtrfs: make deduplication with range including the last block work
Filipe Manana [Mon, 16 Dec 2019 18:26:56 +0000 (18:26 +0000)]
Btrfs: make deduplication with range including the last block work

commit 831d2fa25ab8e27592b1b0268dae6f2dfaf7cc43 upstream.

Since btrfs was migrated to use the generic VFS helpers for clone and
deduplication, it stopped allowing for the last block of a file to be
deduplicated when the source file size is not sector size aligned (when
eof is somewhere in the middle of the last block). There are two reasons
for that:

1) The generic code always rounds down, to a multiple of the block size,
   the range's length for deduplications. This means we end up never
   deduplicating the last block when the eof is not block size aligned,
   even for the safe case where the destination range's end offset matches
   the destination file's size. That rounding down operation is done at
   generic_remap_check_len();

2) Because of that, the btrfs specific code does not expect anymore any
   non-aligned range length's for deduplication and therefore does not
   work if such nona-aligned length is given.

This patch addresses that second part, and it depends on a patch that
fixes generic_remap_check_len(), in the VFS, which was submitted ealier
and has the following subject:

  "fs: allow deduplication of eof block into the end of the destination file"

These two patches address reports from users that started seeing lower
deduplication rates due to the last block never being deduplicated when
the file size is not aligned to the filesystem's block size.

Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoBtrfs: fix missing hole after hole punching and fsync when using NO_HOLES
Filipe Manana [Tue, 19 Nov 2019 12:07:33 +0000 (12:07 +0000)]
Btrfs: fix missing hole after hole punching and fsync when using NO_HOLES

commit 0e56315ca147b3e60c7bf240233a301d3c7fb508 upstream.

When using the NO_HOLES feature, if we punch a hole into a file and then
fsync it, there are cases where a subsequent fsync will miss the fact that
a hole was punched, resulting in the holes not existing after replaying
the log tree.

Essentially these cases all imply that, tree-log.c:copy_items(), is not
invoked for the leafs that delimit holes, because nothing changed those
leafs in the current transaction. And it's precisely copy_items() where
we currenly detect and log holes, which works as long as the holes are
between file extent items in the input leaf or between the beginning of
input leaf and the previous leaf or between the last item in the leaf
and the next leaf.

First example where we miss a hole:

  *) The extent items of the inode span multiple leafs;

  *) The punched hole covers a range that affects only the extent items of
     the first leaf;

  *) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC
     is set in the inode's runtime flags).

  That results in the hole not existing after replaying the log tree.

  For example, if the fs/subvolume tree has the following layout for a
  particular inode:

      Leaf N, generation 10:

      [ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ]

      Leaf N + 1, generation 10:

      [ EXTENT_ITEM (128K 64K) ... ]

  If at transaction 11 we punch a hole coverting the range [0, 128K[, we end
  up dropping the two extent items from leaf N, but we don't touch the other
  leaf, so we end up in the following state:

      Leaf N, generation 11:

      [ ... INODE_ITEM INODE_REF ]

      Leaf N + 1, generation 10:

      [ EXTENT_ITEM (128K 64K) ... ]

  A full fsync after punching the hole will only process leaf N because it
  was modified in the current transaction, but not leaf N + 1, since it
  was not modified in the current transaction (generation 10 and not 11).
  As a result the fsync will not log any holes, because it didn't process
  any leaf with extent items.

Second example where we will miss a hole:

  *) An inode as its items spanning 5 (or more) leafs;

  *) A hole is punched and it covers only the extents items of the 3rd
     leaf. This resulsts in deleting the entire leaf and not touching any
     of the other leafs.

  So the only leaf that is modified in the current transaction, when
  punching the hole, is the first leaf, which contains the inode item.
  During the full fsync, the only leaf that is passed to copy_items()
  is that first leaf, and that's not enough for the hole detection
  code in copy_items() to determine there's a hole between the last
  file extent item in the 2nd leaf and the first file extent item in
  the 3rd leaf (which was the 4th leaf before punching the hole).

Fix this by scanning all leafs and punch holes as necessary when doing a
full fsync (less common than a non-full fsync) when the NO_HOLES feature
is enabled. The lack of explicit file extent items to mark holes makes it
necessary to scan existing extents to determine if holes exist.

A test case for fstests follows soon.

Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoext4: fix race conditions in ->d_compare() and ->d_hash()
Eric Biggers [Fri, 24 Jan 2020 04:12:34 +0000 (20:12 -0800)]
ext4: fix race conditions in ->d_compare() and ->d_hash()

commit ec772f01307a2c06ebf6cdd221e6b518a71ddae7 upstream.

Since ->d_compare() and ->d_hash() can be called in RCU-walk mode,
->d_parent and ->d_inode can be concurrently modified, and in
particular, ->d_inode may be changed to NULL.  For ext4_d_hash() this
resulted in a reproducible NULL dereference if a lookup is done in a
directory being deleted, e.g. with:

int main()
{
if (fork()) {
for (;;) {
mkdir("subdir", 0700);
rmdir("subdir");
}
} else {
for (;;)
access("subdir/file", 0);
}
}

... or by running the 't_encrypted_d_revalidate' program from xfstests.
Both repros work in any directory on a filesystem with the encoding
feature, even if the directory doesn't actually have the casefold flag.

I couldn't reproduce a crash in ext4_d_compare(), but it appears that a
similar crash is possible there.

Fix these bugs by reading ->d_parent and ->d_inode using READ_ONCE() and
falling back to the case sensitive behavior if the inode is NULL.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: b886ee3e778e ("ext4: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.2+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20200124041234.159740-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoext4: fix deadlock allocating crypto bounce page from mempool
Eric Biggers [Tue, 31 Dec 2019 18:11:49 +0000 (12:11 -0600)]
ext4: fix deadlock allocating crypto bounce page from mempool

commit 547c556f4db7c09447ecf5f833ab6aaae0c5ab58 upstream.

ext4_writepages() on an encrypted file has to encrypt the data, but it
can't modify the pagecache pages in-place, so it encrypts the data into
bounce pages and writes those instead.  All bounce pages are allocated
from a mempool using GFP_NOFS.

This is not correct use of a mempool, and it can deadlock.  This is
because GFP_NOFS includes __GFP_DIRECT_RECLAIM, which enables the "never
fail" mode for mempool_alloc() where a failed allocation will fall back
to waiting for one of the preallocated elements in the pool.

But since this mode is used for all a bio's pages and not just the
first, it can deadlock waiting for pages already in the bio to be freed.

This deadlock can be reproduced by patching mempool_alloc() to pretend
that pool->alloc() always fails (so that it always falls back to the
preallocations), and then creating an encrypted file of size > 128 KiB.

Fix it by only using GFP_NOFS for the first page in the bio.  For
subsequent pages just use GFP_NOWAIT, and if any of those fail, just
submit the bio and start a new one.

This will need to be fixed in f2fs too, but that's less straightforward.

Fixes: c9af28fdd449 ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231181149.47619-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agojbd2_seq_info_next should increase position index
Vasily Averin [Thu, 23 Jan 2020 09:05:10 +0000 (12:05 +0300)]
jbd2_seq_info_next should increase position index

commit 1a8e9cf40c9a6a2e40b1e924b13ed303aeea4418 upstream.

if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.

Script below generates endless output
 $ q=;while read -r r;do echo "$((++q)) $r";done </proc/fs/jbd2/DEV/info

https://bugzilla.kernel.org/show_bug.cgi?id=206283

Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface")
Cc: stable@kernel.org
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/d13805e5-695e-8ac3-b678-26ca2313629f@virtuozzo.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agonfsd: fix filecache lookup
Trond Myklebust [Mon, 6 Jan 2020 18:18:03 +0000 (13:18 -0500)]
nfsd: fix filecache lookup

commit 28c7d86bb6172ffbb1a1237c6388e77f9fe5f181 upstream.

If the lookup keeps finding a nfsd_file with an unhashed open file,
then retry once only.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org
Fixes: 65294c1f2c5e "nfsd: add a new struct file caching facility to nfsd"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoNFS: Directory page cache pages need to be locked when read
Trond Myklebust [Sun, 2 Feb 2020 22:53:54 +0000 (17:53 -0500)]
NFS: Directory page cache pages need to be locked when read

commit 114de38225d9b300f027e2aec9afbb6e0def154b upstream.

When a NFS directory page cache page is removed from the page cache,
its contents are freed through a call to nfs_readdir_clear_array().
To prevent the removal of the page cache entry until after we've
finished reading it, we must take the page lock.

Fixes: 11de3b11e08c ("NFS: Fix a memory leak in nfs_readdir")
Cc: stable@vger.kernel.org # v2.6.37+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoNFS: Fix memory leaks and corruption in readdir
Trond Myklebust [Sun, 2 Feb 2020 22:53:53 +0000 (17:53 -0500)]
NFS: Fix memory leaks and corruption in readdir

commit 4b310319c6a8ce708f1033d57145e2aa027a883c upstream.

nfs_readdir_xdr_to_array() must not exit without having initialised
the array, so that the page cache deletion routines can safely
call nfs_readdir_clear_array().
Furthermore, we should ensure that if we exit nfs_readdir_filler()
with an error, we free up any page contents to prevent a leak
if we try to fill the page again.

Fixes: 11de3b11e08c ("NFS: Fix a memory leak in nfs_readdir")
Cc: stable@vger.kernel.org # v2.6.37+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoscsi: qla2xxx: Fix unbound NVME response length
Arun Easi [Fri, 24 Jan 2020 04:50:14 +0000 (20:50 -0800)]
scsi: qla2xxx: Fix unbound NVME response length

commit 00fe717ee1ea3c2979db4f94b1533c57aed8dea9 upstream.

On certain cases when response length is less than 32, NVME response data
is supplied inline in IOCB. This is indicated by some combination of state
flags. There was an instance when a high, and incorrect, response length
was indicated causing driver to overrun buffers. Fix this by checking and
limiting the response payload length.

Fixes: 7401bc18d1ee3 ("scsi: qla2xxx: Add FC-NVMe command handling")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200124045014.23554-1-hmadhani@marvell.com
Signed-off-by: Arun Easi <aeasi@marvell.com>
Signed-off-by: Himanshu Madhani <hmadhani@marvell.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agopowerpc/futex: Fix incorrect user access blocking
Michael Ellerman [Fri, 7 Feb 2020 11:15:46 +0000 (22:15 +1100)]
powerpc/futex: Fix incorrect user access blocking

commit 9dc086f1e9ef39dd823bd27954b884b2062f9e70 upstream.

The early versions of our kernel user access prevention (KUAP) were
written by Russell and Christophe, and didn't have separate
read/write access.

At some point I picked up the series and added the read/write access,
but I failed to update the usages in futex.h to correctly allow read
and write.

However we didn't notice because of another bug which was causing the
low-level code to always enable read and write. That bug was fixed
recently in commit 1d8f739b07bd ("powerpc/kuap: Fix set direction in
allow/prevent_user_access()").

futex_atomic_cmpxchg_inatomic() is passed the user address as %3 and
does:

  1:     lwarx   %1,  0, %3
         cmpw    0,  %1, %4
         bne-    3f
  2:     stwcx.  %5,  0, %3

Which clearly loads and stores from/to %3. The logic in
arch_futex_atomic_op_inuser() is similar, so fix both of them to use
allow_read_write_user().

Without this fix, and with PPC_KUAP_DEBUG=y, we see eg:

  Bug: Read fault blocked by AMR!
  WARNING: CPU: 94 PID: 149215 at arch/powerpc/include/asm/book3s/64/kup-radix.h:126 __do_page_fault+0x600/0xf30
  CPU: 94 PID: 149215 Comm: futex_requeue_p Tainted: G        W         5.5.0-rc7-gcc9x-g4c25df5640ae #1
  ...
  NIP [c000000000070680] __do_page_fault+0x600/0xf30
  LR [c00000000007067c] __do_page_fault+0x5fc/0xf30
  Call Trace:
  [c00020138e5637e0] [c00000000007067c] __do_page_fault+0x5fc/0xf30 (unreliable)
  [c00020138e5638c0] [c00000000000ada8] handle_page_fault+0x10/0x30
  --- interrupt: 301 at cmpxchg_futex_value_locked+0x68/0xd0
      LR = futex_lock_pi_atomic+0xe0/0x1f0
  [c00020138e563bc0] [c000000000217b50] futex_lock_pi_atomic+0x80/0x1f0 (unreliable)
  [c00020138e563c30] [c00000000021b668] futex_requeue+0x438/0xb60
  [c00020138e563d60] [c00000000021c6cc] do_futex+0x1ec/0x2b0
  [c00020138e563d90] [c00000000021c8b8] sys_futex+0x128/0x200
  [c00020138e563e20] [c00000000000b7ac] system_call+0x5c/0x68

Fixes: de78a9c42a79 ("powerpc: Add a framework for Kernel Userspace Access Protection")
Cc: stable@vger.kernel.org # v5.2+
Reported-by: syzbot+e808452bad7c375cbee6@syzkaller-ppc64.appspotmail.com
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Link: https://lore.kernel.org/r/20200207122145.11928-1-mpe@ellerman.id.au
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agocrypto: picoxcell - adjust the position of tasklet_init and fix missed tasklet_kill
Chuhong Yuan [Mon, 9 Dec 2019 16:21:44 +0000 (00:21 +0800)]
crypto: picoxcell - adjust the position of tasklet_init and fix missed tasklet_kill

commit 7f8c36fe9be46862c4f3c5302f769378028a34fa upstream.

Since tasklet is needed to be initialized before registering IRQ
handler, adjust the position of tasklet_init to fix the wrong order.

Besides, to fix the missed tasklet_kill, this patch adds a helper
function and uses devm_add_action to kill the tasklet automatically.

Fixes: ce92136843cb ("crypto: picoxcell - add support for the picoxcell crypto engines")
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agocrypto: api - Fix race condition in crypto_spawn_alg
Herbert Xu [Sat, 7 Dec 2019 14:15:15 +0000 (22:15 +0800)]
crypto: api - Fix race condition in crypto_spawn_alg

commit 73669cc556462f4e50376538d77ee312142e8a8a upstream.

The function crypto_spawn_alg is racy because it drops the lock
before shooting the dying algorithm.  The algorithm could disappear
altogether before we shoot it.

This patch fixes it by moving the shooting into the locked section.

Fixes: 6bfd48096ff8 ("[CRYPTO] api: Added spawns")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agocrypto: atmel-aes - Fix counter overflow in CTR mode
Tudor Ambarus [Thu, 5 Dec 2019 09:54:01 +0000 (09:54 +0000)]
crypto: atmel-aes - Fix counter overflow in CTR mode

commit 781a08d9740afa73357f1a60d45d7c93d7cca2dd upstream.

32 bit counter is not supported by neither of our AES IPs, all implement
a 16 bit block counter. Drop the 32 bit block counter logic.

Fixes: fcac83656a3e ("crypto: atmel-aes - fix the counter overflow in CTR mode")
Signed-off-by: Tudor Ambarus <tudor.ambarus@microchip.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agocrypto: pcrypt - Do not clear MAY_SLEEP flag in original request
Herbert Xu [Fri, 29 Nov 2019 08:40:24 +0000 (16:40 +0800)]
crypto: pcrypt - Do not clear MAY_SLEEP flag in original request

commit e8d998264bffade3cfe0536559f712ab9058d654 upstream.

We should not be modifying the original request's MAY_SLEEP flag
upon completion.  It makes no sense to do so anyway.

Reported-by: Eric Biggers <ebiggers@kernel.org>
Fixes: 5068c7a883d1 ("crypto: pcrypt - Add pcrypt crypto...")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agocrypto: arm64/ghash-neon - bump priority to 150
Ard Biesheuvel [Thu, 28 Nov 2019 12:55:31 +0000 (13:55 +0100)]
crypto: arm64/ghash-neon - bump priority to 150

commit 5441c6507bc84166e9227e9370a56c57ba13794a upstream.

The SIMD based GHASH implementation for arm64 is typically much faster
than the generic one, and doesn't use any lookup tables, so it is
clearly preferred when available. So bump the priority to reflect that.

Fixes: 5a22b198cd527447 ("crypto: arm64/ghash - register PMULL variants ...")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agocrypto: ccp - set max RSA modulus size for v3 platform devices as well
Ard Biesheuvel [Wed, 27 Nov 2019 12:01:36 +0000 (13:01 +0100)]
crypto: ccp - set max RSA modulus size for v3 platform devices as well

commit 11548f5a5747813ff84bed6f2ea01100053b0d8d upstream.

AMD Seattle incorporates a non-PCI version of the v3 CCP crypto
accelerator, and this version was left behind when the maximum
RSA modulus size was parameterized in order to support v5 hardware
which supports larger moduli than v3 hardware does. Due to this
oversight, RSA acceleration no longer works at all on these systems.

Fix this by setting the .rsamax property to the appropriate value
for v3 platform hardware.

Fixes: e28c190db66830c0 ("csrypto: ccp - Expand RSA support for a v5 ccp")
Cc: Gary R Hook <gary.hook@amd.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Gary R Hook <gary.hook@amd.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agocrypto: hisilicon - Use the offset fields in sqe to avoid need to split scatterlists
Jonathan Cameron [Tue, 19 Nov 2019 05:42:57 +0000 (13:42 +0800)]
crypto: hisilicon - Use the offset fields in sqe to avoid need to split scatterlists

commit 484a897ffa3005f16cd9a31efd747bcf8155826f upstream.

We can configure sgl offset fields in ZIP sqe to let ZIP engine read/write
sgl data with skipped data. Hence no need to splite the sgl.

Fixes: 62c455ca853e (crypto: hisilicon - add HiSilicon ZIP accelerator support)
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agocrypto: api - fix unexpectedly getting generic implementation
Herbert Xu [Wed, 11 Dec 2019 02:50:11 +0000 (10:50 +0800)]
crypto: api - fix unexpectedly getting generic implementation

commit 2bbb3375d967155bccc86a5887d4a6e29c56b683 upstream.

When CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y, the first lookup of an
algorithm that needs to be instantiated using a template will always get
the generic implementation, even when an accelerated one is available.

This happens because the extra self-tests for the accelerated
implementation allocate the generic implementation for comparison
purposes, and then crypto_alg_tested() for the generic implementation
"fulfills" the original request (i.e. sets crypto_larval::adult).

This patch fixes this by only fulfilling the original request if
we are currently the best outstanding larval as judged by the
priority.  If we're not the best then we will ask all waiters on
that larval request to retry the lookup.

Note that this patch introduces a behaviour change when the module
providing the new algorithm is unregistered during the process.
Previously we would have failed with ENOENT, after the patch we
will instead redo the lookup.

Fixes: 9a8a6b3f0950 ("crypto: testmgr - fuzz hashes against...")
Fixes: d435e10e67be ("crypto: testmgr - fuzz skciphers against...")
Fixes: 40153b10d91c ("crypto: testmgr - fuzz AEADs against...")
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoselftests: bpf: Ignore FIN packets for reuseport tests
Lorenz Bauer [Fri, 24 Jan 2020 11:27:52 +0000 (11:27 +0000)]
selftests: bpf: Ignore FIN packets for reuseport tests

commit 8bec4f665e0baecb5f1b683379fc10b3745eb612 upstream.

The reuseport tests currently suffer from a race condition: FIN
packets count towards DROP_ERR_SKB_DATA, since they don't contain
a valid struct cmd. Tests will spuriously fail depending on whether
check_results is called before or after the FIN is processed.

Exit the BPF program early if FIN is set.

Fixes: 91134d849a0e ("bpf: Test BPF_PROG_TYPE_SK_REUSEPORT")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200124112754.19664-3-lmb@cloudflare.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoselftests: bpf: Use a temporary file in test_sockmap
Lorenz Bauer [Fri, 24 Jan 2020 11:27:51 +0000 (11:27 +0000)]
selftests: bpf: Use a temporary file in test_sockmap

commit c31dbb1e41d1857b403f9bf58c87f5898519a0bc upstream.

Use a proper temporary file for sendpage tests. This means that running
the tests doesn't clutter the working directory, and allows running the
test on read-only filesystems.

Fixes: 16962b2404ac ("bpf: sockmap, add selftests")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200124112754.19664-2-lmb@cloudflare.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoselftests/bpf: Skip perf hw events test if the setup disabled it
Hangbin Liu [Fri, 17 Jan 2020 10:06:56 +0000 (18:06 +0800)]
selftests/bpf: Skip perf hw events test if the setup disabled it

commit f1c3656c6d9c147d07d16614455aceb34932bdeb upstream.

The same with commit 4e59afbbed96 ("selftests/bpf: skip nmi test when perf
hw events are disabled"), it would make more sense to skip the
test_stacktrace_build_id_nmi test if the setup (e.g. virtual machines) has
disabled hardware perf events.

Fixes: 13790d1cc72c ("bpf: add selftest for stackmap with build_id in NMI context")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200117100656.10359-1-liuhangbin@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoselftests/bpf: Fix test_attach_probe
Alexei Starovoitov [Thu, 19 Dec 2019 02:04:42 +0000 (18:04 -0800)]
selftests/bpf: Fix test_attach_probe

commit 580205dd4fe800b1e95be8b6df9e2991f975a8ad upstream.

Fix two issues in test_attach_probe:

1. it was not able to parse /proc/self/maps beyond the first line,
   since %s means parse string until white space.
2. offset has to be accounted for otherwise uprobed address is incorrect.

Fixes: 1e8611bbdfc9 ("selftests/bpf: add kprobe/uprobe selftests")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20191219020442.1922617-1-ast@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agosamples/bpf: Xdp_redirect_cpu fix missing tracepoint attach
Jesper Dangaard Brouer [Fri, 20 Dec 2019 16:19:36 +0000 (17:19 +0100)]
samples/bpf: Xdp_redirect_cpu fix missing tracepoint attach

commit f9e6bfdbaf0cf304d72c70a05d81acac01a04f48 upstream.

When sample xdp_redirect_cpu was converted to use libbpf, the
tracepoints used by this sample were not getting attached automatically
like with bpf_load.c. The BPF-maps was still getting loaded, thus
nobody notice that the tracepoints were not updating these maps.

This fix doesn't use the new skeleton code, as this bug was introduced
in v5.1 and stable might want to backport this. E.g. Red Hat QA uses
this sample as part of their testing.

Fixes: bbaf6029c49c ("samples/bpf: Convert XDP samples to libbpf usage")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/157685877642.26195.2798780195186786841.stgit@firesoul
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agosamples/bpf: Don't try to remove user's homedir on clean
Toke Høiland-Jørgensen [Mon, 20 Jan 2020 13:06:41 +0000 (14:06 +0100)]
samples/bpf: Don't try to remove user's homedir on clean

commit b2e5e93ae8af6a34bca536cdc4b453ab1e707b8b upstream.

The 'clean' rule in the samples/bpf Makefile tries to remove backup
files (ending in ~). However, if no such files exist, it will instead try
to remove the user's home directory. While the attempt is mostly harmless,
it does lead to a somewhat scary warning like this:

rm: cannot remove '~': Is a directory

Fix this by using find instead of shell expansion to locate any actual
backup files that need to be removed.

Fixes: b62a796c109c ("samples/bpf: allow make to be run from samples/bpf/ directory")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/157952560126.1683545.7273054725976032511.stgit@toke.dk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agotc-testing: fix eBPF tests failure on linux fresh clones
Davide Caratti [Mon, 3 Feb 2020 15:29:29 +0000 (16:29 +0100)]
tc-testing: fix eBPF tests failure on linux fresh clones

commit 7145fcfffef1fad4266aaf5ca96727696916edb7 upstream.

when the following command is done on a fresh clone of the kernel tree,

 [root@f31 tc-testing]# ./tdc.py -c bpf

test cases that need to build the eBPF sample program fail systematically,
because 'buildebpfPlugin' is unable to install the kernel headers (i.e, the
'khdr' target fails). Pass the correct environment to 'make', in place of
ENVIR, to allow running these tests.

Fixes: 4c2d39bd40c1 ("tc-testing: use a plugin to build eBPF program")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agolibbpf: Fix realloc usage in bpf_core_find_cands
Andrii Nakryiko [Fri, 24 Jan 2020 20:18:46 +0000 (12:18 -0800)]
libbpf: Fix realloc usage in bpf_core_find_cands

commit 35b9211c0a2427e8f39e534f442f43804fc8d5ca upstream.

Fix bug requesting invalid size of reallocated array when constructing CO-RE
relocation candidate list. This can cause problems if there are many potential
candidates and a very fine-grained memory allocator bucket sizes are used.

Fixes: ddc7c3042614 ("libbpf: implement BPF CO-RE offset relocation algorithm")
Reported-by: William Smith <williampsmith@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200124201847.212528-1-andriin@fb.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agobpf, devmap: Pass lockdep expression to RCU lists
Amol Grover [Thu, 23 Jan 2020 12:04:38 +0000 (17:34 +0530)]
bpf, devmap: Pass lockdep expression to RCU lists

commit 485ec2ea9cf556e9c120e07961b7b459d776a115 upstream.

head is traversed using hlist_for_each_entry_rcu outside an RCU
read-side critical section but under the protection of dtab->index_lock.

Hence, add corresponding lockdep expression to silence false-positive
lockdep warnings, and harden RCU lists.

Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200123120437.26506-1-frextrite@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoselftests/bpf: Fix perf_buffer test on systems w/ offline CPUs
Andrii Nakryiko [Thu, 12 Dec 2019 01:36:20 +0000 (17:36 -0800)]
selftests/bpf: Fix perf_buffer test on systems w/ offline CPUs

commit 91cbdf740a476cf2c744169bf407de2e3ac1f3cf upstream.

Fix up perf_buffer.c selftest to take into account offline/missing CPUs.

Fixes: ee5cf82ce04a ("selftests/bpf: test perf buffer API")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191212013621.1691858-1-andriin@fb.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoriscv, bpf: Fix broken BPF tail calls
Björn Töpel [Mon, 16 Dec 2019 09:13:35 +0000 (10:13 +0100)]
riscv, bpf: Fix broken BPF tail calls

commit f1003b787c00fbaa4b11619c6b23a885bfce8f07 upstream.

The BPF JIT incorrectly clobbered the a0 register, and did not flag
usage of s5 register when BPF stack was being used.

Fixes: 2353ecc6f91f ("bpf, riscv: add BPF JIT for RV64G")
Signed-off-by: Björn Töpel <bjorn.topel@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191216091343.23260-2-bjorn.topel@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agobtrfs: Handle another split brain scenario with metadata uuid feature
Nikolay Borisov [Fri, 10 Jan 2020 12:11:34 +0000 (14:11 +0200)]
btrfs: Handle another split brain scenario with metadata uuid feature

commit 05840710149c7d1a78ea85a2db5723f706e97d8f upstream.

There is one more cases which isn't handled by the original metadata
uuid work. Namely, when a filesystem has METADATA_UUID incompat bit and
the user decides to change the FSID to the original one e.g. have
metadata_uuid and fsid match. In case of power failure while this
operation is in progress we could end up in a situation where some of
the disks have the incompat bit removed and the other half have both
METADATA_UUID_INCOMPAT and FSID_CHANGING_IN_PROGRESS flags.

This patch handles the case where a disk that has successfully changed
its FSID such that it equals METADATA_UUID is scanned first.
Subsequently when a disk with both
METADATA_UUID_INCOMPAT/FSID_CHANGING_IN_PROGRESS flags is scanned
find_fsid_changed won't be able to find an appropriate btrfs_fs_devices.
This is done by extending find_fsid_changed to correctly find
btrfs_fs_devices whose metadata_uuid/fsid are the same and they match
the metadata_uuid of the currently scanned device.

Fixes: cc5de4e70256 ("btrfs: Handle final split-brain possibility during fsid change")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reported-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agobtrfs: fix improper setting of scanned for range cyclic write cache pages
Josef Bacik [Fri, 3 Jan 2020 15:38:44 +0000 (10:38 -0500)]
btrfs: fix improper setting of scanned for range cyclic write cache pages

commit 556755a8a99be8ca3cd9fbe36aaf9b3b0339a00d upstream.

We noticed that we were having regular CG OOM kills in cases where there
was still enough dirty pages to avoid OOM'ing.  It turned out there's
this corner case in btrfs's handling of range_cyclic where files that
were being redirtied were not getting fully written out because of how
we do range_cyclic writeback.

We unconditionally were setting scanned = 1; the first time we found any
pages in the inode.  This isn't actually what we want, we want it to be
set if we've scanned the entire file.  For range_cyclic we could be
starting in the middle or towards the end of the file, so we could write
one page and then not write any of the other dirty pages in the file
because we set scanned = 1.

Fix this by not setting scanned = 1 if we find pages.  The rules for
setting scanned should be

1) !range_cyclic.  In this case we have a specified range to write out.
2) range_cyclic && index == 0.  In this case we've started at the
   beginning and there is no need to loop around a second time.
3) range_cyclic && we started at index > 0 and we've reached the end of
   the file without satisfying our nr_to_write.

This patch fixes both of our writepages implementations to make sure
these rules hold true.  This fixed our over zealous CG OOMs in
production.

Fixes: d1310b2e0cd9 ("Btrfs: Split the extent_map code into two parts")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agocrypto: pcrypt - Avoid deadlock by using per-instance padata queues
Herbert Xu [Tue, 26 Nov 2019 07:58:45 +0000 (15:58 +0800)]
crypto: pcrypt - Avoid deadlock by using per-instance padata queues

commit bbefa1dd6a6d53537c11624752219e39959d04fb upstream.

If the pcrypt template is used multiple times in an algorithm, then a
deadlock occurs because all pcrypt instances share the same
padata_instance, which completes requests in the order submitted.  That
is, the inner pcrypt request waits for the outer pcrypt request while
the outer request is already waiting for the inner.

This patch fixes this by allocating a set of queues for each pcrypt
instance instead of using two global queues.  In order to maintain
the existing user-space interface, the pinst structure remains global
so any sysfs modifications will apply to every pcrypt instance.

Note that when an update occurs we have to allocate memory for
every pcrypt instance.  Should one of the allocations fail we
will abort the update without rolling back changes already made.

The new per-instance data structure is called padata_shell and is
essentially a wrapper around parallel_data.

Reproducer:

#include <linux/if_alg.h>
#include <sys/socket.h>
#include <unistd.h>

int main()
{
struct sockaddr_alg addr = {
.salg_type = "aead",
.salg_name = "pcrypt(pcrypt(rfc4106-gcm-aesni))"
};
int algfd, reqfd;
char buf[32] = { 0 };

algfd = socket(AF_ALG, SOCK_SEQPACKET, 0);
bind(algfd, (void *)&addr, sizeof(addr));
setsockopt(algfd, SOL_ALG, ALG_SET_KEY, buf, 20);
reqfd = accept(algfd, 0, 0);
write(reqfd, buf, 32);
read(reqfd, buf, 16);
}

Reported-by: syzbot+56c7151cad94eec37c521f0e47d2eee53f9361c4@syzkaller.appspotmail.com
Fixes: 5068c7a883d1 ("crypto: pcrypt - Add pcrypt crypto parallelization wrapper")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 years agoftrace: Protect ftrace_graph_hash with ftrace_sync
Steven Rostedt (VMware) [Wed, 5 Feb 2020 14:20:32 +0000 (09:20 -0500)]
ftrace: Protect ftrace_graph_hash with ftrace_sync

[ Upstream commit 54a16ff6f2e50775145b210bcd94d62c3c2af117 ]

As function_graph tracer can run when RCU is not "watching", it can not be
protected by synchronize_rcu() it requires running a task on each CPU before
it can be freed. Calling schedule_on_each_cpu(ftrace_sync) needs to be used.

Link: https://lore.kernel.org/r/20200205131110.GT2935@paulmck-ThinkPad-P72
Cc: stable@vger.kernel.org
Fixes: b9b0c831bed26 ("ftrace: Convert graph filter to use hash tables")
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
4 years agoftrace: Add comment to why rcu_dereference_sched() is open coded
Steven Rostedt (VMware) [Wed, 5 Feb 2020 07:17:57 +0000 (02:17 -0500)]
ftrace: Add comment to why rcu_dereference_sched() is open coded

[ Upstream commit 16052dd5bdfa16dbe18d8c1d4cde2ddab9d23177 ]

Because the function graph tracer can execute in sections where RCU is not
"watching", the rcu_dereference_sched() for the has needs to be open coded.
This is fine because the RCU "flavor" of the ftrace hash is protected by
its own RCU handling (it does its own little synchronization on every CPU
and does not rely on RCU sched).

Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
4 years agotracing: Annotate ftrace_graph_notrace_hash pointer with __rcu
Amol Grover [Wed, 5 Feb 2020 05:57:02 +0000 (11:27 +0530)]
tracing: Annotate ftrace_graph_notrace_hash pointer with __rcu

[ Upstream commit fd0e6852c407dd9aefc594f54ddcc21d84803d3b ]

Fix following instances of sparse error
kernel/trace/ftrace.c:5667:29: error: incompatible types in comparison
kernel/trace/ftrace.c:5813:21: error: incompatible types in comparison
kernel/trace/ftrace.c:5868:36: error: incompatible types in comparison
kernel/trace/ftrace.c:5870:25: error: incompatible types in comparison

Use rcu_dereference_protected to dereference the newly annotated pointer.

Link: http://lkml.kernel.org/r/20200205055701.30195-1-frextrite@gmail.com
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
4 years agotracing: Annotate ftrace_graph_hash pointer with __rcu
Amol Grover [Sat, 1 Feb 2020 07:27:04 +0000 (12:57 +0530)]
tracing: Annotate ftrace_graph_hash pointer with __rcu

[ Upstream commit 24a9729f831462b1d9d61dc85ecc91c59037243f ]

Fix following instances of sparse error
kernel/trace/ftrace.c:5664:29: error: incompatible types in comparison
kernel/trace/ftrace.c:5785:21: error: incompatible types in comparison
kernel/trace/ftrace.c:5864:36: error: incompatible types in comparison
kernel/trace/ftrace.c:5866:25: error: incompatible types in comparison

Use rcu_dereference_protected to access the __rcu annotated pointer.

Link: http://lkml.kernel.org/r/20200201072703.17330-1-frextrite@gmail.com
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
4 years agoASoC: SOF: core: release resources on errors in probe_continue
Pierre-Louis Bossart [Fri, 24 Jan 2020 21:36:21 +0000 (15:36 -0600)]
ASoC: SOF: core: release resources on errors in probe_continue

[ Upstream commit 410e5e55c9c1c9c0d452ac5b9adb37b933a7747e ]

The initial intent of releasing resources in the .remove does not work
well with HDaudio codecs. If the probe_continue() fails in a work
queue, e.g. due to missing firmware or authentication issues, we don't
release any resources, and as a result the kernel oopses during
suspend operations.

The suggested fix is to release all resources during errors in
probe_continue(), and use fw_state to track resource allocation
state, so that .remove does not attempt to release the same
hardware resources twice. PM operations are also modified so that
no action is done if DSP resources have been freed due to
an error at probe.

Reported-by: Takashi Iwai <tiwai@suse.de>
Co-developed-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Bugzilla:  http://bugzilla.suse.com/show_bug.cgi?id=1161246
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Takashi Iwai <tiwai@suse.de>
Link: https://lore.kernel.org/r/20200124213625.30186-4-pierre-louis.bossart@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
4 years agoASoC: SOF: Introduce state machine for FW boot
Ranjani Sridharan [Wed, 18 Dec 2019 00:26:09 +0000 (18:26 -0600)]
ASoC: SOF: Introduce state machine for FW boot

[ Upstream commit 6ca5cecbd1c1758666ab79446f19e0e61ed11444 ]

Add a state machine for FW boot to track the
different stages of FW boot and replace the boot_complete
field with fw_state field in struct snd_sof_dev.
This will be used to determine the actions to be performed
during system suspend.

One of the main motivations for adding this change is the
fact that errors during the top-level SOF device probe cannot
be propagated and therefore suspending the SOF device normally
during system suspend could potentially run into errors.
For example, with the current flow, if the FW boot failed
for some reason and the system suspends, the SOF device
suspend could fail because the CTX_SAVE IPC would be attempted
even though the FW never really booted successfully causing it
to time out. Another scenario that the state machine fixes
is when the runtime suspend for the SOF device fails and
the DSP is powered down nevertheless, the CTX_SAVE IPC during
system suspend would timeout because the DSP is already
powered down.

Reviewed-by: Curtis Malainey <cujomalainey@chromium.org>
Reviewed-by: Daniel Baluta <daniel.baluta@nxp.com>
Signed-off-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Link: https://lore.kernel.org/r/20191218002616.7652-2-pierre-louis.bossart@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>