platform/kernel/linux-rpi.git
5 years agodrm/i915: Skip vcpi allocation for MSTB ports that are gone
Lyude Paul [Mon, 8 Oct 2018 23:24:33 +0000 (19:24 -0400)]
drm/i915: Skip vcpi allocation for MSTB ports that are gone

commit c02ba4ef16eefe663fdefcccaa57fad32d5481bf upstream.

Since we need to be able to allow DPMS on->off prop changes after an MST
port has disappeared from the system, we need to be able to make sure we
can compute a config for the resulting atomic commit. Currently this is
impossible when the port has disappeared, since the VCPI slot searching
we try to do in intel_dp_mst_compute_config() will fail with -EINVAL.

Since the only commits we want to allow on no-longer-present MST ports
are ones that shut off display hardware, we already know that no VCPI
allocations are needed. So, hardcode the VCPI slot count to 0 when
intel_dp_mst_compute_config() is called on an MST port that's gone.

Changes since V4:
- Don't use mst_port_gone at all, just check whether or not the drm
  connector is registered - Daniel Vetter

Signed-off-by: Lyude Paul <lyude@redhat.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: stable@vger.kernel.org
Link: https://patchwork.freedesktop.org/patch/msgid/20181008232437.5571-5-lyude@redhat.com
(cherry picked from commit f67207d78ceaf98b7531bc22df6f21328559c8d4)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agodrm/i915: Don't unset intel_connector->mst_port
Lyude Paul [Mon, 8 Oct 2018 23:24:32 +0000 (19:24 -0400)]
drm/i915: Don't unset intel_connector->mst_port

commit 80c188695a77eddaa6e8885510ff4ef59fd478c3 upstream.

Currently we set intel_connector->mst_port to NULL to signify that the
MST port has been removed from the system so that we can prevent further
action on the port such as connector probes, mode probing, etc.
However, we're going to need access to intel_connector->mst_port in
order to fixup ->best_encoder() so that it can always return the correct
encoder for an MST port to prevent legacy DPMS prop changes from
failing. This should be safe, so instead keep intel_connector->mst_port
always set and instead just check the status of
drm_connector->regustered to signify whether or not the connector has
disappeared from the system.

Changes since v2:
- Add a comment to mst_port_gone (Jani Nikula)
- Change mst_port_gone to a u8 instead of a bool, per the kernel bot.
  Apparently bool is discouraged in structs these days
Changes since v4:
- Don't use mst_port_gone at all! Just check if the connector is
  registered or not - Daniel Vetter

Signed-off-by: Lyude Paul <lyude@redhat.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: stable@vger.kernel.org
Link: https://patchwork.freedesktop.org/patch/msgid/20181008232437.5571-4-lyude@redhat.com
(cherry picked from commit 6ed5bb1fbad34382c8cfe9a9bf737e9a43053df5)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agodrm/i915: Restore vblank interrupts earlier
Ville Syrjälä [Wed, 3 Oct 2018 14:49:51 +0000 (17:49 +0300)]
drm/i915: Restore vblank interrupts earlier

commit 7cada4d0b7a0fb813dbc9777fec092e9ed0546e9 upstream.

Plane sanitation needs vblank interrupts (on account of CxSR disable).
So let's restore vblank interrupts earlier.

v2: Make it actually build
v3: Add comment to explain why we need this (Daniel)

Cc: stable@vger.kernel.org
Cc: Dennis <dennis.nezic@utoronto.ca>
Tested-by: Dennis <dennis.nezic@utoronto.ca>
Tested-by: Peter Nowee <peter.nowee@gmail.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105637
Fixes: b1e01595a66d ("drm/i915: Redo plane sanitation during readout")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20181003144951.4397-1-ville.syrjala@linux.intel.com
(cherry picked from commit 68bc30deac625b8be8d3950b30dc93d09a3645f5)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agodrm/i915: Use the correct crtc when sanitizing plane mapping
Ville Syrjälä [Wed, 3 Oct 2018 14:50:17 +0000 (17:50 +0300)]
drm/i915: Use the correct crtc when sanitizing plane mapping

commit 9b27390139dbe0dc10d1899545248862fe826b61 upstream.

When we decide that a plane is attached to the wrong pipe we try
to turn off said plane. However we are passing around the crtc we
think that the plane is supposed to be using rather than the crtc
it is currently using. That doesn't work all that well because
we may have to do vblank waits etc. and the other pipe might
not even be enabled here. So let's pass the plane's current crtc to
intel_plane_disable_noatomic() so that it can its job correctly.

To do that semi-cleanly we also have to change the plane readout
to record the plane's visibility into the bitmasks of the crtc
where the plane is currently enabled rather than to the crtc
we want to use for the plane.

One caveat here is that our active_planes bitmask will get confused
if both planes are enabled on the same pipe. Fortunately we can use
plane_mask to reconstruct active_planes sufficiently since
plane_mask still has the same meaning (is the plane visible?)
during readout. We also have to do the same during the initial
plane readout as the second plane could clear the active_planes
bit the first plane had already set.

v2: Rely on fixup_active_planes() to populate active_planes fully (Daniel)
    Add Daniel's proposed comment to better document why we do this
    Drop the redundant intel_set_plane_visible() call

Cc: stable@vger.kernel.org # fcba862e8428 drm/i915: Have plane->get_hw_state() return the current pipe
Cc: stable@vger.kernel.org
Cc: Dennis <dennis.nezic@utoronto.ca>
Cc: Daniel Vetter <daniel@ffwll.ch>
Tested-by: Dennis <dennis.nezic@utoronto.ca>
Tested-by: Peter Nowee <peter.nowee@gmail.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105637
Fixes: b1e01595a66d ("drm/i915: Redo plane sanitation during readout")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20181003145017.4527-1-ville.syrjala@linux.intel.com
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
(cherry picked from commit 62358aa4ee86481ce044bef04859820e1bc7c1d9)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agodrm/i915/dp: Link train Fallback on eDP only if fallback link BW can fit panel's...
Manasi Navare [Tue, 9 Oct 2018 21:28:04 +0000 (14:28 -0700)]
drm/i915/dp: Link train Fallback on eDP only if fallback link BW can fit panel's native mode

commit 041444458835d7fb2c9f042598bfe16bf375b15d upstream.

This patch fixes the original commit c0cfb10d9e1de49 ("drm/i915/edp:
Do not do link training fallback or prune modes on EDP") that causes
a blank screen in case of certain eDP panels (Eg: seen on Dell XPS13 9350)
where first link training fails and a retraining is required by falling
back to lower link rate/lane count.
In case of some panels they advertise higher link rate/lane count
than whats required for supporting the panel's native mode.
But we always link train at highest link rate/lane count for eDP
and if that fails we can still fallback to lower link rate/lane count
as long as the fallback link BW still fits the native mode to avoid
pruning the panel's native mode yet retraining at fallback values
to recover from a blank screen.

v3:
* Add const for fixed_mode (Ville)
v2:
* Send uevent if link failure on eDP unconditionally

Fixes: c0cfb10d9e1d ("drm/i915/edp: Do not do link training fallback or prune modes on EDP")
Cc: Clinton Taylor <clinton.a.taylor@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: <stable@vger.kernel.org> # v4.17+
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107489
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105338
Signed-off-by: Manasi Navare <manasi.d.navare@intel.com>
Tested-by: Alexander Wilson <alexander.wilson@ncf.edu>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20181009212804.702-1-manasi.d.navare@intel.com
(cherry picked from commit 1e712535c51ab025ebc776d4405683d81521996d)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agodrm: panel-orientation-quirks: Add quirk for Acer One 10 (S1003)
Hans de Goede [Fri, 12 Oct 2018 10:16:10 +0000 (12:16 +0200)]
drm: panel-orientation-quirks: Add quirk for Acer One 10 (S1003)

commit 0e8afefd5da4875ddea9aa4ad17a2540a2bf9736 upstream.

The Acer One 10 uses a clamshell design with a detachable keyboard.
As such in normal operating mode, with the keyboard attach the device
is in landscape mode (and the Acer logo at boot also shows in landscape
mode).

But the device uses a portrait screen rotated 90 degrees (sigh). This
commit adds a quirk for this device so that we shown the fbcon the
right way up and that we hint userspace to also show e.g. plymouth and
gdm the right way up.

Cc: stable@vger.kernel.org
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20181012101610.29100-1-hdegoede@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agodrm/dp_mst: Check if primary mstb is null
Stanislav Lisovskiy [Fri, 9 Nov 2018 09:00:12 +0000 (11:00 +0200)]
drm/dp_mst: Check if primary mstb is null

commit 23d8003907d094f77cf959228e2248d6db819fa7 upstream.

Unfortunately drm_dp_get_mst_branch_device which is called from both
drm_dp_mst_handle_down_rep and drm_dp_mst_handle_up_rep seem to rely
on that mgr->mst_primary is not NULL, which seem to be wrong as it can be
cleared with simultaneous mode set, if probing fails or in other case.
mgr->lock mutex doesn't protect against that as it might just get
assigned to NULL right before, not simultaneously.

There are currently bugs 107738, 108616 bugs which crash in
drm_dp_get_mst_branch_device, caused by this issue.

v2: Refactored the code, as it was nicely noticed.
    Fixed Bugzilla bug numbers(second was 108616, but not 108816)
    and added links.

[changed title and added stable cc]
Signed-off-by: Lyude Paul <lyude@redhat.com>
Signed-off-by: Stanislav Lisovskiy <stanislav.lisovskiy@intel.com>
Cc: stable@vger.kernel.org
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108616
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107738
Link: https://patchwork.freedesktop.org/patch/msgid/20181109090012.24438-1-stanislav.lisovskiy@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agodrm/etnaviv: fix bogus fence complete check in timeout handler
Lucas Stach [Thu, 4 Oct 2018 09:37:00 +0000 (11:37 +0200)]
drm/etnaviv: fix bogus fence complete check in timeout handler

commit 6fce3a406108ee6c8a61e2a33e52e9198a626ea0 upstream.

The GPU hardware fences and the job out-fences are on different timelines
so it's wrong to compare them. Fix this by only looking at the out-fence.

Cc: <stable@vger.kernel.org>
Fixes: 2c83a726d6fb (drm/etnaviv: bring back progress check in job
                     timeout handler)
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agodrm/amd/powerplay: Enable/Disable NBPSTATE on On/OFF of UVD
Akshu Agrawal [Mon, 24 Sep 2018 10:18:02 +0000 (15:48 +0530)]
drm/amd/powerplay: Enable/Disable NBPSTATE on On/OFF of UVD

commit 51ef434a15b450bfbef1e06cc87ee4e98a224486 upstream.

We observe black lines (underflow) on display when playing a
4K video with UVD. On Disabling Low memory P state this issue is
not seen.
Multiple runs of power measurement shows no imapct.

Signed-off-by: Akshu Agrawal <akshu.agrawal@amd.com>
Signed-off-by: Satyajit Sahu <satyajit.sahu@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agodrm/nouveau: Fix nv50_mstc->best_encoder()
Lyude Paul [Mon, 8 Oct 2018 23:24:31 +0000 (19:24 -0400)]
drm/nouveau: Fix nv50_mstc->best_encoder()

commit 7b0f61e91b6056c71649efa3204112a4b6cf5fc8 upstream.

As mentioned in the previous commit, we currently prevent new modesets
on recently-removed MST connectors by returning no encoder from our
->best_encoder() callback once the MST port has disappeared. This is
wrong however, because it prevents legacy modesetting users from being
able to disable CRTCs on MST connectors after the connector's respective
topology has disappeared.

So, fix this by instead by just always returning a valid encoder.

Changes since v2:
- Remove usage of atomic MST helper for now, since that got replaced
  with a much simpler solution

Signed-off-by: Lyude Paul <lyude@redhat.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Ben Skeggs <bskeggs@redhat.com>
Cc: stable@vger.kernel.org
Link: https://patchwork.freedesktop.org/patch/msgid/20181008232437.5571-3-lyude@redhat.com
(cherry picked from commit e87b0bbc9f0380d403f8f2f6abba0d51c74d944f)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agodrm/nouveau: Check backlight IDs are >= 0, not > 0
Lyude Paul [Thu, 6 Sep 2018 21:43:21 +0000 (17:43 -0400)]
drm/nouveau: Check backlight IDs are >= 0, not > 0

commit dc854914999d5d52ac1b31740cb0ea8d89d0372e upstream.

Remember, ida IDs start at 0, not 1!

Signed-off-by: Lyude Paul <lyude@redhat.com>
Reviewed-by: Karol Herbst <kherbst@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agodrm/amdgpu: Suppress keypresses from ACPI_VIDEO events
Lyude Paul [Sat, 22 Sep 2018 00:43:44 +0000 (20:43 -0400)]
drm/amdgpu: Suppress keypresses from ACPI_VIDEO events

commit 582f58de36834096a91cc1de2540c2f7269f850d upstream.

Currently we return NOTIFY_DONE for any event which we don't think is
ours. However, many laptops will send more then just an ATIF event and
will also send an ACPI_VIDEO_NOTIFY_PROBE event as well. Since we don't
check for this, we return NOTIFY_DONE which causes a keypress for the
ACPI event to be propogated to userspace. This is the equivalent of
someone pressing the display key on a laptop every time there's a
hotplug event.

So, check for ACPI_VIDEO_NOTIFY_PROBE events and suppress keypresses
from them.

Signed-off-by: Lyude Paul <lyude@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agodrm/amdgpu: add missing CHIP_HAINAN in amdgpu_ucode_get_load_type
Alex Deucher [Tue, 28 Aug 2018 19:16:23 +0000 (14:16 -0500)]
drm/amdgpu: add missing CHIP_HAINAN in amdgpu_ucode_get_load_type

commit d9997b64c52b70bd98c48f443f068253621d1ffc upstream.

This caused a confusing error message, but there is functionally
no problem since the default method is DIRECT.

Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agodrm/amdgpu: Fix typo in amdgpu_vmid_mgr_init
Rex Zhu [Fri, 12 Oct 2018 14:26:11 +0000 (22:26 +0800)]
drm/amdgpu: Fix typo in amdgpu_vmid_mgr_init

commit 3df27645395e8f79c0dc20a15cf1da61f376000d upstream.

fix a typo in for loop: i->j

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Rex Zhu <Rex.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agodrm/rockchip: Allow driver to be shutdown on reboot/kexec
Marc Zyngier [Sun, 5 Aug 2018 12:48:07 +0000 (13:48 +0100)]
drm/rockchip: Allow driver to be shutdown on reboot/kexec

commit 7f3ef5dedb146e3d5063b6845781ad1bb59b92b5 upstream.

Leaving the DRM driver enabled on reboot or kexec has the annoying
effect of leaving the display generating transactions whilst the
IOMMU has been shut down.

In turn, the IOMMU driver (which shares its interrupt line with
the VOP) starts warning either on shutdown or when entering the
secondary kernel in the kexec case (nothing is expected on that
front).

A cheap way of ensuring that things are nicely shut down is to
register a shutdown callback in the platform driver.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Vicente Bergas <vicencb@gmail.com>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20180805124807.18169-1-marc.zyngier@arm.com
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoscripts/spdxcheck.py: make python3 compliant
Uwe Kleine-König [Fri, 16 Nov 2018 23:08:43 +0000 (15:08 -0800)]
scripts/spdxcheck.py: make python3 compliant

commit 6f4d29df66acd49303a99025046b85cabe7aa17a upstream.

Without this change the following happens when using Python3 (3.6.6):

$ echo "GPL-2.0" | python3 scripts/spdxcheck.py -
FAIL: 'str' object has no attribute 'decode'
Traceback (most recent call last):
  File "scripts/spdxcheck.py", line 253, in <module>
    parser.parse_lines(sys.stdin, args.maxlines, '-')
  File "scripts/spdxcheck.py", line 171, in parse_lines
    line = line.decode(locale.getpreferredencoding(False), errors='ignore')
AttributeError: 'str' object has no attribute 'decode'

So as the line is already a string, there is no need to decode it and
the line can be dropped.

/usr/bin/python on Arch is Python 3.  So this would indeed be worth
going into 4.19.

Link: http://lkml.kernel.org/r/20181023070802.22558-1-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Joe Perches <joe@perches.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agomm: don't reclaim inodes with many attached pages
Roman Gushchin [Fri, 16 Nov 2018 23:08:18 +0000 (15:08 -0800)]
mm: don't reclaim inodes with many attached pages

commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0 upstream.

Spock reported that commit 172b06c32b94 ("mm: slowly shrink slabs with a
relatively small number of objects") leads to a regression on his setup:
periodically the majority of the pagecache is evicted without an obvious
reason, while before the change the amount of free memory was balancing
around the watermark.

The reason behind is that the mentioned above change created some
minimal background pressure on the inode cache.  The problem is that if
an inode is considered to be reclaimed, all belonging pagecache page are
stripped, no matter how many of them are there.  So, if a huge
multi-gigabyte file is cached in the memory, and the goal is to reclaim
only few slab objects (unused inodes), we still can eventually evict all
gigabytes of the pagecache at once.

The workload described by Spock has few large non-mapped files in the
pagecache, so it's especially noticeable.

To solve the problem let's postpone the reclaim of inodes, which have
more than 1 attached page.  Let's wait until the pagecache pages will be
evicted naturally by scanning the corresponding LRU lists, and only then
reclaim the inode structure.

Link: http://lkml.kernel.org/r/20181023164302.20436-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Spock <dairinin@gmail.com>
Tested-by: Spock <dairinin@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: <stable@vger.kernel.org> [4.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoefi/arm/libstub: Pack FDT after populating it
Ard Biesheuvel [Wed, 14 Nov 2018 17:55:42 +0000 (09:55 -0800)]
efi/arm/libstub: Pack FDT after populating it

commit 72a58a63a164b4e9d2d914e65caeb551846883f1 upstream.

Commit:

  24d7c494ce46 ("efi/arm-stub: Round up FDT allocation to mapping size")

increased the allocation size for the FDT image created by the stub to a
fixed value of 2 MB, to simplify the former code that made several
attempts with increasing values for the size. This is reasonable
given that the allocation is of type EFI_LOADER_DATA, which is released
to the kernel unless it is explicitly memblock_reserve()d by the early
boot code.

However, this allocation size leaked into the 'size' field of the FDT
header metadata, and so the entire allocation remains occupied by the
device tree binary, even if most of it is not used to store device tree
information.

So call fdt_pack() to shrink the FDT data structure to its minimum size
after populating all the fields, so that the remaining memory is no
longer wasted.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: <stable@vger.kernel.org> # v4.12+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Fixes: 24d7c494ce46 ("efi/arm-stub: Round up FDT allocation to mapping size")
Link: http://lkml.kernel.org/r/20181114175544.12860-4-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agomm/swapfile.c: use kvzalloc for swap_info_struct allocation
Vasily Averin [Fri, 16 Nov 2018 23:08:11 +0000 (15:08 -0800)]
mm/swapfile.c: use kvzalloc for swap_info_struct allocation

commit 873d7bcfd066663e3e50113dc4a0de19289b6354 upstream.

Commit a2468cc9bfdf ("swap: choose swap device according to numa node")
changed 'avail_lists' field of 'struct swap_info_struct' to an array.
In popular linux distros it increased size of swap_info_struct up to 40
Kbytes and now swap_info_struct allocation requires order-4 page.
Switch to kvzmalloc allows to avoid unexpected allocation failures.

Link: http://lkml.kernel.org/r/fc23172d-3c75-21e2-d551-8b1808cbe593@virtuozzo.com
Fixes: a2468cc9bfdf ("swap: choose swap device according to numa node")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agohugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
Mike Kravetz [Fri, 16 Nov 2018 23:08:04 +0000 (15:08 -0800)]
hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!

commit 5e41540c8a0f0e98c337dda8b391e5dda0cde7cf upstream.

This bug has been experienced several times by the Oracle DB team.  The
BUG is in remove_inode_hugepages() as follows:

/*
 * If page is mapped, it was faulted in after being
 * unmapped in caller.  Unmap (again) now after taking
 * the fault mutex.  The mutex will prevent faults
 * until we finish removing the page.
 *
 * This race can only happen in the hole punch case.
 * Getting here in a truncate operation is a bug.
 */
if (unlikely(page_mapped(page))) {
BUG_ON(truncate_op);

In this case, the elevated map count is not the result of a race.
Rather it was incorrectly incremented as the result of a bug in the huge
pmd sharing code.  Consider the following:

 - Process A maps a hugetlbfs file of sufficient size and alignment
   (PUD_SIZE) that a pmd page could be shared.

 - Process B maps the same hugetlbfs file with the same size and
   alignment such that a pmd page is shared.

 - Process B then calls mprotect() to change protections for the mapping
   with the shared pmd. As a result, the pmd is 'unshared'.

 - Process B then calls mprotect() again to chage protections for the
   mapping back to their original value. pmd remains unshared.

 - Process B then forks and process C is created. During the fork
   process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
   tables. Copying page tables for hugetlb mappings is done in the
   routine copy_hugetlb_page_range.

In copy_hugetlb_page_range(), the destination pte is obtained by:

dst_pte = huge_pte_alloc(dst, addr, sz);

If pmd sharing is possible, the returned pointer will be to a pte in an
existing page table.  In the situation above, process C could share with
either process A or process B.  Since process A is first in the list,
the returned pte is a pointer to a pte in process A's page table.

However, the check for pmd sharing in copy_hugetlb_page_range is:

/* If the pagetables are shared don't copy or take references */
if (dst_pte == src_pte)
continue;

Since process C is sharing with process A instead of process B, the
above test fails.  The code in copy_hugetlb_page_range which follows
assumes dst_pte points to a huge_pte_none pte.  It copies the pte entry
from src_pte to dst_pte and increments this map count of the associated
page.  This is how we end up with an elevated map count.

To solve, check the dst_pte entry for huge_pte_none.  If !none, this
implies PMD sharing so do not copy.

Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com
Fixes: c5c99429fa57 ("fix hugepages leak due to pagetable page sharing")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agolib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn
Arnd Bergmann [Fri, 16 Nov 2018 23:08:35 +0000 (15:08 -0800)]
lib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn

commit 1c23b4108d716cc848b38532063a8aca4f86add8 upstream.

gcc-8 complains about the prototype for this function:

  lib/ubsan.c:432:1: error: ignoring attribute 'noreturn' in declaration of a built-in function '__ubsan_handle_builtin_unreachable' because it conflicts with attribute 'const' [-Werror=attributes]

This is actually a GCC's bug. In GCC internals
__ubsan_handle_builtin_unreachable() declared with both 'noreturn' and
'const' attributes instead of only 'noreturn':

   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84210

Workaround this by removing the noreturn attribute.

[aryabinin: add information about GCC bug in changelog]
Link: http://lkml.kernel.org/r/20181107144516.4587-1-aryabinin@virtuozzo.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Olof Johansson <olof@lixom.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agocrypto: user - fix leaking uninitialized memory to userspace
Eric Biggers [Sat, 3 Nov 2018 21:56:00 +0000 (14:56 -0700)]
crypto: user - fix leaking uninitialized memory to userspace

commit f43f39958beb206b53292801e216d9b8a660f087 upstream.

All bytes of the NETLINK_CRYPTO report structures must be initialized,
since they are copied to userspace.  The change from strncpy() to
strlcpy() broke this.  As a minimal fix, change it back.

Fixes: 4473710df1f8 ("crypto: user - Prepare for CRYPTO_MAX_ALG_NAME expansion")
Cc: <stable@vger.kernel.org> # v4.12+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agolibata: blacklist SAMSUNG MZ7TD256HAFV-000L9 SSD
Diego Viola [Mon, 12 Nov 2018 19:22:52 +0000 (17:22 -0200)]
libata: blacklist SAMSUNG MZ7TD256HAFV-000L9 SSD

commit 410b5c7b48368317af95f0113692561d01d8144e upstream.

med_power_with_dipm still causes freezes after updating the firmware to
the latest version (DXT04L5Q).

Set model_rev to NULL and blacklist the device.

Cc: stable@vger.kernel.org
Signed-off-by: Diego Viola <diego.viola@gmail.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agogfs2: Fix metadata read-ahead during truncate (2)
Andreas Gruenbacher [Thu, 8 Nov 2018 20:14:29 +0000 (20:14 +0000)]
gfs2: Fix metadata read-ahead during truncate (2)

commit e7445ceddfc220c1aede6d42758a5acb8844e9c3 upstream.

The previous attempt to fix for metadata read-ahead during truncate was
incorrect: for files with a height > 2 (1006989312 bytes with a block
size of 4096 bytes), read-ahead requests were not being issued for some
of the indirect blocks discovered while walking the metadata tree,
leading to significant slow-downs when deleting large files.  Fix that.

In addition, only issue read-ahead requests in the first pass through
the meta-data tree, while deallocating data blocks.

Fixes: c3ce5aa9b0 ("gfs2: Fix metadata read-ahead during truncate")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agogfs2: Put bitmap buffers in put_super
Andreas Gruenbacher [Mon, 5 Nov 2018 22:57:24 +0000 (22:57 +0000)]
gfs2: Put bitmap buffers in put_super

commit 10283ea525d30f2e99828978fd04d8427876a7ad upstream.

gfs2_put_super calls gfs2_clear_rgrpd to destroy the gfs2_rgrpd objects
attached to the resource group glocks.  That function should release the
buffers attached to the gfs2_bitmap objects (bi_bh), but the call to
gfs2_rgrp_brelse for doing that is missing.

When gfs2_releasepage later runs across these buffers which are still
referenced, it refuses to free them.  This causes the pages the buffers
are attached to to remain referenced as well.  With enough mount/unmount
cycles, the system will eventually run out of memory.

Fix this by adding the missing call to gfs2_rgrp_brelse in
gfs2_clear_rgrpd.

(Also fix a gfs2_rgrp_relse -> gfs2_rgrp_brelse typo in a comment.)

Fixes: 39b0f1e92908 ("GFS2: Don't brelse rgrp buffer_heads every allocation")
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoselinux: check length properly in SCTP bind hook
Ondrej Mosnacek [Tue, 13 Nov 2018 15:16:08 +0000 (16:16 +0100)]
selinux: check length properly in SCTP bind hook

commit c138325fb8713472d5a0c3c7258b9131bab40725 upstream.

selinux_sctp_bind_connect() must verify if the address buffer has
sufficient length before accessing the 'sa_family' field. See
__sctp_connect() for a similar check.

The length of the whole address ('len') is already checked in the
callees.

Reported-by: Qian Cai <cai@gmx.us>
Fixes: d452930fd3b9 ("selinux: Add SCTP support")
Cc: <stable@vger.kernel.org> # 4.17+
Cc: Richard Haines <richard_c_haines@btinternet.com>
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Tested-by: Qian Cai <cai@gmx.us>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agofuse: fix possibly missed wake-up after abort
Miklos Szeredi [Fri, 9 Nov 2018 14:52:16 +0000 (15:52 +0100)]
fuse: fix possibly missed wake-up after abort

commit 2d84a2d19b6150c6dbac1e6ebad9c82e4c123772 upstream.

In current fuse_drop_waiting() implementation it's possible that
fuse_wait_aborted() will not be woken up in the unlikely case that
fuse_abort_conn() + fuse_wait_aborted() runs in between checking
fc->connected and calling atomic_dec(&fc->num_waiting).

Do the atomic_dec_and_test() unconditionally, which also provides the
necessary barrier against reordering with the fc->connected check.

The explicit smp_mb() in fuse_wait_aborted() is not actually needed, since
the spin_unlock() in fuse_abort_conn() provides the necessary RELEASE
barrier after resetting fc->connected.  However, this is not a performance
sensitive path, and adding the explicit barrier makes it easier to
document.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: b8f95e5d13f5 ("fuse: umount should wait for all requests")
Cc: <stable@vger.kernel.org> #v4.19
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agofuse: fix leaked notify reply
Miklos Szeredi [Fri, 9 Nov 2018 14:52:16 +0000 (15:52 +0100)]
fuse: fix leaked notify reply

commit 7fabaf303458fcabb694999d6fa772cc13d4e217 upstream.

fuse_request_send_notify_reply() may fail if the connection was reset for
some reason (e.g. fs was unmounted).  Don't leak request reference in this
case.  Besides leaking memory, this resulted in fc->num_waiting not being
decremented and hence fuse_wait_aborted() left in a hanging and unkillable
state.

Fixes: 2d45ba381a74 ("fuse: add retrieve request")
Fixes: b8f95e5d13f5 ("fuse: umount should wait for all requests")
Reported-and-tested-by: syzbot+6339eda9cb4ebbc4c37b@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> #v2.6.36
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agofuse: fix use-after-free in fuse_direct_IO()
Lukas Czerner [Fri, 9 Nov 2018 13:51:46 +0000 (14:51 +0100)]
fuse: fix use-after-free in fuse_direct_IO()

commit ebacb81273599555a7a19f7754a1451206a5fc4f upstream.

In async IO blocking case the additional reference to the io is taken for
it to survive fuse_aio_complete(). In non blocking case this additional
reference is not needed, however we still reference io to figure out
whether to wait for completion or not. This is wrong and will lead to
use-after-free. Fix it by storing blocking information in separate
variable.

This was spotted by KASAN when running generic/208 fstest.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 744742d692e3 ("fuse: Add reference counting for fuse_io_priv")
Cc: <stable@vger.kernel.org> # v4.6
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agortc: hctosys: Add missing range error reporting
Maciej W. Rozycki [Mon, 5 Nov 2018 03:48:25 +0000 (03:48 +0000)]
rtc: hctosys: Add missing range error reporting

commit 7ce9a992ffde8ce93d5ae5767362a5c7389ae895 upstream.

Fix an issue with the 32-bit range error path in `rtc_hctosys' where no
error code is set and consequently the successful preceding call result
from `rtc_read_time' is propagated to `rtc_hctosys_ret'.  This in turn
makes any subsequent call to `hctosys_show' incorrectly report in sysfs
that the system time has been set from this RTC while it has not.

Set the error to ERANGE then if we can't express the result due to an
overflow.

Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Fixes: b3a5ac42ab18 ("rtc: hctosys: Ensure system time doesn't overflow time_t")
Cc: stable@vger.kernel.org # 4.17+
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agonfsd: COPY and CLONE operations require the saved filehandle to be set
Scott Mayhew [Thu, 8 Nov 2018 16:11:36 +0000 (11:11 -0500)]
nfsd: COPY and CLONE operations require the saved filehandle to be set

commit 01310bb7c9c98752cc763b36532fab028e0f8f81 upstream.

Make sure we have a saved filehandle, otherwise we'll oops with a null
pointer dereference in nfs4_preprocess_stateid_op().

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoNFSv4: Don't exit the state manager without clearing NFS4CLNT_MANAGER_RUNNING
Trond Myklebust [Mon, 5 Nov 2018 16:10:50 +0000 (11:10 -0500)]
NFSv4: Don't exit the state manager without clearing NFS4CLNT_MANAGER_RUNNING

commit 21a446cf186570168b7281b154b1993968598aca upstream.

If we exit the NFSv4 state manager due to a umount, then we can end up
leaving the NFS4CLNT_MANAGER_RUNNING flag set. If another mount causes
the nfs4_client to be rereferenced before it is destroyed, then we end
up never being able to recover state.

Fixes: 47c2199b6eb5 ("NFSv4.1: Ensure state manager thread dies on last ...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agosunrpc: correct the computation for page_ptr when truncating
Frank Sorenson [Tue, 30 Oct 2018 20:10:40 +0000 (15:10 -0500)]
sunrpc: correct the computation for page_ptr when truncating

commit 5d7a5bcb67c70cbc904057ef52d3fcfeb24420bb upstream.

When truncating the encode buffer, the page_ptr is getting
advanced, causing the next page to be skipped while encoding.
The page is still included in the response, so the response
contains a page of bogus data.

We need to adjust the page_ptr backwards to ensure we encode
the next page into the correct place.

We saw this triggered when concurrent directory modifications caused
nfsd4_encode_direct_fattr() to return nfserr_noent, and the resulting
call to xdr_truncate_encode() corrupted the READDIR reply.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agokdb: print real address of pointers instead of hashed addresses
Christophe Leroy [Thu, 27 Sep 2018 17:17:57 +0000 (17:17 +0000)]
kdb: print real address of pointers instead of hashed addresses

commit 568fb6f42ac6851320adaea25f8f1b94de14e40a upstream.

Since commit ad67b74d2469 ("printk: hash addresses printed with %p"),
all pointers printed with %p are printed with hashed addresses
instead of real addresses in order to avoid leaking addresses in
dmesg and syslog. But this applies to kdb too, with is unfortunate:

    Entering kdb (current=0x(ptrval), pid 329) due to Keyboard Entry
    kdb> ps
    15 sleeping system daemon (state M) processes suppressed,
    use 'ps A' to see all.
    Task Addr       Pid   Parent [*] cpu State Thread     Command
    0x(ptrval)      329      328  1    0   R  0x(ptrval) *sh

    0x(ptrval)        1        0  0    0   S  0x(ptrval)  init
    0x(ptrval)        3        2  0    0   D  0x(ptrval)  rcu_gp
    0x(ptrval)        4        2  0    0   D  0x(ptrval)  rcu_par_gp
    0x(ptrval)        5        2  0    0   D  0x(ptrval)  kworker/0:0
    0x(ptrval)        6        2  0    0   D  0x(ptrval)  kworker/0:0H
    0x(ptrval)        7        2  0    0   D  0x(ptrval)  kworker/u2:0
    0x(ptrval)        8        2  0    0   D  0x(ptrval)  mm_percpu_wq
    0x(ptrval)       10        2  0    0   D  0x(ptrval)  rcu_preempt

The whole purpose of kdb is to debug, and for debugging real addresses
need to be known. In addition, data displayed by kdb doesn't go into
dmesg.

This patch replaces all %p by %px in kdb in order to display real
addresses.

Fixes: ad67b74d2469 ("printk: hash addresses printed with %p")
Cc: <stable@vger.kernel.org>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agokdb: use correct pointer when 'btc' calls 'btt'
Christophe Leroy [Thu, 27 Sep 2018 17:17:49 +0000 (17:17 +0000)]
kdb: use correct pointer when 'btc' calls 'btt'

commit dded2e159208a9edc21dd5c5f583afa28d378d39 upstream.

On a powerpc 8xx, 'btc' fails as follows:

Entering kdb (current=0x(ptrval), pid 282) due to Keyboard Entry
kdb> btc
btc: cpu status: Currently on cpu 0
Available cpus: 0
kdb_getarea: Bad address 0x0

when booting the kernel with 'debug_boot_weak_hash', it fails as well

Entering kdb (current=0xba99ad80, pid 284) due to Keyboard Entry
kdb> btc
btc: cpu status: Currently on cpu 0
Available cpus: 0
kdb_getarea: Bad address 0xba99ad80

On other platforms, Oopses have been observed too, see
https://github.com/linuxppc/linux/issues/139

This is due to btc calling 'btt' with %p pointer as an argument.

This patch replaces %p by %px to get the real pointer value as
expected by 'btt'

Fixes: ad67b74d2469 ("printk: hash addresses printed with %p")
Cc: <stable@vger.kernel.org>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoARM: cpuidle: Don't register the driver when back-end init returns -ENXIO
Ulf Hansson [Thu, 1 Nov 2018 12:22:38 +0000 (13:22 +0100)]
ARM: cpuidle: Don't register the driver when back-end init returns -ENXIO

commit 763f191af51f127cf8e69cd361f50bf6180768a5 upstream.

There's no point to register the cpuidle driver for the current CPU, when
the initialization of the arch specific back-end data fails by returning
-ENXIO.

Instead, let's re-order the sequence to its original flow, by first trying
to initialize the back-end part and then act accordingly on the returned
error code. Additionally, let's print the error message, no matter of what
error code that was returned.

Fixes: a0d46a3dfdc3 (ARM: cpuidle: Register per cpuidle device)
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: 4.19+ <stable@vger.kernel.org> # v4.19+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agouapi: fix linux/kfd_ioctl.h userspace compilation errors
Dmitry V. Levin [Thu, 1 Nov 2018 11:03:08 +0000 (14:03 +0300)]
uapi: fix linux/kfd_ioctl.h userspace compilation errors

commit aba118389a6fb2ad7958de0f37b5869852bd38cf upstream.

Consistently use types provided by <linux/types.h> via <drm/drm.h>
to fix the following linux/kfd_ioctl.h userspace compilation errors:

/usr/include/linux/kfd_ioctl.h:250:2: error: unknown type name 'uint32_t'
  uint32_t reset_type;
/usr/include/linux/kfd_ioctl.h:251:2: error: unknown type name 'uint32_t'
  uint32_t reset_cause;
/usr/include/linux/kfd_ioctl.h:252:2: error: unknown type name 'uint32_t'
  uint32_t memory_lost;
/usr/include/linux/kfd_ioctl.h:253:2: error: unknown type name 'uint32_t'
  uint32_t gpu_id;

Fixes: 0c119abad7f0d ("drm/amd: Add kfd ioctl defines for hw_exception event")
Cc: <stable@vger.kernel.org> # v4.19
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agomnt: fix __detach_mounts infinite loop
Benjamin Coddington [Wed, 3 Oct 2018 14:18:33 +0000 (10:18 -0400)]
mnt: fix __detach_mounts infinite loop

commit 1e9c75fb9c47a75a9aec0cd17db5f6dc36b58e00 upstream.

Since commit ff17fa561a04 ("d_invalidate(): unhash immediately")
immediately unhashes the dentry, we'll never return the mountpoint in
lookup_mountpoint(), which can lead to an unbreakable loop in
d_invalidate().

I have reports of NFS clients getting into this condition after the server
removes an export of an existing mount created through follow_automount(),
but I suspect there are various other ways to produce this problem if we
hunt down users of d_invalidate().  For example, it is possible to get into
this state by using XFS' d_invalidate() call in xfs_vn_unlink():

truncate -s 100m img{1,2}

mkfs.xfs -q -n version=ci img1
mkfs.xfs -q -n version=ci img2

mkdir -p /mnt/xfs
mount img1 /mnt/xfs

mkdir /mnt/xfs/sub1
mount img2 /mnt/xfs/sub1

cat > /mnt/xfs/sub1/foo &
umount -l /mnt/xfs/sub1
mount img2 /mnt/xfs/sub1

mount --make-private /mnt/xfs

mkdir /mnt/xfs/sub2
mount --move /mnt/xfs/sub1 /mnt/xfs/sub2
rmdir /mnt/xfs/sub1

Fix this by moving the check for an unlinked dentry out of the
detach_mounts() path.

Fixes: ff17fa561a04 ("d_invalidate(): unhash immediately")
Cc: stable@vger.kernel.org
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agomount: Prevent MNT_DETACH from disconnecting locked mounts
Eric W. Biederman [Thu, 25 Oct 2018 17:05:11 +0000 (12:05 -0500)]
mount: Prevent MNT_DETACH from disconnecting locked mounts

commit 9c8e0a1b683525464a2abe9fb4b54404a50ed2b4 upstream.

Timothy Baldwin <timbaldwin@fastmail.co.uk> wrote:
> As per mount_namespaces(7) unprivileged users should not be able to look under mount points:
>
>   Mounts that come as a single unit from more privileged mount are locked
>   together and may not be separated in a less privileged mount namespace.
>
> However they can:
>
> 1. Create a mount namespace.
> 2. In the mount namespace open a file descriptor to the parent of a mount point.
> 3. Destroy the mount namespace.
> 4. Use the file descriptor to look under the mount point.
>
> I have reproduced this with Linux 4.16.18 and Linux 4.18-rc8.
>
> The setup:
>
> $ sudo sysctl kernel.unprivileged_userns_clone=1
> kernel.unprivileged_userns_clone = 1
> $ mkdir -p A/B/Secret
> $ sudo mount -t tmpfs hide A/B
>
>
> "Secret" is indeed hidden as expected:
>
> $ ls -lR A
> A:
> total 0
> drwxrwxrwt 2 root root 40 Feb 12 21:08 B
>
> A/B:
> total 0
>
>
> The attack revealing "Secret":
>
> $ unshare -Umr sh -c "exec unshare -m ls -lR /proc/self/fd/4/ 4<A"
> /proc/self/fd/4/:
> total 0
> drwxr-xr-x 3 root root 60 Feb 12 21:08 B
>
> /proc/self/fd/4/B:
> total 0
> drwxr-xr-x 2 root root 40 Feb 12 21:08 Secret
>
> /proc/self/fd/4/B/Secret:
> total 0

I tracked this down to put_mnt_ns running passing UMOUNT_SYNC and
disconnecting all of the mounts in a mount namespace.  Fix this by
factoring drop_mounts out of drop_collected_mounts and passing
0 instead of UMOUNT_SYNC.

There are two possible behavior differences that result from this.
- No longer setting UMOUNT_SYNC will no longer set MNT_SYNC_UMOUNT on
  the vfsmounts being unmounted.  This effects the lazy rcu walk by
  kicking the walk out of rcu mode and forcing it to be a non-lazy
  walk.
- No longer disconnecting locked mounts will keep some mounts around
  longer as they stay because the are locked to other mounts.

There are only two users of drop_collected mounts: audit_tree.c and
put_mnt_ns.

In audit_tree.c the mounts are private and there are no rcu lazy walks
only calls to iterate_mounts. So the changes should have no effect
except for a small timing effect as the connected mounts are disconnected.

In put_mnt_ns there may be references from process outside the mount
namespace to the mounts.  So the mounts remaining connected will
be the bug fix that is needed.  That rcu walks are allowed to continue
appears not to be a problem especially as the rcu walk change was about
an implementation detail not about semantics.

Cc: stable@vger.kernel.org
Fixes: 5ff9d8a65ce8 ("vfs: Lock in place mounts from more privileged users")
Reported-by: Timothy Baldwin <timbaldwin@fastmail.co.uk>
Tested-by: Timothy Baldwin <timbaldwin@fastmail.co.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agomount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts
Eric W. Biederman [Thu, 25 Oct 2018 14:04:18 +0000 (09:04 -0500)]
mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts

commit df7342b240185d58d3d9665c0bbf0a0f5570ec29 upstream.

Jonathan Calmels from NVIDIA reported that he's able to bypass the
mount visibility security check in place in the Linux kernel by using
a combination of the unbindable property along with the private mount
propagation option to allow a unprivileged user to see a path which
was purposefully hidden by the root user.

Reproducer:
  # Hide a path to all users using a tmpfs
  root@castiana:~# mount -t tmpfs tmpfs /sys/devices/
  root@castiana:~#

  # As an unprivileged user, unshare user namespace and mount namespace
  stgraber@castiana:~$ unshare -U -m -r

  # Confirm the path is still not accessible
  root@castiana:~# ls /sys/devices/

  # Make /sys recursively unbindable and private
  root@castiana:~# mount --make-runbindable /sys
  root@castiana:~# mount --make-private /sys

  # Recursively bind-mount the rest of /sys over to /mnnt
  root@castiana:~# mount --rbind /sys/ /mnt

  # Access our hidden /sys/device as an unprivileged user
  root@castiana:~# ls /mnt/devices/
  breakpoint cpu cstate_core cstate_pkg i915 intel_pt isa kprobe
  LNXSYSTM:00 msr pci0000:00 platform pnp0 power software system
  tracepoint uncore_arb uncore_cbox_0 uncore_cbox_1 uprobe virtual

Solve this by teaching copy_tree to fail if a mount turns out to be
both unbindable and locked.

Cc: stable@vger.kernel.org
Fixes: 5ff9d8a65ce8 ("vfs: Lock in place mounts from more privileged users")
Reported-by: Jonathan Calmels <jcalmels@nvidia.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agomount: Retest MNT_LOCKED in do_umount
Eric W. Biederman [Mon, 22 Oct 2018 15:21:38 +0000 (10:21 -0500)]
mount: Retest MNT_LOCKED in do_umount

commit 25d202ed820ee347edec0bf3bf553544556bf64b upstream.

It was recently pointed out that the one instance of testing MNT_LOCKED
outside of the namespace_sem is in ksys_umount.

Fix that by adding a test inside of do_umount with namespace_sem and
the mount_lock held.  As it helps to fail fails the existing test is
maintained with an additional comment pointing out that it may be racy
because the locks are not held.

Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Fixes: 5ff9d8a65ce8 ("vfs: Lock in place mounts from more privileged users")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoext4: fix buffer leak in __ext4_read_dirblock() on error path
Vasily Averin [Thu, 8 Nov 2018 03:36:23 +0000 (22:36 -0500)]
ext4: fix buffer leak in __ext4_read_dirblock() on error path

commit de59fae0043f07de5d25e02ca360f7d57bfa5866 upstream.

Fixes: dc6982ff4db1 ("ext4: refactor code to read directory blocks ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 3.9
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoext4: fix buffer leak in ext4_expand_extra_isize_ea() on error path
Vasily Averin [Wed, 7 Nov 2018 16:14:35 +0000 (11:14 -0500)]
ext4: fix buffer leak in ext4_expand_extra_isize_ea() on error path

commit 53692ec074d00589c2cf1d6d17ca76ad0adce6ec upstream.

Fixes: de05ca852679 ("ext4: move call to ext4_error() into ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.17
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoext4: fix buffer leak in ext4_xattr_move_to_block() on error path
Vasily Averin [Wed, 7 Nov 2018 16:10:21 +0000 (11:10 -0500)]
ext4: fix buffer leak in ext4_xattr_move_to_block() on error path

commit 6bdc9977fcdedf47118d2caf7270a19f4b6d8a8f upstream.

Fixes: 3f2571c1f91f ("ext4: factor out xattr moving")
Fixes: 6dd4ee7cab7e ("ext4: Expand extra_inodes space per ...")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 2.6.23
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoext4: release bs.bh before re-using in ext4_xattr_block_find()
Vasily Averin [Wed, 7 Nov 2018 16:07:01 +0000 (11:07 -0500)]
ext4: release bs.bh before re-using in ext4_xattr_block_find()

commit 45ae932d246f721e6584430017176cbcadfde610 upstream.

bs.bh was taken in previous ext4_xattr_block_find() call,
it should be released before re-using

Fixes: 7e01c8e5420b ("ext3/4: fix uninitialized bs in ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 2.6.26
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoext4: fix buffer leak in ext4_xattr_get_block() on error path
Vasily Averin [Wed, 7 Nov 2018 16:01:33 +0000 (11:01 -0500)]
ext4: fix buffer leak in ext4_xattr_get_block() on error path

commit ecaaf408478b6fb4d9986f9b6652f3824e374f4c upstream.

Fixes: dec214d00e0d ("ext4: xattr inode deduplication")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.13
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoext4: fix possible leak of s_journal_flag_rwsem in error path
Vasily Averin [Wed, 7 Nov 2018 15:56:28 +0000 (10:56 -0500)]
ext4: fix possible leak of s_journal_flag_rwsem in error path

commit af18e35bfd01e6d65a5e3ef84ffe8b252d1628c5 upstream.

Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.7
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoext4: fix possible leak of sbi->s_group_desc_leak in error path
Theodore Ts'o [Wed, 7 Nov 2018 15:32:53 +0000 (10:32 -0500)]
ext4: fix possible leak of sbi->s_group_desc_leak in error path

commit 9e463084cdb22e0b56b2dfbc50461020409a5fd3 upstream.

Fixes: bfe0a5f47ada ("ext4: add more mount time checks of the superblock")
Reported-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.18
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoext4: avoid possible double brelse() in add_new_gdb() on error path
Theodore Ts'o [Tue, 6 Nov 2018 22:18:17 +0000 (17:18 -0500)]
ext4: avoid possible double brelse() in add_new_gdb() on error path

commit 4f32c38b4662312dd3c5f113d8bdd459887fb773 upstream.

Fixes: b40971426a83 ("ext4: add error checking to calls to ...")
Reported-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 2.6.38
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoext4: fix missing cleanup if ext4_alloc_flex_bg_array() fails while resizing
Vasily Averin [Tue, 6 Nov 2018 21:16:01 +0000 (16:16 -0500)]
ext4: fix missing cleanup if ext4_alloc_flex_bg_array() fails while resizing

commit f348e2241fb73515d65b5d77dd9c174128a7fbf2 upstream.

Fixes: 117fff10d7f1 ("ext4: grow the s_flex_groups array as needed ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 3.7
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoext4: avoid buffer leak in ext4_orphan_add() after prior errors
Vasily Averin [Tue, 6 Nov 2018 22:01:36 +0000 (17:01 -0500)]
ext4: avoid buffer leak in ext4_orphan_add() after prior errors

commit feaf264ce7f8d54582e2f66eb82dd9dd124c94f3 upstream.

Fixes: d745a8c20c1f ("ext4: reduce contention on s_orphan_lock")
Fixes: 6e3617e579e0 ("ext4: Handle non empty on-disk orphan link")
Cc: Dmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 2.6.34
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoext4: avoid buffer leak on shutdown in ext4_mark_iloc_dirty()
Vasily Averin [Tue, 6 Nov 2018 21:49:50 +0000 (16:49 -0500)]
ext4: avoid buffer leak on shutdown in ext4_mark_iloc_dirty()

commit a6758309a005060b8297a538a457c88699cb2520 upstream.

ext4_mark_iloc_dirty() callers expect that it releases iloc->bh
even if it returns an error.

Fixes: 0db1ff222d40 ("ext4: add shutdown bit and check for it")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.11
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoext4: fix possible inode leak in the retry loop of ext4_resize_fs()
Vasily Averin [Tue, 6 Nov 2018 21:20:40 +0000 (16:20 -0500)]
ext4: fix possible inode leak in the retry loop of ext4_resize_fs()

commit db6aee62406d9fbb53315fcddd81f1dc271d49fa upstream.

Fixes: 1c6bd7173d66 ("ext4: convert file system to meta_bg if needed ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 3.7
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoext4: missing !bh check in ext4_xattr_inode_write()
Vasily Averin [Fri, 9 Nov 2018 16:34:40 +0000 (11:34 -0500)]
ext4: missing !bh check in ext4_xattr_inode_write()

commit eb6984fa4ce2837dcb1f66720a600f31b0bb3739 upstream.

According to Ted Ts'o ext4_getblk() called in ext4_xattr_inode_write()
should not return bh = NULL

The only time that bh could be NULL, then, would be in the case of
something really going wrong; a programming error elsewhere (perhaps a
wild pointer dereference) or I/O error causing on-disk file system
corruption (although that would be highly unlikely given that we had
*just* allocated the blocks and so the metadata blocks in question
probably would still be in the cache).

Fixes: e50e5129f384 ("ext4: xattr-in-inode support")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.13
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoext4: avoid potential extra brelse in setup_new_flex_group_blocks()
Vasily Averin [Sat, 3 Nov 2018 20:13:17 +0000 (16:13 -0400)]
ext4: avoid potential extra brelse in setup_new_flex_group_blocks()

commit 9e4028935cca3f9ef9b6a90df9da6f1f94853536 upstream.

Currently bh is set to NULL only during first iteration of for cycle,
then this pointer is not cleared after end of using.
Therefore rollback after errors can lead to extra brelse(bh) call,
decrements bh counter and later trigger an unexpected warning in __brelse()

Patch moves brelse() calls in body of cycle to exclude requirement of
brelse() call in rollback.

Fixes: 33afdcc5402d ("ext4: add a function which sets up group blocks ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 3.3+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoext4: add missing brelse() add_new_gdb_meta_bg()'s error path
Vasily Averin [Sat, 3 Nov 2018 20:50:08 +0000 (16:50 -0400)]
ext4: add missing brelse() add_new_gdb_meta_bg()'s error path

commit 61a9c11e5e7a0dab5381afa5d9d4dd5ebf18f7a0 upstream.

Fixes: 01f795f9e0d6 ("ext4: add online resizing support for meta_bg ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 3.7
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoext4: add missing brelse() in set_flexbg_block_bitmap()'s error path
Vasily Averin [Sat, 3 Nov 2018 20:22:10 +0000 (16:22 -0400)]
ext4: add missing brelse() in set_flexbg_block_bitmap()'s error path

commit cea5794122125bf67559906a0762186cf417099c upstream.

Fixes: 33afdcc5402d ("ext4: add a function which sets up group blocks ...")
Cc: stable@kernel.org # 3.3
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoext4: add missing brelse() update_backups()'s error path
Vasily Averin [Sat, 3 Nov 2018 21:11:19 +0000 (17:11 -0400)]
ext4: add missing brelse() update_backups()'s error path

commit ea0abbb648452cdb6e1734b702b6330a7448fcf8 upstream.

Fixes: ac27a0ec112a ("ext4: initial copy of files from ext3")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 2.6.19
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoclockevents/drivers/i8253: Add support for PIT shutdown quirk
Michael Kelley [Sun, 4 Nov 2018 03:48:54 +0000 (03:48 +0000)]
clockevents/drivers/i8253: Add support for PIT shutdown quirk

commit 35b69a420bfb56b7b74cb635ea903db05e357bec upstream.

Add support for platforms where pit_shutdown() doesn't work because of a
quirk in the PIT emulation. On these platforms setting the counter register
to zero causes the PIT to start running again, negating the shutdown.

Provide a global variable that controls whether the counter register is
zero'ed, which platform specific code can override.

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
Cc: "devel@linuxdriverproject.org" <devel@linuxdriverproject.org>
Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>
Cc: "virtualization@lists.linux-foundation.org" <virtualization@lists.linux-foundation.org>
Cc: "jgross@suse.com" <jgross@suse.com>
Cc: "akataria@vmware.com" <akataria@vmware.com>
Cc: "olaf@aepfle.de" <olaf@aepfle.de>
Cc: "apw@canonical.com" <apw@canonical.com>
Cc: vkuznets <vkuznets@redhat.com>
Cc: "jasowang@redhat.com" <jasowang@redhat.com>
Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>
Cc: KY Srinivasan <kys@microsoft.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1541303219-11142-2-git-send-email-mikelley@microsoft.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agobtrfs: tree-checker: Fix misleading group system information
Shaokun Zhang [Mon, 5 Nov 2018 10:49:09 +0000 (18:49 +0800)]
btrfs: tree-checker: Fix misleading group system information

commit 761333f2f50ccc887aa9957ae829300262c0d15b upstream.

block_group_err shows the group system as a decimal value with a '0x'
prefix, which is somewhat misleading.

Fix it to print hexadecimal, as was intended.

Fixes: fce466eab7ac6 ("btrfs: tree-checker: Verify block_group_item")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoBtrfs: fix data corruption due to cloning of eof block
Filipe Manana [Mon, 5 Nov 2018 11:14:17 +0000 (11:14 +0000)]
Btrfs: fix data corruption due to cloning of eof block

commit ac765f83f1397646c11092a032d4f62c3d478b81 upstream.

We currently allow cloning a range from a file which includes the last
block of the file even if the file's size is not aligned to the block
size. This is fine and useful when the destination file has the same size,
but when it does not and the range ends somewhere in the middle of the
destination file, it leads to corruption because the bytes between the EOF
and the end of the block have undefined data (when there is support for
discard/trimming they have a value of 0x00).

Example:

 $ mkfs.btrfs -f /dev/sdb
 $ mount /dev/sdb /mnt

 $ export foo_size=$((256 * 1024 + 100))
 $ xfs_io -f -c "pwrite -S 0x3c 0 $foo_size" /mnt/foo
 $ xfs_io -f -c "pwrite -S 0xb5 0 1M" /mnt/bar

 $ xfs_io -c "reflink /mnt/foo 0 512K $foo_size" /mnt/bar

 $ od -A d -t x1 /mnt/bar
 0000000 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
 *
 0524288 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c
 *
 0786528 3c 3c 3c 3c 00 00 00 00 00 00 00 00 00 00 00 00
 0786544 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 *
 0790528 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
 *
 1048576

The bytes in the range from 786532 (512Kb + 256Kb + 100 bytes) to 790527
(512Kb + 256Kb + 4Kb - 1) got corrupted, having now a value of 0x00 instead
of 0xb5.

This is similar to the problem we had for deduplication that got recently
fixed by commit de02b9f6bb65 ("Btrfs: fix data corruption when
deduplicating between different files").

Fix this by not allowing such operations to be performed and return the
errno -EINVAL to user space. This is what XFS is doing as well at the VFS
level. This change however now makes us return -EINVAL instead of
-EOPNOTSUPP for cases where the source range maps to an inline extent and
the destination range's end is smaller then the destination file's size,
since the detection of inline extents is done during the actual process of
dropping file extent items (at __btrfs_drop_extents()). Returning the
-EINVAL error is done early on and solely based on the input parameters
(offsets and length) and destination file's size. This makes us consistent
with XFS and anyone else supporting cloning since this case is now checked
at a higher level in the VFS and is where the -EINVAL will be returned
from starting with kernel 4.20 (the VFS changed was introduced in 4.20-rc1
by commit 07d19dc9fbe9 ("vfs: avoid problematic remapping requests into
partial EOF block"). So this change is more geared towards stable kernels,
as it's unlikely the new VFS checks get removed intentionally.

A test case for fstests follows soon, as well as an update to filter
existing tests that expect -EOPNOTSUPP to accept -EINVAL as well.

CC: <stable@vger.kernel.org> # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoBtrfs: fix infinite loop on inode eviction after deduplication of eof block
Filipe Manana [Mon, 5 Nov 2018 11:14:05 +0000 (11:14 +0000)]
Btrfs: fix infinite loop on inode eviction after deduplication of eof block

commit 11023d3f5fdf89bba5e1142127701ca6e6014587 upstream.

If we attempt to deduplicate the last block of a file A into the middle of
a file B, and file A's size is not a multiple of the block size, we end
rounding the deduplication length to 0 bytes, to avoid the data corruption
issue fixed by commit de02b9f6bb65 ("Btrfs: fix data corruption when
deduplicating between different files"). However a length of zero will
cause the insertion of an extent state with a start value greater (by 1)
then the end value, leading to a corrupt extent state that will trigger a
warning and cause chaos such as an infinite loop during inode eviction.
Example trace:

 [96049.833585] ------------[ cut here ]------------
 [96049.833714] WARNING: CPU: 0 PID: 24448 at fs/btrfs/extent_io.c:436 insert_state+0x101/0x120 [btrfs]
 [96049.833767] CPU: 0 PID: 24448 Comm: xfs_io Not tainted 4.19.0-rc7-btrfs-next-39 #1
 [96049.833768] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
 [96049.833780] RIP: 0010:insert_state+0x101/0x120 [btrfs]
 [96049.833783] RSP: 0018:ffffafd2c3707af0 EFLAGS: 00010282
 [96049.833785] RAX: 0000000000000000 RBX: 000000000004dfff RCX: 0000000000000006
 [96049.833786] RDX: 0000000000000007 RSI: ffff99045c143230 RDI: ffff99047b2168a0
 [96049.833787] RBP: ffff990457851cd0 R08: 0000000000000001 R09: 0000000000000000
 [96049.833787] R10: ffffafd2c3707ab8 R11: 0000000000000000 R12: ffff9903b93b12c8
 [96049.833788] R13: 000000000004e000 R14: ffffafd2c3707b80 R15: ffffafd2c3707b78
 [96049.833790] FS:  00007f5c14e7d700(0000) GS:ffff99047b200000(0000) knlGS:0000000000000000
 [96049.833791] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [96049.833792] CR2: 00007f5c146abff8 CR3: 0000000115f4c004 CR4: 00000000003606f0
 [96049.833795] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [96049.833796] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [96049.833796] Call Trace:
 [96049.833809]  __set_extent_bit+0x46c/0x6a0 [btrfs]
 [96049.833823]  lock_extent_bits+0x6b/0x210 [btrfs]
 [96049.833831]  ? _raw_spin_unlock+0x24/0x30
 [96049.833841]  ? test_range_bit+0xdf/0x130 [btrfs]
 [96049.833853]  lock_extent_range+0x8e/0x150 [btrfs]
 [96049.833864]  btrfs_double_extent_lock+0x78/0xb0 [btrfs]
 [96049.833875]  btrfs_extent_same_range+0x14e/0x550 [btrfs]
 [96049.833885]  ? rcu_read_lock_sched_held+0x3f/0x70
 [96049.833890]  ? __kmalloc_node+0x2b0/0x2f0
 [96049.833899]  ? btrfs_dedupe_file_range+0x19a/0x280 [btrfs]
 [96049.833909]  btrfs_dedupe_file_range+0x270/0x280 [btrfs]
 [96049.833916]  vfs_dedupe_file_range_one+0xd9/0xe0
 [96049.833919]  vfs_dedupe_file_range+0x131/0x1b0
 [96049.833924]  do_vfs_ioctl+0x272/0x6e0
 [96049.833927]  ? __fget+0x113/0x200
 [96049.833931]  ksys_ioctl+0x70/0x80
 [96049.833933]  __x64_sys_ioctl+0x16/0x20
 [96049.833937]  do_syscall_64+0x60/0x1b0
 [96049.833939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 [96049.833941] RIP: 0033:0x7f5c1478ddd7
 [96049.833943] RSP: 002b:00007ffe15b196a8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
 [96049.833945] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5c1478ddd7
 [96049.833946] RDX: 00005625ece322d0 RSI: 00000000c0189436 RDI: 0000000000000004
 [96049.833947] RBP: 0000000000000000 R08: 00007f5c14a46f48 R09: 0000000000000040
 [96049.833948] R10: 0000000000000541 R11: 0000000000000202 R12: 0000000000000000
 [96049.833949] R13: 0000000000000000 R14: 0000000000000004 R15: 00005625ece322d0
 [96049.833954] irq event stamp: 6196
 [96049.833956] hardirqs last  enabled at (6195): [<ffffffff91b00663>] console_unlock+0x503/0x640
 [96049.833958] hardirqs last disabled at (6196): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
 [96049.833959] softirqs last  enabled at (6114): [<ffffffff92600370>] __do_softirq+0x370/0x421
 [96049.833964] softirqs last disabled at (6095): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0
 [96049.833965] ---[ end trace db7b05f01b7fa10c ]---
 [96049.935816] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910
 [96049.935822] irq event stamp: 6584
 [96049.935823] hardirqs last  enabled at (6583): [<ffffffff91b00663>] console_unlock+0x503/0x640
 [96049.935825] hardirqs last disabled at (6584): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
 [96049.935827] softirqs last  enabled at (6328): [<ffffffff92600370>] __do_softirq+0x370/0x421
 [96049.935828] softirqs last disabled at (6313): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0
 [96049.935829] ---[ end trace db7b05f01b7fa123 ]---
 [96049.935840] ------------[ cut here ]------------
 [96049.936065] WARNING: CPU: 1 PID: 24463 at fs/btrfs/extent_io.c:436 insert_state+0x101/0x120 [btrfs]
 [96049.936107] CPU: 1 PID: 24463 Comm: umount Tainted: G        W         4.19.0-rc7-btrfs-next-39 #1
 [96049.936108] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
 [96049.936117] RIP: 0010:insert_state+0x101/0x120 [btrfs]
 [96049.936119] RSP: 0018:ffffafd2c3637bc0 EFLAGS: 00010282
 [96049.936120] RAX: 0000000000000000 RBX: 000000000004dfff RCX: 0000000000000006
 [96049.936121] RDX: 0000000000000007 RSI: ffff990445cf88e0 RDI: ffff99047b2968a0
 [96049.936122] RBP: ffff990457851cd0 R08: 0000000000000001 R09: 0000000000000000
 [96049.936123] R10: ffffafd2c3637b88 R11: 0000000000000000 R12: ffff9904574301e8
 [96049.936124] R13: 000000000004e000 R14: ffffafd2c3637c50 R15: ffffafd2c3637c48
 [96049.936125] FS:  00007fe4b87e72c0(0000) GS:ffff99047b280000(0000) knlGS:0000000000000000
 [96049.936126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [96049.936128] CR2: 00005562e52618d8 CR3: 00000001151c8005 CR4: 00000000003606e0
 [96049.936129] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [96049.936131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [96049.936131] Call Trace:
 [96049.936141]  __set_extent_bit+0x46c/0x6a0 [btrfs]
 [96049.936154]  lock_extent_bits+0x6b/0x210 [btrfs]
 [96049.936167]  btrfs_evict_inode+0x1e1/0x5a0 [btrfs]
 [96049.936172]  evict+0xbf/0x1c0
 [96049.936174]  dispose_list+0x51/0x80
 [96049.936176]  evict_inodes+0x193/0x1c0
 [96049.936180]  generic_shutdown_super+0x3f/0x110
 [96049.936182]  kill_anon_super+0xe/0x30
 [96049.936189]  btrfs_kill_super+0x13/0x100 [btrfs]
 [96049.936191]  deactivate_locked_super+0x3a/0x70
 [96049.936193]  cleanup_mnt+0x3b/0x80
 [96049.936195]  task_work_run+0x93/0xc0
 [96049.936198]  exit_to_usermode_loop+0xfa/0x100
 [96049.936201]  do_syscall_64+0x17f/0x1b0
 [96049.936202]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 [96049.936204] RIP: 0033:0x7fe4b80cfb37
 [96049.936206] RSP: 002b:00007ffff092b688 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 [96049.936207] RAX: 0000000000000000 RBX: 00005562e5259060 RCX: 00007fe4b80cfb37
 [96049.936208] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005562e525faa0
 [96049.936209] RBP: 00005562e525faa0 R08: 00005562e525f770 R09: 0000000000000015
 [96049.936210] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fe4b85d1e64
 [96049.936211] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910
 [96049.936211] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910
 [96049.936216] irq event stamp: 6616
 [96049.936219] hardirqs last  enabled at (6615): [<ffffffff91b00663>] console_unlock+0x503/0x640
 [96049.936219] hardirqs last disabled at (6616): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
 [96049.936222] softirqs last  enabled at (6328): [<ffffffff92600370>] __do_softirq+0x370/0x421
 [96049.936222] softirqs last disabled at (6313): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0
 [96049.936223] ---[ end trace db7b05f01b7fa124 ]---

The second stack trace, from inode eviction, is repeated forever due to
the infinite loop during eviction.

This is the same type of problem fixed way back in 2015 by commit
113e8283869b ("Btrfs: fix inode eviction infinite loop after extent_same
ioctl") and commit ccccf3d67294 ("Btrfs: fix inode eviction infinite loop
after cloning into it").

So fix this by returning immediately if the deduplication range length
gets rounded down to 0 bytes, as there is nothing that needs to be done in
such case.

Example reproducer:

 $ mkfs.btrfs -f /dev/sdb
 $ mount /dev/sdb /mnt

 $ xfs_io -f -c "pwrite -S 0xe6 0 100" /mnt/foo
 $ xfs_io -f -c "pwrite -S 0xe6 0 1M" /mnt/bar

 # Unmount the filesystem and mount it again so that we start without any
 # extent state records when we ask for the deduplication.
 $ umount /mnt
 $ mount /dev/sdb /mnt

 $ xfs_io -c "dedupe /mnt/foo 0 500K 100" /mnt/bar

 # This unmount triggers the infinite loop.
 $ umount /mnt

A test case for fstests will follow soon.

Fixes: de02b9f6bb65 ("Btrfs: fix data corruption when deduplicating between different files")
CC: <stable@vger.kernel.org> # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoBtrfs: fix cur_offset in the error case for nocow
Robbie Ko [Tue, 30 Oct 2018 10:04:04 +0000 (18:04 +0800)]
Btrfs: fix cur_offset in the error case for nocow

commit 506481b20e818db40b6198815904ecd2d6daee64 upstream.

When the cow_file_range fails, the related resources are unlocked
according to the range [start..end), so the unlock cannot be repeated in
run_delalloc_nocow.

In some cases (e.g. cur_offset <= end && cow_start != -1), cur_offset is
not updated correctly, so move the cur_offset update before
cow_file_range.

  kernel BUG at mm/page-writeback.c:2663!
  Internal error: Oops - BUG: 0 [#1] SMP
  CPU: 3 PID: 31525 Comm: kworker/u8:7 Tainted: P O
  Hardware name: Realtek_RTD1296 (DT)
  Workqueue: writeback wb_workfn (flush-btrfs-1)
  task: ffffffc076db3380 ti: ffffffc02e9ac000 task.ti: ffffffc02e9ac000
  PC is at clear_page_dirty_for_io+0x1bc/0x1e8
  LR is at clear_page_dirty_for_io+0x14/0x1e8
  pc : [<ffffffc00033c91c>] lr : [<ffffffc00033c774>] pstate: 40000145
  sp : ffffffc02e9af4f0
  Process kworker/u8:7 (pid: 31525, stack limit = 0xffffffc02e9ac020)
  Call trace:
  [<ffffffc00033c91c>] clear_page_dirty_for_io+0x1bc/0x1e8
  [<ffffffbffc514674>] extent_clear_unlock_delalloc+0x1e4/0x210 [btrfs]
  [<ffffffbffc4fb168>] run_delalloc_nocow+0x3b8/0x948 [btrfs]
  [<ffffffbffc4fb948>] run_delalloc_range+0x250/0x3a8 [btrfs]
  [<ffffffbffc514c0c>] writepage_delalloc.isra.21+0xbc/0x1d8 [btrfs]
  [<ffffffbffc516048>] __extent_writepage+0xe8/0x248 [btrfs]
  [<ffffffbffc51630c>] extent_write_cache_pages.isra.17+0x164/0x378 [btrfs]
  [<ffffffbffc5185a8>] extent_writepages+0x48/0x68 [btrfs]
  [<ffffffbffc4f5828>] btrfs_writepages+0x20/0x30 [btrfs]
  [<ffffffc00033d758>] do_writepages+0x30/0x88
  [<ffffffc0003ba0f4>] __writeback_single_inode+0x34/0x198
  [<ffffffc0003ba6c4>] writeback_sb_inodes+0x184/0x3c0
  [<ffffffc0003ba96c>] __writeback_inodes_wb+0x6c/0xc0
  [<ffffffc0003bac20>] wb_writeback+0x1b8/0x1c0
  [<ffffffc0003bb0f0>] wb_workfn+0x150/0x250
  [<ffffffc0002b0014>] process_one_work+0x1dc/0x388
  [<ffffffc0002b02f0>] worker_thread+0x130/0x500
  [<ffffffc0002b6344>] kthread+0x10c/0x110
  [<ffffffc000284590>] ret_from_fork+0x10/0x40
  Code: d503201f a9025bb5 a90363b7 f90023b9 (d4210000)

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoBtrfs: fix missing data checksums after a ranged fsync (msync)
Filipe Manana [Mon, 29 Oct 2018 09:42:06 +0000 (09:42 +0000)]
Btrfs: fix missing data checksums after a ranged fsync (msync)

commit 008c6753f7e070c77c70d708a6bf0255b4381763 upstream.

Recently we got a massive simplification for fsync, where for the fast
path we no longer log new extents while their respective ordered extents
are still running.

However that simplification introduced a subtle regression for the case
where we use a ranged fsync (msync). Consider the following example:

               CPU 0                                    CPU 1

                                            mmap write to range [2Mb, 4Mb[
  mmap write to range [512Kb, 1Mb[
  msync range [512K, 1Mb[
    --> triggers fast fsync
        (BTRFS_INODE_NEEDS_FULL_SYNC
         not set)
    --> creates extent map A for this
        range and adds it to list of
        modified extents
    --> starts ordered extent A for
        this range
    --> waits for it to complete

                                            writeback triggered for range
                                            [2Mb, 4Mb[
                                              --> create extent map B and
                                                  adds it to the list of
                                                  modified extents
                                              --> creates ordered extent B

    --> start looking for and logging
        modified extents
    --> logs extent maps A and B
    --> finds checksums for extent A
        in the csum tree, but not for
        extent B
  fsync (msync) finishes

                                              --> ordered extent B
                                                  finishes and its
                                                  checksums are added
                                                  to the csum tree

                                <power cut>

After replaying the log, we have the extent covering the range [2Mb, 4Mb[
but do not have the data checksum items covering that file range.

This happens because at the very beginning of an fsync (btrfs_sync_file())
we start and wait for IO in the given range [512Kb, 1Mb[ and therefore
wait for any ordered extents in that range to complete before we start
logging the extents. However if right before we start logging the extent
in our range [512Kb, 1Mb[, writeback is started for any other dirty range,
such as the range [2Mb, 4Mb[ due to memory pressure or a concurrent fsync
or msync (btrfs_sync_file() starts writeback before acquiring the inode's
lock), an ordered extent is created for that other range and a new extent
map is created to represent that range and added to the inode's list of
modified extents.

That means that we will see that other extent in that list when collecting
extents for logging (done at btrfs_log_changed_extents()) and log the
extent before the respective ordered extent finishes - namely before the
checksum items are added to the checksums tree, which is where
log_extent_csums() looks for the checksums, therefore making us log an
extent without logging its checksums. Before that massive simplification
of fsync, this wasn't a problem because besides looking for checkums in
the checksums tree, we also looked for them in any ordered extent still
running.

The consequence of data checksums missing for a file range is that users
attempting to read the affected file range will get -EIO errors and dmesg
reports the following:

 [10188.358136] BTRFS info (device sdc): no csum found for inode 297 start 57344
 [10188.359278] BTRFS warning (device sdc): csum failed root 5 ino 297 off 57344 csum 0x98f94189 expected csum 0x00000000 mirror 1

So fix this by skipping extents outside of our logging range at
btrfs_log_changed_extents() and leaving them on the list of modified
extents so that any subsequent ranged fsync may collect them if needed.
Also, if we find a hole extent outside of the range still log it, just
to prevent having gaps between extent items after replaying the log,
otherwise fsck will complain when we are not using the NO_HOLES feature
(fstest btrfs/056 triggers such case).

Fixes: e7175a692765 ("btrfs: remove the wait ordered logic in the log_one_extent path")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agobtrfs: fix pinned underflow after transaction aborted
Lu Fengqi [Wed, 24 Oct 2018 12:24:03 +0000 (20:24 +0800)]
btrfs: fix pinned underflow after transaction aborted

commit fcd5e74288f7d36991b1f0fb96b8c57079645e38 upstream.

When running generic/475, we may get the following warning in dmesg:

[ 6902.102154] WARNING: CPU: 3 PID: 18013 at fs/btrfs/extent-tree.c:9776 btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
[ 6902.109160] CPU: 3 PID: 18013 Comm: umount Tainted: G        W  O      4.19.0-rc8+ #8
[ 6902.110971] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 6902.112857] RIP: 0010:btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
[ 6902.118921] RSP: 0018:ffffc9000459bdb0 EFLAGS: 00010286
[ 6902.120315] RAX: ffff880175050bb0 RBX: ffff8801124a8000 RCX: 0000000000170007
[ 6902.121969] RDX: 0000000000000002 RSI: 0000000000170007 RDI: ffffffff8125fb74
[ 6902.123716] RBP: ffff880175055d10 R08: 0000000000000000 R09: 0000000000000000
[ 6902.125417] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880175055d88
[ 6902.127129] R13: ffff880175050bb0 R14: 0000000000000000 R15: dead000000000100
[ 6902.129060] FS:  00007f4507223780(0000) GS:ffff88017ba00000(0000) knlGS:0000000000000000
[ 6902.130996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6902.132558] CR2: 00005623599cac78 CR3: 000000014b700001 CR4: 00000000003606e0
[ 6902.134270] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 6902.135981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 6902.137836] Call Trace:
[ 6902.138939]  close_ctree+0x171/0x330 [btrfs]
[ 6902.140181]  ? kthread_stop+0x146/0x1f0
[ 6902.141277]  generic_shutdown_super+0x6c/0x100
[ 6902.142517]  kill_anon_super+0x14/0x30
[ 6902.143554]  btrfs_kill_super+0x13/0x100 [btrfs]
[ 6902.144790]  deactivate_locked_super+0x2f/0x70
[ 6902.146014]  cleanup_mnt+0x3b/0x70
[ 6902.147020]  task_work_run+0x9e/0xd0
[ 6902.148036]  do_syscall_64+0x470/0x600
[ 6902.149142]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 6902.150375]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 6902.151640] RIP: 0033:0x7f45077a6a7b
[ 6902.157324] RSP: 002b:00007ffd589f3e68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 6902.159187] RAX: 0000000000000000 RBX: 000055e8eec732b0 RCX: 00007f45077a6a7b
[ 6902.160834] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055e8eec73490
[ 6902.162526] RBP: 0000000000000000 R08: 000055e8eec734b0 R09: 00007ffd589f26c0
[ 6902.164141] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e8eec73490
[ 6902.165815] R13: 00007f4507ac61a4 R14: 0000000000000000 R15: 00007ffd589f40d8
[ 6902.167553] irq event stamp: 0
[ 6902.168998] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
[ 6902.170731] hardirqs last disabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[ 6902.172773] softirqs last  enabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[ 6902.174671] softirqs last disabled at (0): [<0000000000000000>]           (null)
[ 6902.176407] ---[ end trace 463138c2986b275c ]---
[ 6902.177636] BTRFS info (device dm-3): space_info 4 has 273465344 free, is not full
[ 6902.179453] BTRFS info (device dm-3): space_info total=276824064, used=4685824, pinned=18446744073708158976, reserved=0, may_use=0, readonly=65536

In the above line there's "pinned=18446744073708158976" which is an
unsigned u64 value of -1392640, an obvious underflow.

When transaction_kthread is running cleanup_transaction(), another
fsstress is running btrfs_commit_transaction(). The
btrfs_finish_extent_commit() may get the same range as
btrfs_destroy_pinned_extent() got, which causes the pinned underflow.

Fixes: d4b450cd4b33 ("Btrfs: fix race between transaction commit and empty block group removal")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agowatchdog/core: Add missing prototypes for weak functions
Mathieu Malaterre [Wed, 6 Jun 2018 19:42:32 +0000 (21:42 +0200)]
watchdog/core: Add missing prototypes for weak functions

commit 81bd415c91eb966118d773dddf254aebf3022411 upstream.

The split out of the hard lockup detector exposed two new weak functions,
but no prototypes for them, which triggers the build warning:

  kernel/watchdog.c:109:12: warning: no previous prototype for ‘watchdog_nmi_enable’ [-Wmissing-prototypes]
  kernel/watchdog.c:115:13: warning: no previous prototype for ‘watchdog_nmi_disable’ [-Wmissing-prototypes]

Add the prototypes.

Fixes: 73ce0511c436 ("kernel/watchdog.c: move hardlockup detector to separate file")
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Babu Moger <babu.moger@oracle.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180606194232.17653-1-malat@debian.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoarch/alpha, termios: implement BOTHER, IBSHIFT and termios2
H. Peter Anvin (Intel) [Mon, 22 Oct 2018 16:19:05 +0000 (09:19 -0700)]
arch/alpha, termios: implement BOTHER, IBSHIFT and termios2

commit d0ffb805b729322626639336986bc83fc2e60871 upstream.

Alpha has had c_ispeed and c_ospeed, but still set speeds in c_cflags
using arbitrary flags. Because BOTHER is not defined, the general
Linux code doesn't allow setting arbitrary baud rates, and because
CBAUDEX == 0, we can have an array overrun of the baud_rate[] table in
drivers/tty/tty_baudrate.c if (c_cflags & CBAUD) == 037.

Resolve both problems by #defining BOTHER to 037 on Alpha.

However, userspace still needs to know if setting BOTHER is actually
safe given legacy kernels (does anyone actually care about that on
Alpha anymore?), so enable the TCGETS2/TCSETS*2 ioctls on Alpha, even
though they use the same structure. Define struct termios2 just for
compatibility; it is the exact same structure as struct termios. In a
future patchset, this will be cleaned up so the uapi headers are
usable from libc.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: <linux-alpha@vger.kernel.org>
Cc: <linux-serial@vger.kernel.org>
Cc: Johan Hovold <johan@kernel.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agotermios, tty/tty_baudrate.c: fix buffer overrun
H. Peter Anvin [Mon, 22 Oct 2018 16:19:04 +0000 (09:19 -0700)]
termios, tty/tty_baudrate.c: fix buffer overrun

commit 991a25194097006ec1e0d2e0814ff920e59e3465 upstream.

On architectures with CBAUDEX == 0 (Alpha and PowerPC), the code in tty_baudrate.c does
not do any limit checking on the tty_baudrate[] array, and in fact a
buffer overrun is possible on both architectures. Add a limit check to
prevent that situation.

This will be followed by a much bigger cleanup/simplification patch.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Requested-by: Cc: Johan Hovold <johan@kernel.org>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agox86/hyper-v: Enable PIT shutdown quirk
Michael Kelley [Sun, 4 Nov 2018 03:48:57 +0000 (03:48 +0000)]
x86/hyper-v: Enable PIT shutdown quirk

commit 1de72c706488b7be664a601cf3843bd01e327e58 upstream.

Hyper-V emulation of the PIT has a quirk such that the normal PIT shutdown
path doesn't work, because clearing the counter register restarts the
timer.

Disable the counter clearing on PIT shutdown.

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
Cc: "devel@linuxdriverproject.org" <devel@linuxdriverproject.org>
Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>
Cc: "virtualization@lists.linux-foundation.org" <virtualization@lists.linux-foundation.org>
Cc: "jgross@suse.com" <jgross@suse.com>
Cc: "akataria@vmware.com" <akataria@vmware.com>
Cc: "olaf@aepfle.de" <olaf@aepfle.de>
Cc: "apw@canonical.com" <apw@canonical.com>
Cc: vkuznets <vkuznets@redhat.com>
Cc: "jasowang@redhat.com" <jasowang@redhat.com>
Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>
Cc: KY Srinivasan <kys@microsoft.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1541303219-11142-3-git-send-email-mikelley@microsoft.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agox86/cpu/vmware: Do not trace vmware_sched_clock()
Steven Rostedt (VMware) [Fri, 9 Nov 2018 20:22:07 +0000 (15:22 -0500)]
x86/cpu/vmware: Do not trace vmware_sched_clock()

commit 15035388439f892017d38b05214d3cda6578af64 upstream.

When running function tracing on a Linux guest running on VMware
Workstation, the guest would crash. This is due to tracing of the
sched_clock internal call of the VMware vmware_sched_clock(), which
causes an infinite recursion within the tracing code (clock calls must
not be traced).

Make vmware_sched_clock() not traced by ftrace.

Fixes: 80e9a4f21fd7c ("x86/vmware: Add paravirt sched clock")
Reported-by: GwanYeong Kim <gy741.kim@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: Alok Kataria <akataria@vmware.com>
CC: GwanYeong Kim <gy741.kim@gmail.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
CC: Thomas Gleixner <tglx@linutronix.de>
CC: virtualization@lists.linux-foundation.org
CC: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/20181109152207.4d3e7d70@gandalf.local.home
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoof, numa: Validate some distance map rules
John Garry [Thu, 8 Nov 2018 10:17:03 +0000 (18:17 +0800)]
of, numa: Validate some distance map rules

commit 89c38422e072bb453e3045b8f1b962a344c3edea upstream.

Currently the NUMA distance map parsing does not validate the distance
table for the distance-matrix rules 1-2 in [1].

However the arch NUMA code may enforce some of these rules, but not all.
Such is the case for the arm64 port, which does not enforce the rule that
the distance between separates nodes cannot equal LOCAL_DISTANCE.

The patch adds the following rules validation:
- distance of node to self equals LOCAL_DISTANCE
- distance of separate nodes > LOCAL_DISTANCE

This change avoids a yet-unresolved crash reported in [2].

A note on dealing with symmetrical distances between nodes:

Validating symmetrical distances between nodes is difficult. If it were
mandated in the bindings that every distance must be recorded in the
table, then it would be easy. However, it isn't.

In addition to this, it is also possible to record [b, a] distance only
(and not [a, b]). So, when processing the table for [b, a], we cannot
assert that current distance of [a, b] != [b, a] as invalid, as [a, b]
distance may not be present in the table and current distance would be
default at REMOTE_DISTANCE.

As such, we maintain the policy that we overwrite distance [a, b] = [b, a]
for b > a. This policy is different to kernel ACPI SLIT validation, which
allows non-symmetrical distances (ACPI spec SLIT rules allow it). However,
the distance debug message is dropped as it may be misleading (for a distance
which is later overwritten).

Some final notes on semantics:

- It is implied that it is the responsibility of the arch NUMA code to
  reset the NUMA distance map for an error in distance map parsing.

- It is the responsibility of the FW NUMA topology parsing (whether OF or
  ACPI) to enforce NUMA distance rules, and not arch NUMA code.

[1] Documents/devicetree/bindings/numa.txt
[2] https://www.spinics.net/lists/arm-kernel/msg683304.html

Cc: stable@vger.kernel.org # 4.7
Signed-off-by: John Garry <john.garry@huawei.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoperf intel-pt: Insert callchain context into synthesized callchains
Adrian Hunter [Wed, 31 Oct 2018 09:10:42 +0000 (11:10 +0200)]
perf intel-pt: Insert callchain context into synthesized callchains

commit 242483068b4b9ad02f1653819b6e683577681e0e upstream.

In the absence of a fallback, callchains must encode also the callchain
context. Do that now there is no fallback.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: stable@vger.kernel.org # 4.19
Link: http://lkml.kernel.org/r/100ea2ec-ed14-b56d-d810-e0a6d2f4b069@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoperf intel-pt/bts: Calculate cpumode for synthesized samples
Adrian Hunter [Wed, 31 Oct 2018 09:10:43 +0000 (11:10 +0200)]
perf intel-pt/bts: Calculate cpumode for synthesized samples

commit 5d4f0edaa3ac4f1844ed7c64cd2bae6f1912bac5 upstream.

In the absence of a fallback, samples must provide a correct cpumode for
the 'ip'. Do that now there is no fallback.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: stable@vger.kernel.org # 4.19
Link: http://lkml.kernel.org/r/20181031091043.23465-6-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoperf callchain: Honour the ordering of PERF_CONTEXT_{USER,KERNEL,etc}
David S. Miller [Tue, 30 Oct 2018 15:12:26 +0000 (12:12 -0300)]
perf callchain: Honour the ordering of PERF_CONTEXT_{USER,KERNEL,etc}

commit e9024d519d892b38176cafd46f68a7cdddd77412 upstream.

When processing using 'perf report -g caller', which is the default, we
ended up reverting the callchain entries received from the kernel, but
simply reverting throws away the information that tells that from a
point onwards the addresses are for userspace, kernel, guest kernel,
guest user, hypervisor.

The idea is that if we are walking backwards, for each cluster of
non-cpumode entries we have to first scan backwards for the next one and
use that for the cluster.

This seems silly and more expensive than it needs to be but it is enough
for a initial fix.

The code here is really complicated because it is intimately intertwined
with the lbr and branch handling, as well as this callchain order,
further fixes will be needed to properly take into account the cpumode
in those cases.

Another problem with ORDER_CALLER is that the NULL "0" IP that is at the
end of most callchains shows up at the top of the histogram because
every callchain contains it and with ORDER_CALLER it is the first entry.

Signed-off-by: David S. Miller <davem@davemloft.net>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Souvik Banerjee <souvik1997@gmail.com>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: stable@vger.kernel.org # 4.19
Link: https://lkml.kernel.org/n/tip-2wt3ayp6j2y2f2xowixa8y6y@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoperf stat: Handle different PMU names with common prefix
Thomas Richter [Tue, 23 Oct 2018 15:16:16 +0000 (17:16 +0200)]
perf stat: Handle different PMU names with common prefix

commit ea1fa48c055f833eb25f0c33188feecb7002ada5 upstream.

On s390 the CPU Measurement Facility for counters now supports
2 PMUs named cpum_cf (CPU Measurement Facility for counters) and
cpum_cf_diag (CPU Measurement Facility for diagnostic counters)
for one and the same CPU.

Running command

 [root@s35lp76 perf]# ./perf stat -e tx_c_tend \
 -- ~/mytests/cf-tx-events 1

 Measuring transactions
 TX_C_TABORT_NO_SPECIAL: 0 expected:0
 TX_C_TABORT_SPECIAL: 0 expected:0
 TX_C_TEND: 1 expected:1
 TX_NC_TABORT: 11 expected:11
 TX_NC_TEND: 1 expected:1

 Performance counter stats for '/root/mytests/cf-tx-events 1':

  2      tx_c_tend

      0.002120091 seconds time elapsed

      0.000121000 seconds user
      0.002127000 seconds sys

 [root@s35lp76 perf]#

displays output which is unexpected (and wrong):

  2      tx_c_tend

The test program definitely triggers only one transaction, as shown
in line 'TX_C_TEND: 1 expected:1'.

This is caused by the following call sequence:

pmu_lookup() scans and installs a PMU.
+--> pmu_aliases() parses all aliases in directory
.../<pmu-name>/events/* which are file names.
     +--> pmu_aliases_parse() Read each file in directory and create
                      an new alias entry. This is done with
          +--> perf_pmu__new_alias() and
       +--> __perf_pmu__new_alias() which also check for
                   identical alias names.

After pmu_aliases() returns, a complete list of event names
for this pmu has been created. Now function

pmu_add_cpu_aliases()   is called to add the events listed in the json
|                       files to the alias list of the cpu.
+--> perf_pmu__find_map()  Returns a pointer to the json events.

Now function pmu_add_cpu_aliases() scans through all events listed
in the JSON files for this CPU.
Each json event pmu name is compared with the current PMU being
built up and if they mismatch, the json event is added to the
current PMUs alias list.
To avoid duplicate entries the following comparison is done:

if (!is_arm_pmu_core(name)) {
     pname = pe->pmu ? pe->pmu : "cpu";
     if (strncmp(pname, name, strlen(pname)))
     continue;
     }

The culprit is the strncmp() function.

Using current s390 PMU naming, the first PMU is 'cpum_cf'
and a long list of events is added, among them 'tx_c_tend'

When the second PMU named 'cpum_cf_diag' is added, only one event
named 'CF_DIAG' is added by the pmu_aliases()  function.

Now function pmu_add_cpu_aliases() is invoked for PMU 'cpum_cf_diag'.
Since the CPUID string is the same for both PMUs, json file events
for PMU named 'cpum_cf' are added to the PMU 'cpm_cf_diag'

This happens because the strncmp() actually compares:

     strncmp("cpum_cf", "cpum_cf_diag", 6);

The first parameter is the pmu name taken from the event in
the json file. The second parameter is the pmu name of the PMU
currently being built.
They are different, but the length of the compare only tests the
common prefix and this returns 0(true) when it should return false.

Now all events for PMU cpum_cf are added to the alias list for pmu
cpum_cf_diag.

Later on in function parse_events_add_pmu() the event 'tx_c_end' is
searched in all available PMUs and found twice, adding it two
times to the evsel_list global variable which is the root
of all events. This results in a counter value of 2 instead
of 1.

Output with this patch:

 [root@s35lp76 perf]# ./perf stat -e tx_c_tend \
-- ~/mytests/cf-tx-events 1
 Measuring transactions
 TX_C_TABORT_NO_SPECIAL: 0 expected:0
 TX_C_TABORT_SPECIAL: 0 expected:0
 TX_C_TEND: 1 expected:1
 TX_NC_TABORT: 11 expected:11
 TX_NC_TEND: 1 expected:1

 Performance counter stats for '/root/mytests/cf-tx-events 1':

                  1      tx_c_tend

      0.001815365 seconds time elapsed

      0.000123000 seconds user
      0.001756000 seconds sys

 [root@s35lp76 perf]#

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Reviewed-by: Sebastien Boisvert <sboisvert@gydle.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: stable@vger.kernel.org
Fixes: 292c34c10249 ("perf pmu: Fix core PMU alias list for X86 platform")
Link: http://lkml.kernel.org/r/20181023151616.78193-1-tmricht@linux.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoperf cs-etm: Correct CPU mode for samples
Leo Yan [Tue, 30 Oct 2018 07:18:28 +0000 (15:18 +0800)]
perf cs-etm: Correct CPU mode for samples

commit d6c9c05fe1eb4b213b183d8a1e79416256dc833a upstream.

Since commit edeb0c90df35 ("perf tools: Stop fallbacking to kallsyms for
vdso symbols lookup"), the kernel address cannot be properly parsed to
kernel symbol with command 'perf script -k vmlinux'.  The reason is
CoreSight samples is always to set CPU mode as PERF_RECORD_MISC_USER,
thus it fails to find corresponding map/dso in below flows:

  process_sample_event()
    `-> machine__resolve()
  `-> thread__find_map(thread, sample->cpumode, sample->ip, al);

In this flow it needs to pass argument 'sample->cpumode' to tell what's
the CPU mode, before it always passed PERF_RECORD_MISC_USER but without
any failure until the commit edeb0c90df35 ("perf tools: Stop fallbacking
to kallsyms for vdso symbols lookup") has been merged.  The reason is
even with the wrong CPU mode the function thread__find_map() firstly
fails to find map but it will rollback to find kernel map for vdso
symbols lookup.  In the latest code it has removed the fallback code,
thus if CPU mode is PERF_RECORD_MISC_USER then it cannot find map
anymore with kernel address.

This patch is to correct samples CPU mode setting, it creates a new
helper function cs_etm__cpu_mode() to tell what's the CPU mode based on
the address with the info from machine structure; this patch has a bit
extension to check not only kernel and user mode, but also check for
host/guest and hypervisor mode.  Finally this patch uses the function in
instruction and branch samples and also apply in cs_etm__mem_access()
for a minor polishing.

Signed-off-by: Leo Yan <leo.yan@linaro.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: coresight@lists.linaro.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: stable@kernel.org # v4.19
Link: http://lkml.kernel.org/r/1540883908-17018-1-git-send-email-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agohwmon: (core) Fix double-free in __hwmon_device_register()
Dmitry Osipenko [Wed, 24 Oct 2018 19:37:13 +0000 (22:37 +0300)]
hwmon: (core) Fix double-free in __hwmon_device_register()

commit 74e3512731bd5c9673176425a76a7cc5efa8ddb6 upstream.

Fix double-free that happens when thermal zone setup fails, see KASAN log
below.

==================================================================
BUG: KASAN: double-free or invalid-free in __hwmon_device_register+0x5dc/0xa7c

CPU: 0 PID: 132 Comm: kworker/0:2 Tainted: G    B             4.19.0-rc8-next-20181016-00042-gb52cd80401e9-dirty #41
Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
Workqueue: events deferred_probe_work_func
Backtrace:
[<c0110540>] (dump_backtrace) from [<c0110944>] (show_stack+0x20/0x24)
[<c0110924>] (show_stack) from [<c105cb08>] (dump_stack+0x9c/0xb0)
[<c105ca6c>] (dump_stack) from [<c02fdaec>] (print_address_description+0x68/0x250)
[<c02fda84>] (print_address_description) from [<c02fd4ac>] (kasan_report_invalid_free+0x68/0x88)
[<c02fd444>] (kasan_report_invalid_free) from [<c02fc85c>] (__kasan_slab_free+0x1f4/0x200)
[<c02fc668>] (__kasan_slab_free) from [<c02fd0c0>] (kasan_slab_free+0x14/0x18)
[<c02fd0ac>] (kasan_slab_free) from [<c02f9c6c>] (kfree+0x90/0x294)
[<c02f9bdc>] (kfree) from [<c0b41bbc>] (__hwmon_device_register+0x5dc/0xa7c)
[<c0b415e0>] (__hwmon_device_register) from [<c0b421e8>] (hwmon_device_register_with_info+0xa0/0xa8)
[<c0b42148>] (hwmon_device_register_with_info) from [<c0b42324>] (devm_hwmon_device_register_with_info+0x74/0xb4)
[<c0b422b0>] (devm_hwmon_device_register_with_info) from [<c0b4481c>] (lm90_probe+0x414/0x578)
[<c0b44408>] (lm90_probe) from [<c0aeeff4>] (i2c_device_probe+0x35c/0x384)
[<c0aeec98>] (i2c_device_probe) from [<c08776cc>] (really_probe+0x290/0x3e4)
[<c087743c>] (really_probe) from [<c0877a2c>] (driver_probe_device+0x80/0x1c4)
[<c08779ac>] (driver_probe_device) from [<c0877da8>] (__device_attach_driver+0x104/0x11c)
[<c0877ca4>] (__device_attach_driver) from [<c0874dd8>] (bus_for_each_drv+0xa4/0xc8)
[<c0874d34>] (bus_for_each_drv) from [<c08773b0>] (__device_attach+0xf0/0x15c)
[<c08772c0>] (__device_attach) from [<c0877e24>] (device_initial_probe+0x1c/0x20)
[<c0877e08>] (device_initial_probe) from [<c08762f4>] (bus_probe_device+0xdc/0xec)
[<c0876218>] (bus_probe_device) from [<c0876a08>] (deferred_probe_work_func+0xa8/0xd4)
[<c0876960>] (deferred_probe_work_func) from [<c01527c4>] (process_one_work+0x3dc/0x96c)
[<c01523e8>] (process_one_work) from [<c01541e0>] (worker_thread+0x4ec/0x8bc)
[<c0153cf4>] (worker_thread) from [<c015b238>] (kthread+0x230/0x240)
[<c015b008>] (kthread) from [<c01010bc>] (ret_from_fork+0x14/0x38)
Exception stack(0xcf743fb0 to 0xcf743ff8)
3fa0:                                     00000000 00000000 00000000 00000000
3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 00000000 00000013 00000000

Allocated by task 132:
 kasan_kmalloc.part.1+0x58/0xf4
 kasan_kmalloc+0x90/0xa4
 kmem_cache_alloc_trace+0x90/0x2a0
 __hwmon_device_register+0xbc/0xa7c
 hwmon_device_register_with_info+0xa0/0xa8
 devm_hwmon_device_register_with_info+0x74/0xb4
 lm90_probe+0x414/0x578
 i2c_device_probe+0x35c/0x384
 really_probe+0x290/0x3e4
 driver_probe_device+0x80/0x1c4
 __device_attach_driver+0x104/0x11c
 bus_for_each_drv+0xa4/0xc8
 __device_attach+0xf0/0x15c
 device_initial_probe+0x1c/0x20
 bus_probe_device+0xdc/0xec
 deferred_probe_work_func+0xa8/0xd4
 process_one_work+0x3dc/0x96c
 worker_thread+0x4ec/0x8bc
 kthread+0x230/0x240
 ret_from_fork+0x14/0x38
   (null)

Freed by task 132:
 __kasan_slab_free+0x12c/0x200
 kasan_slab_free+0x14/0x18
 kfree+0x90/0x294
 hwmon_dev_release+0x1c/0x20
 device_release+0x4c/0xe8
 kobject_put+0xac/0x11c
 device_unregister+0x2c/0x30
 __hwmon_device_register+0xa58/0xa7c
 hwmon_device_register_with_info+0xa0/0xa8
 devm_hwmon_device_register_with_info+0x74/0xb4
 lm90_probe+0x414/0x578
 i2c_device_probe+0x35c/0x384
 really_probe+0x290/0x3e4
 driver_probe_device+0x80/0x1c4
 __device_attach_driver+0x104/0x11c
 bus_for_each_drv+0xa4/0xc8
 __device_attach+0xf0/0x15c
 device_initial_probe+0x1c/0x20
 bus_probe_device+0xdc/0xec
 deferred_probe_work_func+0xa8/0xd4
 process_one_work+0x3dc/0x96c
 worker_thread+0x4ec/0x8bc
 kthread+0x230/0x240
 ret_from_fork+0x14/0x38
   (null)

Cc: <stable@vger.kernel.org> # v4.15+
Fixes: 47c332deb8e8 ("hwmon: Deal with errors from the thermal subsystem")
Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agomtd: docg3: don't set conflicting BCH_CONST_PARAMS option
Arnd Bergmann [Thu, 11 Oct 2018 11:06:16 +0000 (13:06 +0200)]
mtd: docg3: don't set conflicting BCH_CONST_PARAMS option

commit be2e1c9dcf76886a83fb1c433a316e26d4ca2550 upstream.

I noticed during the creation of another bugfix that the BCH_CONST_PARAMS
option that is set by DOCG3 breaks setting variable parameters for any
other users of the BCH library code.

The only other user we have today is the MTD_NAND software BCH
implementation (most flash controllers use hardware BCH these days
and are not affected). I considered removing BCH_CONST_PARAMS entirely
because of the inherent conflict, but according to the description in
lib/bch.c there is a significant performance benefit in keeping it.

To avoid the immediate problem of the conflict between MTD_NAND_BCH
and DOCG3, this only sets the constant parameters if MTD_NAND_BCH
is disabled, which should fix the problem for all cases that
are affected. This should also work for all stable kernels.

Note that there is only one machine that actually seems to use the
DOCG3 driver (arch/arm/mach-pxa/mioa701.c), so most users should have
the driver disabled, but it almost certainly shows up if we wanted
to test random kernels on machines that use software BCH in MTD.

Fixes: d13d19ece39f ("mtd: docg3: add ECC correction code")
Cc: stable@vger.kernel.org
Cc: Robert Jarzmik <robert.jarzmik@free.fr>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agomtd: nand: Fix nanddev_neraseblocks()
Boris Brezillon [Sun, 28 Oct 2018 11:29:55 +0000 (12:29 +0100)]
mtd: nand: Fix nanddev_neraseblocks()

commit d098093ba06eb032057d1aca1c2e45889e099d00 upstream.

nanddev_neraseblocks() currently returns the number pages per LUN
instead of the total number of eraseblocks.

Fixes: 9c3736a3de21 ("mtd: nand: Add core infrastructure to deal with NAND devices")
Cc: <stable@vger.kernel.org>
Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agomtd: spi-nor: cadence-quadspi: Return error code in cqspi_direct_read_execute()
Christophe JAILLET [Tue, 16 Oct 2018 07:13:46 +0000 (09:13 +0200)]
mtd: spi-nor: cadence-quadspi: Return error code in cqspi_direct_read_execute()

commit 91d7b67000c6e9bd605624079fee5a084238ad92 upstream.

We return 0 unconditionally in 'cqspi_direct_read_execute()'.
However, 'ret' is set to some error codes in several error handling
paths.

Return 'ret' instead to propagate the error code.

Fixes: ffa639e069fb ("mtd: spi-nor: cadence-quadspi: Add DMA support for direct mode reads")
Cc: <stable@vger.kernel.org>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agobonding/802.3ad: fix link_failure_count tracking
Jarod Wilson [Sun, 4 Nov 2018 19:59:46 +0000 (14:59 -0500)]
bonding/802.3ad: fix link_failure_count tracking

commit ea53abfab960909d622ca37bcfb8e1c5378d21cc upstream.

Commit 4d2c0cda07448ea6980f00102dc3964eb25e241c set slave->link to
BOND_LINK_DOWN for 802.3ad bonds whenever invalid speed/duplex values
were read, to fix a problem with slaves getting into weird states, but
in the process, broke tracking of link failures, as going straight to
BOND_LINK_DOWN when a link is indeed down (cable pulled, switch rebooted)
means we broke out of bond_miimon_inspect()'s BOND_LINK_DOWN case because
!link_state was already true, we never incremented commit, and never got
a chance to call bond_miimon_commit(), where slave->link_failure_count
would be incremented. I believe the simple fix here is to mark the slave
as BOND_LINK_FAIL, and let bond_miimon_inspect() transition the link from
_FAIL to either _UP or _DOWN, and in the latter case, we now get proper
incrementing of link_failure_count again.

Fixes: 4d2c0cda0744 ("bonding: speed/duplex update at NETDEV_UP event")
CC: Mahesh Bandewar <maheshb@google.com>
CC: David S. Miller <davem@davemloft.net>
CC: netdev@vger.kernel.org
CC: stable@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoARM: 8809/1: proc-v7: fix Thumb annotation of cpu_v7_hvc_switch_mm
Ard Biesheuvel [Mon, 5 Nov 2018 13:54:56 +0000 (14:54 +0100)]
ARM: 8809/1: proc-v7: fix Thumb annotation of cpu_v7_hvc_switch_mm

commit 6282e916f774e37845c65d1eae9f8c649004f033 upstream.

Due to what appears to be a copy/paste error, the opening ENTRY()
of cpu_v7_hvc_switch_mm() lacks a matching ENDPROC(), and instead,
the one for cpu_v7_smc_switch_mm() is duplicated.

Given that it is ENDPROC() that emits the Thumb annotation, the
cpu_v7_hvc_switch_mm() routine will be called in ARM mode on a
Thumb2 kernel, resulting in the following splat:

  Internal error: Oops - undefined instruction: 0 [#1] SMP THUMB2
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.18.0-rc1-00030-g4d28ad89189d-dirty #488
  Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
  PC is at cpu_v7_hvc_switch_mm+0x12/0x18
  LR is at flush_old_exec+0x31b/0x570
  pc : [<c0316efe>]    lr : [<c04117c7>]    psr: 00000013
  sp : ee899e50  ip : 00000000  fp : 00000001
  r10: eda28f34  r9 : eda31800  r8 : c12470e0
  r7 : eda1fc00  r6 : eda53000  r5 : 00000000  r4 : ee88c000
  r3 : c0316eec  r2 : 00000001  r1 : eda53000  r0 : 6da6c000
  Flags: nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none

Note the 'ISA ARM' in the last line.

Fix this by using the correct name in ENDPROC().

Cc: <stable@vger.kernel.org>
Fixes: 10115105cb3a ("ARM: spectre-v2: add firmware based hardening")
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agonetfilter: conntrack: fix calculation of next bucket number in early_drop
Vasily Khoruzhick [Thu, 25 Oct 2018 19:15:43 +0000 (12:15 -0700)]
netfilter: conntrack: fix calculation of next bucket number in early_drop

commit f393808dc64149ccd0e5a8427505ba2974a59854 upstream.

If there's no entry to drop in bucket that corresponds to the hash,
early_drop() should look for it in other buckets. But since it increments
hash instead of bucket number, it actually looks in the same bucket 8
times: hsize is 16k by default (14 bits) and hash is 32-bit value, so
reciprocal_scale(hash, hsize) returns the same value for hash..hash+7 in
most cases.

Fix it by increasing bucket number instead of hash and rename _hash
to bucket to avoid future confusion.

Fixes: 3e86638e9a0b ("netfilter: conntrack: consider ct netns in early_drop logic")
Cc: <stable@vger.kernel.org> # v4.7+
Signed-off-by: Vasily Khoruzhick <vasilykh@arista.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agomemory_hotplug: cond_resched in __remove_pages
Michal Hocko [Fri, 2 Nov 2018 22:48:46 +0000 (15:48 -0700)]
memory_hotplug: cond_resched in __remove_pages

commit dd33ad7b251f900481701b2a82d25de583867708 upstream.

We have received a bug report that unbinding a large pmem (>1TB) can
result in a soft lockup:

  NMI watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [ndctl:4365]
  [...]
  Supported: Yes
  CPU: 9 PID: 4365 Comm: ndctl Not tainted 4.12.14-94.40-default #1 SLE12-SP4
  Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.01.00.0833.051120182255 05/11/2018
  task: ffff9cce7d4410c0 task.stack: ffffbe9eb1bc4000
  RIP: 0010:__put_page+0x62/0x80
  Call Trace:
   devm_memremap_pages_release+0x152/0x260
   release_nodes+0x18d/0x1d0
   device_release_driver_internal+0x160/0x210
   unbind_store+0xb3/0xe0
   kernfs_fop_write+0x102/0x180
   __vfs_write+0x26/0x150
   vfs_write+0xad/0x1a0
   SyS_write+0x42/0x90
   do_syscall_64+0x74/0x150
   entry_SYSCALL_64_after_hwframe+0x3d/0xa2
  RIP: 0033:0x7fd13166b3d0

It has been reported on an older (4.12) kernel but the current upstream
code doesn't cond_resched in the hot remove code at all and the given
range to remove might be really large.  Fix the issue by calling
cond_resched once per memory section.

Link: http://lkml.kernel.org/r/20181031125840.23982-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Dan Williams <dan.j.williams@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agomm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
Andrea Arcangeli [Fri, 2 Nov 2018 22:47:59 +0000 (15:47 -0700)]
mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

commit ac5b2c18911ffe95c08d69273917f90212cf5659 upstream.

THP allocation might be really disruptive when allocated on NUMA system
with the local node full or hard to reclaim.  Stefan has posted an
allocation stall report on 4.12 based SLES kernel which suggests the
same issue:

  kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
  kvm cpuset=/ mems_allowed=0-1
  CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
  Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
  Call Trace:
   dump_stack+0x5c/0x84
   warn_alloc+0xe0/0x180
   __alloc_pages_slowpath+0x820/0xc90
   __alloc_pages_nodemask+0x1cc/0x210
   alloc_pages_vma+0x1e5/0x280
   do_huge_pmd_wp_page+0x83f/0xf00
   __handle_mm_fault+0x93d/0x1060
   handle_mm_fault+0xc6/0x1b0
   __do_page_fault+0x230/0x430
   do_page_fault+0x2a/0x70
   page_fault+0x7b/0x80
   [...]
  Mem-Info:
  active_anon:126315487 inactive_anon:1612476 isolated_anon:5
   active_file:60183 inactive_file:245285 isolated_file:0
   unevictable:15657 dirty:286 writeback:1 unstable:0
   slab_reclaimable:75543 slab_unreclaimable:2509111
   mapped:81814 shmem:31764 pagetables:370616 bounce:0
   free:32294031 free_pcp:6233 free_cma:0
  Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
  Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.

Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:

: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).

Fix this by removing __GFP_THISNODE for THP requests which are
requesting the direct reclaim.  This effectivelly reverts 5265047ac301
on the grounds that the zone/node reclaim was known to be disruptive due
to premature reclaim when there was memory free.  While it made sense at
the time for HPC workloads without NUMA awareness on rare machines, it
was ultimately harmful in the majority of cases.  The existing behaviour
is similar, if not as widespare as it applies to a corner case but
crucially, it cannot be tuned around like zone_reclaim_mode can.  The
default behaviour should always be to cause the least harm for the
common case.

If there are specialised use cases out there that want zone_reclaim_mode
in specific cases, then it can be built on top.  Longterm we should
consider a memory policy which allows for the node reclaim like behavior
for the specific memory ranges which would allow a

[1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com

Mel said:

: Both patches look correct to me but I'm responding to this one because
: it's the fix.  The change makes sense and moves further away from the
: severe stalling behaviour we used to see with both THP and zone reclaim
: mode.
:
: I put together a basic experiment with usemem configured to reference a
: buffer multiple times that is 80% the size of main memory on a 2-socket
: box with symmetric node sizes and defrag set to "always".  The defrag
: setting is not the default but it would be functionally similar to
: accessing a buffer with madvise(MADV_HUGEPAGE).  Usemem is configured to
: reference the buffer multiple times and while it's not an interesting
: workload, it would be expected to complete reasonably quickly as it fits
: within memory.  The results were;
:
: usemem
:                                   vanilla           noreclaim-v1
: Amean     Elapsd-1       42.78 (   0.00%)       26.87 (  37.18%)
: Amean     Elapsd-3       27.55 (   0.00%)        7.44 (  73.00%)
: Amean     Elapsd-4        5.72 (   0.00%)        5.69 (   0.45%)
:
: This shows the elapsed time in seconds for 1 thread, 3 threads and 4
: threads referencing buffers 80% the size of memory.  With the patches
: applied, it's 37.18% faster for the single thread and 73% faster with two
: threads.  Note that 4 threads showing little difference does not indicate
: the problem is related to thread counts.  It's simply the case that 4
: threads gets spread so their workload mostly fits in one node.
:
: The overall view from /proc/vmstats is more startling
:
:                          4.19.0-rc1  4.19.0-rc1
:                             vanillanoreclaim-v1r1
: Minor Faults               35593425      708164
: Major Faults                 484088          36
: Swap Ins                    3772837           0
: Swap Outs                   3932295           0
:
: Massive amounts of swap in/out without the patch
:
: Direct pages scanned        6013214           0
: Kswapd pages scanned              0           0
: Kswapd pages reclaimed            0           0
: Direct pages reclaimed      4033009           0
:
: Lots of reclaim activity without the patch
:
: Kswapd efficiency              100%        100%
: Kswapd velocity               0.000       0.000
: Direct efficiency               67%        100%
: Direct velocity           11191.956       0.000
:
: Mostly from direct reclaim context as you'd expect without the patch.
:
: Page writes by reclaim  3932314.000       0.000
: Page writes file                 19           0
: Page writes anon            3932295           0
: Page reclaim immediate        42336           0
:
: Writes from reclaim context is never good but the patch eliminates it.
:
: We should never have default behaviour to thrash the system for such a
: basic workload.  If zone reclaim mode behaviour is ever desired but on a
: single task instead of a global basis then the sensible option is to build
: a mempolicy that enforces that behaviour.

This was a severe regression compared to previous kernels that made
important workloads unusable and it starts when __GFP_THISNODE was
added to THP allocations under MADV_HUGEPAGE.  It is not a significant
risk to go to the previous behavior before __GFP_THISNODE was added, it
worked like that for years.

This was simply an optimization to some lucky workloads that can fit in
a single node, but it ended up breaking the VM for others that can't
possibly fit in a single node, so going back is safe.

[mhocko@suse.com: rewrote the changelog based on the one from Andrea]
Link: http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org
Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: <stable@vger.kernel.org> [4.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoocfs2: free up write context when direct IO failed
Wengang Wang [Fri, 16 Nov 2018 23:08:25 +0000 (15:08 -0800)]
ocfs2: free up write context when direct IO failed

commit 5040f8df56fb90c7919f1c9b0b6e54c843437456 upstream.

The write context should also be freed even when direct IO failed.
Otherwise a memory leak is introduced and entries remain in
oi->ip_unwritten_list causing the following BUG later in unlink path:

  ERROR: bug expression: !list_empty(&oi->ip_unwritten_list)
  ERROR: Clear inode of 215043, inode has unwritten extents
  ...
  Call Trace:
  ? __set_current_blocked+0x42/0x68
  ocfs2_evict_inode+0x91/0x6a0 [ocfs2]
  ? bit_waitqueue+0x40/0x33
  evict+0xdb/0x1af
  iput+0x1a2/0x1f7
  do_unlinkat+0x194/0x28f
  SyS_unlinkat+0x1b/0x2f
  do_syscall_64+0x79/0x1ae
  entry_SYSCALL_64_after_hwframe+0x151/0x0

This patch also logs, with frequency limit, direct IO failures.

Link: http://lkml.kernel.org/r/20181102170632.25921-1-wen.gang.wang@oracle.com
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoocfs2: fix a misuse a of brelse after failing ocfs2_check_dir_entry
Changwei Ge [Fri, 2 Nov 2018 22:48:15 +0000 (15:48 -0700)]
ocfs2: fix a misuse a of brelse after failing ocfs2_check_dir_entry

commit 29aa30167a0a2e6045a0d6d2e89d8168132333d5 upstream.

Somehow, file system metadata was corrupted, which causes
ocfs2_check_dir_entry() to fail in function ocfs2_dir_foreach_blk_el().

According to the original design intention, if above happens we should
skip the problematic block and continue to retrieve dir entry.  But
there is obviouse misuse of brelse around related code.

After failure of ocfs2_check_dir_entry(), current code just moves to
next position and uses the problematic buffer head again and again
during which the problematic buffer head is released for multiple times.
I suppose, this a serious issue which is long-lived in ocfs2.  This may
cause other file systems which is also used in a the same host insane.

So we should also consider about bakcporting this patch into linux
-stable.

Link: http://lkml.kernel.org/r/HK2PR06MB045211675B43EED794E597B6D56E0@HK2PR06MB0452.apcprd06.prod.outlook.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Suggested-by: Changkuo Shi <shi.changkuo@h3c.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agosoc: ti: QMSS: Fix usage of irq_set_affinity_hint
Marc Zyngier [Wed, 31 Oct 2018 08:41:34 +0000 (08:41 +0000)]
soc: ti: QMSS: Fix usage of irq_set_affinity_hint

commit 832ad0e3da4510fd17f98804abe512ea9a747035 upstream.

The Keystone QMSS driver is pretty damaged, in the sense that it
does things like this:

irq_set_affinity_hint(irq, to_cpumask(&cpu_map));

where cpu_map is a local variable. As we leave the function, this
will point to nowhere-land, and things will end-up badly.

Instead, let's use a proper cpumask that gets allocated, giving
the driver a chance to actually work with things like irqbalance
as well as have a hypothetical 64bit future.

Cc: stable@vger.kernel.org
Acked-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoRevert "powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP"
Christophe Leroy [Fri, 19 Oct 2018 06:54:54 +0000 (06:54 +0000)]
Revert "powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP"

commit cc4ebf5c0a3440ed0a32d25c55ebdb6ce5f3c0bc upstream.

This reverts commit 4f94b2c7462d9720b2afa7e8e8d4c19446bb31ce.

That commit was buggy, as it used rlwinm instead of rlwimi.
Instead of fixing that bug, we revert the previous commit in order to
reduce the dependency between L1 entries and L2 entries

Fixes: 4f94b2c7462d9 ("powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP")
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoSCSI: fix queue cleanup race before queue initialization is done
Ming Lei [Wed, 14 Nov 2018 08:25:51 +0000 (16:25 +0800)]
SCSI: fix queue cleanup race before queue initialization is done

commit 8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a upstream.

c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has
already fixed this race, however the implied synchronize_rcu()
in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
performance regression.

Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
tried to quiesce queue for avoiding unnecessary synchronize_rcu()
only when queue initialization is done, because it is usual to see
lots of inexistent LUNs which need to be probed.

However, turns out it isn't safe to quiesce queue only when queue
initialization is done. Because when one SCSI command is completed,
the user of sending command can be waken up immediately, then the
scsi device may be removed, meantime the run queue in scsi_end_request()
is still in-progress, so kernel panic can be caused.

In Red Hat QE lab, there are several reports about this kind of kernel
panic triggered during kernel booting.

This patch tries to address the issue by grabing one queue usage
counter during freeing one request and the following run queue.

Fixes: 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
Cc: Andrew Jones <drjones@redhat.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
Cc: stable <stable@vger.kernel.org>
Cc: jianchao.wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoscsi: qla2xxx: Initialize port speed to avoid setting lower speed
Quinn Tran [Tue, 6 Nov 2018 08:51:21 +0000 (00:51 -0800)]
scsi: qla2xxx: Initialize port speed to avoid setting lower speed

commit f635e48e866ee1a47d2d42ce012fdcc07bf55853 upstream.

This patch initializes port speed so that firmware does not set lower
operating speed. Setting lower speed in firmware impacts WRITE perfomance.

Fixes: 726b85487067 ("qla2xxx: Add framework for async fabric discovery")
Cc: <stable@vger.kernel.org>
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agovhost/scsi: truncate T10 PI iov_iter to prot_bytes
Greg Edwards [Wed, 22 Aug 2018 19:21:53 +0000 (13:21 -0600)]
vhost/scsi: truncate T10 PI iov_iter to prot_bytes

commit 4542d623c7134bc1738f8a68ccb6dd546f1c264f upstream.

Commands with protection information included were not truncating the
protection iov_iter to the number of protection bytes in the command.
This resulted in vhost_scsi mis-calculating the size of the protection
SGL in vhost_scsi_calc_sgls(), and including both the protection and
data SG entries in the protection SGL.

Fixes: 09b13fa8c1a1 ("vhost/scsi: Add ANY_LAYOUT support in vhost_scsi_handle_vq")
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 09b13fa8c1a1093e9458549ac8bb203a7c65c62a
Cc: stable@vger.kernel.org
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agocrypto: hisilicon - Fix reference after free of memories on error path
John Garry [Mon, 5 Nov 2018 12:35:15 +0000 (20:35 +0800)]
crypto: hisilicon - Fix reference after free of memories on error path

commit 0b0cf6af3f3151c26c27e8e51def5527091c3e69 upstream.

coccicheck currently warns of the following issues in the driver:
drivers/crypto/hisilicon/sec/sec_algs.c:864:51-66: ERROR: reference preceded by free on line 812
drivers/crypto/hisilicon/sec/sec_algs.c:864:40-49: ERROR: reference preceded by free on line 813
drivers/crypto/hisilicon/sec/sec_algs.c:861:8-24: ERROR: reference preceded by free on line 814
drivers/crypto/hisilicon/sec/sec_algs.c:860:41-51: ERROR: reference preceded by free on line 815
drivers/crypto/hisilicon/sec/sec_algs.c:867:7-18: ERROR: reference preceded by free on line 816

It would appear than on certain error paths that we may attempt reference-
after-free some memories.

This patch fixes those issues. The solution doesn't look perfect, but
having same memories free'd possibly from separate functions makes it
tricky.

Fixes: 915e4e8413da ("crypto: hisilicon - SEC security accelerator driver")
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agocrypto: hisilicon - Fix NULL dereference for same dst and src
John Garry [Mon, 5 Nov 2018 12:35:14 +0000 (20:35 +0800)]
crypto: hisilicon - Fix NULL dereference for same dst and src

commit 68a031d22c57b94870ba13513c9d93b8a8119ab2 upstream.

When the source and destination addresses for the cipher are the same, we
will get a NULL dereference from accessing the split destination
scatterlist memories, as shown:

[   56.565719] tcrypt:
[   56.565719] testing speed of async ecb(aes) (hisi_sec_aes_ecb) encryption
[   56.574683] tcrypt: test 0 (128 bit key, 16 byte blocks):
[   56.587585] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[   56.596361] Mem abort info:
[   56.599151]   ESR = 0x96000006
[   56.602196]   Exception class = DABT (current EL), IL = 32 bits
[   56.608105]   SET = 0, FnV = 0
[   56.611149]   EA = 0, S1PTW = 0
[   56.614280] Data abort info:
[   56.617151]   ISV = 0, ISS = 0x00000006
[   56.620976]   CM = 0, WnR = 0
[   56.623930] user pgtable: 4k pages, 48-bit VAs, pgdp = (____ptrval____)
[   56.630533] [0000000000000000] pgd=0000041fc7e4d003, pud=0000041fcd9bf003, pmd=0000000000000000
[   56.639224] Internal error: Oops: 96000006 [#1] PREEMPT SMP
[   56.644782] Modules linked in: tcrypt(+)
[   56.648695] CPU: 21 PID: 2326 Comm: insmod Tainted: G        W         4.19.0-rc6-00001-g3fabfb8-dirty #716
[   56.658420] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 IT17 Nemo 2.0 RC0 10/05/2018
[   56.667537] pstate: 20000005 (nzCv daif -PAN -UAO)
[   56.672322] pc : sec_alg_skcipher_crypto+0x318/0x748
[   56.677274] lr : sec_alg_skcipher_crypto+0x178/0x748
[   56.682224] sp : ffff0000118e3840
[   56.685525] x29: ffff0000118e3840 x28: ffff841fbb3f8118
[   56.690825] x27: 0000000000000000 x26: 0000000000000000
[   56.696125] x25: ffff841fbb3f8080 x24: ffff841fbadc0018
[   56.701425] x23: ffff000009119000 x22: ffff841fbb24e280
[   56.706724] x21: ffff841ff212e780 x20: ffff841ff212e700
[   56.712023] x19: 0000000000000001 x18: ffffffffffffffff
[   56.717322] x17: 0000000000000000 x16: 0000000000000000
[   56.722621] x15: ffff0000091196c8 x14: 72635f7265687069
[   56.727920] x13: 636b735f676c615f x12: ffff000009119940
[   56.733219] x11: 0000000000000000 x10: 00000000006080c0
[   56.738519] x9 : 0000000000000000 x8 : ffff841fbb24e480
[   56.743818] x7 : ffff841fbb24e500 x6 : ffff841ff00cdcc0
[   56.749117] x5 : 0000000000000010 x4 : 0000000000000000
[   56.754416] x3 : ffff841fbb24e380 x2 : ffff841fbb24e480
[   56.759715] x1 : 0000000000000000 x0 : ffff000008f682c8
[   56.765016] Process insmod (pid: 2326, stack limit = 0x(____ptrval____))
[   56.771702] Call trace:
[   56.774136]  sec_alg_skcipher_crypto+0x318/0x748
[   56.778740]  sec_alg_skcipher_encrypt+0x10/0x18
[   56.783259]  test_skcipher_speed+0x2a0/0x700 [tcrypt]
[   56.788298]  do_test+0x18f8/0x48c8 [tcrypt]
[   56.792469]  tcrypt_mod_init+0x60/0x1000 [tcrypt]
[   56.797161]  do_one_initcall+0x5c/0x178
[   56.800985]  do_init_module+0x58/0x1b4
[   56.804721]  load_module+0x1da4/0x2150
[   56.808456]  __se_sys_init_module+0x14c/0x1e8
[   56.812799]  __arm64_sys_init_module+0x18/0x20
[   56.817231]  el0_svc_common+0x60/0xe8
[   56.820880]  el0_svc_handler+0x2c/0x80
[   56.824615]  el0_svc+0x8/0xc
[   56.827483] Code: a94c87a3 910b2000 f87b7842 f9004ba2 (b87b7821)
[   56.833564] ---[ end trace 0f63290590e93d94 ]---
Segmentation fault

Fix this by only accessing these memories when we have different src and
dst.

Fixes: 915e4e8413da ("crypto: hisilicon - SEC security accelerator driver")
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoreset: hisilicon: fix potential NULL pointer dereference
Gustavo A. R. Silva [Thu, 26 Jul 2018 00:47:19 +0000 (19:47 -0500)]
reset: hisilicon: fix potential NULL pointer dereference

commit e9a2310fb689151166df7fd9971093362d34bd79 upstream.

There is a potential execution path in which function
platform_get_resource() returns NULL. If this happens,
we will end up having a NULL pointer dereference.

Fix this by replacing devm_ioremap with devm_ioremap_resource,
which has the NULL check and the memory region request.

This code was detected with the help of Coccinelle.

Cc: stable@vger.kernel.org
Fixes: 97b7129cd2af ("reset: hisilicon: change the definition of hisi_reset_init")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoacpi, nfit: Fix ARS overflow continuation
Dan Williams [Thu, 1 Nov 2018 07:30:22 +0000 (00:30 -0700)]
acpi, nfit: Fix ARS overflow continuation

commit 3fa58dcab50a0aa16817f16a8d38aee869eb3fb9 upstream.

When the platform BIOS is unable to report all the media error records
it requires the OS to restart the scrub at a prescribed location. The
driver detects the overflow condition, but then fails to report it to
the ARS state machine after reaping the records. Propagate -ENOSPC
correctly to continue the ARS operation.

Cc: <stable@vger.kernel.org>
Fixes: 1cf03c00e7c1 ("nfit: scrub and register regions in a workqueue")
Reported-by: Jacek Zloch <jacek.zloch@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoacpi/nfit, x86/mce: Validate a MCE's address before using it
Vishal Verma [Fri, 26 Oct 2018 00:37:29 +0000 (18:37 -0600)]
acpi/nfit, x86/mce: Validate a MCE's address before using it

commit e8a308e5f47e545e0d41d0686c00f5f5217c5f61 upstream.

The NFIT machine check handler uses the physical address from the mce
structure, and compares it against information in the ACPI NFIT table
to determine whether that location lies on an NVDIMM. The mce->addr
field however may not always be valid, and this is indicated by the
MCI_STATUS_ADDRV bit in the status field.

Export mce_usable_address() which already performs validation for the
address, and use it in the NFIT handler.

Fixes: 6839a6d96f4e ("nfit: do an ARS scrub on hitting a latent media error")
Reported-by: Robert Elliott <elliott@hpe.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: Arnd Bergmann <arnd@arndb.de>
Cc: Dan Williams <dan.j.williams@intel.com>
CC: Dave Jiang <dave.jiang@intel.com>
CC: elliott@hpe.com
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Len Brown <lenb@kernel.org>
CC: linux-acpi@vger.kernel.org
CC: linux-edac <linux-edac@vger.kernel.org>
CC: linux-nvdimm@lists.01.org
CC: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
CC: "Rafael J. Wysocki" <rjw@rjwysocki.net>
CC: Ross Zwisler <zwisler@kernel.org>
CC: stable <stable@vger.kernel.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tony Luck <tony.luck@intel.com>
CC: x86-ml <x86@kernel.org>
CC: Yazen Ghannam <yazen.ghannam@amd.com>
Link: http://lkml.kernel.org/r/20181026003729.8420-2-vishal.l.verma@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoacpi/nfit, x86/mce: Handle only uncorrectable machine checks
Vishal Verma [Fri, 26 Oct 2018 00:37:28 +0000 (18:37 -0600)]
acpi/nfit, x86/mce: Handle only uncorrectable machine checks

commit 5d96c9342c23ee1d084802dcf064caa67ecaa45b upstream.

The MCE handler for nfit devices is called for memory errors on a
Non-Volatile DIMM and adds the error location to a 'badblocks' list.
This list is used by the various NVDIMM drivers to avoid consuming known
poison locations during IO.

The MCE handler gets called for both corrected and uncorrectable errors.
Until now, both kinds of errors have been added to the badblocks list.
However, corrected memory errors indicate that the problem has already
been fixed by hardware, and the resulting interrupt is merely a
notification to Linux.

As far as future accesses to that location are concerned, it is
perfectly fine to use, and thus doesn't need to be included in the above
badblocks list.

Add a check in the nfit MCE handler to filter out corrected mce events,
and only process uncorrectable errors.

Fixes: 6839a6d96f4e ("nfit: do an ARS scrub on hitting a latent media error")
Reported-by: Omar Avelar <omar.avelar@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: Arnd Bergmann <arnd@arndb.de>
CC: Dan Williams <dan.j.williams@intel.com>
CC: Dave Jiang <dave.jiang@intel.com>
CC: elliott@hpe.com
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Len Brown <lenb@kernel.org>
CC: linux-acpi@vger.kernel.org
CC: linux-edac <linux-edac@vger.kernel.org>
CC: linux-nvdimm@lists.01.org
CC: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
CC: "Rafael J. Wysocki" <rjw@rjwysocki.net>
CC: Ross Zwisler <zwisler@kernel.org>
CC: stable <stable@vger.kernel.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tony Luck <tony.luck@intel.com>
CC: x86-ml <x86@kernel.org>
CC: Yazen Ghannam <yazen.ghannam@amd.com>
Link: http://lkml.kernel.org/r/20181026003729.8420-1-vishal.l.verma@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agomach64: fix image corruption due to reading accelerator registers
Mikulas Patocka [Mon, 8 Oct 2018 10:57:35 +0000 (12:57 +0200)]
mach64: fix image corruption due to reading accelerator registers

commit c09bcc91bb94ed91f1391bffcbe294963d605732 upstream.

Reading the registers without waiting for engine idle returns
unpredictable values. These unpredictable values result in display
corruption - if atyfb_imageblit reads the content of DP_PIX_WIDTH with the
bit DP_HOST_TRIPLE_EN set (from previous invocation), the driver would
never ever clear the bit, resulting in display corruption.

We don't want to wait for idle because it would degrade performance, so
this patch modifies the driver so that it never reads accelerator
registers.

HOST_CNTL doesn't have to be read, we can just write it with
HOST_BYTE_ALIGN because no other part of the driver cares if
HOST_BYTE_ALIGN is set.

DP_PIX_WIDTH is written in the functions atyfb_copyarea and atyfb_fillrect
with the default value and in atyfb_imageblit with the value set according
to the source image data.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Ville Syrjälä <syrjala@sci.fi>
Cc: stable@vger.kernel.org
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agomach64: fix display corruption on big endian machines
Mikulas Patocka [Mon, 8 Oct 2018 10:57:34 +0000 (12:57 +0200)]
mach64: fix display corruption on big endian machines

commit 3c6c6a7878d00a3ac997a779c5b9861ff25dfcc8 upstream.

The code for manual bit triple is not endian-clean. It builds the variable
"hostdword" using byte accesses, therefore we must read the variable with
"le32_to_cpu".

The patch also enables (hardware or software) bit triple only if the image
is monochrome (image->depth). If we want to blit full-color image, we
shouldn't use the triple code.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Ville Syrjälä <syrjala@sci.fi>
Cc: stable@vger.kernel.org
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>