platform/kernel/linux-rpi.git
5 years agoMerge tag 'drm-fixes-2019-10-11' of git://anongit.freedesktop.org/drm/drm
Linus Torvalds [Fri, 11 Oct 2019 16:02:33 +0000 (09:02 -0700)]
Merge tag 'drm-fixes-2019-10-11' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
 "The regular fixes pull for rc3. The i915 team found some fixes they
  (or I) missed for rc1, which is why this is a bit bigger than usual,
  otherwise there is a single amdgpu fix, some spi panel aliases, and a
  bridge fix.

  i915:
   - execlist access fixes
   - list deletion fix
   - CML display fix
   - HSW workaround extension to GT2
   - chicken bit whitelist
   - GGTT resume issue
   - SKL GPU hangs for Vulkan compute

  amdgpu:
   - memory leak fix

  panel:
   - spi aliases

  tc358767:
   - bridge artifacts fix"

* tag 'drm-fixes-2019-10-11' of git://anongit.freedesktop.org/drm/drm: (22 commits)
  drm/bridge: tc358767: fix max_tu_symbol value
  drm/i915/gt: execlists->active is serialised by the tasklet
  drm/i915/execlists: Protect peeking at execlists->active
  drm/i915: Fixup preempt-to-busy vs reset of a virtual request
  drm/i915: Only enqueue already completed requests
  drm/i915/execlists: Drop redundant list_del_init(&rq->sched.link)
  drm/i915/cml: Add second PCH ID for CMP
  drm/amdgpu: fix memory leak
  drm/panel: tpo-td043mtea1: Fix SPI alias
  drm/panel: tpo-td028ttec1: Fix SPI alias
  drm/panel: sony-acx565akm: Fix SPI alias
  drm/panel: nec-nl8048hl11: Fix SPI alias
  drm/panel: lg-lb035q02: Fix SPI alias
  drm/i915: Mark contents as dirty on a write fault
  drm/i915: Prevent bonded requests from overtaking each other on preemption
  drm/i915: Bump skl+ max plane width to 5k for linear/x-tiled
  drm/i915: Verify the engine after acquiring the active.lock
  drm/i915: Extend Haswell GT1 PSMI workaround to all
  drm/i915: Don't mix srcu tag and negative error codes
  drm/i915: Whitelist COMMON_SLICE_CHICKEN2
  ...

5 years agoMerge tag 'for-linus-20191010' of git://git.kernel.dk/linux-block
Linus Torvalds [Fri, 11 Oct 2019 15:45:32 +0000 (08:45 -0700)]
Merge tag 'for-linus-20191010' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Fix wbt performance regression introduced with the blk-rq-qos
   refactoring (Harshad)

 - Fix io_uring fileset removal inadvertently killing the workqueue (me)

 - Fix io_uring typo in linked command nonblock submission (Pavel)

 - Remove spurious io_uring wakeups on request free (Pavel)

 - Fix null_blk zoned command error return (Keith)

 - Don't use freezable workqueues for backing_dev, also means we can
   revert a previous libata hack (Mika)

 - Fix nbd sysfs mutex dropped too soon at removal time (Xiubo)

* tag 'for-linus-20191010' of git://git.kernel.dk/linux-block:
  nbd: fix possible sysfs duplicate warning
  null_blk: Fix zoned command return code
  io_uring: only flush workqueues on fileset removal
  io_uring: remove wait loop spurious wakeups
  blk-wbt: fix performance regression in wbt scale_up/scale_down
  Revert "libata, freezer: avoid block device removal while system is frozen"
  bdi: Do not use freezable workqueue
  io_uring: fix reversed nonblock flag for link submission

5 years agoMerge tag 'drm-intel-fixes-2019-10-10' of git://anongit.freedesktop.org/drm/drm-intel...
Dave Airlie [Fri, 11 Oct 2019 00:09:01 +0000 (10:09 +1000)]
Merge tag 'drm-intel-fixes-2019-10-10' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes

- Fix CML display by adding a missing ID.
- Drop redundant list_del_init
- Only enqueue already completed requests to avoid races
- Fixup preempt-to-busy vs reset of a virtual request
- Protect peeking at execlists->active
- execlists->active is serialised by the tasklet

drm-intel-next-fixes-2019-09-19:
- Extend old HSW workaround to fix some GPU hangs on Haswell GT2
- Fix return error code on GEM mmap.
- White list a chicken bit register for push constants legacy mode on Mesa
- Fix resume issue related to GGTT restore
- Remove incorrect BUG_ON on execlist's schedule-out
- Fix unrecoverable GPU hangs with Vulkan compute workloads on SKL

drm-intel-next-fixes-2019-09-26:
- Fix concurrence on cases where requests where getting retired at same time as resubmitted to HW
- Fix gen9 display resolutions by setting the right max plane width
- Fix GPU hang on preemption
- Mark contents as dirty on a write fault. This was breaking cursor sprite with dumb buffers.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191010143039.GA15313@intel.com
5 years agoMerge tag 'drm-fixes-5.4-2019-10-09' of git://people.freedesktop.org/~agd5f/linux...
Dave Airlie [Fri, 11 Oct 2019 00:08:32 +0000 (10:08 +1000)]
Merge tag 'drm-fixes-5.4-2019-10-09' of git://people.freedesktop.org/~agd5f/linux into drm-fixes

drm-fixes-5.4-2019-10-09:

amdgpu:
- fix memory leak in bo_list ioctl error path

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexdeucher@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191010031023.23359-1-alexander.deucher@amd.com
5 years agoMerge tag 'drm-misc-fixes-2019-10-10' of git://anongit.freedesktop.org/drm/drm-misc...
Dave Airlie [Fri, 11 Oct 2019 00:08:02 +0000 (10:08 +1000)]
Merge tag 'drm-misc-fixes-2019-10-10' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

Short summary of fixes pull (less than what git shortlog provides):
- SPI Aliases fixes for panels
- One fix for the tc358767 bridge dealing with visual artifacts

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <mripard@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20191010105137.j6juxht5dsobgxph@gilmour
5 years agoMerge tag 'xfs-5.4-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Linus Torvalds [Thu, 10 Oct 2019 18:47:16 +0000 (11:47 -0700)]
Merge tag 'xfs-5.4-fixes-3' of git://git./fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "A couple of small code cleanups and bug fixes for rounding errors,
  metadata logging errors, and an extra layer of safeguards against
  leaking memory contents.

   - Fix a rounding error in the fallocate code

   - Minor code cleanups

   - Make sure to zero memory buffers before formatting metadata blocks

   - Fix a few places where we forgot to log an inode metadata update

   - Remove broken error handling that tried to clean up after a failure
     but still got it wrong"

* tag 'xfs-5.4-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: move local to extent inode logging into bmap helper
  xfs: remove broken error handling on failed attr sf to leaf change
  xfs: log the inode on directory sf to block format change
  xfs: assure zeroed memory buffers for certain kmem allocations
  xfs: removed unused error variable from xchk_refcountbt_rec
  xfs: remove unused flags arg from xfs_get_aghdr_buf()
  xfs: Fix tail rounding in xfs_alloc_file_space()

5 years agonbd: fix possible sysfs duplicate warning
Xiubo Li [Thu, 19 Sep 2019 06:14:27 +0000 (11:44 +0530)]
nbd: fix possible sysfs duplicate warning

1. nbd_put takes the mutex and drops nbd->ref to 0. It then does
idr_remove and drops the mutex.

2. nbd_genl_connect takes the mutex. idr_find/idr_for_each fails
to find an existing device, so it does nbd_dev_add.

3. just before the nbd_put could call nbd_dev_remove or not finished
totally, but if nbd_dev_add try to add_disk, we can hit:

debugfs: Directory 'nbd1' with parent 'block' already present!

This patch will make sure all the disk add/remove stuff are done
by holding the nbd_index_mutex lock.

Reported-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoMerge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Linus Torvalds [Thu, 10 Oct 2019 15:39:00 +0000 (08:39 -0700)]
Merge branch 'linus' of git://git./linux/kernel/git/herbert/crypto-2.6

Pull crypto fixes from Herbert Xu:
 "Fix build issues in arm/aes-ce"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: arm/aes-ce - add dependency on AES library
  crypto: arm/aes-ce - build for v8 architecture explicitly

5 years agoMerge tag 'for-5.4-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave...
Linus Torvalds [Thu, 10 Oct 2019 15:30:51 +0000 (08:30 -0700)]
Merge tag 'for-5.4-rc2-tag' of git://git./linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more stabitly fixes, one build warning fix.

   - fix inode allocation under NOFS context

   - fix leak in fiemap due to concurrent append writes

   - fix log-root tree updates

   - fix balance convert of single profile on 32bit architectures

   - silence false positive warning on old GCCs (code moved in rc1)"

* tag 'for-5.4-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: silence maybe-uninitialized warning in clone_range
  btrfs: fix uninitialized ret in ref-verify
  btrfs: allocate new inode in NOFS context
  btrfs: fix balance convert to single on 32-bit host CPUs
  btrfs: fix incorrect updating of log root tree
  Btrfs: fix memory leak due to concurrent append writes with fiemap

5 years agoMerge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Thu, 10 Oct 2019 15:26:58 +0000 (08:26 -0700)]
Merge branch 'work.dcache' of git://git./linux/kernel/git/viro/vfs

Pull dcache_readdir() fixes from Al Viro:
 "The couple of patches you'd been OK with; no hlist conversion yet, and
  cursors are still in the list of children"

[ Al is referring to future work to avoid some nasty O(n**2) behavior
  with the readdir cursors when you have lots of concurrent readdirs.

  This is just a fix for a race with a trivial cleanup   - Linus ]

* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  libfs: take cursors out of list when moving past the end of directory
  Fix the locking in dcache_readdir() and friends

5 years agoMerge branch 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Thu, 10 Oct 2019 15:16:44 +0000 (08:16 -0700)]
Merge branch 'work.mount3' of git://git./linux/kernel/git/viro/vfs

Pull mount fixes from Al Viro:
 "A couple of regressions from the mount series"

* 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: add missing blkdev_put() in get_tree_bdev()
  shmem: fix LSM options parsing

5 years agoMAINTAINERS: Remove Simon as Renesas SoC Co-Maintainer
Geert Uytterhoeven [Thu, 10 Oct 2019 12:30:46 +0000 (14:30 +0200)]
MAINTAINERS: Remove Simon as Renesas SoC Co-Maintainer

At the end of the v5.3 upstream kernel development cycle, Simon stepped
down from his role as Renesas SoC maintainer.

Remove his maintainership, git repository, and branch from the
MAINTAINERS file, and add an entry to the CREDITS file to honor his
work.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agodrm/bridge: tc358767: fix max_tu_symbol value
Tomi Valkeinen [Tue, 24 Sep 2019 13:17:02 +0000 (16:17 +0300)]
drm/bridge: tc358767: fix max_tu_symbol value

max_tu_symbol was programmed to TU_SIZE_RECOMMENDED - 1, which is not
what the spec says. The spec says:

roundup ((input active video bandwidth in bytes/output active video
bandwidth in bytes) * tu_size)

It is not quite clear what the above means, but calculating
max_tu_symbol = (input Bps / output Bps) * tu_size seems to work and
fixes the issues seen.

This fixes artifacts in some videomodes (e.g. 1024x768@60 on 2-lanes &
1.62Gbps was pretty bad for me).

Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
Tested-by: Jyri Sarha <jsarha@ti.com>
Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190924131702.9988-1-tomi.valkeinen@ti.com
5 years agonull_blk: Fix zoned command return code
Keith Busch [Wed, 9 Oct 2019 15:38:13 +0000 (00:38 +0900)]
null_blk: Fix zoned command return code

The return code from null_handle_zoned() sets the cmd->error value.
Returning OK status when an error occured overwrites the intended
cmd->error. Return the appropriate error code instead of setting the
error in the cmd.

Fixes: fceb5d1b19cbe626 ("null_blk: create a helper for zoned devices")
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agolibfs: take cursors out of list when moving past the end of directory
Al Viro [Fri, 20 Sep 2019 20:32:42 +0000 (16:32 -0400)]
libfs: take cursors out of list when moving past the end of directory

that eliminates the last place where we accessed the tail of ->d_subdirs

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
5 years agovfs: add missing blkdev_put() in get_tree_bdev()
Ian Kent [Wed, 2 Oct 2019 09:56:33 +0000 (17:56 +0800)]
vfs: add missing blkdev_put() in get_tree_bdev()

Is there are a couple of missing blkdev_put() in get_tree_bdev()?

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
5 years agoshmem: fix LSM options parsing
Al Viro [Thu, 10 Oct 2019 02:48:01 +0000 (22:48 -0400)]
shmem: fix LSM options parsing

->parse_monolithic() there forgets to call security_sb_eat_lsm_opts()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
5 years agodrm/i915/gt: execlists->active is serialised by the tasklet
Chris Wilson [Wed, 9 Oct 2019 16:09:06 +0000 (17:09 +0100)]
drm/i915/gt: execlists->active is serialised by the tasklet

The active/pending execlists is no longer protected by the
engine->active.lock, but is serialised by the tasklet instead. Update
the locking around the debug and stats to follow suit.

v2: local_bh_disable() to prevent recursing into the tasklet in case we
trigger a softirq (Tvrtko)

Fixes: df403069029d ("drm/i915/execlists: Lift process_csb() out of the irq-off spinlock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191009160906.16195-1-chris@chris-wilson.co.uk
(cherry picked from commit c36eebd9ba5d70b84e1e7408ccc7632566f285c4)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
5 years agodrm/i915/execlists: Protect peeking at execlists->active
Chris Wilson [Wed, 9 Oct 2019 10:09:54 +0000 (11:09 +0100)]
drm/i915/execlists: Protect peeking at execlists->active

Now that we dropped the engine->active.lock serialisation from around
process_csb(), direct submission can run concurrently to the interrupt
handler. As such execlists->active may be advanced as we dequeue,
dropping the reference to the request. We need to employ our RCU request
protection to ensure that the request is not freed too early.

Fixes: df403069029d ("drm/i915/execlists: Lift process_csb() out of the irq-off spinlock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191009100955.21477-1-chris@chris-wilson.co.uk
(cherry picked from commit c949ae431467764277cdd88d7c26ff963a9db40a)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
5 years agodrm/i915: Fixup preempt-to-busy vs reset of a virtual request
Chris Wilson [Mon, 23 Sep 2019 15:28:43 +0000 (16:28 +0100)]
drm/i915: Fixup preempt-to-busy vs reset of a virtual request

Due to the nature of preempt-to-busy the execlists active tracking and
the schedule queue may become temporarily desync'ed (between resubmission
to HW and its ack from HW). This means that we may have unwound a
request and passed it back to the virtual engine, but it is still
inflight on the HW and may even result in a GPU hang. If we detect that
GPU hang and try to reset, the hanging request->engine will no longer
match the current engine, which means that the request is not on the
execlists active list and we should not try to find an older incomplete
request. Given that we have deduced this must be a request on a virtual
engine, it is the single active request in the context and so must be
guilty (as the context is still inflight, it is prevented from being
executed on another engine as we process the reset).

Fixes: 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190923152844.8914-2-chris@chris-wilson.co.uk
(cherry picked from commit cb2377a919bbe8107af269c5a31a8d5cfb27d867)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
5 years agoio_uring: only flush workqueues on fileset removal
Jens Axboe [Wed, 9 Oct 2019 20:40:13 +0000 (14:40 -0600)]
io_uring: only flush workqueues on fileset removal

We should not remove the workqueue, we just need to ensure that the
workqueues are synced. The workqueues are torn down on ctx removal.

Cc: stable@vger.kernel.org
Fixes: 6b06314c47e1 ("io_uring: add file set registration")
Reported-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agodrm/i915: Only enqueue already completed requests
Chris Wilson [Mon, 23 Sep 2019 11:00:55 +0000 (12:00 +0100)]
drm/i915: Only enqueue already completed requests

If we are asked to submit a completed request, just move it onto the
active-list without modifying it's payload. If we try to emit the
modified payload of a completed request, we risk racing with the
ring->head update during retirement which may advance the head past our
breadcrumb and so we generate a warning for the emission being behind
the RING_HEAD.

v2: Commentary for the sneaky, shared responsibility between functions.
v3: Spelling mistakes and bonus assertion

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190923110056.15176-3-chris@chris-wilson.co.uk
(cherry picked from commit c0bb487dc19fc45dbeede7dcf8f513df51a3cd33)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
5 years agodrm/i915/execlists: Drop redundant list_del_init(&rq->sched.link)
Chris Wilson [Mon, 23 Sep 2019 11:00:54 +0000 (12:00 +0100)]
drm/i915/execlists: Drop redundant list_del_init(&rq->sched.link)

Since amalgamating the queued and active lists in commit 422d7df4f090
("drm/i915: Replace engine->timeline with a plain list"), performing a
i915_request_submit() will remove the request from the execlists
priority queue.

References: 422d7df4f090 ("drm/i915: Replace engine->timeline with a plain list")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190923110056.15176-2-chris@chris-wilson.co.uk
(cherry picked from commit 3231f8c01121ee1febfd82398ee22f7ff9dc5d76)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
5 years agodrm/i915/cml: Add second PCH ID for CMP
Matt Roper [Mon, 16 Sep 2019 23:32:51 +0000 (16:32 -0700)]
drm/i915/cml: Add second PCH ID for CMP

The CMP PCH ID we have in the driver is correct for the CML-U machines we have
in our CI system, but the CML-S and CML-H CI machines appear to use a
different PCH ID, leading our driver to detect no PCH for them.

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Anusha Srivatsa <anusha.srivatsa@intel.com>
References: 729ae330a0f2e2 ("drm/i915/cml: Introduce Comet Lake PCH")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111461
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20190916233251.387-1-matthew.d.roper@intel.com
Fixes: 729ae330a0f2e2 ("drm/i915/cml: Introduce Comet Lake PCH")
(cherry picked from commit 8698ba53cd7173c32320ebbef4d389d41ebb5780)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
5 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Linus Torvalds [Wed, 9 Oct 2019 16:46:46 +0000 (09:46 -0700)]
Merge tag 'for-linus' of git://git./linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
 "The usual collection of driver bug fixes, and a few regressions from
  the merge window. Nothing particularly worrisome.

   - Various missed memory frees and error unwind bugs

   - Fix regressions in a few iwarp drivers from 5.4 patches

   - A few regressions added in past kernels

   - Squash a number of races in mlx5 ODP code"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  RDMA/mlx5: Add missing synchronize_srcu() for MW cases
  RDMA/mlx5: Put live in the correct place for ODP MRs
  RDMA/mlx5: Order num_pending_prefetch properly with synchronize_srcu
  RDMA/odp: Lift umem_mutex out of ib_umem_odp_unmap_dma_pages()
  RDMA/mlx5: Fix a race with mlx5_ib_update_xlt on an implicit MR
  RDMA/mlx5: Do not allow rereg of a ODP MR
  IB/core: Fix wrong iterating on ports
  RDMA/nldev: Reshuffle the code to avoid need to rebind QP in error path
  RDMA/cxgb4: Do not dma memory off of the stack
  RDMA/cm: Fix memory leak in cm_add/remove_one
  RDMA/core: Fix an error handling path in 'res_get_common_doit()'
  RDMA/i40iw: Associate ibdev to netdev before IB device registration
  RDMA/iwcm: Fix a lock inversion issue
  RDMA/iw_cxgb4: fix SRQ access from dump_qp()
  RDMA/hfi1: Prevent memory leak in sdma_init
  RDMA/core: Fix use after free and refcnt leak on ndev in_device in iwarp_query_port
  RDMA/siw: Fix serialization issue in write_space()
  RDMA/vmw_pvrdma: Free SRQ only once

5 years agodrm/amdgpu: fix memory leak
Nirmoy Das [Fri, 4 Oct 2019 09:53:37 +0000 (11:53 +0200)]
drm/amdgpu: fix memory leak

cleanup error handling code and make sure temporary info array
with the handles are freed by amdgpu_bo_list_put() on
idr_replace()'s failure.

Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
5 years agoMerge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Linus Torvalds [Wed, 9 Oct 2019 16:27:22 +0000 (09:27 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:
 "A larger-than-usual batch of arm64 fixes for -rc3.

  The bulk of the fixes are dealing with a bunch of issues with the
  build system from the compat vDSO, which unfortunately led to some
  significant Makefile rework to manage the horrible combinations of
  toolchains that we can end up needing to drive simultaneously.

  We came close to disabling the thing entirely, but Vincenzo was quick
  to spin up some patches and I ended up picking up most of the bits
  that were left [*]. Future work will look at disentangling the header
  files properly.

  Other than that, we have some important fixes all over, including one
  papering over the miscompilation fallout from forcing
  CONFIG_OPTIMIZE_INLINING=y, which I'm still unhappy about. Harumph.

  We've still got a couple of open issues, so I'm expecting to have some
  more fixes later this cycle.

  Summary:

   - Numerous fixes to the compat vDSO build system, especially when
     combining gcc and clang

   - Fix parsing of PAR_EL1 in spurious kernel fault detection

   - Partial workaround for Neoverse-N1 erratum #1542419

   - Fix IRQ priority masking on entry from compat syscalls

   - Fix advertisment of FRINT HWCAP to userspace

   - Attempt to workaround inlining breakage with '__always_inline'

   - Fix accidental freeing of parent SVE state on fork() error path

   - Add some missing NULL pointer checks in instruction emulation init

   - Some formatting and comment fixes"

[*] Will's final fixes were

Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
    but they were already in linux-next by then and he didn't rebase
    just to add those.

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (21 commits)
  arm64: armv8_deprecated: Checking return value for memory allocation
  arm64: Kconfig: Make CONFIG_COMPAT_VDSO a proper Kconfig option
  arm64: vdso32: Rename COMPATCC to CC_COMPAT
  arm64: vdso32: Pass '--target' option to clang via VDSO_CAFLAGS
  arm64: vdso32: Don't use KBUILD_CPPFLAGS unconditionally
  arm64: vdso32: Move definition of COMPATCC into vdso32/Makefile
  arm64: Default to building compat vDSO with clang when CONFIG_CC_IS_CLANG
  lib: vdso: Remove CROSS_COMPILE_COMPAT_VDSO
  arm64: vdso32: Remove jump label config option in Makefile
  arm64: vdso32: Detect binutils support for dmb ishld
  arm64: vdso: Remove stale files from old assembly implementation
  arm64: vdso32: Fix broken compat vDSO build warnings
  arm64: mm: fix spurious fault detection
  arm64: ftrace: Ensure synchronisation in PLT setup for Neoverse-N1 #1542419
  arm64: Fix incorrect irqflag restore for priority masking for compat
  arm64: mm: avoid virt_to_phys(init_mm.pgd)
  arm64: cpufeature: Effectively expose FRINT capability to userspace
  arm64: Mark functions using explicit register variables as '__always_inline'
  docs: arm64: Fix indentation and doc formatting
  arm64/sve: Fix wrong free for task->thread.sve_state
  ...

5 years agoxfs: move local to extent inode logging into bmap helper
Brian Foster [Mon, 7 Oct 2019 19:54:16 +0000 (12:54 -0700)]
xfs: move local to extent inode logging into bmap helper

The callers of xfs_bmap_local_to_extents_empty() log the inode
external to the function, yet this function is where the on-disk
format value is updated. Push the inode logging down into the
function itself to help prevent future mistakes.

Note that internal bmap callers track the inode logging flags
independently and thus may log the inode core twice due to this
change. This is harmless, so leave this code around for consistency
with the other attr fork conversion functions.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
5 years agoxfs: remove broken error handling on failed attr sf to leaf change
Brian Foster [Mon, 7 Oct 2019 19:54:15 +0000 (12:54 -0700)]
xfs: remove broken error handling on failed attr sf to leaf change

xfs_attr_shortform_to_leaf() attempts to put the shortform fork back
together after a failed attempt to convert from shortform to leaf
format. While this code reallocates and copies back the shortform
attr fork data, it never resets the inode format field back to local
format. Further, now that the inode is properly logged after the
initial switch from local format, any error that triggers the
recovery code will eventually abort the transaction and shutdown the
fs. Therefore, remove the broken and unnecessary error handling
code.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
5 years agoxfs: log the inode on directory sf to block format change
Brian Foster [Mon, 7 Oct 2019 19:54:15 +0000 (12:54 -0700)]
xfs: log the inode on directory sf to block format change

When a directory changes from shortform (sf) to block format, the sf
format is copied to a temporary buffer, the inode format is modified
and the updated format filled with the dentries from the temporary
buffer. If the inode format is modified and attempt to grow the
inode fails (due to I/O error, for example), it is possible to
return an error while leaving the directory in an inconsistent state
and with an otherwise clean transaction. This results in corruption
of the associated directory and leads to xfs_dabuf_map() errors as
subsequent lookups cannot accurately determine the format of the
directory. This problem is reproduced occasionally by generic/475.

The fundamental problem is that xfs_dir2_sf_to_block() changes the
on-disk inode format without logging the inode. The inode is
eventually logged by the bmapi layer in the common case, but error
checking introduces the possibility of failing the high level
request before this happens.

Update both of the dir2 and attr callers of
xfs_bmap_local_to_extents_empty() to log the inode core as
consistent with the bmap local to extent format change codepath.
This ensures that any subsequent errors after the format has changed
cause the transaction to abort.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
5 years agoMerge tag 'led-fixes-for-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Tue, 8 Oct 2019 22:36:04 +0000 (15:36 -0700)]
Merge tag 'led-fixes-for-5.4-rc3' of git://git./linux/kernel/git/j.anaszewski/linux-leds

Pull LED fixes from Jacek Anaszewski:

 - fix a leftover from earlier stage of development in the documentation
   of recently added led_compose_name() and fix old mistake in the
   documentation of led_set_brightness_sync() parameter name.

  - MAINTAINERS: add pointer to Pavel Machek's linux-leds.git tree.
    Pavel is going to take over LED tree maintainership from myself.

* tag 'led-fixes-for-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds:
  Add my linux-leds branch to MAINTAINERS
  leds: core: Fix leds.h structure documentation

5 years agoAdd my linux-leds branch to MAINTAINERS
Pavel Machek [Tue, 8 Oct 2019 18:58:35 +0000 (20:58 +0200)]
Add my linux-leds branch to MAINTAINERS

Add pointer to my git tree to MAINTAINERS. I'd like to maintain
linux-leds for-next branch for 5.5.

Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jacek Anaszewski <jacek.anaszewski@gmail.com>
5 years agoleds: core: Fix leds.h structure documentation
Dan Murphy [Wed, 2 Oct 2019 12:40:42 +0000 (07:40 -0500)]
leds: core: Fix leds.h structure documentation

Update the leds.h structure documentation to define the
correct arguments.

Signed-off-by: Dan Murphy <dmurphy@ti.com>
Signed-off-by: Jacek Anaszewski <jacek.anaszewski@gmail.com>
5 years agoMerge tag 'gpio-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux...
Linus Torvalds [Tue, 8 Oct 2019 17:55:22 +0000 (10:55 -0700)]
Merge tag 'gpio-v5.4-2' of git://git./linux/kernel/git/linusw/linux-gpio

Pull GPIO fixes from Linus Walleij:

 - don't clear FLAG_IS_OUT when emulating open drain/source in gpiolib

 - fix up the usage of nonexclusive GPIO descriptors from device trees

 - fix the incorrect IEC offset when toggling trigger edge in the
   Spreadtrum driver

 - use the correct unit for debounce settings in the MAX77620 driver

* tag 'gpio-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
  gpio: max77620: Use correct unit for debounce times
  gpio: eic: sprd: Fix the incorrect EIC offset when toggling
  gpio: fix getting nonexclusive gpiods from DT
  gpiolib: don't clear FLAG_IS_OUT when emulating open-drain/open-source

5 years agoMerge tag 'selinux-pr-20191007' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Tue, 8 Oct 2019 17:51:37 +0000 (10:51 -0700)]
Merge tag 'selinux-pr-20191007' of git://git./linux/kernel/git/pcmoore/selinux

Pull selinuxfix from Paul Moore:
 "One patch to ensure we don't copy bad memory up into userspace"

* tag 'selinux-pr-20191007' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  selinux: fix context string corruption in convert_context()

5 years agoMerge tag 'linux-kselftest-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Tue, 8 Oct 2019 17:49:05 +0000 (10:49 -0700)]
Merge tag 'linux-kselftest-5.4-rc3' of git://git./linux/kernel/git/shuah/linux-kselftest

Pull Kselftest fixes from Shuah Khan:
 "Fixes for existing tests and the framework.

  Cristian Marussi's patches add the ability to skip targets (tests) and
  exclude tests that didn't build from run-list. These patches improve
  the Kselftest results. Ability to skip targets helps avoid running
  tests that aren't supported in certain environments. As an example,
  bpf tests from mainline aren't supported on stable kernels and have
  dependency on bleeding edge llvm. Being able to skip bpf on systems
  that can't meet this llvm dependency will be helpful.

  Kselftest can be built and installed from the main Makefile. This
  change help simplify Kselftest use-cases which addresses request from
  users.

  Kees Cook added per test timeout support to limit individual test
  run-time"

* tag 'linux-kselftest-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests: watchdog: Add command line option to show watchdog_info
  selftests: watchdog: Validate optional file argument
  selftests/kselftest/runner.sh: Add 45 second timeout per test
  kselftest: exclude failed TARGETS from runlist
  kselftest: add capability to skip chosen TARGETS
  selftests: Add kselftest-all and kselftest-install targets

5 years agoarm64: armv8_deprecated: Checking return value for memory allocation
Yunfeng Ye [Sun, 29 Sep 2019 04:44:17 +0000 (12:44 +0800)]
arm64: armv8_deprecated: Checking return value for memory allocation

There are no return value checking when using kzalloc() and kcalloc() for
memory allocation. so add it.

Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Signed-off-by: Will Deacon <will@kernel.org>
5 years agobtrfs: silence maybe-uninitialized warning in clone_range
Austin Kim [Tue, 3 Sep 2019 03:30:19 +0000 (12:30 +0900)]
btrfs: silence maybe-uninitialized warning in clone_range

GCC throws warning message as below:

‘clone_src_i_size’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
 #define IS_ALIGNED(x, a)  (((x) & ((typeof(x))(a) - 1)) == 0)
                       ^
fs/btrfs/send.c:5088:6: note: â€˜clone_src_i_size’ was declared here
 u64 clone_src_i_size;
   ^
The clone_src_i_size is only used as call-by-reference
in a call to get_inode_info().

Silence the warning by initializing clone_src_i_size to 0.

Note that the warning is a false positive and reported by older versions
of GCC (eg. 7.x) but not eg 9.x. As there have been numerous people, the
patch is applied. Setting clone_src_i_size to 0 does not otherwise make
sense and would not do any action in case the code changes in the future.

Signed-off-by: Austin Kim <austindh.kim@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
5 years agodrm/panel: tpo-td043mtea1: Fix SPI alias
Laurent Pinchart [Mon, 7 Oct 2019 17:08:01 +0000 (20:08 +0300)]
drm/panel: tpo-td043mtea1: Fix SPI alias

The panel-tpo-td043mtea1 driver incorrectly includes the OF vendor
prefix in its SPI alias. Fix it, and move the manual alias to an SPI
module device table.

Fixes: dc2e1e5b2799 ("drm/panel: Add driver for the Toppoly TD043MTEA1 panel")
Reported-by: H. Nikolaus Schaller <hns@goldelico.com>
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191007170801.27647-6-laurent.pinchart@ideasonboard.com
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Tested-by: H. Nikolaus Schaller <hns@goldelico.com>
5 years agodrm/panel: tpo-td028ttec1: Fix SPI alias
Laurent Pinchart [Mon, 7 Oct 2019 17:08:00 +0000 (20:08 +0300)]
drm/panel: tpo-td028ttec1: Fix SPI alias

The panel-tpo-td028ttec1 driver incorrectly includes the OF vendor
prefix in its SPI alias. Fix it.

Fixes: 415b8dd08711 ("drm/panel: Add driver for the Toppoly TD028TTEC1 panel")
Reported-by: H. Nikolaus Schaller <hns@goldelico.com>
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191007170801.27647-5-laurent.pinchart@ideasonboard.com
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Tested-by: H. Nikolaus Schaller <hns@goldelico.com>
Tested-by: Andreas Kemnade <andreas@kemnade.info>
5 years agodrm/panel: sony-acx565akm: Fix SPI alias
Laurent Pinchart [Mon, 7 Oct 2019 17:07:59 +0000 (20:07 +0300)]
drm/panel: sony-acx565akm: Fix SPI alias

The panel-sony-acx565akm driver incorrectly includes the OF vendor
prefix in its SPI alias. Fix it, and move the manual alias to an SPI
module device table.

Fixes: 1c8fc3f0c5d2 ("drm/panel: Add driver for the Sony ACX565AKM panel")
Reported-by: H. Nikolaus Schaller <hns@goldelico.com>
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191007170801.27647-4-laurent.pinchart@ideasonboard.com
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com>
5 years agodrm/panel: nec-nl8048hl11: Fix SPI alias
Laurent Pinchart [Mon, 7 Oct 2019 17:07:58 +0000 (20:07 +0300)]
drm/panel: nec-nl8048hl11: Fix SPI alias

The panel-nec-nl8048hl11 driver incorrectly includes the OF vendor
prefix in its SPI alias. Fix it, and move the manual alias to an SPI
module device table.

Fixes: df439abe6501 ("drm/panel: Add driver for the NEC NL8048HL11 panel")
Reported-by: H. Nikolaus Schaller <hns@goldelico.com>
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191007170801.27647-3-laurent.pinchart@ideasonboard.com
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com>
5 years agodrm/panel: lg-lb035q02: Fix SPI alias
Laurent Pinchart [Mon, 7 Oct 2019 17:07:57 +0000 (20:07 +0300)]
drm/panel: lg-lb035q02: Fix SPI alias

The panel-lg-lb035q02 driver incorrectly includes the OF vendor prefix
in its SPI alias. Fix it, and move the manual alias to an SPI module
device table.

Fixes: f5b0c6542476 ("drm/panel: Add driver for the LG Philips LB035Q02 panel")
Reported-by: H. Nikolaus Schaller <hns@goldelico.com>
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191007170801.27647-2-laurent.pinchart@ideasonboard.com
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com>
5 years agoio_uring: remove wait loop spurious wakeups
Pavel Begunkov [Mon, 7 Oct 2019 23:18:42 +0000 (02:18 +0300)]
io_uring: remove wait loop spurious wakeups

Any changes interesting to tasks waiting in io_cqring_wait() are
commited with io_cqring_ev_posted(). However, io_ring_drop_ctx_refs()
also tries to do that but with no reason, that means spurious wakeups
every io_free_req() and io_uring_enter().

Just use percpu_ref_put() instead.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Mon, 7 Oct 2019 23:04:19 +0000 (16:04 -0700)]
Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
 "The usual shower of hotfixes.

  Chris's memcg patches aren't actually fixes - they're mature but a few
  niggling review issues were late to arrive.

  The ocfs2 fixes are quite old - those took some time to get reviewer
  attention.

  Subsystems affected by this patch series: ocfs2, hotfixes, mm/memcg,
  mm/slab-generic"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
  mm, sl[ou]b: improve memory accounting
  mm, memcg: make scan aggression always exclude protection
  mm, memcg: make memory.emin the baseline for utilisation determination
  mm, memcg: proportional memory.{low,min} reclaim
  mm/vmpressure.c: fix a signedness bug in vmpressure_register_event()
  mm/page_alloc.c: fix a crash in free_pages_prepare()
  mm/z3fold.c: claim page in the beginning of free
  kernel/sysctl.c: do not override max_threads provided by userspace
  memcg: only record foreign writebacks with dirty pages when memcg is not disabled
  mm: fix -Wmissing-prototypes warnings
  writeback: fix use-after-free in finish_writeback_work()
  mm/memremap: drop unused SECTION_SIZE and SECTION_MASK
  panic: ensure preemption is disabled during panic()
  fs: ocfs2: fix a possible null-pointer dereference in ocfs2_info_scan_inode_alloc()
  fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock()
  fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()
  ocfs2: clear zero in unaligned direct IO

5 years agomm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
Vlastimil Babka [Mon, 7 Oct 2019 00:58:45 +0000 (17:58 -0700)]
mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)

In most configurations, kmalloc() happens to return naturally aligned
(i.e.  aligned to the block size itself) blocks for power of two sizes.

That means some kmalloc() users might unknowingly rely on that
alignment, until stuff breaks when the kernel is built with e.g.
CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned.  Then
developers have to devise workaround such as own kmem caches with
specified alignment [1], which is not always practical, as recently
evidenced in [2].

The topic has been discussed at LSF/MM 2019 [3].  Adding a
'kmalloc_aligned()' variant would not help with code unknowingly relying
on the implicit alignment.  For slab implementations it would either
require creating more kmalloc caches, or allocate a larger size and only
give back part of it.  That would be wasteful, especially with a generic
alignment parameter (in contrast with a fixed alignment to size).

Ideally we should provide to mm users what they need without difficult
workarounds or own reimplementations, so let's make the kmalloc()
alignment to size explicitly guaranteed for power-of-two sizes under all
configurations.  What this means for the three available allocators?

* SLAB object layout happens to be mostly unchanged by the patch.  The
  implicitly provided alignment could be compromised with
  CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
  caches with alignment larger than unsigned long long.  Practically on at
  least x86 this includes kmalloc caches as they use cache line alignment,
  which is larger than that.  Still, this patch ensures alignment on all
  arches and cache sizes.

* SLUB layout is also unchanged unless redzoning is enabled through
  CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
  With this patch, explicit alignment is guaranteed with redzoning as
  well.  This will result in more memory being wasted, but that should be
  acceptable in a debugging scenario.

* SLOB has no implicit alignment so this patch adds it explicitly for
  kmalloc().  The potential downside is increased fragmentation.  While
  pathological allocation scenarios are certainly possible, in my testing,
  after booting a x86_64 kernel+userspace with virtme, around 16MB memory
  was consumed by slab pages both before and after the patch, with
  difference in the noise.

[1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
[2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
[3] https://lwn.net/Articles/787740/

[akpm@linux-foundation.org: documentation fixlet, per Matthew]
Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: David Sterba <dsterba@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm, sl[ou]b: improve memory accounting
Vlastimil Babka [Mon, 7 Oct 2019 00:58:42 +0000 (17:58 -0700)]
mm, sl[ou]b: improve memory accounting

Patch series "guarantee natural alignment for kmalloc()", v2.

This patch (of 2):

SLOB currently doesn't account its pages at all, so in /proc/meminfo the
Slab field shows zero.  Modifying a counter on page allocation and
freeing should be acceptable even for the small system scenarios SLOB is
intended for.  Since reclaimable caches are not separated in SLOB,
account everything as unreclaimable.

SLUB currently doesn't account kmalloc() and kmalloc_node() allocations
larger than order-1 page, that are passed directly to the page
allocator.  As they also don't appear in /proc/slabinfo, it might look
like a memory leak.  For consistency, account them as well.  (SLAB
doesn't actually use page allocator directly, so no change there).

Ideally SLOB and SLUB would be handled in separate patches, but due to
the shared kmalloc_order() function and different kfree()
implementations, it's easier to patch both at once to prevent
inconsistencies.

Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm, memcg: make scan aggression always exclude protection
Chris Down [Mon, 7 Oct 2019 00:58:38 +0000 (17:58 -0700)]
mm, memcg: make scan aggression always exclude protection

This patch is an incremental improvement on the existing
memory.{low,min} relative reclaim work to base its scan pressure
calculations on how much protection is available compared to the current
usage, rather than how much the current usage is over some protection
threshold.

This change doesn't change the experience for the user in the normal
case too much.  One benefit is that it replaces the (somewhat arbitrary)
100% cutoff with an indefinite slope, which makes it easier to ballpark
a memory.low value.

As well as this, the old methodology doesn't quite apply generically to
machines with varying amounts of physical memory.  Let's say we have a
top level cgroup, workload.slice, and another top level cgroup,
system-management.slice.  We want to roughly give 12G to
system-management.slice, so on a 32GB machine we set memory.low to 20GB
in workload.slice, and on a 64GB machine we set memory.low to 52GB.
However, because these are relative amounts to the total machine size,
while the amount of memory we want to generally be willing to yield to
system.slice is absolute (12G), we end up putting more pressure on
system.slice just because we have a larger machine and a larger workload
to fill it, which seems fairly unintuitive.  With this new behaviour, we
don't end up with this unintended side effect.

Previously the way that memory.low protection works is that if you are
50% over a certain baseline, you get 50% of your normal scan pressure.
This is certainly better than the previous cliff-edge behaviour, but it
can be improved even further by always considering memory under the
currently enforced protection threshold to be out of bounds.  This means
that we can set relatively low memory.low thresholds for variable or
bursty workloads while still getting a reasonable level of protection,
whereas with the previous version we may still trivially hit the 100%
clamp.  The previous 100% clamp is also somewhat arbitrary, whereas this
one is more concretely based on the currently enforced protection
threshold, which is likely easier to reason about.

There is also a subtle issue with the way that proportional reclaim
worked previously -- it promotes having no memory.low, since it makes
pressure higher during low reclaim.  This happens because we base our
scan pressure modulation on how far memory.current is between memory.min
and memory.low, but if memory.low is unset, we only use the overage
method.  In most cromulent configurations, this then means that we end
up with *more* pressure than with no memory.low at all when we're in low
reclaim, which is not really very usable or expected.

With this patch, memory.low and memory.min affect reclaim pressure in a
more understandable and composable way.  For example, from a user
standpoint, "protected" memory now remains untouchable from a reclaim
aggression standpoint, and users can also have more confidence that
bursty workloads will still receive some amount of guaranteed
protection.

Link: http://lkml.kernel.org/r/20190322160307.GA3316@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm, memcg: make memory.emin the baseline for utilisation determination
Chris Down [Mon, 7 Oct 2019 00:58:35 +0000 (17:58 -0700)]
mm, memcg: make memory.emin the baseline for utilisation determination

Roman points out that when when we do the low reclaim pass, we scale the
reclaim pressure relative to position between 0 and the maximum
protection threshold.

However, if the maximum protection is based on memory.elow, and
memory.emin is above zero, this means we still may get binary behaviour
on second-pass low reclaim.  This is because we scale starting at 0, not
starting at memory.emin, and since we don't scan at all below emin, we
end up with cliff behaviour.

This should be a fairly uncommon case since usually we don't go into the
second pass, but it makes sense to scale our low reclaim pressure
starting at emin.

You can test this by catting two large sparse files, one in a cgroup
with emin set to some moderate size compared to physical RAM, and
another cgroup without any emin.  In both cgroups, set an elow larger
than 50% of physical RAM.  The one with emin will have less page
scanning, as reclaim pressure is lower.

Rebase on top of and apply the same idea as what was applied to handle
cgroup_memory=disable properly for the original proportional patch
http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name ("mm,
memcg: Handle cgroup_disable=memory when getting memcg protection").

Link: http://lkml.kernel.org/r/20190201051810.GA18895@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Suggested-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm, memcg: proportional memory.{low,min} reclaim
Chris Down [Mon, 7 Oct 2019 00:58:32 +0000 (17:58 -0700)]
mm, memcg: proportional memory.{low,min} reclaim

cgroup v2 introduces two memory protection thresholds: memory.low
(best-effort) and memory.min (hard protection).  While they generally do
what they say on the tin, there is a limitation in their implementation
that makes them difficult to use effectively: that cliff behaviour often
manifests when they become eligible for reclaim.  This patch implements
more intuitive and usable behaviour, where we gradually mount more
reclaim pressure as cgroups further and further exceed their protection
thresholds.

This cliff edge behaviour happens because we only choose whether or not
to reclaim based on whether the memcg is within its protection limits
(see the use of mem_cgroup_protected in shrink_node), but we don't vary
our reclaim behaviour based on this information.  Imagine the following
timeline, with the numbers the lruvec size in this zone:

1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
   scanned. (?!)

* Of course, we won't usually scan all available pages in the zone even
  without this patch because of scan control priority, over-reclaim
  protection, etc.  However, as shown by the tests at the end, these
  techniques don't sufficiently throttle such an extreme change in input,
  so cliff-like behaviour isn't really averted by their existence alone.

Here's an example of how this plays out in practice.  At Facebook, we are
trying to protect various workloads from "system" software, like
configuration management tools, metric collectors, etc (see this[0] case
study).  In order to find a suitable memory.low value, we start by
determining the expected memory range within which the workload will be
comfortable operating.  This isn't an exact science -- memory usage deemed
"comfortable" will vary over time due to user behaviour, differences in
composition of work, etc, etc.  As such we need to ballpark memory.low,
but doing this is currently problematic:

1. If we end up setting it too low for the workload, it won't have
   *any* effect (see discussion above).  The group will receive the full
   weight of reclaim and won't have any priority while competing with the
   less important system software, as if we had no memory.low configured
   at all.

2. Because of this behaviour, we end up erring on the side of setting
   it too high, such that the comfort range is reliably covered.  However,
   protected memory is completely unavailable to the rest of the system,
   so we might cause undue memory and IO pressure there when we *know* we
   have some elasticity in the workload.

3. Even if we get the value totally right, smack in the middle of the
   comfort zone, we get extreme jumps between no pressure and full
   pressure that cause unpredictable pressure spikes in the workload due
   to the current binary reclaim behaviour.

With this patch, we can set it to our ballpark estimation without too much
worry.  Any undesirable behaviour, such as too much or too little reclaim
pressure on the workload or system will be proportional to how far our
estimation is off.  This means we can set memory.low much more
conservatively and thus waste less resources *without* the risk of the
workload falling off a cliff if we overshoot.

As a more abstract technical description, this unintuitive behaviour
results in having to give high-priority workloads a large protection
buffer on top of their expected usage to function reliably, as otherwise
we have abrupt periods of dramatically increased memory pressure which
hamper performance.  Having to set these thresholds so high wastes
resources and generally works against the principle of work conservation.
In addition, having proportional memory reclaim behaviour has other
benefits.  Most notably, before this patch it's basically mandatory to set
memory.low to a higher than desirable value because otherwise as soon as
you exceed memory.low, all protection is lost, and all pages are eligible
to scan again.  By contrast, having a gradual ramp in reclaim pressure
means that you now still get some protection when thresholds are exceeded,
which means that one can now be more comfortable setting memory.low to
lower values without worrying that all protection will be lost.  This is
important because workingset size is really hard to know exactly,
especially with variable workloads, so at least getting *some* protection
if your workingset size grows larger than you expect increases user
confidence in setting memory.low without a huge buffer on top being
needed.

Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
assistance in thinking about how to make this work better.

In testing these changes, I intended to verify that:

1. Changes in page scanning become gradual and proportional instead of
   binary.

   To test this, I experimented stepping further and further down
   memory.low protection on a workload that floats around 19G workingset
   when under memory.low protection, watching page scan rates for the
   workload cgroup:

   +------------+-----------------+--------------------+--------------+
   | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
   +------------+-----------------+--------------------+--------------+
   |        21G |               0 |                  0 | N/A          |
   |        17G |             867 |               3799 | 23%          |
   |        12G |            1203 |               3543 | 34%          |
   |         8G |            2534 |               3979 | 64%          |
   |         4G |            3980 |               4147 | 96%          |
   |          0 |            3799 |               3980 | 95%          |
   +------------+-----------------+--------------------+--------------+

   As you can see, the test kernel (with a kernel containing this
   patch) ramps up page scanning significantly more gradually than the
   control kernel (without this patch).

2. More gradual ramp up in reclaim aggression doesn't result in
   premature OOMs.

   To test this, I wrote a script that slowly increments the number of
   pages held by stress(1)'s --vm-keep mode until a production system
   entered severe overall memory contention.  This script runs in a highly
   protected slice taking up the majority of available system memory.
   Watching vmstat revealed that page scanning continued essentially
   nominally between test and control, without causing forward reclaim
   progress to become arrested.

[0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project

[akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
[chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/vmpressure.c: fix a signedness bug in vmpressure_register_event()
Dan Carpenter [Mon, 7 Oct 2019 00:58:28 +0000 (17:58 -0700)]
mm/vmpressure.c: fix a signedness bug in vmpressure_register_event()

The "mode" and "level" variables are enums and in this context GCC will
treat them as unsigned ints so the error handling is never triggered.

I also removed the bogus initializer because it isn't required any more
and it's sort of confusing.

[akpm@linux-foundation.org: reduce implicit and explicit typecasting]
[akpm@linux-foundation.org: fix return value, add comment, per Matthew]
Link: http://lkml.kernel.org/r/20190925110449.GO3264@mwanda
Fixes: 3cadfa2b9497 ("mm/vmpressure.c: convert to use match_string() helper")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Enrico Weigelt <info@metux.net>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/page_alloc.c: fix a crash in free_pages_prepare()
Qian Cai [Mon, 7 Oct 2019 00:58:25 +0000 (17:58 -0700)]
mm/page_alloc.c: fix a crash in free_pages_prepare()

On architectures like s390, arch_free_page() could mark the page unused
(set_page_unused()) and any access later would trigger a kernel panic.
Fix it by moving arch_free_page() after all possible accessing calls.

 Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
 Krnl PSW : 0404e00180000000 0000000026c2b96e (__free_pages_ok+0x34e/0x5d8)
            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
 Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c 000000000000000f
            000003d080012100 000003d080013fc0 0000000000000000 0000000000100000
            00000000275cca48 0000000000000100 0000000000000008 000003d080010000
            00000000000001d0 000003d000000000 0000000026c2b78a 000000002717fdb0
 Krnl Code: 0000000026c2b95cec1100b30659 risbgn %r1,%r1,0,179,6
            0000000026c2b962e32014000036 pfd 2,1024(%r1)
           #0000000026c2b968d7ff10001000 xc 0(256,%r1),0(%r1)
           >0000000026c2b96e41101100  la %r1,256(%r1)
            0000000026c2b972a737fff8  brctg %r3,26c2b962
            0000000026c2b976d7ff10001000 xc 0(256,%r1),0(%r1)
            0000000026c2b97ce31003400004 lg %r1,832
            0000000026c2b982ebff1430016a asi 5168(%r1),-1
 Call Trace:
 __free_pages_ok+0x16a/0x5d8)
 memblock_free_all+0x206/0x290
 mem_init+0x58/0x120
 start_kernel+0x2b0/0x570
 startup_continue+0x6a/0xc0
 INFO: lockdep is turned off.
 Last Breaking-Event-Address:
 __free_pages_ok+0x372/0x5d8
 Kernel panic - not syncing: Fatal exception: panic_on_oops
 00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 26A2379C

In the past, only kernel_poison_pages() would trigger this but it needs
"page_poison=on" kernel cmdline, and I suspect nobody tested that on
s390.  Recently, kernel_init_free_pages() (commit 6471384af2a6 ("mm:
security: introduce init_on_alloc=1 and init_on_free=1 boot options"))
was added and could trigger this as well.

[akpm@linux-foundation.org: add comment]
Link: http://lkml.kernel.org/r/1569613623-16820-1-git-send-email-cai@lca.pw
Fixes: 8823b1dbc05f ("mm/page_poison.c: enable PAGE_POISONING as a separate option")
Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: <stable@vger.kernel.org> [5.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/z3fold.c: claim page in the beginning of free
Vitaly Wool [Mon, 7 Oct 2019 00:58:22 +0000 (17:58 -0700)]
mm/z3fold.c: claim page in the beginning of free

There's a really hard to reproduce race in z3fold between z3fold_free()
and z3fold_reclaim_page().  z3fold_reclaim_page() can claim the page
after z3fold_free() has checked if the page was claimed and
z3fold_free() will then schedule this page for compaction which may in
turn lead to random page faults (since that page would have been
reclaimed by then).

Fix that by claiming page in the beginning of z3fold_free() and not
forgetting to clear the claim in the end.

[vitalywool@gmail.com: v2]
Link: http://lkml.kernel.org/r/20190928113456.152742cf@bigdell
Link: http://lkml.kernel.org/r/20190926104844.4f0c6efa1366b8f5741eaba9@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Reported-by: Markus Linnala <markus.linnala@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Markus Linnala <markus.linnala@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agokernel/sysctl.c: do not override max_threads provided by userspace
Michal Hocko [Mon, 7 Oct 2019 00:58:19 +0000 (17:58 -0700)]
kernel/sysctl.c: do not override max_threads provided by userspace

Partially revert 16db3d3f1170 ("kernel/sysctl.c: threads-max observe
limits") because the patch is causing a regression to any workload which
needs to override the auto-tuning of the limit provided by kernel.

set_max_threads is implementing a boot time guesstimate to provide a
sensible limit of the concurrently running threads so that runaways will
not deplete all the memory.  This is a good thing in general but there
are workloads which might need to increase this limit for an application
to run (reportedly WebSpher MQ is affected) and that is simply not
possible after the mentioned change.  It is also very dubious to
override an admin decision by an estimation that doesn't have any direct
relation to correctness of the kernel operation.

Fix this by dropping set_max_threads from sysctl_max_threads so any
value is accepted as long as it fits into MAX_THREADS which is important
to check because allowing more threads could break internal robust futex
restriction.  While at it, do not use MIN_THREADS as the lower boundary
because it is also only a heuristic for automatic estimation and admin
might have a good reason to stop new threads to be created even when
below this limit.

This became more severe when we switched x86 from 4k to 8k kernel
stacks.  Starting since 6538b8ea886e ("x86_64: expand kernel stack to
16K") (3.16) we use THREAD_SIZE_ORDER = 2 and that halved the auto-tuned
value.

In the particular case

  3.12
  kernel.threads-max = 515561

  4.4
  kernel.threads-max = 200000

Neither of the two values is really insane on 32GB machine.

I am not sure we want/need to tune the max_thread value further.  If
anything the tuning should be removed altogether if proven not useful in
general.  But we definitely need a way to override this auto-tuning.

Link: http://lkml.kernel.org/r/20190922065801.GB18814@dhcp22.suse.cz
Fixes: 16db3d3f1170 ("kernel/sysctl.c: threads-max observe limits")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomemcg: only record foreign writebacks with dirty pages when memcg is not disabled
Baoquan He [Mon, 7 Oct 2019 00:58:15 +0000 (17:58 -0700)]
memcg: only record foreign writebacks with dirty pages when memcg is not disabled

In kdump kernel, memcg usually is disabled with 'cgroup_disable=memory'
for saving memory.  Now kdump kernel will always panic when dump vmcore
to local disk:

  BUG: kernel NULL pointer dereference, address: 0000000000000ab8
  Oops: 0000 [#1] SMP NOPTI
  CPU: 0 PID: 598 Comm: makedumpfile Not tainted 5.3.0+ #26
  Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 10/02/2018
  RIP: 0010:mem_cgroup_track_foreign_dirty_slowpath+0x38/0x140
  Call Trace:
   __set_page_dirty+0x52/0xc0
   iomap_set_page_dirty+0x50/0x90
   iomap_write_end+0x6e/0x270
   iomap_write_actor+0xce/0x170
   iomap_apply+0xba/0x11e
   iomap_file_buffered_write+0x62/0x90
   xfs_file_buffered_aio_write+0xca/0x320 [xfs]
   new_sync_write+0x12d/0x1d0
   vfs_write+0xa5/0x1a0
   ksys_write+0x59/0xd0
   do_syscall_64+0x59/0x1e0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

And this will corrupt the 1st kernel too with 'cgroup_disable=memory'.

Via the trace and with debugging, it is pointing to commit 97b27821b485
("writeback, memcg: Implement foreign dirty flushing") which introduced
this regression.  Disabling memcg causes the null pointer dereference at
uninitialized data in function mem_cgroup_track_foreign_dirty_slowpath().

Fix it by returning directly if memcg is disabled, but not trying to
record the foreign writebacks with dirty pages.

Link: http://lkml.kernel.org/r/20190924141928.GD31919@MiWiFi-R3L-srv
Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing")
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: fix -Wmissing-prototypes warnings
Yi Wang [Mon, 7 Oct 2019 00:58:12 +0000 (17:58 -0700)]
mm: fix -Wmissing-prototypes warnings

We get two warnings when build kernel W=1:

  mm/shuffle.c:36:12: warning: no previous prototype for `shuffle_show' [-Wmissing-prototypes]
  mm/sparse.c:220:6: warning: no previous prototype for `subsection_mask_set' [-Wmissing-prototypes]

Make the functions static to fix this.

Link: http://lkml.kernel.org/r/1566978161-7293-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agowriteback: fix use-after-free in finish_writeback_work()
Tejun Heo [Mon, 7 Oct 2019 00:58:09 +0000 (17:58 -0700)]
writeback: fix use-after-free in finish_writeback_work()

finish_writeback_work() reads @done->waitq after decrementing
@done->cnt.  However, once @done->cnt reaches zero, @done may be freed
(from stack) at any moment and @done->waitq can contain something
unrelated by the time finish_writeback_work() tries to read it.  This
led to the following crash.

  "BUG: kernel NULL pointer dereference, address: 0000000000000002"
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  PGD 0 P4D 0
  Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
  CPU: 40 PID: 555153 Comm: kworker/u98:50 Kdump: loaded Not tainted
  ...
  Workqueue: writeback wb_workfn (flush-btrfs-1)
  RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30
  Code: 48 89 d8 5b c3 e8 50 db 6b ff eb f4 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 9c 5b fa 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 05 48 89 d8 5b c3 89 c6 e8 fe ca 6b ff eb f2 66 90
  RSP: 0018:ffffc90049b27d98 EFLAGS: 00010046
  RAX: 0000000000000000 RBX: 0000000000000246 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: 0000000000000003 RDI: 0000000000000002
  RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
  R10: ffff889fff407600 R11: ffff88ba9395d740 R12: 000000000000e300
  R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
  FS:  0000000000000000(0000) GS:ffff88bfdfa00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000002 CR3: 0000000002409005 CR4: 00000000001606e0
  Call Trace:
   __wake_up_common_lock+0x63/0xc0
   wb_workfn+0xd2/0x3e0
   process_one_work+0x1f5/0x3f0
   worker_thread+0x2d/0x3d0
   kthread+0x111/0x130
   ret_from_fork+0x1f/0x30

Fix it by reading and caching @done->waitq before decrementing
@done->cnt.

Link: http://lkml.kernel.org/r/20190924010631.GH2233839@devbig004.ftw2.facebook.com
Fixes: 5b9cce4c7eb069 ("writeback: Generalize and expose wb_completion")
Signed-off-by: Tejun Heo <tj@kernel.org>
Debugged-by: Chris Mason <clm@fb.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org> [5.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/memremap: drop unused SECTION_SIZE and SECTION_MASK
Anshuman Khandual [Mon, 7 Oct 2019 00:58:05 +0000 (17:58 -0700)]
mm/memremap: drop unused SECTION_SIZE and SECTION_MASK

SECTION_SIZE and SECTION_MASK macros are not getting used anymore.  But
they do conflict with existing definitions on arm64 platform causing
following warning during build.  Lets drop these unused macros.

  mm/memremap.c:16: warning: "SECTION_MASK" redefined
   #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
  arch/arm64/include/asm/pgtable-hwdef.h:79: note: this is the location of the previous definition
   #define SECTION_MASK  (~(SECTION_SIZE-1))

  mm/memremap.c:17: warning: "SECTION_SIZE" redefined
   #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
  arch/arm64/include/asm/pgtable-hwdef.h:78: note: this is the location of the previous definition
   #define SECTION_SIZE  (_AC(1, UL) << SECTION_SHIFT)

Link: http://lkml.kernel.org/r/1569312010-31313-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agopanic: ensure preemption is disabled during panic()
Will Deacon [Mon, 7 Oct 2019 00:58:00 +0000 (17:58 -0700)]
panic: ensure preemption is disabled during panic()

Calling 'panic()' on a kernel with CONFIG_PREEMPT=y can leave the
calling CPU in an infinite loop, but with interrupts and preemption
enabled.  From this state, userspace can continue to be scheduled,
despite the system being "dead" as far as the kernel is concerned.

This is easily reproducible on arm64 when booting with "nosmp" on the
command line; a couple of shell scripts print out a periodic "Ping"
message whilst another triggers a crash by writing to
/proc/sysrq-trigger:

  | sysrq: Trigger a crash
  | Kernel panic - not syncing: sysrq triggered crash
  | CPU: 0 PID: 1 Comm: init Not tainted 5.2.15 #1
  | Hardware name: linux,dummy-virt (DT)
  | Call trace:
  |  dump_backtrace+0x0/0x148
  |  show_stack+0x14/0x20
  |  dump_stack+0xa0/0xc4
  |  panic+0x140/0x32c
  |  sysrq_handle_reboot+0x0/0x20
  |  __handle_sysrq+0x124/0x190
  |  write_sysrq_trigger+0x64/0x88
  |  proc_reg_write+0x60/0xa8
  |  __vfs_write+0x18/0x40
  |  vfs_write+0xa4/0x1b8
  |  ksys_write+0x64/0xf0
  |  __arm64_sys_write+0x14/0x20
  |  el0_svc_common.constprop.0+0xb0/0x168
  |  el0_svc_handler+0x28/0x78
  |  el0_svc+0x8/0xc
  | Kernel Offset: disabled
  | CPU features: 0x0002,24002004
  | Memory Limit: none
  | ---[ end Kernel panic - not syncing: sysrq triggered crash ]---
  |  Ping 2!
  |  Ping 1!
  |  Ping 1!
  |  Ping 2!

The issue can also be triggered on x86 kernels if CONFIG_SMP=n,
otherwise local interrupts are disabled in 'smp_send_stop()'.

Disable preemption in 'panic()' before re-enabling interrupts.

Link: http://lkml.kernel.org/r/20191002123538.22609-1-will@kernel.org
Link: https://lore.kernel.org/r/BX1W47JXPMR8.58IYW53H6M5N@dragonstone
Signed-off-by: Will Deacon <will@kernel.org>
Reported-by: Xogium <contact@xogium.me>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agofs: ocfs2: fix a possible null-pointer dereference in ocfs2_info_scan_inode_alloc()
Jia-Ju Bai [Mon, 7 Oct 2019 00:57:57 +0000 (17:57 -0700)]
fs: ocfs2: fix a possible null-pointer dereference in ocfs2_info_scan_inode_alloc()

In ocfs2_info_scan_inode_alloc(), there is an if statement on line 283
to check whether inode_alloc is NULL:

    if (inode_alloc)

When inode_alloc is NULL, it is used on line 287:

    ocfs2_inode_lock(inode_alloc, &bh, 0);
        ocfs2_inode_lock_full_nested(inode, ...)
            struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);

Thus, a possible null-pointer dereference may occur.

To fix this bug, inode_alloc is checked on line 286.

This bug is found by a static analysis tool STCheck written by us.

Link: http://lkml.kernel.org/r/20190726033717.32359-1-baijiaju1990@gmail.com
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agofs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock()
Jia-Ju Bai [Mon, 7 Oct 2019 00:57:54 +0000 (17:57 -0700)]
fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock()

In ocfs2_write_end_nolock(), there are an if statement on lines 1976,
2047 and 2058, to check whether handle is NULL:

    if (handle)

When handle is NULL, it is used on line 2045:

ocfs2_update_inode_fsync_trans(handle, inode, 1);
        oi->i_sync_tid = handle->h_transaction->t_tid;

Thus, a possible null-pointer dereference may occur.

To fix this bug, handle is checked before calling
ocfs2_update_inode_fsync_trans().

This bug is found by a static analysis tool STCheck written by us.

Link: http://lkml.kernel.org/r/20190726033705.32307-1-baijiaju1990@gmail.com
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agofs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()
Jia-Ju Bai [Mon, 7 Oct 2019 00:57:50 +0000 (17:57 -0700)]
fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()

In ocfs2_xa_prepare_entry(), there is an if statement on line 2136 to
check whether loc->xl_entry is NULL:

    if (loc->xl_entry)

When loc->xl_entry is NULL, it is used on line 2158:

    ocfs2_xa_add_entry(loc, name_hash);
        loc->xl_entry->xe_name_hash = cpu_to_le32(name_hash);
        loc->xl_entry->xe_name_offset = cpu_to_le16(loc->xl_size);

and line 2164:

    ocfs2_xa_add_namevalue(loc, xi);
        loc->xl_entry->xe_value_size = cpu_to_le64(xi->xi_value_len);
        loc->xl_entry->xe_name_len = xi->xi_name_len;

Thus, possible null-pointer dereferences may occur.

To fix these bugs, if loc-xl_entry is NULL, ocfs2_xa_prepare_entry()
abnormally returns with -EINVAL.

These bugs are found by a static analysis tool STCheck written by us.

[akpm@linux-foundation.org: remove now-unused ocfs2_xa_add_entry()]
Link: http://lkml.kernel.org/r/20190726101447.9153-1-baijiaju1990@gmail.com
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoocfs2: clear zero in unaligned direct IO
Jia Guo [Mon, 7 Oct 2019 00:57:47 +0000 (17:57 -0700)]
ocfs2: clear zero in unaligned direct IO

Unused portion of a part-written fs-block-sized block is not set to zero
in unaligned append direct write.This can lead to serious data
inconsistencies.

Ocfs2 manage disk with cluster size(for example, 1M), part-written in
one cluster will change the cluster state from UN-WRITTEN to WRITTEN,
VFS(function dio_zero_block) doesn't do the cleaning because bh's state
is not set to NEW in function ocfs2_dio_wr_get_block when we write a
WRITTEN cluster.  For example, the cluster size is 1M, file size is 8k
and we direct write from 14k to 15k, then 12k~14k and 15k~16k will
contain dirty data.

We have to deal with two cases:
 1.The starting position of direct write is outside the file.
 2.The starting position of direct write is located in the file.

We need set bh's state to NEW in the first case.  In the second case, we
need mapped twice because bh's state of area out file should be set to
NEW while area in file not.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/5292e287-8f1a-fd4a-1a14-661e555e0bed@huawei.com
Signed-off-by: Jia Guo <guojia12@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agouaccess: implement a proper unsafe_copy_to_user() and switch filldir over to it
Linus Torvalds [Mon, 7 Oct 2019 19:56:48 +0000 (12:56 -0700)]
uaccess: implement a proper unsafe_copy_to_user() and switch filldir over to it

In commit 9f79b78ef744 ("Convert filldir[64]() from __put_user() to
unsafe_put_user()") I made filldir() use unsafe_put_user(), which
improves code generation on x86 enormously.

But because we didn't have a "unsafe_copy_to_user()", the dirent name
copy was also done by hand with unsafe_put_user() in a loop, and it
turns out that a lot of other architectures didn't like that, because
unlike x86, they have various alignment issues.

Most non-x86 architectures trap and fix it up, and some (like xtensa)
will just fail unaligned put_user() accesses unconditionally.  Which
makes that "copy using put_user() in a loop" not work for them at all.

I could make that code do explicit alignment etc, but the architectures
that don't like unaligned accesses also don't really use the fancy
"user_access_begin/end()" model, so they might just use the regular old
__copy_to_user() interface.

So this commit takes that looping implementation, turns it into the x86
version of "unsafe_copy_to_user()", and makes other architectures
implement the unsafe copy version as __copy_to_user() (the same way they
do for the other unsafe_xyz() accessor functions).

Note that it only does this for the copying _to_ user space, and we
still don't have a unsafe version of copy_from_user().

That's partly because we have no current users of it, but also partly
because the copy_from_user() case is slightly different and cannot
efficiently be implemented in terms of a unsafe_get_user() loop (because
gcc can't do asm goto with outputs).

It would be trivial to do this using "rep movsb", which would work
really nicely on newer x86 cores, but really badly on some older ones.

Al Viro is looking at cleaning up all our user copy routines to make
this all a non-issue, but for now we have this simple-but-stupid version
for x86 that works fine for the dirent name copy case because those
names are short strings and we simply don't need anything fancier.

Fixes: 9f79b78ef744 ("Convert filldir[64]() from __put_user() to unsafe_put_user()")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Reported-and-tested-by: Tony Luck <tony.luck@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agodrm/i915: Mark contents as dirty on a write fault
Chris Wilson [Fri, 20 Sep 2019 12:18:21 +0000 (13:18 +0100)]
drm/i915: Mark contents as dirty on a write fault

Since dropping the set-to-gtt-domain in commit a679f58d0510 ("drm/i915:
Flush pages on acquisition"), we no longer mark the contents as dirty on
a write fault. This has the issue of us then not marking the pages as
dirty on releasing the buffer, which means the contents are not written
out to the swap device (should we ever pick that buffer as a victim).
Notably, this is visible in the dumb buffer interface used for cursors.
Having updated the cursor contents via mmap, and swapped away, if the
shrinker should evict the old cursor, upon next reuse, the cursor would
be invisible.

E.g. echo 80 > /proc/sys/kernel/sysrq ; echo f > /proc/sysrq-trigger

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111541
Fixes: a679f58d0510 ("drm/i915: Flush pages on acquisition")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.william.auld@gmail.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: <stable@vger.kernel.org> # v5.2+
Reviewed-by: Matthew Auld <matthew.william.auld@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190920121821.7223-1-chris@chris-wilson.co.uk
(cherry picked from commit 5028851cdfdf78dc22eacbc44a0ab0b3f599ee4a)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
5 years agodrm/i915: Prevent bonded requests from overtaking each other on preemption
Chris Wilson [Mon, 23 Sep 2019 15:28:44 +0000 (16:28 +0100)]
drm/i915: Prevent bonded requests from overtaking each other on preemption

Force bonded requests to run on distinct engines so that they cannot be
shuffled onto the same engine where timeslicing will reverse the order.
A bonded request will often wait on a semaphore signaled by its master,
creating an implicit dependency -- if we ignore that implicit dependency
and allow the bonded request to run on the same engine and before its
master, we will cause a GPU hang. [Whether it will hang the GPU is
debatable, we should keep on timeslicing and each timeslice should be
"accidentally" counted as forward progress, in which case it should run
but at one-half to one-third speed.]

We can prevent this inversion by restricting which engines we allow
ourselves to jump to upon preemption, i.e. baking in the arrangement
established at first execution. (We should also consider capturing the
implicit dependency using i915_sched_add_dependency(), but first we need
to think about the constraints that requires on the execution/retirement
ordering.)

Fixes: 8ee36e048c98 ("drm/i915/execlists: Minimalistic timeslicing")
References: ee1136908e9b ("drm/i915/execlists: Virtual engine bonding")
Testcase: igt/gem_exec_balancer/bonded-slice
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190923152844.8914-3-chris@chris-wilson.co.uk
(cherry picked from commit e2144503bf3b22275dd33cef2880e1cb5fb200c5)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
5 years agodrm/i915: Bump skl+ max plane width to 5k for linear/x-tiled
Ville Syrjälä [Thu, 5 Sep 2019 13:50:43 +0000 (16:50 +0300)]
drm/i915: Bump skl+ max plane width to 5k for linear/x-tiled

The officially validated plane width limit is 4k on skl+, however
we already had people using 5k displays before we started to enforce
the limit. Also it seems Windows allows 5k resolutions as well
(though not sure if they do it with one plane or two).

According to hw folks 5k should work with the possible
exception of the following features:
- Ytile (already limited to 4k)
- FP16 (already limited to 4k)
- render compression (already limited to 4k)
- KVMR sprite and cursor (don't care)
- horizontal panning (need to verify this)
- pipe and plane scaling (need to verify this)

So apart from last two items on that list we are already
fine. We should really verify what happens with those last
two items but I don't have a 5k display on hand atm so it'll
have to wait.

In the meantime let's just bump the limit back up to 5k since
several users have already been using it without apparent issues.
At least we'll be no worse off than we were prior to lowering
the limits.

Cc: stable@vger.kernel.org
Cc: Sean Paul <sean@poorly.run>
Cc: José Roberto de Souza <jose.souza@intel.com>
Tested-by: Leho Kraav <leho@kraav.com>
Fixes: 372b9ffb5799 ("drm/i915: Fix skl+ max plane width")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111501
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190905135044.2001-1-ville.syrjala@linux.intel.com
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: Sean Paul <sean@poorly.run>
(cherry picked from commit bed34ef544f9ab37ab349c04cf4142282c4dcf5d)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
5 years agodrm/i915: Verify the engine after acquiring the active.lock
Chris Wilson [Wed, 18 Sep 2019 14:54:50 +0000 (15:54 +0100)]
drm/i915: Verify the engine after acquiring the active.lock

When using virtual engines, the rq->engine is not stable until we hold
the engine->active.lock (as the virtual engine may be exchanged with the
sibling). Since commit 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy")
we may retire a request concurrently with resubmitting it to HW, we need
to be extra careful to verify we are holding the correct lock for the
request's active list. This is similar to the issue we saw with
rescheduling the virtual requests, see sched_lock_engine().

Or else:

<4> [876.736126] list_add corruption. prev->next should be next (ffff8883f931a1f8), but was dead000000000100. (prev=ffff888361ffa610).
<4> [876.736136] WARNING: CPU: 2 PID: 21 at lib/list_debug.c:28 __list_add_valid+0x4d/0x70
<4> [876.736137] Modules linked in: i915(+) amdgpu gpu_sched ttm vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic mei_hdcp x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul snd_intel_nhlt snd_hda_codec snd_hwdep snd_hda_core ghash_clmulni_intel e1000e cdc_ether usbnet mii snd_pcm ptp pps_core mei_me mei prime_numbers btusb btrtl btbcm btintel bluetooth ecdh_generic ecc [last unloaded: i915]
<4> [876.736154] CPU: 2 PID: 21 Comm: ksoftirqd/2 Tainted: G     U            5.3.0-CI-CI_DRM_6898+ #1
<4> [876.736156] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3183.A00.1905020411 05/02/2019
<4> [876.736157] RIP: 0010:__list_add_valid+0x4d/0x70
<4> [876.736159] Code: c3 48 89 d1 48 c7 c7 20 33 0e 82 48 89 c2 e8 4a 4a bc ff 0f 0b 31 c0 c3 48 89 c1 4c 89 c6 48 c7 c7 70 33 0e 82 e8 33 4a bc ff <0f> 0b 31 c0 c3 48 89 f2 4c 89 c1 48 89 fe 48 c7 c7 c0 33 0e 82 e8
<4> [876.736160] RSP: 0018:ffffc9000018bd30 EFLAGS: 00010082
<4> [876.736162] RAX: 0000000000000000 RBX: ffff888361ffc840 RCX: 0000000000000104
<4> [876.736163] RDX: 0000000080000104 RSI: 0000000000000000 RDI: 00000000ffffffff
<4> [876.736164] RBP: ffffc9000018bd68 R08: 0000000000000000 R09: 0000000000000001
<4> [876.736165] R10: 00000000aed95de3 R11: 000000007fe927eb R12: ffff888361ffca10
<4> [876.736166] R13: ffff888361ffa610 R14: ffff888361ffc880 R15: ffff8883f931a1f8
<4> [876.736168] FS:  0000000000000000(0000) GS:ffff88849fd00000(0000) knlGS:0000000000000000
<4> [876.736169] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [876.736170] CR2: 00007f093a9173c0 CR3: 00000003bba08005 CR4: 0000000000760ee0
<4> [876.736171] PKRU: 55555554
<4> [876.736172] Call Trace:
<4> [876.736226]  __i915_request_submit+0x152/0x370 [i915]
<4> [876.736263]  __execlists_submission_tasklet+0x6da/0x1f50 [i915]
<4> [876.736293]  ? execlists_submission_tasklet+0x29/0x50 [i915]
<4> [876.736321]  execlists_submission_tasklet+0x34/0x50 [i915]
<4> [876.736325]  tasklet_action_common.isra.5+0x47/0xb0
<4> [876.736328]  __do_softirq+0xd8/0x4ae
<4> [876.736332]  ? smpboot_thread_fn+0x23/0x280
<4> [876.736334]  ? smpboot_thread_fn+0x6b/0x280
<4> [876.736336]  run_ksoftirqd+0x2b/0x50
<4> [876.736338]  smpboot_thread_fn+0x1d3/0x280
<4> [876.736341]  ? sort_range+0x20/0x20
<4> [876.736343]  kthread+0x119/0x130
<4> [876.736345]  ? kthread_park+0xa0/0xa0
<4> [876.736347]  ret_from_fork+0x24/0x50
<4> [876.736353] irq event stamp: 2290145
<4> [876.736356] hardirqs last  enabled at (2290144): [<ffffffff8123cde8>] __slab_free+0x3e8/0x500
<4> [876.736358] hardirqs last disabled at (2290145): [<ffffffff819cfb4d>] _raw_spin_lock_irqsave+0xd/0x50
<4> [876.736360] softirqs last  enabled at (2290114): [<ffffffff81c0033e>] __do_softirq+0x33e/0x4ae
<4> [876.736361] softirqs last disabled at (2290119): [<ffffffff810b815b>] run_ksoftirqd+0x2b/0x50
<4> [876.736363] WARNING: CPU: 2 PID: 21 at lib/list_debug.c:28 __list_add_valid+0x4d/0x70
<4> [876.736364] ---[ end trace 3e58d6c7356c65bf ]---
<4> [876.736406] ------------[ cut here ]------------
<4> [876.736415] list_del corruption. prev->next should be ffff888361ffca10, but was ffff88840ac2c730
<4> [876.736421] WARNING: CPU: 2 PID: 5490 at lib/list_debug.c:53 __list_del_entry_valid+0x79/0x90
<4> [876.736422] Modules linked in: i915(+) amdgpu gpu_sched ttm vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic mei_hdcp x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul snd_intel_nhlt snd_hda_codec snd_hwdep snd_hda_core ghash_clmulni_intel e1000e cdc_ether usbnet mii snd_pcm ptp pps_core mei_me mei prime_numbers btusb btrtl btbcm btintel bluetooth ecdh_generic ecc [last unloaded: i915]
<4> [876.736433] CPU: 2 PID: 5490 Comm: i915_selftest Tainted: G     U  W         5.3.0-CI-CI_DRM_6898+ #1
<4> [876.736435] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3183.A00.1905020411 05/02/2019
<4> [876.736436] RIP: 0010:__list_del_entry_valid+0x79/0x90
<4> [876.736438] Code: 0b 31 c0 c3 48 89 fe 48 c7 c7 30 34 0e 82 e8 ae 49 bc ff 0f 0b 31 c0 c3 48 89 f2 48 89 fe 48 c7 c7 68 34 0e 82 e8 97 49 bc ff <0f> 0b 31 c0 c3 48 c7 c7 a8 34 0e 82 e8 86 49 bc ff 0f 0b 31 c0 c3
<4> [876.736439] RSP: 0018:ffffc900003ef758 EFLAGS: 00010086
<4> [876.736440] RAX: 0000000000000000 RBX: ffff888361ffc840 RCX: 0000000000000002
<4> [876.736442] RDX: 0000000080000002 RSI: 0000000000000000 RDI: 00000000ffffffff
<4> [876.736443] RBP: ffffc900003ef780 R08: 0000000000000000 R09: 0000000000000001
<4> [876.736444] R10: 000000001418e4b7 R11: 000000007f0ea93b R12: ffff888361ffcab8
<4> [876.736445] R13: ffff88843b6d0000 R14: 000000000000217c R15: 0000000000000001
<4> [876.736447] FS:  00007f4e6f255240(0000) GS:ffff88849fd00000(0000) knlGS:0000000000000000
<4> [876.736448] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [876.736449] CR2: 00007f093a9173c0 CR3: 00000003bba08005 CR4: 0000000000760ee0
<4> [876.736450] PKRU: 55555554
<4> [876.736451] Call Trace:
<4> [876.736488]  i915_request_retire+0x224/0x8e0 [i915]
<4> [876.736521]  i915_request_create+0x4b/0x1b0 [i915]
<4> [876.736550]  nop_virtual_engine+0x230/0x4d0 [i915]

Fixes: 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111695
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190918145453.8800-1-chris@chris-wilson.co.uk
(cherry picked from commit 37fa0de3c137d5f54f7e64f53495c9d501d42a4d)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
5 years agodrm/i915: Extend Haswell GT1 PSMI workaround to all
Chris Wilson [Tue, 17 Sep 2019 19:47:46 +0000 (20:47 +0100)]
drm/i915: Extend Haswell GT1 PSMI workaround to all

A few times in CI, we have detected a GPU hang on our Haswell GT2
systems with the characteristic IPEHR of 0x780c0000. When the PSMI w/a
was first introducted, it was applied to all Haswell, but later on we
found an erratum that supposedly restricted the issue to GT1 and so
constrained it only be applied on GT1. That may have been a mistake...

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111692
Fixes: 167bc759e823 ("drm/i915: Restrict PSMI context load w/a to Haswell GT1")
References: 2c550183476d ("drm/i915: Disable PSMI sleep messages on all rings around context switches")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190917194746.26710-1-chris@chris-wilson.co.uk
(cherry picked from commit 56c05de6bd773b96deca379370965c49042b5fbf)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
5 years agodrm/i915: Don't mix srcu tag and negative error codes
Chris Wilson [Thu, 12 Sep 2019 16:08:34 +0000 (17:08 +0100)]
drm/i915: Don't mix srcu tag and negative error codes

While srcu may use an integer tag, it does not exclude potential error
codes and so may overlap with our own use of -EINTR. Use a separate
outparam to store the tag, and report the error code separately.

Fixes: 2caffbf11762 ("drm/i915: Revoke mmaps and prevent access to fence registers across reset")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190912160834.30601-1-chris@chris-wilson.co.uk
(cherry picked from commit eebab60f224fcfd560957715d08c31564d8672ed)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
5 years agodrm/i915: Whitelist COMMON_SLICE_CHICKEN2
Kenneth Graunke [Wed, 11 Sep 2019 01:48:01 +0000 (18:48 -0700)]
drm/i915: Whitelist COMMON_SLICE_CHICKEN2

This allows userspace to use "legacy" mode for push constants, where
they are committed at 3DPRIMITIVE or flush time, rather than being
committed at 3DSTATE_BINDING_TABLE_POINTERS_XS time.  Gen6-8 and Gen11
both use the "legacy" behavior - only Gen9 works in the "new" way.

Conflating push constants with binding tables is painful for userspace,
we would like to be able to avoid doing so.

Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Cc: stable@vger.kernel.org
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20190911014801.26821-1-kenneth@whitecape.org
(cherry picked from commit 0606259e3b3a1220a0f04a92a1654a3f674f47ee)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
5 years agodrm/i915: Perform GGTT restore much earlier during resume
Chris Wilson [Mon, 9 Sep 2019 11:00:08 +0000 (12:00 +0100)]
drm/i915: Perform GGTT restore much earlier during resume

As soon as we re-enable the various functions within the HW, they may go
off and read data via a GGTT offset. Hence, if we have not yet restored
the GGTT PTE before then, they may read and even *write* random locations
in memory.

Detected by DMAR faults during resume.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Martin Peres <martin.peres@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: stable@vger.kernel.org
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190909110011.8958-4-chris@chris-wilson.co.uk
(cherry picked from commit cec5ca08e36fd18d2939b98055346b3b06f56c6c)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
5 years agodrm/i915/execlists: Remove incorrect BUG_ON for schedule-out
Chris Wilson [Sat, 7 Sep 2019 10:50:46 +0000 (11:50 +0100)]
drm/i915/execlists: Remove incorrect BUG_ON for schedule-out

As we may unwind incomplete requests (for preemption) prior to
processing the CSB and the schedule-out events, we may update rq->engine
(resetting it to point back to the parent virtual engine) prior to
calling execlists_schedule_out(), invalidating the assertion that the
request still points to the inflight engine. (The likelihood of this is
increased if the CSB interrupt processing is pushed to the ksoftirqd for
being too slow and direct submission overtakes it.)

Tvrtko summarised it as:
"So unwind from direct submission resets rq->engine and races with
process_csb from the tasklet which notices request has actually
completed."

Reported-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Fixes: df403069029d ("drm/i915/execlists: Lift process_csb() out of the irq-off spinlock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190907105046.19934-1-chris@chris-wilson.co.uk
(cherry picked from commit d810583fc2fcf139cc766eb2303500b2d9cf064d)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
5 years agoarm64: Kconfig: Make CONFIG_COMPAT_VDSO a proper Kconfig option
Will Deacon [Mon, 7 Oct 2019 12:03:12 +0000 (13:03 +0100)]
arm64: Kconfig: Make CONFIG_COMPAT_VDSO a proper Kconfig option

CONFIG_COMPAT_VDSO is defined by passing '-DCONFIG_COMPAT_VDSO' to the
compiler when the generic compat vDSO code is in use. It's much cleaner
and simpler to expose this as a proper Kconfig option (like x86 does),
so do that and remove the bodge.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
5 years agoarm64: vdso32: Rename COMPATCC to CC_COMPAT
Will Deacon [Fri, 4 Oct 2019 13:20:06 +0000 (14:20 +0100)]
arm64: vdso32: Rename COMPATCC to CC_COMPAT

For consistency with CROSS_COMPILE_COMPAT, mechanically rename COMPATCC
to CC_COMPAT so that specifying aspects of the compat vDSO toolchain in
the environment isn't needlessly confusing.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
5 years agoarm64: vdso32: Pass '--target' option to clang via VDSO_CAFLAGS
Will Deacon [Mon, 7 Oct 2019 11:27:59 +0000 (12:27 +0100)]
arm64: vdso32: Pass '--target' option to clang via VDSO_CAFLAGS

Directly passing the '--target' option to clang by appending to
COMPATCC does not work if COMPATCC has been specified explicitly as
an argument to Make unless the 'override' directive is used, which is
ugly and different to what is done in the top-level Makefile.

Move the '--target' option for clang out of COMPATCC and into
VDSO_CAFLAGS, where it will be picked up when compiling and assembling
the 32-bit vDSO under clang.

Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
5 years agoarm64: vdso32: Don't use KBUILD_CPPFLAGS unconditionally
Will Deacon [Fri, 4 Oct 2019 14:44:45 +0000 (15:44 +0100)]
arm64: vdso32: Don't use KBUILD_CPPFLAGS unconditionally

KBUILD_CPPFLAGS is defined differently depending on whether the main
compiler is clang or not. This means that it is not possible to build
the compat vDSO with GCC if the rest of the kernel is built with clang.

Define VDSO_CPPFLAGS directly to break this dependency and allow a clang
kernel to build a compat vDSO with GCC:

  $ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- \
    CROSS_COMPILE_COMPAT=arm-linux-gnueabihf- CC=clang \
    COMPATCC=arm-linux-gnueabihf-gcc

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
5 years agoarm64: vdso32: Move definition of COMPATCC into vdso32/Makefile
Will Deacon [Fri, 4 Oct 2019 14:43:53 +0000 (15:43 +0100)]
arm64: vdso32: Move definition of COMPATCC into vdso32/Makefile

There's no need to export COMPATCC, so just define it locally in the
vdso32/Makefile, which is the only place where it is used.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
5 years agoarm64: Default to building compat vDSO with clang when CONFIG_CC_IS_CLANG
Will Deacon [Fri, 4 Oct 2019 13:08:13 +0000 (14:08 +0100)]
arm64: Default to building compat vDSO with clang when CONFIG_CC_IS_CLANG

Rather than force the use of GCC for the compat cross-compiler, instead
extract the target from CROSS_COMPILE_COMPAT and pass it to clang if the
main compiler is clang.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
5 years agolib: vdso: Remove CROSS_COMPILE_COMPAT_VDSO
Vincenzo Frascino [Thu, 3 Oct 2019 17:48:38 +0000 (18:48 +0100)]
lib: vdso: Remove CROSS_COMPILE_COMPAT_VDSO

arm64 was the last architecture using CROSS_COMPILE_COMPAT_VDSO config
option. With this patch series the dependency in the architecture has
been removed.

Remove CROSS_COMPILE_COMPAT_VDSO from the Unified vDSO library code.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
5 years agoarm64: vdso32: Remove jump label config option in Makefile
Vincenzo Frascino [Thu, 3 Oct 2019 17:48:36 +0000 (18:48 +0100)]
arm64: vdso32: Remove jump label config option in Makefile

The jump labels are not used in vdso32 since it is not possible to run
runtime patching on them.

Remove the configuration option from the Makefile.

Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
5 years agoarm64: vdso32: Detect binutils support for dmb ishld
Vincenzo Frascino [Thu, 3 Oct 2019 17:48:34 +0000 (18:48 +0100)]
arm64: vdso32: Detect binutils support for dmb ishld

Older versions of binutils (prior to 2.24) do not support the "ISHLD"
option for memory barrier instructions, which leads to a build failure
when assembling the vdso32 library.

Add a compilation time mechanism that detects if binutils supports those
instructions and configure the kernel accordingly.

Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Will Deacon <will@kernel.org>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
5 years agoarm64: vdso: Remove stale files from old assembly implementation
Vincenzo Frascino [Thu, 3 Oct 2019 17:48:35 +0000 (18:48 +0100)]
arm64: vdso: Remove stale files from old assembly implementation

Moving over to the generic C implementation of the vDSO inadvertently
left some stale files behind which are no longer used. Remove them.

Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
5 years agoarm64: vdso32: Fix broken compat vDSO build warnings
Vincenzo Frascino [Thu, 3 Oct 2019 17:48:33 +0000 (18:48 +0100)]
arm64: vdso32: Fix broken compat vDSO build warnings

The .config file and the generated include/config/auto.conf can
end up out of sync after a set of commands since
CONFIG_CROSS_COMPILE_COMPAT_VDSO is not updated correctly.

The sequence can be reproduced as follows:

$ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- defconfig
[...]
$ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- menuconfig
[set CONFIG_CROSS_COMPILE_COMPAT_VDSO="arm-linux-gnueabihf-"]
$ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-

Which results in:

arch/arm64/Makefile:62: CROSS_COMPILE_COMPAT not defined or empty,
the compat vDSO will not be built

even though the compat vDSO has been built:

$ file arch/arm64/kernel/vdso32/vdso.so
arch/arm64/kernel/vdso32/vdso.so: ELF 32-bit LSB pie executable, ARM,
EABI5 version 1 (SYSV), dynamically linked,
BuildID[sha1]=c67f6c786f2d2d6f86c71f708595594aa25247f6, stripped

A similar case that involves changing the configuration parameter
multiple times can be reconducted to the same family of problems.

Remove the use of CONFIG_CROSS_COMPILE_COMPAT_VDSO altogether and
instead rely on the cross-compiler prefix coming from the environment
via CROSS_COMPILE_COMPAT, much like we do for the rest of the kernel.

Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Will Deacon <will@kernel.org>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
5 years agoarm64: mm: fix spurious fault detection
Mark Rutland [Fri, 4 Oct 2019 13:58:47 +0000 (14:58 +0100)]
arm64: mm: fix spurious fault detection

When detecting a spurious EL1 translation fault, we attempt to compare
ESR_EL1.DFSC with PAR_EL1.FST. We erroneously use FIELD_PREP() to
extract PAR_EL1.FST, when we should be using FIELD_GET().

In the wise words of Robin Murphy:

| FIELD_GET() is a UBFX, FIELD_PREP() is a BFI

Using FIELD_PREP() means that that dfsc & ESR_ELx_FSC_TYPE is always
zero, and hence not equal to ESR_ELx_FSC_FAULT. Thus we detect any
unhandled translation fault as spurious.

... so let's use FIELD_GET() to ensure we don't decide all translation
faults are spurious. ESR_EL1.DFSC occupies bits [5:0], and requires no
shifting.

Fixes: 42f91093b043332a ("arm64: mm: Ignore spurious translation faults taken from the kernel")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Robin Murphy <robin.murphy@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will.deacon@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
5 years agoxfs: assure zeroed memory buffers for certain kmem allocations
Bill O'Donnell [Fri, 4 Oct 2019 23:38:44 +0000 (16:38 -0700)]
xfs: assure zeroed memory buffers for certain kmem allocations

Guarantee zeroed memory buffers for cases where potential memory
leak to disk can occur. In these cases, kmem_alloc is used and
doesn't zero the buffer, opening the possibility of information
leakage to disk.

Use existing infrastucture (xfs_buf_allocate_memory) to obtain
the already zeroed buffer from kernel memory.

This solution avoids the performance issue that would occur if a
wholesale change to replace kmem_alloc with kmem_zalloc was done.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
[darrick: fix bitwise complaint about kmflag_mask]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
5 years agoxfs: removed unused error variable from xchk_refcountbt_rec
Aliasgar Surti [Mon, 30 Sep 2019 18:31:48 +0000 (11:31 -0700)]
xfs: removed unused error variable from xchk_refcountbt_rec

Removed unused error variable. Instead of using error variable,
returned the value directly as it wasn't updated.

Signed-off-by: Aliasgar Surti <aliasgar.surti500@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
5 years agoxfs: remove unused flags arg from xfs_get_aghdr_buf()
Eric Sandeen [Mon, 30 Sep 2019 18:31:12 +0000 (11:31 -0700)]
xfs: remove unused flags arg from xfs_get_aghdr_buf()

The flags arg is always passed as zero, so remove it.

(xfs_buf_get_uncached takes flags to support XBF_NO_IOACCT for
the sb, but that should never be relevant for xfs_get_aghdr_buf)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
5 years agoxfs: Fix tail rounding in xfs_alloc_file_space()
Max Reitz [Mon, 30 Sep 2019 18:29:44 +0000 (11:29 -0700)]
xfs: Fix tail rounding in xfs_alloc_file_space()

To ensure that all blocks touched by the range [offset, offset + count)
are allocated, we need to calculate the block count from the difference
of the range end (rounded up) and the range start (rounded down).

Before this patch, we just round up the byte count, which may lead to
unaligned ranges not being fully allocated:

$ touch test_file
$ block_size=$(stat -fc '%S' test_file)
$ fallocate -o $((block_size / 2)) -l $block_size test_file
$ xfs_bmap test_file
test_file:
        0: [0..7]: 1396264..1396271
        1: [8..15]: hole

There should not be a hole there.  Instead, the first two blocks should
be fully allocated.

With this patch applied, the result is something like this:

$ touch test_file
$ block_size=$(stat -fc '%S' test_file)
$ fallocate -o $((block_size / 2)) -l $block_size test_file
$ xfs_bmap test_file
test_file:
        0: [0..15]: 11024..11039

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
5 years agoLinux 5.4-rc2 v5.4-rc2
Linus Torvalds [Sun, 6 Oct 2019 21:27:30 +0000 (14:27 -0700)]
Linux 5.4-rc2

5 years agoelf: don't use MAP_FIXED_NOREPLACE for elf executable mappings
Linus Torvalds [Sun, 6 Oct 2019 20:53:27 +0000 (13:53 -0700)]
elf: don't use MAP_FIXED_NOREPLACE for elf executable mappings

In commit 4ed28639519c ("fs, elf: drop MAP_FIXED usage from elf_map") we
changed elf to use MAP_FIXED_NOREPLACE instead of MAP_FIXED for the
executable mappings.

Then, people reported that it broke some binaries that had overlapping
segments from the same file, and commit ad55eac74f20 ("elf: enforce
MAP_FIXED on overlaying elf segments") re-instated MAP_FIXED for some
overlaying elf segment cases.  But only some - despite the summary line
of that commit, it only did it when it also does a temporary brk vma for
one obvious overlapping case.

Now Russell King reports another overlapping case with old 32-bit x86
binaries, which doesn't trigger that limited case.  End result: we had
better just drop MAP_FIXED_NOREPLACE entirely, and go back to MAP_FIXED.

Yes, it's a sign of old binaries generated with old tool-chains, but we
do pride ourselves on not breaking existing setups.

This still leaves MAP_FIXED_NOREPLACE in place for the load_elf_interp()
and the old load_elf_library() use-cases, because nobody has reported
breakage for those. Yet.

Note that in all the cases seen so far, the overlapping elf sections
seem to be just re-mapping of the same executable with different section
attributes.  We could possibly introduce a new MAP_FIXED_NOFILECHANGE
flag or similar, which acts like NOREPLACE, but allows just remapping
the same executable file using different protection flags.

It's not clear that would make a huge difference to anything, but if
people really hate that "elf remaps over previous maps" behavior, maybe
at least a more limited form of remapping would alleviate some concerns.

Alternatively, we should take a look at our elf_map() logic to see if we
end up not mapping things properly the first time.

In the meantime, this is the minimal "don't do that then" patch while
people hopefully think about it more.

Reported-by: Russell King <linux@armlinux.org.uk>
Fixes: 4ed28639519c ("fs, elf: drop MAP_FIXED usage from elf_map")
Fixes: ad55eac74f20 ("elf: enforce  MAP_FIXED on overlaying elf segments")
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoMerge tag 'dma-mapping-5.4-1' of git://git.infradead.org/users/hch/dma-mapping
Linus Torvalds [Sun, 6 Oct 2019 18:10:15 +0000 (11:10 -0700)]
Merge tag 'dma-mapping-5.4-1' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping regression fix from Christoph Hellwig:
 "Revert an incorret hunk from a patch that caused problems on various
  arm boards (Andrey Smirnov)"

* tag 'dma-mapping-5.4-1' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: fix false positive warnings in dma_common_free_remap()

5 years agoblk-wbt: fix performance regression in wbt scale_up/scale_down
Harshad Shirwadkar [Sat, 5 Oct 2019 18:59:27 +0000 (11:59 -0700)]
blk-wbt: fix performance regression in wbt scale_up/scale_down

scale_up wakes up waiters after scaling up. But after scaling max, it
should not wake up more waiters as waiters will not have anything to
do. This patch fixes this by making scale_up (and also scale_down)
return when threshold is reached.

This bug causes increased fdatasync latency when fdatasync and dd
conv=sync are performed in parallel on 4.19 compared to 4.14. This
bug was introduced during refactoring of blk-wbt code.

Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
Cc: stable@vger.kernel.org
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoRevert "libata, freezer: avoid block device removal while system is frozen"
Mika Westerberg [Fri, 4 Oct 2019 10:00:25 +0000 (13:00 +0300)]
Revert "libata, freezer: avoid block device removal while system is frozen"

This reverts commit 85fbd722ad0f5d64d1ad15888cd1eb2188bfb557.

The commit was added as a quick band-aid for a hang that happened when a
block device was removed during system suspend. Now that bdi_wq is not
freezable anymore the hang should not be possible and we can get rid of
this hack by reverting it.

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agobdi: Do not use freezable workqueue
Mika Westerberg [Fri, 4 Oct 2019 10:00:24 +0000 (13:00 +0300)]
bdi: Do not use freezable workqueue

A removable block device, such as NVMe or SSD connected over Thunderbolt
can be hot-removed any time including when the system is suspended. When
device is hot-removed during suspend and the system gets resumed, kernel
first resumes devices and then thaws the userspace including freezable
workqueues. What happens in that case is that the NVMe driver notices
that the device is unplugged and removes it from the system. This ends
up calling bdi_unregister() for the gendisk which then schedules
wb_workfn() to be run one more time.

However, since the bdi_wq is still frozen flush_delayed_work() call in
wb_shutdown() blocks forever halting system resume process. User sees
this as hang as nothing is happening anymore.

Triggering sysrq-w reveals this:

  Workqueue: nvme-wq nvme_remove_dead_ctrl_work [nvme]
  Call Trace:
   ? __schedule+0x2c5/0x630
   ? wait_for_completion+0xa4/0x120
   schedule+0x3e/0xc0
   schedule_timeout+0x1c9/0x320
   ? resched_curr+0x1f/0xd0
   ? wait_for_completion+0xa4/0x120
   wait_for_completion+0xc3/0x120
   ? wake_up_q+0x60/0x60
   __flush_work+0x131/0x1e0
   ? flush_workqueue_prep_pwqs+0x130/0x130
   bdi_unregister+0xb9/0x130
   del_gendisk+0x2d2/0x2e0
   nvme_ns_remove+0xed/0x110 [nvme_core]
   nvme_remove_namespaces+0x96/0xd0 [nvme_core]
   nvme_remove+0x5b/0x160 [nvme]
   pci_device_remove+0x36/0x90
   device_release_driver_internal+0xdf/0x1c0
   nvme_remove_dead_ctrl_work+0x14/0x30 [nvme]
   process_one_work+0x1c2/0x3f0
   worker_thread+0x48/0x3e0
   kthread+0x100/0x140
   ? current_work+0x30/0x30
   ? kthread_park+0x80/0x80
   ret_from_fork+0x35/0x40

This is not limited to NVMes so exactly same issue can be reproduced by
hot-removing SSD (over Thunderbolt) while the system is suspended.

Prevent this from happening by removing WQ_FREEZABLE from bdi_wq.

Reported-by: AceLan Kao <acelan.kao@canonical.com>
Link: https://marc.info/?l=linux-kernel&m=138695698516487
Link: https://bugzilla.kernel.org/show_bug.cgi?id=204385
Link: https://lore.kernel.org/lkml/20191002122136.GD2819@lahna.fi.intel.com/#t
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoMerge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Linus Torvalds [Sun, 6 Oct 2019 00:18:43 +0000 (17:18 -0700)]
Merge tag 'armsoc-fixes' of git://git./linux/kernel/git/soc/soc

Pull ARM SoC fixes from Olof Johansson:
 "A few fixes this time around:

   - Fixup of some clock specifications for DRA7 (device-tree fix)

   - Removal of some dead/legacy CPU OPP/PM code for OMAP that throws
     warnings at boot

   - A few more minor fixups for OMAPs, most around display

   - Enable STM32 QSPI as =y since their rootfs sometimes comes from
     there

   - Switch CONFIG_REMOTEPROC to =y since it went from tristate to bool

   - Fix of thermal zone definition for ux500 (5.4 regression)"

* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
  ARM: multi_v7_defconfig: Fix SPI_STM32_QSPI support
  ARM: dts: ux500: Fix up the CPU thermal zone
  arm64/ARM: configs: Change CONFIG_REMOTEPROC from m to y
  ARM: dts: am4372: Set memory bandwidth limit for DISPC
  ARM: OMAP2+: Fix warnings with broken omap2_set_init_voltage()
  ARM: OMAP2+: Add missing LCDC midlemode for am335x
  ARM: OMAP2+: Fix missing reset done flag for am3 and am43
  ARM: dts: Fix gpio0 flags for am335x-icev2
  ARM: omap2plus_defconfig: Enable more droid4 devices as loadable modules
  ARM: omap2plus_defconfig: Enable DRM_TI_TFP410
  DTS: ARM: gta04: introduce legacy spi-cs-high to make display work again
  ARM: dts: Fix wrong clocks for dra7 mcasp
  clk: ti: dra7: Fix mcasp8 clock bits

5 years agoMerge tag 'kbuild-fixes-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahi...
Linus Torvalds [Sat, 5 Oct 2019 19:56:59 +0000 (12:56 -0700)]
Merge tag 'kbuild-fixes-v5.4' of git://git./linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - remove unneeded ar-option and KBUILD_ARFLAGS

 - remove long-deprecated SUBDIRS

 - fix modpost to suppress false-positive warnings for UML builds

 - fix namespace.pl to handle relative paths to ${objtree}, ${srctree}

 - make setlocalversion work for /bin/sh

 - make header archive reproducible

 - fix some Makefiles and documents

* tag 'kbuild-fixes-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kheaders: make headers archive reproducible
  kbuild: update compile-test header list for v5.4-rc2
  kbuild: two minor updates for Documentation/kbuild/modules.rst
  scripts/setlocalversion: clear local variable to make it work for sh
  namespace: fix namespace.pl script to support relative paths
  video/logo: do not generate unneeded logo C files
  video/logo: remove unneeded *.o pattern from clean-files
  integrity: remove pointless subdir-$(CONFIG_...)
  integrity: remove unneeded, broken attempt to add -fshort-wchar
  modpost: fix static EXPORT_SYMBOL warnings for UML build
  kbuild: correct formatting of header in kbuild module docs
  kbuild: remove SUBDIRS support
  kbuild: remove ar-option and KBUILD_ARFLAGS

5 years agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Sat, 5 Oct 2019 19:53:27 +0000 (12:53 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "Twelve patches mostly small but obvious fixes or cosmetic but small
  updates"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: qla2xxx: Fix Nport ID display value
  scsi: qla2xxx: Fix N2N link up fail
  scsi: qla2xxx: Fix N2N link reset
  scsi: qla2xxx: Optimize NPIV tear down process
  scsi: qla2xxx: Fix stale mem access on driver unload
  scsi: qla2xxx: Fix unbound sleep in fcport delete path.
  scsi: qla2xxx: Silence fwdump template message
  scsi: hisi_sas: Make three functions static
  scsi: megaraid: disable device when probe failed after enabled device
  scsi: storvsc: setup 1:1 mapping between hardware queue and CPU queue
  scsi: qedf: Remove always false 'tmp_prio < 0' statement
  scsi: ufs: skip shutdown if hba is not powered
  scsi: bnx2fc: Handle scope bits when array returns BUSY or TSF

5 years agoMerge branch 'readdir' (readdir speedup and sanity checking)
Linus Torvalds [Sat, 5 Oct 2019 19:03:27 +0000 (12:03 -0700)]
Merge branch 'readdir' (readdir speedup and sanity checking)

This makes getdents() and getdents64() do sanity checking on the
pathname that it gives to user space.  And to mitigate the performance
impact of that, it first cleans up the way it does the user copying, so
that the code avoids doing the SMAP/PAN updates between each part of the
dirent structure write.

I really wanted to do this during the merge window, but didn't have
time.  The conversion of filldir to unsafe_put_user() is something I've
had around for years now in a private branch, but the extra pathname
checking finally made me clean it up to the point where it is mergable.

It's worth noting that the filename validity checking really should be a
bit smarter: it would be much better to delay the error reporting until
the end of the readdir, so that non-corrupted filenames are still
returned.  But that involves bigger changes, so let's see if anybody
actually hits the corrupt directory entry case before worrying about it
further.

* branch 'readdir':
  Make filldir[64]() verify the directory entry filename is valid
  Convert filldir[64]() from __put_user() to unsafe_put_user()

5 years agoMake filldir[64]() verify the directory entry filename is valid
Linus Torvalds [Sat, 5 Oct 2019 18:32:52 +0000 (11:32 -0700)]
Make filldir[64]() verify the directory entry filename is valid

This has been discussed several times, and now filesystem people are
talking about doing it individually at the filesystem layer, so head
that off at the pass and just do it in getdents{64}().

This is partially based on a patch by Jann Horn, but checks for NUL
bytes as well, and somewhat simplified.

There's also commentary about how it might be better if invalid names
due to filesystem corruption don't cause an immediate failure, but only
an error at the end of the readdir(), so that people can still see the
filenames that are ok.

There's also been discussion about just how much POSIX strictly speaking
requires this since it's about filesystem corruption.  It's really more
"protect user space from bad behavior" as pointed out by Jann.  But
since Eric Biederman looked up the POSIX wording, here it is for context:

 "From readdir:

   The readdir() function shall return a pointer to a structure
   representing the directory entry at the current position in the
   directory stream specified by the argument dirp, and position the
   directory stream at the next entry. It shall return a null pointer
   upon reaching the end of the directory stream. The structure dirent
   defined in the <dirent.h> header describes a directory entry.

  From definitions:

   3.129 Directory Entry (or Link)

   An object that associates a filename with a file. Several directory
   entries can associate names with the same file.

  ...

   3.169 Filename

   A name consisting of 1 to {NAME_MAX} bytes used to name a file. The
   characters composing the name may be selected from the set of all
   character values excluding the slash character and the null byte. The
   filenames dot and dot-dot have special meaning. A filename is
   sometimes referred to as a 'pathname component'."

Note that I didn't bother adding the checks to any legacy interfaces
that nobody uses.

Also note that if this ends up being noticeable as a performance
regression, we can fix that to do a much more optimized model that
checks for both NUL and '/' at the same time one word at a time.

We haven't really tended to optimize 'memchr()', and it only checks for
one pattern at a time anyway, and we really _should_ check for NUL too
(but see the comment about "soft errors" in the code about why it
currently only checks for '/')

See the CONFIG_DCACHE_WORD_ACCESS case of hash_name() for how the name
lookup code looks for pathname terminating characters in parallel.

Link: https://lore.kernel.org/lkml/20190118161440.220134-2-jannh@google.com/
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jann Horn <jannh@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>