review.tizen.org Git - platform/kernel/linux-rpi.git/log

btrfs: add free space tree to the cow-only list

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add free space tree to lockdep classes

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tweak free space tree bitmap allocation

The requested bitmap size varies, observed numbers were < 4K up to 16K.
Using vmalloc unconditionally would be too heavy, we'll try contiguous
allocations first and fall back to vmalloc if there's no contig memory.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tests: switch to GFP_KERNEL

There's no reason to do GFP_NOFS in tests, it's not data-heavy and
memory allocation failures would affect only developers or testers.

Signed-off-by: David Sterba <dsterba@suse.com>

Merge branch 'for-chris-4.5' of git://git./linux/kernel/git/fdmanana/linux into for-linus-4.5

Signed-off-by: Chris Mason <clm@fb.com>

Merge branch 'misc-cleanups-4.5' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5

Signed-off-by: Chris Mason <clm@fb.com>

Merge branch 'misc-for-4.5' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5

Btrfs: fix fitrim discarding device area reserved for boot loader's use

As of the 4.3 kernel release, the fitrim ioctl can now discard any region
of a disk that is not allocated to any chunk/block group, including the
first megabyte which is used for our primary superblock and by the boot
loader (grub for example).

Fix this by not allowing to trim/discard any region in the device starting
with an offset not greater than min(alloc_start_mount_option, 1Mb), just
as it was not possible before 4.3.

A reproducer test case for xfstests follows.

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1 # failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      cd /
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1

  # Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are
  # reserved for a boot loader to use (GRUB for example) and btrfs should never
  # use them - neither for allocating metadata/data nor should trim/discard them.
  # The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem.
  $XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io
  $XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io

  # Now mount the filesystem and perform a fitrim against it.
  _scratch_mount
  _require_batched_discard $SCRATCH_MNT
  $FSTRIM_PROG $SCRATCH_MNT

  # Now unmount the filesystem and verify the content of the ranges was not
  # modified (no trim/discard happened on them).
  _scratch_unmount
  echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:"
  od -t x1 -N $((64 * 1024)) $SCRATCH_DEV
  od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV

  status=0
  exit

Reported-by: Vincent Petry <PVince81@yahoo.fr>
Reported-by: Andrei Borzenkov <arvidjaar@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341
Fixes: 499f377f49f0 (btrfs: iterate over unused chunk space in FITRIM)
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Filipe Manana <fdmanana@suse.com>

Btrfs: Check metadata redundancy on balance

When converting a filesystem via balance check that metadata mode
is at least as redundant as the data mode. For example give warning
when:
-dconvert=raid1 -mconvert=single

Signed-off-by: Sam Tygier <samtygier@yahoo.co.uk>
[ minor message reformatting ]
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: statfs: report zero available if metadata are exhausted

There is one ENOSPC case that's very confusing. There's Available
greater than zero but no file operation succeds (besides removing
files). This happens when the metadata are exhausted and there's no
possibility to allocate another chunk.

In this scenario it's normal that there's still some space in the data
chunk and the calculation in df reflects that in the Avail value.

To at least give some clue about the ENOSPC situation, let statfs report
zero value in Avail, even if there's still data space available.

Current:
/dev/sdb1 4.0G 3.3G 719M 83% /mnt/test

New:
/dev/sdb1 4.0G 3.3G 0 100% /mnt/test

We calculate the remaining metadata space minus global reserve. If this
is (supposedly) smaller than zero, there's no space. But this does not
hold in practice, the exhausted state happens where's still some
positive delta. So we apply some guesswork and compare the delta to a 4M
threshold. (Practically observed delta was 2M.)

We probably cannot calculate the exact threshold value because this
depends on the internal reservations requested by various operations, so
some operations that consume a few metadata will succeed even if the
Avail is zero. But this is better than the other way around.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: preallocate path for snapshot creation at ioctl time

We can also preallocate btrfs_path that's used during pending snapshot
creation and avoid another late ENOMEM failure.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: allocate root item at snapshot ioctl time

The actual snapshot creation is delayed until transaction commit. If we
cannot get enough memory for the root item there, we have to fail the
whole transaction commit which is bad. So we'll allocate the memory at
the ioctl call and pass it along with the pending_snapshot struct. The
potential ENOMEM will be returned to the caller of snapshot ioctl.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: do an allocation earlier during snapshot creation

We can allocate pending_snapshot earlier and do not have to do cleanup
in case of failure.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use smaller type for btrfs_path locks

The values of btrfs_path::locks are 0 to 4, fit into a u8. Let's see:

* overall size of btrfs_path drops down from 136 to 112 (-24 bytes),
* better packing in a slab page +6 objects
* the whole structure now fits to 2 cachelines
* slight decrease in code size:

   text    data     bss     dec     hex filename
938731   43670   23144 1005545   f57e9 fs/btrfs/btrfs.ko.before
938203   43670   23144 1005017   f55d9 fs/btrfs/btrfs.ko.after

(and the generated assembly does not change much)

The main purpose is to decrease the size of the structure without
affecting performance. The byte access is usually well behaving accross
arches, the locks are not accessed frequently and sometimes just
compared to zero.

Note for further size reduction attempts: the slots could be made u16
but this might generate worse code on some arches (non-byte and non-int
access). Also the range of operations on slots is wider compared to
locks and the potential performance drop should be evaluated first.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use smaller type for btrfs_path lowest_level

The level is 0..7, we can use smaller type. The size of btrfs_path is now
136 bytes from 144, which is +2 objects that fit into a 4k slab.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use smaller type for btrfs_path reada

The possible values for reada are all positive and bounded, we can later
save some bytes by storing it in u8.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: cleanup, use enum values for btrfs_path reada

Replace the integers by enums for better readability. The value 2 does
not have any meaning since a717531942f488209dded30f6bc648167bcefa72
"Btrfs: do less aggressive btree readahead" (2009-01-22).

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: constify static arrays

There are a few statically initialized arrays that can be made const.
The remaining (like file_system_type, sysfs attributes or prop handlers)
do not allow that due to type mismatch when passed to the APIs or
because the structures are modified through other members.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: constify remaining structs with function pointers

* struct extent_io_ops
* struct btrfs_free_space_op

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs tests: replace whole ops structure for free space tests

Preparatory work for making btrfs_free_space_op constant. In
test_steal_space_from_bitmap_to_extent, we substitute use_bitmap with
own version thus preventing constification. We can rework it so we
replace the whole structure with the correct function pointers.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use list_for_each_entry* in backref.c

Use list_for_each_entry*() to simplify the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use list_for_each_entry_safe in free-space-cache.c

Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use list_for_each_entry* in check-integrity.c

Use list_for_each_entry*() instead of list_for_each*() to simplify
the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: use linux/sizes.h to represent constants

We use many constants to represent size and offset value. And to make
code readable we use '256 * 1024 * 1024' instead of '268435456' to
represent '256MB'. However we can make far more readable with 'SZ_256MB'
which is defined in the 'linux/sizes.h'.

So this patch replaces 'xxx * 1024 * 1024' kind of expression with
single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
not a power of 2. And I haven't touched to '4096' & '8192' because it's
more intuitive than 'SZ_4KB' & 'SZ_8KB'.

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: cleanup, remove stray return statements

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: zero out delayed node upon allocation

It's slightly cleaner to zero-out the delayed node upon allocation
than to do it by hand in btrfs_init_delayed_node() for a few members

Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: pass proper enum type to start_transaction()

Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: switch __btrfs_fs_incompat return type from int to bool

Conform to __btrfs_fs_incompat() cast-to-bool (!!) by explicitly
returning boolean not int.

Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove unused inode argument from uncompress_inline()

The inode argument is never used from the beginning, so remove it.

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: don't use slab cache for struct btrfs_delalloc_work

Although we prefer to use separate caches for various structs, it seems
better not to do that for struct btrfs_delalloc_work. Objects of this
type are allocated rarely, when transaction commit calls
btrfs_start_delalloc_roots, requesting delayed iputs.

The objects are temporary (with some IO involved) but still allocated
and freed within __start_delalloc_inodes. Memory allocation failure is
handled.

The slab cache is empty most of the time (observed on several systems),
so if we need to allocate a new slab object, the first one has to
allocate a full page. In a potential case of low memory conditions this
might fail with higher probability compared to using the generic slab
caches.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: drop duplicate prefix from scrub workqueues

The helper btrfs_alloc_workqueue will add the "btrfs-" prefix.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: verbose error when we find an unexpected item in sys_array

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: handle invalid num_stripes in sys_array

We can handle the special case of num_stripes == 0 directly inside
btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to
catch other unhandled cases where we fail to validate external data.

A crafted or corrupted image crashes at mount time:

BTRFS: device fsid 9006933e-2a9a-44f0-917f-514252aeec2c devid 1 transid 7 /dev/loop0
BTRFS info (device loop0): disk space caching is enabled
BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()!
Kernel panic - not syncing: BUG!
CPU: 0 PID: 313 Comm: mount Not tainted 4.2.5-00657-ge047887-dirty #25
Stack:
637af890 60062489 602aeb2e 604192ba
60387961 00000011 637af8a0 6038a835
637af9c0 6038776b 634ef32b 00000000
Call Trace:
[<6001c86d>] show_stack+0xfe/0x15b
[<6038a835>] dump_stack+0x2a/0x2c
[<6038776b>] panic+0x13e/0x2b3
[<6020f099>] btrfs_read_sys_array+0x25d/0x2ff
[<601cfbbe>] open_ctree+0x192d/0x27af
[<6019c2c1>] btrfs_mount+0x8f5/0xb9a
[<600bc9a7>] mount_fs+0x11/0xf3
[<600d5167>] vfs_kern_mount+0x75/0x11a
[<6019bcb0>] btrfs_mount+0x2e4/0xb9a
[<600bc9a7>] mount_fs+0x11/0xf3
[<600d5167>] vfs_kern_mount+0x75/0x11a
[<600d710b>] do_mount+0xa35/0xbc9
[<600d7557>] SyS_mount+0x95/0xc8
[<6001e884>] handle_syscall+0x6b/0x8e

Reported-by: Jiri Slaby <jslaby@suse.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
CC: stable@vger.kernel.org # 3.19+
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: better packing of btrfs_delayed_extent_op

btrfs_delayed_extent_op can be packed in a better way, it's 40 bytes now
and has 8 unused bytes. Reducing the level type to u8 makes it possible
to squeeze it to the padding byte after key. The bitfields were switched
to bool as there's space to store the full byte without increasing the
whole structure, besides that the generated assembly is smaller.

struct btrfs_delayed_extent_op {
struct btrfs_disk_key      key;                  /*     0    17 */
u8                         level;                /*    17     1 */
bool                       update_key;           /*    18     1 */
bool                       update_flags;         /*    19     1 */
bool                       is_data;              /*    20     1 */

/* XXX 3 bytes hole, try to pack */

u64                        flags_to_set;         /*    24     8 */

/* size: 32, cachelines: 1, members: 6 */
/* sum members: 29, holes: 1, sum holes: 3 */
/* last cacheline: 32 bytes */
};

The final size is 32 bytes which gives +26 object per slab page.

   text    data     bss     dec     hex filename
938811   43670   23144 1005625   f5839 fs/btrfs/btrfs.ko.before
938747   43670   23144 1005561   f57f9 fs/btrfs/btrfs.ko.after

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: put delayed item hook into inode

Inodes for delayed iput allocate a trivial helper structure, let's place
the list hook directly into the inode and save a kmalloc (killing a
__GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode.

The inode can be put into the delayed_iputs list more than once and we
have to keep the count. This means we can't use the list_splice to
process a bunch of inodes because we'd lost track of the count if the
inode is put into the delayed iputs again while it's processed.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: Support convert to -d dup for btrfs-convert

Since we will add support for -d dup for non-mixed filesystem,
kernel need to support converting to this raid-type.

This patch remove limitation of above case.

Tested by following script:
(combination of dup conversion with fsck):

export TEST_DEV='/dev/vdc'
export TEST_DIR='/var/ltf/tester/mnt'

do_dup_test()
{
    local m_from="$1"
    local d_from="$2"
    local m_to="$3"
    local d_to="$4"

    echo "Convert from -m $m_from -d $d_from to -m $m_to -d $d_to"

    umount "$TEST_DIR" &>/dev/null
    ./mkfs.btrfs -f -m "$m_from" -d "$d_from" "$TEST_DEV" >/dev/null || return 1
    mount "$TEST_DEV" "$TEST_DIR" || return 1

    cp -a /sbin/* "$TEST_DIR"

    [[ "$m_from" != "$m_to" ]] && {
        ./btrfs balance start -f -mconvert="$m_to" "$TEST_DIR" || return 1
    }

    [[ "$d_from" != "$d_to" ]] && {
local opt=()
[[ "$d_to" == single ]] && opt+=("-f")
        ./btrfs balance start "${opt[@]}" -dconvert="$d_to" "$TEST_DIR" || return 1
    }

    umount "$TEST_DIR" || return 1
    ./btrfsck "$TEST_DEV" || return 1
    echo

    return 0
}

test_all()
{
    for m_from in single dup; do
    for d_from in single dup; do
    for m_to in single dup; do
    for d_to in single dup; do
    do_dup_test "$m_from" "$d_from" "$m_to" "$d_to" || return 1
    done
    done
    done
    done
}

test_all

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: igrab inode in writepage

We hit this panic on a few of our boxes this week where we have an
ordered_extent with an NULL inode.  We do an igrab() of the inode in writepages,
but weren't doing it in writepage which can be called directly from the VM on
dirty pages.  If the inode has been unlinked then we could have I_FREEING set
which means igrab() would return NULL and we get this panic.  Fix this by trying
to igrab in btrfs_writepage, and if it returns NULL then just redirty the page
and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: add missing brelse when superblock checksum fails

Looks like oversight, call brelse() when checksum fails. Further down the
code, in the non error path, we do call brelse() and so we don't see
brelse() in the goto error paths.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: fix transaction handle leak on failure to create hard link

If we failed to create a hard link we were not always releasing the
the transaction handle we got before, resulting in a memory leak and
preventing any other tasks from being able to commit the current
transaction.
Fix this by always releasing our transaction handle.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Btrfs: fix number of transaction units required to create symlink

We weren't accounting for the insertion of an inline extent item for the
symlink inode nor that we need to update the parent inode item (through
the call to btrfs_add_nondir()). So fix this by including two more
transaction units.

Signed-off-by: Filipe Manana <fdmanana@suse.com>

Btrfs: don't leave dangling dentry if symlink creation failed

When we are creating a symlink we might fail with an error after we
created its inode and added the corresponding directory indexes to its
parent inode. In this case we end up never removing the directory indexes
because the inode eviction handler, called for our symlink inode on the
final iput(), only removes items associated with the symlink inode and
not with the parent inode.

Example:

  $ mkfs.btrfs -f /dev/sdi
  $ mount /dev/sdi /mnt
  $ touch /mnt/foo
  $ ln -s /mnt/foo /mnt/bar
  ln: failed to create symbolic link ‘bar’: Cannot allocate memory
  $ umount /mnt
  $ btrfsck /dev/sdi
  Checking filesystem on /dev/sdi
  UUID: d5acb5ba-31bd-42da-b456-89dca2e716e1
  checking extents
  checking free space cache
  checking fs roots
  root 5 inode 258 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 3 namelen 3 name bar filetype 7 errors 4, no inode ref
  found 131073 bytes used err is 1
  total csum bytes: 0
  total tree bytes: 131072
  total fs tree bytes: 32768
  total extent tree bytes: 16384
  btree space waste bytes: 124305
  file data blocks allocated: 262144
   referenced 262144
  btrfs-progs v4.2.3

So fix this by adding the directory index entries as the very last
step of symlink creation.

Signed-off-by: Filipe Manana <fdmanana@suse.com>

Btrfs: send, don't BUG_ON() when an empty symlink is found

When a symlink is successfully created it always has an inline extent
containing the source path. However if an error happens when creating
the symlink, we can leave in the subvolume's tree a symlink inode without
any such inline extent item - this happens if after btrfs_symlink() calls
btrfs_end_transaction() and before it calls the inode eviction handler
(through the final iput() call), the transaction gets committed and a
crash happens before the eviction handler gets called, or if a snapshot
of the subvolume is made before the eviction handler gets called. Sadly
we can't just avoid this by making btrfs_symlink() call
btrfs_end_transaction() after it calls the eviction handler, because the
later can commit the current transaction before it removes any items from
the subvolume tree (if it encounters ENOSPC errors while reserving space
for removing all the items).

So make send fail more gracefully, with an -EIO error, and print a
message to dmesg/syslog informing that there's an empty symlink inode,
so that the user can delete the empty symlink or do something else
about it.

Reported-by: Stephen R. van den Berg <srb@cuci.nl>
Signed-off-by: Filipe Manana <fdmanana@suse.com>

Btrfs: fix race between free space endio workers and space cache writeout

While running a stress test I ran into the following trace/transaction
abort:

[471626.672243] ------------[ cut here ]------------
[471626.673322] WARNING: CPU: 9 PID: 19107 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]()
[471626.675492] BTRFS: Transaction aborted (error -2)
[471626.676748] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix
[471626.688802] CPU: 14 PID: 19107 Comm: fsstress Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
[471626.690148] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[471626.691901]  0000000000000000 ffff880016037cf0 ffffffff812566f4 ffff880016037d38
[471626.695009]  ffff880016037d28 ffffffff8104d0a6 ffffffffa040c84e 00000000fffffffe
[471626.697490]  ffff88011fe855f8 ffff88000c484cb0 ffff88000d195000 ffff880016037d90
[471626.699201] Call Trace:
[471626.699804]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
[471626.701049]  [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[471626.702542]  [<ffffffffa040c84e>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
[471626.704326]  [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[471626.705636]  [<ffffffffa0403717>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs]
[471626.707048]  [<ffffffffa040c84e>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
[471626.708616]  [<ffffffffa048a50a>] commit_cowonly_roots+0x1d7/0x25a [btrfs]
[471626.709950]  [<ffffffffa041e34a>] btrfs_commit_transaction+0x4c4/0x991 [btrfs]
[471626.711286]  [<ffffffff81081c61>] ? signal_pending_state+0x31/0x31
[471626.712611]  [<ffffffffa03f6df4>] btrfs_sync_fs+0x145/0x1ad [btrfs]
[471626.715610]  [<ffffffff811962a2>] ? SyS_tee+0x226/0x226
[471626.716718]  [<ffffffff811962c2>] sync_fs_one_sb+0x20/0x22
[471626.717672]  [<ffffffff8116fc01>] iterate_supers+0x75/0xc2
[471626.718800]  [<ffffffff8119669a>] sys_sync+0x52/0x80
[471626.719990]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[471626.721835] ---[ end trace baf57f43d76693f4 ]---
[471626.722954] BTRFS: error (device sdc) in btrfs_write_dirty_block_groups:3740: errno=-2 No such entry

This is a very rare situation and it happened due to a race between a free
space endio worker and writing the space caches for dirty block groups at
a transaction's commit critical section. The steps leading to this are:

1) A task calls btrfs_commit_transaction() and starts the writeout of the
   space caches for all currently dirty block groups (i.e. it calls
   btrfs_start_dirty_block_groups());

2) The previous step starts writeback for space caches;

3) When the writeback finishes it queues jobs for free space endio work
   queue (fs_info->endio_freespace_worker) that execute
   btrfs_finish_ordered_io();

4) The task committing the transaction sets the transaction's state
   to TRANS_STATE_COMMIT_DOING and shortly after calls
   btrfs_write_dirty_block_groups();

5) A free space endio job joins the transaction, through
   btrfs_join_transaction_nolock(), and updates a free space inode item
   in the root tree through btrfs_update_inode_fallback();

6) Updating the free space inode item resulted in COWing one or more
   nodes/leaves of the root tree, and that resulted in creating a new
   metadata block group, which gets added to the transaction's list
   of dirty block groups (this is a very rare case);

7) The free space endio job has not released yet its transaction handle
   at this point, so the new metadata block group was not yet fully
   created (didn't go through btrfs_create_pending_block_groups() yet);

8) The transaction commit task sees the new metadata block group in
   the transaction's list of dirty block groups and processes it.
   When it attempts to update the block group's block group item in
   the extent tree, through write_one_cache_group(), it isn't able
   to find it and aborts the transaction with error -ENOENT - this
   is because the free space endio job hasn't yet released its
   transaction handle (which calls btrfs_create_pending_block_groups())
   and therefore the block group item was not yet added to the extent
   tree.

Fix this waiting for free space endio jobs if we fail to find a block
group item in the extent tree and then retry once updating the block
group item.

Signed-off-by: Filipe Manana <fdmanana@suse.com>

btrfs: don't run delayed references while we are creating the free space tree

This is a short term solution to make sure btrfs_run_delayed_refs()
doesn't change the extent tree while we are scanning it to create the
free space tree.

Longer term we need to synchronize scanning the block groups one by one,
similar to what happens during a balance.

Signed-off-by: Chris Mason <clm@fb.com>

btrfs: fix compiling with CONFIG_BTRFS_DEBUG enabled.

Merging in the free space tree deleted a variable needed when
CONFIG_BTRFS_DEBUG=y

Signed-off-by: Chris Mason <clm@fb.com>

btrfs: fix warning on uninit variable in btrfs_finish_chunk_alloc

map->num_stripes really can't be zero, but just in case.

Signed-off-by: Chris Mason <clm@fb.com>

Merge branch 'freespace-4.5' into for-linus-4.5

Merge branch 'for-chris-4.5' of git://git./linux/kernel/git/fdmanana/linux into for-linus-4.5

Merge branch 'dev/simplify-set-bit' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5

Signed-off-by: Chris Mason <clm@fb.com>

Merge branch 'dev/gfp-flags' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5

Merge branch 'cleanup/misc-simplify' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5

Btrfs: fix unprotected list operations at btrfs_write_dirty_block_groups

We call btrfs_write_dirty_block_groups() in the critical section of a
transaction's commit, when no other tasks can join the transaction and
add more block groups to the transaction's list of dirty block groups,
so we not taking the dirty block groups spinlock when checking for the
list's emptyness, grabbing its first element or deleting elements from
it.

However there's a special and rare case where we can have a concurrent
task adding elements to this list. We trigger writeback for space
caches before at btrfs_start_dirty_block_groups() and in past iterations
of the loop at btrfs_write_dirty_block_groups(), this means that when
the writeback finishes (which happens asynchronously) it creates a
task for the endio free space work queue that executes
btrfs_finish_ordered_io() - this function is able to join the transaction,
through btrfs_join_transaction_nolock(), and update the free space cache's
inode item in the root tree, which can result in COWing nodes of this tree
and therefore allocation of a new block group can happen, which gets added
to the transaction's list of dirty block groups while the transaction
commit task is operating on it concurrently.

So fix this by taking the dirty block groups spinlock before doing
operations on the dirty block groups list at
btrfs_write_dirty_block_groups().

Signed-off-by: Filipe Manana <fdmanana@suse.com>

Linux 4.4-rc6

Merge tag 'rtc-4.4-3' of git://git./linux/kernel/git/abelloni/linux

Pull RTC fixes from Alexandre Belloni:
"Late fixes for the RTC subsystem for 4.4:

  A fix for a nasty hardware bug in rk808 and an initialization
  reordering in da9063 to fix a possible crash"

* tag 'rtc-4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
  rtc: da9063: fix access ordering error during RTC interrupt at system power on
  rtc: rk808: Compensate for Rockchip calendar deviation on November 31st

rtc: da9063: fix access ordering error during RTC interrupt at system power on

This fix alters the ordering of the IRQ and device registrations in the RTC
driver probe function. This change will apply to the RTC driver that supports
both DA9063 and DA9062 PMICs.

A problem could occur with the existing RTC driver if:

A system is started from a cold boot using the PMIC RTC IRQ to initiate a
power on operation. For instance, if an RTC alarm is used to start a
platform from power off.
The existing driver IRQ is requested before the device has been properly
registered.
i.e.
    ret = devm_request_threaded_irq()
comes before
    rtc->rtc_dev = devm_rtc_device_register();

In this case, the interrupt can be called before the device has been
registered and the handler can be called immediately. The IRQ handler
da9063_alarm_event() contains the function call

    rtc_update_irq(rtc->rtc_dev, 1, RTC_IRQF | RTC_AF);

which in turn tries to access the unavailable rtc->rtc_dev.

The fix is to reorder the functions inside the RTC probe. The IRQ is
requested after the RTC device resource has been registered so that
get_irq_byname is the last thing to happen.

Signed-off-by: Steve Twiss <stwiss.opensource@diasemi.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>

rtc: rk808: Compensate for Rockchip calendar deviation on November 31st

In A.D. 1582 Pope Gregory XIII found that the existing Julian calendar
insufficiently represented reality, and changed the rules about
calculating leap years to account for this. Similarly, in A.D. 2013
Rockchip hardware engineers found that the new Gregorian calendar still
contained flaws, and that the month of November should be counted up to
31 days instead. Unfortunately it takes a long time for calendar changes
to gain widespread adoption, and just like more than 300 years went by
before the last Protestant nation implemented Greg's proposal, we will
have to wait a while until all religions and operating system kernels
acknowledge the inherent advantages of the Rockchip system. Until then
we need to translate dates read from (and written to) Rockchip hardware
back to the Gregorian format.

This patch works by defining Jan 1st, 2016 as the arbitrary anchor date
on which Rockchip and Gregorian calendars are in sync. From that we can
translate arbitrary later dates back and forth by counting the number
of November/December transitons since the anchor date to determine the
offset between the calendars. We choose this method (rather than trying
to regularly "correct" the date stored in hardware) since it's the only
way to ensure perfect time-keeping even if the system may be shut down
for an unknown number of years. The drawback is that other software
reading the same hardware (e.g. mainboard firmware) must use the same
translation convention (including the same anchor date) to be able to
read and write correct timestamps from/to the RTC.

Signed-off-by: Julius Werner <jwerner@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>

Merge tag 'tty-4.4-rc6' of git://git./linux/kernel/git/gregkh/tty

Pull tty/serial fixes from Greg KH:
"Here are some tty/serial driver fixes for 4.4-rc6 that resolve some
  reported problems.  All of these have been in linux-next.  The details
  are in the shortlog"

* tag 'tty-4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  tty: Fix GPF in flush_to_ldisc()
  serial: earlycon: Add missing spinlock initialization
  serial: sh-sci: Fix length of scatterlist
  n_tty: Fix poll() after buffer-limited eof push read
  serial: 8250_uniphier: fix dl_read and dl_write functions

Merge tag 'usb-4.4-rc6' of git://git./linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
"Here are some USB and PHY fixes for 4.4-rc6.  All of them resolve some
  reported problems.  Full details in the shortlog"

* tag 'usb-4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  USB: fix invalid memory access in hub_activate()
  USB: ipaq.c: fix a timeout loop
  phy: core: Get a refcount to phy in devm_of_phy_get_by_index()
  phy: cygnus: pcie: add missing of_node_put
  phy: miphy365x: add missing of_node_put
  phy: miphy28lp: add missing of_node_put
  phy: rockchip-usb: add missing of_node_put
  phy: berlin-sata: add missing of_node_put
  phy: mt65xx-usb3: add missing of_node_put
  phy: brcmstb-sata: add missing of_node_put
  phy: sun9i-usb: add USB dependency

Merge tag 'md/4.4-rc5-fixes' of git://neil.brown.name/md

Pull md fixes from Neil Brown:
"Four fixes for md:

   - two recently introduced regressions fixed.
   - one older bug in RAID10 - tagged for -stable since 4.2
   - one minor sysfs api improvement"

* tag 'md/4.4-rc5-fixes' of git://neil.brown.name/md:
  Fix remove_and_add_spares removes drive added as spare in slot_store
  md: fix bug due to nested suspend
  MD: change journal disk role to disk 0
  md/raid10: fix data corruption and crash during resync

Merge tag 'powerpc-4.4-5' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
- Partial revert of "powerpc: Individual System V IPC system calls"
- pr_warn_once on unsupported OPAL_MSG type from Stewart
- Fix deadlock in opal-irqchip introduced by "Fix double endian
   conversion" from Alistair

* tag 'powerpc-4.4-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/opal-irqchip: Fix deadlock introduced by "Fix double endian conversion"
  powerpc/powernv: pr_warn_once on unsupported OPAL_MSG type
  Partial revert of "powerpc: Individual System V IPC system calls"

Merge tag 'spi-fix-v4.4-rc5' of git://git./linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
"A couple of reference counting bugs here, one in spidev and one with
  holding an extra reference in the core that we never freed if we
  removed a device, plus a driver specific fix.  Both of the refcounting
  bugs are very old but they've only been found by observation so
  hopefully their impact has been low"

* tag 'spi-fix-v4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: fix parent-device reference leak
  spi: spidev: Hold spi_lock over all defererences of spi in release()
  spi-fsl-dspi: Fix CTAR Register access

Merge tag 'gpio-v4.4-3' of git://git./linux/kernel/git/linusw/linux-gpio

Pull GPIO fixes from Linus Walleij:
"Some GPIO fixes for the v4.4 series.  Most prominent: I revert the
  error propagation from the .get() function until we can fix up all the
  drivers properly for v4.5.

   - Revert the error number propagation from the .get() vtable entry
     temporarily, until we make the proper fixes to all drivers.
   - Fix the clamping behaviour in the generic GPIO driver.
   - Driver fix for the ath79 driver"

* tag 'gpio-v4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
  gpio: revert get() to non-errorprogating behaviour
  gpio: generic: clamp values from bgpio_get_set()
  gpio: ath79: Fix the logic to clear offset bit of AR71XX_GPIO_REG_OE register

Merge tag 'pinctrl-v4.4-3' of git://git./linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:
- Driver fixes for Freescale i.MX7D, Intel, Broadcom 2835
- One MAINTAINERS entry

* tag 'pinctrl-v4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  MAINTAINERS: pinctrl: Add maintainers for pinctrl-single
  pinctrl: bcm2835: Fix initial value for direction_output
  pinctrl: intel: fix offset calculation issue of register PAD_OWN
  pinctrl: intel: fix bug of register offset calculation
  pinctrl: freescale: add ZERO_OFFSET_VALID flag for vf610 pinctrl

Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:
"A set of 'usual' driver bugfixes for the I2C subsystem"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: rcar: disable runtime PM correctly in slave mode
  i2c: designware: Keep pm_runtime_enable/_disable calls in sync
  i2c: designware: fix IO timeout issue for AMD controller
  i2c: imx: init bus recovery info before adding i2c adapter
  i2c: do not use 0x in front of %pa
  i2c: davinci: Increase module clock frequency
  i2c: mv64xxx: The n clockdiv factor is 0 based on sunxi SoCs
  i2c: rk3x: populate correct variable for sda_falling_time

Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input

Pull input fixes from Dmitry Torokhov:
"Just a few assorted driver fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: elants_i2c - fix wake-on-touch
  Input: elan_i2c - set input device's vendor and product IDs
  Input: sun4i-lradc-keys - fix typo in binding documentation
  Input: atmel_mxt_ts - add maxtouch to I2C table for module autoload
  Input: arizona-haptic - fix disabling of haptics device
  Input: aiptek - fix crash on detecting device without endpoints
  Input: atmel_mxt_ts - add generic platform data for Chromebooks
  Input: parkbd - clear unused function pointers
  Input: walkera0701 - clear unused function pointers
  Input: turbografx - clear unused function pointers
  Input: gamecon - clear unused function pointers
  Input: db9 - clear unused function pointers

i2c: rcar: disable runtime PM correctly in slave mode

When we also are I2C slave, we need to disable runtime PM because the
address detection mechanism needs to be active all the time. However, we
can reenable runtime PM once the slave instance was unregistered. So,
use pm_runtime_get_sync/put to achieve this, since it has proper
refcounting. pm_runtime_allow/forbid is like a global knob controllable
from userspace which is unsuitable here.

Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Cc: stable@kernel.org

Merge tag 'pm+acpi-4.4-rc6' of git://git./linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
"These fix a potential regression introduced during the 4.3 cycle
  (generic power domains framework), a nasty bug that has been present
  forever (power capping RAPL driver), a build issue (Tegra cpufreq
  driver) and a minor ugliness introduced recently (intel_pstate).

  Specifics:

   - Fix a potential regression in the generic power domains framework
     introduced during the 4.3 development cycle that may lead to
     spurious failures of system suspend in certain situations (Ulf
     Hansson).

   - Fix a problem in the power capping RAPL (Running Average Power
     Limits) driver that causes it to initialize successfully on some
     systems where it is not supposed to do that which is due to an
     incorrect check in an initialization routine (Prarit Bhargava).

   - Fix a build problem in the cpufreq Tegra driver that depends on the
     regulator framework, but that dependency is not reflected in
     Kconfig (Arnd Bergmann).

   - Fix a recent mistake in the intel_pstate driver where a numeric
     constant is used directly instead of a symbol defined specifically
     for the case in question (Prarit Bhargava)"

* tag 'pm+acpi-4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  powercap / RAPL: fix BIOS lock check
  cpufreq: intel_pstate: Minor cleanup for FRAC_BITS
  cpufreq: tegra: add regulator dependency for T124
  PM / Domains: Allow runtime PM callbacks to be re-used during system PM

Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"Three fixes this time, two in SES picked up by KASAN for various types
  of buffer overrun.  The first is a USB array which returns page 8
  whatever is asked for and causes us to overrun with incorrect data
  format assumptions and the second is an invalid iteration of page 10
  (the additional information page).

  The final fix is a reversion of a NULL deref fix which caused
  suspend/resume not to be called in pairs leading to incorrect device
  operation (Jens has queued a more proper fix for the problem in
  block)"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  ses: fix additional element traversal bug
  Revert "SCSI: Fix NULL pointer dereference in runtime PM"
  ses: Fix problems with simple enclosures

Input: elants_i2c - fix wake-on-touch

When sending "SLEEP" command to the controller it ceases scanning
completely and is unable to wake the system up from sleep, so if it is
configured as a wakeup source we should simply configure interrupt for
wakeup and rely on idle logic within the controller to reduce power
consumption while it is not used.

Signed-off-by: James Chen <james.chen@emc.com.tw>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Merge tag 'media/v4.4-3' of git://git./linux/kernel/git/mchehab/linux-media

Pull media fixes from Mauro Carvalho Chehab.

* tag 'media/v4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  [media] airspy: increase USB control message buffer size
  [media] hackrf: move RF gain ctrl enable behind module parameter
  [media] hackrf: fix possible null ptr on debug printing
  [media] Revert "[media] ivtv: avoid going past input/audio array"

Merge branch 'for-linus-4.4' of git://git./linux/kernel/git/mason/linux-btrfs

Pull btrfs fixes from Chris Mason:
"A couple of small fixes"

* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: check prepare_uptodate_page() error code earlier
  Btrfs: check for empty bitmap list in setup_cluster_bitmaps
  btrfs: fix misleading warning when space cache failed to load
  Btrfs: fix transaction handle leak in balance
  Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list

Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
"Three patches"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  include/linux/mmdebug.h: should include linux/bug.h
  mm/zswap: change incorrect strncmp use to strcmp
  proc: fix -ESRCH error when writing to /proc/$pid/coredump_filter

include/linux/mmdebug.h: should include linux/bug.h

mmdebug.h uses BUILD_BUG_ON_INVALID(), assuming someone else included
linux/bug.h.  Include it ourselves.

This saves build-failures such as:

  arch/arm64/include/asm/pgtable.h: In function 'set_pte_at':
  arch/arm64/include/asm/pgtable.h:281:3: error: implicit declaration of function 'BUILD_BUG_ON_INVALID' [-Werror=implicit-function-declaration]
   VM_WARN_ONCE(!pte_young(pte),

Fixes: 02602a18c32d7 ("bug: completely remove code generated by disabled VM_BUG_ON()")
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/zswap: change incorrect strncmp use to strcmp

Change the use of strncmp in zswap_pool_find_get() to strcmp.

The use of strncmp is no longer correct, now that zswap_zpool_type is
not an array; sizeof() will return the size of a pointer, which isn't
the right length to compare. We don't need to use strncmp anyway,
because the existing params and the passed in params are all guaranteed
to be null terminated, so strcmp should be used.

Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Reported-by: Weijie Yang <weijie.yang@samsung.com>
Cc: Seth Jennings <sjennings@variantweb.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

proc: fix -ESRCH error when writing to /proc/$pid/coredump_filter

Writing to /proc/$pid/coredump_filter always returns -ESRCH because commit
774636e19ed51 ("proc: convert to kstrto*()/kstrto*_from_user()") removed
the setting of ret after the get_proc_task call and incorrectly left it as
-ESRCH.  Instead, return 0 when successful.

Example breakage:

  echo 0 > /proc/self/coredump_filter
  bash: echo: write error: No such process

Fixes: 774636e19ed51 ("proc: convert to kstrto*()/kstrto*_from_user()")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org> [4.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'hwmon-for-linus-v4.4-rc6' of git://git./linux/kernel/git/groeck/linux-staging

Pull hwmon fixes from Guenter Roeck:

- Select CONFIG_BITREVERSE for sht15 driver to avoid build failure if
   it is not configured.

- Force wait for conversion time for the first valid data in tmp102
   driver to avoid reporting erroneous data to the thermal subsystem.

* tag 'hwmon-for-linus-v4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  hwmon: (sht15) Select CONFIG_BITREVERSE
  hwmon: (tmp102) Force wait for conversion time for the first valid data

Merge tag 'iommu-fixes-v4.4-rc5' of git://git./linux/kernel/git/joro/iommu

Pull IOMMU fixes from Joerg Roedel:
"Two similar fixes for the Intel and AMD IOMMU drivers to add proper
  access checks before calling handle_mm_fault"

* tag 'iommu-fixes-v4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/vt-d: Do access checks before calling handle_mm_fault()
  iommu/amd: Do proper access checking before calling handle_mm_fault()

Merge tag 'for-linus-4.4-rc5-tag' of git://git./linux/kernel/git/xen/tip

Pull xen bug fixes from David Vrabel:
- XSA-155 security fixes to backend drivers.
- XSA-157 security fixes to pciback.

* tag 'for-linus-4.4-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen-pciback: fix up cleanup path when alloc fails
  xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.
  xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled.
  xen/pciback: Do not install an IRQ handler for MSI interrupts.
  xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled
  xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled
  xen/pciback: Save xen_pci_op commands before processing it
  xen-scsiback: safely copy requests
  xen-blkback: read from indirect descriptors only once
  xen-blkback: only read request operation from shared ring once
  xen-netback: use RING_COPY_REQUEST() throughout
  xen-netback: don't use last request to determine minimum Tx credit
  xen: Add RING_COPY_REQUEST()
  xen/x86/pvh: Use HVM's flush_tlb_others op
  xen: Resume PMU from non-atomic context
  xen/events/fifo: Consume unprocessed events when a CPU dies

Merge tag 'arc-fixes-for-4.4-rc6' of git://git./linux/kernel/git/vgupta/arc

Pull ARC architecture fixes from Vineet Gupta:
"Fixes for:

- perf interrupts on SMP: Not enabled (at boot) and disabled (at runtime)
- stack unwinder regression (for modules, ignoring dwarf3)
- nsim hosed for non default kernel link base builds"

* tag 'arc-fixes-for-4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
  ARC: smp: Rename platform hook @init_cpu_smp -> @init_per_cpu
  ARC: rename smp operation init_irq_cpu() to init_per_cpu()
  ARC: dw2 unwind: Ignore CIE version !=1 gracefully instead of bailing
  ARC: dw2 unwind: Reinstante unwinding out of modules
  ARC: [plat-sim] unbork non default CONFIG_LINUX_LINK_BASE
  ARC: intc: Document arc_request_percpu_irq() better
  ARCv2: perf: Ensure perf intr gets enabled on all cores
  ARC: intc: No need to clear IRQ_NOAUTOEN
  ARCv2: intc: Fix random perf irq disabling in SMP setup
  ARC: [axs10x] cap ethernet phy to 100 Mbit/sec

Merge tag 'sound-4.4-rc6' of git://git./linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"As usual in rc6, this update contains only a few HD-audio and
  USB-audio device-specific quirks: yet another Thinkpad noise fixes,
  Dell headphone mic fixes, and AudioQuest DragonFly fixes"

* tag 'sound-4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda - Add a fixup for Thinkpad X1 Carbon 2nd
  ALSA: hda - Set codec to D3 at reboot/shutdown on Thinkpads
  ALSA: hda - Apply click noise workaround for Thinkpads generically
  ALSA: hda - Fix headphone mic input on a few Dell ALC293 machines
  ALSA: usb-audio: Add sample rate inquiry quirk for AudioQuest DragonFly
  ALSA: usb-audio: Add a more accurate volume quirk for AudioQuest DragonFly

Merge tag 'for-linus-20151217' of git://git.infradead.org/linux-mtd

Pull MTD fixes from Brian Norris:
"I was holding out on this pull request for a bit, since there are a
  few other small issues being discussed that look like 4.4-rc
  regressions.  Hopefully I can get those stabilized soon, but these are
  ready at any rate:

   - A little bit of a last-minute change for the device tree "fixed
     partition" binding.  This is needed because we might want to reuse
     the 'partitions' subnode for other sorts of partitioning
     descriptions -- e.g., for describing which on-flash partition
     format(s) might be used on the system.

   - Also tone down a warning message, since it is probably going to
     show up on a lot of systems where it should just be ignored"

* tag 'for-linus-20151217' of git://git.infradead.org/linux-mtd:
  doc: dt: mtd: partitions: add compatible property to "partitions" node
  mtd: ofpart: don't complain about missing 'partitions' node too loudly

Merge branch 'freespace-tree' into for-linus-4.5

Signed-off-by: Chris Mason <clm@fb.com>

USB: fix invalid memory access in hub_activate()

Commit 8520f38099cc ("USB: change hub initialization sleeps to
delayed_work") changed the hub_activate() routine to make part of it
run in a workqueue.  However, the commit failed to take a reference to
the usb_hub structure or to lock the hub interface while doing so.  As
a result, if a hub is plugged in and quickly unplugged before the work
routine can run, the routine will try to access memory that has been
deallocated.  Or, if the hub is unplugged while the routine is
running, the memory may be deallocated while it is in active use.

This patch fixes the problem by taking a reference to the usb_hub at
the start of hub_activate() and releasing it at the end (when the work
is finished), and by locking the hub interface while the work routine
is running.  It also adds a check at the start of the routine to see
if the hub has already been disconnected, in which nothing should be
done.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Reported-by: Alexandru Cornea <alexandru.cornea@intel.com>
Tested-by: Alexandru Cornea <alexandru.cornea@intel.com>
Fixes: 8520f38099cc ("USB: change hub initialization sleeps to delayed_work")
CC: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

USB: ipaq.c: fix a timeout loop

The code expects the loop to end with "retries" set to zero but, because
it is a post-op, it will end set to -1. I have fixed this by moving the
decrement inside the loop.

Fixes: 014aa2a3c32e ('USB: ipaq: minor ipaq_open() cleanup.')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

[media] airspy: increase USB control message buffer size

Driver requested device firmware version string during probe using
only 24 byte long buffer. That buffer is too small for newer firmware
versions, which causes device firmware hang - device stops responding
to any commands after that. Increase buffer size to 128 which should
be enough for any current and future version strings.

Link: https://github.com/airspy/host/issues/27
Cc: <stable@vger.kernel.org> # 3.17+
Reported-by: Benjamin Vernoux <bvernoux@gmail.com>
Signed-off-by: Antti Palosaari <crope@iki.fi>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>

[media] hackrf: move RF gain ctrl enable behind module parameter

Used Avago MGA-81563 RF amplifier could be destroyed pretty easily
with too strong signal or transmitting to bad antenna.
Add module parameter 'enable_rf_gain_ctrl' which allows enabling
RF gain control - otherwise, default without the module parameter,
RF gain control is set to 'grabbed' state which prevents setting
value to the control.

Signed-off-by: Antti Palosaari <crope@iki.fi>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>

[media] hackrf: fix possible null ptr on debug printing

drivers/media/usb/hackrf/hackrf.c:1533 hackrf_probe()
error: we previously assumed 'dev' could be null (see line 1366)

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Antti Palosaari <crope@iki.fi>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>

[media] Revert "[media] ivtv: avoid going past input/audio array"

This patch broke ivtv logic, as reported at
https://bugzilla.redhat.com/show_bug.cgi?id=1278942

This reverts commit 09290cc885937cab3b2d60a6d48fe3d2d3e04061.

Cc: stable@vger.kernel.org # for v4.1 and upper
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>

Merge tag 'phy-for-4.4-rc' of git://git./linux/kernel/git/kishon/linux-phy into usb-linus

Kishon writes:

phy: for 4.4 -rc

*) Add missing of_node_put in a bunch of PHY drivers
*) Add get_device in devm_of_phy_get_by_index()
*) Fix randconfig build error in sun9i usb driver

xen-pciback: fix up cleanup path when alloc fails

When allocating a pciback device fails, clear the private
field. This could lead to an use-after free, however
the 'really_probe' takes care of setting
dev_set_drvdata(dev, NULL) in its failure path (which we would
exercise if the ->probe function failed), so we we
are OK. However lets be defensive as the code can change.

Going forward we should clean up the pci_set_drvdata(dev, NULL)
in the various code-base. That will be for another day.

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reported-by: Jonathan Creekmore <jonathan.creekmore@gmail.com>
Signed-off-by: Doug Goldstein <cardoe@cardoe.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

hwmon: (sht15) Select CONFIG_BITREVERSE

If CONFIG_BITREVERSE is not built-in, the sht15 driver fails to link:

drivers/built-in.o: In function `sht15_crc8':
drivers/hwmon/sht15.c:195: undefined reference to `byte_rev_table'

This adds a Kconfig 'select' statement, like all other users of
bitrev.h have it.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 33836ee98533 ("hwmon:change sht15_reverse()")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>

xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.

commit f598282f51 ("PCI: Fix the NIU MSI-X problem in a better way")
teaches us that dealing with MSI-X can be troublesome.

Further checks in the MSI-X architecture shows that if the
PCI_COMMAND_MEMORY bit is turned of in the PCI_COMMAND we
may not be able to access the BAR (since they are memory regions).

Since the MSI-X tables are located in there.. that can lead
to us causing PCIe errors. Inhibit us performing any
operation on the MSI-X unless the MEMORY bit is set.

Note that Xen hypervisor with:
"x86/MSI-X: access MSI-X table only after having enabled MSI-X"
will return:
xen_pciback: 0000:0a:00.1: error -6 enabling MSI-X for guest 3!

When the generic MSI code tries to setup the PIRQ without
MEMORY bit set. Which means with later versions of Xen
(4.6) this patch is not neccessary.

This is part of XSA-157

CC: stable@vger.kernel.org
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled.

Otherwise just continue on, returning the same values as
previously (return of 0, and op->result has the PIRQ value).

This does not change the behavior of XEN_PCI_OP_disable_msi[|x].

The pci_disable_msi or pci_disable_msix have the checks for
msi_enabled or msix_enabled so they will error out immediately.

However the guest can still call these operations and cause
us to disable the 'ack_intr'. That means the backend IRQ handler
for the legacy interrupt will not respond to interrupts anymore.

This will lead to (if the device is causing an interrupt storm)
for the Linux generic code to disable the interrupt line.

Naturally this will only happen if the device in question
is plugged in on the motherboard on shared level interrupt GSI.

This is part of XSA-157

CC: stable@vger.kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pciback: Do not install an IRQ handler for MSI interrupts.

Otherwise an guest can subvert the generic MSI code to trigger
an BUG_ON condition during MSI interrupt freeing:

for (i = 0; i < entry->nvec_used; i++)
BUG_ON(irq_has_action(entry->irq + i));

Xen PCI backed installs an IRQ handler (request_irq) for
the dev->irq whenever the guest writes PCI_COMMAND_MEMORY
(or PCI_COMMAND_IO) to the PCI_COMMAND register. This is
done in case the device has legacy interrupts the GSI line
is shared by the backend devices.

To subvert the backend the guest needs to make the backend
to change the dev->irq from the GSI to the MSI interrupt line,
make the backend allocate an interrupt handler, and then command
the backend to free the MSI interrupt and hit the BUG_ON.

Since the backend only calls 'request_irq' when the guest
writes to the PCI_COMMAND register the guest needs to call
XEN_PCI_OP_enable_msi before any other operation. This will
cause the generic MSI code to setup an MSI entry and
populate dev->irq with the new PIRQ value.

Then the guest can write to PCI_COMMAND PCI_COMMAND_MEMORY
and cause the backend to setup an IRQ handler for dev->irq
(which instead of the GSI value has the MSI pirq). See
'xen_pcibk_control_isr'.

Then the guest disables the MSI: XEN_PCI_OP_disable_msi
which ends up triggering the BUG_ON condition in 'free_msi_irqs'
as there is an IRQ handler for the entry->irq (dev->irq).

Note that this cannot be done using MSI-X as the generic
code does not over-write dev->irq with the MSI-X PIRQ values.

The patch inhibits setting up the IRQ handler if MSI or
MSI-X (for symmetry reasons) code had been called successfully.

P.S.
Xen PCIBack when it sets up the device for the guest consumption
ends up writting 0 to the PCI_COMMAND (see xen_pcibk_reset_device).
XSA-120 addendum patch removed that - however when upstreaming said
addendum we found that it caused issues with qemu upstream. That
has now been fixed in qemu upstream.

This is part of XSA-157

CC: stable@vger.kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled

The guest sequence of:

a) XEN_PCI_OP_enable_msix
b) XEN_PCI_OP_enable_msix

results in hitting an NULL pointer due to using freed pointers.

The device passed in the guest MUST have MSI-X capability.

The a) constructs and SysFS representation of MSI and MSI groups.
The b) adds a second set of them but adding in to SysFS fails (duplicate entry).
'populate_msi_sysfs' frees the newly allocated msi_irq_groups (note that
in a) pdev->msi_irq_groups is still set) and also free's ALL of the
MSI-X entries of the device (the ones allocated in step a) and b)).

The unwind code: 'free_msi_irqs' deletes all the entries and tries to
delete the pdev->msi_irq_groups (which hasn't been set to NULL).
However the pointers in the SysFS are already freed and we hit an
NULL pointer further on when 'strlen' is attempted on a freed pointer.

The patch adds a simple check in the XEN_PCI_OP_enable_msix to guard
against that. The check for msi_enabled is not stricly neccessary.

This is part of XSA-157

CC: stable@vger.kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled

The guest sequence of:

a) XEN_PCI_OP_enable_msi
b) XEN_PCI_OP_enable_msi
c) XEN_PCI_OP_disable_msi

results in hitting an BUG_ON condition in the msi.c code.

The MSI code uses an dev->msi_list to which it adds MSI entries.
Under the above conditions an BUG_ON() can be hit. The device
passed in the guest MUST have MSI capability.

The a) adds the entry to the dev->msi_list and sets msi_enabled.
The b) adds a second entry but adding in to SysFS fails (duplicate entry)
and deletes all of the entries from msi_list and returns (with msi_enabled
is still set). c) pci_disable_msi passes the msi_enabled checks and hits:

BUG_ON(list_empty(dev_to_msi_list(&dev->dev)));

and blows up.

The patch adds a simple check in the XEN_PCI_OP_enable_msi to guard
against that. The check for msix_enabled is not stricly neccessary.

This is part of XSA-157.

CC: stable@vger.kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pciback: Save xen_pci_op commands before processing it

Double fetch vulnerabilities that happen when a variable is
fetched twice from shared memory but a security check is only
performed the first time.

The xen_pcibk_do_op function performs a switch statements on the op->cmd
value which is stored in shared memory. Interestingly this can result
in a double fetch vulnerability depending on the performed compiler
optimization.

This patch fixes it by saving the xen_pci_op command before
processing it. We also use 'barrier' to make sure that the
compiler does not perform any optimization.

This is part of XSA155.

CC: stable@vger.kernel.org
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-scsiback: safely copy requests

The copy of the ring request was lacking a following barrier(),
potentially allowing the compiler to optimize the copy away.

Use RING_COPY_REQUEST() to ensure the request is copied to local
memory.

This is part of XSA155.

CC: stable@vger.kernel.org
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-blkback: read from indirect descriptors only once

Since indirect descriptors are in memory shared with the frontend, the
frontend could alter the first_sect and last_sect values after they have
been validated but before they are recorded in the request. This may
result in I/O requests that overflow the foreign page, possibly
overwriting local pages when the I/O request is executed.

When parsing indirect descriptors, only read first_sect and last_sect
once.

This is part of XSA155.

CC: stable@vger.kernel.org
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-blkback: only read request operation from shared ring once

A compiler may load a switch statement value multiple times, which could
be bad when the value is in memory shared with the frontend.

When converting a non-native request to a native one, ensure that
src->operation is only loaded once by using READ_ONCE().

This is part of XSA155.

CC: stable@vger.kernel.org
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>