platform/kernel/linux-rpi.git
7 years agorbd: store and use obj_request->object_no
Ilya Dryomov [Wed, 25 Jan 2017 17:16:23 +0000 (18:16 +0100)]
rbd: store and use obj_request->object_no

object_no can be trivially formatted into an object name.  We already
store object names in OSD requests with special care to avoid dynamic
allocations for short names.  Storing a name in obj_request, obtained
as below (!), is a waste and will be removed in the next commit.

    name = kmem_cache_alloc(rbd_segment_name_cache, ...);
    snprintf(name, ...);
    obj_request->object_name = kstrdup(name);
    kmem_cache_free(rbd_segment_name_cache, name);
    ...
    ceph_oid_aprintf(..., "%s", obj_request->object_name);

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
7 years agorbd: RBD_V{1,2}_DATA_FORMAT macros
Ilya Dryomov [Wed, 25 Jan 2017 17:16:23 +0000 (18:16 +0100)]
rbd: RBD_V{1,2}_DATA_FORMAT macros

... and also fix up the comment -- format 1 data objects have always
been 12 hex digits long.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
7 years agorbd: factor out __rbd_osd_req_create()
Ilya Dryomov [Wed, 25 Jan 2017 17:16:23 +0000 (18:16 +0100)]
rbd: factor out __rbd_osd_req_create()

Factor OSD request allocation and initialization code out into
__rbd_osd_req_create().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
7 years agorbd: set offset and length outside of rbd_obj_request_create()
Ilya Dryomov [Wed, 25 Jan 2017 17:16:22 +0000 (18:16 +0100)]
rbd: set offset and length outside of rbd_obj_request_create()

The allocation doesn't depend on offset and length.  Both offset and
length can be changed after obj_request is allocated, too.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
7 years agorbd: support for data-pool feature
Ilya Dryomov [Wed, 25 Jan 2017 17:16:22 +0000 (18:16 +0100)]
rbd: support for data-pool feature

Add support for RBD_FEATURE_DATA_POOL feature.  rbd_dev->layout.pool_id
now stores the data pool id.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
7 years agorbd: introduce rbd_init_layout()
Ilya Dryomov [Wed, 25 Jan 2017 17:16:22 +0000 (18:16 +0100)]
rbd: introduce rbd_init_layout()

Rather than initializing layout fields with some made up values in
__rbd_dev_create(), move the initialization into rbd_init_layout() and
call it after the header is actually populated.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
7 years agorbd: use rbd_obj_bytes() more
Ilya Dryomov [Wed, 25 Jan 2017 17:16:22 +0000 (18:16 +0100)]
rbd: use rbd_obj_bytes() more

Returning u64 doesn't make sense: max header->obj_order is 25 and
ceph_file_layout::object_size is u32.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
7 years agorbd: remove now unused rbd_obj_request_wait() and helpers
Ilya Dryomov [Wed, 25 Jan 2017 17:16:22 +0000 (18:16 +0100)]
rbd: remove now unused rbd_obj_request_wait() and helpers

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
7 years agorbd: switch rbd_obj_method_sync() to ceph_osdc_call()
Ilya Dryomov [Wed, 25 Jan 2017 17:16:21 +0000 (18:16 +0100)]
rbd: switch rbd_obj_method_sync() to ceph_osdc_call()

As explained in the previous commit, rbd_obj_request machinery (and
rbd_osd_req_create() in particular) shouldn't be used for working with
metadata objects.

Switch to the recently added ceph_osdc_call().  It assumes single pages
for outbound and inbound buffers, but that's OK - none of the callers
need more than that.  These pages need to be allocated (messenger is in
dire need of proper iterator interface!), but we are swapping for
pages[] and pagelist allocations in the existing code.

Kill class_name argument - all rbd methods are under "rbd".

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
7 years agolibceph: pass reply buffer length through ceph_osdc_call()
Ilya Dryomov [Wed, 25 Jan 2017 17:16:21 +0000 (18:16 +0100)]
libceph: pass reply buffer length through ceph_osdc_call()

To spare checking for "this reply fits into a page, but does it fit
into my buffer?" in some callers, osd_req_op_cls_response_data_pages()
needs to know how big it is.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
7 years agorbd: do away with obj_request in rbd_obj_read_sync()
Ilya Dryomov [Wed, 25 Jan 2017 17:16:21 +0000 (18:16 +0100)]
rbd: do away with obj_request in rbd_obj_read_sync()

rbd_obj_request machinery is completely unnecessary here; all that's
being done is fetching a metadata object - no striping, cloning, etc.
More importantly, rbd_osd_req_create() grabs pool id from layout and
that is becoming a data pool id.

Kill offset argument - all metadata objects are small and read in full.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
7 years agorbd: initialize rbd_dev->header_oloc early
Ilya Dryomov [Wed, 25 Jan 2017 17:16:21 +0000 (18:16 +0100)]
rbd: initialize rbd_dev->header_oloc early

No reason to delay it until image_id is known.  This will be required
by some rbd_obj_method_sync() callers, after rbd_obj_method_sync() is
changed to take oloc.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
7 years agorbd: kill rbd_image_header::{crypt_type,comp_type}
Ilya Dryomov [Wed, 25 Jan 2017 17:16:21 +0000 (18:16 +0100)]
rbd: kill rbd_image_header::{crypt_type,comp_type}

Image format 1 is deprecated and format 2 doesn't have these.  Also,
__rbd_dev_create() takes care of zeroing (or otherwise initializing)
format 2 specific fields.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
7 years agorbd: use kstrndup() in rbd_header_from_disk()
Ilya Dryomov [Wed, 25 Jan 2017 17:16:21 +0000 (18:16 +0100)]
rbd: use kstrndup() in rbd_header_from_disk()

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
7 years agolibceph: bump CEPH_PG_MAX_SIZE to 32
Ilya Dryomov [Thu, 9 Feb 2017 15:14:52 +0000 (16:14 +0100)]
libceph: bump CEPH_PG_MAX_SIZE to 32

... to accommodate potentially very wide EC pools.  This increases the
size of a typical rbd ceph_osd_request by ~12% (from 1040 to 1168 bytes),
but I'd rather go future proof here.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
7 years agolibceph: don't go through with the mapping if the PG is too wide
Ilya Dryomov [Wed, 8 Feb 2017 17:57:48 +0000 (18:57 +0100)]
libceph: don't go through with the mapping if the PG is too wide

With EC overwrites maturing, the kernel client will be getting exposed
to potentially very wide EC pools.  While "min(pi->size, X)" works fine
when the cluster is stable and happy, truncating OSD sets interferes
with resend logic (ceph_is_new_interval(), etc).  Abort the mapping if
the pool is too wide, assigning the request to the homeless session.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
7 years agocrush: merge working data and scratch
Ilya Dryomov [Tue, 31 Jan 2017 14:55:06 +0000 (15:55 +0100)]
crush: merge working data and scratch

Much like Arlo Guthrie, I decided that one big pile is better than two
little piles.

Reflects ceph.git commit 95c2df6c7e0b22d2ea9d91db500cf8b9441c73ba.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agocrush: remove mutable part of CRUSH map
Ilya Dryomov [Tue, 31 Jan 2017 14:55:06 +0000 (15:55 +0100)]
crush: remove mutable part of CRUSH map

Then add it to the working state. It would be very nice if we didn't
have to take a lock to calculate a crush placement. By moving the
permutation array into the working data, we can treat the CRUSH map as
immutable.

Reflects ceph.git commit cbcd039651c0569551cb90d26ce27e1432671f2a.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agolibceph: add osdmap_set_crush() helper
Ilya Dryomov [Tue, 31 Jan 2017 14:55:06 +0000 (15:55 +0100)]
libceph: add osdmap_set_crush() helper

Simplify osdmap_decode() and osdmap_apply_incremental() a bit.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agolibceph: remove unneeded stddef.h include
Stafford Horne [Sun, 5 Feb 2017 07:07:32 +0000 (16:07 +0900)]
libceph: remove unneeded stddef.h include

This was causing a build failure for openrisc when using musl and
gcc 5.4.0 since the file is not available in the toolchain.

It doesnt seem this is needed and removing it does not cause any build
warnings for me.

Signed-off-by: Stafford Horne <shorne@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agoceph: do a LOOKUP in d_revalidate instead of GETATTR
Jeff Layton [Mon, 30 Jan 2017 14:47:25 +0000 (09:47 -0500)]
ceph: do a LOOKUP in d_revalidate instead of GETATTR

In commit c3f4688a08f (ceph: don't set req->r_locked_dir in
ceph_d_revalidate), we changed the code to do a GETATTR instead of a
LOOKUP as the parent info isn't strictly necessary to revalidate the
dentry. What we missed there though is that in order to update the lease
on the dentry after revalidating it, we _do_ need parent info.

Change ceph_d_revalidate back to doing a LOOKUP instead of a GETATTR so
that we can get the parent info in order to update the lease from
ceph_fill_trace. Note that we set req->r_parent here, but we cannot set
the CEPH_MDS_R_PARENT_LOCKED flag as we can't guarantee that it is.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agoceph: call update_dentry_lease even when r_locked dir is not set
Jeff Layton [Fri, 27 Jan 2017 18:07:10 +0000 (13:07 -0500)]
ceph: call update_dentry_lease even when r_locked dir is not set

We don't really require that the parent be locked in order to update the
lease on a dentry. Lease info is protected by the d_lock. In the event
that the parent is not locked in ceph_fill_trace, and we have both
parent and target info, go ahead and update the dentry lease.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agoceph: vet the target and parent inodes before updating dentry lease
Jeff Layton [Fri, 27 Jan 2017 14:13:57 +0000 (09:13 -0500)]
ceph: vet the target and parent inodes before updating dentry lease

In a later patch, we're going to need to allow ceph_fill_trace to
update the dentry's lease when the parent is not locked. This is
potentially racy though -- by the time we get around to processing the
trace, the parent may have already changed.

Change update_dentry_lease to take a ceph_vino pointer and use that to
ensure that the dentry's parent still matches it before updating the
lease.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agoceph: don't update_dentry_lease unless we actually got one
Jeff Layton [Thu, 26 Jan 2017 21:14:18 +0000 (16:14 -0500)]
ceph: don't update_dentry_lease unless we actually got one

This if block updates the dentry lease even in the case where
the MDS didn't grant one.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agoceph: add a new flag to indicate whether parent is locked
Jeff Layton [Tue, 31 Jan 2017 15:28:26 +0000 (10:28 -0500)]
ceph: add a new flag to indicate whether parent is locked

struct ceph_mds_request has an r_locked_dir pointer, which is set to
indicate the parent inode and that its i_rwsem is locked.  In some
critical places, we need to be able to indicate the parent inode to the
request handling code, even when its i_rwsem may not be locked.

Most of the code that operates on r_locked_dir doesn't require that the
i_rwsem be locked. We only really need it to handle manipulation of the
dcache. The rest (filling of the inode, updating dentry leases, etc.)
already has its own locking.

Add a new r_req_flags bit that indicates whether the parent is locked
when doing the request, and rename the pointer to "r_parent". For now,
all the places that set r_parent also set this flag, but that will
change in a later patch.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agoceph: convert bools in ceph_mds_request to a new r_req_flags field
Jeff Layton [Wed, 1 Feb 2017 18:49:09 +0000 (13:49 -0500)]
ceph: convert bools in ceph_mds_request to a new r_req_flags field

Currently, we have a bunch of bool flags in struct ceph_mds_request. We
need more flags though, but each bool takes (at least) a byte. Those
add up over time.

Merge all of the existing bools in this struct into a single unsigned
long, and use the set/test/clear_bit macros to manipulate them. These
are atomic operations, but that is required here to prevent
load/modify/store races. The existing flags are protected by different
locks, so we can't rely on them for that purpose.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agoceph: drop session argument to ceph_fill_trace
Jeff Layton [Tue, 31 Jan 2017 16:06:13 +0000 (11:06 -0500)]
ceph: drop session argument to ceph_fill_trace

Just get it from r_session since that's what's always passed in.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agoceph: remove "Debugging hook" from ceph_fill_trace
Jeff Layton [Tue, 31 Jan 2017 15:55:38 +0000 (10:55 -0500)]
ceph: remove "Debugging hook" from ceph_fill_trace

Keeping around commented out code is just asking for it to bitrot and
makes viewing the code under cscope more confusing.  If
we really need this, then we can revert this patch and put it under a
Kconfig option.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agoceph: avoid calling ceph_renew_caps() infinitely
Yan, Zheng [Sun, 29 Jan 2017 14:15:47 +0000 (22:15 +0800)]
ceph: avoid calling ceph_renew_caps() infinitely

__ceph_caps_mds_wanted() ignores caps from stale session. So the
return value of __ceph_caps_mds_wanted() can keep the same across
ceph_renew_caps(). This causes try_get_cap_refs() to keep calling
ceph_renew_caps(). The fix is ignore the session valid check for
the try_get_cap_refs() case. If session is stale, just let the
caps requester sleep.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
7 years agoceph: make sure flushing inode in proper session's cap_flushing list
Yan, Zheng [Tue, 24 Jan 2017 02:02:32 +0000 (10:02 +0800)]
ceph: make sure flushing inode in proper session's cap_flushing list

when flushing inode's auth cap changes, we need to move it into the
new auth cap session's cap_flushing list

Signed-off-by: Yan, Zheng <zyan@redhat.com>
7 years agoceph: update readpages osd request according to size of pages
Yan, Zheng [Thu, 19 Jan 2017 03:21:29 +0000 (11:21 +0800)]
ceph: update readpages osd request according to size of pages

add_to_page_cache_lru() can fails, so the actual pages to read
can be smaller than the initial size of osd request. We need to
update osd request size in that case.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
7 years agoceph: fix bogus endianness change in ceph_ioctl_set_layout
Jeff Layton [Thu, 12 Jan 2017 19:42:40 +0000 (14:42 -0500)]
ceph: fix bogus endianness change in ceph_ioctl_set_layout

sparse says:

    fs/ceph/ioctl.c:100:28: warning: cast to restricted __le64

preferred_osd is a __s64 so we don't need to do any conversion. Also,
just remove the cast in ceph_ioctl_get_layout as it's not needed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agolibceph: include linux/sched.h into crypto.c directly
Ilya Dryomov [Mon, 16 Jan 2017 13:35:17 +0000 (14:35 +0100)]
libceph: include linux/sched.h into crypto.c directly

Currently crypto.c gets linux/sched.h indirectly through linux/slab.h
from linux/kasan.h.  Include it directly for memalloc_noio_*() inlines.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agolibceph: use BUG() instead of BUG_ON(1)
Arnd Bergmann [Mon, 16 Jan 2017 11:06:09 +0000 (12:06 +0100)]
libceph: use BUG() instead of BUG_ON(1)

I ran into this compile warning, which is the result of BUG_ON(1)
not always leading to the compiler treating the code path as
unreachable:

    include/linux/ceph/osdmap.h: In function 'ceph_can_shift_osds':
    include/linux/ceph/osdmap.h:62:1: error: control reaches end of non-void function [-Werror=return-type]

Using BUG() here avoids the warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agoceph: avoid updating mds_wanted too frequently
Yan, Zheng [Thu, 12 Jan 2017 09:18:00 +0000 (17:18 +0800)]
ceph: avoid updating mds_wanted too frequently

user space may open/close single file frequently. It's not good
to send a clientcaps message to mds for each open/close syscall.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
7 years agoceph: set io_pages bdi hint
Andreas Gerstmayr [Tue, 10 Jan 2017 13:17:56 +0000 (14:17 +0100)]
ceph: set io_pages bdi hint

This patch sets the io_pages bdi hint based on the rsize mount option.
Without this patch large buffered reads (request size > max readahead)
are processed sequentially in chunks of the readahead size (i.e. read
requests are sent out up to the readahead size, then the
do_generic_file_read() function waits until the first page is received).

With this patch read requests are sent out at once up to the size
specified in the rsize mount option (default: 64 MB).

Signed-off-by: Andreas Gerstmayr <andreas.gerstmayr@catalysts.cc>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
7 years agoceph: fix spelling mistake: "enabing" -> "enabling"
Colin Ian King [Thu, 29 Dec 2016 20:19:32 +0000 (20:19 +0000)]
ceph: fix spelling mistake: "enabing" -> "enabling"

trivial fix to spelling mistake in debug message

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
7 years agoceph: cleanup ACCESS_ONCE -> READ_ONCE
Seraphime Kirkovski [Mon, 26 Dec 2016 09:26:34 +0000 (10:26 +0100)]
ceph: cleanup ACCESS_ONCE -> READ_ONCE

This removes the uses of ACCESS_ONCE in favor of READ_ONCE

Signed-off-by: Seraphime Kirkovski <kirkseraph@gmail.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
7 years agoceph: pass parent inode info to ceph_encode_dentry_release if we have it
Jeff Layton [Thu, 15 Dec 2016 13:37:59 +0000 (08:37 -0500)]
ceph: pass parent inode info to ceph_encode_dentry_release if we have it

If we have a parent inode reference already, then we don't need to
go back up the directory tree to find one.

Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agoceph: fix unsafe dcache access in ceph_encode_dentry_release
Jeff Layton [Thu, 15 Dec 2016 13:37:58 +0000 (08:37 -0500)]
ceph: fix unsafe dcache access in ceph_encode_dentry_release

Accessing d_parent requires some sort of locking or it could vanish
out from under us. Since we take the d_lock anyway, use that to fetch
d_parent and take a reference to it, and then use that reference to
call ceph_encode_inode_release.

Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agoceph: pass parent dir ino info to build_dentry_path
Jeff Layton [Thu, 15 Dec 2016 13:37:58 +0000 (08:37 -0500)]
ceph: pass parent dir ino info to build_dentry_path

In the event that we have a parent inode reference in the request, we
can use that instead of mucking about in the dcache. Pass any parent
inode info we have down to build_dentry_path so it can make use of it.

Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agoceph: clean up unsafe d_parent accesses in build_dentry_path
Jeff Layton [Thu, 15 Dec 2016 13:37:57 +0000 (08:37 -0500)]
ceph: clean up unsafe d_parent accesses in build_dentry_path

While we hold a reference to the dentry when build_dentry_path is
called, we could end up racing with a rename that changes d_parent.
Handle that situation correctly, by using the rcu_read_lock to
ensure that the parent dentry and inode stick around long enough
to safely check ceph_snap and ceph_ino.

Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agoceph: clean up unsafe d_parent access in __choose_mds
Jeff Layton [Thu, 15 Dec 2016 13:37:56 +0000 (08:37 -0500)]
ceph: clean up unsafe d_parent access in __choose_mds

__choose_mds exists to pick an MDS to use when issuing a call. Doing
that typically involves picking an inode and using the authoritative
MDS for it. In most cases, that's pretty straightforward, as we are
using an inode to which we hold a reference (usually represented by
r_dentry or r_inode in the request).

In the case of a snapshotted directory however, we need to fetch
the non-snapped parent, which involves walking back up the parents
in the tree. The dentries in the snapshot dir are effectively frozen
but the overall parent is _not_, and could vanish if a concurrent
rename were to occur.

Clean this code up and take special care to ensure the validity of
the entries we're working with. First, try to use the inode in
r_locked_dir if one exists. If not and all we have is r_dentry,
then we have to walk back up the tree. Use the rcu_read_lock for
this so we can ensure that any d_parent we find won't go away, and
take extra care to deal with the possibility that the dentries could
go negative.

Change get_nonsnap_parent to return an inode, and take a reference to
that inode before returning (if any). Change all of the other places
where we set "inode" in __choose_mds to also take a reference, and then
call iput on that inode before exiting the function.

Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 years agoLinux 4.10
Linus Torvalds [Sun, 19 Feb 2017 22:34:00 +0000 (14:34 -0800)]
Linux 4.10

7 years agoFix missing sanity check in /dev/sg
Al Viro [Sun, 19 Feb 2017 07:15:27 +0000 (07:15 +0000)]
Fix missing sanity check in /dev/sg

What happens is that a write to /dev/sg is given a request with non-zero
->iovec_count combined with zero ->dxfer_len.  Or with ->dxferp pointing
to an array full of empty iovecs.

Having write permission to /dev/sg shouldn't be equivalent to the
ability to trigger BUG_ON() while holding spinlocks...

Found by Dmitry Vyukov and syzkaller.

[ The BUG_ON() got changed to a WARN_ON_ONCE(), but this fixes the
  underlying issue.  - Linus ]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agoscsi: don't BUG_ON() empty DMA transfers
Johannes Thumshirn [Tue, 31 Jan 2017 09:16:00 +0000 (10:16 +0100)]
scsi: don't BUG_ON() empty DMA transfers

Don't crash the machine just because of an empty transfer. Use WARN_ON()
combined with returning an error.

Found by Dmitry Vyukov and syzkaller.

[ Changed to "WARN_ON_ONCE()". Al has a patch that should fix the root
  cause, but a BUG_ON() is not acceptable in any case, and a WARN_ON()
  might still be a cause of excessive log spamming.

  NOTE! If this warning ever triggers, we may end up leaking resources,
  since this doesn't bother to try to clean the command up. So this
  WARN_ON_ONCE() triggering does imply real problems. But BUG_ON() is
  much worse.

  People really need to stop using BUG_ON() for "this shouldn't ever
  happen". It makes pretty much any bug worse.     - Linus ]

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: James Bottomley <jejb@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agoipv6: release dst on error in ip6_dst_lookup_tail
Willem de Bruijn [Sun, 19 Feb 2017 00:00:45 +0000 (19:00 -0500)]
ipv6: release dst on error in ip6_dst_lookup_tail

If ip6_dst_lookup_tail has acquired a dst and fails the IPv4-mapped
check, release the dst before returning an error.

Fixes: ec5e3b0a1d41 ("ipv6: Inhibit IPv4-mapped src address on the wire.")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm...
Linus Torvalds [Sun, 19 Feb 2017 01:38:09 +0000 (17:38 -0800)]
Merge tag 'fixes-for-linus' of git://git./linux/kernel/git/arm/arm-soc

Pull ARM SoC fixes from Arnd Bergmann:
 "Two more bugfixes that came in during this week:

   - a defconfig change to enable a vital driver used on some Qualcomm
     based phones. This was already queued for 4.11, but the maintainer
     asked to have it in 4.10 after all.

   - a regression fix for the reset controller framework, this got
     broken by a typo in the 4.10 merge window"

* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
  ARM: multi_v7_defconfig: enable Qualcomm RPMCC
  reset: fix shared reset triggered_count decrement on error

7 years agoMerge branch 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm
Linus Torvalds [Sun, 19 Feb 2017 01:36:15 +0000 (17:36 -0800)]
Merge branch 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm

Pull ARM fixes from Russell King:
 "A couple of fixes from Kees concerning problems he spotted with our
  user access support"

* 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm:
  ARM: 8658/1: uaccess: fix zeroing of 64-bit get_user()
  ARM: 8657/1: uaccess: consistently check object sizes

7 years agoMerge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 19 Feb 2017 01:34:56 +0000 (17:34 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 fix from Thomas Gleixner:
 "Make the build clean by working around yet another GCC stupidity"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vm86: Fix unused variable warning if THP is disabled

7 years agoMerge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 19 Feb 2017 01:33:17 +0000 (17:33 -0800)]
Merge branch 'locking-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull locking fix from Thomas Gleixner:
 "Move the futex init function to core initcall so user mode helper does
  not run into an uninitialized futex syscall"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Move futex_init() to core_initcall

7 years agoMerge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 19 Feb 2017 01:30:36 +0000 (17:30 -0800)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull timer fixes from Thomas Gleixner:
 "Two small fixes::

   - Prevent deadlock on the tick broadcast lock. Found and fixed by
     Mike.

   - Stop using printk() in the timekeeping debug code to prevent a
     deadlock against the scheduler"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Use deferred printk() in debug code
  tick/broadcast: Prevent deadlock on tick_broadcast_lock

7 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Sun, 19 Feb 2017 01:29:00 +0000 (17:29 -0800)]
Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

 1) Fix leak in dpaa_eth error paths, from Dan Carpenter.

 2) Use after free when using IPV6_RECVPKTINFO, from Andrey Konovalov.

 3) fanout_release() cannot be invoked from atomic contexts, from Anoob
    Soman.

 4) Fix bogus attempt at lockdep annotation in IRDA.

 5) dev_fill_metadata_dst() can OOP on a NULL dst cache pointer, from
    Paolo Abeni.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  irda: Fix lockdep annotations in hashbin_delete().
  vxlan: fix oops in dev_fill_metadata_dst
  dccp: fix freeing skb too early for IPV6_RECVPKTINFO
  dpaa_eth: small leak on error
  packet: Do not call fanout_release from atomic contexts

7 years agoprintk: use rcuidle console tracepoint
Sergey Senozhatsky [Sat, 18 Feb 2017 11:42:54 +0000 (03:42 -0800)]
printk: use rcuidle console tracepoint

Use rcuidle console tracepoint because, apparently, it may be issued
from an idle CPU:

  hw-breakpoint: Failed to enable monitor mode on CPU 0.
  hw-breakpoint: CPU 0 failed to disable vector catch

  ===============================
  [ ERR: suspicious RCU usage.  ]
  4.10.0-rc8-next-20170215+ #119 Not tainted
  -------------------------------
  ./include/trace/events/printk.h:32 suspicious rcu_dereference_check() usage!

  other info that might help us debug this:

  RCU used illegally from idle CPU!
  rcu_scheduler_active = 2, debug_locks = 0
  RCU used illegally from extended quiescent state!
  2 locks held by swapper/0/0:
   #0:  (cpu_pm_notifier_lock){......}, at: [<c0237e2c>] cpu_pm_exit+0x10/0x54
   #1:  (console_lock){+.+.+.}, at: [<c01ab350>] vprintk_emit+0x264/0x474

  stack backtrace:
  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-rc8-next-20170215+ #119
  Hardware name: Generic OMAP4 (Flattened Device Tree)
    console_unlock
    vprintk_emit
    vprintk_default
    printk
    reset_ctrl_regs
    dbg_cpu_pm_notify
    notifier_call_chain
    cpu_pm_exit
    omap_enter_idle_coupled
    cpuidle_enter_state
    cpuidle_enter_state_coupled
    do_idle
    cpu_startup_entry
    start_kernel

This RCU warning, however, is suppressed by lockdep_off() in printk().
lockdep_off() increments the ->lockdep_recursion counter and thus
disables RCU_LOCKDEP_WARN() and debug_lockdep_rcu_enabled(), which want
lockdep to be enabled "current->lockdep_recursion == 0".

Link: http://lkml.kernel.org/r/20170217015932.11898-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Tony Lindgren <tony@atomide.com>
Tested-by: Tony Lindgren <tony@atomide.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Russell King <rmk@armlinux.org.uk>
Cc: <stable@vger.kernel.org> [3.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agoARM: multi_v7_defconfig: enable Qualcomm RPMCC
Andy Gross [Mon, 2 Jan 2017 20:35:05 +0000 (14:35 -0600)]
ARM: multi_v7_defconfig: enable Qualcomm RPMCC

This patch enables the Qualcomm RPM based Clock Controller present on
A-family boards.

Signed-off-by: Andy Gross <andy.gross@linaro.org>
Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
7 years agoirda: Fix lockdep annotations in hashbin_delete().
David S. Miller [Fri, 17 Feb 2017 21:19:39 +0000 (16:19 -0500)]
irda: Fix lockdep annotations in hashbin_delete().

A nested lock depth was added to the hasbin_delete() code but it
doesn't actually work some well and results in tons of lockdep splats.

Fix the code instead to properly drop the lock around the operation
and just keep peeking the head of the hashbin queue.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'for-linus' of git://git.kernel.dk/linux-block
Linus Torvalds [Fri, 17 Feb 2017 21:01:58 +0000 (13:01 -0800)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull block layer fix from Jens Axboe:
 "A single fix for a lockdep splat reported by Thomas and Gabriel"

* 'for-linus' of git://git.kernel.dk/linux-block:
  cfq-iosched: don't call wbt_disable_default() with IRQs disabled

7 years agovxlan: fix oops in dev_fill_metadata_dst
Paolo Abeni [Fri, 17 Feb 2017 18:14:27 +0000 (19:14 +0100)]
vxlan: fix oops in dev_fill_metadata_dst

Since the commit 0c1d70af924b ("net: use dst_cache for vxlan device")
vxlan_fill_metadata_dst() calls vxlan_get_route() passing a NULL
dst_cache pointer, so the latter should explicitly check for
valid dst_cache ptr. Unfortunately the commit d71785ffc7e7 ("net: add
dst_cache to ovs vxlan lwtunnel") removed said check.

As a result is possible to trigger a null pointer access calling
vxlan_fill_metadata_dst(), e.g. with:

ovs-vsctl add-br ovs-br0
ovs-vsctl add-port ovs-br0 vxlan0 -- set interface vxlan0 \
type=vxlan options:remote_ip=192.168.1.1 \
options:key=1234 options:dst_port=4789 ofport_request=10
ip address add dev ovs-br0 172.16.1.2/24
ovs-vsctl set Bridge ovs-br0 ipfix=@i -- --id=@i create IPFIX \
targets=\"172.16.1.1:1234\" sampling=1
iperf -c 172.16.1.1 -u -l 1000 -b 10M -t 1 -p 1234

This commit addresses the issue passing to vxlan_get_route() the
dst_cache already available into the lwt info processed by
vxlan_fill_metadata_dst().

Fixes: d71785ffc7e7 ("net: add dst_cache to ovs vxlan lwtunnel")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodccp: fix freeing skb too early for IPV6_RECVPKTINFO
Andrey Konovalov [Thu, 16 Feb 2017 16:22:46 +0000 (17:22 +0100)]
dccp: fix freeing skb too early for IPV6_RECVPKTINFO

In the current DCCP implementation an skb for a DCCP_PKT_REQUEST packet
is forcibly freed via __kfree_skb in dccp_rcv_state_process if
dccp_v6_conn_request successfully returns.

However, if IPV6_RECVPKTINFO is set on a socket, the address of the skb
is saved to ireq->pktopts and the ref count for skb is incremented in
dccp_v6_conn_request, so skb is still in use. Nevertheless, it gets freed
in dccp_rcv_state_process.

Fix by calling consume_skb instead of doing goto discard and therefore
calling __kfree_skb.

Similar fixes for TCP:

fb7e2399ec17f1004c0e0ccfd17439f8759ede01 [TCP]: skb is unexpectedly freed.
0aea76d35c9651d55bbaf746e7914e5f9ae5a25d tcp: SYN packets are now
simply consumed

Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge tag 'powerpc-4.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc...
Linus Torvalds [Fri, 17 Feb 2017 17:58:32 +0000 (09:58 -0800)]
Merge tag 'powerpc-4.10-5' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc fix from Michael Ellerman:
 "One fix from Paul: we can not use the radix MMU under a hypervisor for
  now.

  Although the code checked if the processor supports radix, that is not
  sufficient"

* tag 'powerpc-4.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64: Disable use of radix under a hypervisor

7 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Linus Torvalds [Fri, 17 Feb 2017 17:56:34 +0000 (09:56 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input

Pull input fix from Dmitry Torokhov:
 "Just a single change to Elan touchpad driver to recognize a new ACPI
  ID"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: elan_i2c - add ELAN0605 to the ACPI table

7 years agoMerge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa...
Linus Torvalds [Fri, 17 Feb 2017 17:53:59 +0000 (09:53 -0800)]
Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux

Pull i2c fix from Wolfram Sang:
 "I2C has a revert to fix a regression"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  Revert "i2c: designware: detect when dynamic tar update is possible"

7 years agoMerge tag 'mmc-v4.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Linus Torvalds [Fri, 17 Feb 2017 17:52:33 +0000 (09:52 -0800)]
Merge tag 'mmc-v4.10-rc8' of git://git./linux/kernel/git/ulfh/mmc

Pull MMC fix from Ulf Hansson:
 "Fix multi-bit bus width without high-speed mode for MMC"

* tag 'mmc-v4.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: core: fix multi-bit bus width without high-speed mode

7 years agoMerge tag 'ntb-4.10-bugfixes' of git://github.com/jonmason/ntb
Linus Torvalds [Fri, 17 Feb 2017 17:51:05 +0000 (09:51 -0800)]
Merge tag 'ntb-4.10-bugfixes' of git://github.com/jonmason/ntb

Pull NTB bugfixes frfom Jon Mason:
 "NTB bug fixes to address a crash when unloading the ntb module, a DMA
  engine unmap leak, allowing the proper queue choice, and clearing the
  SKX irq bit"

* tag 'ntb-4.10-bugfixes' of git://github.com/jonmason/ntb:
  ntb: ntb_hw_intel: link_poll isn't clearing the pending status properly
  ntb_transport: Pick an unused queue
  ntb: ntb_perf missing dmaengine_unmap_put
  NTB: ntb_transport: fix debugfs_remove_recursive

7 years agodpaa_eth: small leak on error
Dan Carpenter [Thu, 16 Feb 2017 09:56:10 +0000 (12:56 +0300)]
dpaa_eth: small leak on error

This should be >= instead of > here.  It means that we don't increment
the free count enough so it becomes off by one.

Fixes: 9ad1a3749333 ("dpaa_eth: add support for DPAA Ethernet")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge tag 'reset-for-4.10-fixes' of https://git.pengutronix.de/git/pza/linux into...
Arnd Bergmann [Fri, 17 Feb 2017 16:25:15 +0000 (17:25 +0100)]
Merge tag 'reset-for-4.10-fixes' of https://git.pengutronix.de/git/pza/linux into fixes

Pull "Reset controller fixes for v4.10" from Philipp Zabel:

- Remove erroneous negation of the error check of the reset function
  to decrement trigger_count in the error case, not on success. This
  fixes shared resets to actually only trigger once, as intended.

* tag 'reset-for-4.10-fixes' of https://git.pengutronix.de/git/pza/linux:
  reset: fix shared reset triggered_count decrement on error

7 years agopacket: Do not call fanout_release from atomic contexts
Anoob Soman [Wed, 15 Feb 2017 20:25:39 +0000 (20:25 +0000)]
packet: Do not call fanout_release from atomic contexts

Commit 6664498280cf ("packet: call fanout_release, while UNREGISTERING a
netdev"), unfortunately, introduced the following issues.

1. calling mutex_lock(&fanout_mutex) (fanout_release()) from inside
rcu_read-side critical section. rcu_read_lock disables preemption, most often,
which prohibits calling sleeping functions.

[  ] include/linux/rcupdate.h:560 Illegal context switch in RCU read-side critical section!
[  ]
[  ] rcu_scheduler_active = 1, debug_locks = 0
[  ] 4 locks held by ovs-vswitchd/1969:
[  ]  #0:  (cb_lock){++++++}, at: [<ffffffff8158a6c9>] genl_rcv+0x19/0x40
[  ]  #1:  (ovs_mutex){+.+.+.}, at: [<ffffffffa04878ca>] ovs_vport_cmd_del+0x4a/0x100 [openvswitch]
[  ]  #2:  (rtnl_mutex){+.+.+.}, at: [<ffffffff81564157>] rtnl_lock+0x17/0x20
[  ]  #3:  (rcu_read_lock){......}, at: [<ffffffff81614165>] packet_notifier+0x5/0x3f0
[  ]
[  ] Call Trace:
[  ]  [<ffffffff813770c1>] dump_stack+0x85/0xc4
[  ]  [<ffffffff810c9077>] lockdep_rcu_suspicious+0x107/0x110
[  ]  [<ffffffff810a2da7>] ___might_sleep+0x57/0x210
[  ]  [<ffffffff810a2fd0>] __might_sleep+0x70/0x90
[  ]  [<ffffffff8162e80c>] mutex_lock_nested+0x3c/0x3a0
[  ]  [<ffffffff810de93f>] ? vprintk_default+0x1f/0x30
[  ]  [<ffffffff81186e88>] ? printk+0x4d/0x4f
[  ]  [<ffffffff816106dd>] fanout_release+0x1d/0xe0
[  ]  [<ffffffff81614459>] packet_notifier+0x2f9/0x3f0

2. calling mutex_lock(&fanout_mutex) inside spin_lock(&po->bind_lock).
"sleeping function called from invalid context"

[  ] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620
[  ] in_atomic(): 1, irqs_disabled(): 0, pid: 1969, name: ovs-vswitchd
[  ] INFO: lockdep is turned off.
[  ] Call Trace:
[  ]  [<ffffffff813770c1>] dump_stack+0x85/0xc4
[  ]  [<ffffffff810a2f52>] ___might_sleep+0x202/0x210
[  ]  [<ffffffff810a2fd0>] __might_sleep+0x70/0x90
[  ]  [<ffffffff8162e80c>] mutex_lock_nested+0x3c/0x3a0
[  ]  [<ffffffff816106dd>] fanout_release+0x1d/0xe0
[  ]  [<ffffffff81614459>] packet_notifier+0x2f9/0x3f0

3. calling dev_remove_pack(&fanout->prot_hook), from inside
spin_lock(&po->bind_lock) or rcu_read-side critical-section. dev_remove_pack()
-> synchronize_net(), which might sleep.

[  ] BUG: scheduling while atomic: ovs-vswitchd/1969/0x00000002
[  ] INFO: lockdep is turned off.
[  ] Call Trace:
[  ]  [<ffffffff813770c1>] dump_stack+0x85/0xc4
[  ]  [<ffffffff81186274>] __schedule_bug+0x64/0x73
[  ]  [<ffffffff8162b8cb>] __schedule+0x6b/0xd10
[  ]  [<ffffffff8162c5db>] schedule+0x6b/0x80
[  ]  [<ffffffff81630b1d>] schedule_timeout+0x38d/0x410
[  ]  [<ffffffff810ea3fd>] synchronize_sched_expedited+0x53d/0x810
[  ]  [<ffffffff810ea6de>] synchronize_rcu_expedited+0xe/0x10
[  ]  [<ffffffff8154eab5>] synchronize_net+0x35/0x50
[  ]  [<ffffffff8154eae3>] dev_remove_pack+0x13/0x20
[  ]  [<ffffffff8161077e>] fanout_release+0xbe/0xe0
[  ]  [<ffffffff81614459>] packet_notifier+0x2f9/0x3f0

4. fanout_release() races with calls from different CPU.

To fix the above problems, remove the call to fanout_release() under
rcu_read_lock(). Instead, call __dev_remove_pack(&fanout->prot_hook) and
netdev_run_todo will be happy that &dev->ptype_specific list is empty. In order
to achieve this, I moved dev_{add,remove}_pack() out of fanout_{add,release} to
__fanout_{link,unlink}. So, call to {,__}unregister_prot_hook() will make sure
fanout->prot_hook is removed as well.

Fixes: 6664498280cf ("packet: call fanout_release, while UNREGISTERING a netdev")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Anoob Soman <anoob.soman@citrix.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoreset: fix shared reset triggered_count decrement on error
Jerome Brunet [Wed, 15 Feb 2017 18:15:51 +0000 (19:15 +0100)]
reset: fix shared reset triggered_count decrement on error

For a shared reset, when the reset is successful, the triggered_count is
incremented when trying to call the reset callback, so that another device
sharing the same reset line won't trigger it again. If the reset has not
been triggered successfully, the trigger_count should be decremented.

The code does the opposite, and decrements the trigger_count on success.
As a consequence, another device sharing the reset will be able to trigger
it again.

Fixed be removing negation in from of the error code of the reset function.

Fixes: 7da33a37b48f ("reset: allow using reset_control_reset with shared reset")

Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Acked-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
7 years agontb: ntb_hw_intel: link_poll isn't clearing the pending status properly
Dave Jiang [Thu, 16 Feb 2017 23:22:36 +0000 (16:22 -0700)]
ntb: ntb_hw_intel: link_poll isn't clearing the pending status properly

On Skylake hardware, the link_poll isn't clearing the pending interrupt
bit.  Adding a new function for SKX that handles clearing of status bit the
right way.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Fixes: 783dfa6c ("ntb: Adding Skylake Xeon NTB support")
Signed-off-by: Jon Mason <jdmason@kudzu.us>
7 years agontb_transport: Pick an unused queue
Thomas VanSelus [Mon, 13 Feb 2017 22:46:26 +0000 (16:46 -0600)]
ntb_transport: Pick an unused queue

Fix typo causing ntb_transport_create_queue to select the first
queue every time, instead of using the next free queue.

Signed-off-by: Thomas VanSelus <tvanselus@xes-inc.com>
Signed-off-by: Aaron Sierra <asierra@xes-inc.com>
Acked-by: Allen Hubbe <Allen.Hubbe@dell.com>
Fixes: fce8a7bb5 ("PCI-Express Non-Transparent Bridge Support")
Signed-off-by: Jon Mason <jdmason@kudzu.us>
7 years agontb: ntb_perf missing dmaengine_unmap_put
Dave Jiang [Mon, 30 Jan 2017 21:21:17 +0000 (14:21 -0700)]
ntb: ntb_perf missing dmaengine_unmap_put

In the normal I/O execution path, ntb_perf is missing a call to
dmaengine_unmap_put() after submission. That causes us to leak
unmap objects.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Fixes: 8a7b6a77 ("ntb: ntb perf tool")
Signed-off-by: Jon Mason <jdmason@kudzu.us>
7 years agoNTB: ntb_transport: fix debugfs_remove_recursive
Allen Hubbe [Tue, 27 Dec 2016 22:57:04 +0000 (17:57 -0500)]
NTB: ntb_transport: fix debugfs_remove_recursive

The call to debugfs_remove_recursive(qp->debugfs_dir) of the sub-level
directory must not be later than
debugfs_remove_recursive(nt_debugfs_dir) of the top-level directory.
Otherwise, the sub-level directory will not exist, and it would be
invalid (panic) to attempt to remove it.  This removes the top-level
directory last, after sub-level directories have been cleaned up.

Signed-off-by: Allen Hubbe <Allen.Hubbe@dell.com>
Fixes: e26a5843f ("NTB: Split ntb_hw_intel and ntb_transport drivers")
Signed-off-by: Jon Mason <jdmason@kudzu.us>
7 years agoMerge tag 'drm-fixes-for-v4.10-final' of git://people.freedesktop.org/~airlied/linux
Linus Torvalds [Fri, 17 Feb 2017 02:44:38 +0000 (18:44 -0800)]
Merge tag 'drm-fixes-for-v4.10-final' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes from Dave Airlie:
 "Just two last minute fixes, one for DP MST oopses and one for a radeon
  regression"

* tag 'drm-fixes-for-v4.10-final' of git://people.freedesktop.org/~airlied/linux:
  drm/radeon: Use mode h/vdisplay fields to hide out of bounds HW cursor
  drm/dp/mst: fix kernel oops when turning off secondary monitor

7 years agoMerge branch 'drm-fixes-4.10' of git://people.freedesktop.org/~agd5f/linux into drm...
Dave Airlie [Fri, 17 Feb 2017 01:13:17 +0000 (11:13 +1000)]
Merge branch 'drm-fixes-4.10' of git://people.freedesktop.org/~agd5f/linux into drm-fixes

One regression fix for interlaced modes on radeon

* 'drm-fixes-4.10' of git://people.freedesktop.org/~agd5f/linux:
  drm/radeon: Use mode h/vdisplay fields to hide out of bounds HW cursor

7 years agoRevert "nohz: Fix collision between tick and other hrtimers"
Linus Torvalds [Thu, 16 Feb 2017 20:19:18 +0000 (12:19 -0800)]
Revert "nohz: Fix collision between tick and other hrtimers"

This reverts commit 24b91e360ef521a2808771633d76ebc68bd5604b and commit
7bdb59f1ad47 ("tick/nohz: Fix possible missing clock reprog after tick
soft restart") that depends on it,

Pavel reports that it causes occasional boot hangs for him that seem to
depend on just how the machine was booted.  In particular, his machine
hangs at around the PCI fixups of the EHCI USB host controller, but only
hangs from cold boot, not from a warm boot.

Thomas Gleixner suspecs it's a CPU hotplug interaction, particularly
since Pavel also saw suspend/resume issues that seem to be related.
We're reverting for now while trying to figure out the root cause.

Reported-bisected-and-tested-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org # reverted commits were marked for stable
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agoMerge tag 'media/v4.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab...
Linus Torvalds [Thu, 16 Feb 2017 18:22:41 +0000 (10:22 -0800)]
Merge tag 'media/v4.10-5' of git://git./linux/kernel/git/mchehab/linux-media

Pull media fix from Mauro Carvalho Chehab:
 "A regression fix that makes the Siano driver to work again after the
  CONFIG_VMAP_STACK change"

* tag 'media/v4.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  [media] siano: make it work again with CONFIG_VMAP_STACK

7 years agovfs: fix uninitialized flags in splice_to_pipe()
Miklos Szeredi [Thu, 16 Feb 2017 16:49:02 +0000 (17:49 +0100)]
vfs: fix uninitialized flags in splice_to_pipe()

Flags (PIPE_BUF_FLAG_PACKET, PIPE_BUF_FLAG_GIFT) could remain on the
unused part of the pipe ring buffer.  Previously splice_to_pipe() left
the flags value alone, which could result in incorrect behavior.

Uninitialized flags appears to have been there from the introduction of
the splice syscall.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # 2.6.17+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi...
Linus Torvalds [Thu, 16 Feb 2017 17:05:34 +0000 (09:05 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/mszeredi/fuse

Pull fuse fixes from Miklos Szeredi:
 "Fix a use after free bug introduced in 4.2 and using an uninitialized
  value introduced in 4.9"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: fix uninitialized flags in pipe_buffer
  fuse: fix use after free issue in fuse_dev_do_read()

7 years agoMerge tag 'pci-v4.10-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaa...
Linus Torvalds [Thu, 16 Feb 2017 17:03:37 +0000 (09:03 -0800)]
Merge tag 'pci-v4.10-fixes-4' of git://git./linux/kernel/git/helgaas/pci

Pull PCI fix from Bjorn Helgaas:
 "Add back pcie_pme_remove() so we free the IRQ when removing PCIe port
  devices; previously the leaked IRQ caused an MSI BUG_ON"

* tag 'pci-v4.10-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
  PCI/PME: Restore pcie_pme_driver.remove

7 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Thu, 16 Feb 2017 16:37:18 +0000 (08:37 -0800)]
Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

 1) In order to avoid problems in the future, make cgroup bpf overriding
    explicit using BPF_F_ALLOW_OVERRIDE. From Alexei Staovoitov.

 2) LLC sets skb->sk without proper skb->destructor and this explodes,
    fix from Eric Dumazet.

 3) Make sure when we have an ipv4 mapped source address, the
    destination is either also an ipv4 mapped address or
    ipv6_addr_any(). Fix from Jonathan T. Leighton.

 4) Avoid packet loss in fec driver by programming the multicast filter
    more intelligently. From Rui Sousa.

 5) Handle multiple threads invoking fanout_add(), fix from Eric
    Dumazet.

 6) Since we can invoke the TCP input path in process context, without
    BH being disabled, we have to accomodate that in the locking of the
    TCP probe. Also from Eric Dumazet.

 7) Fix erroneous emission of NETEVENT_DELAY_PROBE_TIME_UPDATE when we
    aren't even updating that sysctl value. From Marcus Huewe.

 8) Fix endian bugs in ibmvnic driver, from Thomas Falcon.

[ This is the second version of the pull that reverts the nested
  rhashtable changes that looked a bit too scary for this late in the
  release  - Linus ]

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (27 commits)
  rhashtable: Revert nested table changes.
  ibmvnic: Fix endian errors in error reporting output
  ibmvnic: Fix endian error when requesting device capabilities
  net: neigh: Fix netevent NETEVENT_DELAY_PROBE_TIME_UPDATE notification
  net: xilinx_emaclite: fix freezes due to unordered I/O
  net: xilinx_emaclite: fix receive buffer overflow
  bpf: kernel header files need to be copied into the tools directory
  tcp: tcp_probe: use spin_lock_bh()
  uapi: fix linux/if_pppol2tp.h userspace compilation errors
  packet: fix races in fanout_add()
  ibmvnic: Fix initial MTU settings
  net: ethernet: ti: cpsw: fix cpsw assignment in resume
  kcm: fix a null pointer dereference in kcm_sendmsg()
  net: fec: fix multicast filtering hardware setup
  ipv6: Handle IPv4-mapped src to in6addr_any dst.
  ipv6: Inhibit IPv4-mapped src address on the wire.
  net/mlx5e: Disable preemption when doing TC statistics upcall
  rhashtable: Add nested tables
  tipc: Fix tipc_sk_reinit race conditions
  gfs2: Use rhashtable walk interface in glock_hash_walk
  ...

7 years agodrm/radeon: Use mode h/vdisplay fields to hide out of bounds HW cursor
Michel Dänzer [Wed, 15 Feb 2017 02:28:45 +0000 (11:28 +0900)]
drm/radeon: Use mode h/vdisplay fields to hide out of bounds HW cursor

The crtc_h/vdisplay fields may not match the CRTC viewport dimensions
with special modes such as interlaced ones.

Fixes the HW cursor disappearing in the bottom half of the screen with
interlaced modes.

Fixes: 6b16cf7785a4 ("drm/radeon: Hide the HW cursor while it's out of bounds")
Cc: stable@vger.kernel.org
Reported-by: Ashutosh Kumar <ashutosh.kumar@amd.com>
Tested-by: Sonny Jiang <sonny.jiang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Michel Dänzer <michel.daenzer@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
7 years agoARM: 8658/1: uaccess: fix zeroing of 64-bit get_user()
Kees Cook [Thu, 16 Feb 2017 00:44:37 +0000 (01:44 +0100)]
ARM: 8658/1: uaccess: fix zeroing of 64-bit get_user()

The 64-bit get_user() wasn't clearing the high word due to a typo in the
error handler. The exception handler entry was already correct, though.
Noticed during recent usercopy test additions in lib/test_user_copy.c.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
7 years agoARM: 8657/1: uaccess: consistently check object sizes
Kees Cook [Thu, 16 Feb 2017 00:43:58 +0000 (01:43 +0100)]
ARM: 8657/1: uaccess: consistently check object sizes

In commit 76624175dcae ("arm64: uaccess: consistently check object sizes"),
the object size checks are moved outside the access_ok() so that bad
destinations are detected before hitting the "memset(dest, 0, size)" in the
copy_from_user() failure path.

This makes the same change for arm, with attention given to possibly
extracting the uaccess routines into a common header file for all
architectures in the future.

Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
7 years agocfq-iosched: don't call wbt_disable_default() with IRQs disabled
Jens Axboe [Thu, 16 Feb 2017 14:57:33 +0000 (07:57 -0700)]
cfq-iosched: don't call wbt_disable_default() with IRQs disabled

wbt_disable_default() calls del_timer_sync() to wait for the wbt
timer to finish before disabling throttling. We can't do this with
IRQs disable. This fixes a lockdep splat on boot, if non-root
cgroups are used.

Reported-by: Gabriel C <nix.or.die@gmail.com>
Fixes: 87760e5eef35 ("block: hook up writeback throttling")
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agofuse: fix uninitialized flags in pipe_buffer
Miklos Szeredi [Thu, 16 Feb 2017 14:08:20 +0000 (15:08 +0100)]
fuse: fix uninitialized flags in pipe_buffer

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: d82718e348fe ("fuse_dev_splice_read(): switch to add_to_pipe()")
Cc: <stable@vger.kernel.org> # 4.9+
7 years agorhashtable: Revert nested table changes.
David S. Miller [Thu, 16 Feb 2017 03:29:51 +0000 (22:29 -0500)]
rhashtable: Revert nested table changes.

This reverts commits:

6a25478077d987edc5e2f880590a2bc5fcab4441
9dbbfb0ab6680c6a85609041011484e6658e7d3c
40137906c5f55c252194ef5834130383e639536f

It's too risky to put in this late in the release
cycle.  We'll put these changes into the next merge
window instead.

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge tag 'drm-misc-fixes-2017-02-15' of git://anongit.freedesktop.org/git/drm-misc...
Dave Airlie [Thu, 16 Feb 2017 03:26:41 +0000 (13:26 +1000)]
Merge tag 'drm-misc-fixes-2017-02-15' of git://anongit.freedesktop.org/git/drm-misc into drm-fixes

dp/mst oops fix for v4.10

* tag 'drm-misc-fixes-2017-02-15' of git://anongit.freedesktop.org/git/drm-misc:
  drm/dp/mst: fix kernel oops when turning off secondary monitor

7 years agopowerpc/64: Disable use of radix under a hypervisor
Paul Mackerras [Thu, 16 Feb 2017 02:49:21 +0000 (13:49 +1100)]
powerpc/64: Disable use of radix under a hypervisor

Currently, if the kernel is running on a POWER9 processor under a
hypervisor, it may try to use the radix MMU even though it doesn't have
the necessary code to do so (it doesn't negotiate use of radix, and it
doesn't do the H_REGISTER_PROC_TBL hcall).  If the hypervisor supports
both radix and HPT, then it will set up the guest to use HPT (since the
guest doesn't request radix in the CAS call), but if the radix feature
bit is set in the ibm,pa-features property (which is valid, since
ibm,pa-features is defined to represent the capabilities of the
processor) the guest will try to use radix, resulting in a crash when
it turns the MMU on.

This makes the minimal fix for the current code, which is to disable
radix unless we are running in hypervisor mode.

Fixes: 2bfd65e45e87 ("powerpc/mm/radix: Add radix callbacks for early init routines")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
7 years agoibmvnic: Fix endian errors in error reporting output
Thomas Falcon [Wed, 15 Feb 2017 16:33:33 +0000 (10:33 -0600)]
ibmvnic: Fix endian errors in error reporting output

Error reports received from firmware were not being converted from
big endian values, leading to bogus error codes reported on little
endian systems.

Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoibmvnic: Fix endian error when requesting device capabilities
Thomas Falcon [Wed, 15 Feb 2017 16:32:11 +0000 (10:32 -0600)]
ibmvnic: Fix endian error when requesting device capabilities

When a vNIC client driver requests a faulty device setting, the
server returns an acceptable value for the client to request.
This 64 bit value was incorrectly being swapped as a 32 bit value,
resulting in loss of data. This patch corrects that by using
the 64 bit swap function.

Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: neigh: Fix netevent NETEVENT_DELAY_PROBE_TIME_UPDATE notification
Marcus Huewe [Wed, 15 Feb 2017 00:00:36 +0000 (01:00 +0100)]
net: neigh: Fix netevent NETEVENT_DELAY_PROBE_TIME_UPDATE notification

When setting a neigh related sysctl parameter, we always send a
NETEVENT_DELAY_PROBE_TIME_UPDATE netevent. For instance, when
executing

sysctl net.ipv6.neigh.wlp3s0.retrans_time_ms=2000

a NETEVENT_DELAY_PROBE_TIME_UPDATE netevent is generated.

This is caused by commit 2a4501ae18b5 ("neigh: Send a
notification when DELAY_PROBE_TIME changes"). According to the
commit's description, it was intended to generate such an event
when setting the "delay_first_probe_time" sysctl parameter.

In order to fix this, only generate this event when actually
setting the "delay_first_probe_time" sysctl parameter. This fix
should not have any unintended side-effects, because all but one
registered netevent callbacks check for other netevent event
types (the registered callbacks were obtained by grepping for
"register_netevent_notifier"). The only callback that uses the
NETEVENT_DELAY_PROBE_TIME_UPDATE event is
mlxsw_sp_router_netevent_event() (in
drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c): in case
of this event, it only accesses the DELAY_PROBE_TIME of the
passed neigh_parms.

Fixes: 2a4501ae18b5 ("neigh: Send a notification when DELAY_PROBE_TIME changes")
Signed-off-by: Marcus Huewe <suse-tux@gmx.de>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: xilinx_emaclite: fix freezes due to unordered I/O
Anssi Hannula [Tue, 14 Feb 2017 17:11:45 +0000 (19:11 +0200)]
net: xilinx_emaclite: fix freezes due to unordered I/O

The xilinx_emaclite uses __raw_writel and __raw_readl for register
accesses. Those functions do not imply any kind of memory barriers and
they may be reordered.

The driver does not seem to take that into account, though, and the
driver does not satisfy the ordering requirements of the hardware.
For clear examples, see xemaclite_mdio_write() and xemaclite_mdio_read()
which try to set MDIO address before initiating the transaction.

I'm seeing system freezes with the driver with GCC 5.4 and current
Linux kernels on Zynq-7000 SoC immediately when trying to use the
interface.

In commit 123c1407af87 ("net: emaclite: Do not use microblaze and ppc
IO functions") the driver was switched from non-generic
in_be32/out_be32 (memory barriers, big endian) to
__raw_readl/__raw_writel (no memory barriers, native endian), so
apparently the device follows system endianness and the driver was
originally written with the assumption of memory barriers.

Rather than try to hunt for each case of missing barrier, just switch
the driver to use iowrite32/ioread32/iowrite32be/ioread32be depending
on endianness instead.

Tested on little-endian Zynq-7000 ARM SoC FPGA.

Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Fixes: 123c1407af87 ("net: emaclite: Do not use microblaze and ppc IO
functions")
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: xilinx_emaclite: fix receive buffer overflow
Anssi Hannula [Tue, 14 Feb 2017 17:11:44 +0000 (19:11 +0200)]
net: xilinx_emaclite: fix receive buffer overflow

xilinx_emaclite looks at the received data to try to determine the
Ethernet packet length but does not properly clamp it if
proto_type == ETH_P_IP or 1500 < proto_type <= 1518, causing a buffer
overflow and a panic via skb_panic() as the length exceeds the allocated
skb size.

Fix those cases.

Also add an additional unconditional check with WARN_ON() at the end.

Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Fixes: bb81b2ddfa19 ("net: add Xilinx emac lite device driver")
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoPCI/PME: Restore pcie_pme_driver.remove
Yinghai Lu [Wed, 15 Feb 2017 05:17:48 +0000 (21:17 -0800)]
PCI/PME: Restore pcie_pme_driver.remove

In addition to making PME non-modular, d7def2040077 ("PCI/PME: Make
explicitly non-modular") removed the pcie_pme_driver .remove() method,
pcie_pme_remove().

pcie_pme_remove() freed the PME IRQ that was requested in pci_pme_probe().
The fact that we don't free the IRQ after d7def2040077 causes the following
crash when removing a PCIe port device via /sys:

  ------------[ cut here ]------------
  kernel BUG at drivers/pci/msi.c:370!
  invalid opcode: 0000 [#1] SMP
  Modules linked in:
  CPU: 1 PID: 14509 Comm: sh Tainted: G    W  4.8.0-rc1-yh-00012-gd29438d
  RIP: 0010:[<ffffffff9758bbf5>]  free_msi_irqs+0x65/0x190
  ...
  Call Trace:
   [<ffffffff9758cda4>] pci_disable_msi+0x34/0x40
   [<ffffffff97583817>] cleanup_service_irqs+0x27/0x30
   [<ffffffff97583e9a>] pcie_port_device_remove+0x2a/0x40
   [<ffffffff97584250>] pcie_portdrv_remove+0x40/0x50
   [<ffffffff97576d7b>] pci_device_remove+0x4b/0xc0
   [<ffffffff9785ebe6>] __device_release_driver+0xb6/0x150
   [<ffffffff9785eca5>] device_release_driver+0x25/0x40
   [<ffffffff975702e4>] pci_stop_bus_device+0x74/0xa0
   [<ffffffff975704ea>] pci_stop_and_remove_bus_device_locked+0x1a/0x30
   [<ffffffff97578810>] remove_store+0x50/0x70
   [<ffffffff9785a378>] dev_attr_store+0x18/0x30
   [<ffffffff97260b64>] sysfs_kf_write+0x44/0x60
   [<ffffffff9725feae>] kernfs_fop_write+0x10e/0x190
   [<ffffffff971e13f8>] __vfs_write+0x28/0x110
   [<ffffffff970b0fa4>] ? percpu_down_read+0x44/0x80
   [<ffffffff971e53a7>] ? __sb_start_write+0xa7/0xe0
   [<ffffffff971e53a7>] ? __sb_start_write+0xa7/0xe0
   [<ffffffff971e1f04>] vfs_write+0xc4/0x180
   [<ffffffff971e3089>] SyS_write+0x49/0xa0
   [<ffffffff97001a46>] do_syscall_64+0xa6/0x1b0
   [<ffffffff9819201e>] entry_SYSCALL64_slow_path+0x25/0x25
  ...
   RIP  [<ffffffff9758bbf5>] free_msi_irqs+0x65/0x190
   RSP <ffff89ad3085bc48>
  ---[ end trace f4505e1dac5b95d3 ]---
  Segmentation fault

Restore pcie_pme_remove().

[bhelgaas: changelog]
Fixes: d7def2040077 ("PCI/PME: Make explicitly non-modular")
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
CC: stable@vger.kernel.org # v4.9+
7 years agotimekeeping: Use deferred printk() in debug code
Sergey Senozhatsky [Wed, 15 Feb 2017 04:43:32 +0000 (13:43 +0900)]
timekeeping: Use deferred printk() in debug code

We cannot do printk() from tk_debug_account_sleep_time(), because
tk_debug_account_sleep_time() is called under tk_core seq lock.
The reason why printk() is unsafe there is that console_sem may
invoke scheduler (up()->wake_up_process()->activate_task()), which,
in turn, can return back to timekeeping code, for instance, via
get_time()->ktime_get(), deadlocking the system on tk_core seq lock.

[   48.950592] ======================================================
[   48.950622] [ INFO: possible circular locking dependency detected ]
[   48.950622] 4.10.0-rc7-next-20170213+ #101 Not tainted
[   48.950622] -------------------------------------------------------
[   48.950622] kworker/0:0/3 is trying to acquire lock:
[   48.950653]  (tk_core){----..}, at: [<c01cc624>] retrigger_next_event+0x4c/0x90
[   48.950683]
               but task is already holding lock:
[   48.950683]  (hrtimer_bases.lock){-.-...}, at: [<c01cc610>] retrigger_next_event+0x38/0x90
[   48.950714]
               which lock already depends on the new lock.

[   48.950714]
               the existing dependency chain (in reverse order) is:
[   48.950714]
               -> #5 (hrtimer_bases.lock){-.-...}:
[   48.950744]        _raw_spin_lock_irqsave+0x50/0x64
[   48.950775]        lock_hrtimer_base+0x28/0x58
[   48.950775]        hrtimer_start_range_ns+0x20/0x5c8
[   48.950775]        __enqueue_rt_entity+0x320/0x360
[   48.950805]        enqueue_rt_entity+0x2c/0x44
[   48.950805]        enqueue_task_rt+0x24/0x94
[   48.950836]        ttwu_do_activate+0x54/0xc0
[   48.950836]        try_to_wake_up+0x248/0x5c8
[   48.950836]        __setup_irq+0x420/0x5f0
[   48.950836]        request_threaded_irq+0xdc/0x184
[   48.950866]        devm_request_threaded_irq+0x58/0xa4
[   48.950866]        omap_i2c_probe+0x530/0x6a0
[   48.950897]        platform_drv_probe+0x50/0xb0
[   48.950897]        driver_probe_device+0x1f8/0x2cc
[   48.950897]        __driver_attach+0xc0/0xc4
[   48.950927]        bus_for_each_dev+0x6c/0xa0
[   48.950927]        bus_add_driver+0x100/0x210
[   48.950927]        driver_register+0x78/0xf4
[   48.950958]        do_one_initcall+0x3c/0x16c
[   48.950958]        kernel_init_freeable+0x20c/0x2d8
[   48.950958]        kernel_init+0x8/0x110
[   48.950988]        ret_from_fork+0x14/0x24
[   48.950988]
               -> #4 (&rt_b->rt_runtime_lock){-.-...}:
[   48.951019]        _raw_spin_lock+0x40/0x50
[   48.951019]        rq_offline_rt+0x9c/0x2bc
[   48.951019]        set_rq_offline.part.2+0x2c/0x58
[   48.951049]        rq_attach_root+0x134/0x144
[   48.951049]        cpu_attach_domain+0x18c/0x6f4
[   48.951049]        build_sched_domains+0xba4/0xd80
[   48.951080]        sched_init_smp+0x68/0x10c
[   48.951080]        kernel_init_freeable+0x160/0x2d8
[   48.951080]        kernel_init+0x8/0x110
[   48.951080]        ret_from_fork+0x14/0x24
[   48.951110]
               -> #3 (&rq->lock){-.-.-.}:
[   48.951110]        _raw_spin_lock+0x40/0x50
[   48.951141]        task_fork_fair+0x30/0x124
[   48.951141]        sched_fork+0x194/0x2e0
[   48.951141]        copy_process.part.5+0x448/0x1a20
[   48.951171]        _do_fork+0x98/0x7e8
[   48.951171]        kernel_thread+0x2c/0x34
[   48.951171]        rest_init+0x1c/0x18c
[   48.951202]        start_kernel+0x35c/0x3d4
[   48.951202]        0x8000807c
[   48.951202]
               -> #2 (&p->pi_lock){-.-.-.}:
[   48.951232]        _raw_spin_lock_irqsave+0x50/0x64
[   48.951232]        try_to_wake_up+0x30/0x5c8
[   48.951232]        up+0x4c/0x60
[   48.951263]        __up_console_sem+0x2c/0x58
[   48.951263]        console_unlock+0x3b4/0x650
[   48.951263]        vprintk_emit+0x270/0x474
[   48.951293]        vprintk_default+0x20/0x28
[   48.951293]        printk+0x20/0x30
[   48.951324]        kauditd_hold_skb+0x94/0xb8
[   48.951324]        kauditd_thread+0x1a4/0x56c
[   48.951324]        kthread+0x104/0x148
[   48.951354]        ret_from_fork+0x14/0x24
[   48.951354]
               -> #1 ((console_sem).lock){-.....}:
[   48.951385]        _raw_spin_lock_irqsave+0x50/0x64
[   48.951385]        down_trylock+0xc/0x2c
[   48.951385]        __down_trylock_console_sem+0x24/0x80
[   48.951385]        console_trylock+0x10/0x8c
[   48.951416]        vprintk_emit+0x264/0x474
[   48.951416]        vprintk_default+0x20/0x28
[   48.951416]        printk+0x20/0x30
[   48.951446]        tk_debug_account_sleep_time+0x5c/0x70
[   48.951446]        __timekeeping_inject_sleeptime.constprop.3+0x170/0x1a0
[   48.951446]        timekeeping_resume+0x218/0x23c
[   48.951477]        syscore_resume+0x94/0x42c
[   48.951477]        suspend_enter+0x554/0x9b4
[   48.951477]        suspend_devices_and_enter+0xd8/0x4b4
[   48.951507]        enter_state+0x934/0xbd4
[   48.951507]        pm_suspend+0x14/0x70
[   48.951507]        state_store+0x68/0xc8
[   48.951538]        kernfs_fop_write+0xf4/0x1f8
[   48.951538]        __vfs_write+0x1c/0x114
[   48.951538]        vfs_write+0xa0/0x168
[   48.951568]        SyS_write+0x3c/0x90
[   48.951568]        __sys_trace_return+0x0/0x10
[   48.951568]
               -> #0 (tk_core){----..}:
[   48.951599]        lock_acquire+0xe0/0x294
[   48.951599]        ktime_get_update_offsets_now+0x5c/0x1d4
[   48.951629]        retrigger_next_event+0x4c/0x90
[   48.951629]        on_each_cpu+0x40/0x7c
[   48.951629]        clock_was_set_work+0x14/0x20
[   48.951660]        process_one_work+0x2b4/0x808
[   48.951660]        worker_thread+0x3c/0x550
[   48.951660]        kthread+0x104/0x148
[   48.951690]        ret_from_fork+0x14/0x24
[   48.951690]
               other info that might help us debug this:

[   48.951690] Chain exists of:
                 tk_core --> &rt_b->rt_runtime_lock --> hrtimer_bases.lock

[   48.951721]  Possible unsafe locking scenario:

[   48.951721]        CPU0                    CPU1
[   48.951721]        ----                    ----
[   48.951721]   lock(hrtimer_bases.lock);
[   48.951751]                                lock(&rt_b->rt_runtime_lock);
[   48.951751]                                lock(hrtimer_bases.lock);
[   48.951751]   lock(tk_core);
[   48.951782]
                *** DEADLOCK ***

[   48.951782] 3 locks held by kworker/0:0/3:
[   48.951782]  #0:  ("events"){.+.+.+}, at: [<c0156590>] process_one_work+0x1f8/0x808
[   48.951812]  #1:  (hrtimer_work){+.+...}, at: [<c0156590>] process_one_work+0x1f8/0x808
[   48.951843]  #2:  (hrtimer_bases.lock){-.-...}, at: [<c01cc610>] retrigger_next_event+0x38/0x90
[   48.951843]   stack backtrace:
[   48.951873] CPU: 0 PID: 3 Comm: kworker/0:0 Not tainted 4.10.0-rc7-next-20170213+
[   48.951904] Workqueue: events clock_was_set_work
[   48.951904] [<c0110208>] (unwind_backtrace) from [<c010c224>] (show_stack+0x10/0x14)
[   48.951934] [<c010c224>] (show_stack) from [<c04ca6c0>] (dump_stack+0xac/0xe0)
[   48.951934] [<c04ca6c0>] (dump_stack) from [<c019b5cc>] (print_circular_bug+0x1d0/0x308)
[   48.951965] [<c019b5cc>] (print_circular_bug) from [<c019d2a8>] (validate_chain+0xf50/0x1324)
[   48.951965] [<c019d2a8>] (validate_chain) from [<c019ec18>] (__lock_acquire+0x468/0x7e8)
[   48.951995] [<c019ec18>] (__lock_acquire) from [<c019f634>] (lock_acquire+0xe0/0x294)
[   48.951995] [<c019f634>] (lock_acquire) from [<c01d0ea0>] (ktime_get_update_offsets_now+0x5c/0x1d4)
[   48.952026] [<c01d0ea0>] (ktime_get_update_offsets_now) from [<c01cc624>] (retrigger_next_event+0x4c/0x90)
[   48.952026] [<c01cc624>] (retrigger_next_event) from [<c01e4e24>] (on_each_cpu+0x40/0x7c)
[   48.952056] [<c01e4e24>] (on_each_cpu) from [<c01cafc4>] (clock_was_set_work+0x14/0x20)
[   48.952056] [<c01cafc4>] (clock_was_set_work) from [<c015664c>] (process_one_work+0x2b4/0x808)
[   48.952087] [<c015664c>] (process_one_work) from [<c0157774>] (worker_thread+0x3c/0x550)
[   48.952087] [<c0157774>] (worker_thread) from [<c015d644>] (kthread+0x104/0x148)
[   48.952087] [<c015d644>] (kthread) from [<c0107830>] (ret_from_fork+0x14/0x24)

Replace printk() with printk_deferred(), which does not call into
the scheduler.

Fixes: 0bf43f15db85 ("timekeeping: Prints the amounts of time spent during suspend")
Reported-and-tested-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J . Wysocki" <rjw@rjwysocki.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: "[4.9+]" <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170215044332.30449-1-sergey.senozhatsky@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 years agodrm/dp/mst: fix kernel oops when turning off secondary monitor
Pierre-Louis Bossart [Tue, 14 Feb 2017 12:49:21 +0000 (14:49 +0200)]
drm/dp/mst: fix kernel oops when turning off secondary monitor

100% reproducible issue found on SKL SkullCanyon NUC with two external
DP daisy-chained monitors in DP/MST mode. When turning off or changing
the input of the second monitor the machine stops with a kernel
oops. This issue happened with 4.8.8 as well as drm/drm-intel-nightly.

This issue is traced to an inconsistent control flow in
drm_dp_update_payload_part1(): the 'port' pointer is set to NULL at the
same time as 'req_payload.num_slots' is set to zero, but the pointer is
dereferenced even when req_payload.num_slot is zero.

The problematic dereference was introduced in commit dfda0df34
("drm/mst: rework payload table allocation to conform better") and may
impact all versions since v3.18

The fix suggested by Chris Wilson removes the kernel oops and was found to
work well after 10mn of monkey-testing with the second monitor power and
input buttons

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98990
Fixes: dfda0df34264 ("drm/mst: rework payload table allocation to conform better.")
Cc: Dave Airlie <airlied@redhat.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Nathan D Ciobanu <nathan.d.ciobanu@linux.intel.com>
Cc: Dhinakaran Pandiyan <dhinakaran.pandiyan@intel.com>
Cc: Sean Paul <seanpaul@chromium.org>
Cc: <stable@vger.kernel.org> # v3.18+
Tested-by: Nathan D Ciobanu <nathan.d.ciobanu@linux.intel.com>
Reviewed-by: Dhinakaran Pandiyan <dhinakaran.pandiyan@intel.com>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1487076561-2169-1-git-send-email-jani.nikula@intel.com
7 years agofuse: fix use after free issue in fuse_dev_do_read()
Sahitya Tummala [Wed, 8 Feb 2017 15:00:56 +0000 (20:30 +0530)]
fuse: fix use after free issue in fuse_dev_do_read()

There is a potential race between fuse_dev_do_write()
and request_wait_answer() contexts as shown below:

TASK 1:
__fuse_request_send():
  |--spin_lock(&fiq->waitq.lock);
  |--queue_request();
  |--spin_unlock(&fiq->waitq.lock);
  |--request_wait_answer():
       |--if (test_bit(FR_SENT, &req->flags))
       <gets pre-empted after it is validated true>
                                   TASK 2:
                                   fuse_dev_do_write():
                                     |--clears bit FR_SENT,
                                     |--request_end():
                                        |--sets bit FR_FINISHED
                                        |--spin_lock(&fiq->waitq.lock);
                                        |--list_del_init(&req->intr_entry);
                                        |--spin_unlock(&fiq->waitq.lock);
                                        |--fuse_put_request();
       |--queue_interrupt();
       <request gets queued to interrupts list>
            |--wake_up_locked(&fiq->waitq);
       |--wait_event_freezable();
       <as FR_FINISHED is set, it returns and then
       the caller frees this request>

Now, the next fuse_dev_do_read(), see interrupts list is not empty
and then calls fuse_read_interrupt() which tries to access the request
which is already free'd and gets the below crash:

[11432.401266] Unable to handle kernel paging request at virtual address
6b6b6b6b6b6b6b6b
...
[11432.418518] Kernel BUG at ffffff80083720e0
[11432.456168] PC is at __list_del_entry+0x6c/0xc4
[11432.463573] LR is at fuse_dev_do_read+0x1ac/0x474
...
[11432.679999] [<ffffff80083720e0>] __list_del_entry+0x6c/0xc4
[11432.687794] [<ffffff80082c65e0>] fuse_dev_do_read+0x1ac/0x474
[11432.693180] [<ffffff80082c6b14>] fuse_dev_read+0x6c/0x78
[11432.699082] [<ffffff80081d5638>] __vfs_read+0xc0/0xe8
[11432.704459] [<ffffff80081d5efc>] vfs_read+0x90/0x108
[11432.709406] [<ffffff80081d67f0>] SyS_read+0x58/0x94

As FR_FINISHED bit is set before deleting the intr_entry with input
queue lock in request completion path, do the testing of this flag and
queueing atomically with the same lock in queue_interrupt().

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: fd22d62ed0c3 ("fuse: no fc->lock for iqueue parts")
Cc: <stable@vger.kernel.org> # 4.2+
7 years agobpf: kernel header files need to be copied into the tools directory
Stephen Rothwell [Mon, 13 Feb 2017 21:22:20 +0000 (08:22 +1100)]
bpf: kernel header files need to be copied into the tools directory

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotcp: tcp_probe: use spin_lock_bh()
Eric Dumazet [Wed, 15 Feb 2017 01:11:14 +0000 (17:11 -0800)]
tcp: tcp_probe: use spin_lock_bh()

tcp_rcv_established() can now run in process context.

We need to disable BH while acquiring tcp probe spinlock,
or risk a deadlock.

Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Ricardo Nabinger Sanchez <rnsanchez@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agouapi: fix linux/if_pppol2tp.h userspace compilation errors
Dmitry V. Levin [Wed, 15 Feb 2017 02:23:26 +0000 (05:23 +0300)]
uapi: fix linux/if_pppol2tp.h userspace compilation errors

Because of <linux/libc-compat.h> interface limitations, <netinet/in.h>
provided by libc cannot be included after <linux/in.h>, therefore any
header that includes <netinet/in.h> cannot be included after <linux/in.h>.

Change uapi/linux/l2tp.h, the last uapi header that includes
<netinet/in.h>, to include <linux/in.h> and <linux/in6.h> instead of
<netinet/in.h> and use __SOCK_SIZE__ instead of sizeof(struct sockaddr)
the same way as uapi/linux/in.h does, to fix linux/if_pppol2tp.h userspace
compilation errors like this:

In file included from /usr/include/linux/l2tp.h:12:0,
                 from /usr/include/linux/if_pppol2tp.h:21,
/usr/include/netinet/in.h:31:8: error: redefinition of 'struct in_addr'

Fixes: 47c3e7783be4 ("net: l2tp: deprecate PPPOL2TP_MSG_* in favour of L2TP_MSG_*")
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>