Ilya Dryomov [Mon, 27 Jan 2014 15:40:20 +0000 (17:40 +0200)]
libceph: rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid}
Rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid} before
introducing r_target_{oloc,oid} needed for redirects.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Mon, 27 Jan 2014 15:40:19 +0000 (17:40 +0200)]
libceph: follow {read,write}_tier fields on osd request submission
Overwrite ceph_osd_request::r_oloc.pool with read_tier for read ops and
write_tier for write and read+write ops (aka basic tiering support).
{read,write}_tier are part of pg_pool_t since v9. This commit bumps
our pg_pool_t decode compat version from v7 to v9, all new fields
except for {read,write}_tier are ignored.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Mon, 27 Jan 2014 15:40:19 +0000 (17:40 +0200)]
libceph: add ceph_pg_pool_by_id()
"Lookup pool info by ID" function is hidden in osdmap.c. Expose it to
the rest of libceph.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Mon, 27 Jan 2014 15:40:19 +0000 (17:40 +0200)]
libceph: CEPH_OSD_FLAG_* enum update
Update CEPH_OSD_FLAG_* enum. (We need CEPH_OSD_FLAG_IGNORE_OVERLAY to
support tiering).
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Mon, 27 Jan 2014 15:40:19 +0000 (17:40 +0200)]
libceph: replace ceph_calc_ceph_pg() with ceph_oloc_oid_to_pg()
Switch ceph_calc_ceph_pg() to new oloc and oid abstractions and rename
it to ceph_oloc_oid_to_pg() to make its purpose more clear.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Mon, 27 Jan 2014 15:40:18 +0000 (17:40 +0200)]
libceph: introduce and start using oid abstraction
In preparation for tiering support, which would require having two
(base and target) object names for each osd request and also copying
those names around, introduce struct ceph_object_id (oid) and a couple
helpers to facilitate those copies and encapsulate the fact that object
name is not necessarily a NUL-terminated string.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Mon, 27 Jan 2014 15:40:18 +0000 (17:40 +0200)]
libceph: rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN
In preparation for adding oid abstraction, rename MAX_OBJ_NAME_SIZE to
CEPH_MAX_OID_NAME_LEN.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Mon, 27 Jan 2014 15:40:18 +0000 (17:40 +0200)]
libceph: move ceph_file_layout helpers to ceph_fs.h
Move ceph_file_layout helper macros and inline functions to ceph_fs.h.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Mon, 27 Jan 2014 15:40:18 +0000 (17:40 +0200)]
libceph: start using oloc abstraction
Instead of relying on pool fields in ceph_file_layout (for mapping) and
ceph_pg (for enconding), start using ceph_object_locator (oloc)
abstraction. Note that userspace oloc currently consists of pool, key,
nspace and hash fields, while this one contains only a pool. This is
OK, because at this point we only send (i.e. encode) olocs and never
have to receive (i.e. decode) them.
This makes keeping a copy of ceph_file_layout in every osd request
unnecessary, so ceph_osd_request::r_file_layout field is nuked.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Thu, 16 Jan 2014 17:18:27 +0000 (19:18 +0200)]
libceph: dout() is missing a newline
Add a missing newline to a dout() in __reset_osd().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Ilya Dryomov [Thu, 9 Jan 2014 18:08:21 +0000 (20:08 +0200)]
libceph: add ceph_kv{malloc,free}() and switch to them
Encapsulate kmalloc vs vmalloc memory allocation and freeing logic into
two helpers, ceph_kvmalloc() and ceph_kvfree(), and switch to them.
ceph_kvmalloc() kmalloc()'s a maximum of 8 pages, anything bigger is
vmalloc()'ed with __GFP_HIGHMEM set. This changes the existing
behaviour:
- for buffers (ceph_buffer_new()), from trying to kmalloc() everything
and using vmalloc() just as a fallback
- for messages (ceph_msg_new()), from going to vmalloc() for anything
bigger than a page
- for messages (ceph_msg_new()), from disallowing vmalloc() to use high
memory
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Yan, Zheng [Tue, 21 Jan 2014 03:07:16 +0000 (11:07 +0800)]
libceph: support CEPH_FEATURE_EXPORT_PEER
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Sun, 24 Nov 2013 06:44:38 +0000 (14:44 +0800)]
ceph: add imported caps when handling cap export message
Version 3 cap export message includes information about the imported
caps. It allows us to add the imported caps if the corresponding cap
import message still hasn't been received.
This allow us to handle situation that the importer MDS crashes and
the cap import message is missing.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Sun, 24 Nov 2013 06:33:01 +0000 (14:33 +0800)]
ceph: add open export target session helper
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Sun, 24 Nov 2013 06:43:46 +0000 (14:43 +0800)]
ceph: remove exported caps when handling cap import message
Version 3 cap import message includes the ID of the exported
caps. It allow us to remove the exported caps if we still haven't
received the corresponding cap export message.
We remove the exported caps because they are stale, keeping them
can compromise consistence.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Fri, 22 Nov 2013 06:48:37 +0000 (14:48 +0800)]
ceph: handle session flush message
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Sat, 30 Nov 2013 04:47:41 +0000 (12:47 +0800)]
ceph: check inode caps in ceph_d_revalidate
Some inodes in readdir reply may have no caps. Getattr mds request
for these inodes can return -ESTALE. The fix is consider dentry that
links to inode with no caps as invalid. Invalid dentry causes a
lookup request to send to the mds, the MDS will send caps back.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Fri, 22 Nov 2013 06:21:44 +0000 (14:21 +0800)]
ceph: handle -ESTALE reply
Send requests that operate on path to directory's auth MDS if
mode == USE_AUTH_MDS. Always retry using the auth MDS if got
-ESTALE reply from non-auth MDS. Also clean up the code that
handles auth MDS change.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Fri, 22 Nov 2013 05:56:24 +0000 (13:56 +0800)]
ceph: fix trim caps
- don't trim auth cap if there are flusing caps
- don't trim auth cap if any 'write' cap is wanted
- allow trimming non-auth cap even if the inode is dirty
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Fri, 22 Nov 2013 05:50:45 +0000 (13:50 +0800)]
ceph: fix cache revoke race
handle following sequence of events:
- non-auth MDS revokes Fc cap. queue invalidate work
- auth MDS issues Fc cap through request reply. i_rdcache_gen gets
increased.
- invalidate work runs. it finds i_rdcache_revoking != i_rdcache_gen,
so it does nothing.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Wed, 13 Nov 2013 06:47:19 +0000 (14:47 +0800)]
ceph: use ceph_seq_cmp() to compare migrate_seq
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Yan, Zheng [Thu, 31 Oct 2013 08:44:14 +0000 (16:44 +0800)]
ceph: handle cap export race in try_flush_caps()
auth cap may change after releasing the i_ceph_lock
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
J. Bruce Fields [Thu, 16 Jan 2014 22:42:53 +0000 (17:42 -0500)]
ceph: trivial comment fix
"disconnected" is too easily confused with "DCACHE_DISCONNECTED". I
think "unhashed" is the more precise term here.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Thu, 9 Jan 2014 18:08:21 +0000 (20:08 +0200)]
libceph: fix preallocation check in get_reply()
The check that makes sure that we have enough memory allocated to read
in the entire header of the message in question is currently busted.
It compares front_len of the incoming message with iov_len field of
ceph_msg::front structure, which is used primarily to indicate the
amount of data already read in, and not the size of the allocated
buffer. Under certain conditions (e.g. a short read from a socket
followed by that socket's shutdown and owning ceph_connection reset)
this results in a warning similar to
[85688.975866] libceph: get_reply front 198 > preallocated 122 (4#0)
and, through another bug, leads to forever hung tasks and forced
reboots. Fix this by comparing front_len with front_alloc_len field of
struct ceph_msg, which stores the actual size of the buffer.
Fixes: http://tracker.ceph.com/issues/5425
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Thu, 9 Jan 2014 18:08:21 +0000 (20:08 +0200)]
libceph: rename front to front_len in get_reply()
Rename front local variable to front_len in get_reply() to make its
purpose more clear.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Thu, 9 Jan 2014 18:08:21 +0000 (20:08 +0200)]
libceph: rename ceph_msg::front_max to front_alloc_len
Rename front_max field of struct ceph_msg to front_alloc_len to make
its purpose more clear.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Mon, 30 Dec 2013 17:21:29 +0000 (19:21 +0200)]
libceph: use CEPH_MON_PORT when the specified port is 0
Similar to userspace, don't bail with "parse_ips bad ip ..." if the
specified port is port 0, instead use port CEPH_MON_PORT (6789, the
default monitor port).
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:27 +0000 (21:19 +0200)]
crush: support new indep mode and SET_* steps (crush v2) by default
Add CRUSH_V2 feature (new indep mode and SET_* steps) to a set of
features supported by default.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:27 +0000 (21:19 +0200)]
crush: fix crush_choose_firstn comment
Reflects ceph.git commit
8b38f10bc2ee3643a33ea5f9545ad5c00e4ac5b4.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:27 +0000 (21:19 +0200)]
crush: attempts -> tries
Reflects ceph.git commit
ea3a0bb8b773360d73b8b77fa32115ef091c9857.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:27 +0000 (21:19 +0200)]
crush: add set_choose_local_[fallback_]tries steps
This allows all of the tunables to be overridden by a specific rule.
Reflects ceph.git commits
d129e09e57fbc61cfd4f492e3ee77d0750c9d292,
0497db49e5973b50df26251ed0e3f4ac7578e66e.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:26 +0000 (21:19 +0200)]
crush: generalize descend_once
The legacy behavior is to make the normal number of tries for the
recursive chooseleaf call. The descend_once tunable changed this to
making a single try and bail if we get a reject (note that it is
impossible to collide in the recursive case).
The new set_chooseleaf_tries lets you select the number of recursive
chooseleaf attempts for indep mode, or default to 1. Use the same
behavior for firstn, except default to total_tries when the legacy
tunables are set (for compatibility). This makes the rule step
override the (new) default of 1 recursive attempt, keeping behavior
consistent with indep mode.
Reflects ceph.git commit
685c6950ef3df325ef04ce7c986e36ca2514c5f1.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:26 +0000 (21:19 +0200)]
crush: CHOOSE_LEAF -> CHOOSELEAF throughout
This aligns the internal identifier names with the user-visible names in
the decompiled crush map language.
Reflects ceph.git commit
caa0e22e15e4226c3671318ba1f61314bf6da2a6.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:26 +0000 (21:19 +0200)]
crush: add SET_CHOOSE_TRIES rule step
Since we can specify the recursive retries in a rule, we may as well also
specify the non-recursive tries too for completeness.
Reflects ceph.git commit
d1b97462cffccc871914859eaee562f2786abfd1.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:26 +0000 (21:19 +0200)]
crush: apply chooseleaf_tries to firstn mode too
Parameterize the attempts for the _firstn choose method, and apply the
rule-specified tries count to firstn mode as well. Note that we have
slightly different behavior here than with indep:
If the firstn value is not specified for firstn, we pass through the
normal attempt count. This maintains compatibility with legacy behavior.
Note that this is usually *not* actually N^2 work, though, because of the
descend_once tunable. However, descend_once is unfortunately *not* the
same thing as 1 chooseleaf try because it is only checked on a reject but
not on a collision. Sigh.
In contrast, for indep, if tries is not specified we default to 1
recursive attempt, because that is simply more sane, and we have the
option to do so. The descend_once tunable has no effect for indep.
Reflects ceph.git commit
64aeded50d80942d66a5ec7b604ff2fcbf5d7b63.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:26 +0000 (21:19 +0200)]
crush: new SET_CHOOSE_LEAF_TRIES command
Explicitly control the number of sample attempts, and allow the number of
tries in the recursive call to be explicitly controlled via the rule. This
is important because the amount of time we want to spend looking for a
solution may be rule dependent (e.g., higher for the wide indep pool than
the rep pools).
(We should do the same for the other tunables, by the way!)
Reflects ceph.git commit
c43c893be872f709c787bc57f46c0e97876ff681.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:25 +0000 (21:19 +0200)]
crush: pass parent r value for indep call
Pass down the parent's 'r' value so that we will sample different values in
the recursive call when the parent tries multiple times. This avoids doing
useless work (calling multiple times and trying the same values).
Reflects ceph.git commit
2731d3030d7a3e80922b7f1b7756f9a4a124bac5.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:25 +0000 (21:19 +0200)]
crush: clarify numrep vs endpos
Pass numrep (the width of the result) separately from the number of results
we want *this* iteration. This makes things less awkward when we do a
recursive call (for chooseleaf) and want only one item.
Reflects ceph.git commit
1b567ee08972f268c11b43fc881e57b5984dd08b.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:25 +0000 (21:19 +0200)]
crush: strip firstn conditionals out of crush_choose, rename
Now that indep is handled by crush_choose_indep, rename crush_choose to
crush_choose_firstn and remove all the conditionals. This ends up
stripping out *lots* of code.
Note that it *also* makes it obvious that the shenanigans we were playing
with r' for uniform buckets were broken for firstn mode. This appears to
have happened waaaay back in commit
dae8bec9 (or earlier)... 2007.
Reflects ceph.git commit
94350996cb2035850bcbece6a77a9b0394177ec9.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:25 +0000 (21:19 +0200)]
crush: add note about r in recursive choose
Reflects ceph.git commit
4551fee9ad89d0427ed865d766d0d44004d3e3e1.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:25 +0000 (21:19 +0200)]
crush: use breadth-first search for indep mode
Reflects ceph.git commit
86e978036a4ecbac4c875e7c00f6c5bbe37282d3.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:25 +0000 (21:19 +0200)]
crush: return CRUSH_ITEM_UNDEF for failed placements with indep
For firstn mode, if we fail to make a valid placement choice, we just
continue and return a short result to the caller. For indep mode, however,
we need to make the position stable, and return an undefined value on
failed placements to avoid shifting later results to the left.
Reflects ceph.git commit
b1d4dd4eb044875874a1d01c01c7d766db5d0a80.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:24 +0000 (21:19 +0200)]
crush: eliminate CRUSH_MAX_SET result size limitation
This is only present to size the temporary scratch arrays that we put on
the stack. Let the caller allocate them as they wish and remove the
limitation.
Reflects ceph.git commit
1cfe140bf2dab99517589a82a916f4c75b9492d1.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:24 +0000 (21:19 +0200)]
crush: fix some comments
Reflects ceph.git commit
3cef755428761f2481b1dd0e0fbd0464ac483fc5.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:24 +0000 (21:19 +0200)]
crush: reduce scope of some local variables
Reflects ceph.git commit
e7d47827f0333c96ad43d257607fb92ed4176550.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:24 +0000 (21:19 +0200)]
crush: factor out (trivial) crush_destroy_rule()
Reflects ceph.git commit
43a01c9973c4b83f2eaa98be87429941a227ddde.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:24 +0000 (21:19 +0200)]
crush: pass weight vector size to map function
Pass the size of the weight vector into crush_do_rule() to ensure that we
don't access values past the end. This can happen if the caller misbehaves
and passes a weight vector that is smaller than max_devices.
Currently the monitor tries to prevent that from happening, but this will
gracefully tolerate previous bad osdmaps that got into this state. It's
also a bit more defensive.
Reflects ceph.git commit
5922e2c2b8335b5e46c9504349c3a55b7434c01a.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:24 +0000 (21:19 +0200)]
libceph: update ceph_features.h
This updates ceph_features.h so that it has all feature bits defined in
ceph.git. In the interim since the last update, ceph.git crossed the
"32 feature bits" point, and, the addition of the 33rd bit wasn't
handled correctly. The work-around is squashed into this commit and
reflects ceph.git commit
053659d05e0349053ef703b414f44965f368b9f0.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Tue, 24 Dec 2013 19:19:23 +0000 (21:19 +0200)]
libceph: all features fields must be u64
In preparation for ceph_features.h update, change all features fields
from unsigned int/u32 to u64. (ceph.git has ~40 feature bits at this
point.)
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Mon, 16 Dec 2013 16:02:41 +0000 (18:02 +0200)]
rbd: tear down watch request if rbd_dev_device_setup() fails
Tear down watch request if rbd_dev_device_setup() fails.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Ilya Dryomov [Mon, 16 Dec 2013 16:02:40 +0000 (18:02 +0200)]
rbd: introduce rbd_dev_header_unwatch_sync() and switch to it
Rename rbd_dev_header_watch_sync() to __rbd_dev_header_watch_sync() and
introduce two helpers: rbd_dev_header_{,un}watch_sync() to make it more
clear what is going on.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Thu, 26 Dec 2013 14:37:43 +0000 (08:37 -0600)]
MAINTAINERS: update an e-mail address
I no longer have direct access to my Inktank e-mail. I still pay
attention to rbd, so update its entry in MAINTAINERS accordingly.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Mon, 16 Dec 2013 17:26:32 +0000 (19:26 +0200)]
rbd: enable extended devt in single-major mode
If single-major device number allocation scheme is turned on, instead
of reserving 256 minors per device, which imposes a limit of 4096
images mapped at once, reserve 16 minors per device and enable extended
devt feature. This results in a theoretical limit of 65536 images
mapped at once, and still allows to have more than 15 partititions:
partitions starting with 16th are mapped under major 259 (Block
Extended Major):
$ rbd showmapped
id pool image snap device
0 rbd b5 - /dev/rbd0 # no partitions
1 rbd b2 - /dev/rbd1 # 40 partitions
2 rbd b3 - /dev/rbd2 # 2 partitions
$ cat /proc/partitions
251 0 1024 rbd0
251 16 1024 rbd1
251 17 0 rbd1p1
251 18 0 rbd1p2
...
251 30 0 rbd1p14
251 31 0 rbd1p15
259 0 0 rbd1p16
259 1 0 rbd1p17
...
259 23 0 rbd1p39
259 24 0 rbd1p40
251 32 1024 rbd2
251 33 0 rbd2p1
251 34 0 rbd2p2
(major 251 was assigned dynamically at module load time)
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Li Wang [Thu, 19 Dec 2013 14:03:49 +0000 (06:03 -0800)]
ceph fscache: Uncaching no data page from fscache in readpage()
Currently, if one new page allocated into fscache in readpage(), however,
with no data read into due to error encountered during reading from OSDs,
the slot in fscache is not uncached. This patch fixes this.
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Reviewed-by: Milosz Tanski <milosz@adfin.com>
Li Wang [Thu, 19 Dec 2013 14:03:48 +0000 (06:03 -0800)]
ceph fscache: Introduce a routine for uncaching single no data page from fscache
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Reviewed-by: Milosz Tanski <milosz@adfin.com>
Guangliang Zhao [Mon, 11 Nov 2013 07:18:03 +0000 (15:18 +0800)]
ceph: add acl for cephfs
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Li Wang <li.wang@ubuntykylin.com>
Reviewed-by: Zheng Yan <zheng.z.yan@intel.com>
Yan, Zheng [Thu, 28 Nov 2013 06:28:14 +0000 (14:28 +0800)]
ceph: check caps in filemap_fault and page_mkwrite
Adds cap check to the page fault handler. The check prevents page
fault handler from adding new page to the page cache while Fcb caps
are being revoked. This solves Fc revoking hang in multiple clients
mmap IO workload.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Ilya Dryomov [Fri, 13 Dec 2013 13:28:57 +0000 (15:28 +0200)]
rbd: add support for single-major device number allocation scheme
Currently each rbd device is allocated its own major number, which
leads to a hard limit of 230-250 images mapped at once. This commit
adds support for a new single-major device number allocation scheme,
which is hidden behind a new single_major boolean module parameter and
is disabled by default for backwards compatibility reasons. (Old
userspace cannot correctly unmap images mapped under single-major
scheme and would essentially just unmap a random image, if that.)
$ rbd showmapped
id pool image snap device
0 rbd b100 - /dev/rbd0
1 rbd b101 - /dev/rbd1
2 rbd b102 - /dev/rbd2
3 rbd b103 - /dev/rbd3
Old scheme (modprobe rbd):
$ ls -l /dev/rbd*
brw-rw---- 1 root disk 253, 0 Dec 10 12:24 /dev/rbd0
brw-rw---- 1 root disk 252, 0 Dec 10 12:28 /dev/rbd1
brw-rw---- 1 root disk 252, 1 Dec 10 12:28 /dev/rbd1p1
brw-rw---- 1 root disk 252, 2 Dec 10 12:28 /dev/rbd1p2
brw-rw---- 1 root disk 252, 3 Dec 10 12:28 /dev/rbd1p3
brw-rw---- 1 root disk 251, 0 Dec 10 12:28 /dev/rbd2
brw-rw---- 1 root disk 251, 1 Dec 10 12:28 /dev/rbd2p1
brw-rw---- 1 root disk 250, 0 Dec 10 12:24 /dev/rbd3
New scheme (modprobe rbd single_major=Y):
$ ls -l /dev/rbd*
brw-rw---- 1 root disk 253, 0 Dec 10 12:30 /dev/rbd0
brw-rw---- 1 root disk 253, 256 Dec 10 12:30 /dev/rbd1
brw-rw---- 1 root disk 253, 257 Dec 10 12:30 /dev/rbd1p1
brw-rw---- 1 root disk 253, 258 Dec 10 12:30 /dev/rbd1p2
brw-rw---- 1 root disk 253, 259 Dec 10 12:30 /dev/rbd1p3
brw-rw---- 1 root disk 253, 512 Dec 10 12:30 /dev/rbd2
brw-rw---- 1 root disk 253, 513 Dec 10 12:30 /dev/rbd2p1
brw-rw---- 1 root disk 253, 768 Dec 10 12:30 /dev/rbd3
(major 253 was assigned dynamically at module load time)
The new limit is 4096 images mapped at once, and it comes from the fact
that, as before, 256 minor numbers are reserved for each mapping.
(A follow-up commit changes the number of minors reserved and the way
we deal with partitions over that number.)
If single_major is set to true, two new sysfs interfaces show up:
/sys/bus/rbd/{add,remove}_single_major. These are to be used instead
of /sys/bus/rbd/{add,remove}, which are disabled for backwards
compatibility reasons outlined above.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Ilya Dryomov [Fri, 13 Dec 2013 13:28:57 +0000 (15:28 +0200)]
rbd: wire up is_visible() sysfs callback for rbd bus
In preparation for single-major device number allocation scheme, wire
up attribute_group::is_visible() callback for rbd bus. This allows us
to make the new single-major attributes conditional.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Ilya Dryomov [Fri, 13 Dec 2013 13:28:57 +0000 (15:28 +0200)]
rbd: add 'minor' sysfs rbd device attribute
Introduce /sys/bus/rbd/devices/<id>/minor sysfs attribute for exporting
rbd whole disk minor numbers. This is a step towards single-major
device number allocation scheme, but also a good thing on its own.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Ilya Dryomov [Fri, 13 Dec 2013 13:28:57 +0000 (15:28 +0200)]
rbd: switch to ida for rbd id assignments
Currently rbd ids are allocated using an atomic variable that keeps
track of the highest id currently in use and each new id is simply one
more than the value of that variable. That's nice and cheap, but it
does mean that rbd ids are allowed to grow boundlessly, and, more
importantly, it's completely unpredictable. So, in preparation for
single-major device number allocation scheme, which is going to
establish and rely on a constant mapping between rbd ids and device
numbers, switch to ida for rbd id assignments.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Ilya Dryomov [Fri, 13 Dec 2013 13:28:57 +0000 (15:28 +0200)]
rbd: refactor rbd_init() a bit
Refactor rbd_init() a bit to make it more clear what's going on.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Ilya Dryomov [Fri, 13 Dec 2013 13:28:56 +0000 (15:28 +0200)]
rbd: tweak "loaded" message and module description
Tweak "loaded" message, so that it looks like
[ 30.184235] rbd: loaded
instead of
[ 38.056564] rbd: loaded rbd (rados block device)
Also move (and slightly tweak) MODULE_DESCRIPTION so that all authors
are next to each other in modinfo output.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Ilya Dryomov [Fri, 13 Dec 2013 13:28:56 +0000 (15:28 +0200)]
rbd: rbd_device::dev_id is an int, format it as such
rbd_device::dev_id is an int, format it as such.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Josh Durgin [Tue, 10 Dec 2013 17:35:13 +0000 (09:35 -0800)]
libceph: resend all writes after the osdmap loses the full flag
With the current full handling, there is a race between osds and
clients getting the first map marked full. If the osd wins, it will
return -ENOSPC to any writes, but the client may already have writes
in flight. This results in the client getting the error and
propagating it up the stack. For rbd, the block layer turns this into
EIO, which can cause corruption in filesystems above it.
To avoid this race, osds are being changed to drop writes that came
from clients with an osdmap older than the last osdmap marked full.
In order for this to work, clients must resend all writes after they
encounter a full -> not full transition in the osdmap. osds will wait
for an updated map instead of processing a request from a client with
a newer map, so resent writes will not be dropped by the osd unless
there is another not full -> full transition.
This approach requires both osds and clients to be fixed to avoid the
race. Old clients talking to osds with this fix may hang instead of
returning EIO and potentially corrupting an fs. New clients talking to
old osds have the same behavior as before if they encounter this race.
Fixes: http://tracker.ceph.com/issues/6938
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Josh Durgin [Tue, 3 Dec 2013 03:11:48 +0000 (19:11 -0800)]
libceph: block I/O when PAUSE or FULL osd map flags are set
The PAUSEWR and PAUSERD flags are meant to stop the cluster from
processing writes and reads, respectively. The FULL flag is set when
the cluster determines that it is out of space, and will no longer
process writes. PAUSEWR and PAUSERD are purely client-side settings
already implemented in userspace clients. The osd does nothing special
with these flags.
When the FULL flag is set, however, the osd responds to all writes
with -ENOSPC. For cephfs, this makes sense, but for rbd the block
layer translates this into EIO. If a cluster goes from full to
non-full quickly, a filesystem on top of rbd will not behave well,
since some writes succeed while others get EIO.
Fix this by blocking any writes when the FULL flag is set in the osd
client. This is the same strategy used by userspace, so apply it by
default. A follow-on patch makes this configurable.
__map_request() is called to re-target osd requests in case the
available osds changed. Add a paused field to a ceph_osd_request, and
set it whenever an appropriate osd map flag is set. Avoid queueing
paused requests in __map_request(), but force them to be resent if
they become unpaused.
Also subscribe to the next osd map from the monitor if any of these
flags are set, so paused requests can be unblocked as soon as
possible.
Fixes: http://tracker.ceph.com/issues/6079
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Libo Chen [Wed, 11 Dec 2013 05:49:11 +0000 (13:49 +0800)]
fs: ceph: new helper: file_inode(file)
Signed-off-by: Libo Chen <clbchenlibo.chen@huawei.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Li Wang [Wed, 27 Nov 2013 14:28:14 +0000 (22:28 +0800)]
ceph: Add necessary clean up if invalid reply received in handle_reply()
Wake up possible waiters, invoke the call back if any, unregister the request
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Li Wang [Wed, 27 Nov 2013 14:28:13 +0000 (22:28 +0800)]
ceph: Clean up if error occurred in finish_read()
Clean up if error occurred rather than going through normal process
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
majianpeng [Thu, 26 Sep 2013 06:42:17 +0000 (14:42 +0800)]
ceph: implement readv/preadv for sync operation
For readv/preadv sync-operatoin, ceph only do the first iov.
Now implement this.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
majianpeng [Thu, 12 Sep 2013 05:54:26 +0000 (13:54 +0800)]
ceph: Implement writev/pwritev for sync operation.
For writev/pwritev sync-operatoin, ceph only do the first iov.
I divided the write-sync-operation into two functions. One for
direct-write, other for none-direct-sync-write. This is because for
none-direct-sync-write we can merge iovs to one. But for direct-write,
we can't merge iovs.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Yan, Zheng [Fri, 20 Sep 2013 11:55:31 +0000 (19:55 +0800)]
ceph: drop unconnected inodes
Positve dentry and corresponding inode are always accompanied in MDS reply.
So no need to keep inode in the cache after dropping all its aliases.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Li Wang [Wed, 13 Nov 2013 07:22:14 +0000 (15:22 +0800)]
ceph: Avoid data inconsistency due to d-cache aliasing in readpage()
If the length of data to be read in readpage() is exactly
PAGE_CACHE_SIZE, the original code does not flush d-cache
for data consistency after finishing reading. This patches fixes
this.
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Yan, Zheng [Thu, 5 Dec 2013 04:38:59 +0000 (12:38 +0800)]
ceph: initialize inode before instantiating dentry
commit
b18825a7c8 (Put a small type field into struct dentry::d_flags)
put a type field into struct dentry::d_flags. __d_instantiate() set the
field by checking inode->i_mode. So we should initialize inode before
instantiating dentry when handling mds reply.
Fixes: http://tracker.ceph.com/issues/6930
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Linus Torvalds [Fri, 6 Dec 2013 17:34:04 +0000 (09:34 -0800)]
Linux 3.13-rc3
Linus Torvalds [Fri, 6 Dec 2013 16:34:16 +0000 (08:34 -0800)]
Merge tag 'trace-fixes-3.13-rc2' of git://git./linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"A regression showed up that there's a large delay when enabling all
events. This was prevalent when FTRACE_SELFTEST was enabled which
enables all events several times, and caused the system bootup to
pause for over a minute.
This was tracked down to an addition of a synchronize_sched()
performed when system call tracepoints are unregistered.
The synchronize_sched() is needed between the unregistering of the
system call tracepoint and a deletion of a tracing instance buffer.
But placing the synchronize_sched() in the unreg of *every* system
call tracepoint is a bit overboard. A single synchronize_sched()
before the deletion of the instance is sufficient"
* tag 'trace-fixes-3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Only run synchronize_sched() at instance deletion time
Linus Torvalds [Fri, 6 Dec 2013 16:32:59 +0000 (08:32 -0800)]
Merge git://git.kvack.org/~bcrl/aio-next
Pull aio fix from Benjamin LaHaise:
"AIO fix from Gu Zheng that fixes a GPF that Dave Jones uncovered with
trinity"
* git://git.kvack.org/~bcrl/aio-next:
aio: clean up aio ring in the fail path
Linus Torvalds [Fri, 6 Dec 2013 16:30:18 +0000 (08:30 -0800)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"This is a set of nine fixes (and one author update).
The libsas one should fix discovery in eSATA devices, the WRITE_SAME
one is the largest, but it should fix a lot of problems we've been
getting with the emulated RAID devices (they've been effectively lying
about support and then firmware has been choking on the commands).
The rest are various crash, hang or warn driver fixes"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
[SCSI] bfa: Fix crash when symb name set for offline vport
[SCSI] enclosure: fix WARN_ON in dual path device removing
[SCSI] pm80xx: Tasklets synchronization fix.
[SCSI] pm80xx: Resetting the phy state.
[SCSI] pm80xx: Fix for direct attached device.
[SCSI] pm80xx: Module author addition
[SCSI] hpsa: return 0 from driver probe function on success, not 1
[SCSI] hpsa: do not discard scsi status on aborted commands
[SCSI] Disable WRITE SAME for RAID and virtual host adapter drivers
[SCSI] libsas: fix usage of ata_tf_to_fis
Linus Torvalds [Fri, 6 Dec 2013 16:28:35 +0000 (08:28 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jmorris/linux-security
Pull IMA fixes from James Morris:
"Here are two more fixes for IMA"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
ima: properly free ima_template_entry structures
ima: Do not free 'entry' before it is initialized
Linus Torvalds [Fri, 6 Dec 2013 16:27:47 +0000 (08:27 -0800)]
Merge tag 'dt-fixes-for-3.13' of git://git./linux/kernel/git/robh/linux
Pull devicetree fixes from Rob Herring:
- Various DT binding documentation updates
- Add Kumar Gala and remove Stephen Warren as DT binding maintainers
* tag 'dt-fixes-for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
dt: binding: reword PowerPC 8xxx GPIO documentation
ARM: tegra: delete nvidia,tegra20-spi.txt binding
hwmon: ntc_thermistor: Fix typo (pullup-uV -> pullup-uv)
of: add vendor prefix for GMT
clk: exynos: Fix typos in DT bindings documentation
of: Add vendor prefix for LG Corporation
Documentation: net: fsl-fec.txt: Add phy-supply entry
ARM: dts: doc: Document missing binding for omap5-mpu
dt-bindings: add ARMv8 PMU binding
MAINTAINERS: remove swarren from DT bindings
MAINTAINERS: Add Kumar to Device Tree Binding maintainers group
Gu Zheng [Wed, 4 Dec 2013 10:19:06 +0000 (18:19 +0800)]
aio: clean up aio ring in the fail path
Clean up the aio ring file in the fail path of aio_setup_ring
and ioctx_alloc. And maybe it can fix the GPF issue reported by
Dave Jones:
https://lkml.org/lkml/2013/11/25/898
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
James Morris [Fri, 6 Dec 2013 14:21:02 +0000 (01:21 +1100)]
Merge branch 'free-memory' of git://git./linux/kernel/git/zohar/linux-integrity into for-linus
Linus Torvalds [Fri, 6 Dec 2013 02:26:40 +0000 (18:26 -0800)]
Merge tag 'pm-3.13-rc3' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
- cpufreq regression fix from Bjørn Mork restoring the pre-3.12
behavior of the framework during system suspend/hibernation to avoid
garbage sysfs files from being left behind in case of a suspend error
- PNP regression fix to restore the correct states of devices after
resume from hibernation broken in 3.12. From Dmitry Torokhov.
- cpuidle fix to prevent cpuidle device unregistration from crashing
due to a NULL pointer dereference if cpuidle has been disabled from
the kernel command line. From Konrad Rzeszutek Wilk.
- intel_idle fix for the C6 state definition on Intel Avoton/Rangeley
processors from Arne Bockholdt.
- Power capping framework fix to make the energy_uj sysfs attribute
work in accordance with the documentation. From Srinivas Pandruvada.
- epoll fix to make it ignore the EPOLLWAKEUP flag if the kernel has
been compiled with CONFIG_PM_SLEEP unset (in which case that flag
should not have any effect). From Amit Pundir.
- cpufreq fix to prevent governor sysfs files from being lost over
system suspend/resume in some (arguably unusual) situations. From
Viresh Kumar.
* tag 'pm-3.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PowerCap: Fix mode for energy counter
PNP: fix restoring devices after hibernation
cpuidle: Check for dev before deregistering it.
epoll: drop EPOLLWAKEUP if PM_SLEEP is disabled
cpufreq: fix garbage kobjects on errors during suspend/resume
cpufreq: suspend governors on system suspend/hibernate
intel_idle: Fixed C6 state on Avoton/Rangeley processors
Rafael J. Wysocki [Fri, 6 Dec 2013 01:18:28 +0000 (02:18 +0100)]
Merge branches 'pm-epoll', 'pnp' and 'powercap'
* pm-epoll:
epoll: drop EPOLLWAKEUP if PM_SLEEP is disabled
* pnp:
PNP: fix restoring devices after hibernation
* powercap:
PowerCap: Fix mode for energy counter
Rafael J. Wysocki [Fri, 6 Dec 2013 01:17:59 +0000 (02:17 +0100)]
Merge branches 'pm-cpuidle' and 'pm-cpufreq'
* pm-cpuidle:
cpuidle: Check for dev before deregistering it.
intel_idle: Fixed C6 state on Avoton/Rangeley processors
* pm-cpufreq:
cpufreq: fix garbage kobjects on errors during suspend/resume
cpufreq: suspend governors on system suspend/hibernate
Linus Torvalds [Thu, 5 Dec 2013 23:37:44 +0000 (15:37 -0800)]
Merge branch 'stable' of git://git./linux/kernel/git/cmetcalf/linux-tile
Pull arch/tile ftrace bug fix from Chris Metcalf:
"This fixes a build failure with allyesconfig reported by Fengguang Wu
and fixed by Tony Lu"
* 'stable' of git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
ftrace: default to tilegx if ARCH=tile is specified
Linus Torvalds [Thu, 5 Dec 2013 23:33:27 +0000 (15:33 -0800)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
"A small collection of fixes for the current series. It contains:
- A fix for a use-after-free of a request in blk-mq. From Ming Lei
- A fix for a blk-mq bug that could attempt to dereference a NULL rq
if allocation failed
- Two xen-blkfront small fixes
- Cleanup of submit_bio_wait() type uses in the kernel, unifying
that. From Kent
- A fix for 32-bit blkg_rwstat reading. I apologize for this one
looking mangled in the shortlog, it's entirely my fault for missing
an empty line between the description and body of the text"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: fix use-after-free of request
blk-mq: fix dereference of rq->mq_ctx if allocation fails
block: xen-blkfront: Fix possible NULL ptr dereference
xen-blkfront: Silence pfn maybe-uninitialized warning
block: submit_bio_wait() conversions
Update of blkg_stat and blkg_rwstat may happen in bh context
Linus Torvalds [Thu, 5 Dec 2013 21:05:48 +0000 (13:05 -0800)]
Merge tag 'nfs-for-3.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- Stable fix for a NFSv4.1 delegation and state recovery deadlock
- Stable fix for a loop on irrecoverable errors when returning
delegations
- Fix a 3-way deadlock between layoutreturn, open, and state recovery
- Update the MAINTAINERS file with contact information for Trond
Myklebust
- Close needs to handle NFS4ERR_ADMIN_REVOKED
- Enabling v4.2 should not recompile nfsd and lockd
- Fix a couple of compile warnings
* tag 'nfs-for-3.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
nfs: fix do_div() warning by instead using sector_div()
MAINTAINERS: Update contact information for Trond Myklebust
NFSv4.1: Prevent a 3-way deadlock between layoutreturn, open and state recovery
SUNRPC: do not fail gss proc NULL calls with EACCES
NFSv4: close needs to handle NFS4ERR_ADMIN_REVOKED
NFSv4: Update list of irrecoverable errors on DELEGRETURN
NFSv4 wait on recovery for async session errors
NFS: Fix a warning in nfs_setsecurity
NFS: Enabling v4.2 should not recompile nfsd and lockd
Tony Lu [Thu, 5 Dec 2013 20:36:54 +0000 (15:36 -0500)]
ftrace: default to tilegx if ARCH=tile is specified
This matches the existing behavior in arch/tile/Makefile for defconfig.
Reported-by: fengguang.wu@intel.com
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Tony Lu <zlu@tilera.com>
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Steven Rostedt [Tue, 3 Dec 2013 17:41:20 +0000 (12:41 -0500)]
tracing: Only run synchronize_sched() at instance deletion time
It has been reported that boot up with FTRACE_SELFTEST enabled can take a
very long time. There can be stalls of over a minute.
This was tracked down to the synchronize_sched() called when a system call
event is disabled. As the self tests enable and disable thousands of events,
this makes the synchronize_sched() get called thousands of times.
The synchornize_sched() was added with
d562aff93bfb53 "tracing: Add support
for SOFT_DISABLE to syscall events" which caused this regression (added
in 3.13-rc1).
The synchronize_sched() is to protect against the events being accessed
when a tracer instance is being deleted. When an instance is being deleted
all the events associated to it are unregistered. The synchronize_sched()
makes sure that no more users are running when it finishes.
Instead of calling synchronize_sched() for all syscall events, we only
need to call it once, after the events are unregistered and before the
instance is deleted. The event_mutex is held during this action to
prevent new users from enabling events.
Link: http://lkml.kernel.org/r/20131203124120.427b9661@gandalf.local.home
Reported-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Acked-by: Petr Mladek <pmladek@suse.cz>
Tested-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Linus Torvalds [Thu, 5 Dec 2013 18:48:40 +0000 (10:48 -0800)]
Merge tag 'for-linus' of git://git./linux/kernel/git/mason/linux-btrfs
Pull btrfs MAINTAINERS file update:
"I'm still getting settled into new devel hardware etc, but I do have
one commit for the next rc.
This changes my email over to fb.com, and adds a MAINTAINERS entry for
Josef as well"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: update the MAINTAINERS file
Linus Torvalds [Thu, 5 Dec 2013 17:55:20 +0000 (09:55 -0800)]
Merge tag 'fbdev-fixes-3.13' of git://git./linux/kernel/git/tomba/linux
Pull minor fbdev fixes from Tomi Valkeinen.
* tag 'fbdev-fixes-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux:
video: vt8500: fix error handling in probe()
atmel_lcdfb: fix module autoload
fbdev: sh_mobile_meram: Fix defined but not used compiler warnings
video: kyro: fix incorrect sizes when copying to userspace
ARM: OMAPFB: panel-sony-acx565akm: fix bad unlock balance
Linus Torvalds [Thu, 5 Dec 2013 17:54:35 +0000 (09:54 -0800)]
Merge tag 'sound-3.13-rc3' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A usual pattern of half ASoC and half HD-audio fixes, although
HD-audio fixups have more volumes, in addition to a couple of trivial
fixes. Nothing to worry much is found here.
For ASoC side: a few fixes for PCM rate constraints calculations,
regmap byte-order fix, the rest driver specific fixes (atmel, fsl,
omap, kirkwood, wm codecs).
For HD-audio: Dell headset and mono out fix, ELD update in polling
mode, ALC283 Chromebook fixes, a few fixes for old AD codecs and
MBA2, one regression fix"
* tag 'sound-3.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (30 commits)
ALSA: hda - Fix silent output on MacBook Air 2,1
ALSA: hda - Fix missing ELD info when using jackpoll_ms parameter
ALSA: hda/realtek - remove hp_automute_hook from alc283_fixup_chromebook
ASoC: wm8731: fix dsp mode configuration
ALSA: hda/realtek - Independent of model for HP
ALSA: hda - Fix headset mic input after muted internal mic (Dell/Realtek)
ALSA: hda - Use always amps for auto-mute on
AD1986A codec
ALSA: hda/analog - Handle inverted EAPD properly in vmaster hook
ALSA: hda - Another fixup for ASUS laptop with ALC660 codec
ALSA: atmel: Fix possible array overflow
ALSA: hda - Fix complete_all() timing in deferred probes
ALSA: hda - Fix bad EAPD setup for HP machines with
AD1984A
ASoC: core: fix devres parameter in devm_snd_soc_register_card()
ASoC: omap: n810: Convert to clk_prepare_enable/clk_disable_unprepare
ASoC: fsl: set correct platform drvdata in pcm030_fabric_probe()
ASoC: fsl: imx-pcm-fiq: Remove unused 'runtime' variable
ASoC: fsl: imx-pcm-fiq: remove bogus period delta calculation
ALSA: hda - Fix silent output on ASUS W7J laptop
ASoC: core: Use consistent byte ordering in snd_soc_bytes_get
ALSA: dice: fix array limits in dice_proc_read()
...
Linus Torvalds [Thu, 5 Dec 2013 17:53:59 +0000 (09:53 -0800)]
Merge tag 'pinctrl-v3.13-2' of git://git./linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
- Minor bug fixes for the Rockchip, ST-Ericsson abx500, Renesas PFC
r8a7740 and sh7372.
- Compilation warning fixes.
* tag 'pinctrl-v3.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
sh-pfc: sh7372: Fix pin bias setup
sh-pfc: r8a7740: Fix pin bias setup
pinctrl: abx500: Fix header file include guard
pinctrl: rockchip: missing unlock on error in rockchip_set_pull()
pinctrl: abx500: fix some more bitwise AND tests
pinctrl: rockchip: testing the wrong variable
Ming Lei [Thu, 5 Dec 2013 17:50:39 +0000 (10:50 -0700)]
blk-mq: fix use-after-free of request
If accounting is on, we will do the IO completion accounting after
we have freed the request. Fix that by moving it sooner instead.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Thu, 5 Dec 2013 05:45:21 +0000 (21:45 -0800)]
Merge branch 'x86/urgent' of git://git./linux/kernel/git/tip/tip
Pull x86 and EFI fixes from Peter Anvin:
"Half of these are EFI-related:
The by far biggest change is the change to hold off the deletion of a
sysfs entry while a backend scan is in progress. This is to avoid
calling kmemdup() while under a spinlock.
The other major change is for each entry in the EFI pstore backend to
get a unique identifier, as required by the pstore filesystem proper.
The other changes are:
A fix to the recent consolidation and optimization of using "asm goto"
with read-modify-write operation, which broke the bitops; specifically
in such a way that we could end up generating invalid code.
A build hack to make sure we compile with -mno-sse. icc, and most
likely future versions of gcc, can generate SSE instructions unless we
tell it not to.
A comment-only patch to a change the was due in part to an unpublished
erratum; now when the erratum is published we want to add a comment
explaining why"
* 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic, doc: Justification for disabling IO APIC before Local APIC
x86, bitops: Correct the assembly constraints to testing bitops
x86-64, build: Always pass in -mno-sse
efi-pstore: Make efi-pstore return a unique id
x86/efi: Fix earlyprintk off-by-one bug
efivars, efi-pstore: Hold off deletion of sysfs entry until the scan is completed
Fenghua Yu [Thu, 5 Dec 2013 00:07:49 +0000 (16:07 -0800)]
x86/apic, doc: Justification for disabling IO APIC before Local APIC
Since erratum AVR31 in "Intel Atom Processor C2000 Product Family
Specification Update" is now published, I added a justification
comment for disabling IO APIC before Local APIC, as changed in commit:
522e66464467 x86/apic: Disable I/O APIC before shutdown of the local APIC
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1386202069-51515-1-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Srinivas Pandruvada [Wed, 4 Dec 2013 19:12:59 +0000 (11:12 -0800)]
PowerCap: Fix mode for energy counter
As per the documentation of powercap sysfs, energy_uj field is read only,
if it can't be reset. Currently it always allows write but will fail,
if there is no reset callback.
Changing mode field, to read only if there is no reset callback.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Reported-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Dmitry Torokhov [Thu, 5 Dec 2013 01:01:55 +0000 (02:01 +0100)]
PNP: fix restoring devices after hibernation
On returning from hibernation 'restore' callback is called,
not 'resume'. Fix it.
Fixes: eaf140b60ec9 (PNP: convert PNP driver bus legacy pm_ops to dev_pm_ops)
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: 3.12+ <stable@vger.kernel.org> # 3.12+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
H. Peter Anvin [Wed, 4 Dec 2013 22:31:28 +0000 (14:31 -0800)]
x86, bitops: Correct the assembly constraints to testing bitops
In checkin:
0c44c2d0f459 x86: Use asm goto to implement better modify_and_test() functions
the various functions which do modify and test were unified and
optimized using "asm goto". However, this change missed the detail
that the bitops require an "Ir" constraint rather than an "er"
constraint ("I" = integer constant from 0-31, "e" = signed 32-bit
integer constant). This would cause code to miscompile if these
functions were used on constant bit positions 32-255 and the build to
fail if used on constant bit positions above 255.
Add the constraints as a parameter to the GEN_BINARY_RMWcc() macro to
avoid this problem.
Reported-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/529E8719.4070202@zytor.com