review.tizen.org Git - platform/kernel/linux-starfive.git/log

libceph: support CEPH_FEATURE_OSD_CACHEPOOL feature

Announce our (limited, see previous commit) support for CACHEPOOL
feature.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

libceph: follow redirect replies from osds

Follow redirect replies from osds, for details see ceph.git commit
fbbe3ad1220799b7bb00ea30fce581c5eadaf034.

v1 (current) version of redirect reply consists of oloc and oid, which
expands to pool, key, nspace, hash and oid. However, server-side code
that would populate anything other than pool doesn't exist yet, and
hence this commit adds support for pool redirects only. To make sure
that future server-side updates don't break us, we decode all fields
and, if any of key, nspace, hash or oid have a non-default value, error
out with "corrupt osd_op_reply ..." message.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

libceph: rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid}

Rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid} before
introducing r_target_{oloc,oid} needed for redirects.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

libceph: follow {read,write}_tier fields on osd request submission

Overwrite ceph_osd_request::r_oloc.pool with read_tier for read ops and
write_tier for write and read+write ops (aka basic tiering support).
{read,write}_tier are part of pg_pool_t since v9. This commit bumps
our pg_pool_t decode compat version from v7 to v9, all new fields
except for {read,write}_tier are ignored.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

libceph: add ceph_pg_pool_by_id()

"Lookup pool info by ID" function is hidden in osdmap.c. Expose it to
the rest of libceph.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

libceph: CEPH_OSD_FLAG_* enum update

Update CEPH_OSD_FLAG_* enum. (We need CEPH_OSD_FLAG_IGNORE_OVERLAY to
support tiering).

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

libceph: replace ceph_calc_ceph_pg() with ceph_oloc_oid_to_pg()

Switch ceph_calc_ceph_pg() to new oloc and oid abstractions and rename
it to ceph_oloc_oid_to_pg() to make its purpose more clear.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

libceph: introduce and start using oid abstraction

In preparation for tiering support, which would require having two
(base and target) object names for each osd request and also copying
those names around, introduce struct ceph_object_id (oid) and a couple
helpers to facilitate those copies and encapsulate the fact that object
name is not necessarily a NUL-terminated string.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

libceph: rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN

In preparation for adding oid abstraction, rename MAX_OBJ_NAME_SIZE to
CEPH_MAX_OID_NAME_LEN.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

libceph: move ceph_file_layout helpers to ceph_fs.h

Move ceph_file_layout helper macros and inline functions to ceph_fs.h.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

libceph: start using oloc abstraction

Instead of relying on pool fields in ceph_file_layout (for mapping) and
ceph_pg (for enconding), start using ceph_object_locator (oloc)
abstraction. Note that userspace oloc currently consists of pool, key,
nspace and hash fields, while this one contains only a pool. This is
OK, because at this point we only send (i.e. encode) olocs and never
have to receive (i.e. decode) them.

This makes keeping a copy of ceph_file_layout in every osd request
unnecessary, so ceph_osd_request::r_file_layout field is nuked.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

libceph: dout() is missing a newline

Add a missing newline to a dout() in __reset_osd().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>

libceph: add ceph_kv{malloc,free}() and switch to them

Encapsulate kmalloc vs vmalloc memory allocation and freeing logic into
two helpers, ceph_kvmalloc() and ceph_kvfree(), and switch to them.

ceph_kvmalloc() kmalloc()'s a maximum of 8 pages, anything bigger is
vmalloc()'ed with __GFP_HIGHMEM set.  This changes the existing
behaviour:

- for buffers (ceph_buffer_new()), from trying to kmalloc() everything
  and using vmalloc() just as a fallback

- for messages (ceph_msg_new()), from going to vmalloc() for anything
  bigger than a page

- for messages (ceph_msg_new()), from disallowing vmalloc() to use high
  memory

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

libceph: support CEPH_FEATURE_EXPORT_PEER

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

ceph: add imported caps when handling cap export message

Version 3 cap export message includes information about the imported
caps. It allows us to add the imported caps if the corresponding cap
import message still hasn't been received.

This allow us to handle situation that the importer MDS crashes and
the cap import message is missing.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

ceph: add open export target session helper

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

ceph: remove exported caps when handling cap import message

Version 3 cap import message includes the ID of the exported
caps. It allow us to remove the exported caps if we still haven't
received the corresponding cap export message.

We remove the exported caps because they are stale, keeping them
can compromise consistence.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

ceph: handle session flush message

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

ceph: check inode caps in ceph_d_revalidate

Some inodes in readdir reply may have no caps. Getattr mds request
for these inodes can return -ESTALE. The fix is consider dentry that
links to inode with no caps as invalid. Invalid dentry causes a
lookup request to send to the mds, the MDS will send caps back.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

ceph: handle -ESTALE reply

Send requests that operate on path to directory's auth MDS if
mode == USE_AUTH_MDS. Always retry using the auth MDS if got
-ESTALE reply from non-auth MDS. Also clean up the code that
handles auth MDS change.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

ceph: fix trim caps

- don't trim auth cap if there are flusing caps
- don't trim auth cap if any 'write' cap is wanted
- allow trimming non-auth cap even if the inode is dirty

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

ceph: fix cache revoke race

handle following sequence of events:

- non-auth MDS revokes Fc cap. queue invalidate work
- auth MDS issues Fc cap through request reply. i_rdcache_gen gets
increased.
- invalidate work runs. it finds i_rdcache_revoking != i_rdcache_gen,
so it does nothing.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

ceph: use ceph_seq_cmp() to compare migrate_seq

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

ceph: handle cap export race in try_flush_caps()

auth cap may change after releasing the i_ceph_lock

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

ceph: trivial comment fix

"disconnected" is too easily confused with "DCACHE_DISCONNECTED". I
think "unhashed" is the more precise term here.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Sage Weil <sage@inktank.com>

libceph: fix preallocation check in get_reply()

The check that makes sure that we have enough memory allocated to read
in the entire header of the message in question is currently busted.
It compares front_len of the incoming message with iov_len field of
ceph_msg::front structure, which is used primarily to indicate the
amount of data already read in, and not the size of the allocated
buffer. Under certain conditions (e.g. a short read from a socket
followed by that socket's shutdown and owning ceph_connection reset)
this results in a warning similar to

[85688.975866] libceph: get_reply front 198 > preallocated 122 (4#0)

and, through another bug, leads to forever hung tasks and forced
reboots. Fix this by comparing front_len with front_alloc_len field of
struct ceph_msg, which stores the actual size of the buffer.

Fixes: http://tracker.ceph.com/issues/5425

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

libceph: rename front to front_len in get_reply()

Rename front local variable to front_len in get_reply() to make its
purpose more clear.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

libceph: rename ceph_msg::front_max to front_alloc_len

Rename front_max field of struct ceph_msg to front_alloc_len to make
its purpose more clear.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

libceph: use CEPH_MON_PORT when the specified port is 0

Similar to userspace, don't bail with "parse_ips bad ip ..." if the
specified port is port 0, instead use port CEPH_MON_PORT (6789, the
default monitor port).

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: support new indep mode and SET_* steps (crush v2) by default

Add CRUSH_V2 feature (new indep mode and SET_* steps) to a set of
features supported by default.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: fix crush_choose_firstn comment

Reflects ceph.git commit 8b38f10bc2ee3643a33ea5f9545ad5c00e4ac5b4.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: attempts -> tries

Reflects ceph.git commit ea3a0bb8b773360d73b8b77fa32115ef091c9857.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: add set_choose_local_[fallback_]tries steps

This allows all of the tunables to be overridden by a specific rule.

Reflects ceph.git commits d129e09e57fbc61cfd4f492e3ee77d0750c9d292,
0497db49e5973b50df26251ed0e3f4ac7578e66e.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: generalize descend_once

The legacy behavior is to make the normal number of tries for the
recursive chooseleaf call.  The descend_once tunable changed this to
making a single try and bail if we get a reject (note that it is
impossible to collide in the recursive case).

The new set_chooseleaf_tries lets you select the number of recursive
chooseleaf attempts for indep mode, or default to 1.  Use the same
behavior for firstn, except default to total_tries when the legacy
tunables are set (for compatibility).  This makes the rule step
override the (new) default of 1 recursive attempt, keeping behavior
consistent with indep mode.

Reflects ceph.git commit 685c6950ef3df325ef04ce7c986e36ca2514c5f1.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: CHOOSE_LEAF -> CHOOSELEAF throughout

This aligns the internal identifier names with the user-visible names in
the decompiled crush map language.

Reflects ceph.git commit caa0e22e15e4226c3671318ba1f61314bf6da2a6.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: add SET_CHOOSE_TRIES rule step

Since we can specify the recursive retries in a rule, we may as well also
specify the non-recursive tries too for completeness.

Reflects ceph.git commit d1b97462cffccc871914859eaee562f2786abfd1.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: apply chooseleaf_tries to firstn mode too

Parameterize the attempts for the _firstn choose method, and apply the
rule-specified tries count to firstn mode as well.  Note that we have
slightly different behavior here than with indep:

If the firstn value is not specified for firstn, we pass through the
normal attempt count.  This maintains compatibility with legacy behavior.
Note that this is usually *not* actually N^2 work, though, because of the
descend_once tunable.  However, descend_once is unfortunately *not* the
same thing as 1 chooseleaf try because it is only checked on a reject but
not on a collision.  Sigh.

In contrast, for indep, if tries is not specified we default to 1
recursive attempt, because that is simply more sane, and we have the
option to do so.  The descend_once tunable has no effect for indep.

Reflects ceph.git commit 64aeded50d80942d66a5ec7b604ff2fcbf5d7b63.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: new SET_CHOOSE_LEAF_TRIES command

Explicitly control the number of sample attempts, and allow the number of
tries in the recursive call to be explicitly controlled via the rule. This
is important because the amount of time we want to spend looking for a
solution may be rule dependent (e.g., higher for the wide indep pool than
the rep pools).

(We should do the same for the other tunables, by the way!)

Reflects ceph.git commit c43c893be872f709c787bc57f46c0e97876ff681.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: pass parent r value for indep call

Pass down the parent's 'r' value so that we will sample different values in
the recursive call when the parent tries multiple times. This avoids doing
useless work (calling multiple times and trying the same values).

Reflects ceph.git commit 2731d3030d7a3e80922b7f1b7756f9a4a124bac5.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: clarify numrep vs endpos

Pass numrep (the width of the result) separately from the number of results
we want *this* iteration. This makes things less awkward when we do a
recursive call (for chooseleaf) and want only one item.

Reflects ceph.git commit 1b567ee08972f268c11b43fc881e57b5984dd08b.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: strip firstn conditionals out of crush_choose, rename

Now that indep is handled by crush_choose_indep, rename crush_choose to
crush_choose_firstn and remove all the conditionals. This ends up
stripping out *lots* of code.

Note that it *also* makes it obvious that the shenanigans we were playing
with r' for uniform buckets were broken for firstn mode. This appears to
have happened waaaay back in commit dae8bec9 (or earlier)... 2007.

Reflects ceph.git commit 94350996cb2035850bcbece6a77a9b0394177ec9.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: add note about r in recursive choose

Reflects ceph.git commit 4551fee9ad89d0427ed865d766d0d44004d3e3e1.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: use breadth-first search for indep mode

Reflects ceph.git commit 86e978036a4ecbac4c875e7c00f6c5bbe37282d3.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: return CRUSH_ITEM_UNDEF for failed placements with indep

For firstn mode, if we fail to make a valid placement choice, we just
continue and return a short result to the caller. For indep mode, however,
we need to make the position stable, and return an undefined value on
failed placements to avoid shifting later results to the left.

Reflects ceph.git commit b1d4dd4eb044875874a1d01c01c7d766db5d0a80.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: eliminate CRUSH_MAX_SET result size limitation

This is only present to size the temporary scratch arrays that we put on
the stack. Let the caller allocate them as they wish and remove the
limitation.

Reflects ceph.git commit 1cfe140bf2dab99517589a82a916f4c75b9492d1.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: fix some comments

Reflects ceph.git commit 3cef755428761f2481b1dd0e0fbd0464ac483fc5.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: reduce scope of some local variables

Reflects ceph.git commit e7d47827f0333c96ad43d257607fb92ed4176550.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: factor out (trivial) crush_destroy_rule()

Reflects ceph.git commit 43a01c9973c4b83f2eaa98be87429941a227ddde.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

crush: pass weight vector size to map function

Pass the size of the weight vector into crush_do_rule() to ensure that we
don't access values past the end. This can happen if the caller misbehaves
and passes a weight vector that is smaller than max_devices.

Currently the monitor tries to prevent that from happening, but this will
gracefully tolerate previous bad osdmaps that got into this state. It's
also a bit more defensive.

Reflects ceph.git commit 5922e2c2b8335b5e46c9504349c3a55b7434c01a.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

libceph: update ceph_features.h

This updates ceph_features.h so that it has all feature bits defined in
ceph.git. In the interim since the last update, ceph.git crossed the
"32 feature bits" point, and, the addition of the 33rd bit wasn't
handled correctly. The work-around is squashed into this commit and
reflects ceph.git commit 053659d05e0349053ef703b414f44965f368b9f0.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

libceph: all features fields must be u64

In preparation for ceph_features.h update, change all features fields
from unsigned int/u32 to u64. (ceph.git has ~40 feature bits at this
point.)

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

rbd: tear down watch request if rbd_dev_device_setup() fails

Tear down watch request if rbd_dev_device_setup() fails.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

rbd: introduce rbd_dev_header_unwatch_sync() and switch to it

Rename rbd_dev_header_watch_sync() to __rbd_dev_header_watch_sync() and
introduce two helpers: rbd_dev_header_{,un}watch_sync() to make it more
clear what is going on.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

MAINTAINERS: update an e-mail address

I no longer have direct access to my Inktank e-mail. I still pay
attention to rbd, so update its entry in MAINTAINERS accordingly.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Sage Weil <sage@inktank.com>

rbd: enable extended devt in single-major mode

If single-major device number allocation scheme is turned on, instead
of reserving 256 minors per device, which imposes a limit of 4096
images mapped at once, reserve 16 minors per device and enable extended
devt feature.  This results in a theoretical limit of 65536 images
mapped at once, and still allows to have more than 15 partititions:
partitions starting with 16th are mapped under major 259 (Block
Extended Major):

$ rbd showmapped
id pool image snap device
0  rbd  b5    -    /dev/rbd0    # no partitions
1  rbd  b2    -    /dev/rbd1    # 40 partitions
2  rbd  b3    -    /dev/rbd2    #  2 partitions

$ cat /proc/partitions
251        0       1024 rbd0
251       16       1024 rbd1
251       17          0 rbd1p1
251       18          0 rbd1p2
...
251       30          0 rbd1p14
251       31          0 rbd1p15
259        0          0 rbd1p16
259        1          0 rbd1p17
...
259       23          0 rbd1p39
259       24          0 rbd1p40
251       32       1024 rbd2
251       33          0 rbd2p1
251       34          0 rbd2p2

(major 251 was assigned dynamically at module load time)

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

ceph fscache: Uncaching no data page from fscache in readpage()

Currently, if one new page allocated into fscache in readpage(), however,
with no data read into due to error encountered during reading from OSDs,
the slot in fscache is not uncached. This patch fixes this.

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Reviewed-by: Milosz Tanski <milosz@adfin.com>

ceph fscache: Introduce a routine for uncaching single no data page from fscache

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Reviewed-by: Milosz Tanski <milosz@adfin.com>

ceph: add acl for cephfs

Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Li Wang <li.wang@ubuntykylin.com>
Reviewed-by: Zheng Yan <zheng.z.yan@intel.com>

ceph: check caps in filemap_fault and page_mkwrite

Adds cap check to the page fault handler. The check prevents page
fault handler from adding new page to the page cache while Fcb caps
are being revoked. This solves Fc revoking hang in multiple clients
mmap IO workload.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

rbd: add support for single-major device number allocation scheme

Currently each rbd device is allocated its own major number, which
leads to a hard limit of 230-250 images mapped at once.  This commit
adds support for a new single-major device number allocation scheme,
which is hidden behind a new single_major boolean module parameter and
is disabled by default for backwards compatibility reasons.  (Old
userspace cannot correctly unmap images mapped under single-major
scheme and would essentially just unmap a random image, if that.)

$ rbd showmapped
id pool image snap device
0  rbd  b100  -    /dev/rbd0
1  rbd  b101  -    /dev/rbd1
2  rbd  b102  -    /dev/rbd2
3  rbd  b103  -    /dev/rbd3

Old scheme (modprobe rbd):

$ ls -l /dev/rbd*
brw-rw---- 1 root disk 253, 0 Dec 10 12:24 /dev/rbd0
brw-rw---- 1 root disk 252, 0 Dec 10 12:28 /dev/rbd1
brw-rw---- 1 root disk 252, 1 Dec 10 12:28 /dev/rbd1p1
brw-rw---- 1 root disk 252, 2 Dec 10 12:28 /dev/rbd1p2
brw-rw---- 1 root disk 252, 3 Dec 10 12:28 /dev/rbd1p3
brw-rw---- 1 root disk 251, 0 Dec 10 12:28 /dev/rbd2
brw-rw---- 1 root disk 251, 1 Dec 10 12:28 /dev/rbd2p1
brw-rw---- 1 root disk 250, 0 Dec 10 12:24 /dev/rbd3

New scheme (modprobe rbd single_major=Y):

$ ls -l /dev/rbd*
brw-rw---- 1 root disk 253,   0 Dec 10 12:30 /dev/rbd0
brw-rw---- 1 root disk 253, 256 Dec 10 12:30 /dev/rbd1
brw-rw---- 1 root disk 253, 257 Dec 10 12:30 /dev/rbd1p1
brw-rw---- 1 root disk 253, 258 Dec 10 12:30 /dev/rbd1p2
brw-rw---- 1 root disk 253, 259 Dec 10 12:30 /dev/rbd1p3
brw-rw---- 1 root disk 253, 512 Dec 10 12:30 /dev/rbd2
brw-rw---- 1 root disk 253, 513 Dec 10 12:30 /dev/rbd2p1
brw-rw---- 1 root disk 253, 768 Dec 10 12:30 /dev/rbd3

(major 253 was assigned dynamically at module load time)

The new limit is 4096 images mapped at once, and it comes from the fact
that, as before, 256 minor numbers are reserved for each mapping.
(A follow-up commit changes the number of minors reserved and the way
we deal with partitions over that number.)

If single_major is set to true, two new sysfs interfaces show up:
/sys/bus/rbd/{add,remove}_single_major.  These are to be used instead
of /sys/bus/rbd/{add,remove}, which are disabled for backwards
compatibility reasons outlined above.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

rbd: wire up is_visible() sysfs callback for rbd bus

In preparation for single-major device number allocation scheme, wire
up attribute_group::is_visible() callback for rbd bus. This allows us
to make the new single-major attributes conditional.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

rbd: add 'minor' sysfs rbd device attribute

Introduce /sys/bus/rbd/devices/<id>/minor sysfs attribute for exporting
rbd whole disk minor numbers. This is a step towards single-major
device number allocation scheme, but also a good thing on its own.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

rbd: switch to ida for rbd id assignments

Currently rbd ids are allocated using an atomic variable that keeps
track of the highest id currently in use and each new id is simply one
more than the value of that variable. That's nice and cheap, but it
does mean that rbd ids are allowed to grow boundlessly, and, more
importantly, it's completely unpredictable. So, in preparation for
single-major device number allocation scheme, which is going to
establish and rely on a constant mapping between rbd ids and device
numbers, switch to ida for rbd id assignments.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

rbd: refactor rbd_init() a bit

Refactor rbd_init() a bit to make it more clear what's going on.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

rbd: tweak "loaded" message and module description

Tweak "loaded" message, so that it looks like

[ 30.184235] rbd: loaded

instead of

[ 38.056564] rbd: loaded rbd (rados block device)

Also move (and slightly tweak) MODULE_DESCRIPTION so that all authors
are next to each other in modinfo output.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

rbd: rbd_device::dev_id is an int, format it as such

rbd_device::dev_id is an int, format it as such.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

libceph: resend all writes after the osdmap loses the full flag

With the current full handling, there is a race between osds and
clients getting the first map marked full. If the osd wins, it will
return -ENOSPC to any writes, but the client may already have writes
in flight. This results in the client getting the error and
propagating it up the stack. For rbd, the block layer turns this into
EIO, which can cause corruption in filesystems above it.

To avoid this race, osds are being changed to drop writes that came
from clients with an osdmap older than the last osdmap marked full.
In order for this to work, clients must resend all writes after they
encounter a full -> not full transition in the osdmap. osds will wait
for an updated map instead of processing a request from a client with
a newer map, so resent writes will not be dropped by the osd unless
there is another not full -> full transition.

This approach requires both osds and clients to be fixed to avoid the
race. Old clients talking to osds with this fix may hang instead of
returning EIO and potentially corrupting an fs. New clients talking to
old osds have the same behavior as before if they encounter this race.

Fixes: http://tracker.ceph.com/issues/6938

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

libceph: block I/O when PAUSE or FULL osd map flags are set

The PAUSEWR and PAUSERD flags are meant to stop the cluster from
processing writes and reads, respectively. The FULL flag is set when
the cluster determines that it is out of space, and will no longer
process writes.  PAUSEWR and PAUSERD are purely client-side settings
already implemented in userspace clients. The osd does nothing special
with these flags.

When the FULL flag is set, however, the osd responds to all writes
with -ENOSPC. For cephfs, this makes sense, but for rbd the block
layer translates this into EIO.  If a cluster goes from full to
non-full quickly, a filesystem on top of rbd will not behave well,
since some writes succeed while others get EIO.

Fix this by blocking any writes when the FULL flag is set in the osd
client. This is the same strategy used by userspace, so apply it by
default.  A follow-on patch makes this configurable.

__map_request() is called to re-target osd requests in case the
available osds changed.  Add a paused field to a ceph_osd_request, and
set it whenever an appropriate osd map flag is set.  Avoid queueing
paused requests in __map_request(), but force them to be resent if
they become unpaused.

Also subscribe to the next osd map from the monitor if any of these
flags are set, so paused requests can be unblocked as soon as
possible.

Fixes: http://tracker.ceph.com/issues/6079

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

fs: ceph: new helper: file_inode(file)

Signed-off-by: Libo Chen <clbchenlibo.chen@huawei.com>
Signed-off-by: Sage Weil <sage@inktank.com>

ceph: Add necessary clean up if invalid reply received in handle_reply()

Wake up possible waiters, invoke the call back if any, unregister the request

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Sage Weil <sage@inktank.com>

ceph: Clean up if error occurred in finish_read()

Clean up if error occurred rather than going through normal process

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Sage Weil <sage@inktank.com>

ceph: implement readv/preadv for sync operation

For readv/preadv sync-operatoin, ceph only do the first iov.
Now implement this.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>

ceph: Implement writev/pwritev for sync operation.

For writev/pwritev sync-operatoin, ceph only do the first iov.

I divided the write-sync-operation into two functions. One for
direct-write, other for none-direct-sync-write. This is because for
none-direct-sync-write we can merge iovs to one. But for direct-write,
we can't merge iovs.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>

ceph: drop unconnected inodes

Positve dentry and corresponding inode are always accompanied in MDS reply.
So no need to keep inode in the cache after dropping all its aliases.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

ceph: Avoid data inconsistency due to d-cache aliasing in readpage()

If the length of data to be read in readpage() is exactly
PAGE_CACHE_SIZE, the original code does not flush d-cache
for data consistency after finishing reading. This patches fixes
this.

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Sage Weil <sage@inktank.com>

ceph: initialize inode before instantiating dentry

commit b18825a7c8 (Put a small type field into struct dentry::d_flags)
put a type field into struct dentry::d_flags. __d_instantiate() set the
field by checking inode->i_mode. So we should initialize inode before
instantiating dentry when handling mds reply.

Fixes: http://tracker.ceph.com/issues/6930
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

Linux 3.13-rc3

Merge tag 'trace-fixes-3.13-rc2' of git://git./linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
"A regression showed up that there's a large delay when enabling all
  events.  This was prevalent when FTRACE_SELFTEST was enabled which
  enables all events several times, and caused the system bootup to
  pause for over a minute.

  This was tracked down to an addition of a synchronize_sched()
  performed when system call tracepoints are unregistered.

  The synchronize_sched() is needed between the unregistering of the
  system call tracepoint and a deletion of a tracing instance buffer.
  But placing the synchronize_sched() in the unreg of *every* system
  call tracepoint is a bit overboard.  A single synchronize_sched()
  before the deletion of the instance is sufficient"

* tag 'trace-fixes-3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Only run synchronize_sched() at instance deletion time

Merge git://git.kvack.org/~bcrl/aio-next

Pull aio fix from Benjamin LaHaise:
"AIO fix from Gu Zheng that fixes a GPF that Dave Jones uncovered with
trinity"

* git://git.kvack.org/~bcrl/aio-next:
aio: clean up aio ring in the fail path

Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"This is a set of nine fixes (and one author update).

  The libsas one should fix discovery in eSATA devices, the WRITE_SAME
  one is the largest, but it should fix a lot of problems we've been
  getting with the emulated RAID devices (they've been effectively lying
  about support and then firmware has been choking on the commands).

  The rest are various crash, hang or warn driver fixes"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  [SCSI] bfa: Fix crash when symb name set for offline vport
  [SCSI] enclosure: fix WARN_ON in dual path device removing
  [SCSI] pm80xx: Tasklets synchronization fix.
  [SCSI] pm80xx: Resetting the phy state.
  [SCSI] pm80xx: Fix for direct attached device.
  [SCSI] pm80xx: Module author addition
  [SCSI] hpsa: return 0 from driver probe function on success, not 1
  [SCSI] hpsa: do not discard scsi status on aborted commands
  [SCSI] Disable WRITE SAME for RAID and virtual host adapter drivers
  [SCSI] libsas: fix usage of ata_tf_to_fis

Merge branch 'for-linus' of git://git./linux/kernel/git/jmorris/linux-security

Pull IMA fixes from James Morris:
"Here are two more fixes for IMA"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
ima: properly free ima_template_entry structures
ima: Do not free 'entry' before it is initialized

Merge tag 'dt-fixes-for-3.13' of git://git./linux/kernel/git/robh/linux

Pull devicetree fixes from Rob Herring:
- Various DT binding documentation updates
- Add Kumar Gala and remove Stephen Warren as DT binding maintainers

* tag 'dt-fixes-for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
  dt: binding: reword PowerPC 8xxx GPIO documentation
  ARM: tegra: delete nvidia,tegra20-spi.txt binding
  hwmon: ntc_thermistor: Fix typo (pullup-uV -> pullup-uv)
  of: add vendor prefix for GMT
  clk: exynos: Fix typos in DT bindings documentation
  of: Add vendor prefix for LG Corporation
  Documentation: net: fsl-fec.txt: Add phy-supply entry
  ARM: dts: doc: Document missing binding for omap5-mpu
  dt-bindings: add ARMv8 PMU binding
  MAINTAINERS: remove swarren from DT bindings
  MAINTAINERS: Add Kumar to Device Tree Binding maintainers group

aio: clean up aio ring in the fail path

Clean up the aio ring file in the fail path of aio_setup_ring
and ioctx_alloc. And maybe it can fix the GPF issue reported by
Dave Jones:
https://lkml.org/lkml/2013/11/25/898

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

Merge branch 'free-memory' of git://git./linux/kernel/git/zohar/linux-integrity into for-linus

Merge tag 'pm-3.13-rc3' of git://git./linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:

- cpufreq regression fix from Bjørn Mork restoring the pre-3.12
   behavior of the framework during system suspend/hibernation to avoid
   garbage sysfs files from being left behind in case of a suspend error

- PNP regression fix to restore the correct states of devices after
   resume from hibernation broken in 3.12.  From Dmitry Torokhov.

- cpuidle fix to prevent cpuidle device unregistration from crashing
   due to a NULL pointer dereference if cpuidle has been disabled from
   the kernel command line.  From Konrad Rzeszutek Wilk.

- intel_idle fix for the C6 state definition on Intel Avoton/Rangeley
   processors from Arne Bockholdt.

- Power capping framework fix to make the energy_uj sysfs attribute
   work in accordance with the documentation.  From Srinivas Pandruvada.

- epoll fix to make it ignore the EPOLLWAKEUP flag if the kernel has
   been compiled with CONFIG_PM_SLEEP unset (in which case that flag
   should not have any effect).  From Amit Pundir.

- cpufreq fix to prevent governor sysfs files from being lost over
   system suspend/resume in some (arguably unusual) situations.  From
   Viresh Kumar.

* tag 'pm-3.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PowerCap: Fix mode for energy counter
  PNP: fix restoring devices after hibernation
  cpuidle: Check for dev before deregistering it.
  epoll: drop EPOLLWAKEUP if PM_SLEEP is disabled
  cpufreq: fix garbage kobjects on errors during suspend/resume
  cpufreq: suspend governors on system suspend/hibernate
  intel_idle: Fixed C6 state on Avoton/Rangeley processors

Merge branches 'pm-epoll', 'pnp' and 'powercap'

* pm-epoll:
  epoll: drop EPOLLWAKEUP if PM_SLEEP is disabled

* pnp:
  PNP: fix restoring devices after hibernation

* powercap:
  PowerCap: Fix mode for energy counter

Merge branches 'pm-cpuidle' and 'pm-cpufreq'

* pm-cpuidle:
  cpuidle: Check for dev before deregistering it.
  intel_idle: Fixed C6 state on Avoton/Rangeley processors

* pm-cpufreq:
  cpufreq: fix garbage kobjects on errors during suspend/resume
  cpufreq: suspend governors on system suspend/hibernate

Merge branch 'stable' of git://git./linux/kernel/git/cmetcalf/linux-tile

Pull arch/tile ftrace bug fix from Chris Metcalf:
"This fixes a build failure with allyesconfig reported by Fengguang Wu
and fixed by Tony Lu"

* 'stable' of git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
ftrace: default to tilegx if ARCH=tile is specified

Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull block layer fixes from Jens Axboe:
"A small collection of fixes for the current series. It contains:

   - A fix for a use-after-free of a request in blk-mq.  From Ming Lei

   - A fix for a blk-mq bug that could attempt to dereference a NULL rq
     if allocation failed

   - Two xen-blkfront small fixes

   - Cleanup of submit_bio_wait() type uses in the kernel, unifying
     that.  From Kent

   - A fix for 32-bit blkg_rwstat reading.  I apologize for this one
     looking mangled in the shortlog, it's entirely my fault for missing
     an empty line between the description and body of the text"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq: fix use-after-free of request
  blk-mq: fix dereference of rq->mq_ctx if allocation fails
  block: xen-blkfront: Fix possible NULL ptr dereference
  xen-blkfront: Silence pfn maybe-uninitialized warning
  block: submit_bio_wait() conversions
  Update of blkg_stat and blkg_rwstat may happen in bh context

Merge tag 'nfs-for-3.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
- Stable fix for a NFSv4.1 delegation and state recovery deadlock
- Stable fix for a loop on irrecoverable errors when returning
   delegations
- Fix a 3-way deadlock between layoutreturn, open, and state recovery
- Update the MAINTAINERS file with contact information for Trond
   Myklebust
- Close needs to handle NFS4ERR_ADMIN_REVOKED
- Enabling v4.2 should not recompile nfsd and lockd
- Fix a couple of compile warnings

* tag 'nfs-for-3.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  nfs: fix do_div() warning by instead using sector_div()
  MAINTAINERS: Update contact information for Trond Myklebust
  NFSv4.1: Prevent a 3-way deadlock between layoutreturn, open and state recovery
  SUNRPC: do not fail gss proc NULL calls with EACCES
  NFSv4: close needs to handle NFS4ERR_ADMIN_REVOKED
  NFSv4: Update list of irrecoverable errors on DELEGRETURN
  NFSv4 wait on recovery for async session errors
  NFS: Fix a warning in nfs_setsecurity
  NFS: Enabling v4.2 should not recompile nfsd and lockd

ftrace: default to tilegx if ARCH=tile is specified

This matches the existing behavior in arch/tile/Makefile for defconfig.

Reported-by: fengguang.wu@intel.com
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Tony Lu <zlu@tilera.com>
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>

tracing: Only run synchronize_sched() at instance deletion time

It has been reported that boot up with FTRACE_SELFTEST enabled can take a
very long time. There can be stalls of over a minute.

This was tracked down to the synchronize_sched() called when a system call
event is disabled. As the self tests enable and disable thousands of events,
this makes the synchronize_sched() get called thousands of times.

The synchornize_sched() was added with d562aff93bfb53 "tracing: Add support
for SOFT_DISABLE to syscall events" which caused this regression (added
in 3.13-rc1).

The synchronize_sched() is to protect against the events being accessed
when a tracer instance is being deleted. When an instance is being deleted
all the events associated to it are unregistered. The synchronize_sched()
makes sure that no more users are running when it finishes.

Instead of calling synchronize_sched() for all syscall events, we only
need to call it once, after the events are unregistered and before the
instance is deleted. The event_mutex is held during this action to
prevent new users from enabling events.

Link: http://lkml.kernel.org/r/20131203124120.427b9661@gandalf.local.home
Reported-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Acked-by: Petr Mladek <pmladek@suse.cz>
Tested-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Merge tag 'for-linus' of git://git./linux/kernel/git/mason/linux-btrfs

Pull btrfs MAINTAINERS file update:
"I'm still getting settled into new devel hardware etc, but I do have
  one commit for the next rc.

  This changes my email over to fb.com, and adds a MAINTAINERS entry for
  Josef as well"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: update the MAINTAINERS file

Merge tag 'fbdev-fixes-3.13' of git://git./linux/kernel/git/tomba/linux

Pull minor fbdev fixes from Tomi Valkeinen.

* tag 'fbdev-fixes-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux:
  video: vt8500: fix error handling in probe()
  atmel_lcdfb: fix module autoload
  fbdev: sh_mobile_meram: Fix defined but not used compiler warnings
  video: kyro: fix incorrect sizes when copying to userspace
  ARM: OMAPFB: panel-sony-acx565akm: fix bad unlock balance

Merge tag 'sound-3.13-rc3' of git://git./linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"A usual pattern of half ASoC and half HD-audio fixes, although
  HD-audio fixups have more volumes, in addition to a couple of trivial
  fixes.  Nothing to worry much is found here.

  For ASoC side: a few fixes for PCM rate constraints calculations,
  regmap byte-order fix, the rest driver specific fixes (atmel, fsl,
  omap, kirkwood, wm codecs).

  For HD-audio: Dell headset and mono out fix, ELD update in polling
  mode, ALC283 Chromebook fixes, a few fixes for old AD codecs and
  MBA2, one regression fix"

* tag 'sound-3.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (30 commits)
  ALSA: hda - Fix silent output on MacBook Air 2,1
  ALSA: hda - Fix missing ELD info when using jackpoll_ms parameter
  ALSA: hda/realtek - remove hp_automute_hook from alc283_fixup_chromebook
  ASoC: wm8731: fix dsp mode configuration
  ALSA: hda/realtek - Independent of model for HP
  ALSA: hda - Fix headset mic input after muted internal mic (Dell/Realtek)
  ALSA: hda - Use always amps for auto-mute on AD1986A codec
  ALSA: hda/analog - Handle inverted EAPD properly in vmaster hook
  ALSA: hda - Another fixup for ASUS laptop with ALC660 codec
  ALSA: atmel: Fix possible array overflow
  ALSA: hda - Fix complete_all() timing in deferred probes
  ALSA: hda - Fix bad EAPD setup for HP machines with AD1984A
  ASoC: core: fix devres parameter in devm_snd_soc_register_card()
  ASoC: omap: n810: Convert to clk_prepare_enable/clk_disable_unprepare
  ASoC: fsl: set correct platform drvdata in pcm030_fabric_probe()
  ASoC: fsl: imx-pcm-fiq: Remove unused 'runtime' variable
  ASoC: fsl: imx-pcm-fiq: remove bogus period delta calculation
  ALSA: hda - Fix silent output on ASUS W7J laptop
  ASoC: core: Use consistent byte ordering in snd_soc_bytes_get
  ALSA: dice: fix array limits in dice_proc_read()
  ...

Merge tag 'pinctrl-v3.13-2' of git://git./linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:

- Minor bug fixes for the Rockchip, ST-Ericsson abx500, Renesas PFC
   r8a7740 and sh7372.

- Compilation warning fixes.

* tag 'pinctrl-v3.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  sh-pfc: sh7372: Fix pin bias setup
  sh-pfc: r8a7740: Fix pin bias setup
  pinctrl: abx500: Fix header file include guard
  pinctrl: rockchip: missing unlock on error in rockchip_set_pull()
  pinctrl: abx500: fix some more bitwise AND tests
  pinctrl: rockchip: testing the wrong variable

blk-mq: fix use-after-free of request

If accounting is on, we will do the IO completion accounting after
we have freed the request. Fix that by moving it sooner instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch 'x86/urgent' of git://git./linux/kernel/git/tip/tip

Pull x86 and EFI fixes from Peter Anvin:
"Half of these are EFI-related:

  The by far biggest change is the change to hold off the deletion of a
  sysfs entry while a backend scan is in progress.  This is to avoid
  calling kmemdup() while under a spinlock.

  The other major change is for each entry in the EFI pstore backend to
  get a unique identifier, as required by the pstore filesystem proper.

  The other changes are:

  A fix to the recent consolidation and optimization of using "asm goto"
  with read-modify-write operation, which broke the bitops; specifically
  in such a way that we could end up generating invalid code.

  A build hack to make sure we compile with -mno-sse.  icc, and most
  likely future versions of gcc, can generate SSE instructions unless we
  tell it not to.

  A comment-only patch to a change the was due in part to an unpublished
  erratum; now when the erratum is published we want to add a comment
  explaining why"

* 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/apic, doc: Justification for disabling IO APIC before Local APIC
  x86, bitops: Correct the assembly constraints to testing bitops
  x86-64, build: Always pass in -mno-sse
  efi-pstore: Make efi-pstore return a unique id
  x86/efi: Fix earlyprintk off-by-one bug
  efivars, efi-pstore: Hold off deletion of sysfs entry until the scan is completed

x86/apic, doc: Justification for disabling IO APIC before Local APIC

Since erratum AVR31 in "Intel Atom Processor C2000 Product Family
Specification Update" is now published, I added a justification
comment for disabling IO APIC before Local APIC, as changed in commit:

522e66464467 x86/apic: Disable I/O APIC before shutdown of the local APIC

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1386202069-51515-1-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>

PowerCap: Fix mode for energy counter

As per the documentation of powercap sysfs, energy_uj field is read only,
if it can't be reset. Currently it always allows write but will fail,
if there is no reset callback.
Changing mode field, to read only if there is no reset callback.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Reported-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>