platform/kernel/linux-stable.git
11 years agolibceph: set peer name on con_open, not init
Sage Weil [Wed, 27 Jun 2012 19:24:08 +0000 (12:24 -0700)]
libceph: set peer name on con_open, not init

(cherry picked from commit b7a9e5dd40f17a48a72f249b8bbc989b63bae5fd)

The peer name may change on each open attempt, even when the connection is
reused.

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: add some fine ASCII art
Alex Elder [Thu, 21 Jun 2012 02:53:53 +0000 (21:53 -0500)]
libceph: add some fine ASCII art

(cherry picked from commit bc18f4b1c850ab355e38373fbb60fd28568d84b5)

Sage liked the state diagram I put in my commit description so
I'm putting it in with the code.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: small changes to messenger.c
Alex Elder [Mon, 11 Jun 2012 19:57:13 +0000 (14:57 -0500)]
libceph: small changes to messenger.c

(cherry picked from commit 5821bd8ccdf5d17ab2c391c773756538603838c3)

This patch gathers a few small changes in "net/ceph/messenger.c":
  out_msg_pos_next()
    - small logic change that mostly affects indentation
  write_partial_msg_pages().
    - use a local variable trail_off to represent the offset into
      a message of the trail portion of the data (if present)
    - once we are in the trail portion we will always be there, so we
      don't always need to check against our data position
    - avoid computing len twice after we've reached the trail
    - get rid of the variable tmpcrc, which is not needed
    - trail_off and trail_len never change so mark them const
    - update some comments
  read_partial_message_bio()
    - bio_iovec_idx() will never return an error, so don't bother
      checking for it

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: distinguish two phases of connect sequence
Alex Elder [Thu, 24 May 2012 16:55:03 +0000 (11:55 -0500)]
libceph: distinguish two phases of connect sequence

(cherry picked from commit 7593af920baac37752190a0db703d2732bed4a3b)

Currently a ceph connection enters a "CONNECTING" state when it
begins the process of (re-)connecting with its peer.  Once the two
ends have successfully exchanged their banner and addresses, an
additional NEGOTIATING bit is set in the ceph connection's state to
indicate the connection information exhange has begun.  The
CONNECTING bit/state continues to be set during this phase.

Rather than have the CONNECTING state continue while the NEGOTIATING
bit is set, interpret these two phases as distinct states.  In other
words, when NEGOTIATING is set, clear CONNECTING.  That way only
one of them will be active at a time.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: separate banner and connect writes
Alex Elder [Thu, 31 May 2012 16:37:29 +0000 (11:37 -0500)]
libceph: separate banner and connect writes

(cherry picked from commit ab166d5aa3bc036fba7efaca6e4e43a7e9510acf)

There are two phases in the process of linking together the two ends
of a ceph connection.  The first involves exchanging a banner and
IP addresses, and if that is successful a second phase exchanges
some detail about each side's connection capabilities.

When initiating a connection, the client side now queues to send
its information for both phases of this process at the same time.
This is probably a bit more efficient, but it is slightly messier
from a layering perspective in the code.

So rearrange things so that the client doesn't send the connection
information until it has received and processed the response in the
initial banner phase (in process_banner()).

Move the code (in the (con->sock == NULL) case in try_write()) that
prepares for writing the connection information, delaying doing that
until the banner exchange has completed.  Move the code that begins
the transition to this second "NEGOTIATING" phase out of
process_banner() and into its caller, so preparing to write the
connection information and preparing to read the response are
adjacent to each other.

Finally, preparing to write the connection information now requires
the output kvec to be reset in all cases, so move that into the
prepare_write_connect() and delete it from all callers.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: define and use an explicit CONNECTED state
Alex Elder [Wed, 23 May 2012 19:35:23 +0000 (14:35 -0500)]
libceph: define and use an explicit CONNECTED state

(cherry picked from commit e27947c767f5bed15048f4e4dad3e2eb69133697)

There is no state explicitly defined when a ceph connection is fully
operational.  So define one.

It's set when the connection sequence completes successfully, and is
cleared when the connection gets closed.

Be a little more careful when examining the old state when a socket
disconnect event is reported.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: clear NEGOTIATING when done
Alex Elder [Wed, 23 May 2012 19:35:23 +0000 (14:35 -0500)]
libceph: clear NEGOTIATING when done

(cherry picked from commit 3ec50d1868a9e0493046400bb1fdd054c7f64ebd)

A connection state's NEGOTIATING bit gets set while in CONNECTING
state after we have successfully exchanged a ceph banner and IP
addresses with the connection's peer (the server).  But that bit
is not cleared again--at least not until another connection attempt
is initiated.

Instead, clear it as soon as the connection is fully established.
Also, clear it when a socket connection gets prematurely closed
in the midst of establishing a ceph connection (in case we had
reached the point where it was set).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: clear CONNECTING in ceph_con_close()
Alex Elder [Thu, 21 Jun 2012 02:53:53 +0000 (21:53 -0500)]
libceph: clear CONNECTING in ceph_con_close()

(cherry picked from commit bb9e6bba5d8b85b631390f8dbe8a24ae1ff5b48a)

A connection that is closed will no longer be connecting.  So
clear the CONNECTING state bit in ceph_con_close().  Similarly,
if the socket has been closed we no longer are in connecting
state (a new connect sequence will need to be initiated).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: don't touch con state in con_close_socket()
Alex Elder [Thu, 21 Jun 2012 02:53:53 +0000 (21:53 -0500)]
libceph: don't touch con state in con_close_socket()

(cherry picked from commit 456ea46865787283088b23a8a7f69244513b95f0)

In con_close_socket(), a connection's SOCK_CLOSED flag gets set and
then cleared while its shutdown method is called and its reference
gets dropped.

Previously, that flag got set only if it had not already been set,
so setting it in con_close_socket() might have prevented additional
processing being done on a socket being shut down.  We no longer set
SOCK_CLOSED in the socket event routine conditionally, so setting
that bit here no longer provides whatever benefit it might have
provided before.

A race condition could still leave the SOCK_CLOSED bit set even
after we've issued the call to con_close_socket(), so we still clear
that bit after shutting the socket down.  Add a comment explaining
the reason for this.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: just set SOCK_CLOSED when state changes
Alex Elder [Thu, 21 Jun 2012 02:53:53 +0000 (21:53 -0500)]
libceph: just set SOCK_CLOSED when state changes

(cherry picked from commit d65c9e0b9eb43d14ece9dd843506ccba06162ee7)

When a TCP_CLOSE or TCP_CLOSE_WAIT event occurs, the SOCK_CLOSED
connection flag bit is set, and if it had not been previously set
queue_con() is called to ensure con_work() will get a chance to
handle the changed state.

con_work() atomically checks--and if set, clears--the SOCK_CLOSED
bit if it was set.  This means that even if the bit were set
repeatedly, the related processing in con_work() only gets called
once per transition of the bit from 0 to 1.

What's important then is that we ensure con_work() gets called *at
least* once when a socket close event occurs, not that it gets
called *exactly* once.

The work queue mechanism already takes care of queueing work
only if it is not already queued, so there's no need for us
to call queue_con() conditionally.

So this patch just makes it so the SOCK_CLOSED flag gets set
unconditionally in ceph_sock_state_change().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: don't change socket state on sock event
Alex Elder [Thu, 21 Jun 2012 02:53:53 +0000 (21:53 -0500)]
libceph: don't change socket state on sock event

(cherry picked from commit 188048bce311ee41e5178bc3255415d0eae28423)

Currently the socket state change event handler records an error
message on a connection to distinguish a close while connecting from
a close while a connection was already established.

Changing connection information during handling of a socket event is
not very clean, so instead move this assignment inside con_work(),
where it can be done during normal connection-level processing (and
under protection of the connection mutex as well).

Move the handling of a socket closed event up to the top of the
processing loop in con_work(); there's no point in handling backoff
etc. if we have a newly-closed socket to take care of.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: SOCK_CLOSED is a flag, not a state
Alex Elder [Thu, 21 Jun 2012 02:53:53 +0000 (21:53 -0500)]
libceph: SOCK_CLOSED is a flag, not a state

(cherry picked from commit a8d00e3cdef4c1c4f194414b72b24cd995439a05)

The following commit changed it so SOCK_CLOSED bit was stored in
a connection's new "flags" field rather than its "state" field.

    libceph: start separating connection flags from state
    commit 928443cd

That bit is used in con_close_socket() to protect against setting an
error message more than once in the socket event handler function.

Unfortunately, the field being operated on in that function was not
updated to be "flags" as it should have been.  This fixes that
error.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: don't use bio_iter as a flag
Alex Elder [Mon, 11 Jun 2012 19:57:13 +0000 (14:57 -0500)]
libceph: don't use bio_iter as a flag

(cherry picked from commit abdaa6a849af1d63153682c11f5bbb22dacb1f6b)

Recently a bug was fixed in which the bio_iter field in a ceph
message was not being properly re-initialized when a message got
re-transmitted:
    commit 43643528cce60ca184fe8197efa8e8da7c89a037
    Author: Yan, Zheng <zheng.z.yan@intel.com>
    rbd: Clear ceph_msg->bio_iter for retransmitted message

We are now only initializing the bio_iter field when we are about to
start to write message data (in prepare_write_message_data()),
rather than every time we are attempting to write any portion of the
message data (in write_partial_msg_pages()).  This means we no
longer need to use the msg->bio_iter field as a flag.

So just don't do that any more.  Trust prepare_write_message_data()
to ensure msg->bio_iter is properly initialized, every time we are
about to begin writing (or re-writing) a message's bio data.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: move init of bio_iter
Alex Elder [Mon, 11 Jun 2012 19:57:13 +0000 (14:57 -0500)]
libceph: move init of bio_iter

(cherry picked from commit 572c588edadaa3da3992bd8a0fed830bbcc861f8)

If a message has a non-null bio pointer, its bio_iter field is
initialized in write_partial_msg_pages() if this has not been done
already.  This is really a one-time setup operation for sending a
message's (bio) data, so move that initialization code into
prepare_write_message_data() which serves that purpose.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: move init_bio_*() functions up
Alex Elder [Mon, 11 Jun 2012 19:57:13 +0000 (14:57 -0500)]
libceph: move init_bio_*() functions up

(cherry picked from commit df6ad1f97342ebc4270128222e896541405eecdb)

Move init_bio_iter() and iter_bio_next() up in their source file so
the'll be defined before they're needed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: don't mark footer complete before it is
Alex Elder [Mon, 11 Jun 2012 19:57:13 +0000 (14:57 -0500)]
libceph: don't mark footer complete before it is

(cherry picked from commit fd154f3c75465abd83b7a395033e3755908a1e6e)

This is a nit, but prepare_write_message() sets the FOOTER_COMPLETE
flag before the CRC for the data portion (recorded in the footer)
has been completely computed.  Hold off setting the complete flag
until we've decided it's ready to send.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: encapsulate advancing msg page
Alex Elder [Mon, 11 Jun 2012 19:57:13 +0000 (14:57 -0500)]
libceph: encapsulate advancing msg page

(cherry picked from commit 84ca8fc87fcf4ab97bb8acdb59bf97bb4820cb14)

In write_partial_msg_pages(), once all the data from a page has been
sent we advance to the next one.  Put the code that takes care of
this into its own function.

While modifying write_partial_msg_pages(), make its local variable
"in_trail" be Boolean, and use the local variable "msg" (which is
just the connection's current out_msg pointer) consistently.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: encapsulate out message data setup
Alex Elder [Mon, 11 Jun 2012 19:57:13 +0000 (14:57 -0500)]
libceph: encapsulate out message data setup

(cherry picked from commit 739c905baa018c99003564ebc367d93aa44d4861)

Move the code that prepares to write the data portion of a message
into its own function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: drop ceph_con_get/put helpers and nref member
Sage Weil [Thu, 21 Jun 2012 19:49:23 +0000 (12:49 -0700)]
libceph: drop ceph_con_get/put helpers and nref member

(cherry picked from commit d59315ca8c0de00df9b363f94a2641a30961ca1c)

These are no longer used.  Every ceph_connection instance is embedded in
another structure, and refcounts manipulated via the get/put ops.

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: use con get/put methods
Sage Weil [Thu, 21 Jun 2012 19:47:08 +0000 (12:47 -0700)]
libceph: use con get/put methods

(cherry picked from commit 36eb71aa57e6a33d61fd90a2fd87f00c6844bc86)

The ceph_con_get/put() helpers manipulate the embedded con ref
count, which isn't used now that ceph_connections are embedded in
other structures.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: fix NULL dereference in reset_connection()
Dan Carpenter [Tue, 19 Jun 2012 13:52:33 +0000 (08:52 -0500)]
libceph: fix NULL dereference in reset_connection()

(cherry picked from commit 26ce171915f348abd1f41da1ed139d93750d987f)

We dereference "con->in_msg" on the line after it was set to NULL.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: transition socket state prior to actual connect
Sage Weil [Sat, 9 Jun 2012 21:19:21 +0000 (14:19 -0700)]
libceph: transition socket state prior to actual connect

(cherry picked from commit 89a86be0ce20022f6ede8bccec078dbb3d63caaa)

Once we call ->connect(), we are racing against the actual
connection, and a subsequent transition from CONNECTING ->
CONNECTED.  Set the state to CONNECTING before that, under the
protection of the mutex, to avoid the race.

This was introduced in 928443cd9644e7cfd46f687dbeffda2d1a357ff9,
with the original socket state code.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: fix overflow in osdmap_apply_incremental()
Xi Wang [Thu, 7 Jun 2012 00:35:55 +0000 (19:35 -0500)]
libceph: fix overflow in osdmap_apply_incremental()

(cherry picked from commit a5506049500b30dbc5edb4d07a3577477c1f3643)

On 32-bit systems, a large `pglen' would overflow `pglen*sizeof(u32)'
and bypass the check ceph_decode_need(p, end, pglen*sizeof(u32), bad).
It would also overflow the subsequent kmalloc() size, leading to
out-of-bounds write.

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: fix overflow in osdmap_decode()
Xi Wang [Thu, 7 Jun 2012 00:35:55 +0000 (19:35 -0500)]
libceph: fix overflow in osdmap_decode()

(cherry picked from commit e91a9b639a691e0982088b5954eaafb5a25c8f1c)

On 32-bit systems, a large `n' would overflow `n * sizeof(u32)' and bypass
the check ceph_decode_need(p, end, n * sizeof(u32), bad).  It would also
overflow the subsequent kmalloc() size, leading to out-of-bounds write.

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: fix overflow in __decode_pool_names()
Xi Wang [Thu, 7 Jun 2012 00:35:55 +0000 (19:35 -0500)]
libceph: fix overflow in __decode_pool_names()

(cherry picked from commit ad3b904c07dfa88603689bf9a67bffbb9b99beb5)

`len' is read from network and thus needs validation.  Otherwise a
large `len' would cause out-of-bounds access via the memcpy() call.
In addition, len = 0xffffffff would overflow the kmalloc() size,
leading to out-of-bounds write.

This patch adds a check of `len' via ceph_decode_need().  Also use
kstrndup rather than kmalloc/memcpy.

[elder@inktank.com: added -ENOMEM return for null kstrndup() result]

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: make ceph_con_revoke_message() a msg op
Alex Elder [Fri, 1 Jun 2012 19:56:43 +0000 (14:56 -0500)]
libceph: make ceph_con_revoke_message() a msg op

(cherry picked from commit 8921d114f5574c6da2cdd00749d185633ecf88f3)

ceph_con_revoke_message() is passed both a message and a ceph
connection.  A ceph_msg allocated for incoming messages on a
connection always has a pointer to that connection, so there's no
need to provide the connection when revoking such a message.

Note that the existing logic does not preclude the message supplied
being a null/bogus message pointer.  The only user of this interface
is the OSD client, and the only value an osd client passes is a
request's r_reply field.  That is always non-null (except briefly in
an error path in ceph_osdc_alloc_request(), and that drops the
only reference so the request won't ever have a reply to revoke).
So we can safely assume the passed-in message is non-null, but add a
BUG_ON() to make it very obvious we are imposing this restriction.

Rename the function ceph_msg_revoke_incoming() to reflect that it is
really an operation on an incoming message.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: make ceph_con_revoke() a msg operation
Alex Elder [Fri, 1 Jun 2012 19:56:43 +0000 (14:56 -0500)]
libceph: make ceph_con_revoke() a msg operation

(cherry picked from commit 6740a845b2543cc46e1902ba21bac743fbadd0dc)

ceph_con_revoke() is passed both a message and a ceph connection.
Now that any message associated with a connection holds a pointer
to that connection, there's no need to provide the connection when
revoking a message.

This has the added benefit of precluding the possibility of the
providing the wrong connection pointer.  If the message's connection
pointer is null, it is not being tracked by any connection, so
revoking it is a no-op.  This is supported as a convenience for
upper layers, so they can revoke a message that is not actually
"in flight."

Rename the function ceph_msg_revoke() to reflect that it is really
an operation on a message, not a connection.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: have messages take a connection reference
Alex Elder [Mon, 4 Jun 2012 19:43:33 +0000 (14:43 -0500)]
libceph: have messages take a connection reference

(cherry picked from commit 92ce034b5a740046cc643a21ea21eaad589e0043)

There are essentially two types of ceph messages: incoming and
outgoing.  Outgoing messages are always allocated via ceph_msg_new(),
and at the time of their allocation they are not associated with any
particular connection.  Incoming messages are always allocated via
ceph_con_in_msg_alloc(), and they are initially associated with the
connection from which incoming data will be placed into the message.

When an outgoing message gets sent, it becomes associated with a
connection and remains that way until the message is successfully
sent.  The association of an incoming message goes away at the point
it is sent to an upper layer via a con->ops->dispatch method.

This patch implements reference counting for all ceph messages, such
that every message holds a reference (and a pointer) to a connection
if and only if it is associated with that connection (as described
above).

For background, here is an explanation of the ceph message
lifecycle, emphasizing when an association exists between a message
and a connection.

Outgoing Messages
An outgoing message is "owned" by its allocator, from the time it is
allocated in ceph_msg_new() up to the point it gets queued for
sending in ceph_con_send().  Prior to that point the message's
msg->con pointer is null; at the point it is queued for sending its
message pointer is assigned to refer to the connection.  At that
time the message is inserted into a connection's out_queue list.

When a message on the out_queue list has been sent to the socket
layer to be put on the wire, it is transferred out of that list and
into the connection's out_sent list.  At that point it is still owned
by the connection, and will remain so until an acknowledgement is
received from the recipient that indicates the message was
successfully transferred.  When such an acknowledgement is received
(in process_ack()), the message is removed from its list (in
ceph_msg_remove()), at which point it is no longer associated with
the connection.

So basically, any time a message is on one of a connection's lists,
it is associated with that connection.  Reference counting outgoing
messages can thus be done at the points a message is added to the
out_queue (in ceph_con_send()) and the point it is removed from
either its two lists (in ceph_msg_remove())--at which point its
connection pointer becomes null.

Incoming Messages
When an incoming message on a connection is getting read (in
read_partial_message()) and there is no message in con->in_msg,
a new one is allocated using ceph_con_in_msg_alloc().  At that
point the message is associated with the connection.  Once that
message has been completely and successfully read, it is passed to
upper layer code using the connection's con->ops->dispatch method.
At that point the association between the message and the connection
no longer exists.

Reference counting of connections for incoming messages can be done
by taking a reference to the connection when the message gets
allocated, and releasing that reference when it gets handed off
using the dispatch method.

We should never fail to get a connection reference for a
message--the since the caller should already hold one.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: have messages point to their connection
Alex Elder [Fri, 1 Jun 2012 19:56:43 +0000 (14:56 -0500)]
libceph: have messages point to their connection

(cherry picked from commit 38941f8031bf042dba3ced6394ba3a3b16c244ea)

When a ceph message is queued for sending it is placed on a list of
pending messages (ceph_connection->out_queue).  When they are
actually sent over the wire, they are moved from that list to
another (ceph_connection->out_sent).  When acknowledgement for the
message is received, it is removed from the sent messages list.

During that entire time the message is "in the possession" of a
single ceph connection.  Keep track of that connection in the
message.  This will be used in the next patch (and is a helpful
bit of information for debugging anyway).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: tweak ceph_alloc_msg()
Alex Elder [Mon, 4 Jun 2012 19:43:32 +0000 (14:43 -0500)]
libceph: tweak ceph_alloc_msg()

(cherry picked from commit 1c20f2d26795803fc4f5155fe4fca5717a5944b6)

The function ceph_alloc_msg() is only used to allocate a message
that will be assigned to a connection's in_msg pointer.  Rename the
function so this implied usage is more clear.

In addition, make that assignment inside the function (again, since
that's precisely what it's intended to be used for).  This allows us
to return what is now provided via the passed-in address of a "skip"
variable.  The return type is now Boolean to be explicit that there
are only two possible outcomes.

Make sure the result of an ->alloc_msg method call always sets the
value of *skip properly.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: fully initialize connection in con_init()
Alex Elder [Sun, 27 May 2012 04:26:43 +0000 (23:26 -0500)]
libceph: fully initialize connection in con_init()

(cherry picked from commit 1bfd89f4e6e1adc6a782d94aa5d4c53be1e404d7)

Move the initialization of a ceph connection's private pointer,
operations vector pointer, and peer name information into
ceph_con_init().  Rearrange the arguments so the connection pointer
is first.  Hide the byte-swapping of the peer entity number inside
ceph_con_init()

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: init monitor connection when opening
Alex Elder [Sun, 27 May 2012 04:26:43 +0000 (23:26 -0500)]
libceph: init monitor connection when opening

(cherry picked from commit 20581c1faf7b15ae1f8b80c0ec757877b0b53151)

Hold off initializing a monitor client's connection until just
before it gets opened for use.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: drop connection refcounting for mon_client
Sage Weil [Fri, 1 Jun 2012 03:27:50 +0000 (20:27 -0700)]
libceph: drop connection refcounting for mon_client

(cherry picked from commit ec87ef4309d33bd9c87a53bb5152a86ae7a65f25)

All references to the embedded ceph_connection come from the msgr
workqueue, which is drained prior to mon_client destruction.  That
means we can ignore con refcounting entirely.

Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: embed ceph connection structure in mon_client
Alex Elder [Sun, 27 May 2012 04:26:43 +0000 (23:26 -0500)]
libceph: embed ceph connection structure in mon_client

(cherry picked from commit 67130934fb579fdf0f2f6d745960264378b57dc8)

A monitor client has a pointer to a ceph connection structure in it.
This is the only one of the three ceph client types that do it this
way; the OSD and MDS clients embed the connection into their main
structures.  There is always exactly one ceph connection for a
monitor client, so there is no need to allocate it separate from the
monitor client structure.

So switch the ceph_mon_client structure to embed its
ceph_connection structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: set CLOSED state bit in con_init
Alex Elder [Tue, 29 May 2012 16:04:58 +0000 (11:04 -0500)]
libceph: set CLOSED state bit in con_init

(cherry picked from commit a5988c490ef66cb04ea2f610681949b25c773b3c)

Once a connection is fully initialized, it is really in a CLOSED
state, so make that explicit by setting the bit in its state field.

It is possible for a connection in NEGOTIATING state to get a
failure, leading to ceph_fault() and ultimately ceph_con_close().
Clear that bits if it is set in that case, to reflect that the
connection truly is closed and is no longer participating in a
connect sequence.

Issue a warning if ceph_con_open() is called on a connection that
is not in CLOSED state.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: provide osd number when creating osd
Alex Elder [Sun, 27 May 2012 04:26:43 +0000 (23:26 -0500)]
libceph: provide osd number when creating osd

(cherry picked from commit e10006f807ffc4d5b1d861305d18d9e8145891ca)

Pass the osd number to the create_osd() routine, and move the
initialization of fields that depend on it therein.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: start tracking connection socket state
Alex Elder [Wed, 23 May 2012 03:15:49 +0000 (22:15 -0500)]
libceph: start tracking connection socket state

(cherry picked from commit ce2c8903e76e690846a00a0284e4bd9ee954d680)

Start explicitly keeping track of the state of a ceph connection's
socket, separate from the state of the connection itself.  Create
placeholder functions to encapsulate the state transitions.

    --------
    | NEW* |  transient initial state
    --------
        | con_sock_state_init()
        v
    ----------
    | CLOSED |  initialized, but no socket (and no
    ----------  TCP connection)
     ^      \
     |       \ con_sock_state_connecting()
     |        ----------------------
     |                              \
     + con_sock_state_closed()       \
     |\                               \
     | \                               \
     |  -----------                     \
     |  | CLOSING |  socket event;       \
     |  -----------  await close          \
     |       ^                            |
     |       |                            |
     |       + con_sock_state_closing()   |
     |      / \                           |
     |     /   ---------------            |
     |    /                   \           v
     |   /                    --------------
     |  /    -----------------| CONNECTING |  socket created, TCP
     |  |   /                 --------------  connect initiated
     |  |   | con_sock_state_connected()
     |  |   v
    -------------
    | CONNECTED |  TCP connection established
    -------------

Make the socket state an atomic variable, reinforcing that it's a
distinct transtion with no possible "intermediate/both" states.
This is almost certainly overkill at this point, though the
transitions into CONNECTED and CLOSING state do get called via
socket callback (the rest of the transitions occur with the
connection mutex held).  We can back out the atomicity later.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil<sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: start separating connection flags from state
Alex Elder [Tue, 22 May 2012 16:41:43 +0000 (11:41 -0500)]
libceph: start separating connection flags from state

(cherry picked from commit 928443cd9644e7cfd46f687dbeffda2d1a357ff9)

A ceph_connection holds a mixture of connection state (as in "state
machine" state) and connection flags in a single "state" field.  To
make the distinction more clear, define a new "flags" field and use
it rather than the "state" field to hold Boolean flag values.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil<sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: embed ceph messenger structure in ceph_client
Alex Elder [Sun, 27 May 2012 04:26:43 +0000 (23:26 -0500)]
libceph: embed ceph messenger structure in ceph_client

(cherry picked from commit 15d9882c336db2db73ccf9871ae2398e452f694c)

A ceph client has a pointer to a ceph messenger structure in it.
There is always exactly one ceph messenger for a ceph client, so
there is no need to allocate it separate from the ceph client
structure.

Switch the ceph_client structure to embed its ceph_messenger
structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: rename kvec_reset and kvec_add functions
Alex Elder [Wed, 23 May 2012 19:35:23 +0000 (14:35 -0500)]
libceph: rename kvec_reset and kvec_add functions

(cherry picked from commit e22004235a900213625acd6583ac913d5a30c155)

The functions ceph_con_out_kvec_reset() and ceph_con_out_kvec_add()
are entirely private functions, so drop the "ceph_" prefix in their
name to make them slightly more wieldy.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: rename socket callbacks
Alex Elder [Tue, 22 May 2012 16:41:43 +0000 (11:41 -0500)]
libceph: rename socket callbacks

(cherry picked from commit 327800bdc2cb9b71f4b458ca07aa9d522668dde0)

Change the names of the three socket callback functions to make it
more obvious they're specifically associated with a connection's
socket (not the ceph connection that uses it).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: kill bad_proto ceph connection op
Alex Elder [Wed, 30 May 2012 02:47:38 +0000 (21:47 -0500)]
libceph: kill bad_proto ceph connection op

(cherry picked from commit 6384bb8b8e88a9c6bf2ae0d9517c2c0199177c34)

No code sets a bad_proto method in its ceph connection operations
vector, so just get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: eliminate connection state "DEAD"
Alex Elder [Tue, 22 May 2012 16:41:43 +0000 (11:41 -0500)]
libceph: eliminate connection state "DEAD"

(cherry picked from commit e5e372da9a469dfe3ece40277090a7056c566838)

The ceph connection state "DEAD" is never set and is therefore not
needed.  Eliminate it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: check PG_Private flag before accessing page->private
Yan, Zheng [Mon, 28 May 2012 06:44:30 +0000 (14:44 +0800)]
ceph: check PG_Private flag before accessing page->private

(cherry picked from commit 28c0254ede13ab575d2df5c6585ed3d4817c3e6b)

I got lots of NULL pointer dereference Oops when compiling kernel on ceph.
The bug is because the kernel page migration routine replaces some pages
in the page cache with new pages, these new pages' private can be non-zero.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agorbd: Fix ceph_snap_context size calculation
Yan, Zheng [Wed, 6 Jun 2012 14:15:33 +0000 (09:15 -0500)]
rbd: Fix ceph_snap_context size calculation

(cherry picked from commit f9f9a1904467816452fc70740165030e84c2c659)

ceph_snap_context->snaps is an u64 array

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agorbd: store snapshot id instead of index
Josh Durgin [Mon, 21 Nov 2011 21:04:42 +0000 (13:04 -0800)]
rbd: store snapshot id instead of index

(cherry picked from commit 77dfe99fe3cb0b2b0545e19e2d57b7a9134ee3c0)

When a device was open at a snapshot, and snapshots were deleted or
added, data from the wrong snapshot could be read. Instead of
assuming the snap context is constant, store the actual snap id when
the device is initialized, and rely on the OSDs to signal an error
if we try reading from a snapshot that was deleted.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agorbd: protect read of snapshot sequence number
Josh Durgin [Mon, 5 Dec 2011 18:47:13 +0000 (10:47 -0800)]
rbd: protect read of snapshot sequence number

(cherry picked from commit 403f24d3d51760a8b9368d595fa5f48c309f1a0f)

This is updated whenever a snapshot is added or deleted, and the
snapc pointer is changed with every refresh of the header.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agorbd: don't hold spinlock during messenger flush
Alex Elder [Wed, 4 Apr 2012 18:35:44 +0000 (13:35 -0500)]
rbd: don't hold spinlock during messenger flush

(cherry picked from commit cd9d9f5df6098c50726200d4185e9e8da32785b3)

A recent change made changes to the rbd_client_list be protected by
a spinlock.  Unfortunately in rbd_put_client(), the lock is taken
before possibly dropping the last reference to an rbd_client, and on
the last reference that eventually calls flush_workqueue() which can
sleep.

The problem was flagged by a debug spinlock warning:
    BUG: spinlock wrong CPU on CPU#3, rbd/27814

The solution is to move the spinlock acquisition and release inside
rbd_client_release(), which is the spot where it's really needed for
protecting the removal of the rbd_client from the client list.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: fix messenger retry
Sage Weil [Tue, 10 Jul 2012 18:53:34 +0000 (11:53 -0700)]
libceph: fix messenger retry

(cherry picked from commit 5bdca4e0768d3e0f4efa43d9a2cc8210aeb91ab9)

In ancient times, the messenger could both initiate and accept connections.
An artifact if that was data structures to store/process an incoming
ceph_msg_connect request and send an outgoing ceph_msg_connect_reply.
Sadly, the negotiation code was referencing those structures and ignoring
important information (like the peer's connect_seq) from the correct ones.

Among other things, this fixes tight reconnect loops where the server sends
RETRY_SESSION and we (the client) retries with the same connect_seq as last
time.  This bug pretty easily triggered by injecting socket failures on the
MDS and running some fs workload like workunits/direct_io/test_sync_io.

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: flush msgr queue during mon_client shutdown
Sage Weil [Mon, 11 Jun 2012 03:43:56 +0000 (20:43 -0700)]
libceph: flush msgr queue during mon_client shutdown

(cherry picked from commit f3dea7edd3d449fe7a6d402c1ce56a294b985261)
(cherry picked from commit 642c0dbde32f34baa7886e988a067089992adc8f)

We need to flush the msgr workqueue during mon_client shutdown to
ensure that any work affecting our embedded ceph_connection is
finished so that we can be safely destroyed.

Previously, we were flushing the work queue after osd_client
shutdown and before mon_client shutdown to ensure that any osd
connection refs to authorizers are flushed.  Remove the redundant
flush, and document in the comment that the mon_client flush is
needed to cover that case as well.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agorbd: Clear ceph_msg->bio_iter for retransmitted message
Yan, Zheng [Thu, 7 Jun 2012 00:35:55 +0000 (19:35 -0500)]
rbd: Clear ceph_msg->bio_iter for retransmitted message

(cherry picked from commit 43643528cce60ca184fe8197efa8e8da7c89a037)
(cherry picked from commit b132cf4c733f91bb4dd2277ea049243cf16e8b66)

The bug can cause NULL pointer dereference in write_partial_msg_pages

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: use con get/put ops from osd_client
Sage Weil [Fri, 1 Jun 2012 03:22:18 +0000 (20:22 -0700)]
libceph: use con get/put ops from osd_client

(cherry picked from commit 0d47766f14211a73eaf54cab234db134ece79f49)

There were a few direct calls to ceph_con_{get,put}() instead of the con
ops from osd_client.c.  This is a bug since those ops aren't defined to
be ceph_con_get/put.

This breaks refcounting on the ceph_osd structs that contain the
ceph_connections, and could lead to all manner of strangeness.

The purpose of the ->get and ->put methods in a ceph connection are
to allow the connection to indicate it has a reference to something
external to the messaging system, *not* to indicate something
external has a reference to the connection.

[elder@inktank.com: added that last sentence]

Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 88ed6ea0b295f8e2383d599a04027ec596cdf97b)

11 years agolibceph: osd_client: don't drop reply reference too early
Alex Elder [Mon, 4 Jun 2012 19:43:32 +0000 (14:43 -0500)]
libceph: osd_client: don't drop reply reference too early

(cherry picked from commit ab8cb34a4b2f60281a4b18b1f1ad23bc2313d91b)

In ceph_osdc_release_request(), a reference to the r_reply message
is dropped.  But just after that, that same message is revoked if it
was in use to receive an incoming reply.  Reorder these so we are
sure we hold a reference until we're actually done with the message.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 680584fab05efff732b5ae16ad601ba994d7b505)

11 years agolibceph: fix pg_temp updates
Sage Weil [Mon, 21 May 2012 16:45:23 +0000 (09:45 -0700)]
libceph: fix pg_temp updates

(cherry picked from commit 6bd9adbdf9ca6a052b0b7455ac67b925eb38cfad)

Usually, we are adding pg_temp entries or removing them.  Occasionally they
update.  In that case, osdmap_apply_incremental() was failing because the
rbtree entry already exists.

Fix by removing the existing entry before inserting a new one.

Fixes http://tracker.newdream.net/issues/2446

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: avoid unregistering osd request when not registered
Sage Weil [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
libceph: avoid unregistering osd request when not registered

(cherry picked from commit 35f9f8a09e1e88e31bd34a1e645ca0e5f070dd5c)

There is a race between two __unregister_request() callers: the
reply path and the ceph_osdc_wait_request().  If we get a reply
*and* the timeout expires at roughly the same time, both callers
will try to unregister the request, and the second one will do bad
things.

Simply check if the request is still already unregistered; if so,
return immediately and do nothing.

Fixes http://tracker.newdream.net/issues/2420

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: add auth buf in prepare_write_connect()
Alex Elder [Wed, 16 May 2012 20:16:39 +0000 (15:16 -0500)]
ceph: add auth buf in prepare_write_connect()

(cherry picked from commit 3da54776e2c0385c32d143fd497a7f40a88e29dd)

Move the addition of the authorizer buffer to a connection's
out_kvec out of get_connect_authorizer() and into its caller.  This
way, the caller--prepare_write_connect()--can avoid adding the
connect header to out_kvec before it has been fully initialized.

Prior to this patch, it was possible for a connect header to be
sent over the wire before the authorizer protocol or buffer length
fields were initialized.  An authorizer buffer associated with that
header could also be queued to send only after the connection header
that describes it was on the wire.

Fixes http://tracker.newdream.net/issues/2424

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: rename prepare_connect_authorizer()
Alex Elder [Wed, 16 May 2012 20:16:39 +0000 (15:16 -0500)]
ceph: rename prepare_connect_authorizer()

(cherry picked from commit dac1e716c60161867a47745bca592987ca3a9cb2)

Change the name of prepare_connect_authorizer().  The next
patch is going to make this function no longer add anything to the
connection's out_kvec, so it will no longer fit the pattern of
the rest of the prepare_connect_*() functions.

In addition, pass the address of a variable that will hold the
authorization protocol to use.  Move the assignment of that to the
connection's out_connect structure into prepare_write_connect().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: return pointer from prepare_connect_authorizer()
Alex Elder [Wed, 16 May 2012 20:16:39 +0000 (15:16 -0500)]
ceph: return pointer from prepare_connect_authorizer()

(cherry picked from commit 729796be9190f57ca40ccca315e8ad34a1eb8fef)

Change prepare_connect_authorizer() so it returns a pointer (or
pointer-coded error).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: use info returned by get_authorizer
Alex Elder [Wed, 16 May 2012 20:16:39 +0000 (15:16 -0500)]
ceph: use info returned by get_authorizer

(cherry picked from commit 8f43fb53894079bf0caab6e348ceaffe7adc651a)

Rather than passing a bunch of arguments to be filled in with the
content of the ceph_auth_handshake buffer now returned by the
get_authorizer method, just use the returned information in the
caller, and drop the unnecessary arguments.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: have get_authorizer methods return pointers
Alex Elder [Wed, 16 May 2012 20:16:39 +0000 (15:16 -0500)]
ceph: have get_authorizer methods return pointers

(cherry picked from commit a3530df33eb91d787d08c7383a0a9982690e42d0)

Have the get_authorizer auth_client method return a ceph_auth
pointer rather than an integer, pointer-encoding any returned
error value.  This is to pave the way for making use of the
returned value in an upcoming patch.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: ensure auth ops are defined before use
Alex Elder [Wed, 16 May 2012 20:16:39 +0000 (15:16 -0500)]
ceph: ensure auth ops are defined before use

(cherry picked from commit a255651d4cad89f1a606edd36135af892ada4f20)

In the create_authorizer method for both the mds and osd clients,
the auth_client->ops pointer is blindly dereferenced.  There is no
obvious guarantee that this pointer has been assigned.  And
furthermore, even if the ops pointer is non-null there is definitely
no guarantee that the create_authorizer or destroy_authorizer
methods are defined.

Add checks in both routines to make sure they are defined (non-null)
before use.  Add similar checks in a few other spots in these files
while we're at it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: messenger: reduce args to create_authorizer
Alex Elder [Wed, 16 May 2012 20:16:39 +0000 (15:16 -0500)]
ceph: messenger: reduce args to create_authorizer

(cherry picked from commit 74f1869f76d043bad12ec03b4d5f04a8c3d1f157)

Make use of the new ceph_auth_handshake structure in order to reduce
the number of arguments passed to the create_authorizor method in
ceph_auth_client_ops.  Use a local variable of that type as a
shorthand in the get_authorizer method definitions.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: define ceph_auth_handshake type
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: define ceph_auth_handshake type

(cherry picked from commit 6c4a19158b96ea1fb8acbe0c1d5493d9dcd2f147)

The definitions for the ceph_mds_session and ceph_osd both contain
five fields related only to "authorizers."  Encapsulate those fields
into their own struct type, allowing for better isolation in some
upcoming patches.

Fix the #includes in "linux/ceph/osd_client.h" to lay out their more
complete canonical path.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: messenger: check return from get_authorizer
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: messenger: check return from get_authorizer

(cherry picked from commit ed96af646011412c2bf1ffe860db170db355fae5)

In prepare_connect_authorizer(), a connection's get_authorizer
method is called but ignores its return value.  This function can
return an error, so check for it and return it if that ever occurs.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: messenger: rework prepare_connect_authorizer()
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: messenger: rework prepare_connect_authorizer()

(cherry picked from commit b1c6b9803f5491e94041e6da96bc9dec3870e792)

Change prepare_connect_authorizer() so it returns without dropping
the connection mutex if the connection has no get_authorizer method.

Use the symbolic CEPH_AUTH_UNKNOWN instead of 0 when assigning
authorization protocols.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: messenger: check prepare_write_connect() result
Alex Elder [Thu, 17 May 2012 02:51:59 +0000 (21:51 -0500)]
ceph: messenger: check prepare_write_connect() result

(cherry picked from commit 5a0f8fdd8a0ebe320952a388331dc043d7e14ced)

prepare_write_connect() can return an error, but only one of its
callers checks for it.  All the rest are in functions that already
return errors, so it should be fine to return the error if one
gets returned.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: don't set WRITE_PENDING too early
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: don't set WRITE_PENDING too early

(cherry picked from commit e10c758e4031a801ea4d2f8fb39bf14c2658d74b)

prepare_write_connect() prepares a connect message, then sets
WRITE_PENDING on the connection.  Then *after* this, it calls
prepare_connect_authorizer(), which updates the content of the
connection buffer already queued for sending.  It's also possible it
will result in prepare_write_connect() returning -EAGAIN despite the
WRITE_PENDING big getting set.

Fix this by preparing the connect authorizer first, setting the
WRITE_PENDING bit only after that is done.

Partially addresses http://tracker.newdream.net/issues/2424

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: drop msgr argument from prepare_write_connect()
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: drop msgr argument from prepare_write_connect()

(cherry picked from commit e825a66df97776d30a48a187e3a986736af43945)

In all cases, the value passed as the msgr argument to
prepare_write_connect() is just con->msgr.  Just get the msgr
value from the ceph connection and drop the unneeded argument.

The only msgr passed to prepare_write_banner() is also therefore
just the one from con->msgr, so change that function to drop the
msgr argument as well.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: messenger: send banner in process_connect()
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: messenger: send banner in process_connect()

(cherry picked from commit 41b90c00858129f52d08e6a05c9cfdb0f2bd074d)

prepare_write_connect() has an argument indicating whether a banner
should be sent out before sending out a connection message.  It's
only ever set in one of its callers, so move the code that arranges
to send the banner into that caller and drop the "include_banner"
argument from prepare_write_connect().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: messenger: reset connection kvec caller
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: messenger: reset connection kvec caller

(cherry picked from commit 84fb3adf6413862cff51d8af3fce5f0b655586a2)

Reset a connection's kvec fields in the caller rather than in
prepare_write_connect().   This ends up repeating a few lines of
code but it's improving the separation between distinct operations
on the connection, which we can take advantage of later.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agolibceph: don't reset kvec in prepare_write_banner()
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
libceph: don't reset kvec in prepare_write_banner()

(cherry picked from commit d329156f16306449c273002486c28de3ddddfd89)

Move the kvec reset for a connection out of prepare_write_banner and
into its only caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: messenger: change read_partial() to take "end" arg
Alex Elder [Thu, 10 May 2012 15:29:50 +0000 (10:29 -0500)]
ceph: messenger: change read_partial() to take "end" arg

(cherry picked from commit fd51653f78cf40a0516e521b6de22f329c5bad8d)

Make the second argument to read_partial() be the ending input byte
position rather than the beginning offset it now represents.  This
amounts to moving the addition "to + size" into the caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: messenger: update "to" in read_partial() caller
Alex Elder [Thu, 10 May 2012 15:29:50 +0000 (10:29 -0500)]
ceph: messenger: update "to" in read_partial() caller

(cherry picked from commit e6cee71fac27c946a0bbad754dd076e66c4e9dbd)

read_partial() always increases whatever "to" value is supplied by
adding the requested size to it, and that's the only thing it does
with that pointed-to value.

Do that pointer advance in the caller (and then only when the
updated value will be subsequently used), and change the "to"
parameter to be an in-only and non-pointer value.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: messenger: use read_partial() in read_partial_message()
Alex Elder [Thu, 10 May 2012 15:29:50 +0000 (10:29 -0500)]
ceph: messenger: use read_partial() in read_partial_message()

(cherry picked from commit 57dac9d1620942608306d8c17c98a9d1568ffdf4)

There are two blocks of code in read_partial_message()--those that
read the header and footer of the message--that can be replaced by a
call to read_partial().  Do that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoceph: osd_client: fix endianness bug in osd_req_encode_op()
Alex Elder [Fri, 20 Apr 2012 20:49:43 +0000 (15:49 -0500)]
ceph: osd_client: fix endianness bug in osd_req_encode_op()

(cherry picked from commit 065a68f9167e20f321a62d044cb2c3024393d455)

From Al Viro <viro@zeniv.linux.org.uk>

Al Viro noticed that we were using a non-cpu-encoded value in
a switch statement in osd_req_encode_op().  The result would
clearly not work correctly on a big-endian machine.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agocrush: fix memory leak when destroying tree buckets
Sage Weil [Mon, 7 May 2012 22:37:05 +0000 (15:37 -0700)]
crush: fix memory leak when destroying tree buckets

(cherry picked from commit 6eb43f4b5a2a74599b4ff17a97c03a342327ca65)

Reflects ceph.git commit 46d63d98434b3bc9dad2fc9ab23cbaedc3bcb0e4.

Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agocrush: fix tree node weight lookup
Sage Weil [Mon, 7 May 2012 22:36:49 +0000 (15:36 -0700)]
crush: fix tree node weight lookup

(cherry picked from commit f671d4cd9b36691ac4ef42cde44c1b7a84e13631)

Fix the node weight lookup for tree buckets by using a correct accessor.

Reflects ceph.git commit d287ade5bcbdca82a3aef145b92924cf1e856733.

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agocrush: be more tolerant of nonsensical crush maps
Sage Weil [Mon, 7 May 2012 22:35:24 +0000 (15:35 -0700)]
crush: be more tolerant of nonsensical crush maps

(cherry picked from commit a1f4895be8bf1ba56c2306b058f51619e9b0e8f8)

If we get a map that doesn't make sense, error out or ignore the badness
instead of BUGging out.  This reflects the ceph.git commits
9895f0bff7dc68e9b49b572613d242315fb11b6c and
8ded26472058d5205803f244c2f33cb6cb10de79.

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agocrush: adjust local retry threshold
Sage Weil [Mon, 7 May 2012 22:35:09 +0000 (15:35 -0700)]
crush: adjust local retry threshold

(cherry picked from commit c90f95ed46393e29d843686e21947d1c6fcb1164)

This small adjustment reflects a change that was made in ceph.git commit
af6a9f30696c900a2a8bd7ae24e8ed15fb4964bb, about 6 months ago.  An N-1
search is not exhaustive.  Fixed ceph.git bug #1594.

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agocrush: clean up types, const-ness
Sage Weil [Mon, 7 May 2012 22:38:35 +0000 (15:38 -0700)]
crush: clean up types, const-ness

(cherry picked from commit 8b12d47b80c7a34dffdd98244d99316db490ec58)

Move various types from int -> __u32 (or similar), and add const as
appropriate.

This reflects changes that have been present in the userland implementation
for some time.

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoselinux: fix sel_netnode_insert() suspicious rcu dereference
Dave Jones [Fri, 9 Nov 2012 00:09:27 +0000 (16:09 -0800)]
selinux: fix sel_netnode_insert() suspicious rcu dereference

commit 88a693b5c1287be4da937699cb82068ce9db0135 upstream.

===============================
[ INFO: suspicious RCU usage. ]
3.5.0-rc1+ #63 Not tainted
-------------------------------
security/selinux/netnode.c:178 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
1 lock held by trinity-child1/8750:
 #0:  (sel_netnode_lock){+.....}, at: [<ffffffff812d8f8a>] sel_netnode_sid+0x16a/0x3e0

stack backtrace:
Pid: 8750, comm: trinity-child1 Not tainted 3.5.0-rc1+ #63
Call Trace:
 [<ffffffff810cec2d>] lockdep_rcu_suspicious+0xfd/0x130
 [<ffffffff812d91d1>] sel_netnode_sid+0x3b1/0x3e0
 [<ffffffff812d8e20>] ? sel_netnode_find+0x1a0/0x1a0
 [<ffffffff812d24a6>] selinux_socket_bind+0xf6/0x2c0
 [<ffffffff810cd1dd>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff810cdb55>] ? lock_release_holdtime.part.9+0x15/0x1a0
 [<ffffffff81093841>] ? lock_hrtimer_base+0x31/0x60
 [<ffffffff812c9536>] security_socket_bind+0x16/0x20
 [<ffffffff815550ca>] sys_bind+0x7a/0x100
 [<ffffffff816c03d5>] ? sysret_check+0x22/0x5d
 [<ffffffff810d392d>] ? trace_hardirqs_on_caller+0x10d/0x1a0
 [<ffffffff8133b09e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff816c03a9>] system_call_fastpath+0x16/0x1b

This patch below does what Paul McKenney suggested in the previous thread.

Signed-off-by: Dave Jones <davej@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoreiserfs: Protect reiserfs_quota_write() with write lock
Jan Kara [Tue, 13 Nov 2012 17:25:38 +0000 (18:25 +0100)]
reiserfs: Protect reiserfs_quota_write() with write lock

commit 361d94a338a3fd0cee6a4ea32bbc427ba228e628 upstream.

Calls into reiserfs journalling code and reiserfs_get_block() need to
be protected with write lock. We remove write lock around calls to high
level quota code in the next patch so these paths would suddently become
unprotected.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoreiserfs: Move quota calls out of write lock
Jan Kara [Tue, 13 Nov 2012 16:05:14 +0000 (17:05 +0100)]
reiserfs: Move quota calls out of write lock

commit 7af11686933726e99af22901d622f9e161404e6b upstream.

Calls into highlevel quota code cannot happen under the write lock. These
calls take dqio_mutex which ranks above write lock. So drop write lock
before calling back into quota code.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoreiserfs: Protect reiserfs_quota_on() with write lock
Jan Kara [Tue, 13 Nov 2012 15:34:17 +0000 (16:34 +0100)]
reiserfs: Protect reiserfs_quota_on() with write lock

commit b9e06ef2e8706fe669b51f4364e3aeed58639eb2 upstream.

In reiserfs_quota_on() we do quite some work - for example unpacking
tail of a quota file. Thus we have to hold write lock until a moment
we call back into the quota code.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoreiserfs: Fix lock ordering during remount
Jan Kara [Tue, 13 Nov 2012 13:55:52 +0000 (14:55 +0100)]
reiserfs: Fix lock ordering during remount

commit 3bb3e1fc47aca554e7e2cc4deeddc24750987ac2 upstream.

When remounting reiserfs dquot_suspend() or dquot_resume() can be called.
These functions take dqonoff_mutex which ranks above write lock so we have
to drop it before calling into quota code.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoNFS: Wait for session recovery to finish before returning
Bryan Schumaker [Tue, 30 Oct 2012 20:06:35 +0000 (16:06 -0400)]
NFS: Wait for session recovery to finish before returning

commit 399f11c3d872bd748e1575574de265a6304c7c43 upstream.

Currently, we will schedule session recovery and then return to the
caller of nfs4_handle_exception.  This works for most cases, but causes
a hang on the following test case:

Client Server
------ ------
Open file over NFS v4.1
Write to file
Expire client
Try to lock file

The server will return NFS4ERR_BADSESSION, prompting the client to
schedule recovery.  However, the client will continue placing lock
attempts and the open recovery never seems to be scheduled.  The
simplest solution is to wait for session recovery to run before retrying
the lock.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agodrm/i915: fix overlay on i830M
Daniel Vetter [Mon, 22 Oct 2012 10:55:55 +0000 (12:55 +0200)]
drm/i915: fix overlay on i830M

commit a9193983f4f292a82a00c72971c17ec0ee8c6c15 upstream.

The overlay on the i830M has a peculiar failure mode: It works the
first time around after boot-up, but consistenly hangs the second time
it's used.

Chris Wilson has dug out a nice errata:

"1.5.12 Clock Gating Disable for Display Register
Address Offset: 06200h–06203h

"Bit 3
Ovrunit Clock Gating Disable.
0 = Clock gating controlled by unit enabling logic
1 = Disable clock gating function
DevALM Errata ALM049: Overlay Clock Gating Must be Disabled:  Overlay
& L2 Cache clock gating must be disabled in order to prevent device
hangs when turning off overlay.SW must turn off Ovrunit clock gating
(6200h) and L2 Cache clock gating (C8h)."

Now I've nowhere found that 0xc8 register and hence couldn't apply the
l2 cache workaround. But I've remembered that part of the magic that
the OVERLAY_ON/OFF commands are supposed to do is to rearrange cache
allocations so that the overlay scaler has some scratch space.

And while pondering how that could explain the hang the 2nd time we
enable the overlay, I've remembered that the old ums overlay code did
_not_ issue the OVERLAY_OFF cmd.

And indeed, disabling the OFF cmd results in the overlay working
flawlessly, so I guess we can workaround the lack of the above
workaround by simply never disabling the overlay engine once it's
enabled.

Note that we have the first part of the above w/a already implemented
in i830_init_clock_gating - leave that as-is to avoid surprises.

v2: Add a comment in the code.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=47827
Tested-by: Rhys <rhyspuk@gmail.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
[bwh: Backported to 3.2:
 - Adjust context
 - s/intel_ring_emit(ring, /OUT_RING(/]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agos390/signal: set correct address space control
Martin Schwidefsky [Wed, 7 Nov 2012 09:44:08 +0000 (10:44 +0100)]
s390/signal: set correct address space control

commit fa968ee215c0ca91e4a9c3a69ac2405aae6e5d2f upstream.

If user space is running in primary mode it can switch to secondary
or access register mode, this is used e.g. in the clock_gettime code
of the vdso. If a signal is delivered to the user space process while
it has been running in access register mode the signal handler is
executed in access register mode as well which will result in a crash
most of the time.

Set the address space control bits in the PSW to the default for the
execution of the signal handler and make sure that the previous
address space control is restored on signal return. Take care
that user space can not switch to the kernel address space by
modifying the registers in the signal frame.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agosky2: Fix for interrupt handler
Mirko Lindner [Tue, 3 Jul 2012 23:38:46 +0000 (23:38 +0000)]
sky2: Fix for interrupt handler

commit d663d181b9e92d80c2455e460e932d34e7a2a7ae upstream.

Re-enable interrupts if it is not our interrupt

Signed-off-by: Mirko Lindner <mlindner@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoeCryptfs: check for eCryptfs cipher support at mount
Tim Sally [Thu, 12 Jul 2012 23:10:24 +0000 (19:10 -0400)]
eCryptfs: check for eCryptfs cipher support at mount

commit 5f5b331d5c21228a6519dcb793fc1629646c51a6 upstream.

The issue occurs when eCryptfs is mounted with a cipher supported by
the crypto subsystem but not by eCryptfs. The mount succeeds and an
error does not occur until a write. This change checks for eCryptfs
cipher support at mount time.

Resolves Launchpad issue #338914, reported by Tyler Hicks in 03/2009.
https://bugs.launchpad.net/ecryptfs/+bug/338914

Signed-off-by: Tim Sally <tsally@atomicpeace.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoeCryptfs: Copy up POSIX ACL and read-only flags from lower mount
Tyler Hicks [Mon, 11 Jun 2012 22:42:32 +0000 (15:42 -0700)]
eCryptfs: Copy up POSIX ACL and read-only flags from lower mount

commit 069ddcda37b2cf5bb4b6031a944c0e9359213262 upstream.

When the eCryptfs mount options do not include '-o acl', but the lower
filesystem's mount options do include 'acl', the MS_POSIXACL flag is not
flipped on in the eCryptfs super block flags. This flag is what the VFS
checks in do_last() when deciding if the current umask should be applied
to a newly created inode's mode or not. When a default POSIX ACL mask is
set on a directory, the current umask is incorrectly applied to new
inodes created in the directory. This patch ignores the MS_POSIXACL flag
passed into ecryptfs_mount() and sets the flag on the eCryptfs super
block depending on the flag's presence on the lower super block.

Additionally, it is incorrect to allow a writeable eCryptfs mount on top
of a read-only lower mount. This missing check did not allow writes to
the read-only lower mount because permissions checks are still performed
on the lower filesystem's objects but it is best to simply not allow a
rw mount on top of ro mount. However, a ro eCryptfs mount on top of a rw
mount is valid and still allowed.

https://launchpad.net/bugs/1009207

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Reported-by: Stefan Beller <stefanbeller@googlemail.com>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agousb: use usb_serial_put in usb_serial_probe errors
Jan Safrata [Tue, 22 May 2012 12:04:50 +0000 (14:04 +0200)]
usb: use usb_serial_put in usb_serial_probe errors

commit 0658a3366db7e27fa32c12e886230bb58c414c92 upstream.

The use of kfree(serial) in error cases of usb_serial_probe
was invalid - usb_serial structure allocated in create_serial()
gets reference of usb_device that needs to be put, so we need
to use usb_serial_put() instead of simple kfree().

Signed-off-by: Jan Safrata <jan.nikitenko@gmail.com>
Acked-by: Johan Hovold <jhovold@gmail.com>
Cc: Richard Retanubun <richardretanubun@ruggedcom.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agonetfilter: nf_nat: don't check for port change on ICMP tuples
Ulrich Weber [Thu, 25 Oct 2012 05:34:45 +0000 (05:34 +0000)]
netfilter: nf_nat: don't check for port change on ICMP tuples

commit 38fe36a248ec3228f8e6507955d7ceb0432d2000 upstream.

ICMP tuples have id in src and type/code in dst.
So comparing src.u.all with dst.u.all will always fail here
and ip_xfrm_me_harder() is called for every ICMP packet,
even if there was no NAT.

Signed-off-by: Ulrich Weber <ulrich.weber@sophos.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agonetfilter: Mark SYN/ACK packets as invalid from original direction
Jozsef Kadlecsik [Fri, 31 Aug 2012 09:55:53 +0000 (09:55 +0000)]
netfilter: Mark SYN/ACK packets as invalid from original direction

commit 64f509ce71b08d037998e93dd51180c19b2f464c upstream.

Clients should not send such packets. By accepting them, we open
up a hole by wich ephemeral ports can be discovered in an off-path
attack.

See: "Reflection scan: an Off-Path Attack on TCP" by Jan Wrobel,
http://arxiv.org/abs/1201.2074

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agonetfilter: Validate the sequence number of dataless ACK packets as well
Jozsef Kadlecsik [Fri, 31 Aug 2012 09:55:54 +0000 (09:55 +0000)]
netfilter: Validate the sequence number of dataless ACK packets as well

commit 4a70bbfaef0361d27272629d1a250a937edcafe4 upstream.

We spare nothing by not validating the sequence number of dataless
ACK packets and enabling it makes harder off-path attacks.

See: "Reflection scan: an Off-Path Attack on TCP" by Jan Wrobel,
http://arxiv.org/abs/1201.2074

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agor8169: allow multicast packets on sub-8168f chipset.
Nathan Walp [Thu, 1 Nov 2012 12:08:47 +0000 (12:08 +0000)]
r8169: allow multicast packets on sub-8168f chipset.

commit 0481776b7a70f09acf7d9d97c288c3a8403fbfe4 upstream.

RTL_GIGA_MAC_VER_35 includes no multicast hardware filter.

Signed-off-by: Nathan Walp <faceprint@faceprint.com>
Suggested-by: Hayes Wang <hayeswang@realtek.com>
Acked-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agor8169: Fix WoL on RTL8168d/8111d.
Cyril Brulebois [Wed, 31 Oct 2012 14:00:46 +0000 (14:00 +0000)]
r8169: Fix WoL on RTL8168d/8111d.

commit b00e69dee4ccbb3a19989e3d4f1385bc2e3406cd upstream.

This regression was spotted between Debian squeeze and Debian wheezy
kernels (respectively based on 2.6.32 and 3.2). More info about
Wake-on-LAN issues with Realtek's 816x chipsets can be found in the
following thread: http://marc.info/?t=132079219400004

Probable regression from d4ed95d796e5126bba51466dc07e287cebc8bd19;
more chipsets are likely affected.

Tested on top of a 3.2.23 kernel.

Reported-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Tested-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Hinted-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: Cyril Brulebois <kibi@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agoxen/events: fix RCU warning, or Call idle notifier after irq_enter()
Mojiong Qiu [Tue, 6 Nov 2012 08:08:15 +0000 (16:08 +0800)]
xen/events: fix RCU warning, or Call idle notifier after irq_enter()

commit 772aebcefeff310f80e32b874988af0076cb799d upstream.

exit_idle() should be called after irq_enter(), otherwise it throws:

[ INFO: suspicious RCU usage. ]
3.6.5 #1 Not tainted
-------------------------------
include/linux/rcupdate.h:725 rcu_read_lock() used illegally while idle!

other info that might help us debug this:

RCU used illegally from idle CPU!
rcu_scheduler_active = 1, debug_locks = 1
RCU used illegally from extended quiescent state!
1 lock held by swapper/0/0:
 #0:  (rcu_read_lock){......}, at: [<ffffffff810e9fe0>] __atomic_notifier_call_chain+0x0/0x140

stack backtrace:
Pid: 0, comm: swapper/0 Not tainted 3.6.5 #1
Call Trace:
 <IRQ>  [<ffffffff811259a2>] lockdep_rcu_suspicious+0xe2/0x130
 [<ffffffff810ea10c>] __atomic_notifier_call_chain+0x12c/0x140
 [<ffffffff810e9fe0>] ? atomic_notifier_chain_unregister+0x90/0x90
 [<ffffffff811216cd>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff810ea136>] atomic_notifier_call_chain+0x16/0x20
 [<ffffffff810777c3>] exit_idle+0x43/0x50
 [<ffffffff81568865>] xen_evtchn_do_upcall+0x25/0x50
 [<ffffffff81aa690e>] xen_do_hypervisor_callback+0x1e/0x30
 <EOI>  [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000
 [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000
 [<ffffffff81061540>] ? xen_safe_halt+0x10/0x20
 [<ffffffff81075cfa>] ? default_idle+0xba/0x570
 [<ffffffff810778af>] ? cpu_idle+0xdf/0x140
 [<ffffffff81a4d881>] ? rest_init+0x135/0x144
 [<ffffffff81a4d74c>] ? csum_partial_copy_generic+0x16c/0x16c
 [<ffffffff82520c45>] ? start_kernel+0x3db/0x3e8
 [<ffffffff8252066a>] ? repair_env_string+0x5a/0x5a
 [<ffffffff82520356>] ? x86_64_start_reservations+0x131/0x135
 [<ffffffff82524aca>] ? xen_start_kernel+0x465/0x46

Git commit 98ad1cc14a5c4fd658f9d72c6ba5c86dfd3ce0d5
Author: Frederic Weisbecker <fweisbec@gmail.com>
Date:   Fri Oct 7 18:22:09 2011 +0200

    x86: Call idle notifier after irq_enter()

did this, but it missed the Xen code.

Signed-off-by: Mojiong Qiu <mjqiu@tencent.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agor8169: use unlimited DMA burst for TX
Michal Schmidt [Sun, 9 Sep 2012 13:55:26 +0000 (13:55 +0000)]
r8169: use unlimited DMA burst for TX

commit aee77e4accbeb2c86b1d294cd84fec4a12dde3bd upstream.

The r8169 driver currently limits the DMA burst for TX to 1024 bytes. I have
a box where this prevents the interface from using the gigabit line to its full
potential. This patch solves the problem by setting TX_DMA_BURST to unlimited.

The box has an ASRock B75M motherboard with on-board RTL8168evl/8111evl
(XID 0c900880). TSO is enabled.

I used netperf (TCP_STREAM test) to measure the dependency of TX throughput
on MTU. I did it for three different values of TX_DMA_BURST ('5'=512, '6'=1024,
'7'=unlimited). This chart shows the results:
http://michich.fedorapeople.org/r8169/r8169-effects-of-TX_DMA_BURST.png

Interesting points:
 - With the current DMA burst limit (1024):
   - at the default MTU=1500 I get only 842 Mbit/s.
   - when going from small MTU, the performance rises monotonically with
     increasing MTU only up to a peak at MTU=1076 (908 MBit/s). Then there's
     a sudden drop to 762 MBit/s from which the throughput rises monotonically
     again with further MTU increases.
 - With a smaller DMA burst limit (512):
   - there's a similar peak at MTU=1076 and another one at MTU=564.
 - With unlimited DMA burst:
   - at the default MTU=1500 I get nice 940 Mbit/s.
   - the throughput rises monotonically with increasing MTU with no strange
     peaks.

Notice that the peaks occur at MTU sizes that are multiples of the DMA burst
limit plus 52. Why 52? Because:
  20 (IP header) + 20 (TCP header) + 12 (TCP options) = 52

The Realtek-provided r8168 driver (v8.032.00) uses unlimited TX DMA burst too,
except for CFG_METHOD_1 where the TX DMA burst is set to 512 bytes.
CFG_METHOD_1 appears to be the oldest MAC version of "RTL8168B/8111B",
i.e. RTL_GIGA_MAC_VER_11 in r8169. Not sure if this MAC version really needs
the smaller burst limit, or if any other versions have similar requirements.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Acked-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 years agotmpfs: change final i_blocks BUG to WARNING
Hugh Dickins [Fri, 16 Nov 2012 22:15:04 +0000 (14:15 -0800)]
tmpfs: change final i_blocks BUG to WARNING

commit 0f3c42f522dc1ad7e27affc0a4aa8c790bce0a66 upstream.

Under a particular load on one machine, I have hit shmem_evict_inode()'s
BUG_ON(inode->i_blocks), enough times to narrow it down to a particular
race between swapout and eviction.

It comes from the "if (freed > 0)" asymmetry in shmem_recalc_inode(),
and the lack of coherent locking between mapping's nrpages and shmem's
swapped count.  There's a window in shmem_writepage(), between lowering
nrpages in shmem_delete_from_page_cache() and then raising swapped
count, when the freed count appears to be +1 when it should be 0, and
then the asymmetry stops it from being corrected with -1 before hitting
the BUG.

One answer is coherent locking: using tree_lock throughout, without
info->lock; reasonable, but the raw_spin_lock in percpu_counter_add() on
used_blocks makes that messier than expected.  Another answer may be a
further effort to eliminate the weird shmem_recalc_inode() altogether,
but previous attempts at that failed.

So far undecided, but for now change the BUG_ON to WARN_ON: in usual
circumstances it remains a useful consistency check.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>