Jason Gunthorpe [Sun, 29 Jul 2018 08:34:58 +0000 (11:34 +0300)]
IB/ipoib: Do not remove child devices from within the ndo_uninit
Switching to priv_destructor and needs_free_netdev created a subtle
ordering problem in ipoib_remove_one.
Now that unregister_netdev frees the netdev and priv we must ensure that
the children are unregistered before trying to unregister the parent,
or child unregister will use after free.
The solution is to unregister the children, then parent, in the same batch
all while holding the rtnl_lock. This closes all the races where a new
child could have been added and ensures proper ordering.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Jason Gunthorpe [Sun, 29 Jul 2018 08:34:57 +0000 (11:34 +0300)]
IB/ipoib: Get rid of the sysfs_mutex
This mutex was introduced to deal with the deadlock formed by calling
unregister_netdev from within the sysfs callback of a netdev.
Now that we have priv_destructor and needs_free_netdev we can switch
to the more targeted solution of running the unregister from a
work queue. This avoids the deadlock and gets rid of the mutex.
The next patch in the series needs this mutex eliminated to create
atomicity of unregisteration.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Jason Gunthorpe [Sun, 29 Jul 2018 08:34:56 +0000 (11:34 +0300)]
RDMA/netdev: Use priv_destructor for netdev cleanup
Now that the unregister_netdev flow for IPoIB no longer relies on external
code we can now introduce the use of priv_destructor and
needs_free_netdev.
The rdma_netdev flow is switched to use the netdev common priv_destructor
instead of the special free_rdma_netdev and the IPOIB ULP adjusted:
- priv_destructor needs to switch to point to the ULP's destructor
which will then call the rdma_ndev's in the right order
- We need to be careful around the error unwind of register_netdev
as it sometimes calls priv_destructor on failure
- ULPs need to use ndo_init/uninit to ensure proper ordering
of failures around register_netdev
Switching to priv_destructor is a necessary pre-requisite to using
the rtnl new_link mechanism.
The VNIC user for rdma_netdev should also be revised, but that is left for
another patch.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Denis Drozdov <denisd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Jason Gunthorpe [Sun, 29 Jul 2018 08:34:55 +0000 (11:34 +0300)]
IB/ipoib: Move init code to ndo_init
Now that we have a proper ndo_uninit, move code that naturally pairs
with the ndo_uninit into ndo_init. This allows the netdev core to natually
handle ordering.
This fixes the situation where register_netdev can fail before calling
ndo_init, in which case it wouldn't call ndo_uninit either.
Also move a bunch of duplicated init code that is shared between child
and parent for clarity. Now the child and parent register functions look
very similar.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Jason Gunthorpe [Sun, 29 Jul 2018 08:34:53 +0000 (11:34 +0300)]
IB/ipoib: Move all uninit code into ndo_uninit
Currently uninit is sometimes done twice in error flows, and is sprinkled
a bit all over the place.
Improve the clarity of the design by moving all uninit only into
ndo_uinit.
Some duplication is removed:
- Sometimes IPOIB_STOP_NEIGH_GC was done before unregister, but
this duplicates the process in ipoib_neigh_hash_init
- Flushing priv->wq was sometimes done before unregister,
but that duplicates what has been done in ndo_uninit
Uniniting the IB event queue must remain before unregister_netdev as it
requires the RTNL lock to be dropped, this is moved to a helper to make
that flow really clear and remove some duplication in error flows.
If register_netdev fails (and ndo_init is NULL) then it almost always
calls ndo_uninit, which lets us remove all the extra code from the error
unwinds. The next patch in the series will close the 'almost always' hole
by pairing a proper ndo_init with ndo_uninit.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Erez Shitrit [Sun, 29 Jul 2018 08:34:52 +0000 (11:34 +0300)]
IB/ipoib: Use cancel_delayed_work_sync for neigh-clean task
The neigh_reap_task is self restarting, but so long as we call
cancel_delayed_work_sync() it will be guaranteed to not be running and
never start again. Thus we don't need to have the racy
IPOIB_STOP_NEIGH_GC bit, or the confusing mismatch of places sometimes
calling flush_workqueue after the cancel.
This fixes a situation where the GC work could have been left running
in some rare situations.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Sun, 29 Jul 2018 08:34:51 +0000 (11:34 +0300)]
IB/ipoib: Get rid of IPOIB_FLAG_GOING_DOWN
This essentially duplicates the netdev's reg_state, so just use that
directly. The reg_state is updated under the rntl_lock, and all places
using GOING_DOWN already acquire the rtnl_lock so checking is safe.
Since the only place we use GOING_DOWN is for the parent device this
does not fix any bugs, but it is a step to tidy up the unregister flow
so that after later patches the flow is uniform and sane.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Potnuri Bharat Teja [Thu, 2 Aug 2018 06:03:04 +0000 (11:33 +0530)]
iw_cxgb4: Support FW write completion WR
To optimize NVME-oF READ IOPs, use a specialized WQE that combines
the RDMA WRITE and SEND_INV WR chain submitted by the NVME-oF target
driver.
This reduces uP overhead per NVME-oF IO, and results in over 10%
improvement in NVME-oF 4K READ IOPs.
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Potnuri Bharat Teja [Thu, 2 Aug 2018 06:03:03 +0000 (11:33 +0530)]
iw_cxgb4: RDMA write with immediate support
Adds iw_cxgb4 functionality to support RDMA_WRITE_WITH_IMMEDATE opcode.
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Dan Carpenter [Thu, 2 Aug 2018 07:56:13 +0000 (10:56 +0300)]
rdma/cxgb4: fix some info leaks
In c4iw_create_qp() there are several struct members which potentially
aren't inintialized like uresp.rq_key. I've fixed this code before in
in commit
ae1fe07f3f42 ("RDMA/cxgb4: Fix stack info leak in
c4iw_create_qp()") so this time I'm just going to take a big hammer
approach and memset the whole struct to zero. Hopefully, it will stay
fixed this time.
In c4iw_create_srq() we don't clear uresp.reserved.
Fixes:
6a0b6174d35a ("rdma/cxgb4: Add support for kernel mode SRQ's")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Yixian Liu [Thu, 2 Aug 2018 02:38:05 +0000 (10:38 +0800)]
RDMA/hns: Support flush cqe for hip08 in kernel space
According to IB protocol, there are some cases that work requests must
return the flush error completion status through the completion queue. Due
to hardware limitation, the driver needs to assist the flush process.
This patch adds the support of flush cqe for hip08 in the cases that
needed, such as poll cqe, post send, post recv and aeqe handle.
The patch also considered the compatibility between kernel and user space.
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Denis Drozdov [Sun, 29 Jul 2018 08:42:28 +0000 (11:42 +0300)]
IB/IPoIB: Set ah valid flag in multicast send flow
The change of ipoib_ah data structure with adding "valid" flag and
checks of ah->valid in ipoib_start_xmit affected multicast packet flow.
Since the multicast flow doesn't invoke path_rec_start, "ah->valid" flag
remains unset, so that ipoib_start_xmit end up with neigh_refresh_path
instead of sending the packet using neigh.
"ah->valid" has to be set in multicast send flow. As a result IPoIB
starts sending packets via neigh immediately and eliminates 60sec delay
of neigh keep alive interval.
The typical example of this issue are two sequential arpings:
arping 11.134.208.9 -> got response (mcast_send)
arping 11.134.208.9 -> no response (ah->valid = 0)
Fixes:
fa9391dbad4b ("RDMA/ipoib: Update paths on CLIENT_REREG/SM_CHANGE events")
Signed-off-by: Denis Drozdov <denisd@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Thu, 26 Jul 2018 03:40:20 +0000 (21:40 -0600)]
IB/uverbs: Allow all DESTROY commands to succeed after disassociate
The disassociate function was broken by design because it failed all
commands. This prevents userspace from calling destroy on a uobject after
it has detected a device fatal error and thus reclaiming the resources in
userspace is prevented.
This fix is now straightforward, when anything destroys a uobject that is
not the user the object remains on the IDR with a NULL context and object
pointer. All lookup locking modes other than DESTROY will fail. When the
user ultimately calls the destroy function it is simply dropped from the
IDR while any related information is returned.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Thu, 26 Jul 2018 03:40:19 +0000 (21:40 -0600)]
IB/uverbs: Do not block disassociate during write()
Now that all the callbacks are safe to run concurrently with
disassociation this test can be eliminated. The ufile core infrastructure
becomes entirely self contained and is not sensitive to disassociation.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Thu, 26 Jul 2018 03:40:18 +0000 (21:40 -0600)]
IB/uverbs: Do not pass struct ib_device to the ioctl methods
This does the same as the patch before, except for ioctl. The rules are
the same, but for the ioctl methods the core code handles setting up the
uobject.
- Retrieve the ib_dev from the uobject->context->device. This is
safe under ioctl as the core has already done rdma_alloc_begin_uobject
and so CREATE calls are entirely protected by the rwsem.
- Retrieve the ib_dev from uobject->object
- Call ib_uverbs_get_ucontext()
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Thu, 26 Jul 2018 03:40:17 +0000 (21:40 -0600)]
IB/uverbs: Do not pass struct ib_device to the write based methods
This is a step to get rid of the global check for disassociation. In this
model, the ib_dev is not proven to be valid by the core code and cannot be
provided to the method. Instead, every method decides if it is able to
run after disassociation and obtains the ib_dev using one of three
different approaches:
- Call srcu_dereference on the udevice's ib_dev. As before, this means
the method cannot be called after disassociation begins.
(eg alloc ucontext)
- Retrieve the ib_dev from the ucontext, via ib_uverbs_get_ucontext()
- Retrieve the ib_dev from the uobject->object after checking
under SRCU if disassociation has started (eg uobj_get)
Largely, the code is all ready for this, the main work is to provide a
ib_dev after calling uobj_alloc(). The few other places simply use
ib_uverbs_get_ucontext() to get the ib_dev.
This flexibility will let the next patches allow destroy to operate
after disassociation.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Thu, 26 Jul 2018 03:40:16 +0000 (21:40 -0600)]
IB/uverbs: Lower the test for ongoing disassociation
Commands that are reading/writing to objects can test for an ongoing
disassociation during their initial call to rdma_lookup_get_uobject. This
directly prevents all of these commands from conflicting with an ongoing
disassociation.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Thu, 26 Jul 2018 03:40:15 +0000 (21:40 -0600)]
IB/uverbs: Allow uobject allocation to work concurrently with disassociate
After all the recent structural changes this is now straightforward, hold
the hw_destroy_rwsem across the entire uobject creation. We already take
this semaphore on the success path, so holding it a bit longer is not
going to change the performance.
After this change none of the create callbacks require the
disassociate_srcu lock to be correct.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Thu, 26 Jul 2018 03:40:14 +0000 (21:40 -0600)]
IB/uverbs: Allow RDMA_REMOVE_DESTROY to work concurrently with disassociate
After all the recent structural changes this is now straightfoward, hoist
the hw_destroy_rwsem up out of rdma_destroy_explicit and wrap it around
the uobject write lock as well as the destroy.
This is necessary as obtaining a write lock concurrently with
uverbs_destroy_ufile_hw() will cause malfunction.
After this change none of the destroy callbacks require the
disassociate_srcu lock to be correct.
This requires introducing a new lookup mode, UVERBS_LOOKUP_DESTROY as the
IOCTL interface needs to hold an unlocked kref until all command
verification is completed.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Thu, 26 Jul 2018 03:40:13 +0000 (21:40 -0600)]
IB/uverbs: Convert 'bool exclusive' into an enum
This is more readable, and future patches will need a 3rd lookup type.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Thu, 26 Jul 2018 03:40:12 +0000 (21:40 -0600)]
IB/uverbs: Consolidate uobject destruction
There are several flows that can destroy a uobject and each one is
minimized and sprinkled throughout the code base, making it difficult to
understand and very hard to modify the destroy path.
Consolidate all of these into uverbs_destroy_uobject() and call it in all
cases where a uobject has to be destroyed.
This makes one change to the lifecycle, during any abort (eg when
alloc_commit is not called) we always call out to alloc_abort, even if
remove_commit needs to be called to delete a HW object.
This also renames RDMA_REMOVE_DURING_CLEANUP to RDMA_REMOVE_ABORT to
clarify its actual usage and revises some of the comments to reflect what
the life cycle is for the type implementation.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Thu, 26 Jul 2018 03:40:11 +0000 (21:40 -0600)]
IB/uverbs: Make the write path destroy methods use the same flow as ioctl
The ridiculous dance with uobj_remove_commit() is not needed, the write
path can follow the same flow as ioctl - lock and destroy the HW object
then use the data left over in the uobject to form the response to
userspace.
Two helpers are introduced to make this flow straightforward for the
caller.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Thu, 26 Jul 2018 21:57:56 +0000 (15:57 -0600)]
IB/uverbs: Remove rdma_explicit_destroy() from the ioctl methods
The core code will destroy the HW object on behalf of the method, if the
method provides an implementation it must simply copy data from the stub
uobj into the response. Destroy methods cannot touch the HW object.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Kamal Heib [Tue, 31 Jul 2018 06:02:36 +0000 (09:02 +0300)]
RDMA: Fix return code check in rdma_set_cq_moderation
The proper return code is "-EOPNOTSUPP" when the modify_cq() callback is
not supported, all drivers should generate this and all users should check
for it when detecting not supported functionality.
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com> (for mlx5)
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Tue, 31 Jul 2018 15:51:30 +0000 (08:51 -0700)]
rdma/cxgb4: Simplify a structure initialization
This patch avoids that sparse reports the following warning:
drivers/infiniband/hw/cxgb4/qp.c:2269:34: warning: Using plain integer as NULL pointer
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Acked-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Tue, 31 Jul 2018 15:25:41 +0000 (08:25 -0700)]
rdma/cxgb4: Fix SRQ endianness annotations
This patch avoids that sparse complains about casts to restricted __be32.
Fixes:
a3cdaa69e4ae ("cxgb4: Adds CPL support for Shared Receive Queues")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Tue, 31 Jul 2018 15:08:15 +0000 (08:08 -0700)]
rdma/cxgb4: Remove a set-but-not-used variable
This patch avoids that the following warning is reported when building with
W=1:
drivers/infiniband/hw/cxgb4/cm.c:1860:5: warning: variable 'status' set but not used [-Wunused-but-set-variable]
u8 status;
^~~~~~
Fixes:
6a0b6174d35a ("rdma/cxgb4: Add support for kernel mode SRQ's")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Sun, 29 Jul 2018 08:53:16 +0000 (11:53 +0300)]
RDMA/core: Prefix _ib to IB/RoCE specific functions
In rdma cm module, functions which are common between IB and iWarp
are named with cma_.
iWarp specific functions are prefixed with cma_iw.
IB specific functions are perfixed with cma_ib.
However some functions in request processing path didn't follow
cma_ib notion. Prefix them with _ib for better code clarity.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Sun, 29 Jul 2018 08:53:15 +0000 (11:53 +0300)]
RDMA/core: Simplify gid type check in cma_acquire_dev()
cma_add_one() initializes the default GID regardless of device type.
listen_id is bound to a device and an IP address, its GID type is
initialized by cma_acquire_dev().
Therefore a valid default GID type is always available, it is not needed
to check port type during cma_acquire_dev().
Initialize gid type of a cm id when the cm_id is created instead of
doing conditional checks during cma_acquire_dev() and trying to
initialize to 0 during _cma_attach_to_dev().
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Sun, 29 Jul 2018 08:53:14 +0000 (11:53 +0300)]
RDMA/core: Avoid holding lock while initializing fields on stack
In various functions rdma_cm_event is zero initialized on stack using
memset() while holding lock which is not necessary.
Therefore, don't hold the lock while initializing on stack.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Sun, 29 Jul 2018 08:53:13 +0000 (11:53 +0300)]
RDMA/core: Return bool instead of int
Return bool for following internal and inline functions as their
underlying APIs return bool too.
1. cma_zero_addr()
2. cma_loopback_addr()
3. cma_any_addr()
4. ib_addr_any()
5. ib_addr_loopback()
While we are touching cma_loopback_addr(), remove extra white spaces
in it.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Sun, 29 Jul 2018 08:53:12 +0000 (11:53 +0300)]
RDMA/cma: Get rid of 1 bit boolean
Arrange fields of cma_req_info structure for efficiency on
stack and get rid of one bit boolean field.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Sun, 29 Jul 2018 08:53:11 +0000 (11:53 +0300)]
RDMA/cma: Constify path record, ib_cm_event, listen_id pointers
Constify several pointers such as path_rec, ib_cm_event and listen_id
pointers in several functions.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Sun, 29 Jul 2018 08:53:10 +0000 (11:53 +0300)]
RDMA/core: Constify dst_addr argument
Following APIs are not supposed to modify addr or dest_addr contents.
Therefore make those function argument const for better code
readability.
1. rdma_resolve_ip()
2. rdma_addr_size()
3. rdma_resolve_addr()
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Sun, 29 Jul 2018 08:53:09 +0000 (11:53 +0300)]
RDMA/cma: Simplify rdma_resolve_addr() error flow
Currently dst address is first set and later on cleared on either of the
3 error conditions are met.
However none of the APIs or checks are supposed to refer to the
destination address of the cm_id.
Therefore, set the destination address after necessary checks pass which
simplifies the error flow.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Sun, 29 Jul 2018 08:53:08 +0000 (11:53 +0300)]
RDMA/cma: Initialize resource type in __rdma_create_id()
Currently rdma_cm_id's resource tracking fields such as owner task and
kern_name and other non resource tracking fields are initialized in
in single function __rdma_create_id().
Therefore, initialize rdma_cm_id's resource type also in same init
function.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Lijun Ou [Mon, 30 Jul 2018 12:20:30 +0000 (20:20 +0800)]
RDMA/hns: Program the tclass and flow label into the hardware
This was missed in a few places, and was just using 0.
Also correct the spelling of HNS_ROCE_FLOW_LABEL_MASK
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Lijun Ou [Mon, 30 Jul 2018 12:20:29 +0000 (20:20 +0800)]
RDMA/hns: Use macro instead of magic number
This patch mainly uses CMD_CSQ_DESC_NUM instead of magic number in order
to improve readability.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Lijun Ou [Mon, 30 Jul 2018 12:20:28 +0000 (20:20 +0800)]
RDMA/hns: Modify qp will return errno when qp type is illegal
Set for ret was missing in the error path here, resulting in incorrect
error code for modify_qp.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Lijun Ou [Mon, 30 Jul 2018 12:20:27 +0000 (20:20 +0800)]
RDMA/hns: Assign the value for vlan field of qp context
This patch mainly fills the correct value into the vlan id field of qp
context as well as update the vlan field name according to the latest
hardware user manual.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Lijun Ou [Mon, 30 Jul 2018 12:20:25 +0000 (20:20 +0800)]
RDMA/hns: Only assgin the fields of the av if IB_QP_AV bit is set
Only when the IB_QP_AV flag of attr_mask is set is it valid to assign the
related fields of the av into the qp context.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Kamal Heib [Fri, 27 Jul 2018 18:23:06 +0000 (21:23 +0300)]
RDMA/providers: Remove pointless functions
The rdma core is taking care of return the right error code when the
rdma device callbacks aren't supported.
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Kamal Heib [Fri, 27 Jul 2018 18:23:05 +0000 (21:23 +0300)]
RDMA/core: Check for verbs callbacks before using them
Make sure the providers implement the verbs callbacks before calling
them, otherwise return -EOPNOTSUPP.
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Kamal Heib [Fri, 27 Jul 2018 18:23:04 +0000 (21:23 +0300)]
RDMA/core: Remove {create,destroy}_ah from mandatory verbs
{create,destroy}_ah aren't mandatory verbs, because not all providers
are implementing them.
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Kamal Heib [Mon, 30 Jul 2018 18:56:44 +0000 (21:56 +0300)]
RDMA/ipoib: Fix check for return code from ib_create_srq
Make sure to check for "-EOPNOTSUPP" instead of "-ENOSYS" which is the
return code from ib_create_srq() in case that it not supported.
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Kamal Heib [Mon, 30 Jul 2018 18:56:43 +0000 (21:56 +0300)]
RDMA/providers: Fix return value from create_srq callbacks
The proper return code is "-EOPNOTSUPP" when the create_srq() callback
is not supported.
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jack Morgenstein [Thu, 26 Jul 2018 07:08:37 +0000 (10:08 +0300)]
IB/mlx4: Use 4K pages for kernel QP's WQE buffer
In the current implementation, the driver tries to allocate contiguous
memory, and if it fails, it falls back to 4K fragmented allocation.
Once the memory is fragmented, the first allocation might take a lot
of time, and even fail, which can cause connection failures.
This patch changes the logic to always allocate with 4K granularity,
since it's more robust and more likely to succeed.
This patch was tested with Lustre and no performance degradation
was observed.
Note: This commit eliminates the "shrinking WQE" feature. This feature
depended on using vmap to create a virtually contiguous send WQ.
vmap use was abandoned due to problems with several processors (see the
commit cited below). As a result, shrinking WQE was available only with
physically contiguous send WQs. Allocating such send WQs caused the
problems described above.
Therefore, as a side effect of eliminating the use of large physically
contiguous send WQs, the shrinking WQE feature became unavailable.
Warning example:
worker/20:1: page allocation failure: order:8, mode:0x80d0
CPU: 20 PID: 513 Comm: kworker/20:1 Tainted: G OE ------------
Workqueue: ib_cm cm_work_handler [ib_cm]
Call Trace:
[<
ffffffff81686d81>] dump_stack+0x19/0x1b
[<
ffffffff81186160>] warn_alloc_failed+0x110/0x180
[<
ffffffff8118a954>] __alloc_pages_nodemask+0x9b4/0xba0
[<
ffffffff811ce868>] alloc_pages_current+0x98/0x110
[<
ffffffff81184fae>] __get_free_pages+0xe/0x50
[<
ffffffff8133f6fe>] swiotlb_alloc_coherent+0x5e/0x150
[<
ffffffff81062551>] x86_swiotlb_alloc_coherent+0x41/0x50
[<
ffffffffa056b4c4>] mlx4_buf_direct_alloc.isra.7+0xc4/0x180 [mlx4_core]
[<
ffffffffa056b73b>] mlx4_buf_alloc+0x1bb/0x260 [mlx4_core]
[<
ffffffffa0b15496>] create_qp_common+0x536/0x1000 [mlx4_ib]
[<
ffffffff811c6ef7>] ? dma_pool_free+0xa7/0xd0
[<
ffffffffa0b163c1>] mlx4_ib_create_qp+0x3b1/0xdc0 [mlx4_ib]
[<
ffffffffa0b01bc2>] ? mlx4_ib_create_cq+0x2d2/0x430 [mlx4_ib]
[<
ffffffffa0b21f20>] mlx4_ib_create_qp_wrp+0x10/0x20 [mlx4_ib]
[<
ffffffffa08f152a>] ib_create_qp+0x7a/0x2f0 [ib_core]
[<
ffffffffa06205d4>] rdma_create_qp+0x34/0xb0 [rdma_cm]
[<
ffffffffa08275c9>] kiblnd_create_conn+0xbf9/0x1950 [ko2iblnd]
[<
ffffffffa074077a>] ? cfs_percpt_unlock+0x1a/0xb0 [libcfs]
[<
ffffffffa0835519>] kiblnd_passive_connect+0xa99/0x18c0 [ko2iblnd]
Fixes:
73898db04301 ("net/mlx4: Avoid wrong virtual mappings")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Thu, 26 Jul 2018 22:37:14 +0000 (16:37 -0600)]
IB/uverbs: Add UVERBS_ATTR_FLAGS_IN to the specs language
This clearly indicates that the input is a bitwise combination of values
in an enum, and identifies which enum contains the definition of the bits.
Special accessors are provided that handle the mandatory validation of the
allowed bits and enforce the correct type for bitwise flags.
If we had introduced this at the start then the kabi would have uniformly
used u64 data to pass flags, however today there is a mixture of u64 and
u32 flags. All places are converted to accept both sizes and the accessor
fixes it. This allows all existing flags to grow to u64 in future without
any hassle.
Finally all flags are, by definition, optional. If flags are not passed
the accessor does not fail, but provides a value of zero.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:32 +0000 (09:25 -0700)]
RDMA, core and ULPs: Declare ib_post_send() and ib_post_recv() arguments const
Since neither ib_post_send() nor ib_post_recv() modify the data structure
their second argument points at, declare that argument const. This change
makes it necessary to declare the 'bad_wr' argument const too and also to
modify all ULPs that call ib_post_send(), ib_post_recv() or
ib_post_srq_recv(). This patch does not change any functionality but makes
it possible for the compiler to verify whether the
ib_post_(send|recv|srq_recv) really do not modify the posted work request.
To make this possible, only one cast had to be introduce that casts away
constness, namely in rpcrdma_post_recvs(). The only way I can think of to
avoid that cast is to introduce an additional loop in that function or to
change the data type of bad_wr from struct ib_recv_wr ** into int
(an index that refers to an element in the work request list). However,
both approaches would require even more extensive changes than this
patch.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:15 +0000 (09:25 -0700)]
IB/mlx5, ib_post_send(), IB_WR_REG_SIG_MR: Do not modify the 'wr' argument
Since the next patch will constify the wr pointer, do not modify the data
that pointer points at.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:14 +0000 (09:25 -0700)]
RDMA: Constify the argument of the work request conversion functions
When posting a send work request, the work request that is posted is not
modified by any of the RDMA drivers. Make this explicit by constifying
most ib_send_wr pointers in RDMA transport drivers.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:13 +0000 (09:25 -0700)]
IB/iser: Inline two work request conversion functions
Since the next patch will change the return type of these functions into a
const pointer and since the iSER driver modifies the work request these
functions return a pointer two, inline two work request conversion
function calls. This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Fri, 27 Jul 2018 15:48:30 +0000 (09:48 -0600)]
IB/cache: Restore compatibility for ib_query_gid
Code changes in smc have become so complicated this cycle that the RDMA
patches to remove ib_query_gid in smc create too complex merge conflicts.
Allow those conflicts to be resolved by using the net/smc hunks by
providing a compatibility wrapper. During the second phase of the merge
window this wrapper will be deleted and smc updated to use the new API.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Lijun Ou [Wed, 25 Jul 2018 07:29:41 +0000 (15:29 +0800)]
RDMA/hns: Enable modify_cq for uverbs.
The driver implements the modify_cq callback, but did not set the bit to
expose it to userspace.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Lijun Ou [Wed, 25 Jul 2018 07:29:40 +0000 (15:29 +0800)]
RDMA/hns: Update the data type of immediate data
Because the data structure of hip08 is little endian, it needs to fix the
immediate field of wqe and cqe into __le32.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Lijun Ou [Wed, 25 Jul 2018 07:29:38 +0000 (15:29 +0800)]
RDMA/hns: Use delay instead of usleep
In order to avoid using usleep function in lock function, we use delay
function instead of it. Besides, it also use brackets for standardized
the computed order.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Lijun Ou [Wed, 25 Jul 2018 07:29:37 +0000 (15:29 +0800)]
RDMA/hns: Add illegal hop_num judgement
When hop_num is more than three, it need to return -EINVAL. This patch
fixes it.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Lijun Ou [Wed, 25 Jul 2018 07:29:36 +0000 (15:29 +0800)]
RDMA/hns: Return correct error code from hns_roce_v1_rsv_lp_qp()
When create loop qp fail, it will return the correct result when
modify_qp() fails.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Lijun Ou [Wed, 25 Jul 2018 07:29:33 +0000 (15:29 +0800)]
RDMA/hns: Add 50GE type of hnae3 device match
This patch adds PCI matching for the hns 50GE NIC.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Lijun Ou [Wed, 25 Jul 2018 07:29:31 +0000 (15:29 +0800)]
RDMA/hns: Do not overwrite the error code during error unwind in hns_roce_init
When init cmq fail in initial flow of RoCE, it should return the errno of
cmq_init function, not of the rest call.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Qing Huang [Mon, 23 Jul 2018 21:15:08 +0000 (14:15 -0700)]
IB/mlx5: avoid excessive warning msgs when creating VFs on 2nd port
When a CX5 device is configured in dual-port RoCE mode, after creating
many VFs against port 1, creating the same number of VFs against port 2
will flood kernel/syslog with something like
"mlx5_*:mlx5_ib_bind_slave_port:4266:(pid 5269): port 2 already
affiliated."
So basically, when traversing mlx5_ib_dev_list, mlx5_ib_add_slave_port()
repeatedly attempts to bind the new mpi structure to every device on the
list until it finds an unbound device.
Change the log level from warn to dbg to avoid log flooding as the warning
should be harmless.
Signed-off-by: Qing Huang <qing.huang@oracle.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Mon, 23 Jul 2018 22:37:01 +0000 (15:37 -0700)]
RDMA/usnic: Suppress a compiler warning
This patch avoids that the following compiler warning is reported when
building with gcc 8 and W=1:
drivers/infiniband/hw/usnic/usnic_fwd.c:95:2: warning: 'strncpy' output may be truncated copying 16 bytes from a string of length 20 [-Wstringop-truncation]
strncpy(ufdev->name, netdev_name(ufdev->netdev),
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sizeof(ufdev->name) - 1);
~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Thu, 26 Jul 2018 17:36:50 +0000 (11:36 -0600)]
net/xprtrdma: Restore needed argument to ib_post_send
The call in svc_rdma_post_chunk_ctxt() does actually use bad_wr.
Fixes:
ed288d74a9e5 ("net/xprtrdma: Simplify ib_post_(send|recv|srq_recv)() calls")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Mon, 16 Jul 2018 08:50:13 +0000 (11:50 +0300)]
RDMA/cma: Do not ignore net namespace for unbound cm_id
Currently if the cm_id is not bound to any netdevice, than for such cm_id,
net namespace is ignored; which is incorrect.
Regardless of cm_id bound to a netdevice or not, net namespace must
match. When a cm_id is bound to a netdevice, in such case net namespace
and netdevice both must match.
Fixes:
4c21b5bcef73 ("IB/cma: Add net_dev and private data checks to RDMA CM")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Mon, 16 Jul 2018 08:50:12 +0000 (11:50 +0300)]
RDMA/cma: Consider netdevice for RoCE ports
When netdevice is not found for a request, and if it for RoCE port,
currently it allows matching the listener as long as port number matches
by ignoring the netdevice.
Now that we always prefer to have netdevice associated with RoCE, when
netdevice is not found, don't consider RoCE ports.
In other words, a NULL netdevice with RoCE is not acceptable. Therefore,
remove this confusing RoCE port ignorance check.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Mon, 16 Jul 2018 08:50:11 +0000 (11:50 +0300)]
IB/core: Introduce and use sgid_attr in CM requests
For RoCE, when CM requests are received for RC and UD connections,
netdevice of the incoming request is unavailable. Because of that CM
requests are always forwarded to init_net namespace.
Now that we have the GID attribute available, introduce SGID attribute in
incoming CM requests and refer to the netdevice of it. This is similar to
existing SGID attribute field in outgoing CM requests for RC and UD
transports.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Tue, 24 Jul 2018 20:37:52 +0000 (14:37 -0600)]
IB/usnic: usnic should not select INFINIBAND_USER_ACCESS
This driver doesn't provide any kernel services, it only provides
an interface via uverbs, so it should depend on, not select, uverbs
support.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Raju Rangoju [Wed, 25 Jul 2018 15:52:14 +0000 (21:22 +0530)]
rdma/cxgb4: Add support for kernel mode SRQ's
This patch implements the srq specific verbs such as create/destroy/modify
and post_srq_recv. And adds srq specific structures and defines to t4.h
and uapi.
Also updates the cq poll logic to deal with completions that are
associated with the SRQ's.
This patch also handles kernel mode SRQ_LIMIT events as well as flushed
SRQ buffers
Signed-off-by: Raju Rangoju <rajur@chelsio.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Raju Rangoju [Wed, 25 Jul 2018 15:52:13 +0000 (21:22 +0530)]
rdma/cxgb4: Add support for srq functions & structs
This patch adds kernel mode t4_srq structures and support functions,
uapi structures and defines, as well as firmware work request structures.
Signed-off-by: Raju Rangoju <rajur@chelsio.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Varsha Rao [Wed, 25 Jul 2018 18:43:56 +0000 (20:43 +0200)]
IB/core: Remove extra parentheses
Remove unnecessary parentheses to fix the clang warning of extraneous
parentheses.
Signed-off-by: Varsha Rao <rvarsha016@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:01:35 +0000 (09:01 -0700)]
RDMA/ocrdma: Suppress a compiler warning
This patch avoids that the following compiler warning is reported when
building with gcc 8 and W=1:
In function 'ocrdma_mbx_get_ctrl_attribs',
inlined from 'ocrdma_init_hw' at drivers/infiniband/hw/ocrdma/ocrdma_hw.c:3224:11:
drivers/infiniband/hw/ocrdma/ocrdma_hw.c:1368:3: warning: 'strncpy' output may be truncated copying 31 bytes from a string of length 31 [-Wstringop-truncation]
strncpy(dev->model_number,
^~~~~~~~~~~~~~~~~~~~~~~~~~
hba_attribs->controller_model_number, 31);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Tue, 10 Jul 2018 19:43:06 +0000 (13:43 -0600)]
IB/uverbs: Fix locking around struct ib_uverbs_file ucontext
We have a parallel unlocked reader and writer with ib_uverbs_get_context()
vs everything else, and nothing guarantees this works properly.
Audit and fix all of the places that access ucontext to use one of the
following locking schemes:
- Call ib_uverbs_get_ucontext() under SRCU and check for failure
- Access the ucontext through an struct ib_uobject context member
while holding a READ or WRITE lock on the uobject.
This value cannot be NULL and has no race.
- Hold the ucontext_lock and check for ufile->ucontext !NULL
This also re-implements ib_uverbs_get_ucontext() in a way that is safe
against concurrent ib_uverbs_get_context() and disassociation.
As a side effect, every access to ucontext in the commands is via
ib_uverbs_get_context() with an error check, or via the uobject, so there
is no longer any need for the core code to check ucontext on every command
call. These checks are also removed.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Wed, 11 Jul 2018 02:55:22 +0000 (20:55 -0600)]
IB/mlx5: Use the ucontext from the uobj, not the file
This approach matches the standard flow of the typical write method that
relies on the HW object to store the device and the uobject to access the
ucontext. Avoids the use of the devx_ufile2uctx in several places will
make revising the semantics of ib_uverbs_get_ucontext() in the next patch
simpler.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Wed, 11 Jul 2018 02:55:21 +0000 (20:55 -0600)]
IB/uverbs: Move the FD uobj type struct file allocation to alloc_commit
Allocating the struct file during alloc_begin creates this strange
asymmetry with IDR, where the FD has two krefs pointing at it during the
pre-commit phase. In particular this makes the abort process for FD very
strange and confusing.
For instance abort currently calls the type's destroy_object twice, and
the fops release once if abort is done. This is very counter intuitive. No
fops should be called until alloc_commit succeeds, and destroy_object
should only ever be called once.
Moving the struct file allocation to the alloc_commit is now simple, as we
already support failure of rdma_alloc_commit_uobject, with all the
required rollback pieces.
This creates an understandable symmetry with IDR and simplifies/fixes the
abort handling for FD types.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Wed, 11 Jul 2018 02:55:20 +0000 (20:55 -0600)]
IB/uverbs: Always propagate errors from rdma_alloc_commit_uobject()
The ioctl framework already does this correctly, but the write path did
not. This is trivially fixed by simply using a standard pattern to return
uobj_alloc_commit() as the last statement in every function.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Wed, 11 Jul 2018 02:55:19 +0000 (20:55 -0600)]
IB/uverbs: Rework the locking for cleaning up the ucontext
The locking here has always been a bit crazy and spread out, upon some
careful analysis we can simplify things.
Create a single function uverbs_destroy_ufile_hw() that internally handles
all locking. This pulls together pieces of this process that were
sprinkled all over the places into one place, and covers them with one
lock.
This eliminates several duplicate/confusing locks and makes the control
flow in ib_uverbs_close() and ib_uverbs_free_hw_resources() extremely
simple.
Unfortunately we have to keep an extra mutex, ucontext_lock. This lock is
logically part of the rwsem and provides the 'down write, fail if write
locked, wait if read locked' semantic we require.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Wed, 11 Jul 2018 02:55:18 +0000 (20:55 -0600)]
IB/uverbs: Revise and clarify the rwsem and uobjects_lock
Rename 'cleanup_rwsem' to 'hw_destroy_rwsem' which is held across any call
to the type destroy function (aka 'hw' destroy). The main purpose of this
lock is to prevent normal add and destroy from running concurrently with
uverbs_cleanup_ufile()
Since the uobjects list is always manipulated under the 'hw_destroy_rwsem'
we can eliminate the uobjects_lock in the cleanup function. This allows
converting that lock to a very simple spinlock with a narrow critical
section.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Wed, 11 Jul 2018 02:55:17 +0000 (20:55 -0600)]
IB/uverbs: Clarify and revise uverbs_close_fd
The locking requirements here have changed slightly now that we can rely
on the ib_uverbs_file always existing and containing all the necessary
locking infrastructure.
That means we can get rid of the cleanup_mutex usage (this was protecting
the check on !uboj->context).
Otherwise, follow the same pattern that IDR uses for destroy, acquire
exclusive write access, then call destroy and the undo the 'lookup'.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Wed, 11 Jul 2018 02:55:16 +0000 (20:55 -0600)]
IB/uverbs: Revise the placement of get/puts on uobject
This wasn't wrong, but the placement of two krefs didn't make any
sense. Follow some simple rules.
- A kref is held inside uobjects_list
- A kref is held inside the IDR
- A kref is held inside file->private
- A stack based kref is passed bettwen alloc_begin and
alloc_abort/alloc_commit
Any place we destroy one of the above pointers, we stick a put,
or 'move' the kref into another pointer.
The key functions have sensible semantics:
- alloc_uobj fully initializes the common members in uobj, including
the list
- Get rid of the uverbs_idr_remove_uobj helper since IDR remove
does require put, but it depends on the situation. Later
patches will re-consolidate this differently.
- alloc_abort always consumes the passed kref, done in the type
- alloc_commit always consumes the passed kref, done in the type
- rdma_remove_commit_uobject always pairs with a lookup_get
After it is all done the only control flow change is to:
- move a get from alloc_commit_fd_uobject to rdma_alloc_commit_uobject
- add a put to remove_commit_idr_uobject
- Consistenly use rdma_lookup_put in rdma_remove_commit_uobject at
the right place
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Wed, 11 Jul 2018 02:55:15 +0000 (20:55 -0600)]
IB/uverbs: Clarify the kref'ing ordering for alloc_commit
The alloc_commit callback makes the uobj visible to other threads,
and it does so using a 'move' semantic of the uobj kref on the stack
into the public storage (eg the IDR, uobject list and file_private_data)
Once this is done another thread could start up and trigger deletion
of the kref. Fortunately cleanup_rwsem happens to prevent this from
being a bug, but that is a fantastically unclear side effect.
Re-organize things so that alloc_commit is that last thing to touch
the uobj, get rid of the sneaky implicit dependency on cleanup_rwsem,
and add a comment reminding that uobj is no longer kref'd after
alloc_commit.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Wed, 11 Jul 2018 02:55:14 +0000 (20:55 -0600)]
IB/uverbs: Handle IDR and FD types without truncation
Our ABI for write() uses a s32 for FDs and a u32 for IDRs, but internally
we ended up implicitly casting these ABI values into an 'int'. For ioctl()
we use a s64 for FDs and a u64 for IDRs, again casting to an int.
The various casts to int are all missing range checks which can cause
userspace values that should be considered invalid to be accepted.
Fix this by making the generic lookup routine accept a s64, which does not
truncate the write API's u32/s32 or the ioctl API's s64. Then push the
detailed range checking down to the actual type implementations to be
shared by both interfaces.
Finally, change the copy of the uobj->id to sign extend into a s64, so eg,
if we ever wish to return a negative value for a FD it is carried
properly.
This ensures that userspace values are never weirdly interpreted due to
the various trunctations and everything that is really out of range gets
an EINVAL.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Wed, 11 Jul 2018 02:55:13 +0000 (20:55 -0600)]
IB/uverbs: Get rid of null_obj_type
If the method fails after calling rdma_explicit_destroy (eg if
copy_to_user faults) then it will trigger a kernel oops:
BUG: unable to handle kernel NULL pointer dereference at
0000000000000000
PGD
800000000548d067 P4D
800000000548d067 PUD 54a0067 PMD 0
SMP PTI
CPU: 0 PID: 359 Comm: ibv_rc_pingpong Not tainted 4.18.0-rc1+ #28
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
RIP: 0010: (null)
Code: Bad RIP value.
RSP: 0018:
ffffc900001a3bf0 EFLAGS:
00010246
RAX:
0000000000000000 RBX:
ffff88000603bd00 RCX:
0000000000000003
RDX:
0000000000000001 RSI:
0000000000000001 RDI:
ffff88000603bd00
RBP:
0000000000000001 R08:
ffffc900001a3cf8 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000000 R12:
ffffc900001a3cf0
R13:
0000000000000000 R14:
ffffc900001a3cf0 R15:
0000000000000000
FS:
00007fb00dda8700(0000) GS:
ffff880007c00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
ffffffffffffffd6 CR3:
000000000548e004 CR4:
00000000003606b0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Call Trace:
? rdma_lookup_put_uobject+0x22/0x50 [ib_uverbs]
? uverbs_finalize_object+0x3b/0x60 [ib_uverbs]
? uverbs_finalize_attrs+0x128/0x140 [ib_uverbs]
? ib_uverbs_cmd_verbs+0x698/0x7c0 [ib_uverbs]
? find_held_lock+0x2d/0x90
? __might_fault+0x39/0x90
? ib_uverbs_ioctl+0x111/0x1f0 [ib_uverbs]
? do_vfs_ioctl+0xa0/0x6d0
? trace_hardirqs_on_caller+0xed/0x180
? _raw_spin_unlock_irq+0x24/0x40
? syscall_trace_enter+0x138/0x1d0
? ksys_ioctl+0x35/0x60
? __x64_sys_ioctl+0x11/0x20
? do_syscall_64+0x5b/0x1c0
? entry_SYSCALL_64_after_hwframe+0x49/0xbe
This is because the type was replaced with the null_type during explicit
destroy that cannot complete the destruction.
One of the side effects of replacing the type is to make the object
handle totally unreachable - so no other command could attempt to use
it, even though it remains on the uboject list.
We can get the same end result by just fully destroying the object inside
rdma_explicit_destroy and leaving the caller the residual kref for the
uobj with no attached HW object, and no presence in the ubojects list.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:31 +0000 (09:25 -0700)]
net/xprtrdma: Simplify ib_post_(send|recv|srq_recv)() calls
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL
as third argument to ib_post_(send|recv|srq_recv)().
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:30 +0000 (09:25 -0700)]
net/smc: Simplify ib_post_(send|recv|srq_recv)() calls
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL
as third argument to ib_post_(send|recv|srq_recv)().
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:29 +0000 (09:25 -0700)]
net/smc: Remove a WARN_ON() statement
Remove a WARN_ON() statement that verifies something that is guaranteed
by the RDMA API, namely that the failed_wr pointer is not touched if an
ib_post_send() call succeeds and that it points at the failed wr if an
ib_post_send() call fails.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:28 +0000 (09:25 -0700)]
net/rds: Simplify ib_post_(send|recv|srq_recv)() calls
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL
as third argument to ib_post_(send|recv|srq_recv)().
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:27 +0000 (09:25 -0700)]
net/rds: Remove two WARN_ON() statements
Remove two WARN_ON() statements that verify something that is guaranteed
by the RDMA API, namely that the failed_wr pointer is not touched if an
ib_post_send() call succeeds and that it points at the failed wr if an
ib_post_send() call fails.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:26 +0000 (09:25 -0700)]
net/9p: Simplify ib_post_(send|recv|srq_recv)() calls
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL
as third argument to ib_post_(send|recv|srq_recv)().
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:25 +0000 (09:25 -0700)]
fs/cifs: Simplify ib_post_(send|recv|srq_recv)() calls
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL
as third argument to ib_post_(send|recv|srq_recv)().
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:24 +0000 (09:25 -0700)]
nvmet-rdma: Simplify ib_post_(send|recv|srq_recv)() calls
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL
as third argument to ib_post_(send|recv|srq_recv)().
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:23 +0000 (09:25 -0700)]
nvme-rdma: Simplify ib_post_(send|recv|srq_recv)() calls
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL
as third argument to ib_post_(send|recv|srq_recv)().
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:22 +0000 (09:25 -0700)]
IB/srpt: Simplify ib_post_(send|recv|srq_recv)() calls
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL
as third argument to ib_post_(send|recv|srq_recv)().
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:21 +0000 (09:25 -0700)]
IB/srp: Simplify ib_post_(send|recv|srq_recv)() calls
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL
as third argument to ib_post_(send|recv|srq_recv)().
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:20 +0000 (09:25 -0700)]
IB/isert: Simplify ib_post_(send|recv|srq_recv)() calls
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL
as third argument to ib_post_(send|recv|srq_recv)().
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:19 +0000 (09:25 -0700)]
IB/iser: Simplify ib_post_(send|recv|srq_recv)() calls
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL
as third argument to ib_post_(send|recv|srq_recv)().
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:18 +0000 (09:25 -0700)]
IB/IPoIB: Simplify ib_post_(send|recv|srq_recv)() calls
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL
as third argument to ib_post_(send|recv|srq_recv)().
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:17 +0000 (09:25 -0700)]
RDMA/core: Simplify ib_post_(send|recv|srq_recv)() calls
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL
as third argument to ib_post_(send|recv|srq_recv)().
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 18 Jul 2018 16:25:16 +0000 (09:25 -0700)]
IB/core: Allow ULPs to specify NULL as the third ib_post_(send|recv|srq_recv)() argument
This patch does not change the behavior of the modified functions.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Zhu Yanjun [Fri, 13 Jul 2018 07:10:20 +0000 (03:10 -0400)]
IB/rxe: Drop QP0 silently
According to "Annex A16: RDMA over Converged Ethernet (RoCE)":
A16.4.3 MANAGEMENT INTERFACES
As defined in the base specification, a special Queue Pair, QP0 is defined
solely for communication between subnet manager(s) and subnet management
agents. Since such an IB-defined subnet management architecture is outside
the scope of this annex, it follows that there is also no requirement that
a port which conforms to this annex be associated with a QP0. Thus, for
end nodes designed to conform to this annex, the concept of QP0 is
undefined and unused for any port connected to an Ethernet network.
CA16-8: A packet arriving at a RoCE port containing a BTH with the
destination QP field set to QP0 shall be silently dropped.
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Wei Yongjun [Wed, 11 Jul 2018 13:15:42 +0000 (13:15 +0000)]
IB/ipoib: Fix error return code in ipoib_dev_init()
Fix to return a negative error code from the ipoib_neigh_hash_init()
error handling case instead of 0, as done elsewhere in this function.
Fixes:
515ed4f3aab4 ("IB/IPoIB: Separate control and data related initializations")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>