platform/kernel/linux-starfive.git
5 years agoIB/rdmavt: Fix loopback send with invalidate ordering
Mike Marciniszyn [Tue, 26 Feb 2019 16:45:16 +0000 (08:45 -0800)]
IB/rdmavt: Fix loopback send with invalidate ordering

The IBTA spec notes:

o9-5.2.1: For any HCA which supports SEND with Invalidate, upon receiving
an IETH, the Invalidate operation must not take place until after the
normal transport header validation checks have been successfully
completed.

The rdmavt loopback code does the validation after the invalidate.

Fix by relocating the operation specific logic for all SEND variants until
after the validity checks.

Cc: <stable@vger.kernel.org> #v4.20+
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/iser: Fix dma_nents type definition
Max Gurtovoy [Tue, 26 Feb 2019 10:22:11 +0000 (12:22 +0200)]
IB/iser: Fix dma_nents type definition

The retured value from ib_dma_map_sg saved in dma_nents variable. To avoid
future mismatch between types, define dma_nents as an integer instead of
unsigned.

Fixes: 57b26497fabe ("IB/iser: Pass the correct number of entries for dma mapped SGL")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/mlx5: Set correct write permissions for implicit ODP MR
Moni Shoua [Mon, 25 Feb 2019 06:53:00 +0000 (08:53 +0200)]
IB/mlx5: Set correct write permissions for implicit ODP MR

The write access of an implicit MR is inherited to all of its children.
Therefore we must set the correct write access to the parent MR.

Pass full access_flags when creating umem to let it calculate write access
correctly.

Fixes: da6a496a34f2 ("IB/mlx5: Ranges in implicit ODP MR inherit its write access")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agobnxt_re: Clean cq for kernel consumers only
Devesh Sharma [Tue, 26 Feb 2019 03:18:04 +0000 (22:18 -0500)]
bnxt_re: Clean cq for kernel consumers only

Kernel space provider driver should clean the CQs belonging to kernel
space consumers only. The current implementation is doing reverse of it.

Fixing the same by avoiding the call to __clean_cq on a kernel qp during
destroy.

Fixes: c50866e2853a ("bnxt_re: fix the regression due to changes in alloc_pbl")
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/uverbs: Don't do double free of allocated PD
Leon Romanovsky [Wed, 13 Feb 2019 17:07:05 +0000 (19:07 +0200)]
RDMA/uverbs: Don't do double free of allocated PD

There is no need to call kfree(pd) because ib_dealloc_pd() internally
frees PD.

Fixes: 21a428a019c9 ("RDMA: Handle PD allocations by IB/core")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA: Handle ucontext allocations by IB/core
Leon Romanovsky [Tue, 12 Feb 2019 18:39:16 +0000 (20:39 +0200)]
RDMA: Handle ucontext allocations by IB/core

Following the PD conversion patch, do the same for ucontext allocations.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/core: Fix a WARN() message
Dan Carpenter [Fri, 22 Feb 2019 06:29:02 +0000 (09:29 +0300)]
RDMA/core: Fix a WARN() message

The first parameter of WARN_ONCE() is a condition, then following
parameters are the message.  In this case, we left out the condition so it
will just print the ops->type string.

Fixes: 3856ec4b93c9 ("RDMA/core: Add RDMA_NLDEV_CMD_NEWLINK/DELLINK support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agobnxt_re: fix the regression due to changes in alloc_pbl
Devesh Sharma [Fri, 22 Feb 2019 12:16:19 +0000 (07:16 -0500)]
bnxt_re: fix the regression due to changes in alloc_pbl

While adding the use of for_each_sg_dma_page iterator for Brodcom's rdma
driver, there was a regression added in the __alloc_pbl path. The change
left bnxt_re in DOA state in for-next branch.

Fixing the regression to avoid the host crash when a user space object is
created. Restricting the unconditional access to hwq.pg_arr when hwq is
initialized for user space objects.

Fixes: 161ebe2498d4 ("RDMA/bnxt_re: Use for_each_sg_dma_page iterator on umem SGL")
Reported-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/mlx4: Increase the timeout for CM cache
Håkon Bugge [Sun, 17 Feb 2019 14:45:12 +0000 (15:45 +0100)]
IB/mlx4: Increase the timeout for CM cache

Using CX-3 virtual functions, either from a bare-metal machine or
pass-through from a VM, MAD packets are proxied through the PF driver.

Since the VF drivers have separate name spaces for MAD Transaction Ids
(TIDs), the PF driver has to re-map the TIDs and keep the book keeping
in a cache.

Following the RDMA Connection Manager (CM) protocol, it is clear when
an entry has to evicted form the cache. But life is not perfect,
remote peers may die or be rebooted. Hence, it's a timeout to wipe out
a cache entry, when the PF driver assumes the remote peer has gone.

During workloads where a high number of QPs are destroyed concurrently,
excessive amount of CM DREQ retries has been observed

The problem can be demonstrated in a bare-metal environment, where two
nodes have instantiated 8 VFs each. This using dual ported HCAs, so we
have 16 vPorts per physical server.

64 processes are associated with each vPort and creates and destroys
one QP for each of the remote 64 processes. That is, 1024 QPs per
vPort, all in all 16K QPs. The QPs are created/destroyed using the
CM.

When tearing down these 16K QPs, excessive CM DREQ retries (and
duplicates) are observed. With some cat/paste/awk wizardry on the
infiniband_cm sysfs, we observe as sum of the 16 vPorts on one of the
nodes:

cm_rx_duplicates:
      dreq  2102
cm_rx_msgs:
      drep  1989
      dreq  6195
       rep  3968
       req  4224
       rtu  4224
cm_tx_msgs:
      drep  4093
      dreq 27568
       rep  4224
       req  3968
       rtu  3968
cm_tx_retries:
      dreq 23469

Note that the active/passive side is equally distributed between the
two nodes.

Enabling pr_debug in cm.c gives tons of:

[171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave:
1,sl_cm_id: 0xd393089f} is NULL!

By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
tear-down phase of the application is reduced from approximately 90 to
50 seconds. Retries/duplicates are also significantly reduced:

cm_rx_duplicates:
      dreq  2460
[]
cm_tx_retries:
      dreq  3010
       req    47

Increasing the timeout further didn't help, as these duplicates and
retries stems from a too short CMA timeout, which was 20 (~4 seconds)
on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
numbers fell down to about 10 for both of them.

Adjustment of the CMA timeout is not part of this commit.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Acked-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/core: Abort page fault handler silently during owning process exit
Moni Shoua [Sun, 17 Feb 2019 14:08:24 +0000 (16:08 +0200)]
IB/core: Abort page fault handler silently during owning process exit

It is possible that during a page fault handling, the process that owns
the MR is terminating. The indication for it is failure to get the
task_struct or take reference on the mm_struct. In this case just abort
the page-fault handler with error but without a warning to the kernel log.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/mlx5: Validate correct PD before prefetch MR
Moni Shoua [Sun, 17 Feb 2019 14:08:23 +0000 (16:08 +0200)]
IB/mlx5: Validate correct PD before prefetch MR

When prefetching odp mr it is required to verify that pd of the mr is
identical to the pd for which the advise_mr request arrived with.

This check was missing from synchronous flow and is added now.

Fixes: 813e90b1aeaa ("IB/mlx5: Add advise_mr() support")
Reported-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/mlx5: Protect against prefetch of invalid MR
Moni Shoua [Sun, 17 Feb 2019 14:08:22 +0000 (16:08 +0200)]
IB/mlx5: Protect against prefetch of invalid MR

When deferring a prefetch request we need to protect against MR or PD
being destroyed while the request is still enqueued.

The first step is to validate that PD owns the lkey that describes the MR
and that the MR that the lkey refers to is owned by that PD.

The second step is to dequeue all requests when MR is destroyed.

Since PD can't be destroyed while it owns MRs it is guaranteed that when a
worker wakes up the request it refers to is still valid.

Now, it is possible to refrain from taking a reference on the device since
it is assured to be present as pd.

While that, replace the dedicated ordered workqueue with the system
unbound workqueue to reuse an existing resource and improve
performance. This will also fix a bug of queueing to the wrong workqueue.

Fixes: 813e90b1aeaa ("IB/mlx5: Add advise_mr() support")
Reported-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/uverbs: Store PR pointer before it is overwritten
Leon Romanovsky [Thu, 21 Feb 2019 16:07:42 +0000 (18:07 +0200)]
RDMA/uverbs: Store PR pointer before it is overwritten

The IB_MR_REREG_PD command rewrites mr->pd after successful
rereg_user_mr(), such change causes to lost usecnt information and
produces the following warning:

 WARNING: CPU: 1 PID: 1771 at drivers/infiniband/core/verbs.c:336 ib_dealloc_pd+0x4e/0x60 [ib_core]
 CPU: 1 PID: 1771 Comm: rereg_mr Tainted: G        W  OE 5.0.0-rc7-for-upstream-perf-2019-02-20_14-03-40-34 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
 RIP: 0010:ib_dealloc_pd+0x4e/0x60 [ib_core]
 RSP: 0018:ffffc90003923dc0 EFLAGS: 00010286
 RAX: 00000000ffffffff RBX: ffff88821f7f0400 RCX: ffff888236a40c00
 RDX: ffff88821f7f0400 RSI: 0000000000000001 RDI: 0000000000000000
 RBP: 0000000000000001 R08: ffff88835f665d80 R09: ffff8882209c90d8
 R10: ffff88835ec003e0 R11: 0000000000000000 R12: ffff888221680ba0
 R13: ffff888221680b00 R14: 00000000ffffffea R15: ffff88821f53c318
 FS:  00007f70db11e740(0000) GS:ffff88835f640000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000001dfd030 CR3: 000000029d9d8000 CR4: 00000000000006e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  uverbs_free_pd+0x2d/0x30 [ib_uverbs]
  destroy_hw_idr_uobject+0x16/0x40 [ib_uverbs]
  uverbs_destroy_uobject+0x28/0x170 [ib_uverbs]
  __uverbs_cleanup_ufile+0x6b/0x90 [ib_uverbs]
  uverbs_destroy_ufile_hw+0x8b/0x110 [ib_uverbs]
  ib_uverbs_close+0x1f/0x80 [ib_uverbs]
  __fput+0xb1/0x220
  task_work_run+0x7f/0xa0
  exit_to_usermode_loop+0x6b/0xb2
  do_syscall_64+0xc5/0x100
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f70dad00664

Fixes: e278173fd19e ("RDMA/core: Cosmetic change - move member initialization to correct block")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/hfi1: Add missing break in switch statement
Gustavo A. R. Silva [Thu, 21 Feb 2019 01:02:33 +0000 (19:02 -0600)]
IB/hfi1: Add missing break in switch statement

Fix the following warning by adding a missing break:

drivers/infiniband/hw/hfi1/tid_rdma.c: In function ‘hfi1_tid_rdma_wqe_interlock’:
drivers/infiniband/hw/hfi1/tid_rdma.c:3251:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
   switch (prev->wr.opcode) {
   ^~~~~~
drivers/infiniband/hw/hfi1/tid_rdma.c:3259:2: note: here
  case IB_WR_RDMA_READ:
  ^~~~

Warning level 3 was used: -Wimplicit-fallthrough=3

This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.

Fixes: c6c231175ccd ("IB/hfi1: Add interlock between TID RDMA WRITE and other requests")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Kaike Wan <Kaike.wan@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoMerge branch 'mlx5-next' into rdma.git for-next
Jason Gunthorpe [Thu, 21 Feb 2019 19:40:18 +0000 (12:40 -0700)]
Merge branch 'mlx5-next' into rdma.git for-next

From
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

To resolve conflicts with net-next and pick up the first patch.

* branch 'mlx5-next':
  net/mlx5: Factor out HCA capabilities functions
  IB/mlx5: Add support for 50Gbps per lane link modes
  net/mlx5: Add support to ext_* fields introduced in Port Type and Speed register
  net/mlx5: Add new fields to Port Type and Speed register
  net/mlx5: Refactor queries to speed fields in Port Type and Speed register
  net/mlx5: E-Switch, Avoid magic numbers when initializing offloads mode
  net/mlx5: Relocate vport macros to the vport header file
  net/mlx5: E-Switch, Normalize the name of uplink vport number
  net/mlx5: Provide an alternative VF upper bound for ECPF
  net/mlx5: Add host params change event
  net/mlx5: Add query host params command
  net/mlx5: Update enable HCA dependency
  net/mlx5: Introduce Mellanox SmartNIC and modify page management logic
  IB/mlx5: Use unified register/load function for uplink and VF vports
  net/mlx5: Use consistent vport num argument type
  net/mlx5: Use void pointer as the type in address_of macro
  net/mlx5: Align ODP capability function with netdev coding style
  mlx5: use RCU lock in mlx5_eq_cq_get()

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agodrivers/IB,qib: Fix pinned/locked limit check in qib_get_user_pages()
Davidlohr Bueso [Sat, 16 Feb 2019 12:15:52 +0000 (04:15 -0800)]
drivers/IB,qib: Fix pinned/locked limit check in qib_get_user_pages()

The current check does not take into account the previous value of
pinned_vm; thus it is quite bogus as is. Fix this by checking the
new value after the (optimistic) atomic inc.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/core: Verify that memory window type is legal
Noa Osherovich [Tue, 19 Feb 2019 13:07:34 +0000 (15:07 +0200)]
RDMA/core: Verify that memory window type is legal

Before calling the provider's alloc_mw function, verify that the
given memory type is either IB_MW_TYPE_1 or IB_MW_TYPE_2.

Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/iwcm: Fix string truncation error
Leon Romanovsky [Tue, 19 Feb 2019 11:09:57 +0000 (13:09 +0200)]
RDMA/iwcm: Fix string truncation error

The strlen() check at the beginning of iw_cm_map() ensures that devname
and ifname strings are less than destinations to which they are supposed
to be copied. Change strncpy() call to be strcpy(), because we are
protected from overflow. Zero the entire string buffer to avoid copying
uninitialized kernel stack memory to userspace.

This fixes the compilation warning below:

In file included from ./include/linux/dma-mapping.h:6,
                 from drivers/infiniband/core/iwcm.c:38:
In function _strncpy_,
    inlined from _iw_cm_map_ at drivers/infiniband/core/iwcm.c:519:2:
./include/linux/string.h:253:9: warning: ___builtin_strncpy_ specified
bound 32 equals destination size [-Wstringop-truncation]
  return __builtin_strncpy(p, q, size);
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fixes: d53ec8af56d5 ("RDMA/iwcm: Don't copy past the end of dev_name() string")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/core: Cosmetic change - move member initialization to correct block
Yuval Shaia [Mon, 18 Feb 2019 14:23:22 +0000 (16:23 +0200)]
RDMA/core: Cosmetic change - move member initialization to correct block

old_pd is used only if IB_MR_REREG_PD flags is set.
For readability move it's initialization to where it is used.

While there rewrite the whole 'if-else' block so on error jump directly
to label and no need for 'else'

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoiw_cxgb4: Make function read_tcb() static
Wei Yongjun [Mon, 18 Feb 2019 07:04:32 +0000 (07:04 +0000)]
iw_cxgb4: Make function read_tcb() static

Fixes the following sparse warning:

drivers/infiniband/hw/cxgb4/cm.c:658:6: warning:
 symbol 'read_tcb' was not declared. Should it be static?

Fixes: 11a27e2121a5 ("iw_cxgb4: complete the cached SRQ buffers")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Bugfix for set hem of SCC
Yangyang Li [Sat, 16 Feb 2019 12:10:25 +0000 (20:10 +0800)]
RDMA/hns: Bugfix for set hem of SCC

The method of set hem for scc context is different from other contexts. It
should notify the hardware with the detailed idx in bt0 for scc, while for
other contexts, it only need to notify the bt step and the hardware will
calculate the idx.

Here fixes the following error when unloading the hip08 driver:

[  123.570768] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[  123.579023] {1}[Hardware Error]: event severity: recoverable
[  123.584670] {1}[Hardware Error]:  Error 0, type: recoverable
[  123.590317] {1}[Hardware Error]:   section_type: PCIe error
[  123.595877] {1}[Hardware Error]:   version: 4.0
[  123.600395] {1}[Hardware Error]:   command: 0x0006, status: 0x0010
[  123.606562] {1}[Hardware Error]:   device_id: 0000:7d:00.0
[  123.612034] {1}[Hardware Error]:   slot: 0
[  123.616120] {1}[Hardware Error]:   secondary_bus: 0x00
[  123.621245] {1}[Hardware Error]:   vendor_id: 0x19e5, device_id: 0xa222
[  123.627847] {1}[Hardware Error]:   class_code: 000002
[  123.632977] hns3 0000:7d:00.0: aer_status: 0x00000000, aer_mask: 0x00000000
[  123.639928] hns3 0000:7d:00.0: aer_layer=Transaction Layer, aer_agent=Receiver ID
[  123.647400] hns3 0000:7d:00.0: aer_uncor_severity: 0x00000000
[  123.653136] hns3 0000:7d:00.0: PCI error detected, state(=1)!!
[  123.658959] hns3 0000:7d:00.0: ROCEE uncorrected RAS error identified
[  123.665395] hns3 0000:7d:00.0: ROCEE RAS AXI rresp error
[  123.670713] hns3 0000:7d:00.0: requesting reset due to PCI error
[  123.676715] hns3 0000:7d:00.0: received reset event , reset type is 5
[  123.683147] hns3 0000:7d:00.0: AER: Device recovery successful
[  123.688978] hns3 0000:7d:00.0: PF Reset requested
[  123.693684] hns3 0000:7d:00.0: PF failed(=-5) to send mailbox message to VF
[  123.700633] hns3 0000:7d:00.0: inform reset to vf(1) failded -5!

Fixes: 6a157f7d1b14 ("RDMA/hns: Add SCC context allocation support for hip08")
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Reviewed-by: Yixian Liu <liuyixian@huawei.com>
Reviewed-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Modify qp&cq&pd specification according to UM
Lijun Ou [Sat, 16 Feb 2019 12:10:24 +0000 (20:10 +0800)]
RDMA/hns: Modify qp&cq&pd specification according to UM

Accroding to hip08's limitation, qp&cq specification is 1M, mtpt
specification 1M in kernel space.

Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agolib/irq_poll: Support schedules in non-interrupt contexts
Steve Wise [Wed, 6 Feb 2019 21:11:49 +0000 (13:11 -0800)]
lib/irq_poll: Support schedules in non-interrupt contexts

Do not assume irq_poll_sched() is called from an interrupt context only.
So use raise_softirq_irqoff() instead of __raise_softirq_irqoff() so it
will kick the ksoftirqd if the schedule is from a non-interrupt context.

This is required for RDMA drivers, like soft iwarp, that generate cq
completion notifications in a workqueue or kthread context.  Without this
change, siw completion notifications to the ULP can take several hundred
usecs, depending on the system load.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agordma_rxe: Use netlink messages to add/delete links
Steve Wise [Fri, 15 Feb 2019 19:03:57 +0000 (11:03 -0800)]
rdma_rxe: Use netlink messages to add/delete links

Add support for the RDMA_NLDEV_CMD_NEWLINK/DELLINK messages which allow
dynamically adding new RXE links.  Deprecate the old module options for
now.

Cc: Moni Shoua <monis@mellanox.com>
Reviewed-by: Yanjun Zhu <yanjun.zhu@oracle.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/core: Add RDMA_NLDEV_CMD_NEWLINK/DELLINK support
Steve Wise [Fri, 15 Feb 2019 19:03:53 +0000 (11:03 -0800)]
RDMA/core: Add RDMA_NLDEV_CMD_NEWLINK/DELLINK support

Add support for new LINK messages to allow adding and deleting rdma
interfaces.  This will be used initially for soft rdma drivers which
instantiate device instances dynamically by the admin specifying a netdev
device to use.  The rdma_rxe module will be the first user of these
messages.

The design is modeled after RTNL_NEWLINK/DELLINK: rdma drivers register
with the rdma core if they provide link add/delete functions.  Each driver
registers with a unique "type" string, that is used to dispatch messages
coming from user space.  A new RDMA_NLDEV_ATTR is defined for the "type"
string.  User mode will pass 3 attributes in a NEWLINK message:
RDMA_NLDEV_ATTR_DEV_NAME for the desired rdma device name to be created,
RDMA_NLDEV_ATTR_LINK_TYPE for the "type" of link being added, and
RDMA_NLDEV_ATTR_NDEV_NAME for the net_device interface to use for this
link.  The DELLINK message will contain the RDMA_NLDEV_ATTR_DEV_INDEX of
the device to delete.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/usnic: Fix deadlock
Parvi Kaustubhi [Sat, 9 Feb 2019 17:28:30 +0000 (09:28 -0800)]
IB/usnic: Fix deadlock

There is a dead lock in usnic ib_register and netdev_notify path.

usnic_ib_discover_pf()
| mutex_lock(&usnic_ib_ibdev_list_lock);
 | usnic_ib_device_add();
  | ib_register_device()
   | usnic_ib_query_port()
    | mutex_lock(&us_ibdev->usdev_lock);
     | ib_get_eth_speed()
      | rtnl_lock()

order of lock: &usnic_ib_ibdev_list_lock -> usdev_lock -> rtnl_lock

rtnl_lock()
 | usnic_ib_netdevice_event()
  | mutex_lock(&usnic_ib_ibdev_list_lock);

order of lock: rtnl_lock -> &usnic_ib_ibdev_list_lock

Solution is to use the core's lock-free ib_device_get_by_netdev() scheme
to lookup ib_dev while handling netdev & inet events.

Signed-off-by: Parvi Kaustubhi <pkaustub@cisco.com>
Reviewed-by: Govindarajulu Varadarajan <gvaradar@cisco.com>
Reviewed-by: Tanmay Inamdar <tinamdar@cisco.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/rxe: Close a race after ib_register_device
Jason Gunthorpe [Wed, 13 Feb 2019 04:12:56 +0000 (21:12 -0700)]
RDMA/rxe: Close a race after ib_register_device

Since rxe allows unregistration from other threads the rxe pointer can
become invalid any moment after ib_register_driver returns. This could
cause a user triggered use after free.

Add another driver callback to be called right after the device becomes
registered to complete any device setup required post-registration.  This
callback has enough core locking to prevent the device from becoming
unregistered.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/rxe: Add ib_device_get_by_name() and use it in rxe
Jason Gunthorpe [Wed, 13 Feb 2019 04:12:55 +0000 (21:12 -0700)]
RDMA/rxe: Add ib_device_get_by_name() and use it in rxe

rxe has an open coded version of this that is not as safe as the core
version. This lets us eliminate the internal device list entirely from
rxe.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/rxe: Use driver_unregister and new unregistration API
Jason Gunthorpe [Tue, 22 Jan 2019 23:27:24 +0000 (16:27 -0700)]
RDMA/rxe: Use driver_unregister and new unregistration API

rxe does not have correct locking for its registration/unregistration
paths, use the core code to handle it instead. In this mode
ib_unregister_device will also do the dealloc, so rxe is required to do
clean up from a callback.

The core code ensures that unregistration is done only once, and generally
takes care of locking and concurrency problems for rxe.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/device: Provide APIs from the core code to help unregistration
Jason Gunthorpe [Wed, 13 Feb 2019 04:12:53 +0000 (21:12 -0700)]
RDMA/device: Provide APIs from the core code to help unregistration

These APIs are intended to support drivers that exist outside the usual
driver core probe()/remove() callbacks. Normally the driver core will
prevent remove() from running concurrently with probe(), once this safety
is lost drivers need more support to get the locking and lifetimes right.

ib_unregister_driver() is intended to be used during module_exit of a
driver using these APIs. It unregisters all the associated ib_devices.

ib_unregister_device_and_put() is to be used by a driver-specific removal
function (ie removal by name, removal from a netdev notifier, removal from
netlink)

ib_unregister_queued() is to be used from netdev notifier chains where
RTNL is held.

The locking is tricky here since once things become async it is possible
to race unregister with registration. This is largely solved by relying on
the registration refcount, unregistration will only ever work on something
that has a positive registration refcount - and then an unregistration
mutex serializes all competing unregistrations of the same device.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/rxe: Use ib_device_get_by_netdev() instead of open coding
Jason Gunthorpe [Wed, 13 Feb 2019 04:12:52 +0000 (21:12 -0700)]
RDMA/rxe: Use ib_device_get_by_netdev() instead of open coding

The core API handles the locking correctly and is faster if there are
multiple devices.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/device: Add ib_device_get_by_netdev()
Jason Gunthorpe [Wed, 13 Feb 2019 04:12:51 +0000 (21:12 -0700)]
RDMA/device: Add ib_device_get_by_netdev()

Several drivers need to find the ib_device from a given netdev. rxe needs
this at speed in an unsleepable context, so choose to implement the
translation using a RCU safe hash table.

The hash table can have a many to one mapping. This is intended to support
some future case where multiple IB drivers (ie iWarp and RoCE) connect to
the same netdevs. driver_ids will need to be different to support this.

In the process this makes the struct ib_device and ib_port_data RCU safe
by deferring their kfrees.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/device: Add ib_device_set_netdev() as an alternative to get_netdev
Jason Gunthorpe [Wed, 13 Feb 2019 04:12:50 +0000 (21:12 -0700)]
RDMA/device: Add ib_device_set_netdev() as an alternative to get_netdev

The associated netdev should not actually be very dynamic, so for most
drivers there is no reason for a callback like this. Provide an API to
inform the core code about the net dev affiliation and use a core
maintained data structure instead.

This allows the core code to be more aware of the ndev relationship which
will allow some new APIs based around this.

This also uses locking that makes some kind of sense, many drivers had a
confusing RCU lock, or missing locking which isn't right.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/cache: Move the cache per-port data into the main ib_port_data
Jason Gunthorpe [Wed, 13 Feb 2019 04:12:49 +0000 (21:12 -0700)]
RDMA/cache: Move the cache per-port data into the main ib_port_data

Like the other cases there no real reason to have another array just for
the cache. This larger conversion gets its own patch.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/device: Consolidate ib_device per_port data into one place
Jason Gunthorpe [Wed, 13 Feb 2019 04:12:48 +0000 (21:12 -0700)]
RDMA/device: Consolidate ib_device per_port data into one place

There is no reason to have three allocations of per-port data. Combine
them together and make the lifetime for all the per-port data match the
struct ib_device.

Following patches will require more port-specific data, now there is a
good place to put it.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA: Add and use rdma_for_each_port
Jason Gunthorpe [Wed, 13 Feb 2019 04:12:47 +0000 (21:12 -0700)]
RDMA: Add and use rdma_for_each_port

We have many loops iterating over all of the end port numbers on a struct
ib_device, simplify them with a for_each helper.

Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/nldev: Don't expose number of not-visible entries
Leon Romanovsky [Mon, 18 Feb 2019 20:25:52 +0000 (22:25 +0200)]
RDMA/nldev: Don't expose number of not-visible entries

Netlink dumpit handshake exchanges the index from which kernel should
start to return its value, in current code, this index included
not-visible in this PID items too and indirectly revealed the number of
entries.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/nldev: Connect QP number to .doit callback
Leon Romanovsky [Mon, 18 Feb 2019 20:25:51 +0000 (22:25 +0200)]
RDMA/nldev: Connect QP number to .doit callback

This patch adds ability to query specific QP based on its LQPN (local
QPN), which is assigned by HW and needs special treatment while inserting
into restrack DB.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/nldev: Provide parent IDs for PD, MR and QP objects
Leon Romanovsky [Mon, 18 Feb 2019 20:25:50 +0000 (22:25 +0200)]
RDMA/nldev: Provide parent IDs for PD, MR and QP objects

PD, MR and QP objects have parents objects: contexts and PDs.  The exposed
parent IDs allow to correlate various objects and simplify debug
investigation.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/nldev: Share with user-space object IDs
Leon Romanovsky [Mon, 18 Feb 2019 20:25:49 +0000 (22:25 +0200)]
RDMA/nldev: Share with user-space object IDs

Give to the user space tools unique identifier for PD, MR, CQ and CM_ID
objects, so they can be able to query on them with .doit callbacks.

QP .doit is not supported yet, till all drivers will be updated to provide
their LQPN to be equal to their restrack ID.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/restrack: Prepare restrack_root to addition of extra fields per-type
Leon Romanovsky [Mon, 18 Feb 2019 20:25:48 +0000 (22:25 +0200)]
RDMA/restrack: Prepare restrack_root to addition of extra fields per-type

As a preparation to extension of rdma_restrack_root to provide software
IDs, which will be per-type too. We convert the rdma_restrack_root from
struct with arrays to array of structs.

Such conversion allows us to drop rwsem lock in favour of internal XArray
lock.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agonet/mlx5: Factor out HCA capabilities functions
Leon Romanovsky [Sun, 17 Feb 2019 11:11:02 +0000 (13:11 +0200)]
net/mlx5: Factor out HCA capabilities functions

Combine all HCA capabilities setters under one function
and compile out the ODP related function in case kernel
was compiled without ODP support.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
5 years agoRDMA/restrack: Hide restrack DB from IB/core
Leon Romanovsky [Mon, 18 Feb 2019 20:25:47 +0000 (22:25 +0200)]
RDMA/restrack: Hide restrack DB from IB/core

There is no need to expose internals of restrack DB to IB/core.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/restrack: Reduce scope of synchronization lock while updating DB
Leon Romanovsky [Mon, 18 Feb 2019 20:25:46 +0000 (22:25 +0200)]
RDMA/restrack: Reduce scope of synchronization lock while updating DB

XArray uses internal lock for updates to XArray. This means that our
external RW lock is needed to ensure that entry is not deleted while we
are performing iteration over list.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/nldev: Add resource tracker doit callback
Leon Romanovsky [Mon, 18 Feb 2019 20:25:45 +0000 (22:25 +0200)]
RDMA/nldev: Add resource tracker doit callback

Implement doit callbacks and ensure that users won't provide port values
on resource entry allocated in per-device mode needed for .doit callback.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/restrack: Translate from ID to restrack object
Leon Romanovsky [Mon, 18 Feb 2019 20:25:44 +0000 (22:25 +0200)]
RDMA/restrack: Translate from ID to restrack object

Add new general helper to get restrack entry given by ID and their
respective type.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/restrack: Convert internal DB from hash to XArray
Leon Romanovsky [Mon, 18 Feb 2019 20:25:43 +0000 (22:25 +0200)]
RDMA/restrack: Convert internal DB from hash to XArray

The additions of .doit callbacks posses new access pattern to the resource
entries by some user visible index. Back then, the legacy DB was
implemented as hash because per-index access wasn't needed and XArray
wasn't accepted yet.

Acceptance of XArray together with per-index access requires the refresh
of DB implementation.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/core: Move device addition deletion to device.c
Parav Pandit [Wed, 13 Feb 2019 17:23:06 +0000 (19:23 +0200)]
RDMA/core: Move device addition deletion to device.c

Move core device addition and removal from sysfs.c to device.c as device.c
is more appropriate place for device management.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/core: Introduce and use ib_setup_port_attrs()
Parav Pandit [Wed, 13 Feb 2019 17:23:04 +0000 (19:23 +0200)]
RDMA/core: Introduce and use ib_setup_port_attrs()

Refactor code for device and port sysfs attributes for reuse.

While at it, rename counter part free function to ib_free_port_attrs.

Also attribute setup sequence is:
(a) port specific init.
(b) device stats alloc/init.

So for cleanup, follow reverse sequence:
(a) device stats dealloc
(b) port specific cleanup

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/core: Use simpler device_del() instead of device_unregister()
Parav Pandit [Wed, 13 Feb 2019 17:23:03 +0000 (19:23 +0200)]
RDMA/core: Use simpler device_del() instead of device_unregister()

Instead of holding extra reference using get_device() that
device_unregister() releases, simplify it as below.

device_add() balances with device_del().  device_initialize() balances
with put_device(), always via ib_dealloc_device().

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/cxgb4: Remove kref accounting for sync operation
Leon Romanovsky [Tue, 12 Feb 2019 18:39:15 +0000 (20:39 +0200)]
RDMA/cxgb4: Remove kref accounting for sync operation

Ucontext allocation and release aren't async events and don't need kref
accounting. The common layer of RDMA subsystem ensures that dealloc
ucontext will be called after all other objects are released.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/nes: Remove useless usecnt variable and redundant memset
Leon Romanovsky [Tue, 12 Feb 2019 18:39:14 +0000 (20:39 +0200)]
RDMA/nes: Remove useless usecnt variable and redundant memset

The internal design of RDMA/core ensures that there dealloc ucontext will
be called only if alloc_ucontext succeeded, hence there is no need to
manage internal variable to mark validity of ucontext.

As part of this change, remove redundant memeset too.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/hfi1: Fix a build warning for TID RDMA READ
Kaike Wan [Fri, 15 Feb 2019 21:45:30 +0000 (13:45 -0800)]
IB/hfi1: Fix a build warning for TID RDMA READ

The following build warning was produced for the TID RDMA READ
patch ("IB/hfi1: Enable TID RDMA READ protocol"):

drivers/infiniband/hw/hfi1/qp.c: In function 'hfi1_setup_wqe':
drivers/infiniband/hw/hfi1/qp.c:328:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
   hfi1_setup_tid_rdma_wqe(qp, wqe);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/infiniband/hw/hfi1/qp.c:329:2: note: here
  case IB_QPT_UC:
  ^~~~

This patch will fix the issue by adding the "fall through" comment.

Fixes: f1ab4efa6d32 ("IB/hfi1: Enable TID RDMA READ protocol")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/uverbs: Fix an error flow in ib_uverbs_poll_cq
Jason Gunthorpe [Thu, 14 Feb 2019 20:13:31 +0000 (20:13 +0000)]
RDMA/uverbs: Fix an error flow in ib_uverbs_poll_cq

The new output_written block was wrongly placed before the ret=0, causing
the error code to be lost. uverbs_output_written is not expected to fail,
and even if it does fail it has no significant impact on the userspace
flow.

Reported-by: Bart Van Assche <bvanassche@acm.org>
Fixes: d6f4a21f309d ("RDMA/uverbs: Mark ioctl responses with UVERBS_ATTR_F_VALID_OUTPUT")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
5 years agoIB/{hw,sw}: Remove 'uobject->context' dependency in object creation APIs
Shamir Rabinovitch [Thu, 7 Feb 2019 16:44:49 +0000 (18:44 +0200)]
IB/{hw,sw}: Remove 'uobject->context' dependency in object creation APIs

Now when we have the udata passed to all the ib_xxx object creation APIs
and the additional macro 'rdma_udata_to_drv_context' to get the
ib_ucontext from ib_udata stored in uverbs_attr_bundle, we can finally
start to remove the dependency of the drivers in the
ib_xxx->uobject->context.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/verbs: Add helper function rdma_udata_to_drv_context
Shamir Rabinovitch [Thu, 7 Feb 2019 16:44:48 +0000 (18:44 +0200)]
IB/verbs: Add helper function rdma_udata_to_drv_context

Helper function to get driver's context out of ib_udata wrapped in
uverbs_attr_bundle for user objects or NULL for kernel objects.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/uverbs: Add ib_ucontext to uverbs_attr_bundle sent from ioctl and cmd flows
Shamir Rabinovitch [Thu, 7 Feb 2019 16:44:47 +0000 (18:44 +0200)]
IB/uverbs: Add ib_ucontext to uverbs_attr_bundle sent from ioctl and cmd flows

Add ib_ucontext to the uverbs_attr_bundle sent down the iocl and cmd flows
as soon as the flow has ib_uobject.

In addition, remove rdma_get_ucontext helper function that is only used by
ib_umem_get.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/ipoib: Use __func__ instead of function's name
Erez Alfasi [Wed, 13 Feb 2019 17:12:02 +0000 (19:12 +0200)]
IB/ipoib: Use __func__ instead of function's name

Changed debug statements to use %s and __func__ instead
of hard-coded function's name.

Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/iwpm: Remove set but not used variable 'msg_seq'
YueHaibing [Thu, 14 Feb 2019 01:57:10 +0000 (01:57 +0000)]
RDMA/iwpm: Remove set but not used variable 'msg_seq'

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/infiniband/core/iwpm_util.c: In function 'iwpm_send_hello':
drivers/infiniband/core/iwpm_util.c:811:6: warning:
 variable 'msg_seq' set but not used [-Wunused-but-set-variable]

It never used since introduction in commit b0bad9ad514f ("RDMA/IWPM:
Support no port mapping requirements")

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Configure capacity of hns device
Lijun Ou [Sun, 3 Feb 2019 08:13:07 +0000 (16:13 +0800)]
RDMA/hns: Configure capacity of hns device

This patch adds new device capability for IB_DEVICE_MEM_MGT_EXTENSIONS to
indicate device support for the following features:

1. Fast register memory region.
2. send with remote invalidate by frmr
3. local invalidate memory regsion

As well as adds the max depth of frmr page list len.

Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Delete useful prints for aeq subtype event
Yixian Liu [Sun, 3 Feb 2019 08:13:06 +0000 (16:13 +0800)]
RDMA/hns: Delete useful prints for aeq subtype event

Current all messages printed for aeq subtype event are wrong.  Thus,
delete them and only the value of subtype event is printed.

Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Set allocated memory to zero for wrid
Yixian Liu [Sun, 3 Feb 2019 08:13:05 +0000 (16:13 +0800)]
RDMA/hns: Set allocated memory to zero for wrid

The memory allocated for wrid should be initialized to zero.

Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Fix the state of rereg mr
Yixian Liu [Sun, 3 Feb 2019 08:13:04 +0000 (16:13 +0800)]
RDMA/hns: Fix the state of rereg mr

The state of mr after reregister operation should be set to valid
state. Otherwise, it will keep the same as the state before reregistered.

Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Limit minimum ROCE CQ depth to 64
chenglang [Sun, 3 Feb 2019 08:13:03 +0000 (16:13 +0800)]
RDMA/hns: Limit minimum ROCE CQ depth to 64

This patch modifies the minimum CQ depth specification of hip08 and is
consistent with the processing of hip06.

Signed-off-by: chenglang <chenglang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/mlx5: Add support for 50Gbps per lane link modes
Aya Levin [Wed, 13 Feb 2019 06:55:46 +0000 (22:55 -0800)]
IB/mlx5: Add support for 50Gbps per lane link modes

Driver now supports new link modes: 50Gbps per lane support for
50G/100G/200G. This patch reads the correct field (legacy vs. extended)
based on a FW indication bit, and adds a translation function (link
modes to IB width and speed) to the new link modes.

Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Add support to ext_* fields introduced in Port Type and Speed register
Aya Levin [Wed, 13 Feb 2019 06:55:45 +0000 (22:55 -0800)]
net/mlx5: Add support to ext_* fields introduced in Port Type and Speed register

This patch exposes new link modes (including 50Gbps per lane), and ext_*
fields which describes the new link modes in Port Type and Speed
register (PTYS).
Access functions, translation functions (speed <-> HW bits) and
link max speed function were modified.

Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Add new fields to Port Type and Speed register
Aya Levin [Wed, 13 Feb 2019 06:55:44 +0000 (22:55 -0800)]
net/mlx5: Add new fields to Port Type and Speed register

Register Port Type and Speed (PTYS) introduces three new fields
extending the speed/protocols the can be reported and configured.

Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Refactor queries to speed fields in Port Type and Speed register
Aya Levin [Wed, 13 Feb 2019 06:55:43 +0000 (22:55 -0800)]
net/mlx5: Refactor queries to speed fields in Port Type and Speed register

This patch fascicles queries to speed related fields in Port Type and
Speed register (PTYS) into a single API. I addition, this patch
refactors functions which serves only Ethernet driver: remove the
protocol type as an input parameter, move code from 'core' directory
into 'en' directory and add 'eth' prefix to the function's name. The
patch also encapsulates functions that are not used outside the Ethernet
driver removes redundant include files.

Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: E-Switch, Avoid magic numbers when initializing offloads mode
Bodong Wang [Wed, 13 Feb 2019 06:55:42 +0000 (22:55 -0800)]
net/mlx5: E-Switch, Avoid magic numbers when initializing offloads mode

When dealing with the offloads mode initialization, driver refers to
the number of VFs and add magic number one (1) to take account of the
uplink. This is not clear and will make the code less readable after
adding other vports (e.g. host PF). As these are special vports
compared to VF vports, add a helper macro to denote such special
vports and eliminate the use of magic number.

Moreover, when creating offloads flow table and groups, the driver
reserves two more slots for UC and MC miss rules. Replace this magic
number with a helper macro as well.

This patch doesn't change any functionality.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Relocate vport macros to the vport header file
Bodong Wang [Wed, 13 Feb 2019 06:55:41 +0000 (22:55 -0800)]
net/mlx5: Relocate vport macros to the vport header file

These are two macros in the driver general header which deal with the
number of total vports and if a vport is vport manager. Such macros
are vport entities, better to place them at the vport header file.

This patch doesn't change any functionality.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: E-Switch, Normalize the name of uplink vport number
Bodong Wang [Wed, 13 Feb 2019 06:55:40 +0000 (22:55 -0800)]
net/mlx5: E-Switch, Normalize the name of uplink vport number

Driver used to name uplink vport as FDB_UPLINK_VPORT, it's hard to
comply with the same naming convention along with the introduction of
other vports. Use MLX5_VPORT as the prefix for such vports and
relocate the uplink vport definition to public header file for the
benefits of both net and IB drivers.

This patch doesn't change any functionality.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Provide an alternative VF upper bound for ECPF
Bodong Wang [Wed, 13 Feb 2019 06:55:39 +0000 (22:55 -0800)]
net/mlx5: Provide an alternative VF upper bound for ECPF

ECPF doesn't support SR-IOV, but an ECPF E-Switch manager shall know
the max VFs supported by its peer host PF in order to control those
VF vports.

The current driver implementation uses the total vfs quantity as
provided by the pci sub-system for an upper bound of the VF vports
the e-switch code needs to deal with. This obviously can't work as
is on ECPF e-switch manager. For now, we use a hard coded value of
128 on such systems.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Add host params change event
Bodong Wang [Wed, 13 Feb 2019 06:55:38 +0000 (22:55 -0800)]
net/mlx5: Add host params change event

In Embedded CPU (EC) configurations, the EC driver needs to know when
the number of virtual functions change on the corresponding PF at the
host side. This is required so the EC driver can create or destroy
representor net devices that represent the VFs ports.

Whenever a change in the number of VFs occurs, firmware will generate an
event towards the EC which will trigger a work to complete the rest of
the handling. The specifics of the handling will be introduced in a
downstream patch.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Add query host params command
Bodong Wang [Wed, 13 Feb 2019 06:55:37 +0000 (22:55 -0800)]
net/mlx5: Add query host params command

The QUERY_HOST_PARAMS command is used by an Embedded CPU Physical
Function (ECPF) driver to identify and retrieve information about the
PF on the host side. E.g, number of virtual functions and PCI BDF.

The number of VFs can be changed on the fly, a function is added to
query current number of VFs and will be used in downstream patches.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Update enable HCA dependency
Bodong Wang [Wed, 13 Feb 2019 06:55:36 +0000 (22:55 -0800)]
net/mlx5: Update enable HCA dependency

With the introduction of ECPF, we require that the ECPF driver will
aways call enable/disable HCA for that PF in the same way a PF does
this for its VFs. The PF is still responsible for calling enable and
disable HCA for its VFs.

To distinguish between the ECPF executing enable/disable HCA for
itself or for the PF, it sets the embedded CPU function bit in the
input params struct of these commands. When the bit is cleared and
function ID is zero, it refers to the peer PF.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Introduce Mellanox SmartNIC and modify page management logic
Bodong Wang [Wed, 13 Feb 2019 06:55:35 +0000 (22:55 -0800)]
net/mlx5: Introduce Mellanox SmartNIC and modify page management logic

Mellanox's SmartNIC combines embedded CPU(e.g, ARM) processing power
with advanced network offloads to accelerate a multitude of security,
networking and storage applications.

With the introduction of the SmartNIC, there is a new PCI function
called Embedded CPU Physical Function(ECPF). And it's possible for a
PF to get its ICM pages from the ECPF PCI function. Driver shall
identify if it is running on such a function by reading a bit in
the initialization segment.

When firmware asks for pages, it would issue a page request event
specifying how many pages it requests and for which function. That
driver responds with a manage_pages command providing the requested
pages along with an indication for which function it is providing these
pages.

The encoding before this patch was as follows:
    function_id == 0: pages are requested for the function receiving
                      the EQE.
    function_id != 0: pages are requested for VF identified by the
                      function_id value

A new one bit field in the EQE identifies that pages are requested for
the ECPF.

The notion of page_supplier can be introduced here and to support that,
manage pages and query pages were modified so firmware can distinguish
the following cases:

1. Function provides pages for itself
2. PF provides pages for its VF
3. ECPF provides pages to itself
4. ECPF provides pages for another function

This distinction is possible through the introduction of the bit
"embedded_cpu_function" in query_pages, manage_pages and page request
EQE.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agoIB/mlx5: Use unified register/load function for uplink and VF vports
Bodong Wang [Wed, 13 Feb 2019 06:55:34 +0000 (22:55 -0800)]
IB/mlx5: Use unified register/load function for uplink and VF vports

IB driver maintains different registration and load function calls
for uplink and VF vports. This is not necessary as they only differ
with each other on their profiles.

This patch doesn't change any functionality.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Use consistent vport num argument type
Bodong Wang [Wed, 13 Feb 2019 06:55:33 +0000 (22:55 -0800)]
net/mlx5: Use consistent vport num argument type

Use u16 for vport number, which matches how hardware refers to this
argument throughout commands.

This patch doesn't change any functionality.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Use void pointer as the type in address_of macro
Bodong Wang [Wed, 13 Feb 2019 06:55:32 +0000 (22:55 -0800)]
net/mlx5: Use void pointer as the type in address_of macro

Better to use void * and avoid unnecessary casts.

This patch doesn't change any functionality.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Align ODP capability function with netdev coding style
Leon Romanovsky [Mon, 11 Feb 2019 11:56:07 +0000 (13:56 +0200)]
net/mlx5: Align ODP capability function with netdev coding style

Update newly introduced function to be aligned to netdev coding style.

Fixes: 46861e3e88be ("net/mlx5: Set ODP SRQ support in firmware")
Reported-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
5 years agoRDMA/nes: Use for_each_sg_dma_page iterator for umem SGL
Shiraz, Saleem [Tue, 12 Feb 2019 17:01:14 +0000 (11:01 -0600)]
RDMA/nes: Use for_each_sg_dma_page iterator for umem SGL

Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/rdmavt: Adapt to handle non-uniform sizes on umem SGEs
Shiraz, Saleem [Tue, 12 Feb 2019 16:52:24 +0000 (10:52 -0600)]
RDMA/rdmavt: Adapt to handle non-uniform sizes on umem SGEs

rdmavt expects a uniform size on all umem SGEs which is currently at
PAGE_SIZE.

Adapt to a umem API change which could return non-uniform sized SGEs due
to combining contiguous PAGE_SIZE regions into an SGE. Use
for_each_sg_page variant to unfold the larger SGEs into a list of
PAGE_SIZE elements.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA: Fix allocation failure on pointer pd
Colin Ian King [Tue, 12 Feb 2019 11:22:33 +0000 (11:22 +0000)]
RDMA: Fix allocation failure on pointer pd

The null check on an allocation failure on pd is currently checking
if pd is non-null rather than null. Fix this by adding the missing !
operator.

Fixes: 21a428a019c9 ("RDMA: Handle PD allocations by IB/core")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoMerge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma...
Doug Ledford [Wed, 13 Feb 2019 14:35:39 +0000 (09:35 -0500)]
Merge branch 'for-next' of git://git./linux/kernel/git/rdma/rdma into for-next

I had merged the hfi1-tid code into my local copy of for-next, but was
waiting on 0day testing before pushing it (I pushed it to my wip
branch).  Having waited several days for 0day testing to show up, I'm
finally just going to push it out.  In the meantime, though, Jason
pushed other stuff to for-next, so I needed to merge up the branches
before pushing.

Signed-off-by: Doug Ledford <dledford@redhat.com>
5 years agomlx5: use RCU lock in mlx5_eq_cq_get()
Cong Wang [Wed, 6 Feb 2019 23:00:19 +0000 (15:00 -0800)]
mlx5: use RCU lock in mlx5_eq_cq_get()

mlx5_eq_cq_get() is called in IRQ handler, the spinlock inside
gets a lot of contentions when we test some heavy workload
with 60 RX queues and 80 CPU's, and it is clearly shown in the
flame graph.

In fact, radix_tree_lookup() is perfectly fine with RCU read lock,
we don't have to take a spinlock on this hot path. This is pretty
much similar to commit 291c566a2891
("net/mlx4_core: Fix racy CQ (Completion Queue) free"). Slow paths
are still serialized with the spinlock, and with synchronize_irq()
it should be safe to just move the fast path to RCU read lock.

This patch itself reduces the latency by about 50% for our memcached
workload on a 4.14 kernel we test. In upstream, as pointed out by Saeed,
this spinlock gets some rework in commit 02d92f790364
("net/mlx5: CQ Database per EQ"), so the difference could be smaller.

Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agoRDMA/bnxt_re: fix or'ing of data into an uninitialized struct member
Colin Ian King [Mon, 11 Feb 2019 13:34:15 +0000 (13:34 +0000)]
RDMA/bnxt_re: fix or'ing of data into an uninitialized struct member

The struct member comp_mask has not been initialized however a bit
pattern is being bitwise or'd into the member and hence other bit
fields in comp_mask may contain any garbage from the stack. Fix this
by making the bitwise or into an assignment.

Fixes: 95b86d1c91ad ("RDMA/bnxt_re: Update kernel user abi to pass chip context")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/mlx5: Fix memory leak in case we fail to add an IB device
Mark Bloch [Mon, 11 Feb 2019 15:40:54 +0000 (17:40 +0200)]
RDMA/mlx5: Fix memory leak in case we fail to add an IB device

Make sure the IB device is freed on failure.

Fixes: b5ca15ad7e61 ("IB/mlx5: Add proper representors support")
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/mlx5: Fix bad flow upon DEVX mkey creation
Yishai Hadas [Mon, 11 Feb 2019 15:40:53 +0000 (17:40 +0200)]
IB/mlx5: Fix bad flow upon DEVX mkey creation

Fix bad flow upon DEVX mkey creation to prevent deleting the indirect mkey
from the radix tree in case there was a previous failure to insert it.

Fixes: 534fd7aac56a ("IB/mlx5: Manage indirection mkey upon DEVX flow for ODP")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/rxe: Use for_each_sg_page iterator on umem SGL
Shiraz, Saleem [Mon, 11 Feb 2019 15:25:07 +0000 (09:25 -0600)]
RDMA/rxe: Use for_each_sg_page iterator on umem SGL

The driver walks the umem SGL assuming a 1:1 mapping between SGE and
system page. Update to use the for_each_sg_page iterator to get individual
pages contained in the SGEs.  This is a pre-requisite before adding page
combining into SGEs while building the scatter table in IB core.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/ocrdma: Use for_each_sg_dma_page iterator on umem SGL
Shiraz, Saleem [Mon, 11 Feb 2019 15:25:05 +0000 (09:25 -0600)]
RDMA/ocrdma: Use for_each_sg_dma_page iterator on umem SGL

Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/qedr: Use for_each_sg_dma_page iterator on umem SGL
Shiraz, Saleem [Mon, 11 Feb 2019 15:25:04 +0000 (09:25 -0600)]
RDMA/qedr: Use for_each_sg_dma_page iterator on umem SGL

Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/vmw_pvrdma: Use for_each_sg_dma_page iterator on umem SGL
Shiraz, Saleem [Mon, 11 Feb 2019 15:25:03 +0000 (09:25 -0600)]
RDMA/vmw_pvrdma: Use for_each_sg_dma_page iterator on umem SGL

Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/cxgb3: Use for_each_sg_dma_page iterator on umem SGL
Shiraz, Saleem [Mon, 11 Feb 2019 15:25:02 +0000 (09:25 -0600)]
RDMA/cxgb3: Use for_each_sg_dma_page iterator on umem SGL

Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/cxgb4: Use for_each_sg_dma_page iterator on umem SGL
Shiraz, Saleem [Mon, 11 Feb 2019 15:25:01 +0000 (09:25 -0600)]
RDMA/cxgb4: Use for_each_sg_dma_page iterator on umem SGL

Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Use for_each_sg_dma_page iterator on umem SGL
Shiraz, Saleem [Mon, 11 Feb 2019 15:25:00 +0000 (09:25 -0600)]
RDMA/hns: Use for_each_sg_dma_page iterator on umem SGL

Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/i40iw: Use for_each_sg_dma_page iterator on umem SGL
Shiraz, Saleem [Mon, 11 Feb 2019 15:24:59 +0000 (09:24 -0600)]
RDMA/i40iw: Use for_each_sg_dma_page iterator on umem SGL

Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/mthca: Use for_each_sg_dma_page iterator on umem SGL
Shiraz, Saleem [Mon, 11 Feb 2019 15:24:58 +0000 (09:24 -0600)]
RDMA/mthca: Use for_each_sg_dma_page iterator on umem SGL

Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/bnxt_re: Use for_each_sg_dma_page iterator on umem SGL
Shiraz, Saleem [Mon, 11 Feb 2019 15:24:57 +0000 (09:24 -0600)]
RDMA/bnxt_re: Use for_each_sg_dma_page iterator on umem SGL

Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agolib/scatterlist: Provide a DMA page iterator
Jason Gunthorpe [Fri, 4 Jan 2019 18:40:21 +0000 (11:40 -0700)]
lib/scatterlist: Provide a DMA page iterator

Commit 2db76d7c3c6d ("lib/scatterlist: sg_page_iter: support sg lists w/o
backing pages") introduced the sg_page_iter_dma_address() function without
providing a way to use it in the general case. If the sg_dma_len() is not
equal to the sg length callers cannot safely use the
for_each_sg_page/sg_page_iter_dma_address combination.

Resolve this API mistake by providing a DMA specific iterator,
for_each_sg_dma_page(), that uses the right length so
sg_page_iter_dma_address() works as expected with all sglists.

A new iterator type is introduced to provide compile-time safety against
wrongly mixing accessors and iterators.

Acked-by: Christoph Hellwig <hch@lst.de> (for scatterlist)
Acked-by: Thomas Hellstrom <thellstrom@vmware.com>
Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com> (ipu3-cio2)
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoMerge branch 'wip/dl-for-next' into for-next
Doug Ledford [Sat, 9 Feb 2019 17:54:04 +0000 (12:54 -0500)]
Merge branch 'wip/dl-for-next' into for-next

Due to concurrent work by myself and Jason, a normal fast forward merge
was not possible.  This brings in a number of hfi1 changes, mainly the
hfi1 TID RDMA support (roughly 10,000 LOC change), which was reviewed
and integrated over a period of days.

Signed-off-by: Doug Ledford <dledford@redhat.com>