platform/kernel/linux-starfive.git
6 years agoIB/hfi1: Convert kzalloc_node and kcalloc to use kcalloc_node
Kamenee Arumugam [Thu, 1 Feb 2018 20:37:30 +0000 (12:37 -0800)]
IB/hfi1: Convert kzalloc_node and kcalloc to use kcalloc_node

Kzalloc_node API doesn't check for overflows in size multiplication.
While kcalloc API check for overflows in size multiplication
but these implementations are not NUMA-aware.

This conversion allowed for correcting an allocation used in the hot
path to be on the local NUMA and ensure us overflow free multiplication
for the size of a memory allocation.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/core: Avoid a potential OOPs for an unused optional parameter
Michael J. Ruhl [Thu, 1 Feb 2018 20:31:06 +0000 (12:31 -0800)]
IB/core: Avoid a potential OOPs for an unused optional parameter

The ev_file is an optional parameter for CQ creation. If the parameter
is not passed, the ev_file pointer will be NULL.  Using that pointer
to set the cq_context will result in an OOPs.

Verify that ev_file is not NULL before using.

Cc: <stable@vger.kernel.org> # 4.14.x
Fixes: 9ee79fce3642 ("IB/core: Add completion queue (cq) object actions")
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/core: Map iWarp AH type to undefined in rdma_ah_find_type
Don Hiatt [Thu, 1 Feb 2018 18:57:03 +0000 (10:57 -0800)]
IB/core: Map iWarp AH type to undefined in rdma_ah_find_type

iWarp devices do not support the creation of address handles
so return AH_ATTR_TYPE_UNDEFINED for all iWarp devices.

While we are here reduce the size of port_num to u8 and add
a comment.

Fixes: 44c58487d51a ("IB/core: Define 'ib' and 'roce' rdma_ah_attr types")
Reported-by: Parav Pandit <parav@mellanox.com>
CC: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/ipoib: Fix for potential no-carrier state
Alex Estrin [Thu, 1 Feb 2018 18:55:41 +0000 (10:55 -0800)]
IB/ipoib: Fix for potential no-carrier state

On reboot SM can program port pkey table before ipoib registered its
event handler, which could result in missing pkey event and leave root
interface with initial pkey value from index 0.

Since OPA port starts with invalid pkey in index 0, root interface will
fail to initialize and stay down with no-carrier flag.

For IB ipoib interface may end up with pkey different from value
opensm put in pkey table idx 0, resulting in connectivity issues
(different mcast groups, for example).

Close the window by calling event handler after registration
to make sure ipoib pkey is in sync with port pkey table.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Show fault stats in both TX and RX directions
Mitko Haralanov [Thu, 1 Feb 2018 18:52:43 +0000 (10:52 -0800)]
IB/hfi1: Show fault stats in both TX and RX directions

The routine which shows the fault stats checks the counters
to determine whether to show any stats based on the number of
transmitted pkts/bytes for a particular opcode.

Unfortunately, it only checked the receive counters. As a result,
if any packet faults have happened for packets egressing the HFI,
those stats would not be shown.

In order to fix this, the routine is amended to also check the
TX counters. With this change the pkt/byte counts are the sum of
both TX and RX counts for the opcode.

Fixes: 1b311f8931cf ("IB/hfi1: Add tx_opcode_stats like the opcode_stats")
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Remove blind constants from 16B update
Mike Marciniszyn [Thu, 1 Feb 2018 18:52:35 +0000 (10:52 -0800)]
IB/hfi1: Remove blind constants from 16B update

These values were introduced as part of the 16B code to
account for the varying size of the LRH between the differing
packet formats.

Replace the blind constants with defines based on FIELD_SIZEOF()
calls.

Fixes: 5b6cabb0db77 ("IB/hfi1: Add 16B RC/UC support")
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Convert PortXmitWait/PortVLXmitWait counters to flit times
Kamenee Arumugam [Thu, 1 Feb 2018 18:52:28 +0000 (10:52 -0800)]
IB/hfi1: Convert PortXmitWait/PortVLXmitWait counters to flit times

HFI's counters SendWaitCnt and SendWaitVlCnt are in units
of TXE cycle time (at 805MHz). OPA counters PortXmitWait and
PortVLXmtWait are in units of flit times.
Convert the counter values to flit units using following
conversion formula:

PortXmitWait =
SendWaitCnt * 2 * (4 /link_width) * (25 Gbps /link_speed)
PortVLXmitWait =
SendWaitVLCnt * 2 * (4 /link_width) * (25 Gbps /link_speed)

At link up or downgrade events, the link width can change. To ensure
accurate counter calculations, sample the counters after the events,
during counter requests, and then aggregate the OPA counters.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Do not override given pcie_pset value
Bartlomiej Dudek [Thu, 1 Feb 2018 18:52:20 +0000 (10:52 -0800)]
IB/hfi1: Do not override given pcie_pset value

During PCIe Gen 3 transistion, pcie_pset is read and might be overridden
to a default value(i.e. 255) in do_pcie_gen3_transition() routine.

If the pcie_pset value is overridden then this new value will be used
during initialization of next adapter on a different card.

Introducing a new local variable to avoid modification of pcie_pset

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Bartlomiej Dudek <bartlomiej.dudek@intel.com>
Signed-off-by: Patel Jay P <jay.p.patel@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Optimize process_receive_ib()
Sebastian Sanchez [Thu, 1 Feb 2018 18:46:46 +0000 (10:46 -0800)]
IB/hfi1: Optimize process_receive_ib()

The arguments for trace_hfi1_rcvhdr() get computed every
time in the hot path regardless of the whether the trace
is on or off. This is seen to be costly with a profile.
The handling of fault inject isolates the verbs device for
all packets regardless of the presence of a RHF_DC_ERR error.

Fix the first by computing trace_hfi1_rcvhdr() arguments within
the trace itself, so that when the trace is off, the argument
data isn't computed. Fix the second by moving the error check to
handle_eflags() when an RHF error occurs and by testing for
RHF_DC_ERR before executing the reset of handle_eflags().

Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Remove unnecessary fecn and becn fields
Sebastian Sanchez [Thu, 1 Feb 2018 18:46:38 +0000 (10:46 -0800)]
IB/hfi1: Remove unnecessary fecn and becn fields

packet->fecn and packet->becn are calculated in the hot path
and are never used. Remove these fields as they show to be
costly in a profile. Also, remove initialization for
becn and fecn in process_ecn() as they're unconditionally
assigned in the function and ensure fecn and becn variables
use a boolean type.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Look up ibport using a pointer in receive path
Sebastian Sanchez [Thu, 1 Feb 2018 18:46:31 +0000 (10:46 -0800)]
IB/hfi1: Look up ibport using a pointer in receive path

In the receive path, hfi1_ibport is looked up by indexing into an
array. A profile shows this to be expensive. The receive context
data has a pointer to the ibport data, use that pointer instead.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Optimize packet type comparison using 9B and bypass code paths
Sebastian Sanchez [Thu, 1 Feb 2018 18:46:23 +0000 (10:46 -0800)]
IB/hfi1: Optimize packet type comparison using 9B and bypass code paths

The packet type comparison used to find out if a packet is a bypass
packet in the hot path is an expensive operation as seen in a profile.

Determine packet's pkey and migration bit through the bypass and 9B
code paths instead.

Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Compute BTH only for RDMA_WRITE_LAST/SEND_LAST packet
Sebastian Sanchez [Thu, 1 Feb 2018 18:46:15 +0000 (10:46 -0800)]
IB/hfi1: Compute BTH only for RDMA_WRITE_LAST/SEND_LAST packet

In hfi1_rc_rcv(), BTH is computed for all packets received.
However, it's only used for packets received with opcodes
RDMA_WRITE_LAST and SEND_LAST, and it is a costly operation.

Compute BTH only in the RDMA_WRITE_LAST/SEND_LAST code path
and let the compiler handle endianness conversion for bitwise
operations.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Remove dependence on qp->s_hdrwords
Mitko Haralanov [Thu, 1 Feb 2018 18:46:07 +0000 (10:46 -0800)]
IB/hfi1: Remove dependence on qp->s_hdrwords

The s_hdrwords variable was used to indicate whether a
packet was already built on a previous iteration of the
send engine. This variable assumed the protection of the
QP's RVT_S_BUSY flag, which was required since the the
QP's s_lock was dropped just prior to the packet being
queued on the one of the egress mechanisms.

Support for multiple send engine instantiations require
that the field not be used due to concurency issues.
The ps.txreq signals the "already built" without the
potential concurency issues.

Fix by getting rid of all s_hdrword usage.   A wrapper
is added to test for the already built case that used to
use s_hdrwords.

What used to be stored in s_hdrwords is now in the txreq.
The PBC is not counted, but is added in the pio/sdma code
paths prior to posting the packet.

Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Fix for potential refcount leak in hfi1_open_file()
Alex Estrin [Thu, 1 Feb 2018 18:43:58 +0000 (10:43 -0800)]
IB/hfi1: Fix for potential refcount leak in hfi1_open_file()

The dd refcount is speculatively incremented prior to allocating
the fd memory with kzalloc(). If that kzalloc() failed the dd
refcount leaks.
Increment refcount on kzalloc success.

Fixes: e11ffbd57520 ("IB/hfi1: Do not free hfi1 cdev parent structure early")
Reviewed-by: Michael J Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Fix for early release of sdma context
Alex Estrin [Thu, 1 Feb 2018 18:43:50 +0000 (10:43 -0800)]
IB/hfi1: Fix for early release of sdma context

With IRQF_SHARED flag set and CONFIG_DEBUG_SHIRQ enabled
module removal may result in panic in sdma_interrupt() routine
if associated sdma context was released before pci_free_irq();

[ 9198.939885] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 9198.940514] IP: sdma_make_progress+0xa5/0x450 [hfi1]
[ 9198.941114] PGD 170bdc0067 P4D 170bdc0067 PUD 172063e067 PMD 0
[ 9198.941783] Oops: 0000 [#1] SMP
.....
[ 9198.958877] CPU: 132 PID: 64173 Comm: rmmod Tainted: G           OE   4.14.0-rc4+ #1
[ 9198.961032] Hardware name: Intel Corporation S7200AP/S7200AP, BIOS S72C610.86B.01.02.0118.080620171935 08/06/2017
[ 9198.963323] task: ffff9681397f0000 task.stack: ffffae1647c40000
[ 9198.965695] RIP: 0010:sdma_make_progress+0xa5/0x450 [hfi1]
[ 9198.968082] RSP: 0018:ffffae1647c43be8 EFLAGS: 00010046
[ 9198.970503] RAX: 0000000000000000 RBX: ffff9680ce8b5ca8 RCX: 0000000000000000
[ 9198.973006] RDX: 0000000000000000 RSI: 0000000001a00d28 RDI: ffff9680ce8b5ca0
[ 9198.975546] RBP: ffffae1647c43c40 R08: ffff96814325ec00 R09: 00000000ffffffff
[ 9198.978142] R10: 000000004325e501 R11: ffff96814325ec00 R12: ffff9680ce8b5c44
[ 9198.980779] R13: ffff9680ce8b5ca0 R14: 0000000000000000 R15: ffff9680ce8b5b00
[ 9198.983462] FS:  00007f31196ba740(0000) GS:ffff96819df00000(0000) knlGS:0000000000000000
[ 9198.986231] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9198.989036] CR2: 0000000000000000 CR3: 000000170833f000 CR4: 00000000001406e0
[ 9198.991911] Call Trace:
[ 9198.994847]  sdma_engine_interrupt+0x82/0x100 [hfi1]
[ 9198.997852]  sdma_interrupt+0x61/0xc0 [hfi1]
[ 9199.000852]  __free_irq+0x1b3/0x2d0
[ 9199.003873]  free_irq+0x35/0x70
[ 9199.006909]  pci_free_irq+0x1c/0x30
[ 9199.009999]  clean_up_interrupts+0x53/0xf0 [hfi1]
[ 9199.013137]  hfi1_start_cleanup+0x117/0x190 [hfi1]
[ 9199.016315]  postinit_cleanup+0x1d/0x270 [hfi1]
[ 9199.019529]  remove_one+0x1f3/0x210 [hfi1]
[ 9199.022738]  pci_device_remove+0x39/0xc0
[ 9199.025974]  device_release_driver_internal+0x141/0x210
[ 9199.029268]  driver_detach+0x3f/0x80
[ 9199.032580]  bus_remove_driver+0x55/0xd0
[ 9199.035931]  driver_unregister+0x2c/0x50
[ 9199.039321]  pci_unregister_driver+0x2a/0xa0
[ 9199.042755]  hfi1_mod_cleanup+0x10/0xb50 [hfi1]
[ 9199.046196]  SyS_delete_module+0x171/0x250
...

Fix by exporting sdma_clean() and removing from sdma_exit().
sdma_exit() now just manipulates the engine state,
leaving the memory free to sdma_clean() which is now called
just before the dd is freed.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Michael J Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Re-order IRQ cleanup to address driver cleanup race
Michael J. Ruhl [Thu, 1 Feb 2018 18:43:42 +0000 (10:43 -0800)]
IB/hfi1: Re-order IRQ cleanup to address driver cleanup race

The pci_request_irq() interfaces always adds the IRQF_SHARED bit to
all IRQ requests.

When the kernel is built with CONFIG_DEBUG_SHIRQ config flag, if the
IRQF_SHARED bit is set, a call to the IRQ handler is made from the
__free_irq() function. This is testing a race condition between the
IRQ cleanup and an IRQ racing the cleanup.  The HFI driver should be
able to handle this race, but does not.

This race can cause traces that start with this footprint:

BUG: unable to handle kernel NULL pointer dereference at   (null)
Call Trace:
 <hfi1 irq handler>
 ...
 __free_irq+0x1b3/0x2d0
 free_irq+0x35/0x70
 pci_free_irq+0x1c/0x30
 clean_up_interrupts+0x53/0xf0 [hfi1]
 hfi1_start_cleanup+0x122/0x190 [hfi1]
 postinit_cleanup+0x1d/0x280 [hfi1]
 remove_one+0x233/0x250 [hfi1]
 pci_device_remove+0x39/0xc0

Export IRQ cleanup function so it can be called from other modules.

Using the exported cleanup function:

  Re-order the driver cleanup code to clean up IRQ resources before
  other resources, eliminating the race.

  Re-order error path for init so that the race does not occur.

Reduce severity on spurious error message for SDMA IRQs to info.

Reviewed-by: Alex Estrin <alex.estrin@intel.com>
Reviewed-by: Patel Jay P <jay.p.patel@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/nldev: missing error code in nldev_res_get_doit()
Dan Carpenter [Thu, 1 Feb 2018 10:01:48 +0000 (13:01 +0300)]
RDMA/nldev: missing error code in nldev_res_get_doit()

We should return -ENOMEM if the allocation fails.  The current code
accidentally returns success.

Fixes: bf3c5a93c523 ("RDMA/nldev: Provide global resource utilization")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/hns: Fix misplaced call to hns_roce_cleanup_hem_table
oulijun [Tue, 30 Jan 2018 12:20:45 +0000 (20:20 +0800)]
RDMA/hns: Fix misplaced call to hns_roce_cleanup_hem_table

The mtt_table is cleaned up during the err_unmap_cqe label, it is a
mistake to duplicate the cleanup during the later unwind labels.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/hns: Add names to function arguments in function pointers
oulijun [Tue, 30 Jan 2018 12:20:44 +0000 (20:20 +0800)]
RDMA/hns: Add names to function arguments in function pointers

This patch mainly fix some style warings matched with the new checkpatch
requirement. The warning as follows:

WARNING: function definition argument 'struct hns_roce_cq *' should also have
an identifier name

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/hns: Remove unnecessary operator
oulijun [Tue, 30 Jan 2018 12:20:43 +0000 (20:20 +0800)]
RDMA/hns: Remove unnecessary operator

The double not-operator is unncessary when used in a boolean context. This
patch removes them.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/bnxt_re: Use common error handling code in bnxt_qplib_alloc_dpi_tbl()
Markus Elfring [Sat, 27 Jan 2018 19:56:56 +0000 (20:56 +0100)]
RDMA/bnxt_re: Use common error handling code in bnxt_qplib_alloc_dpi_tbl()

Add a jump target so that a bit of exception handling can be better reused
at the end of this function.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Devesh Sharma <devesh.sharma@broadcom.com>
Acked-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/bnxt_re: Delete two error messages for a failed memory allocation in bnxt_qplib_...
Markus Elfring [Sat, 27 Jan 2018 19:40:11 +0000 (20:40 +0100)]
RDMA/bnxt_re: Delete two error messages for a failed memory allocation in bnxt_qplib_alloc_dpi_tbl()

Omit extra messages for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Devesh Sharma <devesh.sharma@broadcom.com>
Acked-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/rxe: remove redudant parameter in rxe_av_fill_ip_info
Zhu Yanjun [Wed, 31 Jan 2018 11:06:59 +0000 (06:06 -0500)]
IB/rxe: remove redudant parameter in rxe_av_fill_ip_info

In the function rxe_av_fill_ip_info, the parameter rxe is not used.
So it is removed.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoIB/rxe: change the function rxe_av_fill_ip_info to void
Zhu Yanjun [Wed, 31 Jan 2018 11:06:58 +0000 (06:06 -0500)]
IB/rxe: change the function rxe_av_fill_ip_info to void

The function rxe_av_fill_ip_info always returns 0. So the function
type is changed to void.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoIB/rxe: change the function to void from int
Zhu Yanjun [Wed, 31 Jan 2018 11:06:57 +0000 (06:06 -0500)]
IB/rxe: change the function to void from int

Since the function rxe_av_to_attr always return 0, the function
type is changed to void.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoIB/rxe: remove unnecessary parameter in rxe_av_to_attr
Zhu Yanjun [Wed, 31 Jan 2018 11:06:56 +0000 (06:06 -0500)]
IB/rxe: remove unnecessary parameter in rxe_av_to_attr

In the function rxe_av_to_attr, the parameter rxe is
not used. So it is removed.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoIB/rxe: change the function to void from int
Zhu Yanjun [Wed, 31 Jan 2018 11:06:55 +0000 (06:06 -0500)]
IB/rxe: change the function to void from int

The function rxe_av_from_attr always return 0. So change the
function to void.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoIB/rxe: remove redudant parameter in function
Zhu Yanjun [Wed, 31 Jan 2018 11:06:54 +0000 (06:06 -0500)]
IB/rxe: remove redudant parameter in function

In the function rxe_av_from_attr, the parameter rxe
is not used. So it is removed.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoRDMA/netlink: Hide unimplemented NLDEV commands
Leon Romanovsky [Tue, 30 Jan 2018 15:07:16 +0000 (17:07 +0200)]
RDMA/netlink: Hide unimplemented NLDEV commands

The nldev was implemented by following devlink implementation,
including SET/DEL/NEW commands. However these commands were not
implemented and hence don't need to be exposed.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoRDMA/bnxt_re: Fix an error code in bnxt_qplib_create_srq()
Dan Carpenter [Tue, 30 Jan 2018 20:40:30 +0000 (23:40 +0300)]
RDMA/bnxt_re: Fix an error code in bnxt_qplib_create_srq()

We should return -ENOMEM if the allocation fails.  (The current code
returns succees).

Fixes: 37cb11acf1f7 ("RDMA/bnxt_re: Add SRQ support for Broadcom adapters")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-By: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoRDMA/bnxt_re: Fix static checker warning
Doug Ledford [Wed, 31 Jan 2018 16:13:07 +0000 (11:13 -0500)]
RDMA/bnxt_re: Fix static checker warning

If there is ever any error while creating srq->umem, we return that
error, we don't store it in srq->umem, so any check of srq->umem for
IS_ERR is pointless.  Further, checking udata is unnecessary as
srq->umem is always either NULL or valid, without respect to udata.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoMerge tag v4.15 of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2...
Jason Gunthorpe [Mon, 29 Jan 2018 20:26:40 +0000 (13:26 -0700)]
Merge tag v4.15 of git://git./linux/kernel/git/torvalds/linux-2.6.git

To resolve conflicts in:
 drivers/infiniband/hw/mlx5/main.c
 drivers/infiniband/hw/mlx5/qp.c

From patches merged into the -rc cycle. The conflict resolution matches
what linux-next has been carrying.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/nldev: Provide detailed QP information
Leon Romanovsky [Sun, 28 Jan 2018 09:17:25 +0000 (11:17 +0200)]
RDMA/nldev: Provide detailed QP information

Implement RDMA nldev netlink interface to get detailed information on each
QP in the system. This includes the owning process or kernel ULP and
detailed information from the qp_attrs.

Currently only the dumpit variant is implemented.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/nldev: Provide global resource utilization
Leon Romanovsky [Sun, 28 Jan 2018 09:17:24 +0000 (11:17 +0200)]
RDMA/nldev: Provide global resource utilization

Expose through the netlink interface the global per-device utilization of
the supported object types.

Provide both dumpit and doit callbacks.

As an example of possible output from rdmatool for system with 5
mlx5 cards:

$ rdma res
1: mlx5_0: qp 4 cq 5 pd 3
2: mlx5_1: qp 4 cq 5 pd 3
3: mlx5_2: qp 4 cq 5 pd 3
4: mlx5_3: qp 2 cq 3 pd 2
5: mlx5_4: qp 4 cq 5 pd 3

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/core: Add resource tracking for create and destroy PDs
Leon Romanovsky [Sun, 28 Jan 2018 09:17:23 +0000 (11:17 +0200)]
RDMA/core: Add resource tracking for create and destroy PDs

Track create and destroy operations of PD objects.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/core: Add resource tracking for create and destroy CQs
Leon Romanovsky [Sun, 28 Jan 2018 09:17:22 +0000 (11:17 +0200)]
RDMA/core: Add resource tracking for create and destroy CQs

Track create and destroy operations of CQ objects.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/core: Add resource tracking for create and destroy QPs
Leon Romanovsky [Sun, 28 Jan 2018 09:17:21 +0000 (11:17 +0200)]
RDMA/core: Add resource tracking for create and destroy QPs

Track create and destroy operations of QP objects.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/restrack: Add general infrastructure to track RDMA resources
Leon Romanovsky [Sun, 28 Jan 2018 09:17:20 +0000 (11:17 +0200)]
RDMA/restrack: Add general infrastructure to track RDMA resources

The RDMA subsystem has very strict set of objects to work with, but it
completely lacks tracking facilities and has no visibility of resource
utilization.

The following patch adds such infrastructure to keep track of RDMA
resources to help with debugging of user space applications. The primary
user of this infrastructure is RDMA nldev netlink (following patches), to
be exposed to userspace via rdmatool, but it is not limited too that.

At this stage, the main three objects (PD, CQ and QP) are added, and more
will be added later.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/core: Save kernel caller name when creating PD and CQ objects
Leon Romanovsky [Sun, 28 Jan 2018 09:17:19 +0000 (11:17 +0200)]
RDMA/core: Save kernel caller name when creating PD and CQ objects

The KBUILD_MODNAME variable contains the module name and it is known for
kernel users during compilation, so let's reuse it to track the owners.

Followup patches will store this for resource tracking.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/core: Use the MODNAME instead of the function name for pd callers
Leon Romanovsky [Sun, 28 Jan 2018 09:17:18 +0000 (11:17 +0200)]
RDMA/core: Use the MODNAME instead of the function name for pd callers

Each of our modules only allocates a PD in one place, so there isn't any
loss in detail, while MODNAME is more useful and recognizable as something
to expose to the user.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA: Move enum ib_cq_creation_flags to uapi headers
Jason Gunthorpe [Fri, 26 Jan 2018 22:16:46 +0000 (15:16 -0700)]
RDMA: Move enum ib_cq_creation_flags to uapi headers

The flags field the enum is used with comes directly from the uapi
so it belongs in the uapi headers for clarity and so userspace can
use it.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/rxe: Change RDMA_RXE kconfig to use select
Jason Gunthorpe [Fri, 26 Jan 2018 21:18:33 +0000 (14:18 -0700)]
IB/rxe: Change RDMA_RXE kconfig to use select

NET_UDP_TUNNEL is not user selectable, so it should be used as a select
in kconfig.

CRYPTO_CRC32 is a required library for RDMA_RXE so it should active
automatically, as most other CRYPTO_ users do.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoLinux 4.15
Linus Torvalds [Sun, 28 Jan 2018 21:20:33 +0000 (13:20 -0800)]
Linux 4.15

6 years agoIB/qib: remove qib_keys.c
Corentin Labbe [Sun, 28 Jan 2018 15:08:55 +0000 (15:08 +0000)]
IB/qib: remove qib_keys.c

qib_keys.c was left uncompilable in commit 7c2e11fe2dbe ("IB/qib: Remove qp and mr functionality from qib")
Since nothing need it, remove it from tree.

Signed-off-by: Corentin Labbe <clabbe@baylibre.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/mthca: remove mthca_user.h
Corentin Labbe [Sun, 28 Jan 2018 12:39:43 +0000 (12:39 +0000)]
IB/mthca: remove mthca_user.h

mthca_user.h is unused since commit 486f60954c71 ("IB/mthca: Move user vendor structures")
Remove it from tree.

Signed-off-by: Corentin Labbe <clabbe@baylibre.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/cm: Fix access to uninitialized variable
Leon Romanovsky [Sun, 28 Jan 2018 09:25:33 +0000 (11:25 +0200)]
RDMA/cm: Fix access to uninitialized variable

The ndev will be initialized and held only for successful
ib_get_cached_gid(), otherwise it is garbage stack memory.
Calling dev_put() in failure path is wrong.

Fixes: 16c72e402867 ("IB/cm: Refactor to avoid setting path record software only fields")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/cma: Use existing netif_is_bond_master function
Parav Pandit [Sun, 28 Jan 2018 09:25:32 +0000 (11:25 +0200)]
RDMA/cma: Use existing netif_is_bond_master function

When checking whatever the current netdev is the bond master interface,
use kernel API netif_is_bond_master() instead of hardcoding the check.
No functionality is changed.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/core: Avoid SGID attributes query while converting GID from OPA to IB
Parav Pandit [Sun, 28 Jan 2018 09:25:31 +0000 (11:25 +0200)]
IB/core: Avoid SGID attributes query while converting GID from OPA to IB

SGID attributes are not used during OPA to IB GID conversion.
Therefore don't query it.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/mlx5: Avoid memory leak in case of XRCD dealloc failure
Leon Romanovsky [Sun, 28 Jan 2018 09:25:30 +0000 (11:25 +0200)]
RDMA/mlx5: Avoid memory leak in case of XRCD dealloc failure

Failure in XRCD FW deallocation command leaves memory leaked and
returns error to the user which he can't do anything about it.

This patch changes behavior to always free memory and always return
success to the user.

Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/umad: Fix use of unprotected device pointer
Jack Morgenstein [Sun, 28 Jan 2018 09:25:29 +0000 (11:25 +0200)]
IB/umad: Fix use of unprotected device pointer

The ib_write_umad() is protected by taking the umad file mutex.
However, it accesses file->port->ib_dev -- which is protected only by the
port's mutex (field file_mutex).

The ib_umad_remove_one() calls ib_umad_kill_port() which sets
port->ib_dev to NULL under the port mutex (NOT the file mutex).
It then sets the mad agent to "dead" under the umad file mutex.

This is a race condition -- because there is a window where
port->ib_dev is NULL, while the agent is not "dead".

As a result, we saw stack traces like:

[16490.678059] BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
[16490.678246] IP: ib_umad_write+0x29c/0xa3a [ib_umad]
[16490.678333] PGD 0 P4D 0
[16490.678404] Oops: 0000 [#1] SMP PTI
[16490.678466] Modules linked in: rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx4_en(OE) ptp pps_core mlx4_ib(OE-) ib_core(OE) mlx4_core(OE) mlx_compat
(OE) memtrack(OE) devlink mst_pciconf(OE) mst_pci(OE) netconsole nfsv3 nfs_acl nfs lockd grace fscache cfg80211 rfkill esp6_offload esp6 esp4_offload esp4 sunrpc kvm_intel kvm ppdev parport_pc irqbypass
parport joydev i2c_piix4 virtio_balloon cirrus drm_kms_helper ttm drm e1000 serio_raw virtio_pci virtio_ring virtio ata_generic pata_acpi qemu_fw_cfg [last unloaded: mlxfw]
[16490.679202] CPU: 4 PID: 3115 Comm: sminfo Tainted: G           OE   4.14.13-300.fc27.x86_64 #1
[16490.679339] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
[16490.679477] task: ffff9cf753890000 task.stack: ffffaf70c26b0000
[16490.679571] RIP: 0010:ib_umad_write+0x29c/0xa3a [ib_umad]
[16490.679664] RSP: 0018:ffffaf70c26b3d90 EFLAGS: 00010202
[16490.679747] RAX: 0000000000000010 RBX: ffff9cf75610fd80 RCX: 0000000000000000
[16490.679856] RDX: 0000000000000001 RSI: 00007ffdf2bfd714 RDI: ffff9cf6bb2a9c00

In the above trace, ib_umad_write is trying to dereference the NULL
file->port->ib_dev pointer.

Fix this by using the agent's device pointer (the device field
in struct ib_mad_agent) -- which IS protected by the umad file mutex.

Cc: <stable@vger.kernel.org> # v4.11
Fixes: 44c58487d51a ("IB/core: Define 'ib' and 'roce' rdma_ah_attr types")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/iser: Combine substrings for three messages
Markus Elfring [Sat, 27 Jan 2018 17:25:37 +0000 (18:25 +0100)]
IB/iser: Combine substrings for three messages

The script "checkpatch.pl" pointed information out like the following.

WARNING: quoted string split across lines

Thus fix the affected source code places.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/iser: Delete an unnecessary variable initialisation in iser_send_data_out()
Markus Elfring [Sat, 27 Jan 2018 16:55:13 +0000 (17:55 +0100)]
IB/iser: Delete an unnecessary variable initialisation in iser_send_data_out()

The variable "tx_desc" will be set to an appropriate pointer a bit later.
Thus omit the explicit initialisation at the beginning.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/iser: Delete an error message for a failed memory allocation in iser_send_data_out()
Markus Elfring [Sat, 27 Jan 2018 16:48:47 +0000 (17:48 +0100)]
IB/iser: Delete an error message for a failed memory allocation in iser_send_data_out()

Omit an extra message for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoMerge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 28 Jan 2018 20:24:36 +0000 (12:24 -0800)]
Merge branch 'x86-pti-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 retpoline fixlet from Thomas Gleixner:
 "Remove the ESP/RSP thunks for retpoline as they cannot ever work.

  Get rid of them before they show up in a release"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/retpoline: Remove the esp/rsp thunk

6 years agoMerge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 28 Jan 2018 20:19:23 +0000 (12:19 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "A set of small fixes for 4.15:

   - Fix vmapped stack synchronization on systems with 4-level paging
     and a large amount of memory caused by a missing 5-level folding
     which made the pgd synchronization logic to fail and causing double
     faults.

   - Add a missing sanity check in the vmalloc_fault() logic on 5-level
     paging systems.

   - Bring back protection against accessing a freed initrd in the
     microcode loader which was lost by a wrong merge conflict
     resolution.

   - Extend the Broadwell micro code loading sanity check.

   - Add a missing ENDPROC annotation in ftrace assembly code which
     makes ORC unhappy.

   - Prevent loading the AMD power module on !AMD platforms. The load
     itself is uncritical, but an unload attempt results in a kernel
     crash.

   - Update Peter Anvins role in the MAINTAINERS file"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/ftrace: Add one more ENDPROC annotation
  x86: Mark hpa as a "Designated Reviewer" for the time being
  x86/mm/64: Tighten up vmalloc_fault() sanity checks on 5-level kernels
  x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems
  x86/microcode: Fix again accessing initrd after having been freed
  x86/microcode/intel: Extend BDW late-loading further with LLC size check
  perf/x86/amd/power: Do not load AMD power module on !AMD platforms

6 years agoMerge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 28 Jan 2018 20:17:35 +0000 (12:17 -0800)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull timer fix from Thomas Gleixner:
 "A single fix for a ~10 years old problem which causes high resolution
  timers to stop after a CPU unplug/plug cycle due to a stale flag in
  the per CPU hrtimer base struct.

  Paul McKenney was hunting this for about a year, but the heisenbug
  nature made it resistant against debug attempts for quite some time"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimer: Reset hrtimer cpu base proper on CPU hotplug

6 years agoMerge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 28 Jan 2018 19:51:45 +0000 (11:51 -0800)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull scheduler fix from Thomas Gleixner:
 "A single bug fix to prevent a subtle deadlock in the scheduler core
  code vs cpu hotplug"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Fix cpu.max vs. cpuhotplug deadlock

6 years agoMerge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 28 Jan 2018 19:48:25 +0000 (11:48 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull perf fixes from Thomas Gleixner:
 "Four patches which all address lock inversions and deadlocks in the
  perf core code and the Intel debug store"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix perf,x86,cpuhp deadlock
  perf/core: Fix ctx::mutex deadlock
  perf/core: Fix another perf,trace,cpuhp lock inversion
  perf/core: Fix lock inversion between perf,trace,cpuhp

6 years agoMerge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 28 Jan 2018 19:20:35 +0000 (11:20 -0800)]
Merge branch 'locking-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull locking fixes from Thomas Gleixner:
 "Two final locking fixes for 4.15:

   - Repair the OWNER_DIED logic in the futex code which got wreckaged
     with the recent fix for a subtle race condition.

   - Prevent the hard lockup detector from triggering when dumping all
     held locks in the system"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/lockdep: Avoid triggering hardlockup from debug_show_all_locks()
  futex: Fix OWNER_DEAD fixup

6 years agox86/ftrace: Add one more ENDPROC annotation
Josh Poimboeuf [Sun, 28 Jan 2018 02:21:50 +0000 (20:21 -0600)]
x86/ftrace: Add one more ENDPROC annotation

When ORC support was added for the ftrace_64.S code, an ENDPROC
for function_hook() was missed. This results in the following warning:

  arch/x86/kernel/ftrace_64.o: warning: objtool: .entry.text+0x0: unreachable instruction

Fixes: e2ac83d74a4d ("x86/ftrace: Fix ORC unwinding from ftrace handlers")
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180128022150.dqierscqmt3uwwsr@treble
6 years agohrtimer: Reset hrtimer cpu base proper on CPU hotplug
Thomas Gleixner [Fri, 26 Jan 2018 13:54:32 +0000 (14:54 +0100)]
hrtimer: Reset hrtimer cpu base proper on CPU hotplug

The hrtimer interrupt code contains a hang detection and mitigation
mechanism, which prevents that a long delayed hrtimer interrupt causes a
continous retriggering of interrupts which prevent the system from making
progress. If a hang is detected then the timer hardware is programmed with
a certain delay into the future and a flag is set in the hrtimer cpu base
which prevents newly enqueued timers from reprogramming the timer hardware
prior to the chosen delay. The subsequent hrtimer interrupt after the delay
clears the flag and resumes normal operation.

If such a hang happens in the last hrtimer interrupt before a CPU is
unplugged then the hang_detected flag is set and stays that way when the
CPU is plugged in again. At that point the timer hardware is not armed and
it cannot be armed because the hang_detected flag is still active, so
nothing clears that flag. As a consequence the CPU does not receive hrtimer
interrupts and no timers expire on that CPU which results in RCU stalls and
other malfunctions.

Clear the flag along with some other less critical members of the hrtimer
cpu base to ensure starting from a clean state when a CPU is plugged in.

Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
root cause of that hard to reproduce heisenbug. Once understood it's
trivial and certainly justifies a brown paperbag.

Fixes: 41d2e4949377 ("hrtimer: Tune hrtimer_interrupt hang logic")
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Sewior <bigeasy@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801261447590.2067@nanos
6 years agox86: Mark hpa as a "Designated Reviewer" for the time being
H. Peter Anvin [Thu, 25 Jan 2018 19:59:34 +0000 (11:59 -0800)]
x86: Mark hpa as a "Designated Reviewer" for the time being

Due to some unfortunate events, I have not been directly involved in
the x86 kernel patch flow for a while now.  I have also not been able
to ramp back up by now like I had hoped to, and after reviewing what I
will need to work on both internally at Intel and elsewhere in the near
term, it is clear that I am not going to be able to ramp back up until
late 2018 at the very earliest.

It is not acceptable to not recognize that this load is currently
taken by Ingo and Thomas without my direct participation, so I mark
myself as R: (designated reviewer) rather than M: (maintainer) until
further notice.  This is in fact recognizing the de facto situation
for the past few years.

I have obviously no intention of going away, and I will do everything
within my power to improve Linux on x86 and x86 for Linux.  This,
however, puts credit where it is due and reflects a change of focus.

This patch also removes stale entries for portions of the x86
architecture which have not been maintained separately from arch/x86
for a long time.  If there is a reason to re-introduce them then that
can happen later.

Signed-off-by: H. Peter Anvin <h.peter.anvin@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Bruce Schlobohm <bruce.schlobohm@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180125195934.5253-1-hpa@zytor.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
6 years agoMerge tag 'riscv-for-linus-4.15-maintainers' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Fri, 26 Jan 2018 23:10:50 +0000 (15:10 -0800)]
Merge tag 'riscv-for-linus-4.15-maintainers' of git://git./linux/kernel/git/palmer/riscv-linux

Pull RISC-V update from Palmer Dabbelt:
 "RISC-V: We have a new mailing list and git repo!

  Sorry to send something essentially as late as possible (Friday after
  an rc9), but we managed to get a mailing list for the RISC-V Linux
  port. We've been using patches@groups.riscv.org for a while, but that
  list has some problems (it's Google Groups and it's shared over all
  RISC-V software projects). The new infaread.org list is much better.
  We just got it on Wednesday but I used it a bit on Thursday to shake
  out all the configuration problems and it appears to be in working
  order.

  When I updated the mailing list I noticed that the MAINTAINERS file
  was pointing to our github repo, but now that we have a kernel.org
  repo I'd like to point to that instead so I changed that as well.
  We'll be centralizing all RISC-V Linux related development here as
  that seems to be the saner way to go about it.

  I can understand if it's too late to get this into 4.15, but given
  that it's not a code change I was hoping it'd still be OK. It would be
  nice to have the new mailing list and git repo in the release tarballs
  so when people start to find bugs they'll get to the right place"

* tag 'riscv-for-linus-4.15-maintainers' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux:
  Update the RISC-V MAINTAINERS file

6 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Fri, 26 Jan 2018 17:03:16 +0000 (09:03 -0800)]
Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

 1) The per-network-namespace loopback device, and thus its namespace,
    can have its teardown deferred for a long time if a kernel created
    TCP socket closes and the namespace is exiting meanwhile. The kernel
    keeps trying to finish the close sequence until it times out (which
    takes quite some time).

    Fix this by forcing the socket closed in this situation, from Dan
    Streetman.

 2) Fix regression where we're trying to invoke the update_pmtu method
    on route types (in this case metadata tunnel routes) that don't
    implement the dst_ops method. Fix from Nicolas Dichtel.

 3) Fix long standing memory corruption issues in r8169 driver by
    performing the chip statistics DMA programming more correctly. From
    Francois Romieu.

 4) Handle local broadcast sends over VRF routes properly, from David
    Ahern.

 5) Don't refire the DCCP CCID2 timer endlessly, otherwise the socket
    can never be released. From Alexey Kodanev.

 6) Set poll flags properly in VSOCK protocol layer, from Stefan
    Hajnoczi.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  VSOCK: set POLLOUT | POLLWRNORM for TCP_CLOSING
  dccp: don't restart ccid2_hc_tx_rto_expire() if sk in closed state
  net: vrf: Add support for sends to local broadcast address
  r8169: fix memory corruption on retrieval of hardware statistics.
  net: don't call update_pmtu unconditionally
  net: tcp: close sock if net namespace is exiting

6 years agoMerge tag 'drm-fixes-for-v4.15-rc10-2' of git://people.freedesktop.org/~airlied/linux
Linus Torvalds [Fri, 26 Jan 2018 16:59:57 +0000 (08:59 -0800)]
Merge tag 'drm-fixes-for-v4.15-rc10-2' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes from Dave Airlie:
 "A fairly urgent nouveau regression fix for broken irqs across
  suspend/resume came in. This was broken before but a patch in 4.15 has
  made it much more obviously broken and now s/r fails a lot more often.

  The fix removes freeing the irq across s/r which never should have
  been done anyways.

  Also two vc4 fixes for a NULL deference and some misrendering /
  flickering on screen"

* tag 'drm-fixes-for-v4.15-rc10-2' of git://people.freedesktop.org/~airlied/linux:
  drm/nouveau: Move irq setup/teardown to pci ctor/dtor
  drm/vc4: Fix NULL pointer dereference in vc4_save_hang_state()
  drm/vc4: Flush the caches before the bin jobs, as well.

6 years agoVSOCK: set POLLOUT | POLLWRNORM for TCP_CLOSING
Stefan Hajnoczi [Fri, 26 Jan 2018 11:48:25 +0000 (11:48 +0000)]
VSOCK: set POLLOUT | POLLWRNORM for TCP_CLOSING

select(2) with wfds but no rfds must return when the socket is shut down
by the peer.  This way userspace notices socket activity and gets -EPIPE
from the next write(2).

Currently select(2) does not return for virtio-vsock when a SEND+RCV
shutdown packet is received.  This is because vsock_poll() only sets
POLLOUT | POLLWRNORM for TCP_CLOSE, not the TCP_CLOSING state that the
socket is in when the shutdown is received.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agodccp: don't restart ccid2_hc_tx_rto_expire() if sk in closed state
Alexey Kodanev [Fri, 26 Jan 2018 12:14:16 +0000 (15:14 +0300)]
dccp: don't restart ccid2_hc_tx_rto_expire() if sk in closed state

ccid2_hc_tx_rto_expire() timer callback always restarts the timer
again and can run indefinitely (unless it is stopped outside), and after
commit 120e9dabaf55 ("dccp: defer ccid_hc_tx_delete() at dismantle time"),
which moved ccid_hc_tx_delete() (also includes sk_stop_timer()) from
dccp_destroy_sock() to sk_destruct(), this started to happen quite often.
The timer prevents releasing the socket, as a result, sk_destruct() won't
be called.

Found with LTP/dccp_ipsec tests running on the bonding device,
which later couldn't be unloaded after the tests were completed:

  unregister_netdevice: waiting for bond0 to become free. Usage count = 148

Fixes: 2a91aa396739 ("[DCCP] CCID2: Initial CCID2 (TCP-Like) implementation")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoUpdate the RISC-V MAINTAINERS file
Palmer Dabbelt [Wed, 24 Jan 2018 21:26:11 +0000 (13:26 -0800)]
Update the RISC-V MAINTAINERS file

Now that we're upstream in Linux we've been able to make some
infrastructure changes so our port works a bit more like other ports.
Specifically:

* We now have a mailing list specific to the RISC-V Linux port, hosted
  at lists.infreadead.org.
* We now have a kernel.org git tree where work on our port is
  coordinated.

This patch changes the RISC-V maintainers entry to reflect these new
bits of infrastructure.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
6 years agoIB/mthca: Fix gup usage in mthca_map_user_db()
Davidlohr Bueso [Thu, 25 Jan 2018 19:27:27 +0000 (11:27 -0800)]
IB/mthca: Fix gup usage in mthca_map_user_db()

get_user_pages() must be called with mmap_sem held, currently
it is not. In fact it is called under the user db_table->mutex.
To fix this we can convert gup to use the fast alternative,
and safely avoid taking mmap_sem, if possible. Furthermore
this is safe wrt to the mutex as other callers that take the
lock (unmap and alloc_db) are not called under mmap_sem
(hence possible deadlock).

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agox86/mm/64: Tighten up vmalloc_fault() sanity checks on 5-level kernels
Andy Lutomirski [Thu, 25 Jan 2018 21:12:15 +0000 (13:12 -0800)]
x86/mm/64: Tighten up vmalloc_fault() sanity checks on 5-level kernels

On a 5-level kernel, if a non-init mm has a top-level entry, it needs to
match init_mm's, but the vmalloc_fault() code skipped over the BUG_ON()
that would have checked it.

While we're at it, get rid of the rather confusing 4-level folded "pgd"
logic.

Cleans-up: b50858ce3e2a ("x86/mm/vmalloc: Add 5-level paging support")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Neil Berrington <neil.berrington@datacore.com>
Link: https://lkml.kernel.org/r/2ae598f8c279b0a29baf75df207e6f2fdddc0a1b.1516914529.git.luto@kernel.org
6 years agox86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems
Andy Lutomirski [Thu, 25 Jan 2018 21:12:14 +0000 (13:12 -0800)]
x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems

Neil Berrington reported a double-fault on a VM with 768GB of RAM that uses
large amounts of vmalloc space with PTI enabled.

The cause is that load_new_mm_cr3() was never fixed to take the 5-level pgd
folding code into account, so, on a 4-level kernel, the pgd synchronization
logic compiles away to exactly nothing.

Interestingly, the problem doesn't trigger with nopti.  I assume this is
because the kernel is mapped with global pages if we boot with nopti.  The
sequence of operations when we create a new task is that we first load its
mm while still running on the old stack (which crashes if the old stack is
unmapped in the new mm unless the TLB saves us), then we call
prepare_switch_to(), and then we switch to the new stack.
prepare_switch_to() pokes the new stack directly, which will populate the
mapping through vmalloc_fault().  I assume that we're getting lucky on
non-PTI systems -- the old stack's TLB entry stays alive long enough to
make it all the way through prepare_switch_to() and switch_to() so that we
make it to a valid stack.

Fixes: b50858ce3e2a ("x86/mm/vmalloc: Add 5-level paging support")
Reported-and-tested-by: Neil Berrington <neil.berrington@datacore.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: stable@vger.kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/346541c56caed61abbe693d7d2742b4a380c5001.1516914529.git.luto@kernel.org
6 years agoMerge branch 'linux-4.15' of git://github.com/skeggsb/linux into drm-fixes
Dave Airlie [Fri, 26 Jan 2018 05:27:07 +0000 (15:27 +1000)]
Merge branch 'linux-4.15' of git://github.com/skeggsb/linux into drm-fixes

Single irq regression fix
* 'linux-4.15' of git://github.com/skeggsb/linux:
  drm/nouveau: Move irq setup/teardown to pci ctor/dtor

6 years agonet: vrf: Add support for sends to local broadcast address
David Ahern [Thu, 25 Jan 2018 03:37:37 +0000 (19:37 -0800)]
net: vrf: Add support for sends to local broadcast address

Sukumar reported that sends to the local broadcast address
(255.255.255.255) are broken. Check for the address in vrf driver
and do not redirect to the VRF device - similar to multicast
packets.

With this change sockets can use SO_BINDTODEVICE to specify an
egress interface and receive responses. Note: the egress interface
can not be a VRF device but needs to be the enslaved device.

https://bugzilla.kernel.org/show_bug.cgi?id=198521

Reported-by: Sukumar Gopalakrishnan <sukumarg1973@gmail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agor8169: fix memory corruption on retrieval of hardware statistics.
Francois Romieu [Fri, 26 Jan 2018 00:53:26 +0000 (01:53 +0100)]
r8169: fix memory corruption on retrieval of hardware statistics.

Hardware statistics retrieval hurts in tight invocation loops.

Avoid extraneous write and enforce strict ordering of writes targeted to
the tally counters dump area address registers.

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Tested-by: Oliver Freyermuth <o.freyermuth@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Linus Torvalds [Fri, 26 Jan 2018 01:30:47 +0000 (17:30 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input

Pull input fixes from Dmitry Torokhov:
 "The main item is that we try to better handle the newer trackpoints on
  Lenovo devices that are now being produced by Elan/ALPS/NXP and only
  implement a small subset of the original IBM trackpoint controls"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Revert "Input: synaptics_rmi4 - use devm_device_add_group() for attributes in F01"
  Input: trackpoint - only expose supported controls for Elan, ALPS and NXP
  Input: trackpoint - force 3 buttons if 0 button is reported
  Input: xpad - add support for PDP Xbox One controllers
  Input: stmfts,s6sy671 - add SPDX identifier

6 years agoorangefs: fix deadlock; do not write i_size in read_iter
Martin Brandenburg [Fri, 26 Jan 2018 00:39:44 +0000 (19:39 -0500)]
orangefs: fix deadlock; do not write i_size in read_iter

After do_readv_writev, the inode cache is invalidated anyway, so i_size
will never be read.  It will be fetched from the server which will also
know about updates from other machines.

Fixes deadlock on 32-bit SMP.

See https://marc.info/?l=linux-fsdevel&m=151268557427760&w=2

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6 years agodrm/nouveau: Move irq setup/teardown to pci ctor/dtor
Lyude Paul [Thu, 25 Jan 2018 23:29:53 +0000 (18:29 -0500)]
drm/nouveau: Move irq setup/teardown to pci ctor/dtor

For a while we've been having issues with seemingly random interrupts
coming from nvidia cards when resuming them. Originally the fix for this
was thought to be just re-arming the MSI interrupt registers right after
re-allocating our IRQs, however it seems a lot of what we do is both
wrong and not even nessecary.

This was made apparent by what appeared to be a regression in the
mainline kernel that started introducing suspend/resume issues for
nouveau:

        a0c9259dc4e1 (irq/matrix: Spread interrupts on allocation)

After this commit was introduced, we started getting interrupts from the
GPU before we actually re-allocated our own IRQ (see references below)
and assigned the IRQ handler. Investigating this turned out that the
problem was not with the commit, but the fact that nouveau even
free/allocates it's irqs before and after suspend/resume.

For starters: drivers in the linux kernel haven't had to handle
freeing/re-allocating their IRQs during suspend/resume cycles for quite
a while now. Nouveau seems to be one of the few drivers left that still
does this, despite the fact there's no reason we actually need to since
disabling interrupts from the device side should be enough, as the
kernel is already smart enough to know to disable host-side interrupts
for us before going into suspend. Since we were tearing down our IRQs by
hand however, that means there was a short period during resume where
interrupts could be received before we re-allocated our IRQ which would
lead to us getting an unhandled IRQ. Since we never handle said IRQ and
re-arm the interrupt registers, this would cause us to miss all of the
interrupts from the GPU and cause our init process to start timing out
on anything requiring interrupts.

So, since this whole setup/teardown every suspend/resume cycle is
useless anyway, move irq setup/teardown into the pci subdev's ctor/dtor
functions instead so they're only called at driver load and driver
unload. This should fix most of the issues with pending interrupts on
resume, along with getting suspend/resume for nouveau to work again.

As well, this probably means we can also just remove the msi rearm call
inside nvkm_pci_init(). But since our main focus here is to fix
suspend/resume before 4.15, we'll save that for a later patch.

Signed-off-by: Lyude Paul <lyude@redhat.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: stable@vger.kernel.org
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
6 years agonet: don't call update_pmtu unconditionally
Nicolas Dichtel [Thu, 25 Jan 2018 18:03:03 +0000 (19:03 +0100)]
net: don't call update_pmtu unconditionally

Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to:
"BUG: unable to handle kernel NULL pointer dereference at           (null)"

Let's add a helper to check if update_pmtu is available before calling it.

Fixes: 52a589d51f10 ("geneve: update skb dst pmtu on tx path")
Fixes: a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path")
CC: Roman Kapl <code@rkapl.cz>
CC: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Thu, 25 Jan 2018 17:32:10 +0000 (09:32 -0800)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull KVM fixes from Radim Krčmář:
 "Fix races and a potential use after free in the s390 cmma migration
  code"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: s390: add proper locking for CMMA migration bitmap

6 years agoMerge tag 'for-4.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Linus Torvalds [Thu, 25 Jan 2018 17:03:10 +0000 (09:03 -0800)]
Merge tag 'for-4.15-tag' of git://git./linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "It's been reported recently that readdir can list stale entries under
  some conditions. Fix it."

* tag 'for-4.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix stale entries in readdir

6 years agoRDMA/qedr: lower print level of flushed CQEs
Kalderon, Michal [Thu, 25 Jan 2018 11:23:20 +0000 (13:23 +0200)]
RDMA/qedr: lower print level of flushed CQEs

There are races where can still get flush on CQEs before the QP enters
error state. This is not an error and should be treated as
debug information.

Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoRDMA/uverbs: Use an unambiguous errno for method not supported
Jason Gunthorpe [Thu, 25 Jan 2018 02:58:34 +0000 (19:58 -0700)]
RDMA/uverbs: Use an unambiguous errno for method not supported

Returning EOPNOTSUPP is problematic because it can also be
returned by the method function, and we use it in quite a few
places in drivers these days.

Instead, dedicate EPROTONOSUPPORT to indicate that the ioctl framework
is enabled but the requested object and method are not supported by
the kernel. No other case will return this code, and it lets userspace
know to fall back to write().

grep says we do not use it today in drivers/infiniband subsystem.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agonet: tcp: close sock if net namespace is exiting
Dan Streetman [Thu, 18 Jan 2018 21:14:26 +0000 (16:14 -0500)]
net: tcp: close sock if net namespace is exiting

When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.

For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open.  However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open.  In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence.  The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:

unregister_netdevice: waiting for lo to become free. Usage count = 1

These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.

After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.

Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman <ddstreet@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMAINTAINERS: Fix the location of the rdma git repo
Doug Ledford [Thu, 25 Jan 2018 15:54:22 +0000 (10:54 -0500)]
MAINTAINERS: Fix the location of the rdma git repo

When Jason Gunthorpe and I became co-maintainers of the rdma tree, we
moved the official git repo location to a name neutral location.
However, that update did not make it here as well.  Fix that mistake.

Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoMAINTAINERS: Remove Ram Amrani from Q-Logic RDMA driver
Amrani, Ram [Wed, 24 Jan 2018 07:29:53 +0000 (09:29 +0200)]
MAINTAINERS: Remove Ram Amrani from Q-Logic RDMA driver

Remove myself from maintaining the qedr module as my period
of working with Cavium/Q-Logic has come to an end. I've had
a pleasure working with the community, cheers!

Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoRDMA/srpt: Fix RCU debug build error
Leon Romanovsky [Wed, 24 Jan 2018 06:56:05 +0000 (08:56 +0200)]
RDMA/srpt: Fix RCU debug build error

Combination of CONFIG_DEBUG_OBJECTS_RCU_HEAD=y and
CONFIG_INFINIBAND_SRPT=m produces the following build error.

ERROR: "init_rcu_head" [drivers/infiniband/ulp/srpt/ib_srpt.ko] undefined!
make[1]: *** [scripts/Makefile.modpost:92: __modpost] Error 1
make: *** [Makefile:1216: modules] Error 2

The reason to it that init_rcu_head() is not exported and not supposed
to be used in modules. It is needed for dynamic initialization of
statically allocated rcu_head structures.

Fixes: 795bc112cd5a ("IB/srpt: Make it safe to use RCU for srpt_device.rch_list")
Fixes: a11253142e6d ("IB/srpt: Rework multi-channel support")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoperf/x86: Fix perf,x86,cpuhp deadlock
Peter Zijlstra [Wed, 10 Jan 2018 18:23:08 +0000 (19:23 +0100)]
perf/x86: Fix perf,x86,cpuhp deadlock

More lockdep gifts, a 5-way lockup race:

perf_event_create_kernel_counter()
  perf_event_alloc()
    perf_try_init_event()
      x86_pmu_event_init()
__x86_pmu_event_init()
  x86_reserve_hardware()
 #0     mutex_lock(&pmc_reserve_mutex);
    reserve_ds_buffer()
 #1       get_online_cpus()

perf_event_release_kernel()
  _free_event()
    hw_perf_event_destroy()
      x86_release_hardware()
 #0 mutex_lock(&pmc_reserve_mutex)
release_ds_buffer()
 #1   get_online_cpus()

 #1 do_cpu_up()
  perf_event_init_cpu()
 #2     mutex_lock(&pmus_lock)
 #3     mutex_lock(&ctx->mutex)

sys_perf_event_open()
  mutex_lock_double()
 #3     mutex_lock(ctx->mutex)
 #4     mutex_lock_nested(ctx->mutex, 1);

perf_try_init_event()
 #4   mutex_lock_nested(ctx->mutex, 1)
  x86_pmu_event_init()
    intel_pmu_hw_config()
      x86_add_exclusive()
 #0 mutex_lock(&pmc_reserve_mutex)

Fix it by using ordering constructs instead of locking.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
6 years agoperf/core: Fix ctx::mutex deadlock
Peter Zijlstra [Tue, 9 Jan 2018 20:23:02 +0000 (21:23 +0100)]
perf/core: Fix ctx::mutex deadlock

Lockdep noticed the following 3-way lockup scenario:

sys_perf_event_open()
  perf_event_alloc()
    perf_try_init_event()
 #0       ctx = perf_event_ctx_lock_nested(1)
      perf_swevent_init()
swevent_hlist_get()
 #1   mutex_lock(&pmus_lock)

perf_event_init_cpu()
 #1   mutex_lock(&pmus_lock)
 #2   mutex_lock(&ctx->mutex)

sys_perf_event_open()
  mutex_lock_double()
 #2    mutex_lock()
 #0    mutex_lock_nested()

And while we need that perf_event_ctx_lock_nested() for HW PMUs such
that they can iterate the sibling list, trying to match it to the
available counters, the software PMUs need do no such thing. Exclude
them.

In particular the swevent triggers the above invertion, while the
tpevent PMU triggers a more elaborate one through their event_mutex.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
6 years agoperf/core: Fix another perf,trace,cpuhp lock inversion
Peter Zijlstra [Tue, 9 Jan 2018 16:07:59 +0000 (17:07 +0100)]
perf/core: Fix another perf,trace,cpuhp lock inversion

Lockdep noticed the following 3-way lockup race:

        perf_trace_init()
 #0       mutex_lock(&event_mutex)
          perf_trace_event_init()
            perf_trace_event_reg()
              tp_event->class->reg() := tracepoint_probe_register
 #1              mutex_lock(&tracepoints_mutex)
                  trace_point_add_func()
 #2                  static_key_enable()

 #2 do_cpu_up()
  perf_event_init_cpu()
 #3     mutex_lock(&pmus_lock)
 #4     mutex_lock(&ctx->mutex)

perf_ioctl()
 #4   ctx = perf_event_ctx_lock()
  _perf_iotcl()
    ftrace_profile_set_filter()
 #0       mutex_lock(&event_mutex)

Fudge it for now by noting that the tracepoint state does not depend
on the event <-> context relation. Ugly though :/

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
6 years agoperf/core: Fix lock inversion between perf,trace,cpuhp
Peter Zijlstra [Tue, 9 Jan 2018 12:10:30 +0000 (13:10 +0100)]
perf/core: Fix lock inversion between perf,trace,cpuhp

Lockdep gifted us with noticing the following 4-way lockup scenario:

        perf_trace_init()
 #0       mutex_lock(&event_mutex)
          perf_trace_event_init()
            perf_trace_event_reg()
              tp_event->class->reg() := tracepoint_probe_register
 #1             mutex_lock(&tracepoints_mutex)
                  trace_point_add_func()
 #2                 static_key_enable()

 #2     do_cpu_up()
          perf_event_init_cpu()
 #3         mutex_lock(&pmus_lock)
 #4         mutex_lock(&ctx->mutex)

        perf_event_task_disable()
          mutex_lock(&current->perf_event_mutex)
 #4       ctx = perf_event_ctx_lock()
 #5       perf_event_for_each_child()

        do_exit()
          task_work_run()
            __fput()
              perf_release()
                perf_event_release_kernel()
 #4               mutex_lock(&ctx->mutex)
 #5               mutex_lock(&event->child_mutex)
                  free_event()
                    _free_event()
                      event->destroy() := perf_trace_destroy
 #0                     mutex_lock(&event_mutex);

Fix that by moving the free_event() out from under the locks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
6 years agoMerge tag 'drm-misc-fixes-2018-01-24' of git://anongit.freedesktop.org/drm/drm-misc...
Dave Airlie [Thu, 25 Jan 2018 02:28:15 +0000 (12:28 +1000)]
Merge tag 'drm-misc-fixes-2018-01-24' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

Two vc4 fixes that were applied in the last day.
One fixes a NULL dereference, and the other fixes
a flickering bug.

Cc: Eric Anholt <eric@anholt.net>
Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
* tag 'drm-misc-fixes-2018-01-24' of git://anongit.freedesktop.org/drm/drm-misc:
  drm/vc4: Fix NULL pointer dereference in vc4_save_hang_state()
  drm/vc4: Flush the caches before the bin jobs, as well.

6 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Thu, 25 Jan 2018 01:24:30 +0000 (17:24 -0800)]
Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

 1) Avoid negative netdev refcount in error flow of xfrm state add, from
    Aviad Yehezkel.

 2) Fix tcpdump decoding of IPSEC decap'd frames by filling in the
    ethernet header protocol field in xfrm{4,6}_mode_tunnel_input().
    From Yossi Kuperman.

 3) Fix a syzbot triggered skb_under_panic in pppoe having to do with
    failing to allocate an appropriate amount of headroom. From
    Guillaume Nault.

 4) Fix memory leak in vmxnet3 driver, from Neil Horman.

 5) Cure out-of-bounds packet memory access in em_nbyte EMATCH module,
    from Wolfgang Bumiller.

 6) Restrict what kinds of sockets can be bound to the KCM multiplexer
    and also disallow when another layer has attached to the socket and
    made use of sk_user_data. From Tom Herbert.

 7) Fix use before init of IOTLB in vhost code, from Jason Wang.

 8) Correct STACR register write bit definition in IBM emac driver, from
    Ivan Mikhaylov.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  net/ibm/emac: wrong bit is used for STA control register write
  net/ibm/emac: add 8192 rx/tx fifo size
  vhost: do not try to access device IOTLB when not initialized
  vhost: use mutex_lock_nested() in vhost_dev_lock_vqs()
  i40e: flower: check if TC offload is enabled on a netdev
  qed: Free reserved MR tid
  qed: Remove reserveration of dpi for kernel
  kcm: Check if sk_user_data already set in kcm_attach
  kcm: Only allow TCP sockets to be attached to a KCM mux
  net: sched: fix TCF_LAYER_LINK case in tcf_get_base_ptr
  net: sched: em_nbyte: don't add the data offset twice
  mlxsw: spectrum_router: Don't log an error on missing neighbor
  vmxnet3: repair memory leak
  ipv6: Fix getsockopt() for sockets with default IPV6_AUTOFLOWLABEL
  pppoe: take ->needed_headroom of lower device into account on xmit
  xfrm: fix boolean assignment in xfrm_get_type_offload
  xfrm: Fix eth_hdr(skb)->h_proto to reflect inner IP version
  xfrm: fix error flow in case of add state fails
  xfrm: Add SA to hardware at the end of xfrm_state_construct()

6 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
Linus Torvalds [Wed, 24 Jan 2018 23:49:02 +0000 (15:49 -0800)]
Merge git://git./linux/kernel/git/davem/sparc

Pull sparc bugfix from David Miller:
 "Sparc Makefile typo fix"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc64: fix typo in CONFIG_CRYPTO_DES_SPARC64 => CONFIG_CRYPTO_CAMELLIA_SPARC64

6 years agonet/ibm/emac: wrong bit is used for STA control register write
Ivan Mikhaylov [Wed, 24 Jan 2018 12:53:25 +0000 (15:53 +0300)]
net/ibm/emac: wrong bit is used for STA control register write

STA control register has areas of mode and opcodes for opeations. 18 bit is
using for mode selection, where 0 is old MIO/MDIO access method and 1 is
indirect access mode. 19-20 bits are using for setting up read/write
operation(STA opcodes). In current state 'read' is set into old MIO/MDIO mode
with 19 bit and write operation is set into 18 bit which is mode selection,
not a write operation. To correlate write with read we set it into 20 bit.
All those bit operations are MSB 0 based.

Signed-off-by: Ivan Mikhaylov <ivan@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet/ibm/emac: add 8192 rx/tx fifo size
Ivan Mikhaylov [Wed, 24 Jan 2018 12:53:24 +0000 (15:53 +0300)]
net/ibm/emac: add 8192 rx/tx fifo size

emac4syn chips has availability to use 8192 rx/tx fifo buffer sizes,
in current state if we set it up in dts 8192 as example, we will get
only 2048 which may impact on network speed.

Signed-off-by: Ivan Mikhaylov <ivan@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoRevert "Input: synaptics_rmi4 - use devm_device_add_group() for attributes in F01"
Nick Dyer [Wed, 24 Jan 2018 21:46:04 +0000 (13:46 -0800)]
Revert "Input: synaptics_rmi4 - use devm_device_add_group() for attributes in F01"

Since the sysfs attribute hangs off the RMI bus, which doesn't go away during
firmware flash, it needs to be explicitly removed, otherwise we would try and
register the same attribute twice.

This reverts commit 36a44af5c176d619552d99697433261141dd1296.

Signed-off-by: Nick Dyer <nick@shmanahar.org>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
6 years agovhost: do not try to access device IOTLB when not initialized
Jason Wang [Tue, 23 Jan 2018 09:27:26 +0000 (17:27 +0800)]
vhost: do not try to access device IOTLB when not initialized

The code will try to access dev->iotlb when processing
VHOST_IOTLB_INVALIDATE even if it was not initialized which may lead
to NULL pointer dereference. Fixes this by check dev->iotlb before.

Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agovhost: use mutex_lock_nested() in vhost_dev_lock_vqs()
Jason Wang [Tue, 23 Jan 2018 09:27:25 +0000 (17:27 +0800)]
vhost: use mutex_lock_nested() in vhost_dev_lock_vqs()

We used to call mutex_lock() in vhost_dev_lock_vqs() which tries to
hold mutexes of all virtqueues. This may confuse lockdep to report a
possible deadlock because of trying to hold locks belong to same
class. Switch to use mutex_lock_nested() to avoid false positive.

Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
Reported-by: syzbot+dbb7c1161485e61b0241@syzkaller.appspotmail.com
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoi40e: flower: check if TC offload is enabled on a netdev
Jakub Kicinski [Tue, 23 Jan 2018 08:08:40 +0000 (00:08 -0800)]
i40e: flower: check if TC offload is enabled on a netdev

Since TC block changes drivers are required to check if
the TC hw offload flag is set on the interface themselves.

Fixes: 2f4b411a3d67 ("i40e: Enable cloud filters via tc-flower")
Fixes: 44ae12a768b7 ("net: sched: move the can_offload check from binding phase to rule insertion phase")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Amritha Nambiar <amritha.nambiar@intel.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>