review.tizen.org Git - platform/kernel/linux-rpi.git/log

net: hns3: Add support to request VF Reset to PF

VF driver depends upon PF to eventually reset the hardware. This
request is made using the mailbox command. This patch adds the
required function to acheive above.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Add VF Reset device state and its handling

This introduces the hclge device reset states of "requested" and
"pending" and also its handling in context to Reset Service Task.

Device gets into requested state because of any VF reset request
asserted from upper layers, for example due to watchdog timeout
expiration. Requested state would result in eventually forwarding
the VF reset request to PF which would actually reset the VF.

Device will get into pending state if:
1. VF receives the acknowledgement from PF for the VF reset
   request it originally sent to PF.
2. Reset Service Task detects that after asserting VF reset for
   certain times the data-path is not working and device then
   decides to assert full VF reset(this means also resetting the
   PCIe interface).
3. PF intimates the VF that it has undergone reset.
Pending state would result in VF to poll for hardware reset
completion status and then resetting the stack/enet layer, which
in turn means reinitializing the ring management/enet layer.

Note: we would be adding support of 3. later as a separate patch.
This decision should not affect VF reset as its event handling
is generic in nature.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Add VF Reset Service Task to support event handling

VF reset would involve handling of different reset related events
from the stack, physical function, mailbox etc. Reset service task
would be used in servicing such reset event requests and later
handling the hardware completions waits and initiating the stack
resets.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Changes to make enet watchdog timeout func common for PF/VF

HNS3 drivers enet layer, used for the ring management and stack
interaction, is common to both VF and PF. PF already supports reset
functionality to handle the network stack watchdog timeout trigger
but the existing code is not generic enough to be used to support VF
reset as well.
This patch does following:
1. Makes the existing watchdog timeout handler in enet layer generic
i.e. suitable for both VF and PF and
2. Introduces the new reset event handler for the VF code.
3. Changes existing reset event handler of PF code to initialize the
reset level

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Rework-ip_ra_chain-protection'

Kirill Tkhai says:

====================
Rework ip_ra_chain protection

Commit 1215e51edad1 "ipv4: fix a deadlock in ip_ra_control"
made rtnl_lock() be used in raw_close(). This function is called
on every RAW socket destruction, so that rtnl_mutex is taken
every time. This scales very sadly. I observe cleanup_net()
spending a lot of time in rtnl_lock() and raw_close() is one
of the biggest rtnl user (since we have percpu net->ipv4.icmp_sk).

This patchset reworks the locking: reverts the problem commit
and its descendant, and introduces rtnl-independent locking.
This may have a continuation, and someone may work on killing
rtnl_lock() in mrtsock_destruct() in the future.

v3: Change patches order: [2/5] and [3/5].
v2: Fix sparse warning [4/5], as reported by kbuild test robot.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: Replace ip_ra_lock with per-net mutex

Since ra_chain is per-net, we may use per-net mutexes
to protect them in ip_ra_control(). This improves
scalability.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Make ip_ra_chain per struct net

This is optimization, which makes ip_call_ra_chain()
iterate less sockets to find the sockets it's looking for.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Revert "ipv4: fix a deadlock in ip_ra_control"

This reverts commit 1215e51edad1.
Since raw_close() is used on every RAW socket destruction,
the changes made by 1215e51edad1 scale sadly. This clearly
seen on endless unshare(CLONE_NEWNET) test, and cleanup_net()
kwork spends a lot of time waiting for rtnl_lock() introduced
by this commit.

Previous patch moved IP_ROUTER_ALERT out of rtnl_lock(),
so we revert this patch.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Move IP_ROUTER_ALERT out of lock_sock(sk)

ip_ra_control() does not need sk_lock. Who are the another
users of ip_ra_chain? ip_mroute_setsockopt() doesn't take
sk_lock, while parallel IP_ROUTER_ALERT syscalls are
synchronized by ip_ra_lock. So, we may move this command
out of sk_lock.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Revert "ipv4: get rid of ip_ra_lock"

This reverts commit ba3f571d5dde. The commit was made
after 1215e51edad1 "ipv4: fix a deadlock in ip_ra_control",
and killed ip_ra_lock, which became useless after rtnl_lock()
made used to destroy every raw ipv4 socket. This scales
very bad, and next patch in series reverts 1215e51edad1.
ip_ra_lock will be used again.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio: Added support for trusted VF

When a VF is trusted, all promiscuous traffic will only be sent to that VF.
In normal operation promiscuous traffic is sent to the PF. There can be
only one trusted VF per PF

Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'rmnet-next'

Subash Abhinov Kasiviswanathan says:

====================
net: qualcomm: rmnet: Updates 2018-03-12

This series contains some minor updates for rmnet driver.

Patch 1 contains fixes for sparse warnings.
Patch 2 updates the copyright date to 2018.
Patch 3 is a cleanup in receive path.
Patch 4 has the new rmnet netlink attributes in uapi and updates the usage.
Patch 5 has the implementation of the fill_info operation.

v1->v2: Remove the force casts since the data type is changed to __be
types as mentioned by David.
v2->v3: Update copyright in files which actually had changes as
mentioned by Joe.
v3->v4: Add new netlink attributes for mux_id and flags instead of using the
the vlan attributes as mentioned by David. The rmnet specific flags are also
moved to uapi. The netlink updates are done as part of #4 and #5 has the
fill_info operation.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: qualcomm: rmnet: Implement fill_info

This is needed to query the mux_id and flags of a rmnet device.

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qualcomm: rmnet: Export mux_id and flags to netlink

Define new netlink attributes for rmnet mux_id and flags. These
flags / mux_id were earlier using vlan flags / id respectively.
The flag bits are also moved to uapi and are renamed with
prefix RMNET_FLAG_*.

Also add the rmnet policy to handle the new netlink attributes.

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qualcomm: rmnet: Remove unnecessary device assignment

Device of the de-aggregated skb is correctly assigned after inspecting
the mux_id, so remove the assignment in rmnet_map_deaggregate().

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qualcomm: rmnet: Update copyright year to 2018

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qualcomm: rmnet: Fix casting issues

Fix warnings which were reported when running with sparse
(make C=1 CF=-D__CHECK_ENDIAN__)

drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c:81:15:
warning: cast to restricted __be16
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:271:37:
warning: incorrect type in assignment (different base types)
expected unsigned short [unsigned] [usertype] pkt_len
got restricted __be16 [usertype] <noident>
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:287:29:
warning: incorrect type in assignment (different base types)
expected unsigned short [unsigned] [usertype] pkt_len
got restricted __be16 [usertype] <noident>
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:310:22:
warning: cast to restricted __be16
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:319:13:
warning: cast to restricted __be16
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c:49:18:
warning: cast to restricted __be16
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c:50:18:
warning: cast to restricted __be32
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c:74:21:
warning: cast to restricted __be16

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

gre: fix TUNNEL_SEQ bit check on sequence numbering

The current logic of flags | TUNNEL_SEQ is always non-zero and hence
sequence numbers are always incremented no matter the setting of the
TUNNEL_SEQ bit. Fix this by using & instead of |.

Detected by CoverityScan, CID#1466039 ("Operands don't affect result")

Fixes: 77a5196a804e ("gre: add sequence number for collect md mode.")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Documentation/networking: Add net DIM documentation

Net DIM is a generic algorithm, purposed for dynamically
optimizing network devices interrupt moderation. This
document describes how it works and how to use it.

Signed-off-by: Tal Gilboa <talgi@mellanox.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvpp2: use correct index on array mvpp2_pools

Array mvpp2_pools is being indexed by long_log_pool, however this
looks like a cut-n-paste bug and in fact should be short_log_pool.

Detected by CoverityScan, CID#1466113 ("Copy-paste error")

Fixes: 576193f2d579 ("net: mvpp2: jumbo frames support")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tipc-diag'

GhantaKrishnamurthy MohanKrishna says:

====================
tipc: socket diagnostics additions for AF_TIPC

The following patchsets add socket diagnostics support for AF_TIPC by
using the sock diag framework. The patchset was created on top of
commit id: fb66cb0.

New iproute2 package is needed to use this functionality which will
be sent for review in a seperate mail.

The commit series improves diagnosis of tipc sockets by exporting the
configuration, states and statistics of sockets.

The series has been co-authored by Parthasarathy Bhuvaragan and
consist of two parts:

1-2: Adaptations of existing code to support sock_diag framework.
We modify existing functions to support socket diagnostics. Required
information about the sockets are exported.

3: Step sk_drops during packet drop.
This occurs if the packet cannot be queued due to queue length exceeding
configured thresholds.

The diag module is optional, and if enabled it will be loaded on demand
when needed.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tipc: step sk->sk_drops when rcv buffer is full

Currently when tipc is unable to queue a received message on a
socket, the message is rejected back to the sender with error
TIPC_ERR_OVERLOAD. However, the application on this socket
has no knowledge about these discards.

In this commit, we try to step the sk_drops counter when tipc
is unable to queue a received message. Export sk_drops
using tipc socket diagnostics.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tipc: implement socket diagnostics for AF_TIPC

This commit adds socket diagnostics capability for AF_TIPC in netlink
family NETLINK_SOCK_DIAG in a new kernel module (diag.ko).

The following are key design considerations:
- config TIPC_DIAG has default y, like INET_DIAG.
- only requests with flag NLM_F_DUMP is supported (dump all).
- tipc_sock_diag_req message is introduced to send filter parameters.
- the response attributes are of TLV, some nested.

To avoid exposing data structures between diag and tipc modules and
avoid code duplication, the following additions are required:
- export tipc_nl_sk_walk function to reuse socket iterator.
- export tipc_sk_fill_sock_diag to fill the tipc diag attributes.
- create a sock_diag response message in __tipc_add_sock_diag defined
in diag.c and use the above exported tipc_sk_fill_sock_diag
to fill response.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tipc: modify socket iterator for sock_diag

The current socket iterator function tipc_nl_sk_dump, handles socket
locks and calls __tipc_nl_add_sk for each socket.
To reuse this logic in sock_diag implementation, we do minor
modifications to make these functions generic as described below.

In this commit, we add a two new functions __tipc_nl_sk_walk,
__tipc_nl_add_sk_info and modify tipc_nl_sk_dump, __tipc_nl_add_sk
accordingly.

In __tipc_nl_sk_walk we:
1. acquire and release socket locks
2. for each socket, execute the specified callback function

In __tipc_nl_add_sk we:
- Move the netlink attribute insertion to __tipc_nl_add_sk_info.

tipc_nl_sk_dump calls tipc_nl_sk_walk with __tipc_nl_add_sk as argument.

sock_diag will use these generic functions in a later commit.

There is no functional change in this commit.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-Update-supported-firmware-version'

Ido Schimmel says:

====================
mlxsw: Update supported firmware version

The first patch bumps the firmware version supported by the driver. The
second patch enables a feature introduced in the new version,
auto-negotiation disable.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Add support for auto-negotiation disable mode

In 'auto-neg off' the device have sent AN (auto-negotiation) frames
with the forced speed. Thus, fix it using an_disable_admin field in
Port type and speed (PTYS) register. This field indicates if speed
negotiation frames would be send by the port or not.

Add the field and enable/disable it for 'auto-neg on/off', make the
port to start/stop sending AN (auto-negotiation) frames. Note that for
SwitchX2 the behavior doesn't change (i.e support only AN enabled with
forced speed).

Signed-off-by: Tal Bar <talb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Update the supported firmware to version 13.1620.192

This new firmware contains:
- Support for auto-neg disable mode

Signed-off-by: Tal Bar <talb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'fix-some-bugs-for-HNS3-driver'

Peng Li says:

====================
fix some bugs for HNS3 driver

This patchset fixes some bugs for HNS3 driver:
[Patch 1/11 - 5/11] fix various bugs reported by hisilicon test team.
[Patch 6/11 - 7/11] fix bugs about interrupt coalescing self-adaptive
function.
[Patch 8/11 - 11/11] fix bugs about ethtool_ops.get_link_ksettings.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: fix for not returning problem in get_link_ksettings when phy exists

When phy exists, phy_ethtool_ksettings_get function is enough to get the
link ksettings. If the phy exists, get_link_ksettings function can return
directly after phy_ethtool_ksettings_get is called.

Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: add querying speed and duplex support to VF

This patch adds support for querying speed and duplex by ethtool ethX
to VF.

Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: add get_link support to VF

This patch adds ethtool_ops.get_link support to VF.

Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: fix for getting wrong link mode problem

Fixed link mode is returned by hns3_get_link_ksettings. It is
unreasonable.

This patch fixes it by adding some related functions to get link
mode from hardware.

Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: change the time interval of int_gl calculating

Since we change the update rate of int_gl from every interrupt to every
one hundred interrupts, the old way to get time interval by int_gl value
is not accurate. This patch calculates the time interval using the jiffies
value.

Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: change GL update rate

The interrupt coalescing self-adaptive function updates the int_gl every
interrupt. The GL update rate is too faster to get a better new GL value.
This patch changes the GL update rate to every one hundred interrupts.
The GL update rate is defined by HNS3_INT_ADAPT_DOWN_START.

Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: increase the max time for IMP handle command

It may need more time for IMP handle some command, such as reset.
This patch enlarges the max time for cmd timeout.

Driver will check the IMP result every us, it may break through the
loop when get the right result. So not all command need the max time.

Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: export pci table of hclge and hclgevf to userspace

There is no module that is dependent on hclge or hclgevf's symbol,
but hns_enet need them to provide ops for it to run. When there is
a need to auto load the hns3 driver, the auto load will fail because
hclge or hclgevf is not loaded.

Hns_enet has already exported the pci table, so this patch exports
the pci table for hclge and hclgevf module too.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: fix for vlan table lost problem when resetting

The vlan table in hardware is clear after PF/Core/IMP/Global
reset, which will cause vlan tagged packets not being received
problem.

This patch fixes it by restoring the vlan table after reset.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: fix the VF queue reset flow error

VF queue reset flow is different from PF queue reset flow.
VF driver should stop VF queue first, then send message to PF
and PF do the reset. PF should send a response to VF after
PF complete the queue reset, VF can initialize the queue hw
after get the response.
This patch fixes the VF queue reset flow as the correct step.

Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: reallocate tx/rx buffer after changing mtu

When changing the mtu, the max frame size also will be changed. The tx
buffer size and the rx buffer size to be allocated are determined by max
frame size. So when max frame size is changed, the tx buffer and rx buffer
need to be reallocated.

When the tc_num is changed, the tx buffer and rx buffer need to be
reallocated too. So calling set_mtu and buffer_alloc separately is better.

Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Remove top_hierarchy arg to devlink_resource_register

top_hierarchy arg can be determined by comparing parent_resource_id to
DEVLINK_RESOURCE_ID_PARENT_TOP so it does not need to be a separate
argument.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: Add multipath tests for onlink flag

Add multipath tests for onlink flag: one test with onlink added to
both nexthops, then tests with onlink added to only 1 nexthop.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'cxgb4-rdma'

Raju Rangoju says:

====================
Add support for RDMA enhancements in cxgb4

Allocates the HW-resources and provide the necessary routines for the
upper layer driver (rdma/iw_cxgb4) to enable the RDMA SRQ support for
Chelsio adapters.

Advertise support for write with immediate work request
Advertise support for write with completion

v3: modified memory allocation as per Stefano's suggestion

v2: fixed the patching issues and also
fixed the following based on review comments of Stefano Brivio
- using kvzalloc instead of vzalloc
- using #define instead of enum
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Support firmware rdma write completion work request.

If FW supports RDMA WRITE_COMPLETION functionality, then advertise that
to the ULDs. This will be used by iw_cxgb4 to allow WRITE_COMPLETION
work requests.

Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Support firmware rdma write with immediate work request.

If FW supports RDMA WRITE_WITH_IMMEDATE functionality, then advertise
that
to the ULDs. This will be used by iw_cxgb4 to allow WRITE_WITH_IMMEDIATE
work requests.

Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Add support to query HW SRQ parameters

This patch adds support to query FW for the HW SRQ table start/end, and
advertise that for ULDs.

Signed-off-by: Raju Rangoju <rajur@chelsio.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Add support to initialise/read SRQ entries

- This patch adds support to initialise srq table and read srq entries

Signed-off-by: Raju Rangoju <rajur@chelsio.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Adds CPL support for Shared Receive Queues

- Add srq table query cpl support for srq
- Add cpl_abort_req_rss6 and cpl_abort_rpl_rss6 structs.
- Add accessors, macros to get the SRQ IDX value.

Signed-off-by: Raju Rangoju <rajur@chelsio.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'r8169-small-improvements'

Heiner Kallweit says:

====================
r8169: series with smaller improvements w/o functional changes

This series includes smaller improvements w/o intended functional changes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: add helper tp_to_dev

In several places struct device is referenced by using &tp->pci_dev->dev.
Add helper tp_to_dev() to improve code readability.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: change type of argument in rtl_disable/enable_clock_request

Changing the argument type to struct rtl8169_private * is more in line
with the other functions in the driver and it allows to reduce the code size.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: change type of first argument in rtl_tx_performance_tweak

Changing the type of the first argument to struct rtl8169_private * is more
in line with the other functions in the driver and it allows to reduce the
code size.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: simplify rtl_set_mac_address

Replace open-coded functionality with eth_mac_addr().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'batadv-next-for-davem-20180319' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This feature/cleanup patchset includes the following patches:

- avoid redundant multicast TT entries, by Linus Luessing

- add netlink support for distributed arp table cache and multicast flags,
by Linus Luessing (2 patches)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

rds: tcp: remove register_netdevice_notifier infrastructure.

The netns deletion path does not need to wait for all net_devices
to be unregistered before dismantling rds_tcp state for the netns
(we are able to dismantle this state on module unload even when
all net_devices are active so there is no dependency here).

This patch removes code related to netdevice notifiers and
refactors all the code needed to dismantle rds_tcp state
into a ->exit callback for the pernet_operations used with
register_pernet_device().

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netns: send uevent messages

This patch adds a receive method to NETLINK_KOBJECT_UEVENT netlink sockets
to allow sending uevent messages into the network namespace the socket
belongs to.

Currently non-initial network namespaces are already isolated and don't
receive uevents. There are a number of cases where it is beneficial for a
sufficiently privileged userspace process to send a uevent into a network
namespace.

One such use case would be debugging and fuzzing of a piece of software
which listens and reacts to uevents. By running a copy of that software
inside a network namespace, specific uevents could then be presented to it.
More concretely, this would allow for easy testing of udevd/ueventd.

This will also allow some piece of software to run components inside a
separate network namespace and then effectively filter what that software
can receive. Some examples of software that do directly listen to uevents
and that we have in the past attempted to run inside a network namespace
are rbd (CEPH client) or the X server.

Implementation:
The implementation has been kept as simple as possible from the kernel's
perspective. Specifically, a simple input method uevent_net_rcv() is added
to NETLINK_KOBJECT_UEVENT sockets which completely reuses existing
af_netlink infrastructure and does neither add an additional netlink family
nor requires any user-visible changes.

For example, by using netlink_rcv_skb() we can make use of existing netlink
infrastructure to report back informative error messages to userspace.

Furthermore, this implementation does not introduce any overhead for
existing uevent generating codepaths. The struct netns got a new uevent
socket member that records the uevent socket associated with that network
namespace including its position in the uevent socket list. Since we record
the uevent socket for each network namespace in struct net we don't have to
walk the whole uevent socket list. Instead we can directly retrieve the
relevant uevent socket and send the message. At exit time we can now also
trivially remove the uevent socket from the uevent socket list. This keeps
the codepath very performant without introducing needless overhead and even
makes older codepaths faster.

Uevent sequence numbers are kept global. When a uevent message is sent to
another network namespace the implementation will simply increment the
global uevent sequence number and append it to the received uevent. This
has the advantage that the kernel will never need to parse the received
uevent message to replace any existing uevent sequence numbers. Instead it
is up to the userspace process to remove any existing uevent sequence
numbers in case the uevent message to be sent contains any.

Security:
In order for a caller to send uevent messages to a target network namespace
the caller must have CAP_SYS_ADMIN in the owning user namespace of the
target network namespace. Additionally, any received uevent message is
verified to not exceed size UEVENT_BUFFER_SIZE. This includes the space
needed to append the uevent sequence number.

Testing:
This patch has been tested and verified to work with the following udev
implementations:
1. CentOS 6 with udevd version 147
2. Debian Sid with systemd-udevd version 237
3. Android 7.1.1 with ueventd

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: add uevent socket member

This commit adds struct uevent_sock to struct net. Since struct uevent_sock
records the position of the uevent socket in the uevent socket list we can
trivially remove it from the uevent socket list during cleanup. This speeds
up the old removal codepath.
Note, list_del() will hit __list_del_entry_valid() in its call chain which
will validate that the element is a member of the list. If it isn't it will
take care that the list is not modified.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert nf_ct_net_ops

These pernet_operations register and unregister sysctl.
Also, there is inet_frags_exit_net() called in exit method,
which has to be safe after a560002437d3 "net: Fix hlist
corruptions in inet_evict_bucket()".

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert lowpan_frags_ops

These pernet_operations register and unregister sysctl.
Also, there is inet_frags_exit_net() called in exit method,
which has to be safe after a560002437d3 "net: Fix hlist
corruptions in inet_evict_bucket()".

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert can_pernet_ops

These pernet_operations create and destroy /proc entries
and cancel per-net timer.

Also, there are unneed iterations over empty list of net
devices, since all net devices must be already moved
to init_net or unregistered by default_device_ops. This
already was mentioned here:

https://marc.info/?l=linux-can&m=150169589119335&w=2

So, it looks safe to make them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2018-03-21

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add a BPF hook for sendmsg and sendfile by reusing the ULP infrastructure
   and sockmap. Three helpers are added along with this, bpf_msg_apply_bytes(),
   bpf_msg_cork_bytes(), and bpf_msg_pull_data(). The first is used to tell
   for how many bytes the verdict should be applied to, the second to tell
   that x bytes need to be queued first to retrigger the BPF program for a
   verdict, and the third helper is mainly for the sendfile case to pull in
   data for making it private for reading and/or writing, from John.

2) Improve address to symbol resolution of user stack traces in BPF stackmap.
   Currently, the latter stores the address for each entry in the call trace,
   however to map these addresses to user space files, it is necessary to
   maintain the mapping from these virtual addresses to symbols in the binary
   which is not practical for system-wide profiling. Instead, this option for
   the stackmap rather stores the ELF build id and offset for the call trace
   entries, from Song.

3) Add support that allows BPF programs attached to perf events to read the
   address values recorded with the perf events. They are requested through
   PERF_SAMPLE_ADDR via perf_event_open(). Main motivation behind it is to
   support building memory or lock access profiling and tracing tools with
   the help of BPF, from Teng.

4) Several improvements to the tools/bpf/ Makefiles. The 'make bpf' in the
   tools directory does not provide the standard quiet output except for
   bpftool and it also does not respect specifying a build output directory.
   'make bpf_install' command neither respects specified destination nor
   prefix, all from Jiri. In addition, Jakub fixes several other minor issues
   in the Makefiles on top of that, e.g. fixing dependency paths, phony
   targets and more.

5) Various doc updates e.g. add a comment for BPF fs about reserved names
   to make the dentry lookup from there a bit more obvious, and a comment
   to the bpf_devel_QA file in order to explain the diff between native
   and bpf target clang usage with regards to pointer size, from Quentin
   and Daniel.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bpf, doc: add description wrt native/bpf clang target and pointer size

As this recently came up on netdev [0], lets add it to the BPF devel doc.

[0] https://www.spinics.net/lists/netdev/msg489612.html

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'dsa-mv88e6xxx-some-fixes'

Uwe Kleine-König says:

====================
net: dsa: mv88e6xxx: some fixes

these patches target net-next and got approved by Andrew Lunn.

Compared to (implicit) v1, I dropped the patch that I didn't know if it
was right because of missing documentation on my side. But Andrew
already cared for that in a patch that is now adfccf118211 in net-next.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Fix interrupt name for g2 irq

This changes the respective line in /proc/interrupts from

49: x x mv88e6xxx-g1 7 Edge mv88e6xxx-g1

to

49: x x mv88e6xxx-g1 7 Edge mv88e6xxx-g2

which makes more sense.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Fix typo in a comment

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Fix name of switch 88E6141

The switch name is emitted in the kernel log, so having the right name
there is nice.

Fixes: 1558727a1c1b ("net: dsa: mv88e6xxx: Add support for ethernet switch 88E6141")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-Adapt-driver-to-upcoming-firmware-versions'

Ido Schimmel says:

====================
mlxsw: Adapt driver to upcoming firmware versions

The first two patches make sure that reserved fields are set to zero, as
required by the device's programmer's reference manual (PRM).

Last two patches prevent the driver from performing an invalid operation
that is going to be denied by upcoming firmware versions.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_acl: Do not invalidate already invalid ACL groups

When a new ACL group is created its region (ACL) list is initially
empty. Thus, the call to mlxsw_sp_acl_tcam_group_update() would
basically invalidate an already invalid (non-existent) group.

Remove the unnecessary call and make the function symmetric to its del()
counterpart.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_acl: Adapt ACL configuration to new firmware versions

The driver currently creates empty ACL groups, binds them to the
requested port and then fills them with actual ACLs that point to TCAM
regions.

However, empty ACL groups are considered invalid and upcoming firmware
versions are going to forbid their binding.

Work around this limitation by only performing the binding after the
first ACL was added to the group.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Reserved field in mbox profile shouldn't be set

There is no need to set some of the fields within 'mbox_config_profile',
since they are reserved and capability mask should be set to zero.

Signed-off-by: Tal Bar <talb@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: pci: Set mbox dma addresses to zero when not used

Some of the opcodes don't use in, out or both mboxes. In such cases, the
mbox address is a reserved field and FW expects it to be zero.

Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlx5: Remove call to ida_pre_get

The mlx5 driver calls ida_pre_get() in a loop for no readily apparent
reason. The driver uses ida_simple_get() which will call ida_pre_get()
by itself and there's no need to use ida_pre_get() unless using
ida_get_new().

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bpf-sockmap-ulp'

John Fastabend says:

====================
This series adds a BPF hook for sendmsg and senfile by using
the ULP infrastructure and sockmap. A simple pseudocode example
would be,

  // load the programs
  bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
                &obj, &msg_prog);

  // lookup the sockmap
  bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");

  // get fd for sockmap
  map_fd_msg = bpf_map__fd(bpf_map_msg);

  // attach program to sockmap
  bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);

  // Add a socket 'fd' to sockmap at location 'i'
  bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);

After the above snippet any socket attached to the map would run
msg_prog on sendmsg and sendfile system calls.

Three additional helpers are added bpf_msg_apply_bytes(),
bpf_msg_cork_bytes(), and bpf_msg_pull_data(). With
bpf_msg_apply_bytes BPF programs can tell the infrastructure how
many bytes the given verdict should apply to. This has two cases.
First, a BPF program applies verdict to fewer bytes than in the
current sendmsg/sendfile msg this will apply the verdict to the
first N bytes of the message then run the BPF program again with
data pointers recalculated to the N+1 byte. The second case is the
BPF program applies a verdict to more bytes than the current sendmsg
or sendfile system call. In this case the infrastructure will cache
the verdict and apply it to future sendmsg/sendfile calls until the
byte limit is reached. This avoids the overhead of running BPF
programs on large payloads.

The helper bpf_msg_cork_bytes() handles a different case where
a BPF program can not reach a verdict on a msg until it receives
more bytes AND the program doesn't want to forward the packet
until it is known to be "good". The example case being a user
(albeit a dumb one probably) sends a N byte header in 1B system
calls. The BPF program can call bpf_msg_cork_bytes with the
required byte limit to reach a verdict and then the program will
only be called again once N bytes are received.

The last helper added in this series is bpf_msg_pull_data(). It
is used to pull data in for modification or reading. Similar to
how sk_pull_data() works msg_pull_data can be used to access data
not in the initial (data_start, data_end) range. For sendpage()
calls this is needed if any data is accessed because the BPF
sendpage hook initializes the data_start and data_end pointers to
zero. We do this because sendpage data is shared with the user
and can be modified during or after the BPF verdict possibly
invalidating any verdict the BPF program decides. For sendmsg
the data is already copied by the sendmsg bpf infrastructure so
we only copy the data if the user request a data range that is
not already linearized. This happens if the user requests larger
blocks of data that are not in a single scatterlist element. The
common case seems to be accessing headers which normally are
in the first scatterlist element and already linearized.

For more examples please review the sample program. There are
examples for all the actions and helpers there.

Patches 1-8 implement the above sockmap/BPF infrastructure. The
remaining patches flush out some minimal selftests and the sample
sockmap program. The sockmap sample program is the main vehicle
for testing this infrastructure and will be moved into selftests
shortly. The final patch in this series is a simple shell script
to run a set of tests. These are the tests I run after any changes
to sockmap. The next task on the list after this series is to
push those into selftests so we can avoid manually testing.

Couple notes on future items in the pipeline,

  0. move sample sockmap programs into selftests (noted above)
  1. add additional support for tcp flags, most are ignored now.
  2. add a Documentation/bpf/sockmap file with these details
  3. support stacked ULP types to allow this and ktls to cooperate
  4. Ingress flag support, redirect only supports egress here. The
     other redirect helpers support ingress and egress flags.
  5. add optimizations, I cut a few optimizations here in the
     first iteration of the code for later study/implementation

-v3 updates
  : u32 data pointers in msg_md changed to void *
  : page_address NULL check and flag verification in msg_pull_data
  : remove old note in commit msg that is no longer relevant
  : remove enum sk_msg_action its not used anywhere
  : fixup test_verifier W -> DW insn to account for data pointers
  : unintentionally dropped a smap_stop_tx() call in sockmap.c

I propagated the ACKs forward because above changes were small
one/two line changes.

-v2 updates (discussion):

Dave noticed that sendpage call was previously (in v1) running
on the data directly. This allowed users to potentially modify
the data after or during the BPF program. However doing a copy
automatically even if the data is not accessed has measurable
performance impact. So we added another helper modeled after
the existing skb_pull_data() helper to allow users to selectively
pull data from the msg. This is also useful in the sendmsg case
when users need to access data outside the first scatterlist
element or across scatterlist boundaries.

While doing this I also unified the sendmsg and sendfile handlers
a bit. Originally the sendfile call was optimized for never
touching the data. I've decided for a first submission to drop
this optimization and we can add it back later. It introduced
unnecessary complexity, at least for a first posting, for a
use case I have not entirely flushed out yet. When the use
case is deployed we can add it back if needed. Then we can
review concrete performance deltas as well on real-world
use-cases/applications.

Lastly, I reorganized the patches a bit. Now all sockmap
changes are in a single patch and each helper gets its own
patch. This, at least IMO, makes it easier to review because
sockmap changes are not spread across the patch series. On
the other hand now apply_bytes, cork_bytes logic is only
activated later in the series. But that should be OK.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: sockmap test script

This adds the test script I am currently using to validate
the latest sockmap changes. Shortly sockmap will be ported
to selftests and these will be run from the infrastructure
there. Until then add the script here so we have a coverage
checklist when porting into selftests.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: sockmap sample test for bpf_msg_pull_data

This adds an option to test the msg_pull_data helper. This
uses two options txmsg_start and txmsg_end to let the user
specify start and end bytes to pull.

The options can be used with txmsg_apply, txmsg_cork options
as well as with any of the basic tests, txmsg, txmsg_redir and
txmsg_drop (plus noisy variants) to run pull_data inline with
those tests. By giving user direct control over the variables
we can easily do negative testing as well as positive tests.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: sockmap add SK_DROP tests

Add tests for SK_DROP.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: sockmap sample support for bpf_msg_cork_bytes()

Add sample application support for the bpf_msg_cork_bytes helper. This
lets the user specify how many bytes each verdict should apply to.

Similar to apply_bytes() tests these can be run as a stand-alone test
when used without other options or inline with other tests by using
the txmsg_cork option along with any of the basic tests txmsg,
txmsg_redir, txmsg_drop.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: sockmap, add sample option to test apply_bytes helper

This adds an option to test the apply_bytes helper. This option lets
the user specify an int on the command line specifying how much data
each verdict should apply to.

When this is set a map entry is set with the bytes input by the user
and then the specified program --txmsg or --txmsg_redir will use the
value and set the applied data. If no other option is set then a
default --txmsg_apply program is run. This program will drop pkts
if an error is detected on the bytes map lookup. Useful to verify
the map lookup and apply helper are working and causing a hard
error if it is not.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: sockmap sample, add data verification option

To verify data is not being dropped or corrupted this adds an option
to verify test-patterns on recv.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: sockmap sample, add sendfile test

To exercise TX ULP sendpage implementation we need a test that does
a sendfile. Add sendfile test option here.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: sockmap sample, add option to attach SK_MSG program

Add sockmap option to use SK_MSG program types.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: add verifier tests for BPF_PROG_TYPE_SK_MSG

Test read and writes for BPF_PROG_TYPE_SK_MSG.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: add map tests for BPF_PROG_TYPE_SK_MSG

Add map tests to attach BPF_PROG_TYPE_SK_MSG types to a sockmap.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: sk_msg program helper bpf_sk_msg_pull_data

Currently, if a bpf sk msg program is run the program
can only parse data that the (start,end) pointers already
consumed. For sendmsg hooks this is likely the first
scatterlist element. For sendpage this will be the range
(0,0) because the data is shared with userspace and by
default we want to avoid allowing userspace to modify
data while (or after) BPF verdict is being decided.

To support pulling in additional bytes for parsing use
a new helper bpf_sk_msg_pull(start, end, flags) which
works similar to cls tc logic. This helper will attempt
to point the data start pointer at 'start' bytes offest
into msg and data end pointer at 'end' bytes offset into
message.

After basic sanity checks to ensure 'start' <= 'end' and
'end' <= msg_length there are a few cases we need to
handle.

First the sendmsg hook has already copied the data from
userspace and has exclusive access to it. Therefor, it
is not necessesary to copy the data. However, it may
be required. After finding the scatterlist element with
'start' offset byte in it there are two cases. One the
range (start,end) is entirely contained in the sg element
and is already linear. All that is needed is to update the
data pointers, no allocate/copy is needed. The other case
is (start, end) crosses sg element boundaries. In this
case we allocate a block of size 'end - start' and copy
the data to linearize it.

Next sendpage hook has not copied any data in initial
state so that data pointers are (0,0). In this case we
handle it similar to the above sendmsg case except the
allocation/copy must always happen. Then when sending
the data we have possibly three memory regions that
need to be sent, (0, start - 1), (start, end), and
(end + 1, msg_length). This is required to ensure any
writes by the BPF program are correctly transmitted.

Lastly this operation will invalidate any previous
data checks so BPF programs will have to revalidate
pointers after making this BPF call.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: sockmap, add msg_cork_bytes() helper

In the case where we need a specific number of bytes before a
verdict can be assigned, even if the data spans multiple sendmsg
or sendfile calls. The BPF program may use msg_cork_bytes().

The extreme case is a user can call sendmsg repeatedly with
1-byte msg segments. Obviously, this is bad for performance but
is still valid. If the BPF program needs N bytes to validate
a header it can use msg_cork_bytes to specify N bytes and the
BPF program will not be called again until N bytes have been
accumulated. The infrastructure will attempt to coalesce data
if possible so in many cases (most my use cases at least) the
data will be in a single scatterlist element with data pointers
pointing to start/end of the element. However, this is dependent
on available memory so is not guaranteed. So BPF programs must
validate data pointer ranges, but this is the case anyways to
convince the verifier the accesses are valid.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: sockmap, add bpf_msg_apply_bytes() helper

A single sendmsg or sendfile system call can contain multiple logical
messages that a BPF program may want to read and apply a verdict. But,
without an apply_bytes helper any verdict on the data applies to all
bytes in the sendmsg/sendfile. Alternatively, a BPF program may only
care to read the first N bytes of a msg. If the payload is large say
MB or even GB setting up and calling the BPF program repeatedly for
all bytes, even though the verdict is already known, creates
unnecessary overhead.

To allow BPF programs to control how many bytes a given verdict
applies to we implement a bpf_msg_apply_bytes() helper. When called
from within a BPF program this sets a counter, internal to the
BPF infrastructure, that applies the last verdict to the next N
bytes. If the N is smaller than the current data being processed
from a sendmsg/sendfile call, the first N bytes will be sent and
the BPF program will be re-run with start_data pointing to the N+1
byte. If N is larger than the current data being processed the
BPF verdict will be applied to multiple sendmsg/sendfile calls
until N bytes are consumed.

Note1 if a socket closes with apply_bytes counter non-zero this
is not a problem because data is not being buffered for N bytes
and is sent as its received.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

This implements a BPF ULP layer to allow policy enforcement and
monitoring at the socket layer. In order to support this a new
program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
the sendmsg/sendpage hook. To attach the policy to sockets a
sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.

Similar to previous sockmap usages when a sock is added to a
sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
program type attached then the BPF ULP layer is created on the
socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
every msg in sendmsg case and page/offset in sendpage case.

BPF_PROG_TYPE_SK_MSG Semantics/API:

BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
case and in the sendpage case leaves the data untouched. Both cases
return -EACESS to the user. Returning SK_PASS will allow the msg to
be sent.

In the sendmsg case data is copied into kernel space buffers before
running the BPF program. The kernel space buffers are stored in a
scatterlist object where each element is a kernel memory buffer.
Some effort is made to coalesce data from the sendmsg call here.
For example a sendmsg call with many one byte iov entries will
likely be pushed into a single entry. The BPF program is run with
data pointers (start/end) pointing to the first sg element.

In the sendpage case data is not copied. We opt not to copy the
data by default here, because the BPF infrastructure does not
know what bytes will be needed nor when they will be needed. So
copying all bytes may be wasteful. Because of this the initial
start/end data pointers are (0,0). Meaning no data can be read or
written. This avoids reading data that may be modified by the
user. A new helper is added later in this series if reading and
writing the data is needed. The helper call will do a copy by
default so that the page is exclusively owned by the BPF call.

The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
in the sendmsg() case and the entire page/offset in the sendpage case.
This avoids ambiguity on how to handle mixed return codes in the
sendmsg case. Again a helper is added later in the series if
a verdict needs to apply to multiple system calls and/or only
a subpart of the currently being processed message.

The helper msg_redirect_map() can be used to select the socket to
send the data on. This is used similar to existing redirect use
cases. This allows policy to redirect msgs.

Pseudo code simple example:

The basic logic to attach a program to a socket is as follows,

  // load the programs
  bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
&obj, &msg_prog);

  // lookup the sockmap
  bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");

  // get fd for sockmap
  map_fd_msg = bpf_map__fd(bpf_map_msg);

  // attach program to sockmap
  bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);

Adding sockets to the map is done in the normal way,

  // Add a socket 'fd' to sockmap at location 'i'
  bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);

After the above any socket attached to "my_sock_map", in this case
'fd', will run the BPF msg verdict program (msg_prog) on every
sendmsg and sendpage system call.

For a complete example see BPF selftests or sockmap samples.

Implementation notes:

It seemed the simplest, to me at least, to use a refcnt to ensure
psock is not lost across the sendmsg copy into the sg, the bpf program
running on the data in sg_data, and the final pass to the TCP stack.
Some performance testing may show a better method to do this and avoid
the refcnt cost, but for now use the simpler method.

Another item that will come after basic support is in place is
supporting MSG_MORE flag. At the moment we call sendpages even if
the MSG_MORE flag is set. An enhancement would be to collect the
pages into a larger scatterlist and pass down the stack. Notice that
bpf_tcp_sendmsg() could support this with some additional state saved
across sendmsg calls. I built the code to support this without having
to do refactoring work. Other features TBD include ZEROCOPY and the
TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
shortly.

Future work could improve size limits on the scatterlist rings used
here. Currently, we use MAX_SKB_FRAGS simply because this was being
used already in the TLS case. Future work could extend the kernel sk
APIs to tune this depending on workload. This is a trade-off
between memory usage and throughput performance.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

net: generalize sk_alloc_sg to work with scatterlist rings

The current implementation of sk_alloc_sg expects scatterlist to always
start at entry 0 and complete at entry MAX_SKB_FRAGS.

Future patches will want to support starting at arbitrary offset into
scatterlist so add an additional sg_start parameters and then default
to the current values in TLS code paths.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG

When calling do_tcp_sendpages() from in kernel and we know the data
has no references from user side we can omit SKBTX_SHARED_FRAG flag.
This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used
to omit setting SKBTX_SHARED_FRAG.

The flag is not exposed to userspace because the sendpage call from
the splice logic masks out all bits except MSG_MORE.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

sockmap: convert refcnt to an atomic refcnt

The sockmap refcnt up until now has been wrapped in the
sk_callback_lock(). So its not actually needed any locking of its
own. The counter itself tracks the lifetime of the psock object.
Sockets in a sockmap have a lifetime that is independent of the
map they are part of. This is possible because a single socket may
be in multiple maps. When this happens we can only release the
psock data associated with the socket when the refcnt reaches
zero. There are three possible delete sock reference decrement
paths first through the normal sockmap process, the user deletes
the socket from the map. Second the map is removed and all sockets
in the map are removed, delete path is similar to case 1. The third
case is an asyncronous socket event such as a closing the socket. The
last case handles removing sockets that are no longer available.
For completeness, although inc does not pose any problems in this
patch series, the inc case only happens when a psock is added to a
map.

Next we plan to add another socket prog type to handle policy and
monitoring on the TX path. When we do this however we will need to
keep a reference count open across the sendmsg/sendpage call and
holding the sk_callback_lock() here (on every send) seems less than
ideal, also it may sleep in cases where we hit memory pressure.
Instead of dealing with these issues in some clever way simply make
the reference counting a refcnt_t type and do proper atomic ops.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

sock: make static tls function alloc_sg generic sock helper

The TLS ULP module builds scatterlists from a sock using
page_frag_refill(). This is going to be useful for other ULPs
so move it into sock file for more general use.

In the process remove useless goto at end of while loop.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge branch '40GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2018-03-19

This series contains updates to i40e and i40evf only.

Alex fixes a potential deadlock in the configure_clsflower function in
i40evf, where we exit with the "IN_CRITICAL_TASK" bit set while
notifying the PF of flower filters.

Jan fixed an issue where it was possible to set a mode that is not
allowed which resulted in link being down, so fixed the parity between
i40e_set_link_ksettings() and i40e_get_link_ksettings().

Patryk fixes a bug where a backplane device was allowing the setting of
link settings, which is not allowed.

Shiraz fixes a crash when entering S3 because the client interface was
freeing the MSIx vectors while they are still in use.

Jake fixes up a function header comment to document a newly added
parameter. Also cleaned up flags that were never used.

Doug fixes the incorrect return type for i40e_aq_add_cloud_filters().
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

i40e: Fix the polling mechanism of GLGEN_RSTAT.DEVSTATE

This fixes the polling mechanism of GLGEN_RSTAT.DEVSTATE in the
PF Reset path when Global Reset is in progress. While the driver
is polling for the end of the PF Reset and the Global Reset is
triggered, abandon the PF Reset path and prepare for the
upcoming Global Reset.

Signed-off-by: Paweł Jabłoński <pawel.jablonski@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40evf: remove flags that are never used

These flags were defined, but there is no use within the driver code, so
we don't need to keep them.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Prevent setting link speed on I40E_DEV_ID_25G_B

Setting link settings on backplane devices shouldn't be allowed.
This patch adds one more device id to the list which we check
that against.

Signed-off-by: Patryk Małek <patryk.malek@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Fix incorrect return types

Fix return types from i40e_status to enum i40e_status_code.

Signed-off-by: Doug Dziggel <douglas.a.dziggel@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: add doxygen comment for new mode parameter

A recent patch updated the signature for i40e_aq_set_switch_config() to
add a new parameter 'mode'. It forgot to document the parameter in the
doxygen function header comment. Add the parameter to the function
description now.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Close client on suspend and restore client MSIx on resume

During suspend client MSIx vectors are freed while they are still
in use causing a crash on entering S3.

Fix this calling client close before freeing up its MSIx vectors.
Also update the client MSIx vectors on resume before client
open is called.

Fixes commit b980c0634fe5 ("i40e: shutdown all IRQs and disable MSI-X
when suspended")

Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Prevent setting link speed on KX_X722

Setting link settings on backplane devices shouldn't be allowed.
This patch adds one more device id to the list which we check
that against.

Signed-off-by: Patryk Małek <patryk.malek@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Properly check allowed advertisement capabilities

The i40e_set_link_ksettings and i40e_get_link_ksettings use different
codepaths to check available and supported advertisement modes. This
creates scenarios where it's possible to set a mode that's not allowed,
resulting in a link down.

Fix setting advertisement in i40e_set_link_ksettings by calling
i40e_get_link_ksettings to check what modes are allowed.

Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40evf: Reorder configure_clsflower to avoid deadlock on error

While doing some code review I noticed that we can get into a state where
we exit with the "IN_CRITICAL_TASK" bit set while notifying the PF of
flower filters. This patch is meant to address that plus tweak the ordering
of the while loop waiting on it slightly so that we don't wait an extra
period after we have failed for the last time.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>