platform/kernel/linux-rpi.git
7 years agotools: bpftool: add option parsing to bpftool, --help and --version
Quentin Monnet [Mon, 23 Oct 2017 16:24:06 +0000 (09:24 -0700)]
tools: bpftool: add option parsing to bpftool, --help and --version

Add an option parsing facility to bpftool, in prevision of future
options for demanding JSON output. Currently, two options are added:
--help and --version, that act the same as the respective commands
`help` and `version`.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotools: bpftool: copy JSON writer from iproute2 repository
Quentin Monnet [Mon, 23 Oct 2017 16:24:05 +0000 (09:24 -0700)]
tools: bpftool: copy JSON writer from iproute2 repository

In prevision of following commits, supposed to add JSON output to the
tool, two files are copied from the iproute2 repository (taken at commit
268a9eee985f): lib/json_writer.c and include/json_writer.h.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'tcp-tracepoints'
David S. Miller [Tue, 24 Oct 2017 00:21:26 +0000 (01:21 +0100)]
Merge branch 'tcp-tracepoints'

Song Liu says:

====================
net: add a set of tracepoints to tcp stack

Changes from v1:

Fix build error (with ipv6 as ko) by adding EXPORT_TRACEPOINT_SYMBOL_GPL
for trace_tcp_send_reset.

These patches add the following tracepoints to tcp stack.

tcp_send_reset
tcp_receive_reset
tcp_destroy_sock
tcp_set_state

These tracepoints can be used to track TCP state changes. Such state
changes include but are not limited to: connection establish,
connection termination, tx and rx of RST, various retransmits.

Currently, we use the following kprobes to trace these events:

int kprobe__tcp_validate_incoming
int kprobe__tcp_send_active_reset
int kprobe__tcp_v4_send_reset
int kprobe__tcp_v6_send_reset
int kprobe__tcp_v4_destroy_sock
int kprobe__tcp_set_state
int kprobe__tcp_retransmit_skb

These tracepoints will help us simplify this work.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotcp: add tracepoint trace_tcp_set_state()
Song Liu [Mon, 23 Oct 2017 16:20:27 +0000 (09:20 -0700)]
tcp: add tracepoint trace_tcp_set_state()

This patch adds tracepoint trace_tcp_set_state. Besides usual fields
(s/d ports, IP addresses), old and new state of the socket is also
printed with TP_printk, with __print_symbolic().

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotcp: add tracepoint trace_tcp_destroy_sock
Song Liu [Mon, 23 Oct 2017 16:20:26 +0000 (09:20 -0700)]
tcp: add tracepoint trace_tcp_destroy_sock

This patch adds trace event trace_tcp_destroy_sock.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotcp: add tracepoint trace_tcp_receive_reset
Song Liu [Mon, 23 Oct 2017 16:20:25 +0000 (09:20 -0700)]
tcp: add tracepoint trace_tcp_receive_reset

New tracepoint trace_tcp_receive_reset is added and called from
tcp_reset(). This tracepoint is define with a new class tcp_event_sk.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotcp: add tracepoint trace_tcp_send_reset
Song Liu [Mon, 23 Oct 2017 16:20:24 +0000 (09:20 -0700)]
tcp: add tracepoint trace_tcp_send_reset

New tracepoint trace_tcp_send_reset is added and called from
tcp_v4_send_reset(), tcp_v6_send_reset() and tcp_send_active_reset().

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotcp: mark trace event arguments sk and skb as const
Song Liu [Mon, 23 Oct 2017 16:20:23 +0000 (09:20 -0700)]
tcp: mark trace event arguments sk and skb as const

Some functions that we plan to add trace points require const sk
and/or skb. So we mark these fields as const in the tracepoint.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotcp: add trace event class tcp_event_sk_skb
Song Liu [Mon, 23 Oct 2017 16:20:22 +0000 (09:20 -0700)]
tcp: add trace event class tcp_event_sk_skb

Introduce event class tcp_event_sk_skb for tcp tracepoints that
have arguments sk and skb.

Existing tracepoint trace_tcp_retransmit_skb() falls into this class.
This patch rewrites the definition of trace_tcp_retransmit_skb() with
tcp_event_sk_skb.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'hns3-next'
David S. Miller [Tue, 24 Oct 2017 00:16:42 +0000 (01:16 +0100)]
Merge branch 'hns3-next'

Lipeng says:

====================
net: hns3: bug fixes & code improvements

This patchset introduces various HNS3 bug fixes, optimizations and code
improvements.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: hns3: fix a bug about hns3_clean_tx_ring
Lipeng [Mon, 23 Oct 2017 11:51:07 +0000 (19:51 +0800)]
net: hns3: fix a bug about hns3_clean_tx_ring

The return value of hns3_clean_tx_ring means tx ring clean result.
Return true means clean complete and there is no more pakcet need
clean. Retrun false means there is packets need clean and napi need
poll again. The last return of hns3_clean_tx_ring is
"return !!budget" as budget will decrease when clean a buffer.

If there is no valid BD in TX ring, return 0 for hns3_clean_tx_ring
will cause napi poll again and never complete the napi poll. This
patch fixes the bug.

Fixes: 76ad4f0 (net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC)

Signed-off-by: Lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: hns3: remove redundant memset when alloc buffer
Lipeng [Mon, 23 Oct 2017 11:51:06 +0000 (19:51 +0800)]
net: hns3: remove redundant memset when alloc buffer

HW will use packet length to write packets to buffer or read
packets from buffer. There is a redundant memset when alloc buffer,
the memset have no sense and will increase time-consuming.
This patch removes it.

Fixes: 76ad4f0 (net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC)

Signed-off-by: Lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: hns3: fix the TX/RX ring.queue_index in hns3_ring_get_cfg
Lipeng [Mon, 23 Oct 2017 11:51:05 +0000 (19:51 +0800)]
net: hns3: fix the TX/RX ring.queue_index in hns3_ring_get_cfg

The interface hns3_ring_get_cfg only update TX ring queue_index,
but do not update RX ring queue_index. This patch fixes it.

Fixes: 76ad4f0 (net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC)

Signed-off-by: Lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: hns3: get vf count by pci_sriov_get_totalvfs
Lipeng [Mon, 23 Oct 2017 11:51:04 +0000 (19:51 +0800)]
net: hns3: get vf count by pci_sriov_get_totalvfs

This patch gets vf count by standard function pci_sriov_get_totalvfs,
instead of info from NIC HW.

Signed-off-by: Lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: hns3: fix the ops check in hns3_get_rxnfc
Lipeng [Mon, 23 Oct 2017 11:51:03 +0000 (19:51 +0800)]
net: hns3: fix the ops check in hns3_get_rxnfc

1# patch: 07d2995 net: hns3: add support for ETHTOOL_GRXFH.
2# patch: 5668abd net: hns3: add support for set_ringparam.

1# patch adds ae_algo->ops->get_rss_tuple to hns3_get_rxnfc
and 2# patch delete ae_algo->ops->get_tc_size
from hns3_get_rxnfc.This patch fix the ops check in hns3_get_rxnfc.

Signed-off-by: Lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: hns3: fix the bug when map buffer fail
Lipeng [Mon, 23 Oct 2017 11:51:02 +0000 (19:51 +0800)]
net: hns3: fix the bug when map buffer fail

If one buffer had been recieved to stack, driver will alloc a new buffer,
map the buffer to device and replace the old buffer. When map fail, should
only free the new alloced buffer, but not free all buffers in the ring.

Fixes: 76ad4f0 (net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC)

Signed-off-by: Lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: hns3: fix a bug when alloc new buffer
Lipeng [Mon, 23 Oct 2017 11:51:01 +0000 (19:51 +0800)]
net: hns3: fix a bug when alloc new buffer

When alloce new buffer to HW, should unmap the old buffer first.
This old code map the old buffer but not unmap the old buffer,
this patch fixes it.

Fixes: 76ad4f0 (net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC)

Signed-off-by: Lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge tag 'batadv-next-for-davem-20171023' of git://git.open-mesh.org/linux-merge
David S. Miller [Tue, 24 Oct 2017 00:15:03 +0000 (01:15 +0100)]
Merge tag 'batadv-next-for-davem-20171023' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This documentation/cleanup patchset includes the following patches:

 - Fix parameter kerneldoc which caused kerneldoc warnings, by Sven Eckelmann

 - Remove spurious warnings in B.A.T.M.A.N. V neighbor comparison,
   by Sven Eckelmann

 - Use inline kernel-doc style for UAPI constants, by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobatman-adv: use inline kernel-doc for uapi constants
Sven Eckelmann [Sat, 21 Oct 2017 09:45:46 +0000 (11:45 +0200)]
batman-adv: use inline kernel-doc for uapi constants

The enums of constants for netlink tends to become rather large over time.
Documenting them is easier when the kernel-doc is actually next to constant
and not in a different block above the enum.

Also inline kernel-doc allows multi-paragraph description. This could be
required to better document the netlink command types and the expected
return values.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
7 years agonet: core: rtnetlink: use BUG_ON instead of if condition followed by BUG
Gustavo A. R. Silva [Sat, 21 Oct 2017 00:43:11 +0000 (19:43 -0500)]
net: core: rtnetlink: use BUG_ON instead of if condition followed by BUG

Use BUG_ON instead of if condition followed by BUG in do_setlink.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: systemport: Guard against unmapped TX ring
Florian Fainelli [Fri, 20 Oct 2017 22:59:30 +0000 (15:59 -0700)]
net: systemport: Guard against unmapped TX ring

Because SYSTEMPORT is a (semi) normal network device, the stack may attempt to
queue packets on it oustide of the DSA slave transmit path.  When that happens,
the DSA layer has not had a chance to tag packets with the appropriate per-port
and per-queue information, and if that happens and we don't have a port 0 queue
0 available (e.g: on boards where this does not exist), we will hit a NULL
pointer de-reference in bcm_sysport_select_queue().

Guard against such cases by testing for the TX ring validity.

Fixes: 84ff33eeb23d ("net: systemport: Establish DSA network device queue mapping")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'mlxsw-Add-support-for-non-equal-cost-multi-path'
David S. Miller [Mon, 23 Oct 2017 04:23:07 +0000 (05:23 +0100)]
Merge branch 'mlxsw-Add-support-for-non-equal-cost-multi-path'

Jiri Pirko says:

====================
mlxsw: Add support for non-equal-cost multi-path

Ido says:

In the device, nexthops are stored as adjacency entries in an array
called the KVD linear (KVDL). When a multi-path route is hit the
packet's headers are hashed and then converted to an index into KVDL
based on the adjacency group's size and base index.

Up until now the driver ignored the `weight` parameter for multi-path
routes and allocated only one adjacency entry for each nexthop with a
limit of 32 nexthops in a group. This set makes the driver take the
`weight` parameter into account when allocating adjacency entries.

First patch teaches dpipe to show the size of the adjacency group, so
that users will be able to determine the actual weight of each nexthop.
The second patch refactors the KVDL allocator, making it more receptive
towards the addition of another partition later in the set.

Patches 3-5 introduce small changes towards the actual change in the
sixth patch that populates the adjacency entries according to their
relative weight.

Last two patches finally add another partition to the KVDL, which allows
us to allocate more than 32 entries per-group and thus support more
nexthops and also provide higher accuracy with regards to the requested
weights.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum: Add another partition to KVD linear
Ido Schimmel [Sun, 22 Oct 2017 21:11:50 +0000 (23:11 +0200)]
mlxsw: spectrum: Add another partition to KVD linear

The KVD linear is currently partitioned into two partitions. One for
single entries and another for groups of 32 entries.

Add another partition consisting of groups of 512 entries which will
allow us to more accurately represent the nexthop weights in non-equal
cost multi-path routing.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum: Increase number of linear entries
Ido Schimmel [Sun, 22 Oct 2017 21:11:49 +0000 (23:11 +0200)]
mlxsw: spectrum: Increase number of linear entries

The memory region where adjacency entries (nexthops) are stored is
called the KVD linear and is configured during initialization with a
size of 64K.

Extend this area with 32K more entries, that will be partitioned into 64
groups of 0.5K entries, thereby allowing us to support weighted nexthops
with high accuracy.

Change the ratio between both types of hash entries, so as to prevent
reduction in the number of double hash entries, which are used for IPv6
neighbours and routes with a prefix length greater than 64.

Note that the user will be able to control all these sizes once the
devlink resource manager is introduced.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum_router: Populate adjacency entries according to weights
Ido Schimmel [Sun, 22 Oct 2017 21:11:48 +0000 (23:11 +0200)]
mlxsw: spectrum_router: Populate adjacency entries according to weights

Up until now the driver assumed all the nexthops have an equal weight
and wrote each to a single adjacency entry.

This patch takes the `weight` parameter into account and populates the
adjacency group according to the relative weight of each nexthop.

Specifically, the weights of all the nexthops that should be offloaded
are first normalized and then used to calculate the upper adjacency
index of each nexthop. This is done according to the hash-threshold
algorithm used by the kernel for IPv4 multi-path routing.

Adjacency groups are currently limited to 32 entries which limits the
weights that can be used, but follow-up patches will introduce groups of
512 entries.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum_router: Prepare for large adjacency groups
Ido Schimmel [Sun, 22 Oct 2017 21:11:47 +0000 (23:11 +0200)]
mlxsw: spectrum_router: Prepare for large adjacency groups

The device has certain restrictions regarding the size of an adjacency
group.

Have the router determine the size of the adjacency group according to
available KVDL allocation sizes and these restrictions.

This was not needed until now since only allocations of up 32 entries
were supported and these are all valid sizes for an adjacency group.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum_router: Store weight in nexthop struct
Ido Schimmel [Sun, 22 Oct 2017 21:11:46 +0000 (23:11 +0200)]
mlxsw: spectrum_router: Store weight in nexthop struct

As the first step towards non-equal-cost multi-path support, store each
nexthop's weight.

For IPv6 nexthops always set the weight to 1, as it only supports ECMP.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum: Add ability to query KVDL allocation size
Ido Schimmel [Sun, 22 Oct 2017 21:11:45 +0000 (23:11 +0200)]
mlxsw: spectrum: Add ability to query KVDL allocation size

The current KVDL allocation API allows the user to specify the requested
number of entries, but the user has no way of knowing how many entries
were actually allocated.

This works because existing users (e.g., router) request the exact
number they end up using. With the introduction of large adjacency
groups, this will change, as the router will have the ability to choose
from several allocation sizes, where larger allocations provide higher
accuracy with respect to requested weights and better resilience against
nexthop failures.

One option is to have the router try several allocations of descending
size until one succeeds, but a better way is to simply allow it to query
the actual allocation size and then size its request accordingly.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum: Better represent KVDL partitions
Ido Schimmel [Sun, 22 Oct 2017 21:11:44 +0000 (23:11 +0200)]
mlxsw: spectrum: Better represent KVDL partitions

The KVD linear (KVDL) allocator currently consists of a very large
bitmap that reflects the KVDL's usage. The boundaries of each partition
as well as their allocation size are represented using defines.

This representation requires us to patch all the functions that act on a
partition whenever the partitioning scheme is changed. In addition, it
does not enable the dynamic configuration of the KVDL using the
up-coming resource manager.

Add objects to represent these partitions as well as the accompanying
code that acts on them to perform allocations and de-allocations.

In the following patches, this will allow us to easily add another
partition as well as new operations to act on these partitions.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum_dpipe: Add adjacency group size
Ido Schimmel [Sun, 22 Oct 2017 21:11:43 +0000 (23:11 +0200)]
mlxsw: spectrum_dpipe: Add adjacency group size

The adjacency group size is part of the match on the adjacency group and
should therefore be exposed using dpipe.

When non-equal-cost multi-path support will be introduced, the group's
size will help users understand the exact number of adjacency entries
each nexthop occupies, as a nexthop will no longer correspond to a
single entry.

The output for a multi-path route with two nexthops, one with weight 255
and the second 1 will be:

Example:

$ devlink dpipe table dump pci/0000:01:00.0 name mlxsw_adj
pci/0000:01:00.0:
  index 0
  match_value:
    type field_exact header mlxsw_meta field adj_index value 65536
    type field_exact header mlxsw_meta field adj_size value 512
    type field_exact header mlxsw_meta field adj_hash_index value 0
  action_value:
    type field_modify header ethernet field destination mac value e4:1d:2d:a5:f3:64
    type field_modify header mlxsw_meta field erif_port mapping ifindex mapping_value 3 value 1

  index 1
  match_value:
    type field_exact header mlxsw_meta field adj_index value 65536
    type field_exact header mlxsw_meta field adj_size value 512
    type field_exact header mlxsw_meta field adj_hash_index value 510
  action_value:
    type field_modify header ethernet field destination mac value e4:1d:2d:a5:f3:65
    type field_modify header mlxsw_meta field erif_port mapping ifindex mapping_value 4 value 2

Thus, the first nexthop occupies 510 adjacency entries and the second 2,
which leads to a ratio of 255 to 1.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'bcm_sf2-Add-support-for-IPv6-CFP-rules'
David S. Miller [Mon, 23 Oct 2017 02:06:48 +0000 (03:06 +0100)]
Merge branch 'bcm_sf2-Add-support-for-IPv6-CFP-rules'

Florian Fainelli says:

====================
net: dsa: bcm_sf2: Add support for IPv6 CFP rules

This patch series adds support for matching IPv6 addresses to the existing CFP
support code. Because IPv6 addresses are four times bigger than IPv4, we can
fit them anymore in a single slice, so we need to chain two in order to have
a complete match. This makes us require a second bitmap tracking unique rules
so we don't over populate the TCAM.

Finally, because the code had to be re-organized, it became a lot easier to
support arbitrary prefix/mask lengths, so the last two patches do just that.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: dsa: bcm_sf2: Allow matching arbitrary IPv6 masks/lengths
Florian Fainelli [Fri, 20 Oct 2017 21:39:49 +0000 (14:39 -0700)]
net: dsa: bcm_sf2: Allow matching arbitrary IPv6 masks/lengths

There is no reason why we should limit ourselves to matching only
full IPv4 addresses (/32), the same logic applies between the DATA and
MASK ports, so just make it more configurable to accept both.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: dsa: bcm_sf2: Allow matching arbitrary IPv4 mask lengths
Florian Fainelli [Fri, 20 Oct 2017 21:39:48 +0000 (14:39 -0700)]
net: dsa: bcm_sf2: Allow matching arbitrary IPv4 mask lengths

There is no reason why we should limit ourselves to matching only full
IPv4 addresses (/32), the same logic applies between the DATA and MASK
ports, so just make it more configurable to accept both.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: dsa: bcm_sf2: Add support for IPv6 CFP rules
Florian Fainelli [Fri, 20 Oct 2017 21:39:47 +0000 (14:39 -0700)]
net: dsa: bcm_sf2: Add support for IPv6 CFP rules

Inserting IPv6 CFP rules complicates the code a little bit in that we
need to insert two rules side by side and chain them to match a full
IPv6 tuple (src, dst IPv6 + port + protocol).

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: dsa: bcm_sf2: Simplify bcm_sf2_cfp_rule_get_all()
Florian Fainelli [Fri, 20 Oct 2017 21:39:46 +0000 (14:39 -0700)]
net: dsa: bcm_sf2: Simplify bcm_sf2_cfp_rule_get_all()

There is no need to do a HW search of the TCAMs which is something slow
and expensive. Since we already maintain a bitmask of active CFP rules,
just iterate over those, starting from bit 1 (after the reserved entry)
to get a count and index position to store the rule later on.

As a result we can remove the code in bcm_sf2_cfp_rule_get() which acted
on the "search" argument, and remove that argument.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: dsa: bcm_sf2: Make UDF slices more configurable
Florian Fainelli [Fri, 20 Oct 2017 21:39:45 +0000 (14:39 -0700)]
net: dsa: bcm_sf2: Make UDF slices more configurable

In preparation for introducing IPv6 rules support, make the
cfp_udf_layout more flexible and match more accurately how the HW is
designed: we have 3 + 1 slices per protocol, but we may not be using all
of them and we are relative to a particular base offset (slice A for
IPv4 for instance). Also populate the slice number that should be used
(slice 1 for IPv4) based on the lookup function.

Finally, we introduce two helper functions: udf_upper_bits() and
udf_lower_bits() to help setting the UDF_n_* valid bits based on the
number of UDFs valid within a slice. Update the IPv4 rule setting to
make use of it to be more robust wrt. change in number of User Defined
Fields being programmed.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: dsa: bcm_sf2: Move IPv4 CFP processing to specific functions
Florian Fainelli [Fri, 20 Oct 2017 21:39:44 +0000 (14:39 -0700)]
net: dsa: bcm_sf2: Move IPv4 CFP processing to specific functions

Move the processing of IPv4 rules into specific functions, allowing us
to clearly identify which parts are generic and which ones are not. Also
create a specific function to insert a rule into the action and policer
RAMs as those tend to be fairly generic.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: dsa: bcm_sf2: Use existing shift/masks
Florian Fainelli [Fri, 20 Oct 2017 21:39:43 +0000 (14:39 -0700)]
net: dsa: bcm_sf2: Use existing shift/masks

Instead of open coding the shift for the IP protocol, IP fragment bit
etc. define and/or use existing constants to that end.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoisdn/gigaset: Provide cardstate context for bas timer callbacks
Kees Cook [Fri, 20 Oct 2017 20:47:08 +0000 (13:47 -0700)]
isdn/gigaset: Provide cardstate context for bas timer callbacks

While the work callback uses the urb to find cardstate from bas_cardstate,
this may not be valid for timer callbacks. Instead, introduce a direct
pointer back to the cardstate from bas_cardstate for use in timer
callbacks.

Reported-by: Paul Bolle <pebolle@tiscali.nl>
Fixes: 4cfea08e6251 ("isdn/gigaset: Convert timers to use timer_setup()")
Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Johan Hovold <johan@kernel.org>
Cc: gigaset307x-common@lists.sourceforge.net
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoselftests/bpf: fix broken build of test_maps
Alexei Starovoitov [Sun, 22 Oct 2017 17:29:06 +0000 (10:29 -0700)]
selftests/bpf: fix broken build of test_maps

fix multiple build errors and warnings

1.
test_maps.c: In function ‘test_map_rdonly’:
test_maps.c:1051:30: error: ‘BPF_F_RDONLY’ undeclared (first use in this function)
        MAP_SIZE, map_flags | BPF_F_RDONLY);

2.
test_maps.c:1048:6: warning: unused variable ‘i’ [-Wunused-variable]
  int i, fd, key = 0, value = 0;

3.
test_maps.c:1087:2: error: called object is not a function or function pointer
  assert(bpf_map_lookup_elem(fd, &key, &value) == -1 && errno == EPERM);

4.
./bpf_helpers.h:72:11: error: use of undeclared identifier 'BPF_FUNC_getsockopt'
        (void *) BPF_FUNC_getsockopt;

Fixes: e043325b3087 ("bpf: Add tests for eBPF file mode")
Fixes: 6e71b04a8224 ("bpf: Add file mode configuration into bpf maps")
Fixes: cd86d1fd2102 ("bpf: Adding helper function bpf_getsockops")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
David S. Miller [Sun, 22 Oct 2017 12:36:53 +0000 (13:36 +0100)]
Merge git://git./linux/kernel/git/davem/net

There were quite a few overlapping sets of changes here.

Daniel's bug fix for off-by-ones in the new BPF branch instructions,
along with the added allowances for "data_end > ptr + x" forms
collided with the metadata additions.

Along with those three changes came veritifer test cases, which in
their final form I tried to group together properly.  If I had just
trimmed GIT's conflict tags as-is, this would have split up the
meta tests unnecessarily.

In the socketmap code, a set of preemption disabling changes
overlapped with the rename of bpf_compute_data_end() to
bpf_compute_data_pointers().

Changes were made to the mv88e6060.c driver set addr method
which got removed in net-next.

The hyperv transport socket layer had a locking change in 'net'
which overlapped with a change of socket state macro usage
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Sun, 22 Oct 2017 02:44:48 +0000 (22:44 -0400)]
Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:
 "A little more than usual this time around. Been travelling, so that is
  part of it.

  Anyways, here are the highlights:

   1) Deal with memcontrol races wrt. listener dismantle, from Eric
      Dumazet.

   2) Handle page allocation failures properly in nfp driver, from Jaku
      Kicinski.

   3) Fix memory leaks in macsec, from Sabrina Dubroca.

   4) Fix crashes in pppol2tp_session_ioctl(), from Guillaume Nault.

   5) Several fixes in bnxt_en driver, including preventing potential
      NVRAM parameter corruption from Michael Chan.

   6) Fix for KRACK attacks in wireless, from Johannes Berg.

   7) rtnetlink event generation fixes from Xin Long.

   8) Deadlock in mlxsw driver, from Ido Schimmel.

   9) Disallow arithmetic operations on context pointers in bpf, from
      Jakub Kicinski.

  10) Missing sock_owned_by_user() check in sctp_icmp_redirect(), from
      Xin Long.

  11) Only TCP is supported for sockmap, make that explicit with a
      check, from John Fastabend.

  12) Fix IP options state races in DCCP and TCP, from Eric Dumazet.

  13) Fix panic in packet_getsockopt(), also from Eric Dumazet.

  14) Add missing locked in hv_sock layer, from Dexuan Cui.

  15) Various aquantia bug fixes, including several statistics handling
      cures. From Igor Russkikh et al.

  16) Fix arithmetic overflow in devmap code, from John Fastabend.

  17) Fix busted socket memory accounting when we get a fault in the tcp
      zero copy paths. From Willem de Bruijn.

  18) Don't leave opt->tot_len uninitialized in ipv6, from Eric Dumazet"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
  stmmac: Don't access tx_q->dirty_tx before netif_tx_lock
  ipv6: flowlabel: do not leave opt->tot_len with garbage
  of_mdio: Fix broken PHY IRQ in case of probe deferral
  textsearch: fix typos in library helpers
  rxrpc: Don't release call mutex on error pointer
  net: stmmac: Prevent infinite loop in get_rx_timestamp_status()
  net: stmmac: Fix stmmac_get_rx_hwtstamp()
  net: stmmac: Add missing call to dev_kfree_skb()
  mlxsw: spectrum_router: Configure TIGCR on init
  mlxsw: reg: Add Tunneling IPinIP General Configuration Register
  net: ethtool: remove error check for legacy setting transceiver type
  soreuseport: fix initialization race
  net: bridge: fix returning of vlan range op errors
  sock: correct sk_wmem_queued accounting on efault in tcp zerocopy
  bpf: add test cases to bpf selftests to cover all access tests
  bpf: fix pattern matches for direct packet access
  bpf: fix off by one for range markings with L{T, E} patterns
  bpf: devmap fix arithmetic overflow in bitmap_size calculation
  net: aquantia: Bad udp rate on default interrupt coalescing
  net: aquantia: Enable coalescing management via ethtool interface
  ...

7 years agostmmac: Don't access tx_q->dirty_tx before netif_tx_lock
Bernd Edlinger [Sat, 21 Oct 2017 06:51:30 +0000 (06:51 +0000)]
stmmac: Don't access tx_q->dirty_tx before netif_tx_lock

This is the possible reason for different hard to reproduce
problems on my ARMv7-SMP test system.

The symptoms are in recent kernels imprecise external aborts,
and in older kernels various kinds of network stalls and
unexpected page allocation failures.

My testing indicates that the trouble started between v4.5 and v4.6
and prevails up to v4.14.

Using the dirty_tx before acquiring the spin lock is clearly
wrong and was first introduced with v4.6.

Fixes: e3ad57c96715 ("stmmac: review RX/TX ring management")

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoipv6: flowlabel: do not leave opt->tot_len with garbage
Eric Dumazet [Sat, 21 Oct 2017 19:26:23 +0000 (12:26 -0700)]
ipv6: flowlabel: do not leave opt->tot_len with garbage

When syzkaller team brought us a C repro for the crash [1] that
had been reported many times in the past, I finally could find
the root cause.

If FlowLabel info is merged by fl6_merge_options(), we leave
part of the opt_space storage provided by udp/raw/l2tp with random value
in opt_space.tot_len, unless a control message was provided at sendmsg()
time.

Then ip6_setup_cork() would use this random value to perform a kzalloc()
call. Undefined behavior and crashes.

Fix is to properly set tot_len in fl6_merge_options()

At the same time, we can also avoid consuming memory and cpu cycles
to clear it, if every option is copied via a kmemdup(). This is the
change in ip6_setup_cork().

[1]
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 6613 Comm: syz-executor0 Not tainted 4.14.0-rc4+ #127
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
task: ffff8801cb64a100 task.stack: ffff8801cc350000
RIP: 0010:ip6_setup_cork+0x274/0x15c0 net/ipv6/ip6_output.c:1168
RSP: 0018:ffff8801cc357550 EFLAGS: 00010203
RAX: dffffc0000000000 RBX: ffff8801cc357748 RCX: 0000000000000010
RDX: 0000000000000002 RSI: ffffffff842bd1d9 RDI: 0000000000000014
RBP: ffff8801cc357620 R08: ffff8801cb17f380 R09: ffff8801cc357b10
R10: ffff8801cb64a100 R11: 0000000000000000 R12: ffff8801cc357ab0
R13: ffff8801cc357b10 R14: 0000000000000000 R15: ffff8801c3bbf0c0
FS:  00007f9c5c459700(0000) GS:ffff8801db200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020324000 CR3: 00000001d1cf2000 CR4: 00000000001406f0
DR0: 0000000020001010 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Call Trace:
 ip6_make_skb+0x282/0x530 net/ipv6/ip6_output.c:1729
 udpv6_sendmsg+0x2769/0x3380 net/ipv6/udp.c:1340
 inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:762
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 SYSC_sendto+0x358/0x5a0 net/socket.c:1750
 SyS_sendto+0x40/0x50 net/socket.c:1718
 entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x4520a9
RSP: 002b:00007f9c5c458c08 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 00000000004520a9
RDX: 0000000000000001 RSI: 0000000020fd1000 RDI: 0000000000000016
RBP: 0000000000000086 R08: 0000000020e0afe4 R09: 000000000000001c
R10: 0000000000000000 R11: 0000000000000216 R12: 00000000004bb1ee
R13: 00000000ffffffff R14: 0000000000000016 R15: 0000000000000029
Code: e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 ea 0f 00 00 48 8d 79 04 48 b8 00 00 00 00 00 fc ff df 45 8b 74 24 04 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85
RIP: ip6_setup_cork+0x274/0x15c0 net/ipv6/ip6_output.c:1168 RSP: ffff8801cc357550

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoof_mdio: Fix broken PHY IRQ in case of probe deferral
Geert Uytterhoeven [Wed, 18 Oct 2017 11:54:03 +0000 (13:54 +0200)]
of_mdio: Fix broken PHY IRQ in case of probe deferral

If an Ethernet PHY is initialized before the interrupt controller it is
connected to, a message like the following is printed:

    irq: no irq domain found for /interrupt-controller@e61c0000 !

However, the actual error is ignored, leading to a non-functional (POLL)
PHY interrupt later:

    Micrel KSZ8041RNLI ee700000.ethernet-ffffffff:01: attached PHY driver [Micrel KSZ8041RNLI] (mii_bus:phy_addr=ee700000.ethernet-ffffffff:01, irq=POLL)

Depending on whether the PHY driver will fall back to polling, Ethernet
may or may not work.

To fix this:
  1. Switch of_mdiobus_register_phy() from irq_of_parse_and_map() to
     of_irq_get().
     Unlike the former, the latter returns -EPROBE_DEFER if the
     interrupt controller is not yet available, so this condition can be
     detected.
     Other errors are handled the same as before, i.e. use the passed
     mdio->irq[addr] as interrupt.
  2. Propagate and handle errors from of_mdiobus_register_phy() and
     of_mdiobus_register_device().

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotextsearch: fix typos in library helpers
Randy Dunlap [Fri, 20 Oct 2017 19:15:52 +0000 (12:15 -0700)]
textsearch: fix typos in library helpers

Fix spellos (typos) in textsearch library helpers.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'tun-timer-cleanups'
David S. Miller [Sun, 22 Oct 2017 02:13:20 +0000 (03:13 +0100)]
Merge branch 'tun-timer-cleanups'

Eric Dumazet says:

====================
tun: timer cleanups

While working on a syzkaller issue that might have been
fixed already by Cong Wang in commit 0ad646c81b21
("tun: call dev_get_valid_name() before register_netdevice()")
I made three small changes related to flow_gc_timer.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotun: do not arm flow_gc_timer in tun_flow_init()
Eric Dumazet [Fri, 20 Oct 2017 18:29:57 +0000 (11:29 -0700)]
tun: do not arm flow_gc_timer in tun_flow_init()

Timer is properly armed on demand from tun_flow_update(),
so there is no need to arm it at tun init.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotun: avoid extra timer schedule in tun_flow_cleanup()
Eric Dumazet [Fri, 20 Oct 2017 18:29:56 +0000 (11:29 -0700)]
tun: avoid extra timer schedule in tun_flow_cleanup()

If tun_flow_cleanup() deleted all flows, no need to
arm the timer again. It will be armed next time
tun_flow_update() is called.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotun: do not block BH again in tun_flow_cleanup()
Eric Dumazet [Fri, 20 Oct 2017 18:29:55 +0000 (11:29 -0700)]
tun: do not block BH again in tun_flow_cleanup()

tun_flow_cleanup() being a timer callback, it is already
running in BH context.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'bpf-BASE_RTT'
David S. Miller [Sun, 22 Oct 2017 02:12:06 +0000 (03:12 +0100)]
Merge branch 'bpf-BASE_RTT'

Lawrence Brakmo says:

====================
bpf: add support for BASE_RTT

This patch set adds the following functionality to socket_ops BPF
programs.
1) Add bpf helper function bpf_getsocketops. Currently only supports
   TCP_CONGESTION
2) Add BPF_SOCKET_OPS_BASE_RTT op to get the base RTT of the
   connection. In general, the base RTT indicates the threshold such
   that RTTs above it indicate congestion. More details in the
   relevant patches.

Consists of the following patches:

[PATCH net-next 1/5] bpf: add support for BPF_SOCK_OPS_BASE_RTT
[PATCH net-next 2/5] bpf: Adding helper function bpf_getsockops
[PATCH net-next 3/5] bpf: Add BPF_SOCKET_OPS_BASE_RTT support to
[PATCH net-next 4/5] bpf: sample BPF_SOCKET_OPS_BASE_RTT program
[PATCH net-next 5/5] bpf: create samples/bpf/tcp_bpf.readme
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf: create samples/bpf/tcp_bpf.readme
Lawrence Brakmo [Fri, 20 Oct 2017 18:05:43 +0000 (11:05 -0700)]
bpf: create samples/bpf/tcp_bpf.readme

Readme file explaining how to create a cgroupv2 and attach one
of the tcp_*_kern.o socket_ops BPF program.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked_by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf: sample BPF_SOCKET_OPS_BASE_RTT program
Lawrence Brakmo [Fri, 20 Oct 2017 18:05:42 +0000 (11:05 -0700)]
bpf: sample BPF_SOCKET_OPS_BASE_RTT program

Sample socket_ops BPF program to test the BPF helper function
bpf_getsocketops and the new socket_ops op BPF_SOCKET_OPS_BASE_RTT.

The program provides a base RTT of 80us when the calling flow is
within a DC (as determined by the IPV6 prefix) and the congestion
algorithm is "nv".

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked_by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf: Add BPF_SOCKET_OPS_BASE_RTT support to tcp_nv
Lawrence Brakmo [Fri, 20 Oct 2017 18:05:41 +0000 (11:05 -0700)]
bpf: Add BPF_SOCKET_OPS_BASE_RTT support to tcp_nv

TCP_NV will try to get the base RTT from a socket_ops BPF program if one
is loaded. NV will then use the base RTT to bound its min RTT (its
notion of the base RTT). It uses the base RTT as an upper bound and 80%
of the base RTT as its lower bound.

In other words, NV will consider filtered RTTs larger than base RTT as a
sign of congestion. As a result, there is no minRTT inflation when there
is a lot of congestion. For example, in a DC where the RTTs are less
than 40us when there is no congestion, a base RTT value of 80us improves
the performance of NV. The difference between the uncongested RTT and
the base RTT provided represents how much queueing we are willing to
have (in practice it can be higher).

NV has been tunned to reduce congestion when there are many flows at the
cost of one flow not achieving full bandwith utilization. When a
reasonable base RTT is provided, one NV flow can now fully utilize the
full bandwidth. In addition, the performance is also improved when there
are many flows.

In the following examples the NV results are using a kernel with this
patch set (i.e. both NV results are using the new nv_loss_dec_factor).

With one host sending to another host and only one flow the
goodputs are:
  Cubic: 9.3 Gbps, NV: 5.5 Gbps, NV (baseRTT=80us): 9.2 Gbps

With 2 hosts sending to one host (1 flow per host, the goodput per flow
is:
  Cubic: 4.6 Gbps, NV: 4.5 Gbps, NV (baseRTT=80us)L 4.6 Gbps

But the RTTs seen by a ping process in the sender is:
  Cubic: 3.3ms  NV: 97us,  NV (baseRTT=80us): 146us

With a lot of flows things look even better for NV with baseRTT. Here we
have 3 hosts sending to one host. Each sending host has 6 flows: 1
stream, 4x1MB RPC, 1x10KB RPC. Cubic, NV and NV with baseRTT all fully
utilize the full available bandwidth. However, the distribution of
bandwidth among the flows is very different. For the 10KB RPC flow:
  Cubic: 27Mbps, NV: 111Mbps, NV (baseRTT=80us): 222Mbps

The 99% latencies for the 10KB flows are:
  Cubic: 26ms,  NV: 1ms,  NV (baseRTT=80us): 500us

The RTT seen by a ping process at the senders:
  Cubic: 3.2ms  NV: 720us,  NV (baseRTT=80us): 330us

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf: Adding helper function bpf_getsockops
Lawrence Brakmo [Fri, 20 Oct 2017 18:05:40 +0000 (11:05 -0700)]
bpf: Adding helper function bpf_getsockops

Adding support for helper function bpf_getsockops to socket_ops BPF
programs. This patch only supports TCP_CONGESTION.

Signed-off-by: Vlad Vysotsky <vlad@cs.ucla.edu>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf: add support for BPF_SOCK_OPS_BASE_RTT
Lawrence Brakmo [Fri, 20 Oct 2017 18:05:39 +0000 (11:05 -0700)]
bpf: add support for BPF_SOCK_OPS_BASE_RTT

A congestion control algorithm can make a call to the BPF socket_ops
program to request the base RTT. The base RTT can be congestion control
dependent and is meant to represent a congestion threshold such that
RTTs above it indicate congestion. This is especially useful for flows
within a DC where the base RTT is easy to obtain.

Being provided a base RTT solves a basic problem in RTT based congestion
avoidance algorithms (such as Vegas, NV and BBR). Although it is easy
to get the base RTT when the network is not congested, it is very
diffcult to do when it is very congested. Newer connections get an
inflated value of the base RTT leading to unfariness (newer flows with a
larger base RTT get more bandwidth). As a result, RTT based congestion
avoidance algorithms tend to update their base RTTs to improve fairness.
In very congested networks this can lead to base RTT inflation, reducing
the ability of these RTT based congestion control algorithms to prevent
congestion.

Note that in my experiments with TCP-NV, the base RTT provided can be
much larger than the actual hardware RTT. For example, experimenting
with hosts within a rack where the hardware RTT is 16-20us, I've used
base RTTs up to 150us. The effect of using a larger base RTT is that the
congestion avoidance algorithm will allow more queueing. When there are
only a few flows the main effect is larger measured RTTs and RPC
latencies due to the increased queueing. When there are a lot of flows,
a larger base RTT can lead to more congestion and more packet drops.
For this case, where the hardware RTT is 20us, a base RTT of 80us
produces good results.

This patch only introduces BPF_SOCK_OPS_BASE_RTT, a later patch in this
set adds support for using it in TCP-NV. Further study and testing is
needed before support can be added to other delay based congestion
avoidance algorithms.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonfp: use struct fields for 8 bit-wide access
Pieter Jansen van Vuuren [Fri, 20 Oct 2017 17:49:52 +0000 (19:49 +0200)]
nfp: use struct fields for 8 bit-wide access

Use direct access struct fields rather than PREP_FIELD()
macros to manipulate the jump ID and length, both of which
are exactly 8-bits wide. This simplifies the code somewhat.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: x25: mark expected switch fall-throughs
Gustavo A. R. Silva [Fri, 20 Oct 2017 17:37:52 +0000 (12:37 -0500)]
net: x25: mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: af_unix: mark expected switch fall-through
Gustavo A. R. Silva [Fri, 20 Oct 2017 17:05:30 +0000 (12:05 -0500)]
net: af_unix: mark expected switch fall-through

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agorxrpc: Don't release call mutex on error pointer
David Howells [Fri, 20 Oct 2017 16:01:22 +0000 (17:01 +0100)]
rxrpc: Don't release call mutex on error pointer

Don't release call mutex at the end of rxrpc_kernel_begin_call() if the
call pointer actually holds an error value.

Fixes: 540b1c48c37a ("rxrpc: Fix deadlock between call creation and sendmsg/recvmsg")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'stmmac-hw-tstamp-fixes'
David S. Miller [Sun, 22 Oct 2017 01:50:40 +0000 (02:50 +0100)]
Merge branch 'stmmac-hw-tstamp-fixes'

Jose Abreu says:

====================
net: stmmac: Fix HW timestamping

Three fixes for HW timestamping feature, all of them for RX side.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: stmmac: Prevent infinite loop in get_rx_timestamp_status()
Jose Abreu [Fri, 20 Oct 2017 13:37:36 +0000 (14:37 +0100)]
net: stmmac: Prevent infinite loop in get_rx_timestamp_status()

Prevent infinite loop by correctly setting the loop condition to
break when i == 10.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: stmmac: Fix stmmac_get_rx_hwtstamp()
Jose Abreu [Fri, 20 Oct 2017 13:37:35 +0000 (14:37 +0100)]
net: stmmac: Fix stmmac_get_rx_hwtstamp()

When using GMAC4 the valid timestamp is from CTX next desc but
we are passing the previous desc to get_rx_timestamp_status()
callback.

Fix this and while at it rework a little bit the function logic.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: stmmac: Add missing call to dev_kfree_skb()
Jose Abreu [Fri, 20 Oct 2017 13:37:34 +0000 (14:37 +0100)]
net: stmmac: Add missing call to dev_kfree_skb()

When RX HW timestamp is enabled and a frame is discarded we are
not freeing the skb but instead only setting to NULL the entry.

Add a call to dev_kfree_skb_any() so that skb entry is correctly
freed.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Linus Torvalds [Sun, 22 Oct 2017 01:46:39 +0000 (21:46 -0400)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input

Pull input fixes from Dmitry Torokhov:

 - joydev now implements a blacklist to avoid creating joystick nodes
   for accelerometers found in composite devices such as PlaStation
   controllers

 - assorted driver fixes

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: ims-psu - check if CDC union descriptor is sane
  Input: joydev - blacklist ds3/ds4/udraw motion sensors
  Input: allow matching device IDs on property bits
  Input: factor out and export input_device_id matching code
  Input: goodix - poll the 'buffer status' bit before reading data
  Input: axp20x-pek - fix module not auto-loading for axp221 pek
  Input: tca8418 - enable interrupt after it has been requested
  Input: stmfts - fix setting ABS_MT_POSITION_* maximum size
  Input: ti_am335x_tsc - fix incorrect step config for 5 wire touchscreen
  Input: synaptics - disable kernel tracking on SMBus devices

7 years agogeneve: Get rid of is_all_zero(), streamline is_tnl_info_zero()
Stefano Brivio [Fri, 20 Oct 2017 11:31:36 +0000 (13:31 +0200)]
geneve: Get rid of is_all_zero(), streamline is_tnl_info_zero()

No need to re-invent memchr_inv() with !is_all_zero(). While at
it, replace conditional and return clauses with a single return
clause in is_tnl_info_zero().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'dsa-lan9303-Add-fdb-mdb-methods'
David S. Miller [Sun, 22 Oct 2017 01:41:30 +0000 (02:41 +0100)]
Merge branch 'dsa-lan9303-Add-fdb-mdb-methods'

Egil Hjelmeland says:

====================
net: dsa: lan9303: Add fdb/mdb methods

This series add support for accessing and managing the lan9303 ALR
(Address Logic Resolution).

The first patch add low level functions for accessing the ALR, along
with port_fast_age and port_fdb_dump methods.

The second patch add functions for managing ALR entires, along with
remaining fdb/mdb methods.

Note that to complete STP support, a special ALR entry with the STP eth
address must be added too. This must be addressed later.

Comments welcome!

Changes v2 -> v3:
 - Whitespace polishing. Removed some "section" comments.
 - Prefixed ALR constants with LAN9303_ for consistency.
 - Patch 2: lan9303_port_fast_age() wrap the "port" into a struct for passing
   as context to alr_loop_cb_del_port_learned. Safer in event of type change.
 - Patch 2: Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Changes v1 -> v2:
 - Patch 2: Removed question comment
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: dsa: lan9303: Add fdb/mdb manipulation
Egil Hjelmeland [Fri, 20 Oct 2017 10:19:10 +0000 (12:19 +0200)]
net: dsa: lan9303: Add fdb/mdb manipulation

Add functions for managing the lan9303 ALR (Address Logic
Resolution).

Implement DSA methods: port_fdb_add, port_fdb_del, port_mdb_prepare,
port_mdb_add and port_mdb_del.

Since the lan9303 do not offer reading specific ALR entry, the driver
caches all static entries - in a flat table.

Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: dsa: lan9303: Add port_fast_age and port_fdb_dump methods
Egil Hjelmeland [Fri, 20 Oct 2017 10:19:09 +0000 (12:19 +0200)]
net: dsa: lan9303: Add port_fast_age and port_fdb_dump methods

Add DSA method port_fast_age as a step to STP support.

Add low level functions for accessing the lan9303 ALR (Address Logic
Resolution).

Added DSA method port_fdb_dump

Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Sun, 22 Oct 2017 01:39:18 +0000 (21:39 -0400)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs

Pull vfs fixes from Al Viro:
 "MS_I_VERSION fixes - Mimi's fix + missing bits picked from Matthew
  (his patch contained a duplicate of the fs/namespace.c fix as well,
  but by that point the original fix had already been applied)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  Convert fs/*/* to SB_I_VERSION
  vfs: fix mounting a filesystem with i_version

7 years agotipc: refactor tipc_sk_timeout() function
Jon Maloy [Fri, 20 Oct 2017 09:21:32 +0000 (11:21 +0200)]
tipc: refactor tipc_sk_timeout() function

The function tipc_sk_timeout() is more complex than necessary, and
even seems to contain an undetected bug. At one of the occurences
where we renew the timer we just order it with (HZ / 20), instead
of (jiffies + HZ / 20);

In this commit we clean up the function.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'net-driver-refcont_t'
David S. Miller [Sun, 22 Oct 2017 01:22:40 +0000 (02:22 +0100)]
Merge branch 'net-driver-refcont_t'

Elena Reshetova says:

====================
networking drivers refcount_t conversions

Note: these are the last patches related to networking that perform
conversion of refcounters from atomic_t to refcount_t.
In contrast to the core network refcounter conversions that
were merged earlier, these are much more straightforward ones.

This series, for various networking drivers, replaces atomic_t reference
counters with the new refcount_t type and API (see include/linux/refcount.h).
By doing this we prevent intentional or accidental
underflows or overflows that can led to use-after-free vulnerabilities.

The patches are fully independent and can be cherry-picked separately.
Patches are based on top of net-next.
If there are no objections to the patches, please merge them via respective trees
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers, connector: convert cn_callback_entry.refcnt from atomic_t to refcount_t
Elena Reshetova [Fri, 20 Oct 2017 07:23:49 +0000 (10:23 +0300)]
drivers, connector: convert cn_callback_entry.refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable cn_callback_entry.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers, net, ppp: convert syncppp.refcnt from atomic_t to refcount_t
Elena Reshetova [Fri, 20 Oct 2017 07:23:48 +0000 (10:23 +0300)]
drivers, net, ppp: convert syncppp.refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable syncppp.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers, net, ppp: convert ppp_file.refcnt from atomic_t to refcount_t
Elena Reshetova [Fri, 20 Oct 2017 07:23:47 +0000 (10:23 +0300)]
drivers, net, ppp: convert ppp_file.refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable ppp_file.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers, net, ppp: convert asyncppp.refcnt from atomic_t to refcount_t
Elena Reshetova [Fri, 20 Oct 2017 07:23:46 +0000 (10:23 +0300)]
drivers, net, ppp: convert asyncppp.refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable asyncppp.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers, net: convert masces_tx_sa.refcnt from atomic_t to refcount_t
Elena Reshetova [Fri, 20 Oct 2017 07:23:45 +0000 (10:23 +0300)]
drivers, net: convert masces_tx_sa.refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable masces_tx_sa.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers, net: convert masces_rx_sc.refcnt from atomic_t to refcount_t
Elena Reshetova [Fri, 20 Oct 2017 07:23:44 +0000 (10:23 +0300)]
drivers, net: convert masces_rx_sc.refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable masces_rx_sc.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers, net: convert masces_rx_sa.refcnt from atomic_t to refcount_t
Elena Reshetova [Fri, 20 Oct 2017 07:23:43 +0000 (10:23 +0300)]
drivers, net: convert masces_rx_sa.refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable masces_rx_sa.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers, net, hamradio: convert sixpack.refcnt from atomic_t to refcount_t
Elena Reshetova [Fri, 20 Oct 2017 07:23:42 +0000 (10:23 +0300)]
drivers, net, hamradio: convert sixpack.refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable sixpack.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers, net, mlx5: convert fs_node.refcount from atomic_t to refcount_t
Elena Reshetova [Fri, 20 Oct 2017 07:23:41 +0000 (10:23 +0300)]
drivers, net, mlx5: convert fs_node.refcount from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable fs_node.refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers, net, mlx5: convert mlx5_cq.refcount from atomic_t to refcount_t
Elena Reshetova [Fri, 20 Oct 2017 07:23:40 +0000 (10:23 +0300)]
drivers, net, mlx5: convert mlx5_cq.refcount from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable mlx5_cq.refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers, net, mlx4: convert mlx4_srq.refcount from atomic_t to refcount_t
Elena Reshetova [Fri, 20 Oct 2017 07:23:39 +0000 (10:23 +0300)]
drivers, net, mlx4: convert mlx4_srq.refcount from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable mlx4_srq.refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers, net, mlx4: convert mlx4_qp.refcount from atomic_t to refcount_t
Elena Reshetova [Fri, 20 Oct 2017 07:23:38 +0000 (10:23 +0300)]
drivers, net, mlx4: convert mlx4_qp.refcount from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable mlx4_qp.refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers, net, mlx4: convert mlx4_cq.refcount from atomic_t to refcount_t
Elena Reshetova [Fri, 20 Oct 2017 07:23:37 +0000 (10:23 +0300)]
drivers, net, mlx4: convert mlx4_cq.refcount from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable mlx4_cq.refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers, net, ethernet: convert mtk_eth.dma_refcnt from atomic_t to refcount_t
Elena Reshetova [Fri, 20 Oct 2017 07:23:36 +0000 (10:23 +0300)]
drivers, net, ethernet: convert mtk_eth.dma_refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable mtk_eth.dma_refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers, net, ethernet: convert clip_entry.refcnt from atomic_t to refcount_t
Elena Reshetova [Fri, 20 Oct 2017 07:23:35 +0000 (10:23 +0300)]
drivers, net, ethernet: convert clip_entry.refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable clip_entry.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'mlxsw-fixes'
David S. Miller [Sun, 22 Oct 2017 01:19:03 +0000 (02:19 +0100)]
Merge branch 'mlxsw-fixes'

Jiri Pirko says:

====================
mlxsw: spectrum: Configure TTL of "inherit" for offloaded tunnels

Petr says:

Currently mlxsw only offloads tunnels that are configured with TTL of "inherit"
(which is the default). However, Spectrum defaults to 255 and the driver
neglects to change the configuration. Thus the tunnel packets from offloaded
tunnels always have TTL of 255, even though tunnels with explicit TTL of 255 are
never actually offloaded.

To fix this, introduce support for TIGCR, the register that keeps the related
bits of global tunnel configuration, and use it on first offload to properly
configure inheritance of TTL of tunnel packets from overlay packets.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: spectrum_router: Configure TIGCR on init
Petr Machata [Fri, 20 Oct 2017 07:16:16 +0000 (09:16 +0200)]
mlxsw: spectrum_router: Configure TIGCR on init

Spectrum tunnels do not default to ttl of "inherit" like the Linux ones
do. Configure TIGCR on router init so that the TTL of tunnel packets is
copied from the overlay packets.

Fixes: ee954d1a91b2 ("mlxsw: spectrum_router: Support GRE tunnels")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agomlxsw: reg: Add Tunneling IPinIP General Configuration Register
Petr Machata [Fri, 20 Oct 2017 07:16:15 +0000 (09:16 +0200)]
mlxsw: reg: Add Tunneling IPinIP General Configuration Register

The TIGCR register is used for setting up the IPinIP Tunnel
configuration.

Fixes: ee954d1a91b2 ("mlxsw: spectrum_router: Support GRE tunnels")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'hns3-loopback-selftest'
David S. Miller [Sun, 22 Oct 2017 01:16:26 +0000 (02:16 +0100)]
Merge branch 'hns3-loopback-selftest'

Yunsheng Lin says:

====================
Add mac loopback selftest support in hns3 driver

This patchset refactors the skb receiving and transmitting function
before adding mac loopback selftest support in hns3 driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: hns3: Add mac loopback selftest support in hns3 driver
Yunsheng Lin [Fri, 20 Oct 2017 02:19:22 +0000 (10:19 +0800)]
net: hns3: Add mac loopback selftest support in hns3 driver

This patch adds mac loopback selftest support for ethtool cmd
by checking if a transmitted packet can be received correctly
when mac loopback is enabled.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: hns3: Refactor the skb receiving and transmitting function
Yunsheng Lin [Fri, 20 Oct 2017 02:19:21 +0000 (10:19 +0800)]
net: hns3: Refactor the skb receiving and transmitting function

This patch refactors the skb receiving and transmitting functions
and export them in order to support the ethtool's mac loopback
selftest.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: ethtool: remove error check for legacy setting transceiver type
Niklas Söderlund [Thu, 19 Oct 2017 23:32:08 +0000 (01:32 +0200)]
net: ethtool: remove error check for legacy setting transceiver type

Commit 9cab88726929605 ("net: ethtool: Add back transceiver type")
restores the transceiver type to struct ethtool_link_settings and
convert_link_ksettings_to_legacy_settings() but forgets to remove the
error check for the same in convert_legacy_settings_to_link_ksettings().
This prevents older versions of ethtool to change link settings.

    # ethtool --version
    ethtool version 3.16

    # ethtool -s eth0 autoneg on speed 100 duplex full
    Cannot set new settings: Invalid argument
      not setting speed
      not setting duplex
      not setting autoneg

While newer versions of ethtool works.

    # ethtool --version
    ethtool version 4.10

    # ethtool -s eth0 autoneg on speed 100 duplex full
    [   57.703268] sh-eth ee700000.ethernet eth0: Link is Down
    [   59.618227] sh-eth ee700000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx

Fixes: 19cab88726929605 ("net: ethtool: Add back transceiver type")
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reported-by: Renjith R V <renjith.rv@quest-global.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'bpftool-add-a-version-command-and-fix-several-items'
David S. Miller [Sun, 22 Oct 2017 01:11:33 +0000 (02:11 +0100)]
Merge branch 'bpftool-add-a-version-command-and-fix-several-items'

Jakub Kicinski says:

====================
tools: bpftool: add a "version" command, and fix several items

Quentin says:

The first seven patches of this series bring several minor fixes to
bpftool. Please see individual commit logs for details.

Last patch adds a "version" commands to bpftool, which is in fact the
version of the kernel from which it was compiled.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotools: bpftool: add a command to display bpftool version
Quentin Monnet [Thu, 19 Oct 2017 22:46:26 +0000 (15:46 -0700)]
tools: bpftool: add a command to display bpftool version

This command can be used to print the version of the tool, which is in
fact the version from Linux taken from usr/include/linux/version.h.

Example usage:

    $ bpftool version
    bpftool v4.14.0

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotools: bpftool: show that `opcodes` or `file FILE` should be exclusive
Quentin Monnet [Thu, 19 Oct 2017 22:46:25 +0000 (15:46 -0700)]
tools: bpftool: show that `opcodes` or `file FILE` should be exclusive

For the `bpftool prog dump { jited | xlated } ...` command, adding
`opcodes` keyword (to request opcodes to be printed) will have no effect
if `file FILE` (to write binary output to FILE) is provided.

The manual page and the help message to be displayed in the terminal
should reflect that, and indicate that these options should be mutually
exclusive.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotools: bpftool: print all relevant byte opcodes for "load double word"
Quentin Monnet [Thu, 19 Oct 2017 22:46:24 +0000 (15:46 -0700)]
tools: bpftool: print all relevant byte opcodes for "load double word"

The eBPF instruction permitting to load double words (8 bytes) into a
register need 8-byte long "immediate" field, and thus occupy twice the
space of other instructions. bpftool was aware of this and would
increment the instruction counter only once on meeting such instruction,
but it would only print the first four bytes of the immediate value to
load. Make it able to dump the whole 16 byte-long double instruction
instead (as would `llvm-objdump -d <program>`).

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotools: bpftool: print only one error message on byte parsing failure
Quentin Monnet [Thu, 19 Oct 2017 22:46:23 +0000 (15:46 -0700)]
tools: bpftool: print only one error message on byte parsing failure

Make error messages more consistent. Specifically, when bpftool fails at
parsing map key bytes, make it print a single error message to stderr
and return from the function, instead of (always) printing a second
error message afterwards.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotools: bpftool: add `bpftool prog help` as real command i.r.t exit code
Quentin Monnet [Thu, 19 Oct 2017 22:46:22 +0000 (15:46 -0700)]
tools: bpftool: add `bpftool prog help` as real command i.r.t exit code

Make error messages and return codes more consistent. Specifically, make
`bpftool prog help` a real command, instead of printing usage by default
for a non-recognized "help" command. Output is the same, but this makes
bpftool return with a success value instead of an error.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>