platform/kernel/linux-starfive.git
4 years agonetfilter: nft_dynset: validate set expression definition
Pablo Neira Ayuso [Fri, 27 Mar 2020 16:43:05 +0000 (17:43 +0100)]
netfilter: nft_dynset: validate set expression definition

If the global set expression definition mismatches the dynset
expression, then bail out.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nft_set_bitmap: initialize set element extension in lookups
Pablo Neira Ayuso [Fri, 27 Mar 2020 16:43:04 +0000 (17:43 +0100)]
netfilter: nft_set_bitmap: initialize set element extension in lookups

Otherwise, nft_lookup might dereference an uninitialized pointer to the
element extension.

Fixes: 665153ff5752 ("netfilter: nf_tables: add bitmap set type")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: ctnetlink: be more strict when NF_CONNTRACK_MARK is not set
Romain Bellan [Fri, 27 Mar 2020 08:26:32 +0000 (09:26 +0100)]
netfilter: ctnetlink: be more strict when NF_CONNTRACK_MARK is not set

When CONFIG_NF_CONNTRACK_MARK is not set, any CTA_MARK or CTA_MARK_MASK
in netlink message are not supported. We should return an error when one
of them is set, not both

Fixes: 9306425b70bf ("netfilter: ctnetlink: must check mark attributes vs NULL")
Signed-off-by: Romain Bellan <romain.bellan@wifirst.fr>
Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nf_queue: prefer nf_queue_entry_free
Florian Westphal [Fri, 27 Mar 2020 02:24:49 +0000 (03:24 +0100)]
netfilter: nf_queue: prefer nf_queue_entry_free

Instead of dropping refs+kfree, use the helper added in previous patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nf_queue: do not release refcouts until nf_reinject is done
Florian Westphal [Fri, 27 Mar 2020 02:24:48 +0000 (03:24 +0100)]
netfilter: nf_queue: do not release refcouts until nf_reinject is done

nf_queue is problematic when another NF_QUEUE invocation happens
from nf_reinject().

1. nf_queue is invoked, increments state->sk refcount.
2. skb is queued, waiting for verdict.
3. sk is closed/released.
3. verdict comes back, nf_reinject is called.
4. nf_reinject drops the reference -- refcount can now drop to 0

Instead of get_ref/release_ref pattern, we need to nest the get_ref calls:
    get_ref
       get_ref
       release_ref
     release_ref

So that when we invoke the next processing stage (another netfilter
or the okfn()), we hold at least one reference count on the
devices/socket.

After previous patch, it is now safe to put the entry even after okfn()
has potentially free'd the skb.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nf_queue: place bridge physports into queue_entry struct
Florian Westphal [Fri, 27 Mar 2020 02:24:47 +0000 (03:24 +0100)]
netfilter: nf_queue: place bridge physports into queue_entry struct

The refcount is done via entry->skb, which does work fine.
Major problem: When putting the refcount of the bridge ports, we
must always put the references while the skb is still around.

However, we will need to put the references after okfn() to avoid
a possible 1 -> 0 -> 1 refcount transition, so we cannot use the
skb pointer anymore.

Place the physports in the queue entry structure instead to allow
for refcounting changes in the next patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nf_queue: make nf_queue_entry_release_refs static
Florian Westphal [Fri, 27 Mar 2020 02:24:46 +0000 (03:24 +0100)]
netfilter: nf_queue: make nf_queue_entry_release_refs static

This is a preparation patch, no logical changes.
Move free_entry into core and rename it to something more sensible.

Will ease followup patches which will complicate the refcount handling.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: flowtable: Use work entry per offload command
Paul Blakey [Fri, 27 Mar 2020 09:12:30 +0000 (12:12 +0300)]
netfilter: flowtable: Use work entry per offload command

To allow offload commands to execute in parallel, create workqueue
for flow table offload, and use a work entry per offload command.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: flowtable: Use rw sem as flow block lock
Paul Blakey [Fri, 27 Mar 2020 09:12:29 +0000 (12:12 +0300)]
netfilter: flowtable: Use rw sem as flow block lock

Currently flow offload threads are synchronized by the flow block mutex.
Use rw lock instead to increase flow insertion (read) concurrency.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nf_tables: silence a RCU-list warning in nft_table_lookup()
Qian Cai [Wed, 25 Mar 2020 14:31:42 +0000 (10:31 -0400)]
netfilter: nf_tables: silence a RCU-list warning in nft_table_lookup()

It is safe to traverse &net->nft.tables with &net->nft.commit_mutex
held using list_for_each_entry_rcu(). Silence the PROVE_RCU_LIST false
positive,

WARNING: suspicious RCU usage
net/netfilter/nf_tables_api.c:523 RCU-list traversed in non-reader section!!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
1 lock held by iptables/1384:
 #0: ffffffff9745c4a8 (&net->nft.commit_mutex){+.+.}, at: nf_tables_valid_genid+0x25/0x60 [nf_tables]

Call Trace:
 dump_stack+0xa1/0xea
 lockdep_rcu_suspicious+0x103/0x10d
 nft_table_lookup.part.0+0x116/0x120 [nf_tables]
 nf_tables_newtable+0x12c/0x7d0 [nf_tables]
 nfnetlink_rcv_batch+0x559/0x1190 [nfnetlink]
 nfnetlink_rcv+0x1da/0x210 [nfnetlink]
 netlink_unicast+0x306/0x460
 netlink_sendmsg+0x44b/0x770
 ____sys_sendmsg+0x46b/0x4a0
 ___sys_sendmsg+0x138/0x1a0
 __sys_sendmsg+0xb6/0x130
 __x64_sys_sendmsg+0x48/0x50
 do_syscall_64+0x69/0xf4
 entry_SYSCALL_64_after_hwframe+0x49/0xb3

Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: flowtable: Fix incorrect tc_setup_type type
wenxu [Mon, 23 Mar 2020 23:34:25 +0000 (07:34 +0800)]
netfilter: flowtable: Fix incorrect tc_setup_type type

The indirect block setup should use TC_SETUP_FT as the type instead of
TC_SETUP_BLOCK. Adjust existing users of the indirect flow block
infrastructure.

Fixes: b5140a36da78 ("netfilter: flowtable: add indr block setup support")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: flowtable: add counter support
Pablo Neira Ayuso [Tue, 24 Mar 2020 11:50:02 +0000 (12:50 +0100)]
netfilter: flowtable: add counter support

Add a new flag to turn on flowtable counters which are stored in the
conntrack entry.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nf_tables: add enum nft_flowtable_flags to uapi
Pablo Neira Ayuso [Tue, 24 Mar 2020 11:23:57 +0000 (12:23 +0100)]
netfilter: nf_tables: add enum nft_flowtable_flags to uapi

Expose the NFT_FLOWTABLE_HW_OFFLOAD flag through uapi.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: conntrack: export nf_ct_acct_update()
Pablo Neira Ayuso [Tue, 24 Mar 2020 11:34:33 +0000 (12:34 +0100)]
netfilter: conntrack: export nf_ct_acct_update()

This function allows you to update the conntrack counters.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agoipvs: optimize tunnel dumps for icmp errors
Haishuang Yan [Sun, 15 Mar 2020 13:25:41 +0000 (21:25 +0800)]
ipvs: optimize tunnel dumps for icmp errors

After strip GRE/UDP tunnel header for icmp errors, it's better to show
"GRE/UDP" instead of "IPIP" in debug message.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: conntrack: Add missing annotations for nf_conntrack_all_lock() and nf_conn...
Jules Irenge [Wed, 11 Mar 2020 01:09:05 +0000 (01:09 +0000)]
netfilter: conntrack: Add missing annotations for nf_conntrack_all_lock() and nf_conntrack_all_unlock()

Sparse reports warnings at nf_conntrack_all_lock()
and nf_conntrack_all_unlock()

warning: context imbalance in nf_conntrack_all_lock()
- wrong count at exit
warning: context imbalance in nf_conntrack_all_unlock()
- unexpected unlock

Add the missing __acquires(&nf_conntrack_locks_all_lock)
Add missing __releases(&nf_conntrack_locks_all_lock)

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: ctnetlink: Add missing annotation for ctnetlink_parse_nat_setup()
Jules Irenge [Wed, 11 Mar 2020 01:09:04 +0000 (01:09 +0000)]
netfilter: ctnetlink: Add missing annotation for ctnetlink_parse_nat_setup()

Sparse reports a warning at ctnetlink_parse_nat_setup()

warning: context imbalance in ctnetlink_parse_nat_setup()
- unexpected unlock

The root cause is the missing annotation at ctnetlink_parse_nat_setup()
Add the missing __must_hold(RCU) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: flowtable: fix NULL pointer dereference in tunnel offload support
wenxu [Thu, 19 Mar 2020 04:52:45 +0000 (12:52 +0800)]
netfilter: flowtable: fix NULL pointer dereference in tunnel offload support

The tc ct action does not cache the route in the flowtable entry.

Fixes: 88bf6e4114d5 ("netfilter: flowtable: add tunnel encap/decap action offload support")
Fixes: cfab6dbd0ecf ("netfilter: flowtable: add tunnel match offload support")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nf_tables: add nft_set_elem_expr_destroy() and use it
Pablo Neira Ayuso [Wed, 18 Mar 2020 13:29:45 +0000 (14:29 +0100)]
netfilter: nf_tables: add nft_set_elem_expr_destroy() and use it

This patch adds nft_set_elem_expr_destroy() to destroy stateful
expressions in set elements.

This patch also updates the commit path to call this function to invoke
expr->ops->destroy_clone when required.

This is implicitly fixing up a module reference counter leak and
a memory leak in expressions that allocated internal state, e.g.
nft_counter.

Fixes: 409444522976 ("netfilter: nf_tables: add elements with stateful expressions")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nf_tables: fix double-free on set expression from the error path
Pablo Neira Ayuso [Wed, 18 Mar 2020 00:14:58 +0000 (01:14 +0100)]
netfilter: nf_tables: fix double-free on set expression from the error path

After copying the expression to the set element extension, release the
expression and reset the pointer to avoid a double-free from the error
path.

Fixes: 409444522976 ("netfilter: nf_tables: add elements with stateful expressions")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nf_tables: allow to specify stateful expression in set definition
Pablo Neira Ayuso [Tue, 17 Mar 2020 13:13:46 +0000 (14:13 +0100)]
netfilter: nf_tables: allow to specify stateful expression in set definition

This patch allows users to specify the stateful expression for the
elements in this set via NFTA_SET_EXPR. This new feature allows you to
turn on counters for all of the elements in this set.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nf_tables: pass context to nft_set_destroy()
Pablo Neira Ayuso [Tue, 17 Mar 2020 13:13:45 +0000 (14:13 +0100)]
netfilter: nf_tables: pass context to nft_set_destroy()

The patch that adds support for stateful expressions in set definitions
require this.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agonetfilter: nf_tables: move nft_expr_clone() to nf_tables_api.c
Pablo Neira Ayuso [Tue, 17 Mar 2020 13:13:44 +0000 (14:13 +0100)]
netfilter: nf_tables: move nft_expr_clone() to nf_tables_api.c

Move the nft_expr_clone() helper function to the core.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4 years agoMerge tag 'mlx5-updates-2020-03-17' of git://git.kernel.org/pub/scm/linux/kernel...
David S. Miller [Thu, 19 Mar 2020 02:13:37 +0000 (19:13 -0700)]
Merge tag 'mlx5-updates-2020-03-17' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2020-03-17

1) Compiler warnings and cleanup for the connection tracking series
2) Bug fixes for the connection tracking series
3) Fix devlink port register sequence
4) Last five patches in the series, By Eli cohen
   Add the support for forwarding traffic between two eswitch uplink
   representors (Hairpin for eswitch), using mlx5 termination tables
   to change the direction of a packet in hw from RX to TX pipeline.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: phy: realtek: read actual speed to detect downshift
Heiner Kallweit [Wed, 18 Mar 2020 22:07:24 +0000 (23:07 +0100)]
net: phy: realtek: read actual speed to detect downshift

At least some integrated PHY's in RTL8168/RTL8125 chip versions support
downshift, and the actual link speed can be read from a vendor-specific
register. Info about this register was provided by Realtek.
More details about downshift configuration (e.g. number of attempts)
aren't available, therefore the downshift tunable is not implemented.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: sched: Fix hw_stats_type setting in pedit loop
Petr Machata [Wed, 18 Mar 2020 17:42:29 +0000 (19:42 +0200)]
net: sched: Fix hw_stats_type setting in pedit loop

In the commit referenced below, hw_stats_type of an entry is set for every
entry that corresponds to a pedit action. However, the assignment is only
done after the entry pointer is bumped, and therefore could overwrite
memory outside of the entries array.

The reason for this positioning may have been that the current entry's
hw_stats_type is already set above, before the action-type dispatch.
However, if there are no more actions, the assignment is wrong. And if
there are, the next round of the for_each_action loop will make the
assignment before the action-type dispatch anyway.

Therefore fix this issue by simply reordering the two lines.

Fixes: 74522e7baae2 ("net: sched: set the hw_stats_type in pedit loop")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'mlxsw-spectrum_cnt-Expose-counter-resources'
David S. Miller [Wed, 18 Mar 2020 23:46:20 +0000 (16:46 -0700)]
Merge branch 'mlxsw-spectrum_cnt-Expose-counter-resources'

Ido Schimmel says:

====================
mlxsw: spectrum_cnt: Expose counter resources

Jiri says:

Capacity and utilization of existing flow and RIF counters are currently
unavailable to be seen by the user. Use the existing devlink resources
API to expose the information:

$ sudo devlink resource show pci/0000:00:10.0 -v
pci/0000:00:10.0:
  name kvd resource_path /kvd size 524288 unit entry dpipe_tables none
  name span_agents resource_path /span_agents size 8 occ 0 unit entry dpipe_tables none
  name counters resource_path /counters size 79872 occ 44 unit entry dpipe_tables none
    resources:
      name flow resource_path /counters/flow size 61440 occ 4 unit entry dpipe_tables none
      name rif resource_path /counters/rif size 18432 occ 40 unit entry dpipe_tables none
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoselftests: mlxsw: Add tc action hw_stats tests
Jiri Pirko [Wed, 18 Mar 2020 13:48:57 +0000 (15:48 +0200)]
selftests: mlxsw: Add tc action hw_stats tests

Add tests for mlxsw hw_stats types.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum_cnt: Expose devlink resource occupancy for counters
Jiri Pirko [Wed, 18 Mar 2020 13:48:56 +0000 (15:48 +0200)]
mlxsw: spectrum_cnt: Expose devlink resource occupancy for counters

Implement occupancy counting for counters and expose over devlink
resource API.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum_cnt: Consolidate subpools initialization
Jiri Pirko [Wed, 18 Mar 2020 13:48:55 +0000 (15:48 +0200)]
mlxsw: spectrum_cnt: Consolidate subpools initialization

Put all init operations related to subpools into
mlxsw_sp_counter_sub_pools_init().

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum_cnt: Move config validation along with resource register
Jiri Pirko [Wed, 18 Mar 2020 13:48:54 +0000 (15:48 +0200)]
mlxsw: spectrum_cnt: Move config validation along with resource register

Move the validation of subpools configuration, to avoid possible over
commitment to resource registration. Add WARN_ON to indicate bug
in the code.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum_cnt: Expose subpool sizes over devlink resources
Jiri Pirko [Wed, 18 Mar 2020 13:48:53 +0000 (15:48 +0200)]
mlxsw: spectrum_cnt: Expose subpool sizes over devlink resources

Implement devlink resources support for counter pools. Move the subpool
sizes calculations into the new resources register function.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum_cnt: Add entry_size_res_id for each subpool and use it to query entry...
Jiri Pirko [Wed, 18 Mar 2020 13:48:52 +0000 (15:48 +0200)]
mlxsw: spectrum_cnt: Add entry_size_res_id for each subpool and use it to query entry size

Add new field to subpool struct that would indicate which
resource id should be used to query the entry size for
the subpool from the device.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum_cnt: Move sub_pools under per-instance pool struct
Jiri Pirko [Wed, 18 Mar 2020 13:48:51 +0000 (15:48 +0200)]
mlxsw: spectrum_cnt: Move sub_pools under per-instance pool struct

Currently, the global static array of subpools is used. Make it
per-instance as multiple instances of the mlxsw driver can have
different values.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoselftests: spectrum-2: Adjust tc_flower_scale limit according to current counter...
Jiri Pirko [Wed, 18 Mar 2020 13:48:50 +0000 (15:48 +0200)]
selftests: spectrum-2: Adjust tc_flower_scale limit according to current counter count

With the change that made the code to query counter bank size from device
instead of using hard-coded value, the number of available counters
changed for Spectrum-2. Adjust the limit in the selftests.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomlxsw: spectrum_cnt: Query bank size from FW resources
Jiri Pirko [Wed, 18 Mar 2020 13:48:49 +0000 (15:48 +0200)]
mlxsw: spectrum_cnt: Query bank size from FW resources

The bank size is different between Spectrum versions. Also it is
a resource that can be queried. So instead of hard coding the value in
code, query it from the firmware.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agocxgb4: rework TC filter rule insertion across regions
Rahul Lakkireddy [Wed, 18 Mar 2020 10:54:51 +0000 (16:24 +0530)]
cxgb4: rework TC filter rule insertion across regions

Chelsio NICs have 3 filter regions, in following order of priority:
1. High Priority (HPFILTER) region (Highest Priority).
2. HASH region.
3. Normal FILTER region (Lowest Priority).

Currently, there's a 1-to-1 mapping between the prio value passed
by TC and the filter region index. However, it's possible to have
multiple TC rules with the same prio value. In this case, if a region
is exhausted, no attempt is made to try inserting the rule in the
next available region.

So, rework and remove the 1-to-1 mapping. Instead, dynamically select
the region to insert the filter rule, as long as the new rule's prio
value doesn't conflict with existing rules across all the 3 regions.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonetfilter: revert introduction of egress hook
Daniel Borkmann [Wed, 18 Mar 2020 09:33:22 +0000 (10:33 +0100)]
netfilter: revert introduction of egress hook

This reverts the following commits:

  8537f78647c0 ("netfilter: Introduce egress hook")
  5418d3881e1f ("netfilter: Generalize ingress hook")
  b030f194aed2 ("netfilter: Rename ingress hook include file")

>From the discussion in [0], the author's main motivation to add a hook
in fast path is for an out of tree kernel module, which is a red flag
to begin with. Other mentioned potential use cases like NAT{64,46}
is on future extensions w/o concrete code in the tree yet. Revert as
suggested [1] given the weak justification to add more hooks to critical
fast-path.

  [0] https://lore.kernel.org/netdev/cover.1583927267.git.lukas@wunner.de/
  [1] https://lore.kernel.org/netdev/20200318.011152.72770718915606186.davem@davemloft.net/

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Miller <davem@davemloft.net>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Nacked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 's390-qeth-next'
David S. Miller [Wed, 18 Mar 2020 23:33:36 +0000 (16:33 -0700)]
Merge branch 's390-qeth-next'

Julian Wiedmann says:

====================
s390/qeth: updates 2020-03-18

please apply the following patch series for qeth to netdev's net-next
tree.

This consists of three parts:
1) support for __GFP_MEMALLOC,
2) several ethtool enhancements (.set_channels, SW Timestamping),
3) the usual cleanups.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qeth: use dev->reg_state
Julian Wiedmann [Wed, 18 Mar 2020 12:54:55 +0000 (13:54 +0100)]
s390/qeth: use dev->reg_state

To check whether a netdevice has already been registered, look at
NETREG_REGISTERED to replace some hacks I added a while ago.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qeth: remove gratuitous NULL checks
Julian Wiedmann [Wed, 18 Mar 2020 12:54:54 +0000 (13:54 +0100)]
s390/qeth: remove gratuitous NULL checks

qeth_do_ioctl() is only reached through our own net_device_ops, so we
can trust that dev->ml_priv still contains what we put there earlier.

qeth_bridgeport_an_set() is an internal function that doesn't require
such sanity checks.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qeth: add phys_to_virt() translation for AOB
Julian Wiedmann [Wed, 18 Mar 2020 12:54:53 +0000 (13:54 +0100)]
s390/qeth: add phys_to_virt() translation for AOB

Data addresses in the AOB are absolute, and need to be translated before
being fed into kmem_cache_free(). Currently this phys_to_virt() is a no-op.
Also see commit 2db01da8d25f ("s390/qdio: fill SBALEs with absolute addresses").

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qeth: don't report hard-coded driver version
Julian Wiedmann [Wed, 18 Mar 2020 12:54:52 +0000 (13:54 +0100)]
s390/qeth: don't report hard-coded driver version

Versions are meaningless for an in-kernel driver.
Instead use the UTS_RELEASE that is set by ethtool_get_drvinfo().

Cc: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qeth: add SW timestamping support for IQD devices
Julian Wiedmann [Wed, 18 Mar 2020 12:54:51 +0000 (13:54 +0100)]
s390/qeth: add SW timestamping support for IQD devices

This adds support for SOF_TIMESTAMPING_TX_SOFTWARE.
No support for non-IQD devices, since they orphan the skb in their xmit
path.

To play nice with TX bulking, set the timestamp when the buffer that
contains the skb(s) is actually flushed out to HW.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qeth: balance the TX queue selection for IQD devices
Julian Wiedmann [Wed, 18 Mar 2020 12:54:50 +0000 (13:54 +0100)]
s390/qeth: balance the TX queue selection for IQD devices

For ucast traffic, qeth_iqd_select_queue() falls back to
netdev_pick_tx(). This will potentially use skb_tx_hash() to distribute
the flow over all active TX queues - so txq 0 is a valid selection, and
qeth_iqd_select_queue() needs to check for this and put it on some other
queue. As a result, the distribution for ucast flows is unbalanced and
hits QETH_IQD_MIN_UCAST_TXQ heavier than the other queues.

Open-coding a custom variant of skb_tx_hash() isn't an option, since
netdev_pick_tx() also gives us eg. access to XPS. But we can pull a
little trick: add a single TC class that excludes the mcast txq, and
thus encourage skb_tx_hash() to not pick the mcast txq.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qeth: allow configuration of TX queues for IQD devices
Julian Wiedmann [Wed, 18 Mar 2020 12:54:49 +0000 (13:54 +0100)]
s390/qeth: allow configuration of TX queues for IQD devices

Similar to the support for z/VM NICs, but we need to take extra care
about the dedicated mcast queue:

1. netdev_pick_tx() is unaware of this limitation and might select the
   mcast txq. Catch this.
2. require at least _two_ TX queues - one for ucast, one for mcast.
3. when reducing the number of TX queues, there's a potential race
   where netdev_cap_txqueue() over-rules the selected txq index and
   falls back to index 0. This would place ucast traffic on the mcast
   queue, and result in TX errors.
   So for IQD, reject a reduction while the interface is running.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qeth: allow configuration of TX queues for z/VM NICs
Julian Wiedmann [Wed, 18 Mar 2020 12:54:48 +0000 (13:54 +0100)]
s390/qeth: allow configuration of TX queues for z/VM NICs

Add support for ETHTOOL_SCHANNELS to change the count of active
TX queues.

Since all TX queue structs are pre-allocated and -registered, we just
need to trivially adjust dev->real_num_tx_queues.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qeth: remove prio-queueing support for z/VM NICs
Julian Wiedmann [Wed, 18 Mar 2020 12:54:47 +0000 (13:54 +0100)]
s390/qeth: remove prio-queueing support for z/VM NICs

z/VM NICs don't offer HW QoS for TX rings. So just use netdev_pick_tx()
to distribute the connections equally over all enabled TX queues.

We start with just 1 enabled TX queue (this matches the typical
configuration without prio-queueing). A follow-on patch will allow users
to enable additional TX queues.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qeth: use memory reserves in TX slow path
Julian Wiedmann [Wed, 18 Mar 2020 12:54:46 +0000 (13:54 +0100)]
s390/qeth: use memory reserves in TX slow path

When falling back to an allocation from the HW header cache, check if
the skb is eligible for using memory reserves.
This only makes a difference if the cache is empty and needs to be
refilled.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agos390/qeth: use memory reserves to back RX buffers
Julian Wiedmann [Wed, 18 Mar 2020 12:54:45 +0000 (13:54 +0100)]
s390/qeth: use memory reserves to back RX buffers

Use dev_alloc_page() for backing the RX buffers with pages. This way we
pick up __GFP_MEMALLOC.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
David S. Miller [Wed, 18 Mar 2020 06:51:31 +0000 (23:51 -0700)]
Merge git://git./linux/kernel/git/pablo/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Use nf_flow_offload_tuple() to fetch flow stats, from Paul Blakey.

2) Add new xt_IDLETIMER hard mode, from Manoj Basapathi.
   Follow up patch to clean up this new mode, from Dan Carpenter.

3) Add support for geneve tunnel options, from Xin Long.

4) Make sets built-in and remove modular infrastructure for sets,
   from Florian Westphal.

5) Remove unused TEMPLATE_NULLS_VAL, from Li RongQing.

6) Statify nft_pipapo_get, from Chen Wandun.

7) Use C99 flexible-array member, from Gustavo A. R. Silva.

8) More descriptive variable names for bitwise, from Jeremy Sowden.

9) Four patches to add tunnel device hardware offload to the flowtable
   infrastructure, from wenxu.

10) pipapo set supports for 8-bit grouping, from Stefano Brivio.

11) pipapo can switch between nibble and byte grouping, also from
    Stefano.

12) Add AVX2 vectorized version of pipapo, from Stefano Brivio.

13) Update pipapo to be use it for single ranges, from Stefano.

14) Add stateful expression support to elements via control plane,
    eg. counter per element.

15) Re-visit sysctls in unprivileged namespaces, from Florian Westphal.

15) Add new egress hook, from Lukas Wunner.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomptcp: move msk state update to subflow_syn_recv_sock()
Paolo Abeni [Tue, 17 Mar 2020 14:53:34 +0000 (15:53 +0100)]
mptcp: move msk state update to subflow_syn_recv_sock()

After commit 58b09919626b ("mptcp: create msk early"), the
msk socket is already available at subflow_syn_recv_sock()
time. Let's move there the state update, to mirror more
closely the first subflow state.

The above will also help multiple subflow supports.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'net-add-phylink-support-for-PCS'
David S. Miller [Wed, 18 Mar 2020 05:51:16 +0000 (22:51 -0700)]
Merge branch 'net-add-phylink-support-for-PCS'

Russell King says:

====================
net: add phylink support for PCS

This series adds support for IEEE 802.3 register set compliant PCS
for phylink.  In order to do this, we:

1. convert BUG_ON() in existing accessors to WARN_ON_ONCE() and return
   an error.
2. add accessors for modifying a MDIO device register, and use them in
   phylib, rather than duplicating the code from phylib.
3. add support for decoding the advertisement from clause 22 compatible
   register sets for clause 37 advertisements and SGMII advertisements.
4. add support for clause 45 register sets for 10GBASE-R PCS.

These have been tested on the LX2160A Clearfog-CX platform.

v2: eliminate use of BUG_ON() in the accessors.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: phylink: pcs: add 802.3 clause 45 helpers
Russell King [Tue, 17 Mar 2020 14:52:41 +0000 (14:52 +0000)]
net: phylink: pcs: add 802.3 clause 45 helpers

Implement helpers for PCS accessed via the MII bus using 802.3 clause
45 cycles for 10GBASE-R. Only link up/down is supported, 10G full
duplex is assumed.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: phylink: pcs: add 802.3 clause 22 helpers
Russell King [Tue, 17 Mar 2020 14:52:36 +0000 (14:52 +0000)]
net: phylink: pcs: add 802.3 clause 22 helpers

Implement helpers for PCS accessed via the MII bus using 802.3 clause
22 cycles, conforming to 802.3 clause 37 and Cisco SGMII specifications
for the advertisement word.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: mdiobus: add APIs for modifying a MDIO device register
Russell King [Tue, 17 Mar 2020 14:52:31 +0000 (14:52 +0000)]
net: mdiobus: add APIs for modifying a MDIO device register

Add APIs for modifying a MDIO device register, similar to the existing
phy_modify() group of functions, but at mdiobus level instead.  Adapt
__phy_modify_changed() to use the new mdiobus level helper.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: mdiobus: avoid BUG_ON() in mdiobus accessors
Russell King [Tue, 17 Mar 2020 14:52:26 +0000 (14:52 +0000)]
net: mdiobus: avoid BUG_ON() in mdiobus accessors

Avoid using BUG_ON() in the mdiobus accessors, prefering instead to use
WARN_ON_ONCE() and returning an error.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'net-bridge-vlan-options-add-support-for-tunnel-mapping'
David S. Miller [Wed, 18 Mar 2020 05:47:13 +0000 (22:47 -0700)]
Merge branch 'net-bridge-vlan-options-add-support-for-tunnel-mapping'

Nikolay Aleksandrov says:

====================
net: bridge: vlan options: add support for tunnel mapping

In order to bring the new vlan API on par with the old one and be able
to completely migrate to the new one we need to support vlan tunnel mapping
and statistics. This patch-set takes care of the former by making it a
vlan option. There are two notable issues to deal with:
 - vlan range to tunnel range mapping
   * The tunnel ids are globally unique for the vlan code and a vlan can
     be mapped to one tunnel, so the old API took care of ranges by
     taking the starting tunnel id value and incrementally mapping
     vlan id(i) -> tunnel id(i). This set takes the same approach and
     uses one new attribute - BRIDGE_VLANDB_ENTRY_TUNNEL_ID. If used
     with a vlan range then it's the starting tunnel id to map.

 - tunnel mapping removal
   * Since there are no reserved/special tunnel ids defined, we can't
     encode mapping removal within the new attribute, in order to be
     able to remove a mapping we add a vlan flag which makes the new
     tunnel option remove the mapping

The rest is pretty straight-forward, in fact we directly re-use the old
code for manipulating tunnels by just mapping the command (set/del). In
order to be able to keep detecting vlan ranges we check that the current
vlan has a tunnel and it's extending the current vlan range end's tunnel
id.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: bridge: vlan options: add support for tunnel mapping set/del
Nikolay Aleksandrov [Tue, 17 Mar 2020 12:08:36 +0000 (14:08 +0200)]
net: bridge: vlan options: add support for tunnel mapping set/del

This patch adds support for manipulating vlan/tunnel mappings. The
tunnel ids are globally unique and are one per-vlan. There were two
trickier issues - first in order to support vlan ranges we have to
compute the current tunnel id in the following way:
 - base tunnel id (attr) + current vlan id - starting vlan id
This is in line how the old API does vlan/tunnel mapping with ranges. We
already have the vlan range present, so it's redundant to add another
attribute for the tunnel range end. It's simply base tunnel id + vlan
range. And second to support removing mappings we need an out-of-band way
to tell the option manipulating function because there are no
special/reserved tunnel id values, so we use a vlan flag to denote the
operation is tunnel mapping removal.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: bridge: vlan options: add support for tunnel id dumping
Nikolay Aleksandrov [Tue, 17 Mar 2020 12:08:35 +0000 (14:08 +0200)]
net: bridge: vlan options: add support for tunnel id dumping

Add a new option - BRIDGE_VLANDB_ENTRY_TUNNEL_ID which is used to dump
the tunnel id mapping. Since they're unique per vlan they can enter a
vlan range if they're consecutive, thus we can calculate the tunnel id
range map simply as: vlan range end id - vlan range start id. The
starting point is the tunnel id in BRIDGE_VLANDB_ENTRY_TUNNEL_ID. This
is similar to how the tunnel entries can be created in a range via the
old API (a vlan range maps to a tunnel range).

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: bridge: vlan tunnel: constify bridge and port arguments
Nikolay Aleksandrov [Tue, 17 Mar 2020 12:08:34 +0000 (14:08 +0200)]
net: bridge: vlan tunnel: constify bridge and port arguments

The vlan tunnel code changes vlan options, it shouldn't touch port or
bridge options so we can constify the port argument. This would later help
us to re-use these functions from the vlan options code.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: bridge: vlan options: rename br_vlan_opts_eq to br_vlan_opts_eq_range
Nikolay Aleksandrov [Tue, 17 Mar 2020 12:08:33 +0000 (14:08 +0200)]
net: bridge: vlan options: rename br_vlan_opts_eq to br_vlan_opts_eq_range

It is more appropriate name as it shows the intent of why we need to
check the options' state. It also allows us to give meaning to the two
arguments of the function: the first is the current vlan (v_curr) being
checked if it could enter the range ending in the second one (range_end).

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'stmmac-100GB-Enterprise-MAC-support'
David S. Miller [Wed, 18 Mar 2020 04:37:25 +0000 (21:37 -0700)]
Merge branch 'stmmac-100GB-Enterprise-MAC-support'

Jose Abreu says:

====================
net: stmmac: 100GB Enterprise MAC support

Adds the support for Enterprise MAC IP version which allows operating
speeds up to 100GB.

Patch 1/4, adds the support in XPCS for XLGMII interface that is used in
this kind of Enterprise MAC IPs.

Patch 2/4, adds the XLGMII interface support in stmmac.

Patch 3/4, adds the HW specific support for Enterprise MAC.

We end in patch 4/4, by updating stmmac documentation to mention the
support for this new IP version.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoDocumentation: networking: stmmac: Mention new XLGMAC support
Jose Abreu [Tue, 17 Mar 2020 09:18:53 +0000 (10:18 +0100)]
Documentation: networking: stmmac: Mention new XLGMAC support

Add the Enterprise MAC support to the list of supported IP versions and
the newly added XLGMII interface support.

Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: stmmac: Add support for Enterprise MAC version
Jose Abreu [Tue, 17 Mar 2020 09:18:52 +0000 (10:18 +0100)]
net: stmmac: Add support for Enterprise MAC version

Adds the support for Enterprise MAC IP version which is very similar to
XGMAC. It's so similar that we just need to check the device id and add
new speeds definitions and some minor callbacks.

Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: stmmac: Add XLGMII support
Jose Abreu [Tue, 17 Mar 2020 09:18:51 +0000 (10:18 +0100)]
net: stmmac: Add XLGMII support

Add XLGMII support for stmmac including the list of speeds and defines
for them.

Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: phy: xpcs: Add XLGMII support
Jose Abreu [Tue, 17 Mar 2020 09:18:50 +0000 (10:18 +0100)]
net: phy: xpcs: Add XLGMII support

Add XLGMII support for XPCS. This does not include Autoneg feature.

Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'ionic-bits-and-bytes'
David S. Miller [Wed, 18 Mar 2020 04:18:25 +0000 (21:18 -0700)]
Merge branch 'ionic-bits-and-bytes'

Shannon Nelson says:

====================
ionic bits and bytes

These are a few little updates to the ionic driver while we are in between
other feature work.  While these are mostly Fixes, they are almost all low
priority and needn't be promoted to net.  The one higher need is patch 1,
but it is fixing something that hasn't made it out of net-next yet.

v3: allow decode of unknown transciever and use type
    codes from sfp.h
v2: add Fixes tags to patches 1-4, and a little
    description for patch 5
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoionic: add decode for IONIC_RC_ENOSUPP
Shannon Nelson [Tue, 17 Mar 2020 03:22:10 +0000 (20:22 -0700)]
ionic: add decode for IONIC_RC_ENOSUPP

Add decoding for a new firmware error code.

Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoionic: print data for unknown xcvr type
Shannon Nelson [Tue, 17 Mar 2020 03:22:09 +0000 (20:22 -0700)]
ionic: print data for unknown xcvr type

If we don't recognize the transceiver type, set the xcvr type
and data length such that ethtool can at least print the first
256 bytes and the reader can figure out why the transceiver
is not recognized.

While we're here, we can update the phy_id type values to use
the enum values in sfp.h.

Fixes: 4d03e00a2140 ("ionic: Add initial ethtool support")
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoionic: remove adminq napi instance
Shannon Nelson [Tue, 17 Mar 2020 03:22:08 +0000 (20:22 -0700)]
ionic: remove adminq napi instance

Remove the adminq's napi struct when tearing down
the adminq.

Fixes: 1d062b7b6f64 ("ionic: Add basic adminq support")
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoionic: deinit rss only if selected
Shannon Nelson [Tue, 17 Mar 2020 03:22:07 +0000 (20:22 -0700)]
ionic: deinit rss only if selected

Don't bother de-initing RSS if it wasn't selected.

Fixes: aa3198819bea ("ionic: Add RSS support")
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoionic: stop devlink warn on mgmt device
Shannon Nelson [Tue, 17 Mar 2020 03:22:06 +0000 (20:22 -0700)]
ionic: stop devlink warn on mgmt device

If we don't set a port type, the devlink code will eventually
print a WARN in the kernel log.  Because the mgmt device is
not really a useful port, don't register it as a devlink port.

Fixes: b3f064e9746d ("ionic: add support for device id 0x1004")
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'net_sched-allow-use-of-hrtimer-slack'
David S. Miller [Wed, 18 Mar 2020 04:16:35 +0000 (21:16 -0700)]
Merge branch 'net_sched-allow-use-of-hrtimer-slack'

Eric Dumazet says:

====================
net_sched: allow use of hrtimer slack

Packet schedulers have used hrtimers with exact expiry times.

Some of them can afford having a slack, in order to reduce
the number of timer interrupts and feed bigger batches
to increase efficiency.

FQ for example does not care if throttled packets are
sent with an additional (small) delay.

Original observation of having maybe too many interrupts
was made by Willem de Bruijn.

v2: added strict netlink checking (Jakub Kicinski)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet_sched: sch_fq: enable use of hrtimer slack
Eric Dumazet [Tue, 17 Mar 2020 02:12:51 +0000 (19:12 -0700)]
net_sched: sch_fq: enable use of hrtimer slack

Add a new attribute to control the fq qdisc hrtimer slack.

Default is set to 10 usec.

When/if packets are throttled, fq set up an hrtimer that can
lead to one interrupt per packet in the throttled queue.

By using a timer slack, we allow better use of timer interrupts,
by giving them a chance to call multiple timer callbacks
at each hardware interrupt.

Also, giving a slack allows FQ to dequeue batches of packets
instead of a single one, thus increasing xmit_more efficiency.

This has no negative effect on the rate a TCP flow can sustain,
since each TCP flow maintains its own precise vtime (tp->tcp_wstamp_ns)

v2: added strict netlink checking (as feedback from Jakub Kicinski)

Tested:
 1000 concurrent flows all using paced packets.
 1,000,000 packets sent per second.

Before the patch :

$ vmstat 2 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 60726784  23628 3485992    0    0   138     1  977  535  0 12 87  0  0
 0  0      0 60714700  23628 3485628    0    0     0     0 1568827 26462  0 22 78  0  0
 1  0      0 60716012  23628 3485656    0    0     0     0 1570034 26216  0 22 78  0  0
 0  0      0 60722420  23628 3485492    0    0     0     0 1567230 26424  0 22 78  0  0
 0  0      0 60727484  23628 3485556    0    0     0     0 1568220 26200  0 22 78  0  0
 2  0      0 60718900  23628 3485380    0    0     0    40 1564721 26630  0 22 78  0  0
 2  0      0 60718096  23628 3485332    0    0     0     0 1562593 26432  0 22 78  0  0
 0  0      0 60719608  23628 3485064    0    0     0     0 1563806 26238  0 22 78  0  0
 1  0      0 60722876  23628 3485236    0    0     0   130 1565874 26566  0 22 78  0  0
 1  0      0 60722752  23628 3484908    0    0     0     0 1567646 26247  0 22 78  0  0

After the patch, slack of 10 usec, we can see a reduction of interrupts
per second, and a small decrease of reported cpu usage.

$ vmstat 2 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 60722564  23628 3484728    0    0   133     1  696  545  0 13 87  0  0
 1  0      0 60722568  23628 3484824    0    0     0     0 977278 25469  0 20 80  0  0
 0  0      0 60716396  23628 3484764    0    0     0     0 979997 25326  0 20 80  0  0
 0  0      0 60713844  23628 3484960    0    0     0     0 981394 25249  0 20 80  0  0
 2  0      0 60720468  23628 3484916    0    0     0     0 982860 25062  0 20 80  0  0
 1  0      0 60721236  23628 3484856    0    0     0     0 982867 25100  0 20 80  0  0
 1  0      0 60722400  23628 3484456    0    0     0     8 982698 25303  0 20 80  0  0
 0  0      0 60715396  23628 3484428    0    0     0     0 981777 25176  0 20 80  0  0
 0  0      0 60716520  23628 3486544    0    0     0    36 978965 27857  0 21 79  0  0
 0  0      0 60719592  23628 3486516    0    0     0    22 977318 25106  0 20 80  0  0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet_sched: do not reprogram a timer about to expire
Eric Dumazet [Tue, 17 Mar 2020 02:12:50 +0000 (19:12 -0700)]
net_sched: do not reprogram a timer about to expire

qdisc_watchdog_schedule_range_ns() can use the newly added slack
and avoid rearming the hrtimer a bit earlier than the current
value. This patch has no effect if delta_ns parameter
is zero.

Note that this means the max slack is potentially doubled.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet_sched: add qdisc_watchdog_schedule_range_ns()
Eric Dumazet [Tue, 17 Mar 2020 02:12:49 +0000 (19:12 -0700)]
net_sched: add qdisc_watchdog_schedule_range_ns()

Some packet schedulers might want to add a slack
when programming hrtimers. This can reduce number
of interrupts and increase batch sizes and thus
give good xmit_more savings.

This commit adds qdisc_watchdog_schedule_range_ns()
helper, with an extra delta_ns parameter.

Legacy qdisc_watchdog_schedule_n() becomes an inline
passing a zero slack.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'nfp-type'
David S. Miller [Wed, 18 Mar 2020 04:12:40 +0000 (21:12 -0700)]
Merge branch 'nfp-type'

Jakub Kicinski says:

====================
net: rename flow_action stats and set NFP type

Jiri, I hope this is okay with you, I just dropped the "type" from
the helper and value names, and now things should be able to fit
on a line, within 80 characters.

Second patch makes the NFP able to offload DELAYED stats, which
is the type it supports.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonfp: allow explicitly selected delayed stats
Jakub Kicinski [Tue, 17 Mar 2020 01:42:12 +0000 (18:42 -0700)]
nfp: allow explicitly selected delayed stats

NFP flower offload uses delayed stats. Kernel recently gained
the ability to specify stats types. Make nfp accept DELAYED
stats, not just the catch all "any".

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: rename flow_action_hw_stats_types* -> flow_action_hw_stats*
Jakub Kicinski [Tue, 17 Mar 2020 01:42:11 +0000 (18:42 -0700)]
net: rename flow_action_hw_stats_types* -> flow_action_hw_stats*

flow_action_hw_stats_types_check() helper takes one of the
FLOW_ACTION_HW_STATS_*_BIT values as input. If we align
the arguments to the opening bracket of the helper there
is no way to call this helper and stay under 80 characters.

Remove the "types" part from the new flow_action helpers
and enum values.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'net-phy-improve-phy_driver-callback-handle_interrupt'
David S. Miller [Wed, 18 Mar 2020 03:58:22 +0000 (20:58 -0700)]
Merge branch 'net-phy-improve-phy_driver-callback-handle_interrupt'

Heiner Kallweit says:

====================
net: phy: improve phy_driver callback handle_interrupt

did_interrupt() clears the interrupt, therefore handle_interrupt() can
not check which event triggered the interrupt. To overcome this
constraint and allow more flexibility for customer interrupt handlers,
let's decouple handle_interrupt() from parts of the phylib interrupt
handling. Custom interrupt handlers now have to implement the
did_interrupt() functionality in handle_interrupt() if needed.

Fortunately we have just one custom interrupt handler so far (in the
mscc PHY driver), convert it to the changed API and make use of the
benefits.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: phy: mscc: consider interrupt source in interrupt handler
Heiner Kallweit [Mon, 16 Mar 2020 21:33:31 +0000 (22:33 +0100)]
net: phy: mscc: consider interrupt source in interrupt handler

Trigger the respective interrupt handler functionality only if the
related interrupt source bit is set.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: phy: improve phy_driver callback handle_interrupt
Heiner Kallweit [Mon, 16 Mar 2020 21:32:33 +0000 (22:32 +0100)]
net: phy: improve phy_driver callback handle_interrupt

did_interrupt() clears the interrupt, therefore handle_interrupt() can
not check which event triggered the interrupt. To overcome this
constraint and allow more flexibility for customer interrupt handlers,
let's decouple handle_interrupt() from parts of the phylib interrupt
handling. Custom interrupt handlers now have to implement the
did_interrupt() functionality in handle_interrupt() if needed.

Fortunately we have just one custom interrupt handler so far (in the
mscc PHY driver), convert it to the changed API.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'ethtool-consolidate-irq-coalescing-last-part'
David S. Miller [Wed, 18 Mar 2020 03:56:58 +0000 (20:56 -0700)]
Merge branch 'ethtool-consolidate-irq-coalescing-last-part'

Jakub Kicinski says:

====================
ethtool: consolidate irq coalescing - last part

Convert remaining drivers following the groundwork laid in a recent
patch set [1] and continued in [2], [3], [4], [5]. The aim of
the effort is to consolidate irq coalescing parameter validation
in the core.

This set is the sixth and last installment. It converts the remaining
8 drivers in drivers/net/ethernet. The last patch makes declaring
supported IRQ coalescing parameters a requirement.

[1] https://lore.kernel.org/netdev/20200305051542.991898-1-kuba@kernel.org/
[2] https://lore.kernel.org/netdev/20200306010602.1620354-1-kuba@kernel.org/
[3] https://lore.kernel.org/netdev/20200310021512.1861626-1-kuba@kernel.org/
[4] https://lore.kernel.org/netdev/20200311223302.2171564-1-kuba@kernel.org/
[5] https://lore.kernel.org/netdev/20200313040803.2367590-1-kuba@kernel.org/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: ethtool: require drivers to set supported_coalesce_params
Jakub Kicinski [Mon, 16 Mar 2020 20:47:12 +0000 (13:47 -0700)]
net: ethtool: require drivers to set supported_coalesce_params

Now that all in-tree drivers have been updated we can
make the supported_coalesce_params mandatory.

To save debugging time in case some driver was missed
(or is out of tree) add a warning when netdev is registered
with set_coalesce but without supported_coalesce_params.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: axienet: let core reject the unsupported coalescing parameters
Jakub Kicinski [Mon, 16 Mar 2020 20:47:11 +0000 (13:47 -0700)]
net: axienet: let core reject the unsupported coalescing parameters

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver already correctly rejected all unsupported
parameters. No functional changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: ll_temac: let core reject the unsupported coalescing parameters
Jakub Kicinski [Mon, 16 Mar 2020 20:47:10 +0000 (13:47 -0700)]
net: ll_temac: let core reject the unsupported coalescing parameters

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver already correctly rejected all unsupported
parameters. No functional changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: davinci_emac: reject unsupported coalescing params
Jakub Kicinski [Mon, 16 Mar 2020 20:47:09 +0000 (13:47 -0700)]
net: davinci_emac: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: cpsw: reject unsupported coalescing params
Jakub Kicinski [Mon, 16 Mar 2020 20:47:08 +0000 (13:47 -0700)]
net: cpsw: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: tehuti: reject unsupported coalescing params
Jakub Kicinski [Mon, 16 Mar 2020 20:47:07 +0000 (13:47 -0700)]
net: tehuti: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: dwc-xlgmac: let core reject the unsupported coalescing parameters
Jakub Kicinski [Mon, 16 Mar 2020 20:47:06 +0000 (13:47 -0700)]
net: dwc-xlgmac: let core reject the unsupported coalescing parameters

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver already correctly rejected all unsupported
parameters.

While at it remove unnecessary zeroing on get.

No functional changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: socionext: reject unsupported coalescing params
Jakub Kicinski [Mon, 16 Mar 2020 20:47:05 +0000 (13:47 -0700)]
net: socionext: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: sfc: reject unsupported coalescing params
Jakub Kicinski [Mon, 16 Mar 2020 20:47:04 +0000 (13:47 -0700)]
net: sfc: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.
The check for use_adaptive_tx_coalesce will now be done by
the core.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet/mlx5: Avoid forwarding to other eswitch uplink
Eli Cohen [Thu, 12 Mar 2020 15:20:32 +0000 (17:20 +0200)]
net/mlx5: Avoid forwarding to other eswitch uplink

Do not allow forwarding of encapsulated traffic received from one eswtich's
uplink to another eswtich's uplink.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: Eswitch, enable forwarding back to uplink port
Eli Cohen [Mon, 24 Feb 2020 14:59:54 +0000 (16:59 +0200)]
net/mlx5: Eswitch, enable forwarding back to uplink port

Add dependencny on cap termination_table_raw_traffic to allow non
encapsulated packets received from uplink to be forwarded back to the
received uplink port.

Refactor the conditions into a separate function.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5e: Add support for offloading traffic from uplink to uplink
Eli Cohen [Thu, 13 Feb 2020 12:05:14 +0000 (14:05 +0200)]
net/mlx5e: Add support for offloading traffic from uplink to uplink

Termination tables change the direction of a packet in hw from RX to SX
pipeline. Use that to offload hairpin flows received from uplink and
sent back to uplink.

Currently termination tables are used for pushing VLAN to packets
received from uplink and targeting a VF. Extend the implementation to
allow forwarding packets to uplink. These packets can either be
encapsulated or not.

In case encapsulation is needed before forwarding, move the reformat
object to the termination table as required.

Extend the hash table key to include tunnel information for the sake of
reusing reformat objects.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: Don't use termination tables in slow path
Eli Cohen [Thu, 27 Feb 2020 10:22:46 +0000 (12:22 +0200)]
net/mlx5: Don't use termination tables in slow path

Don't use termination tables for packets that are steered to the slow path,
as a pre-step for supporting packet encap (packet reformat) action on
termination tables. Packet encap (reformat action) actions steer the packet
to the slow path until outer arp entries are resolved.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: Avoid configuring eswitch QoS if not supported
Eli Cohen [Sun, 1 Mar 2020 13:31:49 +0000 (15:31 +0200)]
net/mlx5: Avoid configuring eswitch QoS if not supported

Check if QoS is enabled for the eswitch before attempting to configure
QoS parameters and emit a netlink error if not supported.

Introduce an API to check if QoS is supported for the eswitch.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5e: Fix devlink port register sequence
Vladyslav Tarasiuk [Wed, 4 Mar 2020 11:33:50 +0000 (13:33 +0200)]
net/mlx5e: Fix devlink port register sequence

If udevd is configured to rename interfaces according to persistent
naming rules and if a network interface has phys_port_name in sysfs,
its contents will be appended to the interface name.
However, register_netdev creates device in sysfs and if
devlink_port_register is called after that, there is a timeframe in
which udevd may read an empty phys_port_name value. The consequence is
that the interface will lose this suffix and its name will not be
really persistent.

The solution is to register the port before registering a netdev.

Fixes: c6acd629eec7 ("net/mlx5e: Add support for devlink-port in non-representors mode")
Signed-off-by: Vladyslav Tarasiuk <vladyslavt@mellanox.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5e: Fix rejecting all egress rules not on vlan
Roi Dayan [Tue, 3 Mar 2020 09:18:53 +0000 (11:18 +0200)]
net/mlx5e: Fix rejecting all egress rules not on vlan

The original condition rejected all egress rules that
are not on tunnel device.
Also, the whole point of this egress reject was to disallow bad
rules because of egdev which doesn't exists today, so remove
this check entirely.

Fixes: 0a7fcb78cc21 ("net/mlx5e: Support inner header rewrite with goto action")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>