review.tizen.org Git - platform/kernel/linux-starfive.git/log

ena: Add XDP frame size to amazon NIC driver

Frame size ENA_PAGE_SIZE is limited to 16K on systems with larger
PAGE_SIZE than 16K. Change ENA_XDP_MAX_MTU to also take into account
the reserved tailroom.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Sameeh Jubran <sameehj@amazon.com>
Cc: Arthur Kiyanovski <akiyano@amazon.com>
Link: https://lore.kernel.org/bpf/158945341384.97035.907403694833419456.stgit@firesoul

net: ethernet: ti: Add XDP frame size to driver cpsw

The driver code cpsw.c and cpsw_new.c both use page_pool
with default order-0 pages or their RX-pages.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/bpf/158945340875.97035.752144756428532878.stgit@firesoul

qlogic/qede: Add XDP frame size to driver

The driver qede uses a full page, when XDP is enabled. The drivers value
in rx_buf_seg_size (struct qede_rx_queue) will be PAGE_SIZE when an
XDP bpf_prog is attached.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Ariel Elior <aelior@marvell.com>
Cc: GR-everest-linux-l2@marvell.com
Link: https://lore.kernel.org/bpf/158945340366.97035.7764939691580349618.stgit@firesoul

hv_netvsc: Add XDP frame size to driver

The hyperv NIC driver does memory allocation and copy even without XDP.
In XDP mode it will allocate a new page for each packet and copy over
the payload, before invoking the XDP BPF-prog.

The positive thing it that its easy to determine the xdp.frame_sz.

The XDP implementation for hv_netvsc transparently passes xdp_prog
to the associated VF NIC. Many of the Azure VMs are using SRIOV, so
majority of the data are actually processed directly on the VF driver's XDP
path. So the overhead of the synthetic data path (hv_netvsc) is minimal.

Then XDP is enabled on this driver, XDP_PASS and XDP_TX will create the
SKB via build_skb (based on the newly allocated page). Now using XDP
frame_sz this will provide more skb_tailroom, which netstack can use for
SKB coalescing (e.g tcp_try_coalesce -> skb_try_coalesce).

V3: Adjust patch desc to be more positive.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Link: https://lore.kernel.org/bpf/158945339857.97035.10212138582505736163.stgit@firesoul

dpaa2-eth: Add XDP frame size

The dpaa2-eth driver reserve some headroom used for hardware and
software annotation area in RX/TX buffers. Thus, xdp.data_hard_start
doesn't start at page boundary.

When XDP is configured the area reserved via dpaa2_fd_get_offset(fd) is
448 bytes of which XDP have reserved 256 bytes. As frame_sz is
calculated as an offset from xdp_buff.data_hard_start, an adjust from
the full PAGE_SIZE == DPAA2_ETH_RX_BUF_RAW_SIZE.

When doing XDP_REDIRECT, the driver doesn't need this reserved headroom
any-longer and allows xdp_do_redirect() to use it. This is an advantage
for the drivers own ndo-xdp_xmit, as it uses part of this headroom for
itself. Patch also adjust frame_sz in this case.

The driver cannot support XDP data_meta, because it uses the headroom
just before xdp.data for struct dpaa2_eth_swa (DPAA2_ETH_SWA_SIZE=64),
when transmitting the packet. When transmitting a xdp_frame in
dpaa2_eth_xdp_xmit_frame (call via ndo_xdp_xmit) is uses this area to
store a pointer to xdp_frame and dma_size, which is used in TX
completion (free_tx_fd) to return frame via xdp_return_frame().

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Ioana Radulescu <ruxandra.radulescu@nxp.com>
Link: https://lore.kernel.org/bpf/158945339348.97035.8562488847066908856.stgit@firesoul

veth: Xdp using frame_sz in veth driver

The veth driver can run XDP in "native" mode in it's own NAPI
handler, and since commit 9fc8d518d9d5 ("veth: Handle xdp_frames in
xdp napi ring") packets can come in two forms either xdp_frame or
skb, calling respectively veth_xdp_rcv_one() or veth_xdp_rcv_skb().

For packets to arrive in xdp_frame format, they will have been
redirected from an XDP native driver. In case of XDP_PASS or no
XDP-prog attached, the veth driver will allocate and create an SKB.

The current code in veth_xdp_rcv_one() xdp_frame case, had to guess
the frame truesize of the incoming xdp_frame, when using
veth_build_skb(). With xdp_frame->frame_sz this is not longer
necessary.

Calculating the frame_sz in veth_xdp_rcv_skb() skb case, is done
similar to the XDP-generic handling code in net/core/dev.c.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Link: https://lore.kernel.org/bpf/158945338840.97035.935897116345700902.stgit@firesoul

veth: Adjust hard_start offset on redirect XDP frames

When native XDP redirect into a veth device, the frame arrives in the
xdp_frame structure. It is then processed in veth_xdp_rcv_one(),
which can run a new XDP bpf_prog on the packet. Doing so requires
converting xdp_frame to xdp_buff, but the tricky part is that
xdp_frame memory area is located in the top (data_hard_start) memory
area that xdp_buff will point into.

The current code tried to protect the xdp_frame area, by assigning
xdp_buff.data_hard_start past this memory. This results in 32 bytes
less headroom to expand into via BPF-helper bpf_xdp_adjust_head().

This protect step is actually not needed, because BPF-helper
bpf_xdp_adjust_head() already reserve this area, and don't allow
BPF-prog to expand into it. Thus, it is safe to point data_hard_start
directly at xdp_frame memory area.

Fixes: 9fc8d518d9d5 ("veth: Handle xdp_frames in xdp napi ring")
Reported-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158945338331.97035.5923525383710752178.stgit@firesoul

xdp: Cpumap redirect use frame_sz and increase skb_tailroom

Knowing the memory size backing the packet/xdp_frame data area, and
knowing it already have reserved room for skb_shared_info, simplifies
using build_skb significantly.

With this change we no-longer lie about the SKB truesize, but more
importantly a significant larger skb_tailroom is now provided, e.g. when
drivers uses a full PAGE_SIZE. This extra tailroom (in linear area) can be
used by the network stack when coalescing SKBs (e.g. in skb_try_coalesce,
see TCP cases where tcp_queue_rcv() can 'eat' skb).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158945337822.97035.13557959180460986059.stgit@firesoul

xdp: Xdp_frame add member frame_sz and handle in convert_to_xdp_frame

Use hole in struct xdp_frame, when adding member frame_sz, which keeps
same sizeof struct (32 bytes)

Drivers ixgbe and sfc had bug cases where the necessary/expected
tailroom was not reserved. This can lead to some hard to catch memory
corruption issues. Having the drivers frame_sz this can be detected when
packet length/end via xdp->data_end exceed the xdp_data_hard_end
pointer, which accounts for the reserved the tailroom.

When detecting this driver issue, simply fail the conversion with NULL,
which results in feedback to driver (failing xdp_do_redirect()) causing
driver to drop packet. Given the lack of consistent XDP stats, this can
be hard to troubleshoot. And given this is a driver bug, we want to
generate some more noise in form of a WARN stack dump (to ID the driver
code that inlined convert_to_xdp_frame).

Inlining the WARN macro is problematic, because it adds an asm
instruction (on Intel CPUs ud2) what influence instruction cache
prefetching. Thus, introduce xdp_warn and macro XDP_WARN, to avoid this
and at the same time make identifying the function and line of this
inlined function easier.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158945337313.97035.10015729316710496600.stgit@firesoul

net: XDP-generic determining XDP frame size

The SKB "head" pointer points to the data area that contains
skb_shared_info, that can be found via skb_end_pointer(). Given
xdp->data_hard_start have been established (basically pointing to
skb->head), frame size is between skb_end_pointer() and data_hard_start,
plus the size reserved to skb_shared_info.

Change the bpf_xdp_adjust_tail offset adjust of skb->len, to be a positive
offset number on grow, and negative number on shrink. As this seems more
natural when reading the code.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158945336804.97035.7164852191163722056.stgit@firesoul

net: netsec: Add support for XDP frame size

This driver takes advantage of page_pool PP_FLAG_DMA_SYNC_DEV that
can help reduce the number of cache-lines that need to be flushed
when doing DMA sync for_device. Due to xdp_adjust_tail can grow the
area accessible to the by the CPU (can possibly write into), then max
sync length *after* bpf_prog_run_xdp() needs to be taken into account.

For XDP_TX action the driver is smart and does DMA-sync. When growing
tail this is still safe, because page_pool have DMA-mapped the entire
page size.

Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/bpf/158945336295.97035.15034759661036971024.stgit@firesoul

mvneta: Add XDP frame size to driver

This marvell driver mvneta uses PAGE_SIZE frames, which makes it
really easy to convert. Driver updates rxq and now frame_sz
once per NAPI call.

This driver takes advantage of page_pool PP_FLAG_DMA_SYNC_DEV that
can help reduce the number of cache-lines that need to be flushed
when doing DMA sync for_device. Due to xdp_adjust_tail can grow the
area accessible to the by the CPU (can possibly write into), then max
sync length *after* bpf_prog_run_xdp() needs to be taken into account.

For XDP_TX action the driver is smart and does DMA-sync. When growing
tail this is still safe, because page_pool have DMA-mapped the entire
page size.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Cc: thomas.petazzoni@bootlin.com
Link: https://lore.kernel.org/bpf/158945335786.97035.12714388304493736747.stgit@firesoul

sfc: Add XDP frame size

This driver uses RX page-split when possible. It was recently fixed
in commit 86e85bf6981c ("sfc: fix XDP-redirect in this driver") to
add needed tailroom for XDP-redirect.

After the fix efx->rx_page_buf_step is the frame size, with enough
head and tail-room for XDP-redirect.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158945335278.97035.14611425333184621652.stgit@firesoul

bnxt: Add XDP frame size to driver

This driver uses full PAGE_SIZE pages when XDP is enabled.

In case of XDP uses driver uses __bnxt_alloc_rx_page which does full
page DMA-map. Thus, xdp_adjust_tail grow is DMA compliant for XDP_TX
action that does DMA-sync.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Cc: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Link: https://lore.kernel.org/bpf/158945334769.97035.13437970179897613984.stgit@firesoul

xdp: Add frame size to xdp_buff

XDP have evolved to support several frame sizes, but xdp_buff was not
updated with this information. The frame size (frame_sz) member of
xdp_buff is introduced to know the real size of the memory the frame is
delivered in.

When introducing this also make it clear that some tailroom is
reserved/required when creating SKBs using build_skb().

It would also have been an option to introduce a pointer to
data_hard_end (with reserved offset). The advantage with frame_sz is
that (like rxq) drivers only need to setup/assign this value once per
NAPI cycle. Due to XDP-generic (and some drivers) it's not possible to
store frame_sz inside xdp_rxq_info, because it's varies per packet as it
can be based/depend on packet length.

V2: nitpick: deduct -> deduce

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158945334261.97035.555255657490688547.stgit@firesoul

Merge git://git./linux/kernel/git/bpf/bpf-next

Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-05-14

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Merged tag 'perf-for-bpf-2020-05-06' from tip tree that includes CAP_PERFMON.

2) support for narrow loads in bpf_sock_addr progs and additional
helpers in cg-skb progs, from Andrey.

3) bpf benchmark runner, from Andrii.

4) arm and riscv JIT optimizations, from Luke.

5) bpf iterator infrastructure, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'expand-cg_skb-helpers'

Andrey Ignatov says:

====================
v2->v3:
- better documentation for bpf_sk_cgroup_id in uapi (Yonghong Song)
- save/restore errno in network helpers (Yonghong Song)
- cleanup leftover after switching selftest to skeleton (Yonghong Song)
- switch from map to skel->bss in selftest (Yonghong Song)

v1->v2:
- switch selftests to skeleton.

This patch set allows a bunch of existing sk lookup and skb cgroup id
helpers, and adds two new bpf_sk_{,ancestor_}cgroup_id helpers to be used
in cgroup skb programs.

It fills the gap to cover a use-case to apply intra-host cgroup-bpf network
policy based on a source cgroup a packet comes from.

For example, there can be multiple containers A, B, C running on a host.
Every such container runs in its own cgroup that can have multiple
sub-cgroups. But all these containers can share some IP addresses.

At the same time container A wants to have a policy for a server S running
in it so that only clients from this same container can connect to S, but
not from other containers (such as B, C). Source IP address can't be used
to decide whether to allow or deny a packet, but it looks reasonable to
filter by cgroup id.

The patch set allows to implement the following policy:
* when an ingress packet comes to container's cgroup, lookup peer (client)
  socket this packet comes from;
* having peer socket, get its cgroup id;
* compare peer cgroup id with self cgroup id and allow packet only if they
  match, i.e. it comes from same cgroup;
* the "sub-cgroup" part of the story can be addressed by getting not direct
  cgroup id of the peer socket, but ancestor cgroup id on specified level,
  similar to existing "ancestor" flavors of cgroup id helpers.

A newly introduced selftest implements such a policy in its basic form to
provide a better idea on the use-case.

Patch 1 allows existing sk lookup helpers in cgroup skb.
Patch 2 allows skb_ancestor_cgroup_id in cgrou skb.
Patch 3 introduces two new helpers to get cgroup id of socket.
Patch 4 extends network helpers to use them in the next patch.
Patch 5 adds selftest / example of use-case.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Test for sk helpers in cgroup skb

Test bpf_sk_lookup_tcp, bpf_sk_release, bpf_sk_cgroup_id and
bpf_sk_ancestor_cgroup_id helpers from cgroup skb program.

The test creates a testing cgroup, starts a TCPv6 server inside the
cgroup and creates two client sockets: one inside testing cgroup and one
outside.

Then it attaches cgroup skb program to the cgroup that checks all TCP
segments coming to the server and allows only those coming from the
cgroup of the server. If a segment comes from a peer outside of the
cgroup, it'll be dropped.

Finally the test checks that client from inside testing cgroup can
successfully connect to the server, but client outside the cgroup fails
to connect by timeout.

The main goal of the test is to check newly introduced
bpf_sk_{,ancestor_}cgroup_id helpers.

It also checks a couple of socket lookup helpers (tcp & release), but
lookup helpers were introduced much earlier and covered by other tests.
Here it's mostly checked that they can be called from cgroup skb.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/171f4c5d75e8ff4fe1c4e8c1c12288b5240a4549.1589486450.git.rdna@fb.com

selftests/bpf: Add connect_fd_to_fd, connect_wait net helpers

Add two new network helpers.

connect_fd_to_fd connects an already created client socket fd to address
of server fd. Sometimes it's useful to separate client socket creation
and connecting this socket to a server, e.g. if client socket has to be
created in a cgroup different from that of server cgroup.

Additionally connect_to_fd is now implemented using connect_fd_to_fd,
both helpers don't treat EINPROGRESS as an error and let caller decide
how to proceed with it.

connect_wait is a helper to work with non-blocking client sockets so
that if connect_to_fd or connect_fd_to_fd returned -1 with errno ==
EINPROGRESS, caller can wait for connect to finish or for connection
timeout. The helper returns -1 on error, 0 on timeout (1sec,
hard-coded), and positive number on success.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/1403fab72300f379ca97ead4820ae43eac4414ef.1589486450.git.rdna@fb.com

bpf: Introduce bpf_sk_{, ancestor_}cgroup_id helpers

With having ability to lookup sockets in cgroup skb programs it becomes
useful to access cgroup id of retrieved sockets so that policies can be
implemented based on origin cgroup of such socket.

For example, a container running in a cgroup can have cgroup skb ingress
program that can lookup peer socket that is sending packets to a process
inside the container and decide whether those packets should be allowed
or denied based on cgroup id of the peer.

More specifically such ingress program can implement intra-host policy
"allow incoming packets only from this same container and not from any
other container on same host" w/o relying on source IP addresses since
quite often it can be the case that containers share same IP address on
the host.

Introduce two new helpers for this use-case: bpf_sk_cgroup_id() and
bpf_sk_ancestor_cgroup_id().

These helpers are similar to existing bpf_skb_{,ancestor_}cgroup_id
helpers with the only difference that sk is used to get cgroup id
instead of skb, and share code with them.

See documentation in UAPI for more details.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/f5884981249ce911f63e9b57ecd5d7d19154ff39.1589486450.git.rdna@fb.com

bpf: Allow skb_ancestor_cgroup_id helper in cgroup skb

cgroup skb programs already can use bpf_skb_cgroup_id. Allow
bpf_skb_ancestor_cgroup_id as well so that container policies can be
implemented for a container that can have sub-cgroups dynamically
created, but policies should still be implemented based on cgroup id of
container itself not on an id of a sub-cgroup.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/8874194d6041eba190356453ea9f6071edf5f658.1589486450.git.rdna@fb.com

bpf: Allow sk lookup helpers in cgroup skb

Currently sk lookup helpers are allowed in tc, xdp, sk skb, and cgroup
sock_addr programs.

But they would be useful in cgroup skb as well so that for example
cgroup skb ingress program can lookup a peer socket a packet comes from
on same host and make a decision whether to allow or deny this packet
based on the properties of that socket, e.g. cgroup that peer socket
belongs to.

Allow the following sk lookup helpers in cgroup skb:
* bpf_sk_lookup_tcp;
* bpf_sk_lookup_udp;
* bpf_sk_release;
* bpf_skc_lookup_tcp.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/f8c7ee280f1582b586629436d777b6db00597d63.1589486450.git.rdna@fb.com

selftest/bpf: Fix spelling mistake "SIGALARM" -> "SIGALRM"

There is a spelling mistake in an error message, fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200514121529.259668-1-colin.king@canonical.com

bpf: Fix bpf_iter's task iterator logic

task_seq_get_next might stop prematurely if get_pid_task() fails to get
task_struct. Failure to do so doesn't mean that there are no more tasks with
higher pids. Procfs's iteration algorithm (see next_tgid in fs/proc/base.c)
does a retry in such case. After this fix, instead of stopping prematurely
after about 300 tasks on my server, bpf_iter program now returns >4000, which
sounds much closer to reality.

Fixes: eaaacd23910f ("bpf: Add task and task/file iterator targets")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200514055137.1564581-1-andriin@fb.com

selftests/bpf: Test narrow loads for bpf_sock_addr.user_port

Test 1,2,4-byte loads from bpf_sock_addr.user_port in sock_addr
programs.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/e5c734a58cca4041ab30cb5471e644246f8cdb5a.1589420814.git.rdna@fb.com

bpf: Support narrow loads from bpf_sock_addr.user_port

bpf_sock_addr.user_port supports only 4-byte load and it leads to ugly
code in BPF programs, like:

volatile __u32 user_port = ctx->user_port;
__u16 port = bpf_ntohs(user_port);

Since otherwise clang may optimize the load to be 2-byte and it's
rejected by verifier.

Add support for 1- and 2-byte loads same way as it's supported for other
fields in bpf_sock_addr like user_ip4, msg_src_ip4, etc.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/c1e983f4c17573032601d0b2b1f9d1274f24bc16.1589420814.git.rdna@fb.com

samples/bpf: xdp_redirect_cpu: Set MAX_CPUS according to NR_CPUS

xdp_redirect_cpu is currently failing in bpf_prog_load_xattr()
allocating cpu_map map if CONFIG_NR_CPUS is less than 64 since
cpu_map_alloc() requires max_entries to be less than NR_CPUS.
Set cpu_map max_entries according to NR_CPUS in xdp_redirect_cpu_kern.c
and get currently running cpus in xdp_redirect_cpu_user.c

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/374472755001c260158c4e4b22f193bdd3c56fb7.1589300442.git.lorenzo@kernel.org

r8169: don't include linux/moduleparam.h

93882c6f210a ("r8169: switch from netif_xxx message functions to
netdev_xxx") removed the last module parameter from the driver,
therefore there's no need any longer to include linux/moduleparam.h.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: remove not needed checks in rtl8169_set_eee

After 9de5d235b60a ("net: phy: fix aneg restart in phy_ethtool_set_eee")
we don't need the check for aneg being enabled any longer, and as
discussed with Russell configuring the EEE advertisement should be
supported even if we're in a half-duplex mode currently.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: felix: fix incorrect clamp calculation for burst

Currently burst is clamping on rate and not burst, the assignment
of burst from the clamping discards the previous assignment of burst.
This looks like a cut-n-paste error from the previous clamping
calculation on ramp. Fix this by replacing ramp with burst.

Addresses-Coverity: ("Unused value")
Fixes: 0fbabf875d18 ("net: dsa: felix: add support Credit Based Shaper(CBS) for hardware offload")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: mdio-moxart: remove unneeded include

mdio-moxart doesn't use regulators in the driver code. We can remove
the regulator include.

Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: dp83867: Convert DP83867 to yaml

Convert the dp83867 binding to yaml.

Signed-off-by: Dan Murphy <dmurphy@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: dp83869: Update licensing info

Add BSD 2 Clause to the licensing.

CC: Rob Herring <robh@kernel.org>
Signed-off-by: Dan Murphy <dmurphy@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

hinic: update huawei ethernet driver maintainer

update huawei ethernet driver maintainer from aviad to Bin luo

Signed-off-by: Luo bin <luobin9@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

hinic: add set_ringparam ethtool_ops support

support to change TX/RX queue depth with ethtool -G

Signed-off-by: Luo bin <luobin9@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: refactor end checks in devlink_nl_cmd_region_read_dumpit

Clean up after recent fixes, move address calculations
around and change the variable init, so that we can have
just one start_offset == end_offset check.

Make the check a little stricter to preserve the -EINVAL
error if requested start offset is larger than the region
itself.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'am65-cpsw-add-taprio-EST-offload-support'

Murali Karicheri says:

====================
am65-cpsw: add taprio/EST offload support

AM65 CPSW h/w supports Enhanced Scheduled Traffic (EST – defined
in P802.1Qbv/D2.2 that later got included in IEEE 802.1Q-2018)
configuration. EST allows express queue traffic to be scheduled
(placed) on the wire at specific repeatable time intervals. In
Linux kernel, EST configuration is done through tc command and
the taprio scheduler in the net core implements a software only
scheduler (SCH_TAPRIO). If the NIC is capable of EST configuration,
user indicate "flag 2" in the command which is then parsed by
taprio scheduler in net core and indicate that the command is to
be offloaded to h/w. taprio then offloads the command to the
driver by calling ndo_setup_tc() ndo ops. This patch implements
ndo_setup_tc() as well as other changes required to offload EST
configuration to CPSW h/w

For more details please refer patch 2/2.

This series is based on original work done by Ivan Khoronzhuk
<ivan.khoronzhuk@linaro.org> to add taprio offload support to
AM65 CPSW 2G.

1. Example configuration 3 Gates

ifconfig eth0 down
ethtool -L eth0 tx 3

ethtool --set-priv-flags eth0 p0-rx-ptype-rrobin off

ifconfig eth0 192.168.2.20

tc qdisc replace dev eth0 parent root handle 100 taprio \
    num_tc 3 \
    map 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 \
    queues 1@0 1@1 1@2 \
    base-time 0000 \
    sched-entry S 4 125000 \
    sched-entry S 2 125000 \
    sched-entry S 1 250000 \
    flags 2

2. Example configuration 8 Gates

ifconfig eth0 down
ethtool -L eth0 tx 8

ethtool --set-priv-flags eth0 p0-rx-ptype-rrobin off

ifconfig eth0 192.168.2.20

tc qdisc replace dev eth0 parent root handle 100 taprio \
    num_tc 8 \
    map 0 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 \
    queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
    base-time 0000 \
    sched-entry S 80 125000 \
    sched-entry S 40 125000 \
    sched-entry S 20 125000 \
    sched-entry S 10 125000 \
    sched-entry S 08 125000 \
    sched-entry S 04 125000 \
    sched-entry S 02 125000 \
    sched-entry S 01 125000 \
    flags 2

Classify frames to particular priority using skbedit so that they land at
a specific queue in cpsw h/w which is Gated by the EST gate which opens based
on the sched-entry.

tc qdisc add dev eth0 clsact

In the below for example an iperf3 session with destination port 5007
will go through Q7.

tc filter add dev eth0 egress protocol ip prio 1 u32 match ip dport 5007 0xffff action skbedit priority 7

tc filter add dev eth0 egress protocol ip prio 1 u32 match ip dport 5006 0xffff action skbedit priority 6
tc filter add dev eth0 egress protocol ip prio 1 u32 match ip dport 5005 0xffff action skbedit priority 5
tc filter add dev eth0 egress protocol ip prio 1 u32 match ip dport 5004 0xffff action skbedit priority 4
tc filter add dev eth0 egress protocol ip prio 1 u32 match ip dport 5003 0xffff action skbedit priority 3
tc filter add dev eth0 egress protocol ip prio 1 u32 match ip dport 5002 0xffff action skbedit priority 2
tc filter add dev eth0 egress protocol ip prio 1 u32 match ip dport 5001 0xffff action skbedit priority 1

iperf3 -c 192.168.2.10 -u -l1470 -b32M -t1 -p 5007

Testing was done by capturing frames at the PC using wireshark and checking for
the bust interval or cycle time of UDP frames with a specific port number.
Verified that the distance between first frame of a burst (cycle-time) is 1
milli second and burst duration is within 125 usec based on the received packet
timestamp shown in wireshark packet display.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ethernet: ti: am65-cpsw-qos: add TAPRIO offload support

AM65 CPSW h/w supports Enhanced Scheduled Traffic (EST – defined
in P802.1Qbv/D2.2 that later got included in IEEE 802.1Q-2018)
configuration. EST allows express queue traffic to be scheduled
(placed) on the wire at specific repeatable time intervals. In
Linux kernel, EST configuration is done through tc command and
the taprio scheduler in the net core implements a software only
scheduler (SCH_TAPRIO). If the NIC is capable of EST configuration,
user indicate "flag 2" in the command which is then parsed by
taprio scheduler in net core and indicate that the command is to
be offloaded to h/w. taprio then offloads the command to the
driver by calling ndo_setup_tc() ndo ops. This patch implements
ndo_setup_tc() to offload EST configuration to CPSW h/w.

Currently driver supports only SetGateStates operation. EST
operates on a repeating time interval generated by the CPTS EST
function generator. Each Ethernet port has a global EST fetch
RAM that can be configured as 2 buffers, each of 64 locations
or one large buffer of 128 locations. In 2 buffer configuration,
a ping pong mechanism is used to hold the active schedule (oper)
in one buffer and new (admin) command in the other. Each 22-bit
fetch command consists of a 14-bit fetch count (14 MSB’s) and an
8-bit priority fetch allow (8 LSB’s) that will be applied for the
fetch count time in wireside clocks. Driver process each of the
sched-entry in the offload command and update the fetch RAM.
Driver configures duration in sched-entry into the fetch count
and Gate mask into the priority fetch bits of the RAM. Then
configures the CPTS EST function generator to activate the
schedule. Currently driver supports only 2 buffer configuration
which means driver supports a max cycle time of ~8 msec.

CPSW supports a configurable number of priority queues (up to 8)
and needs to be switched to this mode from the default round
robin mode before EST can be offloaded. User configures
these through ethtool commands (-L for changing number of
queues and --set-priv-flags to disable round robin mode).
Driver doesn't enable EST if pf_p0_rx_ptype_rrobin privat flag
is set. The flag is common for all ports, and so can't be just
overridden by taprio configuration w/o user involvement.
Command fails if pf_p0_rx_ptype_rrobin is already set in the
driver.

Scheds (commands) configuration depends on interface speed so
driver translates the duration to the fetch count based on
link speed. Each schedule can be constructed with several
command entries in fetch RAM depending on interval. For example
if each sched has timer interval < ~130us on 1000 Mb link then
each sched consumes one command and have 1:1 mapping. When
Ethernet link goes down, driver purge the configuration if link
is down for more than 1 second.

The patch allows to update the timer and scheds memory only if it's
really needed, and skip cases required the user to stop timer by
configuring only shceds memory.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethernet: ti: am65-cpts: add routines to support taprio offload

TAPRIO/EST offload support in CPSW2G requires EST scheduler
function enabled in CPTS. So this patch add a function to
set cycle time for EST scheduler. It also add a function for
getting time in ns of PHC clock for taprio qdisc configuration.
Mostly to verify if timer update is needed or to get actual
state of oper/admin schedule.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-qed-qede-critical-hw-error-handling'

Igor Russkikh says:

====================
net: qed/qede: critical hw error handling

FastLinQ devices as a complex systems may observe various hardware
level error conditions, both severe and recoverable.

Driver is able to detect and report this, but so far it only did
trace/dmesg based reporting.

Here we implement an extended hw error detection, service task
handler captures a dump for the later analysis.

I also resubmit a patch from Denis Bolotin on tx timeout handler,
addressing David's comment regarding recovery procedure as an extra
reaction on this event.

v2:

Removing the patch with ethtool dump and udev magic. Its quite isolated,
I'm working on devlink based logic for this separately.

v1:

https://patchwork.ozlabs.org/project/netdev/cover/cover.1588758463.git.irusskikh@marvell.com/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: qed: fix bad formatting

On some adjacent code, fix bad code formatting

Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qed: introduce critical hardware error handler

MCP may signal driver about generic critical failure.
Driver has to collect mdump information (get_retain),
it pushes that to logs and triggers generic notification on
"hardware attention" event.

Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qed: introduce critical fan failure handler

Fan failure is sent by firmware, driver reacts on this error with
newly introduced notification path. It will collect dump and shut down
the device to prevent physical breakage

Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qede: Implement ndo_tx_timeout

Upon tx timeout detection we do disable carrier and print TX queue
info on TX timeout. We then raise hw error condition and trigger
service task to handle this.

This handler will capture extra debug info and then optionally
trigger recovery procedure to try restore function.

Signed-off-by: Denis Bolotin <dbolotin@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qede: optional hw recovery procedure

Driver has an ability to initiate a recovery process as a reaction to
detected errors. But the codepath (recovery_process) was disabled and
never active.

Here we add ethtool private flag to allow user have the recovery
procedure activated.

We still do not enable this by default though, since in some configurations
this is not desirable. E.g. this may impact other PFs/VFs.

Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qed: attention clearing properties

On different hardware events we have to respond differently,
on some of hardware indications hw attention (error condition)
should be cleared by the driver to continue normal functioning.

Here we introduce attention clear flags, and put them on some
important events (in aeu_descs).

Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qed: cleanup debug related declarations

Thats probably a legacy code had double declaration of some fields.
Cleanup this, removing copy and fixing references.

Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qed: critical err reporting to management firmware

On various critical errors, notification handler should also report
the err information into the management firmware.

MFW can interact with server/motherboard backend agents - these are
used by server manufacturers to monitor server HW health.

Thus, it is important for driver to report on any faulty conditions

Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qed: invoke err notify on critical areas

In a number of critical places not only debug trace should be printed,
but the appropriate hw error condition should be raised and error
handling/recovery should start.

Introduce our new qed_hw_err_notify invocation in these places to
record and indicate critical error conditions in hardware.

Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qede: add hw err scheduled handler

qede (ethernet level driver) registers a callback handler.
This handler maintains eth dev state flags/bits to track error processing.

It implements in place processing part for nonsleeping context (WARN_ON
trigger), and a deferred (delayed work) part which triggers recovery
process for recoverable errors.

In later patches this atomic handler will come with more meat.

We introduce err_flags on ethdevice structure, its being used to record
error handling properties.

Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qed: adding hw_err states and handling

Here we introduce qed device error tracking flags and error types.

qed_hw_err_notify is an entrace point to report errors.
It'll notify higher level drivers (qede/qedr/etc) to handle and recover
the error.

List of posible errors comes from hardware interfaces, but could be
extended in future.

Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-hns3-add-some-cleanups-for-next'

Huazhong Tan says:

====================
net: hns3: add some cleanups for -next

This patchset adds some cleanups for the HNS3 ethernet driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: remove unnecessary frag list checking in hns3_nic_net_xmit()

The skb_has_frag_list() in hns3_nic_net_xmit() is redundant, since
skb_walk_frags() includes this checking implicitly.

Reported-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: remove some unused macros

There are some macros defined in hns3_enet.h, but not used in
anywhere.

Reported-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: modify an incorrect error log in hclge_mbx_handler()

When handling HCLGE_MBX_GET_LINK_STATUS, PF will return the link
status to the VF, so the error log of hclge_get_link_info() is
incorrect.

Reported-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: remove a duplicated printing in hclge_configure()

Since hclge_get_cfg() already has error print, so hclge_configure()
should not print error when calling hclge_get_cfg() fail.

Reported-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: modify some incorrect spelling

This patch modifies some incorrect spelling.

Reported-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8152: Use MAC address from device tree if available

If a MAC address was passed via the device tree node for the r8152
device, use it and fall back to reading from EEPROM otherwise. This is
useful for devices where the r8152 EEPROM was not programmed with a
valid MAC address, or if users want to explicitly set a MAC address in
the bootloader and pass that to the kernel.

Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: fix flower parent qdisc

Flower tests used to create ingress filter with specified parent qdisc
"parent ffff:" but dump them on "ingress". With recent commit that fixed
tcm_parent handling in dump those are not considered same parent anymore,
which causes iproute2 tc to emit additional "parent ffff:" in first line of
filter dump output. The change in output causes filter match in tests to
fail.

Prevent parent qdisc output when dumping filters in flower tests by always
correctly specifying "ingress" parent both when creating and dumping
filters.

Fixes: a7df4870d79b ("net_sched: fix tcm_parent in tc filter dump")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mt7530: set CPU port to fallback mode

Currently, setting a bridge's self PVID to other value and deleting
the default VID 1 renders untagged ports of that VLAN unable to talk to
the CPU port:

bridge vlan add dev br0 vid 2 pvid untagged self
bridge vlan del dev br0 vid 1 self
bridge vlan add dev sw0p0 vid 2 pvid untagged
bridge vlan del dev sw0p0 vid 1
# br0 cannot send untagged frames out of sw0p0 anymore

That is because the CPU port is set to security mode and its PVID is
still 1, and untagged frames are dropped due to VLAN member violation.

Set the CPU port to fallback mode so untagged frames can pass through.

Fixes: 83163f7dca56 ("net: dsa: mediatek: add VLAN support for MT7530")
Signed-off-by: DENG Qingfang <dqfext@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvneta: speed down the PHY, if WoL used, to save energy

Some PHYs connected to this ethernet hardware support the WoL feature.
But when WoL is enabled and the machine is powered off, the PHY remains
waiting for a magic packet at max speed (i.e. 1Gbps), which is a waste of
energy.

Slow down the PHY speed before stopping the ethernet if WoL is enabled,
and save some energy while the machine is powered off or sleeping.

Tested using an Armada 370 based board (LS421DE) equipped with a Marvell
88E1518 PHY.

Signed-off-by: Daniel González Cabanelas <dgcbueu@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: fix dereference of table before it is null checked

Currently pointer table is being dereferenced on a null check of
table->must_restore_filters before it is being null checked, leading
to a potential null pointer dereference issue. Fix this by null
checking table before dereferencing it when checking for a null
table->must_restore_filters.

Addresses-Coverity: ("Dereference before null check")
Fixes: e4fe938cff04 ("sfc: move 'must restore' flags out of ef10-specific nic_data")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: at803x: add cable diagnostics support

The AR8031/AR8033 and the AR8035 support cable diagnostics. Adding
driver support is straightforward, so lets add it.

The PHY just do one pair at a time, so we have to start the test four
times. The cable_test_get_status() can block and therefore we can just
busy poll the test completion and continue with the next pair until we
are done.
The time delta counter seems to run at 125MHz which just gives us a
resolution of about 82.4cm per tick.

100m cable, A/B/C/D open:
  Cable test started for device eth0.
  Cable test completed for device eth0.
  Pair: Pair A, result: Open Circuit
  Pair: Pair A, fault length: 107.94m
  Pair: Pair B, result: Open Circuit
  Pair: Pair B, fault length: 104.64m
  Pair: Pair C, result: Open Circuit
  Pair: Pair C, fault length: 105.47m
  Pair: Pair D, result: Open Circuit
  Pair: Pair D, fault length: 107.94m

1m cable, A/B connected, C shorted, D open:
  Cable test started for device eth0.
  Cable test completed for device eth0.
  Pair: Pair A, result: OK
  Pair: Pair B, result: OK
  Pair: Pair C, result: Short within Pair
  Pair: Pair C, fault length: 0.82m
  Pair: Pair D, result: Open Circuit
  Pair: Pair D, fault length: 0.82m

Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: set msg_control_is_user in do_ipv6_getsockopt

While do_ipv6_getsockopt does not call the high-level recvmsg helper,
the msghdr eventually ends up being passed to put_cmsg anyway, and thus
needs msg_control_is_user set to the proper value.

Fixes: 1f466e1f15cf ("net: cleanly handle kernel vs user buffers for ->msg_control")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-phy-broadcom-cable-tester-support'

Michael Walle says:

====================
net: phy: broadcom: cable tester support

Add cable tester support for the Broadcom PHYs. Support for it was
developed on a BCM54140 Quad PHY which RDB register access.

If there is a link partner the results are not as good as with an open
cable. I guess we could retry if the measurement until all pairs had at
least one valid result.

changes since v1:
- added Reviewed-by: tags
- removed "div by 2" for cross shorts, just mention it in the commit
message. The results are inconclusive if the tests are repeated. So
just report the length as is for now.
- fixed typo in commit message
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: bcm54140: add cable diagnostics support

Use the generic cable tester functions from bcm-phy-lib to add cable
tester support.

100m cable, A/B/C/D open:
  Cable test started for device eth0.
  Cable test completed for device eth0.
  Pair: Pair A, result: Open Circuit
  Pair: Pair B, result: Open Circuit
  Pair: Pair C, result: Open Circuit
  Pair: Pair D, result: Open Circuit
  Pair: Pair A, fault length: 106.60m
  Pair: Pair B, fault length: 103.32m
  Pair: Pair C, fault length: 104.96m
  Pair: Pair D, fault length: 106.60m

1m cable, A/B connected, pair C shorted, D open:
  Cable test started for device eth0.
  Cable test completed for device eth0.
  Pair: Pair A, result: OK
  Pair: Pair B, result: OK
  Pair: Pair C, result: Short within Pair
  Pair: Pair D, result: Open Circuit
  Pair: Pair C, fault length: 0.82m
  Pair: Pair D, fault length: 1.64m

1m cable, A/B connected, pair C shorted with D:
  Cable test started for device eth0.
  Cable test completed for device eth0.
  Pair: Pair A, result: OK
  Pair: Pair B, result: OK
  Pair: Pair C, result: Short to another pair
  Pair: Pair D, result: Short to another pair
  Pair: Pair C, fault length: 1.64m
  Pair: Pair D, fault length: 1.64m

The granularity of the length measurement seems to be 82cm.

Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: broadcom: add cable test support

Most modern broadcom PHYs support ECD (enhanced cable diagnostics). Add
support for it in the bcm-phy-lib so they can easily be used in the PHY
driver.

There are two access methods for ECD: legacy by expansion registers and
via the new RDB registers which are exclusive. Provide functions in two
variants where the PHY driver can choose from. To keep things simple for
now, we just switch the register access to expansion registers in the
RDB variant for now. On the flipside, we have to keep a bus lock to
prevent any other non-legacy access on the PHY.

The results of the intra-pair tests are inconclusive (at least for the
BCM54140). Most of the times half the length is reported but sometimes
the length is correct.

Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: broadcom: add bcm_phy_modify_exp()

Add the convenience function to do a read-modify-write. This has the
additional benefit of saving one write to the selection register.

Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: broadcom: add exp register access methods without buslock

Add helper to read and write expansion registers without taking the mdio
lock.

Please note, that this changes the semantics of the read and write.
Before there was no lock between selecting the expansion register and
the actual read/write. This may lead to access failures if there are
parallel accesses. Instead take the bus lock during the whole access
cycle.

Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: tja11xx: add cable-test support

Add initial cable testing support.
This PHY needs only 100usec for this test and it is recommended to run it
before the link is up. For now, provide at least ethtool support, so it
can be tested by more developers.

This patch was tested with TJA1102 PHY with following results:
- No cable, is detected as open
- 1m cable, with no connected other end and detected as open
- a 40m cable (out of spec, max lenght should be 15m) is detected as OK.

Current patch do not provide polarity test support. This test would
indicate not proper wire connection, where "+" wire of main phy is
connected to the "-" wire of the link partner.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bpf_iter-fixes'

Yonghong Song says:

====================
Commit ae24345da54e ("bpf: Implement an interface to register
bpf_iter targets") and its subsequent commits in the same patch set
introduced bpf iterator, a way to run bpf program when iterating
kernel data structures.

This patch set addressed some followup issues. One big change
is to allow target to pass ctx arg register types to verifier
for verification purpose. Please see individual patch for details.

Changelogs:
  v1 -> v2:
    . add "const" qualifier to struct bpf_iter_reg for
      bpf_iter_[un]reg_target, and this results in
      additional "const" qualifiers in some other places
    . drop the patch which will issue WARN_ONCE if
      seq_ops->show() returns a positive value.
      If this does happen, code review should spot
      this or author does know what he is doing.
      In the future, we do want to implement a
      mechanism to find out all registered targets
      so we will be aware of new additions.
====================

Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

net: ignore sock_from_file errors in __scm_install_fd

The code had historically been ignoring these errors, and my recent
refactoring changed that, which broke ssh in some setups.

Fixes: 2618d530dd8b ("net/scm: cleanup scm_detach_fds")
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

samples/bpf: Remove compiler warnings

Commit 5fbc220862fc ("tools/libpf: Add offsetof/container_of macro
in bpf_helpers.h") added macros offsetof/container_of to
bpf_helpers.h. Unfortunately, it caused compilation warnings
below for a few samples/bpf programs:
  In file included from /data/users/yhs/work/net-next/samples/bpf/sockex2_kern.c:4:
  In file included from /data/users/yhs/work/net-next/include/uapi/linux/in.h:24:
  In file included from /data/users/yhs/work/net-next/include/linux/socket.h:8:
  In file included from /data/users/yhs/work/net-next/include/linux/uio.h:8:
  /data/users/yhs/work/net-next/include/linux/kernel.h:992:9: warning: 'container_of' macro redefined [-Wmacro-redefined]
          ^
  /data/users/yhs/work/net-next/tools/lib/bpf/bpf_helpers.h:46:9: note: previous definition is here
          ^
  1 warning generated.
    CLANG-bpf  samples/bpf/sockex3_kern.o

In all these cases, bpf_helpers.h is included first, followed by other
standard headers. The macro container_of is defined unconditionally
in kernel.h, causing the compiler warning.

The fix is to move bpf_helpers.h after standard headers.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200513180223.2949987-1-yhs@fb.com

bpf: Enable bpf_iter targets registering ctx argument types

Commit b121b341e598 ("bpf: Add PTR_TO_BTF_ID_OR_NULL
support") adds a field btf_id_or_null_non0_off to
bpf_prog->aux structure to indicate that the
first ctx argument is PTR_TO_BTF_ID reg_type and
all others are PTR_TO_BTF_ID_OR_NULL.
This approach does not really scale if we have
other different reg types in the future, e.g.,
a pointer to a buffer.

This patch enables bpf_iter targets registering ctx argument
reg types which may be different from the default one.
For example, for pointers to structures, the default reg_type
is PTR_TO_BTF_ID for tracing program. The target can register
a particular pointer type as PTR_TO_BTF_ID_OR_NULL which can
be used by the verifier to enforce accesses.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200513180221.2949882-1-yhs@fb.com

bpf: Change func bpf_iter_unreg_target() signature

Change func bpf_iter_unreg_target() parameter from target
name to target reg_info, similar to bpf_iter_reg_target().

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200513180220.2949737-1-yhs@fb.com

bpf: net: Refactor bpf_iter target registration

Currently bpf_iter_reg_target takes parameters from target
and allocates memory to save them. This is really not
necessary, esp. in the future we may grow information
passed from targets to bpf_iter manager.

The patch refactors the code so target reg_info
becomes static and bpf_iter manager can just take
a reference to it.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200513180219.2949605-1-yhs@fb.com

bpf: Add comments to interpret bpf_prog return values

Add a short comment in bpf_iter_run_prog() function to
explain how bpf_prog return value is converted to
seq_ops->show() return value:
  bpf_prog return           seq_ops()->show() return
     0                         0
     1                         -EAGAIN

When show() return value is -EAGAIN, the current
bpf_seq_read() will end. If the current seq_file buffer
is empty, -EAGAIN will return to user space. Otherwise,
the buffer will be copied to user space.
In both cases, the next bpf_seq_read() call will
try to show the same object which returned -EAGAIN
previously.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200513180218.2949517-1-yhs@fb.com

bpf: Change btf_iter func proto prefix to "bpf_iter_"

This is to be consistent with tracing and lsm programs
which have prefix "bpf_trace_" and "bpf_lsm_" respectively.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200513180216.2949387-1-yhs@fb.com

tools/bpf: selftests : Explain bpf_iter test failures with llvm 10.0.0

Commit 6879c042e105 ("tools/bpf: selftests: Add bpf_iter selftests")
added self tests for bpf_iter feature. But two subtests
ipv6_route and netlink needs llvm latest 10.x release branch
or trunk due to a bug in llvm BPF backend. This patch added
the file README.rst to document these two failures
so people using llvm 10.0.0 can be aware of them.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200513180215.2949237-1-yhs@fb.com

Merge branch 'dwmac-meson8b-Ethernet-RX-delay-configuration'

Martin Blumenstingl says:

====================
dwmac-meson8b Ethernet RX delay configuration

The Ethernet TX performance has been historically bad on Meson8b and
Meson8m2 SoCs because high packet loss was seen. I found out that this
was related (yet again) to the RGMII TX delay configuration.
In the process of discussing the big picture (and not just a single
patch) [0] with Andrew I discovered that the IP block behind the
dwmac-meson8b driver actually seems to support the configuration of the
RGMII RX delay (at least on the Meson8b SoC generation).

Since I sent the first RFC I got additional documentation from Jianxin
(many thanks!). Also I have discovered some more interesting details:
- Meson8b Odroid-C1 requires an RX delay (by either the PHY or the MAC)
  Based on the vendor u-boot code (not upstream) I assume that it will
  be the same for all Meson8b and Meson8m2 boards
- Khadas VIM2 seems to have the RX delay built into the PCB trace
  length. When I enable the RX delay on the PHY or MAC I can't get any
  data through. I expect that we will have the same situation on all
  GXBB, GXM, AXG, G12A, G12B and SM1 boards. Further clarification is
  needed here though (since I can't visually see these lengthened
  traces on the PCB). This will be done before sending patches for
  these boards.

Dependencies for this series:
There is a soft dependency for patch #2 on commit f22531438ff42c
"dt-bindings: net: dwmac: increase 'maxItems' for 'clocks',
'clock-names' properties" which is currently in Rob's -next tree.
That commit is needed to make the dt-bindings schema validation
pass for patch #2. That patch has been for ~4 weeks in Robs tree,
so I assume that is not going to be dropped.

Changes since RFC v2 at [2]:
- dropped $ref: /schemas/types.yaml#definitions/uint32 from the
  "amlogic,rx-delay-ns" in patch #1 ("Don't need to define the
  type when in standard units." says Rob - thanks, I learned
  something new). Also use "default: 0" for for this property
  instead of explaining it in the description text.
- added a note to the cover-letter about a hidden dependency for
  dt-binding schema validation in patch #2
- Added Andrew's Reviewed-by to patches 1-7. Thank you again for
  the quick and detailed reviews, I appreciate this!
- error out if the (optional) timing-adjustment clock is missing
  but we're asked to enable the RGMII RX delay. The MAC won't
  work in this specific case and either the RX delay has to be
  provided by the PHY or the timing-adjustment clock has to be
  added.
- dropped the dts patches (#9-11) which were only added to give
  an overview how this is going to be used. those will be sent
  separately
- dropped the RFC prefix

Changes since RFC v1 at [1]:
- add support for the timing adjustment clock input (dt-bindings and
  in the driver) thanks to the input from the unnamed Ethernet engineer
  at Amlogic. This is the missing link between the fclk_div2 clock and
  the Ethernet controller on Meson8b (no traffic would flow if that
  clock was disabled)
- add support fot the amlogic,rx-delay-ns property. The only supported
  values so far are 0ns and 2ns. The registers seem to allow more
  precise timing adjustments, but I could not make that work so far.
- add more register documentation (for the new RX delay bits) and
  unified the placement of existing register documentation. Again,
  thanks to Jianxin and the unnamed Ethernet engineer at Amlogic
- DO NOT MERGE: .dts patches to show the conversion of the Meson8b
  and Meson8m2 boards to "rgmii-id". I didn't have time for all arm64
  patches yet, but these will switch to phy-mode = "rgmii-txid" with
  amlogic,rx-delay-ns = <0> (because the delay seems to be provided by
  the PCB trace length).

[0] https://patchwork.kernel.org/patch/11309891/
[1] https://patchwork.kernel.org/cover/11310719/
[2] https://patchwork.kernel.org/cover/11518257/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: dwmac-meson8b: add support for the RX delay configuration

Configure the PRG_ETH0_ADJ_* bits to enable or disable the RX delay
based on the various RGMII PHY modes. For now the only supported RX
delay settings are:
- disabled, use for example for phy-mode "rgmii-id"
- 0ns - this is treated identical to "disabled", used for example on
  boards where the PHY provides 2ns TX delay and the PCB trace length
  already adds 2ns RX delay
- 2ns - for whenever the PHY cannot add the RX delay and the traces on
  the PCB don't add any RX delay

Disabling the RX delay (in case u-boot enables it, which is the case
for example on Meson8b Odroid-C1) simply means that PRG_ETH0_ADJ_ENABLE,
PRG_ETH0_ADJ_SETUP, PRG_ETH0_ADJ_DELAY and PRG_ETH0_ADJ_SKEW should be
disabled (just disabling PRG_ETH0_ADJ_ENABLE may be enough, since that
disables the whole re-timing logic - but I find it makes more sense to
clear the other bits as well since they depend on that setting).

u-boot on Odroid-C1 uses the following steps to enable a 2ns RX delay:
- enabling enabling the timing adjustment clock
- enabling the timing adjustment logic by setting PRG_ETH0_ADJ_ENABLE
- setting the PRG_ETH0_ADJ_SETUP bit

The documentation for the PRG_ETH0_ADJ_DELAY and PRG_ETH0_ADJ_SKEW
registers indicates that we can even set different RX delays. However,
I could not find out how this works exactly, so for now we only support
a 2ns RX delay using the exact same way that Odroid-C1's u-boot does.

Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: dwmac-meson8b: Make the clock enabling code re-usable

The timing adjustment clock will need similar logic as the RGMII clock:
It has to be enabled in the driver conditionally and when the driver is
unloaded it should be disabled again. Extract the existing code for the
RGMII clock into a new function so it can be re-used.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: dwmac-meson8b: Fetch the "timing-adjustment" clock

The PRG_ETHERNET registers have a built-in timing adjustment circuit
which can provide the RX delay in RGMII mode. This is driven by an
external (to this IP, but internal to the SoC) clock input. Fetch this
clock as optional (even though it's there on all supported SoCs) since
we just learned about it and existing .dtbs don't specify it.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: dwmac-meson8b: Add the PRG_ETH0_ADJ_* bits

The PRG_ETH0_ADJ_* are used for applying the RGMII RX delay. The public
datasheets only have very limited description for these registers, but
Jianxin Pan provided more detailed documentation from an (unnamed)
Amlogic engineer. Add the PRG_ETH0_ADJ_* bits along with the improved
description.

Suggested-by: Jianxin Pan <jianxin.pan@amlogic.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: dwmac-meson8b: Move the documentation for the TX delay

Move the documentation for the TX delay above the PRG_ETH0_TXDLY_MASK
definition. Future commits will add more registers also with
documentation above their register bit definitions. Move the existing
comment so it will be consistent with the upcoming changes.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: dwmac-meson8b: use FIELD_PREP instead of open-coding it

Use FIELD_PREP() to shift a value to the correct offset based on a
bitmask instead of open-coding the logic.
No functional changes.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: dwmac-meson: Document the "timing-adjustment" clock

The PRG_ETHERNET registers can add an RX delay in RGMII mode. This
requires an internal re-timing circuit whose input clock is called
"timing adjustment clock". Document this clock input so the clock can be
enabled as needed.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: meson-dwmac: Add the amlogic,rx-delay-ns property

The PRG_ETHERNET registers on Meson8b and newer SoCs can add an RX
delay. Add a property with the known supported values so it can be
configured according to the board layout.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'for-upstream' of git://git./linux/kernel/git/bluetooth/bluetooth-next

Johan Hedberg says:

====================
pull request: bluetooth-next 2020-05-13

Here's a second attempt at a bluetooth-next pull request which
supercedes the one dated 2020-05-09. This should have the issues
discovered by Jakub fixed.

- Add support for Intel Typhoon Peak device (8087:0032)
- Add device tree bindings for Realtek RTL8723BS device
- Add device tree bindings for Qualcomm QCA9377 device
- Add support for experimental features configuration through mgmt
- Add driver hook to prevent wake from suspend
- Add support for waiting for L2CAP disconnection response
- Multiple fixes & cleanups to the btbcm driver
- Add support for LE scatternet topology for selected devices
- A few other smaller fixes & cleanups

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'benchmark-runner'

Andrii Nakryiko says:

====================
Add generic benchmark runner framework which simplifies writing various
performance benchmarks in a consistent fashion.  This framework will be used
in follow up patches to test performance of perf buffer and ring buffer as
well.

Patch #1 extracts parse_num_list to be re-used between test_progs and bench.

Patch #2 adds generic runner implementation and atomic counter benchmarks to
validate benchmark runner's behavior.

Patch #3 implements test_overhead benchmark as part of bench runner. It also
add fmod_ret BPF program type to a set of benchmarks.

Patch #4 tests faster alternatives to set_task_comm() approach, tested in
test_overhead, in search for minimal-overhead way to trigger BPF program
execution from user-space on demand.

v2->v3:
  - added --prod-affinity and --cons-affinity (Yonghong);
  - removed ringbuf-related options leftovers (Yonghong);
  - added more benchmarking results for test_overhead performance discrepancies;
v1->v2:
  - moved benchmarks into benchs/ subdir (John);
  - added benchmark "suite" scripts (John);
  - few small clean ups, change defaults, etc.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftest/bpf: Add BPF triggering benchmark

It is sometimes desirable to be able to trigger BPF program from user-space
with minimal overhead. sys_enter would seem to be a good candidate, yet in
a lot of cases there will be a lot of noise from syscalls triggered by other
processes on the system. So while searching for low-overhead alternative, I've
stumbled upon getpgid() syscall, which seems to be specific enough to not
suffer from accidental syscall by other apps.

This set of benchmarks compares tp, raw_tp w/ filtering by syscall ID, kprobe,
fentry and fmod_ret with returning error (so that syscall would not be
executed), to determine the lowest-overhead way. Here are results on my
machine (using benchs/run_bench_trigger.sh script):

  base      :    9.200 ± 0.319M/s
  tp        :    6.690 ± 0.125M/s
  rawtp     :    8.571 ± 0.214M/s
  kprobe    :    6.431 ± 0.048M/s
  fentry    :    8.955 ± 0.241M/s
  fmodret   :    8.903 ± 0.135M/s

So it seems like fmodret doesn't give much benefit for such lightweight
syscall. Raw tracepoint is pretty decent despite additional filtering logic,
but it will be called for any other syscall in the system, which rules it out.
Fentry, though, seems to be adding the least amoung of overhead and achieves
97.3% of performance of baseline no-BPF-attached syscall.

Using getpgid() seems to be preferable to set_task_comm() approach from
test_overhead, as it's about 2.35x faster in a baseline performance.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200512192445.2351848-5-andriin@fb.com

selftest/bpf: Fmod_ret prog and implement test_overhead as part of bench

Add fmod_ret BPF program to existing test_overhead selftest. Also re-implement
user-space benchmarking part into benchmark runner to compare results. Results
with ./bench are consistently somewhat lower than test_overhead's, but relative
performance of various types of BPF programs stay consisten (e.g., kretprobe is
noticeably slower). This slowdown seems to be coming from the fact that
test_overhead is single-threaded, while benchmark always spins off at least
one thread for producer. This has been confirmed by hacking multi-threaded
test_overhead variant and also single-threaded bench variant. Resutls are
below. run_bench_rename.sh script from benchs/ subdirectory was used to
produce results for ./bench.

Single-threaded implementations
===============================

/* bench: single-threaded, atomics */
base      :    4.622 ± 0.049M/s
kprobe    :    3.673 ± 0.052M/s
kretprobe :    2.625 ± 0.052M/s
rawtp     :    4.369 ± 0.089M/s
fentry    :    4.201 ± 0.558M/s
fexit     :    4.309 ± 0.148M/s
fmodret   :    4.314 ± 0.203M/s

/* selftest: single-threaded, no atomics */
task_rename base        4555K events per sec
task_rename kprobe      3643K events per sec
task_rename kretprobe   2506K events per sec
task_rename raw_tp      4303K events per sec
task_rename fentry      4307K events per sec
task_rename fexit       4010K events per sec
task_rename fmod_ret    3984K events per sec

Multi-threaded implementations
==============================

/* bench: multi-threaded w/ atomics */
base      :    3.910 ± 0.023M/s
kprobe    :    3.048 ± 0.037M/s
kretprobe :    2.300 ± 0.015M/s
rawtp     :    3.687 ± 0.034M/s
fentry    :    3.740 ± 0.087M/s
fexit     :    3.510 ± 0.009M/s
fmodret   :    3.485 ± 0.050M/s

/* selftest: multi-threaded w/ atomics */
task_rename base        3872K events per sec
task_rename kprobe      3068K events per sec
task_rename kretprobe   2350K events per sec
task_rename raw_tp      3731K events per sec
task_rename fentry      3639K events per sec
task_rename fexit       3558K events per sec
task_rename fmod_ret    3511K events per sec

/* selftest: multi-threaded, no atomics */
task_rename base        3945K events per sec
task_rename kprobe      3298K events per sec
task_rename kretprobe   2451K events per sec
task_rename raw_tp      3718K events per sec
task_rename fentry      3782K events per sec
task_rename fexit       3543K events per sec
task_rename fmod_ret    3526K events per sec

Note that the fact that ./bench benchmark always uses atomic increments for
counting, while test_overhead doesn't, doesn't influence test results all that
much.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200512192445.2351848-4-andriin@fb.com

selftests/bpf: Add benchmark runner infrastructure

While working on BPF ringbuf implementation, testing, and benchmarking, I've
developed a pretty generic and modular benchmark runner, which seems to be
generically useful, as I've already used it for one more purpose (testing
fastest way to trigger BPF program, to minimize overhead of in-kernel code).

This patch adds generic part of benchmark runner and sets up Makefile for
extending it with more sets of benchmarks.

Benchmarker itself operates by spinning up specified number of producer and
consumer threads, setting up interval timer sending SIGALARM signal to
application once a second. Every second, current snapshot with hits/drops
counters are collected and stored in an array. Drops are useful for
producer/consumer benchmarks in which producer might overwhelm consumers.

Once test finishes after given amount of warm-up and testing seconds, mean and
stddev are calculated (ignoring warm-up results) and is printed out to stdout.
This setup seems to give consistent and accurate results.

To validate behavior, I added two atomic counting tests: global and local.
For global one, all the producer threads are atomically incrementing same
counter as fast as possible. This, of course, leads to huge drop of
performance once there is more than one producer thread due to CPUs fighting
for the same memory location.

Local counting, on the other hand, maintains one counter per each producer
thread, incremented independently. Once per second, all counters are read and
added together to form final "counting throughput" measurement. As expected,
such setup demonstrates linear scalability with number of producers (as long
as there are enough physical CPU cores, of course). See example output below.
Also, this setup can nicely demonstrate disastrous effects of false sharing,
if care is not taken to take those per-producer counters apart into
independent cache lines.

Demo output shows global counter first with 1 producer, then with 4. Both
total and per-producer performance significantly drop. The last run is local
counter with 4 producers, demonstrating near-perfect scalability.

$ ./bench -a -w1 -d2 -p1 count-global
Setting up benchmark 'count-global'...
Benchmark 'count-global' started.
Iter   0 ( 24.822us): hits  148.179M/s (148.179M/prod), drops    0.000M/s
Iter   1 ( 37.939us): hits  149.308M/s (149.308M/prod), drops    0.000M/s
Iter   2 (-10.774us): hits  150.717M/s (150.717M/prod), drops    0.000M/s
Iter   3 (  3.807us): hits  151.435M/s (151.435M/prod), drops    0.000M/s
Summary: hits  150.488 ± 1.079M/s (150.488M/prod), drops    0.000 ± 0.000M/s

$ ./bench -a -w1 -d2 -p4 count-global
Setting up benchmark 'count-global'...
Benchmark 'count-global' started.
Iter   0 ( 60.659us): hits   53.910M/s ( 13.477M/prod), drops    0.000M/s
Iter   1 (-17.658us): hits   53.722M/s ( 13.431M/prod), drops    0.000M/s
Iter   2 (  5.865us): hits   53.495M/s ( 13.374M/prod), drops    0.000M/s
Iter   3 (  0.104us): hits   53.606M/s ( 13.402M/prod), drops    0.000M/s
Summary: hits   53.608 ± 0.113M/s ( 13.402M/prod), drops    0.000 ± 0.000M/s

$ ./bench -a -w1 -d2 -p4 count-local
Setting up benchmark 'count-local'...
Benchmark 'count-local' started.
Iter   0 ( 23.388us): hits  640.450M/s (160.113M/prod), drops    0.000M/s
Iter   1 (  2.291us): hits  605.661M/s (151.415M/prod), drops    0.000M/s
Iter   2 ( -6.415us): hits  607.092M/s (151.773M/prod), drops    0.000M/s
Iter   3 ( -1.361us): hits  601.796M/s (150.449M/prod), drops    0.000M/s
Summary: hits  604.849 ± 2.739M/s (151.212M/prod), drops    0.000 ± 0.000M/s

Benchmark runner supports setting thread affinity for producer and consumer
threads. You can use -a flag for default CPU selection scheme, where first
consumer gets CPU #0, next one gets CPU #1, and so on. Then producer threads
pick up next CPU and increment one-by-one as well. But user can also specify
a set of CPUs independently for producers and consumers with --prod-affinity
1,2-10,15 and --cons-affinity <set-of-cpus>. The latter allows to force
producers and consumers to share same set of CPUs, if necessary.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200512192445.2351848-3-andriin@fb.com

selftests/bpf: Extract parse_num_list into generic testing_helpers.c

Add testing_helpers.c, which will contain generic helpers for test runners and
tests needing some common generic functionality, like parsing a set of
numbers.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200512192445.2351848-2-andriin@fb.com

Merge branch 'net-dsa-felix-tc-taprio-and-CBS-offload-support'

Xiaoliang Yang says:

====================
net: dsa: felix: tc taprio and CBS offload support

This patch series support tc taprio and CBS hardware offload according
to IEEE 802.1Qbv and IEEE-802.1Qav on VSC9959.

v1->v2 changes:
- Move port_qos_map_init() function to be common felix codes.
- Keep const for dsa_switch_ops structs, add felix_port_setup_tc
function to call port_setup_tc of felix.info.
- fix code style for cbs_set, rename variables.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: felix: add support Credit Based Shaper(CBS) for hardware offload

VSC9959 hardware support the Credit Based Shaper(CBS) which part
of the IEEE-802.1Qav. This patch support sch_cbs set for VSC9959.

Signed-off-by: Xiaoliang Yang <xiaoliang.yang_1@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: felix: Configure Time-Aware Scheduler via taprio offload

Ocelot VSC9959 switch supports time-based egress shaping in hardware
according to IEEE 802.1Qbv. This patch add support for TAS configuration
on egress port of VSC9959 switch.

Felix driver is an instance of Ocelot family, with a DSA front-end. The
patch uses tc taprio hardware offload to setup TAS set function on felix
driver.

Signed-off-by: Xiaoliang Yang <xiaoliang.yang_1@nxp.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: felix: qos classified based on pcp

Set the default QoS Classification based on PCP and DEI of vlan tag,
after that, frames can be Classified to different Qos based on PCP tag.
If there is no vlan tag or vlan ignored, use port default Qos.

Signed-off-by: Xiaoliang Yang <xiaoliang.yang_1@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

libbpf: Fix probe code to return EPERM if encountered

When the probe code was failing for any reason ENOTSUP was returned, even
if this was due to not having enough lock space. This patch fixes this by
returning EPERM to the user application, so it can respond and increase
the RLIMIT_MEMLOCK size.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/158927424896.2342.10402475603585742943.stgit@ebuild

selftests/bpf: Install generated test progs

Before commit 74b5a5968fe8 ("selftests/bpf: Replace test_progs and
test_maps w/ general rule") selftests/bpf used generic install
target from selftests/lib.mk to install generated bpf test progs
by mentioning them in TEST_GEN_FILES variable.

Take that functionality back.

Fixes: 74b5a5968fe8 ("selftests/bpf: Replace test_progs and test_maps w/ general rule")
Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200513021722.7787-1-yauheni.kaliuta@redhat.com