review.tizen.org Git - platform/kernel/linux-starfive.git/log

bpf: Teach stack depth check about async callbacks.

Teach max stack depth checking algorithm about async callbacks
that don't increase bpf program stack size.
Also add sanity check that bpf_tail_call didn't sneak into async cb.
It's impossible, since PTR_TO_CTX is not available in async cb,
hence the program cannot contain bpf_tail_call(ctx,...);

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-10-alexei.starovoitov@gmail.com

bpf: Implement verifier support for validation of async callbacks.

bpf_for_each_map_elem() and bpf_timer_set_callback() helpers are relying on
PTR_TO_FUNC infra in the verifier to validate addresses to subprograms
and pass them into the helpers as function callbacks.
In case of bpf_for_each_map_elem() the callback is invoked synchronously
and the verifier treats it as a normal subprogram call by adding another
bpf_func_state and new frame in __check_func_call().
bpf_timer_set_callback() doesn't invoke the callback directly.
The subprogram will be called asynchronously from bpf_timer_cb().
Teach the verifier to validate such async callbacks as special kind
of jump by pushing verifier state into stack and let pop_stack() process it.

Special care needs to be taken during state pruning.
The call insn doing bpf_timer_set_callback has to be a prune_point.
Otherwise short timer callbacks might not have prune points in front of
bpf_timer_set_callback() which means is_state_visited() will be called
after this call insn is processed in __check_func_call(). Which means that
another async_cb state will be pushed to be walked later and the verifier
will eventually hit BPF_COMPLEXITY_LIMIT_JMP_SEQ limit.
Since push_async_cb() looks like another push_stack() branch the
infinite loop detection will trigger false positive. To recognize
this case mark such states as in_async_callback_fn.
To distinguish infinite loop in async callback vs the same callback called
with different arguments for different map and timer add async_entry_cnt
to bpf_func_state.

Enforce return zero from async callbacks.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-9-alexei.starovoitov@gmail.com

bpf: Relax verifier recursion check.

In the following bpf subprogram:
static int timer_cb(void *map, void *key, void *value)
{
bpf_timer_set_callback(.., timer_cb);
}

the 'timer_cb' is a pointer to a function.
ld_imm64 insn is used to carry this pointer.
bpf_pseudo_func() returns true for such ld_imm64 insn.

Unlike bpf_for_each_map_elem() the bpf_timer_set_callback() is asynchronous.
Relax control flow check to allow such "recursion" that is seen as an infinite
loop by check_cfg(). The distinction between bpf_for_each_map_elem() the
bpf_timer_set_callback() is done in the follow up patch.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-8-alexei.starovoitov@gmail.com

bpf: Remember BTF of inner maps.

BTF is required for 'struct bpf_timer' to be recognized inside map value.
The bpf timers are supported inside inner maps.
Remember 'struct btf *' in inner_map_meta to make it available
to the verifier in the sequence:

struct bpf_map *inner_map = bpf_map_lookup_elem(&outer_map, ...);
if (inner_map)
timer = bpf_map_lookup_elem(&inner_map, ...);

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-7-alexei.starovoitov@gmail.com

bpf: Prevent pointer mismatch in bpf_timer_init.

bpf_timer_init() arguments are:
1. pointer to a timer (which is embedded in map element).
2. pointer to a map.
Make sure that pointer to a timer actually belongs to that map.

Use map_uid (which is unique id of inner map) to reject:
inner_map1 = bpf_map_lookup_elem(outer_map, key1)
inner_map2 = bpf_map_lookup_elem(outer_map, key2)
if (inner_map1 && inner_map2) {
    timer = bpf_map_lookup_elem(inner_map1);
    if (timer)
        // mismatch would have been allowed
        bpf_timer_init(timer, inner_map2);
}

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-6-alexei.starovoitov@gmail.com

bpf: Add map side support for bpf timers.

Restrict bpf timers to array, hash (both preallocated and kmalloced), and
lru map types. The per-cpu maps with timers don't make sense, since 'struct
bpf_timer' is a part of map value. bpf timers in per-cpu maps would mean that
the number of timers depends on number of possible cpus and timers would not be
accessible from all cpus. lpm map support can be added in the future.
The timers in inner maps are supported.

The bpf_map_update/delete_elem() helpers and sys_bpf commands cancel and free
bpf_timer in a given map element.

Similar to 'struct bpf_spin_lock' BTF is required and it is used to validate
that map element indeed contains 'struct bpf_timer'.

Make check_and_init_map_value() init both bpf_spin_lock and bpf_timer when
map element data is reused in preallocated htab and lru maps.

Teach copy_map_value() to support both bpf_spin_lock and bpf_timer in a single
map element. There could be one of each, but not more than one. Due to 'one
bpf_timer in one element' restriction do not support timers in global data,
since global data is a map of single element, but from bpf program side it's
seen as many global variables and restriction of single global timer would be
odd. The sys_bpf map_freeze and sys_mmap syscalls are not allowed on maps with
timers, since user space could have corrupted mmap element and crashed the
kernel. The maps with timers cannot be readonly. Due to these restrictions
search for bpf_timer in datasec BTF in case it was placed in the global data to
report clear error.

The previous patch allowed 'struct bpf_timer' as a first field in a map
element only. Relax this restriction.

Refactor lru map to s/bpf_lru_push_free/htab_lru_push_free/ to cancel and free
the timer when lru map deletes an element as a part of it eviction algorithm.

Make sure that bpf program cannot access 'struct bpf_timer' via direct load/store.
The timer operation are done through helpers only.
This is similar to 'struct bpf_spin_lock'.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-5-alexei.starovoitov@gmail.com

bpf: Introduce bpf timers.

Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
in hash/array/lru maps as a regular field and helpers to operate on it:

// Initialize the timer.
// First 4 bits of 'flags' specify clockid.
// Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
long bpf_timer_init(struct bpf_timer *timer, struct bpf_map *map, int flags);

// Configure the timer to call 'callback_fn' static function.
long bpf_timer_set_callback(struct bpf_timer *timer, void *callback_fn);

// Arm the timer to expire 'nsec' nanoseconds from the current time.
long bpf_timer_start(struct bpf_timer *timer, u64 nsec, u64 flags);

// Cancel the timer and wait for callback_fn to finish if it was running.
long bpf_timer_cancel(struct bpf_timer *timer);

Here is how BPF program might look like:
struct map_elem {
    int counter;
    struct bpf_timer timer;
};

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1000);
    __type(key, int);
    __type(value, struct map_elem);
} hmap SEC(".maps");

static int timer_cb(void *map, int *key, struct map_elem *val);
/* val points to particular map element that contains bpf_timer. */

SEC("fentry/bpf_fentry_test1")
int BPF_PROG(test1, int a)
{
    struct map_elem *val;
    int key = 0;

    val = bpf_map_lookup_elem(&hmap, &key);
    if (val) {
        bpf_timer_init(&val->timer, &hmap, CLOCK_REALTIME);
        bpf_timer_set_callback(&val->timer, timer_cb);
        bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */, 0);
    }
}

This patch adds helper implementations that rely on hrtimers
to call bpf functions as timers expire.
The following patches add necessary safety checks.

Only programs with CAP_BPF are allowed to use bpf_timer.

The amount of timers used by the program is constrained by
the memcg recorded at map creation time.

The bpf_timer_init() helper needs explicit 'map' argument because inner maps
are dynamic and not known at load time. While the bpf_timer_set_callback() is
receiving hidden 'aux->prog' argument supplied by the verifier.

The prog pointer is needed to do refcnting of bpf program to make sure that
program doesn't get freed while the timer is armed. This approach relies on
"user refcnt" scheme used in prog_array that stores bpf programs for
bpf_tail_call. The bpf_timer_set_callback() will increment the prog refcnt which is
paired with bpf_timer_cancel() that will drop the prog refcnt. The
ops->map_release_uref is responsible for cancelling the timers and dropping
prog refcnt when user space reference to a map reaches zero.
This uref approach is done to make sure that Ctrl-C of user space process will
not leave timers running forever unless the user space explicitly pinned a map
that contained timers in bpffs.

bpf_timer_init() and bpf_timer_set_callback() will return -EPERM if map doesn't
have user references (is not held by open file descriptor from user space and
not pinned in bpffs).

The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
and free the timer if given map element had it allocated.
"bpftool map update" command can be used to cancel timers.

The 'struct bpf_timer' is explicitly __attribute__((aligned(8))) because
'__u64 :64' has 1 byte alignment of 8 byte padding.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-4-alexei.starovoitov@gmail.com

bpf: Factor out bpf_spin_lock into helpers.

Move ____bpf_spin_lock/unlock into helpers to make it more clear
that quadruple underscore bpf_spin_lock/unlock are irqsave/restore variants.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-3-alexei.starovoitov@gmail.com

bpf: Prepare bpf_prog_put() to be called from irq context.

Currently bpf_prog_put() is called from the task context only.
With addition of bpf timers the timer related helpers will start calling
bpf_prog_put() from irq-saved region and in rare cases might drop
the refcnt to zero.
To address this case, first, convert bpf_prog_free_id() to be irq-save
(this is similar to bpf_map_free_id), and, second, defer non irq
appropriate calls into work queue.
For example:
bpf_audit_prog() is calling kmalloc and wake_up_interruptible,
bpf_prog_kallsyms_del_all()->bpf_ksym_del()->spin_unlock_bh().
They are not safe with irqs disabled.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-2-alexei.starovoitov@gmail.com

selftests/bpf: Remove unused variable in tc_tunnel prog

The variable buf is unused since commit 005edd16562b ("selftests/bpf:
convert bpf tunnel test to BPF_ADJ_ROOM_MAC"). Remove it to fix the
following warning:

test_tc_tunnel.c:531:7: warning: unused variable 'buf' [-Wunused-variable]

Fixes: 005edd16562b ("selftests/bpf: convert bpf tunnel test to BPF_ADJ_ROOM_MAC")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210713102719.8890-1-tklauser@distanz.ch

bpf: Fix potential memleak and UAF in the verifier.

In bpf_patch_insn_data(), we first use the bpf_patch_insn_single() to
insert new instructions, then use adjust_insn_aux_data() to adjust
insn_aux_data. If the old env->prog have no enough room for new inserted
instructions, we use bpf_prog_realloc to construct new_prog and free the
old env->prog.

There have two errors here. First, if adjust_insn_aux_data() return
ENOMEM, we should free the new_prog. Second, if adjust_insn_aux_data()
return ENOMEM, bpf_patch_insn_data() will return NULL, and env->prog has
been freed in bpf_prog_realloc, but we will use it in bpf_check().

So in this patch, we make the adjust_insn_aux_data() never fails. In
bpf_patch_insn_data(), we first pre-malloc memory for the new
insn_aux_data, then call bpf_patch_insn_single() to insert new
instructions, at last call adjust_insn_aux_data() to adjust
insn_aux_data.

Fixes: 8041902dae52 ("bpf: adjust insn_aux_data when patching insns")
Signed-off-by: He Fengqing <hefengqing@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210714101815.164322-1-hefengqing@huawei.com

bpf: Fix a typo of reuseport map in bpf.h.

Fix s/BPF_MAP_TYPE_REUSEPORT_ARRAY/BPF_MAP_TYPE_REUSEPORT_SOCKARRAY/ typo
in bpf.h.

Fixes: 2dbb9b9e6df6 ("bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210714124317.67526-1-kuniyu@amazon.co.jp

bpf: Sync tools/include/uapi/linux/bpf.h

Commit 47316f4a3053 missed updating tools/.../bpf.h.
Sync it.

Fixes: 47316f4a3053 ("bpf: Support input xdp_md context in BPF_PROG_TEST_RUN")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

libbpf: Fix reuse of pinned map on older kernel

When loading a BPF program with a pinned map, the loader checks whether
the pinned map can be reused, i.e. their properties match. To derive
such of the pinned map, the loader invokes BPF_OBJ_GET_INFO_BY_FD and
then does the comparison.

Unfortunately, on < 4.12 kernels the BPF_OBJ_GET_INFO_BY_FD is not
available, so loading the program fails with the following error:

libbpf: failed to get map info for map FD 5: Invalid argument
libbpf: couldn't reuse pinned map at
'/sys/fs/bpf/tc/globals/cilium_call_policy': parameter
mismatch"
libbpf: map 'cilium_call_policy': error reusing pinned map
libbpf: map 'cilium_call_policy': failed to create:
Invalid argument(-22)
libbpf: failed to load object 'bpf_overlay.o'

To fix this, fallback to derivation of the map properties via
/proc/$PID/fdinfo/$MAP_FD if BPF_OBJ_GET_INFO_BY_FD fails with EINVAL,
which can be used as an indicator that the kernel doesn't support
the latter.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210712125552.58705-1-m@lambda.lt

samples/bpf: xdp_redirect_cpu_user: Cpumap qsize set larger default

Experience from production shows queue size of 192 is too small, as
this caused packet drops during cpumap-enqueue on RX-CPU.  This can be
diagnosed with xdp_monitor sample program.

This bpftrace program was used to diagnose the problem in more detail:

bpftrace -e '
  tracepoint:xdp:xdp_cpumap_kthread { @deq_bulk = lhist(args->processed,0,10,1); @drop_net = lhist(args->drops,0,10,1) }
  tracepoint:xdp:xdp_cpumap_enqueue { @enq_bulk = lhist(args->processed,0,10,1); @enq_drops = lhist(args->drops,0,10,1); }'

Watch out for the @enq_drops counter. The @drop_net counter can happen
when netstack gets invalid packets, so don't despair it can be
natural, and that counter will likely disappear in newer kernels as it
was a source of confusion (look at netstat info for reason of the
netstack @drop_net counters).

The production system was configured with CPU power-saving C6 state.
Learn more in this blogpost[1].

And wakeup latency in usec for the states are:

# grep -H . /sys/devices/system/cpu/cpu0/cpuidle/*/latency
/sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
/sys/devices/system/cpu/cpu0/cpuidle/state1/latency:2
/sys/devices/system/cpu/cpu0/cpuidle/state2/latency:10
/sys/devices/system/cpu/cpu0/cpuidle/state3/latency:133

Deepest state take 133 usec to wakeup from (133/10^6). The link speed
is 25Gbit/s ((25*10^9/8) in bytes/sec). How many bytes can arrive with
in 133 usec at this speed: (25*10^9/8)*(133/10^6) = 415625 bytes. With
MTU size packets this is 275 packets, and with minimum Ethernet (incl
intergap overhead) 84 bytes it is 4948 packets. Clearly default queue
size is too small.

Setting default cpumap queue to 2048 as worst-case (small packet) at
10Gbit/s is 1979 packets with 133 usec wakeup time, +64 packet before
kthread wakeup call (due to xdp_do_flush) worst-case 2043 packets.

Thus, if a packet burst on RX-CPU will enqueue packets to a remote
cpumap CPU that is in deep-sleep state it can overrun the cpumap queue.

The production system was also configured to avoid deep-sleep via:
tuned-adm profile network-latency

[1] https://jeremyeder.com/2013/08/30/oh-did-you-expect-the-cpu/

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/162523477604.786243.13372630844944530891.stgit@firesoul

Merge branch 'Generic XDP improvements'

Kumar Kartikeya says:

====================

This small series makes some improvements to generic XDP mode and brings it
closer to native XDP. Patch 1 splits out generic XDP processing into reusable
parts, patch 2 adds pointer friendly wrappers for bitops (not have to cast back
and forth the address of local pointer to unsigned long *), patch 3 implements
generic cpumap support (details in commit) and patch 4 allows devmap bpf prog
execution before generic_xdp_tx is called.

Patch 5 just updates a couple of selftests to adapt to changes in behavior (in
that specifying devmap/cpumap prog fd in generic mode is now allowed).

Changelog:
----------
v5 -> v6
v5: https://lore.kernel.org/bpf/20210701002759.381983-1-memxor@gmail.com
* Put rcpu->prog check before RCU-bh section to avoid do_softirq (Jesper)

v4 -> v5
v4: https://lore.kernel.org/bpf/20210628114746.129669-1-memxor@gmail.com
* Add comments and examples for new bitops macros (Alexei)

v3 -> v4
v3: https://lore.kernel.org/bpf/20210622202835.1151230-1-memxor@gmail.com
* Add detach now that attach of XDP program succeeds (Toke)
* Clean up the test to use new ASSERT macros

v2 -> v3
v2: https://lore.kernel.org/bpf/20210622195527.1110497-1-memxor@gmail.com
* list_for_each_entry -> list_for_each_entry_safe (due to deletion of skb)

v1 -> v2
v1: https://lore.kernel.org/bpf/20210620233200.855534-1-memxor@gmail.com
* Move __ptr_{set,clear,test}_bit to bitops.h (Toke)
Also changed argument order to match the bit op they wrap.
* Remove map value size checking functions for cpumap/devmap (Toke)
* Rework prog run for skb in cpu_map_kthread_run (Toke)
* Set skb->dev to dst->dev after devmap prog has run
* Don't set xdp rxq that will be overwritten in cpumap prog run
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Tidy xdp attach selftests

Support for cpumap and devmap entry progs in previous commits means the
test needs to be updated for the new semantics. Also take this
opportunity to convert it from CHECK macros to the new ASSERT macros.

Since xdp_cpumap_attach has no subtest, put the sole test inside the
test_xdp_cpumap_attach function.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210702111825.491065-6-memxor@gmail.com

bpf: devmap: Implement devmap prog execution for generic XDP

This lifts the restriction on running devmap BPF progs in generic
redirect mode. To match native XDP behavior, it is invoked right before
generic_xdp_tx is called, and only supports XDP_PASS/XDP_ABORTED/
XDP_DROP actions.

We also return 0 even if devmap program drops the packet, as
semantically redirect has already succeeded and the devmap prog is the
last point before TX of the packet to device where it can deliver a
verdict on the packet.

This also means it must take care of freeing the skb, as
xdp_do_generic_redirect callers only do that in case an error is
returned.

Since devmap entry prog is supported, remove the check in
generic_xdp_install entirely.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210702111825.491065-5-memxor@gmail.com

bpf: cpumap: Implement generic cpumap

This change implements CPUMAP redirect support for generic XDP programs.
The idea is to reuse the cpu map entry's queue that is used to push
native xdp frames for redirecting skb to a different CPU. This will
match native XDP behavior (in that RPS is invoked again for packet
reinjected into networking stack).

To be able to determine whether the incoming skb is from the driver or
cpumap, we reuse skb->redirected bit that skips generic XDP processing
when it is set. To always make use of this, CONFIG_NET_REDIRECT guard on
it has been lifted and it is always available.

>From the redirect side, we add the skb to ptr_ring with its lowest bit
set to 1. This should be safe as skb is not 1-byte aligned. This allows
kthread to discern between xdp_frames and sk_buff. On consumption of the
ptr_ring item, the lowest bit is unset.

In the end, the skb is simply added to the list that kthread is anyway
going to maintain for xdp_frames converted to skb, and then received
again by using netif_receive_skb_list.

Bulking optimization for generic cpumap is left as an exercise for a
future patch for now.

Since cpumap entry progs are now supported, also remove check in
generic_xdp_install for the cpumap.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/20210702111825.491065-4-memxor@gmail.com

bitops: Add non-atomic bitops for pointers

cpumap needs to set, clear, and test the lowest bit in skb pointer in
various places. To make these checks less noisy, add pointer friendly
bitop macros that also do some typechecking to sanitize the argument.

These wrap the non-atomic bitops __set_bit, __clear_bit, and test_bit
but for pointer arguments. Pointer's address has to be passed in and it
is treated as an unsigned long *, since width and representation of
pointer and unsigned long match on targets Linux supports. They are
prefixed with double underscore to indicate lack of atomicity.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210702111825.491065-3-memxor@gmail.com

net: core: Split out code to run generic XDP prog

This helper can later be utilized in code that runs cpumap and devmap
programs in generic redirect mode and adjust skb based on changes made
to xdp_buff.

When returning XDP_REDIRECT/XDP_TX, it invokes __skb_push, so whenever a
generic redirect path invokes devmap/cpumap prog if set, it must
__skb_pull again as we expect mac header to be pulled.

It also drops the skb_reset_mac_len call after do_xdp_generic, as the
mac_header and network_header are advanced by the same offset, so the
difference (mac_len) remains constant.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210702111825.491065-2-memxor@gmail.com

Merge branch 'bpf: support input xdp_md context in BPF_PROG_TEST_RUN'

Zvi Effron says:

====================

This patchset adds support for passing an xdp_md via ctx_in/ctx_out in
bpf_attr for BPF_PROG_TEST_RUN of XDP programs.

Patch 1 adds a function to validate XDP meta data lengths.

Patch 2 adds initial support for passing XDP meta data in addition to
packet data.

Patch 3 adds support for also specifying the ingress interface and
rx queue.

Patch 4 adds selftests to ensure functionality is correct.

Changelog:
----------
v7->v8
v7: https://lore.kernel.org/bpf/20210624211304.90807-1-zeffron@riotgames.com/

* Fix too long comment line in patch 3

v6->v7
v6: https://lore.kernel.org/bpf/20210617232904.1899-1-zeffron@riotgames.com/

* Add Yonghong Song's Acked-by to commit message in patch 1
* Add Yonghong Song's Acked-by to commit message in patch 2
* Extracted the post-update of the xdp_md context into a function (again)
* Validate that the rx queue was registered with XDP info
* Decrement the reference count on a found netdevice on failure to find
  a valid rx queue
* Decrement the reference count on a found netdevice after the XDP
  program is run
* Drop Yonghong Song's Acked-By for patch 3 because of patch changes
* Improve a comment in the selftests
* Drop Yonghong Song's Acked-By for patch 4 because of patch changes

v5->v6
v5: https://lore.kernel.org/bpf/20210616224712.3243-1-zeffron@riotgames.com/

* Correct commit messages in patches 1 and 3
* Add Acked-by to commit message in patch 4
* Use gotos instead of returns to correctly free resources in
  bpf_prog_test_run_xdp
* Rename xdp_metalen_valid to xdp_metalen_invalid
* Improve the function signature for xdp_metalen_invalid
* Merged declaration of ingress_ifindex and rx_queue_index into one line

v4->v5
v4: https://lore.kernel.org/bpf/20210604220235.6758-1-zeffron@riotgames.com/

* Add new patch to introduce xdp_metalen_valid inline function to avoid
  duplicated code from net/core/filter.c
* Correct size of bad_ctx in selftests
* Make all declarations reverse Christmas tree
* Move data check from xdp_convert_md_to_buff to bpf_prog_test_run_xdp
* Merge xdp_convert_buff_to_md into bpf_prog_test_run_xdp
* Fix line too long
* Extracted common checks in selftests to a helper function
* Removed redundant assignment in selftests
* Reordered test cases in selftests
* Check data against 0 instead of data_meta in selftests
* Made selftests use EINVAL instead of hardcoded 22
* Dropped "_" from XDP function name
* Changed casts in XDP program from unsigned long to long
* Added a comment explaining the use of the loopback interface in selftests
* Change parameter order in xdp_convert_md_to_buff to be input first
* Assigned xdp->ingress_ifindex and xdp->rx_queue_index to local variables in
  xdp_convert_md_to_buff
* Made use of "meta data" versus "metadata" consistent in comments and commit
  messages

v3->v4
v3: https://lore.kernel.org/bpf/20210602190815.8096-1-zeffron@riotgames.com/

* Clean up nits
* Validate xdp_md->data_end in bpf_prog_test_run_xdp
* Remove intermediate metalen variables

v2 -> v3
v2: https://lore.kernel.org/bpf/20210527201341.7128-1-zeffron@riotgames.com/

* Check errno first in selftests
* Use DECLARE_LIBBPF_OPTS
* Rename tattr to opts in selftests
* Remove extra new line
* Rename convert_xdpmd_to_xdpb to xdp_convert_md_to_buff
* Rename convert_xdpb_to_xdpmd to xdp_convert_buff_to_md
* Move declaration of device and rxqueue in xdp_convert_md_to_buff to
  patch 2
* Reorder the kfree calls in bpf_prog_test_run_xdp

v1 -> v2
v1: https://lore.kernel.org/bpf/20210524220555.251473-1-zeffron@riotgames.com

* Fix null pointer dereference with no context
* Use the BPF skeleton and replace CHECK with ASSERT macros
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add test for xdp_md context in BPF_PROG_TEST_RUN

Add a test for using xdp_md as a context to BPF_PROG_TEST_RUN for XDP
programs.

The test uses a BPF program that takes in a return value from XDP
meta data, then reduces the size of the XDP meta data by 4 bytes.

Test cases validate the possible failure cases for passing in invalid
xdp_md contexts, that the return value is successfully passed
in, and that the adjusted meta data is successfully copied out.

Co-developed-by: Cody Haas <chaas@riotgames.com>
Co-developed-by: Lisa Watanabe <lwatanabe@riotgames.com>
Signed-off-by: Cody Haas <chaas@riotgames.com>
Signed-off-by: Lisa Watanabe <lwatanabe@riotgames.com>
Signed-off-by: Zvi Effron <zeffron@riotgames.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210707221657.3985075-5-zeffron@riotgames.com

bpf: Support specifying ingress via xdp_md context in BPF_PROG_TEST_RUN

Support specifying the ingress_ifindex and rx_queue_index of xdp_md
contexts for BPF_PROG_TEST_RUN.

The intended use case is to allow testing XDP programs that make decisions
based on the ingress interface or RX queue.

If ingress_ifindex is specified, look up the device by the provided index
in the current namespace and use its xdp_rxq for the xdp_buff. If the
rx_queue_index is out of range, or is non-zero when the ingress_ifindex is
0, return -EINVAL.

Co-developed-by: Cody Haas <chaas@riotgames.com>
Co-developed-by: Lisa Watanabe <lwatanabe@riotgames.com>
Signed-off-by: Cody Haas <chaas@riotgames.com>
Signed-off-by: Lisa Watanabe <lwatanabe@riotgames.com>
Signed-off-by: Zvi Effron <zeffron@riotgames.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210707221657.3985075-4-zeffron@riotgames.com

bpf: Support input xdp_md context in BPF_PROG_TEST_RUN

Support passing a xdp_md via ctx_in/ctx_out in bpf_attr for
BPF_PROG_TEST_RUN.

The intended use case is to pass some XDP meta data to the test runs of
XDP programs that are used as tail calls.

For programs that use bpf_prog_test_run_xdp, support xdp_md input and
output. Unlike with an actual xdp_md during a non-test run, data_meta must
be 0 because it must point to the start of the provided user data. From
the initial xdp_md, use data and data_end to adjust the pointers in the
generated xdp_buff. All other non-zero fields are prohibited (with
EINVAL). If the user has set ctx_out/ctx_size_out, copy the (potentially
different) xdp_md back to the userspace.

We require all fields of input xdp_md except the ones we explicitly
support to be set to zero. The expectation is that in the future we might
add support for more fields and we want to fail explicitly if the user
runs the program on the kernel where we don't yet support them.

Co-developed-by: Cody Haas <chaas@riotgames.com>
Co-developed-by: Lisa Watanabe <lwatanabe@riotgames.com>
Signed-off-by: Cody Haas <chaas@riotgames.com>
Signed-off-by: Lisa Watanabe <lwatanabe@riotgames.com>
Signed-off-by: Zvi Effron <zeffron@riotgames.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210707221657.3985075-3-zeffron@riotgames.com

bpf: Add function for XDP meta data length check

This commit prepares to use the XDP meta data length check in multiple
places by making it into a static inline function instead of a literal.

Co-developed-by: Cody Haas <chaas@riotgames.com>
Co-developed-by: Lisa Watanabe <lwatanabe@riotgames.com>
Signed-off-by: Cody Haas <chaas@riotgames.com>
Signed-off-by: Lisa Watanabe <lwatanabe@riotgames.com>
Signed-off-by: Zvi Effron <zeffron@riotgames.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210707221657.3985075-2-zeffron@riotgames.com

Merge branch 'dsa-mv88e6xxx-topaz-fixes'

Marek Behún says:

====================
dsa: mv88e6xxx: Topaz fixes

here comes some fixes for the Topaz family (Marvell 88E6141 / 88E6341)
which I found out about when I compared the Topaz' operations
structure with that one of Peridot (6390).

This is v2. In v1, I accidentally sent patches generated from wrong
branch and the 5th patch does not contain a necessary change in
serdes.c.

Changes from v1:
- the fifth patch, "enable SerDes RX stats for Topaz", needs another
change in serdes.c
- Andrew's Reviewed-by to 1,2,3,4 and 6
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: enable SerDes PCS register dump via ethtool -d on Topaz

Commit bf3504cea7d7e ("net: dsa: mv88e6xxx: Add 6390 family PCS
registers to ethtool -d") added support for dumping SerDes PCS registers
via ethtool -d for Peridot.

The same implementation is also valid for Topaz, but was not
enabled at the time.

Signed-off-by: Marek Behún <kabel@kernel.org>
Fixes: bf3504cea7d7e ("net: dsa: mv88e6xxx: Add 6390 family PCS registers to ethtool -d")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: enable SerDes RX stats for Topaz

Commit 0df952873636a ("mv88e6xxx: Add serdes Rx statistics") added
support for RX statistics on SerDes ports for Peridot.

This same implementation is also valid for Topaz, but was not enabled
at the time.

We need to use the generic .serdes_get_lane() method instead of the
Peridot specific one in the stats methods so that on Topaz the proper
one is used.

Signed-off-by: Marek Behún <kabel@kernel.org>
Fixes: 0df952873636a ("mv88e6xxx: Add serdes Rx statistics")
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: enable devlink ATU hash param for Topaz

Commit 23e8b470c7788 ("net: dsa: mv88e6xxx: Add devlink param for ATU
hash algorithm.") introduced ATU hash algorithm access via devlink, but
did not enable it for Topaz.

Enable this feature also for Topaz.

Signed-off-by: Marek Behún <kabel@kernel.org>
Fixes: 23e8b470c7788 ("net: dsa: mv88e6xxx: Add devlink param for ATU hash algorithm.")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: enable .rmu_disable() on Topaz

Commit 9e5baf9b36367 ("net: dsa: mv88e6xxx: add RMU disable op")
introduced .rmu_disable() method with implementation for several models,
but forgot to add Topaz, which can use the Peridot implementation.

Use the Peridot implementation of .rmu_disable() on Topaz.

Signed-off-by: Marek Behún <kabel@kernel.org>
Fixes: 9e5baf9b36367 ("net: dsa: mv88e6xxx: add RMU disable op")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: use correct .stats_set_histogram() on Topaz

Commit 40cff8fca9e3 ("net: dsa: mv88e6xxx: Fix stats histogram mode")
introduced wrong .stats_set_histogram() method for Topaz family.

The Peridot method should be used instead.

Signed-off-by: Marek Behún <kabel@kernel.org>
Fixes: 40cff8fca9e3 ("net: dsa: mv88e6xxx: Fix stats histogram mode")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: enable .port_set_policy() on Topaz

Commit f3a2cd326e44 ("net: dsa: mv88e6xxx: introduce .port_set_policy")
introduced .port_set_policy() method with implementation for several
models, but forgot to add Topaz, which can use the 6352 implementation.

Use the 6352 implementation of .port_set_policy() on Topaz.

Signed-off-by: Marek Behún <kabel@kernel.org>
Fixes: f3a2cd326e44 ("net: dsa: mv88e6xxx: introduce .port_set_policy")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: return -EOPNOTSUPP when driver does not implement .port_lag_join

The DSA core has a layered structure, and even though we end up
returning 0 (success) to user space when setting a bonding/team upper
that can't be offloaded, some parts of the framework actually need to
know that we couldn't offload that.

For example, if dsa_switch_lag_join returns 0 as it currently does,
dsa_port_lag_join has no way to tell a successful offload from a
software fallback, and it will call dsa_port_bridge_join afterwards.
Then we'll think we're offloading the bridge master of the LAG, when in
fact we're not even offloading the LAG. In turn, this will make us set
skb->offload_fwd_mark = true, which is incorrect and the bridge doesn't
like it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'octeopntx2-LMTST-regions'

Geetha sowjanya says:

====================
Dynamic LMTST region setup

This patch series allows RVU PF/VF to allocate memory for
LMTST operations instead of using memory reserved by firmware
which is mapped as device memory.
The LMTST mapping table contains the RVU PF/VF LMTST memory base
address entries. This table is used by hardware for LMTST operations.
Patch1 introduces new mailbox message to update the LMTST table with
the new allocated memory address.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

octeontx2-pf: cn10k: Use runtime allocated LMTLINE region

The current driver uses static LMTST region allocated by firmware.
This memory gets populated as PF/VF BAR2. RVU PF/VF driver ioremap
the memory as device memory for NIX/NPA operation. Since the memory
is mapped as device memory we see performance degration. To address
this issue this patch implements runtime memory allocation.
RVU PF/VF allocates memory during device probe and share the base
address with RVU AF. RVU AF then configure the LMT MAP table
accordingly.

Signed-off-by: Geetha sowjanya <gakula@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

octeontx2-af: cn10k: Support configurable LMTST regions

This patch extends the lmtst_tbl_setup_req mbox to support run time
LMTST configuration.
RVU PF/VF and DPDK/ODP allocates a LMT region, creates a translation
entry for a device via VFIO IOCTLs.
This IOVA is shared with AF through above mbox. AF then uses
RVU_SMMU transulation Widget and gets PA for the IOVA and updates
the LMTtable entry for that device.

Signed-off-by: Geetha sowjanya <gakula@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

octeontx2-af: cn10k: Setting up lmtst map table

Introducing a new mailbox to support updating lmt entries
and common lmt base address scheme i.e. multiple pcifuncs
can share lmt region to reduce L1 cache pressure for application.
Parameters passed to mailbox includes the primary pcifunc
value whose lmt regions will be shared by other secondary
pcifuncs. Here secondary pcifunc will be the one who is
calling the mailbox.
For example:
By default each pcifunc has its own LMT base address:
        PCIFUNC1    LMT_BASE_ADDR A
        PCIFUNC2    LMT_BASE_ADDR B
        PCIFUNC3    LMT_BASE_ADDR C
        PCIFUNC4    LMT_BASE_ADDR D
Application will choose PCIFUNC1 as base/primary pcifunc
and as and when other pcifunc(secondary pcifuncs) gets
probed, this mailbox will be called and LMTST table will
be updated as:
        PCIFUNC1    LMT_BASE_ADDR A
        PCIFUNC2    LMT_BASE_ADDR A
        PCIFUNC3    LMT_BASE_ADDR A
        PCIFUNC4    LMT_BASE_ADDR A

On FLR lmtst map table gets resetted to the default lmt
base addresses for all secondary pcifuncs.

Signed-off-by: Harman Kalra <hkalra@marvell.com>
Signed-off-by: Geetha sowjanya <gakula@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'net-next-5.14' of git://git./linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
"Core:

   - BPF:
      - add syscall program type and libbpf support for generating
        instructions and bindings for in-kernel BPF loaders (BPF loaders
        for BPF), this is a stepping stone for signed BPF programs
      - infrastructure to migrate TCP child sockets from one listener to
        another in the same reuseport group/map to improve flexibility
        of service hand-off/restart
      - add broadcast support to XDP redirect

   - allow bypass of the lockless qdisc to improving performance (for
     pktgen: +23% with one thread, +44% with 2 threads)

   - add a simpler version of "DO_ONCE()" which does not require jump
     labels, intended for slow-path usage

   - virtio/vsock: introduce SOCK_SEQPACKET support

   - add getsocketopt to retrieve netns cookie

   - ip: treat lowest address of a IPv4 subnet as ordinary unicast
     address allowing reclaiming of precious IPv4 addresses

   - ipv6: use prandom_u32() for ID generation

   - ip: add support for more flexible field selection for hashing
     across multi-path routes (w/ offload to mlxsw)

   - icmp: add support for extended RFC 8335 PROBE (ping)

   - seg6: add support for SRv6 End.DT46 behavior

   - mptcp:
      - DSS checksum support (RFC 8684) to detect middlebox meddling
      - support Connection-time 'C' flag
      - time stamping support

   - sctp: packetization Layer Path MTU Discovery (RFC 8899)

   - xfrm: speed up state addition with seq set

   - WiFi:
      - hidden AP discovery on 6 GHz and other HE 6 GHz improvements
      - aggregation handling improvements for some drivers
      - minstrel improvements for no-ack frames
      - deferred rate control for TXQs to improve reaction times
      - switch from round robin to virtual time-based airtime scheduler

   - add trace points:
      - tcp checksum errors
      - openvswitch - action execution, upcalls
      - socket errors via sk_error_report

  Device APIs:

   - devlink: add rate API for hierarchical control of max egress rate
     of virtual devices (VFs, SFs etc.)

   - don't require RCU read lock to be held around BPF hooks in NAPI
     context

   - page_pool: generic buffer recycling

  New hardware/drivers:

   - mobile:
      - iosm: PCIe Driver for Intel M.2 Modem
      - support for Qualcomm MSM8998 (ipa)

   - WiFi: Qualcomm QCN9074 and WCN6855 PCI devices

   - sparx5: Microchip SparX-5 family of Enterprise Ethernet switches

   - Mellanox BlueField Gigabit Ethernet (control NIC of the DPU)

   - NXP SJA1110 Automotive Ethernet 10-port switch

   - Qualcomm QCA8327 switch support (qca8k)

   - Mikrotik 10/25G NIC (atl1c)

  Driver changes:

   - ACPI support for some MDIO, MAC and PHY devices from Marvell and
     NXP (our first foray into MAC/PHY description via ACPI)

   - HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx

   - Mellanox/Nvidia NIC (mlx5)
      - NIC VF offload of L2 bridging
      - support IRQ distribution to Sub-functions

   - Marvell (prestera):
      - add flower and match all
      - devlink trap
      - link aggregation

   - Netronome (nfp): connection tracking offload

   - Intel 1GE (igc): add AF_XDP support

   - Marvell DPU (octeontx2): ingress ratelimit offload

   - Google vNIC (gve): new ring/descriptor format support

   - Qualcomm mobile (rmnet & ipa): inline checksum offload support

   - MediaTek WiFi (mt76)
      - mt7915 MSI support
      - mt7915 Tx status reporting
      - mt7915 thermal sensors support
      - mt7921 decapsulation offload
      - mt7921 enable runtime pm and deep sleep

   - Realtek WiFi (rtw88)
      - beacon filter support
      - Tx antenna path diversity support
      - firmware crash information via devcoredump

   - Qualcomm WiFi (wcn36xx)
      - Wake-on-WLAN support with magic packets and GTK rekeying

   - Micrel PHY (ksz886x/ksz8081): add cable test support"

* tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2168 commits)
  tcp: change ICSK_CA_PRIV_SIZE definition
  tcp_yeah: check struct yeah size at compile time
  gve: DQO: Fix off by one in gve_rx_dqo()
  stmmac: intel: set PCI_D3hot in suspend
  stmmac: intel: Enable PHY WOL option in EHL
  net: stmmac: option to enable PHY WOL with PMT enabled
  net: say "local" instead of "static" addresses in ndo_dflt_fdb_{add,del}
  net: use netdev_info in ndo_dflt_fdb_{add,del}
  ptp: Set lookup cookie when creating a PTP PPS source.
  net: sock: add trace for socket errors
  net: sock: introduce sk_error_report
  net: dsa: replay the local bridge FDB entries pointing to the bridge dev too
  net: dsa: ensure during dsa_fdb_offload_notify that dev_hold and dev_put are on the same dev
  net: dsa: include fdb entries pointing to bridge in the host fdb list
  net: dsa: include bridge addresses which are local in the host fdb list
  net: dsa: sync static FDB entries on foreign interfaces to hardware
  net: dsa: install the host MDB and FDB entries in the master's RX filter
  net: dsa: reference count the FDB addresses at the cross-chip notifier level
  net: dsa: introduce a separate cross-chip notifier type for host FDBs
  net: dsa: reference count the MDB entries at the cross-chip notifier level
  ...

Merge tag 'sched-urgent-2021-06-30' of git://git./linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:

- Fix a small inconsistency (bug) in load tracking, caught by a new
   warning that several people reported.

- Flip CONFIG_SCHED_CORE to default-disabled, and update the Kconfig
   help text.

* tag 'sched-urgent-2021-06-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Disable CONFIG_SCHED_CORE by default
  sched/fair: Ensure _sum and _avg values stay consistent

Merge tag 'microblaze-v5.14' of git://git.monstr.eu/linux-2.6-microblaze

Pull microblaze updates from Michal Simek:

- Remove unused PAGE_UP/DOWN macros

- Fix trivial spelling mistake

* tag 'microblaze-v5.14' of git://git.monstr.eu/linux-2.6-microblaze:
arch: microblaze: Fix spelling mistake "vesion" -> "version"
microblaze: Cleanup unused functions

Merge tag 'safesetid-5.14' of git://github.com/micah-morton/linux

Pull SafeSetID update from Micah Morton:
"One very minor code cleanup change that marks a variable as
__initdata"

* tag 'safesetid-5.14' of git://github.com/micah-morton/linux:
LSM: SafeSetID: Mark safesetid_initialized as __initdata

Merge tag 'Smack-for-5.14' of git://github.com/cschaufler/smack-next

Pull smack updates from Casey Schaufler:
"There is nothing more significant than an improvement to a byte count
  check in smackfs.

  All changes have been in next for weeks"

* tag 'Smack-for-5.14' of git://github.com/cschaufler/smack-next:
  Smack: fix doc warning
  Revert "Smack: Handle io_uring kernel thread privileges"
  smackfs: restrict bytes count in smk_set_cipso()
  security/smack/: fix misspellings using codespell tool

Merge tag 'audit-pr-20210629' of git://git./linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:
"Another merge window, another small audit pull request.

  Four patches in total: one is cosmetic, one removes an unnecessary
  initialization, one renames some enum values to prevent name
  collisions, and one converts list_del()/list_add() to list_move().

  None of these are earth shattering and all pass the audit-testsuite
  tests while merging cleanly on top of your tree from earlier today"

* tag 'audit-pr-20210629' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: remove unnecessary 'ret' initialization
  audit: remove trailing spaces and tabs
  audit: Use list_move instead of list_del/list_add
  audit: Rename enum audit_state constants to avoid AUDIT_DISABLED redefinition
  audit: add blank line after variable declarations

Merge tag 'selinux-pr-20210629' of git://git./linux/kernel/git/pcmoore/selinux

Pull SELinux updates from Paul Moore:

- The slow_avc_audit() function is now non-blocking so we can remove
   the AVC_NONBLOCKING tricks; this also includes the 'flags' variant of
   avc_has_perm().

- Use kmemdup() instead of kcalloc()+copy when copying parts of the
   SELinux policydb.

- The InfiniBand device name is now passed by reference when possible
   in the SELinux code, removing a strncpy().

- Minor cleanups including: constification of avtab function args,
   removal of useless LSM/XFRM function args, SELinux kdoc fixes, and
   removal of redundant assignments.

* tag 'selinux-pr-20210629' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  selinux: kill 'flags' argument in avc_has_perm_flags() and avc_audit()
  selinux: slow_avc_audit has become non-blocking
  selinux: Fix kernel-doc
  selinux: use __GFP_NOWARN with GFP_NOWAIT in the AVC
  lsm_audit,selinux: pass IB device name by reference
  selinux: Remove redundant assignment to rc
  selinux: Corrected comment to match kernel-doc comment
  selinux: delete selinux_xfrm_policy_lookup() useless argument
  selinux: constify some avtab function arguments
  selinux: simplify duplicate_policydb_cond_list() by using kmemdup()

Merge tag 'clang-features-v5.14-rc1' of git://git./linux/kernel/git/kees/linux

Pull clang feature updates from Kees Cook:

- Add CC_HAS_NO_PROFILE_FN_ATTR in preparation for PGO support in the
   face of the noinstr attribute, paving the way for PGO and fixing
   GCOV. (Nick Desaulniers)

- x86_64 LTO coverage is expanded to 32-bit x86. (Nathan Chancellor)

- Small fixes to CFI. (Mark Rutland, Nathan Chancellor)

* tag 'clang-features-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  qemu_fw_cfg: Make fw_cfg_rev_attr a proper kobj_attribute
  Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTR
  compiler_attributes.h: cleanups for GCC 4.9+
  compiler_attributes.h: define __no_profile, add to noinstr
  x86, lto: Enable Clang LTO for 32-bit as well
  CFI: Move function_nocfi() into compiler.h
  MAINTAINERS: Add Clang CFI section

Merge tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
"Pretty calm round, mostly just NVMe and a bit of MD:

   - NVMe updates (via Christoph)
        - improve the APST configuration algorithm (Alexey Bogoslavsky)
        - look for StorageD3Enable on companion ACPI device
          (Mario Limonciello)
        - allow selecting the network interface for TCP connections
          (Martin Belanger)
        - misc cleanups (Amit Engel, Chaitanya Kulkarni, Colin Ian King,
          Christoph)
        - move the ACPI StorageD3 code to drivers/acpi/ and add quirks
          for certain AMD CPUs (Mario Limonciello)
        - zoned device support for nvmet (Chaitanya Kulkarni)
        - fix the rules for changing the serial number in nvmet
          (Noam Gottlieb)
        - various small fixes and cleanups (Dan Carpenter, JK Kim,
          Chaitanya Kulkarni, Hannes Reinecke, Wesley Sheng, Geert
          Uytterhoeven, Daniel Wagner)

   - MD updates (Via Song)
        - iostats rewrite (Guoqing Jiang)
        - raid5 lock contention optimization (Gal Ofri)

   - Fall through warning fix (Gustavo)

   - Misc fixes (Gustavo, Jiapeng)"

* tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block: (78 commits)
  nvmet: use NVMET_MAX_NAMESPACES to set nn value
  loop: Fix missing discard support when using LOOP_CONFIGURE
  nvme.h: add missing nvme_lba_range_type endianness annotations
  nvme: remove zeroout memset call for struct
  nvme-pci: remove zeroout memset call for struct
  nvmet: remove zeroout memset call for struct
  nvmet: add ZBD over ZNS backend support
  nvmet: add Command Set Identifier support
  nvmet: add nvmet_req_bio put helper for backends
  nvmet: add req cns error complete helper
  block: export blk_next_bio()
  nvmet: remove local variable
  nvmet: use nvme status value directly
  nvmet: use u32 type for the local variable nsid
  nvmet: use u32 for nvmet_subsys max_nsid
  nvmet: use req->cmd directly in file-ns fast path
  nvmet: use req->cmd directly in bdev-ns fast path
  nvmet: make ver stable once connection established
  nvmet: allow mn change if subsys not discovered
  nvmet: make sn stable once connection was established
  ...

Merge tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:

- disk events cleanup (Christoph)

- gendisk and request queue allocation simplifications (Christoph)

- bdev_disk_changed cleanups (Christoph)

- IO priority improvements (Bart)

- Chained bio completion trace fix (Edward)

- blk-wbt fixes (Jan)

- blk-wbt enable/disable fix (Zhang)

- Scheduler dispatch improvements (Jan, Ming)

- Shared tagset scheduler improvements (John)

- BFQ updates (Paolo, Luca, Pietro)

- BFQ lock inversion fix (Jan)

- Documentation improvements (Kir)

- CLONE_IO block cgroup fix (Tejun)

- Remove of ancient and deprecated block dump feature (zhangyi)

- Discard merge fix (Ming)

- Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas,
   Yang)

* tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits)
  block: fix discard request merge
  block/mq-deadline: Remove a WARN_ON_ONCE() call
  blk-mq: update hctx->dispatch_busy in case of real scheduler
  blk: Fix lock inversion between ioc lock and bfqd lock
  bfq: Remove merged request already in bfq_requests_merged()
  block: pass a gendisk to bdev_disk_changed
  block: move bdev_disk_changed
  block: add the events* attributes to disk_attrs
  block: move the disk events code to a separate file
  block: fix trace completion for chained bio
  block/partitions/msdos: Fix typo inidicator -> indicator
  block, bfq: reset waker pointer with shared queues
  block, bfq: check waker only for queues with no in-flight I/O
  block, bfq: avoid delayed merge of async queues
  block, bfq: boost throughput by extending queue-merging times
  block, bfq: consider also creation time in delayed stable merge
  block, bfq: fix delayed stable merge check
  block, bfq: let also stably merged queues enjoy weight raising
  blk-wbt: make sure throttle is enabled properly
  blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()
  ...

Merge branch 'for-linus' of git://git./linux/kernel/git/hid/hid

Pull HID updates from Jiri Kosina:

- patch series that ensures that hid-multitouch driver disables touch
   and button-press reporting on hid-mt devices during suspend when the
   device is not configured as a wakeup-source, from Hans de Goede

- support for ISH DMA on Intel EHL platform, from Even Xu

- support for Renoir and Cezanne SoCs, Ambient Light Sensor and Human
   Presence Detection sensor for amd-sfh driver, from Basavaraj Natikar

- other assorted code cleanups and device-specific fixes/quirks

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: (45 commits)
  HID: thrustmaster: Switch to kmemdup() when allocate change_request
  HID: multitouch: Disable event reporting on suspend when the device is not a wakeup-source
  HID: logitech-dj: Implement may_wakeup ll-driver callback
  HID: usbhid: Implement may_wakeup ll-driver callback
  HID: core: Add hid_hw_may_wakeup() function
  HID: input: Add support for Programmable Buttons
  HID: wacom: Correct base usage for capacitive ExpressKey status bits
  HID: amd_sfh: Add initial support for HPD sensor
  HID: amd_sfh: Extend ALS support for newer AMD platform
  HID: amd_sfh: Extend driver capabilities for multi-generation support
  HID: surface-hid: Fix get-report request
  HID: sony: fix freeze when inserting ghlive ps3/wii dongles
  HID: usbkbd: Avoid GFP_ATOMIC when GFP_KERNEL is possible
  HID: amd_sfh: change in maintainer
  HID: intel-ish-hid: ipc: Specify that EHL no cache snooping
  HID: intel-ish-hid: ishtp: Add dma_no_cache_snooping() callback
  HID: intel-ish-hid: Set ISH driver depends on x86
  HID: hid-input: add Surface Go battery quirk
  HID: intel-ish-hid: Fix minor typos in comments
  HID: usbmouse: Avoid GFP_ATOMIC when GFP_KERNEL is possible
  ...

Merge tag 'edac_updates_for_v5.14' of git://git./linux/kernel/git/ras/ras

Pull EDAC updates from Tony Luck:
"Various fixes and support for new CPUs:

   - Clean up error messages from thunderx_edac

   - Add MODULE_DEVICE_TABLE to ti_edac so it will autoload

   - Use %pR to print resources in aspeed_edac

   - Add Yazen Ghannam as MAINTAINER for AMD edac drivers

   - Fix Ice Lake and Sapphire Rapids drivers to report correct "near"
     or "far" device for errors in 2LM configurations

   - Add support of on package high bandwidth memory in Sapphire Rapids

   - New CPU support for three CPUs supporting in-band ECC (IOT SKUs for
     ICL-NNPI, Tiger Lake and Alder Lake)

   - Don't even try to load Intel EDAC drivers when running as a guest

   - Fix Kconfig dependency on X86_MCE_INTEL for EDAC_IGEN6"

* tag 'edac_updates_for_v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/igen6: fix core dependency
  EDAC/Intel: Do not load EDAC driver when running as a guest
  EDAC/igen6: Add Intel Alder Lake SoC support
  EDAC/igen6: Add Intel Tiger Lake SoC support
  EDAC/igen6: Add Intel ICL-NNPI SoC support
  EDAC/i10nm: Add support for high bandwidth memory
  EDAC/i10nm: Add detection of memory levels for ICX/SPR servers
  EDAC/skx_common: Add new ADXL components for 2-level memory
  MAINTAINERS: Make Yazen Ghannam maintainer for EDAC-AMD64
  EDAC/aspeed: Use proper format string for printing resource
  EDAC/ti: Add missing MODULE_DEVICE_TABLE
  EDAC/thunderx: Remove irrelevant variable from error messages

Merge tag 'tpmdd-next-v5.14-rc1' of git://git./linux/kernel/git/jarkko/linux-tpmdd

Pull tpm driver updates from Jarkko Sakkinen:
"Bug fixes for TPM"

[ This isn't actually the whole contents of the tag and thus doesn't
  contain Jarkko's signature - I dropped the two top commits that added
  support for signing modules using elliptic curve keys because there's
  a new series for that that fixes a few confising things   - Linus ]

* tag 'tpmdd-next-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
  tpm: Replace WARN_ONCE() with dev_err_once() in tpm_tis_status()
  tpm_tis: Use DEFINE_RES_MEM() to simplify code
  tpm: fix some doc warnings in tpm1-cmd.c
  tpm_tis_spi: add missing SPI device ID entries
  tpm: add longer timeout for TPM2_CC_VERIFY_SIGNATURE
  char: tpm: move to use request_irq by IRQF_NO_AUTOEN flag
  tpm_tis_spi: set default probe function if device id not match
  tpm_crb: Use IOMEM_ERR_PTR when function returns iomem

Merge tag 'platform-drivers-x86-v5.14-1' of git://git./linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform driver updates from Hans de Goede:
"Highlights:

   - New think-lmi driver adding support for changing Lenovo Thinkpad
     BIOS settings from within Linux using the standard firmware-
     attributes class sysfs API

   - MS Surface aggregator-cdev now also supports forwarding events to
     user-space (for debugging / new driver development purposes only)

   - New intel_skl_int3472 driver this provides the necessary glue to
     translate ACPI table information to GPIOs, regulators, etc. for
     camera sensors on Intel devices with IPU3 attached MIPI cameras

   - A whole bunch of other fixes + device-specific quirk additions

   - New devm_work_autocancel() devm-helpers.h function"

* tag 'platform-drivers-x86-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (83 commits)
  platform/x86: dell-wmi-sysman: Change user experience when Admin/System Password is modified
  platform/x86: intel_skl_int3472: Uninitialized variable in skl_int3472_handle_gpio_resources()
  platform/x86: think-lmi: Move kfree(setting->possible_values) to tlmi_attr_setting_release()
  platform/x86: think-lmi: Split current_value to reflect only the value
  platform/x86: think-lmi: Fix issues with duplicate attributes
  platform/x86: think-lmi: Return EINVAL when kbdlang gets set to a 0 length string
  platform/x86: intel_cht_int33fe: Move to its own subfolder
  platform/x86: intel_skl_int3472: Move to intel/ subfolder
  platform/x86: intel_skl_int3472: Provide skl_int3472_unregister_clock()
  platform/x86: intel_skl_int3472: Provide skl_int3472_unregister_regulator()
  platform/x86: intel_skl_int3472: Use ACPI GPIO resource directly
  platform/x86: intel_skl_int3472: Fix dependencies (drop CLKDEV_LOOKUP)
  platform/x86: intel_skl_int3472: Free ACPI device resources after use
  platform/x86: Remove "default n" entries
  platform/x86: ISST: Use numa node id for cpu pci dev mapping
  platform/x86: ISST: Optimize CPU to PCI device mapping
  tools/power/x86/intel-speed-select: v1.10 release
  tools/power/x86/intel-speed-select: Fix uncore memory frequency display
  extcon: extcon-max8997: Simplify driver using devm
  extcon: extcon-max8997: Fix IRQ freeing at error path
  ...

Merge tag 'mailbox-v5.14' of git://git.linaro.org/landing-teams/working/fujitsu/integration

Pull mailbox updates from Jassi Brar:

- imx: add support for i.MX8ULP

- mtk: code change around callback struct

- qcom: add sm6125, MSM8939 fix for channel exhaustion

- microchip: add support for polarfire controller

- misc: cosmetic changes to bcm-2835,flexrm,pdc, arm-mhu and hisilicon

* tag 'mailbox-v5.14' of git://git.linaro.org/landing-teams/working/fujitsu/integration: (26 commits)
  MAINTAINERS: add entry for polarfire soc mailbox
  dt-bindings: add bindings for polarfire soc system controller
  mbox: add polarfire soc system controller mailbox
  dt-bindings: add bindings for polarfire soc mailbox
  mailbox: imx: Avoid using val uninitialized in imx_mu_isr()
  mailbox: qcom: Add MSM8939 APCS support
  mailbox: qcom: Use PLATFORM_DEVID_AUTO to register platform device
  dt-bindings: mailbox: qcom: Add MSM8939 APCS compatible
  mailbox: qcom-apcs: Add SM6125 compatible
  dt-bindings: mailbox: Add binding for sm6125
  mailbox: mtk-cmdq: Fix uninitialized variable in cmdq_mbox_flush()
  mailbox: bcm-flexrm-mailbox: Remove redundant dev_err call in flexrm_mbox_probe()
  mailbox: bcm2835: Remove redundant dev_err call in bcm2835_mbox_probe()
  mailbox: qcom-ipcc: Fix IPCC mbox channel exhaustion
  mailbox: mtk-cmdq: Add struct cmdq_pkt in struct cmdq_cb_data
  mailbox: mtk-cmdq: Use mailbox rx_callback
  mailbox: mtk-cmdq: Remove cmdq_cb_status
  mailbox: imx-mailbox: support i.MX8ULP MU
  mailbox: imx: add xSR/xCR register array
  mailbox: imx: replace the xTR/xRR array with single register
  ...

Merge tag 'for-linus-5.14-1' of git://github.com/cminyard/linux-ipmi

Pull IPMI driver updates from Corey Minyard:
"Mostly a restructure of the kcs_bmc driver to make it easier to use
  with different types of devices, and just to clean things up and
  improve things.

  Also some bug fixes for the kcs_bmc driver.

  One fix to the IPMI watchdog to stop the timer when the action is
  none. Not a big deal, but it's the right thing to do"

* tag 'for-linus-5.14-1' of git://github.com/cminyard/linux-ipmi:
  ipmi: kcs_bmc_aspeed: Fix less than zero comparison of a unsigned int
  ipmi: kcs_bmc_aspeed: Optionally apply status address
  ipmi: kcs_bmc_aspeed: Fix IBFIE typo from datasheet
  ipmi: kcs_bmc_aspeed: Implement KCS SerIRQ configuration
  dt-bindings: ipmi: Add optional SerIRQ property to ASPEED KCS devices
  dt-bindings: ipmi: Convert ASPEED KCS binding to schema
  ipmi: kcs_bmc: Add serio adaptor
  ipmi: kcs_bmc: Enable IBF on open
  ipmi: kcs_bmc: Allow clients to control KCS IRQ state
  ipmi: kcs_bmc: Decouple the IPMI chardev from the core
  ipmi: kcs_bmc: Strip private client data from struct kcs_bmc
  ipmi: kcs_bmc: Split headers into device and client
  ipmi: kcs_bmc: Turn the driver data-structures inside-out
  ipmi: kcs_bmc: Split out kcs_bmc_cdev_ipmi
  ipmi: kcs_bmc: Rename {read,write}_{status,data}() functions
  ipmi: kcs_bmc: Make status update atomic
  ipmi: kcs_bmc_aspeed: Use of match data to extract KCS properties
  ipmi/watchdog: Stop watchdog timer when the current action is 'none'

Merge branch 'sched/core' into sched/urgent, to pick up fix

Pick up a fix for a warning that several people reported.

Signed-off-by: Ingo Molnar <mingo@kernel.org>

Merge branch 'for-5.14/multitouch' into for-linus

- patch series that ensures that hid-multitouch driver disables touch and
button-press reporting on hid-mt devices during suspend when the device is
not configured as a wakeup-source, from Hans de Goede

Merge branch 'for-5.14/logitech' into for-linus

- support for LCD menu keys + LCD brightness control on the Logitech Z-10
speakers (with LCD) which use the same protocol as the G15 keyboards
from Hans de Goede

Merge branch 'for-5.14/intel-ish' into for-linus

- support for ISH DMA on EHL platform from Even Xu
- various code style fixes and cleanups from Lee Jones and Uwe Kleine-König

Merge branch 'for-5.14/google' into for-linus

- device tree match for Google Whiskers device from Ikjoon Jang

Merge branch 'for-5.14/core' into for-linus

- device unbinding locking fix from Dmitry Torokhov
- support for programmable buttons (mapping to KEY_MACRO# event codes)
from Thomas Weißschuh
- various other small fixes and code style improvements

Merge branch 'for-5.14/amd-sfh' into for-linus

- support for Renoir and Cezanne SoCs
- support for Ambient Light Sensor
- support for Human Presence Detection sensor

all from Basavaraj Natikar

Merge tag 'dlm-5.14' of git://git./linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
"This is a major dlm networking enhancement that adds message
  retransmission so that the dlm can reliably continue operating when
  network connections fail and nodes reconnect.

  Previously, this would result in lost messages which could only be
  handled as a node failure"

* tag 'dlm-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: (26 commits)
  fs: dlm: invalid buffer access in lookup error
  fs: dlm: fix race in mhandle deletion
  fs: dlm: rename socket and app buffer defines
  fs: dlm: introduce proto values
  fs: dlm: move dlm allow conn
  fs: dlm: use alloc_ordered_workqueue
  fs: dlm: fix memory leak when fenced
  fs: dlm: fix lowcomms_start error case
  fs: dlm: Fix spelling mistake "stucked" -> "stuck"
  fs: dlm: Fix memory leak of object mh
  fs: dlm: don't allow half transmitted messages
  fs: dlm: add midcomms debugfs functionality
  fs: dlm: add reliable connection if reconnect
  fs: dlm: add union in dlm header for lockspace id
  fs: dlm: move out some hash functionality
  fs: dlm: add functionality to re-transmit a message
  fs: dlm: make buffer handling per msg
  fs: dlm: add more midcomms hooks
  fs: dlm: public header in out utility
  fs: dlm: fix connection tcp EOF handling
  ...

Merge tag 'gfs2-v5.13-fixes' of git://git./linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Andreas Gruenbacher:
"Various minor gfs2 cleanups and fixes"

* tag 'gfs2-v5.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Clean up gfs2_unstuff_dinode
  gfs2: Unstuff before locking page in gfs2_page_mkwrite
  gfs2: Clean up the error handling in gfs2_page_mkwrite
  gfs2: Fix error handling in init_statfs
  gfs2: Fix underflow in gfs2_page_mkwrite
  gfs2: Use list_move_tail instead of list_del/list_add_tail
  gfs2: Fix do_gfs2_set_flags description

Merge tag '5.14-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs updates from Steve French:

- improve fallocate emulation

- DFS fixes

- minor multichannel fixes

- various cleanup patches, many to address Coverity warnings

* tag '5.14-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6: (38 commits)
  smb3: prevent races updating CurrentMid
  cifs: fix missing spinlock around update to ses->status
  cifs: missing null pointer check in cifs_mount
  smb3: fix possible access to uninitialized pointer to DACL
  cifs: missing null check for newinode pointer
  cifs: remove two cases where rc is set unnecessarily in sid_to_id
  SMB3: Add new info level for query directory
  cifs: fix NULL dereference in smb2_check_message()
  smbdirect: missing rc checks while waiting for rdma events
  cifs: Avoid field over-reading memcpy()
  smb311: remove dead code for non compounded posix query info
  cifs: fix SMB1 error path in cifs_get_file_info_unix
  smb3: fix uninitialized value for port in witness protocol move
  cifs: fix unneeded null check
  cifs: use SPDX-Licence-Identifier
  cifs: convert list_for_each to entry variant in cifs_debug.c
  cifs: convert list_for_each to entry variant in smb2misc.c
  cifs: avoid extra calls in posix_info_parse
  cifs: retry lookup and readdir when EAGAIN is returned.
  cifs: fix check of dfs interlinks
  ...

Merge tag 'fs.openat2.unknown_flags.v5.14' of git://git./linux/kernel/git/brauner/linux

Pull openat2 fixes from Christian Brauner:

- Remove the unused VALID_UPGRADE_FLAGS define we carried from an
   extension to openat2() that we haven't merged. Aleksa might be
   getting back to it at some point but just not right now.

- openat2() used to accidently ignore unknown flag values in the upper
   32 bits.

   The new openat2() syscall verifies that no unknown O-flag values are
   set and returns an error to userspace if they are while the older
   open syscalls like open() and openat() simply ignore unknown flag
   values:

      #define O_FLAG_CURRENTLY_INVALID (1 << 31)
      struct open_how how = {
            .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID,
            .resolve = 0,
      };

      /* fails */
      fd = openat2(-EBADF, "/dev/null", &how, sizeof(how));

      /* succeeds */
      fd = openat(-EBADF, "/dev/null", O_RDONLY | O_FLAG_CURRENTLY_INVALID);

   However, openat2() silently truncates the upper 32 bits meaning:

      #define O_FLAG_CURRENTLY_INVALID_LOWER32 (1 << 31)
      #define O_FLAG_CURRENTLY_INVALID_UPPER32 (1 << 40)

      struct open_how how_lowe32 = {
            .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID_LOWER32,
      };

      struct open_how how_upper32 = {
            .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID_UPPER32,
      };

      /* fails */
      fd = openat2(-EBADF, "/dev/null", &how_lower32, sizeof(how_lower32));

      /* succeeds */
      fd = openat2(-EBADF, "/dev/null", &how_upper32, sizeof(how_upper32));

   Fix this by preventing the immediate truncation in build_open_flags()
   and add a compile-time check to catch when we add flags in the upper
   32 bit range.

* tag 'fs.openat2.unknown_flags.v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  test: add openat2() test for invalid upper 32 bit flag value
  open: don't silently ignore unknown O-flags in openat2()
  fcntl: remove unused VALID_UPGRADE_FLAGS

Merge tag 'fs.mount_setattr.nosymfollow.v5.14' of git://git./linux/kernel/git/brauner/linux

Pull mount_setattr updates from Christian Brauner:
"A few releases ago the old mount API gained support for a mount
  options which prevents following symlinks on a given mount. This adds
  support for it in the new mount api through the MOUNT_ATTR_NOSYMFOLLOW
  flag via mount_setattr() and fsmount(). With mount_setattr() that flag
  can even be applied recursively.

  There's an additional ack from Ross Zwisler who originally authored
  the nosymfollow patch. As I've already had the patches in my for-next
  I didn't add his ack explicitly"

* tag 'fs.mount_setattr.nosymfollow.v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  tests: test MOUNT_ATTR_NOSYMFOLLOW with mount_setattr()
  mount: Support "nosymfollow" in new mount api

Merge branch 'akpm' (patches from Andrew)

Merge misc updates from Andrew Morton:
"191 patches.

  Subsystems affected by this patch series: kthread, ia64, scripts,
  ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
  slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
  mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
  pagealloc, and memory-failure)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits)
  mm,hwpoison: make get_hwpoison_page() call get_any_page()
  mm,hwpoison: send SIGBUS with error virutal address
  mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
  mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
  mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
  mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
  docs: remove description of DISCONTIGMEM
  arch, mm: remove stale mentions of DISCONIGMEM
  mm: remove CONFIG_DISCONTIGMEM
  m68k: remove support for DISCONTIGMEM
  arc: remove support for DISCONTIGMEM
  arc: update comment about HIGHMEM implementation
  alpha: remove DISCONTIGMEM and NUMA
  mm/page_alloc: move free_the_page
  mm/page_alloc: fix counting of managed_pages
  mm/page_alloc: improve memmap_pages dbg msg
  mm: drop SECTION_SHIFT in code comments
  mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
  mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
  mm/page_alloc: scale the number of pages that are batch freed
  ...

Merge git://git./linux/kernel/git/netdev/net

Trivial conflict in net/netfilter/nf_tables_api.c.

Duplicate fix in tools/testing/selftests/net/devlink_port_split.py
- take the net-next version.

skmsg, and L4 bpf - keep the bpf code but remove the flags
and err params.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'devprop-5.14-rc1' of git://git./linux/kernel/git/rafael/linux-pm

Pull device properties framework updates from Rafael Wysocki:
"These unify device properties access in some pieces of code and make
  related changes.

  Specifics:

   - Handle device properties with software node API in the ACPI IORT
     table parsing code (Heikki Krogerus).

   - Unify of_node access in the common device properties code, constify
     the acpi_dma_supported() argument pointer and fix up CONFIG_ACPI=n
     stubs of some functions related to device properties (Andy
     Shevchenko)"

* tag 'devprop-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  device property: Unify access to of_node
  ACPI: scan: Constify acpi_dma_supported() helper function
  ACPI: property: Constify stubs for CONFIG_ACPI=n case
  ACPI: IORT: Handle device properties with software node API
  device property: Retrieve fwnode from of_node via accessor

Merge tag 'pnp-5.14-rc1' of git://git./linux/kernel/git/rafael/linux-pm

Pull PNP updates from Rafael Wysocki:
"These get rid of unnecessary local variables and function, reduce code
  duplication and clean up message printing.

  Specifics:

   - Remove unnecessary local variables from isapnp_proc_attach_device()
     (Anupama K Patil).

   - Make the callers of pnp_alloc() use kzalloc() directly and drop the
     former (Heiner Kallweit).

   - Make two pieces of code use dev_dbg() instead of dev_printk() with
     the KERN_DEBUG message level (Heiner Kallweit).

   - Use DEVICE_ATTR_RO() instead of full DEVICE_ATTR() in some places
     in card.c (Zhen Lei).

   - Use list_for_each_entry() instead of list_for_each() in
     insert_device() (Zou Wei)"

* tag 'pnp-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PNP: pnpbios: Use list_for_each_entry() instead of list_for_each()
  PNP: use DEVICE_ATTR_RO macro
  PNP: Switch over to dev_dbg()
  PNP: Remove pnp_alloc()
  drivers: pnp: isapnp: proc.c: Remove unnecessary local variables

Merge tag 'acpi-5.14-rc1' of git://git./linux/kernel/git/rafael/linux-pm

Pull ACPI updates from Rafael Wysocki:
"These update the ACPICA code in the kernel to the 20210604 upstream
  revision, add preliminary support for the Platform Runtime Mechanism
  (PRM), address issues related to the handling of device dependencies
  in the ACPI device eunmeration code, improve the tracking of ACPI
  power resource states, improve the ACPI support for suspend-to-idle on
  AMD systems, continue the unification of message printing in the ACPI
  code, address assorted issues and clean up the code in a number of
  places.

  Specifics:

   - Update ACPICA code in the kernel to upstrea revision 20210604
     including the following changes:

      - Add defines for the CXL Host Bridge Structureand and add the
        CFMWS structure definition to CEDT (Alison Schofield).
      - iASL: Finish support for the IVRS ACPI table (Bob Moore).
      - iASL: Add support for the SVKL table (Bob Moore).
      - iASL: Add full support for RGRT ACPI table (Bob Moore).
      - iASL: Add support for the BDAT ACPI table (Bob Moore).
      - iASL: add disassembler support for PRMT (Erik Kaneda).
      - Fix memory leak caused by _CID repair function (Erik Kaneda).
      - Add support for PlatformRtMechanism OpRegion (Erik Kaneda).
      - Add PRMT module header to facilitate parsing (Erik Kaneda).
      - Add _PLD panel positions (Fabian Wüthrich).
      - MADT: add Multiprocessor Wakeup Mailbox Structure and the SVKL
        table headers (Kuppuswamy Sathyanarayanan).
      - Use ACPI_FALLTHROUGH (Wei Ming Chen).

   - Add preliminary support for the Platform Runtime Mechanism (PRM) to
     allow the AML interpreter to call PRM functions (Erik Kaneda).

   - Address some issues related to the handling of device dependencies
     reported by _DEP in the ACPI device enumeration code and clean up
     some related pieces of it (Rafael Wysocki).

   - Improve the tracking of states of ACPI power resources (Rafael
     Wysocki).

   - Improve ACPI support for suspend-to-idle on AMD systems (Alex
     Deucher, Mario Limonciello, Pratik Vishwakarma).

   - Continue the unification and cleanup of message printing in the
     ACPI code (Hanjun Guo, Heiner Kallweit).

   - Fix possible buffer overrun issue with the description_show() sysfs
     attribute method (Krzysztof Wilczyński).

   - Improve the acpi_mask_gpe kernel command line parameter handling
     and clean up the core ACPI code related to sysfs (Andy Shevchenko,
     Baokun Li, Clayton Casciato).

   - Postpone bringing devices in the general ACPI PM domain to D0
     during resume from system-wide suspend until they are really needed
     (Dmitry Torokhov).

   - Make the ACPI processor driver fix up C-state latency if not
     ordered (Mario Limonciello).

   - Add support for identifying devices depening on the given one that
     are not its direct descendants with the help of _DEP (Daniel
     Scally).

   - Extend the checks related to ACPI IRQ overrides on x86 in order to
     avoid false-positives (Hui Wang).

   - Add battery DPTF participant for Intel SoCs (Sumeet Pawnikar).

   - Rearrange the ACPI fan driver and device power management code to
     use a common list of device IDs (Rafael Wysocki).

   - Fix clang CFI violation in the ACPI BGRT table parsing code and
     clean it up (Nathan Chancellor).

   - Add GPE-related quirks for some laptops to the EC driver (Chris
     Chiu, Zhang Rui).

   - Make the ACPI PPTT table parsing code populate the cache-id value
     if present in the firmware (James Morse).

   - Remove redundant clearing of context->ret.pointer from
     acpi_run_osc() (Hans de Goede).

   - Add missing acpi_put_table() in acpi_init_fpdt() (Jing Xiangfeng).

   - Make ACPI APEI handle ARM Processor Error CPER records like Memory
     Error ones to avoid user space task lockups (Xiaofei Tan).

   - Stop warning about disabled ACPI in APEI (Jon Hunter).

   - Fix fall-through warning for Clang in the SBSHC driver (Gustavo A.
     R. Silva).

   - Add custom DSDT file as Makefile prerequisite (Richard Fitzgerald).

   - Initialize local variable to avoid garbage being returned (Colin
     Ian King).

   - Simplify assorted pieces of code, address assorted coding style and
     documentation issues and comment typos (Baokun Li, Christophe
     JAILLET, Clayton Casciato, Liu Shixin, Shaokun Zhang, Wei Yongjun,
     Yang Li, Zhen Lei)"

* tag 'acpi-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (97 commits)
  ACPI: PM: postpone bringing devices to D0 unless we need them
  ACPI: tables: Add custom DSDT file as makefile prerequisite
  ACPI: bgrt: Use sysfs_emit
  ACPI: bgrt: Fix CFI violation
  ACPI: EC: trust DSDT GPE for certain HP laptop
  ACPI: scan: Simplify acpi_table_events_fn()
  ACPI: PM: Adjust behavior for field problems on AMD systems
  ACPI: PM: s2idle: Add support for new Microsoft UUID
  ACPI: PM: s2idle: Add support for multiple func mask
  ACPI: PM: s2idle: Refactor common code
  ACPI: PM: s2idle: Use correct revision id
  ACPI: sysfs: Remove tailing return statement in void function
  ACPI: sysfs: Use __ATTR_RO() and __ATTR_RW() macros
  ACPI: sysfs: Sort headers alphabetically
  ACPI: sysfs: Refactor param_get_trace_state() to drop dead code
  ACPI: sysfs: Unify pattern of memory allocations
  ACPI: sysfs: Allow bitmap list to be supplied to acpi_mask_gpe
  ACPI: sysfs: Make sparse happy about address space in use
  ACPI: scan: Fix race related to dropping dependencies
  ACPI: scan: Reorganize acpi_device_add()
  ...

Merge tag 'pm-5.14-rc1' of git://git./linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
"These add hybrid processors support to the intel_pstate driver and
  make it work with more processor models when HWP is disabled, make the
  intel_idle driver use special C6 idle state paremeters when package
  C-states are disabled, add cooling support to the tegra30 devfreq
  driver, rework the TEO (timer events oriented) cpuidle governor,
  extend the OPP (operating performance points) framework to use the
  required-opps DT property in more cases, fix some issues and clean up
  a number of assorted pieces of code.

  Specifics:

   - Make intel_pstate support hybrid processors using abstract
     performance units in the HWP interface (Rafael Wysocki).

   - Add Icelake servers and Cometlake support in no-HWP mode to
     intel_pstate (Giovanni Gherdovich).

   - Make cpufreq_online() error path be consistent with the CPU device
     removal path in cpufreq (Rafael Wysocki).

   - Clean up 3 cpufreq drivers and the statistics code (Hailong Liu,
     Randy Dunlap, Shaokun Zhang).

   - Make intel_idle use special idle state parameters for C6 when
     package C-states are disabled (Chen Yu).

   - Rework the TEO (timer events oriented) cpuidle governor to address
     some theoretical shortcomings in it (Rafael Wysocki).

   - Drop unneeded semicolon from the TEO governor (Wan Jiabing).

   - Modify the runtime PM framework to accept unassigned suspend and
     resume callback pointers (Ulf Hansson).

   - Improve pm_runtime_get_sync() documentation (Krzysztof Kozlowski).

   - Improve device performance states support in the generic power
     domains (genpd) framework (Ulf Hansson).

   - Fix some documentation issues in genpd (Yang Yingliang).

   - Make the operating performance points (OPP) framework use the
     required-opps DT property in use cases that are not related to
     genpd (Hsin-Yi Wang).

   - Make lazy_link_required_opp_table() use list_del_init instead of
     list_del/INIT_LIST_HEAD (Yang Yingliang).

   - Simplify wake IRQs handling in the core system-wide sleep support
     code and clean up some coding style inconsistencies in it (Tian
     Tao, Zhen Lei).

   - Add cooling support to the tegra30 devfreq driver and improve its
     DT bindings (Dmitry Osipenko).

   - Fix some assorted issues in the devfreq core and drivers (Chanwoo
     Choi, Dong Aisheng, YueHaibing)"

* tag 'pm-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (39 commits)
  PM / devfreq: passive: Fix get_target_freq when not using required-opp
  cpufreq: Make cpufreq_online() call driver->offline() on errors
  opp: Allow required-opps to be used for non genpd use cases
  cpuidle: teo: remove unneeded semicolon in teo_select()
  dt-bindings: devfreq: tegra30-actmon: Add cooling-cells
  dt-bindings: devfreq: tegra30-actmon: Convert to schema
  PM / devfreq: userspace: Use DEVICE_ATTR_RW macro
  PM: runtime: Clarify documentation when callbacks are unassigned
  PM: runtime: Allow unassigned ->runtime_suspend|resume callbacks
  PM: runtime: Improve path in rpm_idle() when no callback
  PM: hibernate: remove leading spaces before tabs
  PM: sleep: remove trailing spaces and tabs
  PM: domains: Drop/restore performance state votes for devices at runtime PM
  PM: domains: Return early if perf state is already set for the device
  PM: domains: Split code in dev_pm_genpd_set_performance_state()
  cpuidle: teo: Use kerneldoc documentation in admin-guide
  cpuidle: teo: Rework most recent idle duration values treatment
  cpuidle: teo: Change the main idle state selection logic
  cpuidle: teo: Cosmetic modification of teo_select()
  cpuidle: teo: Cosmetic modifications of teo_update()
  ...

qemu_fw_cfg: Make fw_cfg_rev_attr a proper kobj_attribute

fw_cfg_showrev() is called by an indirect call in kobj_attr_show(),
which violates clang's CFI checking because fw_cfg_showrev()'s second
parameter is 'struct attribute', whereas the ->show() member of 'struct
kobj_structure' expects the second parameter to be of type 'struct
kobj_attribute'.

$ cat /sys/firmware/qemu_fw_cfg/rev
3

$ dmesg | grep "CFI failure"
[ 26.016832] CFI failure (target: fw_cfg_showrev+0x0/0x8):

Fix this by converting fw_cfg_rev_attr to 'struct kobj_attribute' where
this would have been caught automatically by the incompatible pointer
types compiler warning. Update fw_cfg_showrev() accordingly.

Fixes: 75f3e8e47f38 ("firmware: introduce sysfs driver for QEMU's fw_cfg device")
Link: https://github.com/ClangBuiltLinux/linux/issues/1299
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210211194258.4137998-1-nathan@kernel.org

Merge tag 'x86-entry-2021-06-29' of git://git./linux/kernel/git/tip/tip

Pull x86 entry code related updates from Thomas Gleixner:

- Consolidate the macros for .byte ... opcode sequences

- Deduplicate register offset defines in include files

- Simplify the ia32,x32 compat handling of the related syscall tables
   to get rid of #ifdeffery.

- Clear all EFLAGS which are not required for syscall handling

- Consolidate the syscall tables and switch the generation over to the
   generic shell script and remove the CFLAGS tweaks which are not
   longer required.

- Use 'int' type for system call numbers to match the generic code.

- Add more selftests for syscalls

* tag 'x86-entry-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/syscalls: Don't adjust CFLAGS for syscall tables
  x86/syscalls: Remove -Wno-override-init for syscall tables
  x86/uml/syscalls: Remove array index from syscall initializers
  x86/syscalls: Clear 'offset' and 'prefix' in case they are set in env
  x86/entry: Use int everywhere for system call numbers
  x86/entry: Treat out of range and gap system calls the same
  x86/entry/64: Sign-extend system calls on entry to int
  selftests/x86/syscall: Add tests under ptrace to syscall_numbering_64
  selftests/x86/syscall: Simplify message reporting in syscall_numbering
  selftests/x86/syscall: Update and extend syscall_numbering_64
  x86/syscalls: Switch to generic syscallhdr.sh
  x86/syscalls: Use __NR_syscalls instead of __NR_syscall_max
  x86/unistd: Define X32_NR_syscalls only for 64-bit kernel
  x86/syscalls: Stop filling syscall arrays with *_sys_ni_syscall
  x86/syscalls: Switch to generic syscalltbl.sh
  x86/entry/x32: Rename __x32_compat_sys_* to __x64_compat_sys_*

Merge tag 'x86-irq-2021-06-29' of git://git./linux/kernel/git/tip/tip

Pull x86 interrupt related updates from Thomas Gleixner:

- Consolidate the VECTOR defines and the usage sites.

- Cleanup GDT/IDT related code and replace open coded ASM with proper
   native helper functions.

* tag 'x86-irq-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/kexec: Set_[gi]dt() -> native_[gi]dt_invalidate() in machine_kexec_*.c
  x86: Add native_[ig]dt_invalidate()
  x86/idt: Remove address argument from idt_invalidate()
  x86/irq: Add and use NR_EXTERNAL_VECTORS and NR_SYSTEM_VECTORS
  x86/irq: Remove unused vectors defines

Merge tag 'timers-core-2021-06-29' of git://git./linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
"Time and clocksource/clockevent related updates:

  Core changes:

   - Infrastructure to support per CPU "broadcast" devices for per CPU
     clockevent devices which stop in deep idle states. This allows us
     to utilize the more efficient architected timer on certain ARM SoCs
     for normal operation instead of permanentely using the slow to
     access SoC specific clockevent device.

   - Print the name of the broadcast/wakeup device in /proc/timer_list

   - Make the clocksource watchdog more robust against delays between
     reading the current active clocksource and the watchdog
     clocksource. Such delays can be caused by NMIs, SMIs and vCPU
     preemption.

     Handle this by reading the watchdog clocksource twice, i.e. before
     and after reading the current active clocksource. In case that the
     two watchdog reads shows an excessive time delta, the read sequence
     is repeated up to 3 times.

   - Improve the debug output and add a test module for the watchdog
     mechanism.

   - Reimplementation of the venerable time64_to_tm() function with a
     faster and significantly smaller version. Straight from the source,
     i.e. the author of the related research paper contributed this!

  Driver changes:

   - No new drivers, not even new device tree bindings!

   - Fixes, improvements and cleanups and all over the place"

* tag 'timers-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  time/kunit: Add missing MODULE_LICENSE()
  time: Improve performance of time64_to_tm()
  clockevents: Use list_move() instead of list_del()/list_add()
  clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
  clocksource: Provide kernel module to test clocksource watchdog
  clocksource: Reduce clocksource-skew threshold
  clocksource: Limit number of CPUs checked for clock synchronization
  clocksource: Check per-CPU clock synchronization when marked unstable
  clocksource: Retry clock read if long delays detected
  clockevents: Add missing parameter documentation
  clocksource/drivers/timer-ti-dm: Drop unnecessary restore
  clocksource/arm_arch_timer: Improve Allwinner A64 timer workaround
  clocksource/drivers/arm_global_timer: Remove duplicated argument in arm_global_timer
  clocksource/drivers/arm_global_timer: Make symbol 'gt_clk_rate_change_nb' static
  arm: zynq: don't disable CONFIG_ARM_GLOBAL_TIMER due to CONFIG_CPU_FREQ anymore
  clocksource/drivers/arm_global_timer: Implement rate compensation whenever source clock changes
  clocksource/drivers/ingenic: Rename unreasonable array names
  clocksource/drivers/timer-ti-dm: Save and restore timer TIOCP_CFG
  clocksource/drivers/mediatek: Ack and disable interrupts on suspend
  clocksource/drivers/samsung_pwm: Constify source IO memory
  ...

Merge tag 'irq-core-2021-06-29' of git://git./linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
"Updates for the interrupt subsystem:

  Core changes:

   - Cleanup and simplification of common code to invoke the low level
     interrupt flow handlers when this invocation requires irqdomain
     resolution. Add the necessary core infrastructure.

   - Provide a proper interface for modular PMU drivers to set the
     interrupt affinity.

   - Add a request flag which allows to exclude interrupts from spurious
     interrupt detection. Useful especially for IPI handlers which
     always return IRQ_HANDLED which turns the spurious interrupt
     detection into a pointless waste of CPU cycles.

  Driver changes:

   - Bulk convert interrupt chip drivers to the new irqdomain low level
     flow handler invocation mechanism.

   - Add device tree bindings for the Renesas R-Car M3-W+ SoC

   - Enable modular build of the Qualcomm PDC driver

   - The usual small fixes and improvements"

* tag 'irq-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  dt-bindings: interrupt-controller: arm,gic-v3: Describe GICv3 optional properties
  irqchip: gic-pm: Remove redundant error log of clock bulk
  irqchip/sun4i: Remove unnecessary oom message
  irqchip/irq-imx-gpcv2: Remove unnecessary oom message
  irqchip/imgpdc: Remove unnecessary oom message
  irqchip/gic-v3-its: Remove unnecessary oom message
  irqchip/gic-v2m: Remove unnecessary oom message
  irqchip/exynos-combiner: Remove unnecessary oom message
  irqchip: Bulk conversion to generic_handle_domain_irq()
  genirq: Move non-irqdomain handle_domain_irq() handling into ARM's handle_IRQ()
  genirq: Add generic_handle_domain_irq() helper
  irqchip/nvic: Convert from handle_IRQ() to handle_domain_irq()
  irqdesc: Fix __handle_domain_irq() comment
  genirq: Use irq_resolve_mapping() to implement __handle_domain_irq() and co
  irqdomain: Introduce irq_resolve_mapping()
  irqdomain: Protect the linear revmap with RCU
  irqdomain: Cache irq_data instead of a virq number in the revmap
  irqdomain: Use struct_size() helper when allocating irqdomain
  irqdomain: Make normal and nomap irqdomains exclusive
  powerpc: Move the use of irq_domain_add_nomap() behind a config option
  ...

Merge tag 'smp-urgent-2021-06-29' of git://git./linux/kernel/git/tip/tip

Pull CPU hotplug fix from Thomas Gleixner:
"A fix for the CPU hotplug and cpusets interaction:

  cpusets delegate the hotplug work to a workqueue to prevent a lock
  order inversion vs. the CPU hotplug lock. The work is not flushed
  before the hotplug operation returns which creates user visible
  inconsistent state. Prevent this by flushing the work after dropping
  CPU hotplug lock and before releasing the outer mutex which serializes
  the CPU hotplug related sysfs interface operations"

* tag 'smp-urgent-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Cure the cpusets trainwreck

Merge tag 'smp-core-2021-06-29' of git://git./linux/kernel/git/tip/tip

Pull CPU hotplug cleanup from Thomas Gleixner:
"A simple cleanup for the CPU hotplug code to avoid per_cpu_ptr()
reevaluation"

* tag 'smp-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Simplify access to percpu cpuhp_state

Merge tag 'printk-for-5.14' of git://git./linux/kernel/git/printk/linux

Pull printk updates from Petr Mladek:

- Add %pt[RT]s modifier to vsprintf(). It overrides ISO 8601 separator
   by using ' ' (space). It produces "YYYY-mm-dd HH:MM:SS" instead of
   "YYYY-mm-ddTHH:MM:SS".

- Correctly parse long row of numbers by sscanf() when using the field
   width. Add extensive sscanf() selftest.

- Generalize re-entrant CPU lock that has already been used to
   serialize dump_stack() output. It is part of the ongoing printk
   rework. It will allow to remove the obsoleted printk_safe buffers and
   introduce atomic consoles.

- Some code clean up and sparse warning fixes.

* tag 'printk-for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk: fix cpu lock ordering
  lib/dump_stack: move cpu lock to printk.c
  printk: Remove trailing semicolon in macros
  random32: Fix implicit truncation warning in prandom_seed_state()
  lib: test_scanf: Remove pointless use of type_min() with unsigned types
  selftests: lib: Add wrapper script for test_scanf
  lib: test_scanf: Add tests for sscanf number conversion
  lib: vsprintf: Fix handling of number field widths in vsscanf
  lib: vsprintf: scanf: Negative number must have field width > 1
  usb: host: xhci-tegra: Switch to use %ptTs
  nilfs2: Switch to use %ptTs
  kdb: Switch to use %ptTs
  lib/vsprintf: Allow to override ISO 8601 date and time separator

tcp: change ICSK_CA_PRIV_SIZE definition

Instead of a magic number (13 currently) and having
to change it every other year, use sizeof_field() macro.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp_yeah: check struct yeah size at compile time

Compiler can perform the sanity check instead of waiting
to load the module and crash the host.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

gve: DQO: Fix off by one in gve_rx_dqo()

The rx->dqo.buf_states[] array is allocated in gve_rx_alloc_ring_dqo()
and it has rx->dqo.num_buf_states so this > needs to >= to prevent an
out of bounds access.

Fixes: 9b8dd5e5ea48 ("gve: DQO: Add RX path")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'stmmac-phy-wol'

Ling Pei Lee says:

====================
tmmac: Add option to enable PHY WOL with PMT enabled

This patchset main objective is to provide an option to enable PHY WoL
even the PMT is enabled by default in the HW features.

The current stmmac driver WOL implementation will enable MAC WOL if
MAC HW PMT feature is on. Else, the driver will check for PHY WOL
support. Intel EHL mgbe are designed to wake up through PHY WOL
although the HW PMT is enabled.Hence, introduced use_phy_wol platform
data to provide this PHY WOL option. Set use_phy_wol will disable the
plat->pmt which currently used to determine the system to wake up by
MAC WOL or PHY WOL.

This WOL patchset includes of setting the device power state to D3hot.
This is because the EHL PSE will need to PSE mgbe to be in D3 state in
order for the PSE to goes into suspend mode.

Change Log:
V2: Drop Patch #3 net: stmmac: Reconfigure the PHY WOL settings in stmmac_resume().
====================

stmmac: intel: set PCI_D3hot in suspend

During suspend, set the Intel mgbe to D3hot state
to save power consumption.

Signed-off-by: Voon Weifeng <weifeng.voon@intel.com>
Signed-off-by: Ling Pei Lee <pei.lee.ling@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: intel: Enable PHY WOL option in EHL

Enable PHY Wake On LAN in Intel EHL Intel platform.
PHY Wake on LAN option is enabled due to
Intel EHL Intel platform is designed for
PHY Wake On LAN but not MAC Wake On LAN.

Signed-off-by: Ling Pei Lee <pei.lee.ling@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: option to enable PHY WOL with PMT enabled

The current stmmac driver WOL implementation will enable MAC WOL
if MAC HW PMT feature is on. Else, the driver will check for
PHY WOL support. There is another case where MAC HW PMT is
enabled but the platform still goes for the PHY WOL option.
E.g, Intel platform are designed for PHY WOL but not MAC WOL
although HW MAC PMT features are enabled.

Introduce use_phy_wol platform data to select PHY WOL
instead of depending on HW PMT features. Set use_phy_wol
will disable the plat->pmt which currently used to
determine the system to wake up by MAC WOL or PHY WOL.

Signed-off-by: Ling Pei Lee <pei.lee.ling@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ndo_dflt_fdb-print'

Vladimir Oltean says:

====================
Trivial print improvements in ndo_dflt_fdb_{add,del}

These are some changes brought to the informational messages printed in
the default .ndo_fdb_add and .ndo_fdb_del method implementations.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: say "local" instead of "static" addresses in ndo_dflt_fdb_{add,del}

"Static" is a loaded word, and probably not what the author meant when
the code was written.

In particular, this looks weird:
$ bridge fdb add dev swp0 00:01:02:03:04:05 local # totally fine, but
$ bridge fdb add dev swp0 00:01:02:03:04:05 static
[ 2020.708298] swp0: FDB only supports static addresses # hmm what?

By looking at the implementation which uses dev_uc_add/dev_uc_del it is
absolutely clear that only local addresses are supported, and the proper
Network Unreachability Detection state is being used for this purpose
(user space indeed sets NUD_PERMANENT when local addresses are meant).
So it is just the message that is wrong, fix it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: use netdev_info in ndo_dflt_fdb_{add,del}

Use the more modern printk helper for network interfaces, which also
contains information about the associated struct device, and results in
overall shorter line lengths compared to printing an open-coded
dev->name.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: Set lookup cookie when creating a PTP PPS source.

When creating a PTP device, the configuration block allows
creation of an associated PPS device. However, there isn't
any way to associate the two devices after creation.

Set the PPS cookie, so pps_lookup_dev(ptp) performs correctly.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'inet-sk_error-tracers'

Alexander Aring says:

====================
net: sock: add tracers for inet socket errors

this patch series introduce tracers for sk_error_report socket callback
calls. The use-case is that a user space application can monitor them
and making an own heuristic about bad peer connections even over a
socket lifetime. To make a specific example it could be use in the Linux
cluster world to fence a "bad" behaving node. For now it's okay to only
trace inet sockets. Other socket families can introduce their own tracers
easily.

Example output with trace-cmd:

<idle>-0 [003] 201.799437: inet_sk_error_report: family=AF_INET protocol=IPPROTO_TCP sport=21064 dport=38941 saddr=192.168.122.57 daddr=192.168.122.251 saddrv6=::ffff:192.168.122.57 daddrv6=::ffff:192.168.122.251 error=104

- Alex

changes since v2:

- change "sk.sk_error_report(&ipc->sk);" to "sk_error_report(&ipc->sk);"
in net/qrtr/qrtr.c
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: sock: add trace for socket errors

This patch will add tracers to trace inet socket errors only. A user
space monitor application can track connection errors indepedent from
socket lifetime and do additional handling. For example a cluster
manager can fence a node if errors occurs in a specific heuristic.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sock: introduce sk_error_report

This patch introduces a function wrapper to call the sk_error_report
callback. That will prepare to add additional handling whenever
sk_error_report is called, for example to trace socket errors.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'hyperv-next-signed-20210629' of git://git./linux/kernel/git/hyperv/linux

Pull hyperv updates from Wei Liu:
"Just a few minor enhancement patches and bug fixes"

* tag 'hyperv-next-signed-20210629' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  PCI: hv: Add check for hyperv_initialized in init_hv_pci_drv()
  Drivers: hv: Move Hyper-V extended capability check to arch neutral code
  drivers: hv: Fix missing error code in vmbus_connect()
  x86/hyperv: fix logical processor creation
  hv_utils: Fix passing zero to 'PTR_ERR' warning
  scsi: storvsc: Use blk_mq_unique_tag() to generate requestIDs
  Drivers: hv: vmbus: Copy packets sent by Hyper-V out of the ring buffer
  hv_balloon: Remove redundant assignment to region_start

mm,hwpoison: make get_hwpoison_page() call get_any_page()

__get_hwpoison_page() could fail to grab refcount by some race condition,
so it's helpful if we can handle it by retrying.  We already have retry
logic, so make get_hwpoison_page() call get_any_page() when called from
memory_failure().

As a result, get_hwpoison_page() can return negative values (i.e.  error
code), so some callers are also changed to handle error cases.
soft_offline_page() does nothing for -EBUSY because that's enough and
users in userspace can easily handle it.  unpoison_memory() is also
unchanged because it's broken and need thorough fixes (will be done
later).

Link: https://lkml.kernel.org/r/20210603233632.2964832-3-nao.horiguchi@gmail.com
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm,hwpoison: send SIGBUS with error virutal address

Now an action required MCE in already hwpoisoned address surely sends a
SIGBUS to current process, but the SIGBUS doesn't convey error virtual
address.  That's not optimal for hwpoison-aware applications.

To fix the issue, make memory_failure() call kill_accessing_process(),
that does pagetable walk to find the error virtual address.  It could find
multiple virtual addresses for the same error page, and it seems hard to
tell which virtual address is correct one.  But that's rare and sending
incorrect virtual address could be better than no address.  So let's
report the first found virtual address for now.

[naoya.horiguchi@nec.com: fix walk_page_range() return]
Link: https://lkml.kernel.org/r/20210603051055.GA244241@hori.linux.bs1.fc.nec.co.jp
Link: https://lkml.kernel.org/r/20210521030156.2612074-4-nao.horiguchi@gmail.com
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jue Wang <juew@google.com>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes

Dave Hansen reported the following about Feng Tang's tests on a machine
with persistent memory onlined as a DRAM-like device.

  Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
  ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
  use" mode and being managed via the buddy just like the normal RAM.

  The PMEM zones are big ones:

        present  65011712 = 248 G
        high       134595 = 525 M

  The PMEM nodes, of course, don't have any CPUs in them.

  With your series, the pcp->high value per-cpu is 69584 pages or about
  270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
  worst-case memory in the pcps per zone, or roughly 10% of the size of
  the zone.

This should not cause a problem as such although it could trigger reclaim
due to pages being stored on per-cpu lists for CPUs remote to a node.  It
is not possible to treat cpuless nodes exactly the same as normal nodes
but the worst-case scenario can be mitigated by splitting pcp->high across
all online CPUs for cpuless memory nodes.

Link: https://lkml.kernel.org/r/20210616110743.GK30378@techsingularity.net
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Tang, Feng" <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/page_alloc: allow high-order pages to be stored on the per-cpu lists

The per-cpu page allocator (PCP) only stores order-0 pages.  This means
that all THP and "cheap" high-order allocations including SLUB contends on
the zone->lock.  This patch extends the PCP allocator to store THP and
"cheap" high-order pages.  Note that struct per_cpu_pages increases in
size to 256 bytes (4 cache lines) on x86-64.

Note that this is not necessarily a universal performance win because of
how it is implemented.  High-order pages can cause pcp->high to be
exceeded prematurely for lower-orders so for example, a large number of
THP pages being freed could release order-0 pages from the PCP lists.
Hence, much depends on the allocation/free pattern as observed by a single
CPU to determine if caching helps or hurts a particular workload.

That said, basic performance testing passed.  The following is a netperf
UDP_STREAM test which hits the relevant patches as some of the network
allocations are high-order.

netperf-udp
                                 5.13.0-rc2             5.13.0-rc2
                           mm-pcpburst-v3r4   mm-pcphighorder-v1r7
Hmean     send-64         261.46 (   0.00%)      266.30 *   1.85%*
Hmean     send-128        516.35 (   0.00%)      536.78 *   3.96%*
Hmean     send-256       1014.13 (   0.00%)     1034.63 *   2.02%*
Hmean     send-1024      3907.65 (   0.00%)     4046.11 *   3.54%*
Hmean     send-2048      7492.93 (   0.00%)     7754.85 *   3.50%*
Hmean     send-3312     11410.04 (   0.00%)    11772.32 *   3.18%*
Hmean     send-4096     13521.95 (   0.00%)    13912.34 *   2.89%*
Hmean     send-8192     21660.50 (   0.00%)    22730.72 *   4.94%*
Hmean     send-16384    31902.32 (   0.00%)    32637.50 *   2.30%*

Functionally, a patch like this is necessary to make bulk allocation of
high-order pages work with similar performance to order-0 bulk
allocations.  The bulk allocator is not updated in this series as it would
have to be determined by bulk allocation users how they want to track the
order of pages allocated with the bulk allocator.

Link: https://lkml.kernel.org/r/20210611135753.GC30378@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM

After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP
configuration option is equivalent to FLATMEM.

Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead.

Link: https://lkml.kernel.org/r/20210608091316.3622-10-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>