Merge branch 'Support stashing local kptrs with bpf_kptr_xchg'
Dave Marchevsky says:
====================
Local kptrs are kptrs allocated via bpf_obj_new with a type specified in program
BTF. A BPF program which creates a local kptr has exclusive control of the
lifetime of the kptr, and, prior to terminating, must:
* free the kptr via bpf_obj_drop
* If the kptr is a {list,rbtree} node, add the node to a {list, rbtree},
thereby passing control of the lifetime to the collection
This series adds a third option:
* stash the kptr in a map value using bpf_kptr_xchg
As indicated by the use of "stash" to describe this behavior, the intended use
of this feature is temporary storage of local kptrs. For example, a sched_ext
([0]) scheduler may want to create an rbtree node for each new cgroup on cgroup
init, but to add that node to the rbtree as part of a separate program which
runs on enqueue. Stashing the node in a map_value allows its lifetime to outlive
the execution of the cgroup_init program.
Behavior:
There is no semantic difference between adding a kptr to a graph collection and
"stashing" it in a map. In both cases exclusive ownership of the kptr's lifetime
is passed to some containing data structure, which is responsible for
bpf_obj_drop'ing it when the container goes away.
Since graph collections also expect exclusive ownership of the nodes they
contain, graph nodes cannot be both stashed in a map_value and contained by
their corresponding collection.
Implementation:
Two observations simplify the verifier changes for this feature. First, kptrs
("referenced kptrs" until a recent renaming) require registration of a
dtor function as part of their acquire/release semantics, so that a referenced
kptr which is placed in a map_value is properly released when the map goes away.
We want this exact behavior for local kptrs, but with bpf_obj_drop as the dtor
instead of a per-btf_id dtor.
The second observation is that, in terms of identification, "referenced kptr"
and "local kptr" already don't interfere with one another. Consider the
following example:
struct node_data {
long key;
long data;
struct bpf_rb_node node;
};
struct map_value {
struct node_data __kptr *node;
};
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__type(key, int);
__type(value, struct map_value);
__uint(max_entries, 1);
} some_nodes SEC(".maps");
struct map_value *mapval;
struct node_data *res;
int key = 0;
res = bpf_obj_new(typeof(*res));
if (!res) { /* err handling */ }
mapval = bpf_map_lookup_elem(&some_nodes, &key);
if (!mapval) { /* err handling */ }
res = bpf_kptr_xchg(&mapval->node, res);
if (res)
bpf_obj_drop(res);
The __kptr tag identifies map_value's node as a referenced kptr, while the
PTR_TO_BTF_ID which bpf_obj_new returns - a type in some non-vmlinux,
non-module BTF - identifies res as a local kptr. Type tag on the pointer
indicates referenced kptr, while the type of the pointee indicates local kptr.
So using existing facilities we can tell the verifier about a "referenced kptr"
pointer to a "local kptr" pointee.
When kptr_xchg'ing a kptr into a map_value, the verifier can recognize local
kptr types and treat them like referenced kptrs with a properly-typed
bpf_obj_drop as a dtor.
Other implementation notes:
* We don't need to do anything special to enforce "graph nodes cannot be
both stashed in a map_value and contained by their corresponding collection"
* bpf_kptr_xchg both returns and takes as input a (possibly-null) owning
reference. It does not accept non-owning references as input by virtue
of requiring a ref_obj_id. By definition, if a program has an owning
ref to a node, the node isn't in a collection, so it's safe to pass
ownership via bpf_kptr_xchg.
Summary of patches:
* Patch 1 modifies BTF plumbing to support using bpf_obj_drop as a dtor
* Patch 2 adds verifier plumbing to support MEM_ALLOC-flagged param for
bpf_kptr_xchg
* Patch 3 adds selftests exercising the new behavior
Changelog:
v1 -> v2: https://lore.kernel.org/bpf/
20230309180111.
1618459-1-davemarchevsky@fb.com/
Patch #s used below refer to the patch's position in v1 unless otherwise
specified.
Patches 1-3 were applied and are not included in v2.
Rebase onto latest bpf-next: "libbpf: Revert poisoning of strlcpy"
Patch 4: "bpf: Support __kptr to local kptrs"
* Remove !btf_is_kernel(btf) check, WARN_ON_ONCE instead (Alexei)
Patch 6: "selftests/bpf: Add local kptr stashing test"
* Add test which stashes 2 nodes and later unstashes one of them using a
separate BPF program (Alexei)
* Fix incorrect runner subtest name for original test (was
"rbtree_add_nodes")
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>