6 The safety of the eBPF program is determined in two steps.
8 First step does DAG check to disallow loops and other CFG validation.
9 In particular it will detect programs that have unreachable instructions.
10 (though classic BPF checker allows them)
12 Second step starts from the first insn and descends all possible paths.
13 It simulates execution of every insn and observes the state change of
16 At the start of the program the register R1 contains a pointer to context
17 and has type PTR_TO_CTX.
18 If verifier sees an insn that does R2=R1, then R2 has now type
19 PTR_TO_CTX as well and can be used on the right hand side of expression.
20 If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE,
21 since addition of two valid pointers makes invalid pointer.
22 (In 'secure' mode verifier will reject any type of pointer arithmetic to make
23 sure that kernel addresses don't leak to unprivileged users)
25 If register was never written to, it's not readable::
30 will be rejected, since R2 is unreadable at the start of the program.
32 After kernel function call, R1-R5 are reset to unreadable and
33 R0 has a return type of the function.
35 Since R6-R9 are callee saved, their state is preserved across the call.
44 is a correct program. If there was R1 instead of R6, it would have
47 load/store instructions are allowed only with registers of valid types, which
48 are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked.
53 bpf_xadd *(u32 *)(R1 + 3) += R2
56 will be rejected, since R1 doesn't have a valid pointer type at the time of
57 execution of instruction bpf_xadd.
59 At the start R1 type is PTR_TO_CTX (a pointer to generic ``struct bpf_context``)
60 A callback is used to customize verifier to restrict eBPF program access to only
61 certain fields within ctx structure with specified size and alignment.
63 For example, the following insn::
65 bpf_ld R0 = *(u32 *)(R6 + 8)
67 intends to load a word from address R6 + 8 and store it into R0
68 If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
69 that offset 8 of size 4 bytes can be accessed for reading, otherwise
70 the verifier will reject the program.
71 If R6=PTR_TO_STACK, then access should be aligned and be within
72 stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
73 so it will fail verification, since it's out of bounds.
75 The verifier will allow eBPF program to read data from stack only after
78 Classic BPF verifier does similar check with M[0-15] memory slots.
81 bpf_ld R0 = *(u32 *)(R10 - 4)
85 Though R10 is correct read-only register and has type PTR_TO_STACK
86 and R10 - 4 is within stack bounds, there were no stores into that location.
88 Pointer register spill/fill is tracked as well, since four (R6-R9)
89 callee saved registers may not be enough for some programs.
91 Allowed function calls are customized with bpf_verifier_ops->get_func_proto()
92 The eBPF verifier will check that registers match argument constraints.
93 After the call register R0 will be set to return type of the function.
95 Function calls is a main mechanism to extend functionality of eBPF programs.
96 Socket filters may let programs to call one set of functions, whereas tracing
97 filters may allow completely different set.
99 If a function made accessible to eBPF program, it needs to be thought through
100 from safety point of view. The verifier will guarantee that the function is
101 called with valid arguments.
103 seccomp vs socket filters have different security restrictions for classic BPF.
104 Seccomp solves this by two stage verifier: classic BPF verifier is followed
105 by seccomp verifier. In case of eBPF one configurable verifier is shared for
108 See details of eBPF verifier in kernel/bpf/verifier.c
110 Register value tracking
111 =======================
113 In order to determine the safety of an eBPF program, the verifier must track
114 the range of possible values in each register and also in each stack slot.
115 This is done with ``struct bpf_reg_state``, defined in include/linux/
116 bpf_verifier.h, which unifies tracking of scalar and pointer values. Each
117 register state has a type, which is either NOT_INIT (the register has not been
118 written to), SCALAR_VALUE (some value which is not usable as a pointer), or a
119 pointer type. The types of pointers describe their base, as follows:
123 Pointer to bpf_context.
125 Pointer to struct bpf_map. "Const" because arithmetic
126 on these pointers is forbidden.
128 Pointer to the value stored in a map element.
129 PTR_TO_MAP_VALUE_OR_NULL
130 Either a pointer to a map value, or NULL; map accesses
131 (see maps.rst) return this type, which becomes a
132 PTR_TO_MAP_VALUE when checked != NULL. Arithmetic on
133 these pointers is forbidden.
139 skb->data + headlen; arithmetic forbidden.
141 Pointer to struct bpf_sock_ops, implicitly refcounted.
142 PTR_TO_SOCKET_OR_NULL
143 Either a pointer to a socket, or NULL; socket lookup
144 returns this type, which becomes a PTR_TO_SOCKET when
145 checked != NULL. PTR_TO_SOCKET is reference-counted,
146 so programs must release the reference through the
147 socket release function before the end of the program.
148 Arithmetic on these pointers is forbidden.
150 However, a pointer may be offset from this base (as a result of pointer
151 arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable
152 offset'. The former is used when an exactly-known value (e.g. an immediate
153 operand) is added to a pointer, while the latter is used for values which are
154 not exactly known. The variable offset is also used in SCALAR_VALUEs, to track
155 the range of possible values in the register.
157 The verifier's knowledge about the variable offset consists of:
159 * minimum and maximum values as unsigned
160 * minimum and maximum values as signed
162 * knowledge of the values of individual bits, in the form of a 'tnum': a u64
163 'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown;
164 1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both
165 mask and value; no bit should ever be 1 in both. For example, if a byte is read
166 into a register from memory, the register's top 56 bits are known zero, while
167 the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we
168 then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0;
169 0x1ff), because of potential carries.
171 Besides arithmetic, the register state can also be updated by conditional
172 branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch
173 it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false'
174 branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or
175 BPF_JSGE) would instead update the signed minimum/maximum values. Information
176 from the signed and unsigned bounds can be combined; for instance if a value is
177 first tested < 8 and then tested s> 4, the verifier will conclude that the value
178 is also > 4 and s< 8, since the bounds prevent crossing the sign boundary.
180 PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all
181 pointers sharing that same variable offset. This is important for packet range
182 checks: after adding a variable to a packet pointer register A, if you then copy
183 it to another register B and then add a constant 4 to A, both registers will
184 share the same 'id' but the A will have a fixed offset of +4. Then if A is
185 bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is
186 now known to have a safe range of at least 4 bytes. See 'Direct packet access',
187 below, for more on PTR_TO_PACKET ranges.
189 The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of
190 the pointer returned from a map lookup. This means that when one copy is
191 checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs.
192 As well as range-checking, the tracked information is also used for enforcing
193 alignment of pointer accesses. For instance, on most systems the packet pointer
194 is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump
195 over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting
196 pointer will have a variable offset known to be 4n+2 for some n, so adding the 2
197 bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through
198 that pointer are safe.
199 The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common
200 to all copies of the pointer returned from a socket lookup. This has similar
201 behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but
202 it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly
203 represents a reference to the corresponding ``struct sock``. To ensure that the
204 reference is not leaked, it is imperative to NULL-check the reference and in
205 the non-NULL case, and pass the valid reference to the socket release function.
210 In cls_bpf and act_bpf programs the verifier allows direct access to the packet
211 data via skb->data and skb->data_end pointers.
214 1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */
215 2: r3 = *(u32 *)(r1 +76) /* load skb->data */
218 5: if r5 > r4 goto pc+16
219 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
220 6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */
222 this 2byte load from the packet is safe to do, since the program author
223 did check ``if (skb->data + 14 > skb->data_end) goto err`` at insn #5 which
224 means that in the fall-through case the register R3 (which points to skb->data)
225 has at least 14 directly accessible bytes. The verifier marks it
226 as R3=pkt(id=0,off=0,r=14).
227 id=0 means that no additional variables were added to the register.
228 off=0 means that no additional constants were added.
229 r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok.
230 Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points
231 to the packet data, but constant 14 was added to the register, so
232 it now points to ``skb->data + 14`` and accessible range is [R5, R5 + 14 - 14)
235 More complex packet access may look like::
238 R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
239 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */
240 7: r4 = *(u8 *)(r3 +12)
242 9: r3 = *(u32 *)(r1 +76) /* load skb->data */
250 17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */
251 18: if r2 > r1 goto pc+2
252 R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp
253 19: r1 = *(u8 *)(r3 +4)
255 The state of the register R3 is R3=pkt(id=2,off=0,r=8)
256 id=2 means that two ``r3 += rX`` instructions were seen, so r3 points to some
257 offset within a packet and since the program author did
258 ``if (r3 + 8 > r1) goto err`` at insn #18, the safe range is [R3, R3 + 8).
259 The verifier only allows 'add'/'sub' operations on packet registers. Any other
260 operation will set the register state to 'SCALAR_VALUE' and it won't be
261 available for direct packet access.
263 Operation ``r3 += rX`` may overflow and become less than original skb->data,
264 therefore the verifier has to prevent that. So when it sees ``r3 += rX``
265 instruction and rX is more than 16-bit value, any subsequent bounds-check of r3
266 against skb->data_end will not give us 'range' information, so attempts to read
267 through the pointer will give "invalid access to packet" error.
269 Ex. after insn ``r4 = *(u8 *)(r3 +12)`` (insn #7 above) the state of r4 is
270 R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits
271 of the register are guaranteed to be zero, and nothing is known about the lower
272 8 bits. After insn ``r4 *= 14`` the state becomes
273 R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit
274 value by constant 14 will keep upper 52 bits as zero, also the least significant
275 bit will be zero as 14 is even. Similarly ``r2 >>= 48`` will make
276 R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign
277 extending. This logic is implemented in adjust_reg_min_max_vals() function,
278 which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice
279 versa) and adjust_scalar_min_max_vals() for operations on two scalars.
281 The end result is that bpf program author can access packet directly
282 using normal C code as::
284 void *data = (void *)(long)skb->data;
285 void *data_end = (void *)(long)skb->data_end;
286 struct eth_hdr *eth = data;
287 struct iphdr *iph = data + sizeof(*eth);
288 struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph);
290 if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end)
292 if (eth->h_proto != htons(ETH_P_IP))
294 if (iph->protocol != IPPROTO_UDP || iph->ihl != 5)
296 if (udp->dest == 53 || udp->source == 9)
299 which makes such programs easier to write comparing to LD_ABS insn
300 and significantly faster.
305 The verifier does not actually walk all possible paths through the program. For
306 each new branch to analyse, the verifier looks at all the states it's previously
307 been in when at this instruction. If any of them contain the current state as a
308 subset, the branch is 'pruned' - that is, the fact that the previous state was
309 accepted implies the current state would be as well. For instance, if in the
310 previous state, r1 held a packet-pointer, and in the current state, r1 holds a
311 packet-pointer with a range as long or longer and at least as strict an
312 alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't
313 have been used by any path from that point, so any value in r2 (including
314 another NOT_INIT) is safe. The implementation is in the function regsafe().
315 Pruning considers not only the registers but also the stack (and any spilled
316 registers it may hold). They must all be safe for the branch to be pruned.
317 This is implemented in states_equal().
319 Understanding eBPF verifier messages
320 ====================================
322 The following are few examples of invalid eBPF programs and verifier error
323 messages as seen in the log:
325 Program with unreachable instructions::
327 static struct bpf_insn prog[] = {
336 Program that reads uninitialized register::
338 BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
346 Program that doesn't initialize R0 before exiting::
348 BPF_MOV64_REG(BPF_REG_2, BPF_REG_1),
357 Program that accesses stack out of bounds::
359 BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0),
364 0: (7a) *(u64 *)(r10 +8) = 0
365 invalid stack off=8 size=8
367 Program that doesn't initialize stack before passing its address into function::
369 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
370 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
371 BPF_LD_MAP_FD(BPF_REG_1, 0),
372 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
381 invalid indirect read from stack off -8+0 size 8
383 Program that uses invalid map_fd=0 while calling to map_lookup_elem() function::
385 BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
386 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
387 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
388 BPF_LD_MAP_FD(BPF_REG_1, 0),
389 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
394 0: (7a) *(u64 *)(r10 -8) = 0
399 fd 0 is not pointing to valid bpf_map
401 Program that doesn't check return value of map_lookup_elem() before accessing
404 BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
405 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
406 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
407 BPF_LD_MAP_FD(BPF_REG_1, 0),
408 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
409 BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
414 0: (7a) *(u64 *)(r10 -8) = 0
419 5: (7a) *(u64 *)(r0 +0) = 0
420 R0 invalid mem access 'map_value_or_null'
422 Program that correctly checks map_lookup_elem() returned value for NULL, but
423 accesses the memory with incorrect alignment::
425 BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
426 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
427 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
428 BPF_LD_MAP_FD(BPF_REG_1, 0),
429 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
430 BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
431 BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
436 0: (7a) *(u64 *)(r10 -8) = 0
441 5: (15) if r0 == 0x0 goto pc+1
443 6: (7a) *(u64 *)(r0 +4) = 0
444 misaligned access off 4 size 8
446 Program that correctly checks map_lookup_elem() returned value for NULL and
447 accesses memory with correct alignment in one side of 'if' branch, but fails
448 to do so in the other side of 'if' branch::
450 BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
451 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
452 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
453 BPF_LD_MAP_FD(BPF_REG_1, 0),
454 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
455 BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
456 BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
458 BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
463 0: (7a) *(u64 *)(r10 -8) = 0
468 5: (15) if r0 == 0x0 goto pc+2
470 6: (7a) *(u64 *)(r0 +0) = 0
473 from 5 to 8: R0=imm0 R10=fp
474 8: (7a) *(u64 *)(r0 +0) = 1
475 R0 invalid mem access 'imm'
477 Program that performs a socket lookup then sets the pointer to NULL without
480 BPF_MOV64_IMM(BPF_REG_2, 0),
481 BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
482 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
483 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
484 BPF_MOV64_IMM(BPF_REG_3, 4),
485 BPF_MOV64_IMM(BPF_REG_4, 0),
486 BPF_MOV64_IMM(BPF_REG_5, 0),
487 BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
488 BPF_MOV64_IMM(BPF_REG_0, 0),
494 1: (63) *(u32 *)(r10 -8) = r2
500 7: (85) call bpf_sk_lookup_tcp#65
503 Unreleased reference id=1, alloc_insn=7
505 Program that performs a socket lookup but does not NULL-check the returned
508 BPF_MOV64_IMM(BPF_REG_2, 0),
509 BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
510 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
511 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
512 BPF_MOV64_IMM(BPF_REG_3, 4),
513 BPF_MOV64_IMM(BPF_REG_4, 0),
514 BPF_MOV64_IMM(BPF_REG_5, 0),
515 BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
521 1: (63) *(u32 *)(r10 -8) = r2
527 7: (85) call bpf_sk_lookup_tcp#65
529 Unreleased reference id=1, alloc_insn=7