Yonghong Song [Thu, 24 May 2018 18:21:09 +0000 (11:21 -0700)]
bpf: introduce bpf subcommand BPF_TASK_FD_QUERY
Currently, suppose a userspace application has loaded a bpf program
and attached it to a tracepoint/kprobe/uprobe, and a bpf
introspection tool, e.g., bpftool, wants to show which bpf program
is attached to which tracepoint/kprobe/uprobe. Such attachment
information will be really useful to understand the overall bpf
deployment in the system.
There is a name field (16 bytes) for each program, which could
be used to encode the attachment point. There are some drawbacks
for this approaches. First, bpftool user (e.g., an admin) may not
really understand the association between the name and the
attachment point. Second, if one program is attached to multiple
places, encoding a proper name which can imply all these
attachments becomes difficult.
This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY.
Given a pid and fd, if the <pid, fd> is associated with a
tracepoint/kprobe/uprobe perf event, BPF_TASK_FD_QUERY will return
. prog_id
. tracepoint name, or
. k[ret]probe funcname + offset or kernel addr, or
. u[ret]probe filename + offset
to the userspace.
The user can use "bpftool prog" to find more information about
bpf program itself with prog_id.
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song [Thu, 24 May 2018 18:21:08 +0000 (11:21 -0700)]
perf/core: add perf_get_event() to return perf_event given a struct file
A new extern function, perf_get_event(), is added to return a perf event
given a struct file. This function will be used in later patches.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Thu, 24 May 2018 09:57:37 +0000 (11:57 +0200)]
Merge branch 'bpf-ipv6-seg6-bpf-action'
Mathieu Xhonneux says:
====================
As of Linux 4.14, it is possible to define advanced local processing for
IPv6 packets with a Segment Routing Header through the seg6local LWT
infrastructure. This LWT implements the network programming principles
defined in the IETF "SRv6 Network Programming" draft.
The implemented operations are generic, and it would be very interesting to
be able to implement user-specific seg6local actions, without having to
modify the kernel directly. To do so, this patchset adds an End.BPF action
to seg6local, powered by some specific Segment Routing-related helpers,
which provide SR functionalities that can be applied on the packet. This
BPF hook would then allow to implement specific actions at native kernel
speed such as OAM features, advanced SR SDN policies, SRv6 actions like
Segment Routing Header (SRH) encapsulation depending on the content of
the packet, etc.
This patchset is divided in 6 patches, whose main features are :
- A new seg6local action End.BPF with the corresponding new BPF program
type BPF_PROG_TYPE_LWT_SEG6LOCAL. Such attached BPF program can be
passed to the LWT seg6local through netlink, the same way as the LWT
BPF hook operates.
- 3 new BPF helpers for the seg6local BPF hook, allowing to edit/grow/
shrink a SRH and apply on a packet some of the generic SRv6 actions.
- 1 new BPF helper for the LWT BPF IN hook, allowing to add a SRH through
encapsulation (via IPv6 encapsulation or inlining if the packet contains
already an IPv6 header).
As this patchset adds a new LWT BPF hook, I took into account the result
of the discussions when the LWT BPF infrastructure got merged. Hence, the
seg6local BPF hook doesn't allow write access to skb->data directly, only
the SRH can be modified through specific helpers, which ensures that the
integrity of the packet is maintained. More details are available in the
related patches messages.
The performances of this BPF hook have been assessed with the BPF JIT
enabled on an Intel Xeon X3440 processors with 4 cores and 8 threads
clocked at 2.53 GHz. No throughput losses are noted with the seg6local
BPF hook when the BPF program does nothing (440kpps). Adding a 8-bytes
TLV (1 call each to bpf_lwt_seg6_adjust_srh and bpf_lwt_seg6_store_bytes)
drops the throughput to 410kpps, and inlining a SRH via bpf_lwt_seg6_action
drops the throughput to 420kpps. All throughputs are stable.
Changelog:
v2: move the SRH integrity state from skb->cb to a per-cpu buffer
v3: - document helpers in man-page style
- fix kbuild bugs
- un-break BPF LWT out hook
- bpf_push_seg6_encap is now static
- preempt_enable is now called when the packet is dropped in
input_action_end_bpf
v4: fix kbuild bugs when CONFIG_IPV6=m
v5: fix kbuild sparse warnings when CONFIG_IPV6=m
v6: fix skb pointers-related bugs in helpers
v7: - fix memory leak in error path of End.BPF setup
- add freeing of BPF data in seg6_local_destroy_state
- new enums SEG6_LOCAL_BPF_* instead of re-using ones of lwt bpf for
netlink nested bpf attributes
- SEG6_LOCAL_BPF_PROG attr now contains prog->aux->id when dumping
state
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Mathieu Xhonneux [Sun, 20 May 2018 13:58:17 +0000 (14:58 +0100)]
selftests/bpf: test for seg6local End.BPF action
Add a new test for the seg6local End.BPF action. The following helpers
are also tested:
- bpf_lwt_push_encap within the LWT BPF IN hook
- bpf_lwt_seg6_action
- bpf_lwt_seg6_adjust_srh
- bpf_lwt_seg6_store_bytes
A chain of End.BPF actions is built. The SRH is injected through a LWT
BPF IN hook before entering this chain. Each End.BPF action validates
the previous one, otherwise the packet is dropped. The test succeeds
if the last node in the chain receives the packet and the UDP datagram
contained can be retrieved from userspace.
Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Mathieu Xhonneux [Sun, 20 May 2018 13:58:16 +0000 (14:58 +0100)]
ipv6: sr: Add seg6local action End.BPF
This patch adds the End.BPF action to the LWT seg6local infrastructure.
This action works like any other seg6local End action, meaning that an IPv6
header with SRH is needed, whose DA has to be equal to the SID of the
action. It will also advance the SRH to the next segment, the BPF program
does not have to take care of this.
Since the BPF program may not be a source of instability in the kernel, it
is important to ensure that the integrity of the packet is maintained
before yielding it back to the IPv6 layer. The hook hence keeps track if
the SRH has been altered through the helpers, and re-validates its
content if needed with seg6_validate_srh. The state kept for validation is
stored in a per-CPU buffer. The BPF program is not allowed to directly
write into the packet, and only some fields of the SRH can be altered
through the helper bpf_lwt_seg6_store_bytes.
Performances profiling has shown that the SRH re-validation does not induce
a significant overhead. If the altered SRH is deemed as invalid, the packet
is dropped.
This validation is also done before executing any action through
bpf_lwt_seg6_action, and will not be performed again if the SRH is not
modified after calling the action.
The BPF program may return 3 types of return codes:
- BPF_OK: the End.BPF action will look up the next destination through
seg6_lookup_nexthop.
- BPF_REDIRECT: if an action has been executed through the
bpf_lwt_seg6_action helper, the BPF program should return this
value, as the skb's destination is already set and the default
lookup should not be performed.
- BPF_DROP : the packet will be dropped.
Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Mathieu Xhonneux [Sun, 20 May 2018 13:58:15 +0000 (14:58 +0100)]
bpf: Split lwt inout verifier structures
The new bpf_lwt_push_encap helper should only be accessible within the
LWT BPF IN hook, and not the OUT one, as this may lead to a skb under
panic.
At the moment, both LWT BPF IN and OUT share the same list of helpers,
whose calls are authorized by the verifier. This patch separates the
verifier ops for the IN and OUT hooks, and allows the IN hook to call the
bpf_lwt_push_encap helper.
This patch is also the occasion to put all lwt_*_func_proto functions
together for clarity. At the moment, socks_op_func_proto is in the middle
of lwt_inout_func_proto and lwt_xmit_func_proto.
Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Mathieu Xhonneux [Sun, 20 May 2018 13:58:14 +0000 (14:58 +0100)]
bpf: Add IPv6 Segment Routing helpers
The BPF seg6local hook should be powerful enough to enable users to
implement most of the use-cases one could think of. After some thinking,
we figured out that the following actions should be possible on a SRv6
packet, requiring 3 specific helpers :
- bpf_lwt_seg6_store_bytes: Modify non-sensitive fields of the SRH
- bpf_lwt_seg6_adjust_srh: Allow to grow or shrink a SRH
(to add/delete TLVs)
- bpf_lwt_seg6_action: Apply some SRv6 network programming actions
(specifically End.X, End.T, End.B6 and
End.B6.Encap)
The specifications of these helpers are provided in the patch (see
include/uapi/linux/bpf.h).
The non-sensitive fields of the SRH are the following : flags, tag and
TLVs. The other fields can not be modified, to maintain the SRH
integrity. Flags, tag and TLVs can easily be modified as their validity
can be checked afterwards via seg6_validate_srh. It is not allowed to
modify the segments directly. If one wants to add segments on the path,
he should stack a new SRH using the End.B6 action via
bpf_lwt_seg6_action.
Growing, shrinking or editing TLVs via the helpers will flag the SRH as
invalid, and it will have to be re-validated before re-entering the IPv6
layer. This flag is stored in a per-CPU buffer, along with the current
header length in bytes.
Storing the SRH len in bytes in the control block is mandatory when using
bpf_lwt_seg6_adjust_srh. The Header Ext. Length field contains the SRH
len rounded to 8 bytes (a padding TLV can be inserted to ensure the 8-bytes
boundary). When adding/deleting TLVs within the BPF program, the SRH may
temporary be in an invalid state where its length cannot be rounded to 8
bytes without remainder, hence the need to store the length in bytes
separately. The caller of the BPF program can then ensure that the SRH's
final length is valid using this value. Again, a final SRH modified by a
BPF program which doesn’t respect the 8-bytes boundary will be discarded
as it will be considered as invalid.
Finally, a fourth helper is provided, bpf_lwt_push_encap, which is
available from the LWT BPF IN hook, but not from the seg6local BPF one.
This helper allows to encapsulate a Segment Routing Header (either with
a new outer IPv6 header, or by inlining it directly in the existing IPv6
header) into a non-SRv6 packet. This helper is required if we want to
offer the possibility to dynamically encapsulate a SRH for non-SRv6 packet,
as the BPF seg6local hook only works on traffic already containing a SRH.
This is the BPF equivalent of the seg6 LWT infrastructure, which achieves
the same purpose but with a static SRH per route.
These helpers require CONFIG_IPV6=y (and not =m).
Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Mathieu Xhonneux [Sun, 20 May 2018 13:58:13 +0000 (14:58 +0100)]
ipv6: sr: export function lookup_nexthop
The function lookup_nexthop is essential to implement most of the seg6local
actions. As we want to provide a BPF helper allowing to apply some of these
actions on the packet being processed, the helper should be able to call
this function, hence the need to make it public.
Moreover, if one argument is incorrect or if the next hop can not be found,
an error should be returned by the BPF helper so the BPF program can adapt
its processing of the packet (return an error, properly force the drop,
...). This patch hence makes this function return dst->error to indicate a
possible error.
Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Mathieu Xhonneux [Sun, 20 May 2018 13:58:12 +0000 (14:58 +0100)]
ipv6: sr: make seg6.h includable without IPv6
include/net/seg6.h cannot be included in a source file if CONFIG_IPV6 is
not enabled:
include/net/seg6.h: In function 'seg6_pernet':
>> include/net/seg6.h:52:14: error: 'struct net' has no member named
'ipv6'; did you mean 'ipv4'?
return net->ipv6.seg6_data;
^~~~
ipv4
This commit makes seg6_pernet return NULL if IPv6 is not compiled, hence
allowing seg6.h to be included regardless of the configuration.
Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Thu, 24 May 2018 07:20:50 +0000 (09:20 +0200)]
Merge branch 'bpf-multi-prog-improvements'
Sandipan Das says:
====================
[1] Support for bpf-to-bpf function calls in the powerpc64 JIT compiler.
[2] Provide a way for resolving function calls because of the way JITed
images are allocated in powerpc64.
[3] Fix to get JITed instruction dumps for multi-function programs from
the bpf system call.
[4] Fix for bpftool to show delimited multi-function JITed image dumps.
v4:
- Incorporate review comments from Jakub.
- Fix JSON output for bpftool.
v3:
- Change base tree tag to bpf-next.
- Incorporate review comments from Alexei, Daniel and Jakub.
- Make sure that the JITed image does not grow or shrink after
the last pass due to the way the instruction sequence used
to load a callee's address maybe optimized.
- Make additional changes to the bpf system call and bpftool to
make multi-function JITed dumps easier to correlate.
v2:
- Incorporate review comments from Jakub.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Sandipan Das [Thu, 24 May 2018 06:56:54 +0000 (12:26 +0530)]
tools: bpftool: add delimiters to multi-function JITed dumps
This splits up the contiguous JITed dump obtained via the bpf
system call into more relatable chunks for each function in
the program. If the kernel symbols corresponding to these are
known, they are printed in the header for each JIT image dump
otherwise the masked start address is printed.
Before applying this patch:
# bpftool prog dump jited id 1
0: push %rbp
1: mov %rsp,%rbp
...
70: leaveq
71: retq
72: push %rbp
73: mov %rsp,%rbp
...
dd: leaveq
de: retq
# bpftool -p prog dump jited id 1
[{
"pc": "0x0",
"operation": "push",
"operands": ["%rbp"
]
},{
...
},{
"pc": "0x71",
"operation": "retq",
"operands": [null
]
},{
"pc": "0x72",
"operation": "push",
"operands": ["%rbp"
]
},{
...
},{
"pc": "0xde",
"operation": "retq",
"operands": [null
]
}
]
After applying this patch:
# echo 0 > /proc/sys/net/core/bpf_jit_kallsyms
# bpftool prog dump jited id 1
0xffffffffc02c7000:
0: push %rbp
1: mov %rsp,%rbp
...
70: leaveq
71: retq
0xffffffffc02cf000:
0: push %rbp
1: mov %rsp,%rbp
...
6b: leaveq
6c: retq
# bpftool -p prog dump jited id 1
[{
"name": "0xffffffffc02c7000",
"insns": [{
"pc": "0x0",
"operation": "push",
"operands": ["%rbp"
]
},{
...
},{
"pc": "0x71",
"operation": "retq",
"operands": [null
]
}
]
},{
"name": "0xffffffffc02cf000",
"insns": [{
"pc": "0x0",
"operation": "push",
"operands": ["%rbp"
]
},{
...
},{
"pc": "0x6c",
"operation": "retq",
"operands": [null
]
}
]
}
]
# echo 1 > /proc/sys/net/core/bpf_jit_kallsyms
# bpftool prog dump jited id 1
bpf_prog_b811aab41a39ad3d_foo:
0: push %rbp
1: mov %rsp,%rbp
...
70: leaveq
71: retq
bpf_prog_cf418ac8b67bebd9_F:
0: push %rbp
1: mov %rsp,%rbp
...
6b: leaveq
6c: retq
# bpftool -p prog dump jited id 1
[{
"name": "bpf_prog_b811aab41a39ad3d_foo",
"insns": [{
"pc": "0x0",
"operation": "push",
"operands": ["%rbp"
]
},{
...
},{
"pc": "0x71",
"operation": "retq",
"operands": [null
]
}
]
},{
"name": "bpf_prog_cf418ac8b67bebd9_F",
"insns": [{
"pc": "0x0",
"operation": "push",
"operands": ["%rbp"
]
},{
...
},{
"pc": "0x6c",
"operation": "retq",
"operands": [null
]
}
]
}
]
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Sandipan Das [Thu, 24 May 2018 06:56:53 +0000 (12:26 +0530)]
tools: bpf: sync bpf uapi header
Syncing the bpf.h uapi header with tools so that struct
bpf_prog_info has the two new fields for passing on the
JITed image lengths of each function in a multi-function
program.
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Sandipan Das [Thu, 24 May 2018 06:56:52 +0000 (12:26 +0530)]
bpf: get JITed image lengths of functions via syscall
This adds new two new fields to struct bpf_prog_info. For
multi-function programs, these fields can be used to pass
a list of the JITed image lengths of each function for a
given program to userspace using the bpf system call with
the BPF_OBJ_GET_INFO_BY_FD command.
This can be used by userspace applications like bpftool
to split up the contiguous JITed dump, also obtained via
the system call, into more relatable chunks corresponding
to each function.
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Sandipan Das [Thu, 24 May 2018 06:56:51 +0000 (12:26 +0530)]
bpf: fix multi-function JITed dump obtained via syscall
Currently, for multi-function programs, we cannot get the JITed
instructions using the bpf system call's BPF_OBJ_GET_INFO_BY_FD
command. Because of this, userspace tools such as bpftool fail
to identify a multi-function program as being JITed or not.
With the JIT enabled and the test program running, this can be
verified as follows:
# cat /proc/sys/net/core/bpf_jit_enable
1
Before applying this patch:
# bpftool prog list
1: kprobe name foo tag
b811aab41a39ad3d gpl
loaded_at 2018-05-16T11:43:38+0530 uid 0
xlated 216B not jited memlock 65536B
...
# bpftool prog dump jited id 1
no instructions returned
After applying this patch:
# bpftool prog list
1: kprobe name foo tag
b811aab41a39ad3d gpl
loaded_at 2018-05-16T12:13:01+0530 uid 0
xlated 216B jited 308B memlock 65536B
...
# bpftool prog dump jited id 1
0: nop
4: nop
8: mflr r0
c: std r0,16(r1)
10: stdu r1,-112(r1)
14: std r31,104(r1)
18: addi r31,r1,48
1c: li r3,10
...
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Sandipan Das [Thu, 24 May 2018 06:56:50 +0000 (12:26 +0530)]
tools: bpftool: resolve calls without using imm field
Currently, we resolve the callee's address for a JITed function
call by using the imm field of the call instruction as an offset
from __bpf_call_base. If bpf_jit_kallsyms is enabled, we further
use this address to get the callee's kernel symbol's name.
For some architectures, such as powerpc64, the imm field is not
large enough to hold this offset. So, instead of assigning this
offset to the imm field, the verifier now assigns the subprog
id. Also, a list of kernel symbol addresses for all the JITed
functions is provided in the program info. We now use the imm
field as an index for this list to lookup a callee's symbol's
address and resolve its name.
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Sandipan Das [Thu, 24 May 2018 06:56:49 +0000 (12:26 +0530)]
tools: bpf: sync bpf uapi header
Syncing the bpf.h uapi header with tools so that struct
bpf_prog_info has the two new fields for passing on the
addresses of the kernel symbols corresponding to each
function in a program.
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Sandipan Das [Thu, 24 May 2018 06:56:48 +0000 (12:26 +0530)]
bpf: get kernel symbol addresses via syscall
This adds new two new fields to struct bpf_prog_info. For
multi-function programs, these fields can be used to pass
a list of kernel symbol addresses for all functions in a
given program to userspace using the bpf system call with
the BPF_OBJ_GET_INFO_BY_FD command.
When bpf_jit_kallsyms is enabled, we can get the address
of the corresponding kernel symbol for a callee function
and resolve the symbol's name. The address is determined
by adding the value of the call instruction's imm field
to __bpf_call_base. This offset gets assigned to the imm
field by the verifier.
For some architectures, such as powerpc64, the imm field
is not large enough to hold this offset.
We resolve this by:
[1] Assigning the subprog id to the imm field of a call
instruction in the verifier instead of the offset of
the callee's symbol's address from __bpf_call_base.
[2] Determining the address of a callee's corresponding
symbol by using the imm field as an index for the
list of kernel symbol addresses now available from
the program info.
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Sandipan Das [Thu, 24 May 2018 06:56:47 +0000 (12:26 +0530)]
bpf: powerpc64: add JIT support for multi-function programs
This adds support for bpf-to-bpf function calls in the powerpc64
JIT compiler. The JIT compiler converts the bpf call instructions
to native branch instructions. After a round of the usual passes,
the start addresses of the JITed images for the callee functions
are known. Finally, to fixup the branch target addresses, we need
to perform an extra pass.
Because of the address range in which JITed images are allocated
on powerpc64, the offsets of the start addresses of these images
from __bpf_call_base are as large as 64 bits. So, for a function
call, we cannot use the imm field of the instruction to determine
the callee's address. Instead, we use the alternative method of
getting it from the list of function addresses in the auxiliary
data of the caller by using the off field as an index.
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Sandipan Das [Thu, 24 May 2018 06:56:46 +0000 (12:26 +0530)]
bpf: powerpc64: pad function address loads with NOPs
For multi-function programs, loading the address of a callee
function to a register requires emitting instructions whose
count varies from one to five depending on the nature of the
address.
Since we come to know of the callee's address only before the
extra pass, the number of instructions required to load this
address may vary from what was previously generated. This can
make the JITed image grow or shrink.
To avoid this, we should generate a constant five-instruction
when loading function addresses by padding the optimized load
sequence with NOPs.
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Sandipan Das [Thu, 24 May 2018 06:56:45 +0000 (12:26 +0530)]
bpf: support 64-bit offsets for bpf function calls
The imm field of a bpf instruction is a signed 32-bit integer.
For JITed bpf-to-bpf function calls, it holds the offset of the
start address of the callee's JITed image from __bpf_call_base.
For some architectures, such as powerpc64, this offset may be
as large as 64 bits and cannot be accomodated in the imm field
without truncation.
We resolve this by:
[1] Additionally using the auxiliary data of each function to
keep a list of start addresses of the JITed images for all
functions determined by the verifier.
[2] Retaining the subprog id inside the off field of the call
instructions and using it to index into the list mentioned
above and lookup the callee's address.
To make sure that the existing JIT compilers continue to work
without requiring changes, we keep the imm field as it is.
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Wed, 23 May 2018 18:32:36 +0000 (11:32 -0700)]
bpf: btf: Avoid variable length array
Sparse warning:
kernel/bpf/btf.c:1985:34: warning: Variable length array is used.
This patch directly uses ARRAY_SIZE().
Fixes: f80442a4cd18 ("bpf: btf: Change how section is supported in btf_header")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Sirio Balmelli [Wed, 23 May 2018 16:17:07 +0000 (18:17 +0200)]
tools/lib/libbpf.c: fix string format to allow build on arm32
On arm32, 'cd tools/testing/selftests/bpf && make' fails with:
libbpf.c:80:10: error: format ‘%ld’ expects argument of type ‘long int’, but argument 4 has type ‘int64_t {aka long long int}’ [-Werror=format=]
(func)("libbpf: " fmt, ##__VA_ARGS__); \
^
libbpf.c:83:30: note: in expansion of macro ‘__pr’
#define pr_warning(fmt, ...) __pr(__pr_warning, fmt, ##__VA_ARGS__)
^~~~
libbpf.c:1072:3: note: in expansion of macro ‘pr_warning’
pr_warning("map:%s value_type:%s has BTF type_size:%ld != value_size:%u\n",
To fix, typecast 'key_size' and amend format string.
Signed-off-by: Sirio Balmelli <sirio@b-ad.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Sirio Balmelli [Mon, 21 May 2018 07:00:03 +0000 (09:00 +0200)]
selftests/bpf: Makefile fix "missing" headers on build with -idirafter
Selftests fail to build on several distros/architectures because of
missing headers files.
On a Ubuntu/x86_64 some missing headers are:
asm/byteorder.h, asm/socket.h, asm/sockios.h
On a Debian/arm32 build already fails at sys/cdefs.h
In both cases, these already exist in /usr/include/<arch-specific-dir>,
but Clang does not include these when using '-target bpf' flag,
since it is no longer compiling against the host architecture.
The solution is to:
- run Clang without '-target bpf' and extract the include chain for the
current system
- add these to the bpf build with '-idirafter'
The choice of -idirafter is to catch this error without injecting
unexpected include behavior: if an arch-specific tree is built
for bpf in the future, this will be correctly found by Clang.
Signed-off-by: Sirio Balmelli <sirio@b-ad.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Wed, 23 May 2018 10:03:33 +0000 (12:03 +0200)]
Merge branch 'btf-uapi-cleanups'
Martin KaFai Lau says:
====================
This patch set makes some changes to cleanup the unused
bits in BTF uapi. It also makes the btf_header extensible.
Please see individual patches for details.
v2:
- Remove NR_SECS from patch 2
- Remove "unsigned" check on array->index_type from patch 3
- Remove BTF_INT_VARARGS and further limit BTF_INT_ENCODING
from 8 bits to 4 bits in patch 4
- Adjustments in test_btf.c to reflect changes in v2
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Tue, 22 May 2018 22:04:24 +0000 (15:04 -0700)]
bpf: btf: Add tests for the btf uapi changes
This patch does the followings:
1. Modify libbpf and test_btf to reflect the uapi changes in btf
2. Add test for the btf_header changes
3. Add tests for array->index_type
4. Add err_str check to the tests
5. Fix a 4 bytes hole in "struct test #1" by swapping "m" and "n"
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Tue, 22 May 2018 21:57:22 +0000 (14:57 -0700)]
bpf: btf: Sync bpf.h and btf.h to tools
This patch sync the uapi bpf.h and btf.h to tools.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Tue, 22 May 2018 21:57:21 +0000 (14:57 -0700)]
bpf: btf: Rename btf_key_id and btf_value_id in bpf_map_info
In "struct bpf_map_info", the name "btf_id", "btf_key_id" and "btf_value_id"
could cause confusion because the "id" of "btf_id" means the BPF obj id
given to the BTF object while
"btf_key_id" and "btf_value_id" means the BTF type id within
that BTF object.
To make it clear, btf_key_id and btf_value_id are
renamed to btf_key_type_id and btf_value_type_id.
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Tue, 22 May 2018 21:57:20 +0000 (14:57 -0700)]
bpf: btf: Remove unused bits from uapi/linux/btf.h
This patch does the followings:
1. Limit BTF_MAX_TYPES and BTF_MAX_NAME_OFFSET to 64k. We can
raise it later.
2. Remove the BTF_TYPE_PARENT and BTF_STR_TBL_ELF_ID. They are
currently encoded at the highest bit of a u32.
It is because the current use case does not require supporting
parent type (i.e type_id referring to a type in another BTF file).
It also does not support referring to a string in ELF.
The BTF_TYPE_PARENT and BTF_STR_TBL_ELF_ID checks are replaced
by BTF_TYPE_ID_CHECK and BTF_STR_OFFSET_CHECK which are
defined in btf.c instead of uapi/linux/btf.h.
3. Limit the BTF_INFO_KIND from 5 bits to 4 bits which is enough.
There is unused bits headroom if we ever needed it later.
4. The root bit in BTF_INFO is also removed because it is not
used in the current use case.
5. Remove BTF_INT_VARARGS since func type is not supported now.
The BTF_INT_ENCODING is limited to 4 bits instead of 8 bits.
The above can be added back later because the verifier
ensures the unused bits are zeros.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Tue, 22 May 2018 21:57:19 +0000 (14:57 -0700)]
bpf: btf: Check array->index_type
Instead of ingoring the array->index_type field. Enforce that
it must be a BTF_KIND_INT in size 1/2/4/8 bytes.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Tue, 22 May 2018 21:57:18 +0000 (14:57 -0700)]
bpf: btf: Change how section is supported in btf_header
There are currently unused section descriptions in the btf_header. Those
sections are here to support future BTF use cases. For example, the
func section (func_off) is to support function signature (e.g. the BPF
prog function signature).
Instead of spelling out all potential sections up-front in the btf_header.
This patch makes changes to btf_header such that extending it (e.g. adding
a section) is possible later. The unused ones can be removed for now and
they can be added back later.
This patch:
1. adds a hdr_len to the btf_header. It will allow adding
sections (and other info like parent_label and parent_name)
later. The check is similar to the existing bpf_attr.
If a user passes in a longer hdr_len, the kernel
ensures the extra tailing bytes are 0.
2. allows the section order in the BTF object to be
different from its sec_off order in btf_header.
3. each sec_off is followed by a sec_len. It must not have gap or
overlapping among sections.
The string section is ensured to be at the end due to the 4 bytes
alignment requirement of the type section.
The above changes will allow enough flexibility to
add new sections (and other info) to the btf_header later.
This patch also removes an unnecessary !err check
at the end of btf_parse().
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Tue, 22 May 2018 22:03:31 +0000 (15:03 -0700)]
bpf: Expose check_uarg_tail_zero()
This patch exposes check_uarg_tail_zero() which will
be reused by a later BTF patch. Its name is changed to
bpf_check_uarg_tail_zero().
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Tue, 22 May 2018 08:51:10 +0000 (10:51 +0200)]
Merge branch 'bpf-fib-mtu-check'
David Ahern says:
====================
Packets that exceed the egress MTU can not be forwarded in the fast path.
Add IPv4 and IPv6 MTU helpers that take a FIB lookup result (versus the
typical dst path) and add the calls to bpf_ipv{4,6}_fib_lookup.
v2
- add ip6_mtu_from_fib6 to ipv6_stub
- only call the new MTU helpers for fib lookups in XDP path; skb
path uses is_skb_forwardable to determine if the packet can be
sent via the egress device from the FIB lookup
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David Ahern [Mon, 21 May 2018 16:08:15 +0000 (09:08 -0700)]
bpf: Add mtu checking to FIB forwarding helper
Add check that egress MTU can handle packet to be forwarded. If
the MTU is less than the packet length, return 0 meaning the
packet is expected to continue up the stack for help - eg.,
fragmenting the packet or sending an ICMP.
The XDP path needs to leverage the FIB entry for an MTU on the
route spec or an exception entry for a given destination. The
skb path lets is_skb_forwardable decide if the packet can be
sent.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David Ahern [Mon, 21 May 2018 16:08:14 +0000 (09:08 -0700)]
net/ipv6: Add helper to return path MTU based on fib result
Determine path MTU from a FIB lookup result. Logic is based on
ip6_dst_mtu_forward plus lookup of nexthop exception.
Add ip6_dst_mtu_forward to ipv6_stubs to handle access by core
bpf code.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David Ahern [Mon, 21 May 2018 16:08:13 +0000 (09:08 -0700)]
net/ipv4: Add helper to return path MTU based on fib result
Determine path MTU from a FIB lookup result. Logic is a distillation of
ip_dst_mtu_maybe_forward.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Tue, 22 May 2018 08:25:07 +0000 (10:25 +0200)]
Merge branch 'bpf-af-xdp-cleanups'
Björn Töpel says:
====================
This the second follow-up set. The first four patches are uapi
changes:
* Removing rebind support
* Getting rid of structure hole
* Removing explicit cache line alignment
* Stricter bind checks
The last patches do some cleanups, where the umem and refcount_t
changes were suggested by Daniel.
* Add a missing write-barrier and use READ_ONCE for data-dependencies
* Clean up umem and do proper locking
* Convert atomic_t to refcount_t
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Björn Töpel [Tue, 22 May 2018 07:35:03 +0000 (09:35 +0200)]
xsk: convert atomic_t to refcount_t
Introduce refcount_t, in favor of atomic_t.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Björn Töpel [Tue, 22 May 2018 07:35:02 +0000 (09:35 +0200)]
xsk: simplified umem setup
As suggested by Daniel Borkmann, the umem setup code was a too
defensive and complex. Here, we reduce the number of checks. Also, the
memory pinning is now folded into the umem creation, and we do correct
locking.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Björn Töpel [Tue, 22 May 2018 07:35:01 +0000 (09:35 +0200)]
xsk: add missing write- and data-dependency barrier
Here, we add a missing write-barrier, and use READ_ONCE for the
data-dependency barrier.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Björn Töpel [Tue, 22 May 2018 07:35:00 +0000 (09:35 +0200)]
samples/bpf: adapt xdpsock to the new uapi
Adapt xdpsock to use the new getsockopt introduced in the previous
commit.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Björn Töpel [Tue, 22 May 2018 07:34:59 +0000 (09:34 +0200)]
xsk: remove explicit ring structure from uapi
In this commit we remove the explicit ring structure from the the
uapi. It is tricky for an uapi to depend on a certain L1 cache line
size, since it can differ for variants of the same architecture. Now,
we let the user application determine the offsets of the producer,
consumer and descriptors by asking the socket via getsockopt.
A typical flow would be (Rx ring):
struct xdp_mmap_offsets off;
struct xdp_desc *ring;
u32 *prod, *cons;
void *map;
...
getsockopt(fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &optlen);
map = mmap(NULL, off.rx.desc +
NUM_DESCS * sizeof(struct xdp_desc),
PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_POPULATE, sfd,
XDP_PGOFF_RX_RING);
prod = map + off.rx.producer;
cons = map + off.rx.consumer;
ring = map + off.rx.desc;
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Magnus Karlsson [Tue, 22 May 2018 07:34:58 +0000 (09:34 +0200)]
xsk: proper queue id check at bind
Validate the queue id against both Rx and Tx on the netdev. Also, make
sure that the queue exists at xmit time.
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Björn Töpel [Tue, 22 May 2018 07:34:57 +0000 (09:34 +0200)]
xsk: fill hole in struct sockaddr_xdp
Move the sxdp_flags up, avoiding a hole in the uapi structure.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Björn Töpel [Tue, 22 May 2018 07:34:56 +0000 (09:34 +0200)]
xsk: remove rebind support
Supporting rebind, i.e. after a successful bind the process can call
bind again without closing the socket, makes the AF_XDP setup state
machine more complex. Constrain the state space, by not supporting
rebind.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Fri, 18 May 2018 20:44:11 +0000 (22:44 +0200)]
Merge branch 'bpf-sk-msg-fields'
John Fastabend says:
====================
In this series we add the ability for sk msg programs to read basic
sock information about the sock they are attached to. The second
patch adds the tests to the selftest test_verifier.
One observation that I had from writing this seriess is lots of the
./net/core/filter.c code is almost duplicated across program types.
I thought about building a template/macro that we could use as a
single block of code to read sock data out for multiple programs,
but I wasn't convinced it was worth it yet. The result was using a
macro saved a couple lines of code per block but made the code
a bit harder to read IMO. We can probably revisit the idea later
if we get more duplication.
v2: add errstr field to negative test_verifier test cases to ensure
we get the expected err string back from the verifier.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
John Fastabend [Thu, 17 May 2018 21:17:03 +0000 (14:17 -0700)]
bpf: add sk_msg prog sk access tests to test_verifier
Add tests for BPF_PROG_TYPE_SK_MSG to test_verifier for read access
to new sk fields.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
John Fastabend [Thu, 17 May 2018 21:16:58 +0000 (14:16 -0700)]
bpf: allow sk_msg programs to read sock fields
Currently sk_msg programs only have access to the raw data. However,
it is often useful when building policies to have the policies specific
to the socket endpoint. This allows using the socket tuple as input
into filters, etc.
This patch adds ctx access to the sock fields.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Fri, 18 May 2018 19:35:56 +0000 (21:35 +0200)]
Merge branch 'bpf-nfp-shift-insns'
Jiong Wang says:
====================
NFP eBPF JIT is missing logic indirect shifts (both left and right) and
arithmetic right shift (both indirect shift and shift by constant).
This patch adds support for them.
For indirect shifts, shift amount is not specified as constant, NFP needs
to get the shift amount through the low 5 bits of source A operand in
PREV_ALU, therefore extra instructions are needed compared with shifts by
constants.
Because NFP is 32-bit, so we are using register pair for 64-bit shifts and
therefore would need different instruction sequences depending on whether
shift amount is less than 32 or not.
NFP branch-on-bit-test instruction emitter is added by this patch set and
is used for efficient runtime check on shift amount. We'd think the shift
amount is less than 32 if bit 5 is clear and greater or equal then 32
otherwise. Shift amount is greater than or equal to 64 will result in
undefined behavior.
This patch also use range info to avoid generating unnecessary runtime code
if we are certain shift amount is less than 32 or not.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jiong Wang [Fri, 18 May 2018 19:12:11 +0000 (12:12 -0700)]
nfp: bpf: support arithmetic indirect right shift (BPF_ARSH | BPF_X)
Code logic is similar with arithmetic right shift by constant, and NFP
get indirect shift amount through source A operand of PREV_ALU.
It is possible to fall back to logic right shift if the MSB is known to be
zero from range info, however there is no benefit to do this given logic
indirect right shift use the same number and cycle of instruction sequence.
Suppose the MSB of regX is the bit we want to replicate to fill in all the
vacant positions, and regY contains the shift amount, then we could use
single instruction to set up both.
[alu, --, regY, OR, regX]
--
NOTE: the PREV_ALU result doesn't need to write to any destination
register.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jiong Wang [Fri, 18 May 2018 19:12:10 +0000 (12:12 -0700)]
nfp: bpf: support arithmetic right shift by constant (BPF_ARSH | BPF_K)
Code logic is similar with logic right shift except we also need to set
PREV_ALU result properly, the MSB of which is the bit that will be
replicated to fill in all the vacant positions.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jiong Wang [Fri, 18 May 2018 19:12:09 +0000 (12:12 -0700)]
nfp: bpf: support logic indirect shifts (BPF_[L|R]SH | BPF_X)
For indirect shifts, shift amount is not specified as constant, NFP needs
to get the shift amount through the low 5 bits of source A operand in
PREV_ALU, therefore extra instructions are needed compared with shifts by
constants.
Because NFP is 32-bit, so we are using register pair for 64-bit shifts and
therefore would need different instruction sequences depending on whether
shift amount is less than 32 or not.
NFP branch-on-bit-test instruction emitter is added by this patch and is
used for efficient runtime check on shift amount. We'd think the shift
amount is less than 32 if bit 5 is clear and greater or equal than 32
otherwise. Shift amount is greater than or equal to 64 will result in
undefined behavior.
This patch also use range info to avoid generating unnecessary runtime code
if we are certain shift amount is less than 32 or not.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Fri, 18 May 2018 14:07:04 +0000 (16:07 +0200)]
Merge branch 'bpf-af-xdp-cleanups'
Björn Töpel says:
====================
This series contain "cosmetics only" follow-up patches for AF_XDP.
Thanks to Daniel for suggesting them!
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Björn Töpel [Fri, 18 May 2018 12:00:24 +0000 (14:00 +0200)]
xsk: proper '=' alignment
Properly align xsk_proto_ops initialization.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Björn Töpel [Fri, 18 May 2018 12:00:23 +0000 (14:00 +0200)]
xsk: fixed some cases of unnecessary parentheses
Removed some cases of unnecessary parentheses.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Björn Töpel [Fri, 18 May 2018 12:00:22 +0000 (14:00 +0200)]
xsk: remove newline at end of file
Minor cleanup, remove newline at end of Makefile.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Björn Töpel [Fri, 18 May 2018 12:00:21 +0000 (14:00 +0200)]
xsk: clean up SPDX headers
Clean up SPDX-License-Identifier and removing licensing leftovers.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Gustavo A. R. Silva [Thu, 17 May 2018 14:11:02 +0000 (09:11 -0500)]
bpf: sockmap, fix double-free
`e' is being freed twice.
Fix this by removing one of the kfree() calls.
Addresses-Coverity-ID:
1468983 ("Double free")
Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Gustavo A. R. Silva [Thu, 17 May 2018 14:08:43 +0000 (09:08 -0500)]
bpf: sockmap, fix uninitialized variable
There is a potential execution path in which variable err is
returned without being properly initialized previously.
Fix this by initializing variable err to 0.
Addresses-Coverity-ID:
1468964 ("Uninitialized scalar variable")
Fixes: e5cd3abcb31a ("bpf: sockmap, refactor sockmap routines to work with hashmap")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Quentin Monnet [Thu, 17 May 2018 12:43:56 +0000 (13:43 +0100)]
bpf: change eBPF helper doc parsing script to allow for smaller indent
Documentation for eBPF helpers can be parsed from bpf.h and eventually
turned into a man page. Commit
6f96674dbd8c ("bpf: relax constraints on
formatting for eBPF helper documentation") changed the script used to
parse it, in order to allow for different indent style and to ease the
work for writing documentation for future helpers.
The script currently considers that the first tab can be replaced by 6
to 8 spaces. But the documentation for bpf_fib_lookup() uses a mix of
tabs (for the "Description" part) and of spaces ("Return" part), and
only has 5 space long indent for the latter.
We probably do not want to change the values accepted by the script each
time a new helper gets a new indent style. However, it is worth noting
that with those 5 spaces, the "Description" and "Return" part *look*
aligned in the generated patch and in `git show`, so it is likely other
helper authors will use the same length. Therefore, allow for helper
documentation to use 5 spaces only for the first indent level.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David S. Miller [Thu, 17 May 2018 02:47:11 +0000 (22:47 -0400)]
Merge git://git./linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-05-17
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Provide a new BPF helper for doing a FIB and neighbor lookup
in the kernel tables from an XDP or tc BPF program. The helper
provides a fast-path for forwarding packets. The API supports
IPv4, IPv6 and MPLS protocols, but currently IPv4 and IPv6 are
implemented in this initial work, from David (Ahern).
2) Just a tiny diff but huge feature enabled for nfp driver by
extending the BPF offload beyond a pure host processing offload.
Offloaded XDP programs are allowed to set the RX queue index and
thus opening the door for defining a fully programmable RSS/n-tuple
filter replacement. Once BPF decided on a queue already, the device
data-path will skip the conventional RSS processing completely,
from Jakub.
3) The original sockmap implementation was array based similar to
devmap. However unlike devmap where an ifindex has a 1:1 mapping
into the map there are use cases with sockets that need to be
referenced using longer keys. Hence, sockhash map is added reusing
as much of the sockmap code as possible, from John.
4) Introduce BTF ID. The ID is allocatd through an IDR similar as
with BPF maps and progs. It also makes BTF accessible to user
space via BPF_BTF_GET_FD_BY_ID and adds exposure of the BTF data
through BPF_OBJ_GET_INFO_BY_FD, from Martin.
5) Enable BPF stackmap with build_id also in NMI context. Due to the
up_read() of current->mm->mmap_sem build_id cannot be parsed.
This work defers the up_read() via a per-cpu irq_work so that
at least limited support can be enabled, from Song.
6) Various BPF JIT follow-up cleanups and fixups after the LD_ABS/LD_IND
JIT conversion as well as implementation of an optimized 32/64 bit
immediate load in the arm64 JIT that allows to reduce the number of
emitted instructions; in case of tested real-world programs they
were shrinking by three percent, from Daniel.
7) Add ifindex parameter to the libbpf loader in order to enable
BPF offload support. Right now only iproute2 can load offloaded
BPF and this will also enable libbpf for direct integration into
other applications, from David (Beckett).
8) Convert the plain text documentation under Documentation/bpf/ into
RST format since this is the appropriate standard the kernel is
moving to for all documentation. Also add an overview README.rst,
from Jesper.
9) Add __printf verification attribute to the bpf_verifier_vlog()
helper. Though it uses va_list we can still allow gcc to check
the format string, from Mathieu.
10) Fix a bash reference in the BPF selftest's Makefile. The '|& ...'
is a bash 4.0+ feature which is not guaranteed to be available
when calling out to shell, therefore use a more portable variant,
from Joe.
11) Fix a 64 bit division in xdp_umem_reg() by using div_u64()
instead of relying on the gcc built-in, from Björn.
12) Fix a sock hashmap kmalloc warning reported by syzbot when an
overly large key size is used in hashmap then causing overflows
in htab->elem_size. Reject bogus attr->key_size early in the
sock_hash_alloc(), from Yonghong.
13) Ensure in BPF selftests when urandom_read is being linked that
--build-id is always enabled so that test_stacktrace_build_id[_nmi]
won't be failing, from Alexei.
14) Add bitsperlong.h as well as errno.h uapi headers into the tools
header infrastructure which point to one of the arch specific
uapi headers. This was needed in order to fix a build error on
some systems for the BPF selftests, from Sirio.
15) Allow for short options to be used in the xdp_monitor BPF sample
code. And also a bpf.h tools uapi header sync in order to fix a
selftest build failure. Both from Prashant.
16) More formally clarify the meaning of ID in the direct packet access
section of the BPF documentation, from Wang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Wed, 16 May 2018 23:38:14 +0000 (16:38 -0700)]
bpf: sockmap, on update propagate errors back to userspace
When an error happens in the update sockmap element logic also pass
the err up to the user.
Fixes: e5cd3abcb31a ("bpf: sockmap, refactor sockmap routines to work with hashmap")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Yonghong Song [Wed, 16 May 2018 21:06:26 +0000 (14:06 -0700)]
bpf: fix sock hashmap kmalloc warning
syzbot reported a kernel warning below:
WARNING: CPU: 0 PID: 4499 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 mm/slab_common.c:996
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 4499 Comm: syz-executor050 Not tainted 4.17.0-rc3+ #9
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
panic+0x22f/0x4de kernel/panic.c:184
__warn.cold.8+0x163/0x1b3 kernel/panic.c:536
report_bug+0x252/0x2d0 lib/bug.c:186
fixup_bug arch/x86/kernel/traps.c:178 [inline]
do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
RIP: 0010:kmalloc_slab+0x56/0x70 mm/slab_common.c:996
RSP: 0018:
ffff8801d907fc58 EFLAGS:
00010246
RAX:
0000000000000000 RBX:
ffff8801aeecb280 RCX:
ffffffff8185ebd7
RDX:
0000000000000000 RSI:
0000000000000000 RDI:
00000000ffffffe1
RBP:
ffff8801d907fc58 R08:
ffff8801adb5e1c0 R09:
ffffed0035a84700
R10:
ffffed0035a84700 R11:
ffff8801ad423803 R12:
ffff8801aeecb280
R13:
00000000fffffff4 R14:
ffff8801ad891a00 R15:
00000000014200c0
__do_kmalloc mm/slab.c:3713 [inline]
__kmalloc+0x25/0x760 mm/slab.c:3727
kmalloc include/linux/slab.h:517 [inline]
map_get_next_key+0x24a/0x640 kernel/bpf/syscall.c:858
__do_sys_bpf kernel/bpf/syscall.c:2131 [inline]
__se_sys_bpf kernel/bpf/syscall.c:2096 [inline]
__x64_sys_bpf+0x354/0x4f0 kernel/bpf/syscall.c:2096
do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The test case is against sock hashmap with a key size 0xffffffe1.
Such a large key size will cause the below code in function
sock_hash_alloc() overflowing and produces a smaller elem_size,
hence map creation will be successful.
htab->elem_size = sizeof(struct htab_elem) +
round_up(htab->map.key_size, 8);
Later, when map_get_next_key is called and kernel tries
to allocate the key unsuccessfully, it will issue
the above warning.
Similar to hashtab, ensure the key size is at most
MAX_BPF_STACK for a successful map creation.
Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Reported-by: syzbot+e4566d29080e7f3460ff@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David Beckett [Wed, 16 May 2018 21:02:49 +0000 (14:02 -0700)]
libbpf: add ifindex to enable offload support
BPF programs currently can only be offloaded using iproute2. This
patch will allow programs to be offloaded using libbpf calls.
Signed-off-by: David Beckett <david.beckett@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Mathieu Malaterre [Wed, 16 May 2018 20:27:41 +0000 (22:27 +0200)]
bpf: add __printf verification to bpf_verifier_vlog
__printf is useful to verify format and arguments. ‘bpf_verifier_vlog’
function is used twice in verifier.c in both cases the caller function
already uses the __printf gcc attribute.
Remove the following warning, triggered with W=1:
kernel/bpf/verifier.c:176:2: warning: function might be possible candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David Ahern [Tue, 15 May 2018 23:20:52 +0000 (16:20 -0700)]
samples/bpf: Decrement ttl in fib forwarding example
Only consider forwarding packets if ttl in received packet is > 1 and
decrement ttl before handing off to bpf_redirect_map.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Wed, 16 May 2018 20:02:14 +0000 (22:02 +0200)]
Merge branch 'bpf-sock-hashmap'
John Fastabend says:
====================
In the original sockmap implementation we got away with using an
array similar to devmap. However, unlike devmap where an ifindex
has a nice 1:1 function into the map we have found some use cases
with sockets that need to be referenced using longer keys.
This series adds support for a sockhash map reusing as much of
the sockmap code as possible. I made the decision to add sockhash
specific helpers vs trying to generalize the existing helpers
because (a) they have sockmap in the name and (b) the keys are
different types. I prefer to be explicit here rather than play
type games or do something else tricky.
To test this we duplicate all the sockmap testing except swap out
the sockmap with a sockhash.
v2: fix file stats and add v2 tag
v3: move tool updates into test patch, move bpftool updates into
its own patch, and fixup the test patch stats to catch the
renamed file and provide only diffs ± on that.
v4: Add documentation to UAPI bpf.h
v5: Add documentation to tools UAPI bpf.h
v6: 'git add' test_sockhash_kern.c which was previously missing
but was not causing issues because of typo in test script,
noticed by Daniel. After this the git format-patch -M option
no longer tracks the rename of the test_sockmap_kern files for
some reason. I guess the diff has exceeded some threshold.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
John Fastabend [Mon, 14 May 2018 17:00:19 +0000 (10:00 -0700)]
bpf: bpftool, support for sockhash
This adds the SOCKHASH map type to bpftools so that we get correct
pretty printing.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
John Fastabend [Mon, 14 May 2018 17:00:18 +0000 (10:00 -0700)]
bpf: selftest additions for SOCKHASH
This runs existing SOCKMAP tests with SOCKHASH map type. To do this
we push programs into include file and build two BPF programs. One
for SOCKHASH and one for SOCKMAP.
We then run the entire test suite with each type.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Rahul Lakkireddy [Wed, 16 May 2018 14:21:15 +0000 (19:51 +0530)]
cxgb4: update LE-TCAM collection for T6
For T6, clip table is separated from main TCAM. So, update LE-TCAM
collection logic to collect clip table TCAM as well. IPv6 takes
4 entries in clip table TCAM compared to 2 entries in main TCAM.
Also, in case of errors, keep LE-TCAM collected so far and set the
status to partial dump.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 May 2018 18:49:01 +0000 (14:49 -0400)]
Merge branch 'qed-LL2-fixes'
Michal Kalderon says:
====================
qed: LL2 fixes
This series fixes some issues in ll2 related to synchronization
and resource freeing
====================
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Michal Kalderon [Wed, 16 May 2018 11:44:40 +0000 (14:44 +0300)]
qed: Fix LL2 race during connection terminate
Stress on qedi/qedr load unload lead to list_del corruption.
This is due to ll2 connection terminate freeing resources without
verifying that no more ll2 processing will occur.
This patch unregisters the ll2 status block before terminating
the connection to assure this race does not occur.
Fixes: 1d6cff4fca4366 ("qed: Add iSCSI out of order packet handling")
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kalderon [Wed, 16 May 2018 11:44:39 +0000 (14:44 +0300)]
qed: Fix possibility of list corruption during rmmod flows
The ll2 flows of flushing the txq/rxq need to be synchronized with the
regular fp processing. Caused list corruption during load/unload stress
tests.
Fixes: 0a7fb11c23c0f ("qed: Add Light L2 support")
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kalderon [Wed, 16 May 2018 11:44:38 +0000 (14:44 +0300)]
qed: LL2 flush isles when connection is closed
Driver should free all pending isles once it gets a FLUSH cqe from FW.
Part of iSCSI out of order flow.
Fixes: 1d6cff4fca4366 ("qed: Add iSCSI out of order packet handling")
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Wed, 16 May 2018 11:18:22 +0000 (19:18 +0800)]
net: ethoc: Remove useless test before clk_disable_unprepare
clk_disable_unprepare() already checks that the clock pointer is valid.
No need to test it before calling it.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Wed, 16 May 2018 10:59:19 +0000 (18:59 +0800)]
net: stmmac: Remove useless test before clk_disable_unprepare
clk_disable_unprepare() already checks that the clock pointer is valid.
No need to test it before calling it.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hemanth Puranik [Wed, 16 May 2018 01:10:53 +0000 (06:40 +0530)]
net: qcom/emac: Encapsulate sgmii ops under one structure
This patch introduces ops structure for sgmii, This by ensures that
we do not need dummy functions in case of emulation platforms.
Signed-off-by: Hemanth Puranik <hpuranik@codeaurora.org>
Acked-by: Timur Tabi <timur@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 May 2018 18:23:05 +0000 (14:23 -0400)]
Merge branch 'rmnet-next'
Subash Abhinov Kasiviswanathan says:
====================
net: qualcomm: rmnet: Updates 2018-05-14
Patch 1 adds tx_drops counter to more places.
Patch 2 adds ethtool private stats support to make it easy to debug
the checksum offload path.
Patch 3 is a cleanup in command packet processing path.
v1->v2: Fix the incorrect if / else statement in
rmnet_map_checksum_downlink_packet() and define rmnet_ethtool_ops
as static as mentioned by kbuild test robot.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Subash Abhinov Kasiviswanathan [Wed, 16 May 2018 00:52:02 +0000 (18:52 -0600)]
net: qualcomm: rmnet: Remove redundant command check
The command packet size is already checked once in
rmnet_map_deaggregate() for the header, packet and trailer size, so
this additional check is not needed.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Subash Abhinov Kasiviswanathan [Wed, 16 May 2018 00:52:01 +0000 (18:52 -0600)]
net: qualcomm: rmnet: Add support for ethtool private stats
Add ethtool private stats handler to debug the handling of packets
with checksum offload header / trailer. This allows to keep track of
the number of packets for which hardware computes the checksum and
counts and reasons where checksum computation was skipped in hardware
and was done in the network stack.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Subash Abhinov Kasiviswanathan [Wed, 16 May 2018 00:52:00 +0000 (18:52 -0600)]
net: qualcomm: rmnet: Capture all drops in transmit path
Packets in transmit path could potentially be dropped if there were
errors while adding the MAP header or the checksum header.
Increment the tx_drops stats in these cases.
Additionally, refactor the code to free the packet and increment
the tx_drops stat under a single label.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 May 2018 18:20:36 +0000 (14:20 -0400)]
Merge branch 'of-mdio-Fall-back-to-mdiobus_register-with-NULL-device_node'
Florian Fainelli says:
====================
of: mdio: Fall back to mdiobus_register() with NULL device_node
This patch series updates of_mdiobus_register() such that when the device_node
argument is NULL, it calls mdiobus_register() directly. This is consistent with
the behavior of of_mdiobus_register() when CONFIG_OF=n.
I only converted the most obvious drivers, there are others that have a much
less obvious behavior and specifically attempt to deal with CONFIG_ACPI.
Changes in v2:
- fixed build error in davincin_mdio.c (Grygorii)
- reworked first patch a bit: commit message, subject and removed useless
code comment
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Tue, 15 May 2018 23:56:19 +0000 (16:56 -0700)]
drivers: net: Remove device_node checks with of_mdiobus_register()
A number of drivers have the following pattern:
if (np)
of_mdiobus_register()
else
mdiobus_register()
which the implementation of of_mdiobus_register() now takes care of.
Remove that pattern in drivers that strictly adhere to it.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Fugang Duan <fugang.duan@nxp.com>
Reviewed-by: Antoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: Jose Abreu <joabreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Tue, 15 May 2018 23:56:18 +0000 (16:56 -0700)]
of: mdio: Fall back to mdiobus_register() with NULL device_node
When the device_node specified is NULL, fall back to mdiobus_register().
We have a number of drivers having a similar pattern which is:
if (np)
of_mdiobus_register()
else
mdiobus_register()
so incorporate that behavior within the core of_mdiobus_register()
function. This is also consistent with the stub version that we defined
when CONFIG_OF=n.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Grygorii Strashko [Tue, 15 May 2018 23:37:25 +0000 (18:37 -0500)]
net: ethernet: ti: cpsw-phy-sel: check bus_find_device() ret value
This fixes klockworks warnings: Pointer 'dev' returned from call to
function 'bus_find_device' at line 179 may be NULL and will be dereferenced
at line 181.
cpsw-phy-sel.c:179: 'dev' is assigned the return value from function 'bus_find_device'.
bus.c:342: 'bus_find_device' explicitly returns a NULL value.
cpsw-phy-sel.c:181: 'dev' is dereferenced by passing argument 1 to function 'dev_get_drvdata'.
device.h:1024: 'dev' is passed to function 'dev_get_drvdata'.
device.h:1026: 'dev' is explicitly dereferenced.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
[nsekhar@ti.com: add an error message, fix return path]
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Debabrata Banerjee [Wed, 16 May 2018 18:02:13 +0000 (14:02 -0400)]
Revert "bonding: allow carrier and link status to determine link state"
This reverts commit
1386c36b30388f46a95100924bfcae75160db715.
We don't want to encourage drivers to not report carrier status
correctly, therefore remove this commit.
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roman Mashak [Tue, 15 May 2018 18:31:14 +0000 (14:31 -0400)]
tc-testing: updated mirred and vlan with more tests
Added extra test cases for different control actions (reclassify, pipe
etc.), cookies, max values & exceeding maximum, and replace existing
actions unit tests.
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roman Mashak [Tue, 15 May 2018 14:14:40 +0000 (10:14 -0400)]
tc-testing: fixed copy-pasting error in police tests
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Tue, 15 May 2018 08:50:31 +0000 (10:50 +0200)]
sched: manipulate __QDISC_STATE_RUNNING in qdisc_run_* helpers
Currently NOLOCK qdiscs pay a measurable overhead to atomically
manipulate the __QDISC_STATE_RUNNING. Such bit is flipped twice per
packet in the uncontended scenario with packet rate below the
line rate: on packed dequeue and on the next, failing dequeue attempt.
This changeset moves the bit manipulation into the qdisc_run_{begin,end}
helpers, so that the bit is now flipped only once per packet, with
measurable performance improvement in the uncontended scenario.
This also allows simplifying the qdisc teardown code path - since
qdisc_is_running() is now effective for each qdisc type - and avoid a
possible race between qdisc_run() and dev_deactivate_many(), as now
the some_qdisc_is_busy() can properly detect NOLOCK qdiscs being busy
dequeuing packets.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 May 2018 16:15:12 +0000 (12:15 -0400)]
Merge branch 'bonding-performance-and-reliability'
Debabrata Banerjee says:
====================
bonding: performance and reliability
Series of fixes to how rlb updates are handled, code cleanup, allowing
higher performance tx hashing in balance-alb mode, and reliability of
link up/down monitoring.
v2: refactor bond_is_nondyn_tlb with inline fn, update log comment to
point out that multicast addresses will not get rlb updates.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Debabrata Banerjee [Mon, 14 May 2018 18:48:10 +0000 (14:48 -0400)]
bonding: allow carrier and link status to determine link state
In a mixed environment it may be difficult to tell if your hardware
support carrier, if it does not it can always report true. With a new
use_carrier option of 2, we can check both carrier and link status
sequentially, instead of one or the other
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Debabrata Banerjee [Mon, 14 May 2018 18:48:09 +0000 (14:48 -0400)]
bonding: allow use of tx hashing in balance-alb
The rx load balancing provided by balance-alb is not mutually
exclusive with using hashing for tx selection, and should provide a decent
speed increase because this eliminates spinlocks and cache contention.
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Debabrata Banerjee [Mon, 14 May 2018 18:48:08 +0000 (14:48 -0400)]
bonding: use common mac addr checks
Replace homegrown mac addr checks with faster defs from etherdevice.h
Note that this will also prevent any rlb arp updates for multicast
addresses, however this should have been forbidden anyway.
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Debabrata Banerjee [Mon, 14 May 2018 18:48:07 +0000 (14:48 -0400)]
bonding: don't queue up extraneous rlb updates
arps for incomplete entries can't be sent anyway.
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 May 2018 15:49:20 +0000 (11:49 -0400)]
Merge branch 'net-smc-enhancements-2018-05-15'
Ursula Braun says:
====================
net/smc: enhancements 2018/05/15
here are smc patches for net-next. The first one is a fix for net-next
commit
01d2f7e2cdd3 "net/smc: sockopts TCP_NODELAY and TCP_CORK".
Patch 7 improves Connection Layer Control error handling, patch 10
improves abnormal termination of link groups. The remaining patches
from Karsten improve Link Layer Control code.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Karsten Graul [Tue, 15 May 2018 15:05:03 +0000 (17:05 +0200)]
net/smc: check for pending termination
Avoid to run the processing in smc_lgr_terminate() more than once,
remember when the link group termination is triggered.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Karsten Graul [Tue, 15 May 2018 15:05:02 +0000 (17:05 +0200)]
net/smc: drop messages when link state is inactive
Drop incoming messages when the link is flagged as inactive.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Karsten Graul [Tue, 15 May 2018 15:05:01 +0000 (17:05 +0200)]
net/smc: set link inactive before calling smc_lgr_free()
Before smc_lgr_free() is called the link must be set inactive by calling
smc_llc_link_inactive().
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Karsten Graul [Tue, 15 May 2018 15:05:00 +0000 (17:05 +0200)]
net/smc: handle all error codes from smc_conn_create()
Always set a reason_code when smc_conn_create() returns an error code.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Karsten Graul [Tue, 15 May 2018 15:04:59 +0000 (17:04 +0200)]
net/smc: use a workqueue to defer llc send
SMC handles deferred work in tasklets. As tasklets cannot sleep this
can result in rare EBUSY conditions, so defer this work in a work queue.
The high level api functions do not defer work because they can sleep
until the llc send is actually completed.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Karsten Graul [Tue, 15 May 2018 15:04:58 +0000 (17:04 +0200)]
net/smc: move link llc initialization to llc layer
Move the llc layer specific initialization and cleanup out of smc_core.c
into smc_llc.c (smc_llc_link_init and smc_llc_link_clear). Move all
initialization of a link into the new init function.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>