Jiri Pirko [Fri, 8 Apr 2016 17:11:21 +0000 (19:11 +0200)]
mlxsw: Move devlink port registration into common core code
Remove devlink port reg/unreg from spectrum and switchx2 code and rather
do the common work in core. That also ensures code separation where
devlink is only used in core.c.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 8 Apr 2016 17:11:20 +0000 (19:11 +0200)]
devlink: remove implicit type set in port register
As we rely on caller zeroing or correctly set the struct before the call,
this implicit type set is either no-op (DEVLINK_PORT_TYPE_NOTSET is 0)
or it rewrites wanted value. So remove this.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 8 Apr 2016 19:26:07 +0000 (15:26 -0400)]
Merge branch 'nfp-mtu-buffer-reconfig'
Jakub Kicinski says:
====================
MTU/buffer reconfig changes
I re-discussed MPLS/MTU internally, dropped it from the patch 1,
re-tested everything, found out I forgot about debugfs pointers,
fixed that as well.
v5:
- don't reserve space in RX buffers for MPLS label stack
(patch 1);
- fix debugfs pointers to ring structures (patch 5).
v4:
- cut down on unrelated patches;
- don't "close" the device on error path.
--- v4 cover letter
Previous series included some not entirely related patches,
this one is cut down. Main issue I'm trying to solve here
is that .ndo_change_mtu() in nfpvf driver is doing full
close/open to reallocate buffers - which if open fails
can result in device being basically closed even though
the interface is started. As suggested by you I try to move
towards a paradigm where the resources are allocated first
and the MTU change is only done once I'm certain (almost)
nothing can fail. Almost because I need to communicate
with FW and that can always time out.
Patch 1 fixes small issue. Next 10 patches reorganize things
so that I can easily allocate new rings and sets of buffers
while the device is running. Patches 13 and 15 reshape the
.ndo_change_mtu() and ethtool's ring-resize operation into
desired form.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 7 Apr 2016 18:39:48 +0000 (19:39 +0100)]
nfp: allow ring size reconfiguration at runtime
Since much of the required changes have already been made for
changing MTU at runtime let's use it for ring size changes as
well.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 7 Apr 2016 18:39:47 +0000 (19:39 +0100)]
nfp: pass ring count as function parameter
Soon ring resize will call this functions with values
different than the current configuration we need to
explicitly pass the ring count as parameter.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 7 Apr 2016 18:39:46 +0000 (19:39 +0100)]
nfp: convert .ndo_change_mtu() to prepare/commit paradigm
When changing MTU on running device first allocate new rings
and buffers and once it succeeds proceed with changing MTU.
Allocation of new rings is not really necessary for this
operation - it's done to keep the code simple and because
size of the extra ring memory is quite small compared to
the size of buffers.
Operation can still fail midway through if FW communication
times out. In that case we retry with old MTU (rings).
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 7 Apr 2016 18:39:45 +0000 (19:39 +0100)]
nfp: propagate list buffer size in struct rx_ring
Free list buffer size needs to be propagated to few functions
as a parameter and added to struct nfp_net_rx_ring since soon
some of the functions will be reused to manage rings with
buffers of size different than nn->fl_bufsz.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 7 Apr 2016 18:39:44 +0000 (19:39 +0100)]
nfp: sync ring state during FW reconfiguration
FW reconfiguration in .ndo_open()/.ndo_stop() should reset/
restore queue state. Since we need IRQs to be disabled when
filling rings on RX path we have to move disable_irq() from
.ndo_open() all the way up to IRQ allocation.
nfp_net_start_vec() becomes trivial now so it's inlined.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 7 Apr 2016 18:39:43 +0000 (19:39 +0100)]
nfp: slice .ndo_open() and .ndo_stop() up
Divide .ndo_open() and .ndo_stop() into logical, callable
chunks. No functional changes.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 7 Apr 2016 18:39:42 +0000 (19:39 +0100)]
nfp: move filling ring information to FW config
nfp_net_[rt]x_ring_{alloc,free} should only allocate or free
ring resources without touching the device. Move setting
parameters in the BAR to separate functions. This will make
it possible to reuse alloc/free functions to allocate new
rings while the device is running.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 7 Apr 2016 18:39:41 +0000 (19:39 +0100)]
nfp: preallocate RX buffers early in .ndo_open
We want the .ndo_open() to have following structure:
- allocate resources;
- configure HW/FW;
- enable the device from stack perspective.
Therefore filling RX rings needs to be moved to the beginning
of .ndo_open().
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 7 Apr 2016 18:39:40 +0000 (19:39 +0100)]
nfp: reorganize initial filling of RX rings
Separate allocation of buffers from giving them to FW,
thanks to this it will be possible to move allocation
earlier on .ndo_open() path and reuse buffers during
runtime reconfiguration.
Similar to TX side clean up the spill of functionality
from flush to freeing the ring. Unlike on TX side,
RX ring reset does not free buffers from the ring.
Ring reset means only that FW pointers are zeroed and
buffers on the ring must be placed in [0, cnt - 1)
positions.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 7 Apr 2016 18:39:39 +0000 (19:39 +0100)]
nfp: cleanup tx ring flush and rename to reset
Since we never used flush without freeing the ring later
the functionality of the two operations is mixed.
Rename flush to ring reset and move there all the things
which have to be done after FW ring state is cleared.
While at it do some clean-ups.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 7 Apr 2016 18:39:38 +0000 (19:39 +0100)]
nfp: allocate ring SW structs dynamically
To be able to switch rings more easily on config changes
allocate them dynamically, separately from nfp_net structure.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 7 Apr 2016 18:39:37 +0000 (19:39 +0100)]
nfp: make *x_ring_init do all the init
nfp_net_[rt]x_ring_init functions used to be called from probe
path only and some of their functionality was spilled to the
call site. In order to reuse them for ring reconfiguration
we need them to do all the init.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 7 Apr 2016 18:39:36 +0000 (19:39 +0100)]
nfp: break up nfp_net_{alloc|free}_rings
nfp_net_{alloc|free}_rings contained strange mix of allocations
and vector initialization. Remove it, declare vector init as
a separate function and handle allocations explicitly.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 7 Apr 2016 18:39:35 +0000 (19:39 +0100)]
nfp: move link state interrupt request/free calls
We need to be able to disable the link state interrupt when
the device is brought down. We used to just free the IRQ
at the beginning of .ndo_stop(). As we now move towards
more ordered .ndo_open()/.ndo_stop() paths LSC allocation
should be placed in the "allocate resource" section.
Since the IRQ can't be freed early in .ndo_stop(), it is
disabled instead.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 7 Apr 2016 18:39:34 +0000 (19:39 +0100)]
nfp: correct RX buffer length calculation
When calculating the RX buffer length we need to account
for up to 2 VLAN tags. Rounding up to 1k is an relic of
a distant past and can be removed. While at it also remove
trivial print statement.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Sun, 27 Mar 2016 21:34:52 +0000 (14:34 -0700)]
orangefs: Add KERN_<LEVEL> to gossip_<level> macros
Emit the logging messages at the appropriate levels.
Miscellanea:
o Change format to fmt
o Use the more common ##__VA_ARGS__
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Martin Brandenburg [Fri, 8 Apr 2016 17:33:21 +0000 (13:33 -0400)]
orangefs: strncpy -> strscpy
It would have been possible for a rogue client-core to send in a symlink
target which is not NUL terminated. This returns EIO if the client-core
gives us corrupt data.
Leave debugfs and superblock code as is for now.
Other dcache.c and namei.c strncpy instances are safe because
ORANGEFS_NAME_MAX = NAME_MAX + 1; there is always enough space for a
name plus a NUL byte.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Martin Brandenburg [Mon, 4 Apr 2016 20:26:36 +0000 (16:26 -0400)]
orangefs: clean up truncate ctime and mtime setting
The ctime and mtime are always updated on a successful ftruncate and
only updated on a successful truncate where the size changed.
We handle the ``if the size changed'' bit.
This matches FUSE's behavior.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
kbuild test robot [Sat, 26 Mar 2016 18:54:23 +0000 (02:54 +0800)]
Orangefs: fix ifnullfree.cocci warnings
fs/orangefs/orangefs-debugfs.c:130:2-26: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values.
NULL check before some freeing functions is not needed.
Based on checkpatch warning
"kfree(NULL) is safe this check is probably not required"
and kfreeaddr.cocci by Julia Lawall.
Generated by: scripts/coccinelle/free/ifnullfree.cocci
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Mike Marshall [Wed, 6 Apr 2016 15:19:37 +0000 (11:19 -0400)]
Orangefs: optimize boilerplate code.
Suggested by David Binderman <dcb314@hotmail.com>
The former can potentially be a performance win over the latter.
memcpy(d, s, len);
memset(d+len, c, size-len);
memset(d, c, size);
memcpy(d, s, len);
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Mike Marshall [Wed, 6 Apr 2016 14:52:38 +0000 (10:52 -0400)]
Orangefs: xattr.c cleanup
1. It is nonsense to test for negative size_t, suggested by
David Binderman <dcb314@hotmail.com>
2. By the time Orangefs gets called, the vfs has ensured that
name != NULL, and that buffer and size are sane.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Roopa Prabhu [Fri, 8 Apr 2016 04:28:38 +0000 (21:28 -0700)]
mpls: find_outdev: check for err ptr in addition to NULL check
find_outdev calls inet{,6}_fib_lookup_dev() or dev_get_by_index() to
find the output device. In case of an error, inet{,6}_fib_lookup_dev()
returns error pointer and dev_get_by_index() returns NULL. But the function
only checks for NULL and thus can end up calling dev_put on an ERR_PTR.
This patch adds an additional check for err ptr after the NULL check.
Before: Trying to add an mpls route with no oif from user, no available
path to 10.1.1.8 and no default route:
$ip -f mpls route add 100 as 200 via inet 10.1.1.8
[ 822.337195] BUG: unable to handle kernel NULL pointer dereference at
00000000000003a3
[ 822.340033] IP: [<
ffffffff8148781e>] mpls_nh_assign_dev+0x10b/0x182
[ 822.340033] PGD
1db38067 PUD
1de9e067 PMD 0
[ 822.340033] Oops: 0000 [#1] SMP
[ 822.340033] Modules linked in:
[ 822.340033] CPU: 0 PID: 11148 Comm: ip Not tainted 4.5.0-rc7+ #54
[ 822.340033] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org
04/01/2014
[ 822.340033] task:
ffff88001db82580 ti:
ffff88001dad4000 task.ti:
ffff88001dad4000
[ 822.340033] RIP: 0010:[<
ffffffff8148781e>] [<
ffffffff8148781e>]
mpls_nh_assign_dev+0x10b/0x182
[ 822.340033] RSP: 0018:
ffff88001dad7a88 EFLAGS:
00010282
[ 822.340033] RAX:
ffffffffffffff9b RBX:
ffffffffffffff9b RCX:
0000000000000002
[ 822.340033] RDX:
00000000ffffff9b RSI:
0000000000000008 RDI:
0000000000000000
[ 822.340033] RBP:
ffff88001ddc9ea0 R08:
ffff88001e9f1768 R09:
0000000000000000
[ 822.340033] R10:
ffff88001d9c1100 R11:
ffff88001e3c89f0 R12:
ffffffff8187e0c0
[ 822.340033] R13:
ffffffff8187e0c0 R14:
ffff88001ddc9e80 R15:
0000000000000004
[ 822.340033] FS:
00007ff9ed798700(0000) GS:
ffff88001fc00000(0000)
knlGS:
0000000000000000
[ 822.340033] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 822.340033] CR2:
00000000000003a3 CR3:
000000001de89000 CR4:
00000000000006f0
[ 822.340033] Stack:
[ 822.340033]
0000000000000000 0000000100000000 0000000000000000
0000000000000000
[ 822.340033]
0000000000000000 0801010a00000000 0000000000000000
0000000000000000
[ 822.340033]
0000000000000004 ffffffff8148749b ffffffff8187e0c0
000000000000001c
[ 822.340033] Call Trace:
[ 822.340033] [<
ffffffff8148749b>] ? mpls_rt_alloc+0x2b/0x3e
[ 822.340033] [<
ffffffff81488e66>] ? mpls_rtm_newroute+0x358/0x3e2
[ 822.340033] [<
ffffffff810e7bbc>] ? get_page+0x5/0xa
[ 822.340033] [<
ffffffff813b7d94>] ? rtnetlink_rcv_msg+0x17e/0x191
[ 822.340033] [<
ffffffff8111794e>] ? __kmalloc_track_caller+0x8c/0x9e
[ 822.340033] [<
ffffffff813c9393>] ?
rht_key_hashfn.isra.20.constprop.57+0x14/0x1f
[ 822.340033] [<
ffffffff813b7c16>] ? __rtnl_unlock+0xc/0xc
[ 822.340033] [<
ffffffff813cb794>] ? netlink_rcv_skb+0x36/0x82
[ 822.340033] [<
ffffffff813b4507>] ? rtnetlink_rcv+0x1f/0x28
[ 822.340033] [<
ffffffff813cb2b1>] ? netlink_unicast+0x106/0x189
[ 822.340033] [<
ffffffff813cb5b3>] ? netlink_sendmsg+0x27f/0x2c8
[ 822.340033] [<
ffffffff81392ede>] ? sock_sendmsg_nosec+0x10/0x1b
[ 822.340033] [<
ffffffff81393df1>] ? ___sys_sendmsg+0x182/0x1e3
[ 822.340033] [<
ffffffff810e4f35>] ?
__alloc_pages_nodemask+0x11c/0x1e4
[ 822.340033] [<
ffffffff8110619c>] ? PageAnon+0x5/0xd
[ 822.340033] [<
ffffffff811062fe>] ? __page_set_anon_rmap+0x45/0x52
[ 822.340033] [<
ffffffff810e7bbc>] ? get_page+0x5/0xa
[ 822.340033] [<
ffffffff810e85ab>] ? __lru_cache_add+0x1a/0x3a
[ 822.340033] [<
ffffffff81087ea9>] ? current_kernel_time64+0x9/0x30
[ 822.340033] [<
ffffffff813940c4>] ? __sys_sendmsg+0x3c/0x5a
[ 822.340033] [<
ffffffff8148f597>] ?
entry_SYSCALL_64_fastpath+0x12/0x6a
[ 822.340033] Code: 83 08 04 00 00 65 ff 00 48 8b 3c 24 e8 40 7c f2 ff
eb 13 48 c7 c3 9f ff ff ff eb 0f 89 ce e8 f1 ae f1 ff 48 89 c3 48 85 db
74 15 <48> 8b 83 08 04 00 00 65 ff 08 48 81 fb 00 f0 ff ff 76 0d eb 07
[ 822.340033] RIP [<
ffffffff8148781e>] mpls_nh_assign_dev+0x10b/0x182
[ 822.340033] RSP <
ffff88001dad7a88>
[ 822.340033] CR2:
00000000000003a3
[ 822.435363] ---[ end trace
98cc65e6f6b8bf11 ]---
After patch:
$ip -f mpls route add 100 as 200 via inet 10.1.1.8
RTNETLINK answers: Network is unreachable
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 8 Apr 2016 16:13:30 +0000 (12:13 -0400)]
Merge branch '10GbE' of git://git./linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
10GbE Intel Wired LAN Driver Updates 2016-04-07
This series contains updates to ixgbe and ixgbevf.
This entire series (except for one patch from Alex) comes from Mark and
is mainly to add support for our new MAC (x550em_a).
So let's get Alex's patch out of the way first before we cover Mark's
many changes. Alex does his enable bulk free in transmit cleanup for
ixgbe and ixgbevf, like his has done for all of our other drivers.
First Mark cleans up registers that were not being used, so do some
house cleaning. Then to avoid casting lan_id and func fields, just
make them u8 since they only hold small values anyways. Found and
fixed an issue where on read operations it could be possible to
modify locations beyond the length passed in, so change the check
to round up in the same way. Cleaned up the interface for issuing
firmware commands to use a void * instead of a u32 * which eliminates
a number of casts. Added support for the new MAC and provided method
pointers and use them to access IOSF-attached devices, since the
new MAC will also need a new access method. Added support for SFPs
with an external retimer and for an SGMII backplane interface.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Sitnicki [Tue, 5 Apr 2016 16:41:08 +0000 (18:41 +0200)]
ipv6: Count in extension headers in skb->network_header
When sending a UDPv6 message longer than MTU, account for the length
of fragmentable IPv6 extension headers in skb->network_header offset.
Same as we do in alloc_new_skb path in __ip6_append_data().
This ensures that later on __ip6_make_skb() will make space in
headroom for fragmentable extension headers:
/* move skb->data to ip header from ext header */
if (skb->data < skb_network_header(skb))
__skb_pull(skb, skb_network_offset(skb));
Prevents a splat due to skb_under_panic:
skbuff: skb_under_panic: text:
ffffffff8143397b len:2126 put:14 \
head:
ffff880005bacf50 data:
ffff880005bacf4a tail:0x48 end:0xc0 dev:lo
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:104!
invalid opcode: 0000 [#1] KASAN
CPU: 0 PID: 160 Comm: reproducer Not tainted 4.6.0-rc2 #65
[...]
Call Trace:
[<
ffffffff813eb7b9>] skb_push+0x79/0x80
[<
ffffffff8143397b>] eth_header+0x2b/0x100
[<
ffffffff8141e0d0>] neigh_resolve_output+0x210/0x310
[<
ffffffff814eab77>] ip6_finish_output2+0x4a7/0x7c0
[<
ffffffff814efe3a>] ip6_output+0x16a/0x280
[<
ffffffff815440c1>] ip6_local_out+0xb1/0xf0
[<
ffffffff814f1115>] ip6_send_skb+0x45/0xd0
[<
ffffffff81518836>] udp_v6_send_skb+0x246/0x5d0
[<
ffffffff8151985e>] udpv6_sendmsg+0xa6e/0x1090
[...]
Reported-by: Ji Jianwen <jiji@redhat.com>
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bart Van Assche [Thu, 7 Apr 2016 22:55:04 +0000 (15:55 -0700)]
Revert "ib_srpt: Convert to percpu_ida tag allocation"
This reverts commit
0fd10721fe3664f7549e74af9d28a509c9a68719.
That patch causes the ib_srpt driver to crash as soon as the first SCSI
command is received:
kernel BUG at drivers/infiniband/ulp/srpt/ib_srpt.c:1439!
invalid opcode: 0000 [#1] SMP
Workqueue: target_completion target_complete_ok_work [target_core_mod]
RIP: srpt_queue_response+0x437/0x4a0 [ib_srpt]
Call Trace:
srpt_queue_data_in+0x9/0x10 [ib_srpt]
target_complete_ok_work+0x152/0x2b0 [target_core_mod]
process_one_work+0x197/0x480
worker_thread+0x49/0x490
kthread+0xea/0x100
ret_from_fork+0x22/0x40
Aside from the crash, the shortcomings of that patch are as follows:
- It makes the ib_srpt driver use I/O contexts allocated by
transport_alloc_session_tags() but it does not initialize these I/O
contexts properly. All the initializations performed by
srpt_alloc_ioctx() are skipped.
- It swaps the order of the send ioctx allocation and the transition to
RTR mode which is wrong.
- The amount of memory that is needed for I/O contexts is doubled.
- srpt_rdma_ch.free_list is no longer used but is not removed.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David S. Miller [Fri, 8 Apr 2016 01:04:27 +0000 (21:04 -0400)]
Merge branch 'bpf-tracepoints'
Alexei Starovoitov says:
====================
allow bpf attach to tracepoints
Hi Steven, Peter,
v1->v2: addressed Peter's comments:
- fixed wording in patch 1, added ack
- refactored 2nd patch into 3:
2/10 remove unused __perf_addr macro which frees up
an argument in perf_trace_buf_submit
3/10 split perf_trace_buf_prepare into alloc and update parts, so that bpf
programs don't have to pay performance penalty for update of struct trace_entry
which is not going to be accessed by bpf
4/10 actual addition of bpf filter to perf tracepoint handler is now trivial
and bpf prog can be used as proper filter of tracepoints
v1 cover:
last time we discussed bpf+tracepoints it was a year ago [1] and the reason
we didn't proceed with that approach was that bpf would make arguments
arg1, arg2 to trace_xx(arg1, arg2) call to be exposed to bpf program
and that was considered unnecessary extension of abi. Back then I wanted
to avoid the cost of buffer alloc and field assign part in all
of the tracepoints, but looks like when optimized the cost is acceptable.
So this new apporach doesn't expose any new abi to bpf program.
The program is looking at tracepoint fields after they were copied
by perf_trace_xx() and described in /sys/kernel/debug/tracing/events/xxx/format
We made a tool [2] that takes arguments from /sys/.../format and works as:
$ tplist.py -v random:urandom_read
int got_bits;
int pool_left;
int input_left;
Then these fields can be copy-pasted into bpf program like:
struct urandom_read {
__u64 hidden_pad;
int got_bits;
int pool_left;
int input_left;
};
and the program can use it:
SEC("tracepoint/random/urandom_read")
int bpf_prog(struct urandom_read *ctx)
{
return ctx->pool_left > 0 ? 1 : 0;
}
This way the program can access tracepoint fields faster than
equivalent bpf+kprobe program, which is the main goal of these patches.
Patch 1-4 are simple changes in perf core side, please review.
I'd like to take the whole set via net-next tree, since the rest of
the patches might conflict with other bpf work going on in net-next
and we want to avoid cross-tree merge conflicts.
Alternatively we can put patches 1-4 into both tip and net-next.
Patch 9 is an example of access to tracepoint fields from bpf prog.
Patch 10 is a micro benchmark for bpf+kprobe vs bpf+tracepoint.
Note that for actual tracing tools the user doesn't need to
run tplist.py and copy-paste fields manually. The tools do it
automatically. Like argdist tool [3] can be used as:
$ argdist -H 't:block:block_rq_complete():u32:nr_sector'
where 'nr_sector' is name of tracepoint field taken from
/sys/kernel/debug/tracing/events/block/block_rq_complete/format
and appropriate bpf program is generated on the fly.
[1] http://thread.gmane.org/gmane.linux.kernel.api/8127/focus=8165
[2] https://github.com/iovisor/bcc/blob/master/tools/tplist.py
[3] https://github.com/iovisor/bcc/blob/master/tools/argdist.py
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Thu, 7 Apr 2016 01:43:31 +0000 (18:43 -0700)]
samples/bpf: add tracepoint vs kprobe performance tests
the first microbenchmark does
fd=open("/proc/self/comm");
for() {
write(fd, "test");
}
and on 4 cpus in parallel:
writes per sec
base (no tracepoints, no kprobes) 930k
with kprobe at __set_task_comm() 420k
with tracepoint at task:task_rename 730k
For kprobe + full bpf program manully fetches oldcomm, newcomm via bpf_probe_read.
For tracepint bpf program does nothing, since arguments are copied by tracepoint.
2nd microbenchmark does:
fd=open("/dev/urandom");
for() {
read(fd, buf);
}
and on 4 cpus in parallel:
reads per sec
base (no tracepoints, no kprobes) 300k
with kprobe at urandom_read() 279k
with tracepoint at random:urandom_read 290k
bpf progs attached to kprobe and tracepoint are noop.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Thu, 7 Apr 2016 01:43:30 +0000 (18:43 -0700)]
samples/bpf: tracepoint example
modify offwaketime to work with sched/sched_switch tracepoint
instead of kprobe into finish_task_switch
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Thu, 7 Apr 2016 01:43:29 +0000 (18:43 -0700)]
samples/bpf: add tracepoint support to bpf loader
Recognize "tracepoint/" section name prefix and attach the program
to that tracepoint.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Thu, 7 Apr 2016 01:43:28 +0000 (18:43 -0700)]
bpf: sanitize bpf tracepoint access
during bpf program loading remember the last byte of ctx access
and at the time of attaching the program to tracepoint check that
the program doesn't access bytes beyond defined in tracepoint fields
This also disallows access to __dynamic_array fields, but can be
relaxed in the future.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Thu, 7 Apr 2016 01:43:27 +0000 (18:43 -0700)]
bpf: support bpf_get_stackid() and bpf_perf_event_output() in tracepoint programs
needs two wrapper functions to fetch 'struct pt_regs *' to convert
tracepoint bpf context into kprobe bpf context to reuse existing
helper functions
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Thu, 7 Apr 2016 01:43:26 +0000 (18:43 -0700)]
bpf: register BPF_PROG_TYPE_TRACEPOINT program type
register tracepoint bpf program type and let it call the same set
of helper functions as BPF_PROG_TYPE_KPROBE
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Thu, 7 Apr 2016 01:43:25 +0000 (18:43 -0700)]
perf, bpf: allow bpf programs attach to tracepoints
introduce BPF_PROG_TYPE_TRACEPOINT program type and allow it to be attached
to the perf tracepoint handler, which will copy the arguments into
the per-cpu buffer and pass it to the bpf program as its first argument.
The layout of the fields can be discovered by doing
'cat /sys/kernel/debug/tracing/events/sched/sched_switch/format'
prior to the compilation of the program with exception that first 8 bytes
are reserved and not accessible to the program. This area is used to store
the pointer to 'struct pt_regs' which some of the bpf helpers will use:
+---------+
| 8 bytes | hidden 'struct pt_regs *' (inaccessible to bpf program)
+---------+
| N bytes | static tracepoint fields defined in tracepoint/format (bpf readonly)
+---------+
| dynamic | __dynamic_array bytes of tracepoint (inaccessible to bpf yet)
+---------+
Not that all of the fields are already dumped to user space via perf ring buffer
and broken application access it directly without consulting tracepoint/format.
Same rule applies here: static tracepoint fields should only be accessed
in a format defined in tracepoint/format. The order of fields and
field sizes are not an ABI.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Thu, 7 Apr 2016 01:43:24 +0000 (18:43 -0700)]
perf: split perf_trace_buf_prepare into alloc and update parts
split allows to move expensive update of 'struct trace_entry' to later phase.
Repurpose unused 1st argument of perf_tp_event() to indicate event type.
While splitting use temp variable 'rctx' instead of '*rctx' to avoid
unnecessary loads done by the compiler due to -fno-strict-aliasing
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Thu, 7 Apr 2016 01:43:23 +0000 (18:43 -0700)]
perf: remove unused __addr variable
now all calls to perf_trace_buf_submit() pass 0 as 4th
argument which will be repurposed in the next patch which will
change the meaning of 1st arg of perf_tp_event() to event_type
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Thu, 7 Apr 2016 01:43:22 +0000 (18:43 -0700)]
perf: optimize perf_fetch_caller_regs
avoid memset in perf_fetch_caller_regs, since it's the critical path of all tracepoints.
It's called from perf_sw_event_sched, perf_event_task_sched_in and all of perf_trace_##call
with this_cpu_ptr(&__perf_regs[..]) which are zero initialized by perpcu init logic and
subsequent call to perf_arch_fetch_caller_regs initializes the same fields on all archs,
so we can safely drop memset from all of the above cases and move it into
perf_ftrace_function_call that calls it with stack allocated pt_regs.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 8 Apr 2016 00:40:25 +0000 (20:40 -0400)]
net: Fix build failure due to lockdep_sock_is_held().
Needs to be protected with CONFIG_LOCKDEP.
Based upon a patch by Hannes Frederic Sowa.
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 8 Apr 2016 00:22:20 +0000 (17:22 -0700)]
Merge tag 'ext4_for_linus_stable' of git://git./linux/kernel/git/tytso/ext4
Pull ext4 bugfixes from Ted Ts'o:
"These changes contains a fix for overlayfs interacting with some
(badly behaved) dentry code in various file systems. These have been
reviewed by Al and the respective file system mtinainers and are going
through the ext4 tree for convenience.
This also has a few ext4 encryption bug fixes that were discovered in
Android testing (yes, we will need to get these sync'ed up with the
fs/crypto code; I'll take care of that). It also has some bug fixes
and a change to ignore the legacy quota options to allow for xfstests
regression testing of ext4's internal quota feature and to be more
consistent with how xfs handles this case"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: ignore quota mount options if the quota feature is enabled
ext4 crypto: fix some error handling
ext4: avoid calling dquot_get_next_id() if quota is not enabled
ext4: retry block allocation for failed DIO and DAX writes
ext4: add lockdep annotations for i_data_sem
ext4: allow readdir()'s of large empty directories to be interrupted
btrfs: fix crash/invalid memory access on fsync when using overlayfs
ext4 crypto: use dget_parent() in ext4_d_revalidate()
ext4: use file_dentry()
ext4: use dget_parent() in ext4_file_open()
nfs: use file_dentry()
fs: add file_dentry()
ext4 crypto: don't let data integrity writebacks fail with ENOMEM
ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()
Mark Rustad [Fri, 1 Apr 2016 19:18:51 +0000 (12:18 -0700)]
ixgbe: Bump version number
Update ixgbe version number.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Fri, 1 Apr 2016 19:18:46 +0000 (12:18 -0700)]
ixgbe: Add KR backplane support for x550em_a
Add support for x550em_a-based KR backplane devices.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Fri, 1 Apr 2016 19:18:40 +0000 (12:18 -0700)]
ixgbe: Add support for SGMII backplane interface
Add support for an SGMII backplane interface.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Fri, 1 Apr 2016 19:18:35 +0000 (12:18 -0700)]
ixgbe: Add support for SFPs with retimer
Add support for SFPs with an external retimer.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Fri, 1 Apr 2016 19:18:30 +0000 (12:18 -0700)]
ixgbe: Introduce function to control MDIO speed
Move code that controls MDIO speed into a new function because
there will be more MACs that need the control.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Fri, 1 Apr 2016 19:18:25 +0000 (12:18 -0700)]
ixgbe: Read and parse NW_MNG_IF_SEL register
Read the IXGBE_NW_MNG_IF_SEL register and use it to set interface
attributes.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Linus Torvalds [Thu, 7 Apr 2016 23:34:26 +0000 (16:34 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/sage/ceph-client
Pull Ceph fix from Sage Weil:
"This just fixes a few remaining memory allocations in RBD to use
GFP_NOIO instead of GFP_ATOMIC"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: use GFP_NOIO consistently for request allocations
Mark Rustad [Fri, 1 Apr 2016 19:18:20 +0000 (12:18 -0700)]
ixgbe: Read and set instance id
Read the instance number from EEPROM and save it for later use.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Fri, 1 Apr 2016 19:18:14 +0000 (12:18 -0700)]
ixgbe: Use new methods for PHY access
Now x550em_a devices will use a new method for PHY access that will
get the firmware token for each access.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Linus Torvalds [Thu, 7 Apr 2016 23:20:40 +0000 (16:20 -0700)]
Merge tag 'for_linus' of git://git./linux/kernel/git/mst/vhost
Pull virtio/qemu fixes from Michael S Tsirkin:
"A couple of fixes for virtio and for the new QEMU fw cfg driver"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio: add VIRTIO_CONFIG_S_NEEDS_RESET device status bit
MAINTAINERS: add entry for QEMU
firmware: qemu_fw_cfg.c: hold ACPI global lock during device access
virtio: virtio 1.0 cs04 spec compliance for reset
qemu_fw_cfg: don't leak kobj on init error
Mark Rustad [Fri, 1 Apr 2016 19:18:09 +0000 (12:18 -0700)]
ixgbe: Add support for x550em_a 10G MAC type
Add support for x550em_a 10G MAC type to the ixgbe driver. The new
MAC includes new firmware commands that need to be used to control
PHY and IOSF access, so that support is also added. The interface
supported is a native SFP+ interface.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Fri, 1 Apr 2016 19:18:04 +0000 (12:18 -0700)]
ixgbe: Use method pointer to access IOSF devices
Provide method pointers and use them to access IOSF-attached
devices. A new MAC will introduce a new access method.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Fri, 1 Apr 2016 19:17:59 +0000 (12:17 -0700)]
ixgbe: Add definitions for x550em_a 10G MAC
Add definitions for a x550em_a 10G MAC device with a native SFP
interface.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Mon, 21 Mar 2016 18:21:31 +0000 (11:21 -0700)]
ixgbe: Add support for single-port X550 device
Add support for a single-port X550 device.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Mon, 7 Mar 2016 17:30:09 +0000 (09:30 -0800)]
ixgbe/ixgbevf: Add support for bulk free in Tx cleanup & cleanup boolean logic
This patch enables bulk free in Tx cleanup for ixgbevf and cleans up the
boolean logic in the polling routines for ixgbe and ixgbevf in the hopes of
avoiding any mix-ups similar to what occurred with i40e and i40evf.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Linus Torvalds [Thu, 7 Apr 2016 22:53:19 +0000 (15:53 -0700)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"This is mostly amdgpu/radeon fixes, and imx related fixes.
There is also one one TTM fix, one nouveau fix, and one hdlcd fix.
The AMD ones are some fixes for power management after suspend/resume
one some GPUs, and some vblank fixes.
The IMX ones are for more stricter plane checks and some cleanups.
I'm off until Monday, so therre might be some fixes early next week if
anyone missed me"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (34 commits)
drm/nouveau/tegra: acquire and enable reference clock if needed
drm/amdgpu: total vram size also reduces pin size
drm/amd/powerplay: add uvd/vce dpm enabling flag default.
drm/amd/powerplay: fix issue that resume back, dpm can't work on FIJI.
drm/amdgpu: save and restore the firwmware cache part when suspend resume
drm/amdgpu: save and restore UVD context with suspend and resume
drm/ttm: use phys_addr_t for ttm_bus_placement
drm: ARM HDLCD - fix an error code
drm: ARM HDLCD - get rid of devm_clk_put()
drm/radeon: Only call drm_vblank_on/off between drm_vblank_init/cleanup
drm/amdgpu: fence wait old rcu slot
drm/amdgpu: fix leaking fence in the pageflip code
drm/amdgpu: print vram type rather than just DDR
drm/amdgpu/gmc: use proper register for vram type on Fiji
drm/amdgpu/gmc: move vram type fetching into sw_init
drm/amdgpu: Set vblank_disable_allowed = true
drm/radeon: Set vblank_disable_allowed = true
drm/amd/powerplay: Need to change boot to performance state in resume.
drm/amd/powerplay: add new Fiji function for not setting same ps.
drm/amdgpu: check dpm state before pm system fs initialized.
...
Mark Rustad [Mon, 14 Mar 2016 18:06:02 +0000 (11:06 -0700)]
ixgbe: Take manageability semaphore for firmware commands
We need to take the manageability semaphore when issuing firmware
commands to avoid problems. With this in place, the semaphore is
no longer taken in the ixgbe_set_fw_drv_ver_generic function, since
it will now always be taken by the ixgbe_host_interface_command
function.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Mon, 14 Mar 2016 18:05:57 +0000 (11:05 -0700)]
ixgbe: Clean up interface for firmware commands
Clean up the interface for issuing firmware commands to use a
void * instead of a u32 *. This eliminates a number of casts.
Also clean up ixgbe_host_interface_command in a few other ways,
eliminating comparisons with 0, redundant parens and minor
formatting issues.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Mon, 14 Mar 2016 18:05:51 +0000 (11:05 -0700)]
ixgbe: Correct length check for round up
The function ixgbe_host_interface_command actually uses a multiple
of word sized buffer to do its business, but only checks against
the actual length passed in. This means that on read operations it
could be possible to modify locations beyond the length passed in.
Change the check to round up in the same way, just to avoid any
possible hazard.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Mon, 14 Mar 2016 18:05:46 +0000 (11:05 -0700)]
ixgbe: Change the lan_id and func fields to a u8 to avoid casts
Since the lan_id and func fields only ever hold small values, make
them u8 to avoid casts used to silence warnings.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Mon, 14 Mar 2016 18:05:40 +0000 (11:05 -0700)]
ixgbe: Delete some unused register definitions
I noticed the SRAMREL registers are not referenced for any device,
so delete the definitions.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Hannes Frederic Sowa [Thu, 7 Apr 2016 21:53:35 +0000 (23:53 +0200)]
sock: make lockdep_sock_is_held static inline
I forgot to add inline to lockdep_sock_is_held, so it generated all
kinds of build warnings if not build with lockdep support.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 7 Apr 2016 21:00:14 +0000 (17:00 -0400)]
Merge branch 'tipc-next'
Jon Maloy says:
====================
tipc: some small fixes
When fix a minor buffer leak, and ensure that bearers filter packets
correctly while they are being shut down.
v2: Corrected typos in commit #3, as per feedback from S. Shtylyov
v3: Removed commit #3 from the series. Improved version will be
re-submitted later.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Thu, 7 Apr 2016 14:09:14 +0000 (10:09 -0400)]
tipc: stricter filtering of packets in bearer layer
Resetting a bearer/interface, with the consequence of resetting all its
pertaining links, is not an atomic action. This becomes particularly
evident in very large clusters, where a lot of traffic may happen on the
remaining links while we are busy shutting them down. In extreme cases,
we may even see links being re-created and re-established before we are
finished with the job.
To solve this, we now introduce a solution where we temporarily detach
the bearer from the interface when the bearer is reset. This inhibits
all packet reception, while sending still is possible. For the latter,
we use the fact that the device's user pointer now is zero to filter out
which packets can be sent during this situation; i.e., outgoing RESET
messages only. This filtering serves to speed up the neighbors'
detection of the loss event, and saves us from unnecessary probing.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Thu, 7 Apr 2016 14:09:13 +0000 (10:09 -0400)]
tipc: eliminate buffer leak in bearer layer
When enabling a bearer we create a 'neigbor discoverer' instance by
calling the function tipc_disc_create() before the bearer is actually
registered in the list of enabled bearers. Because of this, the very
first discovery broadcast message, created by the mentioned function,
is lost, since it cannot find any valid bearer to use. Furthermore,
the used send function, tipc_bearer_xmit_skb() does not free the given
buffer when it cannot find a bearer, resulting in the leak of exactly
one send buffer each time a bearer is enabled.
This commit fixes this problem by introducing two changes:
1) Instead of attemting to send the discovery message directly, we let
tipc_disc_create() return the discovery buffer to the calling
function, tipc_enable_bearer(), so that the latter can send it
when the enabling sequence is finished.
2) In tipc_bearer_xmit_skb(), as well as in the two other transmit
functions at the bearer layer, we now free the indicated buffer or
buffer chain when a valid bearer cannot be found.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
shamir rabinovitch [Thu, 7 Apr 2016 11:57:36 +0000 (07:57 -0400)]
RDS: fix congestion map corruption for PAGE_SIZE > 4k
When PAGE_SIZE > 4k single page can contain 2 RDS fragments. If
'rds_ib_cong_recv' ignore the RDS fragment offset in to the page it
then read the data fragment as far congestion map update and lead to
corruption of the RDS connection far congestion map.
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
shamir rabinovitch [Thu, 7 Apr 2016 11:57:35 +0000 (07:57 -0400)]
RDS: memory allocated must be align to 8
Fix issue in 'rds_ib_cong_recv' when accessing unaligned memory
allocated by 'rds_page_remainder_alloc' using uint64_t pointer.
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Tue, 5 Apr 2016 16:13:39 +0000 (09:13 -0700)]
GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU
This patch fixes an issue I found in which we were dropping frames if we
had enabled checksums on GRE headers that were encapsulated by either FOU
or GUE. Without this patch I was barely able to get 1 Gb/s of throughput.
With this patch applied I am now at least getting around 6 Gb/s.
The issue is due to the fact that with FOU or GUE applied we do not provide
a transport offset pointing to the GRE header, nor do we offload it in
software as the GRE header is completely skipped by GSO and treated like a
VXLAN or GENEVE type header. As such we need to prevent the stack from
generating it and also prevent GRE from generating it via any interface we
create.
Fixes:
c3483384ee511 ("gro: Allow tunnel stacking in the case of FOU/GUE")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 7 Apr 2016 20:53:37 +0000 (16:53 -0400)]
Merge branch 'gro-in-udp'
Tom Herbert says:
====================
udp: GRO in UDP sockets
This patch set adds GRO functions (gro_receive and gro_complete) to UDP
sockets and removes udp_offload infrastructure.
Add GRO functions (gro_receive and gro_complete) to UDP sockets. In
udp_gro_receive and udp_gro_complete a socket lookup is done instead of
looking up the port number in udp_offloads. If a socket is found and
there are GRO functions for it then those are called. This feature
allows binding GRO functions to more than just a port number.
Eventually, we will be able to use this technique to allow application
defined GRO for an application protocol by attaching BPF porgrams to UDP
sockets for doing GRO.
In order to implement these functions, we added exported
udp6_lib_lookup_skb and udp4_lib_lookup_skb functions in ipv4/udp.c and
ipv6/udp.c. Also, inet_iif and references to skb_dst() were changed to
check that dst is set in skbuf before derefencing. In the GRO path there
is now a UDP socket lookup performed before dst is set, to the get the
device in that case we simply use skb->dev.
Tested:
Ran various combinations of VXLAN and GUE TCP_STREAM and TCP_RR tests.
Did not see any material regression.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 5 Apr 2016 15:22:56 +0000 (08:22 -0700)]
udp: Remove udp_offloads
Now that the UDP encapsulation GRO functions have been moved to the UDP
socket we not longer need the udp_offload insfrastructure so removing it.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 5 Apr 2016 15:22:55 +0000 (08:22 -0700)]
geneve: change to use UDP socket GRO
Adapt geneve_gro_receive, geneve_gro_complete to take a socket argument.
Set these functions in tunnel_config. Don't set udp_offloads any more.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 5 Apr 2016 15:22:54 +0000 (08:22 -0700)]
fou: change to use UDP socket GRO
Adapt gue_gro_receive, gue_gro_complete to take a socket argument.
Don't set udp_offloads any more.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 5 Apr 2016 15:22:53 +0000 (08:22 -0700)]
vxlan: change vxlan to use UDP socket GRO
Adapt vxlan_gro_receive, vxlan_gro_complete to take a socket argument.
Set these functions in tunnel_config. Don't set udp_offloads any more.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 5 Apr 2016 15:22:52 +0000 (08:22 -0700)]
udp: Add socket based GRO and config
Add gro_receive and gro_complete to struct udp_tunnel_sock_cfg.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 5 Apr 2016 15:22:51 +0000 (08:22 -0700)]
udp: Add GRO functions to UDP socket
This patch adds GRO functions (gro_receive and gro_complete) to UDP
sockets. udp_gro_receive is changed to perform socket lookup on a
packet. If a socket is found the related GRO functions are called.
This features obsoletes using UDP offload infrastructure for GRO
(udp_offload). This has the advantage of not being limited to provide
offload on a per port basis, GRO is now applied to whatever individual
UDP sockets are bound to. This also allows the possbility of
"application defined GRO"-- that is we can attach something like
a BPF program to a UDP socket to perfrom GRO on an application
layer protocol.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 5 Apr 2016 15:22:50 +0000 (08:22 -0700)]
udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skb
Add externally visible functions to lookup a UDP socket by skb. This
will be used for GRO in UDP sockets. These functions also check
if skb->dst is set, and if it is not skb->dev is used to get dev_net.
This allows calling lookup functions before dst has been set on the
skbuff.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 5 Apr 2016 15:22:49 +0000 (08:22 -0700)]
net: Checks skb_dst to be NULL in inet_iif
In inet_iif check if skb_rtable is NULL for the skb and return
skb->skb_iif if it is.
This change allows inet_iif to be called before the dst
information has been set in the skb (e.g. when doing socket based
UDP GRO).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 7 Apr 2016 20:44:15 +0000 (16:44 -0400)]
Merge branch 'sock-lockdep-tightening'
Hannes Frederic Sowa says:
====================
sock: lockdep tightening
First patch is from Eric Dumazet and improves lockdep accuracy for
socket locks. After that, second patch introduces lockdep_sock_is_held
and uses it. Final patch reverts and reworks the lockdep fix from Daniel
in the filter code, as we now have tighter lockdep support.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Hannes Frederic Sowa [Tue, 5 Apr 2016 15:10:16 +0000 (17:10 +0200)]
tun: use socket locks for sk_{attach,detatch}_filter
This reverts commit
5a5abb1fa3b05dd ("tun, bpf: fix suspicious RCU usage
in tun_{attach, detach}_filter") and replaces it to use lock_sock around
sk_{attach,detach}_filter. The checks inside filter.c are updated with
lockdep_sock_is_held to check for proper socket locks.
It keeps the code cleaner by ensuring that only one lock governs the
socket filter instead of two independent locks.
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hannes Frederic Sowa [Tue, 5 Apr 2016 15:10:15 +0000 (17:10 +0200)]
net: introduce lockdep_is_held and update various places to use it
The socket is either locked if we hold the slock spin_lock for
lock_sock_fast and unlock_sock_fast or we own the lock (sk_lock.owned
!= 0). Check for this and at the same time improve that the current
thread/cpu is really holding the lock.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hannes Frederic Sowa [Tue, 5 Apr 2016 15:10:14 +0000 (17:10 +0200)]
sock: fix lockdep annotation in release_sock
During release_sock we use callbacks to finish the processing
of outstanding skbs on the socket. We actually are still locked,
sk_locked.owned == 1, but we already told lockdep that the mutex
is released. This could lead to false positives in lockdep for
lockdep_sock_is_held (we don't hold the slock spinlock during processing
the outstanding skbs).
I took over this patch from Eric Dumazet and tested it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Strashko, Grygorii [Wed, 6 Apr 2016 11:45:53 +0000 (14:45 +0300)]
PM / wakeirq: fix wakeirq setting after wakup re-configuration from sysfs
Now wakeirq stops working for device if wakeup option for
this device will be reconfigured through sysfs, like:
echo disabled > /sys/devices/platform/extcon_usb1/power/wakeup
echo enabled > /sys/devices/platform/extcon_usb1/power/wakeup
Once above set of commands is executed the device's wakeup_source
opject will be recreated and dev->power.wakeup->wakeirq field will
contain NULL. As result, device_wakeup_arm_wake_irqs() will not arm
wakeirq for the affected device.
Hece, lets try to fix it in the following way:
check for dev->wakeirq field when device_wakeup_attach() is called
and if !NULL re-attach wakeirq to the device
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Len Brown [Wed, 6 Apr 2016 21:16:00 +0000 (17:16 -0400)]
tools/power turbostat: work around RC6 counter wrap
Sometimes the rc6 sysfs counter spontaneously resets,
causing turbostat prints a very large number
as it tries to calcuate % = 100 * (old - new) / interval
When we see (old > new), print ***.**% instead
of a bogus huge number.
Note that this detection is not fool-proof, as the counter
could reset several times and still result in new > old.
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Len Brown [Wed, 6 Apr 2016 21:15:59 +0000 (17:15 -0400)]
tools/power turbostat: initial KBL support
KBL is similar to SKL
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Len Brown [Wed, 6 Apr 2016 21:15:58 +0000 (17:15 -0400)]
tools/power turbostat: initial SKX support
SKX has a lot in common with HSX
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Len Brown [Wed, 6 Apr 2016 21:15:57 +0000 (17:15 -0400)]
tools/power turbostat: decode BXT TSC frequency via CPUID
Hard-code BXT ART to 19200MHz, so turbostat --debug
can fully enumerate TSC:
CPUID(0x15): eax_crystal: 3 ebx_tsc: 186 ecx_crystal_hz: 0
TSC: 1190 MHz (
19200000 Hz * 186 / 3 / 1000000)
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Len Brown [Wed, 6 Apr 2016 21:15:56 +0000 (17:15 -0400)]
tools/power turbostat: initial BXT support
Broxton has a lot in common with SKL
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Len Brown [Wed, 6 Apr 2016 21:15:55 +0000 (17:15 -0400)]
tools/power turbostat: print IRTL MSRs
Some processors use the Interrupt Response Time Limit (IRTL) MSR value
to describe the maximum IRQ response time latency for deep
package C-states. (Though others have the register, but do not use it)
Lets print it out to give insight into the cases where it is used.
IRTL begain in SNB, with PC3/PC6/PC7, and HSW added PC8/PC9/PC10.
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Len Brown [Wed, 6 Apr 2016 21:15:54 +0000 (17:15 -0400)]
tools/power turbostat: SGX state should print only if --debug
The CPUID.SGX bit was printed, even if --debug was used
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Len Brown [Wed, 6 Apr 2016 21:00:59 +0000 (17:00 -0400)]
intel_idle: Add KBL support
KBL is similar to SKL
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Len Brown [Wed, 6 Apr 2016 21:00:58 +0000 (17:00 -0400)]
intel_idle: Add SKX support
SKX is similar to BDX
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Richard Cochran [Wed, 6 Apr 2016 21:00:57 +0000 (17:00 -0400)]
intel_idle: Clean up all registered devices on exit.
This driver registers cpuidle devices when a CPU comes online, but it
leaves the registrations in place when a CPU goes offline. The module
exit code only unregisters the currently online CPUs, leaving the
devices for offline CPUs dangling.
This patch changes the driver to clean up all registrations on exit,
even those from CPUs that are offline.
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Richard Cochran [Wed, 6 Apr 2016 21:00:56 +0000 (17:00 -0400)]
intel_idle: Propagate hot plug errors.
If a cpuidle registration error occurs during the hot plug notifier
callback, we should really inform the hot plug machinery instead of
just ignoring the error. This patch changes the callback to properly
return on error.
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Richard Cochran [Wed, 6 Apr 2016 21:00:55 +0000 (17:00 -0400)]
intel_idle: Don't overreact to a cpuidle registration failure.
The helper function, intel_idle_cpu_init, registers one new device
with the cpuidle layer. If the registration should fail, that
function immediately calls intel_idle_cpuidle_devices_uninit() to
unregister every last CPU's device. However, it makes no sense to do
so, when called from the hot plug notifier callback.
This patch moves the call to intel_idle_cpuidle_devices_uninit()
outside of the helper function to the one call site that actually
needs to perform the de-registrations.
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Richard Cochran [Wed, 6 Apr 2016 21:00:54 +0000 (17:00 -0400)]
intel_idle: Setup the timer broadcast only on successful driver load.
This driver sets the broadcast tick quite early on during probe and does
not clean up again in cast of failure. This patch moves the setup call
after the registration, placing the on_each_cpu() calls within the global
CPU lock region.
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Richard Cochran [Wed, 6 Apr 2016 21:00:53 +0000 (17:00 -0400)]
intel_idle: Avoid a double free of the per-CPU data.
The helper function, intel_idle_cpuidle_devices_uninit, frees the
globally allocated per-CPU data. However, this function is invoked
from the hot plug notifier callback at a time when freeing that data
is not safe.
If the call to cpuidle_register_driver() should fail (say, due to lack
of memory), then the driver will free its per-CPU region. On the
*next* CPU_ONLINE event, the driver will happily use the region again
and even free it again if the failure repeats.
This patch fixes the issue by moving the call to free_percpu() outside
of the helper function at the two call sites that actually need to
free the per-CPU data.
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Richard Cochran [Wed, 6 Apr 2016 21:00:52 +0000 (17:00 -0400)]
intel_idle: Fix dangling registration on error path.
In the module_init() method, if the per-CPU allocation fails, then the
active cpuidle registration is not cleaned up. This patch fixes the
issue by attempting the allocation before registration, and then
cleaning it up again on registration failure.
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Richard Cochran [Wed, 6 Apr 2016 21:00:51 +0000 (17:00 -0400)]
intel_idle: Fix deallocation order on the driver exit path.
In the module_exit() method, this driver first frees its per-CPU
pointer, then unregisters a callback making use of the pointer.
Furthermore, the function, intel_idle_cpuidle_devices_uninit, is racy
against CPU hot plugging as it calls for_each_online_cpu().
This patch corrects the issues by unregistering first on the exit path
while holding the hot plug lock.
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Richard Cochran [Wed, 6 Apr 2016 21:00:50 +0000 (17:00 -0400)]
intel_idle: Remove redundant initialization calls.
The function, intel_idle_cpuidle_driver_init, makes calls on each CPU
to auto_demotion_disable() and c1e_promotion_disable(). These calls
are redundant, as intel_idle_cpu_init() does the same calls just a bit
later on. They are also premature, as the driver registration may yet
fail.
This patch removes the redundant code.
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>