Ariel Elior [Tue, 1 Jan 2013 05:22:32 +0000 (05:22 +0000)]
bnx2x: Prepare device and initialize VF database
At nic load of the PF, if VFs may be present, prepare the device
for the VFs. Initialize the VF database in preparation of VF arrival.
Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ariel Elior [Tue, 1 Jan 2013 05:22:31 +0000 (05:22 +0000)]
bnx2x: Allocate VF database in PF when VFs are present
When A PF determines that it may have to manage SRIOV VFs it
allocates a database for this purpose. The database is intended to
keep track of the VF state, the resources allocated for each VF
(queues, interrupt vectors, etc), the state of the VF's queues.
When the VF loads the database is updated accordingly.
When A VF closes the database is consulted to determine which
resources need to be released (close queues against device, reclaim
interrupt vectors, etc).
Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ariel Elior [Tue, 1 Jan 2013 05:22:30 +0000 (05:22 +0000)]
bnx2x: VF fastpath
When VF driver is transmitting it must supply the correct mac
address in the parsing BD. This is used for firmware validation
and enforcement and also for tx-switching.
Refactor interrupt ack flow to allow for different BAR addresses of
the hardware in the PF BAR vs the VF BAR.
Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ariel Elior [Tue, 1 Jan 2013 05:22:29 +0000 (05:22 +0000)]
bnx2x: Support ndo_set_rxmode in VF driver
The VF driver uses the 'q_filter' request in the VF <-> PF channel to
have the PF configure the requested rxmode to device. ndo_set_rxmode
is called under bottom half lock, so sleeping until the response
arrives over the VF <-> PF channel is out of the question. For this reason
the VF driver returns from the ndo after scheduling a work item, which
in turn processes the rx mode request and adds the classification
information through the VF <-> PF channel accordingly.
Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ariel Elior [Tue, 1 Jan 2013 05:22:28 +0000 (05:22 +0000)]
bnx2x: Add teardown_q and close to VF <-> PF channel
When a VF is being closed its queues are released via
the 'teardown_q' and the VF itself is closed with
'close'. These are essentially the unload counterparts of
'init' and 'setup_q' from the load flow.
Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ariel Elior [Tue, 1 Jan 2013 05:22:27 +0000 (05:22 +0000)]
bnx2x: Add init, setup_q, set_mac to VF <-> PF channel
'init' - init an acquired VF. Supply allocation GPAs to PF.
'setup_q' - PF to allocate a queue in device on behalf of the VF.
'set_mac' - PF to configure a mac in device on behalf of the VF.
VF driver uses these requests in the VF <-> PF channel in nic_load
flow.
Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ariel Elior [Tue, 1 Jan 2013 05:22:26 +0000 (05:22 +0000)]
bnx2x: Separate VF and PF logic
Generally, the VF driver cannot access the chip, except by the
narrow window its BAR allows. Care had to be taken so the VF driver
will not reach code which accesses the chip elsewhere.
Refactor the nic_load flow into parts so it would be
easier to separate the VF-only logic from the PF-only logic.
Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ariel Elior [Tue, 1 Jan 2013 05:22:25 +0000 (05:22 +0000)]
bnx2x: Add to VF <-> PF channel the release request
VF driver uses this request when removed. The PF driver
reclaims all resources allocated for that VF at this
time.
Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ariel Elior [Tue, 1 Jan 2013 05:22:24 +0000 (05:22 +0000)]
bnx2x: VF <-> PF channel 'acquire' at vf probe
Add the 'acquire' request to VF <-> PF channel and use it at
VF probe. In the acquire request the VF driver lists the resources
it would like to have. In the response the PF either ratifies the
request, or denies it and supplies the maximum values supported.
The VF may then attempt another acquire request.
This patch adds the bnx2x_vfpf.c file which contains the
implementation of the VF to PF hardware channel.
Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ariel Elior [Tue, 1 Jan 2013 05:22:23 +0000 (05:22 +0000)]
bnx2x: Support probing and removing of VF device
To support probing and removing of a bnx2x virtual function
the following were added:
1. add bnx2x_vfpf.h: defines the VF to PF channel
2. add bnx2x_sriov.h: header for bnx2x SR-IOV functionality
3. enumerate VF hw types (identify VFs)
4. if driving a VF, map VF bar
5. if driving a VF, allocate Vf to PF channel
6. refactor interrupt flows to include VF
Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flavio Leitner [Sat, 29 Dec 2012 16:37:33 +0000 (16:37 +0000)]
team: add ethtool support
This patch adds few ethtool operations to team driver.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 29 Dec 2012 16:26:10 +0000 (16:26 +0000)]
veth: extend device features
veth is lacking most modern facilities, like SG, checksums, TSO.
It makes sense to extend dev->features to get them, or GRO aggregation
is defeated by a forced segmentation.
Reported-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 29 Dec 2012 16:02:43 +0000 (16:02 +0000)]
veth: reduce stat overhead
veth stats are a bit bloated. There is no need to account transmit
and receive stats, since they are absolutely symmetric.
Also use a per device atomic64_t for the dropped counter, as it
should never be used in fast path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flavio Leitner [Sat, 29 Dec 2012 15:31:01 +0000 (15:31 +0000)]
team: implement carrier change
The user space teamd daemon may need to control the
master's carrier state depending on the selected mode.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Fri, 28 Dec 2012 18:15:22 +0000 (18:15 +0000)]
bridge: respect RFC2863 operational state
The bridge link detection should follow the operational state
of the lower device, rather than the carrier bit. This allows devices
like tunnels that are controlled by userspace control plane to work
with bridge STP link management.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Reviewed-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Fri, 28 Dec 2012 10:50:17 +0000 (10:50 +0000)]
net: filter: return -EINVAL if BPF_S_ANC* operation is not supported
Currently, we return -EINVAL for malformed or wrong BPF filters.
However, this is not done for BPF_S_ANC* operations, which makes it
more difficult to detect if it's actually supported or not by the
BPF machine. Therefore, we should also return -EINVAL if K is within
the SKF_AD_OFF universe and the ancillary operation did not match.
Why exactly is it needed? If tools such as libpcap/tcpdump want to
make use of new ancillary operations (like filtering VLAN in kernel
space), there is currently no sane way to test if this feature /
BPF_S_ANC* op is present or not, since no error is returned. This
patch will make life easier for that and allow for a proper usage
for user space applications.
There was concern, if this patch will break userland. Short answer: Yes
and no. Long answer: It will "break" only for code that calls ...
{ BPF_LD | BPF_(W|H|B) | BPF_ABS, 0, 0, <K> },
... where <K> is in [0xfffff000, 0xffffffff] _and_ <K> is *not* an
ancillary. And here comes the BUT: assuming some *old* code will have
such an instruction where <K> is between [0xfffff000, 0xffffffff] and
it doesn't know ancillary operations, then this will give a
non-expected / unwanted behavior as well (since we do not return the
BPF machine with 0 after a failed load_pointer(), which was the case
before introducing ancillary operations, but load sth. into the
accumulator instead, and continue with the next instruction, for
instance). Thus, user space code would already have been broken by
introducing ancillary operations into the BPF machine per se. Code
that does such a direct load, e.g. "load word at packet offset
0xffffffff into accumulator" ("ld [0xffffffff]") is quite broken,
isn't it? The whole assumption of ancillary operations is that no-one
intentionally calls things like "ld [0xffffffff]" and expect this
word to be loaded from such a packet offset. Hence, we can also safely
make use of this feature testing patch and facilitate application
development. Therefore, at least from this patch onwards, we have
*for sure* a check whether current or in future implemented BPF_S_ANC*
ops are supported in the kernel. Patch was tested on x86_64.
(Thanks to Eric for the previous review.)
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Ani Sinha <ani@aristanetworks.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Fri, 28 Dec 2012 18:24:28 +0000 (18:24 +0000)]
skbuff: make __kmalloc_reserve static
Sparse detected case where this local function should be static.
It may even allow some compiler optimizations.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Fri, 28 Dec 2012 18:20:24 +0000 (18:20 +0000)]
tcp: make proc_tcp_fastopen_key static
Detected by sparse.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Fri, 28 Dec 2012 18:18:55 +0000 (18:18 +0000)]
sctp: make sctp_addr_wq_timeout_handler static
Fix sparse warning about local function that should be static.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 28 Dec 2012 06:06:37 +0000 (06:06 +0000)]
net: use per task frag allocator in skb_append_datato_frags
Use the new per task frag allocator in skb_append_datato_frags(),
to reduce number of frags and page allocator overhead.
Tested:
ifconfig lo mtu 16436
perf record netperf -t UDP_STREAM ; perf report
before :
Throughput: 32928 Mbit/s
51.79% netperf [kernel.kallsyms] [k] copy_user_generic_string
5.98% netperf [kernel.kallsyms] [k] __alloc_pages_nodemask
5.58% netperf [kernel.kallsyms] [k] get_page_from_freelist
5.01% netperf [kernel.kallsyms] [k] __rmqueue
3.74% netperf [kernel.kallsyms] [k] skb_append_datato_frags
1.87% netperf [kernel.kallsyms] [k] prep_new_page
1.42% netperf [kernel.kallsyms] [k] next_zones_zonelist
1.28% netperf [kernel.kallsyms] [k] __inc_zone_state
1.26% netperf [kernel.kallsyms] [k] alloc_pages_current
0.78% netperf [kernel.kallsyms] [k] sock_alloc_send_pskb
0.74% netperf [kernel.kallsyms] [k] udp_sendmsg
0.72% netperf [kernel.kallsyms] [k] zone_watermark_ok
0.68% netperf [kernel.kallsyms] [k] __cpuset_node_allowed_softwall
0.67% netperf [kernel.kallsyms] [k] fib_table_lookup
0.60% netperf [kernel.kallsyms] [k] memcpy_fromiovecend
0.55% netperf [kernel.kallsyms] [k] __udp4_lib_lookup
after:
Throughput: 47185 Mbit/s
61.74% netperf [kernel.kallsyms] [k] copy_user_generic_string
2.07% netperf [kernel.kallsyms] [k] prep_new_page
1.98% netperf [kernel.kallsyms] [k] skb_append_datato_frags
1.02% netperf [kernel.kallsyms] [k] sock_alloc_send_pskb
0.97% netperf [kernel.kallsyms] [k] enqueue_task_fair
0.97% netperf [kernel.kallsyms] [k] udp_sendmsg
0.91% netperf [kernel.kallsyms] [k] __ip_route_output_key
0.88% netperf [kernel.kallsyms] [k] __netif_receive_skb
0.87% netperf [kernel.kallsyms] [k] fib_table_lookup
0.85% netperf [kernel.kallsyms] [k] resched_task
0.78% netperf [kernel.kallsyms] [k] __udp4_lib_lookup
0.77% netperf [kernel.kallsyms] [k] _raw_spin_lock_irqsave
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 27 Dec 2012 23:49:40 +0000 (23:49 +0000)]
dummy: implement carrier change
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 27 Dec 2012 23:49:39 +0000 (23:49 +0000)]
rtnl: expose carrier value with possibility to set it
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 27 Dec 2012 23:49:38 +0000 (23:49 +0000)]
net: allow to change carrier via sysfs
Make carrier writable
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 27 Dec 2012 23:49:37 +0000 (23:49 +0000)]
net: add change_carrier netdev op
This allows a driver to register change_carrier callback which will be
called whenever user will like to change carrier state. This is useful
for devices like dummy, gre, team and so on.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sasha Levin [Thu, 20 Dec 2012 09:11:24 +0000 (09:11 +0000)]
bnx2x: use ARRAY_SIZE where possible
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 27 Dec 2012 18:46:47 +0000 (10:46 -0800)]
Merge tag 'hwmon-for-linus' of git://git./linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
- Report i2c errors to userspace in lm73 driver
- Fix problem with DIV_ROUND_CLOSEST and unsigned divisors in emc6w201
driver
* tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (emc6w201) Fix DIV_ROUND_CLOSEST problem with unsigned divisors
hwmon: (lm73} Detect and report i2c bus errors
Linus Torvalds [Thu, 27 Dec 2012 18:42:46 +0000 (10:42 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/ebiederm/user-namespace
Pull namespace fixes from Eric Biederman:
"This tree includes two bug fixes for problems Oleg spotted on his
review of the recent pid namespace work. A small fix to not enable
bottom halves with irqs disabled, and a trivial build fix for f2fs
with user namespaces enabled."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
f2fs: Don't assign e_id in f2fs_acl_from_disk
proc: Allow proc_free_inum to be called from any context
pidns: Stop pid allocation when init dies
pidns: Outlaw thread creation after unshare(CLONE_NEWPID)
Linus Torvalds [Thu, 27 Dec 2012 18:40:30 +0000 (10:40 -0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) GRE tunnel drivers don't set the transport header properly, they also
blindly deref the inner protocol ipv4 and needs some checks. Fixes
from Isaku Yamahata.
2) Fix sleeps while atomic in netdevice rename code, from Eric Dumazet.
3) Fix double-spinlock in solos-pci driver, from Dan Carpenter.
4) More ARP bug fixes. Fix lockdep splat in arp_solicit() and then the
bug accidentally added by that fix. From Eric Dumazet and Cong Wang.
5) Remove some __dev* annotations that slipped back in, as well as all
HOTPLUG references. From Greg KH
6) RDS protocol uses wrong interfaces to access scatter-gather elements,
causing a regression. From Mike Marciniszyn.
7) Fix build error in cpts driver, from Richard Cochran.
8) Fix arithmetic in packet scheduler, from Stefan Hasko.
9) Similarly, fix association during calculation of random backoff in
batman-adv. From Akinobu Mita.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (21 commits)
ipv6/ip6_gre: set transport header correctly
ipv4/ip_gre: set transport header correctly to gre header
IB/rds: suppress incompatible protocol when version is known
IB/rds: Correct ib_api use with gs_dma_address/sg_dma_len
net/vxlan: Use the underlying device index when joining/leaving multicast groups
tcp: should drop incoming frames without ACK flag set
netprio_cgroup: define sk_cgrp_prioidx only if NETPRIO_CGROUP is enabled
cpts: fix a run time warn_on.
cpts: fix build error by removing useless code.
batman-adv: fix random jitter calculation
arp: fix a regression in arp_solicit()
net: sched: integer overflow fix
CONFIG_HOTPLUG removal from networking core
Drivers: network: more __dev* removal
bridge: call br_netpoll_disable in br_add_if
ipv4: arp: fix a lockdep splat in arp_solicit()
tuntap: dont use a private kmem_cache
net: devnet_rename_seq should be a seqcount
ip_gre: fix possible use after free
ip_gre: make ipgre_tunnel_xmit() not parse network header as IP unconditionally
...
Isaku Yamahata [Mon, 24 Dec 2012 16:51:04 +0000 (16:51 +0000)]
ipv6/ip6_gre: set transport header correctly
ip6gre_xmit2() incorrectly sets transport header to inner payload
instead of GRE header. It seems copy-and-pasted from ipip.c.
Set transport header to gre header.
(In ipip case the transport header is the inner ip header, so that's
correct.)
Found by inspection. In practice the incorrect transport header
doesn't matter because the skb usually is sent to another net_device
or socket, so the transport header isn't referenced.
Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Isaku Yamahata [Mon, 24 Dec 2012 16:51:03 +0000 (16:51 +0000)]
ipv4/ip_gre: set transport header correctly to gre header
ipgre_tunnel_xmit() incorrectly sets transport header to inner payload
instead of GRE header. It seems copy-and-pasted from ipip.c.
So set transport header to gre header.
(In ipip case the transport header is the inner ip header, so that's
correct.)
Found by inspection. In practice the incorrect transport header
doesn't matter because the skb usually is sent to another net_device
or socket, so the transport header isn't referenced.
Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marciniszyn, Mike [Fri, 21 Dec 2012 08:01:54 +0000 (08:01 +0000)]
IB/rds: suppress incompatible protocol when version is known
Add an else to only print the incompatible protocol message
when version hasn't been established.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marciniszyn, Mike [Fri, 21 Dec 2012 08:01:49 +0000 (08:01 +0000)]
IB/rds: Correct ib_api use with gs_dma_address/sg_dma_len
0b088e00 ("RDS: Use page_remainder_alloc() for recv bufs")
added uses of sg_dma_len() and sg_dma_address(). This makes
RDS DOA with the qib driver.
IB ulps should use ib_sg_dma_len() and ib_sg_dma_address
respectively since some HCAs overload ib_sg_dma* operations.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yan Burman [Thu, 20 Dec 2012 03:36:08 +0000 (03:36 +0000)]
net/vxlan: Use the underlying device index when joining/leaving multicast groups
The socket calls from vxlan to join/leave multicast group aren't
using the index of the underlying device, as a result the stack uses
the first interface that is up. This results in vxlan being non functional
over a device which isn't the 1st to be up.
Fix this by providing the iflink field to the vxlan instance
to the multicast calls.
Signed-off-by: Yan Burman <yanb@mellanox.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 26 Dec 2012 12:44:34 +0000 (12:44 +0000)]
tcp: should drop incoming frames without ACK flag set
In commit
96e0bf4b5193d (tcp: Discard segments that ack data not yet
sent) John Dykstra enforced a check against ack sequences.
In commit
354e4aa391ed5 (tcp: RFC 5961 5.2 Blind Data Injection Attack
Mitigation) I added more safety tests.
But we missed fact that these tests are not performed if ACK bit is
not set.
RFC 793 3.9 mandates TCP should drop a frame without ACK flag set.
" fifth check the ACK field,
if the ACK bit is off drop the segment and return"
Not doing so permits an attacker to only guess an acceptable sequence
number, evading stronger checks.
Many thanks to Zhiyun Qian for bringing this issue to our attention.
See :
http://web.eecs.umich.edu/~zhiyunq/pub/ccs12_TCP_sequence_number_inference.pdf
Reported-by: Zhiyun Qian <zhiyunq@umich.edu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: John Dykstra <john.dykstra1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christoffer Dall [Fri, 21 Dec 2012 18:03:50 +0000 (13:03 -0500)]
mm: Fix PageHead when !CONFIG_PAGEFLAGS_EXTENDED
Unfortunately with !CONFIG_PAGEFLAGS_EXTENDED, (!PageHead) is false, and
(PageHead) is true, for tail pages. If this is indeed the intended
behavior, which I doubt because it breaks cache cleaning on some ARM
systems, then the nomenclature is highly problematic.
This patch makes sure PageHead is only true for head pages and PageTail
is only true for tail pages, and neither is true for non-compound pages.
[ This buglet seems ancient - seems to have been introduced back in Apr
2008 in commit
6a1e7f777f61: "pageflags: convert to the use of new
macros". And the reason nobody noticed is because the PageHead()
tests are almost all about just sanity-checking, and only used on
pages that are actual page heads. The fact that the old code returned
true for tail pages too was thus not really noticeable. - Linus ]
Signed-off-by: Christoffer Dall <cdall@cs.columbia.edu>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Will Deacon <Will.Deacon@arm.com>
Cc: Steve Capper <Steve.Capper@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: stable@kernel.org # 2.6.26+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Zefan [Tue, 25 Dec 2012 20:48:24 +0000 (20:48 +0000)]
netprio_cgroup: define sk_cgrp_prioidx only if NETPRIO_CGROUP is enabled
sock->sk_cgrp_prioidx won't be used at all if CONFIG_NETPRIO_CGROUP=n.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Richard Cochran [Sun, 23 Dec 2012 21:19:10 +0000 (21:19 +0000)]
cpts: fix a run time warn_on.
This patch fixes a warning in clk_enable by calling clk_prepare_enable
instead.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Richard Cochran [Sun, 23 Dec 2012 21:19:09 +0000 (21:19 +0000)]
cpts: fix build error by removing useless code.
The cpts driver tries to obtain the input clock frequency by calling the
clock's internal 'recalc' method. Since <plat/clock.h> has been removed,
this code can no longer compile.
However, the driver never makes use of the frequency value, so this patch
fixes the issue by removing the offending code altogether.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Akinobu Mita [Wed, 26 Dec 2012 02:32:10 +0000 (02:32 +0000)]
batman-adv: fix random jitter calculation
batadv_iv_ogm_emit_send_time() attempts to calculates a random integer
in the range of 'orig_interval +- BATADV_JITTER' by the below lines.
msecs = atomic_read(&bat_priv->orig_interval) - BATADV_JITTER;
msecs += (random32() % 2 * BATADV_JITTER);
But it actually gets 'orig_interval' or 'orig_interval - BATADV_JITTER'
because '%' and '*' have same precedence and associativity is
left-to-right.
This adds the parentheses at the appropriate position so that it matches
original intension.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Antonio Quartulli <ordex@autistici.org>
Cc: Marek Lindner <lindner_marek@yahoo.de>
Cc: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Cc: Antonio Quartulli <ordex@autistici.org>
Cc: b.a.t.m.a.n@lists.open-mesh.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman [Sat, 22 Dec 2012 09:52:39 +0000 (01:52 -0800)]
f2fs: Don't assign e_id in f2fs_acl_from_disk
With user namespaces enabled building f2fs fails with:
CC fs/f2fs/acl.o
fs/f2fs/acl.c: In function ‘f2fs_acl_from_disk’:
fs/f2fs/acl.c:85:21: error: ‘struct posix_acl_entry’ has no member named ‘e_id’
make[2]: *** [fs/f2fs/acl.o] Error 1
make[2]: Target `__build' not remade because of errors.
e_id is a backwards compatibility field only used for file systems
that haven't been converted to use kuids and kgids. When the posix
acl tag field is neither ACL_USER nor ACL_GROUP assigning e_id is
unnecessary. Remove the assignment so f2fs will build with user
namespaces enabled.
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Amit Sahrawat <a.sahrawat@samsung.com>
Acked-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Eric W. Biederman [Sat, 22 Dec 2012 04:38:00 +0000 (20:38 -0800)]
proc: Allow proc_free_inum to be called from any context
While testing the pid namespace code I hit this nasty warning.
[ 176.262617] ------------[ cut here ]------------
[ 176.263388] WARNING: at /home/eric/projects/linux/linux-userns-devel/kernel/softirq.c:160 local_bh_enable_ip+0x7a/0xa0()
[ 176.265145] Hardware name: Bochs
[ 176.265677] Modules linked in:
[ 176.266341] Pid: 742, comm: bash Not tainted 3.7.0userns+ #18
[ 176.266564] Call Trace:
[ 176.266564] [<
ffffffff810a539f>] warn_slowpath_common+0x7f/0xc0
[ 176.266564] [<
ffffffff810a53fa>] warn_slowpath_null+0x1a/0x20
[ 176.266564] [<
ffffffff810ad9ea>] local_bh_enable_ip+0x7a/0xa0
[ 176.266564] [<
ffffffff819308c9>] _raw_spin_unlock_bh+0x19/0x20
[ 176.266564] [<
ffffffff8123dbda>] proc_free_inum+0x3a/0x50
[ 176.266564] [<
ffffffff8111d0dc>] free_pid_ns+0x1c/0x80
[ 176.266564] [<
ffffffff8111d195>] put_pid_ns+0x35/0x50
[ 176.266564] [<
ffffffff810c608a>] put_pid+0x4a/0x60
[ 176.266564] [<
ffffffff8146b177>] tty_ioctl+0x717/0xc10
[ 176.266564] [<
ffffffff810aa4d5>] ? wait_consider_task+0x855/0xb90
[ 176.266564] [<
ffffffff81086bf9>] ? default_spin_lock_flags+0x9/0x10
[ 176.266564] [<
ffffffff810cab0a>] ? remove_wait_queue+0x5a/0x70
[ 176.266564] [<
ffffffff811e37e8>] do_vfs_ioctl+0x98/0x550
[ 176.266564] [<
ffffffff810b8a0f>] ? recalc_sigpending+0x1f/0x60
[ 176.266564] [<
ffffffff810b9127>] ? __set_task_blocked+0x37/0x80
[ 176.266564] [<
ffffffff810ab95b>] ? sys_wait4+0xab/0xf0
[ 176.266564] [<
ffffffff811e3d31>] sys_ioctl+0x91/0xb0
[ 176.266564] [<
ffffffff810a95f0>] ? task_stopped_code+0x50/0x50
[ 176.266564] [<
ffffffff81939199>] system_call_fastpath+0x16/0x1b
[ 176.266564] ---[ end trace
387af88219ad6143 ]---
It turns out that spin_unlock_bh(proc_inum_lock) is not safe when
put_pid is called with another spinlock held and irqs disabled.
For now take the easy path and use spin_lock_irqsave(proc_inum_lock)
in proc_free_inum and spin_loc_irq in proc_alloc_inum(proc_inum_lock).
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Eric W. Biederman [Sat, 22 Dec 2012 04:27:12 +0000 (20:27 -0800)]
pidns: Stop pid allocation when init dies
Oleg pointed out that in a pid namespace the sequence.
- pid 1 becomes a zombie
- setns(thepidns), fork,...
- reaping pid 1.
- The injected processes exiting.
Can lead to processes attempting access their child reaper and
instead following a stale pointer.
That waitpid for init can return before all of the processes in
the pid namespace have exited is also unfortunate.
Avoid these problems by disabling the allocation of new pids in a pid
namespace when init dies, instead of when the last process in a pid
namespace is reaped.
Pointed-out-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Eric W. Biederman [Fri, 21 Dec 2012 03:26:06 +0000 (19:26 -0800)]
pidns: Outlaw thread creation after unshare(CLONE_NEWPID)
The sequence:
unshare(CLONE_NEWPID)
clone(CLONE_THREAD|CLONE_SIGHAND|CLONE_VM)
Creates a new process in the new pid namespace without setting
pid_ns->child_reaper. After forking this results in a NULL
pointer dereference.
Avoid this and other nonsense scenarios that can show up after
creating a new pid namespace with unshare by adding a new
check in copy_prodcess.
Pointed-out-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cong Wang [Sun, 23 Dec 2012 15:23:16 +0000 (15:23 +0000)]
arp: fix a regression in arp_solicit()
Sedat reported the following commit caused a regression:
commit
9650388b5c56578fdccc79c57a8c82fb92b8e7f1
Author: Eric Dumazet <edumazet@google.com>
Date: Fri Dec 21 07:32:10 2012 +0000
ipv4: arp: fix a lockdep splat in arp_solicit
This is due to the 6th parameter of arp_send() needs to be NULL
for the broadcast case, the above commit changed it to an all-zero
array by mistake.
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 23 Dec 2012 17:48:33 +0000 (09:48 -0800)]
Merge branch 'i2c-embedded/for-next' of git://git.pengutronix.de/git/wsa/linux
Pull i2c __dev* attribute removal from Wolfram Sang:
"The squashed patches from Bill to get rid of the __dev* annotations in
the i2c subsystem. I couldn't include it in my previous pull request
due to some dependency with the mfd subsystem. I had this patch in
linux-next for two days before rc1 and nothing popped up."
* 'i2c-embedded/for-next' of git://git.pengutronix.de/git/wsa/linux:
i2c: remove __dev* attributes from subsystem
Zlatko Calusic [Sun, 23 Dec 2012 14:12:54 +0000 (15:12 +0100)]
mm: modify pgdat_balanced() so that it also handles order-0
Teach pgdat_balanced() about order-0 allocations so that we can simplify
code in a few places in vmstat.c.
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rafael J. Wysocki [Sun, 23 Dec 2012 13:39:32 +0000 (14:39 +0100)]
Partly revert "[media] uvcvideo: Set error_idx properly for extended controls API failures"
Commit
f0ed2ce840b3 ("[media] uvcvideo: Set error_idx properly for
extended controls API failures") causes user space to behave incorrectly
on one of my test machines (there is no sound under KDE 4.9.4 using
pulseaudio and there is a knotify4 process occupying one of the CPU
cores 100% of the time). Reverting that commit entirely fixes the
problem for me.
However, commit
f0ed2ce840b3 appears to do more than it follows from its
changelog, because the changelog only says about the changes related to
ctrls->error_idx, while the commit additionally changes error codes
returned by various functions in uvc_ctrl.c and uvc_v4l2.c. It turns
out that the changes of the returned error codes confuse the user spce,
so it is sufficient to revert the part of commit
f0ed2ce840b3 not
mentioned in its changelog to fix the problem.
[ 'ENOENT' is not a valid error return from an ioctl to begin with, and
I don't understand how anybody ever even thought it would be. - Linus ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Cc: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bill Pemberton [Tue, 27 Nov 2012 20:59:38 +0000 (15:59 -0500)]
i2c: remove __dev* attributes from subsystem
CONFIG_HOTPLUG is going away as an option. As result the __dev*
markings will be going away.
Remove use of __devinit, __devexit_p, __devinitdata, __devinitconst,
and __devexit.
Signed-off-by: Bill Pemberton <wfp5p@virginia.edu>
Acked-by: Peter Korsgaard <peter.korsgaard@barco.com> (for ocores and mux-gpio)
Acked-by: Havard Skinnemoen <hskinnemoen@gmail.com> (for i2c-gpio)
Acked-by: Guan Xuetao <gxt@mprc.pku.edu.cn> (for puf3)
Acked-by: Barry Song <baohua.song@csr.com> (for sirf)
Reviewed-by: Jean Delvare <khali@linux-fr.org>
[wsa: Fixed "foo* bar" flaws while we are here]
Signed-off-by: Wolfram Sang <w.sang@pengutronix.de>
Guenter Roeck [Wed, 19 Dec 2012 02:16:08 +0000 (18:16 -0800)]
hwmon: (emc6w201) Fix DIV_ROUND_CLOSEST problem with unsigned divisors
Result of DIV_ROUND_CLOSEST is undefined for negative dividends if the divisor
variable type is unsigned. Fix by declaring divisor as signed variable.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Jean Delvare <khali@linux-fr.org>
Stefan Hasko [Fri, 21 Dec 2012 15:04:59 +0000 (15:04 +0000)]
net: sched: integer overflow fix
Fixed integer overflow in function htb_dequeue
Signed-off-by: Stefan Hasko <hasko.stevo@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Greg KH [Fri, 21 Dec 2012 13:44:29 +0000 (13:44 +0000)]
CONFIG_HOTPLUG removal from networking core
CONFIG_HOTPLUG is always enabled now, so remove the unused code that was
trying to be compiled out when this option was disabled, in the
networking core.
Cc: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Greg KH [Fri, 21 Dec 2012 13:42:15 +0000 (13:42 +0000)]
Drivers: network: more __dev* removal
Remove some __dev* markings that snuck in the 3.8-rc1 merge window in
the drivers/net/* directory.
Cc: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Chris Verges [Fri, 21 Dec 2012 09:58:34 +0000 (01:58 -0800)]
hwmon: (lm73} Detect and report i2c bus errors
If an LM73 device does not exist on an I2C bus, attempts to communicate
with the device result in an error code returned from the i2c read/write
functions. The current lm73 driver casts that return value from a s32
type to a s16 type, then converts it to a temperature in celsius.
Because negative temperatures are valid, it is difficult to distinguish
between an error code printed to the response buffer and a negative
temperature recorded by the sensor.
The solution is to evaluate the return value from the i2c functions
before performing any temperature calculations. If the i2c function did
not succeed, the error code should be passed back through the virtual
file system layer instead of being printed into the response buffer.
Before:
$ cat /sys/class/hwmon/hwmon0/device/temp1_input
-46
After:
$ cat /sys/class/hwmon/hwmon0/device/temp1_input
cat: read error: No such device or address
Signed-off-by: Chris Verges <kg4ysn@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Linus Torvalds [Sat, 22 Dec 2012 01:19:00 +0000 (17:19 -0800)]
Linux 3.8-rc1
Linus Torvalds [Sat, 22 Dec 2012 01:10:29 +0000 (17:10 -0800)]
Merge git://www.linux-watchdog.org/linux-watchdog
Pull watchdog updates from Wim Van Sebroeck:
"This includes some fixes and code improvements (like
clk_prepare_enable and clk_disable_unprepare), conversion from the
omap_wdt and twl4030_wdt drivers to the watchdog framework, addition
of the SB8x0 chipset support and the DA9055 Watchdog driver and some
OF support for the davinci_wdt driver."
* git://www.linux-watchdog.org/linux-watchdog: (22 commits)
watchdog: mei: avoid oops in watchdog unregister code path
watchdog: Orion: Fix possible null-deference in orion_wdt_probe
watchdog: sp5100_tco: Add SB8x0 chipset support
watchdog: davinci_wdt: add OF support
watchdog: da9052: Fix invalid free of devm_ allocated data
watchdog: twl4030_wdt: Change TWL4030_MODULE_PM_RECEIVER to TWL_MODULE_PM_RECEIVER
watchdog: remove depends on CONFIG_EXPERIMENTAL
watchdog: Convert dev_printk(KERN_<LEVEL> to dev_<level>(
watchdog: DA9055 Watchdog driver
watchdog: omap_wdt: eliminate goto
watchdog: omap_wdt: delete redundant platform_set_drvdata() calls
watchdog: omap_wdt: convert to devm_ functions
watchdog: omap_wdt: convert to new watchdog core
watchdog: WatchDog Timer Driver Core: fix comment
watchdog: s3c2410_wdt: use clk_prepare_enable and clk_disable_unprepare
watchdog: imx2_wdt: Select the driver via ARCH_MXC
watchdog: cpu5wdt.c: add missing del_timer call
watchdog: hpwdt.c: Increase version string
watchdog: Convert twl4030_wdt to watchdog core
davinci_wdt: preparation for switch to common clock framework
...
Linus Torvalds [Sat, 22 Dec 2012 01:09:07 +0000 (17:09 -0800)]
Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
"Misc small cifs fixes"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
cifs: eliminate cifsERROR variable
cifs: don't compare uniqueids in cifs_prime_dcache unless server inode numbers are in use
cifs: fix double-free of "string" in cifs_parse_mount_options
Linus Torvalds [Sat, 22 Dec 2012 01:08:06 +0000 (17:08 -0800)]
Merge tag 'dm-3.8-fixes' of git://git./linux/kernel/git/agk/linux-dm
Pull dm update from Alasdair G Kergon:
"Miscellaneous device-mapper fixes, cleanups and performance
improvements.
Of particular note:
- Disable broken WRITE SAME support in all targets except linear and
striped. Use it when kcopyd is zeroing blocks.
- Remove several mempools from targets by moving the data into the
bio's new front_pad area(which dm calls 'per_bio_data').
- Fix a race in thin provisioning if discards are misused.
- Prevent userspace from interfering with the ioctl parameters and
use kmalloc for the data buffer if it's small instead of vmalloc.
- Throttle some annoying error messages when I/O fails."
* tag 'dm-3.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: (36 commits)
dm stripe: add WRITE SAME support
dm: remove map_info
dm snapshot: do not use map_context
dm thin: dont use map_context
dm raid1: dont use map_context
dm flakey: dont use map_context
dm raid1: rename read_record to bio_record
dm: move target request nr to dm_target_io
dm snapshot: use per_bio_data
dm verity: use per_bio_data
dm raid1: use per_bio_data
dm: introduce per_bio_data
dm kcopyd: add WRITE SAME support to dm_kcopyd_zero
dm linear: add WRITE SAME support
dm: add WRITE SAME support
dm: prepare to support WRITE SAME
dm ioctl: use kmalloc if possible
dm ioctl: remove PF_MEMALLOC
dm persistent data: improve improve space map block alloc failure message
dm thin: use DMERR_LIMIT for errors
...
J. Bruce Fields [Sat, 22 Dec 2012 00:48:59 +0000 (19:48 -0500)]
Revert "nfsd: warn on odd reply state in nfsd_vfs_read"
This reverts commit
79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e.
This is obviously wrong, and I have no idea how I missed seeing the
warning in testing: I must just not have looked at the right logs. The
caller bumps rq_resused/rq_next_page, so it will always be hit on a
large enough read.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 22 Dec 2012 00:40:26 +0000 (16:40 -0800)]
Merge tag 'rdma-for-linus' of git://git./linux/kernel/git/roland/infiniband
Pull more infiniband changes from Roland Dreier:
"Second batch of InfiniBand/RDMA changes for 3.8:
- cxgb4 changes to fix lookup engine hash collisions
- mlx4 changes to make flow steering usable
- fix to IPoIB to avoid pinning dst reference for too long"
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
RDMA/cxgb4: Fix bug for active and passive LE hash collision path
RDMA/cxgb4: Fix LE hash collision bug for passive open connection
RDMA/cxgb4: Fix LE hash collision bug for active open connection
mlx4_core: Allow choosing flow steering mode
mlx4_core: Adjustments to Flow Steering activation logic for SR-IOV
mlx4_core: Fix error flow in the flow steering wrapper
mlx4_core: Add QPN enforcement for flow steering rules set by VFs
cxgb4: Add LE hash collision bug fix path in LLD driver
cxgb4: Add T4 filter support
IPoIB: Call skb_dst_drop() once skb is enqueued for sending
Linus Torvalds [Sat, 22 Dec 2012 00:39:08 +0000 (16:39 -0800)]
Merge tag 'asm-generic' of git://git./linux/kernel/git/arnd/asm-generic
Pull asm-generic cleanup from Arnd Bergmann:
"These are a few cleanups for asm-generic:
- a set of patches from Lars-Peter Clausen to generalize asm/mmu.h
and use it in the architectures that don't need any special
handling.
- A patch from Will Deacon to remove the {read,write}s{b,w,l} as
discussed during the arm64 review
- A patch from James Hogan that helps with the meta architecture
series."
* tag 'asm-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
xtensa: Use generic asm/mmu.h for nommu
h8300: Use generic asm/mmu.h
c6x: Use generic asm/mmu.h
asm-generic/mmu.h: Add support for FDPIC
asm-generic/mmu.h: Remove unused vmlist field from mm_context_t
asm-generic: io: remove {read,write} string functions
asm-generic/io.h: remove asm/cacheflush.h include
Kukjin Kim [Fri, 21 Dec 2012 18:02:13 +0000 (10:02 -0800)]
ARM: dts: fix duplicated build target and alphabetical sort out for exynos
Commit
db5b0ae00712 ("Merge tag 'dt' of git://git.kernel.org/.../arm-soc")
causes a duplicated build target. This patch fixes it and sorts out the
build target alphabetically so that we can recognize something wrong
easily.
Cc: Olof Johansson <olof@lixom.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gao feng [Wed, 19 Dec 2012 23:41:43 +0000 (23:41 +0000)]
bridge: call br_netpoll_disable in br_add_if
When netdev_set_master faild in br_add_if, we should
call br_netpoll_disable to do some cleanup jobs,such
as free the memory of struct netpoll which allocated
in br_netpoll_enable.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 21 Dec 2012 07:32:10 +0000 (07:32 +0000)]
ipv4: arp: fix a lockdep splat in arp_solicit()
Yan Burman reported following lockdep warning :
=============================================
[ INFO: possible recursive locking detected ]
3.7.0+ #24 Not tainted
---------------------------------------------
swapper/1/0 is trying to acquire lock:
(&n->lock){++--..}, at: [<
ffffffff8139f56e>] __neigh_event_send
+0x2e/0x2f0
but task is already holding lock:
(&n->lock){++--..}, at: [<
ffffffff813f63f4>] arp_solicit+0x1d4/0x280
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&n->lock);
lock(&n->lock);
*** DEADLOCK ***
May be due to missing lock nesting notation
4 locks held by swapper/1/0:
#0: (((&n->timer))){+.-...}, at: [<
ffffffff8104b350>]
call_timer_fn+0x0/0x1c0
#1: (&n->lock){++--..}, at: [<
ffffffff813f63f4>] arp_solicit
+0x1d4/0x280
#2: (rcu_read_lock_bh){.+....}, at: [<
ffffffff81395400>]
dev_queue_xmit+0x0/0x5d0
#3: (rcu_read_lock_bh){.+....}, at: [<
ffffffff813cb41e>]
ip_finish_output+0x13e/0x640
stack backtrace:
Pid: 0, comm: swapper/1 Not tainted 3.7.0+ #24
Call Trace:
<IRQ> [<
ffffffff8108c7ac>] validate_chain+0xdcc/0x11f0
[<
ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
[<
ffffffff81120565>] ? kmem_cache_free+0xe5/0x1c0
[<
ffffffff8108d570>] __lock_acquire+0x440/0xc30
[<
ffffffff813c3570>] ? inet_getpeer+0x40/0x600
[<
ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
[<
ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
[<
ffffffff8108ddf5>] lock_acquire+0x95/0x140
[<
ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
[<
ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
[<
ffffffff81448d4b>] _raw_write_lock_bh+0x3b/0x50
[<
ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
[<
ffffffff8139f56e>] __neigh_event_send+0x2e/0x2f0
[<
ffffffff8139f99b>] neigh_resolve_output+0x16b/0x270
[<
ffffffff813cb62d>] ip_finish_output+0x34d/0x640
[<
ffffffff813cb41e>] ? ip_finish_output+0x13e/0x640
[<
ffffffffa046f146>] ? vxlan_xmit+0x556/0xbec [vxlan]
[<
ffffffff813cb9a0>] ip_output+0x80/0xf0
[<
ffffffff813ca368>] ip_local_out+0x28/0x80
[<
ffffffffa046f25a>] vxlan_xmit+0x66a/0xbec [vxlan]
[<
ffffffffa046f146>] ? vxlan_xmit+0x556/0xbec [vxlan]
[<
ffffffff81394a50>] ? skb_gso_segment+0x2b0/0x2b0
[<
ffffffff81449355>] ? _raw_spin_unlock_irqrestore+0x65/0x80
[<
ffffffff81394c57>] ? dev_queue_xmit_nit+0x207/0x270
[<
ffffffff813950c8>] dev_hard_start_xmit+0x298/0x5d0
[<
ffffffff813956f3>] dev_queue_xmit+0x2f3/0x5d0
[<
ffffffff81395400>] ? dev_hard_start_xmit+0x5d0/0x5d0
[<
ffffffff813f5788>] arp_xmit+0x58/0x60
[<
ffffffff813f59db>] arp_send+0x3b/0x40
[<
ffffffff813f6424>] arp_solicit+0x204/0x280
[<
ffffffff813a1a70>] ? neigh_add+0x310/0x310
[<
ffffffff8139f515>] neigh_probe+0x45/0x70
[<
ffffffff813a1c10>] neigh_timer_handler+0x1a0/0x2a0
[<
ffffffff8104b3cf>] call_timer_fn+0x7f/0x1c0
[<
ffffffff8104b350>] ? detach_if_pending+0x120/0x120
[<
ffffffff8104b748>] run_timer_softirq+0x238/0x2b0
[<
ffffffff813a1a70>] ? neigh_add+0x310/0x310
[<
ffffffff81043e51>] __do_softirq+0x101/0x280
[<
ffffffff814518cc>] call_softirq+0x1c/0x30
[<
ffffffff81003b65>] do_softirq+0x85/0xc0
[<
ffffffff81043a7e>] irq_exit+0x9e/0xc0
[<
ffffffff810264f8>] smp_apic_timer_interrupt+0x68/0xa0
[<
ffffffff8145122f>] apic_timer_interrupt+0x6f/0x80
<EOI> [<
ffffffff8100a054>] ? mwait_idle+0xa4/0x1c0
[<
ffffffff8100a04b>] ? mwait_idle+0x9b/0x1c0
[<
ffffffff8100a6a9>] cpu_idle+0x89/0xe0
[<
ffffffff81441127>] start_secondary+0x1b2/0x1b6
Bug is from arp_solicit(), releasing the neigh lock after arp_send()
In case of vxlan, we eventually need to write lock a neigh lock later.
Its a false positive, but we can get rid of it without lockdep
annotations.
We can instead use neigh_ha_snapshot() helper.
Reported-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 21 Dec 2012 07:17:21 +0000 (07:17 +0000)]
tuntap: dont use a private kmem_cache
Commit
96442e42429 (tuntap: choose the txq based on rxq)
added a per tun_struct kmem_cache.
As soon as several tun_struct are used, we get an error
because two caches cannot have same name.
Use the default kmalloc()/kfree_rcu(), as it reduce code
size and doesn't have performance impact here.
Reported-by: Paul Moore <pmoore@redhat.com>
Tested-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 20 Dec 2012 17:25:08 +0000 (17:25 +0000)]
net: devnet_rename_seq should be a seqcount
Using a seqlock for devnet_rename_seq is not a good idea,
as device_rename() can sleep.
As we hold RTNL, we dont need a protection for writers,
and only need a seqcount so that readers can catch a change done
by a writer.
Bug added in commit
c91f6df2db4972d3 (sockopt: Change getsockopt() of
SO_BINDTODEVICE to return an interface name)
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 20 Dec 2012 16:00:27 +0000 (16:00 +0000)]
ip_gre: fix possible use after free
Once skb_realloc_headroom() is called, tiph might point to freed memory.
Cache tiph->ttl value before the reallocation, to avoid unexpected
behavior.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Isaku Yamahata <yamahata@valinux.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Isaku Yamahata [Thu, 20 Dec 2012 15:12:52 +0000 (15:12 +0000)]
ip_gre: make ipgre_tunnel_xmit() not parse network header as IP unconditionally
ipgre_tunnel_xmit() parses network header as IP unconditionally.
But transmitting packets are not always IP packet. For example such packet
can be sent by packet socket with sockaddr_ll.sll_protocol set.
So make the function check if skb->protocol is IP.
Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Wed, 19 Dec 2012 21:48:45 +0000 (21:48 +0000)]
solos-pci: double lock in geos_gpio_store()
There is a typo here so we do a double lock instead of an unlock.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mike Snitzer [Fri, 21 Dec 2012 20:23:41 +0000 (20:23 +0000)]
dm stripe: add WRITE SAME support
Rename stripe_map_discard to stripe_map_range and reuse it for WRITE
SAME bio processing.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:41 +0000 (20:23 +0000)]
dm: remove map_info
This patch removes map_info from bio-based device mapper targets.
map_info is still used for request-based targets.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:41 +0000 (20:23 +0000)]
dm snapshot: do not use map_context
Eliminate struct map_info from dm-snap.
map_info->ptr was used in dm-snap to indicate if the bio was tracked.
If map_info->ptr was non-NULL, the bio was linked in tracked_chunk_hash.
This patch removes the use of map_info->ptr. We determine if the bio was
tracked based on hlist_unhashed(&c->node). If hlist_unhashed is true,
the bio is not tracked, if it is false, the bio is tracked.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:40 +0000 (20:23 +0000)]
dm thin: dont use map_context
This patch removes endio_hook_pool from dm-thin and uses per-bio data instead.
This patch removes any use of map_info in preparation for the next patch
that removes map_info from bio-based device mapper.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:40 +0000 (20:23 +0000)]
dm raid1: dont use map_context
Don't use map_info any more in dm-raid1.
map_info was used for writes to hold the region number. For this purpose
we add a new field dm_bio_details to dm_raid1_bio_record.
map_info was used for reads to hold a pointer to dm_raid1_bio_record (if
the pointer was non-NULL, bio details were saved; if the pointer was
NULL, bio details were not saved). We use
dm_raid1_bio_record.details->bi_bdev for this purpose. If bi_bdev is
NULL, details were not saved, if bi_bdev is non-NULL, details were
saved.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:39 +0000 (20:23 +0000)]
dm flakey: dont use map_context
Replace map_info with a per-bio structure "struct per_bio_data" in dm-flakey.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:39 +0000 (20:23 +0000)]
dm raid1: rename read_record to bio_record
Rename struct read_record to bio_record in dm-raid1.
In the following patch, the structure will be used for both read and
write bios, so rename it.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:39 +0000 (20:23 +0000)]
dm: move target request nr to dm_target_io
This patch moves target_request_nr from map_info to dm_target_io and
makes it accessible with dm_bio_get_target_request_nr.
This patch is a preparation for the next patch that removes map_info.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:38 +0000 (20:23 +0000)]
dm snapshot: use per_bio_data
Replace tracked_chunk_pool with per_bio_data in dm-snap.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:38 +0000 (20:23 +0000)]
dm verity: use per_bio_data
Replace io_mempool with per_bio_data in dm-verity.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:38 +0000 (20:23 +0000)]
dm raid1: use per_bio_data
Replace read_record_pool with per_bio_data in dm-raid1.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:38 +0000 (20:23 +0000)]
dm: introduce per_bio_data
Introduce a field per_bio_data_size in struct dm_target.
Targets can set this field in the constructor. If a target sets this
field to a non-zero value, "per_bio_data_size" bytes of auxiliary data
are allocated for each bio submitted to the target. These data can be
used for any purpose by the target and help us improve performance by
removing some per-target mempools.
Per-bio data is accessed with dm_per_bio_data. The
argument data_size must be the same as the value per_bio_data_size in
dm_target.
If the target has a pointer to per_bio_data, it can get a pointer to
the bio with dm_bio_from_per_bio_data() function (data_size must be the
same as the value passed to dm_per_bio_data).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:37 +0000 (20:23 +0000)]
dm kcopyd: add WRITE SAME support to dm_kcopyd_zero
Add WRITE SAME support to dm-io and make it accessible to
dm_kcopyd_zero(). dm_kcopyd_zero() provides an asynchronous interface
whereas the blkdev_issue_write_same() interface is synchronous.
WRITE SAME is a SCSI command that can be leveraged for more efficient
zeroing of a specified logical extent of a device which supports it.
Only a single zeroed logical block is transfered to the target for each
WRITE SAME and the target then writes that same block across the
specified extent.
The dm thin target uses this.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:37 +0000 (20:23 +0000)]
dm linear: add WRITE SAME support
The linear target can already support WRITE SAME requests so signal
this by setting num_write_same_requests to 1.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:37 +0000 (20:23 +0000)]
dm: add WRITE SAME support
WRITE SAME bios have a payload that contain a single page. When
cloning WRITE SAME bios DM has no need to modify the bi_io_vec
attributes (and doing so would be detrimental). DM need only alter the
start and end of the WRITE SAME bio accordingly.
Rather than duplicate __clone_and_map_discard, factor out a common
function that is also used by __clone_and_map_write_same.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:36 +0000 (20:23 +0000)]
dm: prepare to support WRITE SAME
Allow targets to opt in to WRITE SAME support by setting
'num_write_same_requests' in the dm_target structure.
A dm device will only advertise WRITE SAME support if all its
targets and all its underlying devices support it.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:36 +0000 (20:23 +0000)]
dm ioctl: use kmalloc if possible
If the parameter buffer is small enough, try to allocate it with kmalloc()
rather than vmalloc().
vmalloc is noticeably slower than kmalloc because it has to manipulate
page tables.
In my tests, on PA-RISC this patch speeds up activation 13 times.
On Opteron this patch speeds up activation by 5%.
This patch introduces a new function free_params() to free the
parameters and this uses new flags that record whether or not vmalloc()
was used and whether or not the input buffer must be wiped after use.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:36 +0000 (20:23 +0000)]
dm ioctl: remove PF_MEMALLOC
When allocating memory for the userspace ioctl data, set some
appropriate GPF flags directly instead of using PF_MEMALLOC.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 21 Dec 2012 20:23:36 +0000 (20:23 +0000)]
dm persistent data: improve improve space map block alloc failure message
Improve space map error message when unable to allocate a new
metadata block.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:34 +0000 (20:23 +0000)]
dm thin: use DMERR_LIMIT for errors
Throttle all errors logged from the IO path by dm thin.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:34 +0000 (20:23 +0000)]
dm persistent data: use DMERR_LIMIT for errors
Nearly all of persistent-data is in the IO path so throttle error
messages with DMERR_LIMIT to limit the amount logged when
something has gone wrong.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:34 +0000 (20:23 +0000)]
dm block manager: reinstate message when validator fails
Reinstate a useful error message when the block manager buffer validator fails.
This was mistakenly eliminated when the block manager was converted to use
dm-bufio.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Jonathan Brassow [Fri, 21 Dec 2012 20:23:33 +0000 (20:23 +0000)]
dm raid: round region_size to power of two
If the user does not supply a bitmap region_size to the dm raid target,
a reasonable size is computed automatically. If this is not a power of 2,
the md code will report an error later.
This patch catches the problem early and rounds the region_size to the
next power of two.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 21 Dec 2012 20:23:33 +0000 (20:23 +0000)]
dm thin: cleanup dead code
Remove unused @data_block parameter from cell_defer.
Change thin_bio_map to use many returns rather than setting a variable.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 21 Dec 2012 20:23:33 +0000 (20:23 +0000)]
dm thin: rename cell_defer_except to cell_defer_no_holder
Rename cell_defer_except() to cell_defer_no_holder() which describes
its function more clearly.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:33 +0000 (20:23 +0000)]
dm snapshot: optimize track_chunk
track_chunk is always called with interrupts enabled. Consequently, we
do not need to save and restore interrupt state in "flags" variable.
This patch changes spin_lock_irqsave to spin_lock_irq and
spin_unlock_irqrestore to spin_unlock_irq.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:32 +0000 (20:23 +0000)]
dm raid: use DM_ENDIO_INCOMPLETE
Use a defined macro DM_ENDIO_INCOMPLETE instead of a numeric constant.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:32 +0000 (20:23 +0000)]
dm raid1: remove impossible mempool_alloc error test
mempool_alloc can't fail if __GFP_WAIT is specified, so the condition
that tests if read_record is non-NULL is always true.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:32 +0000 (20:23 +0000)]
dm thin: emit ignore_discard in status when discards disabled
If "ignore_discard" is specified when creating the thin pool device then
discard support is disabled for that device. The pool device's status
should reflect this fact rather than stating "no_discard_passdown"
(which implies discards are enabled but passdown is disabled).
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 21 Dec 2012 20:23:32 +0000 (20:23 +0000)]
dm persistent data: fix nested btree deletion
When deleting nested btrees, the code forgets to delete the innermost
btree. The thin-metadata code serendipitously compensates for this by
claiming there is one extra layer in the tree.
This patch corrects both problems.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 21 Dec 2012 20:23:31 +0000 (20:23 +0000)]
dm thin: wake worker when discard is prepared
When discards are prepared it is best to directly wake the worker that
will process them. The worker will be woken anyway, via periodic
commit, but there is no reason to not wake_worker here.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 21 Dec 2012 20:23:31 +0000 (20:23 +0000)]
dm thin: fix race between simultaneous io and discards to same block
There is a race when discard bios and non-discard bios are issued
simultaneously to the same block.
Discard support is expensive for all thin devices precisely because you
have to be careful to quiesce the area you're discarding. DM thin must
handle this conflicting IO pattern (simultaneous non-discard vs discard)
even though a sane application shouldn't be issuing such IO.
The race manifests as follows:
1. A non-discard bio is mapped in thin_bio_map.
This doesn't lock out parallel activity to the same block.
2. A discard bio is issued to the same block as the non-discard bio.
3. The discard bio is locked in a dm_bio_prison_cell in process_discard
to lock out parallel activity against the same block.
4. The non-discard bio's mapping continues and its all_io_entry is
incremented so the bio is accounted for in the thin pool's all_io_ds
which is a dm_deferred_set used to track time locality of non-discard IO.
5. The non-discard bio is finally locked in a dm_bio_prison_cell in
process_bio.
The race can result in deadlock, leaving the block layer hanging waiting
for completion of a discard bio that never completes, e.g.:
INFO: task ruby:15354 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ruby D
ffffffff8160f0e0 0 15354 15314 0x00000000
ffff8802fb08bc58 0000000000000082 ffff8802fb08bfd8 0000000000012900
ffff8802fb08a010 0000000000012900 0000000000012900 0000000000012900
ffff8802fb08bfd8 0000000000012900 ffff8803324b9480 ffff88032c6f14c0
Call Trace:
[<
ffffffff814e5a19>] schedule+0x29/0x70
[<
ffffffff814e3d85>] schedule_timeout+0x195/0x220
[<
ffffffffa06b9bc1>] ? _dm_request+0x111/0x160 [dm_mod]
[<
ffffffff814e589e>] wait_for_common+0x11e/0x190
[<
ffffffff8107a170>] ? try_to_wake_up+0x2b0/0x2b0
[<
ffffffff814e59ed>] wait_for_completion+0x1d/0x20
[<
ffffffff81233289>] blkdev_issue_discard+0x219/0x260
[<
ffffffff81233e79>] blkdev_ioctl+0x6e9/0x7b0
[<
ffffffff8119a65c>] block_ioctl+0x3c/0x40
[<
ffffffff8117539c>] do_vfs_ioctl+0x8c/0x340
[<
ffffffff8119a547>] ? block_llseek+0x67/0xb0
[<
ffffffff811756f1>] sys_ioctl+0xa1/0xb0
[<
ffffffff810561f6>] ? sys_rt_sigprocmask+0x86/0xd0
[<
ffffffff814ef099>] system_call_fastpath+0x16/0x1b
The thinp-test-suite's test_discard_random_sectors reliably hits this
deadlock on fast SSD storage.
The fix for this race is that the all_io_entry for a bio must be
incremented whilst the dm_bio_prison_cell is held for the bio's
associated virtual and physical blocks. That cell locking wasn't
occurring early enough in thin_bio_map. This patch fixes this.
Care is taken to always call the new function inc_all_io_entry() with
the relevant cells locked, but they are generally unlocked before
calling issue() to try to avoid holding the cells locked across
generic_submit_request.
Also, now that thin_bio_map may lock bios in a cell, process_bio() is no
longer the only thread that will do so. Because of this we must be sure
to use cell_defer_except() to release all non-holder entries, that
were added by the other thread, because they must be deferred.
This patch depends on "dm thin: replace dm_cell_release_singleton with
cell_defer_except".
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@vger.kernel.org