review.tizen.org Git - kernel/kernel-generic.git/log

virtio-pci: fix leaks of msix_affinity_masks

vp_dev->msix_vectors should be initialized before allocating
msix_affinity_masks, otherwise vp_free_vectors will not free these
objects.

unreferenced object 0xffff88010f969d88 (size 512):
  comm "systemd-udevd", pid 158, jiffies 4294673645 (age 80.545s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff816e455e>] kmemleak_alloc+0x5e/0xc0
    [<ffffffff811aa7f1>] kmem_cache_alloc_node_trace+0x141/0x2c0
    [<ffffffff8133ba23>] alloc_cpumask_var_node+0x23/0x80
    [<ffffffff8133ba8e>] alloc_cpumask_var+0xe/0x10
    [<ffffffff813fdb3d>] vp_try_to_find_vqs+0x25d/0x810
    [<ffffffff813fe171>] vp_find_vqs+0x81/0xb0
    [<ffffffffa00d2a05>] init_vqs+0x85/0x120 [virtio_balloon]
    [<ffffffffa00d2c29>] virtballoon_probe+0xf9/0x1a0 [virtio_balloon]
    [<ffffffff813fb61e>] virtio_dev_probe+0xde/0x140
    [<ffffffff814452b8>] driver_probe_device+0x98/0x3a0
    [<ffffffff8144566b>] __driver_attach+0xab/0xb0
    [<ffffffff814432f4>] bus_for_each_dev+0x94/0xb0
    [<ffffffff81444f4e>] driver_attach+0x1e/0x20
    [<ffffffff81444910>] bus_add_driver+0x200/0x280
    [<ffffffff81445c14>] driver_register+0x74/0x160
    [<ffffffff813fb7d0>] register_virtio_driver+0x20/0x40

v2: change msix_vectors uncoditionaly in vp_free_vectors

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Fix comment typo "CONFIG_PAE"

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

virtio: remove virtqueue_add_buf().

All users changed to virtqueue_add_sg() or virtqueue_add_outbuf/inbuf.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

lguest: rename i386_head.S

Since commit 9a163ed8e0 (i386: move kernel) kernel/i386_head.S
was renamed to kernel/head_32.S. We do the same for lguest/i386_head.S.

Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

virtio_blk: Add missing 'static' qualifiers

Add missing 'static' qualifiers

Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

virtio: console: Add emergency writeonly register to config space

This patch adds an emerg_wr register (writeonly) in config space
of virtio console device which can be used for debugging.

Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
Signed-off-by: Anup Patel <anup.patel@linaro.org>
Acked-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

virtio_pci: better macro exported in uapi

Macro VIRTIO_PCI_CONFIG assumes that userspace actually has a structure
with a field named msix_enabled. Add VIRTIO_PCI_CONFIG_OFF that gets
the msix_enabled by value instead, to make it useful for userspace. We
still keep VIRTIO_PCI_CONFIG around for now, in case some userspace uses
it.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Merge tag 'pm+acpi-3.10-rc2' of git://git./linux/kernel/git/rafael/linux-pm

Pull power management and ACPI fixes from Rafael Wysocki:

- intel_pstate driver fixes and cleanups from Dirk Brandewie and Wei
   Yongjun.

- cpufreq fixes related to ARM big.LITTLE support and the cpufreq-cpu0
   driver from Viresh Kumar.

- Assorted cpufreq fixes from Srivatsa S Bhat, Borislav Petkov, Wolfram
   Sang, Alexander Shiyan, and Nishanth Menon.

- Assorted ACPI fixes from Catalin Marinas, Lan Tianyu, Alex Hung,
   Jan-Simon Möller, and Rafael J Wysocki.

- Fix for a kfree() under spinlock in the PM core from Shuah Khan.

- PM documentation updates from Borislav Petkov and Zhang Rui.

* tag 'pm+acpi-3.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (30 commits)
  cpufreq: Preserve sysfs files across suspend/resume
  ACPI / scan: Fix memory leak on acpi_scan_init_hotplug() error path
  PM / hibernate: Correct documentation
  PM / Documentation: remove inaccurate suspend/hibernate transition lantency statement
  PM: Documentation update for freeze state
  cpufreq / intel_pstate: use vzalloc() instead of vmalloc()/memset(0)
  cpufreq, ondemand: Remove leftover debug line
  PM: Avoid calling kfree() under spinlock in dev_pm_put_subsys_data()
  cpufreq / kirkwood: don't check resource with devm_ioremap_resource
  cpufreq / intel_pstate: remove #ifdef MODULE compile fence
  cpufreq / intel_pstate: Remove idle mode PID
  cpufreq / intel_pstate: fix ffmpeg regression
  cpufreq / intel_pstate: use lowest requested max performance
  cpufreq / intel_pstate: remove idle time and duration from sample and calculations
  cpufreq: Fix incorrect dependecies for ARM SA11xx drivers
  cpufreq: ARM big LITTLE: Fix Kconfig entries
  cpufreq: cpufreq-cpu0: Free parent node for error cases
  cpufreq: cpufreq-cpu0: defer probe when regulator is not ready
  cpufreq: Issue CPUFREQ_GOV_POLICY_EXIT notifier before dropping policy refcount
  cpufreq: governors: Fix CPUFREQ_GOV_POLICY_{INIT|EXIT} notifiers
  ...

Merge tag 'ntb-bugfixes-3.10' of git://github.com/jonmason/ntb

Pull NTB update from Jon Mason:
"NTB bug fixes to address Smatch/Coverity errors, link toggling bugs,
  and a few corner cases in the driver."

This pull request came in during the merge window, but without any
signage etc.  So I'm taking it late, because it wasn't _originally_
late.

* tag 'ntb-bugfixes-3.10' of git://github.com/jonmason/ntb:
  NTB: Multiple NTB client fix
  ntb_netdev: remove from list on exit
  NTB: memcpy lockup workaround
  NTB: Correctly handle receive buffers of the minimal size
  NTB: reset tx_index on link toggle
  NTB: Link toggle memory leak
  NTB: Handle 64bit BAR sizes
  NTB: fix pointer math issues
  ntb: off by one sanity checks
  NTB: variable dereferenced before check

Merge branch 'ipmi' (minor ipmi fixes from Corey)

Merge ipmi fixes from Corey Minyard:
"Some minor fixes I had queued up.  The last one came in recently
  (patch 4) and it and patch 2 are candidates for stable-kernel."

* emailed patches from Corey Minyard <cminyard@mvista.com>:
  ipmi: ipmi_devintf: compat_ioctl method fails to take ipmi_mutex
  ipmi: Improve error messages on failed irq enable
  drivers/char/ipmi: memcpy, need additional 2 bytes to avoid memory overflow
  drivers: char: ipmi: Replaced kmalloc and strcpy with kstrdup

ipmi: ipmi_devintf: compat_ioctl method fails to take ipmi_mutex

When a 32 bit version of ipmitool is used on a 64 bit kernel, the
ipmi_devintf code fails to correctly acquire ipmi_mutex. This results in
incomplete data being retrieved in some cases, or other possible failures.
Add a wrapper around compat_ipmi_ioctl() to take ipmi_mutex to fix this.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ipmi: Improve error messages on failed irq enable

When the interrupt enable message returns an error, the messages are
not entirely accurate nor helpful. So improve them.

Signed-off-by: Corey Minyard <cminyard@mvista.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

drivers/char/ipmi: memcpy, need additional 2 bytes to avoid memory overflow

When calling memcpy, read_data and write_data need additional 2 bytes.

  write_data:
    for checking:  "if (size > IPMI_MAX_MSG_LENGTH)"
    for operating: "memcpy(bt->write_data + 3, data + 1, size - 1)"

  read_data:
    for checking:  "if (msg_len < 3 || msg_len > IPMI_MAX_MSG_LENGTH)"
    for operating: "memcpy(data + 2, bt->read_data + 4, msg_len - 2)"

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

drivers: char: ipmi: Replaced kmalloc and strcpy with kstrdup

Replaced calls to kmalloc followed by strcpy with a sincle call to
kstrdup. Patch found using coccinelle.

Signed-off-by: Alexandru Gheorghiu <gheorghiuandru@gmail.com>
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-3.10-fixes' of git://git./linux/kernel/git/tj/wq

Pull workqueue fixes from Tejun Heo:
"Three more workqueue regression fixes.

   - Fix unbalanced unlock in trylock failure path of manage_workers().
     This shouldn't happen often in the wild but is possible.

   - While making schedule_work() and friends inline, they become
     unavailable to !GPL modules.  Allow !GPL modules to access basic
     stuff - system_wq and queue_*work_on() - so that schedule_work()
     and friends can be used.

   - During boot, the unbound NUMA support code allocates a cpumask for
     each possible node using alloc_cpumask_var_node(), which ends up
     trying to allocate node-specific memory even for offline nodes
     triggering BUG in the memory alloc code.  Use NUMA_NO_NODE for
     offline nodes."

* 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: don't perform NUMA-aware allocations on offline nodes in wq_numa_init()
  workqueue: Make schedule_work() available again to non GPL modules
  workqueue: correct handling of the pool spin_lock

Merge branch 'rcu/urgent' of git://git./linux/kernel/git/paulmck/linux-rcu

Pull RCU fixes from Paul McKenney:
"A couple of fixes for RCU regressions:

   - A boneheaded boolean-logic bug that resulted in excessive delays on
     boot, hibernation and suspend that was reported by Borislav Petkov,
     Bjørn Mork, and Joerg Roedel.  The fix inserts a single "!".

   - A fix for a boot-time splat due to allocating from bootmem too late
     in boot, fix courtesy of Sasha Levin with additional help from
     Yinghai Lu."

* 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  rcu: Don't allocate bootmem from rcu_init()
  rcu: Fix comparison sense in rcu_needs_cpu()

usermodehelper: check subprocess_info->path != NULL

argv_split(empty_or_all_spaces) happily succeeds, it simply returns
argc == 0 and argv[0] == NULL. Change call_usermodehelper_exec() to
check sub_info->path != NULL to avoid the crash.

This is the minimal fix, todo:

- perhaps we should change argv_split() to return NULL or change the
callers.

- kill or justify ->path[0] check

- narrow the scope of helper_lock()

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-By: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'queue' of git://git./linux/kernel/git/nab/target-pending

Pull target fixes from Nicholas Bellinger:
"A handful of fixes + minor changes this time around, along with one
  important >= v3.9 regression fix for IBLOCK backends.  The highlights
  include:

   - Use FD_MAX_SECTORS in FILEIO for block_device as
     well as files (agrover)

   - Fix processing of out-of-order CmdSNs with
     iSBD driver (shlomo)

   - Close long-standing target_put_sess_cmd() vs.
     core_tmr_abort_task() race with the addition of
     kref_put_spinlock_irqsave() (joern + greg-kh)

   - Fix IBLOCK WCE=1 + DPOFUA=1 backend WRITE
     regression in >= v3.9 (nab + bootc)

  Note these four patches are CC'ed to stable.

  Also, there is still some work left to be done on the active I/O
  shutdown path in target_wait_for_sess_cmds() used by tcm_qla2xxx +
  ib_isert fabrics that is still being discussed on the list, and will
  hopefully be resolved soon."

* 'queue' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
  target: close target_put_sess_cmd() vs. core_tmr_abort_task() race
  target: removed unused transport_state flag
  target/iblock: Fix WCE=1 + DPOFUA=1 backend WRITE regression
  MAINTAINERS: Update target git tree URL
  iscsi-target: Fix typos in RDMAEXTENSIONS macro usage
  target/rd: Add ramdisk bit for NULLIO operation
  iscsi-target: Fix processing of OOO commands
  iscsi-target: Make buf param of iscsit_do_crypto_hash_buf() const void *
  iscsi-target: Fix NULL pointer dereference in iscsit_send_reject
  target: Have dev/enable show if TCM device is configured
  target: Use FD_MAX_SECTORS/FD_BLOCKSIZE for blockdevs using fileio
  target: Remove unused struct members in se_dev_entry

Merge branch 'acpi-fixes'

* acpi-fixes:
ACPI / scan: Fix memory leak on acpi_scan_init_hotplug() error path

Merge branch 'pm-cpufreq'

* pm-cpufreq:
cpufreq: Preserve sysfs files across suspend/resume

workqueue: don't perform NUMA-aware allocations on offline nodes in wq_numa_init()

wq_numa_init() builds per-node cpumasks which are later used to make
unbound workqueues NUMA-aware.  The cpumasks are allocated using
alloc_cpumask_var_node() for all possible nodes.  Unfortunately, on
machines with off-line nodes, this leads to NUMA-aware allocations on
existing bug offline nodes, which in turn triggers BUG in the memory
allocation code.

Fix it by using NUMA_NO_NODE for cpumask allocations for offline
nodes.

  kernel BUG at include/linux/gfp.h:323!
  invalid opcode: 0000 [#1] SMP
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0+ #1
  Hardware name: ProLiant BL465c G7, BIOS A19 12/10/2011
  task: ffff880234608000 ti: ffff880234602000 task.ti: ffff880234602000
  RIP: 0010:[<ffffffff8117495d>]  [<ffffffff8117495d>] new_slab+0x2ad/0x340
  RSP: 0000:ffff880234603bf8  EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff880237404b40 RCX: 00000000000000d0
  RDX: 0000000000000001 RSI: 0000000000000003 RDI: 00000000002052d0
  RBP: ffff880234603c28 R08: 0000000000000000 R09: 0000000000000001
  R10: 0000000000000001 R11: ffffffff812e3aa8 R12: 0000000000000001
  R13: ffff8802378161c0 R14: 0000000000030027 R15: 00000000000040d0
  FS:  0000000000000000(0000) GS:ffff880237800000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: ffff88043fdff000 CR3: 00000000018d5000 CR4: 00000000000007f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Stack:
   ffff880234603c28 0000000000000001 00000000000000d0 ffff8802378161c0
   ffff880237404b40 ffff880237404b40 ffff880234603d28 ffffffff815edba1
   ffff880237816140 0000000000000000 ffff88023740e1c0
  Call Trace:
   [<ffffffff815edba1>] __slab_alloc+0x330/0x4f2
   [<ffffffff81174b25>] kmem_cache_alloc_node_trace+0xa5/0x200
   [<ffffffff812e3aa8>] alloc_cpumask_var_node+0x28/0x90
   [<ffffffff81a0bdb3>] wq_numa_init+0x10d/0x1be
   [<ffffffff81a0bec8>] init_workqueues+0x64/0x341
   [<ffffffff810002ea>] do_one_initcall+0xea/0x1a0
   [<ffffffff819f1f31>] kernel_init_freeable+0xb7/0x1ec
   [<ffffffff815d50de>] kernel_init+0xe/0xf0
   [<ffffffff815ff89c>] ret_from_fork+0x7c/0xb0
  Code: 45  84 ac 00 00 00 f0 41 80 4d 00 40 e9 f6 fe ff ff 66 0f 1f 84 00 00 00 00 00 e8 eb 4b ff ff 49 89 c5 e9 05 fe ff ff <0f> 0b 4c 8b 73 38 44 89 ff 81 cf 00 00 20 00 4c 89 f6 48 c1 ee

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-Tested-by: Lingzhu Xiang <lxiang@redhat.com>

Merge tag 'trace-fixes-v3.10-rc1' of git://git./linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
"This includes a fix to a memory leak when adding filters to traces.

  Also, Masami Hiramatsu fixed up some minor bugs that were discovered
  by sparse."

* tag 'trace-fixes-v3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/kprobes: Make print_*probe_event static
  tracing/kprobes: Fix a sparse warning for incorrect type in assignment
  tracing/kprobes: Use rcu_dereference_raw for tp->files
  tracing: Fix leaks of filter preds

Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:

- Fix for a CPU hot-add deadlock in microcode update code

- Fix for idle consolidation fallout

- Documentation update for initial kernel direct mapping

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Add missing comments for initial kernel direct mapping
  x86/microcode: Add local mutex to fix physical CPU hot-add deadlock
  x86: Fix idle consolidation fallout

Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull perf fixes from Thomas Gleixner:

- Fix for a task exit cleanup race caused by a missing a preempt
   disable

- Cleanup of the event notification functions with a massive reduction
   of duplicated code

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Factor out auxiliary events notification
  perf: Fix EXIT event notification

Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull timer fixes from Thomas Gleixner:

- Cure for not using zalloc in the first place, which leads to random
   crashes with CPUMASK_OFF_STACK.

- Revert a user space visible change which broke udev

- Add a missing cpu_online early return introduced by the new full
   dyntick conversions

- Plug a long standing race in the timer wheel cpu hotplug code.
   Sigh...

- Cleanup NOHZ per cpu data on cpu down to prevent stale data on cpu
   up.

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time: Revert ALWAYS_USE_PERSISTENT_CLOCK compile time optimizaitons
  timer: Don't reinitialize the cpu base lock during CPU_UP_PREPARE
  tick: Don't invoke tick_nohz_stop_sched_tick() if the cpu is offline
  tick: Cleanup NOHZ per cpu data on cpu down
  tick: Use zalloc_cpumask_var for allocating offstack cpumasks

Merge branch 'core-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull core fixes from Thomas Gleixner:

- Two fixlets for the fallout of the generic idle task conversion

- Documentation update

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rcu/idle: Wrap cpu-idle poll mode within rcu_idle_enter/exit
  idle: Fix hlt/nohlt command-line handling in new generic idle
  kthread: Document ways of reducing OS jitter due to per-CPU kthreads

Merge branch 'fixes' of git://git.linaro.org/people/rmk/linux-arm

Pull ARM fixes from Russell King:
"A small number of fixes for stuff from the last merge window, and in
  one case (IRQ time accounting) the previous merge window."

* 'fixes' of git://git.linaro.org/people/rmk/linux-arm:
  ARM: 7720/1: ARM v6/v7 cmpxchg64 shouldn't clear upper 32 bits of the old/new value
  ARM: 7715/1: MCPM: adapt to GIC changes after upstream merge
  ARM: 7714/1: mmc: mmci: Ensure return value of regulator_enable() is checked
  ARM: 7712/1: Remove trailing whitespace in arch/arm/Makefile
  ARM: 7711/1: dove: fix Dove cpu type from V7 to PJ4
  ARM: finally enable IRQ time accounting config

Merge branch 'for-linus' of git://git./linux/kernel/git/sage/ceph-client

Pull Ceph fixes from Sage Weil:
"Yes, this is a much larger pull than I would like after -rc1.  There
  are a few things included:

   - a few fixes for leaks and incorrect assertions
   - a few patches fixing behavior when mapped images are resized
   - handling for cloned/layered images that are flattened out from
     underneath the client

  The last bit was non-trivial, and there is some code movement and
  associated cleanup mixed in.  This was ready and was meant to go in
  last week but I missed the boat on Friday.  My only excuse is that I
  was waiting for an all clear from the testing and there were many
  other shiny things to distract me.

  Strictly speaking, handling the flatten case isn't a regression and
  could wait, so if you like we can try to pull the series apart, but
  Alex and I would much prefer to have it all in as it is a case real
  users will hit with 3.10."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (33 commits)
  rbd: re-submit flattened write request (part 2)
  rbd: re-submit write request for flattened clone
  rbd: re-submit read request for flattened clone
  rbd: detect when clone image is flattened
  rbd: reference count parent requests
  rbd: define parent image request routines
  rbd: define rbd_dev_unparent()
  rbd: don't release write request until necessary
  rbd: get parent info on refresh
  rbd: ignore zero-overlap parent
  rbd: support reading parent page data for writes
  rbd: fix parent request size assumption
  libceph: init sent and completed when starting
  rbd: kill rbd_img_request_get()
  rbd: only set up watch for mapped images
  rbd: set mapping read-only flag in rbd_add()
  rbd: support reading parent page data
  rbd: fix an incorrect assertion condition
  rbd: define rbd_dev_v2_header_info()
  rbd: get rid of trivial v1 header wrappers
  ...

cpufreq: Preserve sysfs files across suspend/resume

The file permissions of cpufreq per-cpu sysfs files are not preserved
across suspend/resume because we internally go through the CPU
Hotplug path which reinitializes the file permissions on CPU online.

But the user is not supposed to know that we are using CPU hotplug
internally within suspend/resume (IOW, the kernel should not silently
wreck the user-set file permissions across a suspend cycle).
Therefore, we need to preserve the file permissions as they are
across suspend/resume.

The simplest way to achieve that is to just not touch the sysfs files
at all - ie., just ignore the CPU hotplug notifications in the
suspend/resume path (_FROZEN) in the cpufreq hotplug callback.

Reported-by: Robert Jarzmik <robert.jarzmik@intel.com>
Reported-by: Durgadoss R <durgadoss.r@intel.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

ACPI / scan: Fix memory leak on acpi_scan_init_hotplug() error path

Following commit 6b772e8f9 (ACPI: Update PNPID match handling for
notify), the acpi_scan_init_hotplug() calls acpi_set_pnp_ids() which
allocates acpi_hardware_id and copies a few strings (kstrdup). If the
devices does not have hardware_id set, the function exits without
freeing the previously allocated ids (and kmemleak complains). This
patch calls simply changes 'return' on error to a 'goto out' which
calls acpi_free_pnp_ids().

Reported-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

NTB: Multiple NTB client fix

Fix issue with adding multiple ntb client devices to the ntb virtual
bus. Previously, multiple devices would be added with the same name,
resulting in crashes. To get around this issue, add a unique number to
the device when it is added.

Signed-off-by: Jon Mason <jon.mason@intel.com>

ntb_netdev: remove from list on exit

The ntb_netdev device is not removed from the global list of devices
upon device removal. If the device is re-added, the removal code would
find the first instance and try to remove an already removed device.

Signed-off-by: Jon Mason <jon.mason@intel.com>

NTB: memcpy lockup workaround

The system will appear to lockup for long periods of time due to the NTB
driver spending too much time in memcpy. Avoid this by reducing the
number of packets that can be serviced on a given interrupt.

Signed-off-by: Jon Mason <jon.mason@intel.com>

NTB: Correctly handle receive buffers of the minimal size

The ring logic of the NTB receive buffer/transmit memory window requires
there to be at least 2 payload sized allotments. For the minimal size
case, split the buffer into two and set the transport_mtu to the
appropriate size.

Signed-off-by: Jon Mason <jon.mason@intel.com>

NTB: reset tx_index on link toggle

If the NTB link toggles, the driver could stop receiving due to the
tx_index not being set to 0 on the transmitting size on a link-up event.
This is due to the driver expecting the incoming data to start at the
beginning of the receive buffer and not at a random place.

Signed-off-by: Jon Mason <jon.mason@intel.com>

NTB: Link toggle memory leak

Each link-up will allocate a new NTB receive buffer when the NTB
properties are negotiated with the remote system. These allocations did
not check for existing buffers and thus did not free them. Now, the
driver will check for an existing buffer and free it if not of the
correct size, before trying to alloc a new one.

Signed-off-by: Jon Mason <jon.mason@intel.com>

NTB: Handle 64bit BAR sizes

64bit BAR sizes are permissible with an NTB device. To support them
various modifications and clean-ups were required, most significantly
using 2 32bit scratch pad registers for each BAR.

Also, modify the driver to allow more than 2 Memory Windows.

Signed-off-by: Jon Mason <jon.mason@intel.com>

NTB: fix pointer math issues

->remote_rx_info and ->rx_info are struct ntb_rx_info pointers. If we
add sizeof(struct ntb_rx_info) then it goes too far.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jon Mason <jon.mason@intel.com>

ntb: off by one sanity checks

These tests are off by one. If "mw" is equal to NTB_NUM_MW then we
would go beyond the end of the ndev->mw[] array.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jon Mason <jon.mason@intel.com>

NTB: variable dereferenced before check

Correct instances of variable dereferencing before checking its value on
the functions exported to the client drivers. Also, add sanity checks
for all exported functions.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jon Mason <jon.mason@intel.com>

tracing/kprobes: Make print_*probe_event static

According to sparse warning, print_*probe_event static because
those functions are not directly called from outside.

Link: http://lkml.kernel.org/r/20130513115839.6545.83067.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

tracing/kprobes: Fix a sparse warning for incorrect type in assignment

Fix a sparse warning about the rcu operated pointer is
defined without __rcu address space.

Link: http://lkml.kernel.org/r/20130513115837.6545.23322.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

tracing/kprobes: Use rcu_dereference_raw for tp->files

Use rcu_dereference_raw() for accessing tp->files. Because the
write-side uses rcu_assign_pointer() for memory barrier,
the read-side also has to use rcu_dereference_raw() with
read memory barrier.

Link: http://lkml.kernel.org/r/20130513115834.6545.17022.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

tracing: Fix leaks of filter preds

Special preds are created when folding a series of preds that
can be done in serial. These are allocated in an ops field of
the pred structure. But they were never freed, causing memory
leaks.

This was discovered using the kmemleak checker:

unreferenced object 0xffff8800797fd5e0 (size 32):
  comm "swapper/0", pid 1, jiffies 4294690605 (age 104.608s)
  hex dump (first 32 bytes):
    00 00 01 00 03 00 05 00 07 00 09 00 0b 00 0d 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff814b52af>] kmemleak_alloc+0x73/0x98
    [<ffffffff8111ff84>] kmemleak_alloc_recursive.constprop.42+0x16/0x18
    [<ffffffff81120e68>] __kmalloc+0xd7/0x125
    [<ffffffff810d47eb>] kcalloc.constprop.24+0x2d/0x2f
    [<ffffffff810d4896>] fold_pred_tree_cb+0xa9/0xf4
    [<ffffffff810d3781>] walk_pred_tree+0x47/0xcc
    [<ffffffff810d5030>] replace_preds.isra.20+0x6f8/0x72f
    [<ffffffff810d50b5>] create_filter+0x4e/0x8b
    [<ffffffff81b1c30d>] ftrace_test_event_filter+0x5a/0x155
    [<ffffffff8100028d>] do_one_initcall+0xa0/0x137
    [<ffffffff81afbedf>] kernel_init_freeable+0x14d/0x1dc
    [<ffffffff814b24b7>] kernel_init+0xe/0xdb
    [<ffffffff814d539c>] ret_from_fork+0x7c/0xb0
    [<ffffffffffffffff>] 0xffffffffffffffff

Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: stable@vger.kernel.org # 2.6.39+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

rcu: Don't allocate bootmem from rcu_init()

When rcu_init() is called we already have slab working, allocating
bootmem at that point results in warnings and an allocation from
slab. This commit therefore changes alloc_bootmem_cpumask_var() to
alloc_cpumask_var() in rcu_bootup_announce_oddness(), which is called
from rcu_init().

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Tested-by: Robin Holt <holt@sgi.com>
[paulmck: convert to zalloc_cpumask_var(), as suggested by Yinghai Lu.]

target: close target_put_sess_cmd() vs. core_tmr_abort_task() race

It is possible for one thread to to take se_sess->sess_cmd_lock in
core_tmr_abort_task() before taking a reference count on
se_cmd->cmd_kref, while another thread in target_put_sess_cmd() drops
se_cmd->cmd_kref before taking se_sess->sess_cmd_lock.

This introduces kref_put_spinlock_irqsave() and uses it in
target_put_sess_cmd() to close the race window.

Signed-off-by: Joern Engel <joern@logfs.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: removed unused transport_state flag

Signed-off-by: Joern Engel <joern@logfs.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target/iblock: Fix WCE=1 + DPOFUA=1 backend WRITE regression

This patch fixes a regression bug introduced in v3.9-rc1 where if the
underlying struct block_device for a IBLOCK backend is configured with
WCE=1 + DPOFUA=1 settings, the rw = WRITE assignment no longer occurs
in iblock_execute_rw(), and rw = 0 is passed to iblock_submit_bios()
in effect causing a READ bio operation to occur.

The offending commit is:

commit d0c8b259f8970d39354c1966853363345d401330
Author: Nicholas Bellinger <nab@linux-iscsi.org>
Date: Tue Jan 29 22:10:06 2013 -0800

target/iblock: Use backend REQ_FLUSH hint for WriteCacheEnabled status

Note the WCE=1 + DPOFUA=0, WCE=0 + DPOFUA=1, and WCE=0 + DPOFUA=0 cases
are not affected by this regression bug.

Reported-by: Chris Boot <bootc@bootc.net>
Tested-by: Chris Boot <bootc@bootc.net>
Reported-by: Hannes Reinecke <hare@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

time: Revert ALWAYS_USE_PERSISTENT_CLOCK compile time optimizaitons

Kay Sievers noted that the ALWAYS_USE_PERSISTENT_CLOCK config,
which enables some minor compile time optimization to avoid
uncessary code in mostly the suspend/resume path could cause
problems for userland.

In particular, the dependency for RTC_HCTOSYS on
!ALWAYS_USE_PERSISTENT_CLOCK, which avoids setting the time
twice and simplifies suspend/resume, has the side effect
of causing the /sys/class/rtc/rtcN/hctosys flag to always be
zero, and this flag is commonly used by udev to setup the
/dev/rtc symlink to /dev/rtcN, which can cause pain for
older applications.

While the udev rules could use some work to be less fragile,
breaking userland should strongly be avoided. Additionally
the compile time optimizations are fairly minor, and the code
being optimized is likely to be reworked in the future, so
lets revert this change.

Reported-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: stable <stable@vger.kernel.org> #3.9
Cc: Feng Tang <feng.tang@intel.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Link: http://lkml.kernel.org/r/1366828376-18124-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

workqueue: Make schedule_work() available again to non GPL modules

Commit 8425e3d5bdbe ("workqueue: inline trivial wrappers") changed
schedule_work() and schedule_delayed_work() to inline wrappers,
but these rely on some symbols that are EXPORT_SYMBOL_GPL, while
the original functions were EXPORT_SYMBOL. This has the effect of
changing the licensing requirement for these functions and making
them unavailable to non GPL modules.

Make them available again by removing the restriction on the
required symbols.

Signed-off-by: Marc Dionne <marc.dionne@your-file-system.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: correct handling of the pool spin_lock

When we fail to mutex_trylock(), we release the pool spin_lock and do
mutex_lock(). After that, we should regrab the pool spin_lock, but,
regrabbing is missed in current code. So correct it.

Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

rcu: Fix comparison sense in rcu_needs_cpu()

Commit c0f4dfd4f (rcu: Make RCU_FAST_NO_HZ take advantage of numbered
callbacks) introduced a bug that can result in excessively long grace
periods. This bug reverse the senes of the "if" statement checking
for lazy callbacks, so that RCU takes a lazy approach when there are
in fact non-lazy callbacks. This can result in excessive boot, suspend,
and resume times.

This commit therefore fixes the sense of this "if" statement.

Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Bjørn Mork <bjorn@mork.no>
Reported-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Bjørn Mork <bjorn@mork.no>
Tested-by: Joerg Roedel <joro@8bytes.org>

Merge tag 'ext4_for_linus_stable' of git://git./linux/kernel/git/tytso/ext4

Pull ext4 update from Ted Ts'o:
"Fixed regressions (two stability regressions and a performance
  regression) introduced during the 3.10-rc1 merge window.

  Also included is a bug fix relating to allocating blocks after
  resizing an ext3 file system when using the ext4 file system driver"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd,jbd2: fix oops in jbd2_journal_put_journal_head()
  ext4: revert "ext4: use io_end for multiple bios"
  ext4: limit group search loop for non-extent files
  ext4: fix fio regression

Merge branch 'for-3.10-fixes' of git://git./linux/kernel/git/tj/wq

Pull workqueue fix from Tejun Heo:
"A fix for a workqueue_congested() regression that broke fscache"

* 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: workqueue_congested() shouldn't translate WORK_CPU_UNBOUND into node number

timer: Don't reinitialize the cpu base lock during CPU_UP_PREPARE

An inactive timer's base can refer to a offline cpu's base.

In the current code, cpu_base's lock is blindly reinitialized each
time a CPU is brought up. If a CPU is brought online during the period
that another thread is trying to modify an inactive timer on that CPU
with holding its timer base lock, then the lock will be reinitialized
under its feet. This leads to following SPIN_BUG().

<0> BUG: spinlock already unlocked on CPU#3, kworker/u:3/1466
<0> lock: 0xe3ebe000, .magic: dead4ead, .owner: kworker/u:3/1466, .owner_cpu: 1
<4> [<c0013dc4>] (unwind_backtrace+0x0/0x11c) from [<c026e794>] (do_raw_spin_unlock+0x40/0xcc)
<4> [<c026e794>] (do_raw_spin_unlock+0x40/0xcc) from [<c076c160>] (_raw_spin_unlock+0x8/0x30)
<4> [<c076c160>] (_raw_spin_unlock+0x8/0x30) from [<c009b858>] (mod_timer+0x294/0x310)
<4> [<c009b858>] (mod_timer+0x294/0x310) from [<c00a5e04>] (queue_delayed_work_on+0x104/0x120)
<4> [<c00a5e04>] (queue_delayed_work_on+0x104/0x120) from [<c04eae00>] (sdhci_msm_bus_voting+0x88/0x9c)
<4> [<c04eae00>] (sdhci_msm_bus_voting+0x88/0x9c) from [<c04d8780>] (sdhci_disable+0x40/0x48)
<4> [<c04d8780>] (sdhci_disable+0x40/0x48) from [<c04bf300>] (mmc_release_host+0x4c/0xb0)
<4> [<c04bf300>] (mmc_release_host+0x4c/0xb0) from [<c04c7aac>] (mmc_sd_detect+0x90/0xfc)
<4> [<c04c7aac>] (mmc_sd_detect+0x90/0xfc) from [<c04c2504>] (mmc_rescan+0x7c/0x2c4)
<4> [<c04c2504>] (mmc_rescan+0x7c/0x2c4) from [<c00a6a7c>] (process_one_work+0x27c/0x484)
<4> [<c00a6a7c>] (process_one_work+0x27c/0x484) from [<c00a6e94>] (worker_thread+0x210/0x3b0)
<4> [<c00a6e94>] (worker_thread+0x210/0x3b0) from [<c00aad9c>] (kthread+0x80/0x8c)
<4> [<c00aad9c>] (kthread+0x80/0x8c) from [<c000ea80>] (kernel_thread_exit+0x0/0x8)

As an example, this particular crash occurred when CPU #3 is executing
mod_timer() on an inactive timer whose base is refered to offlined CPU
#2.  The code locked the timer_base corresponding to CPU #2. Before it
could proceed, CPU #2 came online and reinitialized the spinlock
corresponding to its base. Thus now CPU #3 held a lock which was
reinitialized. When CPU #3 finally ended up unlocking the old cpu_base
corresponding to CPU #2, we hit the above SPIN_BUG().

CPU #0 CPU #3        CPU #2
------ -------        -------
..... ......       <Offline>
mod_timer()
lock_timer_base
   spin_lock_irqsave(&base->lock)

cpu_up(2) .....         ......
init_timers_cpu()
.... .....      spin_lock_init(&base->lock)
.....    spin_unlock_irqrestore(&base->lock)  ......
   <spin_bug>

Allocation of per_cpu timer vector bases is done only once under
"tvec_base_done[]" check. In the current code, spinlock_initialization
of base->lock isn't under this check. When a CPU is up each time the
base lock is reinitialized. Move base spinlock initialization under
the check.

Signed-off-by: Tirupathi Reddy <tirupath@codeaurora.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1368520142-4136-1-git-send-email-tirupath@codeaurora.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

rcu/idle: Wrap cpu-idle poll mode within rcu_idle_enter/exit

Bjørn Mork reported the following warning when running powertop.

[   49.289034] ------------[ cut here ]------------
[   49.289055] WARNING: at kernel/rcutree.c:502 rcu_eqs_exit_common.isra.48+0x3d/0x125()
[   49.289244] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.0-bisect-rcu-warn+ #107
[   49.289251]  ffffffff8157d8c8 ffffffff81801e28 ffffffff8137e4e3 ffffffff81801e68
[   49.289260]  ffffffff8103094f ffffffff81801e68 0000000000000000 ffff88023afcd9b0
[   49.289268]  0000000000000000 0140000000000000 ffff88023bee7700 ffffffff81801e78
[   49.289276] Call Trace:
[   49.289285]  [<ffffffff8137e4e3>] dump_stack+0x19/0x1b
[   49.289293]  [<ffffffff8103094f>] warn_slowpath_common+0x62/0x7b
[   49.289300]  [<ffffffff8103097d>] warn_slowpath_null+0x15/0x17
[   49.289306]  [<ffffffff810a9006>] rcu_eqs_exit_common.isra.48+0x3d/0x125
[   49.289314]  [<ffffffff81079b49>] ? trace_hardirqs_off_caller+0x37/0xa6
[   49.289320]  [<ffffffff810a9692>] rcu_idle_exit+0x85/0xa8
[   49.289327]  [<ffffffff8107076e>] trace_cpu_idle_rcuidle+0xae/0xff
[   49.289334]  [<ffffffff810708b1>] cpu_startup_entry+0x72/0x115
[   49.289341]  [<ffffffff813689e5>] rest_init+0x149/0x150
[   49.289347]  [<ffffffff8136889c>] ? csum_partial_copy_generic+0x16c/0x16c
[   49.289355]  [<ffffffff81a82d34>] start_kernel+0x3f0/0x3fd
[   49.289362]  [<ffffffff81a8274c>] ? repair_env_string+0x5a/0x5a
[   49.289368]  [<ffffffff81a82481>] x86_64_start_reservations+0x2a/0x2c
[   49.289375]  [<ffffffff81a82550>] x86_64_start_kernel+0xcd/0xd1
[   49.289379] ---[ end trace 07a1cc95e29e9036 ]---

The warning is that 'rdtp->dynticks' has an unexpected value, which roughly
translates to - the calls to rcu_idle_enter() and rcu_idle_exit() were not
made in the correct order, or otherwise messed up.

And Bjørn's painstaking debugging indicated that this happens when the idle
loop enters the poll mode. Looking at the poll function cpu_idle_poll(), and
the implementation of trace_cpu_idle_rcuidle(), the problem becomes very clear:
cpu_idle_poll() lacks calls to rcu_idle_enter/exit(), and trace_cpu_idle_rcuidle()
calls them in the reverse order - first rcu_idle_exit(), and then rcu_idle_enter().
Hence the even/odd alternative sequencing of rdtp->dynticks goes for a toss.

And powertop readily triggers this because powertop uses the idle-tracing
infrastructure extensively.

So, to fix this, wrap the code in cpu_idle_poll() within rcu_idle_enter/exit(),
so that it blends properly with the calls inside trace_cpu_idle_rcuidle() and
thus get the function ordering right.

Reported-and-tested-by: Bjørn Mork <bjorn@mork.no>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/519169BF.4080208@linux.vnet.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

tick: Don't invoke tick_nohz_stop_sched_tick() if the cpu is offline

commit 5b39939a4 (nohz: Move ts->idle_calls incrementation into strict
idle logic) moved code out of tick_nohz_stop_sched_tick() and missed
to bail out when the cpu is offline. That's causing subsequent
failures as an offline CPU is supposed to die and not to fiddle with
nohz magic.

Return false in can_stop_idle_tick() if the cpu is offline.

Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz>
Reported-and-tested-by: Prarit Bhargava <prarit@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1305132138160.2863@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Merge branch 'merge' of git://git./linux/kernel/git/benh/powerpc

Pull powerpc fixes from Benjamin Herrenschmidt:
"This is mostly bug fixes (some of them regressions, some of them I
  deemed worth merging now) along with some patches from Li Zhong
  hooking up the new context tracking stuff (for the new full NO_HZ)"

* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (25 commits)
  powerpc: Set show_unhandled_signals to 1 by default
  powerpc/perf: Fix setting of "to" addresses for BHRB
  powerpc/pmu: Fix order of interpreting BHRB target entries
  powerpc/perf: Move BHRB code into CONFIG_PPC64 region
  powerpc: select HAVE_CONTEXT_TRACKING for pSeries
  powerpc: Use the new schedule_user API on userspace preemption
  powerpc: Exit user context on notify resume
  powerpc: Exception hooks for context tracking subsystem
  powerpc: Syscall hooks for context tracking subsystem
  powerpc/booke64: Fix kernel hangs at kernel_dbg_exc
  powerpc: Fix irq_set_affinity() return values
  powerpc: Provide __bswapdi2
  powerpc/powernv: Fix starting of secondary CPUs on OPALv2 and v3
  powerpc/powernv: Detect OPAL v3 API version
  powerpc: Fix MAX_STACK_TRACE_ENTRIES too low warning again
  powerpc: Make CONFIG_RTAS_PROC depend on CONFIG_PROC_FS
  powerpc: Bring all threads online prior to migration/hibernation
  powerpc/rtas_flash: Fix validate_flash buffer overflow issue
  powerpc/kexec: Fix kexec when using VMX optimised memcpy
  powerpc: Fix build errors STRICT_MM_TYPECHECKS
  ...

Merge branch 'pm-fixes'

* pm-fixes:
  PM / hibernate: Correct documentation
  PM / Documentation: remove inaccurate suspend/hibernate transition lantency statement
  PM: Documentation update for freeze state
  PM: Avoid calling kfree() under spinlock in dev_pm_put_subsys_data()

Merge branch 'acpi-fixes'

* acpi-fixes:
  ACPI / AC: Add sleep quirk for Thinkpad e530
  ACPI / EC: Restart transaction even when the IBF flag set
  ACPI video: ignore BIOS initial backlight value for HP 1000
  ACPI: Fix section to __init. Align with usage in acpixf.h
  ACPI / PM: Move processor suspend/resume to syscore_ops

Merge branch 'pm-cpufreq'

* pm-cpufreq:
  cpufreq / intel_pstate: use vzalloc() instead of vmalloc()/memset(0)
  cpufreq, ondemand: Remove leftover debug line
  cpufreq / kirkwood: don't check resource with devm_ioremap_resource
  cpufreq / intel_pstate: remove #ifdef MODULE compile fence
  cpufreq / intel_pstate: Remove idle mode PID
  cpufreq / intel_pstate: fix ffmpeg regression
  cpufreq / intel_pstate: use lowest requested max performance
  cpufreq / intel_pstate: remove idle time and duration from sample and calculations
  cpufreq: Fix incorrect dependecies for ARM SA11xx drivers
  cpufreq: ARM big LITTLE: Fix Kconfig entries
  cpufreq: cpufreq-cpu0: Free parent node for error cases
  cpufreq: cpufreq-cpu0: defer probe when regulator is not ready
  cpufreq: Issue CPUFREQ_GOV_POLICY_EXIT notifier before dropping policy refcount
  cpufreq: governors: Fix CPUFREQ_GOV_POLICY_{INIT|EXIT} notifiers
  cpufreq: ARM big LITTLE: Improve print message
  cpufreq: ARM big LITTLE: Move cpu_to_cluster() to arm_big_little.h
  cpufreq: ARM big LITTLE DT: Return CPUFREQ_ETERNAL if clock-latency isn't found
  cpufreq: ARM big LITTLE DT: Return correct transition latency
  cpufreq: ARM big LITTLE: Select PM_OPP

powerpc: Set show_unhandled_signals to 1 by default

Just like other architectures

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/perf: Fix setting of "to" addresses for BHRB

Currently we only set the "to" address in the branch stack when the CPU
explicitly gives us a value.  Unfortunately it only does this for XL form
branches (eg blr, bctr, bctar) and not I and B form branches (eg b, bc).

Fortunately if we read the instruction from memory we can extract the offset of
a branch and calculate the target address.

This adds a function power_pmu_bhrb_to() to calculate the target/to address of
the corresponding I and B form branches.  It handles branches in both user and
kernel spaces.  It also plumbs this into the perf brhb reading code.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/pmu: Fix order of interpreting BHRB target entries

The current Branch History Rolling Buffer (BHRB) code misinterprets the order
of entries in the hardware buffer. It assumes that a branch target address
will be read _after_ its corresponding branch. In reality the branch target
comes before (lower mfbhrb entry) it's corresponding branch.

This is a rewrite of the code to take this into account.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/perf: Move BHRB code into CONFIG_PPC64 region

The new Branch History Rolling buffer (BHRB) code is only useful on 64bit
processors, so move it into the #ifdef CONFIG_PPC64 region.

This avoids code bloat on 32bit systems.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: select HAVE_CONTEXT_TRACKING for pSeries

Start context tracking support from pSeries.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Use the new schedule_user API on userspace preemption

This patch corresponds to
[PATCH] x86: Use the new schedule_user API on userspace preemption
commit 0430499ce9d78691f3985962021b16bf8f8a8048

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Exit user context on notify resume

This patch allows RCU usage in do_notify_resume, e.g. signal handling.
It corresponds to
[PATCH] x86: Exit RCU extended QS on notify resume
commit edf55fda35c7dc7f2d9241c3abaddaf759b457c6

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Exception hooks for context tracking subsystem

This is the exception hooks for context tracking subsystem, including
data access, program check, single step, instruction breakpoint, machine check,
alignment, fp unavailable, altivec assist, unknown exception, whose handlers
might use RCU.

This patch corresponds to
[PATCH] x86: Exception hooks for userspace RCU extended QS
  commit 6ba3c97a38803883c2eee489505796cb0a727122

But after the exception handling moved to generic code, and some changes in
following two commits:
56dd9470d7c8734f055da2a6bac553caf4a468eb
  context_tracking: Move exception handling to generic code
6c1e0256fad84a843d915414e4b5973b7443d48d
  context_tracking: Restore correct previous context state on exception exit

it is able for exception hooks to use the generic code above instead of a
redundant arch implementation.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Syscall hooks for context tracking subsystem

This is the syscall slow path hooks for context tracking subsystem,
corresponding to
[PATCH] x86: Syscall hooks for userspace RCU extended QS
commit bf5a3c13b939813d28ce26c01425054c740d6731

TIF_MEMDIE is moved to the second 16-bits (with value 17), as it seems there
is no asm code using it. TIF_NOHZ is added to _TIF_SYCALL_T_OR_A, so it is
better for it to be in the same 16 bits with others in the group, so in the
asm code, andi. with this group could work.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/booke64: Fix kernel hangs at kernel_dbg_exc

MSR_DE is not cleared on entry to the kernel, and we don't clear it
explicitly outside of debug code.  If we have MSR_DE set in
prime_debug_regs(), and the new thread has events enabled in DBCR0
(e.g.  ICMP is set in thread->dbsr0, even though it was cleared in the
real DBCR0 when the thread got scheduled out), we'll end up taking a
debug exception in the kernel when DBCR0 is loaded.  DSRR0 will not
point to an exception vector, and the kernel ends up hanging at
kernel_dbg_exc.  Fix this by always clearing MSR_DE when we load new
debug state.

Another observed source of kernel_dbg_exc hangs is with the branch
taken event.  If this event is active, but we take a non-debug trap
(e.g. a TLB miss or an asynchronous interrupt) before the next branch.
We end up taking a branch-taken debug exception on the initial branch
instruction of the exception vector, but because the debug exception is
DBSR_BT rather than DBSR_IC we branch to kernel_dbg_exc before even
checking the DSRR0 address.  Fix this by checking for DBSR_BT as well
as DBSR_IC, which is what 32-bit does and what the comments suggest was
intended in the 64-bit code as well.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Fix irq_set_affinity() return values

Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Provide __bswapdi2

Some versions of GCC apparently expect this to be provided by libgcc.

Updates from Mikey to fix 32 bit version and adding "r" to registers.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/powernv: Fix starting of secondary CPUs on OPALv2 and v3

The current code fails to handle kexec on OPALv2. This fixes it
and adds code to improve the situation on OPALv3 where we can
query the CPU status from the firmware and decide what to do
based on that.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/powernv: Detect OPAL v3 API version

Future firmwares will support that new version

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Fix MAX_STACK_TRACE_ENTRIES too low warning again

Saw this warning again, and this time from the ret_from_fork path.

It seems we could clear the back chain earlier in copy_thread(), which
could cover both path, and also fix potential lockdep usage in
schedule_tail(), or exception occurred before we clear the back chain.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Make CONFIG_RTAS_PROC depend on CONFIG_PROC_FS

We are getting build errors with CONFIG_PROC_FS=n:

arch/powerpc/kernel/rtas_flash.c
In function 'rtas_flash_init':
745:33: error: unused variable 'f' [-Werror=unused-variable]

But rtas_flash.c should not be built when CONFIG_PROC_FS=n, beacause all
it does is provide a /proc interface to the RTAS flash routines.

CONFIG_RTAS_FLASH already depends on CONFIG_RTAS_PROC, to indicate that
it depends on the RTAS proc support, but CONFIG_RTAS_PROC does not
depend on CONFIG_PROC_FS. So fix that.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Bring all threads online prior to migration/hibernation

This patch brings online all threads which are present but not online
prior to migration/hibernation. After migration/hibernation those
threads are taken back offline.

During migration/hibernation all online CPUs must call H_JOIN, this is
required by the hypervisor. Without this patch, threads that are offline
(H_CEDE'd) will not be woken to make the H_JOIN call and the OS will be
deadlocked (all threads either JOIN'd or CEDE'd).

Cc: <stable@kernel.org>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/rtas_flash: Fix validate_flash buffer overflow issue

ibm,validate-flash-image RTAS call output buffer contains 150 - 200
bytes of data on latest system. Presently we have output
buffer size as 64 bytes and we use sprintf to copy data from
RTAS buffer to local buffer. This causes kernel oops (see below
call trace).

This patch increases local buffer size to 256 and also uses
snprintf instead of sprintf to copy data from RTAS buffer.

Kernel call trace :
-------------------
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=1024 NUMA pSeries
Modules linked in: nfs fscache lockd auth_rpcgss nfs_acl sunrpc fuse loop dm_mod ipv6 ipv6_lib usb_storage ehea(X) sr_mod qlge ses cdrom enclosure st be2net sg ext3 jbd mbcache usbhid hid ohci_hcd ehci_hcd usbcore qla2xxx usb_common sd_mod crc_t10dif scsi_dh_hp_sw scsi_dh_rdac scsi_dh_alua scsi_dh_emc scsi_dh lpfc scsi_transport_fc scsi_tgt ipr(X) libata scsi_mod
Supported: Yes
NIP: 4520323031333130 LR: 4520323031333130 CTR: 0000000000000000
REGS: c0000001b91779b0 TRAP: 0400 Tainted: G X (3.0.13-0.27-ppc64)
MSR: 8000000040009032 <EE,ME,IR,DR> CR: 44022488 XER: 20000018
TASK = c0000001bca1aba0[4736] 'cat' THREAD: c0000001b9174000 CPU: 36
GPR00: 4520323031333130 c0000001b9177c30 c000000000f87c98 000000000000009b
GPR04: c0000001b9177c4a 000000000000000b 3520323031333130 2032303133313031
GPR08: 3133313031350a4d 000000000000009b 0000000000000000 c0000000003664a4
GPR12: 0000000022022448 c000000003ee6c00 0000000000000002 00000000100e8a90
GPR16: 00000000100cb9d8 0000000010093370 000000001001d310 0000000000000000
GPR20: 0000000000008000 00000000100fae60 000000000000005e 0000000000000000
GPR24: 0000000010129350 46573738302e3030 2046573738302e30 300a4d4720323031
GPR28: 333130313520554e 4b4e4f574e0a4d47 2032303133313031 3520323031333130
NIP [4520323031333130] 0x4520323031333130
LR [4520323031333130] 0x4520323031333130
Call Trace:
[c0000001b9177c30] [4520323031333130] 0x4520323031333130 (unreliable)
Instruction dump:
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX

Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/kexec: Fix kexec when using VMX optimised memcpy

commit b3f271e86e5a (powerpc: POWER7 optimised memcpy using VMX and
enhanced prefetch) uses VMX when it is safe to do so (ie not in
interrupt). It also looks at the task struct to decide if we have to
save the current tasks' VMX state.

kexec calls memcpy() at a point where the task struct may have been
overwritten by the new kexec segments. If it has been overwritten
then when memcpy -> enable_altivec looks up current->thread.regs->msr
we get a cryptic oops or lockup.

I also notice we aren't initialising thread_info->cpu, which means
smp_processor_id is broken. Fix that too.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@vger.kernel.org> # 3.6+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Fix build errors STRICT_MM_TYPECHECKS

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/mm: Use the correct mask value when looking at pgtable address

Our pgtable are 2*sizeof(pte_t)*PTRS_PER_PTE which is PTE_FRAG_SIZE.
Instead of depending on frag size, mask with PMD_MASKED_BITS.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

Merge tag 'fixes-for-3.10-rc2-tag' of git://git./linux/kernel/git/sstabellini/xen

Pull Xen/arm fixes from Stefano Stabellini:
"This contains a couple of Xen on ARM initialization fixes and a patch
  to improve error handling"

* tag 'fixes-for-3.10-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen:
  xen/arm: rename xen_secondary_init and run it on every online cpu
  xen/arm: do not handle VCPUOP_register_vcpu_info failures
  xen/arm: initialize pm functions later

PM / hibernate: Correct documentation

Correct the meaning of PM_HIBERNATION_PREPARE in the docs.

References: http://lkml.kernel.org/r/20130512162717.GA6305@pd.tnic
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Merge branch 'parisc-for-3.10' of git://git./linux/kernel/git/deller/parisc-linux

Pull parisc update from Helge Deller:
"The second round of parisc updates for 3.10 includes build fixes and
  enhancements to utilize irq stacks, fixes SMP races when updating PTE
  and TLB entries by proper locking and makes the search for the correct
  cross compiler more robust on Debian and Gentoo."

* 'parisc-for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: make default cross compiler search more robust (v3)
  parisc: fix SMP races when updating PTE and TLB entries in entry.S
  parisc: implement irq stacks - part 2 (v2)

PM / Documentation: remove inaccurate suspend/hibernate transition lantency statement

The lantency of the transition from suspend and hibernate is
platform-dependent. Thus we should not refer the lantency in the
documentation.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

PM: Documentation update for freeze state

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq / intel_pstate: use vzalloc() instead of vmalloc()/memset(0)

Use vzalloc() instead of vmalloc() and memset(0).

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

ARM: 7720/1: ARM v6/v7 cmpxchg64 shouldn't clear upper 32 bits of the old/new value

The implementation of cmpxchg64() for the ARM v6 and v7 architecture
casts parameter 2 and 3 (the old and new 64bit values) to an unsigned
long before calling the atomic_cmpxchg64() function. This clears
the top 32 bits of the old and new values, resulting in the wrong
values being compare-exchanged. Luckily, this only appears to be used
for 64-bit sched_clock, which we don't (yet) have on ARM.

This bug was introduced by commit 3e0f5a15f500 ("ARM: 7404/1: cmpxchg64:
use atomic64 and local64 routines for cmpxchg64").

Cc: <stable@vger.kernel.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Jaccon Bastiaansen <jaccon.bastiaansen@gmail.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:
"Several small bug fixes all over:

   1) be2net driver uses wrong payload length when submitting MAC list
      get requests to the chip.  From Sathya Perla.

   2) Fix mwifiex memory leak on driver unload, from Amitkumar Karwar.

   3) Prevent random memory access in batman-adv, from Marek Lindner.

   4) batman-adv doesn't check for pskb_trim_rcsum() errors, also from
      Marek Lindner.

   5) Fix fec crashes on rapid link up/down, from Frank Li.

   6) Fix inner protocol grovelling in GSO, from Pravin B Shelar.

   7) Link event validation fix in qlcnic from Rajesh Borundia.

   8) Not all FEC chips can support checksum offload, fix from Shawn
      Guo.

   9) EXPORT_SYMBOL + inline doesn't make any sense, from Denis Efremov.

  10) Fix race in passthru mode during device removal in macvlan, from
      Jiri Pirko.

  11) Fix RCU hash table lookup socket state race in ipv6, leading to
      NULL pointer derefs, from Eric Dumazet.

  12) Add several missing HAS_DMA kconfig dependencies, from Geert
      Uyttterhoeven.

  13) Fix bogus PCI resource management in 3c59x driver, from Sergei
      Shtylyov.

  14) Fix info leak in ipv6 GRE tunnel driver, from Amerigo Wang.

  15) Fix device leak in ipv6 IPSEC policy layer, from Cong Wang.

  16) DMA mapping leak fix in qlge from Thadeu Lima de Souza Cascardo.

  17) Missing iounmap on probe failure in bna driver, from Wei Yongjun."

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (40 commits)
  bna: add missing iounmap() on error in bnad_init()
  qlge: fix dma map leak when the last chunk is not allocated
  xfrm6: release dev before returning error
  ipv6,gre: do not leak info to user-space
  virtio_net: use default napi weight by default
  emac: Fix EMAC soft reset on 460EX/GT
  3c59x: fix PCI resource management
  caif: CAIF_VIRTIO should depend on HAS_DMA
  net/ethernet: MACB should depend on HAS_DMA
  net/ethernet: ARM_AT91_ETHER should depend on HAS_DMA
  net/wireless: ATH9K should depend on HAS_DMA
  net/ethernet: STMMAC_ETH should depend on HAS_DMA
  net/ethernet: NET_CALXEDA_XGMAC should depend on HAS_DMA
  ipv6: do not clear pinet6 field
  macvlan: fix passthru mode race between dev removal and rx path
  ipv4: ip_output: remove inline marking of EXPORT_SYMBOL functions
  net/mlx4: Strengthen VLAN tags/priorities enforcement in VST mode
  net/mlx4_core: Add missing report on VST and spoof-checking dev caps
  net: fec: enable hardware checksum only on imx6q-fec
  qlcnic: Fix validation of link event command.
  ...

parisc: make default cross compiler search more robust (v3)

People/distros vary how they prefix the toolchain name for 64bit builds.
Rather than enforce one convention over another, add a for loop which
does a search for all the general prefixes.

For 64bit builds, we now search for (in order):
hppa64-unknown-linux-gnu
hppa64-linux-gnu
hppa64-linux

For 32bit builds, we look for:
hppa-unknown-linux-gnu
hppa-linux-gnu
hppa-linux
hppa2.0-unknown-linux-gnu
hppa2.0-linux-gnu
hppa2.0-linux
hppa1.1-unknown-linux-gnu
hppa1.1-linux-gnu
hppa1.1-linux

This patch was initiated by Mike Frysinger, with feedback from Jeroen
Roovers, John David Anglin and Helge Deller.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Jeroen Roovers <jer@gentoo.org>
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Signed-off-by: Helge Deller <deller@gmx.de>

rbd: re-submit flattened write request (part 2)

Add code to rbd_img_obj_exists_callback() to detect when a clone's
parent image has disappeared, and re-submit the original write
request in that case.

Kill off some redundant assertions.

This completes the resolution for:
http://tracker.ceph.com/issues/3763

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

rbd: re-submit write request for flattened clone

Add code to rbd_img_parent_read_full_callback() to detect when a
clone's parent image has disappeared, and re-submit the original
write request in that case. (See the previous commit for more
reasoning about why this is appropriate.)

Rename some variables in rbd_img_obj_parent_read_full_callback()
to match the convention used in the previous patch.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

rbd: re-submit read request for flattened clone

If a clone image gets flattened while a parent read request is
underway, the original rbd object request needs to be resubmitted.

The reason is that by the time we get the response to the parent
read request, the data read from the parent may be out of date.
In other words, we could see this sequence of events:

    rbd client                      parent image/osd
    ----------                      ----------------
    original object ENOENT;
        issue parent read
                                    respond to parent read
                                    child image flattened
    original image header refresh
             <--- original object written independently here
    parent read response received

Add code to rbd_img_parent_read_callback() to detect when a clone's
parent image has disappeared (as evidenced by its parent overlap
becoming 0), and re-submit the original read request in that case.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

rbd: detect when clone image is flattened

A format 2 clone image can be the subject of a "flatten" operation,
during which all of its data gets "copied up" from its parent image,
leaving the image fully populated.  Once this is complete, the
clone's association with the parent is abolished.

Since this can occur when a clone is mapped, we need to detect when
it has occurred and handle it accordingly.  We know an image has
been flattened when we know it at one time had a parent, but we have
learned (via a "get_parent" object class method call) it no longer
has one.

There might be in-flight requests at the point we learn an image has
been flattened, so we can't simply clean up parent data structures
right away.  Instead, we'll drop the initial parent reference when
the parent has disappeared (rather than when the image gets
destroyed), which will allow the last in-flight reference to clean
things up when it's complete.

We leverage the fact that a zero parent overlap renders an image
effectively unlayered.  We set the overlap to 0 at the point we
detect the clone image has flattened, which allows the unlayered
behavior to take effect immediately, while keeping other parent
structures in place until in-flight requests to complete.

This and the next few patches resolve:
    http://tracker.ceph.com/issues/3763

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

rbd: reference count parent requests

Keep a reference count for uses of the parent information for an rbd
device.

An initial reference is set in rbd_img_request_create() if the
target image has a parent (with non-zero overlap). Each image
request for an image with a non-zero parent overlap gets another
reference when it's created, and that reference is dropped when the
request is destroyed.

The initial reference is dropped when the image gets torn down.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

rbd: define parent image request routines

Define rbd_parent_request_create() and rbd_parent_request_destroy()
to handle the creation of parent image requests submitted for
layered image objects. For simplicity, let rbd_img_request_put()
handle dropping the reference to any image request (parent or not),
and call whichever destructor is appropriate on the last put.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

rbd: define rbd_dev_unparent()

Define rbd_dev_unparent() to encapsulate cleaning up parent data
structures from a layered rbd image.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

rbd: don't release write request until necessary

Previously when a layered write was going to involve a copyup
request, the original osd request was released before submitting the
parent full-object read. The osd request for the copyup would then
be allocated in rbd_img_obj_parent_read_full_callback().

Shortly we will be handling the event of mapped layered images
getting flattened, and when that occurs we need to resubmit the
original request. We therefore don't want to release the osd
request until we really konw we're going to replace it--in the
callback function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

rbd: get parent info on refresh

Get parent info for format 2 images on every refresh (rather than
just during the initial probe).  This will be needed to detect the
disappearance of the parent image in the event a mapped image
becomes unlayered (i.e., flattened).  Avoid leaking the previous
parent spec on the second and subsequent times this information is
requested by dropping the previous one (if any) before updating it.
(Also, extract the pool id into a local variable before assigning
it into the parent spec.)

Switch to using a non-zero parent overlap value rather than the
existence of a parent (a non-null parent_spec pointer) to determine
whether to mark a request layered.  It will soon be possible for
a layered image to become unlayered while a request is in flight.

This means that the layered flag for an image request indicates that
there was a non-zero parent overlap at the time the image request
was created.  The parent overlap can change thereafter, which may
lead to special handling at request submission or completion time.

This and the next several patches are related to:
    http://tracker.ceph.com/issues/3763

NOTE:
If an error occurs while refreshing the parent info (i.e.,
requesting it after initial probe), the old parent info will
persist.  This is not really correct, and is a scenario that needs
to be addressed.  For now we'll assert that the failure mode is
unlikely, but the issue has been documented in tracker issue 5040.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>