Wolfram Sang [Tue, 21 May 2013 18:45:40 +0000 (20:45 +0200)]
macintosh/windfarm: Remove obsolete cleanup for clientdata
A few new i2c-drivers came into the kernel which clear the clientdata-pointer
on exit or error. This is obsolete meanwhile, the core will do it.
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Chen Gang [Tue, 21 May 2013 09:20:50 +0000 (17:20 +0800)]
powerpc/nvram64: Need return the related error code on failure occurs
When error occurs, need return the related error code to let upper
caller know about it.
ppc_md.nvram_size() can return the error code (e.g. core99_nvram_size()
in 'arch/powerpc/platforms/powermac/nvram.c').
Also set ret value when only need it, so can save structions for normal
cases.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Li Zhong [Thu, 16 May 2013 10:20:26 +0000 (18:20 +0800)]
powerpc: Set cpu sibling mask before online cpu
It seems following race is possible:
cpu0 cpux
smp_init->cpu_up->_cpu_up
__cpu_up
kick_cpu(1)
-------------------------------------------------------------------------
waiting online ...
... notify CPU_STARTING
set cpux active
set cpux online
-------------------------------------------------------------------------
finish waiting online
...
sched_init_smp
init_sched_domains(cpu_active_mask)
build_sched_domains
set cpux sibling info
-------------------------------------------------------------------------
Execution of cpu0 and cpux could be concurrent between two separator
lines.
So if the cpux sibling information was set too late (normally
impossible, but could be triggered by adding some delay in
start_secondary, after setting cpu online), build_sched_domains()
running on cpu0 might see cpux active, with an empty sibling mask, then
cause some bad address accessing like following:
[ 0.099855] Unable to handle kernel paging request for data at address 0xc00000038518078f
[ 0.099868] Faulting instruction address: 0xc0000000000b7a64
[ 0.099883] Oops: Kernel access of bad area, sig: 11 [#1]
[ 0.099895] PREEMPT SMP NR_CPUS=16 DEBUG_PAGEALLOC NUMA pSeries
[ 0.099922] Modules linked in:
[ 0.099940] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-rc1-00120-gb973425-dirty #16
[ 0.099956] task:
c0000001fed80000 ti:
c0000001fed7c000 task.ti:
c0000001fed7c000
[ 0.099971] NIP:
c0000000000b7a64 LR:
c0000000000b7a40 CTR:
c0000000000b4934
[ 0.099985] REGS:
c0000001fed7f760 TRAP: 0300 Not tainted (3.10.0-rc1-00120-gb973425-dirty)
[ 0.099997] MSR:
8000000000009032 <SF,EE,ME,IR,DR,RI> CR:
24272828 XER:
20000003
[ 0.100045] SOFTE: 1
[ 0.100053] CFAR:
c000000000445ee8
[ 0.100064] DAR:
c00000038518078f, DSISR:
40000000
[ 0.100073]
GPR00:
0000000000000080 c0000001fed7f9e0 c000000000c84d48 0000000000000010
GPR04:
0000000000000010 0000000000000000 c0000001fc55e090 0000000000000000
GPR08:
ffffffffffffffff c000000000b80b30 c000000000c962d8 00000003845ffc5f
GPR12:
0000000000000000 c00000000f33d000 c00000000000b9e4 0000000000000000
GPR16:
0000000000000000 0000000000000000 0000000000000001 0000000000000000
GPR20:
c000000000ccf750 0000000000000000 c000000000c94d48 c0000001fc504000
GPR24:
c0000001fc504000 c0000001fecef848 c000000000c94d48 c000000000ccf000
GPR28:
c0000001fc522090 0000000000000010 c0000001fecef848 c0000001fed7fae0
[ 0.100293] NIP [
c0000000000b7a64] .get_group+0x84/0xc4
[ 0.100307] LR [
c0000000000b7a40] .get_group+0x60/0xc4
[ 0.100318] Call Trace:
[ 0.100332] [
c0000001fed7f9e0] [
c0000000000dbce4] .lock_is_held+0xa8/0xd0 (unreliable)
[ 0.100354] [
c0000001fed7fa70] [
c0000000000bf62c] .build_sched_domains+0x728/0xd14
[ 0.100375] [
c0000001fed7fbe0] [
c000000000af67bc] .sched_init_smp+0x4fc/0x654
[ 0.100394] [
c0000001fed7fce0] [
c000000000adce24] .kernel_init_freeable+0x17c/0x30c
[ 0.100413] [
c0000001fed7fdb0] [
c00000000000ba08] .kernel_init+0x24/0x12c
[ 0.100431] [
c0000001fed7fe30] [
c000000000009f74] .ret_from_kernel_thread+0x5c/0x68
[ 0.100445] Instruction dump:
[ 0.100456]
38800010 38a00000 4838e3f5 60000000 7c6307b4 2fbf0000 419e0040 3d220001
[ 0.100496]
78601f24 39491590 e93e0008 7d6a002a <
7d69582a>
f97f0000 7d4a002a e93e0010
[ 0.100559] ---[ end trace
31fd0ba7d8756001 ]---
This patch tries to move the sibling maps updating before
notify_cpu_starting() and cpu online, and a write barrier there to make
sure sibling maps are updated before active and online mask.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Geert Uytterhoeven [Sun, 30 Jun 2013 10:03:23 +0000 (12:03 +0200)]
mac: Make cuda_init_via() __init
cuda_init_via() is called from find_via_cuda() only, which is __init.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Paul Gortmaker [Mon, 24 Jun 2013 19:30:09 +0000 (15:30 -0400)]
powerpc: Delete __cpuinit usage from all users
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit
5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.
After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.
This removes all the powerpc uses of the __cpuinit macros. There
are no __CPUINIT users in assembly files in powerpc.
[1] https://lkml.org/lkml/2013/5/20/589
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Josh Boyer <jwboyer@gmail.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Joe Perches [Fri, 14 Jun 2013 02:37:37 +0000 (19:37 -0700)]
macintosh: Convert use of typedef ctl_table to struct ctl_table
This typedef is unnecessary and should just be removed.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Joe Perches [Fri, 14 Jun 2013 02:37:30 +0000 (19:37 -0700)]
powerpc/idle: Convert use of typedef ctl_table to struct ctl_table
This typedef is unnecessary and should just be removed.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Bjorn Helgaas [Tue, 11 Jun 2013 19:57:05 +0000 (13:57 -0600)]
powerpc/iommu: Remove unused pci_iommu_init() and pci_direct_iommu_init()
pci_iommu_init() and pci_direct_iommu_init() are not referenced anywhere,
so remove them.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Kevin Hao [Thu, 27 Jun 2013 01:09:43 +0000 (09:09 +0800)]
powerpc: Don't flush/invalidate the d/icache for an unknown relocation type
For an unknown relocation type since the value of r4 is just the 8bit
relocation type, the sum of r4 and r7 may yield an invalid memory
address. For example:
In normal case:
r4 = c00xxxxx
r7 =
40000000
r4 + r7 = 000xxxxx
For an unknown relocation type:
r4 = 000000xx
r7 =
40000000
r4 + r7 = 400000xx
400000xx is an invalid memory address for a board which has just
512M memory.
And for operations such as dcbst or icbi may cause bus error for an
invalid memory address on some platforms and then cause the board
reset. So we should skip the flush/invalidate the d/icache for
an unknown relocation type.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Acked-by: Suzuki K. Poulose <suzuki@in.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aaro Koskinen [Sun, 30 Jun 2013 19:00:42 +0000 (22:00 +0300)]
powerpc/windfarm: Fix overtemperature clearing
With pm81/pm91/pm121, when the overtemperature state is entered, and
when it remains on after skipped ticks, the driver will try to leave
it too soon (immediately on the next tick). This is because the active
FAILURE_OVERTEMP state is not visible in "new_failure" variable of the
current tick. Furthermore, the driver will keep trying to clear condition
in subsequent ticks as FAILURE_OVERTEMP remains set in the "last_failure"
variable. These will start to trigger WARNINGS from windfarm core:
[ 100.082735] windfarm: Clamping CPU frequency to minimum !
[ 100.108132] windfarm: Overtemp condition detected !
[ 101.952908] windfarm: Overtemp condition cleared !
[...]
[ 102.980388] WARNING: at drivers/macintosh/windfarm_core.c:463
[...]
[ 103.982227] WARNING: at drivers/macintosh/windfarm_core.c:463
[...]
[ 105.030494] WARNING: at drivers/macintosh/windfarm_core.c:463
[...]
[ 105.973666] WARNING: at drivers/macintosh/windfarm_core.c:463
[...]
[ 106.977913] WARNING: at drivers/macintosh/windfarm_core.c:463
Fix by adding a helper global variable. We leave the overtemp state only
after all failure bits have been cleared.
I saw this error on iMac G5 iSight (pm121). Also pm81/pm91 are fixed
based on the observation that these are almost identical/copy-pasted code.
Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 27 Jun 2013 05:46:48 +0000 (13:46 +0800)]
powerpc/powernv: Use dev-node in PCI config accessors
Currently, we're using the combo (PCI bus + devfn) in the PCI
config accessors and PCI config accessors in EEH depends on them.
However, it's not safe to refer the PCI bus which might have been
removed during hotplug. So we're using device node in the PCI
config accessors and the corresponding backends just reuse them.
The patch also fix one potential risk: We possiblly have frozen
PE during the early PCI probe time, but we haven't setup the PE
mapping yet. So the errors should be counted to PE#0.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 27 Jun 2013 05:46:47 +0000 (13:46 +0800)]
powerpc/eeh: Avoid build warnings
The patch is for avoiding following build warnings:
The function .pnv_pci_ioda_fixup() references
the function __init .eeh_init().
This is often because .pnv_pci_ioda_fixup lacks a __init
The function .pnv_pci_ioda_fixup() references
the function __init .eeh_addr_cache_build().
This is often because .pnv_pci_ioda_fixup lacks a __init
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 27 Jun 2013 05:46:46 +0000 (13:46 +0800)]
powerpc/eeh: Refactor the output message
We needn't the the whole backtrace other than one-line message in
the error reporting interrupt handler. For errors triggered by
access PCI config space or MMIO, we replace "WARN(1, ...)" with
pr_err() and dump_stack(). The patch also adds more output messages
to indicate what EEH core is doing. Besides, some printk() are
replaced with pr_warning().
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 27 Jun 2013 05:46:45 +0000 (13:46 +0800)]
powerpc/eeh: Fix address catch for PowerNV
On the PowerNV platform, the EEH address cache isn't built correctly
because we skipped the EEH devices without binding PE. The patch
fixes that.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 27 Jun 2013 05:46:44 +0000 (13:46 +0800)]
powerpc/powernv: Replace variables with flags
We have 2 fields in "struct pnv_phb" to trace the states. The patch
replace the fields with one and introduces flags for that. The patch
doesn't impact the logic.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 27 Jun 2013 05:46:43 +0000 (13:46 +0800)]
powerpc/eeh: Check PCIe link after reset
After reset (e.g. complete reset) in order to bring the fenced PHB
back, the PCIe link might not be ready yet. The patch intends to
make sure the PCIe link is ready before accessing its subordinate
PCI devices. The patch also fixes that wrong values restored to
PCI_COMMAND register for PCI bridges.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 27 Jun 2013 05:46:42 +0000 (13:46 +0800)]
powerpc/eeh: Don't collect PCI-CFG data on PHB
When the PHB is fenced or dead, it's pointless to collect the data
from PCI config space of subordinate PCI devices since it should
return 0xFF's. The patch also fixes overwritten buffer while getting
PCI config data.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Michael Neuling [Fri, 28 Jun 2013 08:17:09 +0000 (18:17 +1000)]
powerpc/tm: Clear MSR RI in non-recoverable TM code
When we treclaim and trecheckpoint there's an unavoidable period when r1
will not be a valid kernel stack pointer.
This patch clears the MSR recoverable interrupt (RI) bit over these
regions to indicate we have an invalid kernel stack pointer.
For treclaim, the region over which we clear MSR RI is larger than
required to avoid the need for an extra costly mtmsrd.
Thanks to Paulus for suggesting this change.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
James Yang [Tue, 25 Jun 2013 16:41:05 +0000 (11:41 -0500)]
powerpc: Fix string instr. emulation for 32-bit processes on ppc64
String instruction emulation would erroneously result in a segfault if
the upper bits of the EA are set and is so high that it fails access
check. Truncate the EA to 32 bits if the process is 32-bit.
Signed-off-by: James Yang <James.Yang@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Sebastien Bessiere [Fri, 14 Jun 2013 15:57:03 +0000 (17:57 +0200)]
trivial: powerpc: Fix typo in ioei_interrupt() description
Signed-off-by: Sebastien Bessiere <sebastien.bessiere@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Kirill A. Shutemov [Tue, 25 Jun 2013 21:15:36 +0000 (14:15 -0700)]
mm/thp: define HPAGE_PMD_* constants as BUILD_BUG() if !THP
Currently, HPAGE_PMD_* constans rely on PMD_SHIFT regardless of
CONFIG_TRANSPARENT_HUGEPAGE. PMD_SHIFT is not defined everywhere (e.g.
arm nommu case).
It means we can't use anything like this in generic code:
if (PageTransHuge(page))
zero_huge_user(page, 0, HPAGE_PMD_SIZE);
else
clear_highpage(page);
For !THP case, PageTransHuge() is 0 and compiler can eliminate
zero_huge_user() call. But it still need to be valid C expression, means
HPAGE_PMD_SIZE has to expand to something compiler can understand.
Previously, HPAGE_PMD_* were defined to BUILD_BUG() for !THP. Let's come
back to it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Tue, 25 Jun 2013 06:35:28 +0000 (14:35 +0800)]
powerpc/eeh: Use interruptible sleep in keehd
To replace down() with down_interrutible() to avoid following
warning:
[
c00000007ba7b710] [
c000000000014410] .__switch_to+0x1b0/0x380
[
c00000007ba7b7c0] [
c0000000007b408c] .__schedule+0x3ec/0x970
[
c00000007ba7ba50] [
c0000000007b1f24] .schedule_timeout+0x1a4/0x2b0
[
c00000007ba7bb30] [
c0000000007b34a4] .__down+0xa4/0x104
[
c00000007ba7bbf0] [
c0000000000b9230] .down+0x60/0x70
[
c00000007ba7bc80] [
c0000000000336d0] .eeh_event_handler+0x70/0x190
[
c00000007ba7bd30] [
c0000000000b1a58] .kthread+0xe8/0xf0
[
c00000007ba7be30] [
c00000000000a05c] .ret_from_kernel_thread+0x5c/0x8
This also avoids keeping the load average up while doing nothing.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Tue, 25 Jun 2013 06:35:27 +0000 (14:35 +0800)]
powerpc/eeh: Remove eeh_mutex
Originally, eeh_mutex was introduced to protect the PE hierarchy
tree and the attached EEH devices because EEH core was possiblly
running with multiple threads to access the PE hierarchy tree.
However, we now have only one kthread in EEH core. So we needn't
the eeh_mutex and just remove it.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Nathan Fontenot [Mon, 24 Jun 2013 14:35:55 +0000 (09:35 -0500)]
powerpc/mm: Fix build warnings with CONFIG_TRANSPARENT_HUGEPAGE disabled
Building with CONFIG_TRANSPARENT_HUGEPAGE disabled causes the following
build wearnings;
powerpc/arch/powerpc/include/asm/mmu-hash64.h: In function ‘__hash_page_thp’:
powerpc/arch/powerpc/include/asm/mmu-hash64.h:354: warning: no return statement in function returning non-void
This patch adds a return -1 to the static inline for __hash_page_thp()
to correct the warnings.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aruna Balakrishnaiah [Mon, 24 Jun 2013 06:53:00 +0000 (12:23 +0530)]
powerpc/pseries: Enable PSTORE in pseries_defconfig
Since now we have pstore support for nvram in pseries, enable it
in the default config. With this config option enabled, pstore
infra-structure will be used to read/write the messages from/to nvram.
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Michael Neuling [Mon, 24 Jun 2013 05:47:23 +0000 (15:47 +1000)]
powerpc/hw_brk: Fix clearing of extraneous IRQ
In 9422de3 "powerpc: Hardware breakpoints rewrite to handle non DABR breakpoint
registers" we changed the way we mark extraneous irqs with this:
- info->extraneous_interrupt = !((bp->attr.bp_addr <= dar) &&
- (dar - bp->attr.bp_addr < bp->attr.bp_len));
+ if (!((bp->attr.bp_addr <= dar) &&
+ (dar - bp->attr.bp_addr < bp->attr.bp_len)))
+ info->type |= HW_BRK_TYPE_EXTRANEOUS_IRQ;
Unfortunately this is bogus as it never clears extraneous IRQ if it's already
set.
This correctly clears extraneous IRQ before possibly setting it.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reported-by: Edjunior Barbosa Machado <emachado@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Michael Neuling [Mon, 24 Jun 2013 05:47:22 +0000 (15:47 +1000)]
powerpc/hw_brk: Fix setting of length for exact mode breakpoints
The smallest match region for both the DABR and DAWR is 8 bytes, so the
kernel needs to filter matches when users want to look at regions smaller than
this.
Currently we set the length of PPC_BREAKPOINT_MODE_EXACT breakpoints to 8.
This is wrong as in exact mode we should only match on 1 address, hence the
length should be 1.
This ensures that the kernel will filter out any exact mode hardware breakpoint
matches on any addresses other than the requested one.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reported-by: Edjunior Barbosa Machado <emachado@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Robert P. J. Day [Sat, 22 Jun 2013 08:29:48 +0000 (04:29 -0400)]
macintosh/adb: Replace __WAITQUEUE_INITIALIZER with more standard DECLARE_WAITQUEUE.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:27 +0000 (14:30 +0530)]
powerpc: Optimize hugepage invalidate
Hugepage invalidate involves invalidating multiple hpte entries.
Optimize the operation using H_BULK_REMOVE on lpar platforms.
On native, reduce the number of tlb flush.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:26 +0000 (14:30 +0530)]
powerpc/THP: Enable THP on PPC64
We enable only if the we support 16MB page size.
Reviewed-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:25 +0000 (14:30 +0530)]
powerpc: split hugepage when using subpage protection
We find all the overlapping vma and mark them such that we don't allocate
hugepage in that range. Also we split existing huge page so that the
normal page hash can be invalidated and new page faulted in with new
protection bits.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:24 +0000 (14:30 +0530)]
powerpc: disable assert_pte_locked for collapse_huge_page
With THP we set pmd to none, before we do pte_clear. Hence we can't
walk page table to get the pte lock ptr and verify whether it is locked.
THP do take pte lock before calling pte_clear. So we don't change the locking
rules here. It is that we can't use page table walking to check whether
pte locks are held with THP.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:23 +0000 (14:30 +0530)]
powerpc: Prevent gcc to re-read the pagetables
GCC is very likely to read the pagetables just once and cache them in
the local stack or in a register, but it is can also decide to re-read
the pagetables. The problem is that the pagetable in those places can
change from under gcc.
With THP/hugetlbfs the pmd (and pgd for hugetlbfs giga pages) can
change under gup_fast. The pages won't be freed untill we finish
gup fast because we have irq disabled and we free these pages via
rcu callback.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:22 +0000 (14:30 +0530)]
powerpc: Make linux pagetable walk safe with THP enabled
We need to have irqs disabled to handle all the possible parallel update for
linux page table without holding locks.
Events that we are intersted in while walking page tables are
1) Page fault
2) umap
3) THP split
4) THP collapse
A) local_irq_disabled:
------------------------
1) page fault:
A none to valid transition via page fault is not an issue because we
would either see a none or valid. If it is none, we would error out
the page table walk. We may need to use on stack values when checking for
type of page table elements, because if we do
if (!is_hugepd()) {
if (!pmd_none() {
if (pmd_bad() {
We could take that bad condition because the pmd got converted to a hugepd
after the !is_hugepd check via a hugetlb fault.
The right way would be to check for pmd_none higher up or use on stack value.
2) A valid to none conversion via unmap:
We can safely walk the upper level table, because we don't remove the the
page table entries until rcu grace period. So even if we followed a
wrong pointer we still have the pointer valid till the grace period.
A PTE pointer returned need to be atomically checked for _PAGE_PRESENT and
_PAGE_BUSY. A valid pointer returned could becoming none later. To prevent
pte_clear we take _PAGE_BUSY.
3) THP split:
A valid transparent hugepage is converted to nomal page. Before we split we
do pmd_splitting_flush, which sets the hugepage PTE to _PAGE_SPLITTING
So when walking page table we need to check for pmd_trans_splitting and
handle that. The pte returned should also need to be checked for
_PAGE_SPLITTING before setting _PAGE_BUSY similar to _PAGE_PRESENT. We save
the value of PTE on stack and check for the flag in the local pte value.
If we don't have the value set we can safely operate on the local pte value
and we atomicaly set _PAGE_BUSY.
4) THP collapse:
A normal page gets converted to hugepage. In the collapse path, we
mark the pmd none early (pmdp_clear_flush). With irq disabled, if we
are aleady walking page table we would see the pmd_none and won't continue.
If we see a valid PMD, we should still check for _PAGE_PRESENT before
setting _PAGE_BUSY, to make sure we didn't collapse the PTE to a Huge PTE.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:21 +0000 (14:30 +0530)]
powerpc/THP: Add code to handle HPTE faults for hugepages
The deposted PTE page in the second half of the PMD table is used to
track the state on hash PTEs. After updating the HPTE, we mark the
coresponding slot in the deposted PTE page valid.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:20 +0000 (14:30 +0530)]
powerpc: Update gup_pmd_range to handle transparent hugepages
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:19 +0000 (14:30 +0530)]
powerpc/kvm: Handle transparent hugepage in KVM
We can find pte that are splitting while walking page tables. Return
None pte in that case.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:18 +0000 (14:30 +0530)]
powerpc: Replace find_linux_pte with find_linux_pte_or_hugepte
Replace find_linux_pte with find_linux_pte_or_hugepte and explicitly
document why we don't need to handle transparent hugepages at callsites.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:17 +0000 (14:30 +0530)]
powerpc: Update find_linux_pte_or_hugepte to handle transparent hugepages
Reviewed-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:16 +0000 (14:30 +0530)]
powerpc: move find_linux_pte_or_hugepte and gup_hugepte to common code
We will use this in the later patch for handling THP pages
Reviewed-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:15 +0000 (14:30 +0530)]
powerpc/THP: Implement transparent hugepages for ppc64
We now have pmd entries covering 16MB range and the PMD table double its original size.
We use the second half of the PMD table to deposit the pgtable (PTE page).
The depoisted PTE page is further used to track the HPTE information. The information
include [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry.
With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need
4096 entries. Both will fit in a 4K PTE page. On hugepage invalidate we need to walk
the PTE page and invalidate all valid HPTEs.
This patch implements necessary arch specific functions for THP support and also
hugepage invalidate logic. These PMD related functions are intentionally kept
similar to their PTE counter-part.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:14 +0000 (14:30 +0530)]
powerpc/THP: Double the PMD table size for THP
THP code does PTE page allocation along with large page request and deposit them
for later use. This is to ensure that we won't have any failures when we split
hugepages to regular pages.
On powerpc we want to use the deposited PTE page for storing hash pte slot and
secondary bit information for the HPTEs. We use the second half
of the pmd table to save the deposted PTE page.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:13 +0000 (14:30 +0530)]
powerpc/mm: handle hugepage size correctly when invalidating hpte entries
If a hash bucket gets full, we "evict" a more/less random entry from it.
When we do that we don't invalidate the TLB (hpte_remove) because we assume
the old translation is still technically "valid". This implies that when
we are invalidating or updating pte, even if HPTE entry is not valid
we should do a tlb invalidate. With hugepages, we need to pass the correct
actual page size value for tlb invalidation.
This change update the patch
0608d692463598c1d6e826d9dd7283381b4f246c
"powerpc/mm: Always invalidate tlb on hpte invalidate and update" to handle
transparent hugepages correctly.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 10:13:26 +0000 (18:13 +0800)]
powerpc/eeh: Debugfs for error injection
The patch creates debugfs entries (powerpc/PCIxxxx/err_injct) for
injecting EEH errors for testing purpose.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 10:13:25 +0000 (18:13 +0800)]
powerpc/powernv: Debugfs directory for PHB
The patch creates one debugfs directory ("powerpc/PCIxxxx") for
each PHB so that we can hook EEH error injection debugfs entry
there in proceeding patch.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 10:13:24 +0000 (18:13 +0800)]
powerpc/eeh: Register OPAL notifier for PCI error
The patch registers OPAL event notifier and process the PCI errors
from firmware. If we have pending PCI errors, special EEH event
(without binding PE) will be sent to EEH core for processing.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 10:13:23 +0000 (18:13 +0800)]
powernv/opal: Disable OPAL notifier upon poweroff
While we're restarting or powering off the system, we needn't
the OPAL notifier any more. So just to disable that.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 10:13:22 +0000 (18:13 +0800)]
powernv/opal: Notifier for OPAL events
This patch implements a notifier to receive a notification on OPAL
event mask changes. The notifier is only called as a result of an OPAL
interrupt, which will happen upon reception of FSP messages or PCI errors.
Any event mask change detected as a result of opal_poll_events() will not
result in a notifier call.
[benh: changelog]
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:21:16 +0000 (13:21 +0800)]
powerpc/eeh: Allow to check fenced PHB proactively
It's meaningless to handle frozen PE if we already had fenced PHB.
The patch intends to check the PHB state before checking PE. If the
PHB has been put into fenced state, we need take care of that firstly.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:21:15 +0000 (13:21 +0800)]
powerpc/eeh: Enable EEH check for config access
The patch enables EEH check and let EEH core to process the EEH
errors for PowerNV platform while accessing config space. Originally,
the implementation already had mechanism to check EEH errors and
tried to recover from them. However, we never let EEH core to handle
the EEH errors.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:21:14 +0000 (13:21 +0800)]
powerpc/eeh: Initialization for PowerNV
The patch initializes EEH for PowerNV platform. Because the OPAL
APIs requires HUB ID, we need trace that through struct pnv_phb.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:21:13 +0000 (13:21 +0800)]
powerpc/eeh: PowerNV EEH backends
The patch adds EEH backends for PowerNV platform. It's notable that
part of those EEH backends call to the I/O chip dependent backends.
[Removed pointless change to eeh_pseries.c -- BenH]
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:21:12 +0000 (13:21 +0800)]
powerpc/eeh: I/O chip next error
The patch implements the backend for EEH core to retrieve next
EEH error to handle. For the informational errors, we won't bother
the EEH core. Otherwise, the EEH should take appropriate actions
depending on the return value:
0 - No further errors detected
1 - Frozen PE
2 - Fenced PHB
3 - Dead PHB
4 - Dead IOC
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:21:11 +0000 (13:21 +0800)]
powerpc/eeh: I/O chip PE log and bridge setup
The patch adds backends to retrieve error log and configure p2p
bridges for the indicated PE.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:21:10 +0000 (13:21 +0800)]
powerpc/eeh: I/O chip PE reset
The patch adds the I/O chip backend to do PE reset. For now, we
focus on PCI bus dependent PE. If PHB PE has been put into error
state, the PHB will take complete reset. Besides, the root bridge
will take fundamental or hot reset accordingly if the indicated
PE locates at the toppest of PCI hierarchy tree. Otherwise, the
upstream p2p bridge will take hot reset.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:21:09 +0000 (13:21 +0800)]
powerpc/eeh: I/O chip EEH state retrieval
The patch adds I/O chip backend to retrieve the state for the
indicated PE. While the PE state is temperarily unavailable,
the upper layer (powernv platform) should return default delay
(1 second).
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:21:08 +0000 (13:21 +0800)]
powerpc/eeh: I/O chip EEH enable option
The patch adds the backend to enable or disable EEH functionality
for the specified PE. The backend is also used to enable MMIO or
DMA path for the problematic PE. It's notable that all PEs on
PowerNV platform support EEH functionality by default, and we
disallow to disable EEH for the specific PE.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:21:07 +0000 (13:21 +0800)]
powerpc/eeh: I/O chip post initialization
The post initialization (struct eeh_ops::post_init) is called after
the EEH probe is done. On the other hand, the EEH core post
initialization is designed to call platform and then I/O chip backend
on PowerNV platform.
The patch adds the backend for I/O chip to notify the platform
that the specific PHB is ready to supply EEH service.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:21:06 +0000 (13:21 +0800)]
powerpc/eeh: EEH backend for P7IOC
For EEH on PowerNV platform, the overall architecture is different
from that on pSeries platform. In order to support multiple I/O chips
in future, we split EEH to 3 layers for PowerNV platform: EEH core,
platform layer, I/O layer. It would give EEH implementation on PowerNV
platform much more flexibility in future.
The patch adds the EEH backend for P7IOC.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:21:05 +0000 (13:21 +0800)]
powerpc/eeh: Sync OPAL API with firmware
The patch synchronizes OPAL APIs between kernel and firmware. Also,
we starts to replace opal_pci_get_phb_diag_data() with the similar
opal_pci_get_phb_diag_data2() and the former OPAL API would return
OPAL_UNSUPPORTED from now on.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:21:04 +0000 (13:21 +0800)]
powerpc/eeh: EEH core to handle special event
On PowerNV platform, the EEH event caused by interrupt won't have
binding PE. The patch enables EEH core to handle the special event.
To avoid the current logic we have, The eeh_handle_event() is renamed
to eeh_handle_normal_event(), and the eeh_handle_special_event() is
introduced. The function eeh_handle_event() dispatches to above two
functions according to the input parameter. Besides, new backend
"next_error" added to eeh_ops and it's expected to have following
return values:
4 - Dead IOC 3 - Dead PHB
2 - Fenced PHB 1 - Frozen PE
0 - No error found
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:21:03 +0000 (13:21 +0800)]
powerpc/eeh: Export confirm_error_lock
An EEH event is created and queued to the event queue for each
ingress EEH error. When there're mutiple EEH errors, we need serialize
the process to keep consistent PE state (flags). The spinlock
"confirm_error_lock" was introduced for the purpose. We'll inject
EEH event upon error reporting interrupts on PowerNV platform. So
we export the spinlock for that to use for consistent PE state.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:21:02 +0000 (13:21 +0800)]
powerpc/eeh: Allow to purge EEH events
On PowerNV platform, we might run into the situation where subsequent
events are duplicated events of former one, which is being processed.
For the case, we need the function implemented by the patch to purge
EEH events accordingly.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:21:01 +0000 (13:21 +0800)]
powerpc/eeh: Trace time on first error for PE
We're not expecting that one specific PE got frozen for over 5
times in last hour. Otherwise, the PE will be removed from the
system upon newly coming EEH errors. The patch introduces time
stamp to trace the first error on specific PE in last hour and
function to update that accordingly. Besides, the time stamp
is recovered during PE hotplug path as we did for frozen count.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:21:00 +0000 (13:21 +0800)]
powerpc/eeh: Single kthread to handle events
We possiblly have multiple kthreads running for multiple EEH errors
(events) and use one spinlock to make the process of handling those
EEH events serialized. That's unnecessary and the patch creates only
one kthread, which is started during EEH core initialization time in
eeh_init(). A new semaphore introduced to count the number of existing
EEH events in the queue and the kthread waiting on the semaphore.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:20:59 +0000 (13:20 +0800)]
powerpc/eeh: Delay EEH probe during hotplug
While doing EEH recovery, the PCI devices of the problematic PE
should be removed and then added to the system again. During the
so-called hotplug event, the PCI devices of the problematic PE
will be probed through early/late phase. We would delay EEH probe
on late point for PowerNV platform since the PCI device isn't
available in early phase.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:20:58 +0000 (13:20 +0800)]
powerpc/eeh: Refactor eeh_reset_pe_once()
We shouldn't check that the returned PE status is exactly equal to
(EEH_STATE_MMIO_ACTIVE | EEH_STATE_DMA_ACTIVE) but instead only check
that they are both set.
[benh: changelog]
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:20:57 +0000 (13:20 +0800)]
powerpc/eeh: EEH post initialization operation
The patch adds new EEH operation post_init. It's used to notify
the platform that EEH core has completed the EEH probe. By that,
PowerNV platform starts to use the services supplied by EEH
functionality.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:20:56 +0000 (13:20 +0800)]
powerpc/eeh: Make eeh_init() public
For EEH on PowerNV platform, we will do EEH probe based on the
real PCI devices. The PCI devices are available after PCI probe.
So we have to call eeh_init() explicitly on PowerNV platform
after PCI probe. The patch also does EEH probe for PowerNV platform
in eeh_init().
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:20:55 +0000 (13:20 +0800)]
powerpc/eeh: Trace PCI bus from PE
There're several types of PEs can be supported for now: PHB, Bus
and Device dependent PE. For PCI bus dependent PE, tracing the
corresponding PCI bus from PE (struct eeh_pe) would make the code
more efficient. The patch also enables the retrieval of PCI bus based
on the PCI bus dependent PE.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:20:54 +0000 (13:20 +0800)]
powerpc/eeh: Make eeh_pe_get() public
While processing EEH event interrupt from P7IOC, we need function
to retrieve the PE according to the indicated EEH device. The patch
makes function eeh_pe_get() public so that other source files can call
it for that purpose. Also, the patch fixes referring to wrong BDF
(Bus/Device/Function) address while searching PE in function
__eeh_pe_get().
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:20:53 +0000 (13:20 +0800)]
powerpc/eeh: Make eeh_phb_pe_get() public
One of the possible cases indicated by P7IOC interrupt is fenced
PHB. For that case, we need fetch the PE corresponding to the PHB
and disable the PHB and all subordinate PCI buses/devices, recover
from the fenced state and eventually enable the whole PHB. We need
one function to fetch the PHB PE outside eeh_pe.c and the patch is
going to make eeh_phb_pe_get() public for that purpose.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:20:52 +0000 (13:20 +0800)]
powerpc/eeh: Move common part to kernel directory
The patch moves the common part of EEH core into arch/powerpc/kernel
directory so that we needn't PPC_PSERIES while compiling POWERNV
platform:
* Move the EEH common part into arch/powerpc/kernel
* Move the functions for PCI hotplug from pSeries platform to
arch/powerpc/kernel/pci-hotplug.c
* Move CONFIG_EEH from arch/powerpc/platforms/pseries/Kconfig to
arch/powerpc/platforms/Kconfig
* Adjust makefile accordingly
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 20 Jun 2013 05:20:51 +0000 (13:20 +0800)]
powerpc/eeh: Cleanup for EEH core
Cleanup on EEH core to remove unnecessary whitespaces.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Michael Neuling [Sun, 9 Jun 2013 11:23:19 +0000 (21:23 +1000)]
powerpc/tm: Fix return of active 64bit signals
Currently we only restore signals which are transactionally suspended but it's
possible that the transaction can be restored even when it's active. Most
likely this will result in a transactional rollback by the hardware as the
transaction will have been doomed by an earlier treclaim.
The current code is a legacy of earlier kernel implementations which did
software rollback of active transactions in the kernel. That code has now gone
but we didn't correctly fix up this part of the signals code which still makes
assumptions based on having software rollback.
This changes the signal return code to always restore both contexts on 64 bit
signal return. It also ensures that the MSR TM bits are properly restored from
the signal context which they are not currently.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Michael Neuling [Sun, 9 Jun 2013 11:23:18 +0000 (21:23 +1000)]
powerpc/tm: Fix return of 32bit rt signals to active transactions
Currently we only restore signals which are transactionally suspended but it's
possible that the transaction can be restored even when it's active. Most
likely this will result in a transactional rollback by the hardware as the
transaction will have been doomed by an earlier treclaim.
The current code is a legacy of earlier kernel implementations which did
software rollback of active transactions in the kernel. That code has now gone
but we didn't correctly fix up this part of the signals code which still makes
assumptions based on having software rollback.
This changes the signal return code to always restore both contexts on 32 bit
rt signal return.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Michael Neuling [Sun, 9 Jun 2013 11:23:17 +0000 (21:23 +1000)]
powerpc/tm: Fix restoration of MSR on 32bit signal return
Currently we clear out the MSR TM bits on signal return assuming that the
signal should never return to an active transaction.
This is bogus as the user may do this. It's most likely the transaction will
be doomed due to a treclaim but that's a problem for the HW not the kernel.
The current code is a legacy of earlier kernel implementations which did
software rollback of active transactions in the kernel. That code has now gone
but we didn't correctly fix up this part of the signals code which still makes
the assumption that it must be returning to a suspended transaction.
This pulls out both MSR TM bits from the user supplied context rather than just
setting TM suspend. We pull out only the bits needed to ensure the user can't
do anything dangerous to the MSR.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Michael Neuling [Sun, 9 Jun 2013 11:23:16 +0000 (21:23 +1000)]
powerpc/tm: Fix 32 bit non-rt signals
Currently sys_sigreturn() is TM unaware. Therefore, if we take a 32 bit signal
without SIGINFO (non RT) inside a transaction, on signal return we don't
restore the signal frame correctly.
This checks if the signal frame being restoring is an active transaction, and
if so, it copies the additional state to ptregs so it can be restored.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Michael Neuling [Sun, 9 Jun 2013 11:23:15 +0000 (21:23 +1000)]
powerpc/tm: Fix writing top half of MSR on 32 bit signals
The MSR TM controls are in the top 32 bits of the MSR hence on 32 bit signals,
we stick the top half of the MSR in the checkpointed signal context so that the
user can access it.
Unfortunately, we don't currently write anything to the checkpointed signal
context when coming in a from a non transactional process and hence the top MSR
bits can contain junk.
This updates the 32 bit signal handling code to always write something to the
top MSR bits so that users know if the process is transactional or not and the
kernel can use it on signal return.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Benjamin Herrenschmidt [Sun, 9 Jun 2013 07:04:58 +0000 (17:04 +1000)]
powerpc/8xx: Remove 8xx specific "minimal FPU emulation"
This is duplicated code from math-emu and implements such a small
subset of the FPU (load/stores/fmr) that it's essentially pointless
nowdays.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Benjamin Herrenschmidt [Sun, 9 Jun 2013 07:01:24 +0000 (17:01 +1000)]
powerpc/math-emu: Allow math-emu to be used for HW FPU
(Including 64-bit ones)
This allow SW emulation by the kernel of optional instructions
such as fsqrt which aren't implemented on some processors, and
thus fixes some Fedora 19 issues such as Anaconda since the
compiler is set to generate those by default on 64-bit.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Benjamin Herrenschmidt [Sun, 9 Jun 2013 07:00:42 +0000 (17:00 +1000)]
powerpc/math-emu: Fix decoding of some instructions
The decoding of some instructions such as fsqrt{s} was incorrect,
using the wrong registers, and thus could not work.
This fixes it and also adds a couple of place holders for missing
instructions.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aruna Balakrishnaiah [Wed, 5 Jun 2013 18:52:20 +0000 (00:22 +0530)]
powerpc/pseries: Read common partition via pstore
This patch exploits pstore subsystem to read details of common partition
in NVRAM to a separate file in /dev/pstore. For instance, common partition
details will be stored in a file named [common-nvram-6].
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aruna Balakrishnaiah [Wed, 5 Jun 2013 18:52:10 +0000 (00:22 +0530)]
powerpc/pseries: Read of-config partition via pstore
This patch set exploits the pstore subsystem to read details of
of-config partition in NVRAM to a separate file in /dev/pstore.
For instance, of-config partition details will be stored in a
file named [of-nvram-5].
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aruna Balakrishnaiah [Wed, 5 Jun 2013 18:51:59 +0000 (00:21 +0530)]
powerpc/pseries: Distinguish between a os-partition and non-os partition
Introduce os_partition member in nvram_os_partition structure to identify
if the partition is an os partition or not. This will be useful to handle
non-os partitions of-config and common.
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aruna Balakrishnaiah [Wed, 5 Jun 2013 18:51:44 +0000 (00:21 +0530)]
powerpc/pseries: Read rtas partition via pstore
This patch set exploits the pstore subsystem to read details of rtas partition
in NVRAM to a separate file in /dev/pstore. For instance, rtas details will be
stored in a file named [rtas-nvram-4].
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aruna Balakrishnaiah [Wed, 5 Jun 2013 18:51:32 +0000 (00:21 +0530)]
powerpc/pseries: Read/Write oops nvram partition via pstore
IBM's p series machines provide persistent storage for LPARs through NVRAM.
NVRAM's lnx,oops-log partition is used to log oops messages.
Currently the kernel provides the contents of p-series NVRAM only as a
simple stream of bytes via /dev/nvram, which must be interpreted in user
space by the nvram command in the powerpc-utils package.
This patch set exploits the pstore subsystem to expose oops partition in
NVRAM as a separate file in /dev/pstore. For instance, Oops messages will be
stored in a file named [dmesg-nvram-2]. In case pstore registration fails it
will fall back to kmsg_dump mechanism.
This patch will read/write the oops messages from/to this partition via pstore.
Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aruna Balakrishnaiah [Wed, 5 Jun 2013 18:51:16 +0000 (00:21 +0530)]
powerpc/pseries: Introduce generic read function to read nvram-partitions
Introduce generic read function to read nvram partitions other than rtas.
nvram_read_error_log will be retained which is used to read rtas partition
from rtasd. nvram_read_partition is the generic read function to read from
any nvram partition.
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aruna Balakrishnaiah [Wed, 5 Jun 2013 18:51:05 +0000 (00:21 +0530)]
powerpc/pseries: Add version and timestamp to oops header
Introduce version and timestamp information in the oops header.
oops_log_info (oops header) holds version (to distinguish between old
and new format oops header), length of the oops text
(compressed or uncompressed) and timestamp.
The version field will sit in the same place as the length in old
headers. version is assigned 5000 (greater than oops partition size)
so that existing tools will refuse to dump new style partitions as
the length is too large. The updated tools will work with both
old and new format headers.
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aruna Balakrishnaiah [Wed, 5 Jun 2013 18:50:55 +0000 (00:20 +0530)]
powerpc/pseries: Remove syslog prefix in uncompressed oops text
Removal of syslog prefix in the uncompressed oops text will
help in capturing more oops data.
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Wed, 5 Jun 2013 07:34:03 +0000 (15:34 +0800)]
powerpc/eeh: Enhance converting EEH dev
Under some special circumstances, the EEH device doesn't have the
associated device tree node or PCI device. The patch enhances those
functions converting EEH device to device tree node or PCI device
accordingly to avoid unnecessary system crash.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Wed, 5 Jun 2013 07:34:02 +0000 (15:34 +0800)]
powerpc/eeh: Fix fetching bus for single-dev-PE
While running Linux as guest on top of phyp, we possiblly have
PE that includes single PCI device. However, we didn't return
its PCI bus correctly and it leads to failure on recovery from
EEH errors for single-dev-PE. The patch fixes the issue.
Cc: <stable@vger.kernel.org> # v3.7+
Cc: Steve Best <sbest@us.ibm.com>
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Blanchard [Wed, 5 Jun 2013 03:02:26 +0000 (13:02 +1000)]
powerpc: Align thread->fpr to 16 bytes
On newer CPUs we use VSX loads and stores to the thread->fpr array.
For best performance we need to ensure 16 byte alignment.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
liguang [Thu, 30 May 2013 07:20:33 +0000 (15:20 +0800)]
powerpc/pseries: Use 'true' instead of '1' for orderly_poweroff
orderly_poweroff is expecting a bool parameter, so
use 'true' instead '1'
Signed-off-by: liguang <lig.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
liguang [Thu, 30 May 2013 06:47:53 +0000 (14:47 +0800)]
powerpc/smp: Use '==' instead of '<' for system_state
'system_state < SYSTEM_RUNNING' will have same effect
with 'system_state == SYSTEM_BOOTING', but the later
one is more clearer.
Signed-off-by: liguang <lig.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Bharat Bhushan [Wed, 22 May 2013 04:20:59 +0000 (09:50 +0530)]
powerpc: Restore dbcr0 on user space exit
On BookE (Branch taken + Single Step) is as same as Branch Taken
on BookS and in Linux we simulate BookS behavior for BookE as well.
When doing so, in Branch taken handling we want to set DBCR0_IC but
we update the current->thread->dbcr0 and not DBCR0.
Now on 64bit the current->thread.dbcr0 (and other debug registers)
is synchronized ONLY on context switch flow. But after handling
Branch taken in debug exception if we return back to user space
without context switch then single stepping change (DBCR0_ICMP)
does not get written in h/w DBCR0 and Instruction Complete exception
does not happen.
This fixes using ptrace reliably on BookE-PowerPC
lmbench latency test (lat_syscall) Results are (they varies a little
on each run)
1) ./lat_syscall <action> /dev/shm/uImage
action: Open read write stat fstat null
Before: 3.8618 0.2017 0.2851 1.6789 0.2256 0.0856
After: 3.8580 0.2017 0.2851 1.6955 0.2255 0.0856
1) ./lat_syscall -P 2 -N 10 <action> /dev/shm/uImage
action: Open read write stat fstat null
Before: 4.1388 0.2238 0.3066 1.7106 0.2256 0.0856
After: 4.1413 0.2236 0.3062 1.7107 0.2256 0.0856
[ Slightly modified to avoid extra branch in the fast path
on Book3S and fix build on all non-BookE 64-bit -- BenH
]
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Bharat Bhushan [Wed, 22 May 2013 04:20:58 +0000 (09:50 +0530)]
powerpc: Debug control and status registers are 32bit
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Alexey Kardashevskiy [Tue, 21 May 2013 03:33:11 +0000 (13:33 +1000)]
powerpc/vfio: Enable on pSeries platform
The enables VFIO on the pSeries platform, enabling user space
programs to access PCI devices directly.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Alexey Kardashevskiy [Tue, 21 May 2013 03:33:10 +0000 (13:33 +1000)]
powerpc/vfio: Implement IOMMU driver for VFIO
VFIO implements platform independent stuff such as
a PCI driver, BAR access (via read/write on a file descriptor
or direct mapping when possible) and IRQ signaling.
The platform dependent part includes IOMMU initialization
and handling. This implements an IOMMU driver for VFIO
which does mapping/unmapping pages for the guest IO and
provides information about DMA window (required by a POWER
guest).
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Alexey Kardashevskiy [Tue, 21 May 2013 03:33:09 +0000 (13:33 +1000)]
powerpc/vfio: Enable on PowerNV platform
This initializes IOMMU groups based on the IOMMU configuration
discovered during the PCI scan on POWERNV (POWER non virtualized)
platform. The IOMMU groups are to be used later by the VFIO driver,
which is used for PCI pass through.
It also implements an API for mapping/unmapping pages for
guest PCI drivers and providing DMA window properties.
This API is going to be used later by QEMU-VFIO to handle
h_put_tce hypercalls from the KVM guest.
The iommu_put_tce_user_mode() does only a single page mapping
as an API for adding many mappings at once is going to be
added later.
Although this driver has been tested only on the POWERNV
platform, it should work on any platform which supports
TCE tables. As h_put_tce hypercall is received by the host
kernel and processed by the QEMU (what involves calling
the host kernel again), performance is not the best -
circa 220MB/s on 10Gb ethernet network.
To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
option and configure VFIO as required.
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>