review.tizen.org Git - platform/kernel/linux-rpi.git/log

Documentation: dt: chosen properties for arm64 kdump

Add documentation for DT properties:
linux,usable-memory-range
linux,elfcorehdr
used by arm64 kdump. Those are, respectively, a usable memory range
allocated to crash dump kernel and the elfcorehdr's location within it.

Signed-off-by: James Morse <james.morse@arm.com>
[takahiro.akashi@linaro.org: update the text due to recent changes ]
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: devicetree@vger.kernel.org
Cc: Rob Herring <robh+dt@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

Documentation: kdump: describe arm64 port

Add arch specific descriptions about kdump usage on arm64 to kdump.txt.

Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Dave Young <dyoung@redhat.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: kdump: enable kdump in defconfig

Kdump is enabled by default as kexec is.

Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: kdump: provide /proc/vmcore file

Arch-specific functions are added to allow for implementing a crash dump
file interface, /proc/vmcore, which can be viewed as a ELF file.

A user space tool, like kexec-tools, is responsible for allocating
a separate region for the core's ELF header within crash kdump kernel
memory and filling it in when executing kexec_load().

Then, its location will be advertised to crash dump kernel via a new
device-tree property, "linux,elfcorehdr", and crash dump kernel preserves
the region for later use with reserve_elfcorehdr() at boot time.

On crash dump kernel, /proc/vmcore will access the primary kernel's memory
with copy_oldmem_page(), which feeds the data page-by-page by ioremap'ing
it since it does not reside in linear mapping on crash dump kernel.

Meanwhile, elfcorehdr_read() is simple as the region is always mapped.

Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Reviewed-by: James Morse <james.morse@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: kdump: add VMCOREINFO's for user-space tools

In addition to common VMCOREINFO's defined in
crash_save_vmcoreinfo_init(), we need to know, for crash utility,
  - kimage_voffset
  - PHYS_OFFSET
to examine the contents of a dump file (/proc/vmcore) correctly
due to the introduction of KASLR (CONFIG_RANDOMIZE_BASE) in v4.6.

  - VA_BITS
is also required for makedumpfile command.

arch_crash_save_vmcoreinfo() appends them to the dump file.
More VMCOREINFO's may be added later.

Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Reviewed-by: James Morse <james.morse@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: kdump: implement machine_crash_shutdown()

Primary kernel calls machine_crash_shutdown() to shut down non-boot cpus
and save registers' status in per-cpu ELF notes before starting crash
dump kernel. See kernel_kexec().
Even if not all secondary cpus have shut down, we do kdump anyway.

As we don't have to make non-boot(crashed) cpus offline (to preserve
correct status of cpus at crash dump) before shutting down, this patch
also adds a variant of smp_send_stop().

Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Reviewed-by: James Morse <james.morse@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: hibernate: preserve kdump image around hibernation

Since arch_kexec_protect_crashkres() removes a mapping for crash dump
kernel image, the loaded data won't be preserved around hibernation.

In this patch, helper functions, crash_prepare_suspend()/
crash_post_resume(), are additionally called before/after hibernation so
that the relevant memory segments will be mapped again and preserved just
as the others are.

In addition, to minimize the size of hibernation image, crash_is_nosave()
is added to pfn_is_nosave() in order to recognize only the pages that hold
loaded crash dump kernel image as saveable. Hibernation excludes any pages
that are marked as Reserved and yet "nosave."

Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Reviewed-by: James Morse <james.morse@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: kdump: protect crash dump kernel memory

arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres()
are meant to be called by kexec_load() in order to protect the memory
allocated for crash dump kernel once the image is loaded.

The protection is implemented by unmapping the relevant segments in crash
dump kernel memory, rather than making it read-only as other archs do,
to prevent coherency issues due to potential cache aliasing (with
mismatched attributes).

Page-level mappings are consistently used here so that we can change
the attributes of segments in page granularity as well as shrink the region
also in page granularity through /sys/kernel/kexec_crash_size, putting
the freed memory back to buddy system.

Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: mm: add set_memory_valid()

This function validates and invalidates PTE entries, and will be utilized
in kdump to protect loaded crash dump kernel image.

Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: kdump: reserve memory for crash dump kernel

"crashkernel=" kernel parameter specifies the size (and optionally
the start address) of the system ram to be used by crash dump kernel.
reserve_crashkernel() will allocate and reserve that memory at boot time
of primary kernel.

The memory range will be exposed to userspace as a resource named
"Crash kernel" in /proc/iomem.

Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Signed-off-by: Mark Salter <msalter@redhat.com>
Signed-off-by: Pratyush Anand <panand@redhat.com>
Reviewed-by: James Morse <james.morse@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: limit memory regions based on DT property, usable-memory-range

Crash dump kernel uses only a limited range of available memory as System
RAM. On arm64 kdump, This memory range is advertised to crash dump kernel
via a device-tree property under /chosen,
linux,usable-memory-range = <BASE SIZE>

Crash dump kernel reads this property at boot time and calls
memblock_cap_memory_range() to limit usable memory which are listed either
in UEFI memory map table or "memory" nodes of a device tree blob.

Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Reviewed-by: Geoff Levand <geoff@infradead.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

memblock: add memblock_cap_memory_range()

Add memblock_cap_memory_range() which will remove all the memblock regions
except the memory range specified in the arguments. In addition, rework is
done on memblock_mem_limit_remove_map() to re-implement it using
memblock_cap_memory_range().

This function, like memblock_mem_limit_remove_map(), will not remove
memblocks with MEMMAP_NOMAP attribute as they may be mapped and accessed
later as "device memory."
See the commit a571d4eb55d8 ("mm/memblock.c: add new infrastructure to
address the mem limit issue").

This function is used, in a succeeding patch in the series of arm64 kdump
suuport, to limit the range of usable memory, or System RAM, on crash dump
kernel.
(Please note that "mem=" parameter is of little use for this purpose.)

Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Dennis Chen <dennis.chen@arm.com>
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

memblock: add memblock_clear_nomap()

This function, with a combination of memblock_mark_nomap(), will be used
in a later kdump patch for arm64 when it temporarily isolates some range
of memory from the other memory blocks in order to create a specific
kernel mapping at boot time.

Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

Merge branch 'arm64/common-sysreg' of git://git./linux/kernel/git/mark/linux into for-next/core

* 'arm64/common-sysreg' of git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux:
  arm64: sysreg: add Set/Way sys encodings
  arm64: sysreg: add register encodings used by KVM
  arm64: sysreg: add physical timer registers
  arm64: sysreg: subsume GICv3 sysreg definitions
  arm64: sysreg: add performance monitor registers
  arm64: sysreg: add debug system registers
  arm64: sysreg: sort by encoding

Merge tag 'acpi-arm64-for-v4.12' of git://git./linux/kernel/git/lpieralisi/linux into for-next/core

ACPI ARM64 specific changes for v4.12.

Patches contain:

- IORT kernel interface misc clean-ups
- IORT id mapping interface refactoring in preparation for platform
  MSI (IORT named components -> GIC ITS mappings) devid mapping code
- IORT id mapping implementation for named components nodes to ITS nodes,
  in order to provide the kernel with a firmware interface to map
  platform devices devids to GIC ITS components

* tag 'acpi-arm64-for-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/lpieralisi/linux:
  ACPI: platform: setup MSI domain for ACPI based platform device
  ACPI: platform-msi: retrieve devid from IORT
  ACPI/IORT: Introduce iort_node_map_platform_id() to retrieve dev id
  ACPI/IORT: Rename iort_node_map_rid() to make it generic
  ACPI/IORT: Rework iort_match_node_callback() return value handling
  ACPI/IORT: Add missing comment for iort_dev_find_its_id()
  ACPI/IORT: Fix the indentation in iort_scan_node()

arm64: efi: split Image code and data into separate PE/COFF sections

To prevent unintended modifications to the kernel text (malicious or
otherwise) while running the EFI stub, describe the kernel image as
two separate sections: a .text section with read-execute permissions,
covering .text, .rodata and .init.text, and a .data section with
read-write permissions, covering .init.data, .data and .bss.

This relies on the firmware to actually take the section permission
flags into account, but this is something that is currently being
implemented in EDK2, which means we will likely start seeing it in
the wild between one and two years from now.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: efi: replace open coded constants with symbolic ones

Replace open coded constants with symbolic ones throughout the
Image and the EFI headers. No binary level changes are intended.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: efi: remove pointless dummy .reloc section

The kernel's EFI PE/COFF header contains a dummy .reloc section, and
an explanatory comment that claims that this is required for the EFI
application loader to accept the Image as a relocatable image (i.e.,
one that can be loaded at any offset and fixed up in place)

This was inherited from the x86 implementation, which has elaborate host
tooling to mangle the PE/COFF header post-link time, and which populates
the .reloc section with a single dummy base relocation. On ARM, no such
tooling exists, and the .reloc section remains empty, and is never even
exposed via the BaseRelocationTable directory entry, which is where the
PE/COFF loader looks for it.

The PE/COFF spec is unclear about relocatable images that do not require
any fixups, but the EDK2 implementation, which is the de facto reference
for PE/COFF in the UEFI space, clearly does not care, and explicitly
mentions (in a comment) that relocatable images with no base relocations
are perfectly fine, as long as they don't have the RELOCS_STRIPPED
attribute set (which is not the case for our PE/COFF image)

So simply remove the .reloc section altogether.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Peter Jones <pjones@redhat.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: efi: remove forbidden values from the PE/COFF header

Bring the PE/COFF header in line with the PE/COFF spec, by setting
NumberOfSymbols to 0, and removing the section alignment flags.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: efi: clean up Image header after PE header has been split off

After having split off the PE header, clean up the bits that remain:
use .long consistently, merge two adjacent #ifdef CONFIG_EFI blocks,
fix the offset of the PE header pointer and remove the redundant .align
that follows it.

Also, since we will be eliminating all open coded constants from the
EFI header in subsequent patches, let's replace the open coded "ARM\x64"
magic number with its .ascii equivalent.

No changes to the resulting binary image are intended.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: efi: move EFI header and related data to a separate .S file

In preparation of yet another round of modifications to the PE/COFF
header, macroize it and move the definition into a separate source
file.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

include: pe.h: add some missing definitions

Add the missing IMAGE_FILE_MACHINE_ARM64 and IMAGE_DEBUG_TYPE_CODEVIEW
definitions.

We'll need them for the arm64 EFI stub...

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
[ardb: add IMAGE_DEBUG_TYPE_CODEVIEW as well]
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

include: pe.h: allow for use in assembly

Some of the definitions in include/linux/pe.h would be useful for the
EFI stub headers, where values are currently open-coded. Unfortunately
they cannot be used as some structures are also defined in pe.h without
!__ASSEMBLY__ guards.

This patch moves the structure definitions into an #ifdef __ASSEMBLY__
block, so that the common value definitions can be used from assembly.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: relocation testing module

This module tests the module loader's ELF relocation processing
routines. When loaded, it logs output like below.

    Relocation test:
    -------------------------------------------------------
    R_AARCH64_ABS64                 0xffff880000cccccc pass
    R_AARCH64_ABS32                 0x00000000f800cccc pass
    R_AARCH64_ABS16                 0x000000000000f8cc pass
    R_AARCH64_MOVW_SABS_Gn          0xffff880000cccccc pass
    R_AARCH64_MOVW_UABS_Gn          0xffff880000cccccc pass
    R_AARCH64_ADR_PREL_LO21         0xffffff9cf4d1a400 pass
    R_AARCH64_PREL64                0xffffff9cf4d1a400 pass
    R_AARCH64_PREL32                0xffffff9cf4d1a400 pass
    R_AARCH64_PREL16                0xffffff9cf4d1a400 pass

Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: cpufeature: Make ID reg accessor naming less counterintuitive

read_system_reg() can readily be confused with read_sysreg(),
whereas these are really quite different in their meaning.

This patches attempts to reduce the ambiguity be reserving "sysreg"
for the actual system register accessors.

read_system_reg() is instead renamed to read_sanitised_ftr_reg(),
to make it more obvious that the Linux-defined sanitised feature
register cache is being accessed here, not the underlying
architectural system registers.

cpufeature.c's internal __raw_read_system_reg() function is renamed
in line with its actual purpose: a form of read_sysreg() that
indexes on (non-compiletime-constant) encoding rather than symbolic
register name.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

ACPI: platform: setup MSI domain for ACPI based platform device

By allowing platform MSI domain to be created on ACPI platforms,
a platform device MSI domain can be set-up when it is probed.

In order to do that, the MSI domain the platform device connects
to should be retrieved, so the iort_get_platform_device_domain() is
introduced to retrieve the domain from the IORT kernel layer.

With the domain retrieved, we need a proper way to set the
domain to platform device.

Given that some platform devices (irqchips) require the MSI irqdomain
to be their interrupt parent domain, the MSI irqdomain should be
determined before platform device is probed but after the platform
device is allocated which means that the code setting up the MSI
irqdomain, ie acpi_configure_pmsi_domain() should be called in
acpi_platform_notify() (that is triggered after adding a device but
before the respective driver is probed) for the platform MSI domain
code set-up path to work properly.

Acked-by: Rafael J. Wysocki <rafael@kernel.org> [for glue.c]
Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org>
[lorenzo.pieralisi@arm.com: rewrote commit log]
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Tested-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Wei Xu <xuwei5@hisilicon.com>
Tested-by: Sinan Kaya <okaya@codeaurora.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Tomasz Nowicki <tn@semihalf.com>

ACPI: platform-msi: retrieve devid from IORT

For devices connecting to an ITS, the devices need to identify themself
through a devid; this devid is represented in the IORT table in named
component node [1] for platform devices, so this patch adds code that
scans the IORT table to retrieve the devices devid.

Add an IORT interface to collect ITS devices devid to carry out platform
devices MSI mappings with IORT tables.

[1]: https://static.docs.arm.com/den0049/b/DEN0049B_IO_Remapping_Table.pdf

Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org>
[lorenzo.pieralisi@arm.com: rewrote commit log/dropped ITS changes]
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Tested-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Wei Xu <xuwei5@hisilicon.com>
Tested-by: Sinan Kaya <okaya@codeaurora.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Tomasz Nowicki <tn@semihalf.com>
Cc: Thomas Gleixner <tglx@linutronix.de>

ACPI/IORT: Introduce iort_node_map_platform_id() to retrieve dev id

To retrieve dev id for IORT named components nodes there are
two steps involved (second is optional):

(1) Retrieve the initial id (this may well provide the final mapping)
(2) Map the id (optional if (1) represents the map type we need), this
is needed for use cases such as NC (named component) -> SMMU -> ITS
mappings.

the iort_node_get_id() function was created for step (1) above and
iort_node_map_rid() for step (2).

Create a wrapper, named iort_node_map_platform_id(), that encompasses
the two steps at once to retrieve the dev id to provide steps (1)-(2)
functionality.

iort_node_map_platform_id() will handle the parent type so type handling
in iort_node_get_id() is duplicated, remove it and update current
iort_node_get_id() users to move them over to iort_node_map_platform_id().

Suggested-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Suggested-by: Tomasz Nowicki <tn@semihalf.com>
Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org>
[lorenzo.pieralisi@arm.com: rewrote commit log]
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Tested-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Wei Xu <xuwei5@hisilicon.com>
Tested-by: Sinan Kaya <okaya@codeaurora.org>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Tomasz Nowicki <tn@semihalf.com>

ACPI/IORT: Rename iort_node_map_rid() to make it generic

iort_node_map_rid() was designed to take an input id (that is not
necessarily a PCI requester id) and map it to an output id (eg an SMMU
streamid or an ITS deviceid) according to the mappings provided by an
IORT node mapping entries. This means that the iort_node_map_rid() input
id is not always a PCI requester id as its name, parameters and local
variables suggest, which is misleading.

Apply the s/rid/id substitution to the iort_node_map_rid() mapping
function and its users to make sure its intended usage is clearer.

Suggested-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Tested-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Wei Xu <xuwei5@hisilicon.com>
Tested-by: Sinan Kaya <okaya@codeaurora.org>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Tomasz Nowicki <tn@semihalf.com>

arm64: drop unnecessary newlines in show_regs()

There are two unnecessary newlines, one is in show_regs, another
is in __show_regs(), drop them.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: mm: set the contiguous bit for kernel mappings where appropriate

This is the third attempt at enabling the use of contiguous hints for
kernel mappings. The most recent attempt 0bfc445dec9d was reverted after
it turned out that updating permission attributes on live contiguous ranges
may result in TLB conflicts. So this time, the contiguous hint is not set
for .rodata or for the linear alias of .text/.rodata, both of which are
mapped read-write initially, and remapped read-only at a later stage.
(Note that the latter region could also be unmapped and remapped again
with updated permission attributes, given that the region, while live, is
only mapped for the convenience of the hibernation code, but that also
means the TLB footprint is negligible anyway, so why bother)

This enables the following contiguous range sizes for the virtual mapping
of the kernel image, and for the linear mapping:

          granule size |  cont PTE  |  cont PMD  |
          -------------+------------+------------+
               4 KB    |    64 KB   |   32 MB    |
              16 KB    |     2 MB   |    1 GB*   |
              64 KB    |     2 MB   |   16 GB*   |

* Only when built for 3 or more levels of translation. This is due to the
  fact that a 2 level configuration only consists of PGDs and PTEs, and the
  added complexity of dealing with folded PMDs is not justified considering
  that 16 GB contiguous ranges are likely to be ignored by the hardware (and
  16k/2 levels is a niche configuration)

Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64/mm: remove pointless map/unmap sequences when creating page tables

The routines __pud_populate and __pmd_populate only create a table
entry at their respective level which refers to the next level page
by its physical address, so there is no reason to map this page and
then unmap it immediately after.

Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64/mmu: replace 'page_mappings_only' parameter with flags argument

In preparation of extending the policy for manipulating kernel mappings
with whether or not contiguous hints may be used in the page tables,
replace the bool 'page_mappings_only' with a flags field and a flag
NO_BLOCK_MAPPINGS.

Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64/mmu: add contiguous bit to sanity bug check

A mapping with the contiguous bit cannot be safely manipulated while
live, regardless of whether the bit changes between the old and new
mapping. So take this into account when deciding whether the change
is safe.

Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64/mmu: ignore debug_pagealloc for kernel segments

The debug_pagealloc facility manipulates kernel mappings in the linear
region at page granularity to detect out of bounds or use-after-free
accesses. Since the kernel segments are not allocated dynamically,
there is no point in taking the debug_pagealloc_enabled flag into
account for them, and we can use block mappings unconditionally.

Note that this applies equally to the linear alias of text/rodata:
we will never have dynamic allocations there given that the same
memory is statically in use by the kernel image.

Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64/mmu: align alloc_init_pte prototype with pmd/pud versions

Align the function prototype of alloc_init_pte() with its pmd and pud
counterparts by replacing the pfn parameter with the equivalent physical
address.

Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: mmu: apply strict permissions to .init.text and .init.data

To avoid having mappings that are writable and executable at the same
time, split the init region into a .init.text region that is mapped
read-only, and a .init.data region that is mapped non-executable.

This is possible now that the alternative patching occurs via the linear
mapping, and the linear alias of the init region is always mapped writable
(but never executable).

Since the alternatives descriptions themselves are read-only data, move
those into the .init.text region.

Reviewed-by: Laura Abbott <labbott@redhat.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: mmu: map .text as read-only from the outset

Now that alternatives patching code no longer relies on the primary
mapping of .text being writable, we can remove the code that removes
the writable permissions post-init time, and map it read-only from
the outset.

To preserve the existing behavior under rodata=off, which is relied
upon by external debuggers to manage software breakpoints (as pointed
out by Mark), add an early_param() check for rodata=, and use RWX
permissions if it set to 'off'.

Reviewed-by: Laura Abbott <labbott@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: alternatives: apply boot time fixups via the linear mapping

One important rule of thumb when desiging a secure software system is
that memory should never be writable and executable at the same time.
We mostly adhere to this rule in the kernel, except at boot time, when
regions may be mapped RWX until after we are done applying alternatives
or making other one-off changes.

For the alternative patching, we can improve the situation by applying
the fixups via the linear mapping, which is never mapped with executable
permissions. So map the linear alias of .text with RW- permissions
initially, and remove the write permissions as soon as alternative
patching has completed.

Reviewed-by: Laura Abbott <labbott@redhat.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: mmu: move TLB maintenance from callers to create_mapping_late()

In preparation of refactoring the kernel mapping logic so that text regions
are never mapped writable, which would require adding explicit TLB
maintenance to new call sites of create_mapping_late() (which is currently
invoked twice from the same function), move the TLB maintenance from the
call site into create_mapping_late() itself, and change it from a full
TLB flush into a flush by VA, which is more appropriate here.

Also, given that create_mapping_late() has evolved into a routine that only
updates protection bits on existing mappings, rename it to
update_mapping_prot()

Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm: kvm: move kvm_vgic_global_state out of .text section

The kvm_vgic_global_state struct contains a static key which is
written to by jump_label_init() at boot time. So in preparation of
making .text regions truly (well, almost truly) read-only, mark
kvm_vgic_global_state __ro_after_init so it moves to the .rodata
section instead.

Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Laura Abbott <labbott@redhat.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: Revert "arm64: kaslr: fix breakage with CONFIG_MODVERSIONS=y"

This reverts commit 9c0e83c371cf4696926c95f9c8c77cd6ea803426, which
is no longer needed now that the modversions code plays nice with
relocatable PIE kernels.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: move !VHE work to end of el2_setup

We only need to initialise sctlr_el1 if we're installing an EL2 stub, so
we may as well defer this until we're doing so. Similarly, we can defer
intialising CPTR_EL2 until then, as we do not access any trapped
functionality as part of el2_setup.

This patch modified el2_setup accordingly, allowing us to remove a
branch and simplify the code flow.

Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: reduce el2_setup branching

The early el2_setup code is a little convoluted, with two branches where
one would do. This makes the code more painful to read than is
necessary.

We can remove a branch and simplify the logic by moving the early return
in the booted-at-EL1 case earlier in the function. This separates it
from all the setup logic that only makes sense for EL2.

Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: struct debug_info: Check CONFIG_HAVE_HW_BREAKPOINT

Check if CONFIG_HAVE_HW_BREAKPOINT is enabled before compiling in extra
data required for hardware breakpoints. Compiling out this code when hw
breakpoints are disabled saves about 272 bytes per struct task_struct.

Signed-off-by: Chris Redmon <credmonster@gmail.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: Add support for DMA_ATTR_FORCE_CONTIGUOUS to IOMMU

Add support for allocating physically contiguous DMA buffers on arm64
systems with an IOMMU.  This can be useful when two or more devices
with different memory requirements are involved in buffer sharing.

Note that as this uses the CMA allocator, setting the
DMA_ATTR_FORCE_CONTIGUOUS attribute has a runtime-dependency on
CONFIG_DMA_CMA, just like on arm32.

For arm64 systems using swiotlb, no changes are needed to support the
allocation of physically contiguous DMA buffers:
  - swiotlb always uses physically contiguous buffers (up to
    IO_TLB_SEGSIZE = 128 pages),
  - arm64's __dma_alloc_coherent() already calls
    dma_alloc_from_contiguous() when CMA is available.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: define BUG() instruction without CONFIG_BUG

This mirrors commit e9c38ceba8d9 ("ARM: 8455/1: define __BUG as
asm(BUG_INSTR) without CONFIG_BUG") to make the behavior of
arm64 consistent with arm and x86, and avoids lots of warnings in
randconfig builds, such as:

kernel/seccomp.c: In function '__seccomp_filter':
kernel/seccomp.c:666:1: error: no return statement in function returning non-void [-Werror=return-type]

Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

ACPI/IORT: Rework iort_match_node_callback() return value handling

The return value handling in iort_match_node_callback() is
too convoluted; update the iort_match_node_callback() return
value handling to make code easier to read.

Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org>
[lorenzo.pieralisi@arm.com: rewrote commit log]
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Tested-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Wei Xu <xuwei5@hisilicon.com>
Tested-by: Sinan Kaya <okaya@codeaurora.org>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Tomasz Nowicki <tn@semihalf.com>

ACPI/IORT: Add missing comment for iort_dev_find_its_id()

Add missing req_id parameter to the iort_dev_find_its_id() function
kernel-doc comment.

Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Tested-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Wei Xu <xuwei5@hisilicon.com>
Tested-by: Sinan Kaya <okaya@codeaurora.org>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Tomasz Nowicki <tn@semihalf.com>

ACPI/IORT: Fix the indentation in iort_scan_node()

The indentation in the iort_scan_node() function is wrong, fix it.

Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org>
[lorenzo.pieralisi@arm.com: massaged commit log]
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Tested-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Wei Xu <xuwei5@hisilicon.com>
Tested-by: Sinan Kaya <okaya@codeaurora.org>

arm64: v8.3: Support for weaker release consistency

ARMv8.3 adds new instructions to support Release Consistent
processor consistent (RCpc) model, which is weaker than the
RCsc model.

Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: v8.3: Support for complex number instructions

ARM v8.3 adds support for new instructions to aid floating-point
multiplication and addition of complex numbers. Expose the support
via HWCAP and MRS emulation

Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: v8.3: Support for Javascript conversion instruction

ARMv8.3 adds support for a new instruction to perform conversion
from double precision floating point to integer to match the
architected behaviour of the equivalent Javascript conversion.
Expose the availability via HWCAP and MRS emulation.

Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: KVM: Add support for VPIPT I-caches

A VPIPT I-cache has two main properties:

1. Lines allocated into the cache are tagged by VMID and a lookup can
only hit lines that were allocated with the current VMID.

2. I-cache invalidation from EL1/0 only invalidates lines that match the
current VMID of the CPU doing the invalidation.

This can cause issues with non-VHE configurations, where the host runs
at EL1 and wants to invalidate I-cache entries for a guest running with
a different VMID. VHE is not affected, because the host runs at EL2 and
I-cache invalidation applies as expected.

This patch solves the problem by invalidating the I-cache when unmapping
a page at stage 2 on a system with a VPIPT I-cache but not running with
VHE enabled. Hopefully this is an obscure enough configuration that the
overhead isn't anything to worry about, although it does mean that the
by-range I-cache invalidation currently performed when mapping at stage
2 can be elided on such systems, because the I-cache will be clean for
the guest VMID following a rollover event.

Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: cache: Identify VPIPT I-caches

Add support for detecting VPIPT I-caches, as introduced by ARMv8.2.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: cache: Merge cachetype.h into cache.h

cachetype.h and cache.h are small and both obviously related to caches.
Merge them together to reduce clutter.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: cache: Remove support for ASID-tagged VIVT I-caches

As a recent change to ARMv8, ASID-tagged VIVT I-caches are removed
retrospectively from the architecture. Consequently, we don't need to
support them in Linux either.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: cacheinfo: Remove CCSIDR-based cache information probing

The CCSIDR_EL1.{NumSets,Associativity,LineSize} fields are only for use
in conjunction with set/way cache maintenance and are not guaranteed to
represent the actual microarchitectural features of a design.

The architecture explicitly states:

| You cannot make any inference about the actual sizes of caches based
| on these parameters.

Furthermore, CCSIDR_EL1.{WT,WB,RA,WA} have been removed retrospectively
from ARMv8 and are now considered to be UNKNOWN.

Since the kernel doesn't make use of set/way cache maintenance and it is
not possible for userspace to execute these instructions, we have no
need for the CCSIDR information in the kernel.

This patch removes the accessors, along with the related portions of the
cacheinfo support, which should instead be reintroduced when firmware has
a mechanism to provide us with reliable information.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: cpuinfo: remove I-cache VIPT aliasing detection

The CCSIDR_EL1.{NumSets,Associativity,LineSize} fields are only for use
in conjunction with set/way cache maintenance and are not guaranteed to
represent the actual microarchitectural features of a design.

The architecture explicitly states:

| You cannot make any inference about the actual sizes of caches based
| on these parameters.

We currently use these fields to determine whether or the I-cache is
aliasing, which is bogus and known to break on some platforms. Instead,
assume the I-cache is always aliasing if it advertises a VIPT policy.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

Linux 4.11-rc3

mm/swap: don't BUG_ON() due to uninitialized swap slot cache

This BUG_ON() triggered for me once at shutdown, and I don't see a
reason for the check. The code correctly checks whether the swap slot
cache is usable or not, so an uninitialized swap slot cache is not
actually problematic afaik.

I've temporarily just switched the BUG_ON() to a WARN_ON_ONCE(), since
I'm not sure why that seemingly pointless check was there. I suspect
the real fix is to just remove it entirely, but for now we'll warn about
it but not bring the machine down.

Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'powerpc-4.11-5' of git://git./linux/kernel/git/powerpc/linux

Pull more powerpc fixes from Michael Ellerman:
"A couple of minor powerpc fixes for 4.11:

   - wire up statx() syscall

   - don't print a warning on memory hotplug when HPT resizing isn't
     available

  Thanks to: David Gibson, Chandan Rajendra"

* tag 'powerpc-4.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/pseries: Don't give a warning when HPT resizing isn't available
  powerpc: Wire up statx() syscall

Merge branch 'parisc-4.11-2' of git://git./linux/kernel/git/deller/parisc-linux

Pull parisc fixes from Helge Deller:

- Mikulas Patocka added support for R_PARISC_SECREL32 relocations in
   modules with CONFIG_MODVERSIONS.

- Dave Anglin optimized the cache flushing for vmap ranges.

- Arvind Yadav provided a fix for a potential NULL pointer dereference
   in the parisc perf code (and some code cleanups).

- I wired up the new statx system call, fixed some compiler warnings
   with the access_ok() macro and fixed shutdown code to really halt a
   system at shutdown instead of crashing & rebooting.

* 'parisc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Fix system shutdown halt
  parisc: perf: Fix potential NULL pointer dereference
  parisc: Avoid compiler warnings with access_ok()
  parisc: Wire up statx system call
  parisc: Optimize flush_kernel_vmap_range and invalidate_kernel_vmap_range
  parisc: support R_PARISC_SECREL32 relocation in modules

Merge git://git./linux/kernel/git/nab/target-pending

Pull SCSI target fixes from Nicholas Bellinger:
"The bulk of the changes are in qla2xxx target driver code to address
  various issues found during Cavium/QLogic's internal testing (stable
  CC's included), along with a few other stability and smaller
  miscellaneous improvements.

  There are also a couple of different patch sets from Mike Christie,
  which have been a result of his work to use target-core ALUA logic
  together with tcm-user backend driver.

  Finally, a patch to address some long standing issues with
  pass-through SCSI export of TYPE_TAPE + TYPE_MEDIUM_CHANGER devices,
  which will make folks using physical (or virtual) magnetic tape happy"

* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (28 commits)
  qla2xxx: Update driver version to 9.00.00.00-k
  qla2xxx: Fix delayed response to command for loop mode/direct connect.
  qla2xxx: Change scsi host lookup method.
  qla2xxx: Add DebugFS node to display Port Database
  qla2xxx: Use IOCB interface to submit non-critical MBX.
  qla2xxx: Add async new target notification
  qla2xxx: Export DIF stats via debugfs
  qla2xxx: Improve T10-DIF/PI handling in driver.
  qla2xxx: Allow relogin to proceed if remote login did not finish
  qla2xxx: Fix sess_lock & hardware_lock lock order problem.
  qla2xxx: Fix inadequate lock protection for ABTS.
  qla2xxx: Fix request queue corruption.
  qla2xxx: Fix memory leak for abts processing
  qla2xxx: Allow vref count to timeout on vport delete.
  tcmu: Convert cmd_time_out into backend device attribute
  tcmu: make cmd timeout configurable
  tcmu: add helper to check if dev was configured
  target: fix race during implicit transition work flushes
  target: allow userspace to set state to transitioning
  target: fix ALUA transition timeout handling
  ...

Merge branch 'libnvdimm-fixes' of git://git./linux/kernel/git/nvdimm/nvdimm

Pull device-dax fixes from Dan Williams:
"The device-dax driver was not being careful to handle falling back to
  smaller fault-granularity sizes.

  The driver already fails fault attempts that are smaller than the
  device's alignment, but it also needs to handle the cases where a
  larger page mapping could be established. For simplicity of the
  immediate fix the implementation just signals VM_FAULT_FALLBACK until
  fault-size == device-alignment.

  One fix is for -stable to address pmd-to-pte fallback from the
  original implementation, another fix is for the new (introduced in
  4.11-rc1) pud-to-pmd regression, and a typo fix comes along for the
  ride.

  These have received a build success notification from the kbuild
  robot"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  device-dax: fix debug output typo
  device-dax: fix pud fault fallback handling
  device-dax: fix pmd/pte fault fallback handling

qla2xxx: Update driver version to 9.00.00.00-k

Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
signed-off-by: Giridhar Malavali <giridhar.malavali@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: Fix delayed response to command for loop mode/direct connect.

Current driver wait for FW to be in the ready state before
processing in-coming commands. For Arbitrated Loop or
Point-to- Point (not switch), FW Ready state can take a while.
FW will transition to ready state after all Nports have been
logged in. In the mean time, certain initiators have completed
the login and starts IO. Driver needs to start processing all
queues if FW is already started.

Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: Change scsi host lookup method.

For target mode, when new scsi command arrive, driver first performs
a look up of the SCSI Host. The current look up method is based on
the ALPA portion of the NPort ID. For Cisco switch, the ALPA can
not be used as the index. Instead, the new search method is based
on the full value of the Nport_ID via btree lib.

Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: Add DebugFS node to display Port Database

Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Giridhar Malavali <giridhar.malavali@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: Use IOCB interface to submit non-critical MBX.

The Mailbox interface is currently over subscribed. We like
to reserve the Mailbox interface for the chip managment and
link initialization. Any non essential Mailbox command will
be routed through the IOCB interface. The IOCB interface is
able to absorb more commands.

Following commands are being routed through IOCB interface

- Get ID List (007Ch)
- Get Port DB (0064h)
- Get Link Priv Stats (006Dh)

Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: Add async new target notification

Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: Export DIF stats via debugfs

Signed-off-by: Anil Gurumurthy <anil.gurumurthy@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: Improve T10-DIF/PI handling in driver.

Add routines to support T10 DIF tag.

Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Anil Gurumurthy <anil.gurumurthy@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: Allow relogin to proceed if remote login did not finish

If the remote port have started the login process, then the
PLOGI and PRLI should be back to back. Driver will allow
the remote port to complete the process. For the case where
the remote port decide to back off from sending PRLI, this
local port sets an expiration timer for the PRLI. Once the
expiration time passes, the relogin retry logic is allowed
to go through and perform login with the remote port.

Signed-off-by: Quinn Tran <quinn.tran@qlogic.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: Fix sess_lock & hardware_lock lock order problem.

The main lock that needs to be held for CMD or TMR submission
to upper layer is the sess_lock. The sess_lock is used to
serialize cmd submission and session deletion. The addition
of hardware_lock being held is not necessary. This patch removes
hardware_lock dependency from CMD/TMR submission.

Use hardware_lock only for error response in this case.

Path1
       CPU0                    CPU1
       ----                    ----
  lock(&(&ha->tgt.sess_lock)->rlock);
                               lock(&(&ha->hardware_lock)->rlock);
                               lock(&(&ha->tgt.sess_lock)->rlock);
  lock(&(&ha->hardware_lock)->rlock);

Path2/deadlock
*** DEADLOCK ***
Call Trace:
dump_stack+0x85/0xc2
print_circular_bug+0x1e3/0x250
__lock_acquire+0x1425/0x1620
lock_acquire+0xbf/0x210
_raw_spin_lock_irqsave+0x53/0x70
qlt_sess_work_fn+0x21d/0x480 [qla2xxx]
process_one_work+0x1f4/0x6e0

Cc: <stable@vger.kernel.org>
Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>
Reported-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: Fix inadequate lock protection for ABTS.

Normally, ABTS is sent to Target Core as Task MGMT command.
In the case of error, qla2xxx needs to send response, hardware_lock
is required to prevent request queue corruption.

Cc: <stable@vger.kernel.org>
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: Fix request queue corruption.

When FW notify driver or driver detects low FW resource,
driver tries to send out Busy SCSI Status to tell Initiator
side to back off. During the send process, the lock was not held.

Cc: <stable@vger.kernel.org>
Signed-off-by: Quinn Tran <quinn.tran@qlogic.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: Fix memory leak for abts processing

Cc: <stable@vger.kernel.org>
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: Allow vref count to timeout on vport delete.

Cc: <stable@vger.kernel.org>
Signed-off-by: Joe Carnuccio <joe.carnuccio@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

tcmu: Convert cmd_time_out into backend device attribute

Instead of putting cmd_time_out under ../target/core/user_0/foo/control,
which has historically been used by parameters needed for initial
backend device configuration, go ahead and move cmd_time_out into
a backend device attribute.

In order to do this, tcmu_module_init() has been updated to create
a local struct configfs_attribute **tcmu_attrs, that is based upon
the existing passthrough_attrib_attrs along with the new cmd_time_out
attribute. Once **tcm_attrs has been setup, go ahead and point
it at tcmu_ops->tb_dev_attrib_attrs so it's picked up by target-core.

Also following MNC's previous change, ->cmd_time_out is stored in
milliseconds but exposed via configfs in seconds. Also, note this
patch restricts the modification of ->cmd_time_out to before +
after the TCMU device has been configured, but not while it has
active fabric exports.

Cc: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

tcmu: make cmd timeout configurable

A single daemon could implement multiple types of devices
using multuple types of real devices that may not support
restarting from crashes and/or handling tcmu timeouts. This
makes the cmd timeout configurable, so handlers that do not
support it can turn if off for now.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

tcmu: add helper to check if dev was configured

This adds a helper to check if the dev was configured. It
will be used in the next patch to prevent updates to some
config settings after the device has been setup.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

Merge tag 'openrisc-for-linus' of git://github.com/openrisc/linux

Pull OpenRISC fixes from Stafford Horne:
"OpenRISC fixes for build issues that were exposed by kbuild robots
  after 4.11 merge. All from allmodconfig builds. This includes:

   - bug in the handling of 8-byte get_user() calls

   - module build failure due to multile missing symbol exports"

* tag 'openrisc-for-linus' of git://github.com/openrisc/linux:
  openrisc: Export symbols needed by modules
  openrisc: fix issue handling 8 byte get_user calls
  openrisc: xchg: fix `computed is not used` warning

target: fix race during implicit transition work flushes

This fixes the following races:

1. core_alua_do_transition_tg_pt could have read
tg_pt_gp_alua_access_state and gone into this if chunk:

if (!explicit &&
atomic_read(&tg_pt_gp->tg_pt_gp_alua_access_state) ==
ALUA_ACCESS_STATE_TRANSITION) {

and then core_alua_do_transition_tg_pt_work could update the
state. core_alua_do_transition_tg_pt would then only set
tg_pt_gp_alua_pending_state and the tg_pt_gp_alua_access_state would
not get updated with the second calls state.

2. core_alua_do_transition_tg_pt could be setting
tg_pt_gp_transition_complete while the tg_pt_gp_transition_work
is already completing. core_alua_do_transition_tg_pt then waits on the
completion that will never be called.

To handle these issues, we just call flush_work which will return when
core_alua_do_transition_tg_pt_work has completed so there is no need
to do the complete/wait. And, if core_alua_do_transition_tg_pt_work
was running, instead of trying to sneak in the state change, we just
schedule up another core_alua_do_transition_tg_pt_work call.

Note that this does not handle a possible race where there are multiple
threads call core_alua_do_transition_tg_pt at the same time. I think
we need a mutex in target_tg_pt_gp_alua_access_state_store.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: allow userspace to set state to transitioning

Userspace target_core_user handlers like tcmu-runner may want to set the
ALUA state to transitioning while it does implicit transitions. This
patch allows that state when set from configfs.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: fix ALUA transition timeout handling

The implicit transition time tells initiators the min time
to wait before timing out a transition. We currently schedule
the transition to occur in tg_pt_gp_implicit_trans_secs
seconds so there is no room for delays. If
core_alua_do_transition_tg_pt_work->core_alua_update_tpg_primary_metadata
needs to write out info to a remote file, then the initiator can
easily time out the operation.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Use system workqueue for ALUA transitions

If tcmu-runner is processing a STPG and needs to change the kernel's
ALUA state then we cannot use the same work queue for task management
requests and ALUA transitions, because we could deadlock. The problem
occurs when a STPG times out before tcmu-runner is able to
call into target_tg_pt_gp_alua_access_state_store->
core_alua_do_port_transition -> core_alua_do_transition_tg_pt ->
queue_work. In this case, the tmr is on the work queue waiting for
the STPG to complete, but the STPG transition is now queued behind
the waiting tmr.

Note:
This bug will also be fixed by this patch:
http://www.spinics.net/lists/target-devel/msg14560.html
which switches the tmr code to use the system workqueues.

For both, I am not sure if we need a dedicated workqueue since
it is not a performance path and I do not think we need WQ_MEM_RECLAIM
to make forward progress to free up memory like the block layer does.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: fail ALUA transitions for pscsi

We do not setup the LU group for pscsi devices, so if you write
a state to alua_access_state that will cause a transition you will
get a NULL pointer dereference.

This patch will fail attempts to try and transition the path
for backend devices that set the TRANSPORT_FLAG_PASSTHROUGH_ALUA
flag.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: allow ALUA setup for some passthrough backends

This patch allows passthrough backends to use the core/base LIO
ALUA setup and state checks, but still handle the execution of
commands.

This will allow the target_core_user module to execute STPG and RTPG
in userspace, and not have to duplicate the ALUA state checks, path
information (needed so we can check if command is executable on
specific paths) and setup (rtslib sets/updates the configfs ALUA
interface like it does for iblock or file).

For STPG, the target_core_user userspace daemon, tcmu-runner will
still execute the STPG, and to update the core/base LIO state it
will use the existing configfs interface. For RTPG, tcmu-runner
will loop over configfs and/or cache the state.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

tcmu: return on first Opt parse failure

We only were returing failure if the last opt to be parsed failed.
This has a return failure when we first detect a failure.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

tcmu: allow hw_max_sectors greater than 128

tcmu hard codes the hw_max_sectors to 128 which is a litle small.
Userspace uses the max_sectors to report the optimal IO size and
some initiators perform better with larger IOs (open-iscsi seems
to do better with 256 to 512 depending on the test).

(Fix do not display hw max sectors twice - MNC)

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Drop pointless tfo->check_stop_free check

All in-tree fabric drivers provide a tfo->check_stop_free(),
so there is no need to do the extra check within existing
transport_cmd_check_stop_to_fabric() code.

Just to be sure, add a check in target_fabric_tf_ops_check()
to notify any out-of-tree drivers that might be missing it.

Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

parisc: Fix system shutdown halt

On those parisc machines which don't provide a software power off
function, the system currently kills the init process at the end of a
shutdown and unexpectedly restarts insteads of halting.
Fix it by adding a loop which will not return.

Signed-off-by: Helge Deller <deller@gmx.de>
Cc: stable@vger.kernel.org # 4.9+

parisc: perf: Fix potential NULL pointer dereference

Fix potential NULL pointer dereference and clean up
coding style errors (code indent, trailing whitespaces).

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Helge Deller <deller@gmx.de>

Merge branch 'smp-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull CPU hotplug fix from Thomas Gleixner:
"A single fix preventing the concurrent execution of the CPU hotplug
  callback install/invocation machinery. Long standing bug caused by a
  massive brain slip of that Gleixner dude, which went unnoticed for
  almost a year"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Serialize callback invocations proper

Merge tag 'pm-4.11-rc3' of git://git./linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
"These fix a few more intel_pstate issues and one small issue in the
  cpufreq core.

  Specifics:

   - Fix breakage in the intel_pstate's debugfs interface for PID
     controller tuning (Rafael Wysocki)

   - Fix computations related to P-state limits in intel_pstate to avoid
     excessive rounding errors leading to visible inaccuracies (Srinivas
     Pandruvada, Rafael Wysocki)

   - Add a missing newline to a message printed by one function in the
     cpufreq core and clean up that function (Rafael Wysocki)"

* tag 'pm-4.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: Fix and clean up show_cpuinfo_cur_freq()
  cpufreq: intel_pstate: Avoid percentages in limits-related computations
  cpufreq: intel_pstate: Correct frequency setting in the HWP mode
  cpufreq: intel_pstate: Update pid_params.sample_rate_ns in pid_param_set()

Merge branches 'pm-cpufreq-fixes' and 'intel_pstate-fixes'

* pm-cpufreq-fixes:
  cpufreq: Fix and clean up show_cpuinfo_cur_freq()

* intel_pstate-fixes:
  cpufreq: intel_pstate: Avoid percentages in limits-related computations
  cpufreq: intel_pstate: Correct frequency setting in the HWP mode
  cpufreq: intel_pstate: Update pid_params.sample_rate_ns in pid_param_set()

Merge tag 'nfs-for-4.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
"We have a handful of stable fixes to fix kernel warnings and other
  bugs that have been around for a while. We've also found a few other
  reference counting bugs and memory leaks since the initial 4.11 pull.

  Stable Bugfixes:
   - Fix decrementing nrequests in NFS v4.2 COPY to fix kernel warnings
   - Prevent a double free in async nfs4_exchange_id()
   - Squelch a kbuild sparse complaint for xprtrdma

  Other Bugfixes:
   - Fix a typo (NFS_ATTR_FATTR_GROUP_NAME) that causes a memory leak
   - Fix a reference leak that causes kernel warnings
   - Make nfs4_cb_sv_ops static to fix a sparse warning
   - Respect a server's max size in CREATE_SESSION
   - Handle errors from nfs4_pnfs_ds_connect
   - Flexfiles layout shouldn't mark devices as unavailable"

* tag 'nfs-for-4.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  pNFS/flexfiles: never nfs4_mark_deviceid_unavailable
  pNFS: return status from nfs4_pnfs_ds_connect
  NFSv4.1 respect server's max size in CREATE_SESSION
  NFS prevent double free in async nfs4_exchange_id
  nfs: make nfs4_cb_sv_ops static
  xprtrdma: Squelch kbuild sparse complaint
  NFS: fix the fault nrequests decreasing for nfs_inode COPY
  NFSv4: fix a reference leak caused WARNING messages
  nfs4: fix a typo of NFS_ATTR_FATTR_GROUP_NAME

Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
"An assorted pile of fixes along with some hardware enablement:

   - a fix for a KASAN / branch profiling related boot failure

   - some more fallout of the PUD rework

   - a fix for the Always Running Timer which is not initialized when
     the TSC frequency is known at boot time (via MSR/CPUID)

   - a resource leak fix for the RDT filesystem

   - another unwinder corner case fixup

   - removal of the warning for duplicate NMI handlers because there are
     legitimate cases where more than one handler can be registered at
     the last level

   - make a function static - found by sparse

   - a set of updates for the Intel MID platform which got delayed due
     to merge ordering constraints. It's hardware enablement for a non
     mainstream platform, so there is no risk"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mpx: Make unnecessarily global function static
  x86/intel_rdt: Put group node in rdtgroup_kn_unlock
  x86/unwind: Fix last frame check for aligned function stacks
  mm, x86: Fix native_pud_clear build error
  x86/kasan: Fix boot with KASAN=y and PROFILE_ANNOTATED_BRANCHES=y
  x86/platform/intel-mid: Add power button support for Merrifield
  x86/platform/intel-mid: Use common power off sequence
  x86/platform: Remove warning message for duplicate NMI handlers
  x86/tsc: Fix ART for TSC_KNOWN_FREQ
  x86/platform/intel-mid: Correct MSI IRQ line for watchdog device

Merge branch 'x86-acpi-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 acpi fixes from Thomas Gleixner:
"This update deals with the fallout of the recent work to make
  cpuid/node mappings persistent.

  It turned out that the boot time ACPI based mapping tripped over ACPI
  inconsistencies and caused regressions. It's partially reverted and
  the fragile part replaced by an implementation which makes the mapping
  persistent when a CPU goes online for the first time"

* 'x86-acpi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  acpi/processor: Check for duplicate processor ids at hotplug time
  acpi/processor: Implement DEVICE operator for processor enumeration
  x86/acpi: Restore the order of CPU IDs
  Revert"x86/acpi: Enable MADT APIs to return disabled apicids"
  Revert "x86/acpi: Set persistent cpuid <-> nodeid mapping when booting"