Linus Torvalds [Wed, 6 Mar 2019 20:59:46 +0000 (12:59 -0800)]
Merge tag 'pm-5.1-rc1' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These are PM-runtime framework changes to use ktime instead of jiffies
for accounting, new PM core flag to mark devices that don't need any
form of power management, cpuidle updates including driver API
documentation and a new governor, cpufreq updates including a new
driver for Armada 8K, thermal cleanups and more, some energy-aware
scheduling (EAS) enabling changes, new chips support in the intel_idle
and RAPL drivers and assorted cleanups in some other places.
Specifics:
- Update the PM-runtime framework to use ktime instead of jiffies for
accounting (Thara Gopinath, Vincent Guittot)
- Optimize the autosuspend code in the PM-runtime framework somewhat
(Ladislav Michl)
- Add a PM core flag to mark devices that don't need any form of
power management (Sudeep Holla)
- Introduce driver API documentation for cpuidle and add a new
cpuidle governor for tickless systems (Rafael Wysocki)
- Add Jacobsville support to the intel_idle driver (Zhang Rui)
- Clean up a cpuidle core header file and the cpuidle-dt and ACPI
processor-idle drivers (Yangtao Li, Joseph Lo, Yazen Ghannam)
- Add new cpufreq driver for Armada 8K (Gregory Clement)
- Fix and clean up cpufreq core (Rafael Wysocki, Viresh Kumar, Amit
Kucheria)
- Add support for light-weight tear-down and bring-up of CPUs to the
cpufreq core and use it in the cpufreq-dt driver (Viresh Kumar)
- Fix cpu_cooling Kconfig dependencies, add support for CPU cooling
auto-registration to the cpufreq core and use it in multiple
cpufreq drivers (Amit Kucheria)
- Fix some minor issues and do some cleanups in the davinci,
e_powersaver, ap806, s5pv210, qcom and kryo cpufreq drivers
(Bartosz Golaszewski, Gustavo Silva, Julia Lawall, Paweł Chmiel,
Taniya Das, Viresh Kumar)
- Add a Hisilicon CPPC quirk to the cppc_cpufreq driver (Xiongfeng
Wang)
- Clean up the intel_pstate and acpi-cpufreq drivers (Erwan Velu,
Rafael Wysocki)
- Clean up multiple cpufreq drivers (Yangtao Li)
- Update cpufreq-related MAINTAINERS entries (Baruch Siach, Lukas
Bulwahn)
- Add support for exposing the Energy Model via debugfs and make
multiple cpufreq drivers register an Energy Model to support
energy-aware scheduling (Quentin Perret, Dietmar Eggemann, Matthias
Kaehlcke)
- Add Ice Lake mobile and Jacobsville support to the Intel RAPL
power-capping driver (Gayatri Kammela, Zhang Rui)
- Add a power estimation helper to the operating performance points
(OPP) framework and clean up a core function in it (Quentin Perret,
Viresh Kumar)
- Make minor improvements in the generic power domains (genpd), OPP
and system suspend frameworks and in the PM core (Aditya Pakki,
Douglas Anderson, Greg Kroah-Hartman, Rafael Wysocki, Yangtao Li)"
* tag 'pm-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (80 commits)
cpufreq: kryo: Release OPP tables on module removal
cpufreq: ap806: add missing of_node_put after of_device_is_available
cpufreq: acpi-cpufreq: Report if CPU doesn't support boost technologies
cpufreq: Pass updated policy to driver ->setpolicy() callback
cpufreq: Fix two debug messages in cpufreq_set_policy()
cpufreq: Reorder and simplify cpufreq_update_policy()
cpufreq: Add kerneldoc comments for two core functions
PM / core: Add support to skip power management in device/driver model
cpufreq: intel_pstate: Rework iowait boosting to be less aggressive
cpufreq: intel_pstate: Eliminate intel_pstate_get_base_pstate()
cpufreq: intel_pstate: Avoid redundant initialization of local vars
powercap/intel_rapl: add Ice Lake mobile
ACPI / processor: Set P_LVL{2,3} idle state descriptions
cpufreq / cppc: Work around for Hisilicon CPPC cpufreq
ACPI / CPPC: Add a helper to get desired performance
cpufreq: davinci: move configuration to include/linux/platform_data
cpufreq: speedstep: convert BUG() to BUG_ON()
cpufreq: powernv: fix missing check of return value in init_powernv_pstates()
cpufreq: longhaul: remove unneeded semicolon
cpufreq: pcc-cpufreq: remove unneeded semicolon
..
Linus Torvalds [Wed, 6 Mar 2019 18:31:36 +0000 (10:31 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
- a few misc things
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (159 commits)
tools/testing/selftests/proc/proc-self-syscall.c: remove duplicate include
proc: more robust bulk read test
proc: test /proc/*/maps, smaps, smaps_rollup, statm
proc: use seq_puts() everywhere
proc: read kernel cpu stat pointer once
proc: remove unused argument in proc_pid_lookup()
fs/proc/thread_self.c: code cleanup for proc_setup_thread_self()
fs/proc/self.c: code cleanup for proc_setup_self()
proc: return exit code 4 for skipped tests
mm,mremap: bail out earlier in mremap_to under map pressure
mm/sparse: fix a bad comparison
mm/memory.c: do_fault: avoid usage of stale vm_area_struct
writeback: fix inode cgroup switching comment
mm/huge_memory.c: fix "orig_pud" set but not used
mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC
mm/memcontrol.c: fix bad line in comment
mm/cma.c: cma_declare_contiguous: correct err handling
mm/page_ext.c: fix an imbalance with kmemleak
mm/compaction: pass pgdat to too_many_isolated() instead of zone
mm: remove zone_lru_lock() function, access ->lru_lock directly
...
Linus Torvalds [Wed, 6 Mar 2019 18:22:26 +0000 (10:22 -0800)]
Merge tag 'armsoc-late' of git://git./linux/kernel/git/soc/soc
Pull ARM SoC late updates from Arnd Bergmann:
"Here are two branches that came relatively late during the linux-5.0
development cycle and have dependencies on the other branches:
- On the TI OMAP platform, the CPSW Ethernet PHY mode selection
driver is being replaced, this puts the final pieces in place
- On the DaVinci platform, the interrupt handling code in arch/arm
gets moved into a regular device driver in drivers/irqchip.
Since they both had some time in linux-next after the 5.0-rc8 release,
I'm sending them along with the other updates"
* tag 'armsoc-late' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (38 commits)
net: ethernet: ti: cpsw: deprecate cpsw-phy-sel driver
ARM: davinci: remove intc related fields from davinci_soc_info
irqchip: davinci-cp-intc: move the driver to drivers/irqchip
ARM: davinci: cp-intc: remove redundant comments
ARM: davinci: cp-intc: drop GPL license boilerplate
ARM: davinci: cp-intc: use readl/writel_relaxed()
ARM: davinci: cp-intc: unify error handling
ARM: davinci: cp-intc: improve coding style
ARM: davinci: cp-intc: request the memory region before remapping it
ARM: davinci: cp-intc: use the new-style config structure
ARM: davinci: cp-intc: convert all hex numbers to lowercase
ARM: davinci: cp-intc: use a common prefix for all symbols
ARM: davinci: cp-intc: add the new config structures for da8xx SoCs
irqchip: davinci-cp-intc: add a new config structure
ARM: davinci: cp-intc: add a wrapper around cp_intc_init()
ARM: davinci: cp-intc: remove cp_intc.h
irqchip: davinci-aintc: move the driver to drivers/irqchip
ARM: davinci: aintc: remove unnecessary includes
ARM: davinci: aintc: remove the timer-specific irq_set_handler()
ARM: davinci: aintc: request memory region before remapping it
...
Linus Torvalds [Wed, 6 Mar 2019 18:15:42 +0000 (10:15 -0800)]
Merge tag 'armsoc-newsoc' of git://git./linux/kernel/git/soc/soc
Pull ARM new SoC family support from Arnd Bergmann:
"Two new SoC families are added this time.
Sugaya Taichi submitted support for the Milbeaut SoC family from
Socionext and explains:
"SC2000 is a SoC of the Milbeaut series. equipped with a DSP
optimized for computer vision. It also features advanced
functionalities such as 360-degree, real-time spherical stitching
with multi cameras, image stabilization for without mechanical
gimbals, and rolling shutter correction. More detail is below:
https://www.socionext.com/en/products/assp/milbeaut/SC2000.html"
Interestingly, this one has a history dating back to older chips made
by Socionext and previously Matsushita/Panasonic based on their own
mn10300 CPU architecture that was removed from the kernel last year.
Manivannan Sadhasivam adds support for another SoC family, this is the
Bitmain BM1880 chip used in the Sophon Edge TPU developer board.
The chip is intended for Deep Learning applications, and comes with
dual-core Arm Cortex-A53 to run Linux as well as a RISC-V
microcontroller core to control the tensor unit. For the moment, the
TPU is not accessible in mainline Linux, so we treat it as a generic
Arm SoC.
More information is available at
https://www.sophon.ai/"
* tag 'armsoc-newsoc' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
ARM: multi_v7_defconfig: add ARCH_MILBEAUT and ARCH_MILBEAUT_M10V
ARM: configs: Add Milbeaut M10V defconfig
ARM: dts: milbeaut: Add device tree set for the Milbeaut M10V board
clocksource/drivers/timer-milbeaut: Introduce timer for Milbeaut SoCs
dt-bindings: timer: Add Milbeaut M10V timer description
ARM: milbeaut: Add basic support for Milbeaut m10v SoC
dt-bindings: Add documentation for Milbeaut SoCs
dt-bindings: arm: Add SMP enable-method for Milbeaut
dt-bindings: sram: milbeaut: Add binding for Milbeaut smp-sram
MAINTAINERS: Add entry for Bitmain SoC platform
arm64: dts: bitmain: Add Sophon Egde board support
arm64: dts: bitmain: Add BM1880 SoC support
arm64: Add ARCH_BITMAIN platform
dt-bindings: arm: Document Bitmain BM1880 SoC
Linus Torvalds [Wed, 6 Mar 2019 18:09:50 +0000 (10:09 -0800)]
Merge tag 'armsoc-defconfig' of git://git./linux/kernel/git/soc/soc
Pull ARM SoC defconfig updates from Arnd Bergmann:
"We regenerated the defconfig files for samsung, shmobile, lpc18xx,
lpc32xx, omap2, and nhk8815.
Lots of additional drivers added on samsung and nhk8815, as well as
the new pl110 driver on all machines that have it.
The remaining changes are mostly to enable newly added drivers, and in
case of imx8mq together with the SoC getting merged"
* tag 'armsoc-defconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (47 commits)
ARM: spear3xx_defconfig: Activate PL111 DRM driver
ARM: nhk8815_defconfig: Add new options
ARM: nhk8815_defconfig: Update defconfig
ARM: pxa: remove CONFIG_SND_PXA2XX_AC97 in pxa_defconfig
ARM: defconfig: integrator: Switch to DRM
arm64: defconfig: Add IMX2+ watchdog
arm64: defconfig: Enable PFUZE100 regulator
arm64: defconfig: enable NXP FlexSPI driver
arm64: defconfig: Add i.MX8MQ boot necessary configs
arm64: defconfig: add imx8qxp support
arm64: defconfig: add i.MX system controller RTC support
arm64: defconfig: Enable Tegra TCU
arm64: defconfig: Enable MAX8973 regulator
ARM: socfpga_defconfig: enable BLK_DEV_LOOP config option
ARM: defconfig: lpc32xx: enable DRM simple panel driver
ARM: defconfig: lpc32xx: enable fixed voltage regulator support
arm64: defconfig: Enable SUN6I Camera sensor interface
arm64: defconfig: Enable I2C_GPIO
ARM: omap2plus_defconfig: Update for moved options
ARM: omap2plus_defconfig: Update for dropped options
...
Linus Torvalds [Wed, 6 Mar 2019 17:41:12 +0000 (09:41 -0800)]
Merge tag 'armsoc-drivers' of git://git./linux/kernel/git/soc/soc
Pull ARM SoC driver updates from Arnd Bergmann:
"As usual, the drivers/tee and drivers/reset subsystems get merged
here, with the expected set of smaller updates and some new hardware
support. The tee subsystem now supports device drivers to be attached
to a tee, the first example here is a random number driver with its
implementation in the secure world.
Three new power domain drivers get added for specific chip families:
- Broadcom BCM283x chips (used in Raspberry Pi)
- Qualcomm Snapdragon phone chips
- Xilinx ZynqMP FPGA SoCs
One new driver is added to talk to the BPMP firmware on NVIDIA
Tegra210
Existing drivers are extended for new SoC variants from NXP, NVIDIA,
Amlogic and Qualcomm"
* tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (113 commits)
tee: optee: update optee_msg.h and optee_smc.h to dual license
tee: add cancellation support to client interface
dpaa2-eth: configure the cache stashing amount on a queue
soc: fsl: dpio: configure cache stashing destination
soc: fsl: dpio: enable frame data cache stashing per software portal
soc: fsl: guts: make fsl_guts_get_svr() static
hwrng: make symbol 'optee_rng_id_table' static
tee: optee: Fix unsigned comparison with less than zero
hwrng: Fix unsigned comparison with less than zero
tee: fix possible error pointer ctx dereferencing
hwrng: optee: Initialize some structs using memset instead of braces
tee: optee: Initialize some structs using memset instead of braces
soc: fsl: dpio: fix memory leak of a struct qbman on error exit path
clk: tegra: dfll: Make symbol 'tegra210_cpu_cvb_tables' static
soc: qcom: llcc-slice: Fix typos
qcom: soc: llcc-slice: Consolidate some code
qcom: soc: llcc-slice: Clear the global drv_data pointer on error
drivers: soc: xilinx: Add ZynqMP power domain driver
firmware: xilinx: Add APIs to control node status/power
dt-bindings: power: Add ZynqMP power domain bindings
...
Linus Torvalds [Wed, 6 Mar 2019 17:36:37 +0000 (09:36 -0800)]
Merge tag 'armsoc-dt' of git://git./linux/kernel/git/soc/soc
Pull ARM SoC device tree updates from Arnd Bergmann:
"This is a smaller update than the past few times, but with just over
500 non-merge changesets still dwarfes the rest of the SoC tree.
Three new SoC platforms get added, each one a follow-up to an existing
product, and added here in combination with a reference platform:
- Renesas RZ/A2M (R7S9210) 32-bit Cortex-A9 Real-time imaging
processor:
https://www.renesas.com/eu/en/products/microcontrollers-microprocessors/rz/rza/rza2m.html
- Renesas RZ/G2E (r8a774c0) 64-bit Cortex-A53 SoC "for Rich Graphics
Applications":
https://www.renesas.com/eu/en/products/microcontrollers-microprocessors/rz/rzg/rzg2e.html
- NXP i.MX8QuadXPlus 64-bit Cortex-A35 SoC:
https://www.nxp.com/products/processors-and-microcontrollers/arm-based-processors-and-mcus/i.mx-applications-processors/i.mx-8-processors/i.mx-8x-family-arm-cortex-a35-3d-graphics-4k-video-dsp-error-correcting-code-on-ddr:i.MX8X
These are actual commercial products we now support with an in-kernel
device tree source file:
- Bosch Guardian is a product made by Bosch Power Tools GmbH, based
on the Texas Instruments AM335x chip
- Winterland IceBoard is a Texas Instruments AM3874 based machine
used in telescopes at the south pole and elsewhere, see commit
d031773169df2 for some pointers:
- Inspur on5263m5 is an x86 server platform with an Aspeed ast2500
baseboard management controller. This is for running on the BMC.
- Zodiac Digital Tapping Unit, apparently a kind of ethernet switch
used in airplanes.
- Phicomm K3 is a WiFi router based on Broadcom bcm47094
- Methode Electronics uDPU FTTdp distribution point unit
- X96 Max, a generic TV box based on Amlogic G12a (S905X2)
- NVIDIA Shield TV (Darcy) based on Tegra210
And then there are several new SBC, evaluation, development or modular
systems that we add:
- Three new Rockchips rk3399 based boards:
- FriendlyElec NanoPC-T4 and NanoPi M4
- Radxa ROCK Pi 4
- Five new i.MX6 family SoM modules and boards for industrial
products:
- Logic PD i.MX6QD SoM and evaluation baseboad
- Y Soft IOTA Draco/Hydra/Ursa family boards based on i.MX6DL
- Phytec phyCORE i.MX6 UltraLite SoM and evaluation module
- MYIR Tech MYD-LPC4357 development based on the NXP lpc4357
microcontroller
- Chameleon96, an Intel/Altera Cyclone5 based FPGA development system
in 96boards form factor
- Arm Fixed Virtual Platforms(FVP) Base RevC, a purely virtual
platform for corresponding to the latest "fast model"
- Another Raspberry Pi variant: Model 3 A+, supported both in 32-bit
and 64-bit mode.
- Oxalis Evalkit V100 based on NXP Layerscape LS1012a, in 96Boards
enterprise form factor
- Elgin RV1108 R1 development board based on 32-bit Rockchips RV1108
For already supported boards and SoCs, we often add support for new
devices after merging the drivers. This time, the largest changes
include updates for
- STMicroelectronics stm32mp1, which was now formally launched last
week
- Qualcomm Snapdragon 845, a high-end phone and low-end laptop chip
- Action Semi S700
- TI AM654x, their recently merged 64-bit SoC from the OMAP family
- Various Amlogic Meson SoCs
- Mediatek MT2712
- NVIDIA Tegra186 and Tegra210
- The ancient NXP lpc32xx family
- Samsung s5pv210, used in some older mobile phones
Many other chips see smaller updates and bugfixes beyond that"
* tag 'armsoc-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (506 commits)
ARM: dts: exynos: Fix max voltage for buck8 regulator on Odroid XU3/XU4
dt-bindings: net: ti: deprecate cpsw-phy-sel bindings
ARM: dts: am335x: switch to use phy-gmii-sel
ARM: dts: am4372: switch to use phy-gmii-sel
ARM: dts: dm814x: switch to use phy-gmii-sel
ARM: dts: dra7: switch to use phy-gmii-sel
arch: arm: dts: kirkwood-rd88f6281: Remove disabled marvell,dsa reference
ARM: dts: exynos: Add support for secondary DAI to Odroid XU4
ARM: dts: exynos: Add support for secondary DAI to Odroid XU3
ARM: dts: exynos: Disable ARM PMU on Odroid XU3-lite
ARM: dts: exynos: Add stdout path property to Arndale board
ARM: dts: exynos: Add minimal clkout parameters to Exynos3250 PMU
ARM: dts: exynos: Enable ADC on Odroid HC1
arm64: dts: sprd: Remove wildcard compatible string
arm64: dts: sprd: Add SC27XX fuel gauge device
arm64: dts: sprd: Add SC2731 charger device
arm64: dts: sprd: Add ADC calibration support
arm64: dts: sprd: Remove PMIC INTC irq trigger type
arm64: dts: rockchip: Enable tsadc device on rock960
ARM: dts: rockchip: add chosen node on veyron devices
...
Linus Torvalds [Wed, 6 Mar 2019 17:33:05 +0000 (09:33 -0800)]
Merge tag 'armsoc-soc' of git://git./linux/kernel/git/soc/soc
Pull ARM SoC platform updates from Arnd Bergmann:
"The APM X-Gene platform is now maintained by folks from Ampere
computing that took over the product line a while ago, this gets
reflected in the MAINTAINERS file.
Cleanups continue on the older mach-davinci and mach-pxa platform, to
get them to be more like the modern ones. For pxa, we now remove the
Raumfeld platform code as it now works with device tree based booting.
i.MX adds a couple new features for the i.MX7ULP SoC
Mediatek gains support for a new SoC: MT7629 is a new wireless router
platform, following MT7623.
Aside from those, there are the usual minor cleanups and bugfixes
across several platforms"
* tag 'armsoc-soc' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (49 commits)
MAINTAINERS: Update Ampere email address
usb: ohci-da8xx: remove unused callbacks from platform data
ARM: davinci: da830-evm: remove legacy usb helpers
ARM: davinci: omapl138-hawk: remove legacy usb helpers
usb: ohci-da8xx: add vbus and overcurrent gpios
ARM: davinci: da830-evm: use gpio lookup entries for usb gpios
ARM: davinci: omapl138-hawk: use gpio lookup entries for usb gpios
usb: ohci-da8xx: add a helper pointer to &pdev->dev
usb: ohci-da8xx: add a new line after local variables
arm64: meson: enable g12a clock controller
MAINTAINERS: Add entry for uDPU board
ARM: davinci: da850-evm: use GPIO hogs instead of the legacy API
arm: mediatek: add MT7629 smp bring up code
Revert "ARM: mediatek: add MT7623a smp bringup code"
dt-bindings: soc: fix typo of MT8173 power dt-bindings
ARM: meson: remove COMMON_CLK_AMLOGIC selection
arm64: meson: remove COMMON_CLK_AMLOGIC selection
ARM: lpc32xx: remove platform data of ARM PL111 LCD controller
ARM: lpc32xx: remove platform data of ARM PL180 SD/MMC controller
ARM: lpc32xx: Use kmemdup to replace duplicating its implementation
...
Linus Torvalds [Wed, 6 Mar 2019 17:18:43 +0000 (09:18 -0800)]
Merge tag 'asm-generic-5.1' of git://git./linux/kernel/git/arnd/asm-generic
Pull asm-generic updates from Arnd Bergmann:
"Only a few small changes this time:
- Michael S. Tsirkin cleans up linux/mman.h
- Mike Rapoport found a typo
I had originally merged another cleanup series for I/O accessors from
Hugo Lefeuvre as well, but dropped it after the discussion of the
barrier semantics and some conflicts. I expect this series to get
merged for a later release though"
* tag 'asm-generic-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
asm-generic/page.h: fix typo in #error text requiring a real asm/page.h
arch: move common mmap flags to linux/mman.h
drm: tweak header name
x86/mpx: tweak header name
Linus Torvalds [Wed, 6 Mar 2019 17:07:08 +0000 (09:07 -0800)]
Merge tag 'y2038-fix' of git://git./linux/kernel/git/arnd/playground
Pull y2038 build fix for compat mode from Arnd Bergmann:
"Here is one more patch on top of the y2038 changes already pulled for
linux-5.1, for some reason this had escaped all testing"
* tag 'y2038-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
ipc: Fix building compat mode without sysvipc
Linus Torvalds [Wed, 6 Mar 2019 16:45:46 +0000 (08:45 -0800)]
Merge branch 'x86-alternatives-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 alternative instruction updates from Ingo Molnar:
"Small RDTSCP opimization, enabled by the newly added ALTERNATIVE_3(),
and other small improvements"
* 'x86-alternatives-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/TSC: Use RDTSCP
x86/alternatives: Add an ALTERNATIVE_3() macro
x86/alternatives: Print containing function
x86/alternatives: Add macro comments
Linus Torvalds [Wed, 6 Mar 2019 16:14:05 +0000 (08:14 -0800)]
Merge branch 'sched-core-for-linus' of git://git./linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- refcount conversions
- Solve the rq->leaf_cfs_rq_list can of worms for real.
- improve power-aware scheduling
- add sysctl knob for Energy Aware Scheduling
- documentation updates
- misc other changes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
kthread: Do not use TIMER_IRQSAFE
kthread: Convert worker lock to raw spinlock
sched/fair: Use non-atomic cpumask_{set,clear}_cpu()
sched/fair: Remove unused 'sd' parameter from select_idle_smt()
sched/wait: Use freezable_schedule() when possible
sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block
sched/fair: Explain LLC nohz kick condition
sched/fair: Simplify nohz_balancer_kick()
sched/topology: Fix percpu data types in struct sd_data & struct s_data
sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument
sched/fair: Fix O(nr_cgroups) in the load balancing path
sched/fair: Optimize update_blocked_averages()
sched/fair: Fix insertion in rq->leaf_cfs_rq_list
sched/fair: Add tmp_alone_branch assertion
sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity
sched/fair: Update scale invariance of PELT
sched/fair: Move the rq_of() helper function
sched/core: Convert task_struct.stack_refcount to refcount_t
...
Linus Torvalds [Wed, 6 Mar 2019 15:59:36 +0000 (07:59 -0800)]
Merge branch 'perf-core-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
"Lots of tooling updates - too many to list, here's a few highlights:
- Various subcommand updates to 'perf trace', 'perf report', 'perf
record', 'perf annotate', 'perf script', 'perf test', etc.
- CPU and NUMA topology and affinity handling improvements,
- HW tracing and HW support updates:
- Intel PT updates
- ARM CoreSight updates
- vendor HW event updates
- BPF updates
- Tons of infrastructure updates, both on the build system and the
library support side
- Documentation updates.
- ... and lots of other changes, see the changelog for details.
Kernel side updates:
- Tighten up kprobes blacklist handling, reduce the number of places
where developers can install a kprobe and hang/crash the system.
- Fix/enhance vma address filter handling.
- Various PMU driver updates, small fixes and additions.
- refcount_t conversions
- BPF updates
- error code propagation enhancements
- misc other changes"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (238 commits)
perf script python: Add Python3 support to syscall-counts-by-pid.py
perf script python: Add Python3 support to syscall-counts.py
perf script python: Add Python3 support to stat-cpi.py
perf script python: Add Python3 support to stackcollapse.py
perf script python: Add Python3 support to sctop.py
perf script python: Add Python3 support to powerpc-hcalls.py
perf script python: Add Python3 support to net_dropmonitor.py
perf script python: Add Python3 support to mem-phys-addr.py
perf script python: Add Python3 support to failed-syscalls-by-pid.py
perf script python: Add Python3 support to netdev-times.py
perf tools: Add perf_exe() helper to find perf binary
perf script: Handle missing fields with -F +..
perf data: Add perf_data__open_dir_data function
perf data: Add perf_data__(create_dir|close_dir) functions
perf data: Fail check_backup in case of error
perf data: Make check_backup work over directories
perf tools: Add rm_rf_perf_data function
perf tools: Add pattern name checking to rm_rf
perf tools: Add depth checking to rm_rf
perf data: Add global path holder
...
Linus Torvalds [Wed, 6 Mar 2019 15:17:17 +0000 (07:17 -0800)]
Merge branch 'locking-core-for-linus' of git://git./linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"The biggest part of this tree is the new auto-generated atomics API
wrappers by Mark Rutland.
The primary motivation was to allow instrumentation without uglifying
the primary source code.
The linecount increase comes from adding the auto-generated files to
the Git space as well:
include/asm-generic/atomic-instrumented.h | 1689 ++++++++++++++++--
include/asm-generic/atomic-long.h | 1174 ++++++++++---
include/linux/atomic-fallback.h | 2295 +++++++++++++++++++++++++
include/linux/atomic.h | 1241 +------------
I preferred this approach, so that the full call stack of the (already
complex) locking APIs is still fully visible in 'git grep'.
But if this is excessive we could certainly hide them.
There's a separate build-time mechanism to determine whether the
headers are out of date (they should never be stale if we do our job
right).
Anyway, nothing from this should be visible to regular kernel
developers.
Other changes:
- Add support for dynamic keys, which removes a source of false
positives in the workqueue code, among other things (Bart Van
Assche)
- Updates to tools/memory-model (Andrea Parri, Paul E. McKenney)
- qspinlock, wake_q and lockdep micro-optimizations (Waiman Long)
- misc other updates and enhancements"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
locking/lockdep: Shrink struct lock_class_key
locking/lockdep: Add module_param to enable consistency checks
lockdep/lib/tests: Test dynamic key registration
lockdep/lib/tests: Fix run_tests.sh
kernel/workqueue: Use dynamic lockdep keys for workqueues
locking/lockdep: Add support for dynamic keys
locking/lockdep: Verify whether lock objects are small enough to be used as class keys
locking/lockdep: Check data structure consistency
locking/lockdep: Reuse lock chains that have been freed
locking/lockdep: Fix a comment in add_chain_cache()
locking/lockdep: Introduce lockdep_next_lockchain() and lock_chain_count()
locking/lockdep: Reuse list entries that are no longer in use
locking/lockdep: Free lock classes that are no longer in use
locking/lockdep: Update two outdated comments
locking/lockdep: Make it easy to detect whether or not inside a selftest
locking/lockdep: Split lockdep_free_key_range() and lockdep_reset_lock()
locking/lockdep: Initialize the locks_before and locks_after lists earlier
locking/lockdep: Make zap_class() remove all matching lock order entries
locking/lockdep: Reorder struct lock_class members
locking/lockdep: Avoid that add_chain_cache() adds an invalid chain to the cache
...
Linus Torvalds [Wed, 6 Mar 2019 15:13:56 +0000 (07:13 -0800)]
Merge branch 'efi-core-for-linus' of git://git./linux/kernel/git/tip/tip
Pull EFI updates from Ingo Molnar:
"The main EFI changes in this cycle were:
- Use 32-bit alignment for efi_guid_t
- Allow the SetVirtualAddressMap() call to be omitted
- Implement earlycon=efifb based on existing earlyprintk code
- Various minor fixes and code cleanups from Sai, Ard and me"
* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: Fix build error due to enum collision between efi.h and ima.h
efi/x86: Convert x86 EFI earlyprintk into generic earlycon implementation
x86: Make ARCH_USE_MEMREMAP_PROT a generic Kconfig symbol
efi/arm/arm64: Allow SetVirtualAddressMap() to be omitted
efi: Replace GPL license boilerplate with SPDX headers
efi/fdt: Apply more cleanups
efi: Use 32-bit alignment for efi_guid_t
efi/memattr: Don't bail on zero VA if it equals the region's PA
x86/efi: Mark can_free_region() as an __init function
Arnd Bergmann [Thu, 28 Feb 2019 14:22:53 +0000 (15:22 +0100)]
ipc: Fix building compat mode without sysvipc
As John Stultz noticed, my y2038 syscall series caused a link
failure when CONFIG_SYSVIPC is disabled but CONFIG_COMPAT is
enabled:
arch/arm64/kernel/sys32.o:(.rodata+0x960): undefined reference to `__arm64_compat_sys_old_semctl'
arch/arm64/kernel/sys32.o:(.rodata+0x980): undefined reference to `__arm64_compat_sys_old_msgctl'
arch/arm64/kernel/sys32.o:(.rodata+0x9a0): undefined reference to `__arm64_compat_sys_old_shmctl'
Add the missing entries in kernel/sys_ni.c for the new system
calls.
Cc: Laura Abbott <labbott@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Souptick Joarder [Tue, 5 Mar 2019 23:50:45 +0000 (15:50 -0800)]
tools/testing/selftests/proc/proc-self-syscall.c: remove duplicate include
Remove duplicate header which is included twice.
Link: http://lkml.kernel.org/r/20190304182719.GA6606@jordon-HP-15-Notebook-PC
Signed-off-by: Sabyasachi Gupta <sabyasachi.linux@gmail.com>
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexey Dobriyan [Tue, 5 Mar 2019 23:50:42 +0000 (15:50 -0800)]
proc: more robust bulk read test
/proc may not be mounted and test will exit successfully.
Ensure proc is mounted at /proc.
Link: http://lkml.kernel.org/r/20190209105613.GA10384@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexey Dobriyan [Tue, 5 Mar 2019 23:50:39 +0000 (15:50 -0800)]
proc: test /proc/*/maps, smaps, smaps_rollup, statm
Start testing VM related fiels found in per-process files.
Do it by jiting small executable which brings its address space to
precisely known state, then comparing /proc/*/maps, smaps, smaps_rollup,
and statm files to expected values.
Currently only x86_64 is supported.
[adobriyan@gmail.com: exit correctly in /proc/*/maps test]
Link: http://lkml.kernel.org/r/20190206073659.GB15311@avx2
Link: http://lkml.kernel.org/r/20190203165806.GA14568@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexey Dobriyan [Tue, 5 Mar 2019 23:50:35 +0000 (15:50 -0800)]
proc: use seq_puts() everywhere
seq_printf() without format specifiers == faster seq_puts()
Link: http://lkml.kernel.org/r/20190114200545.GC9680@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexey Dobriyan [Tue, 5 Mar 2019 23:50:32 +0000 (15:50 -0800)]
proc: read kernel cpu stat pointer once
Help gcc generate better code:
$ ./scripts/bloat-o-meter ../vmlinux-000 ../vmlinux-001
add/remove: 2/2 grow/shrink: 0/1 up/down: 92/-142 (-50)
Function old new delta
get_iowait_time.isra - 46 +46
get_idle_time.isra - 46 +46
show_stat 1489 1477 -12
get_iowait_time 65 - -65
get_idle_time 65 - -65
Link: http://lkml.kernel.org/r/20190114195907.GA9680@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zhikang Zhang [Tue, 5 Mar 2019 23:50:29 +0000 (15:50 -0800)]
proc: remove unused argument in proc_pid_lookup()
[adobriyan@gmail.com: delete "extern" from prototype]
Link: http://lkml.kernel.org/r/20190114195635.GA9372@avx2
Signed-off-by: Zhikang Zhang <zhangzhikang1@huawei.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chengguang Xu [Tue, 5 Mar 2019 23:50:25 +0000 (15:50 -0800)]
fs/proc/thread_self.c: code cleanup for proc_setup_thread_self()
Remove unnecessary ERR_PTR()/PTR_ERR() cast in proc_setup_thread_self().
Link: http://lkml.kernel.org/r/20190124030150.8472-2-cgxu519@gmx.com
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chengguang Xu [Tue, 5 Mar 2019 23:50:22 +0000 (15:50 -0800)]
fs/proc/self.c: code cleanup for proc_setup_self()
Remove unnecessary ERR_PTR()/PTR_ERR() cast in proc_setup_self().
Link: http://lkml.kernel.org/r/20190124030150.8472-1-cgxu519@gmx.com
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexey Dobriyan [Tue, 5 Mar 2019 23:50:18 +0000 (15:50 -0800)]
proc: return exit code 4 for skipped tests
Test harness uses 4 for SKIP, not 2.
Link: http://lkml.kernel.org/r/20190108193108.GA12259@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oscar Salvador [Tue, 5 Mar 2019 23:50:14 +0000 (15:50 -0800)]
mm,mremap: bail out earlier in mremap_to under map pressure
When using mremap() syscall in addition to MREMAP_FIXED flag, mremap()
calls mremap_to() which does the following:
1) unmaps the destination region where we are going to move the map
2) If the new region is going to be smaller, we unmap the last part
of the old region
Then, we will eventually call move_vma() to do the actual move.
move_vma() checks whether we are at least 4 maps below max_map_count
before going further, otherwise it bails out with -ENOMEM. The problem
is that we might have already unmapped the vma's in steps 1) and 2), so
it is not possible for userspace to figure out the state of the vmas
after it gets -ENOMEM, and it gets tricky for userspace to clean up
properly on error path.
While it is true that we can return -ENOMEM for more reasons (e.g: see
may_expand_vm() or move_page_tables()), I think that we can avoid this
scenario if we check early in mremap_to() if the operation has high
chances to succeed map-wise.
Should that not be the case, we can bail out before we even try to unmap
anything, so we make sure the vma's are left untouched in case we are
likely to be short of maps.
The thumb-rule now is to rely on the worst-scenario case we can have.
That is when both vma's (old region and new region) are going to be
split in 3, so we get two more maps to the ones we already hold (one per
each). If current map count + 2 maps still leads us to 4 maps below the
threshold, we are going to pass the check in move_vma().
Of course, this is not free, as it might generate false positives when
it is true that we are tight map-wise, but the unmap operation can
release several vma's leading us to a good state.
Another approach was also investigated [1], but it may be too much
hassle for what it brings.
[1] https://lore.kernel.org/lkml/
20190219155320.tkfkwvqk53tfdojt@d104.suse.de/
Link: http://lkml.kernel.org/r/20190226091314.18446-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Tue, 5 Mar 2019 23:50:11 +0000 (15:50 -0800)]
mm/sparse: fix a bad comparison
next_present_section_nr() could only return an unsigned number -1, so
just check it specifically where compilers will convert -1 to unsigned
if needed.
mm/sparse.c: In function 'sparse_init_nid':
mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
((section_nr >= 0) && \
^~
mm/sparse.c:478:2: note: in expansion of macro
'for_each_present_section_nr'
for_each_present_section_nr(pnum_begin, pnum) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
((section_nr >= 0) && \
^~
mm/sparse.c:497:2: note: in expansion of macro
'for_each_present_section_nr'
for_each_present_section_nr(pnum_begin, pnum) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/sparse.c: In function 'sparse_init':
mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
((section_nr >= 0) && \
^~
mm/sparse.c:520:2: note: in expansion of macro
'for_each_present_section_nr'
for_each_present_section_nr(pnum_begin + 1, pnum_end) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~
Link: http://lkml.kernel.org/r/20190228181839.86504-1-cai@lca.pw
Fixes: c4e1be9ec113 ("mm, sparsemem: break out of loops early")
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Stancek [Tue, 5 Mar 2019 23:50:08 +0000 (15:50 -0800)]
mm/memory.c: do_fault: avoid usage of stale vm_area_struct
LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
This is a stress test, where one thread mmaps/writes/munmaps memory area
and other thread is trying to read from it:
CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
Krnl PSW :
0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
Call Trace:
([<
0000000000000000>] (null))
[<
00000000001adae4>] lock_acquire+0xec/0x258
[<
000000000080d1ac>] _raw_spin_lock_bh+0x5c/0x98
[<
000000000012a780>] page_table_free+0x48/0x1a8
[<
00000000002f6e54>] do_fault+0xdc/0x670
[<
00000000002fadae>] __handle_mm_fault+0x416/0x5f0
[<
00000000002fb138>] handle_mm_fault+0x1b0/0x320
[<
00000000001248cc>] do_dat_exception+0x19c/0x2c8
[<
000000000080e5ee>] pgm_check_handler+0x19e/0x200
page_table_free() is called with NULL mm parameter, but because "0" is a
valid address on s390 (see S390_lowcore), it keeps going until it
eventually crashes in lockdep's lock_acquire. This crash is
reproducible at least since 4.14.
Problem is that "vmf->vma" used in do_fault() can become stale. Because
mmap_sem may be released, other threads can come in, call munmap() and
cause "vma" be returned to kmem cache, and get zeroed/re-initialized and
re-used:
handle_mm_fault |
__handle_mm_fault |
do_fault |
vma = vmf->vma |
do_read_fault |
__do_fault |
vma->vm_ops->fault(vmf); |
mmap_sem is released |
|
| do_munmap()
| remove_vma_list()
| remove_vma()
| vm_area_free()
| # vma is released
| ...
| # same vma is allocated
| # from kmem cache
| do_mmap()
| vm_area_alloc()
| memset(vma, 0, ...)
|
pte_free(vma->vm_mm, ...); |
page_table_free |
spin_lock_bh(&mm->context.lock);|
<crash> |
Cache mm_struct to avoid using potentially stale "vma".
[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
Link: http://lkml.kernel.org/r/5b3fdf19e2a5be460a384b936f5b56e13733f1b8.1551595137.git.jstancek@redhat.com
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Rafael Aquini <aquini@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Greg Thelen [Tue, 5 Mar 2019 23:50:03 +0000 (15:50 -0800)]
writeback: fix inode cgroup switching comment
Commit
682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb
transaction and use it for stat updates") refers to
inode_switch_wb_work_fn() which never got merged.
Switch the comments to inode_switch_wbs_work_fn().
Link: http://lkml.kernel.org/r/20190305004617.142590-1-gthelen@google.com
Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Tue, 5 Mar 2019 23:50:00 +0000 (15:50 -0800)]
mm/huge_memory.c: fix "orig_pud" set but not used
Commit
a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent
hugepages") introduced pudp_huge_get_and_clear_full() but no one uses
its return code.
In order to not diverge from pmdp_huge_get_and_clear_full(), just change
zap_huge_pud() to not assign the return value from
pudp_huge_get_and_clear_full().
mm/huge_memory.c: In function 'zap_huge_pud':
mm/huge_memory.c:1982:8: warning: variable 'orig_pud' set but not used [-Wunused-but-set-variable]
pud_t orig_pud;
^~~~~~~~
Link: http://lkml.kernel.org/r/20190301221956.97493-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Tue, 5 Mar 2019 23:49:57 +0000 (15:49 -0800)]
mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC
When onlining a memory block with DEBUG_PAGEALLOC, it unmaps the pages
in the block from kernel, However, it does not map those pages while
offlining at the beginning. As the result, it triggers a panic below
while onlining on ppc64le as it checks if the pages are mapped before
unmapping. However, the imbalance exists for all arches where
double-unmappings could happen. Therefore, let kernel map those pages
in generic_online_page() before they have being freed into the page
allocator for the first time where it will set the page count to one.
On the other hand, it works fine during the boot, because at least for
IBM POWER8, it does,
early_setup
early_init_mmu
harsh__early_init_mmu
htab_initialize [1]
htab_bolt_mapping [2]
where it effectively map all memblock regions just like
kernel_map_linear_page(), so later mem_init() -> memblock_free_all()
will unmap them just fine without any imbalance. On other arches
without this imbalance checking, it still unmap them once at the most.
[1]
for_each_memblock(memory, reg) {
base = (unsigned long)__va(reg->base);
size = reg->size;
DBG("creating mapping for region: %lx..%lx (prot: %lx)\n",
base, size, prot);
BUG_ON(htab_bolt_mapping(base, base + size, __pa(base),
prot, mmu_linear_psize, mmu_kernel_ssize));
}
[2] linear_map_hash_slots[paddr >> PAGE_SHIFT] = ret | 0x80;
kernel BUG at arch/powerpc/mm/hash_utils_64.c:1815!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=256 DEBUG_PAGEALLOC NUMA pSeries
CPU: 2 PID: 4298 Comm: bash Not tainted 5.0.0-rc7+ #15
NIP:
c000000000062670 LR:
c00000000006265c CTR:
0000000000000000
REGS:
c0000005bf8a75b0 TRAP: 0700 Not tainted (5.0.0-rc7+)
MSR:
800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR:
28422842
XER:
00000000
CFAR:
c000000000804f44 IRQMASK: 1
NIP [
c000000000062670] __kernel_map_pages+0x2e0/0x4f0
LR [
c00000000006265c] __kernel_map_pages+0x2cc/0x4f0
Call Trace:
__kernel_map_pages+0x2cc/0x4f0
free_unref_page_prepare+0x2f0/0x4d0
free_unref_page+0x44/0x90
__online_page_free+0x84/0x110
online_pages_range+0xc0/0x150
walk_system_ram_range+0xc8/0x120
online_pages+0x280/0x5a0
memory_subsys_online+0x1b4/0x270
device_online+0xc0/0xf0
state_store+0xc0/0x180
dev_attr_store+0x3c/0x60
sysfs_kf_write+0x70/0xb0
kernfs_fop_write+0x10c/0x250
__vfs_write+0x48/0x240
vfs_write+0xd8/0x210
ksys_write+0x70/0x120
system_call+0x5c/0x70
Link: http://lkml.kernel.org/r/20190301220814.97339-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Tue, 5 Mar 2019 23:49:53 +0000 (15:49 -0800)]
mm/memcontrol.c: fix bad line in comment
Commit
230671533d64 ("mm: memory.low hierarchical behavior") missed an
asterisk in one of the comments.
mm/memcontrol.c:5774: warning: bad line: | 0, otherwise.
Link: http://lkml.kernel.org/r/20190301143734.94393-1-cai@lca.pw
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peng Fan [Tue, 5 Mar 2019 23:49:50 +0000 (15:49 -0800)]
mm/cma.c: cma_declare_contiguous: correct err handling
In case cma_init_reserved_mem failed, need to free the memblock
allocated by memblock_reserve or memblock_alloc_range.
Quote Catalin's comments:
https://lkml.org/lkml/2019/2/26/482
Kmemleak is supposed to work with the memblock_{alloc,free} pair and it
ignores the memblock_reserve() as a memblock_alloc() implementation
detail. It is, however, tolerant to memblock_free() being called on
a sub-range or just a different range from a previous memblock_alloc().
So the original patch looks fine to me. FWIW:
Link: http://lkml.kernel.org/r/20190227144631.16708-1-peng.fan@nxp.com
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Tue, 5 Mar 2019 23:49:46 +0000 (15:49 -0800)]
mm/page_ext.c: fix an imbalance with kmemleak
After offlining a memory block, kmemleak scan will trigger a crash, as
it encounters a page ext address that has already been freed during
memory offlining. At the beginning in alloc_page_ext(), it calls
kmemleak_alloc(), but it does not call kmemleak_free() in
free_page_ext().
BUG: unable to handle kernel paging request at
ffff888453d00000
PGD
128a01067 P4D
128a01067 PUD
128a04067 PMD
47e09e067 PTE
800ffffbac2ff060
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
CPU: 1 PID: 1594 Comm: bash Not tainted 5.0.0-rc8+ #15
Hardware name: HP ProLiant DL180 Gen9/ProLiant DL180 Gen9, BIOS U20 10/25/2017
RIP: 0010:scan_block+0xb5/0x290
Code: 85 6e 01 00 00 48 b8 00 00 30 f5 81 88 ff ff 48 39 c3 0f 84 5b 01 00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 0f 85 87 01 00 00 <4c> 8b 3b e8 f3 0c fa ff 4c 39 3d 0c 6b 4c 01 0f 87 08 01 00 00 4c
RSP: 0018:
ffff8881ec57f8e0 EFLAGS:
00010082
RAX:
0000000000000000 RBX:
ffff888453d00000 RCX:
ffffffffa61e5a54
RDX:
0000000000000000 RSI:
0000000000000008 RDI:
ffff888453d00000
RBP:
ffff8881ec57f920 R08:
fffffbfff4ed588d R09:
fffffbfff4ed588c
R10:
fffffbfff4ed588c R11:
ffffffffa76ac463 R12:
dffffc0000000000
R13:
ffff888453d00ff9 R14:
ffff8881f80cef48 R15:
ffff8881f80cef48
FS:
00007f6c0e3f8740(0000) GS:
ffff8881f7680000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
ffff888453d00000 CR3:
00000001c4244003 CR4:
00000000001606a0
Call Trace:
scan_gray_list+0x269/0x430
kmemleak_scan+0x5a8/0x10f0
kmemleak_write+0x541/0x6ca
full_proxy_write+0xf8/0x190
__vfs_write+0xeb/0x980
vfs_write+0x15a/0x4f0
ksys_write+0xd2/0x1b0
__x64_sys_write+0x73/0xb0
do_syscall_64+0xeb/0xaaa
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f6c0dad73b8
Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 63 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
RSP: 002b:
00007ffd5b863cb8 EFLAGS:
00000246 ORIG_RAX:
0000000000000001
RAX:
ffffffffffffffda RBX:
0000000000000005 RCX:
00007f6c0dad73b8
RDX:
0000000000000005 RSI:
000055a9216e1710 RDI:
0000000000000001
RBP:
000055a9216e1710 R08:
000000000000000a R09:
00007ffd5b863840
R10:
000000000000000a R11:
0000000000000246 R12:
00007f6c0dda9780
R13:
0000000000000005 R14:
00007f6c0dda4740 R15:
0000000000000005
Modules linked in: nls_iso8859_1 nls_cp437 vfat fat kvm_intel kvm irqbypass efivars ip_tables x_tables xfs sd_mod ahci libahci igb i2c_algo_bit libata i2c_core dm_mirror dm_region_hash dm_log dm_mod efivarfs
CR2:
ffff888453d00000
---[ end trace
ccf646c7456717c5 ]---
Kernel panic - not syncing: Fatal exception
Shutting down cpus with NMI
Kernel Offset: 0x24c00000 from 0xffffffff81000000 (relocation range:
0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception ]---
Link: http://lkml.kernel.org/r/20190227173147.75650-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Ryabinin [Tue, 5 Mar 2019 23:49:42 +0000 (15:49 -0800)]
mm/compaction: pass pgdat to too_many_isolated() instead of zone
too_many_isolated() in mm/compaction.c looks only at node state, so it
makes more sense to change argument to pgdat instead of zone.
Link: http://lkml.kernel.org/r/20190228083329.31892-3-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Ryabinin [Tue, 5 Mar 2019 23:49:39 +0000 (15:49 -0800)]
mm: remove zone_lru_lock() function, access ->lru_lock directly
We have common pattern to access lru_lock from a page pointer:
zone_lru_lock(page_zone(page))
Which is silly, because it unfolds to this:
&NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock
while we can simply do
&NODE_DATA(page_to_nid(page))->lru_lock
Remove zone_lru_lock() function, since it's only complicate things. Use
'page_pgdat(page)->lru_lock' pattern instead.
[aryabinin@virtuozzo.com: a slightly better version of __split_huge_page()]
Link: http://lkml.kernel.org/r/20190301121651.7741-1-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/20190228083329.31892-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Ryabinin [Tue, 5 Mar 2019 23:49:35 +0000 (15:49 -0800)]
mm/workingset: remove unused @mapping argument in workingset_eviction()
workingset_eviction() doesn't use and never did use the @mapping
argument. Remove it.
Link: http://lkml.kernel.org/r/20190228083329.31892-1-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gustavo A. R. Silva [Tue, 5 Mar 2019 23:49:31 +0000 (15:49 -0800)]
mm/swapfile.c: use struct_size() in kvzalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
struct boo entry[];
};
size = sizeof(struct foo) + count * sizeof(struct boo);
instance = kvzalloc(size, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kvzalloc(struct_size(instance, entry, count), GFP_KERNEL);
Notice that, in this case, variable size is not necessary, hence it is
removed.
This code was detected with the help of Coccinelle.
Link: http://lkml.kernel.org/r/20190221154622.GA19599@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yue Hu [Tue, 5 Mar 2019 23:49:27 +0000 (15:49 -0800)]
mm/cma_debug.c: remove static scoped cma_debugfs_root
Currently cma_debugfs_root is static storage. That is unnecessary since
it will be only used by next cma_debugfs_add_one(). We can just pass it
to following calling to save thisspace. Also remove useless idx
parameter.
Link: http://lkml.kernel.org/r/20190221040130.8940-1-zbestahu@gmail.com
Signed-off-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexey Dobriyan [Tue, 5 Mar 2019 23:49:24 +0000 (15:49 -0800)]
tmpfs: test link accounting with O_TMPFILE
Mount tmpfs with "nr_inodes=3" for easy check.
Link: http://lkml.kernel.org/r/20190219215016.GA20084@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matej Kupljen <matej.kupljen@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Tue, 5 Mar 2019 23:49:20 +0000 (15:49 -0800)]
MAINTAINERS: add entry for memblock
Add entry for memblock in MAINTAINERS file
Link: http://lkml.kernel.org/r/20190214093630.GC9063@rapoport-lnx
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yu Zhao [Tue, 5 Mar 2019 23:49:17 +0000 (15:49 -0800)]
mm/shmem: make find_get_pages_range() work for huge page
find_get_pages_range() and find_get_pages_range_tag() already correctly
increment reference count on head when seeing compound page, but they
may still use page index from tail. Page index from tail is always
zero, so these functions don't work on huge shmem. This hasn't been a
problem because, AFAIK, nobody calls these functions on (huge) shmem.
Fix them anyway just in case.
Link: http://lkml.kernel.org/r/20190110030838.84446-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christoph Hellwig [Tue, 5 Mar 2019 23:49:13 +0000 (15:49 -0800)]
mm: unexport free_reserved_area
This function is only used by built-in code, which makes perfect sense
given the purpose of it.
Link: http://lkml.kernel.org/r/20190213174621.29297-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tobin C. Harding [Tue, 5 Mar 2019 23:49:10 +0000 (15:49 -0800)]
tools/vm/slabinfo: clean up usage menu debug items
Attempt to make the usage comment for debug options a little cleaner.
Link: http://lkml.kernel.org/r/20190212001219.27769-5-tobin@kernel.org
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tobin C. Harding [Tue, 5 Mar 2019 23:49:07 +0000 (15:49 -0800)]
tools/vm/slabinfo: align usage output columns
Usage message uses spaces not tabspaces, a few tabspaces have snuck in
making the columns not align correctly when output.
Align usage output columns using spaces instead of tabspaces.
Link: http://lkml.kernel.org/r/20190212001219.27769-4-tobin@kernel.org
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tobin C. Harding [Tue, 5 Mar 2019 23:49:03 +0000 (15:49 -0800)]
tools/vm/slabinfo: put options in alphabetic order
Primarily the usage message lists options in alphabetic order however
there are a bunch of the options that are not in alphabetic order.
Put options in alphabetic order.
Link: http://lkml.kernel.org/r/20190212001219.27769-3-tobin@kernel.org
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tobin C. Harding [Tue, 5 Mar 2019 23:48:59 +0000 (15:48 -0800)]
tools/vm/slabinfo: update options in usage message
Currently usage message list only a subset of the available options.
should list them all.
Update options in usage massage to include all available options.
Link: http://lkml.kernel.org/r/20190212001219.27769-2-tobin@kernel.org
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yu Zhao [Tue, 5 Mar 2019 23:48:56 +0000 (15:48 -0800)]
include/linux/compaction.h: fix potential build error
Declaration of struct node is required regardless. On UMA systems,
including compaction.h without preceding node.h shouldn't cause a build
error.
Link: http://lkml.kernel.org/r/20190208080437.253322-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oscar Salvador [Tue, 5 Mar 2019 23:48:53 +0000 (15:48 -0800)]
mm,memory_hotplug: explicitly pass the head to isolate_huge_page
isolate_huge_page() expects we pass the head of hugetlb page to it:
bool isolate_huge_page(...)
{
...
VM_BUG_ON_PAGE(!PageHead(page), page);
...
}
While I really cannot think of any situation where we end up with a
non-head page between hands in do_migrate_range(), let us make sure the
code is as sane as possible by explicitly passing the Head. Since we
already got the pointer, it does not take us extra effort.
Link: http://lkml.kernel.org/r/20190208090604.975-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
john.hubbard@gmail.com [Tue, 5 Mar 2019 23:48:49 +0000 (15:48 -0800)]
mm: page_cache_add_speculative(): refactor out some code duplication
From: John Hubbard <jhubbard@nvidia.com>
This combines the common elements of these routines:
page_cache_get_speculative()
page_cache_add_speculative()
This was anticipated by the original author, as shown by the comment in
commit
ce0ad7f095258 ("powerpc/mm: Lockless get_user_pages_fast() for
64-bit (v3)"):
"Same as above, but add instead of inc (could just be merged)"
There is no intention to introduce any behavioral change, but there is a
small risk of that, due to slightly differing ways of expressing the
TINY_RCU and related configurations.
This also removes the VM_BUG_ON(in_interrupt()) that was in
page_cache_add_speculative(), but not in page_cache_get_speculative().
This provides slightly less detection of such bugs, but it given that it
was only there on the "add" path anyway, we can likely do without it
just fine.
And it removes the
VM_BUG_ON_PAGE(PageCompound(page) && page != compound_head(page), page);
that page_cache_add_speculative() had.
Link: http://lkml.kernel.org/r/20190206231016.22734-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Tue, 5 Mar 2019 23:48:46 +0000 (15:48 -0800)]
mm/migrate.c: cleanup expected_page_refs()
Andrea has noted that page migration code propagates page_mapping(page)
through the whole migration stack down to migrate_page() function so it
seems stupid to then use page_mapping(page) in expected_page_refs()
instead of passed down 'mapping' argument. I agree so let's make
expected_page_refs() more in line with the rest of the migration stack.
Link: http://lkml.kernel.org/r/20190207112314.24872-1-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Tue, 5 Mar 2019 23:48:42 +0000 (15:48 -0800)]
docs/core-api/mm: fix return value descriptions in mm/
Many kernel-doc comments in mm/ have the return value descriptions
either misformatted or omitted at all which makes kernel-doc script
unhappy:
$ make V=1 htmldocs
...
./mm/util.c:36: info: Scanning doc for kstrdup
./mm/util.c:41: warning: No description found for return value of 'kstrdup'
./mm/util.c:57: info: Scanning doc for kstrdup_const
./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
./mm/util.c:75: info: Scanning doc for kstrndup
./mm/util.c:83: warning: No description found for return value of 'kstrndup'
...
Fixing the formatting and adding the missing return value descriptions
eliminates ~100 such warnings.
Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Tue, 5 Mar 2019 23:48:39 +0000 (15:48 -0800)]
docs/core-api/mm: fix user memory accessors formatting
The descriptions of userspace memory access functions had minor issues
with formatting that made kernel-doc unable to properly detect the
function/macro names and the return value sections:
./arch/x86/include/asm/uaccess.h:80: info: Scanning doc for
./arch/x86/include/asm/uaccess.h:139: info: Scanning doc for
./arch/x86/include/asm/uaccess.h:231: info: Scanning doc for
./arch/x86/include/asm/uaccess.h:505: info: Scanning doc for
./arch/x86/include/asm/uaccess.h:530: info: Scanning doc for
./arch/x86/lib/usercopy_32.c:58: info: Scanning doc for
./arch/x86/lib/usercopy_32.c:69: warning: No description found for return
value of 'clear_user'
./arch/x86/lib/usercopy_32.c:78: info: Scanning doc for
./arch/x86/lib/usercopy_32.c:90: warning: No description found for return
value of '__clear_user'
Fix the formatting.
Link: http://lkml.kernel.org/r/1549549644-4903-3-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Tue, 5 Mar 2019 23:48:36 +0000 (15:48 -0800)]
docs/mm: vmalloc: re-indent kernel-doc comemnts
Some kernel-doc comments in mm/vmalloc.c have leading tab in
indentation. This leads to excessive indentation in the generated HTML
and to the inconsistency of its layout ([1] vs [2]).
Besides, multi-line Note: sections are not handled properly with extra
indentation.
[1] https://www.kernel.org/doc/html/v4.20/core-api/mm-api.html?#c.vm_map_ram
[2] https://www.kernel.org/doc/html/v4.20/core-api/mm-api.html?#c.vfree
Link: http://lkml.kernel.org/r/1549549644-4903-2-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michael S. Tsirkin [Tue, 5 Mar 2019 23:48:33 +0000 (15:48 -0800)]
mm/page_poison: update comment after code moved
mm/debug-pagealloc.c is no more, so of course header now needs to be
updated. This seems like something checkpatch should be able to catch -
worth looking into?
Link: http://lkml.kernel.org/r/20190207191113.14039-1-mst@redhat.com
Fixes: 8823b1dbc05f ("mm/page_poison.c: enable PAGE_POISONING as a separate option")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexey Dobriyan [Tue, 5 Mar 2019 23:48:29 +0000 (15:48 -0800)]
numa: make "nr_online_nodes" unsigned int
Number of online NUMA nodes can't be negative as well. This doesn't
save space as the variable is used only in 32-bit context, but do it
anyway for consistency.
Link: http://lkml.kernel.org/r/20190201223151.GB15820@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexey Dobriyan [Tue, 5 Mar 2019 23:48:26 +0000 (15:48 -0800)]
numa: make "nr_node_ids" unsigned int
Number of NUMA nodes can't be negative.
This saves a few bytes on x86_64:
add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238)
Function old new delta
hv_synic_alloc.cold 88 110 +22
prealloc_shrinker 260 262 +2
bootstrap 249 251 +2
sched_init_numa 1566 1567 +1
show_slab_objects 778 777 -1
s_show 1201 1200 -1
kmem_cache_init 346 345 -1
__alloc_workqueue_key 1146 1145 -1
mem_cgroup_css_alloc 1614 1612 -2
__do_sys_swapon 4702 4699 -3
__list_lru_init 655 651 -4
nic_probe 2379 2374 -5
store_user_store 118 111 -7
red_zone_store 106 99 -7
poison_store 106 99 -7
wq_numa_init 348 338 -10
__kmem_cache_empty 75 65 -10
task_numa_free 186 173 -13
merge_across_nodes_store 351 336 -15
irq_create_affinity_masks 1261 1246 -15
do_numa_crng_init 343 321 -22
task_numa_fault 4760 4737 -23
swapfile_init 179 156 -23
hv_synic_alloc 536 492 -44
apply_wqattrs_prepare 746 695 -51
Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Tue, 5 Mar 2019 23:48:22 +0000 (15:48 -0800)]
mm,oom: don't kill global init via memory.oom.group
Since setting global init process to some memory cgroup is technically
possible, oom_kill_memcg_member() must check it.
Tasks in /test1 are going to be killed due to memory.oom.group set
Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int main(int argc, char *argv[])
{
static char buffer[
10485760];
static int pipe_fd[2] = { EOF, EOF };
unsigned int i;
int fd;
char buf[64] = { };
if (pipe(pipe_fd))
return 1;
if (chdir("/sys/fs/cgroup/"))
return 1;
fd = open("cgroup.subtree_control", O_WRONLY);
write(fd, "+memory", 7);
close(fd);
mkdir("test1", 0755);
fd = open("test1/memory.oom.group", O_WRONLY);
write(fd, "1", 1);
close(fd);
fd = open("test1/cgroup.procs", O_WRONLY);
write(fd, "1", 1);
snprintf(buf, sizeof(buf) - 1, "%d", getpid());
write(fd, buf, strlen(buf));
close(fd);
snprintf(buf, sizeof(buf) - 1, "%lu", sizeof(buffer) * 5);
fd = open("test1/memory.max", O_WRONLY);
write(fd, buf, strlen(buf));
close(fd);
for (i = 0; i < 10; i++)
if (fork() == 0) {
char c;
close(pipe_fd[1]);
read(pipe_fd[0], &c, 1);
memset(buffer, 0, sizeof(buffer));
sleep(3);
_exit(0);
}
close(pipe_fd[0]);
close(pipe_fd[1]);
sleep(3);
return 0;
}
[ 37.052923][ T9185] a.out invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[ 37.056169][ T9185] CPU: 4 PID: 9185 Comm: a.out Kdump: loaded Not tainted 5.0.0-rc4-next-
20190131 #280
[ 37.059205][ T9185] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[ 37.062954][ T9185] Call Trace:
[ 37.063976][ T9185] dump_stack+0x67/0x95
[ 37.065263][ T9185] dump_header+0x51/0x570
[ 37.066619][ T9185] ? trace_hardirqs_on+0x3f/0x110
[ 37.068171][ T9185] ? _raw_spin_unlock_irqrestore+0x3d/0x70
[ 37.069967][ T9185] oom_kill_process+0x18d/0x210
[ 37.071515][ T9185] out_of_memory+0x11b/0x380
[ 37.072936][ T9185] mem_cgroup_out_of_memory+0xb6/0xd0
[ 37.074601][ T9185] try_charge+0x790/0x820
[ 37.076021][ T9185] mem_cgroup_try_charge+0x42/0x1d0
[ 37.077629][ T9185] mem_cgroup_try_charge_delay+0x11/0x30
[ 37.079370][ T9185] do_anonymous_page+0x105/0x5e0
[ 37.080939][ T9185] __handle_mm_fault+0x9cb/0x1070
[ 37.082485][ T9185] handle_mm_fault+0x1b2/0x3a0
[ 37.083819][ T9185] ? handle_mm_fault+0x47/0x3a0
[ 37.085181][ T9185] __do_page_fault+0x255/0x4c0
[ 37.086529][ T9185] do_page_fault+0x28/0x260
[ 37.087788][ T9185] ? page_fault+0x8/0x30
[ 37.088978][ T9185] page_fault+0x1e/0x30
[ 37.090142][ T9185] RIP: 0033:0x7f8b183aefe0
[ 37.091433][ T9185] Code: 20 f3 44 0f 7f 44 17 d0 f3 44 0f 7f 47 30 f3 44 0f 7f 44 17 c0 48 01 fa 48 83 e2 c0 48 39 d1 74 a3 66 0f 1f 84 00 00 00 00 00 <66> 44 0f 7f 01 66 44 0f 7f 41 10 66 44 0f 7f 41 20 66 44 0f 7f 41
[ 37.096917][ T9185] RSP: 002b:
00007fffc5d329e8 EFLAGS:
00010206
[ 37.098615][ T9185] RAX:
00000000006010e0 RBX:
0000000000000008 RCX:
0000000000c30000
[ 37.100905][ T9185] RDX:
00000000010010c0 RSI:
0000000000000000 RDI:
00000000006010e0
[ 37.103349][ T9185] RBP:
0000000000000000 R08:
00007f8b188f4740 R09:
0000000000000000
[ 37.105797][ T9185] R10:
00007fffc5d32420 R11:
00007f8b183aef40 R12:
0000000000000005
[ 37.108228][ T9185] R13:
0000000000000000 R14:
ffffffffffffffff R15:
0000000000000000
[ 37.110840][ T9185] memory: usage 51200kB, limit 51200kB, failcnt 125
[ 37.113045][ T9185] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[ 37.115808][ T9185] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[ 37.117660][ T9185] Memory cgroup stats for /test1: cache:0KB rss:49484KB rss_huge:30720KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:49700KB inactive_file:0KB active_file:0KB unevictable:0KB
[ 37.123371][ T9185] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/test1,task_memcg=/test1,task=a.out,pid=9188,uid=0
[ 37.128158][ T9185] Memory cgroup out of memory: Killed process 9188 (a.out) total-vm:14456kB, anon-rss:10324kB, file-rss:504kB, shmem-rss:0kB
[ 37.132710][ T9185] Tasks in /test1 are going to be killed due to memory.oom.group set
[ 37.132833][ T54] oom_reaper: reaped process 9188 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 37.135498][ T9185] Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
[ 37.143434][ T9185] Memory cgroup out of memory: Killed process 9182 (a.out) total-vm:14456kB, anon-rss:76kB, file-rss:588kB, shmem-rss:0kB
[ 37.144328][ T54] oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 37.147585][ T9185] Memory cgroup out of memory: Killed process 9183 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
[ 37.157222][ T9185] Memory cgroup out of memory: Killed process 9184 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:508kB, shmem-rss:0kB
[ 37.157259][ T9185] Memory cgroup out of memory: Killed process 9185 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
[ 37.157291][ T9185] Memory cgroup out of memory: Killed process 9186 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:508kB, shmem-rss:0kB
[ 37.157306][ T54] oom_reaper: reaped process 9183 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 37.157328][ T9185] Memory cgroup out of memory: Killed process 9187 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
[ 37.157452][ T9185] Memory cgroup out of memory: Killed process 9189 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
[ 37.158733][ T9185] Memory cgroup out of memory: Killed process 9190 (a.out) total-vm:14456kB, anon-rss:552kB, file-rss:512kB, shmem-rss:0kB
[ 37.160083][ T54] oom_reaper: reaped process 9186 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 37.160187][ T54] oom_reaper: reaped process 9189 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 37.206941][ T54] oom_reaper: reaped process 9185 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 37.212300][ T9185] Memory cgroup out of memory: Killed process 9191 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
[ 37.212317][ T54] oom_reaper: reaped process 9190 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 37.218860][ T9185] Memory cgroup out of memory: Killed process 9192 (a.out) total-vm:14456kB, anon-rss:1080kB, file-rss:512kB, shmem-rss:0kB
[ 37.227667][ T54] oom_reaper: reaped process 9192 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 37.292323][ T9193] abrt-hook-ccpp (9193) used greatest stack depth: 10480 bytes left
[ 37.351843][ T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
[ 37.354833][ T1] CPU: 7 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.0.0-rc4-next-
20190131 #280
[ 37.357876][ T1] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[ 37.361685][ T1] Call Trace:
[ 37.363239][ T1] dump_stack+0x67/0x95
[ 37.365010][ T1] panic+0xfc/0x2b0
[ 37.366853][ T1] do_exit+0xd55/0xd60
[ 37.368595][ T1] do_group_exit+0x47/0xc0
[ 37.370415][ T1] get_signal+0x32a/0x920
[ 37.372449][ T1] ? _raw_spin_unlock_irqrestore+0x3d/0x70
[ 37.374596][ T1] do_signal+0x32/0x6e0
[ 37.376430][ T1] ? exit_to_usermode_loop+0x26/0x9b
[ 37.378418][ T1] ? prepare_exit_to_usermode+0xa8/0xd0
[ 37.380571][ T1] exit_to_usermode_loop+0x3e/0x9b
[ 37.382588][ T1] prepare_exit_to_usermode+0xa8/0xd0
[ 37.384594][ T1] ? page_fault+0x8/0x30
[ 37.386453][ T1] retint_user+0x8/0x18
[ 37.388160][ T1] RIP: 0033:0x7f42c06974a8
[ 37.389922][ T1] Code: Bad RIP value.
[ 37.391788][ T1] RSP: 002b:
00007ffc3effd388 EFLAGS:
00010213
[ 37.394075][ T1] RAX:
000000000000000e RBX:
00007ffc3effd390 RCX:
0000000000000000
[ 37.396963][ T1] RDX:
000000000000002a RSI:
00007ffc3effd390 RDI:
0000000000000004
[ 37.399550][ T1] RBP:
00007ffc3effd680 R08:
0000000000000000 R09:
0000000000000000
[ 37.402334][ T1] R10:
00000000ffffffff R11:
0000000000000246 R12:
0000000000000001
[ 37.404890][ T1] R13:
ffffffffffffffff R14:
0000000000000884 R15:
000056460b1ac3b0
Link: http://lkml.kernel.org/r/201902010336.x113a4EO027170@www262.sakura.ne.jp
Fixes: 3d8b38eb81cac813 ("mm, oom: introduce memory.oom.group")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Daniel Jordan [Tue, 5 Mar 2019 23:48:19 +0000 (15:48 -0800)]
mm, swap: bounds check swap_info array accesses to avoid NULL derefs
Dan Carpenter reports a potential NULL dereference in
get_swap_page_of_type:
Smatch complains that the NULL checks on "si" aren't consistent. This
seems like a real bug because we have not ensured that the type is
valid and so "si" can be NULL.
Add the missing check for NULL, taking care to use a read barrier to
ensure CPU1 observes CPU0's updates in the correct order:
CPU0 CPU1
alloc_swap_info() if (type >= nr_swapfiles)
swap_info[type] = p /* handle invalid entry */
smp_wmb() smp_rmb()
++nr_swapfiles p = swap_info[type]
Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before
CPU0's write to swap_info[type] and read NULL from swap_info[type].
Ying Huang noticed other places in swapfile.c don't order these reads
properly. Introduce swap_type_to_swap_info to encourage correct usage.
Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model
(see tools/memory-model/Documentation/explanation.txt).
This ordering need not be enforced in places where swap_lock is held
(e.g. si_swapinfo) because swap_lock serializes updates to nr_swapfiles
and the swap_info array.
Link: http://lkml.kernel.org/r/20190131024410.29859-1-daniel.m.jordan@oracle.com
Fixes: ec8acf20afb8 ("swap: add per-partition lock for swapfile")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill Tkhai [Tue, 5 Mar 2019 23:48:15 +0000 (15:48 -0800)]
mm/vmscan.c: do not allocate duplicate stack variables in shrink_page_list()
On path shrink_inactive_list() ---> shrink_page_list() we allocate stack
variables for the statistics twice. This is completely useless, and
this just consumes stack much more, then we really need.
The patch kills duplicate stack variables from shrink_page_list(), and
this reduce stack usage and object file size significantly:
Stack usage:
Before: vmscan.c:1122:22:shrink_page_list 648 static
After: vmscan.c:1122:22:shrink_page_list 616 static
Size of vmscan.o:
text data bss dec hex filename
Before: 56866 4720 128 61714 f112 mm/vmscan.o
After: 56770 4720 128 61618 f0b2 mm/vmscan.o
Link: http://lkml.kernel.org/r/154894900030.5211.12104993874109647641.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Tue, 5 Mar 2019 23:48:12 +0000 (15:48 -0800)]
mm: ksm: do not block on page lock when searching stable tree
ksmd needs to search the stable tree to look for the suitable KSM page,
but the KSM page might be locked for a while due to i.e. KSM page rmap
walk. Basically it is not a big deal since commit
2c653d0ee2ae ("ksm:
introduce ksm_max_page_sharing per page deduplication limit"), since
max_page_sharing limits the number of shared KSM pages.
But it still sounds not worth waiting for the lock, the page can be
skip, then try to merge it in the next scan to avoid potential stall if
its content is still intact.
Introduce trylock mode to get_ksm_page() to not block on page lock, like
what try_to_merge_one_page() does. And, define three possible
operations (nolock, lock and trylock) as enum type to avoid stacking up
bools and make the code more readable.
Return -EBUSY if trylock fails, since NULL means not find suitable KSM
page, which is a valid case.
With the default max_page_sharing setting (256), there is almost no
observed change comparing lock vs trylock.
However, with ksm02 of LTP, the reduced ksmd full scan time can be
observed, which has set max_page_sharing to 786432. With lock version,
ksmd may tak 10s - 11s to run two full scans, with trylock version ksmd
may take 8s - 11s to run two full scans. And, the number of
pages_sharing and pages_to_scan keep same. Basically, this change has
no harm.
[hughd@google.com: fix BUG_ON()]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1902182122280.6914@eggly.anvils
Link: http://lkml.kernel.org/r/1548793753-62377-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Suggested-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chris Down [Tue, 5 Mar 2019 23:48:09 +0000 (15:48 -0800)]
mm: memcontrol: expose THP events on a per-memcg basis
Currently THP allocation events data is fairly opaque, since you can
only get it system-wide. This patch makes it easier to reason about
transparent hugepage behaviour on a per-memcg basis.
For anonymous THP-backed pages, we already have MEMCG_RSS_HUGE in v1,
which is used for v1's rss_huge [sic]. This is reused here as it's
fairly involved to untangle NR_ANON_THPS right now to make it per-memcg,
since right now some of this is delegated to rmap before we have any
memcg actually assigned to the page. It's a good idea to rework that,
but let's leave untangling THP allocation for a future patch.
[akpm@linux-foundation.org: fix build]
[chris@chrisdown.name: fix memcontrol build when THP is disabled]
Link: http://lkml.kernel.org/r/20190131160802.GA5777@chrisdown.name
Link: http://lkml.kernel.org/r/20190129205852.GA7310@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Tue, 5 Mar 2019 23:48:05 +0000 (15:48 -0800)]
mm: vmscan: do not iterate all mem cgroups for global direct reclaim
In current implementation, both kswapd and direct reclaim has to iterate
all mem cgroups. It is not a problem before offline mem cgroups could
be iterated. But, currently with iterating offline mem cgroups, it
could be very time consuming. In our workloads, we saw over 400K mem
cgroups accumulated in some cases, only a few hundred are online memcgs.
Although kswapd could help out to reduce the number of memcgs, direct
reclaim still get hit with iterating a number of offline memcgs in some
cases. We experienced the responsiveness problems due to this
occassionally.
A simple test with pref shows it may take around 220ms to iterate 8K
memcgs in direct reclaim:
dd 13873 [011] 578.542919: vmscan:mm_vmscan_direct_reclaim_begin
dd 13873 [011] 578.758689: vmscan:mm_vmscan_direct_reclaim_end
So for 400K, it may take around 11 seconds to iterate all memcgs.
Here just break the iteration once it reclaims enough pages as what
memcg direct reclaim does. This may hurt the fairness among memcgs.
But the cached iterator cookie could help to achieve the fairness more
or less.
Link: http://lkml.kernel.org/r/1548799877-10949-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Tue, 5 Mar 2019 23:48:02 +0000 (15:48 -0800)]
mm: swap: use mem_cgroup_is_root() instead of deferencing css->parent
mem_cgroup_is_root() is the preferred API to check if memcg is root or
not. Use it instead of deferencing css->parent.
Link: http://lkml.kernel.org/r/1547232913-118148-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joel Fernandes (Google) [Tue, 5 Mar 2019 23:47:58 +0000 (15:47 -0800)]
selftests/memfd: add tests for F_SEAL_FUTURE_WRITE seal
Add tests to verify sealing memfds with the F_SEAL_FUTURE_WRITE works as
expected.
Link: http://lkml.kernel.org/r/20190112203816.85534-3-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Shuah Khan <shuah@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Marc-Andr Lureau <marcandre.lureau@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joel Fernandes (Google) [Tue, 5 Mar 2019 23:47:54 +0000 (15:47 -0800)]
mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd
Android uses ashmem for sharing memory regions. We are looking forward
to migrating all usecases of ashmem to memfd so that we can possibly
remove the ashmem driver in the future from staging while also
benefiting from using memfd and contributing to it. Note staging
drivers are also not ABI and generally can be removed at anytime.
One of the main usecases Android has is the ability to create a region
and mmap it as writeable, then add protection against making any
"future" writes while keeping the existing already mmap'ed
writeable-region active. This allows us to implement a usecase where
receivers of the shared memory buffer can get a read-only view, while
the sender continues to write to the buffer. See CursorWindow
documentation in Android for more details:
https://developer.android.com/reference/android/database/CursorWindow
This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
which prevents any future mmap and write syscalls from succeeding while
keeping the existing mmap active.
A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
where we don't need to modify core VFS structures to get the same
behavior of the seal. This solves several side-effects pointed by Andy.
self-tests are provided in later patch to verify the expected semantics.
[1] https://lore.kernel.org/lkml/
20181111173650.GA256781@google.com/
Thanks a lot to Andy for suggestions to improve code.
Link: http://lkml.kernel.org/r/20190112203816.85534-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Marc-Andr Lureau <marcandre.lureau@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aneesh Kumar K.V [Tue, 5 Mar 2019 23:47:51 +0000 (15:47 -0800)]
powerpc/mm/iommu: allow large IOMMU page size only for hugetlb backing
THP pages can get split during different code paths. An incremented
reference count does imply we will not split the compound page. But the
pmd entry can be converted to level 4 pte entries. Keep the code
simpler by allowing large IOMMU page size only if the guest ram is
backed by hugetlb pages.
Link: http://lkml.kernel.org/r/20190114095438.32470-6-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aneesh Kumar K.V [Tue, 5 Mar 2019 23:47:47 +0000 (15:47 -0800)]
powerpc/mm/iommu: allow migration of cma allocated pages during mm_iommu_do_alloc
The current code doesn't do page migration if the page allocated is a
compound page. With HugeTLB migration support, we can end up allocating
hugetlb pages from CMA region. Also, THP pages can be allocated from
CMA region. This patch updates the code to handle compound pages
correctly. The patch also switches to a single get_user_pages with the
right count, instead of doing one get_user_pages per page. That avoids
reading page table multiple times. This is done by using
get_user_pages_longterm, because that also takes care of DAX backed
pages.
DAX pages lifetime is dictated by file system rules and as such, we need
to make sure that we free these pages on operations like truncate and
punch hole. If we have long term pin on these pages, which are mostly
return to userspace with elevated page count, the entity holding the
long term pin may not be aware of the fact that file got truncated and
the file system blocks possibly got reused. That can result in
corruption.
The patch also converts the hpas member of mm_iommu_table_group_mem_t to
a union. We use the same storage location to store pointers to struct
page. We cannot update all the code path use struct page *, because we
access hpas in real mode and we can't do that struct page * to pfn
conversion in real mode.
[aneesh.kumar@linux.ibm.com: address review feedback, update changelog]
Link: http://lkml.kernel.org/r/20190227144736.5872-4-aneesh.kumar@linux.ibm.com
Link: http://lkml.kernel.org/r/20190114095438.32470-5-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aneesh Kumar K.V [Tue, 5 Mar 2019 23:47:44 +0000 (15:47 -0800)]
mm: update get_user_pages_longterm to migrate pages allocated from CMA region
This patch updates get_user_pages_longterm to migrate pages allocated
out of CMA region. This makes sure that we don't keep non-movable pages
(due to page reference count) in the CMA area.
This will be used by ppc64 in a later patch to avoid pinning pages in
the CMA region. ppc64 uses CMA region for allocation of the hardware
page table (hash page table) and not able to migrate pages out of CMA
region results in page table allocation failures.
One case where we hit this easy is when a guest using a VFIO passthrough
device. VFIO locks all the guest's memory and if the guest memory is
backed by CMA region, it becomes unmovable resulting in fragmenting the
CMA and possibly preventing other guests from allocation a large enough
hash page table.
NOTE: We allocate the new page without using __GFP_THISNODE
Link: http://lkml.kernel.org/r/20190114095438.32470-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aneesh Kumar K.V [Tue, 5 Mar 2019 23:47:40 +0000 (15:47 -0800)]
mm/cma: add PF flag to force non cma alloc
Patch series "mm/kvm/vfio/ppc64: Migrate compound pages out of CMA
region", v8.
ppc64 uses the CMA area for the allocation of guest page table (hash
page table). We won't be able to start guest if we fail to allocate
hash page table. We have observed hash table allocation failure because
we failed to migrate pages out of CMA region because they were pinned.
This happen when we are using VFIO. VFIO on ppc64 pins the entire guest
RAM. If the guest RAM pages get allocated out of CMA region, we won't
be able to migrate those pages. The pages are also pinned for the
lifetime of the guest.
Currently we support migration of non-compound pages. With THP and with
the addition of hugetlb migration we can end up allocating compound
pages from CMA region. This patch series add support for migrating
compound pages.
This patch (of 4):
Add PF_MEMALLOC_NOCMA which make sure any allocation in that context is
marked non-movable and hence cannot be satisfied by CMA region.
This is useful with get_user_pages_longterm where we want to take a page
pin by migrating pages from CMA region. Marking the section
PF_MEMALLOC_NOCMA ensures that we avoid unnecessary page migration
later.
Link: http://lkml.kernel.org/r/20190114095438.32470-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Tue, 5 Mar 2019 23:47:36 +0000 (15:47 -0800)]
mm: better document PG_reserved
The usage of PG_reserved and how PG_reserved pages are to be treated is
buried deep down in different parts of the kernel. Let's shine some
light onto these details by documenting current users and expected
behavior.
Especially, clarify on the "Some of them might not even exist" case.
These are physical memory gaps that will never be dumped as they are not
marked as IORESOURCE_SYSRAM. PG_reserved does in general not hinder
anybody from dumping or swapping. In some cases, these pages will not
be stored in the hibernation image.
Link: http://lkml.kernel.org/r/20190114125903.24845-10-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: <yi.z.zhang@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Tue, 5 Mar 2019 23:47:32 +0000 (15:47 -0800)]
ia64: perfmon: don't mark buffer pages as PG_reserved
In the old days, remap_pfn_range() required pages to be marked as
PG_reserved, so they would e.g. never get swapped out. This was
required for special mappings. Nowadays, this is fully handled via the
VMA (VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP inside
remap_pfn_range() to be precise). PG_reserved is no longer required but
only a relic from the past.
So only architecture specific MM handling might require it (e.g. to
detect them as MMIO pages). As there are no architecture specific
checks for PageReserved() apart from MCA handling in ia64code, this can
go. Use simple vzalloc()/vfree() instead.
Note that before calling vzalloc(), size has already been aligned to
PAGE_SIZE, no need to align again.
Link: http://lkml.kernel.org/r/20190114125903.24845-9-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Tue, 5 Mar 2019 23:47:28 +0000 (15:47 -0800)]
arm64: kdump: no need to mark crashkernel pages manually PG_reserved
The crashkernel is reserved via memblock_reserve(). memblock_free_all()
will call free_low_memory_core_early(), which will go over all reserved
memblocks, marking the pages as PG_reserved.
So manually marking pages as PG_reserved is not necessary, they are
already in the desired state (otherwise they would have been handed over
to the buddy as free pages and bad things would happen).
Link: http://lkml.kernel.org/r/20190114125903.24845-8-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthias Brugger <mbrugger@suse.com>
Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Greg Hackmann <ghackmann@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kristina Martsenko <kristina.martsenko@arm.com>
Cc: CHANDAN VN <chandan.vn@samsung.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Tue, 5 Mar 2019 23:47:25 +0000 (15:47 -0800)]
arm64: kexec: no need to ClearPageReserved()
This will be done by free_reserved_page().
Link: http://lkml.kernel.org/r/20190114125903.24845-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: James Morse <james.morse@arm.com>
Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Tue, 5 Mar 2019 23:47:21 +0000 (15:47 -0800)]
m68k/mm: use __ClearPageReserved()
The PG_reserved flag is cleared from memory that is part of the kernel
image (and therefore marked as PG_reserved). Avoid using PG_reserved
directly.
Link: http://lkml.kernel.org/r/20190114125903.24845-6-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Tue, 5 Mar 2019 23:47:18 +0000 (15:47 -0800)]
riscv/vdso: don't clear PG_reserved
The VDSO is part of the kernel image and therefore the struct pages are
marked as reserved during boot.
As we install a special mapping, the actual struct pages will never be
exposed to MM via the page tables. We can therefore leave the pages
marked as reserved.
Link: http://lkml.kernel.org/r/20190114125903.24845-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Palmer Dabbelt <palmer@sifive.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Tobias Klauser <tklauser@distanz.ch>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Tue, 5 Mar 2019 23:47:14 +0000 (15:47 -0800)]
powerpc/vdso: don't clear PG_reserved
The VDSO is part of the kernel image and therefore the struct pages are
marked as reserved during boot.
As we install a special mapping, the actual struct pages will never be
exposed to MM via the page tables. We can therefore leave the pages
marked as reserved.
Link: http://lkml.kernel.org/r/20190114125903.24845-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Tue, 5 Mar 2019 23:47:10 +0000 (15:47 -0800)]
s390/vdso: don't clear PG_reserved
The VDSO is part of the kernel image and therefore the struct pages are
marked as reserved during boot.
As we install a special mapping, the actual struct pages will never be
exposed to MM via the page tables. We can therefore leave the pages
marked as reserved.
Link: http://lkml.kernel.org/r/20190114125903.24845-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Suggested-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Tue, 5 Mar 2019 23:47:07 +0000 (15:47 -0800)]
agp: efficeon: no need to set PG_reserved on GATT tables
Patch series "mm: PG_reserved cleanups and documentation", v2.
I was recently going over all users of PG_reserved. Short story: it is
difficult and sometimes not really clear if setting/checking for
PG_reserved is only a relict from the past. Easy to break things. I
guess I now have a pretty good idea wh things are like that nowadays and
how they evolved.
I had way more cleanups in this series inititally, but some
architectures take PG_reserved as a way to apply a different caching
strategy (for MMIO pages). So I decided to only include the most
obvious changes (that are less likely to break something). So the big
chunk of manual SetPageReserved users are MMIO/DMA related things on
device buffers.
Most notably, for device memory we will hopefully soon stop setting
PG_reserved. Then the documentation has to be updated.
This patch (of 9):
The l1 GATT page table is kept in a special on-chip page with 64
entries. We allocate the l2 page table pages via get_zeroed_page() and
enter them into the table. These l2 pages are modified accordingly when
inserting/removing memory via efficeon_insert_memory and
efficeon_remove_memory.
Apart from that, these pages are not exposed or ioremap'ed. We can stop
setting them reserved (propably copied from generic code).
Link: http://lkml.kernel.org/r/20190114125903.24845-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vineeth Remanan Pillai [Tue, 5 Mar 2019 23:47:03 +0000 (15:47 -0800)]
mm: rid swapoff of quadratic complexity
This patch was initially posted by Kelley Nielsen. Reposting the patch
with all review comments addressed and with minor modifications and
optimizations. Also, folding in the fixes offered by Hugh Dickins and
Huang Ying. Tests were rerun and commit message updated with new
results.
try_to_unuse() is of quadratic complexity, with a lot of wasted effort.
It unuses swap entries one by one, potentially iterating over all the
page tables for all the processes in the system for each one.
This new proposed implementation of try_to_unuse simplifies its
complexity to linear. It iterates over the system's mms once, unusing
all the affected entries as it walks each set of page tables. It also
makes similar changes to shmem_unuse.
Improvement
swapoff was called on a swap partition containing about 6G of data, in a
VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted.
Present implementation....about 1200M calls(8min, avg 80% cpu util).
Prototype.................about 9.0K calls(3min, avg 5% cpu util).
Details
In shmem_unuse(), iterate over the shmem_swaplist and, for each
shmem_inode_info that contains a swap entry, pass it to
shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(),
iterate over its associated xarray, and store the index and value of
each swap entry in an array for passing to shmem_swapin_page() outside
of the RCU critical section.
In try_to_unuse(), instead of iterating over the entries in the type and
unusing them one by one, perhaps walking all the page tables for all the
processes for each one, iterate over the mmlist, making one pass. Pass
each mm to unuse_mm() to begin its page table walk, and during the walk,
unuse all the ptes that have backing store in the swap type received by
try_to_unuse(). After the walk, check the type for orphaned swap
entries with find_next_to_unuse(), and remove them from the swap cache.
If find_next_to_unuse() starts over at the beginning of the type, repeat
the check of the shmem_swaplist and the walk a maximum of three times.
Change unuse_mm() and the intervening walk functions down to
unuse_pte_range() to take the type as a parameter, and to iterate over
their entire range, calling the next function down on every iteration.
In unuse_pte_range(), make a swap entry from each pte in the range using
the passed in type. If it has backing store in the type, call
swapin_readahead() to retrieve the page and pass it to unuse_pte().
Pass the count of pages_to_unuse down the page table walks in
try_to_unuse(), and return from the walk when the desired number of
pages has been swapped back in.
Link: http://lkml.kernel.org/r/20190114153129.4852-2-vpillai@digitalocean.com
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vineeth Remanan Pillai [Tue, 5 Mar 2019 23:46:58 +0000 (15:46 -0800)]
mm: refactor swap-in logic out of shmem_getpage_gfp
swapin logic can be reused independently without rest of the logic in
shmem_getpage_gfp. So lets refactor it out as an independent function.
Link: http://lkml.kernel.org/r/20190114153129.4852-1-vpillai@digitalocean.com
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kelley Nielsen <kelleynnn@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill Tkhai [Tue, 5 Mar 2019 23:46:55 +0000 (15:46 -0800)]
mm/vmscan.c: remove 7th argument of isolate_lru_pages()
We may simply check for sc->may_unmap in isolate_lru_pages() instead of
doing that in both of its callers.
Link: http://lkml.kernel.org/r/154748280735.29962.15867846875217618569.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Tue, 5 Mar 2019 23:46:50 +0000 (15:46 -0800)]
mm, mempolicy: fix uninit memory access
Syzbot with KMSAN reports (excerpt):
==================================================================
BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:353 [inline]
BUG: KMSAN: uninit-value in mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384
CPU: 1 PID: 17420 Comm: syz-executor4 Not tainted 4.20.0-rc7+ #15
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x173/0x1d0 lib/dump_stack.c:113
kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613
__msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:295
mpol_rebind_policy mm/mempolicy.c:353 [inline]
mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384
update_tasks_nodemask+0x608/0xca0 kernel/cgroup/cpuset.c:1120
update_nodemasks_hier kernel/cgroup/cpuset.c:1185 [inline]
update_nodemask kernel/cgroup/cpuset.c:1253 [inline]
cpuset_write_resmask+0x2a98/0x34b0 kernel/cgroup/cpuset.c:1728
...
Uninit was created at:
kmsan_save_stack_with_flags mm/kmsan/kmsan.c:204 [inline]
kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:158
kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176
kmem_cache_alloc+0x572/0xb90 mm/slub.c:2777
mpol_new mm/mempolicy.c:276 [inline]
do_mbind mm/mempolicy.c:1180 [inline]
kernel_mbind+0x8a7/0x31a0 mm/mempolicy.c:1347
__do_sys_mbind mm/mempolicy.c:1354 [inline]
As it's difficult to report where exactly the uninit value resides in
the mempolicy object, we have to guess a bit. mm/mempolicy.c:353
contains this part of mpol_rebind_policy():
if (!mpol_store_user_nodemask(pol) &&
nodes_equal(pol->w.cpuset_mems_allowed, *newmask))
"mpol_store_user_nodemask(pol)" is testing pol->flags, which I couldn't
ever see being uninitialized after leaving mpol_new(). So I'll guess
it's actually about accessing pol->w.cpuset_mems_allowed on line 354,
but still part of statement starting on line 353.
For w.cpuset_mems_allowed to be not initialized, and the nodes_equal()
reachable for a mempolicy where mpol_set_nodemask() is called in
do_mbind(), it seems the only possibility is a MPOL_PREFERRED policy
with empty set of nodes, i.e. MPOL_LOCAL equivalent, with MPOL_F_LOCAL
flag. Let's exclude such policies from the nodes_equal() check. Note
the uninit access should be benign anyway, as rebinding this kind of
policy is always a no-op. Therefore no actual need for stable
inclusion.
Link: http://lkml.kernel.org/r/a71997c3-e8ae-a787-d5ce-3db05768b27c@suse.cz
Link: http://lkml.kernel.org/r/73da3e9c-cc84-509e-17d9-0c434bb9967d@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: syzbot+b19c2dc2c990ea657a71@syzkaller.appspotmail.com
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Cc: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Tue, 5 Mar 2019 23:46:47 +0000 (15:46 -0800)]
memcg: killed threads should not invoke memcg OOM killer
If a memory cgroup contains a single process with many threads
(including different process group sharing the mm) then it is possible
to trigger a race when the oom killer complains that there are no oom
elible tasks and complain into the log which is both annoying and
confusing because there is no actual problem. The race looks as
follows:
P1 oom_reaper P2
try_charge try_charge
mem_cgroup_out_of_memory
mutex_lock(oom_lock)
out_of_memory
oom_kill_process(P1,P2)
wake_oom_reaper
mutex_unlock(oom_lock)
oom_reap_task
mutex_lock(oom_lock)
select_bad_process # no victim
The problem is more visible with many threads.
Fix this by checking for fatal_signal_pending from
mem_cgroup_out_of_memory when the oom_lock is already held.
The oom bypass is safe because we do the same early in the try_charge
path already. The situation migh have changed in the mean time. It
should be safe to check for fatal_signal_pending and tsk_is_oom_victim
but for a better code readability abstract the current charge bypass
condition into should_force_charge and reuse it from that path. "
Link: http://lkml.kernel.org/r/01370f70-e1f6-ebe4-b95e-0df21a0bc15e@i-love.sakura.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Tue, 5 Mar 2019 23:46:43 +0000 (15:46 -0800)]
mm/page_alloc.c: check return value of memblock_alloc_node_nopanic()
There are two early memory allocations that use
memblock_alloc_node_nopanic() and do not check its return value.
While this happens very early during boot and chances that the
allocation will fail are diminishing, it is still worth to have proper
checks for the allocation errors.
Link: http://lkml.kernel.org/r/1547734941-944-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aneesh Kumar K.V [Tue, 5 Mar 2019 23:46:40 +0000 (15:46 -0800)]
arch/powerpc/mm/hugetlb: NestMMU workaround for hugetlb mprotect RW upgrade
NestMMU requires us to mark the pte invalid and flush the tlb when we do
a RW upgrade of pte. We fixed a variant of this in the fault path in
bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to handle
nest MMU hang").
Link: http://lkml.kernel.org/r/20190116085035.29729-6-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aneesh Kumar K.V [Tue, 5 Mar 2019 23:46:37 +0000 (15:46 -0800)]
mm/hugetlb: add prot_modify_start/commit sequence for hugetlb update
Architectures like ppc64 require to do a conditional tlb flush based on
the old and new value of pte. Follow the regular pte change protection
sequence for hugetlb too. This allows the architectures to override the
update sequence.
Link: http://lkml.kernel.org/r/20190116085035.29729-5-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aneesh Kumar K.V [Tue, 5 Mar 2019 23:46:33 +0000 (15:46 -0800)]
arch/powerpc/mm: Nest MMU workaround for mprotect RW upgrade
NestMMU requires us to mark the pte invalid and flush the tlb when we do
a RW upgrade of pte. We fixed a variant of this in the fault path in
bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to handle
nest MMU hang").
Do the same for mprotect upgrades.
Hugetlb is handled in the next patch.
Link: http://lkml.kernel.org/r/20190116085035.29729-4-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aneesh Kumar K.V [Tue, 5 Mar 2019 23:46:29 +0000 (15:46 -0800)]
mm: update ptep_modify_prot_commit to take old pte value as arg
Architectures like ppc64 require to do a conditional tlb flush based on
the old and new value of pte. Enable that by passing old pte value as
the arg.
Link: http://lkml.kernel.org/r/20190116085035.29729-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aneesh Kumar K.V [Tue, 5 Mar 2019 23:46:26 +0000 (15:46 -0800)]
mm: update ptep_modify_prot_start/commit to take vm_area_struct as arg
Patch series "NestMMU pte upgrade workaround for mprotect", v5.
We can upgrade pte access (R -> RW transition) via mprotect. We need to
make sure we follow the recommended pte update sequence as outlined in
commit
bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to
handle nest MMU hang") for such updates. This patch series does that.
This patch (of 5):
Some architectures may want to call flush_tlb_range from these helpers.
Link: http://lkml.kernel.org/r/20190116085035.29729-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wei Yang [Tue, 5 Mar 2019 23:46:22 +0000 (15:46 -0800)]
mm: fix some typos in mm directory
No functional change.
Link: http://lkml.kernel.org/r/20190118235123.27843-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changbin Du [Tue, 5 Mar 2019 23:46:19 +0000 (15:46 -0800)]
mm/page_owner: move config option to mm/Kconfig.debug
Move the PAGE_OWNER option from submenu "Compile-time checks and
compiler options" to dedicated submenu "Memory Debugging".
Link: http://lkml.kernel.org/r/20190120024254.6270-1-changbin.du@gmail.com
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Fan [Tue, 5 Mar 2019 23:46:16 +0000 (15:46 -0800)]
mm/mmap.c: remove some redundancy in arch_get_unmapped_area_topdown()
The variable 'addr' is redundant in arch_get_unmapped_area_topdown(),
just use parameter 'addr0' directly. Then remove the const qualifier of
the parameter, and change its name to 'addr'.
And in according with other functions, remove the const qualifier of all
other no-pointer parameters in function arch_get_unmapped_area_topdown().
Link: http://lkml.kernel.org/r/20190127041112.25599-1-nullptr.cpp@gmail.com
Signed-off-by: Yang Fan <nullptr.cpp@gmail.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shakeel Butt [Tue, 5 Mar 2019 23:46:12 +0000 (15:46 -0800)]
mm, oom: remove 'prefer children over parent' heuristic
Since the start of the git history of Linux, the kernel after selecting
the worst process to be oom-killed, prefer to kill its child (if the
child does not share mm with the parent). Later it was changed to
prefer to kill a child who is worst. If the parent is still the worst
then the parent will be killed.
This heuristic assumes that the children did less work than their parent
and by killing one of them, the work lost will be less. However this is
very workload dependent. If there is a workload which can benefit from
this heuristic, can use oom_score_adj to prefer children to be killed
before the parent.
The select_bad_process() has already selected the worst process in the
system/memcg. There is no need to recheck the badness of its children
and hoping to find a worse candidate. That's a lot of unneeded racy
work. Also the heuristic is dangerous because it make fork bomb like
workloads to recover much later because we constantly pick and kill
processes which are not memory hogs. So, let's remove this whole
heuristic.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Greg Kroah-Hartman [Tue, 5 Mar 2019 23:46:09 +0000 (15:46 -0800)]
mm: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
Link: http://lkml.kernel.org/r/20190122152151.16139-14-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Laura Abbott <labbott@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Tue, 5 Mar 2019 23:46:06 +0000 (15:46 -0800)]
mm/memory.c: prevent mapping typed pages to userspace
Pages which use page_type must never be mapped to userspace as it would
destroy their page type. Add an explicit check for this instead of
assuming that kernel drivers always get this right.
Link: http://lkml.kernel.org/r/20190129053830.3749-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Tue, 5 Mar 2019 23:46:02 +0000 (15:46 -0800)]
mm: prevent mapping slab pages to userspace
It's never appropriate to map a page allocated by SLAB into userspace.
A buggy device driver might try this, or an attacker might be able to
find a way to make it happen.
Christoph said:
: Let's just fail the code. Currently this may work with SLUB. But SLAB
: and SLOB overlay fields with mapcount. So you would have a corrupted page
: struct if you mapped a slab page to user space.
Link: http://lkml.kernel.org/r/20190125173827.2658-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Uladzislau Rezki (Sony) [Tue, 5 Mar 2019 23:45:59 +0000 (15:45 -0800)]
mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512!
One of the vmalloc stress test case triggers the kernel BUG():
<snip>
[60.562151] ------------[ cut here ]------------
[60.562154] kernel BUG at mm/vmalloc.c:512!
[60.562206] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[60.562247] CPU: 0 PID: 430 Comm: vmalloc_test/0 Not tainted 4.20.0+ #161
[60.562293] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[60.562351] RIP: 0010:alloc_vmap_area+0x36f/0x390
<snip>
it can happen due to big align request resulting in overflowing of
calculated address, i.e. it becomes 0 after ALIGN()'s fixup.
Fix it by checking if calculated address is within vstart/vend range.
Link: http://lkml.kernel.org/r/20190124115648.9433-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chris Down [Tue, 5 Mar 2019 23:45:55 +0000 (15:45 -0800)]
mm, memcg: extract memcg maxable seq_file logic to seq_show_memcg_tunable
memcg has a significant number of files exposed to kernfs where their
value is either exposed directly or is "max" in the case of
PAGE_COUNTER_MAX.
This patch makes this generic by providing a single function to do this
work. In combination with the previous patch adding
mem_cgroup_from_seq, this makes all of the seq_show feeder functions
significantly more simple.
Link: http://lkml.kernel.org/r/20190124194100.GA31425@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chris Down [Tue, 5 Mar 2019 23:45:52 +0000 (15:45 -0800)]
mm, memcg: create mem_cgroup_from_seq
This is the start of a series of patches similar to my earlier
DEFINE_MEMCG_MAX_OR_VAL work, but with less Macro Magic(tm).
There are a bunch of places we go from seq_file to mem_cgroup, which
currently requires manually getting the css, then getting the mem_cgroup
from the css. It's in enough places now that having mem_cgroup_from_seq
makes sense (and also makes the next patch a bit nicer).
Link: http://lkml.kernel.org/r/20190124194050.GA31341@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>