Leo Yan [Sun, 11 Oct 2020 12:10:22 +0000 (20:10 +0800)]
perf c2c: Update usage for showing memory events
Since commit
b027cc6fdf1b ("perf c2c: Fix 'perf c2c record -e list' to
show the default events used"), "perf c2c" tool can show the memory
events properly, it's no reason to still suggest user to use the
command "perf mem record -e list" for showing events.
This patch updates the usage for showing memory events with command
"perf c2c record -e list".
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Ian Rogers <irogers@google.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20201011121022.22409-1-leo.yan@linaro.org
Arnaldo Carvalho de Melo [Tue, 13 Oct 2020 16:02:20 +0000 (13:02 -0300)]
Merge branch 'perf/urgent' into perf/core
To pick fixes that missed v5.9.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tzvetomir Stoyanov (VMware) [Wed, 30 Sep 2020 11:07:33 +0000 (14:07 +0300)]
tools lib traceevent: Hide non API functions
There are internal library functions, which are not declared as a static.
They are used inside the library from different files. Hide them from
the library users, as they are not part of the API.
These functions are made hidden and are renamed without the prefix "tep_":
tep_free_plugin_paths
tep_peek_char
tep_buffer_init
tep_get_input_buf_ptr
tep_get_input_buf
tep_read_token
tep_free_token
tep_free_event
tep_free_format_field
__tep_parse_format
Link: https://lore.kernel.org/linux-trace-devel/e4afdd82deb5e023d53231bb13e08dca78085fb0.camel@decadent.org.uk/
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: linux-trace-devel@vger.kernel.org
Link: http://lore.kernel.org/lkml/20200930110733.280534-1-tz.stoyanov@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Joel Fernandes (Google) [Fri, 25 Sep 2020 23:56:34 +0000 (19:56 -0400)]
perf sched: Show start of latency as well
The 'perf sched latency' tool is really useful at showing worst-case
latencies that task encountered since wakeup. However it shows only the
end of the latency. Often times the start of a latency is interesting as
it can show what else was going on at the time to cause the latency. I
certainly myself spending a lot of time backtracking to the start of the
latency in "perf sched script" which wastes a lot of time.
This patch therefore adds a new column "Max delay start". Considering
this, also rename "Maximum delay at" to "Max delay end" as its easier to
understand.
Example of the new output:
----------------------------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Max delay start | Max delay end |
----------------------------------------------------------------------------------------------------------------------------------
MediaScannerSer:11936 | 651.296 ms | 67978 | avg: 0.113 ms | max: 77.250 ms | max start: 477.691360 s | max end: 477.768610 s
audio@2.0-servi:(3) | 0.000 ms | 3440 | avg: 0.034 ms | max: 72.267 ms | max start: 477.697051 s | max end: 477.769318 s
AudioOut_1D:8112 | 0.000 ms | 2588 | avg: 0.083 ms | max: 64.020 ms | max start: 477.710740 s | max end: 477.774760 s
Time-limited te:14973 | 7966.090 ms | 24807 | avg: 0.073 ms | max: 15.563 ms | max start: 477.162746 s | max end: 477.178309 s
surfaceflinger:8049 | 9.680 ms | 603 | avg: 0.063 ms | max: 13.275 ms | max start: 476.931791 s | max end: 476.945067 s
HeapTaskDaemon:(3) | 1588.830 ms | 7040 | avg: 0.065 ms | max: 6.880 ms | max start: 473.666043 s | max end: 473.672922 s
mount-passthrou:(3) | 1370.809 ms | 68904 | avg: 0.011 ms | max: 6.524 ms | max start: 478.090630 s | max end: 478.097154 s
ReferenceQueueD:(3) | 11.794 ms | 1725 | avg: 0.014 ms | max: 6.521 ms | max start: 476.119782 s | max end: 476.126303 s
writer:14077 | 18.410 ms | 1427 | avg: 0.036 ms | max: 6.131 ms | max start: 474.169675 s | max end: 474.175805 s
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20200925235634.4089867-1-joel@joelfernandes.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Sandipan Das [Mon, 12 Oct 2020 05:02:05 +0000 (10:32 +0530)]
perf vendor events: Fix typos in power8 PMU events
This replaces the incorrectly spelled word "localtion" with "location"
in some power8 PMU event descriptions.
Fixes:
2a81fa3bb5ed ("perf vendor events: Add power8 PMU events")
Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
Reviewed-by: Kajol Jain <kjain@linux.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Link: http://lore.kernel.org/lkml/20201012050205.328523-1-sandipan@linux.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Mon, 12 Oct 2020 07:02:14 +0000 (16:02 +0900)]
perf bench: Run inject-build-id with --buildid-all option too
For comparison, it now runs the benchmark twice - one if regular -b and
another for --buildid-all.
$ perf bench internals inject-build-id
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 21.002 msec (+- 0.172 msec)
Average time per event: 2.059 usec (+- 0.017 usec)
Average memory usage: 8169 KB (+- 0 KB)
Average build-id-all injection took: 19.543 msec (+- 0.124 msec)
Average time per event: 1.916 usec (+- 0.012 usec)
Average memory usage: 7348 KB (+- 0 KB)
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Ian Rogers <irogers@google.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/20201012070214.2074921-7-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Mon, 12 Oct 2020 07:02:13 +0000 (16:02 +0900)]
perf inject: Add --buildid-all option
Like 'perf record', we can even more speedup build-id processing by just
using all DSOs. Then we don't need to look at all the sample events
anymore. The following patch will update 'perf bench' to show the result
of the --buildid-all option too.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Original-patch-by: Stephane Eranian <eranian@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/20201012070214.2074921-6-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Mon, 12 Oct 2020 07:02:12 +0000 (16:02 +0900)]
perf inject: Do not load map/dso when injecting build-id
No need to load symbols in a DSO when injecting build-id. I guess the
reason was to check the DSO is a special file like anon files. Use some
helper functions in map.c to check them before reading build-id. Also
pass sample event's cpumode to a new build-id event.
It brought a speedup in the benchmark of 25 -> 21 msec on my laptop.
Also the memory usage (Max RSS) went down by ~200 KB.
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 21.389 msec (+- 0.138 msec)
Average time per event: 2.097 usec (+- 0.014 usec)
Average memory usage: 8225 KB (+- 0 KB)
Committer notes:
Before:
$ perf stat -r5 perf bench internals inject-build-id > /dev/null
Performance counter stats for 'perf bench internals inject-build-id' (5 runs):
4,020.56 msec task-clock:u # 1.271 CPUs utilized ( +- 0.74% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
123,354 page-faults:u # 0.031 M/sec ( +- 0.81% )
7,119,951,568 cycles:u # 1.771 GHz ( +- 1.74% ) (83.27%)
230,086,969 stalled-cycles-frontend:u # 3.23% frontend cycles idle ( +- 1.97% ) (83.41%)
1,168,298,765 stalled-cycles-backend:u # 16.41% backend cycles idle ( +- 1.13% ) (83.44%)
11,173,083,669 instructions:u # 1.57 insn per cycle
# 0.10 stalled cycles per insn ( +- 1.58% ) (83.31%)
2,413,908,936 branches:u # 600.392 M/sec ( +- 1.69% ) (83.26%)
46,576,289 branch-misses:u # 1.93% of all branches ( +- 2.20% ) (83.31%)
3.1638 +- 0.0309 seconds time elapsed ( +- 0.98% )
$
After:
$ perf stat -r5 perf bench internals inject-build-id > /dev/null
Performance counter stats for 'perf bench internals inject-build-id' (5 runs):
2,379.94 msec task-clock:u # 1.473 CPUs utilized ( +- 0.18% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
62,584 page-faults:u # 0.026 M/sec ( +- 0.07% )
2,372,389,668 cycles:u # 0.997 GHz ( +- 0.29% ) (83.14%)
106,937,862 stalled-cycles-frontend:u # 4.51% frontend cycles idle ( +- 4.89% ) (83.20%)
581,697,915 stalled-cycles-backend:u # 24.52% backend cycles idle ( +- 0.71% ) (83.47%)
3,659,692,199 instructions:u # 1.54 insn per cycle
# 0.16 stalled cycles per insn ( +- 0.10% ) (83.63%)
791,372,961 branches:u # 332.518 M/sec ( +- 0.27% ) (83.39%)
10,648,083 branch-misses:u # 1.35% of all branches ( +- 0.22% ) (83.16%)
1.61570 +- 0.00172 seconds time elapsed ( +- 0.11% )
$
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Original-patch-by: Stephane Eranian <eranian@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/20201012070214.2074921-5-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Mon, 12 Oct 2020 07:02:11 +0000 (16:02 +0900)]
perf inject: Enter namespace when reading build-id
It should be in a proper mnt namespace when accessing the file.
I think this had no problem since the build-id was actually read from
map__load() -> dso__load() already. But I'd like to change it in the
following commit.
Acked-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20201012070214.2074921-4-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Mon, 12 Oct 2020 07:02:10 +0000 (16:02 +0900)]
perf inject: Add missing callbacks in perf_tool
I found some events (like PERF_RECORD_CGROUP) are not copied by perf
inject due to the missing callbacks. Let's add them.
While at it, I've changed the order of the callbacks to match with
struct perf_tool so that we can compare them easily.
Acked-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20201012070214.2074921-3-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Mon, 12 Oct 2020 07:02:09 +0000 (16:02 +0900)]
perf bench: Add build-id injection benchmark
Sometimes I can see that 'perf record' piped with 'perf inject' take a
long time processing build-ids.
So introduce a inject-build-id benchmark to the internals benchmark
suite to measure its overhead regularly.
It runs the 'perf inject' command internally and feeds the given number
of synthesized events (MMAP2 + SAMPLE basically).
Usage: perf bench internals inject-build-id <options>
-i, --iterations <n> Number of iterations used to compute average (default: 100)
-m, --nr-mmaps <n> Number of mmap events for each iteration (default: 100)
-n, --nr-samples <n> Number of sample events per mmap event (default: 100)
-v, --verbose be more verbose (show iteration count, DSO name, etc)
By default, it measures average processing time of 100 MMAP2 events
and 10000 SAMPLE events. Below is a result on my laptop.
$ perf bench internals inject-build-id
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 25.789 msec (+- 0.202 msec)
Average time per event: 2.528 usec (+- 0.020 usec)
Average memory usage: 8411 KB (+- 7 KB)
Committer testing:
$ perf bench
Usage:
perf bench [<common options>] <collection> <benchmark> [<options>]
# List of all available benchmark collections:
sched: Scheduler and IPC benchmarks
syscall: System call benchmarks
mem: Memory access benchmarks
numa: NUMA scheduling and MM benchmarks
futex: Futex stressing benchmarks
epoll: Epoll stressing benchmarks
internals: Perf-internals benchmarks
all: All benchmarks
$ perf bench internals
# List of available benchmarks for collection 'internals':
synthesize: Benchmark perf event synthesis
kallsyms-parse: Benchmark kallsyms parsing
inject-build-id: Benchmark build-id injection
$ perf bench internals inject-build-id
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.202 msec (+- 0.059 msec)
Average time per event: 1.392 usec (+- 0.006 usec)
Average memory usage: 12650 KB (+- 10 KB)
Average build-id-all injection took: 12.831 msec (+- 0.071 msec)
Average time per event: 1.258 usec (+- 0.007 usec)
Average memory usage: 11895 KB (+- 10 KB)
$
$ perf stat -r5 perf bench internals inject-build-id
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.380 msec (+- 0.056 msec)
Average time per event: 1.410 usec (+- 0.006 usec)
Average memory usage: 12608 KB (+- 11 KB)
Average build-id-all injection took: 11.889 msec (+- 0.064 msec)
Average time per event: 1.166 usec (+- 0.006 usec)
Average memory usage: 11838 KB (+- 10 KB)
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.246 msec (+- 0.065 msec)
Average time per event: 1.397 usec (+- 0.006 usec)
Average memory usage: 12744 KB (+- 10 KB)
Average build-id-all injection took: 12.019 msec (+- 0.066 msec)
Average time per event: 1.178 usec (+- 0.006 usec)
Average memory usage: 11963 KB (+- 10 KB)
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.321 msec (+- 0.067 msec)
Average time per event: 1.404 usec (+- 0.007 usec)
Average memory usage: 12690 KB (+- 10 KB)
Average build-id-all injection took: 11.909 msec (+- 0.041 msec)
Average time per event: 1.168 usec (+- 0.004 usec)
Average memory usage: 11938 KB (+- 10 KB)
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.287 msec (+- 0.059 msec)
Average time per event: 1.401 usec (+- 0.006 usec)
Average memory usage: 12864 KB (+- 10 KB)
Average build-id-all injection took: 11.862 msec (+- 0.058 msec)
Average time per event: 1.163 usec (+- 0.006 usec)
Average memory usage: 12103 KB (+- 10 KB)
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.402 msec (+- 0.053 msec)
Average time per event: 1.412 usec (+- 0.005 usec)
Average memory usage: 12876 KB (+- 10 KB)
Average build-id-all injection took: 11.826 msec (+- 0.061 msec)
Average time per event: 1.159 usec (+- 0.006 usec)
Average memory usage: 12111 KB (+- 10 KB)
Performance counter stats for 'perf bench internals inject-build-id' (5 runs):
4,267.48 msec task-clock:u # 1.502 CPUs utilized ( +- 0.14% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
102,092 page-faults:u # 0.024 M/sec ( +- 0.08% )
3,894,589,578 cycles:u # 0.913 GHz ( +- 0.19% ) (83.49%)
140,078,421 stalled-cycles-frontend:u # 3.60% frontend cycles idle ( +- 0.77% ) (83.34%)
948,581,189 stalled-cycles-backend:u # 24.36% backend cycles idle ( +- 0.46% ) (83.25%)
5,835,587,719 instructions:u # 1.50 insn per cycle
# 0.16 stalled cycles per insn ( +- 0.21% ) (83.24%)
1,267,423,636 branches:u # 296.996 M/sec ( +- 0.22% ) (83.12%)
17,484,290 branch-misses:u # 1.38% of all branches ( +- 0.12% ) (83.55%)
2.84176 +- 0.00222 seconds time elapsed ( +- 0.08% )
$
Acked-by: Jiri Olsa <jolsa@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20201012070214.2074921-2-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Wed, 7 Oct 2020 08:13:11 +0000 (17:13 +0900)]
perf stat: Fix out of bounds CPU map access when handling armv8_pmu events
It was reported that 'perf stat' crashed when using with armv8_pmu (CPU)
events with the task mode. As 'perf stat' uses an empty cpu map for
task mode but armv8_pmu has its own cpu mask, it has confused which map
it should use when accessing file descriptors and this causes segfaults:
(gdb) bt
#0 0x0000000000603fc8 in perf_evsel__close_fd_cpu (evsel=<optimized out>,
cpu=<optimized out>) at evsel.c:122
#1 perf_evsel__close_cpu (evsel=evsel@entry=0x716e950, cpu=7) at evsel.c:156
#2 0x00000000004d4718 in evlist__close (evlist=0x70a7cb0) at util/evlist.c:1242
#3 0x0000000000453404 in __run_perf_stat (argc=3, argc@entry=1, argv=0x30,
argv@entry=0xfffffaea2f90, run_idx=119, run_idx@entry=
1701998435)
at builtin-stat.c:929
#4 0x0000000000455058 in run_perf_stat (run_idx=
1701998435, argv=0xfffffaea2f90,
argc=1) at builtin-stat.c:947
#5 cmd_stat (argc=1, argv=0xfffffaea2f90) at builtin-stat.c:2357
#6 0x00000000004bb888 in run_builtin (p=p@entry=0x9764b8 <commands+288>,
argc=argc@entry=4, argv=argv@entry=0xfffffaea2f90) at perf.c:312
#7 0x00000000004bbb54 in handle_internal_command (argc=argc@entry=4,
argv=argv@entry=0xfffffaea2f90) at perf.c:364
#8 0x0000000000435378 in run_argv (argcp=<synthetic pointer>,
argv=<synthetic pointer>) at perf.c:408
#9 main (argc=4, argv=0xfffffaea2f90) at perf.c:538
To fix this, I simply used the given cpu map unless the evsel actually
is not a system-wide event (like uncore events).
Fixes:
7736627b865d ("perf stat: Use affinity for closing file descriptors")
Reported-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Barry Song <song.bao.hua@hisilicon.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20201007081311.1831003-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Mon, 28 Sep 2020 20:11:35 +0000 (22:11 +0200)]
perf python scripting: Fix printable strings in python3 scripts
Hagen reported broken strings in python3 tracepoint scripts:
make PYTHON=python3
perf record -e sched:sched_switch -a -- sleep 5
perf script --gen-script py
perf script -s ./perf-script.py
[..]
sched__sched_switch 7 563231.
759525792 0 swapper prev_comm=bytearray(b'swapper/7\x00\x00\x00\x00\x00\x00\x00'), prev_pid=0, prev_prio=120, prev_state=, next_comm=bytearray(b'mutex-thread-co\x00'),
The problem is in the is_printable_array function that does not take the
zero byte into account and claim such string as not printable, so the
code will create byte array instead of string.
Committer testing:
After this fix:
sched__sched_switch 3 484522.
497072626 1158680 kworker/3:0-eve prev_comm=kworker/3:0, prev_pid=1158680, prev_prio=120, prev_state=I, next_comm=swapper/3, next_pid=0, next_prio=120
Sample: {addr=0, cpu=3, datasrc=
84410401, datasrc_decode=N/A|SNP N/A|TLB N/A|LCK N/A, ip=
18446744071841817196, period=1, phys_addr=0, pid=1158680, tid=1158680, time=
484522497072626, transaction=0, values=[(0, 0)], weight=0}
sched__sched_switch 4 484522.
497085610 1225814 perf prev_comm=perf, prev_pid=1225814, prev_prio=120, prev_state=, next_comm=migration/4, next_pid=30, next_prio=0
Sample: {addr=0, cpu=4, datasrc=
84410401, datasrc_decode=N/A|SNP N/A|TLB N/A|LCK N/A, ip=
18446744071841817196, period=1, phys_addr=0, pid=1225814, tid=1225814, time=
484522497085610, transaction=0, values=[(0, 0)], weight=0}
Fixes:
249de6e07458 ("perf script python: Fix string vs byte array resolving")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tested-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Petlan <mpetlan@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: http://lore.kernel.org/lkml/20200928201135.3633850-1-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Arnaldo Carvalho de Melo [Thu, 1 Oct 2020 14:23:38 +0000 (11:23 -0300)]
perf trace: Use the autogenerated mmap 'prot' string/id table
No change in behaviour:
# perf trace -e mmap sleep 1
0.000 ( 0.009 ms): sleep/751870 mmap(len: 143317, prot: READ, flags: PRIVATE, fd: 3) = 0x7fa96d0f7000
0.028 ( 0.004 ms): sleep/751870 mmap(len: 8192, prot: READ|WRITE, flags: PRIVATE|ANONYMOUS) = 0x7fa96d0f5000
0.037 ( 0.005 ms): sleep/751870 mmap(len: 1872744, prot: READ, flags: PRIVATE|DENYWRITE, fd: 3) = 0x7fa96cf2b000
0.044 ( 0.011 ms): sleep/751870 mmap(addr: 0x7fa96cf50000, len: 1376256, prot: READ|EXEC, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x25000) = 0x7fa96cf50000
0.056 ( 0.007 ms): sleep/751870 mmap(addr: 0x7fa96d0a0000, len: 307200, prot: READ, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x175000) = 0x7fa96d0a0000
0.064 ( 0.007 ms): sleep/751870 mmap(addr: 0x7fa96d0eb000, len: 24576, prot: READ|WRITE, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x1bf000) = 0x7fa96d0eb000
0.075 ( 0.005 ms): sleep/751870 mmap(addr: 0x7fa96d0f1000, len: 13160, prot: READ|WRITE, flags: PRIVATE|FIXED|ANONYMOUS) = 0x7fa96d0f1000
0.253 ( 0.005 ms): sleep/751870 mmap(len:
218049136, prot: READ, flags: PRIVATE, fd: 3) = 0x7fa95ff38000
#
#
# set -o vi
# strace -e mmap sleep 1
mmap(NULL, 143317, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f333bd83000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f333bd81000
mmap(NULL, 1872744, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f333bbb7000
mmap(0x7f333bbdc000, 1376256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f333bbdc000
mmap(0x7f333bd2c000, 307200, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x175000) = 0x7f333bd2c000
mmap(0x7f333bd77000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bf000) = 0x7f333bd77000
mmap(0x7f333bd7d000, 13160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f333bd7d000
mmap(NULL,
218049136, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f332ebc4000
+++ exited with 0 +++
#
And you can as well tweak 'perf trace's output to more closely match
strace's:
# perf config trace.show_arg_names=no
# perf config trace.show_duration=no
# perf config trace.show_prefix=yes
# perf config trace.show_timestamp=no
# perf config trace.show_zeros=yes
# perf config trace.no_inherit=yes
# perf trace -e mmap sleep 1
mmap(NULL, 143317, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0d287ca000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS) = 0x7f0d287c8000
mmap(NULL, 1872744, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0d285fe000
mmap(0x7f0d28623000, 1376256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f0d28623000
mmap(0x7f0d28773000, 307200, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x175000) = 0x7f0d28773000
mmap(0x7f0d287be000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bf000) = 0x7f0d287be000
mmap(0x7f0d287c4000, 13160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS) = 0x7f0d287c4000
mmap(NULL,
218049136, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0d1b60b000
#
# perf config | grep ^trace
trace.show_arg_names=no
trace.show_duration=no
trace.show_prefix=yes
trace.show_timestamp=no
trace.show_zeros=yes
trace.no_inherit=yes
#
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Arnaldo Carvalho de Melo [Thu, 1 Oct 2020 14:14:22 +0000 (11:14 -0300)]
tools beauty: Add script to generate table of mmap's 'prot' argument
Will be wired up in the following csets:
$ tools/perf/trace/beauty/mmap_prot.sh
static const char *mmap_prot[] = {
[ilog2(0x1) + 1] = "READ",
#ifndef PROT_READ
#define PROT_READ 0x1
#endif
[ilog2(0x2) + 1] = "WRITE",
#ifndef PROT_WRITE
#define PROT_WRITE 0x2
#endif
[ilog2(0x4) + 1] = "EXEC",
#ifndef PROT_EXEC
#define PROT_EXEC 0x4
#endif
[ilog2(0x8) + 1] = "SEM",
#ifndef PROT_SEM
#define PROT_SEM 0x8
#endif
[ilog2(0x01000000) + 1] = "GROWSDOWN",
#ifndef PROT_GROWSDOWN
#define PROT_GROWSDOWN 0x01000000
#endif
[ilog2(0x02000000) + 1] = "GROWSUP",
#ifndef PROT_GROWSUP
#define PROT_GROWSUP 0x02000000
#endif
};
$
$
$
$ tools/perf/trace/beauty/mmap_prot.sh alpha
static const char *mmap_prot[] = {
[ilog2(0x4) + 1] = "EXEC",
#ifndef PROT_EXEC
#define PROT_EXEC 0x4
#endif
[ilog2(0x01000000) + 1] = "GROWSDOWN",
#ifndef PROT_GROWSDOWN
#define PROT_GROWSDOWN 0x01000000
#endif
[ilog2(0x02000000) + 1] = "GROWSUP",
#ifndef PROT_GROWSUP
#define PROT_GROWSUP 0x02000000
#endif
[ilog2(0x1) + 1] = "READ",
#ifndef PROT_READ
#define PROT_READ 0x1
#endif
[ilog2(0x8) + 1] = "SEM",
#ifndef PROT_SEM
#define PROT_SEM 0x8
#endif
[ilog2(0x2) + 1] = "WRITE",
#ifndef PROT_WRITE
#define PROT_WRITE 0x2
#endif
};
$
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Arnaldo Carvalho de Melo [Wed, 30 Sep 2020 12:34:20 +0000 (09:34 -0300)]
perf beauty mmap_flags: Conditionaly define the mmap flags
So that in older systems we get it in the mmap flags scnprintf routines:
$ tools/perf/trace/beauty/mmap_flags.sh | head -9 2> /dev/null
static const char *mmap_flags[] = {
[ilog2(0x40) + 1] = "32BIT",
#ifndef MAP_32BIT
#define MAP_32BIT 0x40
#endif
[ilog2(0x01) + 1] = "SHARED",
#ifndef MAP_SHARED
#define MAP_SHARED 0x01
#endif
$
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Arnaldo Carvalho de Melo [Tue, 29 Sep 2020 21:07:27 +0000 (18:07 -0300)]
perf trace beauty: Add script to autogenerate mremap's flags args string/id table
It'll also conditionally generate the defines, so that if we don't have
those when building a new tool tarball in an older systems, we get
those, and we need them sometimes in the actual scnprintf routine, such
as when checking if a flags means we have an extra arg, like with
MREMAP_FIXED.
$ tools/perf/trace/beauty/mremap_flags.sh
static const char *mremap_flags[] = {
[ilog2(1) + 1] = "MAYMOVE",
#ifndef MREMAP_MAYMOVE
#define MREMAP_MAYMOVE 1
#endif
[ilog2(2) + 1] = "FIXED",
#ifndef MREMAP_FIXED
#define MREMAP_FIXED 2
#endif
[ilog2(4) + 1] = "DONTUNMAP",
#ifndef MREMAP_DONTUNMAP
#define MREMAP_DONTUNMAP 4
#endif
};
$
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Arnaldo Carvalho de Melo [Tue, 29 Sep 2020 11:56:38 +0000 (08:56 -0300)]
perf tools: Separate the checking of headers only used to build beautification tables
Some headers are not used in building the tools directly, but instead to
generate tables that then gets source code included to do id->string and
string->id lookups for things like syscall flags and commands.
We were adding it directly to tools/include/ and this sometimes gets in
the way of building using system headers, lets untangle this a bit.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Arnaldo Carvalho de Melo [Mon, 28 Sep 2020 18:44:52 +0000 (15:44 -0300)]
Merge remote-tracking branch 'torvalds/master' into perf/core
To pick up fixes and get v5.10 development in sync with the main kernel
sources.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Linus Torvalds [Mon, 28 Sep 2020 18:05:56 +0000 (11:05 -0700)]
Merge tag 'nfs-for-5.9-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
- NFSv4.2: copy_file_range needs to invalidate caches on success
- NFSv4.2: Fix security label length not being reset
- pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly
on read
- pNFS/flexfiles: Fix signed/unsigned type issues with mirror
indices"
* tag 'nfs-for-5.9-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
pNFS/flexfiles: Be consistent about mirror index types
pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly on read
NFSv4.2: fix client's attribute cache management for copy_file_range
nfs: Fix security label length not being reset
Jason A. Donenfeld [Mon, 28 Sep 2020 10:35:07 +0000 (12:35 +0200)]
mm: do not rely on mm == current->mm in __get_user_pages_locked
It seems likely this block was pasted from internal_get_user_pages_fast,
which is not passed an mm struct and therefore uses current's. But
__get_user_pages_locked is passed an explicit mm, and current->mm is not
always valid. This was hit when being called from i915, which uses:
pin_user_pages_remote->
__get_user_pages_remote->
__gup_longterm_locked->
__get_user_pages_locked
Before, this would lead to an OOPS:
BUG: kernel NULL pointer dereference, address:
0000000000000064
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
CPU: 10 PID: 1431 Comm: kworker/u33:1 Tainted: P S U O 5.9.0-rc7+ #140
Hardware name: LENOVO 20QTCTO1WW/20QTCTO1WW, BIOS N2OET47W (1.34 ) 08/06/2020
Workqueue: i915-userptr-acquire __i915_gem_userptr_get_pages_worker [i915]
RIP: 0010:__get_user_pages_remote+0xd7/0x310
Call Trace:
__i915_gem_userptr_get_pages_worker+0xc8/0x260 [i915]
process_one_work+0x1ca/0x390
worker_thread+0x48/0x3c0
kthread+0x114/0x130
ret_from_fork+0x1f/0x30
CR2:
0000000000000064
This commit fixes the problem by using the mm pointer passed to the
function rather than the bogus one in current.
Fixes:
008cfe4418b3 ("mm: Introduce mm_struct.has_pinned")
Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
Reported-by: Harald Arnesen <harald@skogtun.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ian Rogers [Wed, 23 Sep 2020 21:06:55 +0000 (14:06 -0700)]
perf test: Fix msan uninitialized use.
Ensure 'st' is initialized before an error branch is taken.
Fixes test "67: Parse and process metrics" with LLVM msan:
==6757==WARNING: MemorySanitizer: use-of-uninitialized-value
#0 0x5570edae947d in rblist__exit tools/perf/util/rblist.c:114:2
#1 0x5570edb1c6e8 in runtime_stat__exit tools/perf/util/stat-shadow.c:141:2
#2 0x5570ed92cfae in __compute_metric tools/perf/tests/parse-metric.c:187:2
#3 0x5570ed92cb74 in compute_metric tools/perf/tests/parse-metric.c:196:9
#4 0x5570ed92c6d8 in test_recursion_fail tools/perf/tests/parse-metric.c:318:2
#5 0x5570ed92b8c8 in test__parse_metric tools/perf/tests/parse-metric.c:356:2
#6 0x5570ed8de8c1 in run_test tools/perf/tests/builtin-test.c:410:9
#7 0x5570ed8ddadf in test_and_print tools/perf/tests/builtin-test.c:440:9
#8 0x5570ed8dca04 in __cmd_test tools/perf/tests/builtin-test.c:661:4
#9 0x5570ed8dbc07 in cmd_test tools/perf/tests/builtin-test.c:807:9
#10 0x5570ed7326cc in run_builtin tools/perf/perf.c:313:11
#11 0x5570ed731639 in handle_internal_command tools/perf/perf.c:365:8
#12 0x5570ed7323cd in run_argv tools/perf/perf.c:409:2
#13 0x5570ed731076 in main tools/perf/perf.c:539:3
Fixes: commit
f5a56570a3f2 ("perf test: Fix memory leaks in parse-metric test")
Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: clang-built-linux@googlegroups.com
Link: http://lore.kernel.org/lkml/20200923210655.4143682-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Ian Rogers [Fri, 25 Sep 2020 00:39:03 +0000 (17:39 -0700)]
perf parse-events: Reduce casts around bp_addr
perf_event_attr bp_addr is a u64. parse-events.y parses it as a u64, but
casts it to a void* and then parse-events.c casts it back to a u64.
Rather than all the casts, change the type of the address to be a u64.
This removes an issue noted in:
https://lore.kernel.org/lkml/
20200903184359.GC3495158@kernel.org/
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200925003903.561568-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Thu, 24 Sep 2020 12:44:55 +0000 (21:44 +0900)]
perf test: Add expand cgroup event test
It'll expand given events for cgroups A, B and C.
$ perf test -v expansion
69: Event expansion for cgroups :
--- start ---
test child forked, pid 983140
metric expr 1 / IPC for CPI
metric expr instructions / cycles for IPC
found event instructions
found event cycles
adding {instructions,cycles}:W
copying metric event for cgroup 'A': instructions (idx=0)
copying metric event for cgroup 'B': instructions (idx=0)
copying metric event for cgroup 'C': instructions (idx=0)
test child finished with 0
---- end ----
Event expansion for cgroups: Ok
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200924124455.336326-6-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Thu, 24 Sep 2020 12:44:54 +0000 (21:44 +0900)]
perf tools: Allow creation of cgroup without open
This is a preparation for a test case of expanding events for multiple
cgroups. Instead of using real system cgroup, the test will use fake
cgroups so it needs a way to have them without a open file descriptor.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200924124455.336326-5-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Thu, 24 Sep 2020 12:44:53 +0000 (21:44 +0900)]
perf tools: Copy metric events properly when expand cgroups
The metricgroup__copy_metric_events() is to handle metrics events when
expanding event for cgroups. As the metric events keep pointers to
evsel, it should be refreshed when events are cloned during the
operation.
The perf_stat__collect_metric_expr() is also called in case an event has
a metric directly.
During the copy, it references evsel by index as the evlist now has
cloned evsels for the given cgroup.
Also kernel test robot found an issue in the python module import so add
empty implementations of those two functions to fix it.
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200924124455.336326-4-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Thu, 24 Sep 2020 12:44:52 +0000 (21:44 +0900)]
perf stat: Add --for-each-cgroup option
The --for-each-cgroup option is a syntax sugar to monitor large number
of cgroups easily. Current command line requires to list all the events
and cgroups even if users want to monitor same events for each cgroup.
This patch addresses that usage by copying given events for each cgroup
on user's behalf.
For instance, if they want to monitor 6 events for 200 cgroups each they
should write 1200 event names (with -e) AND 1200 cgroup names (with -G)
on the command line. But with this change, they can just specify 6
events and 200 cgroups with a new option.
A simpler example below: It wants to measure 3 events for 2 cgroups ('A'
and 'B'). The result is that total 6 events are counted like below.
$ perf stat -a -e cpu-clock,cycles,instructions --for-each-cgroup A,B sleep 1
Performance counter stats for 'system wide':
988.18 msec cpu-clock A # 0.987 CPUs utilized
3,153,761,702 cycles A # 3.200 GHz (100.00%)
8,067,769,847 instructions A # 2.57 insn per cycle (100.00%)
982.71 msec cpu-clock B # 0.982 CPUs utilized
3,136,093,298 cycles B # 3.182 GHz (99.99%)
8,109,619,327 instructions B # 2.58 insn per cycle (99.99%)
1.
001228054 seconds time elapsed
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200924124455.336326-3-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Thu, 24 Sep 2020 12:44:51 +0000 (21:44 +0900)]
perf evsel: Add evsel__clone() function
The evsel__clone() is to create an exactly same evsel from same
attributes. The function assumes the given evsel is not configured
yet so it cares fields set during event parsing. Those fields are now
moved together as Jiri suggested. Note that metric events will be
handled by later patch.
It will be used by perf stat to generate separate events for each
cgroup.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200924124455.336326-2-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jin Yao [Wed, 13 May 2020 08:13:33 +0000 (16:13 +0800)]
perf vendor events: Update SkylakeX events to v1.21
- Update SkylakeX events to v1.21.
- Update SkylakeX JSON metrics from TMAM 4.0.
Other fixes:
- Add NO_NMI_WATCHDOG metric constraint to Backend_Bound
- Fix misspelled error
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/lkml/20200922031918.3723-1-yao.jin@linux.intel.com/
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jin Yao [Tue, 22 Sep 2020 02:51:19 +0000 (10:51 +0800)]
perf vendor events intel: Update CascadelakeX events to v1.08
- Update CascadelakeX events to v1.08.
- Update CascadelakeX JSON metrics from TMAM 4.0.
Other fixes:
- Add NO_NMI_WATCHDOG metric constraint to Backend_Bound
- Change 'MB/sec' to 'MB' in UNC_M_PMM_BANDWIDTH.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Link: https://lore.kernel.org/lkml/20200922031918.3723-1-yao.jin@linux.intel.com/
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Linus Torvalds [Sun, 27 Sep 2020 21:38:10 +0000 (14:38 -0700)]
Linux 5.9-rc7
Linus Torvalds [Sun, 27 Sep 2020 19:18:57 +0000 (12:18 -0700)]
Merge tag 'kbuild-fixes-v5.9-4' of git://git./linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- ignore compiler stubs for PPC to fix builds
- fix the usage of --target mentioned in the LLVM document
* tag 'kbuild-fixes-v5.9-4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
Documentation/llvm: Fix clang target examples
scripts/kallsyms: skip ppc compiler stub *.long_branch.* / *.plt_branch.*
Linus Torvalds [Sun, 27 Sep 2020 19:15:21 +0000 (12:15 -0700)]
Merge tag 'x86-urgent-2020-09-27' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Two fixes for the x86 interrupt code:
- Unbreak the magic 'search the timer interrupt' logic in IO/APIC
code which got wreckaged when the core interrupt code made the
state tracking logic stricter.
That caused the interrupt line to stay masked after switching from
IO/APIC to PIC delivery mode, which obviously prevents interrupts
from being delivered.
- Make run_on_irqstack_code() typesafe. The function argument is a
void pointer which is then cast to 'void (*fun)(void *).
This breaks Control Flow Integrity checking in clang. Use proper
helper functions for the three variants reuqired"
* tag 'x86-urgent-2020-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ioapic: Unbreak check_timer()
x86/irq: Make run_on_irqstack_cond() typesafe
Linus Torvalds [Sun, 27 Sep 2020 19:11:35 +0000 (12:11 -0700)]
Merge tag 'timers-urgent-2020-09-27' of git://git./linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"A set of clocksource/clockevents updates:
- Reset the TI/DM timer before enabling it instead of doing it the
other way round.
- Initialize the reload value for the GX6605s timer correctly so the
hardware counter starts at 0 again after overrun.
- Make error return value negative in the h8300 timer init function"
* tag 'timers-urgent-2020-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/timer-gx6605s: Fixup counter reload
clocksource/drivers/timer-ti-dm: Do reset before enable
clocksource/drivers/h8300_timer8: Fix wrong return value in h8300_8timer_init()
Peter Xu [Fri, 25 Sep 2020 22:26:00 +0000 (18:26 -0400)]
mm/thp: Split huge pmds/puds if they're pinned when fork()
Pinned pages shouldn't be write-protected when fork() happens, because
follow up copy-on-write on these pages could cause the pinned pages to
be replaced by random newly allocated pages.
For huge PMDs, we split the huge pmd if pinning is detected. So that
future handling will be done by the PTE level (with our latest changes,
each of the small pages will be copied). We can achieve this by let
copy_huge_pmd() return -EAGAIN for pinned pages, so that we'll
fallthrough in copy_pmd_range() and finally land the next
copy_pte_range() call.
Huge PUDs will be even more special - so far it does not support
anonymous pages. But it can actually be done the same as the huge PMDs
even if the split huge PUDs means to erase the PUD entries. It'll
guarantee the follow up fault ins will remap the same pages in either
parent/child later.
This might not be the most efficient way, but it should be easy and
clean enough. It should be fine, since we're tackling with a very rare
case just to make sure userspaces that pinned some thps will still work
even without MADV_DONTFORK and after they fork()ed.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Xu [Fri, 25 Sep 2020 22:25:59 +0000 (18:25 -0400)]
mm: Do early cow for pinned pages during fork() for ptes
This allows copy_pte_range() to do early cow if the pages were pinned on
the source mm.
Currently we don't have an accurate way to know whether a page is pinned
or not. The only thing we have is page_maybe_dma_pinned(). However
that's good enough for now. Especially, with the newly added
mm->has_pinned flag to make sure we won't affect processes that never
pinned any pages.
It would be easier if we can do GFP_KERNEL allocation within
copy_one_pte(). Unluckily, we can't because we're with the page table
locks held for both the parent and child processes. So the page
allocation needs to be done outside copy_one_pte().
Some trick is there in copy_present_pte(), majorly the wrprotect trick
to block concurrent fast-gup. Comments in the function should explain
better in place.
Oleg Nesterov reported a (probably harmless) bug during review that we
didn't reset entry.val properly in copy_pte_range() so that potentially
there's chance to call add_swap_count_continuation() multiple times on
the same swp entry. However that should be harmless since even if it
happens, the same function (add_swap_count_continuation()) will return
directly noticing that there're enough space for the swp counter. So
instead of a standalone stable patch, it is touched up in this patch
directly.
Link: https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Xu [Fri, 25 Sep 2020 22:25:58 +0000 (18:25 -0400)]
mm/fork: Pass new vma pointer into copy_page_range()
This prepares for the future work to trigger early cow on pinned pages
during fork().
No functional change intended.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Xu [Fri, 25 Sep 2020 22:25:57 +0000 (18:25 -0400)]
mm: Introduce mm_struct.has_pinned
(Commit message majorly collected from Jason Gunthorpe)
Reduce the chance of false positive from page_maybe_dma_pinned() by
keeping track if the mm_struct has ever been used with pin_user_pages().
This allows cases that might drive up the page ref_count to avoid any
penalty from handling dma_pinned pages.
Future work is planned, to provide a more sophisticated solution, likely
to turn it into a real counter. For now, make it atomic_t but use it as
a boolean for simplicity.
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thomas Gleixner [Sun, 27 Sep 2020 09:24:34 +0000 (11:24 +0200)]
Merge tag 'timers-v5.9-rc4' of https://git.linaro.org/people/daniel.lezcano/linux into timers/urgent
Pull clocksource/clockevent fixes from Daniel Lezcano:
- Fix wrong signed return value when checking of_iomap in the probe
function for the h8300 timer (Tianjia Zhang)
- Fix reset sequence when setting up the timer on the dm_timer (Tony
Lindgren)
- Fix counter reload when the interrupt fires on gx6605s (Guo Ren)
Linus Torvalds [Sat, 26 Sep 2020 18:18:37 +0000 (11:18 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Three fixes: one in drivers (lpfc) and two for zoned block devices.
The latter also impinges on the block layer but only to introduce a
new block API for setting the zone model rather than fiddling with the
queue directly in the zoned block driver"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: sd: sd_zbc: Fix ZBC disk initialization
scsi: sd: sd_zbc: Fix handling of host-aware ZBC disks
scsi: lpfc: Fix initial FLOGI failure due to BBSCN not supported
Linus Torvalds [Sat, 26 Sep 2020 18:13:51 +0000 (11:13 -0700)]
Merge tag 'io_uring-5.9-2020-09-25' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Two fixes for regressions in this cycle, and one that goes to 5.8
stable:
- fix leak of getname() retrieved filename
- remove plug->nowait assignment, fixing a regression with btrfs
- fix for async buffered retry"
* tag 'io_uring-5.9-2020-09-25' of git://git.kernel.dk/linux-block:
io_uring: ensure async buffered read-retry is setup properly
io_uring: don't unconditionally set plug->nowait = true
io_uring: ensure open/openat2 name is cleaned on cancelation
Linus Torvalds [Sat, 26 Sep 2020 18:07:36 +0000 (11:07 -0700)]
Merge tag 'block-5.9-2020-09-25' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"NVMe pull request from Christoph, and removal of a dead define.
- fix error during controller probe that cause double free irqs
(Keith Busch)
- FC connection establishment fix (James Smart)
- properly handle completions for invalid tags (Xianting Tian)
- pass the correct nsid to the command effects and supported log
(Chaitanya Kulkarni)"
* tag 'block-5.9-2020-09-25' of git://git.kernel.dk/linux-block:
block: remove unused BLK_QC_T_EAGAIN flag
nvme-core: don't use NVME_NSID_ALL for command effects and supported log
nvme-fc: fail new connections to a deleted host or remote port
nvme-pci: fix NULL req in completion handler
nvme: return errors for hwmon init
Linus Torvalds [Sat, 26 Sep 2020 18:01:18 +0000 (11:01 -0700)]
Merge tag 's390-5.9-7' of git://git./linux/kernel/git/s390/linux
Pull s390 fix from Vasily Gorbik:
"Fix truncated ZCRYPT_PERDEV_REQCNT ioctl result. Copy entire reqcnt
list"
* tag 's390-5.9-7' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/zcrypt: Fix ZCRYPT_PERDEV_REQCNT ioctl
Linus Torvalds [Sat, 26 Sep 2020 17:53:35 +0000 (10:53 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"9 patches.
Subsystems affected by this patch series: mm (thp, memcg, gup,
migration, memory-hotplug), lib, and x86"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: don't rely on system state to detect hot-plug operations
mm: replace memmap_context by meminit_context
arch/x86/lib/usercopy_64.c: fix __copy_user_flushcache() cache writeback
lib/memregion.c: include memregion.h
lib/string.c: implement stpcpy
mm/migrate: correct thp migration stats
mm/gup: fix gup_fast with dynamic page table folding
mm: memcontrol: fix missing suffix of workingset_restore
mm, THP, swap: fix allocating cluster for swapfile by mistake
Minchan Kim [Tue, 15 Sep 2020 06:32:15 +0000 (23:32 -0700)]
mm: validate pmd after splitting
syzbot reported the following KASAN splat:
general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
CPU: 1 PID: 6826 Comm: syz-executor142 Not tainted 5.9.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__lock_acquire+0x84/0x2ae0 kernel/locking/lockdep.c:4296
Code: ff df 8a 04 30 84 c0 0f 85 e3 16 00 00 83 3d 56 58 35 08 00 0f 84 0e 17 00 00 83 3d 25 c7 f5 07 00 74 2c 4c 89 e8 48 c1 e8 03 <80> 3c 30 00 74 12 4c 89 ef e8 3e d1 5a 00 48 be 00 00 00 00 00 fc
RSP: 0018:
ffffc90004b9f850 EFLAGS:
00010006
Call Trace:
lock_acquire+0x140/0x6f0 kernel/locking/lockdep.c:5006
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:354 [inline]
madvise_cold_or_pageout_pte_range+0x52f/0x25c0 mm/madvise.c:389
walk_pmd_range mm/pagewalk.c:89 [inline]
walk_pud_range mm/pagewalk.c:160 [inline]
walk_p4d_range mm/pagewalk.c:193 [inline]
walk_pgd_range mm/pagewalk.c:229 [inline]
__walk_page_range+0xe7b/0x1da0 mm/pagewalk.c:331
walk_page_range+0x2c3/0x5c0 mm/pagewalk.c:427
madvise_pageout_page_range mm/madvise.c:521 [inline]
madvise_pageout mm/madvise.c:557 [inline]
madvise_vma mm/madvise.c:946 [inline]
do_madvise+0x12d0/0x2090 mm/madvise.c:1145
__do_sys_madvise mm/madvise.c:1171 [inline]
__se_sys_madvise mm/madvise.c:1169 [inline]
__x64_sys_madvise+0x76/0x80 mm/madvise.c:1169
do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The backing vma was shmem.
In case of split page of file-backed THP, madvise zaps the pmd instead
of remapping of sub-pages. So we need to check pmd validity after
split.
Reported-by: syzbot+ecf80462cb7d5d552bc7@syzkaller.appspotmail.com
Fixes:
1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Laurent Dufour [Sat, 26 Sep 2020 04:19:31 +0000 (21:19 -0700)]
mm: don't rely on system state to detect hot-plug operations
In register_mem_sect_under_node() the system_state's value is checked to
detect whether the call is made during boot time or during an hot-plug
operation. Unfortunately, that check against SYSTEM_BOOTING is wrong
because regular memory is registered at SYSTEM_SCHEDULING state. In
addition, memory hot-plug operation can be triggered at this system
state by the ACPI [1]. So checking against the system state is not
enough.
The consequence is that on system with interleaved node's ranges like this:
Early memory node ranges
node 1: [mem 0x0000000000000000-0x000000011fffffff]
node 2: [mem 0x0000000120000000-0x000000014fffffff]
node 1: [mem 0x0000000150000000-0x00000001ffffffff]
node 0: [mem 0x0000000200000000-0x000000048fffffff]
node 2: [mem 0x0000000490000000-0x00000007ffffffff]
This can be seen on PowerPC LPAR after multiple memory hot-plug and
hot-unplug operations are done. At the next reboot the node's memory
ranges can be interleaved and since the call to link_mem_sections() is
made in topology_init() while the system is in the SYSTEM_SCHEDULING
state, the node's id is not checked, and the sections registered to
multiple nodes:
$ ls -l /sys/devices/system/memory/memory21/node*
total 0
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
In that case, the system is able to boot but if later one of theses
memory blocks is hot-unplugged and then hot-plugged, the sysfs
inconsistency is detected and this is triggering a BUG_ON():
kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
Call Trace:
add_memory_resource+0x23c/0x340 (unreliable)
__add_memory+0x5c/0xf0
dlpar_add_lmb+0x1b4/0x500
dlpar_memory+0x1f8/0xb80
handle_dlpar_errorlog+0xc0/0x190
dlpar_store+0x198/0x4a0
kobj_attr_store+0x30/0x50
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1b0/0x290
vfs_write+0xe8/0x290
ksys_write+0xdc/0x130
system_call_exception+0x160/0x270
system_call_common+0xf0/0x27c
This patch addresses the root cause by not relying on the system_state
value to detect whether the call is due to a hot-plug operation. An
extra parameter is added to link_mem_sections() detailing whether the
operation is due to a hot-plug operation.
[1] According to Oscar Salvador, using this qemu command line, ACPI
memory hotplug operations are raised at SYSTEM_SCHEDULING state:
$QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
-m size=$MEM,slots=255,maxmem=4294967296k \
-numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
-object memory-backend-ram,id=memdimm0,size=
134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
-object memory-backend-ram,id=memdimm1,size=
134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
-object memory-backend-ram,id=memdimm2,size=
134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
-object memory-backend-ram,id=memdimm3,size=
134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
-object memory-backend-ram,id=memdimm4,size=
134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
-object memory-backend-ram,id=memdimm5,size=
134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
-object memory-backend-ram,id=memdimm6,size=
134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \
Fixes:
4fbce633910e ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Laurent Dufour [Sat, 26 Sep 2020 04:19:28 +0000 (21:19 -0700)]
mm: replace memmap_context by meminit_context
Patch series "mm: fix memory to node bad links in sysfs", v3.
Sometimes, firmware may expose interleaved memory layout like this:
Early memory node ranges
node 1: [mem 0x0000000000000000-0x000000011fffffff]
node 2: [mem 0x0000000120000000-0x000000014fffffff]
node 1: [mem 0x0000000150000000-0x00000001ffffffff]
node 0: [mem 0x0000000200000000-0x000000048fffffff]
node 2: [mem 0x0000000490000000-0x00000007ffffffff]
In that case, we can see memory blocks assigned to multiple nodes in
sysfs:
$ ls -l /sys/devices/system/memory/memory21
total 0
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
-rw-r--r-- 1 root root 65536 Aug 24 05:27 online
-r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
-r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
drwxr-xr-x 2 root root 0 Aug 24 05:27 power
-r--r--r-- 1 root root 65536 Aug 24 05:27 removable
-rw-r--r-- 1 root root 65536 Aug 24 05:27 state
lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory
-rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
-r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones
The same applies in the node's directory with a memory21 link in both
the node1 and node2's directory.
This is wrong but doesn't prevent the system to run. However when
later, one of these memory blocks is hot-unplugged and then hot-plugged,
the system is detecting an inconsistency in the sysfs layout and a
BUG_ON() is raised:
kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
Call Trace:
add_memory_resource+0x23c/0x340 (unreliable)
__add_memory+0x5c/0xf0
dlpar_add_lmb+0x1b4/0x500
dlpar_memory+0x1f8/0xb80
handle_dlpar_errorlog+0xc0/0x190
dlpar_store+0x198/0x4a0
kobj_attr_store+0x30/0x50
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1b0/0x290
vfs_write+0xe8/0x290
ksys_write+0xdc/0x130
system_call_exception+0x160/0x270
system_call_common+0xf0/0x27c
This has been seen on PowerPC LPAR.
The root cause of this issue is that when node's memory is registered,
the range used can overlap another node's range, thus the memory block
is registered to multiple nodes in sysfs.
There are two issues here:
(a) The sysfs memory and node's layouts are broken due to these
multiple links
(b) The link errors in link_mem_sections() should not lead to a system
panic.
To address (a) register_mem_sect_under_node should not rely on the
system state to detect whether the link operation is triggered by a hot
plug operation or not. This is addressed by the patches 1 and 2 of this
series.
Issue (b) will be addressed separately.
This patch (of 2):
The memmap_context enum is used to detect whether a memory operation is
due to a hot-add operation or happening at boot time.
Make it general to the hotplug operation and rename it as
meminit_context.
There is no functional change introduced by this patch
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J . Wysocki" <rafael@kernel.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mikulas Patocka [Sat, 26 Sep 2020 04:19:24 +0000 (21:19 -0700)]
arch/x86/lib/usercopy_64.c: fix __copy_user_flushcache() cache writeback
If we copy less than 8 bytes and if the destination crosses a cache
line, __copy_user_flushcache would invalidate only the first cache line.
This patch makes it invalidate the second cache line as well.
Fixes:
0aed55af88345b ("x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dan Williams <dan.j.wiilliams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/alpine.LRH.2.02.2009161451140.21915@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jason Yan [Sat, 26 Sep 2020 04:19:21 +0000 (21:19 -0700)]
lib/memregion.c: include memregion.h
This addresses the following sparse warning:
lib/memregion.c:8:5: warning: symbol 'memregion_alloc' was not declared. Should it be static?
lib/memregion.c:14:6: warning: symbol 'memregion_free' was not declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20200921142852.875312-1-yanaijie@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Desaulniers [Sat, 26 Sep 2020 04:19:18 +0000 (21:19 -0700)]
lib/string.c: implement stpcpy
LLVM implemented a recent "libcall optimization" that lowers calls to
`sprintf(dest, "%s", str)` where the return value is used to
`stpcpy(dest, str) - dest`.
This generally avoids the machinery involved in parsing format strings.
`stpcpy` is just like `strcpy` except it returns the pointer to the new
tail of `dest`. This optimization was introduced into clang-12.
Implement this so that we don't observe linkage failures due to missing
symbol definitions for `stpcpy`.
Similar to last year's fire drill with: commit
5f074f3e192f
("lib/string.c: implement a basic bcmp")
The kernel is somewhere between a "freestanding" environment (no full
libc) and "hosted" environment (many symbols from libc exist with the
same type, function signature, and semantics).
As Peter Anvin notes, there's not really a great way to inform the
compiler that you're targeting a freestanding environment but would like
to opt-in to some libcall optimizations (see pr/47280 below), rather
than opt-out.
Arvind notes, -fno-builtin-* behaves slightly differently between GCC
and Clang, and Clang is missing many __builtin_* definitions, which I
consider a bug in Clang and am working on fixing.
Masahiro summarizes the subtle distinction between compilers justly:
To prevent transformation from foo() into bar(), there are two ways in
Clang to do that; -fno-builtin-foo, and -fno-builtin-bar. There is
only one in GCC; -fno-buitin-foo.
(Any difference in that behavior in Clang is likely a bug from a missing
__builtin_* definition.)
Masahiro also notes:
We want to disable optimization from foo() to bar(),
but we may still benefit from the optimization from
foo() into something else. If GCC implements the same transform, we
would run into a problem because it is not -fno-builtin-bar, but
-fno-builtin-foo that disables that optimization.
In this regard, -fno-builtin-foo would be more future-proof than
-fno-built-bar, but -fno-builtin-foo is still potentially overkill. We
may want to prevent calls from foo() being optimized into calls to
bar(), but we still may want other optimization on calls to foo().
It seems that compilers today don't quite provide the fine grain control
over which libcall optimizations pseudo-freestanding environments would
prefer.
Finally, Kees notes that this interface is unsafe, so we should not
encourage its use. As such, I've removed the declaration from any
header, but it still needs to be exported to avoid linkage errors in
modules.
Reported-by: Sami Tolvanen <samitolvanen@google.com>
Suggested-by: Andy Lavr <andy.lavr@gmail.com>
Suggested-by: Arvind Sankar <nivedita@alum.mit.edu>
Suggested-by: Joe Perches <joe@perches.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Suggested-by: Masahiro Yamada <masahiroy@kernel.org>
Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200914161643.938408-1-ndesaulniers@google.com
Link: https://bugs.llvm.org/show_bug.cgi?id=47162
Link: https://bugs.llvm.org/show_bug.cgi?id=47280
Link: https://github.com/ClangBuiltLinux/linux/issues/1126
Link: https://man7.org/linux/man-pages/man3/stpcpy.3.html
Link: https://pubs.opengroup.org/onlinepubs/9699919799/functions/stpcpy.html
Link: https://reviews.llvm.org/D85963
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zi Yan [Sat, 26 Sep 2020 04:19:14 +0000 (21:19 -0700)]
mm/migrate: correct thp migration stats
PageTransHuge returns true for both thp and hugetlb, so thp stats was
counting both thp and hugetlb migrations. Exclude hugetlb migration by
setting is_thp variable right.
Clean up thp handling code too when we are there.
Fixes:
1a5bae25e3cf ("mm/vmstat: add events for THP migration without split")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lkml.kernel.org/r/20200917210413.1462975-1-zi.yan@sent.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vasily Gorbik [Sat, 26 Sep 2020 04:19:10 +0000 (21:19 -0700)]
mm/gup: fix gup_fast with dynamic page table folding
Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
unsigned int flags, struct page **pages, int *nr)
...
pudp = pud_offset(&p4d, addr);
This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be
iterated.
On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.
Here is an example of what happens with gup_fast on s390, for a task
with 3-level paging, crossing a 2 GB pud boundary:
// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
unsigned int flags, struct page **pages, int *nr)
{
unsigned long next;
pud_t *pudp;
// pud_offset returns &p4d itself (a pointer to a value on stack)
pudp = pud_offset(&p4d, addr);
do {
// on second iteratation reading "random" stack value
pud_t pud = READ_ONCE(*pudp);
// next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
next = pud_addr_end(addr, end);
...
} while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
return 1;
}
This happens since s390 moved to common gup code with commit
d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") and
commit
1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code").
s390 tried to mimic static level folding by changing pXd_offset
primitives to always calculate top level page table offset in pgd_offset
and just return the value passed when pXd_offset has to act as folded.
What is crucial for gup_fast and what has been overlooked is that
PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly.
And the latter is not possible with dynamic folding.
To fix the issue in addition to pXd values pass original pXdp pointers
down to gup_pXd_range functions. And introduce pXd_offset_lockless
helpers, which take an additional pXd entry value parameter. This has
already been discussed in
https://lkml.kernel.org/r/
20190418100218.
0a4afd51@mschwideX1
Fixes:
1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: <stable@vger.kernel.org> [5.2+]
Link: https://lkml.kernel.org/r/patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hours
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Sat, 26 Sep 2020 04:19:05 +0000 (21:19 -0700)]
mm: memcontrol: fix missing suffix of workingset_restore
We forget to add the suffix to the workingset_restore string, so fix it.
And also update the documentation of cgroup-v2.rst.
Fixes:
170b04b7ae49 ("mm/workingset: prepare the workingset detection infrastructure for anon LRU")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: https://lkml.kernel.org/r/20200916100030.71698-1-songmuchun@bytedance.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gao Xiang [Sat, 26 Sep 2020 04:19:01 +0000 (21:19 -0700)]
mm, THP, swap: fix allocating cluster for swapfile by mistake
SWP_FS is used to make swap_{read,write}page() go through the
filesystem, and it's only used for swap files over NFS. So, !SWP_FS
means non NFS for now, it could be either file backed or device backed.
Something similar goes with legacy SWP_FILE.
So in order to achieve the goal of the original patch, SWP_BLKDEV should
be used instead.
FS corruption can be observed with SSD device + XFS + fragmented
swapfile due to CONFIG_THP_SWAP=y.
I reproduced the issue with the following details:
Environment:
QEMU + upstream kernel + buildroot + NVMe (2 GB)
Kernel config:
CONFIG_BLK_DEV_NVME=y
CONFIG_THP_SWAP=y
Some reproducible steps:
mkfs.xfs -f /dev/nvme0n1
mkdir /tmp/mnt
mount /dev/nvme0n1 /tmp/mnt
bs="32k"
sz="1024m" # doesn't matter too much, I also tried 16m
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw
mkswap /tmp/mnt/sw
swapon /tmp/mnt/sw
stress --vm 2 --vm-bytes 600M # doesn't matter too much as well
Symptoms:
- FS corruption (e.g. checksum failure)
- memory corruption at: 0xd2808010
- segfault
Fixes:
f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
Fixes:
38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Eric Sandeen <esandeen@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200820045323.7809-1-hsiangkao@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shakeel Butt [Sat, 26 Sep 2020 14:13:41 +0000 (07:13 -0700)]
mm: slab: fix potential double free in ___cache_free
With the commit
10befea91b61 ("mm: memcg/slab: use a single set of
kmem_caches for all allocations"), it becomes possible to call kfree()
from the slabs_destroy().
The functions cache_flusharray() and do_drain() calls slabs_destroy() on
array_cache of the local CPU without updating the size of the
array_cache. This enables the kfree() call from the slabs_destroy() to
recursively call cache_flusharray() which can potentially call
free_block() on the same elements of the array_cache of the local CPU
and causing double free and memory corruption.
To fix the issue, simply update the local CPU array_cache cache before
calling slabs_destroy().
Fixes:
10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ted Ts'o <tytso@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Florian Fainelli [Fri, 25 Sep 2020 15:21:14 +0000 (08:21 -0700)]
Documentation/llvm: Fix clang target examples
clang --target=<triple> is how we can specify a particular toolchain
triple to be use, fix the two occurences in the documentation.
Fixes:
fcf1b6a35c16 ("Documentation/llvm: add documentation on building w/ Clang/LLVM")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Linus Torvalds [Sat, 26 Sep 2020 00:15:19 +0000 (17:15 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull more kvm fixes from Paolo Bonzini:
"Five small fixes.
The nested migration bug will be fixed with a better API in 5.10 or
5.11, for now this is a fix that works with existing userspace but
keeps the current ugly API"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: SVM: Add a dedicated INVD intercept routine
KVM: x86: Reset MMU context if guest toggles CR4.SMAP or CR4.PKE
KVM: x86: fix MSR_IA32_TSC read for nested migration
selftests: kvm: Fix assert failure in single-step test
KVM: x86: VMX: Make smaller physical guest address space support user-configurable
Linus Torvalds [Fri, 25 Sep 2020 22:24:04 +0000 (15:24 -0700)]
Merge tag 'mips_fixes_5.9_3' of git://git./linux/kernel/git/mips/linux
Pull MIPS fixes from Thomas Bogendoerfer:
- fixed FP register access on Loongsoon-3
- added missing 1074 cpu handling
- fixed Loongson2ef build error
* tag 'mips_fixes_5.9_3' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
MIPS: BCM47XX: Remove the needless check with the 1074K
MIPS: Add the missing 'CPU_1074K' into __get_cpu_type()
MIPS: Loongson2ef: Disable Loongson MMI instructions
MIPS: Loongson-3: Fix fp register access if MSA enabled
Linus Torvalds [Fri, 25 Sep 2020 22:21:54 +0000 (15:21 -0700)]
Merge tag 'spi-fix-v5.9-rc6' of git://git./linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A small collection of driver specific fixes, the fsl-espi and bcm-qspi
changes in particular have been causing breakage for users"
* tag 'spi-fix-v5.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: bcm-qspi: Fix probe regression on iProc platforms
spi: fsl-dspi: fix use-after-free in remove path
spi: fsl-espi: Only process interrupts for expected events
spi: bcm2835: Make polling_limit_us static
spi: spi-fsl-dspi: use XSPI mode instead of DMA for DPAA2 SoCs
Linus Torvalds [Fri, 25 Sep 2020 22:16:01 +0000 (15:16 -0700)]
Merge tag 'regulator-fix-v5.9-rc6' of git://git./linux/kernel/git/broonie/regulator
Pull regulator fix from Mark Brown:
"A single fix for incorrect specification of some of the register
fields on axp20x devices which would break voltage setting on affected
systems"
* tag 'regulator-fix-v5.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: axp20x: fix LDO2/4 description
Linus Torvalds [Fri, 25 Sep 2020 22:11:24 +0000 (15:11 -0700)]
Merge tag 'regmap-fix-v5.9-rc6' of git://git./linux/kernel/git/broonie/regmap
Pull regmap fixes from Mark Brown:
"Two issues here - one is a fix for use after free issues in the case
where a regmap overrides its name using something dynamically
generated, the other is that we weren't handling access checks
non-incrementing I/O on registers within paged register regions
correctly resulting in spurious errors.
Both of these are quite rare but serious if they occur"
* tag 'regmap-fix-v5.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: fix page selection for noinc writes
regmap: fix page selection for noinc reads
regmap: debugfs: Add back in erroneously removed initialisation of ret
regmap: debugfs: Fix handling of name string for debugfs init delays
Jens Axboe [Fri, 25 Sep 2020 21:23:43 +0000 (15:23 -0600)]
io_uring: ensure async buffered read-retry is setup properly
A previous commit for fixing up short reads botched the async retry
path, so we ended up going to worker threads more often than we should.
Fix this up, so retries work the way they originally were intended to.
Fixes:
227c0c9673d8 ("io_uring: internally retry short reads")
Reported-by: Hao_Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Fri, 25 Sep 2020 17:46:11 +0000 (10:46 -0700)]
Merge tag 'nfsd-5.9-2' of git://git.linux-nfs.org/projects/cel/cel-2.6
Pull NFS server fix from Chuck Lever:
"Fix incorrect calculation on platforms that implement
flush_dcache_page()"
* tag 'nfsd-5.9-2' of git://git.linux-nfs.org/projects/cel/cel-2.6:
SUNRPC: Fix svc_flush_dcache()
Linus Torvalds [Fri, 25 Sep 2020 17:39:22 +0000 (10:39 -0700)]
Merge tag 'pm-5.9-rc7' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix more fallout of recent RCU-lockdep changes in CPU idle code
and two devfreq issues.
Specifics:
- Export rcu_idle_{enter,exit} to modules to fix build issues
introduced by recent RCU-lockdep fixes (Borislav Petkov)
- Add missing return statement to a stub function in the ACPI
processor driver to fix a build issue introduced by recent
RCU-lockdep fixes (Rafael Wysocki)
- Fix recently introduced suspicious RCU usage warnings in the PSCI
cpuidle driver and drop stale comments regarding RCU_NONIDLE()
usage from enter_s2idle_proper() (Ulf Hansson)
- Fix error code path in the tegra30 devfreq driver (Dan Carpenter)
- Add missing information to devfreq_summary debugfs (Chanwoo Choi)"
* tag 'pm-5.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: processor: Fix build for ARCH_APICTIMER_STOPS_ON_C3 unset
PM / devfreq: tegra30: Disable clock on error in probe
PM / devfreq: Add timer type to devfreq_summary debugfs
cpuidle: Drop misleading comments about RCU usage
cpuidle: psci: Fix suspicious RCU usage
rcu/tree: Export rcu_idle_{enter,exit} to modules
Tom Lendacky [Thu, 24 Sep 2020 18:41:57 +0000 (13:41 -0500)]
KVM: SVM: Add a dedicated INVD intercept routine
The INVD instruction intercept performs emulation. Emulation can't be done
on an SEV guest because the guest memory is encrypted.
Provide a dedicated intercept routine for the INVD intercept. And since
the instruction is emulated as a NOP, just skip it instead.
Fixes:
1654efcbc431 ("KVM: SVM: Add KVM_SEV_INIT command")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <
a0b9a19ffa7fef86a3cc700c7ea01cb2731e04e5.
1600972918.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Linus Torvalds [Fri, 25 Sep 2020 16:49:19 +0000 (09:49 -0700)]
Merge tag 'for-linus' of git://git./linux/kernel/git/rdma/rdma
Pull rdma fix from Jason Gunthorpe:
"One fix for a bug that blktests hits when using rxe: tear down the CQ
pool before waiting for all references to go away"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/core: Fix ordering of CQ pool destruction
Linus Torvalds [Fri, 25 Sep 2020 16:41:57 +0000 (09:41 -0700)]
Merge tag 'drm-fixes-2020-09-25' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Fairly quiet, a couple of i915 fixes, one dma-buf fix, one vc4 and two
sun4i changes
dma-buf:
- Single null pointer deref fix
i915:
- Fix selftest reference to stack data out of scope
- Fix GVT null pointer dereference
vc4:
- fill asoc card owner
sun4i:
- program secondary CSC correctly"
* tag 'drm-fixes-2020-09-25' of git://anongit.freedesktop.org/drm/drm:
drm/i915/selftests: Push the fake iommu device from the stack to data
dmabuf: fix NULL pointer dereference in dma_buf_release()
drm/i915/gvt: Fix port number for BDW on EDID region setup
drm/sun4i: mixer: Extend regmap max_register
drm/sun4i: sun8i-csc: Secondary CSC register correction
drm/vc4/vc4_hdmi: fill ASoC card owner
Rafael J. Wysocki [Fri, 25 Sep 2020 16:33:46 +0000 (18:33 +0200)]
Merge branch 'pm-cpuidle'
* pm-cpuidle:
ACPI: processor: Fix build for ARCH_APICTIMER_STOPS_ON_C3 unset
cpuidle: Drop misleading comments about RCU usage
cpuidle: psci: Fix suspicious RCU usage
rcu/tree: Export rcu_idle_{enter,exit} to modules
Jens Axboe [Fri, 25 Sep 2020 15:01:53 +0000 (09:01 -0600)]
io_uring: don't unconditionally set plug->nowait = true
This causes all the bios to be submitted with REQ_NOWAIT, which can be
problematic on either btrfs or on file systems that otherwise use a mix
of block devices where only some of them support it.
For now, just remove the setting of plug->nowait = true.
Reported-by: Dan Melnic <dmm@fb.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Fixes:
b63534c41e20 ("io_uring: re-issue block requests that failed because of resources")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rafael J. Wysocki [Fri, 25 Sep 2020 14:33:19 +0000 (16:33 +0200)]
Merge tag 'devfreq-fixes-for-5.9-rc7' of git://git./linux/kernel/git/chanwoo/linux
Pull devfreq updates for 5.9-rc7 from Chanwoo Choi:
"1. Update devfreq core
- Add missing timer type to devfreq_summary debugfs node.
2. Fix devfreq device driver
- Fix the exception handling about clock on tegra30-devfreq.c"
* tag 'devfreq-fixes-for-5.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux:
PM / devfreq: tegra30: Disable clock on error in probe
PM / devfreq: Add timer type to devfreq_summary debugfs
Jeffle Xu [Fri, 25 Sep 2020 06:00:31 +0000 (14:00 +0800)]
block: remove unused BLK_QC_T_EAGAIN flag
commit
7b6620d7db56 ("block: remove REQ_NOWAIT_INLINE") removed the
REQ_NOWAIT_INLINE related code, but the diff wasn't applied to
blk_types.h somehow.
Then commit
2771cefeac49 ("block: remove the REQ_NOWAIT_INLINE flag")
removed the REQ_NOWAIT_INLINE flag while the BLK_QC_T_EAGAIN flag still
remains.
Fixes:
7b6620d7db56 ("block: remove REQ_NOWAIT_INLINE")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 24 Sep 2020 20:55:54 +0000 (14:55 -0600)]
io_uring: ensure open/openat2 name is cleaned on cancelation
If we cancel these requests, we'll leak the memory associated with the
filename. Add them to the table of ops that need cleaning, if
REQ_F_NEED_CLEANUP is set.
Cc: stable@vger.kernel.org
Fixes:
e62753e4e292 ("io_uring: call statx directly")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sean Christopherson [Wed, 23 Sep 2020 21:53:52 +0000 (14:53 -0700)]
KVM: x86: Reset MMU context if guest toggles CR4.SMAP or CR4.PKE
Reset the MMU context during kvm_set_cr4() if SMAP or PKE is toggled.
Recent commits to (correctly) not reload PDPTRs when SMAP/PKE are
toggled inadvertantly skipped the MMU context reset due to the mask
of bits that triggers PDPTR loads also being used to trigger MMU context
resets.
Fixes:
427890aff855 ("kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode")
Fixes:
cb957adb4ea4 ("kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode")
Cc: Jim Mattson <jmattson@google.com>
Cc: Peter Shier <pshier@google.com>
Cc: Oliver Upton <oupton@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <
20200923215352.17756-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Dave Airlie [Fri, 25 Sep 2020 01:28:36 +0000 (11:28 +1000)]
Merge tag 'drm-misc-fixes-2020-09-24' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
drm-misc-fixes for v5.9:
- Single null pointer deref fix for dma-buf.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/4106c21e-f52c-4c05-6cdb-daa743bb8617@linux.intel.com
Dave Airlie [Fri, 25 Sep 2020 01:07:01 +0000 (11:07 +1000)]
Merge tag 'drm-intel-fixes-2020-09-24' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
drm/i915 fixes for v5.9-rc7:
- Fix selftest reference to stack data out of scope
- Fix GVT null pointer dereference
- Backmerge from Linus' master to fix build
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/87zh5fpmha.fsf@intel.com
Dave Airlie [Fri, 25 Sep 2020 01:06:18 +0000 (11:06 +1000)]
BackMerge commit '
98477740630f270aecf648f1d6a9dbc6027d4ff1' into drm-fixes
The dax mess had some fallout, and i915 used a later base to fix their CI.
Signed-off-by: Dave Airlie <airlied@redhat.com>
Jens Axboe [Thu, 24 Sep 2020 19:42:40 +0000 (13:42 -0600)]
Merge tag 'nvme-5.9-2020-09-24' of git://git.infradead.org/nvme into block-5.9
Pull NVMe fixes from Christoph:
"nvme fixes for 5.9
- fix error during controller probe that cause double free irqs
(Keith Busch)
- FC connection establishment fix (James Smart)
- properly handle completions for invalid tags (Xianting Tian)
- pass the correct nsid to the command effects and supported log
(Chaitanya Kulkarni)"
* tag 'nvme-5.9-2020-09-24' of git://git.infradead.org/nvme:
nvme-core: don't use NVME_NSID_ALL for command effects and supported log
nvme-fc: fail new connections to a deleted host or remote port
nvme-pci: fix NULL req in completion handler
nvme: return errors for hwmon init
Maxim Levitsky [Mon, 21 Sep 2020 10:38:05 +0000 (13:38 +0300)]
KVM: x86: fix MSR_IA32_TSC read for nested migration
MSR reads/writes should always access the L1 state, since the (nested)
hypervisor should intercept all the msrs it wants to adjust, and these
that it doesn't should be read by the guest as if the host had read it.
However IA32_TSC is an exception. Even when not intercepted, guest still
reads the value + TSC offset.
The write however does not take any TSC offset into account.
This is documented in Intel's SDM and seems also to happen on AMD as well.
This creates a problem when userspace wants to read the IA32_TSC value and then
write it. (e.g for migration)
In this case it reads L2 value but write is interpreted as an L1 value.
To fix this make the userspace initiated reads of IA32_TSC return L1 value
as well.
Huge thanks to Dave Gilbert for helping me understand this very confusing
semantic of MSR writes.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <
20200921103805.9102-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Linus Torvalds [Thu, 24 Sep 2020 16:09:47 +0000 (09:09 -0700)]
Merge tag 'mmc-v5.9-rc4-2' of git://git./linux/kernel/git/ulfh/mmc
Pull MMC fix from Ulf Hansson:
"Fix build warning in mmc_spi when CONFIG_HAS_DMA is unset"
* tag 'mmc-v5.9-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: mmc_spi: Fix mmc_spi_dma_alloc() return type for !HAS_DMA
Linus Torvalds [Thu, 24 Sep 2020 16:05:04 +0000 (09:05 -0700)]
Merge tag 'media/v5.9-3' of git://git./linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
- fix a regression at the CEC adapter core
- two uAPI patches (one revert) for changes in this development cycle
* tag 'media/v5.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: dt-bindings: media: imx274: Convert to json-schema
media: media/v4l2: remove V4L2_FLAG_MEMORY_NON_CONSISTENT flag
media: cec-adap.c: don't use flush_scheduled_work()
Linus Torvalds [Thu, 24 Sep 2020 16:00:05 +0000 (09:00 -0700)]
Merge tag 'sound-5.9-rc7' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Just a handful small device-specific fixes including a couple of
reverts"
* tag 'sound-5.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
Revert "ALSA: usb-audio: Disable Lenovo P620 Rear line-in volume control"
Revert "ALSA: hda - Fix silent audio output and corrupted input on MSI X570-A PRO"
ALSA: usb-audio: Add delay quirk for H570e USB headsets
ALSA: hda/realtek: Enable front panel headset LED on Lenovo ThinkStation P520
ALSA: hda/realtek - Couldn't detect Mic if booting with headset plugged
ALSA: asihpi: fix iounmap in error handler
Masahiro Yamada [Tue, 22 Sep 2020 17:48:56 +0000 (02:48 +0900)]
scripts/kallsyms: skip ppc compiler stub *.long_branch.* / *.plt_branch.*
PowerPC allmodconfig often fails to build as follows:
LD .tmp_vmlinux.kallsyms1
KSYM .tmp_vmlinux.kallsyms1.o
LD .tmp_vmlinux.kallsyms2
KSYM .tmp_vmlinux.kallsyms2.o
LD .tmp_vmlinux.kallsyms3
KSYM .tmp_vmlinux.kallsyms3.o
LD vmlinux
SORTTAB vmlinux
SYSMAP System.map
Inconsistent kallsyms data
Try make KALLSYMS_EXTRA_PASS=1 as a workaround
make[2]: *** [../Makefile:1162: vmlinux] Error 1
Setting KALLSYMS_EXTRA_PASS=1 does not help.
This is caused by the compiler inserting stubs such as *.long_branch.*
and *.plt_branch.*
$ powerpc-linux-nm -n .tmp_vmlinux.kallsyms2
[ snip ]
c00000000210c010 t
00000075.plt_branch.da9:19
c00000000210c020 t
00000075.plt_branch.1677:5
c00000000210c030 t
00000075.long_branch.memmove
c00000000210c034 t
00000075.plt_branch.9e0:5
c00000000210c044 t
00000075.plt_branch.free_initrd_mem
...
Actually, the problem mentioned in scripts/link-vmlinux.sh comments;
"In theory it's possible this results in even more stubs, but unlikely"
is happening here, and ends up with another kallsyms step required.
scripts/kallsyms.c already ignores various compiler stubs. Let's do
similar to make kallsysms for PowerPC always succeed in 2 steps.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Linus Torvalds [Thu, 24 Sep 2020 15:41:32 +0000 (08:41 -0700)]
mm: fix misplaced unlock_page in do_wp_page()
Commit
09854ba94c6a ("mm: do_wp_page() simplification") reorganized all
the code around the page re-use vs copy, but in the process also moved
the final unlock_page() around to after the wp_page_reuse() call.
That normally doesn't matter - but it means that the unlock_page() is
now done after releasing the page table lock. Again, not a big deal,
you'd think.
But it turns out that it's very wrong indeed, because once we've
released the page table lock, we've basically lost our only reference to
the page - the page tables - and it could now be free'd at any time. We
do hold the mmap_sem, so no actual unmap() can happen, but madvise can
come in and a MADV_DONTNEED will zap the page range - and free the page.
So now the page may be free'd just as we're unlocking it, which in turn
will usually trigger a "Bad page state" error in the freeing path. To
make matters more confusing, by the time the debug code prints out the
page state, the unlock has typically completed and everything looks fine
again.
This all doesn't happen in any normal situations, but it does trigger
with the dirtyc0w_child LTP test. And it seems to trigger much more
easily (but not expclusively) on s390 than elsewhere, probably because
s390 doesn't do the "batch pages up for freeing after the TLB flush"
that gives the unlock_page() more time to complete and makes the race
harder to hit.
Fixes:
09854ba94c6a ("mm: do_wp_page() simplification")
Link: https://lore.kernel.org/lkml/a46e9bbef2ed4e17778f5615e818526ef848d791.camel@redhat.com/
Link: https://lore.kernel.org/linux-mm/c41149a8-211e-390b-af1d-d5eee690fecb@linux.alibaba.com/
Reported-by: Qian Cai <cai@redhat.com>
Reported-by: Alex Shi <alex.shi@linux.alibaba.com>
Bisected-and-analyzed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Tested-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ray Jui [Thu, 10 Sep 2020 15:25:38 +0000 (08:25 -0700)]
spi: bcm-qspi: Fix probe regression on iProc platforms
iProc chips have QSPI controller that does not have the MSPI_REV
offset. Reading from that offset will cause a bus error. Fix it by
having MSPI_REV query disabled in the generic compatible string.
Fixes:
3a01f04d74ef ("spi: bcm-qspi: Handle lack of MSPI_REV offset")
Link: https://lore.kernel.org/linux-arm-kernel/20200909211857.4144718-1-f.fainelli@gmail.com/T/#u
Signed-off-by: Ray Jui <ray.jui@broadcom.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20200910152539.45584-3-ray.jui@broadcom.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Christian Borntraeger [Mon, 21 Sep 2020 10:48:36 +0000 (12:48 +0200)]
s390/zcrypt: Fix ZCRYPT_PERDEV_REQCNT ioctl
reqcnt is an u32 pointer but we do copy sizeof(reqcnt) which is the
size of the pointer. This means we only copy 8 byte. Let us copy
the full monty.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Harald Freudenberger <freude@linux.ibm.com>
Cc: stable@vger.kernel.org
Fixes:
af4a72276d49 ("s390/zcrypt: Support up to 256 crypto adapters.")
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Linus Torvalds [Wed, 23 Sep 2020 21:52:22 +0000 (14:52 -0700)]
Merge tag 'trace-v5.9-rc5-2' of git://git./linux/kernel/git/rostedt/linux-trace
Pull bootconfig fixes from Steven Rostedt:
"A couple of fixes for bootconfig.
Masami discovered two bugs which this fixes and he added tests to
cover these issues.
- Fix a bug that breaks bootconfig tree nodes
- Fix a bug that does not truncate whitespace properly
- Add tests to cover the above two cases"
* tag 'trace-v5.9-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tools/bootconfig: Add testcase for tailing space
tools/bootconfig: Add testcases for repeated key with brace
lib/bootconfig: Fix to remove tailing spaces after value
lib/bootconfig: Fix a bug of breaking existing tree nodes
Linus Torvalds [Wed, 23 Sep 2020 21:38:21 +0000 (14:38 -0700)]
Merge tag 'for-5.9/dm-fixes-2' of git://git./linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- DM core fix for incorrect double bio splitting. Keep "fixing" this
because past attempts didn't fully appreciate the liability relative
to recursive bio splitting. This fix limits DM's bio splitting to a
single method and does _not_ use blk_queue_split() for normal IO.
- DM crypt Documentation updates for features added during 5.9 merge.
* tag 'for-5.9/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm crypt: document encrypted keyring key option
dm crypt: document new no_workqueue flags
dm: fix comment in dm_process_bio()
dm: fix bio splitting and its bio completion order for regular IO
Linus Torvalds [Wed, 23 Sep 2020 21:32:23 +0000 (14:32 -0700)]
Merge tag 'for-5.9-rc6-tag' of git://git./linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"syzkaller started to hit us with reports, here's a fix for one type
(stack overflow when printing checksums on read error).
The other patch is a fix for sysfs object, we have a test for that and
it leads to a crash."
* tag 'for-5.9-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix put of uninitialized kobject after seed device delete
btrfs: fix overflow when copying corrupt csums for a message
Thomas Gleixner [Wed, 23 Sep 2020 15:46:20 +0000 (17:46 +0200)]
x86/ioapic: Unbreak check_timer()
Several people reported in the kernel bugzilla that between v4.12 and v4.13
the magic which works around broken hardware and BIOSes to find the proper
timer interrupt delivery mode stopped working for some older affected
platforms which need to fall back to ExtINT delivery mode.
The reason is that the core code changed to keep track of the masked and
disabled state of an interrupt line more accurately to avoid the expensive
hardware operations.
That broke an assumption in i8259_make_irq() which invokes
disable_irq_nosync();
irq_set_chip_and_handler();
enable_irq();
Up to v4.12 this worked because enable_irq() unconditionally unmasked the
interrupt line, but after the state tracking improvements this is not
longer the case because the IO/APIC uses lazy disabling. So the line state
is unmasked which means that enable_irq() does not call into the new irq
chip to unmask it.
In principle this is a shortcoming of the core code, but it's more than
unclear whether the core code should try to reset state. At least this
cannot be done unconditionally as that would break other existing use cases
where the chip type is changed, e.g. when changing the trigger type, but
the callers expect the state to be preserved.
As the way how check_timer() is switching the delivery modes is truly
unique, the obvious fix is to simply unmask the i8259 manually after
changing the mode to ExtINT delivery and switching the irq chip to the
legacy PIC.
Note, that the fixes tag is not really precise, but identifies the commit
which broke the assumptions in the IO/APIC and i8259 code and that's the
kernel version to which this needs to be backported.
Fixes:
bf22ff45bed6 ("genirq: Avoid unnecessary low level irq function calls")
Reported-by: p_c_chan@hotmail.com
Reported-by: ecm4@mail.com
Reported-by: perdigao1@yahoo.com
Reported-by: matzes@users.sourceforge.net
Reported-by: rvelascog@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: p_c_chan@hotmail.com
Tested-by: matzes@users.sourceforge.net
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=197769
Chaitanya Kulkarni [Tue, 22 Sep 2020 19:49:38 +0000 (12:49 -0700)]
nvme-core: don't use NVME_NSID_ALL for command effects and supported log
In the function nvme_get_effects_log() it uses NVME_NSID_ALL which has
namespace scope. The command effect log page is controller specific.
Replace NVME_NSID_ALL with 0x00 which specifies the controller scope
instead of namespace scope.
Fixes:
84fef62d135b ("nvme: check admin passthru command effects")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=209287
Reported-by: Huai-Cheng Kuo <hh81478072@gmail.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Linus Torvalds [Wed, 23 Sep 2020 17:04:16 +0000 (10:04 -0700)]
mm: move the copy_one_pte() pte_present check into the caller
This completes the split of the non-present and present pte cases by
moving the check for the source pte being present into the single
caller, which also means that we clearly separate out the very different
return value case for a non-present pte.
The present pte case currently always succeeds.
This is a pure code re-organization with no semantic change: the intent
is to make it much easier to add a new return case to the present pte
case for when we do early COW at page table copy time.
This was split out from the previous commit simply to make it easy to
visually see that there were no semantic changes from this code
re-organization.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 23 Sep 2020 16:56:59 +0000 (09:56 -0700)]
mm: split out the non-present case from copy_one_pte()
This is a purely mechanical split of the copy_one_pte() function. It's
not immediately obvious when looking at the diff because of the
indentation change, but the way to see what is going on in this commit
is to use the "-w" flag to not show pure whitespace changes, and you see
how the first part of copy_one_pte() is simply lifted out into a
separate function.
And since the non-present case is marked unlikely, don't make the new
function be inlined. Not that gcc really seems to care, since it looks
like it will inline it anyway due to the whole "single callsite for
static function" logic. In fact, code generation with the function
split is almost identical to before. But not marking it inline is the
right thing to do.
This is pure prep-work and cleanup for subsequent changes.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sascha Hauer [Wed, 23 Sep 2020 13:10:26 +0000 (15:10 +0200)]
spi: fsl-dspi: fix use-after-free in remove path
spi_unregister_controller() not only unregisters the controller, but
also frees the controller. This will free the driver data with it, so
we must not access it later dspi_remove().
Solve this by allocating the driver data separately from the SPI
controller.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Link: https://lore.kernel.org/r/20200923131026.20707-1-s.hauer@pengutronix.de
Signed-off-by: Mark Brown <broonie@kernel.org>
Icenowy Zheng [Wed, 23 Sep 2020 00:51:42 +0000 (08:51 +0800)]
regulator: axp20x: fix LDO2/4 description
Currently we wrongly set the mask of value of LDO2/4 both to the mask of
LDO2, and the LDO4 voltage configuration is left untouched. This leads
to conflict when LDO2/4 are both in use.
Fix this issue by setting different vsel_mask to both regulators.
Fixes:
db4a555f7c4c ("regulator: axp20x: use defines for masks")
Signed-off-by: Icenowy Zheng <icenowy@aosc.io>
Link: https://lore.kernel.org/r/20200923005142.147135-1-icenowy@aosc.io
Signed-off-by: Mark Brown <broonie@kernel.org>
Hagen Paul Pfeifer [Tue, 22 Sep 2020 20:09:22 +0000 (22:09 +0200)]
perf script: Add min, max to futex-contention output, in addition to avg
Average is quite informative, but the outliners - especially max - are
also of interest.
Before:
mutex-locker[793299] lock
5637ec61e080 contended 3400 times, 446 avg ns
mutex-locker[793301] lock
5637ec61e080 contended 3563 times, 385 avg ns
mutex-locker[793300] lock
5637ec61e080 contended 3110 times, 1855 avg ns
After:
mutex-locker[795251] lock
55b14e6dd080 contended 3853 times, 1279 avg ns [max: 12270 ns, min 340 ns]
mutex-locker[795253] lock
55b14e6dd080 contended 2911 times, 518 avg ns [max:
51660261 ns, min 347 ns]
mutex-locker[795252] lock
55b14e6dd080 contended 3843 times, 385 avg ns [max:
24323998 ns, min 338 ns]
Committer testing:
[root@five ~]# perf script record futex-contention -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.877 MB perf.data (923 samples) ]
[root@five ~]# perf evlist
syscalls:sys_enter_futex
syscalls:sys_exit_futex
dummy:HG
# Tip: use 'perf evlist --trace-fields' to show fields for tracepoint events
#
Before:
[root@five ~]# perf script report futex-contention
JS Helper[2457] lock
55fe0cf82610 contended 4 times, 6657 avg ns
ibus-daemon[2975] lock
56227f6d0210 contended 4 times, 1020 avg ns
chromium-browse[1905801] lock
7ffe573f5088 contended 8 times, 108463 avg ns
gnome-shell[2240] lock
55fe0cf82678 contended 1 times, 8616 avg ns
gnome-shel:cs0[2292] lock
55fe0d0ab768 contended 3 times,
606016034 avg ns
JS Helper[2458] lock
55fe0cf82690 contended 1 times, 1167840 avg ns
chromium-browse[1905470] lock
7ffe573f5358 contended 1 times, 551504 avg ns
chromium-browse[1905948] lock
7ffe573f5358 contended 1 times, 577422 avg ns
gnome-shell[2240] lock
55fe0cf82660 contended 6 times, 202696 avg ns
pool[2602] lock
7fd600008ef0 contended 1 times,
500046007 avg ns
chromium-browse[1905801] lock
7ffe573f5128 contended 4 times, 285083 avg ns
JS Helper[2460] lock
55fe0cf82690 contended 1 times, 680877 avg ns
JS Helper[2459] lock
55fe0cf82610 contended 7 times, 4224 avg ns
chromium-browse[1905434] lock
7ffe573f5358 contended 1 times, 697038 avg ns
chromium-browse[212592] lock
7ffe573f53c8 contended 4 times, 460601 avg ns
gnome-shel:cs0[2292] lock
55fe0d0ab76c contended 2 times,
601237648 avg ns
JS Helper[2460] lock
55fe0cf82610 contended 4 times, 3340 avg ns
JS Helper[2462] lock
55fe0cf82694 contended 1 times, 237275 avg ns
chromium-browse[1905605] lock
7ffe573f5358 contended 2 times, 634555 avg ns
chromium-browse[1905992] lock
7ffe573f5358 contended 1 times, 583965 avg ns
chromium-browse[1905647] lock
7ffe573f5368 contended 8 times, 549800 avg ns
JS Helper[2462] lock
55fe0cf82610 contended 2 times, 4694 avg ns
JS Helper[2461] lock
55fe0cf82694 contended 1 times, 257793 avg ns
JS Helper[2456] lock
55fe0cf82690 contended 1 times, 677771 avg ns
JS Helper[2463] lock
55fe0cf82610 contended 3 times, 5139 avg ns
gdbus[2980] lock
56227f6d0210 contended 2 times, 2465 avg ns
gnome-shell[2240] lock
55fe0cf82664 contended 5 times, 8036 avg ns
chromium-browse[1906308] lock
7ffe573f5358 contended 1 times, 210735 avg ns
JS Helper[2463] lock
55fe0cf82694 contended 1 times, 251531 avg ns
chromium-browse[1905801] lock
7ffe573f4f58 contended 4 times, 399927 avg ns
[root@five ~]#
After:
[root@five ~]# perf script report futex-contention
JS Helper[2457] lock
55fe0cf82610 contended 4 times, 6657 avg ns [max: 11502 ns, min 792 ns]
ibus-daemon[2975] lock
56227f6d0210 contended 4 times, 1020 avg ns [max: 1813 ns, min 581 ns]
chromium-browse[1905801] lock
7ffe573f5088 contended 8 times, 108463 avg ns [max: 380103 ns, min 57989 ns]
gnome-shell[2240] lock
55fe0cf82678 contended 1 times, 8616 avg ns [max: 8616 ns, min 8616 ns]
gnome-shel:cs0[2292] lock
55fe0d0ab768 contended 3 times,
606016034 avg ns [max:
611295960 ns, min
600191357 ns]
JS Helper[2458] lock
55fe0cf82690 contended 1 times, 1167840 avg ns [max: 1167840 ns, min 1167840 ns]
chromium-browse[1905470] lock
7ffe573f5358 contended 1 times, 551504 avg ns [max: 551504 ns, min 551504 ns]
chromium-browse[1905948] lock
7ffe573f5358 contended 1 times, 577422 avg ns [max: 577422 ns, min 577422 ns]
gnome-shell[2240] lock
55fe0cf82660 contended 6 times, 202696 avg ns [max: 398998 ns, min 5050 ns]
pool[2602] lock
7fd600008ef0 contended 1 times,
500046007 avg ns [max:
500046007 ns, min
500046007 ns]
chromium-browse[1905801] lock
7ffe573f5128 contended 4 times, 285083 avg ns [max: 389531 ns, min 76183 ns]
JS Helper[2460] lock
55fe0cf82690 contended 1 times, 680877 avg ns [max: 680877 ns, min 680877 ns]
JS Helper[2459] lock
55fe0cf82610 contended 7 times, 4224 avg ns [max: 12724 ns, min 1012 ns]
chromium-browse[1905434] lock
7ffe573f5358 contended 1 times, 697038 avg ns [max: 697038 ns, min 697038 ns]
chromium-browse[212592] lock
7ffe573f53c8 contended 4 times, 460601 avg ns [max: 594956 ns, min 232996 ns]
gnome-shel:cs0[2292] lock
55fe0d0ab76c contended 2 times,
601237648 avg ns [max:
601255863 ns, min
601219434 ns]
JS Helper[2460] lock
55fe0cf82610 contended 4 times, 3340 avg ns [max: 9168 ns, min 962 ns]
JS Helper[2462] lock
55fe0cf82694 contended 1 times, 237275 avg ns [max: 237275 ns, min 237275 ns]
chromium-browse[1905605] lock
7ffe573f5358 contended 2 times, 634555 avg ns [max: 1024060 ns, min 245050 ns]
chromium-browse[1905992] lock
7ffe573f5358 contended 1 times, 583965 avg ns [max: 583965 ns, min 583965 ns]
chromium-browse[1905647] lock
7ffe573f5368 contended 8 times, 549800 avg ns [max: 775293 ns, min 258375 ns]
JS Helper[2462] lock
55fe0cf82610 contended 2 times, 4694 avg ns [max: 8556 ns, min 832 ns]
JS Helper[2461] lock
55fe0cf82694 contended 1 times, 257793 avg ns [max: 257793 ns, min 257793 ns]
JS Helper[2456] lock
55fe0cf82690 contended 1 times, 677771 avg ns [max: 677771 ns, min 677771 ns]
JS Helper[2463] lock
55fe0cf82610 contended 3 times, 5139 avg ns [max: 6873 ns, min 931 ns]
gdbus[2980] lock
56227f6d0210 contended 2 times, 2465 avg ns [max: 4188 ns, min 742 ns]
gnome-shell[2240] lock
55fe0cf82664 contended 5 times, 8036 avg ns [max: 13105 ns, min 401 ns]
chromium-browse[1906308] lock
7ffe573f5358 contended 1 times, 210735 avg ns [max: 210735 ns, min 210735 ns]
JS Helper[2463] lock
55fe0cf82694 contended 1 times, 251531 avg ns [max: 251531 ns, min 251531 ns]
chromium-browse[1905801] lock
7ffe573f4f58 contended 4 times, 399927 avg ns [max: 476904 ns, min 178495 ns]
[root@five ~]#
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lore.kernel.org/lkml/20200922200922.1306034-1-hagen@jauu.net
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Hagen Paul Pfeifer [Mon, 21 Sep 2020 20:19:27 +0000 (22:19 +0200)]
perf script: Autopep8 futex-contention
10 years leaves its mark! Python has evolved and so has its style guide.
Even with vim it is getting hard to follow the no longer valid
guidelines (spaces vs. tabs).
Autopep8 this code to modernize it!
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Link: http://lore.kernel.org/lkml/20200921201928.799498-1-hagen@jauu.net
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jin Yao [Tue, 22 Sep 2020 01:50:04 +0000 (09:50 +0800)]
perf stat: Skip duration_time in setup_system_wide
Some metrics (such as DRAM_BW_Use) consists of uncore events and
duration_time. For uncore events, counter->core.system_wide is true. But
for duration_time, counter->core.system_wide is false so
target.system_wide is set to false.
Then 'enable_on_exec' is set in perf_event_attr of uncore event. Kernel
will return error when trying to open the uncore event.
This patch skips the duration_time in setup_system_wide then
target.system_wide will be set to true for the evlist of uncore events +
duration_time.
Before (tested on skylake desktop):
# perf stat -M DRAM_BW_Use -- sleep 1
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (arb/event=0x84,umask=0x1/).
/bin/dmesg | grep -i perf may provide additional information.
After:
# perf stat -M DRAM_BW_Use -- sleep 1
Performance counter stats for 'system wide':
169 arb/event=0x84,umask=0x1/ # 0.00 DRAM_BW_Use
40,427 arb/event=0x81,umask=0x1/
1,000,902,197 ns duration_time
1.
000902197 seconds time elapsed
Fixes:
e3ba76deef23064f ("perf tools: Force uncore events to system wide monitoring")
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jin Yao <yao.jin@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20200922015004.30114-1-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Yang Weijiang [Wed, 26 Aug 2020 01:55:24 +0000 (09:55 +0800)]
selftests: kvm: Fix assert failure in single-step test
This is a follow-up patch to fix an issue left in commit:
98b0bf02738004829d7e26d6cb47b2e469aaba86
selftests: kvm: Use a shorter encoding to clear RAX
With the change in the commit, we also need to modify "xor" instruction
length from 3 to 2 in array ss_size accordingly to pass below check:
for (i = 0; i < (sizeof(ss_size) / sizeof(ss_size[0])); i++) {
target_rip += ss_size[i];
CLEAR_DEBUG();
debug.control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP;
debug.arch.debugreg[7] = 0x00000400;
APPLY_DEBUG();
vcpu_run(vm, VCPU_ID);
TEST_ASSERT(run->exit_reason == KVM_EXIT_DEBUG &&
run->debug.arch.exception == DB_VECTOR &&
run->debug.arch.pc == target_rip &&
run->debug.arch.dr6 == target_dr6,
"SINGLE_STEP[%d]: exit %d exception %d rip 0x%llx "
"(should be 0x%llx) dr6 0x%llx (should be 0x%llx)",
i, run->exit_reason, run->debug.arch.exception,
run->debug.arch.pc, target_rip, run->debug.arch.dr6,
target_dr6);
}
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Message-Id: <
20200826015524.13251-1-weijiang.yang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Mohammed Gamal [Thu, 3 Sep 2020 14:11:22 +0000 (16:11 +0200)]
KVM: x86: VMX: Make smaller physical guest address space support user-configurable
This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
Since smaller physical address spaces are only supported on VMX, the
parameter is only exposed in the kvm_intel module.
For now disable support by default, and let the user decide if they want
to enable it.
Modifications to VMX page fault and EPT violation handling will depend
on whether that parameter is enabled.
Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
Message-Id: <
20200903141122.72908-1-mgamal@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wei Li [Wed, 23 Sep 2020 06:53:26 +0000 (14:53 +0800)]
MIPS: BCM47XX: Remove the needless check with the 1074K
As there is no known soc powered by mips 1074K in bcm47xx series,
the check with 1074K is needless. So just remove it.
Link: https://wireless.wiki.kernel.org/en/users/Drivers/b43/soc
Fixes:
442e14a2c55e ("MIPS: Add 1074K CPU support explicitly.")
Signed-off-by: Wei Li <liwei391@huawei.com>
Acked-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>