From a1eb1411b4e4251db02179e39d234c2ee5192c72 Mon Sep 17 00:00:00 2001 From: Stanislaw Gruszka Date: Wed, 17 Aug 2016 11:30:44 +0200 Subject: [PATCH] sched/cputime: Improve scalability by not accounting thread group tasks pending runtime Commit: d670ec13178d0 ("posix-cpu-timers: Cure SMP wobbles") started accounting thread group tasks pending runtime in thread_group_cputime(). Another commit: 6e998916dfe32 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") updated scheduler runtime statistics (call update_curr()) when reading task pending runtime. Those changes cause bad performance of SYS_times() and SYS_clock_gettimes(CLOCK_PROCESS_CPUTIME_ID) syscalls, especially on larger systems with many CPUs. While we would like to have cpuclock monotonicity kept i.e. have problems fixed by above commits stay fixed, we also would like to have good performance. However when we notice that change from commit d670ec13178d0 is not longer needed to solve problem addressed by that commit, because of change from the second commit 6e998916dfe32, we can get room for optimization. Since we update task while reading it's pending runtime in task_sched_runtime(), clock_gettime(CLOCK_PROCESS_CPUTIME_ID) will see updated values and on testcase from d670ec13178d0 process cpuclock will not be smaller than thread cpuclock. I tested the patch on testcases from commits d670ec13178d0, 6e998916dfe32 and some other cpuclock/cputimers testcases and did not found cpuclock monotonicity problems or other malfunction. This patch has the drawback that we will not provide thread group cputime up-to-date to the last moment. For example when arming cputime timer, we will arm it with possibly a bit outdated values and that timer will trigger earlier compared to behaviour without the patch. However that was the behaviour before d670ec13178d0 commit (kernel v3.1) so it's unlikely to affect applications. Patch improves related syscall performance, as measured by Giovanni's benchmarks described in commit: 6075620b0590e ("sched/cputime: Mitigate performance regression in times()/clock_gettime()") The benchmark results are: SYS_clock_gettime(): threads 4.7-rc7 3.18-rc3 4.7-rc7 + prefetch 4.7-rc7 + patch (pre-6e998916dfe3) 2 3.48 2.23 ( 35.68%) 3.06 ( 11.83%) 1.08 ( 68.81%) 5 3.33 2.83 ( 14.84%) 3.25 ( 2.40%) 0.71 ( 78.55%) 8 3.37 2.84 ( 15.80%) 3.26 ( 3.30%) 0.56 ( 83.49%) 12 3.32 3.09 ( 6.69%) 3.37 ( -1.60%) 0.42 ( 87.28%) 21 4.01 3.14 ( 21.70%) 3.90 ( 2.74%) 0.35 ( 91.35%) 30 3.63 3.28 ( 9.75%) 3.36 ( 7.41%) 0.28 ( 92.23%) 48 3.71 3.02 ( 18.69%) 3.11 ( 16.27%) 0.39 ( 89.39%) 79 3.75 2.88 ( 23.23%) 3.16 ( 15.74%) 0.46 ( 87.76%) 110 3.81 2.95 ( 22.62%) 3.25 ( 14.80%) 0.56 ( 85.41%) 128 3.88 3.05 ( 21.28%) 3.31 ( 14.76%) 0.62 ( 84.10%) SYS_times(): threads 4.7-rc7 3.18-rc3 4.7-rc7 + prefetch 4.7-rc7 + patch (pre-6e998916dfe3) 2 3.65 2.27 ( 37.94%) 3.25 ( 11.03%) 1.62 ( 55.71%) 5 3.45 2.78 ( 19.34%) 3.17 ( 7.92%) 2.33 ( 32.28%) 8 3.52 2.79 ( 20.66%) 3.22 ( 8.69%) 2.06 ( 41.44%) 12 3.29 3.02 ( 8.33%) 3.36 ( -2.04%) 2.00 ( 39.18%) 21 4.07 3.10 ( 23.86%) 3.92 ( 3.78%) 2.07 ( 49.18%) 30 3.87 3.33 ( 13.80%) 3.40 ( 12.17%) 1.89 ( 51.12%) 48 3.79 2.96 ( 21.94%) 3.16 ( 16.61%) 1.69 ( 55.46%) 79 3.88 2.88 ( 25.82%) 3.28 ( 15.42%) 1.60 ( 58.81%) 110 3.90 2.98 ( 23.73%) 3.38 ( 13.35%) 1.73 ( 55.61%) 128 4.00 3.10 ( 22.40%) 3.38 ( 15.45%) 1.66 ( 58.52%) Reported-and-tested-by: Giovanni Gherdovich Signed-off-by: Stanislaw Gruszka Cc: Linus Torvalds Cc: Mel Gorman Cc: Mike Galbraith Cc: Paolo Bonzini Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Wanpeng Li Link: http://lkml.kernel.org/r/20160817093043.GA25206@redhat.com Signed-off-by: Ingo Molnar --- kernel/sched/cputime.c | 33 ++++++++++++++++++++++++++++++++- 1 file changed, 32 insertions(+), 1 deletion(-) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index a846cf8..b93c72d 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -306,6 +306,26 @@ static inline cputime_t account_other_time(cputime_t max) return accounted; } +#ifdef CONFIG_64BIT +static inline u64 read_sum_exec_runtime(struct task_struct *t) +{ + return t->se.sum_exec_runtime; +} +#else +static u64 read_sum_exec_runtime(struct task_struct *t) +{ + u64 ns; + struct rq_flags rf; + struct rq *rq; + + rq = task_rq_lock(t, &rf); + ns = t->se.sum_exec_runtime; + task_rq_unlock(rq, t, &rf); + + return ns; +} +#endif + /* * Accumulate raw cputime values of dead tasks (sig->[us]time) and live * tasks (sum on group iteration) belonging to @tsk's group. @@ -318,6 +338,17 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) unsigned int seq, nextseq; unsigned long flags; + /* + * Update current task runtime to account pending time since last + * scheduler action or thread_group_cputime() call. This thread group + * might have other running tasks on different CPUs, but updating + * their runtime can affect syscall performance, so we skip account + * those pending times and rely only on values updated on tick or + * other scheduler action. + */ + if (same_thread_group(current, tsk)) + (void) task_sched_runtime(current); + rcu_read_lock(); /* Attempt a lockless read on the first round. */ nextseq = 0; @@ -332,7 +363,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) task_cputime(t, &utime, &stime); times->utime += utime; times->stime += stime; - times->sum_exec_runtime += task_sched_runtime(t); + times->sum_exec_runtime += read_sum_exec_runtime(t); } /* If lockless access failed, take the lock. */ nextseq = 1; -- 2.7.4