perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets
Running the perf benchmark with (note: more aggressive parameters vs.
preceding changes, but same 256 CPUs host):
| $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 100 threads with 4 breakpoints and 128 parallelism
| Total time: 1.989 [sec]
|
| 38.854160 usecs/op
| 4973.332500 usecs/op/cpu
20.43% [kernel] [k] queued_spin_lock_slowpath
18.75% [kernel] [k] osq_lock
16.98% [kernel] [k] rhashtable_jhash2
8.34% [kernel] [k] task_bp_pinned
4.23% [kernel] [k] smp_cfm_core_cond
3.65% [kernel] [k] bcmp
2.83% [kernel] [k] toggle_bp_slot
1.87% [kernel] [k] find_next_bit
1.49% [kernel] [k] __reserve_bp_slot
We can see that a majority of the time is now spent hashing task
pointers to index into task_bps_ht in task_bp_pinned().
Obtaining the max_bp_pinned_slots() for CPU-independent task targets
currently is O(#cpus), and calls task_bp_pinned() for each CPU, even if
the result of task_bp_pinned() is CPU-independent.
The loop in max_bp_pinned_slots() wants to compute the maximum slots
across all CPUs. If task_bp_pinned() is CPU-independent, we can do so by
obtaining the max slots across all CPUs and adding task_bp_pinned().
To do so in O(1), use a bp_slots_histogram for CPU-pinned slots.
After this optimization:
| $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 100 threads with 4 breakpoints and 128 parallelism
| Total time: 1.930 [sec]
|
| 37.697832 usecs/op
| 4825.322500 usecs/op/cpu
19.13% [kernel] [k] queued_spin_lock_slowpath
18.21% [kernel] [k] rhashtable_jhash2
15.46% [kernel] [k] osq_lock
6.27% [kernel] [k] toggle_bp_slot
5.91% [kernel] [k] task_bp_pinned
5.05% [kernel] [k] smp_cfm_core_cond
1.78% [kernel] [k] update_sg_lb_stats
1.36% [kernel] [k] llist_reverse_order
1.34% [kernel] [k] find_next_bit
1.19% [kernel] [k] bcmp
Suggesting that time spent in task_bp_pinned() has been reduced.
However, we're still hashing too much, which will be addressed in the
subsequent change.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-14-elver@google.com