1 ======================================================
2 hrtimers - subsystem for high-resolution kernel timers
3 ======================================================
5 This patch introduces a new subsystem for high-resolution kernel timers.
7 One might ask the question: we already have a timer subsystem
8 (kernel/timers.c), why do we need two timer subsystems? After a lot of
9 back and forth trying to integrate high-resolution and high-precision
10 features into the existing timer framework, and after testing various
11 such high-resolution timer implementations in practice, we came to the
12 conclusion that the timer wheel code is fundamentally not suitable for
13 such an approach. We initially didn't believe this ('there must be a way
14 to solve this'), and spent a considerable effort trying to integrate
15 things into the timer wheel, but we failed. In hindsight, there are
16 several reasons why such integration is hard/impossible:
18 - the forced handling of low-resolution and high-resolution timers in
19 the same way leads to a lot of compromises, macro magic and #ifdef
20 mess. The timers.c code is very "tightly coded" around jiffies and
21 32-bitness assumptions, and has been honed and micro-optimized for a
22 relatively narrow use case (jiffies in a relatively narrow HZ range)
23 for many years - and thus even small extensions to it easily break
24 the wheel concept, leading to even worse compromises. The timer wheel
25 code is very good and tight code, there's zero problems with it in its
26 current usage - but it is simply not suitable to be extended for
29 - the unpredictable [O(N)] overhead of cascading leads to delays which
30 necessitate a more complex handling of high resolution timers, which
31 in turn decreases robustness. Such a design still leads to rather large
32 timing inaccuracies. Cascading is a fundamental property of the timer
33 wheel concept, it cannot be 'designed out' without inevitably
34 degrading other portions of the timers.c code in an unacceptable way.
36 - the implementation of the current posix-timer subsystem on top of
37 the timer wheel has already introduced a quite complex handling of
38 the required readjusting of absolute CLOCK_REALTIME timers at
39 settimeofday or NTP time - further underlying our experience by
40 example: that the timer wheel data structure is too rigid for high-res
43 - the timer wheel code is most optimal for use cases which can be
44 identified as "timeouts". Such timeouts are usually set up to cover
45 error conditions in various I/O paths, such as networking and block
46 I/O. The vast majority of those timers never expire and are rarely
47 recascaded because the expected correct event arrives in time so they
48 can be removed from the timer wheel before any further processing of
49 them becomes necessary. Thus the users of these timeouts can accept
50 the granularity and precision tradeoffs of the timer wheel, and
51 largely expect the timer subsystem to have near-zero overhead.
52 Accurate timing for them is not a core purpose - in fact most of the
53 timeout values used are ad-hoc. For them it is at most a necessary
54 evil to guarantee the processing of actual timeout completions
55 (because most of the timeouts are deleted before completion), which
56 should thus be as cheap and unintrusive as possible.
58 The primary users of precision timers are user-space applications that
59 utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
60 users like drivers and subsystems which require precise timed events
61 (e.g. multimedia) can benefit from the availability of a separate
62 high-resolution timer subsystem as well.
64 While this subsystem does not offer high-resolution clock sources just
65 yet, the hrtimer subsystem can be easily extended with high-resolution
66 clock capabilities, and patches for that exist and are maturing quickly.
67 The increasing demand for realtime and multimedia applications along
68 with other potential users for precise timers gives another reason to
69 separate the "timeout" and "precise timer" subsystems.
71 Another potential benefit is that such a separation allows even more
72 special-purpose optimization of the existing timer wheel for the low
73 resolution and low precision use cases - once the precision-sensitive
74 APIs are separated from the timer wheel and are migrated over to
75 hrtimers. E.g. we could decrease the frequency of the timeout subsystem
76 from 250 Hz to 100 HZ (or even smaller).
78 hrtimer subsystem implementation details
79 ----------------------------------------
81 the basic design considerations were:
85 - data structure not bound to jiffies or any other granularity. All the
86 kernel logic works at 64-bit nanoseconds resolution - no compromises.
88 - simplification of existing, timing related kernel code
90 another basic requirement was the immediate enqueueing and ordering of
91 timers at activation time. After looking at several possible solutions
92 such as radix trees and hashes, we chose the red black tree as the basic
93 data structure. Rbtrees are available as a library in the kernel and are
94 used in various performance-critical areas of e.g. memory management and
95 file systems. The rbtree is solely used for time sorted ordering, while
96 a separate list is used to give the expiry code fast access to the
97 queued timers, without having to walk the rbtree.
99 (This separate list is also useful for later when we'll introduce
100 high-resolution clocks, where we need separate pending and expired
101 queues while keeping the time-order intact.)
103 Time-ordered enqueueing is not purely for the purposes of
104 high-resolution clocks though, it also simplifies the handling of
105 absolute timers based on a low-resolution CLOCK_REALTIME. The existing
106 implementation needed to keep an extra list of all armed absolute
107 CLOCK_REALTIME timers along with complex locking. In case of
108 settimeofday and NTP, all the timers (!) had to be dequeued, the
109 time-changing code had to fix them up one by one, and all of them had to
110 be enqueued again. The time-ordered enqueueing and the storage of the
111 expiry time in absolute time units removes all this complex and poorly
112 scaling code from the posix-timer implementation - the clock can simply
113 be set without having to touch the rbtree. This also makes the handling
114 of posix-timers simpler in general.
116 The locking and per-CPU behavior of hrtimers was mostly taken from the
117 existing timer wheel code, as it is mature and well suited. Sharing code
118 was not really a win, due to the different data structures. Also, the
119 hrtimer functions now have clearer behavior and clearer names - such as
120 hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
121 equivalent to timer_delete() and timer_delete_sync()] - so there's no direct
122 1:1 mapping between them on the algorithmic level, and thus no real
123 potential for code sharing either.
125 Basic data types: every time value, absolute or relative, is in a
126 special nanosecond-resolution 64bit type: ktime_t.
127 (Originally, the kernel-internal representation of ktime_t values and
128 operations was implemented via macros and inline functions, and could be
129 switched between a "hybrid union" type and a plain "scalar" 64bit
130 nanoseconds representation (at compile time). This was abandoned in the
131 context of the Y2038 work.)
133 hrtimers - rounding of timer values
134 -----------------------------------
136 the hrtimer code will round timer events to lower-resolution clocks
137 because it has to. Otherwise it will do no artificial rounding at all.
139 one question is, what resolution value should be returned to the user by
140 the clock_getres() interface. This will return whatever real resolution
141 a given clock has - be it low-res, high-res, or artificially-low-res.
143 hrtimers - testing and verification
144 -----------------------------------
146 We used the high-resolution clock subsystem on top of hrtimers to verify
147 the hrtimer implementation details in praxis, and we also ran the posix
148 timer tests in order to ensure specification compliance. We also ran
149 tests on low-resolution clocks.
151 The hrtimer patch converts the following kernel functionality to use
158 The conversion of nanosleep and posix-timers enabled the unification of
159 nanosleep and clock_nanosleep.
161 The code was successfully compiled for the following platforms:
163 i386, x86_64, ARM, PPC, PPC64, IA64
165 The code was run-tested on the following platforms:
167 i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
169 hrtimers were also integrated into the -rt tree, along with a
170 hrtimers-based high-resolution clock implementation, so the hrtimers
171 code got a healthy amount of testing and use in practice.
173 Thomas Gleixner, Ingo Molnar