--- /dev/null
+=======================
+CPU cooling APIs How To
+=======================
+
+Written by Amit Daniel Kachhap <amit.kachhap@linaro.org>
+
+Updated: 6 Jan 2015
+
+Copyright (c) 2012 Samsung Electronics Co., Ltd(http://www.samsung.com)
+
+0. Introduction
+===============
+
+The generic cpu cooling(freq clipping) provides registration/unregistration APIs
+to the caller. The binding of the cooling devices to the trip point is left for
+the user. The registration APIs returns the cooling device pointer.
+
+1. cpu cooling APIs
+===================
+
+1.1 cpufreq registration/unregistration APIs
+--------------------------------------------
+
+ ::
+
+ struct thermal_cooling_device
+ *cpufreq_cooling_register(struct cpumask *clip_cpus)
+
+ This interface function registers the cpufreq cooling device with the name
+ "thermal-cpufreq-%x". This api can support multiple instances of cpufreq
+ cooling devices.
+
+ clip_cpus:
+ cpumask of cpus where the frequency constraints will happen.
+
+ ::
+
+ struct thermal_cooling_device
+ *of_cpufreq_cooling_register(struct cpufreq_policy *policy)
+
+ This interface function registers the cpufreq cooling device with
+ the name "thermal-cpufreq-%x" linking it with a device tree node, in
+ order to bind it via the thermal DT code. This api can support multiple
+ instances of cpufreq cooling devices.
+
+ policy:
+ CPUFreq policy.
+
+
+ ::
+
+ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
+
+ This interface function unregisters the "thermal-cpufreq-%x" cooling device.
+
+ cdev: Cooling device pointer which has to be unregistered.
+
+2. Power models
+===============
+
+The power API registration functions provide a simple power model for
+CPUs. The current power is calculated as dynamic power (static power isn't
+supported currently). This power model requires that the operating-points of
+the CPUs are registered using the kernel's opp library and the
+`cpufreq_frequency_table` is assigned to the `struct device` of the
+cpu. If you are using CONFIG_CPUFREQ_DT then the
+`cpufreq_frequency_table` should already be assigned to the cpu
+device.
+
+The dynamic power consumption of a processor depends on many factors.
+For a given processor implementation the primary factors are:
+
+- The time the processor spends running, consuming dynamic power, as
+ compared to the time in idle states where dynamic consumption is
+ negligible. Herein we refer to this as 'utilisation'.
+- The voltage and frequency levels as a result of DVFS. The DVFS
+ level is a dominant factor governing power consumption.
+- In running time the 'execution' behaviour (instruction types, memory
+ access patterns and so forth) causes, in most cases, a second order
+ variation. In pathological cases this variation can be significant,
+ but typically it is of a much lesser impact than the factors above.
+
+A high level dynamic power consumption model may then be represented as::
+
+ Pdyn = f(run) * Voltage^2 * Frequency * Utilisation
+
+f(run) here represents the described execution behaviour and its
+result has a units of Watts/Hz/Volt^2 (this often expressed in
+mW/MHz/uVolt^2)
+
+The detailed behaviour for f(run) could be modelled on-line. However,
+in practice, such an on-line model has dependencies on a number of
+implementation specific processor support and characterisation
+factors. Therefore, in initial implementation that contribution is
+represented as a constant coefficient. This is a simplification
+consistent with the relative contribution to overall power variation.
+
+In this simplified representation our model becomes::
+
+ Pdyn = Capacitance * Voltage^2 * Frequency * Utilisation
+
+Where `capacitance` is a constant that represents an indicative
+running time dynamic power coefficient in fundamental units of
+mW/MHz/uVolt^2. Typical values for mobile CPUs might lie in range
+from 100 to 500. For reference, the approximate values for the SoC in
+ARM's Juno Development Platform are 530 for the Cortex-A57 cluster and
+140 for the Cortex-A53 cluster.
+++ /dev/null
-CPU cooling APIs How To
-===================================
-
-Written by Amit Daniel Kachhap <amit.kachhap@linaro.org>
-
-Updated: 6 Jan 2015
-
-Copyright (c) 2012 Samsung Electronics Co., Ltd(http://www.samsung.com)
-
-0. Introduction
-
-The generic cpu cooling(freq clipping) provides registration/unregistration APIs
-to the caller. The binding of the cooling devices to the trip point is left for
-the user. The registration APIs returns the cooling device pointer.
-
-1. cpu cooling APIs
-
-1.1 cpufreq registration/unregistration APIs
-1.1.1 struct thermal_cooling_device *cpufreq_cooling_register(
- struct cpumask *clip_cpus)
-
- This interface function registers the cpufreq cooling device with the name
- "thermal-cpufreq-%x". This api can support multiple instances of cpufreq
- cooling devices.
-
- clip_cpus: cpumask of cpus where the frequency constraints will happen.
-
-1.1.2 struct thermal_cooling_device *of_cpufreq_cooling_register(
- struct cpufreq_policy *policy)
-
- This interface function registers the cpufreq cooling device with
- the name "thermal-cpufreq-%x" linking it with a device tree node, in
- order to bind it via the thermal DT code. This api can support multiple
- instances of cpufreq cooling devices.
-
- policy: CPUFreq policy.
-
-1.1.3 void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
-
- This interface function unregisters the "thermal-cpufreq-%x" cooling device.
-
- cdev: Cooling device pointer which has to be unregistered.
-
-2. Power models
-
-The power API registration functions provide a simple power model for
-CPUs. The current power is calculated as dynamic power (static power isn't
-supported currently). This power model requires that the operating-points of
-the CPUs are registered using the kernel's opp library and the
-`cpufreq_frequency_table` is assigned to the `struct device` of the
-cpu. If you are using CONFIG_CPUFREQ_DT then the
-`cpufreq_frequency_table` should already be assigned to the cpu
-device.
-
-The dynamic power consumption of a processor depends on many factors.
-For a given processor implementation the primary factors are:
-
-- The time the processor spends running, consuming dynamic power, as
- compared to the time in idle states where dynamic consumption is
- negligible. Herein we refer to this as 'utilisation'.
-- The voltage and frequency levels as a result of DVFS. The DVFS
- level is a dominant factor governing power consumption.
-- In running time the 'execution' behaviour (instruction types, memory
- access patterns and so forth) causes, in most cases, a second order
- variation. In pathological cases this variation can be significant,
- but typically it is of a much lesser impact than the factors above.
-
-A high level dynamic power consumption model may then be represented as:
-
-Pdyn = f(run) * Voltage^2 * Frequency * Utilisation
-
-f(run) here represents the described execution behaviour and its
-result has a units of Watts/Hz/Volt^2 (this often expressed in
-mW/MHz/uVolt^2)
-
-The detailed behaviour for f(run) could be modelled on-line. However,
-in practice, such an on-line model has dependencies on a number of
-implementation specific processor support and characterisation
-factors. Therefore, in initial implementation that contribution is
-represented as a constant coefficient. This is a simplification
-consistent with the relative contribution to overall power variation.
-
-In this simplified representation our model becomes:
-
-Pdyn = Capacitance * Voltage^2 * Frequency * Utilisation
-
-Where `capacitance` is a constant that represents an indicative
-running time dynamic power coefficient in fundamental units of
-mW/MHz/uVolt^2. Typical values for mobile CPUs might lie in range
-from 100 to 500. For reference, the approximate values for the SoC in
-ARM's Juno Development Platform are 530 for the Cortex-A57 cluster and
-140 for the Cortex-A53 cluster.
+++ /dev/null
-Kernel driver exynos_tmu
-=================
-
-Supported chips:
-* ARM SAMSUNG EXYNOS4, EXYNOS5 series of SoC
- Datasheet: Not publicly available
-
-Authors: Donggeun Kim <dg77.kim@samsung.com>
-Authors: Amit Daniel <amit.daniel@samsung.com>
-
-TMU controller Description:
----------------------------
-
-This driver allows to read temperature inside SAMSUNG EXYNOS4/5 series of SoC.
-
-The chip only exposes the measured 8-bit temperature code value
-through a register.
-Temperature can be taken from the temperature code.
-There are three equations converting from temperature to temperature code.
-
-The three equations are:
- 1. Two point trimming
- Tc = (T - 25) * (TI2 - TI1) / (85 - 25) + TI1
-
- 2. One point trimming
- Tc = T + TI1 - 25
-
- 3. No trimming
- Tc = T + 50
-
- Tc: Temperature code, T: Temperature,
- TI1: Trimming info for 25 degree Celsius (stored at TRIMINFO register)
- Temperature code measured at 25 degree Celsius which is unchanged
- TI2: Trimming info for 85 degree Celsius (stored at TRIMINFO register)
- Temperature code measured at 85 degree Celsius which is unchanged
-
-TMU(Thermal Management Unit) in EXYNOS4/5 generates interrupt
-when temperature exceeds pre-defined levels.
-The maximum number of configurable threshold is five.
-The threshold levels are defined as follows:
- Level_0: current temperature > trigger_level_0 + threshold
- Level_1: current temperature > trigger_level_1 + threshold
- Level_2: current temperature > trigger_level_2 + threshold
- Level_3: current temperature > trigger_level_3 + threshold
-
- The threshold and each trigger_level are set
- through the corresponding registers.
-
-When an interrupt occurs, this driver notify kernel thermal framework
-with the function exynos_report_trigger.
-Although an interrupt condition for level_0 can be set,
-it can be used to synchronize the cooling action.
-
-TMU driver description:
------------------------
-
-The exynos thermal driver is structured as,
-
- Kernel Core thermal framework
- (thermal_core.c, step_wise.c, cpu_cooling.c)
- ^
- |
- |
-TMU configuration data -------> TMU Driver <------> Exynos Core thermal wrapper
-(exynos_tmu_data.c) (exynos_tmu.c) (exynos_thermal_common.c)
-(exynos_tmu_data.h) (exynos_tmu.h) (exynos_thermal_common.h)
-
-a) TMU configuration data: This consist of TMU register offsets/bitfields
- described through structure exynos_tmu_registers. Also several
- other platform data (struct exynos_tmu_platform_data) members
- are used to configure the TMU.
-b) TMU driver: This component initialises the TMU controller and sets different
- thresholds. It invokes core thermal implementation with the call
- exynos_report_trigger.
-c) Exynos Core thermal wrapper: This provides 3 wrapper function to use the
- Kernel core thermal framework. They are exynos_unregister_thermal,
- exynos_register_thermal and exynos_report_trigger.
--- /dev/null
+========================
+Kernel driver exynos_tmu
+========================
+
+Supported chips:
+
+* ARM SAMSUNG EXYNOS4, EXYNOS5 series of SoC
+
+ Datasheet: Not publicly available
+
+Authors: Donggeun Kim <dg77.kim@samsung.com>
+Authors: Amit Daniel <amit.daniel@samsung.com>
+
+TMU controller Description:
+---------------------------
+
+This driver allows to read temperature inside SAMSUNG EXYNOS4/5 series of SoC.
+
+The chip only exposes the measured 8-bit temperature code value
+through a register.
+Temperature can be taken from the temperature code.
+There are three equations converting from temperature to temperature code.
+
+The three equations are:
+ 1. Two point trimming::
+
+ Tc = (T - 25) * (TI2 - TI1) / (85 - 25) + TI1
+
+ 2. One point trimming::
+
+ Tc = T + TI1 - 25
+
+ 3. No trimming::
+
+ Tc = T + 50
+
+ Tc:
+ Temperature code, T: Temperature,
+ TI1:
+ Trimming info for 25 degree Celsius (stored at TRIMINFO register)
+ Temperature code measured at 25 degree Celsius which is unchanged
+ TI2:
+ Trimming info for 85 degree Celsius (stored at TRIMINFO register)
+ Temperature code measured at 85 degree Celsius which is unchanged
+
+TMU(Thermal Management Unit) in EXYNOS4/5 generates interrupt
+when temperature exceeds pre-defined levels.
+The maximum number of configurable threshold is five.
+The threshold levels are defined as follows::
+
+ Level_0: current temperature > trigger_level_0 + threshold
+ Level_1: current temperature > trigger_level_1 + threshold
+ Level_2: current temperature > trigger_level_2 + threshold
+ Level_3: current temperature > trigger_level_3 + threshold
+
+The threshold and each trigger_level are set
+through the corresponding registers.
+
+When an interrupt occurs, this driver notify kernel thermal framework
+with the function exynos_report_trigger.
+Although an interrupt condition for level_0 can be set,
+it can be used to synchronize the cooling action.
+
+TMU driver description:
+-----------------------
+
+The exynos thermal driver is structured as::
+
+ Kernel Core thermal framework
+ (thermal_core.c, step_wise.c, cpu_cooling.c)
+ ^
+ |
+ |
+ TMU configuration data -----> TMU Driver <----> Exynos Core thermal wrapper
+ (exynos_tmu_data.c) (exynos_tmu.c) (exynos_thermal_common.c)
+ (exynos_tmu_data.h) (exynos_tmu.h) (exynos_thermal_common.h)
+
+a) TMU configuration data:
+ This consist of TMU register offsets/bitfields
+ described through structure exynos_tmu_registers. Also several
+ other platform data (struct exynos_tmu_platform_data) members
+ are used to configure the TMU.
+b) TMU driver:
+ This component initialises the TMU controller and sets different
+ thresholds. It invokes core thermal implementation with the call
+ exynos_report_trigger.
+c) Exynos Core thermal wrapper:
+ This provides 3 wrapper function to use the
+ Kernel core thermal framework. They are exynos_unregister_thermal,
+ exynos_register_thermal and exynos_report_trigger.
+++ /dev/null
-EXYNOS EMULATION MODE
-========================
-
-Copyright (C) 2012 Samsung Electronics
-
-Written by Jonghwa Lee <jonghwa3.lee@samsung.com>
-
-Description
------------
-
-Exynos 4x12 (4212, 4412) and 5 series provide emulation mode for thermal management unit.
-Thermal emulation mode supports software debug for TMU's operation. User can set temperature
-manually with software code and TMU will read current temperature from user value not from
-sensor's value.
-
-Enabling CONFIG_THERMAL_EMULATION option will make this support available.
-When it's enabled, sysfs node will be created as
-/sys/devices/virtual/thermal/thermal_zone'zone id'/emul_temp.
-
-The sysfs node, 'emul_node', will contain value 0 for the initial state. When you input any
-temperature you want to update to sysfs node, it automatically enable emulation mode and
-current temperature will be changed into it.
-(Exynos also supports user changeable delay time which would be used to delay of
- changing temperature. However, this node only uses same delay of real sensing time, 938us.)
-
-Exynos emulation mode requires synchronous of value changing and enabling. It means when you
-want to update the any value of delay or next temperature, then you have to enable emulation
-mode at the same time. (Or you have to keep the mode enabling.) If you don't, it fails to
-change the value to updated one and just use last succeessful value repeatedly. That's why
-this node gives users the right to change termerpature only. Just one interface makes it more
-simply to use.
-
-Disabling emulation mode only requires writing value 0 to sysfs node.
-
-
-TEMP 120 |
- |
- 100 |
- |
- 80 |
- | +-----------
- 60 | | |
- | +-------------| |
- 40 | | | |
- | | | |
- 20 | | | +----------
- | | | | |
- 0 |______________|_____________|__________|__________|_________
- A A A A TIME
- |<----->| |<----->| |<----->| |
- | 938us | | | | | |
-emulation : 0 50 | 70 | 20 | 0
-current temp : sensor 50 70 20 sensor
--- /dev/null
+=====================
+Exynos Emulation Mode
+=====================
+
+Copyright (C) 2012 Samsung Electronics
+
+Written by Jonghwa Lee <jonghwa3.lee@samsung.com>
+
+Description
+-----------
+
+Exynos 4x12 (4212, 4412) and 5 series provide emulation mode for thermal
+management unit. Thermal emulation mode supports software debug for
+TMU's operation. User can set temperature manually with software code
+and TMU will read current temperature from user value not from sensor's
+value.
+
+Enabling CONFIG_THERMAL_EMULATION option will make this support
+available. When it's enabled, sysfs node will be created as
+/sys/devices/virtual/thermal/thermal_zone'zone id'/emul_temp.
+
+The sysfs node, 'emul_node', will contain value 0 for the initial state.
+When you input any temperature you want to update to sysfs node, it
+automatically enable emulation mode and current temperature will be
+changed into it.
+
+(Exynos also supports user changeable delay time which would be used to
+delay of changing temperature. However, this node only uses same delay
+of real sensing time, 938us.)
+
+Exynos emulation mode requires synchronous of value changing and
+enabling. It means when you want to update the any value of delay or
+next temperature, then you have to enable emulation mode at the same
+time. (Or you have to keep the mode enabling.) If you don't, it fails to
+change the value to updated one and just use last succeessful value
+repeatedly. That's why this node gives users the right to change
+termerpature only. Just one interface makes it more simply to use.
+
+Disabling emulation mode only requires writing value 0 to sysfs node.
+
+::
+
+
+ TEMP 120 |
+ |
+ 100 |
+ |
+ 80 |
+ | +-----------
+ 60 | | |
+ | +-------------| |
+ 40 | | | |
+ | | | |
+ 20 | | | +----------
+ | | | | |
+ 0 |______________|_____________|__________|__________|_________
+ A A A A TIME
+ |<----->| |<----->| |<----->| |
+ | 938us | | | | | |
+ emulation : 0 50 | 70 | 20 | 0
+ current temp: sensor 50 70 20 sensor
--- /dev/null
+:orphan:
+
+=======
+Thermal
+=======
+
+.. toctree::
+ :maxdepth: 1
+
+ cpu-cooling-api
+ sysfs-api
+ power_allocator
+
+ exynos_thermal
+ exynos_thermal_emulation
+ intel_powerclamp
+ nouveau_thermal
+ x86_pkg_temperature_thermal
--- /dev/null
+=======================
+Intel Powerclamp Driver
+=======================
+
+By:
+ - Arjan van de Ven <arjan@linux.intel.com>
+ - Jacob Pan <jacob.jun.pan@linux.intel.com>
+
+.. Contents:
+
+ (*) Introduction
+ - Goals and Objectives
+
+ (*) Theory of Operation
+ - Idle Injection
+ - Calibration
+
+ (*) Performance Analysis
+ - Effectiveness and Limitations
+ - Power vs Performance
+ - Scalability
+ - Calibration
+ - Comparison with Alternative Techniques
+
+ (*) Usage and Interfaces
+ - Generic Thermal Layer (sysfs)
+ - Kernel APIs (TBD)
+
+INTRODUCTION
+============
+
+Consider the situation where a system’s power consumption must be
+reduced at runtime, due to power budget, thermal constraint, or noise
+level, and where active cooling is not preferred. Software managed
+passive power reduction must be performed to prevent the hardware
+actions that are designed for catastrophic scenarios.
+
+Currently, P-states, T-states (clock modulation), and CPU offlining
+are used for CPU throttling.
+
+On Intel CPUs, C-states provide effective power reduction, but so far
+they’re only used opportunistically, based on workload. With the
+development of intel_powerclamp driver, the method of synchronizing
+idle injection across all online CPU threads was introduced. The goal
+is to achieve forced and controllable C-state residency.
+
+Test/Analysis has been made in the areas of power, performance,
+scalability, and user experience. In many cases, clear advantage is
+shown over taking the CPU offline or modulating the CPU clock.
+
+
+THEORY OF OPERATION
+===================
+
+Idle Injection
+--------------
+
+On modern Intel processors (Nehalem or later), package level C-state
+residency is available in MSRs, thus also available to the kernel.
+
+These MSRs are::
+
+ #define MSR_PKG_C2_RESIDENCY 0x60D
+ #define MSR_PKG_C3_RESIDENCY 0x3F8
+ #define MSR_PKG_C6_RESIDENCY 0x3F9
+ #define MSR_PKG_C7_RESIDENCY 0x3FA
+
+If the kernel can also inject idle time to the system, then a
+closed-loop control system can be established that manages package
+level C-state. The intel_powerclamp driver is conceived as such a
+control system, where the target set point is a user-selected idle
+ratio (based on power reduction), and the error is the difference
+between the actual package level C-state residency ratio and the target idle
+ratio.
+
+Injection is controlled by high priority kernel threads, spawned for
+each online CPU.
+
+These kernel threads, with SCHED_FIFO class, are created to perform
+clamping actions of controlled duty ratio and duration. Each per-CPU
+thread synchronizes its idle time and duration, based on the rounding
+of jiffies, so accumulated errors can be prevented to avoid a jittery
+effect. Threads are also bound to the CPU such that they cannot be
+migrated, unless the CPU is taken offline. In this case, threads
+belong to the offlined CPUs will be terminated immediately.
+
+Running as SCHED_FIFO and relatively high priority, also allows such
+scheme to work for both preemptable and non-preemptable kernels.
+Alignment of idle time around jiffies ensures scalability for HZ
+values. This effect can be better visualized using a Perf timechart.
+The following diagram shows the behavior of kernel thread
+kidle_inject/cpu. During idle injection, it runs monitor/mwait idle
+for a given "duration", then relinquishes the CPU to other tasks,
+until the next time interval.
+
+The NOHZ schedule tick is disabled during idle time, but interrupts
+are not masked. Tests show that the extra wakeups from scheduler tick
+have a dramatic impact on the effectiveness of the powerclamp driver
+on large scale systems (Westmere system with 80 processors).
+
+::
+
+ CPU0
+ ____________ ____________
+ kidle_inject/0 | sleep | mwait | sleep |
+ _________| |________| |_______
+ duration
+ CPU1
+ ____________ ____________
+ kidle_inject/1 | sleep | mwait | sleep |
+ _________| |________| |_______
+ ^
+ |
+ |
+ roundup(jiffies, interval)
+
+Only one CPU is allowed to collect statistics and update global
+control parameters. This CPU is referred to as the controlling CPU in
+this document. The controlling CPU is elected at runtime, with a
+policy that favors BSP, taking into account the possibility of a CPU
+hot-plug.
+
+In terms of dynamics of the idle control system, package level idle
+time is considered largely as a non-causal system where its behavior
+cannot be based on the past or current input. Therefore, the
+intel_powerclamp driver attempts to enforce the desired idle time
+instantly as given input (target idle ratio). After injection,
+powerclamp monitors the actual idle for a given time window and adjust
+the next injection accordingly to avoid over/under correction.
+
+When used in a causal control system, such as a temperature control,
+it is up to the user of this driver to implement algorithms where
+past samples and outputs are included in the feedback. For example, a
+PID-based thermal controller can use the powerclamp driver to
+maintain a desired target temperature, based on integral and
+derivative gains of the past samples.
+
+
+
+Calibration
+-----------
+During scalability testing, it is observed that synchronized actions
+among CPUs become challenging as the number of cores grows. This is
+also true for the ability of a system to enter package level C-states.
+
+To make sure the intel_powerclamp driver scales well, online
+calibration is implemented. The goals for doing such a calibration
+are:
+
+a) determine the effective range of idle injection ratio
+b) determine the amount of compensation needed at each target ratio
+
+Compensation to each target ratio consists of two parts:
+
+ a) steady state error compensation
+ This is to offset the error occurring when the system can
+ enter idle without extra wakeups (such as external interrupts).
+
+ b) dynamic error compensation
+ When an excessive amount of wakeups occurs during idle, an
+ additional idle ratio can be added to quiet interrupts, by
+ slowing down CPU activities.
+
+A debugfs file is provided for the user to examine compensation
+progress and results, such as on a Westmere system::
+
+ [jacob@nex01 ~]$ cat
+ /sys/kernel/debug/intel_powerclamp/powerclamp_calib
+ controlling cpu: 0
+ pct confidence steady dynamic (compensation)
+ 0 0 0 0
+ 1 1 0 0
+ 2 1 1 0
+ 3 3 1 0
+ 4 3 1 0
+ 5 3 1 0
+ 6 3 1 0
+ 7 3 1 0
+ 8 3 1 0
+ ...
+ 30 3 2 0
+ 31 3 2 0
+ 32 3 1 0
+ 33 3 2 0
+ 34 3 1 0
+ 35 3 2 0
+ 36 3 1 0
+ 37 3 2 0
+ 38 3 1 0
+ 39 3 2 0
+ 40 3 3 0
+ 41 3 1 0
+ 42 3 2 0
+ 43 3 1 0
+ 44 3 1 0
+ 45 3 2 0
+ 46 3 3 0
+ 47 3 0 0
+ 48 3 2 0
+ 49 3 3 0
+
+Calibration occurs during runtime. No offline method is available.
+Steady state compensation is used only when confidence levels of all
+adjacent ratios have reached satisfactory level. A confidence level
+is accumulated based on clean data collected at runtime. Data
+collected during a period without extra interrupts is considered
+clean.
+
+To compensate for excessive amounts of wakeup during idle, additional
+idle time is injected when such a condition is detected. Currently,
+we have a simple algorithm to double the injection ratio. A possible
+enhancement might be to throttle the offending IRQ, such as delaying
+EOI for level triggered interrupts. But it is a challenge to be
+non-intrusive to the scheduler or the IRQ core code.
+
+
+CPU Online/Offline
+------------------
+Per-CPU kernel threads are started/stopped upon receiving
+notifications of CPU hotplug activities. The intel_powerclamp driver
+keeps track of clamping kernel threads, even after they are migrated
+to other CPUs, after a CPU offline event.
+
+
+Performance Analysis
+====================
+This section describes the general performance data collected on
+multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P).
+
+Effectiveness and Limitations
+-----------------------------
+The maximum range that idle injection is allowed is capped at 50
+percent. As mentioned earlier, since interrupts are allowed during
+forced idle time, excessive interrupts could result in less
+effectiveness. The extreme case would be doing a ping -f to generated
+flooded network interrupts without much CPU acknowledgement. In this
+case, little can be done from the idle injection threads. In most
+normal cases, such as scp a large file, applications can be throttled
+by the powerclamp driver, since slowing down the CPU also slows down
+network protocol processing, which in turn reduces interrupts.
+
+When control parameters change at runtime by the controlling CPU, it
+may take an additional period for the rest of the CPUs to catch up
+with the changes. During this time, idle injection is out of sync,
+thus not able to enter package C- states at the expected ratio. But
+this effect is minor, in that in most cases change to the target
+ratio is updated much less frequently than the idle injection
+frequency.
+
+Scalability
+-----------
+Tests also show a minor, but measurable, difference between the 4P/8P
+Ivy Bridge system and the 80P Westmere server under 50% idle ratio.
+More compensation is needed on Westmere for the same amount of
+target idle ratio. The compensation also increases as the idle ratio
+gets larger. The above reason constitutes the need for the
+calibration code.
+
+On the IVB 8P system, compared to an offline CPU, powerclamp can
+achieve up to 40% better performance per watt. (measured by a spin
+counter summed over per CPU counting threads spawned for all running
+CPUs).
+
+Usage and Interfaces
+====================
+The powerclamp driver is registered to the generic thermal layer as a
+cooling device. Currently, it’s not bound to any thermal zones::
+
+ jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . *
+ cur_state:0
+ max_state:50
+ type:intel_powerclamp
+
+cur_state allows user to set the desired idle percentage. Writing 0 to
+cur_state will stop idle injection. Writing a value between 1 and
+max_state will start the idle injection. Reading cur_state returns the
+actual and current idle percentage. This may not be the same value
+set by the user in that current idle percentage depends on workload
+and includes natural idle. When idle injection is disabled, reading
+cur_state returns value -1 instead of 0 which is to avoid confusing
+100% busy state with the disabled state.
+
+Example usage:
+- To inject 25% idle time::
+
+ $ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state
+
+If the system is not busy and has more than 25% idle time already,
+then the powerclamp driver will not start idle injection. Using Top
+will not show idle injection kernel threads.
+
+If the system is busy (spin test below) and has less than 25% natural
+idle time, powerclamp kernel threads will do idle injection. Forced
+idle time is accounted as normal idle in that common code path is
+taken as the idle task.
+
+In this example, 24.1% idle is shown. This helps the system admin or
+user determine the cause of slowdown, when a powerclamp driver is in action::
+
+
+ Tasks: 197 total, 1 running, 196 sleeping, 0 stopped, 0 zombie
+ Cpu(s): 71.2%us, 4.7%sy, 0.0%ni, 24.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
+ Mem: 3943228k total, 1689632k used, 2253596k free, 74960k buffers
+ Swap: 4087804k total, 0k used, 4087804k free, 945336k cached
+
+ PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
+ 3352 jacob 20 0 262m 644 428 S 286 0.0 0:17.16 spin
+ 3341 root -51 0 0 0 0 D 25 0.0 0:01.62 kidle_inject/0
+ 3344 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/3
+ 3342 root -51 0 0 0 0 D 25 0.0 0:01.61 kidle_inject/1
+ 3343 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/2
+ 2935 jacob 20 0 696m 125m 35m S 5 3.3 0:31.11 firefox
+ 1546 root 20 0 158m 20m 6640 S 3 0.5 0:26.97 Xorg
+ 2100 jacob 20 0 1223m 88m 30m S 3 2.3 0:23.68 compiz
+
+Tests have shown that by using the powerclamp driver as a cooling
+device, a PID based userspace thermal controller can manage to
+control CPU temperature effectively, when no other thermal influence
+is added. For example, a UltraBook user can compile the kernel under
+certain temperature (below most active trip points).
+++ /dev/null
- =======================
- INTEL POWERCLAMP DRIVER
- =======================
-By: Arjan van de Ven <arjan@linux.intel.com>
- Jacob Pan <jacob.jun.pan@linux.intel.com>
-
-Contents:
- (*) Introduction
- - Goals and Objectives
-
- (*) Theory of Operation
- - Idle Injection
- - Calibration
-
- (*) Performance Analysis
- - Effectiveness and Limitations
- - Power vs Performance
- - Scalability
- - Calibration
- - Comparison with Alternative Techniques
-
- (*) Usage and Interfaces
- - Generic Thermal Layer (sysfs)
- - Kernel APIs (TBD)
-
-============
-INTRODUCTION
-============
-
-Consider the situation where a system’s power consumption must be
-reduced at runtime, due to power budget, thermal constraint, or noise
-level, and where active cooling is not preferred. Software managed
-passive power reduction must be performed to prevent the hardware
-actions that are designed for catastrophic scenarios.
-
-Currently, P-states, T-states (clock modulation), and CPU offlining
-are used for CPU throttling.
-
-On Intel CPUs, C-states provide effective power reduction, but so far
-they’re only used opportunistically, based on workload. With the
-development of intel_powerclamp driver, the method of synchronizing
-idle injection across all online CPU threads was introduced. The goal
-is to achieve forced and controllable C-state residency.
-
-Test/Analysis has been made in the areas of power, performance,
-scalability, and user experience. In many cases, clear advantage is
-shown over taking the CPU offline or modulating the CPU clock.
-
-
-===================
-THEORY OF OPERATION
-===================
-
-Idle Injection
---------------
-
-On modern Intel processors (Nehalem or later), package level C-state
-residency is available in MSRs, thus also available to the kernel.
-
-These MSRs are:
- #define MSR_PKG_C2_RESIDENCY 0x60D
- #define MSR_PKG_C3_RESIDENCY 0x3F8
- #define MSR_PKG_C6_RESIDENCY 0x3F9
- #define MSR_PKG_C7_RESIDENCY 0x3FA
-
-If the kernel can also inject idle time to the system, then a
-closed-loop control system can be established that manages package
-level C-state. The intel_powerclamp driver is conceived as such a
-control system, where the target set point is a user-selected idle
-ratio (based on power reduction), and the error is the difference
-between the actual package level C-state residency ratio and the target idle
-ratio.
-
-Injection is controlled by high priority kernel threads, spawned for
-each online CPU.
-
-These kernel threads, with SCHED_FIFO class, are created to perform
-clamping actions of controlled duty ratio and duration. Each per-CPU
-thread synchronizes its idle time and duration, based on the rounding
-of jiffies, so accumulated errors can be prevented to avoid a jittery
-effect. Threads are also bound to the CPU such that they cannot be
-migrated, unless the CPU is taken offline. In this case, threads
-belong to the offlined CPUs will be terminated immediately.
-
-Running as SCHED_FIFO and relatively high priority, also allows such
-scheme to work for both preemptable and non-preemptable kernels.
-Alignment of idle time around jiffies ensures scalability for HZ
-values. This effect can be better visualized using a Perf timechart.
-The following diagram shows the behavior of kernel thread
-kidle_inject/cpu. During idle injection, it runs monitor/mwait idle
-for a given "duration", then relinquishes the CPU to other tasks,
-until the next time interval.
-
-The NOHZ schedule tick is disabled during idle time, but interrupts
-are not masked. Tests show that the extra wakeups from scheduler tick
-have a dramatic impact on the effectiveness of the powerclamp driver
-on large scale systems (Westmere system with 80 processors).
-
-CPU0
- ____________ ____________
-kidle_inject/0 | sleep | mwait | sleep |
- _________| |________| |_______
- duration
-CPU1
- ____________ ____________
-kidle_inject/1 | sleep | mwait | sleep |
- _________| |________| |_______
- ^
- |
- |
- roundup(jiffies, interval)
-
-Only one CPU is allowed to collect statistics and update global
-control parameters. This CPU is referred to as the controlling CPU in
-this document. The controlling CPU is elected at runtime, with a
-policy that favors BSP, taking into account the possibility of a CPU
-hot-plug.
-
-In terms of dynamics of the idle control system, package level idle
-time is considered largely as a non-causal system where its behavior
-cannot be based on the past or current input. Therefore, the
-intel_powerclamp driver attempts to enforce the desired idle time
-instantly as given input (target idle ratio). After injection,
-powerclamp monitors the actual idle for a given time window and adjust
-the next injection accordingly to avoid over/under correction.
-
-When used in a causal control system, such as a temperature control,
-it is up to the user of this driver to implement algorithms where
-past samples and outputs are included in the feedback. For example, a
-PID-based thermal controller can use the powerclamp driver to
-maintain a desired target temperature, based on integral and
-derivative gains of the past samples.
-
-
-
-Calibration
------------
-During scalability testing, it is observed that synchronized actions
-among CPUs become challenging as the number of cores grows. This is
-also true for the ability of a system to enter package level C-states.
-
-To make sure the intel_powerclamp driver scales well, online
-calibration is implemented. The goals for doing such a calibration
-are:
-
-a) determine the effective range of idle injection ratio
-b) determine the amount of compensation needed at each target ratio
-
-Compensation to each target ratio consists of two parts:
-
- a) steady state error compensation
- This is to offset the error occurring when the system can
- enter idle without extra wakeups (such as external interrupts).
-
- b) dynamic error compensation
- When an excessive amount of wakeups occurs during idle, an
- additional idle ratio can be added to quiet interrupts, by
- slowing down CPU activities.
-
-A debugfs file is provided for the user to examine compensation
-progress and results, such as on a Westmere system.
-[jacob@nex01 ~]$ cat
-/sys/kernel/debug/intel_powerclamp/powerclamp_calib
-controlling cpu: 0
-pct confidence steady dynamic (compensation)
-0 0 0 0
-1 1 0 0
-2 1 1 0
-3 3 1 0
-4 3 1 0
-5 3 1 0
-6 3 1 0
-7 3 1 0
-8 3 1 0
-...
-30 3 2 0
-31 3 2 0
-32 3 1 0
-33 3 2 0
-34 3 1 0
-35 3 2 0
-36 3 1 0
-37 3 2 0
-38 3 1 0
-39 3 2 0
-40 3 3 0
-41 3 1 0
-42 3 2 0
-43 3 1 0
-44 3 1 0
-45 3 2 0
-46 3 3 0
-47 3 0 0
-48 3 2 0
-49 3 3 0
-
-Calibration occurs during runtime. No offline method is available.
-Steady state compensation is used only when confidence levels of all
-adjacent ratios have reached satisfactory level. A confidence level
-is accumulated based on clean data collected at runtime. Data
-collected during a period without extra interrupts is considered
-clean.
-
-To compensate for excessive amounts of wakeup during idle, additional
-idle time is injected when such a condition is detected. Currently,
-we have a simple algorithm to double the injection ratio. A possible
-enhancement might be to throttle the offending IRQ, such as delaying
-EOI for level triggered interrupts. But it is a challenge to be
-non-intrusive to the scheduler or the IRQ core code.
-
-
-CPU Online/Offline
-------------------
-Per-CPU kernel threads are started/stopped upon receiving
-notifications of CPU hotplug activities. The intel_powerclamp driver
-keeps track of clamping kernel threads, even after they are migrated
-to other CPUs, after a CPU offline event.
-
-
-=====================
-Performance Analysis
-=====================
-This section describes the general performance data collected on
-multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P).
-
-Effectiveness and Limitations
------------------------------
-The maximum range that idle injection is allowed is capped at 50
-percent. As mentioned earlier, since interrupts are allowed during
-forced idle time, excessive interrupts could result in less
-effectiveness. The extreme case would be doing a ping -f to generated
-flooded network interrupts without much CPU acknowledgement. In this
-case, little can be done from the idle injection threads. In most
-normal cases, such as scp a large file, applications can be throttled
-by the powerclamp driver, since slowing down the CPU also slows down
-network protocol processing, which in turn reduces interrupts.
-
-When control parameters change at runtime by the controlling CPU, it
-may take an additional period for the rest of the CPUs to catch up
-with the changes. During this time, idle injection is out of sync,
-thus not able to enter package C- states at the expected ratio. But
-this effect is minor, in that in most cases change to the target
-ratio is updated much less frequently than the idle injection
-frequency.
-
-Scalability
------------
-Tests also show a minor, but measurable, difference between the 4P/8P
-Ivy Bridge system and the 80P Westmere server under 50% idle ratio.
-More compensation is needed on Westmere for the same amount of
-target idle ratio. The compensation also increases as the idle ratio
-gets larger. The above reason constitutes the need for the
-calibration code.
-
-On the IVB 8P system, compared to an offline CPU, powerclamp can
-achieve up to 40% better performance per watt. (measured by a spin
-counter summed over per CPU counting threads spawned for all running
-CPUs).
-
-====================
-Usage and Interfaces
-====================
-The powerclamp driver is registered to the generic thermal layer as a
-cooling device. Currently, it’s not bound to any thermal zones.
-
-jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . *
-cur_state:0
-max_state:50
-type:intel_powerclamp
-
-cur_state allows user to set the desired idle percentage. Writing 0 to
-cur_state will stop idle injection. Writing a value between 1 and
-max_state will start the idle injection. Reading cur_state returns the
-actual and current idle percentage. This may not be the same value
-set by the user in that current idle percentage depends on workload
-and includes natural idle. When idle injection is disabled, reading
-cur_state returns value -1 instead of 0 which is to avoid confusing
-100% busy state with the disabled state.
-
-Example usage:
-- To inject 25% idle time
-$ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state
-"
-
-If the system is not busy and has more than 25% idle time already,
-then the powerclamp driver will not start idle injection. Using Top
-will not show idle injection kernel threads.
-
-If the system is busy (spin test below) and has less than 25% natural
-idle time, powerclamp kernel threads will do idle injection. Forced
-idle time is accounted as normal idle in that common code path is
-taken as the idle task.
-
-In this example, 24.1% idle is shown. This helps the system admin or
-user determine the cause of slowdown, when a powerclamp driver is in action.
-
-
-Tasks: 197 total, 1 running, 196 sleeping, 0 stopped, 0 zombie
-Cpu(s): 71.2%us, 4.7%sy, 0.0%ni, 24.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
-Mem: 3943228k total, 1689632k used, 2253596k free, 74960k buffers
-Swap: 4087804k total, 0k used, 4087804k free, 945336k cached
-
- PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
- 3352 jacob 20 0 262m 644 428 S 286 0.0 0:17.16 spin
- 3341 root -51 0 0 0 0 D 25 0.0 0:01.62 kidle_inject/0
- 3344 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/3
- 3342 root -51 0 0 0 0 D 25 0.0 0:01.61 kidle_inject/1
- 3343 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/2
- 2935 jacob 20 0 696m 125m 35m S 5 3.3 0:31.11 firefox
- 1546 root 20 0 158m 20m 6640 S 3 0.5 0:26.97 Xorg
- 2100 jacob 20 0 1223m 88m 30m S 3 2.3 0:23.68 compiz
-
-Tests have shown that by using the powerclamp driver as a cooling
-device, a PID based userspace thermal controller can manage to
-control CPU temperature effectively, when no other thermal influence
-is added. For example, a UltraBook user can compile the kernel under
-certain temperature (below most active trip points).
+++ /dev/null
-Kernel driver nouveau
-===================
-
-Supported chips:
-* NV43+
-
-Authors: Martin Peres (mupuf) <martin.peres@free.fr>
-
-Description
----------
-
-This driver allows to read the GPU core temperature, drive the GPU fan and
-set temperature alarms.
-
-Currently, due to the absence of in-kernel API to access HWMON drivers, Nouveau
-cannot access any of the i2c external monitoring chips it may find. If you
-have one of those, temperature and/or fan management through Nouveau's HWMON
-interface is likely not to work. This document may then not cover your situation
-entirely.
-
-Temperature management
---------------------
-
-Temperature is exposed under as a read-only HWMON attribute temp1_input.
-
-In order to protect the GPU from overheating, Nouveau supports 4 configurable
-temperature thresholds:
-
- * Fan_boost: Fan speed is set to 100% when reaching this temperature;
- * Downclock: The GPU will be downclocked to reduce its power dissipation;
- * Critical: The GPU is put on hold to further lower power dissipation;
- * Shutdown: Shut the computer down to protect your GPU.
-
-WARNING: Some of these thresholds may not be used by Nouveau depending
-on your chipset.
-
-The default value for these thresholds comes from the GPU's vbios. These
-thresholds can be configured thanks to the following HWMON attributes:
-
- * Fan_boost: temp1_auto_point1_temp and temp1_auto_point1_temp_hyst;
- * Downclock: temp1_max and temp1_max_hyst;
- * Critical: temp1_crit and temp1_crit_hyst;
- * Shutdown: temp1_emergency and temp1_emergency_hyst.
-
-NOTE: Remember that the values are stored as milli degrees Celsius. Don't forget
-to multiply!
-
-Fan management
-------------
-
-Not all cards have a drivable fan. If you do, then the following HWMON
-attributes should be available:
-
- * pwm1_enable: Current fan management mode (NONE, MANUAL or AUTO);
- * pwm1: Current PWM value (power percentage);
- * pwm1_min: The minimum PWM speed allowed;
- * pwm1_max: The maximum PWM speed allowed (bypassed when hitting Fan_boost);
-
-You may also have the following attribute:
-
- * fan1_input: Speed in RPM of your fan.
-
-Your fan can be driven in different modes:
-
- * 0: The fan is left untouched;
- * 1: The fan can be driven in manual (use pwm1 to change the speed);
- * 2; The fan is driven automatically depending on the temperature.
-
-NOTE: Be sure to use the manual mode if you want to drive the fan speed manually
-
-NOTE2: When operating in manual mode outside the vbios-defined
-[PWM_min, PWM_max] range, the reported fan speed (RPM) may not be accurate
-depending on your hardware.
-
-Bug reports
----------
-
-Thermal management on Nouveau is new and may not work on all cards. If you have
-inquiries, please ping mupuf on IRC (#nouveau, freenode).
-
-Bug reports should be filled on Freedesktop's bug tracker. Please follow
-http://nouveau.freedesktop.org/wiki/Bugs
--- /dev/null
+=====================
+Kernel driver nouveau
+=====================
+
+Supported chips:
+
+* NV43+
+
+Authors: Martin Peres (mupuf) <martin.peres@free.fr>
+
+Description
+-----------
+
+This driver allows to read the GPU core temperature, drive the GPU fan and
+set temperature alarms.
+
+Currently, due to the absence of in-kernel API to access HWMON drivers, Nouveau
+cannot access any of the i2c external monitoring chips it may find. If you
+have one of those, temperature and/or fan management through Nouveau's HWMON
+interface is likely not to work. This document may then not cover your situation
+entirely.
+
+Temperature management
+----------------------
+
+Temperature is exposed under as a read-only HWMON attribute temp1_input.
+
+In order to protect the GPU from overheating, Nouveau supports 4 configurable
+temperature thresholds:
+
+ * Fan_boost:
+ Fan speed is set to 100% when reaching this temperature;
+ * Downclock:
+ The GPU will be downclocked to reduce its power dissipation;
+ * Critical:
+ The GPU is put on hold to further lower power dissipation;
+ * Shutdown:
+ Shut the computer down to protect your GPU.
+
+WARNING:
+ Some of these thresholds may not be used by Nouveau depending
+ on your chipset.
+
+The default value for these thresholds comes from the GPU's vbios. These
+thresholds can be configured thanks to the following HWMON attributes:
+
+ * Fan_boost: temp1_auto_point1_temp and temp1_auto_point1_temp_hyst;
+ * Downclock: temp1_max and temp1_max_hyst;
+ * Critical: temp1_crit and temp1_crit_hyst;
+ * Shutdown: temp1_emergency and temp1_emergency_hyst.
+
+NOTE: Remember that the values are stored as milli degrees Celsius. Don't forget
+to multiply!
+
+Fan management
+--------------
+
+Not all cards have a drivable fan. If you do, then the following HWMON
+attributes should be available:
+
+ * pwm1_enable:
+ Current fan management mode (NONE, MANUAL or AUTO);
+ * pwm1:
+ Current PWM value (power percentage);
+ * pwm1_min:
+ The minimum PWM speed allowed;
+ * pwm1_max:
+ The maximum PWM speed allowed (bypassed when hitting Fan_boost);
+
+You may also have the following attribute:
+
+ * fan1_input:
+ Speed in RPM of your fan.
+
+Your fan can be driven in different modes:
+
+ * 0: The fan is left untouched;
+ * 1: The fan can be driven in manual (use pwm1 to change the speed);
+ * 2; The fan is driven automatically depending on the temperature.
+
+NOTE:
+ Be sure to use the manual mode if you want to drive the fan speed manually
+
+NOTE2:
+ When operating in manual mode outside the vbios-defined
+ [PWM_min, PWM_max] range, the reported fan speed (RPM) may not be accurate
+ depending on your hardware.
+
+Bug reports
+-----------
+
+Thermal management on Nouveau is new and may not work on all cards. If you have
+inquiries, please ping mupuf on IRC (#nouveau, freenode).
+
+Bug reports should be filled on Freedesktop's bug tracker. Please follow
+http://nouveau.freedesktop.org/wiki/Bugs
--- /dev/null
+=================================
+Power allocator governor tunables
+=================================
+
+Trip points
+-----------
+
+The governor works optimally with the following two passive trip points:
+
+1. "switch on" trip point: temperature above which the governor
+ control loop starts operating. This is the first passive trip
+ point of the thermal zone.
+
+2. "desired temperature" trip point: it should be higher than the
+ "switch on" trip point. This the target temperature the governor
+ is controlling for. This is the last passive trip point of the
+ thermal zone.
+
+PID Controller
+--------------
+
+The power allocator governor implements a
+Proportional-Integral-Derivative controller (PID controller) with
+temperature as the control input and power as the controlled output:
+
+ P_max = k_p * e + k_i * err_integral + k_d * diff_err + sustainable_power
+
+where
+ - e = desired_temperature - current_temperature
+ - err_integral is the sum of previous errors
+ - diff_err = e - previous_error
+
+It is similar to the one depicted below::
+
+ k_d
+ |
+ current_temp |
+ | v
+ | +----------+ +---+
+ | +----->| diff_err |-->| X |------+
+ | | +----------+ +---+ |
+ | | | tdp actor
+ | | k_i | | get_requested_power()
+ | | | | | | |
+ | | | | | | | ...
+ v | v v v v v
+ +---+ | +-------+ +---+ +---+ +---+ +----------+
+ | S |-----+----->| sum e |----->| X |--->| S |-->| S |-->|power |
+ +---+ | +-------+ +---+ +---+ +---+ |allocation|
+ ^ | ^ +----------+
+ | | | | |
+ | | +---+ | | |
+ | +------->| X |-------------------+ v v
+ | +---+ granted performance
+ desired_temperature ^
+ |
+ |
+ k_po/k_pu
+
+Sustainable power
+-----------------
+
+An estimate of the sustainable dissipatable power (in mW) should be
+provided while registering the thermal zone. This estimates the
+sustained power that can be dissipated at the desired control
+temperature. This is the maximum sustained power for allocation at
+the desired maximum temperature. The actual sustained power can vary
+for a number of reasons. The closed loop controller will take care of
+variations such as environmental conditions, and some factors related
+to the speed-grade of the silicon. `sustainable_power` is therefore
+simply an estimate, and may be tuned to affect the aggressiveness of
+the thermal ramp. For reference, the sustainable power of a 4" phone
+is typically 2000mW, while on a 10" tablet is around 4500mW (may vary
+depending on screen size).
+
+If you are using device tree, do add it as a property of the
+thermal-zone. For example::
+
+ thermal-zones {
+ soc_thermal {
+ polling-delay = <1000>;
+ polling-delay-passive = <100>;
+ sustainable-power = <2500>;
+ ...
+
+Instead, if the thermal zone is registered from the platform code, pass a
+`thermal_zone_params` that has a `sustainable_power`. If no
+`thermal_zone_params` were being passed, then something like below
+will suffice::
+
+ static const struct thermal_zone_params tz_params = {
+ .sustainable_power = 3500,
+ };
+
+and then pass `tz_params` as the 5th parameter to
+`thermal_zone_device_register()`
+
+k_po and k_pu
+-------------
+
+The implementation of the PID controller in the power allocator
+thermal governor allows the configuration of two proportional term
+constants: `k_po` and `k_pu`. `k_po` is the proportional term
+constant during temperature overshoot periods (current temperature is
+above "desired temperature" trip point). Conversely, `k_pu` is the
+proportional term constant during temperature undershoot periods
+(current temperature below "desired temperature" trip point).
+
+These controls are intended as the primary mechanism for configuring
+the permitted thermal "ramp" of the system. For instance, a lower
+`k_pu` value will provide a slower ramp, at the cost of capping
+available capacity at a low temperature. On the other hand, a high
+value of `k_pu` will result in the governor granting very high power
+while temperature is low, and may lead to temperature overshooting.
+
+The default value for `k_pu` is::
+
+ 2 * sustainable_power / (desired_temperature - switch_on_temp)
+
+This means that at `switch_on_temp` the output of the controller's
+proportional term will be 2 * `sustainable_power`. The default value
+for `k_po` is::
+
+ sustainable_power / (desired_temperature - switch_on_temp)
+
+Focusing on the proportional and feed forward values of the PID
+controller equation we have::
+
+ P_max = k_p * e + sustainable_power
+
+The proportional term is proportional to the difference between the
+desired temperature and the current one. When the current temperature
+is the desired one, then the proportional component is zero and
+`P_max` = `sustainable_power`. That is, the system should operate in
+thermal equilibrium under constant load. `sustainable_power` is only
+an estimate, which is the reason for closed-loop control such as this.
+
+Expanding `k_pu` we get::
+
+ P_max = 2 * sustainable_power * (T_set - T) / (T_set - T_on) +
+ sustainable_power
+
+where:
+
+ - T_set is the desired temperature
+ - T is the current temperature
+ - T_on is the switch on temperature
+
+When the current temperature is the switch_on temperature, the above
+formula becomes::
+
+ P_max = 2 * sustainable_power * (T_set - T_on) / (T_set - T_on) +
+ sustainable_power = 2 * sustainable_power + sustainable_power =
+ 3 * sustainable_power
+
+Therefore, the proportional term alone linearly decreases power from
+3 * `sustainable_power` to `sustainable_power` as the temperature
+rises from the switch on temperature to the desired temperature.
+
+k_i and integral_cutoff
+-----------------------
+
+`k_i` configures the PID loop's integral term constant. This term
+allows the PID controller to compensate for long term drift and for
+the quantized nature of the output control: cooling devices can't set
+the exact power that the governor requests. When the temperature
+error is below `integral_cutoff`, errors are accumulated in the
+integral term. This term is then multiplied by `k_i` and the result
+added to the output of the controller. Typically `k_i` is set low (1
+or 2) and `integral_cutoff` is 0.
+
+k_d
+---
+
+`k_d` configures the PID loop's derivative term constant. It's
+recommended to leave it as the default: 0.
+
+Cooling device power API
+========================
+
+Cooling devices controlled by this governor must supply the additional
+"power" API in their `cooling_device_ops`. It consists on three ops:
+
+1. ::
+
+ int get_requested_power(struct thermal_cooling_device *cdev,
+ struct thermal_zone_device *tz, u32 *power);
+
+
+@cdev:
+ The `struct thermal_cooling_device` pointer
+@tz:
+ thermal zone in which we are currently operating
+@power:
+ pointer in which to store the calculated power
+
+`get_requested_power()` calculates the power requested by the device
+in milliwatts and stores it in @power . It should return 0 on
+success, -E* on failure. This is currently used by the power
+allocator governor to calculate how much power to give to each cooling
+device.
+
+2. ::
+
+ int state2power(struct thermal_cooling_device *cdev, struct
+ thermal_zone_device *tz, unsigned long state,
+ u32 *power);
+
+@cdev:
+ The `struct thermal_cooling_device` pointer
+@tz:
+ thermal zone in which we are currently operating
+@state:
+ A cooling device state
+@power:
+ pointer in which to store the equivalent power
+
+Convert cooling device state @state into power consumption in
+milliwatts and store it in @power. It should return 0 on success, -E*
+on failure. This is currently used by thermal core to calculate the
+maximum power that an actor can consume.
+
+3. ::
+
+ int power2state(struct thermal_cooling_device *cdev, u32 power,
+ unsigned long *state);
+
+@cdev:
+ The `struct thermal_cooling_device` pointer
+@power:
+ power in milliwatts
+@state:
+ pointer in which to store the resulting state
+
+Calculate a cooling device state that would make the device consume at
+most @power mW and store it in @state. It should return 0 on success,
+-E* on failure. This is currently used by the thermal core to convert
+a given power set by the power allocator governor to a state that the
+cooling device can set. It is a function because this conversion may
+depend on external factors that may change so this function should the
+best conversion given "current circumstances".
+
+Cooling device weights
+----------------------
+
+Weights are a mechanism to bias the allocation among cooling
+devices. They express the relative power efficiency of different
+cooling devices. Higher weight can be used to express higher power
+efficiency. Weighting is relative such that if each cooling device
+has a weight of one they are considered equal. This is particularly
+useful in heterogeneous systems where two cooling devices may perform
+the same kind of compute, but with different efficiency. For example,
+a system with two different types of processors.
+
+If the thermal zone is registered using
+`thermal_zone_device_register()` (i.e., platform code), then weights
+are passed as part of the thermal zone's `thermal_bind_parameters`.
+If the platform is registered using device tree, then they are passed
+as the `contribution` property of each map in the `cooling-maps` node.
+
+Limitations of the power allocator governor
+===========================================
+
+The power allocator governor's PID controller works best if there is a
+periodic tick. If you have a driver that calls
+`thermal_zone_device_update()` (or anything that ends up calling the
+governor's `throttle()` function) repetitively, the governor response
+won't be very good. Note that this is not particular to this
+governor, step-wise will also misbehave if you call its throttle()
+faster than the normal thermal framework tick (due to interrupts for
+example) as it will overreact.
+++ /dev/null
-Power allocator governor tunables
-=================================
-
-Trip points
------------
-
-The governor works optimally with the following two passive trip points:
-
-1. "switch on" trip point: temperature above which the governor
- control loop starts operating. This is the first passive trip
- point of the thermal zone.
-
-2. "desired temperature" trip point: it should be higher than the
- "switch on" trip point. This the target temperature the governor
- is controlling for. This is the last passive trip point of the
- thermal zone.
-
-PID Controller
---------------
-
-The power allocator governor implements a
-Proportional-Integral-Derivative controller (PID controller) with
-temperature as the control input and power as the controlled output:
-
- P_max = k_p * e + k_i * err_integral + k_d * diff_err + sustainable_power
-
-where
- e = desired_temperature - current_temperature
- err_integral is the sum of previous errors
- diff_err = e - previous_error
-
-It is similar to the one depicted below:
-
- k_d
- |
-current_temp |
- | v
- | +----------+ +---+
- | +----->| diff_err |-->| X |------+
- | | +----------+ +---+ |
- | | | tdp actor
- | | k_i | | get_requested_power()
- | | | | | | |
- | | | | | | | ...
- v | v v v v v
- +---+ | +-------+ +---+ +---+ +---+ +----------+
- | S |-------+----->| sum e |----->| X |--->| S |-->| S |-->|power |
- +---+ | +-------+ +---+ +---+ +---+ |allocation|
- ^ | ^ +----------+
- | | | | |
- | | +---+ | | |
- | +------->| X |-------------------+ v v
- | +---+ granted performance
-desired_temperature ^
- |
- |
- k_po/k_pu
-
-Sustainable power
------------------
-
-An estimate of the sustainable dissipatable power (in mW) should be
-provided while registering the thermal zone. This estimates the
-sustained power that can be dissipated at the desired control
-temperature. This is the maximum sustained power for allocation at
-the desired maximum temperature. The actual sustained power can vary
-for a number of reasons. The closed loop controller will take care of
-variations such as environmental conditions, and some factors related
-to the speed-grade of the silicon. `sustainable_power` is therefore
-simply an estimate, and may be tuned to affect the aggressiveness of
-the thermal ramp. For reference, the sustainable power of a 4" phone
-is typically 2000mW, while on a 10" tablet is around 4500mW (may vary
-depending on screen size).
-
-If you are using device tree, do add it as a property of the
-thermal-zone. For example:
-
- thermal-zones {
- soc_thermal {
- polling-delay = <1000>;
- polling-delay-passive = <100>;
- sustainable-power = <2500>;
- ...
-
-Instead, if the thermal zone is registered from the platform code, pass a
-`thermal_zone_params` that has a `sustainable_power`. If no
-`thermal_zone_params` were being passed, then something like below
-will suffice:
-
- static const struct thermal_zone_params tz_params = {
- .sustainable_power = 3500,
- };
-
-and then pass `tz_params` as the 5th parameter to
-`thermal_zone_device_register()`
-
-k_po and k_pu
--------------
-
-The implementation of the PID controller in the power allocator
-thermal governor allows the configuration of two proportional term
-constants: `k_po` and `k_pu`. `k_po` is the proportional term
-constant during temperature overshoot periods (current temperature is
-above "desired temperature" trip point). Conversely, `k_pu` is the
-proportional term constant during temperature undershoot periods
-(current temperature below "desired temperature" trip point).
-
-These controls are intended as the primary mechanism for configuring
-the permitted thermal "ramp" of the system. For instance, a lower
-`k_pu` value will provide a slower ramp, at the cost of capping
-available capacity at a low temperature. On the other hand, a high
-value of `k_pu` will result in the governor granting very high power
-while temperature is low, and may lead to temperature overshooting.
-
-The default value for `k_pu` is:
-
- 2 * sustainable_power / (desired_temperature - switch_on_temp)
-
-This means that at `switch_on_temp` the output of the controller's
-proportional term will be 2 * `sustainable_power`. The default value
-for `k_po` is:
-
- sustainable_power / (desired_temperature - switch_on_temp)
-
-Focusing on the proportional and feed forward values of the PID
-controller equation we have:
-
- P_max = k_p * e + sustainable_power
-
-The proportional term is proportional to the difference between the
-desired temperature and the current one. When the current temperature
-is the desired one, then the proportional component is zero and
-`P_max` = `sustainable_power`. That is, the system should operate in
-thermal equilibrium under constant load. `sustainable_power` is only
-an estimate, which is the reason for closed-loop control such as this.
-
-Expanding `k_pu` we get:
- P_max = 2 * sustainable_power * (T_set - T) / (T_set - T_on) +
- sustainable_power
-
-where
- T_set is the desired temperature
- T is the current temperature
- T_on is the switch on temperature
-
-When the current temperature is the switch_on temperature, the above
-formula becomes:
-
- P_max = 2 * sustainable_power * (T_set - T_on) / (T_set - T_on) +
- sustainable_power = 2 * sustainable_power + sustainable_power =
- 3 * sustainable_power
-
-Therefore, the proportional term alone linearly decreases power from
-3 * `sustainable_power` to `sustainable_power` as the temperature
-rises from the switch on temperature to the desired temperature.
-
-k_i and integral_cutoff
------------------------
-
-`k_i` configures the PID loop's integral term constant. This term
-allows the PID controller to compensate for long term drift and for
-the quantized nature of the output control: cooling devices can't set
-the exact power that the governor requests. When the temperature
-error is below `integral_cutoff`, errors are accumulated in the
-integral term. This term is then multiplied by `k_i` and the result
-added to the output of the controller. Typically `k_i` is set low (1
-or 2) and `integral_cutoff` is 0.
-
-k_d
----
-
-`k_d` configures the PID loop's derivative term constant. It's
-recommended to leave it as the default: 0.
-
-Cooling device power API
-========================
-
-Cooling devices controlled by this governor must supply the additional
-"power" API in their `cooling_device_ops`. It consists on three ops:
-
-1. int get_requested_power(struct thermal_cooling_device *cdev,
- struct thermal_zone_device *tz, u32 *power);
-@cdev: The `struct thermal_cooling_device` pointer
-@tz: thermal zone in which we are currently operating
-@power: pointer in which to store the calculated power
-
-`get_requested_power()` calculates the power requested by the device
-in milliwatts and stores it in @power . It should return 0 on
-success, -E* on failure. This is currently used by the power
-allocator governor to calculate how much power to give to each cooling
-device.
-
-2. int state2power(struct thermal_cooling_device *cdev, struct
- thermal_zone_device *tz, unsigned long state, u32 *power);
-@cdev: The `struct thermal_cooling_device` pointer
-@tz: thermal zone in which we are currently operating
-@state: A cooling device state
-@power: pointer in which to store the equivalent power
-
-Convert cooling device state @state into power consumption in
-milliwatts and store it in @power. It should return 0 on success, -E*
-on failure. This is currently used by thermal core to calculate the
-maximum power that an actor can consume.
-
-3. int power2state(struct thermal_cooling_device *cdev, u32 power,
- unsigned long *state);
-@cdev: The `struct thermal_cooling_device` pointer
-@power: power in milliwatts
-@state: pointer in which to store the resulting state
-
-Calculate a cooling device state that would make the device consume at
-most @power mW and store it in @state. It should return 0 on success,
--E* on failure. This is currently used by the thermal core to convert
-a given power set by the power allocator governor to a state that the
-cooling device can set. It is a function because this conversion may
-depend on external factors that may change so this function should the
-best conversion given "current circumstances".
-
-Cooling device weights
-----------------------
-
-Weights are a mechanism to bias the allocation among cooling
-devices. They express the relative power efficiency of different
-cooling devices. Higher weight can be used to express higher power
-efficiency. Weighting is relative such that if each cooling device
-has a weight of one they are considered equal. This is particularly
-useful in heterogeneous systems where two cooling devices may perform
-the same kind of compute, but with different efficiency. For example,
-a system with two different types of processors.
-
-If the thermal zone is registered using
-`thermal_zone_device_register()` (i.e., platform code), then weights
-are passed as part of the thermal zone's `thermal_bind_parameters`.
-If the platform is registered using device tree, then they are passed
-as the `contribution` property of each map in the `cooling-maps` node.
-
-Limitations of the power allocator governor
-===========================================
-
-The power allocator governor's PID controller works best if there is a
-periodic tick. If you have a driver that calls
-`thermal_zone_device_update()` (or anything that ends up calling the
-governor's `throttle()` function) repetitively, the governor response
-won't be very good. Note that this is not particular to this
-governor, step-wise will also misbehave if you call its throttle()
-faster than the normal thermal framework tick (due to interrupts for
-example) as it will overreact.
--- /dev/null
+===================================
+Generic Thermal Sysfs driver How To
+===================================
+
+Written by Sujith Thomas <sujith.thomas@intel.com>, Zhang Rui <rui.zhang@intel.com>
+
+Updated: 2 January 2008
+
+Copyright (c) 2008 Intel Corporation
+
+
+0. Introduction
+===============
+
+The generic thermal sysfs provides a set of interfaces for thermal zone
+devices (sensors) and thermal cooling devices (fan, processor...) to register
+with the thermal management solution and to be a part of it.
+
+This how-to focuses on enabling new thermal zone and cooling devices to
+participate in thermal management.
+This solution is platform independent and any type of thermal zone devices
+and cooling devices should be able to make use of the infrastructure.
+
+The main task of the thermal sysfs driver is to expose thermal zone attributes
+as well as cooling device attributes to the user space.
+An intelligent thermal management application can make decisions based on
+inputs from thermal zone attributes (the current temperature and trip point
+temperature) and throttle appropriate devices.
+
+- `[0-*]` denotes any positive number starting from 0
+- `[1-*]` denotes any positive number starting from 1
+
+1. thermal sysfs driver interface functions
+===========================================
+
+1.1 thermal zone device interface
+---------------------------------
+
+ ::
+
+ struct thermal_zone_device
+ *thermal_zone_device_register(char *type,
+ int trips, int mask, void *devdata,
+ struct thermal_zone_device_ops *ops,
+ const struct thermal_zone_params *tzp,
+ int passive_delay, int polling_delay))
+
+ This interface function adds a new thermal zone device (sensor) to
+ /sys/class/thermal folder as `thermal_zone[0-*]`. It tries to bind all the
+ thermal cooling devices registered at the same time.
+
+ type:
+ the thermal zone type.
+ trips:
+ the total number of trip points this thermal zone supports.
+ mask:
+ Bit string: If 'n'th bit is set, then trip point 'n' is writeable.
+ devdata:
+ device private data
+ ops:
+ thermal zone device call-backs.
+
+ .bind:
+ bind the thermal zone device with a thermal cooling device.
+ .unbind:
+ unbind the thermal zone device with a thermal cooling device.
+ .get_temp:
+ get the current temperature of the thermal zone.
+ .set_trips:
+ set the trip points window. Whenever the current temperature
+ is updated, the trip points immediately below and above the
+ current temperature are found.
+ .get_mode:
+ get the current mode (enabled/disabled) of the thermal zone.
+
+ - "enabled" means the kernel thermal management is
+ enabled.
+ - "disabled" will prevent kernel thermal driver action
+ upon trip points so that user applications can take
+ charge of thermal management.
+ .set_mode:
+ set the mode (enabled/disabled) of the thermal zone.
+ .get_trip_type:
+ get the type of certain trip point.
+ .get_trip_temp:
+ get the temperature above which the certain trip point
+ will be fired.
+ .set_emul_temp:
+ set the emulation temperature which helps in debugging
+ different threshold temperature points.
+ tzp:
+ thermal zone platform parameters.
+ passive_delay:
+ number of milliseconds to wait between polls when
+ performing passive cooling.
+ polling_delay:
+ number of milliseconds to wait between polls when checking
+ whether trip points have been crossed (0 for interrupt driven systems).
+
+ ::
+
+ void thermal_zone_device_unregister(struct thermal_zone_device *tz)
+
+ This interface function removes the thermal zone device.
+ It deletes the corresponding entry from /sys/class/thermal folder and
+ unbinds all the thermal cooling devices it uses.
+
+ ::
+
+ struct thermal_zone_device
+ *thermal_zone_of_sensor_register(struct device *dev, int sensor_id,
+ void *data,
+ const struct thermal_zone_of_device_ops *ops)
+
+ This interface adds a new sensor to a DT thermal zone.
+ This function will search the list of thermal zones described in
+ device tree and look for the zone that refer to the sensor device
+ pointed by dev->of_node as temperature providers. For the zone
+ pointing to the sensor node, the sensor will be added to the DT
+ thermal zone device.
+
+ The parameters for this interface are:
+
+ dev:
+ Device node of sensor containing valid node pointer in
+ dev->of_node.
+ sensor_id:
+ a sensor identifier, in case the sensor IP has more
+ than one sensors
+ data:
+ a private pointer (owned by the caller) that will be
+ passed back, when a temperature reading is needed.
+ ops:
+ `struct thermal_zone_of_device_ops *`.
+
+ ============== =======================================
+ get_temp a pointer to a function that reads the
+ sensor temperature. This is mandatory
+ callback provided by sensor driver.
+ set_trips a pointer to a function that sets a
+ temperature window. When this window is
+ left the driver must inform the thermal
+ core via thermal_zone_device_update.
+ get_trend a pointer to a function that reads the
+ sensor temperature trend.
+ set_emul_temp a pointer to a function that sets
+ sensor emulated temperature.
+ ============== =======================================
+
+ The thermal zone temperature is provided by the get_temp() function
+ pointer of thermal_zone_of_device_ops. When called, it will
+ have the private pointer @data back.
+
+ It returns error pointer if fails otherwise valid thermal zone device
+ handle. Caller should check the return handle with IS_ERR() for finding
+ whether success or not.
+
+ ::
+
+ void thermal_zone_of_sensor_unregister(struct device *dev,
+ struct thermal_zone_device *tzd)
+
+ This interface unregisters a sensor from a DT thermal zone which was
+ successfully added by interface thermal_zone_of_sensor_register().
+ This function removes the sensor callbacks and private data from the
+ thermal zone device registered with thermal_zone_of_sensor_register()
+ interface. It will also silent the zone by remove the .get_temp() and
+ get_trend() thermal zone device callbacks.
+
+ ::
+
+ struct thermal_zone_device
+ *devm_thermal_zone_of_sensor_register(struct device *dev,
+ int sensor_id,
+ void *data,
+ const struct thermal_zone_of_device_ops *ops)
+
+ This interface is resource managed version of
+ thermal_zone_of_sensor_register().
+
+ All details of thermal_zone_of_sensor_register() described in
+ section 1.1.3 is applicable here.
+
+ The benefit of using this interface to register sensor is that it
+ is not require to explicitly call thermal_zone_of_sensor_unregister()
+ in error path or during driver unbinding as this is done by driver
+ resource manager.
+
+ ::
+
+ void devm_thermal_zone_of_sensor_unregister(struct device *dev,
+ struct thermal_zone_device *tzd)
+
+ This interface is resource managed version of
+ thermal_zone_of_sensor_unregister().
+ All details of thermal_zone_of_sensor_unregister() described in
+ section 1.1.4 is applicable here.
+ Normally this function will not need to be called and the resource
+ management code will ensure that the resource is freed.
+
+ ::
+
+ int thermal_zone_get_slope(struct thermal_zone_device *tz)
+
+ This interface is used to read the slope attribute value
+ for the thermal zone device, which might be useful for platform
+ drivers for temperature calculations.
+
+ ::
+
+ int thermal_zone_get_offset(struct thermal_zone_device *tz)
+
+ This interface is used to read the offset attribute value
+ for the thermal zone device, which might be useful for platform
+ drivers for temperature calculations.
+
+1.2 thermal cooling device interface
+------------------------------------
+
+
+ ::
+
+ struct thermal_cooling_device
+ *thermal_cooling_device_register(char *name,
+ void *devdata, struct thermal_cooling_device_ops *)
+
+ This interface function adds a new thermal cooling device (fan/processor/...)
+ to /sys/class/thermal/ folder as `cooling_device[0-*]`. It tries to bind itself
+ to all the thermal zone devices registered at the same time.
+
+ name:
+ the cooling device name.
+ devdata:
+ device private data.
+ ops:
+ thermal cooling devices call-backs.
+
+ .get_max_state:
+ get the Maximum throttle state of the cooling device.
+ .get_cur_state:
+ get the Currently requested throttle state of the
+ cooling device.
+ .set_cur_state:
+ set the Current throttle state of the cooling device.
+
+ ::
+
+ void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev)
+
+ This interface function removes the thermal cooling device.
+ It deletes the corresponding entry from /sys/class/thermal folder and
+ unbinds itself from all the thermal zone devices using it.
+
+1.3 interface for binding a thermal zone device with a thermal cooling device
+-----------------------------------------------------------------------------
+
+ ::
+
+ int thermal_zone_bind_cooling_device(struct thermal_zone_device *tz,
+ int trip, struct thermal_cooling_device *cdev,
+ unsigned long upper, unsigned long lower, unsigned int weight);
+
+ This interface function binds a thermal cooling device to a particular trip
+ point of a thermal zone device.
+
+ This function is usually called in the thermal zone device .bind callback.
+
+ tz:
+ the thermal zone device
+ cdev:
+ thermal cooling device
+ trip:
+ indicates which trip point in this thermal zone the cooling device
+ is associated with.
+ upper:
+ the Maximum cooling state for this trip point.
+ THERMAL_NO_LIMIT means no upper limit,
+ and the cooling device can be in max_state.
+ lower:
+ the Minimum cooling state can be used for this trip point.
+ THERMAL_NO_LIMIT means no lower limit,
+ and the cooling device can be in cooling state 0.
+ weight:
+ the influence of this cooling device in this thermal
+ zone. See 1.4.1 below for more information.
+
+ ::
+
+ int thermal_zone_unbind_cooling_device(struct thermal_zone_device *tz,
+ int trip, struct thermal_cooling_device *cdev);
+
+ This interface function unbinds a thermal cooling device from a particular
+ trip point of a thermal zone device. This function is usually called in
+ the thermal zone device .unbind callback.
+
+ tz:
+ the thermal zone device
+ cdev:
+ thermal cooling device
+ trip:
+ indicates which trip point in this thermal zone the cooling device
+ is associated with.
+
+1.4 Thermal Zone Parameters
+---------------------------
+
+ ::
+
+ struct thermal_bind_params
+
+ This structure defines the following parameters that are used to bind
+ a zone with a cooling device for a particular trip point.
+
+ .cdev:
+ The cooling device pointer
+ .weight:
+ The 'influence' of a particular cooling device on this
+ zone. This is relative to the rest of the cooling
+ devices. For example, if all cooling devices have a
+ weight of 1, then they all contribute the same. You can
+ use percentages if you want, but it's not mandatory. A
+ weight of 0 means that this cooling device doesn't
+ contribute to the cooling of this zone unless all cooling
+ devices have a weight of 0. If all weights are 0, then
+ they all contribute the same.
+ .trip_mask:
+ This is a bit mask that gives the binding relation between
+ this thermal zone and cdev, for a particular trip point.
+ If nth bit is set, then the cdev and thermal zone are bound
+ for trip point n.
+ .binding_limits:
+ This is an array of cooling state limits. Must have
+ exactly 2 * thermal_zone.number_of_trip_points. It is an
+ array consisting of tuples <lower-state upper-state> of
+ state limits. Each trip will be associated with one state
+ limit tuple when binding. A NULL pointer means
+ <THERMAL_NO_LIMITS THERMAL_NO_LIMITS> on all trips.
+ These limits are used when binding a cdev to a trip point.
+ .match:
+ This call back returns success(0) if the 'tz and cdev' need to
+ be bound, as per platform data.
+
+ ::
+
+ struct thermal_zone_params
+
+ This structure defines the platform level parameters for a thermal zone.
+ This data, for each thermal zone should come from the platform layer.
+ This is an optional feature where some platforms can choose not to
+ provide this data.
+
+ .governor_name:
+ Name of the thermal governor used for this zone
+ .no_hwmon:
+ a boolean to indicate if the thermal to hwmon sysfs interface
+ is required. when no_hwmon == false, a hwmon sysfs interface
+ will be created. when no_hwmon == true, nothing will be done.
+ In case the thermal_zone_params is NULL, the hwmon interface
+ will be created (for backward compatibility).
+ .num_tbps:
+ Number of thermal_bind_params entries for this zone
+ .tbp:
+ thermal_bind_params entries
+
+2. sysfs attributes structure
+=============================
+
+== ================
+RO read only value
+WO write only value
+RW read/write value
+== ================
+
+Thermal sysfs attributes will be represented under /sys/class/thermal.
+Hwmon sysfs I/F extension is also available under /sys/class/hwmon
+if hwmon is compiled in or built as a module.
+
+Thermal zone device sys I/F, created once it's registered::
+
+ /sys/class/thermal/thermal_zone[0-*]:
+ |---type: Type of the thermal zone
+ |---temp: Current temperature
+ |---mode: Working mode of the thermal zone
+ |---policy: Thermal governor used for this zone
+ |---available_policies: Available thermal governors for this zone
+ |---trip_point_[0-*]_temp: Trip point temperature
+ |---trip_point_[0-*]_type: Trip point type
+ |---trip_point_[0-*]_hyst: Hysteresis value for this trip point
+ |---emul_temp: Emulated temperature set node
+ |---sustainable_power: Sustainable dissipatable power
+ |---k_po: Proportional term during temperature overshoot
+ |---k_pu: Proportional term during temperature undershoot
+ |---k_i: PID's integral term in the power allocator gov
+ |---k_d: PID's derivative term in the power allocator
+ |---integral_cutoff: Offset above which errors are accumulated
+ |---slope: Slope constant applied as linear extrapolation
+ |---offset: Offset constant applied as linear extrapolation
+
+Thermal cooling device sys I/F, created once it's registered::
+
+ /sys/class/thermal/cooling_device[0-*]:
+ |---type: Type of the cooling device(processor/fan/...)
+ |---max_state: Maximum cooling state of the cooling device
+ |---cur_state: Current cooling state of the cooling device
+ |---stats: Directory containing cooling device's statistics
+ |---stats/reset: Writing any value resets the statistics
+ |---stats/time_in_state_ms: Time (msec) spent in various cooling states
+ |---stats/total_trans: Total number of times cooling state is changed
+ |---stats/trans_table: Cooing state transition table
+
+
+Then next two dynamic attributes are created/removed in pairs. They represent
+the relationship between a thermal zone and its associated cooling device.
+They are created/removed for each successful execution of
+thermal_zone_bind_cooling_device/thermal_zone_unbind_cooling_device.
+
+::
+
+ /sys/class/thermal/thermal_zone[0-*]:
+ |---cdev[0-*]: [0-*]th cooling device in current thermal zone
+ |---cdev[0-*]_trip_point: Trip point that cdev[0-*] is associated with
+ |---cdev[0-*]_weight: Influence of the cooling device in
+ this thermal zone
+
+Besides the thermal zone device sysfs I/F and cooling device sysfs I/F,
+the generic thermal driver also creates a hwmon sysfs I/F for each _type_
+of thermal zone device. E.g. the generic thermal driver registers one hwmon
+class device and build the associated hwmon sysfs I/F for all the registered
+ACPI thermal zones.
+
+::
+
+ /sys/class/hwmon/hwmon[0-*]:
+ |---name: The type of the thermal zone devices
+ |---temp[1-*]_input: The current temperature of thermal zone [1-*]
+ |---temp[1-*]_critical: The critical trip point of thermal zone [1-*]
+
+Please read Documentation/hwmon/sysfs-interface.rst for additional information.
+
+Thermal zone attributes
+-----------------------
+
+type
+ Strings which represent the thermal zone type.
+ This is given by thermal zone driver as part of registration.
+ E.g: "acpitz" indicates it's an ACPI thermal device.
+ In order to keep it consistent with hwmon sys attribute; this should
+ be a short, lowercase string, not containing spaces nor dashes.
+ RO, Required
+
+temp
+ Current temperature as reported by thermal zone (sensor).
+ Unit: millidegree Celsius
+ RO, Required
+
+mode
+ One of the predefined values in [enabled, disabled].
+ This file gives information about the algorithm that is currently
+ managing the thermal zone. It can be either default kernel based
+ algorithm or user space application.
+
+ enabled
+ enable Kernel Thermal management.
+ disabled
+ Preventing kernel thermal zone driver actions upon
+ trip points so that user application can take full
+ charge of the thermal management.
+
+ RW, Optional
+
+policy
+ One of the various thermal governors used for a particular zone.
+
+ RW, Required
+
+available_policies
+ Available thermal governors which can be used for a particular zone.
+
+ RO, Required
+
+`trip_point_[0-*]_temp`
+ The temperature above which trip point will be fired.
+
+ Unit: millidegree Celsius
+
+ RO, Optional
+
+`trip_point_[0-*]_type`
+ Strings which indicate the type of the trip point.
+
+ E.g. it can be one of critical, hot, passive, `active[0-*]` for ACPI
+ thermal zone.
+
+ RO, Optional
+
+`trip_point_[0-*]_hyst`
+ The hysteresis value for a trip point, represented as an integer
+ Unit: Celsius
+ RW, Optional
+
+`cdev[0-*]`
+ Sysfs link to the thermal cooling device node where the sys I/F
+ for cooling device throttling control represents.
+
+ RO, Optional
+
+`cdev[0-*]_trip_point`
+ The trip point in this thermal zone which `cdev[0-*]` is associated
+ with; -1 means the cooling device is not associated with any trip
+ point.
+
+ RO, Optional
+
+`cdev[0-*]_weight`
+ The influence of `cdev[0-*]` in this thermal zone. This value
+ is relative to the rest of cooling devices in the thermal
+ zone. For example, if a cooling device has a weight double
+ than that of other, it's twice as effective in cooling the
+ thermal zone.
+
+ RW, Optional
+
+passive
+ Attribute is only present for zones in which the passive cooling
+ policy is not supported by native thermal driver. Default is zero
+ and can be set to a temperature (in millidegrees) to enable a
+ passive trip point for the zone. Activation is done by polling with
+ an interval of 1 second.
+
+ Unit: millidegrees Celsius
+
+ Valid values: 0 (disabled) or greater than 1000
+
+ RW, Optional
+
+emul_temp
+ Interface to set the emulated temperature method in thermal zone
+ (sensor). After setting this temperature, the thermal zone may pass
+ this temperature to platform emulation function if registered or
+ cache it locally. This is useful in debugging different temperature
+ threshold and its associated cooling action. This is write only node
+ and writing 0 on this node should disable emulation.
+ Unit: millidegree Celsius
+
+ WO, Optional
+
+ WARNING:
+ Be careful while enabling this option on production systems,
+ because userland can easily disable the thermal policy by simply
+ flooding this sysfs node with low temperature values.
+
+sustainable_power
+ An estimate of the sustained power that can be dissipated by
+ the thermal zone. Used by the power allocator governor. For
+ more information see Documentation/thermal/power_allocator.rst
+
+ Unit: milliwatts
+
+ RW, Optional
+
+k_po
+ The proportional term of the power allocator governor's PID
+ controller during temperature overshoot. Temperature overshoot
+ is when the current temperature is above the "desired
+ temperature" trip point. For more information see
+ Documentation/thermal/power_allocator.rst
+
+ RW, Optional
+
+k_pu
+ The proportional term of the power allocator governor's PID
+ controller during temperature undershoot. Temperature undershoot
+ is when the current temperature is below the "desired
+ temperature" trip point. For more information see
+ Documentation/thermal/power_allocator.rst
+
+ RW, Optional
+
+k_i
+ The integral term of the power allocator governor's PID
+ controller. This term allows the PID controller to compensate
+ for long term drift. For more information see
+ Documentation/thermal/power_allocator.rst
+
+ RW, Optional
+
+k_d
+ The derivative term of the power allocator governor's PID
+ controller. For more information see
+ Documentation/thermal/power_allocator.rst
+
+ RW, Optional
+
+integral_cutoff
+ Temperature offset from the desired temperature trip point
+ above which the integral term of the power allocator
+ governor's PID controller starts accumulating errors. For
+ example, if integral_cutoff is 0, then the integral term only
+ accumulates error when temperature is above the desired
+ temperature trip point. For more information see
+ Documentation/thermal/power_allocator.rst
+
+ Unit: millidegree Celsius
+
+ RW, Optional
+
+slope
+ The slope constant used in a linear extrapolation model
+ to determine a hotspot temperature based off the sensor's
+ raw readings. It is up to the device driver to determine
+ the usage of these values.
+
+ RW, Optional
+
+offset
+ The offset constant used in a linear extrapolation model
+ to determine a hotspot temperature based off the sensor's
+ raw readings. It is up to the device driver to determine
+ the usage of these values.
+
+ RW, Optional
+
+Cooling device attributes
+-------------------------
+
+type
+ String which represents the type of device, e.g:
+
+ - for generic ACPI: should be "Fan", "Processor" or "LCD"
+ - for memory controller device on intel_menlow platform:
+ should be "Memory controller".
+
+ RO, Required
+
+max_state
+ The maximum permissible cooling state of this cooling device.
+
+ RO, Required
+
+cur_state
+ The current cooling state of this cooling device.
+ The value can any integer numbers between 0 and max_state:
+
+ - cur_state == 0 means no cooling
+ - cur_state == max_state means the maximum cooling.
+
+ RW, Required
+
+stats/reset
+ Writing any value resets the cooling device's statistics.
+ WO, Required
+
+stats/time_in_state_ms:
+ The amount of time spent by the cooling device in various cooling
+ states. The output will have "<state> <time>" pair in each line, which
+ will mean this cooling device spent <time> msec of time at <state>.
+ Output will have one line for each of the supported states. usertime
+ units here is 10mS (similar to other time exported in /proc).
+ RO, Required
+
+
+stats/total_trans:
+ A single positive value showing the total number of times the state of a
+ cooling device is changed.
+
+ RO, Required
+
+stats/trans_table:
+ This gives fine grained information about all the cooling state
+ transitions. The cat output here is a two dimensional matrix, where an
+ entry <i,j> (row i, column j) represents the number of transitions from
+ State_i to State_j. If the transition table is bigger than PAGE_SIZE,
+ reading this will return an -EFBIG error.
+ RO, Required
+
+3. A simple implementation
+==========================
+
+ACPI thermal zone may support multiple trip points like critical, hot,
+passive, active. If an ACPI thermal zone supports critical, passive,
+active[0] and active[1] at the same time, it may register itself as a
+thermal_zone_device (thermal_zone1) with 4 trip points in all.
+It has one processor and one fan, which are both registered as
+thermal_cooling_device. Both are considered to have the same
+effectiveness in cooling the thermal zone.
+
+If the processor is listed in _PSL method, and the fan is listed in _AL0
+method, the sys I/F structure will be built like this::
+
+ /sys/class/thermal:
+ |thermal_zone1:
+ |---type: acpitz
+ |---temp: 37000
+ |---mode: enabled
+ |---policy: step_wise
+ |---available_policies: step_wise fair_share
+ |---trip_point_0_temp: 100000
+ |---trip_point_0_type: critical
+ |---trip_point_1_temp: 80000
+ |---trip_point_1_type: passive
+ |---trip_point_2_temp: 70000
+ |---trip_point_2_type: active0
+ |---trip_point_3_temp: 60000
+ |---trip_point_3_type: active1
+ |---cdev0: --->/sys/class/thermal/cooling_device0
+ |---cdev0_trip_point: 1 /* cdev0 can be used for passive */
+ |---cdev0_weight: 1024
+ |---cdev1: --->/sys/class/thermal/cooling_device3
+ |---cdev1_trip_point: 2 /* cdev1 can be used for active[0]*/
+ |---cdev1_weight: 1024
+
+ |cooling_device0:
+ |---type: Processor
+ |---max_state: 8
+ |---cur_state: 0
+
+ |cooling_device3:
+ |---type: Fan
+ |---max_state: 2
+ |---cur_state: 0
+
+ /sys/class/hwmon:
+ |hwmon0:
+ |---name: acpitz
+ |---temp1_input: 37000
+ |---temp1_crit: 100000
+
+4. Event Notification
+=====================
+
+The framework includes a simple notification mechanism, in the form of a
+netlink event. Netlink socket initialization is done during the _init_
+of the framework. Drivers which intend to use the notification mechanism
+just need to call thermal_generate_netlink_event() with two arguments viz
+(originator, event). The originator is a pointer to struct thermal_zone_device
+from where the event has been originated. An integer which represents the
+thermal zone device will be used in the message to identify the zone. The
+event will be one of:{THERMAL_AUX0, THERMAL_AUX1, THERMAL_CRITICAL,
+THERMAL_DEV_FAULT}. Notification can be sent when the current temperature
+crosses any of the configured thresholds.
+
+5. Export Symbol APIs
+=====================
+
+5.1. get_tz_trend
+-----------------
+
+This function returns the trend of a thermal zone, i.e the rate of change
+of temperature of the thermal zone. Ideally, the thermal sensor drivers
+are supposed to implement the callback. If they don't, the thermal
+framework calculated the trend by comparing the previous and the current
+temperature values.
+
+5.2. get_thermal_instance
+-------------------------
+
+This function returns the thermal_instance corresponding to a given
+{thermal_zone, cooling_device, trip_point} combination. Returns NULL
+if such an instance does not exist.
+
+5.3. thermal_notify_framework
+-----------------------------
+
+This function handles the trip events from sensor drivers. It starts
+throttling the cooling devices according to the policy configured.
+For CRITICAL and HOT trip points, this notifies the respective drivers,
+and does actual throttling for other trip points i.e ACTIVE and PASSIVE.
+The throttling policy is based on the configured platform data; if no
+platform data is provided, this uses the step_wise throttling policy.
+
+5.4. thermal_cdev_update
+------------------------
+
+This function serves as an arbitrator to set the state of a cooling
+device. It sets the cooling device to the deepest cooling state if
+possible.
+
+6. thermal_emergency_poweroff
+=============================
+
+On an event of critical trip temperature crossing. Thermal framework
+allows the system to shutdown gracefully by calling orderly_poweroff().
+In the event of a failure of orderly_poweroff() to shut down the system
+we are in danger of keeping the system alive at undesirably high
+temperatures. To mitigate this high risk scenario we program a work
+queue to fire after a pre-determined number of seconds to start
+an emergency shutdown of the device using the kernel_power_off()
+function. In case kernel_power_off() fails then finally
+emergency_restart() is called in the worst case.
+
+The delay should be carefully profiled so as to give adequate time for
+orderly_poweroff(). In case of failure of an orderly_poweroff() the
+emergency poweroff kicks in after the delay has elapsed and shuts down
+the system.
+
+If set to 0 emergency poweroff will not be supported. So a carefully
+profiled non-zero positive value is a must for emergerncy poweroff to be
+triggered.
+++ /dev/null
-Generic Thermal Sysfs driver How To
-===================================
-
-Written by Sujith Thomas <sujith.thomas@intel.com>, Zhang Rui <rui.zhang@intel.com>
-
-Updated: 2 January 2008
-
-Copyright (c) 2008 Intel Corporation
-
-
-0. Introduction
-
-The generic thermal sysfs provides a set of interfaces for thermal zone
-devices (sensors) and thermal cooling devices (fan, processor...) to register
-with the thermal management solution and to be a part of it.
-
-This how-to focuses on enabling new thermal zone and cooling devices to
-participate in thermal management.
-This solution is platform independent and any type of thermal zone devices
-and cooling devices should be able to make use of the infrastructure.
-
-The main task of the thermal sysfs driver is to expose thermal zone attributes
-as well as cooling device attributes to the user space.
-An intelligent thermal management application can make decisions based on
-inputs from thermal zone attributes (the current temperature and trip point
-temperature) and throttle appropriate devices.
-
-[0-*] denotes any positive number starting from 0
-[1-*] denotes any positive number starting from 1
-
-1. thermal sysfs driver interface functions
-
-1.1 thermal zone device interface
-1.1.1 struct thermal_zone_device *thermal_zone_device_register(char *type,
- int trips, int mask, void *devdata,
- struct thermal_zone_device_ops *ops,
- const struct thermal_zone_params *tzp,
- int passive_delay, int polling_delay))
-
- This interface function adds a new thermal zone device (sensor) to
- /sys/class/thermal folder as thermal_zone[0-*]. It tries to bind all the
- thermal cooling devices registered at the same time.
-
- type: the thermal zone type.
- trips: the total number of trip points this thermal zone supports.
- mask: Bit string: If 'n'th bit is set, then trip point 'n' is writeable.
- devdata: device private data
- ops: thermal zone device call-backs.
- .bind: bind the thermal zone device with a thermal cooling device.
- .unbind: unbind the thermal zone device with a thermal cooling device.
- .get_temp: get the current temperature of the thermal zone.
- .set_trips: set the trip points window. Whenever the current temperature
- is updated, the trip points immediately below and above the
- current temperature are found.
- .get_mode: get the current mode (enabled/disabled) of the thermal zone.
- - "enabled" means the kernel thermal management is enabled.
- - "disabled" will prevent kernel thermal driver action upon trip points
- so that user applications can take charge of thermal management.
- .set_mode: set the mode (enabled/disabled) of the thermal zone.
- .get_trip_type: get the type of certain trip point.
- .get_trip_temp: get the temperature above which the certain trip point
- will be fired.
- .set_emul_temp: set the emulation temperature which helps in debugging
- different threshold temperature points.
- tzp: thermal zone platform parameters.
- passive_delay: number of milliseconds to wait between polls when
- performing passive cooling.
- polling_delay: number of milliseconds to wait between polls when checking
- whether trip points have been crossed (0 for interrupt driven systems).
-
-
-1.1.2 void thermal_zone_device_unregister(struct thermal_zone_device *tz)
-
- This interface function removes the thermal zone device.
- It deletes the corresponding entry from /sys/class/thermal folder and
- unbinds all the thermal cooling devices it uses.
-
-1.1.3 struct thermal_zone_device *thermal_zone_of_sensor_register(
- struct device *dev, int sensor_id, void *data,
- const struct thermal_zone_of_device_ops *ops)
-
- This interface adds a new sensor to a DT thermal zone.
- This function will search the list of thermal zones described in
- device tree and look for the zone that refer to the sensor device
- pointed by dev->of_node as temperature providers. For the zone
- pointing to the sensor node, the sensor will be added to the DT
- thermal zone device.
-
- The parameters for this interface are:
- dev: Device node of sensor containing valid node pointer in
- dev->of_node.
- sensor_id: a sensor identifier, in case the sensor IP has more
- than one sensors
- data: a private pointer (owned by the caller) that will be
- passed back, when a temperature reading is needed.
- ops: struct thermal_zone_of_device_ops *.
-
- get_temp: a pointer to a function that reads the
- sensor temperature. This is mandatory
- callback provided by sensor driver.
- set_trips: a pointer to a function that sets a
- temperature window. When this window is
- left the driver must inform the thermal
- core via thermal_zone_device_update.
- get_trend: a pointer to a function that reads the
- sensor temperature trend.
- set_emul_temp: a pointer to a function that sets
- sensor emulated temperature.
- The thermal zone temperature is provided by the get_temp() function
- pointer of thermal_zone_of_device_ops. When called, it will
- have the private pointer @data back.
-
- It returns error pointer if fails otherwise valid thermal zone device
- handle. Caller should check the return handle with IS_ERR() for finding
- whether success or not.
-
-1.1.4 void thermal_zone_of_sensor_unregister(struct device *dev,
- struct thermal_zone_device *tzd)
-
- This interface unregisters a sensor from a DT thermal zone which was
- successfully added by interface thermal_zone_of_sensor_register().
- This function removes the sensor callbacks and private data from the
- thermal zone device registered with thermal_zone_of_sensor_register()
- interface. It will also silent the zone by remove the .get_temp() and
- get_trend() thermal zone device callbacks.
-
-1.1.5 struct thermal_zone_device *devm_thermal_zone_of_sensor_register(
- struct device *dev, int sensor_id,
- void *data, const struct thermal_zone_of_device_ops *ops)
-
- This interface is resource managed version of
- thermal_zone_of_sensor_register().
- All details of thermal_zone_of_sensor_register() described in
- section 1.1.3 is applicable here.
- The benefit of using this interface to register sensor is that it
- is not require to explicitly call thermal_zone_of_sensor_unregister()
- in error path or during driver unbinding as this is done by driver
- resource manager.
-
-1.1.6 void devm_thermal_zone_of_sensor_unregister(struct device *dev,
- struct thermal_zone_device *tzd)
-
- This interface is resource managed version of
- thermal_zone_of_sensor_unregister().
- All details of thermal_zone_of_sensor_unregister() described in
- section 1.1.4 is applicable here.
- Normally this function will not need to be called and the resource
- management code will ensure that the resource is freed.
-
-1.1.7 int thermal_zone_get_slope(struct thermal_zone_device *tz)
-
- This interface is used to read the slope attribute value
- for the thermal zone device, which might be useful for platform
- drivers for temperature calculations.
-
-1.1.8 int thermal_zone_get_offset(struct thermal_zone_device *tz)
-
- This interface is used to read the offset attribute value
- for the thermal zone device, which might be useful for platform
- drivers for temperature calculations.
-
-1.2 thermal cooling device interface
-1.2.1 struct thermal_cooling_device *thermal_cooling_device_register(char *name,
- void *devdata, struct thermal_cooling_device_ops *)
-
- This interface function adds a new thermal cooling device (fan/processor/...)
- to /sys/class/thermal/ folder as cooling_device[0-*]. It tries to bind itself
- to all the thermal zone devices registered at the same time.
- name: the cooling device name.
- devdata: device private data.
- ops: thermal cooling devices call-backs.
- .get_max_state: get the Maximum throttle state of the cooling device.
- .get_cur_state: get the Currently requested throttle state of the cooling device.
- .set_cur_state: set the Current throttle state of the cooling device.
-
-1.2.2 void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev)
-
- This interface function removes the thermal cooling device.
- It deletes the corresponding entry from /sys/class/thermal folder and
- unbinds itself from all the thermal zone devices using it.
-
-1.3 interface for binding a thermal zone device with a thermal cooling device
-1.3.1 int thermal_zone_bind_cooling_device(struct thermal_zone_device *tz,
- int trip, struct thermal_cooling_device *cdev,
- unsigned long upper, unsigned long lower, unsigned int weight);
-
- This interface function binds a thermal cooling device to a particular trip
- point of a thermal zone device.
- This function is usually called in the thermal zone device .bind callback.
- tz: the thermal zone device
- cdev: thermal cooling device
- trip: indicates which trip point in this thermal zone the cooling device
- is associated with.
- upper:the Maximum cooling state for this trip point.
- THERMAL_NO_LIMIT means no upper limit,
- and the cooling device can be in max_state.
- lower:the Minimum cooling state can be used for this trip point.
- THERMAL_NO_LIMIT means no lower limit,
- and the cooling device can be in cooling state 0.
- weight: the influence of this cooling device in this thermal
- zone. See 1.4.1 below for more information.
-
-1.3.2 int thermal_zone_unbind_cooling_device(struct thermal_zone_device *tz,
- int trip, struct thermal_cooling_device *cdev);
-
- This interface function unbinds a thermal cooling device from a particular
- trip point of a thermal zone device. This function is usually called in
- the thermal zone device .unbind callback.
- tz: the thermal zone device
- cdev: thermal cooling device
- trip: indicates which trip point in this thermal zone the cooling device
- is associated with.
-
-1.4 Thermal Zone Parameters
-1.4.1 struct thermal_bind_params
- This structure defines the following parameters that are used to bind
- a zone with a cooling device for a particular trip point.
- .cdev: The cooling device pointer
- .weight: The 'influence' of a particular cooling device on this
- zone. This is relative to the rest of the cooling
- devices. For example, if all cooling devices have a
- weight of 1, then they all contribute the same. You can
- use percentages if you want, but it's not mandatory. A
- weight of 0 means that this cooling device doesn't
- contribute to the cooling of this zone unless all cooling
- devices have a weight of 0. If all weights are 0, then
- they all contribute the same.
- .trip_mask:This is a bit mask that gives the binding relation between
- this thermal zone and cdev, for a particular trip point.
- If nth bit is set, then the cdev and thermal zone are bound
- for trip point n.
- .binding_limits: This is an array of cooling state limits. Must have
- exactly 2 * thermal_zone.number_of_trip_points. It is an
- array consisting of tuples <lower-state upper-state> of
- state limits. Each trip will be associated with one state
- limit tuple when binding. A NULL pointer means
- <THERMAL_NO_LIMITS THERMAL_NO_LIMITS> on all trips.
- These limits are used when binding a cdev to a trip point.
- .match: This call back returns success(0) if the 'tz and cdev' need to
- be bound, as per platform data.
-1.4.2 struct thermal_zone_params
- This structure defines the platform level parameters for a thermal zone.
- This data, for each thermal zone should come from the platform layer.
- This is an optional feature where some platforms can choose not to
- provide this data.
- .governor_name: Name of the thermal governor used for this zone
- .no_hwmon: a boolean to indicate if the thermal to hwmon sysfs interface
- is required. when no_hwmon == false, a hwmon sysfs interface
- will be created. when no_hwmon == true, nothing will be done.
- In case the thermal_zone_params is NULL, the hwmon interface
- will be created (for backward compatibility).
- .num_tbps: Number of thermal_bind_params entries for this zone
- .tbp: thermal_bind_params entries
-
-2. sysfs attributes structure
-
-RO read only value
-WO write only value
-RW read/write value
-
-Thermal sysfs attributes will be represented under /sys/class/thermal.
-Hwmon sysfs I/F extension is also available under /sys/class/hwmon
-if hwmon is compiled in or built as a module.
-
-Thermal zone device sys I/F, created once it's registered:
-/sys/class/thermal/thermal_zone[0-*]:
- |---type: Type of the thermal zone
- |---temp: Current temperature
- |---mode: Working mode of the thermal zone
- |---policy: Thermal governor used for this zone
- |---available_policies: Available thermal governors for this zone
- |---trip_point_[0-*]_temp: Trip point temperature
- |---trip_point_[0-*]_type: Trip point type
- |---trip_point_[0-*]_hyst: Hysteresis value for this trip point
- |---emul_temp: Emulated temperature set node
- |---sustainable_power: Sustainable dissipatable power
- |---k_po: Proportional term during temperature overshoot
- |---k_pu: Proportional term during temperature undershoot
- |---k_i: PID's integral term in the power allocator gov
- |---k_d: PID's derivative term in the power allocator
- |---integral_cutoff: Offset above which errors are accumulated
- |---slope: Slope constant applied as linear extrapolation
- |---offset: Offset constant applied as linear extrapolation
-
-Thermal cooling device sys I/F, created once it's registered:
-/sys/class/thermal/cooling_device[0-*]:
- |---type: Type of the cooling device(processor/fan/...)
- |---max_state: Maximum cooling state of the cooling device
- |---cur_state: Current cooling state of the cooling device
- |---stats: Directory containing cooling device's statistics
- |---stats/reset: Writing any value resets the statistics
- |---stats/time_in_state_ms: Time (msec) spent in various cooling states
- |---stats/total_trans: Total number of times cooling state is changed
- |---stats/trans_table: Cooing state transition table
-
-
-Then next two dynamic attributes are created/removed in pairs. They represent
-the relationship between a thermal zone and its associated cooling device.
-They are created/removed for each successful execution of
-thermal_zone_bind_cooling_device/thermal_zone_unbind_cooling_device.
-
-/sys/class/thermal/thermal_zone[0-*]:
- |---cdev[0-*]: [0-*]th cooling device in current thermal zone
- |---cdev[0-*]_trip_point: Trip point that cdev[0-*] is associated with
- |---cdev[0-*]_weight: Influence of the cooling device in
- this thermal zone
-
-Besides the thermal zone device sysfs I/F and cooling device sysfs I/F,
-the generic thermal driver also creates a hwmon sysfs I/F for each _type_
-of thermal zone device. E.g. the generic thermal driver registers one hwmon
-class device and build the associated hwmon sysfs I/F for all the registered
-ACPI thermal zones.
-
-/sys/class/hwmon/hwmon[0-*]:
- |---name: The type of the thermal zone devices
- |---temp[1-*]_input: The current temperature of thermal zone [1-*]
- |---temp[1-*]_critical: The critical trip point of thermal zone [1-*]
-
-Please read Documentation/hwmon/sysfs-interface.rst for additional information.
-
-***************************
-* Thermal zone attributes *
-***************************
-
-type
- Strings which represent the thermal zone type.
- This is given by thermal zone driver as part of registration.
- E.g: "acpitz" indicates it's an ACPI thermal device.
- In order to keep it consistent with hwmon sys attribute; this should
- be a short, lowercase string, not containing spaces nor dashes.
- RO, Required
-
-temp
- Current temperature as reported by thermal zone (sensor).
- Unit: millidegree Celsius
- RO, Required
-
-mode
- One of the predefined values in [enabled, disabled].
- This file gives information about the algorithm that is currently
- managing the thermal zone. It can be either default kernel based
- algorithm or user space application.
- enabled = enable Kernel Thermal management.
- disabled = Preventing kernel thermal zone driver actions upon
- trip points so that user application can take full
- charge of the thermal management.
- RW, Optional
-
-policy
- One of the various thermal governors used for a particular zone.
- RW, Required
-
-available_policies
- Available thermal governors which can be used for a particular zone.
- RO, Required
-
-trip_point_[0-*]_temp
- The temperature above which trip point will be fired.
- Unit: millidegree Celsius
- RO, Optional
-
-trip_point_[0-*]_type
- Strings which indicate the type of the trip point.
- E.g. it can be one of critical, hot, passive, active[0-*] for ACPI
- thermal zone.
- RO, Optional
-
-trip_point_[0-*]_hyst
- The hysteresis value for a trip point, represented as an integer
- Unit: Celsius
- RW, Optional
-
-cdev[0-*]
- Sysfs link to the thermal cooling device node where the sys I/F
- for cooling device throttling control represents.
- RO, Optional
-
-cdev[0-*]_trip_point
- The trip point in this thermal zone which cdev[0-*] is associated
- with; -1 means the cooling device is not associated with any trip
- point.
- RO, Optional
-
-cdev[0-*]_weight
- The influence of cdev[0-*] in this thermal zone. This value
- is relative to the rest of cooling devices in the thermal
- zone. For example, if a cooling device has a weight double
- than that of other, it's twice as effective in cooling the
- thermal zone.
- RW, Optional
-
-passive
- Attribute is only present for zones in which the passive cooling
- policy is not supported by native thermal driver. Default is zero
- and can be set to a temperature (in millidegrees) to enable a
- passive trip point for the zone. Activation is done by polling with
- an interval of 1 second.
- Unit: millidegrees Celsius
- Valid values: 0 (disabled) or greater than 1000
- RW, Optional
-
-emul_temp
- Interface to set the emulated temperature method in thermal zone
- (sensor). After setting this temperature, the thermal zone may pass
- this temperature to platform emulation function if registered or
- cache it locally. This is useful in debugging different temperature
- threshold and its associated cooling action. This is write only node
- and writing 0 on this node should disable emulation.
- Unit: millidegree Celsius
- WO, Optional
-
- WARNING: Be careful while enabling this option on production systems,
- because userland can easily disable the thermal policy by simply
- flooding this sysfs node with low temperature values.
-
-sustainable_power
- An estimate of the sustained power that can be dissipated by
- the thermal zone. Used by the power allocator governor. For
- more information see Documentation/thermal/power_allocator.txt
- Unit: milliwatts
- RW, Optional
-
-k_po
- The proportional term of the power allocator governor's PID
- controller during temperature overshoot. Temperature overshoot
- is when the current temperature is above the "desired
- temperature" trip point. For more information see
- Documentation/thermal/power_allocator.txt
- RW, Optional
-
-k_pu
- The proportional term of the power allocator governor's PID
- controller during temperature undershoot. Temperature undershoot
- is when the current temperature is below the "desired
- temperature" trip point. For more information see
- Documentation/thermal/power_allocator.txt
- RW, Optional
-
-k_i
- The integral term of the power allocator governor's PID
- controller. This term allows the PID controller to compensate
- for long term drift. For more information see
- Documentation/thermal/power_allocator.txt
- RW, Optional
-
-k_d
- The derivative term of the power allocator governor's PID
- controller. For more information see
- Documentation/thermal/power_allocator.txt
- RW, Optional
-
-integral_cutoff
- Temperature offset from the desired temperature trip point
- above which the integral term of the power allocator
- governor's PID controller starts accumulating errors. For
- example, if integral_cutoff is 0, then the integral term only
- accumulates error when temperature is above the desired
- temperature trip point. For more information see
- Documentation/thermal/power_allocator.txt
- Unit: millidegree Celsius
- RW, Optional
-
-slope
- The slope constant used in a linear extrapolation model
- to determine a hotspot temperature based off the sensor's
- raw readings. It is up to the device driver to determine
- the usage of these values.
- RW, Optional
-
-offset
- The offset constant used in a linear extrapolation model
- to determine a hotspot temperature based off the sensor's
- raw readings. It is up to the device driver to determine
- the usage of these values.
- RW, Optional
-
-*****************************
-* Cooling device attributes *
-*****************************
-
-type
- String which represents the type of device, e.g:
- - for generic ACPI: should be "Fan", "Processor" or "LCD"
- - for memory controller device on intel_menlow platform:
- should be "Memory controller".
- RO, Required
-
-max_state
- The maximum permissible cooling state of this cooling device.
- RO, Required
-
-cur_state
- The current cooling state of this cooling device.
- The value can any integer numbers between 0 and max_state:
- - cur_state == 0 means no cooling
- - cur_state == max_state means the maximum cooling.
- RW, Required
-
-stats/reset
- Writing any value resets the cooling device's statistics.
- WO, Required
-
-stats/time_in_state_ms:
- The amount of time spent by the cooling device in various cooling
- states. The output will have "<state> <time>" pair in each line, which
- will mean this cooling device spent <time> msec of time at <state>.
- Output will have one line for each of the supported states. usertime
- units here is 10mS (similar to other time exported in /proc).
- RO, Required
-
-stats/total_trans:
- A single positive value showing the total number of times the state of a
- cooling device is changed.
- RO, Required
-
-stats/trans_table:
- This gives fine grained information about all the cooling state
- transitions. The cat output here is a two dimensional matrix, where an
- entry <i,j> (row i, column j) represents the number of transitions from
- State_i to State_j. If the transition table is bigger than PAGE_SIZE,
- reading this will return an -EFBIG error.
- RO, Required
-
-3. A simple implementation
-
-ACPI thermal zone may support multiple trip points like critical, hot,
-passive, active. If an ACPI thermal zone supports critical, passive,
-active[0] and active[1] at the same time, it may register itself as a
-thermal_zone_device (thermal_zone1) with 4 trip points in all.
-It has one processor and one fan, which are both registered as
-thermal_cooling_device. Both are considered to have the same
-effectiveness in cooling the thermal zone.
-
-If the processor is listed in _PSL method, and the fan is listed in _AL0
-method, the sys I/F structure will be built like this:
-
-/sys/class/thermal:
-
-|thermal_zone1:
- |---type: acpitz
- |---temp: 37000
- |---mode: enabled
- |---policy: step_wise
- |---available_policies: step_wise fair_share
- |---trip_point_0_temp: 100000
- |---trip_point_0_type: critical
- |---trip_point_1_temp: 80000
- |---trip_point_1_type: passive
- |---trip_point_2_temp: 70000
- |---trip_point_2_type: active0
- |---trip_point_3_temp: 60000
- |---trip_point_3_type: active1
- |---cdev0: --->/sys/class/thermal/cooling_device0
- |---cdev0_trip_point: 1 /* cdev0 can be used for passive */
- |---cdev0_weight: 1024
- |---cdev1: --->/sys/class/thermal/cooling_device3
- |---cdev1_trip_point: 2 /* cdev1 can be used for active[0]*/
- |---cdev1_weight: 1024
-
-|cooling_device0:
- |---type: Processor
- |---max_state: 8
- |---cur_state: 0
-
-|cooling_device3:
- |---type: Fan
- |---max_state: 2
- |---cur_state: 0
-
-/sys/class/hwmon:
-
-|hwmon0:
- |---name: acpitz
- |---temp1_input: 37000
- |---temp1_crit: 100000
-
-4. Event Notification
-
-The framework includes a simple notification mechanism, in the form of a
-netlink event. Netlink socket initialization is done during the _init_
-of the framework. Drivers which intend to use the notification mechanism
-just need to call thermal_generate_netlink_event() with two arguments viz
-(originator, event). The originator is a pointer to struct thermal_zone_device
-from where the event has been originated. An integer which represents the
-thermal zone device will be used in the message to identify the zone. The
-event will be one of:{THERMAL_AUX0, THERMAL_AUX1, THERMAL_CRITICAL,
-THERMAL_DEV_FAULT}. Notification can be sent when the current temperature
-crosses any of the configured thresholds.
-
-5. Export Symbol APIs:
-
-5.1: get_tz_trend:
-This function returns the trend of a thermal zone, i.e the rate of change
-of temperature of the thermal zone. Ideally, the thermal sensor drivers
-are supposed to implement the callback. If they don't, the thermal
-framework calculated the trend by comparing the previous and the current
-temperature values.
-
-5.2:get_thermal_instance:
-This function returns the thermal_instance corresponding to a given
-{thermal_zone, cooling_device, trip_point} combination. Returns NULL
-if such an instance does not exist.
-
-5.3:thermal_notify_framework:
-This function handles the trip events from sensor drivers. It starts
-throttling the cooling devices according to the policy configured.
-For CRITICAL and HOT trip points, this notifies the respective drivers,
-and does actual throttling for other trip points i.e ACTIVE and PASSIVE.
-The throttling policy is based on the configured platform data; if no
-platform data is provided, this uses the step_wise throttling policy.
-
-5.4:thermal_cdev_update:
-This function serves as an arbitrator to set the state of a cooling
-device. It sets the cooling device to the deepest cooling state if
-possible.
-
-6. thermal_emergency_poweroff:
-
-On an event of critical trip temperature crossing. Thermal framework
-allows the system to shutdown gracefully by calling orderly_poweroff().
-In the event of a failure of orderly_poweroff() to shut down the system
-we are in danger of keeping the system alive at undesirably high
-temperatures. To mitigate this high risk scenario we program a work
-queue to fire after a pre-determined number of seconds to start
-an emergency shutdown of the device using the kernel_power_off()
-function. In case kernel_power_off() fails then finally
-emergency_restart() is called in the worst case.
-
-The delay should be carefully profiled so as to give adequate time for
-orderly_poweroff(). In case of failure of an orderly_poweroff() the
-emergency poweroff kicks in after the delay has elapsed and shuts down
-the system.
-
-If set to 0 emergency poweroff will not be supported. So a carefully
-profiled non-zero positive value is a must for emergerncy poweroff to be
-triggered.
+++ /dev/null
-Kernel driver: x86_pkg_temp_thermal
-===================
-
-Supported chips:
-* x86: with package level thermal management
-(Verify using: CPUID.06H:EAX[bit 6] =1)
-
-Authors: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
-
-Reference
----
-Intel® 64 and IA-32 Architectures Software Developer’s Manual (Jan, 2013):
-Chapter 14.6: PACKAGE LEVEL THERMAL MANAGEMENT
-
-Description
----------
-
-This driver register CPU digital temperature package level sensor as a thermal
-zone with maximum two user mode configurable trip points. Number of trip points
-depends on the capability of the package. Once the trip point is violated,
-user mode can receive notification via thermal notification mechanism and can
-take any action to control temperature.
-
-
-Threshold management
---------------------
-Each package will register as a thermal zone under /sys/class/thermal.
-Example:
-/sys/class/thermal/thermal_zone1
-
-This contains two trip points:
-- trip_point_0_temp
-- trip_point_1_temp
-
-User can set any temperature between 0 to TJ-Max temperature. Temperature units
-are in milli-degree Celsius. Refer to "Documentation/thermal/sysfs-api.txt" for
-thermal sys-fs details.
-
-Any value other than 0 in these trip points, can trigger thermal notifications.
-Setting 0, stops sending thermal notifications.
-
-Thermal notifications: To get kobject-uevent notifications, set the thermal zone
-policy to "user_space". For example: echo -n "user_space" > policy
-
-
-
-
--- /dev/null
+===================================
+Kernel driver: x86_pkg_temp_thermal
+===================================
+
+Supported chips:
+
+* x86: with package level thermal management
+
+(Verify using: CPUID.06H:EAX[bit 6] =1)
+
+Authors: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
+
+Reference
+---------
+
+Intel® 64 and IA-32 Architectures Software Developer’s Manual (Jan, 2013):
+Chapter 14.6: PACKAGE LEVEL THERMAL MANAGEMENT
+
+Description
+-----------
+
+This driver register CPU digital temperature package level sensor as a thermal
+zone with maximum two user mode configurable trip points. Number of trip points
+depends on the capability of the package. Once the trip point is violated,
+user mode can receive notification via thermal notification mechanism and can
+take any action to control temperature.
+
+
+Threshold management
+--------------------
+Each package will register as a thermal zone under /sys/class/thermal.
+
+Example::
+
+ /sys/class/thermal/thermal_zone1
+
+This contains two trip points:
+
+- trip_point_0_temp
+- trip_point_1_temp
+
+User can set any temperature between 0 to TJ-Max temperature. Temperature units
+are in milli-degree Celsius. Refer to "Documentation/thermal/sysfs-api.rst" for
+thermal sys-fs details.
+
+Any value other than 0 in these trip points, can trigger thermal notifications.
+Setting 0, stops sending thermal notifications.
+
+Thermal notifications:
+To get kobject-uevent notifications, set the thermal zone
+policy to "user_space".
+
+For example::
+
+ echo -n "user_space" > policy
M: Javi Merino <javi.merino@kernel.org>
L: linux-pm@vger.kernel.org
S: Supported
-F: Documentation/thermal/cpu-cooling-api.txt
+F: Documentation/thermal/cpu-cooling-api.rst
F: drivers/thermal/cpu_cooling.c
F: include/linux/cpu_cooling.h
* platform characterization. This value is relative to the
* rest of the weights so a cooling device whose weight is
* double that of another cooling device is twice as
- * effective. See Documentation/thermal/sysfs-api.txt for more
+ * effective. See Documentation/thermal/sysfs-api.rst for more
* information.
*/
int weight;
/*
* This is a bit mask that gives the binding relation between this
* thermal zone and cdev, for a particular trip point.
- * See Documentation/thermal/sysfs-api.txt for more information.
+ * See Documentation/thermal/sysfs-api.rst for more information.
*/
int trip_mask;