1 .. SPDX-License-Identifier: GPL-2.0
2 .. include:: <isonum.txt>
4 ===========================================================
5 The PCI Express Advanced Error Reporting Driver Guide HOWTO
6 ===========================================================
8 :Authors: - T. Long Nguyen <tom.l.nguyen@intel.com>
9 - Yanmin Zhang <yanmin.zhang@intel.com>
11 :Copyright: |copy| 2006 Intel Corporation
19 This guide describes the basics of the PCI Express (PCIe) Advanced Error
20 Reporting (AER) driver and provides information on how to use it, as
21 well as how to enable the drivers of Endpoint devices to conform with
25 What is the PCIe AER Driver?
26 ----------------------------
28 PCIe error signaling can occur on the PCIe link itself
29 or on behalf of transactions initiated on the link. PCIe
30 defines two error reporting paradigms: the baseline capability and
31 the Advanced Error Reporting capability. The baseline capability is
32 required of all PCIe components providing a minimum defined
33 set of error reporting requirements. Advanced Error Reporting
34 capability is implemented with a PCIe Advanced Error Reporting
35 extended capability structure providing more robust error reporting.
37 The PCIe AER driver provides the infrastructure to support PCIe Advanced
38 Error Reporting capability. The PCIe AER driver provides three basic
41 - Gathers the comprehensive error information if errors occurred.
42 - Reports error to the users.
43 - Performs error recovery actions.
45 The AER driver only attaches to Root Ports and RCECs that support the PCIe
52 Include the PCIe AER Root Driver into the Linux Kernel
53 ------------------------------------------------------
55 The PCIe AER driver is a Root Port service driver attached
56 via the PCIe Port Bus driver. If a user wants to use it, the driver
57 must be compiled. It is enabled with CONFIG_PCIEAER, which
58 depends on CONFIG_PCIEPORTBUS.
60 Load PCIe AER Root Driver
61 -------------------------
63 Some systems have AER support in firmware. Enabling Linux AER support at
64 the same time the firmware handles AER would result in unpredictable
65 behavior. Therefore, Linux does not handle AER events unless the firmware
66 grants AER control to the OS via the ACPI _OSC method. See the PCI Firmware
67 Specification for details regarding _OSC usage.
72 When a PCIe AER error is captured, an error message will be output to
73 console. If it's a correctable error, it is output as an info message.
74 Otherwise, it is printed as an error. So users could choose different
75 log level to filter out correctable error messages.
77 Below shows an example::
79 0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
80 0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000
81 0000:50:00.0: [20] Unsupported Request (First)
82 0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100
84 In the example, 'Requester ID' means the ID of the device that sent
85 the error message to the Root Port. Please refer to PCIe specs for other
88 AER Statistics / Counters
89 -------------------------
91 When PCIe AER errors are captured, the counters / statistics are also exposed
92 in the form of sysfs attributes which are documented at
93 Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
98 To enable error recovery, a software driver must provide callbacks.
100 To support AER better, developers need to understand how AER works.
102 PCIe errors are classified into two types: correctable errors
103 and uncorrectable errors. This classification is based on the impact
104 of those errors, which may result in degraded performance or function
107 Correctable errors pose no impacts on the functionality of the
108 interface. The PCIe protocol can recover without any software
109 intervention or any loss of data. These errors are detected and
110 corrected by hardware.
112 Unlike correctable errors, uncorrectable
113 errors impact functionality of the interface. Uncorrectable errors
114 can cause a particular transaction or a particular PCIe link
115 to be unreliable. Depending on those error conditions, uncorrectable
116 errors are further classified into non-fatal errors and fatal errors.
117 Non-fatal errors cause the particular transaction to be unreliable,
118 but the PCIe link itself is fully functional. Fatal errors, on
119 the other hand, cause the link to be unreliable.
121 When PCIe error reporting is enabled, a device will automatically send an
122 error message to the Root Port above it when it captures
123 an error. The Root Port, upon receiving an error reporting message,
124 internally processes and logs the error message in its AER
125 Capability structure. Error information being logged includes storing
126 the error reporting agent's requestor ID into the Error Source
127 Identification Registers and setting the error bits of the Root Error
128 Status Register accordingly. If AER error reporting is enabled in the Root
129 Error Command Register, the Root Port generates an interrupt when an
132 Note that the errors as described above are related to the PCIe
133 hierarchy and links. These errors do not include any device specific
134 errors because device specific errors will still get sent directly to
140 callback reset_link to reset PCIe link
141 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
143 This callback is used to reset the PCIe physical link when a
144 fatal error happens. The Root Port AER service driver provides a
145 default reset_link function, but different Upstream Ports might
146 have different specifications to reset the PCIe link, so
147 Upstream Port drivers may provide their own reset_link functions.
149 Section 3.2.2.2 provides more detailed info on when to call
152 PCI error-recovery callbacks
153 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
155 The PCIe AER Root driver uses error callbacks to coordinate
156 with downstream device drivers associated with a hierarchy in question
157 when performing error recovery actions.
159 Data struct pci_driver has a pointer, err_handler, to point to
160 pci_error_handlers who consists of a couple of callback function
161 pointers. The AER driver follows the rules defined in
162 pci-error-recovery.rst except PCIe-specific parts (e.g.
163 reset_link). Please refer to pci-error-recovery.rst for detailed
164 definitions of the callbacks.
166 The sections below specify when to call the error callback functions.
171 Correctable errors pose no impacts on the functionality of
172 the interface. The PCIe protocol can recover without any
173 software intervention or any loss of data. These errors do not
174 require any recovery actions. The AER driver clears the device's
175 correctable error status register accordingly and logs these errors.
177 Non-correctable (non-fatal and fatal) errors
178 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
180 If an error message indicates a non-fatal error, performing link reset
181 at upstream is not required. The AER driver calls error_detected(dev,
182 pci_channel_io_normal) to all drivers associated within a hierarchy in
183 question. For example::
185 Endpoint <==> Downstream Port B <==> Upstream Port A <==> Root Port
187 If Upstream Port A captures an AER error, the hierarchy consists of
188 Downstream Port B and Endpoint.
190 A driver may return PCI_ERS_RESULT_CAN_RECOVER,
191 PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
192 whether it can recover or the AER driver calls mmio_enabled as next.
194 If an error message indicates a fatal error, kernel will broadcast
195 error_detected(dev, pci_channel_io_frozen) to all drivers within
196 a hierarchy in question. Then, performing link reset at upstream is
197 necessary. As different kinds of devices might use different approaches
198 to reset link, AER port service driver is required to provide the
199 function to reset link via callback parameter of pcie_do_recovery()
200 function. If reset_link is not NULL, recovery function will use it
201 to reset the link. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER
202 and reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
205 Frequent Asked Questions
206 ------------------------
209 What happens if a PCIe device driver does not provide an
210 error recovery handler (pci_driver->err_handler is equal to NULL)?
213 The devices attached with the driver won't be recovered. If the
214 error is fatal, kernel will print out warning messages. Please refer
215 to section 3 for more information.
218 What happens if an upstream port service driver does not provide
222 Fatal error recovery will fail if the errors are reported by the
223 upstream ports who are attached by the service driver.
226 Software error injection
227 ========================
229 Debugging PCIe AER error recovery code is quite difficult because it
230 is hard to trigger real hardware errors. Software based error
231 injection can be used to fake various kinds of PCIe errors.
233 First you should enable PCIe AER software error injection in kernel
234 configuration, that is, following item should be in your .config.
236 CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
238 After reboot with new kernel or insert the module, a device file named
239 /dev/aer_inject should be created.
241 Then, you need a user space tool named aer-inject, which can be gotten
244 https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
246 More information about aer-inject can be found in the document in