1 ====================================
2 Coherent Accelerator Interface (CXL)
3 ====================================
8 The coherent accelerator interface is designed to allow the
9 coherent connection of accelerators (FPGAs and other devices) to a
10 POWER system. These devices need to adhere to the Coherent
11 Accelerator Interface Architecture (CAIA).
13 IBM refers to this as the Coherent Accelerator Processor Interface
14 or CAPI. In the kernel it's referred to by the name CXL to avoid
15 confusion with the ISDN CAPI subsystem.
17 Coherent in this context means that the accelerator and CPUs can
18 both access system memory directly and with the same effective
28 +----------+ +---------+
34 +----------+ +---------+
38 +---+------+ PCIE +---------+
40 The POWER8/9 chip has a Coherently Attached Processor Proxy (CAPP)
41 unit which is part of the PCIe Host Bridge (PHB). This is managed
42 by Linux by calls into OPAL. Linux doesn't directly program the
45 The FPGA (or coherently attached device) consists of two parts.
46 The POWER Service Layer (PSL) and the Accelerator Function Unit
47 (AFU). The AFU is used to implement specific functionality behind
48 the PSL. The PSL, among other things, provides memory address
49 translation services to allow each AFU direct access to userspace
52 The AFU is the core part of the accelerator (eg. the compression,
53 crypto etc function). The kernel has no knowledge of the function
54 of the AFU. Only userspace interacts directly with the AFU.
56 The PSL provides the translation and interrupt services that the
57 AFU needs. This is what the kernel interacts with. For example, if
58 the AFU needs to read a particular effective address, it sends
59 that address to the PSL, the PSL then translates it, fetches the
60 data from memory and returns it to the AFU. If the PSL has a
61 translation miss, it interrupts the kernel and the kernel services
62 the fault. The context to which this fault is serviced is based on
63 who owns that acceleration function.
65 - POWER8 and PSL Version 8 are compliant to the CAIA Version 1.0.
66 - POWER9 and PSL Version 9 are compliant to the CAIA Version 2.0.
68 This PSL Version 9 provides new features such as:
70 * Interaction with the nest MMU on the P9 chip.
72 * Supports sending ASB_Notify messages for host thread wakeup.
73 * Supports Atomic operations.
76 Cards with a PSL9 won't work on a POWER8 system and cards with a
77 PSL8 won't work on a POWER9 system.
82 There are two programming modes supported by the AFU. Dedicated
83 and AFU directed. AFU may support one or both modes.
85 When using dedicated mode only one MMU context is supported. In
86 this mode, only one userspace process can use the accelerator at
89 When using AFU directed mode, up to 16K simultaneous contexts can
90 be supported. This means up to 16K simultaneous userspace
91 applications may use the accelerator (although specific AFUs may
92 support fewer). In this mode, the AFU sends a 16 bit context ID
93 with each of its requests. This tells the PSL which context is
94 associated with each operation. If the PSL can't translate an
95 operation, the ID can also be accessed by the kernel so it can
96 determine the userspace context associated with an operation.
102 A portion of the accelerator MMIO space can be directly mapped
103 from the AFU to userspace. Either the whole space can be mapped or
104 just a per context portion. The hardware is self describing, hence
105 the kernel can determine the offset and size of the per context
112 AFUs may generate interrupts that are destined for userspace. These
113 are received by the kernel as hardware interrupts and passed onto
114 userspace by a read syscall documented below.
116 Data storage faults and error interrupts are handled by the kernel
120 Work Element Descriptor (WED)
121 =============================
123 The WED is a 64-bit parameter passed to the AFU when a context is
124 started. Its format is up to the AFU hence the kernel has no
125 knowledge of what it represents. Typically it will be the
126 effective address of a work queue or status block where the AFU
127 and userspace can share control and status information.
135 1. AFU character devices
136 ^^^^^^^^^^^^^^^^^^^^^^^^
138 For AFUs operating in AFU directed mode, two character device
139 files will be created. /dev/cxl/afu0.0m will correspond to a
140 master context and /dev/cxl/afu0.0s will correspond to a slave
141 context. Master contexts have access to the full MMIO space an
142 AFU provides. Slave contexts have access to only the per process
143 MMIO space an AFU provides.
145 For AFUs operating in dedicated process mode, the driver will
146 only create a single character device per AFU called
147 /dev/cxl/afu0.0d. This will have access to the entire MMIO space
148 that the AFU provides (like master contexts in AFU directed).
150 The types described below are defined in include/uapi/misc/cxl.h
152 The following file operations are supported on both slave and
155 A userspace library libcxl is available here:
157 https://github.com/ibm-capi/libcxl
159 This provides a C interface to this kernel API.
164 Opens the device and allocates a file descriptor to be used with
167 A dedicated mode AFU only has one context and only allows the
168 device to be opened once.
170 An AFU directed mode AFU can have many contexts, the device can be
171 opened once for each context that is available.
173 When all available contexts are allocated the open call will fail
177 IRQs need to be allocated for each context, which may limit
178 the number of contexts that can be created, and therefore
179 how many times the device can be opened. The POWER8 CAPP
180 supports 2040 IRQs and 3 are used by the kernel, so 2037 are
181 left. If 1 IRQ is needed per context, then only 2037
182 contexts can be allocated. If 4 IRQs are needed per context,
183 then only 2037/4 = 509 contexts can be allocated.
189 CXL_IOCTL_START_WORK:
190 Starts the AFU context and associates it with the current
191 process. Once this ioctl is successfully executed, all memory
192 mapped into this process is accessible to this AFU context
193 using the same effective addresses. No additional calls are
194 required to map/unmap memory. The AFU memory context will be
195 updated as userspace allocates and frees memory. This ioctl
196 returns once the AFU context is started.
198 Takes a pointer to a struct cxl_ioctl_start_work
202 struct cxl_ioctl_start_work {
204 __u64 work_element_descriptor;
206 __s16 num_interrupts;
216 Indicates which optional fields in the structure are
219 work_element_descriptor:
220 The Work Element Descriptor (WED) is a 64-bit argument
221 defined by the AFU. Typically this is an effective
222 address pointing to an AFU specific structure
223 describing what work to perform.
226 Authority Mask Register (AMR), same as the powerpc
227 AMR. This field is only used by the kernel when the
228 corresponding CXL_START_WORK_AMR value is specified in
229 flags. If not specified the kernel will use a default
233 Number of userspace interrupts to request. This field
234 is only used by the kernel when the corresponding
235 CXL_START_WORK_NUM_IRQS value is specified in flags.
236 If not specified the minimum number required by the
237 AFU will be allocated. The min and max number can be
241 For ABI padding and future extensions
243 CXL_IOCTL_GET_PROCESS_ELEMENT:
244 Get the current context id, also known as the process element.
245 The value is returned from the kernel as a __u32.
251 An AFU may have an MMIO space to facilitate communication with the
252 AFU. If it does, the MMIO space can be accessed via mmap. The size
253 and contents of this area are specific to the particular AFU. The
254 size can be discovered via sysfs.
256 In AFU directed mode, master contexts are allowed to map all of
257 the MMIO space and slave contexts are allowed to only map the per
258 process MMIO space associated with the context. In dedicated
259 process mode the entire MMIO space can always be mapped.
261 This mmap call must be done after the START_WORK ioctl.
263 Care should be taken when accessing MMIO space. Only 32 and 64-bit
264 accesses are supported by POWER8. Also, the AFU will be designed
265 with a specific endianness, so all MMIO accesses should consider
266 endianness (recommend endian(3) variants like: le64toh(),
267 be64toh() etc). These endian issues equally apply to shared memory
268 queues the WED may describe.
274 Reads events from the AFU. Blocks if no events are pending
275 (unless O_NONBLOCK is supplied). Returns -EIO in the case of an
276 unrecoverable error or if the card is removed.
278 read() will always return an integral number of events.
280 The buffer passed to read() must be at least 4K bytes.
282 The result of the read will be a buffer of one or more events,
283 each event is of type struct cxl_event, of varying size::
286 struct cxl_event_header header;
288 struct cxl_event_afu_interrupt irq;
289 struct cxl_event_data_storage fault;
290 struct cxl_event_afu_error afu_error;
294 The struct cxl_event_header is defined as
298 struct cxl_event_header {
301 __u16 process_element;
306 This defines the type of event. The type determines how
307 the rest of the event is structured. These types are
308 described below and defined by enum cxl_event_type.
311 This is the size of the event in bytes including the
312 struct cxl_event_header. The start of the next event can
313 be found at this offset from the start of the current
317 Context ID of the event.
320 For future extensions and padding.
322 If the event type is CXL_EVENT_AFU_INTERRUPT then the event
323 structure is defined as
327 struct cxl_event_afu_interrupt {
329 __u16 irq; /* Raised AFU interrupt number */
334 These flags indicate which optional fields are present
335 in this struct. Currently all fields are mandatory.
338 The IRQ number sent by the AFU.
341 For future extensions and padding.
343 If the event type is CXL_EVENT_DATA_STORAGE then the event
344 structure is defined as
348 struct cxl_event_data_storage {
358 These flags indicate which optional fields are present in
359 this struct. Currently all fields are mandatory.
362 The address that the AFU unsuccessfully attempted to
363 access. Valid accesses will be handled transparently by the
364 kernel but invalid accesses will generate this event.
367 This field gives information on the type of fault. It is a
368 copy of the DSISR from the PSL hardware when the address
369 fault occurred. The form of the DSISR is as defined in the
373 For future extensions
375 If the event type is CXL_EVENT_AFU_ERROR then the event structure
380 struct cxl_event_afu_error {
388 These flags indicate which optional fields are present in
389 this struct. Currently all fields are Mandatory.
392 Error status from the AFU. Defined by the AFU.
395 For future extensions and padding
398 2. Card character device (powerVM guest only)
399 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
401 In a powerVM guest, an extra character device is created for the
402 card. The device is only used to write (flash) a new image on the
403 FPGA accelerator. Once the image is written and verified, the
404 device tree is updated and the card is reset to reload the updated
410 Opens the device and allocates a file descriptor to be used with
411 the rest of the API. The device can only be opened once.
416 CXL_IOCTL_DOWNLOAD_IMAGE / CXL_IOCTL_VALIDATE_IMAGE:
417 Starts and controls flashing a new FPGA image. Partial
418 reconfiguration is not supported (yet), so the image must contain
419 a copy of the PSL and AFU(s). Since an image can be quite large,
420 the caller may have to iterate, splitting the image in smaller
423 Takes a pointer to a struct cxl_adapter_image::
425 struct cxl_adapter_image {
437 These flags indicate which optional fields are present in
438 this struct. Currently all fields are mandatory.
441 Pointer to a buffer with part of the image to write to the
445 Size of the buffer pointed to by data.
448 Full size of the image.
454 A cxl sysfs class is added under /sys/class/cxl to facilitate
455 enumeration and tuning of the accelerators. Its layout is
456 described in Documentation/ABI/testing/sysfs-class-cxl
462 The following udev rules could be used to create a symlink to the
463 most logical chardev to use in any programming mode (afuX.Yd for
464 dedicated, afuX.Ys for afu directed), since the API is virtually
467 SUBSYSTEM=="cxl", ATTRS{mode}=="dedicated_process", SYMLINK="cxl/%b"
468 SUBSYSTEM=="cxl", ATTRS{mode}=="afu_directed", \
469 KERNEL=="afu[0-9]*.[0-9]*s", SYMLINK="cxl/%b"