1 The Graphics Execution Manager
2 Part of the Direct Rendering Manager
3 ==============================
5 Keith Packard <keithp@keithp.com>
6 Eric Anholt <eric@anholt.net>
12 2. API overview and conventions
13 3. Object Creation/Destruction
14 4. Reading/writing contents
15 5. Mapping objects to userspace
17 7. Execution (Intel specific)
18 8. Other misc Intel-specific functions
20 1. Graphics Execution Manager Overview
22 Gem is designed to manage graphics memory, control access to the graphics
23 device execution context and handle the essentially NUMA environment unique
24 to modern graphics hardware. Gem allows multiple applications to share
25 graphics device resources without the need to constantly reload the entire
26 graphics card. Data may be shared between multiple applications with gem
27 ensuring that the correct memory synchronization occurs.
29 Graphics data can consume arbitrary amounts of memory, with 3D applications
30 constructing ever larger sets of textures and vertices. With graphics cards
31 memory space growing larger every year, and graphics APIs growing more
32 complex, we can no longer insist that each application save a complete copy
33 of their graphics state so that the card can be re-initialized from user
34 space at each context switch. Ensuring that graphics data remains persistent
35 across context switches allows applications significant new functionality
36 while also improving performance for existing APIs.
38 Modern linux desktops include significant 3D rendering as a fundemental
39 component of the desktop image construction process. 2D and 3D applications
40 paint their content to offscreen storage and the central 'compositing
41 manager' constructs the final screen image from those window contents. This
42 means that pixel image data from these applications must move within reach
43 of the compositing manager and used as source operands for screen image
46 Gem provides simple mechanisms to manage graphics data and control execution
47 flow within the linux operating system. Using many existing kernel
48 subsystems, it does this with a modest amount of code.
50 2. API Overview and Conventions
52 All APIs here are defined in terms of ioctls appplied to the DRM file
53 descriptor. To create and manipulate objects, an application must be
54 'authorized' using the DRI or DRI2 protocols with the X server. To relax
55 that, we will need to implement some better access control mechanisms within
56 the hardware portion of the driver to prevent inappropriate
57 cross-application data access.
59 Any DRM driver which does not support GEM will return -ENODEV for all of
60 these ioctls. Invalid object handles return -EINVAL. Invalid object names
61 return -ENOENT. Other errors are as documented in the specific API below.
63 To avoid the need to translate ioctl contents on mixed-size systems (with
64 32-bit user space running on a 64-bit kernel), the ioctl data structures
65 contain explicitly sized objects, using 64-bits for all size and pointer
66 data and 32-bits for identifiers. In addition, the 64-bit objects are all
67 carefully aligned on 64-bit boundaries. Because of this, all pointers in the
68 ioctl data structures are passed as uint64_t values. Suitable casts will
71 One significant operation which is explicitly left out of this API is object
72 locking. Applications are expected to perform locking of shared objects
73 outside of the GEM api. This kind of locking is not necessary to safely
74 manipulate the graphics engine, and with multiple objects interacting in
75 unknown ways, per-object locking would likely introduce all kinds of
76 lock-order issues. Punting this to the application seems like the only
77 sensible plan. Given that DRM already offers a global lock on the hardware,
78 this doesn't change the current situation.
80 3. Object Creation and Destruction
82 Gem provides explicit memory management primitives. System pages are
83 allocated when the object is created, either as the fundemental storage for
84 hardware where system memory is used by the graphics processor directly, or
85 as backing store for graphics-processor resident memory.
87 Objects are referenced from user space using handles. These are, for all
88 intents and purposes, equivalent to file descriptors. We could simply use
89 file descriptors were it not for the small limit (1024) of file descriptors
90 available to applications, and for the fact that the X server (a rather
91 significant user of this API) uses 'select' and has a limited maximum file
92 descriptor for that operation. Given the ability to allocate more file
93 descriptors, and given the ability to place these 'higher' in the file
94 descriptor space, we'd love to simply use file descriptors.
96 Objects may be published with a name so that other applications can access
97 them. The name remains valid as long as the object exists. Right now, our
98 DRI APIs use 32-bit integer names, so that's what we expose here
102 struct drm_gem_create {
104 * Requested size for the object.
106 * The (page-aligned) allocated size for the object
111 * Returned handle for the object.
113 * Object handles are nonzero.
121 ret = ioctl (fd, DRM_IOCTL_GEM_CREATE, &create);
123 return create.handle;
125 Note that the size is rounded up to a page boundary, and that
126 the rounded-up size is returned in 'size'. No name is assigned to
127 this object, making it local to this process.
129 If insufficient memory is availabe, -ENOMEM will be returned.
133 struct drm_gem_close {
134 /** Handle of the object to be closed. */
141 close.handle = <handle>;
142 ret = ioctl (fd, DRM_IOCTL_GEM_CLOSE, &close);
144 This call makes the specified handle invalid, and if no other
145 applications are using the object, any necessary graphics hardware
146 synchronization is performed and the resources used by the object
151 struct drm_gem_flink {
152 /** Handle for the object being named */
155 /** Returned global name */
160 flink.handle = <handle>;
161 ret = ioctl (fd, DRM_IOCTL_GEM_FLINK, &flink);
165 Flink creates a name for the object and returns it to the
166 application. This name can be used by other applications to gain
167 access to the same object.
171 struct drm_gem_open {
172 /** Name of object being opened */
175 /** Returned handle for the object */
178 /** Returned size of the object */
184 ret = ioctl (fd, DRM_IOCTL_GEM_OPEN, &open);
190 Open accesses an existing object and returns a handle for it. If the
191 object doesn't exist, -ENOENT is returned. The size of the object is
192 also returned. This handle has all the same capabilities as the
193 handle used to create the object. In particular, the object is not
194 destroyed until all handles are closed.
196 4. Basic read/write operations
198 By default, gem objects are not mapped to the applications address space,
199 getting data in and out of them is done with I/O operations instead. This
200 allows the data to reside in otherwise unmapped pages, including pages in
201 video memory on an attached discrete graphics card. In addition, using
202 explicit I/O operations allows better control over cache contents, as
203 graphics devices are generally not cache coherent with the CPU, mapping
204 pages used for graphics into an application address space requires the use
205 of expensive cache flushing operations. Providing direct control over
206 graphics data access ensures that data are handled in the most efficient
211 struct drm_gem_pread {
212 /** Handle for the object being read. */
215 /** Offset into the object to read from */
217 /** Length of data to read */
219 /** Pointer to write the data into. */
220 uint64_t data_ptr; /* void * */
223 This copies data into the specified object at the specified
224 position. Any necessary graphics device synchronization and
225 flushing will be done automatically.
227 struct drm_gem_pwrite {
228 /** Handle for the object being written to. */
231 /** Offset into the object to write to */
233 /** Length of data to write */
235 /** Pointer to read the data from. */
236 uint64_t data_ptr; /* void * */
239 This copies data out of the specified object into the
240 waiting user memory. Again, device synchronization will
241 be handled by the kernel to ensure user space sees a
242 consistent view of the graphics device.
244 5. Mapping objects to user space
246 For most objects, reading/writing is the preferred interaction mode.
247 However, when the CPU is involved in rendering to cover deficiencies in
248 hardware support for particular operations, the CPU will want to directly
249 access the relevant objects.
251 Because mmap is fairly heavyweight, we allow applications to retain maps to
252 objects persistently and then update how they're using the memory through a
253 separate interface. Applications which fail to use this separate interface
254 may exhibit unpredictable behaviour as memory consistency will not be
259 struct drm_gem_mmap {
260 /** Handle for the object being mapped. */
263 /** Offset in the object to map. */
266 * Length of data to map.
268 * The value will be page-aligned.
271 /** Returned pointer the data was mapped at */
272 uint64_t addr_ptr; /* void * */
276 mmap.handle = <handle>;
277 mmap.offset = <offset>;
279 ret = ioctl (fd, DRM_IOCTL_GEM_MMAP, &mmap);
281 return (void *) (uintptr_t) mmap.addr_ptr;
286 munmap (addr, length);
288 Nothing strange here, just use the normal munmap syscall.
292 Graphics devices remain a strong bastion of non cache-coherent memory. As a
293 result, accessing data through one functional unit will end up loading that
294 cache with data which then needs to be manually synchronized when that data
295 is used with another functional unit.
297 Tracking where data are resident is done by identifying how functional units
298 deal with caches. Each cache is labeled as a separate memory domain. Then,
299 each sequence of operations is expected to load data into various read
300 domains and leave data in at most one write domain. Gem tracks the read and
301 write memory domains of each object and performs the necessary
302 synchronization operations when objects move from one domain set to another.
304 For example, if operation 'A' constructs an image that is immediately used
305 by operation 'B', then when the read domain for 'B' is not the same as the
306 write domain for 'A', then the write domain must be flushed, and the read
307 domain invalidated. If these two operations are both executed in the same
308 command queue, then the flush operation can go inbetween them in the same
309 queue, avoiding any kind of CPU-based synchronization and leaving the GPU to
312 6.1 Memory Domains (GPU-independent)
314 * DRM_GEM_DOMAIN_CPU.
316 Objects in this domain are using caches which are connected to the CPU.
317 Moving objects from non-CPU domains into the CPU domain can involve waiting
318 for the GPU to finish with operations using this object. Moving objects
319 from this domain to a GPU domain can involve flushing CPU caches and chipset
322 6.1 GPU-independent memory domain ioctl
324 This ioctl is independent of the GPU in use. So far, no use other than
325 synchronizing objects to the CPU domain have been found; if that turns out
326 to be generally true, this ioctl may be simplified further.
328 A. Explicit domain control
330 struct drm_gem_set_domain {
331 /** Handle for the object */
334 /** New read domains */
335 uint32_t read_domains;
337 /** New write domain */
338 uint32_t write_domain;
342 set_domain.handle = <handle>;
343 set_domain.read_domains = <read_domains>;
344 set_domain.write_domain = <write_domain>;
345 ret = ioctl (fd, DRM_IOCTL_GEM_SET_DOMAIN, &set_domain);
347 When the application wants to explicitly manage memory domains for
348 an object, it can use this function. Usually, this is only used
349 when the application wants to synchronize object contents between
350 the GPU and CPU-based application rendering. In that case,
351 the <read_domains> would be set to DRM_GEM_DOMAIN_CPU, and if the
352 application were going to write to the object, the <write_domain>
353 would also be set to DRM_GEM_DOMAIN_CPU. After the call, gem
354 guarantees that all previous rendering operations involving this
355 object are complete. The application is then free to access the
356 object through the address returned by the mmap call. Afterwards,
357 when the application again uses the object through the GPU, any
358 necessary CPU flushing will occur and the object will be correctly
359 synchronized with the GPU.
361 Note that this synchronization is not required for any accesses
362 going through the driver itself. The pread, pwrite and execbuffer
363 ioctls all perform the necessary domain management internally.
364 Explicit synchronization is only necessary when accessing the object
365 through the mmap'd address.
367 7. Execution (Intel specific)
369 Managing the command buffers is inherently chip-specific, so the core of gem
370 doesn't have any intrinsic functions. Rather, execution is left to the
371 device-specific portions of the driver.
373 The Intel DRM_I915_GEM_EXECBUFFER ioctl takes a list of gem objects, all of
374 which are mapped to the graphics device. The last object in the list is the
379 Command buffers often refer to other objects, and to allow the kernel driver
380 to move objects around, a sequence of relocations is associated with each
381 object. Device-specific relocation operations are used to place the
382 target-object relative value into the object.
384 The Intel driver has a single relocation type:
386 struct drm_i915_gem_relocation_entry {
388 * Handle of the buffer being pointed to by this
391 * It's appealing to make this be an index into the
392 * mm_validate_entry list to refer to the buffer,
393 * but this allows the driver to create a relocation
394 * list for state buffers and not re-write it per
395 * exec using the buffer.
397 uint32_t target_handle;
400 * Value to be added to the offset of the target
401 * buffer to make up the relocation entry.
406 * Offset in the buffer the relocation entry will be
412 * Offset value of the target buffer that the
413 * relocation entry was last written as.
415 * If the buffer has the same offset as last time, we
416 * can skip syncing and writing the relocation. This
417 * value is written back out by the execbuffer ioctl
418 * when the relocation is written.
420 uint64_t presumed_offset;
423 * Target memory domains read by this operation.
425 uint32_t read_domains;
428 * Target memory domains written by this operation.
430 * Note that only one domain may be written by the
431 * whole execbuffer operation, so that where there are
432 * conflicts, the application will get -EINVAL back.
434 uint32_t write_domain;
437 'target_handle', the handle to the target object. This object must
438 be one of the objects listed in the execbuffer request or
439 bad things will happen. The kernel doesn't check for this.
441 'offset' is where, in the source object, the relocation data
442 are written. Each relocation value is a 32-bit value consisting
443 of the location of the target object in the GPU memory space plus
444 the 'delta' value included in the relocation.
446 'presumed_offset' is where user-space believes the target object
447 lies in GPU memory space. If this value matches where the object
448 actually is, then no relocation data are written, the kernel
449 assumes that user space has set up data in the source object
450 using this presumption. This offers a fairly important optimization
451 as writing relocation data requires mapping of the source object
452 into the kernel memory space.
454 'read_domains' and 'write_domains' list the usage by the source
455 object of the target object. The kernel unions all of the domain
456 information from all relocations in the execbuffer request. No more
457 than one write_domain is allowed, otherwise an EINVAL error is
458 returned. read_domains must contain write_domain. This domain
459 information is used to synchronize buffer contents as described
460 above in the section on domains.
462 7.1.1 Memory Domains (Intel specific)
464 The Intel GPU has several internal caches which are not coherent and hence
465 require explicit synchronization. Memory domains provide the necessary data
466 to synchronize what is needed while leaving other cache contents intact.
468 * DRM_GEM_DOMAIN_I915_RENDER.
469 The GPU 3D and 2D rendering operations use a unified rendering cache, so
470 operations doing 3D painting and 2D blts will use this domain
472 * DRM_GEM_DOMAIN_I915_SAMPLER
473 Textures are loaded by the sampler through a separate cache, so
474 any texture reading will use this domain. Note that the sampler
475 and renderer use different caches, so moving an object from render target
476 to texture source will require a domain transfer.
478 * DRM_GEM_DOMAIN_I915_COMMAND
479 The command buffer doesn't have an explicit cache (although it does
480 read ahead quite a bit), so this domain just indicates that the object
481 needs to be flushed to the GPU.
483 * DRM_GEM_DOMAIN_I915_INSTRUCTION
484 All of the programs on Gen4 and later chips use an instruction cache to
485 speed program execution. It must be explicitly flushed when new programs
486 are written to memory by the CPU.
488 * DRM_GEM_DOMAIN_I915_VERTEX
489 Vertex data uses two different vertex caches, but they're
490 both flushed with the same instruction.
492 7.2 Execution object list (Intel specific)
494 struct drm_i915_gem_exec_object {
496 * User's handle for a buffer to be bound into the GTT
497 * for this operation.
502 * List of relocations to be performed on this buffer
504 uint32_t relocation_count;
505 /* struct drm_i915_gem_relocation_entry *relocs */
509 * Required alignment in graphics aperture
514 * Returned value of the updated offset of the object,
515 * for future presumed_offset writes.
520 Each object involved in a particular execution operation must be
521 listed using one of these structures.
523 'handle' references the object.
525 'relocs_ptr' is a user-mode pointer to a array of 'relocation_count'
526 drm_i915_gem_relocation_entry structs (see above) that
527 define the relocations necessary in this buffer. Note that all
528 relocations must reference other exec_object structures in the same
529 execbuffer ioctl and that those other buffers must come earlier in
530 the exec_object array. In other words, the dependencies mapped by the
531 exec_object relocations must form a directed acyclic graph.
533 'alignment' is the byte alignment necessary for this buffer. Each
534 object has specific alignment requirements, as the kernel doesn't
535 know what each object is being used for, those requirements must be
536 provided by user mode. If an object is used in two different ways,
537 it's quite possible that the alignment requirements will differ.
539 'offset' is a return value, receiving the location of the object
540 during this execbuffer operation. The application should use this
541 as the presumed offset in future operations; if the object does not
542 move, then kernel need not write relocation data.
544 7.3 Execbuffer ioctl (Intel specific)
546 struct drm_i915_gem_execbuffer {
548 * List of buffers to be validated with their
549 * relocations to be performend on them.
551 * These buffers must be listed in an order such that
552 * all relocations a buffer is performing refer to
553 * buffers that have already appeared in the validate
556 /* struct drm_i915_gem_validate_entry *buffers */
557 uint64_t buffers_ptr;
558 uint32_t buffer_count;
561 * Offset in the batchbuffer to start execution from.
563 uint32_t batch_start_offset;
566 * Bytes used in batchbuffer from batch_start_offset
571 uint32_t num_cliprects;
572 uint64_t cliprects_ptr; /* struct drm_clip_rect *cliprects */
576 'buffers_ptr' is a user-mode pointer to an array of 'buffer_count'
577 drm_i915_gem_exec_object structures which contains the complete set
578 of objects required for this execbuffer operation. The last entry in
579 this array, the 'batch buffer', is the buffer of commands which will
580 be linked to the ring and executed.
582 'batch_start_offset' is the byte offset within the batch buffer which
583 contains the first command to execute. So far, we haven't found a
584 reason to use anything other than '0' here, but the thought was that
585 some space might be allocated for additional initialization which
586 could be skipped in some cases. This must be a multiple of 4.
588 'batch_len' is the length, in bytes, of the data to be executed
589 (i.e., the amount of data after batch_start_offset). This must
592 'num_cliprects' and 'cliprects_ptr' reference an array of
593 drm_clip_rect structures that is num_cliprects long. The entire
594 batch buffer will be executed multiple times, once for each
595 rectangle in this list. If num_cliprects is 0, then no clipping
596 rectangle will be set.
598 'DR1' and 'DR4' are portions of the 3DSTATE_DRAWING_RECTANGLE
599 command which will be queued when this operation is clipped
600 (num_cliprects != 0).
603 31 Fast Scissor Clip Disable (debug only).
604 Disables a hardware optimization that
605 improves performance. This should have
606 no visible effect, other than reducing
609 30 Depth Buffer Coordinate Offset Disable.
610 This disables the addition of the
611 depth buffer offset bits which are used
612 to change the location of the depth buffer
613 relative to the front buffer.
615 27:26 X Dither Offset. Specifies the X pixel
616 offset to use when accessing the dither table
618 25:24 Y Dither Offset. Specifies the Y pixel
619 offset to use when accessing the dither
623 31:16 Drawing Rectangle Origin Y. Specifies the Y
624 origin of coordinates relative to the
627 15:0 Drawing Rectangle Origin X. Specifies the X
628 origin of coordinates relative to the
631 As you can see, these two fields are necessary for correctly
632 offsetting drawing within a buffer which contains multiple surfaces.
633 Note that DR1 is only used on Gen3 and earlier hardware and that
634 newer hardware sticks the dither offset elsewhere.
636 7.3.1 Detailed Execution Description
638 Execution of a single batch buffer requires several preparatory
639 steps to make the objects visible to the graphics engine and resolve
640 relocations to account for their current addresses.
642 A. Mapping and Relocation
644 Each exec_object structure in the array is examined in turn.
646 If the object is not already bound to the GTT, it is assigned a
647 location in the graphics address space. If no space is available in
648 the GTT, some other object will be evicted. This may require waiting
649 for previous execbuffer requests to complete before that object can
650 be unmapped. With the location assigned, the pages for the object
651 are pinned in memory using find_or_create_page and the GTT entries
652 updated to point at the relevant pages using drm_agp_bind_pages.
654 Then the array of relocations is traversed. Each relocation record
655 looks up the target object and, if the presumed offset does not
656 match the current offset (remember that this buffer has already been
657 assigned an address as it must have been mapped earlier), the
658 relocation value is computed using the current offset. If the
659 object is currently in use by the graphics engine, writing the data
660 out must be preceeded by a delay while the object is still busy.
661 Once it is idle, then the page containing the relocation is mapped
662 by the CPU and the updated relocation data written out.
664 The read_domains and write_domain entries in each relocation are
665 used to compute the new read_domains and write_domain values for the
666 target buffers. The actual execution of the domain changes must wait
667 until all of the exec_object entries have been evaluated as the
668 complete set of domain information will not be available until then.
670 B. Memory Domain Resolution
672 After all of the new memory domain data has been pulled out of the
673 relocations and computed for each object, the list of objects is
674 again traversed and the new memory domains compared against the
675 current memory domains. There are two basic operations involved here:
677 * Flushing the current write domain. If the new read domains
678 are not equal to the current write domain, then the current
679 write domain must be flushed. Otherwise, reads will not see data
680 present in the write domain cache. In addition, any new read domains
681 other than the current write domain must be invalidated to ensure
682 that the flushed data are re-read into their caches.
684 * Invaliding new read domains. Any domains which were not currently
685 used for this object must be invalidated as old objects which
686 were mapped at the same location may have stale data in the new
689 If the CPU cache is being invalidated and some GPU cache is being
690 flushed, then we'll have to wait for rendering to complete so that
691 any pending GPU writes will be complete before we flush the GPU
694 If the CPU cache is being flushed, then we use 'clflush' to get data
695 written from the CPU.
697 Because the GPU caches cannot be partially flushed or invalidated,
698 we don't actually flush them during this traversal stage. Rather, we
699 gather the invalidate and flush bits up in the device structure.
701 Once all of the object domain changes have been evaluated, then the
702 gathered invalidate and flush bits are examined. For any GPU flush
703 operations, we emit a single MI_FLUSH command that performs all of
704 the necessary flushes. We then look to see if the CPU cache was
705 flushed. If so, we use the chipset flush magic (writing to a special
706 page) to get the data out of the chipset and into memory.
708 C. Queuing Batch Buffer to the Ring
710 With all of the objects resident in graphics memory space, and all
711 of the caches prepared with appropriate data, the batch buffer
712 object can be queued to the ring. If there are clip rectangles, then
713 the buffer is queued once per rectangle, with suitable clipping
714 inserted into the ring just before the batch buffer.
716 D. Creating an IRQ Cookie
718 Right after the batch buffer is placed in the ring, a request to
719 generate an IRQ is added to the ring along with a command to write a
720 marker into memory. When the IRQ fires, the driver can look at the
721 memory location to see where in the ring the GPU has passed. This
722 magic cookie value is stored in each object used in this execbuffer
723 command; it is used whereever you saw 'wait for rendering' above in
726 E. Writing back the new object offsets
728 So that the application has a better idea what to use for
729 'presumed_offset' values later, the current object offsets are
730 written back to the exec_object structures.
733 8. Other misc Intel-specific functions.
735 To complete the driver, a few other functions were necessary.
737 8.1 Initialization from the X server
739 As the X server is currently responsible for apportioning memory between 2D
740 and 3D, it must tell the kernel which region of the GTT aperture is
741 available for 3D objects to be mapped into.
743 struct drm_i915_gem_init {
745 * Beginning offset in the GTT to be managed by the
746 * DRM memory manager.
750 * Ending offset in the GTT to be managed by the DRM
756 init.gtt_start = <gtt_start>;
757 init.gtt_end = <gtt_end>;
758 ret = ioctl (fd, DRM_IOCTL_I915_GEM_INIT, &init);
760 The GTT aperture between gtt_start and gtt_end will be used to map
761 objects. This also tells the kernel that the ring can be used,
762 pulling the ring addresses from the device registers.
764 8.2 Pinning objects in the GTT
766 For scan-out buffers and the current shared depth and back buffers, we need
767 to have them always available in the GTT, at least for now. Pinning means to
768 lock their pages in memory along with keeping them at a fixed offset in the
769 graphics aperture. These operations are available only to root.
771 struct drm_i915_gem_pin {
772 /** Handle of the buffer to be pinned. */
776 /** alignment required within the aperture */
779 /** Returned GTT offset of the buffer. */
784 pin.handle = <handle>;
785 pin.alignment = <alignment>;
786 ret = ioctl (fd, DRM_IOCTL_I915_GEM_PIN, &pin);
790 Pinning an object ensures that it will not be evicted from the GTT
791 or moved. It will stay resident until destroyed or unpinned.
793 struct drm_i915_gem_unpin {
794 /** Handle of the buffer to be unpinned. */
800 unpin.handle = <handle>;
801 ret = ioctl (fd, DRM_IOCTL_I915_GEM_UNPIN, &unpin);
803 Unpinning an object makes it possible to evict this object from the
804 GTT. It doesn't ensure that it will be evicted, just that it may.