linux-core/drm-gem.txt

   1                   The Graphics Execution Manager
   2               Part of the Direct Rendering Manager
   3                   ==============================
   4
   5                  Keith Packard <keithp@keithp.com>
   6                    Eric Anholt <eric@anholt.net>
   7                            2008-5-9
   8
   9 Contents:
  10
  11  1. GEM Overview
  12  2. API overview and conventions
  13  3. Object Creation/Destruction
  14  4. Reading/writing contents
  15  5. Mapping objects to userspace
  16  6. Memory Domains
  17  7. Execution (Intel specific)
  18  8. Other misc Intel-specific functions
  19
  20 1. Graphics Execution Manager Overview
  21
  22 Gem is designed to manage graphics memory, control access to the graphics
  23 device execution context and handle the essentially NUMA environment unique
  24 to modern graphics hardware. Gem allows multiple applications to share
  25 graphics device resources without the need to constantly reload the entire
  26 graphics card. Data may be shared between multiple applications with gem
  27 ensuring that the correct memory synchronization occurs.
  28
  29 Graphics data can consume arbitrary amounts of memory, with 3D applications
  30 constructing ever larger sets of textures and vertices. With graphics cards
  31 memory space growing larger every year, and graphics APIs growing more
  32 complex, we can no longer insist that each application save a complete copy
  33 of their graphics state so that the card can be re-initialized from user
  34 space at each context switch. Ensuring that graphics data remains persistent
  35 across context switches allows applications significant new functionality
  36 while also improving performance for existing APIs.
  37
  38 Modern linux desktops include significant 3D rendering as a fundemental
  39 component of the desktop image construction process. 2D and 3D applications
  40 paint their content to offscreen storage and the central 'compositing
  41 manager' constructs the final screen image from those window contents.  This
  42 means that pixel image data from these applications must move within reach
  43 of the compositing manager and used as source operands for screen image
  44 rendering operations.
  45
  46 Gem provides simple mechanisms to manage graphics data and control execution
  47 flow within the linux operating system. Using many existing kernel
  48 subsystems, it does this with a modest amount of code.
  49
  50 2. API Overview and Conventions
  51
  52 All APIs here are defined in terms of ioctls appplied to the DRM file
  53 descriptor. To create and manipulate objects, an application must be
  54 'authorized' using the DRI or DRI2 protocols with the X server. To relax
  55 that, we will need to implement some better access control mechanisms within
  56 the hardware portion of the driver to prevent inappropriate
  57 cross-application data access.
  58
  59 Any DRM driver which does not support GEM will return -ENODEV for all of
  60 these ioctls. Invalid object handles return -EINVAL. Invalid object names
  61 return -ENOENT. Other errors are as documented in the specific API below.
  62
  63 To avoid the need to translate ioctl contents on mixed-size systems (with
  64 32-bit user space running on a 64-bit kernel), the ioctl data structures
  65 contain explicitly sized objects, using 64-bits for all size and pointer
  66 data and 32-bits for identifiers. In addition, the 64-bit objects are all
  67 carefully aligned on 64-bit boundaries. Because of this, all pointers in the
  68 ioctl data structures are passed as uint64_t values. Suitable casts will
  69 be necessary.
  70
  71 One significant operation which is explicitly left out of this API is object
  72 locking. Applications are expected to perform locking of shared objects
  73 outside of the GEM api. This kind of locking is not necessary to safely
  74 manipulate the graphics engine, and with multiple objects interacting in
  75 unknown ways, per-object locking would likely introduce all kinds of
  76 lock-order issues. Punting this to the application seems like the only
  77 sensible plan. Given that DRM already offers a global lock on the hardware,
  78 this doesn't change the current situation.
  79
  80 3. Object Creation and Destruction
  81
  82 Gem provides explicit memory management primitives. System pages are
  83 allocated when the object is created, either as the fundemental storage for
  84 hardware where system memory is used by the graphics processor directly, or
  85 as backing store for graphics-processor resident memory.
  86
  87 Objects are referenced from user space using handles. These are, for all
  88 intents and purposes, equivalent to file descriptors. We could simply use
  89 file descriptors were it not for the small limit (1024) of file descriptors
  90 available to applications, and for the fact that the X server (a rather
  91 significant user of this API) uses 'select' and has a limited maximum file
  92 descriptor for that operation. Given the ability to allocate more file
  93 descriptors, and given the ability to place these 'higher' in the file
  94 descriptor space, we'd love to simply use file descriptors.
  95
  96 Objects may be published with a name so that other applications can access
  97 them. The name remains valid as long as the object exists. Right now, our
  98 DRI APIs use 32-bit integer names, so that's what we expose here
  99
 100  A. Creation
 101
 102                 struct drm_gem_create {
 103                         /**
 104                          * Requested size for the object.
 105                          *
 106                          * The (page-aligned) allocated size for the object
 107                          * will be returned.
 108                          */
 109                         uint64_t size;
 110                         /**
 111                          * Returned handle for the object.
 112                          *
 113                          * Object handles are nonzero.
 114                          */
 115                         uint32_t handle;
 116                         uint32_t pad;
 117                 };
 118
 119                 /* usage */
 120                 create.size = 16384;
 121                 ret = ioctl (fd, DRM_IOCTL_GEM_CREATE, &create);
 122                 if (ret == 0)
 123                         return create.handle;
 124
 125         Note that the size is rounded up to a page boundary, and that
 126         the rounded-up size is returned in 'size'. No name is assigned to
 127         this object, making it local to this process.
 128
 129         If insufficient memory is availabe, -ENOMEM will be returned.
 130
 131  B. Closing
 132
 133                 struct drm_gem_close {
 134                         /** Handle of the object to be closed. */
 135                         uint32_t handle;
 136                         uint32_t pad;
 137                 };
 138
 139
 140                 /* usage */
 141                 close.handle = <handle>;
 142                 ret = ioctl (fd, DRM_IOCTL_GEM_CLOSE, &close);
 143
 144         This call makes the specified handle invalid, and if no other
 145         applications are using the object, any necessary graphics hardware
 146         synchronization is performed and the resources used by the object
 147         released.
 148
 149  C. Naming
 150
 151                 struct drm_gem_flink {
 152                         /** Handle for the object being named */
 153                         uint32_t handle;
 154
 155                         /** Returned global name */
 156                         uint32_t name;
 157                 };
 158
 159                 /* usage */
 160                 flink.handle = <handle>;
 161                 ret = ioctl (fd, DRM_IOCTL_GEM_FLINK, &flink);
 162                 if (ret == 0)
 163                         return flink.name;
 164
 165         Flink creates a name for the object and returns it to the
 166         application. This name can be used by other applications to gain
 167         access to the same object.
 168
 169  D. Opening by name
 170
 171                 struct drm_gem_open {
 172                         /** Name of object being opened */
 173                         uint32_t name;
 174
 175                         /** Returned handle for the object */
 176                         uint32_t handle;
 177
 178                         /** Returned size of the object */
 179                         uint64_t size;
 180                 };
 181
 182                 /* usage */
 183                 open.name = <name>;
 184                 ret = ioctl (fd, DRM_IOCTL_GEM_OPEN, &open);
 185                 if (ret == 0) {
 186                         *sizep = open.size;
 187                         return open.handle;
 188                 }
 189
 190         Open accesses an existing object and returns a handle for it. If the
 191         object doesn't exist, -ENOENT is returned. The size of the object is
 192         also returned. This handle has all the same capabilities as the
 193         handle used to create the object. In particular, the object is not
 194         destroyed until all handles are closed.
 195
 196 4. Basic read/write operations
 197
 198 By default, gem objects are not mapped to the applications address space,
 199 getting data in and out of them is done with I/O operations instead. This
 200 allows the data to reside in otherwise unmapped pages, including pages in
 201 video memory on an attached discrete graphics card. In addition, using
 202 explicit I/O operations allows better control over cache contents, as
 203 graphics devices are generally not cache coherent with the CPU, mapping
 204 pages used for graphics into an application address space requires the use
 205 of expensive cache flushing operations. Providing direct control over
 206 graphics data access ensures that data are handled in the most efficient
 207 possible fashion.
 208
 209  A. Reading
 210
 211                 struct drm_gem_pread {
 212                         /** Handle for the object being read. */
 213                         uint32_t handle;
 214                         uint32_t pad;
 215                         /** Offset into the object to read from */
 216                         uint64_t offset;
 217                         /** Length of data to read */
 218                         uint64_t size;
 219                         /** Pointer to write the data into. */
 220                         uint64_t data_ptr;      /* void * */
 221                 };
 222
 223         This copies data into the specified object at the specified
 224         position. Any necessary graphics device synchronization and
 225         flushing will be done automatically.
 226
 227                 struct drm_gem_pwrite {
 228                         /** Handle for the object being written to. */
 229                         uint32_t handle;
 230                         uint32_t pad;
 231                         /** Offset into the object to write to */
 232                         uint64_t offset;
 233                         /** Length of data to write */
 234                         uint64_t size;
 235                         /** Pointer to read the data from. */
 236                         uint64_t data_ptr;      /* void * */
 237                 };
 238
 239         This copies data out of the specified object into the
 240         waiting user memory. Again, device synchronization will
 241         be handled by the kernel to ensure user space sees a
 242         consistent view of the graphics device.
 243
 244 5. Mapping objects to user space
 245
 246 For most objects, reading/writing is the preferred interaction mode.
 247 However, when the CPU is involved in rendering to cover deficiencies in
 248 hardware support for particular operations, the CPU will want to directly
 249 access the relevant objects.
 250
 251 Because mmap is fairly heavyweight, we allow applications to retain maps to
 252 objects persistently and then update how they're using the memory through a
 253 separate interface. Applications which fail to use this separate interface
 254 may exhibit unpredictable behaviour as memory consistency will not be
 255 preserved.
 256
 257  A. Mapping
 258
 259                 struct drm_gem_mmap {
 260                         /** Handle for the object being mapped. */
 261                         uint32_t handle;
 262                         uint32_t pad;
 263                         /** Offset in the object to map. */
 264                         uint64_t offset;
 265                         /**
 266                          * Length of data to map.
 267                          *
 268                          * The value will be page-aligned.
 269                          */
 270                         uint64_t size;
 271                         /** Returned pointer the data was mapped at */
 272                         uint64_t addr_ptr;      /* void * */
 273                 };
 274
 275                 /* usage */
 276                 mmap.handle = <handle>;
 277                 mmap.offset = <offset>;
 278                 mmap.size = <size>;
 279                 ret = ioctl (fd, DRM_IOCTL_GEM_MMAP, &mmap);
 280                 if (ret == 0)
 281                         return (void *) (uintptr_t) mmap.addr_ptr;
 282
 283
 284  B. Unmapping
 285
 286                 munmap (addr, length);
 287
 288         Nothing strange here, just use the normal munmap syscall.
 289
 290 6. Memory Domains
 291
 292 Graphics devices remain a strong bastion of non cache-coherent memory. As a
 293 result, accessing data through one functional unit will end up loading that
 294 cache with data which then needs to be manually synchronized when that data
 295 is used with another functional unit.
 296
 297 Tracking where data are resident is done by identifying how functional units
 298 deal with caches. Each cache is labeled as a separate memory domain. Then,
 299 each sequence of operations is expected to load data into various read
 300 domains and leave data in at most one write domain. Gem tracks the read and
 301 write memory domains of each object and performs the necessary
 302 synchronization operations when objects move from one domain set to another.
 303
 304 For example, if operation 'A' constructs an image that is immediately used
 305 by operation 'B', then when the read domain for 'B' is not the same as the
 306 write domain for 'A', then the write domain must be flushed, and the read
 307 domain invalidated. If these two operations are both executed in the same
 308 command queue, then the flush operation can go inbetween them in the same
 309 queue, avoiding any kind of CPU-based synchronization and leaving the GPU to
 310 do the work itself.
 311
 312 6.1 Memory Domains (GPU-independent)
 313
 314  * DRM_GEM_DOMAIN_CPU.
 315
 316  Objects in this domain are using caches which are connected to the CPU.
 317  Moving objects from non-CPU domains into the CPU domain can involve waiting
 318  for the GPU to finish with operations using this object. Moving objects
 319  from this domain to a GPU domain can involve flushing CPU caches and chipset
 320  buffers.
 321
 322 6.1 GPU-independent memory domain ioctl
 323
 324 This ioctl is independent of the GPU in use. So far, no use other than
 325 synchronizing objects to the CPU domain have been found; if that turns out
 326 to be generally true, this ioctl may be simplified further.
 327
 328  A. Explicit domain control
 329
 330                 struct drm_gem_set_domain {
 331                         /** Handle for the object */
 332                         uint32_t handle;
 333
 334                         /** New read domains */
 335                         uint32_t read_domains;
 336
 337                         /** New write domain */
 338                         uint32_t write_domain;
 339                 };
 340
 341                 /* usage */
 342                 set_domain.handle = <handle>;
 343                 set_domain.read_domains = <read_domains>;
 344                 set_domain.write_domain = <write_domain>;
 345                 ret = ioctl (fd, DRM_IOCTL_GEM_SET_DOMAIN, &set_domain);
 346
 347         When the application wants to explicitly manage memory domains for
 348         an object, it can use this function. Usually, this is only used
 349         when the application wants to synchronize object contents between
 350         the GPU and CPU-based application rendering. In that case,
 351         the <read_domains> would be set to DRM_GEM_DOMAIN_CPU, and if the
 352         application were going to write to the object, the <write_domain>
 353         would also be set to DRM_GEM_DOMAIN_CPU. After the call, gem
 354         guarantees that all previous rendering operations involving this
 355         object are complete. The application is then free to access the
 356         object through the address returned by the mmap call. Afterwards,
 357         when the application again uses the object through the GPU, any
 358         necessary CPU flushing will occur and the object will be correctly
 359         synchronized with the GPU.
 360
 361         Note that this synchronization is not required for any accesses
 362         going through the driver itself. The pread, pwrite and execbuffer
 363         ioctls all perform the necessary domain management internally.
 364         Explicit synchronization is only necessary when accessing the object
 365         through the mmap'd address.
 366
 367 7. Execution (Intel specific)
 368
 369 Managing the command buffers is inherently chip-specific, so the core of gem
 370 doesn't have any intrinsic functions. Rather, execution is left to the
 371 device-specific portions of the driver.
 372
 373 The Intel DRM_I915_GEM_EXECBUFFER ioctl takes a list of gem objects, all of
 374 which are mapped to the graphics device. The last object in the list is the
 375 command buffer.
 376
 377 7.1. Relocations
 378
 379 Command buffers often refer to other objects, and to allow the kernel driver
 380 to move objects around, a sequence of relocations is associated with each
 381 object. Device-specific relocation operations are used to place the
 382 target-object relative value into the object.
 383
 384 The Intel driver has a single relocation type:
 385
 386                 struct drm_i915_gem_relocation_entry {
 387                         /**
 388                          * Handle of the buffer being pointed to by this
 389                          * relocation entry.
 390                          *
 391                          * It's appealing to make this be an index into the
 392                          * mm_validate_entry list to refer to the buffer,
 393                          * but this allows the driver to create a relocation
 394                          * list for state buffers and not re-write it per
 395                          * exec using the buffer.
 396                          */
 397                         uint32_t target_handle;
 398
 399                         /**
 400                          * Value to be added to the offset of the target
 401                          * buffer to make up the relocation entry.
 402                          */
 403                         uint32_t delta;
 404
 405                         /**
 406                          * Offset in the buffer the relocation entry will be
 407                          * written into
 408                          */
 409                         uint64_t offset;
 410
 411                         /**
 412                          * Offset value of the target buffer that the
 413                          * relocation entry was last written as.
 414                          *
 415                          * If the buffer has the same offset as last time, we
 416                          * can skip syncing and writing the relocation.  This
 417                          * value is written back out by the execbuffer ioctl
 418                          * when the relocation is written.
 419                          */
 420                         uint64_t presumed_offset;
 421
 422                         /**
 423                          * Target memory domains read by this operation.
 424                          */
 425                         uint32_t read_domains;
 426
 427                         /*
 428                          * Target memory domains written by this operation.
 429                          *
 430                          * Note that only one domain may be written by the
 431                          * whole execbuffer operation, so that where there are
 432                          * conflicts, the application will get -EINVAL back.
 433                          */
 434                         uint32_t write_domain;
 435                 };
 436
 437         'target_handle', the handle to the target object. This object must
 438         be one of the objects listed in the execbuffer request or
 439         bad things will happen. The kernel doesn't check for this.
 440
 441         'offset' is where, in the source object, the relocation data
 442         are written. Each relocation value is a 32-bit value consisting
 443         of the location of the target object in the GPU memory space plus
 444         the 'delta' value included in the relocation.
 445
 446         'presumed_offset' is where user-space believes the target object
 447         lies in GPU memory space. If this value matches where the object
 448         actually is, then no relocation data are written, the kernel
 449         assumes that user space has set up data in the source object
 450         using this presumption. This offers a fairly important optimization
 451         as writing relocation data requires mapping of the source object
 452         into the kernel memory space.
 453
 454         'read_domains' and 'write_domains' list the usage by the source
 455         object of the target object. The kernel unions all of the domain
 456         information from all relocations in the execbuffer request. No more
 457         than one write_domain is allowed, otherwise an EINVAL error is
 458         returned. read_domains must contain write_domain. This domain
 459         information is used to synchronize buffer contents as described
 460         above in the section on domains.
 461
 462 7.1.1 Memory Domains (Intel specific)
 463
 464 The Intel GPU has several internal caches which are not coherent and hence
 465 require explicit synchronization. Memory domains provide the necessary data
 466 to synchronize what is needed while leaving other cache contents intact.
 467
 468  * DRM_GEM_DOMAIN_I915_RENDER.
 469    The GPU 3D and 2D rendering operations use a unified rendering cache, so
 470    operations doing 3D painting and 2D blts will use this domain
 471
 472  * DRM_GEM_DOMAIN_I915_SAMPLER
 473    Textures are loaded by the sampler through a separate cache, so
 474    any texture reading will use this domain. Note that the sampler
 475    and renderer use different caches, so moving an object from render target
 476    to texture source will require a domain transfer.
 477
 478  * DRM_GEM_DOMAIN_I915_COMMAND
 479    The command buffer doesn't have an explicit cache (although it does
 480    read ahead quite a bit), so this domain just indicates that the object
 481    needs to be flushed to the GPU.
 482
 483  * DRM_GEM_DOMAIN_I915_INSTRUCTION
 484    All of the programs on Gen4 and later chips use an instruction cache to
 485    speed program execution. It must be explicitly flushed when new programs
 486    are written to memory by the CPU.
 487
 488  * DRM_GEM_DOMAIN_I915_VERTEX
 489    Vertex data uses two different vertex caches, but they're
 490    both flushed with the same instruction.
 491
 492 7.2 Execution object list (Intel specific)
 493
 494                 struct drm_i915_gem_exec_object {
 495                         /**
 496                          * User's handle for a buffer to be bound into the GTT
 497                          * for this operation.
 498                          */
 499                         uint32_t handle;
 500
 501                         /**
 502                          * List of relocations to be performed on this buffer
 503                          */
 504                         uint32_t relocation_count;
 505                         /* struct drm_i915_gem_relocation_entry *relocs */
 506                         uint64_t relocs_ptr;
 507
 508                         /**
 509                          * Required alignment in graphics aperture
 510                          */
 511                         uint64_t alignment;
 512
 513                         /**
 514                          * Returned value of the updated offset of the object,
 515                          * for future presumed_offset writes.
 516                          */
 517                         uint64_t offset;
 518                 };
 519
 520         Each object involved in a particular execution operation must be
 521         listed using one of these structures.
 522
 523         'handle' references the object.
 524
 525         'relocs_ptr' is a user-mode pointer to a array of 'relocation_count'
 526         drm_i915_gem_relocation_entry structs (see above) that
 527         define the relocations necessary in this buffer. Note that all
 528         relocations must reference other exec_object structures in the same
 529         execbuffer ioctl and that those other buffers must come earlier in
 530         the exec_object array. In other words, the dependencies mapped by the
 531         exec_object relocations must form a directed acyclic graph.
 532
 533         'alignment' is the byte alignment necessary for this buffer. Each
 534         object has specific alignment requirements, as the kernel doesn't
 535         know what each object is being used for, those requirements must be
 536         provided by user mode. If an object is used in two different ways,
 537         it's quite possible that the alignment requirements will differ.
 538
 539         'offset' is a return value, receiving the location of the object
 540         during this execbuffer operation. The application should use this
 541         as the presumed offset in future operations; if the object does not
 542         move, then kernel need not write relocation data.
 543
 544 7.3 Execbuffer ioctl (Intel specific)
 545
 546                 struct drm_i915_gem_execbuffer {
 547                         /**
 548                          * List of buffers to be validated with their
 549                          * relocations to be performend on them.
 550                          *
 551                          * These buffers must be listed in an order such that
 552                          * all relocations a buffer is performing refer to
 553                          * buffers that have already appeared in the validate
 554                          * list.
 555                          */
 556                         /* struct drm_i915_gem_validate_entry *buffers */
 557                         uint64_t buffers_ptr;
 558                         uint32_t buffer_count;
 559
 560                         /**
 561                          * Offset in the batchbuffer to start execution from.
 562                          */
 563                         uint32_t batch_start_offset;
 564
 565                         /**
 566                          * Bytes used in batchbuffer from batch_start_offset
 567                          */
 568                         uint32_t batch_len;
 569                         uint32_t DR1;
 570                         uint32_t DR4;
 571                         uint32_t num_cliprects;
 572                         uint64_t cliprects_ptr; /* struct drm_clip_rect *cliprects */
 573                 };
 574
 575
 576         'buffers_ptr' is a user-mode pointer to an array of 'buffer_count'
 577         drm_i915_gem_exec_object structures which contains the complete set
 578         of objects required for this execbuffer operation. The last entry in
 579         this array, the 'batch buffer', is the buffer of commands which will
 580         be linked to the ring and executed.
 581
 582         'batch_start_offset' is the byte offset within the batch buffer which
 583         contains the first command to execute. So far, we haven't found a
 584         reason to use anything other than '0' here, but the thought was that
 585         some space might be allocated for additional initialization which
 586         could be skipped in some cases. This must be a multiple of 4.
 587
 588         'batch_len' is the length, in bytes, of the data to be executed
 589         (i.e., the amount of data after batch_start_offset). This must
 590         be a multiple of 4.
 591
 592         'num_cliprects' and 'cliprects_ptr' reference an array of
 593         drm_clip_rect structures that is num_cliprects long. The entire
 594         batch buffer will be executed multiple times, once for each
 595         rectangle in this list. If num_cliprects is 0, then no clipping
 596         rectangle will be set.
 597
 598         'DR1' and 'DR4' are portions of the 3DSTATE_DRAWING_RECTANGLE
 599         command which will be queued when this operation is clipped
 600         (num_cliprects != 0).
 601
 602                 DR1 bit         definition
 603                 31              Fast Scissor Clip Disable (debug only).
 604                                 Disables a hardware optimization that
 605                                 improves performance. This should have
 606                                 no visible effect, other than reducing
 607                                 performance
 608
 609                 30              Depth Buffer Coordinate Offset Disable.
 610                                 This disables the addition of the
 611                                 depth buffer offset bits which are used
 612                                 to change the location of the depth buffer
 613                                 relative to the front buffer.
 614
 615                 27:26           X Dither Offset. Specifies the X pixel
 616                                 offset to use when accessing the dither table
 617
 618                 25:24           Y Dither Offset. Specifies the Y pixel
 619                                 offset to use when accessing the dither
 620                                 table.
 621
 622                 DR4 bit         definition
 623                 31:16           Drawing Rectangle Origin Y. Specifies the Y
 624                                 origin of coordinates relative to the
 625                                 draw buffer.
 626
 627                 15:0            Drawing Rectangle Origin X. Specifies the X
 628                                 origin of coordinates relative to the
 629                                 draw buffer.
 630
 631         As you can see, these two fields are necessary for correctly
 632         offsetting drawing within a buffer which contains multiple surfaces.
 633         Note that DR1 is only used on Gen3 and earlier hardware and that
 634         newer hardware sticks the dither offset elsewhere.
 635
 636 7.3.1 Detailed Execution Description
 637
 638         Execution of a single batch buffer requires several preparatory
 639         steps to make the objects visible to the graphics engine and resolve
 640         relocations to account for their current addresses.
 641
 642  A. Mapping and Relocation
 643
 644         Each exec_object structure in the array is examined in turn.
 645
 646         If the object is not already bound to the GTT, it is assigned a
 647         location in the graphics address space. If no space is available in
 648         the GTT, some other object will be evicted. This may require waiting
 649         for previous execbuffer requests to complete before that object can
 650         be unmapped. With the location assigned, the pages for the object
 651         are pinned in memory using find_or_create_page and the GTT entries
 652         updated to point at the relevant pages using drm_agp_bind_pages.
 653
 654         Then the array of relocations is traversed. Each relocation record
 655         looks up the target object and, if the presumed offset does not
 656         match the current offset (remember that this buffer has already been
 657         assigned an address as it must have been mapped earlier), the
 658         relocation value is computed using the current offset.  If the
 659         object is currently in use by the graphics engine, writing the data
 660         out must be preceeded by a delay while the object is still busy.
 661         Once it is idle, then the page containing the relocation is mapped
 662         by the CPU and the updated relocation data written out.
 663
 664         The read_domains and write_domain entries in each relocation are
 665         used to compute the new read_domains and write_domain values for the
 666         target buffers. The actual execution of the domain changes must wait
 667         until all of the exec_object entries have been evaluated as the
 668         complete set of domain information will not be available until then.
 669
 670  B. Memory Domain Resolution
 671
 672         After all of the new memory domain data has been pulled out of the
 673         relocations and computed for each object, the list of objects is
 674         again traversed and the new memory domains compared against the
 675         current memory domains. There are two basic operations involved here:
 676
 677          * Flushing the current write domain. If the new read domains
 678            are not equal to the current write domain, then the current
 679            write domain must be flushed. Otherwise, reads will not see data
 680            present in the write domain cache. In addition, any new read domains
 681            other than the current write domain must be invalidated to ensure
 682            that the flushed data are re-read into their caches.
 683
 684          * Invaliding new read domains. Any domains which were not currently
 685            used for this object must be invalidated as old objects which
 686            were mapped at the same location may have stale data in the new
 687            domain caches.
 688
 689         If the CPU cache is being invalidated and some GPU cache is being
 690         flushed, then we'll have to wait for rendering to complete so that
 691         any pending GPU writes will be complete before we flush the GPU
 692         cache.
 693
 694         If the CPU cache is being flushed, then we use 'clflush' to get data
 695         written from the CPU.
 696
 697         Because the GPU caches cannot be partially flushed or invalidated,
 698         we don't actually flush them during this traversal stage. Rather, we
 699         gather the invalidate and flush bits up in the device structure.
 700
 701         Once all of the object domain changes have been evaluated, then the
 702         gathered invalidate and flush bits are examined. For any GPU flush
 703         operations, we emit a single MI_FLUSH command that performs all of
 704         the necessary flushes. We then look to see if the CPU cache was
 705         flushed. If so, we use the chipset flush magic (writing to a special
 706         page) to get the data out of the chipset and into memory.
 707
 708  C. Queuing Batch Buffer to the Ring
 709
 710         With all of the objects resident in graphics memory space, and all
 711         of the caches prepared with appropriate data, the batch buffer
 712         object can be queued to the ring. If there are clip rectangles, then
 713         the buffer is queued once per rectangle, with suitable clipping
 714         inserted into the ring just before the batch buffer.
 715
 716  D. Creating an IRQ Cookie
 717
 718         Right after the batch buffer is placed in the ring, a request to
 719         generate an IRQ is added to the ring along with a command to write a
 720         marker into memory. When the IRQ fires, the driver can look at the
 721         memory location to see where in the ring the GPU has passed. This
 722         magic cookie value is stored in each object used in this execbuffer
 723         command; it is used whereever you saw 'wait for rendering' above in
 724         this document.
 725
 726  E. Writing back the new object offsets
 727
 728         So that the application has a better idea what to use for
 729         'presumed_offset' values later, the current object offsets are
 730         written back to the exec_object structures.
 731
 732
 733 8. Other misc Intel-specific functions.
 734
 735 To complete the driver, a few other functions were necessary.
 736
 737 8.1 Initialization from the X server
 738
 739 As the X server is currently responsible for apportioning memory between 2D
 740 and 3D, it must tell the kernel which region of the GTT aperture is
 741 available for 3D objects to be mapped into.
 742
 743                 struct drm_i915_gem_init {
 744                         /**
 745                          * Beginning offset in the GTT to be managed by the
 746                          * DRM memory manager.
 747                          */
 748                         uint64_t gtt_start;
 749                         /**
 750                          * Ending offset in the GTT to be managed by the DRM
 751                          * memory manager.
 752                          */
 753                         uint64_t gtt_end;
 754                 };
 755                 /* usage */
 756                 init.gtt_start = <gtt_start>;
 757                 init.gtt_end = <gtt_end>;
 758                 ret = ioctl (fd, DRM_IOCTL_I915_GEM_INIT, &init);
 759
 760         The GTT aperture between gtt_start and gtt_end will be used to map
 761         objects. This also tells the kernel that the ring can be used,
 762         pulling the ring addresses from the device registers.
 763
 764 8.2 Pinning objects in the GTT
 765
 766 For scan-out buffers and the current shared depth and back buffers, we need
 767 to have them always available in the GTT, at least for now. Pinning means to
 768 lock their pages in memory along with keeping them at a fixed offset in the
 769 graphics aperture. These operations are available only to root.
 770
 771                 struct drm_i915_gem_pin {
 772                         /** Handle of the buffer to be pinned. */
 773                         uint32_t handle;
 774                         uint32_t pad;
 775
 776                         /** alignment required within the aperture */
 777                         uint64_t alignment;
 778
 779                         /** Returned GTT offset of the buffer. */
 780                         uint64_t offset;
 781                 };
 782
 783                 /* usage */
 784                 pin.handle = <handle>;
 785                 pin.alignment = <alignment>;
 786                 ret = ioctl (fd, DRM_IOCTL_I915_GEM_PIN, &pin);
 787                 if (ret == 0)
 788                         return pin.offset;
 789
 790         Pinning an object ensures that it will not be evicted from the GTT
 791         or moved. It will stay resident until destroyed or unpinned.
 792
 793                 struct drm_i915_gem_unpin {
 794                         /** Handle of the buffer to be unpinned. */
 795                         uint32_t handle;
 796                         uint32_t pad;
 797                 };
 798
 799                 /* usage */
 800                 unpin.handle = <handle>;
 801                 ret = ioctl (fd, DRM_IOCTL_I915_GEM_UNPIN, &unpin);
 802
 803         Unpinning an object makes it possible to evict this object from the
 804         GTT. It doesn't ensure that it will be evicted, just that it may.
 805