src/native_client_sdk/src/doc/reference/sandbox_internals/arm-32-bit-sandbox.rst

   1 ==================
   2 ARM 32-bit Sandbox
   3 ==================
   4
   5 Native Client for ARM is a sandboxing technology for running
   6 programs---even malicious ones---safely, on computers that use 32-bit
   7 ARM processors. The ARM sandbox is an extension of earlier work on
   8 Native Client for x86 processors. Security is provided with a low
   9 performance overhead of about 10% over regular ARM code, and as you'll
  10 see in this document the sandbox model is beautifully simple, meaning
  11 that the trusted codebase is much easier to validate.
  12
  13 As an implementation detail, the Native Client 32-bit ARM sandbox is
  14 currently used by Portable Native Client to execute code on 32-bit ARM
  15 machines in a safe manner. The portable bitcode contained in a **pexe**
  16 is translated to a 32-bit ARM **nexe** before execution. This may change
  17 at a point in time: Portable Native Client doesn't necessarily need this
  18 sandbox to execute code on ARM. Note that the Portable Native Client
  19 compiler itself is also untrusted: it too runs in the ARM sandbox
  20 described in this document.
  21
  22 On this page, we describe how Native Client works on 32-bit ARM. We
  23 assume no prior knowledge about the internals of Native Client, on x86
  24 or any other architecture, but we do assume some familiarity with
  25 assembly languages in general.
  26
  27 .. contents::
  28    :local:
  29    :backlinks: none
  30    :depth: 3
  31
  32 An Introduction to the ARM Architecture
  33 =======================================
  34
  35 In this section, we summarize the relevant parts of the ARM processor
  36 architecture.
  37
  38 About ARM and ARMv7-A
  39 ---------------------
  40
  41 ARM is one of the older commercial "RISC" processor designs, dating back
  42 to the early 1980s. Today, it is used primarily in embedded systems:
  43 everything from toys, to home automation, to automobiles. However, its
  44 most visible use is in cellular phones, tablets and some
  45 laptops.
  46
  47 Through the years, there have been many revisions of the ARM
  48 architecture, written as ARMv\ *X* for some version *X*. Native Client
  49 specifically targets the ARMv7-A architecture commonly used in high-end
  50 phones and smartbooks. This revision, defined in the mid-2000s, adds a
  51 number of useful instructions, and specifies some portions of the system
  52 that used to be left to individual chip manufacturers. Critically,
  53 ARMv7-A specifies the "eXecute Never" bit, or *XN*. This pagetable
  54 attribute lets us mark memory as non-executable. Our security relies on
  55 the presence of this feature.
  56
  57 ARMv8 adds a new 64-bit instruction set architecture called A64, while
  58 also enhancing the 32-bit A32 ISA. For Native Client's purposes the A32
  59 ISA is equivalent to the ARMv7 ARM ISA, albeit with a few new
  60 instructions. This document only discussed the 32-bit A32 instruction
  61 set: A64 would require a different sandboxing model.
  62
  63 ARM Programmer's Model
  64 ----------------------
  65
  66 While modern ARM chips support several instruction encodings, 32-bit
  67 Native Client on ARM focuses on a single one: a fixed-width encoding
  68 where every instruction is 32-bits wide called A32 (previously, and
  69 confusingly, called simply ARM). Thumb, Thumb2 (now confusingly called
  70 T32), Jazelle, ThumbEE and such aren't supported by Native Client. This
  71 dramatically simplifies some of our analyses, as we'll see later. Nearly
  72 every instruction can be conditionally executed based on the contents of
  73 a dedicated condition code register.
  74
  75 ARM processors have 16 general-purpose registers used for integer and
  76 memory operations, written ``r0`` through ``r15``. Of these, two have
  77 special roles baked in to the hardware:
  78
  79 * ``r14`` is the Link Register. The ARM *call* instruction
  80   (*branch-with-link*) doesn't use the stack directly. Instead, it
  81   stashes the return address in ``r14``. In other circumstances, ``r14``
  82   can be (and is!) used as a general-purpose register. When ``r14`` is
  83   playing its Link Register role, it's referred to as ``lr``.
  84 * ``r15`` is the Program Counter. While it can be read and written like
  85   any other register, setting it to a new value will cause execution to
  86   jump to a new address. Using it in some circumstances is also
  87   undefined by the ARM architecture. Because of this, ``r15`` is never
  88   used for anything else, and is referred to as ``pc``.
  89
  90 Other registers are given roles by convention. The only important
  91 registers to Native Client are ``r9`` and ``r13``, which are used as the
  92 Thread Pointer location and Stack Pointer. When playing this role,
  93 they're referred to as ``tp`` and ``sp``.
  94
  95 Like other RISC-inspired designs, ARM programs use explicit *load* and
  96 *store* instructions to access memory. All other instructions operate
  97 only on registers, or on registers and small constants called
  98 immediates. Because both instructions and data words are 32-bits, we
  99 can't simply embed a 32-bit number into an instruction. ARM programs use
 100 three methods to work around this, all of which Native Client exploits:
 101
 102 1. Many instructions can encode a modified immediate, which is an 8-bit
 103    number rotated right by an even number of bits.
 104 2. The ``movw`` and ``movt`` instructions can be used to set the top and
 105    bottom 16-bits of a register, and can therefore encode any 32-bit
 106    immediate.
 107 3. For values that can't be represented as modified immediates, ARM
 108    programs use ``pc``-relative loads to load data from inside the
 109    code---hidden in a place where it won't be executed such as "constant
 110    pools", just past the final return of a function.
 111
 112 We'll introduce more details of the ARM instruction set later, as we
 113 walk through the system.
 114
 115 The Native Client Approach
 116 ==========================
 117
 118 Native Client runs an untrusted program, potentially from an unknown or
 119 malicious source, inside a sandbox created by a trusted runtime. The
 120 trusted runtime allows the untrusted program to "call-out" and perform
 121 certain actions, such as drawing graphics, but prevents it from
 122 accessing the operating system directly. This "call-out" facility,
 123 called a trampoline, looks like a standard function call to the
 124 untrusted program, but it allows control to escape from the sandbox in a
 125 controlled way.
 126
 127 The untrusted program and trusted runtime inhabit the same process, or
 128 virtual address space, maintained by the operating system. To keep the
 129 trusted runtime behaving the way we expect, we must prevent the
 130 untrusted program from accessing and modifying its internals. Since they
 131 share a virtual address space, we can't rely on the operating system for
 132 this. Instead, we isolate the untrusted program from the trusted
 133 runtime.
 134
 135 Unlike modern operating systems, we use a cooperative isolation
 136 method. Native Client can't run any off-the-shelf program compiled for
 137 an off-the-shelf operating system. The program must be compiled to
 138 comply with Native Client's rules. The details vary on each platform,
 139 but in general, the untrusted program:
 140
 141 * Must not attempt to use certain forbidden instructions, such as system
 142   calls.
 143 * Must not attempt to modify its own code without abiding by Native
 144   Client's code modification rules.
 145 * Must not jump into the middle of an instruction group, or otherwise do
 146   tricky things to cause instructions to be interpreted multiple ways.
 147 * Must use special, strictly-defined instruction sequences to perform
 148   permitted but potentially dangerous actions. We call these sequences
 149   pseudo-instructions.
 150
 151 We can't simply take the program's word that it complies with these
 152 rules---we call it "untrusted" for a reason! Nor do we require it to be
 153 produced by a special compiler; in practice, we don't trust our
 154 compilers either. Instead, we apply a load-time validator that
 155 disassembles the program. The validator either proves that the program
 156 complies with our rules, or rejects it as unsafe. By keeping the rules
 157 simple, we keep the validator simple, small, and fast. We like to put
 158 our trust in small, simple things, and the validator is key to the
 159 system's security.
 160
 161 .. Note::
 162   :class: note
 163
 164   For the computationally-inclined, all our validators scale linearly in
 165   the size of the program.
 166
 167 NaCl/ARM: Pure Software Fault Isolation
 168 ---------------------------------------
 169
 170 In the original Native Client system for the x86, we used unusual
 171 hardware features of that processor (the segment registers) to isolate
 172 untrusted programs. This was simple and fast, but won't work on ARM,
 173 which has nothing equivalent. Instead, we use pure software fault
 174 isolation.
 175
 176 We use a fixed address space layout: the untrusted program gets the
 177 lowest gigabyte, addresses ``0`` through ``0x3FFFFFFF``. The rest of the
 178 address space holds the trusted runtime and the operating system. We
 179 isolate the program by requiring every *load*, *store*, and *indirect
 180 branch* (to an address in a register) to use a pseudo-instruction. The
 181 pseudo-instructions ensure that the address stays within the
 182 sandbox. The *indirect branch* pseudo-instruction, in turn, ensures that
 183 such branches won't split up other pseudo-instructions.
 184
 185 At either side of the sandbox, we place small (8KiB) guard
 186 regions. These are simply areas in the process's address space that are
 187 mapped without read, write, or execute permissions, so any attempt to
 188 access them for any reason---*load*, *store*, or *jump*---will cause a
 189 fault.
 190
 191 Finally, we ban the use of certain instructions, notably direct system
 192 calls. This is to ensure that the untrusted program can be run on any
 193 operating system supported by Native Client, and to prevent access to
 194 certain system features that might be used to subvert the sandbox. As a
 195 side effect, it helps to prevent programs from exploiting buggy
 196 operating system APIs.
 197
 198 Let's walk through the details, starting with the simplest part: *load*
 199 and *store*.
 200
 201 *Load* and *Store*
 202 ^^^^^^^^^^^^^^^^^^
 203
 204 All access to memory must be through *load* and *store*
 205 pseudo-instructions. These are simply a native *load* or *store*
 206 instruction, preceded by a guard instruction.
 207
 208 Each *load* or *store* pseudo-instruction is similar to the *load* shown
 209 below. We use abstract "placeholder" registers instead of specific
 210 numbered registers for the sake of discussion. ``rA`` is the register
 211 holding the address to load from. ``rD`` is the destination for the
 212 loaded data.
 213
 214 .. naclcode::
 215   :prettyprint: 0
 216
 217   bic    rA,  #0xC0000000
 218   ldr    rD,  [rA]
 219
 220 The first instruction, ``bic``, clears the top two bits of ``rA``. In
 221 this case, that means that the value in ``rA`` is forced to an address
 222 inside our sandbox, between ``0`` and ``0x3FFFFFFF``, inclusive.
 223
 224 The second instruction, ``ldr``, uses the previously-sandboxed address
 225 to load a value. This address might not be the address that the program
 226 intended, and might cause an access to an unmapped memory location
 227 within the sandbox: ``bic`` forces the address to be valid, by clearing
 228 the top two bits. This is a no-op in a correct program.
 229
 230 This illustrates a common property of all Native Client systems: we aim
 231 for safety, not correctness. A program using an invalid address in
 232 ``rA`` here is simply broken, so we are free to do whatever we want to
 233 preserve safety. In this case the program might load an invalid (but
 234 safe) value, or cause a segmentation fault limited to the untrusted
 235 code.
 236
 237 Now, if we allowed arbitrary branches within the program, a malicious
 238 program could set up carefully-crafted values in ``rA``, and then jump
 239 straight to the ``ldr``. This is why we validate that programs never
 240 split pseudo-instructions.
 241
 242 Alternative Sandboxing
 243 """"""""""""""""""""""
 244
 245 .. naclcode::
 246   :prettyprint: 0
 247
 248   tst    rA,  #0xC0000000
 249   ldreq  rD,  [rA]
 250
 251 The first instruction, ``tst``, performs a bitwise-\ ``AND`` of ``rA``
 252 and the modified immediate literal, ``0xC0000000``. It sets the
 253 condition flags based on the result, but does not write the result to a
 254 register. In particular, it sets the ``Z`` condition flag if the result
 255 was zero---if the two values had no set bits in common. In this case,
 256 that means that the value in ``rA`` was an address inside our sandbox,
 257 between ``0`` and ``0x3FFFFFFF``, inclusive.
 258
 259 The second instruction, ``ldreq``, is a conditional load if equal. As we
 260 mentioned before, nearly all ARM instructions can be made
 261 conditional. In assembly language, we simply stick the desired condition
 262 on the end of the instruction's mnemonic name. Here, the condition is
 263 ``EQ``, which causes the instruction to execute only if the ``Z`` flag
 264 is set.
 265
 266 Thus, when the pseudo-instruction executes, the ``tst`` sets ``Z`` if
 267 (and only if) the value in ``rA`` is an address within the bounds of the
 268 sandbox, and then the ``ldreq`` loads if (and only if) it was. If ``rA``
 269 held an invalid address, the *load* does not execute, and ``rD`` is
 270 unchanged.
 271
 272 .. Note::
 273   :class: note
 274
 275   The ``tst``-based sequence is faster than the ``bic``-based sequence
 276   on modern ARM chips. It avoids a data dependency in the address
 277   register. This is why we keep both around. The ``tst``-based sequence
 278   unfortunately leaks information on some processors, and is therefore
 279   forbidden on certain processors. This effectively means that it cannot
 280   be used for regular Native Client **nexe** files, but can be used with
 281   Portable Native Client because the target processor is known at
 282   translation time from **pexe** to **nexe**.
 283
 284 Addressing Modes
 285 """"""""""""""""
 286
 287 ARM has an unusually rich set of addressing modes. We allow all but one:
 288 register-indexed, where two registers are added to determine the
 289 address.
 290
 291 We permit simple *load* and *store*, as shown above. We also permit
 292 displacement, pre-index, and post-index memory operations:
 293
 294 .. naclcode::
 295   :prettyprint: 0
 296
 297   bic    rA,  #0xC0000000
 298   ldr    rD,  [rA, #1234]    ; This is fine.
 299   bic    rA,  #0xC0000000
 300   ldr    rD,  [rA, #1234]!   ; Also fine.
 301   bic    rA,  #0xC0000000
 302   ldr    rD,  [rA], #1234    ; Looking good.
 303
 304 In each case, we know ``rA`` points into the sandbox when the ``ldr``
 305 executes. We allow adding an immediate displacement to ``rA`` to
 306 determine the final address (as in the first two examples here) because
 307 the largest immediate displacement is ±4095 bytes, while our guard pages
 308 are 8192 bytes wide.
 309
 310 We also allow ARM's more unusual *load* and *store* instructions, such
 311 as *load-multiple* and *store-multiple*, etc.
 312
 313 Conditional *Load* and *Store*
 314 """"""""""""""""""""""""""""""
 315
 316 There's one problem with the pseudo-instructions shown above: they are
 317 unconditional (assuming ``rA`` is valid). ARM compilers regularly use
 318 conditional *load* and *store*, so we should support this in Native
 319 Client. We do so by defining alternate, predictable
 320 pseudo-instructions. Here is a conditional *store*
 321 (*store-if-greater-than*) using this pseudo-instruction sequence:
 322
 323 .. naclcode::
 324   :prettyprint: 0
 325
 326   bicgt  rA,  #0xC0000000
 327   strgt  rX,  [rA, #123]
 328
 329 The Stack Pointer, Thread Pointer, and Program Counter
 330 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 331
 332 Stack Pointer
 333 """""""""""""
 334
 335 In C-like languages, the stack is used to store return addresses during
 336 function calls, as well as any local variables that won't fit in
 337 registers. This makes stack operations very common.
 338
 339 Native Client does not require guard instructions on any *load* or
 340 *store* involving the stack pointer, ``sp``. This improves performance
 341 and reduces code size. However, ARM's stack pointer isn't special: it's
 342 just another register, called ``sp`` only by convention. To make it safe
 343 to use this register as a *load* or *store* address without guards, we
 344 add a rule: ``sp`` must always contain a valid address.
 345
 346 We enforce this rule by restricting the sorts of operations that
 347 programs can use to alter ``sp``. Programs can alter ``sp`` by adding or
 348 subtracting an immediate, as a side-effect of a *load* or *store*:
 349
 350 .. naclcode::
 351   :prettyprint: 0
 352
 353   ldr  rX,  [sp],  #4!   ; Load from stack, then add 4 to sp.
 354   str  rX,  [sp, #1234]! ; Add 1234 to sp, then store to stack.
 355
 356 These are safe because, as we mentioned before, the largest immediate
 357 available in a *load* or *store* is ±4095. Even after adding or
 358 subtracting 4095, the stack pointer will still be within the sandbox or
 359 guard regions.
 360
 361 Any other operation that alters ``sp`` must be followed by a guard
 362 instruction. The most common alterations, in practice, are addition and
 363 subtraction of arbitrary integers:
 364
 365 .. naclcode::
 366   :prettyprint: 0
 367
 368   add  sp,  rX
 369   bic  sp,  #0xC0000000
 370
 371 The ``bic`` is similar to the one we used for conditional *load* and
 372 *store*, and serves exactly the same purpose: after it completes, ``sp``
 373 is a valid address.
 374
 375 .. Note::
 376   :class: note
 377
 378   Clever assembly programmers and compilers may want to use this
 379   "trusted" property of ``sp`` to emit more efficient code: in a hot
 380   loop instead of using ``sp`` as a stack pointer it can be temporarily
 381   used as an index pointer (e.g. to traverse an array). This avoids the
 382   extra ``bic`` whenever the pointer is updated in the loop.
 383
 384 Thread Pointer Loads
 385 """"""""""""""""""""
 386
 387 The thread pointer and IRT thread pointer are stored in the trusted
 388 address space. All uses and definitions of ``r9`` from untrusted code
 389 are forbidden except as follows:
 390
 391 .. naclcode::
 392   :prettyprint: 0
 393
 394   ldr Rn, [r9]     ; Load user thread pointer.
 395   ldr Rn, [r9, #4] ; Load IRT thread pointer.
 396
 397 ``pc``-relative Loads
 398 """""""""""""""""""""
 399
 400 By extension, we also allow *load* through the ``pc`` without a
 401 mask. The explanation is quite similar:
 402
 403 * Our control-flow isolation rules mean that the ``pc`` will always
 404   point into the sandbox.
 405 * The maximum immediate displacement that can be used in a
 406   ``pc``-relative *load* is smaller than the width of the guard pages.
 407
 408 We do not allow ``pc``-relative stores, because they look suspiciously
 409 like self-modifying code, or any addressing mode that would alter the
 410 ``pc`` as a side effect of the *load*.
 411
 412 *Indirect Branch*
 413 ^^^^^^^^^^^^^^^^^
 414
 415 There are two types of control flow on ARM: direct and indirect. Direct
 416 control flow instructions have an embedded target address or
 417 offset. Indirect control flow instructions take their destination
 418 address from a register. The ``b`` (branch) and ``bl``
 419 (*branch-with-link*) instructions are *direct branch* and *call*,
 420 respectively. The ``bx`` (*branch-exchange*) and ``blx``
 421 (*branch-with-link-exchange*) are the indirect equivalents.
 422
 423 Because the program counter ``pc`` is simply another register, ARM also
 424 has many implicit indirect control flow instructions. Programs can
 425 operate on the ``pc`` using *add* or *load*, or even outlandish (and
 426 often specified as having unpredictable-behavior) things like multiply!
 427 In Native Client we ban all such instructions. Indirect control flow is
 428 exclusively through ``bx`` and ``blx``. Because all of ARM's control
 429 flow instructions are called *branch* instructions, we'll use the term
 430 *indirect branch* from here on, even though this includes things like
 431 *virtual call*, *return*, and the like.
 432
 433 The Trouble with Indirection
 434 """"""""""""""""""""""""""""
 435
 436 *Indirect branch* present two problems for Native Client:
 437
 438 * We must ensure that they don't send execution outside the sandbox.
 439 * We must ensure that they don't break up the instructions inside a
 440   pseudo-instruction, by landing on the second one.
 441
 442 .. Note::
 443   :class: note
 444
 445   On the x86 architectures we must also ensure that it doesn't land
 446   inside an instruction. This is unnecessary on ARM, where all
 447   instructions are 32-bit wide.
 448
 449 Checking both of these for *direct branch* is easy: the validator just
 450 pulls the (fixed) target address out of the instruction and checks what
 451 it points to.
 452
 453 The Native Client Solution: "Bundles"
 454 """""""""""""""""""""""""""""""""""""
 455
 456 For *indirect branch*, we can address the first problem by simply
 457 masking some high-order bits off the address, like we did for *load* and
 458 *store*. The second problem is more subtle. Detecting every possible
 459 route that every *indirect branch* might take is difficult. Instead, we
 460 take the approach pioneered by the original Native Client: we restrict
 461 the possible places that any *indirect branch* can land. On Native
 462 Client for ARM, *indirect branch* can target any address that has its
 463 bottom four bits clear---any address that's ``0 mod 16``. We call these
 464 16-byte chunks of code "bundles". The validator makes sure that no
 465 pseudo-instruction straddles a bundle boundary. Compilers must pad with`
 466 `nop``\ s to ensure that every pseudo-instruction fits entirely inside
 467 one bundle.
 468
 469 Here is the *indirect branch* pseudo-instruction. As you can see, it
 470 clears the top two and bottom four bits of the address:
 471
 472 .. naclcode::
 473   :prettyprint: 0
 474
 475   bic  rA,  #0xC000000F
 476   bx   rA
 477
 478 This particular pseudo-instruction (a ``bic`` followed by a ``bx``) is
 479 used for computed jumps in switch tables and returning from functions,
 480 among other uses. Recall that, under ARM's modified immediate rules, we
 481 can fit the constant ``0xC000000F`` into the ``bic`` instruction's
 482 immediate field: ``0xC000000F`` is the 8-bit constant ``0xFC``, rotated
 483 right by 4 bits.
 484
 485 The other useful variant is the *indirect branch-with-link*, which is
 486 the ARM equivalent to *call*:
 487
 488 .. naclcode::
 489   :prettyprint: 0
 490
 491   bic  rA,  #0xC000000F
 492   blx  rA
 493
 494 This is used for indirect function calls---commonly seen in C++ programs
 495 as virtual calls, but also for calling function pointers in C.
 496
 497 Note that both *indirect branch* pseudo-instructions use ``bic``, rather
 498 than the ``tst`` instruction we allow for *load* and *store*. There are
 499 two reasons for this:
 500
 501 1. Conditional *branch* is very common. Much more common than
 502    conditional *load* and *store*. If we supported an alternative
 503    ``tst``-based sequence for *branch*, it would be rare.
 504 2. There's no performance benefit to using ``tst`` here on modern ARM
 505    chips. *Branch* consumes its operands later in the pipeline than
 506    *load* and *store* (since they don't have to generate an address,
 507    etc) so this sequence doesn't stall.
 508
 509 .. Note::
 510   :class: note
 511
 512   At this point astute readers are wondering what the ``x`` in ``bx``
 513   and ``blx`` means. We told you it stood for "exchange", but exchange
 514   to what? ARM, for all the reduced-ness of its instruction set, can
 515   change execution mode from A32 (ARM) to T32 (Thumb) and back with
 516   these *branch* instructions, called *interworking branch*. Recall that
 517   A32 instructions are 32-bit wide, and T32 instructions are a mix of
 518   both 16-bit or 32-bit wide. The destination address given to a
 519   *branch* therefore cannot sensibly have its bottom bit set in either
 520   instruction set: that would be an unaligned instruction in both cases,
 521   and ARM simply doesn't support this. The bottom bit for the *indirect
 522   branch* was therefore cleverly recycled by the ARM architecture to
 523   mean "switch to T32 mode" when set!
 524
 525   As you've figured out by now, Native Client's sandbox won't be very
 526   happy if A32 instructions were to be executed as T32 instructions: who
 527   know what they correspond to?  A malicious person could craft valid
 528   A32 code that's actually very naughty T32 code, somewhat like forming
 529   a sentence that happens to be valid in English and French but with
 530   completely different meanings, complimenting the reader in one
 531   language and insulting them in the other.
 532
 533   You've figured out by now that the bundle alignment restrictions of
 534   the Native Client sandbox already take care of making this travesty
 535   impossible: by masking off the bottom 4 bits of the destination the
 536   interworking nature of ARM's *indirect branch* is completely avoided.
 537
 538 *Call* and *Return*
 539 """""""""""""""""""
 540
 541 On ARM, there is no *call* or *return* instruction. A *call* is simply a
 542 *branch* that just happen to load a return address into ``lr``, the link
 543 register. If the called function is a leaf (that is, if it calls no
 544 other functions before returning), it simply branches to the address
 545 stored in ``lr`` to *return* to its caller:
 546
 547 .. naclcode::
 548   :prettyprint: 0
 549
 550   bic  lr,  #0xC000000F
 551   bx   lr
 552
 553 If the function called other functions, however, it had to spill ``lr``
 554 onto the stack. On x86, this is done implicitly, but it is explicit on
 555 ARM:
 556
 557 .. naclcode::
 558   :prettyprint: 0
 559
 560   push { lr }
 561   ; Some code here...
 562   pop  { lr }
 563   bic  lr,  #0xC000000F
 564   bx   lr
 565
 566 There are two things to note about this code.
 567
 568 1. As we mentioned before, we don't allow arbitrary instructions to
 569    write to the Program Counter, ``pc``. Thus, while a traditional ARM
 570    program might have popped directly into ``pc`` to end the function,
 571    we require a pop into a register, followed by a pseudo-instruction.
 572 2. Function returns really are just *indirect branch*, with the same
 573    restrictions. This means that functions can only return to addresses
 574    that are bundle-aligned: ``0 mod 16``.
 575
 576 The implication here is that a *call*\ ---the *branch* that enters
 577 functions---must be placed at the end of the bundle, so that the return
 578 address they generate is ``0 mod 16``. Otherwise, when we clear the
 579 bottom four bits, the program would enter an infinite loop!  (Native
 580 Client doesn't try to prevent infinite loops, but the validator actually
 581 does check the alignment of calls. This is because, when we were writing
 582 the compiler, it was annoying to find out our calls were in the wrong
 583 place by having the program run forever!)
 584
 585 .. Note::
 586   :class: note
 587
 588   Properly balancing the CPU's *call*/*return* actually allows it to
 589   perform much better by allowing it to speculatively execute the return
 590   address' code. For more information on ARM's *call*/*return* stack see
 591   ARM's technical reference manual.
 592
 593 Literal Pools and Data Bundles
 594 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 595
 596 In the section where we described the ARM architecture, we mentioned
 597 ARM's unusual immediate forms. To restate:
 598
 599 * ARM instructions are fixed-length, 32-bits, so we can't have an
 600   instruction that includes an arbitrary 32-bit constant.
 601 * Many ARM instructions can include a modified immediate constant, which
 602   is flexible, but limited.
 603 * For any other value (particularly addresses), ARM programs explicitly
 604   load constants from inside the code itself.
 605
 606 .. Note::
 607   :class: note
 608
 609   ARMv7 introduces some instructions, ``movw`` and ``movt``, that try to
 610   address this by letting us directly load larger constants. Our
 611   toolchain uses this capability in some cases.
 612
 613 Here's a typical example of the use of a literal pool. ARM assemblers
 614 typically hide the details---this is the sort of code you'd see produced
 615 by a disassembler, but with more comments.
 616
 617 .. naclcode::
 618   :prettyprint: 0
 619
 620   ; C equivalent: "table[3] = 4"
 621   ; 'table' is a static array of bytes.
 622   ldr   r0,  [pc, #124]    ; Load the address of the 'table',
 623                            ; "124" is the offset from here
 624                            ; to the constant below.
 625   add   r0,  #3            ; Add the immediate array index.
 626   mov   r1,  #4            ; Get the constant '4' into a register.
 627   bic   r0,  #0xC0000000   ; Mask our array address.
 628   strb  r1,  [r0]          ; Store one byte.
 629   ; ...
 630   .word table              ; Constant referenced above.
 631
 632 Because table is a static array, the compiler knew its address at
 633 compile-time---but the address didn't fit in a modified immediate. (Most
 634 don't).  So, instead of loading an immediate into ``r0`` with a ``mov``,
 635 we stashed the address in the code, generated its address using ``pc``,
 636 and loaded the constant. ARM compilers will typically group all the
 637 embedded data together into a literal pool. These typically live just
 638 past the end of functions, where they won't be executed.
 639
 640 This is an important trick in ARM code, so it's important to support it
 641 in Native Client... but there's a potential flaw. If we let programs
 642 contain arbitrary data, mingled in with the code, couldn't they hide
 643 malicious instructions this way?
 644
 645 The answer is no, because the validator disassembles the entire
 646 executable region of the program, without regard to whether the
 647 programmer said a certain chunk was code or data. But this brings the
 648 opposite problem: what if the program needs to contain a certain
 649 constant that just happens to encode a malicious instruction?  We want
 650 to allow this, but we have to be certain it will never be executed as
 651 code!
 652
 653 Data Bundles to the Rescue
 654 """"""""""""""""""""""""""
 655
 656 As we discussed in the last section, ARM code in Native Client is
 657 structured in 16-byte bundles. We allow literal pools by putting them in
 658 special bundles, called data bundles. Each data bundle can contain 12
 659 bytes of arbitrary data, and the program can have as many data bundles
 660 as it likes.
 661
 662 Each data bundle starts with a breakpoint instruction, ``bkpt``. This
 663 way, if an *indirect branch* tries to enter the data bundle, the process
 664 will take a fault and the trusted runtime will intervene (by terminating
 665 the program). For example:
 666
 667 .. naclcode::
 668   :prettyprint: 0
 669
 670   bkpt #0x5BE0          ; Must be aligned 0 mod 16!
 671   .word 0xDEADBEEF      ; Arbitrary constants are A-OK.
 672   svc #30               ; Trying to make a syscall? OK!
 673   str r0, [r1]          ; Unmasked stores are fine too.
 674
 675 So, we have a way for programs to create an arbitrary, even dangerous,
 676 chunk of data within their code. We can prevent *indirect branch* from
 677 entering it. We can also prevent fall-through from the code just before
 678 it, by the ``bkpt``. But what about *direct branch* straight into the
 679 middle?
 680
 681 The validator detects all data bundles (because this ``bkpt`` has a
 682 special encoding) and marks them as off-limits for *direct branch*. If
 683 it finds a *direct branch* into a data bundle, the entire program is
 684 rejected as unsafe. Because *direct branch* cannot be modified at
 685 runtime, the data bundles cannot be executed.
 686
 687 .. Note::
 688   :class: note
 689
 690   Clever readers may wonder: why use ``bkpt #0x5BE0``, that seems
 691   awfully specific when you just need a special "roadblock" instruction!
 692   Quite true, young Padawan! It happens that this odd ``bkpt``
 693   instruction is encoded as ``0xE125BE70`` in A32, and in T32 the
 694   ``bkpt`` instruction is encoded as ``0xBExx`` (where ``xx`` could be
 695   any 8-bit immediate, say ``0x70``) and ``0xE125`` encodes the *branch*
 696   instruction ``b.n #0x250``. The special roadblock instruction
 697   therefore doubles as a roadblock in T32, if anything were to go so
 698   awry that we tried to execute it as a T32 instruction! Much defense,
 699   such depth, wow!
 700
 701 Trampolines and Memory Layout
 702 -----------------------------
 703
 704 So far, the rules we've described make for boring programs: they can't
 705 communicate with the outside world!
 706
 707 * The program can't call an external library, or the operating system,
 708   even to do something simple like draw some pixels on the screen.
 709 * It also can't read or write memory outside of its dedicated sandbox,
 710   so communicating that way is right out.
 711
 712 We fix this by allowing the untrusted program to call into the trusted
 713 runtime using a trampoline. A trampoline is simply a short stretch of
 714 code, placed by the trusted runtime at a known location within the
 715 sandbox, that is permitted to do things the untrusted program can't.
 716
 717 Even though trampolines are inside the sandbox, the untrusted program
 718 can't modify them: the trusted runtime marks them read-only. It also
 719 can't do anything clever with the special instructions inside the
 720 trampoline---for example, call it at a slightly offset address to bypass
 721 some checks---because the validator only allows trampolines to be
 722 reached by *indirect branch* (or *branch-with-link*). We structure the
 723 trampolines carefully so that they're safe to enter at any ``0 mod 16``
 724 address.
 725
 726 The validator can detect attempts to use the trampolines because they're
 727 loaded at a fixed location in memory. Let's look at the memory map of
 728 the Native Client sandbox.
 729
 730 Memory Map
 731 ^^^^^^^^^^
 732
 733 The ARM sandbox is always at virtual address ``0``, and is exactly 1GiB
 734 in size. This includes the untrusted program's code and data, the
 735 trampolines, and a small guard region to detect null pointer
 736 dereferences. In practice, the untrusted program takes up a bit more
 737 room than this, because of the need for additional guard regions at
 738 either end of the sandbox.
 739
 740 +----------------+-------+-------------------+--------------------------------------------------------------------+
 741 | Address        | Size  | Name              | Purpose                                                            |
 742 +================+=======+===================+====================================================================+
 743 | ``-0x2000``    |  8KiB | Bottom Guard      | Keeps negative-displacement *load* or *store* from escaping.       |
 744 +----------------+-------+-------------------+--------------------------------------------------------------------+
 745 | ``0``          | 64KiB | Null Guard        | Catches null pointer dereferences, guards against kernel exploits. |
 746 +----------------+-------+-------------------+--------------------------------------------------------------------+
 747 | ``0x10000``    | 64KiB | Trampolines       | Up to 2048 unique syscall entry points.                            |
 748 +----------------+-------+-------------------+--------------------------------------------------------------------+
 749 | ``0x20000``    | ~1GiB | Untrusted Sandbox | Contains untrusted code, followed by its heap/stack/memory.        |
 750 +----------------+-------+-------------------+--------------------------------------------------------------------+
 751 | ``0x40000000`` |  8KiB | Top Guard         | Keeps positive-displacement *load* or *store* from escaping.       |
 752 +----------------+-------+-------------------+--------------------------------------------------------------------+
 753
 754 Within the trampolines, the untrusted program can call any address
 755 that's ``0 mod 16``. However, only even slots are used, so useful
 756 trampolines are always ``0 mod 32``. If the program calls an odd slot,
 757 it will fault, and the trusted runtime will shut it down.
 758
 759 .. Note::
 760   :class: note
 761
 762   This is a bit of speculative flexibility. While the current bundle
 763   size of Native Client on ARM is 16 bytes, we've considered the
 764   possibility of optional 32-byte bundles, to enable certain compiler
 765   improvements. While this option isn't available to untrusted programs
 766   today, we're trying to keep the system "32-byte clean".
 767
 768 Inside a Trampoline
 769 ^^^^^^^^^^^^^^^^^^^
 770
 771 When we introduced trampolines, we mentioned that they can do things
 772 that untrusted programs can't. To be more specific, trampolines can jump
 773 to locations outside the sandbox. On ARM, this is all they do. Here's a
 774 typical trampoline fragment on ARM:
 775
 776 .. naclcode::
 777   :prettyprint: 0
 778
 779   ; Even trampoline bundle:
 780   push  { r0-r3 }     ; Save arguments that may be in registers.
 781   push  { lr }        ; Save the untrusted return address,
 782                       ; separate step because it must be on top.
 783   ldr   r0,  [pc, #4] ; Load the destination address from
 784                       ; the next bundle.
 785   blx   r0            ; Go!
 786   ; The odd trampoline that immediately follows:
 787   bkpt 0x5be0         ; Prevent entry to this data bundle.
 788   .word address_of_routine
 789
 790 The only odd thing here is that we push the incoming value of ``lr``,
 791 and then use ``blx``---not ``bx``---to escape the sandbox. This is
 792 because, in practice, all trampolines jump to the same routine in the
 793 trusted runtime, called the syscall hook. It uses the return address
 794 produced by the final ``blx`` instruction to determine which trampoline
 795 was called.
 796
 797 Loose Ends
 798 ----------
 799
 800 Forbidden Instructions
 801 ^^^^^^^^^^^^^^^^^^^^^^
 802
 803 To complete the sandbox, the validator ensures that the program does not
 804 try to use certain forbidden instructions.
 805
 806 * We forbid instructions that directly interact with the operating
 807   system by going around the trusted runtime. We prevent this to limit
 808   the functionality of the untrusted program, and to ensure portability
 809   across operating systems.
 810 * We forbid instructions that change the processor's execution mode to
 811   Thumb, ThumbEE, or Jazelle. This would cause the code to be
 812   interpreted differently than the validator's original 32-bit ARM
 813   disassembly, so the validator results might be invalidated.
 814 * We forbid instructions that aren't available to user code (i.e. have
 815   to be used by an operating system kernel). This is purely out of
 816   paranoia, because the hardware should prevent the instructions from
 817   working. Essentially, we consider it "suspicious" if a program
 818   contains these instructions---it might be trying to exploit a hardware
 819   bug.
 820 * We forbid instructions, or variants of instructions, that are
 821   implementation-defined ("unpredictable") or deprecated in the ARMv7-A
 822   architecture manual.
 823 * Finally, we forbid a small number of instructions, such as ``setend``,
 824   purely out of paranoia. It's easier to loosen the validator's
 825   restrictions than to tighten them, so we err on the side of rejecting
 826   safe instructions.
 827
 828 If an instruction can't be decoded at all within the ARMv7-A instruction
 829 set specification, it is forbidden.
 830
 831 .. Note::
 832   :class: note
 833
 834   Here is a list of instructions currently forbidden for security
 835   reasons (that is, excluding deprecated or undefined instructions):
 836
 837   * ``BLX`` (immediate): always changes to Thumb mode.
 838   * ``BXJ``: always changes to Jazelle mode.
 839   * ``CPS``: not available to user code.
 840   * ``LDM``, exception return version: not available to user code.
 841   * ``LDM``, kernel version: not available to user code.
 842   * ``LDR*T`` (unprivileged load operations): theoretically harmless,
 843     but suspicious when found in user code. Use ``LDR`` instead.
 844   * ``MSR``, kernel version: not available to user code.
 845   * ``RFE``: not available to user code.
 846   * ``SETEND``: theoretically harmless, but suspicious when found in
 847     user code. May make some future validator extensions difficult.
 848   * ``SMC``: not available to user code.
 849   * ``SRS``: not available to user code.
 850   * ``STM``, kernel version: not available to user code.
 851   * ``STR*T`` (unprivileged store operations): theoretically harmless,
 852     but suspicious when found in user code. Use ``STR`` instead.
 853   * ``SVC``/``SWI``: allows direct operating system interaction.
 854   * Any unassigned hint instruction: difficult to reason about, so
 855     treated as suspicious.
 856
 857   More details are available in the `ARMv7 instruction table definition
 858   <http://src.chromium.org/viewvc/native_client/trunk/src/native_client/src/trusted/validator_arm/armv7.table>`_.
 859
 860 Coprocessors
 861 ^^^^^^^^^^^^
 862
 863 ARM has traditionally added new instruction set features through
 864 coprocessors. Coprocessors are accessed through a small set of
 865 instructions, and often have their own register files. Floating point
 866 and the NEON vector extensions are both implemented as coprocessors, as
 867 is the MMU.
 868
 869 We're confident that the side-effects of coprocessors in slots 10 and 11
 870 (that is, floating point, NEON, etc.) are well-understood. These are in
 871 the coprocessor space reserved by ARM Ltd. for their own extensions
 872 (``CP8``--\ ``CP15``), and are unlikely to change significantly. So, we
 873 allow untrusted code to use coprocessors 10 and 11, and we mandate the
 874 presence of at least VFPv3 and NEON/AdvancedSIMD. Multiprocessor
 875 Extension, VFPv4, FP16 and other extensions are allowed but not
 876 required, and may fail on processors that do not support them, it is
 877 therefore the program's responsibility to validate their availability
 878 before executing them.
 879
 880 We don't allow access to any other ARM-reserved coprocessor
 881 (``CP8``--\ ``CP9`` or ``CP12``--\ ``CP15``). It's possible that read
 882 access to ``CP15`` might be useful, and we might allow it in the
 883 future---but again, it's easier to loosen the restrictions than tighten
 884 them, so we ban it for now.
 885
 886 We do not, and probably never will, allow access to the vendor-specific
 887 coprocessor space, ``CP0``--\ ``CP7``. We're simply not confident in our
 888 ability to model the operations on these coprocessors, given that
 889 vendors often leave them poorly-specified. Unfortunately this eliminates
 890 some legacy floating point and vector implementations, but these are
 891 superceded on ARMv7-A parts anyway.
 892
 893 Validator Code
 894 ^^^^^^^^^^^^^^
 895
 896 By now you're itching to see the sandbox validator's code and dissect
 897 it. You'll have a disapointing read: at less that 500 lines of code
 898 `validator.cc
 899 <http://src.chromium.org/viewvc/native_client/trunk/src/native_client/src/trusted/validator_arm/validator.cc>`_
 900 is quite simple to understand and much shorter than this document. It's
 901 of course dependent on the `ARMv7 instruction table definition
 902 <http://src.chromium.org/viewvc/native_client/trunk/src/native_client/src/trusted/validator_arm/armv7.table>`_,
 903 which teaches it about the ARMv7 instruction set.