src/native_client_sdk/src/doc/reference/sandbox_internals/arm-32-bit-sandbox.rst

   1 .. _arm-32-bit-sandbox:
   2
   3 ==================
   4 ARM 32-bit Sandbox
   5 ==================
   6
   7 Native Client for ARM is a sandboxing technology for running
   8 programs---even malicious ones---safely, on computers that use 32-bit
   9 ARM processors. The ARM sandbox is an extension of earlier work on
  10 Native Client for x86 processors. Security is provided with a low
  11 performance overhead of about 10% over regular ARM code, and as you'll
  12 see in this document the sandbox model is beautifully simple, meaning
  13 that the trusted codebase is much easier to validate.
  14
  15 As an implementation detail, the Native Client 32-bit ARM sandbox is
  16 currently used by Portable Native Client to execute code on 32-bit ARM
  17 machines in a safe manner. The portable bitcode contained in a **pexe**
  18 is translated to a 32-bit ARM **nexe** before execution. This may change
  19 at a point in time: Portable Native Client doesn't necessarily need this
  20 sandbox to execute code on ARM. Note that the Portable Native Client
  21 compiler itself is also untrusted: it too runs in the ARM sandbox
  22 described in this document.
  23
  24 On this page, we describe how Native Client works on 32-bit ARM. We
  25 assume no prior knowledge about the internals of Native Client, on x86
  26 or any other architecture, but we do assume some familiarity with
  27 assembly languages in general.
  28
  29 .. contents::
  30    :local:
  31    :backlinks: none
  32    :depth: 3
  33
  34 An Introduction to the ARM Architecture
  35 =======================================
  36
  37 In this section, we summarize the relevant parts of the ARM processor
  38 architecture.
  39
  40 About ARM and ARMv7-A
  41 ---------------------
  42
  43 ARM is one of the older commercial "RISC" processor designs, dating back
  44 to the early 1980s. Today, it is used primarily in embedded systems:
  45 everything from toys, to home automation, to automobiles. However, its
  46 most visible use is in cellular phones, tablets and some
  47 laptops.
  48
  49 Through the years, there have been many revisions of the ARM
  50 architecture, written as ARMv\ *X* for some version *X*. Native Client
  51 specifically targets the ARMv7-A architecture commonly used in high-end
  52 phones and smartbooks. This revision, defined in the mid-2000s, adds a
  53 number of useful instructions, and specifies some portions of the system
  54 that used to be left to individual chip manufacturers. Critically,
  55 ARMv7-A specifies the "eXecute Never" bit, or *XN*. This pagetable
  56 attribute lets us mark memory as non-executable. Our security relies on
  57 the presence of this feature.
  58
  59 ARMv8 adds a new 64-bit instruction set architecture called A64, while
  60 also enhancing the 32-bit A32 ISA. For Native Client's purposes the A32
  61 ISA is equivalent to the ARMv7 ARM ISA, albeit with a few new
  62 instructions. This document only discussed the 32-bit A32 instruction
  63 set: A64 would require a different sandboxing model.
  64
  65 ARM Programmer's Model
  66 ----------------------
  67
  68 While modern ARM chips support several instruction encodings, 32-bit
  69 Native Client on ARM focuses on a single one: a fixed-width encoding
  70 where every instruction is 32-bits wide called A32 (previously, and
  71 confusingly, called simply ARM). Thumb, Thumb2 (now confusingly called
  72 T32), Jazelle, ThumbEE and such aren't supported by Native Client. This
  73 dramatically simplifies some of our analyses, as we'll see later. Nearly
  74 every instruction can be conditionally executed based on the contents of
  75 a dedicated condition code register.
  76
  77 ARM processors have 16 general-purpose registers used for integer and
  78 memory operations, written ``r0`` through ``r15``. Of these, two have
  79 special roles baked in to the hardware:
  80
  81 * ``r14`` is the Link Register. The ARM *call* instruction
  82   (*branch-with-link*) doesn't use the stack directly. Instead, it
  83   stashes the return address in ``r14``. In other circumstances, ``r14``
  84   can be (and is!) used as a general-purpose register. When ``r14`` is
  85   playing its Link Register role, it's referred to as ``lr``.
  86 * ``r15`` is the Program Counter. While it can be read and written like
  87   any other register, setting it to a new value will cause execution to
  88   jump to a new address. Using it in some circumstances is also
  89   undefined by the ARM architecture. Because of this, ``r15`` is never
  90   used for anything else, and is referred to as ``pc``.
  91
  92 Other registers are given roles by convention. The only important
  93 registers to Native Client are ``r9`` and ``r13``, which are used as the
  94 Thread Pointer location and Stack Pointer. When playing this role,
  95 they're referred to as ``tp`` and ``sp``.
  96
  97 Like other RISC-inspired designs, ARM programs use explicit *load* and
  98 *store* instructions to access memory. All other instructions operate
  99 only on registers, or on registers and small constants called
 100 immediates. Because both instructions and data words are 32-bits, we
 101 can't simply embed a 32-bit number into an instruction. ARM programs use
 102 three methods to work around this, all of which Native Client exploits:
 103
 104 1. Many instructions can encode a modified immediate, which is an 8-bit
 105    number rotated right by an even number of bits.
 106 2. The ``movw`` and ``movt`` instructions can be used to set the top and
 107    bottom 16-bits of a register, and can therefore encode any 32-bit
 108    immediate.
 109 3. For values that can't be represented as modified immediates, ARM
 110    programs use ``pc``-relative loads to load data from inside the
 111    code---hidden in a place where it won't be executed such as "constant
 112    pools", just past the final return of a function.
 113
 114 We'll introduce more details of the ARM instruction set later, as we
 115 walk through the system.
 116
 117 The Native Client Approach
 118 ==========================
 119
 120 Native Client runs an untrusted program, potentially from an unknown or
 121 malicious source, inside a sandbox created by a trusted runtime. The
 122 trusted runtime allows the untrusted program to "call-out" and perform
 123 certain actions, such as drawing graphics, but prevents it from
 124 accessing the operating system directly. This "call-out" facility,
 125 called a trampoline, looks like a standard function call to the
 126 untrusted program, but it allows control to escape from the sandbox in a
 127 controlled way.
 128
 129 The untrusted program and trusted runtime inhabit the same process, or
 130 virtual address space, maintained by the operating system. To keep the
 131 trusted runtime behaving the way we expect, we must prevent the
 132 untrusted program from accessing and modifying its internals. Since they
 133 share a virtual address space, we can't rely on the operating system for
 134 this. Instead, we isolate the untrusted program from the trusted
 135 runtime.
 136
 137 Unlike modern operating systems, we use a cooperative isolation
 138 method. Native Client can't run any off-the-shelf program compiled for
 139 an off-the-shelf operating system. The program must be compiled to
 140 comply with Native Client's rules. The details vary on each platform,
 141 but in general, the untrusted program:
 142
 143 * Must not attempt to use certain forbidden instructions, such as system
 144   calls.
 145 * Must not attempt to modify its own code without abiding by Native
 146   Client's code modification rules.
 147 * Must not jump into the middle of an instruction group, or otherwise do
 148   tricky things to cause instructions to be interpreted multiple ways.
 149 * Must use special, strictly-defined instruction sequences to perform
 150   permitted but potentially dangerous actions. We call these sequences
 151   pseudo-instructions.
 152
 153 We can't simply take the program's word that it complies with these
 154 rules---we call it "untrusted" for a reason! Nor do we require it to be
 155 produced by a special compiler; in practice, we don't trust our
 156 compilers either. Instead, we apply a load-time validator that
 157 disassembles the program. The validator either proves that the program
 158 complies with our rules, or rejects it as unsafe. By keeping the rules
 159 simple, we keep the validator simple, small, and fast. We like to put
 160 our trust in small, simple things, and the validator is key to the
 161 system's security.
 162
 163 .. Note::
 164   :class: note
 165
 166   For the computationally-inclined, all our validators scale linearly in
 167   the size of the program.
 168
 169 NaCl/ARM: Pure Software Fault Isolation
 170 ---------------------------------------
 171
 172 In the original Native Client system for the x86, we used unusual
 173 hardware features of that processor (the segment registers) to isolate
 174 untrusted programs. This was simple and fast, but won't work on ARM,
 175 which has nothing equivalent. Instead, we use pure software fault
 176 isolation.
 177
 178 We use a fixed address space layout: the untrusted program gets the
 179 lowest gigabyte, addresses ``0`` through ``0x3FFFFFFF``. The rest of the
 180 address space holds the trusted runtime and the operating system. We
 181 isolate the program by requiring every *load*, *store*, and *indirect
 182 branch* (to an address in a register) to use a pseudo-instruction. The
 183 pseudo-instructions ensure that the address stays within the
 184 sandbox. The *indirect branch* pseudo-instruction, in turn, ensures that
 185 such branches won't split up other pseudo-instructions.
 186
 187 At either side of the sandbox, we place small (8KiB) guard
 188 regions. These are simply areas in the process's address space that are
 189 mapped without read, write, or execute permissions, so any attempt to
 190 access them for any reason---*load*, *store*, or *jump*---will cause a
 191 fault.
 192
 193 Finally, we ban the use of certain instructions, notably direct system
 194 calls. This is to ensure that the untrusted program can be run on any
 195 operating system supported by Native Client, and to prevent access to
 196 certain system features that might be used to subvert the sandbox. As a
 197 side effect, it helps to prevent programs from exploiting buggy
 198 operating system APIs.
 199
 200 Let's walk through the details, starting with the simplest part: *load*
 201 and *store*.
 202
 203 *Load* and *Store*
 204 ^^^^^^^^^^^^^^^^^^
 205
 206 All access to memory must be through *load* and *store*
 207 pseudo-instructions. These are simply a native *load* or *store*
 208 instruction, preceded by a guard instruction.
 209
 210 Each *load* or *store* pseudo-instruction is similar to the *load* shown
 211 below. We use abstract "placeholder" registers instead of specific
 212 numbered registers for the sake of discussion. ``rA`` is the register
 213 holding the address to load from. ``rD`` is the destination for the
 214 loaded data.
 215
 216 .. naclcode::
 217   :prettyprint: 0
 218
 219   bic    rA,  #0xC0000000
 220   ldr    rD,  [rA]
 221
 222 The first instruction, ``bic``, clears the top two bits of ``rA``. In
 223 this case, that means that the value in ``rA`` is forced to an address
 224 inside our sandbox, between ``0`` and ``0x3FFFFFFF``, inclusive.
 225
 226 The second instruction, ``ldr``, uses the previously-sandboxed address
 227 to load a value. This address might not be the address that the program
 228 intended, and might cause an access to an unmapped memory location
 229 within the sandbox: ``bic`` forces the address to be valid, by clearing
 230 the top two bits. This is a no-op in a correct program.
 231
 232 This illustrates a common property of all Native Client systems: we aim
 233 for safety, not correctness. A program using an invalid address in
 234 ``rA`` here is simply broken, so we are free to do whatever we want to
 235 preserve safety. In this case the program might load an invalid (but
 236 safe) value, or cause a segmentation fault limited to the untrusted
 237 code.
 238
 239 Now, if we allowed arbitrary branches within the program, a malicious
 240 program could set up carefully-crafted values in ``rA``, and then jump
 241 straight to the ``ldr``. This is why we validate that programs never
 242 split pseudo-instructions.
 243
 244 Alternative Sandboxing
 245 """"""""""""""""""""""
 246
 247 .. naclcode::
 248   :prettyprint: 0
 249
 250   tst    rA,  #0xC0000000
 251   ldreq  rD,  [rA]
 252
 253 The first instruction, ``tst``, performs a bitwise-\ ``AND`` of ``rA``
 254 and the modified immediate literal, ``0xC0000000``. It sets the
 255 condition flags based on the result, but does not write the result to a
 256 register. In particular, it sets the ``Z`` condition flag if the result
 257 was zero---if the two values had no set bits in common. In this case,
 258 that means that the value in ``rA`` was an address inside our sandbox,
 259 between ``0`` and ``0x3FFFFFFF``, inclusive.
 260
 261 The second instruction, ``ldreq``, is a conditional load if equal. As we
 262 mentioned before, nearly all ARM instructions can be made
 263 conditional. In assembly language, we simply stick the desired condition
 264 on the end of the instruction's mnemonic name. Here, the condition is
 265 ``EQ``, which causes the instruction to execute only if the ``Z`` flag
 266 is set.
 267
 268 Thus, when the pseudo-instruction executes, the ``tst`` sets ``Z`` if
 269 (and only if) the value in ``rA`` is an address within the bounds of the
 270 sandbox, and then the ``ldreq`` loads if (and only if) it was. If ``rA``
 271 held an invalid address, the *load* does not execute, and ``rD`` is
 272 unchanged.
 273
 274 .. Note::
 275   :class: note
 276
 277   The ``tst``-based sequence is faster than the ``bic``-based sequence
 278   on modern ARM chips. It avoids a data dependency in the address
 279   register. This is why we keep both around. The ``tst``-based sequence
 280   unfortunately leaks information on some processors, and is therefore
 281   forbidden on certain processors. This effectively means that it cannot
 282   be used for regular Native Client **nexe** files, but can be used with
 283   Portable Native Client because the target processor is known at
 284   translation time from **pexe** to **nexe**.
 285
 286 Addressing Modes
 287 """"""""""""""""
 288
 289 ARM has an unusually rich set of addressing modes. We allow all but one:
 290 register-indexed, where two registers are added to determine the
 291 address.
 292
 293 We permit simple *load* and *store*, as shown above. We also permit
 294 displacement, pre-index, and post-index memory operations:
 295
 296 .. naclcode::
 297   :prettyprint: 0
 298
 299   bic    rA,  #0xC0000000
 300   ldr    rD,  [rA, #1234]    ; This is fine.
 301   bic    rA,  #0xC0000000
 302   ldr    rD,  [rA, #1234]!   ; Also fine.
 303   bic    rA,  #0xC0000000
 304   ldr    rD,  [rA], #1234    ; Looking good.
 305
 306 In each case, we know ``rA`` points into the sandbox when the ``ldr``
 307 executes. We allow adding an immediate displacement to ``rA`` to
 308 determine the final address (as in the first two examples here) because
 309 the largest immediate displacement is ±4095 bytes, while our guard pages
 310 are 8192 bytes wide.
 311
 312 We also allow ARM's more unusual *load* and *store* instructions, such
 313 as *load-multiple* and *store-multiple*, etc.
 314
 315 Conditional *Load* and *Store*
 316 """"""""""""""""""""""""""""""
 317
 318 There's one problem with the pseudo-instructions shown above: they are
 319 unconditional (assuming ``rA`` is valid). ARM compilers regularly use
 320 conditional *load* and *store*, so we should support this in Native
 321 Client. We do so by defining alternate, predictable
 322 pseudo-instructions. Here is a conditional *store*
 323 (*store-if-greater-than*) using this pseudo-instruction sequence:
 324
 325 .. naclcode::
 326   :prettyprint: 0
 327
 328   bicgt  rA,  #0xC0000000
 329   strgt  rX,  [rA, #123]
 330
 331 The Stack Pointer, Thread Pointer, and Program Counter
 332 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 333
 334 Stack Pointer
 335 """""""""""""
 336
 337 In C-like languages, the stack is used to store return addresses during
 338 function calls, as well as any local variables that won't fit in
 339 registers. This makes stack operations very common.
 340
 341 Native Client does not require guard instructions on any *load* or
 342 *store* involving the stack pointer, ``sp``. This improves performance
 343 and reduces code size. However, ARM's stack pointer isn't special: it's
 344 just another register, called ``sp`` only by convention. To make it safe
 345 to use this register as a *load* or *store* address without guards, we
 346 add a rule: ``sp`` must always contain a valid address.
 347
 348 We enforce this rule by restricting the sorts of operations that
 349 programs can use to alter ``sp``. Programs can alter ``sp`` by adding or
 350 subtracting an immediate, as a side-effect of a *load* or *store*:
 351
 352 .. naclcode::
 353   :prettyprint: 0
 354
 355   ldr  rX,  [sp],  #4!   ; Load from stack, then add 4 to sp.
 356   str  rX,  [sp, #1234]! ; Add 1234 to sp, then store to stack.
 357
 358 These are safe because, as we mentioned before, the largest immediate
 359 available in a *load* or *store* is ±4095. Even after adding or
 360 subtracting 4095, the stack pointer will still be within the sandbox or
 361 guard regions.
 362
 363 Any other operation that alters ``sp`` must be followed by a guard
 364 instruction. The most common alterations, in practice, are addition and
 365 subtraction of arbitrary integers:
 366
 367 .. naclcode::
 368   :prettyprint: 0
 369
 370   add  sp,  rX
 371   bic  sp,  #0xC0000000
 372
 373 The ``bic`` is similar to the one we used for conditional *load* and
 374 *store*, and serves exactly the same purpose: after it completes, ``sp``
 375 is a valid address.
 376
 377 .. Note::
 378   :class: note
 379
 380   Clever assembly programmers and compilers may want to use this
 381   "trusted" property of ``sp`` to emit more efficient code: in a hot
 382   loop instead of using ``sp`` as a stack pointer it can be temporarily
 383   used as an index pointer (e.g. to traverse an array). This avoids the
 384   extra ``bic`` whenever the pointer is updated in the loop.
 385
 386 Thread Pointer Loads
 387 """"""""""""""""""""
 388
 389 The thread pointer and IRT thread pointer are stored in the trusted
 390 address space. All uses and definitions of ``r9`` from untrusted code
 391 are forbidden except as follows:
 392
 393 .. naclcode::
 394   :prettyprint: 0
 395
 396   ldr Rn, [r9]     ; Load user thread pointer.
 397   ldr Rn, [r9, #4] ; Load IRT thread pointer.
 398
 399 ``pc``-relative Loads
 400 """""""""""""""""""""
 401
 402 By extension, we also allow *load* through the ``pc`` without a
 403 mask. The explanation is quite similar:
 404
 405 * Our control-flow isolation rules mean that the ``pc`` will always
 406   point into the sandbox.
 407 * The maximum immediate displacement that can be used in a
 408   ``pc``-relative *load* is smaller than the width of the guard pages.
 409
 410 We do not allow ``pc``-relative stores, because they look suspiciously
 411 like self-modifying code, or any addressing mode that would alter the
 412 ``pc`` as a side effect of the *load*.
 413
 414 *Indirect Branch*
 415 ^^^^^^^^^^^^^^^^^
 416
 417 There are two types of control flow on ARM: direct and indirect. Direct
 418 control flow instructions have an embedded target address or
 419 offset. Indirect control flow instructions take their destination
 420 address from a register. The ``b`` (branch) and ``bl``
 421 (*branch-with-link*) instructions are *direct branch* and *call*,
 422 respectively. The ``bx`` (*branch-exchange*) and ``blx``
 423 (*branch-with-link-exchange*) are the indirect equivalents.
 424
 425 Because the program counter ``pc`` is simply another register, ARM also
 426 has many implicit indirect control flow instructions. Programs can
 427 operate on the ``pc`` using *add* or *load*, or even outlandish (and
 428 often specified as having unpredictable-behavior) things like multiply!
 429 In Native Client we ban all such instructions. Indirect control flow is
 430 exclusively through ``bx`` and ``blx``. Because all of ARM's control
 431 flow instructions are called *branch* instructions, we'll use the term
 432 *indirect branch* from here on, even though this includes things like
 433 *virtual call*, *return*, and the like.
 434
 435 The Trouble with Indirection
 436 """"""""""""""""""""""""""""
 437
 438 *Indirect branch* present two problems for Native Client:
 439
 440 * We must ensure that they don't send execution outside the sandbox.
 441 * We must ensure that they don't break up the instructions inside a
 442   pseudo-instruction, by landing on the second one.
 443
 444 .. Note::
 445   :class: note
 446
 447   On the x86 architectures we must also ensure that it doesn't land
 448   inside an instruction. This is unnecessary on ARM, where all
 449   instructions are 32-bit wide.
 450
 451 Checking both of these for *direct branch* is easy: the validator just
 452 pulls the (fixed) target address out of the instruction and checks what
 453 it points to.
 454
 455 The Native Client Solution: "Bundles"
 456 """""""""""""""""""""""""""""""""""""
 457
 458 For *indirect branch*, we can address the first problem by simply
 459 masking some high-order bits off the address, like we did for *load* and
 460 *store*. The second problem is more subtle. Detecting every possible
 461 route that every *indirect branch* might take is difficult. Instead, we
 462 take the approach pioneered by the original Native Client: we restrict
 463 the possible places that any *indirect branch* can land. On Native
 464 Client for ARM, *indirect branch* can target any address that has its
 465 bottom four bits clear---any address that's ``0 mod 16``. We call these
 466 16-byte chunks of code "bundles". The validator makes sure that no
 467 pseudo-instruction straddles a bundle boundary. Compilers must pad with
 468 ``nop`` to ensure that every pseudo-instruction fits entirely inside one
 469 bundle.
 470
 471 Here is the *indirect branch* pseudo-instruction. As you can see, it
 472 clears the top two and bottom four bits of the address:
 473
 474 .. naclcode::
 475   :prettyprint: 0
 476
 477   bic  rA,  #0xC000000F
 478   bx   rA
 479
 480 This particular pseudo-instruction (a ``bic`` followed by a ``bx``) is
 481 used for computed jumps in switch tables and returning from functions,
 482 among other uses. Recall that, under ARM's modified immediate rules, we
 483 can fit the constant ``0xC000000F`` into the ``bic`` instruction's
 484 immediate field: ``0xC000000F`` is the 8-bit constant ``0xFC``, rotated
 485 right by 4 bits.
 486
 487 The other useful variant is the *indirect branch-with-link*, which is
 488 the ARM equivalent to *call*:
 489
 490 .. naclcode::
 491   :prettyprint: 0
 492
 493   bic  rA,  #0xC000000F
 494   blx  rA
 495
 496 This is used for indirect function calls---commonly seen in C++ programs
 497 as virtual calls, but also for calling function pointers in C.
 498
 499 Note that both *indirect branch* pseudo-instructions use ``bic``, rather
 500 than the ``tst`` instruction we allow for *load* and *store*. There are
 501 two reasons for this:
 502
 503 1. Conditional *branch* is very common. Much more common than
 504    conditional *load* and *store*. If we supported an alternative
 505    ``tst``-based sequence for *branch*, it would be rare.
 506 2. There's no performance benefit to using ``tst`` here on modern ARM
 507    chips. *Branch* consumes its operands later in the pipeline than
 508    *load* and *store* (since they don't have to generate an address,
 509    etc) so this sequence doesn't stall.
 510
 511 .. Note::
 512   :class: note
 513
 514   At this point astute readers are wondering what the ``x`` in ``bx``
 515   and ``blx`` means. We told you it stood for "exchange", but exchange
 516   to what? ARM, for all the reduced-ness of its instruction set, can
 517   change execution mode from A32 (ARM) to T32 (Thumb) and back with
 518   these *branch* instructions, called *interworking branch*. Recall that
 519   A32 instructions are 32-bit wide, and T32 instructions are a mix of
 520   both 16-bit or 32-bit wide. The destination address given to a
 521   *branch* therefore cannot sensibly have its bottom bit set in either
 522   instruction set: that would be an unaligned instruction in both cases,
 523   and ARM simply doesn't support this. The bottom bit for the *indirect
 524   branch* was therefore cleverly recycled by the ARM architecture to
 525   mean "switch to T32 mode" when set!
 526
 527   As you've figured out by now, Native Client's sandbox won't be very
 528   happy if A32 instructions were to be executed as T32 instructions: who
 529   know what they correspond to?  A malicious person could craft valid
 530   A32 code that's actually very naughty T32 code, somewhat like forming
 531   a sentence that happens to be valid in English and French but with
 532   completely different meanings, complimenting the reader in one
 533   language and insulting them in the other.
 534
 535   You've figured out by now that the bundle alignment restrictions of
 536   the Native Client sandbox already take care of making this travesty
 537   impossible: by masking off the bottom 4 bits of the destination the
 538   interworking nature of ARM's *indirect branch* is completely avoided.
 539
 540 *Call* and *Return*
 541 """""""""""""""""""
 542
 543 On ARM, there is no *call* or *return* instruction. A *call* is simply a
 544 *branch* that just happen to load a return address into ``lr``, the link
 545 register. If the called function is a leaf (that is, if it calls no
 546 other functions before returning), it simply branches to the address
 547 stored in ``lr`` to *return* to its caller:
 548
 549 .. naclcode::
 550   :prettyprint: 0
 551
 552   bic  lr,  #0xC000000F
 553   bx   lr
 554
 555 If the function called other functions, however, it had to spill ``lr``
 556 onto the stack. On x86, this is done implicitly, but it is explicit on
 557 ARM:
 558
 559 .. naclcode::
 560   :prettyprint: 0
 561
 562   push { lr }
 563   ; Some code here...
 564   pop  { lr }
 565   bic  lr,  #0xC000000F
 566   bx   lr
 567
 568 There are two things to note about this code.
 569
 570 1. As we mentioned before, we don't allow arbitrary instructions to
 571    write to the Program Counter, ``pc``. Thus, while a traditional ARM
 572    program might have popped directly into ``pc`` to end the function,
 573    we require a pop into a register, followed by a pseudo-instruction.
 574 2. Function returns really are just *indirect branch*, with the same
 575    restrictions. This means that functions can only return to addresses
 576    that are bundle-aligned: ``0 mod 16``.
 577
 578 The implication here is that a *call*\ ---the *branch* that enters
 579 functions---must be placed at the end of the bundle, so that the return
 580 address they generate is ``0 mod 16``. Otherwise, when we clear the
 581 bottom four bits, the program would enter an infinite loop!  (Native
 582 Client doesn't try to prevent infinite loops, but the validator actually
 583 does check the alignment of calls. This is because, when we were writing
 584 the compiler, it was annoying to find out our calls were in the wrong
 585 place by having the program run forever!)
 586
 587 .. Note::
 588   :class: note
 589
 590   Properly balancing the CPU's *call*/*return* actually allows it to
 591   perform much better by allowing it to speculatively execute the return
 592   address' code. For more information on ARM's *call*/*return* stack see
 593   ARM's technical reference manual.
 594
 595 Literal Pools and Data Bundles
 596 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 597
 598 In the section where we described the ARM architecture, we mentioned
 599 ARM's unusual immediate forms. To restate:
 600
 601 * ARM instructions are fixed-length, 32-bits, so we can't have an
 602   instruction that includes an arbitrary 32-bit constant.
 603 * Many ARM instructions can include a modified immediate constant, which
 604   is flexible, but limited.
 605 * For any other value (particularly addresses), ARM programs explicitly
 606   load constants from inside the code itself.
 607
 608 .. Note::
 609   :class: note
 610
 611   ARMv7 introduces some instructions, ``movw`` and ``movt``, that try to
 612   address this by letting us directly load larger constants. Our
 613   toolchain uses this capability in some cases.
 614
 615 Here's a typical example of the use of a literal pool. ARM assemblers
 616 typically hide the details---this is the sort of code you'd see produced
 617 by a disassembler, but with more comments.
 618
 619 .. naclcode::
 620   :prettyprint: 0
 621
 622   ; C equivalent: "table[3] = 4"
 623   ; 'table' is a static array of bytes.
 624   ldr   r0,  [pc, #124]    ; Load the address of the 'table',
 625                            ; "124" is the offset from here
 626                            ; to the constant below.
 627   add   r0,  #3            ; Add the immediate array index.
 628   mov   r1,  #4            ; Get the constant '4' into a register.
 629   bic   r0,  #0xC0000000   ; Mask our array address.
 630   strb  r1,  [r0]          ; Store one byte.
 631   ; ...
 632   .word table              ; Constant referenced above.
 633
 634 Because table is a static array, the compiler knew its address at
 635 compile-time---but the address didn't fit in a modified immediate. (Most
 636 don't).  So, instead of loading an immediate into ``r0`` with a ``mov``,
 637 we stashed the address in the code, generated its address using ``pc``,
 638 and loaded the constant. ARM compilers will typically group all the
 639 embedded data together into a literal pool. These typically live just
 640 past the end of functions, where they won't be executed.
 641
 642 This is an important trick in ARM code, so it's important to support it
 643 in Native Client... but there's a potential flaw. If we let programs
 644 contain arbitrary data, mingled in with the code, couldn't they hide
 645 malicious instructions this way?
 646
 647 The answer is no, because the validator disassembles the entire
 648 executable region of the program, without regard to whether the
 649 programmer said a certain chunk was code or data. But this brings the
 650 opposite problem: what if the program needs to contain a certain
 651 constant that just happens to encode a malicious instruction?  We want
 652 to allow this, but we have to be certain it will never be executed as
 653 code!
 654
 655 Data Bundles to the Rescue
 656 """"""""""""""""""""""""""
 657
 658 As we discussed in the last section, ARM code in Native Client is
 659 structured in 16-byte bundles. We allow literal pools by putting them in
 660 special bundles, called data bundles. Each data bundle can contain 12
 661 bytes of arbitrary data, and the program can have as many data bundles
 662 as it likes.
 663
 664 Each data bundle starts with a breakpoint instruction, ``bkpt``. This
 665 way, if an *indirect branch* tries to enter the data bundle, the process
 666 will take a fault and the trusted runtime will intervene (by terminating
 667 the program). For example:
 668
 669 .. naclcode::
 670   :prettyprint: 0
 671
 672   .p2align 4
 673   bkpt #0x5BE0          ; Must be aligned 0 mod 16!
 674   .word 0xDEADBEEF      ; Arbitrary constants are A-OK.
 675   svc #30               ; Trying to make a syscall? OK!
 676   str r0, [r1]          ; Unmasked stores are fine too.
 677
 678 So, we have a way for programs to create an arbitrary, even dangerous,
 679 chunk of data within their code. We can prevent *indirect branch* from
 680 entering it. We can also prevent fall-through from the code just before
 681 it, by the ``bkpt``. But what about *direct branch* straight into the
 682 middle?
 683
 684 The validator detects all data bundles (because this ``bkpt`` has a
 685 special encoding) and marks them as off-limits for *direct branch*. If
 686 it finds a *direct branch* into a data bundle, the entire program is
 687 rejected as unsafe. Because *direct branch* cannot be modified at
 688 runtime, the data bundles cannot be executed.
 689
 690 .. Note::
 691   :class: note
 692
 693   Clever readers may wonder: why use ``bkpt #0x5BE0``, that seems
 694   awfully specific when you just need a special "roadblock" instruction!
 695   Quite true, young Padawan! It happens that this odd ``bkpt``
 696   instruction is encoded as ``0xE125BE70`` in A32, and in T32 the
 697   ``bkpt`` instruction is encoded as ``0xBExx`` (where ``xx`` could be
 698   any 8-bit immediate, say ``0x70``) and ``0xE125`` encodes the *branch*
 699   instruction ``b.n #0x250``. The special roadblock instruction
 700   therefore doubles as a roadblock in T32, if anything were to go so
 701   awry that we tried to execute it as a T32 instruction! Much defense,
 702   such depth, wow!
 703
 704 Trampolines and Memory Layout
 705 -----------------------------
 706
 707 So far, the rules we've described make for boring programs: they can't
 708 communicate with the outside world!
 709
 710 * The program can't call an external library, or the operating system,
 711   even to do something simple like draw some pixels on the screen.
 712 * It also can't read or write memory outside of its dedicated sandbox,
 713   so communicating that way is right out.
 714
 715 We fix this by allowing the untrusted program to call into the trusted
 716 runtime using a trampoline. A trampoline is simply a short stretch of
 717 code, placed by the trusted runtime at a known location within the
 718 sandbox, that is permitted to do things the untrusted program can't.
 719
 720 Even though trampolines are inside the sandbox, the untrusted program
 721 can't modify them: the trusted runtime marks them read-only. It also
 722 can't do anything clever with the special instructions inside the
 723 trampoline---for example, call it at a slightly offset address to bypass
 724 some checks---because the validator only allows trampolines to be
 725 reached by *indirect branch* (or *branch-with-link*). We structure the
 726 trampolines carefully so that they're safe to enter at any ``0 mod 16``
 727 address.
 728
 729 The validator can detect attempts to use the trampolines because they're
 730 loaded at a fixed location in memory. Let's look at the memory map of
 731 the Native Client sandbox.
 732
 733 Memory Map
 734 ^^^^^^^^^^
 735
 736 The ARM sandbox is always at virtual address ``0``, and is exactly 1GiB
 737 in size. This includes the untrusted program's code and data, the
 738 trampolines, and a small guard region to detect null pointer
 739 dereferences. In practice, the untrusted program takes up a bit more
 740 room than this, because of the need for additional guard regions at
 741 either end of the sandbox.
 742
 743 +----------------+-------+-------------------+--------------------------------------------------------------------+
 744 | Address        | Size  | Name              | Purpose                                                            |
 745 +================+=======+===================+====================================================================+
 746 | ``-0x2000``    |  8KiB | Bottom Guard      | Keeps negative-displacement *load* or *store* from escaping.       |
 747 +----------------+-------+-------------------+--------------------------------------------------------------------+
 748 | ``0``          | 64KiB | Null Guard        | Catches null pointer dereferences, guards against kernel exploits. |
 749 +----------------+-------+-------------------+--------------------------------------------------------------------+
 750 | ``0x10000``    | 64KiB | Trampolines       | Up to 2048 unique syscall entry points.                            |
 751 +----------------+-------+-------------------+--------------------------------------------------------------------+
 752 | ``0x20000``    | ~1GiB | Untrusted Sandbox | Contains untrusted code, followed by its heap/stack/memory.        |
 753 +----------------+-------+-------------------+--------------------------------------------------------------------+
 754 | ``0x40000000`` |  8KiB | Top Guard         | Keeps positive-displacement *load* or *store* from escaping.       |
 755 +----------------+-------+-------------------+--------------------------------------------------------------------+
 756
 757 Within the trampolines, the untrusted program can call any address
 758 that's ``0 mod 16``. However, only even slots are used, so useful
 759 trampolines are always ``0 mod 32``. If the program calls an odd slot,
 760 it will fault, and the trusted runtime will shut it down.
 761
 762 .. Note::
 763   :class: note
 764
 765   This is a bit of speculative flexibility. While the current bundle
 766   size of Native Client on ARM is 16 bytes, we've considered the
 767   possibility of optional 32-byte bundles, to enable certain compiler
 768   improvements. While this option isn't available to untrusted programs
 769   today, we're trying to keep the system "32-byte clean".
 770
 771 Inside a Trampoline
 772 ^^^^^^^^^^^^^^^^^^^
 773
 774 When we introduced trampolines, we mentioned that they can do things
 775 that untrusted programs can't. To be more specific, trampolines can jump
 776 to locations outside the sandbox. On ARM, this is all they do. Here's a
 777 typical trampoline fragment on ARM:
 778
 779 .. naclcode::
 780   :prettyprint: 0
 781
 782   ; Even trampoline bundle:
 783   push  { r0-r3 }     ; Save arguments that may be in registers.
 784   push  { lr }        ; Save the untrusted return address,
 785                       ; separate step because it must be on top.
 786   ldr   r0,  [pc, #4] ; Load the destination address from
 787                       ; the next bundle.
 788   blx   r0            ; Go!
 789   ; The odd trampoline that immediately follows:
 790   bkpt 0x5be0         ; Prevent entry to this data bundle.
 791   .word address_of_routine
 792
 793 The only odd thing here is that we push the incoming value of ``lr``,
 794 and then use ``blx``---not ``bx``---to escape the sandbox. This is
 795 because, in practice, all trampolines jump to the same routine in the
 796 trusted runtime, called the syscall hook. It uses the return address
 797 produced by the final ``blx`` instruction to determine which trampoline
 798 was called.
 799
 800 Loose Ends
 801 ----------
 802
 803 Forbidden Instructions
 804 ^^^^^^^^^^^^^^^^^^^^^^
 805
 806 To complete the sandbox, the validator ensures that the program does not
 807 try to use certain forbidden instructions.
 808
 809 * We forbid instructions that directly interact with the operating
 810   system by going around the trusted runtime. We prevent this to limit
 811   the functionality of the untrusted program, and to ensure portability
 812   across operating systems.
 813 * We forbid instructions that change the processor's execution mode to
 814   Thumb, ThumbEE, or Jazelle. This would cause the code to be
 815   interpreted differently than the validator's original 32-bit ARM
 816   disassembly, so the validator results might be invalidated.
 817 * We forbid instructions that aren't available to user code (i.e. have
 818   to be used by an operating system kernel). This is purely out of
 819   paranoia, because the hardware should prevent the instructions from
 820   working. Essentially, we consider it "suspicious" if a program
 821   contains these instructions---it might be trying to exploit a hardware
 822   bug.
 823 * We forbid instructions, or variants of instructions, that are
 824   implementation-defined ("unpredictable") or deprecated in the ARMv7-A
 825   architecture manual.
 826 * Finally, we forbid a small number of instructions, such as ``setend``,
 827   purely out of paranoia. It's easier to loosen the validator's
 828   restrictions than to tighten them, so we err on the side of rejecting
 829   safe instructions.
 830
 831 If an instruction can't be decoded at all within the ARMv7-A instruction
 832 set specification, it is forbidden.
 833
 834 .. Note::
 835   :class: note
 836
 837   Here is a list of instructions currently forbidden for security
 838   reasons (that is, excluding deprecated or undefined instructions):
 839
 840   * ``BLX`` (immediate): always changes to Thumb mode.
 841   * ``BXJ``: always changes to Jazelle mode.
 842   * ``CPS``: not available to user code.
 843   * ``LDM``, exception return version: not available to user code.
 844   * ``LDM``, kernel version: not available to user code.
 845   * ``LDR*T`` (unprivileged load operations): theoretically harmless,
 846     but suspicious when found in user code. Use ``LDR`` instead.
 847   * ``MSR``, kernel version: not available to user code.
 848   * ``RFE``: not available to user code.
 849   * ``SETEND``: theoretically harmless, but suspicious when found in
 850     user code. May make some future validator extensions difficult.
 851   * ``SMC``: not available to user code.
 852   * ``SRS``: not available to user code.
 853   * ``STM``, kernel version: not available to user code.
 854   * ``STR*T`` (unprivileged store operations): theoretically harmless,
 855     but suspicious when found in user code. Use ``STR`` instead.
 856   * ``SVC``/``SWI``: allows direct operating system interaction.
 857   * Any unassigned hint instruction: difficult to reason about, so
 858     treated as suspicious.
 859
 860   More details are available in the `ARMv7 instruction table definition
 861   <http://src.chromium.org/viewvc/native_client/trunk/src/native_client/src/trusted/validator_arm/armv7.table>`_.
 862
 863 Coprocessors
 864 ^^^^^^^^^^^^
 865
 866 ARM has traditionally added new instruction set features through
 867 coprocessors. Coprocessors are accessed through a small set of
 868 instructions, and often have their own register files. Floating point
 869 and the NEON vector extensions are both implemented as coprocessors, as
 870 is the MMU.
 871
 872 We're confident that the side-effects of coprocessors in slots 10 and 11
 873 (that is, floating point, NEON, etc.) are well-understood. These are in
 874 the coprocessor space reserved by ARM Ltd. for their own extensions
 875 (``CP8``--\ ``CP15``), and are unlikely to change significantly. So, we
 876 allow untrusted code to use coprocessors 10 and 11, and we mandate the
 877 presence of at least VFPv3 and NEON/AdvancedSIMD. Multiprocessor
 878 Extension, VFPv4, FP16 and other extensions are allowed but not
 879 required, and may fail on processors that do not support them, it is
 880 therefore the program's responsibility to validate their availability
 881 before executing them.
 882
 883 We don't allow access to any other ARM-reserved coprocessor
 884 (``CP8``--\ ``CP9`` or ``CP12``--\ ``CP15``). It's possible that read
 885 access to ``CP15`` might be useful, and we might allow it in the
 886 future---but again, it's easier to loosen the restrictions than tighten
 887 them, so we ban it for now.
 888
 889 We do not, and probably never will, allow access to the vendor-specific
 890 coprocessor space, ``CP0``--\ ``CP7``. We're simply not confident in our
 891 ability to model the operations on these coprocessors, given that
 892 vendors often leave them poorly-specified. Unfortunately this eliminates
 893 some legacy floating point and vector implementations, but these are
 894 superceded on ARMv7-A parts anyway.
 895
 896 Validator Code
 897 ^^^^^^^^^^^^^^
 898
 899 By now you're itching to see the sandbox validator's code and dissect
 900 it. You'll have a disapointing read: at less that 500 lines of code
 901 `validator.cc
 902 <http://src.chromium.org/viewvc/native_client/trunk/src/native_client/src/trusted/validator_arm/validator.cc>`_
 903 is quite simple to understand and much shorter than this document. It's
 904 of course dependent on the `ARMv7 instruction table definition
 905 <http://src.chromium.org/viewvc/native_client/trunk/src/native_client/src/trusted/validator_arm/armv7.table>`_,
 906 which teaches it about the ARMv7 instruction set.