From: H. Peter Anvin Date: Tue, 11 Sep 2007 23:52:01 +0000 (+0000) Subject: Feeble attempt at updating the documentation; remove Appendix B X-Git-Tag: nasm-2.11.05~1999 X-Git-Url: http://review.tizen.org/git/?a=commitdiff_plain;h=9b49e24e1fe1a4afc021f6c3a01720fcabdc47ca;p=platform%2Fupstream%2Fnasm.git Feeble attempt at updating the documentation; remove Appendix B Feeble attempt to document 64-bit support. Also, remove Appendix B since we have been utterly useless at keeping it up to date, and it's redundant with the processor manufacturer's documentation anyway. --- diff --git a/doc/insref.src b/doc/insref.src new file mode 100644 index 0000000..1406f87 --- /dev/null +++ b/doc/insref.src @@ -0,0 +1,6732 @@ +\A{iref} x86 Instruction Reference + +This appendix provides a complete list of the machine instructions +which NASM will assemble, and a short description of the function of +each one. + +It is not intended to be an exhaustive documentation on the fine +details of the instructions' function, such as which exceptions they +can trigger: for such documentation, you should go to Intel's Web +site, \W{http://developer.intel.com/design/Pentium4/manuals/}\c{http://developer.intel.com/design/Pentium4/manuals/}. + +Instead, this appendix is intended primarily to provide +documentation on the way the instructions may be used within NASM. +For example, looking up \c{LOOP} will tell you that NASM allows +\c{CX} or \c{ECX} to be specified as an optional second argument to +the \c{LOOP} instruction, to enforce which of the two possible +counter registers should be used if the default is not the one +desired. + +The instructions are not quite listed in alphabetical order, since +groups of instructions with similar functions are lumped together in +the same entry. Most of them don't move very far from their +alphabetic position because of this. + + +\H{iref-opr} Key to Operand Specifications + +The instruction descriptions in this appendix specify their operands +using the following notation: + +\b Registers: \c{reg8} denotes an 8-bit \i{general purpose +register}, \c{reg16} denotes a 16-bit general purpose register, +\c{reg32} a 32-bit one and \c{reg64} a 64-bit one. \c{fpureg} denotes +one of the eight FPU stack registers, \c{mmxreg} denotes one of the +eight 64-bit MMX registers, and \c{segreg} denotes a segment register. +\c{xmmreg} denotes one of the 8, or 16 in x64 long mode, SSE XMM registers. +In addition, some registers (such as \c{AL}, \c{DX}, \c{ECX} or \c{RAX}) +may be specified explicitly. + +\b Immediate operands: \c{imm} denotes a generic \i{immediate operand}. +\c{imm8}, \c{imm16} and \c{imm32} are used when the operand is +intended to be a specific size. For some of these instructions, NASM +needs an explicit specifier: for example, \c{ADD ESP,16} could be +interpreted as either \c{ADD r/m32,imm32} or \c{ADD r/m32,imm8}. +NASM chooses the former by default, and so you must specify \c{ADD +ESP,BYTE 16} for the latter. There is a special case of the allowance +of an \c{imm64} for particular x64 versions of the MOV instruction. + +\b Memory references: \c{mem} denotes a generic \i{memory reference}; +\c{mem8}, \c{mem16}, \c{mem32}, \c{mem64} and \c{mem80} are used +when the operand needs to be a specific size. Again, a specifier is +needed in some cases: \c{DEC [address]} is ambiguous and will be +rejected by NASM. You must specify \c{DEC BYTE [address]}, \c{DEC +WORD [address]} or \c{DEC DWORD [address]} instead. + +\b \i{Restricted memory references}: one form of the \c{MOV} +instruction allows a memory address to be specified \e{without} +allowing the normal range of register combinations and effective +address processing. This is denoted by \c{memoffs8}, \c{memoffs16}, +\c{memoffs32} or \c{memoffs64}. + +\b Register or memory choices: many instructions can accept either a +register \e{or} a memory reference as an operand. \c{r/m8} is +shorthand for \c{reg8/mem8}; similarly \c{r/m16} and \c{r/m32}. +On legacy x86 modes, \c{r/m64} is MMX-related, and is shorthand for +\c{mmxreg/mem64}. When utilizing the x86-64 architecture extension, +\c{r/m64} denotes use of a 64-bit GPR as well, and is shorthand for +\c{reg64/mem64}. + + +\H{iref-opc} Key to Opcode Descriptions + +This appendix also provides the opcodes which NASM will generate for +each form of each instruction. The opcodes are listed in the +following way: + +\b A hex number, such as \c{3F}, indicates a fixed byte containing +that number. + +\b A hex number followed by \c{+r}, such as \c{C8+r}, indicates that +one of the operands to the instruction is a register, and the +`register value' of that register should be added to the hex number +to produce the generated byte. For example, EDX has register value +2, so the code \c{C8+r}, when the register operand is EDX, generates +the hex byte \c{CA}. Register values for specific registers are +given in \k{iref-rv}. + +\b A hex number followed by \c{+cc}, such as \c{40+cc}, indicates +that the instruction name has a condition code suffix, and the +numeric representation of the condition code should be added to the +hex number to produce the generated byte. For example, the code +\c{40+cc}, when the instruction contains the \c{NE} condition, +generates the hex byte \c{45}. Condition codes and their numeric +representations are given in \k{iref-cc}. + +\b A slash followed by a digit, such as \c{/2}, indicates that one +of the operands to the instruction is a memory address or register +(denoted \c{mem} or \c{r/m}, with an optional size). This is to be +encoded as an effective address, with a \i{ModR/M byte}, an optional +\i{SIB byte}, and an optional displacement, and the spare (register) +field of the ModR/M byte should be the digit given (which will be +from 0 to 7, so it fits in three bits). The encoding of effective +addresses is given in \k{iref-ea}. + +\b The code \c{/r} combines the above two: it indicates that one of +the operands is a memory address or \c{r/m}, and another is a +register, and that an effective address should be generated with the +spare (register) field in the ModR/M byte being equal to the +`register value' of the register operand. The encoding of effective +addresses is given in \k{iref-ea}; register values are given in +\k{iref-rv}. + +\b The codes \c{ib}, \c{iw} and \c{id} indicate that one of the +operands to the instruction is an immediate value, and that this is +to be encoded as a byte, little-endian word or little-endian +doubleword respectively. + +\b The codes \c{rb}, \c{rw} and \c{rd} indicate that one of the +operands to the instruction is an immediate value, and that the +\e{difference} between this value and the address of the end of the +instruction is to be encoded as a byte, word or doubleword +respectively. Where the form \c{rw/rd} appears, it indicates that +either \c{rw} or \c{rd} should be used according to whether assembly +is being performed in \c{BITS 16} or \c{BITS 32} state respectively. + +\b The codes \c{ow} and \c{od} indicate that one of the operands to +the instruction is a reference to the contents of a memory address +specified as an immediate value: this encoding is used in some forms +of the \c{MOV} instruction in place of the standard +effective-address mechanism. The displacement is encoded as a word +or doubleword. Again, \c{ow/od} denotes that \c{ow} or \c{od} should +be chosen according to the \c{BITS} setting. + +\b The codes \c{o16} and \c{o32} indicate that the given form of the +instruction should be assembled with operand size 16 or 32 bits. In +other words, \c{o16} indicates a \c{66} prefix in \c{BITS 32} state, +but generates no code in \c{BITS 16} state; and \c{o32} indicates a +\c{66} prefix in \c{BITS 16} state but generates nothing in \c{BITS +32}. + +\b The codes \c{a16} and \c{a32}, similarly to \c{o16} and \c{o32}, +indicate the address size of the given form of the instruction. +Where this does not match the \c{BITS} setting, a \c{67} prefix is +required. Please note that \c{a16} is useless in long mode as +16-bit addressing is depreciated on the x86-64 architecture extension. + + +\S{iref-rv} Register Values + +Where an instruction requires a register value, it is already +implicit in the encoding of the rest of the instruction what type of +register is intended: an 8-bit general-purpose register, a segment +register, a debug register, an MMX register, or whatever. Therefore +there is no problem with registers of different types sharing an +encoding value. + +Please note that for the register classes listed below, the register +extensions (REX) classes require the use of the REX prefix, in which +is only available when in long mode on the x86-64 processor. This +pretty much goes for any register that has a number higher than 7. + +The encodings for the various classes of register are: + +\b 8-bit general registers: \c{AL} is 0, \c{CL} is 1, \c{DL} is 2, +\c{BL} is 3, \c{AH} is 4, \c{CH} is 5, \c{DH} is 6 and \c{BH} is +7. Please note that \c{AH}, \c{BH}, \c{CH} and \c{DH} are not +addressable when using the REX prefix in long mode. + +\b 8-bit general register extensions (REX): \c{SPL} is 4, \c{BPL} is 5, +\c{SIL} is 6, \c{DIL} is 7, \c{R8B} is 8, \c{R9B} is 9, \c{R10B} is 10, +\c{R11B} is 11, \c{R12B} is 12, \c{R13B} is 13, \c{R14B} is 14 and +\c{R15B} is 15. + +\b 16-bit general registers: \c{AX} is 0, \c{CX} is 1, \c{DX} is 2, +\c{BX} is 3, \c{SP} is 4, \c{BP} is 5, \c{SI} is 6, and \c{DI} is 7. + +\b 16-bit general register extensions (REX): \c{R8W} is 8, \c{R9W} is 9, +\c{R10w} is 10, \c{R11W} is 11, \c{R12W} is 12, \c{R13W} is 13, \c{R14W} +is 14 and \c{R15W} is 15. + +\b 32-bit general registers: \c{EAX} is 0, \c{ECX} is 1, \c{EDX} is +2, \c{EBX} is 3, \c{ESP} is 4, \c{EBP} is 5, \c{ESI} is 6, and +\c{EDI} is 7. + +\b 32-bit general register extensions (REX): \c{R8D} is 8, \c{R9D} is 9, +\c{R10D} is 10, \c{R11D} is 11, \c{R12D} is 12, \c{R13D} is 13, \c{R14D} +is 14 and \c{R15D} is 15. + +\b 64-bit general register extensions (REX): \c{RAX} is 0, \c{RCX} is 1, +\c{RDX} is 2, \c{RBX} is 3, \c{RSP} is 4, \c{RBP} is 5, \c{RSI} is 6, +\c{RDI} is 7, \c{R8} is 8, \c{R9} is 9, \c{R10} is 10, \c{R11} is 11, +\c{R12} is 12, \c{R13} is 13, \c{R14} is 14 and \c{R15} is 15. + +\b \i{Segment registers}: \c{ES} is 0, \c{CS} is 1, \c{SS} is 2, \c{DS} +is 3, \c{FS} is 4, and \c{GS} is 5. + +\b \I{floating-point, registers}Floating-point registers: \c{ST0} +is 0, \c{ST1} is 1, \c{ST2} is 2, \c{ST3} is 3, \c{ST4} is 4, +\c{ST5} is 5, \c{ST6} is 6, and \c{ST7} is 7. + +\b 64-bit \i{MMX registers}: \c{MM0} is 0, \c{MM1} is 1, \c{MM2} is 2, +\c{MM3} is 3, \c{MM4} is 4, \c{MM5} is 5, \c{MM6} is 6, and \c{MM7} +is 7. + +\b 128-bit \i{XMM (SSE) registers}: \c{XMM0} is 0, \c{XMM1} is 1, +\c{XMM2} is 2, \c{XMM3} is 3, \c{XMM4} is 4, \c{XMM5} is 5, \c{XMM6} is +6 and \c{XMM7} is 7. + +\b 128-bit \i{XMM (SSE) register} extensions (REX): \c{XMM8} is 8, +\c{XMM9} is 9, \c{XMM10} is 10, \c{XMM11} is 11, \c{XMM12} is 12, +\c{XMM13} is 13, \c{XMM14} is 14 and \c{XMM15} is 15. + +\b \i{Control registers}: \c{CR0} is 0, \c{CR2} is 2, \c{CR3} is 3, +and \c{CR4} is 4. + +\b \i{Control register} extensions: \c{CR8} is 8. + +\b \i{Debug registers}: \c{DR0} is 0, \c{DR1} is 1, \c{DR2} is 2, +\c{DR3} is 3, \c{DR6} is 6, and \c{DR7} is 7. + +\b \i{Test registers}: \c{TR3} is 3, \c{TR4} is 4, \c{TR5} is 5, +\c{TR6} is 6, and \c{TR7} is 7. + +(Note that wherever a register name contains a number, that number +is also the register value for that register.) + + +\S{iref-cc} \i{Condition Codes} + +The available condition codes are given here, along with their +numeric representations as part of opcodes. Many of these condition +codes have synonyms, so several will be listed at a time. + +In the following descriptions, the word `either', when applied to two +possible trigger conditions, is used to mean `either or both'. If +`either but not both' is meant, the phrase `exactly one of' is used. + +\b \c{O} is 0 (trigger if the overflow flag is set); \c{NO} is 1. + +\b \c{B}, \c{C} and \c{NAE} are 2 (trigger if the carry flag is +set); \c{AE}, \c{NB} and \c{NC} are 3. + +\b \c{E} and \c{Z} are 4 (trigger if the zero flag is set); \c{NE} +and \c{NZ} are 5. + +\b \c{BE} and \c{NA} are 6 (trigger if either of the carry or zero +flags is set); \c{A} and \c{NBE} are 7. + +\b \c{S} is 8 (trigger if the sign flag is set); \c{NS} is 9. + +\b \c{P} and \c{PE} are 10 (trigger if the parity flag is set); +\c{NP} and \c{PO} are 11. + +\b \c{L} and \c{NGE} are 12 (trigger if exactly one of the sign and +overflow flags is set); \c{GE} and \c{NL} are 13. + +\b \c{LE} and \c{NG} are 14 (trigger if either the zero flag is set, +or exactly one of the sign and overflow flags is set); \c{G} and +\c{NLE} are 15. + +Note that in all cases, the sense of a condition code may be +reversed by changing the low bit of the numeric representation. + +For details of when an instruction sets each of the status flags, +see the individual instruction, plus the Status Flags reference +in \k{iref-Flags} + + +\S{iref-SSE-cc} \i{SSE Condition Predicates} + +The condition predicates for SSE comparison instructions are the +codes used as part of the opcode, to determine what form of +comparison is being carried out. In each case, the imm8 value is +the final byte of the opcode encoding, and the predicate is the +code used as part of the mnemonic for the instruction (equivalent +to the "cc" in an integer instruction that used a condition code). +The instructions that use this will give details of what the various +mnemonics are, this table is used to help you work out details of what +is happening. + +\c Predi- imm8 Description Relation where: Emula- Result QNaN +\c cate Encod- A Is 1st Operand tion if NaN Signal +\c ing B Is 2nd Operand Operand Invalid +\c +\c EQ 000B equal A = B False No +\c +\c LT 001B less-than A < B False Yes +\c +\c LE 010B less-than- A <= B False Yes +\c or-equal +\c +\c --- ---- greater A > B Swap False Yes +\c than Operands, +\c Use LT +\c +\c --- ---- greater- A >= B Swap False Yes +\c than-or-equal Operands, +\c Use LE +\c +\c UNORD 011B unordered A, B = Unordered True No +\c +\c NEQ 100B not-equal A != B True No +\c +\c NLT 101B not-less- NOT(A < B) True Yes +\c than +\c +\c NLE 110B not-less- NOT(A <= B) True Yes +\c than-or- +\c equal +\c +\c --- ---- not-greater NOT(A > B) Swap True Yes +\c than Operands, +\c Use NLT +\c +\c --- ---- not-greater NOT(A >= B) Swap True Yes +\c than- Operands, +\c or-equal Use NLE +\c +\c ORD 111B ordered A , B = Ordered False No + +The unordered relationship is true when at least one of the two +values being compared is a NaN or in an unsupported format. + +Note that the comparisons which are listed as not having a predicate +or encoding can only be achieved through software emulation, as +described in the "emulation" column. Note in particular that an +instruction such as \c{greater-than} is not the same as \c{NLE}, as, +unlike with the \c{CMP} instruction, it has to take into account the +possibility of one operand containing a NaN or an unsupported numeric +format. + + +\S{iref-Flags} \i{Status Flags} + +The status flags provide some information about the result of the +arithmetic instructions. This information can be used by conditional +instructions (such a \c{Jcc} and \c{CMOVcc}) as well as by some of +the other instructions (such as \c{ADC} and \c{INTO}). + +There are 6 status flags: + +\c CF - Carry flag. + +Set if an arithmetic operation generates a +carry or a borrow out of the most-significant bit of the result; +cleared otherwise. This flag indicates an overflow condition for +unsigned-integer arithmetic. It is also used in multiple-precision +arithmetic. + +\c PF - Parity flag. + +Set if the least-significant byte of the result contains an even +number of 1 bits; cleared otherwise. + +\c AF - Adjust flag. + +Set if an arithmetic operation generates a carry or a borrow +out of bit 3 of the result; cleared otherwise. This flag is used +in binary-coded decimal (BCD) arithmetic. + +\c ZF - Zero flag. + +Set if the result is zero; cleared otherwise. + +\c SF - Sign flag. + +Set equal to the most-significant bit of the result, which is the +sign bit of a signed integer. (0 indicates a positive value and 1 +indicates a negative value.) + +\c OF - Overflow flag. + +Set if the integer result is too large a positive number or too +small a negative number (excluding the sign-bit) to fit in the +destination operand; cleared otherwise. This flag indicates an +overflow condition for signed-integer (two's complement) arithmetic. + + +\S{iref-ea} Effective Address Encoding: \i{ModR/M} and \i{SIB} + +An \i{effective address} is encoded in up to three parts: a ModR/M +byte, an optional SIB byte, and an optional byte, word or doubleword +displacement field. + +The ModR/M byte consists of three fields: the \c{mod} field, ranging +from 0 to 3, in the upper two bits of the byte, the \c{r/m} field, +ranging from 0 to 7, in the lower three bits, and the spare +(register) field in the middle (bit 3 to bit 5). The spare field is +not relevant to the effective address being encoded, and either +contains an extension to the instruction opcode or the register +value of another operand. + +The ModR/M system can be used to encode a direct register reference +rather than a memory access. This is always done by setting the +\c{mod} field to 3 and the \c{r/m} field to the register value of +the register in question (it must be a general-purpose register, and +the size of the register must already be implicit in the encoding of +the rest of the instruction). In this case, the SIB byte and +displacement field are both absent. + +In 16-bit addressing mode (either \c{BITS 16} with no \c{67} prefix, +or \c{BITS 32} with a \c{67} prefix), the SIB byte is never used. +The general rules for \c{mod} and \c{r/m} (there is an exception, +given below) are: + +\b The \c{mod} field gives the length of the displacement field: 0 +means no displacement, 1 means one byte, and 2 means two bytes. + +\b The \c{r/m} field encodes the combination of registers to be +added to the displacement to give the accessed address: 0 means +\c{BX+SI}, 1 means \c{BX+DI}, 2 means \c{BP+SI}, 3 means \c{BP+DI}, +4 means \c{SI} only, 5 means \c{DI} only, 6 means \c{BP} only, and 7 +means \c{BX} only. + +However, there is a special case: + +\b If \c{mod} is 0 and \c{r/m} is 6, the effective address encoded +is not \c{[BP]} as the above rules would suggest, but instead +\c{[disp16]}: the displacement field is present and is two bytes +long, and no registers are added to the displacement. + +Therefore the effective address \c{[BP]} cannot be encoded as +efficiently as \c{[BX]}; so if you code \c{[BP]} in a program, NASM +adds a notional 8-bit zero displacement, and sets \c{mod} to 1, +\c{r/m} to 6, and the one-byte displacement field to 0. + +In 32-bit addressing mode (either \c{BITS 16} with a \c{67} prefix, +or \c{BITS 32} with no \c{67} prefix) the general rules (again, +there are exceptions) for \c{mod} and \c{r/m} are: + +\b The \c{mod} field gives the length of the displacement field: 0 +means no displacement, 1 means one byte, and 2 means four bytes. + +\b If only one register is to be added to the displacement, and it +is not \c{ESP}, the \c{r/m} field gives its register value, and the +SIB byte is absent. If the \c{r/m} field is 4 (which would encode +\c{ESP}), the SIB byte is present and gives the combination and +scaling of registers to be added to the displacement. + +If the SIB byte is present, it describes the combination of +registers (an optional base register, and an optional index register +scaled by multiplication by 1, 2, 4 or 8) to be added to the +displacement. The SIB byte is divided into the \c{scale} field, in +the top two bits, the \c{index} field in the next three, and the +\c{base} field in the bottom three. The general rules are: + +\b The \c{base} field encodes the register value of the base +register. + +\b The \c{index} field encodes the register value of the index +register, unless it is 4, in which case no index register is used +(so \c{ESP} cannot be used as an index register). + +\b The \c{scale} field encodes the multiplier by which the index +register is scaled before adding it to the base and displacement: 0 +encodes a multiplier of 1, 1 encodes 2, 2 encodes 4 and 3 encodes 8. + +The exceptions to the 32-bit encoding rules are: + +\b If \c{mod} is 0 and \c{r/m} is 5, the effective address encoded +is not \c{[EBP]} as the above rules would suggest, but instead +\c{[disp32]}: the displacement field is present and is four bytes +long, and no registers are added to the displacement. + +\b If \c{mod} is 0, \c{r/m} is 4 (meaning the SIB byte is present) +and \c{base} is 5, the effective address encoded is not +\c{[EBP+index]} as the above rules would suggest, but instead +\c{[disp32+index]}: the displacement field is present and is four +bytes long, and there is no base register (but the index register is +still processed in the normal way). + + +\S{iref-rex} Register Extensions: The \i{REX} Prefix + +The Register Extensions, or \i{REX} for short, prefix is the means +of accessing extended registers on the x86-64 architecture. \i{REX} +is considered an instruction prefix, but is required to be after +all other prefixes and thus immediately before the first instruction +opcode itself. So overall, \i{REX} can be thought of as an "Opcode +Prefix" instead. The \i{REX} prefix itself is indicated by a value +of 0x4X, where X is one of 16 different combinations of the actual +\i{REX} flags. + +The \i{REX} prefix flags consist of four 1-bit extensions fields. +These flags are found in the lower nibble of the actual \i{REX} +prefix opcode. Below is the list of \i{REX} prefix flags, from +high bit to low bit. + +\c{REX.W}: When set, this flag indicates the use of a 64-bit operand, +as opposed to the default of using 32-bit operands as found in 32-bit +Protected Mode. + +\c{REX.R}: When set, this flag extends the \c{reg (spare)} field of +the \c{ModRM} byte. Overall, this raises the amount of addressable +registers in this field from 8 to 16. + +\c{REX.X}: When set, this flag extends the \c{index} field of the +\c{SIB} byte. Overall, this raises the amount of addressable +registers in this field from 8 to 16. + +\c{REX.B}: When set, this flag extends the \c{r/m} field of the +\c{ModRM} byte. This flag can also represent an extension to the +opcode register \c{(/r)} field. The determination of which is used +varies depending on which instruction is used. Overall, this raises +the amount of addressable registers in these fields from 8 to 16. + +Interal use of the \i{REX} prefix by the processor is consistent, +yet non-trivial. Most instructions use the \i{REX} prefix as +indicated by the above flags. Some instructions require the \i{REX} +prefix to be present even if the flags are empty. Some instructions +default to a 64-bit operand and require the \i{REX} prefix only for +actual register extensions, and thus ignores the \c{REX.W} field +completely. + +At any rate, NASM is designed to handle, and fully supports, the +\i{REX} prefix internally. Please read the appropriate processor +documentation for further information on the \i{REX} prefix. + +You may have noticed that opcodes 0x40 through 0x4F are actually +opcodes for the INC/DEC instructions for each General Purpose +Register. This is, of course, correct... for legacy x86. While +in long mode, opcodes 0x40 through 0x4F are reserved for use as +the REX prefix. The other opcode forms of the INC/DEC instructions +are used instead. + + +\H{iref-flg} Key to Instruction Flags + +Given along with each instruction in this appendix is a set of +flags, denoting the type of the instruction. The types are as follows: + +\b \c{8086}, \c{186}, \c{286}, \c{386}, \c{486}, \c{PENT} and \c{P6} +denote the lowest processor type that supports the instruction. Most +instructions run on all processors above the given type; those that +do not are documented. The Pentium II contains no additional +instructions beyond the P6 (Pentium Pro); from the point of view of +its instruction set, it can be thought of as a P6 with MMX +capability. + +\b \c{3DNOW} indicates that the instruction is a 3DNow! one, and will +run on the AMD K6-2 and later processors. ATHLON extensions to the +3DNow! instruction set are documented as such. + +\b \c{CYRIX} indicates that the instruction is specific to Cyrix +processors, for example the extra MMX instructions in the Cyrix +extended MMX instruction set. + +\b \c{FPU} indicates that the instruction is a floating-point one, +and will only run on machines with a coprocessor (automatically +including 486DX, Pentium and above). + +\b \c{KATMAI} indicates that the instruction was introduced as part +of the Katmai New Instruction set. These instructions are available +on the Pentium III and later processors. Those which are not +specifically SSE instructions are also available on the AMD Athlon. + +\b \c{MMX} indicates that the instruction is an MMX one, and will +run on MMX-capable Pentium processors and the Pentium II. + +\b \c{PRIV} indicates that the instruction is a protected-mode +management instruction. Many of these may only be used in protected +mode, or only at privilege level zero. + +\b \c{SSE} and \c{SSE2} indicate that the instruction is a Streaming +SIMD Extension instruction. These instructions operate on multiple +values in a single operation. SSE was introduced with the Pentium III +and SSE2 was introduced with the Pentium 4. + +\b \c{UNDOC} indicates that the instruction is an undocumented one, +and not part of the official Intel Architecture; it may or may not +be supported on any given machine. + +\b \c{WILLAMETTE} indicates that the instruction was introduced as +part of the new instruction set in the Pentium 4 and Intel Xeon +processors. These instructions are also known as SSE2 instructions. + +\b \c{X64} indicates that the instruction was introduced as part of +the new instruction set in the x86-64 architecture extension, +commonly referred to as x64, AMD64 or EM64T. + + +\H{iref-inst} x86 Instruction Set + + +\S{insAAA} \i\c{AAA}, \i\c{AAS}, \i\c{AAM}, \i\c{AAD}: ASCII +Adjustments + +\c AAA ; 37 [8086] + +\c AAS ; 3F [8086] + +\c AAD ; D5 0A [8086] +\c AAD imm ; D5 ib [8086] + +\c AAM ; D4 0A [8086] +\c AAM imm ; D4 ib [8086] + +These instructions are used in conjunction with the add, subtract, +multiply and divide instructions to perform binary-coded decimal +arithmetic in \e{unpacked} (one BCD digit per byte - easy to +translate to and from \c{ASCII}, hence the instruction names) form. +There are also packed BCD instructions \c{DAA} and \c{DAS}: see +\k{insDAA}. + +\b \c{AAA} (ASCII Adjust After Addition) should be used after a +one-byte \c{ADD} instruction whose destination was the \c{AL} +register: by means of examining the value in the low nibble of +\c{AL} and also the auxiliary carry flag \c{AF}, it determines +whether the addition has overflowed, and adjusts it (and sets +the carry flag) if so. You can add long BCD strings together +by doing \c{ADD}/\c{AAA} on the low digits, then doing +\c{ADC}/\c{AAA} on each subsequent digit. + +\b \c{AAS} (ASCII Adjust AL After Subtraction) works similarly to +\c{AAA}, but is for use after \c{SUB} instructions rather than +\c{ADD}. + +\b \c{AAM} (ASCII Adjust AX After Multiply) is for use after you +have multiplied two decimal digits together and left the result +in \c{AL}: it divides \c{AL} by ten and stores the quotient in +\c{AH}, leaving the remainder in \c{AL}. The divisor 10 can be +changed by specifying an operand to the instruction: a particularly +handy use of this is \c{AAM 16}, causing the two nibbles in \c{AL} +to be separated into \c{AH} and \c{AL}. + +\b \c{AAD} (ASCII Adjust AX Before Division) performs the inverse +operation to \c{AAM}: it multiplies \c{AH} by ten, adds it to +\c{AL}, and sets \c{AH} to zero. Again, the multiplier 10 can +be changed. + + +\S{insADC} \i\c{ADC}: Add with Carry + +\c ADC r/m8,reg8 ; 10 /r [8086] +\c ADC r/m16,reg16 ; o16 11 /r [8086] +\c ADC r/m32,reg32 ; o32 11 /r [386] + +\c ADC reg8,r/m8 ; 12 /r [8086] +\c ADC reg16,r/m16 ; o16 13 /r [8086] +\c ADC reg32,r/m32 ; o32 13 /r [386] + +\c ADC r/m8,imm8 ; 80 /2 ib [8086] +\c ADC r/m16,imm16 ; o16 81 /2 iw [8086] +\c ADC r/m32,imm32 ; o32 81 /2 id [386] + +\c ADC r/m16,imm8 ; o16 83 /2 ib [8086] +\c ADC r/m32,imm8 ; o32 83 /2 ib [386] + +\c ADC AL,imm8 ; 14 ib [8086] +\c ADC AX,imm16 ; o16 15 iw [8086] +\c ADC EAX,imm32 ; o32 15 id [386] + +\c{ADC} performs integer addition: it adds its two operands +together, plus the value of the carry flag, and leaves the result in +its destination (first) operand. The destination operand can be a +register or a memory location. The source operand can be a register, +a memory location or an immediate value. + +The flags are set according to the result of the operation: in +particular, the carry flag is affected and can be used by a +subsequent \c{ADC} instruction. + +In the forms with an 8-bit immediate second operand and a longer +first operand, the second operand is considered to be signed, and is +sign-extended to the length of the first operand. In these cases, +the \c{BYTE} qualifier is necessary to force NASM to generate this +form of the instruction. + +To add two numbers without also adding the contents of the carry +flag, use \c{ADD} (\k{insADD}). + + +\S{insADD} \i\c{ADD}: Add Integers + +\c ADD r/m8,reg8 ; 00 /r [8086] +\c ADD r/m16,reg16 ; o16 01 /r [8086] +\c ADD r/m32,reg32 ; o32 01 /r [386] + +\c ADD reg8,r/m8 ; 02 /r [8086] +\c ADD reg16,r/m16 ; o16 03 /r [8086] +\c ADD reg32,r/m32 ; o32 03 /r [386] + +\c ADD r/m8,imm8 ; 80 /7 ib [8086] +\c ADD r/m16,imm16 ; o16 81 /7 iw [8086] +\c ADD r/m32,imm32 ; o32 81 /7 id [386] + +\c ADD r/m16,imm8 ; o16 83 /7 ib [8086] +\c ADD r/m32,imm8 ; o32 83 /7 ib [386] + +\c ADD AL,imm8 ; 04 ib [8086] +\c ADD AX,imm16 ; o16 05 iw [8086] +\c ADD EAX,imm32 ; o32 05 id [386] + +\c{ADD} performs integer addition: it adds its two operands +together, and leaves the result in its destination (first) operand. +The destination operand can be a register or a memory location. +The source operand can be a register, a memory location or an +immediate value. + +The flags are set according to the result of the operation: in +particular, the carry flag is affected and can be used by a +subsequent \c{ADC} instruction. + +In the forms with an 8-bit immediate second operand and a longer +first operand, the second operand is considered to be signed, and is +sign-extended to the length of the first operand. In these cases, +the \c{BYTE} qualifier is necessary to force NASM to generate this +form of the instruction. + + +\S{insADDPD} \i\c{ADDPD}: ADD Packed Double-Precision FP Values + +\c ADDPD xmm1,xmm2/mem128 ; 66 0F 58 /r [WILLAMETTE,SSE2] + +\c{ADDPD} performs addition on each of two packed double-precision +FP value pairs. + +\c dst[0-63] := dst[0-63] + src[0-63], +\c dst[64-127] := dst[64-127] + src[64-127]. + +The destination is an \c{XMM} register. The source operand can be +either an \c{XMM} register or a 128-bit memory location. + + +\S{insADDPS} \i\c{ADDPS}: ADD Packed Single-Precision FP Values + +\c ADDPS xmm1,xmm2/mem128 ; 0F 58 /r [KATMAI,SSE] + +\c{ADDPS} performs addition on each of four packed single-precision +FP value pairs + +\c dst[0-31] := dst[0-31] + src[0-31], +\c dst[32-63] := dst[32-63] + src[32-63], +\c dst[64-95] := dst[64-95] + src[64-95], +\c dst[96-127] := dst[96-127] + src[96-127]. + +The destination is an \c{XMM} register. The source operand can be +either an \c{XMM} register or a 128-bit memory location. + + +\S{insADDSD} \i\c{ADDSD}: ADD Scalar Double-Precision FP Values + +\c ADDSD xmm1,xmm2/mem64 ; F2 0F 58 /r [KATMAI,SSE] + +\c{ADDSD} adds the low double-precision FP values from the source +and destination operands and stores the double-precision FP result +in the destination operand. + +\c dst[0-63] := dst[0-63] + src[0-63], +\c dst[64-127) remains unchanged. + +The destination is an \c{XMM} register. The source operand can be +either an \c{XMM} register or a 64-bit memory location. + + +\S{insADDSS} \i\c{ADDSS}: ADD Scalar Single-Precision FP Values + +\c ADDSS xmm1,xmm2/mem32 ; F3 0F 58 /r [WILLAMETTE,SSE2] + +\c{ADDSS} adds the low single-precision FP values from the source +and destination operands and stores the single-precision FP result +in the destination operand. + +\c dst[0-31] := dst[0-31] + src[0-31], +\c dst[32-127] remains unchanged. + +The destination is an \c{XMM} register. The source operand can be +either an \c{XMM} register or a 32-bit memory location. + + +\S{insAND} \i\c{AND}: Bitwise AND + +\c AND r/m8,reg8 ; 20 /r [8086] +\c AND r/m16,reg16 ; o16 21 /r [8086] +\c AND r/m32,reg32 ; o32 21 /r [386] + +\c AND reg8,r/m8 ; 22 /r [8086] +\c AND reg16,r/m16 ; o16 23 /r [8086] +\c AND reg32,r/m32 ; o32 23 /r [386] + +\c AND r/m8,imm8 ; 80 /4 ib [8086] +\c AND r/m16,imm16 ; o16 81 /4 iw [8086] +\c AND r/m32,imm32 ; o32 81 /4 id [386] + +\c AND r/m16,imm8 ; o16 83 /4 ib [8086] +\c AND r/m32,imm8 ; o32 83 /4 ib [386] + +\c AND AL,imm8 ; 24 ib [8086] +\c AND AX,imm16 ; o16 25 iw [8086] +\c AND EAX,imm32 ; o32 25 id [386] + +\c{AND} performs a bitwise AND operation between its two operands +(i.e. each bit of the result is 1 if and only if the corresponding +bits of the two inputs were both 1), and stores the result in the +destination (first) operand. The destination operand can be a +register or a memory location. The source operand can be a register, +a memory location or an immediate value. + +In the forms with an 8-bit immediate second operand and a longer +first operand, the second operand is considered to be signed, and is +sign-extended to the length of the first operand. In these cases, +the \c{BYTE} qualifier is necessary to force NASM to generate this +form of the instruction. + +The \c{MMX} instruction \c{PAND} (see \k{insPAND}) performs the same +operation on the 64-bit \c{MMX} registers. + + +\S{insANDNPD} \i\c{ANDNPD}: Bitwise Logical AND NOT of +Packed Double-Precision FP Values + +\c ANDNPD xmm1,xmm2/mem128 ; 66 0F 55 /r [WILLAMETTE,SSE2] + +\c{ANDNPD} inverts the bits of the two double-precision +floating-point values in the destination register, and then +performs a logical AND between the two double-precision +floating-point values in the source operand and the temporary +inverted result, storing the result in the destination register. + +\c dst[0-63] := src[0-63] AND NOT dst[0-63], +\c dst[64-127] := src[64-127] AND NOT dst[64-127]. + +The destination is an \c{XMM} register. The source operand can be +either an \c{XMM} register or a 128-bit memory location. + + +\S{insANDNPS} \i\c{ANDNPS}: Bitwise Logical AND NOT of +Packed Single-Precision FP Values + +\c ANDNPS xmm1,xmm2/mem128 ; 0F 55 /r [KATMAI,SSE] + +\c{ANDNPS} inverts the bits of the four single-precision +floating-point values in the destination register, and then +performs a logical AND between the four single-precision +floating-point values in the source operand and the temporary +inverted result, storing the result in the destination register. + +\c dst[0-31] := src[0-31] AND NOT dst[0-31], +\c dst[32-63] := src[32-63] AND NOT dst[32-63], +\c dst[64-95] := src[64-95] AND NOT dst[64-95], +\c dst[96-127] := src[96-127] AND NOT dst[96-127]. + +The destination is an \c{XMM} register. The source operand can be +either an \c{XMM} register or a 128-bit memory location. + + +\S{insANDPD} \i\c{ANDPD}: Bitwise Logical AND For Single FP + +\c ANDPD xmm1,xmm2/mem128 ; 66 0F 54 /r [WILLAMETTE,SSE2] + +\c{ANDPD} performs a bitwise logical AND of the two double-precision +floating point values in the source and destination operand, and +stores the result in the destination register. + +\c dst[0-63] := src[0-63] AND dst[0-63], +\c dst[64-127] := src[64-127] AND dst[64-127]. + +The destination is an \c{XMM} register. The source operand can be +either an \c{XMM} register or a 128-bit memory location. + + +\S{insANDPS} \i\c{ANDPS}: Bitwise Logical AND For Single FP + +\c ANDPS xmm1,xmm2/mem128 ; 0F 54 /r [KATMAI,SSE] + +\c{ANDPS} performs a bitwise logical AND of the four single-precision +floating point values in the source and destination operand, and +stores the result in the destination register. + +\c dst[0-31] := src[0-31] AND dst[0-31], +\c dst[32-63] := src[32-63] AND dst[32-63], +\c dst[64-95] := src[64-95] AND dst[64-95], +\c dst[96-127] := src[96-127] AND dst[96-127]. + +The destination is an \c{XMM} register. The source operand can be +either an \c{XMM} register or a 128-bit memory location. + + +\S{insARPL} \i\c{ARPL}: Adjust RPL Field of Selector + +\c ARPL r/m16,reg16 ; 63 /r [286,PRIV] + +\c{ARPL} expects its two word operands to be segment selectors. It +adjusts the \i\c{RPL} (requested privilege level - stored in the bottom +two bits of the selector) field of the destination (first) operand +to ensure that it is no less (i.e. no more privileged than) the \c{RPL} +field of the source operand. The zero flag is set if and only if a +change had to be made. + + +\S{insBOUND} \i\c{BOUND}: Check Array Index against Bounds + +\c BOUND reg16,mem ; o16 62 /r [186] +\c BOUND reg32,mem ; o32 62 /r [386] + +\c{BOUND} expects its second operand to point to an area of memory +containing two signed values of the same size as its first operand +(i.e. two words for the 16-bit form; two doublewords for the 32-bit +form). It performs two signed comparisons: if the value in the +register passed as its first operand is less than the first of the +in-memory values, or is greater than or equal to the second, it +throws a \c{BR} exception. Otherwise, it does nothing. + + +\S{insBSF} \i\c{BSF}, \i\c{BSR}: Bit Scan + +\c BSF reg16,r/m16 ; o16 0F BC /r [386] +\c BSF reg32,r/m32 ; o32 0F BC /r [386] + +\c BSR reg16,r/m16 ; o16 0F BD /r [386] +\c BSR reg32,r/m32 ; o32 0F BD /r [386] + +\b \c{BSF} searches for the least significant set bit in its source +(second) operand, and if it finds one, stores the index in +its destination (first) operand. If no set bit is found, the +contents of the destination operand are undefined. If the source +operand is zero, the zero flag is set. + +\b \c{BSR} performs the same function, but searches from the top +instead, so it finds the most significant set bit. + +Bit indices are from 0 (least significant) to 15 or 31 (most +significant). The destination operand can only be a register. +The source operand can be a register or a memory location. + + +\S{insBSWAP} \i\c{BSWAP}: Byte Swap + +\c BSWAP reg32 ; o32 0F C8+r [486] + +\c{BSWAP} swaps the order of the four bytes of a 32-bit register: +bits 0-7 exchange places with bits 24-31, and bits 8-15 swap with +bits 16-23. There is no explicit 16-bit equivalent: to byte-swap +\c{AX}, \c{BX}, \c{CX} or \c{DX}, \c{XCHG} can be used. When \c{BSWAP} +is used with a 16-bit register, the result is undefined. + + +\S{insBT} \i\c{BT}, \i\c{BTC}, \i\c{BTR}, \i\c{BTS}: Bit Test + +\c BT r/m16,reg16 ; o16 0F A3 /r [386] +\c BT r/m32,reg32 ; o32 0F A3 /r [386] +\c BT r/m16,imm8 ; o16 0F BA /4 ib [386] +\c BT r/m32,imm8 ; o32 0F BA /4 ib [386] + +\c BTC r/m16,reg16 ; o16 0F BB /r [386] +\c BTC r/m32,reg32 ; o32 0F BB /r [386] +\c BTC r/m16,imm8 ; o16 0F BA /7 ib [386] +\c BTC r/m32,imm8 ; o32 0F BA /7 ib [386] + +\c BTR r/m16,reg16 ; o16 0F B3 /r [386] +\c BTR r/m32,reg32 ; o32 0F B3 /r [386] +\c BTR r/m16,imm8 ; o16 0F BA /6 ib [386] +\c BTR r/m32,imm8 ; o32 0F BA /6 ib [386] + +\c BTS r/m16,reg16 ; o16 0F AB /r [386] +\c BTS r/m32,reg32 ; o32 0F AB /r [386] +\c BTS r/m16,imm ; o16 0F BA /5 ib [386] +\c BTS r/m32,imm ; o32 0F BA /5 ib [386] + +These instructions all test one bit of their first operand, whose +index is given by the second operand, and store the value of that +bit into the carry flag. Bit indices are from 0 (least significant) +to 15 or 31 (most significant). + +In addition to storing the original value of the bit into the carry +flag, \c{BTR} also resets (clears) the bit in the operand itself. +\c{BTS} sets the bit, and \c{BTC} complements the bit. \c{BT} does +not modify its operands. + +The destination can be a register or a memory location. The source can +be a register or an immediate value. + +If the destination operand is a register, the bit offset should be +in the range 0-15 (for 16-bit operands) or 0-31 (for 32-bit operands). +An immediate value outside these ranges will be taken modulo 16/32 +by the processor. + +If the destination operand is a memory location, then an immediate +bit offset follows the same rules as for a register. If the bit offset +is in a register, then it can be anything within the signed range of +the register used (ie, for a 32-bit operand, it can be (-2^31) to (2^31 - 1) + + +\S{insCALL} \i\c{CALL}: Call Subroutine + +\c CALL imm ; E8 rw/rd [8086] +\c CALL imm:imm16 ; o16 9A iw iw [8086] +\c CALL imm:imm32 ; o32 9A id iw [386] +\c CALL FAR mem16 ; o16 FF /3 [8086] +\c CALL FAR mem32 ; o32 FF /3 [386] +\c CALL r/m16 ; o16 FF /2 [8086] +\c CALL r/m32 ; o32 FF /2 [386] + +\c{CALL} calls a subroutine, by means of pushing the current +instruction pointer (\c{IP}) and optionally \c{CS} as well on the +stack, and then jumping to a given address. + +\c{CS} is pushed as well as \c{IP} if and only if the call is a far +call, i.e. a destination segment address is specified in the +instruction. The forms involving two colon-separated arguments are +far calls; so are the \c{CALL FAR mem} forms. + +The immediate \i{near call} takes one of two forms (\c{call imm16/imm32}, +determined by the current segment size limit. For 16-bit operands, +you would use \c{CALL 0x1234}, and for 32-bit operands you would use +\c{CALL 0x12345678}. The value passed as an operand is a relative offset. + +You can choose between the two immediate \i{far call} forms +(\c{CALL imm:imm}) by the use of the \c{WORD} and \c{DWORD} keywords: +\c{CALL WORD 0x1234:0x5678}) or \c{CALL DWORD 0x1234:0x56789abc}. + +The \c{CALL FAR mem} forms execute a far call by loading the +destination address out of memory. The address loaded consists of 16 +or 32 bits of offset (depending on the operand size), and 16 bits of +segment. The operand size may be overridden using \c{CALL WORD FAR +mem} or \c{CALL DWORD FAR mem}. + +The \c{CALL r/m} forms execute a \i{near call} (within the same +segment), loading the destination address out of memory or out of a +register. The keyword \c{NEAR} may be specified, for clarity, in +these forms, but is not necessary. Again, operand size can be +overridden using \c{CALL WORD mem} or \c{CALL DWORD mem}. + +As a convenience, NASM does not require you to call a far procedure +symbol by coding the cumbersome \c{CALL SEG routine:routine}, but +instead allows the easier synonym \c{CALL FAR routine}. + +The \c{CALL r/m} forms given above are near calls; NASM will accept +the \c{NEAR} keyword (e.g. \c{CALL NEAR [address]}), even though it +is not strictly necessary. + + +\S{insCBW} \i\c{CBW}, \i\c{CWD}, \i\c{CDQ}, \i\c{CWDE}: Sign Extensions + +\c CBW ; o16 98 [8086] +\c CWDE ; o32 98 [386] + +\c CWD ; o16 99 [8086] +\c CDQ ; o32 99 [386] + +All these instructions sign-extend a short value into a longer one, +by replicating the top bit of the original value to fill the +extended one. + +\c{CBW} extends \c{AL} into \c{AX} by repeating the top bit of +\c{AL} in every bit of \c{AH}. \c{CWDE} extends \c{AX} into +\c{EAX}. \c{CWD} extends \c{AX} into \c{DX:AX} by repeating +the top bit of \c{AX} throughout \c{DX}, and \c{CDQ} extends +\c{EAX} into \c{EDX:EAX}. + + +\S{insCLC} \i\c{CLC}, \i\c{CLD}, \i\c{CLI}, \i\c{CLTS}: Clear Flags + +\c CLC ; F8 [8086] +\c CLD ; FC [8086] +\c CLI ; FA [8086] +\c CLTS ; 0F 06 [286,PRIV] + +These instructions clear various flags. \c{CLC} clears the carry +flag; \c{CLD} clears the direction flag; \c{CLI} clears the +interrupt flag (thus disabling interrupts); and \c{CLTS} clears the +task-switched (\c{TS}) flag in \c{CR0}. + +To set the carry, direction, or interrupt flags, use the \c{STC}, +\c{STD} and \c{STI} instructions (\k{insSTC}). To invert the carry +flag, use \c{CMC} (\k{insCMC}). + + +\S{insCLFLUSH} \i\c{CLFLUSH}: Flush Cache Line + +\c CLFLUSH mem ; 0F AE /7 [WILLAMETTE,SSE2] + +\c{CLFLUSH} invalidates the cache line that contains the linear address +specified by the source operand from all levels of the processor cache +hierarchy (data and instruction). If, at any level of the cache +hierarchy, the line is inconsistent with memory (dirty) it is written +to memory before invalidation. The source operand points to a +byte-sized memory location. + +Although \c{CLFLUSH} is flagged \c{SSE2} and above, it may not be +present on all processors which have \c{SSE2} support, and it may be +supported on other processors; the \c{CPUID} instruction (\k{insCPUID}) +will return a bit which indicates support for the \c{CLFLUSH} instruction. + + +\S{insCMC} \i\c{CMC}: Complement Carry Flag + +\c CMC ; F5 [8086] + +\c{CMC} changes the value of the carry flag: if it was 0, it sets it +to 1, and vice versa. + + +\S{insCMOVcc} \i\c{CMOVcc}: Conditional Move + +\c CMOVcc reg16,r/m16 ; o16 0F 40+cc /r [P6] +\c CMOVcc reg32,r/m32 ; o32 0F 40+cc /r [P6] + +\c{CMOV} moves its source (second) operand into its destination +(first) operand if the given condition code is satisfied; otherwise +it does nothing. + +For a list of condition codes, see \k{iref-cc}. + +Although the \c{CMOV} instructions are flagged \c{P6} and above, they +may not be supported by all Pentium Pro processors; the \c{CPUID} +instruction (\k{insCPUID}) will return a bit which indicates whether +conditional moves are supported. + + +\S{insCMP} \i\c{CMP}: Compare Integers + +\c CMP r/m8,reg8 ; 38 /r [8086] +\c CMP r/m16,reg16 ; o16 39 /r [8086] +\c CMP r/m32,reg32 ; o32 39 /r [386] + +\c CMP reg8,r/m8 ; 3A /r [8086] +\c CMP reg16,r/m16 ; o16 3B /r [8086] +\c CMP reg32,r/m32 ; o32 3B /r [386] + +\c CMP r/m8,imm8 ; 80 /7 ib [8086] +\c CMP r/m16,imm16 ; o16 81 /7 iw [8086] +\c CMP r/m32,imm32 ; o32 81 /7 id [386] + +\c CMP r/m16,imm8 ; o16 83 /7 ib [8086] +\c CMP r/m32,imm8 ; o32 83 /7 ib [386] + +\c CMP AL,imm8 ; 3C ib [8086] +\c CMP AX,imm16 ; o16 3D iw [8086] +\c CMP EAX,imm32 ; o32 3D id [386] + +\c{CMP} performs a `mental' subtraction of its second operand from +its first operand, and affects the flags as if the subtraction had +taken place, but does not store the result of the subtraction +anywhere. + +In the forms with an 8-bit immediate second operand and a longer +first operand, the second operand is considered to be signed, and is +sign-extended to the length of the first operand. In these cases, +the \c{BYTE} qualifier is necessary to force NASM to generate this +form of the instruction. + +The destination operand can be a register or a memory location. The +source can be a register, memory location or an immediate value of +the same size as the destination. + + +\S{insCMPccPD} \i\c{CMPccPD}: Packed Double-Precision FP Compare +\I\c{CMPEQPD} \I\c{CMPLTPD} \I\c{CMPLEPD} \I\c{CMPUNORDPD} +\I\c{CMPNEQPD} \I\c{CMPNLTPD} \I\c{CMPNLEPD} \I\c{CMPORDPD} + +\c CMPPD xmm1,xmm2/mem128,imm8 ; 66 0F C2 /r ib [WILLAMETTE,SSE2] + +\c CMPEQPD xmm1,xmm2/mem128 ; 66 0F C2 /r 00 [WILLAMETTE,SSE2] +\c CMPLTPD xmm1,xmm2/mem128 ; 66 0F C2 /r 01 [WILLAMETTE,SSE2] +\c CMPLEPD xmm1,xmm2/mem128 ; 66 0F C2 /r 02 [WILLAMETTE,SSE2] +\c CMPUNORDPD xmm1,xmm2/mem128 ; 66 0F C2 /r 03 [WILLAMETTE,SSE2] +\c CMPNEQPD xmm1,xmm2/mem128 ; 66 0F C2 /r 04 [WILLAMETTE,SSE2] +\c CMPNLTPD xmm1,xmm2/mem128 ; 66 0F C2 /r 05 [WILLAMETTE,SSE2] +\c CMPNLEPD xmm1,xmm2/mem128 ; 66 0F C2 /r 06 [WILLAMETTE,SSE2] +\c CMPORDPD xmm1,xmm2/mem128 ; 66 0F C2 /r 07 [WILLAMETTE,SSE2] + +The \c{CMPccPD} instructions compare the two packed double-precision +FP values in the source and destination operands, and returns the +result of the comparison in the destination register. The result of +each comparison is a quadword mask of all 1s (comparison true) or +all 0s (comparison false). + +The destination is an \c{XMM} register. The source can be either an +\c{XMM} register or a 128-bit memory location. + +The third operand is an 8-bit immediate value, of which the low 3 +bits define the type of comparison. For ease of programming, the +8 two-operand pseudo-instructions are provided, with the third +operand already filled in. The \I{Condition Predicates} +\c{Condition Predicates} are: + +\c EQ 0 Equal +\c LT 1 Less-than +\c LE 2 Less-than-or-equal +\c UNORD 3 Unordered +\c NE 4 Not-equal +\c NLT 5 Not-less-than +\c NLE 6 Not-less-than-or-equal +\c ORD 7 Ordered + +For more details of the comparison predicates, and details of how +to emulate the "greater-than" equivalents, see \k{iref-SSE-cc} + + +\S{insCMPccPS} \i\c{CMPccPS}: Packed Single-Precision FP Compare +\I\c{CMPEQPS} \I\c{CMPLTPS} \I\c{CMPLEPS} \I\c{CMPUNORDPS} +\I\c{CMPNEQPS} \I\c{CMPNLTPS} \I\c{CMPNLEPS} \I\c{CMPORDPS} + +\c CMPPS xmm1,xmm2/mem128,imm8 ; 0F C2 /r ib [KATMAI,SSE] + +\c CMPEQPS xmm1,xmm2/mem128 ; 0F C2 /r 00 [KATMAI,SSE] +\c CMPLTPS xmm1,xmm2/mem128 ; 0F C2 /r 01 [KATMAI,SSE] +\c CMPLEPS xmm1,xmm2/mem128 ; 0F C2 /r 02 [KATMAI,SSE] +\c CMPUNORDPS xmm1,xmm2/mem128 ; 0F C2 /r 03 [KATMAI,SSE] +\c CMPNEQPS xmm1,xmm2/mem128 ; 0F C2 /r 04 [KATMAI,SSE] +\c CMPNLTPS xmm1,xmm2/mem128 ; 0F C2 /r 05 [KATMAI,SSE] +\c CMPNLEPS xmm1,xmm2/mem128 ; 0F C2 /r 06 [KATMAI,SSE] +\c CMPORDPS xmm1,xmm2/mem128 ; 0F C2 /r 07 [KATMAI,SSE] + +The \c{CMPccPS} instructions compare the two packed single-precision +FP values in the source and destination operands, and returns the +result of the comparison in the destination register. The result of +each comparison is a doubleword mask of all 1s (comparison true) or +all 0s (comparison false). + +The destination is an \c{XMM} register. The source can be either an +\c{XMM} register or a 128-bit memory location. + +The third operand is an 8-bit immediate value, of which the low 3 +bits define the type of comparison. For ease of programming, the +8 two-operand pseudo-instructions are provided, with the third +operand already filled in. The \I{Condition Predicates} +\c{Condition Predicates} are: + +\c EQ 0 Equal +\c LT 1 Less-than +\c LE 2 Less-than-or-equal +\c UNORD 3 Unordered +\c NE 4 Not-equal +\c NLT 5 Not-less-than +\c NLE 6 Not-less-than-or-equal +\c ORD 7 Ordered + +For more details of the comparison predicates, and details of how +to emulate the "greater-than" equivalents, see \k{iref-SSE-cc} + + +\S{insCMPSB} \i\c{CMPSB}, \i\c{CMPSW}, \i\c{CMPSD}: Compare Strings + +\c CMPSB ; A6 [8086] +\c CMPSW ; o16 A7 [8086] +\c CMPSD ; o32 A7 [386] + +\c{CMPSB} compares the byte at \c{[DS:SI]} or \c{[DS:ESI]} with the +byte at \c{[ES:DI]} or \c{[ES:EDI]}, and sets the flags accordingly. +It then increments or decrements (depending on the direction flag: +increments if the flag is clear, decrements if it is set) \c{SI} and +\c{DI} (or \c{ESI} and \c{EDI}). + +The registers used are \c{SI} and \c{DI} if the address size is 16 +bits, and \c{ESI} and \c{EDI} if it is 32 bits. If you need to use +an address size not equal to the current \c{BITS} setting, you can +use an explicit \i\c{a16} or \i\c{a32} prefix. + +The segment register used to load from \c{[SI]} or \c{[ESI]} can be +overridden by using a segment register name as a prefix (for +example, \c{ES CMPSB}). The use of \c{ES} for the load from \c{[DI]} +or \c{[EDI]} cannot be overridden. + +\c{CMPSW} and \c{CMPSD} work in the same way, but they compare a +word or a doubleword instead of a byte, and increment or decrement +the addressing registers by 2 or 4 instead of 1. + +The \c{REPE} and \c{REPNE} prefixes (equivalently, \c{REPZ} and +\c{REPNZ}) may be used to repeat the instruction up to \c{CX} (or +\c{ECX} - again, the address size chooses which) times until the +first unequal or equal byte is found. + + +\S{insCMPccSD} \i\c{CMPccSD}: Scalar Double-Precision FP Compare +\I\c{CMPEQSD} \I\c{CMPLTSD} \I\c{CMPLESD} \I\c{CMPUNORDSD} +\I\c{CMPNEQSD} \I\c{CMPNLTSD} \I\c{CMPNLESD} \I\c{CMPORDSD} + +\c CMPSD xmm1,xmm2/mem64,imm8 ; F2 0F C2 /r ib [WILLAMETTE,SSE2] + +\c CMPEQSD xmm1,xmm2/mem64 ; F2 0F C2 /r 00 [WILLAMETTE,SSE2] +\c CMPLTSD xmm1,xmm2/mem64 ; F2 0F C2 /r 01 [WILLAMETTE,SSE2] +\c CMPLESD xmm1,xmm2/mem64 ; F2 0F C2 /r 02 [WILLAMETTE,SSE2] +\c CMPUNORDSD xmm1,xmm2/mem64 ; F2 0F C2 /r 03 [WILLAMETTE,SSE2] +\c CMPNEQSD xmm1,xmm2/mem64 ; F2 0F C2 /r 04 [WILLAMETTE,SSE2] +\c CMPNLTSD xmm1,xmm2/mem64 ; F2 0F C2 /r 05 [WILLAMETTE,SSE2] +\c CMPNLESD xmm1,xmm2/mem64 ; F2 0F C2 /r 06 [WILLAMETTE,SSE2] +\c CMPORDSD xmm1,xmm2/mem64 ; F2 0F C2 /r 07 [WILLAMETTE,SSE2] + +The \c{CMPccSD} instructions compare the low-order double-precision +FP values in the source and destination operands, and returns the +result of the comparison in the destination register. The result of +each comparison is a quadword mask of all 1s (comparison true) or +all 0s (comparison false). + +The destination is an \c{XMM} register. The source can be either an +\c{XMM} register or a 128-bit memory location. + +The third operand is an 8-bit immediate value, of which the low 3 +bits define the type of comparison. For ease of programming, the +8 two-operand pseudo-instructions are provided, with the third +operand already filled in. The \I{Condition Predicates} +\c{Condition Predicates} are: + +\c EQ 0 Equal +\c LT 1 Less-than +\c LE 2 Less-than-or-equal +\c UNORD 3 Unordered +\c NE 4 Not-equal +\c NLT 5 Not-less-than +\c NLE 6 Not-less-than-or-equal +\c ORD 7 Ordered + +For more details of the comparison predicates, and details of how +to emulate the "greater-than" equivalents, see \k{iref-SSE-cc} + + +\S{insCMPccSS} \i\c{CMPccSS}: Scalar Single-Precision FP Compare +\I\c{CMPEQSS} \I\c{CMPLTSS} \I\c{CMPLESS} \I\c{CMPUNORDSS} +\I\c{CMPNEQSS} \I\c{CMPNLTSS} \I\c{CMPNLESS} \I\c{CMPORDSS} + +\c CMPSS xmm1,xmm2/mem32,imm8 ; F3 0F C2 /r ib [KATMAI,SSE] + +\c CMPEQSS xmm1,xmm2/mem32 ; F3 0F C2 /r 00 [KATMAI,SSE] +\c CMPLTSS xmm1,xmm2/mem32 ; F3 0F C2 /r 01 [KATMAI,SSE] +\c CMPLESS xmm1,xmm2/mem32 ; F3 0F C2 /r 02 [KATMAI,SSE] +\c CMPUNORDSS xmm1,xmm2/mem32 ; F3 0F C2 /r 03 [KATMAI,SSE] +\c CMPNEQSS xmm1,xmm2/mem32 ; F3 0F C2 /r 04 [KATMAI,SSE] +\c CMPNLTSS xmm1,xmm2/mem32 ; F3 0F C2 /r 05 [KATMAI,SSE] +\c CMPNLESS xmm1,xmm2/mem32 ; F3 0F C2 /r 06 [KATMAI,SSE] +\c CMPORDSS xmm1,xmm2/mem32 ; F3 0F C2 /r 07 [KATMAI,SSE] + +The \c{CMPccSS} instructions compare the low-order single-precision +FP values in the source and destination operands, and returns the +result of the comparison in the destination register. The result of +each comparison is a doubleword mask of all 1s (comparison true) or +all 0s (comparison false). + +The destination is an \c{XMM} register. The source can be either an +\c{XMM} register or a 128-bit memory location. + +The third operand is an 8-bit immediate value, of which the low 3 +bits define the type of comparison. For ease of programming, the +8 two-operand pseudo-instructions are provided, with the third +operand already filled in. The \I{Condition Predicates} +\c{Condition Predicates} are: + +\c EQ 0 Equal +\c LT 1 Less-than +\c LE 2 Less-than-or-equal +\c UNORD 3 Unordered +\c NE 4 Not-equal +\c NLT 5 Not-less-than +\c NLE 6 Not-less-than-or-equal +\c ORD 7 Ordered + +For more details of the comparison predicates, and details of how +to emulate the "greater-than" equivalents, see \k{iref-SSE-cc} + + +\S{insCMPXCHG} \i\c{CMPXCHG}, \i\c{CMPXCHG486}: Compare and Exchange + +\c CMPXCHG r/m8,reg8 ; 0F B0 /r [PENT] +\c CMPXCHG r/m16,reg16 ; o16 0F B1 /r [PENT] +\c CMPXCHG r/m32,reg32 ; o32 0F B1 /r [PENT] + +\c CMPXCHG486 r/m8,reg8 ; 0F A6 /r [486,UNDOC] +\c CMPXCHG486 r/m16,reg16 ; o16 0F A7 /r [486,UNDOC] +\c CMPXCHG486 r/m32,reg32 ; o32 0F A7 /r [486,UNDOC] + +These two instructions perform exactly the same operation; however, +apparently some (not all) 486 processors support it under a +non-standard opcode, so NASM provides the undocumented +\c{CMPXCHG486} form to generate the non-standard opcode. + +\c{CMPXCHG} compares its destination (first) operand to the value in +\c{AL}, \c{AX} or \c{EAX} (depending on the operand size of the +instruction). If they are equal, it copies its source (second) +operand into the destination and sets the zero flag. Otherwise, it +clears the zero flag and copies the destination register to AL, AX or EAX. + +The destination can be either a register or a memory location. The +source is a register. + +\c{CMPXCHG} is intended to be used for atomic operations in +multitasking or multiprocessor environments. To safely update a +value in shared memory, for example, you might load the value into +\c{EAX}, load the updated value into \c{EBX}, and then execute the +instruction \c{LOCK CMPXCHG [value],EBX}. If \c{value} has not +changed since being loaded, it is updated with your desired new +value, and the zero flag is set to let you know it has worked. (The +\c{LOCK} prefix prevents another processor doing anything in the +middle of this operation: it guarantees atomicity.) However, if +another processor has modified the value in between your load and +your attempted store, the store does not happen, and you are +notified of the failure by a cleared zero flag, so you can go round +and try again. + + +\S{insCMPXCHG8B} \i\c{CMPXCHG8B}: Compare and Exchange Eight Bytes + +\c CMPXCHG8B mem ; 0F C7 /1 [PENT] + +This is a larger and more unwieldy version of \c{CMPXCHG}: it +compares the 64-bit (eight-byte) value stored at \c{[mem]} with the +value in \c{EDX:EAX}. If they are equal, it sets the zero flag and +stores \c{ECX:EBX} into the memory area. If they are unequal, it +clears the zero flag and stores the memory contents into \c{EDX:EAX}. + +\c{CMPXCHG8B} can be used with the \c{LOCK} prefix, to allow atomic +execution. This is useful in multi-processor and multi-tasking +environments. + + +\S{insCOMISD} \i\c{COMISD}: Scalar Ordered Double-Precision FP Compare and Set EFLAGS + +\c COMISD xmm1,xmm2/mem64 ; 66 0F 2F /r [WILLAMETTE,SSE2] + +\c{COMISD} compares the low-order double-precision FP value in the +two source operands. ZF, PF and CF are set according to the result. +OF, AF and AF are cleared. The unordered result is returned if either +source is a NaN (QNaN or SNaN). + +The destination operand is an \c{XMM} register. The source can be either +an \c{XMM} register or a memory location. + +The flags are set according to the following rules: + +\c Result Flags Values + +\c UNORDERED: ZF,PF,CF <-- 111; +\c GREATER_THAN: ZF,PF,CF <-- 000; +\c LESS_THAN: ZF,PF,CF <-- 001; +\c EQUAL: ZF,PF,CF <-- 100; + + +\S{insCOMISS} \i\c{COMISS}: Scalar Ordered Single-Precision FP Compare and Set EFLAGS + +\c COMISS xmm1,xmm2/mem32 ; 66 0F 2F /r [KATMAI,SSE] + +\c{COMISS} compares the low-order single-precision FP value in the +two source operands. ZF, PF and CF are set according to the result. +OF, AF and AF are cleared. The unordered result is returned if either +source is a NaN (QNaN or SNaN). + +The destination operand is an \c{XMM} register. The source can be either +an \c{XMM} register or a memory location. + +The flags are set according to the following rules: + +\c Result Flags Values + +\c UNORDERED: ZF,PF,CF <-- 111; +\c GREATER_THAN: ZF,PF,CF <-- 000; +\c LESS_THAN: ZF,PF,CF <-- 001; +\c EQUAL: ZF,PF,CF <-- 100; + + +\S{insCPUID} \i\c{CPUID}: Get CPU Identification Code + +\c CPUID ; 0F A2 [PENT] + +\c{CPUID} returns various information about the processor it is +being executed on. It fills the four registers \c{EAX}, \c{EBX}, +\c{ECX} and \c{EDX} with information, which varies depending on the +input contents of \c{EAX}. + +\c{CPUID} also acts as a barrier to serialize instruction execution: +executing the \c{CPUID} instruction guarantees that all the effects +(memory modification, flag modification, register modification) of +previous instructions have been completed before the next +instruction gets fetched. + +The information returned is as follows: + +\b If \c{EAX} is zero on input, \c{EAX} on output holds the maximum +acceptable input value of \c{EAX}, and \c{EBX:EDX:ECX} contain the +string \c{"GenuineIntel"} (or not, if you have a clone processor). +That is to say, \c{EBX} contains \c{"Genu"} (in NASM's own sense of +character constants, described in \k{chrconst}), \c{EDX} contains +\c{"ineI"} and \c{ECX} contains \c{"ntel"}. + +\b If \c{EAX} is one on input, \c{EAX} on output contains version +information about the processor, and \c{EDX} contains a set of +feature flags, showing the presence and absence of various features. +For example, bit 8 is set if the \c{CMPXCHG8B} instruction +(\k{insCMPXCHG8B}) is supported, bit 15 is set if the conditional +move instructions (\k{insCMOVcc} and \k{insFCMOVB}) are supported, +and bit 23 is set if \c{MMX} instructions are supported. + +\b If \c{EAX} is two on input, \c{EAX}, \c{EBX}, \c{ECX} and \c{EDX} +all contain information about caches and TLBs (Translation Lookahead +Buffers). + +For more information on the data returned from \c{CPUID}, see the +documentation from Intel and other processor manufacturers. + + +\S{insCVTDQ2PD} \i\c{CVTDQ2PD}: +Packed Signed INT32 to Packed Double-Precision FP Conversion + +\c CVTDQ2PD xmm1,xmm2/mem64 ; F3 0F E6 /r [WILLAMETTE,SSE2] + +\c{CVTDQ2PD} converts two packed signed doublewords from the source +operand to two packed double-precision FP values in the destination +operand. + +The destination operand is an \c{XMM} register. The source can be +either an \c{XMM} register or a 64-bit memory location. If the +source is a register, the packed integers are in the low quadword. + + +\S{insCVTDQ2PS} \i\c{CVTDQ2PS}: +Packed Signed INT32 to Packed Single-Precision FP Conversion + +\c CVTDQ2PS xmm1,xmm2/mem128 ; 0F 5B /r [WILLAMETTE,SSE2] + +\c{CVTDQ2PS} converts four packed signed doublewords from the source +operand to four packed single-precision FP values in the destination +operand. + +The destination operand is an \c{XMM} register. The source can be +either an \c{XMM} register or a 128-bit memory location. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTPD2DQ} \i\c{CVTPD2DQ}: +Packed Double-Precision FP to Packed Signed INT32 Conversion + +\c CVTPD2DQ xmm1,xmm2/mem128 ; F2 0F E6 /r [WILLAMETTE,SSE2] + +\c{CVTPD2DQ} converts two packed double-precision FP values from the +source operand to two packed signed doublewords in the low quadword +of the destination operand. The high quadword of the destination is +set to all 0s. + +The destination operand is an \c{XMM} register. The source can be +either an \c{XMM} register or a 128-bit memory location. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTPD2PI} \i\c{CVTPD2PI}: +Packed Double-Precision FP to Packed Signed INT32 Conversion + +\c CVTPD2PI mm,xmm/mem128 ; 66 0F 2D /r [WILLAMETTE,SSE2] + +\c{CVTPD2PI} converts two packed double-precision FP values from the +source operand to two packed signed doublewords in the destination +operand. + +The destination operand is an \c{MMX} register. The source can be +either an \c{XMM} register or a 128-bit memory location. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTPD2PS} \i\c{CVTPD2PS}: +Packed Double-Precision FP to Packed Single-Precision FP Conversion + +\c CVTPD2PS xmm1,xmm2/mem128 ; 66 0F 5A /r [WILLAMETTE,SSE2] + +\c{CVTPD2PS} converts two packed double-precision FP values from the +source operand to two packed single-precision FP values in the low +quadword of the destination operand. The high quadword of the +destination is set to all 0s. + +The destination operand is an \c{XMM} register. The source can be +either an \c{XMM} register or a 128-bit memory location. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTPI2PD} \i\c{CVTPI2PD}: +Packed Signed INT32 to Packed Double-Precision FP Conversion + +\c CVTPI2PD xmm,mm/mem64 ; 66 0F 2A /r [WILLAMETTE,SSE2] + +\c{CVTPI2PD} converts two packed signed doublewords from the source +operand to two packed double-precision FP values in the destination +operand. + +The destination operand is an \c{XMM} register. The source can be +either an \c{MMX} register or a 64-bit memory location. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTPI2PS} \i\c{CVTPI2PS}: +Packed Signed INT32 to Packed Single-FP Conversion + +\c CVTPI2PS xmm,mm/mem64 ; 0F 2A /r [KATMAI,SSE] + +\c{CVTPI2PS} converts two packed signed doublewords from the source +operand to two packed single-precision FP values in the low quadword +of the destination operand. The high quadword of the destination +remains unchanged. + +The destination operand is an \c{XMM} register. The source can be +either an \c{MMX} register or a 64-bit memory location. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTPS2DQ} \i\c{CVTPS2DQ}: +Packed Single-Precision FP to Packed Signed INT32 Conversion + +\c CVTPS2DQ xmm1,xmm2/mem128 ; 66 0F 5B /r [WILLAMETTE,SSE2] + +\c{CVTPS2DQ} converts four packed single-precision FP values from the +source operand to four packed signed doublewords in the destination operand. + +The destination operand is an \c{XMM} register. The source can be +either an \c{XMM} register or a 128-bit memory location. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTPS2PD} \i\c{CVTPS2PD}: +Packed Single-Precision FP to Packed Double-Precision FP Conversion + +\c CVTPS2PD xmm1,xmm2/mem64 ; 0F 5A /r [WILLAMETTE,SSE2] + +\c{CVTPS2PD} converts two packed single-precision FP values from the +source operand to two packed double-precision FP values in the destination +operand. + +The destination operand is an \c{XMM} register. The source can be +either an \c{XMM} register or a 64-bit memory location. If the source +is a register, the input values are in the low quadword. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTPS2PI} \i\c{CVTPS2PI}: +Packed Single-Precision FP to Packed Signed INT32 Conversion + +\c CVTPS2PI mm,xmm/mem64 ; 0F 2D /r [KATMAI,SSE] + +\c{CVTPS2PI} converts two packed single-precision FP values from +the source operand to two packed signed doublewords in the destination +operand. + +The destination operand is an \c{MMX} register. The source can be +either an \c{XMM} register or a 64-bit memory location. If the +source is a register, the input values are in the low quadword. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTSD2SI} \i\c{CVTSD2SI}: +Scalar Double-Precision FP to Signed INT32 Conversion + +\c CVTSD2SI reg32,xmm/mem64 ; F2 0F 2D /r [WILLAMETTE,SSE2] + +\c{CVTSD2SI} converts a double-precision FP value from the source +operand to a signed doubleword in the destination operand. + +The destination operand is a general purpose register. The source can be +either an \c{XMM} register or a 64-bit memory location. If the +source is a register, the input value is in the low quadword. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTSD2SS} \i\c{CVTSD2SS}: +Scalar Double-Precision FP to Scalar Single-Precision FP Conversion + +\c CVTSD2SS xmm1,xmm2/mem64 ; F2 0F 5A /r [KATMAI,SSE] + +\c{CVTSD2SS} converts a double-precision FP value from the source +operand to a single-precision FP value in the low doubleword of the +destination operand. The upper 3 doublewords are left unchanged. + +The destination operand is an \c{XMM} register. The source can be +either an \c{XMM} register or a 64-bit memory location. If the +source is a register, the input value is in the low quadword. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTSI2SD} \i\c{CVTSI2SD}: +Signed INT32 to Scalar Double-Precision FP Conversion + +\c CVTSI2SD xmm,r/m32 ; F2 0F 2A /r [WILLAMETTE,SSE2] + +\c{CVTSI2SD} converts a signed doubleword from the source operand to +a double-precision FP value in the low quadword of the destination +operand. The high quadword is left unchanged. + +The destination operand is an \c{XMM} register. The source can be either +a general purpose register or a 32-bit memory location. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTSI2SS} \i\c{CVTSI2SS}: +Signed INT32 to Scalar Single-Precision FP Conversion + +\c CVTSI2SS xmm,r/m32 ; F3 0F 2A /r [KATMAI,SSE] + +\c{CVTSI2SS} converts a signed doubleword from the source operand to a +single-precision FP value in the low doubleword of the destination operand. +The upper 3 doublewords are left unchanged. + +The destination operand is an \c{XMM} register. The source can be either +a general purpose register or a 32-bit memory location. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTSS2SD} \i\c{CVTSS2SD}: +Scalar Single-Precision FP to Scalar Double-Precision FP Conversion + +\c CVTSS2SD xmm1,xmm2/mem32 ; F3 0F 5A /r [WILLAMETTE,SSE2] + +\c{CVTSS2SD} converts a single-precision FP value from the source operand +to a double-precision FP value in the low quadword of the destination +operand. The upper quadword is left unchanged. + +The destination operand is an \c{XMM} register. The source can be either +an \c{XMM} register or a 32-bit memory location. If the source is a +register, the input value is contained in the low doubleword. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTSS2SI} \i\c{CVTSS2SI}: +Scalar Single-Precision FP to Signed INT32 Conversion + +\c CVTSS2SI reg32,xmm/mem32 ; F3 0F 2D /r [KATMAI,SSE] + +\c{CVTSS2SI} converts a single-precision FP value from the source +operand to a signed doubleword in the destination operand. + +The destination operand is a general purpose register. The source can be +either an \c{XMM} register or a 32-bit memory location. If the +source is a register, the input value is in the low doubleword. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTTPD2DQ} \i\c{CVTTPD2DQ}: +Packed Double-Precision FP to Packed Signed INT32 Conversion with Truncation + +\c CVTTPD2DQ xmm1,xmm2/mem128 ; 66 0F E6 /r [WILLAMETTE,SSE2] + +\c{CVTTPD2DQ} converts two packed double-precision FP values in the source +operand to two packed single-precision FP values in the destination operand. +If the result is inexact, it is truncated (rounded toward zero). The high +quadword is set to all 0s. + +The destination operand is an \c{XMM} register. The source can be +either an \c{XMM} register or a 128-bit memory location. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTTPD2PI} \i\c{CVTTPD2PI}: +Packed Double-Precision FP to Packed Signed INT32 Conversion with Truncation + +\c CVTTPD2PI mm,xmm/mem128 ; 66 0F 2C /r [WILLAMETTE,SSE2] + +\c{CVTTPD2PI} converts two packed double-precision FP values in the source +operand to two packed single-precision FP values in the destination operand. +If the result is inexact, it is truncated (rounded toward zero). + +The destination operand is an \c{MMX} register. The source can be +either an \c{XMM} register or a 128-bit memory location. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTTPS2DQ} \i\c{CVTTPS2DQ}: +Packed Single-Precision FP to Packed Signed INT32 Conversion with Truncation + +\c CVTTPS2DQ xmm1,xmm2/mem128 ; F3 0F 5B /r [WILLAMETTE,SSE2] + +\c{CVTTPS2DQ} converts four packed single-precision FP values in the source +operand to four packed signed doublewords in the destination operand. +If the result is inexact, it is truncated (rounded toward zero). + +The destination operand is an \c{XMM} register. The source can be +either an \c{XMM} register or a 128-bit memory location. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTTPS2PI} \i\c{CVTTPS2PI}: +Packed Single-Precision FP to Packed Signed INT32 Conversion with Truncation + +\c CVTTPS2PI mm,xmm/mem64 ; 0F 2C /r [KATMAI,SSE] + +\c{CVTTPS2PI} converts two packed single-precision FP values in the source +operand to two packed signed doublewords in the destination operand. +If the result is inexact, it is truncated (rounded toward zero). If +the source is a register, the input values are in the low quadword. + +The destination operand is an \c{MMX} register. The source can be +either an \c{XMM} register or a 64-bit memory location. If the source +is a register, the input value is in the low quadword. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTTSD2SI} \i\c{CVTTSD2SI}: +Scalar Double-Precision FP to Signed INT32 Conversion with Truncation + +\c CVTTSD2SI reg32,xmm/mem64 ; F2 0F 2C /r [WILLAMETTE,SSE2] + +\c{CVTTSD2SI} converts a double-precision FP value in the source operand +to a signed doubleword in the destination operand. If the result is +inexact, it is truncated (rounded toward zero). + +The destination operand is a general purpose register. The source can be +either an \c{XMM} register or a 64-bit memory location. If the source is a +register, the input value is in the low quadword. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insCVTTSS2SI} \i\c{CVTTSS2SI}: +Scalar Single-Precision FP to Signed INT32 Conversion with Truncation + +\c CVTTSD2SI reg32,xmm/mem32 ; F3 0F 2C /r [KATMAI,SSE] + +\c{CVTTSS2SI} converts a single-precision FP value in the source operand +to a signed doubleword in the destination operand. If the result is +inexact, it is truncated (rounded toward zero). + +The destination operand is a general purpose register. The source can be +either an \c{XMM} register or a 32-bit memory location. If the source is a +register, the input value is in the low doubleword. + +For more details of this instruction, see the Intel Processor manuals. + + +\S{insDAA} \i\c{DAA}, \i\c{DAS}: Decimal Adjustments + +\c DAA ; 27 [8086] +\c DAS ; 2F [8086] + +These instructions are used in conjunction with the add and subtract +instructions to perform binary-coded decimal arithmetic in +\e{packed} (one BCD digit per nibble) form. For the unpacked +equivalents, see \k{insAAA}. + +\c{DAA} should be used after a one-byte \c{ADD} instruction whose +destination was the \c{AL} register: by means of examining the value +in the \c{AL} and also the auxiliary carry flag \c{AF}, it +determines whether either digit of the addition has overflowed, and +adjusts it (and sets the carry and auxiliary-carry flags) if so. You +can add long BCD strings together by doing \c{ADD}/\c{DAA} on the +low two digits, then doing \c{ADC}/\c{DAA} on each subsequent pair +of digits. + +\c{DAS} works similarly to \c{DAA}, but is for use after \c{SUB} +instructions rather than \c{ADD}. + + +\S{insDEC} \i\c{DEC}: Decrement Integer + +\c DEC reg16 ; o16 48+r [8086] +\c DEC reg32 ; o32 48+r [386] +\c DEC r/m8 ; FE /1 [8086] +\c DEC r/m16 ; o16 FF /1 [8086] +\c DEC r/m32 ; o32 FF /1 [386] + +\c{DEC} subtracts 1 from its operand. It does \e{not} affect the +carry flag: to affect the carry flag, use \c{SUB something,1} (see +\k{insSUB}). \c{DEC} affects all the other flags according to the result. + +This instruction can be used with a \c{LOCK} prefix to allow atomic +execution. + +See also \c{INC} (\k{insINC}). + + +\S{insDIV} \i\c{DIV}: Unsigned Integer Divide + +\c DIV r/m8 ; F6 /6 [8086] +\c DIV r/m16 ; o16 F7 /6 [8086] +\c DIV r/m32 ; o32 F7 /6 [386] + +\c{DIV} performs unsigned integer division. The explicit operand +provided is the divisor; the dividend and destination operands are +implicit, in the following way: + +\b For \c{DIV r/m8}, \c{AX} is divided by the given operand; the +quotient is stored in \c{AL} and the remainder in \c{AH}. + +\b For \c{DIV r/m16}, \c{DX:AX} is divided by the given operand; the +quotient is stored in \c{AX} and the remainder in \c{DX}. + +\b For \c{DIV r/m32}, \c{EDX:EAX} is divided by the given operand; +the quotient is stored in \c{EAX} and the remainder in \c{EDX}. + +Signed integer division is performed by the \c{IDIV} instruction: +see \k{insIDIV}. + + +\S{insDIVPD} \i\c{DIVPD}: Packed Double-Precision FP Divide + +\c DIVPD xmm1,xmm2/mem128 ; 66 0F 5E /r [WILLAMETTE,SSE2] + +\c{DIVPD} divides the two packed double-precision FP values in +the destination operand by the two packed double-precision FP +values in the source operand, and stores the packed double-precision +results in the destination register. + +The destination is an \c{XMM} register. The source operand can be +either an \c{XMM} register or a 128-bit memory location. + +\c dst[0-63] := dst[0-63] / src[0-63], +\c dst[64-127] := dst[64-127] / src[64-127]. + + +\S{insDIVPS} \i\c{DIVPS}: Packed Single-Precision FP Divide + +\c DIVPS xmm1,xmm2/mem128 ; 0F 5E /r [KATMAI,SSE] + +\c{DIVPS} divides the four packed single-precision FP values in +the destination operand by the four packed single-precision FP +values in the source operand, and stores the packed single-precision +results in the destination register. + +The destination is an \c{XMM} register. The source operand can be +either an \c{XMM} register or a 128-bit memory location. + +\c dst[0-31] := dst[0-31] / src[0-31], +\c dst[32-63] := dst[32-63] / src[32-63], +\c dst[64-95] := dst[64-95] / src[64-95], +\c dst[96-127] := dst[96-127] / src[96-127]. + + +\S{insDIVSD} \i\c{DIVSD}: Scalar Double-Precision FP Divide + +\c DIVSD xmm1,xmm2/mem64 ; F2 0F 5E /r [WILLAMETTE,SSE2] + +\c{DIVSD} divides the low-order double-precision FP value in the +destination operand by the low-order double-precision FP value in +the source operand, and stores the double-precision result in the +destination register. + +The destination is an \c{XMM} register. The source operand can be +either an \c{XMM} register or a 64-bit memory location. + +\c dst[0-63] := dst[0-63] / src[0-63], +\c dst[64-127] remains unchanged. + + +\S{insDIVSS} \i\c{DIVSS}: Scalar Single-Precision FP Divide + +\c DIVSS xmm1,xmm2/mem32 ; F3 0F 5E /r [KATMAI,SSE] + +\c{DIVSS} divides the low-order single-precision FP value in the +destination operand by the low-order single-precision FP value in +the source operand, and stores the single-precision result in the +destination register. + +The destination is an \c{XMM} register. The source operand can be +either an \c{XMM} register or a 32-bit memory location. + +\c dst[0-31] := dst[0-31] / src[0-31], +\c dst[32-127] remains unchanged. + + +\S{insEMMS} \i\c{EMMS}: Empty MMX State + +\c EMMS ; 0F 77 [PENT,MMX] + +\c{EMMS} sets the FPU tag word (marking which floating-point registers +are available) to all ones, meaning all registers are available for +the FPU to use. It should be used after executing \c{MMX} instructions +and before executing any subsequent floating-point operations. + + +\S{insENTER} \i\c{ENTER}: Create Stack Frame + +\c ENTER imm,imm ; C8 iw ib [186] + +\c{ENTER} constructs a \i\c{stack frame} for a high-level language +procedure call. The first operand (the \c{iw} in the opcode +definition above refers to the first operand) gives the amount of +stack space to allocate for local variables; the second (the \c{ib} +above) gives the nesting level of the procedure (for languages like +Pascal, with nested procedures). + +The function of \c{ENTER}, with a nesting level of zero, is +equivalent to + +\c PUSH EBP ; or PUSH BP in 16 bits +\c MOV EBP,ESP ; or MOV BP,SP in 16 bits +\c SUB ESP,operand1 ; or SUB SP,operand1 in 16 bits + +This creates a stack frame with the procedure parameters accessible +upwards from \c{EBP}, and local variables accessible downwards from +\c{EBP}. + +With a nesting level of one, the stack frame created is 4 (or 2) +bytes bigger, and the value of the final frame pointer \c{EBP} is +accessible in memory at \c{[EBP-4]}. + +This allows \c{ENTER}, when called with a nesting level of two, to +look at the stack frame described by the \e{previous} value of +\c{EBP}, find the frame pointer at offset -4 from that, and push it +along with its new frame pointer, so that when a level-two procedure +is called from within a level-one procedure, \c{[EBP-4]} holds the +frame pointer of the most recent level-one procedure call and +\c{[EBP-8]} holds that of the most recent level-two call. And so on, +for nesting levels up to 31. + +Stack frames created by \c{ENTER} can be destroyed by the \c{LEAVE} +instruction: see \k{insLEAVE}. + + +\S{insF2XM1} \i\c{F2XM1}: Calculate 2**X-1 + +\c F2XM1 ; D9 F0 [8086,FPU] + +\c{F2XM1} raises 2 to the power of \c{ST0}, subtracts one, and +stores the result back into \c{ST0}. The initial contents of \c{ST0} +must be a number in the range -1.0 to +1.0. + + +\S{insFABS} \i\c{FABS}: Floating-Point Absolute Value + +\c FABS ; D9 E1 [8086,FPU] + +\c{FABS} computes the absolute value of \c{ST0},by clearing the sign +bit, and stores the result back in \c{ST0}. + + +\S{insFADD} \i\c{FADD}, \i\c{FADDP}: Floating-Point Addition + +\c FADD mem32 ; D8 /0 [8086,FPU] +\c FADD mem64 ; DC /0 [8086,FPU] + +\c FADD fpureg ; D8 C0+r [8086,FPU] +\c FADD ST0,fpureg ; D8 C0+r [8086,FPU] + +\c FADD TO fpureg ; DC C0+r [8086,FPU] +\c FADD fpureg,ST0 ; DC C0+r [8086,FPU] + +\c FADDP fpureg ; DE C0+r [8086,FPU] +\c FADDP fpureg,ST0 ; DE C0+r [8086,FPU] + +\b \c{FADD}, given one operand, adds the operand to \c{ST0} and stores +the result back in \c{ST0}. If the operand has the \c{TO} modifier, +the result is stored in the register given rather than in \c{ST0}. + +\b \c{FADDP} performs the same function as \c{FADD TO}, but pops the +register stack after storing the result. + +The given two-operand forms are synonyms for the one-operand forms. + +To add an integer value to \c{ST0}, use the c{FIADD} instruction +(\k{insFIADD}) + + +\S{insFBLD} \i\c{FBLD}, \i\c{FBSTP}: BCD Floating-Point Load and Store + +\c FBLD mem80 ; DF /4 [8086,FPU] +\c FBSTP mem80 ; DF /6 [8086,FPU] + +\c{FBLD} loads an 80-bit (ten-byte) packed binary-coded decimal +number from the given memory address, converts it to a real, and +pushes it on the register stack. \c{FBSTP} stores the value of +\c{ST0}, in packed BCD, at the given address and then pops the +register stack. + + +\S{insFCHS} \i\c{FCHS}: Floating-Point Change Sign + +\c FCHS ; D9 E0 [8086,FPU] + +\c{FCHS} negates the number in \c{ST0}, by inverting the sign bit: +negative numbers become positive, and vice versa. + + +\S{insFCLEX} \i\c{FCLEX}, \c{FNCLEX}: Clear Floating-Point Exceptions + +\c FCLEX ; 9B DB E2 [8086,FPU] +\c FNCLEX ; DB E2 [8086,FPU] + +\c{FCLEX} clears any floating-point exceptions which may be pending. +\c{FNCLEX} does the same thing but doesn't wait for previous +floating-point operations (including the \e{handling} of pending +exceptions) to finish first. + + +\S{insFCMOVB} \i\c{FCMOVcc}: Floating-Point Conditional Move + +\c FCMOVB fpureg ; DA C0+r [P6,FPU] +\c FCMOVB ST0,fpureg ; DA C0+r [P6,FPU] + +\c FCMOVE fpureg ; DA C8+r [P6,FPU] +\c FCMOVE ST0,fpureg ; DA C8+r [P6,FPU] + +\c FCMOVBE fpureg ; DA D0+r [P6,FPU] +\c FCMOVBE ST0,fpureg ; DA D0+r [P6,FPU] + +\c FCMOVU fpureg ; DA D8+r [P6,FPU] +\c FCMOVU ST0,fpureg ; DA D8+r [P6,FPU] + +\c FCMOVNB fpureg ; DB C0+r [P6,FPU] +\c FCMOVNB ST0,fpureg ; DB C0+r [P6,FPU] + +\c FCMOVNE fpureg ; DB C8+r [P6,FPU] +\c FCMOVNE ST0,fpureg ; DB C8+r [P6,FPU] + +\c FCMOVNBE fpureg ; DB D0+r [P6,FPU] +\c FCMOVNBE ST0,fpureg ; DB D0+r [P6,FPU] + +\c FCMOVNU fpureg ; DB D8+r [P6,FPU] +\c FCMOVNU ST0,fpureg ; DB D8+r [P6,FPU] + +The \c{FCMOV} instructions perform conditional move operations: each +of them moves the contents of the given register into \c{ST0} if its +condition is satisfied, and does nothing if not. + +The conditions are not the same as the standard condition codes used +with conditional jump instructions. The conditions \c{B}, \c{BE}, +\c{NB}, \c{NBE}, \c{E} and \c{NE} are exactly as normal, but none of +the other standard ones are supported. Instead, the condition \c{U} +and its counterpart \c{NU} are provided; the \c{U} condition is +satisfied if the last two floating-point numbers compared were +\e{unordered}, i.e. they were not equal but neither one could be +said to be greater than the other, for example if they were NaNs. +(The flag state which signals this is the setting of the parity +flag: so the \c{U} condition is notionally equivalent to \c{PE}, and +\c{NU} is equivalent to \c{PO}.) + +The \c{FCMOV} conditions test the main processor's status flags, not +the FPU status flags, so using \c{FCMOV} directly after \c{FCOM} +will not work. Instead, you should either use \c{FCOMI} which writes +directly to the main CPU flags word, or use \c{FSTSW} to extract the +FPU flags. + +Although the \c{FCMOV} instructions are flagged \c{P6} above, they +may not be supported by all Pentium Pro processors; the \c{CPUID} +instruction (\k{insCPUID}) will return a bit which indicates whether +conditional moves are supported. + + +\S{insFCOM} \i\c{FCOM}, \i\c{FCOMP}, \i\c{FCOMPP}, \i\c{FCOMI}, +\i\c{FCOMIP}: Floating-Point Compare + +\c FCOM mem32 ; D8 /2 [8086,FPU] +\c FCOM mem64 ; DC /2 [8086,FPU] +\c FCOM fpureg ; D8 D0+r [8086,FPU] +\c FCOM ST0,fpureg ; D8 D0+r [8086,FPU] + +\c FCOMP mem32 ; D8 /3 [8086,FPU] +\c FCOMP mem64 ; DC /3 [8086,FPU] +\c FCOMP fpureg ; D8 D8+r [8086,FPU] +\c FCOMP ST0,fpureg ; D8 D8+r [8086,FPU] + +\c FCOMPP ; DE D9 [8086,FPU] + +\c FCOMI fpureg ; DB F0+r [P6,FPU] +\c FCOMI ST0,fpureg ; DB F0+r [P6,FPU] + +\c FCOMIP fpureg ; DF F0+r [P6,FPU] +\c FCOMIP ST0,fpureg ; DF F0+r [P6,FPU] + +\c{FCOM} compares \c{ST0} with the given operand, and sets the FPU +flags accordingly. \c{ST0} is treated as the left-hand side of the +comparison, so that the carry flag is set (for a `less-than' result) +if \c{ST0} is less than the given operand. + +\c{FCOMP} does the same as \c{FCOM}, but pops the register stack +afterwards. \c{FCOMPP} compares \c{ST0} with \c{ST1} and then pops +the register stack twice. + +\c{FCOMI} and \c{FCOMIP} work like the corresponding forms of +\c{FCOM} and \c{FCOMP}, but write their results directly to the CPU +flags register rather than the FPU status word, so they can be +immediately followed by conditional jump or conditional move +instructions. + +The \c{FCOM} instructions differ from the \c{FUCOM} instructions +(\k{insFUCOM}) only in the way they handle quiet NaNs: \c{FUCOM} +will handle them silently and set the condition code flags to an +`unordered' result, whereas \c{FCOM} will generate an exception. + + +\S{insFCOS} \i\c{FCOS}: Cosine + +\c FCOS ; D9 FF [386,FPU] + +\c{FCOS} computes the cosine of \c{ST0} (in radians), and stores the +result in \c{ST0}. The absolute value of \c{ST0} must be less than 2**63. + +See also \c{FSINCOS} (\k{insFSIN}). + + +\S{insFDECSTP} \i\c{FDECSTP}: Decrement Floating-Point Stack Pointer + +\c FDECSTP ; D9 F6 [8086,FPU] + +\c{FDECSTP} decrements the `top' field in the floating-point status +word. This has the effect of rotating the FPU register stack by one, +as if the contents of \c{ST7} had been pushed on the stack. See also +\c{FINCSTP} (\k{insFINCSTP}). + + +\S{insFDISI} \i\c{FxDISI}, \i\c{FxENI}: Disable and Enable Floating-Point Interrupts + +\c FDISI ; 9B DB E1 [8086,FPU] +\c FNDISI ; DB E1 [8086,FPU] + +\c FENI ; 9B DB E0 [8086,FPU] +\c FNENI ; DB E0 [8086,FPU] + +\c{FDISI} and \c{FENI} disable and enable floating-point interrupts. +These instructions are only meaningful on original 8087 processors: +the 287 and above treat them as no-operation instructions. + +\c{FNDISI} and \c{FNENI} do the same thing as \c{FDISI} and \c{FENI} +respectively, but without waiting for the floating-point processor +to finish what it was doing first. + + +\S{insFDIV} \i\c{FDIV}, \i\c{FDIVP}, \i\c{FDIVR}, \i\c{FDIVRP}: Floating-Point Division + +\c FDIV mem32 ; D8 /6 [8086,FPU] +\c FDIV mem64 ; DC /6 [8086,FPU] + +\c FDIV fpureg ; D8 F0+r [8086,FPU] +\c FDIV ST0,fpureg ; D8 F0+r [8086,FPU] + +\c FDIV TO fpureg ; DC F8+r [8086,FPU] +\c FDIV fpureg,ST0 ; DC F8+r [8086,FPU] + +\c FDIVR mem32 ; D8 /7 [8086,FPU] +\c FDIVR mem64 ; DC /7 [8086,FPU] + +\c FDIVR fpureg ; D8 F8+r [8086,FPU] +\c FDIVR ST0,fpureg ; D8 F8+r [8086,FPU] + +\c FDIVR TO fpureg ; DC F0+r [8086,FPU] +\c FDIVR fpureg,ST0 ; DC F0+r [8086,FPU] + +\c FDIVP fpureg ; DE F8+r [8086,FPU] +\c FDIVP fpureg,ST0 ; DE F8+r [8086,FPU] + +\c FDIVRP fpureg ; DE F0+r [8086,FPU] +\c FDIVRP fpureg,ST0 ; DE F0+r [8086,FPU] + +\b \c{FDIV} divides \c{ST0} by the given operand and stores the result +back in \c{ST0}, unless the \c{TO} qualifier is given, in which case +it divides the given operand by \c{ST0} and stores the result in the +operand. + +\b \c{FDIVR} does the same thing, but does the division the other way +up: so if \c{TO} is not given, it divides the given operand by +\c{ST0} and stores the result in \c{ST0}, whereas if \c{TO} is given +it divides \c{ST0} by its operand and stores the result in the +operand. + +\b \c{FDIVP} operates like \c{FDIV TO}, but pops the register stack +once it has finished. + +\b \c{FDIVRP} operates like \c{FDIVR TO}, but pops the register stack +once it has finished. + +For FP/Integer divisions, see \c{FIDIV} (\k{insFIDIV}). + + +\S{insFEMMS} \i\c{FEMMS}: Faster Enter/Exit of the MMX or floating-point state + +\c FEMMS ; 0F 0E [PENT,3DNOW] + +\c{FEMMS} can be used in place of the \c{EMMS} instruction on +processors which support the 3DNow! instruction set. Following +execution of \c{FEMMS}, the state of the \c{MMX/FP} registers +is undefined, and this allows a faster context switch between +\c{FP} and \c{MMX} instructions. The \c{FEMMS} instruction can +also be used \e{before} executing \c{MMX} instructions + + +\S{insFFREE} \i\c{FFREE}: Flag Floating-Point Register as Unused + +\c FFREE fpureg ; DD C0+r [8086,FPU] +\c FFREEP fpureg ; DF C0+r [286,FPU,UNDOC] + +\c{FFREE} marks the given register as being empty. + +\c{FFREEP} marks the given register as being empty, and then +pops the register stack. + + +\S{insFIADD} \i\c{FIADD}: Floating-Point/Integer Addition + +\c FIADD mem16 ; DE /0 [8086,FPU] +\c FIADD mem32 ; DA /0 [8086,FPU] + +\c{FIADD} adds the 16-bit or 32-bit integer stored in the given +memory location to \c{ST0}, storing the result in \c{ST0}. + + +\S{insFICOM} \i\c{FICOM}, \i\c{FICOMP}: Floating-Point/Integer Compare + +\c FICOM mem16 ; DE /2 [8086,FPU] +\c FICOM mem32 ; DA /2 [8086,FPU] + +\c FICOMP mem16 ; DE /3 [8086,FPU] +\c FICOMP mem32 ; DA /3 [8086,FPU] + +\c{FICOM} compares \c{ST0} with the 16-bit or 32-bit integer stored +in the given memory location, and sets the FPU flags accordingly. +\c{FICOMP} does the same, but pops the register stack afterwards. + + +\S{insFIDIV} \i\c{FIDIV}, \i\c{FIDIVR}: Floating-Point/Integer Division + +\c FIDIV mem16 ; DE /6 [8086,FPU] +\c FIDIV mem32 ; DA /6 [8086,FPU] + +\c FIDIVR mem16 ; DE /7 [8086,FPU] +\c FIDIVR mem32 ; DA /7 [8086,FPU] + +\c{FIDIV} divides \c{ST0} by the 16-bit or 32-bit integer stored in +the given memory location, and stores the result in \c{ST0}. +\c{FIDIVR} does the division the other way up: it divides the +integer by \c{ST0}, but still stores the result in \c{ST0}. + + +\S{insFILD} \i\c{FILD}, \i\c{FIST}, \i\c{FISTP}: Floating-Point/Integer Conversion + +\c FILD mem16 ; DF /0 [8086,FPU] +\c FILD mem32 ; DB /0 [8086,FPU] +\c FILD mem64 ; DF /5 [8086,FPU] + +\c FIST mem16 ; DF /2 [8086,FPU] +\c FIST mem32 ; DB /2 [8086,FPU] + +\c FISTP mem16 ; DF /3 [8086,FPU] +\c FISTP mem32 ; DB /3 [8086,FPU] +\c FISTP mem64 ; DF /7 [8086,FPU] + +\c{FILD} loads an integer out of a memory location, converts it to a +real, and pushes it on the FPU register stack. \c{FIST} converts +\c{ST0} to an integer and stores that in memory; \c{FISTP} does the +same as \c{FIST}, but pops the register stack afterwards. + + +\S{insFIMUL} \i\c{FIMUL}: Floating-Point/Integer Multiplication + +\c FIMUL mem16 ; DE /1 [8086,FPU] +\c FIMUL mem32 ; DA /1 [8086,FPU] + +\c{FIMUL} multiplies \c{ST0} by the 16-bit or 32-bit integer stored +in the given memory location, and stores the result in \c{ST0}. + + +\S{insFINCSTP} \i\c{FINCSTP}: Increment Floating-Point Stack Pointer + +\c FINCSTP ; D9 F7 [8086,FPU] + +\c{FINCSTP} increments the `top' field in the floating-point status +word. This has the effect of rotating the FPU register stack by one, +as if the register stack had been popped; however, unlike the +popping of the stack performed by many FPU instructions, it does not +flag the new \c{ST7} (previously \c{ST0}) as empty. See also +\c{FDECSTP} (\k{insFDECSTP}). + + +\S{insFINIT} \i\c{FINIT}, \i\c{FNINIT}: initialize Floating-Point Unit + +\c FINIT ; 9B DB E3 [8086,FPU] +\c FNINIT ; DB E3 [8086,FPU] + +\c{FINIT} initializes the FPU to its default state. It flags all +registers as empty, without actually change their values, clears +the top of stack pointer. \c{FNINIT} does the same, without first +waiting for pending exceptions to clear. + + +\S{insFISUB} \i\c{FISUB}: Floating-Point/Integer Subtraction + +\c FISUB mem16 ; DE /4 [8086,FPU] +\c FISUB mem32 ; DA /4 [8086,FPU] + +\c FISUBR mem16 ; DE /5 [8086,FPU] +\c FISUBR mem32 ; DA /5 [8086,FPU] + +\c{FISUB} subtracts the 16-bit or 32-bit integer stored in the given +memory location from \c{ST0}, and stores the result in \c{ST0}. +\c{FISUBR} does the subtraction the other way round, i.e. it +subtracts \c{ST0} from the given integer, but still stores the +result in \c{ST0}. + + +\S{insFLD} \i\c{FLD}: Floating-Point Load + +\c FLD mem32 ; D9 /0 [8086,FPU] +\c FLD mem64 ; DD /0 [8086,FPU] +\c FLD mem80 ; DB /5 [8086,FPU] +\c FLD fpureg ; D9 C0+r [8086,FPU] + +\c{FLD} loads a floating-point value out of the given register or +memory location, and pushes it on the FPU register stack. + + +\S{insFLD1} \i\c{FLDxx}: Floating-Point Load Constants + +\c FLD1 ; D9 E8 [8086,FPU] +\c FLDL2E ; D9 EA [8086,FPU] +\c FLDL2T ; D9 E9 [8086,FPU] +\c FLDLG2 ; D9 EC [8086,FPU] +\c FLDLN2 ; D9 ED [8086,FPU] +\c FLDPI ; D9 EB [8086,FPU] +\c FLDZ ; D9 EE [8086,FPU] + +These instructions push specific standard constants on the FPU +register stack. + +\c Instruction Constant pushed + +\c FLD1 1 +\c FLDL2E base-2 logarithm of e +\c FLDL2T base-2 log of 10 +\c FLDLG2 base-10 log of 2 +\c FLDLN2 base-e log of 2 +\c FLDPI pi +\c FLDZ zero + + +\S{insFLDCW} \i\c{FLDCW}: Load Floating-Point Control Word + +\c FLDCW mem16 ; D9 /5 [8086,FPU] + +\c{FLDCW} loads a 16-bit value out of memory and stores it into the +FPU control word (governing things like the rounding mode, the +precision, and the exception masks). See also \c{FSTCW} +(\k{insFSTCW}). If exceptions are enabled and you don't want to +generate one, use \c{FCLEX} or \c{FNCLEX} (\k{insFCLEX}) before +loading the new control word. + + +\S{insFLDENV} \i\c{FLDENV}: Load Floating-Point Environment + +\c FLDENV mem ; D9 /4 [8086,FPU] + +\c{FLDENV} loads the FPU operating environment (control word, status +word, tag word, instruction pointer, data pointer and last opcode) +from memory. The memory area is 14 or 28 bytes long, depending on +the CPU mode at the time. See also \c{FSTENV} (\k{insFSTENV}). + + +\S{insFMUL} \i\c{FMUL}, \i\c{FMULP}: Floating-Point Multiply + +\c FMUL mem32 ; D8 /1 [8086,FPU] +\c FMUL mem64 ; DC /1 [8086,FPU] + +\c FMUL fpureg ; D8 C8+r [8086,FPU] +\c FMUL ST0,fpureg ; D8 C8+r [8086,FPU] + +\c FMUL TO fpureg ; DC C8+r [8086,FPU] +\c FMUL fpureg,ST0 ; DC C8+r [8086,FPU] + +\c FMULP fpureg ; DE C8+r [8086,FPU] +\c FMULP fpureg,ST0 ; DE C8+r [8086,FPU] + +\c{FMUL} multiplies \c{ST0} by the given operand, and stores the +result in \c{ST0}, unless the \c{TO} qualifier is used in which case +it stores the result in the operand. \c{FMULP} performs the same +operation as \c{FMUL TO}, and then pops the register stack. + + +\S{insFNOP} \i\c{FNOP}: Floating-Point No Operation + +\c FNOP ; D9 D0 [8086,FPU] + +\c{FNOP} does nothing. + + +\S{insFPATAN} \i\c{FPATAN}, \i\c{FPTAN}: Arctangent and Tangent + +\c FPATAN ; D9 F3 [8086,FPU] +\c FPTAN ; D9 F2 [8086,FPU] + +\c{FPATAN} computes the arctangent, in radians, of the result of +dividing \c{ST1} by \c{ST0}, stores the result in \c{ST1}, and pops +the register stack. It works like the C \c{atan2} function, in that +changing the sign of both \c{ST0} and \c{ST1} changes the output +value by pi (so it performs true rectangular-to-polar coordinate +conversion, with \c{ST1} being the Y coordinate and \c{ST0} being +the X coordinate, not merely an arctangent). + +\c{FPTAN} computes the tangent of the value in \c{ST0} (in radians), +and stores the result back into \c{ST0}. + +The absolute value of \c{ST0} must be less than 2**63. + + +\S{insFPREM} \i\c{FPREM}, \i\c{FPREM1}: Floating-Point Partial Remainder + +\c FPREM ; D9 F8 [8086,FPU] +\c FPREM1 ; D9 F5 [386,FPU] + +These instructions both produce the remainder obtained by dividing +\c{ST0} by \c{ST1}. This is calculated, notionally, by dividing +\c{ST0} by \c{ST1}, rounding the result to an integer, multiplying +by \c{ST1} again, and computing the value which would need to be +added back on to the result to get back to the original value in +\c{ST0}. + +The two instructions differ in the way the notional round-to-integer +operation is performed. \c{FPREM} does it by rounding towards zero, +so that the remainder it returns always has the same sign as the +original value in \c{ST0}; \c{FPREM1} does it by rounding to the +nearest integer, so that the remainder always has at most half the +magnitude of \c{ST1}. + +Both instructions calculate \e{partial} remainders, meaning that +they may not manage to provide the final result, but might leave +intermediate results in \c{ST0} instead. If this happens, they will +set the C2 flag in the FPU status word; therefore, to calculate a +remainder, you should repeatedly execute \c{FPREM} or \c{FPREM1} +until C2 becomes clear. + + +\S{insFRNDINT} \i\c{FRNDINT}: Floating-Point Round to Integer + +\c FRNDINT ; D9 FC [8086,FPU] + +\c{FRNDINT} rounds the contents of \c{ST0} to an integer, according +to the current rounding mode set in the FPU control word, and stores +the result back in \c{ST0}. + + +\S{insFRSTOR} \i\c{FSAVE}, \i\c{FRSTOR}: Save/Restore Floating-Point State + +\c FSAVE mem ; 9B DD /6 [8086,FPU] +\c FNSAVE mem ; DD /6 [8086,FPU] + +\c FRSTOR mem ; DD /4 [8086,FPU] + +\c{FSAVE} saves the entire floating-point unit state, including all +the information saved by \c{FSTENV} (\k{insFSTENV}) plus the +contents of all the registers, to a 94 or 108 byte area of memory +(depending on the CPU mode). \c{FRSTOR} restores the floating-point +state from the same area of memory. + +\c{FNSAVE} does the same as \c{FSAVE}, without first waiting for +pending floating-point exceptions to clear. + + +\S{insFSCALE} \i\c{FSCALE}: Scale Floating-Point Value by Power of Two + +\c FSCALE ; D9 FD [8086,FPU] + +\c{FSCALE} scales a number by a power of two: it rounds \c{ST1} +towards zero to obtain an integer, then multiplies \c{ST0} by two to +the power of that integer, and stores the result in \c{ST0}. + + +\S{insFSETPM} \i\c{FSETPM}: Set Protected Mode + +\c FSETPM ; DB E4 [286,FPU] + +This instruction initializes protected mode on the 287 floating-point +coprocessor. It is only meaningful on that processor: the 387 and +above treat the instruction as a no-operation. + + +\S{insFSIN} \i\c{FSIN}, \i\c{FSINCOS}: Sine and Cosine + +\c FSIN ; D9 FE [386,FPU] +\c FSINCOS ; D9 FB [386,FPU] + +\c{FSIN} calculates the sine of \c{ST0} (in radians) and stores the +result in \c{ST0}. \c{FSINCOS} does the same, but then pushes the +cosine of the same value on the register stack, so that the sine +ends up in \c{ST1} and the cosine in \c{ST0}. \c{FSINCOS} is faster +than executing \c{FSIN} and \c{FCOS} (see \k{insFCOS}) in succession. + +The absolute value of \c{ST0} must be less than 2**63. + + +\S{insFSQRT} \i\c{FSQRT}: Floating-Point Square Root + +\c FSQRT ; D9 FA [8086,FPU] + +\c{FSQRT} calculates the square root of \c{ST0} and stores the +result in \c{ST0}. + + +\S{insFST} \i\c{FST}, \i\c{FSTP}: Floating-Point Store + +\c FST mem32 ; D9 /2 [8086,FPU] +\c FST mem64 ; DD /2 [8086,FPU] +\c FST fpureg ; DD D0+r [8086,FPU] + +\c FSTP mem32 ; D9 /3 [8086,FPU] +\c FSTP mem64 ; DD /3 [8086,FPU] +\c FSTP mem80 ; DB /7 [8086,FPU] +\c FSTP fpureg ; DD D8+r [8086,FPU] + +\c{FST} stores the value in \c{ST0} into the given memory location +or other FPU register. \c{FSTP} does the same, but then pops the +register stack. + + +\S{insFSTCW} \i\c{FSTCW}: Store Floating-Point Control Word + +\c FSTCW mem16 ; 9B D9 /7 [8086,FPU] +\c FNSTCW mem16 ; D9 /7 [8086,FPU] + +\c{FSTCW} stores the \c{FPU} control word (governing things like the +rounding mode, the precision, and the exception masks) into a 2-byte +memory area. See also \c{FLDCW} (\k{insFLDCW}). + +\c{FNSTCW} does the same thing as \c{FSTCW}, without first waiting +for pending floating-point exceptions to clear. + + +\S{insFSTENV} \i\c{FSTENV}: Store Floating-Point Environment + +\c FSTENV mem ; 9B D9 /6 [8086,FPU] +\c FNSTENV mem ; D9 /6 [8086,FPU] + +\c{FSTENV} stores the \c{FPU} operating environment (control word, +status word, tag word, instruction pointer, data pointer and last +opcode) into memory. The memory area is 14 or 28 bytes long, +depending on the CPU mode at the time. See also \c{FLDENV} +(\k{insFLDENV}). + +\c{FNSTENV} does the same thing as \c{FSTENV}, without first waiting +for pending floating-point exceptions to clear. + + +\S{insFSTSW} \i\c{FSTSW}: Store Floating-Point Status Word + +\c FSTSW mem16 ; 9B DD /7 [8086,FPU] +\c FSTSW AX ; 9B DF E0 [286,FPU] + +\c FNSTSW mem16 ; DD /7 [8086,FPU] +\c FNSTSW AX ; DF E0 [286,FPU] + +\c{FSTSW} stores the \c{FPU} status word into \c{AX} or into a 2-byte +memory area. + +\c{FNSTSW} does the same thing as \c{FSTSW}, without first waiting +for pending floating-point exceptions to clear. + + +\S{insFSUB} \i\c{FSUB}, \i\c{FSUBP}, \i\c{FSUBR}, \i\c{FSUBRP}: Floating-Point Subtract + +\c FSUB mem32 ; D8 /4 [8086,FPU] +\c FSUB mem64 ; DC /4 [8086,FPU] + +\c FSUB fpureg ; D8 E0+r [8086,FPU] +\c FSUB ST0,fpureg ; D8 E0+r [8086,FPU] + +\c FSUB TO fpureg ; DC E8+r [8086,FPU] +\c FSUB fpureg,ST0 ; DC E8+r [8086,FPU] + +\c FSUBR mem32 ; D8 /5 [8086,FPU] +\c FSUBR mem64 ; DC /5 [8086,FPU] + +\c FSUBR fpureg ; D8 E8+r [8086,FPU] +\c FSUBR ST0,fpureg ; D8 E8+r [8086,FPU] + +\c FSUBR TO fpureg ; DC E0+r [8086,FPU] +\c FSUBR fpureg,ST0 ; DC E0+r [8086,FPU] + +\c FSUBP fpureg ; DE E8+r [8086,FPU] +\c FSUBP fpureg,ST0 ; DE E8+r [8086,FPU] + +\c FSUBRP fpureg ; DE E0+r [8086,FPU] +\c FSUBRP fpureg,ST0 ; DE E0+r [8086,FPU] + +\b \c{FSUB} subtracts the given operand from \c{ST0} and stores the +result back in \c{ST0}, unless the \c{TO} qualifier is given, in +which case it subtracts \c{ST0} from the given operand and stores +the result in the operand. + +\b \c{FSUBR} does the same thing, but does the subtraction the other +way up: so if \c{TO} is not given, it subtracts \c{ST0} from the given +operand and stores the result in \c{ST0}, whereas if \c{TO} is given +it subtracts its operand from \c{ST0} and stores the result in the +operand. + +\b \c{FSUBP} operates like \c{FSUB TO}, but pops the register stack +once it has finished. + +\b \c{FSUBRP} operates like \c{FSUBR TO}, but pops the register stack +once it has finished. + + +\S{insFTST} \i\c{FTST}: Test \c{ST0} Against Zero + +\c FTST ; D9 E4 [8086,FPU] + +\c{FTST} compares \c{ST0} with zero and sets the FPU flags +accordingly. \c{ST0} is treated as the left-hand side of the +comparison, so that a `less-than' result is generated if \c{ST0} is +negative. + + +\S{insFUCOM} \i\c{FUCOMxx}: Floating-Point Unordered Compare + +\c FUCOM fpureg ; DD E0+r [386,FPU] +\c FUCOM ST0,fpureg ; DD E0+r [386,FPU] + +\c FUCOMP fpureg ; DD E8+r [386,FPU] +\c FUCOMP ST0,fpureg ; DD E8+r [386,FPU] + +\c FUCOMPP ; DA E9 [386,FPU] + +\c FUCOMI fpureg ; DB E8+r [P6,FPU] +\c FUCOMI ST0,fpureg ; DB E8+r [P6,FPU] + +\c FUCOMIP fpureg ; DF E8+r [P6,FPU] +\c FUCOMIP ST0,fpureg ; DF E8+r [P6,FPU] + +\b \c{FUCOM} compares \c{ST0} with the given operand, and sets the +FPU flags accordingly. \c{ST0} is treated as the left-hand side of +the comparison, so that the carry flag is set (for a `less-than' +result) if \c{ST0} is less than the given operand. + +\b \c{FUCOMP} does the same as \c{FUCOM}, but pops the register stack +afterwards. \c{FUCOMPP} compares \c{ST0} with \c{ST1} and then pops +the register stack twice. + +\b \c{FUCOMI} and \c{FUCOMIP} work like the corresponding forms of +\c{FUCOM} and \c{FUCOMP}, but write their results directly to the CPU +flags register rather than the FPU status word, so they can be +immediately followed by conditional jump or conditional move +instructions. + +The \c{FUCOM} instructions differ from the \c{FCOM} instructions +(\k{insFCOM}) only in the way they handle quiet NaNs: \c{FUCOM} will +handle them silently and set the condition code flags to an +`unordered' result, whereas \c{FCOM} will generate an exception. + + +\S{insFXAM} \i\c{FXAM}: Examine Class of Value in \c{ST0} + +\c FXAM ; D9 E5 [8086,FPU] + +\c{FXAM} sets the FPU flags \c{C3}, \c{C2} and \c{C0} depending on +the type of value stored in \c{ST0}: + +\c Register contents Flags + +\c Unsupported format 000 +\c NaN 001 +\c Finite number 010 +\c Infinity 011 +\c Zero 100 +\c Empty register 101 +\c Denormal 110 + +Additionally, the \c{C1} flag is set to the sign of the number. + + +\S{insFXCH} \i\c{FXCH}: Floating-Point Exchange + +\c FXCH ; D9 C9 [8086,FPU] +\c FXCH fpureg ; D9 C8+r [8086,FPU] +\c FXCH fpureg,ST0 ; D9 C8+r [8086,FPU] +\c FXCH ST0,fpureg ; D9 C8+r [8086,FPU] + +\c{FXCH} exchanges \c{ST0} with a given FPU register. The no-operand +form exchanges \c{ST0} with \c{ST1}. + + +\S{insFXRSTOR} \i\c{FXRSTOR}: Restore \c{FP}, \c{MMX} and \c{SSE} State + +\c FXRSTOR memory ; 0F AE /1 [P6,SSE,FPU] + +The \c{FXRSTOR} instruction reloads the \c{FPU}, \c{MMX} and \c{SSE} +state (environment and registers), from the 512 byte memory area defined +by the source operand. This data should have been written by a previous +\c{FXSAVE}. + + +\S{insFXSAVE} \i\c{FXSAVE}: Store \c{FP}, \c{MMX} and \c{SSE} State + +\c FXSAVE memory ; 0F AE /0 [P6,SSE,FPU] + +\c{FXSAVE}The FXSAVE instruction writes the current \c{FPU}, \c{MMX} +and \c{SSE} technology states (environment and registers), to the +512 byte memory area defined by the destination operand. It does this +without checking for pending unmasked floating-point exceptions +(similar to the operation of \c{FNSAVE}). + +Unlike the \c{FSAVE/FNSAVE} instructions, the processor retains the +contents of the \c{FPU}, \c{MMX} and \c{SSE} state in the processor +after the state has been saved. This instruction has been optimized +to maximize floating-point save performance. + + +\S{insFXTRACT} \i\c{FXTRACT}: Extract Exponent and Significand + +\c FXTRACT ; D9 F4 [8086,FPU] + +\c{FXTRACT} separates the number in \c{ST0} into its exponent and +significand (mantissa), stores the exponent back into \c{ST0}, and +then pushes the significand on the register stack (so that the +significand ends up in \c{ST0}, and the exponent in \c{ST1}). + + +\S{insFYL2X} \i\c{FYL2X}, \i\c{FYL2XP1}: Compute Y times Log2(X) or Log2(X+1) + +\c FYL2X ; D9 F1 [8086,FPU] +\c FYL2XP1 ; D9 F9 [8086,FPU] + +\c{FYL2X} multiplies \c{ST1} by the base-2 logarithm of \c{ST0}, +stores the result in \c{ST1}, and pops the register stack (so that +the result ends up in \c{ST0}). \c{ST0} must be non-zero and +positive. + +\c{FYL2XP1} works the same way, but replacing the base-2 log of +\c{ST0} with that of \c{ST0} plus one. This time, \c{ST0} must have +magnitude no greater than 1 minus half the square root of two. + + +\S{insHLT} \i\c{HLT}: Halt Processor + +\c HLT ; F4 [8086,PRIV] + +\c{HLT} puts the processor into a halted state, where it will +perform no more operations until restarted by an interrupt or a +reset. + +On the 286 and later processors, this is a privileged instruction. + + +\S{insIBTS} \i\c{IBTS}: Insert Bit String + +\c IBTS r/m16,reg16 ; o16 0F A7 /r [386,UNDOC] +\c IBTS r/m32,reg32 ; o32 0F A7 /r [386,UNDOC] + +The implied operation of this instruction is: + +\c IBTS r/m16,AX,CL,reg16 +\c IBTS r/m32,EAX,CL,reg32 + +Writes a bit string from the source operand to the destination. +\c{CL} indicates the number of bits to be copied, from the low bits +of the source. \c{(E)AX} indicates the low order bit offset in the +destination that is written to. For example, if \c{CL} is set to 4 +and \c{AX} (for 16-bit code) is set to 5, bits 0-3 of \c{src} will +be copied to bits 5-8 of \c{dst}. This instruction is very poorly +documented, and I have been unable to find any official source of +documentation on it. + +\c{IBTS} is supported only on the early Intel 386s, and conflicts +with the opcodes for \c{CMPXCHG486} (on early Intel 486s). NASM +supports it only for completeness. Its counterpart is \c{XBTS} +(see \k{insXBTS}). + + +\S{insIDIV} \i\c{IDIV}: Signed Integer Divide + +\c IDIV r/m8 ; F6 /7 [8086] +\c IDIV r/m16 ; o16 F7 /7 [8086] +\c IDIV r/m32 ; o32 F7 /7 [386] + +\c{IDIV} performs signed integer division. The explicit operand +provided is the divisor; the dividend and destination operands +are implicit, in the following way: + +\b For \c{IDIV r/m8}, \c{AX} is divided by the given operand; +the quotient is stored in \c{AL} and the remainder in \c{AH}. + +\b For \c{IDIV r/m16}, \c{DX:AX} is divided by the given operand; +the quotient is stored in \c{AX} and the remainder in \c{DX}. + +\b For \c{IDIV r/m32}, \c{EDX:EAX} is divided by the given operand; +the quotient is stored in \c{EAX} and the remainder in \c{EDX}. + +Unsigned integer division is performed by the \c{DIV} instruction: +see \k{insDIV}. + + +\S{insIMUL} \i\c{IMUL}: Signed Integer Multiply + +\c IMUL r/m8 ; F6 /5 [8086] +\c IMUL r/m16 ; o16 F7 /5 [8086] +\c IMUL r/m32 ; o32 F7 /5 [386] + +\c IMUL reg16,r/m16 ; o16 0F AF /r [386] +\c IMUL reg32,r/m32 ; o32 0F AF /r [386] + +\c IMUL reg16,imm8 ; o16 6B /r ib [186] +\c IMUL reg16,imm16 ; o16 69 /r iw [186] +\c IMUL reg32,imm8 ; o32 6B /r ib [386] +\c IMUL reg32,imm32 ; o32 69 /r id [386] + +\c IMUL reg16,r/m16,imm8 ; o16 6B /r ib [186] +\c IMUL reg16,r/m16,imm16 ; o16 69 /r iw [186] +\c IMUL reg32,r/m32,imm8 ; o32 6B /r ib [386] +\c IMUL reg32,r/m32,imm32 ; o32 69 /r id [386] + +\c{IMUL} performs signed integer multiplication. For the +single-operand form, the other operand and destination are +implicit, in the following way: + +\b For \c{IMUL r/m8}, \c{AL} is multiplied by the given operand; +the product is stored in \c{AX}. + +\b For \c{IMUL r/m16}, \c{AX} is multiplied by the given operand; +the product is stored in \c{DX:AX}. + +\b For \c{IMUL r/m32}, \c{EAX} is multiplied by the given operand; +the product is stored in \c{EDX:EAX}. + +The two-operand form multiplies its two operands and stores the +result in the destination (first) operand. The three-operand +form multiplies its last two operands and stores the result in +the first operand. + +The two-operand form with an immediate second operand is in +fact a shorthand for the three-operand form, as can be seen by +examining the opcode descriptions: in the two-operand form, the +code \c{/r} takes both its register and \c{r/m} parts from the +same operand (the first one). + +In the forms with an 8-bit immediate operand and another longer +source operand, the immediate operand is considered to be signed, +and is sign-extended to the length of the other source operand. +In these cases, the \c{BYTE} qualifier is necessary to force +NASM to generate this form of the instruction. + +Unsigned integer multiplication is performed by the \c{MUL} +instruction: see \k{insMUL}. + + +\S{insIN} \i\c{IN}: Input from I/O Port + +\c IN AL,imm8 ; E4 ib [8086] +\c IN AX,imm8 ; o16 E5 ib [8086] +\c IN EAX,imm8 ; o32 E5 ib [386] +\c IN AL,DX ; EC [8086] +\c IN AX,DX ; o16 ED [8086] +\c IN EAX,DX ; o32 ED [386] + +\c{IN} reads a byte, word or doubleword from the specified I/O port, +and stores it in the given destination register. The port number may +be specified as an immediate value if it is between 0 and 255, and +otherwise must be stored in \c{DX}. See also \c{OUT} (\k{insOUT}). + + +\S{insINC} \i\c{INC}: Increment Integer + +\c INC reg16 ; o16 40+r [8086] +\c INC reg32 ; o32 40+r [386] +\c INC r/m8 ; FE /0 [8086] +\c INC r/m16 ; o16 FF /0 [8086] +\c INC r/m32 ; o32 FF /0 [386] + +\c{INC} adds 1 to its operand. It does \e{not} affect the carry +flag: to affect the carry flag, use \c{ADD something,1} (see +\k{insADD}). \c{INC} affects all the other flags according to the result. + +This instruction can be used with a \c{LOCK} prefix to allow atomic execution. + +See also \c{DEC} (\k{insDEC}). + + +\S{insINSB} \i\c{INSB}, \i\c{INSW}, \i\c{INSD}: Input String from I/O Port + +\c INSB ; 6C [186] +\c INSW ; o16 6D [186] +\c INSD ; o32 6D [386] + +\c{INSB} inputs a byte from the I/O port specified in \c{DX} and +stores it at \c{[ES:DI]} or \c{[ES:EDI]}. It then increments or +decrements (depending on the direction flag: increments if the flag +is clear, decrements if it is set) \c{DI} or \c{EDI}. + +The register used is \c{DI} if the address size is 16 bits, and +\c{EDI} if it is 32 bits. If you need to use an address size not +equal to the current \c{BITS} setting, you can use an explicit +\i\c{a16} or \i\c{a32} prefix. + +Segment override prefixes have no effect for this instruction: the +use of \c{ES} for the load from \c{[DI]} or \c{[EDI]} cannot be +overridden. + +\c{INSW} and \c{INSD} work in the same way, but they input a word or +a doubleword instead of a byte, and increment or decrement the +addressing register by 2 or 4 instead of 1. + +The \c{REP} prefix may be used to repeat the instruction \c{CX} (or +\c{ECX} - again, the address size chooses which) times. + +See also \c{OUTSB}, \c{OUTSW} and \c{OUTSD} (\k{insOUTSB}). + + +\S{insINT} \i\c{INT}: Software Interrupt + +\c INT imm8 ; CD ib [8086] + +\c{INT} causes a software interrupt through a specified vector +number from 0 to 255. + +The code generated by the \c{INT} instruction is always two bytes +long: although there are short forms for some \c{INT} instructions, +NASM does not generate them when it sees the \c{INT} mnemonic. In +order to generate single-byte breakpoint instructions, use the +\c{INT3} or \c{INT1} instructions (see \k{insINT1}) instead. + + +\S{insINT1} \i\c{INT3}, \i\c{INT1}, \i\c{ICEBP}, \i\c{INT01}: Breakpoints + +\c INT1 ; F1 [P6] +\c ICEBP ; F1 [P6] +\c INT01 ; F1 [P6] + +\c INT3 ; CC [8086] +\c INT03 ; CC [8086] + +\c{INT1} and \c{INT3} are short one-byte forms of the instructions +\c{INT 1} and \c{INT 3} (see \k{insINT}). They perform a similar +function to their longer counterparts, but take up less code space. +They are used as breakpoints by debuggers. + +\b \c{INT1}, and its alternative synonyms \c{INT01} and \c{ICEBP}, is +an instruction used by in-circuit emulators (ICEs). It is present, +though not documented, on some processors down to the 286, but is +only documented for the Pentium Pro. \c{INT3} is the instruction +normally used as a breakpoint by debuggers. + +\b \c{INT3}, and its synonym \c{INT03}, is not precisely equivalent to +\c{INT 3}: the short form, since it is designed to be used as a +breakpoint, bypasses the normal \c{IOPL} checks in virtual-8086 mode, +and also does not go through interrupt redirection. + + +\S{insINTO} \i\c{INTO}: Interrupt if Overflow + +\c INTO ; CE [8086] + +\c{INTO} performs an \c{INT 4} software interrupt (see \k{insINT}) +if and only if the overflow flag is set. + + +\S{insINVD} \i\c{INVD}: Invalidate Internal Caches + +\c INVD ; 0F 08 [486] + +\c{INVD} invalidates and empties the processor's internal caches, +and causes the processor to instruct external caches to do the same. +It does not write the contents of the caches back to memory first: +any modified data held in the caches will be lost. To write the data +back first, use \c{WBINVD} (\k{insWBINVD}). + + +\S{insINVLPG} \i\c{INVLPG}: Invalidate TLB Entry + +\c INVLPG mem ; 0F 01 /7 [486] + +\c{INVLPG} invalidates the translation lookahead buffer (TLB) entry +associated with the supplied memory address. + + +\S{insIRET} \i\c{IRET}, \i\c{IRETW}, \i\c{IRETD}: Return from Interrupt + +\c IRET ; CF [8086] +\c IRETW ; o16 CF [8086] +\c IRETD ; o32 CF [386] + +\c{IRET} returns from an interrupt (hardware or software) by means +of popping \c{IP} (or \c{EIP}), \c{CS} and the flags off the stack +and then continuing execution from the new \c{CS:IP}. + +\c{IRETW} pops \c{IP}, \c{CS} and the flags as 2 bytes each, taking +6 bytes off the stack in total. \c{IRETD} pops \c{EIP} as 4 bytes, +pops a further 4 bytes of which the top two are discarded and the +bottom two go into \c{CS}, and pops the flags as 4 bytes as well, +taking 12 bytes off the stack. + +\c{IRET} is a shorthand for either \c{IRETW} or \c{IRETD}, depending +on the default \c{BITS} setting at the time. + + +\S{insJcc} \i\c{Jcc}: Conditional Branch + +\c Jcc imm ; 70+cc rb [8086] +\c Jcc NEAR imm ; 0F 80+cc rw/rd [386] + +The \i{conditional jump} instructions execute a near (same segment) +jump if and only if their conditions are satisfied. For example, +\c{JNZ} jumps only if the zero flag is not set. + +The ordinary form of the instructions has only a 128-byte range; the +\c{NEAR} form is a 386 extension to the instruction set, and can +span the full size of a segment. NASM will not override your choice +of jump instruction: if you want \c{Jcc NEAR}, you have to use the +\c{NEAR} keyword. + +The \c{SHORT} keyword is allowed on the first form of the +instruction, for clarity, but is not necessary. + +For details of the condition codes, see \k{iref-cc}. + + +\S{insJCXZ} \i\c{JCXZ}, \i\c{JECXZ}: Jump if CX/ECX Zero + +\c JCXZ imm ; a16 E3 rb [8086] +\c JECXZ imm ; a32 E3 rb [386] + +\c{JCXZ} performs a short jump (with maximum range 128 bytes) if and +only if the contents of the \c{CX} register is 0. \c{JECXZ} does the +same thing, but with \c{ECX}. + + +\S{insJMP} \i\c{JMP}: Jump + +\c JMP imm ; E9 rw/rd [8086] +\c JMP SHORT imm ; EB rb [8086] +\c JMP imm:imm16 ; o16 EA iw iw [8086] +\c JMP imm:imm32 ; o32 EA id iw [386] +\c JMP FAR mem ; o16 FF /5 [8086] +\c JMP FAR mem32 ; o32 FF /5 [386] +\c JMP r/m16 ; o16 FF /4 [8086] +\c JMP r/m32 ; o32 FF /4 [386] + +\c{JMP} jumps to a given address. The address may be specified as an +absolute segment and offset, or as a relative jump within the +current segment. + +\c{JMP SHORT imm} has a maximum range of 128 bytes, since the +displacement is specified as only 8 bits, but takes up less code +space. NASM does not choose when to generate \c{JMP SHORT} for you: +you must explicitly code \c{SHORT} every time you want a short jump. + +You can choose between the two immediate \i{far jump} forms (\c{JMP +imm:imm}) by the use of the \c{WORD} and \c{DWORD} keywords: \c{JMP +WORD 0x1234:0x5678}) or \c{JMP DWORD 0x1234:0x56789abc}. + +The \c{JMP FAR mem} forms execute a far jump by loading the +destination address out of memory. The address loaded consists of 16 +or 32 bits of offset (depending on the operand size), and 16 bits of +segment. The operand size may be overridden using \c{JMP WORD FAR +mem} or \c{JMP DWORD FAR mem}. + +The \c{JMP r/m} forms execute a \i{near jump} (within the same +segment), loading the destination address out of memory or out of a +register. The keyword \c{NEAR} may be specified, for clarity, in +these forms, but is not necessary. Again, operand size can be +overridden using \c{JMP WORD mem} or \c{JMP DWORD mem}. + +As a convenience, NASM does not require you to jump to a far symbol +by coding the cumbersome \c{JMP SEG routine:routine}, but instead +allows the easier synonym \c{JMP FAR routine}. + +The \c{JMP r/m} forms given above are near calls; NASM will accept +the \c{NEAR} keyword (e.g. \c{JMP NEAR [address]}), even though it +is not strictly necessary. + + +\S{insLAHF} \i\c{LAHF}: Load AH from Flags + +\c LAHF ; 9F [8086] + +\c{LAHF} sets the \c{AH} register according to the contents of the +low byte of the flags word. + +The operation of \c{LAHF} is: + +\c AH <-- SF:ZF:0:AF:0:PF:1:CF + +See also \c{SAHF} (\k{insSAHF}). + + +\S{insLAR} \i\c{LAR}: Load Access Rights + +\c LAR reg16,r/m16 ; o16 0F 02 /r [286,PRIV] +\c LAR reg32,r/m32 ; o32 0F 02 /r [286,PRIV] + +\c{LAR} takes the segment selector specified by its source (second) +operand, finds the corresponding segment descriptor in the GDT or +LDT, and loads the access-rights byte of the descriptor into its +destination (first) operand. + + +\S{insLDMXCSR} \i\c{LDMXCSR}: Load Streaming SIMD Extension + Control/Status + +\c LDMXCSR mem32 ; 0F AE /2 [KATMAI,SSE] + +\c{LDMXCSR} loads 32-bits of data from the specified memory location +into the \c{MXCSR} control/status register. \c{MXCSR} is used to +enable masked/unmasked exception handling, to set rounding modes, +to set flush-to-zero mode, and to view exception status flags. + +For details of the \c{MXCSR} register, see the Intel processor docs. + +See also \c{STMXCSR} (\k{insSTMXCSR} + + +\S{insLDS} \i\c{LDS}, \i\c{LES}, \i\c{LFS}, \i\c{LGS}, \i\c{LSS}: Load Far Pointer + +\c LDS reg16,mem ; o16 C5 /r [8086] +\c LDS reg32,mem ; o32 C5 /r [386] + +\c LES reg16,mem ; o16 C4 /r [8086] +\c LES reg32,mem ; o32 C4 /r [386] + +\c LFS reg16,mem ; o16 0F B4 /r [386] +\c LFS reg32,mem ; o32 0F B4 /r [386] + +\c LGS reg16,mem ; o16 0F B5 /r [386] +\c LGS reg32,mem ; o32 0F B5 /r [386] + +\c LSS reg16,mem ; o16 0F B2 /r [386] +\c LSS reg32,mem ; o32 0F B2 /r [386] + +These instructions load an entire far pointer (16 or 32 bits of +offset, plus 16 bits of segment) out of memory in one go. \c{LDS}, +for example, loads 16 or 32 bits from the given memory address into +the given register (depending on the size of the register), then +loads the \e{next} 16 bits from memory into \c{DS}. \c{LES}, +\c{LFS}, \c{LGS} and \c{LSS} work in the same way but use the other +segment registers. + + +\S{insLEA} \i\c{LEA}: Load Effective Address + +\c LEA reg16,mem ; o16 8D /r [8086] +\c LEA reg32,mem ; o32 8D /r [386] + +\c{LEA}, despite its syntax, does not access memory. It calculates +the effective address specified by its second operand as if it were +going to load or store data from it, but instead it stores the +calculated address into the register specified by its first operand. +This can be used to perform quite complex calculations (e.g. \c{LEA +EAX,[EBX+ECX*4+100]}) in one instruction. + +\c{LEA}, despite being a purely arithmetic instruction which +accesses no memory, still requires square brackets around its second +operand, as if it were a memory reference. + +The size of the calculation is the current \e{address} size, and the +size that the result is stored as is the current \e{operand} size. +If the address and operand size are not the same, then if the +addressing mode was 32-bits, the low 16-bits are stored, and if the +address was 16-bits, it is zero-extended to 32-bits before storing. + + +\S{insLEAVE} \i\c{LEAVE}: Destroy Stack Frame + +\c LEAVE ; C9 [186] + +\c{LEAVE} destroys a stack frame of the form created by the +\c{ENTER} instruction (see \k{insENTER}). It is functionally +equivalent to \c{MOV ESP,EBP} followed by \c{POP EBP} (or \c{MOV +SP,BP} followed by \c{POP BP} in 16-bit mode). + + +\S{insLFENCE} \i\c{LFENCE}: Load Fence + +\c LFENCE ; 0F AE /5 [WILLAMETTE,SSE2] + +\c{LFENCE} performs a serialising operation on all loads from memory +that were issued before the \c{LFENCE} instruction. This guarantees that +all memory reads before the \c{LFENCE} instruction are visible before any +reads after the \c{LFENCE} instruction. + +\c{LFENCE} is ordered respective to other \c{LFENCE} instruction, \c{MFENCE}, +any memory read and any other serialising instruction (such as \c{CPUID}). + +Weakly ordered memory types can be used to achieve higher processor +performance through such techniques as out-of-order issue and +speculative reads. The degree to which a consumer of data recognizes +or knows that the data is weakly ordered varies among applications +and may be unknown to the producer of this data. The \c{LFENCE} +instruction provides a performance-efficient way of ensuring load +ordering between routines that produce weakly-ordered results and +routines that consume that data. + +\c{LFENCE} uses the following ModRM encoding: + +\c Mod (7:6) = 11B +\c Reg/Opcode (5:3) = 101B +\c R/M (2:0) = 000B + +All other ModRM encodings are defined to be reserved, and use +of these encodings risks incompatibility with future processors. + +See also \c{SFENCE} (\k{insSFENCE}) and \c{MFENCE} (\k{insMFENCE}). + + +\S{insLGDT} \i\c{LGDT}, \i\c{LIDT}, \i\c{LLDT}: Load Descriptor Tables + +\c LGDT mem ; 0F 01 /2 [286,PRIV] +\c LIDT mem ; 0F 01 /3 [286,PRIV] +\c LLDT r/m16 ; 0F 00 /2 [286,PRIV] + +\c{LGDT} and \c{LIDT} both take a 6-byte memory area as an operand: +they load a 16-bit size limit and a 32-bit linear address from that +area (in the opposite order) into the \c{GDTR} (global descriptor table +register) or \c{IDTR} (interrupt descriptor table register). These are +the only instructions which directly use \e{linear} addresses, rather +than segment/offset pairs. + +\c{LLDT} takes a segment selector as an operand. The processor looks +up that selector in the GDT and stores the limit and base address +given there into the \c{LDTR} (local descriptor table register). + +See also \c{SGDT}, \c{SIDT} and \c{SLDT} (\k{insSGDT}). + + +\S{insLMSW} \i\c{LMSW}: Load/Store Machine Status Word + +\c LMSW r/m16 ; 0F 01 /6 [286,PRIV] + +\c{LMSW} loads the bottom four bits of the source operand into the +bottom four bits of the \c{CR0} control register (or the Machine +Status Word, on 286 processors). See also \c{SMSW} (\k{insSMSW}). + + +\S{insLOADALL} \i\c{LOADALL}, \i\c{LOADALL286}: Load Processor State + +\c LOADALL ; 0F 07 [386,UNDOC] +\c LOADALL286 ; 0F 05 [286,UNDOC] + +This instruction, in its two different-opcode forms, is apparently +supported on most 286 processors, some 386 and possibly some 486. +The opcode differs between the 286 and the 386. + +The function of the instruction is to load all information relating +to the state of the processor out of a block of memory: on the 286, +this block is located implicitly at absolute address \c{0x800}, and +on the 386 and 486 it is at \c{[ES:EDI]}. + + +\S{insLODSB} \i\c{LODSB}, \i\c{LODSW}, \i\c{LODSD}: Load from String + +\c LODSB ; AC [8086] +\c LODSW ; o16 AD [8086] +\c LODSD ; o32 AD [386] + +\c{LODSB} loads a byte from \c{[DS:SI]} or \c{[DS:ESI]} into \c{AL}. +It then increments or decrements (depending on the direction flag: +increments if the flag is clear, decrements if it is set) \c{SI} or +\c{ESI}. + +The register used is \c{SI} if the address size is 16 bits, and +\c{ESI} if it is 32 bits. If you need to use an address size not +equal to the current \c{BITS} setting, you can use an explicit +\i\c{a16} or \i\c{a32} prefix. + +The segment register used to load from \c{[SI]} or \c{[ESI]} can be +overridden by using a segment register name as a prefix (for +example, \c{ES LODSB}). + +\c{LODSW} and \c{LODSD} work in the same way, but they load a +word or a doubleword instead of a byte, and increment or decrement +the addressing registers by 2 or 4 instead of 1. + + +\S{insLOOP} \i\c{LOOP}, \i\c{LOOPE}, \i\c{LOOPZ}, \i\c{LOOPNE}, \i\c{LOOPNZ}: Loop with Counter + +\c LOOP imm ; E2 rb [8086] +\c LOOP imm,CX ; a16 E2 rb [8086] +\c LOOP imm,ECX ; a32 E2 rb [386] + +\c LOOPE imm ; E1 rb [8086] +\c LOOPE imm,CX ; a16 E1 rb [8086] +\c LOOPE imm,ECX ; a32 E1 rb [386] +\c LOOPZ imm ; E1 rb [8086] +\c LOOPZ imm,CX ; a16 E1 rb [8086] +\c LOOPZ imm,ECX ; a32 E1 rb [386] + +\c LOOPNE imm ; E0 rb [8086] +\c LOOPNE imm,CX ; a16 E0 rb [8086] +\c LOOPNE imm,ECX ; a32 E0 rb [386] +\c LOOPNZ imm ; E0 rb [8086] +\c LOOPNZ imm,CX ; a16 E0 rb [8086] +\c LOOPNZ imm,ECX ; a32 E0 rb [386] + +\c{LOOP} decrements its counter register (either \c{CX} or \c{ECX} - +if one is not specified explicitly, the \c{BITS} setting dictates +which is used) by one, and if the counter does not become zero as a +result of this operation, it jumps to the given label. The jump has +a range of 128 bytes. + +\c{LOOPE} (or its synonym \c{LOOPZ}) adds the additional condition +that it only jumps if the counter is nonzero \e{and} the zero flag +is set. Similarly, \c{LOOPNE} (and \c{LOOPNZ}) jumps only if the +counter is nonzero and the zero flag is clear. + + +\S{insLSL} \i\c{LSL}: Load Segment Limit + +\c LSL reg16,r/m16 ; o16 0F 03 /r [286,PRIV] +\c LSL reg32,r/m32 ; o32 0F 03 /r [286,PRIV] + +\c{LSL} is given a segment selector in its source (second) operand; +it computes the segment limit value by loading the segment limit +field from the associated segment descriptor in the \c{GDT} or \c{LDT}. +(This involves shifting left by 12 bits if the segment limit is +page-granular, and not if it is byte-granular; so you end up with a +byte limit in either case.) The segment limit obtained is then +loaded into the destination (first) operand. + + +\S{insLTR} \i\c{LTR}: Load Task Register + +\c LTR r/m16 ; 0F 00 /3 [286,PRIV] + +\c{LTR} looks up the segment base and limit in the GDT or LDT +descriptor specified by the segment selector given as its operand, +and loads them into the Task Register. + + +\S{insMASKMOVDQU} \i\c{MASKMOVDQU}: Byte Mask Write + +\c MASKMOVDQU xmm1,xmm2 ; 66 0F F7 /r [WILLAMETTE,SSE2] + +\c{MASKMOVDQU} stores data from xmm1 to the location specified by +\c{ES:(E)DI}. The size of the store depends on the address-size +attribute. The most significant bit in each byte of the mask +register xmm2 is used to selectively write the data (0 = no write, +1 = write) on a per-byte basis. + + +\S{insMASKMOVQ} \i\c{MASKMOVQ}: Byte Mask Write + +\c MASKMOVQ mm1,mm2 ; 0F F7 /r [KATMAI,MMX] + +\c{MASKMOVQ} stores data from mm1 to the location specified by +\c{ES:(E)DI}. The size of the store depends on the address-size +attribute. The most significant bit in each byte of the mask +register mm2 is used to selectively write the data (0 = no write, +1 = write) on a per-byte basis. + + +\S{insMAXPD} \i\c{MAXPD}: Return Packed Double-Precision FP Maximum + +\c MAXPD xmm1,xmm2/m128 ; 66 0F 5F /r [WILLAMETTE,SSE2] + +\c{MAXPD} performs a SIMD compare of the packed double-precision +FP numbers from xmm1 and xmm2/mem, and stores the maximum values +of each pair of values in xmm1. If the values being compared are +both zeroes, source2 (xmm2/m128) would be returned. If source2 +(xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the +destination (i.e., a QNaN version of the SNaN is not returned). + + +\S{insMAXPS} \i\c{MAXPS}: Return Packed Single-Precision FP Maximum + +\c MAXPS xmm1,xmm2/m128 ; 0F 5F /r [KATMAI,SSE] + +\c{MAXPS} performs a SIMD compare of the packed single-precision +FP numbers from xmm1 and xmm2/mem, and stores the maximum values +of each pair of values in xmm1. If the values being compared are +both zeroes, source2 (xmm2/m128) would be returned. If source2 +(xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the +destination (i.e., a QNaN version of the SNaN is not returned). + + +\S{insMAXSD} \i\c{MAXSD}: Return Scalar Double-Precision FP Maximum + +\c MAXSD xmm1,xmm2/m64 ; F2 0F 5F /r [WILLAMETTE,SSE2] + +\c{MAXSD} compares the low-order double-precision FP numbers from +xmm1 and xmm2/mem, and stores the maximum value in xmm1. If the +values being compared are both zeroes, source2 (xmm2/m64) would +be returned. If source2 (xmm2/m64) is an SNaN, this SNaN is +forwarded unchanged to the destination (i.e., a QNaN version of +the SNaN is not returned). The high quadword of the destination +is left unchanged. + + +\S{insMAXSS} \i\c{MAXSS}: Return Scalar Single-Precision FP Maximum + +\c MAXSS xmm1,xmm2/m32 ; F3 0F 5F /r [KATMAI,SSE] + +\c{MAXSS} compares the low-order single-precision FP numbers from +xmm1 and xmm2/mem, and stores the maximum value in xmm1. If the +values being compared are both zeroes, source2 (xmm2/m32) would +be returned. If source2 (xmm2/m32) is an SNaN, this SNaN is +forwarded unchanged to the destination (i.e., a QNaN version of +the SNaN is not returned). The high three doublewords of the +destination are left unchanged. + + +\S{insMFENCE} \i\c{MFENCE}: Memory Fence + +\c MFENCE ; 0F AE /6 [WILLAMETTE,SSE2] + +\c{MFENCE} performs a serialising operation on all loads from memory +and writes to memory that were issued before the \c{MFENCE} instruction. +This guarantees that all memory reads and writes before the \c{MFENCE} +instruction are completed before any reads and writes after the +\c{MFENCE} instruction. + +\c{MFENCE} is ordered respective to other \c{MFENCE} instructions, +\c{LFENCE}, \c{SFENCE}, any memory read and any other serialising +instruction (such as \c{CPUID}). + +Weakly ordered memory types can be used to achieve higher processor +performance through such techniques as out-of-order issue, speculative +reads, write-combining, and write-collapsing. The degree to which a +consumer of data recognizes or knows that the data is weakly ordered +varies among applications and may be unknown to the producer of this +data. The \c{MFENCE} instruction provides a performance-efficient way +of ensuring load and store ordering between routines that produce +weakly-ordered results and routines that consume that data. + +\c{MFENCE} uses the following ModRM encoding: + +\c Mod (7:6) = 11B +\c Reg/Opcode (5:3) = 110B +\c R/M (2:0) = 000B + +All other ModRM encodings are defined to be reserved, and use +of these encodings risks incompatibility with future processors. + +See also \c{LFENCE} (\k{insLFENCE}) and \c{SFENCE} (\k{insSFENCE}). + + +\S{insMINPD} \i\c{MINPD}: Return Packed Double-Precision FP Minimum + +\c MINPD xmm1,xmm2/m128 ; 66 0F 5D /r [WILLAMETTE,SSE2] + +\c{MINPD} performs a SIMD compare of the packed double-precision +FP numbers from xmm1 and xmm2/mem, and stores the minimum values +of each pair of values in xmm1. If the values being compared are +both zeroes, source2 (xmm2/m128) would be returned. If source2 +(xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the +destination (i.e., a QNaN version of the SNaN is not returned). + + +\S{insMINPS} \i\c{MINPS}: Return Packed Single-Precision FP Minimum + +\c MINPS xmm1,xmm2/m128 ; 0F 5D /r [KATMAI,SSE] + +\c{MINPS} performs a SIMD compare of the packed single-precision +FP numbers from xmm1 and xmm2/mem, and stores the minimum values +of each pair of values in xmm1. If the values being compared are +both zeroes, source2 (xmm2/m128) would be returned. If source2 +(xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the +destination (i.e., a QNaN version of the SNaN is not returned). + + +\S{insMINSD} \i\c{MINSD}: Return Scalar Double-Precision FP Minimum + +\c MINSD xmm1,xmm2/m64 ; F2 0F 5D /r [WILLAMETTE,SSE2] + +\c{MINSD} compares the low-order double-precision FP numbers from +xmm1 and xmm2/mem, and stores the minimum value in xmm1. If the +values being compared are both zeroes, source2 (xmm2/m64) would +be returned. If source2 (xmm2/m64) is an SNaN, this SNaN is +forwarded unchanged to the destination (i.e., a QNaN version of +the SNaN is not returned). The high quadword of the destination +is left unchanged. + + +\S{insMINSS} \i\c{MINSS}: Return Scalar Single-Precision FP Minimum + +\c MINSS xmm1,xmm2/m32 ; F3 0F 5D /r [KATMAI,SSE] + +\c{MINSS} compares the low-order single-precision FP numbers from +xmm1 and xmm2/mem, and stores the minimum value in xmm1. If the +values being compared are both zeroes, source2 (xmm2/m32) would +be returned. If source2 (xmm2/m32) is an SNaN, this SNaN is +forwarded unchanged to the destination (i.e., a QNaN version of +the SNaN is not returned). The high three doublewords of the +destination are left unchanged. + + +\S{insMOV} \i\c{MOV}: Move Data + +\c MOV r/m8,reg8 ; 88 /r [8086] +\c MOV r/m16,reg16 ; o16 89 /r [8086] +\c MOV r/m32,reg32 ; o32 89 /r [386] +\c MOV reg8,r/m8 ; 8A /r [8086] +\c MOV reg16,r/m16 ; o16 8B /r [8086] +\c MOV reg32,r/m32 ; o32 8B /r [386] + +\c MOV reg8,imm8 ; B0+r ib [8086] +\c MOV reg16,imm16 ; o16 B8+r iw [8086] +\c MOV reg32,imm32 ; o32 B8+r id [386] +\c MOV r/m8,imm8 ; C6 /0 ib [8086] +\c MOV r/m16,imm16 ; o16 C7 /0 iw [8086] +\c MOV r/m32,imm32 ; o32 C7 /0 id [386] + +\c MOV AL,memoffs8 ; A0 ow/od [8086] +\c MOV AX,memoffs16 ; o16 A1 ow/od [8086] +\c MOV EAX,memoffs32 ; o32 A1 ow/od [386] +\c MOV memoffs8,AL ; A2 ow/od [8086] +\c MOV memoffs16,AX ; o16 A3 ow/od [8086] +\c MOV memoffs32,EAX ; o32 A3 ow/od [386] + +\c MOV r/m16,segreg ; o16 8C /r [8086] +\c MOV r/m32,segreg ; o32 8C /r [386] +\c MOV segreg,r/m16 ; o16 8E /r [8086] +\c MOV segreg,r/m32 ; o32 8E /r [386] + +\c MOV reg32,CR0/2/3/4 ; 0F 20 /r [386] +\c MOV reg32,DR0/1/2/3/6/7 ; 0F 21 /r [386] +\c MOV reg32,TR3/4/5/6/7 ; 0F 24 /r [386] +\c MOV CR0/2/3/4,reg32 ; 0F 22 /r [386] +\c MOV DR0/1/2/3/6/7,reg32 ; 0F 23 /r [386] +\c MOV TR3/4/5/6/7,reg32 ; 0F 26 /r [386] + +\c{MOV} copies the contents of its source (second) operand into its +destination (first) operand. + +In all forms of the \c{MOV} instruction, the two operands are the +same size, except for moving between a segment register and an +\c{r/m32} operand. These instructions are treated exactly like the +corresponding 16-bit equivalent (so that, for example, \c{MOV +DS,EAX} functions identically to \c{MOV DS,AX} but saves a prefix +when in 32-bit mode), except that when a segment register is moved +into a 32-bit destination, the top two bytes of the result are +undefined. + +\c{MOV} may not use \c{CS} as a destination. + +\c{CR4} is only a supported register on the Pentium and above. + +Test registers are supported on 386/486 processors and on some +non-Intel Pentium class processors. + + +\S{insMOVAPD} \i\c{MOVAPD}: Move Aligned Packed Double-Precision FP Values + +\c MOVAPD xmm1,xmm2/mem128 ; 66 0F 28 /r [WILLAMETTE,SSE2] +\c MOVAPD xmm1/mem128,xmm2 ; 66 0F 29 /r [WILLAMETTE,SSE2] + +\c{MOVAPD} moves a double quadword containing 2 packed double-precision +FP values from the source operand to the destination. When the source +or destination operand is a memory location, it must be aligned on a +16-byte boundary. + +To move data in and out of memory locations that are not known to be on +16-byte boundaries, use the \c{MOVUPD} instruction (\k{insMOVUPD}). + + +\S{insMOVAPS} \i\c{MOVAPS}: Move Aligned Packed Single-Precision FP Values + +\c MOVAPS xmm1,xmm2/mem128 ; 0F 28 /r [KATMAI,SSE] +\c MOVAPS xmm1/mem128,xmm2 ; 0F 29 /r [KATMAI,SSE] + +\c{MOVAPS} moves a double quadword containing 4 packed single-precision +FP values from the source operand to the destination. When the source +or destination operand is a memory location, it must be aligned on a +16-byte boundary. + +To move data in and out of memory locations that are not known to be on +16-byte boundaries, use the \c{MOVUPS} instruction (\k{insMOVUPS}). + + +\S{insMOVD} \i\c{MOVD}: Move Doubleword to/from MMX Register + +\c MOVD mm,r/m32 ; 0F 6E /r [PENT,MMX] +\c MOVD r/m32,mm ; 0F 7E /r [PENT,MMX] +\c MOVD xmm,r/m32 ; 66 0F 6E /r [WILLAMETTE,SSE2] +\c MOVD r/m32,xmm ; 66 0F 7E /r [WILLAMETTE,SSE2] + +\c{MOVD} copies 32 bits from its source (second) operand into its +destination (first) operand. When the destination is a 64-bit \c{MMX} +register or a 128-bit \c{XMM} register, the input value is zero-extended +to fill the destination register. + + +\S{insMOVDQ2Q} \i\c{MOVDQ2Q}: Move Quadword from XMM to MMX register. + +\c MOVDQ2Q mm,xmm ; F2 OF D6 /r [WILLAMETTE,SSE2] + +\c{MOVDQ2Q} moves the low quadword from the source operand to the +destination operand. + + +\S{insMOVDQA} \i\c{MOVDQA}: Move Aligned Double Quadword + +\c MOVDQA xmm1,xmm2/m128 ; 66 OF 6F /r [WILLAMETTE,SSE2] +\c MOVDQA xmm1/m128,xmm2 ; 66 OF 7F /r [WILLAMETTE,SSE2] + +\c{MOVDQA} moves a double quadword from the source operand to the +destination operand. When the source or destination operand is a +memory location, it must be aligned to a 16-byte boundary. + +To move a double quadword to or from unaligned memory locations, +use the \c{MOVDQU} instruction (\k{insMOVDQU}). + + +\S{insMOVDQU} \i\c{MOVDQU}: Move Unaligned Double Quadword + +\c MOVDQU xmm1,xmm2/m128 ; F3 OF 6F /r [WILLAMETTE,SSE2] +\c MOVDQU xmm1/m128,xmm2 ; F3 OF 7F /r [WILLAMETTE,SSE2] + +\c{MOVDQU} moves a double quadword from the source operand to the +destination operand. When the source or destination operand is a +memory location, the memory may be unaligned. + +To move a double quadword to or from known aligned memory locations, +use the \c{MOVDQA} instruction (\k{insMOVDQA}). + + +\S{insMOVHLPS} \i\c{MOVHLPS}: Move Packed Single-Precision FP High to Low + +\c MOVHLPS xmm1,xmm2 ; OF 12 /r [KATMAI,SSE] + +\c{MOVHLPS} moves the two packed single-precision FP values from the +high quadword of the source register xmm2 to the low quadword of the +destination register, xmm2. The upper quadword of xmm1 is left unchanged. + +The operation of this instruction is: + +\c dst[0-63] := src[64-127], +\c dst[64-127] remains unchanged. + + +\S{insMOVHPD} \i\c{MOVHPD}: Move High Packed Double-Precision FP + +\c MOVHPD xmm,m64 ; 66 OF 16 /r [WILLAMETTE,SSE2] +\c MOVHPD m64,xmm ; 66 OF 17 /r [WILLAMETTE,SSE2] + +\c{MOVHPD} moves a double-precision FP value between the source and +destination operands. One of the operands is a 64-bit memory location, +the other is the high quadword of an \c{XMM} register. + +The operation of this instruction is: + +\c mem[0-63] := xmm[64-127]; + +or + +\c xmm[0-63] remains unchanged; +\c xmm[64-127] := mem[0-63]. + + +\S{insMOVHPS} \i\c{MOVHPS}: Move High Packed Single-Precision FP + +\c MOVHPS xmm,m64 ; 0F 16 /r [KATMAI,SSE] +\c MOVHPS m64,xmm ; 0F 17 /r [KATMAI,SSE] + +\c{MOVHPS} moves two packed single-precision FP values between the source +and destination operands. One of the operands is a 64-bit memory location, +the other is the high quadword of an \c{XMM} register. + +The operation of this instruction is: + +\c mem[0-63] := xmm[64-127]; + +or + +\c xmm[0-63] remains unchanged; +\c xmm[64-127] := mem[0-63]. + + +\S{insMOVLHPS} \i\c{MOVLHPS}: Move Packed Single-Precision FP Low to High + +\c MOVLHPS xmm1,xmm2 ; OF 16 /r [KATMAI,SSE] + +\c{MOVLHPS} moves the two packed single-precision FP values from the +low quadword of the source register xmm2 to the high quadword of the +destination register, xmm2. The low quadword of xmm1 is left unchanged. + +The operation of this instruction is: + +\c dst[0-63] remains unchanged; +\c dst[64-127] := src[0-63]. + +\S{insMOVLPD} \i\c{MOVLPD}: Move Low Packed Double-Precision FP + +\c MOVLPD xmm,m64 ; 66 OF 12 /r [WILLAMETTE,SSE2] +\c MOVLPD m64,xmm ; 66 OF 13 /r [WILLAMETTE,SSE2] + +\c{MOVLPD} moves a double-precision FP value between the source and +destination operands. One of the operands is a 64-bit memory location, +the other is the low quadword of an \c{XMM} register. + +The operation of this instruction is: + +\c mem(0-63) := xmm(0-63); + +or + +\c xmm(0-63) := mem(0-63); +\c xmm(64-127) remains unchanged. + +\S{insMOVLPS} \i\c{MOVLPS}: Move Low Packed Single-Precision FP + +\c MOVLPS xmm,m64 ; OF 12 /r [KATMAI,SSE] +\c MOVLPS m64,xmm ; OF 13 /r [KATMAI,SSE] + +\c{MOVLPS} moves two packed single-precision FP values between the source +and destination operands. One of the operands is a 64-bit memory location, +the other is the low quadword of an \c{XMM} register. + +The operation of this instruction is: + +\c mem(0-63) := xmm(0-63); + +or + +\c xmm(0-63) := mem(0-63); +\c xmm(64-127) remains unchanged. + + +\S{insMOVMSKPD} \i\c{MOVMSKPD}: Extract Packed Double-Precision FP Sign Mask + +\c MOVMSKPD reg32,xmm ; 66 0F 50 /r [WILLAMETTE,SSE2] + +\c{MOVMSKPD} inserts a 2-bit mask in r32, formed of the most significant +bits of each double-precision FP number of the source operand. + + +\S{insMOVMSKPS} \i\c{MOVMSKPS}: Extract Packed Single-Precision FP Sign Mask + +\c MOVMSKPS reg32,xmm ; 0F 50 /r [KATMAI,SSE] + +\c{MOVMSKPS} inserts a 4-bit mask in r32, formed of the most significant +bits of each single-precision FP number of the source operand. + + +\S{insMOVNTDQ} \i\c{MOVNTDQ}: Move Double Quadword Non Temporal + +\c MOVNTDQ m128,xmm ; 66 0F E7 /r [WILLAMETTE,SSE2] + +\c{MOVNTDQ} moves the double quadword from the \c{XMM} source +register to the destination memory location, using a non-temporal +hint. This store instruction minimizes cache pollution. + + +\S{insMOVNTI} \i\c{MOVNTI}: Move Doubleword Non Temporal + +\c MOVNTI m32,reg32 ; 0F C3 /r [WILLAMETTE,SSE2] + +\c{MOVNTI} moves the doubleword in the source register +to the destination memory location, using a non-temporal +hint. This store instruction minimizes cache pollution. + + +\S{insMOVNTPD} \i\c{MOVNTPD}: Move Aligned Four Packed Single-Precision +FP Values Non Temporal + +\c MOVNTPD m128,xmm ; 66 0F 2B /r [WILLAMETTE,SSE2] + +\c{MOVNTPD} moves the double quadword from the \c{XMM} source +register to the destination memory location, using a non-temporal +hint. This store instruction minimizes cache pollution. The memory +location must be aligned to a 16-byte boundary. + + +\S{insMOVNTPS} \i\c{MOVNTPS}: Move Aligned Four Packed Single-Precision +FP Values Non Temporal + +\c MOVNTPS m128,xmm ; 0F 2B /r [KATMAI,SSE] + +\c{MOVNTPS} moves the double quadword from the \c{XMM} source +register to the destination memory location, using a non-temporal +hint. This store instruction minimizes cache pollution. The memory +location must be aligned to a 16-byte boundary. + + +\S{insMOVNTQ} \i\c{MOVNTQ}: Move Quadword Non Temporal + +\c MOVNTQ m64,mm ; 0F E7 /r [KATMAI,MMX] + +\c{MOVNTQ} moves the quadword in the \c{MMX} source register +to the destination memory location, using a non-temporal +hint. This store instruction minimizes cache pollution. + + +\S{insMOVQ} \i\c{MOVQ}: Move Quadword to/from MMX Register + +\c MOVQ mm1,mm2/m64 ; 0F 6F /r [PENT,MMX] +\c MOVQ mm1/m64,mm2 ; 0F 7F /r [PENT,MMX] + +\c MOVQ xmm1,xmm2/m64 ; F3 0F 7E /r [WILLAMETTE,SSE2] +\c MOVQ xmm1/m64,xmm2 ; 66 0F D6 /r [WILLAMETTE,SSE2] + +\c{MOVQ} copies 64 bits from its source (second) operand into its +destination (first) operand. When the source is an \c{XMM} register, +the low quadword is moved. When the destination is an \c{XMM} register, +the destination is the low quadword, and the high quadword is cleared. + + +\S{insMOVQ2DQ} \i\c{MOVQ2DQ}: Move Quadword from MMX to XMM register. + +\c MOVQ2DQ xmm,mm ; F3 OF D6 /r [WILLAMETTE,SSE2] + +\c{MOVQ2DQ} moves the quadword from the source operand to the low +quadword of the destination operand, and clears the high quadword. + + +\S{insMOVSB} \i\c{MOVSB}, \i\c{MOVSW}, \i\c{MOVSD}: Move String + +\c MOVSB ; A4 [8086] +\c MOVSW ; o16 A5 [8086] +\c MOVSD ; o32 A5 [386] + +\c{MOVSB} copies the byte at \c{[DS:SI]} or \c{[DS:ESI]} to +\c{[ES:DI]} or \c{[ES:EDI]}. It then increments or decrements +(depending on the direction flag: increments if the flag is clear, +decrements if it is set) \c{SI} and \c{DI} (or \c{ESI} and \c{EDI}). + +The registers used are \c{SI} and \c{DI} if the address size is 16 +bits, and \c{ESI} and \c{EDI} if it is 32 bits. If you need to use +an address size not equal to the current \c{BITS} setting, you can +use an explicit \i\c{a16} or \i\c{a32} prefix. + +The segment register used to load from \c{[SI]} or \c{[ESI]} can be +overridden by using a segment register name as a prefix (for +example, \c{es movsb}). The use of \c{ES} for the store to \c{[DI]} +or \c{[EDI]} cannot be overridden. + +\c{MOVSW} and \c{MOVSD} work in the same way, but they copy a word +or a doubleword instead of a byte, and increment or decrement the +addressing registers by 2 or 4 instead of 1. + +The \c{REP} prefix may be used to repeat the instruction \c{CX} (or +\c{ECX} - again, the address size chooses which) times. + + +\S{insMOVSD} \i\c{MOVSD}: Move Scalar Double-Precision FP Value + +\c MOVSD xmm1,xmm2/m64 ; F2 0F 10 /r [WILLAMETTE,SSE2] +\c MOVSD xmm1/m64,xmm2 ; F2 0F 11 /r [WILLAMETTE,SSE2] + +\c{MOVSD} moves a double-precision FP value from the source operand +to the destination operand. When the source or destination is a +register, the low-order FP value is read or written. + + +\S{insMOVSS} \i\c{MOVSS}: Move Scalar Single-Precision FP Value + +\c MOVSS xmm1,xmm2/m32 ; F3 0F 10 /r [KATMAI,SSE] +\c MOVSS xmm1/m32,xmm2 ; F3 0F 11 /r [KATMAI,SSE] + +\c{MOVSS} moves a single-precision FP value from the source operand +to the destination operand. When the source or destination is a +register, the low-order FP value is read or written. + + +\S{insMOVSX} \i\c{MOVSX}, \i\c{MOVZX}: Move Data with Sign or Zero Extend + +\c MOVSX reg16,r/m8 ; o16 0F BE /r [386] +\c MOVSX reg32,r/m8 ; o32 0F BE /r [386] +\c MOVSX reg32,r/m16 ; o32 0F BF /r [386] + +\c MOVZX reg16,r/m8 ; o16 0F B6 /r [386] +\c MOVZX reg32,r/m8 ; o32 0F B6 /r [386] +\c MOVZX reg32,r/m16 ; o32 0F B7 /r [386] + +\c{MOVSX} sign-extends its source (second) operand to the length of +its destination (first) operand, and copies the result into the +destination operand. \c{MOVZX} does the same, but zero-extends +rather than sign-extending. + + +\S{insMOVUPD} \i\c{MOVUPD}: Move Unaligned Packed Double-Precision FP Values + +\c MOVUPD xmm1,xmm2/mem128 ; 66 0F 10 /r [WILLAMETTE,SSE2] +\c MOVUPD xmm1/mem128,xmm2 ; 66 0F 11 /r [WILLAMETTE,SSE2] + +\c{MOVUPD} moves a double quadword containing 2 packed double-precision +FP values from the source operand to the destination. This instruction +makes no assumptions about alignment of memory operands. + +To move data in and out of memory locations that are known to be on 16-byte +boundaries, use the \c{MOVAPD} instruction (\k{insMOVAPD}). + + +\S{insMOVUPS} \i\c{MOVUPS}: Move Unaligned Packed Single-Precision FP Values + +\c MOVUPS xmm1,xmm2/mem128 ; 0F 10 /r [KATMAI,SSE] +\c MOVUPS xmm1/mem128,xmm2 ; 0F 11 /r [KATMAI,SSE] + +\c{MOVUPS} moves a double quadword containing 4 packed single-precision +FP values from the source operand to the destination. This instruction +makes no assumptions about alignment of memory operands. + +To move data in and out of memory locations that are known to be on 16-byte +boundaries, use the \c{MOVAPS} instruction (\k{insMOVAPS}). + + +\S{insMUL} \i\c{MUL}: Unsigned Integer Multiply + +\c MUL r/m8 ; F6 /4 [8086] +\c MUL r/m16 ; o16 F7 /4 [8086] +\c MUL r/m32 ; o32 F7 /4 [386] + +\c{MUL} performs unsigned integer multiplication. The other operand +to the multiplication, and the destination operand, are implicit, in +the following way: + +\b For \c{MUL r/m8}, \c{AL} is multiplied by the given operand; the +product is stored in \c{AX}. + +\b For \c{MUL r/m16}, \c{AX} is multiplied by the given operand; +the product is stored in \c{DX:AX}. + +\b For \c{MUL r/m32}, \c{EAX} is multiplied by the given operand; +the product is stored in \c{EDX:EAX}. + +Signed integer multiplication is performed by the \c{IMUL} +instruction: see \k{insIMUL}. + + +\S{insMULPD} \i\c{MULPD}: Packed Single-FP Multiply + +\c MULPD xmm1,xmm2/mem128 ; 66 0F 59 /r [WILLAMETTE,SSE2] + +\c{MULPD} performs a SIMD multiply of the packed double-precision FP +values in both operands, and stores the results in the destination register. + + +\S{insMULPS} \i\c{MULPS}: Packed Single-FP Multiply + +\c MULPS xmm1,xmm2/mem128 ; 0F 59 /r [KATMAI,SSE] + +\c{MULPS} performs a SIMD multiply of the packed single-precision FP +values in both operands, and stores the results in the destination register. + + +\S{insMULSD} \i\c{MULSD}: Scalar Single-FP Multiply + +\c MULSD xmm1,xmm2/mem32 ; F2 0F 59 /r [WILLAMETTE,SSE2] + +\c{MULSD} multiplies the lowest double-precision FP values of both +operands, and stores the result in the low quadword of xmm1. + + +\S{insMULSS} \i\c{MULSS}: Scalar Single-FP Multiply + +\c MULSS xmm1,xmm2/mem32 ; F3 0F 59 /r [KATMAI,SSE] + +\c{MULSS} multiplies the lowest single-precision FP values of both +operands, and stores the result in the low doubleword of xmm1. + + +\S{insNEG} \i\c{NEG}, \i\c{NOT}: Two's and One's Complement + +\c NEG r/m8 ; F6 /3 [8086] +\c NEG r/m16 ; o16 F7 /3 [8086] +\c NEG r/m32 ; o32 F7 /3 [386] + +\c NOT r/m8 ; F6 /2 [8086] +\c NOT r/m16 ; o16 F7 /2 [8086] +\c NOT r/m32 ; o32 F7 /2 [386] + +\c{NEG} replaces the contents of its operand by the two's complement +negation (invert all the bits and then add one) of the original +value. \c{NOT}, similarly, performs one's complement (inverts all +the bits). + + +\S{insNOP} \i\c{NOP}: No Operation + +\c NOP ; 90 [8086] + +\c{NOP} performs no operation. Its opcode is the same as that +generated by \c{XCHG AX,AX} or \c{XCHG EAX,EAX} (depending on the +processor mode; see \k{insXCHG}). + + +\S{insOR} \i\c{OR}: Bitwise OR + +\c OR r/m8,reg8 ; 08 /r [8086] +\c OR r/m16,reg16 ; o16 09 /r [8086] +\c OR r/m32,reg32 ; o32 09 /r [386] + +\c OR reg8,r/m8 ; 0A /r [8086] +\c OR reg16,r/m16 ; o16 0B /r [8086] +\c OR reg32,r/m32 ; o32 0B /r [386] + +\c OR r/m8,imm8 ; 80 /1 ib [8086] +\c OR r/m16,imm16 ; o16 81 /1 iw [8086] +\c OR r/m32,imm32 ; o32 81 /1 id [386] + +\c OR r/m16,imm8 ; o16 83 /1 ib [8086] +\c OR r/m32,imm8 ; o32 83 /1 ib [386] + +\c OR AL,imm8 ; 0C ib [8086] +\c OR AX,imm16 ; o16 0D iw [8086] +\c OR EAX,imm32 ; o32 0D id [386] + +\c{OR} performs a bitwise OR operation between its two operands +(i.e. each bit of the result is 1 if and only if at least one of the +corresponding bits of the two inputs was 1), and stores the result +in the destination (first) operand. + +In the forms with an 8-bit immediate second operand and a longer +first operand, the second operand is considered to be signed, and is +sign-extended to the length of the first operand. In these cases, +the \c{BYTE} qualifier is necessary to force NASM to generate this +form of the instruction. + +The MMX instruction \c{POR} (see \k{insPOR}) performs the same +operation on the 64-bit MMX registers. + + +\S{insORPD} \i\c{ORPD}: Bit-wise Logical OR of Double-Precision FP Data + +\c ORPD xmm1,xmm2/m128 ; 66 0F 56 /r [WILLAMETTE,SSE2] + +\c{ORPD} return a bit-wise logical OR between xmm1 and xmm2/mem, +and stores the result in xmm1. If the source operand is a memory +location, it must be aligned to a 16-byte boundary. + + +\S{insORPS} \i\c{ORPS}: Bit-wise Logical OR of Single-Precision FP Data + +\c ORPS xmm1,xmm2/m128 ; 0F 56 /r [KATMAI,SSE] + +\c{ORPS} return a bit-wise logical OR between xmm1 and xmm2/mem, +and stores the result in xmm1. If the source operand is a memory +location, it must be aligned to a 16-byte boundary. + + +\S{insOUT} \i\c{OUT}: Output Data to I/O Port + +\c OUT imm8,AL ; E6 ib [8086] +\c OUT imm8,AX ; o16 E7 ib [8086] +\c OUT imm8,EAX ; o32 E7 ib [386] +\c OUT DX,AL ; EE [8086] +\c OUT DX,AX ; o16 EF [8086] +\c OUT DX,EAX ; o32 EF [386] + +\c{OUT} writes the contents of the given source register to the +specified I/O port. The port number may be specified as an immediate +value if it is between 0 and 255, and otherwise must be stored in +\c{DX}. See also \c{IN} (\k{insIN}). + + +\S{insOUTSB} \i\c{OUTSB}, \i\c{OUTSW}, \i\c{OUTSD}: Output String to I/O Port + +\c OUTSB ; 6E [186] +\c OUTSW ; o16 6F [186] +\c OUTSD ; o32 6F [386] + +\c{OUTSB} loads a byte from \c{[DS:SI]} or \c{[DS:ESI]} and writes +it to the I/O port specified in \c{DX}. It then increments or +decrements (depending on the direction flag: increments if the flag +is clear, decrements if it is set) \c{SI} or \c{ESI}. + +The register used is \c{SI} if the address size is 16 bits, and +\c{ESI} if it is 32 bits. If you need to use an address size not +equal to the current \c{BITS} setting, you can use an explicit +\i\c{a16} or \i\c{a32} prefix. + +The segment register used to load from \c{[SI]} or \c{[ESI]} can be +overridden by using a segment register name as a prefix (for +example, \c{es outsb}). + +\c{OUTSW} and \c{OUTSD} work in the same way, but they output a +word or a doubleword instead of a byte, and increment or decrement +the addressing registers by 2 or 4 instead of 1. + +The \c{REP} prefix may be used to repeat the instruction \c{CX} (or +\c{ECX} - again, the address size chooses which) times. + + +\S{insPACKSSDW} \i\c{PACKSSDW}, \i\c{PACKSSWB}, \i\c{PACKUSWB}: Pack Data + +\c PACKSSDW mm1,mm2/m64 ; 0F 6B /r [PENT,MMX] +\c PACKSSWB mm1,mm2/m64 ; 0F 63 /r [PENT,MMX] +\c PACKUSWB mm1,mm2/m64 ; 0F 67 /r [PENT,MMX] + +\c PACKSSDW xmm1,xmm2/m128 ; 66 0F 6B /r [WILLAMETTE,SSE2] +\c PACKSSWB xmm1,xmm2/m128 ; 66 0F 63 /r [WILLAMETTE,SSE2] +\c PACKUSWB xmm1,xmm2/m128 ; 66 0F 67 /r [WILLAMETTE,SSE2] + +All these instructions start by combining the source and destination +operands, and then splitting the result in smaller sections which it +then packs into the destination register. The \c{MMX} versions pack +two 64-bit operands into one 64-bit register, while the \c{SSE} +versions pack two 128-bit operands into one 128-bit register. + +\b \c{PACKSSWB} splits the combined value into words, and then reduces +the words to bytes, using signed saturation. It then packs the bytes +into the destination register in the same order the words were in. + +\b \c{PACKSSDW} performs the same operation as \c{PACKSSWB}, except that +it reduces doublewords to words, then packs them into the destination +register. + +\b \c{PACKUSWB} performs the same operation as \c{PACKSSWB}, except that +it uses unsigned saturation when reducing the size of the elements. + +To perform signed saturation on a number, it is replaced by the largest +signed number (\c{7FFFh} or \c{7Fh}) that \e{will} fit, and if it is too +small it is replaced by the smallest signed number (\c{8000h} or +\c{80h}) that will fit. To perform unsigned saturation, the input is +treated as unsigned, and the input is replaced by the largest unsigned +number that will fit. + + +\S{insPADDB} \i\c{PADDB}, \i\c{PADDW}, \i\c{PADDD}: Add Packed Integers + +\c PADDB mm1,mm2/m64 ; 0F FC /r [PENT,MMX] +\c PADDW mm1,mm2/m64 ; 0F FD /r [PENT,MMX] +\c PADDD mm1,mm2/m64 ; 0F FE /r [PENT,MMX] + +\c PADDB xmm1,xmm2/m128 ; 66 0F FC /r [WILLAMETTE,SSE2] +\c PADDW xmm1,xmm2/m128 ; 66 0F FD /r [WILLAMETTE,SSE2] +\c PADDD xmm1,xmm2/m128 ; 66 0F FE /r [WILLAMETTE,SSE2] + +\c{PADDx} performs packed addition of the two operands, storing the +result in the destination (first) operand. + +\b \c{PADDB} treats the operands as packed bytes, and adds each byte +individually; + +\b \c{PADDW} treats the operands as packed words; + +\b \c{PADDD} treats its operands as packed doublewords. + +When an individual result is too large to fit in its destination, it +is wrapped around and the low bits are stored, with the carry bit +discarded. + + +\S{insPADDQ} \i\c{PADDQ}: Add Packed Quadword Integers + +\c PADDQ mm1,mm2/m64 ; 0F D4 /r [PENT,MMX] + +\c PADDQ xmm1,xmm2/m128 ; 66 0F D4 /r [WILLAMETTE,SSE2] + +\c{PADDQ} adds the quadwords in the source and destination operands, and +stores the result in the destination register. + +When an individual result is too large to fit in its destination, it +is wrapped around and the low bits are stored, with the carry bit +discarded. + + +\S{insPADDSB} \i\c{PADDSB}, \i\c{PADDSW}: Add Packed Signed Integers With Saturation + +\c PADDSB mm1,mm2/m64 ; 0F EC /r [PENT,MMX] +\c PADDSW mm1,mm2/m64 ; 0F ED /r [PENT,MMX] + +\c PADDSB xmm1,xmm2/m128 ; 66 0F EC /r [WILLAMETTE,SSE2] +\c PADDSW xmm1,xmm2/m128 ; 66 0F ED /r [WILLAMETTE,SSE2] + +\c{PADDSx} performs packed addition of the two operands, storing the +result in the destination (first) operand. +\c{PADDSB} treats the operands as packed bytes, and adds each byte +individually; and \c{PADDSW} treats the operands as packed words. + +When an individual result is too large to fit in its destination, a +saturated value is stored. The resulting value is the value with the +largest magnitude of the same sign as the result which will fit in +the available space. + + +\S{insPADDSIW} \i\c{PADDSIW}: MMX Packed Addition to Implicit Destination + +\c PADDSIW mmxreg,r/m64 ; 0F 51 /r [CYRIX,MMX] + +\c{PADDSIW}, specific to the Cyrix extensions to the MMX instruction +set, performs the same function as \c{PADDSW}, except that the result +is placed in an implied register. + +To work out the implied register, invert the lowest bit in the register +number. So \c{PADDSIW MM0,MM2} would put the result in \c{MM1}, but +\c{PADDSIW MM1,MM2} would put the result in \c{MM0}. + + +\S{insPADDUSB} \i\c{PADDUSB}, \i\c{PADDUSW}: Add Packed Unsigned Integers With Saturation + +\c PADDUSB mm1,mm2/m64 ; 0F DC /r [PENT,MMX] +\c PADDUSW mm1,mm2/m64 ; 0F DD /r [PENT,MMX] + +\c PADDUSB xmm1,xmm2/m128 ; 66 0F DC /r [WILLAMETTE,SSE2] +\c PADDUSW xmm1,xmm2/m128 ; 66 0F DD /r [WILLAMETTE,SSE2] + +\c{PADDUSx} performs packed addition of the two operands, storing the +result in the destination (first) operand. +\c{PADDUSB} treats the operands as packed bytes, and adds each byte +individually; and \c{PADDUSW} treats the operands as packed words. + +When an individual result is too large to fit in its destination, a +saturated value is stored. The resulting value is the maximum value +that will fit in the available space. + + +\S{insPAND} \i\c{PAND}, \i\c{PANDN}: MMX Bitwise AND and AND-NOT + +\c PAND mm1,mm2/m64 ; 0F DB /r [PENT,MMX] +\c PANDN mm1,mm2/m64 ; 0F DF /r [PENT,MMX] + +\c PAND xmm1,xmm2/m128 ; 66 0F DB /r [WILLAMETTE,SSE2] +\c PANDN xmm1,xmm2/m128 ; 66 0F DF /r [WILLAMETTE,SSE2] + + +\c{PAND} performs a bitwise AND operation between its two operands +(i.e. each bit of the result is 1 if and only if the corresponding +bits of the two inputs were both 1), and stores the result in the +destination (first) operand. + +\c{PANDN} performs the same operation, but performs a one's +complement operation on the destination (first) operand first. + + +\S{insPAUSE} \i\c{PAUSE}: Spin Loop Hint + +\c PAUSE ; F3 90 [WILLAMETTE,SSE2] + +\c{PAUSE} provides a hint to the processor that the following code +is a spin loop. This improves processor performance by bypassing +possible memory order violations. On older processors, this instruction +operates as a \c{NOP}. + + +\S{insPAVEB} \i\c{PAVEB}: MMX Packed Average + +\c PAVEB mmxreg,r/m64 ; 0F 50 /r [CYRIX,MMX] + +\c{PAVEB}, specific to the Cyrix MMX extensions, treats its two +operands as vectors of eight unsigned bytes, and calculates the +average of the corresponding bytes in the operands. The resulting +vector of eight averages is stored in the first operand. + +This opcode maps to \c{MOVMSKPS r32, xmm} on processors that support +the SSE instruction set. + + +\S{insPAVGB} \i\c{PAVGB} \i\c{PAVGW}: Average Packed Integers + +\c PAVGB mm1,mm2/m64 ; 0F E0 /r [KATMAI,MMX] +\c PAVGW mm1,mm2/m64 ; 0F E3 /r [KATMAI,MMX,SM] + +\c PAVGB xmm1,xmm2/m128 ; 66 0F E0 /r [WILLAMETTE,SSE2] +\c PAVGW xmm1,xmm2/m128 ; 66 0F E3 /r [WILLAMETTE,SSE2] + +\c{PAVGB} and \c{PAVGW} add the unsigned data elements of the source +operand to the unsigned data elements of the destination register, +then adds 1 to the temporary results. The results of the add are then +each independently right-shifted by one bit position. The high order +bits of each element are filled with the carry bits of the corresponding +sum. + +\b \c{PAVGB} operates on packed unsigned bytes, and + +\b \c{PAVGW} operates on packed unsigned words. + + +\S{insPAVGUSB} \i\c{PAVGUSB}: Average of unsigned packed 8-bit values + +\c PAVGUSB mm1,mm2/m64 ; 0F 0F /r BF [PENT,3DNOW] + +\c{PAVGUSB} adds the unsigned data elements of the source operand to +the unsigned data elements of the destination register, then adds 1 +to the temporary results. The results of the add are then each +independently right-shifted by one bit position. The high order bits +of each element are filled with the carry bits of the corresponding +sum. + +This instruction performs exactly the same operations as the \c{PAVGB} +\c{MMX} instruction (\k{insPAVGB}). + + +\S{insPCMPEQB} \i\c{PCMPxx}: Compare Packed Integers. + +\c PCMPEQB mm1,mm2/m64 ; 0F 74 /r [PENT,MMX] +\c PCMPEQW mm1,mm2/m64 ; 0F 75 /r [PENT,MMX] +\c PCMPEQD mm1,mm2/m64 ; 0F 76 /r [PENT,MMX] + +\c PCMPGTB mm1,mm2/m64 ; 0F 64 /r [PENT,MMX] +\c PCMPGTW mm1,mm2/m64 ; 0F 65 /r [PENT,MMX] +\c PCMPGTD mm1,mm2/m64 ; 0F 66 /r [PENT,MMX] + +\c PCMPEQB xmm1,xmm2/m128 ; 66 0F 74 /r [WILLAMETTE,SSE2] +\c PCMPEQW xmm1,xmm2/m128 ; 66 0F 75 /r [WILLAMETTE,SSE2] +\c PCMPEQD xmm1,xmm2/m128 ; 66 0F 76 /r [WILLAMETTE,SSE2] + +\c PCMPGTB xmm1,xmm2/m128 ; 66 0F 64 /r [WILLAMETTE,SSE2] +\c PCMPGTW xmm1,xmm2/m128 ; 66 0F 65 /r [WILLAMETTE,SSE2] +\c PCMPGTD xmm1,xmm2/m128 ; 66 0F 66 /r [WILLAMETTE,SSE2] + +The \c{PCMPxx} instructions all treat their operands as vectors of +bytes, words, or doublewords; corresponding elements of the source +and destination are compared, and the corresponding element of the +destination (first) operand is set to all zeros or all ones +depending on the result of the comparison. + +\b \c{PCMPxxB} treats the operands as vectors of bytes; + +\b \c{PCMPxxW} treats the operands as vectors of words; + +\b \c{PCMPxxD} treats the operands as vectors of doublewords; + +\b \c{PCMPEQx} sets the corresponding element of the destination +operand to all ones if the two elements compared are equal; + +\b \c{PCMPGTx} sets the destination element to all ones if the element +of the first (destination) operand is greater (treated as a signed +integer) than that of the second (source) operand. + + +\S{insPDISTIB} \i\c{PDISTIB}: MMX Packed Distance and Accumulate +with Implied Register + +\c PDISTIB mm,m64 ; 0F 54 /r [CYRIX,MMX] + +\c{PDISTIB}, specific to the Cyrix MMX extensions, treats its two +input operands as vectors of eight unsigned bytes. For each byte +position, it finds the absolute difference between the bytes in that +position in the two input operands, and adds that value to the byte +in the same position in the implied output register. The addition is +saturated to an unsigned byte in the same way as \c{PADDUSB}. + +To work out the implied register, invert the lowest bit in the register +number. So \c{PDISTIB MM0,M64} would put the result in \c{MM1}, but +\c{PDISTIB MM1,M64} would put the result in \c{MM0}. + +Note that \c{PDISTIB} cannot take a register as its second source +operand. + +Operation: + +\c dstI[0-7] := dstI[0-7] + ABS(src0[0-7] - src1[0-7]), +\c dstI[8-15] := dstI[8-15] + ABS(src0[8-15] - src1[8-15]), +\c ....... +\c ....... +\c dstI[56-63] := dstI[56-63] + ABS(src0[56-63] - src1[56-63]). + + +\S{insPEXTRW} \i\c{PEXTRW}: Extract Word + +\c PEXTRW reg32,mm,imm8 ; 0F C5 /r ib [KATMAI,MMX] +\c PEXTRW reg32,xmm,imm8 ; 66 0F C5 /r ib [WILLAMETTE,SSE2] + +\c{PEXTRW} moves the word in the source register (second operand) +that is pointed to by the count operand (third operand), into the +lower half of a 32-bit general purpose register. The upper half of +the register is cleared to all 0s. + +When the source operand is an \c{MMX} register, the two least +significant bits of the count specify the source word. When it is +an \c{SSE} register, the three least significant bits specify the +word location. + + +\S{insPF2ID} \i\c{PF2ID}: Packed Single-Precision FP to Integer Convert + +\c PF2ID mm1,mm2/m64 ; 0F 0F /r 1D [PENT,3DNOW] + +\c{PF2ID} converts two single-precision FP values in the source operand +to signed 32-bit integers, using truncation, and stores them in the +destination operand. Source values that are outside the range supported +by the destination are saturated to the largest absolute value of the +same sign. + + +\S{insPF2IW} \i\c{PF2IW}: Packed Single-Precision FP to Integer Word Convert + +\c PF2IW mm1,mm2/m64 ; 0F 0F /r 1C [PENT,3DNOW] + +\c{PF2IW} converts two single-precision FP values in the source operand +to signed 16-bit integers, using truncation, and stores them in the +destination operand. Source values that are outside the range supported +by the destination are saturated to the largest absolute value of the +same sign. + +\b In the K6-2 and K6-III, the 16-bit value is zero-extended to 32-bits +before storing. + +\b In the K6-2+, K6-III+ and Athlon processors, the value is sign-extended +to 32-bits before storing. + + +\S{insPFACC} \i\c{PFACC}: Packed Single-Precision FP Accumulate + +\c PFACC mm1,mm2/m64 ; 0F 0F /r AE [PENT,3DNOW] + +\c{PFACC} adds the two single-precision FP values from the destination +operand together, then adds the two single-precision FP values from the +source operand, and places the results in the low and high doublewords +of the destination operand. + +The operation is: + +\c dst[0-31] := dst[0-31] + dst[32-63], +\c dst[32-63] := src[0-31] + src[32-63]. + + +\S{insPFADD} \i\c{PFADD}: Packed Single-Precision FP Addition + +\c PFADD mm1,mm2/m64 ; 0F 0F /r 9E [PENT,3DNOW] + +\c{PFADD} performs addition on each of two packed single-precision +FP value pairs. + +\c dst[0-31] := dst[0-31] + src[0-31], +\c dst[32-63] := dst[32-63] + src[32-63]. + + +\S{insPFCMP} \i\c{PFCMPxx}: Packed Single-Precision FP Compare +\I\c{PFCMPEQ} \I\c{PFCMPGE} \I\c{PFCMPGT} + +\c PFCMPEQ mm1,mm2/m64 ; 0F 0F /r B0 [PENT,3DNOW] +\c PFCMPGE mm1,mm2/m64 ; 0F 0F /r 90 [PENT,3DNOW] +\c PFCMPGT mm1,mm2/m64 ; 0F 0F /r A0 [PENT,3DNOW] + +The \c{PFCMPxx} instructions compare the packed single-point FP values +in the source and destination operands, and set the destination +according to the result. If the condition is true, the destination is +set to all 1s, otherwise it's set to all 0s. + +\b \c{PFCMPEQ} tests whether dst == src; + +\b \c{PFCMPGE} tests whether dst >= src; + +\b \c{PFCMPGT} tests whether dst > src. + + +\S{insPFMAX} \i\c{PFMAX}: Packed Single-Precision FP Maximum + +\c PFMAX mm1,mm2/m64 ; 0F 0F /r A4 [PENT,3DNOW] + +\c{PFMAX} returns the higher of each pair of single-precision FP values. +If the higher value is zero, it is returned as positive zero. + + +\S{insPFMIN} \i\c{PFMIN}: Packed Single-Precision FP Minimum + +\c PFMIN mm1,mm2/m64 ; 0F 0F /r 94 [PENT,3DNOW] + +\c{PFMIN} returns the lower of each pair of single-precision FP values. +If the lower value is zero, it is returned as positive zero. + + +\S{insPFMUL} \i\c{PFMUL}: Packed Single-Precision FP Multiply + +\c PFMUL mm1,mm2/m64 ; 0F 0F /r B4 [PENT,3DNOW] + +\c{PFMUL} returns the product of each pair of single-precision FP values. + +\c dst[0-31] := dst[0-31] * src[0-31], +\c dst[32-63] := dst[32-63] * src[32-63]. + + +\S{insPFNACC} \i\c{PFNACC}: Packed Single-Precision FP Negative Accumulate + +\c PFNACC mm1,mm2/m64 ; 0F 0F /r 8A [PENT,3DNOW] + +\c{PFNACC} performs a negative accumulate of the two single-precision +FP values in the source and destination registers. The result of the +accumulate from the destination register is stored in the low doubleword +of the destination, and the result of the source accumulate is stored in +the high doubleword of the destination register. + +The operation is: + +\c dst[0-31] := dst[0-31] - dst[32-63], +\c dst[32-63] := src[0-31] - src[32-63]. + + +\S{insPFPNACC} \i\c{PFPNACC}: Packed Single-Precision FP Mixed Accumulate + +\c PFPNACC mm1,mm2/m64 ; 0F 0F /r 8E [PENT,3DNOW] + +\c{PFPNACC} performs a positive accumulate of the two single-precision +FP values in the source register and a negative accumulate of the +destination register. The result of the accumulate from the destination +register is stored in the low doubleword of the destination, and the +result of the source accumulate is stored in the high doubleword of the +destination register. + +The operation is: + +\c dst[0-31] := dst[0-31] - dst[32-63], +\c dst[32-63] := src[0-31] + src[32-63]. + + +\S{insPFRCP} \i\c{PFRCP}: Packed Single-Precision FP Reciprocal Approximation + +\c PFRCP mm1,mm2/m64 ; 0F 0F /r 96 [PENT,3DNOW] + +\c{PFRCP} performs a low precision estimate of the reciprocal of the +low-order single-precision FP value in the source operand, storing the +result in both halves of the destination register. The result is accurate +to 14 bits. + +For higher precision reciprocals, this instruction should be followed by +two more instructions: \c{PFRCPIT1} (\k{insPFRCPIT1}) and \c{PFRCPIT2} +(\k{insPFRCPIT1}). This will result in a 24-bit accuracy. For more details, +see the AMD 3DNow! technology manual. + + +\S{insPFRCPIT1} \i\c{PFRCPIT1}: Packed Single-Precision FP Reciprocal, +First Iteration Step + +\c PFRCPIT1 mm1,mm2/m64 ; 0F 0F /r A6 [PENT,3DNOW] + +\c{PFRCPIT1} performs the first intermediate step in the calculation of +the reciprocal of a single-precision FP value. The first source value +(\c{mm1} is the original value, and the second source value (\c{mm2/m64} +is the result of a \c{PFRCP} instruction. + +For the final step in a reciprocal, returning the full 24-bit accuracy +of a single-precision FP value, see \c{PFRCPIT2} (\k{insPFRCPIT2}). For +more details, see the AMD 3DNow! technology manual. + + +\S{insPFRCPIT2} \i\c{PFRCPIT2}: Packed Single-Precision FP +Reciprocal/ Reciprocal Square Root, Second Iteration Step + +\c PFRCPIT2 mm1,mm2/m64 ; 0F 0F /r B6 [PENT,3DNOW] + +\c{PFRCPIT2} performs the second and final intermediate step in the +calculation of a reciprocal or reciprocal square root, refining the +values returned by the \c{PFRCP} and \c{PFRSQRT} instructions, +respectively. + +The first source value (\c{mm1}) is the output of either a \c{PFRCPIT1} +or a \c{PFRSQIT1} instruction, and the second source is the output of +either the \c{PFRCP} or the \c{PFRSQRT} instruction. For more details, +see the AMD 3DNow! technology manual. + + +\S{insPFRSQIT1} \i\c{PFRSQIT1}: Packed Single-Precision FP Reciprocal +Square Root, First Iteration Step + +\c PFRSQIT1 mm1,mm2/m64 ; 0F 0F /r A7 [PENT,3DNOW] + +\c{PFRSQIT1} performs the first intermediate step in the calculation of +the reciprocal square root of a single-precision FP value. The first +source value (\c{mm1} is the square of the result of a \c{PFRSQRT} +instruction, and the second source value (\c{mm2/m64} is the original +value. + +For the final step in a calculation, returning the full 24-bit accuracy +of a single-precision FP value, see \c{PFRCPIT2} (\k{insPFRCPIT2}). For +more details, see the AMD 3DNow! technology manual. + + +\S{insPFRSQRT} \i\c{PFRSQRT}: Packed Single-Precision FP Reciprocal +Square Root Approximation + +\c PFRSQRT mm1,mm2/m64 ; 0F 0F /r 97 [PENT,3DNOW] + +\c{PFRSQRT} performs a low precision estimate of the reciprocal square +root of the low-order single-precision FP value in the source operand, +storing the result in both halves of the destination register. The result +is accurate to 15 bits. + +For higher precision reciprocals, this instruction should be followed by +two more instructions: \c{PFRSQIT1} (\k{insPFRSQIT1}) and \c{PFRCPIT2} +(\k{insPFRCPIT1}). This will result in a 24-bit accuracy. For more details, +see the AMD 3DNow! technology manual. + + +\S{insPFSUB} \i\c{PFSUB}: Packed Single-Precision FP Subtract + +\c PFSUB mm1,mm2/m64 ; 0F 0F /r 9A [PENT,3DNOW] + +\c{PFSUB} subtracts the single-precision FP values in the source from +those in the destination, and stores the result in the destination +operand. + +\c dst[0-31] := dst[0-31] - src[0-31], +\c dst[32-63] := dst[32-63] - src[32-63]. + + +\S{insPFSUBR} \i\c{PFSUBR}: Packed Single-Precision FP Reverse Subtract + +\c PFSUBR mm1,mm2/m64 ; 0F 0F /r AA [PENT,3DNOW] + +\c{PFSUBR} subtracts the single-precision FP values in the destination +from those in the source, and stores the result in the destination +operand. + +\c dst[0-31] := src[0-31] - dst[0-31], +\c dst[32-63] := src[32-63] - dst[32-63]. + + +\S{insPI2FD} \i\c{PI2FD}: Packed Doubleword Integer to Single-Precision FP Convert + +\c PI2FD mm1,mm2/m64 ; 0F 0F /r 0D [PENT,3DNOW] + +\c{PF2ID} converts two signed 32-bit integers in the source operand +to single-precision FP values, using truncation of significant digits, +and stores them in the destination operand. + + +\S{insPF2IW} \i\c{PF2IW}: Packed Word Integer to Single-Precision FP Convert + +\c PI2FW mm1,mm2/m64 ; 0F 0F /r 0C [PENT,3DNOW] + +\c{PF2IW} converts two signed 16-bit integers in the source operand +to single-precision FP values, and stores them in the destination +operand. The input values are in the low word of each doubleword. + + +\S{insPINSRW} \i\c{PINSRW}: Insert Word + +\c PINSRW mm,r16/r32/m16,imm8 ;0F C4 /r ib [KATMAI,MMX] +\c PINSRW xmm,r16/r32/m16,imm8 ;66 0F C4 /r ib [WILLAMETTE,SSE2] + +\c{PINSRW} loads a word from a 16-bit register (or the low half of a +32-bit register), or from memory, and loads it to the word position +in the destination register, pointed at by the count operand (third +operand). If the destination is an \c{MMX} register, the low two bits +of the count byte are used, if it is an \c{XMM} register the low 3 +bits are used. The insertion is done in such a way that the other +words from the destination register are left untouched. + + +\S{insPMACHRIW} \i\c{PMACHRIW}: Packed Multiply and Accumulate with Rounding + +\c PMACHRIW mm,m64 ; 0F 5E /r [CYRIX,MMX] + +\c{PMACHRIW} takes two packed 16-bit integer inputs, multiplies the +values in the inputs, rounds on bit 15 of each result, then adds bits +15-30 of each result to the corresponding position of the \e{implied} +destination register. + +The operation of this instruction is: + +\c dstI[0-15] := dstI[0-15] + (mm[0-15] *m64[0-15] +\c + 0x00004000)[15-30], +\c dstI[16-31] := dstI[16-31] + (mm[16-31]*m64[16-31] +\c + 0x00004000)[15-30], +\c dstI[32-47] := dstI[32-47] + (mm[32-47]*m64[32-47] +\c + 0x00004000)[15-30], +\c dstI[48-63] := dstI[48-63] + (mm[48-63]*m64[48-63] +\c + 0x00004000)[15-30]. + +Note that \c{PMACHRIW} cannot take a register as its second source +operand. + + +\S{insPMADDWD} \i\c{PMADDWD}: MMX Packed Multiply and Add + +\c PMADDWD mm1,mm2/m64 ; 0F F5 /r [PENT,MMX] +\c PMADDWD xmm1,xmm2/m128 ; 66 0F F5 /r [WILLAMETTE,SSE2] + +\c{PMADDWD} treats its two inputs as vectors of signed words. It +multiplies corresponding elements of the two operands, giving doubleword +results. These are then added together in pairs and stored in the +destination operand. + +The operation of this instruction is: + +\c dst[0-31] := (dst[0-15] * src[0-15]) +\c + (dst[16-31] * src[16-31]); +\c dst[32-63] := (dst[32-47] * src[32-47]) +\c + (dst[48-63] * src[48-63]); + +The following apply to the \c{SSE} version of the instruction: + +\c dst[64-95] := (dst[64-79] * src[64-79]) +\c + (dst[80-95] * src[80-95]); +\c dst[96-127] := (dst[96-111] * src[96-111]) +\c + (dst[112-127] * src[112-127]). + + +\S{insPMAGW} \i\c{PMAGW}: MMX Packed Magnitude + +\c PMAGW mm1,mm2/m64 ; 0F 52 /r [CYRIX,MMX] + +\c{PMAGW}, specific to the Cyrix MMX extensions, treats both its +operands as vectors of four signed words. It compares the absolute +values of the words in corresponding positions, and sets each word +of the destination (first) operand to whichever of the two words in +that position had the larger absolute value. + + +\S{insPMAXSW} \i\c{PMAXSW}: Packed Signed Integer Word Maximum + +\c PMAXSW mm1,mm2/m64 ; 0F EE /r [KATMAI,MMX] +\c PMAXSW xmm1,xmm2/m128 ; 66 0F EE /r [WILLAMETTE,SSE2] + +\c{PMAXSW} compares each pair of words in the two source operands, and +for each pair it stores the maximum value in the destination register. + + +\S{insPMAXUB} \i\c{PMAXUB}: Packed Unsigned Integer Byte Maximum + +\c PMAXUB mm1,mm2/m64 ; 0F DE /r [KATMAI,MMX] +\c PMAXUB xmm1,xmm2/m128 ; 66 0F DE /r [WILLAMETTE,SSE2] + +\c{PMAXUB} compares each pair of bytes in the two source operands, and +for each pair it stores the maximum value in the destination register. + + +\S{insPMINSW} \i\c{PMINSW}: Packed Signed Integer Word Minimum + +\c PMINSW mm1,mm2/m64 ; 0F EA /r [KATMAI,MMX] +\c PMINSW xmm1,xmm2/m128 ; 66 0F EA /r [WILLAMETTE,SSE2] + +\c{PMINSW} compares each pair of words in the two source operands, and +for each pair it stores the minimum value in the destination register. + + +\S{insPMINUB} \i\c{PMINUB}: Packed Unsigned Integer Byte Minimum + +\c PMINUB mm1,mm2/m64 ; 0F DA /r [KATMAI,MMX] +\c PMINUB xmm1,xmm2/m128 ; 66 0F DA /r [WILLAMETTE,SSE2] + +\c{PMINUB} compares each pair of bytes in the two source operands, and +for each pair it stores the minimum value in the destination register. + + +\S{insPMOVMSKB} \i\c{PMOVMSKB}: Move Byte Mask To Integer + +\c PMOVMSKB reg32,mm ; 0F D7 /r [KATMAI,MMX] +\c PMOVMSKB reg32,xmm ; 66 0F D7 /r [WILLAMETTE,SSE2] + +\c{PMOVMSKB} returns an 8-bit or 16-bit mask formed of the most +significant bits of each byte of source operand (8-bits for an +\c{MMX} register, 16-bits for an \c{XMM} register). + + +\S{insPMULHRW} \i\c{PMULHRWC}, \i\c{PMULHRIW}: Multiply Packed 16-bit Integers +With Rounding, and Store High Word + +\c PMULHRWC mm1,mm2/m64 ; 0F 59 /r [CYRIX,MMX] +\c PMULHRIW mm1,mm2/m64 ; 0F 5D /r [CYRIX,MMX] + +These instructions take two packed 16-bit integer inputs, multiply the +values in the inputs, round on bit 15 of each result, then store bits +15-30 of each result to the corresponding position of the destination +register. + +\b For \c{PMULHRWC}, the destination is the first source operand. + +\b For \c{PMULHRIW}, the destination is an implied register (worked out +as described for \c{PADDSIW} (\k{insPADDSIW})). + +The operation of this instruction is: + +\c dst[0-15] := (src1[0-15] *src2[0-15] + 0x00004000)[15-30] +\c dst[16-31] := (src1[16-31]*src2[16-31] + 0x00004000)[15-30] +\c dst[32-47] := (src1[32-47]*src2[32-47] + 0x00004000)[15-30] +\c dst[48-63] := (src1[48-63]*src2[48-63] + 0x00004000)[15-30] + +See also \c{PMULHRWA} (\k{insPMULHRWA}) for a 3DNow! version of this +instruction. + + +\S{insPMULHRWA} \i\c{PMULHRWA}: Multiply Packed 16-bit Integers +With Rounding, and Store High Word + +\c PMULHRWA mm1,mm2/m64 ; 0F 0F /r B7 [PENT,3DNOW] + +\c{PMULHRWA} takes two packed 16-bit integer inputs, multiplies +the values in the inputs, rounds on bit 16 of each result, then +stores bits 16-31 of each result to the corresponding position +of the destination register. + +The operation of this instruction is: + +\c dst[0-15] := (src1[0-15] *src2[0-15] + 0x00008000)[16-31]; +\c dst[16-31] := (src1[16-31]*src2[16-31] + 0x00008000)[16-31]; +\c dst[32-47] := (src1[32-47]*src2[32-47] + 0x00008000)[16-31]; +\c dst[48-63] := (src1[48-63]*src2[48-63] + 0x00008000)[16-31]. + +See also \c{PMULHRWC} (\k{insPMULHRW}) for a Cyrix version of this +instruction. + + +\S{insPMULHUW} \i\c{PMULHUW}: Multiply Packed 16-bit Integers, +and Store High Word + +\c PMULHUW mm1,mm2/m64 ; 0F E4 /r [KATMAI,MMX] +\c PMULHUW xmm1,xmm2/m128 ; 66 0F E4 /r [WILLAMETTE,SSE2] + +\c{PMULHUW} takes two packed unsigned 16-bit integer inputs, multiplies +the values in the inputs, then stores bits 16-31 of each result to the +corresponding position of the destination register. + + +\S{insPMULHW} \i\c{PMULHW}, \i\c{PMULLW}: Multiply Packed 16-bit Integers, +and Store + +\c PMULHW mm1,mm2/m64 ; 0F E5 /r [PENT,MMX] +\c PMULLW mm1,mm2/m64 ; 0F D5 /r [PENT,MMX] + +\c PMULHW xmm1,xmm2/m128 ; 66 0F E5 /r [WILLAMETTE,SSE2] +\c PMULLW xmm1,xmm2/m128 ; 66 0F D5 /r [WILLAMETTE,SSE2] + +\c{PMULxW} takes two packed unsigned 16-bit integer inputs, and +multiplies the values in the inputs, forming doubleword results. + +\b \c{PMULHW} then stores the top 16 bits of each doubleword in the +destination (first) operand; + +\b \c{PMULLW} stores the bottom 16 bits of each doubleword in the +destination operand. + + +\S{insPMULUDQ} \i\c{PMULUDQ}: Multiply Packed Unsigned +32-bit Integers, and Store. + +\c PMULUDQ mm1,mm2/m64 ; 0F F4 /r [WILLAMETTE,SSE2] +\c PMULUDQ xmm1,xmm2/m128 ; 66 0F F4 /r [WILLAMETTE,SSE2] + +\c{PMULUDQ} takes two packed unsigned 32-bit integer inputs, and +multiplies the values in the inputs, forming quadword results. The +source is either an unsigned doubleword in the low doubleword of a +64-bit operand, or it's two unsigned doublewords in the first and +third doublewords of a 128-bit operand. This produces either one or +two 64-bit results, which are stored in the respective quadword +locations of the destination register. + +The operation is: + +\c dst[0-63] := dst[0-31] * src[0-31]; +\c dst[64-127] := dst[64-95] * src[64-95]. + + +\S{insPMVccZB} \i\c{PMVccZB}: MMX Packed Conditional Move + +\c PMVZB mmxreg,mem64 ; 0F 58 /r [CYRIX,MMX] +\c PMVNZB mmxreg,mem64 ; 0F 5A /r [CYRIX,MMX] +\c PMVLZB mmxreg,mem64 ; 0F 5B /r [CYRIX,MMX] +\c PMVGEZB mmxreg,mem64 ; 0F 5C /r [CYRIX,MMX] + +These instructions, specific to the Cyrix MMX extensions, perform +parallel conditional moves. The two input operands are treated as +vectors of eight bytes. Each byte of the destination (first) operand +is either written from the corresponding byte of the source (second) +operand, or left alone, depending on the value of the byte in the +\e{implied} operand (specified in the same way as \c{PADDSIW}, in +\k{insPADDSIW}). + +\b \c{PMVZB} performs each move if the corresponding byte in the +implied operand is zero; + +\b \c{PMVNZB} moves if the byte is non-zero; + +\b \c{PMVLZB} moves if the byte is less than zero; + +\b \c{PMVGEZB} moves if the byte is greater than or equal to zero. + +Note that these instructions cannot take a register as their second +source operand. + + +\S{insPOP} \i\c{POP}: Pop Data from Stack + +\c POP reg16 ; o16 58+r [8086] +\c POP reg32 ; o32 58+r [386] + +\c POP r/m16 ; o16 8F /0 [8086] +\c POP r/m32 ; o32 8F /0 [386] + +\c POP CS ; 0F [8086,UNDOC] +\c POP DS ; 1F [8086] +\c POP ES ; 07 [8086] +\c POP SS ; 17 [8086] +\c POP FS ; 0F A1 [386] +\c POP GS ; 0F A9 [386] + +\c{POP} loads a value from the stack (from \c{[SS:SP]} or +\c{[SS:ESP]}) and then increments the stack pointer. + +The address-size attribute of the instruction determines whether +\c{SP} or \c{ESP} is used as the stack pointer: to deliberately +override the default given by the \c{BITS} setting, you can use an +\i\c{a16} or \i\c{a32} prefix. + +The operand-size attribute of the instruction determines whether the +stack pointer is incremented by 2 or 4: this means that segment +register pops in \c{BITS 32} mode will pop 4 bytes off the stack and +discard the upper two of them. If you need to override that, you can +use an \i\c{o16} or \i\c{o32} prefix. + +The above opcode listings give two forms for general-purpose +register pop instructions: for example, \c{POP BX} has the two forms +\c{5B} and \c{8F C3}. NASM will always generate the shorter form +when given \c{POP BX}. NDISASM will disassemble both. + +\c{POP CS} is not a documented instruction, and is not supported on +any processor above the 8086 (since they use \c{0Fh} as an opcode +prefix for instruction set extensions). However, at least some 8086 +processors do support it, and so NASM generates it for completeness. + + +\S{insPOPA} \i\c{POPAx}: Pop All General-Purpose Registers + +\c POPA ; 61 [186] +\c POPAW ; o16 61 [186] +\c POPAD ; o32 61 [386] + +\b \c{POPAW} pops a word from the stack into each of, successively, +\c{DI}, \c{SI}, \c{BP}, nothing (it discards a word from the stack +which was a placeholder for \c{SP}), \c{BX}, \c{DX}, \c{CX} and +\c{AX}. It is intended to reverse the operation of \c{PUSHAW} (see +\k{insPUSHA}), but it ignores the value for \c{SP} that was pushed +on the stack by \c{PUSHAW}. + +\b \c{POPAD} pops twice as much data, and places the results in +\c{EDI}, \c{ESI}, \c{EBP}, nothing (placeholder for \c{ESP}), +\c{EBX}, \c{EDX}, \c{ECX} and \c{EAX}. It reverses the operation of +\c{PUSHAD}. + +\c{POPA} is an alias mnemonic for either \c{POPAW} or \c{POPAD}, +depending on the current \c{BITS} setting. + +Note that the registers are popped in reverse order of their numeric +values in opcodes (see \k{iref-rv}). + + +\S{insPOPF} \i\c{POPFx}: Pop Flags Register + +\c POPF ; 9D [8086] +\c POPFW ; o16 9D [8086] +\c POPFD ; o32 9D [386] + +\b \c{POPFW} pops a word from the stack and stores it in the bottom 16 +bits of the flags register (or the whole flags register, on +processors below a 386). + +\b \c{POPFD} pops a doubleword and stores it in the entire flags register. + +\c{POPF} is an alias mnemonic for either \c{POPFW} or \c{POPFD}, +depending on the current \c{BITS} setting. + +See also \c{PUSHF} (\k{insPUSHF}). + + +\S{insPOR} \i\c{POR}: MMX Bitwise OR + +\c POR mm1,mm2/m64 ; 0F EB /r [PENT,MMX] +\c POR xmm1,xmm2/m128 ; 66 0F EB /r [WILLAMETTE,SSE2] + +\c{POR} performs a bitwise OR operation between its two operands +(i.e. each bit of the result is 1 if and only if at least one of the +corresponding bits of the two inputs was 1), and stores the result +in the destination (first) operand. + + +\S{insPREFETCH} \i\c{PREFETCH}: Prefetch Data Into Caches + +\c PREFETCH mem8 ; 0F 0D /0 [PENT,3DNOW] +\c PREFETCHW mem8 ; 0F 0D /1 [PENT,3DNOW] + +\c{PREFETCH} and \c{PREFETCHW} fetch the line of data from memory that +contains the specified byte. \c{PREFETCHW} performs differently on the +Athlon to earlier processors. + +For more details, see the 3DNow! Technology Manual. + + +\S{insPREFETCHh} \i\c{PREFETCHh}: Prefetch Data Into Caches +\I\c{PREFETCHNTA} \I\c{PREFETCHT0} \I\c{PREFETCHT1} \I\c{PREFETCHT2} + +\c PREFETCHNTA m8 ; 0F 18 /0 [KATMAI] +\c PREFETCHT0 m8 ; 0F 18 /1 [KATMAI] +\c PREFETCHT1 m8 ; 0F 18 /2 [KATMAI] +\c PREFETCHT2 m8 ; 0F 18 /3 [KATMAI] + +The \c{PREFETCHh} instructions fetch the line of data from memory +that contains the specified byte. It is placed in the cache +according to rules specified by locality hints \c{h}: + +The hints are: + +\b \c{T0} (temporal data) - prefetch data into all levels of the +cache hierarchy. + +\b \c{T1} (temporal data with respect to first level cache) - +prefetch data into level 2 cache and higher. + +\b \c{T2} (temporal data with respect to second level cache) - +prefetch data into level 2 cache and higher. + +\b \c{NTA} (non-temporal data with respect to all cache levels) - +prefetch data into non-temporal cache structure and into a +location close to the processor, minimizing cache pollution. + +Note that this group of instructions doesn't provide a guarantee +that the data will be in the cache when it is needed. For more +details, see the Intel IA32 Software Developer Manual, Volume 2. + + +\S{insPSADBW} \i\c{PSADBW}: Packed Sum of Absolute Differences + +\c PSADBW mm1,mm2/m64 ; 0F F6 /r [KATMAI,MMX] +\c PSADBW xmm1,xmm2/m128 ; 66 0F F6 /r [WILLAMETTE,SSE2] + +\c{PSADBW} The PSADBW instruction computes the absolute value of the +difference of the packed unsigned bytes in the two source operands. +These differences are then summed to produce a word result in the lower +16-bit field of the destination register; the rest of the register is +cleared. The destination operand is an \c{MMX} or an \c{XMM} register. +The source operand can either be a register or a memory operand. + + +\S{insPSHUFD} \i\c{PSHUFD}: Shuffle Packed Doublewords + +\c PSHUFD xmm1,xmm2/m128,imm8 ; 66 0F 70 /r ib [WILLAMETTE,SSE2] + +\c{PSHUFD} shuffles the doublewords in the source (second) operand +according to the encoding specified by imm8, and stores the result +in the destination (first) operand. + +Bits 0 and 1 of imm8 encode the source position of the doubleword to +be copied to position 0 in the destination operand. Bits 2 and 3 +encode for position 1, bits 4 and 5 encode for position 2, and bits +6 and 7 encode for position 3. For example, an encoding of 10 in +bits 0 and 1 of imm8 indicates that the doubleword at bits 64-95 of +the source operand will be copied to bits 0-31 of the destination. + + +\S{insPSHUFHW} \i\c{PSHUFHW}: Shuffle Packed High Words + +\c PSHUFHW xmm1,xmm2/m128,imm8 ; F3 0F 70 /r ib [WILLAMETTE,SSE2] + +\c{PSHUFW} shuffles the words in the high quadword of the source +(second) operand according to the encoding specified by imm8, and +stores the result in the high quadword of the destination (first) +operand. + +The operation of this instruction is similar to the \c{PSHUFW} +instruction, except that the source and destination are the top +quadword of a 128-bit operand, instead of being 64-bit operands. +The low quadword is copied from the source to the destination +without any changes. + + +\S{insPSHUFLW} \i\c{PSHUFLW}: Shuffle Packed Low Words + +\c PSHUFLW xmm1,xmm2/m128,imm8 ; F2 0F 70 /r ib [WILLAMETTE,SSE2] + +\c{PSHUFLW} shuffles the words in the low quadword of the source +(second) operand according to the encoding specified by imm8, and +stores the result in the low quadword of the destination (first) +operand. + +The operation of this instruction is similar to the \c{PSHUFW} +instruction, except that the source and destination are the low +quadword of a 128-bit operand, instead of being 64-bit operands. +The high quadword is copied from the source to the destination +without any changes. + + +\S{insPSHUFW} \i\c{PSHUFW}: Shuffle Packed Words + +\c PSHUFW mm1,mm2/m64,imm8 ; 0F 70 /r ib [KATMAI,MMX] + +\c{PSHUFW} shuffles the words in the source (second) operand +according to the encoding specified by imm8, and stores the result +in the destination (first) operand. + +Bits 0 and 1 of imm8 encode the source position of the word to be +copied to position 0 in the destination operand. Bits 2 and 3 encode +for position 1, bits 4 and 5 encode for position 2, and bits 6 and 7 +encode for position 3. For example, an encoding of 10 in bits 0 and 1 +of imm8 indicates that the word at bits 32-47 of the source operand +will be copied to bits 0-15 of the destination. + + +\S{insPSLLD} \i\c{PSLLx}: Packed Data Bit Shift Left Logical + +\c PSLLW mm1,mm2/m64 ; 0F F1 /r [PENT,MMX] +\c PSLLW mm,imm8 ; 0F 71 /6 ib [PENT,MMX] + +\c PSLLW xmm1,xmm2/m128 ; 66 0F F1 /r [WILLAMETTE,SSE2] +\c PSLLW xmm,imm8 ; 66 0F 71 /6 ib [WILLAMETTE,SSE2] + +\c PSLLD mm1,mm2/m64 ; 0F F2 /r [PENT,MMX] +\c PSLLD mm,imm8 ; 0F 72 /6 ib [PENT,MMX] + +\c PSLLD xmm1,xmm2/m128 ; 66 0F F2 /r [WILLAMETTE,SSE2] +\c PSLLD xmm,imm8 ; 66 0F 72 /6 ib [WILLAMETTE,SSE2] + +\c PSLLQ mm1,mm2/m64 ; 0F F3 /r [PENT,MMX] +\c PSLLQ mm,imm8 ; 0F 73 /6 ib [PENT,MMX] + +\c PSLLQ xmm1,xmm2/m128 ; 66 0F F3 /r [WILLAMETTE,SSE2] +\c PSLLQ xmm,imm8 ; 66 0F 73 /6 ib [WILLAMETTE,SSE2] + +\c PSLLDQ xmm1,imm8 ; 66 0F 73 /7 ib [WILLAMETTE,SSE2] + +\c{PSLLx} performs logical left shifts of the data elements in the +destination (first) operand, moving each bit in the separate elements +left by the number of bits specified in the source (second) operand, +clearing the low-order bits as they are vacated. \c{PSLLDQ} +shifts bytes, not bits. + +\b \c{PSLLW} shifts word sized elements. + +\b \c{PSLLD} shifts doubleword sized elements. + +\b \c{PSLLQ} shifts quadword sized elements. + +\b \c{PSLLDQ} shifts double quadword sized elements. + + +\S{insPSRAD} \i\c{PSRAx}: Packed Data Bit Shift Right Arithmetic + +\c PSRAW mm1,mm2/m64 ; 0F E1 /r [PENT,MMX] +\c PSRAW mm,imm8 ; 0F 71 /4 ib [PENT,MMX] + +\c PSRAW xmm1,xmm2/m128 ; 66 0F E1 /r [WILLAMETTE,SSE2] +\c PSRAW xmm,imm8 ; 66 0F 71 /4 ib [WILLAMETTE,SSE2] + +\c PSRAD mm1,mm2/m64 ; 0F E2 /r [PENT,MMX] +\c PSRAD mm,imm8 ; 0F 72 /4 ib [PENT,MMX] + +\c PSRAD xmm1,xmm2/m128 ; 66 0F E2 /r [WILLAMETTE,SSE2] +\c PSRAD xmm,imm8 ; 66 0F 72 /4 ib [WILLAMETTE,SSE2] + +\c{PSRAx} performs arithmetic right shifts of the data elements in the +destination (first) operand, moving each bit in the separate elements +right by the number of bits specified in the source (second) operand, +setting the high-order bits to the value of the original sign bit. + +\b \c{PSRAW} shifts word sized elements. + +\b \c{PSRAD} shifts doubleword sized elements. + + +\S{insPSRLD} \i\c{PSRLx}: Packed Data Bit Shift Right Logical + +\c PSRLW mm1,mm2/m64 ; 0F D1 /r [PENT,MMX] +\c PSRLW mm,imm8 ; 0F 71 /2 ib [PENT,MMX] + +\c PSRLW xmm1,xmm2/m128 ; 66 0F D1 /r [WILLAMETTE,SSE2] +\c PSRLW xmm,imm8 ; 66 0F 71 /2 ib [WILLAMETTE,SSE2] + +\c PSRLD mm1,mm2/m64 ; 0F D2 /r [PENT,MMX] +\c PSRLD mm,imm8 ; 0F 72 /2 ib [PENT,MMX] + +\c PSRLD xmm1,xmm2/m128 ; 66 0F D2 /r [WILLAMETTE,SSE2] +\c PSRLD xmm,imm8 ; 66 0F 72 /2 ib [WILLAMETTE,SSE2] + +\c PSRLQ mm1,mm2/m64 ; 0F D3 /r [PENT,MMX] +\c PSRLQ mm,imm8 ; 0F 73 /2 ib [PENT,MMX] + +\c PSRLQ xmm1,xmm2/m128 ; 66 0F D3 /r [WILLAMETTE,SSE2] +\c PSRLQ xmm,imm8 ; 66 0F 73 /2 ib [WILLAMETTE,SSE2] + +\c PSRLDQ xmm1,imm8 ; 66 0F 73 /3 ib [WILLAMETTE,SSE2] + +\c{PSRLx} performs logical right shifts of the data elements in the +destination (first) operand, moving each bit in the separate elements +right by the number of bits specified in the source (second) operand, +clearing the high-order bits as they are vacated. \c{PSRLDQ} +shifts bytes, not bits. + +\b \c{PSRLW} shifts word sized elements. + +\b \c{PSRLD} shifts doubleword sized elements. + +\b \c{PSRLQ} shifts quadword sized elements. + +\b \c{PSRLDQ} shifts double quadword sized elements. + + +\S{insPSUBB} \i\c{PSUBx}: Subtract Packed Integers + +\c PSUBB mm1,mm2/m64 ; 0F F8 /r [PENT,MMX] +\c PSUBW mm1,mm2/m64 ; 0F F9 /r [PENT,MMX] +\c PSUBD mm1,mm2/m64 ; 0F FA /r [PENT,MMX] +\c PSUBQ mm1,mm2/m64 ; 0F FB /r [WILLAMETTE,SSE2] + +\c PSUBB xmm1,xmm2/m128 ; 66 0F F8 /r [WILLAMETTE,SSE2] +\c PSUBW xmm1,xmm2/m128 ; 66 0F F9 /r [WILLAMETTE,SSE2] +\c PSUBD xmm1,xmm2/m128 ; 66 0F FA /r [WILLAMETTE,SSE2] +\c PSUBQ xmm1,xmm2/m128 ; 66 0F FB /r [WILLAMETTE,SSE2] + +\c{PSUBx} subtracts packed integers in the source operand from those +in the destination operand. It doesn't differentiate between signed +and unsigned integers, and doesn't set any of the flags. + +\b \c{PSUBB} operates on byte sized elements. + +\b \c{PSUBW} operates on word sized elements. + +\b \c{PSUBD} operates on doubleword sized elements. + +\b \c{PSUBQ} operates on quadword sized elements. + + +\S{insPSUBSB} \i\c{PSUBSxx}, \i\c{PSUBUSx}: Subtract Packed Integers With Saturation + +\c PSUBSB mm1,mm2/m64 ; 0F E8 /r [PENT,MMX] +\c PSUBSW mm1,mm2/m64 ; 0F E9 /r [PENT,MMX] + +\c PSUBSB xmm1,xmm2/m128 ; 66 0F E8 /r [WILLAMETTE,SSE2] +\c PSUBSW xmm1,xmm2/m128 ; 66 0F E9 /r [WILLAMETTE,SSE2] + +\c PSUBUSB mm1,mm2/m64 ; 0F D8 /r [PENT,MMX] +\c PSUBUSW mm1,mm2/m64 ; 0F D9 /r [PENT,MMX] + +\c PSUBUSB xmm1,xmm2/m128 ; 66 0F D8 /r [WILLAMETTE,SSE2] +\c PSUBUSW xmm1,xmm2/m128 ; 66 0F D9 /r [WILLAMETTE,SSE2] + +\c{PSUBSx} and \c{PSUBUSx} subtracts packed integers in the source +operand from those in the destination operand, and use saturation for +results that are outside the range supported by the destination operand. + +\b \c{PSUBSB} operates on signed bytes, and uses signed saturation on the +results. + +\b \c{PSUBSW} operates on signed words, and uses signed saturation on the +results. + +\b \c{PSUBUSB} operates on unsigned bytes, and uses signed saturation on +the results. + +\b \c{PSUBUSW} operates on unsigned words, and uses signed saturation on +the results. + + +\S{insPSUBSIW} \i\c{PSUBSIW}: MMX Packed Subtract with Saturation to +Implied Destination + +\c PSUBSIW mm1,mm2/m64 ; 0F 55 /r [CYRIX,MMX] + +\c{PSUBSIW}, specific to the Cyrix extensions to the MMX instruction +set, performs the same function as \c{PSUBSW}, except that the +result is not placed in the register specified by the first operand, +but instead in the implied destination register, specified as for +\c{PADDSIW} (\k{insPADDSIW}). + + +\S{insPSWAPD} \i\c{PSWAPD}: Swap Packed Data +\I\c{PSWAPW} + +\c PSWAPD mm1,mm2/m64 ; 0F 0F /r BB [PENT,3DNOW] + +\c{PSWAPD} swaps the packed doublewords in the source operand, and +stores the result in the destination operand. + +In the \c{K6-2} and \c{K6-III} processors, this opcode uses the +mnemonic \c{PSWAPW}, and it swaps the order of words when copying +from the source to the destination. + +The operation in the \c{K6-2} and \c{K6-III} processors is + +\c dst[0-15] = src[48-63]; +\c dst[16-31] = src[32-47]; +\c dst[32-47] = src[16-31]; +\c dst[48-63] = src[0-15]. + +The operation in the \c{K6-x+}, \c{ATHLON} and later processors is: + +\c dst[0-31] = src[32-63]; +\c dst[32-63] = src[0-31]. + + +\S{insPUNPCKHBW} \i\c{PUNPCKxxx}: Unpack and Interleave Data + +\c PUNPCKHBW mm1,mm2/m64 ; 0F 68 /r [PENT,MMX] +\c PUNPCKHWD mm1,mm2/m64 ; 0F 69 /r [PENT,MMX] +\c PUNPCKHDQ mm1,mm2/m64 ; 0F 6A /r [PENT,MMX] + +\c PUNPCKHBW xmm1,xmm2/m128 ; 66 0F 68 /r [WILLAMETTE,SSE2] +\c PUNPCKHWD xmm1,xmm2/m128 ; 66 0F 69 /r [WILLAMETTE,SSE2] +\c PUNPCKHDQ xmm1,xmm2/m128 ; 66 0F 6A /r [WILLAMETTE,SSE2] +\c PUNPCKHQDQ xmm1,xmm2/m128 ; 66 0F 6D /r [WILLAMETTE,SSE2] + +\c PUNPCKLBW mm1,mm2/m32 ; 0F 60 /r [PENT,MMX] +\c PUNPCKLWD mm1,mm2/m32 ; 0F 61 /r [PENT,MMX] +\c PUNPCKLDQ mm1,mm2/m32 ; 0F 62 /r [PENT,MMX] + +\c PUNPCKLBW xmm1,xmm2/m128 ; 66 0F 60 /r [WILLAMETTE,SSE2] +\c PUNPCKLWD xmm1,xmm2/m128 ; 66 0F 61 /r [WILLAMETTE,SSE2] +\c PUNPCKLDQ xmm1,xmm2/m128 ; 66 0F 62 /r [WILLAMETTE,SSE2] +\c PUNPCKLQDQ xmm1,xmm2/m128 ; 66 0F 6C /r [WILLAMETTE,SSE2] + +\c{PUNPCKxx} all treat their operands as vectors, and produce a new +vector generated by interleaving elements from the two inputs. The +\c{PUNPCKHxx} instructions start by throwing away the bottom half of +each input operand, and the \c{PUNPCKLxx} instructions throw away +the top half. + +The remaining elements, are then interleaved into the destination, +alternating elements from the second (source) operand and the first +(destination) operand: so the leftmost part of each element in the +result always comes from the second operand, and the rightmost from +the destination. + +\b \c{PUNPCKxBW} works a byte at a time, producing word sized output +elements. + +\b \c{PUNPCKxWD} works a word at a time, producing doubleword sized +output elements. + +\b \c{PUNPCKxDQ} works a doubleword at a time, producing quadword sized +output elements. + +\b \c{PUNPCKxQDQ} works a quadword at a time, producing double quadword +sized output elements. + +So, for example, for \c{MMX} operands, if the first operand held +\c{0x7A6A5A4A3A2A1A0A} and the second held \c{0x7B6B5B4B3B2B1B0B}, +then: + +\b \c{PUNPCKHBW} would return \c{0x7B7A6B6A5B5A4B4A}. + +\b \c{PUNPCKHWD} would return \c{0x7B6B7A6A5B4B5A4A}. + +\b \c{PUNPCKHDQ} would return \c{0x7B6B5B4B7A6A5A4A}. + +\b \c{PUNPCKLBW} would return \c{0x3B3A2B2A1B1A0B0A}. + +\b \c{PUNPCKLWD} would return \c{0x3B2B3A2A1B0B1A0A}. + +\b \c{PUNPCKLDQ} would return \c{0x3B2B1B0B3A2A1A0A}. + + +\S{insPUSH} \i\c{PUSH}: Push Data on Stack + +\c PUSH reg16 ; o16 50+r [8086] +\c PUSH reg32 ; o32 50+r [386] + +\c PUSH r/m16 ; o16 FF /6 [8086] +\c PUSH r/m32 ; o32 FF /6 [386] + +\c PUSH CS ; 0E [8086] +\c PUSH DS ; 1E [8086] +\c PUSH ES ; 06 [8086] +\c PUSH SS ; 16 [8086] +\c PUSH FS ; 0F A0 [386] +\c PUSH GS ; 0F A8 [386] + +\c PUSH imm8 ; 6A ib [186] +\c PUSH imm16 ; o16 68 iw [186] +\c PUSH imm32 ; o32 68 id [386] + +\c{PUSH} decrements the stack pointer (\c{SP} or \c{ESP}) by 2 or 4, +and then stores the given value at \c{[SS:SP]} or \c{[SS:ESP]}. + +The address-size attribute of the instruction determines whether +\c{SP} or \c{ESP} is used as the stack pointer: to deliberately +override the default given by the \c{BITS} setting, you can use an +\i\c{a16} or \i\c{a32} prefix. + +The operand-size attribute of the instruction determines whether the +stack pointer is decremented by 2 or 4: this means that segment +register pushes in \c{BITS 32} mode will push 4 bytes on the stack, +of which the upper two are undefined. If you need to override that, +you can use an \i\c{o16} or \i\c{o32} prefix. + +The above opcode listings give two forms for general-purpose +\i{register push} instructions: for example, \c{PUSH BX} has the two +forms \c{53} and \c{FF F3}. NASM will always generate the shorter +form when given \c{PUSH BX}. NDISASM will disassemble both. + +Unlike the undocumented and barely supported \c{POP CS}, \c{PUSH CS} +is a perfectly valid and sensible instruction, supported on all +processors. + +The instruction \c{PUSH SP} may be used to distinguish an 8086 from +later processors: on an 8086, the value of \c{SP} stored is the +value it has \e{after} the push instruction, whereas on later +processors it is the value \e{before} the push instruction. + + +\S{insPUSHA} \i\c{PUSHAx}: Push All General-Purpose Registers + +\c PUSHA ; 60 [186] +\c PUSHAD ; o32 60 [386] +\c PUSHAW ; o16 60 [186] + +\c{PUSHAW} pushes, in succession, \c{AX}, \c{CX}, \c{DX}, \c{BX}, +\c{SP}, \c{BP}, \c{SI} and \c{DI} on the stack, decrementing the +stack pointer by a total of 16. + +\c{PUSHAD} pushes, in succession, \c{EAX}, \c{ECX}, \c{EDX}, +\c{EBX}, \c{ESP}, \c{EBP}, \c{ESI} and \c{EDI} on the stack, +decrementing the stack pointer by a total of 32. + +In both cases, the value of \c{SP} or \c{ESP} pushed is its +\e{original} value, as it had before the instruction was executed. + +\c{PUSHA} is an alias mnemonic for either \c{PUSHAW} or \c{PUSHAD}, +depending on the current \c{BITS} setting. + +Note that the registers are pushed in order of their numeric values +in opcodes (see \k{iref-rv}). + +See also \c{POPA} (\k{insPOPA}). + + +\S{insPUSHF} \i\c{PUSHFx}: Push Flags Register + +\c PUSHF ; 9C [8086] +\c PUSHFD ; o32 9C [386] +\c PUSHFW ; o16 9C [8086] + +\b \c{PUSHFW} pushes the bottom 16 bits of the flags register +(or the whole flags register, on processors below a 386) onto +the stack. + +\b \c{PUSHFD} pushes the entire flags register onto the stack. + +\c{PUSHF} is an alias mnemonic for either \c{PUSHFW} or \c{PUSHFD}, +depending on the current \c{BITS} setting. + +See also \c{POPF} (\k{insPOPF}). + + +\S{insPXOR} \i\c{PXOR}: MMX Bitwise XOR + +\c PXOR mm1,mm2/m64 ; 0F EF /r [PENT,MMX] +\c PXOR xmm1,xmm2/m128 ; 66 0F EF /r [WILLAMETTE,SSE2] + +\c{PXOR} performs a bitwise XOR operation between its two operands +(i.e. each bit of the result is 1 if and only if exactly one of the +corresponding bits of the two inputs was 1), and stores the result +in the destination (first) operand. + + +\S{insRCL} \i\c{RCL}, \i\c{RCR}: Bitwise Rotate through Carry Bit + +\c RCL r/m8,1 ; D0 /2 [8086] +\c RCL r/m8,CL ; D2 /2 [8086] +\c RCL r/m8,imm8 ; C0 /2 ib [186] +\c RCL r/m16,1 ; o16 D1 /2 [8086] +\c RCL r/m16,CL ; o16 D3 /2 [8086] +\c RCL r/m16,imm8 ; o16 C1 /2 ib [186] +\c RCL r/m32,1 ; o32 D1 /2 [386] +\c RCL r/m32,CL ; o32 D3 /2 [386] +\c RCL r/m32,imm8 ; o32 C1 /2 ib [386] + +\c RCR r/m8,1 ; D0 /3 [8086] +\c RCR r/m8,CL ; D2 /3 [8086] +\c RCR r/m8,imm8 ; C0 /3 ib [186] +\c RCR r/m16,1 ; o16 D1 /3 [8086] +\c RCR r/m16,CL ; o16 D3 /3 [8086] +\c RCR r/m16,imm8 ; o16 C1 /3 ib [186] +\c RCR r/m32,1 ; o32 D1 /3 [386] +\c RCR r/m32,CL ; o32 D3 /3 [386] +\c RCR r/m32,imm8 ; o32 C1 /3 ib [386] + +\c{RCL} and \c{RCR} perform a 9-bit, 17-bit or 33-bit bitwise +rotation operation, involving the given source/destination (first) +operand and the carry bit. Thus, for example, in the operation +\c{RCL AL,1}, a 9-bit rotation is performed in which \c{AL} is +shifted left by 1, the top bit of \c{AL} moves into the carry flag, +and the original value of the carry flag is placed in the low bit of +\c{AL}. + +The number of bits to rotate by is given by the second operand. Only +the bottom five bits of the rotation count are considered by +processors above the 8086. + +You can force the longer (286 and upwards, beginning with a \c{C1} +byte) form of \c{RCL foo,1} by using a \c{BYTE} prefix: \c{RCL +foo,BYTE 1}. Similarly with \c{RCR}. + + +\S{insRCPPS} \i\c{RCPPS}: Packed Single-Precision FP Reciprocal + +\c RCPPS xmm1,xmm2/m128 ; 0F 53 /r [KATMAI,SSE] + +\c{RCPPS} returns an approximation of the reciprocal of the packed +single-precision FP values from xmm2/m128. The maximum error for this +approximation is: |Error| <= 1.5 x 2^-12 + + +\S{insRCPSS} \i\c{RCPSS}: Scalar Single-Precision FP Reciprocal + +\c RCPSS xmm1,xmm2/m128 ; F3 0F 53 /r [KATMAI,SSE] + +\c{RCPSS} returns an approximation of the reciprocal of the lower +single-precision FP value from xmm2/m32; the upper three fields are +passed through from xmm1. The maximum error for this approximation is: +|Error| <= 1.5 x 2^-12 + + +\S{insRDMSR} \i\c{RDMSR}: Read Model-Specific Registers + +\c RDMSR ; 0F 32 [PENT,PRIV] + +\c{RDMSR} reads the processor Model-Specific Register (MSR) whose +index is stored in \c{ECX}, and stores the result in \c{EDX:EAX}. +See also \c{WRMSR} (\k{insWRMSR}). + + +\S{insRDPMC} \i\c{RDPMC}: Read Performance-Monitoring Counters + +\c RDPMC ; 0F 33 [P6] + +\c{RDPMC} reads the processor performance-monitoring counter whose +index is stored in \c{ECX}, and stores the result in \c{EDX:EAX}. + +This instruction is available on P6 and later processors and on MMX +class processors. + + +\S{insRDSHR} \i\c{RDSHR}: Read SMM Header Pointer Register + +\c RDSHR r/m32 ; 0F 36 /0 [386,CYRIX,SMM] + +\c{RDSHR} reads the contents of the SMM header pointer register and +saves it to the destination operand, which can be either a 32 bit +memory location or a 32 bit register. + +See also \c{WRSHR} (\k{insWRSHR}). + + +\S{insRDTSC} \i\c{RDTSC}: Read Time-Stamp Counter + +\c RDTSC ; 0F 31 [PENT] + +\c{RDTSC} reads the processor's time-stamp counter into \c{EDX:EAX}. + + +\S{insRET} \i\c{RET}, \i\c{RETF}, \i\c{RETN}: Return from Procedure Call + +\c RET ; C3 [8086] +\c RET imm16 ; C2 iw [8086] + +\c RETF ; CB [8086] +\c RETF imm16 ; CA iw [8086] + +\c RETN ; C3 [8086] +\c RETN imm16 ; C2 iw [8086] + +\b \c{RET}, and its exact synonym \c{RETN}, pop \c{IP} or \c{EIP} from +the stack and transfer control to the new address. Optionally, if a +numeric second operand is provided, they increment the stack pointer +by a further \c{imm16} bytes after popping the return address. + +\b \c{RETF} executes a far return: after popping \c{IP}/\c{EIP}, it +then pops \c{CS}, and \e{then} increments the stack pointer by the +optional argument if present. + + +\S{insROL} \i\c{ROL}, \i\c{ROR}: Bitwise Rotate + +\c ROL r/m8,1 ; D0 /0 [8086] +\c ROL r/m8,CL ; D2 /0 [8086] +\c ROL r/m8,imm8 ; C0 /0 ib [186] +\c ROL r/m16,1 ; o16 D1 /0 [8086] +\c ROL r/m16,CL ; o16 D3 /0 [8086] +\c ROL r/m16,imm8 ; o16 C1 /0 ib [186] +\c ROL r/m32,1 ; o32 D1 /0 [386] +\c ROL r/m32,CL ; o32 D3 /0 [386] +\c ROL r/m32,imm8 ; o32 C1 /0 ib [386] + +\c ROR r/m8,1 ; D0 /1 [8086] +\c ROR r/m8,CL ; D2 /1 [8086] +\c ROR r/m8,imm8 ; C0 /1 ib [186] +\c ROR r/m16,1 ; o16 D1 /1 [8086] +\c ROR r/m16,CL ; o16 D3 /1 [8086] +\c ROR r/m16,imm8 ; o16 C1 /1 ib [186] +\c ROR r/m32,1 ; o32 D1 /1 [386] +\c ROR r/m32,CL ; o32 D3 /1 [386] +\c ROR r/m32,imm8 ; o32 C1 /1 ib [386] + +\c{ROL} and \c{ROR} perform a bitwise rotation operation on the given +source/destination (first) operand. Thus, for example, in the +operation \c{ROL AL,1}, an 8-bit rotation is performed in which +\c{AL} is shifted left by 1 and the original top bit of \c{AL} moves +round into the low bit. + +The number of bits to rotate by is given by the second operand. Only +the bottom five bits of the rotation count are considered by processors +above the 8086. + +You can force the longer (286 and upwards, beginning with a \c{C1} +byte) form of \c{ROL foo,1} by using a \c{BYTE} prefix: \c{ROL +foo,BYTE 1}. Similarly with \c{ROR}. + + +\S{insRSDC} \i\c{RSDC}: Restore Segment Register and Descriptor + +\c RSDC segreg,m80 ; 0F 79 /r [486,CYRIX,SMM] + +\c{RSDC} restores a segment register (DS, ES, FS, GS, or SS) from mem80, +and sets up its descriptor. + + +\S{insRSLDT} \i\c{RSLDT}: Restore Segment Register and Descriptor + +\c RSLDT m80 ; 0F 7B /0 [486,CYRIX,SMM] + +\c{RSLDT} restores the Local Descriptor Table (LDTR) from mem80. + + +\S{insRSM} \i\c{RSM}: Resume from System-Management Mode + +\c RSM ; 0F AA [PENT] + +\c{RSM} returns the processor to its normal operating mode when it +was in System-Management Mode. + + +\S{insRSQRTPS} \i\c{RSQRTPS}: Packed Single-Precision FP Square Root Reciprocal + +\c RSQRTPS xmm1,xmm2/m128 ; 0F 52 /r [KATMAI,SSE] + +\c{RSQRTPS} computes the approximate reciprocals of the square +roots of the packed single-precision floating-point values in the +source and stores the results in xmm1. The maximum error for this +approximation is: |Error| <= 1.5 x 2^-12 + + +\S{insRSQRTSS} \i\c{RSQRTSS}: Scalar Single-Precision FP Square Root Reciprocal + +\c RSQRTSS xmm1,xmm2/m128 ; F3 0F 52 /r [KATMAI,SSE] + +\c{RSQRTSS} returns an approximation of the reciprocal of the +square root of the lowest order single-precision FP value from +the source, and stores it in the low doubleword of the destination +register. The upper three fields of xmm1 are preserved. The maximum +error for this approximation is: |Error| <= 1.5 x 2^-12 + + +\S{insRSTS} \i\c{RSTS}: Restore TSR and Descriptor + +\c RSTS m80 ; 0F 7D /0 [486,CYRIX,SMM] + +\c{RSTS} restores Task State Register (TSR) from mem80. + + +\S{insSAHF} \i\c{SAHF}: Store AH to Flags + +\c SAHF ; 9E [8086] + +\c{SAHF} sets the low byte of the flags word according to the +contents of the \c{AH} register. + +The operation of \c{SAHF} is: + +\c AH --> SF:ZF:0:AF:0:PF:1:CF + +See also \c{LAHF} (\k{insLAHF}). + + +\S{insSAL} \i\c{SAL}, \i\c{SAR}: Bitwise Arithmetic Shifts + +\c SAL r/m8,1 ; D0 /4 [8086] +\c SAL r/m8,CL ; D2 /4 [8086] +\c SAL r/m8,imm8 ; C0 /4 ib [186] +\c SAL r/m16,1 ; o16 D1 /4 [8086] +\c SAL r/m16,CL ; o16 D3 /4 [8086] +\c SAL r/m16,imm8 ; o16 C1 /4 ib [186] +\c SAL r/m32,1 ; o32 D1 /4 [386] +\c SAL r/m32,CL ; o32 D3 /4 [386] +\c SAL r/m32,imm8 ; o32 C1 /4 ib [386] + +\c SAR r/m8,1 ; D0 /7 [8086] +\c SAR r/m8,CL ; D2 /7 [8086] +\c SAR r/m8,imm8 ; C0 /7 ib [186] +\c SAR r/m16,1 ; o16 D1 /7 [8086] +\c SAR r/m16,CL ; o16 D3 /7 [8086] +\c SAR r/m16,imm8 ; o16 C1 /7 ib [186] +\c SAR r/m32,1 ; o32 D1 /7 [386] +\c SAR r/m32,CL ; o32 D3 /7 [386] +\c SAR r/m32,imm8 ; o32 C1 /7 ib [386] + +\c{SAL} and \c{SAR} perform an arithmetic shift operation on the given +source/destination (first) operand. The vacated bits are filled with +zero for \c{SAL}, and with copies of the original high bit of the +source operand for \c{SAR}. + +\c{SAL} is a synonym for \c{SHL} (see \k{insSHL}). NASM will +assemble either one to the same code, but NDISASM will always +disassemble that code as \c{SHL}. + +The number of bits to shift by is given by the second operand. Only +the bottom five bits of the shift count are considered by processors +above the 8086. + +You can force the longer (286 and upwards, beginning with a \c{C1} +byte) form of \c{SAL foo,1} by using a \c{BYTE} prefix: \c{SAL +foo,BYTE 1}. Similarly with \c{SAR}. + + +\S{insSALC} \i\c{SALC}: Set AL from Carry Flag + +\c SALC ; D6 [8086,UNDOC] + +\c{SALC} is an early undocumented instruction similar in concept to +\c{SETcc} (\k{insSETcc}). Its function is to set \c{AL} to zero if +the carry flag is clear, or to \c{0xFF} if it is set. + + +\S{insSBB} \i\c{SBB}: Subtract with Borrow + +\c SBB r/m8,reg8 ; 18 /r [8086] +\c SBB r/m16,reg16 ; o16 19 /r [8086] +\c SBB r/m32,reg32 ; o32 19 /r [386] + +\c SBB reg8,r/m8 ; 1A /r [8086] +\c SBB reg16,r/m16 ; o16 1B /r [8086] +\c SBB reg32,r/m32 ; o32 1B /r [386] + +\c SBB r/m8,imm8 ; 80 /3 ib [8086] +\c SBB r/m16,imm16 ; o16 81 /3 iw [8086] +\c SBB r/m32,imm32 ; o32 81 /3 id [386] + +\c SBB r/m16,imm8 ; o16 83 /3 ib [8086] +\c SBB r/m32,imm8 ; o32 83 /3 ib [386] + +\c SBB AL,imm8 ; 1C ib [8086] +\c SBB AX,imm16 ; o16 1D iw [8086] +\c SBB EAX,imm32 ; o32 1D id [386] + +\c{SBB} performs integer subtraction: it subtracts its second +operand, plus the value of the carry flag, from its first, and +leaves the result in its destination (first) operand. The flags are +set according to the result of the operation: in particular, the +carry flag is affected and can be used by a subsequent \c{SBB} +instruction. + +In the forms with an 8-bit immediate second operand and a longer +first operand, the second operand is considered to be signed, and is +sign-extended to the length of the first operand. In these cases, +the \c{BYTE} qualifier is necessary to force NASM to generate this +form of the instruction. + +To subtract one number from another without also subtracting the +contents of the carry flag, use \c{SUB} (\k{insSUB}). + + +\S{insSCASB} \i\c{SCASB}, \i\c{SCASW}, \i\c{SCASD}: Scan String + +\c SCASB ; AE [8086] +\c SCASW ; o16 AF [8086] +\c SCASD ; o32 AF [386] + +\c{SCASB} compares the byte in \c{AL} with the byte at \c{[ES:DI]} +or \c{[ES:EDI]}, and sets the flags accordingly. It then increments +or decrements (depending on the direction flag: increments if the +flag is clear, decrements if it is set) \c{DI} (or \c{EDI}). + +The register used is \c{DI} if the address size is 16 bits, and +\c{EDI} if it is 32 bits. If you need to use an address size not +equal to the current \c{BITS} setting, you can use an explicit +\i\c{a16} or \i\c{a32} prefix. + +Segment override prefixes have no effect for this instruction: the +use of \c{ES} for the load from \c{[DI]} or \c{[EDI]} cannot be +overridden. + +\c{SCASW} and \c{SCASD} work in the same way, but they compare a +word to \c{AX} or a doubleword to \c{EAX} instead of a byte to +\c{AL}, and increment or decrement the addressing registers by 2 or +4 instead of 1. + +The \c{REPE} and \c{REPNE} prefixes (equivalently, \c{REPZ} and +\c{REPNZ}) may be used to repeat the instruction up to \c{CX} (or +\c{ECX} - again, the address size chooses which) times until the +first unequal or equal byte is found. + + +\S{insSETcc} \i\c{SETcc}: Set Register from Condition + +\c SETcc r/m8 ; 0F 90+cc /2 [386] + +\c{SETcc} sets the given 8-bit operand to zero if its condition is +not satisfied, and to 1 if it is. + + +\S{insSFENCE} \i\c{SFENCE}: Store Fence + +\c SFENCE ; 0F AE /7 [KATMAI] + +\c{SFENCE} performs a serialising operation on all writes to memory +that were issued before the \c{SFENCE} instruction. This guarantees that +all memory writes before the \c{SFENCE} instruction are visible before any +writes after the \c{SFENCE} instruction. + +\c{SFENCE} is ordered respective to other \c{SFENCE} instruction, \c{MFENCE}, +any memory write and any other serialising instruction (such as \c{CPUID}). + +Weakly ordered memory types can be used to achieve higher processor +performance through such techniques as out-of-order issue, +write-combining, and write-collapsing. The degree to which a consumer +of data recognizes or knows that the data is weakly ordered varies +among applications and may be unknown to the producer of this data. +The \c{SFENCE} instruction provides a performance-efficient way of +insuring store ordering between routines that produce weakly-ordered +results and routines that consume this data. + +\c{SFENCE} uses the following ModRM encoding: + +\c Mod (7:6) = 11B +\c Reg/Opcode (5:3) = 111B +\c R/M (2:0) = 000B + +All other ModRM encodings are defined to be reserved, and use +of these encodings risks incompatibility with future processors. + +See also \c{LFENCE} (\k{insLFENCE}) and \c{MFENCE} (\k{insMFENCE}). + + +\S{insSGDT} \i\c{SGDT}, \i\c{SIDT}, \i\c{SLDT}: Store Descriptor Table Pointers + +\c SGDT mem ; 0F 01 /0 [286,PRIV] +\c SIDT mem ; 0F 01 /1 [286,PRIV] +\c SLDT r/m16 ; 0F 00 /0 [286,PRIV] + +\c{SGDT} and \c{SIDT} both take a 6-byte memory area as an operand: +they store the contents of the GDTR (global descriptor table +register) or IDTR (interrupt descriptor table register) into that +area as a 32-bit linear address and a 16-bit size limit from that +area (in that order). These are the only instructions which directly +use \e{linear} addresses, rather than segment/offset pairs. + +\c{SLDT} stores the segment selector corresponding to the LDT (local +descriptor table) into the given operand. + +See also \c{LGDT}, \c{LIDT} and \c{LLDT} (\k{insLGDT}). + + +\S{insSHL} \i\c{SHL}, \i\c{SHR}: Bitwise Logical Shifts + +\c SHL r/m8,1 ; D0 /4 [8086] +\c SHL r/m8,CL ; D2 /4 [8086] +\c SHL r/m8,imm8 ; C0 /4 ib [186] +\c SHL r/m16,1 ; o16 D1 /4 [8086] +\c SHL r/m16,CL ; o16 D3 /4 [8086] +\c SHL r/m16,imm8 ; o16 C1 /4 ib [186] +\c SHL r/m32,1 ; o32 D1 /4 [386] +\c SHL r/m32,CL ; o32 D3 /4 [386] +\c SHL r/m32,imm8 ; o32 C1 /4 ib [386] + +\c SHR r/m8,1 ; D0 /5 [8086] +\c SHR r/m8,CL ; D2 /5 [8086] +\c SHR r/m8,imm8 ; C0 /5 ib [186] +\c SHR r/m16,1 ; o16 D1 /5 [8086] +\c SHR r/m16,CL ; o16 D3 /5 [8086] +\c SHR r/m16,imm8 ; o16 C1 /5 ib [186] +\c SHR r/m32,1 ; o32 D1 /5 [386] +\c SHR r/m32,CL ; o32 D3 /5 [386] +\c SHR r/m32,imm8 ; o32 C1 /5 ib [386] + +\c{SHL} and \c{SHR} perform a logical shift operation on the given +source/destination (first) operand. The vacated bits are filled with +zero. + +A synonym for \c{SHL} is \c{SAL} (see \k{insSAL}). NASM will +assemble either one to the same code, but NDISASM will always +disassemble that code as \c{SHL}. + +The number of bits to shift by is given by the second operand. Only +the bottom five bits of the shift count are considered by processors +above the 8086. + +You can force the longer (286 and upwards, beginning with a \c{C1} +byte) form of \c{SHL foo,1} by using a \c{BYTE} prefix: \c{SHL +foo,BYTE 1}. Similarly with \c{SHR}. + + +\S{insSHLD} \i\c{SHLD}, \i\c{SHRD}: Bitwise Double-Precision Shifts + +\c SHLD r/m16,reg16,imm8 ; o16 0F A4 /r ib [386] +\c SHLD r/m16,reg32,imm8 ; o32 0F A4 /r ib [386] +\c SHLD r/m16,reg16,CL ; o16 0F A5 /r [386] +\c SHLD r/m16,reg32,CL ; o32 0F A5 /r [386] + +\c SHRD r/m16,reg16,imm8 ; o16 0F AC /r ib [386] +\c SHRD r/m32,reg32,imm8 ; o32 0F AC /r ib [386] +\c SHRD r/m16,reg16,CL ; o16 0F AD /r [386] +\c SHRD r/m32,reg32,CL ; o32 0F AD /r [386] + +\b \c{SHLD} performs a double-precision left shift. It notionally +places its second operand to the right of its first, then shifts +the entire bit string thus generated to the left by a number of +bits specified in the third operand. It then updates only the +\e{first} operand according to the result of this. The second +operand is not modified. + +\b \c{SHRD} performs the corresponding right shift: it notionally +places the second operand to the \e{left} of the first, shifts the +whole bit string right, and updates only the first operand. + +For example, if \c{EAX} holds \c{0x01234567} and \c{EBX} holds +\c{0x89ABCDEF}, then the instruction \c{SHLD EAX,EBX,4} would update +\c{EAX} to hold \c{0x12345678}. Under the same conditions, \c{SHRD +EAX,EBX,4} would update \c{EAX} to hold \c{0xF0123456}. + +The number of bits to shift by is given by the third operand. Only +the bottom five bits of the shift count are considered. + + +\S{insSHUFPD} \i\c{SHUFPD}: Shuffle Packed Double-Precision FP Values + +\c SHUFPD xmm1,xmm2/m128,imm8 ; 66 0F C6 /r ib [WILLAMETTE,SSE2] + +\c{SHUFPD} moves one of the packed double-precision FP values from +the destination operand into the low quadword of the destination +operand; the upper quadword is generated by moving one of the +double-precision FP values from the source operand into the +destination. The select (third) operand selects which of the values +are moved to the destination register. + +The select operand is an 8-bit immediate: bit 0 selects which value +is moved from the destination operand to the result (where 0 selects +the low quadword and 1 selects the high quadword) and bit 1 selects +which value is moved from the source operand to the result. +Bits 2 through 7 of the shuffle operand are reserved. + + +\S{insSHUFPS} \i\c{SHUFPS}: Shuffle Packed Single-Precision FP Values + +\c SHUFPS xmm1,xmm2/m128,imm8 ; 0F C6 /r ib [KATMAI,SSE] + +\c{SHUFPS} moves two of the packed single-precision FP values from +the destination operand into the low quadword of the destination +operand; the upper quadword is generated by moving two of the +single-precision FP values from the source operand into the +destination. The select (third) operand selects which of the +values are moved to the destination register. + +The select operand is an 8-bit immediate: bits 0 and 1 select the +value to be moved from the destination operand the low doubleword of +the result, bits 2 and 3 select the value to be moved from the +destination operand the second doubleword of the result, bits 4 and +5 select the value to be moved from the source operand the third +doubleword of the result, and bits 6 and 7 select the value to be +moved from the source operand to the high doubleword of the result. + + +\S{insSMI} \i\c{SMI}: System Management Interrupt + +\c SMI ; F1 [386,UNDOC] + +\c{SMI} puts some AMD processors into SMM mode. It is available on some +386 and 486 processors, and is only available when DR7 bit 12 is set, +otherwise it generates an Int 1. + + +\S{insSMINT} \i\c{SMINT}, \i\c{SMINTOLD}: Software SMM Entry (CYRIX) + +\c SMINT ; 0F 38 [PENT,CYRIX] +\c SMINTOLD ; 0F 7E [486,CYRIX] + +\c{SMINT} puts the processor into SMM mode. The CPU state information is +saved in the SMM memory header, and then execution begins at the SMM base +address. + +\c{SMINTOLD} is the same as \c{SMINT}, but was the opcode used on the 486. + +This pair of opcodes are specific to the Cyrix and compatible range of +processors (Cyrix, IBM, Via). + + +\S{insSMSW} \i\c{SMSW}: Store Machine Status Word + +\c SMSW r/m16 ; 0F 01 /4 [286,PRIV] + +\c{SMSW} stores the bottom half of the \c{CR0} control register (or +the Machine Status Word, on 286 processors) into the destination +operand. See also \c{LMSW} (\k{insLMSW}). + +For 32-bit code, this would store all of \c{CR0} in the specified +register (or the bottom 16 bits if the destination is a memory location), + without needing an operand size override byte. + + +\S{insSQRTPD} \i\c{SQRTPD}: Packed Double-Precision FP Square Root + +\c SQRTPD xmm1,xmm2/m128 ; 66 0F 51 /r [WILLAMETTE,SSE2] + +\c{SQRTPD} calculates the square root of the packed double-precision +FP value from the source operand, and stores the double-precision +results in the destination register. + + +\S{insSQRTPS} \i\c{SQRTPS}: Packed Single-Precision FP Square Root + +\c SQRTPS xmm1,xmm2/m128 ; 0F 51 /r [KATMAI,SSE] + +\c{SQRTPS} calculates the square root of the packed single-precision +FP value from the source operand, and stores the single-precision +results in the destination register. + + +\S{insSQRTSD} \i\c{SQRTSD}: Scalar Double-Precision FP Square Root + +\c SQRTSD xmm1,xmm2/m128 ; F2 0F 51 /r [WILLAMETTE,SSE2] + +\c{SQRTSD} calculates the square root of the low-order double-precision +FP value from the source operand, and stores the double-precision +result in the destination register. The high-quadword remains unchanged. + + +\S{insSQRTSS} \i\c{SQRTSS}: Scalar Single-Precision FP Square Root + +\c SQRTSS xmm1,xmm2/m128 ; F3 0F 51 /r [KATMAI,SSE] + +\c{SQRTSS} calculates the square root of the low-order single-precision +FP value from the source operand, and stores the single-precision +result in the destination register. The three high doublewords remain +unchanged. + + +\S{insSTC} \i\c{STC}, \i\c{STD}, \i\c{STI}: Set Flags + +\c STC ; F9 [8086] +\c STD ; FD [8086] +\c STI ; FB [8086] + +These instructions set various flags. \c{STC} sets the carry flag; +\c{STD} sets the direction flag; and \c{STI} sets the interrupt flag +(thus enabling interrupts). + +To clear the carry, direction, or interrupt flags, use the \c{CLC}, +\c{CLD} and \c{CLI} instructions (\k{insCLC}). To invert the carry +flag, use \c{CMC} (\k{insCMC}). + + +\S{insSTMXCSR} \i\c{STMXCSR}: Store Streaming SIMD Extension + Control/Status + +\c STMXCSR m32 ; 0F AE /3 [KATMAI,SSE] + +\c{STMXCSR} stores the contents of the \c{MXCSR} control/status +register to the specified memory location. \c{MXCSR} is used to +enable masked/unmasked exception handling, to set rounding modes, +to set flush-to-zero mode, and to view exception status flags. +The reserved bits in the \c{MXCSR} register are stored as 0s. + +For details of the \c{MXCSR} register, see the Intel processor docs. + +See also \c{LDMXCSR} (\k{insLDMXCSR}). + + +\S{insSTOSB} \i\c{STOSB}, \i\c{STOSW}, \i\c{STOSD}: Store Byte to String + +\c STOSB ; AA [8086] +\c STOSW ; o16 AB [8086] +\c STOSD ; o32 AB [386] + +\c{STOSB} stores the byte in \c{AL} at \c{[ES:DI]} or \c{[ES:EDI]}, +and sets the flags accordingly. It then increments or decrements +(depending on the direction flag: increments if the flag is clear, +decrements if it is set) \c{DI} (or \c{EDI}). + +The register used is \c{DI} if the address size is 16 bits, and +\c{EDI} if it is 32 bits. If you need to use an address size not +equal to the current \c{BITS} setting, you can use an explicit +\i\c{a16} or \i\c{a32} prefix. + +Segment override prefixes have no effect for this instruction: the +use of \c{ES} for the store to \c{[DI]} or \c{[EDI]} cannot be +overridden. + +\c{STOSW} and \c{STOSD} work in the same way, but they store the +word in \c{AX} or the doubleword in \c{EAX} instead of the byte in +\c{AL}, and increment or decrement the addressing registers by 2 or +4 instead of 1. + +The \c{REP} prefix may be used to repeat the instruction \c{CX} (or +\c{ECX} - again, the address size chooses which) times. + + +\S{insSTR} \i\c{STR}: Store Task Register + +\c STR r/m16 ; 0F 00 /1 [286,PRIV] + +\c{STR} stores the segment selector corresponding to the contents of +the Task Register into its operand. When the operand size is 32 bit and +the destination is a register, the upper 16-bits are cleared to 0s. +When the destination operand is a memory location, 16 bits are +written regardless of the operand size. + + +\S{insSUB} \i\c{SUB}: Subtract Integers + +\c SUB r/m8,reg8 ; 28 /r [8086] +\c SUB r/m16,reg16 ; o16 29 /r [8086] +\c SUB r/m32,reg32 ; o32 29 /r [386] + +\c SUB reg8,r/m8 ; 2A /r [8086] +\c SUB reg16,r/m16 ; o16 2B /r [8086] +\c SUB reg32,r/m32 ; o32 2B /r [386] + +\c SUB r/m8,imm8 ; 80 /5 ib [8086] +\c SUB r/m16,imm16 ; o16 81 /5 iw [8086] +\c SUB r/m32,imm32 ; o32 81 /5 id [386] + +\c SUB r/m16,imm8 ; o16 83 /5 ib [8086] +\c SUB r/m32,imm8 ; o32 83 /5 ib [386] + +\c SUB AL,imm8 ; 2C ib [8086] +\c SUB AX,imm16 ; o16 2D iw [8086] +\c SUB EAX,imm32 ; o32 2D id [386] + +\c{SUB} performs integer subtraction: it subtracts its second +operand from its first, and leaves the result in its destination +(first) operand. The flags are set according to the result of the +operation: in particular, the carry flag is affected and can be used +by a subsequent \c{SBB} instruction (\k{insSBB}). + +In the forms with an 8-bit immediate second operand and a longer +first operand, the second operand is considered to be signed, and is +sign-extended to the length of the first operand. In these cases, +the \c{BYTE} qualifier is necessary to force NASM to generate this +form of the instruction. + + +\S{insSUBPD} \i\c{SUBPD}: Packed Double-Precision FP Subtract + +\c SUBPD xmm1,xmm2/m128 ; 66 0F 5C /r [WILLAMETTE,SSE2] + +\c{SUBPD} subtracts the packed double-precision FP values of +the source operand from those of the destination operand, and +stores the result in the destination operation. + + +\S{insSUBPS} \i\c{SUBPS}: Packed Single-Precision FP Subtract + +\c SUBPS xmm1,xmm2/m128 ; 0F 5C /r [KATMAI,SSE] + +\c{SUBPS} subtracts the packed single-precision FP values of +the source operand from those of the destination operand, and +stores the result in the destination operation. + + +\S{insSUBSD} \i\c{SUBSD}: Scalar Single-FP Subtract + +\c SUBSD xmm1,xmm2/m128 ; F2 0F 5C /r [WILLAMETTE,SSE2] + +\c{SUBSD} subtracts the low-order double-precision FP value of +the source operand from that of the destination operand, and +stores the result in the destination operation. The high +quadword is unchanged. + + +\S{insSUBSS} \i\c{SUBSS}: Scalar Single-FP Subtract + +\c SUBSS xmm1,xmm2/m128 ; F3 0F 5C /r [KATMAI,SSE] + +\c{SUBSS} subtracts the low-order single-precision FP value of +the source operand from that of the destination operand, and +stores the result in the destination operation. The three high +doublewords are unchanged. + + +\S{insSVDC} \i\c{SVDC}: Save Segment Register and Descriptor + +\c SVDC m80,segreg ; 0F 78 /r [486,CYRIX,SMM] + +\c{SVDC} saves a segment register (DS, ES, FS, GS, or SS) and its +descriptor to mem80. + + +\S{insSVLDT} \i\c{SVLDT}: Save LDTR and Descriptor + +\c SVLDT m80 ; 0F 7A /0 [486,CYRIX,SMM] + +\c{SVLDT} saves the Local Descriptor Table (LDTR) to mem80. + + +\S{insSVTS} \i\c{SVTS}: Save TSR and Descriptor + +\c SVTS m80 ; 0F 7C /0 [486,CYRIX,SMM] + +\c{SVTS} saves the Task State Register (TSR) to mem80. + + +\S{insSYSCALL} \i\c{SYSCALL}: Call Operating System + +\c SYSCALL ; 0F 05 [P6,AMD] + +\c{SYSCALL} provides a fast method of transferring control to a fixed +entry point in an operating system. + +\b The \c{EIP} register is copied into the \c{ECX} register. + +\b Bits [31-0] of the 64-bit SYSCALL/SYSRET Target Address Register +(\c{STAR}) are copied into the \c{EIP} register. + +\b Bits [47-32] of the \c{STAR} register specify the selector that is +copied into the \c{CS} register. + +\b Bits [47-32]+1000b of the \c{STAR} register specify the selector that +is copied into the SS register. + +The \c{CS} and \c{SS} registers should not be modified by the operating +system between the execution of the \c{SYSCALL} instruction and its +corresponding \c{SYSRET} instruction. + +For more information, see the \c{SYSCALL and SYSRET Instruction Specification} +(AMD document number 21086.pdf). + + +\S{insSYSENTER} \i\c{SYSENTER}: Fast System Call + +\c SYSENTER ; 0F 34 [P6] + +\c{SYSENTER} executes a fast call to a level 0 system procedure or +routine. Before using this instruction, various MSRs need to be set +up: + +\b \c{SYSENTER_CS_MSR} contains the 32-bit segment selector for the +privilege level 0 code segment. (This value is also used to compute +the segment selector of the privilege level 0 stack segment.) + +\b \c{SYSENTER_EIP_MSR} contains the 32-bit offset into the privilege +level 0 code segment to the first instruction of the selected operating +procedure or routine. + +\b \c{SYSENTER_ESP_MSR} contains the 32-bit stack pointer for the +privilege level 0 stack. + +\c{SYSENTER} performs the following sequence of operations: + +\b Loads the segment selector from the \c{SYSENTER_CS_MSR} into the +\c{CS} register. + +\b Loads the instruction pointer from the \c{SYSENTER_EIP_MSR} into +the \c{EIP} register. + +\b Adds 8 to the value in \c{SYSENTER_CS_MSR} and loads it into the +\c{SS} register. + +\b Loads the stack pointer from the \c{SYSENTER_ESP_MSR} into the +\c{ESP} register. + +\b Switches to privilege level 0. + +\b Clears the \c{VM} flag in the \c{EFLAGS} register, if the flag +is set. + +\b Begins executing the selected system procedure. + +In particular, note that this instruction des not save the values of +\c{CS} or \c{(E)IP}. If you need to return to the calling code, you +need to write your code to cater for this. + +For more information, see the Intel Architecture Software Developer's +Manual, Volume 2. + + +\S{insSYSEXIT} \i\c{SYSEXIT}: Fast Return From System Call + +\c SYSEXIT ; 0F 35 [P6,PRIV] + +\c{SYSEXIT} executes a fast return to privilege level 3 user code. +This instruction is a companion instruction to the \c{SYSENTER} +instruction, and can only be executed by privilege level 0 code. +Various registers need to be set up before calling this instruction: + +\b \c{SYSENTER_CS_MSR} contains the 32-bit segment selector for the +privilege level 0 code segment in which the processor is currently +executing. (This value is used to compute the segment selectors for +the privilege level 3 code and stack segments.) + +\b \c{EDX} contains the 32-bit offset into the privilege level 3 code +segment to the first instruction to be executed in the user code. + +\b \c{ECX} contains the 32-bit stack pointer for the privilege level 3 +stack. + +\c{SYSEXIT} performs the following sequence of operations: + +\b Adds 16 to the value in \c{SYSENTER_CS_MSR} and loads the sum into +the \c{CS} selector register. + +\b Loads the instruction pointer from the \c{EDX} register into the +\c{EIP} register. + +\b Adds 24 to the value in \c{SYSENTER_CS_MSR} and loads the sum +into the \c{SS} selector register. + +\b Loads the stack pointer from the \c{ECX} register into the \c{ESP} +register. + +\b Switches to privilege level 3. + +\b Begins executing the user code at the \c{EIP} address. + +For more information on the use of the \c{SYSENTER} and \c{SYSEXIT} +instructions, see the Intel Architecture Software Developer's +Manual, Volume 2. + + +\S{insSYSRET} \i\c{SYSRET}: Return From Operating System + +\c SYSRET ; 0F 07 [P6,AMD,PRIV] + +\c{SYSRET} is the return instruction used in conjunction with the +\c{SYSCALL} instruction to provide fast entry/exit to an operating system. + +\b The \c{ECX} register, which points to the next sequential instruction +after the corresponding \c{SYSCALL} instruction, is copied into the \c{EIP} +register. + +\b Bits [63-48] of the \c{STAR} register specify the selector that is copied +into the \c{CS} register. + +\b Bits [63-48]+1000b of the \c{STAR} register specify the selector that is +copied into the \c{SS} register. + +\b Bits [1-0] of the \c{SS} register are set to 11b (RPL of 3) regardless of +the value of bits [49-48] of the \c{STAR} register. + +The \c{CS} and \c{SS} registers should not be modified by the operating +system between the execution of the \c{SYSCALL} instruction and its +corresponding \c{SYSRET} instruction. + +For more information, see the \c{SYSCALL and SYSRET Instruction Specification} +(AMD document number 21086.pdf). + + +\S{insTEST} \i\c{TEST}: Test Bits (notional bitwise AND) + +\c TEST r/m8,reg8 ; 84 /r [8086] +\c TEST r/m16,reg16 ; o16 85 /r [8086] +\c TEST r/m32,reg32 ; o32 85 /r [386] + +\c TEST r/m8,imm8 ; F6 /0 ib [8086] +\c TEST r/m16,imm16 ; o16 F7 /0 iw [8086] +\c TEST r/m32,imm32 ; o32 F7 /0 id [386] + +\c TEST AL,imm8 ; A8 ib [8086] +\c TEST AX,imm16 ; o16 A9 iw [8086] +\c TEST EAX,imm32 ; o32 A9 id [386] + +\c{TEST} performs a `mental' bitwise AND of its two operands, and +affects the flags as if the operation had taken place, but does not +store the result of the operation anywhere. + + +\S{insUCOMISD} \i\c{UCOMISD}: Unordered Scalar Double-Precision FP +compare and set EFLAGS + +\c UCOMISD xmm1,xmm2/m128 ; 66 0F 2E /r [WILLAMETTE,SSE2] + +\c{UCOMISD} compares the low-order double-precision FP numbers in the +two operands, and sets the \c{ZF}, \c{PF} and \c{CF} bits in the +\c{EFLAGS} register. In addition, the \c{OF}, \c{SF} and \c{AF} bits +in the \c{EFLAGS} register are zeroed out. The unordered predicate +(\c{ZF}, \c{PF} and \c{CF} all set) is returned if either source +operand is a \c{NaN} (\c{qNaN} or \c{sNaN}). + + +\S{insUCOMISS} \i\c{UCOMISS}: Unordered Scalar Single-Precision FP +compare and set EFLAGS + +\c UCOMISS xmm1,xmm2/m128 ; 0F 2E /r [KATMAI,SSE] + +\c{UCOMISS} compares the low-order single-precision FP numbers in the +two operands, and sets the \c{ZF}, \c{PF} and \c{CF} bits in the +\c{EFLAGS} register. In addition, the \c{OF}, \c{SF} and \c{AF} bits +in the \c{EFLAGS} register are zeroed out. The unordered predicate +(\c{ZF}, \c{PF} and \c{CF} all set) is returned if either source +operand is a \c{NaN} (\c{qNaN} or \c{sNaN}). + + +\S{insUD2} \i\c{UD0}, \i\c{UD1}, \i\c{UD2}: Undefined Instruction + +\c UD0 ; 0F FF [186,UNDOC] +\c UD1 ; 0F B9 [186,UNDOC] +\c UD2 ; 0F 0B [186] + +\c{UDx} can be used to generate an invalid opcode exception, for testing +purposes. + +\c{UD0} is specifically documented by AMD as being reserved for this +purpose. + +\c{UD1} is documented by Intel as being available for this purpose. + +\c{UD2} is specifically documented by Intel as being reserved for this +purpose. Intel document this as the preferred method of generating an +invalid opcode exception. + +All these opcodes can be used to generate invalid opcode exceptions on +all currently available processors. + + +\S{insUMOV} \i\c{UMOV}: User Move Data + +\c UMOV r/m8,reg8 ; 0F 10 /r [386,UNDOC] +\c UMOV r/m16,reg16 ; o16 0F 11 /r [386,UNDOC] +\c UMOV r/m32,reg32 ; o32 0F 11 /r [386,UNDOC] + +\c UMOV reg8,r/m8 ; 0F 12 /r [386,UNDOC] +\c UMOV reg16,r/m16 ; o16 0F 13 /r [386,UNDOC] +\c UMOV reg32,r/m32 ; o32 0F 13 /r [386,UNDOC] + +This undocumented instruction is used by in-circuit emulators to +access user memory (as opposed to host memory). It is used just like +an ordinary memory/register or register/register \c{MOV} +instruction, but accesses user space. + +This instruction is only available on some AMD and IBM 386 and 486 +processors. + + +\S{insUNPCKHPD} \i\c{UNPCKHPD}: Unpack and Interleave High Packed +Double-Precision FP Values + +\c UNPCKHPD xmm1,xmm2/m128 ; 66 0F 15 /r [WILLAMETTE,SSE2] + +\c{UNPCKHPD} performs an interleaved unpack of the high-order data +elements of the source and destination operands, saving the result +in \c{xmm1}. It ignores the lower half of the sources. + +The operation of this instruction is: + +\c dst[63-0] := dst[127-64]; +\c dst[127-64] := src[127-64]. + + +\S{insUNPCKHPS} \i\c{UNPCKHPS}: Unpack and Interleave High Packed +Single-Precision FP Values + +\c UNPCKHPS xmm1,xmm2/m128 ; 0F 15 /r [KATMAI,SSE] + +\c{UNPCKHPS} performs an interleaved unpack of the high-order data +elements of the source and destination operands, saving the result +in \c{xmm1}. It ignores the lower half of the sources. + +The operation of this instruction is: + +\c dst[31-0] := dst[95-64]; +\c dst[63-32] := src[95-64]; +\c dst[95-64] := dst[127-96]; +\c dst[127-96] := src[127-96]. + + +\S{insUNPCKLPD} \i\c{UNPCKLPD}: Unpack and Interleave Low Packed +Double-Precision FP Data + +\c UNPCKLPD xmm1,xmm2/m128 ; 66 0F 14 /r [WILLAMETTE,SSE2] + +\c{UNPCKLPD} performs an interleaved unpack of the low-order data +elements of the source and destination operands, saving the result +in \c{xmm1}. It ignores the lower half of the sources. + +The operation of this instruction is: + +\c dst[63-0] := dst[63-0]; +\c dst[127-64] := src[63-0]. + + +\S{insUNPCKLPS} \i\c{UNPCKLPS}: Unpack and Interleave Low Packed +Single-Precision FP Data + +\c UNPCKLPS xmm1,xmm2/m128 ; 0F 14 /r [KATMAI,SSE] + +\c{UNPCKLPS} performs an interleaved unpack of the low-order data +elements of the source and destination operands, saving the result +in \c{xmm1}. It ignores the lower half of the sources. + +The operation of this instruction is: + +\c dst[31-0] := dst[31-0]; +\c dst[63-32] := src[31-0]; +\c dst[95-64] := dst[63-32]; +\c dst[127-96] := src[63-32]. + + +\S{insVERR} \i\c{VERR}, \i\c{VERW}: Verify Segment Readability/Writability + +\c VERR r/m16 ; 0F 00 /4 [286,PRIV] + +\c VERW r/m16 ; 0F 00 /5 [286,PRIV] + +\b \c{VERR} sets the zero flag if the segment specified by the selector +in its operand can be read from at the current privilege level. +Otherwise it is cleared. + +\b \c{VERW} sets the zero flag if the segment can be written. + + +\S{insWAIT} \i\c{WAIT}: Wait for Floating-Point Processor + +\c WAIT ; 9B [8086] +\c FWAIT ; 9B [8086] + +\c{WAIT}, on 8086 systems with a separate 8087 FPU, waits for the +FPU to have finished any operation it is engaged in before +continuing main processor operations, so that (for example) an FPU +store to main memory can be guaranteed to have completed before the +CPU tries to read the result back out. + +On higher processors, \c{WAIT} is unnecessary for this purpose, and +it has the alternative purpose of ensuring that any pending unmasked +FPU exceptions have happened before execution continues. + + +\S{insWBINVD} \i\c{WBINVD}: Write Back and Invalidate Cache + +\c WBINVD ; 0F 09 [486] + +\c{WBINVD} invalidates and empties the processor's internal caches, +and causes the processor to instruct external caches to do the same. +It writes the contents of the caches back to memory first, so no +data is lost. To flush the caches quickly without bothering to write +the data back first, use \c{INVD} (\k{insINVD}). + + +\S{insWRMSR} \i\c{WRMSR}: Write Model-Specific Registers + +\c WRMSR ; 0F 30 [PENT] + +\c{WRMSR} writes the value in \c{EDX:EAX} to the processor +Model-Specific Register (MSR) whose index is stored in \c{ECX}. +See also \c{RDMSR} (\k{insRDMSR}). + + +\S{insWRSHR} \i\c{WRSHR}: Write SMM Header Pointer Register + +\c WRSHR r/m32 ; 0F 37 /0 [386,CYRIX,SMM] + +\c{WRSHR} loads the contents of either a 32-bit memory location or a +32-bit register into the SMM header pointer register. + +See also \c{RDSHR} (\k{insRDSHR}). + + +\S{insXADD} \i\c{XADD}: Exchange and Add + +\c XADD r/m8,reg8 ; 0F C0 /r [486] +\c XADD r/m16,reg16 ; o16 0F C1 /r [486] +\c XADD r/m32,reg32 ; o32 0F C1 /r [486] + +\c{XADD} exchanges the values in its two operands, and then adds +them together and writes the result into the destination (first) +operand. This instruction can be used with a \c{LOCK} prefix for +multi-processor synchronisation purposes. + + +\S{insXBTS} \i\c{XBTS}: Extract Bit String + +\c XBTS reg16,r/m16 ; o16 0F A6 /r [386,UNDOC] +\c XBTS reg32,r/m32 ; o32 0F A6 /r [386,UNDOC] + +The implied operation of this instruction is: + +\c XBTS r/m16,reg16,AX,CL +\c XBTS r/m32,reg32,EAX,CL + +Writes a bit string from the source operand to the destination. \c{CL} +indicates the number of bits to be copied, and \c{(E)AX} indicates the +low order bit offset in the source. The bits are written to the low +order bits of the destination register. For example, if \c{CL} is set +to 4 and \c{AX} (for 16-bit code) is set to 5, bits 5-8 of \c{src} will +be copied to bits 0-3 of \c{dst}. This instruction is very poorly +documented, and I have been unable to find any official source of +documentation on it. + +\c{XBTS} is supported only on the early Intel 386s, and conflicts with +the opcodes for \c{CMPXCHG486} (on early Intel 486s). NASM supports it +only for completeness. Its counterpart is \c{IBTS} (see \k{insIBTS}). + + +\S{insXCHG} \i\c{XCHG}: Exchange + +\c XCHG reg8,r/m8 ; 86 /r [8086] +\c XCHG reg16,r/m8 ; o16 87 /r [8086] +\c XCHG reg32,r/m32 ; o32 87 /r [386] + +\c XCHG r/m8,reg8 ; 86 /r [8086] +\c XCHG r/m16,reg16 ; o16 87 /r [8086] +\c XCHG r/m32,reg32 ; o32 87 /r [386] + +\c XCHG AX,reg16 ; o16 90+r [8086] +\c XCHG EAX,reg32 ; o32 90+r [386] +\c XCHG reg16,AX ; o16 90+r [8086] +\c XCHG reg32,EAX ; o32 90+r [386] + +\c{XCHG} exchanges the values in its two operands. It can be used +with a \c{LOCK} prefix for purposes of multi-processor +synchronisation. + +\c{XCHG AX,AX} or \c{XCHG EAX,EAX} (depending on the \c{BITS} +setting) generates the opcode \c{90h}, and so is a synonym for +\c{NOP} (\k{insNOP}). + + +\S{insXLATB} \i\c{XLATB}: Translate Byte in Lookup Table + +\c XLAT ; D7 [8086] +\c XLATB ; D7 [8086] + +\c{XLATB} adds the value in \c{AL}, treated as an unsigned byte, to +\c{BX} or \c{EBX}, and loads the byte from the resulting address (in +the segment specified by \c{DS}) back into \c{AL}. + +The base register used is \c{BX} if the address size is 16 bits, and +\c{EBX} if it is 32 bits. If you need to use an address size not +equal to the current \c{BITS} setting, you can use an explicit +\i\c{a16} or \i\c{a32} prefix. + +The segment register used to load from \c{[BX+AL]} or \c{[EBX+AL]} +can be overridden by using a segment register name as a prefix (for +example, \c{es xlatb}). + + +\S{insXOR} \i\c{XOR}: Bitwise Exclusive OR + +\c XOR r/m8,reg8 ; 30 /r [8086] +\c XOR r/m16,reg16 ; o16 31 /r [8086] +\c XOR r/m32,reg32 ; o32 31 /r [386] + +\c XOR reg8,r/m8 ; 32 /r [8086] +\c XOR reg16,r/m16 ; o16 33 /r [8086] +\c XOR reg32,r/m32 ; o32 33 /r [386] + +\c XOR r/m8,imm8 ; 80 /6 ib [8086] +\c XOR r/m16,imm16 ; o16 81 /6 iw [8086] +\c XOR r/m32,imm32 ; o32 81 /6 id [386] + +\c XOR r/m16,imm8 ; o16 83 /6 ib [8086] +\c XOR r/m32,imm8 ; o32 83 /6 ib [386] + +\c XOR AL,imm8 ; 34 ib [8086] +\c XOR AX,imm16 ; o16 35 iw [8086] +\c XOR EAX,imm32 ; o32 35 id [386] + +\c{XOR} performs a bitwise XOR operation between its two operands +(i.e. each bit of the result is 1 if and only if exactly one of the +corresponding bits of the two inputs was 1), and stores the result +in the destination (first) operand. + +In the forms with an 8-bit immediate second operand and a longer +first operand, the second operand is considered to be signed, and is +sign-extended to the length of the first operand. In these cases, +the \c{BYTE} qualifier is necessary to force NASM to generate this +form of the instruction. + +The \c{MMX} instruction \c{PXOR} (see \k{insPXOR}) performs the same +operation on the 64-bit \c{MMX} registers. + + +\S{insXORPD} \i\c{XORPD}: Bitwise Logical XOR of Double-Precision FP Values + +\c XORPD xmm1,xmm2/m128 ; 66 0F 57 /r [WILLAMETTE,SSE2] + +\c{XORPD} returns a bit-wise logical XOR between the source and +destination operands, storing the result in the destination operand. + + +\S{insXORPS} \i\c{XORPS}: Bitwise Logical XOR of Single-Precision FP Values + +\c XORPS xmm1,xmm2/m128 ; 0F 57 /r [KATMAI,SSE] + +\c{XORPS} returns a bit-wise logical XOR between the source and +destination operands, storing the result in the destination operand. + + diff --git a/doc/nasmdoc.src b/doc/nasmdoc.src index ee8c0f6..197011a 100644 --- a/doc/nasmdoc.src +++ b/doc/nasmdoc.src @@ -4,7 +4,7 @@ \# \M{category}{Programming} \M{title}{NASM - The Netwide Assembler} -\M{year}{2003} +\M{year}{2007} \M{author}{The NASM Development Team} \M{license}{All rights reserved. This document is redistributable under the license given in the file "COPYING" distributed in the NASM archive.} \M{summary}{This file documents NASM, the Netwide Assembler: an assembler targetting the Intel x86 series of processors, with portable source.} @@ -1096,9 +1096,11 @@ they can be \i{effective addresses} (see \k{effaddr}), constants For \i{floating-point} instructions, NASM accepts a wide range of syntaxes: you can use two-operand forms like MASM supports, or you -can use NASM's native single-operand forms in most cases. Details of -all forms of each supported instruction are given in -\k{iref}. For example, you can code: +can use NASM's native single-operand forms in most cases. +\# Details of +\# all forms of each supported instruction are given in +\# \k{iref}. +For example, you can code: \c fadd st1 ; this sets st0 := st0 + st1 \c fadd st0,st1 ; so does this @@ -1304,6 +1306,11 @@ fact, it will also split \c{[eax*2+offset]} into the \c{NOSPLIT} keyword: \c{[nosplit eax*2]} will force \c{[eax*2+0]} to be generated literally. +In 64-bit mode, NASM will by default generate absolute addresses. The +\i\c{REL} keyword makes it produce \c{RIP}-relative addresses. Since +this is frequently the normally desired behaviour, see the \c{DEFAULT} +directive. The keyword \i\c{ABS} overrides \i\c{REL}. + \H{const} \i{Constants} @@ -1350,7 +1357,8 @@ then the constant generated is not \c{0x61626364}, but \c{0x64636261}, so that if you were then to store the value into memory, it would read \c{abcd} rather than \c{dcba}. This is also the sense of character constants understood by the Pentium's -\i\c{CPUID} instruction (see \k{insCPUID}). +\i\c{CPUID} instruction. +\# (see \k{insCPUID}) \S{strconst} String Constants @@ -3262,7 +3270,8 @@ local variables in C are an example of this kind of variable. The (see \k{stacksize} and is also compatible with the \c{%arg} directive (see \k{arg}). It allows simplified reference to variables on the stack which have been allocated typically by using the \c{ENTER} -instruction (see \k{insENTER} for a description of that instruction). +instruction. +\# (see \k{insENTER} for a description of that instruction). An example of its use is the following: \c silly_swap: @@ -3428,7 +3437,7 @@ the REX prefix is used. In summary, the \c{REX} prefix causes the addressing of AH, BH, CH and DH to be replaced by SPL, BPL, SIL and DIL. The \c{BITS} directive has an exactly equivalent primitive form, -\c{[BITS 16]}, \c{[BITS 32]} and \c{BITS 64]}. The user-level form is +\c{[BITS 16]}, \c{[BITS 32]} and \c{[BITS 64]}. The user-level form is a macro which has no function other than to call the primitive form. Note that the space is neccessary, e.g. \c{BITS32} will \e{not} work! @@ -3439,6 +3448,25 @@ The `\c{USE16}' and `\c{USE32}' directives can be used in place of `\c{BITS 16}' and `\c{BITS 32}', for compatibility with other assemblers. +\H{default} \i\c{DEFAULT}: Change the assembler defaults + +The \c{DEFAULT} directive changes the assembler defaults. Normally, +NASM defaults to a mode where the programmer is expected to explicitly +specify most features directly. However, this is occationally +obnoxious, as the explicit form is pretty much the only one one wishes +to use. + +Currently, the only \c{DEFAULT} that is settable is whether or not +registerless instructions in 64-bit mode are \c{RIP}-relative or not. +By default, they are absolute unless overridden with the \i\c{REL} +specifier. However, if \c{DEFAULT REL} is specified, \c{REL} is +default, unless overridden with the \c{ABS} specifier, \e{except when +used with an \c{FS} or \c{GS} segment override}. The special handling +of \c{FS} and \c{GS} overrides are due to the fact that these +registers are generally used as thread pointers or other special +functions in 64-bit mode, and generating \c{RIP}-relative addresses +would be extremely confusing. + \H{section} \i\c{SECTION} or \i\c{SEGMENT}: Changing and \i{Defining Sections} @@ -6140,10 +6168,14 @@ corresponding \c{a16} prefix can be used. The \c{a16} and \c{a32} prefixes can be applied to any instruction in NASM's instruction table, but most of them can generate all the useful forms without them. The prefixes are necessary only for -instructions with implicit addressing: \c{CMPSx} (\k{insCMPSB}), -\c{SCASx} (\k{insSCASB}), \c{LODSx} (\k{insLODSB}), \c{STOSx} -(\k{insSTOSB}), \c{MOVSx} (\k{insMOVSB}), \c{INSx} (\k{insINSB}), -\c{OUTSx} (\k{insOUTSB}), and \c{XLATB} (\k{insXLATB}). Also, the +instructions with implicit addressing: +\# \c{CMPSx} (\k{insCMPSB}), +\# \c{SCASx} (\k{insSCASB}), \c{LODSx} (\k{insLODSB}), \c{STOSx} +\# (\k{insSTOSB}), \c{MOVSx} (\k{insMOVSB}), \c{INSx} (\k{insINSB}), +\# \c{OUTSx} (\k{insOUTSB}), and \c{XLATB} (\k{insXLATB}). +\c{CMPSx}, \c{SCASx}, \c{LODSx}, \c{STOSx}, \c{MOVSx}, \c{INSx}, +\c{OUTSx}, and \c{XLATB}. +Also, the various push and pop instructions (\c{PUSHA} and \c{POPF} as well as the more usual \c{PUSH} and \c{POP}) can accept \c{a16} or \c{a32} prefixes to force a particular one of \c{SP} or \c{ESP} to be used @@ -6168,6 +6200,28 @@ one. when in 16-bit mode, but this seems less useful.) +\C{64bit} Writing 64-bit Code (Unix, Win64) + +This chapter attempts to cover some of the common issues involved when +writing 64-bit code, to run under \i{Win64} or Unix. It covers how to +write assembly code to interface with 64-bit C routines, and how to +write position-independent code for shared libraries. + +All 64-bit code uses a flat memory model, since segmentation is not +available in 64-bit mode. The one exception is the \c{FS} and \c{GS} +registers, which still add their bases. + +Position independence in 64-bit mode is significantly simpler, since +the processor supports \c{RIP}-relative addressing directly; see the +\c{REL} keyword (\k{effaddr}). + +64-bit programming is relatively similar to 32-bit programming, but +of course pointers are 64 bits long; additionally, all existing +platforms pass arguments in registers rather than on the stack. +Furthermore, 64-bit platforms use SSE2 by default for floating point. +Please see the ABI documentation for your platform. + + \C{trouble} Troubleshooting This chapter describes some of the common problems that users have @@ -6394,12 +6448,12 @@ are on a Unix system. To disassemble a file, you will typically use a command of the form -\c ndisasm [-b16 | -b32] filename +\c ndisasm -b {16|32|64} filename -NDISASM can disassemble 16-bit code or 32-bit code equally easily, +NDISASM can disassemble 16-, 32- or 64-bit code equally easily, provided of course that you remember to specify which it is to work -with. If no \i\c{-b} switch is present, NDISASM works in 16-bit mode by -default. The \i\c{-u} switch (for USE32) also invokes 32-bit mode. +with. If no \i\c{-b} switch is present, NDISASM works in 16-bit mode +by default. The \i\c{-u} switch (for USE32) also invokes 32-bit mode. Two more command line options are \i\c{-r} which reports the version number of NDISASM you are running, and \i\c{-h} which gives a short @@ -6541,8 +6595,8 @@ anyway. \H{ndisbugs} Bugs and Improvements There are no known bugs. However, any you find, with patches if -possible, should be sent to \W{mailto:jules@dsf.org.uk}\c{jules@dsf.org.uk} -or \W{mailto:anakin@pobox.com}\c{anakin@pobox.com}, or to the +possible, should be sent to +\W{mailto:nasm-bugs@lists.sourceforge.net}\c{nasm-bugs@lists.sourceforge.net}, or to the developer's site at \W{https://sourceforge.net/projects/nasm/}\c{https://sourceforge.net/projects/nasm/} and we'll try to fix them. Feel free to send contributions and @@ -6562,6736 +6616,3 @@ I don't recommend taking NDISASM apart to see how an efficient disassembler works, because as far as I know, it isn't an efficient one anyway. You have been warned. - -\A{iref} x86 Instruction Reference - -This appendix provides a complete list of the machine instructions -which NASM will assemble, and a short description of the function of -each one. - -It is not intended to be an exhaustive documentation on the fine -details of the instructions' function, such as which exceptions they -can trigger: for such documentation, you should go to Intel's Web -site, \W{http://developer.intel.com/design/Pentium4/manuals/}\c{http://developer.intel.com/design/Pentium4/manuals/}. - -Instead, this appendix is intended primarily to provide -documentation on the way the instructions may be used within NASM. -For example, looking up \c{LOOP} will tell you that NASM allows -\c{CX} or \c{ECX} to be specified as an optional second argument to -the \c{LOOP} instruction, to enforce which of the two possible -counter registers should be used if the default is not the one -desired. - -The instructions are not quite listed in alphabetical order, since -groups of instructions with similar functions are lumped together in -the same entry. Most of them don't move very far from their -alphabetic position because of this. - - -\H{iref-opr} Key to Operand Specifications - -The instruction descriptions in this appendix specify their operands -using the following notation: - -\b Registers: \c{reg8} denotes an 8-bit \i{general purpose -register}, \c{reg16} denotes a 16-bit general purpose register, -\c{reg32} a 32-bit one and \c{reg64} a 64-bit one. \c{fpureg} denotes -one of the eight FPU stack registers, \c{mmxreg} denotes one of the -eight 64-bit MMX registers, and \c{segreg} denotes a segment register. -\c{xmmreg} denotes one of the 8, or 16 in x64 long mode, SSE XMM registers. -In addition, some registers (such as \c{AL}, \c{DX}, \c{ECX} or \c{RAX}) -may be specified explicitly. - -\b Immediate operands: \c{imm} denotes a generic \i{immediate operand}. -\c{imm8}, \c{imm16} and \c{imm32} are used when the operand is -intended to be a specific size. For some of these instructions, NASM -needs an explicit specifier: for example, \c{ADD ESP,16} could be -interpreted as either \c{ADD r/m32,imm32} or \c{ADD r/m32,imm8}. -NASM chooses the former by default, and so you must specify \c{ADD -ESP,BYTE 16} for the latter. There is a special case of the allowance -of an \c{imm64} for particular x64 versions of the MOV instruction. - -\b Memory references: \c{mem} denotes a generic \i{memory reference}; -\c{mem8}, \c{mem16}, \c{mem32}, \c{mem64} and \c{mem80} are used -when the operand needs to be a specific size. Again, a specifier is -needed in some cases: \c{DEC [address]} is ambiguous and will be -rejected by NASM. You must specify \c{DEC BYTE [address]}, \c{DEC -WORD [address]} or \c{DEC DWORD [address]} instead. - -\b \i{Restricted memory references}: one form of the \c{MOV} -instruction allows a memory address to be specified \e{without} -allowing the normal range of register combinations and effective -address processing. This is denoted by \c{memoffs8}, \c{memoffs16}, -\c{memoffs32} or \c{memoffs64}. - -\b Register or memory choices: many instructions can accept either a -register \e{or} a memory reference as an operand. \c{r/m8} is -shorthand for \c{reg8/mem8}; similarly \c{r/m16} and \c{r/m32}. -On legacy x86 modes, \c{r/m64} is MMX-related, and is shorthand for -\c{mmxreg/mem64}. When utilizing the x86-64 architecture extension, -\c{r/m64} denotes use of a 64-bit GPR as well, and is shorthand for -\c{reg64/mem64}. - - -\H{iref-opc} Key to Opcode Descriptions - -This appendix also provides the opcodes which NASM will generate for -each form of each instruction. The opcodes are listed in the -following way: - -\b A hex number, such as \c{3F}, indicates a fixed byte containing -that number. - -\b A hex number followed by \c{+r}, such as \c{C8+r}, indicates that -one of the operands to the instruction is a register, and the -`register value' of that register should be added to the hex number -to produce the generated byte. For example, EDX has register value -2, so the code \c{C8+r}, when the register operand is EDX, generates -the hex byte \c{CA}. Register values for specific registers are -given in \k{iref-rv}. - -\b A hex number followed by \c{+cc}, such as \c{40+cc}, indicates -that the instruction name has a condition code suffix, and the -numeric representation of the condition code should be added to the -hex number to produce the generated byte. For example, the code -\c{40+cc}, when the instruction contains the \c{NE} condition, -generates the hex byte \c{45}. Condition codes and their numeric -representations are given in \k{iref-cc}. - -\b A slash followed by a digit, such as \c{/2}, indicates that one -of the operands to the instruction is a memory address or register -(denoted \c{mem} or \c{r/m}, with an optional size). This is to be -encoded as an effective address, with a \i{ModR/M byte}, an optional -\i{SIB byte}, and an optional displacement, and the spare (register) -field of the ModR/M byte should be the digit given (which will be -from 0 to 7, so it fits in three bits). The encoding of effective -addresses is given in \k{iref-ea}. - -\b The code \c{/r} combines the above two: it indicates that one of -the operands is a memory address or \c{r/m}, and another is a -register, and that an effective address should be generated with the -spare (register) field in the ModR/M byte being equal to the -`register value' of the register operand. The encoding of effective -addresses is given in \k{iref-ea}; register values are given in -\k{iref-rv}. - -\b The codes \c{ib}, \c{iw} and \c{id} indicate that one of the -operands to the instruction is an immediate value, and that this is -to be encoded as a byte, little-endian word or little-endian -doubleword respectively. - -\b The codes \c{rb}, \c{rw} and \c{rd} indicate that one of the -operands to the instruction is an immediate value, and that the -\e{difference} between this value and the address of the end of the -instruction is to be encoded as a byte, word or doubleword -respectively. Where the form \c{rw/rd} appears, it indicates that -either \c{rw} or \c{rd} should be used according to whether assembly -is being performed in \c{BITS 16} or \c{BITS 32} state respectively. - -\b The codes \c{ow} and \c{od} indicate that one of the operands to -the instruction is a reference to the contents of a memory address -specified as an immediate value: this encoding is used in some forms -of the \c{MOV} instruction in place of the standard -effective-address mechanism. The displacement is encoded as a word -or doubleword. Again, \c{ow/od} denotes that \c{ow} or \c{od} should -be chosen according to the \c{BITS} setting. - -\b The codes \c{o16} and \c{o32} indicate that the given form of the -instruction should be assembled with operand size 16 or 32 bits. In -other words, \c{o16} indicates a \c{66} prefix in \c{BITS 32} state, -but generates no code in \c{BITS 16} state; and \c{o32} indicates a -\c{66} prefix in \c{BITS 16} state but generates nothing in \c{BITS -32}. - -\b The codes \c{a16} and \c{a32}, similarly to \c{o16} and \c{o32}, -indicate the address size of the given form of the instruction. -Where this does not match the \c{BITS} setting, a \c{67} prefix is -required. Please note that \c{a16} is useless in long mode as -16-bit addressing is depreciated on the x86-64 architecture extension. - - -\S{iref-rv} Register Values - -Where an instruction requires a register value, it is already -implicit in the encoding of the rest of the instruction what type of -register is intended: an 8-bit general-purpose register, a segment -register, a debug register, an MMX register, or whatever. Therefore -there is no problem with registers of different types sharing an -encoding value. - -Please note that for the register classes listed below, the register -extensions (REX) classes require the use of the REX prefix, in which -is only available when in long mode on the x86-64 processor. This -pretty much goes for any register that has a number higher than 7. - -The encodings for the various classes of register are: - -\b 8-bit general registers: \c{AL} is 0, \c{CL} is 1, \c{DL} is 2, -\c{BL} is 3, \c{AH} is 4, \c{CH} is 5, \c{DH} is 6 and \c{BH} is -7. Please note that \c{AH}, \c{BH}, \c{CH} and \c{DH} are not -addressable when using the REX prefix in long mode. - -\b 8-bit general register extensions (REX): \c{SPL} is 4, \c{BPL} is 5, -\c{SIL} is 6, \c{DIL} is 7, \c{R8B} is 8, \c{R9B} is 9, \c{R10B} is 10, -\c{R11B} is 11, \c{R12B} is 12, \c{R13B} is 13, \c{R14B} is 14 and -\c{R15B} is 15. - -\b 16-bit general registers: \c{AX} is 0, \c{CX} is 1, \c{DX} is 2, -\c{BX} is 3, \c{SP} is 4, \c{BP} is 5, \c{SI} is 6, and \c{DI} is 7. - -\b 16-bit general register extensions (REX): \c{R8W} is 8, \c{R9W} is 9, -\c{R10w} is 10, \c{R11W} is 11, \c{R12W} is 12, \c{R13W} is 13, \c{R14W} -is 14 and \c{R15W} is 15. - -\b 32-bit general registers: \c{EAX} is 0, \c{ECX} is 1, \c{EDX} is -2, \c{EBX} is 3, \c{ESP} is 4, \c{EBP} is 5, \c{ESI} is 6, and -\c{EDI} is 7. - -\b 32-bit general register extensions (REX): \c{R8D} is 8, \c{R9D} is 9, -\c{R10D} is 10, \c{R11D} is 11, \c{R12D} is 12, \c{R13D} is 13, \c{R14D} -is 14 and \c{R15D} is 15. - -\b 64-bit general register extensions (REX): \c{RAX} is 0, \c{RCX} is 1, -\c{RDX} is 2, \c{RBX} is 3, \c{RSP} is 4, \c{RBP} is 5, \c{RSI} is 6, -\c{RDI} is 7, \c{R8} is 8, \c{R9} is 9, \c{R10} is 10, \c{R11} is 11, -\c{R12} is 12, \c{R13} is 13, \c{R14} is 14 and \c{R15} is 15. - -\b \i{Segment registers}: \c{ES} is 0, \c{CS} is 1, \c{SS} is 2, \c{DS} -is 3, \c{FS} is 4, and \c{GS} is 5. - -\b \I{floating-point, registers}Floating-point registers: \c{ST0} -is 0, \c{ST1} is 1, \c{ST2} is 2, \c{ST3} is 3, \c{ST4} is 4, -\c{ST5} is 5, \c{ST6} is 6, and \c{ST7} is 7. - -\b 64-bit \i{MMX registers}: \c{MM0} is 0, \c{MM1} is 1, \c{MM2} is 2, -\c{MM3} is 3, \c{MM4} is 4, \c{MM5} is 5, \c{MM6} is 6, and \c{MM7} -is 7. - -\b 128-bit \i{XMM (SSE) registers}: \c{XMM0} is 0, \c{XMM1} is 1, -\c{XMM2} is 2, \c{XMM3} is 3, \c{XMM4} is 4, \c{XMM5} is 5, \c{XMM6} is -6 and \c{XMM7} is 7. - -\b 128-bit \i{XMM (SSE) register} extensions (REX): \c{XMM8} is 8, -\c{XMM9} is 9, \c{XMM10} is 10, \c{XMM11} is 11, \c{XMM12} is 12, -\c{XMM13} is 13, \c{XMM14} is 14 and \c{XMM15} is 15. - -\b \i{Control registers}: \c{CR0} is 0, \c{CR2} is 2, \c{CR3} is 3, -and \c{CR4} is 4. - -\b \i{Control register} extensions: \c{CR8} is 8. - -\b \i{Debug registers}: \c{DR0} is 0, \c{DR1} is 1, \c{DR2} is 2, -\c{DR3} is 3, \c{DR6} is 6, and \c{DR7} is 7. - -\b \i{Test registers}: \c{TR3} is 3, \c{TR4} is 4, \c{TR5} is 5, -\c{TR6} is 6, and \c{TR7} is 7. - -(Note that wherever a register name contains a number, that number -is also the register value for that register.) - - -\S{iref-cc} \i{Condition Codes} - -The available condition codes are given here, along with their -numeric representations as part of opcodes. Many of these condition -codes have synonyms, so several will be listed at a time. - -In the following descriptions, the word `either', when applied to two -possible trigger conditions, is used to mean `either or both'. If -`either but not both' is meant, the phrase `exactly one of' is used. - -\b \c{O} is 0 (trigger if the overflow flag is set); \c{NO} is 1. - -\b \c{B}, \c{C} and \c{NAE} are 2 (trigger if the carry flag is -set); \c{AE}, \c{NB} and \c{NC} are 3. - -\b \c{E} and \c{Z} are 4 (trigger if the zero flag is set); \c{NE} -and \c{NZ} are 5. - -\b \c{BE} and \c{NA} are 6 (trigger if either of the carry or zero -flags is set); \c{A} and \c{NBE} are 7. - -\b \c{S} is 8 (trigger if the sign flag is set); \c{NS} is 9. - -\b \c{P} and \c{PE} are 10 (trigger if the parity flag is set); -\c{NP} and \c{PO} are 11. - -\b \c{L} and \c{NGE} are 12 (trigger if exactly one of the sign and -overflow flags is set); \c{GE} and \c{NL} are 13. - -\b \c{LE} and \c{NG} are 14 (trigger if either the zero flag is set, -or exactly one of the sign and overflow flags is set); \c{G} and -\c{NLE} are 15. - -Note that in all cases, the sense of a condition code may be -reversed by changing the low bit of the numeric representation. - -For details of when an instruction sets each of the status flags, -see the individual instruction, plus the Status Flags reference -in \k{iref-Flags} - - -\S{iref-SSE-cc} \i{SSE Condition Predicates} - -The condition predicates for SSE comparison instructions are the -codes used as part of the opcode, to determine what form of -comparison is being carried out. In each case, the imm8 value is -the final byte of the opcode encoding, and the predicate is the -code used as part of the mnemonic for the instruction (equivalent -to the "cc" in an integer instruction that used a condition code). -The instructions that use this will give details of what the various -mnemonics are, this table is used to help you work out details of what -is happening. - -\c Predi- imm8 Description Relation where: Emula- Result QNaN -\c cate Encod- A Is 1st Operand tion if NaN Signal -\c ing B Is 2nd Operand Operand Invalid -\c -\c EQ 000B equal A = B False No -\c -\c LT 001B less-than A < B False Yes -\c -\c LE 010B less-than- A <= B False Yes -\c or-equal -\c -\c --- ---- greater A > B Swap False Yes -\c than Operands, -\c Use LT -\c -\c --- ---- greater- A >= B Swap False Yes -\c than-or-equal Operands, -\c Use LE -\c -\c UNORD 011B unordered A, B = Unordered True No -\c -\c NEQ 100B not-equal A != B True No -\c -\c NLT 101B not-less- NOT(A < B) True Yes -\c than -\c -\c NLE 110B not-less- NOT(A <= B) True Yes -\c than-or- -\c equal -\c -\c --- ---- not-greater NOT(A > B) Swap True Yes -\c than Operands, -\c Use NLT -\c -\c --- ---- not-greater NOT(A >= B) Swap True Yes -\c than- Operands, -\c or-equal Use NLE -\c -\c ORD 111B ordered A , B = Ordered False No - -The unordered relationship is true when at least one of the two -values being compared is a NaN or in an unsupported format. - -Note that the comparisons which are listed as not having a predicate -or encoding can only be achieved through software emulation, as -described in the "emulation" column. Note in particular that an -instruction such as \c{greater-than} is not the same as \c{NLE}, as, -unlike with the \c{CMP} instruction, it has to take into account the -possibility of one operand containing a NaN or an unsupported numeric -format. - - -\S{iref-Flags} \i{Status Flags} - -The status flags provide some information about the result of the -arithmetic instructions. This information can be used by conditional -instructions (such a \c{Jcc} and \c{CMOVcc}) as well as by some of -the other instructions (such as \c{ADC} and \c{INTO}). - -There are 6 status flags: - -\c CF - Carry flag. - -Set if an arithmetic operation generates a -carry or a borrow out of the most-significant bit of the result; -cleared otherwise. This flag indicates an overflow condition for -unsigned-integer arithmetic. It is also used in multiple-precision -arithmetic. - -\c PF - Parity flag. - -Set if the least-significant byte of the result contains an even -number of 1 bits; cleared otherwise. - -\c AF - Adjust flag. - -Set if an arithmetic operation generates a carry or a borrow -out of bit 3 of the result; cleared otherwise. This flag is used -in binary-coded decimal (BCD) arithmetic. - -\c ZF - Zero flag. - -Set if the result is zero; cleared otherwise. - -\c SF - Sign flag. - -Set equal to the most-significant bit of the result, which is the -sign bit of a signed integer. (0 indicates a positive value and 1 -indicates a negative value.) - -\c OF - Overflow flag. - -Set if the integer result is too large a positive number or too -small a negative number (excluding the sign-bit) to fit in the -destination operand; cleared otherwise. This flag indicates an -overflow condition for signed-integer (two's complement) arithmetic. - - -\S{iref-ea} Effective Address Encoding: \i{ModR/M} and \i{SIB} - -An \i{effective address} is encoded in up to three parts: a ModR/M -byte, an optional SIB byte, and an optional byte, word or doubleword -displacement field. - -The ModR/M byte consists of three fields: the \c{mod} field, ranging -from 0 to 3, in the upper two bits of the byte, the \c{r/m} field, -ranging from 0 to 7, in the lower three bits, and the spare -(register) field in the middle (bit 3 to bit 5). The spare field is -not relevant to the effective address being encoded, and either -contains an extension to the instruction opcode or the register -value of another operand. - -The ModR/M system can be used to encode a direct register reference -rather than a memory access. This is always done by setting the -\c{mod} field to 3 and the \c{r/m} field to the register value of -the register in question (it must be a general-purpose register, and -the size of the register must already be implicit in the encoding of -the rest of the instruction). In this case, the SIB byte and -displacement field are both absent. - -In 16-bit addressing mode (either \c{BITS 16} with no \c{67} prefix, -or \c{BITS 32} with a \c{67} prefix), the SIB byte is never used. -The general rules for \c{mod} and \c{r/m} (there is an exception, -given below) are: - -\b The \c{mod} field gives the length of the displacement field: 0 -means no displacement, 1 means one byte, and 2 means two bytes. - -\b The \c{r/m} field encodes the combination of registers to be -added to the displacement to give the accessed address: 0 means -\c{BX+SI}, 1 means \c{BX+DI}, 2 means \c{BP+SI}, 3 means \c{BP+DI}, -4 means \c{SI} only, 5 means \c{DI} only, 6 means \c{BP} only, and 7 -means \c{BX} only. - -However, there is a special case: - -\b If \c{mod} is 0 and \c{r/m} is 6, the effective address encoded -is not \c{[BP]} as the above rules would suggest, but instead -\c{[disp16]}: the displacement field is present and is two bytes -long, and no registers are added to the displacement. - -Therefore the effective address \c{[BP]} cannot be encoded as -efficiently as \c{[BX]}; so if you code \c{[BP]} in a program, NASM -adds a notional 8-bit zero displacement, and sets \c{mod} to 1, -\c{r/m} to 6, and the one-byte displacement field to 0. - -In 32-bit addressing mode (either \c{BITS 16} with a \c{67} prefix, -or \c{BITS 32} with no \c{67} prefix) the general rules (again, -there are exceptions) for \c{mod} and \c{r/m} are: - -\b The \c{mod} field gives the length of the displacement field: 0 -means no displacement, 1 means one byte, and 2 means four bytes. - -\b If only one register is to be added to the displacement, and it -is not \c{ESP}, the \c{r/m} field gives its register value, and the -SIB byte is absent. If the \c{r/m} field is 4 (which would encode -\c{ESP}), the SIB byte is present and gives the combination and -scaling of registers to be added to the displacement. - -If the SIB byte is present, it describes the combination of -registers (an optional base register, and an optional index register -scaled by multiplication by 1, 2, 4 or 8) to be added to the -displacement. The SIB byte is divided into the \c{scale} field, in -the top two bits, the \c{index} field in the next three, and the -\c{base} field in the bottom three. The general rules are: - -\b The \c{base} field encodes the register value of the base -register. - -\b The \c{index} field encodes the register value of the index -register, unless it is 4, in which case no index register is used -(so \c{ESP} cannot be used as an index register). - -\b The \c{scale} field encodes the multiplier by which the index -register is scaled before adding it to the base and displacement: 0 -encodes a multiplier of 1, 1 encodes 2, 2 encodes 4 and 3 encodes 8. - -The exceptions to the 32-bit encoding rules are: - -\b If \c{mod} is 0 and \c{r/m} is 5, the effective address encoded -is not \c{[EBP]} as the above rules would suggest, but instead -\c{[disp32]}: the displacement field is present and is four bytes -long, and no registers are added to the displacement. - -\b If \c{mod} is 0, \c{r/m} is 4 (meaning the SIB byte is present) -and \c{base} is 5, the effective address encoded is not -\c{[EBP+index]} as the above rules would suggest, but instead -\c{[disp32+index]}: the displacement field is present and is four -bytes long, and there is no base register (but the index register is -still processed in the normal way). - - -\S{iref-rex} Register Extensions: The \i{REX} Prefix - -The Register Extensions, or \i{REX} for short, prefix is the means -of accessing extended registers on the x86-64 architecture. \i{REX} -is considered an instruction prefix, but is required to be after -all other prefixes and thus immediately before the first instruction -opcode itself. So overall, \i{REX} can be thought of as an "Opcode -Prefix" instead. The \i{REX} prefix itself is indicated by a value -of 0x4X, where X is one of 16 different combinations of the actual -\i{REX} flags. - -The \i{REX} prefix flags consist of four 1-bit extensions fields. -These flags are found in the lower nibble of the actual \i{REX} -prefix opcode. Below is the list of \i{REX} prefix flags, from -high bit to low bit. - -\c{REX.W}: When set, this flag indicates the use of a 64-bit operand, -as opposed to the default of using 32-bit operands as found in 32-bit -Protected Mode. - -\c{REX.R}: When set, this flag extends the \c{reg (spare)} field of -the \c{ModRM} byte. Overall, this raises the amount of addressable -registers in this field from 8 to 16. - -\c{REX.X}: When set, this flag extends the \c{index} field of the -\c{SIB} byte. Overall, this raises the amount of addressable -registers in this field from 8 to 16. - -\c{REX.B}: When set, this flag extends the \c{r/m} field of the -\c{ModRM} byte. This flag can also represent an extension to the -opcode register \c{(/r)} field. The determination of which is used -varies depending on which instruction is used. Overall, this raises -the amount of addressable registers in these fields from 8 to 16. - -Interal use of the \i{REX} prefix by the processor is consistent, -yet non-trivial. Most instructions use the \i{REX} prefix as -indicated by the above flags. Some instructions require the \i{REX} -prefix to be present even if the flags are empty. Some instructions -default to a 64-bit operand and require the \i{REX} prefix only for -actual register extensions, and thus ignores the \c{REX.W} field -completely. - -At any rate, NASM is designed to handle, and fully supports, the -\i{REX} prefix internally. Please read the appropriate processor -documentation for further information on the \i{REX} prefix. - -You may have noticed that opcodes 0x40 through 0x4F are actually -opcodes for the INC/DEC instructions for each General Purpose -Register. This is, of course, correct... for legacy x86. While -in long mode, opcodes 0x40 through 0x4F are reserved for use as -the REX prefix. The other opcode forms of the INC/DEC instructions -are used instead. - - -\H{iref-flg} Key to Instruction Flags - -Given along with each instruction in this appendix is a set of -flags, denoting the type of the instruction. The types are as follows: - -\b \c{8086}, \c{186}, \c{286}, \c{386}, \c{486}, \c{PENT} and \c{P6} -denote the lowest processor type that supports the instruction. Most -instructions run on all processors above the given type; those that -do not are documented. The Pentium II contains no additional -instructions beyond the P6 (Pentium Pro); from the point of view of -its instruction set, it can be thought of as a P6 with MMX -capability. - -\b \c{3DNOW} indicates that the instruction is a 3DNow! one, and will -run on the AMD K6-2 and later processors. ATHLON extensions to the -3DNow! instruction set are documented as such. - -\b \c{CYRIX} indicates that the instruction is specific to Cyrix -processors, for example the extra MMX instructions in the Cyrix -extended MMX instruction set. - -\b \c{FPU} indicates that the instruction is a floating-point one, -and will only run on machines with a coprocessor (automatically -including 486DX, Pentium and above). - -\b \c{KATMAI} indicates that the instruction was introduced as part -of the Katmai New Instruction set. These instructions are available -on the Pentium III and later processors. Those which are not -specifically SSE instructions are also available on the AMD Athlon. - -\b \c{MMX} indicates that the instruction is an MMX one, and will -run on MMX-capable Pentium processors and the Pentium II. - -\b \c{PRIV} indicates that the instruction is a protected-mode -management instruction. Many of these may only be used in protected -mode, or only at privilege level zero. - -\b \c{SSE} and \c{SSE2} indicate that the instruction is a Streaming -SIMD Extension instruction. These instructions operate on multiple -values in a single operation. SSE was introduced with the Pentium III -and SSE2 was introduced with the Pentium 4. - -\b \c{UNDOC} indicates that the instruction is an undocumented one, -and not part of the official Intel Architecture; it may or may not -be supported on any given machine. - -\b \c{WILLAMETTE} indicates that the instruction was introduced as -part of the new instruction set in the Pentium 4 and Intel Xeon -processors. These instructions are also known as SSE2 instructions. - -\b \c{X64} indicates that the instruction was introduced as part of -the new instruction set in the x86-64 architecture extension, -commonly referred to as x64, AMD64 or EM64T. - - -\H{iref-inst} x86 Instruction Set - - -\S{insAAA} \i\c{AAA}, \i\c{AAS}, \i\c{AAM}, \i\c{AAD}: ASCII -Adjustments - -\c AAA ; 37 [8086] - -\c AAS ; 3F [8086] - -\c AAD ; D5 0A [8086] -\c AAD imm ; D5 ib [8086] - -\c AAM ; D4 0A [8086] -\c AAM imm ; D4 ib [8086] - -These instructions are used in conjunction with the add, subtract, -multiply and divide instructions to perform binary-coded decimal -arithmetic in \e{unpacked} (one BCD digit per byte - easy to -translate to and from \c{ASCII}, hence the instruction names) form. -There are also packed BCD instructions \c{DAA} and \c{DAS}: see -\k{insDAA}. - -\b \c{AAA} (ASCII Adjust After Addition) should be used after a -one-byte \c{ADD} instruction whose destination was the \c{AL} -register: by means of examining the value in the low nibble of -\c{AL} and also the auxiliary carry flag \c{AF}, it determines -whether the addition has overflowed, and adjusts it (and sets -the carry flag) if so. You can add long BCD strings together -by doing \c{ADD}/\c{AAA} on the low digits, then doing -\c{ADC}/\c{AAA} on each subsequent digit. - -\b \c{AAS} (ASCII Adjust AL After Subtraction) works similarly to -\c{AAA}, but is for use after \c{SUB} instructions rather than -\c{ADD}. - -\b \c{AAM} (ASCII Adjust AX After Multiply) is for use after you -have multiplied two decimal digits together and left the result -in \c{AL}: it divides \c{AL} by ten and stores the quotient in -\c{AH}, leaving the remainder in \c{AL}. The divisor 10 can be -changed by specifying an operand to the instruction: a particularly -handy use of this is \c{AAM 16}, causing the two nibbles in \c{AL} -to be separated into \c{AH} and \c{AL}. - -\b \c{AAD} (ASCII Adjust AX Before Division) performs the inverse -operation to \c{AAM}: it multiplies \c{AH} by ten, adds it to -\c{AL}, and sets \c{AH} to zero. Again, the multiplier 10 can -be changed. - - -\S{insADC} \i\c{ADC}: Add with Carry - -\c ADC r/m8,reg8 ; 10 /r [8086] -\c ADC r/m16,reg16 ; o16 11 /r [8086] -\c ADC r/m32,reg32 ; o32 11 /r [386] - -\c ADC reg8,r/m8 ; 12 /r [8086] -\c ADC reg16,r/m16 ; o16 13 /r [8086] -\c ADC reg32,r/m32 ; o32 13 /r [386] - -\c ADC r/m8,imm8 ; 80 /2 ib [8086] -\c ADC r/m16,imm16 ; o16 81 /2 iw [8086] -\c ADC r/m32,imm32 ; o32 81 /2 id [386] - -\c ADC r/m16,imm8 ; o16 83 /2 ib [8086] -\c ADC r/m32,imm8 ; o32 83 /2 ib [386] - -\c ADC AL,imm8 ; 14 ib [8086] -\c ADC AX,imm16 ; o16 15 iw [8086] -\c ADC EAX,imm32 ; o32 15 id [386] - -\c{ADC} performs integer addition: it adds its two operands -together, plus the value of the carry flag, and leaves the result in -its destination (first) operand. The destination operand can be a -register or a memory location. The source operand can be a register, -a memory location or an immediate value. - -The flags are set according to the result of the operation: in -particular, the carry flag is affected and can be used by a -subsequent \c{ADC} instruction. - -In the forms with an 8-bit immediate second operand and a longer -first operand, the second operand is considered to be signed, and is -sign-extended to the length of the first operand. In these cases, -the \c{BYTE} qualifier is necessary to force NASM to generate this -form of the instruction. - -To add two numbers without also adding the contents of the carry -flag, use \c{ADD} (\k{insADD}). - - -\S{insADD} \i\c{ADD}: Add Integers - -\c ADD r/m8,reg8 ; 00 /r [8086] -\c ADD r/m16,reg16 ; o16 01 /r [8086] -\c ADD r/m32,reg32 ; o32 01 /r [386] - -\c ADD reg8,r/m8 ; 02 /r [8086] -\c ADD reg16,r/m16 ; o16 03 /r [8086] -\c ADD reg32,r/m32 ; o32 03 /r [386] - -\c ADD r/m8,imm8 ; 80 /7 ib [8086] -\c ADD r/m16,imm16 ; o16 81 /7 iw [8086] -\c ADD r/m32,imm32 ; o32 81 /7 id [386] - -\c ADD r/m16,imm8 ; o16 83 /7 ib [8086] -\c ADD r/m32,imm8 ; o32 83 /7 ib [386] - -\c ADD AL,imm8 ; 04 ib [8086] -\c ADD AX,imm16 ; o16 05 iw [8086] -\c ADD EAX,imm32 ; o32 05 id [386] - -\c{ADD} performs integer addition: it adds its two operands -together, and leaves the result in its destination (first) operand. -The destination operand can be a register or a memory location. -The source operand can be a register, a memory location or an -immediate value. - -The flags are set according to the result of the operation: in -particular, the carry flag is affected and can be used by a -subsequent \c{ADC} instruction. - -In the forms with an 8-bit immediate second operand and a longer -first operand, the second operand is considered to be signed, and is -sign-extended to the length of the first operand. In these cases, -the \c{BYTE} qualifier is necessary to force NASM to generate this -form of the instruction. - - -\S{insADDPD} \i\c{ADDPD}: ADD Packed Double-Precision FP Values - -\c ADDPD xmm1,xmm2/mem128 ; 66 0F 58 /r [WILLAMETTE,SSE2] - -\c{ADDPD} performs addition on each of two packed double-precision -FP value pairs. - -\c dst[0-63] := dst[0-63] + src[0-63], -\c dst[64-127] := dst[64-127] + src[64-127]. - -The destination is an \c{XMM} register. The source operand can be -either an \c{XMM} register or a 128-bit memory location. - - -\S{insADDPS} \i\c{ADDPS}: ADD Packed Single-Precision FP Values - -\c ADDPS xmm1,xmm2/mem128 ; 0F 58 /r [KATMAI,SSE] - -\c{ADDPS} performs addition on each of four packed single-precision -FP value pairs - -\c dst[0-31] := dst[0-31] + src[0-31], -\c dst[32-63] := dst[32-63] + src[32-63], -\c dst[64-95] := dst[64-95] + src[64-95], -\c dst[96-127] := dst[96-127] + src[96-127]. - -The destination is an \c{XMM} register. The source operand can be -either an \c{XMM} register or a 128-bit memory location. - - -\S{insADDSD} \i\c{ADDSD}: ADD Scalar Double-Precision FP Values - -\c ADDSD xmm1,xmm2/mem64 ; F2 0F 58 /r [KATMAI,SSE] - -\c{ADDSD} adds the low double-precision FP values from the source -and destination operands and stores the double-precision FP result -in the destination operand. - -\c dst[0-63] := dst[0-63] + src[0-63], -\c dst[64-127) remains unchanged. - -The destination is an \c{XMM} register. The source operand can be -either an \c{XMM} register or a 64-bit memory location. - - -\S{insADDSS} \i\c{ADDSS}: ADD Scalar Single-Precision FP Values - -\c ADDSS xmm1,xmm2/mem32 ; F3 0F 58 /r [WILLAMETTE,SSE2] - -\c{ADDSS} adds the low single-precision FP values from the source -and destination operands and stores the single-precision FP result -in the destination operand. - -\c dst[0-31] := dst[0-31] + src[0-31], -\c dst[32-127] remains unchanged. - -The destination is an \c{XMM} register. The source operand can be -either an \c{XMM} register or a 32-bit memory location. - - -\S{insAND} \i\c{AND}: Bitwise AND - -\c AND r/m8,reg8 ; 20 /r [8086] -\c AND r/m16,reg16 ; o16 21 /r [8086] -\c AND r/m32,reg32 ; o32 21 /r [386] - -\c AND reg8,r/m8 ; 22 /r [8086] -\c AND reg16,r/m16 ; o16 23 /r [8086] -\c AND reg32,r/m32 ; o32 23 /r [386] - -\c AND r/m8,imm8 ; 80 /4 ib [8086] -\c AND r/m16,imm16 ; o16 81 /4 iw [8086] -\c AND r/m32,imm32 ; o32 81 /4 id [386] - -\c AND r/m16,imm8 ; o16 83 /4 ib [8086] -\c AND r/m32,imm8 ; o32 83 /4 ib [386] - -\c AND AL,imm8 ; 24 ib [8086] -\c AND AX,imm16 ; o16 25 iw [8086] -\c AND EAX,imm32 ; o32 25 id [386] - -\c{AND} performs a bitwise AND operation between its two operands -(i.e. each bit of the result is 1 if and only if the corresponding -bits of the two inputs were both 1), and stores the result in the -destination (first) operand. The destination operand can be a -register or a memory location. The source operand can be a register, -a memory location or an immediate value. - -In the forms with an 8-bit immediate second operand and a longer -first operand, the second operand is considered to be signed, and is -sign-extended to the length of the first operand. In these cases, -the \c{BYTE} qualifier is necessary to force NASM to generate this -form of the instruction. - -The \c{MMX} instruction \c{PAND} (see \k{insPAND}) performs the same -operation on the 64-bit \c{MMX} registers. - - -\S{insANDNPD} \i\c{ANDNPD}: Bitwise Logical AND NOT of -Packed Double-Precision FP Values - -\c ANDNPD xmm1,xmm2/mem128 ; 66 0F 55 /r [WILLAMETTE,SSE2] - -\c{ANDNPD} inverts the bits of the two double-precision -floating-point values in the destination register, and then -performs a logical AND between the two double-precision -floating-point values in the source operand and the temporary -inverted result, storing the result in the destination register. - -\c dst[0-63] := src[0-63] AND NOT dst[0-63], -\c dst[64-127] := src[64-127] AND NOT dst[64-127]. - -The destination is an \c{XMM} register. The source operand can be -either an \c{XMM} register or a 128-bit memory location. - - -\S{insANDNPS} \i\c{ANDNPS}: Bitwise Logical AND NOT of -Packed Single-Precision FP Values - -\c ANDNPS xmm1,xmm2/mem128 ; 0F 55 /r [KATMAI,SSE] - -\c{ANDNPS} inverts the bits of the four single-precision -floating-point values in the destination register, and then -performs a logical AND between the four single-precision -floating-point values in the source operand and the temporary -inverted result, storing the result in the destination register. - -\c dst[0-31] := src[0-31] AND NOT dst[0-31], -\c dst[32-63] := src[32-63] AND NOT dst[32-63], -\c dst[64-95] := src[64-95] AND NOT dst[64-95], -\c dst[96-127] := src[96-127] AND NOT dst[96-127]. - -The destination is an \c{XMM} register. The source operand can be -either an \c{XMM} register or a 128-bit memory location. - - -\S{insANDPD} \i\c{ANDPD}: Bitwise Logical AND For Single FP - -\c ANDPD xmm1,xmm2/mem128 ; 66 0F 54 /r [WILLAMETTE,SSE2] - -\c{ANDPD} performs a bitwise logical AND of the two double-precision -floating point values in the source and destination operand, and -stores the result in the destination register. - -\c dst[0-63] := src[0-63] AND dst[0-63], -\c dst[64-127] := src[64-127] AND dst[64-127]. - -The destination is an \c{XMM} register. The source operand can be -either an \c{XMM} register or a 128-bit memory location. - - -\S{insANDPS} \i\c{ANDPS}: Bitwise Logical AND For Single FP - -\c ANDPS xmm1,xmm2/mem128 ; 0F 54 /r [KATMAI,SSE] - -\c{ANDPS} performs a bitwise logical AND of the four single-precision -floating point values in the source and destination operand, and -stores the result in the destination register. - -\c dst[0-31] := src[0-31] AND dst[0-31], -\c dst[32-63] := src[32-63] AND dst[32-63], -\c dst[64-95] := src[64-95] AND dst[64-95], -\c dst[96-127] := src[96-127] AND dst[96-127]. - -The destination is an \c{XMM} register. The source operand can be -either an \c{XMM} register or a 128-bit memory location. - - -\S{insARPL} \i\c{ARPL}: Adjust RPL Field of Selector - -\c ARPL r/m16,reg16 ; 63 /r [286,PRIV] - -\c{ARPL} expects its two word operands to be segment selectors. It -adjusts the \i\c{RPL} (requested privilege level - stored in the bottom -two bits of the selector) field of the destination (first) operand -to ensure that it is no less (i.e. no more privileged than) the \c{RPL} -field of the source operand. The zero flag is set if and only if a -change had to be made. - - -\S{insBOUND} \i\c{BOUND}: Check Array Index against Bounds - -\c BOUND reg16,mem ; o16 62 /r [186] -\c BOUND reg32,mem ; o32 62 /r [386] - -\c{BOUND} expects its second operand to point to an area of memory -containing two signed values of the same size as its first operand -(i.e. two words for the 16-bit form; two doublewords for the 32-bit -form). It performs two signed comparisons: if the value in the -register passed as its first operand is less than the first of the -in-memory values, or is greater than or equal to the second, it -throws a \c{BR} exception. Otherwise, it does nothing. - - -\S{insBSF} \i\c{BSF}, \i\c{BSR}: Bit Scan - -\c BSF reg16,r/m16 ; o16 0F BC /r [386] -\c BSF reg32,r/m32 ; o32 0F BC /r [386] - -\c BSR reg16,r/m16 ; o16 0F BD /r [386] -\c BSR reg32,r/m32 ; o32 0F BD /r [386] - -\b \c{BSF} searches for the least significant set bit in its source -(second) operand, and if it finds one, stores the index in -its destination (first) operand. If no set bit is found, the -contents of the destination operand are undefined. If the source -operand is zero, the zero flag is set. - -\b \c{BSR} performs the same function, but searches from the top -instead, so it finds the most significant set bit. - -Bit indices are from 0 (least significant) to 15 or 31 (most -significant). The destination operand can only be a register. -The source operand can be a register or a memory location. - - -\S{insBSWAP} \i\c{BSWAP}: Byte Swap - -\c BSWAP reg32 ; o32 0F C8+r [486] - -\c{BSWAP} swaps the order of the four bytes of a 32-bit register: -bits 0-7 exchange places with bits 24-31, and bits 8-15 swap with -bits 16-23. There is no explicit 16-bit equivalent: to byte-swap -\c{AX}, \c{BX}, \c{CX} or \c{DX}, \c{XCHG} can be used. When \c{BSWAP} -is used with a 16-bit register, the result is undefined. - - -\S{insBT} \i\c{BT}, \i\c{BTC}, \i\c{BTR}, \i\c{BTS}: Bit Test - -\c BT r/m16,reg16 ; o16 0F A3 /r [386] -\c BT r/m32,reg32 ; o32 0F A3 /r [386] -\c BT r/m16,imm8 ; o16 0F BA /4 ib [386] -\c BT r/m32,imm8 ; o32 0F BA /4 ib [386] - -\c BTC r/m16,reg16 ; o16 0F BB /r [386] -\c BTC r/m32,reg32 ; o32 0F BB /r [386] -\c BTC r/m16,imm8 ; o16 0F BA /7 ib [386] -\c BTC r/m32,imm8 ; o32 0F BA /7 ib [386] - -\c BTR r/m16,reg16 ; o16 0F B3 /r [386] -\c BTR r/m32,reg32 ; o32 0F B3 /r [386] -\c BTR r/m16,imm8 ; o16 0F BA /6 ib [386] -\c BTR r/m32,imm8 ; o32 0F BA /6 ib [386] - -\c BTS r/m16,reg16 ; o16 0F AB /r [386] -\c BTS r/m32,reg32 ; o32 0F AB /r [386] -\c BTS r/m16,imm ; o16 0F BA /5 ib [386] -\c BTS r/m32,imm ; o32 0F BA /5 ib [386] - -These instructions all test one bit of their first operand, whose -index is given by the second operand, and store the value of that -bit into the carry flag. Bit indices are from 0 (least significant) -to 15 or 31 (most significant). - -In addition to storing the original value of the bit into the carry -flag, \c{BTR} also resets (clears) the bit in the operand itself. -\c{BTS} sets the bit, and \c{BTC} complements the bit. \c{BT} does -not modify its operands. - -The destination can be a register or a memory location. The source can -be a register or an immediate value. - -If the destination operand is a register, the bit offset should be -in the range 0-15 (for 16-bit operands) or 0-31 (for 32-bit operands). -An immediate value outside these ranges will be taken modulo 16/32 -by the processor. - -If the destination operand is a memory location, then an immediate -bit offset follows the same rules as for a register. If the bit offset -is in a register, then it can be anything within the signed range of -the register used (ie, for a 32-bit operand, it can be (-2^31) to (2^31 - 1) - - -\S{insCALL} \i\c{CALL}: Call Subroutine - -\c CALL imm ; E8 rw/rd [8086] -\c CALL imm:imm16 ; o16 9A iw iw [8086] -\c CALL imm:imm32 ; o32 9A id iw [386] -\c CALL FAR mem16 ; o16 FF /3 [8086] -\c CALL FAR mem32 ; o32 FF /3 [386] -\c CALL r/m16 ; o16 FF /2 [8086] -\c CALL r/m32 ; o32 FF /2 [386] - -\c{CALL} calls a subroutine, by means of pushing the current -instruction pointer (\c{IP}) and optionally \c{CS} as well on the -stack, and then jumping to a given address. - -\c{CS} is pushed as well as \c{IP} if and only if the call is a far -call, i.e. a destination segment address is specified in the -instruction. The forms involving two colon-separated arguments are -far calls; so are the \c{CALL FAR mem} forms. - -The immediate \i{near call} takes one of two forms (\c{call imm16/imm32}, -determined by the current segment size limit. For 16-bit operands, -you would use \c{CALL 0x1234}, and for 32-bit operands you would use -\c{CALL 0x12345678}. The value passed as an operand is a relative offset. - -You can choose between the two immediate \i{far call} forms -(\c{CALL imm:imm}) by the use of the \c{WORD} and \c{DWORD} keywords: -\c{CALL WORD 0x1234:0x5678}) or \c{CALL DWORD 0x1234:0x56789abc}. - -The \c{CALL FAR mem} forms execute a far call by loading the -destination address out of memory. The address loaded consists of 16 -or 32 bits of offset (depending on the operand size), and 16 bits of -segment. The operand size may be overridden using \c{CALL WORD FAR -mem} or \c{CALL DWORD FAR mem}. - -The \c{CALL r/m} forms execute a \i{near call} (within the same -segment), loading the destination address out of memory or out of a -register. The keyword \c{NEAR} may be specified, for clarity, in -these forms, but is not necessary. Again, operand size can be -overridden using \c{CALL WORD mem} or \c{CALL DWORD mem}. - -As a convenience, NASM does not require you to call a far procedure -symbol by coding the cumbersome \c{CALL SEG routine:routine}, but -instead allows the easier synonym \c{CALL FAR routine}. - -The \c{CALL r/m} forms given above are near calls; NASM will accept -the \c{NEAR} keyword (e.g. \c{CALL NEAR [address]}), even though it -is not strictly necessary. - - -\S{insCBW} \i\c{CBW}, \i\c{CWD}, \i\c{CDQ}, \i\c{CWDE}: Sign Extensions - -\c CBW ; o16 98 [8086] -\c CWDE ; o32 98 [386] - -\c CWD ; o16 99 [8086] -\c CDQ ; o32 99 [386] - -All these instructions sign-extend a short value into a longer one, -by replicating the top bit of the original value to fill the -extended one. - -\c{CBW} extends \c{AL} into \c{AX} by repeating the top bit of -\c{AL} in every bit of \c{AH}. \c{CWDE} extends \c{AX} into -\c{EAX}. \c{CWD} extends \c{AX} into \c{DX:AX} by repeating -the top bit of \c{AX} throughout \c{DX}, and \c{CDQ} extends -\c{EAX} into \c{EDX:EAX}. - - -\S{insCLC} \i\c{CLC}, \i\c{CLD}, \i\c{CLI}, \i\c{CLTS}: Clear Flags - -\c CLC ; F8 [8086] -\c CLD ; FC [8086] -\c CLI ; FA [8086] -\c CLTS ; 0F 06 [286,PRIV] - -These instructions clear various flags. \c{CLC} clears the carry -flag; \c{CLD} clears the direction flag; \c{CLI} clears the -interrupt flag (thus disabling interrupts); and \c{CLTS} clears the -task-switched (\c{TS}) flag in \c{CR0}. - -To set the carry, direction, or interrupt flags, use the \c{STC}, -\c{STD} and \c{STI} instructions (\k{insSTC}). To invert the carry -flag, use \c{CMC} (\k{insCMC}). - - -\S{insCLFLUSH} \i\c{CLFLUSH}: Flush Cache Line - -\c CLFLUSH mem ; 0F AE /7 [WILLAMETTE,SSE2] - -\c{CLFLUSH} invalidates the cache line that contains the linear address -specified by the source operand from all levels of the processor cache -hierarchy (data and instruction). If, at any level of the cache -hierarchy, the line is inconsistent with memory (dirty) it is written -to memory before invalidation. The source operand points to a -byte-sized memory location. - -Although \c{CLFLUSH} is flagged \c{SSE2} and above, it may not be -present on all processors which have \c{SSE2} support, and it may be -supported on other processors; the \c{CPUID} instruction (\k{insCPUID}) -will return a bit which indicates support for the \c{CLFLUSH} instruction. - - -\S{insCMC} \i\c{CMC}: Complement Carry Flag - -\c CMC ; F5 [8086] - -\c{CMC} changes the value of the carry flag: if it was 0, it sets it -to 1, and vice versa. - - -\S{insCMOVcc} \i\c{CMOVcc}: Conditional Move - -\c CMOVcc reg16,r/m16 ; o16 0F 40+cc /r [P6] -\c CMOVcc reg32,r/m32 ; o32 0F 40+cc /r [P6] - -\c{CMOV} moves its source (second) operand into its destination -(first) operand if the given condition code is satisfied; otherwise -it does nothing. - -For a list of condition codes, see \k{iref-cc}. - -Although the \c{CMOV} instructions are flagged \c{P6} and above, they -may not be supported by all Pentium Pro processors; the \c{CPUID} -instruction (\k{insCPUID}) will return a bit which indicates whether -conditional moves are supported. - - -\S{insCMP} \i\c{CMP}: Compare Integers - -\c CMP r/m8,reg8 ; 38 /r [8086] -\c CMP r/m16,reg16 ; o16 39 /r [8086] -\c CMP r/m32,reg32 ; o32 39 /r [386] - -\c CMP reg8,r/m8 ; 3A /r [8086] -\c CMP reg16,r/m16 ; o16 3B /r [8086] -\c CMP reg32,r/m32 ; o32 3B /r [386] - -\c CMP r/m8,imm8 ; 80 /7 ib [8086] -\c CMP r/m16,imm16 ; o16 81 /7 iw [8086] -\c CMP r/m32,imm32 ; o32 81 /7 id [386] - -\c CMP r/m16,imm8 ; o16 83 /7 ib [8086] -\c CMP r/m32,imm8 ; o32 83 /7 ib [386] - -\c CMP AL,imm8 ; 3C ib [8086] -\c CMP AX,imm16 ; o16 3D iw [8086] -\c CMP EAX,imm32 ; o32 3D id [386] - -\c{CMP} performs a `mental' subtraction of its second operand from -its first operand, and affects the flags as if the subtraction had -taken place, but does not store the result of the subtraction -anywhere. - -In the forms with an 8-bit immediate second operand and a longer -first operand, the second operand is considered to be signed, and is -sign-extended to the length of the first operand. In these cases, -the \c{BYTE} qualifier is necessary to force NASM to generate this -form of the instruction. - -The destination operand can be a register or a memory location. The -source can be a register, memory location or an immediate value of -the same size as the destination. - - -\S{insCMPccPD} \i\c{CMPccPD}: Packed Double-Precision FP Compare -\I\c{CMPEQPD} \I\c{CMPLTPD} \I\c{CMPLEPD} \I\c{CMPUNORDPD} -\I\c{CMPNEQPD} \I\c{CMPNLTPD} \I\c{CMPNLEPD} \I\c{CMPORDPD} - -\c CMPPD xmm1,xmm2/mem128,imm8 ; 66 0F C2 /r ib [WILLAMETTE,SSE2] - -\c CMPEQPD xmm1,xmm2/mem128 ; 66 0F C2 /r 00 [WILLAMETTE,SSE2] -\c CMPLTPD xmm1,xmm2/mem128 ; 66 0F C2 /r 01 [WILLAMETTE,SSE2] -\c CMPLEPD xmm1,xmm2/mem128 ; 66 0F C2 /r 02 [WILLAMETTE,SSE2] -\c CMPUNORDPD xmm1,xmm2/mem128 ; 66 0F C2 /r 03 [WILLAMETTE,SSE2] -\c CMPNEQPD xmm1,xmm2/mem128 ; 66 0F C2 /r 04 [WILLAMETTE,SSE2] -\c CMPNLTPD xmm1,xmm2/mem128 ; 66 0F C2 /r 05 [WILLAMETTE,SSE2] -\c CMPNLEPD xmm1,xmm2/mem128 ; 66 0F C2 /r 06 [WILLAMETTE,SSE2] -\c CMPORDPD xmm1,xmm2/mem128 ; 66 0F C2 /r 07 [WILLAMETTE,SSE2] - -The \c{CMPccPD} instructions compare the two packed double-precision -FP values in the source and destination operands, and returns the -result of the comparison in the destination register. The result of -each comparison is a quadword mask of all 1s (comparison true) or -all 0s (comparison false). - -The destination is an \c{XMM} register. The source can be either an -\c{XMM} register or a 128-bit memory location. - -The third operand is an 8-bit immediate value, of which the low 3 -bits define the type of comparison. For ease of programming, the -8 two-operand pseudo-instructions are provided, with the third -operand already filled in. The \I{Condition Predicates} -\c{Condition Predicates} are: - -\c EQ 0 Equal -\c LT 1 Less-than -\c LE 2 Less-than-or-equal -\c UNORD 3 Unordered -\c NE 4 Not-equal -\c NLT 5 Not-less-than -\c NLE 6 Not-less-than-or-equal -\c ORD 7 Ordered - -For more details of the comparison predicates, and details of how -to emulate the "greater-than" equivalents, see \k{iref-SSE-cc} - - -\S{insCMPccPS} \i\c{CMPccPS}: Packed Single-Precision FP Compare -\I\c{CMPEQPS} \I\c{CMPLTPS} \I\c{CMPLEPS} \I\c{CMPUNORDPS} -\I\c{CMPNEQPS} \I\c{CMPNLTPS} \I\c{CMPNLEPS} \I\c{CMPORDPS} - -\c CMPPS xmm1,xmm2/mem128,imm8 ; 0F C2 /r ib [KATMAI,SSE] - -\c CMPEQPS xmm1,xmm2/mem128 ; 0F C2 /r 00 [KATMAI,SSE] -\c CMPLTPS xmm1,xmm2/mem128 ; 0F C2 /r 01 [KATMAI,SSE] -\c CMPLEPS xmm1,xmm2/mem128 ; 0F C2 /r 02 [KATMAI,SSE] -\c CMPUNORDPS xmm1,xmm2/mem128 ; 0F C2 /r 03 [KATMAI,SSE] -\c CMPNEQPS xmm1,xmm2/mem128 ; 0F C2 /r 04 [KATMAI,SSE] -\c CMPNLTPS xmm1,xmm2/mem128 ; 0F C2 /r 05 [KATMAI,SSE] -\c CMPNLEPS xmm1,xmm2/mem128 ; 0F C2 /r 06 [KATMAI,SSE] -\c CMPORDPS xmm1,xmm2/mem128 ; 0F C2 /r 07 [KATMAI,SSE] - -The \c{CMPccPS} instructions compare the two packed single-precision -FP values in the source and destination operands, and returns the -result of the comparison in the destination register. The result of -each comparison is a doubleword mask of all 1s (comparison true) or -all 0s (comparison false). - -The destination is an \c{XMM} register. The source can be either an -\c{XMM} register or a 128-bit memory location. - -The third operand is an 8-bit immediate value, of which the low 3 -bits define the type of comparison. For ease of programming, the -8 two-operand pseudo-instructions are provided, with the third -operand already filled in. The \I{Condition Predicates} -\c{Condition Predicates} are: - -\c EQ 0 Equal -\c LT 1 Less-than -\c LE 2 Less-than-or-equal -\c UNORD 3 Unordered -\c NE 4 Not-equal -\c NLT 5 Not-less-than -\c NLE 6 Not-less-than-or-equal -\c ORD 7 Ordered - -For more details of the comparison predicates, and details of how -to emulate the "greater-than" equivalents, see \k{iref-SSE-cc} - - -\S{insCMPSB} \i\c{CMPSB}, \i\c{CMPSW}, \i\c{CMPSD}: Compare Strings - -\c CMPSB ; A6 [8086] -\c CMPSW ; o16 A7 [8086] -\c CMPSD ; o32 A7 [386] - -\c{CMPSB} compares the byte at \c{[DS:SI]} or \c{[DS:ESI]} with the -byte at \c{[ES:DI]} or \c{[ES:EDI]}, and sets the flags accordingly. -It then increments or decrements (depending on the direction flag: -increments if the flag is clear, decrements if it is set) \c{SI} and -\c{DI} (or \c{ESI} and \c{EDI}). - -The registers used are \c{SI} and \c{DI} if the address size is 16 -bits, and \c{ESI} and \c{EDI} if it is 32 bits. If you need to use -an address size not equal to the current \c{BITS} setting, you can -use an explicit \i\c{a16} or \i\c{a32} prefix. - -The segment register used to load from \c{[SI]} or \c{[ESI]} can be -overridden by using a segment register name as a prefix (for -example, \c{ES CMPSB}). The use of \c{ES} for the load from \c{[DI]} -or \c{[EDI]} cannot be overridden. - -\c{CMPSW} and \c{CMPSD} work in the same way, but they compare a -word or a doubleword instead of a byte, and increment or decrement -the addressing registers by 2 or 4 instead of 1. - -The \c{REPE} and \c{REPNE} prefixes (equivalently, \c{REPZ} and -\c{REPNZ}) may be used to repeat the instruction up to \c{CX} (or -\c{ECX} - again, the address size chooses which) times until the -first unequal or equal byte is found. - - -\S{insCMPccSD} \i\c{CMPccSD}: Scalar Double-Precision FP Compare -\I\c{CMPEQSD} \I\c{CMPLTSD} \I\c{CMPLESD} \I\c{CMPUNORDSD} -\I\c{CMPNEQSD} \I\c{CMPNLTSD} \I\c{CMPNLESD} \I\c{CMPORDSD} - -\c CMPSD xmm1,xmm2/mem64,imm8 ; F2 0F C2 /r ib [WILLAMETTE,SSE2] - -\c CMPEQSD xmm1,xmm2/mem64 ; F2 0F C2 /r 00 [WILLAMETTE,SSE2] -\c CMPLTSD xmm1,xmm2/mem64 ; F2 0F C2 /r 01 [WILLAMETTE,SSE2] -\c CMPLESD xmm1,xmm2/mem64 ; F2 0F C2 /r 02 [WILLAMETTE,SSE2] -\c CMPUNORDSD xmm1,xmm2/mem64 ; F2 0F C2 /r 03 [WILLAMETTE,SSE2] -\c CMPNEQSD xmm1,xmm2/mem64 ; F2 0F C2 /r 04 [WILLAMETTE,SSE2] -\c CMPNLTSD xmm1,xmm2/mem64 ; F2 0F C2 /r 05 [WILLAMETTE,SSE2] -\c CMPNLESD xmm1,xmm2/mem64 ; F2 0F C2 /r 06 [WILLAMETTE,SSE2] -\c CMPORDSD xmm1,xmm2/mem64 ; F2 0F C2 /r 07 [WILLAMETTE,SSE2] - -The \c{CMPccSD} instructions compare the low-order double-precision -FP values in the source and destination operands, and returns the -result of the comparison in the destination register. The result of -each comparison is a quadword mask of all 1s (comparison true) or -all 0s (comparison false). - -The destination is an \c{XMM} register. The source can be either an -\c{XMM} register or a 128-bit memory location. - -The third operand is an 8-bit immediate value, of which the low 3 -bits define the type of comparison. For ease of programming, the -8 two-operand pseudo-instructions are provided, with the third -operand already filled in. The \I{Condition Predicates} -\c{Condition Predicates} are: - -\c EQ 0 Equal -\c LT 1 Less-than -\c LE 2 Less-than-or-equal -\c UNORD 3 Unordered -\c NE 4 Not-equal -\c NLT 5 Not-less-than -\c NLE 6 Not-less-than-or-equal -\c ORD 7 Ordered - -For more details of the comparison predicates, and details of how -to emulate the "greater-than" equivalents, see \k{iref-SSE-cc} - - -\S{insCMPccSS} \i\c{CMPccSS}: Scalar Single-Precision FP Compare -\I\c{CMPEQSS} \I\c{CMPLTSS} \I\c{CMPLESS} \I\c{CMPUNORDSS} -\I\c{CMPNEQSS} \I\c{CMPNLTSS} \I\c{CMPNLESS} \I\c{CMPORDSS} - -\c CMPSS xmm1,xmm2/mem32,imm8 ; F3 0F C2 /r ib [KATMAI,SSE] - -\c CMPEQSS xmm1,xmm2/mem32 ; F3 0F C2 /r 00 [KATMAI,SSE] -\c CMPLTSS xmm1,xmm2/mem32 ; F3 0F C2 /r 01 [KATMAI,SSE] -\c CMPLESS xmm1,xmm2/mem32 ; F3 0F C2 /r 02 [KATMAI,SSE] -\c CMPUNORDSS xmm1,xmm2/mem32 ; F3 0F C2 /r 03 [KATMAI,SSE] -\c CMPNEQSS xmm1,xmm2/mem32 ; F3 0F C2 /r 04 [KATMAI,SSE] -\c CMPNLTSS xmm1,xmm2/mem32 ; F3 0F C2 /r 05 [KATMAI,SSE] -\c CMPNLESS xmm1,xmm2/mem32 ; F3 0F C2 /r 06 [KATMAI,SSE] -\c CMPORDSS xmm1,xmm2/mem32 ; F3 0F C2 /r 07 [KATMAI,SSE] - -The \c{CMPccSS} instructions compare the low-order single-precision -FP values in the source and destination operands, and returns the -result of the comparison in the destination register. The result of -each comparison is a doubleword mask of all 1s (comparison true) or -all 0s (comparison false). - -The destination is an \c{XMM} register. The source can be either an -\c{XMM} register or a 128-bit memory location. - -The third operand is an 8-bit immediate value, of which the low 3 -bits define the type of comparison. For ease of programming, the -8 two-operand pseudo-instructions are provided, with the third -operand already filled in. The \I{Condition Predicates} -\c{Condition Predicates} are: - -\c EQ 0 Equal -\c LT 1 Less-than -\c LE 2 Less-than-or-equal -\c UNORD 3 Unordered -\c NE 4 Not-equal -\c NLT 5 Not-less-than -\c NLE 6 Not-less-than-or-equal -\c ORD 7 Ordered - -For more details of the comparison predicates, and details of how -to emulate the "greater-than" equivalents, see \k{iref-SSE-cc} - - -\S{insCMPXCHG} \i\c{CMPXCHG}, \i\c{CMPXCHG486}: Compare and Exchange - -\c CMPXCHG r/m8,reg8 ; 0F B0 /r [PENT] -\c CMPXCHG r/m16,reg16 ; o16 0F B1 /r [PENT] -\c CMPXCHG r/m32,reg32 ; o32 0F B1 /r [PENT] - -\c CMPXCHG486 r/m8,reg8 ; 0F A6 /r [486,UNDOC] -\c CMPXCHG486 r/m16,reg16 ; o16 0F A7 /r [486,UNDOC] -\c CMPXCHG486 r/m32,reg32 ; o32 0F A7 /r [486,UNDOC] - -These two instructions perform exactly the same operation; however, -apparently some (not all) 486 processors support it under a -non-standard opcode, so NASM provides the undocumented -\c{CMPXCHG486} form to generate the non-standard opcode. - -\c{CMPXCHG} compares its destination (first) operand to the value in -\c{AL}, \c{AX} or \c{EAX} (depending on the operand size of the -instruction). If they are equal, it copies its source (second) -operand into the destination and sets the zero flag. Otherwise, it -clears the zero flag and copies the destination register to AL, AX or EAX. - -The destination can be either a register or a memory location. The -source is a register. - -\c{CMPXCHG} is intended to be used for atomic operations in -multitasking or multiprocessor environments. To safely update a -value in shared memory, for example, you might load the value into -\c{EAX}, load the updated value into \c{EBX}, and then execute the -instruction \c{LOCK CMPXCHG [value],EBX}. If \c{value} has not -changed since being loaded, it is updated with your desired new -value, and the zero flag is set to let you know it has worked. (The -\c{LOCK} prefix prevents another processor doing anything in the -middle of this operation: it guarantees atomicity.) However, if -another processor has modified the value in between your load and -your attempted store, the store does not happen, and you are -notified of the failure by a cleared zero flag, so you can go round -and try again. - - -\S{insCMPXCHG8B} \i\c{CMPXCHG8B}: Compare and Exchange Eight Bytes - -\c CMPXCHG8B mem ; 0F C7 /1 [PENT] - -This is a larger and more unwieldy version of \c{CMPXCHG}: it -compares the 64-bit (eight-byte) value stored at \c{[mem]} with the -value in \c{EDX:EAX}. If they are equal, it sets the zero flag and -stores \c{ECX:EBX} into the memory area. If they are unequal, it -clears the zero flag and stores the memory contents into \c{EDX:EAX}. - -\c{CMPXCHG8B} can be used with the \c{LOCK} prefix, to allow atomic -execution. This is useful in multi-processor and multi-tasking -environments. - - -\S{insCOMISD} \i\c{COMISD}: Scalar Ordered Double-Precision FP Compare and Set EFLAGS - -\c COMISD xmm1,xmm2/mem64 ; 66 0F 2F /r [WILLAMETTE,SSE2] - -\c{COMISD} compares the low-order double-precision FP value in the -two source operands. ZF, PF and CF are set according to the result. -OF, AF and AF are cleared. The unordered result is returned if either -source is a NaN (QNaN or SNaN). - -The destination operand is an \c{XMM} register. The source can be either -an \c{XMM} register or a memory location. - -The flags are set according to the following rules: - -\c Result Flags Values - -\c UNORDERED: ZF,PF,CF <-- 111; -\c GREATER_THAN: ZF,PF,CF <-- 000; -\c LESS_THAN: ZF,PF,CF <-- 001; -\c EQUAL: ZF,PF,CF <-- 100; - - -\S{insCOMISS} \i\c{COMISS}: Scalar Ordered Single-Precision FP Compare and Set EFLAGS - -\c COMISS xmm1,xmm2/mem32 ; 66 0F 2F /r [KATMAI,SSE] - -\c{COMISS} compares the low-order single-precision FP value in the -two source operands. ZF, PF and CF are set according to the result. -OF, AF and AF are cleared. The unordered result is returned if either -source is a NaN (QNaN or SNaN). - -The destination operand is an \c{XMM} register. The source can be either -an \c{XMM} register or a memory location. - -The flags are set according to the following rules: - -\c Result Flags Values - -\c UNORDERED: ZF,PF,CF <-- 111; -\c GREATER_THAN: ZF,PF,CF <-- 000; -\c LESS_THAN: ZF,PF,CF <-- 001; -\c EQUAL: ZF,PF,CF <-- 100; - - -\S{insCPUID} \i\c{CPUID}: Get CPU Identification Code - -\c CPUID ; 0F A2 [PENT] - -\c{CPUID} returns various information about the processor it is -being executed on. It fills the four registers \c{EAX}, \c{EBX}, -\c{ECX} and \c{EDX} with information, which varies depending on the -input contents of \c{EAX}. - -\c{CPUID} also acts as a barrier to serialize instruction execution: -executing the \c{CPUID} instruction guarantees that all the effects -(memory modification, flag modification, register modification) of -previous instructions have been completed before the next -instruction gets fetched. - -The information returned is as follows: - -\b If \c{EAX} is zero on input, \c{EAX} on output holds the maximum -acceptable input value of \c{EAX}, and \c{EBX:EDX:ECX} contain the -string \c{"GenuineIntel"} (or not, if you have a clone processor). -That is to say, \c{EBX} contains \c{"Genu"} (in NASM's own sense of -character constants, described in \k{chrconst}), \c{EDX} contains -\c{"ineI"} and \c{ECX} contains \c{"ntel"}. - -\b If \c{EAX} is one on input, \c{EAX} on output contains version -information about the processor, and \c{EDX} contains a set of -feature flags, showing the presence and absence of various features. -For example, bit 8 is set if the \c{CMPXCHG8B} instruction -(\k{insCMPXCHG8B}) is supported, bit 15 is set if the conditional -move instructions (\k{insCMOVcc} and \k{insFCMOVB}) are supported, -and bit 23 is set if \c{MMX} instructions are supported. - -\b If \c{EAX} is two on input, \c{EAX}, \c{EBX}, \c{ECX} and \c{EDX} -all contain information about caches and TLBs (Translation Lookahead -Buffers). - -For more information on the data returned from \c{CPUID}, see the -documentation from Intel and other processor manufacturers. - - -\S{insCVTDQ2PD} \i\c{CVTDQ2PD}: -Packed Signed INT32 to Packed Double-Precision FP Conversion - -\c CVTDQ2PD xmm1,xmm2/mem64 ; F3 0F E6 /r [WILLAMETTE,SSE2] - -\c{CVTDQ2PD} converts two packed signed doublewords from the source -operand to two packed double-precision FP values in the destination -operand. - -The destination operand is an \c{XMM} register. The source can be -either an \c{XMM} register or a 64-bit memory location. If the -source is a register, the packed integers are in the low quadword. - - -\S{insCVTDQ2PS} \i\c{CVTDQ2PS}: -Packed Signed INT32 to Packed Single-Precision FP Conversion - -\c CVTDQ2PS xmm1,xmm2/mem128 ; 0F 5B /r [WILLAMETTE,SSE2] - -\c{CVTDQ2PS} converts four packed signed doublewords from the source -operand to four packed single-precision FP values in the destination -operand. - -The destination operand is an \c{XMM} register. The source can be -either an \c{XMM} register or a 128-bit memory location. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTPD2DQ} \i\c{CVTPD2DQ}: -Packed Double-Precision FP to Packed Signed INT32 Conversion - -\c CVTPD2DQ xmm1,xmm2/mem128 ; F2 0F E6 /r [WILLAMETTE,SSE2] - -\c{CVTPD2DQ} converts two packed double-precision FP values from the -source operand to two packed signed doublewords in the low quadword -of the destination operand. The high quadword of the destination is -set to all 0s. - -The destination operand is an \c{XMM} register. The source can be -either an \c{XMM} register or a 128-bit memory location. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTPD2PI} \i\c{CVTPD2PI}: -Packed Double-Precision FP to Packed Signed INT32 Conversion - -\c CVTPD2PI mm,xmm/mem128 ; 66 0F 2D /r [WILLAMETTE,SSE2] - -\c{CVTPD2PI} converts two packed double-precision FP values from the -source operand to two packed signed doublewords in the destination -operand. - -The destination operand is an \c{MMX} register. The source can be -either an \c{XMM} register or a 128-bit memory location. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTPD2PS} \i\c{CVTPD2PS}: -Packed Double-Precision FP to Packed Single-Precision FP Conversion - -\c CVTPD2PS xmm1,xmm2/mem128 ; 66 0F 5A /r [WILLAMETTE,SSE2] - -\c{CVTPD2PS} converts two packed double-precision FP values from the -source operand to two packed single-precision FP values in the low -quadword of the destination operand. The high quadword of the -destination is set to all 0s. - -The destination operand is an \c{XMM} register. The source can be -either an \c{XMM} register or a 128-bit memory location. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTPI2PD} \i\c{CVTPI2PD}: -Packed Signed INT32 to Packed Double-Precision FP Conversion - -\c CVTPI2PD xmm,mm/mem64 ; 66 0F 2A /r [WILLAMETTE,SSE2] - -\c{CVTPI2PD} converts two packed signed doublewords from the source -operand to two packed double-precision FP values in the destination -operand. - -The destination operand is an \c{XMM} register. The source can be -either an \c{MMX} register or a 64-bit memory location. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTPI2PS} \i\c{CVTPI2PS}: -Packed Signed INT32 to Packed Single-FP Conversion - -\c CVTPI2PS xmm,mm/mem64 ; 0F 2A /r [KATMAI,SSE] - -\c{CVTPI2PS} converts two packed signed doublewords from the source -operand to two packed single-precision FP values in the low quadword -of the destination operand. The high quadword of the destination -remains unchanged. - -The destination operand is an \c{XMM} register. The source can be -either an \c{MMX} register or a 64-bit memory location. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTPS2DQ} \i\c{CVTPS2DQ}: -Packed Single-Precision FP to Packed Signed INT32 Conversion - -\c CVTPS2DQ xmm1,xmm2/mem128 ; 66 0F 5B /r [WILLAMETTE,SSE2] - -\c{CVTPS2DQ} converts four packed single-precision FP values from the -source operand to four packed signed doublewords in the destination operand. - -The destination operand is an \c{XMM} register. The source can be -either an \c{XMM} register or a 128-bit memory location. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTPS2PD} \i\c{CVTPS2PD}: -Packed Single-Precision FP to Packed Double-Precision FP Conversion - -\c CVTPS2PD xmm1,xmm2/mem64 ; 0F 5A /r [WILLAMETTE,SSE2] - -\c{CVTPS2PD} converts two packed single-precision FP values from the -source operand to two packed double-precision FP values in the destination -operand. - -The destination operand is an \c{XMM} register. The source can be -either an \c{XMM} register or a 64-bit memory location. If the source -is a register, the input values are in the low quadword. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTPS2PI} \i\c{CVTPS2PI}: -Packed Single-Precision FP to Packed Signed INT32 Conversion - -\c CVTPS2PI mm,xmm/mem64 ; 0F 2D /r [KATMAI,SSE] - -\c{CVTPS2PI} converts two packed single-precision FP values from -the source operand to two packed signed doublewords in the destination -operand. - -The destination operand is an \c{MMX} register. The source can be -either an \c{XMM} register or a 64-bit memory location. If the -source is a register, the input values are in the low quadword. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTSD2SI} \i\c{CVTSD2SI}: -Scalar Double-Precision FP to Signed INT32 Conversion - -\c CVTSD2SI reg32,xmm/mem64 ; F2 0F 2D /r [WILLAMETTE,SSE2] - -\c{CVTSD2SI} converts a double-precision FP value from the source -operand to a signed doubleword in the destination operand. - -The destination operand is a general purpose register. The source can be -either an \c{XMM} register or a 64-bit memory location. If the -source is a register, the input value is in the low quadword. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTSD2SS} \i\c{CVTSD2SS}: -Scalar Double-Precision FP to Scalar Single-Precision FP Conversion - -\c CVTSD2SS xmm1,xmm2/mem64 ; F2 0F 5A /r [KATMAI,SSE] - -\c{CVTSD2SS} converts a double-precision FP value from the source -operand to a single-precision FP value in the low doubleword of the -destination operand. The upper 3 doublewords are left unchanged. - -The destination operand is an \c{XMM} register. The source can be -either an \c{XMM} register or a 64-bit memory location. If the -source is a register, the input value is in the low quadword. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTSI2SD} \i\c{CVTSI2SD}: -Signed INT32 to Scalar Double-Precision FP Conversion - -\c CVTSI2SD xmm,r/m32 ; F2 0F 2A /r [WILLAMETTE,SSE2] - -\c{CVTSI2SD} converts a signed doubleword from the source operand to -a double-precision FP value in the low quadword of the destination -operand. The high quadword is left unchanged. - -The destination operand is an \c{XMM} register. The source can be either -a general purpose register or a 32-bit memory location. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTSI2SS} \i\c{CVTSI2SS}: -Signed INT32 to Scalar Single-Precision FP Conversion - -\c CVTSI2SS xmm,r/m32 ; F3 0F 2A /r [KATMAI,SSE] - -\c{CVTSI2SS} converts a signed doubleword from the source operand to a -single-precision FP value in the low doubleword of the destination operand. -The upper 3 doublewords are left unchanged. - -The destination operand is an \c{XMM} register. The source can be either -a general purpose register or a 32-bit memory location. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTSS2SD} \i\c{CVTSS2SD}: -Scalar Single-Precision FP to Scalar Double-Precision FP Conversion - -\c CVTSS2SD xmm1,xmm2/mem32 ; F3 0F 5A /r [WILLAMETTE,SSE2] - -\c{CVTSS2SD} converts a single-precision FP value from the source operand -to a double-precision FP value in the low quadword of the destination -operand. The upper quadword is left unchanged. - -The destination operand is an \c{XMM} register. The source can be either -an \c{XMM} register or a 32-bit memory location. If the source is a -register, the input value is contained in the low doubleword. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTSS2SI} \i\c{CVTSS2SI}: -Scalar Single-Precision FP to Signed INT32 Conversion - -\c CVTSS2SI reg32,xmm/mem32 ; F3 0F 2D /r [KATMAI,SSE] - -\c{CVTSS2SI} converts a single-precision FP value from the source -operand to a signed doubleword in the destination operand. - -The destination operand is a general purpose register. The source can be -either an \c{XMM} register or a 32-bit memory location. If the -source is a register, the input value is in the low doubleword. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTTPD2DQ} \i\c{CVTTPD2DQ}: -Packed Double-Precision FP to Packed Signed INT32 Conversion with Truncation - -\c CVTTPD2DQ xmm1,xmm2/mem128 ; 66 0F E6 /r [WILLAMETTE,SSE2] - -\c{CVTTPD2DQ} converts two packed double-precision FP values in the source -operand to two packed single-precision FP values in the destination operand. -If the result is inexact, it is truncated (rounded toward zero). The high -quadword is set to all 0s. - -The destination operand is an \c{XMM} register. The source can be -either an \c{XMM} register or a 128-bit memory location. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTTPD2PI} \i\c{CVTTPD2PI}: -Packed Double-Precision FP to Packed Signed INT32 Conversion with Truncation - -\c CVTTPD2PI mm,xmm/mem128 ; 66 0F 2C /r [WILLAMETTE,SSE2] - -\c{CVTTPD2PI} converts two packed double-precision FP values in the source -operand to two packed single-precision FP values in the destination operand. -If the result is inexact, it is truncated (rounded toward zero). - -The destination operand is an \c{MMX} register. The source can be -either an \c{XMM} register or a 128-bit memory location. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTTPS2DQ} \i\c{CVTTPS2DQ}: -Packed Single-Precision FP to Packed Signed INT32 Conversion with Truncation - -\c CVTTPS2DQ xmm1,xmm2/mem128 ; F3 0F 5B /r [WILLAMETTE,SSE2] - -\c{CVTTPS2DQ} converts four packed single-precision FP values in the source -operand to four packed signed doublewords in the destination operand. -If the result is inexact, it is truncated (rounded toward zero). - -The destination operand is an \c{XMM} register. The source can be -either an \c{XMM} register or a 128-bit memory location. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTTPS2PI} \i\c{CVTTPS2PI}: -Packed Single-Precision FP to Packed Signed INT32 Conversion with Truncation - -\c CVTTPS2PI mm,xmm/mem64 ; 0F 2C /r [KATMAI,SSE] - -\c{CVTTPS2PI} converts two packed single-precision FP values in the source -operand to two packed signed doublewords in the destination operand. -If the result is inexact, it is truncated (rounded toward zero). If -the source is a register, the input values are in the low quadword. - -The destination operand is an \c{MMX} register. The source can be -either an \c{XMM} register or a 64-bit memory location. If the source -is a register, the input value is in the low quadword. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTTSD2SI} \i\c{CVTTSD2SI}: -Scalar Double-Precision FP to Signed INT32 Conversion with Truncation - -\c CVTTSD2SI reg32,xmm/mem64 ; F2 0F 2C /r [WILLAMETTE,SSE2] - -\c{CVTTSD2SI} converts a double-precision FP value in the source operand -to a signed doubleword in the destination operand. If the result is -inexact, it is truncated (rounded toward zero). - -The destination operand is a general purpose register. The source can be -either an \c{XMM} register or a 64-bit memory location. If the source is a -register, the input value is in the low quadword. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insCVTTSS2SI} \i\c{CVTTSS2SI}: -Scalar Single-Precision FP to Signed INT32 Conversion with Truncation - -\c CVTTSD2SI reg32,xmm/mem32 ; F3 0F 2C /r [KATMAI,SSE] - -\c{CVTTSS2SI} converts a single-precision FP value in the source operand -to a signed doubleword in the destination operand. If the result is -inexact, it is truncated (rounded toward zero). - -The destination operand is a general purpose register. The source can be -either an \c{XMM} register or a 32-bit memory location. If the source is a -register, the input value is in the low doubleword. - -For more details of this instruction, see the Intel Processor manuals. - - -\S{insDAA} \i\c{DAA}, \i\c{DAS}: Decimal Adjustments - -\c DAA ; 27 [8086] -\c DAS ; 2F [8086] - -These instructions are used in conjunction with the add and subtract -instructions to perform binary-coded decimal arithmetic in -\e{packed} (one BCD digit per nibble) form. For the unpacked -equivalents, see \k{insAAA}. - -\c{DAA} should be used after a one-byte \c{ADD} instruction whose -destination was the \c{AL} register: by means of examining the value -in the \c{AL} and also the auxiliary carry flag \c{AF}, it -determines whether either digit of the addition has overflowed, and -adjusts it (and sets the carry and auxiliary-carry flags) if so. You -can add long BCD strings together by doing \c{ADD}/\c{DAA} on the -low two digits, then doing \c{ADC}/\c{DAA} on each subsequent pair -of digits. - -\c{DAS} works similarly to \c{DAA}, but is for use after \c{SUB} -instructions rather than \c{ADD}. - - -\S{insDEC} \i\c{DEC}: Decrement Integer - -\c DEC reg16 ; o16 48+r [8086] -\c DEC reg32 ; o32 48+r [386] -\c DEC r/m8 ; FE /1 [8086] -\c DEC r/m16 ; o16 FF /1 [8086] -\c DEC r/m32 ; o32 FF /1 [386] - -\c{DEC} subtracts 1 from its operand. It does \e{not} affect the -carry flag: to affect the carry flag, use \c{SUB something,1} (see -\k{insSUB}). \c{DEC} affects all the other flags according to the result. - -This instruction can be used with a \c{LOCK} prefix to allow atomic -execution. - -See also \c{INC} (\k{insINC}). - - -\S{insDIV} \i\c{DIV}: Unsigned Integer Divide - -\c DIV r/m8 ; F6 /6 [8086] -\c DIV r/m16 ; o16 F7 /6 [8086] -\c DIV r/m32 ; o32 F7 /6 [386] - -\c{DIV} performs unsigned integer division. The explicit operand -provided is the divisor; the dividend and destination operands are -implicit, in the following way: - -\b For \c{DIV r/m8}, \c{AX} is divided by the given operand; the -quotient is stored in \c{AL} and the remainder in \c{AH}. - -\b For \c{DIV r/m16}, \c{DX:AX} is divided by the given operand; the -quotient is stored in \c{AX} and the remainder in \c{DX}. - -\b For \c{DIV r/m32}, \c{EDX:EAX} is divided by the given operand; -the quotient is stored in \c{EAX} and the remainder in \c{EDX}. - -Signed integer division is performed by the \c{IDIV} instruction: -see \k{insIDIV}. - - -\S{insDIVPD} \i\c{DIVPD}: Packed Double-Precision FP Divide - -\c DIVPD xmm1,xmm2/mem128 ; 66 0F 5E /r [WILLAMETTE,SSE2] - -\c{DIVPD} divides the two packed double-precision FP values in -the destination operand by the two packed double-precision FP -values in the source operand, and stores the packed double-precision -results in the destination register. - -The destination is an \c{XMM} register. The source operand can be -either an \c{XMM} register or a 128-bit memory location. - -\c dst[0-63] := dst[0-63] / src[0-63], -\c dst[64-127] := dst[64-127] / src[64-127]. - - -\S{insDIVPS} \i\c{DIVPS}: Packed Single-Precision FP Divide - -\c DIVPS xmm1,xmm2/mem128 ; 0F 5E /r [KATMAI,SSE] - -\c{DIVPS} divides the four packed single-precision FP values in -the destination operand by the four packed single-precision FP -values in the source operand, and stores the packed single-precision -results in the destination register. - -The destination is an \c{XMM} register. The source operand can be -either an \c{XMM} register or a 128-bit memory location. - -\c dst[0-31] := dst[0-31] / src[0-31], -\c dst[32-63] := dst[32-63] / src[32-63], -\c dst[64-95] := dst[64-95] / src[64-95], -\c dst[96-127] := dst[96-127] / src[96-127]. - - -\S{insDIVSD} \i\c{DIVSD}: Scalar Double-Precision FP Divide - -\c DIVSD xmm1,xmm2/mem64 ; F2 0F 5E /r [WILLAMETTE,SSE2] - -\c{DIVSD} divides the low-order double-precision FP value in the -destination operand by the low-order double-precision FP value in -the source operand, and stores the double-precision result in the -destination register. - -The destination is an \c{XMM} register. The source operand can be -either an \c{XMM} register or a 64-bit memory location. - -\c dst[0-63] := dst[0-63] / src[0-63], -\c dst[64-127] remains unchanged. - - -\S{insDIVSS} \i\c{DIVSS}: Scalar Single-Precision FP Divide - -\c DIVSS xmm1,xmm2/mem32 ; F3 0F 5E /r [KATMAI,SSE] - -\c{DIVSS} divides the low-order single-precision FP value in the -destination operand by the low-order single-precision FP value in -the source operand, and stores the single-precision result in the -destination register. - -The destination is an \c{XMM} register. The source operand can be -either an \c{XMM} register or a 32-bit memory location. - -\c dst[0-31] := dst[0-31] / src[0-31], -\c dst[32-127] remains unchanged. - - -\S{insEMMS} \i\c{EMMS}: Empty MMX State - -\c EMMS ; 0F 77 [PENT,MMX] - -\c{EMMS} sets the FPU tag word (marking which floating-point registers -are available) to all ones, meaning all registers are available for -the FPU to use. It should be used after executing \c{MMX} instructions -and before executing any subsequent floating-point operations. - - -\S{insENTER} \i\c{ENTER}: Create Stack Frame - -\c ENTER imm,imm ; C8 iw ib [186] - -\c{ENTER} constructs a \i\c{stack frame} for a high-level language -procedure call. The first operand (the \c{iw} in the opcode -definition above refers to the first operand) gives the amount of -stack space to allocate for local variables; the second (the \c{ib} -above) gives the nesting level of the procedure (for languages like -Pascal, with nested procedures). - -The function of \c{ENTER}, with a nesting level of zero, is -equivalent to - -\c PUSH EBP ; or PUSH BP in 16 bits -\c MOV EBP,ESP ; or MOV BP,SP in 16 bits -\c SUB ESP,operand1 ; or SUB SP,operand1 in 16 bits - -This creates a stack frame with the procedure parameters accessible -upwards from \c{EBP}, and local variables accessible downwards from -\c{EBP}. - -With a nesting level of one, the stack frame created is 4 (or 2) -bytes bigger, and the value of the final frame pointer \c{EBP} is -accessible in memory at \c{[EBP-4]}. - -This allows \c{ENTER}, when called with a nesting level of two, to -look at the stack frame described by the \e{previous} value of -\c{EBP}, find the frame pointer at offset -4 from that, and push it -along with its new frame pointer, so that when a level-two procedure -is called from within a level-one procedure, \c{[EBP-4]} holds the -frame pointer of the most recent level-one procedure call and -\c{[EBP-8]} holds that of the most recent level-two call. And so on, -for nesting levels up to 31. - -Stack frames created by \c{ENTER} can be destroyed by the \c{LEAVE} -instruction: see \k{insLEAVE}. - - -\S{insF2XM1} \i\c{F2XM1}: Calculate 2**X-1 - -\c F2XM1 ; D9 F0 [8086,FPU] - -\c{F2XM1} raises 2 to the power of \c{ST0}, subtracts one, and -stores the result back into \c{ST0}. The initial contents of \c{ST0} -must be a number in the range -1.0 to +1.0. - - -\S{insFABS} \i\c{FABS}: Floating-Point Absolute Value - -\c FABS ; D9 E1 [8086,FPU] - -\c{FABS} computes the absolute value of \c{ST0},by clearing the sign -bit, and stores the result back in \c{ST0}. - - -\S{insFADD} \i\c{FADD}, \i\c{FADDP}: Floating-Point Addition - -\c FADD mem32 ; D8 /0 [8086,FPU] -\c FADD mem64 ; DC /0 [8086,FPU] - -\c FADD fpureg ; D8 C0+r [8086,FPU] -\c FADD ST0,fpureg ; D8 C0+r [8086,FPU] - -\c FADD TO fpureg ; DC C0+r [8086,FPU] -\c FADD fpureg,ST0 ; DC C0+r [8086,FPU] - -\c FADDP fpureg ; DE C0+r [8086,FPU] -\c FADDP fpureg,ST0 ; DE C0+r [8086,FPU] - -\b \c{FADD}, given one operand, adds the operand to \c{ST0} and stores -the result back in \c{ST0}. If the operand has the \c{TO} modifier, -the result is stored in the register given rather than in \c{ST0}. - -\b \c{FADDP} performs the same function as \c{FADD TO}, but pops the -register stack after storing the result. - -The given two-operand forms are synonyms for the one-operand forms. - -To add an integer value to \c{ST0}, use the c{FIADD} instruction -(\k{insFIADD}) - - -\S{insFBLD} \i\c{FBLD}, \i\c{FBSTP}: BCD Floating-Point Load and Store - -\c FBLD mem80 ; DF /4 [8086,FPU] -\c FBSTP mem80 ; DF /6 [8086,FPU] - -\c{FBLD} loads an 80-bit (ten-byte) packed binary-coded decimal -number from the given memory address, converts it to a real, and -pushes it on the register stack. \c{FBSTP} stores the value of -\c{ST0}, in packed BCD, at the given address and then pops the -register stack. - - -\S{insFCHS} \i\c{FCHS}: Floating-Point Change Sign - -\c FCHS ; D9 E0 [8086,FPU] - -\c{FCHS} negates the number in \c{ST0}, by inverting the sign bit: -negative numbers become positive, and vice versa. - - -\S{insFCLEX} \i\c{FCLEX}, \c{FNCLEX}: Clear Floating-Point Exceptions - -\c FCLEX ; 9B DB E2 [8086,FPU] -\c FNCLEX ; DB E2 [8086,FPU] - -\c{FCLEX} clears any floating-point exceptions which may be pending. -\c{FNCLEX} does the same thing but doesn't wait for previous -floating-point operations (including the \e{handling} of pending -exceptions) to finish first. - - -\S{insFCMOVB} \i\c{FCMOVcc}: Floating-Point Conditional Move - -\c FCMOVB fpureg ; DA C0+r [P6,FPU] -\c FCMOVB ST0,fpureg ; DA C0+r [P6,FPU] - -\c FCMOVE fpureg ; DA C8+r [P6,FPU] -\c FCMOVE ST0,fpureg ; DA C8+r [P6,FPU] - -\c FCMOVBE fpureg ; DA D0+r [P6,FPU] -\c FCMOVBE ST0,fpureg ; DA D0+r [P6,FPU] - -\c FCMOVU fpureg ; DA D8+r [P6,FPU] -\c FCMOVU ST0,fpureg ; DA D8+r [P6,FPU] - -\c FCMOVNB fpureg ; DB C0+r [P6,FPU] -\c FCMOVNB ST0,fpureg ; DB C0+r [P6,FPU] - -\c FCMOVNE fpureg ; DB C8+r [P6,FPU] -\c FCMOVNE ST0,fpureg ; DB C8+r [P6,FPU] - -\c FCMOVNBE fpureg ; DB D0+r [P6,FPU] -\c FCMOVNBE ST0,fpureg ; DB D0+r [P6,FPU] - -\c FCMOVNU fpureg ; DB D8+r [P6,FPU] -\c FCMOVNU ST0,fpureg ; DB D8+r [P6,FPU] - -The \c{FCMOV} instructions perform conditional move operations: each -of them moves the contents of the given register into \c{ST0} if its -condition is satisfied, and does nothing if not. - -The conditions are not the same as the standard condition codes used -with conditional jump instructions. The conditions \c{B}, \c{BE}, -\c{NB}, \c{NBE}, \c{E} and \c{NE} are exactly as normal, but none of -the other standard ones are supported. Instead, the condition \c{U} -and its counterpart \c{NU} are provided; the \c{U} condition is -satisfied if the last two floating-point numbers compared were -\e{unordered}, i.e. they were not equal but neither one could be -said to be greater than the other, for example if they were NaNs. -(The flag state which signals this is the setting of the parity -flag: so the \c{U} condition is notionally equivalent to \c{PE}, and -\c{NU} is equivalent to \c{PO}.) - -The \c{FCMOV} conditions test the main processor's status flags, not -the FPU status flags, so using \c{FCMOV} directly after \c{FCOM} -will not work. Instead, you should either use \c{FCOMI} which writes -directly to the main CPU flags word, or use \c{FSTSW} to extract the -FPU flags. - -Although the \c{FCMOV} instructions are flagged \c{P6} above, they -may not be supported by all Pentium Pro processors; the \c{CPUID} -instruction (\k{insCPUID}) will return a bit which indicates whether -conditional moves are supported. - - -\S{insFCOM} \i\c{FCOM}, \i\c{FCOMP}, \i\c{FCOMPP}, \i\c{FCOMI}, -\i\c{FCOMIP}: Floating-Point Compare - -\c FCOM mem32 ; D8 /2 [8086,FPU] -\c FCOM mem64 ; DC /2 [8086,FPU] -\c FCOM fpureg ; D8 D0+r [8086,FPU] -\c FCOM ST0,fpureg ; D8 D0+r [8086,FPU] - -\c FCOMP mem32 ; D8 /3 [8086,FPU] -\c FCOMP mem64 ; DC /3 [8086,FPU] -\c FCOMP fpureg ; D8 D8+r [8086,FPU] -\c FCOMP ST0,fpureg ; D8 D8+r [8086,FPU] - -\c FCOMPP ; DE D9 [8086,FPU] - -\c FCOMI fpureg ; DB F0+r [P6,FPU] -\c FCOMI ST0,fpureg ; DB F0+r [P6,FPU] - -\c FCOMIP fpureg ; DF F0+r [P6,FPU] -\c FCOMIP ST0,fpureg ; DF F0+r [P6,FPU] - -\c{FCOM} compares \c{ST0} with the given operand, and sets the FPU -flags accordingly. \c{ST0} is treated as the left-hand side of the -comparison, so that the carry flag is set (for a `less-than' result) -if \c{ST0} is less than the given operand. - -\c{FCOMP} does the same as \c{FCOM}, but pops the register stack -afterwards. \c{FCOMPP} compares \c{ST0} with \c{ST1} and then pops -the register stack twice. - -\c{FCOMI} and \c{FCOMIP} work like the corresponding forms of -\c{FCOM} and \c{FCOMP}, but write their results directly to the CPU -flags register rather than the FPU status word, so they can be -immediately followed by conditional jump or conditional move -instructions. - -The \c{FCOM} instructions differ from the \c{FUCOM} instructions -(\k{insFUCOM}) only in the way they handle quiet NaNs: \c{FUCOM} -will handle them silently and set the condition code flags to an -`unordered' result, whereas \c{FCOM} will generate an exception. - - -\S{insFCOS} \i\c{FCOS}: Cosine - -\c FCOS ; D9 FF [386,FPU] - -\c{FCOS} computes the cosine of \c{ST0} (in radians), and stores the -result in \c{ST0}. The absolute value of \c{ST0} must be less than 2**63. - -See also \c{FSINCOS} (\k{insFSIN}). - - -\S{insFDECSTP} \i\c{FDECSTP}: Decrement Floating-Point Stack Pointer - -\c FDECSTP ; D9 F6 [8086,FPU] - -\c{FDECSTP} decrements the `top' field in the floating-point status -word. This has the effect of rotating the FPU register stack by one, -as if the contents of \c{ST7} had been pushed on the stack. See also -\c{FINCSTP} (\k{insFINCSTP}). - - -\S{insFDISI} \i\c{FxDISI}, \i\c{FxENI}: Disable and Enable Floating-Point Interrupts - -\c FDISI ; 9B DB E1 [8086,FPU] -\c FNDISI ; DB E1 [8086,FPU] - -\c FENI ; 9B DB E0 [8086,FPU] -\c FNENI ; DB E0 [8086,FPU] - -\c{FDISI} and \c{FENI} disable and enable floating-point interrupts. -These instructions are only meaningful on original 8087 processors: -the 287 and above treat them as no-operation instructions. - -\c{FNDISI} and \c{FNENI} do the same thing as \c{FDISI} and \c{FENI} -respectively, but without waiting for the floating-point processor -to finish what it was doing first. - - -\S{insFDIV} \i\c{FDIV}, \i\c{FDIVP}, \i\c{FDIVR}, \i\c{FDIVRP}: Floating-Point Division - -\c FDIV mem32 ; D8 /6 [8086,FPU] -\c FDIV mem64 ; DC /6 [8086,FPU] - -\c FDIV fpureg ; D8 F0+r [8086,FPU] -\c FDIV ST0,fpureg ; D8 F0+r [8086,FPU] - -\c FDIV TO fpureg ; DC F8+r [8086,FPU] -\c FDIV fpureg,ST0 ; DC F8+r [8086,FPU] - -\c FDIVR mem32 ; D8 /7 [8086,FPU] -\c FDIVR mem64 ; DC /7 [8086,FPU] - -\c FDIVR fpureg ; D8 F8+r [8086,FPU] -\c FDIVR ST0,fpureg ; D8 F8+r [8086,FPU] - -\c FDIVR TO fpureg ; DC F0+r [8086,FPU] -\c FDIVR fpureg,ST0 ; DC F0+r [8086,FPU] - -\c FDIVP fpureg ; DE F8+r [8086,FPU] -\c FDIVP fpureg,ST0 ; DE F8+r [8086,FPU] - -\c FDIVRP fpureg ; DE F0+r [8086,FPU] -\c FDIVRP fpureg,ST0 ; DE F0+r [8086,FPU] - -\b \c{FDIV} divides \c{ST0} by the given operand and stores the result -back in \c{ST0}, unless the \c{TO} qualifier is given, in which case -it divides the given operand by \c{ST0} and stores the result in the -operand. - -\b \c{FDIVR} does the same thing, but does the division the other way -up: so if \c{TO} is not given, it divides the given operand by -\c{ST0} and stores the result in \c{ST0}, whereas if \c{TO} is given -it divides \c{ST0} by its operand and stores the result in the -operand. - -\b \c{FDIVP} operates like \c{FDIV TO}, but pops the register stack -once it has finished. - -\b \c{FDIVRP} operates like \c{FDIVR TO}, but pops the register stack -once it has finished. - -For FP/Integer divisions, see \c{FIDIV} (\k{insFIDIV}). - - -\S{insFEMMS} \i\c{FEMMS}: Faster Enter/Exit of the MMX or floating-point state - -\c FEMMS ; 0F 0E [PENT,3DNOW] - -\c{FEMMS} can be used in place of the \c{EMMS} instruction on -processors which support the 3DNow! instruction set. Following -execution of \c{FEMMS}, the state of the \c{MMX/FP} registers -is undefined, and this allows a faster context switch between -\c{FP} and \c{MMX} instructions. The \c{FEMMS} instruction can -also be used \e{before} executing \c{MMX} instructions - - -\S{insFFREE} \i\c{FFREE}: Flag Floating-Point Register as Unused - -\c FFREE fpureg ; DD C0+r [8086,FPU] -\c FFREEP fpureg ; DF C0+r [286,FPU,UNDOC] - -\c{FFREE} marks the given register as being empty. - -\c{FFREEP} marks the given register as being empty, and then -pops the register stack. - - -\S{insFIADD} \i\c{FIADD}: Floating-Point/Integer Addition - -\c FIADD mem16 ; DE /0 [8086,FPU] -\c FIADD mem32 ; DA /0 [8086,FPU] - -\c{FIADD} adds the 16-bit or 32-bit integer stored in the given -memory location to \c{ST0}, storing the result in \c{ST0}. - - -\S{insFICOM} \i\c{FICOM}, \i\c{FICOMP}: Floating-Point/Integer Compare - -\c FICOM mem16 ; DE /2 [8086,FPU] -\c FICOM mem32 ; DA /2 [8086,FPU] - -\c FICOMP mem16 ; DE /3 [8086,FPU] -\c FICOMP mem32 ; DA /3 [8086,FPU] - -\c{FICOM} compares \c{ST0} with the 16-bit or 32-bit integer stored -in the given memory location, and sets the FPU flags accordingly. -\c{FICOMP} does the same, but pops the register stack afterwards. - - -\S{insFIDIV} \i\c{FIDIV}, \i\c{FIDIVR}: Floating-Point/Integer Division - -\c FIDIV mem16 ; DE /6 [8086,FPU] -\c FIDIV mem32 ; DA /6 [8086,FPU] - -\c FIDIVR mem16 ; DE /7 [8086,FPU] -\c FIDIVR mem32 ; DA /7 [8086,FPU] - -\c{FIDIV} divides \c{ST0} by the 16-bit or 32-bit integer stored in -the given memory location, and stores the result in \c{ST0}. -\c{FIDIVR} does the division the other way up: it divides the -integer by \c{ST0}, but still stores the result in \c{ST0}. - - -\S{insFILD} \i\c{FILD}, \i\c{FIST}, \i\c{FISTP}: Floating-Point/Integer Conversion - -\c FILD mem16 ; DF /0 [8086,FPU] -\c FILD mem32 ; DB /0 [8086,FPU] -\c FILD mem64 ; DF /5 [8086,FPU] - -\c FIST mem16 ; DF /2 [8086,FPU] -\c FIST mem32 ; DB /2 [8086,FPU] - -\c FISTP mem16 ; DF /3 [8086,FPU] -\c FISTP mem32 ; DB /3 [8086,FPU] -\c FISTP mem64 ; DF /7 [8086,FPU] - -\c{FILD} loads an integer out of a memory location, converts it to a -real, and pushes it on the FPU register stack. \c{FIST} converts -\c{ST0} to an integer and stores that in memory; \c{FISTP} does the -same as \c{FIST}, but pops the register stack afterwards. - - -\S{insFIMUL} \i\c{FIMUL}: Floating-Point/Integer Multiplication - -\c FIMUL mem16 ; DE /1 [8086,FPU] -\c FIMUL mem32 ; DA /1 [8086,FPU] - -\c{FIMUL} multiplies \c{ST0} by the 16-bit or 32-bit integer stored -in the given memory location, and stores the result in \c{ST0}. - - -\S{insFINCSTP} \i\c{FINCSTP}: Increment Floating-Point Stack Pointer - -\c FINCSTP ; D9 F7 [8086,FPU] - -\c{FINCSTP} increments the `top' field in the floating-point status -word. This has the effect of rotating the FPU register stack by one, -as if the register stack had been popped; however, unlike the -popping of the stack performed by many FPU instructions, it does not -flag the new \c{ST7} (previously \c{ST0}) as empty. See also -\c{FDECSTP} (\k{insFDECSTP}). - - -\S{insFINIT} \i\c{FINIT}, \i\c{FNINIT}: initialize Floating-Point Unit - -\c FINIT ; 9B DB E3 [8086,FPU] -\c FNINIT ; DB E3 [8086,FPU] - -\c{FINIT} initializes the FPU to its default state. It flags all -registers as empty, without actually change their values, clears -the top of stack pointer. \c{FNINIT} does the same, without first -waiting for pending exceptions to clear. - - -\S{insFISUB} \i\c{FISUB}: Floating-Point/Integer Subtraction - -\c FISUB mem16 ; DE /4 [8086,FPU] -\c FISUB mem32 ; DA /4 [8086,FPU] - -\c FISUBR mem16 ; DE /5 [8086,FPU] -\c FISUBR mem32 ; DA /5 [8086,FPU] - -\c{FISUB} subtracts the 16-bit or 32-bit integer stored in the given -memory location from \c{ST0}, and stores the result in \c{ST0}. -\c{FISUBR} does the subtraction the other way round, i.e. it -subtracts \c{ST0} from the given integer, but still stores the -result in \c{ST0}. - - -\S{insFLD} \i\c{FLD}: Floating-Point Load - -\c FLD mem32 ; D9 /0 [8086,FPU] -\c FLD mem64 ; DD /0 [8086,FPU] -\c FLD mem80 ; DB /5 [8086,FPU] -\c FLD fpureg ; D9 C0+r [8086,FPU] - -\c{FLD} loads a floating-point value out of the given register or -memory location, and pushes it on the FPU register stack. - - -\S{insFLD1} \i\c{FLDxx}: Floating-Point Load Constants - -\c FLD1 ; D9 E8 [8086,FPU] -\c FLDL2E ; D9 EA [8086,FPU] -\c FLDL2T ; D9 E9 [8086,FPU] -\c FLDLG2 ; D9 EC [8086,FPU] -\c FLDLN2 ; D9 ED [8086,FPU] -\c FLDPI ; D9 EB [8086,FPU] -\c FLDZ ; D9 EE [8086,FPU] - -These instructions push specific standard constants on the FPU -register stack. - -\c Instruction Constant pushed - -\c FLD1 1 -\c FLDL2E base-2 logarithm of e -\c FLDL2T base-2 log of 10 -\c FLDLG2 base-10 log of 2 -\c FLDLN2 base-e log of 2 -\c FLDPI pi -\c FLDZ zero - - -\S{insFLDCW} \i\c{FLDCW}: Load Floating-Point Control Word - -\c FLDCW mem16 ; D9 /5 [8086,FPU] - -\c{FLDCW} loads a 16-bit value out of memory and stores it into the -FPU control word (governing things like the rounding mode, the -precision, and the exception masks). See also \c{FSTCW} -(\k{insFSTCW}). If exceptions are enabled and you don't want to -generate one, use \c{FCLEX} or \c{FNCLEX} (\k{insFCLEX}) before -loading the new control word. - - -\S{insFLDENV} \i\c{FLDENV}: Load Floating-Point Environment - -\c FLDENV mem ; D9 /4 [8086,FPU] - -\c{FLDENV} loads the FPU operating environment (control word, status -word, tag word, instruction pointer, data pointer and last opcode) -from memory. The memory area is 14 or 28 bytes long, depending on -the CPU mode at the time. See also \c{FSTENV} (\k{insFSTENV}). - - -\S{insFMUL} \i\c{FMUL}, \i\c{FMULP}: Floating-Point Multiply - -\c FMUL mem32 ; D8 /1 [8086,FPU] -\c FMUL mem64 ; DC /1 [8086,FPU] - -\c FMUL fpureg ; D8 C8+r [8086,FPU] -\c FMUL ST0,fpureg ; D8 C8+r [8086,FPU] - -\c FMUL TO fpureg ; DC C8+r [8086,FPU] -\c FMUL fpureg,ST0 ; DC C8+r [8086,FPU] - -\c FMULP fpureg ; DE C8+r [8086,FPU] -\c FMULP fpureg,ST0 ; DE C8+r [8086,FPU] - -\c{FMUL} multiplies \c{ST0} by the given operand, and stores the -result in \c{ST0}, unless the \c{TO} qualifier is used in which case -it stores the result in the operand. \c{FMULP} performs the same -operation as \c{FMUL TO}, and then pops the register stack. - - -\S{insFNOP} \i\c{FNOP}: Floating-Point No Operation - -\c FNOP ; D9 D0 [8086,FPU] - -\c{FNOP} does nothing. - - -\S{insFPATAN} \i\c{FPATAN}, \i\c{FPTAN}: Arctangent and Tangent - -\c FPATAN ; D9 F3 [8086,FPU] -\c FPTAN ; D9 F2 [8086,FPU] - -\c{FPATAN} computes the arctangent, in radians, of the result of -dividing \c{ST1} by \c{ST0}, stores the result in \c{ST1}, and pops -the register stack. It works like the C \c{atan2} function, in that -changing the sign of both \c{ST0} and \c{ST1} changes the output -value by pi (so it performs true rectangular-to-polar coordinate -conversion, with \c{ST1} being the Y coordinate and \c{ST0} being -the X coordinate, not merely an arctangent). - -\c{FPTAN} computes the tangent of the value in \c{ST0} (in radians), -and stores the result back into \c{ST0}. - -The absolute value of \c{ST0} must be less than 2**63. - - -\S{insFPREM} \i\c{FPREM}, \i\c{FPREM1}: Floating-Point Partial Remainder - -\c FPREM ; D9 F8 [8086,FPU] -\c FPREM1 ; D9 F5 [386,FPU] - -These instructions both produce the remainder obtained by dividing -\c{ST0} by \c{ST1}. This is calculated, notionally, by dividing -\c{ST0} by \c{ST1}, rounding the result to an integer, multiplying -by \c{ST1} again, and computing the value which would need to be -added back on to the result to get back to the original value in -\c{ST0}. - -The two instructions differ in the way the notional round-to-integer -operation is performed. \c{FPREM} does it by rounding towards zero, -so that the remainder it returns always has the same sign as the -original value in \c{ST0}; \c{FPREM1} does it by rounding to the -nearest integer, so that the remainder always has at most half the -magnitude of \c{ST1}. - -Both instructions calculate \e{partial} remainders, meaning that -they may not manage to provide the final result, but might leave -intermediate results in \c{ST0} instead. If this happens, they will -set the C2 flag in the FPU status word; therefore, to calculate a -remainder, you should repeatedly execute \c{FPREM} or \c{FPREM1} -until C2 becomes clear. - - -\S{insFRNDINT} \i\c{FRNDINT}: Floating-Point Round to Integer - -\c FRNDINT ; D9 FC [8086,FPU] - -\c{FRNDINT} rounds the contents of \c{ST0} to an integer, according -to the current rounding mode set in the FPU control word, and stores -the result back in \c{ST0}. - - -\S{insFRSTOR} \i\c{FSAVE}, \i\c{FRSTOR}: Save/Restore Floating-Point State - -\c FSAVE mem ; 9B DD /6 [8086,FPU] -\c FNSAVE mem ; DD /6 [8086,FPU] - -\c FRSTOR mem ; DD /4 [8086,FPU] - -\c{FSAVE} saves the entire floating-point unit state, including all -the information saved by \c{FSTENV} (\k{insFSTENV}) plus the -contents of all the registers, to a 94 or 108 byte area of memory -(depending on the CPU mode). \c{FRSTOR} restores the floating-point -state from the same area of memory. - -\c{FNSAVE} does the same as \c{FSAVE}, without first waiting for -pending floating-point exceptions to clear. - - -\S{insFSCALE} \i\c{FSCALE}: Scale Floating-Point Value by Power of Two - -\c FSCALE ; D9 FD [8086,FPU] - -\c{FSCALE} scales a number by a power of two: it rounds \c{ST1} -towards zero to obtain an integer, then multiplies \c{ST0} by two to -the power of that integer, and stores the result in \c{ST0}. - - -\S{insFSETPM} \i\c{FSETPM}: Set Protected Mode - -\c FSETPM ; DB E4 [286,FPU] - -This instruction initializes protected mode on the 287 floating-point -coprocessor. It is only meaningful on that processor: the 387 and -above treat the instruction as a no-operation. - - -\S{insFSIN} \i\c{FSIN}, \i\c{FSINCOS}: Sine and Cosine - -\c FSIN ; D9 FE [386,FPU] -\c FSINCOS ; D9 FB [386,FPU] - -\c{FSIN} calculates the sine of \c{ST0} (in radians) and stores the -result in \c{ST0}. \c{FSINCOS} does the same, but then pushes the -cosine of the same value on the register stack, so that the sine -ends up in \c{ST1} and the cosine in \c{ST0}. \c{FSINCOS} is faster -than executing \c{FSIN} and \c{FCOS} (see \k{insFCOS}) in succession. - -The absolute value of \c{ST0} must be less than 2**63. - - -\S{insFSQRT} \i\c{FSQRT}: Floating-Point Square Root - -\c FSQRT ; D9 FA [8086,FPU] - -\c{FSQRT} calculates the square root of \c{ST0} and stores the -result in \c{ST0}. - - -\S{insFST} \i\c{FST}, \i\c{FSTP}: Floating-Point Store - -\c FST mem32 ; D9 /2 [8086,FPU] -\c FST mem64 ; DD /2 [8086,FPU] -\c FST fpureg ; DD D0+r [8086,FPU] - -\c FSTP mem32 ; D9 /3 [8086,FPU] -\c FSTP mem64 ; DD /3 [8086,FPU] -\c FSTP mem80 ; DB /7 [8086,FPU] -\c FSTP fpureg ; DD D8+r [8086,FPU] - -\c{FST} stores the value in \c{ST0} into the given memory location -or other FPU register. \c{FSTP} does the same, but then pops the -register stack. - - -\S{insFSTCW} \i\c{FSTCW}: Store Floating-Point Control Word - -\c FSTCW mem16 ; 9B D9 /7 [8086,FPU] -\c FNSTCW mem16 ; D9 /7 [8086,FPU] - -\c{FSTCW} stores the \c{FPU} control word (governing things like the -rounding mode, the precision, and the exception masks) into a 2-byte -memory area. See also \c{FLDCW} (\k{insFLDCW}). - -\c{FNSTCW} does the same thing as \c{FSTCW}, without first waiting -for pending floating-point exceptions to clear. - - -\S{insFSTENV} \i\c{FSTENV}: Store Floating-Point Environment - -\c FSTENV mem ; 9B D9 /6 [8086,FPU] -\c FNSTENV mem ; D9 /6 [8086,FPU] - -\c{FSTENV} stores the \c{FPU} operating environment (control word, -status word, tag word, instruction pointer, data pointer and last -opcode) into memory. The memory area is 14 or 28 bytes long, -depending on the CPU mode at the time. See also \c{FLDENV} -(\k{insFLDENV}). - -\c{FNSTENV} does the same thing as \c{FSTENV}, without first waiting -for pending floating-point exceptions to clear. - - -\S{insFSTSW} \i\c{FSTSW}: Store Floating-Point Status Word - -\c FSTSW mem16 ; 9B DD /7 [8086,FPU] -\c FSTSW AX ; 9B DF E0 [286,FPU] - -\c FNSTSW mem16 ; DD /7 [8086,FPU] -\c FNSTSW AX ; DF E0 [286,FPU] - -\c{FSTSW} stores the \c{FPU} status word into \c{AX} or into a 2-byte -memory area. - -\c{FNSTSW} does the same thing as \c{FSTSW}, without first waiting -for pending floating-point exceptions to clear. - - -\S{insFSUB} \i\c{FSUB}, \i\c{FSUBP}, \i\c{FSUBR}, \i\c{FSUBRP}: Floating-Point Subtract - -\c FSUB mem32 ; D8 /4 [8086,FPU] -\c FSUB mem64 ; DC /4 [8086,FPU] - -\c FSUB fpureg ; D8 E0+r [8086,FPU] -\c FSUB ST0,fpureg ; D8 E0+r [8086,FPU] - -\c FSUB TO fpureg ; DC E8+r [8086,FPU] -\c FSUB fpureg,ST0 ; DC E8+r [8086,FPU] - -\c FSUBR mem32 ; D8 /5 [8086,FPU] -\c FSUBR mem64 ; DC /5 [8086,FPU] - -\c FSUBR fpureg ; D8 E8+r [8086,FPU] -\c FSUBR ST0,fpureg ; D8 E8+r [8086,FPU] - -\c FSUBR TO fpureg ; DC E0+r [8086,FPU] -\c FSUBR fpureg,ST0 ; DC E0+r [8086,FPU] - -\c FSUBP fpureg ; DE E8+r [8086,FPU] -\c FSUBP fpureg,ST0 ; DE E8+r [8086,FPU] - -\c FSUBRP fpureg ; DE E0+r [8086,FPU] -\c FSUBRP fpureg,ST0 ; DE E0+r [8086,FPU] - -\b \c{FSUB} subtracts the given operand from \c{ST0} and stores the -result back in \c{ST0}, unless the \c{TO} qualifier is given, in -which case it subtracts \c{ST0} from the given operand and stores -the result in the operand. - -\b \c{FSUBR} does the same thing, but does the subtraction the other -way up: so if \c{TO} is not given, it subtracts \c{ST0} from the given -operand and stores the result in \c{ST0}, whereas if \c{TO} is given -it subtracts its operand from \c{ST0} and stores the result in the -operand. - -\b \c{FSUBP} operates like \c{FSUB TO}, but pops the register stack -once it has finished. - -\b \c{FSUBRP} operates like \c{FSUBR TO}, but pops the register stack -once it has finished. - - -\S{insFTST} \i\c{FTST}: Test \c{ST0} Against Zero - -\c FTST ; D9 E4 [8086,FPU] - -\c{FTST} compares \c{ST0} with zero and sets the FPU flags -accordingly. \c{ST0} is treated as the left-hand side of the -comparison, so that a `less-than' result is generated if \c{ST0} is -negative. - - -\S{insFUCOM} \i\c{FUCOMxx}: Floating-Point Unordered Compare - -\c FUCOM fpureg ; DD E0+r [386,FPU] -\c FUCOM ST0,fpureg ; DD E0+r [386,FPU] - -\c FUCOMP fpureg ; DD E8+r [386,FPU] -\c FUCOMP ST0,fpureg ; DD E8+r [386,FPU] - -\c FUCOMPP ; DA E9 [386,FPU] - -\c FUCOMI fpureg ; DB E8+r [P6,FPU] -\c FUCOMI ST0,fpureg ; DB E8+r [P6,FPU] - -\c FUCOMIP fpureg ; DF E8+r [P6,FPU] -\c FUCOMIP ST0,fpureg ; DF E8+r [P6,FPU] - -\b \c{FUCOM} compares \c{ST0} with the given operand, and sets the -FPU flags accordingly. \c{ST0} is treated as the left-hand side of -the comparison, so that the carry flag is set (for a `less-than' -result) if \c{ST0} is less than the given operand. - -\b \c{FUCOMP} does the same as \c{FUCOM}, but pops the register stack -afterwards. \c{FUCOMPP} compares \c{ST0} with \c{ST1} and then pops -the register stack twice. - -\b \c{FUCOMI} and \c{FUCOMIP} work like the corresponding forms of -\c{FUCOM} and \c{FUCOMP}, but write their results directly to the CPU -flags register rather than the FPU status word, so they can be -immediately followed by conditional jump or conditional move -instructions. - -The \c{FUCOM} instructions differ from the \c{FCOM} instructions -(\k{insFCOM}) only in the way they handle quiet NaNs: \c{FUCOM} will -handle them silently and set the condition code flags to an -`unordered' result, whereas \c{FCOM} will generate an exception. - - -\S{insFXAM} \i\c{FXAM}: Examine Class of Value in \c{ST0} - -\c FXAM ; D9 E5 [8086,FPU] - -\c{FXAM} sets the FPU flags \c{C3}, \c{C2} and \c{C0} depending on -the type of value stored in \c{ST0}: - -\c Register contents Flags - -\c Unsupported format 000 -\c NaN 001 -\c Finite number 010 -\c Infinity 011 -\c Zero 100 -\c Empty register 101 -\c Denormal 110 - -Additionally, the \c{C1} flag is set to the sign of the number. - - -\S{insFXCH} \i\c{FXCH}: Floating-Point Exchange - -\c FXCH ; D9 C9 [8086,FPU] -\c FXCH fpureg ; D9 C8+r [8086,FPU] -\c FXCH fpureg,ST0 ; D9 C8+r [8086,FPU] -\c FXCH ST0,fpureg ; D9 C8+r [8086,FPU] - -\c{FXCH} exchanges \c{ST0} with a given FPU register. The no-operand -form exchanges \c{ST0} with \c{ST1}. - - -\S{insFXRSTOR} \i\c{FXRSTOR}: Restore \c{FP}, \c{MMX} and \c{SSE} State - -\c FXRSTOR memory ; 0F AE /1 [P6,SSE,FPU] - -The \c{FXRSTOR} instruction reloads the \c{FPU}, \c{MMX} and \c{SSE} -state (environment and registers), from the 512 byte memory area defined -by the source operand. This data should have been written by a previous -\c{FXSAVE}. - - -\S{insFXSAVE} \i\c{FXSAVE}: Store \c{FP}, \c{MMX} and \c{SSE} State - -\c FXSAVE memory ; 0F AE /0 [P6,SSE,FPU] - -\c{FXSAVE}The FXSAVE instruction writes the current \c{FPU}, \c{MMX} -and \c{SSE} technology states (environment and registers), to the -512 byte memory area defined by the destination operand. It does this -without checking for pending unmasked floating-point exceptions -(similar to the operation of \c{FNSAVE}). - -Unlike the \c{FSAVE/FNSAVE} instructions, the processor retains the -contents of the \c{FPU}, \c{MMX} and \c{SSE} state in the processor -after the state has been saved. This instruction has been optimized -to maximize floating-point save performance. - - -\S{insFXTRACT} \i\c{FXTRACT}: Extract Exponent and Significand - -\c FXTRACT ; D9 F4 [8086,FPU] - -\c{FXTRACT} separates the number in \c{ST0} into its exponent and -significand (mantissa), stores the exponent back into \c{ST0}, and -then pushes the significand on the register stack (so that the -significand ends up in \c{ST0}, and the exponent in \c{ST1}). - - -\S{insFYL2X} \i\c{FYL2X}, \i\c{FYL2XP1}: Compute Y times Log2(X) or Log2(X+1) - -\c FYL2X ; D9 F1 [8086,FPU] -\c FYL2XP1 ; D9 F9 [8086,FPU] - -\c{FYL2X} multiplies \c{ST1} by the base-2 logarithm of \c{ST0}, -stores the result in \c{ST1}, and pops the register stack (so that -the result ends up in \c{ST0}). \c{ST0} must be non-zero and -positive. - -\c{FYL2XP1} works the same way, but replacing the base-2 log of -\c{ST0} with that of \c{ST0} plus one. This time, \c{ST0} must have -magnitude no greater than 1 minus half the square root of two. - - -\S{insHLT} \i\c{HLT}: Halt Processor - -\c HLT ; F4 [8086,PRIV] - -\c{HLT} puts the processor into a halted state, where it will -perform no more operations until restarted by an interrupt or a -reset. - -On the 286 and later processors, this is a privileged instruction. - - -\S{insIBTS} \i\c{IBTS}: Insert Bit String - -\c IBTS r/m16,reg16 ; o16 0F A7 /r [386,UNDOC] -\c IBTS r/m32,reg32 ; o32 0F A7 /r [386,UNDOC] - -The implied operation of this instruction is: - -\c IBTS r/m16,AX,CL,reg16 -\c IBTS r/m32,EAX,CL,reg32 - -Writes a bit string from the source operand to the destination. -\c{CL} indicates the number of bits to be copied, from the low bits -of the source. \c{(E)AX} indicates the low order bit offset in the -destination that is written to. For example, if \c{CL} is set to 4 -and \c{AX} (for 16-bit code) is set to 5, bits 0-3 of \c{src} will -be copied to bits 5-8 of \c{dst}. This instruction is very poorly -documented, and I have been unable to find any official source of -documentation on it. - -\c{IBTS} is supported only on the early Intel 386s, and conflicts -with the opcodes for \c{CMPXCHG486} (on early Intel 486s). NASM -supports it only for completeness. Its counterpart is \c{XBTS} -(see \k{insXBTS}). - - -\S{insIDIV} \i\c{IDIV}: Signed Integer Divide - -\c IDIV r/m8 ; F6 /7 [8086] -\c IDIV r/m16 ; o16 F7 /7 [8086] -\c IDIV r/m32 ; o32 F7 /7 [386] - -\c{IDIV} performs signed integer division. The explicit operand -provided is the divisor; the dividend and destination operands -are implicit, in the following way: - -\b For \c{IDIV r/m8}, \c{AX} is divided by the given operand; -the quotient is stored in \c{AL} and the remainder in \c{AH}. - -\b For \c{IDIV r/m16}, \c{DX:AX} is divided by the given operand; -the quotient is stored in \c{AX} and the remainder in \c{DX}. - -\b For \c{IDIV r/m32}, \c{EDX:EAX} is divided by the given operand; -the quotient is stored in \c{EAX} and the remainder in \c{EDX}. - -Unsigned integer division is performed by the \c{DIV} instruction: -see \k{insDIV}. - - -\S{insIMUL} \i\c{IMUL}: Signed Integer Multiply - -\c IMUL r/m8 ; F6 /5 [8086] -\c IMUL r/m16 ; o16 F7 /5 [8086] -\c IMUL r/m32 ; o32 F7 /5 [386] - -\c IMUL reg16,r/m16 ; o16 0F AF /r [386] -\c IMUL reg32,r/m32 ; o32 0F AF /r [386] - -\c IMUL reg16,imm8 ; o16 6B /r ib [186] -\c IMUL reg16,imm16 ; o16 69 /r iw [186] -\c IMUL reg32,imm8 ; o32 6B /r ib [386] -\c IMUL reg32,imm32 ; o32 69 /r id [386] - -\c IMUL reg16,r/m16,imm8 ; o16 6B /r ib [186] -\c IMUL reg16,r/m16,imm16 ; o16 69 /r iw [186] -\c IMUL reg32,r/m32,imm8 ; o32 6B /r ib [386] -\c IMUL reg32,r/m32,imm32 ; o32 69 /r id [386] - -\c{IMUL} performs signed integer multiplication. For the -single-operand form, the other operand and destination are -implicit, in the following way: - -\b For \c{IMUL r/m8}, \c{AL} is multiplied by the given operand; -the product is stored in \c{AX}. - -\b For \c{IMUL r/m16}, \c{AX} is multiplied by the given operand; -the product is stored in \c{DX:AX}. - -\b For \c{IMUL r/m32}, \c{EAX} is multiplied by the given operand; -the product is stored in \c{EDX:EAX}. - -The two-operand form multiplies its two operands and stores the -result in the destination (first) operand. The three-operand -form multiplies its last two operands and stores the result in -the first operand. - -The two-operand form with an immediate second operand is in -fact a shorthand for the three-operand form, as can be seen by -examining the opcode descriptions: in the two-operand form, the -code \c{/r} takes both its register and \c{r/m} parts from the -same operand (the first one). - -In the forms with an 8-bit immediate operand and another longer -source operand, the immediate operand is considered to be signed, -and is sign-extended to the length of the other source operand. -In these cases, the \c{BYTE} qualifier is necessary to force -NASM to generate this form of the instruction. - -Unsigned integer multiplication is performed by the \c{MUL} -instruction: see \k{insMUL}. - - -\S{insIN} \i\c{IN}: Input from I/O Port - -\c IN AL,imm8 ; E4 ib [8086] -\c IN AX,imm8 ; o16 E5 ib [8086] -\c IN EAX,imm8 ; o32 E5 ib [386] -\c IN AL,DX ; EC [8086] -\c IN AX,DX ; o16 ED [8086] -\c IN EAX,DX ; o32 ED [386] - -\c{IN} reads a byte, word or doubleword from the specified I/O port, -and stores it in the given destination register. The port number may -be specified as an immediate value if it is between 0 and 255, and -otherwise must be stored in \c{DX}. See also \c{OUT} (\k{insOUT}). - - -\S{insINC} \i\c{INC}: Increment Integer - -\c INC reg16 ; o16 40+r [8086] -\c INC reg32 ; o32 40+r [386] -\c INC r/m8 ; FE /0 [8086] -\c INC r/m16 ; o16 FF /0 [8086] -\c INC r/m32 ; o32 FF /0 [386] - -\c{INC} adds 1 to its operand. It does \e{not} affect the carry -flag: to affect the carry flag, use \c{ADD something,1} (see -\k{insADD}). \c{INC} affects all the other flags according to the result. - -This instruction can be used with a \c{LOCK} prefix to allow atomic execution. - -See also \c{DEC} (\k{insDEC}). - - -\S{insINSB} \i\c{INSB}, \i\c{INSW}, \i\c{INSD}: Input String from I/O Port - -\c INSB ; 6C [186] -\c INSW ; o16 6D [186] -\c INSD ; o32 6D [386] - -\c{INSB} inputs a byte from the I/O port specified in \c{DX} and -stores it at \c{[ES:DI]} or \c{[ES:EDI]}. It then increments or -decrements (depending on the direction flag: increments if the flag -is clear, decrements if it is set) \c{DI} or \c{EDI}. - -The register used is \c{DI} if the address size is 16 bits, and -\c{EDI} if it is 32 bits. If you need to use an address size not -equal to the current \c{BITS} setting, you can use an explicit -\i\c{a16} or \i\c{a32} prefix. - -Segment override prefixes have no effect for this instruction: the -use of \c{ES} for the load from \c{[DI]} or \c{[EDI]} cannot be -overridden. - -\c{INSW} and \c{INSD} work in the same way, but they input a word or -a doubleword instead of a byte, and increment or decrement the -addressing register by 2 or 4 instead of 1. - -The \c{REP} prefix may be used to repeat the instruction \c{CX} (or -\c{ECX} - again, the address size chooses which) times. - -See also \c{OUTSB}, \c{OUTSW} and \c{OUTSD} (\k{insOUTSB}). - - -\S{insINT} \i\c{INT}: Software Interrupt - -\c INT imm8 ; CD ib [8086] - -\c{INT} causes a software interrupt through a specified vector -number from 0 to 255. - -The code generated by the \c{INT} instruction is always two bytes -long: although there are short forms for some \c{INT} instructions, -NASM does not generate them when it sees the \c{INT} mnemonic. In -order to generate single-byte breakpoint instructions, use the -\c{INT3} or \c{INT1} instructions (see \k{insINT1}) instead. - - -\S{insINT1} \i\c{INT3}, \i\c{INT1}, \i\c{ICEBP}, \i\c{INT01}: Breakpoints - -\c INT1 ; F1 [P6] -\c ICEBP ; F1 [P6] -\c INT01 ; F1 [P6] - -\c INT3 ; CC [8086] -\c INT03 ; CC [8086] - -\c{INT1} and \c{INT3} are short one-byte forms of the instructions -\c{INT 1} and \c{INT 3} (see \k{insINT}). They perform a similar -function to their longer counterparts, but take up less code space. -They are used as breakpoints by debuggers. - -\b \c{INT1}, and its alternative synonyms \c{INT01} and \c{ICEBP}, is -an instruction used by in-circuit emulators (ICEs). It is present, -though not documented, on some processors down to the 286, but is -only documented for the Pentium Pro. \c{INT3} is the instruction -normally used as a breakpoint by debuggers. - -\b \c{INT3}, and its synonym \c{INT03}, is not precisely equivalent to -\c{INT 3}: the short form, since it is designed to be used as a -breakpoint, bypasses the normal \c{IOPL} checks in virtual-8086 mode, -and also does not go through interrupt redirection. - - -\S{insINTO} \i\c{INTO}: Interrupt if Overflow - -\c INTO ; CE [8086] - -\c{INTO} performs an \c{INT 4} software interrupt (see \k{insINT}) -if and only if the overflow flag is set. - - -\S{insINVD} \i\c{INVD}: Invalidate Internal Caches - -\c INVD ; 0F 08 [486] - -\c{INVD} invalidates and empties the processor's internal caches, -and causes the processor to instruct external caches to do the same. -It does not write the contents of the caches back to memory first: -any modified data held in the caches will be lost. To write the data -back first, use \c{WBINVD} (\k{insWBINVD}). - - -\S{insINVLPG} \i\c{INVLPG}: Invalidate TLB Entry - -\c INVLPG mem ; 0F 01 /7 [486] - -\c{INVLPG} invalidates the translation lookahead buffer (TLB) entry -associated with the supplied memory address. - - -\S{insIRET} \i\c{IRET}, \i\c{IRETW}, \i\c{IRETD}: Return from Interrupt - -\c IRET ; CF [8086] -\c IRETW ; o16 CF [8086] -\c IRETD ; o32 CF [386] - -\c{IRET} returns from an interrupt (hardware or software) by means -of popping \c{IP} (or \c{EIP}), \c{CS} and the flags off the stack -and then continuing execution from the new \c{CS:IP}. - -\c{IRETW} pops \c{IP}, \c{CS} and the flags as 2 bytes each, taking -6 bytes off the stack in total. \c{IRETD} pops \c{EIP} as 4 bytes, -pops a further 4 bytes of which the top two are discarded and the -bottom two go into \c{CS}, and pops the flags as 4 bytes as well, -taking 12 bytes off the stack. - -\c{IRET} is a shorthand for either \c{IRETW} or \c{IRETD}, depending -on the default \c{BITS} setting at the time. - - -\S{insJcc} \i\c{Jcc}: Conditional Branch - -\c Jcc imm ; 70+cc rb [8086] -\c Jcc NEAR imm ; 0F 80+cc rw/rd [386] - -The \i{conditional jump} instructions execute a near (same segment) -jump if and only if their conditions are satisfied. For example, -\c{JNZ} jumps only if the zero flag is not set. - -The ordinary form of the instructions has only a 128-byte range; the -\c{NEAR} form is a 386 extension to the instruction set, and can -span the full size of a segment. NASM will not override your choice -of jump instruction: if you want \c{Jcc NEAR}, you have to use the -\c{NEAR} keyword. - -The \c{SHORT} keyword is allowed on the first form of the -instruction, for clarity, but is not necessary. - -For details of the condition codes, see \k{iref-cc}. - - -\S{insJCXZ} \i\c{JCXZ}, \i\c{JECXZ}: Jump if CX/ECX Zero - -\c JCXZ imm ; a16 E3 rb [8086] -\c JECXZ imm ; a32 E3 rb [386] - -\c{JCXZ} performs a short jump (with maximum range 128 bytes) if and -only if the contents of the \c{CX} register is 0. \c{JECXZ} does the -same thing, but with \c{ECX}. - - -\S{insJMP} \i\c{JMP}: Jump - -\c JMP imm ; E9 rw/rd [8086] -\c JMP SHORT imm ; EB rb [8086] -\c JMP imm:imm16 ; o16 EA iw iw [8086] -\c JMP imm:imm32 ; o32 EA id iw [386] -\c JMP FAR mem ; o16 FF /5 [8086] -\c JMP FAR mem32 ; o32 FF /5 [386] -\c JMP r/m16 ; o16 FF /4 [8086] -\c JMP r/m32 ; o32 FF /4 [386] - -\c{JMP} jumps to a given address. The address may be specified as an -absolute segment and offset, or as a relative jump within the -current segment. - -\c{JMP SHORT imm} has a maximum range of 128 bytes, since the -displacement is specified as only 8 bits, but takes up less code -space. NASM does not choose when to generate \c{JMP SHORT} for you: -you must explicitly code \c{SHORT} every time you want a short jump. - -You can choose between the two immediate \i{far jump} forms (\c{JMP -imm:imm}) by the use of the \c{WORD} and \c{DWORD} keywords: \c{JMP -WORD 0x1234:0x5678}) or \c{JMP DWORD 0x1234:0x56789abc}. - -The \c{JMP FAR mem} forms execute a far jump by loading the -destination address out of memory. The address loaded consists of 16 -or 32 bits of offset (depending on the operand size), and 16 bits of -segment. The operand size may be overridden using \c{JMP WORD FAR -mem} or \c{JMP DWORD FAR mem}. - -The \c{JMP r/m} forms execute a \i{near jump} (within the same -segment), loading the destination address out of memory or out of a -register. The keyword \c{NEAR} may be specified, for clarity, in -these forms, but is not necessary. Again, operand size can be -overridden using \c{JMP WORD mem} or \c{JMP DWORD mem}. - -As a convenience, NASM does not require you to jump to a far symbol -by coding the cumbersome \c{JMP SEG routine:routine}, but instead -allows the easier synonym \c{JMP FAR routine}. - -The \c{JMP r/m} forms given above are near calls; NASM will accept -the \c{NEAR} keyword (e.g. \c{JMP NEAR [address]}), even though it -is not strictly necessary. - - -\S{insLAHF} \i\c{LAHF}: Load AH from Flags - -\c LAHF ; 9F [8086] - -\c{LAHF} sets the \c{AH} register according to the contents of the -low byte of the flags word. - -The operation of \c{LAHF} is: - -\c AH <-- SF:ZF:0:AF:0:PF:1:CF - -See also \c{SAHF} (\k{insSAHF}). - - -\S{insLAR} \i\c{LAR}: Load Access Rights - -\c LAR reg16,r/m16 ; o16 0F 02 /r [286,PRIV] -\c LAR reg32,r/m32 ; o32 0F 02 /r [286,PRIV] - -\c{LAR} takes the segment selector specified by its source (second) -operand, finds the corresponding segment descriptor in the GDT or -LDT, and loads the access-rights byte of the descriptor into its -destination (first) operand. - - -\S{insLDMXCSR} \i\c{LDMXCSR}: Load Streaming SIMD Extension - Control/Status - -\c LDMXCSR mem32 ; 0F AE /2 [KATMAI,SSE] - -\c{LDMXCSR} loads 32-bits of data from the specified memory location -into the \c{MXCSR} control/status register. \c{MXCSR} is used to -enable masked/unmasked exception handling, to set rounding modes, -to set flush-to-zero mode, and to view exception status flags. - -For details of the \c{MXCSR} register, see the Intel processor docs. - -See also \c{STMXCSR} (\k{insSTMXCSR} - - -\S{insLDS} \i\c{LDS}, \i\c{LES}, \i\c{LFS}, \i\c{LGS}, \i\c{LSS}: Load Far Pointer - -\c LDS reg16,mem ; o16 C5 /r [8086] -\c LDS reg32,mem ; o32 C5 /r [386] - -\c LES reg16,mem ; o16 C4 /r [8086] -\c LES reg32,mem ; o32 C4 /r [386] - -\c LFS reg16,mem ; o16 0F B4 /r [386] -\c LFS reg32,mem ; o32 0F B4 /r [386] - -\c LGS reg16,mem ; o16 0F B5 /r [386] -\c LGS reg32,mem ; o32 0F B5 /r [386] - -\c LSS reg16,mem ; o16 0F B2 /r [386] -\c LSS reg32,mem ; o32 0F B2 /r [386] - -These instructions load an entire far pointer (16 or 32 bits of -offset, plus 16 bits of segment) out of memory in one go. \c{LDS}, -for example, loads 16 or 32 bits from the given memory address into -the given register (depending on the size of the register), then -loads the \e{next} 16 bits from memory into \c{DS}. \c{LES}, -\c{LFS}, \c{LGS} and \c{LSS} work in the same way but use the other -segment registers. - - -\S{insLEA} \i\c{LEA}: Load Effective Address - -\c LEA reg16,mem ; o16 8D /r [8086] -\c LEA reg32,mem ; o32 8D /r [386] - -\c{LEA}, despite its syntax, does not access memory. It calculates -the effective address specified by its second operand as if it were -going to load or store data from it, but instead it stores the -calculated address into the register specified by its first operand. -This can be used to perform quite complex calculations (e.g. \c{LEA -EAX,[EBX+ECX*4+100]}) in one instruction. - -\c{LEA}, despite being a purely arithmetic instruction which -accesses no memory, still requires square brackets around its second -operand, as if it were a memory reference. - -The size of the calculation is the current \e{address} size, and the -size that the result is stored as is the current \e{operand} size. -If the address and operand size are not the same, then if the -addressing mode was 32-bits, the low 16-bits are stored, and if the -address was 16-bits, it is zero-extended to 32-bits before storing. - - -\S{insLEAVE} \i\c{LEAVE}: Destroy Stack Frame - -\c LEAVE ; C9 [186] - -\c{LEAVE} destroys a stack frame of the form created by the -\c{ENTER} instruction (see \k{insENTER}). It is functionally -equivalent to \c{MOV ESP,EBP} followed by \c{POP EBP} (or \c{MOV -SP,BP} followed by \c{POP BP} in 16-bit mode). - - -\S{insLFENCE} \i\c{LFENCE}: Load Fence - -\c LFENCE ; 0F AE /5 [WILLAMETTE,SSE2] - -\c{LFENCE} performs a serialising operation on all loads from memory -that were issued before the \c{LFENCE} instruction. This guarantees that -all memory reads before the \c{LFENCE} instruction are visible before any -reads after the \c{LFENCE} instruction. - -\c{LFENCE} is ordered respective to other \c{LFENCE} instruction, \c{MFENCE}, -any memory read and any other serialising instruction (such as \c{CPUID}). - -Weakly ordered memory types can be used to achieve higher processor -performance through such techniques as out-of-order issue and -speculative reads. The degree to which a consumer of data recognizes -or knows that the data is weakly ordered varies among applications -and may be unknown to the producer of this data. The \c{LFENCE} -instruction provides a performance-efficient way of ensuring load -ordering between routines that produce weakly-ordered results and -routines that consume that data. - -\c{LFENCE} uses the following ModRM encoding: - -\c Mod (7:6) = 11B -\c Reg/Opcode (5:3) = 101B -\c R/M (2:0) = 000B - -All other ModRM encodings are defined to be reserved, and use -of these encodings risks incompatibility with future processors. - -See also \c{SFENCE} (\k{insSFENCE}) and \c{MFENCE} (\k{insMFENCE}). - - -\S{insLGDT} \i\c{LGDT}, \i\c{LIDT}, \i\c{LLDT}: Load Descriptor Tables - -\c LGDT mem ; 0F 01 /2 [286,PRIV] -\c LIDT mem ; 0F 01 /3 [286,PRIV] -\c LLDT r/m16 ; 0F 00 /2 [286,PRIV] - -\c{LGDT} and \c{LIDT} both take a 6-byte memory area as an operand: -they load a 16-bit size limit and a 32-bit linear address from that -area (in the opposite order) into the \c{GDTR} (global descriptor table -register) or \c{IDTR} (interrupt descriptor table register). These are -the only instructions which directly use \e{linear} addresses, rather -than segment/offset pairs. - -\c{LLDT} takes a segment selector as an operand. The processor looks -up that selector in the GDT and stores the limit and base address -given there into the \c{LDTR} (local descriptor table register). - -See also \c{SGDT}, \c{SIDT} and \c{SLDT} (\k{insSGDT}). - - -\S{insLMSW} \i\c{LMSW}: Load/Store Machine Status Word - -\c LMSW r/m16 ; 0F 01 /6 [286,PRIV] - -\c{LMSW} loads the bottom four bits of the source operand into the -bottom four bits of the \c{CR0} control register (or the Machine -Status Word, on 286 processors). See also \c{SMSW} (\k{insSMSW}). - - -\S{insLOADALL} \i\c{LOADALL}, \i\c{LOADALL286}: Load Processor State - -\c LOADALL ; 0F 07 [386,UNDOC] -\c LOADALL286 ; 0F 05 [286,UNDOC] - -This instruction, in its two different-opcode forms, is apparently -supported on most 286 processors, some 386 and possibly some 486. -The opcode differs between the 286 and the 386. - -The function of the instruction is to load all information relating -to the state of the processor out of a block of memory: on the 286, -this block is located implicitly at absolute address \c{0x800}, and -on the 386 and 486 it is at \c{[ES:EDI]}. - - -\S{insLODSB} \i\c{LODSB}, \i\c{LODSW}, \i\c{LODSD}: Load from String - -\c LODSB ; AC [8086] -\c LODSW ; o16 AD [8086] -\c LODSD ; o32 AD [386] - -\c{LODSB} loads a byte from \c{[DS:SI]} or \c{[DS:ESI]} into \c{AL}. -It then increments or decrements (depending on the direction flag: -increments if the flag is clear, decrements if it is set) \c{SI} or -\c{ESI}. - -The register used is \c{SI} if the address size is 16 bits, and -\c{ESI} if it is 32 bits. If you need to use an address size not -equal to the current \c{BITS} setting, you can use an explicit -\i\c{a16} or \i\c{a32} prefix. - -The segment register used to load from \c{[SI]} or \c{[ESI]} can be -overridden by using a segment register name as a prefix (for -example, \c{ES LODSB}). - -\c{LODSW} and \c{LODSD} work in the same way, but they load a -word or a doubleword instead of a byte, and increment or decrement -the addressing registers by 2 or 4 instead of 1. - - -\S{insLOOP} \i\c{LOOP}, \i\c{LOOPE}, \i\c{LOOPZ}, \i\c{LOOPNE}, \i\c{LOOPNZ}: Loop with Counter - -\c LOOP imm ; E2 rb [8086] -\c LOOP imm,CX ; a16 E2 rb [8086] -\c LOOP imm,ECX ; a32 E2 rb [386] - -\c LOOPE imm ; E1 rb [8086] -\c LOOPE imm,CX ; a16 E1 rb [8086] -\c LOOPE imm,ECX ; a32 E1 rb [386] -\c LOOPZ imm ; E1 rb [8086] -\c LOOPZ imm,CX ; a16 E1 rb [8086] -\c LOOPZ imm,ECX ; a32 E1 rb [386] - -\c LOOPNE imm ; E0 rb [8086] -\c LOOPNE imm,CX ; a16 E0 rb [8086] -\c LOOPNE imm,ECX ; a32 E0 rb [386] -\c LOOPNZ imm ; E0 rb [8086] -\c LOOPNZ imm,CX ; a16 E0 rb [8086] -\c LOOPNZ imm,ECX ; a32 E0 rb [386] - -\c{LOOP} decrements its counter register (either \c{CX} or \c{ECX} - -if one is not specified explicitly, the \c{BITS} setting dictates -which is used) by one, and if the counter does not become zero as a -result of this operation, it jumps to the given label. The jump has -a range of 128 bytes. - -\c{LOOPE} (or its synonym \c{LOOPZ}) adds the additional condition -that it only jumps if the counter is nonzero \e{and} the zero flag -is set. Similarly, \c{LOOPNE} (and \c{LOOPNZ}) jumps only if the -counter is nonzero and the zero flag is clear. - - -\S{insLSL} \i\c{LSL}: Load Segment Limit - -\c LSL reg16,r/m16 ; o16 0F 03 /r [286,PRIV] -\c LSL reg32,r/m32 ; o32 0F 03 /r [286,PRIV] - -\c{LSL} is given a segment selector in its source (second) operand; -it computes the segment limit value by loading the segment limit -field from the associated segment descriptor in the \c{GDT} or \c{LDT}. -(This involves shifting left by 12 bits if the segment limit is -page-granular, and not if it is byte-granular; so you end up with a -byte limit in either case.) The segment limit obtained is then -loaded into the destination (first) operand. - - -\S{insLTR} \i\c{LTR}: Load Task Register - -\c LTR r/m16 ; 0F 00 /3 [286,PRIV] - -\c{LTR} looks up the segment base and limit in the GDT or LDT -descriptor specified by the segment selector given as its operand, -and loads them into the Task Register. - - -\S{insMASKMOVDQU} \i\c{MASKMOVDQU}: Byte Mask Write - -\c MASKMOVDQU xmm1,xmm2 ; 66 0F F7 /r [WILLAMETTE,SSE2] - -\c{MASKMOVDQU} stores data from xmm1 to the location specified by -\c{ES:(E)DI}. The size of the store depends on the address-size -attribute. The most significant bit in each byte of the mask -register xmm2 is used to selectively write the data (0 = no write, -1 = write) on a per-byte basis. - - -\S{insMASKMOVQ} \i\c{MASKMOVQ}: Byte Mask Write - -\c MASKMOVQ mm1,mm2 ; 0F F7 /r [KATMAI,MMX] - -\c{MASKMOVQ} stores data from mm1 to the location specified by -\c{ES:(E)DI}. The size of the store depends on the address-size -attribute. The most significant bit in each byte of the mask -register mm2 is used to selectively write the data (0 = no write, -1 = write) on a per-byte basis. - - -\S{insMAXPD} \i\c{MAXPD}: Return Packed Double-Precision FP Maximum - -\c MAXPD xmm1,xmm2/m128 ; 66 0F 5F /r [WILLAMETTE,SSE2] - -\c{MAXPD} performs a SIMD compare of the packed double-precision -FP numbers from xmm1 and xmm2/mem, and stores the maximum values -of each pair of values in xmm1. If the values being compared are -both zeroes, source2 (xmm2/m128) would be returned. If source2 -(xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the -destination (i.e., a QNaN version of the SNaN is not returned). - - -\S{insMAXPS} \i\c{MAXPS}: Return Packed Single-Precision FP Maximum - -\c MAXPS xmm1,xmm2/m128 ; 0F 5F /r [KATMAI,SSE] - -\c{MAXPS} performs a SIMD compare of the packed single-precision -FP numbers from xmm1 and xmm2/mem, and stores the maximum values -of each pair of values in xmm1. If the values being compared are -both zeroes, source2 (xmm2/m128) would be returned. If source2 -(xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the -destination (i.e., a QNaN version of the SNaN is not returned). - - -\S{insMAXSD} \i\c{MAXSD}: Return Scalar Double-Precision FP Maximum - -\c MAXSD xmm1,xmm2/m64 ; F2 0F 5F /r [WILLAMETTE,SSE2] - -\c{MAXSD} compares the low-order double-precision FP numbers from -xmm1 and xmm2/mem, and stores the maximum value in xmm1. If the -values being compared are both zeroes, source2 (xmm2/m64) would -be returned. If source2 (xmm2/m64) is an SNaN, this SNaN is -forwarded unchanged to the destination (i.e., a QNaN version of -the SNaN is not returned). The high quadword of the destination -is left unchanged. - - -\S{insMAXSS} \i\c{MAXSS}: Return Scalar Single-Precision FP Maximum - -\c MAXSS xmm1,xmm2/m32 ; F3 0F 5F /r [KATMAI,SSE] - -\c{MAXSS} compares the low-order single-precision FP numbers from -xmm1 and xmm2/mem, and stores the maximum value in xmm1. If the -values being compared are both zeroes, source2 (xmm2/m32) would -be returned. If source2 (xmm2/m32) is an SNaN, this SNaN is -forwarded unchanged to the destination (i.e., a QNaN version of -the SNaN is not returned). The high three doublewords of the -destination are left unchanged. - - -\S{insMFENCE} \i\c{MFENCE}: Memory Fence - -\c MFENCE ; 0F AE /6 [WILLAMETTE,SSE2] - -\c{MFENCE} performs a serialising operation on all loads from memory -and writes to memory that were issued before the \c{MFENCE} instruction. -This guarantees that all memory reads and writes before the \c{MFENCE} -instruction are completed before any reads and writes after the -\c{MFENCE} instruction. - -\c{MFENCE} is ordered respective to other \c{MFENCE} instructions, -\c{LFENCE}, \c{SFENCE}, any memory read and any other serialising -instruction (such as \c{CPUID}). - -Weakly ordered memory types can be used to achieve higher processor -performance through such techniques as out-of-order issue, speculative -reads, write-combining, and write-collapsing. The degree to which a -consumer of data recognizes or knows that the data is weakly ordered -varies among applications and may be unknown to the producer of this -data. The \c{MFENCE} instruction provides a performance-efficient way -of ensuring load and store ordering between routines that produce -weakly-ordered results and routines that consume that data. - -\c{MFENCE} uses the following ModRM encoding: - -\c Mod (7:6) = 11B -\c Reg/Opcode (5:3) = 110B -\c R/M (2:0) = 000B - -All other ModRM encodings are defined to be reserved, and use -of these encodings risks incompatibility with future processors. - -See also \c{LFENCE} (\k{insLFENCE}) and \c{SFENCE} (\k{insSFENCE}). - - -\S{insMINPD} \i\c{MINPD}: Return Packed Double-Precision FP Minimum - -\c MINPD xmm1,xmm2/m128 ; 66 0F 5D /r [WILLAMETTE,SSE2] - -\c{MINPD} performs a SIMD compare of the packed double-precision -FP numbers from xmm1 and xmm2/mem, and stores the minimum values -of each pair of values in xmm1. If the values being compared are -both zeroes, source2 (xmm2/m128) would be returned. If source2 -(xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the -destination (i.e., a QNaN version of the SNaN is not returned). - - -\S{insMINPS} \i\c{MINPS}: Return Packed Single-Precision FP Minimum - -\c MINPS xmm1,xmm2/m128 ; 0F 5D /r [KATMAI,SSE] - -\c{MINPS} performs a SIMD compare of the packed single-precision -FP numbers from xmm1 and xmm2/mem, and stores the minimum values -of each pair of values in xmm1. If the values being compared are -both zeroes, source2 (xmm2/m128) would be returned. If source2 -(xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the -destination (i.e., a QNaN version of the SNaN is not returned). - - -\S{insMINSD} \i\c{MINSD}: Return Scalar Double-Precision FP Minimum - -\c MINSD xmm1,xmm2/m64 ; F2 0F 5D /r [WILLAMETTE,SSE2] - -\c{MINSD} compares the low-order double-precision FP numbers from -xmm1 and xmm2/mem, and stores the minimum value in xmm1. If the -values being compared are both zeroes, source2 (xmm2/m64) would -be returned. If source2 (xmm2/m64) is an SNaN, this SNaN is -forwarded unchanged to the destination (i.e., a QNaN version of -the SNaN is not returned). The high quadword of the destination -is left unchanged. - - -\S{insMINSS} \i\c{MINSS}: Return Scalar Single-Precision FP Minimum - -\c MINSS xmm1,xmm2/m32 ; F3 0F 5D /r [KATMAI,SSE] - -\c{MINSS} compares the low-order single-precision FP numbers from -xmm1 and xmm2/mem, and stores the minimum value in xmm1. If the -values being compared are both zeroes, source2 (xmm2/m32) would -be returned. If source2 (xmm2/m32) is an SNaN, this SNaN is -forwarded unchanged to the destination (i.e., a QNaN version of -the SNaN is not returned). The high three doublewords of the -destination are left unchanged. - - -\S{insMOV} \i\c{MOV}: Move Data - -\c MOV r/m8,reg8 ; 88 /r [8086] -\c MOV r/m16,reg16 ; o16 89 /r [8086] -\c MOV r/m32,reg32 ; o32 89 /r [386] -\c MOV reg8,r/m8 ; 8A /r [8086] -\c MOV reg16,r/m16 ; o16 8B /r [8086] -\c MOV reg32,r/m32 ; o32 8B /r [386] - -\c MOV reg8,imm8 ; B0+r ib [8086] -\c MOV reg16,imm16 ; o16 B8+r iw [8086] -\c MOV reg32,imm32 ; o32 B8+r id [386] -\c MOV r/m8,imm8 ; C6 /0 ib [8086] -\c MOV r/m16,imm16 ; o16 C7 /0 iw [8086] -\c MOV r/m32,imm32 ; o32 C7 /0 id [386] - -\c MOV AL,memoffs8 ; A0 ow/od [8086] -\c MOV AX,memoffs16 ; o16 A1 ow/od [8086] -\c MOV EAX,memoffs32 ; o32 A1 ow/od [386] -\c MOV memoffs8,AL ; A2 ow/od [8086] -\c MOV memoffs16,AX ; o16 A3 ow/od [8086] -\c MOV memoffs32,EAX ; o32 A3 ow/od [386] - -\c MOV r/m16,segreg ; o16 8C /r [8086] -\c MOV r/m32,segreg ; o32 8C /r [386] -\c MOV segreg,r/m16 ; o16 8E /r [8086] -\c MOV segreg,r/m32 ; o32 8E /r [386] - -\c MOV reg32,CR0/2/3/4 ; 0F 20 /r [386] -\c MOV reg32,DR0/1/2/3/6/7 ; 0F 21 /r [386] -\c MOV reg32,TR3/4/5/6/7 ; 0F 24 /r [386] -\c MOV CR0/2/3/4,reg32 ; 0F 22 /r [386] -\c MOV DR0/1/2/3/6/7,reg32 ; 0F 23 /r [386] -\c MOV TR3/4/5/6/7,reg32 ; 0F 26 /r [386] - -\c{MOV} copies the contents of its source (second) operand into its -destination (first) operand. - -In all forms of the \c{MOV} instruction, the two operands are the -same size, except for moving between a segment register and an -\c{r/m32} operand. These instructions are treated exactly like the -corresponding 16-bit equivalent (so that, for example, \c{MOV -DS,EAX} functions identically to \c{MOV DS,AX} but saves a prefix -when in 32-bit mode), except that when a segment register is moved -into a 32-bit destination, the top two bytes of the result are -undefined. - -\c{MOV} may not use \c{CS} as a destination. - -\c{CR4} is only a supported register on the Pentium and above. - -Test registers are supported on 386/486 processors and on some -non-Intel Pentium class processors. - - -\S{insMOVAPD} \i\c{MOVAPD}: Move Aligned Packed Double-Precision FP Values - -\c MOVAPD xmm1,xmm2/mem128 ; 66 0F 28 /r [WILLAMETTE,SSE2] -\c MOVAPD xmm1/mem128,xmm2 ; 66 0F 29 /r [WILLAMETTE,SSE2] - -\c{MOVAPD} moves a double quadword containing 2 packed double-precision -FP values from the source operand to the destination. When the source -or destination operand is a memory location, it must be aligned on a -16-byte boundary. - -To move data in and out of memory locations that are not known to be on -16-byte boundaries, use the \c{MOVUPD} instruction (\k{insMOVUPD}). - - -\S{insMOVAPS} \i\c{MOVAPS}: Move Aligned Packed Single-Precision FP Values - -\c MOVAPS xmm1,xmm2/mem128 ; 0F 28 /r [KATMAI,SSE] -\c MOVAPS xmm1/mem128,xmm2 ; 0F 29 /r [KATMAI,SSE] - -\c{MOVAPS} moves a double quadword containing 4 packed single-precision -FP values from the source operand to the destination. When the source -or destination operand is a memory location, it must be aligned on a -16-byte boundary. - -To move data in and out of memory locations that are not known to be on -16-byte boundaries, use the \c{MOVUPS} instruction (\k{insMOVUPS}). - - -\S{insMOVD} \i\c{MOVD}: Move Doubleword to/from MMX Register - -\c MOVD mm,r/m32 ; 0F 6E /r [PENT,MMX] -\c MOVD r/m32,mm ; 0F 7E /r [PENT,MMX] -\c MOVD xmm,r/m32 ; 66 0F 6E /r [WILLAMETTE,SSE2] -\c MOVD r/m32,xmm ; 66 0F 7E /r [WILLAMETTE,SSE2] - -\c{MOVD} copies 32 bits from its source (second) operand into its -destination (first) operand. When the destination is a 64-bit \c{MMX} -register or a 128-bit \c{XMM} register, the input value is zero-extended -to fill the destination register. - - -\S{insMOVDQ2Q} \i\c{MOVDQ2Q}: Move Quadword from XMM to MMX register. - -\c MOVDQ2Q mm,xmm ; F2 OF D6 /r [WILLAMETTE,SSE2] - -\c{MOVDQ2Q} moves the low quadword from the source operand to the -destination operand. - - -\S{insMOVDQA} \i\c{MOVDQA}: Move Aligned Double Quadword - -\c MOVDQA xmm1,xmm2/m128 ; 66 OF 6F /r [WILLAMETTE,SSE2] -\c MOVDQA xmm1/m128,xmm2 ; 66 OF 7F /r [WILLAMETTE,SSE2] - -\c{MOVDQA} moves a double quadword from the source operand to the -destination operand. When the source or destination operand is a -memory location, it must be aligned to a 16-byte boundary. - -To move a double quadword to or from unaligned memory locations, -use the \c{MOVDQU} instruction (\k{insMOVDQU}). - - -\S{insMOVDQU} \i\c{MOVDQU}: Move Unaligned Double Quadword - -\c MOVDQU xmm1,xmm2/m128 ; F3 OF 6F /r [WILLAMETTE,SSE2] -\c MOVDQU xmm1/m128,xmm2 ; F3 OF 7F /r [WILLAMETTE,SSE2] - -\c{MOVDQU} moves a double quadword from the source operand to the -destination operand. When the source or destination operand is a -memory location, the memory may be unaligned. - -To move a double quadword to or from known aligned memory locations, -use the \c{MOVDQA} instruction (\k{insMOVDQA}). - - -\S{insMOVHLPS} \i\c{MOVHLPS}: Move Packed Single-Precision FP High to Low - -\c MOVHLPS xmm1,xmm2 ; OF 12 /r [KATMAI,SSE] - -\c{MOVHLPS} moves the two packed single-precision FP values from the -high quadword of the source register xmm2 to the low quadword of the -destination register, xmm2. The upper quadword of xmm1 is left unchanged. - -The operation of this instruction is: - -\c dst[0-63] := src[64-127], -\c dst[64-127] remains unchanged. - - -\S{insMOVHPD} \i\c{MOVHPD}: Move High Packed Double-Precision FP - -\c MOVHPD xmm,m64 ; 66 OF 16 /r [WILLAMETTE,SSE2] -\c MOVHPD m64,xmm ; 66 OF 17 /r [WILLAMETTE,SSE2] - -\c{MOVHPD} moves a double-precision FP value between the source and -destination operands. One of the operands is a 64-bit memory location, -the other is the high quadword of an \c{XMM} register. - -The operation of this instruction is: - -\c mem[0-63] := xmm[64-127]; - -or - -\c xmm[0-63] remains unchanged; -\c xmm[64-127] := mem[0-63]. - - -\S{insMOVHPS} \i\c{MOVHPS}: Move High Packed Single-Precision FP - -\c MOVHPS xmm,m64 ; 0F 16 /r [KATMAI,SSE] -\c MOVHPS m64,xmm ; 0F 17 /r [KATMAI,SSE] - -\c{MOVHPS} moves two packed single-precision FP values between the source -and destination operands. One of the operands is a 64-bit memory location, -the other is the high quadword of an \c{XMM} register. - -The operation of this instruction is: - -\c mem[0-63] := xmm[64-127]; - -or - -\c xmm[0-63] remains unchanged; -\c xmm[64-127] := mem[0-63]. - - -\S{insMOVLHPS} \i\c{MOVLHPS}: Move Packed Single-Precision FP Low to High - -\c MOVLHPS xmm1,xmm2 ; OF 16 /r [KATMAI,SSE] - -\c{MOVLHPS} moves the two packed single-precision FP values from the -low quadword of the source register xmm2 to the high quadword of the -destination register, xmm2. The low quadword of xmm1 is left unchanged. - -The operation of this instruction is: - -\c dst[0-63] remains unchanged; -\c dst[64-127] := src[0-63]. - -\S{insMOVLPD} \i\c{MOVLPD}: Move Low Packed Double-Precision FP - -\c MOVLPD xmm,m64 ; 66 OF 12 /r [WILLAMETTE,SSE2] -\c MOVLPD m64,xmm ; 66 OF 13 /r [WILLAMETTE,SSE2] - -\c{MOVLPD} moves a double-precision FP value between the source and -destination operands. One of the operands is a 64-bit memory location, -the other is the low quadword of an \c{XMM} register. - -The operation of this instruction is: - -\c mem(0-63) := xmm(0-63); - -or - -\c xmm(0-63) := mem(0-63); -\c xmm(64-127) remains unchanged. - -\S{insMOVLPS} \i\c{MOVLPS}: Move Low Packed Single-Precision FP - -\c MOVLPS xmm,m64 ; OF 12 /r [KATMAI,SSE] -\c MOVLPS m64,xmm ; OF 13 /r [KATMAI,SSE] - -\c{MOVLPS} moves two packed single-precision FP values between the source -and destination operands. One of the operands is a 64-bit memory location, -the other is the low quadword of an \c{XMM} register. - -The operation of this instruction is: - -\c mem(0-63) := xmm(0-63); - -or - -\c xmm(0-63) := mem(0-63); -\c xmm(64-127) remains unchanged. - - -\S{insMOVMSKPD} \i\c{MOVMSKPD}: Extract Packed Double-Precision FP Sign Mask - -\c MOVMSKPD reg32,xmm ; 66 0F 50 /r [WILLAMETTE,SSE2] - -\c{MOVMSKPD} inserts a 2-bit mask in r32, formed of the most significant -bits of each double-precision FP number of the source operand. - - -\S{insMOVMSKPS} \i\c{MOVMSKPS}: Extract Packed Single-Precision FP Sign Mask - -\c MOVMSKPS reg32,xmm ; 0F 50 /r [KATMAI,SSE] - -\c{MOVMSKPS} inserts a 4-bit mask in r32, formed of the most significant -bits of each single-precision FP number of the source operand. - - -\S{insMOVNTDQ} \i\c{MOVNTDQ}: Move Double Quadword Non Temporal - -\c MOVNTDQ m128,xmm ; 66 0F E7 /r [WILLAMETTE,SSE2] - -\c{MOVNTDQ} moves the double quadword from the \c{XMM} source -register to the destination memory location, using a non-temporal -hint. This store instruction minimizes cache pollution. - - -\S{insMOVNTI} \i\c{MOVNTI}: Move Doubleword Non Temporal - -\c MOVNTI m32,reg32 ; 0F C3 /r [WILLAMETTE,SSE2] - -\c{MOVNTI} moves the doubleword in the source register -to the destination memory location, using a non-temporal -hint. This store instruction minimizes cache pollution. - - -\S{insMOVNTPD} \i\c{MOVNTPD}: Move Aligned Four Packed Single-Precision -FP Values Non Temporal - -\c MOVNTPD m128,xmm ; 66 0F 2B /r [WILLAMETTE,SSE2] - -\c{MOVNTPD} moves the double quadword from the \c{XMM} source -register to the destination memory location, using a non-temporal -hint. This store instruction minimizes cache pollution. The memory -location must be aligned to a 16-byte boundary. - - -\S{insMOVNTPS} \i\c{MOVNTPS}: Move Aligned Four Packed Single-Precision -FP Values Non Temporal - -\c MOVNTPS m128,xmm ; 0F 2B /r [KATMAI,SSE] - -\c{MOVNTPS} moves the double quadword from the \c{XMM} source -register to the destination memory location, using a non-temporal -hint. This store instruction minimizes cache pollution. The memory -location must be aligned to a 16-byte boundary. - - -\S{insMOVNTQ} \i\c{MOVNTQ}: Move Quadword Non Temporal - -\c MOVNTQ m64,mm ; 0F E7 /r [KATMAI,MMX] - -\c{MOVNTQ} moves the quadword in the \c{MMX} source register -to the destination memory location, using a non-temporal -hint. This store instruction minimizes cache pollution. - - -\S{insMOVQ} \i\c{MOVQ}: Move Quadword to/from MMX Register - -\c MOVQ mm1,mm2/m64 ; 0F 6F /r [PENT,MMX] -\c MOVQ mm1/m64,mm2 ; 0F 7F /r [PENT,MMX] - -\c MOVQ xmm1,xmm2/m64 ; F3 0F 7E /r [WILLAMETTE,SSE2] -\c MOVQ xmm1/m64,xmm2 ; 66 0F D6 /r [WILLAMETTE,SSE2] - -\c{MOVQ} copies 64 bits from its source (second) operand into its -destination (first) operand. When the source is an \c{XMM} register, -the low quadword is moved. When the destination is an \c{XMM} register, -the destination is the low quadword, and the high quadword is cleared. - - -\S{insMOVQ2DQ} \i\c{MOVQ2DQ}: Move Quadword from MMX to XMM register. - -\c MOVQ2DQ xmm,mm ; F3 OF D6 /r [WILLAMETTE,SSE2] - -\c{MOVQ2DQ} moves the quadword from the source operand to the low -quadword of the destination operand, and clears the high quadword. - - -\S{insMOVSB} \i\c{MOVSB}, \i\c{MOVSW}, \i\c{MOVSD}: Move String - -\c MOVSB ; A4 [8086] -\c MOVSW ; o16 A5 [8086] -\c MOVSD ; o32 A5 [386] - -\c{MOVSB} copies the byte at \c{[DS:SI]} or \c{[DS:ESI]} to -\c{[ES:DI]} or \c{[ES:EDI]}. It then increments or decrements -(depending on the direction flag: increments if the flag is clear, -decrements if it is set) \c{SI} and \c{DI} (or \c{ESI} and \c{EDI}). - -The registers used are \c{SI} and \c{DI} if the address size is 16 -bits, and \c{ESI} and \c{EDI} if it is 32 bits. If you need to use -an address size not equal to the current \c{BITS} setting, you can -use an explicit \i\c{a16} or \i\c{a32} prefix. - -The segment register used to load from \c{[SI]} or \c{[ESI]} can be -overridden by using a segment register name as a prefix (for -example, \c{es movsb}). The use of \c{ES} for the store to \c{[DI]} -or \c{[EDI]} cannot be overridden. - -\c{MOVSW} and \c{MOVSD} work in the same way, but they copy a word -or a doubleword instead of a byte, and increment or decrement the -addressing registers by 2 or 4 instead of 1. - -The \c{REP} prefix may be used to repeat the instruction \c{CX} (or -\c{ECX} - again, the address size chooses which) times. - - -\S{insMOVSD} \i\c{MOVSD}: Move Scalar Double-Precision FP Value - -\c MOVSD xmm1,xmm2/m64 ; F2 0F 10 /r [WILLAMETTE,SSE2] -\c MOVSD xmm1/m64,xmm2 ; F2 0F 11 /r [WILLAMETTE,SSE2] - -\c{MOVSD} moves a double-precision FP value from the source operand -to the destination operand. When the source or destination is a -register, the low-order FP value is read or written. - - -\S{insMOVSS} \i\c{MOVSS}: Move Scalar Single-Precision FP Value - -\c MOVSS xmm1,xmm2/m32 ; F3 0F 10 /r [KATMAI,SSE] -\c MOVSS xmm1/m32,xmm2 ; F3 0F 11 /r [KATMAI,SSE] - -\c{MOVSS} moves a single-precision FP value from the source operand -to the destination operand. When the source or destination is a -register, the low-order FP value is read or written. - - -\S{insMOVSX} \i\c{MOVSX}, \i\c{MOVZX}: Move Data with Sign or Zero Extend - -\c MOVSX reg16,r/m8 ; o16 0F BE /r [386] -\c MOVSX reg32,r/m8 ; o32 0F BE /r [386] -\c MOVSX reg32,r/m16 ; o32 0F BF /r [386] - -\c MOVZX reg16,r/m8 ; o16 0F B6 /r [386] -\c MOVZX reg32,r/m8 ; o32 0F B6 /r [386] -\c MOVZX reg32,r/m16 ; o32 0F B7 /r [386] - -\c{MOVSX} sign-extends its source (second) operand to the length of -its destination (first) operand, and copies the result into the -destination operand. \c{MOVZX} does the same, but zero-extends -rather than sign-extending. - - -\S{insMOVUPD} \i\c{MOVUPD}: Move Unaligned Packed Double-Precision FP Values - -\c MOVUPD xmm1,xmm2/mem128 ; 66 0F 10 /r [WILLAMETTE,SSE2] -\c MOVUPD xmm1/mem128,xmm2 ; 66 0F 11 /r [WILLAMETTE,SSE2] - -\c{MOVUPD} moves a double quadword containing 2 packed double-precision -FP values from the source operand to the destination. This instruction -makes no assumptions about alignment of memory operands. - -To move data in and out of memory locations that are known to be on 16-byte -boundaries, use the \c{MOVAPD} instruction (\k{insMOVAPD}). - - -\S{insMOVUPS} \i\c{MOVUPS}: Move Unaligned Packed Single-Precision FP Values - -\c MOVUPS xmm1,xmm2/mem128 ; 0F 10 /r [KATMAI,SSE] -\c MOVUPS xmm1/mem128,xmm2 ; 0F 11 /r [KATMAI,SSE] - -\c{MOVUPS} moves a double quadword containing 4 packed single-precision -FP values from the source operand to the destination. This instruction -makes no assumptions about alignment of memory operands. - -To move data in and out of memory locations that are known to be on 16-byte -boundaries, use the \c{MOVAPS} instruction (\k{insMOVAPS}). - - -\S{insMUL} \i\c{MUL}: Unsigned Integer Multiply - -\c MUL r/m8 ; F6 /4 [8086] -\c MUL r/m16 ; o16 F7 /4 [8086] -\c MUL r/m32 ; o32 F7 /4 [386] - -\c{MUL} performs unsigned integer multiplication. The other operand -to the multiplication, and the destination operand, are implicit, in -the following way: - -\b For \c{MUL r/m8}, \c{AL} is multiplied by the given operand; the -product is stored in \c{AX}. - -\b For \c{MUL r/m16}, \c{AX} is multiplied by the given operand; -the product is stored in \c{DX:AX}. - -\b For \c{MUL r/m32}, \c{EAX} is multiplied by the given operand; -the product is stored in \c{EDX:EAX}. - -Signed integer multiplication is performed by the \c{IMUL} -instruction: see \k{insIMUL}. - - -\S{insMULPD} \i\c{MULPD}: Packed Single-FP Multiply - -\c MULPD xmm1,xmm2/mem128 ; 66 0F 59 /r [WILLAMETTE,SSE2] - -\c{MULPD} performs a SIMD multiply of the packed double-precision FP -values in both operands, and stores the results in the destination register. - - -\S{insMULPS} \i\c{MULPS}: Packed Single-FP Multiply - -\c MULPS xmm1,xmm2/mem128 ; 0F 59 /r [KATMAI,SSE] - -\c{MULPS} performs a SIMD multiply of the packed single-precision FP -values in both operands, and stores the results in the destination register. - - -\S{insMULSD} \i\c{MULSD}: Scalar Single-FP Multiply - -\c MULSD xmm1,xmm2/mem32 ; F2 0F 59 /r [WILLAMETTE,SSE2] - -\c{MULSD} multiplies the lowest double-precision FP values of both -operands, and stores the result in the low quadword of xmm1. - - -\S{insMULSS} \i\c{MULSS}: Scalar Single-FP Multiply - -\c MULSS xmm1,xmm2/mem32 ; F3 0F 59 /r [KATMAI,SSE] - -\c{MULSS} multiplies the lowest single-precision FP values of both -operands, and stores the result in the low doubleword of xmm1. - - -\S{insNEG} \i\c{NEG}, \i\c{NOT}: Two's and One's Complement - -\c NEG r/m8 ; F6 /3 [8086] -\c NEG r/m16 ; o16 F7 /3 [8086] -\c NEG r/m32 ; o32 F7 /3 [386] - -\c NOT r/m8 ; F6 /2 [8086] -\c NOT r/m16 ; o16 F7 /2 [8086] -\c NOT r/m32 ; o32 F7 /2 [386] - -\c{NEG} replaces the contents of its operand by the two's complement -negation (invert all the bits and then add one) of the original -value. \c{NOT}, similarly, performs one's complement (inverts all -the bits). - - -\S{insNOP} \i\c{NOP}: No Operation - -\c NOP ; 90 [8086] - -\c{NOP} performs no operation. Its opcode is the same as that -generated by \c{XCHG AX,AX} or \c{XCHG EAX,EAX} (depending on the -processor mode; see \k{insXCHG}). - - -\S{insOR} \i\c{OR}: Bitwise OR - -\c OR r/m8,reg8 ; 08 /r [8086] -\c OR r/m16,reg16 ; o16 09 /r [8086] -\c OR r/m32,reg32 ; o32 09 /r [386] - -\c OR reg8,r/m8 ; 0A /r [8086] -\c OR reg16,r/m16 ; o16 0B /r [8086] -\c OR reg32,r/m32 ; o32 0B /r [386] - -\c OR r/m8,imm8 ; 80 /1 ib [8086] -\c OR r/m16,imm16 ; o16 81 /1 iw [8086] -\c OR r/m32,imm32 ; o32 81 /1 id [386] - -\c OR r/m16,imm8 ; o16 83 /1 ib [8086] -\c OR r/m32,imm8 ; o32 83 /1 ib [386] - -\c OR AL,imm8 ; 0C ib [8086] -\c OR AX,imm16 ; o16 0D iw [8086] -\c OR EAX,imm32 ; o32 0D id [386] - -\c{OR} performs a bitwise OR operation between its two operands -(i.e. each bit of the result is 1 if and only if at least one of the -corresponding bits of the two inputs was 1), and stores the result -in the destination (first) operand. - -In the forms with an 8-bit immediate second operand and a longer -first operand, the second operand is considered to be signed, and is -sign-extended to the length of the first operand. In these cases, -the \c{BYTE} qualifier is necessary to force NASM to generate this -form of the instruction. - -The MMX instruction \c{POR} (see \k{insPOR}) performs the same -operation on the 64-bit MMX registers. - - -\S{insORPD} \i\c{ORPD}: Bit-wise Logical OR of Double-Precision FP Data - -\c ORPD xmm1,xmm2/m128 ; 66 0F 56 /r [WILLAMETTE,SSE2] - -\c{ORPD} return a bit-wise logical OR between xmm1 and xmm2/mem, -and stores the result in xmm1. If the source operand is a memory -location, it must be aligned to a 16-byte boundary. - - -\S{insORPS} \i\c{ORPS}: Bit-wise Logical OR of Single-Precision FP Data - -\c ORPS xmm1,xmm2/m128 ; 0F 56 /r [KATMAI,SSE] - -\c{ORPS} return a bit-wise logical OR between xmm1 and xmm2/mem, -and stores the result in xmm1. If the source operand is a memory -location, it must be aligned to a 16-byte boundary. - - -\S{insOUT} \i\c{OUT}: Output Data to I/O Port - -\c OUT imm8,AL ; E6 ib [8086] -\c OUT imm8,AX ; o16 E7 ib [8086] -\c OUT imm8,EAX ; o32 E7 ib [386] -\c OUT DX,AL ; EE [8086] -\c OUT DX,AX ; o16 EF [8086] -\c OUT DX,EAX ; o32 EF [386] - -\c{OUT} writes the contents of the given source register to the -specified I/O port. The port number may be specified as an immediate -value if it is between 0 and 255, and otherwise must be stored in -\c{DX}. See also \c{IN} (\k{insIN}). - - -\S{insOUTSB} \i\c{OUTSB}, \i\c{OUTSW}, \i\c{OUTSD}: Output String to I/O Port - -\c OUTSB ; 6E [186] -\c OUTSW ; o16 6F [186] -\c OUTSD ; o32 6F [386] - -\c{OUTSB} loads a byte from \c{[DS:SI]} or \c{[DS:ESI]} and writes -it to the I/O port specified in \c{DX}. It then increments or -decrements (depending on the direction flag: increments if the flag -is clear, decrements if it is set) \c{SI} or \c{ESI}. - -The register used is \c{SI} if the address size is 16 bits, and -\c{ESI} if it is 32 bits. If you need to use an address size not -equal to the current \c{BITS} setting, you can use an explicit -\i\c{a16} or \i\c{a32} prefix. - -The segment register used to load from \c{[SI]} or \c{[ESI]} can be -overridden by using a segment register name as a prefix (for -example, \c{es outsb}). - -\c{OUTSW} and \c{OUTSD} work in the same way, but they output a -word or a doubleword instead of a byte, and increment or decrement -the addressing registers by 2 or 4 instead of 1. - -The \c{REP} prefix may be used to repeat the instruction \c{CX} (or -\c{ECX} - again, the address size chooses which) times. - - -\S{insPACKSSDW} \i\c{PACKSSDW}, \i\c{PACKSSWB}, \i\c{PACKUSWB}: Pack Data - -\c PACKSSDW mm1,mm2/m64 ; 0F 6B /r [PENT,MMX] -\c PACKSSWB mm1,mm2/m64 ; 0F 63 /r [PENT,MMX] -\c PACKUSWB mm1,mm2/m64 ; 0F 67 /r [PENT,MMX] - -\c PACKSSDW xmm1,xmm2/m128 ; 66 0F 6B /r [WILLAMETTE,SSE2] -\c PACKSSWB xmm1,xmm2/m128 ; 66 0F 63 /r [WILLAMETTE,SSE2] -\c PACKUSWB xmm1,xmm2/m128 ; 66 0F 67 /r [WILLAMETTE,SSE2] - -All these instructions start by combining the source and destination -operands, and then splitting the result in smaller sections which it -then packs into the destination register. The \c{MMX} versions pack -two 64-bit operands into one 64-bit register, while the \c{SSE} -versions pack two 128-bit operands into one 128-bit register. - -\b \c{PACKSSWB} splits the combined value into words, and then reduces -the words to bytes, using signed saturation. It then packs the bytes -into the destination register in the same order the words were in. - -\b \c{PACKSSDW} performs the same operation as \c{PACKSSWB}, except that -it reduces doublewords to words, then packs them into the destination -register. - -\b \c{PACKUSWB} performs the same operation as \c{PACKSSWB}, except that -it uses unsigned saturation when reducing the size of the elements. - -To perform signed saturation on a number, it is replaced by the largest -signed number (\c{7FFFh} or \c{7Fh}) that \e{will} fit, and if it is too -small it is replaced by the smallest signed number (\c{8000h} or -\c{80h}) that will fit. To perform unsigned saturation, the input is -treated as unsigned, and the input is replaced by the largest unsigned -number that will fit. - - -\S{insPADDB} \i\c{PADDB}, \i\c{PADDW}, \i\c{PADDD}: Add Packed Integers - -\c PADDB mm1,mm2/m64 ; 0F FC /r [PENT,MMX] -\c PADDW mm1,mm2/m64 ; 0F FD /r [PENT,MMX] -\c PADDD mm1,mm2/m64 ; 0F FE /r [PENT,MMX] - -\c PADDB xmm1,xmm2/m128 ; 66 0F FC /r [WILLAMETTE,SSE2] -\c PADDW xmm1,xmm2/m128 ; 66 0F FD /r [WILLAMETTE,SSE2] -\c PADDD xmm1,xmm2/m128 ; 66 0F FE /r [WILLAMETTE,SSE2] - -\c{PADDx} performs packed addition of the two operands, storing the -result in the destination (first) operand. - -\b \c{PADDB} treats the operands as packed bytes, and adds each byte -individually; - -\b \c{PADDW} treats the operands as packed words; - -\b \c{PADDD} treats its operands as packed doublewords. - -When an individual result is too large to fit in its destination, it -is wrapped around and the low bits are stored, with the carry bit -discarded. - - -\S{insPADDQ} \i\c{PADDQ}: Add Packed Quadword Integers - -\c PADDQ mm1,mm2/m64 ; 0F D4 /r [PENT,MMX] - -\c PADDQ xmm1,xmm2/m128 ; 66 0F D4 /r [WILLAMETTE,SSE2] - -\c{PADDQ} adds the quadwords in the source and destination operands, and -stores the result in the destination register. - -When an individual result is too large to fit in its destination, it -is wrapped around and the low bits are stored, with the carry bit -discarded. - - -\S{insPADDSB} \i\c{PADDSB}, \i\c{PADDSW}: Add Packed Signed Integers With Saturation - -\c PADDSB mm1,mm2/m64 ; 0F EC /r [PENT,MMX] -\c PADDSW mm1,mm2/m64 ; 0F ED /r [PENT,MMX] - -\c PADDSB xmm1,xmm2/m128 ; 66 0F EC /r [WILLAMETTE,SSE2] -\c PADDSW xmm1,xmm2/m128 ; 66 0F ED /r [WILLAMETTE,SSE2] - -\c{PADDSx} performs packed addition of the two operands, storing the -result in the destination (first) operand. -\c{PADDSB} treats the operands as packed bytes, and adds each byte -individually; and \c{PADDSW} treats the operands as packed words. - -When an individual result is too large to fit in its destination, a -saturated value is stored. The resulting value is the value with the -largest magnitude of the same sign as the result which will fit in -the available space. - - -\S{insPADDSIW} \i\c{PADDSIW}: MMX Packed Addition to Implicit Destination - -\c PADDSIW mmxreg,r/m64 ; 0F 51 /r [CYRIX,MMX] - -\c{PADDSIW}, specific to the Cyrix extensions to the MMX instruction -set, performs the same function as \c{PADDSW}, except that the result -is placed in an implied register. - -To work out the implied register, invert the lowest bit in the register -number. So \c{PADDSIW MM0,MM2} would put the result in \c{MM1}, but -\c{PADDSIW MM1,MM2} would put the result in \c{MM0}. - - -\S{insPADDUSB} \i\c{PADDUSB}, \i\c{PADDUSW}: Add Packed Unsigned Integers With Saturation - -\c PADDUSB mm1,mm2/m64 ; 0F DC /r [PENT,MMX] -\c PADDUSW mm1,mm2/m64 ; 0F DD /r [PENT,MMX] - -\c PADDUSB xmm1,xmm2/m128 ; 66 0F DC /r [WILLAMETTE,SSE2] -\c PADDUSW xmm1,xmm2/m128 ; 66 0F DD /r [WILLAMETTE,SSE2] - -\c{PADDUSx} performs packed addition of the two operands, storing the -result in the destination (first) operand. -\c{PADDUSB} treats the operands as packed bytes, and adds each byte -individually; and \c{PADDUSW} treats the operands as packed words. - -When an individual result is too large to fit in its destination, a -saturated value is stored. The resulting value is the maximum value -that will fit in the available space. - - -\S{insPAND} \i\c{PAND}, \i\c{PANDN}: MMX Bitwise AND and AND-NOT - -\c PAND mm1,mm2/m64 ; 0F DB /r [PENT,MMX] -\c PANDN mm1,mm2/m64 ; 0F DF /r [PENT,MMX] - -\c PAND xmm1,xmm2/m128 ; 66 0F DB /r [WILLAMETTE,SSE2] -\c PANDN xmm1,xmm2/m128 ; 66 0F DF /r [WILLAMETTE,SSE2] - - -\c{PAND} performs a bitwise AND operation between its two operands -(i.e. each bit of the result is 1 if and only if the corresponding -bits of the two inputs were both 1), and stores the result in the -destination (first) operand. - -\c{PANDN} performs the same operation, but performs a one's -complement operation on the destination (first) operand first. - - -\S{insPAUSE} \i\c{PAUSE}: Spin Loop Hint - -\c PAUSE ; F3 90 [WILLAMETTE,SSE2] - -\c{PAUSE} provides a hint to the processor that the following code -is a spin loop. This improves processor performance by bypassing -possible memory order violations. On older processors, this instruction -operates as a \c{NOP}. - - -\S{insPAVEB} \i\c{PAVEB}: MMX Packed Average - -\c PAVEB mmxreg,r/m64 ; 0F 50 /r [CYRIX,MMX] - -\c{PAVEB}, specific to the Cyrix MMX extensions, treats its two -operands as vectors of eight unsigned bytes, and calculates the -average of the corresponding bytes in the operands. The resulting -vector of eight averages is stored in the first operand. - -This opcode maps to \c{MOVMSKPS r32, xmm} on processors that support -the SSE instruction set. - - -\S{insPAVGB} \i\c{PAVGB} \i\c{PAVGW}: Average Packed Integers - -\c PAVGB mm1,mm2/m64 ; 0F E0 /r [KATMAI,MMX] -\c PAVGW mm1,mm2/m64 ; 0F E3 /r [KATMAI,MMX,SM] - -\c PAVGB xmm1,xmm2/m128 ; 66 0F E0 /r [WILLAMETTE,SSE2] -\c PAVGW xmm1,xmm2/m128 ; 66 0F E3 /r [WILLAMETTE,SSE2] - -\c{PAVGB} and \c{PAVGW} add the unsigned data elements of the source -operand to the unsigned data elements of the destination register, -then adds 1 to the temporary results. The results of the add are then -each independently right-shifted by one bit position. The high order -bits of each element are filled with the carry bits of the corresponding -sum. - -\b \c{PAVGB} operates on packed unsigned bytes, and - -\b \c{PAVGW} operates on packed unsigned words. - - -\S{insPAVGUSB} \i\c{PAVGUSB}: Average of unsigned packed 8-bit values - -\c PAVGUSB mm1,mm2/m64 ; 0F 0F /r BF [PENT,3DNOW] - -\c{PAVGUSB} adds the unsigned data elements of the source operand to -the unsigned data elements of the destination register, then adds 1 -to the temporary results. The results of the add are then each -independently right-shifted by one bit position. The high order bits -of each element are filled with the carry bits of the corresponding -sum. - -This instruction performs exactly the same operations as the \c{PAVGB} -\c{MMX} instruction (\k{insPAVGB}). - - -\S{insPCMPEQB} \i\c{PCMPxx}: Compare Packed Integers. - -\c PCMPEQB mm1,mm2/m64 ; 0F 74 /r [PENT,MMX] -\c PCMPEQW mm1,mm2/m64 ; 0F 75 /r [PENT,MMX] -\c PCMPEQD mm1,mm2/m64 ; 0F 76 /r [PENT,MMX] - -\c PCMPGTB mm1,mm2/m64 ; 0F 64 /r [PENT,MMX] -\c PCMPGTW mm1,mm2/m64 ; 0F 65 /r [PENT,MMX] -\c PCMPGTD mm1,mm2/m64 ; 0F 66 /r [PENT,MMX] - -\c PCMPEQB xmm1,xmm2/m128 ; 66 0F 74 /r [WILLAMETTE,SSE2] -\c PCMPEQW xmm1,xmm2/m128 ; 66 0F 75 /r [WILLAMETTE,SSE2] -\c PCMPEQD xmm1,xmm2/m128 ; 66 0F 76 /r [WILLAMETTE,SSE2] - -\c PCMPGTB xmm1,xmm2/m128 ; 66 0F 64 /r [WILLAMETTE,SSE2] -\c PCMPGTW xmm1,xmm2/m128 ; 66 0F 65 /r [WILLAMETTE,SSE2] -\c PCMPGTD xmm1,xmm2/m128 ; 66 0F 66 /r [WILLAMETTE,SSE2] - -The \c{PCMPxx} instructions all treat their operands as vectors of -bytes, words, or doublewords; corresponding elements of the source -and destination are compared, and the corresponding element of the -destination (first) operand is set to all zeros or all ones -depending on the result of the comparison. - -\b \c{PCMPxxB} treats the operands as vectors of bytes; - -\b \c{PCMPxxW} treats the operands as vectors of words; - -\b \c{PCMPxxD} treats the operands as vectors of doublewords; - -\b \c{PCMPEQx} sets the corresponding element of the destination -operand to all ones if the two elements compared are equal; - -\b \c{PCMPGTx} sets the destination element to all ones if the element -of the first (destination) operand is greater (treated as a signed -integer) than that of the second (source) operand. - - -\S{insPDISTIB} \i\c{PDISTIB}: MMX Packed Distance and Accumulate -with Implied Register - -\c PDISTIB mm,m64 ; 0F 54 /r [CYRIX,MMX] - -\c{PDISTIB}, specific to the Cyrix MMX extensions, treats its two -input operands as vectors of eight unsigned bytes. For each byte -position, it finds the absolute difference between the bytes in that -position in the two input operands, and adds that value to the byte -in the same position in the implied output register. The addition is -saturated to an unsigned byte in the same way as \c{PADDUSB}. - -To work out the implied register, invert the lowest bit in the register -number. So \c{PDISTIB MM0,M64} would put the result in \c{MM1}, but -\c{PDISTIB MM1,M64} would put the result in \c{MM0}. - -Note that \c{PDISTIB} cannot take a register as its second source -operand. - -Operation: - -\c dstI[0-7] := dstI[0-7] + ABS(src0[0-7] - src1[0-7]), -\c dstI[8-15] := dstI[8-15] + ABS(src0[8-15] - src1[8-15]), -\c ....... -\c ....... -\c dstI[56-63] := dstI[56-63] + ABS(src0[56-63] - src1[56-63]). - - -\S{insPEXTRW} \i\c{PEXTRW}: Extract Word - -\c PEXTRW reg32,mm,imm8 ; 0F C5 /r ib [KATMAI,MMX] -\c PEXTRW reg32,xmm,imm8 ; 66 0F C5 /r ib [WILLAMETTE,SSE2] - -\c{PEXTRW} moves the word in the source register (second operand) -that is pointed to by the count operand (third operand), into the -lower half of a 32-bit general purpose register. The upper half of -the register is cleared to all 0s. - -When the source operand is an \c{MMX} register, the two least -significant bits of the count specify the source word. When it is -an \c{SSE} register, the three least significant bits specify the -word location. - - -\S{insPF2ID} \i\c{PF2ID}: Packed Single-Precision FP to Integer Convert - -\c PF2ID mm1,mm2/m64 ; 0F 0F /r 1D [PENT,3DNOW] - -\c{PF2ID} converts two single-precision FP values in the source operand -to signed 32-bit integers, using truncation, and stores them in the -destination operand. Source values that are outside the range supported -by the destination are saturated to the largest absolute value of the -same sign. - - -\S{insPF2IW} \i\c{PF2IW}: Packed Single-Precision FP to Integer Word Convert - -\c PF2IW mm1,mm2/m64 ; 0F 0F /r 1C [PENT,3DNOW] - -\c{PF2IW} converts two single-precision FP values in the source operand -to signed 16-bit integers, using truncation, and stores them in the -destination operand. Source values that are outside the range supported -by the destination are saturated to the largest absolute value of the -same sign. - -\b In the K6-2 and K6-III, the 16-bit value is zero-extended to 32-bits -before storing. - -\b In the K6-2+, K6-III+ and Athlon processors, the value is sign-extended -to 32-bits before storing. - - -\S{insPFACC} \i\c{PFACC}: Packed Single-Precision FP Accumulate - -\c PFACC mm1,mm2/m64 ; 0F 0F /r AE [PENT,3DNOW] - -\c{PFACC} adds the two single-precision FP values from the destination -operand together, then adds the two single-precision FP values from the -source operand, and places the results in the low and high doublewords -of the destination operand. - -The operation is: - -\c dst[0-31] := dst[0-31] + dst[32-63], -\c dst[32-63] := src[0-31] + src[32-63]. - - -\S{insPFADD} \i\c{PFADD}: Packed Single-Precision FP Addition - -\c PFADD mm1,mm2/m64 ; 0F 0F /r 9E [PENT,3DNOW] - -\c{PFADD} performs addition on each of two packed single-precision -FP value pairs. - -\c dst[0-31] := dst[0-31] + src[0-31], -\c dst[32-63] := dst[32-63] + src[32-63]. - - -\S{insPFCMP} \i\c{PFCMPxx}: Packed Single-Precision FP Compare -\I\c{PFCMPEQ} \I\c{PFCMPGE} \I\c{PFCMPGT} - -\c PFCMPEQ mm1,mm2/m64 ; 0F 0F /r B0 [PENT,3DNOW] -\c PFCMPGE mm1,mm2/m64 ; 0F 0F /r 90 [PENT,3DNOW] -\c PFCMPGT mm1,mm2/m64 ; 0F 0F /r A0 [PENT,3DNOW] - -The \c{PFCMPxx} instructions compare the packed single-point FP values -in the source and destination operands, and set the destination -according to the result. If the condition is true, the destination is -set to all 1s, otherwise it's set to all 0s. - -\b \c{PFCMPEQ} tests whether dst == src; - -\b \c{PFCMPGE} tests whether dst >= src; - -\b \c{PFCMPGT} tests whether dst > src. - - -\S{insPFMAX} \i\c{PFMAX}: Packed Single-Precision FP Maximum - -\c PFMAX mm1,mm2/m64 ; 0F 0F /r A4 [PENT,3DNOW] - -\c{PFMAX} returns the higher of each pair of single-precision FP values. -If the higher value is zero, it is returned as positive zero. - - -\S{insPFMIN} \i\c{PFMIN}: Packed Single-Precision FP Minimum - -\c PFMIN mm1,mm2/m64 ; 0F 0F /r 94 [PENT,3DNOW] - -\c{PFMIN} returns the lower of each pair of single-precision FP values. -If the lower value is zero, it is returned as positive zero. - - -\S{insPFMUL} \i\c{PFMUL}: Packed Single-Precision FP Multiply - -\c PFMUL mm1,mm2/m64 ; 0F 0F /r B4 [PENT,3DNOW] - -\c{PFMUL} returns the product of each pair of single-precision FP values. - -\c dst[0-31] := dst[0-31] * src[0-31], -\c dst[32-63] := dst[32-63] * src[32-63]. - - -\S{insPFNACC} \i\c{PFNACC}: Packed Single-Precision FP Negative Accumulate - -\c PFNACC mm1,mm2/m64 ; 0F 0F /r 8A [PENT,3DNOW] - -\c{PFNACC} performs a negative accumulate of the two single-precision -FP values in the source and destination registers. The result of the -accumulate from the destination register is stored in the low doubleword -of the destination, and the result of the source accumulate is stored in -the high doubleword of the destination register. - -The operation is: - -\c dst[0-31] := dst[0-31] - dst[32-63], -\c dst[32-63] := src[0-31] - src[32-63]. - - -\S{insPFPNACC} \i\c{PFPNACC}: Packed Single-Precision FP Mixed Accumulate - -\c PFPNACC mm1,mm2/m64 ; 0F 0F /r 8E [PENT,3DNOW] - -\c{PFPNACC} performs a positive accumulate of the two single-precision -FP values in the source register and a negative accumulate of the -destination register. The result of the accumulate from the destination -register is stored in the low doubleword of the destination, and the -result of the source accumulate is stored in the high doubleword of the -destination register. - -The operation is: - -\c dst[0-31] := dst[0-31] - dst[32-63], -\c dst[32-63] := src[0-31] + src[32-63]. - - -\S{insPFRCP} \i\c{PFRCP}: Packed Single-Precision FP Reciprocal Approximation - -\c PFRCP mm1,mm2/m64 ; 0F 0F /r 96 [PENT,3DNOW] - -\c{PFRCP} performs a low precision estimate of the reciprocal of the -low-order single-precision FP value in the source operand, storing the -result in both halves of the destination register. The result is accurate -to 14 bits. - -For higher precision reciprocals, this instruction should be followed by -two more instructions: \c{PFRCPIT1} (\k{insPFRCPIT1}) and \c{PFRCPIT2} -(\k{insPFRCPIT1}). This will result in a 24-bit accuracy. For more details, -see the AMD 3DNow! technology manual. - - -\S{insPFRCPIT1} \i\c{PFRCPIT1}: Packed Single-Precision FP Reciprocal, -First Iteration Step - -\c PFRCPIT1 mm1,mm2/m64 ; 0F 0F /r A6 [PENT,3DNOW] - -\c{PFRCPIT1} performs the first intermediate step in the calculation of -the reciprocal of a single-precision FP value. The first source value -(\c{mm1} is the original value, and the second source value (\c{mm2/m64} -is the result of a \c{PFRCP} instruction. - -For the final step in a reciprocal, returning the full 24-bit accuracy -of a single-precision FP value, see \c{PFRCPIT2} (\k{insPFRCPIT2}). For -more details, see the AMD 3DNow! technology manual. - - -\S{insPFRCPIT2} \i\c{PFRCPIT2}: Packed Single-Precision FP -Reciprocal/ Reciprocal Square Root, Second Iteration Step - -\c PFRCPIT2 mm1,mm2/m64 ; 0F 0F /r B6 [PENT,3DNOW] - -\c{PFRCPIT2} performs the second and final intermediate step in the -calculation of a reciprocal or reciprocal square root, refining the -values returned by the \c{PFRCP} and \c{PFRSQRT} instructions, -respectively. - -The first source value (\c{mm1}) is the output of either a \c{PFRCPIT1} -or a \c{PFRSQIT1} instruction, and the second source is the output of -either the \c{PFRCP} or the \c{PFRSQRT} instruction. For more details, -see the AMD 3DNow! technology manual. - - -\S{insPFRSQIT1} \i\c{PFRSQIT1}: Packed Single-Precision FP Reciprocal -Square Root, First Iteration Step - -\c PFRSQIT1 mm1,mm2/m64 ; 0F 0F /r A7 [PENT,3DNOW] - -\c{PFRSQIT1} performs the first intermediate step in the calculation of -the reciprocal square root of a single-precision FP value. The first -source value (\c{mm1} is the square of the result of a \c{PFRSQRT} -instruction, and the second source value (\c{mm2/m64} is the original -value. - -For the final step in a calculation, returning the full 24-bit accuracy -of a single-precision FP value, see \c{PFRCPIT2} (\k{insPFRCPIT2}). For -more details, see the AMD 3DNow! technology manual. - - -\S{insPFRSQRT} \i\c{PFRSQRT}: Packed Single-Precision FP Reciprocal -Square Root Approximation - -\c PFRSQRT mm1,mm2/m64 ; 0F 0F /r 97 [PENT,3DNOW] - -\c{PFRSQRT} performs a low precision estimate of the reciprocal square -root of the low-order single-precision FP value in the source operand, -storing the result in both halves of the destination register. The result -is accurate to 15 bits. - -For higher precision reciprocals, this instruction should be followed by -two more instructions: \c{PFRSQIT1} (\k{insPFRSQIT1}) and \c{PFRCPIT2} -(\k{insPFRCPIT1}). This will result in a 24-bit accuracy. For more details, -see the AMD 3DNow! technology manual. - - -\S{insPFSUB} \i\c{PFSUB}: Packed Single-Precision FP Subtract - -\c PFSUB mm1,mm2/m64 ; 0F 0F /r 9A [PENT,3DNOW] - -\c{PFSUB} subtracts the single-precision FP values in the source from -those in the destination, and stores the result in the destination -operand. - -\c dst[0-31] := dst[0-31] - src[0-31], -\c dst[32-63] := dst[32-63] - src[32-63]. - - -\S{insPFSUBR} \i\c{PFSUBR}: Packed Single-Precision FP Reverse Subtract - -\c PFSUBR mm1,mm2/m64 ; 0F 0F /r AA [PENT,3DNOW] - -\c{PFSUBR} subtracts the single-precision FP values in the destination -from those in the source, and stores the result in the destination -operand. - -\c dst[0-31] := src[0-31] - dst[0-31], -\c dst[32-63] := src[32-63] - dst[32-63]. - - -\S{insPI2FD} \i\c{PI2FD}: Packed Doubleword Integer to Single-Precision FP Convert - -\c PI2FD mm1,mm2/m64 ; 0F 0F /r 0D [PENT,3DNOW] - -\c{PF2ID} converts two signed 32-bit integers in the source operand -to single-precision FP values, using truncation of significant digits, -and stores them in the destination operand. - - -\S{insPF2IW} \i\c{PF2IW}: Packed Word Integer to Single-Precision FP Convert - -\c PI2FW mm1,mm2/m64 ; 0F 0F /r 0C [PENT,3DNOW] - -\c{PF2IW} converts two signed 16-bit integers in the source operand -to single-precision FP values, and stores them in the destination -operand. The input values are in the low word of each doubleword. - - -\S{insPINSRW} \i\c{PINSRW}: Insert Word - -\c PINSRW mm,r16/r32/m16,imm8 ;0F C4 /r ib [KATMAI,MMX] -\c PINSRW xmm,r16/r32/m16,imm8 ;66 0F C4 /r ib [WILLAMETTE,SSE2] - -\c{PINSRW} loads a word from a 16-bit register (or the low half of a -32-bit register), or from memory, and loads it to the word position -in the destination register, pointed at by the count operand (third -operand). If the destination is an \c{MMX} register, the low two bits -of the count byte are used, if it is an \c{XMM} register the low 3 -bits are used. The insertion is done in such a way that the other -words from the destination register are left untouched. - - -\S{insPMACHRIW} \i\c{PMACHRIW}: Packed Multiply and Accumulate with Rounding - -\c PMACHRIW mm,m64 ; 0F 5E /r [CYRIX,MMX] - -\c{PMACHRIW} takes two packed 16-bit integer inputs, multiplies the -values in the inputs, rounds on bit 15 of each result, then adds bits -15-30 of each result to the corresponding position of the \e{implied} -destination register. - -The operation of this instruction is: - -\c dstI[0-15] := dstI[0-15] + (mm[0-15] *m64[0-15] -\c + 0x00004000)[15-30], -\c dstI[16-31] := dstI[16-31] + (mm[16-31]*m64[16-31] -\c + 0x00004000)[15-30], -\c dstI[32-47] := dstI[32-47] + (mm[32-47]*m64[32-47] -\c + 0x00004000)[15-30], -\c dstI[48-63] := dstI[48-63] + (mm[48-63]*m64[48-63] -\c + 0x00004000)[15-30]. - -Note that \c{PMACHRIW} cannot take a register as its second source -operand. - - -\S{insPMADDWD} \i\c{PMADDWD}: MMX Packed Multiply and Add - -\c PMADDWD mm1,mm2/m64 ; 0F F5 /r [PENT,MMX] -\c PMADDWD xmm1,xmm2/m128 ; 66 0F F5 /r [WILLAMETTE,SSE2] - -\c{PMADDWD} treats its two inputs as vectors of signed words. It -multiplies corresponding elements of the two operands, giving doubleword -results. These are then added together in pairs and stored in the -destination operand. - -The operation of this instruction is: - -\c dst[0-31] := (dst[0-15] * src[0-15]) -\c + (dst[16-31] * src[16-31]); -\c dst[32-63] := (dst[32-47] * src[32-47]) -\c + (dst[48-63] * src[48-63]); - -The following apply to the \c{SSE} version of the instruction: - -\c dst[64-95] := (dst[64-79] * src[64-79]) -\c + (dst[80-95] * src[80-95]); -\c dst[96-127] := (dst[96-111] * src[96-111]) -\c + (dst[112-127] * src[112-127]). - - -\S{insPMAGW} \i\c{PMAGW}: MMX Packed Magnitude - -\c PMAGW mm1,mm2/m64 ; 0F 52 /r [CYRIX,MMX] - -\c{PMAGW}, specific to the Cyrix MMX extensions, treats both its -operands as vectors of four signed words. It compares the absolute -values of the words in corresponding positions, and sets each word -of the destination (first) operand to whichever of the two words in -that position had the larger absolute value. - - -\S{insPMAXSW} \i\c{PMAXSW}: Packed Signed Integer Word Maximum - -\c PMAXSW mm1,mm2/m64 ; 0F EE /r [KATMAI,MMX] -\c PMAXSW xmm1,xmm2/m128 ; 66 0F EE /r [WILLAMETTE,SSE2] - -\c{PMAXSW} compares each pair of words in the two source operands, and -for each pair it stores the maximum value in the destination register. - - -\S{insPMAXUB} \i\c{PMAXUB}: Packed Unsigned Integer Byte Maximum - -\c PMAXUB mm1,mm2/m64 ; 0F DE /r [KATMAI,MMX] -\c PMAXUB xmm1,xmm2/m128 ; 66 0F DE /r [WILLAMETTE,SSE2] - -\c{PMAXUB} compares each pair of bytes in the two source operands, and -for each pair it stores the maximum value in the destination register. - - -\S{insPMINSW} \i\c{PMINSW}: Packed Signed Integer Word Minimum - -\c PMINSW mm1,mm2/m64 ; 0F EA /r [KATMAI,MMX] -\c PMINSW xmm1,xmm2/m128 ; 66 0F EA /r [WILLAMETTE,SSE2] - -\c{PMINSW} compares each pair of words in the two source operands, and -for each pair it stores the minimum value in the destination register. - - -\S{insPMINUB} \i\c{PMINUB}: Packed Unsigned Integer Byte Minimum - -\c PMINUB mm1,mm2/m64 ; 0F DA /r [KATMAI,MMX] -\c PMINUB xmm1,xmm2/m128 ; 66 0F DA /r [WILLAMETTE,SSE2] - -\c{PMINUB} compares each pair of bytes in the two source operands, and -for each pair it stores the minimum value in the destination register. - - -\S{insPMOVMSKB} \i\c{PMOVMSKB}: Move Byte Mask To Integer - -\c PMOVMSKB reg32,mm ; 0F D7 /r [KATMAI,MMX] -\c PMOVMSKB reg32,xmm ; 66 0F D7 /r [WILLAMETTE,SSE2] - -\c{PMOVMSKB} returns an 8-bit or 16-bit mask formed of the most -significant bits of each byte of source operand (8-bits for an -\c{MMX} register, 16-bits for an \c{XMM} register). - - -\S{insPMULHRW} \i\c{PMULHRWC}, \i\c{PMULHRIW}: Multiply Packed 16-bit Integers -With Rounding, and Store High Word - -\c PMULHRWC mm1,mm2/m64 ; 0F 59 /r [CYRIX,MMX] -\c PMULHRIW mm1,mm2/m64 ; 0F 5D /r [CYRIX,MMX] - -These instructions take two packed 16-bit integer inputs, multiply the -values in the inputs, round on bit 15 of each result, then store bits -15-30 of each result to the corresponding position of the destination -register. - -\b For \c{PMULHRWC}, the destination is the first source operand. - -\b For \c{PMULHRIW}, the destination is an implied register (worked out -as described for \c{PADDSIW} (\k{insPADDSIW})). - -The operation of this instruction is: - -\c dst[0-15] := (src1[0-15] *src2[0-15] + 0x00004000)[15-30] -\c dst[16-31] := (src1[16-31]*src2[16-31] + 0x00004000)[15-30] -\c dst[32-47] := (src1[32-47]*src2[32-47] + 0x00004000)[15-30] -\c dst[48-63] := (src1[48-63]*src2[48-63] + 0x00004000)[15-30] - -See also \c{PMULHRWA} (\k{insPMULHRWA}) for a 3DNow! version of this -instruction. - - -\S{insPMULHRWA} \i\c{PMULHRWA}: Multiply Packed 16-bit Integers -With Rounding, and Store High Word - -\c PMULHRWA mm1,mm2/m64 ; 0F 0F /r B7 [PENT,3DNOW] - -\c{PMULHRWA} takes two packed 16-bit integer inputs, multiplies -the values in the inputs, rounds on bit 16 of each result, then -stores bits 16-31 of each result to the corresponding position -of the destination register. - -The operation of this instruction is: - -\c dst[0-15] := (src1[0-15] *src2[0-15] + 0x00008000)[16-31]; -\c dst[16-31] := (src1[16-31]*src2[16-31] + 0x00008000)[16-31]; -\c dst[32-47] := (src1[32-47]*src2[32-47] + 0x00008000)[16-31]; -\c dst[48-63] := (src1[48-63]*src2[48-63] + 0x00008000)[16-31]. - -See also \c{PMULHRWC} (\k{insPMULHRW}) for a Cyrix version of this -instruction. - - -\S{insPMULHUW} \i\c{PMULHUW}: Multiply Packed 16-bit Integers, -and Store High Word - -\c PMULHUW mm1,mm2/m64 ; 0F E4 /r [KATMAI,MMX] -\c PMULHUW xmm1,xmm2/m128 ; 66 0F E4 /r [WILLAMETTE,SSE2] - -\c{PMULHUW} takes two packed unsigned 16-bit integer inputs, multiplies -the values in the inputs, then stores bits 16-31 of each result to the -corresponding position of the destination register. - - -\S{insPMULHW} \i\c{PMULHW}, \i\c{PMULLW}: Multiply Packed 16-bit Integers, -and Store - -\c PMULHW mm1,mm2/m64 ; 0F E5 /r [PENT,MMX] -\c PMULLW mm1,mm2/m64 ; 0F D5 /r [PENT,MMX] - -\c PMULHW xmm1,xmm2/m128 ; 66 0F E5 /r [WILLAMETTE,SSE2] -\c PMULLW xmm1,xmm2/m128 ; 66 0F D5 /r [WILLAMETTE,SSE2] - -\c{PMULxW} takes two packed unsigned 16-bit integer inputs, and -multiplies the values in the inputs, forming doubleword results. - -\b \c{PMULHW} then stores the top 16 bits of each doubleword in the -destination (first) operand; - -\b \c{PMULLW} stores the bottom 16 bits of each doubleword in the -destination operand. - - -\S{insPMULUDQ} \i\c{PMULUDQ}: Multiply Packed Unsigned -32-bit Integers, and Store. - -\c PMULUDQ mm1,mm2/m64 ; 0F F4 /r [WILLAMETTE,SSE2] -\c PMULUDQ xmm1,xmm2/m128 ; 66 0F F4 /r [WILLAMETTE,SSE2] - -\c{PMULUDQ} takes two packed unsigned 32-bit integer inputs, and -multiplies the values in the inputs, forming quadword results. The -source is either an unsigned doubleword in the low doubleword of a -64-bit operand, or it's two unsigned doublewords in the first and -third doublewords of a 128-bit operand. This produces either one or -two 64-bit results, which are stored in the respective quadword -locations of the destination register. - -The operation is: - -\c dst[0-63] := dst[0-31] * src[0-31]; -\c dst[64-127] := dst[64-95] * src[64-95]. - - -\S{insPMVccZB} \i\c{PMVccZB}: MMX Packed Conditional Move - -\c PMVZB mmxreg,mem64 ; 0F 58 /r [CYRIX,MMX] -\c PMVNZB mmxreg,mem64 ; 0F 5A /r [CYRIX,MMX] -\c PMVLZB mmxreg,mem64 ; 0F 5B /r [CYRIX,MMX] -\c PMVGEZB mmxreg,mem64 ; 0F 5C /r [CYRIX,MMX] - -These instructions, specific to the Cyrix MMX extensions, perform -parallel conditional moves. The two input operands are treated as -vectors of eight bytes. Each byte of the destination (first) operand -is either written from the corresponding byte of the source (second) -operand, or left alone, depending on the value of the byte in the -\e{implied} operand (specified in the same way as \c{PADDSIW}, in -\k{insPADDSIW}). - -\b \c{PMVZB} performs each move if the corresponding byte in the -implied operand is zero; - -\b \c{PMVNZB} moves if the byte is non-zero; - -\b \c{PMVLZB} moves if the byte is less than zero; - -\b \c{PMVGEZB} moves if the byte is greater than or equal to zero. - -Note that these instructions cannot take a register as their second -source operand. - - -\S{insPOP} \i\c{POP}: Pop Data from Stack - -\c POP reg16 ; o16 58+r [8086] -\c POP reg32 ; o32 58+r [386] - -\c POP r/m16 ; o16 8F /0 [8086] -\c POP r/m32 ; o32 8F /0 [386] - -\c POP CS ; 0F [8086,UNDOC] -\c POP DS ; 1F [8086] -\c POP ES ; 07 [8086] -\c POP SS ; 17 [8086] -\c POP FS ; 0F A1 [386] -\c POP GS ; 0F A9 [386] - -\c{POP} loads a value from the stack (from \c{[SS:SP]} or -\c{[SS:ESP]}) and then increments the stack pointer. - -The address-size attribute of the instruction determines whether -\c{SP} or \c{ESP} is used as the stack pointer: to deliberately -override the default given by the \c{BITS} setting, you can use an -\i\c{a16} or \i\c{a32} prefix. - -The operand-size attribute of the instruction determines whether the -stack pointer is incremented by 2 or 4: this means that segment -register pops in \c{BITS 32} mode will pop 4 bytes off the stack and -discard the upper two of them. If you need to override that, you can -use an \i\c{o16} or \i\c{o32} prefix. - -The above opcode listings give two forms for general-purpose -register pop instructions: for example, \c{POP BX} has the two forms -\c{5B} and \c{8F C3}. NASM will always generate the shorter form -when given \c{POP BX}. NDISASM will disassemble both. - -\c{POP CS} is not a documented instruction, and is not supported on -any processor above the 8086 (since they use \c{0Fh} as an opcode -prefix for instruction set extensions). However, at least some 8086 -processors do support it, and so NASM generates it for completeness. - - -\S{insPOPA} \i\c{POPAx}: Pop All General-Purpose Registers - -\c POPA ; 61 [186] -\c POPAW ; o16 61 [186] -\c POPAD ; o32 61 [386] - -\b \c{POPAW} pops a word from the stack into each of, successively, -\c{DI}, \c{SI}, \c{BP}, nothing (it discards a word from the stack -which was a placeholder for \c{SP}), \c{BX}, \c{DX}, \c{CX} and -\c{AX}. It is intended to reverse the operation of \c{PUSHAW} (see -\k{insPUSHA}), but it ignores the value for \c{SP} that was pushed -on the stack by \c{PUSHAW}. - -\b \c{POPAD} pops twice as much data, and places the results in -\c{EDI}, \c{ESI}, \c{EBP}, nothing (placeholder for \c{ESP}), -\c{EBX}, \c{EDX}, \c{ECX} and \c{EAX}. It reverses the operation of -\c{PUSHAD}. - -\c{POPA} is an alias mnemonic for either \c{POPAW} or \c{POPAD}, -depending on the current \c{BITS} setting. - -Note that the registers are popped in reverse order of their numeric -values in opcodes (see \k{iref-rv}). - - -\S{insPOPF} \i\c{POPFx}: Pop Flags Register - -\c POPF ; 9D [8086] -\c POPFW ; o16 9D [8086] -\c POPFD ; o32 9D [386] - -\b \c{POPFW} pops a word from the stack and stores it in the bottom 16 -bits of the flags register (or the whole flags register, on -processors below a 386). - -\b \c{POPFD} pops a doubleword and stores it in the entire flags register. - -\c{POPF} is an alias mnemonic for either \c{POPFW} or \c{POPFD}, -depending on the current \c{BITS} setting. - -See also \c{PUSHF} (\k{insPUSHF}). - - -\S{insPOR} \i\c{POR}: MMX Bitwise OR - -\c POR mm1,mm2/m64 ; 0F EB /r [PENT,MMX] -\c POR xmm1,xmm2/m128 ; 66 0F EB /r [WILLAMETTE,SSE2] - -\c{POR} performs a bitwise OR operation between its two operands -(i.e. each bit of the result is 1 if and only if at least one of the -corresponding bits of the two inputs was 1), and stores the result -in the destination (first) operand. - - -\S{insPREFETCH} \i\c{PREFETCH}: Prefetch Data Into Caches - -\c PREFETCH mem8 ; 0F 0D /0 [PENT,3DNOW] -\c PREFETCHW mem8 ; 0F 0D /1 [PENT,3DNOW] - -\c{PREFETCH} and \c{PREFETCHW} fetch the line of data from memory that -contains the specified byte. \c{PREFETCHW} performs differently on the -Athlon to earlier processors. - -For more details, see the 3DNow! Technology Manual. - - -\S{insPREFETCHh} \i\c{PREFETCHh}: Prefetch Data Into Caches -\I\c{PREFETCHNTA} \I\c{PREFETCHT0} \I\c{PREFETCHT1} \I\c{PREFETCHT2} - -\c PREFETCHNTA m8 ; 0F 18 /0 [KATMAI] -\c PREFETCHT0 m8 ; 0F 18 /1 [KATMAI] -\c PREFETCHT1 m8 ; 0F 18 /2 [KATMAI] -\c PREFETCHT2 m8 ; 0F 18 /3 [KATMAI] - -The \c{PREFETCHh} instructions fetch the line of data from memory -that contains the specified byte. It is placed in the cache -according to rules specified by locality hints \c{h}: - -The hints are: - -\b \c{T0} (temporal data) - prefetch data into all levels of the -cache hierarchy. - -\b \c{T1} (temporal data with respect to first level cache) - -prefetch data into level 2 cache and higher. - -\b \c{T2} (temporal data with respect to second level cache) - -prefetch data into level 2 cache and higher. - -\b \c{NTA} (non-temporal data with respect to all cache levels) - -prefetch data into non-temporal cache structure and into a -location close to the processor, minimizing cache pollution. - -Note that this group of instructions doesn't provide a guarantee -that the data will be in the cache when it is needed. For more -details, see the Intel IA32 Software Developer Manual, Volume 2. - - -\S{insPSADBW} \i\c{PSADBW}: Packed Sum of Absolute Differences - -\c PSADBW mm1,mm2/m64 ; 0F F6 /r [KATMAI,MMX] -\c PSADBW xmm1,xmm2/m128 ; 66 0F F6 /r [WILLAMETTE,SSE2] - -\c{PSADBW} The PSADBW instruction computes the absolute value of the -difference of the packed unsigned bytes in the two source operands. -These differences are then summed to produce a word result in the lower -16-bit field of the destination register; the rest of the register is -cleared. The destination operand is an \c{MMX} or an \c{XMM} register. -The source operand can either be a register or a memory operand. - - -\S{insPSHUFD} \i\c{PSHUFD}: Shuffle Packed Doublewords - -\c PSHUFD xmm1,xmm2/m128,imm8 ; 66 0F 70 /r ib [WILLAMETTE,SSE2] - -\c{PSHUFD} shuffles the doublewords in the source (second) operand -according to the encoding specified by imm8, and stores the result -in the destination (first) operand. - -Bits 0 and 1 of imm8 encode the source position of the doubleword to -be copied to position 0 in the destination operand. Bits 2 and 3 -encode for position 1, bits 4 and 5 encode for position 2, and bits -6 and 7 encode for position 3. For example, an encoding of 10 in -bits 0 and 1 of imm8 indicates that the doubleword at bits 64-95 of -the source operand will be copied to bits 0-31 of the destination. - - -\S{insPSHUFHW} \i\c{PSHUFHW}: Shuffle Packed High Words - -\c PSHUFHW xmm1,xmm2/m128,imm8 ; F3 0F 70 /r ib [WILLAMETTE,SSE2] - -\c{PSHUFW} shuffles the words in the high quadword of the source -(second) operand according to the encoding specified by imm8, and -stores the result in the high quadword of the destination (first) -operand. - -The operation of this instruction is similar to the \c{PSHUFW} -instruction, except that the source and destination are the top -quadword of a 128-bit operand, instead of being 64-bit operands. -The low quadword is copied from the source to the destination -without any changes. - - -\S{insPSHUFLW} \i\c{PSHUFLW}: Shuffle Packed Low Words - -\c PSHUFLW xmm1,xmm2/m128,imm8 ; F2 0F 70 /r ib [WILLAMETTE,SSE2] - -\c{PSHUFLW} shuffles the words in the low quadword of the source -(second) operand according to the encoding specified by imm8, and -stores the result in the low quadword of the destination (first) -operand. - -The operation of this instruction is similar to the \c{PSHUFW} -instruction, except that the source and destination are the low -quadword of a 128-bit operand, instead of being 64-bit operands. -The high quadword is copied from the source to the destination -without any changes. - - -\S{insPSHUFW} \i\c{PSHUFW}: Shuffle Packed Words - -\c PSHUFW mm1,mm2/m64,imm8 ; 0F 70 /r ib [KATMAI,MMX] - -\c{PSHUFW} shuffles the words in the source (second) operand -according to the encoding specified by imm8, and stores the result -in the destination (first) operand. - -Bits 0 and 1 of imm8 encode the source position of the word to be -copied to position 0 in the destination operand. Bits 2 and 3 encode -for position 1, bits 4 and 5 encode for position 2, and bits 6 and 7 -encode for position 3. For example, an encoding of 10 in bits 0 and 1 -of imm8 indicates that the word at bits 32-47 of the source operand -will be copied to bits 0-15 of the destination. - - -\S{insPSLLD} \i\c{PSLLx}: Packed Data Bit Shift Left Logical - -\c PSLLW mm1,mm2/m64 ; 0F F1 /r [PENT,MMX] -\c PSLLW mm,imm8 ; 0F 71 /6 ib [PENT,MMX] - -\c PSLLW xmm1,xmm2/m128 ; 66 0F F1 /r [WILLAMETTE,SSE2] -\c PSLLW xmm,imm8 ; 66 0F 71 /6 ib [WILLAMETTE,SSE2] - -\c PSLLD mm1,mm2/m64 ; 0F F2 /r [PENT,MMX] -\c PSLLD mm,imm8 ; 0F 72 /6 ib [PENT,MMX] - -\c PSLLD xmm1,xmm2/m128 ; 66 0F F2 /r [WILLAMETTE,SSE2] -\c PSLLD xmm,imm8 ; 66 0F 72 /6 ib [WILLAMETTE,SSE2] - -\c PSLLQ mm1,mm2/m64 ; 0F F3 /r [PENT,MMX] -\c PSLLQ mm,imm8 ; 0F 73 /6 ib [PENT,MMX] - -\c PSLLQ xmm1,xmm2/m128 ; 66 0F F3 /r [WILLAMETTE,SSE2] -\c PSLLQ xmm,imm8 ; 66 0F 73 /6 ib [WILLAMETTE,SSE2] - -\c PSLLDQ xmm1,imm8 ; 66 0F 73 /7 ib [WILLAMETTE,SSE2] - -\c{PSLLx} performs logical left shifts of the data elements in the -destination (first) operand, moving each bit in the separate elements -left by the number of bits specified in the source (second) operand, -clearing the low-order bits as they are vacated. \c{PSLLDQ} -shifts bytes, not bits. - -\b \c{PSLLW} shifts word sized elements. - -\b \c{PSLLD} shifts doubleword sized elements. - -\b \c{PSLLQ} shifts quadword sized elements. - -\b \c{PSLLDQ} shifts double quadword sized elements. - - -\S{insPSRAD} \i\c{PSRAx}: Packed Data Bit Shift Right Arithmetic - -\c PSRAW mm1,mm2/m64 ; 0F E1 /r [PENT,MMX] -\c PSRAW mm,imm8 ; 0F 71 /4 ib [PENT,MMX] - -\c PSRAW xmm1,xmm2/m128 ; 66 0F E1 /r [WILLAMETTE,SSE2] -\c PSRAW xmm,imm8 ; 66 0F 71 /4 ib [WILLAMETTE,SSE2] - -\c PSRAD mm1,mm2/m64 ; 0F E2 /r [PENT,MMX] -\c PSRAD mm,imm8 ; 0F 72 /4 ib [PENT,MMX] - -\c PSRAD xmm1,xmm2/m128 ; 66 0F E2 /r [WILLAMETTE,SSE2] -\c PSRAD xmm,imm8 ; 66 0F 72 /4 ib [WILLAMETTE,SSE2] - -\c{PSRAx} performs arithmetic right shifts of the data elements in the -destination (first) operand, moving each bit in the separate elements -right by the number of bits specified in the source (second) operand, -setting the high-order bits to the value of the original sign bit. - -\b \c{PSRAW} shifts word sized elements. - -\b \c{PSRAD} shifts doubleword sized elements. - - -\S{insPSRLD} \i\c{PSRLx}: Packed Data Bit Shift Right Logical - -\c PSRLW mm1,mm2/m64 ; 0F D1 /r [PENT,MMX] -\c PSRLW mm,imm8 ; 0F 71 /2 ib [PENT,MMX] - -\c PSRLW xmm1,xmm2/m128 ; 66 0F D1 /r [WILLAMETTE,SSE2] -\c PSRLW xmm,imm8 ; 66 0F 71 /2 ib [WILLAMETTE,SSE2] - -\c PSRLD mm1,mm2/m64 ; 0F D2 /r [PENT,MMX] -\c PSRLD mm,imm8 ; 0F 72 /2 ib [PENT,MMX] - -\c PSRLD xmm1,xmm2/m128 ; 66 0F D2 /r [WILLAMETTE,SSE2] -\c PSRLD xmm,imm8 ; 66 0F 72 /2 ib [WILLAMETTE,SSE2] - -\c PSRLQ mm1,mm2/m64 ; 0F D3 /r [PENT,MMX] -\c PSRLQ mm,imm8 ; 0F 73 /2 ib [PENT,MMX] - -\c PSRLQ xmm1,xmm2/m128 ; 66 0F D3 /r [WILLAMETTE,SSE2] -\c PSRLQ xmm,imm8 ; 66 0F 73 /2 ib [WILLAMETTE,SSE2] - -\c PSRLDQ xmm1,imm8 ; 66 0F 73 /3 ib [WILLAMETTE,SSE2] - -\c{PSRLx} performs logical right shifts of the data elements in the -destination (first) operand, moving each bit in the separate elements -right by the number of bits specified in the source (second) operand, -clearing the high-order bits as they are vacated. \c{PSRLDQ} -shifts bytes, not bits. - -\b \c{PSRLW} shifts word sized elements. - -\b \c{PSRLD} shifts doubleword sized elements. - -\b \c{PSRLQ} shifts quadword sized elements. - -\b \c{PSRLDQ} shifts double quadword sized elements. - - -\S{insPSUBB} \i\c{PSUBx}: Subtract Packed Integers - -\c PSUBB mm1,mm2/m64 ; 0F F8 /r [PENT,MMX] -\c PSUBW mm1,mm2/m64 ; 0F F9 /r [PENT,MMX] -\c PSUBD mm1,mm2/m64 ; 0F FA /r [PENT,MMX] -\c PSUBQ mm1,mm2/m64 ; 0F FB /r [WILLAMETTE,SSE2] - -\c PSUBB xmm1,xmm2/m128 ; 66 0F F8 /r [WILLAMETTE,SSE2] -\c PSUBW xmm1,xmm2/m128 ; 66 0F F9 /r [WILLAMETTE,SSE2] -\c PSUBD xmm1,xmm2/m128 ; 66 0F FA /r [WILLAMETTE,SSE2] -\c PSUBQ xmm1,xmm2/m128 ; 66 0F FB /r [WILLAMETTE,SSE2] - -\c{PSUBx} subtracts packed integers in the source operand from those -in the destination operand. It doesn't differentiate between signed -and unsigned integers, and doesn't set any of the flags. - -\b \c{PSUBB} operates on byte sized elements. - -\b \c{PSUBW} operates on word sized elements. - -\b \c{PSUBD} operates on doubleword sized elements. - -\b \c{PSUBQ} operates on quadword sized elements. - - -\S{insPSUBSB} \i\c{PSUBSxx}, \i\c{PSUBUSx}: Subtract Packed Integers With Saturation - -\c PSUBSB mm1,mm2/m64 ; 0F E8 /r [PENT,MMX] -\c PSUBSW mm1,mm2/m64 ; 0F E9 /r [PENT,MMX] - -\c PSUBSB xmm1,xmm2/m128 ; 66 0F E8 /r [WILLAMETTE,SSE2] -\c PSUBSW xmm1,xmm2/m128 ; 66 0F E9 /r [WILLAMETTE,SSE2] - -\c PSUBUSB mm1,mm2/m64 ; 0F D8 /r [PENT,MMX] -\c PSUBUSW mm1,mm2/m64 ; 0F D9 /r [PENT,MMX] - -\c PSUBUSB xmm1,xmm2/m128 ; 66 0F D8 /r [WILLAMETTE,SSE2] -\c PSUBUSW xmm1,xmm2/m128 ; 66 0F D9 /r [WILLAMETTE,SSE2] - -\c{PSUBSx} and \c{PSUBUSx} subtracts packed integers in the source -operand from those in the destination operand, and use saturation for -results that are outside the range supported by the destination operand. - -\b \c{PSUBSB} operates on signed bytes, and uses signed saturation on the -results. - -\b \c{PSUBSW} operates on signed words, and uses signed saturation on the -results. - -\b \c{PSUBUSB} operates on unsigned bytes, and uses signed saturation on -the results. - -\b \c{PSUBUSW} operates on unsigned words, and uses signed saturation on -the results. - - -\S{insPSUBSIW} \i\c{PSUBSIW}: MMX Packed Subtract with Saturation to -Implied Destination - -\c PSUBSIW mm1,mm2/m64 ; 0F 55 /r [CYRIX,MMX] - -\c{PSUBSIW}, specific to the Cyrix extensions to the MMX instruction -set, performs the same function as \c{PSUBSW}, except that the -result is not placed in the register specified by the first operand, -but instead in the implied destination register, specified as for -\c{PADDSIW} (\k{insPADDSIW}). - - -\S{insPSWAPD} \i\c{PSWAPD}: Swap Packed Data -\I\c{PSWAPW} - -\c PSWAPD mm1,mm2/m64 ; 0F 0F /r BB [PENT,3DNOW] - -\c{PSWAPD} swaps the packed doublewords in the source operand, and -stores the result in the destination operand. - -In the \c{K6-2} and \c{K6-III} processors, this opcode uses the -mnemonic \c{PSWAPW}, and it swaps the order of words when copying -from the source to the destination. - -The operation in the \c{K6-2} and \c{K6-III} processors is - -\c dst[0-15] = src[48-63]; -\c dst[16-31] = src[32-47]; -\c dst[32-47] = src[16-31]; -\c dst[48-63] = src[0-15]. - -The operation in the \c{K6-x+}, \c{ATHLON} and later processors is: - -\c dst[0-31] = src[32-63]; -\c dst[32-63] = src[0-31]. - - -\S{insPUNPCKHBW} \i\c{PUNPCKxxx}: Unpack and Interleave Data - -\c PUNPCKHBW mm1,mm2/m64 ; 0F 68 /r [PENT,MMX] -\c PUNPCKHWD mm1,mm2/m64 ; 0F 69 /r [PENT,MMX] -\c PUNPCKHDQ mm1,mm2/m64 ; 0F 6A /r [PENT,MMX] - -\c PUNPCKHBW xmm1,xmm2/m128 ; 66 0F 68 /r [WILLAMETTE,SSE2] -\c PUNPCKHWD xmm1,xmm2/m128 ; 66 0F 69 /r [WILLAMETTE,SSE2] -\c PUNPCKHDQ xmm1,xmm2/m128 ; 66 0F 6A /r [WILLAMETTE,SSE2] -\c PUNPCKHQDQ xmm1,xmm2/m128 ; 66 0F 6D /r [WILLAMETTE,SSE2] - -\c PUNPCKLBW mm1,mm2/m32 ; 0F 60 /r [PENT,MMX] -\c PUNPCKLWD mm1,mm2/m32 ; 0F 61 /r [PENT,MMX] -\c PUNPCKLDQ mm1,mm2/m32 ; 0F 62 /r [PENT,MMX] - -\c PUNPCKLBW xmm1,xmm2/m128 ; 66 0F 60 /r [WILLAMETTE,SSE2] -\c PUNPCKLWD xmm1,xmm2/m128 ; 66 0F 61 /r [WILLAMETTE,SSE2] -\c PUNPCKLDQ xmm1,xmm2/m128 ; 66 0F 62 /r [WILLAMETTE,SSE2] -\c PUNPCKLQDQ xmm1,xmm2/m128 ; 66 0F 6C /r [WILLAMETTE,SSE2] - -\c{PUNPCKxx} all treat their operands as vectors, and produce a new -vector generated by interleaving elements from the two inputs. The -\c{PUNPCKHxx} instructions start by throwing away the bottom half of -each input operand, and the \c{PUNPCKLxx} instructions throw away -the top half. - -The remaining elements, are then interleaved into the destination, -alternating elements from the second (source) operand and the first -(destination) operand: so the leftmost part of each element in the -result always comes from the second operand, and the rightmost from -the destination. - -\b \c{PUNPCKxBW} works a byte at a time, producing word sized output -elements. - -\b \c{PUNPCKxWD} works a word at a time, producing doubleword sized -output elements. - -\b \c{PUNPCKxDQ} works a doubleword at a time, producing quadword sized -output elements. - -\b \c{PUNPCKxQDQ} works a quadword at a time, producing double quadword -sized output elements. - -So, for example, for \c{MMX} operands, if the first operand held -\c{0x7A6A5A4A3A2A1A0A} and the second held \c{0x7B6B5B4B3B2B1B0B}, -then: - -\b \c{PUNPCKHBW} would return \c{0x7B7A6B6A5B5A4B4A}. - -\b \c{PUNPCKHWD} would return \c{0x7B6B7A6A5B4B5A4A}. - -\b \c{PUNPCKHDQ} would return \c{0x7B6B5B4B7A6A5A4A}. - -\b \c{PUNPCKLBW} would return \c{0x3B3A2B2A1B1A0B0A}. - -\b \c{PUNPCKLWD} would return \c{0x3B2B3A2A1B0B1A0A}. - -\b \c{PUNPCKLDQ} would return \c{0x3B2B1B0B3A2A1A0A}. - - -\S{insPUSH} \i\c{PUSH}: Push Data on Stack - -\c PUSH reg16 ; o16 50+r [8086] -\c PUSH reg32 ; o32 50+r [386] - -\c PUSH r/m16 ; o16 FF /6 [8086] -\c PUSH r/m32 ; o32 FF /6 [386] - -\c PUSH CS ; 0E [8086] -\c PUSH DS ; 1E [8086] -\c PUSH ES ; 06 [8086] -\c PUSH SS ; 16 [8086] -\c PUSH FS ; 0F A0 [386] -\c PUSH GS ; 0F A8 [386] - -\c PUSH imm8 ; 6A ib [186] -\c PUSH imm16 ; o16 68 iw [186] -\c PUSH imm32 ; o32 68 id [386] - -\c{PUSH} decrements the stack pointer (\c{SP} or \c{ESP}) by 2 or 4, -and then stores the given value at \c{[SS:SP]} or \c{[SS:ESP]}. - -The address-size attribute of the instruction determines whether -\c{SP} or \c{ESP} is used as the stack pointer: to deliberately -override the default given by the \c{BITS} setting, you can use an -\i\c{a16} or \i\c{a32} prefix. - -The operand-size attribute of the instruction determines whether the -stack pointer is decremented by 2 or 4: this means that segment -register pushes in \c{BITS 32} mode will push 4 bytes on the stack, -of which the upper two are undefined. If you need to override that, -you can use an \i\c{o16} or \i\c{o32} prefix. - -The above opcode listings give two forms for general-purpose -\i{register push} instructions: for example, \c{PUSH BX} has the two -forms \c{53} and \c{FF F3}. NASM will always generate the shorter -form when given \c{PUSH BX}. NDISASM will disassemble both. - -Unlike the undocumented and barely supported \c{POP CS}, \c{PUSH CS} -is a perfectly valid and sensible instruction, supported on all -processors. - -The instruction \c{PUSH SP} may be used to distinguish an 8086 from -later processors: on an 8086, the value of \c{SP} stored is the -value it has \e{after} the push instruction, whereas on later -processors it is the value \e{before} the push instruction. - - -\S{insPUSHA} \i\c{PUSHAx}: Push All General-Purpose Registers - -\c PUSHA ; 60 [186] -\c PUSHAD ; o32 60 [386] -\c PUSHAW ; o16 60 [186] - -\c{PUSHAW} pushes, in succession, \c{AX}, \c{CX}, \c{DX}, \c{BX}, -\c{SP}, \c{BP}, \c{SI} and \c{DI} on the stack, decrementing the -stack pointer by a total of 16. - -\c{PUSHAD} pushes, in succession, \c{EAX}, \c{ECX}, \c{EDX}, -\c{EBX}, \c{ESP}, \c{EBP}, \c{ESI} and \c{EDI} on the stack, -decrementing the stack pointer by a total of 32. - -In both cases, the value of \c{SP} or \c{ESP} pushed is its -\e{original} value, as it had before the instruction was executed. - -\c{PUSHA} is an alias mnemonic for either \c{PUSHAW} or \c{PUSHAD}, -depending on the current \c{BITS} setting. - -Note that the registers are pushed in order of their numeric values -in opcodes (see \k{iref-rv}). - -See also \c{POPA} (\k{insPOPA}). - - -\S{insPUSHF} \i\c{PUSHFx}: Push Flags Register - -\c PUSHF ; 9C [8086] -\c PUSHFD ; o32 9C [386] -\c PUSHFW ; o16 9C [8086] - -\b \c{PUSHFW} pushes the bottom 16 bits of the flags register -(or the whole flags register, on processors below a 386) onto -the stack. - -\b \c{PUSHFD} pushes the entire flags register onto the stack. - -\c{PUSHF} is an alias mnemonic for either \c{PUSHFW} or \c{PUSHFD}, -depending on the current \c{BITS} setting. - -See also \c{POPF} (\k{insPOPF}). - - -\S{insPXOR} \i\c{PXOR}: MMX Bitwise XOR - -\c PXOR mm1,mm2/m64 ; 0F EF /r [PENT,MMX] -\c PXOR xmm1,xmm2/m128 ; 66 0F EF /r [WILLAMETTE,SSE2] - -\c{PXOR} performs a bitwise XOR operation between its two operands -(i.e. each bit of the result is 1 if and only if exactly one of the -corresponding bits of the two inputs was 1), and stores the result -in the destination (first) operand. - - -\S{insRCL} \i\c{RCL}, \i\c{RCR}: Bitwise Rotate through Carry Bit - -\c RCL r/m8,1 ; D0 /2 [8086] -\c RCL r/m8,CL ; D2 /2 [8086] -\c RCL r/m8,imm8 ; C0 /2 ib [186] -\c RCL r/m16,1 ; o16 D1 /2 [8086] -\c RCL r/m16,CL ; o16 D3 /2 [8086] -\c RCL r/m16,imm8 ; o16 C1 /2 ib [186] -\c RCL r/m32,1 ; o32 D1 /2 [386] -\c RCL r/m32,CL ; o32 D3 /2 [386] -\c RCL r/m32,imm8 ; o32 C1 /2 ib [386] - -\c RCR r/m8,1 ; D0 /3 [8086] -\c RCR r/m8,CL ; D2 /3 [8086] -\c RCR r/m8,imm8 ; C0 /3 ib [186] -\c RCR r/m16,1 ; o16 D1 /3 [8086] -\c RCR r/m16,CL ; o16 D3 /3 [8086] -\c RCR r/m16,imm8 ; o16 C1 /3 ib [186] -\c RCR r/m32,1 ; o32 D1 /3 [386] -\c RCR r/m32,CL ; o32 D3 /3 [386] -\c RCR r/m32,imm8 ; o32 C1 /3 ib [386] - -\c{RCL} and \c{RCR} perform a 9-bit, 17-bit or 33-bit bitwise -rotation operation, involving the given source/destination (first) -operand and the carry bit. Thus, for example, in the operation -\c{RCL AL,1}, a 9-bit rotation is performed in which \c{AL} is -shifted left by 1, the top bit of \c{AL} moves into the carry flag, -and the original value of the carry flag is placed in the low bit of -\c{AL}. - -The number of bits to rotate by is given by the second operand. Only -the bottom five bits of the rotation count are considered by -processors above the 8086. - -You can force the longer (286 and upwards, beginning with a \c{C1} -byte) form of \c{RCL foo,1} by using a \c{BYTE} prefix: \c{RCL -foo,BYTE 1}. Similarly with \c{RCR}. - - -\S{insRCPPS} \i\c{RCPPS}: Packed Single-Precision FP Reciprocal - -\c RCPPS xmm1,xmm2/m128 ; 0F 53 /r [KATMAI,SSE] - -\c{RCPPS} returns an approximation of the reciprocal of the packed -single-precision FP values from xmm2/m128. The maximum error for this -approximation is: |Error| <= 1.5 x 2^-12 - - -\S{insRCPSS} \i\c{RCPSS}: Scalar Single-Precision FP Reciprocal - -\c RCPSS xmm1,xmm2/m128 ; F3 0F 53 /r [KATMAI,SSE] - -\c{RCPSS} returns an approximation of the reciprocal of the lower -single-precision FP value from xmm2/m32; the upper three fields are -passed through from xmm1. The maximum error for this approximation is: -|Error| <= 1.5 x 2^-12 - - -\S{insRDMSR} \i\c{RDMSR}: Read Model-Specific Registers - -\c RDMSR ; 0F 32 [PENT,PRIV] - -\c{RDMSR} reads the processor Model-Specific Register (MSR) whose -index is stored in \c{ECX}, and stores the result in \c{EDX:EAX}. -See also \c{WRMSR} (\k{insWRMSR}). - - -\S{insRDPMC} \i\c{RDPMC}: Read Performance-Monitoring Counters - -\c RDPMC ; 0F 33 [P6] - -\c{RDPMC} reads the processor performance-monitoring counter whose -index is stored in \c{ECX}, and stores the result in \c{EDX:EAX}. - -This instruction is available on P6 and later processors and on MMX -class processors. - - -\S{insRDSHR} \i\c{RDSHR}: Read SMM Header Pointer Register - -\c RDSHR r/m32 ; 0F 36 /0 [386,CYRIX,SMM] - -\c{RDSHR} reads the contents of the SMM header pointer register and -saves it to the destination operand, which can be either a 32 bit -memory location or a 32 bit register. - -See also \c{WRSHR} (\k{insWRSHR}). - - -\S{insRDTSC} \i\c{RDTSC}: Read Time-Stamp Counter - -\c RDTSC ; 0F 31 [PENT] - -\c{RDTSC} reads the processor's time-stamp counter into \c{EDX:EAX}. - - -\S{insRET} \i\c{RET}, \i\c{RETF}, \i\c{RETN}: Return from Procedure Call - -\c RET ; C3 [8086] -\c RET imm16 ; C2 iw [8086] - -\c RETF ; CB [8086] -\c RETF imm16 ; CA iw [8086] - -\c RETN ; C3 [8086] -\c RETN imm16 ; C2 iw [8086] - -\b \c{RET}, and its exact synonym \c{RETN}, pop \c{IP} or \c{EIP} from -the stack and transfer control to the new address. Optionally, if a -numeric second operand is provided, they increment the stack pointer -by a further \c{imm16} bytes after popping the return address. - -\b \c{RETF} executes a far return: after popping \c{IP}/\c{EIP}, it -then pops \c{CS}, and \e{then} increments the stack pointer by the -optional argument if present. - - -\S{insROL} \i\c{ROL}, \i\c{ROR}: Bitwise Rotate - -\c ROL r/m8,1 ; D0 /0 [8086] -\c ROL r/m8,CL ; D2 /0 [8086] -\c ROL r/m8,imm8 ; C0 /0 ib [186] -\c ROL r/m16,1 ; o16 D1 /0 [8086] -\c ROL r/m16,CL ; o16 D3 /0 [8086] -\c ROL r/m16,imm8 ; o16 C1 /0 ib [186] -\c ROL r/m32,1 ; o32 D1 /0 [386] -\c ROL r/m32,CL ; o32 D3 /0 [386] -\c ROL r/m32,imm8 ; o32 C1 /0 ib [386] - -\c ROR r/m8,1 ; D0 /1 [8086] -\c ROR r/m8,CL ; D2 /1 [8086] -\c ROR r/m8,imm8 ; C0 /1 ib [186] -\c ROR r/m16,1 ; o16 D1 /1 [8086] -\c ROR r/m16,CL ; o16 D3 /1 [8086] -\c ROR r/m16,imm8 ; o16 C1 /1 ib [186] -\c ROR r/m32,1 ; o32 D1 /1 [386] -\c ROR r/m32,CL ; o32 D3 /1 [386] -\c ROR r/m32,imm8 ; o32 C1 /1 ib [386] - -\c{ROL} and \c{ROR} perform a bitwise rotation operation on the given -source/destination (first) operand. Thus, for example, in the -operation \c{ROL AL,1}, an 8-bit rotation is performed in which -\c{AL} is shifted left by 1 and the original top bit of \c{AL} moves -round into the low bit. - -The number of bits to rotate by is given by the second operand. Only -the bottom five bits of the rotation count are considered by processors -above the 8086. - -You can force the longer (286 and upwards, beginning with a \c{C1} -byte) form of \c{ROL foo,1} by using a \c{BYTE} prefix: \c{ROL -foo,BYTE 1}. Similarly with \c{ROR}. - - -\S{insRSDC} \i\c{RSDC}: Restore Segment Register and Descriptor - -\c RSDC segreg,m80 ; 0F 79 /r [486,CYRIX,SMM] - -\c{RSDC} restores a segment register (DS, ES, FS, GS, or SS) from mem80, -and sets up its descriptor. - - -\S{insRSLDT} \i\c{RSLDT}: Restore Segment Register and Descriptor - -\c RSLDT m80 ; 0F 7B /0 [486,CYRIX,SMM] - -\c{RSLDT} restores the Local Descriptor Table (LDTR) from mem80. - - -\S{insRSM} \i\c{RSM}: Resume from System-Management Mode - -\c RSM ; 0F AA [PENT] - -\c{RSM} returns the processor to its normal operating mode when it -was in System-Management Mode. - - -\S{insRSQRTPS} \i\c{RSQRTPS}: Packed Single-Precision FP Square Root Reciprocal - -\c RSQRTPS xmm1,xmm2/m128 ; 0F 52 /r [KATMAI,SSE] - -\c{RSQRTPS} computes the approximate reciprocals of the square -roots of the packed single-precision floating-point values in the -source and stores the results in xmm1. The maximum error for this -approximation is: |Error| <= 1.5 x 2^-12 - - -\S{insRSQRTSS} \i\c{RSQRTSS}: Scalar Single-Precision FP Square Root Reciprocal - -\c RSQRTSS xmm1,xmm2/m128 ; F3 0F 52 /r [KATMAI,SSE] - -\c{RSQRTSS} returns an approximation of the reciprocal of the -square root of the lowest order single-precision FP value from -the source, and stores it in the low doubleword of the destination -register. The upper three fields of xmm1 are preserved. The maximum -error for this approximation is: |Error| <= 1.5 x 2^-12 - - -\S{insRSTS} \i\c{RSTS}: Restore TSR and Descriptor - -\c RSTS m80 ; 0F 7D /0 [486,CYRIX,SMM] - -\c{RSTS} restores Task State Register (TSR) from mem80. - - -\S{insSAHF} \i\c{SAHF}: Store AH to Flags - -\c SAHF ; 9E [8086] - -\c{SAHF} sets the low byte of the flags word according to the -contents of the \c{AH} register. - -The operation of \c{SAHF} is: - -\c AH --> SF:ZF:0:AF:0:PF:1:CF - -See also \c{LAHF} (\k{insLAHF}). - - -\S{insSAL} \i\c{SAL}, \i\c{SAR}: Bitwise Arithmetic Shifts - -\c SAL r/m8,1 ; D0 /4 [8086] -\c SAL r/m8,CL ; D2 /4 [8086] -\c SAL r/m8,imm8 ; C0 /4 ib [186] -\c SAL r/m16,1 ; o16 D1 /4 [8086] -\c SAL r/m16,CL ; o16 D3 /4 [8086] -\c SAL r/m16,imm8 ; o16 C1 /4 ib [186] -\c SAL r/m32,1 ; o32 D1 /4 [386] -\c SAL r/m32,CL ; o32 D3 /4 [386] -\c SAL r/m32,imm8 ; o32 C1 /4 ib [386] - -\c SAR r/m8,1 ; D0 /7 [8086] -\c SAR r/m8,CL ; D2 /7 [8086] -\c SAR r/m8,imm8 ; C0 /7 ib [186] -\c SAR r/m16,1 ; o16 D1 /7 [8086] -\c SAR r/m16,CL ; o16 D3 /7 [8086] -\c SAR r/m16,imm8 ; o16 C1 /7 ib [186] -\c SAR r/m32,1 ; o32 D1 /7 [386] -\c SAR r/m32,CL ; o32 D3 /7 [386] -\c SAR r/m32,imm8 ; o32 C1 /7 ib [386] - -\c{SAL} and \c{SAR} perform an arithmetic shift operation on the given -source/destination (first) operand. The vacated bits are filled with -zero for \c{SAL}, and with copies of the original high bit of the -source operand for \c{SAR}. - -\c{SAL} is a synonym for \c{SHL} (see \k{insSHL}). NASM will -assemble either one to the same code, but NDISASM will always -disassemble that code as \c{SHL}. - -The number of bits to shift by is given by the second operand. Only -the bottom five bits of the shift count are considered by processors -above the 8086. - -You can force the longer (286 and upwards, beginning with a \c{C1} -byte) form of \c{SAL foo,1} by using a \c{BYTE} prefix: \c{SAL -foo,BYTE 1}. Similarly with \c{SAR}. - - -\S{insSALC} \i\c{SALC}: Set AL from Carry Flag - -\c SALC ; D6 [8086,UNDOC] - -\c{SALC} is an early undocumented instruction similar in concept to -\c{SETcc} (\k{insSETcc}). Its function is to set \c{AL} to zero if -the carry flag is clear, or to \c{0xFF} if it is set. - - -\S{insSBB} \i\c{SBB}: Subtract with Borrow - -\c SBB r/m8,reg8 ; 18 /r [8086] -\c SBB r/m16,reg16 ; o16 19 /r [8086] -\c SBB r/m32,reg32 ; o32 19 /r [386] - -\c SBB reg8,r/m8 ; 1A /r [8086] -\c SBB reg16,r/m16 ; o16 1B /r [8086] -\c SBB reg32,r/m32 ; o32 1B /r [386] - -\c SBB r/m8,imm8 ; 80 /3 ib [8086] -\c SBB r/m16,imm16 ; o16 81 /3 iw [8086] -\c SBB r/m32,imm32 ; o32 81 /3 id [386] - -\c SBB r/m16,imm8 ; o16 83 /3 ib [8086] -\c SBB r/m32,imm8 ; o32 83 /3 ib [386] - -\c SBB AL,imm8 ; 1C ib [8086] -\c SBB AX,imm16 ; o16 1D iw [8086] -\c SBB EAX,imm32 ; o32 1D id [386] - -\c{SBB} performs integer subtraction: it subtracts its second -operand, plus the value of the carry flag, from its first, and -leaves the result in its destination (first) operand. The flags are -set according to the result of the operation: in particular, the -carry flag is affected and can be used by a subsequent \c{SBB} -instruction. - -In the forms with an 8-bit immediate second operand and a longer -first operand, the second operand is considered to be signed, and is -sign-extended to the length of the first operand. In these cases, -the \c{BYTE} qualifier is necessary to force NASM to generate this -form of the instruction. - -To subtract one number from another without also subtracting the -contents of the carry flag, use \c{SUB} (\k{insSUB}). - - -\S{insSCASB} \i\c{SCASB}, \i\c{SCASW}, \i\c{SCASD}: Scan String - -\c SCASB ; AE [8086] -\c SCASW ; o16 AF [8086] -\c SCASD ; o32 AF [386] - -\c{SCASB} compares the byte in \c{AL} with the byte at \c{[ES:DI]} -or \c{[ES:EDI]}, and sets the flags accordingly. It then increments -or decrements (depending on the direction flag: increments if the -flag is clear, decrements if it is set) \c{DI} (or \c{EDI}). - -The register used is \c{DI} if the address size is 16 bits, and -\c{EDI} if it is 32 bits. If you need to use an address size not -equal to the current \c{BITS} setting, you can use an explicit -\i\c{a16} or \i\c{a32} prefix. - -Segment override prefixes have no effect for this instruction: the -use of \c{ES} for the load from \c{[DI]} or \c{[EDI]} cannot be -overridden. - -\c{SCASW} and \c{SCASD} work in the same way, but they compare a -word to \c{AX} or a doubleword to \c{EAX} instead of a byte to -\c{AL}, and increment or decrement the addressing registers by 2 or -4 instead of 1. - -The \c{REPE} and \c{REPNE} prefixes (equivalently, \c{REPZ} and -\c{REPNZ}) may be used to repeat the instruction up to \c{CX} (or -\c{ECX} - again, the address size chooses which) times until the -first unequal or equal byte is found. - - -\S{insSETcc} \i\c{SETcc}: Set Register from Condition - -\c SETcc r/m8 ; 0F 90+cc /2 [386] - -\c{SETcc} sets the given 8-bit operand to zero if its condition is -not satisfied, and to 1 if it is. - - -\S{insSFENCE} \i\c{SFENCE}: Store Fence - -\c SFENCE ; 0F AE /7 [KATMAI] - -\c{SFENCE} performs a serialising operation on all writes to memory -that were issued before the \c{SFENCE} instruction. This guarantees that -all memory writes before the \c{SFENCE} instruction are visible before any -writes after the \c{SFENCE} instruction. - -\c{SFENCE} is ordered respective to other \c{SFENCE} instruction, \c{MFENCE}, -any memory write and any other serialising instruction (such as \c{CPUID}). - -Weakly ordered memory types can be used to achieve higher processor -performance through such techniques as out-of-order issue, -write-combining, and write-collapsing. The degree to which a consumer -of data recognizes or knows that the data is weakly ordered varies -among applications and may be unknown to the producer of this data. -The \c{SFENCE} instruction provides a performance-efficient way of -insuring store ordering between routines that produce weakly-ordered -results and routines that consume this data. - -\c{SFENCE} uses the following ModRM encoding: - -\c Mod (7:6) = 11B -\c Reg/Opcode (5:3) = 111B -\c R/M (2:0) = 000B - -All other ModRM encodings are defined to be reserved, and use -of these encodings risks incompatibility with future processors. - -See also \c{LFENCE} (\k{insLFENCE}) and \c{MFENCE} (\k{insMFENCE}). - - -\S{insSGDT} \i\c{SGDT}, \i\c{SIDT}, \i\c{SLDT}: Store Descriptor Table Pointers - -\c SGDT mem ; 0F 01 /0 [286,PRIV] -\c SIDT mem ; 0F 01 /1 [286,PRIV] -\c SLDT r/m16 ; 0F 00 /0 [286,PRIV] - -\c{SGDT} and \c{SIDT} both take a 6-byte memory area as an operand: -they store the contents of the GDTR (global descriptor table -register) or IDTR (interrupt descriptor table register) into that -area as a 32-bit linear address and a 16-bit size limit from that -area (in that order). These are the only instructions which directly -use \e{linear} addresses, rather than segment/offset pairs. - -\c{SLDT} stores the segment selector corresponding to the LDT (local -descriptor table) into the given operand. - -See also \c{LGDT}, \c{LIDT} and \c{LLDT} (\k{insLGDT}). - - -\S{insSHL} \i\c{SHL}, \i\c{SHR}: Bitwise Logical Shifts - -\c SHL r/m8,1 ; D0 /4 [8086] -\c SHL r/m8,CL ; D2 /4 [8086] -\c SHL r/m8,imm8 ; C0 /4 ib [186] -\c SHL r/m16,1 ; o16 D1 /4 [8086] -\c SHL r/m16,CL ; o16 D3 /4 [8086] -\c SHL r/m16,imm8 ; o16 C1 /4 ib [186] -\c SHL r/m32,1 ; o32 D1 /4 [386] -\c SHL r/m32,CL ; o32 D3 /4 [386] -\c SHL r/m32,imm8 ; o32 C1 /4 ib [386] - -\c SHR r/m8,1 ; D0 /5 [8086] -\c SHR r/m8,CL ; D2 /5 [8086] -\c SHR r/m8,imm8 ; C0 /5 ib [186] -\c SHR r/m16,1 ; o16 D1 /5 [8086] -\c SHR r/m16,CL ; o16 D3 /5 [8086] -\c SHR r/m16,imm8 ; o16 C1 /5 ib [186] -\c SHR r/m32,1 ; o32 D1 /5 [386] -\c SHR r/m32,CL ; o32 D3 /5 [386] -\c SHR r/m32,imm8 ; o32 C1 /5 ib [386] - -\c{SHL} and \c{SHR} perform a logical shift operation on the given -source/destination (first) operand. The vacated bits are filled with -zero. - -A synonym for \c{SHL} is \c{SAL} (see \k{insSAL}). NASM will -assemble either one to the same code, but NDISASM will always -disassemble that code as \c{SHL}. - -The number of bits to shift by is given by the second operand. Only -the bottom five bits of the shift count are considered by processors -above the 8086. - -You can force the longer (286 and upwards, beginning with a \c{C1} -byte) form of \c{SHL foo,1} by using a \c{BYTE} prefix: \c{SHL -foo,BYTE 1}. Similarly with \c{SHR}. - - -\S{insSHLD} \i\c{SHLD}, \i\c{SHRD}: Bitwise Double-Precision Shifts - -\c SHLD r/m16,reg16,imm8 ; o16 0F A4 /r ib [386] -\c SHLD r/m16,reg32,imm8 ; o32 0F A4 /r ib [386] -\c SHLD r/m16,reg16,CL ; o16 0F A5 /r [386] -\c SHLD r/m16,reg32,CL ; o32 0F A5 /r [386] - -\c SHRD r/m16,reg16,imm8 ; o16 0F AC /r ib [386] -\c SHRD r/m32,reg32,imm8 ; o32 0F AC /r ib [386] -\c SHRD r/m16,reg16,CL ; o16 0F AD /r [386] -\c SHRD r/m32,reg32,CL ; o32 0F AD /r [386] - -\b \c{SHLD} performs a double-precision left shift. It notionally -places its second operand to the right of its first, then shifts -the entire bit string thus generated to the left by a number of -bits specified in the third operand. It then updates only the -\e{first} operand according to the result of this. The second -operand is not modified. - -\b \c{SHRD} performs the corresponding right shift: it notionally -places the second operand to the \e{left} of the first, shifts the -whole bit string right, and updates only the first operand. - -For example, if \c{EAX} holds \c{0x01234567} and \c{EBX} holds -\c{0x89ABCDEF}, then the instruction \c{SHLD EAX,EBX,4} would update -\c{EAX} to hold \c{0x12345678}. Under the same conditions, \c{SHRD -EAX,EBX,4} would update \c{EAX} to hold \c{0xF0123456}. - -The number of bits to shift by is given by the third operand. Only -the bottom five bits of the shift count are considered. - - -\S{insSHUFPD} \i\c{SHUFPD}: Shuffle Packed Double-Precision FP Values - -\c SHUFPD xmm1,xmm2/m128,imm8 ; 66 0F C6 /r ib [WILLAMETTE,SSE2] - -\c{SHUFPD} moves one of the packed double-precision FP values from -the destination operand into the low quadword of the destination -operand; the upper quadword is generated by moving one of the -double-precision FP values from the source operand into the -destination. The select (third) operand selects which of the values -are moved to the destination register. - -The select operand is an 8-bit immediate: bit 0 selects which value -is moved from the destination operand to the result (where 0 selects -the low quadword and 1 selects the high quadword) and bit 1 selects -which value is moved from the source operand to the result. -Bits 2 through 7 of the shuffle operand are reserved. - - -\S{insSHUFPS} \i\c{SHUFPS}: Shuffle Packed Single-Precision FP Values - -\c SHUFPS xmm1,xmm2/m128,imm8 ; 0F C6 /r ib [KATMAI,SSE] - -\c{SHUFPS} moves two of the packed single-precision FP values from -the destination operand into the low quadword of the destination -operand; the upper quadword is generated by moving two of the -single-precision FP values from the source operand into the -destination. The select (third) operand selects which of the -values are moved to the destination register. - -The select operand is an 8-bit immediate: bits 0 and 1 select the -value to be moved from the destination operand the low doubleword of -the result, bits 2 and 3 select the value to be moved from the -destination operand the second doubleword of the result, bits 4 and -5 select the value to be moved from the source operand the third -doubleword of the result, and bits 6 and 7 select the value to be -moved from the source operand to the high doubleword of the result. - - -\S{insSMI} \i\c{SMI}: System Management Interrupt - -\c SMI ; F1 [386,UNDOC] - -\c{SMI} puts some AMD processors into SMM mode. It is available on some -386 and 486 processors, and is only available when DR7 bit 12 is set, -otherwise it generates an Int 1. - - -\S{insSMINT} \i\c{SMINT}, \i\c{SMINTOLD}: Software SMM Entry (CYRIX) - -\c SMINT ; 0F 38 [PENT,CYRIX] -\c SMINTOLD ; 0F 7E [486,CYRIX] - -\c{SMINT} puts the processor into SMM mode. The CPU state information is -saved in the SMM memory header, and then execution begins at the SMM base -address. - -\c{SMINTOLD} is the same as \c{SMINT}, but was the opcode used on the 486. - -This pair of opcodes are specific to the Cyrix and compatible range of -processors (Cyrix, IBM, Via). - - -\S{insSMSW} \i\c{SMSW}: Store Machine Status Word - -\c SMSW r/m16 ; 0F 01 /4 [286,PRIV] - -\c{SMSW} stores the bottom half of the \c{CR0} control register (or -the Machine Status Word, on 286 processors) into the destination -operand. See also \c{LMSW} (\k{insLMSW}). - -For 32-bit code, this would store all of \c{CR0} in the specified -register (or the bottom 16 bits if the destination is a memory location), - without needing an operand size override byte. - - -\S{insSQRTPD} \i\c{SQRTPD}: Packed Double-Precision FP Square Root - -\c SQRTPD xmm1,xmm2/m128 ; 66 0F 51 /r [WILLAMETTE,SSE2] - -\c{SQRTPD} calculates the square root of the packed double-precision -FP value from the source operand, and stores the double-precision -results in the destination register. - - -\S{insSQRTPS} \i\c{SQRTPS}: Packed Single-Precision FP Square Root - -\c SQRTPS xmm1,xmm2/m128 ; 0F 51 /r [KATMAI,SSE] - -\c{SQRTPS} calculates the square root of the packed single-precision -FP value from the source operand, and stores the single-precision -results in the destination register. - - -\S{insSQRTSD} \i\c{SQRTSD}: Scalar Double-Precision FP Square Root - -\c SQRTSD xmm1,xmm2/m128 ; F2 0F 51 /r [WILLAMETTE,SSE2] - -\c{SQRTSD} calculates the square root of the low-order double-precision -FP value from the source operand, and stores the double-precision -result in the destination register. The high-quadword remains unchanged. - - -\S{insSQRTSS} \i\c{SQRTSS}: Scalar Single-Precision FP Square Root - -\c SQRTSS xmm1,xmm2/m128 ; F3 0F 51 /r [KATMAI,SSE] - -\c{SQRTSS} calculates the square root of the low-order single-precision -FP value from the source operand, and stores the single-precision -result in the destination register. The three high doublewords remain -unchanged. - - -\S{insSTC} \i\c{STC}, \i\c{STD}, \i\c{STI}: Set Flags - -\c STC ; F9 [8086] -\c STD ; FD [8086] -\c STI ; FB [8086] - -These instructions set various flags. \c{STC} sets the carry flag; -\c{STD} sets the direction flag; and \c{STI} sets the interrupt flag -(thus enabling interrupts). - -To clear the carry, direction, or interrupt flags, use the \c{CLC}, -\c{CLD} and \c{CLI} instructions (\k{insCLC}). To invert the carry -flag, use \c{CMC} (\k{insCMC}). - - -\S{insSTMXCSR} \i\c{STMXCSR}: Store Streaming SIMD Extension - Control/Status - -\c STMXCSR m32 ; 0F AE /3 [KATMAI,SSE] - -\c{STMXCSR} stores the contents of the \c{MXCSR} control/status -register to the specified memory location. \c{MXCSR} is used to -enable masked/unmasked exception handling, to set rounding modes, -to set flush-to-zero mode, and to view exception status flags. -The reserved bits in the \c{MXCSR} register are stored as 0s. - -For details of the \c{MXCSR} register, see the Intel processor docs. - -See also \c{LDMXCSR} (\k{insLDMXCSR}). - - -\S{insSTOSB} \i\c{STOSB}, \i\c{STOSW}, \i\c{STOSD}: Store Byte to String - -\c STOSB ; AA [8086] -\c STOSW ; o16 AB [8086] -\c STOSD ; o32 AB [386] - -\c{STOSB} stores the byte in \c{AL} at \c{[ES:DI]} or \c{[ES:EDI]}, -and sets the flags accordingly. It then increments or decrements -(depending on the direction flag: increments if the flag is clear, -decrements if it is set) \c{DI} (or \c{EDI}). - -The register used is \c{DI} if the address size is 16 bits, and -\c{EDI} if it is 32 bits. If you need to use an address size not -equal to the current \c{BITS} setting, you can use an explicit -\i\c{a16} or \i\c{a32} prefix. - -Segment override prefixes have no effect for this instruction: the -use of \c{ES} for the store to \c{[DI]} or \c{[EDI]} cannot be -overridden. - -\c{STOSW} and \c{STOSD} work in the same way, but they store the -word in \c{AX} or the doubleword in \c{EAX} instead of the byte in -\c{AL}, and increment or decrement the addressing registers by 2 or -4 instead of 1. - -The \c{REP} prefix may be used to repeat the instruction \c{CX} (or -\c{ECX} - again, the address size chooses which) times. - - -\S{insSTR} \i\c{STR}: Store Task Register - -\c STR r/m16 ; 0F 00 /1 [286,PRIV] - -\c{STR} stores the segment selector corresponding to the contents of -the Task Register into its operand. When the operand size is 32 bit and -the destination is a register, the upper 16-bits are cleared to 0s. -When the destination operand is a memory location, 16 bits are -written regardless of the operand size. - - -\S{insSUB} \i\c{SUB}: Subtract Integers - -\c SUB r/m8,reg8 ; 28 /r [8086] -\c SUB r/m16,reg16 ; o16 29 /r [8086] -\c SUB r/m32,reg32 ; o32 29 /r [386] - -\c SUB reg8,r/m8 ; 2A /r [8086] -\c SUB reg16,r/m16 ; o16 2B /r [8086] -\c SUB reg32,r/m32 ; o32 2B /r [386] - -\c SUB r/m8,imm8 ; 80 /5 ib [8086] -\c SUB r/m16,imm16 ; o16 81 /5 iw [8086] -\c SUB r/m32,imm32 ; o32 81 /5 id [386] - -\c SUB r/m16,imm8 ; o16 83 /5 ib [8086] -\c SUB r/m32,imm8 ; o32 83 /5 ib [386] - -\c SUB AL,imm8 ; 2C ib [8086] -\c SUB AX,imm16 ; o16 2D iw [8086] -\c SUB EAX,imm32 ; o32 2D id [386] - -\c{SUB} performs integer subtraction: it subtracts its second -operand from its first, and leaves the result in its destination -(first) operand. The flags are set according to the result of the -operation: in particular, the carry flag is affected and can be used -by a subsequent \c{SBB} instruction (\k{insSBB}). - -In the forms with an 8-bit immediate second operand and a longer -first operand, the second operand is considered to be signed, and is -sign-extended to the length of the first operand. In these cases, -the \c{BYTE} qualifier is necessary to force NASM to generate this -form of the instruction. - - -\S{insSUBPD} \i\c{SUBPD}: Packed Double-Precision FP Subtract - -\c SUBPD xmm1,xmm2/m128 ; 66 0F 5C /r [WILLAMETTE,SSE2] - -\c{SUBPD} subtracts the packed double-precision FP values of -the source operand from those of the destination operand, and -stores the result in the destination operation. - - -\S{insSUBPS} \i\c{SUBPS}: Packed Single-Precision FP Subtract - -\c SUBPS xmm1,xmm2/m128 ; 0F 5C /r [KATMAI,SSE] - -\c{SUBPS} subtracts the packed single-precision FP values of -the source operand from those of the destination operand, and -stores the result in the destination operation. - - -\S{insSUBSD} \i\c{SUBSD}: Scalar Single-FP Subtract - -\c SUBSD xmm1,xmm2/m128 ; F2 0F 5C /r [WILLAMETTE,SSE2] - -\c{SUBSD} subtracts the low-order double-precision FP value of -the source operand from that of the destination operand, and -stores the result in the destination operation. The high -quadword is unchanged. - - -\S{insSUBSS} \i\c{SUBSS}: Scalar Single-FP Subtract - -\c SUBSS xmm1,xmm2/m128 ; F3 0F 5C /r [KATMAI,SSE] - -\c{SUBSS} subtracts the low-order single-precision FP value of -the source operand from that of the destination operand, and -stores the result in the destination operation. The three high -doublewords are unchanged. - - -\S{insSVDC} \i\c{SVDC}: Save Segment Register and Descriptor - -\c SVDC m80,segreg ; 0F 78 /r [486,CYRIX,SMM] - -\c{SVDC} saves a segment register (DS, ES, FS, GS, or SS) and its -descriptor to mem80. - - -\S{insSVLDT} \i\c{SVLDT}: Save LDTR and Descriptor - -\c SVLDT m80 ; 0F 7A /0 [486,CYRIX,SMM] - -\c{SVLDT} saves the Local Descriptor Table (LDTR) to mem80. - - -\S{insSVTS} \i\c{SVTS}: Save TSR and Descriptor - -\c SVTS m80 ; 0F 7C /0 [486,CYRIX,SMM] - -\c{SVTS} saves the Task State Register (TSR) to mem80. - - -\S{insSYSCALL} \i\c{SYSCALL}: Call Operating System - -\c SYSCALL ; 0F 05 [P6,AMD] - -\c{SYSCALL} provides a fast method of transferring control to a fixed -entry point in an operating system. - -\b The \c{EIP} register is copied into the \c{ECX} register. - -\b Bits [31-0] of the 64-bit SYSCALL/SYSRET Target Address Register -(\c{STAR}) are copied into the \c{EIP} register. - -\b Bits [47-32] of the \c{STAR} register specify the selector that is -copied into the \c{CS} register. - -\b Bits [47-32]+1000b of the \c{STAR} register specify the selector that -is copied into the SS register. - -The \c{CS} and \c{SS} registers should not be modified by the operating -system between the execution of the \c{SYSCALL} instruction and its -corresponding \c{SYSRET} instruction. - -For more information, see the \c{SYSCALL and SYSRET Instruction Specification} -(AMD document number 21086.pdf). - - -\S{insSYSENTER} \i\c{SYSENTER}: Fast System Call - -\c SYSENTER ; 0F 34 [P6] - -\c{SYSENTER} executes a fast call to a level 0 system procedure or -routine. Before using this instruction, various MSRs need to be set -up: - -\b \c{SYSENTER_CS_MSR} contains the 32-bit segment selector for the -privilege level 0 code segment. (This value is also used to compute -the segment selector of the privilege level 0 stack segment.) - -\b \c{SYSENTER_EIP_MSR} contains the 32-bit offset into the privilege -level 0 code segment to the first instruction of the selected operating -procedure or routine. - -\b \c{SYSENTER_ESP_MSR} contains the 32-bit stack pointer for the -privilege level 0 stack. - -\c{SYSENTER} performs the following sequence of operations: - -\b Loads the segment selector from the \c{SYSENTER_CS_MSR} into the -\c{CS} register. - -\b Loads the instruction pointer from the \c{SYSENTER_EIP_MSR} into -the \c{EIP} register. - -\b Adds 8 to the value in \c{SYSENTER_CS_MSR} and loads it into the -\c{SS} register. - -\b Loads the stack pointer from the \c{SYSENTER_ESP_MSR} into the -\c{ESP} register. - -\b Switches to privilege level 0. - -\b Clears the \c{VM} flag in the \c{EFLAGS} register, if the flag -is set. - -\b Begins executing the selected system procedure. - -In particular, note that this instruction des not save the values of -\c{CS} or \c{(E)IP}. If you need to return to the calling code, you -need to write your code to cater for this. - -For more information, see the Intel Architecture Software Developer's -Manual, Volume 2. - - -\S{insSYSEXIT} \i\c{SYSEXIT}: Fast Return From System Call - -\c SYSEXIT ; 0F 35 [P6,PRIV] - -\c{SYSEXIT} executes a fast return to privilege level 3 user code. -This instruction is a companion instruction to the \c{SYSENTER} -instruction, and can only be executed by privilege level 0 code. -Various registers need to be set up before calling this instruction: - -\b \c{SYSENTER_CS_MSR} contains the 32-bit segment selector for the -privilege level 0 code segment in which the processor is currently -executing. (This value is used to compute the segment selectors for -the privilege level 3 code and stack segments.) - -\b \c{EDX} contains the 32-bit offset into the privilege level 3 code -segment to the first instruction to be executed in the user code. - -\b \c{ECX} contains the 32-bit stack pointer for the privilege level 3 -stack. - -\c{SYSEXIT} performs the following sequence of operations: - -\b Adds 16 to the value in \c{SYSENTER_CS_MSR} and loads the sum into -the \c{CS} selector register. - -\b Loads the instruction pointer from the \c{EDX} register into the -\c{EIP} register. - -\b Adds 24 to the value in \c{SYSENTER_CS_MSR} and loads the sum -into the \c{SS} selector register. - -\b Loads the stack pointer from the \c{ECX} register into the \c{ESP} -register. - -\b Switches to privilege level 3. - -\b Begins executing the user code at the \c{EIP} address. - -For more information on the use of the \c{SYSENTER} and \c{SYSEXIT} -instructions, see the Intel Architecture Software Developer's -Manual, Volume 2. - - -\S{insSYSRET} \i\c{SYSRET}: Return From Operating System - -\c SYSRET ; 0F 07 [P6,AMD,PRIV] - -\c{SYSRET} is the return instruction used in conjunction with the -\c{SYSCALL} instruction to provide fast entry/exit to an operating system. - -\b The \c{ECX} register, which points to the next sequential instruction -after the corresponding \c{SYSCALL} instruction, is copied into the \c{EIP} -register. - -\b Bits [63-48] of the \c{STAR} register specify the selector that is copied -into the \c{CS} register. - -\b Bits [63-48]+1000b of the \c{STAR} register specify the selector that is -copied into the \c{SS} register. - -\b Bits [1-0] of the \c{SS} register are set to 11b (RPL of 3) regardless of -the value of bits [49-48] of the \c{STAR} register. - -The \c{CS} and \c{SS} registers should not be modified by the operating -system between the execution of the \c{SYSCALL} instruction and its -corresponding \c{SYSRET} instruction. - -For more information, see the \c{SYSCALL and SYSRET Instruction Specification} -(AMD document number 21086.pdf). - - -\S{insTEST} \i\c{TEST}: Test Bits (notional bitwise AND) - -\c TEST r/m8,reg8 ; 84 /r [8086] -\c TEST r/m16,reg16 ; o16 85 /r [8086] -\c TEST r/m32,reg32 ; o32 85 /r [386] - -\c TEST r/m8,imm8 ; F6 /0 ib [8086] -\c TEST r/m16,imm16 ; o16 F7 /0 iw [8086] -\c TEST r/m32,imm32 ; o32 F7 /0 id [386] - -\c TEST AL,imm8 ; A8 ib [8086] -\c TEST AX,imm16 ; o16 A9 iw [8086] -\c TEST EAX,imm32 ; o32 A9 id [386] - -\c{TEST} performs a `mental' bitwise AND of its two operands, and -affects the flags as if the operation had taken place, but does not -store the result of the operation anywhere. - - -\S{insUCOMISD} \i\c{UCOMISD}: Unordered Scalar Double-Precision FP -compare and set EFLAGS - -\c UCOMISD xmm1,xmm2/m128 ; 66 0F 2E /r [WILLAMETTE,SSE2] - -\c{UCOMISD} compares the low-order double-precision FP numbers in the -two operands, and sets the \c{ZF}, \c{PF} and \c{CF} bits in the -\c{EFLAGS} register. In addition, the \c{OF}, \c{SF} and \c{AF} bits -in the \c{EFLAGS} register are zeroed out. The unordered predicate -(\c{ZF}, \c{PF} and \c{CF} all set) is returned if either source -operand is a \c{NaN} (\c{qNaN} or \c{sNaN}). - - -\S{insUCOMISS} \i\c{UCOMISS}: Unordered Scalar Single-Precision FP -compare and set EFLAGS - -\c UCOMISS xmm1,xmm2/m128 ; 0F 2E /r [KATMAI,SSE] - -\c{UCOMISS} compares the low-order single-precision FP numbers in the -two operands, and sets the \c{ZF}, \c{PF} and \c{CF} bits in the -\c{EFLAGS} register. In addition, the \c{OF}, \c{SF} and \c{AF} bits -in the \c{EFLAGS} register are zeroed out. The unordered predicate -(\c{ZF}, \c{PF} and \c{CF} all set) is returned if either source -operand is a \c{NaN} (\c{qNaN} or \c{sNaN}). - - -\S{insUD2} \i\c{UD0}, \i\c{UD1}, \i\c{UD2}: Undefined Instruction - -\c UD0 ; 0F FF [186,UNDOC] -\c UD1 ; 0F B9 [186,UNDOC] -\c UD2 ; 0F 0B [186] - -\c{UDx} can be used to generate an invalid opcode exception, for testing -purposes. - -\c{UD0} is specifically documented by AMD as being reserved for this -purpose. - -\c{UD1} is documented by Intel as being available for this purpose. - -\c{UD2} is specifically documented by Intel as being reserved for this -purpose. Intel document this as the preferred method of generating an -invalid opcode exception. - -All these opcodes can be used to generate invalid opcode exceptions on -all currently available processors. - - -\S{insUMOV} \i\c{UMOV}: User Move Data - -\c UMOV r/m8,reg8 ; 0F 10 /r [386,UNDOC] -\c UMOV r/m16,reg16 ; o16 0F 11 /r [386,UNDOC] -\c UMOV r/m32,reg32 ; o32 0F 11 /r [386,UNDOC] - -\c UMOV reg8,r/m8 ; 0F 12 /r [386,UNDOC] -\c UMOV reg16,r/m16 ; o16 0F 13 /r [386,UNDOC] -\c UMOV reg32,r/m32 ; o32 0F 13 /r [386,UNDOC] - -This undocumented instruction is used by in-circuit emulators to -access user memory (as opposed to host memory). It is used just like -an ordinary memory/register or register/register \c{MOV} -instruction, but accesses user space. - -This instruction is only available on some AMD and IBM 386 and 486 -processors. - - -\S{insUNPCKHPD} \i\c{UNPCKHPD}: Unpack and Interleave High Packed -Double-Precision FP Values - -\c UNPCKHPD xmm1,xmm2/m128 ; 66 0F 15 /r [WILLAMETTE,SSE2] - -\c{UNPCKHPD} performs an interleaved unpack of the high-order data -elements of the source and destination operands, saving the result -in \c{xmm1}. It ignores the lower half of the sources. - -The operation of this instruction is: - -\c dst[63-0] := dst[127-64]; -\c dst[127-64] := src[127-64]. - - -\S{insUNPCKHPS} \i\c{UNPCKHPS}: Unpack and Interleave High Packed -Single-Precision FP Values - -\c UNPCKHPS xmm1,xmm2/m128 ; 0F 15 /r [KATMAI,SSE] - -\c{UNPCKHPS} performs an interleaved unpack of the high-order data -elements of the source and destination operands, saving the result -in \c{xmm1}. It ignores the lower half of the sources. - -The operation of this instruction is: - -\c dst[31-0] := dst[95-64]; -\c dst[63-32] := src[95-64]; -\c dst[95-64] := dst[127-96]; -\c dst[127-96] := src[127-96]. - - -\S{insUNPCKLPD} \i\c{UNPCKLPD}: Unpack and Interleave Low Packed -Double-Precision FP Data - -\c UNPCKLPD xmm1,xmm2/m128 ; 66 0F 14 /r [WILLAMETTE,SSE2] - -\c{UNPCKLPD} performs an interleaved unpack of the low-order data -elements of the source and destination operands, saving the result -in \c{xmm1}. It ignores the lower half of the sources. - -The operation of this instruction is: - -\c dst[63-0] := dst[63-0]; -\c dst[127-64] := src[63-0]. - - -\S{insUNPCKLPS} \i\c{UNPCKLPS}: Unpack and Interleave Low Packed -Single-Precision FP Data - -\c UNPCKLPS xmm1,xmm2/m128 ; 0F 14 /r [KATMAI,SSE] - -\c{UNPCKLPS} performs an interleaved unpack of the low-order data -elements of the source and destination operands, saving the result -in \c{xmm1}. It ignores the lower half of the sources. - -The operation of this instruction is: - -\c dst[31-0] := dst[31-0]; -\c dst[63-32] := src[31-0]; -\c dst[95-64] := dst[63-32]; -\c dst[127-96] := src[63-32]. - - -\S{insVERR} \i\c{VERR}, \i\c{VERW}: Verify Segment Readability/Writability - -\c VERR r/m16 ; 0F 00 /4 [286,PRIV] - -\c VERW r/m16 ; 0F 00 /5 [286,PRIV] - -\b \c{VERR} sets the zero flag if the segment specified by the selector -in its operand can be read from at the current privilege level. -Otherwise it is cleared. - -\b \c{VERW} sets the zero flag if the segment can be written. - - -\S{insWAIT} \i\c{WAIT}: Wait for Floating-Point Processor - -\c WAIT ; 9B [8086] -\c FWAIT ; 9B [8086] - -\c{WAIT}, on 8086 systems with a separate 8087 FPU, waits for the -FPU to have finished any operation it is engaged in before -continuing main processor operations, so that (for example) an FPU -store to main memory can be guaranteed to have completed before the -CPU tries to read the result back out. - -On higher processors, \c{WAIT} is unnecessary for this purpose, and -it has the alternative purpose of ensuring that any pending unmasked -FPU exceptions have happened before execution continues. - - -\S{insWBINVD} \i\c{WBINVD}: Write Back and Invalidate Cache - -\c WBINVD ; 0F 09 [486] - -\c{WBINVD} invalidates and empties the processor's internal caches, -and causes the processor to instruct external caches to do the same. -It writes the contents of the caches back to memory first, so no -data is lost. To flush the caches quickly without bothering to write -the data back first, use \c{INVD} (\k{insINVD}). - - -\S{insWRMSR} \i\c{WRMSR}: Write Model-Specific Registers - -\c WRMSR ; 0F 30 [PENT] - -\c{WRMSR} writes the value in \c{EDX:EAX} to the processor -Model-Specific Register (MSR) whose index is stored in \c{ECX}. -See also \c{RDMSR} (\k{insRDMSR}). - - -\S{insWRSHR} \i\c{WRSHR}: Write SMM Header Pointer Register - -\c WRSHR r/m32 ; 0F 37 /0 [386,CYRIX,SMM] - -\c{WRSHR} loads the contents of either a 32-bit memory location or a -32-bit register into the SMM header pointer register. - -See also \c{RDSHR} (\k{insRDSHR}). - - -\S{insXADD} \i\c{XADD}: Exchange and Add - -\c XADD r/m8,reg8 ; 0F C0 /r [486] -\c XADD r/m16,reg16 ; o16 0F C1 /r [486] -\c XADD r/m32,reg32 ; o32 0F C1 /r [486] - -\c{XADD} exchanges the values in its two operands, and then adds -them together and writes the result into the destination (first) -operand. This instruction can be used with a \c{LOCK} prefix for -multi-processor synchronisation purposes. - - -\S{insXBTS} \i\c{XBTS}: Extract Bit String - -\c XBTS reg16,r/m16 ; o16 0F A6 /r [386,UNDOC] -\c XBTS reg32,r/m32 ; o32 0F A6 /r [386,UNDOC] - -The implied operation of this instruction is: - -\c XBTS r/m16,reg16,AX,CL -\c XBTS r/m32,reg32,EAX,CL - -Writes a bit string from the source operand to the destination. \c{CL} -indicates the number of bits to be copied, and \c{(E)AX} indicates the -low order bit offset in the source. The bits are written to the low -order bits of the destination register. For example, if \c{CL} is set -to 4 and \c{AX} (for 16-bit code) is set to 5, bits 5-8 of \c{src} will -be copied to bits 0-3 of \c{dst}. This instruction is very poorly -documented, and I have been unable to find any official source of -documentation on it. - -\c{XBTS} is supported only on the early Intel 386s, and conflicts with -the opcodes for \c{CMPXCHG486} (on early Intel 486s). NASM supports it -only for completeness. Its counterpart is \c{IBTS} (see \k{insIBTS}). - - -\S{insXCHG} \i\c{XCHG}: Exchange - -\c XCHG reg8,r/m8 ; 86 /r [8086] -\c XCHG reg16,r/m8 ; o16 87 /r [8086] -\c XCHG reg32,r/m32 ; o32 87 /r [386] - -\c XCHG r/m8,reg8 ; 86 /r [8086] -\c XCHG r/m16,reg16 ; o16 87 /r [8086] -\c XCHG r/m32,reg32 ; o32 87 /r [386] - -\c XCHG AX,reg16 ; o16 90+r [8086] -\c XCHG EAX,reg32 ; o32 90+r [386] -\c XCHG reg16,AX ; o16 90+r [8086] -\c XCHG reg32,EAX ; o32 90+r [386] - -\c{XCHG} exchanges the values in its two operands. It can be used -with a \c{LOCK} prefix for purposes of multi-processor -synchronisation. - -\c{XCHG AX,AX} or \c{XCHG EAX,EAX} (depending on the \c{BITS} -setting) generates the opcode \c{90h}, and so is a synonym for -\c{NOP} (\k{insNOP}). - - -\S{insXLATB} \i\c{XLATB}: Translate Byte in Lookup Table - -\c XLAT ; D7 [8086] -\c XLATB ; D7 [8086] - -\c{XLATB} adds the value in \c{AL}, treated as an unsigned byte, to -\c{BX} or \c{EBX}, and loads the byte from the resulting address (in -the segment specified by \c{DS}) back into \c{AL}. - -The base register used is \c{BX} if the address size is 16 bits, and -\c{EBX} if it is 32 bits. If you need to use an address size not -equal to the current \c{BITS} setting, you can use an explicit -\i\c{a16} or \i\c{a32} prefix. - -The segment register used to load from \c{[BX+AL]} or \c{[EBX+AL]} -can be overridden by using a segment register name as a prefix (for -example, \c{es xlatb}). - - -\S{insXOR} \i\c{XOR}: Bitwise Exclusive OR - -\c XOR r/m8,reg8 ; 30 /r [8086] -\c XOR r/m16,reg16 ; o16 31 /r [8086] -\c XOR r/m32,reg32 ; o32 31 /r [386] - -\c XOR reg8,r/m8 ; 32 /r [8086] -\c XOR reg16,r/m16 ; o16 33 /r [8086] -\c XOR reg32,r/m32 ; o32 33 /r [386] - -\c XOR r/m8,imm8 ; 80 /6 ib [8086] -\c XOR r/m16,imm16 ; o16 81 /6 iw [8086] -\c XOR r/m32,imm32 ; o32 81 /6 id [386] - -\c XOR r/m16,imm8 ; o16 83 /6 ib [8086] -\c XOR r/m32,imm8 ; o32 83 /6 ib [386] - -\c XOR AL,imm8 ; 34 ib [8086] -\c XOR AX,imm16 ; o16 35 iw [8086] -\c XOR EAX,imm32 ; o32 35 id [386] - -\c{XOR} performs a bitwise XOR operation between its two operands -(i.e. each bit of the result is 1 if and only if exactly one of the -corresponding bits of the two inputs was 1), and stores the result -in the destination (first) operand. - -In the forms with an 8-bit immediate second operand and a longer -first operand, the second operand is considered to be signed, and is -sign-extended to the length of the first operand. In these cases, -the \c{BYTE} qualifier is necessary to force NASM to generate this -form of the instruction. - -The \c{MMX} instruction \c{PXOR} (see \k{insPXOR}) performs the same -operation on the 64-bit \c{MMX} registers. - - -\S{insXORPD} \i\c{XORPD}: Bitwise Logical XOR of Double-Precision FP Values - -\c XORPD xmm1,xmm2/m128 ; 66 0F 57 /r [WILLAMETTE,SSE2] - -\c{XORPD} returns a bit-wise logical XOR between the source and -destination operands, storing the result in the destination operand. - - -\S{insXORPS} \i\c{XORPS}: Bitwise Logical XOR of Single-Precision FP Values - -\c XORPS xmm1,xmm2/m128 ; 0F 57 /r [KATMAI,SSE] - -\c{XORPS} returns a bit-wise logical XOR between the source and -destination operands, storing the result in the destination operand. - -