src/lib/krb5/unicode/ucdata/format.txt

   1 #
   2 # $Id: format.txt,v 1.2 2001/01/02 18:46:20 mleisher Exp $
   3 #
   4
   5 CHARACTER DATA
   6 ==============
   7
   8 This package generates some data files that contain character properties useful
   9 for text processing.
  10
  11 CHARACTER PROPERTIES
  12 ====================
  13
  14 The first data file is called "ctype.dat" and contains a compressed form of
  15 the character properties found in the Unicode Character Database (UCDB).
  16 Additional properties can be specified in limited UCDB format in another file
  17 to avoid modifying the original UCDB.
  18
  19 The following is a property name and code table to be used with the character
  20 data:
  21
  22 NAME CODE DESCRIPTION
  23 ---------------------
  24 Mn   0    Mark, Non-Spacing
  25 Mc   1    Mark, Spacing Combining
  26 Me   2    Mark, Enclosing
  27 Nd   3    Number, Decimal Digit
  28 Nl   4    Number, Letter
  29 No   5    Number, Other
  30 Zs   6    Separator, Space
  31 Zl   7    Separator, Line
  32 Zp   8    Separator, Paragraph
  33 Cc   9    Other, Control
  34 Cf   10   Other, Format
  35 Cs   11   Other, Surrogate
  36 Co   12   Other, Private Use
  37 Cn   13   Other, Not Assigned
  38 Lu   14   Letter, Uppercase
  39 Ll   15   Letter, Lowercase
  40 Lt   16   Letter, Titlecase
  41 Lm   17   Letter, Modifier
  42 Lo   18   Letter, Other
  43 Pc   19   Punctuation, Connector
  44 Pd   20   Punctuation, Dash
  45 Ps   21   Punctuation, Open
  46 Pe   22   Punctuation, Close
  47 Po   23   Punctuation, Other
  48 Sm   24   Symbol, Math
  49 Sc   25   Symbol, Currency
  50 Sk   26   Symbol, Modifier
  51 So   27   Symbol, Other
  52 L    28   Left-To-Right
  53 R    29   Right-To-Left
  54 EN   30   European Number
  55 ES   31   European Number Separator
  56 ET   32   European Number Terminator
  57 AN   33   Arabic Number
  58 CS   34   Common Number Separator
  59 B    35   Block Separator
  60 S    36   Segment Separator
  61 WS   37   Whitespace
  62 ON   38   Other Neutrals
  63 Pi   47   Punctuation, Initial
  64 Pf   48   Punctuation, Final
  65 #
  66 # Implementation specific properties.
  67 #
  68 Cm   39   Composite
  69 Nb   40   Non-Breaking
  70 Sy   41   Symmetric (characters which are part of open/close pairs)
  71 Hd   42   Hex Digit
  72 Qm   43   Quote Mark
  73 Mr   44   Mirroring
  74 Ss   45   Space, Other (controls viewed as spaces in ctype isspace())
  75 Cp   46   Defined character
  76
  77 The actual binary data is formatted as follows:
  78
  79   Assumptions: unsigned short is at least 16-bits in size and unsigned long
  80                is at least 32-bits in size.
  81
  82     unsigned short ByteOrderMark
  83     unsigned short OffsetArraySize
  84     unsigned long  Bytes
  85     unsigned short Offsets[OffsetArraySize + 1]
  86     unsigned long  Ranges[N], N = value of Offsets[OffsetArraySize]
  87
  88   The Bytes field provides the total byte count used for the Offsets[] and
  89   Ranges[] arrays.  The Offsets[] array is aligned on a 4-byte boundary and
  90   there is always one extra node on the end to hold the final index of the
  91   Ranges[] array.  The Ranges[] array contains pairs of 4-byte values
  92   representing a range of Unicode characters.  The pairs are arranged in
  93   increasing order by the first character code in the range.
  94
  95   Determining if a particular character is in the property list requires a
  96   simple binary search to determine if a character is in any of the ranges
  97   for the property.
  98
  99   If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a
 100   machine with a different endian order and the values must be byte-swapped.
 101
 102   To swap a 16-bit value:
 103      c = (c >> 8) | ((c & 0xff) << 8)
 104
 105   To swap a 32-bit value:
 106      c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) |
 107          (((c >> 16) & 0xff) << 8) | (c >> 24)
 108
 109 CASE MAPPINGS
 110 =============
 111
 112 The next data file is called "case.dat" and contains three case mapping tables
 113 in the following order: upper, lower, and title case.  Each table is in
 114 increasing order by character code and each mapping contains 3 unsigned longs
 115 which represent the possible mappings.
 116
 117 The format for the binary form of these tables is:
 118
 119   unsigned short ByteOrderMark
 120   unsigned short NumMappingNodes, count of all mapping nodes
 121   unsigned short CaseTableSizes[2], upper and lower mapping node counts
 122   unsigned long  CaseTables[NumMappingNodes]
 123
 124   The starting indexes of the case tables are calculated as following:
 125
 126     UpperIndex = 0;
 127     LowerIndex = CaseTableSizes[0] * 3;
 128     TitleIndex = LowerIndex + CaseTableSizes[1] * 3;
 129
 130   The order of the fields for the three tables are:
 131
 132     Upper case
 133     ----------
 134     unsigned long upper;
 135     unsigned long lower;
 136     unsigned long title;
 137
 138     Lower case
 139     ----------
 140     unsigned long lower;
 141     unsigned long upper;
 142     unsigned long title;
 143
 144     Title case
 145     ----------
 146     unsigned long title;
 147     unsigned long upper;
 148     unsigned long lower;
 149
 150   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
 151   same way as described in the CHARACTER PROPERTIES section.
 152
 153   Because the tables are in increasing order by character code, locating a
 154   mapping requires a simple binary search on one of the 3 codes that make up
 155   each node.
 156
 157   It is important to note that there can only be 65536 mapping nodes which
 158   divided into 3 portions allows 21845 nodes for each case mapping table.  The
 159   distribution of mappings may be more or less than 21845 per table, but only
 160   65536 are allowed.
 161
 162 COMPOSITIONS
 163 ============
 164
 165 This data file is called "comp.dat" and contains data that tracks character
 166 pairs that have a single Unicode value representing the combination of the two
 167 characters.
 168
 169 The format for the binary form of this table is:
 170
 171   unsigned short ByteOrderMark
 172   unsigned short NumCompositionNodes, count of composition nodes
 173   unsigned long  Bytes, total number of bytes used for composition nodes
 174   unsigned long  CompositionNodes[NumCompositionNodes * 4]
 175
 176   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
 177   same way as described in the CHARACTER PROPERTIES section.
 178
 179   The CompositionNodes[] array consists of groups of 4 unsigned longs.  The
 180   first of these is the character code representing the combination of two
 181   other character codes, the second records the number of character codes that
 182   make up the composition (not currently used), and the last two are the pair
 183   of character codes whose combination is represented by the character code in
 184   the first field.
 185
 186 DECOMPOSITIONS
 187 ==============
 188
 189 The next data file is called "decomp.dat" and contains the decomposition data
 190 for all characters with decompositions containing more than one character and
 191 are *not* compatibility decompositions.  Compatibility decompositions are
 192 signaled in the UCDB format by the use of the <compat> tag in the
 193 decomposition field.  Each list of character codes represents a full
 194 decomposition of a composite character.  The nodes are arranged in increasing
 195 order by character code.
 196
 197 The format for the binary form of this table is:
 198
 199   unsigned short ByteOrderMark
 200   unsigned short NumDecompNodes, count of all decomposition nodes
 201   unsigned long  Bytes
 202   unsigned long  DecompNodes[(NumDecompNodes * 2) + 1]
 203   unsigned long  Decomp[N], N = sum of all counts in DecompNodes[]
 204
 205   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
 206   same way as described in the CHARACTER PROPERTIES section.
 207
 208   The DecompNodes[] array consists of pairs of unsigned longs, the first of
 209   which is the character code and the second is the initial index of the list
 210   of character codes representing the decomposition.
 211
 212   Locating the decomposition of a composite character requires a binary search
 213   for a character code in the DecompNodes[] array and using its index to
 214   locate the start of the decomposition.  The length of the decomposition list
 215   is the index in the following element in DecompNode[] minus the current
 216   index.
 217
 218 COMBINING CLASSES
 219 =================
 220
 221 The fourth data file is called "cmbcl.dat" and contains the characters with
 222 non-zero combining classes.
 223
 224 The format for the binary form of this table is:
 225
 226   unsigned short ByteOrderMark
 227   unsigned short NumCCLNodes
 228   unsigned long  Bytes
 229   unsigned long  CCLNodes[NumCCLNodes * 3]
 230
 231   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
 232   same way as described in the CHARACTER PROPERTIES section.
 233
 234   The CCLNodes[] array consists of groups of three unsigned longs.  The first
 235   and second are the beginning and ending of a range and the third is the
 236   combining class of that range.
 237
 238   If a character is not found in this table, then the combining class is
 239   assumed to be 0.
 240
 241   It is important to note that only 65536 distinct ranges plus combining class
 242   can be specified because the NumCCLNodes is usually a 16-bit number.
 243
 244 NUMBER TABLE
 245 ============
 246
 247 The final data file is called "num.dat" and contains the characters that have
 248 a numeric value associated with them.
 249
 250 The format for the binary form of the table is:
 251
 252   unsigned short ByteOrderMark
 253   unsigned short NumNumberNodes
 254   unsigned long  Bytes
 255   unsigned long  NumberNodes[NumNumberNodes]
 256   unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long)))
 257                             / sizeof(short)]
 258
 259   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
 260   same way as described in the CHARACTER PROPERTIES section.
 261
 262   The NumberNodes array contains pairs of values, the first of which is the
 263   character code and the second an index into the ValueNodes array.  The
 264   ValueNodes array contains pairs of integers which represent the numerator
 265   and denominator of the numeric value of the character.  If the character
 266   happens to map to an integer, both the values in ValueNodes will be the
 267   same.