manual/aspell.html/Notes-on-8_002dbit-Characters.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
   2 <html>
   3 <!-- This is the user's manual for Aspell
   4
   5 GNU Aspell is a spell checker designed to eventually replace Ispell.
   6 It can either be used as a library or as an independent spell checker.
   7
   8 Copyright © 2000-2019 Kevin Atkinson.
   9
  10 Permission is granted to copy, distribute and/or modify this document
  11 under the terms of the GNU Free Documentation License, Version 1.1 or
  12 any later version published by the Free Software Foundation; with no
  13 Invariant Sections, no Front-Cover Texts and no Back-Cover Texts.  A
  14 copy of the license is included in the section entitled "GNU Free
  15 Documentation License". -->
  16 <!-- Created by GNU Texinfo 5.2, http://www.gnu.org/software/texinfo/ -->
  17 <head>
  18 <title>GNU Aspell 0.60.7: Notes on 8-bit Characters</title>
  19
  20 <meta name="description" content="Aspell 0.60.7 spell checker user&rsquo;s manual.">
  21 <meta name="keywords" content="GNU Aspell 0.60.7: Notes on 8-bit Characters">
  22 <meta name="resource-type" content="document">
  23 <meta name="distribution" content="global">
  24 <meta name="Generator" content="makeinfo">
  25 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  26 <link href="index.html#Top" rel="start" title="Top">
  27 <link href="index.html#SEC_Contents" rel="contents" title="Table of Contents">
  28 <link href="Implementation-Notes.html#Implementation-Notes" rel="up" title="Implementation Notes">
  29 <link href="Languages-Which-Aspell-can-Support.html#Languages-Which-Aspell-can-Support" rel="next" title="Languages Which Aspell can Support">
  30 <link href="Aspell-Suggestion-Strategy.html#Aspell-Suggestion-Strategy" rel="prev" title="Aspell Suggestion Strategy">
  31 <style type="text/css">
  32 <!--
  33 a.summary-letter {text-decoration: none}
  34 blockquote.smallquotation {font-size: smaller}
  35 div.display {margin-left: 3.2em}
  36 div.example {margin-left: 3.2em}
  37 div.indentedblock {margin-left: 3.2em}
  38 div.lisp {margin-left: 3.2em}
  39 div.smalldisplay {margin-left: 3.2em}
  40 div.smallexample {margin-left: 3.2em}
  41 div.smallindentedblock {margin-left: 3.2em; font-size: smaller}
  42 div.smalllisp {margin-left: 3.2em}
  43 kbd {font-style:oblique}
  44 pre.display {font-family: inherit}
  45 pre.format {font-family: inherit}
  46 pre.menu-comment {font-family: serif}
  47 pre.menu-preformatted {font-family: serif}
  48 pre.smalldisplay {font-family: inherit; font-size: smaller}
  49 pre.smallexample {font-size: smaller}
  50 pre.smallformat {font-family: inherit; font-size: smaller}
  51 pre.smalllisp {font-size: smaller}
  52 span.nocodebreak {white-space:nowrap}
  53 span.nolinebreak {white-space:nowrap}
  54 span.roman {font-family:serif; font-weight:normal}
  55 span.sansserif {font-family:sans-serif; font-weight:normal}
  56 ul.no-bullet {list-style: none}
  57 -->
  58 </style>
  59
  60
  61 </head>
  62
  63 <body lang="en" bgcolor="#FFFFFF" text="#000000" link="#0000FF" vlink="#800080" alink="#FF0000">
  64 <a name="Notes-on-8_002dbit-Characters"></a>
  65 <div class="header">
  66 <p>
  67 Previous: <a href="Aspell-Suggestion-Strategy.html#Aspell-Suggestion-Strategy" accesskey="p" rel="prev">Aspell Suggestion Strategy</a>, Up: <a href="Implementation-Notes.html#Implementation-Notes" accesskey="u" rel="up">Implementation Notes</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>]</p>
  68 </div>
  69 <hr>
  70 <a name="Notes-on-8_002dbit-Characters-1"></a>
  71 <h3 class="appendixsec">A.2 Notes on 8-bit Characters</h3>
  72
  73 <p>There is a very good reason I use 8-bit characters in Aspell. Speed
  74 and simplicity. While many parts of my code can fairly easily be
  75 converted to some sort of wide character as my code is clean. Other
  76 parts cannot be.
  77 </p>
  78 <p>One of the reasons why is because in many, many places I use a direct
  79 lookup to find out various information about characters. With 8-bit
  80 characters this is very feasible because there is only 256 of
  81 them. With 16-bit wide characters this will waste a LOT of space. With
  82 32-bit characters this is just plain impossible. Converting the lookup
  83 tables to another form is certainly possible but degrades performance
  84 significantly.
  85 </p>
  86 <p>Furthermore, some of my algorithms rely on words consisting only on a
  87 small number of distinct characters (often around 30 when case and
  88 accents are not considered). When the possible character can consist
  89 of any Unicode character this number becomes several thousand, if
  90 that. In order for these algorithms to still be used, some sort of
  91 limit will need to be placed on the possible characters the word can
  92 contain. If I impose that limit, I might as well use some sort of
  93 8-bit characters set which will automatically place the limit on what
  94 the characters can be.
  95 </p>
  96 <p>There is also the issue of how I should store the word lists in
  97 memory? As a string of 32 bit wide characters. Now that is using up 4
  98 times more memory than characters would and for languages that can fit
  99 within an 8-bit character that is, in my view, a gross waste of
 100 memory. So maybe I should store them is some variable width format
 101 such as UTF-8. Unfortunately, way, way too many of the algorithms will
 102 simply not work with variable width characters without significant
 103 modification which will very likely degrade performance. So the
 104 solution is to work with the characters as 32-bit wide characters and
 105 then convert it to a shorter representation when storing them in the
 106 lookup tables. Now that can lead to an inefficiency. I could also use
 107 16 bit wide characters, however that may not be good enough to hold all
 108 future versions of Unicode and therefore has the same problems.
 109 </p>
 110 <p>As a response to the space waste used by storing word lists in some
 111 sort of wide format some one asked:
 112 </p>
 113 <blockquote>
 114 <p>Since hard drives are cheaper and cheaper, you could store a dictionary
 115 in a usable (uncompressed) form and use it directly with memory
 116 mapping. Then the efficiency would directly depend on the disk caching
 117 method, and only the used part of the dictionaries would really be
 118 loaded into memory. You would no more have to load plain dictionaries
 119 into main memory, you&rsquo;ll just want to compute some indexes (or
 120 something like that) after mapping.
 121 </p></blockquote>
 122
 123 <p>However, the fact of the matter is that most of the dictionary will be
 124 read into memory anyway if it is available. If it is not available
 125 then there would be a good deal of disk swaps. Making characters
 126 32-bit wide will increase the chance that there are more disk swaps.
 127 So the bottom line is that it is more efficient to convert characters
 128 from something like UTF-8 into some sort of 8-bit character. I could
 129 also use some sort of disk space lookup table such as the Berkeley
 130 Database. However this will <strong>definitely</strong> degrade performance.
 131 </p>
 132 <p>The bottom line is that keeping Aspell 8-bit internally is a very well
 133 though out decision that is not likely to change any time soon. Feel
 134 free to challenge me on it, but, don&rsquo;t expect me to change my mind
 135 unless you can bring up some point that I have not thought of before
 136 and quite possibly a patch to solve cleanly convert Aspell to Unicode
 137 internally without a serious performance lost OR serious memory usage
 138 increase.
 139 </p>
 140 <hr>
 141 <div class="header">
 142 <p>
 143 Previous: <a href="Aspell-Suggestion-Strategy.html#Aspell-Suggestion-Strategy" accesskey="p" rel="prev">Aspell Suggestion Strategy</a>, Up: <a href="Implementation-Notes.html#Implementation-Notes" accesskey="u" rel="up">Implementation Notes</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>]</p>
 144 </div>
 145
 146
 147
 148 </body>
 149 </html>