manual/aspell.html/Phonetic-Code.html

   1 <html lang="en">
   2 <head>
   3 <title>Phonetic Code - GNU Aspell 0.60.6.1</title>
   4 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   5 <meta name="description" content="Aspell 0.60.6.1 spell checker user's manual.">
   6 <meta name="generator" content="makeinfo 4.8">
   7 <link title="Top" rel="start" href="index.html#Top">
   8 <link rel="up" href="Adding-Support-For-Other-Languages.html#Adding-Support-For-Other-Languages" title="Adding Support For Other Languages">
   9 <link rel="prev" href="Compiling-the-Word-List.html#Compiling-the-Word-List" title="Compiling the Word List">
  10 <link rel="next" href="The-Simple-Soundslike.html#The-Simple-Soundslike" title="The Simple Soundslike">
  11 <link href="http://www.gnu.org/software/texinfo/" rel="generator-home" title="Texinfo Homepage">
  12 <!--
  13 This is the user's manual for Aspell
  14
  15 GNU Aspell is a spell checker designed to eventually replace Ispell.
  16 It can either be used as a library or as an independent spell checker.
  17
  18 Copyright (C) 2000--2011 Kevin Atkinson.
  19
  20      Permission is granted to copy, distribute and/or modify this
  21      document under the terms of the GNU Free Documentation License,
  22      Version 1.1 or any later version published by the Free Software
  23      Foundation; with no Invariant Sections, no Front-Cover Texts and
  24      no Back-Cover Texts.  A copy of the license is included in the
  25      section entitled "GNU Free Documentation License".
  26    -->
  27 <meta http-equiv="Content-Style-Type" content="text/css">
  28 <style type="text/css"><!--
  29   pre.display { font-family:inherit }
  30   pre.format  { font-family:inherit }
  31   pre.smalldisplay { font-family:inherit; font-size:smaller }
  32   pre.smallformat  { font-family:inherit; font-size:smaller }
  33   pre.smallexample { font-size:smaller }
  34   pre.smalllisp    { font-size:smaller }
  35   span.sc    { font-variant:small-caps }
  36   span.roman { font-family:serif; font-weight:normal; }
  37   span.sansserif { font-family:sans-serif; font-weight:normal; }
  38 --></style>
  39 </head>
  40 <body>
  41 <div class="node">
  42 <p>
  43 <a name="Phonetic-Code"></a>
  44 Next:&nbsp;<a rel="next" accesskey="n" href="The-Simple-Soundslike.html#The-Simple-Soundslike">The Simple Soundslike</a>,
  45 Previous:&nbsp;<a rel="previous" accesskey="p" href="Compiling-the-Word-List.html#Compiling-the-Word-List">Compiling the Word List</a>,
  46 Up:&nbsp;<a rel="up" accesskey="u" href="Adding-Support-For-Other-Languages.html#Adding-Support-For-Other-Languages">Adding Support For Other Languages</a>
  47 <hr>
  48 </div>
  49
  50 <h3 class="section">7.3 Phonetic Code</h3>
  51
  52 <!-- @emph{(The following section was originally written by Bj@"orn Jacke, -->
  53 <!-- bjoern.jacke at gmx de)} -->
  54 <p>Aspell is in fact the spell checker that comes up with the best
  55 suggestions if it finds an unknown word.  One reason is that it does
  56 not just compare the word with other words in the dictionary (like
  57 Ispell does) but also uses phonetic comparisons with other words.
  58
  59    <p>The new table driven phonetic code is very flexible and setting up
  60 phonetic transformation rules for other languages is not difficult but
  61 there can be a number of stumbling blocks &mdash; that's why I wrote this
  62 section.
  63
  64    <p>The main phonetic code is free of any language specific code and
  65 should be powerful enough to allow setting up rules for any language.
  66 Anything which is language specific is kept in a plain text file and
  67 can easily be edited.  So it's even possible to write phonetic
  68 transformation rules if you don't have any programming skills.  All
  69 you need to know is how words of the language are written and how they
  70 are pronounced.
  71
  72 <h4 class="subsection">7.3.1 Syntax of the transformation array</h4>
  73
  74 <p>In the translation array there are two strings on each line; the first
  75 one is the search string (or switch name) and the second one is the
  76 replacement string (or switch parameter).  The line
  77
  78 <pre class="example">     version   <var>version</var>
  79 </pre>
  80    <p class="noindent">is also required to appear somewhere in the translation array.  The
  81 version string can be anything but it should be changed whenever a new
  82 version of the translation array is released.  This is important
  83 because it will keep Aspell from using a compiled dictionary with the
  84 wrong set of rules.  For example, if when coming up with suggestion
  85 for <code>hallo</code>, Aspell will use the new rules to come up with the
  86 soundslike say <code>H*L*</code>, but if `<samp><span class="samp">hello</span></samp>' is stored in the
  87 dictionary using the old rules as <code>HL</code> instead of <code>H*L*</code>
  88 Aspell will never be able to come up with `<samp><span class="samp">hello</span></samp>'.  So to solve
  89 this problem Aspell checks if the version strings match and aborts
  90 with an error if they don't.  Thus it is important to update it
  91 whenever a new version of the translation array is released.  This is
  92 only a problem with the main word list as the personal word lists are
  93 now stored as simple word lists with a single header line (i.e. no
  94 soundslike data).
  95
  96    <p>Each non switch line represents one replacement (transformation) rule.
  97 Words beginning with the same letter must be grouped together; the
  98 order inside this group does not depend on alphabetical issues but it
  99 gives priorities; the higher the rule the higher the priority.  That's
 100 why the first rule that matches is applied.  In the following example:
 101
 102 <pre class="example">     GH   _
 103      G    K
 104 </pre>
 105    <p class="noindent">`<samp><span class="samp">GH -&gt; _</span></samp>' has higher priority than `<samp><span class="samp">G -&gt; K</span></samp>'
 106
 107    <p>`<samp><span class="samp">_</span></samp>' represents the empty string &ldquo;&rdquo;.  If `<samp><span class="samp">GH -&gt; _</span></samp>' came
 108 after `<samp><span class="samp">G -&gt; K</span></samp>', the second rule would never match because the
 109 algorithm would stop searching for more rules after the first match.
 110 The above rules transform any `<samp><span class="samp">GH</span></samp>' to an empty string (delete
 111 them) and transforms any other `<samp><span class="samp">G</span></samp>' to `<samp><span class="samp">K</span></samp>'.
 112
 113    <p>At the end of the first string of a line (the search string) there may
 114 optionally stand a number of characters in brackets.  One (only one!)
 115 of these characters must fit.  It's comparable with the `<samp><span class="samp">[ ]</span></samp>'
 116 brackets in regular expressions.  The rule `<samp><span class="samp">DG(EIY) -&gt; J</span></samp>' for
 117 example would match any `<samp><span class="samp">DGE</span></samp>', `<samp><span class="samp">DGI</span></samp>' and
 118 `<samp><span class="samp">DGY</span></samp>' and replace them with `<samp><span class="samp">J</span></samp>'.  This way you can
 119 reduce several rules to one.
 120
 121    <p>Before the search string, one or more dashes `<samp><span class="samp">-</span></samp>' may be placed.
 122 Those search strings will be matched totally but only the beginning of
 123 the string will be replaced.  Furthermore, for these rules no follow-up
 124 rule will be searched (what this is will be explained later).  The
 125 rule `<samp><span class="samp">TCH-- </span></samp>'-&gt; _ will match any word containing
 126 `<samp><span class="samp">TCH</span></samp>' (like `<samp><span class="samp">match</span></samp>') but will only replace the first
 127 character `<samp><span class="samp">T</span></samp>' with an empty string.  The number of dashes
 128 determines how many characters from the end will not be replaced.
 129 After the replacement, the search for transformation rules continues
 130 with the not replaced `<samp><span class="samp">CH</span></samp>'!
 131
 132    <p>If a `<samp><span class="samp">&lt;</span></samp>' is appended to the search string, the search for
 133 replacement rules will continue with the replacement string and not with
 134 the next character of the word.  The rule `<samp><span class="samp">PH&lt; -&gt; F</span></samp>' for example
 135 would replace `<samp><span class="samp">PH</span></samp>' with `<samp><span class="samp">F</span></samp>' and then again start to search for
 136 a replacement rule for `<samp><span class="samp">F...</span></samp>'.  If there would also be rules
 137 like `<samp><span class="samp">FO </span></samp>'-&gt; `<samp><span class="samp">O</span></samp>' and `<samp><span class="samp">F -&gt; _</span></samp>' then words like
 138 `<samp><span class="samp">PHOXYZ</span></samp>' would be transformed to `<samp><span class="samp">OXYZ</span></samp>' and any occurrences of
 139 `<samp><span class="samp">PH</span></samp>' that are not followed by an `<samp><span class="samp">O</span></samp>' will be deleted like
 140 `<samp><span class="samp">PHIXYZ -&gt; IXYZ</span></samp>'.  The second replacement however is not applied if
 141 the priority of this rule is lower than the priority of the first rule.
 142
 143    <p>Priorities are added to a rule by putting a number between 0 and 9 at
 144 the end of the search string, for example `<samp><span class="samp">ING6 -&gt; N</span></samp>'.
 145 The higher the number the higher is the priority.
 146
 147    <p>Priorities are especially important for the previously mentioned
 148 follow-up rules.  Follow-up rules are searched beginning from the last
 149 string of the first search string.  This is a bit complicated but I
 150 hope this example will make it clearer:
 151
 152 <pre class="example">     CHS      X
 153      CH       G
 154
 155      HAU--1   H
 156
 157      SCH      SH
 158 </pre>
 159    <p>In this example `<samp><span class="samp">CHS</span></samp>' in the word `<samp><span class="samp">FUCHS</span></samp>' would be
 160 transformed to `<samp><span class="samp">X</span></samp>'.  If we take the word `<samp><span class="samp">DURCHSCHNITT</span></samp>' then
 161 things look a bit different.  Here `<samp><span class="samp">CH</span></samp>' belongs together and
 162 `<samp><span class="samp">SCH</span></samp>' belongs together and both are spoken separately.  The
 163 algorithm however first finds the string `<samp><span class="samp">CHS</span></samp>' which may not be
 164 transformed like in the previous word `<samp><span class="samp">FUCHS</span></samp>'.  At this point the
 165 algorithm can find a follow-up rule.  It takes the last character of
 166 the first matching rule (`<samp><span class="samp">CHS</span></samp>') which is `<samp><span class="samp">S</span></samp>' and looks for
 167 the next match, beginning from this character.  What it finds is
 168 clear: It finds `<samp><span class="samp">SCH -&gt; SH</span></samp>', which has the same priority
 169 (no priority means standard priority, which is 5).  If the priority is
 170 the same or higher the follow-up rule will be applied.  Let's take a
 171 look at the word `<samp><span class="samp">SCHAUKEL</span></samp>'.  In this word `<samp><span class="samp">SCH</span></samp>' belongs
 172 together and may not be taken apart.  After the algorithm has found
 173 `<samp><span class="samp">SCH </span></samp>'-&gt; `<samp><span class="samp">SH</span></samp>' it searches for a follow-up rule for
 174 `<samp><span class="samp">H+</span></samp>'`<samp><span class="samp">AUKEL</span></samp>'.  It finds `<samp><span class="samp">HAU--1 -&gt; H</span></samp>', but does not
 175 apply it because its priority is lower than the one of the first rule.
 176 You see that this is a very powerful feature but it also can easily
 177 lead to mistakes.  If you really don't need this feature you can turn
 178 it off by putting the line:
 179
 180 <pre class="example">     followup      0
 181 </pre>
 182    <p class="noindent">at the beginning of the phonetic table file.  As mentioned, for rules
 183 containing a `<samp><span class="samp">-</span></samp>' no follow-up rules are searched but giving such
 184 rules a priority is not totally senseless because they can be
 185 follow-up rules and in that case the priority makes sense again.
 186 Follow-up rules of follow-up rules are not searched because this is in
 187 fact not needed very often.
 188
 189    <p>The control character `<samp><span class="samp">^</span></samp>' says that the search string only
 190 matches at the beginning of words so that the rule `<samp><span class="samp">RH -&gt; R</span></samp>' will
 191 only apply to words like `<samp><span class="samp">RHESUS</span></samp>' but not `<samp><span class="samp">PERHAPS</span></samp>'.  You
 192 can append another `<samp><span class="samp">^</span></samp>' to the search string.  In that case the
 193 algorithm treats the rest of the word totally separately from the
 194 first matched string at the beginning.  This is useful for prefixes
 195 whose pronunciation does not depend on the rest of the word and vice
 196 versa like `<samp><span class="samp">OVER^^</span></samp>' in English for example.
 197
 198    <p>The same way as `<samp><span class="samp">^</span></samp>' works does `<samp><span class="samp">$</span></samp>' only apply to words
 199 that end with the search string.  `<samp><span class="samp">GN$ -&gt; N</span></samp>' only
 200 matches on words like `<samp><span class="samp">SIGN</span></samp>' but not `<samp><span class="samp">SIGNUM</span></samp>'.  If
 201 you use `<samp><span class="samp">^</span></samp>' and `<samp><span class="samp">$</span></samp>' together, both of them must fit
 202 `<samp><span class="samp">ENOUGH^$ -&gt; NF</span></samp>' will only match the word
 203 `<samp><span class="samp">ENOUGH</span></samp>' and nothing else.
 204
 205    <p>Of course you can combine all of the mentioned control characters but
 206 they must occur in this order: `<samp><span class="samp">&lt; - priority ^ $</span></samp>'.  All
 207 characters must be written in CAPITAL letters.
 208
 209    <p>If absolutely no rule can be found &mdash; might happen if you use strange
 210 characters for which you don't have any replacement rule &mdash; the next
 211 character will simply be skipped and the search for replacement rules
 212 will continue with the rest of the word.
 213
 214    <p>If you want double letters to be reduced to one you must set up a rule
 215 like `<samp><span class="samp">LL- -&gt; L</span></samp>'.  If double letters in the resulting phonetic
 216 word should be allowed, you must place the line:
 217
 218 <pre class="example">     collapse_result     0
 219 </pre>
 220    <p class="noindent">at the beginning of your transformation table file; otherwise set the
 221 value to `1'.  The English rules for example strip all vowels from
 222 words and so the word "GOGO" would be transformed to "K" and not to
 223 "KK" (as desired) if <code>collapse_result</code> is set to 1.  That's why
 224 the English rules have <code>collapse_result</code> set to <code>0</code>.
 225
 226    <p>By default, all accents are removed from a word before it is matched to
 227 the soundslike rules.  If you do not want this then add the line
 228
 229 <pre class="example">     remove_accents      0
 230 </pre>
 231    <p>at the beginning of your file.  The exact definition of an accent is
 232 language dependent and is controlled via the character set file.  If you
 233 set remove_accents to '0' then you should also set "store-as" to "lower"
 234 in the language data file (not the phonetic transformation file)
 235 otherwise Aspell will have problems when both the accented and the
 236 de-accented version of a word appearing in the dictionary; it will
 237 consider one of them as incorrectly spelled.
 238
 239 <h4 class="subsection">7.3.2 How do I start finally?</h4>
 240
 241 <p>Before you start to write an array of transformation rules, you should
 242 be aware that you have to do some work to make sure that things you do
 243 will result in correct transformation rules.
 244
 245 <h5 class="subsubsection">7.3.2.1 Things that come in handy</h5>
 246
 247 <p>First of all, you need to have a large word list of the language you
 248 want to make phonetics for.  It should contain about as many words as
 249 the dictionary of the spell checker.  If you don't have such a list,
 250 you will probably find an Ispell dictionary at
 251 <a href="http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html">http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html</a> which
 252 will help you.  You can then make affix expansion via <samp><span class="command">ispell
 253 -e</span></samp> and then pipe it through <samp><span class="command">tr " " "\n"</span></samp> to put one word on
 254 each line.  After that you eventually have to convert special
 255 characters like `<samp><span class="samp">&eacute;</span></samp>' from Ispell's internal representation to
 256 latin1 encoding.  <samp><span class="command">sed s/e'/&eacute;/g</span></samp> for example would replace
 257 all `<samp><span class="samp">e'</span></samp>' with `<samp><span class="samp">&eacute;</span></samp>'.
 258
 259    <p>The second is that you know how to use regular expressions and know
 260 how to use <samp><span class="command">grep</span></samp>.  You should for example know that:
 261
 262 <pre class="example">     grep ^[^aeiou]qu[io] wordlist | less
 263 </pre>
 264    <p class="noindent">will show you all words that begin with any character but `<samp><span class="samp">a</span></samp>',
 265 `<samp><span class="samp">e</span></samp>', `<samp><span class="samp">i</span></samp>', `<samp><span class="samp">o</span></samp>' or `<samp><span class="samp">u</span></samp>' and then continue with
 266 `<samp><span class="samp">qui</span></samp>' or `<samp><span class="samp">quo</span></samp>'.  This stuff is important for example to
 267 find out if a phonetic replacement rule you want to set up is valid
 268 for all words which match the expression you want to replace.  Taking
 269 a look at the regex(7) man page is a good idea.
 270
 271 <h5 class="subsubsection">7.3.2.2 What the phonetic code should do</h5>
 272
 273 <p>Normal text comparison works well as long as the typer misspells a
 274 word because he pressed one key he didn't really want to press.  In
 275 these cases, mostly one character differs from the original word.
 276
 277    <p>In cases where the writer didn't know about the correct spelling of
 278 the word, the word may have several characters that differ from the
 279 original word but usually the word would still sound like the
 280 original.  Someone might think that `tough' is spelled `taff'.  No
 281 spell checker without phonetic code will come to the idea that this
 282 might be `tough', but a spell checker who knows that `taff' would be
 283 pronounced like `tough' will make good suggestions to the user.  Another
 284 example could be `funetik' and `phonetic'.
 285
 286    <p>From these examples you can see that the phonetic transformation should
 287 not be too fussy and too precise.  If you implement a whole phonetic
 288 dictionary as you can find it in books this will not be very useful
 289 because then there could still be many characters differing from the
 290 misspelled and the desired word.  What you should do if you implement
 291 the phonetic transformation table is to reduce the number of used
 292 letters to the only really necessary ones.
 293
 294    <p>Characters that sound similar should be reduced to one.  In the English
 295 language for example `Z' sounds like `S' and that's why the
 296 transformation rule `<samp><span class="samp">Z -&gt; S</span></samp>' is present in the
 297 replacement table.  &ldquo;PH is spoken like &ldquo;F and so we have a
 298 `<samp><span class="samp">PH -&gt; F</span></samp>' rule.
 299
 300    <p>If you take a closer look you will even see that vowels sound very
 301 similar in the English language: `contradiction', `cuntradiction',
 302 `cantradiction' or `centradiction' in fact sound nearly the same,
 303 don't they? Therefore the English phonetic replacement rules not only
 304 reduce all vowels to one but even remove them all (removing is done by
 305 just setting up no rule for those letters).  The phonetic code of
 306 &ldquo;contradiction&rdquo; is &ldquo;KNTRTKXN&rdquo; and if you try to read this
 307 letter-monster loud you will hear that it still sound a bit like
 308 `contradiction'.  You also see that &ldquo;D&rdquo; is transformed to &ldquo;T&rdquo;
 309 because they nearly sound the same.
 310
 311    <p>If you think you have found a regularity you should <em>always</em> take
 312 your word list and <samp><span class="command">grep</span></samp> for the corresponding regular
 313 expression you want to make a transformation rule for.  An example: If
 314 you come to the idea that all English words ending on `ough' sound
 315 like `AF' at the end because you think of `enough' and `tough'.  If
 316 you then <code>grep</code> for the corresponding regular expression by
 317 <samp><span class="command">grep -i ough$ wordlist</span></samp> you will see that the rule you wanted
 318 to set up is not correct because the rule doesn't fit to words like
 319 `although' or `bough'.  So you have to define your rule more precisely
 320 or you have to set up exceptions if the number of words that differ
 321 from the desired rule is not too big.
 322
 323    <p>Don't forget about follow-up rules which can help in many cases but
 324 which also can lead to confusion and unwanted side effects.  It's also
 325 important to write exceptions in front of the more general rules
 326 (`<samp><span class="samp">GH</span></samp>' before `<samp><span class="samp">G</span></samp>' etc.).
 327
 328    <p>If you think you have set up a number of rules that may produce some
 329 good results try them out! If you run Aspell as <samp><span class="command">aspell
 330 --lang=</span><var>your_language</var><span class="command"> pipe</span></samp> you get a prompt at which you can type
 331 in words.  If you just type words Aspell checks them and eventually
 332 makes suggestions if they are misspelled.  If you type in <code>$$Sw
 333 </code><var>word</var> you will see the phonetic transformation and you can test
 334 out if your work does what you want.
 335
 336    <p>Another good way to check that changes you make to your rules don't
 337 have any bad side effects is to create another list from your word
 338 list which contains not only the word of the word list but also the
 339 corresponding phonetic version of this word on the same line.  If you
 340 do this once before the change and once after the change you can make
 341 a diff (see <samp><span class="command">man diff</span></samp>) to see what <em>really</em> changed.  To
 342 do this use the command <samp><span class="command">aspell --lang=</span><var>your_language</var><span class="command">
 343 soundslike</span></samp>.  In this mode Aspell will output the the original word
 344 and then its soundslike separated by a tab character for each word you
 345 give it.  If you are interested in seeing how the algorithm works you
 346 can download a set of useful programs from
 347 <a href="http://members.xoom.com/maccy/spell/phonet-utils.tar.gz">http://members.xoom.com/maccy/spell/phonet-utils.tar.gz</a>.  This
 348 includes a program that produces a list as mentioned above and another
 349 program which illustrates how the algorithm works.  It uses the same
 350 transformation table as Aspell and so it helps a lot during the
 351 process of creating a phonetic transformation table for Aspell.
 352
 353    <p>During your work you should write down your basic ideas so that other
 354 people are able to understand what you did (and you still know about
 355 it after a few weeks).  The English table has a huge documentation
 356 appended as an example.
 357
 358    <p>Now you can start experimenting with all the things you just read and
 359 perhaps set up a nice phonetic transformation table for your language
 360 to help Aspell to come up with the best correction suggestions ever
 361 seen also for your language.  Take a look at the Aspell homepage to
 362 see if there is already a transformation table for your language.  If
 363 there is one you might also take a look at it to see if it could be
 364 improved.
 365
 366    <p>If you think that this section helped you or if you think that this is
 367 just a waste of time you can send any feedback to
 368 <a href="mailto:bjoern.jacke@gmx.de">bjoern.jacke@gmx.de</a>.
 369
 370    </body></html>
 371