doc/slmbuild.pod

   1 =head1 NAME
   2
   3 slmbuild - generate language model from idngram file
   4
   5 =head1 SYNOPSIS
   6
   7 slmbuild [I<option>]... I<idngram_file>...
   8
   9 =head1 DESCRIPTION
  10
  11 B<slmbuild> generates a back-off smoothing language model from a given idngram file. Generally, the I<idngram_file> is created by B<ids2ngram>.
  12
  13
  14 =head1 OPTIONS
  15 All the following options are mandatory.
  16
  17 =over 4
  18
  19 =item B<-n>,B<--NMax> I<N>
  20
  21 1 for unigram, 2 for bigram, 3 for trigram. Any number not in the range of 1..3 is not valid.
  22
  23
  24 =item B<-o>, B<--out> I<output-file>
  25
  26 Specify the output xfilei name.
  27
  28
  29 =item B<-l>, B<--log>
  30
  31 using I<-log(pr)>, use I<pr> directly by default.
  32
  33
  34 =item B<-w>, B<--wordcount> I<N>
  35
  36 Lexican size, number of different words.
  37
  38
  39 =item B<-b>, B<--brk> I<id>...
  40
  41 Set the ids which should be treated as breaker.
  42
  43
  44 =item B<-e>, B<--e> I<id>...
  45
  46 Set the ids which should not be put into LM.
  47
  48
  49 =item B<-c>, B<--cut> I<c>...
  50
  51 k-grams whose freq <= c[k] are dropped.
  52
  53
  54 =item B<-d>, B<--discount> I<method>, I<param>...
  55
  56 The k-th B<-d> parm specifies the discount method
  57
  58 For k-gram, possibble values for method/param are:
  59
  60       B<GT>,I<R>,I<dis>  : B<GT> discount for r E<lt>= I<R>, r is the freq of a ngram.
  61                   Linear discount for those r E<gt> I<R>, i.e. r'=r*dis
  62                   0 E<lt>E<lt> dis E<lt> 1.0, for example 0.999
  63       B<ABS>,[I<dis>] : Absolute discount r'=r-I<dis>. And I<dis> is optional
  64                   0 E<lt>E<lt> I<dis> E<lt> cut[k]+1.0, normally I<dis> E<lt> 1.0.
  65       LIN,[I<dis>] : Linear discount r'=r*dis. And dis is optional
  66                   0 E<lt> dis E<lt> 1.0
  67
  68 =back
  69
  70
  71 =head1 NOTE
  72
  73 B<-n> must be given before B<-c> B<-b>. And B<-c> must give right number of cut-off,
  74 also B<-d>s must appear exactly N times specifying the discounts for 1-gram, 2-gram...,
  75 respectively.
  76
  77 BREAKER-IDs could be SentenceTokens or ParagraphTokens. Conceptually,
  78 these ids have no meaning when they appeared in the middle of n-gram.
  79
  80 EXCLUDE-IDs could be ambiguious-ids. Conceptually, n-grams which
  81 contain those ids are meaningless.
  82
  83 We can not erase ngrams according to BREAKER-IDS and EXCLUDE-IDs directly
  84 from IDNGRAM file, because some low-level information is still useful in it.
  85
  86
  87 =head1 EXAMPLE
  88
  89 Following example read 'all.id3gram' and write trigram model 'all.slm'.
  90
  91 At 1-gram level, use Good-Turing discount with cut-off 0, i<R>=8, I<dis>=0.9995. At
  92 2-gram level, use Absolute discount with cut-off 3, dis auto-calc. At 3-gram
  93 level, use Absolute discount with cut-off 2, dis auto-calc. Word id 10,11,12
  94 are breakers (sentence/para/paper breaker, etc). Exclude-ID is 9. Lexicon
  95 contains 200000 words. The result languagme model uses -log(pr).
  96
  97 B<slmbuild -l -n 3 -o all.slm -w 200000 -c 0,3,2 -d GT,8,0.9995 -d ABS -d ABS -b 10,11,12 -e 9 all.id3gram>
  98
  99
 100 =head1 AUTHOR
 101
 102 Originally written by Phill.Zhang E<lt>phill.zhang@sun.comE<gt>.
 103 Currently maintained by Kov.Chai E<lt>tchaikov@gmail.comE<gt>.
 104
 105 =head1 SEE ALSO
 106
 107 B<ids2ngram>(1), B<slmprune>(1).
 108