3 slmbuild - generate language model from idngram file
7 slmbuild [I<option>]... I<idngram_file>...
11 B<slmbuild> generates a back-off smoothing language model from a given idngram file. Generally, the I<idngram_file> is created by B<ids2ngram>.
15 All the following options are mandatory.
19 =item B<-n>,B<--NMax> I<N>
21 1 for unigram, 2 for bigram, 3 for trigram. Any number not in the range of 1..3 is not valid.
24 =item B<-o>, B<--out> I<output-file>
26 Specify the output xfilei name.
31 using I<-log(pr)>, use I<pr> directly by default.
34 =item B<-w>, B<--wordcount> I<N>
36 Lexican size, number of different words.
39 =item B<-b>, B<--brk> I<id>...
41 Set the ids which should be treated as breaker.
44 =item B<-e>, B<--e> I<id>...
46 Set the ids which should not be put into LM.
49 =item B<-c>, B<--cut> I<c>...
51 k-grams whose freq <= c[k] are dropped.
54 =item B<-d>, B<--discount> I<method>, I<param>...
56 The k-th B<-d> parm specifies the discount method
58 For k-gram, possibble values for method/param are:
60 B<GT>,I<R>,I<dis> : B<GT> discount for r E<lt>= I<R>, r is the freq of a ngram.
61 Linear discount for those r E<gt> I<R>, i.e. r'=r*dis
62 0 E<lt>E<lt> dis E<lt> 1.0, for example 0.999
63 B<ABS>,[I<dis>] : Absolute discount r'=r-I<dis>. And I<dis> is optional
64 0 E<lt>E<lt> I<dis> E<lt> cut[k]+1.0, normally I<dis> E<lt> 1.0.
65 LIN,[I<dis>] : Linear discount r'=r*dis. And dis is optional
73 B<-n> must be given before B<-c> B<-b>. And B<-c> must give right number of cut-off,
74 also B<-d>s must appear exactly N times specifying the discounts for 1-gram, 2-gram...,
77 BREAKER-IDs could be SentenceTokens or ParagraphTokens. Conceptually,
78 these ids have no meaning when they appeared in the middle of n-gram.
80 EXCLUDE-IDs could be ambiguious-ids. Conceptually, n-grams which
81 contain those ids are meaningless.
83 We can not erase ngrams according to BREAKER-IDS and EXCLUDE-IDs directly
84 from IDNGRAM file, because some low-level information is still useful in it.
89 Following example read 'all.id3gram' and write trigram model 'all.slm'.
91 At 1-gram level, use Good-Turing discount with cut-off 0, i<R>=8, I<dis>=0.9995. At
92 2-gram level, use Absolute discount with cut-off 3, dis auto-calc. At 3-gram
93 level, use Absolute discount with cut-off 2, dis auto-calc. Word id 10,11,12
94 are breakers (sentence/para/paper breaker, etc). Exclude-ID is 9. Lexicon
95 contains 200000 words. The result languagme model uses -log(pr).
97 B<slmbuild -l -n 3 -o all.slm -w 200000 -c 0,3,2 -d GT,8,0.9995 -d ABS -d ABS -b 10,11,12 -e 9 all.id3gram>
102 Originally written by Phill.Zhang E<lt>phill.zhang@sun.comE<gt>.
103 Currently maintained by Kov.Chai E<lt>tchaikov@gmail.comE<gt>.
107 B<ids2ngram>(1), B<slmprune>(1).