1 MIGRATION FROM 0.1.X TO 0.2.X
3 0.2.x breaks 0.1.x interoperability in many ways, to allow more use cases, and
4 to provide more storage capacity.
8 1.1 All Trie Data in Single File
10 No more splitting of a trie into '{trie-name}.sbm', '{trie-name}.br' and
11 '{trie-name}.tl'. All parts are now stored in a single file, '{trie-name}.tri'.
13 Note, however, that a '{trie-name}.abm' (a renamed version of '{trie-name}.sbm'
14 after Unicode support) is still needed on first creation. But once created,
15 the '{trie-name}.tri' will incorporate the alphabet map data, and no
16 '{trie-name}.abm' is required in later uses. It will even be ignored if exists.
20 To accommodate larger word lists, trie node indices are now 32 bits, instead of
21 16 bits. This means 32,767 times capacity compared to the old format.
22 Therefore, the data size are doubled in general when migrating from old format,
23 but it can now hold exponentially more entries.
25 In addition, the tail block lengths are now 16 bits, instead of 8 bits, making
26 it possible to store longer suffixes, for dictionaries of extremely long words.
28 1.3 No Backward Compatibility
30 For simplicity of the code, it was decided not to read/write old format files.
31 If you still prefer using the old format, just stay with the old version. If
32 you like to gain more support from the new version, you can migrate your old
33 data by first dumping your dictionary with 0.1.x trietool into text file and
34 then creating the new dictionary with the dumped word list. Or if you already
35 have the word list, that makes things a lot easier. Just create the dictionary
36 with the new trietool.
40 a. If you have the word list source, just skip to next step. Otherwise, you
41 can dump the old data with 0.1.x trietool:
43 $ trietool {trie-name} list > words.lst
45 b. Prepare '{trie-name}.abm', listing ranges of characters used in the word
46 list, in terms of Unicode values. For example, for an English and Thai
54 c. Generate new trie with 0.2.x trietool-0.2. For example:
56 $ trietool-0.2 {trie-name} add-list -e TIS-620 words.lst
58 In this example, the '-e TIS-620' indicates that the 'words.lst' file
59 contains TIS-620 encoded text, which is most likely for word lists dumped
60 from the old trie with 8-bit Thai character code as the key encoding.
61 Replace it with your old encoding as necessary, such as ISO-8859-1 or the
62 like. If '-e' option is omitted, current locale encoding is assumed.
63 See trietool-0.2 man page for details.
67 2.1 Non-File Trie Usage
69 In datrie 0.1.x, every trie was associated with a set of files. Now, this is
70 not only reduced to a single file, but zero file is also possible. That is, a
71 new trie can be created in memory, added words, removed words, queried words,
72 and then disposed without writing data to any file. Meanwhile, saving to file
75 Scenario 1: Loading trie from file, using it read-only.
76 1a. Open trie with trie_new_from_file(path).
79 - Close it with trie_free().
81 Scenario 2: Loading trie from file, updating file when finished.
82 2a. Open trie with trie_new_from_file(path).
85 - If trie_is_dirty(), then trie_save().
86 - Close it with trie_free().
88 Scenario 3: Create a new trie, saving it when finished.
89 3a. Prepare an alphabet map:
90 - Create new alphabet map with alpha_map_new().
91 - Add ranges with alpha_map_add_range().
92 3b. Create new trie with trie_new(alpha_map).
93 3c. Free the alphabet map with alpha_map_free().
94 3d. Use/update the trie.
96 - If trie_is_dirty(), then trie_save().
97 - Close the trie with trie_free().
99 Scenario 4: Create temporary trie, disposing it when finished.
100 4a. Prepare an alphabet map:
101 - Create new alphabet map with alpha_map_new().
102 - Add ranges with alpha_map_add_range().
103 4b. Create new trie with trie_new(alpha_map).
104 4c. Free the alphabet map with alpha_map_free().
105 4d. Use/update the trie.
107 - Close the trie with trie_free().
111 In datrie 0.1.x, SBTrie provided a wrapper to Trie implementation, converting
112 between real character codes and trie internal codes. This was for compactness,
113 as continuous character code range can cause more compact sparse table
114 allocation, while the real alphabet set needs not be continuous. However, in
115 datrie 0.2.x, this mapping feature has been merged into Trie class, to reduce
116 call layers. So, there is no SBTrie any more. You can call Trie directly in the
117 same way you called SBTrie in 0.1.x.
119 2.3 Characters are Now Unicode
121 datrie was previously planned to support multiple kinds of character encodings,
122 with only single-byte encoding as the available implementation for the time
125 However, as there have been many requests for Unicode support, it seems to be
126 the most useful choice, into which all other encodings can be converted.
128 Furthermore, as datrie is mostly used in program's critical path, having too
129 many layers can contribute to being a bottleneck. So, only Unicode is accepted
130 in this version. It's now the application's duty to convert its keys into
131 Unicode before passing them to datrie. This should also allow any kind of
134 2.4 New Public APIs for Alphabet Map
136 As AlphaMap (alphabet map) is now necessary for creating a new empty trie, the
137 APIs for manipulating this data is now exposed to the public scope. See
138 <datrie/alpha-map.h> for the details.
140 2.5 Extensions to TrieState
144 As part of performance profiling, allocating and freeing TrieState is found
145 to eat up CPU time at some degree. So, reusing existing TrieState where
146 possible does help. This function is added for copying TrieState data, as a
147 better alternative than trie_state_clone().
149 trie_state_is_single()
151 Sometimes, checking if a TrieState is a leaf state is too expensive for
152 program's critical path. It needs to check both whether the state is in a
153 non-branching path, that is, whether it is in a suffix node, and whether it
154 can be walked by a terminator. When a program only needs to check for the
155 former fact and not the latter, this method is at disposal.
157 3. Changes to TrieTool
161 To allow co-existence with 0.1.x trietool, 0.2.x trietool is named
164 3.2 '*.abm' Instead of '*.sbm'
166 As SBTrie has been eliminated in datrie 0.2.x, the corresponding '*.sbm'
167 (single-byte map) input file is also obsoleted. It is now renamed to '*.abm'
168 (alphabet map). Its format is also redefined to be Unicode-based. All alphabet
169 character ranges are defined in Unicode.
171 Besides, the '*.abm' file is required only once at trie creation time. It is
172 not needed at deployment, as the alphabet map is already included in the single
175 3.3 Encoding Conversion Support
177 As datrie is now Unicode-based, conversion from other encodings can be useful.
178 This is possible for word list operations, namely add-list and delete-list, by
179 the additional '-e {enc}' or '--encoding {enc}' option. This option specifies
180 the character encoding of the word list file. And trietool-0.2 will convert the
181 contents to Unicode on-the-fly.