MIGRATION FROM 0.1.X TO 0.2.X 0.2.x breaks 0.1.x interoperability in many ways, to allow more use cases, and to provide more storage capacity. 1. Binary Data Changes 1.1 All Trie Data in Single File No more splitting of a trie into '{trie-name}.sbm', '{trie-name}.br' and '{trie-name}.tl'. All parts are now stored in a single file, '{trie-name}.tri'. Note, however, that a '{trie-name}.abm' (a renamed version of '{trie-name}.sbm' after Unicode support) is still needed on first creation. But once created, the '{trie-name}.tri' will incorporate the alphabet map data, and no '{trie-name}.abm' is required in later uses. It will even be ignored if exists. 1.2 32-Bit Node Index To accommodate larger word lists, trie node indices are now 32 bits, instead of 16 bits. This means 32,767 times capacity compared to the old format. Therefore, the data size are doubled in general when migrating from old format, but it can now hold exponentially more entries. In addition, the tail block lengths are now 16 bits, instead of 8 bits, making it possible to store longer suffixes, for dictionaries of extremely long words. 1.3 No Backward Compatibility For simplicity of the code, it was decided not to read/write old format files. If you still prefer using the old format, just stay with the old version. If you like to gain more support from the new version, you can migrate your old data by first dumping your dictionary with 0.1.x trietool into text file and then creating the new dictionary with the dumped word list. Or if you already have the word list, that makes things a lot easier. Just create the dictionary with the new trietool. Data Migration Steps: a. If you have the word list source, just skip to next step. Otherwise, you can dump the old data with 0.1.x trietool: $ trietool {trie-name} list > words.lst b. Prepare '{trie-name}.abm', listing ranges of characters used in the word list, in terms of Unicode values. For example, for an English and Thai dictionary: [0x0041,0x005a] [0x0061,0x007a] [0x0e01,0x0e3a] [0x0e40,0x0e4e] c. Generate new trie with 0.2.x trietool-0.2. For example: $ trietool-0.2 {trie-name} add-list -e TIS-620 words.lst In this example, the '-e TIS-620' indicates that the 'words.lst' file contains TIS-620 encoded text, which is most likely for word lists dumped from the old trie with 8-bit Thai character code as the key encoding. Replace it with your old encoding as necessary, such as ISO-8859-1 or the like. If '-e' option is omitted, current locale encoding is assumed. See trietool-0.2 man page for details. 2. API Changes 2.1 Non-File Trie Usage In datrie 0.1.x, every trie was associated with a set of files. Now, this is not only reduced to a single file, but zero file is also possible. That is, a new trie can be created in memory, added words, removed words, queried words, and then disposed without writing data to any file. Meanwhile, saving to file is still possible. Scenario 1: Loading trie from file, using it read-only. 1a. Open trie with trie_new_from_file(path). 1b. Use it. 1c. On exit: - Close it with trie_free(). Scenario 2: Loading trie from file, updating file when finished. 2a. Open trie with trie_new_from_file(path). 2b. Use/update it. 2c. On exit: - If trie_is_dirty(), then trie_save(). - Close it with trie_free(). Scenario 3: Create a new trie, saving it when finished. 3a. Prepare an alphabet map: - Create new alphabet map with alpha_map_new(). - Add ranges with alpha_map_add_range(). 3b. Create new trie with trie_new(alpha_map). 3c. Free the alphabet map with alpha_map_free(). 3d. Use/update the trie. 3e. On exit: - If trie_is_dirty(), then trie_save(). - Close the trie with trie_free(). Scenario 4: Create temporary trie, disposing it when finished. 4a. Prepare an alphabet map: - Create new alphabet map with alpha_map_new(). - Add ranges with alpha_map_add_range(). 4b. Create new trie with trie_new(alpha_map). 4c. Free the alphabet map with alpha_map_free(). 4d. Use/update the trie. 4e. On exit: - Close the trie with trie_free(). 2.2 No More SBTrie In datrie 0.1.x, SBTrie provided a wrapper to Trie implementation, converting between real character codes and trie internal codes. This was for compactness, as continuous character code range can cause more compact sparse table allocation, while the real alphabet set needs not be continuous. However, in datrie 0.2.x, this mapping feature has been merged into Trie class, to reduce call layers. So, there is no SBTrie any more. You can call Trie directly in the same way you called SBTrie in 0.1.x. 2.3 Characters are Now Unicode datrie was previously planned to support multiple kinds of character encodings, with only single-byte encoding as the available implementation for the time being. However, as there have been many requests for Unicode support, it seems to be the most useful choice, into which all other encodings can be converted. Furthermore, as datrie is mostly used in program's critical path, having too many layers can contribute to being a bottleneck. So, only Unicode is accepted in this version. It's now the application's duty to convert its keys into Unicode before passing them to datrie. This should also allow any kind of possible caching. 2.4 New Public APIs for Alphabet Map As AlphaMap (alphabet map) is now necessary for creating a new empty trie, the APIs for manipulating this data is now exposed to the public scope. See for the details. 2.5 Extensions to TrieState trie_state_copy() As part of performance profiling, allocating and freeing TrieState is found to eat up CPU time at some degree. So, reusing existing TrieState where possible does help. This function is added for copying TrieState data, as a better alternative than trie_state_clone(). trie_state_is_single() Sometimes, checking if a TrieState is a leaf state is too expensive for program's critical path. It needs to check both whether the state is in a non-branching path, that is, whether it is in a suffix node, and whether it can be walked by a terminator. When a program only needs to check for the former fact and not the latter, this method is at disposal. 3. Changes to TrieTool 3.1 Renaming To allow co-existence with 0.1.x trietool, 0.2.x trietool is named trietool-0.2. 3.2 '*.abm' Instead of '*.sbm' As SBTrie has been eliminated in datrie 0.2.x, the corresponding '*.sbm' (single-byte map) input file is also obsoleted. It is now renamed to '*.abm' (alphabet map). Its format is also redefined to be Unicode-based. All alphabet character ranges are defined in Unicode. Besides, the '*.abm' file is required only once at trie creation time. It is not needed at deployment, as the alphabet map is already included in the single trie file. 3.3 Encoding Conversion Support As datrie is now Unicode-based, conversion from other encodings can be useful. This is possible for word list operations, namely add-list and delete-list, by the additional '-e {enc}' or '--encoding {enc}' option. This option specifies the character encoding of the word list file. And trietool-0.2 will convert the contents to Unicode on-the-fly.