1 Copyright (C) 1992, 1997-2002, 2004-2014 Free Software Foundation, Inc.
3 Copying and distribution of this file, with or without modification,
4 are permitted in any medium without royalty provided the copyright
5 notice and this notice are preserved.
11 See where we are with UTF-8 performance.
13 Merge Debian patches 55-bigfile.patch, 69-mbtowc.patch and
14 70-man_apostrophe.patch. Go through patches in Savannah.
16 Cleanup of the grep(), grepdir(), recursion (the "main loop") to use fts.
17 Fix --directories=read.
19 Write better Texinfo documentation for grep. The manual page would be a
20 good place to start, but Info documents are also supposed to contain a
21 tutorial and examples.
23 Some test in tests/spencer2.tests should have failed! Need to filter out
24 some bugs in dfa.[ch]/regex.[ch].
28 GNU grep does 32-bit arithmetic, it needs to move to 64-bit (i.e.
31 Lazy dynamic linking of libpcre.
33 Check FreeBSD's integration of zgrep (-Z) and bzgrep (-J) in one
34 binary. Is there a possibility of doing even better by automatically
35 checking the magic of binary files ourselves (0x1F 0x8B for gzip, 0x1F
36 0x9D for compress, and 0x42 0x5A 0x68 for bzip2)? Once what to do with
37 libpcre is decided, do the same for libz and libbz2.
44 Check <http://tony.abou-assaleh.net/greps.html>. Take a look at these
45 and consider opportunities for merging or cloning:
47 -- ja-grep's mlb2 patch (Japanese grep)
48 <ftp://ftp.freebsd.org/pub/FreeBSD/ports/distfiles/grep-2.4.2-mlb2.patch.gz>
49 -- lgrep (from lv, a Powerful Multilingual File Viewer / Grep)
50 <http://www.ff.iij4u.or.jp/~nrt/lv/>;
51 -- cgrep (Context grep) <http://plg.uwaterloo.ca/~ftp/mt/cgrep/>
53 -- sgrep (Struct grep) <http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>;
54 -- agrep (Approximate grep) <http://www.tgries.de/agrep/>,
56 -- nr-grep (Nondeterministic reverse grep)
57 <http://www.dcc.uchile.cl/~gnavarro/software/>;
58 -- ggrep (Grouse grep) <http://www.grouse.com.au/ggrep/>;
59 -- grep.py (Python grep) <http://www.vdesmedt.com/~vds2212/grep.html>;
60 -- freegrep <http://www.vocito.com/downloads/software/grep/>;
62 Check some new algorithms for matching; talk to Karl Berry and Nelson.
63 Sunday's "Quick Search" Algorithm (CACM 33, 1990-08-08 pp. 132-142)
64 claim that his algorithm is faster than Boyer-More. Worth checking.
66 Fix the DFA matcher to never use exponential space. (Fortunately, these
70 ============================
71 Standards: POSIX and Unicode
72 ============================
74 For POSIX compliance, see p10003.x. Current support for the POSIX [= =]
75 and [. .] constructs is limited. This is difficult because it requires
76 locale-dependent details of the character set and collating sequence,
77 but POSIX does not standardize any method for accessing this information!
79 For Unicode, interesting things to check include the Unicode Standard
80 <http://www.unicode.org/standard/standard.html> and the Unicode Technical
81 Standard #18 (<http://www.unicode.org/reports/tr18/> “Unicode Regular
82 Expressions”). Talk to Bruno Haible who's maintaining GNU libunistring.
83 See also Unicode Standard Annex #15 (<http://www.unicode.org/reports/tr15/>
84 “Unicode Normalization Forms”), already implemented by GNU libunistring.
86 In particular, --ignore-case needs to be evaluated against the standards.
87 We may want to deviate from POSIX if Unicode provides better or clearer
90 POSIX and --ignore-case
91 -----------------------
93 For this issue, interesting things to check in POSIX include the
94 Volume “Base Definitions (XBD)”, Chapter “Regular Expressions” and in
95 particular Section “Regular Expression General Requirements” and its
96 paragraph about caseless matching (note that this may not have been
97 fully thought through and that this text may be self-contradicting
98 [specifically: “of either data or patterns” versus all the rest]).
100 In particular, consider the following with POSIX's approach to case
101 folding in mind. Assume a non-Turkic locale with a character
102 repertoire reduced to the following various forms of “LATIN LETTER I”:
104 0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
105 0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
106 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN CAPITAL LETTER I DOT;;;0069;
107 0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049
109 First note the differing UTF-8 octet lengths of U+0049 (0x49) and
110 U+0069 (0x69) versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1). This
111 implies that whole UTF-8 strings cannot be case-converted in place,
112 using the same memory buffer, and that the needed octet-size of the
113 new buffer cannot merely be guessed (although there's a simple upper
114 bound of six times the size of the input, as the longest UTF-8
115 encoding of any character is six bytes).
124 where lc() and uc() denote lower-case and upper-case conversions.
126 There are several candidate --ignore-case logics (including the one
131 if (lc(input_wchar) == lc(pattern_wchar))
133 logic leads to the following matches:
142 There is a lack of symmetry between CAPITAL and SMALL LETTERs with
147 if (uc(input_wchar) == uc(pattern_wchar))
149 logic leads to the following matches:
158 There is a lack of symmetry between CAPITAL and SMALL LETTERs with
163 if ( lc(input_wchar) == lc(pattern_wchar)
164 || uc(input_wchar) == uc(pattern_wchar))
166 logic leads to the following matches:
175 There is some elegance and symmetry with this. But there are
176 potentially two conversions to be made per input character. If the
177 pattern is pre-converted, two copies of it need to be kept and used in
178 a mutually coherent fashion.
182 if ( input_wchar == pattern_wchar
183 || lc(input_wchar) == pattern_wchar
184 || uc(input_wchar) == pattern_wchar)
186 logic (as mandated by POSIX) leads to the following matches:
195 There is a different CAPITAL/SMALL symmetry with this. But there's
196 also a loss of pattern/input symmetry that's unique to it. Also there
197 are potentially two conversions to be made per input character.
201 if (lc(uc(input_wchar)) == lc(uc(pattern_wchar)))
204 logic leads to the following matches:
213 This shows total symmetry and transitivity
214 (at least in this example analysis).
215 There are two conversions to be made per input character,
216 but support could be added for having
217 a single straight mapping performing
218 a composition of the two conversions.
220 Any optimization in the implementation of each logic
221 must not change its basic semantic.
224 Unicode and --ignore-case
225 -------------------------
227 For this issue, interesting things to check in Unicode include:
229 -- The Unicode Standard, Chapter 3
230 (<http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf>
231 “Conformance”), Section 3.13 (“Default Case Operations”) and the
232 toCasefold() case conversion operation.
234 -- The Unicode Standard, Chapter 4
235 (<http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf>
236 “Character Properties”), Section 4.2 (“Case—Normative”) and
237 the <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>
238 SpecialCasing.txt and
239 <http://www.unicode.org/Public/UNIDATA/CaseFolding.txt>
240 CaseFolding.txt files from the
241 <http://www.unicode.org/Public/UNIDATA/UCD.html> Unicode
244 The <http://www.unicode.org/standard/standard.html> Unicode Standard,
245 Chapter 5 (“<http://www.unicode.org/versions/Unicode5.2.0/ch05.pdf>
246 Implementation Guidelines ”), Section 5.18 (“Case Mappings”),
247 Subsection “Caseless Matching”.
249 The Unicode <http://www.unicode.org/charts/case/> case charts.
253 if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string))
255 logic for caseless matching. Let's consider the “LATIN LETTER I”
256 example mentioned above. In a non-Turkic locale, simple case folding
259 toCasefold_simple(U+0049) = U+0069
260 toCasefold_simple(U+0069) = U+0069
261 toCasefold_simple(U+0130) = U+0130
262 toCasefold_simple(U+0131) = U+0131
264 which leads to the following matches:
273 This is different from anything so far!
275 In a non-Turkic locale, full case folding yields
277 toCasefold_full(U+0049) = U+0069
278 toCasefold_full(U+0069) = U+0069
279 toCasefold_full(U+0130) = <U+0069, U+0307>
280 toCasefold_full(U+0131) = U+0131
284 0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;;
286 which leads to the following matches:
297 Note that having toCasefold(U+0131), simple or full, map to itself
298 instead of U+0069 is in contradiction with the rules of Section 5.18
299 of the Unicode Standard since toUpperCase(U+0131) is U+0049. Same
300 thing for toCasefold_simple(U+0130) since toLowerCase(U+0131) is
301 U+0069. The justification for the weird toCasefold_full(U+0130)
302 mapping is unknown; it doesn't even make sense to add a dot (U+0307)
303 to a letter that already has one (U+0069). It would have been so
304 simple to put them all in the same equivalence class!
306 Otherwise, also consider the following problem with Unicode's approach
307 on case folding in mind. Assume that we want to perform
309 echo 'AßBC | grep -i 'Sb'
313 input: U+0041 U+00DF U+0042 U+0043 U+000A
314 pattern: U+0053 U+0062
316 Following “CaseFolding-4.1.0.txt”, applying the toCasefold()
317 transformation to these yields
319 input: U+0061 U+0073 U+0073 U+0062 U+0063 U+000A
320 pattern: U+0073 U+0062
322 so, according to this approach, the input should match the pattern. As
323 long as the original input line is to be reported to the user as a
324 whole, there is no problem (from the user's point-of-view;
325 implementation is complicated by this).
327 However, consider both these GNU extensions:
329 echo 'AßBC' | grep -i --only-matching 'Sb'
330 echo 'AßBC' | grep -i --color=always 'Sb'
332 What is to be reported in these cases, since the match begins in the
333 *middle* of the original input character 'ß'?
335 Note that Unicode's toCasefold() cannot be implemented in terms of
336 POSIX' towctrans() since that can only return a single wint_t value
337 per input wint_t value.