TODO

   1   Copyright (C) 1992, 1997-2002, 2004-2012 Free Software Foundation, Inc.
   2
   3   Copying and distribution of this file, with or without modification,
   4   are permitted in any medium without royalty provided the copyright
   5   notice and this notice are preserved.
   6
   7 ===============
   8 Short term work
   9 ===============
  10
  11 See where we are with UTF-8 performance.
  12
  13 Merge Debian patches 55-bigfile.patch, 69-mbtowc.patch and
  14 70-man_apostrophe.patch.  Go through patches in Savannah.
  15
  16 Cleanup of the grep(), grepdir(), recursion (the "main loop") to use fts.
  17 Fix --directories=read.
  18
  19 Write better Texinfo documentation for grep.  The manual page would be a
  20 good place to start, but Info documents are also supposed to contain a
  21 tutorial and examples.
  22
  23 Some test in tests/spencer2.tests should have failed!  Need to filter out
  24 some bugs in dfa.[ch]/regex.[ch].
  25
  26 Multithreading?
  27
  28 GNU grep does 32-bit arithmetic, it needs to move to 64-bit (i.e.
  29 size_t/ptrdiff_t).
  30
  31 Lazy dynamic linking of libpcre.
  32
  33 Check FreeBSD's integration of zgrep (-Z) and bzgrep (-J) in one
  34 binary. Is there a possibility of doing even better by automatically
  35 checking the magic of binary files ourselves (0x1F 0x8B for gzip, 0x1F
  36 0x9D for compress, and 0x42 0x5A 0x68 for bzip2)?  Once what to do with
  37 libpcre is decided, do the same for libz and libbz2.
  38
  39 \f
  40 ==================
  41 Matching algorithms
  42 ==================
  43
  44 Check <http://tony.abou-assaleh.net/greps.html>.  Take a look at these
  45 and consider opportunities for merging or cloning:
  46
  47    -- ja-grep's mlb2 patch (Japanese grep)
  48       <ftp://ftp.freebsd.org/pub/FreeBSD/ports/distfiles/grep-2.4.2-mlb2.patch.gz>
  49    -- lgrep (from lv, a Powerful Multilingual File Viewer / Grep)
  50       <http://www.ff.iij4u.or.jp/~nrt/lv/>;
  51    -- cgrep (Context grep) <http://plg.uwaterloo.ca/~ftp/mt/cgrep/>
  52       seems like nice work;
  53    -- sgrep (Struct grep) <http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>;
  54    -- agrep (Approximate grep) <http://www.tgries.de/agrep/>,
  55       from glimpse;
  56    -- nr-grep (Nondeterministic reverse grep)
  57       <http://www.dcc.uchile.cl/~gnavarro/software/>;
  58    -- ggrep (Grouse grep) <http://www.grouse.com.au/ggrep/>;
  59    -- grep.py (Python grep) <http://www.vdesmedt.com/~vds2212/grep.html>;
  60    -- freegrep <http://www.vocito.com/downloads/software/grep/>;
  61
  62 Check some new algorithms for matching; talk to Karl Berry and Nelson.
  63 Sunday's "Quick Search" Algorithm (CACM 33, 1990-08-08 pp. 132-142)
  64 claim that his algorithm is faster than Boyer-More. Worth checking.
  65
  66 Fix the DFA matcher to never use exponential space.  (Fortunately, these
  67 cases are rare.)
  68
  69 \f
  70 ============================
  71 Standards: POSIX and Unicode
  72 ============================
  73
  74 For POSIX compliance, see p10003.x. Current support for the POSIX [= =]
  75 and [. .] constructs is limited. This is difficult because it requires
  76 locale-dependent details of the character set and collating sequence,
  77 but POSIX does not standardize any method for accessing this information!
  78
  79 For Unicode, interesting things to check include the Unicode Standard
  80 <http://www.unicode.org/standard/standard.html> and the Unicode Technical
  81 Standard #18 (<http://www.unicode.org/reports/tr18/> “Unicode Regular
  82 Expressions”).  Talk to Bruno Haible who's maintaining GNU libunistring.
  83 See also Unicode Standard Annex #15 (<http://www.unicode.org/reports/tr15/>
  84 “Unicode Normalization Forms”), already implemented by GNU libunistring.
  85
  86 In particular, --ignore-case needs to be evaluated against the standards.
  87 We may want to deviate from POSIX if Unicode provides better or clearer
  88 semantics.
  89
  90 POSIX and --ignore-case
  91 -----------------------
  92
  93 For this issue, interesting things to check in POSIX include the
  94 Volume “Base Definitions (XBD)”, Chapter “Regular Expressions” and in
  95 particular Section “Regular Expression General Requirements” and its
  96 paragraph about caseless matching (note that this may not have been
  97 fully thought through and that this text may be self-contradicting
  98 [specifically: “of either data or patterns” versus all the rest]).
  99
 100 In particular, consider the following with POSIX's approach to case
 101 folding in mind. Assume a non-Turkic locale with a character
 102 repertoire reduced to the following various forms of “LATIN LETTER I”:
 103
 104 0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
 105 0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
 106 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN CAPITAL LETTER I DOT;;;0069;
 107 0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049
 108
 109 First note the differing UTF-8 octet lengths of U+0049 (0x49) and
 110 U+0069 (0x69) versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1). This
 111 implies that whole UTF-8 strings cannot be case-converted in place,
 112 using the same memory buffer, and that the needed octet-size of the
 113 new buffer cannot merely be guessed (although there's a simple upper
 114 bound of six times the size of the input, as the longest UTF-8
 115 encoding of any character is six bytes).
 116
 117 We have
 118
 119 lc(I) = i, uc(I) = I
 120 lc(i) = i, uc(i) = I
 121 lc(İ) = i, uc(İ) = İ
 122 lc(ı) = ı, uc(ı) = I
 123
 124 where lc() and uc() denote lower-case and upper-case conversions.
 125
 126 There are several candidate --ignore-case logics (including the one
 127 mandated by POSIX):
 128
 129 Using the
 130
 131 if (lc(input_wchar) == lc(pattern_wchar))
 132
 133 logic leads to the following matches:
 134
 135   \in  I  i  İ  ı
 136 pat\   ----------
 137 "I" |  Y  Y  Y  n
 138 "i" |  Y  Y  Y  n
 139 "İ" |  Y  Y  Y  n
 140 "ı" |  n  n  n  Y
 141
 142 There is a lack of symmetry between CAPITAL and SMALL LETTERs with
 143 this.
 144
 145 Using the
 146
 147 if (uc(input_wchar) == uc(pattern_wchar))
 148
 149 logic leads to the following matches:
 150
 151   \in  I  i  İ  ı
 152 pat\   ----------
 153 "I" |  Y  Y  n  Y
 154 "i" |  Y  Y  n  Y
 155 "İ" |  n  n  Y  n
 156 "ı" |  Y  Y  n  Y
 157
 158 There is a lack of symmetry between CAPITAL and SMALL LETTERs with
 159 this.
 160
 161 Using the
 162
 163 if (   lc(input_wchar) == lc(pattern_wchar)
 164 || uc(input_wchar) == uc(pattern_wchar))
 165
 166 logic leads to the following matches:
 167
 168   \in  I  i  İ  ı
 169 pat\   ----------
 170 "I" |  Y  Y  Y  Y
 171 "i" |  Y  Y  Y  Y
 172 "İ" |  Y  Y  Y  n
 173 "ı" |  Y  Y  n  Y
 174
 175 There is some elegance and symmetry with this. But there are
 176 potentially two conversions to be made per input character. If the
 177 pattern is pre-converted, two copies of it need to be kept and used in
 178 a mutually coherent fashion.
 179
 180 Using the
 181
 182 if (      input_wchar  == pattern_wchar
 183 || lc(input_wchar) == pattern_wchar
 184 || uc(input_wchar) == pattern_wchar)
 185
 186 logic (as mandated by POSIX) leads to the following matches:
 187
 188   \in  I  i  İ  ı
 189 pat\   ----------
 190 "I" |  Y  Y  n  Y
 191 "i" |  Y  Y  Y  n
 192 "İ" |  n  n  Y  n
 193 "ı" |  n  n  n  Y
 194
 195 There is a different CAPITAL/SMALL symmetry with this. But there's
 196 also a loss of pattern/input symmetry that's unique to it. Also there
 197 are potentially two conversions to be made per input character.
 198
 199 Using the
 200
 201 if (lc(uc(input_wchar)) == lc(uc(pattern_wchar)))
 202
 203
 204 logic leads to the following matches:
 205
 206   \in  I  i  İ  ı
 207 pat\   ----------
 208 "I" |  Y  Y  Y  Y
 209 "i" |  Y  Y  Y  Y
 210 "İ" |  Y  Y  Y  Y
 211 "ı" |  Y  Y  Y  Y
 212
 213 This shows total symmetry and transitivity
 214 (at least in this example analysis).
 215 There are two conversions to be made per input character,
 216 but support could be added for having
 217 a single straight mapping performing
 218 a composition of the two conversions.
 219
 220 Any optimization in the implementation of each logic
 221 must not change its basic semantic.
 222
 223
 224 Unicode and --ignore-case
 225 -------------------------
 226
 227 For this issue, interesting things to check in Unicode include:
 228
 229   -- The Unicode Standard, Chapter 3
 230      (<http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf>
 231      “Conformance”), Section 3.13 (“Default Case Operations”) and the
 232      toCasefold() case conversion operation.
 233
 234   -- The Unicode Standard, Chapter 4
 235      (<http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf>
 236      “Character Properties”), Section 4.2 (“Case—Normative”) and
 237      the <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>
 238      SpecialCasing.txt and
 239      <http://www.unicode.org/Public/UNIDATA/CaseFolding.txt>
 240      CaseFolding.txt files from the
 241      <http://www.unicode.org/Public/UNIDATA/UCD.html> Unicode
 242      Character Database .
 243
 244 The <http://www.unicode.org/standard/standard.html> Unicode Standard,
 245 Chapter 5 (“<http://www.unicode.org/versions/Unicode5.2.0/ch05.pdf>
 246 Implementation Guidelines ”), Section 5.18 (“Case Mappings”),
 247 Subsection “Caseless Matching”.
 248
 249 The Unicode <http://www.unicode.org/charts/case/> case charts.
 250
 251 Unicode uses the
 252
 253 if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string))
 254
 255 logic for caseless matching. Let's consider the “LATIN LETTER I”
 256 example mentioned above. In a non-Turkic locale, simple case folding
 257 yields
 258
 259 toCasefold_simple(U+0049) = U+0069
 260 toCasefold_simple(U+0069) = U+0069
 261 toCasefold_simple(U+0130) = U+0130
 262 toCasefold_simple(U+0131) = U+0131
 263
 264 which leads to the following matches:
 265
 266   \in  I  i  İ  ı
 267 pat\   ----------
 268 "I" |  Y  Y  n  n
 269 "i" |  Y  Y  n  n
 270 "İ" |  n  n  Y  n
 271 "ı" |  n  n  n  Y
 272
 273 This is different from anything so far!
 274
 275 In a non-Turkic locale, full case folding yields
 276
 277 toCasefold_full(U+0049) = U+0069
 278 toCasefold_full(U+0069) = U+0069
 279 toCasefold_full(U+0130) = <U+0069, U+0307>
 280 toCasefold_full(U+0131) = U+0131
 281
 282 with
 283
 284 0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;;
 285
 286 which leads to the following matches:
 287
 288   \in  I  i  İ  ı
 289 pat\   ----------
 290 "I" |  Y  Y  *  n
 291 "i" |  Y  Y  *  n
 292 "İ" |  n  n  Y  n
 293 "ı" |  n  n  n  Y
 294
 295 This is just sad!
 296
 297 Note that having toCasefold(U+0131), simple or full, map to itself
 298 instead of U+0069 is in contradiction with the rules of Section 5.18
 299 of the Unicode Standard since toUpperCase(U+0131) is U+0049. Same
 300 thing for toCasefold_simple(U+0130) since toLowerCase(U+0131) is
 301 U+0069. The justification for the weird toCasefold_full(U+0130)
 302 mapping is unknown; it doesn't even make sense to add a dot (U+0307)
 303 to a letter that already has one (U+0069). It would have been so
 304 simple to put them all in the same equivalence class!
 305
 306 Otherwise, also consider the following problem with Unicode's approach
 307 on case folding in mind. Assume that we want to perform
 308
 309 echo 'AßBC | grep -i 'Sb'
 310
 311 which corresponds to
 312
 313 input:    U+0041 U+00DF U+0042 U+0043 U+000A
 314 pattern:  U+0053 U+0062
 315
 316 Following “CaseFolding-4.1.0.txt”, applying the toCasefold()
 317 transformation to these yields
 318
 319 input:    U+0061 U+0073 U+0073 U+0062 U+0063 U+000A
 320 pattern:                U+0073 U+0062
 321
 322 so, according to this approach, the input should match the pattern. As
 323 long as the original input line is to be reported to the user as a
 324 whole, there is no problem (from the user's point-of-view;
 325 implementation is complicated by this).
 326
 327 However, consider both these GNU extensions:
 328
 329 echo 'AßBC' | grep -i --only-matching 'Sb'
 330 echo 'AßBC' | grep -i --color=always  'Sb'
 331
 332 What is to be reported in these cases, since the match begins in the
 333 *middle* of the original input character 'ß'?
 334
 335 Note that Unicode's toCasefold() cannot be implemented in terms of
 336 POSIX' towctrans() since that can only return a single wint_t value
 337 per input wint_t value.