docs/reference/glib/tmpl/gregex.sgml

   1 <!-- ##### SECTION Title ##### -->
   2 Perl-compatible regular expressions
   3
   4 <!-- ##### SECTION Short_Description ##### -->
   5 matches strings against regular expressions.
   6
   7 <!-- ##### SECTION Long_Description ##### -->
   8 <para>
   9 The <function>g_regex_*()</function> functions implement regular
  10 expression pattern matching using syntax and semantics similar to
  11 Perl regular expression.
  12 </para>
  13 <para>
  14 Some functions accept a <parameter>start_position</parameter> argument,
  15 setting it differs from just passing over a shortened string and setting
  16 #G_REGEX_MATCH_NOTBOL in the case of a pattern that begins with any kind
  17 of lookbehind assertion.
  18 For example, consider the pattern "\Biss\B" which finds occurrences of "iss"
  19 in the middle of words. ("\B" matches only if the current position in the
  20 subject is not a word boundary.) When applied to the string "Mississipi"
  21 from the fourth byte, namely "issipi", it does not match, because "\B" is
  22 always false at the start of the subject, which is deemed to be a word
  23 boundary. However, if the entire string is passed , but with
  24 <parameter>start_position</parameter> set to 4, it finds the second
  25 occurrence of "iss" because it is able to look behind the starting point
  26 to discover that it is preceded by a letter.
  27 </para>
  28 <para>
  29 Note that, unless you set the #G_REGEX_RAW flag, all the strings passed
  30 to these functions must be encoded in UTF-8. The lengths and the positions
  31 inside the strings are in bytes and not in characters, so, for instance,
  32 "\xc3\xa0" (i.e. "&agrave;") is two bytes long but it is treated as a single
  33 character. If you set #G_REGEX_RAW the strings can be non-valid UTF-8
  34 strings and a byte is treated as a character, so "\xc3\xa0" is two bytes
  35 and two characters long.
  36 </para>
  37 <para>
  38 When matching a pattern, "\n" matches only against a "\n" character in the
  39 string, and "\r" matches only a "\r" character. To match any newline sequence
  40 use "\R". This particular group matches either the two-character sequence
  41 CR + LF ("\r\n"), or one of the single characters LF (linefeed, U+000A, "\n"), VT
  42 (vertical tab, U+000B, "\v"), FF (formfeed, U+000C, "\f"), CR (carriage return,
  43 U+000D, "\r"), NEL (next line, U+0085), LS (line separator, U+2028), or PS
  44 (paragraph separator, U+2029).
  45 </para>
  46 <para>
  47 The behaviour of the dot, circumflex, and dollar metacharacters are affected by
  48 newline characters, the default is to recognize any newline character (the same
  49 characters recognized by "\R"). This can be changed with #G_REGEX_NEWLINE_CR,
  50 #G_REGEX_NEWLINE_LF and #G_REGEX_NEWLINE_CRLF compile options,
  51 and with #G_REGEX_MATCH_NEWLINE_ANY, #G_REGEX_MATCH_NEWLINE_CR,
  52 #G_REGEX_MATCH_NEWLINE_LF and #G_REGEX_MATCH_NEWLINE_CRLF match options.
  53 These settings are also relevant when compiling a pattern if
  54 #G_REGEX_EXTENDED is set, and an unescaped "#" outside a character class is
  55 encountered. This indicates a comment that lasts until after the next
  56 newline.
  57 </para>
  58 <para>
  59 If you have two threads manipulating the same #GRegex, they must use a
  60 lock to synchronize their operation, as these functions are not threadsafe.
  61 Creating and manipulating different #GRegex structures from different
  62 threads is not a problem.
  63 </para>
  64 <para>
  65 The regular expressions low level functionalities are obtained through
  66 the excellent <ulink url="http://www.pcre.org/">PCRE</ulink> library
  67 written by Philip Hazel.
  68 </para>
  69
  70 <!-- ##### SECTION See_Also ##### -->
  71 <para>
  72
  73 </para>
  74
  75 <!-- ##### SECTION Stability_Level ##### -->
  76
  77
  78 <!-- ##### ENUM GRegexError ##### -->
  79 <para>
  80 Error codes returned by regular expressions functions.
  81 </para>
  82
  83 @G_REGEX_ERROR_COMPILE: Compilation of the regular expression in <function>g_regex_new()</function> failed.
  84 @G_REGEX_ERROR_OPTIMIZE: Optimization of the regular expression in <function>g_regex_optimize()</function> failed.
  85 @G_REGEX_ERROR_REPLACE: Replacement failed due to an ill-formed replacement string.
  86 @G_REGEX_ERROR_MATCH: The match process failed.
  87 @Since: 2.14
  88
  89 <!-- ##### MACRO G_REGEX_ERROR ##### -->
  90 <para>
  91 Error domain for regular expressions. Errors in this domain will be from the #GRegexError enumeration. See #GError for information on error domains.
  92 </para>
  93
  94 @Since: 2.14
  95
  96
  97 <!-- ##### ENUM GRegexCompileFlags ##### -->
  98 <para>
  99 Flags specifying compile-time options.
 100 </para>
 101
 102 @G_REGEX_CASELESS: Letters in the pattern match both upper and lower case
 103 letters. It be changed within a pattern by a "(?i)" option setting.
 104 @G_REGEX_MULTILINE: By default, GRegex treats the strings as consisting
 105 of a single line of characters (even if it actually contains newlines).
 106 The "start of line" metacharacter ("^") matches only at the start of the
 107 string, while the "end of line" metacharacter ("$") matches only at the
 108 end of the string, or before a terminating newline (unless
 109 #G_REGEX_DOLLAR_ENDONLY is set). When #G_REGEX_MULTILINE is set,
 110 the "start of line" and "end of line" constructs match immediately following
 111 or immediately before any newline in the string, respectively, as well
 112 as at the very start and end. This can be changed within a pattern by a
 113 "(?m)" option setting.
 114 @G_REGEX_DOTALL: A dot metacharater (".") in the pattern matches all
 115 characters, including newlines. Without it, newlines are excluded. This
 116 option can be changed within a pattern by a ("?s") option setting.
 117 @G_REGEX_EXTENDED: Whitespace data characters in the pattern are
 118 totally ignored except when escaped or inside a character class.
 119 Whitespace does not include the VT character (code 11). In addition,
 120 characters between an unescaped "#" outside a character class and
 121 the next newline character, inclusive, are also ignored. This can be
 122 changed within a pattern by a "(?x)" option setting.
 123 @G_REGEX_ANCHORED: The pattern is forced to be "anchored", that is,
 124 it is constrained to match only at the first matching point in the string
 125 that is being searched. This effect can also be achieved by appropriate
 126 constructs in the pattern itself such as the "^" metacharater.
 127 @G_REGEX_DOLLAR_ENDONLY: A dollar metacharacter ("$") in the pattern
 128 matches only at the end of the string. Without this option, a dollar also
 129 matches immediately before the final character if it is a newline (but
 130 not before any other newlines). This option is ignored if
 131 #G_REGEX_MULTILINE is set.
 132 @G_REGEX_UNGREEDY: Inverts the "greediness" of the
 133 quantifiers so that they are not greedy by default, but become greedy
 134 if followed by "?". It can also be set by a "(?U)" option setting within
 135 the pattern.
 136 @G_REGEX_RAW: Usually strings must be valid UTF-8 strings, using this
 137 flag they are considered as a raw sequence of bytes.
 138 @G_REGEX_NO_AUTO_CAPTURE: Disables the use of numbered capturing
 139 parentheses in the pattern. Any opening parenthesis that is not followed
 140 by "?" behaves as if it were followed by "?:" but named parentheses can
 141 still be used for capturing (and they acquire numbers in the usual way).
 142 @G_REGEX_DUPNAMES: Names used to identify capturing subpatterns need not
 143 be unique. This can be helpful for certain types of pattern when it is known
 144 that only one instance of the named subpattern can ever be matched.
 145 @G_REGEX_NEWLINE_CR: Usually any newline character is recognized, if this
 146 option is set, the only recognized newline character is '\r'.
 147 @G_REGEX_NEWLINE_LF: Usually any newline character is recognized, if this
 148 option is set, the only recognized newline character is '\n'.
 149 @G_REGEX_NEWLINE_CRLF: Usually any newline character is recognized, if this
 150 option is set, the only recognized newline character sequence is '\r\n'.
 151 @Since: 2.14
 152
 153 <!-- ##### ENUM GRegexMatchFlags ##### -->
 154 <para>
 155 Flags specifying match-time options.
 156 </para>
 157
 158 @G_REGEX_MATCH_ANCHORED: The pattern is forced to be "anchored", that is,
 159 it is constrained to match only at the first matching point in the string
 160 that is being searched. This effect can also be achieved by appropriate
 161 constructs in the pattern itself such as the "^" metacharater.
 162 @G_REGEX_MATCH_NOTBOL: Specifies that first character of the string is
 163 not the beginning of a line, so the circumflex metacharacter should not
 164 match before it. Setting this without G_REGEX_MULTILINE (at compile time)
 165 causes circumflex never to match. This option affects only the behaviour of
 166 the circumflex metacharacter, it does not affect "\A".
 167 @G_REGEX_MATCH_NOTEOL: Specifies that the end of the subject string is
 168 not the end of a line, so the dollar metacharacter should not match it nor
 169 (except in multiline mode) a newline immediately before it. Setting this
 170 without G_REGEX_MULTILINE (at compile time) causes dollar never to match.
 171 This option affects only the behaviour of the dollar metacharacter, it does
 172 not affect "\Z" or "\z".
 173 @G_REGEX_MATCH_NOTEMPTY: An empty string is not considered to be a valid
 174 match if this option is set. If there are alternatives in the pattern, they
 175 are tried. If all the alternatives match the empty string, the entire match
 176 fails. For example, if the pattern "a?b?" is applied to a string not beginning
 177 with "a" or "b", it matches the empty string at the start of the string.
 178 With this flag set, this match is not valid, so GRegex searches further
 179 into the string for occurrences of "a" or "b".
 180 @G_REGEX_MATCH_PARTIAL: Turns on the partial matching feature, for more
 181 documentation on partial matching see g_regex_is_partial_match().
 182 @G_REGEX_MATCH_NEWLINE_CR: Overrides the newline definition set when creating
 183 a new #GRegex, setting the '\r' character as line terminator.
 184 @G_REGEX_MATCH_NEWLINE_LF: Overrides the newline definition set when creating
 185 a new #GRegex, setting the '\n' character as line terminator.
 186 @G_REGEX_MATCH_NEWLINE_CRLF: Overrides the newline definition set when creating
 187 a new #GRegex, setting the '\r\n' characters as line terminator.
 188 @G_REGEX_MATCH_NEWLINE_ANY: Overrides the newline definition set when creating
 189 a new #GRegex, any newline character or character sequence is recognized.
 190 @Since: 2.14
 191
 192 <!-- ##### STRUCT GRegex ##### -->
 193 <para>
 194 A GRegex is the "compiled" form of a regular expression pattern. This
 195 structure is opaque and its fields cannot be accessed directly.
 196 </para>
 197
 198 @Since: 2.14
 199
 200 <!-- ##### USER_FUNCTION GRegexEvalCallback ##### -->
 201 <para>
 202 Specifies the type of the function passed to g_regex_replace_eval().
 203 It is called for each occurance of the pattern @regex in @string, and it
 204 should append the replacement to @result.
 205 </para>
 206
 207 <para>
 208 Do not call on @regex functions that modify its internal state, such as
 209 g_regex_match(); if you need it you can create a temporary copy of
 210 @regex using g_regex_copy().
 211 </para>
 212
 213 @Param1: a #GRegex.
 214 @Param2: the string used to perform matches against.
 215 @Param3: a #GString containing the new string.
 216 @Param4: user data passed to g_regex_replace_eval().
 217 @Returns: %FALSE to continue the replacement process, %TRUE to stop it.
 218 @Since: 2.14
 219
 220
 221 <!-- ##### FUNCTION g_regex_new ##### -->
 222 <para>
 223
 224 </para>
 225
 226 @pattern:
 227 @compile_options:
 228 @match_options:
 229 @error:
 230 @Returns:
 231
 232
 233 <!-- ##### FUNCTION g_regex_free ##### -->
 234 <para>
 235
 236 </para>
 237
 238 @regex:
 239
 240
 241 <!-- ##### FUNCTION g_regex_optimize ##### -->
 242 <para>
 243
 244 </para>
 245
 246 @regex:
 247 @error:
 248 @Returns:
 249
 250
 251 <!-- ##### FUNCTION g_regex_copy ##### -->
 252 <para>
 253
 254 </para>
 255
 256 @regex:
 257 @Returns:
 258
 259
 260 <!-- ##### FUNCTION g_regex_get_pattern ##### -->
 261 <para>
 262
 263 </para>
 264
 265 @regex:
 266 @Returns:
 267
 268
 269 <!-- ##### FUNCTION g_regex_clear ##### -->
 270 <para>
 271
 272 </para>
 273
 274 @regex:
 275
 276
 277 <!-- ##### FUNCTION g_regex_match_simple ##### -->
 278 <para>
 279
 280 </para>
 281
 282 @pattern:
 283 @string:
 284 @compile_options:
 285 @match_options:
 286 @Returns:
 287
 288
 289 <!-- ##### FUNCTION g_regex_match ##### -->
 290 <para>
 291
 292 </para>
 293
 294 @regex:
 295 @string:
 296 @match_options:
 297 @Returns:
 298
 299
 300 <!-- ##### FUNCTION g_regex_match_full ##### -->
 301 <para>
 302
 303 </para>
 304
 305 @regex:
 306 @string:
 307 @string_len:
 308 @start_position:
 309 @match_options:
 310 @error:
 311 @Returns:
 312
 313
 314 <!-- ##### FUNCTION g_regex_match_next ##### -->
 315 <para>
 316
 317 </para>
 318
 319 @regex:
 320 @string:
 321 @match_options:
 322 @Returns:
 323
 324
 325 <!-- ##### FUNCTION g_regex_match_next_full ##### -->
 326 <para>
 327
 328 </para>
 329
 330 @regex:
 331 @string:
 332 @string_len:
 333 @start_position:
 334 @match_options:
 335 @error:
 336 @Returns:
 337
 338
 339 <!-- ##### FUNCTION g_regex_match_all ##### -->
 340 <para>
 341
 342 </para>
 343
 344 @regex:
 345 @string:
 346 @match_options:
 347 @Returns:
 348
 349
 350 <!-- ##### FUNCTION g_regex_match_all_full ##### -->
 351 <para>
 352
 353 </para>
 354
 355 @regex:
 356 @string:
 357 @string_len:
 358 @start_position:
 359 @match_options:
 360 @error:
 361 @Returns:
 362
 363
 364 <!-- ##### FUNCTION g_regex_get_match_count ##### -->
 365 <para>
 366
 367 </para>
 368
 369 @regex:
 370 @Returns:
 371
 372
 373 <!-- ##### FUNCTION g_regex_is_partial_match ##### -->
 374 <para>
 375
 376 </para>
 377
 378 @regex:
 379 @Returns:
 380
 381
 382 <!-- ##### FUNCTION g_regex_fetch ##### -->
 383 <para>
 384
 385 </para>
 386
 387 @regex:
 388 @match_num:
 389 @string:
 390 @Returns:
 391
 392
 393 <!-- ##### FUNCTION g_regex_fetch_pos ##### -->
 394 <para>
 395
 396 </para>
 397
 398 @regex:
 399 @match_num:
 400 @start_pos:
 401 @end_pos:
 402 @Returns:
 403
 404
 405 <!-- ##### FUNCTION g_regex_fetch_named ##### -->
 406 <para>
 407
 408 </para>
 409
 410 @regex:
 411 @name:
 412 @string:
 413 @Returns:
 414
 415
 416 <!-- ##### FUNCTION g_regex_fetch_named_pos ##### -->
 417 <para>
 418
 419 </para>
 420
 421 @regex:
 422 @name:
 423 @start_pos:
 424 @end_pos:
 425 @Returns:
 426
 427
 428 <!-- ##### FUNCTION g_regex_fetch_all ##### -->
 429 <para>
 430
 431 </para>
 432
 433 @regex:
 434 @string:
 435 @Returns:
 436
 437
 438 <!-- ##### FUNCTION g_regex_get_string_number ##### -->
 439 <para>
 440
 441 </para>
 442
 443 @regex:
 444 @name:
 445 @Returns:
 446
 447
 448 <!-- ##### FUNCTION g_regex_split_simple ##### -->
 449 <para>
 450
 451 </para>
 452
 453 @pattern:
 454 @string:
 455 @compile_options:
 456 @match_options:
 457 @Returns:
 458
 459
 460 <!-- ##### FUNCTION g_regex_split ##### -->
 461 <para>
 462
 463 </para>
 464
 465 @regex:
 466 @string:
 467 @match_options:
 468 @Returns:
 469
 470
 471 <!-- ##### FUNCTION g_regex_split_full ##### -->
 472 <para>
 473
 474 </para>
 475
 476 @regex:
 477 @string:
 478 @string_len:
 479 @start_position:
 480 @match_options:
 481 @max_tokens:
 482 @error:
 483 @Returns:
 484
 485
 486 <!-- ##### FUNCTION g_regex_split_next ##### -->
 487 <para>
 488
 489 </para>
 490
 491 @regex:
 492 @string:
 493 @match_options:
 494 @Returns:
 495
 496
 497 <!-- ##### FUNCTION g_regex_split_next_full ##### -->
 498 <para>
 499
 500 </para>
 501
 502 @regex:
 503 @string:
 504 @string_len:
 505 @start_position:
 506 @match_options:
 507 @error:
 508 @Returns:
 509
 510
 511 <!-- ##### FUNCTION g_regex_expand_references ##### -->
 512 <para>
 513
 514 </para>
 515
 516 @regex:
 517 @string:
 518 @string_to_expand:
 519 @error:
 520 @Returns:
 521
 522
 523 <!-- ##### FUNCTION g_regex_replace ##### -->
 524 <para>
 525
 526 </para>
 527
 528 @regex:
 529 @string:
 530 @string_len:
 531 @start_position:
 532 @replacement:
 533 @match_options:
 534 @error:
 535 @Returns:
 536
 537
 538 <!-- ##### FUNCTION g_regex_replace_literal ##### -->
 539 <para>
 540
 541 </para>
 542
 543 @regex:
 544 @string:
 545 @string_len:
 546 @start_position:
 547 @replacement:
 548 @match_options:
 549 @error:
 550 @Returns:
 551
 552
 553 <!-- ##### FUNCTION g_regex_replace_eval ##### -->
 554 <para>
 555
 556 </para>
 557
 558 @regex:
 559 @string:
 560 @string_len:
 561 @start_position:
 562 @match_options:
 563 @eval:
 564 @user_data:
 565 @error:
 566 @Returns:
 567
 568
 569 <!-- ##### FUNCTION g_regex_escape_string ##### -->
 570 <para>
 571
 572 </para>
 573
 574 @string:
 575 @length:
 576 @Returns:
 577
 578