From 243effed56c5bea983c9cdbdc24b329f19ff0aad Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Mon, 3 Dec 2012 16:57:05 -0700 Subject: [PATCH] handy.h: Change documentation for perlapi This documents several more of the character classification macros, including all variants of them. There are no code changes. The READ_XDIGIT macro was moved to "Miscellaneous Functions", as it really isn't character classification. Several of the macros remain undocumented because I'm not comfortable yet about their names/and or functionality. --- handy.h | 189 ++++++++++++++++++++++++++++++++++++++++++------------ pod/perldelta.pod | 7 +- 2 files changed, 153 insertions(+), 43 deletions(-) diff --git a/handy.h b/handy.h index 2d45075..c7a8829 100644 --- a/handy.h +++ b/handy.h @@ -471,84 +471,188 @@ C). /* =head1 Character classes -There are three variants for all the functions in this section. The base ones -operate using the character set of the platform Perl is running on. The ones -with an C<_A> suffix operate on the ASCII character set, and the ones with an -C<_L1> suffix operate on the full Latin1 character set. All are unaffected by -locale and by C. - -For ASCII platforms, the base function with no suffix and the one with the -C<_A> suffix are identical. The function with the C<_L1> suffix imposes the -Latin-1 character set onto the platform. That is, the code points that are -ASCII are unaffected, since ASCII is a subset of Latin-1. But the non-ASCII -code points are treated as if they are Latin-1 characters. For example, -C will return true when called with the code point 0xA0, which is -the Latin-1 NO-BREAK SPACE. - -For EBCDIC platforms, the base function with no suffix and the one with the -C<_L1> suffix should be identical, since, as of this writing, the EBCDIC code -pages that Perl knows about all are equivalent to Latin-1. The function that -ends in an C<_A> suffix will not return true unless the specified character also -has an ASCII equivalent. +This section is about functions (really macros) that classify characters +into types, such as punctuation versus alphabetic, etc. Most of these are +analogous to regular expression character classes. (See +L.) There are several variants for +each class. (Not all macros have all variants; each item below lists the +ones valid for it.) None are affected by C, and only the ones +with C in the name are affected by the current locale. + +The base function, e.g., C, takes an octet (either a C or a +C) as input and returns a boolean as to whether or not the character +represented by that octet is in the named class based on platform, Unicode, and +Perl rules. If the input is a number that doesn't fit in an octet, FALSE is +always returned. + +Variant C (e.g., C) will return TRUE only if the input is +also in the ASCII character set. For ASCII platforms, the base function with +no suffix and the one with the C<_A> suffix are identical. On EBCDIC +platforms, the C<_A> suffix function will not return true unless the specified +character also has an ASCII equivalent. + +Variant C operates on the full Latin1 character set. For EBCDIC +platforms, the base function with no suffix and the one with the C<_L1> suffix +are identical. For ASCII platforms, the C<_L1> suffix imposes the Latin-1 +character set onto the platform. That is, the code points that are ASCII are +unaffected, since ASCII is a subset of Latin-1. But the non-ASCII code points +are treated as if they are Latin-1 characters. For example, C +will return true when called with the code point 0xA0, which is the Latin-1 +NO-BREAK SPACE. + +Variant C is like the C variant, but accepts any UV code +point as input. If the code point is larger than 255, Unicode rules are used +to determine if it is in the character class. For example, +C returns TRUE, since 0x100 is LATIN CAPITAL LETTER A WITH +MACRON in Unicode, and is a word character. + +Variant C is like C, but the input is a pointer to a +(known to be well-formed) UTF-8 encoded string (C or C). The +classification of just the first character in the string is tested. + +Variant C is like the C and C variants, but uses +the C library function that gives the named classification instead of +hard-coded rules. For example, C returns the result of calling +C. This means that the result is based on the current locale, which +is what C in the name stands for. FALSE is always returned if the input +won't fit into an octet. + +Variant C is like C, but is defined on any UV. It +returns the same as C for input code points less than 256, and +returns the hard-coded, not-affected-by-locale, Unicode results for larger ones. + +Variant C is like C, but the input is a pointer to a +(known to be well-formed) UTF-8 encoded string (C or C). The +classification of just the first character in the string is tested. =for apidoc Am|bool|isALPHA|char ch Returns a boolean indicating whether the specified character is an -alphabetic character in the platform's native character set. +alphabetic character in the platform's native character set, analogous to +C. See the L for an explanation of variants -C and C. +C, C, C, C, C +C, and C. =for apidoc Am|bool|isASCII|char ch Returns a boolean indicating whether the specified character is one of the 128 -characters in the ASCII character set. On non-ASCII platforms, it is if this +characters in the ASCII character set, analogous to C. +On non-ASCII platforms, it is if this character corresponds to an ASCII character. Variants C and C are identical to C. +See the L for an explanation of variants +C, C, C, C, and +C. Note, however, that some platforms do not have the C +library routine C. In these cases, the variants whose names contain +C are the same as the corresponding ones without. + +=for apidoc Am|bool|isBLANK|char ch +Returns a boolean indicating whether the specified character is a +character considered to be a blank in the platform's native character set, +analogous to C. +See the L for an explanation of variants +C, C, C, C, C +C, and C. Note, however, that some +platforms do not have the C library routine C. In these cases, the +variants whose names contain C are the same as the corresponding ones +without. + +=for apidoc Am|bool|isCNTRL|char ch +Returns a boolean indicating whether the specified character is a +control character in the platform's native character set, +analogous to C. +See the L for an explanation of variants +C, C, C, C, C +C, and C. =for apidoc Am|bool|isDIGIT|char ch Returns a boolean indicating whether the specified character is a -digit in the platform's native character set. +digit in the platform's native character set, analogous to C. Variants C and C are identical to C. +See the L for an explanation of variants +C, C, C C, and +C. + +=for apidoc Am|bool|isGRAPH|char ch +Returns a boolean indicating whether the specified character is a +graphic character in the platform's native character set, analogous to +C. +See the L for an explanation of variants +C, C, C, C, C +C, and C. =for apidoc Am|bool|isLOWER|char ch Returns a boolean indicating whether the specified character is a -lowercase character in the platform's native character set. +lowercase character in the platform's native character set, analogous to +C. See the L for an explanation of variants -C and C. +C, C, C, C, C +C, and C. =for apidoc Am|bool|isOCTAL|char ch Returns a boolean indicating whether the specified character is an octal digit, [0-7] in the platform's native character set. -Variants C and C are identical to C. +The only two variants are C and C; each is identical to +C. + +=for apidoc Am|bool|isPUNCT|char ch +Returns a boolean indicating whether the specified character is a +punctuation character in the platform's native character set, analogous to +C. Note that the definition of what is punctuation isn't as +straightforward as one might desire. See L for details. +See the L for an explanation of variants +C, C, C, C, C +C, and C. =for apidoc Am|bool|isSPACE|char ch Returns a boolean indicating whether the specified character is a -whitespace character in the platform's native character set. This is the same -as what C<\s> matches in a regular expression. +whitespace character in the platform's native character set. This is analogous +to what C and C match in a regular expression. See the L for an explanation of variants -C and C. +C, C, C, C, C +C, and C. =for apidoc Am|bool|isUPPER|char ch Returns a boolean indicating whether the specified character is an -uppercase character in the platform's native character set. +uppercase character in the platform's native character set, analogous to +C. See the L for an explanation of variants -C and C. +C, C, C, C, C +C, and C. -=for apidoc Am|bool|isWORDCHAR|char ch +=for apidoc Am|bool|isPRINT|char ch Returns a boolean indicating whether the specified character is a -character that is any of: alphabetic, numeric, or an underscore. This is the -same as what C<\w> matches in a regular expression. -C is a synonym provided for backward compatibility. Note that it -does not have the standard C language meaning of alphanumeric, since it matches -an underscore and the standard meaning does not. +printable character in the platform's native character set, analogous to +C. See the L for an explanation of variants -C and C. +C, C, C, C, C +C, and C. + +=for apidoc Am|bool|isWORDCHAR|char ch +Returns a boolean indicating whether the specified character is a character +that is a word character, analogous to what C and C match +in a regular expression. A word character is an alphabetic character, a +decimal digit, a connecting punctuation character (such as an underscore), or +a "mark" character that attaches to one of those (like some sort of accent). +C is a synonym provided for backward compatibility, even though a +word character includes more than the standard C language meaning of +alphanumeric. +See the L for an explanation of variants +C, C, C, C, +C, C, and C. =for apidoc Am|bool|isXDIGIT|char ch Returns a boolean indicating whether the specified character is a hexadecimal -digit, [0-9A-Fa-f]. Variants C and C are -identical to C. +digit. In the ASCII range these are C<[0-9A-Fa-f]>. Variants C +and C are identical to C. +See the L for an explanation of variants +C, C, C, C, and +C. + +=head1 Miscellaneous Functions =for apidoc Am|U8|READ_XDIGIT|char str* -Returns the value of a hex digit and advances the string pointer. +Returns the value of an ASCII-range hex digit and advances the string pointer. Behaviour is only well defined when isXDIGIT(*str) is true. =head1 Character case changing @@ -563,6 +667,9 @@ character set, if possible; otherwise returns the input character itself. =cut +Still undocumented are ALNUMC, PSXSPC, VERTSPACE, and IDFIRST, and the other +toUPPER etc functions + Note that these macros are repeated in Devel::PPPort, so should also be patched there. The file as of this writing is cpan/Devel-PPPort/parts/inc/misc diff --git a/pod/perldelta.pod b/pod/perldelta.pod index 7220d40..0f1c89a 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -129,13 +129,16 @@ XXX Changes which significantly change existing files in F go here. However, any changes to F should go in the L section. -=head3 L +=head3 L =over 4 =item * -XXX Description of the change here +There are quite a few macros callable from XS modules that classify +characters into things like alphabetic, punctuation, etc. More of these +are now documented, including ones which work on characters whose code +points are outside the Latin-1 range. =back -- 2.7.4