Handle universal character names and Unicode characters outside of literals.
authorJordan Rose <jordan_rose@apple.com>
Thu, 24 Jan 2013 20:50:46 +0000 (20:50 +0000)
committerJordan Rose <jordan_rose@apple.com>
Thu, 24 Jan 2013 20:50:46 +0000 (20:50 +0000)
commit7f43dddae0669f0700cf04bf520c12d3301cd809
treec431698192bbab6064242887616f7e36916aa973
parentaa89cf1a6611ce1a35b85418b0434620b4064c83
Handle universal character names and Unicode characters outside of literals.

This is a missing piece for C99 conformance.

This patch handles UCNs by adding a '\\' case to LexTokenInternal and
LexIdentifier -- if we see a backslash, we tentatively try to read in a UCN.
If the UCN is not syntactically well-formed, we fall back to the old
treatment: a backslash followed by an identifier beginning with 'u' (or 'U').

Because the spelling of an identifier with UCNs still has the UCN in it, we
need to convert that to UTF-8 in Preprocessor::LookUpIdentifierInfo.

Of course, valid code that does *not* use UCNs will see only a very minimal
performance hit (checks after each identifier for non-ASCII characters,
checks when converting raw_identifiers to identifiers that they do not
contain UCNs, and checks when getting the spelling of an identifier that it
does not contain a UCN).

This patch also adds basic support for actual UTF-8 in the source. This is
treated almost exactly the same as UCNs except that we consider stray
Unicode characters to be mistakes and offer a fixit to remove them.

llvm-svn: 173369
12 files changed:
clang/include/clang/Basic/ConvertUTF.h
clang/include/clang/Basic/DiagnosticLexKinds.td
clang/include/clang/Lex/Lexer.h
clang/include/clang/Lex/Token.h
clang/lib/Lex/Lexer.cpp
clang/lib/Lex/Preprocessor.cpp
clang/test/CXX/over/over.oper/over.literal/p8.cpp
clang/test/CodeGen/ucn-identifiers.c [new file with mode: 0644]
clang/test/FixIt/fixit-unicode.c
clang/test/Lexer/utf8-invalid.c [new file with mode: 0644]
clang/test/Preprocessor/ucn-pp-identifier.c [new file with mode: 0644]
clang/test/Sema/ucn-identifiers.c [new file with mode: 0644]