- Overview
-
Table of Contents
- Special Member Functions: Constructors, Destructors, and the Assignment Operator
- Operator Overloading
- Memory Management
- Templates
- Namespaces
- Time and Date Library
- Streams
- Object-Oriented Programming and Design Principles
- The Standard Template Library (STL) and Generic Programming
- Exception Handling
- Runtime Type Information (RTTI)
- Signal Processing
- Creating Persistent Objects
- Bit Fields
- New Cast Operators
- Environment Variables
- Variadic Functions
- Pointers to Functions
- Function Objects
- Pointers to Members
- Lock Files
- Design Patterns
- Dynamic Linking
- Tips and Techniques
- Five Things You Need to Know About C++11 Unions
- A Tour of C99
- A Tour of C1X
-
C++0X: The New Face of Standard C++
- Reference Wrapper
- The Performance Technical Report
- auto for the People
- Ironing Templates' Syntactic Wrinkles
- Visual C++ Becomes ISO Compliant
- A Garbage Collector for C++
- C99 Core Features in C++0X
- The <code>shared_ptr</code> Class
- The shared_ptr Class, II
- Lambda Expressions and Closures, Part I
- Lambda Expressions and Closures, Part II
- Lambda Expressions and Closures, Part III
- The Type Traits Library, Part I
- The Type Traits Library, Part II
- The Type Traits Library, Part III
- finally Revisited
- The Any Library
- The nullptr Keyword Proposal
- Delegating Constructors
- The Explicit Conversion Operators Proposal
- Conditionally-Supported Behavior
- The weak ptr Class Template, Part I
- The weak ptr Class Template, Part II
- POD Types Revisited
- The rvalue Reference Proposal, Part I
- The rvalue Reference Proposal, Part II
- Proposal for New String Algorithms
- Concepts, Part I
- Concepts, Part II
- constexpr: Generalized Constant Expressions
- The <u>constexpr</u> Proposal: Constructors
- Strongly-Typed enum Types
- C++09: The Road Ahead
- C++09: Proposals by Statuses
- Changing Undefined Behavior to Diagnosable Errors
- New Character Types
- The __func__ Predeclared Identifier is Coming to C++
- Static Assertions
- The extern template Proposal
- Variadic Templates, Part I
- Variadic Templates, Part II
- Variadic Templates, Part III -- Critique
- Using unique_ptr, Part I
- Using unique_ptr, Part II
- Unrestricted Unions, Part I
- Unrestricted Unions, Part II
- Unrestricted Unions, Part III
- Types With No Linkage as Template Arguments
- New Initialization Syntax
- Initializer Lists and Sequence Constructors
- New Standard Library Algorithms
- Class Member Initializers
- Inheriting Constructors
- Introducing Attributes
- The Removal of Concepts From C++0x
- The Future of C++0x, Part I
- The Future of C++0X, Part II
- The Debate About Attributes, Part I
- The Debate About Attributes, Part II
- The Debate About Attributes, Part III
- The Debate About Attributes, Part IV
- Forward Declarations of Enum Types
- The SCARY Iterators Proposal, Part I
- The SCARY Iterators Proposal, Part II
- Heading for Deprecation: <tt>export</tt>, Exception Specification and <tt>register</tt>
- The Rejection of the Unified Function Syntax Proposal
- Rvalue References as Object Members
- FCD Approved
- The Debate on noexcept, Part I
- The Debate on noexcept, Part II
- The Debate on noexcept, Part III
- About-face -- [[Attributes]] to Be Replaced with Keywords
- Will Delegating Constructors Be Removed From C++0x?
- Rvalue References: Past, Present and Future, Part I
- Rvalue References: Past, Present and Future, Part II
- Rvalue References: Past, Present and Future, Part III
- A Move in the Right Direction, Part I
- A Move in the Right Direction, Part II
- New Keywords for Inheritance Control, Part I
- New Keywords for Inheritance Control, Part II
- FDIS Approved
- C++0x Concurrency
- The Reflecting Circle
- We Have Mail
- The Soapbox
- Numeric Types and Arithmetic
- Careers
- Locales and Internationalization
New Character Types
Last updated Jan 1, 2003.
The C++09x draft includes a proposal for adding new charactertypes to the language. The main purpose of this proposal is supporting Unicode characters and strings. In addition, it ensures compatibility with a similar extension that is being added to standard C.
Doesn’t wchar_t Already Solve the Unicode Support Problem?
These days, many C++ applications manipulate Unicode characters instead of 8-bit ASCII/EBCDIC characters. Seemingly, the standard type wchar_t can be used for manipulating Unicode characters. However, there are several problems with wchar_t with which both C++ and C have been battling for a long time. The first problem is that Unicode 4.0.0, the latest version of this standard, is in fact a superset of older Unicode versions. Originally, Unicode was implemented as a 16-bit encoding system. However, as new scripts were added (including Runic, Linear B and Ugaritic!), it became obvious that 16 bits weren’t enough. Consequently, the Unicode standard was extended to 32 bits. This means that the abstract concept of a character can have at least three different sizes: 1 byte, 2 bytes and 4 bytes. Presently, the dual system of char and wchar_t can’t handle this propensity. There are other problems with wchar_t as well, which I will not discuss here for the sake of brevity. The bottom line however is this: in order to ensure native and portable Unicode support, C++ needs additional character types specifically designed for supporting Unicode.
The C solution
The ISO C committee has addressed Unicode support extensively (see document ISO/IEC TR 19769:2004 Extensions for the programming language C to support new character data types). The C++09 N2018 proposal is based on the C proposal, with some necessary changes that ensure proper overloading resolution. The N2018 proposal addresses the core language changes; specializations for numeric_limits, character traits, basic_string, streams, and insertion operations will be addressed in other documents that deal with the required library changes.
New Fundamental Types
The new proposal adds two new fundamental datatypes to C++: _Char16_t and _Chart32_t. Fortunately, standard typedefs will enable you to hide these ugly keywords: char16_t is a new typedef for _Char16_t and char32_t is a new typedef for _Char32_t.
The new datatypes have the same size and representation as uint_least16_t and uint_least32_t respectively. Indeed, earlier versions of this proposal didn’t include the new datatypes _Char16_t and _Char32_t. They used only the uint_least16_t and uint_least32_t typedefs. Alas, the typedef approach made overloading on the new character types impossible. To fix this, the 2018 proposal adds _Char16_t and _Char32_t as new built-in types (and keywords) to C++09.
New Character Literals
The said proposal extends the grammar by adding two new character literal sequences:
- u’c-char-sequence’
- U’c-char-sequence’
A character literal beginning with the letter u, e.g., u’z’, is a character literal of type _Char16_t. The value of a _Char16_t literal containing a single c-char is the numerical value of that c-char as appears in the _Char16_t character set used by the implementation, provided that the encoding can be represented with 16 bits. If the value cannot be represented with 16 bits, the program is ill-formed. Similarly, a _Char16_t literal containing multiple c-chars is ill-formed.
A character literal beginning with the letter U, e.g., U’y’, is a character literal of type _Char32_t. The value of a _Char32_t literal containing a single c-char is the numerical value of that c-char as appears in the _Char32_t character set used by the implementation, provided that the encoding can be represented with 32 bits. If the value cannot be represented with 32 bits, the program is ill-formed. A _Char32_t literal containing multiple c-chars is ill-formed.
It is implementation-defined whether literals may contain universal character names such as \Unnnnnnnn and \unnnn.
New String Literals
The new proposal extends the grammar by adding two new string literal sequences:
- u"c-char-sequence(optional)"
- U"c-char-sequence(optional)"
A string literal beginning with u, such as u"abcd", is a _Char16_t string literal. The type of a _Char16_t string literal is "array of n const _Char16_t" and has static storage duration. n is the size of the string as defined below. Similarly, a string literal beginning with U, such as U"abcd", is a _Char32_t string literal. The type of a _Char32_t string literal is "array of n const _Char32_t" and has static storage duration.
The size of a _Char16_t, or _Char32_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating L’\0’ character. The universal-character-names, if present, must be representable by the type of the literal.
Underlying Types
Types _Char16_t and _Char32_t denote distinct types with the same size, signedness, and alignment as the underlying types uint_least16_t and uint_least32_t, respectively. Notice that _Chart16_t is required to occupy at least 16 bits, which means that may be represented by an underlying type larger than 16 bits. Similarly, _Chart32_t must occupy at least 32 bits, although in theory its underlying representation may occupy more than 32 bits. Notice that unlike the char type, the three character types wchar_t, _Char16_t and _Char32_t cannot have the signed or unsigned qualifiers.
With the adoption of this proposal, C++ will support four different character types: char (including its signed and unsigned variants), wchar_t, _Chart16_t and _Char32_t. A literal character of each of these character types looks like this:
- char: ’a’
- wchar_t: L’a’,
- Char16_t: u’a’ (as well as the universal character name such as \unnnn).
- _Char32_t: U’a’ (as well as the universal character name such as \unnnnnnnn).
Literal strings look like this:
- char: "ab"
- wchar_t: L"ab",
- _Char16_t: u"ab"
- _Char32_t: U"ab"
<cuchar> declarations
The new header <cuchar>, which corresponds to the <uchar.h> header in C, includes the following typedef names and macros:
typedef _Char16_t char16_t; typedef _Char32_t char32_t;
If the macro __STDC_UTF_16__ is defined in <cuchar>, values of type _Char16_t shall have UTF-16 encoding as defined by ISO 10646.
If the macro __STDC_UTF_32__ is defined in <cuchar>, values of type _Char32_t shall have UTF-32 encoding, as defined by ISO 10646.
