Home > Articles > Programming > C/C++

C++ Reference Guide

Hosted by

Toggle Open Guide Table of ContentsGuide Contents

Close Table of ContentsGuide Contents

Close Table of Contents

New Character Types

Last updated Jan 1, 2003.

The C++09x draft includes a proposal for adding new charactertypes to the language. The main purpose of this proposal is supporting Unicode characters and strings. In addition, it ensures compatibility with a similar extension that is being added to standard C.

Doesn’t wchar_t Already Solve the Unicode Support Problem?

These days, many C++ applications manipulate Unicode characters instead of 8-bit ASCII/EBCDIC characters. Seemingly, the standard type wchar_t can be used for manipulating Unicode characters. However, there are several problems with wchar_t with which both C++ and C have been battling for a long time. The first problem is that Unicode 4.0.0, the latest version of this standard, is in fact a superset of older Unicode versions. Originally, Unicode was implemented as a 16-bit encoding system. However, as new scripts were added (including Runic, Linear B and Ugaritic!), it became obvious that 16 bits weren’t enough. Consequently, the Unicode standard was extended to 32 bits. This means that the abstract concept of a character can have at least three different sizes: 1 byte, 2 bytes and 4 bytes. Presently, the dual system of char and wchar_t can’t handle this propensity. There are other problems with wchar_t as well, which I will not discuss here for the sake of brevity. The bottom line however is this: in order to ensure native and portable Unicode support, C++ needs additional character types specifically designed for supporting Unicode.

The C solution

The ISO C committee has addressed Unicode support extensively (see document ISO/IEC TR 19769:2004 Extensions for the programming language C to support new character data types). The C++09 N2018 proposal is based on the C proposal, with some necessary changes that ensure proper overloading resolution. The N2018 proposal addresses the core language changes; specializations for numeric_limits, character traits, basic_string, streams, and insertion operations will be addressed in other documents that deal with the required library changes.

New Fundamental Types

The new proposal adds two new fundamental datatypes to C++: _Char16_t and _Chart32_t. Fortunately, standard typedefs will enable you to hide these ugly keywords: char16_t is a new typedef for _Char16_t and char32_t is a new typedef for _Char32_t.

The new datatypes have the same size and representation as uint_least16_t and uint_least32_t respectively. Indeed, earlier versions of this proposal didn’t include the new datatypes _Char16_t and _Char32_t. They used only the uint_least16_t and uint_least32_t typedefs. Alas, the typedef approach made overloading on the new character types impossible. To fix this, the 2018 proposal adds _Char16_t and _Char32_t as new built-in types (and keywords) to C++09.

New Character Literals

The said proposal extends the grammar by adding two new character literal sequences:

  • u’c-char-sequence’
  • U’c-char-sequence’

A character literal beginning with the letter u, e.g., u’z’, is a character literal of type _Char16_t. The value of a _Char16_t literal containing a single c-char is the numerical value of that c-char as appears in the _Char16_t character set used by the implementation, provided that the encoding can be represented with 16 bits. If the value cannot be represented with 16 bits, the program is ill-formed. Similarly, a _Char16_t literal containing multiple c-chars is ill-formed.

A character literal beginning with the letter U, e.g., U’y’, is a character literal of type _Char32_t. The value of a _Char32_t literal containing a single c-char is the numerical value of that c-char as appears in the _Char32_t character set used by the implementation, provided that the encoding can be represented with 32 bits. If the value cannot be represented with 32 bits, the program is ill-formed. A _Char32_t literal containing multiple c-chars is ill-formed.

It is implementation-defined whether literals may contain universal character names such as \Unnnnnnnn and \unnnn.

New String Literals

The new proposal extends the grammar by adding two new string literal sequences:

  • u"c-char-sequence(optional)"
  • U"c-char-sequence(optional)"

A string literal beginning with u, such as u"abcd", is a _Char16_t string literal. The type of a _Char16_t string literal is "array of n const _Char16_t" and has static storage duration. n is the size of the string as defined below. Similarly, a string literal beginning with U, such as U"abcd", is a _Char32_t string literal. The type of a _Char32_t string literal is "array of n const _Char32_t" and has static storage duration.

The size of a _Char16_t, or _Char32_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating L’\0’ character. The universal-character-names, if present, must be representable by the type of the literal.

Underlying Types

Types _Char16_t and _Char32_t denote distinct types with the same size, signedness, and alignment as the underlying types uint_least16_t and uint_least32_t, respectively. Notice that _Chart16_t is required to occupy at least 16 bits, which means that may be represented by an underlying type larger than 16 bits. Similarly, _Chart32_t must occupy at least 32 bits, although in theory its underlying representation may occupy more than 32 bits. Notice that unlike the char type, the three character types wchar_t, _Char16_t and _Char32_t cannot have the signed or unsigned qualifiers.

With the adoption of this proposal, C++ will support four different character types: char (including its signed and unsigned variants), wchar_t, _Chart16_t and _Char32_t. A literal character of each of these character types looks like this:

  • char: ’a’
  • wchar_t: L’a’,
  • Char16_t: u’a’ (as well as the universal character name such as \unnnn).
  • _Char32_t: U’a’ (as well as the universal character name such as \unnnnnnnn).

Literal strings look like this:

  • char: "ab"
  • wchar_t: L"ab",
  • _Char16_t: u"ab"
  • _Char32_t: U"ab"

<cuchar> declarations

The new header <cuchar>, which corresponds to the <uchar.h> header in C, includes the following typedef names and macros:

typedef _Char16_t char16_t;
typedef _Char32_t char32_t; 

If the macro __STDC_UTF_16__ is defined in <cuchar>, values of type _Char16_t shall have UTF-16 encoding as defined by ISO 10646.

If the macro __STDC_UTF_32__ is defined in <cuchar>, values of type _Char32_t shall have UTF-32 encoding, as defined by ISO 10646.