Home > Articles > Programming > C/C++

Secure Coding in C and C++: Strings and Buffer Overflows

  • Print
  • + Share This
  • 💬 Discuss
Buffer overflows are a primary source of software vulnerabilities. Type-unsafe languages, such as C and C++, are especially prone to such vulnerabilities. In this chapter, Robert C. Seacord discusses practical mitigation strategies that can be used to help eliminate vulnerabilities resulting from buffer overflows.

Read Secure Coding in C and C++, Second Edition or more than 24,000 other books and videos on Safari Books Online. Start a free trial today.


 

with Dan Plakosh, Jason Rafail, and Martin Sebor 1

  • But evil things, in robes of sorrow, Assailed the monarch’s high estate.
  • —Edgar Allan Poe,
  • “The Fall of the House of Usher”

2.1. Character Strings

Strings from sources such as command-line arguments, environment variables, console input, text files, and network connections are of special concern in secure programming because they provide means for external input to influence the behavior and output of a program. Graphics- and Web-based applications, for example, make extensive use of text input fields, and because of standards like XML, data exchanged between programs is increasingly in string form as well. As a result, weaknesses in string representation, string management, and string manipulation have led to a broad range of software vulnerabilities and exploits.

Strings are a fundamental concept in software engineering, but they are not a built-in type in C or C++. The standard C library supports strings of type char and wide strings of type wchar_t.

String Data Type

A string consists of a contiguous sequence of characters terminated by and including the first null character. A pointer to a string points to its initial character. The length of a string is the number of bytes preceding the null character, and the value of a string is the sequence of the values of the contained characters, in order. Figure 2.1 shows a string representation of “hello.”

Figure 2.1

Figure 2.1. String representation of “hello”

Strings are implemented as arrays of characters and are susceptible to the same problems as arrays.

As a result, secure coding practices for arrays should also be applied to null-terminated character strings; see the “Arrays (ARR)” chapter of The CERT C Secure Coding Standard [Seacord 2008]. When dealing with character arrays, it is useful to define some terms:

The C Standard allows for the creation of pointers that point one past the last element of the array object, although these pointers cannot be dereferenced without invoking undefined behavior. When dealing with strings, some extra terms are also useful:

Array Size

One of the problems with arrays is determining the number of elements. In the following example, the function clear() uses the idiom sizeof(array) / sizeof(array[0]) to determine the number of elements in the array. However, array is a pointer type because it is a parameter. As a result, sizeof(array) is equal to sizeof(int *). For example, on an architecture (such as x86-32) where sizeof(int) == 4 and sizeof(int *) == 4, the expression sizeof(array) / sizeof(array[0]) evaluates to 1, regardless of the length of the array passed, leaving the rest of the array unaffected.

01  void clear(int array[]) {
02    for (size_t i = 0; i < sizeof(array) / sizeof(array[0]); ++i) {
03       array[i] = 0;
04     }
05  }
06
07  void dowork(void) {
08    int dis[12];
09
10    clear(dis);
11    /* ... */
12  }

This is because the sizeof operator yields the size of the adjusted (pointer) type when applied to a parameter declared to have array or function type. The strlen() function can be used to determine the length of a properly null-terminated character string but not the space available in an array. The CERT C Secure Coding Standard [Seacord 2008] includes “ARR01-C. Do not apply the sizeof operator to a pointer when taking the size of an array,” which warns against this problem.

The characters in a string belong to the character set interpreted in the execution environment—the execution character set. These characters consist of a basic character set, defined by the C Standard, and a set of zero or more extended characters, which are not members of the basic character set. The values of the members of the execution character set are implementation defined but may, for example, be the values of the 7-bit U.S. ASCII character set.

C uses the concept of a locale, which can be changed by the setlocale() function, to keep track of various conventions such as language and punctuation supported by the implementation. The current locale determines which characters are available as extended characters.

The basic execution character set includes the 26 uppercase and 26 lowercase letters of the Latin alphabet, the 10 decimal digits, 29 graphic characters, the space character, and control characters representing horizontal tab, vertical tab, form feed, alert, backspace, carriage return, and newline. The representation of each member of the basic character set fits in a single byte. A byte with all bits set to 0, called the null character, must exist in the basic execution character set; it is used to terminate a character string.

The execution character set may contain a large number of characters and therefore require multiple bytes to represent some individual characters in the extended character set. This is called a multibyte character set. In this case, the basic characters must still be present, and each character of the basic character set is encoded as a single byte. The presence, meaning, and representation of any additional characters are locale specific. A string may sometimes be called a multibyte string to emphasize that it might hold multibyte characters. These are not the same as wide strings in which each character has the same length.

A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte characters begins in an initial shift state and enters other locale-specific shift states when specific multibyte characters are encountered in the sequence. While in the initial shift state, all single-byte characters retain their usual interpretation and do not alter the shift state. The interpretation for subsequent bytes in the sequence is a function of the current shift state.

UTF-8

UTF-8 is a multibyte character set that can represent every character in the Unicode character set but is also backward compatible with the 7-bit U.S. ASCII character set. Each UTF-8 character is represented by 1 to 4 bytes (see Table 2.1). If the character is encoded by just 1 byte, the high-order bit is 0 and the other bits give the code value (in the range 0 to 127). If the character is encoded by a sequence of more than 1 byte, the first byte has as many leading 1 bits as the total number of bytes in the sequence, followed by a 0 bit, and the succeeding bytes are all marked by a leading 10-bit pattern. The remaining bits in the byte sequence are concatenated to form the Unicode code point value (in the range 0x80 to 0x10FFFF). Consequently, a byte with lead bit 0 is a single-byte code, a byte with multiple leading 1 bits is the first of a multibyte sequence, and a byte with a leading 10-bit pattern is a continuation byte of a multibyte sequence. The format of the bytes allows the beginning of each sequence to be detected without decoding from the beginning of the string.

Table 2.1. Well-Formed UTF-8 Byte Sequences

Code Points

First Byte

Second Byte

Third Byte

Fourth Byte

U+0000..U+007F

00..7F

U+0080..U+07FF

C2..DF

80..BF

U+0800..U+0FFF

E0

A0..BF

80..BF

U+1000..U+CFFF

E1..EC

80..BF

80..BF

U+D000..U+D7FF

ED

80..9F

80..BF

U+E000..U+FFFF

EE..EF

80..BF

80..BF

U+10000..U+3FFFF

F0

90..BF

80..BF

80..BF

U+40000..U+FFFFF

F1..F3

80..BF

80..BF

80..BF

U+100000..U+10FFFF

F4

80..8F

80..BF

80..BF

Source: [Unicode 2012]

The first 128 characters constitute the basic execution character set; each of these characters fits in a single byte.

UTF-8 decoders are sometimes a security hole. In some circumstances, an attacker can exploit an incautious UTF-8 decoder by sending it an octet sequence that is not permitted by the UTF-8 syntax. The CERT C Secure Coding Standard [Seacord 2008] includes “MSC10-C. Character encoding—UTF-8-related issues,” which describes this problem and other UTF-8-related issues.

Wide Strings

To process the characters of a large character set, a program may represent each character as a wide character, which generally takes more space than an ordinary character. Most implementations choose either 16 or 32 bits to represent a wide character. The problem of sizing wide strings is covered in the section “Sizing Strings.”

A wide string is a contiguous sequence of wide characters terminated by and including the first null wide character. A pointer to a wide string points to its initial (lowest addressed) wide character. The length of a wide string is the number of wide characters preceding the null wide character, and the value of a wide string is the sequence of code values of the contained wide characters, in order.

String Literals

A character string literal is a sequence of zero or more characters enclosed in double quotes, as in "xyz". A wide string literal is the same, except prefixed by the letter L, as in L"xyz".

In a character constant or string literal, members of the character set used during execution are represented by corresponding members of the character set in the source code or by escape sequences consisting of the backslash \ followed by one or more characters. A byte with all bits set to 0, called the null character, must exist in the basic execution character set; it is used to terminate a character string.

During compilation, the multibyte character sequences specified by any sequence of adjacent characters and identically prefixed string literal tokens are concatenated into a single multibyte character sequence. If any of the tokens have an encoding prefix, the resulting multibyte character sequence is treated as having the same prefix; otherwise, it is treated as a character string literal. Whether differently prefixed wide string literal tokens can be concatenated (and, if so, the treatment of the resulting multibyte character sequence) is implementation defined. For example, each of the following sequences of adjacent string literal tokens

"a" "b" L"c"
"a" L"b" "c"
L"a" "b" L"c"
L"a" L"b" L"c"

is equivalent to the string literal

L"abc"

Next, a byte or code of value 0 is appended to each character sequence that results from a string literal or literals. (A character string literal need not be a string, because a null character may be embedded in it by a \0 escape sequence.) The character sequence is then used to initialize an array of static storage duration and length just sufficient to contain the sequence. For character string literals, the array elements have type char and are initialized with the individual bytes of the character sequence. For wide string literals, the array elements have type wchar_t and are initialized with the sequence of wide characters corresponding to the character sequence, as defined by the mbstowcs() (multibyte string to wide-character string) function with an implementation-defined current locale. The value of a string literal containing a character or escape sequence not represented in the execution character set is implementation defined.

The type of a string literal is an array of char in C, but it is an array of const char in C++. Consequently, a string literal is modifiable in C. However, if the program attempts to modify such an array, the behavior is undefined—and therefore such behavior is prohibited by The CERT C Secure Coding Standard [Seacord 2008], “STR30-C. Do not attempt to modify string literals.” One reason for this rule is that the C Standard does not specify that these arrays must be distinct, provided their elements have the appropriate values. For example, compilers sometimes store multiple identical string literals at the same address, so that modifying one such literal might have the effect of changing the others as well. Another reason for this rule is that string literals are frequently stored in read-only memory (ROM).

The C Standard allows an array variable to be declared both with a bound index and with an initialization literal. The initialization literal also implies an array size in the number of elements specified. For strings, the size specified by a string literal is the number of characters in the literal plus one for the terminating null character.

Array variables are often initialized by a string literal and declared with an explicit bound that matches the number of characters in the string literal. For example, the following declaration initializes an array of characters using a string literal that defines one more character (counting the terminating '\0') than the array can hold:

const char s[3] = "abc";

The size of the array s is 3, although the size of the string literal is 4; consequently, the trailing null byte is omitted. Any subsequent use of the array as a null-terminated byte string can result in a vulnerability, because s is not properly null-terminated.

A better approach is to not specify the bound of a string initialized with a string literal because the compiler will automatically allocate sufficient space for the entire string literal, including the terminating null character:

const char s[] = "abc";

This approach also simplifies maintenance, because the size of the array can always be derived even if the size of the string literal changes. This issue is further described by The CERT C Secure Coding Standard [Seacord 2008], “STR36-C. Do not specify the bound of a character array initialized with a string literal.”

Strings in C++

Multibyte strings and wide strings are both common data types in C++ programs, but many attempts have been made to also create string classes. Most C++ developers have written at least one string class, and a number of widely accepted forms exist. The standardization of C++ [ISO/IEC 1998] promotes the standard class template std::basic_string. The basic_string template represents a sequence of characters. It supports sequence operations as well as string operations such as search and concatenation and is parameterized by character type:

  • string is a typedef for the template specialization basic_string<char>.
  • wstring is a typedef for the template specialization basic_string<wchar_t>.

Because the C++ standard defines additional string types, C++ also defines additional terms for multibyte strings. A null-terminated byte string, or NTBS, is a character sequence whose highest addressed element with defined content has the value 0 (the terminating null character); no other element in the sequence has the value 0. A null-terminated multibyte string, or NTMBS, is an NTBS that constitutes a sequence of valid multibyte characters beginning and ending in the initial shift state.

The basic_string class template specializations are less prone to errors and security vulnerabilities than are null-terminated byte strings. Unfortunately, there is a mismatch between C++ string objects and null-terminated byte strings. Specifically, most C++ string objects are treated as atomic entities (usually passed by value or reference), whereas existing C library functions accept pointers to null-terminated character sequences. In the standard C++ string class, the internal representation does not have to be null-terminated [Stroustrup 1997], although all common implementations are null-terminated. Some other string types, such as Win32 LSA_UNICODE_STRING, do not have to be null-terminated either. As a result, there are different ways to access string contents, determine the string length, and determine whether a string is empty.

It is virtually impossible to avoid multiple string types within a C++ program. If you want to use basic_string exclusively, you must ensure that there are no

  • basic_string literals. A string literal such as "abc" is a static null-terminated byte string.
  • Interactions with the existing libraries that accept null-terminated byte strings (for example, many of the objects manipulated by function signatures declared in <cstring> are NTBSs).
  • Interactions with the existing libraries that accept null-terminated wide-character strings (for example, many of the objects manipulated by function signatures declared in <cwchar> are wide-character sequences).

Typically, C++ programs use null-terminated byte strings and one string class, although it is often necessary to deal with multiple string classes within a legacy code base [Wilson 2003].

Character Types

The three types char, signed char, and unsigned char are collectively called the character types. Compilers have the latitude to define char to have the same range, representation, and behavior as either signed char or unsigned char. Regardless of the choice made, char is a distinct type.

Although not stated in one place, the C Standard follows a consistent philosophy for choosing character types:

signed char and unsigned char

  • Suitable for small integer values

    plain char

  • The type of each element of a string literal
  • Used for character data (where signedness has little meaning) as opposed to integer data

The following program fragment shows the standard string-handling function strlen() being called with a plain character string, a signed character string, and an unsigned character string. The strlen() function takes a single argument of type const char *.

1  size_t len;
2  char cstr[] = "char string";
3  signed char scstr[] = "signed char string";
4  unsigned char ucstr[] = "unsigned char string";
5
6  len = strlen(cstr);
7  len = strlen(scstr);  /* warns when char is unsigned */
8  len = strlen(ucstr);  /* warns when char is signed */

Compiling at high warning levels in compliance with “MSC00-C. Compile cleanly at high warning levels” causes warnings to be issued when

  • Converting from unsigned char[] to const char * when char is signed
  • Converting from signed char[] to const char * when char is defined to be unsigned

Casts are required to eliminate these warnings, but excessive casts can make code difficult to read and hide legitimate warning messages.

If this code were compiled using a C++ compiler, conversions from unsigned char[] to const char * and from signed char[] to const char * would be flagged as errors requiring casts. “STR04-C. Use plain char for characters in the basic character set” recommends the use of plain char for compatibility with standard narrow-string-handling functions.

int

The int type is used for data that could be either EOF (a negative value) or character data interpreted as unsigned char to prevent sign extension and then converted to int. For example, on a platform in which the int type is represented as a 32-bit value, the extended ASCII code 0xFF would be returned as 00 00 00 FF.

  • Consequently, fgetc(), getc(), getchar(), fgetwc(), getwc(), and getwchar() return int.
  • The character classification functions declared in <ctype.h>, such as isalpha(), accept int because they might be passed the result of fgetc() or the other functions from this list.

In C, a character constant has type int. Its value is that of a plain char converted to int. The perhaps surprising consequence is that for all character constants c, sizeof c is equal to sizeof int. This also means, for example, that sizeof 'a' is not equal to sizeof x when x is a variable of type char.

In C++, a character literal that contains only one character has type char and consequently, unlike in C, its size is 1. In both C and C++, a wide-character literal has type wchar_t, and a multicharacter literal has type int.

unsigned char

The unsigned char type is useful when the object being manipulated might be of any type, and it is necessary to access all bits of that object, as with fwrite(). Unlike other integer types, unsigned char has the unique property that values stored in objects of type unsigned char are guaranteed to be represented using a pure binary notation. A pure binary notation is defined by the C Standard as “a positional representation for integers that uses the binary digits 0 and 1, in which the values represented by successive bits are additive, begin with 1, and are multiplied by successive integral powers of 2, except perhaps the bit with the highest position.”

Objects of type unsigned char are guaranteed to have no padding bits and consequently no trap representation. As a result, non-bit-field objects of any type may be copied into an array of unsigned char (for example, via memcpy()) and have their representation examined 1 byte at a time.

wchar_t

  • Wide characters are used for natural-language character data.

    “STR00-C. Represent characters using an appropriate type” recommends that the use of character types follow this same philosophy. For characters in the basic character set, it does not matter which data type is used, except for type compatibility.

Sizing Strings

Sizing strings correctly is essential in preventing buffer overflows and other runtime errors. Incorrect string sizes can lead to buffer overflows when used, for example, to allocate an inadequately sized buffer. The CERT C Secure Coding Standard [Seacord 2008], “STR31-C. Guarantee that storage for strings has sufficient space for character data and the null terminator,” addresses this issue. Several important properties of arrays and strings are critical to allocating space correctly and preventing buffer overflows:

Confusing these concepts frequently leads to critical errors in C and C++ programs. The C Standard guarantees that objects of type char consist of a single byte. Consequently, the size of an array of char is equal to the count of an array of char, which is also the bounds. The length is the number of characters before the null terminator. For a properly null-terminated string of type char, the length must be less than or equal to the size minus 1.

Wide-character strings may be improperly sized when they are mistaken for narrow strings or for multibyte character strings. The C Standard defines wchar_t to be an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales. Windows uses UTF-16 character encodings, so the size of wchar_t is typically 2 bytes. Linux and OS X (GCC/g++ and Xcode) use UTF-32 character encodings, so the size of wchar_t is typically 4 bytes. On most platforms, the size of wchar_t is at least 2 bytes, and consequently, the size of an array of wchar_t is no longer equal to the count of the same array. Programs that assume otherwise are likely to contain errors. For example, in the following program fragment, the strlen() function is incorrectly used to determine the size of a wide-character string:

1  wchar_t wide_str1[] = L"0123456789";
2  wchar_t *wide_str2 = (wchar_t *)malloc(strlen(wide_str1) + 1);
3  if (wide_str2 == NULL) {
4    /* handle error */
5  }
6  /* ... */
7  free(wide_str2);
8  wide_str2 = NULL;

When this program is compiled, Microsoft Visual Studio 2012 generates an incompatible type warning and terminates translation. GCC 4.7.2 also generates an incompatible type warning but continues compilation.

The strlen() function counts the number of characters in a null-terminated byte string preceding the terminating null byte (the length). However, wide characters can contain null bytes, particularly when taken from the ASCII character set, as in this example. As a result, the strlen() function will return the number of bytes preceding the first null byte in the string.

In the following program fragment, the wcslen() function is correctly used to determine the size of a wide-character string, but the length is not multiplied by sizeof(wchar_t):

1  wchar_t wide_str1[] = L"0123456789";
2  wchar_t *wide_str3 = (wchar_t *)malloc(wcslen(wide_str1) + 1);
3  if (wide_str3 == NULL) {
4    /* handle error */
5  }
6  /* ... */
7  free(wide_str3);
8  wide_str3 = NULL;

The following program fragment correctly calculates the number of bytes required to contain a copy of the wide string (including the termination character):

01  wchar_t wide_str1[] = L"0123456789";
02  wchar_t *wide_str2 = (wchar_t *)malloc(
03    (wcslen(wide_str1) + 1) * sizeof(wchar_t)
04  );
05  if (wide_str2 == NULL) {
06    /* handle error */
07  }
08  /* ... */
09  free(wide_str2);
10  wide_str2 = NULL;

The CERT C Secure Coding Standard [Seacord 2008], “STR31-C. Guarantee that storage for strings has sufficient space for character data and the null terminator,” correctly provides additional information with respect to sizing wide strings.

  • + Share This
  • 🔖 Save To Your Account

Discussions

comments powered by Disqus

Sign Up for Our Newsletters

Subscribing to the InformIT newsletters is an easy way to keep in touch with what's happening in your corner of the industry. We have a newsletters dedicated to a variety of topics such as open source, programming, and web development, so you get just the information you need. Sign up today.