Home > Articles

  • Print
  • + Share This
This chapter is from the book

Unicode Strings

The use of standard strings and Unicode strings in the same program presents a number of subtle complications. This is because such strings may be used in a variety of operations, including string concatenation, comparisons, dictionary key lookups, and as arguments to built-in functions.

To convert a standard string, s, to a Unicode string, the built-in unicode(s [, encoding [,errors]]) function is used. To convert a Unicode string, u, to a standard string, the string method u.encode([encoding [, errors]]) is used. Both of these conversion operators require the use of a special encoding rule that specifies how 16-bit Unicode character values are mapped to a sequence of 8-bit characters in standard strings, and vice versa. The encoding parameter is specified as a string and is one of the following values:

Value

Description

'ascii'

7-bit ASCII

'latin-1' or 'iso-8859-1'

ISO 8859-1 Latin-1

'utf-8'

8-bit variable-length encoding

'utf-16'

16-bit variable-length encoding (may be little or big endian)

'utf-16-le'

UTF-16, little endian encoding

'utf-16-be'

UTF-16, big endian encoding

'unicode-escape'

Same format as Unicode literals u"string"

'raw-unicode-escape'

Same format as raw Unicode literals ur"string"


The default encoding is set in the site module and can be queried using sys. getdefaultencoding(). In most cases, the default encoding is 'ascii', which means that ASCII characters with values in the range [0x00,0x7f] are directly mapped to Unicode characters in the range [U+0000, U+007F]. Details about the other encodings can be found in Chapter 9, "Input and Output."

When string values are being converted, a UnicodeError exception may be raised if a character that can't be converted is encountered. For instance, if the encoding rule is 'ascii', a Unicode character such as U+1F28 can't be converted because its value is too large. Similarly, the string "\xfc" can't be converted to Unicode because it contains a character outside the range of valid ASCII character values. The errors parameter determines how encoding errors are handled. It's a string with one of the following values:

Value

Description

'strict'

Raises a UnicodeError exception for decoding errors.

'ignore'

Ignores invalid characters.

'replace'

Replaces invalid characters with a replacement character (U+FFFD in Unicode, '?' in standard strings).

'backslashreplace'

Replaces invalid characters with a Python character escape sequence. For example, the character U+1234 is replaced by '\u1234'.

'xmlcharrefreplace'

Replaces invalid characters with an XML character reference. For example, the character U+1234 is replaced by 'ሴ'.


The default error handling is 'strict'.

When standard strings and Unicode strings are mixed in an expression, standard strings are automatically coerced to Unicode using the built-in unicode() function. For example:

s = "hello"
t = u"world"
w = s + t     # w = unicode(s) + t

When Unicode strings are used in string methods that return new strings (as described in Chapter 3), the result is always coerced to Unicode. Here's an example:

a = "Hello World"
b = a.replace("World", u"Bob") # Produces u"Hello Bob"

Furthermore, even if zero replacements are made and the result is identical to the original string, the final result is still a Unicode string.

If a Unicode string is used as the format string with the % operator, all the arguments are first coerced to Unicode and then put together according to the given format rules. If a Unicode object is passed as one of the arguments to the % operator, the entire result is coerced to Unicode at the point at which the Unicode object is expanded. For example:

c = "%s %s" % ("Hello", u"World") # c = "Hello " + u"World"
d = u"%s %s" % ("Hello", "World") # d = u"Hello " + u"World"

When applied to Unicode strings, the str() and repr() functions automatically coerce the value back to a standard string. For Unicode string u, str(u) produces the value u.encode() and repr(u) produces u"%s" % repr(u.encode('unicode-escape')).

In addition, most library and built-in functions that only operate with standard strings will automatically coerce Unicode strings to a standard string using the default encoding. If such a coercion is not possible, a UnicodeError exception is raised.

Standard and Unicode strings can be compared. In this case, standard strings are coerced to Unicode using the default encoding before any comparison is made. This coercion also occurs whenever comparisons are made during list and dictionary operations. For example, 'x' in [u'x', u'y', u'z'] coerces 'x' to Unicode and returns True. For character containment tests such as 'W' in u'Hello World', the character 'W' is coerced to Unicode before the test.

When computing hash values with the hash() function, standard strings and Unicode strings produce identical values, provided that the Unicode string only contains characters in the range [U+0000, U+007F]. This allows standard strings and Unicode strings to be used interchangeably as dictionary keys, provided that the Unicode strings are confined to ASCII characters. For example:

a = { }
a[u"foo"] = 1234
print a["foo"]    # Prints 1234

However, it should be noted that this dictionary key behavior may not hold if the default encoding is ever changed to something other than 'ascii' or if Unicode strings contain non-ASCII characters. For example, if 'utf-8' is used as a default character encoding, it's possible to produce pathological examples in which strings compare as equal, but have different hash values. For example:

a = u"M\u00fcller"      # Unicode string
b = "M\303\274ller"     # utf-8 encoded version of a
print a == b            # Prints '1', true
print hash(a)==hash(b)  # Prints '0', false
  • + Share This
  • 🔖 Save To Your Account