Home > Articles > Programming

  • Print
  • + Share This
This chapter is from the book

Character Semantics

Because Unicode aims to encode semantics rather than appearances, simple code charts aren't sufficient. After all, they merely show pictures of characters in a grid that maps them to numeric values. The pictures of the characters can certainly help illustrate the semantics of the characters, but they can't tell the whole story. The Unicode standard goes well beyond just pictures of the characters, providing a wealth of information on every character.

Every code chart in the standard is followed by a list of the characters in the code chart. For each character, an entry gives the following information:

  • Its Unicode code point value.

  • A representative glyph. For characters that combine with other characters, such as accent marks, the representative glyph includes a dotted ª circle that shows where the main character would go—making it possible to distinguish COMBINING DOT ABOVE from COMBINING DOT BELOW, for example. For characters that have no visual appearance, such as spaces and control and formatting codes, the representative glyph is a dotted square with some sort of abbreviation of the character name inside.

  • The character's name. The name, and not the representative glyph, is the normative property (the parts of the standard that are declared to be "normative" are the parts you have to follow exactly to conform to the standard; parts declared "informative" are there to supplement or clarify the normative parts and don't have to be followed exactly to conform). This reflects the philosophy that Unicode encodes semantics, although sometimes the actual meaning of the character has drifted since the earliest drafts of the standard and no longer matches the name. Such cases are very rare, however.

In addition to the code point value, name, and representative glyph, an entry may include the following:

  • Alternate names for the character

  • Cross-references to similar characters elsewhere in the standard (which helps to distinguish them from each other)

  • The character's canonical or compatibility decomposition (if it's a composite character)

  • Additional notes on its usage or meaning (for example, the entries for many letters include the languages that use them)

The Unicode standard also includes chapters on each major group of characters in the standard, with information that's common to all of the characters in the group (such as encoding philosophy or information on special processing challenges) and additional narrative explaining the meaning and usage of any characters in the group that have special properties or behavior that needs to be called out.

The Unicode standard actually consists of more than just The Unicode Standard Version 3.0. That is, there's more to the Unicode standard than just the book. The standard includes a comprehensive database of all the characters, a copy of which is included on the CD that's included with the book. Because the character database changes more frequently than the rest of the standard, it's usually a good idea to get the most recent version of the database from the Unicode Consortium's Web site at http://www.unicode.org.

Every character in Unicode is associated with a list of properties that define how the character is to be treated by various processes. The Unicode Character Database comprises a group of text files that give the properties for each character in Unicode. Among the properties that each character has are the following:

  • The character's code point value and name.

  • The character's general category. All of the characters in Unicode are grouped into 30 categories. The category tells you things like whether the character is a letter, numeral, symbol, whitespace character, control code, and so forth.

  • The character's decomposition, along with whether it's a canonical or compatibility decomposition, and for compatibility composites, a tag that attempts to indicate what data are lost when you convert to the decomposed form.

  • The character's case mapping. If the character is a cased letter, the database includes the mapping from the character to its counterpart in the opposite case.

  • For characters that are considered numerals, the character's numeric value (that is, the numeric value the character represents, not the character's code point value).

  • The character's directionality (e.g., whether it is left-to-right, is right-to-left, or takes on the directionality of the surrounding text). The Unicode Bidirectional Layout Algorithm uses this property to determine how to arrange characters of different directionalities on a single line of text.

  • The character's mirroring property. It says whether the character takes on a mirror-image glyph shape when surrounded by right-to-left text.

  • The character's combining class. It is used to derive the canonical representation of a character with more than one combining mark attached to it (it's used to derive the canonical ordering of combining characters that don't interact with each other).

  • The character's line-break properties. This information is used by text rendering processes to help figure out where line divisions should go.

For an in-depth look at the various files in the Unicode Character Database, see Chapter 5.

  • + Share This
  • 🔖 Save To Your Account