Home > Store

Unicode Standard, Version 4.0, The

Register your product to gain access to bonus material or receive a coupon.

Unicode Standard, Version 4.0, The

Book

  • This product currently is not for sale.
Not for Sale

Description

  • Copyright 2004
  • Dimensions: 8-1/2x11
  • Pages: 1504
  • Edition: 1st
  • Book
  • ISBN-10: 0-321-18578-1
  • ISBN-13: 978-0-321-18578-5

The authoritative guide to universal character encoding
The official way to implement ISO/IEC 10646
The key to advancing global interoperability in information technology products

Unicode 4.0The Unicode Standard

The Unicode Standard provides a unique code number for every character in electronic text, no matter what the platform, no matter what the application, no matter what the language. It is required for XML and is at the core of modern software products. Unicode 4.0 contains 96,248 characters covering languages of the world. The Unicode Standard contains extensive descriptions of each writing system, as well as definitions of character properties and detailed conformance requirements. It is the complete and definitive user's guide for novices and experts alike.

This edition, The Unicode Standard, Version 4.0, adds 47,188 new characters for minority and historic scripts, several sets of symbols, and a very large collection of additional CJK ideographs. It provides updated specifications covering structure, conformance, character behavior and semantics, as well as implementation guidelines, detailed discussions of writing systems, comprehensive charts, and an extensive glossary. The accompanying CD-ROM includes the text of all the Unicode Standard Annexes and the entire Unicode Character Database.



0321185781B07232003

Sample Content

Table of Contents



Acknowledgments.


Unicode Consortium Members and Directors.


Figures.


Tables.


Preface.


1. Introduction.

Coverage.

Standards Coverage.

New Characters.

Design Goals.

Text Handling.

Interpreting Characters.

Text Elements.

The Unicode Standard and ISO/IEC 10646.

The Unicode Consortium.

The Unicode Technical Committee.

Submitting New Characters.



2. General Structure.

Architectural Context.

Basic Text Processes.

Text Elements, Characters, and Text Processes.

Text Processes and Encoding.

Unicode Design Principles.

Universality.

Efficiency.

Characters, Not Glyphs.

Semantics.

Plain Text.

Logical Order.

Unification.

Dynamic Composition.

Equivalent Sequences.

Convertibility.

Compatibility Characters.

Compatibility Characters.

Compatibility Decomposable Characters.

Mapping Compatibility Characters.

Code Points and Characters.

Types of Code Points.

Encoding Forms.

UTF-32.

UTF-16.

UTF-8.

Comparison of the Advantages of UTF-32, UTF-16, and UTF-8.

Encoding Schemes.

Unicode Strings.

Unicode Allocation.

Planes.

Allocation Areas and Character Blocks.

Details of Allocation.

Assignment of Code Points.

Writing Direction.

Combining Characters.

Sequence of Base Characters and Diacritics.

Multiple Combining Characters.

Ligated Multiple Base Characters.

Spacing Clones of European Diacritical Marks.

"Characters" and Grapheme Clusters.

Special Characters and Noncharacters.

Byte Order Mark (BOM).

Special Noncharacter Code Points.

Layout and Format Control Characters.

The Replacement Character.

Control Codes.

Conforming to the Unicode Standard.

Supported Subsets.

Related Publications.



3. Conformance.

Versions of the Unicode Standard.

Stability.

Version Numbering.

Errata, Corrigenda, and Future Updates.

References to the Unicode Standard.

References to Unicode Character Properties.

References to Unicode Algorithms.

Conformance Requirements.

Byte Ordering.

Unassigned Code Points.

Interpretation.

Modification.

Character Encoding Forms.

Character Encoding Schemes.

Bidirectional Text.

Normalization Forms.

Normative References.

Unicode Algorithms.

Default Casing Operations.

Unicode Standard Annexes.

Semantics.

Definitions.

Character Identity and Semantics.

Characters and Encoding.

Properties.

Normative and Informative Properties.

Simple and Derived Properties.

Property Aliases.

Default Property Values.

Private Use.

Combination.

Decomposition.

Compatibility Decomposition.

Canonical Decomposition.

Surrogates.

Unicode Encoding Forms.

UTF-32.

UTF-16.

UTF-8.

Encoding Form Conversion.

Unicode Encoding Schemes.

Canonical Ordering Behavior.

Application of Combining Marks.

Combining Classes.

Canonical Ordering.

Canonical Ordering and Collation.

Conjoining Jamo Behavior.

Hangul Syllable Boundaries.

Standard Korean Syllables.

Hangul Syllable Composition.

Hangul Syllable Decomposition.

Hangul Syllable Names.

Default Case Operations.

Definitions.

Case Conversion of Strings.

Case Detection for Strings.

Caseless Matching.



4. Character Properties.

Unicode Character Database.

Case-Normative.

Case Mapping.

Combining Classes-Normative.

Directionality-Normative.

General Category-Normative.

Numeric Value-Normative.

Ideographic Numeric Values.

Bidi Mirrored-Normative.

Unicode 1.0 Names.

Letters, Alphabetic, and Ideographic.

Boundary Control.

Characters with Unusual Properties.



5. Implementation Guidelines.

Transcoding to Other Standards.

Issues.

Multistage Tables.

ANSI/ISO C wchar_t.

Unknown and Missing Characters.

Reserved and Private-Use Character Codes.

Interpretable but Unrenderable Characters.

Default Property Values.

Default Ignorable Code Points.

Interacting with Downlevel Systems.

Handling Surrogate Pairs in UTF-16.

Handling Numbers.

Normalization.

Compression.

Newline Guidelines.

Definitions.

Background.

Recommendations.

Regular Expressions.

Language Information in Plain Text.

Requirements for Language Tagging.

Language Tags and Han Unification.

Editing and Selection.

Consistent Text Elements.

Strategies for Handling Nonspacing Marks.

Keyboard Input.

Truncation.

Rendering Nonspacing Marks.

Canonical Equivalence.

Positioning Methods.

Locating Text Element Boundaries.

Identifiers.

Property-Based Identifier Syntax.

Syntactic Rule.

Alternative Recommendation.

Sorting and Searching.

Culturally Expected Sorting and Searching.

Language-Insensitive Sorting.

Searching.

Sublinear Searching.

Binary Order.

UTF-8 in UTF-16 Order.

UTF-16 in UTF-8 Order.

Case Mappings.

Complications for Case Mapping.

Reversibility.

Caseless Matching.

Normalization.

Unicode Security.

Default Ignorable Code Points.



6. Writing Systems and Punctuation.

Writing Systems.

General Punctuation.

Punctuation: U+0020-U+00BF.

General Punctuation: U+2000-U+206F.

CJK Symbols and Punctuation: U+3000-U+303F.

CJK Compatibility Forms: U+FE30-U+FE4F.

Small Form Variants: U+FE50-U+FE6F.



7. European Alphabetic Scripts.

Latin.

Letters of Basic Latin: U+0041-U+007A.

Letters of the Latin-1 Supplement: U+00C0-U+00FF.

Latin Extended-A: U+0100-U+017F.

Latin Extended-B: U+0180-U+024F.

IPA Extensions: U+0250-U+02AF.

Phonetic Extensions: U+1D00-U+1D6A.

Latin Extended Additional: U+1E00-U+1EFF.

Latin Ligatures: FB00-FB06.

Greek.

Greek: U+0370-U+03FF.

Greek Extended: U+1F00-U+1FFF.

Cyrillic.

Cyrillic: U+0400-U+04FF.

Cyrillic Supplement: U+0500-U+052F.

Armenian.

Armenian: U+0530-U+058F.

Georgian.

Georgian: U+10A0-U+10FF.

Modifier Letters.

Spacing Modifier Letters: U+02B0-U+02FF.

Combining Marks.

Combining Diacritical Marks: U+0300-U+036F.

Combining Marks for Symbols: U+20D0-U+20FF.

Combining Half Marks: U+FE20-U+FE2F.



8. Middle Eastern Scripts.

Hebrew.

Hebrew: U+0590-U+05FF.

Alphabetic Presentation Forms: U+FB1D-U+FB4F.

Arabic.

Arabic: U+0600-U+06FF.

Cursive Joining.

Ligatures.

Arabic Presentation Forms-A: U+FB50-U+FDFF.

Arabic Presentation Forms-B: U+FE70-U+FEFF.

Syriac.

Syriac: U+0700-U+074F.

Syriac Shaping.

Syriac Cursive Joining.

Ligatures.

Thaana.

Thaana: U+0780-U+07BF.



9. South Asian Scripts.

Devanagari.

Devanagari: U+0900-U+097F.

Bengali.

Bengali: U+0980-U+09FF.

Gurmukhi.

Gurmukhi: U+0A00-U+0A7F.

Gujarati.

Gujarati: U+0A80-U+0AFF.

Oriya.

Oriya: U+0B00-U+0B7F.

Tamil.

Tamil: U+0B80-U+0BFF.

Telugu.

Telugu: U+0C00-U+0C7F.

Kannada.

Kannada: U+0C80-U+0CFF.

Malayalam.

Malayalam: U+0D00-U+0D7F.

Sinhala.

Sinhala: U+0D80-U+0DFF.

Tibetan.

Tibetan: U+0F00-U+0FFF.

Limbu.

Limbu: U+1900-U+194F.



10. Southeast Asian Scripts.

Thai.

Thai: U+0E00-U+0E7F.

Lao.

Lao: U+0E80-U+0EFF.

Myanmar.

Myanmar: U+1000-U+109F.

Khmer.

Khmer: U+1780-U+17FF.

Khmer Symbols: U+19E0-U+19FF.

Tai Le.

Tai Le: U+1950-U+197F.

Philippine Scripts.

Tagalog: U+1700-U+171F.

Hanunoo: U+1720-U+173F.

Buhid: U+1740-U+175F.

Tagbanwa: U+1760-U+177F.



11. East Asian Scripts.

Han.

CJK Unified Ideographs.

CJK Unified Ideographs Ext. B: U+20000-U+2A6D6.

CJK Compatibility Ideographs: U+F900-U+FAFF.

CJK Compatibility Supplement: U+2F800-U+2FA1D.

Kanbun: U+3190-U+319F.

CJK and KangXi Radicals: U+2E80-U+2FD5.

Ideographic Description: U+2FF0-U+2FFB.

Bopomofo.

Bopomofo: U+3100-U+312F.

Hiragana and Katakana.

Hiragana: U+3040-U+309F.

Katakana: U+30A0-U+30FF.

Katakana Phonetic Extensions: U+31F0-U+31FF.

Halfwidth and Fullwidth Forms: U+FF00-U+FFEF.

Hangul.

Hangul Jamo: U+1100-U+11FF.

Hangul Compatibility Jamo: U+3130-U+318F.

Hangul Syllables: U+AC00-U+D7A3.

Yi.

Yi: U+A000-U+A4CF.



12. Additional Modern Scripts.

Ethiopic.

Ethiopic: U+1200-U+137F.

Mongolian.

Mongolian: U+1800-U+18AF.

Osmanya.

Osmanya: U+10480-U+104AF.

Cherokee.

Cherokee: U+13A0-U+13FF.

Canadian Aboriginal Syllabics.

Canadian Aboriginal Syllabics: U+1400-U+167F.

Deseret.

Deseret: U+10400-U+1044F.

Shavian.

Shavian: U+10450-U+1047F.



13. Archaic Scripts.

Ogham.

Ogham: U+1680-U+169F.

Old Italic.

Old Italic: U+10300-U+1032F.

Runic.

Runic: U+16A0-U+16F0.

Gothic.

Gothic: U+10330-U+1034F.

Ugaritic.

Ugaritic: U+10380-U+1039F.

Linear B.

Linear B Syllabary: U+10000-U+1007F.

Linear B Ideograms: U+10080-U+108FF.

Aegean Numbers: U+10100-U+1013F.

Cypriot Syllabary.

Cypriot Syllabary: U+10800-U+1083F.



14. Symbols.

Currency Symbols.

Currency Symbols: U+20A0-U+20CF.

Letterlike Symbols.

Letterlike Symbols: U+2100-U+214F.

Math Alphanumeric Symbols: U+1D400-U+1D7FF.

Mathematical Alphabets.

Fonts Used for Mathematical Alphabets.

Number Forms.

Number Forms: U+2150-U+218F.

Superscripts and Subscripts: U+2070-U+209F.

Mathematical Symbols.

Mathematical Operators: U+2200-U+22FF.

Supplements to Mathematical Symbols and Arrows.

Supplemental Math Operators: U+2A00-U+2AFF.

Miscellaneous Math Symbols-A: U+27C0-U+27EF.

Miscellaneous Math Symbols-B: U+2980-U+29FF.

Arrows: U+2190-U+21FF.

Supplemental Arrows.

Standardized Variants of Mathematical Symbols.

Technical Symbols.

Control Pictures: U+2400-U+243F.

Miscellaneous Technical: U+2300-U+23FF.

Optical Character Recognition: U+2440-U+245F.

Geometrical Symbols.

Box Drawing: U+2500-U+257F.

Block Elements: U+2580-U+259F.

Geometric Shapes: U+25A0-U+25FF.

Miscellaneous Symbols and Dingbats.

Miscellaneous Symbols: U+2600-U+26FF.

Dingbats: U+2700-U+27BF.

Yijing Hexagram Symbols: U+4DC0-U+4DFF.

Tai Xuan Jing Symbols: U+1D300-U+1D356.

Enclosed and Square.

Enclosed Alphanumerics: U+2460-U+24FF.

Enclosed CJK Letters and Months: U+3200-U+32FF.

CJK Compatibility: U+3300-U+33FF.

Braille.

Braille Patterns: U+2800-U+28FF.

Byzantine Musical Symbols.

Byzantine Musical Symbols: U+1D000-U+1D0FF.

Western Musical Symbols.

Musical Symbols: U+1D100-U+1D1FF.



15. Special Areas and Format Characters.

Control Codes.

Layout Controls.

Invisible Operators.

Deprecated Format Characters.

Deprecated Format Characters: U+206A-U+206F.

Surrogates Area.

Surrogates Area: U+D800-U+DFFF.

Variation Selectors.

Private-Use Characters.

Private Use Area: U+E000-U+F8FF.

Supplementary Private Use Areas.

Noncharacters.

Noncharacters: U+FFFE, U+FFFF, and Others.

Specials.

Specials: U+FEFF, U+FFF0-U+FFFD.

Tag Characters.

Tag Characters: U+E0000-U+E007F.



16. Code Charts.

Character Names List.

Images in the Code Charts and Character Lists.

Character Names.

Aliases.

Cross References.

Information About Languages.

Case Mappings.

Decompositions.

Reserved Characters.

Noncharacters.

Subheads.

CJK Unified Ideographs.

Hangul Syllables.



17. Han Radical-Stroke Index.


A. Han Unification History.


B. Abstracts of Unicode Technical Reports.

Unicode Standard Annexes.

UAX #9: The Bidirectional Algorithm.

UAX #11: East Asian Width.

UAX #14: Line Breaking Properties.

UAX #15: Unicode Normalization Forms.

UAX #24: Script Names.

UAX #29: Text Boundaries.

Unicode Technical Standards.

UTS #6: A Standard Compression Scheme for Unicode.

UTS #10: Unicode Collation Algorithm.

Unicode Technical Reports.

UTR #16: UTF-EBCDIC.

UTR #17: Character Encoding Model.

UTR #18: Unicode Regular Expression Guidelines.

UTR #20: Unicode in XML and Other Markup Languages.

UTR #22: Character Mapping Markup Language (CharMapML).

UTR #26: Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8).

Other Unicode References.

Unicode Technical Notes.

FAQ (Frequently Asked Questions).

Charts.

Conferences.

Policies.

Updates and Errata.

Versions.

Where Is My Character?



C. Relationship to ISO/IEC 10646.

History.

Unicode 1.0.

Unicode 2.0.

Unicode 3.0.

Unicode 4.0.

Encoding Forms in ISO/IEC 10646.

Zero Extending.

UCS Transformation Formats.

UTF-8.

UTF-16.

Synchronization of the Standards.

Identification of Features for the Unicode Standard.

Character Names.

Character Functional Specifications.



D. Changes from Unicode Version 3.0.

Versions of the Unicode Standard.

Changes from Unicode Version 3.0 to Version 3.1.

New Characters Added.

Unicode Character Database Changes.

Changes Affecting Conformance.

Unicode Standard Annexes.

Changes from Unicode Version 3.1 to Version 3.2.

New Characters Added.

Unicode Character Database Changes.

Changes Affecting Conformance.

Unicode Standard Annexes.

Changes from Unicode Version 3.2 to Version 4.0.

New Characters Added.

Unicode Character Database Changes.

Changes Affecting Conformance.

Unicode Standard Annexes.

Errata.



G. Glossary.


R. References.

Source Standards and Specifications.

Source Dictionaries for Han Unification.

Other Sources for the Unicode Standard.

Selected Resources: Technical.

Selected Resources: Scripts and Languages.



I. Indices.

Unicode Names Index.

General Index.

Preface

This book, The Unicode Standard, Version 4.0, is the authoritative source of information on the Unicode character encoding standard.

Version 4.0 expands on and supersedes all other previous versions. The text of the standard has been extensively rewritten to improve its structure and clarity.

Major additions to Version 4.0 since Version 3.0 include:

  • extensive additions of CJK characters to cover dictionaries and historic usage
  • many new symbols for mathematical and technical publication
  • substantially improved specification of conformance requirements, incorporating the character encoding model
  • encoding of supplementary characters
  • formalized policies for stability of the standard
  • clarification of semantics of special characters, including the byte order mark
  • major expansion of Unicode Character Database properties and of specifications for text boundaries and casing
  • more minority scripts, including Limbu, Tai Le, Osmanya, and Philippine scripts
  • more historic scripts, including Linear B, Cypriot, and Ugaritic
  • tightened definition of encoding terms, including UTF-32

Furthermore, many individual characters were added to meet the requirements of users and implementers alike. The Unicode Standard maintains consistency with the international standard ISO/IEC 10646. Version 4.0 of the Unicode Standard corresponds to ISO/IEC 10646:2003.

0.1 About the Unicode Standard

This book, together with the Unicode Standard Annexes described in Appendix B, and the Unicode Character Database, defines Version 4.0 of the Unicode Standard. The book gives the general principles, requirements for conformance, and guidelines for implementers, followed by character code charts and names.

Concepts, Architecture, Conformance, and Guidelines

The first five chapters of Version 4.0 introduce the Unicode Standard and provide the fundamental information needed to produce a conforming implementation. Basic text processing, working with combining marks, and encoding forms are all described. A special chapter on implementation guidelines answers many common questions that arise when implementing Unicode.

Chapter 1 introduces the standard's basic concepts, design basis, and coverage, and discusses basic text handling requirements.

Chapter 2 sets forth the fundamental principles underlying the Unicode Standard and covers specific topics such as text processes, overall character properties, and the use of combining marks.

Chapter 3 constitutes the formal statement of conformance. This chapter also presents the normative algorithms for two processes: the canonical ordering of combining marks and the encoding of Korean Hangul syllables by conjoining jamo.

Chapter 4 describes character properties in detail, both normative (required) and informative. Tables giving additional character property information appear in the Unicode Character Database.

Chapter 5 discusses implementation issues, including compression, strategies for dealing with unknown and unsupported characters, and transcoding tother standards.

Character Block Descriptions

Chapters 6 through 15 contain the character block descriptions that give basic information about each script or group of symbols and may discuss specific characters or pertinent layout information. Some of this information is required in order to produce conformant implementations of these scripts and other collections of characters.

Chapter 6 introduces writing systems and describes the general punctuation characters.

Chapter 7 presents the European Alphabetic scripts, including Latin, Greek, Cyrillic, Armenian, Georgian, and associated combining marks.

Chapter 8 presents the Middle Eastern, right-to-left scripts: Hebrew, Arabic, Syriac, and Thaana.

Chapter 9 covers the South Asian scripts, including Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Tibetan, and Limbu. Chapter 10 covers the Southeast Asian scripts, including Thai, Lao, Tai Le, Myanmar, Khmer, and Philippine scripts.

Chapter 11 presents the East Asian scripts, including Han, Hiragana, Katakana, Hangul, Bopomofo, and Yi.

Chapter 12 presents other scripts, including Ethiopic, Mongolian, Osmanya, Cherokee, Canadian Aboriginal Syllabics, Deseret, and Shavian.

Chapter 13 describes archaic scripts, including Ogham, Old Italic, Runic, Gothic, Ugaritic, Linear B, and Cypriot.

Chapter 14 presents symbols, including currency, letterlike and technical symbols, mathematical operators, and musical symbols.

Chapter 15 describes other topics such as private-use characters, surrogate code points, and special characters.

Charts and Han Radical-Stroke Index

The next two chapters document the Unicode Standard's character code assignments, their names and important descriptive information, and provide a Han radical-stroke index that aids in locating specific ideographs encoded in Unicode.

Chapter 16 gives the code charts and the Character Names List. The code charts contain the normative character encoding assignments, and the names list contains normative information as well as useful cross references and informational notes.

Chapter 17 provides a radical-stroke index to East Asian ideographs.

Appendices

The appendices contain detailed background information on important topics regarding the history of the Unicode Standard and its relationship to ISO/IEC 10646.

Appendix A describes the history of Han Unification in the Unicode Standard.

Appendix B provides abstracts of Unicode Technical Reports and lists other important Unicode resources.

Appendix C details the relationship between the Unicode Standard and ISO/IEC 10646.

Appendix D lists the changes to the Unicode Standard since Version 3.0.

The appendices are followed by a glossary of terms, a bibliography, and two indices: an index to Unicode characters and an index to the text of the book.

0.2 The Unicode Character Database and Technical Reports

The Unicode Character Database is a collection of data files that contain character code points, character names and character property data. It is described more fully in of the Unicode Character Database, are found on the Unicode Web site at: http://www.unicode.org/ucd/.

The files for Version 4.0.0 of the Unicode Character Database are also supplied on the CDROM that accompanies this book.

Information on versions of the Unicode Standard can be found on the Unicode Web site at: http://www.unicode.org/versions/.

All versions of all Unicode Technical Reports, Unicode Technical Standards, and Unicode Standard Annexes are available on the Unicode Web site http://www.unicode.org/reports/.

The latest available version of each document at the time of publication is included on the CD-ROM. See Appendix B for a summary overview of important Unicode Technical Standards, Unicode Technical Reports and Unicode Standard Annexes.

On the CD-ROM

The CD-ROM also contains additional information, such as sample code, which is maintained on the Unicode ftp site at: ftp.unicode.org or via http at: http://www.unicode.org/Public/. For the complete contents of the CD-ROM see its ReadMe.txt file.

0.3 Notational Conventions

Throughout this book, certain typographic conventions are used.

Code Points

In running text, an individual Unicode code point can be expressed as U+n, where n is from four to six hexadecimal digits, using the digits 0-9 and uppercase letters A-F (for 10 through 15, respectively). There should be no leading zeros, unless the code point would have fewer than four hexadecimal digits; for example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.

  • U+0416 is the Unicode code point for the character named .

In tables, the U+ may be omitted for brevity.

A range of Unicode code points is expressed as U+xxxx-U+yyyy or xxxx..yyyy, where xxxx and yyyy are the first and last Unicode values in the range, and the long dash or two dots indicate a contiguous range inclusive of the endpoints. For ranges involving supplementary characters, the code points in the ranges are expressed with five or six hexadecimal digits.

  • The range U+0900-U+097F contains 128 Unicode code points.
  • The Plane 16 private use characters are in the range 100000..10FFFD.

Character Names

All Unicode characters have unique names, which are identical to those of the English language edition of International Standard ISO/IEC 10646. Unicode character names contain only uppercase Latin letters A through Z, digits, space, and hyphen-minus; this convention makes it easy to generate computer-language identifiers automatically from the names. Unified CJK ideographs are named -X, where X is replaced with the hexadecimal Unicode code point--for example, -4E00.The names of Hangul syllables are generated algorithmically; for details, see Hangul Syllable Names in Section 3.12, Conjoining Jamo Behavior.

In running text, a formal Unicode name is shown in small capitals (for example, ), and alternative names (aliases) appear in italics (for example, umlaut).Italics are also used to refer to a text element that is not explicitly encoded (for example, pasekh alef) or to set off a non-English word (for example, the Welsh word ynghyd).

Sequences

A sequence of two or more code points may be represented by a comma-delimited list, set off by angle brackets. For this purpose angle brackets consist of U+003C - and U+003E - . Spaces are optional after the comma, and U+ notation for the code point is also optional; for example, "". When the usage is clear from the context, a sequence of characters may also be represented with generic short names, for example as in "", or the angle brackets may be omitted.

In contrast to sequences of code points, a sequence of one or more code units may be represented by a list set off by angle brackets, but without comma delimitation or U+ notation. For example, the notation "" represents a sequence of bytes, as for the UTF-8 encoding form of a Unicode character. The notation "" represents a sequence of 16-bit code units, as for the UTF-16 encoding form of a Unicode character.

Miscellaneous

Phonemic transcriptions are shown between slashes, as in Khmer /khnyom/.

Phonetic transcriptions are shown between square brackets, using the International Phonetic Alphabet. (Full details on the IPA can be found on the International Phonetic Association's Web site, http://www2.arts.gla.ac.uk/IPA/ipa.html.)

A leading asterisk is used to represent an incorrect or nonoccurring linguistic form.

The symbols used in the character names list are described at the beginning of Chapter 16, Code Charts.

In the text of this book, the word "Unicode" when used alone as a noun refers to the Unicode Standard.

Unambiguous dates of the current common era, such as 1999, are unlabeled. In cases of ambiguity, is used. Dates before the common era are labeled with .

The term byte, as used in this standard, always refers to a unit of eight bits. This corresponds to the use of the term octet in some other standards.

Extended BNF

The Unicode Standard and technical reports use an extended BNF format for describing syntax. As different conventions are used for BNF, Table 0-1, Extended BNF, lists the notation used here.

0.3 Notational Conventions

In other environments, such as programming languages or mark-up, alternative notation for sequences of code points or code units may be used.

Character Classes. A code point class is a specification of an unordered set of code points. Whenever the code points are all assigned characters, it can also be referred to as a character class. The specification consists of any of the following:

  • A literal code point
  • A range of literal code points
  • A set of code points having a given Unicode character property value, as defined in the Unicode Character Database (see PropertyAliases.txt and PropertyValueAliases.txt)
  • Non-boolean properties given as an expression = or A, , such as "General_Category=Titlecase_Letter"
  • Boolean properties given as an expression = true or
  • A, true, such as "Uppercase=true"
  • Combinations of logical operations on classes

Further extensions to this specification of character classes are used in some Unicode Standard Annexes and Unicode Technical Reports. Such extensions are described in those documents, as appropriate.

A partial formal BNF syntax for character classes as used in this standard is given by the following.

char_class := "" char_class - char_class ""// set difference
:= "" item_list ""
:= "" property ("=" | "A,") property_value ""
item_list := item (","? item)?
item := code_point // either literal or escaped
:= code_point - code_point // inclusive range

Whenever any character could be interpreted as a syntax character, it must be escaped. Where no ambiguity would result (with normal operator precedence), extra square brackets can be discarded. If a space character is used as a literal, it is escaped. Examples are found in Table 0-2, Character Class Examples.

Symbols Meaning

For more information about character classes, see Unicode Technical Report #18, "Unicode Regular Expression Guidelines."

Operators

Operators used in this standard are listed in Table 0-3, Operators.

0.4 Resources

The Unicode Consortium provides a number of online resources for obtaining information and data about the Unicode Standard, as well as updates and corrigenda. They are listed below.

Subscription instructions for the email discussion list are posted on the Unicode Web site.

  • a-z-c English lowercase letters except for c
  • 0-9 European decimal digits
  • \u0030-\u0039 (same as above, using Unicode escapes)
  • 0-9,A-F,a-f hexadecimal digits
  • {gc=letter},{gc=non-spacing_ mark} all letters and non-spacing marks
  • {gc=L},{gc=Mn} (same as above, using abbreviated notation)
  • {gcA,unassigned} all assigned Unicode characters
  • \u0600-\u06FF-{gc=unassigned} all assigned Arabic characters
  • Alphabetic=true all alphabetic characters
  • Line_BreakA,Infix_Numeric all code points that do not have the line break property of Infix_Numeric

0.4 Resources

How to Contact the Unicode Consortium

Contact the Unicode Consortium for membership information and to order publications (including additional copies of this book).

Postal address:
P.O. Box 391476
Mountain View, CA 94039-1476
USA

Please check the Web site for up-to-date contact information, including telephone, fax, and courier delivery address.



0321185781P05142003

Updates

Submit Errata

More Information

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020