Home > Store

Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard

By Richard Gillam
Published Sep 16, 2002 by Addison-Wesley Professional.

Book

Sorry, this book is no longer in print.

Not for Sale

Description

Sample Content

Updates

More Information

Description

Copyright 2003
Dimensions: 7-3/8" x 9-1/4"
Pages: 896
Edition: 1st

Book
ISBN-10: 0-201-70052-2
ISBN-13: 978-0-201-70052-7

Unicode is a critical enabling technology for developers who want to internationalize applications for global environments. But, until now, developers have had to turn to standards documents for crucial information on utilizing Unicode. In Unicode Demystified, one of IBM's leading software internationalization experts covers every key aspect of Unicode development, offering practical examples and detailed guidance for integrating Unicode 3.0 into virtually any application or environment. Writing from a developer's point of view, Rich Gillam presents a systematic introduction to Unicode's goals, evolution, and key elements. Gillam illuminates the Unicode standards documents with insightful discussions of character properties, the Unicode character database, storage formats, character sequences, Unicode normalization, character encoding conversion, and more. He presents practical techniques for text processing, locating text boundaries, searching, sorting, rendering text, accepting user input, and other key development tasks. Along the way, he offers specific guidance on integrating Unicode with other technologies, including Java, JavaScript, XML, and the Web. For every developer building internationalized applications, internationalizing existing applications, or interfacing with systems that already utilize Unicode.



Sample Content

Downloadable Sample Chapter

Click below for Sample Chapter(s) related to this title:
Sample Chapter 3

Preface.

I. UNICODE IN ESSENCE: AN ARCHITECTURAL OVERVIEW OF THE UNICODE STANDARD.

1. Language, Computers, and Unicode.

What Unicode Is.

What Unicode Isn't.

The Challenge of Representing Text in Computers.

What This Book Does.

How This Book Is Organized.

Part I: Unicode in Essence.

Part II: Unicode in Depth.

Part III: Unicode in Action.

2. A Brief History of Character Encoding.

Prehistory.

The Telegraph and Morse Code.

The Teletypewriter and Baudot Code.

Other Teletype and Telegraphy Codes.

FIELDATA and ASCII.

Hollerith and EBCDIC.

Single-Byte Encoding Systems.

Eight-Bit Encoding Schemes and the ISO 2022 Model.

ISO 8859.

Other 8-Bit Encoding Schemes.

Character Encoding Terminology.

Multiple-Byte Encoding Systems.

East Asian Coded Character Sets.

Character Encoding Schemes for East Asian Coded Character Sets.

Other East Asian Encoding Systems.

ISO 10646 and Unicode.

How the Unicode Standard Is Maintained.

3. Architecture:Not Just a Pile of Code Charts.

The Unicode Character-Glyph Model.

Character Positioning.

The Principle of Unification.

Alternate-Glyph Selection.

Multiple Representations.

Flavors of Unicode.

Character Semantics.

Unicode Versions and Unicode Technical Reports.

Unicode Standard Annexes.

Unicode Technical Standards.

Unicode Technical Reports.

Draft and Proposed Draft Technical Reports.

Superseded Technical Reports.

Unicode Versions.

Unicode Stability Policies.

Arrangement of the Encoding Space.

Organization of the Planes.

The Basic Multilingual Plane.

The Supplementary Planes.

Noncharacter Code Point Values.

Conforming to the Standard.

General.

Producing Text as Output.

Interpreting Text from the Outside World.

Passing Text Through.

Drawing Text on the Screen or Other Output Devices.

Comparing Character Strings.

Summary.

4. Combining Character Sequences and Unicode Normalization.

How Unicode Non-spacing Marks Work.

Dealing Properly with Combining Character Sequences.

Canonical Decompositions.

Canonical Accent Ordering.

Double Diacritics.

Compatibility Decompositions.

Singleton Decompositions.

Hangul.

Unicode Normalization Forms.

Grapheme Clusters.

5. Character Properties and the Unicode Character Database.

Where to Get the Unicode Character Database.

The UNIDATA Directory.

UnicodeData.txt.

PropList.txt.

General Character Properties.

Standard Character Names.

Algorithmically Derived Names.

Control-Character Names.

ISO 10646 Comments.

Block and Script.

General Category.

Letters.

Marks.

Numbers.

Punctuation.

Symbols.

Separators.

Miscellaneous.

Other Categories.

Properties of Letters.

SpecialCasing.txt.

CaseFolding.txt.

Properties of Digits, Numerals, and Mathematical Symbols.

Layout-Related Properties.

Bidirectional Layout.

Mirroring.

Arabic Contextual Shaping.

East Asian Width.

Line-Breaking Property.

Normalization-Related Properties.

Decomposition.

Decomposition Type.

Combining Class.

Composition Exclusion List.

Normalization Test File.

Derived Normalization Properties.

Grapheme Cluster-Related Properties.

Unihan.txt.

6. Unicode Storage and Serialization Formats.

A Historical Note.

UTF-32.

UTF-16 and the Surrogate Mechanism.

Ending-ness and the Byte Order Mark.

UTF-8.

CESU-8.

UTF-EBCDIC.

UTF-7.

Standard Compression Scheme for Unicode.

BOCU.

Detecting Unicode Storage Formats.

II. UNICODE IN DEPTH: A GUIDED TOUR OF THE CHARACTER REPERTOIRE.

7. Scripts of Europe.

The Western Alphabetic Scripts.

The Latin Alphabet.

The Latin-1 Characters.

The Latin Extended A Block.

The Latin Extended B Block.

The Latin Extended Additional Block.

The International Phonetic Alphabet.

Diacritical Marks.

Isolated Combining Marks.

Spacing Modifier Letters.

The Greek Alphabet.

The Greek Block.

The Greek Extended Block.

The Coptic Alphabet.

The Cyrillic Alphabet.

The Cyrillic Block.

The Cyrillic Supplementary Block.

The Armenian Alphabet.

The Georgian Alphabet.

8. Scripts of the Middle East.

Bidirectional Text Layout.

The Unicode Bidirectional Layout Algorithm.

Inherent Directionality.

Neutrals.

Numbers.

The Left-to-Right and Right-to-Left Marks.

The Explicit Override Characters.

The Explicit Embedding Characters.

Mirroring Characters.

Line and Paragraph Boundaries.

Bidirectional Text in a Text-Editing Environment.

The Hebrew Alphabet.

The Hebrew Block.

The Arabic Alphabet.

The Arabic Block.

Joiners and Non-joiners.

The Arabic Presentation Forms B Block.

The Arabic Presentation Forms A Block.

The Syriac Alphabet.

The Syriac Block.

The Thaana Script.

The Thaana Block.

9. Scripts of India and Southeast Asia.

Devanagari.

The Devanagari Block.

Bengali.

The Bengali Block.

Gurmukhi.

The Gurmukhi Block.

Gujarati.

The Gujarati Block.

Oriya.

The Oriya Block.

Tamil.

The Tamil Block.

Telugu.

The Telugu Block.

Kannada.

The Kannada Block.

Malayalam.

The Malayalam Block.

Sinhala.

The Sinhala Block.

Thai.

The Thai Block.

Lao.

The Lao Block.

Khmer.

The Khmer Block.

Myanmar.

The Myanmar Block.

Tibetan.

The Tibetan Block.

The Philippine Scripts.

10. Scripts of East Asia.

The Han Characters.

Variant Forms of Han Characters.

Han Characters in Unicode.

The CJK Unified Ideographs Area.

The CJK Unified Ideographs Extension A Area.

The CJK Unified Ideographs Extension B Area.

The CJK Compatibility Ideographs Block.

The CJK Compatibility Ideographs Supplement Block.

The Kangxi Radicals Block.

The CJK Radicals Supplement Block.

Ideographic Description Sequences.

Bopomofo.

The Bopomofo Block.

The Bopomofo Extended Block.

Japanese.

The Hiragana Block.

The Katakana Block.

The Katakana Phonetic Extensions Block.

The Kanbun Block.

Korean.

The Hangul Jamo Block.

The Hangul Compatibility Jamo Block.

The Hangul Syllables Area.

Half-width and Full-width Characters.

The Half-width and Full-width Forms Block.

Vertical Text Layout.

Ruby.

The Interlinear Annotation Characters.

Yi.

The Yi Syllables Block.

The Yi Radicals Block.

11. Scripts from Other Parts of the World.

Mongolian.

The Mongolian Block.

Ethiopic.

The Ethiopic Block.

Cherokee.

The Cherokee Block.

Canadian Aboriginal Syllables.

The Unified Canadian Aboriginal Syllabics Block.

Historical Scripts.

Runic.

Ogham.

Old Italic.

Gothic.

Deseret.

12. Numbers, Punctuation, Symbols, and Specials.

Numbers.

Western Positional Notation.

Alphabetic Numerals.

Roman Numerals.

Han Characters as Numerals.

Other Numeration Systems.

Numeric Presentation Forms.

National and Nominal Digit Shapes.

Punctuation.

Script-Specific Punctuation.

The General Punctuation Block.

The CJK Symbols and Punctuation Block.

Spaces.

Dashes and Hyphens.

Quotation Marks, Apostrophes, and Similar-Looking Characters.

Paired Punctuation.

Dot Leaders.

Bullets and Dots.

Special Characters.

Line and Paragraph Separators.

Segment and Page Separators.

Control Characters.

Characters That Control Word Wrapping.

Characters That Control Glyph Selection.

The Grapheme Joiner.

Bidirectional Formatting Characters.

Deprecated Characters.

Interlinear Annotation.

The Object Replacement Character.

The General Substitution Character.

Tagging Characters.

Noncharacters.

Symbols Used with Numbers.

Numeric Punctuation.

Currency Symbols.

Unit Markers.

Math Symbols.

Mathematical Alphanumeric Symbols.

Other Symbols and Miscellaneous Characters.

Musical Notation.

Braille.

Other Symbols.

Presentation Forms.

Miscellaneous Characters.

III. UNICODE IN ACTION: IMPLEMENTING AND USING THE UNICODE STANDARD.

13 Techniques and Data Structures for Handling Unicode Text.

Useful Data Structures.

Testing for Membership in a Class.

The Inversion List.

Performing Set Operations on Inversion Lists.

Mapping Single Characters to Other Values.

Inversion Maps.

The Compact Array.

Two-Level Compact Arrays.

Mapping Single Characters to Multiple Values.

Exception Tables.

Mapping Multiple Characters to Other Values.

Exception Tables and Key Closure.

Tries as Exception Tables.

Tries as the Main Lookup Table.

Single Versus Multiple Tables.

14. Conversions and Transformations.

Converting Between Unicode Encoding Forms.

Converting Between UTF-16 and UTF-32.

Converting Between UTF-8 and UTF-32.

Converting Between UTF-8 and UTF-16.

Implementing Unicode Compression.

Unicode Normalization.

Canonical Decomposition.

Compatibility Decomposition.

Canonical Composition.

Optimizing Unicode Normalization.

Testing Unicode Normalization.

Converting Between Unicode and Other Standards.

Getting Conversion Information.

Converting Between Unicode and Single-Byte Encodings.

Converting Between Unicode and Multibyte Encodings.

Other Types of Conversions.

Handling Exceptional Conditions.

Dealing with Differences in Encoding Philosophy.

Choosing a Converter.

Line-Break Conversion.

Case Mapping and Case Folding.

Case Mapping on a Single Character.

Case Mapping on a String.

Case Folding.

Transliteration.

15 Searching and Sorting.

The Basics of Language-Sensitive String Comparison.

Multilevel Comparisons.

Ignorable Characters.

French Accent Sorting.

Contracting Character Sequences.

Expanding Characters.

Context-Sensitive Weighting.

Putting It All Together.

Other Processes and Equivalences.

Language-Sensitive Comparison on Unicode Text.

Unicode Normalization.

Reordering.

A General Implementation Strategy.

The Unicode Collation Algorithm.

The Default UCA Sort Order.

Alternate Weighting.

Optimizations and Enhancements.

Language-Insensitive String Comparison.

Sorting.

Collation Strength and Secondary Keys.

Exposing Sort Keys.

Minimizing Sort Key Length.

Searching.

The Boyer-Moore Algorithm.

Using the Boyer-Moore Algorithm with Unicode.

“Whole Word” Searches.

Using Unicode with Regular Expressions.

16. Rendering and Editing.

Line Breaking.

Line-Breaking Properties.

Implementing Boundary Analysis with Pair Tables.

Implementing Boundary Analysis with State Machines.

Performing Boundary Analysis Using a Dictionary.

A Few More Thoughts on Boundary Analysis.

Performing Line Breaking.

Line Layout.

Glyph Selection and Positioning.

Font Technologies.

Poor Man's Glyph Selection.

Glyph Selection and Placement in AAT.

Glyph Selection and Placement in OpenType.

Special-Purpose Rendering Technology.

Compound and Virtual Fonts.

Special Text-Editing Considerations.

Optimizing for Editing Performance.

Accepting Text Input.

Handling Arrow Keys.

Handling Discontiguous Selection.

Handling Multiple-Click Selection.

17. Unicode and Other Technologies.

Unicode and the Internet.

The W3C Character Model.

XML.

HTML and HTTP.

URLs and Domain Names.

Mail and Usenet.

Unicode and Programming Languages.

The Unicode Identifier Guidelines.

Java.

C and C++.

Javascript and JScript.

Visual Basic.

Perl.

ICU.

Unicode and Operating Systems.

Microsoft Windows.

MacOS.

Varieties of UNIX.

Conclusion.

Glossary.
Bibliography.
Index.

Preface

As the ecomonies of the world continue to become more connected together, and as the American computer market becomes more and more saturated, computer-related businesses are looking more and more to markets outside the United States to grow their businesses. At the same time, companies in other industries are not only beginning to do the same thing (or, in fact, have been for a long time), but are increasingly turning to computer technology, especially the Internet, to grow their businesses and streamline their operations.

The convergence of these two trends means that it's no longer just an English-only market for computer software. More and more, computer software is being used not only by people outside the United States or by people whose first language isn't English, but by people who don't speak English at all. As a result, interest in software internationalization is growing in the software development community.

A lot of things are involved in software internationalization: displaying text in the user's native language (and in different languages depending on the user), accepting input in the user's native language, altering window layouts to accommodate expansion or contraction of text or differences in writing direction, displaying numeric values acording to local customs, indicating events in time according to the local calendar systems, and so on.

This book isn't about any of these things. It's about something more basic, and which underlies most of the issues listed above: representing written language in a computer. There are many different ways to do this; in fact, there are several for just about every language that's been represented in computers. In fact, that's the whole problem. Designing software that's flexible enough to handle data in multiple languages (at least multiple languages that use different writing systems) has traditionally meant not just keeping track of the text, but also keeping track of which encoding scheme is being used to represent it. And if you want to mix text in multiple writing systems, this bookkeeping becomes more and more cumbersome.

The Unicode standard was designed specifically to solve this problem. It aims to be the universal character encoding standard, providing unique, unambiguous representations for every character in virtually every writing system and language in the world. The most recent version of Unicode provides representations for over 90,000 characters.

Unicode has been around for twelve years now and is in its third major revision, adding support for more languages with each revision. It has gained widespread support in the software community and is now supported in a wide variety of operating systems, programming languages, and application programs. Each of the semiannual International Unicode Conferences is better attended than the previous one, and the number of presenters and sessions at the Conferences grows correspondingly.

Representing text isn't as straightforward as it appears at first glance: It's not merely as simple as picking out a bunch of characters and assigning numbers to them. First you have to decide what a "character" is, which isn't as obvious in many writing systems as it is in English. You have to contend with things such as how to represent characters with diacrtical marks applied to them, how to represent clusters of marks that represent syllables, when differently shaped marks on the page are different "characters" and when they're just different ways of writing the same "character," what order to store the characters in when they don't proceed in a straightforward manner from one side of the page to the other (for example, some characters stack on top of each other, or you have two parallel lines of characters, or the reading order of the text on the page zigzags around the line because of differences in natural reading direction), and many similar issues.

The decisions you make on each of these issues for every character affect how various processes, such as comparing strings or analyzing a string for word boundaries, are performed, making them more complicated. In addition, the sheer number of different characters representable using the Unicode standard make many processes on text more complicated.

For all of these reasons, the Unicode standard is a large, complicated affair. Unicode 3.0, the last version published as a book, is 1,040 pages long. Even at this length, many of the explanations are fairly concise and assume the reader already has some degree of familiarity with the problems to be solved. It can be kind of intimidating.

The aim of this book is to provide an easier entree into the world of Unicode. It arranges things in a more pedagogical manner, takes more time to explain the various issues and how they're solved, fills in various pieces of background information, and adds implementation information and information on what Unicode support is already out there. It is this author's hope that this book will be a worthy companion to the standard itself, and will provide the average programmer and the internationalization specialist alike with all the information they need to effectively handle Unicode in their software.

About This Book

There are a few things you should keep in mind as you go through this book:

This book assumes the reader either is a professional computer programmer or is familiar with most computer-programming concepts and terms. Most general computer-science jargon isn't defined or explained here.
It's helpful, but not essential, if the reader has some basic understanding of the basic concepts of software internationalization. Many of those concepts are explained here, but if they're not central to one of the book's topics, they're not given a lot of time.
This book covers a lot of ground, and it isn't intended as a comprehensive and definitive reference for every single topic it discusses. In particular, I'm not repeating the entire text of the Unicode standard here; the idea is to complement the standard, not replace it. In many cases, this book will summarize a topic or attempt to explain it at a high level, leaving it to other documents (typically the Unicode standard or one of its technical reports) to fill in all the details.
The Unicode standard changes rapidly. New versions come out yearly, and small changes, new technical reports, and other things happen more quickly. In Unicode's history, terminology has changed, and this will probably continue to happen from time to time. In addition, there are a lot of other technologies that use or depend on Unicode, and they are also constantly changing, and I'm certainly not an expert on every single topic I discuss here. (In my darker moments, I'm not sure I'm an expert on any of them!) I have made every effort I could to see to it that this book is complete, accurate, and up to date, but I can't guarantee I've succeeded in every detail. In fact, I can almost guarantee you that there is information in here that is either outdated or just plain wrong. But I have made every effort to make the proportion of such information in this book as small as possible, and I pledge to continue, with each future version, to try to bring it closer to being fully accurate.
At the time of this writing (January 2002), the newest version of Unicode, Unicode 3.2, was in beta, and thus still in flux. The Unicode 3.2 spec is schedule to be finalized in March 2002, well before this book actually hits the streets. With a few exceptions, I don't expect major changes between now and March, but they're always possible, and therefore, the Unicode 3.2 information in this book may wind up wrong in some details. I've tried to flag all the Unicode 3.2-specific information here as being from Unicode 3.2, and I've tried to indicate the areas that I think are still in the greatest amount of flux.
Sample code in this book is almost always in Java. This is partially because Java is the language I personally use in my regular job, and thus the programming language I think in these days. But I also chose Java because of its increasing importance and popularity in the programming world in general and because Java code tends to be somewhat easier to understand than, say, C (or at least no more difficult). Because of Java's syntactic similarity to C and C++, I also hope the examples will be reasonable accessible to C and C++ programmers who don't also program in Java.
The sample code is provided for illustrative purposes only. I've gone to the trouble, at least with the examples that can stand alone, to make sure the examples all compile, and I've tested them to make sure I didn't make any obvious stupid mistakes, but they haven't been tested comprehensively. They were also written with far more of an eye toward explaining a concept than being directly usable in any particular context. Incorporate them into your code at your own risk!
I've tried to define all the jargon the first time I use it or to indicate a full explanation is coming later, but there's also a glossary at the back you can refer to if you come across an unfamiliar term that isn't defined.
Numeric constants, especially numbers representing characters, are pretty much always shown in hexadecimal notation. Hexadecimal numbers in the text are always written using the 0x notation familiar to C and Java programmers.
Unicode code point values are shown using the standard Unicode notation, U+1234, where "1234" is a hexadecimal number of from four to six digits. In many cases, a character is referred to by both its Unicode code point value and its Unicode name: for example, "U+0041 LATIN CAPITAL LETTER A." Code unit values in one of the Unicode transformation formats are shown using the 0x notation.

How This Book Was Produced

All of the examples of text in non-Latin writing systems posed quite a challenge for the production process. The bulk of this manuscript was written on a Power Macintosh G4/450 using Adobe FrameMaker 6 running on MacOS 9. I did the original versions of the various diagrams in Microsoft PowerPoint 98 on the Macintosh. But I discovered early on that FrameMaker on the Macintosh couldn't handle a lot of the characters I needed to be able to show in the book. I wound up writing the whole thing with little placeholder notes to myself throughout describing what the examples were supposed to look like.

FrameMaker was somewhat compatible with Apple's WorldScript technology, enabling me to do some of the example, but I quickly discovered Acrobat 3, which I was using at the time, wasn't. It crashed when I tried to created PDFs of chapters that included the non-Latin characters. Switching to Windows didn't prove much more fruitful: On both platforms, FrameMaker 6, Adobe Illustrator 9, and Acrobat 3 and 4 were not Unicode compatible. The non-Latin characters would either turn into garbage characters, not show up at all, or show up with very compromised quality.

Late in the process, I decided to switch to the one combination of software and platform I knew would work: Microsoft Office 2000 on Windows 2000, which handles (with varying degrees of difficulty) everything I needed to do. I converted the whole project from FrameMaker to Word and spent a couple of months restoring all the missing examples to the text. (In a few cases where I didn't have suitable fonts at my disposal, or where Word didn't product quite the results I wanted, I either scanned pictures out of books or just left the placeholders in.) The last rounds of revisions were done in Word on either the Mac or on a Windows machine, depending on where I was physically at the time, and all the example text was done on the Windows machine.

The Author's Journey

Like many in the field, I fell into software internationalization by happenstance. I've always been interested in language--written language in particular--and (of course) in computers. But my job had never really dealt with this directly.

In the spring of 1995, that changed when I went to work for Taligent. Taligent, you may remember, was the ill-fated joint venture between Apple Computer and IBM (later joined by Hewlett Packard) that was originally formed to create a new operating system for personal computers using state-of-the-art object-oriented technology. The fruit of our labors, CommonPoint, turned out to be too little too late, but it spawned a lot of technologies that found their places in other products.

For a while there, Taligent enjoyed a cachet in the industry as the place where Apple and IBM had sent many of their best and brightest. If you managed to get a job at Taligent, you had "made it."

I almost didn't "make it." I had wanted to work at Taligent for some time and eventually got the chance, but turned in a rather unimpressive interview performance (a couple coworkers kidded me about that for years afterward) and wasn't offered the job. About that same time, a friend of mine did get a job there, and after the person who did get the job I interviewed for turned it down for family reasons, my friend put in a good word for me and I got a second chance.

I probably would have taken almost any job there, but the specific opening was in the text and internationalization group, and thus began my long association with Unicode.

One thing pretty much everybody who ever worked at Taligent will agree on is that working there was a wonderful learning experience: an opportunity, as it were, to "sit at the feet of the masters." Personally, the Taligent experience made me into the programmer I am today. My C++ and OOD skills improved dramatically, I became proficient in Java, and I went from knowing virtually nothing about written language and software internationalization to... well, I'll let you be the judge.

My team was eventually absorbed into IBM, and I enjoyed a little over two years as an IBMer before deciding to move on in early 2000. During my time at Taligent/IBM, I worked on four different sets of Unicode-related text handling routines: the text-editing frameworks in CommonPoint, various text-storage and internationalization frameworks in IBM's Open Class Library, various internationalization facilities in Sun's Java Class Library (which IBM wrote under contract to Sun), and the libraries that eventually came to be known as the International Components for Unicode.

International Components for Unicode, or ICU, began life as an IBM developer-tools package based on the Java internationalization libraries, but has since morphed into an open-source project and taken on a life of its own. It's gaining increasing popularity and showing up in more operating systems and software packages, and it's acquiring a reputation as a great demonstration of how to implement the various features of the Unicode standard. I had the twin privileges of contributing frameworks to Java and ICU and of working alongside those who developed the other frameworks and learning from them. I got to watch the Unicode standard develop, work with some of those who were developing it, occasionally rub shoulders with the others, and occasionally contrbute a tidbit or two to the effort myself. It was a fantastic experience, and I hope that at least some of their expertise rubbed off on me.

A Personal Appeal

For a year and a half, I wrote a column on Java for C++ Report magazine, and for much of that time, I wondered if anyone was actually reading the thing and what they thought of it. I would occasionally get an e-mail from a reader commenting on something I'd written, and I was always grateful, whether the feedback was good or bad, because it meant people were reading the thing and took it seriously enough to let me know what they thought.

I'm hoping there will be more than one edition of this book, and I really want it to be as good as possible. If you read it and find it less than helpful, I hope you won't just throw it on a shelf somewhere and grumble about the money you threw away on it. Please, if this book fails to adequately answer your questions about Unicode, or if it wastes too much time answering questions you don't care about, I want to know. The more specific you can be about just what isn't doing it for you, the better. Please write me at rtgillam@concentric.net with your comments and criticisms.

For that matter, if you like what you see here, I wouldn't mind hearing from you either. God knows, I can use the encouragement.

--R. T. G.
Austin, Teaxs
January 2002

0201700522P02202002

Foreword

Unicode began with a simple goal: to unify the many hundreds of separate character encodings into a single, universal standard. These character encodings were incomplete and inconsistent: Two encodings would use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needed to support many different encodings, yet whenever data was passed between different encodings or platforms, that data always ran the risk of corruption.

Unicode was designed to fix that situation: to provide a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

Unfortunately, Unicode has not remained quite as simple as we originally planned. Most of the complications derive from three unavoidable sources. First, written languages are complex: They have evolved in strange and curious ways over the centuries. Many oddities end up being reflected in the way characters are used to represent those languages, or reflected in the way strings of characters function in a broader context. Second, Unicode did not develop in a vacuum: We had to add a number of features--and a great many characters--for compatibility with legacy character encoding standards. Third, we needed to adapt Unicode to different OS environments: allowing the use of 8-, 16-, or 32-bit units to represent characters, and either big- or little-endian integers. Of course, in addition to these three factors, there are in hindsight parts of the design that could have been simpler (true of any project of such magnitude).

Luckily you, a typical programmer, are shielded from most of these complexities. Just to put two strings in alphabetic order according to the conventions of any given language or country, for example, involves a lot of heavy lifting. Sometimes accents are important, sometimes not; sometimes two characters compare as one, sometimes one character compares as two, sometimes characters even swap. The software has to twist and hop between multiple code paths to make the simple cases absolutely as fast as possible, while still making every comparison absolutely accurate. All these gyrations are done by "the man behind the curtain"--you just call up a library routine in Windows, Java, ICU, or some other library, magic happens, and you get the right result.

You are not shielded from all of the complexities, however. There are still pitfalls. An English-speaking programmer might assume, for example, that given the three characters X, Y, and Z, that if X sorts before Y, then XZ sorts before YZ. This works for English, but fails for many languages. These are the things that a library can't do for you; you need to understand enough of the differences between the way characters behave in various environments so you can avoid stumbling into the traps.

So how do you learn more? The Unicode Standard and associated Technical Reports are available, but they often start at too high a level, or are aimed at a different audience--programmers implementing the standard, rather than programmers using pre-built implementations. The FAQs on the Unicode site (http://www.unicode.org) should help a good deal, but they jump from topic to topic, and don't provide a smooth exposition of the issues involved.

That brings us to this book and its author. For many years, Rich was part of the ICU team, which supplies one of the premier Unicode-enablement libraries. During that time, Rich was the "man behind the curtain" for parts of ICU, including pieces that were incorporated into the standard Java versions. He did some very interesting work there, and became intimately familiar with data structures and mechanisms used to support implementations of Unicode. Because ICU offers the same functionality across Java and cross-platform C/C++, he also has a solid grasp of the issues involved in those environments. That led him to becoming a columnist for C++ Report.

The ICU team got an inside look at Rich's work as successive drafts of his columns would float across our internal net for feedback. We still use his "Anatomy of the Assignment Operator" column as a basis for asking questions of interviewees--it is remarkable just how tricky some of the nuts and bolts issues are in C++!

Rich left the ICU team a couple of years ago (after all, when Texas beckons, who can refuse her), but has continued to develop as a writer. He has a clear, colloquial style that allows him to make even complex matters understandable.

This book covers many of the important issues involved in dealing with Unicode: the basic character architecture, combining characters, encoding forms, normalization, searching, and sorting. It provides a solid foundation for understanding the structure and usage of Unicode, and the implications for most programmers. I'm glad that Rich has poured so much of his time and energy into this effort over the past few years, and hope you find the book as useful as I think you will.

Dr. Mark Davis,
President, The Unicode Consortium
(Chief Globalization Architect, IBM)
http://www.macchiato.com

Index

Click below to download the Index file related to this title:
Index



Updates

Submit Errata



More Information



InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address

Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard

Book

Description

Sample Content

Downloadable Sample Chapter

Table of Contents

Preface

About This Book

How This Book Was Produced

The Author's Journey

A Personal Appeal

Foreword

Index

Updates

Submit Errata

More Information

InformIT Promotional Mailings & Special Offers

Other Things You Might Like