Click below for Sample Chapter(s) related to this title:
Sample Chapter 3
I. UNICODE IN ESSENCE: AN ARCHITECTURAL OVERVIEW OF THE UNICODE STANDARD.1. Language, Computers, and Unicode.
What Unicode Is.
What Unicode Isn't.
The Challenge of Representing Text in Computers.
What This Book Does.
How This Book Is Organized.
Part I: Unicode in Essence.
Part II: Unicode in Depth.
Part III: Unicode in Action.2. A Brief History of Character Encoding.
The Telegraph and Morse Code.
The Teletypewriter and Baudot Code.
Other Teletype and Telegraphy Codes.
FIELDATA and ASCII.
Hollerith and EBCDIC.
Single-Byte Encoding Systems.
Eight-Bit Encoding Schemes and the ISO 2022 Model.
Other 8-Bit Encoding Schemes.
Character Encoding Terminology.
Multiple-Byte Encoding Systems.
East Asian Coded Character Sets.
Character Encoding Schemes for East Asian Coded Character Sets.
Other East Asian Encoding Systems.
ISO 10646 and Unicode.
How the Unicode Standard Is Maintained.3. Architecture:Not Just a Pile of Code Charts.
The Unicode Character-Glyph Model.
The Principle of Unification.
Flavors of Unicode.
Unicode Versions and Unicode Technical Reports.
Unicode Standard Annexes.
Unicode Technical Standards.
Unicode Technical Reports.
Draft and Proposed Draft Technical Reports.
Superseded Technical Reports.
Unicode Stability Policies.
Arrangement of the Encoding Space.
Organization of the Planes.
The Basic Multilingual Plane.
The Supplementary Planes.
Noncharacter Code Point Values.
Conforming to the Standard.
Producing Text as Output.
Interpreting Text from the Outside World.
Passing Text Through.
Drawing Text on the Screen or Other Output Devices.
Comparing Character Strings.
Summary.4. Combining Character Sequences and Unicode Normalization.
How Unicode Non-spacing Marks Work.
Dealing Properly with Combining Character Sequences.
Canonical Accent Ordering.
Unicode Normalization Forms.
Grapheme Clusters.5. Character Properties and the Unicode Character Database.
Where to Get the Unicode Character Database.
The UNIDATA Directory.
General Character Properties.
Standard Character Names.
Algorithmically Derived Names.
ISO 10646 Comments.
Block and Script.
Properties of Letters.
Properties of Digits, Numerals, and Mathematical Symbols.
Arabic Contextual Shaping.
East Asian Width.
Composition Exclusion List.
Normalization Test File.
Derived Normalization Properties.
Grapheme Cluster-Related Properties.
Unihan.txt.6. Unicode Storage and Serialization Formats.
A Historical Note.
UTF-16 and the Surrogate Mechanism.
Ending-ness and the Byte Order Mark.
Standard Compression Scheme for Unicode.
Detecting Unicode Storage Formats.
II. UNICODE IN DEPTH: A GUIDED TOUR OF THE CHARACTER REPERTOIRE.7. Scripts of Europe.
The Western Alphabetic Scripts.
The Latin Alphabet.
The Latin-1 Characters.
The Latin Extended A Block.
The Latin Extended B Block.
The Latin Extended Additional Block.
The International Phonetic Alphabet.
Isolated Combining Marks.
Spacing Modifier Letters.
The Greek Alphabet.
The Greek Block.
The Greek Extended Block.
The Coptic Alphabet.
The Cyrillic Alphabet.
The Cyrillic Block.
The Cyrillic Supplementary Block.
The Armenian Alphabet.
The Georgian Alphabet.8. Scripts of the Middle East.
Bidirectional Text Layout.
The Unicode Bidirectional Layout Algorithm.
The Left-to-Right and Right-to-Left Marks.
The Explicit Override Characters.
The Explicit Embedding Characters.
Line and Paragraph Boundaries.
Bidirectional Text in a Text-Editing Environment.
The Hebrew Alphabet.
The Hebrew Block.
The Arabic Alphabet.
The Arabic Block.
Joiners and Non-joiners.
The Arabic Presentation Forms B Block.
The Arabic Presentation Forms A Block.
The Syriac Alphabet.
The Syriac Block.
The Thaana Script.
The Thaana Block.9. Scripts of India and Southeast Asia.
The Devanagari Block.
The Bengali Block.
The Gurmukhi Block.
The Gujarati Block.
The Oriya Block.
The Tamil Block.
The Telugu Block.
The Kannada Block.
The Malayalam Block.
The Sinhala Block.
The Thai Block.
The Lao Block.
The Khmer Block.
The Myanmar Block.
The Tibetan Block.
The Philippine Scripts.10. Scripts of East Asia.
The Han Characters.
Variant Forms of Han Characters.
Han Characters in Unicode.
The CJK Unified Ideographs Area.
The CJK Unified Ideographs Extension A Area.
The CJK Unified Ideographs Extension B Area.
The CJK Compatibility Ideographs Block.
The CJK Compatibility Ideographs Supplement Block.
The Kangxi Radicals Block.
The CJK Radicals Supplement Block.
Ideographic Description Sequences.
The Bopomofo Block.
The Bopomofo Extended Block.
The Hiragana Block.
The Katakana Block.
The Katakana Phonetic Extensions Block.
The Kanbun Block.
The Hangul Jamo Block.
The Hangul Compatibility Jamo Block.
The Hangul Syllables Area.
Half-width and Full-width Characters.
The Half-width and Full-width Forms Block.
Vertical Text Layout.
The Interlinear Annotation Characters.
The Yi Syllables Block.
The Yi Radicals Block.11. Scripts from Other Parts of the World.
The Mongolian Block.
The Ethiopic Block.
The Cherokee Block.
Canadian Aboriginal Syllables.
The Unified Canadian Aboriginal Syllabics Block.
Deseret.12. Numbers, Punctuation, Symbols, and Specials.
Western Positional Notation.
Han Characters as Numerals.
Other Numeration Systems.
Numeric Presentation Forms.
National and Nominal Digit Shapes.
The General Punctuation Block.
The CJK Symbols and Punctuation Block.
Dashes and Hyphens.
Quotation Marks, Apostrophes, and Similar-Looking Characters.
Bullets and Dots.
Line and Paragraph Separators.
Segment and Page Separators.
Characters That Control Word Wrapping.
Characters That Control Glyph Selection.
The Grapheme Joiner.
Bidirectional Formatting Characters.
The Object Replacement Character.
The General Substitution Character.
Symbols Used with Numbers.
Mathematical Alphanumeric Symbols.
Other Symbols and Miscellaneous Characters.
III. UNICODE IN ACTION: IMPLEMENTING AND USING THE UNICODE STANDARD.13 Techniques and Data Structures for Handling Unicode Text.
Useful Data Structures.
Testing for Membership in a Class.
The Inversion List.
Performing Set Operations on Inversion Lists.
Mapping Single Characters to Other Values.
The Compact Array.
Two-Level Compact Arrays.
Mapping Single Characters to Multiple Values.
Mapping Multiple Characters to Other Values.
Exception Tables and Key Closure.
Tries as Exception Tables.
Tries as the Main Lookup Table.
Single Versus Multiple Tables.14. Conversions and Transformations.
Converting Between Unicode Encoding Forms.
Converting Between UTF-16 and UTF-32.
Converting Between UTF-8 and UTF-32.
Converting Between UTF-8 and UTF-16.
Implementing Unicode Compression.
Optimizing Unicode Normalization.
Testing Unicode Normalization.
Converting Between Unicode and Other Standards.
Getting Conversion Information.
Converting Between Unicode and Single-Byte Encodings.
Converting Between Unicode and Multibyte Encodings.
Other Types of Conversions.
Handling Exceptional Conditions.
Dealing with Differences in Encoding Philosophy.
Choosing a Converter.
Case Mapping and Case Folding.
Case Mapping on a Single Character.
Case Mapping on a String.
Transliteration.15 Searching and Sorting.
The Basics of Language-Sensitive String Comparison.
French Accent Sorting.
Contracting Character Sequences.
Putting It All Together.
Other Processes and Equivalences.
Language-Sensitive Comparison on Unicode Text.
A General Implementation Strategy.
The Unicode Collation Algorithm.
The Default UCA Sort Order.
Optimizations and Enhancements.
Language-Insensitive String Comparison.
Collation Strength and Secondary Keys.
Exposing Sort Keys.
Minimizing Sort Key Length.
The Boyer-Moore Algorithm.
Using the Boyer-Moore Algorithm with Unicode.
“Whole Word” Searches.
Using Unicode with Regular Expressions.16. Rendering and Editing.
Implementing Boundary Analysis with Pair Tables.
Implementing Boundary Analysis with State Machines.
Performing Boundary Analysis Using a Dictionary.
A Few More Thoughts on Boundary Analysis.
Performing Line Breaking.
Glyph Selection and Positioning.
Poor Man's Glyph Selection.
Glyph Selection and Placement in AAT.
Glyph Selection and Placement in OpenType.
Special-Purpose Rendering Technology.
Compound and Virtual Fonts.
Special Text-Editing Considerations.
Optimizing for Editing Performance.
Accepting Text Input.
Handling Arrow Keys.
Handling Discontiguous Selection.
Handling Multiple-Click Selection.17. Unicode and Other Technologies.
Unicode and the Internet.
The W3C Character Model.
HTML and HTTP.
URLs and Domain Names.
Mail and Usenet.
Unicode and Programming Languages.
The Unicode Identifier Guidelines.
C and C++.
Unicode and Operating Systems.
Varieties of UNIX.
As the ecomonies of the world continue to become more connected together, and as the American computer market becomes more and more saturated, computer-related businesses are looking more and more to markets outside the United States to grow their businesses. At the same time, companies in other industries are not only beginning to do the same thing (or, in fact, have been for a long time), but are increasingly turning to computer technology, especially the Internet, to grow their businesses and streamline their operations.
The convergence of these two trends means that it's no longer just an English-only market for computer software. More and more, computer software is being used not only by people outside the United States or by people whose first language isn't English, but by people who don't speak English at all. As a result, interest in software internationalization is growing in the software development community.
A lot of things are involved in software internationalization: displaying text in the user's native language (and in different languages depending on the user), accepting input in the user's native language, altering window layouts to accommodate expansion or contraction of text or differences in writing direction, displaying numeric values acording to local customs, indicating events in time according to the local calendar systems, and so on.
This book isn't about any of these things. It's about something more basic, and which underlies most of the issues listed above: representing written language in a computer. There are many different ways to do this; in fact, there are several for just about every language that's been represented in computers. In fact, that's the whole problem. Designing software that's flexible enough to handle data in multiple languages (at least multiple languages that use different writing systems) has traditionally meant not just keeping track of the text, but also keeping track of which encoding scheme is being used to represent it. And if you want to mix text in multiple writing systems, this bookkeeping becomes more and more cumbersome.
The Unicode standard was designed specifically to solve this problem. It aims to be the universal character encoding standard, providing unique, unambiguous representations for every character in virtually every writing system and language in the world. The most recent version of Unicode provides representations for over 90,000 characters.
Unicode has been around for twelve years now and is in its third major revision, adding support for more languages with each revision. It has gained widespread support in the software community and is now supported in a wide variety of operating systems, programming languages, and application programs. Each of the semiannual International Unicode Conferences is better attended than the previous one, and the number of presenters and sessions at the Conferences grows correspondingly.
Representing text isn't as straightforward as it appears at first glance: It's not merely as simple as picking out a bunch of characters and assigning numbers to them. First you have to decide what a "character" is, which isn't as obvious in many writing systems as it is in English. You have to contend with things such as how to represent characters with diacrtical marks applied to them, how to represent clusters of marks that represent syllables, when differently shaped marks on the page are different "characters" and when they're just different ways of writing the same "character," what order to store the characters in when they don't proceed in a straightforward manner from one side of the page to the other (for example, some characters stack on top of each other, or you have two parallel lines of characters, or the reading order of the text on the page zigzags around the line because of differences in natural reading direction), and many similar issues.
The decisions you make on each of these issues for every character affect how various processes, such as comparing strings or analyzing a string for word boundaries, are performed, making them more complicated. In addition, the sheer number of different characters representable using the Unicode standard make many processes on text more complicated.
For all of these reasons, the Unicode standard is a large, complicated affair. Unicode 3.0, the last version published as a book, is 1,040 pages long. Even at this length, many of the explanations are fairly concise and assume the reader already has some degree of familiarity with the problems to be solved. It can be kind of intimidating.
The aim of this book is to provide an easier entree into the world of Unicode. It arranges things in a more pedagogical manner, takes more time to explain the various issues and how they're solved, fills in various pieces of background information, and adds implementation information and information on what Unicode support is already out there. It is this author's hope that this book will be a worthy companion to the standard itself, and will provide the average programmer and the internationalization specialist alike with all the information they need to effectively handle Unicode in their software.
There are a few things you should keep in mind as you go through this book:
All of the examples of text in non-Latin writing systems posed quite a challenge for the production process. The bulk of this manuscript was written on a Power Macintosh G4/450 using Adobe FrameMaker 6 running on MacOS 9. I did the original versions of the various diagrams in Microsoft PowerPoint 98 on the Macintosh. But I discovered early on that FrameMaker on the Macintosh couldn't handle a lot of the characters I needed to be able to show in the book. I wound up writing the whole thing with little placeholder notes to myself throughout describing what the examples were supposed to look like.
FrameMaker was somewhat compatible with Apple's WorldScript technology, enabling me to do some of the example, but I quickly discovered Acrobat 3, which I was using at the time, wasn't. It crashed when I tried to created PDFs of chapters that included the non-Latin characters. Switching to Windows didn't prove much more fruitful: On both platforms, FrameMaker 6, Adobe Illustrator 9, and Acrobat 3 and 4 were not Unicode compatible. The non-Latin characters would either turn into garbage characters, not show up at all, or show up with very compromised quality.
Late in the process, I decided to switch to the one combination of software and platform I knew would work: Microsoft Office 2000 on Windows 2000, which handles (with varying degrees of difficulty) everything I needed to do. I converted the whole project from FrameMaker to Word and spent a couple of months restoring all the missing examples to the text. (In a few cases where I didn't have suitable fonts at my disposal, or where Word didn't product quite the results I wanted, I either scanned pictures out of books or just left the placeholders in.) The last rounds of revisions were done in Word on either the Mac or on a Windows machine, depending on where I was physically at the time, and all the example text was done on the Windows machine.
Like many in the field, I fell into software internationalization by happenstance. I've always been interested in language--written language in particular--and (of course) in computers. But my job had never really dealt with this directly.
In the spring of 1995, that changed when I went to work for Taligent. Taligent, you may remember, was the ill-fated joint venture between Apple Computer and IBM (later joined by Hewlett Packard) that was originally formed to create a new operating system for personal computers using state-of-the-art object-oriented technology. The fruit of our labors, CommonPoint, turned out to be too little too late, but it spawned a lot of technologies that found their places in other products.
For a while there, Taligent enjoyed a cachet in the industry as the place where Apple and IBM had sent many of their best and brightest. If you managed to get a job at Taligent, you had "made it."
I almost didn't "make it." I had wanted to work at Taligent for some time and eventually got the chance, but turned in a rather unimpressive interview performance (a couple coworkers kidded me about that for years afterward) and wasn't offered the job. About that same time, a friend of mine did get a job there, and after the person who did get the job I interviewed for turned it down for family reasons, my friend put in a good word for me and I got a second chance.
I probably would have taken almost any job there, but the specific opening was in the text and internationalization group, and thus began my long association with Unicode.
One thing pretty much everybody who ever worked at Taligent will agree on is that working there was a wonderful learning experience: an opportunity, as it were, to "sit at the feet of the masters." Personally, the Taligent experience made me into the programmer I am today. My C++ and OOD skills improved dramatically, I became proficient in Java, and I went from knowing virtually nothing about written language and software internationalization to... well, I'll let you be the judge.
My team was eventually absorbed into IBM, and I enjoyed a little over two years as an IBMer before deciding to move on in early 2000. During my time at Taligent/IBM, I worked on four different sets of Unicode-related text handling routines: the text-editing frameworks in CommonPoint, various text-storage and internationalization frameworks in IBM's Open Class Library, various internationalization facilities in Sun's Java Class Library (which IBM wrote under contract to Sun), and the libraries that eventually came to be known as the International Components for Unicode.
International Components for Unicode, or ICU, began life as an IBM developer-tools package based on the Java internationalization libraries, but has since morphed into an open-source project and taken on a life of its own. It's gaining increasing popularity and showing up in more operating systems and software packages, and it's acquiring a reputation as a great demonstration of how to implement the various features of the Unicode standard. I had the twin privileges of contributing frameworks to Java and ICU and of working alongside those who developed the other frameworks and learning from them. I got to watch the Unicode standard develop, work with some of those who were developing it, occasionally rub shoulders with the others, and occasionally contrbute a tidbit or two to the effort myself. It was a fantastic experience, and I hope that at least some of their expertise rubbed off on me.
For a year and a half, I wrote a column on Java for C++ Report magazine, and for much of that time, I wondered if anyone was actually reading the thing and what they thought of it. I would occasionally get an e-mail from a reader commenting on something I'd written, and I was always grateful, whether the feedback was good or bad, because it meant people were reading the thing and took it seriously enough to let me know what they thought.
I'm hoping there will be more than one edition of this book, and I really want it to be as good as possible. If you read it and find it less than helpful, I hope you won't just throw it on a shelf somewhere and grumble about the money you threw away on it. Please, if this book fails to adequately answer your questions about Unicode, or if it wastes too much time answering questions you don't care about, I want to know. The more specific you can be about just what isn't doing it for you, the better. Please write me at email@example.com with your comments and criticisms.
For that matter, if you like what you see here, I wouldn't mind hearing from you either. God knows, I can use the encouragement.--R. T. G.
Unicode began with a simple goal: to unify the many hundreds of separate character encodings into a single, universal standard. These character encodings were incomplete and inconsistent: Two encodings would use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needed to support many different encodings, yet whenever data was passed between different encodings or platforms, that data always ran the risk of corruption.
Unicode was designed to fix that situation: to provide a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
Unfortunately, Unicode has not remained quite as simple as we originally planned. Most of the complications derive from three unavoidable sources. First, written languages are complex: They have evolved in strange and curious ways over the centuries. Many oddities end up being reflected in the way characters are used to represent those languages, or reflected in the way strings of characters function in a broader context. Second, Unicode did not develop in a vacuum: We had to add a number of features--and a great many characters--for compatibility with legacy character encoding standards. Third, we needed to adapt Unicode to different OS environments: allowing the use of 8-, 16-, or 32-bit units to represent characters, and either big- or little-endian integers. Of course, in addition to these three factors, there are in hindsight parts of the design that could have been simpler (true of any project of such magnitude).
Luckily you, a typical programmer, are shielded from most of these complexities. Just to put two strings in alphabetic order according to the conventions of any given language or country, for example, involves a lot of heavy lifting. Sometimes accents are important, sometimes not; sometimes two characters compare as one, sometimes one character compares as two, sometimes characters even swap. The software has to twist and hop between multiple code paths to make the simple cases absolutely as fast as possible, while still making every comparison absolutely accurate. All these gyrations are done by "the man behind the curtain"--you just call up a library routine in Windows, Java, ICU, or some other library, magic happens, and you get the right result.
You are not shielded from all of the complexities, however. There are still pitfalls. An English-speaking programmer might assume, for example, that given the three characters X, Y, and Z, that if X sorts before Y, then XZ sorts before YZ. This works for English, but fails for many languages. These are the things that a library can't do for you; you need to understand enough of the differences between the way characters behave in various environments so you can avoid stumbling into the traps.
So how do you learn more? The Unicode Standard and associated Technical Reports are available, but they often start at too high a level, or are aimed at a different audience--programmers implementing the standard, rather than programmers using pre-built implementations. The FAQs on the Unicode site (http://www.unicode.org) should help a good deal, but they jump from topic to topic, and don't provide a smooth exposition of the issues involved.
That brings us to this book and its author. For many years, Rich was part of the ICU team, which supplies one of the premier Unicode-enablement libraries. During that time, Rich was the "man behind the curtain" for parts of ICU, including pieces that were incorporated into the standard Java versions. He did some very interesting work there, and became intimately familiar with data structures and mechanisms used to support implementations of Unicode. Because ICU offers the same functionality across Java and cross-platform C/C++, he also has a solid grasp of the issues involved in those environments. That led him to becoming a columnist for C++ Report.The ICU team got an inside look at Rich's work as successive drafts of his columns would float across our internal net for feedback. We still use his "Anatomy of the Assignment Operator" column as a basis for asking questions of interviewees--it is remarkable just how tricky some of the nuts and bolts issues are in C++!
Rich left the ICU team a couple of years ago (after all, when Texas beckons, who can refuse her), but has continued to develop as a writer. He has a clear, colloquial style that allows him to make even complex matters understandable.
This book covers many of the important issues involved in dealing with Unicode: the basic character architecture, combining characters, encoding forms, normalization, searching, and sorting. It provides a solid foundation for understanding the structure and usage of Unicode, and the implications for most programmers. I'm glad that Rich has poured so much of his time and energy into this effort over the past few years, and hope you find the book as useful as I think you will.Dr. Mark Davis,
Click below to download the Index file related to this title: