- Getting Started
- What Are Character Sets?
- Unicode to the Rescue
- Language Attributes
What Are Character Sets?
Next, let's look at why there are so many different character sets for the various languages of the world, and what you'll need to know about the different character sets and how computers use and display them.
One character set you've probably heard of or are already familiar with is the ASCII character set. ASCII, which stands for American Standard Code for Information Interchange, includes punctuation marks, numbers, and the 26 uppercase and 26 lowercase letters in the English alphabet (a-z and A-Z). Because the ASCII character set contains fewer than 128 characters, it's known as a 7-bit character set (27 = 128).
While ASCII is sufficient for displaying information in English on computer systems, it's easy to see its shortcomings for most other languages. Many Western European languages, for example, include characters that require accents such as an umlaut or circumflex. As a result, the need for broader character sets that serve those languages quickly arose in order to accommodate exchanging information across computer systems.
The Latin 1 or Western character set (known as ISO 8859-1) quickly gained wide usage as an Internet standard. It's an 8-bit character set (containing 256 characters) that extends the ASCII character set with additional punctuation marks, special characters such as the German ß and the accented letters common to most Western European languages.
Typically, ASCII characters appear as a subset of most character sets. For example, the first 128 characters of the Hebrew set (ISO 8859-8) match the ASCII character set. This is necessary because the HTML tags in your Web pagessuch as <P> or the <TABLE> tagsshould remain the same, no matter what language your content displays in.
If you were to create both English and Spanish versions of your Web site, you could use the ISO 8859-1 character set for both. To do so, you'd insert the following <META> tag in the <HEAD> of your document:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Character sets need a way to let computers know how to convert from bits to characters. For example, in ASCII, the number 65 represents the capital letter A, whereas the number 97 represents the lowercase letter a. This conversion from number to letter is called the character encoding. Your Web browser must be told which encoding is being used in order to map the number to the appropriate characterand this is what the charset value in the <META> tag really does.
It's especially critical to include the appropriate character encoding in <META> tags for Web content in languages such as Korean, Japanese, or Chinese that contain many thousands of characters. Because these languages require two bytes (16 bits) to represent a full character set, they are often referred to as double-byte character sets. If by default your Web browser expects the characters in your document to be just one byte long instead of two, it won't be able to properly display double-byte characters. To ensure your Japanese language Web pages display well, insert the following <META> tag in the HEAD of each document:
<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">
Here, the value for the charset attribute has been changed from ISO 8859-1 (appropriate for a French or Spanish page) to Shift_JIS (appropriate for a Japanese Web page). Be careful herealthough the charset value can use the actual character set name (as with ISO 8859-1), some character sets have more than one kind of character encoding, and it's the character encoding that needs to appear as the charset value in your page's <META> tag.
It can sound confusing, but Japanese provides a good real-world example: The two most popular Japanese encodings are Shift-JIS and JP-EUC. Shift-JIS is an encoding originally used by Microsoft platforms, and over time has become more prominent than EUCan encoding used on UNIX platforms. As a result, you're far more likely to see Shift-JIS specified as the charset value when you view the source code of a Japanese Web page, and you'll reach the largest audience if you use Shift-JIS as the charset value in your <META > tag when coding your own Japanese Web pages.