Home > Articles > Data > SQL Server

SQL Server Reference Guide

Hosted by

Toggle Open Guide Table of ContentsGuide Contents

Close Table of ContentsGuide Contents

Close Table of Contents

Unicode and SQL Server

Last updated Mar 28, 2003.

If you’re looking for more up-to-date information on this topic, please visit our SQL Server article, podcast, and store pages.

Systems have often been designed to be used in a single country, in a single language, and for a single audience. But as the world matures, organizations are increasingly required to expand that audience to other locations. As this happens in the business, most DBAs and developers will be faced with the need to "Globalize" their systems.

The word "Globalize" means that your software systems should be flexible enough to deal with different time zones, money and date formats, and, not least of all, languages. In this tutorial I'll focus on dealing with the language components — I'll explain the other facets in another article.

The first question that comes to mind then, of course, is how to store more than one language at a time. We're all familiar with working on a system that has a particular language installed where we read the screen and type in information. And of course on a "client" system like this, it's easy to think about applications and ultimately data in a database that they access, in a single language. But the new requirements state that while you look at the screen in one language, someone on the other side of the world needs to look at data from that system in another language — the system is providing storage for both sets of users. How can we accomplish this?

As you're already aware (or can probably guess), computers deal in numbers — specifically in binary numbers. To oversimplify a bit, a computer is nothing more than a huge set of off and on switches. Those numbers get converted to whole numbers to store things like the screen coordinates of a pixel or the number associated with a letter on a keyboard. So when you press the letter "A" on keyboards manufactured for the English language, the keyboard sends electronic symbols that are converted to the binary number 1000001, or decimal 66 (it's different for a lower-case "a"). The computer (or more accurately the software running on the computer) deals with this number however it is supposed to. In the case of a database, the number is stored inside a database column or field.

This happens using the American Standard Code for Information Interchange codes, or ASCII. Most electronic equipment made in the US conforms to these codes. These 128+ number codes, which include 32 non-printable characters (for things like a carriage return or line feed) cover all of the letters of the English alphabet. But these codes don't cover other letters, so you can't use them to store characters other than English.

To fix this problem, another set of number-to-character mappings was invented, called "Unicode". Unicode contains far more choices for storage of the chapter-mappings, since it doesn't use numbers to represent characters directly, but instead uses encoding. Encoding is indeed a set of numbers, but it uses bits (ones and zeroes) to represent characters and states of characters. By setting a bit off or on, for instance, you can represent the "case" of a letter, saving an entire set of numbers to just represent the character itself. What this gets you is a substantially larger set of characters you can work with. Rather than storing only the letters of the English alphabet and a few other characters, you get more than 100,000 characters you can store! As an added bonus, the original ASCII numbers are seamlessly represented in the Unicode set, so everything just continues to work like it always has, at least for computers and electronics that used ASCII before.

All this comes at a cost, of course. It takes more space to store these extra characters, although not as much more as you might think. We'll come back to that in a moment.

Now that you understand how the encoding of the characters works, we can start working on how to use Unicode in a SQL Server system. This is the part that confuses a lot of people, and we should stop for a moment and talk about how the system will use Unicode.

SQL Server uses something called a collation, which is a set of character mappings. These characters are stored using a single byte representation, so you can get around 256 characters using one collation. The collations are selected when you install the server, and then you can select a collation (or code page) when you create a database. These collations define how characters are sorted and compared, based on a language specification and locale settings.

So, when you want to work with international data, you need to do a couple of things. First, check the server's collation (the Windows server collation, that is) where SQL Server will be installed. By default, the collation for SQL Server is set to the same as the operating system. In fact, any operations that depend on a particular collation, means that the particular collation needs to be present on the operating system as well.

Second, you need to decide how to deal with multi-language data. When you think about it, most of the time all you need to store for multiple languages are the characters — not numbers or dates. Because of that, Microsoft created the nchar (specific-length characters), nvarchar (variable-length characters), and nvarchar(max) (large variable-length characters) data types, in SQL Server 2005. In SQL Server 2000, you use the ntext data type instead of nvarchar(max).You can use these data types within a database to store the fields that you think might take input from multiple languages.

By the way, your applications will send these characters to your database. I've been asked before, "If I set a field to Unicode using one of these data types and I enter the word 'Hello', will it automatically translate it to another language, like 'Bonjour'?" The answer is no. although that would be pretty cool.

Using one of these Unicode types (distinguished by the "n" in front of the type name) uses twice as many bits (two, in fact) to store the data. That gives the system that sent the data the room for the character mapping, and identifies itself as Unicode data.

So you can see that you also need to be concerned about how much more storage you'll need with Unicode data. For one thing, you get half the amount of "width" in a text field. With standard char or varchar types you get 8000 characters in a single field — but with the nchar or nvarchar types you only get 4000.

Collations define the sort orders and comparisons of data. This is true even with Unicode data. Things won't automatically sort using another language just because the field is using a Unicode type. There is a special algorithm for dealing with sorting Unicode columns data, so you'll need to test the results of your code, and there will be a performance impact as well.

Just to make things a little more complicated, it is important to note that SQL Server 2000 and 2005 use the Unicode standard of UCS-2. There are actually several other Unicode standards, so you'll need to coordinate with the application owners to ensure they use the same one, or that you create a function or code shim to translate between UCS-2 and whatever the applications use.

We've only scratched the surface on Unicode data and data programming, and we haven't discussed the full impact of globalizing your applications and their data. Understanding these basics of Unicode is a good start. There are some references at the end of this overview that will help you explore this topic further.

InformIT Articles and Sample Chapters

Still using SQL Server 2000? There's more about the data types, including Unicode, in this free chapter.

Books and eBooks

Richard Gillam has a great practical guide to all this in his book, Unicode Demystified: A Practical Programmer’s Guide to the Encoding Standard.

Online Resources

Books Online for SQL Server 2005 largely holds true for 2000 as well, as far as Unicode data goes. You can find that reference here.