- Introduction
-
Table of Contents
- Microsoft SQL Server Defined
- Microsoft SQL Server Features
- Microsoft SQL Server Administration
-
Microsoft SQL Server Programming
- An Outline for Development
- Database
- Database Services
- Database Objects: Databases
- Database Objects: Tables
- Database Objects: Table Relationships
- Database Objects: Keys
- Database Objects: Constraints
- Database Objects: Data Types
- Database Objects: Views
- Database Objects: Stored Procedures
- Database Objects: Indexes
- Database Objects: User Defined Functions
- Database Objects: Triggers
- Database Design: Requirements, Entities, and Attributes
- Business Process Model Notation (BPMN) and the Data Professional
- Business Questions for Database Design, Part One
- Business Questions for Database Design, Part Two
- Database Design: Finalizing Requirements and Defining Relationships
- Database Design: Creating an Entity Relationship Diagram
- Database Design: The Logical ERD
- Database Design: Adjusting The Model
- Database Design: Normalizing the Model
- Creating The Physical Model
- Database Design: Changing Attributes to Columns
- Database Design: Creating The Physical Database
- Database Design Example: Curriculum Vitae
- NULLs
- The SQL Server Sample Databases
- The SQL Server Sample Databases: pubs
- The SQL Server Sample Databases: NorthWind
- The SQL Server Sample Databases: AdventureWorks
- The SQL Server Sample Databases: Adventureworks Derivatives
- UniversalDB: The Demo and Testing Database, Part 1
- UniversalDB: The Demo and Testing Database, Part 2
- UniversalDB: The Demo and Testing Database, Part 3
- UniversalDB: The Demo and Testing Database, Part 4
- Getting Started with Transact-SQL
- Transact-SQL: Data Definition Language (DDL) Basics
- Transact-SQL: Limiting Results
- Transact-SQL: More Operators
- Transact-SQL: Ordering and Aggregating Data
- Transact-SQL: Subqueries
- Transact-SQL: Joins
- Transact-SQL: Complex Joins - Building a View with Multiple JOINs
- Transact-SQL: Inserts, Updates, and Deletes
- An Introduction to the CLR in SQL Server 2005
- Design Elements Part 1: Programming Flow Overview, Code Format and Commenting your Code
- Design Elements Part 2: Controlling SQL's Scope
- Design Elements Part 3: Error Handling
- Design Elements Part 4: Variables
- Design Elements Part 5: Where Does The Code Live?
- Design Elements Part 6: Math Operators and Functions
- Design Elements Part 7: Statistical Functions
- Design Elements Part 8: Summarization Statistical Algorithms
- Design Elements Part 9:Representing Data with Statistical Algorithms
- Design Elements Part 10: Interpreting the Data—Regression
- Design Elements Part 11: String Manipulation
- Design Elements Part 12: Loops
- Design Elements Part 13: Recursion
- Design Elements Part 14: Arrays
- Design Elements Part 15: Event-Driven Programming Vs. Scheduled Processes
- Design Elements Part 16: Event-Driven Programming
- Design Elements Part 17: Program Flow
- Forming Queries Part 1: Design
- Forming Queries Part 2: Query Basics
- Forming Queries Part 3: Query Optimization
- Forming Queries Part 4: SET Options
- Forming Queries Part 5: Table Optimization Hints
- Using SQL Server Templates
- Transact-SQL Unit Testing
- Index Tuning Wizard
- Unicode and SQL Server
- SQL Server Development Tools
- The SQL Server Transact-SQL Debugger
- The Transact-SQL Debugger, Part 2
- Basic Troubleshooting for Transact-SQL Code
- An Introduction to Spatial Data in SQL Server 2008
- Performance Tuning
- Practical Applications
- Professional Development
- Application Architecture Assessments
- Business Intelligence
- Tips and Troubleshooting
- Additional Resources
Unicode and SQL Server
Last updated Mar 28, 2003.
Systems have often been designed to be used in a single country, in a single language, and for a single audience. But as the world matures, organizations are increasingly required to expand that audience to other locations. As this happens in the business, most DBAs and developers will be faced with the need to "Globalize" their systems.
The word "Globalize" means that your software systems should be flexible enough to deal with different time zones, money and date formats, and, not least of all, languages. In this tutorial I'll focus on dealing with the language components — I'll explain the other facets in another article.
The first question that comes to mind then, of course, is how to store more than one language at a time. We're all familiar with working on a system that has a particular language installed where we read the screen and type in information. And of course on a "client" system like this, it's easy to think about applications and ultimately data in a database that they access, in a single language. But the new requirements state that while you look at the screen in one language, someone on the other side of the world needs to look at data from that system in another language — the system is providing storage for both sets of users. How can we accomplish this?
As you're already aware (or can probably guess), computers deal in numbers — specifically in binary numbers. To oversimplify a bit, a computer is nothing more than a huge set of off and on switches. Those numbers get converted to whole numbers to store things like the screen coordinates of a pixel or the number associated with a letter on a keyboard. So when you press the letter "A" on keyboards manufactured for the English language, the keyboard sends electronic symbols that are converted to the binary number 1000001, or decimal 66 (it's different for a lower-case "a"). The computer (or more accurately the software running on the computer) deals with this number however it is supposed to. In the case of a database, the number is stored inside a database column or field.
This happens using the American Standard Code for Information Interchange codes, or ASCII. Most electronic equipment made in the US conforms to these codes. These 128+ number codes, which include 32 non-printable characters (for things like a carriage return or line feed) cover all of the letters of the English alphabet. But these codes don't cover other letters, so you can't use them to store characters other than English.
To fix this problem, another set of number-to-character mappings was invented, called "Unicode". Unicode contains far more choices for storage of the chapter-mappings, since it doesn't use numbers to represent characters directly, but instead uses encoding. Encoding is indeed a set of numbers, but it uses bits (ones and zeroes) to represent characters and states of characters. By setting a bit off or on, for instance, you can represent the "case" of a letter, saving an entire set of numbers to just represent the character itself. What this gets you is a substantially larger set of characters you can work with. Rather than storing only the letters of the English alphabet and a few other characters, you get more than 100,000 characters you can store! As an added bonus, the original ASCII numbers are seamlessly represented in the Unicode set, so everything just continues to work like it always has, at least for computers and electronics that used ASCII before.
All this comes at a cost, of course. It takes more space to store these extra characters, although not as much more as you might think. We'll come back to that in a moment.
Now that you understand how the encoding of the characters works, we can start working on how to use Unicode in a SQL Server system. This is the part that confuses a lot of people, and we should stop for a moment and talk about how the system will use Unicode.
SQL Server uses something called a collation, which is a set of character mappings. These characters are stored using a single byte representation, so you can get around 256 characters using one collation. The collations are selected when you install the server, and then you can select a collation (or code page) when you create a database. These collations define how characters are sorted and compared, based on a language specification and locale settings.
So, when you want to work with international data, you need to do a couple of things. First, check the server's collation (the Windows server collation, that is) where SQL Server will be installed. By default, the collation for SQL Server is set to the same as the operating system. In fact, any operations that depend on a particular collation, means that the particular collation needs to be present on the operating system as well.
Second, you need to decide how to deal with multi-language data. When you think about it, most of the time all you need to store for multiple languages are the characters — not numbers or dates. Because of that, Microsoft created the nchar (specific-length characters), nvarchar (variable-length characters), and nvarchar(max) (large variable-length characters) data types, in SQL Server 2005. In SQL Server 2000, you use the ntext data type instead of nvarchar(max).You can use these data types within a database to store the fields that you think might take input from multiple languages.
By the way, your applications will send these characters to your database. I've been asked before, "If I set a field to Unicode using one of these data types and I enter the word 'Hello', will it automatically translate it to another language, like 'Bonjour'?" The answer is no. although that would be pretty cool.
Using one of these Unicode types (distinguished by the "n" in front of the type name) uses twice as many bits (two, in fact) to store the data. That gives the system that sent the data the room for the character mapping, and identifies itself as Unicode data.
So you can see that you also need to be concerned about how much more storage you'll need with Unicode data. For one thing, you get half the amount of "width" in a text field. With standard char or varchar types you get 8000 characters in a single field — but with the nchar or nvarchar types you only get 4000.
Collations define the sort orders and comparisons of data. This is true even with Unicode data. Things won't automatically sort using another language just because the field is using a Unicode type. There is a special algorithm for dealing with sorting Unicode columns data, so you'll need to test the results of your code, and there will be a performance impact as well.
Just to make things a little more complicated, it is important to note that SQL Server 2000 and 2005 use the Unicode standard of UCS-2. There are actually several other Unicode standards, so you'll need to coordinate with the application owners to ensure they use the same one, or that you create a function or code shim to translate between UCS-2 and whatever the applications use.
We've only scratched the surface on Unicode data and data programming, and we haven't discussed the full impact of globalizing your applications and their data. Understanding these basics of Unicode is a good start. There are some references at the end of this overview that will help you explore this topic further.
InformIT Articles and Sample Chapters
Still using SQL Server 2000? There's more about the data types, including Unicode, in this free chapter.
Books and eBooks
Richard Gillam has a great practical guide to all this in his book, Unicode Demystified: A Practical Programmer’s Guide to the Encoding Standard.
Online Resources
Books Online for SQL Server 2005 largely holds true for 2000 as well, as far as Unicode data goes. You can find that reference here.
