Home > Articles > Programming > C#

.NET Reference Guide

Hosted by

Toggle Open Guide Table of ContentsGuide Contents

Close Table of ContentsGuide Contents

Close Table of Contents

Working with Text Encodings

Last updated Mar 14, 2003.

When you start working with text in .NET programs, you'll soon come to realize that there's more to it than meets the eyes. In the past, with languages like C/C++ and Visual Basic, text files could be treated pretty much like binary files. There were some special cases and some assumptions about the contents, but for the most part a byte was a character and there was little to no translation. That is not the case in .NET programming. The .NET Framework uses Unicode internally and supports many other text encodings as well. When you read text from a file into a string, the .NET Framework interprets the byte stream on input and converts it to Unicode. This can cause some unexpected results.

Simple Example

The method below text from a file into a string and then displays the contents of the string in a multiline text box. For brevity I've eliminated the rest of the program's code. To test this method, you'll need to create a Windows Forms application that has a text box named textBox1 and a button called button1, and hook the button's Click event up to the button1_Click method shown here.

[C#]

private void button1_Click(object sender, System.EventArgs e)
{
  textBox1.Clear();
  StreamReader reader = new StreamReader("testfile.txt");
  try
  {
    string theString = reader.ReadToEnd();
    textBox1.AppendText(theString);
  }
  finally
  {
    reader.Close();
  }
}

[Visual Basic]

Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
  TextBox1.Clear()
  Dim reader As New System.IO.StreamReader("testfile.txt")
  Try
    Dim theString As String = reader.ReadToEnd()
    TextBox1.AppendText(theString)
  Finally
    reader.Close()
  End Try
End Sub

If the test file contains this text:

The em dash character (—) is used in writing to indicate a sudden break in thought.

You'll likely be very surprised to see that the string displayed is missing the em dash character between the parentheses. What in the name of ASCII is going on?

About Text Encodings

As I mentioned in the introduction, the .NET Framework interprets the incoming byte string and converts it to Unicode. There are, however, many different interpretations of a particular byte stream. The particular StreamReader constructor I used in the sample uses the UTF-8 encoding to open the file. In the UTF-8 encoding, US ASCII characters 0 through 127 are encoded as single bytes, but characters with values higher than 127 are encoded as sequences of two or more bytes. The em dash character (code 151) is an invalid character in a UTF-8 encoded file.

TIP

The documentation for the System.Text.StreamReader constructor, by the way, is incorrect or at least unclear. It says that the particular constructor I used opens the file with the default character encoding, when in fact the file is opened with the UTF-8 encoding. There is an encoding called Default which as you will see is quite different.

There's not much the system can do with an invalid byte sequence in the file other than discard it. I guess the function could throw an exception, but that seems a bit drastic. Programs should be able to gracefully handle invalid data. Unfortunately, discarding the data is less than ideal if you're not expecting it.

When I ran into this problem I scratched my head over it for a bit and then started working with different encodings to determine which one would read the file properly. I won't bore you with the results of all my experiments, but just mention that the solution was to open the file with the Default text encoding, like this:

[C#]

StreamReader reader = new System.IO.StreamReader("testfile.txt", System.Text.Encoding.Default);

[Visual Basic]

Dim reader As New System.IO.StreamReader("testfile.txt", System.Text.Encoding.Default)

The second parameter to this overload of the StreamReader constructor specifies the Encoding that the system should use when converting the incoming byte stream to Unicode. The Default text encoding interprets the byte stream using the system's current ANSI code page. On my system running the U.S English version of Windows 2000, that is code page Windows-1252, also known as Latin-1. This is similar but not identical to the ISO-8859-1 code page. Code pages are the old (pre-Unicode) method of interpreting characters for different languages. Using them was very confusing.

Don't expect this problem to go away any time in the near future. Although Unicode is a better way to encode characters because it eliminates ambiguity in interpretation, the use of code pages allows for smaller text files. In Unicode, the encoding for the em dash character is the three-byte sequence 0xE2, 0x80, 0x94. Obviously, a Unicode encoded text file with many em dash and other such characters will be significantly larger than the equivalent file encoded with the Windows-1252 code page.

Other Uses for Text Encodings

Text encodings are useful for more than just reading files. For example, say you have a file as described above that you want to output using the UTF-8 encoding so that it's correct Unicode. You can open and read the file as shown above, create a StreamWriter that uses the UTF-8 encoding, and then output the string, like this:

[C#]

StreamReader reader = new System.IO.StreamReader("testfile.txt", System.Text.Encoding.Default);
try
{
  theString = reader.ReadToEnd();
}
finally
{
  reader.Close();
}
StreamWriter writer = new System.IO.StreamWriter("out.txt", false, System.Text.Encoding.UTF8);
try
{
  writer.WriteLine(theString);
}
finally
{
  writer.Close();
}

[Visual Basic]

  Dim theString As String
  Dim reader As New System.IO.StreamReader("testfile.txt", System.Text.Encoding.Default)
  Try
    theString = reader.ReadToEnd()
  Finally
    reader.Close()
  End Try

  Dim writer As New System.IO.StreamWriter("out.txt", False, System.Text.Encoding.UTF8)
  Try
    writer.WriteLine(theString)
  Finally
    writer.Close()
  End Try

File input isn't the only place that you'll run into text encodings. In a program that communicates with other programs you might receive a buffer full of bytes that you need to convert to a string, or you might need to convert a string to a byte buffer using a particular encoding. For example, a program that uses the POP3 protocol to receive electronic mail will have to convert an incoming byte stream from 7-bit ASCII to Unicode strings, and will need to convert Unicode strings to 7-bit ASCII in order to send commands to the POP server. Both rely on encodings in order to get the translation correct.

[C#]

// convert an ASCII byte buffer to a string
int BUFFER_SIZE=1024;
byte[] inputBuffer = new byte[BUFFER_SIZE];
// fill the buffer from the input stream (GetInput method not shown)
int bytesRead = GetInput(inputBuffer);
// now convert the byte buffer to a string
string strInput = System.Text.Encoding.ASCII.GetString(inputBuffer, 0, bytesRead);

// convert a string to an ASCII byte buffer
string strOutput = "USER jmischel";
byte[] outputBuffer = System.Text.Encoding.ASCII.GetBytes(strOutput);

[Visual Basic]

' convert an ASCII byte buffer to a string
Dim BUFFER_SIZE As Integer = 1024
Dim inputBuffer(BUFFER_SIZE) As Byte
' fills the buffer from the input stream (GetInput method not shown)
Dim bytesRead As Integer = GetInput(inputBuffer)
' now convert the byte buffer to a string
Dim strInput As String = System.Text.Encoding.ASCII.GetString(inputBuffer, 0, bytesRead)

' convert a string to an ASCII byte buffer
Dim strOutput As String = "USER jmischel"
Dim outputBuffer() As Byte = System.Text.Encoding.ASCII.GetBytes(strOutput)

Unicode is undoubtedly a Good Thing, but other character encodings are going to be around for a very long time to come. If you're reading text files created by older programs, or if you're communicating using older Internet protocols like mail and news, you will have to convert text between those old encodings and Unicode using the facilities provided by the System.Text.Encoding class.