Home > Articles > Programming > General Programming/Other Languages

  • Print
  • + Share This
This chapter is from the book

This chapter is from the book

Pascal Strings

I've included this section on strings because this feature of the language has a number of very confusing aspects. Under normal circumstances, Pascal strings are very easy to use. However, there happen to be a number of different kinds of Pascal strings, and that proliferation of types really cries out for a clear explanation.

Object Pascal has four different kinds of strings: ShortStrings, AnsiStrings, PChars, and WideStrings. All Object Pascal strings except WideStrings are, at heart, little more than an array of Char. A WideString is an array of WideChars. A Char is 8 bits in size, while a WideChar is 16 bits—going on 32 bits—in size. I will explain more about WideStrings and WideChars at the end of this section on strings.

The following code fragment gives you examples of the types of things you can do with a Char or a String. The code explicitly uses AnsiStrings, but most of it would work the same regardless of whether the variables S and S2 were declared as ShortStrings, PChars, or AnsiStrings. Of course, I will explain the differences among these three types later in this section. Here is the example:

var
  a, b: Char;
  S, S2: String;
begin
  S := `Sam';     // Valid: Set a string equal to a string literal
  S := `1';       // Valid: Set a string equal to character
  S := `';        // Valid: Set a string equal to an empty string literal
  a := `1';       // Valid: Set a Char equal to a character literal
  b := a;          // Valid: Set a Char equal to Char
  a := `Sam';     // Invalid: You can't set a Char equal to a string
  a := #65;       // Valid: Set a Char equal to a character literal
  a := Char(10);  // Valid: Set a Char equal to an integer converted to a char
  a := S[1];      // Valid: Set a Char equal to the first Char in a string
  S2 := `Sam'#10; // Valid: Set a string equal to a string with Char appended
  S := S + S2;    // Valid: Concatenate two strings
  if (S = S2) then
    ShowMessage(`S and S2 contain equivalent strings');
  if (S > S2) then
    ShowMessage(`S would appear in a dictionary after S2');
end;

The Pascal language originated in Europe, so strings follow the traditional European syntax and are set off with single rather than double quotes. The code shown here declares two Chars and two Strings. The first statement after the begin correctly sets the String equal to a string literal that contains three letters. You can also set a String equal to a string literal that contains a single character or no characters. You can set a Char equal to a single character such as a, b, A, or B. You cannot set a Char equal to a string such as Sam. You can, however, set a Char equal to the first character in a String, as in a := S[1]. You can also set a String equal to the 65th character in a character set by writing this syntax: a := #65. In the standard ANSI character set, the 65th character is a capital A, so this is equivalent to setting a Char equal to the letter A: a := `A';. The expression Char(10) is equivalent to the expression #10. Both expressions reference the 10th ANSI character, which is usually the linefeed character. It is also legal to append or insert characters into a string using the following syntax: S := `Sam'#10;. This adds a linefeed to the end of the string. Notice that the character is appended outside the closing quote.

C/C++, JAVA NOTE

In Java or C++ you would write "Sam\n" rather than `Sam'#10. The two statements are equivalent.

Studying the examples in this section should give you some sense of how to use strings in your programs. Notice that in one of the examples, you can use the + operator to concatenate two strings. You can also use the < and > operators to test whether a String is larger than another String, and you can use the = operator to test whether two Strings point to identical sets of characters.

JAVA NOTE

The = operator in Pascal does the same thing as the String::equals method does in Java. You are not testing to see whether the strings point at the same memory; you are testing to see whether they point at strings that contain the same sets of characters.

ShortStrings

The ShortString is the oldest kind of Pascal string, and it is rarely in use today. A ShortString is essentially a glorified array of Char with a maximum length of 256 characters. The first byte, the length byte, designates the length of the string. ShortStrings are not null-terminated; their length is determined only by the length byte. Remember that the length byte takes up 1 of the 256 bytes in the string, so the longest possible ShortString contains 255 characters. The limitation on the length of a ShortString exists because the first byte is 8 bits in size, and you can fit only 256 possible values in 8 bits.

NOTE

ShortStrings are used mostly for backward compatibility with old Pascal code. However, you might use a ShortString if you need to be sure that a block of memory has a prescribed size. For instance, you know that ShortStrings are usually 256 bytes long, so if you want to create an array of 4 Strings and you want to be sure that it occupies exactly 1,024 bytes of memory, regardless of the length of each string (and assuming that each string is 255 characters in length or less), you might decide to use ShortStrings rather than AnsiStrings. ShortStrings can also be useful in variant records, as described in the later section of this chapter titled "Variant Records."

Here is the syntax for using a ShortString:

var
  S: ShortString;
begin
  S := `Hello';
end;

This string is represented in memory as such: [#5][H][e][l][l][o]. The first byte of the string, which the user never sees, represents the length of the string. The remaining bytes contain the string itself.

You can also declare a ShortString like this:

var
  S: String[10];

This string contains only 10 characters rather than 255. More commonly, you might declare a type of string that is a custom length and then reuse that type throughout your program:

type
  String5 = String[5];
  String15 = String[15]
var
  S5: String5;
  S15: String15;

The compiler appears not to object to you assigning strings larger than 5 or 16 characters to the types declared previously. However, the string that you create will display only the appropriate number of characters. The others will be ignored.

Again, I want to stress that ShortStrings are not in common use today. In Java parlance, one might even say that they are deprecated, although I doubt that they will ever cease to be a part of the language.

AnsiStrings

AnsiStrings are also known as long strings. On 32-bit platforms, the maximum length for an AnsiString is 2GB. This type is the native Object Pascal string and the kind that you will use in most programs.

If you declare a variable as a String, it is assumed to be an AnsiString. In other words, if you do not specify that a string is an AnsiString, a ShortString, or a custom string such as String[10], you can assume that it is an AnsiString. The one exception to this rule occurs if you explicitly turn off the $H directive, where H can be thought of as standing for "huge" strings. In such cases, all strings are assumed to be ShortStrings unless explicitly declared otherwise. If you place the {$H-} directive at the top of a module, that entire module will use ShortStrings by default. If you deselect Project, Options, Compiler, Huge Strings from the menu, your entire program will use ShortStrings by default.

NOTE

When using the default key mappings, you can press Ctrl+O+O (that's the letter O) to get a list of all the compiler directives for the current module.

When a CLX method needs to be passed a string, it almost always expects to be passed an AnsiString. The AnsiString is the native type expected by CLX controls. Despite the simplicity of this statement, there are some twists and turns to it. As a result, I will discuss this in more depth both in this section and in the section "PChars."

An AnsiString is a pointer type, although you should rarely, if ever, need to explicitly allocate memory for it. The compiler notes the times when you make an assignment to a string, and it calls routines at that time for allocating the memory for the string. (Many of these routines are in System.pas, and you can step right into them with the compiler on some versions of Kylix.)

NOTE

You will find that many of the routines in the System unit use Assembly language. In general, they follow one of two different formats:

procedure Foo;
asm
  mov eax, 1
end;

procedure FooBar;
var
  X: Integer;
begin
  X := 7;
  asm
    mov eax, X;
  end;
end;

Procedure Foo uses asm where a normal Pascal procedure would use begin. In this type of procedure, all the code is written in Assembler until the closing end statement. The second example embeds an asm statement in a begin..end block. Both syntaxes are valid. When using the debugger, after starting your program, choose View, Debug Windows, CPU to step through your code. I will talk more about debugging in Chapter 5. However, I am not going to say anything more about Assembler in this book. Use System.pas as a reference if you are interested in this technology.

The only time that you might need to allocate memory for an AnsiString is if you are going to pass it to a routine that does not know about AnsiStrings—that is, when you are passing it a routine written in some language other than Pascal or when you are passing it to some exceptionally peculiar Pascal routine. In such a case, you would normally want to pass a PChar rather than an AnsiString. But it is possible to pass an AnsiString to such a routine; you allocate memory for it first and then pass it. (Use the SetLength routine to allocate memory for an AnsiString, as described at the very end of this section.)

Routines that take PChars are generally routines that are written in some other language, such as C or C++. If you pass an AnsiString into such a routine and you expect it to pass the string back with a new value in it (passing by reference), you probably need to allocate memory for the string before passing it. If you are passing an AnsiString into an Object Pascal routine, you can assume that the compiler will know how to allocate memory for it. In your day-to-day practice as an Object Pascal programmer, you should never need to think about allocating memory for an AnsiString. The cases when you need to do it are very rare and are not the type that beginning or intermediate-level programmers are ever likely to encounter.

AnsiStrings are null-terminated. This means that the end of the string is marked with #0, the first character in the ANSI character set. This is the same way that you mark the end of a string in C/C++. AnsiStrings are different than C/C++ strings, however, because they are usually prefaced with two 32-bit characters; one character holds the length of the string, and the other holds the reference count for a string. The only time that an AnsiString is not prefaced by these values is when the string variable references a 0-length string. As a programmer, you will almost certainly never have an occasion to explicitly reference either of these values.

It is a simple matter to understand the 32-bit value that holds the length of the string. It is similar to the length byte in a ShortString, except that it is 32 bits in size rather than 8 bits, so it can reference a very large string. What is the point, though, of the 32-bit value used for reference counting?

Reference counting is a means of saving memory and decreasing the time necessary to make string assignments. If two strings contain the same values, it is thriftiest to have them both point at the same memory. If possible, Object Pascal will do this by default. (You can override this behavior, as explained later in this section in the note on the UniqueString procedure.) When reference counting, the compiler simply points a second string at the memory allocated for a first string and then ups the reference count of the strings. Consider the following code fragment:

var
  Sam: String;
  Fred: String;
begin
  Sam := `Look at all beings with the eyes of compassion. -- Lotus Sutra';
  Fred := Sam; // Reference count incremented, no memory allocated for chars.
  Fred := `Learn to ` + Fred; // Strings not equal, memory must be allocated.
end;

When you set Sam equal to the quote from the Lotus Sutra, the compiler allocates sufficient memory for the variable Sam. When you set Fred equal to Sam, no new memory for character values is allocated. Instead, the reference count for the string is incremented and Fred is pointed at the same string as Sam. This kind of assignment is very fast and also saves memory. In short, you avoid both the extra memory consumed by allocating memory for the characters in the string and also the extra time required to copy the memory from one location in memory to another.

So far, so good. But what happens if you change one of the values that either variable addresses? That is what happens in the third line of the code fragment. When you change the value of Fred in the last line of the method, new memory is allocated for Fred and the reference count for the string is decremented by 1. At this point, Fred and Sam point at two entirely separate strings.

NOTE

You can use the UniqueString procedure to force a string to have a reference count of 1, even if it would normally have a higher count.

I want to stress that all these complicated machinations mean that you normally don't have to think about string memory allocation at all. You can just use a string type in a manner similar to the way you would use an Integer type. The compiler handles the allocation, and you don't have to think about it. However, it helps to know the inner workings of the AnsiString type, both so that you know what happens in unusual cases and so that you can design your code to be as efficient as possible.

Strings are generally allocated for you automatically. However, you can use the SetLength procedure to set or reset the length of a string:

var
  S: string;
begin
  SetLength(S, 10);
  SetLength(S, 12);
end;

Many routines built into the Object Pascal language can help you work with strings. In particular, see the FmtStr and Format functions. You might also want to browse the entire SysUtils unit and become familiar with the many useful routines found there. Also see the LCodeBox unit that ships with this book.

PChars

A PChar is a standard null-terminated string and is structurally exactly like a C string. In fact, this type was created primarily to provide compatibility with C class libraries. In particular, it was created for compatibility with the Windows API, which is written in C. It has proven to be a generally useful type, and it will come in handy when you are calling functions from the Linux C libraries such as Libc.

NOTE

To call most of the routines in the Libc library, just add Libc to your uses clause and go to work. This process is described in more depth in Chapter 6, "Understanding the Linux Environment."

The native Object Pascal string type is known as a String—or, more properly, as a long string or AnsiString. However, in most cases you are free to use either the native String type or the PChar type. Both types of strings are null-terminated. The difference between them is that a Pascal string has data placed in front of the String that determines the string's length and its reference count.

In most cases in a Kylix program, you should use the AnsiString type. A Kylix control such as a TEdit would never expect you to pass it a PChar. However, it is usually legal, but unorthodox, to pass it a PChar. This is confusing enough that an example might be helpful. Consider the following block of code:

procedure TForm1.Button1Click(Sender: TObject);
var
  Sam: PChar;
begin
  Sam := `Fred';
  Edit1.Text := Sam;
end;

This code will compile and run without error. In short, it is legal to assign a PChar to a property that is declared to be of type AnsiString. (Actually the Text property is declared to be of type TCaption, but TCaption is declared to be of type String.)

NOTE

CLX is built on top of the C++ library called Qt. As a result, many of the controls in CLX ultimately end up working with native C strings, or a C String object. However, none of that is any concern to us as Pascal programmers. CLX is expecting AnsiStrings and, when you work with CLX controls, you should use the native String type.

You can assign a PChar to a string directly. However, if you assign a String to a PChar, you need to typecast it:

var
  S: string;
  P: PChar;
begin
  P := PChar(S);

As you recall, an AnsiString is simply a PChar with some data in front of it. This data appears at a negative offset from the pointer to the AnsiString. As a result, typecasting the AnsiString as a PChar is really just a confirmation that from the pointer to the AnsiString and onward, an AnsiString is nothing more than a PChar. You will use this typecasting technique quite often if you need to pass AnsiStrings to routines written in C that are expecting a regular C string rather than an AnsiString.

Once the decision was made to make PChars part of Object Pascal there needed to be a set of routines to help you work with such strings. These routines are based closely on the functions you would use for manipulating strings in a C/C++ program. For instance, these routines have names such as StrLen, StrCat, StrPos, and StrScan. Again, you should look in the SysUtils unit for more information on these routines. You will find that there are dozens of such routines and that they are quite flexible and powerful.

WideStrings

WideStrings are very much like AnsiStrings, except that they point at wide characters of 16 bits rather than normal Chars of 8 bits. These large characters, known as WideChars, are a means of manipulating Unicode characters. Unicode in particular, and WideChars, in general, provide a means for working with large character sets that will not fit in the 256 bits of a Char. For instance, the kanji character sets from Asia have thousands of characters in them. You can't capture them using standard AnsiStrings; instead, you must use WideStrings.

NOTE

In Windows, the native wide character type (WCHAR) is 16 bits in size. In Linux, wide characters are 32 bits in size. The Kylix team decided to reuse the 16 bit WideChar in place for Windows rather than to rewrite the routines explicitly for the 32-bit Linux WideChar. As a result, your programs work with 16-bit WideChars, even though Linux defaults to 32-bit WideChars. Unless we are invaded from Alpha Centauri, where very large character sets are in common use, you should find that 16-bit WideChars are large enough for all practical purposes.

Starting with Kylix and Delphi 6, WideStrings are reference counted just as AnsiStrings are reference counted. In fact, you use a WideString exactly as you would use an AnsiString:

procedure TForm1.Button1Click(Sender: TObject);
var
  S: WideString;
begin
  S := `Sam';
  Edit1.Text := S;
end;

This example shows that you can convert an AnsiString to a WideString and also convert a WideString to an AnsiString through the simple use of the assignment operator. In Kylix and Delphi 6, code based on WideStrings is actually quite efficient. If you have good reason to use WideStrings, go ahead and use them. The compiler handles them quite easily.

This is the end of the section on Strings. Next up are typecasts, a technology used very widely in Kylix programs. After that, we will look at the array and record types, and then we'll take a quick tour of Object Pascal pointers.

  • + Share This
  • 🔖 Save To Your Account