Home > Articles > Data > SQL Server

SQL Server Reference Guide

Hosted by

Toggle Open Guide Table of ContentsGuide Contents

Close Table of ContentsGuide Contents

Close Table of Contents

Database Objects: Indexes

Last updated Mar 28, 2003.

As I mentioned in my last overview on Stored Procedures, the two main categories of database objects are those that store data and those that access or provide access to data. Tables are the only objects that actually store data; all other objects in the database provide access to data.

It seems obvious that objects such as views and stored procedures provide access to data, but in this article I’ll describe another object that provides a round-about way of data access: indexes. In this overview, I’ll explain index theory, index structures, and how indexes are implemented. To put these theories into practice, we have other articles here at InformIT regarding maintain indexes. In other words, this article is more theory; the "how to" portions are in the sections I'll reference at the bottom of this overview.

Indexes Defined

So just what is an index? An index is a structure in the database that SQL Server can use to find and access data quickly.

Think about a fairly thick book. If I asked you to find a particular topic or even a word in that book, you’d turn to the index. The index in the book is very similar to what a SQL Server Index.

In the case of the book’s index, you can find the page numbers where a word or phrase is located. You simply find the word (which is listed in alphabetical order) and the page or pages where that word or phrase appears is listed just to the right. In fact, most indexes are even broken out into alphabetical groupings in a bold heading, which is a visual cue for you to locate that section quickly. In addition to alphabetical order, sometimes books have additional indexes based on content type or another grouping.

If you didn’t have an index or table of contents in a book when you were asked to locate a word, you’d have to read each and every page to find it. And even when you found it, you’d have to keep reading to make sure that it didn’t occur again. You would have to do this each and every time I asked you to find a word or phrase, since you probably can't memorize every location of something as you look it up. Each time you're asked where a word is, you have to scan the entire book.

In the case of SQL Server, you have the same issue. You have a lot of data, and it may surprise you to learn that by its very definition, the data in a relational database isn't stored in any particular order. It's just scattered all over the data pages (which are 8K blocks on the hard drive) with no particular order other than what was quickest to insert at the time. That makes the job of finding the data very difficult.

In fact, without some sort of help, the server’s Query Optimization engine has to do a "scan" operation, which is the same as what you and I would do without an index in a book — it would have to read every page of data to locate the field you're asked for. With tables under a thousand rows or so, this isn't really a problem, and in fact is the default behavior in SQL Server. But with larger tables, those with thousands or millions of rows, this takes a lot of time. It also has some other bad side-effects. Because all of those rows don't fit into memory, the server has to read from the I/O system, which is among the slowest components on your server. While data are being read from the I/O system, the database can't write to the tables while it is reading from them. That slows down not only reads but writes.

It gets worse. In a database the data is changing all the time. Imagine that you’re asked to find a word in a book, but someone keeps writing new words in the book, and changing where the old ones were.

What you can do to mitigate this scanning problem is to create an Index. The index in SQL Server works the same as the index in a book. It stores the words you're interested in along with a number (which is really a pointer to a block on the I/O system) where the word or phrase is found. Now all SQL Server has to do when you ask for data is to look up the word or phrase in the index, which is much smaller, and find where it lives on the hard drive.

The physical arrangement of indexes in SQL Server is just a table-structure in the database, and on the hard drive. When you create an index, a new physical structure is created and new rows of data are put into that table. You can tell SQL Server when you create the index which fields in the database table you want to track, just like you would mark words in a book that need to be tracked for its index. You don't put all data into an index, since it that would create the same problem you're trying to solve. You only index the fields that users are likely to look up, or fields that are useful for your programs to use. I'll explain more about this later.

The physical arrangement of indexes fall into two types: Clustered and Non-Clustered. There are important differences in these two types, and which one you choose should be carefully considered during the design phase.

A clustered index actually changes the way the data is stored in the table, and the index. In fact, if you create a clustered index on a table, the table itself simply becomes the index.

Let’s take a look at a very simple table:

A

D

C

B

E

One tenet of a relational database is that the data isn’t stored in any particular order, and SQL Server holds to that tenet as you can see here. Tables where the data is in no particular order and with no indexes are called heap tables. Once you apply a clustered index to the columns, however, the table changes physically on the hard drive to this:

A

B

C

D

E

As you can see, the data can only be physically arranged one way. You can’t have the same data arranged physically as A-F and also F-A. Since the data is arranged this way, the server doesn't need to store another table for the index pointing to the data — it just calls the table the index, and the index the table.

The advantage here is that data access is extremely fast. It's like rearranging a book to the point where every word is in alphabetical order. That wouldn't help the storyline much, but you would be able to find anything quickly. Come to think of it, there is a book like that — the dictionary.

Of course, this speed comes at a price. Since the table is physically ordered by the index fields, each time you insert a new row of data the I/O subsystem has to rearrange all the data in the table. So if there are a lot of data changes, the system can slow down to a crawl. If the data doesn't change that often or is arranged in large groups, a clustered index might be the best solution.

The other type of index in SQL Server is called Non-Clustered. This type doesn’t physically re-arrange the data; it just makes another table-like structure which points to where the data is physically located.

So if we continue the book analogy, a clustered index is similar to the table of contents, since the data is arranged that way. A non-clustered index would be similar to a book’s index in the back.

Let's examine our table again, but this time I'll add some more data and show a hidden column: the block address of each row:

AEF1

A

BFE2

D

AEF1

C

BFE2

B

BFE2

E

BFE2

A

You'll notice that there are two "A" values, stored at different locations. At this point if you asked for all of the "A" values the engine would have to read the entire table structure, stored on all these block fragments. If there were thousands of blocks and only two values of "A", the engine would still read them all just to make sure. That would take a lot of time, and create lots of locks as it did so.

Now we'll create an index on the table, which makes another internal table structure that looks like this:

AEF1, BFE2

A

BFE2

B

AEF1

C

BFE2

D

BFE2

E

Now when you look up all the "A" values the engine sees that there is an index, uses that index to find the data, and quickly calls up just those two blocks where the data lives. That's a faster access process, and takes far fewer locks on the rows or tables.

Types of Indexes

There are many types of indexes in SQL Server. The one that you'll work with most often is the user-defined index. This is the Index object type in the database, and they are stored in one of the physical arrangements I just explained.

But there are other types of indexes as well. You see, even if you don't specify an index on a table, the SQL Server Query Optimizer wants to be able to find data quickly. So it will "watch" what you are querying on and make temporary indexes of its own, called "statistical indexes" or more commonly "statistics". You can tell SQL Server to create and maintain these statistics automatically, or you can control that manually. You can make those changes on the database settings in the Properties tab for the database. I recommend that you allow the server to create and maintain statistics, unless the database is quite large and you'll run this process manually. I'll cover this topic more in the other articles referenced at the end of this overview.

Another type of index is the Full-Text index. This structure is used for Full-Text fields and isn't stored in SQL Server at all, and I've covered that information in more depth here.

Finally, in SQL Server 2005, you can even index views, computed columns, and create a special XML index. I'll cover those in other overviews as well.

Using Indexes Effectively

The basic premise in any of these situations is that you need to create an index on the columns that the users access the most for searching. For instance, in an address table, the last name, city, and state are normally prime candidates for an index. In that case, you might make an index for each of those three fields. You might also make a single index that has all three of these fields. I’ll describe that further in a moment. First, take a look at this table:

1

1

1

1

2

2

2

2

3

Now look at this table:

1

2

3

4

5

See the difference? If you have an index on first field of the second table, you'll end up with the same values in the index as in the table itself. In the first filed of the first table, you have a lot of the same values, and they will potentially end up all over the drive. If users need to look up that data frequently, that's the column you should index. We'll cover these decisions more in the other tutorials I've mentioned.

You’ll normally want to create indexes on the fields in a query’s predicate (the WHERE clause). You may include more than one column in an index. Including more than one column in an index is called a composite index. If your users are including the AND predicate, then you may need to include all those columns in a single index. Don’t go crazy with that, as anything above five or so columns will usually cost you.

To follow on with that concept, sometimes you’ll create a compound index that has all of the fields a user might select at one time. Creating an index on all those fields is called a covering index. What this means is that all of the data you're query needs is included in the index table itself, so it never has to use the pointer at all – it's kind of like having the advantages of a clustered index without as much of the expense. In some cases that can really speed up the query.

If your database is used in an application that performs many inserts, edits, and deletes, then speedy access isn’t always the driving factor. Databases that are used in transaction-oriented solutions, such as a sales database, fit this model. In a database where sales are recorded, there aren’t a lot of look-up activities, other than perhaps checking inventory. These types of database systems are called On-Line Transaction Processing systems, or OLTP. The decisions regarding the use of indexes is important in OLTP databases, because adding an index slows down INSERT operations, as the index is kept up to date when the data changes. So indexes on this type of database are less useful, and sometimes can impact performance. That isn't to say you shouldn't have any indexes, because even insert operations are sometimes based on lookups.

The second type of database is more reporting or look-up oriented. Databases that match this description are those used at a public library, or those used to provide reporting information. When those systems provide multidimensional data lookups they are called On-Line Analytical Processing systems, or OLAP. It’s easier to design indexes for these databases, since the data doesn’t change as often and you’re usually familiar with the columns the user is searching on.

The next type is the hybrid transaction/reporting database. This type of database is used for both entry and reporting. This is the most difficult situation for designing indexes, because you need them to speed up access for reporting but you don’t need them because they will slow down access for data entry. Unfortunately, this is also the type you’ll encounter most often.

Creating indexes is more difficult to do with OLAP databases than with OLTP databases. As a matter of fact, it’s usually best to minimize indexes for OLTP. Of course, that’s assuming your database is used purely for transactions and not also for reporting. Unfortunately, that’s not normally the case.

The next question is how to determine which kind of physical index to use — clustered or non-clustered. There are some other good indicators that the index you’re making should be clustered. Since the table is physically re-arranged if it’s clustered, then it’s best if the data doesn’t change a lot. If the table with a clustered index on the first field looks like this:

1

3

4

6

7

Then to insert the values "2" and "5" into the table (which you'll remember is the index as well) everything has to move around to accommodate the insertion. It’s better if the data that will be inserted comes after the data that’s already there — at least on the columns where a clustered index is defined. That way no movement happens at all.

So now you can see that perhaps a great candidate for the one (and only) clustered index on a table is the primary key — the fields you have set as the ones that uniquely identify the rows in the table. And you also can see that picking a "meaningful" value like an identification number for a person is such a bad idea for a Primary Key. If the number changes or the person leaves, the whole table would need to be rearranged physically on the drive. If you pick a meaningless, incrementing number for the Primary Key it becomes a natural target for a clustered index, since so many lookups are based on the Primary Key.

Another consideration for making the index clustered is the length of the data. If the data length is short, and even better if it’s made up of integers, the insert operations are quick and the system can move the data around easily. It’s also a good idea if the data is unique. When the values are unique, there is less chance that you'll have to move the data to insert it.

You should also consider a clustered index if the data is within a range that the user will search on. Queries such as:

SELECT au_lname 
FROM authors 
WHERE au_lname like ’[A-L]%’

Work well with a clustered index on the au_lname column.

If the order of the data is similar to the order in which the user wants to get results, it’s also a great candidate for a clustered index.

If a column isn’t a candidate for a clustered index after all these considerations, then of course you’re looking at creating a non-clustered index. Luckily, this is the default option.

Check the references section that follows for more strategies on how you can create and maintain good indexes. What I've shown here are more guidelines than hard and fast rules. You mileage will vary.

Informit Articles and Sample Chapters

An overview of the mechanics of implementing effective indexes is here.

To learn more about index operations, see this tutorial.

SQL Server: Optimizing Database Performance Through Indexes, by Kevin Kline, Baya Pavliashvili.

Online Resources

SQL Server Performance asks: Will the SQL Server Query Optimizer will use your carefully crafted indexes?

Create and Modify SQL Server Indexes Properly — Learn how to determine the state of your indexes and understand whether that state should be changed. See how evaluating the construction and location of your system’s indexes and deciding whether to adjust their fill factors and padding can improve the system’s performance, by Drew Georgopulos.

Here are some tips on optimizing your SQL Server indexes.

Microsoft explains the architecture of indexes here.

Here is another general description of Indexes.