Managing the WHERE Clause
The WHERE clause is the place to start your search for SQL you can change to get better performance. You already know what indexes you have. Now look at the optimizer plan for a problem query and see if the indexes are getting used or if the optimizer is searching the table row by row.
Why a Table Scan?
Just because you have an index on a column doesn't mean your optimizer will use it.
If the amount of data is trivial (as in the msdpn tables), a search using an index may not be faster than a table scan and the optimizer may choose not to use the index.
If the query includes all rows in the table (no WHERE clause), the optimizer does a table scan.
If you're retrieving a lot of data from the table, an index may give no advantage.
If the optimizer does not have accurate information on data distribution, it may pick the wrong index or no index at all.
If you include certain elements in the WHERE clause, you may make the index unavailable.
Of these elements, only the last two are under your control.
Data Distribution Statistics
First, check your system documents to see if the DBA needs periodically to run a command to keep the optimizer current (UPDATE STATISTICS in Transact-SQL and Informix, ANALYZE in Oracle). Microsoft SQL Server also supports a command that tells you when the command was last run (STAT_DATEthere's more information on it in "Using System Functions" in Chapter 7). See Table 62 for a list of commands associated with index statistics. The SQL Anywhere ESTIMATE command is included in the table, but it works differently than the UPDATE STATISTICS or ANALYZE commands, allowing the user to give the optimizer hints on data distribution.
Table 62. Index Statistics
MS SQL Server
Disabling an Index with a Bad WHERE
Second, don't disable a valid index by the way you construct your WHERE clause. The easiest elements to optimize are comparison operators (=, >, <, and variants) or operators that can be translated to comparison operators (BETWEEN and some LIKE clauses).
Anything else may make your indexes unavailable. Here are some suspicious areas to investigate. Since architecture and optimizers vary so much, you'll have to check your system documentation to find out just how these areas affect (or don't affect!) your queries.
- Comparing columns in the same table
- Choosing columns with low-selectivity indexes
- Doing math on a column before comparing it to a constant
- Applying a function to column data before comparing it to a constant
- Finding ranges with BETWEEN
- Matching with LIKE
- Comparing to NULL
- Negating with NOT
- Converting values
- Using OR
- Finding sets of values with IN
- Using multicolumn indexes
Comparing Columns in the Same Table
In many systems, comparing columns in the same table makes an index useless. For example, consider matching the empnum column to a string constant versus matching it to another column, bossnum. The two queries return the same results.
Adaptive Server Anywhere select fname, lname, empnum, bossnum from employee where empnum = bossnum select fname, lname, empnum, bossnum from employee where empnum = '443232366' fname lname empnum bossnum ==================== ==================== ========= ========= Scorley Blake-Pipps 443232366 443232366 [1 row]
With a nontrivial number of rows, the first query scans the table sequentially. The second uses the empix index. If you test this query on another system, you may not see this difference, because the employee table is so small. However, it's clear that an index on empid helps find the employee with a particular identification number and is less useful in finding one with the same number as the boss. Why? Because a constant (443232366) is constant. An index points to a known value, not an unknown value.
Using Nonselective Indexes
A unique index is 100% selective. Every index entry points to a single location in the data. A nonselective index is just the oppositeeach index entry points to multiple data locations. You can think of selectivity as roughly the number of distinct index entries divided by the number of data rows. Optimizers don't get much advantage from nonselective indexes, and they often don't use them.
Returning to the book index comparison, imagine an index that listed every occurrence of the word "and." Because "and" is so common, you'd find an index reference to just about every page in the book. Using the index does not make finding "and" faster than paging through the book, because "and" has low selectivity.
Consider attributes such as gender, where there are only two choices. There's no point indexing this kind of column unless the data distribution is very skewed. If 90% of the employees are male and 10% are female, the optimizer might use the index where gender is female. It would not use it in searches for males.
Doing Math on a Column
Another WHERE clause element to watch out for is doing math on an indexed column before comparing it to a constant. The index can find the column value but not the column value * 2. To test this out on the msdpn database on the Adaptive Server Anywhere CD, run these two queries.
Adaptive Server Anywhere select prodnum, price from product where price * 2 > 200 select prodnum, price from product where price > 100 prodnum price =========== ========== 2111 119.99 1106 149.00 1794 400.00 [3 rows]
The first one does a table scan. The second takes advantage of the pricix index. This change is relatively easy to enforce, thanks to our good friend Ms. Algebra. Unadorned indexed columns on the left! Computations on the right! Get in the habit of changing WHERE clauses as in Figure 69.
Figure 69. Calculation in the WHERE Clause
Functions have the same effect on an indexed column in a WHERE clause as math does. The index points to a value but understands nothing about calculations or functions. You already know that a search for employees by number uses the empix index.
Adaptive Server Anywhere select fname, lname, empnum, bossnum from employee where empnum = '443232366'
If you modify the column name (empnum) with a substring function, even one that does exactly the same match and produces the same results, you'll get a table scan.
Adaptive Server Anywhere select fname, lname, empnum, bossnum from employee where substr(empnum, 1, 9) = '443232366' fname lname empnum bossnum ==================== ==================== ========= ========= Scorley Blake-Pipps 443232366 443232366 [1 row]
Here's another example. The first query does a table scan. The second uses the empid index.
Adaptive Server Anywhere select fname, lname, empnum, bossnum from employee where empnum || ' ' || fname = '443232366 Scorley' select fname, lname, empnum, bossnum from employee where empnum = '443232366' and fname = 'Scorley' fname lname empnum bossnum ==================== ==================== ========= ========= Scorley Blake-Pipps 443232366 443232366 [1 row]
Finding Ranges with BETWEEN
If there is an index on a column, BETWEEN will not disable it. BETWEEN is treated as a pair of comparison operators. The low value must precede the high value.
The first of the following queries uses the prodix index on prodnum. The second uses the index on price.
Adaptive Server Anywhere select prodnum, name from product where prodnum between '1110' and '1357' prodnum name =========== ==================== 1110 star systems 1255 bug stories 1357 nt guru [3 rows] select prodnum, name, price from product where price between 300 and 400 prodnum name price =========== ==================== ========== 1794 memory8 400.00 [1 row]
NOT BETWEEN is not as easy to resolve with an indexan index points to a specific value. In the following case, you get a table scan.
Adaptive Server Anywhere select prodnum, name, price from product where price not between 10 and 100 prodnum name price =========== ==================== ========== 2111 memory tripler 119.99 1794 memory8 400.00 1106 z_connector 149.00 [3 rows]
Matching with LIKE
Most systems can take advantage of an index with LIKE as long as you provide the first character in the pattern (see "Matching Patterns" in Chapter 2). To follow this example, start by creating an index on the name columns of customer, last name first.
Adaptive Server Anywhere create unique index custnmix on customer (lname, fname) select lname, fname from customer where lname like 'Pe%'
The SQL engine translates the LIKE into a range comparison and is able to take advantage of the index.
Adaptive Server Anywhere select lname, fname from customer where lname >= 'Pe' and lname < 'Pf' lname fname ==================== ==================== Peters Pete [1 row]
However, the picture is different if you start with a wildcard. The optimizer does not use an indexthe index cannot point to an unknown value.
Adaptive Server Anywhere select lname, fname from customer where lname like '%ete%'
You get the same answer, of course, but the method is a table scan.
Comparing to NULL
Adaptive Server Anywhere uses the index on product.price in both the following queries, but not all systems do. Check vendor documentation to find out how nulls and indexes relate. If your system takes the "a null is not equal to anything" dictum into the realm of indexes, you may want to campaign for defaults instead of nulls at design time.
Adaptive Server Anywhere select prodnum, price from product where price = 49.99 select prodnum, price from product where price is null
Another issue to check out is whether or not null returns a value in an IN. The following query may return one or two rows, depending on how your system handles null values in this situation.
Adaptive Server Anywhere select prodnum, price from product where price in (null, 400.00)
Negating with NOT
A related area is how NOT and negatives such as <> and != affect index use. Indexes, after all, point to entries. What do they know about nonentries? The two following queries return the same results here (they could return quite disparate results with different data). The first uses the custnmix on last and first names. The second does not.
Adaptive Server Anywhere select lname, fname from customer where lname = 'WONG' select lname, fname from customer where lname not between ' ' and 'W' lname fname ==================== ==================== WONG LI-REN [1 row]
Get in the habit of converting NOT phrases when you can (Figure 610). In most cases, you'll be better off looking for rows with prices greater than zero rather than rows with prices not equal to zero. In addition to its effect on index uses, negative logic is sometimes hard to understand and therefore open to error.
Figure 610. Negatives in WHERE
Conversions, like math or functions, can disable an index, and this is true for autoconversions the system handles as well as conversions you perform. For starters, keep the conversion on the right side of the equation, if possible (Figure 611).
Figure 611. Conversions in WHERE
In ASE, character data defined to allow null is actually stored as VARCHAR data, so joining two columns that differ only in whether or not they allow nulls may cause a conversion. The optimizer cannot use an index on the converted column. In Oracle, some autoconversions deactivate an index, others don't. Check your vendor documentation for details.
You'll find information about convert functions in "Converting Dates (and Other Datatypes)" in Chapter 2.
An OR clause returns results if any one of the conditions is true (A = 1 or B > A or C = 3). If the columns in an OR clause have indexes, the optimizer can use the relevant indexes or do a table scan. Using the indexes means dumping results from each clause into an intermediate table and then removing duplicates from the intermediate table (Figure 612).
Figure 612. OR Processing: Indexes
In a cost-based system, it's important that the table statistics be up to date. Otherwise, the optimizer may make the wrong choice.
If the ORed columns do not have indexes or the optimizer chooses not to use the indexes, you'll see a table scan (Figure 613).
Figure 613. OR Processing: No Indexes
Finding Sets of Values with IN
Many third-party front-end applications overuse IN, sometimes employing it to find a single value or to return values in a range (Figure 614). This can be a problem, as IN is a form of OR processing, and tends to be expensive. If you are lucky, an IN with multiple terms can be re-stated as a range. This works only when the elements in the IN are the only ones in the range. If there were a $200 price in the second Figure 614 query, the BETWEEN translation would not be equivalent to the IN version.
Figure 614. IN
Check IN clauses carefully. Could you do the same work with a simple equals comparison? With a range?
Using Multicolumn Indexes
In many cases, indexes consist of two or more columns (Figure 615). The ordprodix index in orderdetail is an example: it includes the ordnum and prodnum columns, in that order, and the order matters. The optimizer can use the index to find ordnum or combinations of ordnum and prodnum, but it can't follow pointers to go directly to prodnum values. Think of a phone book entry. It helps you find all the subscribers surnamed Smith and all the Smiths with Heather as a first name. It's no good for finding people with an unknown last name whose first name is Heather, though.
Figure 615. Multicolumn Indexes
When you construct WHERE clauses, keep in mind the order of the columns in the index. Order of columns in the SELECT list has no effect.