Home > Articles > Data > SQL

SQL Performance Tuning: Simple "Searches"

  • Print
  • + Share This
  • 💬 Discuss
This chapter is from the book

This chapter is from the book

Master SQL syntax-based optimizing and simple search conditions. Learn which search conditions are best, and armed with this information, you can decide whether to change the order of expressions, or to substitute one expression for another that does the same work more efficiently.

In this chapter, we'll talk about syntax-based optimizing and simple search conditions.

A syntax is a choice of words and their arrangement in the SQL statement. To optimize based on syntax, assume that nonsyntactic factors (e.g., indexes, table sizes, storage) are irrelevant or unknown. This is the lowest level of optimizing—it's usually predictable, and some of it can be done on the client.

There's no point in attempting to optimize most SQL syntax because only certain SQL statements have options that lend themselves to optimization. The particular syntax that offers many optimization possibilities is the SQL search condition. Here are three examples of search conditions:

 ... WHERE title LIKE 'The %' OR title LIKE 'A %'
 ... WHERE name <> 'Smith'
 ... WHERE number = 5

Although the slowest search conditions are those that contain joins and subqueries, this chapter deals only with single-table searches. (We'll talk about joins and subqueries later.) Also, although search conditions can appear in HAVING, IF, or ON clauses, we'll talk only about search conditions in WHERE clauses. So the chapter title—"Simple Searches"—is an understatement. Don't worry—the complex queries will come later.

General Tuning

In this part of the chapter, we'll look at some general ideas you should keep in mind when writing simple search conditions.

Code for Points

The best search conditions are those that work on few rows with easy-to-do comparisons. Table 2-1 and Table 2-2 show typical lists (derived from vendors' manuals) of types of search conditions, in order from best to worst. Each search condition component has a "point count"—the better the component, the higher the score. You can see from the allotted points shown in Tables 2-1 and 2-2 that the best search condition is something like:

 ... WHERE smallint_column = 12345

Table 2-1. Search Condition Point Counts for Operators

Operator

Points

=

10

>

5

>=

5

<

5

<=

5

LIKE

3

<>

0


Table 2-2. Search Condition Point Counts for Operands

Operand

Points

Literal alone

10

Column alone

5

Parameter alone

5

Multioperand expression

3

Exact numeric data type

2

Other numeric data type

1

Temporal data type

1

Character data type

0

NULL

0


This example gets a total of 27 points, calculated as follows:

  • Five points for the column (smallint_column) alone on the left

  • Two points for the exact numeric (smallint_column) operand data type

  • Ten points for the equals operator

  • Ten points for the literal (12345) alone on the right

Here's another example:

 ... WHERE char_column >= varchar_column || 'x'

The point count for this type of search condition is much lower—only 13:

  • Five points for the column (char_column) alone on the left

  • Zero points for the CHAR (char_column) operand data type

  • Five points for the greater than or equal to operator

  • Three points for the multioperand expression (varchar_column || 'x') on the right

  • Zero points for the VARCHAR (varchar_column) operand data type

The precise point count for a search condition varies from vendor to vendor, so it's pointless to memorize anything other than the order and the concept for this optimization technique. So just remember:

The condition that takes the least time—usually because it involves fewer rows or easier comparisons—gets the most points.

Armed with this concept, you can decide whether to change the order of expressions, or to substitute one expression for another that does the same work. Even though a modern cost-based DBMS optimizer has many more rules that require information outside the SQL statement itself, all DBMSs still fall back on the point count when no other information is available. The possibility is always there that you will use an item of information that the optimizer doesn't. (For more information about cost-based optimizers, see Chapter 17 "Cost-Based Optimizers.")

Another way you can optimize a search condition is by putting multiple expressions in the correct order. The expressions in this WHERE clause are already in optimal order:

 SELECT * FROM Table1
  WHERE column1 = 5
   AND column2 = 77.3
   AND column3 = 'Smith'
   AND column4 < 117
   AND column4 > column5
 GAIN: 0/8

The note at the bottom of this example says there is a GAIN: 0/8. That's an important number, and we're going to say "GAIN: x/8" in many of our examples, so let's clarify. As explained in Chapter 1, the gain shows how many of the Big Eight run faster when the search condition is optimized. Mileage varies with different data and different machines, of course. We're only reporting what our tests showed.

So "GAIN: 0/8" means "you'd be wasting your time if you rearranged this particular WHERE clause into optimum order because the DBMS does this for you." All DBMS makers know the basics of point counting, so the rearrangement is automatic. This means that, in ordinary cases, you will gain nothing by doing your own syntax-based optimization. However, there are many exceptions to the rule. For the rest of this chapter, we'll look at cases where the gain is both significant and predictable.

Constant Propagation

"All men are mortal. Socrates is a man. Therefore, Socrates is mortal."
—Attributed to Aristotle

Formally, the Law of Transitivity states that:

 IF
 (A <comparison operator> B) IS TRUE AND (B <comparison operator> C) IS TRUE
 THEN
 (A <comparison operator> C) IS TRUE AND NOT (A <comparison operator> C) IS FALSE

(Comparison operator is any one of: = or > or >= or < or <= but not one of: <> or LIKE.)

The Law of Transitivity leads to the simple observation that we can substitute C for B without changing the meaning of an expression. When such a substitution involves substituting a constant value, the process is called constant propagation.1

The next two expressions mean the same thing, but the second expression has a better point count because it substitutes a literal (5) for a column name (column1):

 Expression #1
 ... WHERE column1 < column2
    AND column2 = column3
    AND column1 = 5

 Expression #2
 ... WHERE 5 < column2
    AND column2 = column3
    AND column1 = 5
 GAIN: 2/8

Expression #2 is called a transform of Expression #1. (Writing a transform means rewriting an SQL statement to produce the same result but with different syntax. When two SQL statements have different syntax, but will predictably and regularly produce the same outputs, they are known as transforms of one another.) Most good DBMSs do this sort of thing automatically. But some DBMSs won't try transforms when the expression contains multiple parentheses and NOTs. For example, this SELECT statement can be slow:

 SELECT * FROM Table1
  WHERE column1 = 5 AND
   NOT (column3 = 7 OR column1 = column2)

Applying the transforms ourselves, we came up with this statement:

 SELECT * FROM Table1
  WHERE column1 = 5
   AND column3 <> 7
   AND column2 <> 5
 GAIN: 5/8

The transformed statement is faster more than half of the time. In other words, sometimes it pays to do your own transforms.

Sometimes constant propagation won't work with floats, because it's possible to be both "greater than" and "equal to" at the same time when approximate numeric comparisons happen. When it does work, though, expect a GAIN: 5/8. And sometimes, constant propagation won't work for CHAR expressions. But when it does, expect a GAIN: 4/8.

Exercise time: The MySQL online documentation has this example:

 ... WHERE a < b AND b = c AND a = 5

transforms to:

 ... WHERE b > 5 AND b = c AND a = 5

The quiz question here is—Did the MySQL folks make a mistake?2

In the real world, you'll find many semiconstant operands, as program parameters or functions. Examples are the niladic functions like CURRENT_DATE (a niladic function is a function that has no arguments). Because using a constant value always accelerates accesses, try a transform to speed up these cases. Here's an example: Query #1 transforms to Query #2:

 Query #1:
 SELECT * FROM Table1
  WHERE date_column = CURRENT_DATE
   AND amount * 5 > 100.00
 Query #2:
 SELECT * FROM Table1
  WHERE date_column = DATE '2002-01-01'
   AND amount * 5 > 100.00
 GAIN: 5/8

If you're thinking of transforming this type of expression, keep in mind that (because of the DATE constant), you'd have to change the query every day. That's only practical when an application program is generating the queries on the server.

Dead Code Elimination

"Branches with no leaves should be cut off."
—Ortho Guide To Pruning Bushes and Shrubs

In some old SQL programs, you'll encounter literals on both sides of the comparison operator, as in this example:

 SELECT * FROM Table1
  WHERE 0 = 1
   AND column1 = 'I hope we never execute this'

In the days before C-style /* comments */ were legal in an SQL statement, this was a way to add an inline comment string. Because the expression 0 = 1 is always false, this query will always return zero rows, and therefore DBMSs can skip the whole WHERE clause. But some of them don't. We tested this by removing the WHERE clause and got a gain:

 SELECT * FROM Table1
 GAIN: 5/8

It is, of course, obvious that these two queries aren't equivalent—the point is merely that it should take less time to retrieve zero rows due to an always-false condition than it should to do a full table scan-provided the DBMS doesn't evaluate the always-false condition. This example shows that DBMSs don't always throw out always-false conditions and all their dependents in the PREPARE stage. But they're pretty reliable at throwing out always-true conditions. So you can use always-true conditions for an SQL equivalent of conditional compilation. For example, if you worry that a DBMS won't give high precision for division results, add a separate condition that comes into play only when necessary—as in this example:

 ... WHERE (77 / 10 = 7.7 AND column1 / 10 = 7.7)
    OR (77 / 10 = 7 AND column1 * 10 = 77)
 GAIN: 5/8

Because of the unreliability aspect, though, it is usually a bad idea to put in redundant code. Suppose that a column, indexed_column, is an indexed NOT NULL column. You could transform this SQL statement:

 SELECT * FROM Table1

to this statement:

 SELECT * FROM Table1
  WHERE indexed_column > 0

This is a way of forcing the DBMS to look up via the index. Alas, it works only with a few DBMSs. In general, then, don't add redundant conditions to WHERE clauses.

Ensure You Use the Right DBMS

There are several ways to ensure that a specific DBMS (and no other) executes an expression. Here are three examples, all of which use nonstandard SQL extensions:

 Example 1:
 ... WHERE :variable = 'Oracle'
    AND /* Oracle-specific code here */

 Example 2:
 SELECT /* ! HIGH_PRIORITY */ ...
  /* all DBMSs except MySQL ignore this */

 Example 3:
 ... WHERE <escape-sequence> AND /* ODBC code */

While we're on the subject, Oracle allows you to add a comment that indicates what index you want to use. It looks like this:

 SELECT /*+ INDEX(Widgets Widget_index) */
    column1, column2, column3
  FROM Widgets
  WHERE column1 <> 7;
 GAIN: only 1/8 because it's Oracle-specific

Oracle-specific optimizations are bad ideas if they tie you to Oracle. In this case, the hint is in a comment, so other DBMSs will ignore it. That's good—it's more portable than putting hints outside comments as Microsoft and Sybase do. So it's okay—until other DBMSs start to put executable data inside comments too. Some are already starting to do so. So right now hints are okay, but eventually they will lead to conflict and chaos.

Constant Folding

"Go north one mile, west one mile, west one more mile, then south one mile, and you will be at your starting point."
—The South Pole Riddle

Anyone who has used C will know that the expression x=1+1-1-1 is folded to x=0 at compile time. So it may surprise you that many SQL DBMSs do not fold these five obvious-looking transform candidates:

 ... WHERE column1 + 0

 ... WHERE 5 + 0.0

 ... WHERE column1 IN (1, 3, 3)

 ... CAST(1 AS INTEGER)

 ... WHERE 'a' || 'b'

If you find expressions like these in old code though, our tip is—Leave them alone. They are there for historical reasons, such as forcing the DBMS to ignore indexes, changing the result data type, allowing for the difference between SMALLINT and INTEGER, or evading a limit on line size. Sorry—but the obvious-looking cases are precisely the cases where you should stop and wonder whether the original programmer had some reason for the weird syntax choice. We read a Java optimization article once that sums it up nicely: "Rule #1: Understand the code." Nevertheless, we do recommend that you transform this search condition:

 ... WHERE a - 3 = 5

to:

 ... WHERE a = 8     /* a - 3 = 5 */
 GAIN: 6/8

Although it's useless in simple cases, constant folding can lead to constant propagation and is therefore A Good Thing.

Case-Insensitive Searches

Microsoft's Access Jet considers the strings 'SMITH' and 'Smith' to be equal, so Access is called a case-insensitive DBMS. Oracle, on the other hand, is usually case sensitive (the engine would say 'SMITH' and 'Smith' are unequal strings). Sybase allows you to decide about case sensitivity when you install, and a true SQL Standard DBMS will allow you to switch case sensitivity at runtime. We've seen many programmers try to ensure case insensitivity by using the fold function UPPER, as in:

 ... WHERE UPPER(column1) = 'SMITH'

That can be a mistake if you're dealing with strings that contain anything other than strictly Latin letters. With some DBMSs, when you translate certain French or German strings to uppercase, you lose information. For example, the function:

 ... UPPER('rÈsumÈ')

returns RESUME—that is, the accent marks are lost, changing the meaning of the word from "curriculum vitae" to "begin again." Because information isn't lost going the other way, it's better to use the LOWER function, like this:

 ... WHERE LOWER(column1) = 'rÈsumÈ'

An even better way is to eliminate the fold function entirely if that's possible, because—we appeal to authority here—both the Microsoft and Oracle manuals say: "Avoid functions on columns." We're sure they mean "avoid functions on columns when there's another way to get the result needed"—for example, to ensure case insensitivity, the best method is to use a case-insensitive collation rather than a fold function.

A slightly faster search assumes that the data is clean and asks for the only reasonable combinations, like this:

 ... WHERE column1 = 'SMITH'
    OR column1 = 'Smith'
 GAIN: 8/8

which is still slow. Our tip here is—Take advantage of dead code elimination so that the 'Smith' search happens only when the DBMS is case sensitive. Here's how:

 ... WHERE column1 = 'SMITH'
    OR ('SMITH' <> 'Smith' AND column1 = 'Smith')
 GAIN: 3/8

Sargability

"CHUFFED. adj. [1] Pleased. [2] Displeased."
—The Concise Oxford Dictionary

The ideal SQL search condition has the general form:

 <column> <comparison operator> <literal>

In the early days, IBM researchers named these kinds of search conditions "sargable predicates" because SARG is a contraction for Search ARGument. In later days, Microsoft and Sybase redefined "sargable" to mean "can be looked up via the index." Alas—when the same word has two very different meanings, it's not much use as a word any more! So we titled this section "Sargability" just for fun. But although the word is dead, the idea lives on in this rule:

The left side of a search condition should be a simple column name; the right side should be an easy-to-look-up value.

To enforce this rule, all DBMSs will transpose the expression:

 5 = column1

to:

 column1 = 5

When there's arithmetic involved, though, only some DBMSs will transpose. For example, we tested this transform:

 ... WHERE column1 - 3 = -column2

transforms to:

 ... WHERE column1 = -column2 + 3
 GAIN: 4/8

The gain shows that doing the transform ourselves helped considerably.

On a 32-bit computer, arithmetic is fastest if all operands are INTEGERs (because INTEGERs are 32-bit signed numbers) rather than SMALLINTs, DECIMALs, or FLOATs. Thus this condition:

 ... WHERE decimal_column * float_column

is slower than:

 ... WHERE integer_column * integer_column
 GAIN: 5/8

You might expect "GAIN: 8/8" with the last example, but some DBMSs (Oracle is an example) don't distinguish between 16-bit and 32-bit numbers or store them in binary form. (Recall that dBASE stores numbers in ASCII. When a DBMS stores numbers in ASCII, truncation will always beat division by ten. So store using the base that you might divide by.)

The Bottom Line: General Tuning

The left side of a search condition should be a simple column name; the right side should be an easy-to-look-up value.

Each component of a search condition has a point count. The higher the points, the faster the component. The condition that takes the least time gets the most points.

Put multiple expressions in the correct order.

Use the Law of Transitivity and the concept of constant propagation to substitute a literal for a column name or column expression whenever you can do so without changing the meaning of an expression.

Some DBMSs won't fold most obvious-looking (to a C programmer) expressions. Don't use this principle as the reason to always transform such expressions when you find them in old code though—usually they're there for historical reasons. Remember Rule #1: Understand the code before changing it.

If the code involves an obvious math expression, do evaluate it and transform the condition to the evaluated result. Constant folding can lead to constant propagation and is therefore A Good Thing.

Avoid functions on columns.

If you can't avoid functions, don't use UPPER to ensure case insensitivity. Use LOWER instead.

  • + Share This
  • 🔖 Save To Your Account

Discussions

comments powered by Disqus