3.6 Choosing Data Types
Section 3.2, “MySQL Data Types,” described the various data types from which you can choose and the general properties of those types, such as the kind of values they may contain, how much storage space they take, and so on. But how do you actually decide which types to use when you create a table? This section discusses issues to consider that will help you choose.
The most “generic” data types are the string types. You can store anything in them because numbers and dates can be represented in string form. So should you just define all your columns as strings and be done with it? No. Let’s consider a simple example. Suppose that you have values that look like numbers. You could represent these as strings, but should you? What happens if you do?
For one thing, you’ll probably use more space, because numbers can be stored more efficiently using numeric columns than string columns. You’ll also notice some differences in query results due to the different ways that numbers and strings are handled. For example, the sort order for numbers is not the same as for strings. The number 2 is less than the number 11, but the string '2' is lexically greater than the string '11'. You can work around this by using the column in a numeric context like this:
SELECT col_name + 0 as num ... ORDER BY num;
Adding zero to the column forces a numeric sort, but is that a reasonable thing to do? It’s a useful technique sometimes, but you don’t want to have to use it every time you want a numeric sort. Causing MySQL to treat a string column as a number has a couple of significant implications. It forces a string-to-number conversion for each column value, which is inefficient. Also, using the column in a calculation prevents MySQL from using any index on the column, which slows down the query further. Neither of these performance degradations occur if you store the values as numbers in the first place.
The preceding example illustrates that several issues come into play when you choose data types. The simple choice of using one representation rather than another has implications for storage requirements, query handling, and processing performance. The following list gives a quick rundown of factors to think about when picking a type for a column.
What kind of values will the column hold? Numbers? Strings? Dates? Spatial values? This is an obvious question, but you must ask it. You can represent any type of value as a string, but as we’ve just seen, it’s likely that you’ll get better performance if you use other more appropriate types for numeric values. (This is also true for temporal and spatial values.) However, assessing the kind of values you’re working with isn’t necessarily trivial, particularly for other people’s data. It’s especially important to ask what kind of values the column will hold if you’re setting up a table for someone else, and you must be sure to ask enough questions to get sufficient information for making a good decision.
Do your values lie within some particular range? If they are integers, will they always be non-negative? If so, you can use UNSIGNED. If they are strings, will they always be chosen from among a fixed, limited set of values? If so, you may find ENUM or SET a useful type.
There is a tradeoff between the range of a type and the amount of storage it uses. How “big” a type do you need? For numbers, you can choose small types with a limited range of values, or large types with a much larger range. For strings, you can make them short or long, so you wouldn’t choose CHAR(255) if all the values you want to store contain fewer than 10 characters.
What are the performance and efficiency issues? Some types can be processed more efficiently than others. Numeric operations generally can be performed more quickly than string operations. Short strings can be compared more quickly than long strings, and also involve less disk overhead. For MyISAM tables, performance is better for fixed-length rows than for variable-length rows.
The following sections consider these issues in more detail, except for the performance issues, which are covered in Section 5.3, “Choosing Data Types for Efficient Queries.”
Before we proceed, I should point out that, although you want to make the best data type choices you can when you create a table, it’s not the end of the world if you make a choice that turns out to be non-optimal. You can use ALTER TABLE to change the type to a better one. This might be as simple as changing a SMALLINT to MEDIUMINT after finding out your data set contains values larger than you originally thought. Or it can be more complex, such as changing a CHAR to an ENUM with a specific set of allowed values. You can use PROCEDURE ANALYSE() to obtain information about your table’s columns, such as the minimum and maximum values as well as a suggested optimal type to cover the range of values in a column:
SELECT * FROM tbl_name PROCEDURE ANALYSE();
The output from this query may help you determine that a smaller type can be used, which can improve the performance of queries that involve the table and reduce the amount of space required for table storage. For more information about PROCEDURE ANALYSE(), see Section 5.3, “Choosing Data Types for Efficient Queries.”
3.6.1 What Kind of Values Will the Column Hold?
The first thing you think of when you’re trying to decide on a data type is the kind of values the column will be used for because this has the most evident implications for the type you choose. In general, you do the obvious thing: You store numbers in numeric columns, strings in string columns, and dates and times in temporal columns. If your numbers have a fractional part, you use a DECIMAL or floating-point type rather than an integer type. But sometimes there are exceptions. The principle here is that you need to understand the nature of your data to be able to choose the type in an informed manner. If you’re going to store your own data, you probably have a good idea of how to characterize it. On the other hand, if others ask you to set up a table for them, it’s sometimes a different story. It may not be so easy to know just what you’re working with. Be sure to ask enough questions to find out what kind of values the table really should contain.
Suppose that you’re told that a table needs a column to record “amount of precipitation.” Is that a number? Or is it “mostly” numeric—that is, typically but not always coded as a number? For example, when you watch the news on television, the weather report generally includes a measure of precipitation. Sometimes this is a number (as in “0.25 inches of rain”), but sometimes it’s a “trace” of precipitation, meaning “not much at all.” That’s fine for the weather report, but what does it mean for storage in a database? You either need to quantify “trace” as a number so that you can use a numeric data type to record precipitation amounts, or you need to use a string so that you can record the word “trace.” Or you could come up with some more complicated arrangement, using a number column and a string column where you fill in one column and leave the other one NULL. It should be obvious that you want to avoid that option, if possible; it makes the table harder to understand and it makes query-writing much more difficult.
I would probably try to store all rows in numeric form, and then convert them as necessary for display purposes. For example, if any non-zero amount of precipitation less than .01 inches is considered a trace amount, you could display values from the column like this:
SELECT IF(precip>0 AND precip<.01,'trace',precip) FROM ... ;
Some values are obviously numeric but you must determine whether to use an integer or non-integer type. You should ask what your units are and what accuracy you require. Is whole-unit accuracy sufficient or do you need to represent fractional units? This may help you distinguish between integer and fixed-point or floating-point data types. For example, if you’re recording weights to the nearest pound, you can use an integer column. If you want to record fractional units, you’d use a fixed-point or floating-point column. In some cases, you might even use multiple columns—for example, to record weight in terms of pounds and ounces.
Height is a numeric type of information for which there are several representational possibilities:
- Use a string such as '6-2' for a value like “6 feet, 2 inches.” This has the advantage of having a form that’s easy to look at and understand (certainly more so than “74 inches”), but it’s difficult to use this kind of value for mathematical operations such as summation or averaging.
- Use one numeric column for feet and another for inches. This would be a little easier to work with for numerical operations, but two columns are more difficult to use than one.
- Use one numeric column representing inches. This is easiest for a database to work with, and least meaningful for humans. But remember that you don’t have to present values in the same format that you use to work with them. You can reformat values for meaningful display using MySQL’s many functions. That means this might be the best way to represent height.
Another type of numeric information is currency, such as U.S. dollars. For monetary calculations, you’re working with values that have dollars and cents parts. These look like floating-point values, but FLOAT and DOUBLE are subject to rounding error and may not be suitable except for rows in which you need only approximate accuracy. Because people tend to be touchy about their money, it’s more likely you need a type that affords perfect accuracy. You have a couple of choices:
- You can represent money as a DECIMAL(M,2) type, choosing M as the maximum width appropriate for the range of values you need. This gives you values with two decimal places of accuracy. The advantage of DECIMAL is that values are not subject to roundoff error and calculations are exact.
- You can represent all monetary values internally as cents using an integer type. The advantage is that calculations are done internally using integers, which is very fast. The disadvantage is that you will need to convert values on input or output by multiplying or dividing by 100.
Some kinds of “numbers” aren’t. Telephone numbers, credit card numbers, and Social Security numbers all can be written using non-digit characters such as spaces or dashes and cannot be stored directly in a numeric column unless you strip the non-digits. But even with non-digits stripped, you may want to store values as strings rather than as numbers to avoid loss of leading zeros.
If you need to store date information, do the values include a time? That is, will they ever need to include a time? MySQL doesn’t provide a date type that has an optional time part: DATE never has a time, and DATETIME must have a time. If the time really is optional, use a DATE column to record the date, and a separate TIME column to record the time. Then allow the TIME column to be NULL and interpret that as “no time”:
CREATE TABLE mytbl ( date DATE NOT NULL, # date is required time TIME NULL # time is optional (may be NULL) );
One type of situation in which it’s especially important to determine whether you need a time value occurs when you’re joining two tables with a master-detail relationship that are “linked” based on date information. Suppose that you’re conducting research involving test subjects. Following a standard initial battery of tests, you might run several additional tests, with the choice of tests varying according to the results of the initial tests. You can represent this information using a master-detail relationship, in which the subject identification information and the standard initial tests are stored in a master row and any additional tests are stored as rows in a secondary detail table. Then you link together the two tables based on subject ID and the date on which the tests are given.
The question you must answer in this situation is whether you can use just the date or whether you need both date and time. This depends on whether a subject might go through the testing procedure more than once during the same day. If so, record the time (for example, the time that the procedure begins), using either a DATETIME column or separate DATE and TIME columns that both must be filled in. Without the time value, you will not be able to associate a subject’s detail rows with the proper master rows if the subject is tested twice in a day.
I’ve heard people claim “I don’t need a time; I will never test a subject twice on the same day.” Sometimes they’re correct, but I have also seen some of these same people turn up later wondering how to prevent detail rows from being mixed up with the wrong master row after entering data for subjects who were tested multiple times in a day. Sorry, by then it’s too late!
Sometimes you can deal with this problem by retrofitting a TIME column into the tables. Unfortunately, it’s difficult to fix existing rows unless you have some independent data source, such as the original paper rows. Otherwise, you have no way to disambiguate detail rows to associate them with the proper master row. Even if you have an independent source of information, this is very messy and likely to cause problems for applications that you’ve already written to use the tables. It’s best to explain the issues to the table owners and make sure that you’ve gotten a good characterization of the data values before creating their tables.
Sometimes you have incomplete data, and this will influence your choice of data types. You may be collecting birth and death dates for genealogical research, and sometimes all you can find out is the year or year and month someone was born or died, but not the exact date. If you use a DATE column, you can’t enter a date unless you have the full date. If you want to be able to record whatever information you have, even if it’s incomplete, you may have to keep separate year, month, and day columns. Then you can enter such parts of the date as you have and leave the rest NULL. Another possibility is to use DATE values in which the day or month and day parts are set to 0. Such “fuzzy” dates can be used to represent incomplete date values.
3.6.2 Do Your Values Lie Within Some Particular Range?
If you’ve decided on the general category from which to pick a data type for a column, thinking about the range of values you want to represent will help you narrow down your choices to a particular type within that category. Suppose that you want to store integer values. The range of your values determines the types you can use. If you need values in the range from 0 to 1000, you can use anything from a SMALLINT up to a BIGINT. If your values range up to 2 million, you can’t use SMALLINT, so your choices range from MEDIUMINT to BIGINT.
You could, of course, simply use the largest type for the kind of value you want to store (BIGINT for the examples in the previous paragraph). Generally, however, you should use the smallest type that is large enough for your purposes. By doing so, you’ll minimize the amount of storage used by your tables, and they will give you better performance because smaller columns usually can be processed more quickly than larger ones. (Reading smaller values requires less disk activity, and more key values fit into the key cache, allowing indexed searches to be performed faster.)
If you don’t know the range of values you’ll need to be able to represent, you either must guess or use BIGINT to accommodate the worst possible case. If you guess and the type you choose turns out later to be too small, all is not lost. Use ALTER TABLE later to make the column bigger.
Sometimes you even find out that you can make a column smaller. In Chapter 1, we created a score table for the grade-keeping project that had a score column for recording quiz and test scores. The column was created using INT in order to keep the discussion simpler, but you can see now that if scores are in the range from 0 to 100, a better choice would be TINYINT UNSIGNED, because that would use less storage.
The range of values in your data also affects the attributes you can use with your data type. If values never are negative, you can use UNSIGNED; otherwise, you can’t.
String types don’t have a “range” in the same way numeric columns do, but they have a length, and the maximum length you need affects the column types you can use. If you’re storing character strings that are shorter than 256 characters, you can use CHAR, VARCHAR, or TINYTEXT. If you want longer strings, you can use VARCHAR or a longer TEXT type.
For a string column used to represent a fixed set of values, you might consider using an ENUM or SET data type. These can be good choices because they are represented internally as numbers. Operations on them are performed numerically, which makes them more efficient than other string types. They also can be more compact than other string types, which saves space. In addition, you can prevent entry of values not present in the list of legal values by enabling strict SQL mode. See Section 3.3, “How MySQL Handles Invalid Data Values.”
When characterizing the range of values you have to deal with, the best terms are “always” and “never” (as in “always less than 1000” or “never negative”), because they enable you to constrain your data type choices more tightly. But be wary of using these terms when they’re not really justified. Be especially wary if you’re consulting with other people about their data and they start throwing around those two terms. When people say “always” or “never,” be sure they really mean it. Sometimes people say their data always have a particular characteristic when they really mean “almost always.”
Suppose that you’re designing a table for a group of investigators who tell you, “Our test scores are always 0 to 100.” Based on that statement, you choose TINYINT and you make it UNSIGNED because the values are always non-negative. Then you find out that the people who code the data for entry into the database sometimes use −1 to mean “student was absent due to illness.” Oops. They didn’t tell you that. It might be acceptable to use NULL to represent such values, but if not, you’ll have to record a −1, and then you can’t use an UNSIGNED column. (This is an instance where ALTER TABLE comes to your rescue.)
Sometimes decisions about these cases can be made more easily by asking a simple question: Are there ever exceptions? If an exceptional case ever occurs, even just once, you must allow for it. You will find that people who talk to you about designing a database invariably think that if exceptions don’t occur very often, they don’t matter. When you’re creating a table, you can’t think that way. The question you need to ask isn’t “how often do exceptions occur?” It’s “do exceptions ever occur?” If they do, you must take them into account.
3.6.3 Inter-Relatedness of Data Type Choice Issues
You can’t always consider the issues involved in choosing data types as though they are independent of one another. For example, range is related to storage size for numeric types: As you increase the range, you require more storage, which affects performance. Or consider the implications of using AUTO_INCREMENT to create a column for holding unique sequence numbers. That single choice has several consequences involving the data type, indexing, and the use of NULL:
- AUTO_INCREMENT is a column attribute that is best used with integer types. That immediately limits your choices to TINYINT through BIGINT.
- An AUTO_INCREMENT column is intended only for generating sequences of positive values, so you should define it as UNSIGNED.
- AUTO_INCREMENT columns must be indexed. Furthermore, to prevent duplicates in the column, the index should be unique, so you should define the column as a PRIMARY KEY or as a UNIQUE index.
- AUTO_INCREMENT columns must be NOT NULL. (If you omit NOT NULL, MySQL adds it automatically.)
All of this means you do not just define an AUTO_INCREMENT column like this:
mycol arbitrary_type AUTO_INCREMENT
You define it like this:
mycol integer_type UNSIGNED NOT NULL AUTO_INCREMENT, PRIMARY KEY (mycol)
Or like this:
mycol integer_type UNSIGNED NOT NULL AUTO_INCREMENT, UNIQUE (mycol)