Home > Articles > Data > MySQL

This chapter is from the book

This chapter is from the book

Choosing Data Types

The section "MySQL Data Types" describes the various data types from which you can choose and the general properties of those types, such as the kind of values they may contain, how much storage space they take, and so on. But how do you actually decide which types to use when you create a table? This section discusses issues to consider that will help you choose.

The most "generic" data types are the string types. You can store anything in them because numbers and dates can be represented in string form. So should you just define all your columns as strings and be done with it? No. Let's consider a simple example. Suppose that you have values that look like numbers. You could represent these as strings, but should you? What happens if you do?

For one thing, you'll probably use more space, because numbers can be stored more efficiently using numeric columns than string columns. You'll also notice some differences in query results due to the different ways that numbers and strings are handled. For example, the sort order for numbers is not the same as for strings. The number 2 is less than the number 11, but the string '2' is lexically greater than the string '11'. You can work around this by using the column in a numeric context like this:

SELECT col_name + 0 as num ... ORDER BY num;

Adding zero to the column forces a numeric sort, but is that a reasonable thing to do? It's a useful technique sometimes, but you don't want to have to use it every time you want a numeric sort. Causing MySQL to treat a string column as a number has a couple of significant implications. It forces a string-to-number conversion for each column value, which is inefficient. Also, using the column in a calculation prevents MySQL from using any index on the column, which slows down the query further. Neither of these performance degradations occur if you store the values as numbers in the first place.

The preceding example illustrates that several issues come into play when you choose data types. The simple choice of using one representation rather than another has implications for storage requirements, query handling, and processing performance. The following list gives a quick rundown of factors to think about when picking a type for a column.

  • What kind of values will the column hold? Numbers? Strings? Dates? Spatial values? This is an obvious question, but you must ask it. You can represent any type of value as a string, but as we've just seen, it's likely that you'll get better performance if you use other more appropriate types for numeric values. (This is also true for temporal and spatial values.) However, assessing the kind of values you're working with isn't necessarily trivial, particularly for other people's data. It's especially important to ask what kind of values the column will hold if you're setting up a table for someone else, and you must be sure to ask enough questions to get sufficient information for making a good decision.

  • Do your values lie within some particular range? If they are integers, will they always be non-negative? If so, you can use UNSIGNED. If they are strings, will they always be chosen from among a fixed, limited set of values? If so, you may find ENUM or SET a useful type.

  • There is a tradeoff between the range of a type and the amount of storage it uses. How "big" a type do you need? For numbers, you can choose small types with a limited range of values, or large types with a much larger range. For strings, you can make them short or long, so you wouldn't choose CHAR(255) if all the values you want to store contain fewer than 10 characters.

  • What are the performance and efficiency issues? Some types can be processed more efficiently than others. Numeric operations generally can be performed more quickly than string operations. Short strings can be compared more quickly than long strings, and also involve less disk overhead. For MyISAM tables, performance is worse for variable-length rows than for fixed-length rows, but fixed-length rows are used only if all columns have fixed-length data types.

Now let's consider each of these issues in more detail. But before we do, allow me to point something out: You want to make the best data type choices you can when you create a table, but it's not the end of the world if you make a choice that turns out to be non-optimal. You can use ALTER TABLE to change the type to a better one. This might be as simple as changing a SMALLINT to MEDIUMINT after finding out your dataset contains values larger than you originally thought. Or it can be more complex, such as changing a CHAR to an ENUM with a specific set of allowed values. You can use PROCEDURE ANALYSE() to obtain information about your table's columns, such as the minimum and maximum values as well as a suggested optimal type to cover the range of values in a column:

SELECT * FROM tbl_name PROCEDURE ANALYSE();

The output from this query may help you determine that a smaller type can be used, which can improve the performance of queries that involve the table and reduce the amount of space required for table storage.

What Kind of Values Will the Column Hold?

The first thing you think of when you're trying to decide on a data type is the kind of values the column will be used for because this has the most evident implications for the type you choose. In general, you do the obvious thing: You store numbers in numeric columns, strings in string columns, and dates and times in temporal columns. If your numbers have a fractional part, you use a floating-point type rather than an integer type. But sometimes there are exceptions. The principle here is that you need to understand the nature of your data to be able to choose the type in an informed manner. If you're going to store your own data, you probably have a good idea of how to characterize it. On the other hand, if others ask you to set up a table for them, it's sometimes a different story. It may not be so easy to know just what you're working with. Be sure to ask enough questions to find out what kind of values the table really should contain.

Suppose that you're told that a table needs a column to record "amount of precipitation." Is that a number? Or is it "mostly" numeric—that is, typically but not always coded as a number? For example, when you watch the news on television, the weather report generally includes a measure of precipitation. Sometimes this is a number (as in "0.25 inches of rain"), but sometimes it's a "trace" of precipitation, meaning "not much at all." That's fine for the weather report, but what does it mean for storage in a database? You either need to quantify "trace" as a number so that you can use a numeric data type to record precipitation amounts, or you need to use a string so that you can record the word "trace." Or you could come up with some more complicated arrangement, using a number column and a string column where you fill in one column and leave the other one NULL. It should be obvious that you want to avoid that option, if possible; it makes the table harder to understand and it makes query-writing much more difficult.

I would probably try to store all rows in numeric form, and then convert them as necessary for display purposes. For example, if any non-zero amount of precipitation less than .01 inches is considered a trace amount, you could display values from the column like this:

SELECT IF(precip>0 AND precip<.01,'trace',precip) FROM ... ;

Some values are obviously numeric but you must determine whether to use a floating-point or integer type. You should ask what your units are and what accuracy you require. Is whole-unit accuracy sufficient or do you need to represent fractional units? This may help you distinguish between integer and floating-point column types. For example, if you're representing weights, you can use an integer column if you record values to the nearest pound. You'd use a floating-point column if you want to record fractional units. In some cases, you might even use multiple columns—for example, if you want to record weight in terms of pounds and ounces.

Height is a numeric type of information for which there are several representational possibilities:

  • Use a string such as '6-2' for a value like "6 feet, 2 inches." This has the advantage of having a form that's easy to look at and understand (certainly more so than "74 inches"), but it's difficult to use this kind of value for mathematical operations such as summation or averaging.

  • Use one numeric column for feet and another for inches. This would be a little easier to work with for numerical operations, but two columns are more difficult to use than one.

  • Use one numeric column representing inches. This is easiest for a database to work with, and least meaningful for humans. But remember that you don't have to present values in the same format that you use to work with them. You can reformat values for meaningful display using MySQL's many functions. That means this might be the best way to represent height.

Another type of numeric information is money. For monetary calculations, you're working with values that have dollars and cents parts. These look like floating-point values, but FLOAT and DOUBLE are subject to rounding error and may not be suitable except for records in which you need only approximate accuracy. Because people tend to be touchy about their money, it's more likely you need a type that affords perfect accuracy. You have a couple of choices:

  • You can represent money as a DECIMAL(M,2) type, choosing M as the maximum width appropriate for the range of values you need. This gives you floating point values with two decimal places of accuracy. The advantage of DECIMAL is that values are represented as strings and are not subject to roundoff error. The disadvantage is that string operations are less efficient than operations on values represented internally as numbers. Also, although storage and retrieval will be exact, calculations on DECIMAL values might still be done using floating-point operations; roundoff error can occur as a result.

  • You can represent all monetary values internally as cents using an integer type. The advantage is that calculations are done internally using integers, which is very fast. The disadvantage is that you will need to convert values on input or output by multiplying or dividing by 100.

Some kinds of "numbers" aren't. Telephone numbers, credit card numbers, and Social Security numbers all contain non-digit characters and cannot be stored directly in a numeric column unless you strip the non-digits.

If you need to store date information, do the values include a time? That is, will they ever need to include a time? MySQL doesn't provide a date type that has an optional time part: DATE never has a time, and DATETIME must have a time. If the time really is optional, use a DATE column to record the date, and a separate TIME column to record the time. Then allow the TIME column to be NULL and interpret that as "no time":

CREATE TABLE mytbl
(
  date DATE NOT NULL,  # date is required
  time TIME NULL       # time is optional (may be NULL)
);

One type of situation in which it's especially important to determine whether you need a time value occurs when you're joining two tables with a master-detail relationship that are "linked" based on date information. Suppose that you're conducting research involving test subjects. Following a standard initial battery of tests, you might run several additional tests, with the choice of tests varying according to the results of the initial tests. You can represent this information using a master-detail relationship, in which the subject identification information and the standard initial tests are stored in a master record and any additional tests are stored as rows in a secondary detail table. Then you link together the two tables based on subject ID and the date on which the tests are given.

The question you must answer in this situation is whether you can use just the date or whether you need both date and time. This depends on whether a subject might go through the testing procedure more than once during the same day. If so, record the time (for example, the time that the procedure begins), using either a DATETIME column or separate DATE and TIME columns that both must be filled in. Without the time value, you will not be able to associate a subject's detail records with the proper master records if the subject is tested twice in a day.

I've heard people claim "I don't need a time; I will never test a subject twice on the same day." Sometimes they're correct, but I have also seen some of these same people turn up later wondering how to prevent detail records from being mixed up with the wrong master record after entering data for subjects who were tested multiple times in a day. Sorry, by then it's too late!

Sometimes you can deal with this problem by retrofitting a TIME column into the tables. Unfortunately, it's difficult to fix existing records unless you have some independent data source, such as the original paper records. Otherwise, you have no way to disambiguate detail records to associate them to the proper master record. Even if you have an independent source of information, this is very messy and likely to cause problems for applications that you've already written to use the tables. It's best to explain the issues to the table owners and make sure that you've gotten a good characterization of the data values before creating their tables.

Sometimes you have incomplete data, and this will influence your choice of data types. You may be collecting birth and death dates for genealogical research, and sometimes all you can find out is the year or year and month someone was born or died, but not the exact date. If you use a DATE column, you can't enter a date unless you have the full date. If you want to be able to record whatever information you have, even if it's incomplete, you may have to keep separate year, month, and day columns. Then you can enter such parts of the date as you have and leave the rest NULL. Another possibility is to use DATE values in which the day or month and day parts are set to 0. Such "fuzzy" dates can be used to represent incomplete date values.

Do Your Values Lie Within Some Particular Range?

If you've decided on the general category from which to pick a data type for a column, thinking about the range of values you want to represent will help you narrow down your choices to a particular type within that category. Suppose that you want to store integer values. The range of your values determines the types you can use. If you need values in the range from 0 to 1000, you can use anything from a SMALLINT up to a BIGINT. If your values range up to 2 million, you can't use SMALLINT, so your choices range from MEDIUMINT to BIGINT.

You could, of course, simply use the largest type for the kind of value you want to store (BIGINT for the examples in the previous paragraph). Generally, however, you should use the smallest type that is large enough for your purposes. By doing so, you'll minimize the amount of storage used by your tables, and they will give you better performance because smaller columns usually can be processed more quickly than larger ones. (Reading smaller values requires less disk activity, and more key values fit into the key cache, allowing indexed searches to be performed faster.)

If you don't know the range of values you'll need to be able to represent, you either must guess or use BIGINT to accommodate the worst possible case. If you guess and the type you choose turns out later to be too small, all is not lost. Use ALTER TABLE later to make the column bigger.

Sometimes you even find out that you can make a column smaller. In Chapter 1, we created a score table for the grade-keeping project that had a score column for recording quiz and test scores. The column was created using INT in order to keep the discussion simpler, but you can see now that if scores are in the range from 0 to 100, a better choice would be TINYINT UNSIGNED, because that would use less storage.

The range of values in your data also affects the attributes you can use with your data type. If values never are negative, you can use UNSIGNED; otherwise, you can't.

String types don't have a "range" in the same way numeric columns do, but they have a length, and the maximum length you need affects the column types you can use. If you're storing character strings that are shorter than 256 characters, you can use CHAR, VARCHAR, or TINYTEXT. If you want longer strings, you can use a longer TEXT type, or VARCHAR as of MySQL 5.0.3.

For a string column used to represent a fixed set of values, you might consider using an ENUM or SET data type. These can be good choices because they are represented internally as numbers. Operations on them are performed numerically, which makes them more efficient than other string types. They also can be more compact than other string types, which saves space. In addition, for MySQL 5.0.2 and up, you can prevent entry of values not present in the list of legal values by enabling strict SQL mode. See "How MySQL Handles Invalid Data Values."

When characterizing the range of values you have to deal with, the best terms are "always" and "never" (as in "always less than 1000" or "never negative"), because they allow you to constrain your data type choices more tightly. But be wary of using these terms when they're not really justified. Be especially wary if you're consulting with other people about their data and they start throwing around those two terms. When people say "always" or "never," be sure they really mean it. Sometimes people say their data always have a particular characteristic when they really mean "almost always."

Suppose that you're designing a table for a group of investigators who tell you, "Our test scores are always 0 to 100." Based on that statement, you choose TINYINT and you make it UNSIGNED because the values are always non-negative. Then you find out that the people who code the data for entry into the database sometimes use –1 to mean "student was absent due to illness." Oops. They didn't tell you that. It might be acceptable to use NULL to represent such values, but if not, you'll have to record a –1, and then you can't use an UNSIGNED column. (This is an instance where ALTER TABLE comes to your rescue.)

Sometimes decisions about these cases can be made more easily by asking a simple question: Are there ever exceptions? If an exceptional case ever occurs, even just once, you must allow for it. You will find that people who talk to you about designing a database invariably think that if exceptions don't occur very often, they don't matter. When you're creating a table, you can't think that way. The question you need to ask isn't "how often do exceptions occur?" It's "do exceptions ever occur?" If they do, you must take them into account.

What Are the Performance and Efficiency Issues?

Your choice of data type can influence query performance in several ways. If you keep the general guidelines discussed in the following sections in mind, you'll be able to choose types that will help MySQL process your tables more efficiently.

Numeric Versus String Operations

Numeric operations are generally faster than string operations. Consider comparison operations. Numbers can be compared in a single operation. String comparisons may involve several byte-by-byte or character-by-character comparisons, more so as the strings become longer.

If a string column has a limited number of values, you can use an ENUM or SET type to get the advantages of numeric operations. These types are represented internally as numbers and can be processed more efficiently.

Consider alternative representations for strings. Sometimes you can improve performance by representing string values as numbers. For example, to represent IP numbers in dotted-quad notation, such as 192.168.0.4, you might use a string. As an alternative, you could convert the IP numbers to integer form by storing each part of the dotted-quad form in one byte of a four-byte INT UNSIGNED type. Storing integers would both save space and speed lookups. On the other hand, representing IP numbers as INT values might make it difficult to perform pattern matches such as you might do if you wanted to look for numbers in a given subnet. Perhaps you can do the same thing by using bitmask operations. These kinds of issues illustrate that you cannot consider only space issues; you must decide which representation is most appropriate based on what you want to do with the values. (Whatever choice you make, the INET_ATON() and INET_NTOA() functions can help convert between the two representations.)

Smaller Types Versus Larger Types

Smaller types can be processed more quickly than larger types. A general principle is that they take less space and involve less overhead for disk activity. For strings in particular, processing time is in direct relationship to string length.

For columns that use fixed-size data types, choose the smallest type that will hold the required range of values. For example, don't use BIGINT if MEDIUMINT will do. Don't use DOUBLE if you need only FLOAT precision. For variable-size types, you may still be able to save space. A BLOB uses 2 bytes to record the length of the value, a LONGBLOB uses 4 bytes. If you're storing values that are never as long as 64KB, using BLOB saves you 2 bytes per value. (Similar considerations apply for TEXT types.)

Fixed-Length Versus Variable-Length Types

Fixed-length and variable-length types have different performance implications, although the particular effects of each varies per storage engine. For example, MyISAM tables that have fixed-length rows generally can be processed more quickly than tables with variable-length rows. For further discussion of this issue, see "Data Type Choices and Query Efficiency," in Chapter 4.

Prohibiting or Allowing NULL Values

If you define a column to be NOT NULL, it can be handled more quickly because MySQL doesn't have to check the column's values during query processing to see whether they are NULL. It also saves one bit per row in the table. Avoiding NULL in columns may make your queries simpler because you don't have to think about NULL as a special case, and simpler queries generally can be processed more quickly.

Inter-Relatedness of Data Type Choice Issues

You can't always consider the issues involved in choosing data types as though they are independent of one another. For example, range is related to storage size for numeric types: As you increase the range, you require more storage, which affects performance. Or consider the implications of using AUTO_INCREMENT to create a column for holding unique sequence numbers. That single choice has several consequences involving the data type, indexing, and the use of NULL:

  • AUTO_INCREMENT is a column attribute that should be used only with integer types. That immediately limits your choices to TINYINT through BIGINT.

  • An AUTO_INCREMENT column is intended only for generating sequences of positive values, so you should define it as UNSIGNED.

  • AUTO_INCREMENT columns must be indexed. Furthermore, to prevent duplicates in the column, the index must be unique. This means you must define the column as a PRIMARY KEY or as a UNIQUE index.

  • AUTO_INCREMENT columns must be NOT NULL.

All of this means you do not just define an AUTO_INCREMENT column like this:

mycol arbitrary_type AUTO_INCREMENT

You define it like this:

mycol integer_type UNSIGNED NOT NULL AUTO_INCREMENT,
PRIMARY KEY (mycol)

Or like this:

mycol integer_type UNSIGNED NOT NULL AUTO_INCREMENT,
UNIQUE (mycol)

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020