Normalization
After your data is modeled into tables, you need to ensure that the tables are truly relational. Far from being just good practice, making your data relational is essential for efficient processing of transactions, good use of storage, and ease of data management. Making a data structure relational is achieved by the process of normalization.
Normalization is probably the most important tool for database designers. It is central to relational database theory and is the key process that translates data models from flat-files to efficient, relational formats.
There are nine rules for normalization, which arise from numerous theories of data modeling. However, in this lesson we'll keep things simple and explain the first three normal forms. These are really all you need for the majority of modeling exercises.
First Normal Form
First Normal Form, put in simple language, states that any columns that occur multiple times in a table should be removed and placed in a different table.
For example, imagine that you have a (non-normalized) purchases table like that shown in Figure 3.5.
Figure 3.5 Applying normalization to First Normal Form.
As you can see, there are multiple columns for both product name and product quantity, both of which are repeated three times. This is not only wasteful of storage space for small orders (where only one product is ordered) but also allows only three different products to be ordered at a time.
First Normal Form means removing these multiple columns (product name and quantity) from the table (purchases) to another table (in this case products_ordered). After this is done, the tables will be normalized to First Normal Form. However, note that a key will be needed on the new table (products_ordered) to relate each product back to the relevant purchase in the first table.
Second Normal Form
The rule of Second Normal Form states that, in addition to the design complying with First Normal Form, all data not relying on a table key to uniquely identify it should be removed and placed in a different table.
Imagine that you have a table describing products available in the store. Each record includes information about the individual product (its name, price, and so on) and also a category description (such as "sweater," "shoes," and so on).
Because several products will be in any given category, the category name would have to be repeated several times within the table.
Not only would this representation be wasteful of space, because descriptions of the categories are being included repeatedly, but also it's difficult to change something. For example, if you initially had a category, "sweaters," and then introduced a new category, "polar neck sweaters," you would risk creating ambiguities in the table if you updated some of the descriptions. You would not be managing your data efficiently and reliably.
Figure 3.6 shows a sample products table, in which the category_name column is removed according to the Second Rule, and this category information is moved off to another table.
Figure 3.6 Applying normalization to Second Normal Form.
Going to Second Normal Form means that the category data would be removed and exist in a categories table its own right. To keep the relationship, a non-unique key would need to be added to both the categories and products tables to relate products to their categories (refer to Figure 3.3).
Third Normal Form
Third Normal Form is all about eliminating dependencies within a table. It states that, in addition to the design complying with Second Normal Form, all non-key data that is dependent on other non-key data in the same table should be placed in a different table.
Imagine that the table customers includes every customer's address, the name of the country in which the customer lives, and the ISO abbreviation for that country, as shown in Figure 3.7. The ISO country code and the country name are dependent because a given country will always have the same ISO code.
Figure 3.7 Applying normalization to Third Normal Form.
To comply with Third Normal Form, you must remove the country name from the customers table and place it in a new table, countries. So countries will have both iso_country_code (a key to relate the country back to the customers table) and country_name.
customers can keep the iso_country_code column because this is key data (and therefore smaller). By removing country_name, the non-key, dependent data is removed.
Other Considerations When Designing Tables
It is possible to perform several other forms of normalization, but it's beyond the scope of all but the most specialized database design texts to cover these.
In some circumstances, there may also be reasons for not normalizing every table to Third Normal Form, or not even normalizing at all. For simple databases, it may not be worth the increased complexity of a multitable architecture, with keys having to be added as part of the normalization process. Data in normalized tables is normally more difficult to manage because applications have to be more complex to display data and to insert, update, and delete records. This is because such operations normally require SQL which references several tables rather than just one.
Another reason for not normalizing is that when actual database performance is considered, doing cross-referencing between tables (performing joins) may slow down the live system.
These considerations are discussed more in Day 9, and techniques for improving performance are looked at in Day 18, "Optimizing Performance."
Apart from normalizing your database, you need to specify data types for the columns in your table: you need to progress your table specifications from the concepts of text or number, to the concepts of fixed- or variable-length text columns, integer or floating-point number columns, and so on. The column types implemented in MySQL are considered in Day 5.