Understanding Relational Database Normalization
In a nutshell, building a "good" optimized, relational database is mostly about normalization. Once you understand the basic principles of normalization, SQL Server should be able to manage your data more efficiently, the applications you write should be able to return data more efficiently, and you'll find it a lot easier to protect your database's data and relational integrity.
So, what is "normalization" and how does it help performance and all that other good stuff? Well, normalization is simply the set of relational database techniques developed to efficiently organize the information you want to manage in a relational database. The academics talk about (at least) five "normal forms," but most database designers stick to the first three forms and seldom go further. The benefit of implementing further levels is usually not that great when compared to the costs—especially in smaller databases.
Here are basic tenants of the first three normal forms.
Don't define duplicate columns in the same table. Each column in a table should contain "different" information. This does not mean avoiding use of identical column names (that's prohibited by the SQL engine), but it does mean that any two columns should not store basically the same information. For example, don't create a table with two or more addresses for a customer (such as a home and business address), as shown in Figure 3.1. The solution to this problem is best implemented by the second normal form.
Figure 3.1 Unnormalized Customers table.
All attributes (columns) in a table that are not dependent on the primary key must be eliminated. This means you need to create a separate table for each logical group of data and identify each row with a unique set of columns (its own primary key). In this case, create a separate Addresses table and connect the two tables together with their primary keys, as shown in Figure 3.2.
Figure 3.2 Normalized Customers and Addresses tables.
- Third normal: Tables cannot include duplicate information. For example, if two tables require a common field, a separate table should be created to manage that column. Our basic design already conforms to the third normal form. However, as an example, take a look at the Biblio database. In this case, I created a TitleAuthor table that contains fields common to the Titles and Authors table.
No, this is not an in-depth discussion of normalization or relational theory, but it's enough to get you started. It's also important to know that many database developers bend these rules from time to time to get more efficiency out of their databases. Sometimes, they add a bit more detail for an entity in a parent table, so it's not always necessary to JOIN to another table just to get one or two bits of information. Yes, these changes mean that the data must be kept current in two different tables, and if someone else comes along and does not realize what's going on....
Understand as well that stored procedures or object-based approaches can (and do) help resolve these issues. By blocking direct access to base tables, developers can write server-side code to dereference the data in the base tables and get away with some tactics that would cause quite a bit of trouble if direct table access were permitted.
Once you have decided what tables you need, you need to use one of the SQL Server or Visual Studio tools to create them (as I illustrate in Chapter 4, "Getting Started with Visual Studio"). But before doing that, I often draw these tables on a whiteboard, which makes it easier to "see" how the data is to be stored and how the tables are related. In Visual Studio, you can use the database diagramming tool to help at this phase, and the ink does not stain your fingers as much.