Access Database Design and Normalization
In This Chapter:
- The Relational Design Theory
Access applications are database applications, an obvious statement that can get lost in the details of designing a dialog box or crafting the perfect form. To develop smarter, faster, more efficient database applications, you need to understand the concept of database normalization, this chapter's topic. Relational database management systems, or RDBMSes, such as Microsoft Access, implement a theory of data management and manipulation called, strangely enough, relational theory, pioneered by Dr. E.F. Codd at IBM during the 1970s. This chapter introduces you to just enough relational theory to enable you to design Access databases that take advantage of the way relational database operations work.
Critics of Microsoft Access argue it does not conform to all of Dr. Codd's 13 rules defining a relational database and so cannot be called relational. Predictably, Access' supporters argue it is fully conformant. From a practical standpoint, it behaves like relational database. You may judge for yourself whether it is truly conformant.
The following table lists the 13 rules Codd declared to which a database system much conform in order to be considered a true relational database management system.
The Foundation Principle
Any RDBMS must be able to manage databases entirely through its relational capabilities. If a database system depends on a record-by-record data manipulation tools, it is not truly relational.
All data in a relational database is represented explicitly as values in tables. Data cannot be stored in any other way.
Every data element must be accessible logically through the use of a combination of its primary key value, table name, and column name.
Null values are supported explicitly. Nulls represent missing or inapplicable information.
The database description or "catalog" exists at the logical level as tabular values. The relational language (SQL) must be able to set the database design in the same manner in which it acts on data stored in the structure.
An RDBMS must support a clearly defined data-manipulation language (SQL) that comprehensively supports data manipulation and definition, view definition, integrity constraints, transactional boundaries, and authorization.
All views that can be updated by the system. In a true RDBMS, most (although not all) views would be updatable.
Set Level Updates
An RDBMS must do more than just be able to retrieve data sets. It has to be capable of inserting, updating, and deleting data as a relational set.
Physical Data Independence
Data must be physically independent of the application program. The underlying RDBMS program or "optimizer" should be able to track physical changes in the data. For example, an RDBMS's application programs should not have to change when an index is added to or deleted from a table.
Logical Data Independence
Whenever possible, application software must be independent of changes made to the base tables. For example, no code should be rewritten when tables are combined into a view.
Data integrity must be definable in a relational language and stored in the catalog. Data integrity constraints can be built into applications. However, this approach is foreign to the relational model. In the relational model, the integrity should be inherent in the database design.
RDBMS capabilities will not be limited due to the distribution of its components in separate databases.
If an RDBMS has a single-record-at-a-time language, that language cannot be used to bypass the integrity rules or constraints of the relational language. Thus, not only must an RDBMS be governed by relational rules, but these rules must be primary laws.
The Relational Design Theory
The relational design theory developed by Dr. Codd consists of the following categories:
Tables and uniqueness
Foreign keys and domains
The Benefits of Using the Relational Model
Using the relational design theory, you gain the benefits of years of research into the best way to manage data. Some of the benefits you can achieve by following the relational model are
Ensuring data integrity.
Storing data storage efficiently.
Giving your database application tremendous room for growth.
Creating a database that behaves predictably because it conforms to these well-tested rules.
Enabling other database designers to understand your database because it follows the rules.
Ensuring that database schema changes are easy to implement
Improving the speed of data access. I know this ties in with efficiency; however, the mention of the word speed is important.
Tables and Uniqueness
When you create database applications, each table represents an entity or process in the real world. Tables can represent people, events, financial transactions, and physical items such as products. Relational theory requires that all data is stored in a table consisting of unique rows and columns. The way to guarantee uniqueness for each rule is to set a primary key for each row.
A primary key is a field or group of fields that uniquely identifies that row. If the primary key consists of multiple fields, it is called a composite key. The primary key must be uniquethat is, it must appear in one and only one row. Access allows you to designate a field as a primary key by setting it as a Key value in table design view. Access will then check to see if the data in that field is unique and does not allow duplicates.
Sometimes it can be hard to come up with a unique value for each row while following the business rules in your application. For example, suppose you are creating a contact management system to track your contacts. Your initial database design might resemble the table shown in Figure 3.1.
Figure 3.1 A table storing contact information.
As you develop your contact management system, you might select ContactName, ContactEmail, or one of the ContactPhone fields as the key. Possible key fields are commonly referred to as candidate keys because, during the initial database design period, they might be used as primary keys, but the final decision has not been made.
A common rule of database design is to keep the key as simple as possible and to choose the key from data that is the most unique and least likely to change. A person's name might change due to marriage, divorce, and so on. Phone numbers and e-mail addresses change all the time. A good key would be a Social Security number, but what if your contact lives outside the United States? In this situation, Access allows you to create an AutoNumber field to use as a primary key. An AutoNumber is a numeric value that Access maintains and automatically increments each time you add a new record to a table.
However, an AutoNumber is not the same as a record number. In record numberbased database products, record numbers are recalculated as records are added or deleted. In Microsoft Access, a record's AutoNumber remains the same as records are added or deleted (except in an unusual circumstance discussed in Chapter 5, "The Microsoft Jet Database Engine 4.0"). AutoNumber values are not reused in the same table after a record is deleted. There are two types of AutoNumber values:
ReplicationID or GUID (used in Replication).
You can manipulate a Jet AutoNumber in a few ways. You can set the AutoNumber increment value, reset the starting seed, and start counting backward. See Chapter 5 for more information.
Foreign Keys and Domains
When one table's primary key is used in another table, it is called a foreign key because it is foreign to the second table. Foreign keys are used to relate records from two (or more) tables, a topic discussed in the section titled Relationships later in the chapter. Figure 3.2 illustrates a table using foreign keys.
Figure 3.2 A table with two foreign keys.
Figure 3.2 shows the Contacts table with two foreign keys, ContactID and PhoneID. ContactID is the primary key of the tblContactName table (not shown) and PhoneID is the primary key of the tblContactPhone table (not shown).
When you use foreign keys, they will always be in the same domain. A domain is the pool of possible values from which a column's value is drawn. For example, in the United States, the domain of ZIP codes consists of all valid ZIP codes. Another example might be a university's student ID numbers. If the student ID system is based on Social Security numbers, the domain consists of all valid Social Security numbers available to the system.
When you create primary keys and foreign keys, you are defining relationships. Relationships refer to how the records in one table connect to the records in another table. Access supports three different types of relationships:
To set relationships in Microsoft Access, press the Relationships button on the toolbar or select Tools, Relationships from the menu to bring up the Relationships window.
Two tables have a one-to-one relationship if, for each row in one table, there is at most one related record in the related table. Figure 3.3 shows an example of a one-to-one relationship.
Figure 3.3 A one-to-one relationship.
Figure 3.3 shows a one-to-one relationship because each record in the table tblContacts relates to, at most, one related record in tblContactEmail. One-to-one relationships are the least common types of relationships because most databases include related information in one table. For security reasons, however, you might choose to break out information into two tables. Financial transactions often involve many one-to-one relationships.
The most common type of relationship is the one-to-many relationship. A one-to-many relationship occurs when a table has zero or many related records (sometimes called child records) in another table. A typical example occurs in a customer order database, because one customer can have many orders.
Similarly, in your contact database, one contact (the "one" side of the relationship) might have several phone numbers (the "many" side). To express the relationship in terms of keys, each contact has a unique ID which is the primary key in one table. This primary key would appear as a foreign key in each related record in a table of telephone numbers. Figure 3.4 illustrates a one-to-many relationship.
Figure 3.4 A one-to-many relationship.
The ContactID field is tblContacts' primary key and a foreign key in tblContactPhone. ContactID 2 appears twice in tblContactPhone, meaning that there are two related records in tblContactPhone. Figure 3.4 includes the contents of the table to make the relationships clearer.
Given two tables, tblAlpha and tblBeta, a many-to-many relationship occurs when one or more rows in tblAlpha relate to zero or more rows in tblBeta, and one or more rows in tblBeta relates to zero or more rows in tblAlpha.
Many-to-many relationships are problematic because it is difficult to select unique records from one table using existing relationships. Suppose you have three contacts (stored in tblAlpha) who use any one of three fax numbers (stored in tblBeta). If you want to send one of those contacts a fax, it is not immediately clear which fax number to use. Conversely, if you receive a fax from one of those three numbers that lists the sender's fax number but not the sender's name, you will not immediately know which of the three sent the fax.
Unfortunately, Access does not have a way to resolve many-to-many relationships between two tables. The only way to do so is to create a so-called "linking" table. A linking table is a table that contains the primary keys of both tables as foreign keys.
Figure 3.5 shows a many-to-many relationship between tblContacts and tblContactType through the linking table tblContactInfo. tblContacts contains all the contacts in your database. tblContactType lists all the types of contact information: Fax, Email, Phone, and Pager. tblContactInfo is the linking table. It contains a reference to a contact (ContactID), a contact type (ContactTypeID), and the contact information itself (ContactInfo), so you can locate either a specific telephone number using a contact's name or a specific contact name using a particular telephone number.
Moreover, if your contact adds a new type of communication (like a Web address) to the list, you simply add "Web Address" to the lookup table (tblContactType), and then add the new linking record to tblContactInfo. This is the most flexible way to build a contact management system because it avoids repeating fields (Phone1, Phone2, and so forth), blank fields, and having to add new fields to a table when you add another communication type to the list. A common naming convention when using a linking table this way is to use the "trel" prefix to identify a linking table. In this case, the tblContactInfo would be named trelContactInfo.
Figure 3.5 A many-to-many relationship using a linking table.
Data normalization describes the process of designing a database (or modifying an existing database design) and organizing data to take best advantage of relational database principles. Codd defined six levels of normalization that he described as normal forms. The first three, known as First, Second, and Third Normal Form, have the most impact on database design decisions. As you read through the next few paragraphs, keep the following somewhat whimsical but useful summation of Codd's first three laws in mind: "The values in a row are dependent on the key, the whole key, and nothing but the key, so help me Codd."
First Normal Form
First Normal Form requires that all columns (fields) in a table contain atomic values. That is, each field should contain one value, not a list of values or any repeating groups of data. Many flat file "databases" store data in this fashion, which makes searching quite difficult.
The term flat file database refers to files, almost always text files but sometimes binary, containing some type of data. They are called "flat" because they lack any structure or hierarchical arrangement suggesting the relationship between one record and the next. Flat files are often database dumps.
Figure 3.6 shows an example of a table that is not in First Normal Form.
Figure 3.6 A table not in First Normal Form.
The ContactInfo column violates First Normal Form because it contains multiple pieces of information. Putting the contact type (Fax, Voice, Email) in its own column (ContactType), making separate records for each contact number in the ContactInfo column, creating the ContactInfoID column, and making ContactID the table's primary key results in Figure 3.7.
Figure 3.7 A Table in First Normal Form.
The table shown in Figure 3.7 is in First Normal Form because each column contains a single, discrete piece of information.
Second Normal Form
A table in Second Normal Form is one in First Normal Form and in which every non-key field is fully dependent on the entire primary key. That is, any column in a table that is not the primary key or part of a composite primary key must depend on the primary key or all parts of a composite primary key for its meaning or interpretation.
tblContact is not in Second Normal Form. The primary key is a composite consisting of ContactID, ContactInfoID, and ContactInfo. ContactName does depend on ContactID, but not on ContactInfoID or ContactInfo.
To put this table in Second Normal Form, put ContactInfo, ContactInfoID, and ContactType in a separate table. Next, use ContactID as a foreign key in the new table, creating a one-to-many relationship. The result is shown in Figure 3.8. Now both tables are in Second Normal Form because the columns in each table depend on their respective table's primary keys.
Figure 3.8 A Table in Second Normal Form.
Third Normal Form
A table is in Third Normal Form when it is in Second Normal Form and when no non-key column depends on the primary key.
tblContactOld in Figure 3.8 contains a field dependent on another non-key field, so it is not in Third Normal Form. If you do not see it right away, look at it for a few moments before continuing with the next paragraph.
If you missed it, the dependent column is ContactType. Why? Given the ContactID and ContactInfoID, you can use a query to obtain the ContactType. The solution is to create a third table named tblContactType using the ContactType descriptions, and use tblContactType's primary key (ContactTypeID, in this case) as a foreign key in tblContactOld. The result is shown in Figure 3.9.
Figure 3.9 Tables in Third Normal Form.
All three tables are now in Third Normal Form.
Real-World Benefits of Normalization
Why bother to normalize data? The original table in the previous example (Figure 3.6), listed the contact types and information in an array of values, making searching for the contents of a fax number very difficult. In addition, when one of your contacts' phone number changes or gets a new form of communication method, like a cellular phone, updating the data is difficult and easily botched. First Normal Form improved the situation and gave you more flexibility because it created separate records for each contact's voice number, fax number, e-mail address, and so forth.
First Normal Form, shown in Figure 3.7, only partly solves the problem. Contact data is difficult to manage because the same information is repeated many times and because storing the same name 10 times wastes disk space.
Second Normal Form, in Figure 3.8, is almost, but not quite there. It eliminates most of the wasted disk space and repetitious information, but limits your flexibility. Third Normal Form allows you to add and remove contact types from a single table without requiring any changes to the structure of other tables. The tables in Figure 3.9 give you the most flexibility and use the least amount of disk space.
Data Integrity Rules
When normalizing a set of tables, you should also consider data integrity rules. Data integrity rules, sometimes called constraints, are rules, often enforced by the database engine itself, that ensure the consistency of your data. For example, to revisit the contact database example a final time, if you delete a contact from the main table (tblContact), you want to make sure that related records in other tables are also deleted. Deleting only a related record, on the other hand, should not delete other related records or, certainly, the primary contact information. Most data integrity rules in an Access 2002 database are determined by the relationships you define. In addition, Access lets you specify cascading updates and deletes when defining relationships.
A cascading update ensures that when you update a record's primary key value on the "one" side of a one-to-many relationship, the change will be reflected in all records on the "many" side. The change, that is, will cascade through all related records.
Suppose you have a lookup table for states. New York is in this table with a primary key of NY and a description of New York. Imagine that New York City broke away and formed its own state (before you laugh, it almost happened in 1789!) When you change your NY value to NNY (for "New New York"), the values of NY in all the child tables will be changed to NNY. If cascading updates were not set, you would need to add a record to the lookup table for NY1, update all the records in the child table, and then delete the NY record from the lookup table.
A cascading delete has the same effect for deleting a primary key. Cascading deletes ensure that deleting a record from a "one" table also deletes all related records from the "many" table(s). This feature has a downside, however. If you delete a customer and cascading deletes were set, all the "many" invoices would be deleted. If you do not have deletes set, the deletion would not be allowed in the "one" table until you deleted all the records in the "many" table, to avoid leaving any orphans. Orphaned records are records not related to an existing primary key. As a result, they are not directly accessible and constitute wasted disk space.