- Apr 11, 2003
After this lab, you will be able to:
Read a Schema Diagram
Identify Data Normalization Rules and Table Relationships
Understand the Database Development Context
Although this is a book about SQL, you must understand the basic concepts, terminology, and issues involving database design to be able to understand why tables are organized in specific ways. This lab will introduce you to the practical aspects of designing tables and determining their respective relationships to each other.
The objective of normalization is the elimination of redundancy in tables, therefore avoiding any future data manipulation problems. There are a number of different rules for minimizing duplication of data, which are formulated into the various normal forms.
The rules verify that the columns you placed in the tables do in fact belong there. You design your tables, the appropriate columns, and the matching primary and foreign keys to comply with these rules. This process is called normalization. The normalization rules will be quite intuitive after you have read through the examples in this lab. Although there are many normalization rules, the five normal forms and the BoyceCodd normal form (BCNF) are the most widely accepted. This lab will discuss the first three normal forms as programmers and analysts typically don't bother normalizing beyond third normal form; with the exception of experienced database designers.
First Normal Form
For a table to be in first normal form, all repeating groups must be removed and placed in a new table. The example in Figure 1.8 illustrates the repeating groups in the BOOK table. The table has the location information of various warehouses across the country where the title is stocked. The location is listed in three columns as LOCATION_1 LOCATION_2, and LOCATION_3.
Figure 1.8 Repeating group.
Imagine the scenario when you have more than three locations for a book. To avoid this and other problems, the database designer will move the location information to a separate table named BOOK_LOCATION, as illustrated in Figure 1.9. This design is more flexible and allows the storing of books at an unlimited number of locations.
Figure 1.9 Tables in first normal form.
Second Normal Form
Second normal form states that all nonkey columns must depend on the entire primary key, not just part of it. This form only applies to tables that have composite primary keys. Figure 1.10 shows the BOOK_AUTHOR table with both the BOOK_ID and AUTHOR_ID as the composite primary key. In this example, authors with the ID 900 and 901 coauthored the book with the ID of 10002. If you add the author's phone number to the table, the second normal form is violated because the phone number is dependent only on the AUTHOR_ID, not on the BOOK_ID. Note ROYALTY_SHARE is dependent completely on the combination of both columns because the percentage of the royalty varies from book to book and is split among authors.
Figure 1.10 Violation of second normal form in the BOOK_AUTHOR table.
Third Normal Form
The third normal form goes a step further than the second normal form: It states that every nonkey column must be a fact about the primary key column. The third normal form is quite intuitive. Figure 1.11 shows a table that violates third normal form. The publisher_phone_no column is not dependent on the primary key column Book_ID but on the PUBLISHER_NAME column. Therefore, it should not be part of the BOOK table.
Figure 1.11 Violation of third normal form.
Instead, the publisher's phone number should be stored in a separate table called PUBLISHER. This has the advantage that when a publisher's phone number is updated, it only needs to be updated in one place, rather than all occurrences of this publisher in the BOOK table. Removing the PUBLISHER_PHONE_NO column eliminates redundancy and avoids any possibilities of data inconsistencies (see Figure 1.12).
Figure 1.12 Tables in third normal form.
Also, the BOOK table can benefit by introducing a surrogate key, such as a PUBLISHER_ID. Such a key is not subject to changes and is easily referenced in any additional tables that may need to refer to data about the publisher.
BoyceCodd Normal Form, Fourth Normal Form, and Fifth Normal Form
The Boyce-Codd normal form is an even more elaborate version of the third normal form and deals with deletion anomalies. The fourth normal form tackles potential problems when three or more columns are part of the unique identifier and their dependencies to each other. The fifth normal form splits the tables even further apart to eliminate all redundancy. These different normal forms are beyond the scope of this book; for more details, please consult one of the many excellent books on database design.
When two tables have a common column or columns, the tables are said to have a relationship between them. The cardinality of a relationship is the actual number of occurrences for each entity. We will explore one-to-one, one-to-many, and many-to-many relationships.
One-to-Many Relationship (1:M)
Figure 1.13 shows the Customer table and the Order table. The common column is CUSTOMER_ID. The link between the two tables is a one-to-many relationship, the most common type of relationship. This means that "one" individual customer can have "many" order rows in the Order table. This relationship represents the business rule that "One customer can place one or many orders (or no orders)." Reading the relationship in the other direction, an order is associated with only one customer row (or no customer rows). In other words, "each order may be placed by one and only one customer."
Figure 1.13 One-to-many relationship example between Customer and Order table.
One-to-One Relationship (1:1)
One-to-one relationships exist in the database world, but they are not typical because most often data from both tables are combined into one table for simplicity. Figure 1.14 shows an example of a one-to-one relationship between the Product table and the Product_Price table. For every row in the Product table you may find only "one" matching row in the Product_Price table. And for every row in the PRODUCT_PRICE table there is "one" matching row in the PRODUCT table. If the two tables are combined, the Retail_Price and In_Stock_Qty columns can be included in the Product table.
Figure 1.14 One-to-one relationship example.
Many-to-Many Relationship (M:M)
The examination of Figure 1.15 reveals a many-to-many relationship between the BOOK and AUTHOR tables. One book can have one or more authors and one author can write one or more books. The relational database model requires the resolution of many-to-many relationships into one-to-many relationship tables. This is done by creating an associative table (also called an intersection table). The solution in this example is achieved via the BOOK_AUTHOR table. Figure 1.16 shows the columns of this table.
Figure 1.15 Many-to-many relationship example.
Figure 1.16 Associative BOOK_AUTHOR table that resolves the many-to-many relationship.
The BOOK_AUTHOR table lists the individual author(s) for each book and shows, for a particular author, the book(s) he or she wrote. The primary key of the BOOK_AUTHOR table is the combination of both columns: the BOOK_ID column and the AUTHOR_ID column. These two columns represent the concatenated primary key that uniquely identifies a row in the table. As you may recall from the previous lab, multicolumn primary keys are referred to as a composite or concatenated primary key. Additionally, the BOOK_AUTHOR table has the AUTHOR_ID and the BOOK_ID as two individual foreign keys linking back to the AUTHOR and the BOOK table, respectively.
The BOOK_AUTHOR table contains an additional column, the ROYALTY_SHARE column. It identifies the royalty percentage for each author for an individual book. When there are multiple authors, the percentage of the royalty is split; in the case of a sole author the share is 100 percent. This column is appropriately located in the BOOK_AUTHOR table as the values are relevant for the combination of the book_ID and author_ID. This combination of columns uniquely identifies both a book and an author and the respective percentage share of the royalty.
For clarity of meaning and conceptual consistency, it is useful to show table relationships using drawings (called schema diagrams) and there are a number of standard notations for this type of diagram. For example, Figure 1.17 illustrates one of the ways to graphically depict the relationship between tables. The convention used in this book for a one-to-many relationship is a line with a "crow's foot" (fork) on one end indicating the "many" side of the relationship; at the other end, a "single line" depicts the "one" side of the relationship. You will see the use of the crow's-foot notation throughout this book. Software diagramming programs that support the graphical display of relational database models often allow you to choose your notation preference.
Figure 1.17 Crow's foot notation.
Cardinality and Optionality
The cardinality expresses the ratio of a parent and child table from the perspective of the parent table. It describes how many rows you may find between the two tables for a given primary key value. For example, in Figure 1.13 you saw a one-to-many relationship between the CUSTOMER and ORDER tables and the relationship ratio is expressed in the form of a 1:M ratio.
Graphical relationship lines indicate the optionality of a relationship, whether a row is required or not (mandatory or optional). Specifically, optionality shows if one row in a table can exist without a row in the related table.
Figure 1.17 shows a one-to-many relationship between the PUBLISHER (parent) and the BOOK (child). Examining the relationship line on the "many" end, you notice a "circle" identifying the optional relationship and a crow's foot indicating "many." The symbols indicate that a publisher may publish zero, one, or many books. You use the word "may" to indicate that the relationship is optional and allows a publisher to exist without a corresponding value in the BOOK table.
The relationship line also reads the other way. The solid line on the PUBLISHER end of the line indicates the "one" side, a "vertical bar" intersects it and this bar identifies a mandatory relationship. You read this direction of the relationship as "One book must be published by one and only one publisher." This means a row in the BOOK table must always have the PUBLISHER_ID value filled in. It cannot be null because that means unknown and indicates there is no associated PUBLISHER row.
The "(FK)" symbol next to the PUBLISHER_ID column indicates that this is the foreign key column. In this diagram, the primary key is separated from the other columns with a line; you observe the BOOK_ID and the PUBLISHER_ID as the primary keys or unique identifiers.
Figure 1.18 shows an optional relationship on both sides; a book may be published by zero or one publisher. Effectively, this means the value in the publisher_id column in BOOK is optional. Reading the relationship from the PUBLISHER, you can say that "one publisher may publish zero, one, or many books" (which is identical to Figure 1.17).
Figure 1.18 Optional relationship on both sides.
Real World Business Practice
You will typically see only these two types of relationships: First, mandatory on the "one" side and optional on the "many" end as in Figure 1.17; and second, optional on both ends as in Figure 1.18. Only rarely will you find other types of relationships. For example, mandatory relationships on both sides are infrequently implemented; it means that rows must be inserted in both tables simultaneously. Occasionally you will find one-to-one relationships but most often the columns from both tables are combined into one table. Many-to-many relationships are not allowed in the relational database; they must be resolved via an associative or intersection table into one-to-many relationships.
To clarify and explain the nature of the relationship on the diagram, it's useful to add a label or name with a verb on the relationship line. Figure 1.19 shows an example of a labeled relationship. For the utmost in clarity, a labeled relationship should be labeled on both sides. You then read it as: "One PUBLISHER may publish zero, one, or many BOOKs; and one BOOK must be published by one and only one PUBLISHER." This kind of labeling makes the relationship perfectly clear and states the relationship in terms that a business user can understand.
Figure 1.19 Labeled relationship between BOOK and PUBLISHER.
Identifying and nonidentifying Relationships
In an identifying relationship, the foreign key is propagated to the child entity as the primary key. This is in contrast to a nonidentifying relationship, in which the foreign key becomes one of the nonkey columns. Nonidentifying relationships may accept null value in the foreign key column.
Figure 1.20 depicts some of the tables used in the lab; the many-to-many relationship between the BOOK and AUTHOR tables is now resolved to the associative table called BOOK_AUTHOR. If a graphical representation of a table's box has rounded edges it means that the relationship is identifying. Effectively, one of the foreign keys became the primary key or part of the primary key. In the case of the BOOK_AUTHOR table, both foreign key columns constitute the primary key and both columns may not be null because a primary key is never null.
Figure 1.20 Identifying and nonidentifying relationships.
The relationship between the PUBLISHER and BOOK tables is nonidentifying, as indicated by the sharp edges. The foreign key column PUBLISHER_ID is not part of the primary key. The foreign key columns of a nonidentifying relationship may be either null or NOT NULL. In this instance you can determine if a null is allowed by checking if the relationship is optional or mandatory. Although the foreign key column allows null values in non-identifying relationships, here the relationship depicts a single bar on the relationship line. Effectively, for every row in the BOOK table there must be a corresponding row in the PUBLISHER table and the PUBLISHER_ID column of the BOOK table cannot be null.
DATABASE Development context
Now that you are familiar with the some of the relational database terminology and its core concepts, you are ready to learn about how all this information fits into the context of database development. From the initial idea of an application until the final system implementation, the data model is continuously refined. Figure 1.21 indicates the essential phases of the development project with respect to the database.
Figure 1.21 Database development and design phases.
Initially, the process starts off with gathering data requirements that identify the needs and wants of the users. One of the outputs of this phase is a list of individual data elements that need to be stored in the database.
Conceptual DATA Model
The conceptual data model logically groups the major data elements from the requirements analysis into individual entities. An entity is just something of significance for which you need to store data. For example, all data related to books such as the title, publish date, and retail price are placed in the book entity. Data elements such as the author's name and address are part of the author entity. The individual data elements are referred to as attributes.
You designate a unique identifier or candidate key that uniquely distinguishes a row in the entity. Notice that in this conceptual data model we use the terms entity, attribute, and candidate key or unique identifier instead of table, column, and primary key, respectively.
Noncritical attributes are not included in the model to emphasize the business meaning of those entities, attributes, and relationships. Many-to-many relationships are acceptable and not resolved. The diagram of the conceptual model is useful to communicate the initial understanding of the requirements to business users. The conceptual model gives no consideration to the implementation platform or database software. Many projects skip the conceptual model and go directly to the logical model.
Logical Data Model
The purpose of the logical data model is to show that all of the entities, their respective attributes, and the relationship between entities represent the business requirements without considering technical issues. The focus is entirely on business problems and considers a design that accommodates growth and change. The entities and attributes require descriptive names and documentation of their meaning. Labeling and documenting the relationships between entities clarify the business rules between them.
The diagram may show the datatype of an attribute in general terms such as text, number, and date. In many logical design models you will find foreign key columns identified, in others they are implied. (For example, Oracle's Designer software product doesn't show the foreign keys in the logical model diagram because they are an implementation detail that is implied by the relationships.)
The complete model is called the logical data model or Entity Relationship Diagram (ERD). At the end of the analysis phase the entities are fully normalized, the unique identifier for each entity is determined, and any many-to-many relationships are resolved into associative entities.
Physical Data Model
The physical data model, also referred to as the schema diagram, is a graphical model of the physical design implementation of the database. This physical schema diagram is what the programmers and you will use to learn about the database and the relationship between the tables. In Lab 1.3 you will be introduced to the STUDENT schema diagram used throughout this workbook.
This physical data model is derived from the fully normalized logical model. Before the actual implementation (installation) of the physical data model in the database, multiple physical data models may exist. They represent a variety of alternative physical database designs that consider the performance implications and application constraints. One of the physical design models will be implemented in the database. The schema diagram graphically represents the chosen implemented physical data model; it is specific to a particular RDBMS product such as Oracle.
Figure 1.22 depicts the schema diagram of the book publishing database discussed in this chapter. It shows the structure of the tables with their respective columns, and it illustrates the relationships between the tables.
Figure 1.22 Book publishing database diagram.
The physical data model has a different terminology than the conceptual or logical data model. The physical data model refers to tables instead of entities; the individual pieces of data are columns instead of attributes in the logical model.
Transfer from Logical to Physical Model
The transfer from the logical to the physical models, which ultimately means the actual implementation in a database as tables, columns, primary keys, foreign keys, indexes, and so on, requires a number of steps and considerations. The entities identified in the logical data model are resolved to physical tables; the entity name is often identical to the table name. Some designers use singular names for entities and plural names for tables; others abbreviate the entity names when implementing the physical model to follow certain business naming standards. Frequently, the physical data model includes additional tables for specific technical implementation requirements and programming purposes such as a report queue table or an error log table.
As mentioned, attributes become columns with names being either identical or following business naming conventions and abbreviations. The columns are associated with the database software vendor's specific datatypes, which considers valid column lengths and restrictions. Individual data entry formats are determined (e.g., phone numbers must be in numeric format with dashes between). Rules for maintaining data integrity and consistency are created and physical storage parameters for individual tables are determined. You will learn about these and many other aspects of creating these restrictions in Chapter 11, "Create, Alter, and Drop Tables." Sometimes additional columns are added that were never in the logical design with the purpose of storing precalculated values; this is referred to as denormalization, which we will discuss shortly.
Another activity that occurs in the physical data design phase is the design of indexes. Indexes are database objects that facilitate speedy access to data to a specific column or columns of a table. Placing indexes on tables is necessary to optimize efficient query performance, but indexes have the negative impact of requiring additional time for insert, update, or delete operations. Balancing the trade-offs with the advantages requires careful consideration of these factors, including knowledge in optimizing SQL statements and an understanding of the features of a particular database version. You will learn more about different types of indexes and the success factors of a well-placed index strategy in Chapter 12, "Views, Indexes, and Sequences."
Poor physical database design is very costly and difficult to correct.
Database designers must be knowledgeable and experienced in many aspects of programming, design, and database administration to fully understand how design decisions impact cost, system interfaces, programming effort, and future maintenance.
You may wonder how the graphical models you see in this book are produced. Specific software packages allow you to visually design the various models and they allow you to display different aspects of it such as showing only table names or showing table names, columns, and their respective datatypes. Many of these tools even allow you to generate the DDL SQL statements to create the tables. For a list of software tools that allow you to visually produce the diagrams, see the book's Web site at http://www.phptr.com/rischert and Appendix H,"Resources."
Denormalization is the act of adding redundancy to the physical database design. Typically, logical models are fully normalized or at least in third normal form. When designing the physical model, database designers must weigh the benefit of eliminating all redundancy with data split into many tables against potentially poor performance when these many tables are joined.
Therefore database designers, also called database architects, sometimes purposely add redundancy to their physical design. Only experienced database designers should do denormalization. Increasing redundancy may greatly increase the overall programming effort because now many copies of the same data must be kept in sync; however, the time it takes to query data may be less.
In some applications, particularly data warehousing applications where massive amounts of detailed data are stored and summarized, denormalization is required. Data warehouse applications are database applications that benefit users that need to analyze large data sets from various angles and use this data for reporting and decision-making purposes. Typically, the source of the data warehouse is historical transaction data but can also include data from various other sources for the purpose of consolidating data. For example, the purchasing department of a supermarket chain could determine how many turkeys to order for a specific store on the week before Thanksgiving or use the data to determine what promotional offers have the largest sales impact on stores with certain customer demographics.
The primary purpose of a data warehouses is to query, report, and analyze data. Therefore redundancy is encouraged and necessary for queries to perform efficiently.