Getting Started with Solid Database Design
The Microsoft Books Online (BOL) documentation seems to fall a bit short in this important subject, so this chapter might be helpful for those who need a more complete understanding of how to create a best-practice relational database. The problem faced by any database designer is knowing what's going to be stored ahead of time—before the first table is created. That's always been (and always will be) a problem. As I've said before, a customer rarely knows what they want until they don't get it.
To get started on the right foot, I recommend a good course in relational theory like Extended Relational Analysis. This can do a world of good—but its depth is well beyond the scope of this book. In courses like this, you learn how to ask the right questions for each "entity" you expect to store in the database.
I also think that using a (big) whiteboard to lay out the database with your team (or customer) can help visualize the data. Getting everyone who is going to consume the data is essential. How I design a database for a single-focused project is very different than the way it's designed for projects where a small multitude of groups expect to consume the data. Admittedly, database development by committee is tough, and one should try to avoid those situations, but leaving town might not be an option.
Before we get into the academics of normalization, let's spend a few moments in quiet contemplation and focus on a few guiding principles. As you design your database, you should keep these basic tenants in mind:
- Each table needs a unique identifier. That is, you need to choose one or more columns to permit SQL Server to find specific rows to return or update without including other irrelevant rows. In SQL Server, this typically means each table should have an "ID" column (typically cast as an Identity integer) that gives the row a unique (SQL Server–generated) value. You should define this column using the Primary Key (PK) constraint (as discussed in Chapter 2, "How Does SQL Server Work?"). For example Au_ID is the unique identifier for the Authors table in the sample Biblio database.
- Each table should store only one kind of information and not repeat information in multiple columns. As I discuss next, this is where normalization comes in. SQL Server is tuned to work with small, tight rows that contain relatively few columns. If you find your table has more than a dozen columns, you're treading off the boards and into the swamp. Remember, the largest row you can define is only 8K (not counting BLOB columns).
- Microsoft feels that you should avoid columns that can be set to NULL—I'm not so sure. That is, they feel that you should avoid columns where you might not have access to the data at all and cannot (logically) assign an arbitrary default. Each time you define a column as permitting NULL (making it "nullable"), SQL Server incurs extra overhead. It makes sense to move these columns to another table and cross-reference them to the table's PK as long as the database complexity does not get out of hand.
- Each column needs to be defined both with the content in mind and with the constraints needed to keep it pure. It's not enough to type a column as integer and hope that the data entered therein is going to be pure. All columns, especially numeric columns, need to have (at least) CHECK constraints defined, if not TSQL rules. Columns should be defined to hold what's expected to be saved—and no more. Needlessly bloating data type capacity simply chews up memory to no good purpose. If you have columns that contain text, but that text is never expressed in Unicode, don't use a Unicode type. If you have a date column but don't need to store time with 3.33-millisecond accuracy, use smalldatetime instead of datetime. You get the idea. In this case, less is more—more space saved, more memory available to store other pertinent stuff. I'll show you how much each column costs in memory near the end of this chapter.
- Start thinking about a concurrency strategy from the beginning. Determine how data is to be shared (if at all). Consider that many "single-user" databases are doomed to failure once they are "converted" to multi-user. Think about the mechanism you're planning to use to determine if a row has changed once it's fetched. For example, you might (perhaps ought) to use a "Rowversion" or "TimeStamp" column to help track access—it can make Visual Studio's job a lot easier (and yours, too) when it comes time to write action commands to change the database.
- Avoid BLOBs in the database. I have been suggesting this for over a decade, and those who have listened have been able to build a smaller, faster, and simpler database. If you have BLOBs, store them in files (or on RO media) and use the database to store the path.
- Finally, and perhaps most importantly, for less-experienced developers, I think that you should strive to keep your database simple—simple to understand, support, and maintain. Excessive complexity is the bane of many a mature and amateur database designer.