Home > Articles > Data > SQL Server

SQL Server Reference Guide

Hosted by

Data Retention Strategy

Last updated Mar 28, 2003.

When I served in the United States Air Force, computerized records were just coming into wide use. We still had a lot of paperwork that we had to originate, track and maintain. One of the primary issues we ran into was how to determine when a document no longer had any value. Some records, such as daily memos or reports, were only valid for a short period of time. Other data records, such as medical histories or large purchases, were kept far longer. Whenever I created a document, I had to include a "disposition." This stamp on the document determined how long the record was to be retained, and when and how it would be destroyed.

When data is stored on a single piece of paper, it is easier to track, handle and dispose of. With the advent of computers, things are handled much differently. Anyone can create a record and store it in multiple locations. Data isn't stored in a single kind of document, but in word-processing formats, spreadsheets, and of course, databases. And most of the data we now store isn't stamped with any kind of data retention.

In fact, I'm willing to bet that you have a lot of structured and unstructured data in your organization that fits this description. We tend to think of computer-based data as inexpensive and permanent. But as recent high-profile cases of data loss have show us, that isn't the case. Data management isn't free, and losing it or having it fall into the wrong hands is now commonplace. While the IT infrastructure is focused to make it efficient to create and store data, it spends less effort protecting that data, and even less thinking about its disposition. In this tutorial, I'll explain some concepts you can use to think about your data differently.

There are many reasons to develop a data strategy, not the least of which are the multiple regulatory implications mandated in almost every country. In the United States for instance, Sarbanes-Oxley laws dictate financial record retention, and HIPAA regulations determine medical data retention. Europe's laws are even more stringent than in the U.S.

Although I'll talk generally about data retention, there are good reasons for the database staff to take the lead on this kind of effort. The database systems often hold the most important data in an organization's store, and it can be designed to hold other data types such as text or binary formats.

The Data Lifecycle

Whether it is a Microsoft Word document or an entry in a database table, most data goes through four basic processes: creation, distribution, editing, and destruction.

In the creation phase, the user originates the data and stores it on a networked share point or locally on a single device, or perhaps in a database. Once the data is created, it is distributed to one or more people. In some cases this is a "push" operation, where the data is sent using a mechanism such as e-mail, and in other cases the data is "pulled," in which case it is retrieved from a website or mount point. In the editing phase, either the original owner or others make changes to the data. At some point in time, the data is either inadvertently or purposefully deleted. It is this last step that I want to focus on. Using the following information, you can help your organization develop a strategy

Data Evaluation and Categorization

The first thing you need to consider is the type of data that you store. This will help you create the categories of data you have. Many companies have never performed this exercise, so expect that there will be a great deal of confusion and denial as you begin. If your company is like most others, this is the most difficult part of creating your strategy.

The categories you choose can fall into structural or type, like "Word Files," "Excel Data" or "SAP records," or by use, such as "Financial Data," "Personnel Records" and so forth. My recommendation is to create a database with both terms, as I'll describe a bit later.

You'll find that the easiest data to locate and categorize is in the database systems used in your largest applications. Since IT controls these resources, you can easily locate and identify it. You've probably already done that as part of your disaster recovery strategy. The more difficult tasks involve the unstructured data your company at least partially runs on. Most of the company's data isn't in databases; it’s in Word documents and Excel spreadsheets. That doesn't mean that it isn't important — on the contrary, it's at least as important as the data in your databases. Finding it and categorizing it is another matter.

Format Determination

You might think that if the data is in a database, it's safer than in a file share or on a user's hard drive. While that is probably true in the short and even middle-term, it isn't true in a longer period of time. If you doubt that, think about restoring a SQL Server 4.2 database from a series of 5.25 floppies. As a case in point, the news recently reported that NASA is having issues locating and recovering several sets of key records from the first moon landing.

Once you locate and categorize your data, you need to think about its format. If the data is spread out over your organization in various formats, will those formats be readable in the distant future? You need to think at least ten years out. You may want to consider extracting the data into a more "universal" format, such as ASCII. This involves a lot of effort, and not all data requires this level of work. Only the longest-term categories fit the bill.

You may also want to consolidate the data into a single platform. Rather than storing documents on file shares and web sites, you can feed them into a database. In fact, many portals that provide a single view to the organization do just that. However, if you do store files in the database, make sure that you track what format the file is. It doesn't help to consolidate your Wordstar files into a database if you don't have a way to read them out later.

Metadata Creation

It's tough to adequately categorize your data using only one criteria. For instance, a Word document might contain both financial and medical information. Your database records also have multiple applications. You can track all these attributes using a database, and create an interface your business users can enter the locations of data, its disposition, and all the other metadata needed to surface the data to the business.

Hierarchical Storage and Review

Most large firms subject to long-term regulations for data retention "stage" the data into systems that present the data based on how it is accessed. Data that is used often requires expensive, current technology. Data that is used less frequently can be placed on storage mechanisms that are slower and less expensive. Data that is retained for regulatory reasons can be placed on storage that not subject to change, located remotely from the organization.

As you develop your strategy, don't forget the final step: disposal. As a data professional, you should guide your organization to protect the data you store, and that means keeping it only as long as necessary and destroying it completely when it isn't needed.

The metadata repository you created in the last step enables you to periodically review your data to ensure you have the right storage strategy. This is a process that you need to undertake with business users to surface any concerns they have.

Each one of these points has mechanical and procedural aspects. Each is also interdependent. If you don't categorize your data, you won't know where it is or how long to keep it. If you don't have a database to store the metadata, you won't be able to track it, putting you back where you started. If you don't have the formats defined, you won't be able to get to the data when you need to, and without the proper review you're wasting a lot of effort.

Informit Articles and Sample Chapters

David Gulbransen has a great article that covers this same topic.

Online Resources

TDAN has a great rundown on some of the regulatory issues for data here in the U.S.