Home > Articles > Programming

This chapter is from the book

A Brief History of Data Architecture

While this multifaceted view of data is not universally practiced, it is generally accepted as a reasonable way to look at the world. How we got here turns out to be an interesting story.

The “Application Approach” to Systems . . .

Originally, we wrote programs to address particular problems. From the beginning of the computer industry in the 1940s, and through the early 1970s, data processing professionals focused on program processing. From machine language, through assembler, Fortran, and COBOL, programmers were concerned primarily with what a program was supposed to do. Presumably that was in support of some activity in an enterprise. Any data referred to were defined as needed by the program and then promptly forgotten. For example, we might write a program for a client that would take Whiz-bang input data and produce a Framis report of some kind. (See Figure 3.3.)

03fig03.gifFigure 3.3. One Program.

As it happens, the fellow sitting next to our client sees the report and likes it. He'd like one, too. But could you add this little dollop of Knart data, and maybe summarize things a little differently? This gives us the configuration in Figure 3.4.

03fig04.gifFigure 3.4. Two Programs.

Now, our popularity grows. Lots of people want variations on the report, using not only the data originally provided but various other kinds of data as well. This gives us Figure 3.5.

03fig05.gifFigure 3.5. Until . . . .

In fact it usually gives us a picture even messier than is shown in Figure 3.5. There could be hundreds of these programs, all copying data from one place to another—often requiring re-entry of the same data—with no particular organization or cohesion.

The problem is compounded by the fact that the business is dynamic. Our clients are always changing the way they want things to work. By the time the configuration has matured to this complexity, the situation is all but hopeless. Errors cannot be fixed and enhancements cannot be added without generating yet more errors.

What Went Wrong?

It is easy to bemoan the mess we have found ourselves in. It is easy to imagine that things simply got out of control by virtue of general incompetence. The fact of the matter, however, is that this approach was based on three very specific mistaken assumptions.

1 Input's Connections to Output

First, we assumed that input and output were closely related. (See Figure 3.6.) Each project was a matched pair of inputs and outputs. Take some data, process it, and generate a report.

03fig06.gifFigure 3.6. The Assumptions.

The problem with that assumption is that it's false.

The dynamics of input are in fact radically different from those of output. Input data tend to come from the operations of the company. The mechanisms for collecting them are difficult to set up, but once in place they are relatively stable and often are difficult to change. Output requirements, on the other hand, are usually highly dynamic. “I want a report that looks like this. I have never asked for this kind of report before, and I never will again, but this is what I want today.”

When somebody wants a new report and is told that it can't be produced because the structures for collecting the data are too inflexible, you have an unhappy client.

2 Overlapping Applications

A second, related, assumption was that each of the applications was a separate entity. By assuming that applications could be addressed in isolation, we wound up duplicating:

  • Data definition

  • Data entry

  • Program code

. . . with the effect that we couldn't:

  • Guarantee consistency

  • Find errors easily

The fact of the matter is that much of the data that any one department requires is often required by other departments as well. For each and every department to pursue collection and massaging of the same data is not very productive.

3 Encoding Business in the Programs

A third assumption we made was that programs should represent the workings of the enterprise. This meant:

  • The structure of the company's business was encoded in the programs.

  • Changes in the business meant changes to programs (by programmers).

  • “Ease of maintenance” (for programmers) was an issue.

    But processes are inherently subject to change. To encode the business in the programs was to condemn the Information Technology Department to be forever trying to keep up with business changes. The advent of structured design techniques was intended to address this problem, and it did make programs easier to change—but the fundamental problem remained. Programs could adapt to change, but they could not accommodate it.

It is these three assumptions, not a general undifferentiated incompetence, that left us in this sad position.

The Solution—Version 1

So, how should we address this problem and these fallacies?

In 1970, Dr. E. F. Codd introduced the relational approach to organizing data [Codd, 1970]. Dr. Codd's theory included specific criteria to ensure that data structures remained simple and non-redundant (“normalization”1). This approach produced much simpler database structures than had been seen before. Before that, COBOL programs arranged data hierarchically, as did many early versions of database management systems. This meant that to find a piece of data, it was necessary to start with general categories and search through progressively narrower ones.

Dr. Codd's insight was that you could do everything you needed to store data coherently with two-dimensional tables, or relations, as he called them. He based his theory on the mathematics of set theory, and from that he derived a set of rules to ensure that data were stored with the least possible redundancy. Moreover, relationships between data were defined by the data themselves, not by internal (and invisible) pointers.

Thus, four years before publication of the three-schema architecture, Dr. Codd had already described the nature of the conceptual schema.

Shortly after Dr. Codd's paper was published, the IBM San Jose Laboratory began work on System R as a database management system based on his relational theory. At about the same time, a group of students at the University of California at Berkeley built the Interactive Graphics and Retrieval System (INGRES), and Relation Software, Inc. was developing the Oracle Database.2

Six years later, in 1976, Peter Chen introduced “entity/relationship modeling” or “data modeling” as a way of representing data structures graphically [Chen, 1976]. Two years after that, Oracle Corporation introduced the first commercial relational database management system.

These insights and techniques drew our attention to the structure of the data our programs were manipulating. And the more we looked at things, the more we realized that these data structures were more stable than our program structures could ever be. The things an enterprise is concerned with don't change very much. The processes they carry out change all the time.

Thus, our orientation changed from the programs we wrote to the data they used. The advent of this data orientation made it clear that we had been looking at the problem all wrong. Instead of addressing the problem of a company's systems horizontally, why not address them vertically, as shown in Figure 3.7?

03fig07.gifFigure 3.7. Cut a Different Way.

This means addressing the entire problem of collecting data in the organization as a single project, utilizing all the best technology available. This might include time clocks, process-control equipment, and bar code readers.

A second project can then address reporting overall. There are excellent tools for retrieving data in many different ways and forms. These include query languages, online analytical processing tools, and other fourth-generation programming tools.

The idea is to isolate these processes from each other. This is done by redesigning the middle. The secret of this redesign is the addition of a buffer database, as shown in Figure 3.8.

03fig08.gifFigure 3.8. The Database Approach.

Included with this buffer is a standard piece of software called a database management system. Traditional programs tended to have a lot of code concerned with reading from and writing to files. A database management system is specifically equipped to read from and write to the database. Because the mechanics of reading and writing have been standardized, the only programming required now is that to:

  • define the structure of the input transactions, and the translation of input data into a form for storage

  • define the structure of the data as stored

  • define the nature of each query, in terms of the data as stored

An “application” is now just that required to enter or retrieve (and manipulate) data. (See Figure 3.8.)

Note that with this arrangement, the definition of the nature of the enterprise is no longer just encoded in programs. It is now reflected in the structure of the data. The task of designing the database is nothing other than the task of defining the structure of the enterprise itself.

The database management system is then charged with distinguishing the external views of data from the conceptual and physical views. Indeed, relational database management systems, via the standard language SQL, have the ability to represent “views” of data that are different from the physical tables and columns that constitute the database.

All this, then, constitutes “the database approach” to designing systems.

Data Management

A significant effect of this change in orientation is the appearance of “data management” as a function. Companies and government agencies have come to recognize that information itself constitutes an important asset to the company, and throughout the nineties the functions “database administrator” and “data administrator” began to appear.

The meanings of these two terms have varied widely but appear to be converging on the following: the database administrator is the person responsible for maintenance of the database management software, while the data administrator (also called the information resource manager, or IRM) is the person responsible for the integrity of the data themselves. The data administrator is responsible for data models as well as for the quality of the data captured.

Professional societies such as the Data Administration Management Association (DAMA)3 have been formed to further the body of knowledge concerning data administration, and numerous books have been written on the subject.

The Solution—Version 2

The database approach to developing systems has been undertaken in various forms since the late 1970s. Relational database management systems in particular now make it possible to design a database along the lines of a conceptual schema and then provide different “views” of the database to both the creators and consumers of its data. In addition each database management system has the ability to tune physical storage parameters independently of the table and column structure. The principles of data “normalization” (described below) provide guidance for the design of the central database, while SQL views are used to construct specific external schemas.

In spite of these advances, however, different companies have had varying degrees of success in trying to carry out this approach, and few have been able to do so completely. It turns out that there are other problems besides the ones we first identified.

The first problem relates to the first assumption the database approach was supposed to deal with previously. The way creators and consumers look at data is very different. (See Figure 3.9.)

03fig09.gifFigure 3.9. Inputs vs. Outputs.

Creators of data tend to deal with it one transaction at a time. It is not impractical to analyze a transaction and figure out where each of its components fits into the database based on the conceptual schema.

Consumers of data, on the other hand, tend to want to see large sets of data, with many records taken from various parts of the underlying database. As powerful and fast as computers have become in recent years, they still are not up to the task of retrieving data as complicated as those typically requested, in the time required.

The response to these problems has been development of the central database into what has come to be called the data warehouse. This term has two meanings: First, it is a configuration consisting of a central database that reflects an enterprise's conceptual schema, plus one or more data marts, each of which represents a particular external schema—summarizing data in particular ways that will be useful to a particular data customer. The second meaning is that a data warehouse is the central database itself. It is an integrated, enterprisewide database that serves as a central point of reference. It's purpose is to store data for retrieval, not to support operations, so it will not be quite as current, and some data in it may be stored in summarized form.

It is this second meaning of the term that will be used here. A data warehouse is analogous to a discount warehouse of the sort run by Costco or Sam's Club. Here, the pretzels are stored in one part of the store, the beer is in a second part, and the video tapes are in a third part altogether. At the local convenience store, however (the mart), the pretzels, beer, and video tapes tend to be displayed close together, to encourage customers buying one to buy them all. This configuration means that most queries will not have to examine the entire data warehouse at all.

Mapping software extracts data from the central database to create these data marts. (See Figure 3.10.)

03fig10.gifFigure 3.10. The Data Warehouse.

The data marts each tend to be organized in terms of a central fact, such as “sales”, with various dimensions for collecting groups of facts. A dimension could be identification of the fact's product and product group, its point in time, or geographical location. A collection of facts, then, is expressed in terms of the product group, time, and geography to which they apply. (“Give me all the sales for green vegetables in the Eastern Region last quarter.”) This organization of data is also called a multidimensional database.

For example, while a data warehouse (in the sense of the central database) might keep the complete structure of customer information and contracts, one multidimensional data mart might just concern the fact “sale”, accessible through the dimensions of time (day, week, month, or quarter), geography (region, state, city), customer (customer group or individual customer), and product (product group or product). In such a data mart, it would be easy to report on sales at any level of detail along any dimension.

One problem with early incarnations of the database approach was that a typical company's management was not prepared to replace all its existing systems and base the entire company's operations on a single database. Older systems are not replaced if there seems no need to do so.

The data warehouse architecture, then, often involves feeding the central database from existing systems (legacy systems). This means additional mapping software to translate data from the terms in which they were stored there into terms appropriate for the data warehouse. These are shown on the left side of Figure 3.10.

The result of all this is a configuration that may include a set of older systems, a “mapping” from those systems to a central data warehouse, and then another “mapping” to the data marts used for user queries.

One or more of the systems which support the organization's daily operations may also be based on the conceptual schema. To the extent that this is done and the operational system comes to resemble a data warehouse, the database becomes what is called an operational data store, or ODS. The dynamics of a data warehouse and an operational data store remain very different, however. To the extent that a database is an ODS used to support applications, it will be more dynamic, with potentially less room for history. To the extent that it is used only as a warehouse to support retrievals and analysis, it may have more history and be updated less frequently. In either case, however, the underlying structure will be basically the same.

Since the data warehouse is used only for retrievals, it does not require the structures and mechanisms for enforcing business rules that are to be found in the operational data store and in other operational legacy systems.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020