Home > Articles > Software Development & Management

This chapter is from the book

Problems with Data

For a long time, computers were used strictly as calculating machines. Data was fed into them manually from punch cards, a holdover from tabulating machinery. The computer crunched the numbers and generated, with limited flexibility, reams of reports on the familiar "greenbar." Data in computers was typically lost because there was nowhere to store it. Storage wasn't considered important anyway, as punch cards and paper reports, usually retyped into forms with a typewriter, were considered the official system of record. The computer and its programs and data were merely tools to speed up some work.

Both greenbar and official reports were stored for recordkeeping. This is how it had always been done, and the appearance of a "big calculator" didn't change anything right away. In fact, many firms, well into the 1990s, kept paper records long after the appearance of magnetic tape and later nonsequential storage devices, such as magnetic disk drives. Perhaps the tradition of keeping important records, such as legal documents, on paper contributed to this habit. It wasn't until total quality management (TQM) and business process reengineering (BPR) took hold that the inefficiencies of large organizations were exposed. Unfortunately, the attitude that computing was just a calculator and not to be trusted for real work would take much longer to break down than it took to get started.

The Reporting Gap

With the introduction of permanent storage devices, however, computers began to take on a role as archival devices as well as number crunchers and report generators. Although paper was king at the time, the economy of microfiche was easily understood and accepted because it didn't alter the concept of computers outputting information in a format understood by businesspeople. Microfiche just compressed information and made it more durable. However, magnetic storage devices, especially tape, ushered in a completely new way of thinking about computers in business. Tapes stored everything—data, report streams, and programs. The tapes even substituted for core memory, allowing programmers the luxury of writing programs that used more resources than were actually available. Computers gradually became bona fide business tools.

Computers gained speed and flexibility through the advancements of disk drives, solid-state memory, and virtual memory operating systems. Eventually, interfaces or "transaction monitors," such as IBM's CICS, opened a new discipline: online systems. Instead of batch machines that ran serial scheduled jobs, transaction monitors allowed programmers to create applications for real-time, instantaneous processing of multiple processes (although most small "transactions" were stored for later batch processing). Soon, transaction processing was born, the polar opposite of batch computing. Online transaction processing (OLTP) emerged as the next big thing and never really faded. The bulk of corporate computing, even as it moves through generations of technology, is still transaction-oriented.

The advent of formal data methodologies and software (Chen, Codd, Computer- Aided Software Engineering [CASE] tools, and relational databases) also led to better appreciation and use of the information stored in computing technology. So we moved from computers for processing to computers for information. One thing was missing from these transactions, however—decision making. All the "brainpower" in transactions came from people. The computer's role was to capture and store the data needed to record the transaction. If a judgment had to be made, a person made it. If the transaction had to be approved, the system waited for someone to approve it. These transactions were increasingly fast and essential to business operations, but they remained unintelligent.

Additionally, the more transaction-oriented corporate computing became, the more alienated it became from its original purpose: reporting. As a result, a new industry emerged in the late 1970s that was designed to address this gap. At the time, this industry was called decision support systems (DSSs) or sometimes fourth-generation languages (4GLs), but its goal was to enable users, mostly businesspeople, to create models and generate reports without the IT department's involvement. This industry later became known as business intelligence, but almost all the early providers of these products fell by the wayside before then.

Mainstream systems evolved from batch to OLTP to client/server and finally, by 2000, to Y2K/enterprise applications, as shown in Figure 3.1. Business intelligence and analytics developed separately, supported by functional areas, particularly finance and marketing. As computing in general became more advanced and complicated, it ended up back in IT. It was called "data warehousing" and "business intelligence," but it was still a separate discipline. The original motivation for this separation was mostly IT's neglect of reporting and analysis, but as the need for and volume of business intelligence grew, IT took control to protect its existing resources. The gap has continued to grow and is, with today's technology, largely artificial and unnecessary.

Figure 3.1

Figure 3.1 The divergence of applications and analytics as computing evolved

Many of the early DSS tools were quite ingenious, with features that current analytical tools would envy, such as nonprocedural modeling, multidimensional orientation, goal seeking, statistical and stochastic routines, graphics, and tools for development and collaboration. Their weakness was data. Without the data in transactional systems, these tools were limited to whatever data businesspeople could generate on their own, such as budget, planning, and forecasting data. In many cases, getting existing data required reentering data from printed reports, a time-consuming and tedious process. Nevertheless, these tools flourished, especially in planning and budgeting roles where the need for external data was low.

The IT department wasn't supportive of these tools. There was a constant refrain that they would "bring the mainframe to its knees," but in practice that rarely happened. What did happen was that for the first time, "power users" emerged—non-IT people who could actually develop applications and generate reports from computers, a privileged group that endures to this day. In fact, there are so many of them, and their output is so important, that this group has a name of its own: shadow IT.2

Early DSS vendors met their Waterloo in the early 1980s when a small product was launched: Lotus 1-2-3. It turned out that a PC with a Lotus 1-2-3 spreadsheet was capable of doing much, if not all, of the functions of mainframe- or mini-based DSS tools. Connectivity was actually worse, because PCs weren't connected to the mainframe, but the freedom and novelty was so great that, within a few years, most DSS vendors disappeared.

A bigger drain on IT was the constant flood of requests for data extracted from operational systems, both decision support systems and PCs. Many alternatives were tried:

  • Formal programming requests with highly specific requirements
  • Generic abstracts to flat files that could be shared by more than one request
  • Nonprocedural programming languages, such as A Programming Language (APL), that allowed a less programmer-centered approach
  • Ad hoc query tools, such as Query Management Facility (QMF), so that users could create their own abstracts

In the end, none of these approaches worked well because, for the most part, they couldn't be maintained. The number of extracts expanded geometrically, as shown in simplified form in Figure 3.2, and changes in programs, requirements, and tools overwhelmed IT's ability to respond to requests. There were no standards or best practices for non-IT people to develop libraries of extraction and importation routines, so the "self-service" aspect suffered, too. The sheer number of extracts was also a burden on often heavily used mainframe and minicomputer hosts.

Figure 3.2

Figure 3.2 Simplified representation of extracts from various systems for diverse reporting needs

Enter the Data Warehouse

Originally, the data warehouse (or information warehouse, as IBM called it) was designed to solve a problem for IT departments, not end users. The goal was to build a database to provide in-place access to physically distributed databases and serve as a repository for satisfying report requests from end users (because individual silo databases couldn't do so), not to support self-service or direct access. This key point explains why even current data warehouse approaches often use a layered, after-the-fact batch architecture with a substantial investment in an enterprise data warehouse that knowledge workers never, or rarely, query directly.

The term "data warehouse" has an uncertain provenance. The most influential person who shaped the concept's development is probably Bill Inmon. For more than 20 years, he has dominated as a thought leader, with dozens of books and countless appearances and publications. However, the practice of data warehousing has surged ahead on many fronts, with broad adoption of Inmon's original work, Ralph Kimball's architecture and methodology, and the contributions of many others. Data warehousing has become a well-established and vibrant strand in the IT mainstream.

However, in the early to mid-1980s, a typical organization owned an IBM mainframe that struggled to perform all scheduled jobs. The computer often simply ran out of computing cycles to do all the work, necessitating an upgrade to a newer, more powerful machine. This solution was expensive and temporary, because new requests were already lined up to absorb all the new computing cycles. This management-from-scarcity approach resulted in a sort of triage process in which requests for new applications were evaluated. Most were put into a queue, the so-called backlog, where they languished sometimes indefinitely. This backlog was largely made up of business departments' requests for reports, because preference was given to operational programs.

The original concept of data warehousing was quite simple: Buy another mainframe, install a more "friendly" operating system (such as IBM's VM/CMS), and provide a set of utilities to build a relational database of data culled from various operational systems. This solution would relieve the other mainframe's load of numerous and overlapping extracts and allow IT to service users' reporting needs with tools of its choosing. How reporting needs were met took different forms. At the simplest level, the new machine simply took copies of data from various systems, in whole or in part. From these files or databases, programmers could draw their own extracts for reporting programs without affecting the "main" mainframe's performance.

This method is very different from the current concept of data warehousing. For one thing, there were no business users in the data warehouse; it was designed for IT. Business units, with or without support from IT, developed their own reporting and analysis solutions to compensate for the weak reporting operational systems offered. None of these solutions, however, was completely effective, and all were expensive and tedious to build and maintain. Eventually, power users of DSS tools and PCs got access to the early data warehouse, and it became apparent that the current arrangement wasn't satisfactory. Combining information from multiple extracts, the data simply did not line up. Whether the problem was caused by coding errors, sloppy data entry, or timing errors, merging information across the various disconnected source systems or "stovepipes" was impossible. Solving this problem was the beginning of data warehousing as it is known today: the integration of disparate data into a common logical model. In time, data warehousing became a structured solution to address these intractable reporting and analysis problems:

  • Performance—Systems designed for transaction processing performance, especially in an era of expensive, proprietary hardware, were unable to tolerate the added load of query and report processing.
  • Data quality—OLTP system data integrity was limited to the application's needs, so integrating data across systems was difficult to impossible.
  • Access—Initially, connectivity issues were paramount, but security, licensing, and organizational boundary disputes also restricted access to information.
  • Stewardship of historical data—OLTP systems didn't maintain the historical analysis needed for variance reporting, trend analysis, and even statutory and regulatory requirements.

In the 20 or so years since this concept emerged, the data warehouse has taken on far more responsibility, some of which it still hasn't met completely. For instance, many see the "enterprise" data warehouse as the "single version of the truth," a comprehensive repository of the one true set of data and semantics that everyone can get behind. Having absorbed the role of business intelligence, it's also the architecture for all forms of end-user reporting, query, and analysis. It's the plug that fills the gap between operational and analytical processing, a gap that has been steadily widening since the advent of online processing. As Figure 3.3 shows, a data warehouse is the hub for spokes that lead to operational systems, analysts, customers, performance management, and more.

Figure 3.3

Figure 3.3 A modern data warehouse plays a more central role in information systems (teradata)

A functioning data warehouse is a collection of data, models, metadata, processes, practices, standards, and services. At the most fundamental level, a data warehouse gathers data from other systems and transforms it into a merged model of sorts. This process can range from preserving almost all the original data at the most detailed level to compressing and summarizing at a much higher level of aggregation or, in most cases, both. What's most important is finding a way to preserve as much of the data's original intent as possible yet fit it into a single schema, at least initially. This process often involves "cleaning" the data because data in various source systems might be of poor quality, especially historical data. Another unique characteristic of data warehouses is the extent of historical data they retain. This data provides the basis for tasks such as trend analysis, statutory and regulatory reporting, and variance reporting.

Keeping historical data presents a new set of challenges, however. In today's fast-changing world, shoehorning three-year-old data into current data models can be exceedingly difficult. Data warehousing has developed some ingenious methods for pulling off this process, but keep in mind that these innovations were added to the original concept.

Data warehouses have also prolonged and extended the gap between operational and analytical processes. Because data warehouses and the business intelligence tools used with them were separate disciplines from other enterprise applications, even the small amount of reporting and analysis that occurred in operational systems gradually migrated to data warehouse/business intelligence systems.

The data warehouse was a simple idea, but in practice, its implementation caused, and still is causing, problems. Extracting data from source systems and transforming it into an integrated whole is a monumental effort. The latency caused by the load and refresh cycle devalues data. The scale of data warehouses grew exponentially, leading to spiraling costs and performance problems. The industry divided over polarizing issues of database design, architecture, and methodology. Most of all, the data warehouse's purpose changed dramatically from a tool for programmers to an analytical environment for knowledge workers. Enter business intelligence.

Business Intelligence

The terms "data warehousing" and "business intelligence" are like binary star systems: They orbit each other, but from a distance, they appear to be a single entity. Because the subjects of data warehousing and business intelligence are somewhat aligned with EDM, understanding the differences is important, as is knowing how each evolved, what problems they were designed to solve, and how well they performed.

In its purest form, business intelligence is supposed to be part of decision making in an organization. Phrases such as "better data for improved decision making," "getting the right information to the right person at the right time," and "the single version of the truth" are common. The focus is on informing people with data, who then go about making decisions. Business intelligence is designed to inform people and provide a way for them for evaluate information in retrospect. A branch of business intelligence now called performance management also has provisions for forecasting and planning, but in practice, most of performance management is still retrospective analysis and reporting.

Some people refer to the process of people getting and using information as business intelligence, but the term was coined when many providers, concepts, and methods for today's business intelligence didn't exist. Some fuzziness in the definition of business intelligence is unavoidable. In fact, the term "business intelligence" has more or less subsumed the practice of data warehousing and has become synonymous with a range of functionality that business intelligence vendors offer.

Of all the types of business intelligence, reporting, ad hoc query, online analytical processing (OLAP), dashboards, scorecards, visualization, and even data mining, one common thread is delivering information to people. With transaction systems still relying on human intelligence for decision making and judgment, business intelligence has sold itself to a new group of users by promising to improve decisions. If people must make a decision because the system can't, at least they should make one informed by all the facts or data in the data warehouse. Even business activity monitoring (BAM), which catches data on the fly and applies some simple rules to process data or let it go, ultimately alerts people through visual devices. Some BAM products are capable of bypassing people in the loop and alerting applications, services, or processes directly, but this feature is unusual. Most implementations update digital dashboards for people to view and navigate.

Operational Business Intelligence

As the gap between operational and analytical processes closes, thinking about the role of business intelligence in this new landscape is natural. Latency in business intelligence is an issue because data warehouses are updated monthly, weekly, and often daily, but rarely within a business day. Operational reporting or monitoring that requires fresher data is a problem. The concept of a data warehouse is based on layers of schema that are updated in batch offline and then conformed, indexed, preaggregated, and otherwise tuned and dispatched. Trickling in new data in real time disrupts this model.

Another problem is that after data is culled from its source and integrated into the data warehouse, the connection to its source can be lost. Although vendors have invested heavily in maintaining these links, persistent connectivity problems make data warehouses a poor candidate for single-source reporting. The solution was a sort of kludge called the operational data store. It violated many data warehouse principles, even though it was still considered part of the data warehouse. In time, operational data stores were also integrated, so they lost their capability to tie back to the original systems. Because of the transformation that happens with integration, they inherited much of the same latency as data warehouses.

The need for more timely analysis of data is strong and growing, however. The business intelligence industry is working toward solutions of operational business intelligence, but that definition is typically limited to daily or "intraday" data.3 So for the time being, operational business intelligence is characterized as having the same output (to people) and using the same data model (the data warehouse) but taking place a little more often than daily. True real-time data monitoring has been scaled back in business intelligence to a watery concept called "right time," which means as fast as the data warehouse can handle it. Some vendors can trickle data into a data warehouse, but the answer to operational business intelligence clearly lies in the ability to get closer to business processes as composite applications or from message queues. In any case, the lingering shadow of the data warehouse plus people-based delivery of business intelligence will keep business intelligence from participating in making your systems smarter, but it can be a valuable contributor.

Data Mining, Predictive Reporting, and Operations Research

To improve the value of reporting and analysis tools for knowledge workers, some organizations have adopted data-mining technologies. Data mining involves using powerful mathematical techniques to analyze huge volumes of data, as from a data warehouse, and extract meaningful insight. This insight might mean finding out how two products are related in their sales patterns or what customer characteristics make them susceptible to an offer. Association rules, likelihood scores, and trending are possible outcomes of these insights.

Although much data mining is aimed at understanding historical data, some is focused on predictive analysis. What customer actions predict whether they will cancel a contract? What kinds of customers are likely to respond well to a store redesign? Which suppliers will have credit issues in the future? This kind of predictive reporting can increase the value of analysis by giving knowledge workers a sense of data's future implications. Generally, it doesn't prescribe actions or lend itself to causing change in systems. Most organizations have applied it only to making reporting and analysis more sophisticated. Predictive reporting improves the value of your data but doesn't make your systems smarter.

Operations research grew out of work in the United Kingdom during World War II and has been applied to business problems more over the years. Using a mix of statistical and mathematical techniques to find solutions to problems, operations research is more focused on finding the right action to take, given current constraints and historical data. Using mathematical techniques and optimization and simulation tools, it allows a modeler to set up a problem and then "solve" it to find the best outcomes, given existing constraints. Although some models can be embedded into systems to make them smarter, the range of solutions for which it's sufficient is limited. Therefore, operations research has focused on making people understand possible solutions instead. Many problems have been targeted and solved with operations research, but it hasn't generally been used in programs that run organizations day to day.

Although data mining and operations research have roles to play in solving these problems, problems with data aren't the only ones organizations have. You need to consider problems with programs as well, as discussed in the following section.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020