Home > Articles > Data

  • Print
  • + Share This
This chapter is from the book

Traditional Data Systems

Traditional data systems, such as relational databases and data warehouses, have been the primary way businesses and organizations have stored and analyzed their data for the past 30 to 40 years. Although other data stores and technologies exist, the major percentage of business data can be found in these traditional systems. Traditional systems are designed from the ground up to work with data that has primarily been structured data. Characteristics of structured data include the following:

  • Clearly defined fields organized in records. Records are usually stored in tables. Fields have names, and relationships are defined between different fields.
  • Schema-on-write that requires data be validated against a schema before it can be written to disk. A significant amount of requirements analysis, design, and effort up front can be involved in putting the data in clearly defined structured formats. This can increase the time before business value can be realized from the data.
  • A design to get data from the disk and load the data into memory to be processed by applications. This is an extremely inefficient architecture when processing large volumes of data this way. The data is extremely large and the programs are small. The big component must move to the small component for processing.
  • The use of Structured Query Language (SQL) for managing and accessing the data.
  • Relational and warehouse database systems that often read data in 8k or 16k block sizes. These block sizes load data into memory, and then the data are processed by applications. When processing large volumes of data, reading the data in these block sizes is extremely inefficient.
  • Organizations today contain large volumes of information that is not actionable or being leveraged for the information it contains.
  • An order management system is designed to take orders. A web application is designed for operational efficiency. A customer system is designed to manage information on customers. Data from these systems usually reside in separate data silos. However, bringing this information together and correlating with other data can help establish detailed patterns on customers.
  • In a number of traditional siloed environments data scientists can spend 80% of their time looking for the right data and 20% of the time doing analytics. A data-driven environment must have data scientists spending a lot more time doing analytics.

Every year organizations need to store more and more detailed information for longer periods of time. Increased regulation in areas such as health and finance are significantly increasing storage volumes. Expensive shared storage systems often store this data because of the critical nature of the information. Shared storage arrays provide features such as striping (for performance) and mirroring (for availability). Managing the volume and cost of this data growth within these traditional systems is usually a stress point for IT organizations. Examples of data often stored in structured form include Enterprise Resource Planning (ERP), Customer Resource Management (CRM), financial, retail, and customer information.

Atomicity, Consistency, Isolation, Durability (ACID) compliant systems and the strategy around them are still important for running the business. A number of these systems were built over the years and support business decisions that run an organization today. Relational databases and data warehouses can store petabytes (PB) of information. However, these systems were not designed from the ground up to address a number of today’s data challenges. The cost, required speed, and complexity of using these traditional systems to address these new data challenges would be extremely high.

Semi-Structured and Unstructured Data

Semi-structured data does not conform to the organized form of structured data but contains tags, markers, or some method for organizing the data. Unstructured data usually does not have a predefined data model or order. Examples of unstructured data include Voice over IP (VoIP), social media data structures (Twitter, Facebook), application server logs, video, audio, messaging data, RFID, GPS coordinates, machine sensors, and so on. This unstructured data is completely dwarfing the volume of structured data being generated. Organizations are finding that this unstructured data that is usually generated externally is just as critical as the structured internal data being stored in relational databases. External data about your products and services can be just as important as the data you collect. Every time you use social media or use a smart device, you might be broadcasting the information shown in Table 1.1, or more. In fact, smartphones are generating massive volumes of data that telecommunication companies have to deal with.

Table 1.1 Social Media Data That Can Be Used to Establish Patterns

Environment

Personal

What you are doing?

What do you think?

Where you are going?

How do you feel?

Where are you?

What do you like and not like?

Where have you been?

What are your driving patterns?

What neighborhood are you in?

What are your purchasing patterns?

What store you are visiting?

Are you at work?

Are you on vacation?

What products are you looking at?

What road are you on?

This information can be correlated with other sources of data, and with a high degree of accuracy, which can predict some of the information shown in Table 1.2.

Table 1.2 Examples of Patterns Derived from Social Media

Health

Money

Personal

Infrastructure

Detection of diseases or outbreaks

Questionable trading practices

Is someone a safe driver?

Is a machine component wearing out or likely to break?

Activity and growth patterns

Credit card fraud

Is someone having an affair?

Designing roads to reflect traffic patterns and activity in different areas

Probability of a heart attack or stroke

Identify process failures and security breaches

Who you will vote for

Activity and growth patterns

Are you an alcoholic?

How money is spent

Products in your home

Identify process failures and security breaches

The outbreak of a virus

How much money you make

Are you likely to commit a crime?

Driving patterns in a city

Purchase patterns

What you do for relaxation

A good place to put a store or business

Products you are likely to buy

How you use a website

Brand loyalty and why people switch brands

Probability of loan default

Products you are likely to buy

Brand loyalty and why people switch brands

Type of people you associate with

Activities you enjoy

In a very competitive world, people realize they need to use this information and mine it for the “business insight” it contains. In some ways, business insight or insight generation might be a better term than big data because insight is one of the key goals for a big data platform. This type of data is raising the minimum bar for the level of information an organization needs to make competitive business decisions.

Causation and Correlation

Storing large volumes of data on shared storage systems is very expensive. So for most of the critical data we have talked about, companies have not had the capability to save it, organize it, and analyze it or leverage its benefits because of the storage costs. We have lived in a world of causation. With causation, detailed information is filtered, aggregated, averaged, and then used to try to figure out what “caused” the results. After the data has been processed this way, most of the golden secrets of the data have been stripped away. The original detailed records can provide much more insight than aggregated and filtered data. The ever increasing volume of data, the unstoppable velocity of the data that is being generated in the world, and the complexity of working with unstructured data as well as the costs have kept organizations from leveraging the details of the data. This impacts the capability to make good business decisions in an ever-changing competitive environment.

When you look at large corporations, it is typical to see hundreds and even thousands of relational databases of different types and multiple data warehouses. Organizations must be able to analyze together the data from databases, data warehouses, application servers, machine sensors, social media, and so on. Data can be organized into repositories that can store data of all kinds, of different types, and from different sources in data refineries and data lakes. This data can be correlated using more data points for increased business value. By processing data from different sources into a single source, organizations can do a lot more descriptive and predictive analytics. Organizations are not only wanting to predict with high degrees of accuracy but also to reduce the risk in the predictions.

Organizations want to centralize a lot of their data for improved analytics and to reduce the cost of data movement. These centralized data repositories are referred to differently, such as data refineries and data lakes.

  • A data refinery is a repository that can ingest, process, and transform disparate polystructured data into usable formats for analytics. A data refinery is analogous to an oil refinery. With an oil refinery, it is understood how to make gasoline and kerosene from oil. A data refinery is a little more rigid in the data it accepts for analytics. There is an emphasis in making sure garbage data does not enter the data refinery. A data refinery can work with extremely large datasets of any format cost effectively.
  • A data lake is a new concept where structured, semi-structured, and unstructured data can be pooled into one single repository where business users can interact with it in multiple ways for analytical purposes. A data lake is an enterprise data platform that uses different types of software, such as Hadoop and NoSQL. A data lake can run applications of different runtime characteristics. A water lake does not have rigid boundaries. The shoreline of a lake can change over a period of time. A data lake is designed with similar flexibility to support new types of data and combinations of data so it can be analyzed for new sources of insight. This does not mean that a data lake should allow any data inside it, so it turns into a swamp. Control must be maintained to ensure that quality data or data with the potential of new insights is stored in the data lake. The data lake should not enable itself to be flooded with just any type of data.

Data Challenges

They say that necessity is the mother of all invention. That definitely holds true for data. Banks, governments, insurance firms, manufacturing companies, health institutions, and retail companies all realized the issues of working with these large volumes of data. Yet, it was the Internet companies that were forced to solve it. Organizations such as Google, Yahoo!, Facebook, and eBay were ingesting massive volumes of data that were increasing in size and velocity every day, and to stay in business they had to solve this data problem. Google wanted to be able to rank the Internet. It knew the data volume was large and would grow larger every day. It went to the traditional database and storage vendors and saw that the costs of using their software licenses and storage technology was so prohibitive they could not even be considered. So Google realized it needed a new technology and a new way of addressing the data challenges.

Back to the Basics

Google realized that if it wanted to be able to rank the Internet, it had to design a new way of solving the problem. It started with looking at what was needed:

  • Inexpensive storage that could store massive amounts of data cost effectively
  • To scale cost effectively as the data volume continued to increase
  • To analyze these large data volumes very fast
  • To be able to correlate semi-structured and unstructured data with existing structured data
  • To work with unstructured data that had many forms that could change frequently; for example, data structures from organizations such as Twitter can change regularly

Google also identified the problems:

  • The traditional storage vendor solutions were too expensive.
  • When processing very large volumes of data at the level of hundreds of terabytes and petabytes, technologies based on “shared block-level storage” were too slow and couldn’t scale cost effectively. Relational databases and data warehouses were not designed for the new level of scale of data ingestion, storage, and processing that was required. Today’s data scale requires a high-performance super-computer platform that could scale at cost.
  • The processing model of relational databases that read data in 8k and 16k increments and then loaded the data into memory to be accessed by software programs was too inefficient for working with large volumes of data.
  • The traditional relational database and data warehouse software licenses were too expensive for the scale of data Google needed.
  • The architecture and processing models of relational databases and data warehouses were designed to handle transactions for a world that existed 30 to 40 years ago. These architectures and processing models were not designed to process the semi-structured and unstructured data coming from social media, machine sensors, GPS coordinates, and RFID. Solutions to address these challenges are so expensive that organizations wanted another choice.
  • Reducing business data latency was needed. Business data latency is the differential between the time when data is stored to the time when the data can be analyzed to solve business problems.
  • Google needed a large single data repository to store all the data. Walk into any large organization and it typically has thousands of relational databases along with a number of different data warehouse and business analysis solutions. All these data platforms stored their data in their own independent silos. The data needed to be correlated and analyzed with different datasets to maximize business value. Moving data across data silos is expensive, requires lots of resources, and significantly slows down the time to business insight.

The solution criteria follows:

  • Inexpensive storage. The most inexpensive storage is local storage from off-the-shelf disks.
  • A data platform that could handle large volumes of data and be linearly scalable at cost and performance.
  • A highly parallel processing model that was highly distributed to access and compute the data very fast.
  • A data repository that could break down the silos and store structured, semi-structured, and unstructured data to make it easy to correlate and analyze the data together.

The key whitepapers that were the genesis for the solution follow. These are still recommended readings because they lay down the foundation for the processing and storage of Hadoop. These articles are also insightful because they define the business drivers and technical challenges Google wanted to solve.

Solving the Data Problem

Necessity may be the mother of all invention, but for something to be created and grow, it needs a culture and environment that can support, nurture, and provide the nutrients. Look at the Italian Renaissance period, which was a great period in the history of art. Why? During the Renaissance period, in a very condensed area in Europe, there were artists who started studying at childhood, often as young as seven years old. They would learn as apprentices to other great artists, with kings and nobility paying for their works. During the Renaissance period, great artists flourished because a culture existed that allowed individuals with talent to spend their entire lives studying and working with other great artists.

During the industrial revolution there was a great need for stronger materials to grow larger buildings in condensed areas, for faster and more efficient transportation, and to be able to create products quickly for fast-growing populations. During the industrial revolution, steel manufacturing and transportation grew almost overnight. The Italian Renaissance, the industrial revolution, and Hadoop all grew from the need, demand, and culture that could promote their growth. Today’s current data challenges have created a demand for a new platform, and open source is a culture that can provide tremendous innovation by leveraging great talent from around the world in collaborative efforts. The capability to store, process, and analyze information at ever faster rates will change how businesses, organizations, and governments run; how people think; and change the very nature of the world created around us.

The Necessity and Environment for Solving the Data Problem

The environment that solved the problem turned out to be Silicon Valley in California, and the culture was open source. In Silicon Valley, a number of Internet companies had to solve the same problem to stay in business, but they needed to be able to share and exchange ideas with other smart people who could add the additional components. Silicon Valley is unique in that it has a large number of startup and Internet companies that by their nature are innovative, believe in open source, and have a large amount of cross-pollination in a very condensed area. Open source is a culture of exchanging ideas and writing software from individuals and companies around the world. Larger proprietary companies might have hundreds or thousands of engineers and customers, but open source has tens of thousands to millions of individuals who can write software and download and test software.

Individuals from Google, Yahoo!, and the open source community created a solution for the data problem called Hadoop. Hadoop was created for a very important reason—survival. The Internet companies needed to solve this data problem to stay in business and be able to grow.

Giving the Data Problem a Name

The data problem is being able to store large amounts of data cost effectively (volume), with large ingestion rates (velocity), with data that can be of different types and structures (variety). This data must be able to provide value (veracity) to an organization. This type of data is referred to as big data. These are the Vs of big data. Yet big data is not just volume, velocity, or variety. Big data is the name given to a data context or environment when the data environment is too difficult to work with, too slow, or too expensive for traditional relational databases and data warehouses to solve.

What’s the Deal with Big Data?

Across the board, industry analyst firms consistently report almost unimaginable numbers on the growth of data. The traditional data in relational databases and data warehouses are growing at incredible rates. The growth of traditional data is by itself a significant challenge for organizations to solve. The cost of storing just the traditional data growth on expensive storage arrays is strangling the budgets of IT departments.

The big news, though, is that VoIP, social media, and machine data are growing at almost exponential rates and are completely dwarfing the data growth of traditional systems. Most organizations are learning that this data is just as critical to making business decisions as traditional data. This nontraditional data is usually semi-structured and unstructured data. Examples include web logs, mobile web, clickstream, spatial and GPS coordinates, sensor data, RFID, video, audio, and image data.

Big data from an industry perspective:

  • All the industry analysts and pundits are making predictions of massive growth of the big data market. In every company we walk into, one of their top priorities involves using predictive analytics to better understand their customers, themselves, and their industry.
  • Opportunities for vendors will exist at all levels of the big data technology stack, including infrastructure, software, and services.
  • Organizations that have begun to embrace big data technology and approaches are demonstrating that they can gain a competitive advantage by being able to take action based on timely, relevant, complete, and accurate information rather than guesswork.

Data becomes big data when the volume, velocity, and/or variety of data gets to the point where it is too difficult or too expensive for traditional systems to handle. Big data is not when the data reaches a certain volume or velocity of data ingestion or type of data. However, it is the exponential data growth that is the driving factor of the data revolution.

Open Source

Open source is a community and culture designed around crowd sourcing to solve problems. Many of the most innovative individuals who work for companies or themselves help to design and create open source software. It is created under open source license structures that can make the software free and the source code available to anyone. Be aware that there are different types of open source licensing. The open source culture provides an environment that allows rapid innovation, software to be free, and hardware that is relatively inexpensive because open source uses commodity x86 hardware. MySQL, Linux, Apache HTTP Server, Ganglia, Nagios, Tomcat, Java, Python, and JavaScript are all growing significantly in large organizations. An example of the rapid innovation is that proprietary vendors often come out with a major new release every two to three years. However, Hadoop recently had three new major releases in a year.

The ecosystem around Hadoop is innovating just as fast. For example, frameworks such as Spark, Storm, and Kafka are significantly increasing the capabilities around Hadoop. Open source solutions can be very innovative because the source can be generated from sources all around the world and from different organizations. There is increasing participation from large vendor companies as well, and software teams in large organizations also generate open source software. Large companies, such as EMC, HP, Hitachi, Oracle, VMware, and IBM are now offering solutions around big data. The innovation being driven by open source is completely changing the landscape of the software industry. In looking at Hadoop and big data, we see that open source is now defining platforms and ecosystems, not just software frameworks or tools.

Why Traditional Systems Have Difficulty with Big Data

The reason traditional systems have a problem with big data is that they were not designed for it.

  • Problem—Schema-On-Write: Traditional systems are schema-on-write. Schema-on-write requires the data to be validated when it is written. This means that a lot of work must be done before new data sources can be analyzed. Here is an example: Suppose a company wants to start analyzing a new source of data from unstructured or semi-structured sources. A company will usually spend months (3–6 months) designing schemas and so on to store the data in a data warehouse. That is 3 to 6 months that the company cannot use the data to make business decisions. Then when the data warehouse design is completed 6 months later, often the data has changed again. If you look at data structures from social media, they change on a regular basis. The schema-on-write environment is too slow and rigid to deal with the dynamics of semi-structured and unstructured data environments that are changing over a period of time. The other problem with unstructured data is that traditional systems usually use Large Object Byte (LOB) types to handle unstructured data, which is often very inconvenient and difficult to work with.
  • Solution—Schema-On-Read: Hadoop systems are schema-on-read, which means any data can be written to the storage system immediately. Data are not validated until they are read. This enables Hadoop systems to load any type of data and begin analyzing it quickly. Hadoop systems have extremely short business latency compared to traditional systems. Traditional systems require schema-on-write, which was designed more than 50 years ago. A lot of companies need real-time processing of data and customer models generated in hours or days versus weeks or months. The Internet of Things (IoT) is accelerating the data streams coming from different types of devices and physical objects, and digital personalization is accelerating the need to be able to make real-time decisions. Schema-on-read gives Hadoop a tremendous advantage over traditional systems in an area that matters most, that of being able to analyze the data faster to make business decisions. When working with complex data structures that are semi-structured or unstructured, schema-on-read enables data to be accessed much faster than schema-on-write systems.
  • Problem—Cost of Storage: Traditional systems use shared storage. As organizations start to ingest larger volumes of data, shared storage is cost prohibitive.
  • Solution—Local Storage: Hadoop can use the Hadoop Distributed File System (HDFS), a distributed file system that leverages local disks on commodity servers. Shared storage is about $1.20/GB, whereas local storage is about $.04/GB. Hadoop’s HDFS creates three replicas by default for high availability. So at 12 cents per GB, it is still a fraction of the cost of traditional shared storage.
  • Problem—Cost of Proprietary Hardware: Large proprietary hardware solutions can be cost prohibitive when deployed to process extremely large volumes of data. Organizations are spending millions of dollars in hardware and software licensing costs while supporting large data environments. Organizations are often growing their hardware in million dollar increments to handle the increasing data. New technology in traditional vendor systems that can grow to petabyte scale and good performance are extremely expensive.
  • Solution—Commodity Hardware: It is possible to build a high-performance super-computer environment using Hadoop. One customer was looking at a proprietary hardware vendor for a solution. The hardware vendor’s solution was $1.2 million in hardware costs and $3 million in software licensing. The Hadoop solution for the same processing power was $400,000 for hardware, the software was free, and the support costs were included. Because data volumes would be constantly increasing, the proprietary solution would have grown in $500k and $1 million dollar increments, whereas the Hadoop solution would grow in $10,000 and $100,000 increments.
  • Problem—Complexity: When you look at any traditional proprietary solution, it is full of extremely complex silos of system administrators, DBAs, application server teams, storage teams, and network teams. Often there is one DBA for every 40 to 50 database servers. Anyone running traditional systems knows that complex systems fail in complex ways.
  • Solution—Simplicity: Because Hadoop uses commodity hardware and follows the “shared-nothing” architecture, it is a platform that one person can understand very easily. Numerous organizations running Hadoop have one administrator for every 1,000 data nodes. With commodity hardware, one person can understand the entire technology stack.
  • Problem—Causation: Because data is so expensive to store in traditional systems, data is filtered and aggregated, and large volumes are thrown out because of the cost of storage. Minimizing the data to be analyzed reduces the accuracy and confidence of the results. Not only are accuracy and confidence to the resulting data affected, but it also limits an organization’s ability to identify business opportunities. Atomic data can yield more insights into the data than aggregated data.
  • Solution—Correlation: Because of the relatively low cost of storage of Hadoop, the detailed records are stored in Hadoop’s storage system HDFS. Traditional data can then be analyzed with nontraditional data in Hadoop to find correlation points that can provide much higher accuracy of data analysis. We are moving to a world of correlation because the accuracy and confidence of the results are factors higher than traditional systems. Organizations are seeing big data as transformational. Companies building predictive models for their customers would spend weeks or months building new profiles. Now these same companies are building new profiles and models in a few days. One company would have a data load take 20 hours to complete, which is not ideal. They went to Hadoop and the time for the data load went from 20 hours to 3 hours.
  • Problem—Bringing Data to the Programs: In relational databases and data warehouses, data are loaded from shared storage elsewhere in the datacenter. The data must go over wires and through switches that have bandwidth limitations before programs can process the data. For many types of analytics that process 10s, 100s, and 1000s of terabytes, the capability of the computational side to process data greatly exceeds the storage bandwidth available.
  • Solution—Bringing Programs to the Data: With Hadoop, the programs are moved to where the data is. Hadoop data is spread across all the disks on the local servers that make up the Hadoop cluster, often in 64MB or 128MB block increments. Individual programs, one for every block, runs in parallel (up to the number of available map slots, more on this later) across the cluster, delivering a very high level of parallelization and Input/Output Operations per Second (IOPS). This means Hadoop systems can process extremely large volumes of data much faster than traditional systems and at a fraction of the cost because of the architecture model. Moving the programs (small component) to the data (large component) is an architecture that supports the extremely fast processing of large volumes of data.

Successfully leveraging big data is transforming how organizations are analyzing data and making business decisions. The “value” of the results of big data has most companies racing to build Hadoop solutions to do data analysis. Often, customers bring in consulting firms and want to “out Hadoop” their competitors. Hadoop is not just a transformation technology; it has become the strategic difference between success and failure in today’s modern analytics world.

Hadoop’s Framework Architecture

Hadoop is a software solution where all the components are designed from the ground up to be an extremely parallel high-performance platform that can store large volumes of information cost effectively. It handles very large ingestion rates; easily works with structured, semi-structured, and unstructured data; eliminates the business data latency problem; is extremely low cost in relation to traditional systems; has a very low entry cost point; and is linearly scalable in cost effective increments.

A Hadoop distribution is made of a number of separate frameworks that are designed to work together. The frameworks are extensible as well as the Hadoop framework platform. Hadoop has evolved to support fast data as well as big data. Big data was initially about large batch processing of data. Now organizations also need to make business decisions real time or near real time as the data arrives. Fast data involves the capability to act on the data as it arrives. Hadoop’s flexible framework architecture supports the processing of data with different run-time characteristics.

NoSQL Databases

NoSQL databases were also designed from the ground up to be able to work with very large datasets of different types and to perform very fast analysis of that data. Traditional databases were designed to store relational records and handle transactions. NoSQL databases are nonrelational. When records need to be analyzed, it is the columns that contain the important information. NoSQL databases may mean data is accessed in the following ways:

  • Without SQL using APIs (“No” SQL).
  • With SQL or other access methods (“Not only” SQL).

When using Apache Hive (Hadoop framework) to run SQL in NoSQL databases, those queries are converted to MapReduce(2) and run as a batch operation to process large volumes of data in parallel. APIs can also be used to access the data in NoSQL to process interactive and real-time queries.

Popular NoSQL databases include HBase, Accumulo, MongoDB, and Cassandra. They are databases designed to provide very fast analysis of column data. Accumulo is a NoSQL database designed by the National Security Agency (NSA) of the United States, so it has additional security features currently not available in HBase.

NoSQL databases have different characteristics and features. For example, they can be key-value based, column based, document based, or graph based. Each NoSQL database can emphasize different areas of the Cap Theorem (Brewer Theorem). The Cap Theorem states that a database can excel in only two of the following areas: consistency (all data nodes see same data at the same time), availability (every request for data will get a response of success or failure), and partition tolerance (the data platform will continue to run even if parts of the system are not available). NoSQL databases are often indexed by key but not all support secondary indexes. Data in NoSQL databases is usually distributed across local disks across different servers. Fan-out queries are used to access the data. With NoSQL systems supporting eventual consistency, the data can be stored in separate geographical locations. NoSQL is discussed in more detail in Chapter 2, “Hadoop Fundamental Concepts.”

RDBMS systems enforce schemas, are ACID compliant, and support the relational model. NoSQL databases are less structured (nonrelational). Tables can be schema free (a schema can be different in each row), are often open source, and can be distributed horizontally in a cluster. Some NoSQL databases are evolving to support ACID. There are Apache projects such as Phoenix, which has a relational database layer over HBase. Schema tables can be very flexible for even simple schemas such as an order table that stores addresses from different countries that require different formats. A number of customers start looking at NoSQL when they need to work with a lot of unstructured or semi-structured data or when they are having performance or data ingestion issues because of the volume or velocity of the data.

In-Memory Frameworks

Fast data is driving the adoption of in-memory distributed data systems. Frameworks such as Apache Spark and Cloudera’s Impala offer in-memory distributed datasets that are spread across the Hadoop cluster. Apache Drill and Hortonworks Tez are additional frameworks emerging as additional solutions for fast data.

  • + Share This
  • 🔖 Save To Your Account