Home > Articles > Data

  • Print
  • + Share This
Like this article? We recommend

Like this article? We recommend

Technology Overview

Data mining requires a hardware and software infrastructure capable of supporting the high-throughput data processing and a network capable of supporting data communications from the database to the visualization workstation. With a robust hardware and software infrastructure in place, processes such as machine learning can be used to automatically manage and refine the knowledge discovery and data mining processes. This work can be performed with minimal user interaction once a knowledgeable researcher has established the basic design of the system.

The core technologies that actually perform the work of data mining, whether under computer control or directed by users, provide a means of simplifying the complexity and reducing the effective size of the databases. This focus isn't limited to genome sequences and protein structures, but extends to the wealth of data hidden in the online literature. Advanced text mining methods are used to identify textual data and place them in the proper context.


The infrastructure in a data mining laboratory data includes high-speed Internet and intranet connectivity, a data warehouse with a data dictionary that defines a standard vocabulary and data format, several databases, and high-performance computer hardware. Some form of database management system (DBMS) is required to support queries and ensure data integrity. The infrastructure can be based on a central high-performance computer. However, most systems support some form of parallel processing, so intermediate results from one workstation can be fed to another workstation. For example, link analysis performed on one workstation may be fed the regression analysis results from another workstation. The trend toward distributed data mining using relatively inexpensive desktop hardware is largely a reflection of the economics of modern computing. In many cases, the price-performance ratio of desktop hardware is superior to that of mainframe computers.

Pattern Recognition

Data mining involves identifying patterns and relationships in data that often are not obvious in large, complex data sets. This pattern recognition is most often concerned with the automatic classification of character sequences representative of the nucleotide bases or molecular structures, and of 3-D protein structures. From an information processing perspective, pattern recognition can be viewed as a data simplification process that filters extraneous data from consideration and labels the remaining data according to a classification scheme.

As illustrated in Figure 3, the major steps in the pattern recognition and discovery process are feature selection, measurement, processing, feature extraction, classification, and labeling. Given a pattern, the first step in the pattern recognition is to select a set of features or attributes from the universe of available features that will be used to classify the pattern. Next, the original pattern must be transformed into a representation that can be easily manipulated programmatically. After the data are processed to remove noise, features in the data that are defined as relevant to pattern matching are searched for. In the classification stage, data are classified based on measurements of similarity with other patterns. The pattern recognition process ends when a label is assigned to the data, based on its membership in a class.

Figure 3Figure 3 Stages in the pattern recognition process.

Machine Learning

The pattern matching and pattern discovery components of data mining are often performed by using machine learning techniques. Machine learning encompasses a variety of methods that represent the convergence of statistics, biological modeling, adaptive control theory, psychology, and artificial intelligence (AI). The spectrum of machine learning technologies applicable to data mining in bioinformatics include inductive logic programming, genetic algorithms, neural networks, statistical methods, Bayesian methods, decision trees, and Hidden Markov Models.

Inductive logic programming uses a set of rules or heuristics to categorize data. Genetic algorithms are based on evolutionary principles wherein a particular function or definition that best fits the constraints of an environment survives to the next generation, and the other functions are eliminated. Neural networks learn to associate input patterns with output patterns in a way that allows them to categorize new patterns and to extrapolate trends from data. The statistical methods used to support data mining are generally some form of feature extraction, classification, or clustering. Decision trees are hierarchically arranged questions and answers that lead to classification. A Hidden Markov Model (HMM) is a statistical model for an ordered sequence of symbols, acting as a stochastic state machine that generates a symbol each time a transition is made from one state to the next. Transitions between states are specified by transition probabilities.

Regardless of the underlying technology, most machine learning follows the general process outlined in Figure 4. Input data are fed to a comparison engine that compares the data with an underlying model. The results of the comparison engine then direct a software actor to initiate some type of change. This output, whether it takes the form of a change in data or a modification of the underlying model, is evaluated by an evaluation engine, which uses the underlying goals of the system as a point of reference. Feedback from the actor and the evaluation engine direct changes in the model. In this scenario, the goals can be standard patterns that are known to be associated with the input data. Alternatively, the goals can be states, such as minimal change in output compared with the system's previous encounter with the same data.

Figure 4Figure 4 The machine learning process.

Text Mining

The primary store of functional data that links clinical medicine, pharmacology, sequence data, and structure data is in the form of biomedicine documents in online bibliographic databases such as PubMed. Mining these databases is expected to reveal the relationships between structure and function at the molecular level and their relationship to pharmacology and clinical medicine.

However, text mining, automatically extracting this data from documents, which is published in the form of unstructured free text, often in several languages, is a non-trivial task. Although computer languages such as LISP (LISt Processing) have been developed expressly for handling free text, working with free text remains one of the most challenging areas of computer science. This is primarily because, unlike the analysis of the sequence of amino acids in a protein, natural language is ambiguous and often references data not contained in the document under study. For example, a research article on the expression of a particular gene in PubMed may contain numerous synonyms, acronyms, and abbreviations. Furthermore, despite editing to constrain the sentences to proper English (or other language), the syntax—the ordering of words and their relationships to other elements in phrases and sentences—is typically author-specific. The article may also reference an experimental method that isn't defined because it's assumed as common knowledge in the intended readership. In addition, text mining is complicated because of the variability of how data are represented in a typical text document. Data on a particular topic may appear in the main body of text, in a footnote, in a table, or embedded in a graphic illustration

The most promising approaches to text mining online documents rely on natural language processing (NLP), a technology that encompasses a variety of computational methods ranging from simple keyword extraction to semantic analysis. The simplest NLP systems work by parsing documents and identifying the documents with recognized keywords such as "protein" or "amino acid". The contents of the tagged documents can then be copied to a local database and later reviewed.

More-elaborate NLP systems use statistical methods to recognize not only relevant keywords, but their distribution within a document. In this way, it's possible to infer context. For example, an NLP system can identify documents with the keywords "amino acid", "neurofibromatosis", and "clinical outcome" in the same paragraph. The result of this more-advanced analysis is document clusters, each of which represents data on a specific topic in a particular context.

The most advanced NLP systems work at the semantic level—the analysis of how meaning is created by the use and interrelationships of words, phrases, and sentences in a sentence.

The processing phase of NLP involves one or more of a variety of the following techniques:

  • Stemming—Identifying the stem of each word. For example, "hybridized", "hybridizing", and "hybridization" would be stemmed to "hybrid". As a result, the analysis phase of the NLP process has to deal with only the stem of each word, not every possible permutation.

  • Tagging—Identifying the part of speech represented by each word, such as noun, verb, or adjective.

  • Tokenizing—Segmenting sentences into words and phrases. This process determines which words should be retained as phrases and which ones should be segmented into individual words. For example, "Type II Diabetes" should be retained as a word phrase, whereas "A patient with diabetes" would be segmented into four separate words.

  • Core Terms—Significant terms, such as protein names and experimental method names, are identified based on a dictionary of core terms. A related process is ignoring insignificant words such as "the", "and", and "a".

  • Resolving Abbreviations, Acronyms, and Synonyms—Replacing abbreviations with the words they represent, and resolving acronyms and synonyms to a controlled vocabulary. For example, "DM" and "Diabetes Mellitus" could be resolved to "Type II Diabetes", depending on the controlled vocabulary.

The analysis phase of NLP typically involves the use of heuristics, grammar, or statistical methods. Heuristic approaches rely on a knowledge base of rules that are applied to the processed text. Grammar-based methods use language models to extract information from the processed text. Statistical methods use mathematical models to derive context and meaning from words. Often, these methods are combined in the same system. For example, grammar-based methods and statistical methods are frequently used in NLP systems to improve the performance of what could be accomplished by using either approach alone.

  • + Share This
  • 🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information

To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.


Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.


If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information

Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.


This site is not directed to children under the age of 13.


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information

If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information

Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents

California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure

Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact

Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice

We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020