Applied Bioinformatics Computing: An Introduction
If I have seen further it is by standing on the shoulders of Giants.
--Sir Isaac Newton
Bioinformatics is the study of how information is represented and transmitted in biological systems for myriad practical applications. These applications include creating new drugs, discovering cures for genetic diseases, cloning threatened species, creating new biomaterials for military and civilian applications, and creating high-yield and disease-resistant crops to feed the world's growing population. Bioinformatics is also a field at the fringes, in which advances are made by technologists using technologies from many different disciplines. For example, techniques developed by computer scientists enabled researchers at the commercial Celera Genomics, the public Human Genome Project consortium, and other laboratories around the world to sequence the nearly three billion base pairs that define the 35,000 genes that constitute the human genome. Sequencing the human genome, most of which was performed from 1998 to 2000, would have been virtually impossible without the Internet; high-performance computing; and the combined efforts of mathematicians, statisticians, life scientists, physicists, and computer scientists.
What separates bioinformatics from most other technical fields isn't its reliance on computer technology, but its immediate and real potential to redefine life as we know it. Proponents of the broad field of biotechnology contend that because of developments in bioinformatics and related life sciences, we are on the verge of controlling the coding of all living things as well as concomitant breakthroughs in biomedical engineering, therapeutics, and drug development. Many people take the view that in the near future, chemical and drug testing simulations will streamline pharmaceutical development and predict subpopulation and individual response to designer drugs, dramatically changing the practice of medicine. These views seem especially credible when one considers the advances made in nanoscience, nanoengineering, and computing over the past decade.
One of the most obvious practical applications of bioinformatics is in collecting and analyzing data for creating designer drugsmolecules designed to work with maximum effectiveness on a particular person's DNA makeupas depicted in Figure 1. The process begins with a tissue sample from the patient, which can be as simple as swiping a cotton swab inside the patient's mouth. Next, the sample is processed by a microarray (sometimes referred to as a "gene chip"), which examines an array of tens of thousands of the patient's genes in only a few minutes. Each element or cell in the microarray is in independent experiment on the patient's DNA that can be used to determine, for example, whether the patient will respond favorably to a particular drug, and whether the patient will suffer an allergic response or other undesirable side effect.
Figure 1 Creating patient-specific designer drugs that are free of untoward side effects is one of the early practical applications of bioinformatics.
The microarray is processed so that cells that react to a drug or other molecules absorb marker dyes in varying degrees, which results in a color mosaic pattern. This mosaic pattern is then optically scanned, and the resultant digital image is processed with a pattern recognizer that maps particular experiments with the color of specific cells in the microarray. The resulting data are then stored in a database in preparation for analysis.
A scientist using a networked workstation accesses the microarray data and analyses them with a variety of tools. In addition, the scientist may extend his analysis by incorporating data from the online biology databases, such as mining biomedical literature for articles that might explain anomalies found in the patient's data. The scientist may also perform statistical analysis on the microarray data, comparing it to data from similar patients. He simulates the interaction of the drug and the patient's proteins, or visualizes particular proteins defined by the patient's DNA. The results of the analysis are delivered to a pharmaceutical company, which then uses the data to synthesize a custom drug that exactly fits the patient's needs. Moreover, the drug will be free of unwanted side effectsa significant cause of mortality, especially in patients on multiple medicationsand of non-compliance by patients who often experience side effects of drug therapy.
Many proponents of bioinformatics and biotechnology in general contend that as the earth's population continues to explode, genetically modified fruits will offer extended shelf life; tolerate herbicides; promote faster growth; be able to grow in harsh climates; and provide significant sources of vitamins, protein, and other nutrients. Fruits and vegetables will be engineered to create drugs to control human disease, just as bacteria have been harnessed to mass-produce insulin for diabetics. To determine whether this future is likely, it's important to understand the historical path biotechnology has taken thus far.
The first wave of non-military biotechnology focused on medicine. It was relatively well-received by the public, perhaps because of the obvious benefits of the technologyand the lack of general knowledge of government-sponsored research in biological weapons. Instead, media stressed the benefits of genetic engineering, reporting that millions of patients with diabetes would have ready access to affordable insulin.
The second wave of biotechnology focused on crops. This had a much more difficult time gaining acceptance, in part because some consumers fear that engineered organisms have the potential to disrupt the ecosystem. As a result, the first genetically engineered whole food ever brought to the U.S. market, the short-lived Flavr Savr[tm] Tomato, was an economic failure when it was introduced in the spring of 1994 (only four years after the first federally approved gene therapy on a patient). However, Calgene's entry into the market paved the way for a new industry that today holds nearly 2,000 patents on engineered foods, from virus-resistant papayas and bug-free corn to caffeine-free coffee beans. It's important to note that the first successful genetically modified crop, a virus-resistant line of tobacco plants, was created in China in 1988.
Today, nearly a century after the first gene map of an organism was published, we're in the third wave of biotechnology. The focus of modern biotechnologyand, by extension, of bioinformaticsis on manufacturing military armaments made of transgenic spider webs, consumer plastics, and other biomaterials from corn; and stain-removing bacilli. Although biotechnology manufacturing is still in its infancy, it holds promise as a means of avoiding the pollution caused by traditional smokestack factories. This form of technology also remains relatively unnoticed by opponents of genetic engineering.
The modern biotechnology arena is characterized by complexity, uncertainty, and unprecedented scale; all of which contribute to the difficulty in bringing together experts in multiple disciplines. Because of a lack of awareness of advances in the field of computer science, life scientists have developed innovative computational solutions heretofore unknown or unappreciated by the computer science community. In other cases, molecular biologists and other life scientists who have little formal training in computer science, mathematics, and physics have reinvented techniques and rediscovered principles long known to scientists and professionals in these fields. For example, advances in machine learning techniques have been redundantly developed by the microarray community, mostly independently of the traditional machine-learning research community. As a result, valuable time has been wasted in the duplication of effort in both disciplines.
The aim of this article is to introduce computer science professionals to the multidisciplinary field of bioinformatics. Taking a computer science perspective, the field can be segmented into databases, networks, search engines, visualization techniques, statistics, data mining, pattern matching, modeling and simulation, and collaboration. Before considering these areas in more detail, it's important to appreciate the Central Dogma of molecular biology and how it relates to classical information theory.