Home > Articles

  • Print
  • + Share This
This chapter is from the book

HIERARCHICAL CLUSTERING

Scatterplots are excellent visual representations because they facilitate rapid and simple comparisons of two datasets. However, it is frequently necessary to identify groups of genes with similar expression profiles across a large number of experiments. The most commonly used technique for finding such relationships is cluster analysis, which is often used to identify genes that may be functionally related. Such clusters often suggest biochemical pathways. Like many other mathematical tools, cluster analysis generates meaningful results only when combined with biochemical insight.

Hierarchical clustering, the most frequently used mathematical technique, attempts to group genes into small clusters and to group clusters into higher-level systems. The resulting hierarchical tree is easily viewed as a dendrogram [[11], [12]].

Most studies involve comparing a series of experiments to identify genes that are consistently coregulated under some defined set of circumstances—disease state, increasing time, increasing drug dose, etc. A two-dimensional grid is constructed with each row corresponding to a different gene sequence and each column to a different set of experimental conditions. Each set of gene-expression levels (each row in the matrix) is compared to every other set of expression levels in a pairwise fashion, and similarity scores are produced in the form of statistical correlation coefficients. These correlation coefficients can be thought of as representing the Euclidean distances between the rows in the matrix. The correlations are ordered, and a node is created between the highest-scoring (geometrically closest) pair of rows—the two gene sequences that were most nearly coregulated across each of the experiments. The matrix is then modified to represent the joined elements as a single node, and all distances between the newly formed node and other gene sequences (rows) in the matrix are calculated. It is not necessary to recalculate all correlations because only those involving the two rows joined in the new node have changed. Typically, the node is represented by a link in the dendrogram, the height of the link being directly proportional to the strength of the correlation. The process of creating proportional links and joining genes into clusters continues until all genes in the experiment have been joined into a single hierarchical cluster through links of appropriate length. If more than two nodes are related by the same correlation coefficient (same geometric distance), the conflict is resolved according to a predetermined set of rules.

Advantages of Hierarchical Clustering

It is sometimes meaningful to cluster data at the experiment level rather than at the level of individual genes. Such experiments are most often used to identify similarities in overall gene-expression patterns in the context of different treatment regimens—the goal being to stratify patients based on their molecular-level responses to the treatments. The hierarchical techniques outlined earlier are appropriate for such clustering, which is based on the pairwise statistical comparison of complete scatterplots rather than individual gene sequences. The data are represented as a matrix of scatterplots, ultimately reduced to a matrix of correlation coefficients. The correlation coefficients are then used to construct a two-dimensional dendrogram in the exact same way as in the gene-cluster experiments previously described.

The overall process of constructing a two-dimensional dendrogram using hierarchical clustering data is depicted in Figure 5-4. The example in the figure embodies all the principles of the technique but in a vastly simplified form; expression-profile experiments typically include hundreds, sometimes thousands of genes, and the analysis is almost always more complex than this illustration.

05fig04.gifFigure 5-4 Construction of a two-dimensional dendrogram representing a hierarchical cluster of related genes. Each column represents a different experiment, each row a different spot (oligonucleotide sequence) on the microarray. The height of each link is inversely proportional to the strength of the correlation. Relative correlation strengths are represented by integers in the accompanying chart sequence. Genes 1 and 2 are most closely coregulated, followed by genes 4 and 5. The regulation of gene 3 is more closely linked with the regulation of genes 4 and 5 than any remaining link or combination of links. The strength of the correlation between the expression levels of genes 1 and 2 and the cluster containing genes 3, 4, and 5 is the weakest (relative score of 10). (Adapted from: Jeffrey Augen, "Bioinformatics and Data Mining in Support of Drug Discover," Handbook of Anticancer Drug Development. D. Budman, A. Calvert, E. Rowinsky, editors. Lippincott Williams and Wilkins. 2003)

Messenger RNA profiling techniques have become a cornerstone of modern disease classification. These advances are especially significant in areas such as oncology and neuroscience, where complex phenotypes have recently been found to correlate with specific changes in gene expression, the result being more precise patient stratification both for clinical trials and treatment. A significant example that illustrates the utility of hierarchical clustering involves the identification of distinct tumor subclasses in diffuse large B-cell lymphoma (DLBCL). Two distinct forms of DLBCL have been identified using hierarchical clustering techniques, each related to a different stage of B-cell differentiation. The fact that the cluster correlates are significant is demonstrated by direct relationships to patient survival rates [[13]].

Disadvantages of Hierarchical Clustering

Despite its proven utility, hierarchical clustering has many flaws. Interpretation of the hierarchy is complex and often confusing; the deterministic nature of the technique prevents reevaluation after points are grouped into a node; all determinations are strictly based on local decisions and a single pass of analysis; it has been demonstrated that the tree structure can lock in accidental features reflecting idiosyncrasies of the clustering rules; expression patterns of individual gene sequences become less relevant as the clustering process progresses; and an incorrect assignment made early in the process cannot be corrected [[14]]. These deficiencies have driven the development of additional clustering techniques that are based on multiple passes of analysis and utilize advanced algorithms borrowed from the artificial intelligence community. Two of these techniques, k-means clustering and self-organizing maps (SOMs), have achieved widespread acceptance in research oncology where they have been enormously successful in identifying meaningful genetic differences between patient populations.

When discussing clustering algorithms, it is essential to recognize the limitations of two- and three-dimensional representations of individual gene-expression values across a collection of experiments. Figure 5-5 depicts a simple analysis composed of two experiments. Each experiment is represented by a dimension in the grid, and clusters of the genes are readily apparent.

05fig05.gifFigure 5-5 Visual representation of two gene-clustering experiments. For each transcript, the fluorescence ratio (Cy5/Cy3) is plotted on one of the two axes; y-axis for experiment 1 and x-axis for experiment 2. Genes with high fluorescence ratios in both experiments appear farthest from the origin. Genes with similar behavior across the two experiments are closely clustered on the graph. Three different clusters are evident in diagram.

Representations of two or three experiments are relatively straightforward to visualize because they can be plotted on the axes of a simple graph (either x and y axes or x, y, and z axes). Results from more than four experiments are difficult to represent because they cannot be visualized in three dimensions. Figure 5-6, which depicts the results from three experiments, represents the most complex case that can be easily visualized on the axes of a graph. As in the two-dimensional case, each gene occupies a unique position on the graph determined by its fluorescence ratio in each of the three experiments. For example, a gene that exhibits Cy5/Cy3 fluorescence ratios of 3,5, and 4.5 in the three experiments would be represented by a single point plotted at the coordinate 3,5,4.5 in the graph. As in the two-experiment model, absolute distance from the origin correlates with the Cy5/Cy3 ratio, and the distance between points in three-dimensional space is representative of the likelihood that the genes they represent are coregulated across the three experiments.

05fig06.gifFigure 5-6 A three-dimensional (three experiment) gene-clustering analysis containing ten different sequences. Each axis in the drawing represents a different experiment, and each set of expression levels is represented by a single vector defined in three dimensions. As in the two-dimensional case, grouping of the sequences is accomplished by determining the geometric distance between each vector. Higher-dimensional models representing more than three experiments cannot be visualized as single vectors, and so different graphical techniques must be used.

The ten genes used in these experiments are clustered into readily recognizable groups. As mentioned previously, higher-dimensional representations, those containing more than three sets of experimental results, are much more complex to imagine because absolute distances between individual genes and gene clusters do not lend themselves to visual representation. However, despite the complexities associated with visual representation of microarray data across large numbers of experiments, it is always possible to calculate a single vector to represent all the expression values for any gene sequence regardless of the number of dimensions/experiments. It is the distance between these vectors that determines the degree to which a pair of genes is coregulated.

Regardless of the graphical technique used, clustering algorithms can be collectively viewed as a mechanism for defining boundaries that partition vectors into meaningful groups. Datasets with many dimensions are often visualized in a simple two-dimensional plot—time or experiment number on the x-axis and expression ratio on the y-axis. (Expression ratios are normally represented logarithmically.) When many different gene sequences are shown on the same grid, it is often possible to visually identify groups. A two-dimensional grid containing expression-level information for several different genes measured across ten different time points is displayed in Figure 5-7. Although many of the expression curves overlap and individual sequences are difficult to dissect out from the mix, it is clear that the data fall into three distinct clusters of coregulated genes. Representations of this type are often used to set starting conditions for more complex algorithmic clustering procedures such as those described next.

05fig07.gifFigure 5-7 A two-dimensional grid containing expression-level information for a large number of genes measured across ten different time points. The expression data represented in these ten experiments reveal three distinct clusters of coregulated genes (1, 2, and 3). (Adapted from: Jeffrey Augen, "Bioinfor-matics and Data Mining in Support of Drug Discover," Handbook of Anticancer Drug Development. D. Budman, A. Calvert, E. Rowinsky, editors. Lippincott Williams and Wilkins. 2003)

K-Means Clustering

K-means clustering is most useful when the number of clusters that should be represented is known. An example might include microarray classification of a group of patients that have morphologically similar diseases that fall into three clinically distinct categories (k=3). The clustering process would proceed as follows [[15]]:

  1. Each expression vector is randomly assigned to one of three groups or clusters (k=3).

  2. An average expression vector (called the center) is calculated for each group, and these vectors are used to compute the distances between groups.

  3. Each gene-expression vector is reassigned to the group whose center is closest. Expression vectors are allowed to remain in a cluster only when they are closer to the center of that cluster than to a neighboring one.

  4. Inter- and intracluster distances are recomputed, and new expression vectors for the center of each cluster are calculated.

  5. The process is repeated until all expression vectors are optimally placed. At this point, any additional changes would increase intracluster distances while decreasing intercluster dissimilarity.

The Advantages of K-Means Clustering

K-means clustering has proven to be a valuable tool for identifying coregulated genes in systems where biochemical or clinical knowledge can be used to predict the appropriate number of clusters. The tool is also useful when the number of appropriate clusters is unknown if the researcher experiments with different values of k.

The Disadvantages of K-Means Clustering

Despite these advantages, the unstructured nature of the k-means clustering technique tends to proceed in a local fashion, and this effect intensifies as additional clustering centers are added to the analysis. Excessive locality eventually leads to incorrect groupings, and important gene associations can be lost. It follows that as the number of clustering centers is increased, initial placement of the centers becomes increasingly critical; for analyses that involve large numbers of clustering centers, it makes sense to use more-structured techniques. Algorithms based on self-organizing maps solve many of these problems and have demonstrated tremendous utility in anticancer drug discovery and research oncology.

Self-Organizing Maps

The SOM analysis technique bears much resemblance to k-means clustering because both techniques involve an iterative approach to locating the center of each cluster. However, SOM analysis is much more structured, and the user must initialize the system with a specific geometric construct representing the initial location of each cluster center. (SOM cluster centers are referred to as a centroids.) More than a simple list of locations, the construct is a complete topology where centroids are part of a symmetrical structure. Initial mapping of nodes into the structure is random followed by an iterative process that organizes the map. The SOM process is tantamount to selecting a gene-expression vector and adjusting reference vectors for nearby centroids to make them more compatible with the selected gene; the nearest centroid is adjusted by the largest amount, and other local centroids are moved toward the selected gene by smaller amounts depending on their proximity to the selected point. More specifically, each iteration involves randomly selecting a data point (P) and moving the all nodes in the direction of (P) by an amount that is inversely proportional to the distance between the two points. The process typically continues for many thousands of iterations until a well-organized map is complete. A large number of iterations is necessary to achieve a stable map because typical experiments involve many genes, and each point (gene) in the map migrates a relatively small distance during each iteration of the algorithm.

Many successful experiments have been conducted utilizing a simple two-dimensional 3 × 2 rectangular grid of centroids as the starting point. Alternative structures based on hexagonal rings, grids, and lines have also been used, and each has a pronounced effect on the outcome of the analysis. More structured than k-means clustering and far less rigid than hierarchical techniques, SOM is emerging as the technique of choice for analyzing complex datasets with high-dimensional character and many gene sequences [[16]].

An eight-node SOM based on two experiments is depicted in Figure 5-8.

05fig08.gifFigure 5-8 A simple SOM analysis composed of 8 centroids, 51 genes, and 2 dimensions (2 sets of experimental conditions). Large dark spots denote centroids, and small spots represent gene-expression vectors. Successive iterations of the system generate a migration trajectory for each centroid until the best fit is found. Additional experiments require additional dimensions and are difficult to visualize in a drawing. Many other topologies—hexagonal rings, lines, and complex grids—are also possible and have a pronounced effect on the outcome.

A seminal paper on use of SOMs to profile a specific disease state was recently published by Golub and Slonim. The research compared expression profiles from tissue samples recovered from patients with acute myeloid leukemia and acute lymphoblastic leukemia (AML and ALL respectively) using the Affymetrix HU6800 GeneChip [[17]]. The goal was to identify a gene or small number of genes that are consistently up regulated in one disease class and down regulated in the other. After such a grouping is located, cluster analysis can be used to locate "nearby" genes that behave in similar fashion. The array contained probes for 6,817 different human genes, and final results revealed 1,100 that were coregulated in one of the disease classes. Each of these can be thought of as providing a "weighted vote" based on its statistical significance in the analysis. The system is tested on a well-characterized sample (known disease class) that is not included in the training set for the SOM. Of the 1,100 genes identified, 50 that had the highest correlation with class distinction were selected as predictors. These predictors were tested against 38 carefully diagnosed patients. Of the 38 samples, 36 were correctly assigned as belonging to either the AML or ALL group strictly on the basis of up or down regulation of these 50 genes. Two of the 38 predictions were statistically uncertain. The selection of 50 genes was arbitrary in the sense that more than 50 genes displayed expression patterns that were well correlated with one of the disease states. Further refinement of the technique might include additional genes; in practice it often makes sense to adjust the predictor set after the function of each of the genes is known. Such an approach links microarray analysis to further biochemical experimentation and may require extensive analysis to identify functions for previously unknown genes. Conversely, if a gene is found to be a strong correlate with a specific disease state, and if the function of that gene is not currently known, it makes excellent scientific sense to select that particular sequence for further analysis. Approaches of this sort are driving the use of microarray analysis as a core component of research oncology.

Two additional procedural points are important to note:

  • All expression data should be passed through a variation filter to remove genes that do not show significant up or down regulation. This additional step prevents centroids from being attracted to groups of inactive genes.

  • Expression levels must be normalized across all experiments to focus attention on the shape of the expression curves rather than absolute expression levels.

Supervised Neural Networks

As previously discussed, neural networks are algorithmic approaches to machine learning that are designed to simulate the input/output structure of biological neurons. Like their biological counterparts, neural networks learn by adjusting weighting functions and combining outputs to create trigger situations when inputs exceed certain thresholds. SOM clustering algorithms are members of a broad class of "unsupervised" machine learning algorithms intended to provide an alternative to more rigid "supervised" approaches built on specific sets of assumptions. The system proceeds through thousands of iterations of learning until a solution is discovered that optimally places the centroids in the centers of gene clusters.

Supervised neural networks (SNNs) are similar in their iterative approach to learning, but proceed until a predefined endpoint determined by the user is reached. The simplest case of a SNN is the single two-dimensional matrix, referred to as a perceptron, which we reviewed in Chapter 4 [[18]]. Values contained in the matrix are used as the basis of a scoring system whose ultimate goal is to distinguish between two sets of data. An iterative process is used to score individual members of each dataset and make modifications to the matrix until a satisfactory endpoint is reached. More complex systems use multiple layers of perceptrons; scores in the first layer are combined and used as triggers for the second layer, and so on through as many layers as are built in to the system. Modern neural networks often use three layers of perceptrons—an input layer, a hidden layer that combines data from the input layer, and a final output layer. Each iteration involves comparison to the predetermined endpoint, and a decision is made regarding modification of one of the layers.

One of the most important features of multilayer perceptron-based SNNs is that they can be designed to perform multiset classifications of complex datasets. Recent experiments have demonstrated the utility of this approach for studying gene-expression signatures from multiclass microarray data. The data represented gene clusters for major metabolic pathways. One of the most intriguing outcomes of the research involved the detailed examination of "false positives"—results that seemed to artificially cluster genes not thought to be directly related. However, deeper analysis revealed that these false positives were found to represent related metabolic pathways with overlapping enzymatic reactions [[19]]. Biological systems are logistically complex in the sense that transitivity rules often fail: If genes A and B belong to the same class and genes B and C are also of the same class, then it does not necessarily follow that A and C belong to the same class.

This effect can confound the learning process for neural networks. The problem is one of separating two distinct classes of expression profiles when those profiles both display some level of overlap with a third biologically distinct class. For example, consider three genes—A, B, and C—belonging to three distinct biological classes. If the class that includes gene A is considered to be the positive group, and the class that contains gene C is the negative group, then, in effect, when the network learns the signature of the class that includes A it will also learn the signature of gene B; thus genes with expression profiles similar to that of B but belonging to the biological class containing gene C may be erroneously identified as a member of the class that includes gene A. This result amounts to identification of a false positive.

A false negative can also occur in the reverse direction if gene B is mathematically recruited into the solution set built around the class containing gene C (the negative group in the learning exercise). The resulting network would likely identify certain members of the class containing gene A as belonging to the class containing gene C, thus providing a false negative result.

It is important to note that any recognition problem that can be solved by a neural network can be solved in three layers. However, neural networks containing more than three layers can sometimes converge on a solution more quickly if provided with a sufficient amount of training data.

Serial Analysis of Gene Expression (SAGE)

Serial Analysis of Gene Expression (SAGE) is a sequence-based gene-expression analysis tool that incorporates both a discovery engine and a mechanism for quantification of the level of expression. Microarray technology necessarily relies on a large body of sequence information that must be used to direct construction of the physical array. In situations where new sequences are likely to be discovered, the array must be constructed with an extremely large number of sequences, possibly covering the entire genome. Because most microarray experiments involve very short sequences, several per known coding region, the number of spots required for an unbounded experiment is likely to be enormous. Clustering enormous numbers of spots can be difficult and confusing. For these and other reasons related to time, cost, and complexity, it is often desirable to use a technique such as SAGE that can reveal the identity of every relevant message regardless of the number of transcripts produced. This information can then be used to construct more focused microarray experiments that contain only relevant sequences.

SAGE is designed to measure the expression of a small "tag" rather than the entire transcription product of a gene. The tag, defined as a ten-base-pair region directly adjacent to the 3'-end of the first occurrence of the sequence CATG, is generated through a process of well-defined enzymatic digestions and oligonucleotide purification steps. The complex process of generating SAGE tags is as follows [[20]]:

  1. In the first step, double-stranded cDNA is made from purified mRNA using a biotinylated oligo (dT) primer. The result is a length of double-stranded cDNA with an exposed 3' end and a biotin moiety attached at the 5'end. Digestion with the endonuclease Nla III is then used to expose a defined location—5'. . CATG. . 3' —within each cDNA that will be used for excision of the adjoining SAGE tag. (Nla III cleaves DNA immediately adjacent to the 3' side of the sequence CATG.) Ultimately, the SAGE tag will be the ten bases immediately adjoining the Nla III cut site.

  2. Biotinylated 3' cDNAs are then bound to streptavidin-coated magnetic beads and affinity-purified.

  3. The 5' termini of the purified cDNAs are separated into two pools and attached to adaptors at their 5' ends using oligo duplexes that encode the Nla III CATG cohesive overhang, a type IIs recognition sequence (which will be recognized by another restriction enzyme and used to release the tag), and a polymerase chain reaction (PCR) primer sequence.

  4. The adapted cDNAs are then digested at the type IIs recognition sequence with BsmFI (also known as a tagging enzyme). BsmFI cleaves 14 to 15 base pairs 3' of its recognition sequence, releasing the linker-adapted SAGE tag from each cDNA.

  5. The linker-adapted SAGE tags from each pool are then repaired using DNA polymerase, combined, and ligated using T4 DNA ligase. The resulting linker-adapted "ditags" are amplified by PCR and digested one more time with Nla III to release the primer adaptors.

  6. The SAGE ditags are purified prior to being polymerized using T4 DNA ligase. Repaired ditags are then size-selected and cloned into a high-copy plasmid vector-pending sequencing.

Figure 5-9 illustrates these steps.

05fig09.gifFigure 5-9 Serial analysis of gene expression (SAGE). A complex series of biochemical reactions is required to produce a sequence composed of short (ten base pair) tags derived from each transcript in the mix. The steps are: (1) Synthesis of biotinylated double-stranded cDNA; (2) Restriction enzyme digestion of cDNA and exposure of the 3' most CATG fragment; (3) Addition of adaptors and primer sequences for PCR coupled with excision of a ten-base-pair tag from each cDNA sequence; (4) formation of ditags (sequences containing two adjoining tags) followed by amplification and removal of the adaptors and primer sequences; (5) Concatenation of all tags into a single double-stranded DNA followed by sequencing.

Because each tag represents a small (ten-base-pair) portion of an existing transcript, SAGE analysis is strongly dependent on the availability of high-quality genome sequence data. Newly identified and sequenced tags have little meaning outside the context of such data. High-throughput sequencing of whole genomes is, therefore, an important adjunct to the SAGE technique. High-quality sequence data combined with SAGE represents a powerful strategy for identifying and counting transcripts within a population of cells. Ultimately the number of tags counted for each gene will be proportional to the number of copies of each message present in the original sample.

The simplicity of SAGE analysis makes it an excellent tool for measuring the appearance and disappearance of a small number of highly regulated transcripts under a well-characterized set of conditions. Table 5-2 displays the results of such an experiment. Each entry in the table corresponds to a transcript whose expression is up regulated in the presence of an active p53 oncogene (data from rat fibroblast cells). P53, known to be inactive in many human cancers, has become the subject of intense scrutiny. Although expression of the gene is known to induce either stable growth arrest or programmed cell death (apoptosis), the mechanism underlying development of p53-dependent apoptosis remains incompletely understood [[21]]. SAGE is an appropriate technique for such analysis because the transcripts each come from well-characterized genes that have been completely sequenced. In addition, each of the ten genes identified in the experiment is up regulated by at least twofold (200%). As in other expression-profiling techniques, SAGE is most effective when the quantification spans a limited range.

Table 5-2. SAGE Analysis of Genes Induced in the Presence of an Active Form of p53 in the Rat Fibroblast

Gene

Percent Up Regulation (+p53 / –p53)

MDM2

300%

HSP70

500%

WAF1

>500%

Cyclin G

2500%

BAX

200%

CGR11

1200%

CGR19

>500%

EGR1

3000%

Shabin 80

>2000%

Shabin 123

>500%

Source: Madden, et. al. (2000) Drug Discovery Today 5(9), 415–425

Similar studies in humans have revealed the up regulation of 14 transcripts, 8 of which are known to code for proteins involved in cellular responses to oxidative stress. These observations stimulated additional biochemical experimentation that ultimately suggested a three-step process for p53-induced apoptosis: (1) transcriptional induction of redox-related genes; (2) formation of reactive oxygen species; and (3) oxidative degradation of mitochondrial components. Leakage of calcium and proteinaceous components from the damaged mitochondria stimulate caspases that are ubiquitously activated during the apoptotic process. Table 5-3 lists these transcripts.

Table 5-3. Eight Genes Significantly Up Regulated in the Presence of the p53 Oncogene in Human Epithelial Cancers Cells

Gene

Function and Reactive Oxygen Species (ROS) Effect

p21

CDK inhibitor (induced by ROS)

PIG1

Galectin 7 (enhancer of superoxide production)

PIG3

Quinone oxidoreductase homologue (ROS generator)

PIG4

Serum amyloid A (induced by ROS)

PIG6

Proline oxidase homologue (glutathione biosynthesis)

PIG7

TNFa-induced mRNA (ROS induces TNFa)

PIG8

Etoposide-induced mRNA (quinone-causing ROS)

PIG12

GST homologue (induced by ROS)

Source: Polyak, et. al. (1997) Nature 389, 300–305

The use of a single relatively short tag to identify each transcript often creates ambiguity; exactly the reason that SAGE results must be interpreted in the context of other biochemical information. As one might expect, the tags used to identify p53-induced genes are occasionally found in more than one coding region. For example, the sequence GAGGCCAACA, which identified PIG3 (NADPH quinone oxidoreductase homologue), is associated with three different GenBank accession numbers (see Table 5-4).

Table 5-4. SAGE Tags Often Display Redundancy

Tag Sequence

Accession

Description

GAGGCCAACA

AF010309

PIG3 (NADPH quinone oxidoreductase homologue)

GAGGCCAACA

H42923

yo10e11.s1 (EST name) Soares adult brain N2b5HB55Y [*]

GAGGCCAACA

W07320

za94c09.r1 (EST name) Soares fetal lung NbHL19W [3]

Source: sagenet.org, NCBI

The fact that SAGE and microarray analyses can complement each other has not escaped notice. One popular strategy involves using SAGE to design custom microarrays in lieu of prefabricated designs with predefined target cDNA sequences. This approach has been used effectively to identify genes that are differentially expressed in primary breast cancers, metastatic breast cancers, and normal mammary epithelial cells [[22]]. In addition to identifying known cancer markers—HER-2/neu and MUC-1—the combination of techniques revealed several new genes and pathways not previously linked to breast cancer. Moreover, the new analysis generated significant information about expression-level differences between primary and metastatic disease. These differences have become important clinical markers, thus bridging research and clinical environments. It is likely that complete transcriptional maps will eventually enable researchers to construct families of microarrays containing all relevant sequences for each disease and metabolic state. Accomplishing this goal will take time; SAGE and other analysis techniques are likely to play an important role in the identification of sequences that will become the basis of a complete transcription-level picture of each relevant metabolic state.

Massively Parallel Signature Sequencing

Researchers at Lynx Therapeutics have developed a technology that generates accurate and precise copy counts for every transcript in a sample regardless of the number of transcripts or the range of copy counts. This patented technology was invented by Sydney Brenner, whose goal was to be able to obtain an accurate accounting of the complete transcriptional profile of a cell—major and minor messages, transient messages, short-lived transcripts, and control sequences.

The strategy involves several innovative approaches to amplifying, cloning, and sequencing millions of transcripts in parallel [[23], [24], [25]]. Two new technologies form the basis of this approach. The first, Megaclone, allows a cDNA copy of each message to be cloned onto the surface of a 5um microbead. More specifically, each microbead, through amplification and selection processes that will be described, is ultimately decorated with 100,000 copies of the same sequence. For example, a message present in 20 copies will appear on 20 different microbeads—100,000 copies per microbead. The second, Massively Parallel Signature Sequencing (MPSS), uses sequence-dependent fluorescent signatures to successively identify four base-pair segments, and ultimately construct sequences from the microbead-bound cDNA. With more than one million microbeads immobilized in a single layer array inside a flow cell, solvents and reagents can be washed over the samples in each cycle of the process (hence the designation "massively parallel"). Sequence data are ultimately matched against more-complete well-characterized entries in the public database infrastructure.

The combination of Megaclone and MPSS represents a complete departure from analog detection techniques; each transcript in the cell is separately cloned, amplified, and sequenced. Moreover, the technique is fully digital in the sense that each transcript is guaranteed to be sequenced as many times as it appears in the cell without any dependence on comparisons or extrapolations. Because the technique involves precise counting, accuracy is maintained across a very broad range of copy counts. Furthermore, the digital nature of the Megaclone-MPSS strategy facilitates precise comparisons between experiments and eliminates the need to make final adjustments to expression values.

Megaclone

The first half of the process—sequence-specific microbead preparation—is outlined here and graphically depicted in Figure 5-10.

  1. All mRNA transcripts contained in the sample are reverse transcribed into cDNA.

  2. A complex mixture of conjugates between cDNA templates and prefabricated oligonucleotide tags is prepared, where the number of different tags is at least 100 times larger than the number of cDNA templates. (The large ratio ensures that every template in the sample is conjugated to a unique tag.) The available tag library includes more than 16.7 million distinct 32mers.

  3. The sample is amplified by PCR, after which the tags are rendered single stranded and combined with a population of microbeads that have attached on their surfaces the complete set of complementary tags or anti-tags. (Each microbead displays a single anti-tag.) Each template-tag combination will find and bind to its corresponding anti-tag microbead in the mixture. Because the population of tags bound to template cDNAs represents only 1% of the total number of tags, only 1% of the microbeads will be loaded with template molecules. This small population of cDNA-loaded microbeads is separated from the remaining unloaded microbeads with a fluorescence-activated cell sorter (FACS). Each loaded microbead displays between 104 to 105 identical copies of one of the transcripts.

  4. Using a specially fabricated flow cell, as many as two million microbeads are immobilized for sequencing. Because the position of each microbead remains fixed throughout the process, as many as two million distinct sequencing operations can be carried out in parallel. The design of the flow cell provides enough head space for reagent flow across the microbeads while maintaining an intact monolayer. The strategy involves iterative sequencing reactions and position-specific fluorescence measurements. (Ultrafine positional control is achieved through new scanning technology and software that tracks each microbead as sequencing proceeds.)

05fig10.jpgFigure 5-10 The Megaclone system for creating a large microbead-based cDNA library representing all mRNA transcripts in a cell.

MPSS

MPSS relies on successive iterations of oligonucleotide digestion, ligation of specially coded adaptors, and probe hybridization. More specifically, the process involves digesting the bound cDNA template to expose a four-base single-stranded overlap sequence; ligation of a series of encoded adaptors, each specific for a single residue and one of the four exposed positions; hybridization of one of 16 fluorescent probes to each of the 4 bound adaptors; and identification of the four bound probes and thus the sequence of the four-base overhang. The process is repeated until each of the bound cDNA templates has been sequenced from the first digestion site to the bound 32-base oligonucleotide tag securing the cDNA template to the microbead. The partial sequences are for identification purposes only; they are ultimately used to search publicly available data sources that contain more complete gene sequences. MPSS, depicted in Figure 5-11, is outlined here:

05fig11.jpgFigure 5-11 Massively Parallel Signature Sequencing (MPSS). Sequences of up to two million microbead-bound cDNA templates are determined in parallel using an iterative process of enzymatic cleavage to expose a four-base single-stranded overhang, encoded adaptor ligation, and sequence interrogation by encoded hybridization probes. Sequencing chemistry steps are executed simultaneously across all microbeads in a specially designed flow cell that maintains the positional integrity of each bead while providing enough head space for the flow of solvents and reagents. The dynamic range of MPSS scales from fewer than 10 transcripts per million (tpm) to as many as 50,000 tpm.

  1. Microbead-bound cDNA templates are digested with the restriction enzyme Dpn II to reveal a single-stranded GATC overhang. (Dpn II cleaves on the 3' side of the sequence 5'...GATC 3'.)

  2. An initiating adaptor ending in the complementary sequence 3'. . CTAG. . 5' is ligated to the overhang. This adaptor also contains a type IIs endonuclease recognition site positioned upstream of the terminating CTAG. Recall that type IIs endonucleases have the unusual property of their cleavage site being separated from their recognition site by a fixed number of bases—in this case, the recognition site within the adaptor is positioned to place the cleavage site just beyond the CTAG at the adaptor-template junction. Cleavage at this position will expose four new bases at the end of the bead-bound cDNA template.

  3. Digestion with the appropriate type IIs endonuclease exposes the first four bases of the cDNA template at the adaptor-template junction. The four exposed bases are presented on a single-stranded overhang, and the initiating adaptor is released by the digestion.

  4. A mixture of encoded adaptors is used to determine the identity of each of the four exposed bases. Each adaptor is specific for one of the four possible DNA bases in one of the four positions of the overhang. Consequently, the exposed four-base single-stranded overhangs can each be expected to attach one of four adaptors. The presence of a large homogeneous population of cDNA templates on each bead guarantees that a significant (and approximately equal) number of each of the four appropriate adaptors will be attached to the templates.

  5. The identity and order of the exposed bases is determined by hybridizing, one at a time, each of 16 decoder probes to the ends of the adaptors. Because four adaptors were bound to the cDNAs, four decoder probes will hybridize—each one will represent a specific base in one of the four positions. Moreover, each of the 16 probes contains a unique fluorescent marker that can be detected by the laser scanner. Because the bound adaptors are specific both for a DNA base and one of the four positions, the fluorescent signatures on each bead will reveal the order and identity of the exposed bases in the overhang.

  6. As in the case of initiating adaptors, each encoded adaptor contains a type IIs restriction endonuclease site upstream of the four bases being interrogated. The position of the site is set to guarantee that cleavage by the appropriate type IIs endonuclease will expose a new overhang containing the next four bases. The process of binding base and position-specific encoded adaptors and matching fluorescent probes is repeated to determine the next four bases in the sequence.

  7. The process continues until every remaining base in the microbead-bound cDNA templates is identified. As a result, for each cDNA template all bases between the CATG cut site and the tag sequence securing the template to the microbead are identified. This information is subsequently used to search other databases for more sequence and functional information about the transcripts. Because all beads were sequenced in parallel, a complete transcriptional map of the sample is revealed. During the sequencing chemistry steps, a database containing information about each microbead was constructed. Counting identical sequences present in the database reveals information about the number of copies of each transcript in the original sample.

Perhaps the most compelling feature of the Megaclone-MPSS strategy is the immortality of the data. As previously discussed, one of the central goals of transcriptional profiling is the creation of a central repository containing expression data from thousands of different cell types under a variety of conditions. Megaclone-MPSS facilitates the construction of such repositories because copy counts can be directly compared between experiments. Furthermore, complete annotated sequence information is not a clear need if the goal is to build transcriptional profiles that can serve as diagnostic tools. The diagnostic simply becomes a pattern of up and down regulation of signature sequences that serve as proxies for complete transcripts. Because the length of an MPSS sequenced transcript is typically in the range of 16 to 20 bases, each signature sequence is likely to represent a unique transcript—it is very unlikely that the same signature sequence will be cleaved from two different transcripts and, therefore, disguised in the sequencing process. SAGE, discussed earlier, is built on a similar strategy of constructing tags of limited length that are later used to query databases containing complete annotated sequences.

The Megaclone-MPSS technique is not without drawbacks. One notable issue is its inability to provide complete information regarding SNPs and other polymorphisms. Because each signature sequence represents a final Dpn II digestion product (the length of cDNA between the bead-bound tag and the first CATG), any additional sequence information existing beyond the Dpn II cut site is automatically lost. An up regulated transcript containing an SNP outside the signature sequence will simply appear to be up regulated, and the fact that a polymorphism is present will be lost in the analysis. One could envision a more complex situation where a disease group associated with the up regulation of a specific gene contains two different subpopulations: one with a SNP in the gene, and one without. The up regulation would be quantified by MPSS, but the SNP might fall outside the signature sequence region. These data would not be helpful in explaining the presence of two subgroups, but might have some diagnostic value for the overall disease. Moreover, if the subgroups required two distinctly different treatment regimens, MPSS alone would fail as a complete diagnostic. By comparison, microarray analysis would likely identify both the SNP and the up regulation event but might have other problems related to dynamic range if key transcripts were up or down regulated across a broad range of copy counts.

As with all biochemical techniques, Megaclone-MPSS has strengths and weaknesses. In general, if the goal is to achieve very accurate copy counts for all transcripts across a very broad range, then MPSS is likely to be a valuable tool. Conversely, if the goal is to interrogate every gene at the single nucleotide level, possibly to identify SNPs or other small aberrations, microarray technology is likely to represent a better fit. Many situations require both sets of capabilities. The fact that MPSS data is often used to search public databases populated with more complete sequence information obtained through microarray analysis supports this view. The reverse is also true in the sense that microarray analysis can be used to identify interesting transcripts, which can then be quantified at the single-copy-count level with MPSS.

The Splice-Variant Problem

Messenger RNA splice variants have the capability of confounding almost any high-throughput expression-profiling technique. SAGE and Megaclone-MPSS are especially vulnerable because they rely on partial sequences to identify expressed transcripts. Splice variants that vary outside the region being studied cannot be distinguished by these techniques. Even when splice variants are distinguishable, subtle differences outside the region of study will not be revealed. Furthermore, using SAGE- or MPSS-derived sequences as probes to search large gene-sequence databases is unlikely to help because basic sequence data is not useful as a predictor of transcriptional splicing. After SAGE or MPSS has revealed the presence of a specific gene, microarray technology can be used to enumerate the complete set of splice variants.

However, it is important to point out that microarray results can also be difficult to interpret in this context because each spot in the array corresponds to a short—approximately 25 bases—portion of a transcript. Although overlaps between the short cDNA templates contained in the array can be used to assemble complete sequences, the presence of unspliced messages and message fragments removed during splicing can confuse the interpretation. In general, it is almost impossible to use microarray, or any other high-throughput technique, to unambiguously identify the specific combinations of exon variants that occur together in individual mRNA molecules [[26]]. The problem of correctly orienting mRNA fragments into splice variants is depicted in Figure 5-12.

05fig12.gifFigure 5-12 The splice-variant problem. Various combinations of mRNA fragments derived from long unspliced transcripts must be assembled and correctly oriented to correctly identify final mRNA species. The fact that some of the fragments represent shifted reading frames further complicates the analysis. Splice variants 1 and 2 result from a single base frame shift. Removal of the sequence AGACCCAAGA from splice variant 2 eliminates a stop codon that overlaps the 5' side of the splice site and exposes a new stop codon further downstream. As a result, splice variant 3 contains elements from both reading frames. High-throughput analysis techniques would reveal the presence of mRNA sequence elements from all three splice variants even if the cell did not express all three versions of the gene.

Unambiguous identification of individual mRNA isoforms can be obtained only by using low-throughput approaches based on amplification and end-to-end sequencing of individual transcripts. Nuclease protection assays (NPAs) are often used in this context to test for the presence of individual sequence elements in an isolated transcript or population of transcripts. NPA is based on hybridization of single-stranded, well-characterized antisense probes to an RNA sample. After hybridization, remaining unhybridized probes and sample RNA are removed by digestion with a mixture of nucleases. After nuclease inactivation, the remaining double-stranded probe:target hybrids are precipitated and purified for further sequence analysis. NPAs can reveal the presence of specific exons in a mixture of mRNA species. The combination of high-throughput profiling approaches—SAGE, MPSS, microarray, etc.—with a carefully planned sequence of NPAs, can be a powerful solution to the problem of comprehensive mRNA profiling. These approaches are complemented by a growing base of transcription profiles and complete transcriptome maps that are rapidly becoming part of the public database infrastructure for bioinformatics.

  • + Share This
  • 🔖 Save To Your Account