Bioinformatics Online Home
   Chapters Links Problems Enroll for Updates Help
 CHAPTER 3 LINKS
   
 Introduction
 Sample image
 Table 3.1
 Web search terms
 Problems

Home  >  Chapters  >  3  > 

3. Improvements in time and memory efficiency for global alignments

The original Needleman-Wunsch dynamic programming algorithm allowed single gaps and took N x M steps to align two sequences of length N and M. Subsequently, Waterman et al. (1976) and Smith and Waterman (1981a,b) extended this algorithm to allow for multiple-sized gaps in the alignment, and the alignment then took N x M2 steps (where N is the shorter sequence). Similarly, the alignment algorithm developed by Smith and Waterman also required N x M2 steps. The dynamic programming algorithm was improved in performance by Gotoh (1982) by using the linear relationship for a gap weight wx = g + rx, where the weight for a gap of length x is the sum of a gap opening penalty (g) and a gap extension penalty (r) times the gap length (x), and by simplifying the dynamic programming algorithm. He reasoned that two of the terms that are maximized in the dynamic programming algorithm and designated here Pij and Qij depend only on the values in the current and previous row and column, as indicated below.

Improved dynamic programming algorithm of Gotoh (1982)

The similarity score is written as

Si,j = max { Si-1,j-1 + si,j, Pi,j, Qi,j }, where
Pi,j = max { Si-x,j -wx }, and Qi,j = max { Si,j-x - wx }
 1<x<i 1<x<j

P may be obtained in a single step since,

Pi,j = max { Si-1,j - w1, max ( Si-x,j - wx }
 2<x<i
 = max { Si-1,j - w1, max ( Si-1-x,j - wx+1 }
 1<x<i-1
 = max { Si-1,j - w1, max ( Si-1-x,j - wx - r }
 1<x<i-1
 = max { Si-1,j - w1, Pi-1,j - r }

where the penalty of a gap of length x is given by,

wx = g + rx

and g is the gap opening penalty and r the gap extension penalty.

Similarly,

Qi,j = max { Si,j-1 - w1, Pi,j-1 - r }

Although this procedure can align sequences of lengths N and M (N < M) in M x N steps, these operations still require the storage of three matrices in memory of size M x N. Myers and Miller (1988) were able to reduce this memory requirement by storing enough information to compute the similarity score in N units of memory. This saving of memory space was achieved by recognizing that calculation of each matrix row by the Gotoh method depends only on the values of the current and most previously calculated rows. Thus, only matrix values in those rows plus a few other numbers need to be saved in an array, and old values can be overwritten by new ones as new rows are calculated. Because local and global alignments are computed the same way, these same improvements can be used for producing both types of alignments.

Furthermore, Myers and Miller were able to save additional memory space required for storing the trace-back matrix that keeps track of the moves giving the best score at each matrix position. In doing so, they also reduced the number of sequence comparisons that must be made to find an optimal alignment. The trace-back matrix requires M x N memory positions. Myers and Miller utilized the fact that the same best alignment is obtained by the dynamic programming algorithm starting at the sequence beginnings and moving forward down the sequences as starting at the sequence ends and moving backward along the sequences. In their method, a subalignment is started in the forward direction, but does not go beyond the half-way point along the first sequence. Another alignment is started from the far end of the sequences, ending at the same half-way point in the first sequence. At some position J in the second sequence, these forward and backward alignments must converge on an optimal midpoint in the alignment, as illustrated in Figure S3.3. The sum of the scores is the optimal score of aligning the full sequences, after correcting for subalignments that both end in deletes. In breaking down the trace-back analysis this way, called the divide-and-conquer approach by Hirschberg (1975), two improvements are realized: (1) only one-half as many sequence positions have to be compared by the dynamic programming algorithm to find the optimal alignment (the area inside the colored rectangles), and (2) only one-half as much memory is needed to keep track of the moves. One other round of divide-and-conquer would require filling in only the brown area of the matrix, thus halving the requirement again. In this manner, dividing up of alignments into subalignments can be repeated several times to reduce the memory requirement even further to approximately a linear function of sequence length (Fig. S3.3). By confining sequence comparisons to a narrow band of interest in the dynamic programming alignment grid in combination with the above divide-and-conquer method, such as in aglobal alignment of related sequences of approximately the same length, the algorithm can be made even more efficient (Chao et al. 1994).

A similar strategy may be used to obtain a local alignment as a linear function of sequence length. The end points of the local alignment are first found by using the Smith-Waterman algorithm and the Gotoh method to reduce the computational time. The trace-back path is then found by applying the divide-and-conquer approach within the corresponding area of the matrix defined by these end points (Chao et al. 1994). As a result of these changes, the dynamic programming algorithm can be more easily run on smaller computers without compromising the ability of the method to find an optimal alignment. Most dynamic programming programs for sequence alignment use these more efficient methods.

Many of the above formulations that reduce time and space for calculating optimal sequence alignments have depended on a re-formulation of the dynamic programming algorithm using mathematical graph theory by G. Myers, W. Miller, and their colleagues. Although a description of the graph theory method is beyond the scope of this chapter, the formulation is not too different from that given above. However, the method provides a level of mathematical rigor and insights and a more critical view of details of the sequence alignment problem. Readers interested in the application of graph theory to the problem of sequence alignment should consult the review by Chao et al. (1994).

Figure S3.3. The divide-and-conquer approach of Myers and Miller (1988). Whereas the full dynamic programming method requires the entire matrix of size M x N to be filled, where M and N are the sequence lengths, the new approach only requires the positions that are shaded in the graph.

Finding near-optimal alignments

The dynamic programming alignment program reports only the highest-scoring or optimal sequence alignment. As discussed previously, there may be other important alignments with scores as high as the optimal one. These may be alternate paths back through the dynamic programming matrix starting at the same high-scoring position. These alignments will use some of the same aligned residues that were also used in the optimal alignment. Other alignments may use the same residues but have them aligned differently. Yet a third kind of alternative alignment may use entirely different amino acids to achieve a range of possible alignments.

The LFASTA program uses the FASTA algorithm to report a series of alternative alignments of proteins, and PLFASTA plots the alternative alignments in a form much like a dot matrix plot (Pearson and Lipman 1988; Pearson 1990). Another method of finding near-optimal alignments is to design a method for identifying the highest-scoring alignment, then the next highest-scoring, and so on, in that order. To accomplish this, Saqi and Sternberg (1991) change the values in the scoring matrix after the first alignment has been found to be biased against that alignment and to favor others the next time.

Waterman and Eggert (1987) devised another method for finding a series of local alignments that is widely used. After finding the highest-scoring alignment beginning at position i in sequence 1 and position j in sequence 2, the matrix position Hi,j that initiates this alignment is given a score of zero. The H matrix values (and also the Gotoh P and Q matrix values), which are below and to the right of this position, are then rescored for as far in the matrix as the recalculation continues to change the values. This recalculation has the effect of removing the highest-scoring alignment and any other alignment that uses the same residue pairs as the first alignment, but the same residues may be used in a different alignment. The rescored matrix is then scanned for the next highest local similarity score and the path leading to it provides the next highest-scoring alignment. The Waterman and Eggert method calculates these alternative alignments in far less time than taken to produce the original scoring matrix. A more space-efficient version of this method using graph theory has been described (Huang and Miller 1991; Chao et al. 1994).

Another method of identifying near-optimal alignments is to score all alignments that achieve a similarity score greater than a specified minimum (Chao et al. 1994). In the above section, an optimal midpoint in an alignment was found by forward-backward passages of the dynamic programming algorithm, as illustrated in Figure S3.3. A similar procedure is used here on a rectangular portion of the original dynamic programming matrix, with two changes, as illustrated in Figure S3.4: (1) the range of midpoints that give a score of at least the set minimum is located, and (2) the procedure is repeated with midpoints along the second sequence. The result is a set of scores along both midpoints of the matrix that define a band of possible alignments, illustrated by the area between the dark lines. The alignment paths follow a set of adjacent squares as in Figure S3.3, except that the sizes of the squares are adjusted and the squares overlap enough horizontally so that the boundary lines can pass from one square to the next. The result is a band of space that fills enough of the scoring matrix so that all possible alignment paths can be found.

Figure S3.4. The divide-and-conquer method used to locate points on all alignments that exceed a minimum score. Shown is a portion of the dynamic programming alignment matrix between positions i1 and i2 in sequence 1 and j1 and j2 in sequence 2. All alignments that achieve the minimum score lie within the colored area and between the dark lines.

 

© 2004 by Cold Spring Harbor Laboratory Press. All rights reserved.
No part of these pages, either text or image, may be used for any purpose other than personal use. Therefore, reproduction, modification, storage in a retrieval system, or retransmission, in any form or by any means, electronic, mechanical, or otherwise, for reasons other than personal use, is strictly prohibited without prior written permission.

 

 
Home Chapters Links Problems Enroll for Updates Help CSHL Press