Home > Chapters > 3 >
3. Improvements in time and memory efficiency for global alignments
The original NeedlemanWunsch dynamic programming algorithm allowed single gaps and took N x M steps to align two sequences of length N and M. Subsequently, Waterman et al. (1976) and Smith and Waterman (1981a,b) extended this algorithm to allow for multiplesized gaps in the alignment, and the alignment then took N x M^{2} steps (where N is the shorter sequence). Similarly, the alignment algorithm developed by Smith and Waterman also required N x M^{2} steps. The dynamic programming algorithm was improved in performance by Gotoh (1982) by using the linear relationship for a gap weight w_{x} = g + rx, where the weight for a gap of length x is the sum of a gap opening penalty (g) and a gap extension penalty (r) times the gap length (x), and by simplifying the dynamic programming algorithm. He reasoned that two of the terms that are maximized in the dynamic programming algorithm and designated here Pij and Qij depend only on the values in the current and previous row and column, as indicated below.
Improved dynamic programming algorithm of Gotoh (1982)
The similarity score is written as
S_{i}_{,}_{j} = max { S_{i}_{1,j1} + s_{i}_{,j}, P_{i}_{,j}, Q_{i}_{,j} }, where
P_{i}_{,j} = max { S_{i}_{x,j} w_{x} }, and Q_{i}_{,j} = max { S_{i}_{,jx}  w_{x} }
1<x<i 1<x<j
P may be obtained in a single step since,
P_{i}_{,j} = max { S_{i}_{1,j}  w_{1}, max ( S_{i}_{x,j}  w_{x} }
2<x<i
= max { S_{i}_{1,j}  w_{1}, max ( S_{i}_{1x,j}  w_{x+1} }
1<x<i1
= max { S_{i}_{1,j}  w_{1}, max ( S_{i}_{1x,j}  w_{x}  r }
1<x<i1
= max { S_{i}_{1,j}  w_{1}, P_{i1}_{,j}  r }
where the penalty of a gap of length x is given by,
w_{x} = g + rx
and g is the gap opening penalty and r the gap extension
penalty.
Similarly,
Q_{i}_{,j} = max { S_{i}_{,j1}  w_{1}, P_{i}_{,j1}  r }
Although this procedure can align sequences of lengths N and M (N < M) in M x N steps, these operations still require the storage of three matrices in memory of size M x N. Myers and Miller (1988) were able to reduce this memory requirement by storing enough information to compute the similarity score in N units of memory. This saving of memory space was achieved by recognizing that calculation of each matrix row by the Gotoh method depends only on the values of the current and most previously calculated rows. Thus, only matrix values in those rows plus a few other numbers need to be saved in an array, and old values can be overwritten by new ones as new rows are calculated. Because local and global alignments are computed the same way, these same improvements can be used for producing both types of alignments.
Furthermore, Myers and Miller were able to save additional memory space required for storing the traceback matrix that keeps track of the moves giving the best score at each matrix position. In doing so, they also reduced the number of sequence comparisons that must be made to find an optimal alignment. The traceback matrix requires M x N memory positions. Myers and Miller utilized the fact that the same best alignment is obtained by the dynamic programming algorithm starting at the sequence beginnings and moving forward down the sequences as starting at the sequence ends and moving backward along the sequences. In their method, a subalignment is started in the forward direction, but does not go beyond the halfway point along the first sequence. Another alignment is started from the far end of the sequences, ending at the same halfway point in the first sequence. At some position J in the second sequence, these forward and backward alignments must converge on an optimal midpoint in the alignment, as illustrated in Figure S3.3. The sum of the scores is the optimal score of aligning the full sequences, after correcting for subalignments that both end in deletes. In breaking down the traceback analysis this way, called the divideandconquer approach by Hirschberg (1975), two improvements are realized: (1) only onehalf as many sequence positions have to be compared by the dynamic programming algorithm to find the optimal alignment (the area inside the colored rectangles), and (2) only onehalf as much memory is needed to keep track of the moves. One other round of divideandconquer would require filling in only the brown area of the matrix, thus halving the requirement again. In this manner, dividing up of alignments into subalignments can be repeated several times to reduce the memory requirement even further to approximately a linear function of sequence length (Fig. S3.3). By confining sequence comparisons to a narrow band of interest in the dynamic programming alignment grid in combination with the above divideandconquer method, such as in aglobal alignment of related sequences of approximately the same length, the algorithm can be made even more efficient (Chao et al. 1994).
A similar strategy may be used to obtain a local alignment as a linear function of sequence length. The end points of the local alignment are first found by using the SmithWaterman algorithm and the Gotoh method to reduce the computational time. The traceback path is then found by applying the divideandconquer approach within the corresponding area of the matrix defined by these end points (Chao et al. 1994). As a result of these changes, the dynamic programming algorithm can be more easily run on smaller computers without compromising the ability of the method to find an optimal alignment. Most dynamic programming programs for sequence alignment use these more efficient methods. Many of the above formulations that reduce time and space for calculating optimal sequence alignments have depended on a reformulation of the dynamic programming algorithm using mathematical graph theory by G. Myers, W. Miller, and their colleagues. Although a description of the graph theory method is beyond the scope of this chapter, the formulation is not too different from that given above. However, the method provides a level of mathematical rigor and insights and a more critical view of details of the sequence alignment problem. Readers interested in the application of graph theory to the problem of sequence alignment should consult the review by Chao et al. (1994).
Figure S3.3. The divideandconquer approach of Myers and Miller (1988). Whereas the full dynamic programming method requires the entire matrix of size M x N to be filled, where M and N are the sequence lengths, the new approach only requires the positions that are shaded in the graph.


Finding nearoptimal alignments
The dynamic programming alignment program reports only the highestscoring or optimal sequence alignment. As discussed previously, there may be other important alignments with scores as high as the optimal one. These may be alternate paths back through the dynamic programming matrix starting at the same highscoring position. These alignments will use some of the same aligned residues that were also used in the optimal alignment. Other alignments may use the same residues but have them aligned differently. Yet a third kind of alternative alignment may use entirely different amino acids to achieve a range of possible alignments.
The LFASTA program uses the FASTA algorithm to report a series of alternative alignments of proteins, and PLFASTA plots the alternative alignments in a form much like a dot matrix plot (Pearson and Lipman 1988; Pearson 1990). Another method of finding nearoptimal alignments is to design a method for identifying the highestscoring alignment, then the next highestscoring, and so on, in that order. To accomplish this, Saqi and Sternberg (1991) change the values in the scoring matrix after the first alignment has been found to be biased against that alignment and to favor others the next time.
Waterman and Eggert (1987) devised another method for finding a series of local alignments that is widely used. After finding the highestscoring alignment beginning at position i in sequence 1 and position j in sequence 2, the matrix position Hi,j that initiates this alignment is given a score of zero. The H matrix values (and also the Gotoh P and Q matrix values), which are below and to the right of this position, are then rescored for as far in the matrix as the recalculation continues to change the values. This recalculation has the effect of removing the highestscoring alignment and any other alignment that uses the same residue pairs as the first alignment, but the same residues may be used in a different alignment. The rescored matrix is then scanned for the next highest local similarity score and the path leading to it provides the next highestscoring alignment. The Waterman and Eggert method calculates these alternative alignments in far less time than taken to produce the original scoring matrix. A more spaceefficient version of this method using graph theory has been described (Huang and Miller 1991; Chao et al. 1994).
Another method of identifying nearoptimal alignments is to score all alignments that achieve a similarity score greater than a specified minimum (Chao et al. 1994). In the above section, an optimal midpoint in an alignment was found by forwardbackward passages of the dynamic programming algorithm, as illustrated in Figure S3.3. A similar procedure is used here on a rectangular portion of the original dynamic programming matrix, with two changes, as illustrated in Figure S3.4: (1) the range of midpoints that give a score of at least the set minimum is located, and (2) the procedure is repeated with midpoints along the second sequence. The result is a set of scores along both midpoints of the matrix that define a band of possible alignments, illustrated by the area between the dark lines. The alignment paths follow a set of adjacent squares as in Figure S3.3, except that the sizes of the squares are adjusted and the squares overlap enough horizontally so that the boundary lines can pass from one square to the next. The result is a band of space that fills enough of the scoring matrix so that all possible alignment paths can be found.
Figure S3.4. The divideandconquer method used to locate points on all alignments that exceed a minimum score. Shown is a portion of the dynamic programming alignment matrix between positions i_{1} and i_{2} in sequence 1 and j_{1} and j_{2} in sequence 2. All alignments that achieve the minimum score lie within the colored area and between the dark lines.


