Home > Chapters > 3 >
5. Updates to PAM amino acid scoring matrices
The Dayhoff data set has been augmented to include the 1991 protein database (Gonnet et al. 1992; Jones et al. 1992). The ability of the Dayhoff matrices to identify homologous sequences has also been extensively compared to that of other scoring matrices.
The Gonnet et al. (1992) revised PAM matrix, also known as the Gonnet92 matrix, was prepared by first ordering the sequences by organizing them through a type of index. By this ordering method, the sequences are effectively organized into a tree in which similar sequences should be represented by nearby branches on the tree. Thus, the ordering provides a way to quickly locate similar sequences for alignment. Because aligning the sequences was a much more time-consuming process, this clustering greatly reduced the computational time needed. Alignments between closely related and more distantly related sequences were produced by the Needleman-Wunsch dynamic programming method using the original Dayhoff PAM matrices as a starting point. From these alignments, new scoring matrices were calculated for protein pairs separated by the same distance, and then these matrices were used to refine the alignments, which, in turn, were used to produce a final matrix. Changes in amino acids between all pairs of aligned sequences at the same level of similarity were then calculated. These distributions were then adjusted to the same level of similarity by a matrix multiplication method similar to that used by Dayhoff. Finally, these matrices were used to calculate a new average scoring matrix of distance 250 PAMs, referred to as the Gonnet92 scoring matrix. Benner et al. (1994) later published a set of substitution matrices at 250 PAMs based on the same data, except that the data for sets of aligned proteins at the same PAM distance levels (decreased similarity) were combined and used to produce a set of scoring matrices at 250 PAMs identified as BennerN matrices, where N is the approximate PAM distance between the sequences aligned to score the amino acid substitutions. Recommended gap penalties were published along with these matrices (Benner et al. 1993).
At least two significant changes between this method and that used by Dayhoff are apparent: (1) the absence of tree-building to infer the amino acid changes occurring in each set of homologous proteins, which would in any case have been a very difficult task for this many proteins, and (2) proteins separated by PAM distances between 6.4 (6% difference) and 100 (>50% difference) were used to score amino acid changes that were used to produce the new matrix, as opposed to the PAM15 distance (approximately 15% difference) used by Dayhoff. It is much more difficult to align more distant proteins, and the placement of mismatches and gaps in the alignment will determine which substitutions will be scored. An improvement would be to use multiple sequence alignments to locate the gaps. The problem is worsened by using an already existing scoring matrix to make the alignment because the first matrix will then influence the scoring of the second. This problem is much reduced when substitutions in closely related sequences are scored, because a realistic alignment can be readily produced. If one accepts the Dayhoff Markovian model of protein evolution as a set of random amino acid changes, then for high PAM distances, the probability that a significant fraction of the observed amino acid changes represents two or more sequential changes greatly increases. In addition, some previously changed amino acids are more likely to have reverted to the original one. For sequences that differ by an average number of x residues, the distribution of the actual number of mutations should be approximately given by the Poisson distribution Pn = e-x xn / n!, where Pn is the probability of n differences. For sequences that are 15% different, as used by Dayhoff, P0 = 0.86, P1 = 0.13, and P2 = 0.01. Thus, only 8% of the amino acid changes sampled are expected to be double changes. But for sequences that are 30% different, this ratio is 14%, and for sequences that are 40% different, 20%. Thus, scoring more distantly related sequences, as performed by Gonnet et al. (1992), does not permit one to evaluate as well the single amino acid changes that occur during a Markovian evolution process because many of the resulting amino acids have been changed a second time. Due to such concerns, it would appear that the Gonnet92 matrix is not a reasonable substitute for the Dayhoff PAM250 matrix as a way to score evolutionary changes in proteins. Rather, it is a hybrid matrix between the Dayhoff PAM model and the BLOSUM model described in Chapter 3.
Another substitute for the Dayhoff PAM250 matrix is an amino acid substitution matrix named PET91 (pair exchange table for year 1991), and also called JTT250 (Jones et al. 1992). This matrix was also based on the comparison of a large number of protein sequences by clustering the sequences on a tree as a rapid method for identifying similar sequences. This method was slightly different from that used by Gonnet et al. (1992) and was based on identifying similar sequences through their having the same complement of amino acid triplets. The resulting matrix was also a log odds matrix at 250 PAMs, although JTT matrices at other evolutionary distances may also be calculated by the program CALCPAM, which may be obtained from the authors. The new data set included 59,190 accepted point mutations in 16,130 protein sequences, a much smaller number than that compared by Gonnet et al. (1992). The JTT matrix was based on sequences that were >85% similar, and in this respect was identical to the original Dayhoff analysis, whereas Gonnet et al. (1992) also used alignments from more distantly related sequences. Jones et al. showed that their sequence clustering method predicted the same or similar frequencies of amino acid changes in the original Dayhoff data set, although the method of inferring evolutionary change was quite different. These new mutation data matrices (MDM), Gonnet92 and JTT250, are compared to the original Dayhoff MDM matrix at 250 PAMs in Table 3.3 in the book. The units in this log odds table are log10 x 10, as is the original MDM of Dayhoff at 250 PAMs.
There are significant differences both between the new Gonnet92 and JTT250 matrices and the old Dayhoff matrix at 250 PAMS (row 1) and also between the Gonnet92 (row 2) and JTT250 (row 3) matrices. The most striking differences are in the substitutions between C (Cys) and W (Trp) and many of the other amino acids, of which many more had been observed than in the original Dayhoff data set, and some were not oberved at all previously. For example, the log odds score between W (Trp) and C (Cys) increased from -8 to -1 and +1, an increase of 5- to 8-fold in probability. In the new protein comparisons, C (Cys) and W (Trp) were exchanged, whereas they were not exchanged in the original Dayhoff analysis. Other amino acid exchanges were rare in the new tables, as in the original data. For example, W (Trp) and N (Asn) were only seen to exchange twice in the JTT data set. A similar but less pronounced increase in score occurred with W (Trp)/G (Gly) and M (Met)/C (Cys). These differences seen in the newly derived matrices are also reflected in a decreased score for aligning these same amino acids (C and W) with each other in sequences.
There are also some large differences between the Gonnet92 and JTT250 matrices, both of which were supposed to substitute for the original Dayhoff PAM matrices. These differences may be attributed to the overall number of sequences compared and their relatedness. For example, the Gonnet92 log odds score for W (Trp)/ Y (Tyr) is 4,whereas that of JTT250 is 0, a difference in probability of 2.5-fold. There are also a number of examples of variations in log odds scores of 2 (1.6-fold probability) and 3 (2-fold probability). On the basis of the differences in the scoring matrices, the best recommendation that can be made is to use the JTT matrices as a more modern substitute for the Dayhoff matrices. The Gonnet and Benner matrices are useful for scoring amino acid changes between proteins that are more distantly related, because no underlying evolutionary model was used to prepare these matrices. A comparison of these matrices for aligning sequences is given below.