Home > Chapters > 3 >
6. Comparisons and uses of amino acid scoring matrices
As outlined in the text, the theoretical basis for performing a sequence alignment is to determine whether or not the sequences are homologous, i.e., derived from a common ancestor sequence. The degree of similarity found can indicate how much separation has occurred, expressed in PAM distances. Because different matrices are designed for answering different questions about sequence similarity, it is very difficult to know how to compare them appropriately, and also how to use them to analyze evolutionary relationships. Second, the ability of these matrices to produce alignments needs to be assessed by using known alignments assessed by an independent criterion. The same tests are not appropriate for all matrices because they are based on different criteria. Third, the sequence alignments used for testing matrices should be a representative collection of all known sequence classes and should not contain an overrepresentation of similar sequences, nor should they be the same sequence alignments that were used to produce the matrix. Fourth, each scoring matrix has a particular range of values, some log odds values of positive and negative values, and others a simple scale of values that are all positive. To use these matrices, a different gap penalty score that produces reasonable alignments must be used for each. These penalties are usually not published or well known, thus throwing a level of uncertainty into the alignments obtained, and also into the reliability of the matrix comparison analyses. When investigators publish a new scoring matrix, they usually devise a scheme for comparing the ability of that particular matrix to others for producing reliable alignments. These types of analyses often overlook the purposes of different matrices and fail to recognize differences in the underlying models of protein evolution implied in their construction. With this sobering introduction in mind, some matrix comparison studies are discussed below.
Comparison of amino acid scoring matrices poses several problems for the reader to sort out. First, many extremely useful scoring matrices are not based on an explicit model of protein evolution, as are the Dayhoff matrices. Instead, they are designed to locate sequence similarity as a basis for some other common feature between proteins, such as structure, biochemical function, or a family relationship, because they were generated by examining similarity in sequences that share such features in common. These matrices include the BLOSUM, Gonnet92, and JO93 matrices. Recent research (Henikoff and Henikoff 1993; Pearson 1995; 1996; 1998) has shown that, for the majority of sequence analysis projects, these matrices are the most likely to identify a relationship or to find related sequences in a database search, provided that appropriate gap penalties are chosen. At the same time, these matrices are not as useful for discovering evolutionary relationships, because an explicit model of protein evolution was not used in their generation (Altschul et al. 1994). If evolutionary relationships are to be emphasized, the PAM and JTT scoring matrices offer the best analysis, provided that the matrix that corresponds to the evolutionary distance between the sequences and appropriate gap penalties for that matrix are used.
Amino acid scoring matrices are used for two types of analyses. One is to produce alignments and score similarity between two or more protein sequences for reasons identified above, not forgetting that the goal is to identify homology. A second purpose is to find sequences similar to a test sequence in a database search such as by the BLAST or FASTA programs. This second function of the scoring matrices is quite different and is discussed in Chapter 7. For producing sequence alignments, there is no best scoring matrix for every purpose, and the best choice for the investigator is to try several. For the Dayhoff matrices that are based on a particular model of protein evolution, the best alignment is theoretically expected when the correct matrix is used. This matrix should be based on the same evolutionary distance as is apparent between the sequences. The Dayhoff matrix at 250 PAMs of evolutionary distance corresponds to proteins that should be about 20% similar. This level of similarity is within the "twilight zone" of alignment in which most alignments are difficult to prove or to believe. Perhaps in this case, using a SmithWaterman program will reveal a local area of similarity that is more convincing than a NeedlemanWunsch program. Sequences that are 60% similar are separated by approximately 60 PAMs, 50% similar by 80 PAMs, 40% similar by 120 PAMs, and 30% similar by 160 PAMs of distance. These matrices are widely available and more can be generated by the PAM program (distributed with the BLAST package of programs) to accommodate any distance. As a more appropriate matrix is used, the alignment should improve and be reflected in an increase in the score of the alignment, provided that other factors such as the gap penalty are also adjusted in a reasonable way.
In an early study of matrix comparisons, the ability of several of these matrices to identify distantly related sequences has been compared with the Dayhoff PAM250 matrix (Feng et al. 1985, Doolittle 1989). The PAM250 matrix proved to be the best, but was closely followed by the structural and genetic code matrices. For comparing similar sequences, the genetic code, amino acid side chain, and PAM250 scales were equally useful.
Abagyan and Batalov (1997) performed a study of the value of scoring matrices in predicting global alignments between structurally related protein sequences. They examined alignment scores among a large number of proteins whose structures were either known or presumed to be known based on structural alignments. The objective was to find whether or not the alignment score of other similar features, such as percent identity or similarity, was a reliable indicator of a structural relationship. If so, scores between structurally related proteins should be high and those between unrelated proteins should be low. The proteins analyzed had been collected in a database known as the HSSP (homologyderived secondary structure of proteins) database (Sander and Schneider 1991) (see Chapter 9 in the book). They represented members of the same structural family and also of different structural families. These families had been identified in the SCOP (structural classification of proteins) database (Murzin et al. 1995). The various sequence combinations were aligned by the global NeedlemanWunsch alignment algorithm without end gap penalties, the latter being chosen to favor the location of similar domains in different positions in the sequences being compared. The scoring matrices were similar to those used above by Vogt et al. (1995), including the positive scoring matrices. To standardize the comparisons between matrices, they were each normalized such that the sum of the diagonal scores when adjusted for residue frequency added up to 1. A number of different gap opening and extension penalties were also tried. Because these values were chosen to be optimal for the normalized matrix, they will not work for others using the standard scoring matrices. This paper also has an excellent discussion of the statistical significance of alignment scores. The results of this analysis showed some similarities but also some differences with those of Vogt et al.
A similar finding was that the Gonnet92 and BLOSUM50 scoring matrices, normalized as described below, performed the best. Other BLOSUM matrices performed almost as well. These normalized penalties (adjusted by us to the regular scores for these matrices in parentheses) were 2.4 (11) and 0.15 (1) for the Gonnet92 matrix and 2.0 (13) and 0.15 (1) for the BLOSUM50 matrix, for the gap opening and extension penalties, respectively. The surprise was that the allpositive matrices that had performed the best in the Vogt et al. (1995) study performed the worst in this one. It is not clear why this should be so. One possibility is that a different criterion for success was used. The criterion here was fold recognition, whereas the former study was based on a full sequence alignment score between the sequences of structurally related proteins. The authors conclude that several different scoring matrices should be used because one alone, such as the Gonnet92 matrix, produces an alignment score that suggests a relationship between unrelated proteins; i.e., it is not very selective and sometimes fails to demonstrate a relationship when there is known to be one.
Another method of testing a scoring matrix and gap penalty combination is to perform a search of a sequence database with a known member of a protein family and to find how many members of the family are found when the same matrix/penalty is used. In analyses of this kind in which gap penalty was not considered, the BLOSUM62 matrix outperformed the PAM250 matrix in finding more members of 504 different families on the Prosite database (Henikoff and Henikoff 1993). Subsequently, Henikoff and Henikoff (1993) showed, using representatives of the 560 families that were used to produce the BLOSUM matrices, that the BLOSUM62 matrices also performed better in this same test than a series of Dayhoff (80250) and JTT (80220) PAM matrices and the matrix, and slightly better than other BLOSUM matrices (45, 50 , 80, and 100) and a structurebased scoring matrix (STR, similar to JO93).
Brenner et al. (1998) have conducted a third such detailed study of the ability of sequence alignment programs to recognize proteins that are related by structure and not by sequence. They argue that basing the evaluation of a system on the recognition of proteins in families previously defined by sequence similarity begs the question of finding true homology, which should be based on structure. Therefore, they developed two new databases of protein structural domains derived from the SCOP and PDB (Brookhaven protein data bank) databases, and which are similar to the HSSP database discussed earlier. One new database, PDB90DB, has domains that are <90% identical and the other, PDB40DB, has domains that are <40% identical. Using various sequence alignment methods, each entry in each database was then used to search the rest of that database to discover how sensitive the method was at finding related sequences versus how many errors were made in misidentifying unrelated sequences. The alignment programs included SSEARCH (SmithWaterman algorithm), BLAST, WUBLAST, and FASTA and the standard or default scoring matrices and gap penalties of these programs. WUBLAST2, a newer version of BLAST, uses gapped alignments (see Agarwal and States 1998 and Chapter 7 in the book).
The authors found that use of statistical scores provided a much improved ability to identify structural homology over raw alignment scores or percent identity. The statistical scores of SSEARCH and FASTA reliably predicted the actual number of incorrectly identified proteins. However, the statistical scores estimated by BLAST programs overestimated the significance by one or two orders of magnitude because many more unrelated proteins were found. The very best of the programs, SSEARCH and FASTA (ktuple 1), using statistical scores could at best find only 18% of all related proteins in the PDB40DB database at a 1% error rate. BLAST identifies 15% of the relatives, and the other programs were intermediate. In the PDB40DB database, most of the structural homologs cannot be detected by sequence similarity because they have so little sequence similarity. SSEARCH detected hardly any structural homologs that have <15% identity, but could detect most that have >25% identity. Between 15 and 20% identity, 10% were identified, and between 20 and 25%, 40%. Thus, there is a limit to which sequence similarity searching can detect structural homologs that share little sequence similarity. For the PDB90DB database, WUBLAST identifed 38% at the 1% error level, with the older version of BLAST giving 28% and the other programs intermediate values.
The result with WUBLAST is not surprising, because several scoring regions are likely to be found between the similar sequences in the PDB90DB database. WUBLAST uses this information to weight such information more significantly than older versions of BLAST. An interesting example of two proteins that have 39% identity over an alignment length of 64 residues, but which have different structures, is shown. The alignment scores given for this alignment were not significant. In general, because the databases are now so large, a level of 40% identity for alignments at least 70 residues in length is a reasonable threshold for structural similarity. The databases are available from http://scop.mrclmb.cam.ac.uk/scop/.
Pearson (1995) has performed the most detailed analysis to date of the ability of combinations of scoring matrices and gap penalties to assist in the recognition of proteins that are known to be in the same superfamily. Superfamily classification of proteins is performed by the PIR resource and is based on sequence similarity but also on other criteria, such as a similar structure, that provide an indication of an evolutionary relationship. Some members of the same superfamily share significant sequence similarity, but others may not. In such cases, two nonsimilar sequences may be found to have similarity with a third, inbetween member of the same family, thus establishing a relationship among the three proteins. Due to such variations in similarity among related proteins, finding members of a family in a database search poses a difficult challenge. Pearson (1995) performed an extensive analysis of 67 superfamilies from the PIR1 database.
Pearson used the computer programs SSEARCH, FASTA, and BLAST, which are designed to be used in a database search and to identify proteins that are related to an input sequence. SSEARCH is an implementation of the SmithWaterman algorithm, which searches an entire database for sequences that give the highest alignment scores. FASTA and BLAST perform the same function more rapidly by identifiying sequences that have short matching patterns or words in sequences, and then aligning them. These programs utilize various scoring matrices and gap penalty combinations and are discussed in greater detail in Chapter 7 in the book. The object of these studies was to compare the numbers of superfamily members found by the above programs in a database search, given as input of the sequence of a known family or superfamily member. Several different scoring matrices were tested, each with a set of gap penalties in the range 6, 1 to 16, 1, where 6 to 16 is the penalty for a gap of size 1 and there is an additional 1 for each additional gap position. The number of members found is a measure of the sensitivity of the program, scoring matrix, and gap penalty combination. A better combination of program and scoring system will also not find members of other superfamilies in a database search. This capability is a measure of the selectivity of the program and scoring system. A combined measure of sensitivity and selectivity called the equivalence number was used to compare the methods. This measure is the number of related sequences missed at a similarity score X which balances the number of lowscoring family sequences missed with the number of unrelated sequences with scores at or above X. The equivalence number was scored for each of the 67 families and the results were compared by a simple statistical test. Later, this measure was replaced by more informative and detailed statistical estimates of alignment scores (Pearson 1996, 1998).
In performing this analysis, Pearson (1995) also found that rescaling the alignment scores to adjust for an apparent increase in score with sequence length increased the sensitivity and selectivity of the search. These scaling methods were later improved when it became possible to perform a reliable statistical assessment of the alignment score (Pearson 1996, 1998).
The best combination of program and scoring system was the SSEARCH program (SmithWaterman program optimized as described previously) in combination with the BLOSUM55 scoring matrix (which is in units of 1/3 bits like the PAM, Gonnet92, and JTT matrices), and gap penalties of 12, 2. The PAM250 matrix performed significantly worse at every gap penalty tested. In contrast, the JO93 (10, 4 to 16,2) and JTT160/JTT200 (14, 2) combinations performed as well as BLOSUM55 (12, 2). The higher PAM JTT matrices (JTT250 and 320) did not perform as well as BLOSUM55 (12, 2). BLOSUM62 (6,4 and 8, 2) which is scaled in 1/2bit units and thus requires smaller gap penalties, was almost, but not quite, as good as BLOSUM55 (12, 2). When lengthscaled scores were used, the BLOSUM45 (10, 2; 12, 1; 12, 2) or Gonnet92 (same penalties) performed the best, and JO93 and JTT250 performed poorly. PAM250 (10, 2; 12, 2; 14, 2) did not perform significantly worse than BLOSUM45 (10, 2; 12, 1; 12, 2). Other scoring schemes, such as adding a small, positive value to the scoring matrix, or using a constant penalty for any size gap, or a constant penalty per gap position, did not perform as well as the above conditions. Pearson's results underscore the importance of using appropriate gap penalties. Besides having the undesirable effect of producing global instead of local alignments, low gap penalty, e.g., 8, 1 in the above study, decreased the selectivity of the method used and led to the erroneous estimates of statistical significance (Pearson 1998). Pearson later performed a detailed statistical analysis of alignment scores found in a database search by SSEARCH, using methods described in Chapter 7. BLOSUM50 (12,2) and BLOSUM50 (14,2) with rescaled similarity scores were the best combinations, the choice between these two depending on the method used to rescale the scores (Pearson 1998).
