Bioinformatics Online Home
   Chapters Links Problems Enroll for Updates Help

Home  >  Links

Table 6.1. Types of database searches for proteins
Type of search   Target database   Method   Type of query data   Examples of programs used,
location (also see Tables
6.2, 6.4, 6.7, and 6.8)
  Results of database search
A. Sequence similarity search with query sequence   protein sequence database (or genomic sequencesa)   search for database sequence that can be aligned with query sequence  

single sequence, e.g.,
DAHQSNGA

  FASTA (TFASTAa), SSEARCH
http://fasta.bioch.virginia.edu/fasta/
BLASTP (TBLASTNa)
http://www.ncbi.nlm.nih.gov/BLAST/
WU-BLAST
http://blast.wustl.edu/
  list of database sequences having the most significant similarity scores
B. Alignment search with profile (scoring matrixb,d with gap penalties) protein sequence database prepare profile from a multiple sequence alignment (Profilemake) and align profile with database sequence profile representing gapped multiple sequence alignment, e.g.,
D-HQSNGA
ESHQ-YTM
EAHQSN-L
EGVQSYSL
Profilesearch
ftp.sdsc.edu/pub/sdsc/biology
list of database sequences that can be aligned with the profile
C. Search with position-specific scoring matrixc,d (PSSM) representing ungapped sequence alignment (BLOCK) protein sequence database prepare PSSM from ungapped region of multiple sequence alignment or search for patterns of same length in unaligned sequences,c then use for database search PSSM representing ungapped alignment, e.g.,
DAHQSN
ESHQSY
EAHQSN
EGVQSY
MAST
http://meme.sdsc.edu/meme/
website/mast.html
list of database sequences with one or more patterns represented by PSSM but not necessarily in the same order
D. Iterative alignment search for similar sequences that starts with a query sequence, builds a gapped multiple alignment, and then uses the alignment to augment the searchd protein sequence database uses initial matches to query sequence to build a type of scoring matrix and searches for additional matches to the matrix by an iterative search methodd

builds matches to query sequence, e.g.,
DAHQSNGA

iteration 1
  H-SNGA EAHQSN-L


further iterations
PSI- BLAST
http://www.ncbi.nlm.nih.gov/BLAST/
PSI-BLAST finds a set of sequences related to each other by the presence of common patterns (not every sequence may have same patterns)
E. Search query sequence for patterns representative of protein familiese database of patterns found in protein families search for patterns represented by scoring matrix or hidden Markov model (profile HMM)e single sequence, e.g.,
DAHQSNGA
Prosite
http://www.expasy.ch/prosite
INTERPRO
http://www.ebi.ac.uk/interpro
Pfam
http://www.sanger.ac.uk/Pfam
CDD/CDART
http://www.ncbi.nlm.nih.gov/BLAST
(also see Table 10.5)
list of sequence patterns found in query sequence
    a Searches of this type include the use of programs that search nucleic acid databases for matches to a query protein sequence by automatically translating the nucleic acid sequences in all six possible reading frames (TFASTA, TBLASTN). These searches may be useful when only genomic sequences or partial cDNA sequences (expressed sequence tag or EST sequences) of an organism are available. Genomic sequences that encode proteins may also have been found by gene prediction programs (Chapter 9). The predicted protein is then usually entered in the protein sequence databases. Matches to these predicted proteins may be found by searches of the protein sequence databases. These gene predictions are error-prone (see Chapter 9).
    b A multiple sequence alignment that includes gaps may be represented by a profile, a type of scoring matrix discussed in Chapter 5, page 189. The consecutive rows of the matrix represent columns of the multiple sequence alignment, and the column values represent the distribution of amino acids in each column of the alignment. The profile includes extra columns with gap opening and extension penalties. The profile is aligned to a sequence by sliding the profile along the sequence and finding the position with the best alignment score by means of a dynamic programming method. The alignment may include gaps in the database sequence. The best scoring alignments are with database sequences that have a pattern similar to that represented by the profile.
    c The position-specific scoring matrix (PSSM), or weight matrix as it is sometimes called, is a representation of a multiple sequence alignment that has no gaps (a BLOCK). The matrix may be made from a multiple sequence alignment or by searching for patterns of the same length in a set of sequences using pattern-finding or statistical methods, e.g., expectation maximization, Gibbs sampling, ASSET, and by aligning these patterns, as discussed in Chapter 5. The consecutive columns of the matrix represent columns of the aligned patterns and the rows represent the distribution of amino acids in each column of the alignment. The PSSM columns include log odds scores for evaluating matches with a target sequence. The matrix is used to search a sequence for comparable patterns by sliding the matrix along the sequence and, at each position in the sequence, evaluating the match at each column position using the matrix values for that column. The log odds scores for each column are added to obtain a log odds score for the alignment to that sequence position. High log odds scores represent a significant match.
     d Using a scoring matrix instead of a single query sequence can enhance a database search because the matrix represents the greater amount of sequence variation found in a multiple sequence alignment. Amino acid representation in each column of the alignment is also reflected in the matrix scores for that column; the more common an amino acid, the higher the score for a match to that amino acid. Note also that the matrix does not store any information about correlations between sequence positions. Thus, if two amino acids are commonly found together in the sequences at two positions of the alignment, these will each be independently scored by the matrix, but there will be no information as to their co-occurrence (or covariation) in the sequences. Since this type of information is missing, the matrix can give high scores to patterns that include new combinations of amino acids not found in the original set of sequences. Scoring covariation in sequence positions is discussed further in Chapters 8, 9, and 10.
    e Pattern databases are described in Chapter 10.

 

© 2004 by Cold Spring Harbor Laboratory Press. All rights reserved.
No part of these pages, either text or image, may be used for any purpose other than personal use. Therefore, reproduction, modification, storage in a retrieval system, or retransmission, in any form or by any means, electronic, mechanical, or otherwise, for reasons other than personal use, is strictly prohibited without prior written permission.

 

 
Home Chapters Links Problems Enroll for Updates Help CSHL Press