Bioinformatics Online Home
   Chapters Links Problems Enroll for Updates Help
 CHAPTER 4 PROBLEMS
   
 Problem 1
 Problem 2
 Problem 3
 All problems

Home  >  Problems  >  Chapter 4

1. Log odds and odds score of a short alignment. This question is a continuation of the question in Part III in Chapter 3 (p. 119).
  1. Using the values calculated in Part III, Chapter 3, what is the log odds score of the following alignment in bits? (Note that these two short sequences have very low sequence complexity by having only two amino acids of the available 20. These sequences were chosen to simplify the calculations. Such alignments are quite high scoring, but the low complexity means that the score can be misleading as discussed in Chapter 5, p. 254.)
  2. What is the odds score of the following alignment?
        DEDEDEDE
        DDDDDDDD
  3. Using the section "Quick Determination of the Significance of an Alignment Score" (p. 139), and assuming that the above alignment was found by aligning two sequences of length 250, is the alignment significant at the 0.05 level? (That is, could an alignment of two random sequences of the same length achieve such a score with a probability of 0.05?)
  4. If the gap penalty was very high, e.g., gap opening of 8 and gap extension of 8, so that no gaps were produced, and the BLOSUM62 scoring matrix was used, calculate the significance of the alignment using Equation 8. You will need to find the value of K and λ in Table 4.3 (p. 142) and note that λ in this table assumes that the alignment score is in half-bits so that the alignment score must be in these units also.




2. Statistical evaluation of sequence alignment scores. This question is a continuation of Part IV in Chapter 3 (p. 119).

We will calculate significance of an alignment between two sequences by scrambling one sequence many times and recalculating the alignment scores to see how they compare.

  1. The program PRSS on the Pearson FASTA Web site http://fasta.bioch.virginia.edu/ will scramble the second sequence and calculate many alignment scores. Scrambling can be done at the individual amino acid level or with a window of amino acids to keep repetitive sequences intact.
  2. A plot of the scores of the scrambled sequence alignments is shown on the Web page, and these scores are compared to the original alignment score between the sequences.
  3. The scores are fitted to an extreme value distribution curve and K and λ are calculated.
    Note that when there are many such comparisons made, e.g., when the first sequence is compared to 100 scrambled second sequences, the expected value of this many alignments achieving the original score has to be calculated. If the probability that one score of an alignment with a scrambled sequence achieves the original score is 1/10,000 and 100 scrambled sequences were tested, then the expected value for 100 sequences is 1/10,000 x 100 = 1/100.

Obtain the same two sequences from GenBank in FASTA format as done previously.
  1. Use PRSS to align the reca.pro (P03017 from the bacterium E. coli) and rad51.pro (P25454 from budding yeast S. cerevisiae) sequences downloaded in Chapter 3 problems with gap penalties of �12 and �2 and perform 1000 scrambled alignments. Note the expect value for the alignment score found between these proteins.
  2. Repeat the analysis with gap penalties of �5 and �1 and note the expect score.
  3. Describe what happened when the gap penalties were reduced.




3. A Bayesian method for estimating evolutionary distance between nucleic acid sequences.
  1. A great deal can be learned about the use of PAM matrices using the DNA PAM matrices as an example (p. 95).
    1. First an alignment between two DNA sequences without any gaps is found. The object is to figure out how long ago (in PAM units of time where 1 PAM equals 10 million years) the sequences might have diverged to give the observed variation (the number of mismatches) in the alignment.
    2. A PAM1 matrix is made for an expected model of evolution, e.g., each base can change into any other base and the overall rate of change in the sequences is 1% (see Table 3.4, p. 107).
    3. For longer periods, e.g., PAM10, the PAM1 matrix is multiplied by itself n times (10 for PAM10).
    4. The more time, the greater the amount of change expected and these changes are reflected in the log odds scores of each particular PAM matrix. These are shown in the table of nucleic acid substitution matrices (Table 3.6 on p. 108) that assumes a uniform rate of mutation among nucleotides and that a 1% change in sequence represents 10 my (million years) of mutation.
    5. Examine the substitution rates in the following sequences and decide approximately how many years ago they became separated.
          AGTTG ACTAA GCCAG GTCAC
          ACTTG CCGGA GCCTC GTGTC
  2. What log odds and odds scores are found for the alignment for PAM distances of 10, 25, 50, 100, and 125, and which score is highest?
  3. Add up the odds scores and determine the ratio of each to the total. What is the sum of these numbers, and what do these numbers represent?




 

© 2004 by Cold Spring Harbor Laboratory Press. All rights reserved.
No part of these pages, either text or image, may be used for any purpose other than personal use. Therefore, reproduction, modification, storage in a retrieval system, or retransmission, in any form or by any means, electronic, mechanical, or otherwise, for reasons other than personal use, is strictly prohibited without prior written permission.

 

 
Home Chapters Links Problems Enroll for Updates Help CSHL Press