Bioinformatics Online Home
   Chapters Links Problems Enroll for Updates Help
 CHAPTER 3 PROBLEMS
   
 Problem 1
 Problem 2
 Problem 3
 Problem 4
 Problem 5
 All problems

Home  >  Problems  >  Chapter 3

PART I. DOT MATRIX ANALYSIS

Using DNA Strider on a Macintosh (DNA Strider is available from Dr. Christian Marck, [email protected])

  1. Protein Sequence Comparison

    Compare two Escherichia coli phage repressor protein sequences by copying FASTA-formatted sequences of phage l cI repressor protein (accession no. RPBPL) and the phage p22 c2 repressor protein (accession no. RPBP22) from the GenBank display window at http://www.ncbi.nlm.nih.gov/entrez/ into two DNA Strider protein windows as follows:

    1. Highlight lamc1.pro (RPBPL) with the mouse and, using the Copy option in the edit window, copy the sequence into the clipboard.
    2. Start DNA Strider and open a new protein sequence window.
    3. Paste the lamc1.pro sequence into the new protein window using the Paste command in the Edit window.
    4. Place the cursor at the start of the sequence.
    5. Return to the editor and, using the same procedure as above, copy the p22c2.pro sequence (RPBP22) into the clipboard, and then into a second new protein window in DNA Strider.
    6. Place the cursor at the start of the sequence. The sequences in the two windows, one in the top window and the other in the bottom window, may now be compared using the matrix option of DNA Strider.
    7. Hold down the Option key, choose the matrix drag-down window, and choose the protein matrix option. Then release the mouse button.
    8. In the window that appears, set the protein matrix options on the right. Choose a window and stringency of 1, and a scale of auto by sliding the cursor in the windows to these choices. Choose an identity matrix. Then click on the matrix button for proteins.
    9. Examine the matrix for the presence of a row of dots that represents a region of sequence similarity. Note the background matching that also appears, and which will be eliminated below by using a larger window.
    10. Close the matrix window and then repeat the matrix analysis by choosing a window of 2 and a stringency of 2. Note that the similarity stands out much more clearly.
    11. Repeat the matrix analysis again looking for a stringency of 2 in a window of 3 amino acids. Note that the region of similarity stands out more clearly still but that the resolution, i.e., the exact position of the individual amino acid matches, is not as clear.
    12. Repeat the analysis using the amino acid scoring matrix BLOSUM62.

  2. DNA Sequence Comparison

    1. To retrieve the DNA sequences of the above two repressor genes, go to the GenBank entries for the phage λ and p22 genomes and retrieve the gene sequences from the features table (NC_001416, complementary strand positions 37227..37940 and NC_002371, complementary strand positions 12764..13414, respectively).
    2. Find the coding sequence entry (CDS) for the repressor genes in Features and click the mouse on the CDS link. A new window will come up with the DNA sequence.
    3. Copy and paste the sequences into two DNA windows in DNA Strider.
    4. Use the DNA matrix option in the matrix window to obtain a dot matrix analysis of these DNA sequences using stringencies and windows of 1 and 1, and 7 and 10, respectively, using the identity matrix.

  3. Self-comparison for Finding Repeated Sequences

    1. Open a GenBank window for the haptoglobin hp2 protein sequence (accession no. 1006264A).
    2. Copy the sequence into a new protein window in DNA Strider.
    3. Use the protein self-matrix option to compare the sequence to itself. Use window 1, stringency 1, and identity matrix.
    4. Note the presence of any repeated elements and where they are.

  4. Complex Repeated Elements

    1. Obtain the human and chicken erythroid transcription factors (accession nos. CAA35120 and P17678, respectively) from GenBank.
    2. Copy and paste the sequence into protein windows in DNA Strider.
    3. Compare these sequences, first each to itself and then to each other using the same stringency and window settings in each case (2/3 or much higher, such as 15/23).
    4. What primary structure features do these proteins share? Look at the sequences and see if you can identify any features, e.g., repeats of the same amino acid, that are affecting the appearance of the dot matrix.

  5. Sequence Complexity

    When the same sequence characters are repeated many times, the complexity of the sequence is said to be low; i.e., the number of all the available sequence characters is quite small or only a single character may be present. These regions can make alignments look artificially good and score artificially high. They become quite apparent on the self-matrix as horizontal or vertical rows of dots.

    1. Examine the self-matrix pattern for the human erythroid factor above (match of stringency 1 to window 1, identity matrix) and describe what is observed around sequence positions 55 and 265.
    2. Examine the sequence and report what is found in the sequence at these positions.

Using EMBOSS Dot Matrix Software

For the instructor. A knowledgeable computer support person will need to compile the EMBOSS programs on a UNIX or Linux server (Mac OS X is an alternative, but more time-consuming, option) and then provide X server access to PCs from the server as discussed in Figure 3.5 and the text. The EMBOSS programs are well documented, and online help is accessed through the tfm program followed by the program name, e.g., dotmatcher. Some of the displays done above with DNA Strider cannot be shown because there is a minimal window size of 3 in dotmatcher. An alternative dot matrix program, dotter, is described in the text. The sequences should be retrieved as text files in FASTA format in a convenient location for student access on the server. This task is good practice for students to do themselves if they have an account on the server. For now, they could save the GenBank files on a PC and then move them to the server, for example. It is also a good idea to make a protein scoring matrix that scores identities as 1 and mismatches as 0 using a text editor and place this matrix in the EMBOSS data directory with the other scoring matrices. In Chapter 12, students will learn how to retrieve sequences from GenBank directly using Perl scripts.

For the students. For the above problems, it is best to have FASTA files of the sequences (other sequence formats can also be used).

  1. Run dotmatcher on the remote server using the X-Window client program. The program prompts for sequence names, reads in the sequences, and prompts for window size and stringency.
  2. For most input queries, hit Return to give reasonable choices.
  3. You can also use different scoring matrices to test the effects on the results, but these must be entered as options when you type in the name of the program, e.g., "dotmatcher–matrixfile=mychoice."
  4. Type "tfm dotmatcher" to read about all of the options.
  5. If all goes well, the results will be displayed in a window.




PART II. ALIGNMENT OF TWO SEQUENCES BY THE DYNAMIC PROGRAMMING ALGORITHM

In this section, protein sequence pairs will be aligned using Internet servers.

  1. Using one of the Web sites listed below and the default conditions provided by the Web site, align the protein sequences for the phage λ and p22 phage repressors. Cut and paste FASTA files of the sequences already available into these sites.
  2. Record the resulting percent identity and similarity and briefly describe what each represents.

Internet Sites for Sequence Alignment

The following are Web sites that will perform sequence alignment of two sequences by the dynamic programming algorithm.

  1. LALIGN (http://fasta.bioch.virginia.edu/) at the University of Virginia. This program is also available for Mac and PC computers but without a windows/mouse interface. The program finds not just 1, but also n nonoverlapping alignment of two sequences according to the SIM algorithm discussed in the text. In these alignments, the same two residues will never be found together more than once.
  2. SIM (http://us.expasy.org/tools/sim-prot.html) uses the same algorithm as the above site.
  3. BCM (http://searchlauncher.bcm.tmc.edu/) Baylor College of Medicine Web site offers a variety of methods of sequence alignment. Read the "h" option to see how these programs work. Not all of these programs use dynamic programming as the sole method. LFASTA and BLAST2 search for common words and then align on the basis of these words. The program align is a global alignment program based on the Needleman�Wunsch alignment algorithm instead of the Smith�Waterman local alignment algorithm. Unless dealing with strongly similar sequences of the same length, and alike along their entire lengths, a global alignment will not be useful.




PART III. CALCULATION OF SEQUENCE ALIGNMENT SCORES

Calculation of Log Odds and Odds Scores by the BLOSUM Method

In one column of an alignment of a set of related, similar sequences, amino acid D changes to amino acid E at a frequency of 0.10, and the number of times this change is expected based on the number of occurrences of D and E in the column is 0.05.

  1. What is the odds score of finding a D-to-E substitution in an alignment?
  2. What is the log odds score for the D-to-E substitution in bits? (Note: log to base 2 = natural log / 0.693.)
  3. What would be the entry in the BLOSUM amino acid scoring matrix for this substitution? Compare your result to the actual entry in the BLOSUM62 matrix.
  4. In the same column, D does not change at all at a frequency of 0.80, and the expected frequency of D not changing is 0.10. Calculate the corresponding log odds score and the BLOSUM62 entry for D not changing.

Log Odds and Odds Score of a Short Alignment

  1. Using the above values, what is the log odds score of the following alignment in bits?
    (Note that these two short sequences have very low sequence complexity by having only two amino acids of the available 20. These sequences were chosen to simplify the calculations. Alignments of low complexity sequences can give quite high scores that are misleading of the sequence similarity, as discussed in Chapter 6.)
    DEDEDEDE
    DDDDDDDD
  2. What is the odds score of the above alignment?




PART IV. COMPARING ALIGNMENT SCORES WITH SMALL AND LARGE GAP PENALTIES

For this question, use the program LALIGN on the University of Virginia FASTA server http://fasta.bioch.virginia.edu/. This program aligns sequences by a local dynamic programming algorithm and includes end gap penalties. It produces as many different alignments as specified, with no two alignments including a match of the same two sequence positions.

  1. Obtain the following two sequences from GenBank in FASTA format: recA.pro (P03017) from the bacterium E. coli and rad51.pro (P25454) from budding yeast (Saccharomyces cerevisiae). These proteins have the same function, i.e., promoting the pairing of homologous single-stranded DNAs. They almost certainly have the same three-dimensional structure but have diverged enough that they are difficult to align.
  2. Use LALIGN to align the above two sequences with gap penalties of �12 and �2. Note the length of the alignment, the percent identity, and the score of the alignment.
  3. Repeat the alignment with gap penalties of �5 and �1 and note the features of the alignment.
  4. Describe what happened when the gap penalties were reduced. Which of these alignments looks like a local alignment and which looks like a global alignment?




PART V. USING THE DYNAMIC PROGRAMMING METHOD TO CALCULATE THE LOCAL ALIGNMENT OF TWO SHORT SEQUENCES BY HAND

The BLASTP algorithm performs a local alignment between a query sequence and a matching database sequence using the dynamic programming algorithm with the BLOSUM62 scoring matrix, a gap opening penalty of �11, and a gap extension penalty of –1 (i.e., a gap of length 1 has a penalty of �11, one of length 2, �12, etc.). Align the sequences MDPW and MEDPW using the Smith�Waterman algorithm described in the dynamic programming notes by following the global alignment example given in the notes, but using the Smith�Waterman algorithm.

  1. Make a matrix for keeping track of best scores and a second matrix to keep track of the moves that give the best scores. (Hint: The alignment of M's, P's, and W's all give high scores, so the problem boils down to how to align D with ED and is actually quite a trivial problem.)
  2. Use the BLOSUM62 matrix and BLASTP gap penalties of �11,�1. What is the optimal alignment and score between these two sequences?




 

© 2004 by Cold Spring Harbor Laboratory Press. All rights reserved.
No part of these pages, either text or image, may be used for any purpose other than personal use. Therefore, reproduction, modification, storage in a retrieval system, or retransmission, in any form or by any means, electronic, mechanical, or otherwise, for reasons other than personal use, is strictly prohibited without prior written permission.

 

 
Home Chapters Links Problems Enroll for Updates Help CSHL Press