Task 6 - Protein structure prediction from evolutionary sequence variation

From Bioinformatikpedia

For the proteins used in this practical, structures have been determined. However, in real-life projects, you often do not have protein structures, only sequences. However, structure often provides crucial information and furthers the understanding of the protein's function. During this practical, you already predicted secondary structure elements from protein sequence and generated homology models from protein structures with sequence similarity to your protein. This week, we will use evolutionary couplings (EV) from correlated mutations to predict structures from protein sequence alignments alone, without making use of known 3D structures.

Theoretical background talk

The talk will give an introduction to structure prediction from correlated mutations. In particular EVcouplings and EVfold are introduced. The slides can be found here: File:Evol variation.pdf.

Task

Address questions concerning this task to Thomas, Julia and Edda.

To retrieve meaningful information from correlated mutations in many cases multiple sequence alignments (MSAs) of more than 1,000 or even more than 10,000 sequences are needed. Since your proteins may not have enough homologs, please apply the following to your protein and to GTPase HRas (UniProt ID P01112). Results for HRas can be found here: http://evfold.org/evfold-web/datasets.do -> Appendix 3.

1. Multiple sequence alignment

Retrieve an MSA for HRas for example from Pfam: Ras (PF00071). Download the full alignment, i.e. approximately 21,000 sequences. For your protein, you can also download the Pfam alignment (check, how much of your protein's sequence is covered by its Pfam domain(s)) or use your own alignments from task 2, e.g. PSI-BLAST or HHblits for homology searches and ClustalW to generate an MSA.

For the calculation of correlated mutations using the freecontact tool, the MSA has to be reformatted. Use /usr/share/freecontact/a2m2aln (installed on i12k-biolab). Further details in the help or look up: man freecontact

2. Calculate and analyze correlated mutations

Use freecontact (installed on i12k-biolab) with standard parameters and evfold as output format. Read the man page for further details, e.g. a description of the output. You will need the evolutionary coupling scores, i.e. corrected norm (CN) contact score from freecontact, for the following analyses. Note that the CN score is an improved version of what was previously called DI score.

  • Extract all pairs and their scores for residue pairs i and i+n, where n>5, i.e. consider only residue pairs separated by at least 5 residues in the sequence. Why are the scores of residues close in sequence amongst the highest? Why are the pairs distant in sequence (n>5) more interesting for structure prediction?
  • Rank by score in descending order. Look at the values, range and distribution of scores.
  • Check for each residue pair its actual distance in the structure. Consider pairs with less than 5 Ångstrom minimum atom distance (any pair of atoms between the residues is closer than 5 Å) as true positive, i.e. contact. How many of the high-scoring pairs are true or false positives? Does this correlate with the value of the score? Visualize the predicted contacts together with the crystal structure contacts in a contact map plot.
  • Take the L top couplings (L = length of aligned protein sequence) with a sequence distance > 5 and sum the scores for each residue. Normalize with the average of the top L scores. [Note, calculate as follows: sum all top L scores, divide by L. The result, i.e. the average of the top L scores, is your normalization parameter. To normalize divide the sum of scores for each residue by this normalization parameter.] Then rank residues in descending order of scores. Can you determine evolutionary hot spots, i.e. functionally important residues? Compare to conserved sites in the MSA. Compare with your results from task 7 (when you are working on task 7, i.e. this is a task for the future).
  • Use also EVcouplings from the evolutionary couplings server. Here, the DI score is given. Compare the top 50 DI and CN (from freecontact) scores. How large is the overlap (>80%)?

3. Calculate structural models

To generate structural models, use EVfold from the evolutionary couplings server. For this task consider different numbers of contacts. Typically optimal models are generated with number of contacts approximately 60-70% of L (L = length of aligned protein sequence). Try varying numbers of contacts, including:

  • the optimal 60-70% of L
  • 40% of L
  • 100% of L

When using the EVfold server (EVcouplings and EVfold), i.e. to calculate correlated mutations and generate models, use in addition to the DI score also PLM for couplings scoring.

To check the quality of the predicted models, calculate the Ca-RMSD to the crystal structure, e.g. using Pymol, or analyze the corresponding output from the EV server.

References

See Julia's talk (link above) and here.