# Difference between revisions of "Gaucher Disease: Task 02 - Alignments"

(→Alignment Comparison) |
Kalemanovm (talk | contribs) (→Sequence Searches) |
||

Line 179: | Line 179: | ||

====E-value distribution==== |
====E-value distribution==== |
||

+ | <figure id="evalue"> |
||

− | |||

− | <gallery |
+ | <gallery widths="500px" heights="500px" perrow="2"> |

File:hhblits-2iter-uniprot20.hhr_results.psiblast-1iter-big_80--1iter_big_results.blast-big_results_evalue_hist_log_freq.png|E-value distribution of HHblits hits (2 iterations against uniprot_20 with default E-value cutoff, 0.001), Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with default E-value cutoff, 0.002) and BLAST hits (against big with default parameters). On the X-axis -log of the E-value is plotted, so the smallest (the best) E-value is on the right side. On the Y-axis is logarithmic frequency of the values on the X-axis. |
File:hhblits-2iter-uniprot20.hhr_results.psiblast-1iter-big_80--1iter_big_results.blast-big_results_evalue_hist_log_freq.png|E-value distribution of HHblits hits (2 iterations against uniprot_20 with default E-value cutoff, 0.001), Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with default E-value cutoff, 0.002) and BLAST hits (against big with default parameters). On the X-axis -log of the E-value is plotted, so the smallest (the best) E-value is on the right side. On the Y-axis is logarithmic frequency of the values on the X-axis. |
||

File:hhblits-2iter-uniprot20-10E-10.hhr_results.psiblast-1iter-big_80--1iter_big-10E-10_results_evalue_hist_log_freq.png|E-value distribution of HHblits hits (2 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis -log of the E-value is plotted, so the smallest (the best) E-value is on the right side. On the Y-axis is logarithmic frequency of the values on the X-axis. |
File:hhblits-2iter-uniprot20-10E-10.hhr_results.psiblast-1iter-big_80--1iter_big-10E-10_results_evalue_hist_log_freq.png|E-value distribution of HHblits hits (2 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis -log of the E-value is plotted, so the smallest (the best) E-value is on the right side. On the Y-axis is logarithmic frequency of the values on the X-axis. |
||

Line 186: | Line 186: | ||

File:hhblits-10iter-uniprot20-10E-10.hhr_results.psiblast-9iter-big_80--1iter_big-10E-10_results_evalue_hist_log_freq.png|E-value distribution of HHblits hits (10 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (9 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis -log of the E-value is plotted, so the smallest (the best) E-value is on the right side. On the Y-axis is logarithmic frequency of the values on the X-axis. |
File:hhblits-10iter-uniprot20-10E-10.hhr_results.psiblast-9iter-big_80--1iter_big-10E-10_results_evalue_hist_log_freq.png|E-value distribution of HHblits hits (10 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (9 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis -log of the E-value is plotted, so the smallest (the best) E-value is on the right side. On the Y-axis is logarithmic frequency of the values on the X-axis. |
||

</gallery> |
</gallery> |
||

+ | <small>'''<caption>''' E-value distribution of BLAST and HHblits and Psi-BLAST hits with different combinations of number of iterations and E-value cutoffs.</caption></small> |
||

+ | </figure> |
||

'''Observations''' |
'''Observations''' |
||

Line 194: | Line 196: | ||

====Percentage identity distribution==== |
====Percentage identity distribution==== |
||

+ | <figure id="identity" > |
||

− | |||

− | <gallery |
+ | <gallery widths="500px" heights="500px" perrow="2"> |

File:hhblits-2iter-uniprot20.hhr_results.psiblast-1iter-big_80--1iter_big_results.blast-big_results_identity_hist.png|Identity distribution of HHblits hits (2 iterations against uniprot_20 with default E-value cutoff, 0.001), Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with default E-value cutoff, 0.002) and BLAST hits (against big with default parameters). On the X-axis %identity is plotted [0,1]. On the Y-axis is frequency of the values on the X-axis. |
File:hhblits-2iter-uniprot20.hhr_results.psiblast-1iter-big_80--1iter_big_results.blast-big_results_identity_hist.png|Identity distribution of HHblits hits (2 iterations against uniprot_20 with default E-value cutoff, 0.001), Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with default E-value cutoff, 0.002) and BLAST hits (against big with default parameters). On the X-axis %identity is plotted [0,1]. On the Y-axis is frequency of the values on the X-axis. |
||

File:hhblits-2iter-uniprot20-10E-10.hhr_results.psiblast-1iter-big_80--1iter_big-10E-10_results_identity_hist.png|Identity distribution of HHblits hits (2 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis %identity is plotted [0,1]. On the Y-axis is frequency of the values on the X-axis. |
File:hhblits-2iter-uniprot20-10E-10.hhr_results.psiblast-1iter-big_80--1iter_big-10E-10_results_identity_hist.png|Identity distribution of HHblits hits (2 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis %identity is plotted [0,1]. On the Y-axis is frequency of the values on the X-axis. |
||

Line 201: | Line 203: | ||

File:hhblits-10iter-uniprot20-10E-10.hhr_results.psiblast-9iter-big_80--1iter_big-10E-10_results_identity_hist.png|Identity distribution of HHblits hits (10 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (9 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis %identity is plotted [0,1]. On the Y-axis is frequency of the values on the X-axis. |
File:hhblits-10iter-uniprot20-10E-10.hhr_results.psiblast-9iter-big_80--1iter_big-10E-10_results_identity_hist.png|Identity distribution of HHblits hits (10 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (9 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis %identity is plotted [0,1]. On the Y-axis is frequency of the values on the X-axis. |
||

</gallery> |
</gallery> |
||

+ | <small>'''<caption>''' Identity distribution of BLAST and HHblits and Psi-BLAST hits with different combinations of number of iterations and E-value cutoffs.</caption></small> |
||

+ | </figure> |
||

'''Observations''' |
'''Observations''' |
||

Line 208: | Line 212: | ||

====Alignment length distribution==== |
====Alignment length distribution==== |
||

+ | <figure id="length" > |
||

− | |||

− | <gallery |
+ | <gallery widths="500px" heights="500px" perrow="2"> |

File:hhblits-2iter-uniprot20.hhr_results.psiblast-1iter-big_80--1iter_big_results.blast-big_results_length_hist_log_freq.png|Alignment length distribution of HHblits hits (2 iterations against uniprot_20 with default E-value cutoff, 0.001), Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with default E-value cutoff, 0.002) and BLAST hits (against big with default parameters). On the X-axis alignment length is plotted. On the Y-axis is logarithmic frequency of the values on the X-axis. |
File:hhblits-2iter-uniprot20.hhr_results.psiblast-1iter-big_80--1iter_big_results.blast-big_results_length_hist_log_freq.png|Alignment length distribution of HHblits hits (2 iterations against uniprot_20 with default E-value cutoff, 0.001), Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with default E-value cutoff, 0.002) and BLAST hits (against big with default parameters). On the X-axis alignment length is plotted. On the Y-axis is logarithmic frequency of the values on the X-axis. |
||

File:hhblits-2iter-uniprot20-10E-10.hhr_results.psiblast-1iter-big_80--1iter_big-10E-10_results_length_hist_log_freq.png|Alignment length distribution of HHblits hits (2 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis %identity is plotted [0,1]. On the X-axis alignment length is plotted. On the Y-axis is logarithmic frequency of the values on the X-axis. |
File:hhblits-2iter-uniprot20-10E-10.hhr_results.psiblast-1iter-big_80--1iter_big-10E-10_results_length_hist_log_freq.png|Alignment length distribution of HHblits hits (2 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis %identity is plotted [0,1]. On the X-axis alignment length is plotted. On the Y-axis is logarithmic frequency of the values on the X-axis. |
||

Line 215: | Line 219: | ||

File:hhblits-10iter-uniprot20-10E-10.hhr_results.psiblast-9iter-big_80--1iter_big-10E-10_results_length_hist_log_freq.png|Alignment length distribution of HHblits hits (10 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (9 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis %identity is plotted [0,1]. On the X-axis alignment length is plotted. On the Y-axis is logarithmic frequency of the values on the X-axis. |
File:hhblits-10iter-uniprot20-10E-10.hhr_results.psiblast-9iter-big_80--1iter_big-10E-10_results_length_hist_log_freq.png|Alignment length distribution of HHblits hits (10 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (9 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis %identity is plotted [0,1]. On the X-axis alignment length is plotted. On the Y-axis is logarithmic frequency of the values on the X-axis. |
||

</gallery> |
</gallery> |
||

+ | <small>'''<caption>''' Alignment length distribution of BLAST and HHblits and Psi-BLAST hits with different combinations of number of iterations and E-value cutoffs.</caption></small> |
||

+ | </figure> |
||

'''Observations''' |
'''Observations''' |
||

Line 231: | Line 237: | ||

'''BLAST''' |
'''BLAST''' |
||

+ | <figure id="blast_results" > |
||

− | |||

− | <gallery |
+ | <gallery widths="500px" heights="500px" perrow="2"> |

File:blast-big_80_eval_tpr_precision.png|TPR and precision vs. E-value of BLAST hits after 1 big_80 iteration with default parameters using the L30, L40 and L60 COPS groups as gold standard. On the X-axis E-value is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff. |
File:blast-big_80_eval_tpr_precision.png|TPR and precision vs. E-value of BLAST hits after 1 big_80 iteration with default parameters using the L30, L40 and L60 COPS groups as gold standard. On the X-axis E-value is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff. |
||

File:blast-big_eval_tpr_precision.png|TPR and precision vs. E-value of BLAST hits after 1 big iteration with default parameters using the L30, L40 and L60 COPS groups as gold standard. On the X-axis E-value is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff. |
File:blast-big_eval_tpr_precision.png|TPR and precision vs. E-value of BLAST hits after 1 big iteration with default parameters using the L30, L40 and L60 COPS groups as gold standard. On the X-axis E-value is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff. |
||

</gallery> |
</gallery> |
||

+ | <small>'''<caption>''' TPR and precision vs. E-value of BLAST hits searched in big_80 and big database.</caption></small> |
||

+ | </figure> |
||

'''Psi-BLAST''' |
'''Psi-BLAST''' |
||

+ | <figure id="psiblast_results" > |
||

− | |||

− | <gallery |
+ | <gallery widths="500px" heights="500px" perrow="2"> |

File:psiblast-1iter-big_80--1iter_big--1iter_pdb_eval_tpr_precision.png|TPR and precision vs. E-value of Psi-BLAST hits after 1 big_80, 1 big and 1 pdb_full iterations with default E-value using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff. |
File:psiblast-1iter-big_80--1iter_big--1iter_pdb_eval_tpr_precision.png|TPR and precision vs. E-value of Psi-BLAST hits after 1 big_80, 1 big and 1 pdb_full iterations with default E-value using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff. |
||

File:psiblast-1iter-big_80--1iter_big--1iter_pdb-10E-10_eval_tpr_precision.png|TPR and precision vs. E-value of Psi-BLAST hits after 1 big_80, 1 big and 1 pdb_full iterations with E-value 10E-10 using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff. |
File:psiblast-1iter-big_80--1iter_big--1iter_pdb-10E-10_eval_tpr_precision.png|TPR and precision vs. E-value of Psi-BLAST hits after 1 big_80, 1 big and 1 pdb_full iterations with E-value 10E-10 using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff. |
||

Line 245: | Line 253: | ||

File:psiblast-9iter-big_80--1iter_big--1iter_pdb-10E-10_eval_tpr_precision.png|TPR and precision vs. E-value of Psi-BLAST hits after 9 big_80, 1 big and 1 pdb_full iterations with E-value 10E-10 using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff. |
File:psiblast-9iter-big_80--1iter_big--1iter_pdb-10E-10_eval_tpr_precision.png|TPR and precision vs. E-value of Psi-BLAST hits after 9 big_80, 1 big and 1 pdb_full iterations with E-value 10E-10 using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff. |
||

</gallery> |
</gallery> |
||

+ | <small>'''<caption>''' TPR and precision vs. E-value of Psi-BLAST hits with different combinations of number of iterations and E-value cutoffs.</caption></small> |
||

+ | </figure> |
||

'''HHblits''' |
'''HHblits''' |
||

+ | <figure id="hhblits_results" > |
||

− | |||

− | <gallery |
+ | <gallery widths="500px" heights="500px" perrow="2"> |

File:hhblits-2iter-uniprot20--1iter-pdb.hhr_eval_tpr_precision.png|TPR and precision vs. E-value of HHblits hits after 2 uniprot_20 iterations and 1 pdb_full iteration with default E-value using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff. |
File:hhblits-2iter-uniprot20--1iter-pdb.hhr_eval_tpr_precision.png|TPR and precision vs. E-value of HHblits hits after 2 uniprot_20 iterations and 1 pdb_full iteration with default E-value using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff. |
||

File:hhblits-2iter-uniprot20--1iter-pdb-10E-10.hhr_eval_tpr_precision.png|TPR and precision vs. E-value of HHblits hits after 2 uniprot_20 iterations and 1 pdb_full iteration with E-value 10E-10 using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff. |
File:hhblits-2iter-uniprot20--1iter-pdb-10E-10.hhr_eval_tpr_precision.png|TPR and precision vs. E-value of HHblits hits after 2 uniprot_20 iterations and 1 pdb_full iteration with E-value 10E-10 using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff. |
||

Line 254: | Line 264: | ||

File:hhblits-10iter-uniprot20--1iter-pdb-10E-10.hhr_eval_tpr_precision.png|TPR and precision vs. E-value of HHblits hits after 10 uniprot_20 iterations and 1 pdb_full iteration with E-value 10E-10 using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff. |
File:hhblits-10iter-uniprot20--1iter-pdb-10E-10.hhr_eval_tpr_precision.png|TPR and precision vs. E-value of HHblits hits after 10 uniprot_20 iterations and 1 pdb_full iteration with E-value 10E-10 using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff. |
||

</gallery> |
</gallery> |
||

+ | <small>'''<caption>''' TPR and precision vs. E-value of HHblits hits with different combinations of number of iterations and E-value cutoffs.</caption></small> |
||

+ | </figure> |
||

'''Observations''' |
'''Observations''' |

## Revision as of 01:42, 5 September 2013

Alignments allow a comparisons of Strings. In the field of bioinformatics, sequence alignments show the relation between two or more sequences.

## Theoretical Background

### Description

Pairwise or multiple alignments often contain an aditional line below the proper alignment (or between two aligned sequences for pairwise alignments). This line gives a more accurate description of the relation of the aligned residues above. The symbols show if there is match between identical amino acids or if they are only similar.

Symbols for describing sequence alignments | |||
---|---|---|---|

Program(s) | Symbol | Example | Meaning |

MSAs | * | A A * |
identical residues |

: | I L : |
similar residues | |

. | S P . |
similar residues | |

(Psi-)BLAST | same letter | A A A |
identical residues |

+ | L + V |
similar residues | |

HHblits | | | AF | | AW |
identical or very similar residues |

+ | T + S |
similar residues | |

. | N . H |
non-similar residues |

## Sequence Searches

### Runs

Sequence searches with our query protein sequence - enzyme glucocerebrosidase, P04062.fasta, were done with the following programs:

**BLAST**- using standard parameters
- against big_80
- against big

**Psi-BLAST**- with number of shown hits and alignments set to 10000 (-b, -v options), so that all the hits will be shown.
- with all combinations of:
- 2 iterations: 1 iterations against big_80 followed by 1 iteration against big
- 10 iterations: 9 iterations against big_80 followed by 1 iteration against big
- default E-value cutoff (0.002)
- E-value cutoff 10E-10

- other options leaved default

**HHblits**- with number of shown hits and alignments set to 10000 (-Z, -B options), as in Psi-BLAST
- with all combinations of:
- 2 iterations against uniprot_20
- 10 iterations against uniprot_20
- default E-value cutoff (0.001)
- E-value cutoff 10E-10

- other options leaved default

HHblits searches among representative sequences from different uniprot_20 clusters. When a representative from one cluster is found, all the sequences from this cluster are being added to the hit list. Thus HHblits finds much more hits than PSI-BLAST. Therefore, one additional PSI-BLAST run against the unfiltered big database was performed, using checkpoint files of the last profile created from the previous search in big80, in order to find as much related sequences as possible, as HHblits does indirectly. Summing up, the following databases were used: big_80 for profile building and big for the last iteration for PSI-BLAST and uniprot20 for HHblits. BLAST runs (consisting only from one iteration) we searched once against big_80 and once against big to see the difference in number of found hits.

### Comparison

#### Number and overlap of hits

Number of found hits with each program and parameter combination and overlap of hits between some are summarized in the following tables.

Number and overlap of hits | ||||
---|---|---|---|---|

Program | Settings | Number of hits | Of them overlapped - with ... | |

BLAST | big_80 | 569 | 310 - both HHblits searches with E-val 0.001 | |

big | 1209 | 643 - both HHblits searches with E-val 0.001 | ||

Psi-BLAST | all combinations with 1 iter big | 1216 | 644 - HHblits search with same combination | |

1 iter big_80 + 2 iter big, default E-val | 2274 | 1126 - HHblits 2 iter, default E-val | ||

1 iter big_80 + 2 iter big, E-val 10E-10 | 2099 | 1071 - HHblits 2 iter, E-val 10E-10 | ||

9 iter big_80 + 2 iter big, default E-val | 2274 | 1174 - HHblits 10 iter, default E-val | ||

9 iter big_80 + 2 iter big, E-val 10E-10 | 2099 | 1069 - HHblits 10 iter, E-val 10E-10 | ||

2 iter big_80, default E-val | 1092 | |||

2 iter big_80, E-val 10E-10 | 987 | |||

10 iter big_80, default E-val | 4031 | |||

10 iter big_80, E-val 10E-10 | 1525 | |||

HHblits | 2 iter., E-val 0.001 | 1985 | 644 - Psi-BLAST search with same combination | |

2 iter., E-val 10E-10 | 1922 | 644 - Psi-BLAST search with same combination | ||

10 iter., E-val 0.001 | 17538 | 644 - Psi-BLAST search with same combination | ||

10 iter., E-val 10E-10 | 3189 | 644 - Psi-BLAST search with same combination |

**Observations**

- BLAST search against not filtered big yields significantly more hits than against the redundancy reduced big_80.
- Iterative BLAST (i.e. Psi-BLAST) finds far more hits than BLAST.
- HHblits finds more hits than BLAST and PSI-BLAST (latter when number of iterations comparable).
- HHblits finds more hits after 10 iterations than after 2 iterations.
- HHblits finds less hits when lower E-value cutoff (10E-10) is used. For 2 iterations it is a little less but for 10 iterations it is much less.
- The "high outlier" is number of hits with HHblits search with 10 iterations and default E-value inclusion cutoff.
- Psi-BLAST number of hits is the same in all 4 parameter combinations after 1 last iteration against big. Thought after another iteration against big (done because of the check files), the number of hits using the default E-value cutoff is higher than using the E-value cutoff 10E-10. However, the number of hits stays the same independent of the number of "profile building" iterations against big_80 (1 or 9). In the second iteration against big much more hits are found, almost the double. -> Is one iteration against big not enough to "unfold" the profile?
- Psi-BLAST after the 4 different runs only against big_80 are summarized in the table. The trends of the number of hits are comparable to HHblits (more hits with more iterations and less hits with the more restrictive E-value).
- Number of hits found by both HHblits and Psi-BLAST or both HHblits and BLAST is almost the same and independent of the search parameters we used (therefore in Psi-BLAST the same hits are found after 1 iteration against big with all parameter combinations). But the overlap is significantly higher after 1 additional Psi-BLAST iteration against big.

To gain more understanding about the found hits using different programs and search options, E-value, percentage identity and alignment length distributions were plottet.

#### E-value distribution

<figure id="evalue">

E-value distribution of HHblits hits (2 iterations against uniprot_20 with default E-value cutoff, 0.001), Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with default E-value cutoff, 0.002) and BLAST hits (against big with default parameters). On the X-axis -log of the E-value is plotted, so the smallest (the best) E-value is on the right side. On the Y-axis is logarithmic frequency of the values on the X-axis.

E-value distribution of HHblits hits (2 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis -log of the E-value is plotted, so the smallest (the best) E-value is on the right side. On the Y-axis is logarithmic frequency of the values on the X-axis.

E-value distribution of HHblits hits (10 iterations against uniprot_20 with default E-value cutoff, 0.001) and Psi-BLAST hits (9 iteration against big_80 followed by one iteration against big with default E-value cutoff, 0.002). On the X-axis -log of the E-value is plotted, so the smallest (the best) E-value is on the right side. On the Y-axis is logarithmic frequency of the values on the X-axis.

E-value distribution of HHblits hits (10 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (9 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis -log of the E-value is plotted, so the smallest (the best) E-value is on the right side. On the Y-axis is logarithmic frequency of the values on the X-axis.

E-value distribution of BLAST and HHblits and Psi-BLAST hits with different combinations of number of iterations and E-value cutoffs. </figure>

**Observations**

- BLAST has a similar E-value distribution as Psi-BLAST.
- Same range of E-value for (Psi-)BLAST and HHblits, however diffenert frequency of diffent values: (Psi-)BLAST has more of higher E-values (left side) while HHblits has wo peaks at high and low values.
- For high number of iterations, HHblits finds more hits with lower E-values (especially with default E-value cutoff).

#### Percentage identity distribution

<figure id="identity" >

Identity distribution of HHblits hits (2 iterations against uniprot_20 with default E-value cutoff, 0.001), Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with default E-value cutoff, 0.002) and BLAST hits (against big with default parameters). On the X-axis %identity is plotted [0,1]. On the Y-axis is frequency of the values on the X-axis.

Identity distribution of HHblits hits (2 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis %identity is plotted [0,1]. On the Y-axis is frequency of the values on the X-axis.

Identity distribution of HHblits hits (10 iterations against uniprot_20 with default E-value cutoff, 0.001) and Psi-BLAST hits (9 iteration against big_80 followed by one iteration against big with default E-value cutoff, 0.002). On the X-axis %identity is plotted [0,1]. On the Y-axis is frequency of the values on the X-axis.

Identity distribution of HHblits hits (10 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (9 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis %identity is plotted [0,1]. On the Y-axis is frequency of the values on the X-axis.

Identity distribution of BLAST and HHblits and Psi-BLAST hits with different combinations of number of iterations and E-value cutoffs. </figure>

**Observations**

- BLAST and Psi-BLAST have similar identity distributions, with peaks at about 30%, 85% and 100%.
- HHblits has another high peak at about 20%, which moves to even lower identities with higher number of iterations. This peak is especially high with 100 iterations and default E-value cutoff, because of the high number of found hits.

#### Alignment length distribution

<figure id="length" >

Alignment length distribution of HHblits hits (2 iterations against uniprot_20 with default E-value cutoff, 0.001), Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with default E-value cutoff, 0.002) and BLAST hits (against big with default parameters). On the X-axis alignment length is plotted. On the Y-axis is logarithmic frequency of the values on the X-axis.

Alignment length distribution of HHblits hits (2 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis %identity is plotted [0,1]. On the X-axis alignment length is plotted. On the Y-axis is logarithmic frequency of the values on the X-axis.

Alignment length distribution of HHblits hits (10 iterations against uniprot_20 with default E-value cutoff, 0.001) and Psi-BLAST hits (9 iteration against big_80 followed by one iteration against big with default E-value cutoff, 0.002). On the X-axis %identity is plotted [0,1]. On the X-axis alignment length is plotted. On the Y-axis is logarithmic frequency of the values on the X-axis.

Alignment length distribution of HHblits hits (10 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (9 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis %identity is plotted [0,1]. On the X-axis alignment length is plotted. On the Y-axis is logarithmic frequency of the values on the X-axis.

Alignment length distribution of BLAST and HHblits and Psi-BLAST hits with different combinations of number of iterations and E-value cutoffs. </figure>

**Observations**

- BLAST and Psi-BLAST have similar alignment length distributions.
- HHblits and (Psi-)BLAST alignment lengths have similar range from 1 to round 550 residues.
- (Psi-)BLAST seems to make in average longer alignments than HHblits.

### Evaluation

#### Results

Resulting plots of TPR and precision as a function of E-value are shown below.

**BLAST**
<figure id="blast_results" >

TPR and precision vs. E-value of BLAST hits searched in big_80 and big database. </figure>

**Psi-BLAST**
<figure id="psiblast_results" >

TPR and precision vs. E-value of Psi-BLAST hits after 1 big_80, 1 big and 1 pdb_full iterations with default E-value using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff.

TPR and precision vs. E-value of Psi-BLAST hits after 1 big_80, 1 big and 1 pdb_full iterations with E-value 10E-10 using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff.

TPR and precision vs. E-value of Psi-BLAST hits after 9 big_80, 1 big and 1 pdb_full iterations with default E-value using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff.

TPR and precision vs. E-value of Psi-BLAST hits after 9 big_80, 1 big and 1 pdb_full iterations with E-value 10E-10 using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff.

TPR and precision vs. E-value of Psi-BLAST hits with different combinations of number of iterations and E-value cutoffs. </figure>

**HHblits**
<figure id="hhblits_results" >

TPR and precision vs. E-value of HHblits hits after 2 uniprot_20 iterations and 1 pdb_full iteration with default E-value using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff.

TPR and precision vs. E-value of HHblits hits after 2 uniprot_20 iterations and 1 pdb_full iteration with E-value 10E-10 using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff.

TPR and precision vs. E-value of HHblits hits after 10 uniprot_20 iterations and 1 pdb_full iteration with default E-value using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff.

TPR and precision vs. E-value of HHblits hits after 10 uniprot_20 iterations and 1 pdb_full iteration with E-value 10E-10 using the L30, L40 and L60 COPS groups as gold standard. On the X-axis -log(E-value) is plotted. On the Y-axis are percentual values of TPR and precision of all hits until respective E-value cutoff.

TPR and precision vs. E-value of HHblits hits with different combinations of number of iterations and E-value cutoffs. </figure>

**Observations**

- In
**BLAST**search against**big_80**only one hit was found and it has the same L60 (therefore also L40 and L30) group as the query. It can be seen on the plot as a pair of points: with the highest precision and lowest TPR. - In
**BLAST**search against**big**only hits with two different E-values were found. Also there the precision is high but TPR low, especially for hits with same L30 group. - For the other programs -
**HHblits**and**Psi-BLAST**- it can be observed, that TPR rises with growing E-value and precision drops with growing E-value, because then more TP and FP hits are found.- TPR rises with growing E-value, because more TP hits are detected, especially rapidly for L60. This indicates that more similar hit are detected more easily, almost all L60 are found, however only a small subset of L40 and L30 hits are retrieved. The TPR plot is similar for Psi-BLAST and HHblits searches.
**HHblits**achieves slightly higher precision then**Psi-BLAST**: 24% vs. 9% for L30, 40% vs. 15% for L40. Both programs find all L60 hits. - There is a different behavior in precision:
- In
**Psi-BLAST**searches precision stays 100% until approx. E-value 10E-20, then it starts to drop, more rapidly for L60 group than for L40 and L30 groups. It shows that hits with high E-values probably are not very similar to the query. - In
**HHblits**searches precision is 100% for very low E-values, then as E-value grows precision of all structural groups falls rapidly to about 90% and stays there for a while. After some E-value is achieved (usually approx. > 10E-10), precision continues to drop further. Also here precision for L60 groups falls more rapidly than for L40 and L30 groups.

- In
**Psi-BLAST**results for different settings are overall very similar, except that less TP are found with E-value inclusion cutoff 10E-10.- For
**HHblits**there is one exeption: the plot for HHblits with 10 iterations and default E-value cutoff (for inclusion into profile) looks very differnt. The highest TPR values are achieved: 30% for L30, 52% for L40 and 100% for L60, however the precision drops not only already at very low E-values to about 90%, as also for other HHblits searches, but it continues to drop already after E-value 10E-30 for all the structural groups (again more rapidly for L60). This indicates that many HHblits iterations combined with not very restrictive E-value inclusion cutoff 0.001 allow too many FP to penetrate into the profile and in the end into the result list. We do not recommend to use these search settings. - It could be a good idea to remove
**Psi-BLAST**hits with E-value > 10E-20 and HHblits hits with E-value > 10E-10 (in most of the cases) to reduce the number of FP. However then we will loose some TP that are still found with these high E-values (L30 and L40).

- TPR rises with growing E-value, because more TP hits are detected, especially rapidly for L60. This indicates that more similar hit are detected more easily, almost all L60 are found, however only a small subset of L40 and L30 hits are retrieved. The TPR plot is similar for Psi-BLAST and HHblits searches.

## Multiple Sequence Alignment

For the multiple sequence alignments, three sets were created. Therefore, the results of the previous task were used according to their sequnece identiy to the native protein sequence of glucocerebrosidase. Set 1 and set 2 contain 10 sequences each, including the native protein sequence to keep the alignments in relation to the Gaucher's disease causing protein. The remainig 9 sequences have a sequence identity to the glucocerebrosidase of more than 60% in set 1 and less than 30% in set 2. In set 3, the 20 sequences are over the whole range of sequence identity.

In the following analyses the term "conserved region" is used for a section of columns that feature a higher number of conserved columns or columns of similar residues. Moreover, the term "contigous gap" is used for an amount of consecutive gaps that split up the protein sequence.

Set 1: sequence identity >60% | |
---|---|

ID | Identity in % |

P04062 | 100 |

A9UD35 | 84.1 |

D1L2S0 | 83.0 |

3gxi_A | 99.8 |

2nt1_A | 99.8 |

F6WDY8 | 90.7 |

Q2KHZ8 | 89.2 |

F5CB27 | 81.8 |

F5H241 | 98.2 |

B7Z6S9 | 99.8 |

Set 2: sequence identity <30% | |
---|---|

ID | Identity in % |

P04062 | 100 |

H6CEV7 | 26.5 |

Q21GD0 | 24.3 |

I1WBF3 | 23.7 |

I9HH59 | 29.4 |

D0TN48 | 25.4 |

B1VPJ0 | 27.5 |

E2LY19 | 24.8 |

K9HBW2 | 25.9 |

B5QQZ8 | 27.1 |

Set 3: sequence identity over all | |||
---|---|---|---|

ID | Identity in % | ID | Identity in % |

P04062 | 100 | B5DYA3 | 32.0 |

B4JTN5 | 34.1 | F5CB37 | 81.8 |

E1ZYU8 | 42.7 | J2STU0 | 34.2 |

F4E6W5 | 26.4 | G9BHQ3 | 80.7 |

F4X220 | 41.7 | 2nsx_B | 99.8 |

E5UPZ0 | 21.6 | G9BHQ5 | 82.5 |

C6A5Q0 | 25.0 | D0TIX6 | 25.8 |

B1QKT0 | 27.8 | 1y7v_A | 99.8 |

D4SV88 | 34.6 | A9UD58 | 80.4 |

A9UD54 | 81.8 | G9MUP6 | 20.7 |

### ClustalW

Multiple alignments generated with ClustalW: <figure id="cl_1" >

</figure> <figure id="cl_2" >

</figure> <figure id="cl_3" >

</figure>

**Set 1**, which includes sequences with high similarity, has only 11 conserved columns. But these columns lie densly in an alignment area between position 284 and 337 (marked darkblue in <xr id="cl_1"/>). Also columns that are not conserved by identical residues, but have similar amino acids, can be found in this conserved region. This conserved area forms two alpha helices and one beta-sheet in its secondary structure. The gaps of the alignment split the protein into 5, parts of residues. Especially the last 400 residues have a very high sequence identity between the clucosylcerebrosidase and 6 other proteins. The last contigous gap which is also the longest one lies inside an alpha helix of the secondary structure.

**Set 2** has less conserved columns than set 1 (<xr id="cl_2"/>). They are spread over the alignment and build no conserved region. The contigous gaps are more but shorter.

There exist no conserved columns for **set 3** (<xr id="cl_3"/>). This could result from the greater number of sequences in the set. The more sequences are in an alignment the lesser the probability of having conserved columns. Furthermore, also the sequences with a low sequence identity cause to a loss of a conserved region.

Results of ClustalW in numbers | |||
---|---|---|---|

Set 1 | Set 2 | Set 3 | |

Conserved Columns | 11 | 8 | 0 |

Number of gaps in the Alignm. | 3797 | 5239 | 6721 |

Number of contiguous gaps in P04062 | 6 | 17 | 16 |

Alignment length | 762 | 1079 | 722 |

### Muscle

Multiple alignments generated with Muscle: <figure id="m1" >

</figure> <figure id="m2" >

</figure> <figure id="m3" >

</figure>

In **set 1** the native protein sequence itself has no gaps. There are only gaps at the beginning of the sequence, because the alignment (<xr id="m1"/>) is longer than the length of the glucocerebrosidase sequence. There are 10 conserved columns which all concentrate on the alignment positions between 138-204. The remaining residues of these area also have a high similarity between the different sequences. Only the sequence F5H241 is splitted by a few gaps at its beginning. This high conserved region affects a different region of the glucocerebrosidase (residues 53-119) than the high conservative region detected by ClustalW (residues 171-220).

**Set 2** has also 10 conserved columns (<xr id="m2"/>). But in contrast to set 1, they are widely spread over the alignment. The same observation can be made about columns with similar residues. Through this scattering it seems rather a randomly generated conserved column than a conservation due to functionaly reasons. The high number as well as the partly great length of the contigous gaps straighten the alignment. The alignment gives a great overview of the relation between sequences with low similarity and sequence identity. It also shows that for finding functional important areas that are conserved, a higher similarity and identity is needed.

**Set 3** has no conserved columns (<xr id="m3"/>), for the same reason as explained before, for set 2. It has indeed the highest number of gaps of the three sets. But contrary to the alignment of set 2 the contigous gaps are shorter, partially they are only 1-2 gaps long. So the alignment appears less straightened, as the sequences with high sequence identity keep the alignment a bit compact.

Results of Muscle in numbers | |||
---|---|---|---|

Set 1 | Set 2 | Set 3 | |

Conserved Columns | 10 | 10 | 0 |

Number of gaps in the Alignm. | 2387 | 5609 | 7341 |

Number of contiguous gaps in P04062 | 0 | 26 | 29 |

Alignment length | 621 | 1116 | 753 |

### T-Coffee

Multiple alignments generated with T-Coffee: <figure id="tc1" >

</figure> <figure id="tc2" >

</figure> <figure id="tc3" >

</figure>

The Alignment of **set 1** in <xr id="tc1"/> generated by T-Coffee is very similar to its alignment made by Muscle. However, there exist no conserved columns. This is caused because of the shift of the sequences. So, in no alignment position, the residues of all sequences are aligned. The high conserved region of the Muscle-alignment is almost identical to a conserved region in the T-coffee generated alignment. They only differ in a shift of a part of sequence F5H241. The First few residues are now aligend at the beginning of the glucocerebrosidase instead of position 53. While Muscle includes gaps in this F5H241 part, T-Coffe does not separate the shifted residues.

The other difference to the Muscle alignment is one long contigous gap inside of the protein sequence A9UD35. This extremely long contigous gap shifts the last amino acid of the sequence to the last position of the alignment. As this amino acid (G) is not identical to the remaining amino acids (Q) at the last alignment position of the other sequences, there seem to be no reason for this observation. The suspicion of an error arises, as the amino acid (G) would perfectly fit directly at the end of the sequence without a contigous gap.

**Set 2** shows widly spread conserved columns without building a conserved region. The alignment (<xr id="tc2"/>) itself has lots of gaps. An interpretation of the alignment in a biological way should be made carefully, as it bases on low sequence identity and similarity. Noticeable for this alignment is its beginning and end. All sequences are aligned in these areas with at least 4 residues.

Sequences with an identity over the full range (**set 3**) generate many gaps of different length and no conserved columns (<xr id="tc3"/>). Two different pattern are observable. On the one hand blocks of aligned residues are seperated mostly by a block of gaps and are only connected by 1-2 sequences (for example positions 495-515). On the other hand a block is not completely aligned by residues of all sequences, but has an contigous gap of a few sequences (for example positions 415-450).

Results of Muscle in numbers | |||
---|---|---|---|

Set 1 | Set 2 | Set 3 | |

Conserved Columns | 0 | 12 | 0 |

Number of gaps in the Alignm. | 2387 | 6719 | 9361 |

Number of contiguous gaps in P04062 | 0 | 44 | 50 |

Alignment length | 621 | 1227 | 854 |

### Espresso

Espresso is an advanced programm of T-Coffee. While T-Coffe only uses sequences, Espresso also takes the 3D structure of the proteins into account.

Multiple alignments generated with Espresso:

- Results for Espresso of : Set1 (>60% seq.identity to Glucocerbrosidase)
- Results for Espresso of : Set2 (<30% seq.identity to Glucocerbrosidase)
- Results for Espresso of : Set3 (all % seq.identity to Glucocerbrosidase)

(Please note: in case of an unavailability of the alignments in the links above, please contact the author)

The additional use of the structure improves the alignment of sequences with high sequence identity (**set 1**). In contrary to T-Coffee, Espresso detects now conserved columns. Through a little difference in the alignment score calculating, the residues in the beginning of the F5H241 aligne different than in T-Coffee. Now an conserved area between residues 53 and 119, including 12 high conserved columns and 9 columns with similar amino acids, can be seen in the alignment. The residues are now aligned not equal but similar and better to those in the Muscle-alignment. These conserved area forms four beta-sheets in its secondary structure.
Also the last residue of A9UD35 which was mentioned before in connection with T-Coffee is now aligned directly at the end of its sequence.

The two other sets containing also sequences with low sequence identity show no remarkable improvement to the results of T-Coffee .

Results of Espresso in numbers | |||
---|---|---|---|

Set 1 | Set 2 | Set 3 | |

Conserved Columns | 12 | 11 | 0 |

Number of gaps in the Alignm. | 2387 | 6769 | 9341 |

Number of contiguous gaps in P04062 | 0 | 46 | 51 |

Alignment length | 621 | 1232 | 853 |

### Note

The generated T-Coffee and Espresso alignments of set 2 and 3 seem to focus aligned residues at their beginning or their end. So the question arises: Why? Why are these residues aligned, even though enormous gap penalty accures considering the resulting long contigous gaps. One thing is certain: T-Coffee does not use a freeshift alignments for generating a multiple alignment.

### Alignment Comparison

Alignment comparison of the MSA of sequences with high similarity (**Set 1**) generated by the programs described above.

The Alignments produced by Muscle, T-Coffe and Espresso are very similar. They only differ in their aligning of the sequence F5H241. In case of Espresso the sequence was aligned so that a conserved region is observable. The conserved region, detected by ClustalW is on a different position.

Nevertheless the resulting trees in <xr id="sid"/> based on the Average % of sequence identity differ from each other.

<figure id="sid" >

Trees of sequences with high similarity (set1) paired by average distance using %Identity. </figure>

The trees (<xr id="sid"/>) generated based on the alignment generated by ClustalW and T-Coffee are identically in their sequences pairing. All human proteinns can be found in one subtree. The same proteins are paired in both trees. Just as the remaining mammalian proteins do in the other subtree. They only differ in their distances between paired sequences. This is an unexpected observation, as both alignments differ especially in their conserved region.
But the tree showing the alignment from Muscle deviates enormously from the other two trees. Indeed, the mammalian subtree coincide with the others, but the human proteins do not. The human protein F5H241 which is directly paired to the Glucocerebrosidase by ClustalW and T-Coffee is now linked (yellow) to the subtree consisting of all proteins (red, blue). This is the protein which is aligned different for Muscle and T-Coffee. So Muscle shows a different %Identity between the aligned sequences than ClustalW or T-Coffee.

<figure id="ad" >

Trees of sequences with high similarity (set1) paired by average distance using BLOSUM62. </figure>

By calculating the trees <xr id="ad"/> with the average distance using BLOSUM62, all three trees separate the whale proteins into one subtree. They differ in their pairing of the remaining sequences. Intrestingly, the trees of ClustalW and T-Coffee are not the same anymore, as the trees paired according to their average distance using %identity. Now the alignments of Muscle and T-Coffee to the exact same tree. While Muscle and T-Coffee group the Glucocerebrosidase with the horse and bovine protein, ClustalW groups all human proteins together.

Organisms of Uniprot Entries | |||
---|---|---|---|

Entry | Organism | ||

HUMAN | Human | ||

HORSE | Horse | ||

BOVIN | Bovine | ||

BALPH | Finback whale | ||

DELDE | Saddleback dolphin | ||

DELCA | Long-beaked common dolphin |

### For future research

Multiple sequence alignments are suitable to detect conserved regions on proteins. As for such researches sequences with high similarity cause proper results, the choice of the tool is based on the results of set 1. For a good alignment I would expect few gaps and many conserved columns of residues. However I would only focus on gaps inside the sequences and neglect the gaps at the beginning and the end of sequences. Under these categories I would use and advise Espresso. It shows more conserved columns than Muscle and T-Coffe and less gaps than ClustalW. Even the fact that an alpha helix is cut by a very long contigous gap in the secondary structure, also convinces me to use Espresso instead of ClustalW. The Espresso generated alignment is a compact alignment which provides a conserved region.

In case of having only sequences with low similarity, I would tend to the use of ClustalW. The analyses with set 2 and set 3 show better results for alignments generated by ClustalW. These Alignments have less gaps and less straightened than the alignments produced by Muscle, T-Coffee and Espresso.