Difference between revisions of "Sequence Alignments HEXA"

From Bioinformatikpedia
(Gaps)
(Multiple Alignments)
 
(31 intermediate revisions by 2 users not shown)
Line 13: Line 13:
 
* HHSearch
 
* HHSearch
 
For the HHSearch tool we used the online server for [http://toolkit.tuebingen.mpg.de/hhpred HHSearch].
 
For the HHSearch tool we used the online server for [http://toolkit.tuebingen.mpg.de/hhpred HHSearch].
  +
<br><br>
  +
Back to [[http://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Tay-Sachs_Disease Tay-Sachs Disease]]<br>
   
 
=== Statistic results ===
 
=== Statistic results ===
   
For the statistical analysis we wrote a script which shows the distribution of the E-Value and the Identity as well as the different aligned sequences. Furthermore we create a Venn diagram to presentate the overlap between the results of the different searching methods (with http://bioinfogp.cnb.csic.es/tools/venny/index.html). First we compared the methods BLAST, FASTA and PsiBlast(PsiBlast with 3 and 5 runs and E-Value cutoff from 10E-6). Then we looked for the overlap of all done PsiBlasts.
+
For the statistical analysis we wrote a script which shows the distribution of the E-Value and the identity as well as the different aligned sequences. Furthermore, we created a Venn diagram to present the overlap between the results of the different searching methods (with http://bioinfogp.cnb.csic.es/tools/venny/index.html). First, we compared the methods BLAST, FASTA and PsiBlast (PsiBlast with 3 and 5 runs and E-Value cutoff of 10E-6). Then we looked for the overlap of all done PsiBlast runs.
   
   
 
'''Overlap of the aligned sequences'''
 
'''Overlap of the aligned sequences'''
   
FASTA found a large number of matches which are not found by the other methods. By comparison the number of hits which were not found by BLAST or PsiBlast, is about 1400. This is much higher than the number of sequences which is found by FASTA and BLAST together. This leads to the conclusion that FASTA aligns many sequences which are probably less good or even wrong.
+
As it can be seen on Figure 1, FASTA found a large number of matches which are not found by the other methods. By comparison, the number of hits which were not found by BLAST or PsiBlast as well, is about 1400. This is much higher than the number of sequences which is found by FASTA and BLAST together. This leads to the conclusion that FASTA aligns many more sequences which are probably worse or even wrong.
The both different PsiBlast-variants deliver the same hits which are all also found by FASTA. Furthermore all resulting sequences by BLAST were also aligned by FASTA and the most of them are also by PsiBlast.
+
The both different PsiBlast variants deliver the same hits which are all found by FASTA, as well. Furthermore, all resulting BLAST-sequences were aligned by FASTA and the most of them also by PsiBlast.<br>
Besides, we decided to compare different runs of PsiBlast. We compared PsiBlast with 3 iterations and an e-Value Cutoff of 0.005 and 10E-6 and also two PsiBlast runs with 5 iterations and the same two e-Value cutoffs as before. In this Vann-Digramm could be seen that the result overlap mostly. Only a few ones differ from the other. This leads to the fact, that PsiBlast with different iteration number and e-value deliver usually a similar result.
+
Besides, we decided to compare different runs of PsiBlast. We compared PsiBlast with 3 iterations and two different E-Value Cutoffs (0.005 and 10E-6) and also two PsiBlast runs with 5 iterations and the same two E-Value cutoffs as before, which can be seen on Figure 2. The Venn diagram shows that the different results overlap mostly. Only a few ones differ from the other. This leads to the conclusion, that PsiBlast with different iteration numbers and E-value receives usually a similar result.
In summary the BLAST-methods agree with each other. In contrast the FASTa-method delivers much more sequences which do not correspond which one of the other methods.
+
In summary the BLAST-methods agree with each other. In contrast the FASTA-method delivers much more sequences which do not correspond to the other methods.
   
   
 
{| class="centered"
 
{| class="centered"
| [[Image:comparison_ali.png|thumb|250px| Overlap of results from BLAST, FASTA and PsiBlast.
+
| [[Image:comparison_ali.png|thumb|250px|Figure 1: Overlap of results from BLAST, FASTA and PsiBlast.<br>
'''''Declaration:''''' Psiblast #3: 3 Iterations, E-Value Cutoff: 10E-6; Psiblast #5: 5 Iterations, E-Value Cutoff, 10E-6 ]]
+
'''''Declaration:'''''<br> Psiblast #3: 3 Iterations, E-Value Cutoff: 10E-6<br> Psiblast #5: 5 Iterations, E-Value Cutoff, 10E-6 ]]
| [[Image:comparison_psiblast.png|thumb|250px| Overlap of the results from different PsiBlast
+
| [[Image:comparison_psiblast.png|thumb|250px|Figure 2: Overlap of the results from different PsiBlast.<br>
'''''Declaration:''''' Psiblast 1: 3 Iterations, E-Value Cutoff: 0.005; Psiblast 2: 5 Iterations, E-Value Cutoff: 0.005
+
'''''Declaration:'''''<br> Psiblast 1: 3 Iterations, E-Value Cutoff: 0.005<br> Psiblast 2: 5 Iterations, E-Value Cutoff: 0.005<br>
Psiblast 3: 3 Iterations, E-Value Cutoff: 10E-6; Psiblast 4: 5 Iterations, E-Value Cutoff, 10E-6]]
+
Psiblast 3: 3 Iterations, E-Value Cutoff: 10E-6<br> Psiblast 4: 5 Iterations, E-Value Cutoff, 10E-6]]
 
|}
 
|}
   
 
'''Distribution of the sequence identity and the e-value'''
 
'''Distribution of the sequence identity and the e-value'''
   
The following plots show the distribution of the sequence idetities and the e-values of all used methods. Both values (x-axis) and their frequencies (y-axis) were extractet from the corresponding output-files.
+
The following plots show the distribution of the sequence identities and the E-values of all used methods. Both values (x-axis) and their frequencies (y-axis) were extracted from the corresponding output-files.
   
The first image shows the distribution of the sequence identities. The first plot is the distribution for BLAST. Here could be seen that these identity-distribution is very balanced which means the low identities are approximate same common as the very high ones. The same goes for the HHSearch-plot. Contrary the Fasta-distribution has many high frequency for very small identities. This means that FASTA aligns many sequences although they have only a small sequence identity. This could explain why FASTA receives so many hits which do not agree with the other sequence searching tools (see Venn diagramm).
+
The first image (Figure 3) shows the distribution of the sequence identities. The first plot visualize the distribution for BLAST. Here could be seen that these identity-distribution is very balanced which means the low identities are approximate same common as the very high ones. It is also the same for the HHSearch-plot. Contrary, the FASTA-distribution has often a very high frequency for very small identities. This means that FASTA aligns many sequences although they have only a small sequence identity. This could explain why FASTA receives so many hits which do not agree with the other sequence searching tools (see Venn diagram).
The last four plots represent the corresponding distribution for the different PsiBlasts which were very similar. This is another in indication that PsiBlast received very similar results for the different parameters. Their distribution is also very balanced. There are high frequencies for small identities, middle identities and for high identities.
+
The last four plots represent the corresponding distribution for the different PsiBlast runs which are very similar. This is another indication that PsiBlast received very similar results for different parameters. Their distribution is also very balanced. There are high frequencies for small, middle and high sequence identity values. <br>
   
The second image shows the distribution of the e-values. The e-value is a measurement for the probability that a hit is resulting by chance. Therefore the smaller the e-value the better the alignment. All plots except the one for FASTA have high frequencies for small e-values whereby BLAST receives the smalles e-values. The e-values of FASTA have range from 0 to 8 where in contrast the other methods have no evalue higher than 1. Furthermore the highest BLAST e-value is at about 1e-29 which is still very low. In summary this shows again that BLAST deliver the best results and FASTA the worst ones.
+
The second image (Figure 4) shows the distribution of the E-values. The E-value is a measurement for the probability that a hit is found by chance. Therefore, the smaller the E-value the better the alignment. All plots, except the one for FASTA, have high frequencies for small E-values whereby BLAST receives the smallest E-values. The E-values of FASTA have range from 0 to 8 where in contrast the other methods have no E-value higher than 1. Furthermore, the highest BLAST E-value is about 1e-29 which is still very low. In summary this shows again that BLAST delivers the best results and FASTA the worst ones.
   
 
{| class="centered"
 
{| class="centered"
| [[Image:seq_identity.png|thumb|center|400px|Distribution of the sequence identities of the different methods]]
+
| [[Image:seq_identity.png|thumb|center|400px|Figure 3: Distribution of the sequence identities of the different methods]]
| [[Image:evalue.png|thumb|center|400px|Distribution of the e-values of the different methods]]
+
| [[Image:evalue.png|thumb|center|400px|Figure 4: Distribution of the e-values of the different methods]]
 
|}
 
|}
   
Line 54: Line 56:
 
'''True positive hits'''
 
'''True positive hits'''
   
HSSP (Homology-derived Secondary Structure of Proteins) lists proteins which are homologue and have a similar secondary structure. Therefore we use the HSSP alignment to check our results. Therefore we check how much overlap is between HSSP and the other methods. The overlapping sequences are the true positives.
+
HSSP (Homology-derived Secondary Structure of Proteins) lists proteins which are homologue and have a similar secondary structure. Therefore, we used the HSSP alignment to check our results. The overlapping sequences are the true positives. FASTA (Figure 5) has a greater overlap than the other methods (about 10 sequences more). The BLAST result (Figure 6) and the results of the Blast variants (PsiBlast runs Figure 7 - 10) show very similar overlap.
  +
Therefore, in this case FASTA gave the most true positive hits (although FASTA has also a hugh number of false positive predictions).
  +
   
 
{|
 
{|
| [[Image:hssp_fasta.png|thumb|center|Overlap between HSSP and FASTA]]
+
| [[Image:hssp_fasta.png|thumb|center|Figure 5: Overlap between HSSP and FASTA]]
| [[Image:hssp_blast.png‎ |thumb|Overlap between HSSP and Blast]]
+
| [[Image:hssp_blast.png‎ |thumb|Figure 6: Overlap between HSSP and Blast]]
| [[Image:hssp_psiblast_3_10.png|thumb|Overlap between HSSP and PsiBlast with 3 Iterations and 10E-6 cutoff]]
+
| [[Image:hssp_psiblast_3_10.png|thumb|Figure 7: Overlap between HSSP and PsiBlast with 3 Iterations and 10E-6 cutoff]]
 
|}
 
|}
 
{|
 
{|
| [[Image:hssp_psiblast_5_10.png|thumb|Overlap between HSSP and PsiBlast with 5 Iterations and 10E-6 cutoff]]
+
| [[Image:hssp_psiblast_5_10.png|thumb|Figure 8: Overlap between HSSP and PsiBlast with 5 Iterations and 10E-6 cutoff]]
| [[Image:hssp_psiblast_3_0.png|thumb|Overlap between HSSP and PsiBlast with 3 Iterations and 0.005 cutoff]]
+
| [[Image:hssp_psiblast_3_0.png|thumb|Figure 9: Overlap between HSSP and PsiBlast with 3 Iterations and 0.005 cutoff]]
| [[Image:hssp_psiblast_5_0.png|thumb|Overlap between HSSP and psiBlast with 5 Iterations and 0.005 cutoff]]
+
| [[Image:hssp_psiblast_5_0.png|thumb|Figure 10: Overlap between HSSP and psiBlast with 5 Iterations and 0.005 cutoff]]
 
|}
 
|}
   
   
With the results of these analysis, we created our file for the multiple alignments.
+
With the results of these analyses, we created our file for the multiple alignments.
 
{| border="1" style="text-align:center; border-spacing:0;"
 
{| border="1" style="text-align:center; border-spacing:0;"
 
!SeqIdentifier
 
!SeqIdentifier
Line 123: Line 127:
 
|-
 
|-
 
|}
 
|}
  +
<br><br>
 
  +
Back to [[http://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Tay-Sachs_Disease Tay-Sachs Disease]]<br>
 
== Multiple Alignments ==
 
== Multiple Alignments ==
   
 
* Cobalt
 
* Cobalt
For coblat it was first necessary to install the programm on our virtual box:
+
To use cobalt, we had to install the program on our virtual box first.
  +
Download Cobalt from ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt/executables/2.0.1/ (ncbi-cobalt-2.0.1-x64-linux.tar).
 
  +
'''Howto:'''
Uncompress the archive file with tar xfz ncbi-cobalt-2.0.1-x64-linux.tar and change directory to the uncompressed cobalt directoy.
 
  +
* Download Cobalt from [[http://ftp.ncbi.nlm.nih.gov/pub/cobalt/executables/2.0.1/ here]] (ncbi-cobalt-2.0.1-x64-linux.tar).
  +
* Uncompress the archive file with tar xfz ncbi-cobalt-2.0.1-x64-linux.tar and change directory to the uncompressed cobalt directory.
  +
* Now you find an executable file called cobalt
   
 
Call: ./cobalt -i mult_seq.fasta -norps T > cobalt_out.aln
 
Call: ./cobalt -i mult_seq.fasta -norps T > cobalt_out.aln
Line 144: Line 152:
 
* T-Coffee (3D)
 
* T-Coffee (3D)
 
t_coffee -seq mult_seq.fasta -mode expresso
 
t_coffee -seq mult_seq.fasta -mode expresso
 
   
 
=== Conservation ===
 
=== Conservation ===
  +
Next, we wanted to know if there is a strong conservation between the sequences in our multiple sequence alignment. We compared each position of the alignment to find and count the conserved columns. Furthermore, we counted also the number of gaps in our alignment.
   
 
{| border="1" style="text-align:center; border-spacing:0;"
 
{| border="1" style="text-align:center; border-spacing:0;"
Line 213: Line 221:
 
|}
 
|}
   
  +
The different methods have a different amount of gaps in their alignments. They number of gaps differ between 346 (ClustalW) and 609 (T-Coffee). There is a clear trend by looking at the different conservation levels (100% conservation is not that frequent as lower conservation). Interestingly, there do not exist columns which have a conservation between 40% and 50%.
   
 
=== Gaps ===
 
=== Gaps ===
   
For the identification of gaps in secundary structure elements we write a script which comparse the alignment sequence of hexasaminidase with the secondary struture sequence from PDB (http://www.pdb.org/pdb/explore/sequenceText.do?structureId=2GJX&chainId=A). In the following table are the results for all multiple alignment tools.
+
[[Image:secondary.png|thumb|Figure 11: Secondary structure alignment for the protein Hexosaminidase A (http://www.pdb.org/pdb/explore/remediatedChain.do?structureId=2GJX&chainId=A)]]
   
[[Image:secondary.png|thumb|300px|Secondary structure alignment for the Protein Hexosaminidase A (http://www.pdb.org/pdb/explore/remediatedChain.do?structureId=2GJX&chainId=A)]]
+
For the identification of gaps in secondary structure elements we wrote a script which compares the alignment sequence of hexasaminidase with the secondary structure sequence from the [[http://www.pdb.org/pdb/explore/sequenceText.do?structureId=2GJX&chainId=A PDB]] visualized in Figure 11. The results for all multiple alignment tools are listed in the following table.
   
 
{| border="1" style="text-align:center; border-spacing:0;"
 
{| border="1" style="text-align:center; border-spacing:0;"
 
!rowspan="2"|'''Alignment methods'''
 
!rowspan="2"|'''Alignment methods'''
 
|colspan="1"|
 
|colspan="1"|
|colspan="4"|'''Gaps in Secundary Structure Elements'''
+
|colspan="4"|'''Gaps in Secondary Structure Elements'''
 
|-
 
|-
 
|'''Sum of Gaps'''
 
|'''Sum of Gaps'''
Line 263: Line 272:
 
===Functional residues===
 
===Functional residues===
   
  +
We found several functional residues in the [[http://www.uniprot.org/uniprot/P06865 Uniprot]] database:
We found several functional residues from (LINK FEHLT NOCH). Because these residues are functionally important, these residues should be conserved. We compared the different alignments and looked if these residues are conserved.
 
  +
{| border="1" style="text-align:center; border-spacing:0;"
  +
|-
  +
|'''function'''
  +
|'''position'''
  +
|-
  +
|active site
  +
|323
  +
|-
  +
|Glycolysation
  +
|115
  +
|-
  +
|Glycolysation
  +
|157
  +
|-
  +
|Glycolysation
  +
|295
  +
|-
  +
|Disulfide bond
  +
|58 <-> 104
  +
|-
  +
|Disulfide bond
  +
|277 <-> 328
  +
|-
  +
|Disulfide bond
  +
|505 <-> 522
  +
|-
  +
|}
  +
  +
Because these residues are functionally important, they should be conserved. We compared the different alignments and looked if these residues are conserved.
   
 
{| border="1" style="text-align:center; border-spacing:0;"
 
{| border="1" style="text-align:center; border-spacing:0;"
 
!rowspan="2"|'''Amino acids'''
 
!rowspan="2"|'''Amino acids'''
 
|colspan="1"|
 
|colspan="1"|
|colspan="6"|'''Methods'''
+
|colspan="5"|'''Methods'''
  +
|colspan="1"|
 
|-
 
|-
 
|'''residue position'''
 
|'''residue position'''
Line 276: Line 315:
 
|'''T-Coffee'''
 
|'''T-Coffee'''
 
|'''3D T-Coffee'''
 
|'''3D T-Coffee'''
  +
|'''dominated substitution'''
 
|-
 
|-
  +
|E (active site)
|R
 
|178
+
|323
|conserved
+
|conserved (21/21)
|conserved
+
|conserved (21/21)
|conserved
+
|conserved (21/21)
|conserved
+
|conserved (21/21)
|conserved
+
|conserved (21/21)
  +
|none
 
|-
 
|-
  +
|N (Glycolysation)
|D
 
|207
+
|115
|conserved (once E)
+
|non-conserved (12/21)
|conserved (once E)
+
|non-conserved (13/21)
|conserved (once E)
+
|non-conserved (13/21)
|conserved (once E)
+
|non-conserved (13/21)
|conserved (once E)
+
|non-conserved (13/21)
  +
|Serine
 
|-
 
|-
  +
|N (Glycolysation)
|H
 
  +
|157
|262
 
|conserved
+
|non-conserved (16/21)
|conserved
+
|non-conserved (16/21)
|conserved
+
|non-conserved (16/21)
|conserved
+
|non-conserved (16/21)
|conserved
+
|non-conserved (16/21)
  +
|Proline and Serine
 
|-
 
|-
  +
|N (Glycolysation)
|D
 
|322
+
|295
|conserved
+
|non-conserved (14/21)
|conserved
+
|non-conserved (14/21)
|conserved
+
|non-conserved (14/21)
|conserved
+
|non-conserved (14/21)
|conserved
+
|non-conserved (14/21)
  +
|Proline and Aspartic acid
 
|-
 
|-
  +
|C (Disulfide bond)
|E
 
  +
|58 (connected with 104)
|323
 
|conserved
+
|non-conserved (17/21)
|conserved
+
|non-conserved (15/21)
|conserved
+
|non-conserved (17/21)
|conserved
+
|non-conserved (17/21)
|conserved
+
|non-conserved (16/21)
  +
|no dominated substitution
 
|-
 
|-
  +
|C (Disulfide bond)
|W
 
  +
|104 (connected with 58)
|373
 
|conserved
+
|non-conserved (14/21)
|conserved (once V, R)
+
|non-conserved (14/21)
|conserved
+
|non-conserved (14/21)
|conserved
+
|non-conserved (14/21)
|conserved
+
|non-conserved (15/21)
  +
|no dominated substitution
 
|-
 
|-
  +
|C (Disulfide bond)
|W
 
  +
|277 (connected with 328)
|392
 
|conserved
+
|conserved (20/21)
|conserved (once P, T, G)
+
|non-conserved (16/21)
|conserved
+
|non-conserved (18/21)
|conserved
+
|non-conserved (19/21)
|conserved
+
|non-conserved (17/21)
  +
|Serine
 
|-
 
|-
  +
|C (Disulfide bond)
|Y
 
  +
|328 (connected with 277)
|421
 
|conserved
+
|non-conserved (19/21)
|conserved (twice G, once -, S)
+
|non-conserved (19/21)
|conserved
+
|non-conserved (19/21)
|conserved
+
|non-conserved (19/21)
|conserved (once H)
+
|non-conserved (19/21)
  +
|Argenine and Glutamic acid
 
|-
 
|-
  +
|C (Disulfide bond)
|N
 
  +
|505 (connected with 522)
|423
 
|non-conserved
+
|non-conserved (17/21)
|non-conserved
+
|non-conserved (17/21)
|non-conserved
+
|non-conserved (16/21)
|non-conserved
+
|non-conserved (16/21)
|non-conserved
+
|non-conserved (16/21)
  +
|Tyrosine and Glutamine
 
|-
 
|-
  +
|C (Disulfide bond)
|W
 
  +
|522 (connected with 505)
|460
 
|conserved (once -)
+
|non-conserved (18/21)
|conserved (once -)
+
|non-conserved (16/21)
|conserved (once -)
+
|non-conserved (18/21)
|conserved (once -)
+
|non-conserved (18/21)
|conserved (once -)
+
|non-conserved (16/21)
  +
|no dominated substitution
|-
 
|E
 
|462
 
|conserved (once -)
 
|conserved (once Q)
 
|conserved (once Q)
 
|conserved (once -)
 
|conserved (once -)
 
 
|-
 
|-
 
|}
 
|}
  +
  +
As you can seen in the table above, only the active site is completely conserved. But the other positions are also well conserved. Some substitutions, which could be found here, probably do not damage the protein that much.
  +
<br><br>
  +
Back to [[http://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Tay-Sachs_Disease Tay-Sachs Disease]]<br>

Latest revision as of 20:54, 30 August 2011

Sequence Searches

Use of database searching tools

  • FASTA

/bin/fasta36 seq.fasta /data/blast/nr/nr > fasta_out.txt

  • BLAST

blastall -p blastp -d /data/blast/nr/nr -i mult_seq.fasta > blast_out.txt

  • PSIBLAST

blastpgn -i seq.fasta -j <#iterations> -h <e-value threshold> -d /data/blast/nr/nr > psiblast_out.txt

  • HHSearch

For the HHSearch tool we used the online server for HHSearch.

Back to [Tay-Sachs Disease]

Statistic results

For the statistical analysis we wrote a script which shows the distribution of the E-Value and the identity as well as the different aligned sequences. Furthermore, we created a Venn diagram to present the overlap between the results of the different searching methods (with http://bioinfogp.cnb.csic.es/tools/venny/index.html). First, we compared the methods BLAST, FASTA and PsiBlast (PsiBlast with 3 and 5 runs and E-Value cutoff of 10E-6). Then we looked for the overlap of all done PsiBlast runs.


Overlap of the aligned sequences

As it can be seen on Figure 1, FASTA found a large number of matches which are not found by the other methods. By comparison, the number of hits which were not found by BLAST or PsiBlast as well, is about 1400. This is much higher than the number of sequences which is found by FASTA and BLAST together. This leads to the conclusion that FASTA aligns many more sequences which are probably worse or even wrong. The both different PsiBlast variants deliver the same hits which are all found by FASTA, as well. Furthermore, all resulting BLAST-sequences were aligned by FASTA and the most of them also by PsiBlast.
Besides, we decided to compare different runs of PsiBlast. We compared PsiBlast with 3 iterations and two different E-Value Cutoffs (0.005 and 10E-6) and also two PsiBlast runs with 5 iterations and the same two E-Value cutoffs as before, which can be seen on Figure 2. The Venn diagram shows that the different results overlap mostly. Only a few ones differ from the other. This leads to the conclusion, that PsiBlast with different iteration numbers and E-value receives usually a similar result. In summary the BLAST-methods agree with each other. In contrast the FASTA-method delivers much more sequences which do not correspond to the other methods.


Figure 1: Overlap of results from BLAST, FASTA and PsiBlast.
Declaration:
Psiblast #3: 3 Iterations, E-Value Cutoff: 10E-6
Psiblast #5: 5 Iterations, E-Value Cutoff, 10E-6
Figure 2: Overlap of the results from different PsiBlast.
Declaration:
Psiblast 1: 3 Iterations, E-Value Cutoff: 0.005
Psiblast 2: 5 Iterations, E-Value Cutoff: 0.005
Psiblast 3: 3 Iterations, E-Value Cutoff: 10E-6
Psiblast 4: 5 Iterations, E-Value Cutoff, 10E-6

Distribution of the sequence identity and the e-value

The following plots show the distribution of the sequence identities and the E-values of all used methods. Both values (x-axis) and their frequencies (y-axis) were extracted from the corresponding output-files.

The first image (Figure 3) shows the distribution of the sequence identities. The first plot visualize the distribution for BLAST. Here could be seen that these identity-distribution is very balanced which means the low identities are approximate same common as the very high ones. It is also the same for the HHSearch-plot. Contrary, the FASTA-distribution has often a very high frequency for very small identities. This means that FASTA aligns many sequences although they have only a small sequence identity. This could explain why FASTA receives so many hits which do not agree with the other sequence searching tools (see Venn diagram). The last four plots represent the corresponding distribution for the different PsiBlast runs which are very similar. This is another indication that PsiBlast received very similar results for different parameters. Their distribution is also very balanced. There are high frequencies for small, middle and high sequence identity values.

The second image (Figure 4) shows the distribution of the E-values. The E-value is a measurement for the probability that a hit is found by chance. Therefore, the smaller the E-value the better the alignment. All plots, except the one for FASTA, have high frequencies for small E-values whereby BLAST receives the smallest E-values. The E-values of FASTA have range from 0 to 8 where in contrast the other methods have no E-value higher than 1. Furthermore, the highest BLAST E-value is about 1e-29 which is still very low. In summary this shows again that BLAST delivers the best results and FASTA the worst ones.

Figure 3: Distribution of the sequence identities of the different methods
Figure 4: Distribution of the e-values of the different methods



True positive hits

HSSP (Homology-derived Secondary Structure of Proteins) lists proteins which are homologue and have a similar secondary structure. Therefore, we used the HSSP alignment to check our results. The overlapping sequences are the true positives. FASTA (Figure 5) has a greater overlap than the other methods (about 10 sequences more). The BLAST result (Figure 6) and the results of the Blast variants (PsiBlast runs Figure 7 - 10) show very similar overlap. Therefore, in this case FASTA gave the most true positive hits (although FASTA has also a hugh number of false positive predictions).


Figure 5: Overlap between HSSP and FASTA
Figure 6: Overlap between HSSP and Blast
Figure 7: Overlap between HSSP and PsiBlast with 3 Iterations and 10E-6 cutoff
Figure 8: Overlap between HSSP and PsiBlast with 5 Iterations and 10E-6 cutoff
Figure 9: Overlap between HSSP and PsiBlast with 3 Iterations and 0.005 cutoff
Figure 10: Overlap between HSSP and psiBlast with 5 Iterations and 0.005 cutoff


With the results of these analyses, we created our file for the multiple alignments.

SeqIdentifier Seq Identity source
99%-90% Sequence Identity
109157872|pdb|2GK1 99% Blast
179460|gb|AAA51827.1 99% Blast
194375013|dbj|BAG62619.1 97% Blast
296213630|ref|XP_002753354.1 95.1% Fasta
297296816|ref|XP_002804897.1 93% Blast
89%-60% Sequence Identity
149692271|ref|XP_001494361.1 85% Blast
187607461|ref|NP_001119815.1 84.3% Fasta
67514549|ref|NP_034551.2 84.1% Fasta
74213671|dbj|BAE35636.1 84% Blast
178056464|ref|NP_001116693.1 83.2% Fasta
59%-40% Sequence Identity
187608414|ref|NP_001120459.1 58.3% Fasta
213513173|ref|NP_001133930.1 57.0% Fasta
38492599|pdb|1O7A 56% Blast
867691|gb|AAA68620.1 55% PsiBlast, 3 Iterations, E-Value Cutoff = 0.005
189239563|ref|XP_975660.2 47% Blast
39%-20% Sequence Identity
299139410|ref|ZP_07032585.1 36% PsiBlast, 3 Iterations, E-Value Cutoff = 0.005
281209747|gb|EFA83915.1 33% Blast
166159759|gb|ABY83272.1 32% PsiBlast, 5 Iterations, E-Value Cutoff = 0.005
251836937|pdb|3GH4 26.4% Fasta
212691177|ref|ZP_03299305.1 22% PsiBlast, 3 Iterations, E-Value Cutoff = 10E-6



Back to [Tay-Sachs Disease]

Multiple Alignments

  • Cobalt

To use cobalt, we had to install the program on our virtual box first.

Howto:

  • Download Cobalt from [here] (ncbi-cobalt-2.0.1-x64-linux.tar).
  • Uncompress the archive file with tar xfz ncbi-cobalt-2.0.1-x64-linux.tar and change directory to the uncompressed cobalt directory.
  • Now you find an executable file called cobalt

Call: ./cobalt -i mult_seq.fasta -norps T > cobalt_out.aln

  • ClustalW

clustalw -infile=mult_seq.fasta > clustalW_out.aln

  • Muscle

muscle -in mult_seq.fasta -out muscle_out.aln -clw

  • T-Coffee

t_coffee -seq mult_seq.fasta

  • T-Coffee (3D)

t_coffee -seq mult_seq.fasta -mode expresso

Conservation

Next, we wanted to know if there is a strong conservation between the sequences in our multiple sequence alignment. We compared each position of the alignment to find and count the conserved columns. Furthermore, we counted also the number of gaps in our alignment.

Alignment methods Conserved Columns
Gaps 100% cons >90% cons >80% cons >70% cons >60% cons >50% cons >40% cons
Cobalt 384 31 68 75 87 76 0 65
ClustalW 346 29 61 81 84 71 0 65
Muscle 463 32 70 74 84 76 0 74
T-Coffee 609 31 67 74 90 73 0 70
3D T-Coffee 533 32 64 77 89 74 0 70

The different methods have a different amount of gaps in their alignments. They number of gaps differ between 346 (ClustalW) and 609 (T-Coffee). There is a clear trend by looking at the different conservation levels (100% conservation is not that frequent as lower conservation). Interestingly, there do not exist columns which have a conservation between 40% and 50%.

Gaps

Figure 11: Secondary structure alignment for the protein Hexosaminidase A (http://www.pdb.org/pdb/explore/remediatedChain.do?structureId=2GJX&chainId=A)

For the identification of gaps in secondary structure elements we wrote a script which compares the alignment sequence of hexasaminidase with the secondary structure sequence from the [PDB] visualized in Figure 11. The results for all multiple alignment tools are listed in the following table.

Alignment methods Gaps in Secondary Structure Elements
Sum of Gaps Helix Extended Coil
Cobalt 384 5 5 1
ClustalW 346 2 0 0
Muscle 463 3 4 1
T-Coffee 609 4 7 4
3D T-Coffee 533 5 4 2

Functional residues

We found several functional residues in the [Uniprot] database:

function position
active site 323
Glycolysation 115
Glycolysation 157
Glycolysation 295
Disulfide bond 58 <-> 104
Disulfide bond 277 <-> 328
Disulfide bond 505 <-> 522

Because these residues are functionally important, they should be conserved. We compared the different alignments and looked if these residues are conserved.

Amino acids Methods
residue position Cobalt ClustalW Muscle T-Coffee 3D T-Coffee dominated substitution
E (active site) 323 conserved (21/21) conserved (21/21) conserved (21/21) conserved (21/21) conserved (21/21) none
N (Glycolysation) 115 non-conserved (12/21) non-conserved (13/21) non-conserved (13/21) non-conserved (13/21) non-conserved (13/21) Serine
N (Glycolysation) 157 non-conserved (16/21) non-conserved (16/21) non-conserved (16/21) non-conserved (16/21) non-conserved (16/21) Proline and Serine
N (Glycolysation) 295 non-conserved (14/21) non-conserved (14/21) non-conserved (14/21) non-conserved (14/21) non-conserved (14/21) Proline and Aspartic acid
C (Disulfide bond) 58 (connected with 104) non-conserved (17/21) non-conserved (15/21) non-conserved (17/21) non-conserved (17/21) non-conserved (16/21) no dominated substitution
C (Disulfide bond) 104 (connected with 58) non-conserved (14/21) non-conserved (14/21) non-conserved (14/21) non-conserved (14/21) non-conserved (15/21) no dominated substitution
C (Disulfide bond) 277 (connected with 328) conserved (20/21) non-conserved (16/21) non-conserved (18/21) non-conserved (19/21) non-conserved (17/21) Serine
C (Disulfide bond) 328 (connected with 277) non-conserved (19/21) non-conserved (19/21) non-conserved (19/21) non-conserved (19/21) non-conserved (19/21) Argenine and Glutamic acid
C (Disulfide bond) 505 (connected with 522) non-conserved (17/21) non-conserved (17/21) non-conserved (16/21) non-conserved (16/21) non-conserved (16/21) Tyrosine and Glutamine
C (Disulfide bond) 522 (connected with 505) non-conserved (18/21) non-conserved (16/21) non-conserved (18/21) non-conserved (18/21) non-conserved (16/21) no dominated substitution

As you can seen in the table above, only the active site is completely conserved. But the other positions are also well conserved. Some substitutions, which could be found here, probably do not damage the protein that much.

Back to [Tay-Sachs Disease]