Difference between revisions of "Sequence Alignments HEXA"
(→Statistic results) |
(→Statistic results) |
||
Line 37: | Line 37: | ||
'''Distribution of the sequence identity and the e-value''' |
'''Distribution of the sequence identity and the e-value''' |
||
− | The following plots show the distribution of the sequence |
+ | The following plots show the distribution of the sequence identities and the e-values of all used methods. Both values (x-axis) and their frequencies (y-axis) were extractet from the corresponding output-files. |
− | The first image shows the distribution of the sequence identities. The first plot is the distribution for BLAST. Here could be seen that these identity-distribution is very balanced which means the low identities are approximate same common as the very high ones. |
+ | The first image shows the distribution of the sequence identities. The first plot is the distribution for BLAST. Here could be seen that these identity-distribution is very balanced which means the low identities are approximate same common as the very high ones. It is also the same for the HHSearch-plot. Contrary the FASTA-distribution has many high frequency for very small identities. This means that FASTA aligns many sequences although they have only a small sequence identity. This could explain why FASTA receives so many hits which do not agree with the other sequence searching tools (see Venn diagramm). |
− | The last four plots represent the corresponding distribution for the different PsiBlasts which were very similar. This is another |
+ | The last four plots represent the corresponding distribution for the different PsiBlasts which were very similar. This is another indication that PsiBlast received very similar results for different parameters. Their distribution is also very balanced. There are high frequencies for small identities, middle identities and for high identities. |
− | The second image shows the distribution of the e-values. The e-value is a measurement for the probability that a hit is resulting by chance. Therefore the smaller the e-value the better the alignment. All plots except the one for FASTA have high frequencies for small e-values whereby BLAST receives the |
+ | The second image shows the distribution of the e-values. The e-value is a measurement for the probability that a hit is resulting by chance. Therefore the smaller the e-value the better the alignment. All plots except the one for FASTA have high frequencies for small e-values whereby BLAST receives the smallest e-values. The e-values of FASTA have range from 0 to 8 where in contrast the other methods have no evalue higher than 1. Furthermore the highest BLAST e-value is at about 1e-29 which is still very low. In summary this shows again that BLAST delivers the best results and FASTA the worst ones. |
{| class="centered" |
{| class="centered" |
||
Line 54: | Line 54: | ||
'''True positive hits''' |
'''True positive hits''' |
||
− | HSSP (Homology-derived Secondary Structure of Proteins) lists proteins which are homologue and have a similar secondary structure. Therefore we use the HSSP alignment to check our results. |
+ | HSSP (Homology-derived Secondary Structure of Proteins) lists proteins which are homologue and have a similar secondary structure. Therefore we use the HSSP alignment to check our results. The overlapping sequences are the true positives. FASTA has a greater overlap then the other methods (about 10 sequences more). The BLAST result and the results of the Blast variantes (PsiBlast runs) show very similar overlap. |
+ | Therefore, in this case FASTA gave the most true positives (although Fasta has also a hugh number of false positive predictions). |
||
+ | |||
{| |
{| |
Revision as of 19:21, 23 May 2011
Contents
Sequence Searches
Use of database searching tools
- FASTA
/bin/fasta36 seq.fasta /data/blast/nr/nr > fasta_out.txt
- BLAST
blastall -p blastp -d /data/blast/nr/nr -i mult_seq.fasta > blast_out.txt
- PSIBLAST
blastpgn -i seq.fasta -j <#iterations> -h <e-value threshold> -d /data/blast/nr/nr > psiblast_out.txt
- HHSearch
For the HHSearch tool we used the online server for HHSearch.
Statistic results
For the statistical analysis we wrote a script which shows the distribution of the E-Value and the Identity as well as the different aligned sequences. Furthermore, we created a Venn diagram to presentate the overlap between the results of the different searching methods (with http://bioinfogp.cnb.csic.es/tools/venny/index.html). First, we compared the methods BLAST, FASTA and PsiBlast(PsiBlast with 3 and 5 runs and E-Value cutoff 10E-6). Then we looked for the overlap of all done PsiBlasts.
Overlap of the aligned sequences
FASTA found a large number of matches which are not found by the other methods. By comparison the number of hits which were not found by BLAST or PsiBlast, is about 1400. This is much higher than the number of sequences which is found by FASTA and BLAST together. This leads to the conclusion that FASTA aligns many sequences which are probably less good or even wrong. The both different PsiBlast-variants deliver the same hits which are all also found by FASTA. Furthermore all resulting sequences by BLAST were also aligned by FASTA and the most of them are also by PsiBlast. Besides, we decided to compare different runs of PsiBlast. We compared PsiBlast with 3 iterations and an e-Value Cutoff of 0.005 and 10E-6 and also two PsiBlast runs with 5 iterations and the same two e-Value cutoffs as before. In this Vann-Digramm could be seen that the result overlap mostly. Only a few ones differ from the other. This leads to the fact, that PsiBlast with different iteration number and e-value deliver usually a similar result. In summary the BLAST-methods agree with each other. In contrast the FASTA-method delivers much more sequences which do not correspond to the other methods.
Distribution of the sequence identity and the e-value
The following plots show the distribution of the sequence identities and the e-values of all used methods. Both values (x-axis) and their frequencies (y-axis) were extractet from the corresponding output-files.
The first image shows the distribution of the sequence identities. The first plot is the distribution for BLAST. Here could be seen that these identity-distribution is very balanced which means the low identities are approximate same common as the very high ones. It is also the same for the HHSearch-plot. Contrary the FASTA-distribution has many high frequency for very small identities. This means that FASTA aligns many sequences although they have only a small sequence identity. This could explain why FASTA receives so many hits which do not agree with the other sequence searching tools (see Venn diagramm). The last four plots represent the corresponding distribution for the different PsiBlasts which were very similar. This is another indication that PsiBlast received very similar results for different parameters. Their distribution is also very balanced. There are high frequencies for small identities, middle identities and for high identities.
The second image shows the distribution of the e-values. The e-value is a measurement for the probability that a hit is resulting by chance. Therefore the smaller the e-value the better the alignment. All plots except the one for FASTA have high frequencies for small e-values whereby BLAST receives the smallest e-values. The e-values of FASTA have range from 0 to 8 where in contrast the other methods have no evalue higher than 1. Furthermore the highest BLAST e-value is at about 1e-29 which is still very low. In summary this shows again that BLAST delivers the best results and FASTA the worst ones.
True positive hits
HSSP (Homology-derived Secondary Structure of Proteins) lists proteins which are homologue and have a similar secondary structure. Therefore we use the HSSP alignment to check our results. The overlapping sequences are the true positives. FASTA has a greater overlap then the other methods (about 10 sequences more). The BLAST result and the results of the Blast variantes (PsiBlast runs) show very similar overlap. Therefore, in this case FASTA gave the most true positives (although Fasta has also a hugh number of false positive predictions).
With the results of these analysis, we created our file for the multiple alignments.
SeqIdentifier | Seq Identity | source |
---|---|---|
99%-90% Sequence Identity | ||
109157872|pdb|2GK1 | 99% | Blast |
179460|gb|AAA51827.1 | 99% | Blast |
194375013|dbj|BAG62619.1 | 97% | Blast |
296213630|ref|XP_002753354.1 | 95.1% | Fasta |
297296816|ref|XP_002804897.1 | 93% | Blast |
89%-60% Sequence Identity | ||
149692271|ref|XP_001494361.1 | 85% | Blast |
187607461|ref|NP_001119815.1 | 84.3% | Fasta |
67514549|ref|NP_034551.2 | 84.1% | Fasta |
74213671|dbj|BAE35636.1 | 84% | Blast |
178056464|ref|NP_001116693.1 | 83.2% | Fasta |
59%-40% Sequence Identity | ||
187608414|ref|NP_001120459.1 | 58.3% | Fasta |
213513173|ref|NP_001133930.1 | 57.0% | Fasta |
38492599|pdb|1O7A | 56% | Blast |
867691|gb|AAA68620.1 | 55% | PsiBlast, 3 Iterations, E-Value Cutoff = 0.005 |
189239563|ref|XP_975660.2 | 47% | Blast |
39%-20% Sequence Identity | ||
299139410|ref|ZP_07032585.1 | 36% | PsiBlast, 3 Iterations, E-Value Cutoff = 0.005 |
281209747|gb|EFA83915.1 | 33% | Blast |
166159759|gb|ABY83272.1 | 32% | PsiBlast, 5 Iterations, E-Value Cutoff = 0.005 |
251836937|pdb|3GH4 | 26.4% | Fasta |
212691177|ref|ZP_03299305.1 | 22% | PsiBlast, 3 Iterations, E-Value Cutoff = 10E-6 |
Multiple Alignments
- Cobalt
For coblat it was first necessary to install the programm on our virtual box: Download Cobalt from ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt/executables/2.0.1/ (ncbi-cobalt-2.0.1-x64-linux.tar). Uncompress the archive file with tar xfz ncbi-cobalt-2.0.1-x64-linux.tar and change directory to the uncompressed cobalt directoy.
Call: ./cobalt -i mult_seq.fasta -norps T > cobalt_out.aln
- ClustalW
clustalw -infile=mult_seq.fasta > clustalW_out.aln
- Muscle
muscle -in mult_seq.fasta -out muscle_out.aln -clw
- T-Coffee
t_coffee -seq mult_seq.fasta
- T-Coffee (3D)
t_coffee -seq mult_seq.fasta -mode expresso
Conservation
Alignment methods | Conserved Columns | |||||||
---|---|---|---|---|---|---|---|---|
Gaps | 100% cons | >90% cons | >80% cons | >70% cons | >60% cons | >50% cons | >40% cons | |
Cobalt | 384 | 31 | 68 | 75 | 87 | 76 | 0 | 65 |
ClustalW | 346 | 29 | 61 | 81 | 84 | 71 | 0 | 65 |
Muscle | 463 | 32 | 70 | 74 | 84 | 76 | 0 | 74 |
T-Coffee | 609 | 31 | 67 | 74 | 90 | 73 | 0 | 70 |
3D T-Coffee | 533 | 32 | 64 | 77 | 89 | 74 | 0 | 70 |
Gaps
For the identification of gaps in secundary structure elements we write a script which comparse the alignment sequence of hexasaminidase with the secondary struture sequence from PDB (http://www.pdb.org/pdb/explore/sequenceText.do?structureId=2GJX&chainId=A). In the following table are the results for all multiple alignment tools.
Alignment methods | Gaps in Secundary Structure Elements | ||||
---|---|---|---|---|---|
Sum of Gaps | Helix | Extended | Coil | ||
Cobalt | 384 | 5 | 5 | 1 | |
ClustalW | 346 | 2 | 0 | 0 | |
Muscle | 463 | 3 | 4 | 1 | |
T-Coffee | 609 | 4 | 7 | 4 | |
3D T-Coffee | 533 | 5 | 4 | 2 |
Functional residues
We found several functional residues from (LINK FEHLT NOCH). Because these residues are functionally important, these residues should be conserved. We compared the different alignments and looked if these residues are conserved.
Amino acids | Methods | ||||||
---|---|---|---|---|---|---|---|
residue position | Cobalt | ClustalW | Muscle | T-Coffee | 3D T-Coffee | ||
R | 178 | conserved | conserved | conserved | conserved | conserved | |
D | 207 | conserved (once E) | conserved (once E) | conserved (once E) | conserved (once E) | conserved (once E) | |
H | 262 | conserved | conserved | conserved | conserved | conserved | |
D | 322 | conserved | conserved | conserved | conserved | conserved | |
E | 323 | conserved | conserved | conserved | conserved | conserved | |
W | 373 | conserved | conserved (once V, R) | conserved | conserved | conserved | |
W | 392 | conserved | conserved (once P, T, G) | conserved | conserved | conserved | |
Y | 421 | conserved | conserved (twice G, once -, S) | conserved | conserved | conserved (once H) | |
N | 423 | non-conserved | non-conserved | non-conserved | non-conserved | non-conserved | |
W | 460 | conserved (once -) | conserved (once -) | conserved (once -) | conserved (once -) | conserved (once -) | |
E | 462 | conserved (once -) | conserved (once Q) | conserved (once Q) | conserved (once -) | conserved (once -) |