Difference between revisions of "Sequence Alignments HEXA"

From Bioinformatikpedia
(Statistic results)
(Statistic results)
Line 16: Line 16:
 
=== Statistic results ===
 
=== Statistic results ===
   
For the statistical analysis we wrote a script which shows the distribution of the E-Value and the Identity as well as the different aligned sequences. Furthermore we create a Venn diagram to presentate the overlap between the different searching methods (with http://bioinfogp.cnb.csic.es/tools/venny/index.html). First we compared the methods BLAST, FASTA and PsiBlast(PsiBlast with 3 and 5 runs and E-Value cutoff from 10E-6). Then we looked for the overlap of all done PsiBlasts.
+
For the statistical analysis we wrote a script which shows the distribution of the E-Value and the Identity as well as the different aligned sequences. Furthermore we create a Venn diagram to presentate the overlap between the results of the different searching methods (with http://bioinfogp.cnb.csic.es/tools/venny/index.html). First we compared the methods BLAST, FASTA and PsiBlast(PsiBlast with 3 and 5 runs and E-Value cutoff from 10E-6). Then we looked for the overlap of all done PsiBlasts.
   
   
 
'''Overlap of the aligned sequences'''
 
'''Overlap of the aligned sequences'''
   
  +
FASTA found a large number of matches which are not found by the other methods. By comparison the number of hits which were not found by BLAST or PsiBlast, is about 1400. This is much higher than the number of sequences which is found by FASTA and BLAST together. This leads to the conclusion that FASTA aligns many sequences which are probably less good or even wrong.
  +
The both different PsiBlast-variants deliver the same hits which are all also found by FASTA. Furthermore all resulting sequences by BLAST were also aligned by FASTA and the most of them are also by PsiBlast.
  +
Besides, we decided to compare different runs of PsiBlast. We compared PsiBlast with 3 iterations and an e-Value Cutoff of 0.005 and 10E-6 and also two PsiBlast runs with 5 iterations and the same two e-Value cutoffs as before. In this Vann-Digramm could be seen that the result overlap mostly. Only a few ones differ from the other. This leads to the fact, that PsiBlast with differen iteration number and e-value deliver usually a similar result.
  +
In summary the BLAST-methods agree with each other. In contrast the FASTa-method delivers much more sequences which do not correspond which one of the other methods.
   
[[Image:comparison_ali.png|none|300px]]
 
   
  +
{| class="centered"
FASTA found more than 1000 matches, whereas the numbers of results of the blast methods is lower. Therefore, we can see, that FASTA aligns more sequences than blast and therefore, a lot of FASTA hits would be wrong. The two different Psi-Blast variants showed a big agreement by their aligned sequences, and all their sequences were also aligned by FASTA. All sequences, which were aligned by Blast were also found by the FASTA alignment.
 
  +
| [[Image:comparison_ali.png|thumb|250px| Overlap of results from BLAST, FASTA and PsiBlast ]]
  +
| [[Image:comparison_psiblast.png|thumb|250px| Overlap of the results from different PsiBlast
  +
'''''Declaration:''''' Psiblast 1: 3 Iterations, E-Value Cutoff: 0.005; Psiblast 2: 5 Iterations, E-Value Cutoff: 0.005
  +
Psiblast 3: 3 Iterations, E-Value Cutoff: 10E-6; Psiblast 4: 5 Iterations, E-Value Cutoff, 10E-6]]
  +
|}
   
We also decided to compare different runs of PsiBlast. We compared PsiBlast with 3 iterations and an e-Value Cutoff of 0.005 and 10E-6 and also two PsiBlast runs with 5 iterations and the same two e-Value cutoffs as before.
 
   
[[Image:comparison_psiblast.png|none|300px]]
 
   
Psiblast 1: 3 Iterations, E-Value Cutoff: 0.005
 
   
Psiblast 2: 5 Iterations, E-Value Cutoff: 0.005
 
   
Psiblast 3: 3 Iterations, E-Value Cutoff: 10E-6
 
 
Psiblast 4: 5 Iterations, E-Value Cutoff, 10E-6
 
   
 
The differences between the different PsiBlast variants is not that big. Only Psiblast 1 has 6 aligned sequences, which are not shared by the other variants.
 
The differences between the different PsiBlast variants is not that big. Only Psiblast 1 has 6 aligned sequences, which are not shared by the other variants.

Revision as of 15:42, 23 May 2011

Sequence Searches

Use of database searching tools

  • FASTA

/bin/fasta36 seq.fasta /data/blast/nr/nr > fasta_out.txt

  • BLAST

blastall -p blastp -d /data/blast/nr/nr -i mult_seq.fasta > blast_out.txt

  • PSIBLAST

blastpgn -i seq.fasta -j <#iterations> -h <e-value threshold> -d /data/blast/nr/nr > psiblast_out.txt

  • HHSearch

For the HHSearch tool we used the online server for HHSearch.

Statistic results

For the statistical analysis we wrote a script which shows the distribution of the E-Value and the Identity as well as the different aligned sequences. Furthermore we create a Venn diagram to presentate the overlap between the results of the different searching methods (with http://bioinfogp.cnb.csic.es/tools/venny/index.html). First we compared the methods BLAST, FASTA and PsiBlast(PsiBlast with 3 and 5 runs and E-Value cutoff from 10E-6). Then we looked for the overlap of all done PsiBlasts.


Overlap of the aligned sequences

FASTA found a large number of matches which are not found by the other methods. By comparison the number of hits which were not found by BLAST or PsiBlast, is about 1400. This is much higher than the number of sequences which is found by FASTA and BLAST together. This leads to the conclusion that FASTA aligns many sequences which are probably less good or even wrong. The both different PsiBlast-variants deliver the same hits which are all also found by FASTA. Furthermore all resulting sequences by BLAST were also aligned by FASTA and the most of them are also by PsiBlast. Besides, we decided to compare different runs of PsiBlast. We compared PsiBlast with 3 iterations and an e-Value Cutoff of 0.005 and 10E-6 and also two PsiBlast runs with 5 iterations and the same two e-Value cutoffs as before. In this Vann-Digramm could be seen that the result overlap mostly. Only a few ones differ from the other. This leads to the fact, that PsiBlast with differen iteration number and e-value deliver usually a similar result. In summary the BLAST-methods agree with each other. In contrast the FASTa-method delivers much more sequences which do not correspond which one of the other methods.


Overlap of results from BLAST, FASTA and PsiBlast
Overlap of the results from different PsiBlast Declaration: Psiblast 1: 3 Iterations, E-Value Cutoff: 0.005; Psiblast 2: 5 Iterations, E-Value Cutoff: 0.005 Psiblast 3: 3 Iterations, E-Value Cutoff: 10E-6; Psiblast 4: 5 Iterations, E-Value Cutoff, 10E-6




The differences between the different PsiBlast variants is not that big. Only Psiblast 1 has 6 aligned sequences, which are not shared by the other variants.


Sequence identity

Sequence identity comparison of the different methods


E-Value

E-Value comparison of the different methods



True positive hits

HSSP (Homology-derived Secondary Structure of Proteins) lists proteins which are homologue and have a similar secondary structure. Therefore we use the HSSP alignment to check our results. Therefore we check how much overlap is between HSSP and the other methods. The overlapping sequences are the true positives.

Overlap between HSSP and FASTA
Overlap between HSSP and Blast
Overlap between HSSP and PsiBlast with 3 Iterations and 10E-6 cutoff
Overlap between HSSP and PsiBlast with 5 Iterations and 10E-6 cutoff
Overlap between HSSP and PsiBlast with 3 Iterations and 0.005 cutoff
Overlap between HSSP and psiBlast with 5 Iterations and 0.005 cutoff


With the results of these analysis, we created our file for the multiple alignments.

gi|212691177|ref|ZP_03299305.1 || 22% | PsiBlast, 3 Iterations, E-Value Cutoff = 10E-6
SeqIdentifier Seq Identity source
99%-90% Sequence Identity
109157872|pdb|2GK1 99% Blast
179460|gb|AAA51827.1 99% Blast
194375013|dbj|BAG62619.1 97% Blast
296213630|ref|XP_002753354.1 95.1% Fasta
297296816|ref|XP_002804897.1 93% Blast
89%-60% Sequence Identity
149692271|ref|XP_001494361.1 85% Blast
187607461|ref|NP_001119815.1 84.3% Fasta
67514549|ref|NP_034551.2 84.1% Fasta
74213671|dbj|BAE35636.1 84% Blast
178056464|ref|NP_001116693.1 83.2% Fasta
59%-40% Sequence Identity
187608414|ref|NP_001120459.1 58.3% Fasta
213513173|ref|NP_001133930.1 57.0% Fasta
38492599|pdb|1O7A 56% Blast
867691|gb|AAA68620.1 55% PsiBlast, 3 Iterations, E-Value Cutoff = 0.005
189239563|ref|XP_975660.2 47% Blast
39%-20% Sequence Identity
299139410|ref|ZP_07032585.1 36% PsiBlast, 3 Iterations, E-Value Cutoff = 0.005
281209747|gb|EFA83915.1 33% Blast
166159759|gb|ABY83272.1 32% PsiBlast, 5 Iterations, E-Value Cutoff = 0.005
251836937|pdb|3GH4 26.4% Fasta

Multiple Alignments

  • Cobalt

Download Cobalt from ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt/executables/2.0.1/ (ncbi-cobalt-2.0.1-x64-linux.tar). Uncompress the archive file with tar xfz ncbi-cobalt-2.0.1-x64-linux.tar and change directory to the uncompressed cobalt directoy. Call: ./cobalt -i mult_seq.fasta -norps T > cobalt_out.aln

  • ClustalW

clustalw -infile=mult_seq.fasta > clustalW_out.aln

  • Muscle

muscle -in mult_seq.fasta -out muscle_out.aln -clw

  • T-Coffee

t_coffee -seq mult_seq.fasta

  • T-Coffee (3D)

t_coffee -seq mult_seq.fasta -mode expresso


Conservation

Alignment methods Conserved Columns
Gaps 100% cons >90% cons >80% cons >70% cons >60% cons >50% cons >40% cons
Cobalt 384 31 68 75 87 76 0 65
ClustalW 346 29 61 81 84 71 0 65
Muscle 463 32 70 74 84 76 0 74
T-Coffee 609 31 67 74 90 73 0 70
3D T-Coffee 533 32 64 77 89 74 0 70


Gaps

Alignment methods Secundary Structure
#Gaps in sum Helix Extended Coil No secundary strucutre
Cobalt 384 5 5 1 373
ClustalW 346 2 0 0 244
Muscle 463 3 4 1 455
T-Coffee 609 4 7 4 594
3D T-Coffee 533 5 4 2 522


Functional residues

We found several functional residues from (LINK FEHLT NOCH). Because these residues are functionally important, these residues should be conserved. We compared the different alignments and looked if these resides are conserved.

Amino acids Methods
residue position Cobalt ClustalW Muscle T-Coffee 3D T-Coffee
R 178 conserved conserved conserved conserved conserved
D 207 conserved (once E) conserved (once E) conserved (once E) conserved (once E) conserved (once E)
H 262 conserved conserved conserved conserved conserved
D 322 conserved conserved conserved conserved conserved
E 323 conserved conserved conserved conserved conserved
W 373 conserved conserved (once V, R) conserved conserved conserved
W 392 conserved conserved (once P, T, G) conserved conserved conserved
Y 421 conserved conserved (twice G, once -, S) conserved conserved conserved (once H)
N 423 non-conserved non-conserved non-conserved non-conserved non-conserved
W 460 conserved (once -) conserved (once -) conserved (once -) conserved (once -) conserved (once -)
E 462 conserved (once -) conserved (once Q) conserved (once Q) conserved (once -) conserved (once -)