Difference between revisions of "Sequence Alignments HEXA"

From Bioinformatikpedia
(Sequence Alignments)
(Sequence Alignments)
Line 105: Line 105:
 
|gi|867691|gb|AAA68620.1 || 55% || PsiBlast, 3 Iterations, E-Value Cutoff = 0.005
 
|gi|867691|gb|AAA68620.1 || 55% || PsiBlast, 3 Iterations, E-Value Cutoff = 0.005
 
|-
 
|-
|gi|189239563|ref|XP_975660.2 || 47% | Blast
+
|gi|189239563|ref|XP_975660.2 || 47% || Blast
 
|-
 
|-
 
|colspan="3" | 39%-20% Sequence Identity
 
|colspan="3" | 39%-20% Sequence Identity
Line 111: Line 111:
 
|gi|299139410|ref|ZP_07032585.1 || 36% || PsiBlast, 3 Iterations, E-Value Cutoff = 0.005
 
|gi|299139410|ref|ZP_07032585.1 || 36% || PsiBlast, 3 Iterations, E-Value Cutoff = 0.005
 
|-
 
|-
|gi|281209747|gb|EFA83915.1 || 33% | Blast
+
|gi|281209747|gb|EFA83915.1 || 33% || Blast
 
|-
 
|-
|gi|166159759|gb|ABY83272.1 || 32% | PsiBlast, 5 Iterations, E-Value Cutoff = 0.005
+
|gi|166159759|gb|ABY83272.1 || 32% || PsiBlast, 5 Iterations, E-Value Cutoff = 0.005
 
|-
 
|-
 
|gi|251836937|pdb|3GH4 || 26.4% || Fasta
 
|gi|251836937|pdb|3GH4 || 26.4% || Fasta

Revision as of 11:35, 23 May 2011

Sequence Alignments

Sequence Searches

  • FASTA

/bin/fasta36 seq.fasta /data/blast/nr/nr > fasta_out.txt

  • BLAST

blastall -p blastp -d /data/blast/nr/nr -i mult_seq.fasta > blast_out.txt

  • PSIBLAST

blastpgn -i seq.fasta -j <#iterations> -h <e-value threshold> -d /data/blast/nr/nr > psiblast_out.txt

  • HHSearch

For the HHSearch tool we used the online server for HHSearch.

Result Statistics

We wrote a script, which shows the distribution of the E-Value and the Identity and also the different aligned sequences. To analyse the overlap between the different methods, we drew a Venn diagram (with http://bioinfogp.cnb.csic.es/tools/venny/index.html). We compared the BLAST, FASTA and Psiblast method (PsiBlast with 3 and 5 runs and E-Value cutoff from 10E-6)


Overlap of the aligned sequences


Comparison ali.png

FASTA found more than 1000 matches, whereas the numbers of results of the blast methods is lower. Therefore, we can see, that FASTA aligns more sequences than blast and therefore, a lot of FASTA hits would be wrong. The two different Psi-Blast variants showed a big agreement by their aligned sequences, and all their sequences were also aligned by FASTA. All sequences, which were aligned by Blast were also found by the FASTA alignment.

We also decided to compare different runs of PsiBlast. We compared PsiBlast with 3 iterations and an e-Value Cutoff of 0.005 and 10E-6 and also two PsiBlast runs with 5 iterations and the same two e-Value cutoffs as before.

Comparison psiblast.png

Psiblast 1: 3 Iterations, E-Value Cutoff: 0.005

Psiblast 2: 5 Iterations, E-Value Cutoff: 0.005

Psiblast 3: 3 Iterations, E-Value Cutoff: 10E-6

Psiblast 4: 5 Iterations, E-Value Cutoff, 10E-6

The differences between the different PsiBlast variants is not that big. Only Psiblast 1 has 6 aligned sequences, which are not shared by the other variants.



TODO:: Seq Id & E-Value



True positive hits

HSSP (Homology-derived Secondary Structure of Proteins) lists proteins which are homologue and have a similar secondary structure. Therefore we use the HSSP alignment to check our results. Therefore we check how much overlap is between HSSP and the other methods. The overlapping sequences are the true positives.

Overlap between HSSP and FASTA
Overlap between HSSP and Blast
Overlap between HSSP and PsiBlast with 3 Iterations and 10E-6 cutoff
Overlap between HSSP and PsiBlast with 5 Iterations and 10E-6 cutoff
Overlap between HSSP and PsiBlast with 3 Iterations and 0.005 cutoff
Overlap between HSSP and psiBlast with 5 Iterations and 0.005 cutoff


With the results of these analysis, we created our file for the multiple alignments.

gi|212691177|ref|ZP_03299305.1 || 22% | PsiBlast, 3 Iterations, E-Value Cutoff = 10E-6
SeqIdentifier Seq Identity source
99%-90% Sequence Identity
109157872|pdb|2GK1 99% Blast
179460|gb|AAA51827.1 99% Blast
194375013|dbj|BAG62619.1 97% Blast
296213630|ref|XP_002753354.1 95.1% Fasta
297296816|ref|XP_002804897.1 93% Blast
89%-60% Sequence Identity
149692271|ref|XP_001494361.1 85% Blast
187607461|ref|NP_001119815.1 84.3% Fasta
67514549|ref|NP_034551.2 84.1% Fasta
74213671|dbj|BAE35636.1 84% Blast
178056464|ref|NP_001116693.1 83.2% Fasta
59%-40% Sequence Identity
187608414|ref|NP_001120459.1 58.3% Fasta
213513173|ref|NP_001133930.1 57.0% Fasta
38492599|pdb|1O7A 56% Blast
867691|gb|AAA68620.1 55% PsiBlast, 3 Iterations, E-Value Cutoff = 0.005
189239563|ref|XP_975660.2 47% Blast
39%-20% Sequence Identity
299139410|ref|ZP_07032585.1 36% PsiBlast, 3 Iterations, E-Value Cutoff = 0.005
281209747|gb|EFA83915.1 33% Blast
166159759|gb|ABY83272.1 32% PsiBlast, 5 Iterations, E-Value Cutoff = 0.005
251836937|pdb|3GH4 26.4% Fasta

Multiple Alignments

  • Cobalt

Download Cobalt from ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt/executables/2.0.1/ (ncbi-cobalt-2.0.1-x64-linux.tar). Uncompress the archive file with tar xfz ncbi-cobalt-2.0.1-x64-linux.tar and change directory to the uncompressed cobalt directoy. Call: ./cobalt -i mult_seq.fasta -norps T > cobalt_out.aln

  • ClustalW

clustalw -infile=mult_seq.fasta > clustalW_out.aln

  • Muscle

muscle -in mult_seq.fasta -out muscle_out.aln -clw

  • T-Coffee

t_coffee -seq mult_seq.fasta

  • T-Coffee (3D)

t_coffee -seq mult_seq.fasta -mode expresso