Sequence Alignments HEXA
Contents
Sequence Searches
Use of database searching tools
- FASTA
/bin/fasta36 seq.fasta /data/blast/nr/nr > fasta_out.txt
- BLAST
blastall -p blastp -d /data/blast/nr/nr -i mult_seq.fasta > blast_out.txt
- PSIBLAST
blastpgn -i seq.fasta -j <#iterations> -h <e-value threshold> -d /data/blast/nr/nr > psiblast_out.txt
- HHSearch
For the HHSearch tool we used the online server for HHSearch.
Statistic results
We wrote a script, which shows the distribution of the E-Value and the Identity and also the different aligned sequences. To analyse the overlap between the different methods, we drew a Venn diagram (with http://bioinfogp.cnb.csic.es/tools/venny/index.html). We compared the BLAST, FASTA and Psiblast method (PsiBlast with 3 and 5 runs and E-Value cutoff from 10E-6)
Overlap of the aligned sequences
FASTA found more than 1000 matches, whereas the numbers of results of the blast methods is lower. Therefore, we can see, that FASTA aligns more sequences than blast and therefore, a lot of FASTA hits would be wrong. The two different Psi-Blast variants showed a big agreement by their aligned sequences, and all their sequences were also aligned by FASTA. All sequences, which were aligned by Blast were also found by the FASTA alignment.
We also decided to compare different runs of PsiBlast. We compared PsiBlast with 3 iterations and an e-Value Cutoff of 0.005 and 10E-6 and also two PsiBlast runs with 5 iterations and the same two e-Value cutoffs as before.
Psiblast 1: 3 Iterations, E-Value Cutoff: 0.005
Psiblast 2: 5 Iterations, E-Value Cutoff: 0.005
Psiblast 3: 3 Iterations, E-Value Cutoff: 10E-6
Psiblast 4: 5 Iterations, E-Value Cutoff, 10E-6
The differences between the different PsiBlast variants is not that big. Only Psiblast 1 has 6 aligned sequences, which are not shared by the other variants.
Sequence identity
E-Value
True positive hits
HSSP (Homology-derived Secondary Structure of Proteins) lists proteins which are homologue and have a similar secondary structure. Therefore we use the HSSP alignment to check our results. Therefore we check how much overlap is between HSSP and the other methods. The overlapping sequences are the true positives.
With the results of these analysis, we created our file for the multiple alignments.
SeqIdentifier | Seq Identity | source |
---|---|---|
99%-90% Sequence Identity | ||
109157872|pdb|2GK1 | 99% | Blast |
179460|gb|AAA51827.1 | 99% | Blast |
194375013|dbj|BAG62619.1 | 97% | Blast |
296213630|ref|XP_002753354.1 | 95.1% | Fasta |
297296816|ref|XP_002804897.1 | 93% | Blast |
89%-60% Sequence Identity | ||
149692271|ref|XP_001494361.1 | 85% | Blast |
187607461|ref|NP_001119815.1 | 84.3% | Fasta |
67514549|ref|NP_034551.2 | 84.1% | Fasta |
74213671|dbj|BAE35636.1 | 84% | Blast |
178056464|ref|NP_001116693.1 | 83.2% | Fasta |
59%-40% Sequence Identity | ||
187608414|ref|NP_001120459.1 | 58.3% | Fasta |
213513173|ref|NP_001133930.1 | 57.0% | Fasta |
38492599|pdb|1O7A | 56% | Blast |
867691|gb|AAA68620.1 | 55% | PsiBlast, 3 Iterations, E-Value Cutoff = 0.005 |
189239563|ref|XP_975660.2 | 47% | Blast |
39%-20% Sequence Identity | ||
299139410|ref|ZP_07032585.1 | 36% | PsiBlast, 3 Iterations, E-Value Cutoff = 0.005 |
281209747|gb|EFA83915.1 | 33% | Blast |
166159759|gb|ABY83272.1 | 32% | PsiBlast, 5 Iterations, E-Value Cutoff = 0.005 |
251836937|pdb|3GH4 | 26.4% | Fasta |
Multiple Alignments
- Cobalt
Download Cobalt from ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt/executables/2.0.1/ (ncbi-cobalt-2.0.1-x64-linux.tar). Uncompress the archive file with tar xfz ncbi-cobalt-2.0.1-x64-linux.tar and change directory to the uncompressed cobalt directoy. Call: ./cobalt -i mult_seq.fasta -norps T > cobalt_out.aln
- ClustalW
clustalw -infile=mult_seq.fasta > clustalW_out.aln
- Muscle
muscle -in mult_seq.fasta -out muscle_out.aln -clw
- T-Coffee
t_coffee -seq mult_seq.fasta
- T-Coffee (3D)
t_coffee -seq mult_seq.fasta -mode expresso
Conservation
Alignment methods | Conserved Columns | |||||||
---|---|---|---|---|---|---|---|---|
Gaps | 100% cons | >90% cons | >80% cons | >70% cons | >60% cons | >50% cons | >40% cons | |
Cobalt | 384 | 31 | 68 | 75 | 87 | 76 | 0 | 65 |
ClustalW | 346 | 29 | 61 | 81 | 84 | 71 | 0 | 65 |
Muscle | 463 | 32 | 70 | 74 | 84 | 76 | 0 | 74 |
T-Coffee | 609 | 31 | 67 | 74 | 90 | 73 | 0 | 70 |
3D T-Coffee | 533 | 32 | 64 | 77 | 89 | 74 | 0 | 70 |
Gaps
Alignment methods | Secundary Structure | ||||
---|---|---|---|---|---|
#Gaps in sum | Helix | Extended | Coil | No secundary strucutre | |
Cobalt | 384 | 5 | 5 | 1 | 373 |
ClustalW | 346 | 2 | 0 | 0 | 244 |
Muscle | 463 | 3 | 4 | 1 | 455 |
T-Coffee | 609 | 4 | 7 | 4 | 594 |
3D T-Coffee | 533 | 5 | 4 | 2 | 522 |
Functional residues
We found several functional residues from (LINK FEHLT NOCH). Because these residues are functionally important, these residues should be conserved. We compared the different alignments and looked if these resides are conserved.
Amino acids | Methods | ||||||
---|---|---|---|---|---|---|---|
residue position | Cobalt | ClustalW | Muscle | T-Coffee | 3D T-Coffee | ||
R | 178 | conserved | conserved | conserved | conserved | conserved | |
D | 207 | conserved (once E) | conserved (once E) | conserved (once E) | conserved (once E) | conserved (once E) | |
H | 262 | conserved | conserved | conserved | conserved | conserved | |
D | 322 | conserved | conserved | conserved | conserved | conserved | |
E | 323 | conserved | conserved | conserved | conserved | conserved | |
W | 373 | conserved | conserved (once V, R) | conserved | conserved | conserved | |
W | 392 | conserved | conserved (once P, T, G) | conserved | conserved | conserved | |
Y | 421 | conserved | conserved (twice G, once -, S) | conserved | conserved | conserved (once H) | |
N | 423 | non-conserved | non-conserved | non-conserved | non-conserved | non-conserved | |
W | 460 | conserved (once -) | conserved (once -) | conserved (once -) | conserved (once -) | conserved (once -) | |
E | 462 | conserved (once -) | conserved (once Q) | conserved (once Q) | conserved (once -) | conserved (once -) |