Difference between revisions of "Sequence searches and multiple sequence alignments (Phenylketonuria)"
(→3D-Coffee) |
(→GO-terms) |
||
Line 42: | Line 42: | ||
====GO-terms==== |
====GO-terms==== |
||
− | For the reference sequence ( |
+ | For the reference sequence (P00439) following GO-terms were found on [http://www.ebi.ac.uk/QuickGO/GAnnotation QuickGO]: |
+ | GO:0006559 |
||
− | GO:0050661, GO:0009650, GO:0020027, GO:0004096, GO:0046872, GO:0042803, GO:0009060, GO:0005739, |
||
+ | GO:0005506 |
||
− | GO:0051092, GO:0004601, GO:0051289, GO:0051781, GO:0004046, GO:0042744, GO:0006979, GO:0005778, |
||
+ | GO:0009072 |
||
− | GO:0008203, GO:0005777, GO:0043066, GO:0051262, GO:0016209, GO:0005102, GO:0014068, GO:0032088, |
||
+ | GO:0004505 |
||
− | GO:0016491, GO:0006641, GO:0016684, GO:0019899, GO:0000302, GO:0055114, GO:0020037 |
||
+ | GO:0008652 |
||
+ | GO:0044281 |
||
+ | GO:0003824 |
||
+ | GO:0046872 |
||
+ | GO:0005829 |
||
+ | GO:0034641 |
||
+ | GO:0004497 |
||
+ | GO:0008152 |
||
+ | GO:0016491 |
||
+ | GO:0042136 |
||
+ | GO:0016714 |
||
+ | GO:0016597 |
||
+ | GO:0042423 |
||
+ | GO:0055114 |
||
+ | |||
To look for similarities between the reference sequence and the sequences found in the searches, those terms are counted. |
To look for similarities between the reference sequence and the sequences found in the searches, those terms are counted. |
||
... |
... |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="center" |
||
+ | |- |
||
+ | ! colspan="7" style="background:#32CD32;" | "..." |
||
+ | |- |
||
+ | ! style="background:#90EE90;" align="center" | GO |
||
+ | ! style="background:#90EE90;" align="center" | BLAST |
||
+ | ! style="background:#90EE90;" align="center" | PSIBLAST 2 x 0.002 |
||
+ | ! style="background:#90EE90;" align="center" | PSIBLAST 2 x 10e-10 |
||
+ | ! style="background:#90EE90;" align="center" | PSIBLAST 10 x 0.002 |
||
+ | ! style="background:#90EE90;" align="center" | PSIBLAST 10 x 10e-10 |
||
+ | ! style="background:#90EE90;" align="center" | HHblits |
||
+ | |- |
||
+ | |GO:0006559 |
||
+ | |165 |
||
+ | |165 |
||
+ | |165 |
||
+ | |165 |
||
+ | |165 |
||
+ | |0 |
||
+ | |- |
||
+ | |GO:0005506 |
||
+ | |449 |
||
+ | |448 |
||
+ | |450 |
||
+ | |434 |
||
+ | |431 |
||
+ | |0 |
||
+ | |- |
||
+ | |GO:0009072 |
||
+ | |449 |
||
+ | |448 |
||
+ | |450 |
||
+ | |434 |
||
+ | |431 |
||
+ | |0 |
||
+ | |- |
||
+ | |GO:0004505 |
||
+ | |207 |
||
+ | |207 |
||
+ | |207 |
||
+ | |205 |
||
+ | |205 |
||
+ | |0 |
||
+ | |- |
||
+ | |GO:0008652 |
||
+ | |6 |
||
+ | |10 |
||
+ | |11 |
||
+ | |12 |
||
+ | |12 |
||
+ | |0 |
||
+ | |- |
||
+ | |GO:0044281 |
||
+ | |0 |
||
+ | |0 |
||
+ | |0 |
||
+ | |0 |
||
+ | |0 |
||
+ | |0 |
||
+ | |- |
||
+ | |GO:0003824 |
||
+ | |12 |
||
+ | |30 |
||
+ | |28 |
||
+ | |33 |
||
+ | |32 |
||
+ | |0 |
||
+ | |- |
||
+ | |GO:0046872 |
||
+ | |7 |
||
+ | |6 |
||
+ | |6 |
||
+ | |6 |
||
+ | |6 |
||
+ | |0 |
||
+ | |- |
||
+ | |GO:0005829 |
||
+ | |0 |
||
+ | |1 |
||
+ | |1 |
||
+ | |2 |
||
+ | |2 |
||
+ | |0 |
||
+ | |- |
||
+ | |GO:0034641 |
||
+ | |0 |
||
+ | |0 |
||
+ | |0 |
||
+ | |0 |
||
+ | |0 |
||
+ | |0 |
||
+ | |- |
||
+ | |GO:0004497 |
||
+ | |449 |
||
+ | |448 |
||
+ | |450 |
||
+ | |434 |
||
+ | |431 |
||
+ | |0 |
||
+ | |- |
||
+ | |GO:0008152 |
||
+ | |554 |
||
+ | |808 |
||
+ | |818 |
||
+ | |712 |
||
+ | |713 |
||
+ | |0 |
||
+ | |- |
||
+ | |GO:0016491 |
||
+ | |158 |
||
+ | |157 |
||
+ | |157 |
||
+ | |154 |
||
+ | |154 |
||
+ | |0 |
||
+ | |- |
||
+ | |GO:0042136 |
||
+ | |2 |
||
+ | |2 |
||
+ | |2 |
||
+ | |2 |
||
+ | |2 |
||
+ | |0 |
||
+ | |- |
||
+ | |GO:0016714 |
||
+ | |445 |
||
+ | |443 |
||
+ | |444 |
||
+ | |434 |
||
+ | |431 |
||
+ | |0 |
||
+ | |- |
||
+ | |GO:0016597 |
||
+ | |554 |
||
+ | |808 |
||
+ | |818 |
||
+ | |712 |
||
+ | |713 |
||
+ | |0 |
||
+ | |- |
||
+ | |GO:0042423 |
||
+ | |15 |
||
+ | |25 |
||
+ | |15 |
||
+ | |15 |
||
+ | |15 |
||
+ | |0 |
||
+ | |- |
||
+ | |GO:0055114 |
||
+ | |458 |
||
+ | |457 |
||
+ | |459 |
||
+ | |443 |
||
+ | |438 |
||
+ | |0 |
||
+ | |- |
||
+ | |} |
||
== Multiple sequence alignments == |
== Multiple sequence alignments == |
Revision as of 22:04, 6 May 2013
Summary of the task
In this task we compare the protein sequence of interest, in this case the phenylalanine hydroxylase (PAH), to other protein sequences. Therefore both sequence searches and multiple sequence alignments were done using the big80 database meaning a database that contains subsets of swissprot and pdb, where the entries have a sequence similarity of 80% or less. Furthermore searches against a pdb database were done. For sequence searches the programs BLAST, PSIBLAST and HHblits are used. Their results were taken for the creation of multiple sequence alignments (MSA) using he methods ClustalW, Muscle and TCoffee.
Sequence searches
The following invocations were used for Blast, PSI-Blast and HHBlits:
BLAST (Basic Local Alignment Search Tool)
blastall -p blastp -d /mnt/project/rost_db/data/big/big_80 -i /mnt/home/student/worfk /Masterpractical/Task2/PAH.fasta -o /mnt/home/student/worfk/Masterpractical/Task2/Blast/PAH _Blast_big_80.out -v 2000 -b 2000
PSI-BLAST (Position-Specific Iterated BLAST)
For PSI-Blast (PSI-BLAST Tutorial) more than one vocation was performed. First two iterations were done with an E-value cutoff of 0.002 and then again with cutoff 10E-10. The same for ten iterations. An example vocation would be:
blastpgp -i /mnt/home/student/worfk/Masterpractical/Task2/PAH.fasta -d /mnt/project/rost_db /data/big/big_80 -j 2 -h 0.002 -v 2000 -b 2000 -o psi_blast_big_80_2_2.out -C big_80_check_ 2_2.chk -Q big_80_matrix_2_2.pssm
HHblits
hhblits -i /mnt/home/student/waldraffs/Masterpraktikum/PAH.fasta -d /mnt/project/rost_db/data/hhblits/uniprot20_02Sep11 -o /mnt/home/student/waldraffs/Masterpraktikum/PAH_2000.hrr -oa3m /mnt/home/student/waldraffs/Masterpraktikum/PAH_2000.a3m -ohhm /mnt/home/student/waldraffs/Masterpraktikum/PAH_2000.hhm -Z 2000 -B 2000
To perform all programms at once, one could use the Perl-script from Maria, like shown here:
perl /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1/run.pl ...
Comparison of the results
Sequence identity in percent
E-Value
GO-terms
For the reference sequence (P00439) following GO-terms were found on QuickGO: GO:0006559 GO:0005506 GO:0009072 GO:0004505 GO:0008652 GO:0044281 GO:0003824 GO:0046872 GO:0005829 GO:0034641 GO:0004497 GO:0008152 GO:0016491 GO:0042136 GO:0016714 GO:0016597 GO:0042423 GO:0055114
To look for similarities between the reference sequence and the sequences found in the searches, those terms are counted.
...
"..." | ||||||
---|---|---|---|---|---|---|
GO | BLAST | PSIBLAST 2 x 0.002 | PSIBLAST 2 x 10e-10 | PSIBLAST 10 x 0.002 | PSIBLAST 10 x 10e-10 | HHblits |
GO:0006559 | 165 | 165 | 165 | 165 | 165 | 0 |
GO:0005506 | 449 | 448 | 450 | 434 | 431 | 0 |
GO:0009072 | 449 | 448 | 450 | 434 | 431 | 0 |
GO:0004505 | 207 | 207 | 207 | 205 | 205 | 0 |
GO:0008652 | 6 | 10 | 11 | 12 | 12 | 0 |
GO:0044281 | 0 | 0 | 0 | 0 | 0 | 0 |
GO:0003824 | 12 | 30 | 28 | 33 | 32 | 0 |
GO:0046872 | 7 | 6 | 6 | 6 | 6 | 0 |
GO:0005829 | 0 | 1 | 1 | 2 | 2 | 0 |
GO:0034641 | 0 | 0 | 0 | 0 | 0 | 0 |
GO:0004497 | 449 | 448 | 450 | 434 | 431 | 0 |
GO:0008152 | 554 | 808 | 818 | 712 | 713 | 0 |
GO:0016491 | 158 | 157 | 157 | 154 | 154 | 0 |
GO:0042136 | 2 | 2 | 2 | 2 | 2 | 0 |
GO:0016714 | 445 | 443 | 444 | 434 | 431 | 0 |
GO:0016597 | 554 | 808 | 818 | 712 | 713 | 0 |
GO:0042423 | 15 | 25 | 15 | 15 | 15 | 0 |
GO:0055114 | 458 | 457 | 459 | 443 | 438 | 0 |
Multiple sequence alignments
Datasets
For the multiple sequence alignments four different datasets were generated with a Python script. Three groups with ten sequences (two of them are pdb-sequences): one with higher than 60% sequence identity to our target sequence (PAH gene), another group with sequence identity between 30% and 60% and one group with lower sequence identity than 30%. In the group with < 30% identity the pdb sequences have a 32% identity, because these are the lowest ones found in the Blast output against the pdb dataset. The fourth group contains 20 sequences with a whole range of sequence identites and four of them are pdb sequences. For the comparison to our reference sequence of the Phenylalanine hydroxylase (PAH - P00439) enzyme we added this sequence to all four groups.
Group of sequences with < 30% (pdb = 32%) identity | ||
---|---|---|
Sequence identity | ID | Protein Name |
29% | C8W332 | Prephenate dehydratase OS=Desulfotomaculum acetoxidans |
29% | F4GLY0 | Phospho-2-dehydro-3-deoxyheptonate aldolase OS=Spirochaeta coccoides |
29% | B8J3L5 | Chorismate mutase OS=Desulfovibrio desulfuricans |
27% | A1ZW97 | Phenylalanine-4-hydroxylase OS=Microscilla marina |
26% | H1GP19 | Putative uncharacterized protein OS=Myroides odoratimimus |
25% | G0L2J6 | Phenylalanine 4-monooxygenase OS=Zobellia galactanivorans |
24% | L9JT09 | Phenylalanine-4-hydroxylase OS=Cystobacter fuscus |
23% | Q08RX0 | Aromatic amino acid hydroxylase, biopterin-dependent OS=Stigmatella aurantiaca |
32% | 1ltu_A (pdb) | PHENYLALANINE-4-HYDROXYLASE |
32% | 1ltz_A (pdb) | PHENYLALANINE-4-HYDROXYLASE |
Group of sequences between 30% and 60% identity | ||
---|---|---|
Sequence identity | ID | Protein Name |
48% | Q45XJ4 | Tyrosine hydroxylase OS=Branchiostoma floridae |
43% | D8U9W1 | Putative uncharacterized protein OS=Volvox carteri |
41% | A5GW18 | Prephenate dehydratase OS=Synechococcus sp. |
38% | Q1Q4R8 | Strongly similar to chorismate mutase/prephenate dehydratase OS=Candidatus Kuenenia stuttgartiensis |
37% | E5Y7E0 | Prephenate dehydratase OS=Bilophila wadsworthia |
33% | F5XW56 | Candidate phenylalanine 4-monooxygenase (Phenylalanine-4-hydroxylase) OS=Ramlibacter tataouinensis |
32% | K0JRN5 | Prephenate dehydratase OS=Saccharothrix espanaensis |
31% | A3UNV6 | Phenylalanine-4-hydroxylase OS=Vibrio splendidus |
59% | 1toh_A (pdb) | TYROSINE HYDROXYLASE |
33% | 2v28_B (pdb) | PHENYLALANINE-4-HYDROXYLASE |
Group of sequences with > 60% identity | ||
---|---|---|
Sequence identity | ID | Protein Name |
76% | Q4VBE2 | Putative uncharacterized protein mgc108157 OS=Xenopus tropicalis |
67% | K1RSS1 | Protein henna OS=Crassostrea gigas |
64% | H3HGU2 | Uncharacterized protein (Fragment) OS=Strongylocentrotus purpuratus |
63% | C3ZNL0 | Putative uncharacterized protein OS=Branchiostoma floridae |
63% | G3MQ02 | Putative uncharacterized protein OS=Amblyomma maculatum |
62% | E4XIM4 | Whole genome shotgun assembly, reference scaffold set, scaffold scaffold_41 OS=Oikopleura dioica |
61% | E9FTL2 | Putative uncharacterized protein OS=Daphnia pulex |
61% | D6WIQ7 | Putative uncharacterized protein OS=Tribolium castaneum |
95% | 1tg2_A (pdb) | Phenylalanine-4-hydroxylase |
96% | 4pah_A (pdb) | PHENYLALANINE HYDROXYLASE |
Group of sequences with different identities (0-100%) | ||
---|---|---|
Sequence identity | ID | Protein Name |
55% | K7F3H7 | Uncharacterized protein OS=Pelodiscus sinensis |
55% | F4PX76 | Phenylalanine 4-monooxygenase OS=Dictyostelium fasciculatum |
53% | K7FZR2 | Uncharacterized protein OS=Pelodiscus sinensis |
52% | A6YIC4 | Tryptophan hydroxylase OS=Platynereis dumerilii |
50% | C0KKU5 | Tyrosine hydroxylase (Fragment) OS=Octopus vulgaris |
48% | I2FKE7 | Tyrosine hydroxylase long variant (Fragment) OS=Gryllus bimaculatus |
47% | K8YPQ2 | Phenylalanine-4-hydroxylase OS=Nannochloropsis gaditana |
44% | B1B5B7 | Chorismate mutase/prephenate dehydratase OS=uncultured Termite group 1 bacterium |
36% | L8LIE0 | Prephenate dehydratase OS=Leptolyngbya sp. |
35% | B9Y6K3 | Putative uncharacterized protein OS=Holdemania filiformis |
35% | A3QCV4 | Phenylalanine 4-hydroxylase OS=Shewanella loihica |
34% | C1F688 | Phenylalanine-4-hydroxylase OS=Acidobacterium capsulatum |
34% | K5BAA8 | Prephenate dehydratase family protein OS=Mycobacterium hassiacum |
34% | G4FWD8 | Chorismate mutase OS=Rhodanobacter sp. |
33% | E2MCW6 | Phenylalanine-4-hydroxylase OS=Pseudomonas syringae pv. tomato |
33% | J2JTB5 | Phenylalanine-4-hydroxylase, monomeric form OS=Rhizobium sp. |
95% | 1tg2_A (pdb) | Phenylalanine-4-hydroxylase |
65% | 3hfb_A | Tryptophan 5-hydroxylase 1 |
60% | 2xsn_D | TYROSINE 3-MONOOXYGENASE |
32% | 2qmw_B (pdb) | Prephenate dehydratase |
ClustalW
clustalw -align -infile=in.fasta -outfile=out.aln
Muscle
muscle -in in.fasta -clw -out out.aln
T-Coffee
t_coffee in.fasta
3D-Coffee/Expresso
t_coffee -seq in.fasta -method TMalign_pair,sap_pair -template_file EXPRESSO
Discussion of the multiple sequence alignments and the used tools
...