Difference between revisions of "Sequence searches and multiple sequence alignments (Phenylketonuria)"
(→T-Coffee) |
(→Comparison of the results) |
||
Line 35: | Line 35: | ||
*GO-terms |
*GO-terms |
||
For the reference sequence (P00432) following GO-terms were found on [http://www.ebi.ac.uk/QuickGO/GAnnotation QuickGO]: |
For the reference sequence (P00432) following GO-terms were found on [http://www.ebi.ac.uk/QuickGO/GAnnotation QuickGO]: |
||
+ | "GO:0050661", "GO:0009650", "GO:0020027", "GO:0004096", "GO:0046872", "GO:0042803", "GO:0009060", "GO:0005739", |
||
− | ... |
||
+ | "GO:0051092", "GO:0004601", "GO:0051289", "GO:0051781", "GO:0004046", "GO:0042744", "GO:0006979", "GO:0005778", |
||
+ | "GO:0008203", "GO:0005777", "GO:0043066", "GO:0051262", "GO:0016209", "GO:0005102", "GO:0014068", "GO:0032088", |
||
+ | "GO:0016491", "GO:0006641", "GO:0016684", "GO:0019899", "GO:0000302", "GO:0055114", "GO:0020037" |
||
To look for similarities between the reference sequence and the sequences found in the searches, those terms are counted. |
To look for similarities between the reference sequence and the sequences found in the searches, those terms are counted. |
Revision as of 20:47, 6 May 2013
Summary of the task
In this task we compare the protein sequence of interest, in this case the phenylalanine hydroxylase (PAH), to other protein sequences. Therefore both sequence searches and multiple sequence alignments were done using the big80 database meaning a database that contains subsets of swissprot and pdb, where the entries have a sequence similarity of 80% or less. Furthermore searches against a pdb database were done. For sequence searches the programs BLAST, PSIBLAST and HHblits are used. Their results were taken for the creation of multiple sequence alignments (MSA) using he methods ClustalW, Muscle and TCoffee.
Sequence searches
The following invocations were used for Blast, PSI-Blast and HHBlits:
BLAST (Basic Local Alignment Search Tool)
blastall -p blastp -d /mnt/project/rost_db/data/big/big_80 -i /mnt/home/student/worfk /Masterpractical/Task2/PAH.fasta -o /mnt/home/student/worfk/Masterpractical/Task2/Blast/PAH _Blast_big_80.out -v 2000 -b 2000
PSI-BLAST (Position-Specific Iterated BLAST)
For PSI-Blast (PSI-BLAST Tutorial) more than one vocation was performed. First two iterations were done with an E-value cutoff of 0.002 and then again with cutoff 10E-10. The same for ten iterations. An example vocation would be:
blastpgp -i /mnt/home/student/worfk/Masterpractical/Task2/PAH.fasta -d /mnt/project/rost_db /data/big/big_80 -j 2 -h 0.002 -v 2000 -b 2000 -o psi_blast_big_80_2_2.out -C big_80_check_ 2_2.chk -Q big_80_matrix_2_2.pssm
HHblits
hhblits -i /mnt/home/student/waldraffs/Masterpraktikum/PAH.fasta -d /mnt/project/rost_db/data/hhblits/uniprot20_02Sep11 -o /mnt/home/student/waldraffs/Masterpraktikum/PAH_2000.hrr -oa3m /mnt/home/student/waldraffs/Masterpraktikum/PAH_2000.a3m -ohhm /mnt/home/student/waldraffs/Masterpraktikum/PAH_2000.hhm -Z 2000 -B 2000
To perform all programms at once, one could use the Perl-script from Maria, like shown here:
perl /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1/run.pl ...
Comparison of the results
- Sequence identity in percent
- E-Value
- GO-terms
For the reference sequence (P00432) following GO-terms were found on QuickGO:
"GO:0050661", "GO:0009650", "GO:0020027", "GO:0004096", "GO:0046872", "GO:0042803", "GO:0009060", "GO:0005739", "GO:0051092", "GO:0004601", "GO:0051289", "GO:0051781", "GO:0004046", "GO:0042744", "GO:0006979", "GO:0005778", "GO:0008203", "GO:0005777", "GO:0043066", "GO:0051262", "GO:0016209", "GO:0005102", "GO:0014068", "GO:0032088", "GO:0016491", "GO:0006641", "GO:0016684", "GO:0019899", "GO:0000302", "GO:0055114", "GO:0020037"
To look for similarities between the reference sequence and the sequences found in the searches, those terms are counted. ...
Multiple sequence alignments
Datasets
For the multiple sequence alignments four different datasets were generated with a Python script. Three groups with ten sequences (two of them are pdb-sequences): one with higher than 60% sequence identity to our target sequence (PAH gene), another group with sequence identity between 30% and 60% and one group with lower sequence identity than 30%. In the group with < 30% identity the pdb sequences have a 32% identity, because these are the lowest ones found in the Blast output against the pdb dataset. The fourth group contains 20 sequences with a whole range of sequence identites and four of them are pdb sequences.
Group of sequences with < 30% (pdb = 32%) identity | ||
---|---|---|
Sequence identity | ID | Protein Name |
29% | C8W332 | Prephenate dehydratase OS=Desulfotomaculum acetoxidans |
29% | F4GLY0 | Phospho-2-dehydro-3-deoxyheptonate aldolase OS=Spirochaeta coccoides |
29% | B8J3L5 | Chorismate mutase OS=Desulfovibrio desulfuricans |
27% | A1ZW97 | Phenylalanine-4-hydroxylase OS=Microscilla marina |
26% | H1GP19 | Putative uncharacterized protein OS=Myroides odoratimimus |
25% | G0L2J6 | Phenylalanine 4-monooxygenase OS=Zobellia galactanivorans |
24% | L9JT09 | Phenylalanine-4-hydroxylase OS=Cystobacter fuscus |
23% | Q08RX0 | Aromatic amino acid hydroxylase, biopterin-dependent OS=Stigmatella aurantiaca |
32% | 1ltu_A (pdb) | PHENYLALANINE-4-HYDROXYLASE |
32% | 1ltz_A (pdb) | PHENYLALANINE-4-HYDROXYLASE |
Group of sequences between 30% and 60% identity | ||
---|---|---|
Sequence identity | ID | Protein Name |
48% | Q45XJ4 | Tyrosine hydroxylase OS=Branchiostoma floridae |
43% | D8U9W1 | Putative uncharacterized protein OS=Volvox carteri |
41% | A5GW18 | Prephenate dehydratase OS=Synechococcus sp. |
38% | Q1Q4R8 | Strongly similar to chorismate mutase/prephenate dehydratase OS=Candidatus Kuenenia stuttgartiensis |
37% | E5Y7E0 | Prephenate dehydratase OS=Bilophila wadsworthia |
33% | F5XW56 | Candidate phenylalanine 4-monooxygenase (Phenylalanine-4-hydroxylase) OS=Ramlibacter tataouinensis |
32% | K0JRN5 | Prephenate dehydratase OS=Saccharothrix espanaensis |
31% | A3UNV6 | Phenylalanine-4-hydroxylase OS=Vibrio splendidus |
59% | 1toh_A (pdb) | TYROSINE HYDROXYLASE |
33% | 2v28_B (pdb) | PHENYLALANINE-4-HYDROXYLASE |
Group of sequences with > 60% identity | ||
---|---|---|
Sequence identity | ID | Protein Name |
76% | Q4VBE2 | Putative uncharacterized protein mgc108157 OS=Xenopus tropicalis |
67% | K1RSS1 | Protein henna OS=Crassostrea gigas |
64% | H3HGU2 | Uncharacterized protein (Fragment) OS=Strongylocentrotus purpuratus |
63% | C3ZNL0 | Putative uncharacterized protein OS=Branchiostoma floridae |
63% | G3MQ02 | Putative uncharacterized protein OS=Amblyomma maculatum |
62% | E4XIM4 | Whole genome shotgun assembly, reference scaffold set, scaffold scaffold_41 OS=Oikopleura dioica |
61% | E9FTL2 | Putative uncharacterized protein OS=Daphnia pulex |
61% | D6WIQ7 | Putative uncharacterized protein OS=Tribolium castaneum |
95% | 1tg2_A (pdb) | Phenylalanine-4-hydroxylase |
96% | 4pah_A (pdb) | PHENYLALANINE HYDROXYLASE |
Group of sequences with different identities (0-100%) | ||
---|---|---|
Sequence identity | ID | Protein Name |
55% | K7F3H7 | Uncharacterized protein OS=Pelodiscus sinensis |
55% | F4PX76 | Phenylalanine 4-monooxygenase OS=Dictyostelium fasciculatum |
53% | K7FZR2 | Uncharacterized protein OS=Pelodiscus sinensis |
52% | A6YIC4 | Tryptophan hydroxylase OS=Platynereis dumerilii |
50% | C0KKU5 | Tyrosine hydroxylase (Fragment) OS=Octopus vulgaris |
48% | I2FKE7 | Tyrosine hydroxylase long variant (Fragment) OS=Gryllus bimaculatus |
47% | K8YPQ2 | Phenylalanine-4-hydroxylase OS=Nannochloropsis gaditana |
44% | B1B5B7 | Chorismate mutase/prephenate dehydratase OS=uncultured Termite group 1 bacterium |
36% | L8LIE0 | Prephenate dehydratase OS=Leptolyngbya sp. |
35% | B9Y6K3 | Putative uncharacterized protein OS=Holdemania filiformis |
35% | A3QCV4 | Phenylalanine 4-hydroxylase OS=Shewanella loihica |
34% | C1F688 | Phenylalanine-4-hydroxylase OS=Acidobacterium capsulatum |
34% | K5BAA8 | Prephenate dehydratase family protein OS=Mycobacterium hassiacum |
34% | G4FWD8 | Chorismate mutase OS=Rhodanobacter sp. |
33% | E2MCW6 | Phenylalanine-4-hydroxylase OS=Pseudomonas syringae pv. tomato |
33% | J2JTB5 | Phenylalanine-4-hydroxylase, monomeric form OS=Rhizobium sp. |
95% | 1tg2_A (pdb) | Phenylalanine-4-hydroxylase |
65% | 3hfb_A | Tryptophan 5-hydroxylase 1 |
60% | 2xsn_D | TYROSINE 3-MONOOXYGENASE |
32% | 2qmw_B (pdb) | Prephenate dehydratase |
ClustalW
clustalw -align in.fasta
Muscle
muscle -in in.fasta -clw -out out.aln
T-Coffee
t_coffee in.fasta
3D-Coffee
...
Discussion of the multiple sequence alignments and the used tools
...