Sequence searches and multiple sequence alignments (Phenylketonuria)

Summary of the task

In this task we compare the protein sequence of interest, in this case the phenylalanine hydroxylase (PAH), to other protein sequences. Therefore both sequence searches and multiple sequence alignments were done using the big80 database meaning a database that contains subsets of swissprot and pdb, where the entries have a sequence similarity of 80% or less. Furthermore searches against a pdb database were done. For sequence searches the programs BLAST, PSIBLAST and HHblits are used. Their results were taken for the creation of multiple sequence alignments (MSA) using he methods ClustalW, Muscle and TCoffee.

Sequence searches

The following invocations were used for Blast, PSI-Blast and HHBlits:

BLAST (Basic Local Alignment Search Tool)

blastall -p blastp -d /mnt/project/rost_db/data/big/big_80 -i /mnt/home/student/worfk
/Masterpractical/Task2/PAH.fasta -o /mnt/home/student/worfk/Masterpractical/Task2/Blast/PAH
_Blast_big_80.out -v 2000 -b 2000

PSI-BLAST (Position-Specific Iterated BLAST)

For PSI-Blast (PSI-BLAST Tutorial) more than one vocation was performed. First two iterations were done with an E-value cutoff of 0.002 and then again with cutoff 10E-10. The same for ten iterations. An example vocation would be:

blastpgp -i /mnt/home/student/worfk/Masterpractical/Task2/PAH.fasta -d /mnt/project/rost_db
/data/big/big_80 -j 2 -h 0.002 -v 2000 -b 2000 -o psi_blast_big_80_2_2.out -C big_80_check_
2_2.chk -Q big_80_matrix_2_2.pssm

HHblits

hhblits -i /mnt/home/student/waldraffs/Masterpraktikum/PAH.fasta -d /mnt/project/rost_db/data/hhblits/uniprot20_02Sep11 
-o /mnt/home/student/waldraffs/Masterpraktikum/PAH_2000.hrr -oa3m /mnt/home/student/waldraffs/Masterpraktikum/PAH_2000.a3m 
-ohhm /mnt/home/student/waldraffs/Masterpraktikum/PAH_2000.hhm -Z 2000 -B 2000

To perform all programms at once, one could use the Perl-script from Maria, like shown here:

perl /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1/run.pl ...

Comparison of the results

Sequence identity in percent

E-Value

GO-terms

For the reference sequence (P00432) following GO-terms were found on QuickGO:

GO:0050661, GO:0009650, GO:0020027, GO:0004096, GO:0046872, GO:0042803, GO:0009060, GO:0005739,
GO:0051092, GO:0004601, GO:0051289, GO:0051781, GO:0004046, GO:0042744, GO:0006979, GO:0005778,
GO:0008203, GO:0005777, GO:0043066, GO:0051262, GO:0016209, GO:0005102, GO:0014068, GO:0032088,
GO:0016491, GO:0006641, GO:0016684, GO:0019899, GO:0000302, GO:0055114, GO:0020037

To look for similarities between the reference sequence and the sequences found in the searches, those terms are counted. ...

Multiple sequence alignments

Datasets

For the multiple sequence alignments four different datasets were generated with a Python script. Three groups with ten sequences (two of them are pdb-sequences): one with higher than 60% sequence identity to our target sequence (PAH gene), another group with sequence identity between 30% and 60% and one group with lower sequence identity than 30%. In the group with < 30% identity the pdb sequences have a 32% identity, because these are the lowest ones found in the Blast output against the pdb dataset. The fourth group contains 20 sequences with a whole range of sequence identites and four of them are pdb sequences.

**Dataset "low":**
Group of sequences with < 30% (pdb = 32%) identity
Sequence identity	ID	Protein Name
29%	C8W332	Prephenate dehydratase OS=Desulfotomaculum acetoxidans
29%	F4GLY0	Phospho-2-dehydro-3-deoxyheptonate aldolase OS=Spirochaeta coccoides
29%	B8J3L5	Chorismate mutase OS=Desulfovibrio desulfuricans
27%	A1ZW97	Phenylalanine-4-hydroxylase OS=Microscilla marina
26%	H1GP19	Putative uncharacterized protein OS=Myroides odoratimimus
25%	G0L2J6	Phenylalanine 4-monooxygenase OS=Zobellia galactanivorans
24%	L9JT09	Phenylalanine-4-hydroxylase OS=Cystobacter fuscus
23%	Q08RX0	Aromatic amino acid hydroxylase, biopterin-dependent OS=Stigmatella aurantiaca
32%	1ltu_A (pdb)	PHENYLALANINE-4-HYDROXYLASE
32%	1ltz_A (pdb)	PHENYLALANINE-4-HYDROXYLASE

**Dataset "medium":**
Group of sequences between 30% and 60% identity
Sequence identity	ID	Protein Name
48%	Q45XJ4	Tyrosine hydroxylase OS=Branchiostoma floridae
43%	D8U9W1	Putative uncharacterized protein OS=Volvox carteri
41%	A5GW18	Prephenate dehydratase OS=Synechococcus sp.
38%	Q1Q4R8	Strongly similar to chorismate mutase/prephenate dehydratase OS=Candidatus Kuenenia stuttgartiensis
37%	E5Y7E0	Prephenate dehydratase OS=Bilophila wadsworthia
33%	F5XW56	Candidate phenylalanine 4-monooxygenase (Phenylalanine-4-hydroxylase) OS=Ramlibacter tataouinensis
32%	K0JRN5	Prephenate dehydratase OS=Saccharothrix espanaensis
31%	A3UNV6	Phenylalanine-4-hydroxylase OS=Vibrio splendidus
59%	1toh_A (pdb)	TYROSINE HYDROXYLASE
33%	2v28_B (pdb)	PHENYLALANINE-4-HYDROXYLASE

**Dataset "high":**
Group of sequences with > 60% identity
Sequence identity	ID	Protein Name
76%	Q4VBE2	Putative uncharacterized protein mgc108157 OS=Xenopus tropicalis
67%	K1RSS1	Protein henna OS=Crassostrea gigas
64%	H3HGU2	Uncharacterized protein (Fragment) OS=Strongylocentrotus purpuratus
63%	C3ZNL0	Putative uncharacterized protein OS=Branchiostoma floridae
63%	G3MQ02	Putative uncharacterized protein OS=Amblyomma maculatum
62%	E4XIM4	Whole genome shotgun assembly, reference scaffold set, scaffold scaffold_41 OS=Oikopleura dioica
61%	E9FTL2	Putative uncharacterized protein OS=Daphnia pulex
61%	D6WIQ7	Putative uncharacterized protein OS=Tribolium castaneum
95%	1tg2_A (pdb)	Phenylalanine-4-hydroxylase
96%	4pah_A (pdb)	PHENYLALANINE HYDROXYLASE

**Dataset "all":**
Group of sequences with different identities (0-100%)
Sequence identity	ID	Protein Name
55%	K7F3H7	Uncharacterized protein OS=Pelodiscus sinensis
55%	F4PX76	Phenylalanine 4-monooxygenase OS=Dictyostelium fasciculatum
53%	K7FZR2	Uncharacterized protein OS=Pelodiscus sinensis
52%	A6YIC4	Tryptophan hydroxylase OS=Platynereis dumerilii
50%	C0KKU5	Tyrosine hydroxylase (Fragment) OS=Octopus vulgaris
48%	I2FKE7	Tyrosine hydroxylase long variant (Fragment) OS=Gryllus bimaculatus
47%	K8YPQ2	Phenylalanine-4-hydroxylase OS=Nannochloropsis gaditana
44%	B1B5B7	Chorismate mutase/prephenate dehydratase OS=uncultured Termite group 1 bacterium
36%	L8LIE0	Prephenate dehydratase OS=Leptolyngbya sp.
35%	B9Y6K3	Putative uncharacterized protein OS=Holdemania filiformis
35%	A3QCV4	Phenylalanine 4-hydroxylase OS=Shewanella loihica
34%	C1F688	Phenylalanine-4-hydroxylase OS=Acidobacterium capsulatum
34%	K5BAA8	Prephenate dehydratase family protein OS=Mycobacterium hassiacum
34%	G4FWD8	Chorismate mutase OS=Rhodanobacter sp.
33%	E2MCW6	Phenylalanine-4-hydroxylase OS=Pseudomonas syringae pv. tomato
33%	J2JTB5	Phenylalanine-4-hydroxylase, monomeric form OS=Rhizobium sp.
95%	1tg2_A (pdb)	Phenylalanine-4-hydroxylase
65%	3hfb_A	Tryptophan 5-hydroxylase 1
60%	2xsn_D	TYROSINE 3-MONOOXYGENASE
32%	2qmw_B (pdb)	Prephenate dehydratase

ClustalW

clustalw -align -infile=in.fasta -outfile=out.aln

Muscle

muscle -in in.fasta -clw -out out.aln

T-Coffee

t_coffee in.fasta

3D-Coffee

...

Discussion of the multiple sequence alignments and the used tools

...

Sequence searches and multiple sequence alignments (Phenylketonuria)

Contents

Summary of the task