Sequence Search and Multiple Sequence Alignment (PKU)

Short Task Description

Perform database searches using different search tools with the PAH protein as query

Create and evaluate multiple sequence alignments

Reference Sequence of PAH

>sp|P00439|PH4H_HUMAN Phenylalanine-4-hydroxylase OS=Homo sapiens GN=PAH PE=1 SV=1
MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV
NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW
FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM
EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF
RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA
QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE
KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR
IEVLDNTQQLKILADSINSEIGILCSALQKI

Database Searches

Blast

time blast2 -p blastp -d /mnt/project/pracstrucfunc12/data/big/big -i Dropbox/Phenylketonuria/Task1/PAH.fasta -o results_blast2_standard

real 1m47.401s user 1m25.290s sys 0m18.280s

time blast2 -p blastp -d /mnt/project/pracstrucfunc12/data/big/big -i Dropbox/Phenylketonuria/Task1/PAH.fasta -o results_blast2_e-10 -e 0.0000000001 -v 2000

real 1m35.454s user 1m21.700s sys 0m3.100s

PSIBlast

time blastpgp -j 5 -d /mnt/project/pracstrucfunc12/data/big/big_80 -i Dropbox/Phenylketonuria/Task1/PAH.fasta -o psi_blast_standard_5_it

real 8m48.107s user 8m21.950s sys 0m8.730s

HHBlits

time hhblits -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o results_hhblits_standard

real 6m10.059s user 3m15.640s sys 0m40.220s

hhblits -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o results_hhblits_n4_e-7 -n 4 -e 0.0000001 -o results_hhblits_n4_e-7

HHSearch

time hhsearch -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current_hhm_db

real 13m27.782s user 13m18.120s sys 0m8.480s

PDB

Instead of mapping hits in big_80 to PDB to get structural information for the alignments, we performed an additional search in PDB at NCBI, with parameters: scoring matrix=PAM70, gap open = 10, gap extend = 1, composition based statistics, max. 1000 target sequences.

MSA

Datasets

We tried to create datasets of 4 sequences + reference sequence for the identity ranges (1: 90%-99%, 2: 60%-89%, 3: 40%-59%, 4: 20%-39%) from our search in the big_80 database. Since these results don't contain PDB structures, we additionally searched PDB directly for proteins in the required range and added them to the dataset. For the most conserved range, there is only 1 sequence in big_80, so we experimentally created the best possible highly conserved dataset in the range 80%-90% and used the structure of the reference sequence for 3D-coffee. The resulting datasets are shown in the following tables. All sequences have roughly the same length as the reference sequence except for G5AMD7 in the first set. G5AMD7 contains an insertion of 162 aa ,that is easily identified in the alignment, but shows a very high similarity in the other sections.

80-90% Sequence Identity
Sequence Identity	ID	Comment
90%	G1P4I7	Uncharacterized protein OS=Myotis lucifugus
89%	G5AMD7	Phenylalanine-4-hydroxylase OS=Heterocephalus glaber
80%	G1KSL1	Uncharacterized protein OS=Anolis carolinensis
100%	1PAH	used as 3D-Template only

60-89% Sequence Identity
Sequence Identity	ID	Comment
80%	G1KSL1	Uncharacterized protein OS=Anolis carolinensis
76%	Q4VBE2	Putative uncharacterized protein mgc108157 OS=Xenopus tropicalis
70%	H2UJM8	Uncharacterized protein OS=Takifugu rubripes
63%	D1LXB2	Phenylalanine hydroxlase OS=Saccoglossus kowalevskii
60%	2XSN_A	human Tyrosine 3-Monooxygenase, also used as 3D-Template

40-59% Sequence Identity
Sequence Identity	ID	Comment
58%	O96947	Phenylalanine hydroxylase OS=Geodia cydonium
55%	D3BKZ8	Phenylalanine 4-monooxygenase OS=Polysphondylium pallidum
49%	Q5RHI3	Novel protein similar to tyrosine hydroxylase OS=Danio rerio
44%	A6P4D3	Tyrosine hydroxylase OS=Dugesia japonica
59%	1TOH_A	Tyrosine hydroxylase from rattus norvegicus, also used as 3D-Template

20-39% Sequence Identity
Sequence Identity	ID	Comment
37%	F4WGX3	Phenylalanine-4-hydroxylase OS=Acromyrmex echinatior
35%	Q23A76	Biopterin-dependent aromatic amino acid hydroxylase family protein OS=Tetrahymena thermophila
29%	D0I5S9	Phenylalanine-4-hydroxylase OS=Grimontia hollisae
28%	F6G5X4	Phenylalanine-4-hydroxylase OS=Ralstonia solanacearum
31%	1LTU_A	Phenylalanine-4-hydroxylase from Chromobacterium violaceum, also used as 3D-Template

ClustalW

clustalw -align -infile=NN.fasta -outfile=clustalW_NN.aln

Muscle

muscle -in NN.fasta -out muscle_NN.fasta

T-coffee

t_coffee NN.fasta

3D-coffee

t_coffee NN.fasta -method sap_pair,slow_pair -template_file <PDB-ID>

Sequence Search and Multiple Sequence Alignment (PKU)

Contents

Short Task Description

Reference Sequence of PAH

Database Searches

Blast

PSIBlast

HHBlits

HHSearch

PDB

MSA

Datasets

ClustalW

Muscle

T-coffee

3D-coffee

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools