Sequence Search and Multiple Sequence Alignment (PKU)
Contents
Short Task Description
Perform database searches using different search tools with the PAH protein as query
Create and evaluate multiple sequence alignments
Reference Sequence of PAH
>sp|P00439|PH4H_HUMAN Phenylalanine-4-hydroxylase OS=Homo sapiens GN=PAH PE=1 SV=1 MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR IEVLDNTQQLKILADSINSEIGILCSALQKI
Database Searches
Blast
time blast2 -p blastp -d /mnt/project/pracstrucfunc12/data/big/big -i Dropbox/Phenylketonuria/Task1/PAH.fasta -o results_blast2_standard
real 1m47.401s user 1m25.290s sys 0m18.280s
time blast2 -p blastp -d /mnt/project/pracstrucfunc12/data/big/big -i Dropbox/Phenylketonuria/Task1/PAH.fasta -o results_blast2_e-10 -e 0.0000000001 -v 2000
real 1m35.454s user 1m21.700s sys 0m3.100s
PSIBlast
time blastpgp -j 5 -d /mnt/project/pracstrucfunc12/data/big/big_80 -i Dropbox/Phenylketonuria/Task1/PAH.fasta -o psi_blast_standard_5_it
real 8m48.107s user 8m21.950s sys 0m8.730s
HHBlits
time hhblits -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o results_hhblits_standard
real 6m10.059s user 3m15.640s sys 0m40.220s
hhblits -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o results_hhblits_n4_e-7 -n 4 -e 0.0000001 -o results_hhblits_n4_e-7
HHSearch
time hhsearch -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current_hhm_db
real 13m27.782s user 13m18.120s sys 0m8.480s
PDB
Instead of mapping hits in big_80 to PDB to get structural information for the alignments, we performed an additional search in PDB at NCBI, with parameters: scoring matrix=PAM70, gap open = 10, gap extend = 1, composition based statistics, max. 1000 target sequences.
MSA
Datasets
We tried to create datasets of 4 sequences + reference sequence for the identity ranges (1: 90%-99%, 2: 60%-89%, 3: 40%-59%, 4: 20%-39%) from our search in the big_80 database. Since these results don't contain PDB structures, we additionally searched PDB directly for proteins in the required range and added them to the dataset. For the most conserved range, there is only 1 sequence in big_80, so we experimentally created the best possible highly conserved dataset in the range 80%-90% and used the structure of the reference sequence for 3D-coffee. The resulting datasets are shown in the following tables. All sequences have roughly the same length as the reference sequence except for G5AMD7 in the first set. G5AMD7 contains an insertion of 162 aa ,that is easily identified in the alignment, but shows a very high similarity in the other sections.
80-90% Sequence Identity | |||
---|---|---|---|
Sequence Identity | ID | Comment | |
90% | G1P4I7 | Uncharacterized protein OS=Myotis lucifugus | |
89% | G5AMD7 | Phenylalanine-4-hydroxylase OS=Heterocephalus glaber | |
80% | G1KSL1 | Uncharacterized protein OS=Anolis carolinensis | |
100% | 1PAH | used as 3D-Template only |
60-89% Sequence Identity | |||
---|---|---|---|
Sequence Identity | ID | Comment | |
80% | G1KSL1 | Uncharacterized protein OS=Anolis carolinensis | |
76% | Q4VBE2 | Putative uncharacterized protein mgc108157 OS=Xenopus tropicalis | |
70% | H2UJM8 | Uncharacterized protein OS=Takifugu rubripes | |
63% | D1LXB2 | Phenylalanine hydroxlase OS=Saccoglossus kowalevskii | |
60% | 2XSN_A | human Tyrosine 3-Monooxygenase, also used as 3D-Template |
40-59% Sequence Identity | |||
---|---|---|---|
Sequence Identity | ID | Comment | |
58% | O96947 | Phenylalanine hydroxylase OS=Geodia cydonium | |
55% | D3BKZ8 | Phenylalanine 4-monooxygenase OS=Polysphondylium pallidum | |
49% | Q5RHI3 | Novel protein similar to tyrosine hydroxylase OS=Danio rerio | |
44% | A6P4D3 | Tyrosine hydroxylase OS=Dugesia japonica | |
59% | 1TOH_A | Tyrosine hydroxylase from rattus norvegicus, also used as 3D-Template |
20-39% Sequence Identity | |||
---|---|---|---|
Sequence Identity | ID | Comment | |
37% | F4WGX3 | Phenylalanine-4-hydroxylase OS=Acromyrmex echinatior | |
35% | Q23A76 | Biopterin-dependent aromatic amino acid hydroxylase family protein OS=Tetrahymena thermophila | |
29% | D0I5S9 | Phenylalanine-4-hydroxylase OS=Grimontia hollisae | |
28% | F6G5X4 | Phenylalanine-4-hydroxylase OS=Ralstonia solanacearum | |
31% | 1LTU_A | Phenylalanine-4-hydroxylase from Chromobacterium violaceum, also used as 3D-Template |
ClustalW
clustalw -align -infile=NN.fasta -outfile=clustalW_NN.aln
Muscle
muscle -in NN.fasta -out muscle_NN.fasta
T-coffee
t_coffee NN.fasta
3D-coffee
t_coffee NN.fasta -method sap_pair,slow_pair -template_file <PDB-ID>