Sequence Search and Multiple Sequence Alignment (PKU)
Short Task Description
The goal of this task is to learn a bit about our protein of interest (phenylalanine hydroxylase/PheOH) from existing databases and comparing it to similar (and probably related) sequences. To do this, we searched the big80 database (7.73 mio sequences), which contains a subset of swissprot and pdb with sequence similarity of 80% or less, with blast (sequence comparison) and psi-blast (profile comparison) and searched a preprocessed and clustered version of uniprot (44,757 clusters) with the more advanced HHblits method (HMM comparison).
From the search results, we created several datasets sorted by sequence similarity (highly conserved, less conserved, a little conserved, ...) and created multiple aligmnents from these sets with four different tools. The results are listed below, the commands, programms and scripts we used are listed at the appropriate places in the text or linked to their own site.
Reference Sequence of PAH
>sp|P00439|PH4H_HUMAN Phenylalanine-4-hydroxylase OS=Homo sapiens GN=PAH PE=1 SV=1 MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR IEVLDNTQQLKILADSINSEIGILCSALQKI
Database Searches
Blast
time blast2 -p blastp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i Dropbox/Phenylketonuria/Task2/PAH.fasta -o results_blast2_standard -v 2000 -b 2000
real 1m34.784s user 1m22.930s sys 0m8.400s
time blast2 -p blastp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i Dropbox/Phenylketonuria/Task2/PAH.fasta -o results_blast2_e-10 -e 0.0000000001 -v 2000 -b 2000
real 1m34.145s user 1m23.380s sys 0m9.270s
PSIBlast
In order to display the runtime PSIBlast uses in order to get results you asked for we used a little skript which uses PSIblast with a variety of parametersettings. The overall message is that more iterations one uses used the more runtime PSIBlast needs. and the harder (lower) you choose your e-Value cutoff the less results you get.
Iterations
If you change the number of iterations (parameter -j) your runtime increases according to the parameter. To show this we used the defaultsettings for anything and increased the iterations from 1 to 10 (see Fig. <xr id="fig:runtimeIterations"/>)
<figure id="fig:runtimeIterations">
</figure>
e-Values
A change in the -e parameter does not change your runtime (see Fig. XXX) but in- or decreases the amount of sequences you derive from your search. This is shown in <xr id="fig:cutoffSize"/> <figure id="fig:cutoffSize">
</figure>
iteration: 1 87.21user 8.37system 1:59.44elapsed 80%CPU (0avgtext+0avgdata 3990608maxresident)k 8852328inputs+1712outputs (222major+675658minor)pagefaults 0swaps iteration: 2 183.40user 5.77system 3:14.13elapsed 97%CPU (0avgtext+0avgdata 10788800maxresident)k 2794968inputs+3344outputs (205major+1343786minor)pagefaults 0swaps iteration: 3 284.08user 3.71system 4:49.33elapsed 99%CPU (0avgtext+0avgdata 10829616maxresident)k 104376inputs+5160outputs (25major+2015594minor)pagefaults 0swaps iteration: 4 389.89user 5.17system 6:36.74elapsed 99%CPU (0avgtext+0avgdata 10837360maxresident)k 56472inputs+6768outputs (9major+2685186minor)pagefaults 0swaps iteration: 5 499.89user 7.36system 8:29.33elapsed 99%CPU (0avgtext+0avgdata 10839264maxresident)k 12592inputs+8808outputs (5major+3354681minor)pagefaults 0swaps iteration: 6 615.74user 8.73system 10:27.22elapsed 99%CPU (0avgtext+0avgdata 10840096maxresident)k 74344inputs+10232outputs (5major+4024207minor)pagefaults 0swaps iteration: 7 736.57user 10.26system 12:30.19elapsed 99%CPU (0avgtext+0avgdata 10840880maxresident)k 168inputs+11912outputs (6major+4693733minor)pagefaults 0swaps iteration: 8 852.37user 12.37system 14:27.78elapsed 99%CPU (0avgtext+0avgdata 10841248maxresident)k 0inputs+13208outputs (0major+5363227minor)pagefaults 0swaps iteration: 9 978.60user 13.86system 16:36.34elapsed 99%CPU (0avgtext+0avgdata 10841328maxresident)k 0inputs+15368outputs (0major+6032716minor)pagefaults 0swaps iteration: 10 1093.38user 15.58system 18:33.79elapsed 99%CPU (0avgtext+0avgdata 10841552maxresident)k 0inputs+17144outputs (0major+6702196minor)pagefaults 0swaps
HHBlits
time hhblits -i Dropbox/Phenylketonuria/Task2/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o results_hhblits_standard -E 1 -e 0.001 -Z 2000 -B 2000 real 6m10.059s user 3m15.640s sys 0m40.220s
hhblits -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o results_hhblits_n4_e-7 -n 4 -e 0.0000001 -o results_hhblits_n4_e-7
HHSearch
time hhsearch -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current_hhm_db
real 13m27.782s user 13m18.120s sys 0m8.480s
PDB
Instead of mapping hits in big_80 to PDB to get structural information for the alignments, we performed an additional search in PDB at NCBI, with parameters: scoring matrix=PAM70, gap open = 10, gap extend = 1, composition based statistics, max. 1000 target sequences.
MSA
Datasets
We tried to create datasets of 4 sequences + reference sequence for the identity ranges (1: 90%-99%, 2: 60%-89%, 3: 40%-59%, 4: 20%-39%) from our search in the big_80 database. Since these results do not contain PDB structures, we additionally searched PDB directly for proteins in the required range and added them to the dataset. For the most conserved range, there is only 1 sequence in big_80, so we experimentally created the best possible highly conserved dataset in the range 80%-90% and used the structure of the reference sequence for 3D-coffee. The resulting datasets are shown in the following tables. All sequences have roughly the same length as the reference sequence except for G5AMD7 in the first set. G5AMD7 contains an insertion of 162 aa ,that is easily identified in the alignment, but shows a very high similarity in the other sections.
80-90% Sequence Identity | |||
---|---|---|---|
Sequence Identity | ID | Comment | |
90% | G1P4I7 | Uncharacterized protein OS=Myotis lucifugus | |
89% | G5AMD7 | Phenylalanine-4-hydroxylase OS=Heterocephalus glaber | |
80% | G1KSL1 | Uncharacterized protein OS=Anolis carolinensis | |
100% | 1PAH | used as 3D-Template only |
60-89% Sequence Identity | |||
---|---|---|---|
Sequence Identity | ID | Comment | |
80% | G1KSL1 | Uncharacterized protein OS=Anolis carolinensis | |
76% | Q4VBE2 | Putative uncharacterized protein mgc108157 OS=Xenopus tropicalis | |
70% | H2UJM8 | Uncharacterized protein OS=Takifugu rubripes | |
63% | D1LXB2 | Phenylalanine hydroxlase OS=Saccoglossus kowalevskii | |
60% | 2XSN_A | human Tyrosine 3-Monooxygenase, also used as 3D-Template |
40-59% Sequence Identity | |||
---|---|---|---|
Sequence Identity | ID | Comment | |
58% | O96947 | Phenylalanine hydroxylase OS=Geodia cydonium | |
55% | D3BKZ8 | Phenylalanine 4-monooxygenase OS=Polysphondylium pallidum | |
49% | Q5RHI3 | Novel protein similar to tyrosine hydroxylase OS=Danio rerio | |
44% | A6P4D3 | Tyrosine hydroxylase OS=Dugesia japonica | |
59% | 1TOH_A | Tyrosine hydroxylase from rattus norvegicus, also used as 3D-Template |
20-39% Sequence Identity | |||
---|---|---|---|
Sequence Identity | ID | Comment | |
37% | F4WGX3 | Phenylalanine-4-hydroxylase OS=Acromyrmex echinatior | |
35% | Q23A76 | Biopterin-dependent aromatic amino acid hydroxylase family protein OS=Tetrahymena thermophila | |
29% | D0I5S9 | Phenylalanine-4-hydroxylase OS=Grimontia hollisae | |
28% | F6G5X4 | Phenylalanine-4-hydroxylase OS=Ralstonia solanacearum | |
31% | 1LTU_A | Phenylalanine-4-hydroxylase from Chromobacterium violaceum, also used as 3D-Template |
Raw Results
We present the alignments and basic statistics for the different tools. Images have been created in Jalview. To count gaps and conserved columns a simple python script (Collection of scripts) was used.
ClustalW
Command used to create the alignments:
clustalw -align -infile=NN.fasta -outfile=clustalW_NN.aln
20-39% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 83 |
Q23A76 | 87 |
D0I5S9 | 272 |
F4WGX3 | 69 |
F6G5X4 | 181 |
1LTU_A | 238 |
Conserved columns | 19 |
40-59% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 37 |
A6P4D3 | 1 |
Q5RHI3 | 14 |
D3BKZ8 | 45 |
O96947 | 39 |
1TOH_A | 146 |
Conserved columns | 109 |
60-89% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 17 |
H2UJM8 | 7 |
Q4VBE2 | 24 |
G1KSL1 | 16 |
D1LXB2 | 18 |
2XSN_A | 126 |
Conserved columns | 159 |
80-90% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 165 |
G1P4I7 | 162 |
G1KSL1 | 164 |
G5AMD7 | 4 |
Conserved columns | 329 |
Muscle
Command used to create the alignments:
muscle -in NN.fasta -out muscle_NN.fasta
20-39% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 94 |
Q23A76 | 98 |
D0I5S9 | 283 |
F4WGX3 | 80 |
F6G5X4 | 192 |
1LTU_A | 249 |
Conserved columns | 22 |
40-59% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 45 |
A6P4D3 | 9 |
Q5RHI3 | 22 |
D3BKZ8 | 53 |
O96947 | 47 |
1TOH_A | 154 |
Conserved columns | 109 |
60-89% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 17 |
H2UJM8 | 7 |
Q4VBE2 | 24 |
G1KSL1 | 16 |
D1LXB2 | 18 |
2XSN_A | 126 |
Conserved columns | 159 |
80-90% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 167 |
G1P4I7 | 164 |
G1KSL1 | 166 |
G5AMD7 | 6 |
Conserved columns | 328 |
T-coffee
Command used to create the alignments:
t_coffee NN.fasta
20-39% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 103 |
Q23A76 | 107 |
D0I5S9 | 292 |
F4WGX3 | 89 |
F6G5X4 | 201 |
1LTU_A | 258 |
Conserved columns | 22 |
40-59% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 53 |
A6P4D3 | 17 |
Q5RHI3 | 30 |
D3BKZ8 | 61 |
O96947 | 55 |
1TOH_A | 162 |
Conserved columns | 109 |
60-89% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 18 |
H2UJM8 | 8 |
Q4VBE2 | 25 |
G1KSL1 | 17 |
D1LXB2 | 19 |
2XSN_A | 127 |
Conserved columns | 159 |
80-90% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 165 |
G1P4I7 | 162 |
G1KSL1 | 164 |
G5AMD7 | 4 |
Conserved columns | 329 |
3D-coffee
Command to create the alignments:
t_coffee NN.fasta -method sap_pair,slow_pair -template_file <PDB-ID>
20-39% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 96 |
Q23A76 | 100 |
D0I5S9 | 285 |
F4WGX3 | 82 |
F6G5X4 | 194 |
1LTU_A | 251 |
Conserved columns | 22 |
40-59% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 51 |
A6P4D3 | 15 |
Q5RHI3 | 28 |
D3BKZ8 | 59 |
O96947 | 53 |
1TOH_A | 160 |
Conserved columns | 109 |
60-89% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 21 |
H2UJM8 | 11 |
Q4VBE2 | 28 |
G1KSL1 | 20 |
D1LXB2 | 22 |
2XSN_A | 130 |
Conserved columns | 159 |
80-90% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 165 |
G1P4I7 | 162 |
G1KSL1 | 164 |
G5AMD7 | 4 |
Conserved columns | 329 |
Discussion of the alignment results and alignment tools
Looking at the used tools, for these small datasets (4-6 sequences) there is not much of a difference in execution time, but t-coffee offers the most options and different modes (we used only two here, but it still counts). Comparing the results, the tools perform quite similar, but can be sorted by the basic statistics of introduced gaps and found conserved columns in the following order: ClustalW> Muscle>3D-Coffee>T-Coffee
T-Coffee using the additional 3D information performs minimally better in terms of gaps, more notably in the less conserved datasets. An interesting point that stands out might be that 3D-Coffe always places the first methionines together even if for a shorter sequence missing e.g. an N-terminal part this means a very long gap between this first methionine and the rest of the sequence. This behaviour does not influence the alignment score much, might even be a 'better' alignment as the first methionines belong together functionally, but the single M irritated us a little.
to be continued shortly..