Difference between revisions of "Sequence Search and Multiple Sequence Alignment (PKU)"
Line 32: | Line 32: | ||
===PSIBlast=== |
===PSIBlast=== |
||
+ | In order to display the runtime PSIBlast uses in order to get results you asked for we used a little [[File:PSIBlasttrainer.pl|skript]] which uses PSIblast with a variety of parametersettings. The overall message is that more iterations one uses used the more runtime PSIBlast needs. and the harder (lower) you choose your e-Value cutoff the less results you get. |
||
− | time blastpgp -j 5 -d /mnt/project/pracstrucfunc12/data/big/big_80 -i Dropbox/Phenylketonuria/Task1/PAH.fasta -o psi_blast_standard_5_it |
||
− | real 8m48.107s |
||
− | user 8m21.950s |
||
− | sys 0m8.730s |
||
iteration: 1 |
iteration: 1 |
||
87.21user 8.37system 1:59.44elapsed 80%CPU (0avgtext+0avgdata 3990608maxresident)k |
87.21user 8.37system 1:59.44elapsed 80%CPU (0avgtext+0avgdata 3990608maxresident)k |
Revision as of 19:24, 6 May 2012
Contents
Short Task Description
The goal of this task is to learn a bit about our protein of interest (phenylalanine hydroxylase/PheOH)from existing databases and comparing it to similar (and probably related) sequences. To do this, we searched the big80 database (7.73 mio sequences), which contains a subset of swissprot and pdb with sequence similarity of 80% or less, with blast (sequence comparison) and psi-blast (profile comparison) and searched a preprocessed and clustered version of uniprot (44,757 clusters) with the more advanced HHblits method (HMM comparison).
From the search results, we created several datasets sorted by sequence similarity (highly conserved, less conserved, a little conserved, ...) and created multiple aligmnents from these sets with four different tools. The results are listed below, the commands, programms and scripts we used are listed at the appropriate places in the text or linked to their own site.
Reference Sequence of PAH
>sp|P00439|PH4H_HUMAN Phenylalanine-4-hydroxylase OS=Homo sapiens GN=PAH PE=1 SV=1 MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR IEVLDNTQQLKILADSINSEIGILCSALQKI
Database Searches
Blast
time blast2 -p blastp -d /mnt/project/pracstrucfunc12/data/big/big -i Dropbox/Phenylketonuria/Task1/PAH.fasta -o results_blast2_standard
real 1m47.401s user 1m25.290s sys 0m18.280s
time blast2 -p blastp -d /mnt/project/pracstrucfunc12/data/big/big -i Dropbox/Phenylketonuria/Task1/PAH.fasta -o results_blast2_e-10 -e 0.0000000001 -v 2000
real 1m35.454s user 1m21.700s sys 0m3.100s
PSIBlast
In order to display the runtime PSIBlast uses in order to get results you asked for we used a little skript which uses PSIblast with a variety of parametersettings. The overall message is that more iterations one uses used the more runtime PSIBlast needs. and the harder (lower) you choose your e-Value cutoff the less results you get.
iteration: 1 87.21user 8.37system 1:59.44elapsed 80%CPU (0avgtext+0avgdata 3990608maxresident)k 8852328inputs+1712outputs (222major+675658minor)pagefaults 0swaps iteration: 2 183.40user 5.77system 3:14.13elapsed 97%CPU (0avgtext+0avgdata 10788800maxresident)k 2794968inputs+3344outputs (205major+1343786minor)pagefaults 0swaps iteration: 3 284.08user 3.71system 4:49.33elapsed 99%CPU (0avgtext+0avgdata 10829616maxresident)k 104376inputs+5160outputs (25major+2015594minor)pagefaults 0swaps iteration: 4 389.89user 5.17system 6:36.74elapsed 99%CPU (0avgtext+0avgdata 10837360maxresident)k 56472inputs+6768outputs (9major+2685186minor)pagefaults 0swaps iteration: 5 499.89user 7.36system 8:29.33elapsed 99%CPU (0avgtext+0avgdata 10839264maxresident)k 12592inputs+8808outputs (5major+3354681minor)pagefaults 0swaps iteration: 6 615.74user 8.73system 10:27.22elapsed 99%CPU (0avgtext+0avgdata 10840096maxresident)k 74344inputs+10232outputs (5major+4024207minor)pagefaults 0swaps iteration: 7 736.57user 10.26system 12:30.19elapsed 99%CPU (0avgtext+0avgdata 10840880maxresident)k 168inputs+11912outputs (6major+4693733minor)pagefaults 0swaps iteration: 8 852.37user 12.37system 14:27.78elapsed 99%CPU (0avgtext+0avgdata 10841248maxresident)k 0inputs+13208outputs (0major+5363227minor)pagefaults 0swaps iteration: 9 978.60user 13.86system 16:36.34elapsed 99%CPU (0avgtext+0avgdata 10841328maxresident)k 0inputs+15368outputs (0major+6032716minor)pagefaults 0swaps iteration: 10 1093.38user 15.58system 18:33.79elapsed 99%CPU (0avgtext+0avgdata 10841552maxresident)k 0inputs+17144outputs (0major+6702196minor)pagefaults 0swaps
HHBlits
time hhblits -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o results_hhblits_standard
real 6m10.059s user 3m15.640s sys 0m40.220s
hhblits -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o results_hhblits_n4_e-7 -n 4 -e 0.0000001 -o results_hhblits_n4_e-7
HHSearch
time hhsearch -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current_hhm_db
real 13m27.782s user 13m18.120s sys 0m8.480s
PDB
Instead of mapping hits in big_80 to PDB to get structural information for the alignments, we performed an additional search in PDB at NCBI, with parameters: scoring matrix=PAM70, gap open = 10, gap extend = 1, composition based statistics, max. 1000 target sequences.
MSA
Datasets
We tried to create datasets of 4 sequences + reference sequence for the identity ranges (1: 90%-99%, 2: 60%-89%, 3: 40%-59%, 4: 20%-39%) from our search in the big_80 database. Since these results don't contain PDB structures, we additionally searched PDB directly for proteins in the required range and added them to the dataset. For the most conserved range, there is only 1 sequence in big_80, so we experimentally created the best possible highly conserved dataset in the range 80%-90% and used the structure of the reference sequence for 3D-coffee. The resulting datasets are shown in the following tables. All sequences have roughly the same length as the reference sequence except for G5AMD7 in the first set. G5AMD7 contains an insertion of 162 aa ,that is easily identified in the alignment, but shows a very high similarity in the other sections.
80-90% Sequence Identity | |||
---|---|---|---|
Sequence Identity | ID | Comment | |
90% | G1P4I7 | Uncharacterized protein OS=Myotis lucifugus | |
89% | G5AMD7 | Phenylalanine-4-hydroxylase OS=Heterocephalus glaber | |
80% | G1KSL1 | Uncharacterized protein OS=Anolis carolinensis | |
100% | 1PAH | used as 3D-Template only |
60-89% Sequence Identity | |||
---|---|---|---|
Sequence Identity | ID | Comment | |
80% | G1KSL1 | Uncharacterized protein OS=Anolis carolinensis | |
76% | Q4VBE2 | Putative uncharacterized protein mgc108157 OS=Xenopus tropicalis | |
70% | H2UJM8 | Uncharacterized protein OS=Takifugu rubripes | |
63% | D1LXB2 | Phenylalanine hydroxlase OS=Saccoglossus kowalevskii | |
60% | 2XSN_A | human Tyrosine 3-Monooxygenase, also used as 3D-Template |
40-59% Sequence Identity | |||
---|---|---|---|
Sequence Identity | ID | Comment | |
58% | O96947 | Phenylalanine hydroxylase OS=Geodia cydonium | |
55% | D3BKZ8 | Phenylalanine 4-monooxygenase OS=Polysphondylium pallidum | |
49% | Q5RHI3 | Novel protein similar to tyrosine hydroxylase OS=Danio rerio | |
44% | A6P4D3 | Tyrosine hydroxylase OS=Dugesia japonica | |
59% | 1TOH_A | Tyrosine hydroxylase from rattus norvegicus, also used as 3D-Template |
20-39% Sequence Identity | |||
---|---|---|---|
Sequence Identity | ID | Comment | |
37% | F4WGX3 | Phenylalanine-4-hydroxylase OS=Acromyrmex echinatior | |
35% | Q23A76 | Biopterin-dependent aromatic amino acid hydroxylase family protein OS=Tetrahymena thermophila | |
29% | D0I5S9 | Phenylalanine-4-hydroxylase OS=Grimontia hollisae | |
28% | F6G5X4 | Phenylalanine-4-hydroxylase OS=Ralstonia solanacearum | |
31% | 1LTU_A | Phenylalanine-4-hydroxylase from Chromobacterium violaceum, also used as 3D-Template |
Raw Results
We present the alignments and basic statistics for the different tools. Images have been created in Jalview. To count gaps and conserved columns a simple python script (Collection of scripts) was used.
ClustalW
Command used to create the alignments:
clustalw -align -infile=NN.fasta -outfile=clustalW_NN.aln
20-39% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 83 |
Q23A76 | 87 |
D0I5S9 | 272 |
F4WGX3 | 69 |
F6G5X4 | 181 |
1LTU_A | 238 |
Conserved columns | 19 |
40-59% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 37 |
A6P4D3 | 1 |
Q5RHI3 | 14 |
D3BKZ8 | 45 |
O96947 | 39 |
1TOH_A | 146 |
Conserved columns | 109 |
60-89% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 17 |
H2UJM8 | 7 |
Q4VBE2 | 24 |
G1KSL1 | 16 |
D1LXB2 | 18 |
2XSN_A | 126 |
Conserved columns | 159 |
80-90% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 165 |
G1P4I7 | 162 |
G1KSL1 | 164 |
G5AMD7 | 4 |
Conserved columns | 329 |
Muscle
Command used to create the alignments:
muscle -in NN.fasta -out muscle_NN.fasta
20-39% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 94 |
Q23A76 | 98 |
D0I5S9 | 283 |
F4WGX3 | 80 |
F6G5X4 | 192 |
1LTU_A | 249 |
Conserved columns | 22 |
40-59% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 45 |
A6P4D3 | 9 |
Q5RHI3 | 22 |
D3BKZ8 | 53 |
O96947 | 47 |
1TOH_A | 154 |
Conserved columns | 109 |
60-89% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 17 |
H2UJM8 | 7 |
Q4VBE2 | 24 |
G1KSL1 | 16 |
D1LXB2 | 18 |
2XSN_A | 126 |
Conserved columns | 159 |
80-90% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 167 |
G1P4I7 | 164 |
G1KSL1 | 166 |
G5AMD7 | 6 |
Conserved columns | 328 |
T-coffee
Command used to create the alignments:
t_coffee NN.fasta
20-39% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 103 |
Q23A76 | 107 |
D0I5S9 | 292 |
F4WGX3 | 89 |
F6G5X4 | 201 |
1LTU_A | 258 |
Conserved columns | 22 |
40-59% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 53 |
A6P4D3 | 17 |
Q5RHI3 | 30 |
D3BKZ8 | 61 |
O96947 | 55 |
1TOH_A | 162 |
Conserved columns | 109 |
60-89% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 18 |
H2UJM8 | 8 |
Q4VBE2 | 25 |
G1KSL1 | 17 |
D1LXB2 | 19 |
2XSN_A | 127 |
Conserved columns | 159 |
80-90% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 165 |
G1P4I7 | 162 |
G1KSL1 | 164 |
G5AMD7 | 4 |
Conserved columns | 329 |
3D-coffee
Command to create the alignments:
t_coffee NN.fasta -method sap_pair,slow_pair -template_file <PDB-ID>
20-39% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 96 |
Q23A76 | 100 |
D0I5S9 | 285 |
F4WGX3 | 82 |
F6G5X4 | 194 |
1LTU_A | 251 |
Conserved columns | 22 |
40-59% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 51 |
A6P4D3 | 15 |
Q5RHI3 | 28 |
D3BKZ8 | 59 |
O96947 | 53 |
1TOH_A | 160 |
Conserved columns | 109 |
60-89% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 21 |
H2UJM8 | 11 |
Q4VBE2 | 28 |
G1KSL1 | 20 |
D1LXB2 | 22 |
2XSN_A | 130 |
Conserved columns | 159 |
80-90% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 165 |
G1P4I7 | 162 |
G1KSL1 | 164 |
G5AMD7 | 4 |
Conserved columns | 329 |
Discussion of the alignment results and alignment tools
Looking at the used tools, for these small datasets (4-6 sequences) there is not much of a difference in execution time, but t-coffee offers the most options and different modes (we used only two here, but it still counts). Comparing the results, the programms perform quite similar, but can be sorted by the basic statistics of inroduced gaps and found conserved columns in the following order: ClustalW> Muscle>3D-Coffee>T-Coffee
to be continued shortly..