Sequence Search and Multiple Sequence Alignment (PKU)

From Bioinformatikpedia
Revision as of 20:28, 6 May 2012 by Hollizeck (talk | contribs)

Short Task Description

The goal of this task is to learn a bit about our protein of interest (phenylalanine hydroxylase/PheOH)from existing databases and comparing it to similar (and probably related) sequences. To do this, we searched the big80 database (7.73 mio sequences), which contains a subset of swissprot and pdb with sequence similarity of 80% or less, with blast (sequence comparison) and psi-blast (profile comparison) and searched a preprocessed and clustered version of uniprot (44,757 clusters) with the more advanced HHblits method (HMM comparison).

From the search results, we created several datasets sorted by sequence similarity (highly conserved, less conserved, a little conserved, ...) and created multiple aligmnents from these sets with four different tools. The results are listed below, the commands, programms and scripts we used are listed at the appropriate places in the text or linked to their own site.

Reference Sequence of PAH

>sp|P00439|PH4H_HUMAN Phenylalanine-4-hydroxylase OS=Homo sapiens GN=PAH PE=1 SV=1
MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV
NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW
FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM
EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF
RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA
QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE
KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR
IEVLDNTQQLKILADSINSEIGILCSALQKI

Database Searches

Blast

time blast2 -p blastp -d /mnt/project/pracstrucfunc12/data/big/big -i Dropbox/Phenylketonuria/Task1/PAH.fasta -o results_blast2_standard

real 1m47.401s user 1m25.290s sys 0m18.280s

time blast2 -p blastp -d /mnt/project/pracstrucfunc12/data/big/big -i Dropbox/Phenylketonuria/Task1/PAH.fasta -o results_blast2_e-10 -e 0.0000000001 -v 2000

real 1m35.454s user 1m21.700s sys 0m3.100s

PSIBlast

In order to display the runtime PSIBlast uses in order to get results you asked for we used a little skript which uses PSIblast with a variety of parametersettings. The overall message is that more iterations one uses used the more runtime PSIBlast needs. and the harder (lower) you choose your e-Value cutoff the less results you get.

Iterations

If you change the number of iterations (parameter -j) your runtime increases according to the parameter. To show this we used the defaultsettings for anything and increased the iterations from 1 to 10 (see Fig. XXX)

e-Values

A change in the -e parameter does not change your runtime (see Fig. XXX) but in- or decreases the amount of sequences you derive from your search. This is shown in Fig. YYY

Change in the -e parameter results in a lower resultsize

iteration: 1 87.21user 8.37system 1:59.44elapsed 80%CPU (0avgtext+0avgdata 3990608maxresident)k 8852328inputs+1712outputs (222major+675658minor)pagefaults 0swaps iteration: 2 183.40user 5.77system 3:14.13elapsed 97%CPU (0avgtext+0avgdata 10788800maxresident)k 2794968inputs+3344outputs (205major+1343786minor)pagefaults 0swaps iteration: 3 284.08user 3.71system 4:49.33elapsed 99%CPU (0avgtext+0avgdata 10829616maxresident)k 104376inputs+5160outputs (25major+2015594minor)pagefaults 0swaps iteration: 4 389.89user 5.17system 6:36.74elapsed 99%CPU (0avgtext+0avgdata 10837360maxresident)k 56472inputs+6768outputs (9major+2685186minor)pagefaults 0swaps iteration: 5 499.89user 7.36system 8:29.33elapsed 99%CPU (0avgtext+0avgdata 10839264maxresident)k 12592inputs+8808outputs (5major+3354681minor)pagefaults 0swaps iteration: 6 615.74user 8.73system 10:27.22elapsed 99%CPU (0avgtext+0avgdata 10840096maxresident)k 74344inputs+10232outputs (5major+4024207minor)pagefaults 0swaps iteration: 7 736.57user 10.26system 12:30.19elapsed 99%CPU (0avgtext+0avgdata 10840880maxresident)k 168inputs+11912outputs (6major+4693733minor)pagefaults 0swaps iteration: 8 852.37user 12.37system 14:27.78elapsed 99%CPU (0avgtext+0avgdata 10841248maxresident)k 0inputs+13208outputs (0major+5363227minor)pagefaults 0swaps iteration: 9 978.60user 13.86system 16:36.34elapsed 99%CPU (0avgtext+0avgdata 10841328maxresident)k 0inputs+15368outputs (0major+6032716minor)pagefaults 0swaps iteration: 10 1093.38user 15.58system 18:33.79elapsed 99%CPU (0avgtext+0avgdata 10841552maxresident)k 0inputs+17144outputs (0major+6702196minor)pagefaults 0swaps

HHBlits

time hhblits -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o results_hhblits_standard

real 6m10.059s user 3m15.640s sys 0m40.220s

hhblits -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o results_hhblits_n4_e-7 -n 4 -e 0.0000001 -o results_hhblits_n4_e-7

HHSearch

time hhsearch -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current_hhm_db

real 13m27.782s user 13m18.120s sys 0m8.480s

PDB

Instead of mapping hits in big_80 to PDB to get structural information for the alignments, we performed an additional search in PDB at NCBI, with parameters: scoring matrix=PAM70, gap open = 10, gap extend = 1, composition based statistics, max. 1000 target sequences.

MSA

Datasets

We tried to create datasets of 4 sequences + reference sequence for the identity ranges (1: 90%-99%, 2: 60%-89%, 3: 40%-59%, 4: 20%-39%) from our search in the big_80 database. Since these results don't contain PDB structures, we additionally searched PDB directly for proteins in the required range and added them to the dataset. For the most conserved range, there is only 1 sequence in big_80, so we experimentally created the best possible highly conserved dataset in the range 80%-90% and used the structure of the reference sequence for 3D-coffee. The resulting datasets are shown in the following tables. All sequences have roughly the same length as the reference sequence except for G5AMD7 in the first set. G5AMD7 contains an insertion of 162 aa ,that is easily identified in the alignment, but shows a very high similarity in the other sections.

80-90% Sequence Identity
Sequence Identity ID Comment
90% G1P4I7 Uncharacterized protein OS=Myotis lucifugus
89% G5AMD7 Phenylalanine-4-hydroxylase OS=Heterocephalus glaber
80% G1KSL1 Uncharacterized protein OS=Anolis carolinensis
100% 1PAH used as 3D-Template only
60-89% Sequence Identity
Sequence Identity ID Comment
80% G1KSL1 Uncharacterized protein OS=Anolis carolinensis
76% Q4VBE2 Putative uncharacterized protein mgc108157 OS=Xenopus tropicalis
70% H2UJM8 Uncharacterized protein OS=Takifugu rubripes
63% D1LXB2 Phenylalanine hydroxlase OS=Saccoglossus kowalevskii
60% 2XSN_A human Tyrosine 3-Monooxygenase, also used as 3D-Template
40-59% Sequence Identity
Sequence Identity ID Comment
58% O96947 Phenylalanine hydroxylase OS=Geodia cydonium
55% D3BKZ8 Phenylalanine 4-monooxygenase OS=Polysphondylium pallidum
49% Q5RHI3 Novel protein similar to tyrosine hydroxylase OS=Danio rerio
44% A6P4D3 Tyrosine hydroxylase OS=Dugesia japonica
59% 1TOH_A Tyrosine hydroxylase from rattus norvegicus, also used as 3D-Template
20-39% Sequence Identity
Sequence Identity ID Comment
37% F4WGX3 Phenylalanine-4-hydroxylase OS=Acromyrmex echinatior
35% Q23A76 Biopterin-dependent aromatic amino acid hydroxylase family protein OS=Tetrahymena thermophila
29% D0I5S9 Phenylalanine-4-hydroxylase OS=Grimontia hollisae
28% F6G5X4 Phenylalanine-4-hydroxylase OS=Ralstonia solanacearum
31% 1LTU_A Phenylalanine-4-hydroxylase from Chromobacterium violaceum, also used as 3D-Template


Raw Results

We present the alignments and basic statistics for the different tools. Images have been created in Jalview. To count gaps and conserved columns a simple python script (Collection of scripts) was used.

ClustalW

Command used to create the alignments:

clustalw -align -infile=NN.fasta -outfile=clustalW_NN.aln
Alignment of set 4 created with ClustalW
20-39% Sequence Identity
Sequence ID Gaps
P00439 83
Q23A76 87
D0I5S9 272
F4WGX3 69
F6G5X4 181
1LTU_A 238
Conserved columns 19
Alignment of set 3 created with ClustalW
40-59% Sequence Identity
Sequence ID Gaps
P00439 37
A6P4D3 1
Q5RHI3 14
D3BKZ8 45
O96947 39
1TOH_A 146
Conserved columns 109


Alignment of set 2 created with ClustalW
60-89% Sequence Identity
Sequence ID Gaps
P00439 17
H2UJM8 7
Q4VBE2 24
G1KSL1 16
D1LXB2 18
2XSN_A 126
Conserved columns 159
Alignment of set 1 created with ClustalW
80-90% Sequence Identity
Sequence ID Gaps
P00439 165
G1P4I7 162
G1KSL1 164
G5AMD7 4
Conserved columns 329

Muscle

Command used to create the alignments:

muscle -in NN.fasta -out muscle_NN.fasta


Alignment of set 4 created with Muscle
20-39% Sequence Identity
Sequence ID Gaps
P00439 94
Q23A76 98
D0I5S9 283
F4WGX3 80
F6G5X4 192
1LTU_A 249
Conserved columns 22


Alignment of set 3 created with Muscle
40-59% Sequence Identity
Sequence ID Gaps
P00439 45
A6P4D3 9
Q5RHI3 22
D3BKZ8 53
O96947 47
1TOH_A 154
Conserved columns 109
Alignment of set 2 created with Muscle
60-89% Sequence Identity
Sequence ID Gaps
P00439 17
H2UJM8 7
Q4VBE2 24
G1KSL1 16
D1LXB2 18
2XSN_A 126
Conserved columns 159
Alignment of set 1 created with Muscle
80-90% Sequence Identity
Sequence ID Gaps
P00439 167
G1P4I7 164
G1KSL1 166
G5AMD7 6
Conserved columns 328

T-coffee

Command used to create the alignments:

t_coffee NN.fasta


Alignment of set 4 created with T-Coffee
20-39% Sequence Identity
Sequence ID Gaps
P00439 103
Q23A76 107
D0I5S9 292
F4WGX3 89
F6G5X4 201
1LTU_A 258
Conserved columns 22
Alignment of set 3 created with T-Coffee
40-59% Sequence Identity
Sequence ID Gaps
P00439 53
A6P4D3 17
Q5RHI3 30
D3BKZ8 61
O96947 55
1TOH_A 162
Conserved columns 109
Alignment of set 2 created with T-Coffee
60-89% Sequence Identity
Sequence ID Gaps
P00439 18
H2UJM8 8
Q4VBE2 25
G1KSL1 17
D1LXB2 19
2XSN_A 127
Conserved columns 159
Alignment of set 1 created with T-Coffee
80-90% Sequence Identity
Sequence ID Gaps
P00439 165
G1P4I7 162
G1KSL1 164
G5AMD7 4
Conserved columns 329

3D-coffee

Command to create the alignments:

t_coffee NN.fasta -method sap_pair,slow_pair -template_file <PDB-ID>


Alignment of set 4 created with 3D-Coffee
20-39% Sequence Identity
Sequence ID Gaps
P00439 96
Q23A76 100
D0I5S9 285
F4WGX3 82
F6G5X4 194
1LTU_A 251
Conserved columns 22
Alignment of set 3 created with 3D-Coffee
40-59% Sequence Identity
Sequence ID Gaps
P00439 51
A6P4D3 15
Q5RHI3 28
D3BKZ8 59
O96947 53
1TOH_A 160
Conserved columns 109
Alignment of set 2 created with 3D-Coffee
60-89% Sequence Identity
Sequence ID Gaps
P00439 21
H2UJM8 11
Q4VBE2 28
G1KSL1 20
D1LXB2 22
2XSN_A 130
Conserved columns 159
Alignment of set 1 created with 3D-Coffee
80-90% Sequence Identity
Sequence ID Gaps
P00439 165
G1P4I7 162
G1KSL1 164
G5AMD7 4
Conserved columns 329

Discussion of the alignment results and alignment tools

Looking at the used tools, for these small datasets (4-6 sequences) there is not much of a difference in execution time, but t-coffee offers the most options and different modes (we used only two here, but it still counts). Comparing the results, the programms perform quite similar, but can be sorted by the basic statistics of inroduced gaps and found conserved columns in the following order: ClustalW> Muscle>3D-Coffee>T-Coffee

to be continued shortly..