Difference between revisions of "Sequence Search and Multiple Sequence Alignment (PKU)"

Revision as of 13:57, 7 May 2012

Short Task Description

The goal of this task is to learn a bit about our protein of interest (phenylalanine hydroxylase/PheOH) from existing databases and comparing it to similar (and probably related) sequences. To do this, we searched the big80 database (7.73 mio sequences), which contains a subset of swissprot and pdb with sequence similarity of 80% or less, with blast (sequence comparison) and psi-blast (profile comparison) and searched a preprocessed and clustered version of uniprot (44,757 clusters) with the more advanced HHblits method (HMM comparison).

From the search results, we created several datasets sorted by sequence similarity (highly conserved, less conserved, a little conserved, ...) and created multiple aligmnents from these sets with four different tools. The results are listed below, the commands, programms and scripts we used are listed at the appropriate places in the text or linked to their own site.

Reference Sequence of PAH

This is the sequence of the fully functional PheOH used as query and reference sequence in all our work.

>sp|P00439|PH4H_HUMAN Phenylalanine-4-hydroxylase OS=Homo sapiens GN=PAH PE=1 SV=1
MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV
NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW
FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM
EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF
RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA
QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE
KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR
IEVLDNTQQLKILADSINSEIGILCSALQKI

Database Searches

Blast

time blast2 -p blastp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i Dropbox/Phenylketonuria/Task2/PAH.fasta -o results_blast2_standard -v 2000 -b 2000

real 1m34.784s user 1m22.930s sys 0m8.400s

time blast2 -p blastp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i Dropbox/Phenylketonuria/Task2/PAH.fasta -o results_blast2_e-10 -e 0.0000000001 -v 2000 -b 2000

real 1m34.145s user 1m23.380s sys 0m9.270s

PSIBlast

In order to display the runtime PSIBlast uses in order to get results you asked for we used a little skript which uses PSIblast with a variety of parametersettings. The overall message is that more iterations one uses used the more runtime PSIBlast needs. and the harder (lower) you choose your e-Value cutoff the less results you get. A general Message cant be more precise, because all the results depend mostly on the data you feed the programm.

Iterations

If you change the number of iterations (parameter -j) your runtime increases according to the parameter. To show this we used the defaultsettings for anything and increased the iterations from 1 to 10 (see Fig. <xr id="fig:runtimeIterations"/>)

more iterations result in a higher runtime, but a change in -e or -h do not mean a higher or lower runtime by themselves. A combination of more iterations and a lower integrations rate might be faster as less iterations though

</figure>

e-Values

A change in the -e parameter does not change your runtime (see <xr id=fig:runtimeIterations />) but in- or decreases the amount of sequences you derive from your search. This is shown in <xr id="fig:cutoffSize"/> <figure id="fig:cutoffSize">

Change in the -e parameter results in a lower resultsize

</figure>

Plots

All of the following plots emphasis for our example how a change of parameter changes the outcome of the searchtool.

e-Values

almost no restriction for the e-Values e=0.1

</figure><figure id="fig:cutoffhisto2">

a bit harded restriction to the e-Values e=0.0001

</figure><figure id="fig:cutoffhisto3">

even stronger restriction to e=0.000000001

</figure>

<xr id="fig:cutoffhisto1"/>, <xr id="fig:cutoffhisto2"/> and <xr id="fig:cutoffhisto3"/> show that the distribution of the e-Values one derives from their search changes if you restrict them. Which also decreases the size of your result as shown above.

HHBlits

time hhblits -i Dropbox/Phenylketonuria/Task2/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o results_hhblits_standard -E 1 -e 0.001 -Z 2000 -B 2000 real 6m10.059s user 3m15.640s sys 0m40.220s

hhblits -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o results_hhblits_n4_e-7 -n 4 -e 0.0000001 -o results_hhblits_n4_e-7

PDB

Instead of mapping hits in big_80 to PDB to get structural information for the alignments, we performed an additional search in PDB at NCBI, with parameters: scoring matrix=PAM70, gap open = 10, gap extend = 1, composition based statistics, max. 1000 target sequences.

MSA

Datasets

We tried to create datasets of 4 sequences + reference sequence for the identity ranges (1: 90%-99%, 2: 60%-89%, 3: 40%-59%, 4: 20%-39%) from our search in the big_80 database. Since these results do not contain PDB structures, we additionally searched PDB directly for proteins in the required range and added them to the dataset. For the most conserved range, there is only 1 sequence in big_80, so we experimentally created the best possible highly conserved dataset in the range 80%-90% and used the structure of the reference sequence for 3D-coffee. The resulting datasets are shown in the following tables, the alignments in the figures <xr id="fig:ClustalW_set4"/> to <xr id="fig:3d_coffee_set1"/>. All sequences have roughly the same length as the reference sequence except for G5AMD7 in the first set. G5AMD7 contains an insertion of 162 aa ,that is easily identified in the alignment, but shows a very high similarity in the other sections.

80-90% Sequence Identity
Sequence Identity	ID	Comment
90%	G1P4I7	Uncharacterized protein OS=Myotis lucifugus
89%	G5AMD7	Phenylalanine-4-hydroxylase OS=Heterocephalus glaber
80%	G1KSL1	Uncharacterized protein OS=Anolis carolinensis
100%	1PAH	used as 3D-Template only

60-89% Sequence Identity
Sequence Identity	ID	Comment
80%	G1KSL1	Uncharacterized protein OS=Anolis carolinensis
76%	Q4VBE2	Putative uncharacterized protein mgc108157 OS=Xenopus tropicalis
70%	H2UJM8	Uncharacterized protein OS=Takifugu rubripes
63%	D1LXB2	Phenylalanine hydroxlase OS=Saccoglossus kowalevskii
60%	2XSN_A	human Tyrosine 3-Monooxygenase, also used as 3D-Template

40-59% Sequence Identity
Sequence Identity	ID	Comment
58%	O96947	Phenylalanine hydroxylase OS=Geodia cydonium
55%	D3BKZ8	Phenylalanine 4-monooxygenase OS=Polysphondylium pallidum
49%	Q5RHI3	Novel protein similar to tyrosine hydroxylase OS=Danio rerio
44%	A6P4D3	Tyrosine hydroxylase OS=Dugesia japonica
59%	1TOH_A	Tyrosine hydroxylase from rattus norvegicus, also used as 3D-Template

20-39% Sequence Identity
Sequence Identity	ID	Comment
37%	F4WGX3	Phenylalanine-4-hydroxylase OS=Acromyrmex echinatior
35%	Q23A76	Biopterin-dependent aromatic amino acid hydroxylase family protein OS=Tetrahymena thermophila
29%	D0I5S9	Phenylalanine-4-hydroxylase OS=Grimontia hollisae
28%	F6G5X4	Phenylalanine-4-hydroxylase OS=Ralstonia solanacearum
31%	1LTU_A	Phenylalanine-4-hydroxylase from Chromobacterium violaceum, also used as 3D-Template

Raw Results

We present the alignments and basic statistics for the different tools. Images have been created in Jalview. To count gaps and conserved columns a simple python script (Collection of scripts) was used.

ClustalW

Command used to create the alignments:

clustalw -align -infile=NN.fasta -outfile=clustalW_NN.aln

Alignment of set 4 created with ClustalW

</figure>

20-39% Sequence Identity
Sequence ID	Gaps
P00439	83
Q23A76	87
D0I5S9	272
F4WGX3	69
F6G5X4	181
1LTU_A	238
Conserved columns	19

Alignment of set 3 created with ClustalW

</figure>

40-59% Sequence Identity
Sequence ID	Gaps
P00439	37
A6P4D3	1
Q5RHI3	14
D3BKZ8	45
O96947	39
1TOH_A	146
Conserved columns	109

Alignment of set 2 created with ClustalW

</figure>

60-89% Sequence Identity
Sequence ID	Gaps
P00439	17
H2UJM8	7
Q4VBE2	24
G1KSL1	16
D1LXB2	18
2XSN_A	126
Conserved columns	159

Alignment of set 1 created with ClustalW

</figure>

80-90% Sequence Identity
Sequence ID	Gaps
P00439	165
G1P4I7	162
G1KSL1	164
G5AMD7	4
Conserved columns	329

Muscle

Command used to create the alignments:

muscle -in NN.fasta -out muscle_NN.fasta

Alignment of set 4 created with Muscle

</figure>

20-39% Sequence Identity
Sequence ID	Gaps
P00439	94
Q23A76	98
D0I5S9	283
F4WGX3	80
F6G5X4	192
1LTU_A	249
Conserved columns	22

Alignment of set 3 created with Muscle

</figure>

40-59% Sequence Identity
Sequence ID	Gaps
P00439	45
A6P4D3	9
Q5RHI3	22
D3BKZ8	53
O96947	47
1TOH_A	154
Conserved columns	109

Alignment of set 2 created with Muscle

</figure>

60-89% Sequence Identity
Sequence ID	Gaps
P00439	17
H2UJM8	7
Q4VBE2	24
G1KSL1	16
D1LXB2	18
2XSN_A	126
Conserved columns	159

Alignment of set 1 created with Muscle

</figure>

80-90% Sequence Identity
Sequence ID	Gaps
P00439	167
G1P4I7	164
G1KSL1	166
G5AMD7	6
Conserved columns	328

T-coffee

Command used to create the alignments:

t_coffee NN.fasta

Alignment of set 4 created with T-Coffee

</figure>

20-39% Sequence Identity
Sequence ID	Gaps
P00439	103
Q23A76	107
D0I5S9	292
F4WGX3	89
F6G5X4	201
1LTU_A	258
Conserved columns	22

Alignment of set 3 created with T-Coffee

</figure>

40-59% Sequence Identity
Sequence ID	Gaps
P00439	53
A6P4D3	17
Q5RHI3	30
D3BKZ8	61
O96947	55
1TOH_A	162
Conserved columns	109

Alignment of set 2 created with T-Coffee

</figure>

60-89% Sequence Identity
Sequence ID	Gaps
P00439	18
H2UJM8	8
Q4VBE2	25
G1KSL1	17
D1LXB2	19
2XSN_A	127
Conserved columns	159

Alignment of set 1 created with T-Coffee

</figure>

80-90% Sequence Identity
Sequence ID	Gaps
P00439	165
G1P4I7	162
G1KSL1	164
G5AMD7	4
Conserved columns	329

3D-coffee

Command to create the alignments:

t_coffee NN.fasta -method sap_pair,slow_pair -template_file <PDB-ID>

Alignment of set 4 created with 3D-Coffee

</figure>

20-39% Sequence Identity
Sequence ID	Gaps
P00439	96
Q23A76	100
D0I5S9	285
F4WGX3	82
F6G5X4	194
1LTU_A	251
Conserved columns	22

Alignment of set 3 created with 3D-Coffee

</figure>

40-59% Sequence Identity
Sequence ID	Gaps
P00439	51
A6P4D3	15
Q5RHI3	28
D3BKZ8	59
O96947	53
1TOH_A	160
Conserved columns	109

Alignment of set 2 created with 3D-Coffee

</figure>

60-89% Sequence Identity
Sequence ID	Gaps
P00439	21
H2UJM8	11
Q4VBE2	28
G1KSL1	20
D1LXB2	22
2XSN_A	130
Conserved columns	159

Alignment of set 1 created with 3D-Coffee

</figure>

80-90% Sequence Identity
Sequence ID	Gaps
P00439	165
G1P4I7	162
G1KSL1	164
G5AMD7	4
Conserved columns	329

Discussion of the alignment results and alignment tools

Looking at the used tools, for these small datasets (4-6 sequences) there is not much of a difference in execution time, so we do not use time as a factor in our assessment. Muscle, ClustalW and T-Coffee are similar and easy to use, but T-Coffee offers the most options and different modes (we used only two here) and provides the most detailed output. To assess the alignments, we looked also at the number of gaps and conserved residues, but consider the most important factor in alignment quality the proper alignment of the functional sequence parts.

Comparing the results, the tools also perform quite similar, but can be sorted by the basic statistics of introduced gaps and found conserved columns in the following order: ClustalW> Muscle>3D-Coffee>T-Coffee.

T-Coffee using the additional 3D information performs minimally better in terms of gaps, more notably in the less conserved datasets. An interesting point that stands out might be that 3D-Coffe always places the first residues -- mostly methionines -- together, even if for a shorter sequence missing e.g. an N-terminal part this means a very long gap between this first residue and the rest of the sequence (see figure <xr id="fig:3d_coffee_set2"/> for an example). This behaviour does not influence the alignment score much, might even be a 'better' alignment as the first methionines belong together, but the single residue still feels a little misplaced to us.

In the most conserved dataset the aligments differ only marginally and due to the high overall conservation, the functional domains are obviously aligned correctly as well. For the other sets of sequences, we looked more closely at the central catalytic and the N-terminal regulatory domain:

The catalytic domain of PheOH appears around 280aa-340aa, with the iron-coordinating His285, His290 and Glu330. (needs reference..) This part of the sequence is recognized by all tools in all datasets and the coordinating residues are aligned correctly. In all but the least conserved sets, the alignments are virtually identical in the functional parts and differ only in the placement of gaps. The most differences appear at the N-terminus and likely stem from the different lengths of the sequences that have to be reconciled in the alignment.

A real difference in the results of the tools can be seen near the N-terminus of set 4. T-Coffe, 3D-Coffee and Muscle find a conserved region around Trp187, while ClustalW misaligns D0I5S9, which is the shortest sequence of the set. Look at positions 190-210 of the alignment by ClustalW and 210-230 by the others.

Our favorite tools for this task was probably T-Coffee, due to good results and the wide range of option. Clustal produces less gaps, but misaligned at least one sequence. Muscle does not perform much different from T-Coffee, but provides less functionality. The advantage of muscle, its emphasis on speed, is, as stated before, not taken into account here.

@@ Line 45: / Line 45: @@
 =====e-Values=====
 <figure id="fig:cutoffhisto1">[[File:pkueValuecutoffhisto1.png|thumb|left|300px|<caption>almost no restriction for the e-Values e=0.1</caption>]]</figure><figure id="fig:cutoffhisto2">[[File:pkueValuecutoffhisto2.png|thumb|center|300px|<caption>a bit harded restriction to the e-Values e=0.0001</caption>]]</figure><figure id="fig:cutoffhisto3">[[File:pkueValuecutoffhisto3.png|thumb|right|300px|<caption>even stronger restriction to e=0.000000001</caption>]]</figure>
-<xr id="fig:cutoffhisto3"/>, <xr id="fig:cutoffhisto2"/> and <xr id="fig:cutoffhisto3"/> show that the distribution of the e-Values one derives from their search changes if you restrict them. Which also decreases the size of your result as shown above.
+<xr id="fig:cutoffhisto1"/>, <xr id="fig:cutoffhisto2"/> and <xr id="fig:cutoffhisto3"/> show that the distribution of the e-Values one derives from their search changes if you restrict them. Which also decreases the size of your result as shown above.
 ===HHBlits===
@@ Line 57: / Line 57: @@
 ===PDB===
 Instead of mapping hits in big_80 to PDB to get structural information for the alignments, we performed an additional search in PDB at NCBI, with parameters: scoring matrix=PAM70, gap open = 10, gap extend = 1, composition based statistics, max. 1000 target sequences.
 ==MSA==

Difference between revisions of "Sequence Search and Multiple Sequence Alignment (PKU)"

Revision as of 13:57, 7 May 2012

Contents

Short Task Description

Reference Sequence of PAH

Database Searches

Blast

PSIBlast

Iterations

e-Values

Plots

e-Values

HHBlits

PDB

MSA

Datasets

Raw Results

ClustalW

Muscle

T-coffee

3D-coffee

Discussion of the alignment results and alignment tools

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools