Sequence searches and multiple sequence alignments (Phenylketonuria)

From Bioinformatikpedia
(Redirected from Phenylketonuria/Task2)

Summary of the task

In this task we compare the protein sequence of interest, in this case the phenylalanine hydroxylase (PAH), to other protein sequences. Therefore both sequence searches and multiple sequence alignments were done using the big80 database meaning a database that contains subsets of swissprot and pdb, where the entries have a sequence similarity of 80% or less. Furthermore searches against a pdb database were done. For sequence searches the programs BLAST, PSIBLAST and HHblits are used. Their results were taken for the creation of multiple sequence alignments (MSA) using the methods ClustalW, Muscle and TCoffee.

Sequence searches

Lab journal

Comparison of the results

In this part the results of the different sequence searches are analyzed. Therefore the outputs are parsed with Marias Parser and are filtered for their IDs, sequence identities and e-values.

Sequence identity in percent

</figure> </figure> </figure>
<figure id="fig:blast">
Sequence identity distribution of the BLAST search; the x-axis shows the sequence identity from 0 (=0%) to 1 (=100%) and the y-axis the number of sequences that got this sequence identity to the PAH sequence in the BLAST search.
<figure id="fig:psiblast">
Sequence identity distribution of the PSI-BLAST searches; the x-axis shows the sequence identity from 0 (=0%) to 1 (=100%) and the y-axis the number of sequences that got this sequence identity to the PAH sequence in the PSI-BLAST searches: two iterations with an e-value cutoff of 0.002 (blue) and with an e-value cutoff of 10e-10 (yellow); ten iterations with an e-value cutoff of 0.002 (green) and with an e-value cutoff of 10e-10.
<figure id="fig:hhblits">
Sequence identity distribution of the HHblits search; the x-axis shows the sequence identity from 0 (=0%) to 1 (=100%) and the y-axis the number of sequences that got this sequence identity to the PAH sequence in the HHblits search.


At comparing the sequence identities of the different search tools, it could be seen that BLAST (<xr id="fig:blast"/>) and the distributions of PSI-BLAST (<xr id="fig:psiblast"/>) with two iterations for both e-value cutoffs show similar distributions with a maximum frequency between 35% and 40% sequence identity. For ten iterations the curve is shifted a bit to the left with a maximum frequency at about 20%. Altogether, it can be seen that more sequences were found in ten iterations than in two, however with less sequence identity. This can be ascribed to the fact that the profile gets less specific in each iteration. The runs with an e-value cutoff of 10e-10 show a lower number of sequences than the runs with a cutoff of 0.002 as 10e-10 is a more significant cutoff. However, that cutoff is really strict and has some false negatives with higher probability, whereas the cutoff of 0.002 is not very significant and likely includes some false positives. The highest number of sequences was found in the HHblits search (<xr id="fig:hhblits"/>), which distribution of sequence identity shows high similarity to the distribution of PSI-BLAST run with ten iterations.

E-Value

</figure> </figure> </figure>
<figure id="fig:blast-evalue">
Logarithmic e-value distribution of the BLAST search; the x-axis shows the logarithmic e-value and the y-axis the frequency of sequences with that specific e-value; If the logarithmic e-value is smaller the evalue is better.
<figure id="fig:psiblast-evalue">
Logarithmic e-value distribution of the PSI-BLAST search: two iterations with an e-value cutoff of 0.002 (blue) and with an e-value cutoff of 10e-10 (yellow); ten iterations with an e-value cutoff of 0.002 (green) and with an e-value cutoff of 10e-10; the x-axis shows the logarithmic e-value and the y-axis the frequency of sequences with that specific e-value; If the logarithmic e-value is smaller the evalue is better.
<figure id="fig:hhblits-evalue">
Logarithmic e-value distribution of the HHblits search; the x-axis shows the logarithmic e-value and the y-axis the frequency of sequences with that specific e-value; If the logarithmic e-value is smaller the evalue is better.

The distributions of the logarithmic e-values of the sequence searches all look similar with a lowest and best value beneath -400. Nevertheless, it goes up to higher than 0 in BLAST (<xr id="fig:blast-evalue"/>) and HHblits(<xr id="fig:hhblits-evalue"/>). However, the maximum frequency for the e-values in BLAST is still in negativ range, whereas in HHblits it is in positive range meaning that the e-values are not as good. Best e-values are found for PSI-BLAST searches especially for 10 iterations (<xr id="fig:psiblast-evalue"/>). They all show lower frequency at lower e-values and their maximum at higher e-values.


GO-terms

For the reference sequence (P00439) GO-terms (<xr id="go"/>) were found on QuickGO. To look for similarities between the reference sequence and the sequences found in the searches we wrote a program which download those terms for all detected sequences and counted how often the GO terms of PAH are found for the other sequences: <figtable id="go">

"GO numbers and terms of PAH and their number of occurence in the different search results"
GO GO-Term BLAST PSIBLAST

2 x 0.002

PSIBLAST

2 x 10e-10

PSIBLAST

10 x 0.002

PSIBLAST

10 x 10e-10

HHblits
GO:0008152 metabolic process 554 808 818 712 713 5701
GO:0016597 amino acid binding 554 808 818 712 713 5611
GO:0055114 oxidation-reduction process 458 457 459 443 438 3522
GO:0005506 iron ion binding 449 448 450 434 431 1043
GO:0009072 aromatic amino acid family metabolic process 449 448 450 434 431 1043
GO:0004497 monooxygenase activity 449 448 450 434 431 1043
GO:0016714 oxidoreductase activity,

acting on paired donors,

with incorporation or reduction of molecular oxygen,

reduced pteridine as one donor,

and incorporation of one atom of oxygen

445 443 444 434 431 1033
GO:0004505 phenylalanine 4-monooxygenase activity 207 207 207 205 205 442
GO:0006559 L-phenylalanine catabolic process 165 165 165 165 165 395
GO:0016491 oxidoreductase activity 158 157 157 154 154 2456
GO:0003824 catalytic activity 12 30 28 33 32 3237
GO:0042423 catecholamine biosynthetic process 15 25 15 15 15 80
GO:0008652 cellular amino acid biosynthetic process 6 10 11 12 12 1762
GO:0046872 metal ion binding 7 6 6 6 6 1141
GO:0042136 neurotransmitter biosynthetic process 2 2 2 2 2 12
GO:0005829 cytosol 0 1 1 2 2 16
GO:0034641 cellular nitrogen compound metabolic process 0 0 0 0 0 4
GO:0044281 small molecule metabolic process 0 0 0 0 0 4

GO terms and their number of occurence for the protein sequences found in different sequence searches: BLAST, PSI-BLAST with 2 iterations and e-value of 0.002 and 10e-10 and with 10 iterations with the same e-values and finally HHblits. </figtable> When you look at the GO-terms it is obvious that more general descriptions like 'metabolic process' or 'amino acid binding' are found very often. The more specific the GO-terms of PAH gets the less sequences in the searches have the same description. However, in HHblits some of the terms like for example 'catalytic activity' are detected more often as in the other searches even it is incorporated that there are much more sequences found in the HHblits clusters.

Best results

In this part the sequences with highest sequence identity and lowest e-value are compared for the different searches.

Sequence identity

In the following table (<xr id="hhblits"/>) the four proteins with the highest sequence identity found in the HHblits run are presented. <figtable id="hhblits">

Best sequence identities in HHblits
Uinprot-ID Identity E-Value
C9JMN0 1.0 5.0E-6
Q66RJ9 1.0 0.006
Q16021 0.96 8.4E-12
Q66RJ7 0.96 7.1E-4
Proteins with highest identity in HHblits

</figtable> None of these sequences are found in BLAST or the four PSI-BLAST searches. However, some proteins with highest sequence identity found with BLAST also can be found with PSI-BLAST (<xr id="blast"/>). <figtable id="blast">

Best sequence identities in BLAST
Uinprot-ID Identity E-Value Identity

PSI-BLAST 2x0.002

Identity

PSI-BLAST 2x10e-10

Identity

PSI-BLAST 10x0.002

Identity

PSI-BLAST 10x10e-10

F6XY00 0.92 0.0 0.95 0.95 0.93 0.96
L9L9N2 0.89 0.0 0.93 0.92 0.95 0.90
G5AMD7 0.89 0.0 0.93 0.93 0.91 0.91
D3YZ73 0.85 7.0E-20 0.73 0.73 - -
G1KSL1 0.81 0.0 0.82 0.82 0.82 0.79
Proteins with highest identity in BLAST and their identity values in the four PSI-BLAST runs if found.

</figtable> Additionally the sequences (F6XY00, L9L9N2, G5AMD7, G1KSL1) also are the best four in the PSI-BLAST runs only not in the same order but with better e-values than in BLAST. They are not included in the HHblits result. Only D3YZ73 is not under the best in PSI-BLAST and is not found after ten iterations at all. However, in HHblits it is found with a sequence identity of 0.71.

E-Value

The best e-values are those which are lowest. In HHblits the three lowest values are 5.5e-175 (<xr id="hh-eval"/>). <figtable id="hh-eval">

Best e-values in HHblits
Uinprot-ID E-Value Identity
A7UTU7 5.5e-175 0.68
P17752 5.5e-175 0.68
P00439 5.5e-175 0.68
Proteins with lowest e-value in HHblits

</figtable> Again none of these protein sequences are found in BLAST or the four PSI-BLAST searches. The proteins with best e-values of the BLAST run (<xr id="bl-eval"/>), however, are found in all searches, besides H3FAJ0, which is not included in the HHblits result. They have very good e-values in all runs, but are under the three best in BLAST only. <figtable id="bl-eval">

Best e-values in BLAST
Uinprot-ID E-Value Identity E-Value

PSI-BLAST 2x0.002

E-Value

PSI-BLAST 2x10e-10

E-Value

PSI-BLAST 10x0.002

E-Value

PSI-BLAST 10x10e-10

E-Value

HHblits

H3FAJ0 1.0e-177 0.544276457883369 1.0E-157 1.0E-162 1.0E-122 1.0E-126 -
A8WSM6 1.0e-177 0.563636363636364 1.0E-159 1.0E-164 1.0E-127 1.0E-131 1.1E-172
A9UUJ8 1.0e-176 0.567695961995249 1.0E-156 1.0E-160 1.0E-127 1.0E-131 5.5E-175
Proteins with lowest e-value in BLAST and their values in PSI-BLAST and also in HHblits

</figtable>

At PSI-BLAST the three best proteins of the run with two iterations and an e-value cutoff of 10e-10 and the run with ten iterations and an e-value cutoff of 0.002 are the same. The best one is H2UJM8 with an e-value of 1e-179. For two iterations and e-value cutoff of 0.002 it is Q4VBE2 with an e-value of 1e-178 and for ten iterations and e-value cutoff of 10e-10 it is K7F3H7 with an e-value of 1e-143, which is good, but not as good as the other best ones.
When you compare sequence identity and e-value, that sequences with best identities often have not as good e-values and vice versa is remarkable. Especially in the BLAST runs e-values of only 0.0 for the sequences with highest sequence similarity are reached and the sequences with lowest e-values only have sequence identities slightly above 54%. Therefore you can see that a good balance between high sequence identity and low e-value is very hard to get, but necessary to get trustable results.

Multiple sequence alignments

For the images of the Alignments Jalview was used. Colours are shown in Clustalx Format.
Lab journal

Datasets

For the multiple sequence alignments four different datasets were generated with the BLAST outputs and a Python script. Three groups with ten sequences (two of them are pdb-sequences): one with higher than 60% sequence identity to our target sequence (PAH gene), another group with sequence identity between 30% and 60% and one group with lower sequence identity than 30% (<xr id="datasets"/>). In the group with < 30% identity the pdb sequences have a 32% identity, because these are the lowest ones found in the Blast output against the pdb dataset. The fourth group contains 20 sequences with a whole range of sequence identities and four of them are pdb sequences (<xr id="set_all"/>). For the comparison to our reference sequence of the Phenylalanine hydroxylase (PAH - P00439) enzyme we added this sequence to all four groups. <figtable id="datasets">

a) Dataset "low":
Group of sequences with < 30%

(pdb = 32%) identity

Sequence identity ID
29% K0WHA3
29% L0FXW8
29% B4R9Q9
28% C9P8B8
25% I3YW84
25% G0L2J6
25% Q9AG78
24% B4UJD0
32% 1ltu_A (pdb)
32% 1ltz_A (pdb)
b) Dataset "medium":
Group of sequences between

30% and 60% identity

Sequence identity ID
54% I1C3M2
53% F6ZKP1
51% Q54XS1
48% G1KSP0
45% E5SYS4
42% H3EDU0
38% H0HJI4
36% F4WGX3
59% 2toh_A (pdb)
45% 2qmx_A (pdb)
c) Dataset "high":
Group of sequences with

> 60% identity

Sequence identity ID
70% H2UJM8
67% K1RSS1
65% E7D1A7
63% G7YPD5
63% D1LXB2
62% G1KJG2
61% G9B2G8
61% E9RJV0
96% 2pah_A (pdb)
96% 1kw0_A (pdb)
The three datasets with sequence identity a) lower than 30% (and two pdb sequences with 32% identity b) between 30% and 60% and c) bigger than 60%.

</figtable>

<figtable id="set_all">

Dataset "all":
Group of sequences with different identities (0-100%)
Sequence identity ID Sequence identity ID
63% D1LXB2 33% A9D485
63% G3MQ02 33% G7UU95
56% A9UUJ8 31% I3TIA8
54% D2XNL7 31% A0Y3T4
52% E9G1D2 29% D5HB02
44% I1GDE7 27% A1ZW97
41% H1L1J2 96% 5pah_A (pdb)
37% D7AKY2 96% 3pah_A (pdb)
36% B7GIR4 96% 2pah_A (pdb)
35% A0C973 33% 2v27_B (pdb)
Set with whole range of sequence identities including four pdb sequences.

</figtable>

ClustalW

<figure id="clustalW">

a) Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with ClustalW
b) Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with ClustalW
c) Multiple Sequence Alignment of the "high" dataset with sequences >60% identity, which was created with ClustalW
d) Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with ClustalW


Multiple sequence alignments for the four datasets a) low, b) medium, c) high and d) all created with ClustalW.

</figure>

Muscle

<figure id="muscle">

a) Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with Muscle
b) Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with Muscle
c) Multiple Sequence Alignment of the "high" dataset with sequences >60% identity, which was created with Muscle
d) Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with Muscle


Multiple sequence alignments for the four datasets a) low, b) medium, c) high and d) all created with Muscle.

</figure>

T-Coffee

<figure id="tCoffee">

a) Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with T-Coffee
b) Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with T-Coffee
c) Multiple Sequence Alignment of the "high" dataset with sequences >60% identity, which was created with T-Coffee
d) Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with T-Coffee


Multiple sequence alignments for the four datasets a) low, b) medium, c) high and d) all created with T-Coffee.

</figure>

Expresso(3D-Coffee)

Expresso is an extension of 3D-Coffee and uses BLAST to search the PDB database for structures whose sequences are similar to the given sequences. These structures are then used to build the alignment. It is slowlier than T-Coffee itself, but if it finds enough structures it is more accurate than the other programms.

Since we could not run Expresso on the server, we have used this website for the multiple sequence alignments with Expresso (3D-Coffee).

<figure id="expresso">

a) Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with Expresso(3D-Coffee)
b) Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with Expresso(3D-Coffee)
c) Multiple Sequence Alignment of the "high" dataset with sequences > 60% identity, which was created with Expresso(3D-Coffee)
d) Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with Expresso(3D-Coffee)


Multiple sequence alignments for the four datasets a) low, b) medium, c) high and d) all created with Expresso.

</figure>

Results

The following tables (<xr id="low_set"/>,<xr id="medium_set"/> and <xr id="high_set"/>) show the gaps that are included during the creation of the multiple alignments by ClustalW, Muscle, T-Coffee and Expresso. Thereby the sequence of PAH (P00439) is shown in bold, whereas pdb sequences are marked with *.

</figtable>

</figtable>

</figtable>

<figtable id="low_set">
Gaps in Dataset "low":
Group of sequences with < 30% (pdb = 32%) identity
ID ClustalW MUSCLE T-COFFEE EXPRESSO
P00439 3 9 3 3
K0WHA3 230 236 230 230
L0FXW8 401 407 401 401
B4R9Q9 339 345 339 339
C9P8B8 223 229 223 223
I3YW84 396 402 396 396
G0L2J6 336 342 336 336
Q9AG78 399 405 399 399
B4UJD0 280 286 280 280
*1ltu_A 399 405 399 399
*1ltz_A 237 243 237 237

Gaps that are included creating the multiple alignments with the dataset "low". The first column contains the Uniprot IDs, whereas the following columns show the gaps inserted by ClustalW, Muscle, T-Coffee and Expresso.
<figtable id="medium_set">
Gaps in Dataset "medium":
Group of sequences between 30% and 60% identity
ID ClustalW MUSCLE T-COFFEE EXPRESSO
P00439 10 21 25 29
I1C3M2 37 48 52 56
F6ZKP1 20 31 35 39
Q54XS1 43 54 58 62
G1KSP0 287 298 302 306
E5SYS4 297 308 312 316
H3EDU0 75 86 90 94
H0HJI4 402 413 417 421
F4WGX3 283 294 298 302
*2toh_A 282 293 297 301
*2qmx_A 416 427 431 435

Gaps that are included creating the multiple alignments with the dataset "medium". The first column contains the Uniprot IDs, whereas the following columns show the gaps inserted by ClustalW, Muscle, T-Coffee and Expresso.
<figtable id="high_set">
Gaps in Dataset "high":
Group of sequences with > 60% identity
ID ClustalW MUSCLE T-COFFEE EXPRESSO
P00439 8 13 15 15
H2UJM8 20 25 27 27
K1RSS1 280 285 287 287
E7D1A7 342 347 349 349
G7YPD5 115 120 122 122
D1LXB2 162 167 169 169
G1KJG2 220 225 227 227
G9B2G8 42 47 49 49
E9RJV0 353 358 360 360
*2pah_A 220 225 227 227
*1kw0_A 160 165 167 167

Gaps that are included creating the multiple alignments with the dataset "high". The first column contains the Uniprot IDs, whereas the following columns show the gaps inserted by ClustalW, Muscle, T-Coffee and Expresso.

Discussion

Comparing the different tools similar outcomes can be viewed. As expected the dataset with low similarities shows a lot of gaps. Only for our protein P00439 (PAH) few gaps are included. Therefore, the other sequences are shorter and only cover a small part of our protein, which goes roughly from position 175 to 295. Only Muscle shows more gaps than the other tools. For the dataset with identities between 30% and 60% higher conservations could be recognised. Again the results of the different tools are similar, whereby sometimes the conserved regions are shifted due to some gaps inserted between them. The set with the highest sequence similarities shows least gaps and also only few differences between the tools. ClustalW inserted fewest gaps. For the datasets low and high T-Coffee and Expresso have identical numbers of gaps. The multiple alignments can be viewed in <xr id="clustalW"/> with the results of ClustalW,in <xr id="muscle"/> with the results of Muscle, in <xr id="tCoffee"/> with the results of T-Coffee and in <xr id="expresso"/> with the results of Expresso. Thereby a) always comprises the MSA of the low, b) of the medium and c) of the high dataset. d), on the other hand, shows the MSA, which was created with 20 sequences that covers the whole range of similarity with identities between 27% and 96%. Thereby, T-Coffee and Expresso again show very similar outcomes and also Muscle is not that different to those, too. ClustalW, however, seems to have more problems to make a good alignment even with the similar sequences and the conserved amino acids are picked to pieces. Even there are some sequences with low similarities T-Coffee and Expresso show good alignments. Altogether we think that a medium sequence identity is sufficient to get a good MSA. Nevertheless, it is also possible to include sequences with less identities. In such cases also sequences with high similarities should be included and T-Coffee seems to be a good choice to create a reliable MSA.