Sequence searches and multiple sequence alignments (Phenylketonuria)

Summary of the task

In this task we compare the protein sequence of interest, in this case the phenylalanine hydroxylase (PAH), to other protein sequences. Therefore both sequence searches and multiple sequence alignments were done using the big80 database meaning a database that contains subsets of swissprot and pdb, where the entries have a sequence similarity of 80% or less. Furthermore searches against a pdb database were done. For sequence searches the programs BLAST, PSIBLAST and HHblits are used. Their results were taken for the creation of multiple sequence alignments (MSA) using the methods ClustalW, Muscle and TCoffee.

Sequence searches

Lab journal

Comparison of the results

In this part the results of the different sequence searches are analyzed. Therefore the outputs are parsed with Marias Parser and are filtered for their IDs, sequence identities and e-values.

Sequence identity in percent

</figure> </figure> </figure>

Sequence identity distribution of the BLAST search; the x-axis shows the sequence identity from 0 (=0%) to 1 (=100%) and the y-axis the number of sequences that got this sequence identity to the PAH sequence in the BLAST search.
<figure id="fig:blast">
<figure id="fig:psiblast">
<figure id="fig:hhblits">

At comparing the sequence identities of the different search tools, it could be seen that BLAST (<xr id="fig:blast"/>) and the distributions of PSI-BLAST (<xr id="fig:psiblast"/>) with two iterations for both e-value cutoffs show similar distributions with a maximum frequency between 35% and 40% sequence identity. For ten iterations the curve is shifted a bit to the left with a maximum frequency at about 20%. Altogether, it can be seen that more sequences were found in ten iterations than in two, however with less sequence identity. This can be ascribed to the fact that the profile gets less specific in each iteration. The runs with an e-value cutoff of 10e-10 show a lower number of sequences than the runs with a cutoff of 0.002 as 10e-10 is a more significant cutoff. However, that cutoff is really strict and has some false negatives with higher probability, whereas the cutoff of 0.002 is not very significant and likely includes some false positives. The highest number of sequences was found in the HHblits search (<xr id="fig:hhblits"/>), which distribution of sequence identity shows high similarity to the distribution of PSI-BLAST run with ten iterations.

E-Value

</figure> </figure> </figure>

Logarithmic e-value distribution of the BLAST search; the x-axis shows the logarithmic e-value and the y-axis the frequency of sequences with that specific e-value; If the logarithmic e-value is smaller the evalue is better.
<figure id="fig:blast-evalue">
<figure id="fig:psiblast-evalue">
<figure id="fig:hhblits-evalue">

The distributions of the logarithmic e-values of the sequence searches all look similar with a lowest and best value beneath -400. Nevertheless, it goes up to higher than 0 in BLAST (<xr id="fig:blast-evalue"/>) and HHblits(<xr id="fig:hhblits-evalue"/>). However, the maximum frequency for the e-values in BLAST is still in negativ range, whereas in HHblits it is in positive range meaning that the e-values are not as good. Best e-values are found for PSI-BLAST searches especially for 10 iterations (<xr id="fig:psiblast-evalue"/>). They all show lower frequency at lower e-values and their maximum at higher e-values.

GO-terms

For the reference sequence (P00439) GO-terms (<xr id="go"/>) were found on QuickGO. To look for similarities between the reference sequence and the sequences found in the searches we wrote a program which download those terms for all detected sequences and counted how often the GO terms of PAH are found for the other sequences: <figtable id="go">

"GO numbers and terms of PAH and their number of occurence in the different search results"
GO	GO-Term	BLAST	PSIBLAST 2 x 0.002	PSIBLAST 2 x 10e-10	PSIBLAST 10 x 0.002	PSIBLAST 10 x 10e-10	HHblits
GO:0008152	metabolic process	554	808	818	712	713	5701
GO:0016597	amino acid binding	554	808	818	712	713	5611
GO:0055114	oxidation-reduction process	458	457	459	443	438	3522
GO:0005506	iron ion binding	449	448	450	434	431	1043
GO:0009072	aromatic amino acid family metabolic process	449	448	450	434	431	1043
GO:0004497	monooxygenase activity	449	448	450	434	431	1043
GO:0016714	oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen, reduced pteridine as one donor, and incorporation of one atom of oxygen	445	443	444	434	431	1033
GO:0004505	phenylalanine 4-monooxygenase activity	207	207	207	205	205	442
GO:0006559	L-phenylalanine catabolic process	165	165	165	165	165	395
GO:0016491	oxidoreductase activity	158	157	157	154	154	2456
GO:0003824	catalytic activity	12	30	28	33	32	3237
GO:0042423	catecholamine biosynthetic process	15	25	15	15	15	80
GO:0008652	cellular amino acid biosynthetic process	6	10	11	12	12	1762
GO:0046872	metal ion binding	7	6	6	6	6	1141
GO:0042136	neurotransmitter biosynthetic process	2	2	2	2	2	12
GO:0005829	cytosol	0	1	1	2	2	16
GO:0034641	cellular nitrogen compound metabolic process	0	0	0	0	0	4
GO:0044281	small molecule metabolic process	0	0	0	0	0	4

GO terms and their number of occurence for the protein sequences found in different sequence searches: BLAST, PSI-BLAST with 2 iterations and e-value of 0.002 and 10e-10 and with 10 iterations with the same e-values and finally HHblits. </figtable> When you look at the GO-terms it is obvious that more general descriptions like 'metabolic process' or 'amino acid binding' are found very often. The more specific the GO-terms of PAH gets the less sequences in the searches have the same description. However, in HHblits some of the terms like for example 'catalytic activity' are detected more often as in the other searches even it is incorporated that there are much more sequences found in the HHblits clusters.

Best results

In this part the sequences with highest sequence identity and lowest e-value are compared for the different searches.

Sequence identity

In the following table (<xr id="hhblits"/>) the four proteins with the highest sequence identity found in the HHblits run are presented. <figtable id="hhblits">

Best sequence identities in HHblits
Uinprot-ID	Identity	E-Value
C9JMN0	1.0	5.0E-6
Q66RJ9	1.0	0.006
Q16021	0.96	8.4E-12
Q66RJ7	0.96	7.1E-4

Proteins with highest identity in HHblits

</figtable> None of these sequences are found in BLAST or the four PSI-BLAST searches. However, some proteins with highest sequence identity found with BLAST also can be found with PSI-BLAST (<xr id="blast"/>). <figtable id="blast">

Best sequence identities in BLAST
Uinprot-ID	Identity	E-Value	Identity PSI-BLAST 2x0.002	Identity PSI-BLAST 2x10e-10	Identity PSI-BLAST 10x0.002	Identity PSI-BLAST 10x10e-10
F6XY00	0.92	0.0	0.95	0.95	0.93	0.96
L9L9N2	0.89	0.0	0.93	0.92	0.95	0.90
G5AMD7	0.89	0.0	0.93	0.93	0.91	0.91
D3YZ73	0.85	7.0E-20	0.73	0.73	-	-
G1KSL1	0.81	0.0	0.82	0.82	0.82	0.79

Proteins with highest identity in BLAST and their identity values in the four PSI-BLAST runs if found.

</figtable> Additionally the sequences (F6XY00, L9L9N2, G5AMD7, G1KSL1) also are the best four in the PSI-BLAST runs only not in the same order but with better e-values than in BLAST. They are not included in the HHblits result. Only D3YZ73 is not under the best in PSI-BLAST and is not found after ten iterations at all. However, in HHblits it is found with a sequence identity of 0.71.

E-Value

The best e-values are those which are lowest. In HHblits the three lowest values are 5.5e-175 (<xr id="hh-eval"/>). <figtable id="hh-eval">

Best e-values in HHblits
Uinprot-ID	E-Value	Identity
A7UTU7	5.5e-175	0.68
P17752	5.5e-175	0.68
P00439	5.5e-175	0.68

Proteins with lowest e-value in HHblits

</figtable> Again none of these protein sequences are found in BLAST or the four PSI-BLAST searches. The proteins with best e-values of the BLAST run (<xr id="bl-eval"/>), however, are found in all searches, besides H3FAJ0, which is not included in the HHblits result. They have very good e-values in all runs, but are under the three best in BLAST only. <figtable id="bl-eval">

Best e-values in BLAST
Uinprot-ID	E-Value	Identity	E-Value PSI-BLAST 2x0.002	E-Value PSI-BLAST 2x10e-10	E-Value PSI-BLAST 10x0.002	E-Value PSI-BLAST 10x10e-10	E-Value HHblits
H3FAJ0	1.0e-177	0.544276457883369	1.0E-157	1.0E-162	1.0E-122	1.0E-126	-
A8WSM6	1.0e-177	0.563636363636364	1.0E-159	1.0E-164	1.0E-127	1.0E-131	1.1E-172
A9UUJ8	1.0e-176	0.567695961995249	1.0E-156	1.0E-160	1.0E-127	1.0E-131	5.5E-175

Proteins with lowest e-value in BLAST and their values in PSI-BLAST and also in HHblits

</figtable>

At PSI-BLAST the three best proteins of the run with two iterations and an e-value cutoff of 10e-10 and the run with ten iterations and an e-value cutoff of 0.002 are the same. The best one is H2UJM8 with an e-value of 1e-179. For two iterations and e-value cutoff of 0.002 it is Q4VBE2 with an e-value of 1e-178 and for ten iterations and e-value cutoff of 10e-10 it is K7F3H7 with an e-value of 1e-143, which is good, but not as good as the other best ones.
When you compare sequence identity and e-value, that sequences with best identities often have not as good e-values and vice versa is remarkable. Especially in the BLAST runs e-values of only 0.0 for the sequences with highest sequence similarity are reached and the sequences with lowest e-values only have sequence identities slightly above 54%. Therefore you can see that a good balance between high sequence identity and low e-value is very hard to get, but necessary to get trustable results.

Multiple sequence alignments

For the images of the Alignments Jalview was used. Colours are shown in Clustalx Format.
Lab journal

Datasets

For the multiple sequence alignments four different datasets were generated with the BLAST outputs and a Python script. Three groups with ten sequences (two of them are pdb-sequences): one with higher than 60% sequence identity to our target sequence (PAH gene), another group with sequence identity between 30% and 60% and one group with lower sequence identity than 30% (<xr id="datasets"/>). In the group with < 30% identity the pdb sequences have a 32% identity, because these are the lowest ones found in the Blast output against the pdb dataset. The fourth group contains 20 sequences with a whole range of sequence identities and four of them are pdb sequences (<xr id="set_all"/>). For the comparison to our reference sequence of the Phenylalanine hydroxylase (PAH - P00439) enzyme we added this sequence to all four groups. <figtable id="datasets">

**a) Dataset "low":**
Group of sequences with < 30% (pdb = 32%) identity
Sequence identity	ID
29%	K0WHA3
29%	L0FXW8
29%	B4R9Q9
28%	C9P8B8
25%	I3YW84
25%	G0L2J6
25%	Q9AG78
24%	B4UJD0
32%	1ltu_A (pdb)
32%	1ltz_A (pdb)

**b) Dataset "medium":**
Group of sequences between 30% and 60% identity
Sequence identity	ID
54%	I1C3M2
53%	F6ZKP1
51%	Q54XS1
48%	G1KSP0
45%	E5SYS4
42%	H3EDU0
38%	H0HJI4
36%	F4WGX3
59%	2toh_A (pdb)
45%	2qmx_A (pdb)

**c) Dataset "high":**
Group of sequences with > 60% identity
Sequence identity	ID
70%	H2UJM8
67%	K1RSS1
65%	E7D1A7
63%	G7YPD5
63%	D1LXB2
62%	G1KJG2
61%	G9B2G8
61%	E9RJV0
96%	2pah_A (pdb)
96%	1kw0_A (pdb)

The three datasets with sequence identity a) lower than 30% (and two pdb sequences with 32% identity b) between 30% and 60% and c) bigger than 60%.

</figtable>

**Dataset "all":**
Group of sequences with different identities (0-100%)
Sequence identity	ID	Sequence identity	ID
63%	D1LXB2	33%	A9D485
63%	G3MQ02	33%	G7UU95
56%	A9UUJ8	31%	I3TIA8
54%	D2XNL7	31%	A0Y3T4
52%	E9G1D2	29%	D5HB02
44%	I1GDE7	27%	A1ZW97
41%	H1L1J2	96%	5pah_A (pdb)
37%	D7AKY2	96%	3pah_A (pdb)
36%	B7GIR4	96%	2pah_A (pdb)
35%	A0C973	33%	2v27_B (pdb)

Set with whole range of sequence identities including four pdb sequences.

</figtable>

ClustalW

a) Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with ClustalW

b) Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with ClustalW

c) Multiple Sequence Alignment of the "high" dataset with sequences >60% identity, which was created with ClustalW

d) Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with ClustalW

Multiple sequence alignments for the four datasets a) low, b) medium, c) high and d) all created with ClustalW.

</figure>

Muscle

a) Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with Muscle

b) Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with Muscle

c) Multiple Sequence Alignment of the "high" dataset with sequences >60% identity, which was created with Muscle

d) Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with Muscle

Multiple sequence alignments for the four datasets a) low, b) medium, c) high and d) all created with Muscle.

</figure>

T-Coffee

a) Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with T-Coffee

b) Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with T-Coffee

c) Multiple Sequence Alignment of the "high" dataset with sequences >60% identity, which was created with T-Coffee

d) Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with T-Coffee

Multiple sequence alignments for the four datasets a) low, b) medium, c) high and d) all created with T-Coffee.

</figure>

Expresso(3D-Coffee)

Expresso is an extension of 3D-Coffee and uses BLAST to search the PDB database for structures whose sequences are similar to the given sequences. These structures are then used to build the alignment. It is slowlier than T-Coffee itself, but if it finds enough structures it is more accurate than the other programms.

Since we could not run Expresso on the server, we have used this website for the multiple sequence alignments with Expresso (3D-Coffee).

a) Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with Expresso(3D-Coffee)

b) Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with Expresso(3D-Coffee)

c) Multiple Sequence Alignment of the "high" dataset with sequences > 60% identity, which was created with Expresso(3D-Coffee)

d) Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with Expresso(3D-Coffee)

Multiple sequence alignments for the four datasets a) low, b) medium, c) high and d) all created with Expresso.

</figure>

Results

The following tables (<xr id="low_set"/>,<xr id="medium_set"/> and <xr id="high_set"/>) show the gaps that are included during the creation of the multiple alignments by ClustalW, Muscle, T-Coffee and Expresso. Thereby the sequence of PAH (P00439) is shown in bold, whereas pdb sequences are marked with *.

</figtable>

**Gaps in Dataset "low":**
Group of sequences with < 30% (pdb = 32%) identity
ID	ClustalW	MUSCLE	T-COFFEE	EXPRESSO
P00439	3	9	3	3
K0WHA3	230	236	230	230
L0FXW8	401	407	401	401
B4R9Q9	339	345	339	339
C9P8B8	223	229	223	223
I3YW84	396	402	396	396
G0L2J6	336	342	336	336
Q9AG78	399	405	399	399
B4UJD0	280	286	280	280
*1ltu_A	399	405	399	399
*1ltz_A	237	243	237	237

Gaps that are included creating the multiple alignments with the dataset "low". The first column contains the Uniprot IDs, whereas the following columns show the gaps inserted by ClustalW, Muscle, T-Coffee and Expresso.

**Gaps in Dataset "medium":**
Group of sequences between 30% and 60% identity
ID	ClustalW	MUSCLE	T-COFFEE	EXPRESSO
P00439	10	21	25	29
I1C3M2	37	48	52	56
F6ZKP1	20	31	35	39
Q54XS1	43	54	58	62
G1KSP0	287	298	302	306
E5SYS4	297	308	312	316
H3EDU0	75	86	90	94
H0HJI4	402	413	417	421
F4WGX3	283	294	298	302
*2toh_A	282	293	297	301
*2qmx_A	416	427	431	435

Gaps that are included creating the multiple alignments with the dataset "medium". The first column contains the Uniprot IDs, whereas the following columns show the gaps inserted by ClustalW, Muscle, T-Coffee and Expresso.

**Gaps in Dataset "high":**
Group of sequences with > 60% identity
ID	ClustalW	MUSCLE	T-COFFEE	EXPRESSO
P00439	8	13	15	15
H2UJM8	20	25	27	27
K1RSS1	280	285	287	287
E7D1A7	342	347	349	349
G7YPD5	115	120	122	122
D1LXB2	162	167	169	169
G1KJG2	220	225	227	227
G9B2G8	42	47	49	49
E9RJV0	353	358	360	360
*2pah_A	220	225	227	227
*1kw0_A	160	165	167	167

Gaps that are included creating the multiple alignments with the dataset "high". The first column contains the Uniprot IDs, whereas the following columns show the gaps inserted by ClustalW, Muscle, T-Coffee and Expresso.

Discussion

Comparing the different tools similar outcomes can be viewed. As expected the dataset with low similarities shows a lot of gaps. Only for our protein P00439 few gaps are included. Therefore the other sequences are shorter and only cover a small part, which goes roughly from position 175 to 295. Only Muscle shows more gaps than the others. For the dataset with identities between 30% and 60% higher conservations could be viewed. Again the results of the different tools are similar, whereby sometimes the conserved regions are shifted due to some gaps inserted between those. The set with highest sequence similarities show least gaps and also only few differences between the tools. ClustalW inserted fewest gaps. For the datasets low and high T-Coffee and Expresso have identical numbers of gaps. The multiple alignments can be viewed in <xr id="clustalW"/> with the results of ClustalW,in <xr id="muscle"/> with the results of Muscke, in <xr id="tCoffee"/> with the results of T-Coffee and in <xr id="expresso"/> with the results of Expresso. Thereby a) always comprise the MSA of the low, b) of the medium and c) of the high dataset. d), on the other hand, shows the MSA, which was created with 20 sequences that covers the whole range of similarity with identities of 27% to 96%. Thereby, T-Coffee and Expresso again show very similar outcomes and also Muscle is not that different to those two. ClustalW, however, seems to have more problems to make a good alignment even with the similar sequences and the conserved amino acids are picked to pieces.

Sequence searches and multiple sequence alignments (Phenylketonuria)

Contents

Summary of the task