Difference between revisions of "Sequence searches and multiple sequence alignments (Phenylketonuria)"

Latest revision as of 22:52, 5 September 2013

Summary of the task

In this task we compare the protein sequence of interest, in this case the phenylalanine hydroxylase (PAH), to other protein sequences. Therefore both sequence searches and multiple sequence alignments were done using the big80 database meaning a database that contains subsets of swissprot and pdb, where the entries have a sequence similarity of 80% or less. Furthermore searches against a pdb database were done. For sequence searches the programs BLAST, PSIBLAST and HHblits are used. Their results were taken for the creation of multiple sequence alignments (MSA) using the methods ClustalW, Muscle and TCoffee.

Sequence searches

Lab journal

Comparison of the results

In this part the results of the different sequence searches are analyzed. Therefore the outputs are parsed with Marias Parser and are filtered for their IDs, sequence identities and e-values.

Sequence identity in percent

</figure> </figure> </figure>

Sequence identity distribution of the BLAST search; the x-axis shows the sequence identity from 0 (=0%) to 1 (=100%) and the y-axis the number of sequences that got this sequence identity to the PAH sequence in the BLAST search.
<figure id="fig:blast">
<figure id="fig:psiblast">
<figure id="fig:hhblits">

At comparing the sequence identities of the different search tools, it could be seen that BLAST (<xr id="fig:blast"/>) and the distributions of PSI-BLAST (<xr id="fig:psiblast"/>) with two iterations for both e-value cutoffs show similar distributions with a maximum frequency between 35% and 40% sequence identity. For ten iterations the curve is shifted a bit to the left with a maximum frequency at about 20%. Altogether, it can be seen that more sequences were found in ten iterations than in two, however with less sequence identity. This can be ascribed to the fact that the profile gets less specific in each iteration. The runs with an e-value cutoff of 10e-10 show a lower number of sequences than the runs with a cutoff of 0.002 as 10e-10 is a more significant cutoff. However, that cutoff is really strict and has some false negatives with higher probability, whereas the cutoff of 0.002 is not very significant and likely includes some false positives. The highest number of sequences was found in the HHblits search (<xr id="fig:hhblits"/>), which distribution of sequence identity shows high similarity to the distribution of PSI-BLAST run with ten iterations.

E-Value

</figure> </figure> </figure>

Logarithmic e-value distribution of the BLAST search; the x-axis shows the logarithmic e-value and the y-axis the frequency of sequences with that specific e-value; If the logarithmic e-value is smaller the evalue is better.
<figure id="fig:blast-evalue">
<figure id="fig:psiblast-evalue">
<figure id="fig:hhblits-evalue">

The distributions of the logarithmic e-values of the sequence searches all look similar with a lowest and best value beneath -400. Nevertheless, it goes up to higher than 0 in BLAST (<xr id="fig:blast-evalue"/>) and HHblits(<xr id="fig:hhblits-evalue"/>). However, the maximum frequency for the e-values in BLAST is still in negativ range, whereas in HHblits it is in positive range meaning that the e-values are not as good. Best e-values are found for PSI-BLAST searches especially for 10 iterations (<xr id="fig:psiblast-evalue"/>). They all show lower frequency at lower e-values and their maximum at higher e-values.

GO-terms

For the reference sequence (P00439) GO-terms (<xr id="go"/>) were found on QuickGO. To look for similarities between the reference sequence and the sequences found in the searches we wrote a program which download those terms for all detected sequences and counted how often the GO terms of PAH are found for the other sequences: <figtable id="go">

"GO numbers and terms of PAH and their number of occurence in the different search results"
GO	GO-Term	BLAST	PSIBLAST 2 x 0.002	PSIBLAST 2 x 10e-10	PSIBLAST 10 x 0.002	PSIBLAST 10 x 10e-10	HHblits
GO:0008152	metabolic process	554	808	818	712	713	5701
GO:0016597	amino acid binding	554	808	818	712	713	5611
GO:0055114	oxidation-reduction process	458	457	459	443	438	3522
GO:0005506	iron ion binding	449	448	450	434	431	1043
GO:0009072	aromatic amino acid family metabolic process	449	448	450	434	431	1043
GO:0004497	monooxygenase activity	449	448	450	434	431	1043
GO:0016714	oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen, reduced pteridine as one donor, and incorporation of one atom of oxygen	445	443	444	434	431	1033
GO:0004505	phenylalanine 4-monooxygenase activity	207	207	207	205	205	442
GO:0006559	L-phenylalanine catabolic process	165	165	165	165	165	395
GO:0016491	oxidoreductase activity	158	157	157	154	154	2456
GO:0003824	catalytic activity	12	30	28	33	32	3237
GO:0042423	catecholamine biosynthetic process	15	25	15	15	15	80
GO:0008652	cellular amino acid biosynthetic process	6	10	11	12	12	1762
GO:0046872	metal ion binding	7	6	6	6	6	1141
GO:0042136	neurotransmitter biosynthetic process	2	2	2	2	2	12
GO:0005829	cytosol	0	1	1	2	2	16
GO:0034641	cellular nitrogen compound metabolic process	0	0	0	0	0	4
GO:0044281	small molecule metabolic process	0	0	0	0	0	4

GO terms and their number of occurence for the protein sequences found in different sequence searches: BLAST, PSI-BLAST with 2 iterations and e-value of 0.002 and 10e-10 and with 10 iterations with the same e-values and finally HHblits. </figtable> When you look at the GO-terms it is obvious that more general descriptions like 'metabolic process' or 'amino acid binding' are found very often. The more specific the GO-terms of PAH gets the less sequences in the searches have the same description. However, in HHblits some of the terms like for example 'catalytic activity' are detected more often as in the other searches even it is incorporated that there are much more sequences found in the HHblits clusters.

Best results

In this part the sequences with highest sequence identity and lowest e-value are compared for the different searches.

Sequence identity

In the following table (<xr id="hhblits"/>) the four proteins with the highest sequence identity found in the HHblits run are presented. <figtable id="hhblits">

Best sequence identities in HHblits
Uinprot-ID	Identity	E-Value
C9JMN0	1.0	5.0E-6
Q66RJ9	1.0	0.006
Q16021	0.96	8.4E-12
Q66RJ7	0.96	7.1E-4

Proteins with highest identity in HHblits

</figtable> None of these sequences are found in BLAST or the four PSI-BLAST searches. However, some proteins with highest sequence identity found with BLAST also can be found with PSI-BLAST (<xr id="blast"/>). <figtable id="blast">

Best sequence identities in BLAST
Uinprot-ID	Identity	E-Value	Identity PSI-BLAST 2x0.002	Identity PSI-BLAST 2x10e-10	Identity PSI-BLAST 10x0.002	Identity PSI-BLAST 10x10e-10
F6XY00	0.92	0.0	0.95	0.95	0.93	0.96
L9L9N2	0.89	0.0	0.93	0.92	0.95	0.90
G5AMD7	0.89	0.0	0.93	0.93	0.91	0.91
D3YZ73	0.85	7.0E-20	0.73	0.73	-	-
G1KSL1	0.81	0.0	0.82	0.82	0.82	0.79

Proteins with highest identity in BLAST and their identity values in the four PSI-BLAST runs if found.

</figtable> Additionally the sequences (F6XY00, L9L9N2, G5AMD7, G1KSL1) also are the best four in the PSI-BLAST runs only not in the same order but with better e-values than in BLAST. They are not included in the HHblits result. Only D3YZ73 is not under the best in PSI-BLAST and is not found after ten iterations at all. However, in HHblits it is found with a sequence identity of 0.71.

E-Value

The best e-values are those which are lowest. In HHblits the three lowest values are 5.5e-175 (<xr id="hh-eval"/>). <figtable id="hh-eval">

Best e-values in HHblits
Uinprot-ID	E-Value	Identity
A7UTU7	5.5e-175	0.68
P17752	5.5e-175	0.68
P00439	5.5e-175	0.68

Proteins with lowest e-value in HHblits

</figtable> Again none of these protein sequences are found in BLAST or the four PSI-BLAST searches. The proteins with best e-values of the BLAST run (<xr id="bl-eval"/>), however, are found in all searches, besides H3FAJ0, which is not included in the HHblits result. They have very good e-values in all runs, but are under the three best in BLAST only. <figtable id="bl-eval">

Best e-values in BLAST
Uinprot-ID	E-Value	Identity	E-Value PSI-BLAST 2x0.002	E-Value PSI-BLAST 2x10e-10	E-Value PSI-BLAST 10x0.002	E-Value PSI-BLAST 10x10e-10	E-Value HHblits
H3FAJ0	1.0e-177	0.544276457883369	1.0E-157	1.0E-162	1.0E-122	1.0E-126	-
A8WSM6	1.0e-177	0.563636363636364	1.0E-159	1.0E-164	1.0E-127	1.0E-131	1.1E-172
A9UUJ8	1.0e-176	0.567695961995249	1.0E-156	1.0E-160	1.0E-127	1.0E-131	5.5E-175

Proteins with lowest e-value in BLAST and their values in PSI-BLAST and also in HHblits

</figtable>

At PSI-BLAST the three best proteins of the run with two iterations and an e-value cutoff of 10e-10 and the run with ten iterations and an e-value cutoff of 0.002 are the same. The best one is H2UJM8 with an e-value of 1e-179. For two iterations and e-value cutoff of 0.002 it is Q4VBE2 with an e-value of 1e-178 and for ten iterations and e-value cutoff of 10e-10 it is K7F3H7 with an e-value of 1e-143, which is good, but not as good as the other best ones.
When you compare sequence identity and e-value, that sequences with best identities often have not as good e-values and vice versa is remarkable. Especially in the BLAST runs e-values of only 0.0 for the sequences with highest sequence similarity are reached and the sequences with lowest e-values only have sequence identities slightly above 54%. Therefore you can see that a good balance between high sequence identity and low e-value is very hard to get, but necessary to get trustable results.

Multiple sequence alignments

For the images of the Alignments Jalview was used. Colours are shown in Clustalx Format.
Lab journal

Datasets

For the multiple sequence alignments four different datasets were generated with the BLAST outputs and a Python script. Three groups with ten sequences (two of them are pdb-sequences): one with higher than 60% sequence identity to our target sequence (PAH gene), another group with sequence identity between 30% and 60% and one group with lower sequence identity than 30% (<xr id="datasets"/>). In the group with < 30% identity the pdb sequences have a 32% identity, because these are the lowest ones found in the Blast output against the pdb dataset. The fourth group contains 20 sequences with a whole range of sequence identities and four of them are pdb sequences (<xr id="set_all"/>). For the comparison to our reference sequence of the Phenylalanine hydroxylase (PAH - P00439) enzyme we added this sequence to all four groups. <figtable id="datasets">

**a) Dataset "low":**
Group of sequences with < 30% (pdb = 32%) identity
Sequence identity	ID
29%	K0WHA3
29%	L0FXW8
29%	B4R9Q9
28%	C9P8B8
25%	I3YW84
25%	G0L2J6
25%	Q9AG78
24%	B4UJD0
32%	1ltu_A (pdb)
32%	1ltz_A (pdb)

**b) Dataset "medium":**
Group of sequences between 30% and 60% identity
Sequence identity	ID
54%	I1C3M2
53%	F6ZKP1
51%	Q54XS1
48%	G1KSP0
45%	E5SYS4
42%	H3EDU0
38%	H0HJI4
36%	F4WGX3
59%	2toh_A (pdb)
45%	2qmx_A (pdb)

**c) Dataset "high":**
Group of sequences with > 60% identity
Sequence identity	ID
70%	H2UJM8
67%	K1RSS1
65%	E7D1A7
63%	G7YPD5
63%	D1LXB2
62%	G1KJG2
61%	G9B2G8
61%	E9RJV0
96%	2pah_A (pdb)
96%	1kw0_A (pdb)

The three datasets with sequence identity a) lower than 30% (and two pdb sequences with 32% identity b) between 30% and 60% and c) bigger than 60%.

</figtable>

**Dataset "all":**
Group of sequences with different identities (0-100%)
Sequence identity	ID	Sequence identity	ID
63%	D1LXB2	33%	A9D485
63%	G3MQ02	33%	G7UU95
56%	A9UUJ8	31%	I3TIA8
54%	D2XNL7	31%	A0Y3T4
52%	E9G1D2	29%	D5HB02
44%	I1GDE7	27%	A1ZW97
41%	H1L1J2	96%	5pah_A (pdb)
37%	D7AKY2	96%	3pah_A (pdb)
36%	B7GIR4	96%	2pah_A (pdb)
35%	A0C973	33%	2v27_B (pdb)

Set with whole range of sequence identities including four pdb sequences.

</figtable>

ClustalW

a) Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with ClustalW

b) Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with ClustalW

c) Multiple Sequence Alignment of the "high" dataset with sequences >60% identity, which was created with ClustalW

d) Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with ClustalW

Multiple sequence alignments for the four datasets a) low, b) medium, c) high and d) all created with ClustalW.

</figure>

Muscle

a) Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with Muscle

b) Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with Muscle

c) Multiple Sequence Alignment of the "high" dataset with sequences >60% identity, which was created with Muscle

d) Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with Muscle

Multiple sequence alignments for the four datasets a) low, b) medium, c) high and d) all created with Muscle.

</figure>

T-Coffee

a) Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with T-Coffee

b) Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with T-Coffee

c) Multiple Sequence Alignment of the "high" dataset with sequences >60% identity, which was created with T-Coffee

d) Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with T-Coffee

Multiple sequence alignments for the four datasets a) low, b) medium, c) high and d) all created with T-Coffee.

</figure>

Expresso(3D-Coffee)

Expresso is an extension of 3D-Coffee and uses BLAST to search the PDB database for structures whose sequences are similar to the given sequences. These structures are then used to build the alignment. It is slowlier than T-Coffee itself, but if it finds enough structures it is more accurate than the other programms.

Since we could not run Expresso on the server, we have used this website for the multiple sequence alignments with Expresso (3D-Coffee).

a) Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with Expresso(3D-Coffee)

b) Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with Expresso(3D-Coffee)

c) Multiple Sequence Alignment of the "high" dataset with sequences > 60% identity, which was created with Expresso(3D-Coffee)

d) Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with Expresso(3D-Coffee)

Multiple sequence alignments for the four datasets a) low, b) medium, c) high and d) all created with Expresso.

</figure>

Results

The following tables (<xr id="low_set"/>,<xr id="medium_set"/> and <xr id="high_set"/>) show the gaps that are included during the creation of the multiple alignments by ClustalW, Muscle, T-Coffee and Expresso. Thereby the sequence of PAH (P00439) is shown in bold, whereas pdb sequences are marked with *.

</figtable>

**Gaps in Dataset "low":**
Group of sequences with < 30% (pdb = 32%) identity
ID	ClustalW	MUSCLE	T-COFFEE	EXPRESSO
P00439	3	9	3	3
K0WHA3	230	236	230	230
L0FXW8	401	407	401	401
B4R9Q9	339	345	339	339
C9P8B8	223	229	223	223
I3YW84	396	402	396	396
G0L2J6	336	342	336	336
Q9AG78	399	405	399	399
B4UJD0	280	286	280	280
*1ltu_A	399	405	399	399
*1ltz_A	237	243	237	237

Gaps that are included creating the multiple alignments with the dataset "low". The first column contains the Uniprot IDs, whereas the following columns show the gaps inserted by ClustalW, Muscle, T-Coffee and Expresso.

**Gaps in Dataset "medium":**
Group of sequences between 30% and 60% identity
ID	ClustalW	MUSCLE	T-COFFEE	EXPRESSO
P00439	10	21	25	29
I1C3M2	37	48	52	56
F6ZKP1	20	31	35	39
Q54XS1	43	54	58	62
G1KSP0	287	298	302	306
E5SYS4	297	308	312	316
H3EDU0	75	86	90	94
H0HJI4	402	413	417	421
F4WGX3	283	294	298	302
*2toh_A	282	293	297	301
*2qmx_A	416	427	431	435

Gaps that are included creating the multiple alignments with the dataset "medium". The first column contains the Uniprot IDs, whereas the following columns show the gaps inserted by ClustalW, Muscle, T-Coffee and Expresso.

**Gaps in Dataset "high":**
Group of sequences with > 60% identity
ID	ClustalW	MUSCLE	T-COFFEE	EXPRESSO
P00439	8	13	15	15
H2UJM8	20	25	27	27
K1RSS1	280	285	287	287
E7D1A7	342	347	349	349
G7YPD5	115	120	122	122
D1LXB2	162	167	169	169
G1KJG2	220	225	227	227
G9B2G8	42	47	49	49
E9RJV0	353	358	360	360
*2pah_A	220	225	227	227
*1kw0_A	160	165	167	167

Gaps that are included creating the multiple alignments with the dataset "high". The first column contains the Uniprot IDs, whereas the following columns show the gaps inserted by ClustalW, Muscle, T-Coffee and Expresso.

Discussion

Comparing the different tools similar outcomes can be viewed. As expected the dataset with low similarities shows a lot of gaps. Only for our protein P00439 (PAH) few gaps are included. Therefore, the other sequences are shorter and only cover a small part of our protein, which goes roughly from position 175 to 295. Only Muscle shows more gaps than the other tools. For the dataset with identities between 30% and 60% higher conservations could be recognised. Again the results of the different tools are similar, whereby sometimes the conserved regions are shifted due to some gaps inserted between them. The set with the highest sequence similarities shows least gaps and also only few differences between the tools. ClustalW inserted fewest gaps. For the datasets low and high T-Coffee and Expresso have identical numbers of gaps. The multiple alignments can be viewed in <xr id="clustalW"/> with the results of ClustalW,in <xr id="muscle"/> with the results of Muscle, in <xr id="tCoffee"/> with the results of T-Coffee and in <xr id="expresso"/> with the results of Expresso. Thereby a) always comprises the MSA of the low, b) of the medium and c) of the high dataset. d), on the other hand, shows the MSA, which was created with 20 sequences that covers the whole range of similarity with identities between 27% and 96%. Thereby, T-Coffee and Expresso again show very similar outcomes and also Muscle is not that different to those, too. ClustalW, however, seems to have more problems to make a good alignment even with the similar sequences and the conserved amino acids are picked to pieces. Even there are some sequences with low similarities T-Coffee and Expresso show good alignments. Altogether we think that a medium sequence identity is sufficient to get a good MSA. Nevertheless, it is also possible to include sequences with less identities. In such cases also sequences with high similarities should be included and T-Coffee seems to be a good choice to create a reliable MSA.

Difference between revisions of "Sequence searches and multiple sequence alignments (Phenylketonuria)"

Latest revision as of 22:52, 5 September 2013

Contents

Summary of the task

Sequence searches

Comparison of the results

Sequence identity in percent

E-Value

GO-terms

Best results

Sequence identity

E-Value

Multiple sequence alignments

Datasets

ClustalW

Muscle

T-Coffee

Expresso(3D-Coffee)

Results

Discussion

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools

@@ Line 1: / Line 1: @@
 == Summary of the task ==
-In this task we compare the protein sequence of interest, in this case the phenylalanine hydroxylase (PAH), to other protein sequences. Therefore both sequence searches and multiple sequence alignments were done using the big80 database meaning a database that contains subsets of swissprot and pdb, where the entries have a sequence similarity of 80% or less. Furthermore searches against a pdb database were done. For sequence searches the programs BLAST, PSIBLAST and HHblits are used. Their results were taken for the creation of multiple sequence alignments (MSA) using he methods ClustalW, Muscle and TCoffee.
+In this task we compare the protein sequence of interest, in this case the phenylalanine hydroxylase (PAH), to other protein sequences. Therefore both sequence searches and multiple sequence alignments were done using the big80 database meaning a database that contains subsets of swissprot and pdb, where the entries have a sequence similarity of 80% or less. Furthermore searches against a pdb database were done. For sequence searches the programs BLAST, PSIBLAST and HHblits are used. Their results were taken for the creation of multiple sequence alignments (MSA) using the methods ClustalW, Muscle and TCoffee.
 == Sequence searches ==
+[[Lab Journal - Task 2 (PAH) #Sequence searches|Lab journal]]
-The following invocations were used for Blast, PSI-Blast and HHBlits:
-=== BLAST (Basic Local Alignment Search Tool)===
+=== Comparison of the results ===
+In this part the results of the different sequence searches are analyzed. Therefore the outputs are parsed with Marias [https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Parse_output.pl Parser] and are filtered for their IDs, sequence identities and e-values.
- blastall -p blastp -d /mnt/project/rost_db/data/big/big_80 -i /mnt/home/student/worfk
+====Sequence identity in percent====
- /Masterpractical/Task2/PAH.fasta -o /mnt/home/student/worfk/Masterpractical/Task2/Blast/PAH
+<small>
- _Blast_big_80.out -v 2000 -b 2000
+{|
+|<figure id="fig:blast"> [[File:PAH_Blast_identity.png|thumb|left|300px|'''<caption>''' Sequence identity distribution of the BLAST search; the x-axis shows the sequence identity from 0 (=0%) to 1 (=100%) and the y-axis the number of sequences that got this sequence identity to the PAH sequence in the BLAST search.</caption>]]</figure>
+|<figure id="fig:psiblast">[[File:PAH_PSIBLAST_identity.png|thumb|center|350px|'''<caption>''' Sequence identity distribution of the PSI-BLAST searches; the x-axis shows the sequence identity from 0 (=0%) to 1 (=100%) and the y-axis the number of sequences that got this sequence identity to the PAH sequence in the PSI-BLAST searches: two iterations with an e-value cutoff of 0.002 (blue) and with an e-value cutoff of 10e-10 (yellow); ten iterations  with an e-value cutoff of 0.002 (green) and with an e-value cutoff of 10e-10.</caption>]]</figure>
+|<figure id="fig:hhblits"> [[File:PAH_HHbilts_identity.png|thumb|right|300px|'''<caption>''' Sequence identity distribution of the HHblits search; the x-axis shows the sequence identity from 0 (=0%) to 1 (=100%) and the y-axis the number of sequences that got this sequence identity to the PAH sequence in the HHblits search.</caption>]]</figure>
+|}
+</small>
+<br clear=all>
+At comparing the sequence identities of the different search tools, it could be seen that BLAST (<xr id="fig:blast"/>) and the distributions of PSI-BLAST (<xr id="fig:psiblast"/>) with two iterations for both e-value cutoffs show similar distributions with a maximum frequency between 35% and 40% sequence identity. For ten iterations the curve is shifted a bit to the left with a maximum frequency at about 20%. Altogether, it can be seen that more sequences were found in ten iterations than in two, however with less sequence identity. This can be ascribed to the fact that the profile gets less specific in each iteration. The runs with an e-value cutoff of 10e-10 show a lower number of sequences than the runs with a cutoff of 0.002 as 10e-10 is a more significant cutoff. However, that cutoff is really strict and has some false negatives with higher probability, whereas the cutoff of 0.002 is not very significant and likely includes some false positives. The highest number of sequences was found in the HHblits search (<xr id="fig:hhblits"/>), which distribution of sequence identity shows high similarity to the distribution of PSI-BLAST run with ten iterations.
-=== PSI-BLAST (Position-Specific Iterated BLAST) ===
-For PSI-Blast ([http://www.ncbi.nlm.nih.gov/books/NBK2590/ PSI-BLAST Tutorial]) more than one vocation was performed. First two iterations were done with an E-value cutoff of 0.002 and then again with cutoff 10E-10. The same for ten iterations. An example vocation would be:
+====E-Value====
- blastpgp -i /mnt/home/student/worfk/Masterpractical/Task2/PAH.fasta -d /mnt/project/rost_db
+<small>
- /data/big/big_80 -j 2 -h 0.002 -v 2000 -b 2000 -o psi_blast_big_80_2_2.out -C big_80_check_
+{|
-_2.chk -Q big_80_matrix_2_2.pssm
+|<figure id="fig:blast-evalue">[[File:PAH_Blast_evalue.png|thumb|left|300px|'''<caption>''' Logarithmic e-value distribution of the BLAST search; the x-axis shows the logarithmic e-value and the y-axis the frequency of sequences with that specific e-value; If the logarithmic e-value is smaller the evalue is better.</caption>]]</figure>
+|<figure id="fig:psiblast-evalue">[[File:PAH_PSIBLAST_evalue.png|thumb|center|350px|'''<caption>''' Logarithmic e-value distribution of the PSI-BLAST search: two iterations with an e-value cutoff of 0.002 (blue) and with an e-value cutoff of 10e-10 (yellow); ten iterations with an e-value cutoff of 0.002 (green) and with an e-value cutoff of 10e-10; the x-axis shows the logarithmic e-value and the y-axis the frequency of sequences with that specific e-value; If the logarithmic e-value is smaller the evalue is better.</caption>]]</figure>
+|<figure id="fig:hhblits-evalue">[[File:PAH_HHbilts_evalue.png|thumb|right|300px|'''<caption>''' Logarithmic e-value distribution of the HHblits search; the x-axis shows the logarithmic e-value and the y-axis the frequency of sequences with that specific e-value; If the logarithmic e-value is smaller the evalue is better.</caption>]]</figure>
+|}
+</small>
+The distributions of the logarithmic e-values of the sequence searches all look similar with a lowest and best value beneath -400. Nevertheless, it goes up to higher than 0 in BLAST (<xr id="fig:blast-evalue"/>) and HHblits(<xr id="fig:hhblits-evalue"/>). However, the maximum frequency for the e-values in BLAST is still in negativ range, whereas in HHblits it is in positive range meaning that the e-values are not as good. Best e-values are found for PSI-BLAST searches especially for 10 iterations (<xr id="fig:psiblast-evalue"/>).
+They all show lower frequency at lower e-values and their maximum at higher e-values.
+<br clear=all>
-=== HHblits ===
- hhblits -i /mnt/home/student/waldraffs/Masterpraktikum/PAH.fasta -d /mnt/project/rost_db/data/hhblits/uniprot20_02Sep11
- -o /mnt/home/student/waldraffs/Masterpraktikum/PAH_2000.hrr -oa3m /mnt/home/student/waldraffs/Masterpraktikum/PAH_2000.a3m
- -ohhm /mnt/home/student/waldraffs/Masterpraktikum/PAH_2000.hhm -Z 2000 -B 2000
+====GO-terms====
+For the reference sequence (P00439) GO-terms (<xr id="go"/>) were found on [http://www.ebi.ac.uk/QuickGO/GAnnotation QuickGO].
+To look for similarities between the reference sequence and the sequences found in the searches we wrote a program which download those terms for all detected sequences and counted how often the GO terms of PAH are found for the other sequences:
+<figtable id="go">
+{| border="1" cellpadding="5" cellspacing="0" align="center" style="text-align:center;"
+|-
+! colspan="8" style="background:#32CD32;" | "GO numbers and terms of PAH and their number of occurence in the different search results"
+|-
+! style="background:#90EE90;" align="center" | GO
+! style="background:#90EE90;" align="center" | GO-Term
+! style="background:#90EE90;" align="center" | BLAST
+! style="background:#90EE90;" align="center" | PSIBLAST
+x 0.002
+! style="background:#90EE90;" align="center" | PSIBLAST
+x 10e-10
+! style="background:#90EE90;" align="center" | PSIBLAST
+x 0.002
+! style="background:#90EE90;" align="center" | PSIBLAST
+x 10e-10
+! style="background:#90EE90;" align="center" | HHblits
+|-
+|[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0008152 GO:0008152]
+|metabolic process
+|554
+|808
+|818
+|712
+|713
+|5701
+|-
+|[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0016597 GO:0016597]
+|amino acid binding
+|554
+|808
+|818
+|712
+|713
+|5611
+|-
+|[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:00055114 GO:0055114]
+|oxidation-reduction process
+|458
+|457
+|459
+|443
+|438
+|3522
+|-
+|[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0005506 GO:0005506]
+|iron ion binding
+|449
+|448
+|450
+|434
+|431
+|1043
+|-
+|[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0009072 GO:0009072]
+|aromatic amino acid family metabolic process
+|449
+|448
+|450
+|434
+|431
+|1043
+|-
+|[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0004497 GO:0004497]
+|monooxygenase activity
+|449
+|448
+|450
+|434
+|431
+|1043
+|-
+|[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0016714 GO:0016714]
+|oxidoreductase activity,
+acting on paired donors,
+with incorporation or reduction of molecular oxygen,
-To perform all programms at once, one could use the Perl-script from Maria, like shown here:
- perl /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1/run.pl ...
+reduced pteridine as one donor,
-=== Comparison of the results ===
+and incorporation of one atom of oxygen
-*Sequence identity in percent
+|445
+|443
+|444
+|434
+|431
+|1033
+|-
+|[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0004505 GO:0004505]
+|phenylalanine 4-monooxygenase activity
+|207
+|207
+|207
+|205
+|205
+|442
+|-
+|[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0006559 GO:0006559]
+|L-phenylalanine catabolic process
+|165
+|165
+|165
+|165
+|165
+|395
+|-
+|[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0016491 GO:0016491]
+|oxidoreductase activity
+|158
+|157
+|157
+|154
+|154
+|2456
+|-
+|[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0003824 GO:0003824]
+|catalytic activity
+|12
+|30
+|28
+|33
+|32
+|3237
+|-
+|[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0042423 GO:0042423]
+|catecholamine biosynthetic process
+|15
+|25
+|15
+|15
+|15
+|80
+|-
+|[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0008652 GO:0008652]
+|cellular amino acid biosynthetic process
+|6
+|10
+|11
+|12
+|12
+|1762
+|-
+|[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0046872 GO:0046872]
+|metal ion binding
+|7
+|6
+|6
+|6
+|6
+|1141
+|-
+|[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0042136 GO:0042136]
+|neurotransmitter biosynthetic process
+|2
+|2
+|2
+|2
+|2
+|12
+|-
+|[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0005829 GO:0005829]
+|cytosol
+|0
+|1
+|1
+|2
+|2
+|16
+|-
+|[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0034641 GO:0034641]
+|cellular nitrogen compound metabolic process
+|0
+|0
+|0
+|0
+|0
+|4
+|-
+|[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0044281 GO:0044281]
+|small molecule metabolic process
+|0
+|0
+|0
+|0
+|0
+|4
+|-
+|}
+<small>'''<caption>''' GO terms and their number of occurence for the protein sequences found in different sequence searches: BLAST, PSI-BLAST with 2 iterations and e-value of 0.002 and 10e-10 and with 10 iterations with the same e-values and finally HHblits.</caption></small>
+</figtable>
+When you look at the GO-terms it is obvious that more general descriptions like 'metabolic process' or 'amino acid binding' are found very often. The more specific the GO-terms of PAH gets the less sequences in the searches have the same description. However, in HHblits some of the terms like for example 'catalytic activity' are detected more often as in the other searches even it is incorporated that there are much more sequences found in the HHblits clusters.
+====Best results====
-*E-Value
+In this part the sequences with highest sequence identity and lowest e-value are compared for the different searches.
+=====Sequence identity=====
+In the following table (<xr id="hhblits"/>) the four proteins with the highest sequence identity found in the HHblits run are presented.
+<figtable id="hhblits">
+{| border="1" style="text-align:center;" cellpadding="5" cellspacing="0" align="center"
-*GO-terms
+|-
-For the reference sequence (P00432) following GO-terms were found on [http://www.ebi.ac.uk/QuickGO/GAnnotation QuickGO]:
+! colspan="8" style="background:#32CD32;" | Best sequence identities in HHblits
- "GO:0050661", "GO:0009650", "GO:0020027", "GO:0004096", "GO:0046872", "GO:0042803", "GO:0009060", "GO:0005739",
+|-
- "GO:0051092", "GO:0004601", "GO:0051289", "GO:0051781", "GO:0004046", "GO:0042744", "GO:0006979", "GO:0005778",
+! style="background:#90EE90;" align="center" | Uinprot-ID
- "GO:0008203", "GO:0005777", "GO:0043066", "GO:0051262", "GO:0016209", "GO:0005102", "GO:0014068", "GO:0032088",
+! style="background:#90EE90;" align="center" | Identity
- "GO:0016491", "GO:0006641", "GO:0016684", "GO:0019899", "GO:0000302", "GO:0055114", "GO:0020037"
+! style="background:#90EE90;" align="center" | E-Value
+|-
+|[http://www.uniprot.org/uniprot/C9JMN0 C9JMN0]
+|1.0
+|5.0E-6
+|-
+|[http://www.uniprot.org/uniprot/Q66RJ9 Q66RJ9]
+|1.0
+|0.006
+|-
+|[http://www.uniprot.org/uniprot/Q16021 Q16021]
+|0.96
+|8.4E-12
+|-
+|[http://www.uniprot.org/uniprot/Q66RJ7 Q66RJ7]
+|0.96
+|7.1E-4
+|}
+<center><small>'''<caption>''' Proteins with highest identity in HHblits</caption></small></center>
+</figtable>
+None of these sequences are found in BLAST or the four PSI-BLAST searches. However, some proteins with highest sequence identity found with BLAST also can be found with PSI-BLAST (<xr id="blast"/>).
+<figtable id="blast">
+{| border="1" style="text-align:center;" cellpadding="5" cellspacing="0" align="center"
+|-
+! colspan="11" style="background:#32CD32;" | Best sequence identities in BLAST
+|-
+! style="background:#90EE90;" align="center" | Uinprot-ID
+! style="background:#90EE90;" align="center" | Identity
+! style="background:#90EE90;" align="center" | E-Value
+! style="background:#90EE90;" align="center" | Identity
+PSI-BLAST 2x0.002
+! style="background:#90EE90;" align="center" | Identity
+PSI-BLAST 2x10e-10
+! style="background:#90EE90;" align="center" | Identity
+PSI-BLAST 10x0.002
+! style="background:#90EE90;" align="center" | Identity
+PSI-BLAST 10x10e-10
+|-
+|[http://www.uniprot.org/uniprot/F6XY00 F6XY00]
+|0.92
+|0.0
+|0.95
+|0.95
+|0.93
+|0.96
+|-
+|[http://www.uniprot.org/uniprot/L9L9N2 L9L9N2]
+|0.89
+|0.0
+|0.93
+|0.92
+|0.95
+|0.90
+|-
+|[http://www.uniprot.org/uniprot/G5AMD7 G5AMD7]
+|0.89
+|0.0
+|0.93
+|0.93
+|0.91
+|0.91
+|-
+|[http://www.uniprot.org/uniprot/D3YZ73 D3YZ73]
+|0.85
+|7.0E-20
+|0.73
+|0.73
+| -
+| -
+|-
+|[http://www.uniprot.org/uniprot/G1KSL1 G1KSL1]
+|0.81
+|0.0
+|0.82
+|0.82
+|0.82
+|0.79
+|}
+<center><small>'''<caption>''' Proteins with highest identity in BLAST and their identity values in the four PSI-BLAST runs if found. </caption></small></center>
+</figtable>
+Additionally the sequences (F6XY00, L9L9N2, G5AMD7, G1KSL1) also are the best four in the PSI-BLAST runs only not in the same order but with better e-values than in BLAST. They are not included in the HHblits result. Only D3YZ73 is not under the best in PSI-BLAST and is not found after ten iterations at all. However, in HHblits it is found with a sequence identity of 0.71.
+=====E-Value=====
+The best e-values are those which are lowest. In HHblits the three lowest values are 5.5e-175 (<xr id="hh-eval"/>).
+<figtable id="hh-eval">
+{| border="1" cellpadding="5" cellspacing="0" align="center"
+|-
+! colspan="8" style="background:#32CD32;" | Best e-values in HHblits
+|-
+! style="background:#90EE90;" align="center" | Uinprot-ID
+! style="background:#90EE90;" align="center" | E-Value
+! style="background:#90EE90;" align="center" | Identity
+|-
+|[http://www.uniprot.org/uniprot/A7UTU7 A7UTU7]
+|5.5e-175
+|0.68
+|-
+|[http://www.uniprot.org/uniprot/P17752 P17752]
+|5.5e-175
+|0.68
+|-
+|[http://www.uniprot.org/uniprot/P00439 P00439]
+|5.5e-175
+|0.68
+|}
+<center><small>'''<caption>''' Proteins with lowest e-value in HHblits</caption></small></center>
+</figtable>
+Again none of these protein sequences are found in BLAST or the four PSI-BLAST searches. The proteins with best e-values of the BLAST run (<xr id="bl-eval"/>), however, are found in all searches, besides H3FAJ0, which is not included in the HHblits result. They have very good e-values in all runs, but are under the three best in BLAST only.
+<figtable id="bl-eval">
+{| border="1" cellpadding="5" cellspacing="0" align="center"
+|-
+! colspan="11" style="background:#32CD32;" | Best e-values in BLAST
+|-
+! style="background:#90EE90;" align="center" | Uinprot-ID
+! style="background:#90EE90;" align="center" | E-Value
+! style="background:#90EE90;" align="center" | Identity
+! style="background:#90EE90;" align="center" | E-Value
+PSI-BLAST 2x0.002
+! style="background:#90EE90;" align="center" | E-Value
+PSI-BLAST 2x10e-10
+! style="background:#90EE90;" align="center" | E-Value
+PSI-BLAST 10x0.002
+! style="background:#90EE90;" align="center" | E-Value
+PSI-BLAST 10x10e-10
+! style="background:#90EE90;" align="center" | E-Value
+HHblits
+|-
+|[http://www.uniprot.org/uniprot/H3FAJ0 H3FAJ0]
+|1.0e-177
+|0.544276457883369
+|1.0E-157
+|1.0E-162
+|1.0E-122
+|1.0E-126
+| -
+|-
+|[http://www.uniprot.org/uniprot/A8WSM6 A8WSM6]
+|1.0e-177
+|0.563636363636364
+|1.0E-159
+|1.0E-164
+|1.0E-127
+|1.0E-131
+|1.1E-172
+|-
+|[http://www.uniprot.org/uniprot/A9UUJ8 A9UUJ8]
+|1.0e-176
+|0.567695961995249
+|1.0E-156
+|1.0E-160
+|1.0E-127
+|1.0E-131
+|5.5E-175
+|}
+<center><small>'''<caption>''' Proteins with lowest e-value in BLAST and their values in PSI-BLAST and also in HHblits</caption></small></center>
+</figtable>
+At PSI-BLAST the three best proteins of the run with two iterations and an e-value cutoff of 10e-10 and the run with ten iterations and an e-value cutoff of 0.002 are the same. The best one is [http://www.uniprot.org/uniprot/H2UJM8 H2UJM8] with an e-value of 1e-179. For two iterations and e-value cutoff of 0.002 it is [http://www.uniprot.org/uniprot/Q4VBE2 Q4VBE2] with an e-value of 1e-178 and for ten iterations and e-value cutoff of 10e-10 it is [http://www.uniprot.org/uniprot/K7F3H7 K7F3H7] with an e-value of 1e-143, which is good, but not as good as the other best ones.
-To look for similarities between the reference sequence and the sequences found in the searches, those terms are counted.
+<br clear=all>
-...
+When you compare sequence identity and e-value, that sequences with best identities often have not as good e-values and vice versa is remarkable. Especially in the BLAST runs e-values of only 0.0 for the sequences with highest sequence similarity are reached and the sequences with lowest e-values only have sequence identities slightly above 54%. Therefore you can see that a good balance between high sequence identity and low e-value is very hard to get, but necessary to get trustable results.
 == Multiple sequence alignments ==
+For the images of the Alignments [http://www.jalview.org Jalview] was used. Colours are shown in Clustalx Format. <br>
+[[Lab Journal - Task 2 (PAH) #Multiple sequence alignments|Lab journal]]
 === Datasets ===
-For the multiple sequence alignments four different datasets were generated with a [https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Phenylketonuria/Task2/Scripts Python script]. Three groups with ten sequences (two of them are pdb-sequences): one with higher than 60% sequence identity to our target sequence (PAH gene), another group with sequence identity between 30% and 60% and one group with lower sequence identity than 30%. In the group with < 30% identity the pdb sequences have a 32% identity, because these are the lowest ones found in the Blast output against the pdb dataset. The fourth group contains 20 sequences with a whole range of sequence identites and four of them are pdb sequences.
+For the multiple sequence alignments four different datasets were generated with the BLAST outputs and a [https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Phenylketonuria/Task2/Scripts Python script]. Three groups with ten sequences (two of them are pdb-sequences): one with higher than 60% sequence identity to our target sequence (PAH gene), another group with sequence identity between 30% and 60% and one group with lower sequence identity than 30% (<xr id="datasets"/>). In the group with < 30% identity the pdb sequences have a 32% identity, because these are the lowest ones found in the Blast output against the pdb dataset. The fourth group contains 20 sequences with a whole range of sequence identities and four of them are pdb sequences (<xr id="set_all"/>). For the comparison to our reference sequence of the Phenylalanine hydroxylase (PAH - P00439) enzyme we added this sequence to all four groups.
+<figtable id="datasets">
+{|align="center"
+|
-{| border="1" cellpadding="5" cellspacing="0" align="center"
+{| border="1" cellpadding="5" cellspacing="0" align="left"
-|+'''Dataset "low":'''
+|+'''a) Dataset "low":'''
 |-
-! colspan="3" style="background:#32CD32;" | Group of sequences with < 30% (pdb = 32%) identity
+! colspan="2" style="background:#32CD32;" | Group of sequences with < 30%
+(pdb = 32%) identity
 |-
 ! style="background:#90EE90;" align="center" | Sequence identity
 ! style="background:#90EE90;" align="center" | ID
-! style="background:#90EE90;" align="center" | Protein Name
 |-
-| 29%
+| align="center" | 29%
+| align="center" | K0WHA3
-| C8W332
-| Prephenate dehydratase OS=Desulfotomaculum acetoxidans
 |-
-| 29%
+| align="center" | 29%
+| align="center" | L0FXW8
-| F4GLY0
-| Phospho-2-dehydro-3-deoxyheptonate aldolase OS=Spirochaeta coccoides
 |-
-| 29%
+| align="center" | 29%
+| align="center" | B4R9Q9
-| B8J3L5
-| Chorismate mutase OS=Desulfovibrio desulfuricans
 |-
+| align="center" | 28%
-| 27%
+| align="center" | C9P8B8
-| A1ZW97
-| Phenylalanine-4-hydroxylase OS=Microscilla marina
 |-
+| align="center" | 25%
-| 26%
+| align="center" | I3YW84
-| H1GP19
-| Putative uncharacterized protein OS=Myroides odoratimimus
 |-
-| 25%
+| align="center" | 25%
-| G0L2J6
+| align="center" | G0L2J6
-| Phenylalanine 4-monooxygenase OS=Zobellia galactanivorans
 |-
+| align="center" | 25%
-| 24%
+| align="center" | Q9AG78
-| L9JT09
-| Phenylalanine-4-hydroxylase OS=Cystobacter fuscus
 |-
+| align="center" | 24%
-| 23%
+| align="center" | B4UJD0
-| Q08RX0
-| Aromatic amino acid hydroxylase, biopterin-dependent OS=Stigmatella aurantiaca
 |-
-| 32%
+| align="center" | 32%
-| 1ltu_A (pdb)
+| align="center" | 1ltu_A (pdb)
-| PHENYLALANINE-4-HYDROXYLASE
 |-
-| style="border-bottom:3px solid gray;" | 32%
+| style="border-bottom:3px solid gray;" align="center" | 32%
-| style="border-bottom:3px solid gray;" | 1ltz_A (pdb)
+| style="border-bottom:3px solid gray;" align="center" | 1ltz_A (pdb)
-| style="border-bottom:3px solid gray;" | PHENYLALANINE-4-HYDROXYLASE
 |-
 |}
+|
 {| border="1" cellpadding="5" cellspacing="0" align="center"
-|+'''Dataset "medium":'''
+|+'''b) Dataset "medium":'''
 |-
-! colspan="3" style="background:#32CD32;" | Group of sequences between 30% and  60% identity
+! colspan="2" style="background:#32CD32;" | Group of sequences between
+% and  60% identity
 |-
 ! style="background:#90EE90;" align="center" | Sequence identity
 ! style="background:#90EE90;" align="center" | ID
-! style="background:#90EE90;" align="center" | Protein Name
 |-
+| align="center" | 54%
-| 48%
+| align="center" | I1C3M2
-| Q45XJ4
-| Tyrosine hydroxylase OS=Branchiostoma floridae
 |-
+| align="center" | 53%
-| 43%
+| align="center" | F6ZKP1
-| D8U9W1
-| Putative uncharacterized protein OS=Volvox carteri
 |-
+| align="center" | 51%
-| 41%
+| align="center" | Q54XS1
-| A5GW18
-| Prephenate dehydratase OS=Synechococcus sp.
 |-
+| align="center" | 48%
-| 38%
+| align="center" | G1KSP0
-| Q1Q4R8
-| Strongly similar to chorismate mutase/prephenate dehydratase OS=Candidatus Kuenenia stuttgartiensis
 |-
+| align="center" | 45%
-| 37%
+| align="center" | E5SYS4
-| E5Y7E0
-| Prephenate dehydratase OS=Bilophila wadsworthia
 |-
+| align="center" | 42%
-| 33%
+| align="center" | H3EDU0
-| F5XW56
-| Candidate phenylalanine 4-monooxygenase (Phenylalanine-4-hydroxylase) OS=Ramlibacter tataouinensis
 |-
+| align="center" | 38%
-| 32%
+| align="center" | H0HJI4
-| K0JRN5
-| Prephenate dehydratase OS=Saccharothrix espanaensis
 |-
+| align="center" | 36%
-| 31%
+| align="center" | F4WGX3
-| A3UNV6
-| Phenylalanine-4-hydroxylase OS=Vibrio splendidus
 |-
-| 59%
+| align="center" | 59%
-| 1toh_A (pdb)
+| align="center" | 2toh_A (pdb)
-| TYROSINE HYDROXYLASE
 |-
-| style="border-bottom:3px solid gray;" | 33%
+| style="border-bottom:3px solid gray;" align="center" |45%
-| style="border-bottom:3px solid gray;" | 2v28_B (pdb)
+| style="border-bottom:3px solid gray;" align="center" | 2qmx_A (pdb)
-| style="border-bottom:3px solid gray;" | PHENYLALANINE-4-HYDROXYLASE
 |-
 |}
+|
+{| border="1" cellpadding="5" cellspacing="0" align="right"
+|+'''c) Dataset "high":'''
+|-
+! colspan="3" style="background:#32CD32;" | Group of sequences with
+> 60% identity
+|-
+! style="background:#90EE90;" align="center" | Sequence identity
+! style="background:#90EE90;" align="center" | ID
+|-
+| align="center" | 70%
+| align="center" | H2UJM8
+|-
+| align="center" | 67%
+| align="center" | K1RSS1
+|-
+| align="center" | 65%
+| align="center" | E7D1A7
+|-
+| align="center" | 63%
+| align="center" | G7YPD5
+|-
+| align="center" | 63%
+| align="center" | D1LXB2
+|-
+| align="center" | 62%
+| align="center" | G1KJG2
+|-
+| align="center" | 61%
+| align="center" | G9B2G8
+|-
+| align="center" | 61%
+| align="center" | E9RJV0
+|-
+| align="center" | 96%
+| align="center" | 2pah_A (pdb)
+|-
+| style="border-bottom:3px solid gray;" align="center" | 96%
+| style="border-bottom:3px solid gray;" align="center" | 1kw0_A (pdb)
+|-
+|}
+|}
+<center><small>'''<caption>''' The three datasets with sequence identity '''a)''' lower than 30% (and two pdb sequences with 32% identity '''b)''' between 30% and 60% and '''c)''' bigger than 60%. </caption></small></center>
+</figtable>
+<figtable id="set_all">
 {| border="1" cellpadding="5" cellspacing="0" align="center"
-|+'''Dataset "high":'''
+|+'''Dataset "all":'''
 |-
-! colspan="3" style="background:#32CD32;" | Group of sequences with > 60% identity
+! colspan="4" style="background:#32CD32;" | Group of sequences with different identities (0-100%)
 |-
 ! style="background:#90EE90;" align="center" | Sequence identity
 ! style="background:#90EE90;" align="center" | ID
-! style="background:#90EE90;" align="center" | Protein Name
+! style="background:#90EE90;" align="center" | Sequence identity
+! style="background:#90EE90;" align="center" | ID
 |-
+| align="center" | 63%
-| 76%
+| align="center" | D1LXB2
-| Q4VBE2
+| align="center" | 33%
-| Putative uncharacterized protein mgc108157 OS=Xenopus tropicalis
+| align="center" | A9D485
 |-
+| align="center" | 63%
-| 67%
+| align="center" | G3MQ02
-| K1RSS1
+| align="center" | 33%
-| Protein henna OS=Crassostrea gigas
+| align="center" | G7UU95
 |-
+| align="center" | 56%
-| 64%
+| align="center" | A9UUJ8
-| H3HGU2
+| align="center" | 31%
-| Uncharacterized protein (Fragment) OS=Strongylocentrotus purpuratus
+| align="center" | I3TIA8
 |-
+| align="center" | 54%
-| 63%
+| align="center" | D2XNL7
-| C3ZNL0
+| align="center" | 31%
-| Putative uncharacterized protein OS=Branchiostoma floridae
+| align="center" | A0Y3T4
 |-
+| align="center" | 52%
-| 63%
+| align="center" | E9G1D2
-| G3MQ02
+| align="center" | 29%
-| Putative uncharacterized protein OS=Amblyomma maculatum
+| align="center" | D5HB02
 |-
+| align="center" | 44%
-| 62%
+| align="center" | I1GDE7
-| E4XIM4
+| align="center" | 27%
-| Whole genome shotgun assembly, reference scaffold set, scaffold scaffold_41 OS=Oikopleura dioica
+| align="center" | A1ZW97
 |-
+| align="center" | 41%
-| 61%
+| align="center" | H1L1J2
-| E9FTL2
+| align="center" | 96%
-| Putative uncharacterized protein OS=Daphnia pulex
+| align="center" | 5pah_A (pdb)
 |-
+| align="center" | 37%
-| 61%
+| align="center" | D7AKY2
-| D6WIQ7
+| align="center" | 96%
-| Putative uncharacterized protein OS=Tribolium castaneum
+| align="center" | 3pah_A (pdb)
 |-
+| align="center" | 36%
-| 95%
+| align="center" | B7GIR4
-| 1tg2_A (pdb)
+| align="center" | 96%
-| Phenylalanine-4-hydroxylase
+| align="center" | 2pah_A (pdb)
 |-
-| style="border-bottom:3px solid gray;" | 96%
+| style="border-bottom:3px solid gray;" align="center" | 35%
-| style="border-bottom:3px solid gray;" | 4pah_A (pdb)
+| style="border-bottom:3px solid gray;" align="center" | A0C973
-| style="border-bottom:3px solid gray;" | PHENYLALANINE HYDROXYLASE
+| style="border-bottom:3px solid gray;" align="center" | 33%
+| style="border-bottom:3px solid gray;" align="center" | 2v27_B (pdb)
 |-
 |}
+<center><small>'''<caption>''' Set with whole range of sequence identities including four pdb sequences. </caption></small></center>
+</figtable>
+=== ClustalW ===
+<figure id="clustalW"><small>
+[[File:Low_clustalw.png|thumb|left|1000px|'''a)''' Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with ClustalW]]
+[[File:Medium_clustalw.png|thumb|left|1000px|'''b)''' Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with ClustalW]]
+[[File:High_clustalw.png|thumb|left|1000px|'''c)''' Multiple Sequence Alignment of the "high" dataset with sequences >60% identity, which was created with ClustalW]]
+[[File:All_clustalw.png|thumb|left|1500px|'''d)''' Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with ClustalW]]
+<br clear=all>
+<center>'''<caption>''' Multiple sequence alignments for the four datasets '''a)''' low, '''b)''' medium, '''c)''' high and '''d)''' all created with ClustalW.</caption></center></small>
+</figure>
+<br clear=all>
+=== Muscle ===
+<figure id="muscle"><small>
+[[File:Low_muscle.png|thumb|left|1000px|'''a)''' Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with Muscle]]
+[[File:Medium_muscle.png|thumb|left|1000px|'''b)''' Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with Muscle]]
+[[File:High_muscle.png|thumb|left|1000px|'''c)''' Multiple Sequence Alignment of the "high" dataset with sequences >60% identity, which was created with Muscle]]
+[[File:All_muscle.png|thumb|left|1500px|'''d)''' Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with Muscle]]
+<br clear=all>
+<center>'''<caption>''' Multiple sequence alignments for the four datasets '''a)''' low, '''b)''' medium, '''c)''' high and '''d)''' all created with Muscle.</caption></center></small>
+</figure>
+<br clear=all>
+=== T-Coffee ===
+<figure id="tCoffee"><small>
+[[File:Low_tcoffee.png|thumb|left|1000px|'''a)''' Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with T-Coffee]]
+[[File:Medium_tcoffee.png|thumb|left|1000px|'''b)''' Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with T-Coffee]]
+[[File:High_tcoffee.png|thumb|left|1000px|'''c)''' Multiple Sequence Alignment of the "high" dataset with sequences >60% identity, which was created with T-Coffee]]
+[[File:All_tcoffee.png|thumb|left|1500px|'''d)''' Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with T-Coffee]]
+<br clear=all>
+<center>'''<caption>''' Multiple sequence alignments for the four datasets '''a)''' low, '''b)''' medium, '''c)''' high and '''d)''' all created with T-Coffee.</caption></center></small>
+</figure>
+<br clear=all>
+=== Expresso(3D-Coffee) ===
+[http://www.tcoffee.org/Projects/expresso/ Expresso] is an extension of 3D-Coffee and uses BLAST to search the PDB database for structures whose sequences are similar to the given sequences. These structures are then used to build the alignment. It is slowlier than T-Coffee itself, but if it finds enough structures it is more accurate than the other programms.
+Since we could not run Expresso on the server, we have used this [http://www.igs.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi?stage1=1&daction=EXPRESSO%283DCoffee%29::Regular website] for the multiple sequence alignments with Expresso (3D-Coffee).
+<figure id="expresso"><small>
+[[File:Low_expresso.png|thumb|left|1000px|'''a)''' Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with Expresso(3D-Coffee)]]
+[[File:Medium_expresso.png|thumb|left|1000px|'''b)''' Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with Expresso(3D-Coffee)]]
+[[File:High_expresso.png|thumb|left|1000px|'''c)''' Multiple Sequence Alignment of the "high" dataset with sequences > 60% identity, which was created with Expresso(3D-Coffee)]]
+[[File:All_expresso.png|thumb|left|1500px|'''d)''' Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with Expresso(3D-Coffee)]]
+<br clear=all>
+<center>'''<caption>''' Multiple sequence alignments for the four datasets '''a)''' low, '''b)''' medium, '''c)''' high and '''d)''' all created with Expresso.</caption></center></small>
+</figure>
+<br clear=all>
+=== Results ===
+The following tables (<xr id="low_set"/>,<xr id="medium_set"/> and <xr id="high_set"/>) show the gaps that are included during the creation of the multiple alignments by ClustalW, Muscle, T-Coffee and Expresso. Thereby the sequence of PAH (P00439) is shown in bold, whereas pdb sequences are marked with *.<br>
+{|
+|<figtable id="low_set">
 {| border="1" cellpadding="5" cellspacing="0" align="center"
-|+'''Dataset "all":'''
+|+'''Gaps in Dataset "low":'''
 |-
-! colspan="3" style="background:#32CD32;" | Group of sequences with different identities (0-100%)
+! colspan="5" style="background:#32CD32;" | Group of sequences with < 30% (pdb = 32%) identity
 |-
-! style="background:#90EE90;" align="center" | Sequence identity
 ! style="background:#90EE90;" align="center" | ID
-! style="background:#90EE90;" align="center" | Protein Name
+! style="background:#90EE90;" align="center" | ClustalW
+! style="background:#90EE90;" align="center" | MUSCLE
+! style="background:#90EE90;" align="center" | T-COFFEE
+! style="background:#90EE90;" align="center" | EXPRESSO
 |-
+| align="center" | '''P00439'''
-| 55%
+| align="center" | 3
-| K7F3H7
+| align="center" | 9
-| Uncharacterized protein OS=Pelodiscus sinensis
+| align="center" | 3
+| align="center" | 3
 |-
+| align="center" | K0WHA3
-| 55%
+| align="center" | 230
-| F4PX76
+| align="center" | 236
-| Phenylalanine 4-monooxygenase OS=Dictyostelium fasciculatum
+| align="center" | 230
+| align="center" | 230
 |-
+| align="center" | L0FXW8
-| 53%
+| align="center" | 401
-| K7FZR2
+| align="center" | 407
-| Uncharacterized protein OS=Pelodiscus sinensis
+| align="center" | 401
+| align="center" | 401
 |-
+| align="center" | B4R9Q9
-| 52%
+| align="center" | 339
-| A6YIC4
+| align="center" | 345
-| Tryptophan hydroxylase OS=Platynereis dumerilii
+| align="center" | 339
+| align="center" | 339
 |-
+| align="center" | C9P8B8
-| 50%
+| align="center" | 223
-| C0KKU5
+| align="center" | 229
-| Tyrosine hydroxylase (Fragment) OS=Octopus vulgaris
+| align="center" | 223
+| align="center" | 223
 |-
+| align="center" | I3YW84
-| 48%
+| align="center" | 396
-| I2FKE7
+| align="center" | 402
-| Tyrosine hydroxylase long variant (Fragment) OS=Gryllus bimaculatus
+| align="center" | 396
+| align="center" | 396
 |-
+| align="center" | G0L2J6
-| 47%
+| align="center" | 336
-| K8YPQ2
+| align="center" | 342
-| Phenylalanine-4-hydroxylase OS=Nannochloropsis gaditana
+| align="center" | 336
+| align="center" | 336
 |-
+| align="center" | Q9AG78
-| 44%
+| align="center" | 399
-| B1B5B7
+| align="center" | 405
-| Chorismate mutase/prephenate dehydratase OS=uncultured Termite group 1 bacterium
+| align="center" | 399
+| align="center" | 399
 |-
+| align="center" | B4UJD0
-| 36%
+| align="center" | 280
-| L8LIE0
+| align="center" | 286
-| Prephenate dehydratase OS=Leptolyngbya sp.
+| align="center" | 280
+| align="center" | 280
 |-
+| align="center" | *1ltu_A
-| 35%
+| align="center" | 399
-| B9Y6K3
+| align="center" | 405
-| Putative uncharacterized protein OS=Holdemania filiformis
+| align="center" | 399
+| align="center" | 399
 |-
+| style="border-bottom:3px solid gray;" align="center" | *1ltz_A
-| 35%
+| style="border-bottom:3px solid gray;" align="center" | 237
-| A3QCV4
+| style="border-bottom:3px solid gray;" align="center" | 243
-| Phenylalanine 4-hydroxylase OS=Shewanella loihica
+| style="border-bottom:3px solid gray;" align="center" | 237
+| style="border-bottom:3px solid gray;" align="center" | 237
 |-
+|}
-| 34%
+<small>'''<caption>''' Gaps that are included creating the multiple alignments with the dataset "low". The first column contains the Uniprot IDs, whereas the following columns show the gaps inserted by ClustalW, Muscle, T-Coffee and Expresso.</caption></small>
-| C1F688
+</figtable>
-| Phenylalanine-4-hydroxylase OS=Acidobacterium capsulatum
+|
+|<figtable id="medium_set">
+{| border="1" cellpadding="5" cellspacing="0" align="center"
+|+'''Gaps in Dataset "medium":'''
 |-
+! colspan="5" style="background:#32CD32;" | Group of sequences between 30% and  60% identity
-| 34%
-| K5BAA8
-| Prephenate dehydratase family protein OS=Mycobacterium hassiacum
 |-
+! style="background:#90EE90;" align="center" | ID
-| 34%
+! style="background:#90EE90;" align="center" | ClustalW
-| G4FWD8
+! style="background:#90EE90;" align="center" | MUSCLE
-| Chorismate mutase OS=Rhodanobacter sp.
+! style="background:#90EE90;" align="center" | T-COFFEE
+! style="background:#90EE90;" align="center" | EXPRESSO
 |-
+| align="center" | '''P00439'''
-| 33%
+| align="center" | 10
-| E2MCW6
+| align="center" | 21
-| Phenylalanine-4-hydroxylase OS=Pseudomonas syringae pv. tomato
+| align="center" | 25
+| align="center" | 29
 |-
+| align="center" | I1C3M2
-| 33%
+| align="center" | 37
-| J2JTB5
+| align="center" | 48
-| Phenylalanine-4-hydroxylase, monomeric form OS=Rhizobium sp.
+| align="center" | 52
+| align="center" | 56
 |-
+| align="center" | F6ZKP1
-| 95%
+| align="center" | 20
-| 1tg2_A  (pdb)
+| align="center" | 31
-| Phenylalanine-4-hydroxylase
+| align="center" | 35
+| align="center" | 39
 |-
+| align="center" | Q54XS1
-| 65%
+| align="center" | 43
-| 3hfb_A
+| align="center" | 54
-| Tryptophan 5-hydroxylase 1
+| align="center" | 58
+| align="center" | 62
 |-
+| align="center" | G1KSP0
-| 60%
+| align="center" | 287
-| 2xsn_D
+| align="center" | 298
-| TYROSINE 3-MONOOXYGENASE
+| align="center" | 302
+| align="center" | 306
 |-
+| align="center" | E5SYS4
-| style="border-bottom:3px solid gray;" | 32%
+| align="center" | 297
-| style="border-bottom:3px solid gray;" | 2qmw_B (pdb)
+| align="center" | 308
-| style="border-bottom:3px solid gray;" | Prephenate dehydratase
+| align="center" | 312
+| align="center" | 316
+|-
+| align="center" | H3EDU0
+| align="center" | 75
+| align="center" | 86
+| align="center" | 90
+| align="center" | 94
+|-
+| align="center" | H0HJI4
+| align="center" | 402
+| align="center" | 413
+| align="center" | 417
+| align="center" | 421
+|-
+| align="center" | F4WGX3
+| align="center" | 283
+| align="center" | 294
+| align="center" | 298
+| align="center" | 302
+|-
+| align="center" | *2toh_A
+| align="center" | 282
+| align="center" | 293
+| align="center" | 297
+| align="center" | 301
+|-
+| style="border-bottom:3px solid gray;" align="center" | *2qmx_A
+| style="border-bottom:3px solid gray;" align="center" | 416
+| style="border-bottom:3px solid gray;" align="center" | 427
+| style="border-bottom:3px solid gray;" align="center" | 431
+| style="border-bottom:3px solid gray;" align="center" | 435
 |-
+|}
+<small>'''<caption>''' Gaps that are included creating the multiple alignments with the dataset "medium". The first column contains the Uniprot IDs, whereas the following columns show the gaps inserted by ClustalW, Muscle, T-Coffee and Expresso.</caption></small>
+</figtable>
+|
+|<figtable id="high_set">
+{| border="1" cellpadding="5" cellspacing="0" align="center"
+|+'''Gaps in Dataset "high":'''
+|-
+! colspan="5" style="background:#32CD32;" | Group of sequences with > 60% identity
+|-
+! style="background:#90EE90;" align="center" | ID
+! style="background:#90EE90;" align="center" | ClustalW
+! style="background:#90EE90;" align="center" | MUSCLE
+! style="background:#90EE90;" align="center" | T-COFFEE
+! style="background:#90EE90;" align="center" | EXPRESSO
+|-
+| align="center" | '''P00439'''
+| align="center" | 8
+| align="center" | 13
+| align="center" | 15
+| align="center" | 15
+|-
+| align="center" | H2UJM8
+| align="center" | 20
+| align="center" | 25
+| align="center" | 27
+| align="center" | 27
+|-
+| align="center" | K1RSS1
+| align="center" | 280
+| align="center" | 285
+| align="center" | 287
+| align="center" | 287
+|-
+| align="center" | E7D1A7
+| align="center" | 342
+| align="center" | 347
+| align="center" | 349
+| align="center" | 349
+|-
+| align="center" | G7YPD5
+| align="center" | 115
+| align="center" | 120
+| align="center" | 122
+| align="center" | 122
+|-
+| align="center" | D1LXB2
+| align="center" | 162
+| align="center" | 167
+| align="center" | 169
+| align="center" | 169
+|-
+| align="center" | G1KJG2
+| align="center" | 220
+| align="center" | 225
+| align="center" | 227
+| align="center" | 227
+|-
+| align="center" | G9B2G8
+| align="center" | 42
+| align="center" | 47
+| align="center" | 49
+| align="center" | 49
+|-
+| align="center" | E9RJV0
+| align="center" | 353
+| align="center" | 358
+| align="center" | 360
+| align="center" | 360
+|-
+| align="center" | *2pah_A
+| align="center" | 220
+| align="center" | 225
+| align="center" | 227
+| align="center" | 227
+|-
+| style="border-bottom:3px solid gray;" align="center" | *1kw0_A
+| style="border-bottom:3px solid gray;" align="center" | 160
+| style="border-bottom:3px solid gray;" align="center" | 165
+| style="border-bottom:3px solid gray;" align="center" | 167
+| style="border-bottom:3px solid gray;" align="center" | 167
+|-
+|}
+<small>'''<caption>''' Gaps that are included creating the multiple alignments with the dataset "high". The first column contains the Uniprot IDs, whereas the following columns show the gaps inserted by ClustalW, Muscle, T-Coffee and Expresso.</caption></small>
+</figtable>
 |}
-=== ClustalW ===
+=== Discussion ===
+Comparing the different tools similar outcomes can be viewed. As expected the dataset with low similarities shows a lot of gaps. Only for our protein P00439 (PAH) few gaps are included. Therefore, the other sequences are shorter and only cover a small part of our protein, which goes roughly from position 175 to 295. Only Muscle shows more gaps than the other tools. For the dataset with identities between 30% and 60% higher conservations could be recognised. Again the results of the different tools are similar, whereby sometimes the conserved regions are shifted due to some gaps inserted between them. The set with the highest sequence similarities shows least gaps and also only few differences between the tools. ClustalW inserted fewest gaps. For the datasets low and high T-Coffee and Expresso have identical numbers of gaps. The multiple alignments can be viewed in <xr id="clustalW"/> with the results of ClustalW,in <xr id="muscle"/> with the results of Muscle, in <xr id="tCoffee"/> with the results of T-Coffee and in <xr id="expresso"/> with the results of Expresso. Thereby '''a)''' always comprises the MSA of the low, '''b)''' of the medium and '''c)''' of the high dataset. '''d)''', on the other hand, shows the MSA, which was created with 20 sequences that covers the whole range of similarity with identities between 27% and 96%. Thereby, T-Coffee and Expresso again show very similar outcomes and also Muscle is not that different to those, too. ClustalW, however, seems to have more problems to make a good alignment even with the similar sequences and the conserved amino acids are picked to pieces. Even there are some sequences with low similarities T-Coffee and Expresso show good alignments. Altogether we think that a medium sequence identity is sufficient to get a good MSA. Nevertheless, it is also possible to include sequences with less identities. In such cases also sequences with high similarities should be included and T-Coffee seems to be a good choice to create a reliable MSA.
- clustalw -align in.fasta
-=== Muscle ===
- muscle -in in.fasta -clw -out out.aln
-=== T-Coffee ===
- t_coffee in.fasta
-=== 3D-Coffee ===
- ...
+[[Category: Phenylketonuria 2013]]
-=== Discussion of the multiple sequence alignments and the used tools ===
-...