Sequence searches and multiple sequence alignments (Phenylketonuria)
Summary of the task
In this task we compare the protein sequence of interest, in this case the phenylalanine hydroxylase (PAH), to other protein sequences. Therefore both sequence searches and multiple sequence alignments were done using the big80 database meaning a database that contains subsets of swissprot and pdb, where the entries have a sequence similarity of 80% or less. Furthermore searches against a pdb database were done. For sequence searches the programs BLAST, PSIBLAST and HHblits are used. Their results were taken for the creation of multiple sequence alignments (MSA) using the methods ClustalW, Muscle and TCoffee.
Comparison of the results
In this part the results of the different sequence searches are analyzed. Therefore the outputs are parsed with Marias Parser and are filtered for their IDs, sequence identities and e-values.
Sequence identity in percent
</figure> </figure> </figure>
At comparing the sequence identities of the different search tools, it could be seen that BLAST (<xr id="fig:blast"/>) and the distributions of PSI-BLAST (<xr id="fig:psiblast"/>) with two iterations for both e-value cutoffs show similar distributions with a maximum frequency between 35% and 40% sequence identity. For ten iterations the curve is shifted a bit to the left with a maximum frequency at about 20%. Altogether, it can be seen that more sequences were found in ten iterations than in two, however with less sequence identity. This can be ascribed to the fact that the profile gets less specific in each iteration. The runs with an e-value cutoff of 10e-10 show a lower number of sequences than the runs with a cutoff of 0.002 as 10e-10 is a more significant cutoff. However, that cutoff is really strict and has some false negatives with higher probability, whereas the cutoff of 0.002 is not very significant and likely includes some false positives. The highest number of sequences was found in the HHblits search (<xr id="fig:hhblits"/>), which distribution of sequence identity shows high similarity to the distribution of PSI-BLAST run with ten iterations.
</figure> </figure> </figure>
The distributions of the logarithmic e-values of the sequence searches all look similar with a lowest and best value beneath -400. Nevertheless, it goes up to higher than 0 in BLAST (<xr id="fig:blast-evalue"/>) and HHblits(<xr id="fig:hhblits-evalue"/>). However, the maximum frequency for the e-values in BLAST is still in negativ range, whereas in HHblits it is in positive range meaning that the e-values are not as good. Best e-values are found for PSI-BLAST searches especially for 10 iterations (<xr id="fig:psiblast-evalue"/>). They all show lower frequency at lower e-values and their maximum at higher e-values.
For the reference sequence (P00439) GO-terms (<xr id="go"/>) were found on QuickGO. To look for similarities between the reference sequence and the sequences found in the searches we wrote a program which download those terms for all detected sequences and counted how often the GO terms of PAH are found for the other sequences: <figtable id="go">
|"GO numbers and terms of PAH and their number of occurence in the different search results"|
2 x 0.002
2 x 10e-10
10 x 0.002
10 x 10e-10
|GO:0016597||amino acid binding||554||808||818||712||713||5611|
|GO:0005506||iron ion binding||449||448||450||434||431||1043|
|GO:0009072||aromatic amino acid family metabolic process||449||448||450||434||431||1043|
acting on paired donors,
with incorporation or reduction of molecular oxygen,
reduced pteridine as one donor,
and incorporation of one atom of oxygen
|GO:0004505||phenylalanine 4-monooxygenase activity||207||207||207||205||205||442|
|GO:0006559||L-phenylalanine catabolic process||165||165||165||165||165||395|
|GO:0042423||catecholamine biosynthetic process||15||25||15||15||15||80|
|GO:0008652||cellular amino acid biosynthetic process||6||10||11||12||12||1762|
|GO:0046872||metal ion binding||7||6||6||6||6||1141|
|GO:0042136||neurotransmitter biosynthetic process||2||2||2||2||2||12|
|GO:0034641||cellular nitrogen compound metabolic process||0||0||0||0||0||4|
|GO:0044281||small molecule metabolic process||0||0||0||0||0||4|
GO terms and their number of occurence for the protein sequences found in different sequence searches: BLAST, PSI-BLAST with 2 iterations and e-value of 0.002 and 10e-10 and with 10 iterations with the same e-values and finally HHblits. </figtable> When you look at the GO-terms it is obvious that more general descriptions like 'metabolic process' or 'amino acid binding' are found very often. The more specific the GO-terms of PAH gets the less sequences in the searches have the same description. However, in HHblits some of the terms like for example 'catalytic activity' are detected more often as in the other searches even it is incorporated that there are much more sequences found in the HHblits clusters.
In this part the sequences with highest sequence identity and lowest e-value are compared for the different searches.
In the following table (<xr id="hhblits"/>) the four proteins with the highest sequence identity found in the HHblits run are presented. <figtable id="hhblits">
|Best sequence identities in HHblits|
</figtable> None of these sequences are found in BLAST or the four PSI-BLAST searches. However, some proteins with highest sequence identity found with BLAST also can be found with PSI-BLAST (<xr id="blast"/>). <figtable id="blast">
|Best sequence identities in BLAST|
</figtable> Additionally the sequences (F6XY00, L9L9N2, G5AMD7, G1KSL1) also are the best four in the PSI-BLAST runs only not in the same order but with better e-values than in BLAST. They are not included in the HHblits result. Only D3YZ73 is not under the best in PSI-BLAST and is not found after ten iterations at all. However, in HHblits it is found with a sequence identity of 0.71.
The best e-values are those which are lowest. In HHblits the three lowest values are 5.5e-175 (<xr id="hh-eval"/>). <figtable id="hh-eval">
|Best e-values in HHblits|
</figtable> Again none of these protein sequences are found in BLAST or the four PSI-BLAST searches. The proteins with best e-values of the BLAST run (<xr id="bl-eval"/>), however, are found in all searches, besides H3FAJ0, which is not included in the HHblits result. They have very good e-values in all runs, but are under the three best in BLAST only. <figtable id="bl-eval">
|Best e-values in BLAST|
At PSI-BLAST the three best proteins of the run with two iterations and an e-value cutoff of 10e-10 and the run with ten iterations and an e-value cutoff of 0.002 are the same. The best one is H2UJM8 with an e-value of 1e-179. For two iterations and e-value cutoff of 0.002 it is Q4VBE2 with an e-value of 1e-178 and for ten iterations and e-value cutoff of 10e-10 it is K7F3H7 with an e-value of 1e-143, which is good, but not as good as the other best ones.
When you compare sequence identity and e-value, that sequences with best identities often have not as good e-values and vice versa is remarkable. Especially in the BLAST runs e-values of only 0.0 for the sequences with highest sequence similarity are reached and the sequences with lowest e-values only have sequence identities slightly above 54%. Therefore you can see that a good balance between high sequence identity and low e-value is very hard to get, but necessary to get trustable results.
Multiple sequence alignments
For the multiple sequence alignments four different datasets were generated with the BLAST outputs and a Python script. Three groups with ten sequences (two of them are pdb-sequences): one with higher than 60% sequence identity to our target sequence (PAH gene), another group with sequence identity between 30% and 60% and one group with lower sequence identity than 30% (<xr id="datasets"/>). In the group with < 30% identity the pdb sequences have a 32% identity, because these are the lowest ones found in the Blast output against the pdb dataset. The fourth group contains 20 sequences with a whole range of sequence identities and four of them are pdb sequences (<xr id="set_all"/>). For the comparison to our reference sequence of the Phenylalanine hydroxylase (PAH - P00439) enzyme we added this sequence to all four groups. <figtable id="datasets">
|Group of sequences with different identities (0-100%)|
|Sequence identity||ID||Sequence identity||ID|
Expresso is an extension of 3D-Coffee and uses BLAST to search the PDB database for structures whose sequences are similar to the given sequences. These structures are then used to build the alignment. It is slowlier than T-Coffee itself, but if it finds enough structures it is more accurate than the other programms.
Since we could not run Expresso on the server, we have used this website for the multiple sequence alignments with Expresso (3D-Coffee).
The following tables (<xr id="low_set"/>,<xr id="medium_set"/> and <xr id="high_set"/>) show the gaps that are included during the creation of the multiple alignments by ClustalW, Muscle, T-Coffee and Expresso. Thereby the sequence of PAH (P00439) is shown in bold, whereas pdb sequences are marked with *.
Comparing the different tools similar outcomes can be viewed. As expected the dataset with low similarities shows a lot of gaps. Only for our protein P00439 (PAH) few gaps are included. Therefore, the other sequences are shorter and only cover a small part of our protein, which goes roughly from position 175 to 295. Only Muscle shows more gaps than the other tools. For the dataset with identities between 30% and 60% higher conservations could be recognised. Again the results of the different tools are similar, whereby sometimes the conserved regions are shifted due to some gaps inserted between them. The set with the highest sequence similarities shows least gaps and also only few differences between the tools. ClustalW inserted fewest gaps. For the datasets low and high T-Coffee and Expresso have identical numbers of gaps. The multiple alignments can be viewed in <xr id="clustalW"/> with the results of ClustalW,in <xr id="muscle"/> with the results of Muscle, in <xr id="tCoffee"/> with the results of T-Coffee and in <xr id="expresso"/> with the results of Expresso. Thereby a) always comprises the MSA of the low, b) of the medium and c) of the high dataset. d), on the other hand, shows the MSA, which was created with 20 sequences that covers the whole range of similarity with identities between 27% and 96%. Thereby, T-Coffee and Expresso again show very similar outcomes and also Muscle is not that different to those, too. ClustalW, however, seems to have more problems to make a good alignment even with the similar sequences and the conserved amino acids are picked to pieces. Even there are some sequences with low similarities T-Coffee and Expresso show good alignments. Altogether we think that a medium sequence identity is sufficient to get a good MSA. Nevertheless, it is also possible to include sequences with less identities. In such cases also sequences with high similarities should be included and T-Coffee seems to be a good choice to create a reliable MSA.