Difference between revisions of "Sequence Search and Multiple Sequence Alignment (PKU)"
(→Discussion of the alignment results and alignment tools) |
|||
(57 intermediate revisions by 2 users not shown) | |||
Line 2: | Line 2: | ||
The goal of this task is to learn a bit about our protein of interest (phenylalanine hydroxylase/PheOH) from existing databases and comparing it to similar (and probably related) sequences. To do this, we searched the big80 database (7.73 mio sequences), which contains a subset of swissprot and pdb with sequence similarity of 80% or less, with blast (sequence comparison) and psi-blast (profile comparison) and searched a preprocessed and clustered version of uniprot (44,757 clusters) with the more advanced HHblits method (HMM comparison). |
The goal of this task is to learn a bit about our protein of interest (phenylalanine hydroxylase/PheOH) from existing databases and comparing it to similar (and probably related) sequences. To do this, we searched the big80 database (7.73 mio sequences), which contains a subset of swissprot and pdb with sequence similarity of 80% or less, with blast (sequence comparison) and psi-blast (profile comparison) and searched a preprocessed and clustered version of uniprot (44,757 clusters) with the more advanced HHblits method (HMM comparison). |
||
− | From the search results, we created several datasets sorted by sequence similarity (highly conserved, less conserved, a little conserved, ...) and created multiple aligmnents from these sets with four different tools. The results are listed below, the commands, |
+ | From the search results, we created several datasets sorted by sequence similarity (highly conserved, less conserved, a little conserved, ...) and created multiple aligmnents from these sets with four different tools. The results are listed below, the commands, programs and scripts we used are listed at the appropriate places in the text or linked to their own [[Collection_of_scripts|site]]. |
==Reference Sequence of PAH== |
==Reference Sequence of PAH== |
||
Line 18: | Line 18: | ||
==Database Searches== |
==Database Searches== |
||
+ | To increase productivity and reproducibility we used a variety of scripts here. Those which are not just basic R or bashprograms will be listed here |
||
+ | [[Collection_of_scripts#getEvalues.pl|getEvalues.pl]] to derive the eValues of all Blast, PSIBlast or HHBlits outputfiles |
||
+ | [[Collection_of_scripts#HHextract.py|HHextract.py]] to extract all UNiprotIds from the clusters of a HHblits output |
||
===Blast=== |
===Blast=== |
||
− | |||
time blast2 -p blastp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i Dropbox/Phenylketonuria/Task2/PAH.fasta -o results_blast2_standard -v 2000 -b 2000 |
time blast2 -p blastp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i Dropbox/Phenylketonuria/Task2/PAH.fasta -o results_blast2_standard -v 2000 -b 2000 |
||
Line 32: | Line 34: | ||
sys 0m9.270s |
sys 0m9.270s |
||
+ | <figtable id="hhblitsevalue"> |
||
+ | {| class="wikitable" style="float: center; margin: auto" cellpadding="2" |
||
+ | ! scope="row" align="left" | |
||
+ | | align="right" | [[File:pkublastevald.png|thumb|300px| Histogram of the logarithmic E-values standard BLAST run]] |
||
+ | | align="right" | [[File:pkublastident.png|thumb|300px| Histogram of the sequenceidentities for a standard BLAST run]] |
||
+ | |- |
||
+ | |} |
||
+ | </figtable> |
||
===PSIBlast=== |
===PSIBlast=== |
||
− | In order to display the runtime PSIBlast uses in order to get results you asked for we used a little [[Collection of scripts#PSIBlasttrainer.pl|skript]] which uses PSIblast with a variety of |
+ | In order to display the runtime PSIBlast uses in order to get results you asked for we used a little [[Collection of scripts#PSIBlasttrainer.pl|skript]] which uses PSIblast with a variety of parameter settings. The overall message is that the more iterations one uses the more run-time PSIBlast needs. And the harder (lower) you choose your e-Value cutoff the less results you get. A general Message cant be more precise, because all the results depend mostly on the data you feed the programm. |
====Iterations==== |
====Iterations==== |
||
− | If you change the number of iterations (parameter -j) your runtime increases according to the parameter. To show this we used the |
+ | If you change the number of iterations (parameter -j) your runtime increases according to the parameter. To show this we used the default settings for anything and increased the iterations from 1 to 10 (see <xr id="fig:runtimeIterations"/>) |
− | <figure id="fig:runtimeIterations">[[File:pkuruntimeiterations.png|thumb| |
+ | <figure id="fig:runtimeIterations">[[File:pkuruntimeiterations.png|thumb|400px||center|<caption>more iterations result in a higher runtime, but a change in -e or -h do not mean a higher or lower runtime by themselves. A combination of more iterations and a lower integrations rate might be faster as less iterations though</caption>]]</figure> |
+ | |||
+ | |||
+ | <figure id="fig:overlap_psiblast_iterations">[[Image:Venn psiblast.png|200px|thumb|center|<caption>Overlap of PSIBlast (with 1, 3, 5, 7 and 10 iterations) search results in big80</caption>]]</figure> |
||
+ | <xr id="fig:overlap_psiblast_iterations"/> shows the overlap of five PSIBlast searches with different number of iterations. It seems that more than five iterations scarcely result in more hits and of the nine proteins added from 7 to 10 iterations, a high number (33%) are again uncharacterized proteins. |
||
====e-Value cutoff==== |
====e-Value cutoff==== |
||
− | A change in the -e parameter does not change your runtime (see <xr id=fig:runtimeIterations />) but in- or decreases the amount of sequences you derive from your search. This is shown in <xr id="fig:cutoffSize"/> <figure id="fig:cutoffSize">[[File:pkueValuecutoffsize.png|thumb|300px|<caption>Change in the -e parameter results in a lower resultsize</caption>]]</figure> |
+ | A change in the -e parameter does not change your runtime (see <xr id=fig:runtimeIterations />) but in- or decreases the amount of sequences you derive from your search. This is shown in <xr id="fig:cutoffSize"/> <figure id="fig:cutoffSize">[[File:pkueValuecutoffsize.png|thumb|center|300px|<caption>Change in the -e parameter results in a lower resultsize</caption>]]</figure> |
====Plots==== |
====Plots==== |
||
− | All of the following plots |
+ | All of the following plots emphasize for our [[Phenylketonuria|example]] how a change of parameter changes the outcome of the searchtool. |
=====e-Values===== |
=====e-Values===== |
||
+ | <figtable id="psiblasteValue"> |
||
− | <figure id="fig:cutoffhisto1">[[File:pkueValuecutoffhisto1.png|thumb|left|300px|<caption>almost no restriction for the e-Values e=0.1</caption>]]</figure><figure id="fig:cutoffhisto2">[[File:pkueValuecutoffhisto2.png|thumb|left|300px|<caption>a bit harded restriction to the e-Values e=0.0001</caption>]]</figure><figure id="fig:cutoffhisto3">[[File:pkueValuecutoffhisto3.png|thumb|left|300px|<caption>even stronger restriction to e=0.000000001</caption>]]</figure><br> |
||
+ | {| class="wikitable" style="float: center; margin: auto" cellpadding="3" |
||
− | <xr id="fig:cutoffhisto1"/>, <xr id="fig:cutoffhisto2"/> and <xr id="fig:cutoffhisto3"/> show that the distribution of the e-Values one derives from their search changes if you restrict them. Which also decreases the size of your result as shown above. |
||
+ | ! scope="row" align="left" | |
||
+ | | align="right" | [[File:pkueValuecutoffhisto1.png|thumb|left|300px|<caption>different restrictions to the minimum e-Value</caption>|log e-Values with almost no restriction]] |
||
+ | | align="right" | [[File:pkueValuecutoffhisto2.png|thumb|left|300px|log e-Values with more restriction]] |
||
+ | | align="right" | [[File:pkueValuecutoffhisto3.png|thumb|left|300px|log e-Values with strong restriction]] |
||
+ | |- |
||
+ | |} |
||
+ | </figtable> |
||
+ | |||
+ | |||
+ | <xr id="psiblasteValue" /> shows that the distribution of the e-Values one derives from their search changes if you restrict them. Which also decreases the size of your result as shown above. |
||
+ | <figtable id="psiblasteValueit"> |
||
+ | {| class="wikitable" style="float: center; margin: auto" cellpadding="3" |
||
+ | ! scope="row" align="left" | |
||
+ | | align="right" | [[File:pkueValuecutoffithisto1.png|thumb|left|300px|log e-Values after one iterations]] |
||
+ | | align="right" | [[File:pkueValuecutoffithisto2.png|thumb|left|300px|log e-Values after five iterations]] |
||
+ | | align="right" | [[File:pkueValuecutoffithisto3.png|thumb|left|300px|log e-Values after ten iterations]] |
||
+ | |- |
||
+ | |} |
||
+ | </figtable> |
||
+ | In <xr id="psiblasteValue" /> one can clearly see that further iterations of the algorithm add some low e-Value sequences but the majority of the profile is getting fuzzy because the number of sequences with a e-Value around 0 increases the most. |
||
+ | <br style="clear:both;"> |
||
=====identities===== |
=====identities===== |
||
+ | <figtable id="psiblastidentity"> |
||
− | <figure id="fig:cutoffhistoident1">[[File:pkuidentitycutoffhisto1.png|thumb|left|300px|<caption>almost no restriction for the e-Values e=0.1</caption>]]</figure><figure id="fig:cutoffhistoident2">[[File:pkuidentitycutoffhisto2.png|thumb|left|300px|<caption>a bit harded restriction to the e-Values e=0.0001</caption>]]</figure><figure id="fig:cutoffhistoident3">[[File:pkuidentitycutoffhisto3.png|thumb|left|300px|<caption>even stronger restriction to e=0.000000001</caption>]]</figure><br> |
||
+ | {| class="wikitable" style="float: center; margin: auto" cellpadding="3" |
||
+ | ! scope="row" align="left" | |
||
+ | | align="right" | [[File:pkuidentitycutoffhisto1.png|thumb|left|300px|<caption>different restrictions to the minimum e-Value</caption>|identities of sequences with a low restriction to the e-Values]] |
||
+ | | align="right" | [[File:pkuidentitycutoffhisto2.png|thumb|left|300px|identities of sequences with a moderate restriction to the e-Values]] |
||
+ | | align="right" | [[File:pkuidentitycutoffhisto3.png|thumb|left|300px|identities of sequences with a high restriction to the e-Values]] |
||
+ | |- |
||
+ | |} |
||
+ | </figtable> |
||
+ | |||
+ | <xr id="psiblastidentity"/> shows that the identity of the sequences peak at about 30% with restriction to lower e-Values one removes not only low-identity-sequences, but also some with higher identity. |
||
+ | |||
+ | |||
+ | <figtable id="psiblastidentityit"> |
||
+ | {| class="wikitable" style="float: center; margin: auto" cellpadding="3" |
||
+ | ! scope="row" align="left" | |
||
+ | | align="right" | [[File:pkuidentitycutoffithisto1.png|thumb|left|300px|<caption>different restrictions to the minimum e-Value</caption>|identities of sequences with a low restriction to the e-Values]] |
||
+ | | align="right" | [[File:pkuidentitycutoffithisto2.png|thumb|left|300px|identities of sequences with a moderate restriction to the e-Values]] |
||
+ | | align="right" | [[File:pkuidentitycutoffithisto3.png|thumb|left|300px|identities of sequences with a high restriction to the e-Values]] |
||
+ | |- |
||
+ | |} |
||
+ | </figtable> |
||
+ | One can clearly see in <xr id="psiblastidentityit"/> that the distribution of identities shifts its peak from 0.4 to about 0.25 by adding more and more sequences |
||
+ | <br style="clear:both;"> |
||
+ | |||
===HHBlits=== |
===HHBlits=== |
||
time hhblits -i Dropbox/Phenylketonuria/Task2/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o results_hhblits_standard -E 1 -e 0.001 -Z 2000 -B 2000 |
time hhblits -i Dropbox/Phenylketonuria/Task2/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o results_hhblits_standard -E 1 -e 0.001 -Z 2000 -B 2000 |
||
Line 61: | Line 120: | ||
user 4m40.113s |
user 4m40.113s |
||
sys 1m58.910s |
sys 1m58.910s |
||
+ | ====results==== |
||
+ | <figtable id="hhblitsevalue"> |
||
+ | {| class="wikitable" style="float: center; margin: auto" cellpadding="2" |
||
+ | ! scope="row" align="left" | |
||
+ | | align="right" | [[File:pkuhhblitsevaluestd.png|thumb|300px| Histogram of the logarithmic E-values of the HHBlits hits with standard parameter]] |
||
+ | | align="right" | [[File:pkuhhblitsevaluerestr.png|thumb|300px| Histogram of the logarithmic E-values of the HHBlits hits with restricted parameter]] |
||
+ | |- |
||
+ | |} |
||
+ | </figtable> |
||
+ | |||
+ | |||
+ | <figtable id="hhblitsident"> |
||
+ | {| class="wikitable" style="float: center; margin: auto" cellpadding="2" |
||
+ | ! scope="row" align="left" | |
||
+ | | align="right" | [[File:pkuhhblitsidentstd.png|thumb|300px| Histogram of the identities in percent of a standard HHBlits run]] |
||
+ | | align="right" | [[File:pkuhhblitsidentrestr.png|thumb|300px| Histogram of the identities in percent HHBlits run with restricted parameters]] |
||
+ | |- |
||
+ | |} |
||
+ | </figtable> |
||
+ | |||
+ | <br style="clear:both;"> |
||
+ | |||
+ | |||
+ | <figure id="fig:overlap_hhblits_iterations">[[Image:Venn_result_hhblits.png|200px|thumb|right|<caption>Overlap of HHBlits (with 4 (blue) and 2 (red) iterations) search results in big80</caption>]]</figure> |
||
+ | <xr id="fig:overlap_hhblits_iterations"/> shows the overlap between the results of different searches with HHBlits with two and four iterations respectively. The large number of unique sequences simply comes from two large clusters and one smaller cluster that are found and lost respectively when the number of iterations is doubled. |
||
+ | |||
+ | ===Comparison of the Searchtools=== |
||
+ | |||
+ | <figure id="fig:overlap_blast_psiblast">[[Image:Venn blast psiblast.png|200px|thumb|right|<caption>Overlap of PSIBlast (with 1, 5 and 10 iterations from left to right) and Blast (rightmost) search results in big80</caption>]]</figure> |
||
+ | |||
+ | Figure <xr id="fig:overlap_blast_psiblast"/> shows the overlap between Blast and PSIBlast with different numbers of iterations. PSIBlast finds as expected much more results at the same default E-value. Much more important is the quality of the results of PSIBlast. A look in the results shows indeed a high number of uncharacterized proteins but still dehydratases and hydroxlyases with annotations similar to our PheOH. |
||
+ | |||
+ | For HHBlits we can compare the results to the other tools indirectly: (PSI)Blast searched in the 'clustered' big_80, which yielded only sequences from Uniprot during our searches, HHBlits searched in a clustered Uniprot release. Therefore we simply check, which clusters of HHBlits were also hit by (PSI)Blast, assuming that the clusters of HHBlits and the representatives in big_80 cover a similar spectrum of sequences. |
||
+ | |||
+ | HHBlits with standard parameters finds 242 clusters of sequences significantly (with E-Value<0.01) similar to PheOH. Only 120 clusters contain sequences also found with Blast with standard parameters. Depending on the number of iterations of PSIBlast, only 86 (with 1 iteration) to 94 (with 10 iterations) clusters found with HHBlits contain sequences found with PSIBlast. From this, we conclude that HHBlits searches much more sensitive than the traditional Blast algorithm, but with considerable costs in execution time. |
||
+ | |||
+ | ===GOTerms=== |
||
+ | After analyzing the output of each of those tools through standard numeric methods we wanted to see if one can see a quality difference with something more concrete than e-Values and identities. So we used GOAnnotations. In order to compare the results from HHBlits with BLAST and PSIBlast we downloaded the annotations for each hit and compared them with the annotations of the reference sequence. We used the RESTful webservice of [http://www.ebi.ac.uk/ http://www.ebi.ac.uk/] <br>[https://www.dropbox.com/s/pczwfu1yha49qf2/GOAnalyzer.jar Here] you can get the .jar containing source and class files of the program.<br> |
||
+ | Our reference sequence has the following annotated GO:Terms |
||
+ | * [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0055114| GO:0055114 oxidation-reduction process] |
||
+ | * [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0042423| GO:0042423 catecholamine biosynthetic process] |
||
+ | * [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0016597| GO:0016597 amino acid binding] |
||
+ | * [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0016714| GO:0016714 oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen, reduced pteridine as one donor, and incorporation of one atom of oxygen] |
||
+ | * [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0042136| GO:0042136 neurotransmitter biosynthetic process] |
||
+ | * [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0016491| GO:0016491 oxidoreductase activity] |
||
+ | * [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0008152| GO:0008152 metabolic process] |
||
+ | * [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0034641| GO:0034641 cellular nitrogen compound metabolic process] |
||
+ | * [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0004497| GO:0004497 monooxygenase activity] |
||
+ | * [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0005829| GO:0005829 cytosol] |
||
+ | * [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0046872| GO:0046872 metal ion binding] |
||
+ | * [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0003824| GO:0003824 catalytic activity] |
||
+ | * [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0044281| GO:0044281 small molecule metabolic process] |
||
+ | * [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0008652| GO:0008652 cellular amino acid biosynthetic process] |
||
+ | * [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0004505| GO:0004505 phenylalanine 4-monooxygenase activity] |
||
+ | * [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0009072| GO:0009072 aromatic amino acid family metabolic process] |
||
+ | * [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0005506| GO:0005506 iron ion binding] |
||
+ | * [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0006559| GO:0006559 L-phenylalanine catabolic process] |
||
+ | |||
+ | |||
+ | <figtable id="goidentity"> |
||
+ | {| class="wikitable" style="float: center; margin: auto" cellpadding="3" |
||
+ | ! scope="row" align="left" | |
||
+ | |+ <caption>GoTerm identity comparsion between BLAST PSIBLAST and HHBlits</caption> |
||
+ | | align="right" | <figure id="fig:identityblast">[[File:pkugoidentityblast.png|thumb|left|300px|a) Histogramm of the Gotermidentity gained from BLAST hits]]</figure> |
||
+ | | align="right" | [[File:pkugoidentitypsi.png|thumb|left|300px|b) Histogramm of the Gotermidentity gained from PSIBLAST hits]] |
||
+ | | align="right" | [[File:pkugoidentityhhblits.png|thumb|left|300px|c) Histogramm of the Gotermidentity gained from HHBlits hits]] |
||
+ | |- |
||
+ | |} |
||
+ | </figtable> |
||
+ | In <xr id="goidentity"/> one can see the performance increase from pure BLAST to PSIBlast but due to the increased hitcount on HHBLits with its clusters one can barely compare a result where single sequences are checked versus a cluster to a pairwise check. This check seems to be a further step in order to compare results from those three search tools. <br> |
||
+ | In conclusion I would prefer PSIBLast due to its rather easy usage. But a distinct "best" search tool could not be found for our data. It still depends on rather subjective measurements, than real objective proof. |
||
+ | |||
===PDB=== |
===PDB=== |
||
− | Instead of mapping hits in big_80 to PDB to get structural information for the alignments, we performed an additional search in PDB at NCBI, with parameters: scoring matrix |
+ | Instead of mapping hits in big_80 to PDB to get structural information for the alignments, we performed an additional search in PDB at NCBI, with parameters equivalent to our blast search in big_80: The scoring matrix used is Blosum62, gap open = 10, gap extend = 1 with composition based statistics. The results yielded no hits in the range of 80%-99% identity but still helped to complete our datasets as shown below. |
==MSA== |
==MSA== |
||
===Datasets=== |
===Datasets=== |
||
− | We tried to create datasets of 4 sequences + reference sequence for the identity ranges (1: 90%-99%, 2: 60%-89%, 3: 40%-59%, 4: 20%-39%) from our search in the big_80 database. Since these results do not contain PDB structures, we additionally searched PDB directly for proteins in the required range and added them to the dataset. For the most conserved range, there is only 1 sequence in big_80, so we experimentally created the best possible highly conserved dataset in the range 80%-90% and used the structure of the reference sequence for 3D-coffee. The resulting datasets are shown in the following tables, the alignments in the figures <xr id="fig:ClustalW_set4"/> to <xr id="fig:3d_coffee_set1"/>. All sequences have roughly the same length as the reference sequence except for G5AMD7 in the first set. G5AMD7 contains an insertion of 162 aa |
+ | We tried to create datasets of 4 sequences + reference sequence for the identity ranges (1: 90%-99%, 2: 60%-89%, 3: 40%-59%, 4: 20%-39%) from our search in the big_80 database. Since these results do not contain PDB structures, we additionally searched PDB directly for proteins in the required range and added them to the dataset. For the most conserved range, there is only 1 sequence in big_80, so we experimentally created the best possible highly conserved dataset in the range 80%-90% and used the structure of the reference sequence for 3D-coffee. The resulting datasets are shown in the following tables, the alignments in the figures <xr id="fig:ClustalW_set4"/> to <xr id="fig:3d_coffee_set1"/>. All sequences have roughly the same length as the reference sequence except for G5AMD7 in the first set. G5AMD7 contains an insertion of 162 aa, that is easily identified in the alignment, but shows a very high similarity in the other sections. |
− | {| align="left" border="1" |
+ | {| align="left"; border="1" |
|- |
|- |
||
!colspan="4"| 80-90% Sequence Identity |
!colspan="4"| 80-90% Sequence Identity |
||
Line 94: | Line 225: | ||
|} |
|} |
||
− | {| align=" |
+ | {| align="left"; border="1" |
|- |
|- |
||
!colspan="4"| 60-89% Sequence Identity |
!colspan="4"| 60-89% Sequence Identity |
||
Line 118: | Line 249: | ||
| Phenylalanine hydroxlase OS=Saccoglossus kowalevskii |
| Phenylalanine hydroxlase OS=Saccoglossus kowalevskii |
||
|- |
|- |
||
− | | |
+ | | 61% |
| 2XSN_A |
| 2XSN_A |
||
| human Tyrosine 3-Monooxygenase, also used as 3D-Template |
| human Tyrosine 3-Monooxygenase, also used as 3D-Template |
||
|} |
|} |
||
− | {| align="left" border="1" |
+ | {| align="left"; border="1" |
|- |
|- |
||
!colspan="4"| 40-59% Sequence Identity |
!colspan="4"| 40-59% Sequence Identity |
||
Line 152: | Line 283: | ||
|} |
|} |
||
− | {| align=" |
+ | {| align="left"; border="1" |
|- |
|- |
||
!colspan="4"| 20-39% Sequence Identity |
!colspan="4"| 20-39% Sequence Identity |
||
Line 567: | Line 698: | ||
====3D-coffee==== |
====3D-coffee==== |
||
Command to create the alignments: |
Command to create the alignments: |
||
− | t_coffee NN.fasta -method sap_pair,slow_pair -template_file |
+ | t_coffee NN.fasta -method sap_pair,slow_pair -template_file EXPRESSO |
Line 704: | Line 835: | ||
In the most conserved dataset the aligments differ only marginally and due to the high overall conservation, the functional domains are obviously aligned correctly as well. For the other sets of sequences, we looked more closely at the central catalytic and the N-terminal regulatory domain: |
In the most conserved dataset the aligments differ only marginally and due to the high overall conservation, the functional domains are obviously aligned correctly as well. For the other sets of sequences, we looked more closely at the central catalytic and the N-terminal regulatory domain: |
||
− | The catalytic domain of PheOH appears around 280aa-340aa, with the iron-coordinating His285, His290 and Glu330. (needs reference..) This part of the sequence is recognized by all tools in all datasets and the coordinating residues are aligned correctly. In all but the least conserved sets, the alignments are virtually identical in the functional parts and differ only in the placement of gaps. The most differences appear at the N-terminus and likely stem from the different lengths of the sequences that have to be reconciled in the alignment. |
+ | The catalytic domain of PheOH appears around 280aa-340aa, with the iron-coordinating His285, His290 and Glu330. <!--(needs reference..)--> This part of the sequence is recognized by all tools in all datasets and the coordinating residues are aligned correctly. In all but the least conserved sets, the alignments are virtually identical in the functional parts and differ only in the placement of gaps. The most differences appear at the N-terminus and likely stem from the different lengths of the sequences that have to be reconciled in the alignment. |
A real difference in the results of the tools can be seen near the N-terminus of set 4. T-Coffe, 3D-Coffee and Muscle find a conserved region around Trp187, while ClustalW misaligns D0I5S9, which is the shortest sequence of the set. Look at positions 190-210 of the alignment by ClustalW and 210-230 by the others. |
A real difference in the results of the tools can be seen near the N-terminus of set 4. T-Coffe, 3D-Coffee and Muscle find a conserved region around Trp187, while ClustalW misaligns D0I5S9, which is the shortest sequence of the set. Look at positions 190-210 of the alignment by ClustalW and 210-230 by the others. |
||
Our favorite tools for this task was probably T-Coffee, due to good results and the wide range of option. Clustal produces less gaps, but misaligned at least one sequence. Muscle does not perform much different from T-Coffee, but provides less functionality. The advantage of muscle, its emphasis on speed, is, as stated before, not taken into account here. |
Our favorite tools for this task was probably T-Coffee, due to good results and the wide range of option. Clustal produces less gaps, but misaligned at least one sequence. Muscle does not perform much different from T-Coffee, but provides less functionality. The advantage of muscle, its emphasis on speed, is, as stated before, not taken into account here. |
||
+ | |||
+ | [[Category: Phenylketonuria 2012]] |
Latest revision as of 11:47, 29 August 2012
Short Task Description
The goal of this task is to learn a bit about our protein of interest (phenylalanine hydroxylase/PheOH) from existing databases and comparing it to similar (and probably related) sequences. To do this, we searched the big80 database (7.73 mio sequences), which contains a subset of swissprot and pdb with sequence similarity of 80% or less, with blast (sequence comparison) and psi-blast (profile comparison) and searched a preprocessed and clustered version of uniprot (44,757 clusters) with the more advanced HHblits method (HMM comparison).
From the search results, we created several datasets sorted by sequence similarity (highly conserved, less conserved, a little conserved, ...) and created multiple aligmnents from these sets with four different tools. The results are listed below, the commands, programs and scripts we used are listed at the appropriate places in the text or linked to their own site.
Reference Sequence of PAH
This is the sequence of the fully functional PheOH used as query and reference sequence in all our work.
>sp|P00439|PH4H_HUMAN Phenylalanine-4-hydroxylase OS=Homo sapiens GN=PAH PE=1 SV=1 MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR IEVLDNTQQLKILADSINSEIGILCSALQKI
Database Searches
To increase productivity and reproducibility we used a variety of scripts here. Those which are not just basic R or bashprograms will be listed here
getEvalues.pl to derive the eValues of all Blast, PSIBlast or HHBlits outputfiles HHextract.py to extract all UNiprotIds from the clusters of a HHblits output
Blast
time blast2 -p blastp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i Dropbox/Phenylketonuria/Task2/PAH.fasta -o results_blast2_standard -v 2000 -b 2000 real 1m34.784s user 1m22.930s sys 0m8.400s
time blast2 -p blastp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i Dropbox/Phenylketonuria/Task2/PAH.fasta -o results_blast2_e-10 -e 0.0000000001 -v 2000 -b 2000 real 1m34.145s user 1m23.380s sys 0m9.270s
<figtable id="hhblitsevalue">
</figtable>
PSIBlast
In order to display the runtime PSIBlast uses in order to get results you asked for we used a little skript which uses PSIblast with a variety of parameter settings. The overall message is that the more iterations one uses the more run-time PSIBlast needs. And the harder (lower) you choose your e-Value cutoff the less results you get. A general Message cant be more precise, because all the results depend mostly on the data you feed the programm.
Iterations
If you change the number of iterations (parameter -j) your runtime increases according to the parameter. To show this we used the default settings for anything and increased the iterations from 1 to 10 (see <xr id="fig:runtimeIterations"/>)
<figure id="fig:runtimeIterations">
</figure>
<figure id="fig:overlap_psiblast_iterations">
</figure>
<xr id="fig:overlap_psiblast_iterations"/> shows the overlap of five PSIBlast searches with different number of iterations. It seems that more than five iterations scarcely result in more hits and of the nine proteins added from 7 to 10 iterations, a high number (33%) are again uncharacterized proteins.
e-Value cutoff
A change in the -e parameter does not change your runtime (see <xr id=fig:runtimeIterations />) but in- or decreases the amount of sequences you derive from your search. This is shown in <xr id="fig:cutoffSize"/> <figure id="fig:cutoffSize">
</figure>
Plots
All of the following plots emphasize for our example how a change of parameter changes the outcome of the searchtool.
e-Values
<figtable id="psiblasteValue">
</figtable>
<xr id="psiblasteValue" /> shows that the distribution of the e-Values one derives from their search changes if you restrict them. Which also decreases the size of your result as shown above.
<figtable id="psiblasteValueit">
</figtable>
In <xr id="psiblasteValue" /> one can clearly see that further iterations of the algorithm add some low e-Value sequences but the majority of the profile is getting fuzzy because the number of sequences with a e-Value around 0 increases the most.
identities
<figtable id="psiblastidentity">
</figtable>
<xr id="psiblastidentity"/> shows that the identity of the sequences peak at about 30% with restriction to lower e-Values one removes not only low-identity-sequences, but also some with higher identity.
<figtable id="psiblastidentityit">
</figtable>
One can clearly see in <xr id="psiblastidentityit"/> that the distribution of identities shifts its peak from 0.4 to about 0.25 by adding more and more sequences
HHBlits
time hhblits -i Dropbox/Phenylketonuria/Task2/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o results_hhblits_standard -E 1 -e 0.001 -Z 2000 -B 2000 real 6m10.059s user 3m15.640s sys 0m40.220s
hhblits -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o results_hhblits_n4_e-7 -n 4 -e 0.0000001 -o results_hhblits_n4_e-7 real 8m32.287s user 4m40.113s sys 1m58.910s
results
<figtable id="hhblitsevalue">
</figtable>
<figtable id="hhblitsident">
</figtable>
<figure id="fig:overlap_hhblits_iterations">
</figure>
<xr id="fig:overlap_hhblits_iterations"/> shows the overlap between the results of different searches with HHBlits with two and four iterations respectively. The large number of unique sequences simply comes from two large clusters and one smaller cluster that are found and lost respectively when the number of iterations is doubled.
Comparison of the Searchtools
<figure id="fig:overlap_blast_psiblast">
</figure>
Figure <xr id="fig:overlap_blast_psiblast"/> shows the overlap between Blast and PSIBlast with different numbers of iterations. PSIBlast finds as expected much more results at the same default E-value. Much more important is the quality of the results of PSIBlast. A look in the results shows indeed a high number of uncharacterized proteins but still dehydratases and hydroxlyases with annotations similar to our PheOH.
For HHBlits we can compare the results to the other tools indirectly: (PSI)Blast searched in the 'clustered' big_80, which yielded only sequences from Uniprot during our searches, HHBlits searched in a clustered Uniprot release. Therefore we simply check, which clusters of HHBlits were also hit by (PSI)Blast, assuming that the clusters of HHBlits and the representatives in big_80 cover a similar spectrum of sequences.
HHBlits with standard parameters finds 242 clusters of sequences significantly (with E-Value<0.01) similar to PheOH. Only 120 clusters contain sequences also found with Blast with standard parameters. Depending on the number of iterations of PSIBlast, only 86 (with 1 iteration) to 94 (with 10 iterations) clusters found with HHBlits contain sequences found with PSIBlast. From this, we conclude that HHBlits searches much more sensitive than the traditional Blast algorithm, but with considerable costs in execution time.
GOTerms
After analyzing the output of each of those tools through standard numeric methods we wanted to see if one can see a quality difference with something more concrete than e-Values and identities. So we used GOAnnotations. In order to compare the results from HHBlits with BLAST and PSIBlast we downloaded the annotations for each hit and compared them with the annotations of the reference sequence. We used the RESTful webservice of http://www.ebi.ac.uk/
Here you can get the .jar containing source and class files of the program.
Our reference sequence has the following annotated GO:Terms
- GO:0055114 oxidation-reduction process
- GO:0042423 catecholamine biosynthetic process
- GO:0016597 amino acid binding
- GO:0016714 oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen, reduced pteridine as one donor, and incorporation of one atom of oxygen
- GO:0042136 neurotransmitter biosynthetic process
- GO:0016491 oxidoreductase activity
- GO:0008152 metabolic process
- GO:0034641 cellular nitrogen compound metabolic process
- GO:0004497 monooxygenase activity
- GO:0005829 cytosol
- GO:0046872 metal ion binding
- GO:0003824 catalytic activity
- GO:0044281 small molecule metabolic process
- GO:0008652 cellular amino acid biosynthetic process
- GO:0004505 phenylalanine 4-monooxygenase activity
- GO:0009072 aromatic amino acid family metabolic process
- GO:0005506 iron ion binding
- GO:0006559 L-phenylalanine catabolic process
<figtable id="goidentity">
<figure id="fig:identityblast"></figure> |
</figtable>
In <xr id="goidentity"/> one can see the performance increase from pure BLAST to PSIBlast but due to the increased hitcount on HHBLits with its clusters one can barely compare a result where single sequences are checked versus a cluster to a pairwise check. This check seems to be a further step in order to compare results from those three search tools.
In conclusion I would prefer PSIBLast due to its rather easy usage. But a distinct "best" search tool could not be found for our data. It still depends on rather subjective measurements, than real objective proof.
PDB
Instead of mapping hits in big_80 to PDB to get structural information for the alignments, we performed an additional search in PDB at NCBI, with parameters equivalent to our blast search in big_80: The scoring matrix used is Blosum62, gap open = 10, gap extend = 1 with composition based statistics. The results yielded no hits in the range of 80%-99% identity but still helped to complete our datasets as shown below.
MSA
Datasets
We tried to create datasets of 4 sequences + reference sequence for the identity ranges (1: 90%-99%, 2: 60%-89%, 3: 40%-59%, 4: 20%-39%) from our search in the big_80 database. Since these results do not contain PDB structures, we additionally searched PDB directly for proteins in the required range and added them to the dataset. For the most conserved range, there is only 1 sequence in big_80, so we experimentally created the best possible highly conserved dataset in the range 80%-90% and used the structure of the reference sequence for 3D-coffee. The resulting datasets are shown in the following tables, the alignments in the figures <xr id="fig:ClustalW_set4"/> to <xr id="fig:3d_coffee_set1"/>. All sequences have roughly the same length as the reference sequence except for G5AMD7 in the first set. G5AMD7 contains an insertion of 162 aa, that is easily identified in the alignment, but shows a very high similarity in the other sections.
80-90% Sequence Identity | |||
---|---|---|---|
Sequence Identity | ID | Comment | |
90% | G1P4I7 | Uncharacterized protein OS=Myotis lucifugus | |
89% | G5AMD7 | Phenylalanine-4-hydroxylase OS=Heterocephalus glaber | |
80% | G1KSL1 | Uncharacterized protein OS=Anolis carolinensis | |
100% | 1PAH | used as 3D-Template only |
60-89% Sequence Identity | |||
---|---|---|---|
Sequence Identity | ID | Comment | |
80% | G1KSL1 | Uncharacterized protein OS=Anolis carolinensis | |
76% | Q4VBE2 | Putative uncharacterized protein mgc108157 OS=Xenopus tropicalis | |
70% | H2UJM8 | Uncharacterized protein OS=Takifugu rubripes | |
63% | D1LXB2 | Phenylalanine hydroxlase OS=Saccoglossus kowalevskii | |
61% | 2XSN_A | human Tyrosine 3-Monooxygenase, also used as 3D-Template |
40-59% Sequence Identity | |||
---|---|---|---|
Sequence Identity | ID | Comment | |
58% | O96947 | Phenylalanine hydroxylase OS=Geodia cydonium | |
55% | D3BKZ8 | Phenylalanine 4-monooxygenase OS=Polysphondylium pallidum | |
49% | Q5RHI3 | Novel protein similar to tyrosine hydroxylase OS=Danio rerio | |
44% | A6P4D3 | Tyrosine hydroxylase OS=Dugesia japonica | |
59% | 1TOH_A | Tyrosine hydroxylase from rattus norvegicus, also used as 3D-Template |
20-39% Sequence Identity | |||
---|---|---|---|
Sequence Identity | ID | Comment | |
37% | F4WGX3 | Phenylalanine-4-hydroxylase OS=Acromyrmex echinatior | |
35% | Q23A76 | Biopterin-dependent aromatic amino acid hydroxylase family protein OS=Tetrahymena thermophila | |
29% | D0I5S9 | Phenylalanine-4-hydroxylase OS=Grimontia hollisae | |
28% | F6G5X4 | Phenylalanine-4-hydroxylase OS=Ralstonia solanacearum | |
31% | 1LTU_A | Phenylalanine-4-hydroxylase from Chromobacterium violaceum, also used as 3D-Template |
Raw Results
We present the alignments and basic statistics for the different tools. Images have been created in Jalview. To count gaps and conserved columns a simple python script (Collection of scripts) was used.
ClustalW
Command used to create the alignments:
clustalw -align -infile=NN.fasta -outfile=clustalW_NN.aln
<figure id="fig:ClustalW_set4">
</figure>
20-39% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 83 |
Q23A76 | 87 |
D0I5S9 | 272 |
F4WGX3 | 69 |
F6G5X4 | 181 |
1LTU_A | 238 |
Conserved columns | 19 |
<figure id="fig:ClustalW_set3">
</figure>
40-59% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 37 |
A6P4D3 | 1 |
Q5RHI3 | 14 |
D3BKZ8 | 45 |
O96947 | 39 |
1TOH_A | 146 |
Conserved columns | 109 |
<figure id="fig:ClustalW_set2">
</figure>
60-89% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 17 |
H2UJM8 | 7 |
Q4VBE2 | 24 |
G1KSL1 | 16 |
D1LXB2 | 18 |
2XSN_A | 126 |
Conserved columns | 159 |
<figure id="fig:ClustalW_set1">
</figure>
80-90% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 165 |
G1P4I7 | 162 |
G1KSL1 | 164 |
G5AMD7 | 4 |
Conserved columns | 329 |
Muscle
Command used to create the alignments:
muscle -in NN.fasta -out muscle_NN.fasta
<figure id="fig:Muscle_set4">
</figure>
20-39% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 94 |
Q23A76 | 98 |
D0I5S9 | 283 |
F4WGX3 | 80 |
F6G5X4 | 192 |
1LTU_A | 249 |
Conserved columns | 22 |
<figure id="fig:Muscle_set3">
</figure>
40-59% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 45 |
A6P4D3 | 9 |
Q5RHI3 | 22 |
D3BKZ8 | 53 |
O96947 | 47 |
1TOH_A | 154 |
Conserved columns | 109 |
<figure id="fig:Muscle_set2">
</figure>
60-89% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 17 |
H2UJM8 | 7 |
Q4VBE2 | 24 |
G1KSL1 | 16 |
D1LXB2 | 18 |
2XSN_A | 126 |
Conserved columns | 159 |
<figure id="fig:Muscle_set1">
</figure>
80-90% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 167 |
G1P4I7 | 164 |
G1KSL1 | 166 |
G5AMD7 | 6 |
Conserved columns | 328 |
T-coffee
Command used to create the alignments:
t_coffee NN.fasta
<figure id="fig:t_coffee_set4">
</figure>
20-39% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 103 |
Q23A76 | 107 |
D0I5S9 | 292 |
F4WGX3 | 89 |
F6G5X4 | 201 |
1LTU_A | 258 |
Conserved columns | 22 |
<figure id="fig:t_coffee_set3">
</figure>
40-59% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 53 |
A6P4D3 | 17 |
Q5RHI3 | 30 |
D3BKZ8 | 61 |
O96947 | 55 |
1TOH_A | 162 |
Conserved columns | 109 |
<figure id="fig:t_coffee_set2">
</figure>
60-89% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 18 |
H2UJM8 | 8 |
Q4VBE2 | 25 |
G1KSL1 | 17 |
D1LXB2 | 19 |
2XSN_A | 127 |
Conserved columns | 159 |
<figure id="fig:t_coffee_set1">
</figure>
80-90% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 165 |
G1P4I7 | 162 |
G1KSL1 | 164 |
G5AMD7 | 4 |
Conserved columns | 329 |
3D-coffee
Command to create the alignments:
t_coffee NN.fasta -method sap_pair,slow_pair -template_file EXPRESSO
<figure id="fig:3d_coffee_set4">
</figure>
20-39% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 96 |
Q23A76 | 100 |
D0I5S9 | 285 |
F4WGX3 | 82 |
F6G5X4 | 194 |
1LTU_A | 251 |
Conserved columns | 22 |
<figure id="fig:3d_coffee_set3">
</figure>
40-59% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 51 |
A6P4D3 | 15 |
Q5RHI3 | 28 |
D3BKZ8 | 59 |
O96947 | 53 |
1TOH_A | 160 |
Conserved columns | 109 |
<figure id="fig:3d_coffee_set2">
</figure>
60-89% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 21 |
H2UJM8 | 11 |
Q4VBE2 | 28 |
G1KSL1 | 20 |
D1LXB2 | 22 |
2XSN_A | 130 |
Conserved columns | 159 |
<figure id="fig:3d_coffee_set1">
</figure>
80-90% Sequence Identity | |
---|---|
Sequence ID | Gaps |
P00439 | 165 |
G1P4I7 | 162 |
G1KSL1 | 164 |
G5AMD7 | 4 |
Conserved columns | 329 |
Discussion of the alignment results and alignment tools
Looking at the used tools, for these small datasets (4-6 sequences) there is not much of a difference in execution time, so we do not use time as a factor in our assessment. Muscle, ClustalW and T-Coffee are similar and easy to use, but T-Coffee offers the most options and different modes (we used only two here) and provides the most detailed output. To assess the alignments, we looked also at the number of gaps and conserved residues, but consider the most important factor in alignment quality the proper alignment of the functional sequence parts.
Comparing the results, the tools also perform quite similar, but can be sorted by the basic statistics of introduced gaps and found conserved columns in the following order: ClustalW> Muscle>3D-Coffee>T-Coffee.
T-Coffee using the additional 3D information performs minimally better in terms of gaps, more notably in the less conserved datasets. An interesting point that stands out might be that 3D-Coffe always places the first residues -- mostly methionines -- together, even if for a shorter sequence missing e.g. an N-terminal part this means a very long gap between this first residue and the rest of the sequence (see figure <xr id="fig:3d_coffee_set2"/> for an example). This behaviour does not influence the alignment score much, might even be a 'better' alignment as the first methionines belong together, but the single residue still feels a little misplaced to us.
In the most conserved dataset the aligments differ only marginally and due to the high overall conservation, the functional domains are obviously aligned correctly as well. For the other sets of sequences, we looked more closely at the central catalytic and the N-terminal regulatory domain:
The catalytic domain of PheOH appears around 280aa-340aa, with the iron-coordinating His285, His290 and Glu330. This part of the sequence is recognized by all tools in all datasets and the coordinating residues are aligned correctly. In all but the least conserved sets, the alignments are virtually identical in the functional parts and differ only in the placement of gaps. The most differences appear at the N-terminus and likely stem from the different lengths of the sequences that have to be reconciled in the alignment.
A real difference in the results of the tools can be seen near the N-terminus of set 4. T-Coffe, 3D-Coffee and Muscle find a conserved region around Trp187, while ClustalW misaligns D0I5S9, which is the shortest sequence of the set. Look at positions 190-210 of the alignment by ClustalW and 210-230 by the others.
Our favorite tools for this task was probably T-Coffee, due to good results and the wide range of option. Clustal produces less gaps, but misaligned at least one sequence. Muscle does not perform much different from T-Coffee, but provides less functionality. The advantage of muscle, its emphasis on speed, is, as stated before, not taken into account here.