Difference between revisions of "Sequence searches and multiple sequence alignments (Phenylketonuria)"
(→Comparison of the results) |
(→Discussion) |
||
(274 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
== Summary of the task == |
== Summary of the task == |
||
− | In this task we compare the protein sequence of interest, in this case the phenylalanine hydroxylase (PAH), to other protein sequences. Therefore both sequence searches and multiple sequence alignments were done using the big80 database meaning a database that contains subsets of swissprot and pdb, where the entries have a sequence similarity of 80% or less. Furthermore searches against a pdb database were done. For sequence searches the programs BLAST, PSIBLAST and HHblits are used. Their results were taken for the creation of multiple sequence alignments (MSA) using |
+ | In this task we compare the protein sequence of interest, in this case the phenylalanine hydroxylase (PAH), to other protein sequences. Therefore both sequence searches and multiple sequence alignments were done using the big80 database meaning a database that contains subsets of swissprot and pdb, where the entries have a sequence similarity of 80% or less. Furthermore searches against a pdb database were done. For sequence searches the programs BLAST, PSIBLAST and HHblits are used. Their results were taken for the creation of multiple sequence alignments (MSA) using the methods ClustalW, Muscle and TCoffee. |
== Sequence searches == |
== Sequence searches == |
||
+ | [[Lab Journal - Task 2 (PAH) #Sequence searches|Lab journal]] |
||
− | The following invocations were used for Blast, PSI-Blast and HHBlits: |
||
− | === |
+ | === Comparison of the results === |
+ | In this part the results of the different sequence searches are analyzed. Therefore the outputs are parsed with Marias [https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Parse_output.pl Parser] and are filtered for their IDs, sequence identities and e-values. |
||
− | blastall -p blastp -d /mnt/project/rost_db/data/big/big_80 -i /mnt/home/student/worfk |
||
+ | ====Sequence identity in percent==== |
||
− | /Masterpractical/Task2/PAH.fasta -o /mnt/home/student/worfk/Masterpractical/Task2/Blast/PAH |
||
+ | <small> |
||
− | _Blast_big_80.out -v 2000 -b 2000 |
||
+ | {| |
||
+ | |<figure id="fig:blast"> [[File:PAH_Blast_identity.png|thumb|left|300px|'''<caption>''' Sequence identity distribution of the BLAST search; the x-axis shows the sequence identity from 0 (=0%) to 1 (=100%) and the y-axis the number of sequences that got this sequence identity to the PAH sequence in the BLAST search.</caption>]]</figure> |
||
+ | |<figure id="fig:psiblast">[[File:PAH_PSIBLAST_identity.png|thumb|center|350px|'''<caption>''' Sequence identity distribution of the PSI-BLAST searches; the x-axis shows the sequence identity from 0 (=0%) to 1 (=100%) and the y-axis the number of sequences that got this sequence identity to the PAH sequence in the PSI-BLAST searches: two iterations with an e-value cutoff of 0.002 (blue) and with an e-value cutoff of 10e-10 (yellow); ten iterations with an e-value cutoff of 0.002 (green) and with an e-value cutoff of 10e-10.</caption>]]</figure> |
||
+ | |<figure id="fig:hhblits"> [[File:PAH_HHbilts_identity.png|thumb|right|300px|'''<caption>''' Sequence identity distribution of the HHblits search; the x-axis shows the sequence identity from 0 (=0%) to 1 (=100%) and the y-axis the number of sequences that got this sequence identity to the PAH sequence in the HHblits search.</caption>]]</figure> |
||
+ | |} |
||
+ | </small> |
||
+ | <br clear=all> |
||
+ | At comparing the sequence identities of the different search tools, it could be seen that BLAST (<xr id="fig:blast"/>) and the distributions of PSI-BLAST (<xr id="fig:psiblast"/>) with two iterations for both e-value cutoffs show similar distributions with a maximum frequency between 35% and 40% sequence identity. For ten iterations the curve is shifted a bit to the left with a maximum frequency at about 20%. Altogether, it can be seen that more sequences were found in ten iterations than in two, however with less sequence identity. This can be ascribed to the fact that the profile gets less specific in each iteration. The runs with an e-value cutoff of 10e-10 show a lower number of sequences than the runs with a cutoff of 0.002 as 10e-10 is a more significant cutoff. However, that cutoff is really strict and has some false negatives with higher probability, whereas the cutoff of 0.002 is not very significant and likely includes some false positives. The highest number of sequences was found in the HHblits search (<xr id="fig:hhblits"/>), which distribution of sequence identity shows high similarity to the distribution of PSI-BLAST run with ten iterations. |
||
− | === PSI-BLAST (Position-Specific Iterated BLAST) === |
||
− | For PSI-Blast ([http://www.ncbi.nlm.nih.gov/books/NBK2590/ PSI-BLAST Tutorial]) more than one vocation was performed. First two iterations were done with an E-value cutoff of 0.002 and then again with cutoff 10E-10. The same for ten iterations. An example vocation would be: |
||
+ | ====E-Value==== |
||
− | blastpgp -i /mnt/home/student/worfk/Masterpractical/Task2/PAH.fasta -d /mnt/project/rost_db |
||
+ | <small> |
||
− | /data/big/big_80 -j 2 -h 0.002 -v 2000 -b 2000 -o psi_blast_big_80_2_2.out -C big_80_check_ |
||
+ | {| |
||
− | 2_2.chk -Q big_80_matrix_2_2.pssm |
||
+ | |<figure id="fig:blast-evalue">[[File:PAH_Blast_evalue.png|thumb|left|300px|'''<caption>''' Logarithmic e-value distribution of the BLAST search; the x-axis shows the logarithmic e-value and the y-axis the frequency of sequences with that specific e-value; If the logarithmic e-value is smaller the evalue is better.</caption>]]</figure> |
||
+ | |<figure id="fig:psiblast-evalue">[[File:PAH_PSIBLAST_evalue.png|thumb|center|350px|'''<caption>''' Logarithmic e-value distribution of the PSI-BLAST search: two iterations with an e-value cutoff of 0.002 (blue) and with an e-value cutoff of 10e-10 (yellow); ten iterations with an e-value cutoff of 0.002 (green) and with an e-value cutoff of 10e-10; the x-axis shows the logarithmic e-value and the y-axis the frequency of sequences with that specific e-value; If the logarithmic e-value is smaller the evalue is better.</caption>]]</figure> |
||
+ | |<figure id="fig:hhblits-evalue">[[File:PAH_HHbilts_evalue.png|thumb|right|300px|'''<caption>''' Logarithmic e-value distribution of the HHblits search; the x-axis shows the logarithmic e-value and the y-axis the frequency of sequences with that specific e-value; If the logarithmic e-value is smaller the evalue is better.</caption>]]</figure> |
||
+ | |} |
||
+ | </small> |
||
+ | The distributions of the logarithmic e-values of the sequence searches all look similar with a lowest and best value beneath -400. Nevertheless, it goes up to higher than 0 in BLAST (<xr id="fig:blast-evalue"/>) and HHblits(<xr id="fig:hhblits-evalue"/>). However, the maximum frequency for the e-values in BLAST is still in negativ range, whereas in HHblits it is in positive range meaning that the e-values are not as good. Best e-values are found for PSI-BLAST searches especially for 10 iterations (<xr id="fig:psiblast-evalue"/>). |
||
+ | They all show lower frequency at lower e-values and their maximum at higher e-values. |
||
+ | <br clear=all> |
||
− | === HHblits === |
||
− | hhblits -i /mnt/home/student/waldraffs/Masterpraktikum/PAH.fasta -d /mnt/project/rost_db/data/hhblits/uniprot20_02Sep11 |
||
− | -o /mnt/home/student/waldraffs/Masterpraktikum/PAH_2000.hrr -oa3m /mnt/home/student/waldraffs/Masterpraktikum/PAH_2000.a3m |
||
− | -ohhm /mnt/home/student/waldraffs/Masterpraktikum/PAH_2000.hhm -Z 2000 -B 2000 |
||
+ | ====GO-terms==== |
||
+ | For the reference sequence (P00439) GO-terms (<xr id="go"/>) were found on [http://www.ebi.ac.uk/QuickGO/GAnnotation QuickGO]. |
||
+ | To look for similarities between the reference sequence and the sequences found in the searches we wrote a program which download those terms for all detected sequences and counted how often the GO terms of PAH are found for the other sequences: |
||
+ | <figtable id="go"> |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="center" style="text-align:center;" |
||
+ | |- |
||
+ | ! colspan="8" style="background:#32CD32;" | "GO numbers and terms of PAH and their number of occurence in the different search results" |
||
+ | |- |
||
+ | ! style="background:#90EE90;" align="center" | GO |
||
+ | ! style="background:#90EE90;" align="center" | GO-Term |
||
+ | ! style="background:#90EE90;" align="center" | BLAST |
||
+ | ! style="background:#90EE90;" align="center" | PSIBLAST |
||
+ | 2 x 0.002 |
||
+ | ! style="background:#90EE90;" align="center" | PSIBLAST |
||
+ | 2 x 10e-10 |
||
+ | ! style="background:#90EE90;" align="center" | PSIBLAST |
||
+ | 10 x 0.002 |
||
+ | ! style="background:#90EE90;" align="center" | PSIBLAST |
||
+ | 10 x 10e-10 |
||
+ | ! style="background:#90EE90;" align="center" | HHblits |
||
+ | |- |
||
+ | |[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0008152 GO:0008152] |
||
+ | |metabolic process |
||
+ | |554 |
||
+ | |808 |
||
+ | |818 |
||
+ | |712 |
||
+ | |713 |
||
+ | |5701 |
||
+ | |- |
||
+ | |[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0016597 GO:0016597] |
||
+ | |amino acid binding |
||
+ | |554 |
||
+ | |808 |
||
+ | |818 |
||
+ | |712 |
||
+ | |713 |
||
+ | |5611 |
||
+ | |- |
||
+ | |[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:00055114 GO:0055114] |
||
+ | |oxidation-reduction process |
||
+ | |458 |
||
+ | |457 |
||
+ | |459 |
||
+ | |443 |
||
+ | |438 |
||
+ | |3522 |
||
+ | |- |
||
+ | |[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0005506 GO:0005506] |
||
+ | |iron ion binding |
||
+ | |449 |
||
+ | |448 |
||
+ | |450 |
||
+ | |434 |
||
+ | |431 |
||
+ | |1043 |
||
+ | |- |
||
+ | |[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0009072 GO:0009072] |
||
+ | |aromatic amino acid family metabolic process |
||
+ | |449 |
||
+ | |448 |
||
+ | |450 |
||
+ | |434 |
||
+ | |431 |
||
+ | |1043 |
||
+ | |- |
||
+ | |[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0004497 GO:0004497] |
||
+ | |monooxygenase activity |
||
+ | |449 |
||
+ | |448 |
||
+ | |450 |
||
+ | |434 |
||
+ | |431 |
||
+ | |1043 |
||
+ | |- |
||
+ | |[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0016714 GO:0016714] |
||
+ | |oxidoreductase activity, |
||
+ | acting on paired donors, |
||
+ | with incorporation or reduction of molecular oxygen, |
||
− | To perform all programms at once, one could use the Perl-script from Maria, like shown here: |
||
− | perl /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1/run.pl ... |
||
+ | reduced pteridine as one donor, |
||
− | === Comparison of the results === |
||
+ | and incorporation of one atom of oxygen |
||
− | *Sequence identity in percent |
||
+ | |445 |
||
+ | |443 |
||
+ | |444 |
||
+ | |434 |
||
+ | |431 |
||
+ | |1033 |
||
+ | |- |
||
+ | |[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0004505 GO:0004505] |
||
+ | |phenylalanine 4-monooxygenase activity |
||
+ | |207 |
||
+ | |207 |
||
+ | |207 |
||
+ | |205 |
||
+ | |205 |
||
+ | |442 |
||
+ | |- |
||
+ | |[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0006559 GO:0006559] |
||
+ | |L-phenylalanine catabolic process |
||
+ | |165 |
||
+ | |165 |
||
+ | |165 |
||
+ | |165 |
||
+ | |165 |
||
+ | |395 |
||
+ | |- |
||
+ | |[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0016491 GO:0016491] |
||
+ | |oxidoreductase activity |
||
+ | |158 |
||
+ | |157 |
||
+ | |157 |
||
+ | |154 |
||
+ | |154 |
||
+ | |2456 |
||
+ | |- |
||
+ | |[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0003824 GO:0003824] |
||
+ | |catalytic activity |
||
+ | |12 |
||
+ | |30 |
||
+ | |28 |
||
+ | |33 |
||
+ | |32 |
||
+ | |3237 |
||
+ | |- |
||
+ | |[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0042423 GO:0042423] |
||
+ | |catecholamine biosynthetic process |
||
+ | |15 |
||
+ | |25 |
||
+ | |15 |
||
+ | |15 |
||
+ | |15 |
||
+ | |80 |
||
+ | |- |
||
+ | |[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0008652 GO:0008652] |
||
+ | |cellular amino acid biosynthetic process |
||
+ | |6 |
||
+ | |10 |
||
+ | |11 |
||
+ | |12 |
||
+ | |12 |
||
+ | |1762 |
||
+ | |- |
||
+ | |[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0046872 GO:0046872] |
||
+ | |metal ion binding |
||
+ | |7 |
||
+ | |6 |
||
+ | |6 |
||
+ | |6 |
||
+ | |6 |
||
+ | |1141 |
||
+ | |- |
||
+ | |[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0042136 GO:0042136] |
||
+ | |neurotransmitter biosynthetic process |
||
+ | |2 |
||
+ | |2 |
||
+ | |2 |
||
+ | |2 |
||
+ | |2 |
||
+ | |12 |
||
+ | |- |
||
+ | |[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0005829 GO:0005829] |
||
+ | |cytosol |
||
+ | |0 |
||
+ | |1 |
||
+ | |1 |
||
+ | |2 |
||
+ | |2 |
||
+ | |16 |
||
+ | |- |
||
+ | |[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0034641 GO:0034641] |
||
+ | |cellular nitrogen compound metabolic process |
||
+ | |0 |
||
+ | |0 |
||
+ | |0 |
||
+ | |0 |
||
+ | |0 |
||
+ | |4 |
||
+ | |- |
||
+ | |[http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0044281 GO:0044281] |
||
+ | |small molecule metabolic process |
||
+ | |0 |
||
+ | |0 |
||
+ | |0 |
||
+ | |0 |
||
+ | |0 |
||
+ | |4 |
||
+ | |- |
||
+ | |} |
||
+ | <small>'''<caption>''' GO terms and their number of occurence for the protein sequences found in different sequence searches: BLAST, PSI-BLAST with 2 iterations and e-value of 0.002 and 10e-10 and with 10 iterations with the same e-values and finally HHblits.</caption></small> |
||
+ | </figtable> |
||
+ | When you look at the GO-terms it is obvious that more general descriptions like 'metabolic process' or 'amino acid binding' are found very often. The more specific the GO-terms of PAH gets the less sequences in the searches have the same description. However, in HHblits some of the terms like for example 'catalytic activity' are detected more often as in the other searches even it is incorporated that there are much more sequences found in the HHblits clusters. |
||
+ | ====Best results==== |
||
− | *E-Value |
||
+ | In this part the sequences with highest sequence identity and lowest e-value are compared for the different searches. |
||
+ | =====Sequence identity===== |
||
+ | In the following table (<xr id="hhblits"/>) the four proteins with the highest sequence identity found in the HHblits run are presented. |
||
+ | <figtable id="hhblits"> |
||
+ | {| border="1" style="text-align:center;" cellpadding="5" cellspacing="0" align="center" |
||
− | *GO-terms |
||
+ | |- |
||
− | For the reference sequence (P00432) following GO-terms were found on [http://www.ebi.ac.uk/QuickGO/GAnnotation QuickGO]: |
||
+ | ! colspan="8" style="background:#32CD32;" | Best sequence identities in HHblits |
||
− | ... |
||
+ | |- |
||
+ | ! style="background:#90EE90;" align="center" | Uinprot-ID |
||
+ | ! style="background:#90EE90;" align="center" | Identity |
||
+ | ! style="background:#90EE90;" align="center" | E-Value |
||
+ | |- |
||
+ | |[http://www.uniprot.org/uniprot/C9JMN0 C9JMN0] |
||
+ | |1.0 |
||
+ | |5.0E-6 |
||
+ | |- |
||
+ | |[http://www.uniprot.org/uniprot/Q66RJ9 Q66RJ9] |
||
+ | |1.0 |
||
+ | |0.006 |
||
+ | |- |
||
+ | |[http://www.uniprot.org/uniprot/Q16021 Q16021] |
||
+ | |0.96 |
||
+ | |8.4E-12 |
||
+ | |- |
||
+ | |[http://www.uniprot.org/uniprot/Q66RJ7 Q66RJ7] |
||
+ | |0.96 |
||
+ | |7.1E-4 |
||
+ | |} |
||
+ | <center><small>'''<caption>''' Proteins with highest identity in HHblits</caption></small></center> |
||
+ | </figtable> |
||
+ | None of these sequences are found in BLAST or the four PSI-BLAST searches. However, some proteins with highest sequence identity found with BLAST also can be found with PSI-BLAST (<xr id="blast"/>). |
||
+ | <figtable id="blast"> |
||
+ | {| border="1" style="text-align:center;" cellpadding="5" cellspacing="0" align="center" |
||
+ | |- |
||
+ | ! colspan="11" style="background:#32CD32;" | Best sequence identities in BLAST |
||
+ | |- |
||
+ | ! style="background:#90EE90;" align="center" | Uinprot-ID |
||
+ | ! style="background:#90EE90;" align="center" | Identity |
||
+ | ! style="background:#90EE90;" align="center" | E-Value |
||
+ | ! style="background:#90EE90;" align="center" | Identity |
||
+ | PSI-BLAST 2x0.002 |
||
+ | ! style="background:#90EE90;" align="center" | Identity |
||
+ | PSI-BLAST 2x10e-10 |
||
+ | ! style="background:#90EE90;" align="center" | Identity |
||
+ | PSI-BLAST 10x0.002 |
||
+ | ! style="background:#90EE90;" align="center" | Identity |
||
+ | PSI-BLAST 10x10e-10 |
||
+ | |- |
||
+ | |[http://www.uniprot.org/uniprot/F6XY00 F6XY00] |
||
+ | |0.92 |
||
+ | |0.0 |
||
+ | |0.95 |
||
+ | |0.95 |
||
+ | |0.93 |
||
+ | |0.96 |
||
+ | |- |
||
+ | |[http://www.uniprot.org/uniprot/L9L9N2 L9L9N2] |
||
+ | |0.89 |
||
+ | |0.0 |
||
+ | |0.93 |
||
+ | |0.92 |
||
+ | |0.95 |
||
+ | |0.90 |
||
+ | |- |
||
+ | |[http://www.uniprot.org/uniprot/G5AMD7 G5AMD7] |
||
+ | |0.89 |
||
+ | |0.0 |
||
+ | |0.93 |
||
+ | |0.93 |
||
+ | |0.91 |
||
+ | |0.91 |
||
+ | |- |
||
+ | |[http://www.uniprot.org/uniprot/D3YZ73 D3YZ73] |
||
+ | |0.85 |
||
+ | |7.0E-20 |
||
+ | |0.73 |
||
+ | |0.73 |
||
+ | | - |
||
+ | | - |
||
+ | |- |
||
+ | |[http://www.uniprot.org/uniprot/G1KSL1 G1KSL1] |
||
+ | |0.81 |
||
+ | |0.0 |
||
+ | |0.82 |
||
+ | |0.82 |
||
+ | |0.82 |
||
+ | |0.79 |
||
+ | |} |
||
+ | <center><small>'''<caption>''' Proteins with highest identity in BLAST and their identity values in the four PSI-BLAST runs if found. </caption></small></center> |
||
+ | </figtable> |
||
+ | Additionally the sequences (F6XY00, L9L9N2, G5AMD7, G1KSL1) also are the best four in the PSI-BLAST runs only not in the same order but with better e-values than in BLAST. They are not included in the HHblits result. Only D3YZ73 is not under the best in PSI-BLAST and is not found after ten iterations at all. However, in HHblits it is found with a sequence identity of 0.71. |
||
+ | =====E-Value===== |
||
− | To look for similarities between the reference sequence and the sequences found in the searches, those terms are counted. |
||
+ | The best e-values are those which are lowest. In HHblits the three lowest values are 5.5e-175 (<xr id="hh-eval"/>). |
||
− | ... |
||
+ | <figtable id="hh-eval"> |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="center" |
||
+ | |- |
||
+ | ! colspan="8" style="background:#32CD32;" | Best e-values in HHblits |
||
+ | |- |
||
+ | ! style="background:#90EE90;" align="center" | Uinprot-ID |
||
+ | ! style="background:#90EE90;" align="center" | E-Value |
||
+ | ! style="background:#90EE90;" align="center" | Identity |
||
+ | |- |
||
+ | |[http://www.uniprot.org/uniprot/A7UTU7 A7UTU7] |
||
+ | |5.5e-175 |
||
+ | |0.68 |
||
+ | |- |
||
+ | |[http://www.uniprot.org/uniprot/P17752 P17752] |
||
+ | |5.5e-175 |
||
+ | |0.68 |
||
+ | |- |
||
+ | |[http://www.uniprot.org/uniprot/P00439 P00439] |
||
+ | |5.5e-175 |
||
+ | |0.68 |
||
+ | |} |
||
+ | <center><small>'''<caption>''' Proteins with lowest e-value in HHblits</caption></small></center> |
||
+ | </figtable> |
||
+ | Again none of these protein sequences are found in BLAST or the four PSI-BLAST searches. The proteins with best e-values of the BLAST run (<xr id="bl-eval"/>), however, are found in all searches, besides H3FAJ0, which is not included in the HHblits result. They have very good e-values in all runs, but are under the three best in BLAST only. |
||
+ | <figtable id="bl-eval"> |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="center" |
||
+ | |- |
||
+ | ! colspan="11" style="background:#32CD32;" | Best e-values in BLAST |
||
+ | |- |
||
+ | ! style="background:#90EE90;" align="center" | Uinprot-ID |
||
+ | ! style="background:#90EE90;" align="center" | E-Value |
||
+ | ! style="background:#90EE90;" align="center" | Identity |
||
+ | ! style="background:#90EE90;" align="center" | E-Value |
||
+ | PSI-BLAST 2x0.002 |
||
+ | ! style="background:#90EE90;" align="center" | E-Value |
||
+ | PSI-BLAST 2x10e-10 |
||
+ | ! style="background:#90EE90;" align="center" | E-Value |
||
+ | PSI-BLAST 10x0.002 |
||
+ | ! style="background:#90EE90;" align="center" | E-Value |
||
+ | PSI-BLAST 10x10e-10 |
||
+ | ! style="background:#90EE90;" align="center" | E-Value |
||
+ | HHblits |
||
+ | |- |
||
+ | |[http://www.uniprot.org/uniprot/H3FAJ0 H3FAJ0] |
||
+ | |1.0e-177 |
||
+ | |0.544276457883369 |
||
+ | |1.0E-157 |
||
+ | |1.0E-162 |
||
+ | |1.0E-122 |
||
+ | |1.0E-126 |
||
+ | | - |
||
+ | |- |
||
+ | |[http://www.uniprot.org/uniprot/A8WSM6 A8WSM6] |
||
+ | |1.0e-177 |
||
+ | |0.563636363636364 |
||
+ | |1.0E-159 |
||
+ | |1.0E-164 |
||
+ | |1.0E-127 |
||
+ | |1.0E-131 |
||
+ | |1.1E-172 |
||
+ | |- |
||
+ | |[http://www.uniprot.org/uniprot/A9UUJ8 A9UUJ8] |
||
+ | |1.0e-176 |
||
+ | |0.567695961995249 |
||
+ | |1.0E-156 |
||
+ | |1.0E-160 |
||
+ | |1.0E-127 |
||
+ | |1.0E-131 |
||
+ | |5.5E-175 |
||
+ | |} |
||
+ | <center><small>'''<caption>''' Proteins with lowest e-value in BLAST and their values in PSI-BLAST and also in HHblits</caption></small></center> |
||
+ | </figtable> |
||
+ | |||
+ | At PSI-BLAST the three best proteins of the run with two iterations and an e-value cutoff of 10e-10 and the run with ten iterations and an e-value cutoff of 0.002 are the same. The best one is [http://www.uniprot.org/uniprot/H2UJM8 H2UJM8] with an e-value of 1e-179. For two iterations and e-value cutoff of 0.002 it is [http://www.uniprot.org/uniprot/Q4VBE2 Q4VBE2] with an e-value of 1e-178 and for ten iterations and e-value cutoff of 10e-10 it is [http://www.uniprot.org/uniprot/K7F3H7 K7F3H7] with an e-value of 1e-143, which is good, but not as good as the other best ones. |
||
+ | <br clear=all> |
||
+ | When you compare sequence identity and e-value, that sequences with best identities often have not as good e-values and vice versa is remarkable. Especially in the BLAST runs e-values of only 0.0 for the sequences with highest sequence similarity are reached and the sequences with lowest e-values only have sequence identities slightly above 54%. Therefore you can see that a good balance between high sequence identity and low e-value is very hard to get, but necessary to get trustable results. |
||
== Multiple sequence alignments == |
== Multiple sequence alignments == |
||
+ | For the images of the Alignments [http://www.jalview.org Jalview] was used. Colours are shown in Clustalx Format. <br> |
||
+ | [[Lab Journal - Task 2 (PAH) #Multiple sequence alignments|Lab journal]] |
||
+ | === Datasets === |
||
+ | For the multiple sequence alignments four different datasets were generated with the BLAST outputs and a [https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Phenylketonuria/Task2/Scripts Python script]. Three groups with ten sequences (two of them are pdb-sequences): one with higher than 60% sequence identity to our target sequence (PAH gene), another group with sequence identity between 30% and 60% and one group with lower sequence identity than 30% (<xr id="datasets"/>). In the group with < 30% identity the pdb sequences have a 32% identity, because these are the lowest ones found in the Blast output against the pdb dataset. The fourth group contains 20 sequences with a whole range of sequence identities and four of them are pdb sequences (<xr id="set_all"/>). For the comparison to our reference sequence of the Phenylalanine hydroxylase (PAH - P00439) enzyme we added this sequence to all four groups. |
||
+ | <figtable id="datasets"> |
||
+ | {|align="center" |
||
+ | | |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="left" |
||
+ | |+'''a) Dataset "low":''' |
||
+ | |- |
||
+ | ! colspan="2" style="background:#32CD32;" | Group of sequences with < 30% |
||
+ | (pdb = 32%) identity |
||
+ | |- |
||
+ | ! style="background:#90EE90;" align="center" | Sequence identity |
||
+ | ! style="background:#90EE90;" align="center" | ID |
||
+ | |- |
||
+ | | align="center" | 29% |
||
+ | | align="center" | K0WHA3 |
||
+ | |- |
||
+ | | align="center" | 29% |
||
+ | | align="center" | L0FXW8 |
||
+ | |- |
||
+ | | align="center" | 29% |
||
+ | | align="center" | B4R9Q9 |
||
+ | |- |
||
+ | | align="center" | 28% |
||
+ | | align="center" | C9P8B8 |
||
+ | |- |
||
+ | | align="center" | 25% |
||
+ | | align="center" | I3YW84 |
||
+ | |- |
||
+ | | align="center" | 25% |
||
+ | | align="center" | G0L2J6 |
||
+ | |- |
||
+ | | align="center" | 25% |
||
+ | | align="center" | Q9AG78 |
||
+ | |- |
||
+ | | align="center" | 24% |
||
+ | | align="center" | B4UJD0 |
||
+ | |- |
||
+ | | align="center" | 32% |
||
+ | | align="center" | 1ltu_A (pdb) |
||
+ | |- |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 32% |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 1ltz_A (pdb) |
||
+ | |- |
||
+ | |} |
||
+ | | |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="center" |
||
+ | |+'''b) Dataset "medium":''' |
||
+ | |- |
||
+ | ! colspan="2" style="background:#32CD32;" | Group of sequences between |
||
+ | 30% and 60% identity |
||
+ | |- |
||
+ | ! style="background:#90EE90;" align="center" | Sequence identity |
||
+ | ! style="background:#90EE90;" align="center" | ID |
||
+ | |- |
||
+ | | align="center" | 54% |
||
+ | | align="center" | I1C3M2 |
||
+ | |- |
||
+ | | align="center" | 53% |
||
+ | | align="center" | F6ZKP1 |
||
+ | |- |
||
+ | | align="center" | 51% |
||
+ | | align="center" | Q54XS1 |
||
+ | |- |
||
+ | | align="center" | 48% |
||
+ | | align="center" | G1KSP0 |
||
+ | |- |
||
+ | | align="center" | 45% |
||
+ | | align="center" | E5SYS4 |
||
+ | |- |
||
+ | | align="center" | 42% |
||
+ | | align="center" | H3EDU0 |
||
+ | |- |
||
+ | | align="center" | 38% |
||
+ | | align="center" | H0HJI4 |
||
+ | |- |
||
+ | | align="center" | 36% |
||
+ | | align="center" | F4WGX3 |
||
+ | |- |
||
+ | | align="center" | 59% |
||
+ | | align="center" | 2toh_A (pdb) |
||
+ | |- |
||
+ | | style="border-bottom:3px solid gray;" align="center" |45% |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 2qmx_A (pdb) |
||
+ | |- |
||
+ | |} |
||
+ | | |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="right" |
||
+ | |+'''c) Dataset "high":''' |
||
+ | |- |
||
+ | ! colspan="3" style="background:#32CD32;" | Group of sequences with |
||
+ | > 60% identity |
||
+ | |- |
||
+ | ! style="background:#90EE90;" align="center" | Sequence identity |
||
+ | ! style="background:#90EE90;" align="center" | ID |
||
+ | |- |
||
+ | | align="center" | 70% |
||
+ | | align="center" | H2UJM8 |
||
+ | |- |
||
+ | | align="center" | 67% |
||
+ | | align="center" | K1RSS1 |
||
+ | |- |
||
+ | | align="center" | 65% |
||
+ | | align="center" | E7D1A7 |
||
+ | |- |
||
+ | | align="center" | 63% |
||
+ | | align="center" | G7YPD5 |
||
+ | |- |
||
+ | | align="center" | 63% |
||
+ | | align="center" | D1LXB2 |
||
+ | |- |
||
+ | | align="center" | 62% |
||
+ | | align="center" | G1KJG2 |
||
+ | |- |
||
+ | | align="center" | 61% |
||
+ | | align="center" | G9B2G8 |
||
+ | |- |
||
+ | | align="center" | 61% |
||
+ | | align="center" | E9RJV0 |
||
+ | |- |
||
+ | | align="center" | 96% |
||
+ | | align="center" | 2pah_A (pdb) |
||
+ | |- |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 96% |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 1kw0_A (pdb) |
||
+ | |- |
||
+ | |} |
||
+ | |} |
||
+ | <center><small>'''<caption>''' The three datasets with sequence identity '''a)''' lower than 30% (and two pdb sequences with 32% identity '''b)''' between 30% and 60% and '''c)''' bigger than 60%. </caption></small></center> |
||
+ | </figtable> |
||
+ | |||
+ | <figtable id="set_all"> |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="center" |
||
+ | |+'''Dataset "all":''' |
||
+ | |- |
||
+ | ! colspan="4" style="background:#32CD32;" | Group of sequences with different identities (0-100%) |
||
+ | |- |
||
+ | ! style="background:#90EE90;" align="center" | Sequence identity |
||
+ | ! style="background:#90EE90;" align="center" | ID |
||
+ | ! style="background:#90EE90;" align="center" | Sequence identity |
||
+ | ! style="background:#90EE90;" align="center" | ID |
||
+ | |- |
||
+ | | align="center" | 63% |
||
+ | | align="center" | D1LXB2 |
||
+ | | align="center" | 33% |
||
+ | | align="center" | A9D485 |
||
+ | |- |
||
+ | | align="center" | 63% |
||
+ | | align="center" | G3MQ02 |
||
+ | | align="center" | 33% |
||
+ | | align="center" | G7UU95 |
||
+ | |- |
||
+ | | align="center" | 56% |
||
+ | | align="center" | A9UUJ8 |
||
+ | | align="center" | 31% |
||
+ | | align="center" | I3TIA8 |
||
+ | |- |
||
+ | | align="center" | 54% |
||
+ | | align="center" | D2XNL7 |
||
+ | | align="center" | 31% |
||
+ | | align="center" | A0Y3T4 |
||
+ | |- |
||
+ | | align="center" | 52% |
||
+ | | align="center" | E9G1D2 |
||
+ | | align="center" | 29% |
||
+ | | align="center" | D5HB02 |
||
+ | |- |
||
+ | | align="center" | 44% |
||
+ | | align="center" | I1GDE7 |
||
+ | | align="center" | 27% |
||
+ | | align="center" | A1ZW97 |
||
+ | |- |
||
+ | | align="center" | 41% |
||
+ | | align="center" | H1L1J2 |
||
+ | | align="center" | 96% |
||
+ | | align="center" | 5pah_A (pdb) |
||
+ | |- |
||
+ | | align="center" | 37% |
||
+ | | align="center" | D7AKY2 |
||
+ | | align="center" | 96% |
||
+ | | align="center" | 3pah_A (pdb) |
||
+ | |- |
||
+ | | align="center" | 36% |
||
+ | | align="center" | B7GIR4 |
||
+ | | align="center" | 96% |
||
+ | | align="center" | 2pah_A (pdb) |
||
+ | |- |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 35% |
||
+ | | style="border-bottom:3px solid gray;" align="center" | A0C973 |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 33% |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 2v27_B (pdb) |
||
+ | |- |
||
+ | |} |
||
+ | <center><small>'''<caption>''' Set with whole range of sequence identities including four pdb sequences. </caption></small></center> |
||
+ | </figtable> |
||
=== ClustalW === |
=== ClustalW === |
||
+ | <figure id="clustalW"><small> |
||
− | ... |
||
+ | [[File:Low_clustalw.png|thumb|left|1000px|'''a)''' Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with ClustalW]] |
||
+ | [[File:Medium_clustalw.png|thumb|left|1000px|'''b)''' Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with ClustalW]] |
||
+ | [[File:High_clustalw.png|thumb|left|1000px|'''c)''' Multiple Sequence Alignment of the "high" dataset with sequences >60% identity, which was created with ClustalW]] |
||
+ | [[File:All_clustalw.png|thumb|left|1500px|'''d)''' Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with ClustalW]] |
||
+ | <br clear=all> |
||
+ | <center>'''<caption>''' Multiple sequence alignments for the four datasets '''a)''' low, '''b)''' medium, '''c)''' high and '''d)''' all created with ClustalW.</caption></center></small> |
||
+ | </figure> |
||
+ | <br clear=all> |
||
=== Muscle === |
=== Muscle === |
||
+ | <figure id="muscle"><small> |
||
− | ... |
||
+ | [[File:Low_muscle.png|thumb|left|1000px|'''a)''' Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with Muscle]] |
||
+ | [[File:Medium_muscle.png|thumb|left|1000px|'''b)''' Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with Muscle]] |
||
+ | [[File:High_muscle.png|thumb|left|1000px|'''c)''' Multiple Sequence Alignment of the "high" dataset with sequences >60% identity, which was created with Muscle]] |
||
+ | [[File:All_muscle.png|thumb|left|1500px|'''d)''' Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with Muscle]] |
||
+ | <br clear=all> |
||
+ | <center>'''<caption>''' Multiple sequence alignments for the four datasets '''a)''' low, '''b)''' medium, '''c)''' high and '''d)''' all created with Muscle.</caption></center></small> |
||
+ | </figure> |
||
+ | <br clear=all> |
||
=== T-Coffee === |
=== T-Coffee === |
||
+ | <figure id="tCoffee"><small> |
||
− | ... |
||
+ | [[File:Low_tcoffee.png|thumb|left|1000px|'''a)''' Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with T-Coffee]] |
||
+ | [[File:Medium_tcoffee.png|thumb|left|1000px|'''b)''' Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with T-Coffee]] |
||
+ | [[File:High_tcoffee.png|thumb|left|1000px|'''c)''' Multiple Sequence Alignment of the "high" dataset with sequences >60% identity, which was created with T-Coffee]] |
||
+ | [[File:All_tcoffee.png|thumb|left|1500px|'''d)''' Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with T-Coffee]] |
||
+ | <br clear=all> |
||
+ | <center>'''<caption>''' Multiple sequence alignments for the four datasets '''a)''' low, '''b)''' medium, '''c)''' high and '''d)''' all created with T-Coffee.</caption></center></small> |
||
+ | </figure> |
||
+ | <br clear=all> |
||
+ | |||
+ | === Expresso(3D-Coffee) === |
||
+ | [http://www.tcoffee.org/Projects/expresso/ Expresso] is an extension of 3D-Coffee and uses BLAST to search the PDB database for structures whose sequences are similar to the given sequences. These structures are then used to build the alignment. It is slowlier than T-Coffee itself, but if it finds enough structures it is more accurate than the other programms. |
||
+ | |||
+ | Since we could not run Expresso on the server, we have used this [http://www.igs.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi?stage1=1&daction=EXPRESSO%283DCoffee%29::Regular website] for the multiple sequence alignments with Expresso (3D-Coffee). |
||
+ | |||
+ | <figure id="expresso"><small> |
||
+ | [[File:Low_expresso.png|thumb|left|1000px|'''a)''' Multiple Sequence Alignment of the "low" dataset with sequences <30% identity, which was created with Expresso(3D-Coffee)]] |
||
+ | [[File:Medium_expresso.png|thumb|left|1000px|'''b)''' Multiple Sequence Alignment of the "medium" dataset with sequences between 30% and 60% identity, which was created with Expresso(3D-Coffee)]] |
||
+ | [[File:High_expresso.png|thumb|left|1000px|'''c)''' Multiple Sequence Alignment of the "high" dataset with sequences > 60% identity, which was created with Expresso(3D-Coffee)]] |
||
+ | [[File:All_expresso.png|thumb|left|1500px|'''d)''' Multiple Sequence Alignment of the "all" dataset with the whole range of sequence identities, which was created with Expresso(3D-Coffee)]] |
||
+ | <br clear=all> |
||
+ | <center>'''<caption>''' Multiple sequence alignments for the four datasets '''a)''' low, '''b)''' medium, '''c)''' high and '''d)''' all created with Expresso.</caption></center></small> |
||
+ | </figure> |
||
+ | <br clear=all> |
||
+ | |||
+ | === Results === |
||
+ | The following tables (<xr id="low_set"/>,<xr id="medium_set"/> and <xr id="high_set"/>) show the gaps that are included during the creation of the multiple alignments by ClustalW, Muscle, T-Coffee and Expresso. Thereby the sequence of PAH (P00439) is shown in bold, whereas pdb sequences are marked with *.<br> |
||
+ | {| |
||
+ | |<figtable id="low_set"> |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="center" |
||
+ | |+'''Gaps in Dataset "low":''' |
||
+ | |- |
||
+ | ! colspan="5" style="background:#32CD32;" | Group of sequences with < 30% (pdb = 32%) identity |
||
+ | |- |
||
+ | ! style="background:#90EE90;" align="center" | ID |
||
+ | ! style="background:#90EE90;" align="center" | ClustalW |
||
+ | ! style="background:#90EE90;" align="center" | MUSCLE |
||
+ | ! style="background:#90EE90;" align="center" | T-COFFEE |
||
+ | ! style="background:#90EE90;" align="center" | EXPRESSO |
||
+ | |- |
||
+ | | align="center" | '''P00439''' |
||
+ | | align="center" | 3 |
||
+ | | align="center" | 9 |
||
+ | | align="center" | 3 |
||
+ | | align="center" | 3 |
||
+ | |- |
||
+ | | align="center" | K0WHA3 |
||
+ | | align="center" | 230 |
||
+ | | align="center" | 236 |
||
+ | | align="center" | 230 |
||
+ | | align="center" | 230 |
||
+ | |- |
||
+ | | align="center" | L0FXW8 |
||
+ | | align="center" | 401 |
||
+ | | align="center" | 407 |
||
+ | | align="center" | 401 |
||
+ | | align="center" | 401 |
||
+ | |- |
||
+ | | align="center" | B4R9Q9 |
||
+ | | align="center" | 339 |
||
+ | | align="center" | 345 |
||
+ | | align="center" | 339 |
||
+ | | align="center" | 339 |
||
+ | |- |
||
+ | | align="center" | C9P8B8 |
||
+ | | align="center" | 223 |
||
+ | | align="center" | 229 |
||
+ | | align="center" | 223 |
||
+ | | align="center" | 223 |
||
+ | |- |
||
+ | | align="center" | I3YW84 |
||
+ | | align="center" | 396 |
||
+ | | align="center" | 402 |
||
+ | | align="center" | 396 |
||
+ | | align="center" | 396 |
||
+ | |- |
||
+ | | align="center" | G0L2J6 |
||
+ | | align="center" | 336 |
||
+ | | align="center" | 342 |
||
+ | | align="center" | 336 |
||
+ | | align="center" | 336 |
||
+ | |- |
||
+ | | align="center" | Q9AG78 |
||
+ | | align="center" | 399 |
||
+ | | align="center" | 405 |
||
+ | | align="center" | 399 |
||
+ | | align="center" | 399 |
||
+ | |- |
||
+ | | align="center" | B4UJD0 |
||
+ | | align="center" | 280 |
||
+ | | align="center" | 286 |
||
+ | | align="center" | 280 |
||
+ | | align="center" | 280 |
||
+ | |- |
||
+ | | align="center" | *1ltu_A |
||
+ | | align="center" | 399 |
||
+ | | align="center" | 405 |
||
+ | | align="center" | 399 |
||
+ | | align="center" | 399 |
||
+ | |- |
||
+ | | style="border-bottom:3px solid gray;" align="center" | *1ltz_A |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 237 |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 243 |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 237 |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 237 |
||
+ | |- |
||
+ | |} |
||
+ | <small>'''<caption>''' Gaps that are included creating the multiple alignments with the dataset "low". The first column contains the Uniprot IDs, whereas the following columns show the gaps inserted by ClustalW, Muscle, T-Coffee and Expresso.</caption></small> |
||
+ | </figtable> |
||
+ | | |
||
+ | |<figtable id="medium_set"> |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="center" |
||
+ | |+'''Gaps in Dataset "medium":''' |
||
+ | |- |
||
+ | ! colspan="5" style="background:#32CD32;" | Group of sequences between 30% and 60% identity |
||
+ | |- |
||
+ | ! style="background:#90EE90;" align="center" | ID |
||
+ | ! style="background:#90EE90;" align="center" | ClustalW |
||
+ | ! style="background:#90EE90;" align="center" | MUSCLE |
||
+ | ! style="background:#90EE90;" align="center" | T-COFFEE |
||
+ | ! style="background:#90EE90;" align="center" | EXPRESSO |
||
+ | |- |
||
+ | | align="center" | '''P00439''' |
||
+ | | align="center" | 10 |
||
+ | | align="center" | 21 |
||
+ | | align="center" | 25 |
||
+ | | align="center" | 29 |
||
+ | |- |
||
+ | | align="center" | I1C3M2 |
||
+ | | align="center" | 37 |
||
+ | | align="center" | 48 |
||
+ | | align="center" | 52 |
||
+ | | align="center" | 56 |
||
+ | |- |
||
+ | | align="center" | F6ZKP1 |
||
+ | | align="center" | 20 |
||
+ | | align="center" | 31 |
||
+ | | align="center" | 35 |
||
+ | | align="center" | 39 |
||
+ | |- |
||
+ | | align="center" | Q54XS1 |
||
+ | | align="center" | 43 |
||
+ | | align="center" | 54 |
||
+ | | align="center" | 58 |
||
+ | | align="center" | 62 |
||
+ | |- |
||
+ | | align="center" | G1KSP0 |
||
+ | | align="center" | 287 |
||
+ | | align="center" | 298 |
||
+ | | align="center" | 302 |
||
+ | | align="center" | 306 |
||
+ | |- |
||
+ | | align="center" | E5SYS4 |
||
+ | | align="center" | 297 |
||
+ | | align="center" | 308 |
||
+ | | align="center" | 312 |
||
+ | | align="center" | 316 |
||
+ | |- |
||
+ | | align="center" | H3EDU0 |
||
+ | | align="center" | 75 |
||
+ | | align="center" | 86 |
||
+ | | align="center" | 90 |
||
+ | | align="center" | 94 |
||
+ | |- |
||
+ | | align="center" | H0HJI4 |
||
+ | | align="center" | 402 |
||
+ | | align="center" | 413 |
||
+ | | align="center" | 417 |
||
+ | | align="center" | 421 |
||
+ | |- |
||
+ | | align="center" | F4WGX3 |
||
+ | | align="center" | 283 |
||
+ | | align="center" | 294 |
||
+ | | align="center" | 298 |
||
+ | | align="center" | 302 |
||
+ | |- |
||
+ | | align="center" | *2toh_A |
||
+ | | align="center" | 282 |
||
+ | | align="center" | 293 |
||
+ | | align="center" | 297 |
||
+ | | align="center" | 301 |
||
+ | |- |
||
+ | | style="border-bottom:3px solid gray;" align="center" | *2qmx_A |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 416 |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 427 |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 431 |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 435 |
||
+ | |- |
||
+ | |} |
||
+ | <small>'''<caption>''' Gaps that are included creating the multiple alignments with the dataset "medium". The first column contains the Uniprot IDs, whereas the following columns show the gaps inserted by ClustalW, Muscle, T-Coffee and Expresso.</caption></small> |
||
+ | </figtable> |
||
+ | | |
||
+ | |<figtable id="high_set"> |
||
+ | {| border="1" cellpadding="5" cellspacing="0" align="center" |
||
+ | |+'''Gaps in Dataset "high":''' |
||
+ | |- |
||
+ | ! colspan="5" style="background:#32CD32;" | Group of sequences with > 60% identity |
||
+ | |- |
||
+ | ! style="background:#90EE90;" align="center" | ID |
||
+ | ! style="background:#90EE90;" align="center" | ClustalW |
||
+ | ! style="background:#90EE90;" align="center" | MUSCLE |
||
+ | ! style="background:#90EE90;" align="center" | T-COFFEE |
||
+ | ! style="background:#90EE90;" align="center" | EXPRESSO |
||
+ | |- |
||
+ | | align="center" | '''P00439''' |
||
+ | | align="center" | 8 |
||
+ | | align="center" | 13 |
||
+ | | align="center" | 15 |
||
+ | | align="center" | 15 |
||
+ | |- |
||
+ | | align="center" | H2UJM8 |
||
+ | | align="center" | 20 |
||
+ | | align="center" | 25 |
||
+ | | align="center" | 27 |
||
+ | | align="center" | 27 |
||
+ | |- |
||
+ | | align="center" | K1RSS1 |
||
+ | | align="center" | 280 |
||
+ | | align="center" | 285 |
||
+ | | align="center" | 287 |
||
+ | | align="center" | 287 |
||
+ | |- |
||
+ | | align="center" | E7D1A7 |
||
+ | | align="center" | 342 |
||
+ | | align="center" | 347 |
||
+ | | align="center" | 349 |
||
+ | | align="center" | 349 |
||
+ | |- |
||
+ | | align="center" | G7YPD5 |
||
+ | | align="center" | 115 |
||
+ | | align="center" | 120 |
||
+ | | align="center" | 122 |
||
+ | | align="center" | 122 |
||
+ | |- |
||
+ | | align="center" | D1LXB2 |
||
+ | | align="center" | 162 |
||
+ | | align="center" | 167 |
||
+ | | align="center" | 169 |
||
+ | | align="center" | 169 |
||
+ | |- |
||
+ | | align="center" | G1KJG2 |
||
+ | | align="center" | 220 |
||
+ | | align="center" | 225 |
||
+ | | align="center" | 227 |
||
+ | | align="center" | 227 |
||
+ | |- |
||
+ | | align="center" | G9B2G8 |
||
+ | | align="center" | 42 |
||
+ | | align="center" | 47 |
||
+ | | align="center" | 49 |
||
+ | | align="center" | 49 |
||
+ | |- |
||
+ | | align="center" | E9RJV0 |
||
+ | | align="center" | 353 |
||
+ | | align="center" | 358 |
||
+ | | align="center" | 360 |
||
+ | | align="center" | 360 |
||
+ | |- |
||
+ | | align="center" | *2pah_A |
||
+ | | align="center" | 220 |
||
+ | | align="center" | 225 |
||
+ | | align="center" | 227 |
||
+ | | align="center" | 227 |
||
+ | |- |
||
+ | | style="border-bottom:3px solid gray;" align="center" | *1kw0_A |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 160 |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 165 |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 167 |
||
+ | | style="border-bottom:3px solid gray;" align="center" | 167 |
||
+ | |- |
||
+ | |} |
||
+ | <small>'''<caption>''' Gaps that are included creating the multiple alignments with the dataset "high". The first column contains the Uniprot IDs, whereas the following columns show the gaps inserted by ClustalW, Muscle, T-Coffee and Expresso.</caption></small> |
||
+ | </figtable> |
||
+ | |} |
||
+ | |||
+ | === Discussion === |
||
+ | Comparing the different tools similar outcomes can be viewed. As expected the dataset with low similarities shows a lot of gaps. Only for our protein P00439 (PAH) few gaps are included. Therefore, the other sequences are shorter and only cover a small part of our protein, which goes roughly from position 175 to 295. Only Muscle shows more gaps than the other tools. For the dataset with identities between 30% and 60% higher conservations could be recognised. Again the results of the different tools are similar, whereby sometimes the conserved regions are shifted due to some gaps inserted between them. The set with the highest sequence similarities shows least gaps and also only few differences between the tools. ClustalW inserted fewest gaps. For the datasets low and high T-Coffee and Expresso have identical numbers of gaps. The multiple alignments can be viewed in <xr id="clustalW"/> with the results of ClustalW,in <xr id="muscle"/> with the results of Muscle, in <xr id="tCoffee"/> with the results of T-Coffee and in <xr id="expresso"/> with the results of Expresso. Thereby '''a)''' always comprises the MSA of the low, '''b)''' of the medium and '''c)''' of the high dataset. '''d)''', on the other hand, shows the MSA, which was created with 20 sequences that covers the whole range of similarity with identities between 27% and 96%. Thereby, T-Coffee and Expresso again show very similar outcomes and also Muscle is not that different to those, too. ClustalW, however, seems to have more problems to make a good alignment even with the similar sequences and the conserved amino acids are picked to pieces. Even there are some sequences with low similarities T-Coffee and Expresso show good alignments. Altogether we think that a medium sequence identity is sufficient to get a good MSA. Nevertheless, it is also possible to include sequences with less identities. In such cases also sequences with high similarities should be included and T-Coffee seems to be a good choice to create a reliable MSA. |
||
+ | |||
+ | [[Category: Phenylketonuria 2013]] |
Latest revision as of 22:52, 5 September 2013
Contents
Summary of the task
In this task we compare the protein sequence of interest, in this case the phenylalanine hydroxylase (PAH), to other protein sequences. Therefore both sequence searches and multiple sequence alignments were done using the big80 database meaning a database that contains subsets of swissprot and pdb, where the entries have a sequence similarity of 80% or less. Furthermore searches against a pdb database were done. For sequence searches the programs BLAST, PSIBLAST and HHblits are used. Their results were taken for the creation of multiple sequence alignments (MSA) using the methods ClustalW, Muscle and TCoffee.
Sequence searches
Comparison of the results
In this part the results of the different sequence searches are analyzed. Therefore the outputs are parsed with Marias Parser and are filtered for their IDs, sequence identities and e-values.
Sequence identity in percent
</figure> </figure> </figure>
<figure id="fig:blast"> |
<figure id="fig:psiblast"> |
<figure id="fig:hhblits"> |
At comparing the sequence identities of the different search tools, it could be seen that BLAST (<xr id="fig:blast"/>) and the distributions of PSI-BLAST (<xr id="fig:psiblast"/>) with two iterations for both e-value cutoffs show similar distributions with a maximum frequency between 35% and 40% sequence identity. For ten iterations the curve is shifted a bit to the left with a maximum frequency at about 20%. Altogether, it can be seen that more sequences were found in ten iterations than in two, however with less sequence identity. This can be ascribed to the fact that the profile gets less specific in each iteration. The runs with an e-value cutoff of 10e-10 show a lower number of sequences than the runs with a cutoff of 0.002 as 10e-10 is a more significant cutoff. However, that cutoff is really strict and has some false negatives with higher probability, whereas the cutoff of 0.002 is not very significant and likely includes some false positives. The highest number of sequences was found in the HHblits search (<xr id="fig:hhblits"/>), which distribution of sequence identity shows high similarity to the distribution of PSI-BLAST run with ten iterations.
E-Value
</figure> </figure> </figure>
<figure id="fig:blast-evalue"> |
<figure id="fig:psiblast-evalue"> |
<figure id="fig:hhblits-evalue"> |
The distributions of the logarithmic e-values of the sequence searches all look similar with a lowest and best value beneath -400. Nevertheless, it goes up to higher than 0 in BLAST (<xr id="fig:blast-evalue"/>) and HHblits(<xr id="fig:hhblits-evalue"/>). However, the maximum frequency for the e-values in BLAST is still in negativ range, whereas in HHblits it is in positive range meaning that the e-values are not as good. Best e-values are found for PSI-BLAST searches especially for 10 iterations (<xr id="fig:psiblast-evalue"/>). They all show lower frequency at lower e-values and their maximum at higher e-values.
GO-terms
For the reference sequence (P00439) GO-terms (<xr id="go"/>) were found on QuickGO. To look for similarities between the reference sequence and the sequences found in the searches we wrote a program which download those terms for all detected sequences and counted how often the GO terms of PAH are found for the other sequences: <figtable id="go">
"GO numbers and terms of PAH and their number of occurence in the different search results" | |||||||
---|---|---|---|---|---|---|---|
GO | GO-Term | BLAST | PSIBLAST
2 x 0.002 |
PSIBLAST
2 x 10e-10 |
PSIBLAST
10 x 0.002 |
PSIBLAST
10 x 10e-10 |
HHblits |
GO:0008152 | metabolic process | 554 | 808 | 818 | 712 | 713 | 5701 |
GO:0016597 | amino acid binding | 554 | 808 | 818 | 712 | 713 | 5611 |
GO:0055114 | oxidation-reduction process | 458 | 457 | 459 | 443 | 438 | 3522 |
GO:0005506 | iron ion binding | 449 | 448 | 450 | 434 | 431 | 1043 |
GO:0009072 | aromatic amino acid family metabolic process | 449 | 448 | 450 | 434 | 431 | 1043 |
GO:0004497 | monooxygenase activity | 449 | 448 | 450 | 434 | 431 | 1043 |
GO:0016714 | oxidoreductase activity,
acting on paired donors, with incorporation or reduction of molecular oxygen, reduced pteridine as one donor, and incorporation of one atom of oxygen |
445 | 443 | 444 | 434 | 431 | 1033 |
GO:0004505 | phenylalanine 4-monooxygenase activity | 207 | 207 | 207 | 205 | 205 | 442 |
GO:0006559 | L-phenylalanine catabolic process | 165 | 165 | 165 | 165 | 165 | 395 |
GO:0016491 | oxidoreductase activity | 158 | 157 | 157 | 154 | 154 | 2456 |
GO:0003824 | catalytic activity | 12 | 30 | 28 | 33 | 32 | 3237 |
GO:0042423 | catecholamine biosynthetic process | 15 | 25 | 15 | 15 | 15 | 80 |
GO:0008652 | cellular amino acid biosynthetic process | 6 | 10 | 11 | 12 | 12 | 1762 |
GO:0046872 | metal ion binding | 7 | 6 | 6 | 6 | 6 | 1141 |
GO:0042136 | neurotransmitter biosynthetic process | 2 | 2 | 2 | 2 | 2 | 12 |
GO:0005829 | cytosol | 0 | 1 | 1 | 2 | 2 | 16 |
GO:0034641 | cellular nitrogen compound metabolic process | 0 | 0 | 0 | 0 | 0 | 4 |
GO:0044281 | small molecule metabolic process | 0 | 0 | 0 | 0 | 0 | 4 |
GO terms and their number of occurence for the protein sequences found in different sequence searches: BLAST, PSI-BLAST with 2 iterations and e-value of 0.002 and 10e-10 and with 10 iterations with the same e-values and finally HHblits. </figtable> When you look at the GO-terms it is obvious that more general descriptions like 'metabolic process' or 'amino acid binding' are found very often. The more specific the GO-terms of PAH gets the less sequences in the searches have the same description. However, in HHblits some of the terms like for example 'catalytic activity' are detected more often as in the other searches even it is incorporated that there are much more sequences found in the HHblits clusters.
Best results
In this part the sequences with highest sequence identity and lowest e-value are compared for the different searches.
Sequence identity
In the following table (<xr id="hhblits"/>) the four proteins with the highest sequence identity found in the HHblits run are presented. <figtable id="hhblits">
Best sequence identities in HHblits | |||||||
---|---|---|---|---|---|---|---|
Uinprot-ID | Identity | E-Value | |||||
C9JMN0 | 1.0 | 5.0E-6 | |||||
Q66RJ9 | 1.0 | 0.006 | |||||
Q16021 | 0.96 | 8.4E-12 | |||||
Q66RJ7 | 0.96 | 7.1E-4 |
</figtable> None of these sequences are found in BLAST or the four PSI-BLAST searches. However, some proteins with highest sequence identity found with BLAST also can be found with PSI-BLAST (<xr id="blast"/>). <figtable id="blast">
Best sequence identities in BLAST | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Uinprot-ID | Identity | E-Value | Identity
PSI-BLAST 2x0.002 |
Identity
PSI-BLAST 2x10e-10 |
Identity
PSI-BLAST 10x0.002 |
Identity
PSI-BLAST 10x10e-10 | ||||
F6XY00 | 0.92 | 0.0 | 0.95 | 0.95 | 0.93 | 0.96 | ||||
L9L9N2 | 0.89 | 0.0 | 0.93 | 0.92 | 0.95 | 0.90 | ||||
G5AMD7 | 0.89 | 0.0 | 0.93 | 0.93 | 0.91 | 0.91 | ||||
D3YZ73 | 0.85 | 7.0E-20 | 0.73 | 0.73 | - | - | ||||
G1KSL1 | 0.81 | 0.0 | 0.82 | 0.82 | 0.82 | 0.79 |
</figtable> Additionally the sequences (F6XY00, L9L9N2, G5AMD7, G1KSL1) also are the best four in the PSI-BLAST runs only not in the same order but with better e-values than in BLAST. They are not included in the HHblits result. Only D3YZ73 is not under the best in PSI-BLAST and is not found after ten iterations at all. However, in HHblits it is found with a sequence identity of 0.71.
E-Value
The best e-values are those which are lowest. In HHblits the three lowest values are 5.5e-175 (<xr id="hh-eval"/>). <figtable id="hh-eval">
Best e-values in HHblits | |||||||
---|---|---|---|---|---|---|---|
Uinprot-ID | E-Value | Identity | |||||
A7UTU7 | 5.5e-175 | 0.68 | |||||
P17752 | 5.5e-175 | 0.68 | |||||
P00439 | 5.5e-175 | 0.68 |
</figtable> Again none of these protein sequences are found in BLAST or the four PSI-BLAST searches. The proteins with best e-values of the BLAST run (<xr id="bl-eval"/>), however, are found in all searches, besides H3FAJ0, which is not included in the HHblits result. They have very good e-values in all runs, but are under the three best in BLAST only. <figtable id="bl-eval">
Best e-values in BLAST | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Uinprot-ID | E-Value | Identity | E-Value
PSI-BLAST 2x0.002 |
E-Value
PSI-BLAST 2x10e-10 |
E-Value
PSI-BLAST 10x0.002 |
E-Value
PSI-BLAST 10x10e-10 |
E-Value
HHblits | |||
H3FAJ0 | 1.0e-177 | 0.544276457883369 | 1.0E-157 | 1.0E-162 | 1.0E-122 | 1.0E-126 | - | |||
A8WSM6 | 1.0e-177 | 0.563636363636364 | 1.0E-159 | 1.0E-164 | 1.0E-127 | 1.0E-131 | 1.1E-172 | |||
A9UUJ8 | 1.0e-176 | 0.567695961995249 | 1.0E-156 | 1.0E-160 | 1.0E-127 | 1.0E-131 | 5.5E-175 |
</figtable>
At PSI-BLAST the three best proteins of the run with two iterations and an e-value cutoff of 10e-10 and the run with ten iterations and an e-value cutoff of 0.002 are the same. The best one is H2UJM8 with an e-value of 1e-179. For two iterations and e-value cutoff of 0.002 it is Q4VBE2 with an e-value of 1e-178 and for ten iterations and e-value cutoff of 10e-10 it is K7F3H7 with an e-value of 1e-143, which is good, but not as good as the other best ones.
When you compare sequence identity and e-value, that sequences with best identities often have not as good e-values and vice versa is remarkable. Especially in the BLAST runs e-values of only 0.0 for the sequences with highest sequence similarity are reached and the sequences with lowest e-values only have sequence identities slightly above 54%. Therefore you can see that a good balance between high sequence identity and low e-value is very hard to get, but necessary to get trustable results.
Multiple sequence alignments
For the images of the Alignments Jalview was used. Colours are shown in Clustalx Format.
Lab journal
Datasets
For the multiple sequence alignments four different datasets were generated with the BLAST outputs and a Python script. Three groups with ten sequences (two of them are pdb-sequences): one with higher than 60% sequence identity to our target sequence (PAH gene), another group with sequence identity between 30% and 60% and one group with lower sequence identity than 30% (<xr id="datasets"/>). In the group with < 30% identity the pdb sequences have a 32% identity, because these are the lowest ones found in the Blast output against the pdb dataset. The fourth group contains 20 sequences with a whole range of sequence identities and four of them are pdb sequences (<xr id="set_all"/>). For the comparison to our reference sequence of the Phenylalanine hydroxylase (PAH - P00439) enzyme we added this sequence to all four groups. <figtable id="datasets">
|
|
|
</figtable>
<figtable id="set_all">
Group of sequences with different identities (0-100%) | |||
---|---|---|---|
Sequence identity | ID | Sequence identity | ID |
63% | D1LXB2 | 33% | A9D485 |
63% | G3MQ02 | 33% | G7UU95 |
56% | A9UUJ8 | 31% | I3TIA8 |
54% | D2XNL7 | 31% | A0Y3T4 |
52% | E9G1D2 | 29% | D5HB02 |
44% | I1GDE7 | 27% | A1ZW97 |
41% | H1L1J2 | 96% | 5pah_A (pdb) |
37% | D7AKY2 | 96% | 3pah_A (pdb) |
36% | B7GIR4 | 96% | 2pah_A (pdb) |
35% | A0C973 | 33% | 2v27_B (pdb) |
</figtable>
ClustalW
<figure id="clustalW">
</figure>
Muscle
<figure id="muscle">
</figure>
T-Coffee
<figure id="tCoffee">
</figure>
Expresso(3D-Coffee)
Expresso is an extension of 3D-Coffee and uses BLAST to search the PDB database for structures whose sequences are similar to the given sequences. These structures are then used to build the alignment. It is slowlier than T-Coffee itself, but if it finds enough structures it is more accurate than the other programms.
Since we could not run Expresso on the server, we have used this website for the multiple sequence alignments with Expresso (3D-Coffee).
<figure id="expresso">
</figure>
Results
The following tables (<xr id="low_set"/>,<xr id="medium_set"/> and <xr id="high_set"/>) show the gaps that are included during the creation of the multiple alignments by ClustalW, Muscle, T-Coffee and Expresso. Thereby the sequence of PAH (P00439) is shown in bold, whereas pdb sequences are marked with *.
<figtable id="low_set">
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<figtable id="medium_set">
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<figtable id="high_set">
|
Discussion
Comparing the different tools similar outcomes can be viewed. As expected the dataset with low similarities shows a lot of gaps. Only for our protein P00439 (PAH) few gaps are included. Therefore, the other sequences are shorter and only cover a small part of our protein, which goes roughly from position 175 to 295. Only Muscle shows more gaps than the other tools. For the dataset with identities between 30% and 60% higher conservations could be recognised. Again the results of the different tools are similar, whereby sometimes the conserved regions are shifted due to some gaps inserted between them. The set with the highest sequence similarities shows least gaps and also only few differences between the tools. ClustalW inserted fewest gaps. For the datasets low and high T-Coffee and Expresso have identical numbers of gaps. The multiple alignments can be viewed in <xr id="clustalW"/> with the results of ClustalW,in <xr id="muscle"/> with the results of Muscle, in <xr id="tCoffee"/> with the results of T-Coffee and in <xr id="expresso"/> with the results of Expresso. Thereby a) always comprises the MSA of the low, b) of the medium and c) of the high dataset. d), on the other hand, shows the MSA, which was created with 20 sequences that covers the whole range of similarity with identities between 27% and 96%. Thereby, T-Coffee and Expresso again show very similar outcomes and also Muscle is not that different to those, too. ClustalW, however, seems to have more problems to make a good alignment even with the similar sequences and the conserved amino acids are picked to pieces. Even there are some sequences with low similarities T-Coffee and Expresso show good alignments. Altogether we think that a medium sequence identity is sufficient to get a good MSA. Nevertheless, it is also possible to include sequences with less identities. In such cases also sequences with high similarities should be included and T-Coffee seems to be a good choice to create a reliable MSA.