Difference between revisions of "Sequence-based predictions (Phenylketonuria)"

From Bioinformatikpedia
(MEMSAT-SVM)
(Summary)
Line 7: Line 7:
 
* GOPET and ProtFun2.0 to predict GO terms
 
* GOPET and ProtFun2.0 to predict GO terms
 
* Pfam with a sequence search to find out more about the Pfam family of our protein
 
* Pfam with a sequence search to find out more about the Pfam family of our protein
The results are here presented and discussed in detail.
+
The results are presented here and discussed in detail.
   
 
== Secondary structure ==
 
== Secondary structure ==

Revision as of 13:36, 17 August 2013

Summary

Sequence-based prediction approaches are useful to predict a variety of structural and functional properties of proteins. Here, we used different methods to provide useful information about our protein sequence of phenylalanine hydroxylase (PAH - P00439) and in some cases likewise for other given proteins (in brackets):

  • ReProf for secondary structure prediction (P10775, Q9X0E6, Q08209)
  • IUPred and MD (MetaDisorder) for the prediction of the disorder (P10775, Q9X0E6, Q08209)
  • PolyPhobius and MEMSAT-SVM to predict transmembrane helices (P35462, Q9YDF8, P47863)
  • SignalP to predict signal peptides (P02768, P47863, P11279)
  • GOPET and ProtFun2.0 to predict GO terms
  • Pfam with a sequence search to find out more about the Pfam family of our protein

The results are presented here and discussed in detail.

Secondary structure

Lab journal
In general secondary structure of a protein is the 3D form constructed by inter-residue interactions like hydrogen bonds. Most common are alpha-helices and beta-strands, but also loops or turns can be formed.

As it is not easy to look at secondary structures there are some methods that can predict them:

  • ReProf is already installed on the students lab and can be simply called with the input parameter -i only.
  • For the secondary structure prediction with PsiPred v3.3, we used the PSIPRED server
  • For DSSP PDB files are needed and the DSSP server is used to create dssp files.
    • The PDB IDs are:
      • P10775: 2BNH (Ribonuclease inhibitor, eukaryote: sus scrofa (pig))
      • Q9X0E6: 1VHF (Divalent-cation tolerance protein CutA, bacterium: thermotoga maritima)
      • Q08209: 1AUI (Serine/threonine-protein phosphatase 2B catalytic subunit alpha isoform, eukaryote: homo sapiens)
      • P00439: 1PAH (Phenylalanine-4-hydroxylase, eukaryote: homo sapiens)

There are more than one PDB ID for the Uniprot IDs and they are not completely identical to the Uniprot sequences. For example 1PAH only shows the residues 117 to 424. Nevertheless, we tried to choose the most similar and align them by hand. Positions, for which no secondary structure is predicted, are marked with a '-'. Furthermore the different secondary structures are assimilated (<xr id="secondary structure"/>) so they all show the same secondary structure format like in ReProf with letters E, H and L. Therefore, we wrote a program to filter the ReProf, PsiPred and DSSP outputs for the secondary structure: filter_seqStruc.pl <figtable id="secondary structure">

"Secondary Structure"
Type ReProf PsiPred DSSP
Helix (alpha) H H GHI
Extended strand (beta) E E BE
Loops/Turns L C ST
The different types to represent secondary structure in ReProf, PsiPred and DSSP.

</figtable>

P10775 (RNH1)

ReProf has the possibility to use a PSSM matrix or a FASTA sequence as input to predict secondary structure. We used three different inputs and compared them. First the FASTA sequence itself is applied, then a PSSM matrix generated by PSI-Blast against the big80 database and another matrix against the swissprot database are used for the ReProf prediction. In the following the predicted structures of ReProf are compared against the structure prediction of DSSP and PsiPred (<xr id="DSSP"/>, <xr id="PsiPred"/>). Furthermore, they are compared to the recorded structure in UniProt (<xr id="uniprot"/>). A java script (SecStrucComparison.jar) was written which counts the number of matches between two sequences and the total number of residues. Then the precision for each letter and in total is calculated and given in %.

precision = number of matches / number of residues

As there are positions which have no prediction in DSSP or Uniprot (indicated by '-'), those positions are ignored.

<figtable id="DSSP">

"Precision of predicted secondary structures against the DSSP structure."
Letter FASTA PSSM-Big PSSM-Swissprot PsiPred Uniprot
E 21.05 63.16 80.7 84.21 100.0
H 71.94 94.9 92.35 83.16 100.0
L 71.95 85.37 79.27 95.12 0.0
total 63.28 87.16 87.16 86.27 95.96
Comparison of the secondary structures to the DSSP structure

</figtable> <figtable id="PsiPred">

"Precision of predicted secondary structures against the PsiPred structure."
Letter FASTA PSSM-Big PSSM-Swissprot DSSP Uniprot
E 20.0 69.09 90.91 97.96 100.0
H 77.71 100.0 99.4 98.19 100.0
L 61.7 77.02 70.64 65.0 0.0
total 62.5 84.43 83.55 86.27 86.1
Comparison of the secondary structures to the PsiPred structure

</figtable> <figtable id="uniprot">

"Precision of predicted secondary structures against the Uniprot structure."
Letter FASTA PSSM-Big PSSM-Swissprot PsiPred DSSP
E 22.22 55.56 71.11 73.33 80.0
H 74.16 97.19 94.94 89.33 100.0
L 0.0 0.0 0.0 0.0 0.0
total 63.68 88.79 90.13 86.1 95.96
Comparison of the secondary structures to the Uniprot structure

</figtable> In the structure of P10775 found at Uniprot no loops or turns are included, which is why L has 0% for those comparisons. All three comparisons show that the results for ReProf using the FASTA sequence are worse than using PSSMs especially for extended strands where it only has a true positive rate of about 20% to 22%. The two different databases (big80 and Swissprot), however, show nearly no differences. As the beta strand was better predicted at the Swissprot database and it has a better result at comparison with Uniprot, this database is applied for the other proteins. Additionally, PsiPred has similar results as the two Reprof predictions with PSSM matrices. Although DSSP uses the knowledge of the PDB structure, differences between the DSSP and the Uniprot secondary structure can be seen.

ReProf comparison (Q9X0E6, Q08209, P00439)

After choosing the SwissProt database for the PSSM matrices, ReProf, PsiPred and DSSP secondary structure predictions were done for the two proteins Q9X0E6 (CUTA) and Q08209 (PP2BA) as well as for our protein P00439 (PAH). Again the results are analyzed with our java script (<xr id="reprof"/>). <figtable id="reprof">

Q9X0E6 (CUTA)

Secondary structure comparison

Type PsiPred DSSP
E 80.95 90.00
H 62.50 69.23
L 66.67 38.10
total 67.33 66.67
Q08209 (PPP3CA)

Secondary structure comparison

Type PsiPred DSSP
E 51.52 71.70
H 80.59 85.71
L 86.67 46.43
total 80.23 64.39
P00439 (PAH)

Secondary structure comparison

Type PsiPred DSSP
E 74.00 57.14
H 88.21 85.82
L 90.82 56.82
total 87.83 73.20
Comparison of the ReProf result with PsiPred and DSSP for the three proteins Q9X0E6, Q08209 and P00439.

</figtable>

After comparing the ratios of matches to the number of residues, a higher similarity of the secondary structure of ReProf to PsiPred than to DSSP can be seen. In most cases helices are predicted quite good but also the other forms are predicted well. In total the prediction of our protein P00439 show highest secondary structure similarity. Only the prediction of extended strand in comparison with DSSP show a worse result than at the other proteins. Altogether, ReProf seems to make a good prediction.

P00439 sequences

Here the amino acid sequence of our protein (PAH) as well as its secondary structure predictions and records are shown.

 FASTA : MSTAVLENPG LGRKLSDFGQ ETSYIEDNCN QNGAISLIFS LKEEVGALAK VLRLFEENDV
 ReProf: LLLLELLLLL LLLLLLLLLL LLLLLLLLLL LLLEEEEEEE ELLLLHHHHH HHHHHHHLLL
PsiPred: LLLLLLLLLL LLLLLLLLLL LLLLLLLLLL LLLEEEEEEE ELLLLLHHHH HHHHHHHLLL
   DSSP: ---------- ---------- ---------- ---------- ---------- ----------
Uniprot: ---------- ---------- ---------- ---------- ---------- ----------  
 
 FASTA : NLTHIESRPS RLKKDEYEFF THLDKRSLPA LTNIIKILRH DIGATVHELS RDKKKDTVPW
 ReProf: LEEEEELLLL LLLLLLEEEE EEEELLLHHH HHHHHHHHHL LLLLLLLLLL LLLLLLLLLL
PsiPred: EEEEEELLLL LLLLLLEEEE EEELLLLLHH HHHHHHHHHH LLLLLLLLLL LLLLLLLLLL
   DSSP: ---------- ---------- ---------- ---------- ---------- ----------
Uniprot: ---------- ---------- ---------- ---------- ---------- ----------

 FASTA : FPRTIQELDR FANQILSYGA ELDADHPGFK DPVYRARRKQ FADIAYNYRH GQPIPRVEYM
 ReProf: LLHHHHHHHH HHHHHHLLLL LLLLLLLLLL LLHHHHHHHH HHHLLLLLLL LLLLLLLLLL
PsiPred: LLLLHHHHHH HHHHHHHLLL LLLLLLLLLL LHHHHHHHHH HHHHHHLLLL LLLLLLLLLL
   DSSP: --LEHHHHHH HHHHLELL-H HHLLLLLLLL -HHHHHHHHH HHHHHHL--L LL--------
Uniprot: ----HHHHHH HH--EEE--H H-----LLL- -HHHHHHHHH HHHHHH---- ----------

 FASTA : EEEKKTWGTV FKTLKSLYKT HACYEYNHIF PLLEKYCGFH EDNIPQLEDV SQFLQTCTGF
 ReProf: HHHHHHHHHH HHHHHHHLHH HLHHHHHHHH HHHHHHLLLL LLLLLLHHHH HHHHHHLLLL
PsiPred: HHHHHHHHHH HHHHHHHHHL LLHHHHHHHH HHHHHHLLLL LLLLLLHHHH HHHHHHHLLE
   DSSP: HHHHHHHHHH HHHHHHHHHH HE-HHHHHHH HHHHHHH--E LLE---HHHH HHHHHHHHL-
Uniprot: HHHHHHHHHH HHHHHHHHHH ---HHHHHHH HHHHHH---- ------HHHH HHHHHHH---

 FASTA : RLRPVAGLLS SRDFLGGLAF RVFHCTQYIR HGSKPMYTPE PDICHELLGH VPLFSDRSFA
 ReProf: EEEEEELLLL LHHHHHHHHH LHHHHHHEEL LLLLLLLLLL LLHHHHHHLL LLLLLLHHHH
PsiPred: EEEELLLLLL HHHHHHHHHL LEELLLLLLL LLLLLLLLLL LLHHHHHHLL LLLLLLHHHH
   DSSP: EEEE--LE-- HHHHHHHHLL LEEEE----- -LLLLL--LL --HHHHHHHH HHHHLLHHHH
Uniprot: EEE--EE--- HHHHHHHH-- -EEE------ ---------- --HHHHHH-- HHH---HHHH

 FASTA : QFSQEIGLAS LGAPDEYIEK LATIYWFTVE FGLCKQGDSI KAYGAGLLSS FGELQYCLSE
 ReProf: HHHHHHHHHL LLLLHHHHHH HHHHHEEEEE EEEELLLLLL EEEEEEELLL LLHHHHHHLL
PsiPred: HHHHHHHHHL LLLLHHHHHH HHHHHHEEEE EEEEEELLLE EEELLLLLLL HHHHHHHHLL
   DSSP: HHHHHHHHHH LL--HHHHHH HHHHHHHHHH L-EEEELLEE EE--HHHHL- HHHHHHLLLL
Uniprot: HHHHHHHHH- ----HHHHHH HHHHHH-LLL --EEE---EE E---HHH--- HHHHHH--EE

 FASTA : KPKLLPLELE KTAIQNYTVT EFQPLYYVAE SFNDAKEKVR NFAATIPRPF SVRYDPYTQR
 ReProf: LLLLLLLLHH HLLLLLLLLL LLLLHEEHHH HHHHHHHHHH HHHHHLLLLL LLEELLLLLL
PsiPred: LLLLLLLLHH HHHLLLLLLL LLLLLEEEEL LHHHHHHHHH HHHHLLLLLL LLLLLLLLLE
   DSSP: LLEEEE--HH HHLL----LL L--LEEEEEL -HHHHHHHHH HHHHLL--LL -EEEELLLLE
Uniprot: EEEEE---HH H-------EE ---EEEEEE- -HHHHHHHHH HH------EE EEEE-LLL-E

 FASTA : IEVLDNTQQL KILADSINSE IGILCSALQK IK
 ReProf: HHHHLLHHHH HHHHHHHHHH HHHHHHHHHH HL
PsiPred: EEELLLHHHH HHHHHHHHHH HHHHHHHHHH HL
   DSSP: EEE------- ---------- ---------- --
Uniprot: EEE--HHHHH HHHHHHHHHH HHHHHHHHH- --

Again very similar outcomes can be seen for ReProf and Psipred and for DSSP and Uniprot. Thereby the later start of the secondary structure for both DSSP and Uniprot is conspicious. However none of the structures is identical to another one. Differences most often results from slightly shifted predictions, so for example in one prediction a helix strand starts a few amino acid earlier or is a few amino acids longer than in the other prediction.

Other tools are GOR4 or CFSSP. However, trying GOR4 prediction we see that it is not that similar to the other secondary structures above.

Disorder

Lab journal
A special interest of protein structure predictions are the so called disordered regions of protein chain, which have no fixed spatial structure in the native state. There are two different kinds of disordering:

  • some regions are structured, when they bind to other molecules or under changing conditions of biochemical medium
  • others are always disordered and never become structured

Disordered regions often cause complications in the expression, purification and crystalization of the protein containing them. This is the reason, why much attention is being paid on their examination and prediction. Most prediction methods identify disorder in proteins through the analysis of the contained amino acids or evolutionary conserved regions. <ref> Michil Yu. Lobanov, Eugeniya I. Furletova, Natalya S. Bogatyreva, Michail A. Roytberg, Oxana V. Galzitskaya (2010): "Library of Disordered Patterns in 3D Protein Structures". PLoS Comput Biol Vol.6:e1000958. doi:10.1371/journal.pcbi.1000958 </ref>

In this part, we want to analyse the disordered regions of our protein (P00439) as well as of the proteins from the secondary structure prediction (P10775, Q9X0E6, Q08209).

IUPred

In IUPred the prediction of disordered regions was done with amino acid sequences by a novel algorithm, which estimates their total pairwise interresidue interaction energy. This is based on the assumption that IUP sequences do not fold due to their inability to form sufficient stabilizing interresidue interactions. Thereby, IUPred provides three optional parameters:

For the graphical output, one can use the IUPred webserver. Since there can only accessed short and long disorder separately, we wrote an R-script, where one can plot both disorder forms into one graphic. The resulting images as well as the images for the globular domains of the proteins are shown in <xr id="iupred"/> below. <figtable id="iupred">

a-1) Long and short disorder comparison of protein P00439 (PAH)
a-2) Globular domain of protein P00439 (PAH)
b-1) Long and short disorder comparison of protein P10775 (RNH1)
b-2) Globular domain of protein P10775 (RNH1)
c-1) Long and short disorder comparison of protein Q9X0E6 (CUTA)
c-2) Globular domain of protein Q9X0E6 (CUTA)
d-1) Long and short disorder comparison of protein Q08209 (PP2BA)
d-2) Globular domain of protein Q08209 (PP2BA)
1) Long and short disorder comparison and 2) globular domain presentations for the four proteins a) P00439, b) P10775, c) Q9X0E6 and d) Q08209.

</figtable> In the comparison of the two prediction forms short and long, you can see that both are very similar. Usually, the long prediction has a higher probability at the beginning and end of a sequence than short. Otherwise, only slight changes are visible.

For the three proteins P00439 (PAH), P10775 (RNH1) and Q9X0E6 (CUTA) we can see, that only the long disorder has a tendency over 0.5 on the beginning and end of the sequence and all of them have a globular region on the whole protein. Only the protein Q08209 (PP2BA) has disordered regions on the beginning and end of the sequence in both, short and long disorder, predictions. Here, the globular region is located between the sequence position 5 and 446.

MD (MetaDisorder)

MetaDisorder is a prediction tool, which includes following prediction methods:

MetaDisorder uses neural networks to combine the output of other methods with sequence profiles and other sequence features. <ref name="metadisorder"/>

For the comparison of the different methods, we used the output file sequence.mdisorder, because there are all relevant results included. For the generation of the images in <xr id="metadisorder"/> we used a little R-script.

<figtable id="metadisorder">

a) Comparison of the MetaDisorder, PROFbval, NORSnet and Ucon result for the protein P00439 (PAH)
b) Comparison of the MetaDisorder, PROFbval, NORSnet and Ucon result for the protein P10775 (RNH1)
c) Comparison of the MetaDisorder, PROFbval, NORSnet and Ucon result for the protein Q9X0E6 (CUTA)
d) Comparison of the MetaDisorder, PROFbval, NORSnet and Ucon result for the protein Q08209 (PP2BA)
Comparison of the MetaDisorder, PROFbval, NORSnet and Ucon result for a)P00439, b)P10775, c)Q9X0E6 and d)Q08209.

</figtable> In comparison to the IUPred results, the output of MetaDisorder (green line) is very similar. In contrast, PROFbval (blue line) is very fluctuating, and hardly agrees with the predictions of IUPred. This is understandable, since this method is not specifically for disordered regions, but mainly for identifying flexible and rigid ones. NORSnet (violetred line) is also very fluctuating, but predicts lower tendencies for the disordered regions. That is, why this method very often does not predict any disordered regions and hardly overlaps with the IUPred results. Ucon (orange line) predicts disordered regions a bit better than NORSnet and PROFbval, but not as well as MetaDisorder itself. So, MetaDisorder is the best method for the prediction of disordered regions of these four methods.

DisProt

DisProt is a database of disordered proteins which connects structure and function of intrinsic disordered proteins (IDPs). These protreins form under physiological conditions no fixed three-dimensional structure, either in whole or in regions. IDP is defined as a protein containing at least one disordered region. A big problem of the IDP study has been the lack of organized information. DisProt was designed to enable IDP research. Therefore, knowledge has been collected and structured about the experimental characterization and the functional associations of IDPs. Thus, DIsProt opens doors for a variety of bioinformatic studies. <ref name="disprot"> Megan Sickmeier, Justin A. Hamilton, Tanguy LeGall, Vladimir Vacic, Marc S. Cortese, Agnes Tantos, Beata Szabo, Peter Tompa, Jake Chen, Vladimir N. Uversky, Zoran Obradovic and A. Keith Dunker (2007): "DisProt: the Database of Disordered Proteins". Nucleic Acids Research Vol.35, Database issue: D786–D793. doi:10.1093/nar/gkl893 </ref>

We could not find exact matchings on DisProt for our protein as well as two other proteins, so we used the following best hits done with Sequence Search and Smith Waterman search algorithm:

The PSI-Blast search algorithm gave the same best hits, except for the CUTA protein, but here was the E-Value in the Smith Waterman search better than in PSI-Blast, so we used this hit.

The only protein with a match in DisProt, was Q08209 (PPP3CA).

On the images in <xr id="disprot"/> below, one can see the regions of order and disorder of a given sequence.

Definition.png


<figtable id="disprot">

a) Map of ordered and disordered regions from DisProt for the best sequence hit of P00439 (Tyrosine 3-monooxygenase). The disordered region is located between the 1-155 sequence position.
b) Map of ordered and disordered regions from DisProt for the best sequence hit of P10775 (NALP1). The disordered region is located between the 31-50 sequence position.
c) Map of ordered and disordered regions from DisProt for the best sequence hit of Q9X0E6 (Uncharacterized protein). The disordered region is located on the whole protein sequence.
d) Map of ordered and disordered regions from DisProt for the protein Q08209. Disordered regions are located between the 1-13, 374 - 468, 390 - 414, 469 - 486, 487 - 521 and the ordered region between the 14 - 373 sequence position.
Map of ordered and disordered regions from DisProt.

</figtable> As we were not able to find exact matches here, we can not compare the results to the ones we got from IUPred and MetaDisorder. Q08209 is the only protein with an exact match, where five different ordered regions were found. This result does overlap with the one we got from the IUPred prediction, as one can see in <xr id="iupred"/> d). Here, the ordered regions are located on the beginning and the end of the sequence. The same applies to the MetaDisorder prediction (green) on the bottom right image in <xr id="metadisorder"/> d). There, the Ucon (orange) and NORSnet (violetred) prediction are very close, too, but they predict a slight disorder region in the middle of the sequence as well. The only tool not overlapping with the results of the other methods is PROFbval.

A list of different disorder prediction methods with short definition can be found on this wiki page or on the paper "A Practical Overview of Protein Disorder Prediction Methods". Some example tools are:

Transmembrane helices

Lab journal
A transmembrane (TM) helix is defined as a membrane-spanning domain with hydrogen-bonded helical configuration <ref>http://www.uniprot.org Transmembrane helix definition on UniProt, retrieved May 18, 2013</ref>. Alpha-helical TM proteins are involved in a wide range of important biological processes including transport of membrane-impermeable molecules and cellular features like cell signaling, recognition or adhesion. Given that many of them are also prime drug targets one estimates that over half of the drugs on the market are targeting membrane proteins. A big problem in the prediction of transmembrane proteins is the discrimination between TM helices and other features usually composed by hydrophobic residues. These include targeting motifs like signal peptides, amphipathic helices and re-entrant helices - entering and exiting membranes on the same side. The high similarity between such properties and the hydrophobic profile of a TM helix leads in many cases to crosstalk between the different types of predictions. Should these components be predicted as TM helices, the subsequent topology prediction can be corrupted. This is the reason, why TM proteins are experimentally difficult to find and therefore are strongly under-represented in structural databases. However, sequence-based methods in the absence of structural data allow the investigation of TM protein topology. <ref name="memsat"> Timothy Nugent and David T Jones (2009): "Transmembrane protein topology prediction using support vector machinesTransmembrane protein topology prediction using support vector machines". BMC Bioinformatics Vol.10:159. doi:10.1186/1471-2105-10-159 </ref>

Here, we predicted transmembrane helices with the sequence-based tools PolyPhobius and MEMSAT-SVM for the following proteins:

Then we compared our results with the ones generated with OPM and PDBTM.

PolyPhobius

PolyPhobius is an enhanced version of Phobius and uses Hidden Markov Models (HMM) for the prediction of transmembrane topology and signal peptides. Thereby homology information is utilized to increase the accuracy of the prediction. The method depends on a high quality global multiple sequence alignment (MSA). <ref name="polyphobius"> Lukas Käll, Anders Krogh and Erik L. L. Sonnhammer (2005): "An HMM posterior decoder for sequence feature prediction that includes homology information". BMC Bioinformatics Vol.21:i251–i257. doi:10.1093/bioinformatics/bti1014 </ref>

For the prediction as well as a graphical output one can use the PolyPhobius webserver as well. The graphical outputs for the four proteins on the webserver are shown in the <xr id="polyphobius"/> below.

<figtable id="polyphobius">

a) PolyPhobius webserver result for P00439 (PAH)
b) PolyPhobius webserver result for P35462 (DRD3)
c) PolyPhobius webserver result for Q9YDF8 (KVAP)
d) PolyPhobius webserver result for P47863 (AQP4)

Graphical results from the Polyphobius webserver for the proteins a)P00439 (PAH), b)P35462 (DRD3), c)Q9YDF8 (KVAP) and d)P47863 (AQP4). The grey regions show Transmembrane (TM) helices. </figtable> Polyphobius predicts no transmembrane (TM) helices for our protein P00439 (PAH), six for the proteins P35462 (DRD3) and P47863 (AQP4) and seven for Q9YDF8 (KVAP). The whole protein P00439 consists of a non cytoplasmic region, whereas P35462 (DRD3) has three cytoplasmic and four non cytoplasmic, P47863 (AQP4) has four cytoplasmic and non cytoplasmic and Q9YDF8 (KVAP) four cytoplasmic and three non cytoplasmic regions included. For all proteins no signal peptide region was predicted.

MEMSAT-SVM

MEMSAT-SVM is a revised version of MEMSAT-3 and is based on support vector machines (SVM) for the prediction of transmembrane protein topology. Signal peptides and re-entrant helices prediction are both integrated in this method. Also the discrimination between transmembrane and globular proteins can effectively be done (extremely low false positive and false negative rates). MEMSAT-SVM showed a better accuracy than MEMSAT-3. <ref name="memsat"/>

Since MEMSAT-SVM is currently not running on the biolab servers, we used the MEMSAT-SVM webserver and the fasta sequences of the above named proteins for our prediction. In the <xr id="memsat-svm"/> the downloaded images of the predictions of MEMSAT-SVM are included.

<figtable id="memsat-svm">

a) Transmembrane helices result from the MEMSAT-SVM webserver for P00439 (PAH)
b) Transmembrane helices result from the MEMSAT-SVM webserver for P35462 (DRD3)
c) Transmembrane helices result from the MEMSAT-SVM webserver for Q9YDF8 (KVAP)
d) Transmembrane helices result from the MEMSAT-SVM webserver for P47863 (AQP4)

Graphical results from the MEMSAT-SVM webserver for the proteins a)P00439 (PAH), b)P35462 (DRD3), c)Q9YDF8 (KVAP) and d)P47863 (AQP4). The yellow squares display Transmembrane (TM) helices. </figtable> MEMSAT-SVM predicts one TM helix for P00439 (PAH) and six for the other proteins P35462 (DRD3), Q9YDF8 (KVAP) and P47863 (AQP4).

OPM

The Orientations of Proteins in Membranes (OPM) database represents a collection of transmembrane, monotopic and peripheral proteins from the Protein Data Bank (PDB). Thereby, a computational approach to calculate the spatial arrangements of protein structures in lipid bilayers has been developed and compared with experimental data. In this database one can analyse, sort and search of membrane proteins based on different properties. The created coordinate files with the calculated membrane boundaries can then all be downloaded. <ref name="opm"> Mikhail A. Lomize, Andrei L. Lomize, Irina D. Pogozheva and Henry I. Mosberg (2006): "OPM: Orientations of Proteins in Membranes database". BMC Bioinformatics Vol.22:623–625. doi:10.1093/bioinformatics/btk023 </ref>

The results can be found on the following links:

  • We could not find any entry for the protein P00439 (PAH)
  • 3PBL for the protein P35462 (DRD3)
  • 1ORQ and 1ORS for the protein Q9YDF8 (KVAP)
  • 2D57 for the protein P47863 (AQP4)

PDBTM

The PDBTM is a database, which contains transmembrane proteins with known structures. Thereby, all transmembrane proteins, that could be found in the Protein Data Bank (PDB) database, are collected and their membrane-spanning regions are determined. These calculations are based on the TMDET algorithm, which uses only structural information to calculate the most probable location of the lipid bilayer. It can also distinguish between transmembrane and globular proteins like MEMSAT-SVM. The database is updated every week to keep it with the PDB entries synchronized. <ref name="pdbtm"> Gabor E. Tusnady, Zsuzsanna Dosztanyi and Istvan Simon (2005): "PDB_TM: selection and membrane localization of transmembrane proteins in the protein data bank". BMC Bioinformatics Vol.20:2964-2972; Nucleic Acids Research Vol.33, Database issue D275–D278. doi:10.1093/nar/gki002 </ref>

The results can be found on the sites below:

  • We could not find any entry for the protein P00439 (PAH)
  • 3PBL for the protein P35462 (DRD3)
  • 1ORQ and 1ORS for the protein Q9YDF8 (KVAP)
  • 2D57 for the protein P47863 (AQP4)

Results and Discussion

In the <xr id="TM_PAH"/> one can see, that our protein P00439 (PAH) does not include any transmembrane helix in its structure. Only MEMSAT-SVM predicts one at the position of 287-302. On the PolyPhobius graphical output in <xr id="polyphobius"/>, one can see a slight curve on this part of the protein, too. But it is not very intensive (probability under 0.1), so most of the prediction methods will not predict any transmembrane region here. <figtable id="TM_PAH">

"Transmembrane(TM) helices of protein P00439 (PAH)"
TM PolyPhobius MEMSAT-SVM OPM PDBTM Uniprot
1 - 287-302 - - -
Number and position of the predicted transmembrane helices of PolyPhobius, MEMSAT-SVM, OPM and PDBTM for the protein P00439 (PAH).

</figtable> Since our protein does not have any transmembrane helix, we will research the other proteins in more detail.

First, we want to look at the results of the protein P35462 (DRD3) shown in <xr id="TM_P35462"/>. Here, all methods except for MEMSAT-SVM predict seven TM helices. The positions of the TM helices change only slightly for the different prediction methods, but they are lying always in the same range.

<figtable id="TM_P35462">

"Transmembrane(TM) helices of protein P35462 (DRD3)"
TM PolyPhobius MEMSAT-SVM OPM PDBTM Uniprot
1 30-55 32-55 34-52 35-52 33–55
2 66-88 65-88 67-91 68-84 66–88
3 105-126 101-129 101-126 109-123 105–126
4 150-170 151-169 150-170 152-166 150–170
5 188-212 188-209 187-209 191-206 188–212
6 329-352 331-354 330-351 334-347 330–351
7 367-386 - 363-386 368-382 367–388
Number and position of the predicted transmembrane helices of PolyPhobius, MEMSAT-SVM, OPM and PDBTM for the protein P35462 (DRD3).

</figtable> For the protein Q9YDF8 (KVAP) we had a big problem with choosing a pdb ID, since all of the pdbs in Uniprot did not get any good results for OPM and PDBTM. So, we decided to use the two pdb IDs 1ORS and 1ORQ. Both together cover approximately the area from the predicted TM helices. But they differ from the PolyPhobius and MEMSAT-SVM results. In most cases they lie on a lower position than the predicted TM helices. A reason for this could be that the protein belong to the Archaea family. If we compare the predicted method results in <xr id="TM_Q9YDF8"/> with the TM helix positions in the UniProt entry, one can see, that the predicted positions are always lying in the range of the UniProt TM helices. Therefore, we suggest that although OPM and PDBTM do not match with the prediction results, we can say that Polyphobius and MEMSAT-SVM predict the TM helices very well.

<figtable id="TM_Q9YDF8">

"Transmembrane(TM) helices of protein Q9YDF8 (KVAP)"
TM PolyPhobius MEMSAT-SVM OPM (1ORS) OPM (1ORQ) PDBTM (1ORS) PDBTM (1ORQ) Uniprot
1 42-60 43-59 25-46 - 27-50 21-52 39–63
2 68-88 72-90 55-78 - 55-75 57-80 68–92
3 - - 86-97 - 88-107 - -
4 108-129 101-118 100-107 - 118-142 - 109–125
5 137-157 128-143 117-148 - - - 129-145
6 163-184 163-184 - 153-172 - 151-171 160-184
7 196-213 - - 183-195 - - -
8 224-244 221-245 - 207-225 - 209-236 222-253
Number and position of the predicted transmembrane helices of PolyPhobius, MEMSAT-SVM, OPM and PDBTM for the protein Q9YDF8 (KVAP).

</figtable>

In the last of the four proteins, called P47863 (AQP4), eight TM helices are included in OPM and six TM helices in PDBTM and predicted by the methods Polyphobius and MEMSAT-SVM. Here we used the pdb ID 2D57 (X-ray), because it had a higher resolution than the ID 2ZZ9 (X-ray). The pdb ID 3IYZ (electron microscopy) had very similar TM helices included in PDBTM with about a change of one or two positions, so we used only 2D57 for our comparison. In the <xr id="TM_P47863"/> you can see that the positions of all TM helices also are overlapping in the different methods with minimal changes like in the proteins discussed above. The only differences are the third and seventh TM helix included in the OPM database, which neither of the other methods predicted or included as well. In PDBTM this two regions are displayed as loops.

<figtable id="TM_P47863">

"Transmembrane(TM) helices of protein P47863 (AQP4)"
TM PolyPhobius MEMSAT-SVM OPM PDBTM Uniprot
1 34-58 35-56 34-56 39-55 37–57
2 70-91 71-89 70-88 72-89 65–85
3 - - 98-107 - -
4 115-136 113-136 112-136 116-133 116-136
5 156-177 157-178 156-178 158-177 156-176
6 188-208 190-205 189-203 188-205 185-205
7 - - 214-223 - -
8 231-252 232-252 231-252 231-248 232-252
Number and position of the predicted transmembrane helices of PolyPhobius, MEMSAT-SVM, OPM and PDBTM for the protein P47863 (AQP4).

</figtable>

If we compare all results, we can say, that all used methods predict the positions very close to each other and it is very difficult to decide which method would be the best for transmembrane helices prediction.

A list with different prediction methods for transmembrane helices is shown here. Some examples will be:

and there can be found many more!

Signal peptides

Lab journal
Signal peptides are sequences of amino acids in a protein that determine the pathway of the protein in the cell to its destination.
On the server version 3.0 of SignalP predictions is installed, whereas the internet server of SignalP has version 4.1. Here only the FASTA sequences are delivered. Parameters are not changed. The output <ref name="output_signalp"> http://www.cbs.dtu.dk/services/SignalP: Description of the output format of SignalP, retrieved May 20, 2013</ref> is given as:

  • SignalP-NN (neuronal network) with the maximal values of C-score (raw cleavage site score), S-score (signal peptide score) and Y-score (combined cleavage site score). Additionaly the mean S (average S-score of the possible signal peptide) and the D-score (discrimination score), which is a weighted average of the mean S, and the max. Y scores are reported. The D-Score discriminates signal peptides from non-signal peptides.
  • SignalP-HMM returns the posterior probabilities for cleavage site (C) and signal peptide (S) for each position in the input sequences. The signal peptide probability is divided into three region probabilities (n, h, c). The maximal cleavage site probability is reported with its position. For eukaryotes also the probability of a signal anchor is reported.

<figtable id="signalp">

UniProt SignalP 3.0 SignalP 4.1 SignalP (website)
ID description SignalP-NN SignalP-HMM Max cleavage site prediction prediction prediction
P00439 PAH (homo sapiena) 5 x NO 0.0 signal peptide, 0.0 signal anchor 0.000 non-secretory protein no signal peptide not in database
P02768 ALBU (homo sapiens) 5 x YES 1.0 signal peptide, 0.0 signal anchor 0.785 signal peptide signal peptide confirmed [1]
P11279 LAMP1 (homo sapiens) 5 x YES 1.0 signal peptide, 0.0 signal anchor 0.847 signal peptide signal peptide confirmed [2]
P47863 AQP4 (rattus norvegicus) 4 x NO, 1 x YES (max. S) 0.526 signal peptide, 0.457 signal anchor 0.388 signal peptide no signal peptide not in database

Output of SignalP for the four proteins: The second column (SignalP-NN) shows how many of the calculated values are above the threshold to be a signal peptide, whereas the third column (SignalP-HMM) gives information about how probable it is, that it has a signal peptide. Also a probability for having a signal anchor is given. Thereby 0.0 means it has a probability of 0% and 1.0 stands for 100%. The next column indicates the highest probability of a cleavage site and after that the predictions themselves follow for SignalP version 3.0, version 4.1 and if the prediction is correct by comparing with the database. </figtable>

That for protein P47863 different predictions are made for the two different versions of SignalP can be seen in <xr id="signalp"/>. However, the predictions for the other proteins are the same. Furthermore, including a signal peptide only has a probability of about 53% and the probability of the maximum cleavage site is 39% only. In Uniprot the protein is classified as multi-pass membrane protein and maybe this is the cause for identifying the N-terminal as signal peptide. Our protein P00439 has no signal peptide which is predicted correctly, whereas P02768 and P11279 both show a signal peptide at the N-terminal region, all three with high credibility. The cleavage sites predicted with version 4.1 coincide with the confirmed results on the webserver. As looking at the transmembrane helices on the webserver, none is predicted for P02768, whereas for P11279 a potential transmembrane region between position 383 and 405 can be found.

Other prediction tools are:

  • TatP (predicts presence and location of Twin-arginine signal peptide cleavage sites in bacteria)
  • Phobius (predicts both transmembrane topology and signal peptide)
  • PrediSi (predicts signal peptides)

We tested PrediSi with our proteins and almost get the same results as for the prediction with SignalP version 3. So, for P47863 a signal peptide was predicted, while the predictions for the other three proteins are correct. Phobius, however, as it also predicts transmembrane shows the same results as SignalP version 4.

GO terms

In this part we discover two different GO annotation prediction tools. This tools try to predict function and other features using the sequences.

GOPET

The first tool is called GOPET. In a GOPET run the confidence threshold can be chosen. On default it is at 60%, but a threshold of 50% was tried, too. However, there were only four more predictions which are not found at QuickGO for this protein and therefore not mentioned. All other inputs are left on default. The resulting predictions can be found in <xr id="gopet"/>.

<figtable id="gopet">

GOPET prediction for UniprotID: P00439
GOid Aspect Condidence GO term
GO:0003824 F 94% catalytic activity
GO:0016491 F 88% oxidoreductase activity
GO:0004497 F 87% monooxygenase activity
GO:0004505 F 84% phenylalanine 4-monooxygenase activity
*GO:0004510 F 80% tryptophan 5-monooxygenase activity
*GO:0004511 F 79% tyrosine 3-monooxygenase activity
GO:0046872 F 78% metal ion binding
GO:0005506 F 78% iron ion binding
*GO:0008199 F 72% ferric iron binding
*GO:0008198 F 72% ferrous iron binding
GO:0016597 F 71% amino acid binding

GOs predicted with GOPET. The first column represents the GO-IDs, the second their aspect, where F stands for molecular function, the third column indicates the confidence value of the prediction of the particular GO and in the last column a short description of the GO can be found. </figtable>

Altogether eleven terms were found and besides of the four GOs marked with * all were found at QuickGO, too (see also GO-terms). As GOPET is a prediction tool for molecular functions, all results have F as aspect. The three predicted GOs with highest confidence level are most general and should be correct. The other eight also have a quite good confidence value. That the terms for the four GOs not found in QuickGO are very similar to some of the others (two both have monooxygenase activity, whereas the other two are iron binding) is notably. It seems that GOPET cannot distinguish similar terms very well, but nevertheless is a good prediction tool for first annotation.

ProtFun

The second tool ProtFun determines four different classes/categories for a protein: Functional category, enzyme or no enzyme, enzyme class with EC number and gene ontology category. Furthermore each predicted term has two scores. The first one indicates the estimated probability that the entry belongs to the predicted class, whereas the second reports the odds of the sequence belonging to the class or category. <xr id="protfun"/> presents the result for our protein PAH. Thereby, the entries with significant values are marked in green. <figtable id="protfun">

ProtFun prediction for UniprotID: P00439
Functional category                  Prob     Odds
Amino_acid_biosynthesis              0.210    9.530
Biosynthesis_of_cofactors            0.229    3.180
Cell_envelope                        0.034    0.563
Cellular_processes                   0.063    0.867
Central_intermediary_metabolism      0.061    0.970
Energy_metabolism                    0.343    3.815
Fatty_acid_metabolism                0.025    1.889
Purines_and_pyrimidines              0.392    1.615
Regulatory_functions                 0.020    0.125
Replication_and_transcription        0.118    0.438
Translation                          0.204    4.630
Transport_and_binding                0.024    0.060
Enzyme/nonenzyme     Prob     Odds
Enzyme               0.724    2.527
Nonenzyme            0.276    0.387
Enzyme class                     Prob     Odds
Oxidoreductase (EC 1.-.-.-)      0.154    0.738
Transferase    (EC 2.-.-.-)      0.271    0.785
Hydrolase      (EC 3.-.-.-)      0.083    0.261
Lyase          (EC 4.-.-.-)      0.047    1.002
Isomerase      (EC 5.-.-.-)      0.100    3.138
Ligase         (EC 6.-.-.-)      0.019    0.370
Gene Ontology category      Prob     Odds
Signal_transducer           0.075    0.350
Receptor                    0.003    0.016
Hormone                     0.001    0.206
Structural_protein          0.005    0.166
Transporter                 0.025    0.229
Ion_channel                 0.010    0.168
Voltage-gated_ion_channel   0.005    0.232
Cation_channel              0.010    0.215
Transcription               0.043    0.334
Transcription_regulation    0.032    0.255
Stress_response             0.010    0.118
Immune_response             0.012    0.140
Growth_factor               0.006    0.407
Metal_ion_transport         0.009    0.020

GO anotation predicted with ProtFun. The first cell reports the functional categories, the second gives information if the sequence is predicted to be an enzyme or not, whereas the third cell then displays the possible enzyme classes. The final cell, finally, denotes the GO categories for the sequence. Thereby each prediction is related with its probability and odds. </figtable> Most significant in functional category is amino acid biosynthesis, whereas the other categories are not significant for PAH, which is correct. Furhtermore the protein is correctly identified as enzyme, however, the enzyme class mistakenly is detected as isomerase instead of oxidoreductase, which only reached the fourth best score. Maybe the isomerase is included in the amino acid biosynthesis and thereby not completely wrong. The correct EC number of PAH is EC 1.14.16.1. Remarkably, there is no significant score for any of the predicted gene ontology categories. Altogether the prediction mostly seems to be reasonable. Additionally individual features used by ProtFun 2.2 can be viewed if selected.

Another tool to predict function of a protein for example would be PFP <ref name="pfp"> Troy Hawkins, Meghana Chitale, Stanislav Luban, and Daisuke Kihara (2009): "PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data". Proteins Vol.74:566-582. doi:10.1002/prot.22172. </ref>. The results can be seen on PFP(PAH).

Pfam

Pfam is a database where protein families are collected.
Our protein (Pfam P00439) belongs to the ACT and the Biopterin_H (biopterin-dependent aromatic amino acid hydroxylases) family:

  • ACT are protein domains associated with metabolism as they often are linked to metabolic enzymes. In PAH ACT is located on the N-terminus of the protein.
  • Biopterin_H represents a family of aromatic amino acid hydroxylases. All members have a rate-limiting influence on important metabolic pathways. They are regulated by phosporylation at serines in their N-termini and it is believed that they include a conserved C-terminal cataltic domain and an unrelated N-terminal regulatory domain. Deficiencies in Biopterin_H cause PKU.

When you run a sequence search, the results are nearly the same. Both domains are found with high posterior probability and therefore are correctly designated as significant. Furthermore for the ACT family the clan CL0070 is named. The e-value for ACT is 3.7e-08 and even 1.8e-192 for Biopterin_H. None active site is predicted for both domains. We also tried including Pfam-B families, but none were found.

Another Protein familiy search tool is DomainSweep, which scannes protein family databases. For our protein sequence it predicts aromatic amino acid hydroxylase, phenylalanine-4-hydroxylase and the ACT domain as significant. So it seems to give correct predictions. In the output of DomainSweep all results of the different databases (BLOCKS, CDD, CATH, PFAMA, PRINTS, PRODOM, PROSITE, SMART, SUPERFAMILY, TIGRFAMS) can be viewed.

References

<references/>