Difference between revisions of "Sequence-based predictions (Phenylketonuria)"

From Bioinformatikpedia
(References)
(Transmembrane helices)
Line 279: Line 279:
   
 
== Transmembrane helices ==
 
== Transmembrane helices ==
A transmembrane helix is defined as a membrane-spanning domain with hydrogen-bonded helical configuration <ref>http://www.uniprot.org/keywords/1133 Transmembrane helix definition on UniProt, retrieved May 18, 2013</ref>. ...
+
A transmembrane helix is defined as a membrane-spanning domain with hydrogen-bonded helical configuration <ref>[http://www.uniprot.org/keywords/1133 http://www.uniprot.org] Transmembrane helix definition on UniProt, retrieved May 18, 2013</ref>. ...
   
 
Here, we predicted transmembrane helices with the tools PolyPhobius and MEMSAT-SVM for the following proteins:
 
Here, we predicted transmembrane helices with the tools PolyPhobius and MEMSAT-SVM for the following proteins:
Line 289: Line 289:
   
 
=== PolyPhobius ===
 
=== PolyPhobius ===
PolyPhobius is an enhanced version of Phobius and uses [http://en.wikipedia.org/wiki/Hidden_Markov_model Hidden Markov Models (HMM)] for the prediction of transmembrane topology and signal peptides. This homology information is utilized to increase the performance of the prediction. The method depends on a high quality global [http://en.wikipedia.org/wiki/Multiple_sequence_alignment multiple sequence alignment (MSA)]. <ref name="polyphobius"> Lukas Käll, Anders Krogh and Erik L. L. Sonnhammer (2005). "[http://bioinformatics.oxfordjournals.org/content/21/suppl_1/i251.full.pdf+html An HMM posterior decoder for sequence feature prediction that includes homology information]". Bioinformatics Vol.21(suppl 1): i251–i257. [http://en.wikipedia.org/wiki/Digital_object_identifier doi]:[http://bioinformatics.oxfordjournals.org/content/21/suppl_1/i251.abstract 10.1093/bioinformatics/bti1014] </ref>
+
PolyPhobius is an enhanced version of Phobius and uses [http://en.wikipedia.org/wiki/Hidden_Markov_model Hidden Markov Models (HMM)] for the prediction of transmembrane topology and signal peptides. This homology information is utilized to increase the performance of the prediction. The method depends on a high quality global [http://en.wikipedia.org/wiki/Multiple_sequence_alignment multiple sequence alignment (MSA)]. <ref name="polyphobius">{{cite journal | date=2005 | title="[http://bioinformatics.oxfordjournals.org/content/21/suppl_1/i251.full.pdf+html An HMM posterior decoder for sequence feature prediction that includes homology information]" | volume= 21 | journal= Bioinformatics | pages=i251–i257 | author=Lukas Käll, Anders Krogh and Erik L. L. Sonnhammer | issue=2}} [http://en.wikipedia.org/wiki/Digital_object_identifier doi]:[http://bioinformatics.oxfordjournals.org/content/21/suppl_1/i251.abstract 10.1093/bioinformatics/bti1014] </ref>
   
 
We used the PolyPhobius installed on the server on the following path: ''/mnt/project/pracstrucfunc13/polyphobius/''
 
We used the PolyPhobius installed on the server on the following path: ''/mnt/project/pracstrucfunc13/polyphobius/''

Revision as of 18:23, 20 May 2013

Page is still under construction!!!

Summary

Sequence-based prediction approaches are useful to predict a variety of structural and functional properties of proteins. Here, we used different methods to provide useful information about our protein sequence of phenylalanine hydroxylase (PAH - P00439) and in some cases likewise for other given proteins (in brackets):

  • ReProf for secondary structure prediction (P10775, Q9X0E6, Q08209)
  • IUPred and MD (MetaDisorder) for the prediction of the disorder (P10775, Q9X0E6, Q08209)
  • PolyPhobius and MEMSAT-SVM to predict transmembrane helices (P35462, Q9YDF8, P47863)
  • SignalP to predict signal peptides (P02768, P47863, P11279)
  • GOPET and ProtFun2.0 to predict GO terms
  • Pfam with a sequence search to find out more about the Pfam family of our protein

The results are here presented and discussed in detail.

Secondary structure

Secondary structure of a protein is ... (alpha-helices, beta-strand, loops...).

As it is not easy to look at secondary structures, there are some methods that can predict them:

  • ReProf is already installed on the students lab. An example call is shown here:
 reprof -i swiss_matrix_P00439.pssm
  • For the secondary structure prediction with PsiPred v3.3, we used the PSIPRED server
  • For DSSP PDB files are needed and the DSSP server is used to create dssp files.

There are more than one PDB ID for the Uniprot IDs and they are not completely identical to the Uniprot sequences. For example 1PAH only shows the residues 117 to 424. Nevertheless we tried to choose the most similar and align them by hand. Positions, for which no secondary structure are predicted, are marked with a '-'. Furthermore the different secondary structures are assimilated (Table ...) so they all show the same secondary structure format like in ReProf with letters E, H and L. Therefore we wrote a program to filter the ReProf, PsiPred and DSSP outputs for the secondary structure: filter_seqStruc.pl

"Secondary Structure"
Type ReProf PsiPred DSSP
Helix (alpha) H H GHI
Extended strand (beta) E E BE
Loops/Turns L C ST


P10775 (RNH1)

ReProf has the posibility to use a PSSM matrix or a FASTA sequence as input to predict secondary structure. We used three different inputs and compare them. First the FASTA sequence itself is applied, then a PSSM matrix generated by PSI-Blast against the big80 database and another matrix against the swissprot database are used for the ReProf prediction. In the following tables the predicted structures of ReProf are compared against the structure prediciton of DSSP and PsiPred. Furthermore they are compared to the recorded structure in UniProt. A java script (...) was written which counts the number of matches between two sequences and the total number of residues. Then the sensitivity for each letter and in total is calculated and given in %.

senstitivity = number of matches / number of residues 

As there are positions which have no prediction in DSSP or Uniprot (indicated by '-'), those positions are ignored.

"Sensitivity of predicted secondary structures against the DSSP structure."
Letter FASTA PSSM-Big PSSM-Swissprot PsiPred Uniprot
E 21.05 63.16 80.7 84.21 100.0
H 71.94 94.9 92.35 83.16 100.0
L 71.95 85.37 79.27 95.12 0.0
total 63.28 87.16 87.16 86.27 95.96
"Sensitivity of predicted secondary structures against the PsiPred structure."
Letter FASTA PSSM-Big PSSM-Swissprot DSSP Uniprot
E 20.0 69.09 90.91 97.96 100.0
H 77.71 100.0 99.4 98.19 100.0
L 61.7 77.02 70.64 65.0 0.0
total 62.5 84.43 83.55 86.27 86.1



"Sensitivity of predicted secondary structures against the Uniprot structure."
Letter FASTA PSSM-Big PSSM-Swissprot PsiPred DSSP
E 22.22 55.56 71.11 73.33 80.0
H 74.16 97.19 94.94 89.33 100.0
L 0.0 0.0 0.0 0.0 0.0
total 63.68 88.79 90.13 86.1 95.96


In the structure of P10775 found at Uniprot no loops or turns are included, which is why L has 0% for those comparisons. All three comparisons show that the results for ReProf using the FASTA sequence are worse than using PSSMs especially for extended strands where it only has a true positive rate of about 20% to 22%. The two different databases (big80 and Swissprot), however, show nearly no differences. As the beta strand was better predicted at the Swissprot database and it has a better result at comparison with Uniprot, this database is applied for the other proteins. Additionally, PsiPred has similar results as the two Reprof predictions with PSSM matrices. Although DSSP uses the knowledge of the PDB structure, differences between the DSSP and the Uniprot secondary structure can be seen.

ReProf comparison (Q9X0E6, Q08209, P00439)

After choosing the SwissProt database for the PSSM matrices, ReProf, PsiPred and DSSP secondary structure predictions were done for the two proteins Q9X0E6 and Q08209 as well as for our protein P00439. Again the results are analyzed with our java script (Tables: ...).

Q9X0E6 (CUTA)

Secondary structure comparison

Type PsiPred DSSP
E 80.95 90.0
H 62.5 69.23
L 66.67 38.1
total 67.33 66.67
Q08209 (PPP3CA)

Secondary structure comparison

Type PsiPred DSSP
E 51.52 71.7
H 80.59 85.71
L 86.67 46.43
total 80.23 64.39
P00439 (PAH)

Secondary structure comparison

Type PsiPred DSSP
E 74.0 57.14
H 88.21 85.82
L 90.82 56.82
total 87.83 73.2


After comparing the ratios of matches to the number of residues, a higher similarity of the secondary structure of ReProf to PsiPred than to DSSP can be seen. In most cases helices are predicted quite good but also the other forms are predicted well. In total the prediction of our protein P00439 show highest secondary structure similarity. Only the prediction of extended strand in comparison with DSSP show a worse result than at the other proteins. Altogether ReProf seems to make a good prediciton.

//TODO? maybe a comparison of PsiPred to DSSP?

P00439 sequences

Here the amino acid sequence of our PAH protein P00439 and its secondary structure predictions and entries are shown.

 FASTA : MSTAVLENPG LGRKLSDFGQ ETSYIEDNCN QNGAISLIFS LKEEVGALAK VLRLFEENDV
 ReProf: LLLLELLLLL LLLLLLLLLL LLLLLLLLLL LLLEEEEEEE ELLLLHHHHH HHHHHHHLLL
PsiPred: LLLLLLLLLL LLLLLLLLLL LLLLLLLLLL LLLEEEEEEE ELLLLLHHHH HHHHHHHLLL
   DSSP: ---------- ---------- ---------- ---------- ---------- ----------
Uniprot: ---------- ---------- ---------- ---------- ---------- ----------  
 
 FASTA : NLTHIESRPS RLKKDEYEFF THLDKRSLPA LTNIIKILRH DIGATVHELS RDKKKDTVPW
 ReProf: LEEEEELLLL LLLLLLEEEE EEEELLLHHH HHHHHHHHHL LLLLLLLLLL LLLLLLLLLL
PsiPred: EEEEEELLLL LLLLLLEEEE EEELLLLLHH HHHHHHHHHH LLLLLLLLLL LLLLLLLLLL
   DSSP: ---------- ---------- ---------- ---------- ---------- ----------
Uniprot: ---------- ---------- ---------- ---------- ---------- ----------

 FASTA : FPRTIQELDR FANQILSYGA ELDADHPGFK DPVYRARRKQ FADIAYNYRH GQPIPRVEYM
 ReProf: LLHHHHHHHH HHHHHHLLLL LLLLLLLLLL LLHHHHHHHH HHHLLLLLLL LLLLLLLLLL
PsiPred: LLLLHHHHHH HHHHHHHLLL LLLLLLLLLL LHHHHHHHHH HHHHHHLLLL LLLLLLLLLL
   DSSP: --LEHHHHHH HHHHLELL-H HHLLLLLLLL -HHHHHHHHH HHHHHHL--L LL--------
Uniprot: ----HHHHHH HH--EEE--H H-----LLL- -HHHHHHHHH HHHHHH---- ----------

 FASTA : EEEKKTWGTV FKTLKSLYKT HACYEYNHIF PLLEKYCGFH EDNIPQLEDV SQFLQTCTGF
 ReProf: HHHHHHHHHH HHHHHHHLHH HLHHHHHHHH HHHHHHLLLL LLLLLLHHHH HHHHHHLLLL
PsiPred: HHHHHHHHHH HHHHHHHHHL LLHHHHHHHH HHHHHHLLLL LLLLLLHHHH HHHHHHHLLE
   DSSP: HHHHHHHHHH HHHHHHHHHH HE-HHHHHHH HHHHHHH--E LLE---HHHH HHHHHHHHL-
Uniprot: HHHHHHHHHH HHHHHHHHHH ---HHHHHHH HHHHHH---- ------HHHH HHHHHHH---

 FASTA : RLRPVAGLLS SRDFLGGLAF RVFHCTQYIR HGSKPMYTPE PDICHELLGH VPLFSDRSFA
 ReProf: EEEEEELLLL LHHHHHHHHH LHHHHHHEEL LLLLLLLLLL LLHHHHHHLL LLLLLLHHHH
PsiPred: EEEELLLLLL HHHHHHHHHL LEELLLLLLL LLLLLLLLLL LLHHHHHHLL LLLLLLHHHH
   DSSP: EEEE--LE-- HHHHHHHHLL LEEEE----- -LLLLL--LL --HHHHHHHH HHHHLLHHHH
Uniprot: EEE--EE--- HHHHHHHH-- -EEE------ ---------- --HHHHHH-- HHH---HHHH

 FASTA : QFSQEIGLAS LGAPDEYIEK LATIYWFTVE FGLCKQGDSI KAYGAGLLSS FGELQYCLSE
 ReProf: HHHHHHHHHL LLLLHHHHHH HHHHHEEEEE EEEELLLLLL EEEEEEELLL LLHHHHHHLL
PsiPred: HHHHHHHHHL LLLLHHHHHH HHHHHHEEEE EEEEEELLLE EEELLLLLLL HHHHHHHHLL
   DSSP: HHHHHHHHHH LL--HHHHHH HHHHHHHHHH L-EEEELLEE EE--HHHHL- HHHHHHLLLL
Uniprot: HHHHHHHHH- ----HHHHHH HHHHHH-LLL --EEE---EE E---HHH--- HHHHHH--EE

 FASTA : KPKLLPLELE KTAIQNYTVT EFQPLYYVAE SFNDAKEKVR NFAATIPRPF SVRYDPYTQR
 ReProf: LLLLLLLLHH HLLLLLLLLL LLLLHEEHHH HHHHHHHHHH HHHHHLLLLL LLEELLLLLL
PsiPred: LLLLLLLLHH HHHLLLLLLL LLLLLEEEEL LHHHHHHHHH HHHHLLLLLL LLLLLLLLLE
   DSSP: LLEEEE--HH HHLL----LL L--LEEEEEL -HHHHHHHHH HHHHLL--LL -EEEELLLLE
Uniprot: EEEEE---HH H-------EE ---EEEEEE- -HHHHHHHHH HH------EE EEEE-LLL-E

 FASTA : IEVLDNTQQL KILADSINSE IGILCSALQK IK
 ReProf: HHHHLLHHHH HHHHHHHHHH HHHHHHHHHH HL
PsiPred: EEELLLHHHH HHHHHHHHHH HHHHHHHHHH HL
   DSSP: EEE------- ---------- ---------- --
Uniprot: EEE--HHHHH HHHHHHHHHH HHHHHHHHH- --

Again very similar outcomes can be seen for ReProf and Psipred and for DSSP and Uniprot. Thereby the later start of the secondary structure for both DSSP and Uniprot is conspicious. However none of the structures is identical to another one. Differences most often results from slightly shifted predictions, so for example in one prediction a helix strand starts a few amino acid earlier or is a few amino acids longer than in the other prediciton.

Disorder

In this part, we wanted to analyse the disorder of our protein (P00439) as well as the proteins from the secondary structure prediction (P10775, Q9X0E6, Q08209).

IUPred

With IUPred one can predict long and short disorders as well as globular domains. ...

First we compiled IUPred with following command:

cc /opt/iupred/iupred.c -o /mnt/home/student/.../iupred

Afterwards one can invoke the programm as shown here:

iupred sequence.fasta long|short|glob > output.txt

Since the output is only given to Standard Out, we had to save the output into a file.

MD (MetaDisorder)

MetaDisorder is a ...

To invoke the programm one can use following command:

predictprotein --seqfile sequence.fasta --target metadisorder -p output_name -o output-directory

DisProt

DisProt is a database of ...

We could not find exact matchings on DisProt for our protein as well as two other proteins, so we used the following best hits done with Sequence Search and Smith Waterman search algorithm:

The PSI-Blast search algorithm gave the same best hits, except for the CUTA protein, but here was the E-Value in the Smith Waterman search better than in PSI-Blast, so we used this hit.

The only protein with a match in DisProt, was Q08209 (PPP3CA).

In the images below, one can see the regions of order and disorder of a given sequence.

Legend for the DisProt images.
Map of ordered and disordered regions from DisProt for the best sequence hit of P00439 (Tyrosine 3-monooxygenase). The disordered region is located between the 1-155 sequence position.
Map of ordered and disordered regions from DisProt for the best sequence hit of P10775 (NALP1). The disordered region is located between the 31-50 sequence position.
Map of ordered and disordered regions from DisProt for the best sequence hit of Q9X0E6 (Uncharacterized protein). The disordered region is located on the whole protein sequence.
Map of ordered and disordered regions from DisProt for the protein Q08209. Disordered regions are located between the 1-13, 374 - 468, 390 - 414, 469 - 486, 487 - 521 and the ordered region between the 14 - 373 sequence position.


Transmembrane helices

A transmembrane helix is defined as a membrane-spanning domain with hydrogen-bonded helical configuration <ref>http://www.uniprot.org Transmembrane helix definition on UniProt, retrieved May 18, 2013</ref>. ...

Here, we predicted transmembrane helices with the tools PolyPhobius and MEMSAT-SVM for the following proteins:

Then we compared our results with the ones generated with OPM and PDBTM.

PolyPhobius

PolyPhobius is an enhanced version of Phobius and uses Hidden Markov Models (HMM) for the prediction of transmembrane topology and signal peptides. This homology information is utilized to increase the performance of the prediction. The method depends on a high quality global multiple sequence alignment (MSA). <ref name="polyphobius">Template loop detected: Template:Citation/core{{#if:|}}{{#if:|}}{{#if:|}}{{#if:|}} doi:10.1093/bioinformatics/bti1014 </ref>

We used the PolyPhobius installed on the server on the following path: /mnt/project/pracstrucfunc13/polyphobius/

For the prediction as well as a graphical output one can use the PolyPhobius webserver.

...

MEMSAT-SVM

MEMSAT-SVM is a revised version of MEMSAT-3 and is based on support vector machines (SVM) for the prediction of transmembrane protein topology. Signal peptides and re-entrant helices prediction are both integrated in this method. Also the discrimination between transmembrane and globular proteins can effectively be done (extremely low false positive and false negative rates). MEMSAT-SVM showed a better accuracy than MEMSAT-3. <ref name="memsat"> Timothy Nugent and David T Jones (2009). "Transmembrane protein topology prediction using support vector machinesTransmembrane protein topology prediction using support vector machines". BMC Bioinformatics Vol.10:159. doi:10.1186/1471-2105-10-159 </ref>

Since MEMSAT-SVM is currently not running on the biolab servers, we used the MEMSAT-SVM webserver and the fasta sequences of the above named proteins for our prediction.

...

Comparison to OPM and PDBTM

...

Signal peptides

Signal peptides are ... On the server version 3.0 of SignalP predicitions is installed. We tried two different parameters for our predictions:

First we simple run SignalP without any constraints. The only thing, which has to be stated is -t euk as all four sequences are eukaryotic. Otherwise SignalP only would accept Gran+ or Gran-. -o can be set, so the output is written automatically in output.txt or it can be set with '>'.

 signalp -t euk <UniprotID>.fasta > <UniprotID>_output.out

In our second run we choose only the N-terminal with 70 residues as it is recommended in the manual page of SignalP to avoid false positives.

 signalp -trunc 70 -t euk <UniprotID>.fasta > <UniprotID>_trunc.out

The internet server of SignalP has version 4.1. Here only the FASTA sequences are delivered. Parameters are not changed. The output is given as:

  • SignalP-NN (neuronal network) with the maximal values of C-score (raw cleavage site score), S-score (signal peptide score) and Y-score (combined cleavage site score). Additionaly the mean S (average S-score of the possible signal peptide) and the D-score (discrimination score), which is a weighted average of the mean S and the max. Y scores are reported. The D-Score discriminates signal peptides from non-signal peptides.
  • SignalP-HMM returns the posterior probabilities for cleavage site (C) and signal peptide (S) for each position in the input sequences. The signal peptide probability is divided into three region probabilities (n, h, c). The maximal cleavage site probability is reported with its position. For eukaryotes also the probability of a signal anchor is reported.

(see also Output format)

SignalP 3.0 SignalP 4.1 SignalP (website)
UniProtID SignalP-NN SignalP-HMM Max cleavage site prediction prediction prediction
P00439 5 x NO 0.0 signal peptide, 0.0 signal anchor 0.000 non-secretory protein no signal peptide not in database
P02768 5 x YES 1.0 signal peptide, 0.0 signal anchor 0.785 signal peptide signal peptide confirmed [1]
P11279 5 x YES 1.0 signal peptide, 0.0 signal anchor 0.847 signal peptide signal peptide confirmed [2]
P47863 4 x NO, 1 x YES (max. S) 0.526 signal peptide, 0.457 signal anchor 0.388 signal peptide no signal peptide not in database


In our case there are only few differences between the runs for the whole sequence or only the N-terminal. For example for the whole sequence the NN result of P47863 gives also a YES for C and not only for max. S. Table ... shows the results of the N-terminal run only. For P47863 different predictions are made for the two different versions of SignalP. However, the predictions for the other proteins are the same. Furthermore, signal peptide only has a probability of about 53% and the probability of the maximum cleavage site only has 39%. In Uniprot the protein is classified as multi-pass membrane protein and maybe this is the cause for identifying the N-terminal as signal peptide. Our protein P00439 has no signal peptide which is predicted correctly, whereas P02768 and P11279 both show a signal peptide at the N-terminal region all three with high credibility. The cleavage sites predicted with version 4.1 coincide with the confirmed results on the webserver. As looking at the transmembrane helices on the webserver, none is predicted for P02768, whereas for P11279 a potential transmembrane region between position 383 and 405 can be found. ????

Other prediction tools are:

  • TatP (predicts presence and location of Twin-arginine signal peptide cleavage sites in bacteria)
  • Phobius (predicts both transmembrane topology and signal peptide)
  • PrediSi (predicts signal peptides)

We tested PrediSi with our proteins and almost get the same results as for the prediction with SignalP version 3. So for P47863 a signal peptide was predicted, while the predictions for the other three proteins are correct. Phobius, however, as it also predicts transmembrane shows the same results as SignalP version 4.

GO terms

... GOPET

Sequence ID: P00439
GOid Aspect Condidence GO term
GO:0003824 F 94% catalytic activity
GO:0016491 F 88% oxidoreductase activity
GO:0004497 F 87% monooxygenase activity
GO:0004505 F 84% phenylalanine 4-monooxygenase activity
GO:0004510 F 80% tryptophan 5-monooxygenase activity
GO:0004511 F 79% tyrosine 3-monooxygenase activity
GO:0046872 F 78% metal ion binding
GO:0005506 F 78% iron ion binding
GO:0008199 F 72% ferric iron binding
GO:0008198 F 72% ferrous iron binding
GO:0016597 F 71% amino acid binding



ProtFun

Pfam

... Pfam P00439

Discussion

Questions:

  • What features are predicted?
  • Discuss the results for your protein and the example proteins. Using the predictions, what could you learn about your protein and the example proteins? Compare to the available knowledge in UniProt, PDB, DisProt, OPM, PDBTM, Pfam...
  • Look for other methods to get an idea how many different tools are available to predict: secondary structure, disorder, transmembrane, signal peptides and GO terms. You should be able to name several more methods in the discussion. (You can also try out more methods.)
  • What else can/is be predicted from protein sequence alone?
  • Which predictions can be improved considerably by structure-based approaches?

References