Difference between revisions of "Sequence-based predictions (Phenylketonuria)"

From Bioinformatikpedia
(Q9X0E6 (CUTA) & Q08209 (PPP3CA))
(Q9X0E6 (CUTA) & Q08209 (PPP3CA))
Line 116: Line 116:
 
{| border="1" cellpadding="5" cellspacing="0" align="left"
 
{| border="1" cellpadding="5" cellspacing="0" align="left"
 
|-
 
|-
! colspan="8" style="background:#32CD32;" | "Secondary structure comparison (Q9X0E6)"
+
! colspan="8" style="background:#32CD32;" | Secondary structure comparison (Q9X0E6)
 
|-
 
|-
 
! style="background:#90EE90;" align="center" | Type
 
! style="background:#90EE90;" align="center" | Type
Line 133: Line 133:
 
{| border="1" cellpadding="5" cellspacing="0" align="center"
 
{| border="1" cellpadding="5" cellspacing="0" align="center"
 
|-
 
|-
! colspan="8" style="background:#32CD32;" | "Secondary structure comparison (Q08209)"
+
! colspan="8" style="background:#32CD32;" | Secondary structure comparison (Q08209)
 
|-
 
|-
 
! style="background:#90EE90;" align="center" | Type
 
! style="background:#90EE90;" align="center" | Type

Revision as of 09:31, 15 May 2013

Page is still under construction!!!

Summary

Sequence-based prediction approaches are useful to predict a variety of structural and functional properties of proteins. Here, we used different methods to provide useful information about our protein sequence of phenylalanine hydroxylase (PAH - P00439) and in some cases likewise for other given proteins (in brackets):

  • ReProf for secondary structure prediction (P10775, Q9X0E6, Q08209)
  • IUPred and MD (MetaDisorder) for the prediction of the disorder (P10775, Q9X0E6, Q08209)
  • PolyPhobius and MEMSAT-SVM to predict transmembrane helices (P35462, Q9YDF8, P47863)
  • SignalP to predict signal peptides (P02768, P47863, P11279)
  • GOPET and ProtFun2.0 to predict GO terms
  • Pfam with a sequence search to find out more about the Pfam family of our protein

The results are here presented and discussed in detail.

Secondary structure

We wrote a program to filter the ReProf, PsiPred and DSSP outputs for the secondary structure: filter_seqStruc.pl

For DSSP PDB files are needed. Empty positions are converted to '-'. The PDB IDs are:

  • P10775: 2BNH
  • Q9X0E6: 1VHF
  • Q08209: ...
  • P00439: 1PAH (RESIDUES 117-424)
"Secondary Structure"
Type ReProf PsiPred DSSP
Helix (alpha) H H GHI
Extended strand (beta) E E BE
Loops/Turns L C ST




P10775 (RNH1)

In the following tables the predicted structures of ReProf are compared against the structure prediciton of DSSP and PsiPred. Furthermore they are compared to the recorded structure in UniProt. A java script (...) was written which counts the number of matches between two sequences and the total number of residues. Then the sensitivity for each letter and in total is calculated and given in %.

senstitivity = number of matches / number of residues 

As there are positions which have no prediction in DSSP or Uniprot (indicated with '-'), those positions are ignored.

"Sensitivity of predicted secondary structures against the DSSP structure."
Letter FASTA PSSM-Big PSSM-Swissprot PsiPred Uniprot
E 21.05 63.16 80.7 84.21 100.0
H 71.94 94.9 92.35 83.16 100.0
L 71.95 85.37 79.27 95.12 0.0
total 63.28 87.16 87.16 86.27 95.96
"Sensitivity of predicted secondary structures against the PsiPred structure."
Letter FASTA PSSM-Big PSSM-Swissprot DSSP Uniprot
E 20.0 69.09 90.91 97.96 100.0
H 77.71 100.0 99.4 98.19 100.0
L 61.7 77.02 70.64 65.0 0.0
total 62.5 84.43 83.55 86.27 86.1



"Sensitivity of predicted secondary structures against the Uniprot structure."
Letter FASTA PSSM-Big PSSM-Swissprot PsiPred DSSP
E 22.22 55.56 71.11 73.33 80.0
H 74.16 97.19 94.94 89.33 100.0
L 0.0 0.0 0.0 0.0 0.0
total 63.68 88.79 90.13 86.1 95.96


In the structure found at Uniprot no loops or turns are included, which is why L has 0% for those comparisons. The results for ReProf using the FASTA sequence are worse than using PSSMs. The two different databases (big80 and Swissprot), however, show nearly no differences. As the beta strand was better predicted at the Swissprot database and it has a better result at comparison with Uniprot, this database is applied for the other proteins.

Q9X0E6 (CUTA) & Q08209 (PPP3CA)

Secondary structure comparison (Q9X0E6)
Type PsiPred DSSP
E 80.95 90.0
H 62.5 69.23
L 66.67 38.1
total 67.33 66.67
Secondary structure comparison (Q08209)
Type PsiPred DSSP
E 51.52
H 80.59
L 86.67
total 80.23

...

P00439 (PAH)

...

"Secondary structure comparison"
Type PsiPred DSSP
E 74.0 57.14
H 88.21 85.82
L 90.82 56.82
total 87.83 73.2


 FASTA : MSTAVLENPG LGRKLSDFGQ ETSYIEDNCN QNGAISLIFS LKEEVGALAK VLRLFEENDV
 ReProf: LLLLELLLLL LLLLLLLLLL LLLLLLLLLL LLLEEEEEEE ELLLLHHHHH HHHHHHHLLL
PsiPred: LLLLLLLLLL LLLLLLLLLL LLLLLLLLLL LLLEEEEEEE ELLLLLHHHH HHHHHHHLLL
   DSSP: ---------- ---------- ---------- ---------- ---------- ----------
Uniprot: ---------- ---------- ---------- ---------- ---------- ----------  
 
 FASTA : NLTHIESRPS RLKKDEYEFF THLDKRSLPA LTNIIKILRH DIGATVHELS RDKKKDTVPW
 ReProf: LEEEEELLLL LLLLLLEEEE EEEELLLHHH HHHHHHHHHL LLLLLLLLLL LLLLLLLLLL
PsiPred: EEEEEELLLL LLLLLLEEEE EEELLLLLHH HHHHHHHHHH LLLLLLLLLL LLLLLLLLLL
   DSSP: ---------- ---------- ---------- ---------- ---------- ----------
Uniprot: ---------- ---------- ---------- ---------- ---------- ----------

 FASTA : FPRTIQELDR FANQILSYGA ELDADHPGFK DPVYRARRKQ FADIAYNYRH GQPIPRVEYM
 ReProf: LLHHHHHHHH HHHHHHLLLL LLLLLLLLLL LLHHHHHHHH HHHLLLLLLL LLLLLLLLLL
PsiPred: LLLLHHHHHH HHHHHHHLLL LLLLLLLLLL LHHHHHHHHH HHHHHHLLLL LLLLLLLLLL
   DSSP: --LEHHHHHH HHHHLELL-H HHLLLLLLLL -HHHHHHHHH HHHHHHL--L LL--------
Uniprot: ----HHHHHH HH--EEE--H H-----LLL- -HHHHHHHHH HHHHHH---- ----------

 FASTA : EEEKKTWGTV FKTLKSLYKT HACYEYNHIF PLLEKYCGFH EDNIPQLEDV SQFLQTCTGF
 ReProf: HHHHHHHHHH HHHHHHHLHH HLHHHHHHHH HHHHHHLLLL LLLLLLHHHH HHHHHHLLLL
PsiPred: HHHHHHHHHH HHHHHHHHHL LLHHHHHHHH HHHHHHLLLL LLLLLLHHHH HHHHHHHLLE
   DSSP: HHHHHHHHHH HHHHHHHHHH HE-HHHHHHH HHHHHHH--E LLE---HHHH HHHHHHHHL-
Uniprot: HHHHHHHHHH HHHHHHHHHH ---HHHHHHH HHHHHH---- ------HHHH HHHHHHH---

 FASTA : RLRPVAGLLS SRDFLGGLAF RVFHCTQYIR HGSKPMYTPE PDICHELLGH VPLFSDRSFA
 ReProf: EEEEEELLLL LHHHHHHHHH LHHHHHHEEL LLLLLLLLLL LLHHHHHHLL LLLLLLHHHH
PsiPred: EEEELLLLLL HHHHHHHHHL LEELLLLLLL LLLLLLLLLL LLHHHHHHLL LLLLLLHHHH
   DSSP: EEEE--LE-- HHHHHHHHLL LEEEE----- -LLLLL--LL --HHHHHHHH HHHHLLHHHH
Uniprot: EEE--EE--- HHHHHHHH-- -EEE------ ---------- --HHHHHH-- HHH---HHHH

 FASTA : QFSQEIGLAS LGAPDEYIEK LATIYWFTVE FGLCKQGDSI KAYGAGLLSS FGELQYCLSE
 ReProf: HHHHHHHHHL LLLLHHHHHH HHHHHEEEEE EEEELLLLLL EEEEEEELLL LLHHHHHHLL
PsiPred: HHHHHHHHHL LLLLHHHHHH HHHHHHEEEE EEEEEELLLE EEELLLLLLL HHHHHHHHLL
   DSSP: HHHHHHHHHH LL--HHHHHH HHHHHHHHHH L-EEEELLEE EE--HHHHL- HHHHHHLLLL
Uniprot: HHHHHHHHH- ----HHHHHH HHHHHH-LLL --EEE---EE E---HHH--- HHHHHH--EE

 FASTA : KPKLLPLELE KTAIQNYTVT EFQPLYYVAE SFNDAKEKVR NFAATIPRPF SVRYDPYTQR
 ReProf: LLLLLLLLHH HLLLLLLLLL LLLLHEEHHH HHHHHHHHHH HHHHHLLLLL LLEELLLLLL
PsiPred: LLLLLLLLHH HHHLLLLLLL LLLLLEEEEL LHHHHHHHHH HHHHLLLLLL LLLLLLLLLE
   DSSP: LLEEEE--HH HHLL----LL L--LEEEEEL -HHHHHHHHH HHHHLL--LL -EEEELLLLE
Uniprot: EEEEE---HH H-------EE ---EEEEEE- -HHHHHHHHH HH------EE EEEE-LLL-E

 FASTA : IEVLDNTQQL KILADSINSE IGILCSALQK IK
 ReProf: HHHHLLHHHH HHHHHHHHHH HHHHHHHHHH HL
PsiPred: EEELLLHHHH HHHHHHHHHH HHHHHHHHHH HL
   DSSP: EEE------- ---------- ---------- --
Uniprot: EEE--HHHHH HHHHHHHHHH HHHHHHHHH- --

Disorder

IUPred

With IUPred one can predict long and short disorders as well as globular domains. ...

First we compiled IUPred with following command:

cc /opt/iupred/iupred.c -o /mnt/home/student/.../iupred

Afterwards one can invoke the programm as shown here:

iupred sequence.fasta long/short/glob > output.txt

Since the output is only given to Standard Out, we had to save the output into a file.

MD (MetaDisorder)

MetaDisorder is a ...

To invoke the programm one can use following command:

predictprotein --seqfile sequence.fasta --target metadisorder -p output_name -o output-directory

DisProt

DisProt is a database of ...

We could not find exact matchings on DisProt for our protein as well as two other proteins, so we used the following best hits done with Sequence Search and Smith Waterman search algorithm:

The PSI-Blast search algorithm gave the same best hits, except for the CUTA protein, but here was the E-Value in the Smith Waterman search better than in PSI-Blast, so we used this hit.

The only protein with a match in DisProt, was Q08209 (PPP3CA).

In the images below, one can see the regions of order and disorder of a given sequence.

Legend for the DisProt images.
Map of ordered and disordered regions from DisProt for the best sequence hit of P00439 (Tyrosine 3-monooxygenase). The disordered region is located between the 1-155 sequence position.
Map of ordered and disordered regions from DisProt for the best sequence hit of P10775 (NALP1). The disordered region is located between the 31-50 sequence position.
Map of ordered and disordered regions from DisProt for the best sequence hit of Q9X0E6 (Uncharacterized protein). The disordered region is located on the whole protein sequence.
Map of ordered and disordered regions from DisProt for the protein Q08209. Disordered regions are located between the 1-13, 374 - 468, 390 - 414, 469 - 486, 487 - 521 and the ordered region between the 14 - 373 sequence position.


Transmembrane helices

...

PolyPhobius

...

P00439 (PAH)

...

P35462 (DRD3)

...

Q9YDF8 (KVAP)

...

P47863 (AQP4)

...

MEMSAT-SVM

...

Signal peptides

...

P00439 (PAH)

...

P02768 (ALB)

...

P47863 (AQP4)

...

P11279 (LAMP1)

...

GO terms

...

Pfam

...

Discussion

Questions:

  • What features are predicted?
  • Discuss the results for your protein and the example proteins. Using the predictions, what could you learn about your protein and the example proteins? Compare to the available knowledge in UniProt, PDB, DisProt, OPM, PDBTM, Pfam...
  • Look for other methods to get an idea how many different tools are available to predict: secondary structure, disorder, transmembrane, signal peptides and GO terms. You should be able to name several more methods in the discussion. (You can also try out more methods.)
  • What else can/is be predicted from protein sequence alone?
  • Which predictions can be improved considerably by structure-based approaches?