Difference between revisions of "Sequence-based predictions HEXA"

From Bioinformatikpedia
(OCTOPUS and SPOCTOPUS)
(OCTOPUS and SPOCTOPUS)
Line 781: Line 781:
 
=== OCTOPUS and SPOCTOPUS ===
 
=== OCTOPUS and SPOCTOPUS ===
   
  +
*HEXA_HUMAN
  +
[[Image:hexa_human_octopus.png|thumb|Prediction of OCTOPUS for the transmembrane helices of HEXA_HUMAN]]
  +
[[Image:hexa_human_spoctopus.png|thumb|Prediction of SPOCTOPUS for the transmembrane helices of HEXA_HUMAN]]
  +
  +
{| border="1" style="text-align:center; border-spacing:0;"
  +
!colspan="3"|'''OCTOPUS'''
  +
|colspan="3"|'''SPOCTOPUS'''
  +
|-
  +
|start position
  +
|end position
  +
|prediction
  +
|start position
  +
|end position
  +
|prediction
  +
|-
  +
|1
  +
|2
  +
|inside
  +
|1
  +
|6
  +
|N-terminal of a signal peptide
  +
|-
  +
|3
  +
|23
  +
|TM helix
  +
|7
  +
|21
  +
|signal peptide
  +
|-
  +
|24
  +
|529
  +
|outside
  +
|22
  +
|529
  +
|outside
  +
|-
  +
|}
  +
  +
The results of these two predictions differ.
  +
OCTOPUS predicts a transmembrane helix, whereas SPOCTOPUS predicts at the same location a signale peptide.
  +
<br>
  +
To check which methods predicted right, we compared the protein and the prediction.
  +
  +
  +
{|
  +
| [[Image:bacr_halsa_octopus_vs_real.png|thumb|Comparison between the prediction of OCTOPUS and the real protein]]
  +
| [[Image:bacr_halsa_spoctopus_vs_real.png|thumb|Comparison between the prediction of SPOCTOPUS and the real protein]]
  +
|}
  +
  +
* BACR_HALSA
 
[[Image:bacr_halsa_octopus.png|thumb|Prediction of OCTOPUS for the transmembrane helices of BACR_HALSA]]
 
[[Image:bacr_halsa_octopus.png|thumb|Prediction of OCTOPUS for the transmembrane helices of BACR_HALSA]]
 
[[Image:bacr_halsa_spoctopus.png|thumb|Prediction of SPOCTOPUS for the transmembrane helices of BACR_HALSA]]
 
[[Image:bacr_halsa_spoctopus.png|thumb|Prediction of SPOCTOPUS for the transmembrane helices of BACR_HALSA]]

Revision as of 21:25, 30 May 2011

General Information

Secondary Structure Prediction

Prediction of disordered regions

  • DISOPRED

Authors: Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT.

Year: 2004

Source: [Prediction and functional analysis of native disorder in proteins from the three kingdoms of life.]


Description:

This method is based on a neuronal network which was trained on high resolution X-ray structures from PDB. Disordered regions are regions, which appears in the sequence record, but their electrons are missing from electronic density map. This approach can also failed, because missing electrons can also arise because of the cristallization process. The method runs first a PsiBlast search against a filtered sequence database. Next, a profile for each residue is calculated and classified by using the trained neuronal network.


Prediction:

As a prediction result you get a file with the predicted disordered region, the precision and recall. Furthermore you can a more detailed output. There you see the sequence, and the predictions and also numbers above the sequence (from 0 to 9 which shows you how likly your prediction is)


Input:

If you run disopred on the console, you have to define the location of your database. The program needs as input your sequence in a file with fasta format.


  • POODLE

Prediction of order and disorder by machine-learning

Authors: S. Hirose, K. Shimizu, S. Kanai, Y. Kuroda and T. Noguchi

Year: 2007

There exist three different variants of POODLE.

The first variant is called POODLE-L which predicts mainly long disorder region with a length more than 40.


Source: [POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions.]

The next variant is called POODLE-S, which predicts mainly short disorder regions.


Source: [POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix.]

The last variant is called POODLE-I, which integrates structal information predictors.


Source: [POODLE-I: Disordered region prediction by integrating POODLE series and structural information predictors based on a workflow approach]

There exists als another variant called POODLE-W, which compares different sequences and predicts which sequence is the most disordered one, but this method wasn't used in our analysis.


Description:

POODLE is also a machine learning based method. This method based on a 2-level SVM (Support Vector Machine).

We describe here the POODLE-L in detail, but all POODLE variants use the same principle. The method was trained on disordered proteins and proteins with no disoredered regions. On the first level, the SVM predicts the probability of a 40-residue sequence segment to be disordered. If the algorithm found such a disordered regions, the second level of the SVM use the output from the first level and predicts the probability to be disordered for each amino acid.


Output:

The result of this method is a file with the single amino acids, the prediction if it is ordered or not and the probability for the state. Furtheremore, you get a graphical view of the result.


Input:

We used the POODLE webserver for our analysis. We paste our sequence in fasta format in the input window and chose the POODLE variant.

Prediction of transmembrane helices and signal peptides

  • TMHMM (transmembrane helices hidden markov model)

Authors: E. L.L. Sonnhammer, G. von Heijne, and A. Krogh
Year: 1998
Source: A hidden Markov model for predicting transmembrane helices in protein sequences.


Description:
TMHMM is a hidden markov model-based prediction methode for transmembrane helices in proteins. The HMM consists of three different main locations (core, cap, loop) and seven different states (cytoplasmic loop, cytoplasmic cap, helix core, non-cytoplasmic cap, short non-cytoplasmic loop, long non-cytoplasmic loop and globular domain).


Prediction:
This method search for a given protein sequence in FASTA-format the best path through the hidden markov model. There are two output possibilities, the short one and the long one. The long output format gives additional statistic information (i.e. expected numbers of amino acids in transmembrane helices).


Input:
The method only needs the protein sequence in FASTA-format for the prediction.

  • Phobius and PolyPhobius

Phobius:
Authors: Lukas Käll, Anders Krogh and Erik L. L. Sonnhammer
Year: 2004
Source: A Combined Transmembrane Topology and Signal Peptide Prediction Method.

PolyPhobius:
Authors: Lukas Käll, Anders Krogh and Erik Sonnhammer
Year: 2005
Source: An HMM posterior decoder for sequence feature prediction that includes homology information.

Description:
Phobius and PolyPhobius are combined methods, which predict transmembrane helices and signal peptides. These both methods are based on a hidden markov model and combine the methods from TMHMM and SignalP. The basic of these methods are the HMM from TMHMM with an additional start state for signal peptides. The difference between Phobius and PolyPhobius is, that PolyPhobius also use homology information for the prediction.

Input:
We used the Webserver for Phobius and PolyPhobius and there it was only necessary to paste the protein sequence in fasta format.

Output:
The Server outputs a textfile with the prediction of the position of the signal peptide, the type of the signal peptide and also the positions of the transmembrane helices. Furthermore, it outputs a detailed file, with the probabilties for each residue to be located in a transmembrane helix or signal peptide. Additionally, the server outputs a picture of the prediction.

Prediction of GO Terms

Secondary Structure prediction

Prediction of disordered regions

Before we start to analyse the results of the different methods, we checked, if our protein has one or more disoredered regions. Therefore, we search our protein in the DisProt database and didn't found it, so our protein doesn't have disordered regions. Another possibility to find out if the protein has disordered regions, is only to check in the UniProt entry, if there is an entry for DisProt.


  • Disopred

Disopred predicts two disordered regions in our protein. The first region is at the beginning of the protein (first two residues) and the second region is at the end (last three regions). This prediction is wrong, because it is normal, that the electrons from the first and the last amino acids lack in the electron density map. So, our protein Hexosamidase A has no disordered regions.

Result of the Disopred prediction. * shows that this amino acid belongs to a disordered regions, whereas . signs for a non-disordered region.


  • POODLE

We decided to test several POODLE variants and to compare the results.

POODLE-I

POODLE-I predicted five disordered regions:

start position end position length
1 2 2
14 19 6
83 89 7
105 109 5
527 529 3


POODLE-L

POODLE-L found no disordered regions. Therefore, there is no disordered region with a length more than 40aa in our protein.


POODLE-S (High B-factor residues) This POODLE-S variant searches for high B-facto values in the crystallography, which implies uncertainty in the assignment of the atom positions.

POODLE-S predicted five disordered regions:

start position end position length
0 2 2
13 19 7
83 88 6
105 109 5
526 529 4


POODLE-S (missing residues)

POODLE-S (missing residues) predicts regions as disordered, if there is a amino acid in the sequence record, but not on the electron density map.

Poodle-S found 6 disordered regions.

start position end position length
17 18 2
53 61 9
78 109 33
153 153 1
280 280 1
345 345 1


Graphical Output:

Prediction of POODLE-S (High B-factor residues)
Prediction of POODLE-S (missing residues)
Prediction of POODLE-I
Prediction of POODLE-L

Comparison of the different POODLE variants: POODLE-L doesn't find any disordered regions. This is the result we expected, because our protein doesn't posses any disordered regions.

Both POODLE-S variants found several short disordered regions, which is a false result. Interesstingly, there seems to be more missing electrons in the electron density map, than residues with high B-factor value.

POODLE-I found the same result as POODLE-S with high B-factor, which was expected, because POODLE-I combines POODLE-L and POODLE-S (high B-factor).

Therefore, the predictions of short disordered regions are wrong results. Only the prediction of POODLE-L is correct.

In general, these predictions are used, if nothing else is available about the protein. Therefore, normally we don't know, that the prediction is wrong. Because of that, we want to trust the result and we want to check if the disordered regions overlap with the functionally important residues, because it seems that disordered regions are functionally very important. We check this for POODLE-S with missing residues and POODLE-I, because POODLE-S with high B-factor values show the same result als POODLE-I.

functional residues disordered
residue position amino acid function POODLE-S (missing) POODLE-I
323 E active site ordered ordered
115 N Glycolysation ordered ordered
157 N Glycolysation ordered ordered
259 N Glycolysation ordered ordered
58 (connected with 104) C Disulfide bond disordered ordered
104 (connected with 58) C Disulfide bond disordered ordered
277 (connected with 328) C Disulfide bond ordered ordered
328 (connected with 277) C Disulfide bond ordered ordered
505 (connected with 522) C Disulfide bond ordered ordered
522 (connected with 505) C Disulfide bond ordered ordered

As you can see in the table above, only on disulfide bond is located in a disordered region, all other functionally important residues are located in ordered regions. This is a further good hint, that the predictions are wrong.

Prediction of transmembrane alpha-helices and signal peptides

Because most of the proteins we used in this practicum aren't membrane proteins, we got five additional proteins for the transmembrane and signal peptides analyses.

Additional proteins:

name organism location transmembrane protein sequence
BACR_HALSA Halobacterium salinarium (Archaea) Cell membrane Multi-pass membrane protein [P02945.fasta]
RET4_HUMAN Human (Homo sapiens) extracellular space No [P02753.fasta]
INSL5_HUMAN Human (Homo sapiens) extracellular region No [Q9Y5Q6.fasta]
LAMP1_HUMAN Human (Homo sapiens) Cell membrane Single-pass membrane protein [P11279.fasta]
A4_HUMAN Human (Homo sapiens) Cell membrane Single-pass membrane protein [P05067.fasta]

TMHMM

We analysed the six sequences with TMHMM.

  • Hexosamidase A

TODO

  • BACR_HALSA
Prediction of TMHMM for the transmembrane helices of BACR_HALSA
start position end position location
1 22 outside
23 42 TM Helix
43 54 inside
55 77 TM Helix
78 91 outside
92 114 TM Helix
115 120 inside
121 143 TM Helix
144 147 outside
148 170 TM Helix
171 189 inside
190 212 TM Helix
213 262 outside

TMHMM predicts six transmembrane helices for BACR_HALSA. We decided to compare the TMHMM prediction with the real occuring transmembrane helices in BACR_HALSA:

Comparison between real occuring transmembrane helices and the TMHMM result.

Especially in the beginning is the prediction really very good. There is almost 100% overlap between predicted and real helices. Only in the end of the protein lacks one transmembrane helix in the TMHMM prediction. Therefore, in real there are 7 transmembrane helices, whereas TMHMM only predicts 6. This is really bad, because it is a different for the function if there are 6 or 7 helices, but in general the prediction of TMHMM was quite good.


  • RET4_HUMAN
Prediction of TMHMM for the transmembrane helices of RET4_HUMAN
start position end position location
1 201 outside

TMHMM predicts no transmembrane helices. The whole protein is loacted in the extracellular space.


Comparison between real occuring transmembrane helices and the TMHMM result.

The TMHMM prediction is completly right. Therefore, you can see TMHMM can also predict, that a protein is not a transmembrane protein.


  • INSL5_HUMAN
Prediction of TMHMM for the transmembrane helices of INSL5_HUMAN
start position end position location
1 135 outside

TMHMM predicts no transmembrane helices. The whole protein is loacted in the extracellular space.


Comparison between real occuring transmembrane helices and the TMHMM result.

The TMHMM prediction is again completly right.

  • LAMP1_HUMAN
Prediction of TMHMM for the transmembrane helices of LAMP1_HUMAN


start position end position location
1 10 inside
11 33 TM Helix
34 383 outside
384 406 TM Helix
407 417 inside

TMHMM predicts two transmembrane helices, which are divided by a very long loop which is loacted at the extracellular space.

Comparison between real occuring transmembrane helices and the TMHMM result.

The prediction of TMHMM is quite good. Only at the beginning of the protein TMHMM predicts one wrong transmembrane helix, but the rest of the prediction is correct.

  • A4_HUMAN


Prediction of TMHMM for the transmembrane helices of A4_HUMAN


start position end position location
1 700 outside
701 723 TM Helix
724 770 inside

TMHMM predicts on transmembrane helix at the end of the protein. As we already know is A4_HUMAN a single-spanning transmembrane protein and therefore the numbers of transmembrane helices is right predicted. As next step we compared the position of the transmembrane helix.

Comparison between real occuring transmembrane helices and the TMHMM result.

The result of the TMHMM prediction is pretty well. Except of the first residues at the beginning and the exact start position of the transmembrane helix, the prediction is correct.


Phobius and PolyPhobius

  • Hexosamidase A
Prediction of Phobius for the transmembrane helices and signal peptides of HEXA_HUMAN
Prediction of PolyPhobius for the transmembrane helices and signal peptides of HEXA_HUMAN
Phobius PolyPhobius
start position end position prediction start position end position prediction
Signal peptide prediction
1 5 N-Region 1 5 N-Region
6 17 H-Region 6 15 H-Region
18 22 C-Region 16 19 C-Region
Summary signal peptide
1 22 Signal Peptide 1 19 Signal Peptide
Transmembrane helices prediction
23 529 outside 20 520 outside

Both methods don't predict a transmembrane helix, which is correct, because HEXA_HUMAN is located at the lysosmal space. We compared the results of Phobius and PolyPhobius with the real protein.

Comparison between the prediction of Phobius and the real protein
Comparison between the prediction of PolyPhobius and the real protein

The prediction of Phobius is a little bit better than the PolyPhobius prediction, because Phobius predicts the beginning and the end of the signal peptide totally correct, whereas PolyPhobius cuts two residues of the signal peptide.


  • BACR_HALSA
Prediction of Phobius for the transmembrane helices and signal peptides of BACR_HALSA
Prediction of PolyPhobius for the transmembrane helices and signal peptides of BACR_HALSA
Phobius PolyPhobius
start position end position prediction start position end position prediction
Signal peptide prediction
No prediction available
Transmembrane helices prediction
23 42 TM helix 22 43 TM helix
43 53 inside 44 54 inside
54 76 TM helix 55 77 TM helix
77 95 outside 78 94 outside
96 114 TM helix 95 114 TM helix
115 120 inside 115 120 inside
121 142 TM helix 121 141 TM helix
143 147 outside 142 147 outside
148 169 TM helix 148 166 TM helix
170 189 inside 167 186 inside
190 212 TM helix 187 205 TM helix
213 217 outside 206 215 outside
218 237 TM helix 216 237 TM helix
238 262 inside 238 262 inside

Both methods don't predict a signal peptide, but both recognize, that this protein is a transmembrane protein with seven helices. The predictions only differ at the beginning and the end of the helix positions, but the distance between these two predictions is about 1 to 3 residues.

To evaluate the predictions, we compared the predictions with the real occuring transmembrane helices.

Comparison between the prediction of Phobius and the real protein
Comparison between the prediction of PolyPhobius and the real protein


OCTOPUS and SPOCTOPUS

  • HEXA_HUMAN
Prediction of OCTOPUS for the transmembrane helices of HEXA_HUMAN
Prediction of SPOCTOPUS for the transmembrane helices of HEXA_HUMAN
OCTOPUS SPOCTOPUS
start position end position prediction start position end position prediction
1 2 inside 1 6 N-terminal of a signal peptide
3 23 TM helix 7 21 signal peptide
24 529 outside 22 529 outside

The results of these two predictions differ. OCTOPUS predicts a transmembrane helix, whereas SPOCTOPUS predicts at the same location a signale peptide.
To check which methods predicted right, we compared the protein and the prediction.


Comparison between the prediction of OCTOPUS and the real protein
Comparison between the prediction of SPOCTOPUS and the real protein
  • BACR_HALSA
Prediction of OCTOPUS for the transmembrane helices of BACR_HALSA
Prediction of SPOCTOPUS for the transmembrane helices of BACR_HALSA
OCTOPUS SPOCTOPUS
start position end position prediction start position end position prediction
1 22 outside 1 22 outside
23 43 TM helix 23 43 TM helix
44 54 inside 44 54 inside
55 75 TM helix 55 75 TM helix
76 95 outside 76 95 outside
96 116 TM helix 96 116 TM helix
117 121 inside 117 120 inside
122 142 TM helix 121 141 TM helix
143 147 outside 142 147 outside
148 168 TM helix 148 168 TM helix
169 185 inside 169 185 inside
186 206 TM helix 186 206 TM helix
207 216 outside 207 216 outside
217 237 TM helix 217 237 TM helix
238 262 inside 238 262 inside

Both methods have a very very similar result, which is identical with the exception of some residues. Both predicted the seven transmembrane helices, which is a very good result.
Next we analysed the overlap between the real occuring transmembrane helices and the prediction results.

Comparison between the prediction of OCTOPUS and the real protein
Comparison between the prediction of SPOCTOPUS and the real protein

Prediction of GO terms