Canavan Task 3 - Sequence-based predictions

From Bioinformatikpedia
Revision as of 18:05, 21 May 2012 by Vorbergs (talk | contribs) (P35462)
Oh, I would sing of mackerel skies,
And why the sea is wet,
Of jelly-fish and conger-eels,
And things that I forget. 

(taken from "The Cumberbunce" by Paul West)

Protocol

Commands, Source Code and other methodocial issues are kept in the protocol.

Secondary Structure Prediction

Information on Proteins

Identifier P10775 Q08209 Q9X0E6
Protein Ribonuclease inhibitor Serine/threonine phosphatase (Calcineurin) Divalent-cation tolerance protein
Organism Sus scrofa (pig) Homo Sapiens Thermotoga maritima
Sequence length 456 521 101
Subcellular location Cytoplasm Nucleus Cytoplasm
PDB Identifier 2BNH 1AUI 1O5J
Structure 2bnh bio r 500.jpg 1aui bio r 500.jpg 1o5j bio r 500.jpg

Consistent nomenclature and sequence issues

Three-state accuracy

We mapped the (slightly differing) secondary structure elements of the three prediction methods onto the three common possible states C (Coil), H (Helix), and E (Extended; Beta-Sheet) to make comparison of methods easier.

UniProt vs PDB Sequences

DSSP assigns secondary structure based on given 3D structures of proteins. The chosen pdb entries for the according UniProt sequences can be found in the table above. However, pdb sequences often differ significantly to their corresponding UniProt sequence due to the circumstances of the experiments performed for solving the structure (missing atoms and residues being the main problem). We therefore performed a pairwise alignment to allow for comparison of predictions.

Basic Analysis of Predictions

P10775 - 2BNH - Ribonuclease Inhibitor

As can be seen from the cartoon picture in the table above, the ribonuclease inhibitor P01775 basically consists of helices alternating with small beta-sheet elements, resembling a horseshoe-motif.

Both prediction methods (reprof and psipred) capture this basic structural motif, even though reprof has a slight tendency to elongate or shift helices (TODO pictures/at least code). <figtable id="P10775">

P10775 - 2BNH Reprof PsiPred
Q3 61.05 91.90
QH 71.94 90.31
QE 21.05 85.96
QC 61.58 95.07
SOV 35-52 68-84
Table ??: State Accuracy and SOV (PDB ID: 2BNH).

</figtable>

Q08209 - 1AUI - Calcineurin

<figtable id="Q08209">

Q08209 - 1AUI Reprof PsiPred
Q3 45.10 59.54
QH 25.00 52.97
QE 62.32 46.38
QC 63.18 74.06
SOV 35-52 68-84
Table ??: State Accuracy and SOV (PDB ID: 2BNH).

</figtable>

The structure of Calcineurin is more complex, and, presumably, secondary structure elements are much harder to predict. Statements are difficult when simply looking at the prediction files. Still, one can see that psipred, while it does contain distinct loop regions, seems to have too few beta-sheet regions. Reprof, contains more beta-sheet regions, and keeps the regions shorter in general.

Q9X0E6 - 1O5J - Divalent-cation tolerance protein

Prediction of disordered regions

With IUPred there are three options to predict disorder in a protein:

  • globular domains: finds globular domain in a protein, that do not contain disorderd residues
  • short disorder: single residues that might lead to disorder
  • long disorder: disordered regions of a protein


Aspartoacylase

Running IUPred with parameter glob, resulted in the prediction of one globular domain:

Number of globular domains:     1 
          globular domain       1.    1 - 313

IUPred predicts the N-terminus of the protein to contain disorderd residues (~pos 1-10) which can be seen in <xr id="iupred_short_p45381"/>

IUPred does not predict any long range disorder for P45381 (see <xr id="iupred_long_p45381"/> ).

<figure id="iupred_short_p45381">
<xr nolink id="iupred_short_p45381"/>
IUPRED output for short disorder in Aspartoacylase
</figure>
<figure id="iupred_long_p45381">
<xr nolink id="iupred_long_p45381"/>
IUPRED output for long range disorder in Aspartoacylase
</figure>


There were no direct hit fo P45381 in DisProt. PSI-Blast search with the P35831 sequence identified three hits, but with high E Values. Furthermore only 40 to 80 residues have been aligned, which is why these hits cannot give any reasonable information.

Sequences producing significant alignments        Score (bits)     E Value
        
DP00080                                           29               0.11         
DP00517                                           23               5.3         
DP00102                                           22               8.3

P10775

Running IUPred with parameter glob, resulted in the prediction of one globular domain:

Number of globular domains:     1 
          globular domain       1.    1 - 456 

IUPred predicts the N-terminus (pos 1 - 12) and the C-terminus (452-456) of the protein to contain disorderd residues which can be seen in <xr id="iupred_short_p10775"/>

IUPred does not predict any long range disorder for P10775 (see <xr id="iupred_long_p10775"/> ).

<figure id="iupred_short_p10775">
<xr nolink id="iupred_short_p10775"/>
IUPRED output for short disorder in P10775
</figure>
<figure id="iupred_long_p10775">
<xr nolink id="iupred_long_p10775"/>
IUPRED output for long range disorder in P10775
</figure>

There was no direct hit for in DisProt. Searching via PSI-Blast yielded one significant hit:

Sequences producing significant alignments    Score(bits) E-Value
DP00554                                       123         5e-30   

For this protein, DisProt lists one disordered region at the N-terminus (pos 31-50), which is shown in <xr id="disprot_p10775"/>. IUPred in contrast predicts the first 12 residues to form a disordered region.

<figure id="disprot_p10775">

<xr nolink id="disprot_p10775"/>
Visualization of DisProt annotation for p10775: there is only one disorderd region annotated from pos 31-50

</figure>

Q08209

Running IUPred with parameter glob, resulted in the prediction of one globular domain. Since the protein has a length of 521 residues, the result concludes that the C-terminal part of the protein is not part of the globular domain and contains unordered regions.

Number of globular domains:     1 
          globular domain       1.    5 - 446 

IUPred predicts the N-terminus (pos 1 - 20) and the C-terminus (460-521) of the protein to contain disorderd residues, which can be seen in <xr id="iupred_short_q08209"/>

Though IUPred predicts the C-terminus to contain many residues for short range disorder, IUPred does not predict any long range disorder for Q08209 (see <xr id="iupred_long_q08209"/> ).

<figure id="iupred_short_q08209">
<xr nolink id="iupred_short_q08209"/>
IUPred output for short disorder in Q08209
</figure>
<figure id="iupred_long_q08209">
<xr nolink id="iupred_long_q08209"/>
IUPred output for long range disorder in Q08209
</figure>


DisProt declares 31% of the protein to be disorderd. In <xr id="disprot_q08209"/> can be seen where the annotated disorderd regions are located. DisProt already characterizes a region starting from position 374 as disordered, whereas IUPred predicts residues starting from 460 to be part of a disordered region.

<figure id="disprot_q08209">

<xr nolink id="disprot_q08209"/>
Visalization of the DisProt annotation for Q08209: 31% of the protein contain disordered regions

</figure>

Q9X0E6

Running IUPred with parameter glob, resulted in the prediction of one globular domain.

Number of globular domains:     1 
          globular domain       1.    1 - 101 

IUPred predicts just some residues at the N- and C-terminus to be disordered and also for long range, IUPred predicts no disorderd regions.

<figure id="iupred_short_q9x0e6">
<xr nolink id="iupred_short_q9x0e6"/>
IUPred output for short diordered segments in Q9X0E6.
</figure>
<figure id="iupred_long_q9x0e6">
<xr nolink id="iupred_long_q9x0e6"/>
IUPred output for long diordered segments in Q9X0E6.
</figure>

No direct hits were found in DisProt. Searching via PSI-Blast did not work and via Smith-Waterman resulted in minimal alignments of about only 20 AA. For the found hits, there are no disordered regions annotated in DisProt.

Transmembrane Helix Prediction

We analyzed the prediction of Transmembrane Helices for the proteins listed in <xr id="table_TMH_info"/> and for our protein Aspartoacylase. Next to Polyphobius, we also examined the results for other TMH Predictors, namely TMHMM and PHDhtm.

Information on Proteins

<figtable id="table_TMH_info"> <xr nolink id="table_TMH_info"/> Information on the proteins used for the evaluation of different TMH prediction methods.

Identifier P35462 Q9YDF8 P47863
Protein D(3) dopamine receptor Voltage-gated potassium channel Aquaporin-4
Organism Homo sapiens (Human) Aeropyrum pernix Rattus norvegicus (Rat)
Sequence length 400 295 323
Subcellular location Cell membrane; Multi-pass membrane protein Cell membrane; Multi-pass membrane protein Membrane; Multi-pass membrane protein
PDB Identifier 3PBL 1ORQ 2D57
Structure CD 3PBL.jpg CD 1ORQ.jpg CD 2D57.jpg

</figtable>

Aspartoacylase

TMH prediction of our Protein yielded the expected prediction of only cytoplasmic residues.


P35462

For P35462 there is only one structure listed in UniProt : 3pbl. For this structure, OPM and PDBTM list 7 TMH, which is the same amount of TMH that can be found in the Uniprot annotation for P35462. There is only a slight difference in the localization of the TMH. Usually, the annotation between these three references differs about 1-4 amino acid residues.

All prediction methods yield the same amount of TMH. Furthermore Polyphobius, TMHMM and PHDhtm predict the location of TMH very accurate with only a small deviations compared to the reference annotations of UniProt, PDBTM and OPM.


In <xr id="table_3pbl"/> the exact localization of the TMH of the reference sources UniProt, PDBTM and OPM is listed as well as for the prediction methods Polyphobius, TMHMM and PHDthm. In <xr id="CD_tm_3pbl"/> the length distribution for the predicted and annotated TMH is depicted. One can see that PDBTM in general finds shorter TMH, whereas Polyphobius and OPM find longer helices. Furthermore the location of the TMH within the sequence is visualized in <xr id="vis_3pbl"/>.


<figtable id="table_3pbl" >

<xr nolink id="table_3pbl"/> AA Position of the predicted/annotated TMH for different methods/sources
UniProt PDBTM OPM Polyphobius TMHMM PHDhtm
33-55 35-52 34-52 30-55 32-54 31-55
66-88 68-84 67-91 66-88 67-89 65-90
105-126 109-123 101-126 105-126 104-126 101-130
150-170 152-166 150-170 150-170 150-172 151-170
188-212 191-206 187-209 188-212 192-214 188-213
330-351 334-347 330-351 329-352 331-353 331-353
367-388 368-382 363-386 367-386 368-390 362-387

</figtable>


<figure id="CD_tm_3pbl">
<xr nolink id="CD_tm_3pbl"/>
Length distribution for the TMH annotated in PDBTM and OPM and the predicted TMH with Polyphobius. PDBTM in general lists shorter TMH.
</figure>
<figure id="vis_3pbl">
<xr nolink id="vis_3pbl"/>
Visualization of the location of the annotated and predicted TMH for P35462(3PBL)
</figure>

P47863

In UniProt there are several structures listed for P47863:

  • 2D57 X-ray 3.20 A
  • 2ZZ9 X-ray 2.80 A
  • 3IYZ electron microscopy 10.00 A

Since 2ZZ9 is a mutant, we decided to use 2D57 as a reference structure with OPM and PDBTM.

Interestingly, OPM lists 8 TMH for P47863, whereas PDBTM agrees with the UniProt annotation and lists 6 TMH. Yet, the two additional helices in OPM are rather short (<10 AA) and correspond to two loop segments in the PDBTM annotation.

Just as there is disagreement between the reference sources, the different prediction methods yield deviating results. Polyphobius and TMHMM predict 6 helices, which correspond to the 6 helices listed in UniProt, PDBTM and OPM. PHDhtm finds only 5 helices, of which helix 2 is about 60 amino residues long and matches helix 2 and 3 found by the other methods. This long helix also incorporates the loop region annotated in PDBTM and the additional helix listed in OPM. Therefore PHDhtm just merged these 3 structural elements into one helical region.

In <xr id="table_2D57"/> the exact localization of the TMH of the reference sources UniProt, PDBTM and OPM is listed as well as for the prediction methods Polyphobius, TMHMM, DAS and PHDthm. In <xr id="CD_tm_2D57"/> the length distribution for the predicted TMH with polyphobius and the annotated TMH is depicted. One can see that PDBTM in general finds shorter helices, wheras OPM and Polyphobius find longer ones. Furthermore the location of the TMH within the sequence is visualized in <xr id="vis_2D57"/>.


<figtable id="table_2D57" >

<xr nolink id="table_2D57"/> AA Position of the predicted/annotated TMH for different methods/sources
UniProt PDBTM OPM Polyphobius TMHMM PHDhtm
37-57 39-55 34-56 34-58 33-55 34-56
65-85 72-89 70-88 70-91 70-92 70-137
95-106(loop) 98-107 70-137(cont)
116-136 116-133 112-136 115-136 112-134 70-137(cont)
156-176 158-177 156-178 156-177 154-176 156-176
185-205 188-205 189-203 188-208 189-211 190-210
209-222(loop) 214-223
232-252 231-248 231-252 231-252 231-253 224-250

</figtable>


<figure id="CD_tm_2D57">
<xr nolink id="CD_tm_2D57"/>
</figure>
<figure id="vis_2D57">
<xr nolink id="vis_2D57"/>
</figure>

Q9YDF8

For Q9YDF8,in UniProt one can find the annotation for 6 TM regions and 2 intramembrane regions and four different structures:

  • 1ORQ X-ray 3.20 A 31-253
  • 1ORS X-ray 1.90 A 33-160
  • 2A0L X-ray 3.90 A 20-259
  • 2KYH NMR - 19-160

Since in 1ORS, only residues 33-160 have been crystalized, we decided to use 1ORQ for comparison with the prediction method's output.

For Q9YDF8, Polyphobius did not find any homologues with the blast search. Therefore, no homolgy information could be used for the TMH prediction. The TMH prediction done by Polyphobius in generel coincedes with the UniProt annotation: Polyphobius finds 7 TMH and their overlap with the TMH listed in UniProt is large. However, OPM and PDBTM list very diverse results. There is only a consensus on TMH 5 and 7. When comparing the annotation of OPM for the two structures 1ORQ and 1ORS, one can find tremendous differences:

  • 1ors: C - Tilt: 19° - Segments: 1(25-46), 2(55-78), 3(86-97), 4(100-107), 5(117-148)
  • 1orq: C - Tilt: 31° - Segments: 1(153-172), 2(183-195), 3(207-225)

Yet, if one considers the sequence shift of 13 AA for the 1ORQ PDB sequence and the Q9YDF8 UniProt sequence (see <xr id=seq_shift />), both annotations together represent the identified TMH with Phobius and the annotated TMH in UniProt. The same observations account for the PDBTM annotations.

<figure id="seq_shift"> 1orq seq shift.png</figure>

In general one can say, that the three analyzed TMH prediction methods yield comparable results, that agree with the annotated TMH locations (as far as the annotations agree with each other). Stronger deviations can be found for the prediction of amino acid positions 100-160. Here, Polyphobius predicts two TMH, TMHMM finds only the first helix, whereas PHDhtm detects one long helix spanning 42 residues. Since the annotaion for this region differs for UniProt, OPM and PDBTM, it is hard to decide which methods shows the most accurate result.


In <xr id="table_1orq"/> the exact localization of the TMH of the reference sources UniProt, PDBTM and OPM is listed as well as for the prediction methods Polyphobius, TMHMM and PHDthm. In <xr id="CD_tm_Q9YDF8"/> the length distribution for the predicted TMH with polyphobius and the annotated TMH is depicted. One can see that PDBTM in general finds shorter helices, wheras OPM and Polyphobius find longer ones. Furthermore the location of the TMH within the sequence is visualized in <xr id="vis_1orq"/>.

<figtable id="table_1orq" >

<xr nolink id="table_1orq"/> AA Position of the predicted/annotated TMH for different methods/sources
UniProt PDBTM OPM(=1ors&1orq) Polyphobius TMHMM PHDhtm
39 – 63 34-65(1ORS:40-63) 38-59 39-61 42-60 42-64
68 – 92 70-93(1ORS:68-88) 68-91 68-88 68-87 69-88
99-110 42-60
109 – 125 (1ORS:101-120) 113-120 108-129 107-129 107-149
129 – 145 (1ORS:131-155) 130-161 137-157 107-149(cont)
160 – 184 164-184 166-185 163-184 162-184 162-181
196 – 208(intramembrane) 196-208 196-213 199-218 197-212
222 – 253 222-249 220-238 224-244 225-244 220-247

</figtable>

<figure id="CD_tm_Q9YDF8">
<xr nolink id="CD_tm_Q9YDF8"/>
</figure>
<figure id="vis_1orq">
<xr nolink id="vis_1orq"/>
</figure>

signal peptides

Checking for possible confusion TMH <=> signalpeptides with SignalP 4.0

SignlaP4.0 Prediction for P35462: no SignalPeptide was found
SignlaP4.0 Prediction for Q9YDF8: no SignalPeptide was found
SignlaP4.0 Prediction for P47863: no SignalPeptide was found

Signal Peptide Prediction

Information on Proteins

<figtable id="table_signalp_info"> <xr nolink id="table_signalp_info"/> Information on the proteins used for the prediction of signal peptides.

Identifier P02768 P11279 P47863
Protein Serum albumin Lysosome-associated membrane glycoprotein 1 Aquaporin-4
Organism Homo sapiens (Human) Homo sapiens (Human) Rattus norvegicus (Rat)
Sequence length 609 417 323
Subcellular location Secreted Lysosome membrane, Single-pass type I membrane protein Membrane; Multi-pass membrane protein
PDB Identifier 1E7I - 2D57
Structure CD 1E7I.jpg CD 2D57.jpg

</figtable>


P02768

<figure id="signalp_p02768">

<xr nolink id="signalp_p02768"/>

</figure>

SignalP3.0 NN predicts a signal peptide for P02768 with a cleavage site between residue 18 and 19: AYS-RG with a value of 0.880 at a cutoff at 0.43.

SignalP3.0 HM has max. cleavage site probability of 0.759 between pos. 18 and 19

SignalP4.0 predicts the cleavage site between pos. 18 and 19: AYS-RG with D=0.848


In <xr id="signalp_p02768"/> you can see the graphical output of the SignalP prediction for P02768.


Polyphobius also predicts the signal peptide:

  • N-REGION: 1 - 2
  • H-REGION: 3 - 13
  • C-REGION: 14 - 18
  • NON CYTOPLASMIC: 19 - 609


We also used TargetP to predict the localization of this secreted protein. TargetP can identify the presence of a N-terminal presequence for chloroplast transit peptides (cTP), mitochondrial targeting peptides (mTP) or secretory pathway signal peptides (SP).

TargetP predicts P02768 to be a secretory protein (Loc = S) with a medium reliability (RC=3) and predicts the signal peptide to be 18 residues long. This is in accordance with SignalP (Target P also uses SignalP for cleavage site predictions).

Name                  Len            mTP     SP  other  Loc  RC  TPlen
----------------------------------------------------------------------
sp_P02768_ALBU_HUMAN  609          0.380  0.873  0.013   S    3     18

P11279

<figure id="signalp_p11279">

<xr nolink id="signalp_p11279"/>

</figure>

SignalP3.0 NN predicts the cleavage site between pos. 28 and 29: ASA-AM with a value of 0.931 and a cutoff at 0.43.

SignalP3.0 HMM has max. cleavage site probability of 0.847 between pos. 28 and 29.

SignalP4.0 predicts the cleavage site between pos. 28 and 29: ASA-AM with D=0.952

In <xr id="signalp_p11279"/> you can see the graphical output of the SignalP prediction for P02768.


Polyphobius also predicts the signal peptide:

  • N-REGION: 1 - 10
  • H-REGION: 11 - 22
  • C-REGION: 23 - 28
  • NON CYTOPLASMIC: 29 - 381
  • TRANSMEM: 382 - 405
  • CYTOPLASMIC: 406 - 417


P47863

<figure id="signalp_p47863">

<xr nolink id="signalp_p47863"/>

</figure>

Signal3.0 NN has no consensus on whether P47863 has a signal peptide or not. Most likely, a possible cleavage site is between pos. 54 and 55: SVG-ST.

Signal3.0 HMM has a signal peptide probability of 0.723 and the max. cleavage site probability of 0.533 between pos. 56 and 57.

SignalP4.0 predicts no signal peptide with an value of D=0.154 at a D-cutoff at 0.500.

In <xr id="signalp_p47863"/> you can see the graphical output of the SignalP prediction for P47863.

The prediction of Polyphobius for TMH was already discussed in section "TMH prediction."







Aspartoacylase

<figure id="signalp_p45381">

<xr nolink id="signalp_p45381"/>

</figure>

In Signal3.0 NN, the majority of the scores predicts no signal peptide for P45381. Only the C-Score is over the cutoff and predicts the cleavage site at pos. 23.

Signal3.0 HMM predicts no signal peptide (probability of 0.000for signal peptide).

SignalP4.0 predicts no signal peptide with an value of D=0.124 at a D-cutoff of 0.450.

The prediction of Polyphobius for TMH was already discussed in section "TMH prediction".

In <xr id="signalp_p45381"/> you can see the graphical output of the SignalP prediction for p45381.









GO terms and Pfam

Pfam

AstE_AspA family: Succinylglutamate desuccinylase / Aspartoacylase family