Sequence-based predictions TSD

From Bioinformatikpedia
Revision as of 20:17, 20 May 2012 by Reeb (talk | contribs) (Disorder)

If not noted otherwise, the sequence for all predictions is the HEXA Reference sequence (Uniprot P06865). A protocol for this task can be found here.

Secondary structure

Proteins: Ribonuclease inhibitor P10775 , CutA Q9X0E6 , CAM-PRP catalytic subunit Q08209
Ribonuclease inhibitor and CutA are located in the cytoplasm whereas the CAM-PRP catalytic subunit is located in the nucleus.

TODO maybe extend this a bit and change names, CutA and CAM-PRP tell me exactly nothing about what these are.

DSSP and handling of differing sequences

DSSP builds upon 3D structures, therefore a PDB entry has to be selected for every given Uniprot entry. The chosen mapping is 2bnh for P10775, 1kr4 for Q9X0E6, 1aui for Q08209 and 2gjx for P06865. This creates an additional problem. All other resources base their predictions on the Uniprot sequence. The sequence used by DSSP, inferred from the PDB file might be significantly different due to changes of the experimentalists solving the structure. There are automated ways to resolve this, using PDBs new mmCIF files, which provide a residue-level mapping between the atom recored inferred sequence and the SEQRES record sequence. From there one could use SIFTS which provides a residue-level mapping between SEQRES and Uniprot. However both these tools are automated and, while surely developed with great care, looking at the sequence might be considered favorable and will also directly point out interesting parts. Therefore manual alignments were performed and, if applicable, special cases noted in the text.

Comparison of predictions and annotation

In the following the secondary structure prediction of PSIPred and reprof is compared to the 3D-structure based annotation of DSSP. Uniprot annotation was not included, since its source is DSSP <ref name="uniprotsecstruct">http://www.uniprot.org/manual/helix</ref>. To compare the methods a common alphabet had to be established. All outputs were normalized to only contain the residues types H (Helix), E (Beta-strand) or C (Coil). Details on the parsing can be found in the protocol. The figures were created using the Latex package cpssp <ref name="cpssp">http://www.ctan.org/pkg/cpssp</ref>, details on these can also be found in the protocol.

P06865

<xr id="fig:ssP06865"/> shows the comparison of assigned secondary structure for the Hex A alpha subunit. Several things are noticable here. Firstly, there are several short helices in the DSSP annotation that neither PSIPred nor reprof predicted. These short regions are not, as one might first assume, less common helix types, but in fact, as a check in the original DSSP output confirms 'normal' (i+4->i) alpha-helices. This issue cannnot be observed for beta-sheets where the length does not shown an immediate impact und prediction performance. Comparing PSIPred and reprof it can be easily observed that PSIPred shows very good correlation with the DSSP annotation while reprof often differs from these two. The disagreement is noticably stronger in the catalytic domain of the protein (cf. <xr id="fig:pfam2gjx"/>) and mostly appears in the prediction of beta-sheets instead of alpha-helices. This is surprising but also means that both methods correctly predict the central TIM barrel in this domain.

<figure id="fig:ssP06865">

asdasd
asd.

</figure>

P10775

<xr id="fig:ssP10775"/> shows the comparison of assigned secondary structure for the ribonuclease inhibitor. It is apparent that both prediction methods largely agree with the annotation in DSSP, which is not entirely surprising given the peculiar horseshoe like secondary structure of alternating alpha-helices and beta-sheets that hardly leaves place for exchange or removal of a secondary structure element. Nonetheless, there are differences present, most noticeably the behaviour of reprof to either elongate helices or shift them to the left or right compared to the DSSP and PSIPred annotations. This leads to the fact that some of the beta strand-turn-alpha helix motifs are mistaken by reprof for something that would be more accurately described as alpha helix-turn-alpha helix motif. Given that these inhibitor are known to very tightly interact with the ribonucleases <ref name="Dickson2010">Dickson,K. and Haigis,M. (2005) Ribonuclease inhibitor: structure and function. Progress in nucleic acid research and, 6603, 1-23.</ref> it would be surprising to see that this does not significantly impair function and is therefore a very important finding.

<figure id="fig:ssP10775">

asdasd
asd.

</figure>

Q9X0E6

<xr id="fig:ssQ9X0E6"/> shows the comparison of assigned secondary structure for 'cation tolerance protein' CutA which reinforces the behaviours observed so far. While PSIPred's prediction is very close to the DSSP annotation, reprof does not predict some of the beta-sheets and tends to extend helices. Again this is important to note, since the beta-sheet assembly is thought to be essential for assumption of the trimeric biological assembly <ref name="Savchenko2009">Savchenko,A. and Skarina,T. (2004) X-Ray Crystal Structure of CutA From Thermotoga maritima at 1.4 Å Resolution. Proteins: Structure,, 54, 162-165.</ref>.

<figure id="fig:ssQ9X0E6">

asdasd
asd.

</figure>

Q08209

Finally, <xr id="fig:ssQ08209"/> shows the comparison of assigned secondary structure for the phosphatase. This protein seems comparably hard to predict, both PSIPred and reprof make several errors. PSIPred does not predict any structural elements where there are none annotated according to DSSP, however some are mistaken for the opposite and serveral ones simply missed. On the other hand reprof correctly predicts some of these, but misses others and in addition again mistakes several long alpha helices for beta-sheets.

<figure id="fig:ssQ08209">

asdasd
asd.

</figure>

Conclusion

In conclusion it can be seen that a very general agreement with the annotation by DSSP can always be achieved at least by PSIPred. Mostly this predictions methods errors remain low and could be considered minor. Reprof shows much more detrimental errors, namely mistaking alpha-helices for beta-sheets or elongating and shifting predicted alpha-helices. Most of these errors can be shown to be unacceptable in terms of the known function of the particular structural element. While the last entry hinted at the fact that on some proteins both methods have almost equally large problems, PSIPred was much more convincing overall.

Disorder

The protein disorder prediction was performed with IUPred. The option long was chosen as prediction type as this is most suitable to find any disordered regions in a protein, which are long enough (>30 residues) to have an impact on protein structure.

<figtable id="tbl:iupred">

Iupred P06865.png
Iupred P10775.png
Iupred Q9X0E6.png
Iupred Q08209.png
Table : Protein disorder predictions for P06865, P10775, Q9X0E6 and Q08209 (top left to bottom right)

</figtable> In the plots in <xr id="tbl:iupred"/> the calculated disorder tendency is displayed for every residue. All predictions for the proteins named above express a fluctuating tendency which is overall lower than 0.5.
The prediction of CutA (<xr id="tbl:iupred"/>, bottom left) has the lowest and least fluctuating curve. Ribonuclease inhibitor (<xr id="tbl:iupred"/>, top right) and HexA (<xr id="tbl:iupred"/>, top left) have very similar profiles. None of these 3 proteins can be assigned a disordered region in accordance with the prediction. This complies with the known resolution of the structures of these proteins: For the ribonuclease inhibitor, the leucine rich repeats leave very litroom for disorder and CutA is very small where the only coiled regions seem to only directly connect the ordered regions (c.f. TODO pic here, also reference in task before).

The only protein exeeding the 0.5 cutoff is the CAM-PRP catalytic subunit (<xr id="tbl:iupred"/>, bottom right) which shows signs of disorder at the beginning of the sequence and towards the end. The first region is about 10 residues long and the latter begins roughly at residue 425 and spans 100 amino acids. This prediction can be validated by the Disprot annotations of CAM-PRP. The assigned disordered regions are 1 - 13, 390 - 414 (CaM-binding domain), 374 - 468, 469 - 486 (Autoinhibitory region) and 487 - 521. All were detected by X-ray crystallography.

The first disordered region was well detected by IUPred. The region starting at position 374 is hinted at by a peak in the prediction, however the signal is too strong to assume a disordered region of the size as annotated in Disprot. Furthermore, the IUPred results hint at the distinction between the disordered region till position 486 and the disorder starting at 487 as the curve expresses a steep rise in this region.

In conclusion, IUPred supplies a very accurate and reliable prediction for the given protein set.


Transmembrane helices

Proteins: Dopamine D3 receptor P35462 , KvAP Q9YDF8 , AQP-4 P47863
Dopamine D3 receptor, KvAP and AQP-4 are multi-pass membrane proteins. <figtable id="tab:gopetgo">

Positions of transmembrane helices
Drd3 33–35 66–88 105–126 150–170 188–212 330–351 367–388
KvAP 39–63 68–92 109–125 129–145 160–184 222–253
AQP-4 37–57 65–85 116–136 156–176 185–205 232–252

Table TODO: Assigned transmembrane regions in Uniprot </figtable>

Signal peptides

Proteins: Serum albumin P02768, LAMP-1 P11279, AQP-4 P47863
HEXA LAMP-1 and Serum albumin contain a signal peptide. LAMP-1 is a membrane protein which passes the membrane with one helix. Serum albumin, the main protein of plasma, is a secreted extracellular protein. AQP-4 is a multi-pass membrane protein which forms a waterspecific channel and functions in transport.

SignalP

The prediction of the displayed results was performed with SignalP version 4.0.
SignalP employs 3 main scores for the prediction of signal peptides, C, S and Y. The S-score stands for the actual signal peptide prediction, with high scores indicating that the corresponding amino acid is part of a signal peptide, and low scores indicating that the amino acid is part of a mature protein. The C-score is the cleavage score, which indicates the best cleavage cite when significantly high. (When a cleavage site position is referred to by a single number, the number indicates the first residue in the mature protein.) Y-max is a derivative of the C-score combined with the S-score calculated to give a better cleavage site prediction than the raw C-score alone.
For non-secretory proteins all scores are supposed to be very low.

<figtable id="tbl:signalp">

Sp P06865 HEXA HUMAN.png
Sp P47863 AQP4 RAT TSD.png
Sp P11279 LAMP1 HUMAN TSD.png
Sp P02768 ALBU HUMAN TSD.png
Table : Signal peptide predictions.

</figtable>

The <xr id="tbl:signalp"/> displays the results of the SignalP predictions and <xr id="tab:signals"/> gives a comparison of the predicted signal peptide positions and the validation from the Signal Peptide Website. The additional scores can be viewed here.


<figtable id="tab:signals">

Peptide positions Prediction Validation
HexA 1-22 1-22
Serum Albumin 1-18 1-18
LAMP-1 1-28 1-28
AQP-4 - -

Table TODO: Comparison of signal peptide prediction and the assignment from the Signal Peptide Website. </figtable>


HEXA, LAMP-1 and Serum albumin are correctly predicted one signal peptide at the beginning of the sequence and AQP-4 is identified as a mature protein. Even the exact positions of the peptides are predicted accurately and thus the performance of SignalP turns out exceptionally satisfactory.



Cross check with PolyPhobius

GO terms

GOpet

<xr id="tab:gopetgo"/> depicts the prediction results for the Hexa protein from GOpet. The predictions are all given with a very high confidence. Of the 35 GO terms which are associated with the HexA GOpet identified 6 correctly and 2 are falsely assigned.

<figtable id="tab:gopetgo">

GO-Term ID Type Confidence GO-Term description Validation
GO:0003824 Molecular function 97% catalytic activity true
GO:0004563 Molecular function 96% beta-N-acetylhexosaminidase activity true
GO:0015929 Molecular function 96% hexosaminidase activity false
GO:0016787 Molecular function 96% hydrolase activity true
GO:0016798 Molecular function 96% hydrolase activity acting on glycosyl bonds true
GO:0004553 Molecular function 96% hydrolase activity hydrolyzing O-glycosyl compounds true
GO:0016799 Molecular function 77% hydrolase activity hydrolyzing N-glycosyl compounds false
GO:0046982 Molecular function 61% protein heterodimerization activity true

Table TODO: GO term prediction from GOpet. </figtable>

ProtFun2.0

ProtFun2.0 employes various tools for the protein function prediction. A large number of feature prediction servers are queried such as SignalP to obtain information, which are integrated into final predictions of the cellular role, enzyme class, and selected Gene Ontology categories of the submitted sequence.
The Gene Ontology categories are displayed in <xr id="tab:gopetgo"/>. There is no single prediction above 10%, thus the HexA is not attributed to any of these GO categories. With a closer examination of these classes it becomes clear that neither of them matches the function of our protein.
Further on the HexA protein is predicted to be a enzyme more specifically Ligase (EC 6.-.-.-).
"Cell_envelope" is chosen as the functional category with a probability of over 80%. This prediction seems to be the most accurate although it is not very apparent where this classification comes from and how it can be validated or further employed. The GO category prediction can be neglected for the HexA protein.


<figtable id="tab:protfun">

Gene Ontology category Probability
Signal_transducer 8.3%
Receptor 10.5%
Hormone 0.1%
Structural_protein 1.0%
Transporter 2.4%
Ion_channel 1.8%
Voltage-gated_ion_channel 0.2%
Cation_channel 1.0%
Transcription 5.8%
Transcription_regulation 2.6%
Stress_response 4.4%
Immune_response 1.4%
Growth_factor 0.5%
Metal_ion_transport 0.9%

Table TODO: GO term prediction from from ProtFun2.0. </figtable>

Pfam

The Pfam-A sequence search reveals two significant Pfam-A domains within the Hex A alpha subunit: The Glycosyl hydrolase family 20, domain 2 and the Glycosyl hydrolase family 20, catalytic domain (see <xr id="fig:pfam"/>).

<figure id="fig:pfam">

HEXA Pfam domains.

</figure>

<figure id="fig:pfam2gjx">

Visualization of Pfam domains.
The Glyco hydro 20b domain is coloured green, the Glyco hydro 20 catalytic domain red and the predicted active site residue is shown as sticks and highlighted in purple. Regions that are not mapped to a Pfam domain are grey.

</figure>

HEXA is almost completely spanned by these two domains. The Glyco hydro 20b domain (green) reaches from position 35 to position 165 and the Glyco hydro 20 domain (red) directly follows up, occupying a region from position 167 to 488. Left unmapped are a mostly coiled region at the beginning of the subunit and a helix followed by another coiled region at the end of the sequence. <xr id="fig:pfam2gjx"/> visualizes the Pfam annotation in the 3D structure of the Hex A alpha subunit.

Both domain annotations are correct, albeit they don't seem very specific. The catalytic domain also belongs to the Pfam clan Glyco_hydro_tim, consisting of glycosyl hydrolases that contain a TIM barrel fold. The TIM barrel can be seen in the middle of catalytic domain in <xr id="fig:pfam2gjx"/>. The clan is very large <ref name="pfamclan">PFAM clan statistics</ref> with 41 members and amongst others also contains a domain associated with Fabry disease.

Interestingly Pfam also infers an active site residue E323 which is indeed thought to be important for catalytic activity as already outlined in the introduction.

In conclusion Pfam of course cannot provide the wealth of information the prediction methods claim to deliver, however its manual curation and high quality data in combination with the recently introduced step towards crowd sourcing Wikipedia <ref name="Punta2012a">Punta,M. et al. (2012) The Pfam protein families database. Nucleic acids research, 40, D290-301.</ref> make it an at least equally valuable resource.

References

<references/>