Canavan Task 3 - Sequence-based predictions
- 1 Protocol
- 2 Secondary Structure Prediction
- 2.1 Information on Proteins
- 2.2 Consistent nomenclature and sequence issues
- 2.3 Predictions - Analysis and Comparison
- 3 Prediction of disordered regions
- 4 Transmembrane Helix Prediction
- 5 Signal Peptide Prediction
- 6 GO terms and Pfam
Commands, Source Code and other methodocial issues are kept in the protocol.
Secondary Structure Prediction
Information on Proteins
In the following table we summarized information on the three proteins that were used, next to our own protein Aspartoacylase, to predict properties from sequence only.
Consistent nomenclature and sequence issues
We mapped the (slightly differing) secondary structure elements of the three prediction methods onto the three common possible states C (Coil), H (Helix), and E (Extended; Beta-Sheet) to make comparison of methods easier.
UniProt vs PDB Sequences
DSSP assigns secondary structure based on given 3D structures of proteins. The chosen pdb entries for the according UniProt sequences can be found in the table above. However, pdb sequences often differ significantly to their corresponding UniProt sequence due to the circumstances of the experiments performed for solving the structure (missing atoms and residues being the main problem). We therefore performed a pairwise alignment to allow for comparison of predictions.
Predictions - Analysis and Comparison
P45381 - 2O53 - Aspartoacylase
Aspartoacylase has a fairly complex structure, consisting of helices, beta-sheets and coiled regions with no apparent order or regularity. We therefore expected secondary structure prediction to be difficult. The table below summarizes the Q3 state accuracy, while <xr id="reprof_aspa"/> and <xr id="psipred_aspa"/> visualise the prediction results of the two methods. It is noticeable that there are huge differences in the two sequences (UniProt and PDB), so judging the performance on Q3 alone is clearly wrong. Looking at the figures, you can see that both methods manage to capture wide regions without gaps fairly well.
</figure> <figure id="psipred_aspa">
P10775 - 2BNH - Ribonuclease Inhibitor
As can be seen from the cartoon picture in the table 'Information on proteins', the ribonuclease inhibitor P01775 basically consists of helices alternating with small beta-sheet elements (and very regularly so), resembling a horseshoe-motif. Also, there are no gaps in the alignment of the pdb and UniProt sequence.
Both prediction methods (reprof and psipred) capture this basic structural motif well (which is also reflected in the scores), even though reprof has a tendency to elongate coils or shift helices. See <xr id="reprof_P10775"/> and <xr id="psipred_P10775"/>.
|P10775 - 2BNH||Reprof||PsiPred|
</figure> <figure id="psipred_P10775">
Q08209 - 1AUI - Calcineurin
The structure of Calcineurin is again more complex than that of P10775, and, presumably, much harder to predict. This is reflected in both the scores as well as the figures (<xr id="reprof_Q08209"/> and <xr id="psipred_Q08209"/>). The sequence alignment contains some gaps. Again, PsiPred outperforms ReProf.
|Q08209 - 1AUI||Reprof||PsiPred|
</figure> <figure id="psipred_Q08209">
Q9X0E6 - 1O5J - Divalent-cation tolerance protein
The structure of the 'divalent-cation tolerance protein CutA' is again more regular. Consequently, both methods, especially PsiPred, perform well and manage to capture many of the regularly occuring states. See <xr id="reprof_Q9X0E6"/> and <xr id="psipred_Q9X0E6" />.
|Q9X0E6 - 1O5J||Reprof||PsiPred|
</figure> <figure id="psipred_Q9X0E6">
Prediction of disordered regions
With IUPred there are three options to predict disorder in a protein:
- globular domains: finds globular domains in a protein, that does not contain disorderd residues
- short disorder: single residues that might lead to disorder
- long disorder: disordered regions of a protein
Running IUPred with parameter glob, resulted in the prediction of one globular domain:
Number of globular domains: 1 globular domain 1. 1 - 313
IUPred predicts the N-terminus of the protein to contain disorderd residues (~pos 1-10), which can be seen in <xr id="iupred_short_p45381"/>
IUPred does not predict any long range disorder for P45381 (see <xr id="iupred_long_p45381"/> ).
|<figure id="iupred_short_p45381">||<figure id="iupred_long_p45381">|
There were no direct hits for P45381 in DisProt. PSI-Blast search with the P35831 sequence identified three hits, but with high E-Values. Furthermore, only 40 to 80 residues have been aligned, which is why these hits cannot give any reasonable information.
Sequences producing significant alignments Score (bits) E Value DP00080 29 0.11 DP00517 23 5.3 DP00102 22 8.3
Running IUPred with parameter glob, resulted in the prediction of one globular domain:
Number of globular domains: 1 globular domain 1. 1 - 456
IUPred predicts the N-terminus (pos 1 - 12) and the C-terminus (452-456) of the protein to contain disorderd residues, which can be seen in <xr id="iupred_short_p10775"/>
IUPred does not predict any long range disorder for P10775 (see <xr id="iupred_long_p10775"/> ).
|<figure id="iupred_short_p10775">||<figure id="iupred_long_p10775">|
There was no direct hit for P10775 in DisProt. Searching via PSI-Blast yielded one significant hit:
Sequences producing significant alignments Score(bits) E-Value DP00554 123 5e-30
For this protein, DisProt lists one disordered region at the N-terminus (pos 31-50), which is shown in <xr id="disprot_p10775"/>. IUPred in contrast predicts the first 12 residues to form a disordered region.
Running IUPred with parameter glob, resulted in the prediction of one globular domain. Since the protein has a length of 521 residues, the result concludes, that the C-terminal part of the protein is not part of the globular domain and contains disordered regions.
Number of globular domains: 1 globular domain 1. 5 - 446
For the parameter "short disorder", IUPred predicts the N-terminus (pos 1 - 20) and the C-terminus (460-521) of the protein to contain disorderd residues, which can be seen in <xr id="iupred_short_q08209"/>
Though IUPred predicts the C-terminus to contain many residues for short range disorder, IUPred does not predict any long range disorder for Q08209 (see <xr id="iupred_long_q08209"/> ).
|<figure id="iupred_short_q08209">||<figure id="iupred_long_q08209">|
DisProt declares 31% of the protein to be disorderd. In <xr id="disprot_q08209"/> can be seen where the annotated disorderd regions are located. DisProt already characterizes a region starting from position 374 as disordered, whereas IUPred predicts residues starting from 460 to be part of a disordered region.
Running IUPred with parameter glob, resulted in the prediction of one globular domain.
Number of globular domains: 1 globular domain 1. 1 - 101
With parameter "short range", IUPred predicts just some residues at the N- and C-terminus to be disordered and also for long range, IUPred predicts no disorderd regions.
|<figure id="iupred_short_q9x0e6">||<figure id="iupred_long_q9x0e6">|
No direct hits were found in DisProt. Searching via PSI-Blast did not work and via Smith-Waterman resulted in minimal alignments of about only 20 AA. For the found hits, there are no disordered regions annotated in DisProt.
Transmembrane Helix Prediction
We analyzed the prediction of Transmembrane Helices for the proteins listed in <xr id="table_TMH_info"/> and for our protein Aspartoacylase. Next to Polyphobius, we also examined the results for other TMH Predictors, namely TMHMM and PHDhtm.
Information on Proteins
<figtable id="table_TMH_info"> <xr nolink id="table_TMH_info"/> Information on the proteins used for the evaluation of different TMH prediction methods.
TMH prediction of our Protein yielded the expected prediction of only cytoplasmic residues.
For P35462 there is only one structure listed in UniProt : 3pbl. For this structure, OPM and PDBTM list 7 TMH, which is the same amount of TMH that can be found in the Uniprot annotation for P35462. There is only a slight difference in the localization of the TMH. Usually, the annotation between these three references differs about 1-4 amino acid residues.
All prediction methods yield the same amount of TMH. Furthermore Polyphobius, TMHMM and PHDhtm predict the location of TMH very accurate with only a small deviations compared to the reference annotations of UniProt, PDBTM and OPM.
In <xr id="table_3pbl"/> the exact localization of the TMH of the reference sources UniProt, PDBTM and OPM is listed as well as for the prediction methods Polyphobius, TMHMM and PHDthm. In <xr id="CD_tm_3pbl"/> the length distribution for the predicted and annotated TMH is depicted. One can see that PDBTM in general finds shorter TMH, whereas Polyphobius and OPM find longer helices. Furthermore the location of the TMH within the sequence is visualized in <xr id="vis_3pbl"/>.
<figtable id="table_3pbl" >
|<figure id="CD_tm_3pbl">||<figure id="vis_3pbl">|
In UniProt there are several structures listed for P47863:
- 2D57 X-ray 3.20 A
- 2ZZ9 X-ray 2.80 A
- 3IYZ electron microscopy 10.00 A
Since 2ZZ9 is a mutant, we decided to use 2D57 as a reference structure with OPM and PDBTM.
Interestingly, OPM lists 8 TMH for P47863, whereas PDBTM agrees with the UniProt annotation and lists 6 TMH. Yet, the two additional helices in OPM are rather short (<10 AA) and correspond to two loop segments in the PDBTM annotation.
Just as there is disagreement between the reference sources, the different prediction methods yield deviating results. Polyphobius and TMHMM predict 6 helices, which correspond to the 6 helices listed in UniProt, PDBTM and OPM. PHDhtm finds only 5 helices, of which helix 2 is about 60 amino residues long and matches helix 2 and 3 found by the other methods. This long helix also incorporates the loop region annotated in PDBTM and the additional helix listed in OPM. Therefore PHDhtm just merged these 3 structural elements into one helical region.
In <xr id="table_2D57"/> the exact localization of the TMH of the reference sources UniProt, PDBTM and OPM is listed as well as for the prediction methods Polyphobius, TMHMM, DAS and PHDthm. In <xr id="CD_tm_2D57"/> the length distribution for the predicted TMH with polyphobius and the annotated TMH is depicted. One can see that PDBTM in general finds shorter helices, wheras OPM and Polyphobius find longer ones. Furthermore the location of the TMH within the sequence is visualized in <xr id="vis_2D57"/>.
<figtable id="table_2D57" >
|<figure id="CD_tm_2D57">||<figure id="vis_2D57">|
For Q9YDF8,in UniProt one can find the annotation for 6 TM regions and 2 intramembrane regions and four different structures:
- 1ORQ X-ray 3.20 A 31-253
- 1ORS X-ray 1.90 A 33-160
- 2A0L X-ray 3.90 A 20-259
- 2KYH NMR - 19-160
Since in 1ORS, only residues 33-160 have been crystalized, we decided to use 1ORQ for comparison with the prediction method's output.
For Q9YDF8, Polyphobius did not find any homologues with the blast search. Therefore, no homolgy information could be used for the TMH prediction. The TMH prediction done by Polyphobius in generel coincedes with the UniProt annotation: Polyphobius finds 7 TMH and their overlap with the TMH listed in UniProt is large. However, OPM and PDBTM list very diverse results. There is only a consensus on TMH 5 and 7. When comparing the annotation of OPM for the two structures 1ORQ and 1ORS, one can find tremendous differences:
- 1ORS: C - Tilt: 19° - Segments: 1(25-46), 2(55-78), 3(86-97), 4(100-107), 5(117-148)
- 1ORQ: C - Tilt: 31° - Segments: 1(153-172), 2(183-195), 3(207-225)
Yet, if one considers the sequence shift of 13 AA for the 1ORQ PDB sequence and the Q9YDF8 UniProt sequence (see <xr id=seq_shift />), both annotations together represent the identified TMH with Phobius and the annotated TMH in UniProt. The same observations account for the PDBTM annotations.
In general one can say, that the three analyzed TMH prediction methods yield comparable results, that agree with the annotated TMH locations (as far as the annotations agree with each other). Stronger deviations can be found for the prediction of amino acid positions 100-160. Here, Polyphobius predicts two TMH, TMHMM finds only the first helix, whereas PHDhtm detects one long helix spanning 42 residues. Since the annotaion for this region differs for UniProt, OPM and PDBTM, it is hard to decide which methods shows the most accurate result.
In <xr id="table_1orq"/> the exact localization of the TMH of the reference sources UniProt, PDBTM and OPM is listed as well as for the prediction methods Polyphobius, TMHMM and PHDthm. In <xr id="CD_tm_Q9YDF8"/> the length distribution for the predicted TMH with polyphobius and the annotated TMH is depicted. Only for helix five, the three methods agree in the helix length. Furthermore the location of the TMH within the sequence is visualized in <xr id="vis_1orq"/>.
<figtable id="table_1orq" >
|39 – 63||34-65(1ORS:40-63)||38-59||39-61||42-60||42-64|
|68 – 92||70-93(1ORS:68-88)||68-91||68-88||68-87||69-88|
|109 – 125||(1ORS:101-120)||113-120||108-129||107-129||107-149|
|129 – 145||(1ORS:131-155)||130-161||137-157||107-149(cont)|
|160 – 184||164-184||166-185||163-184||162-184||162-181|
|196 – 208(intramembrane)||196-208||196-213||199-218||197-212|
|222 – 253||222-249||220-238||224-244||225-244||220-247|
|<figure id="CD_tm_Q9YDF8">||<figure id="vis_1orq">|
Checking for possible confusion TMH <=> signalpeptides with SignalP 4.0
Signal Peptide Prediction
Information on Proteins
<figtable id="table_signalp_info"> <xr nolink id="table_signalp_info"/> Information on the proteins used for the prediction of signal peptides.
SignalP3.0 NN predicts a signal peptide for P02768 with a cleavage site between residue 18 and 19: AYS-RG with a value of 0.880 at a cutoff at 0.43.
SignalP3.0 HM has max. cleavage site probability of 0.759 between pos. 18 and 19
SignalP4.0 predicts the cleavage site between pos. 18 and 19: AYS-RG with D=0.848
In <xr id="signalp_p02768"/> you can see the graphical output of the SignalP prediction for P02768.
Polyphobius also predicts the signal peptide:
- N-REGION: 1 - 2
- H-REGION: 3 - 13
- C-REGION: 14 - 18
- NON CYTOPLASMIC: 19 - 609
We also used TargetP to predict the localization of this secreted protein. TargetP can identify the presence of a N-terminal presequence for chloroplast transit peptides (cTP), mitochondrial targeting peptides (mTP) or secretory pathway signal peptides (SP).
TargetP predicts P02768 to be a secretory protein (Loc = S) with a medium reliability (RC=3) and predicts the signal peptide to be 18 residues long. This is in accordance with SignalP (Target P also uses SignalP for cleavage site predictions).
Name Len mTP SP other Loc RC TPlen ---------------------------------------------------------------------- sp_P02768_ALBU_HUMAN 609 0.380 0.873 0.013 S 3 18
SignalP3.0 NN predicts the cleavage site between pos. 28 and 29: ASA-AM with a value of 0.931 and a cutoff at 0.43.
SignalP3.0 HMM has max. cleavage site probability of 0.847 between pos. 28 and 29.
SignalP4.0 predicts the cleavage site between pos. 28 and 29: ASA-AM with D=0.952
In <xr id="signalp_p11279"/> you can see the graphical output of the SignalP prediction for P02768.
Polyphobius also predicts the signal peptide:
- N-REGION: 1 - 10
- H-REGION: 11 - 22
- C-REGION: 23 - 28
- NON CYTOPLASMIC: 29 - 381
- TRANSMEM: 382 - 405
- CYTOPLASMIC: 406 - 417
Signal3.0 NN has no consensus on whether P47863 has a signal peptide or not. Most likely, a possible cleavage site is between pos. 54 and 55: SVG-ST.
Signal3.0 HMM has a signal peptide probability of 0.723 and the max. cleavage site probability of 0.533 between pos. 56 and 57.
SignalP4.0 predicts no signal peptide with an value of D=0.154 at a D-cutoff at 0.500.
In <xr id="signalp_p47863"/> you can see the graphical output of the SignalP prediction for P47863.
The prediction of Polyphobius for TMH was already discussed in section "TMH prediction."
In Signal3.0 NN, the majority of the scores predicts no signal peptide for P45381. Only the C-Score is over the cutoff and predicts the cleavage site at pos. 23.
Signal3.0 HMM predicts no signal peptide (probability of 0.000for signal peptide).
SignalP4.0 predicts no signal peptide with an value of D=0.124 at a D-cutoff of 0.450.
The prediction of Polyphobius for TMH was already discussed in section "TMH prediction".
In <xr id="signalp_p45381"/> you can see the graphical output of the SignalP prediction for p45381.
GO terms and Pfam
GOPET predicts four GO terms for our proteins, one of them with a very high confidence (96% for hydrolase activity), and three with a fairly high confidence (81-82%). Of course, we know all of these four predictions to be accurate. So, if we were dealing with a truly unknown protein, we would very likely believe in a hydrolase activity. 81 and 82% confidence is still a high number, so we could either also believe that, or, since we are only talking about three terms, could try to have them experimentally validated.
The GOPET predictions indicate that the protein is either very specific and its homologs (on which the predictions are based on) are not involved in many reactions, or only one form of reaction is known in which the homologs take part.
|GoID||Ontology||Confidence||Term name||Found in UniProt|
|GO:0016787||Molecular Function||96%||hydrolase activity||yes|
|GO:0004046||Molecular Function||82%||aminoacylase activity||yes|
|GO:0019807||Molecular Function||82%||aspartoacylase activity||yes|
|GO:0016788||Molecular Function||81%||hydrolase activity acting on ester bonds||yes|
ProtFun Output: The arrows indicate the highest information content, not the highest probability in that class. Looking at the output and pretending again we do not know anything about our protein, we would learn that it seems to:
- be involved in central intermediary metabolism (highest information content) (true)
- have to do with Purines and Pyrimidines (highest score) (not true)
- be an enzyme (true)
- have a higher probability to be a Transferase (0.202) than a Hydrolase (0.115) (wrong, it is a Hydrolase)
- is definitely not an Isomerase (highest information content due to high Odds value)
ProtFun Prediction for Aspartoacylase # Functional category Prob Odds Amino_acid_biosynthesis 0.071 3.233 Biosynthesis_of_cofactors 0.144 2.003 Cell_envelope 0.033 0.535 Cellular_processes 0.137 1.875 Central_intermediary_metabolism => 0.334 5.309 Energy_metabolism 0.226 2.511 Fatty_acid_metabolism 0.022 1.663 Purines_and_pyrimidines 0.367 1.512 Regulatory_functions 0.021 0.128 Replication_and_transcription 0.167 0.625 Translation 0.113 2.559 Transport_and_binding 0.017 0.042 # Enzyme/nonenzyme Prob Odds Enzyme => 0.703 2.454 Nonenzyme 0.297 0.416 # Enzyme class Prob Odds Oxidoreductase (EC 1.-.-.-) 0.111 0.534 Transferase (EC 2.-.-.-) 0.202 0.585 Hydrolase (EC 3.-.-.-) 0.115 0.363 Lyase (EC 4.-.-.-) 0.031 0.662 Isomerase (EC 5.-.-.-) => 0.084 2.637 Ligase (EC 6.-.-.-) 0.074 1.460 # Gene Ontology category Prob Odds Signal_transducer 0.053 0.246 Receptor 0.004 0.024 Hormone 0.001 0.206 Structural_protein 0.001 0.041 Transporter 0.025 0.230 Ion_channel 0.015 0.257 Voltage-gated_ion_channel 0.004 0.173 Cation_channel 0.011 0.234 Transcription 0.100 0.785 Transcription_regulation 0.039 0.313 Stress_response 0.010 0.117 Immune_response 0.061 0.720 Growth_factor 0.006 0.450 Metal_ion_transport 0.009 0.020
GOPET provided only few functional annotations, but they are very specific and have a high confidence. ProtFun, in this case, did not really give any clear insight of what kind of protein we are dealing with. Go get a more clear idea, we would suggest (if there was more time now ;) ) to employ at least two more function prediction tools to compare their results and see whether they are all pointing into a similar direction.
Searching Pfam with the sequence of the human Aspartoacylase (UniProt ID P45381) produced one significant result (E-Value 7.9e-71). It is the Succinylglutamate desuccinylase / Aspartoacylase family, short AstE_AspA. 33 sequences can be found in EMBL for this family, where <xr id="family"/> shows the family tree.
Succinylglutamate desuccinylases catalyse a reaction very similar to that of aspartoaclyase:
N-succinyl-L-glutamate + H2O <-> succinate + L-glutamate
(Recall that aspartoaclyase catalyses the reaction:
N-Acetyl-L-aspartate + H2O <--> Acetate + L-Aspartate,
so the difference lies only in succinyl vs acetyl, and glutamate vs aspartate)
The AstE_AspA familiy is part of the peptidase clan, which again has 12 members.