Difference between revisions of "Fabry:Sequence-based analyses"

From Bioinformatikpedia
(Transmembrane helices: EVERYTHING ADDED :D Done!)
(Other programs and ressources: Added some stuff, feel free to add more!)
Line 741: Line 741:
   
 
== Other programs and ressources ==
 
== Other programs and ressources ==
  +
A simple internet search for "protein sequence prediction" reveals a confusingly high number of different ressources and databases. Clicking on the first entry leads to Rostlab's [http://www.predictprotein.org/ Predict protein]. This program seems to predict nearly anything:
  +
# Multiple sequence alignment
  +
# ProSite sequence motifs
  +
# low-complexity retions (SEG)
  +
# Nuclear localisation signals
  +
# and predictions of
  +
## secondary structure
  +
## solvent accessibility
  +
## globular regions
  +
## transmembrane helices
  +
## coiled-coil regions
  +
## structural switch regions
  +
## B-value
  +
## disordered regions
  +
## intra-residue contacts
  +
## protein protein and protein/DNA binding sites
  +
## sub-cellular localization
  +
## domain assignment
  +
## beta barrels
  +
## cysteine predictions and disulphide bridges
  +
Performing a search reveals that the tool gives you an overview of all the above mentioned features. On a first glance it seemed rather confusing, due to the vast amount of informations provided. Anyhow I believe giving the output some more attentiveness, it is a good tool to gather a lot of information in one place. A very good feature is that the output is presented in many different ways which can be switched at any time. A quick look at the properties we examined in the task, shows, that Predict Protein probably does not distinguish very well between signal peptides and transmembrane helices, since for the α-Galactosidase A it predicts a TM, where there actually is signal peptide (position 11-28; see section [[Fabry:Sequence-based_analyses#Transmembrane_helices | Transmembrane helices]] and [[Fabry:Sequence-based_analyses#Signal_peptides | Signal peptides]])<br>
  +
Searching the same term again at the NCBI homepage (see [http://www.ncbi.nlm.nih.gov/pubmed?term=protein%20sequence%20prediction search]) reveals a huge mass of reviews on that topic, mainly on predicting secondary structure and function out of the protein sequence, but also on less common topics like predicting the protein solubility (see [http://www.ncbi.nlm.nih.gov/pubmed/19549632 here] and [http://www.ncbi.nlm.nih.gov/pubmed/22172487 here]) , or more recently emerged whole genome sequence approaches ([http://www.ncbi.nlm.nih.gov/pubmed/17709778 see]).<br>
  +
A really different topic was brought up by Smialowski P, Martin-Galiano AJ, Cox J and Frishman D. who give an overview of techniques to predict experimental properties of proteins from sequence ([http://www.ncbi.nlm.nih.gov/pubmed/17430194 see]) and thus predict experimental success in cloning, expression, soluble expression, purification and crystallization <br>
  +
Summarizing, it is very hard to get along in the vast amount of prediction tools and methods and one can spend a lot of time on finding out which one fits best to the requirements of one's topic.
  +
 
[[Category: Fabry Disease 2012]]
 
[[Category: Fabry Disease 2012]]

Revision as of 20:02, 18 May 2012

Fabry Disease » Sequence-based analyses



The following analyses were performed on the basis of the α-Galactosidase A sequence. Please consult the journal for the commands used to generate the results.

Secondary structure

Disorder

Transmembrane helices

α-Galactosidase A (P06280)

<figure id="fig:1r46_TM">

Human α-Galactosidase A, plot of posterior probabilities of features (source)

</figure> <figure id="fig:1r46_TMHMM">

Transmembrane helix prediction with TMHMM for human α-Galactosidase A

</figure>

organism: Homo sapiens (Human)
pdb-id: 1r46

Since the whole sequence is labeled as "NON CYTOPLASMIC", the prediction of Polyphobius is that it contains no membrane helices (see Polyphobius output file). This result is consistent with both databases, OPM and PDBTM and the TMHMM-2.0 prediction (see <xr id="fig:1r46_TMHMM" />).
In <xr id="fig:1r46_TM"/> the posterior probabilities of cytoplasmic(green), non cytoplasmic(blue), TM helix(grey area) and signal peptide(red) are shown. The grey area indicates weak evidence for a transmembrane helix which where not predicted. The first small peak can be observed in both pictures. Considering the shown probabilities and our background knowledge, the prediction seems to be true. Only the end of the signal peptide at residue 29 is, according to our knowledge, too early and should be shifted to position 31.


D(3) dopamine receptor (P35462)

<figtable id="tab:3pbl_TM_PostProb"> Posterior probability plots Polyphobius and TMHMM

Human D(3) dopamine receptor, plot of posterior probabilities of features predicted by Polyphobius
Human D(3) dopamine receptor, plot of posterior probabilities of features predicted by TMHMM

</figtable>

organism: Homo sapiens (Human)
pdb-id: 3pbl (only available structure)

Polyphobius predicts 7 transmembrane regions. Comparing this result to the pictures in <xr id="tab:3pbl_TM"/>, we can again see a consistent result with the two databases. The only difference between the models seems to be the number of residues on the cytosolic side of the membrane (green in the right picture) of this hydrolase/hydrolase inhibitor. Inquiring the PDBTM site, this area is marked as residues 1002-1161, which do not even belong to the 400 aa long protein. This might result from the experiment, the structure was derived from. It says, that the protein was enginered (see source1 and source2)
Besides from that, a difference in predicted and observed starting and end points could be revealed (see <xr id="tab:3pbl_start_end"/>), mainly those of PDBTM, which leads to a great difference in its mean length compared to the other methods (see <xr id="tab:3pbl_TM"/>, picture 3). Here, PDBTM throughout assigns shorter helices, since its helices always start later and end earlier, which can also be seen in <xr id="fig:bitshift" />. The boundaries Polyphobius predicted may be seen in the Polyphobius output file.
The posterior probability of the first 6 predicted helices is very similar, as one can see in <xr id="tab:3pbl_TM_PostProb"/>. The only difference is in the last TM, where the probability assigned by TMHMM is nearly 90%, whereas Polyphobius only assigns about 55%.

<figtable id="tab:3pbl_start_end"> Start and end points of each transmembrane helix predicted

TMH number 1 2 3 4 5 6 7
Method Length Start End Length Start End Length Start End Length Start End Length Start End Length Start End Length Start End
Polyphobius 26 30 55 23 66 88 22 105 126 21 150 170 25 188 212 24 329 352 20 367 386
OPM 19 34 52 25 67 91 26 101 126 21 150 170 23 187 209 22 330 251 24 363 382
PDBTM 18 35 52 17 68 84 15 109 123 15 152 166 16 191 206 14 334 347 15 368 382
Uniprot 23 33 55 23 66 88 22 105 126 21 150 170 25 188 212 22 330 351 22 367 388
TMHMM 23 32 54 23 67 89 23 104 126 23 150 172 23 192 214 23 331 353 23 368 390

</figtable>

<figtable id="tab:3pbl_TM"> Visualizations for the human D(3) dopamine receptor as a transmembrane protein. Note: Cytoplasmic side of the membrane is different in the pictures

Human D(3) dopamine receptor as depicted in OPM database (Cytoplasmic side on the bottom) (source)
Human D(3) dopamine receptor as shown in PDBTM database (Cytoplasmic side above) (source)
Human D(3) dopamine receptor, comparison of length distribution of different prediction methods

</figtable>


Voltage-gated potassium channel (Q9YDF8)

<figtable id="tab:1orq_TM_PostProb"> Posterior probability plots Polyphobius and TMHMM for Q9YDF8

Voltage-gated potassium channel, plot of posterior probabilities of features predicted by Polyphobius
Voltage-gated potassium channel, plot of posterior probabilities of features predicted by TMHMM

</figtable>

organism: Aeropyrum pernix (strain ATCC 700893 / DSM 11879 / JCM 9820 / NBRC 100138 / K1)
pdb-id: 1orq (smaller resolution than 2AOL, more residues determined)

The voltage-gated potassium channel is a tetrameric protein, that causes a voltage-dependent potassium ion permeability of the membrane. It may occur in an opened or closed conformation. This state can be changed by the voltage difference across the membrane. (source) To fulfill this function, the protein needs the transmembrane helices. Of these, 7 are predicted according to the Polyphobius output file, but referring to OPM, only three helices should be predicted, according to PDBTM 4 helices are considered true. Looking carefully at the probability plot of Polyphobius in <xr id="tab:1orq_TM_PostProb"/>, one may see, that there actually are only four peaks of transmembrane helix posterior probability, but three of the less higher peaks are considered true as well.
Only 2 of all 7 helices are common among all three methods (see <xr id="tab:1orq_TM" />, picture 3), the third and fourth transmembrane helices are even only predicted by Polyphobius, Uniprot and TMHMM and only Polyphobius and Uniprot, respectively. If not predicted or shown by one of the methods, the helix is considered as membrane loop.
Here OPM has the smallest mean length of transmembrane helices (see <xr id="tab:1orq_TM" />, picture 3). Start and end points of the helices seem to be differing more than in the first described picture (see <xr id="tab:1orq_start_end" />, picture 3), although Polyphobius, Uniprot and TMHMM are quite similar. Personally, considering the features predicted by Polyphobius and TMHMM (see <xr id="tab:1orq_TM_PostProb" />), I would trust the PDBTM version the most in this case. But since in this case, the advantage of Polyphobius to include homologue data is not given, because the BLAST search did not find any homologue sequences, it might on the other hand be a false prediction. This might, to a certain degree, explain the rather diverse result.

<figtable id="tab:1orq_start_end"> Start and end points of each transmembrane helix predicted

TMH number 1 2 3 4 5 6 7
Method Length Start End Length Start End Length Start End Length Start End Length Start End Length Start End Length Start End
Polyphobius 19 42 60 21 68 88 22 108 129 21 137 157 22 163 184 18 196 213 21 224 244
OPM 0 - - 0 - - 0 - - 0 - - 20 153 172 13 183 195 19 207 225
PDBTM 32 21 52 24 57 80 0 - - 0 - - 21 151 171 28 209 236 0 - -
Uniprot 25 39 63 25 68 92 17 109 125 21 137 157 25 160 184 13 196 208 32 222 253
TMHMM 23 39 61 20 68 87 23 107 129 0 - - 23 162 184 20 199 218 20 225 244

</figtable>

<figtable id="tab:1orq_TM"> Visualizations for the Voltage-gated potassium channel as a transmembrane protein. Note: Cytoplasmic side of the membrane is different in the pictures

Voltage-gated potassium channel as depicted in OPM database (Cytoplasmic side on the bottom) (source)
Voltage-gated potassium channel as shown in PDBTM database (Cytoplasmic side above) (source)
Voltage-gated potassium channel, comparison of length distribution of different prediction methods

</figtable>


Aquaporin-4 (P47863)

<figtable id="tab:2d57_TM_PostProb"> Posterior probability plots Polyphobius and TMHMM for Q9YDF8

Aquaporin-4, plot of posterior probabilities of features predicted by Polyphobius
Aquaporin-4, plot of posterior probabilities of features predicted by TMHMM

</figtable>

organism: Rattus norvegicus (Rat)
pdb-id: 2d57 (non-mutant)

Another homotetramer, the water-specific channel Aquaporin-4, has been analyzed. In the first two pictures in <xr id="tab:2d57_TM"/>, the protein is depicted in its tetrameric structure, while in this exercise we only examine one of the four chains. Here Polyphobius again predicted the number of transmembrane helices correctly according to PDBTM and all the other ressources(6 - see Polyphobius output file), while OPM alone shows 8 TMH structures. Taking a closer look at the results, we see that the 2 "missing" TMHs are interpreted as membrane loop in PDBTM (depicted in orange in <xr id="tab:2d57_TM"/>, picture 2) and Polyphobius, which might be due to the short length of the helical structure (10 residues each - see <xr id="tab:2d57_TM"/>). Considering the posterior probability, I would rather prefer one of the 6 helical models.
The 6 shared helices' start and end points are again quite conform among Polyphobius and OPM, PDBTM, Uniprot and TMHMM. The mean length is again around 23, except for the two databases which have a mean length of around 18.

<figtable id="tab:2d57_start_end"> Start and end points of each transmembrane helix predicted

TMH number 1 2 3 4 5 6 7 8
Method Length Start End Length Start End Length Start End Length Start End Length Start End Length Start End Length Start End Length Start End
Polyphobius 25 34 58 22 70 91 0 - - 22 115 136 22 156 177 21 188 208 0 - - 22 231 252
OPM 23 34 56 19 70 88 10 98 107 25 112 136 23 156 178 15 189 203 10 214 223 22 231 252
PDBTM 17 39 55 18 72 89 0 - - 18 116 133 20 158 177 18 188 205 0 - - 18 231 248
Uniprot 21 37 57 21 65 85 0 - - 21 116 136 21 156 176 21 185 205 0 - - 21 232 252
TMHMM 23 33 55 23 70 92 0 - - 23 112 134 23 154 176 23 189 211 0 - - 23 231 253

</figtable>

<figtable id="tab:2d57_TM"> Visualizations for the Aquaporin-4 as a transmembrane protein. Note: Cytoplasmic side of the membrane is different in the pictures

Rat's Aquaporin-4 as depicted in OPM database (Cytoplasmic side on the bottom) (source)
Rat's Aquaporin-4 as shown in PDBTM database (Cytoplasmic side above)(source)
Aquaporin-4, comparison of length distribution of different prediction methods

</figtable>


Comparison Polyphobius, OPM, PDBTM

<figure id="fig:bitshift">

Average start and end points of transmembrane helices, Polyphobius compared to OPM, PDBTM, Uniprot and TMHMM

</figure>

In <xr id="fig:bitshift" /> the predicted transmembrane helices of Polyphobius have been used as basis and the start and end point of each TMH in the models of OPM, PDBTM, Uniprot and TMHMM (see <xr id="tab:3pbl_start_end" />, <xr id="tab:1orq_start_end" /> and <xr id="tab:2d57_start_end" />) have been compared to it. The average of each model has been calculated for each helix Polyphobius and OPM, respectively each of the other models, shared. It becomes obvious, that the transmembrane helices of OPM tend to be shifted to the left, compared to the Polyphobius helices, while the PDBTM helices usually are nested inside the Polyphobius membranes due to later starting and earlier end points. The great strength of Polyphobius becomes apparent, when examining the result for 1orq, where no homologue data could be found. (see Voltage-gated potassium channel ). Here all models are very different from Polyphobius' prediction, except for the Uniprot prediction. Taking a closer look, we found out, that the Uniprot prediction among few other methods relies on the Phobius and TMHMM algorithms (see)
Since on average, the hydrophobic belt in a membrane is about 3.5 to 5nm wide and a helix needs approximately 15 to 20 amino acids to span this width, mean lengths of 18 to 23 aas seem reasonable. Nonetheless, especially TMHMM, which seems to predict helices of length 23 in almost all cases, and also Polyphobius and Uniprot tend to have longer helices than that. This fact might be either due to the type of membrane the proteins are located at, or a bias in the algorithms.
What also stood out, was the fact that the cutoff for still counting the structure as an helical structure in the OPM database is really low, considering that an α-helix on average needs 3.6 amino-acid residues per turn. Contrariwise OPM predicts the least number of transmembrane helices in the Voltage-gated potassium channel protein, which cannot be due to the length of the predicted structure.
Summing up, the method to prefer depends on what demand you have in your result. If you want to make sure your predicted helices are transmembrane helices for sure, you want to choose PDBTM's models, since in our example, they did never assign the label TMH to a structure as the only method. But then, if you want to cover all possible structures, you should choose Polyphobius, since there has not been any transmembrane helix it did not at all predict, considering the two databases.

Signal peptides

Prediction of the presence and location of signal peptide cleavage sites in amino acid sequences.

GO terms

GOPET

Searching the GOPET annotation tool with the AGAL_HUMAN sequence revealed 5 GOIds, which are displayed in <xr id="tab:GOPET"/>. On a first glance, since we already know the name and function of the protein, it is a bit surprising, that alpha-galactosidase activity is only the third entry with 96% confidence. In our already carried out information gathering we learned that α-galactosidase A is a hydrolase thus the first three entries were not surprising. Considering that our enzyme mainly is a glycosidase, the both entries on top of the list make perfekt sense.
Again a bit surprising was the last entry. α-N-Acetylgalactosaminidase is actually used for enzyme replacement therapy, which we mention on our main page. The structure of both enzymes is similar to each other, but this still does not explain the association of this GO term to the AGAL protein.

<figtable id="tab:GOPET"> The results of the GOPET search

Result for GOPET search
GOid Aspect Confidence GO term
GO:0016798 Molecular Function Ontology (F) 98% hydrolase activity acting on glycosyl bonds
GO:0004553 Molecular Function Ontology (F) 98% hydrolase activity hydrolyzing O-glycosyl compounds
GO:0016787 Molecular Function Ontology (F) 97% hydrolase activity
GO:0004557 Molecular Function Ontology (F) 96% alpha-galactosidase activity
GO:0008456 Molecular Function Ontology (F) 89% alpha-N-acetylgalactosaminidase activity

</figtable>

ProtFun2.0

EC=3.2.1.22(EC 3.-.-.- Hydrolase)
Predicted: EC 6.-.-.-(Ligase)

############## ProtFun 2.2 predictions ##############

>gi_4504009_

# Functional category                  Prob     Odds
  Amino_acid_biosynthesis              0.283   12.847
  Biosynthesis_of_cofactors            0.339    4.708
  Cell_envelope                     => 0.652   10.690
  Cellular_processes                   0.057    0.783
  Central_intermediary_metabolism      0.400    6.343
  Energy_metabolism                    0.151    1.678
  Fatty_acid_metabolism                0.032    2.448
  Purines_and_pyrimidines              0.506    2.082
  Regulatory_functions                 0.013    0.083
  Replication_and_transcription        0.047    0.175
  Translation                          0.211    4.807
  Transport_and_binding                0.549    1.339

# Enzyme/nonenzyme                     Prob     Odds
  Enzyme                            => 0.805    2.811
  Nonenzyme                            0.195    0.273

# Enzyme class                         Prob     Odds
  Oxidoreductase (EC 1.-.-.-)          0.176    0.845
  Transferase    (EC 2.-.-.-)          0.195    0.564
  Hydrolase      (EC 3.-.-.-)          0.244    0.769
  Lyase          (EC 4.-.-.-)          0.029    0.608
  Isomerase      (EC 5.-.-.-)          0.010    0.321
  Ligase         (EC 6.-.-.-)       => 0.141    2.776

# Gene Ontology category               Prob     Odds
  Signal_transducer                    0.090    0.419
  Receptor                             0.014    0.083
  Hormone                              0.002    0.318
  Structural_protein                   0.004    0.127
  Transporter                          0.024    0.222
  Ion_channel                          0.010    0.169
  Voltage-gated_ion_channel            0.003    0.127
  Cation_channel                       0.010    0.215
  Transcription                        0.047    0.367
  Transcription_regulation             0.026    0.204
  Stress_response                      0.049    0.552
  Immune_response                      0.012    0.136
  Growth_factor                        0.006    0.412
  Metal_ion_transport                  0.009    0.020

//

Pfam

Other programs and ressources

A simple internet search for "protein sequence prediction" reveals a confusingly high number of different ressources and databases. Clicking on the first entry leads to Rostlab's Predict protein. This program seems to predict nearly anything:

  1. Multiple sequence alignment
  2. ProSite sequence motifs
  3. low-complexity retions (SEG)
  4. Nuclear localisation signals
  5. and predictions of
    1. secondary structure
    2. solvent accessibility
    3. globular regions
    4. transmembrane helices
    5. coiled-coil regions
    6. structural switch regions
    7. B-value
    8. disordered regions
    9. intra-residue contacts
    10. protein protein and protein/DNA binding sites
    11. sub-cellular localization
    12. domain assignment
    13. beta barrels
    14. cysteine predictions and disulphide bridges

Performing a search reveals that the tool gives you an overview of all the above mentioned features. On a first glance it seemed rather confusing, due to the vast amount of informations provided. Anyhow I believe giving the output some more attentiveness, it is a good tool to gather a lot of information in one place. A very good feature is that the output is presented in many different ways which can be switched at any time. A quick look at the properties we examined in the task, shows, that Predict Protein probably does not distinguish very well between signal peptides and transmembrane helices, since for the α-Galactosidase A it predicts a TM, where there actually is signal peptide (position 11-28; see section Transmembrane helices and Signal peptides)
Searching the same term again at the NCBI homepage (see search) reveals a huge mass of reviews on that topic, mainly on predicting secondary structure and function out of the protein sequence, but also on less common topics like predicting the protein solubility (see here and here) , or more recently emerged whole genome sequence approaches (see).
A really different topic was brought up by Smialowski P, Martin-Galiano AJ, Cox J and Frishman D. who give an overview of techniques to predict experimental properties of proteins from sequence (see) and thus predict experimental success in cloning, expression, soluble expression, purification and crystallization
Summarizing, it is very hard to get along in the vast amount of prediction tools and methods and one can spend a lot of time on finding out which one fits best to the requirements of one's topic.