Difference between revisions of "Secondary Structure Prediction BCKDHA"
(→Comparison) |
(→Prediction type: long disorder) |
||
Line 445: | Line 445: | ||
<br> |
<br> |
||
− | When using the long disorder |
+ | When using the long disorder tool of IUPred it predicts several disordered regions. They are located at the positions 33-50, 89-93, 385-388, 390-397, 399-401, 404-413, 420-422, 424-428 and on the position 431. Although there are many different regions they are all located in the beginning or in the end of the protein. By looking on [[:File:Long.png |Figure 13]] it strikes out that mainly the peak in the beginning has a high confidence. Since this hit is quite long there are also regions which don't have a high confidence that's why the average value is only 0.69 which is anyway quite good. The second peak is very short and additionally has a weak confidence so it is not sure wether this is a real hit. In the end there are many predicted disordered regions but except of one the prediction of all of them is quite unsure since the confidence value is only a bit over the cut-off value.<br> |
− | Between all these predicted disordered regions there are many peaks which |
+ | Between all these predicted disordered regions there are many peaks which are only a bit under the threshold. By looking at [[:File:Long.png |Figure 13]] the whole protein except of the middle part could be part of a disordered region. |
=====Prediction type: short disorder ===== |
=====Prediction type: short disorder ===== |
Revision as of 16:21, 25 August 2011
Contents
- 1 1. Secondary structure prediction
- 2 2. Prediction of disordered regions
- 3 3. Prediction of transmembrane alpha-helices and signal peptides
- 4 4. Prediction of GO terms
- 5 References
1. Secondary structure prediction
General Information
The secondary structure of a protein bases on the primary structure and consists of alpha-helices, beta-sheets and coils.
alpha-helices
Alpha-helices (Figure 1) are built by H-bounds between the NH-group of an amino acid and the CO-group of the amino acid which is placed four recidues earlier (i+4). This form of the alhpa-helix is the most common one. There are two other types of alpha-helices which are very rare. One is called 3,10-helices because the H-bound is between the NH-group and the CO-group three recidues earlier (i+3). The other one is the Phi-helix and here the H-bound is between the NH-group and the CO-group five residues earlier (i+5). The different locations of the CO-group influence the width and the height of the helices.
beta-sheets
The H-bounds (Figure 2) between the CO-group and the NH-group which build a beta-sheet can be located far away from each other in the sequence.
There are two different kinds of beta-sheets. The parallel one where the sheets all point in the same direction and the anti-parallel ones where the sheets point alternately in different directions.
coils
Coils are irregular formed elements like turns.
PSIPRED
Basic information
author: David T. Jones (University College London)
year:1998
version: 2
PSIPRED uses neuronal networks which have a single hidden layer and a feed-forward back-propagation architecture to predict the secondary structure.
To run PSIPRED local it requires the output of PSI-BLAST (Position Specific Iterated - BLAST) as input data.
For the online prediction on the server it is enough to enter a amino acid sequence.
Since PSIPRED uses a very stringent cross validation method to evaluate the performance it reaches an average Q3 score of 80.7%.
The predicition is splitted into three different steps. In the first step sequence profiles are generated by using a position specific scoring matrix from PSI-BLAST as input for the neuronal network. In the next step the secondary structure is predicted. In the last step the output of the secondary structure prediction is filtered.
There are three different options:
- Mask low complexity regions
- Mask transmembrane helices
- Mask coiled-coil regions
References
[PSIPRED Server]
[Overview of prediction methods]
[History of the PSIPRED]
Prediction
Seq MAVAIAAARVWRLNRGLSQAALLLLRQPGARGLARSHPPRQQQQFSSLDD Pred CHHHHHHHHHHHHHHHCHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCC UniProt Seq KPQFPGASAEFIDKLEFIQPNVISGIPIYRVMDRQGQIINPSEDPHLPKE Pred CCCCCCCCCCCCCCCCCCCCCCCCCCCEEEEECCCCCCCCCCCCCCCCHH UniProt EEEE HHH HH Seq KVLKLYKSMTLLNTMDRILYESQRQGRISFYMTNYGEEGTHVGSAAALDN Pred HHHHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCCCHHHHHHHHHHCCCC UniProt HHHHHHHHHHHHHHHHHHHHHHHH EEE HHHHHHHH Seq TDLVFGQYREAGVLMYRDYPLELFMAQCYGNISDLGKGRQMPVHYGCKER Pred CCEEECCCCHHHHHHHCCCCHHHHHHHHCCCCCCCCCCCCCCCCCCCCCC UniProt EEE HHHHHH HHHHHHHHH CCCC CCC Seq HFVTISSPLATQIPQAVGAAYAAKRANANRVVICYFGEGAASEGDAHAGF Pred CCCCCCCCCCCCHHHHHHHHHHHHHCCCCCEEEEEECCCCCCHHHHHHHH UniProt C CCCHHHHHHHHHHHHHHH EEEEEE HHH HHHHHHH Seq NFAATLECPIIFFCRNNGYAISTPTSEQYRGDGIAARGPGYGIMSIRVDG Pred HHHHHHCCCEEEEEECCCCCCCCCCCHHCCCCHHHHHCCCCCCCCCEECC UniProt HHHHH EEEEEEE EEE HHH EEE HHH HHH EEEEEE Seq NDVFAVYNATKEARRRAVAENQPFLIEAMTYRIGHHSTSDDSSAYRSVDE Pred HHHHHHHHHHHHHHHHHHCCCCEEEEEECCCCCCCCCCCCCCCCCCHHHH UniProt EEEEEEEEEEEEEEEEEE EEEEEE Seq VNYWDKQDHPISRLRHYLLSQGWWDEEQEKAWRKQSRRKVMEAFEQAERK Pred HHHHHHCCCCHHHHHHHHHHCCCCC HHHHHHHHHHHHHHHHHHHHHHHC UniProt HHHHHHHHHCCCC HHHHHHHHHHHHHHHHHHHHHHHH Seq PKPNPNLLFSDVYQEMPAQLRKQQESLARHLQTYGEHYPLDHFDK Pred CCCCHHHHHHHHHCCCCHHHHHHHHHHHHHHHHHCCCCCCCCCCC UniProt HHHH EEEE HHHHHHHHHHHHHHHHHHHH HHH
PSIPRED has predicted 23 coils, 16 alpha helices and 6 beta sheets as it is shown in the alignment above. In (Figure 3) these predictions are visualized by pink bars which stand for the alpha helices and yellow arrows which symbolize the beta sheets. PSIPRED does not mark coils with a special figure which means that when there is wether a bar nor a arrow than there is a coil.
As it is shown in the alignment of predicted and real secondary structure of UniProt the prediction is completely wrong in the beginning. In the middle part it become better but still there are many mistakes. It seems that PSIPRED has more problems with beta sheets than with alpha helices because it predicts more beta sheets which do not exists or misses existing beta sheets than alpha helices. In most of the cases it predicts the alpha helices quite good. By comparing with the structure of UniProt it can be seen that especially the long alpha helices are correct predicted. Except of one long region in the middle of the sequence which should be a long beta sheet but is predicted as a alpha helix.
Jpred3
Basic information
author: Cole C, Barber JD & Barton GJ (Bioinformatics and Computational Biology Research, University of Dundee)
year: 1998
version: 3
Jpred is using a neuronal network to make the predictions. To predict the secondary structure of a protein sequence or of a multiple alignment of protein sequences the algorithm Jnet is used. The prediction accuracy for secondary structures lies above 81%. Additionally Jpred makes predictions about the solvent accessibility.
Jpred3 needs a protein sequence or multiple alignment of protein sequences as input.
It is important that the target sequence is the first sequence in the multiple alignment since the alignment is modified so that the first sequence do not have any gaps. The alignemt has to be in the MSF or in the BLC format.
References
Prediction
By predicting the secondary structure of BCKDHA with JPred it found many hits with very good e-values in other proteins.
e-value=0.0
2bew, 2bev, 2beu, 1x80, 1wci, 1u5b, 1olx, 1ols, 1dtw, 1x7y, 1x7z, 1x7x, 1x7w, 2j9f, 2bff, 1v1r, 1olu, 1v16, 1v11, 2bfc, 2bfb, 1v1m, 2bfd, 2bfe
e-value=6e-58
1umd, 1umc, 1umb, 1um9
e-value=1e-57
2bp7, 1qs0, 1w85, 3dva, 1w88
With these hits JPred run the prediction:
Seq MAVAIAAARVWRLNRGLSQAALLLLRQPGARGLARSHPPRQQQQFSSLDD Pred HHHHHHHHHHHHHH EEE Conf 10090009999980000000323546777770000303566666777777 UniProd Seq KPQFPGASAEFIDKLEFIQPNVISGIPIYRVMDRQGQIINPSEDPHLPKE Pred EEEEE HH Conf 77777777777777654567777777308885377740467787776368 UniProd EEEE HHH HH Seq KVLKLYKSMTLLNTMDRILYESQRQGRISFYMTNYGEEGTHVGSAAALDN Pred HHHHHHHHHHHHHHHHHHHHHHHH E HHHHHHHHHHH Conf 99999999999999999999875045000001677517899999885278 UniProt HHHHHHHHHHHHHHHHHHHHHHHH EEE HHHHHHHH Seq TDLVFGQYREAGVLMYRDYPLELFMAQCYGNISDLGKGRQMPVHYGCKER Pred EEEE HHHHHHHH HHHHHHHHH Conf 84465157745788885065689988740677754577777545677777 UniProt EEE HHHHHH HHHHHHHHH CCCC CCC Seq HFVTISSPLATQIPQAVGAAYAAKRANANRVVICYFGEGAASEGDAHAGF Pred HHHHHHHHHHHH EEEEEE HHHHHHHH Conf 64132147888770367889998750688558887407887468999999 UniProt C CCCHHHHHHHHHHHHHHH EEEEEE HHH HHHHHHH Seq NFAATLECPIIFFCRNNGYAISTPTSEQYRGDGIAARGPGYGIMSIRVDG Pred HHHH EEEEEEE HHHHHHH EEEEE Conf 87500888606888703677777777777764067777005725774078 UniProt HHHHH EEEEEEE EEE HHH EEE HHH HHH EEEEEE Seq NDVFAVYNATKEARRRAVAENQPFLIEAMTYRIGHHSTSDDSSAYRSVDE Pred HHHHHHHHHHHHHHHHH EEEEEEEEEE HHH Conf 74689999999999988507985588886354067777777765553688 UniProt EEEEEEEEEEEEEEEEEE EEEEEE Seq VNYWDKQDHPISRLRHYLLSQGWWDEEQEKAWRKQSRRKVMEAFEQAERK Pred HHHHHH HHHHHHHHHHH HHHHHHHHHHHHHHHHHHHHHHHH Conf 99998468758999999986068866899999999999999999988606 UniProt HHHHHHHHHCCCC HHHHHHHHHHHHHHHHHHHHHHHH Seq PKPNPNLLFSDVYQEMPAQLRKQQESLARHLQTYGEHYPLDHFDK Pred HHHHHHH HHHHHHHHHHHHHHHH Conf 887368777523688756899999999999875267777777888 UniProt HHHH EEEE HHHHHHHHHHHHHHHHHHHH HHH
By comparing the prediction of the secondary structure of Jpred and the secondary structure of BCKDHA in UniProt as it is done in the alignment above it is remarkable that in the beginning the prediction differs a lot from UniProt but in the middle and in the end it becomes much better. Jpred predicts more helices and less beta sheets than there are in the UniProt secondary structure. It is interesting that although there are no alpha helices in the beginning Jpred predicts them with a quite high confidence. This high confidence can also be seen very good in the visualization of the predition (Figure 4) where it is displayed by black bars. There is one part in the middle of the sequence where it predicts a very long alpha helix but it should be a beta sheet. It is interesting that PSIPRED also had problems with this beta sheet. In the rest of the middle part the prediction of Jpred is quite correct except for a few positions. (Figure 4) underlines that the protein mainly consists of alpha helices since there are mainly red bars shown.
DSSP
Basic information
author: Wolfgang Kabsch and Chris Sander (Max-Planck-Institut fürmedizinische Forschung, Heidelberg)
year: 1983
whole name: Define Secondary Structure of Proteins
Based on atomic coordinates in Protein Data Bank format, DSSP defines the secondary structure of a protein.
With this method the secondary structure is not predicted but determined from the 3D coordinates.
Referencse
[Introduction]
[Explanation ]
Prediction
Seq KPQFPGASAEFIDKLEFIQPNVISGIPIYRVMDRQGQIINPSEDPHLPKEKVLKLYKSMT Pred TT T TT T T TTT T 333 HHHHHHHHHHHH UniProt EEEEE HH Seq LLNTMDRILYESQRQGRISFYMTNYGEEGTHVGSAAALDNTDLVFGQYREAGVLMYRDYP Pred HHHHHHHHHHHHHHTTTTT TT HHHHHHHHHTT TTTSSS TT HHHHHHTT UniProt HHHHHHHHHHHHHH E HHHHHHHHHHH EEEE HHHHHHHH Seq LELFMAQCYGNISDLGKGRQMPVHYGCKERHFVTISSPLATQIPQAVGAAYAAKRANANR Pred HHHHHHHHHT TT TTTT T TT TTTT TTTTTHHHHHHHHHHHHHHTT UniProt HHHHHHHHH HHHHHHHHHHHH Seq VVICYFGEGAASEGDAHAGFNFAATLECPIIFFCRNNGYAISTPTSEQYRGDGIAARGPG Pred SSSSSSTT333THHHHHHHHHHHHTT SSSSSSS TSSTTSS333T TTTTT333T33 UniProt EEEEEE HHHHHHHHHHHH EEEEEEE HHHHHHH Seq YGIMSIRVDGNDVFAVYNATKEARRRAVAENQPFLIEAMTYRIGHHSTSDDSSAYR Pred 3T SSSSSSTT HHHHHHHHHHHHHHHHHHT SSSSSS T TTTT 333T UniProt EEEEE HHHHHHHHHHHHHHHHH EEEEEEEEEE Seq VNYWDKQDHPISRLRHYLLSQGWWDEEQEKAWRKQSRRKVMEAFEQAERKPKPNPNLLFS Pred HHHHHHT HHHHHHHHHHHHTT HHHHHHHHHHHHHHHHHHHHHHHHT 3333TT UniProt HHHHHH HHHHHHHHHHH HHHHHHHHHHHHHHHHHHHHHHHH HHHHHH Seq DVYQEMPAQLRKQQESLARHLQTYGEHYPLDHFDK Pred TTTTT HHHHHHHHHHHHHHHHH333T 333 UniProt H HHHHHHHHHHHHHHHH
Description of the visualization of the prediction
It is important to know that the first 50 amino acids of the sequence are not shown. And that the important part for our protein ends on position 391.
1. line: Sequence
2. line: structural elements
3. line: if a residue is involved in symmetrie contacts it is labeled with a star
4. line: if a residue is solvent accessible it is labeled with an "A"
Letter code for the secondary structure elements:
- H (blue): alpha helix
- 3 (yellow): residue in isolated beta-bridge
- T (red): hydrogen bonded turn
- S (green): bend
As we can see by the comparison of the predicted structure with the structure of BCKDHA of UniProt they match to a large extent. Especially the alpha helices are assigend mainly correct. As it is shown in (Figure 5) by the blue regions the protein mainly consists of alpha helices so most of the prediction is exact. DSSP has some problems to assign beta sheets which arise from the comparison of the prediction with the UniProt structure.
DSSP offers much more information than the two other tools, since it does not only predict alpha helices, beta sheets and turns but also symmetrie contacts and solvent accessibility.
2. Prediction of disordered regions
General information
Disordered regions are long regions which do not have a regular secondary structure. They are dynamically flexible and have only a regular structure when they bind to another substrate or protein. In these regions polar and charged amino acid and especially proline are overrepresentated. The disordered regions are conserved and obtain mainly in regions which have a regulatory function. Since disordered regions have no clear secondary structure they also have no tertiary structure.
DISOPRED
Basic information
author: Jonathan J. Ward, Liam J. McGuffin, Kevin Bryson, Bernard F. Buxton and David T. Jones (University College London)
year: 2004
version: 2
DISOPRED2 identifies disordered regions by searching residues which appear in the sequence records but have no co-ordinates in the electron density map. This is a very simple method to find disordered regions because the absence of co-ordinates can also be explained with artifacts of the crystalization process.
References
Publication
DISOPRED server
Information
Prediction
In the first line the confidence of the prediction which is shown in the second line is denoted. The prediction of a disordered region is marked with an asterisk (*). All of the disordered regions are predicted with a very high confidence.
DISOPRED predicts disordered regions mainly in the beginning and a few in the end of BCKDHA as it is shown in Figure 6 by the red fields.
Figure 7 on the right side also points out that the disordered regions are in the beginning and in the end since at these two sides there are the highest peaks.
POODLE
Basic information
POODLE uses machine learning approaches to predict the disordered regions of an amino acid sequence.
author:
- POODLE-L S. Hirose, K. Shimizu, S. Kanai, Y. Kuroda and T. Noguchi
- POODLE-S K. Shimizu, Y. Muraoka, S. Hirose, and T. Noguchi
- POODLE-W K. Shimizu, Y. Muraoka, S. Hirose, K. Tomii and T. Noguchi
- POODLE-I S.Hirose, K.Shimizu, N.Inoue, S.Kanai and T.Noguchi
year:
- POODLE-L 2007
- POODLE-S 2007
- POODLE-W 2007
- POODLE-I 2008
options:
POODLE-L: This tool searches for disordered regions which are longer than 40 consecutive amino acids.
POODLE-S: Here the focus lies on predicting short disordered regions. There are two different subtools: "Missing residues" and "High B-factor residues"
POODLE-W: With this option the proteins which are mostly disordered can be found.
POODLE-I: In this tool the other three tools are combined. POODLE-I also uses structural information to predict disordered regions. It bases on a work-flow approach.
References
[POODLE-L]
[POODLE-S]
[POODLE-W]
[POODLE-I]
[POODLE server]
[Help]
Prediction
POODLE-S
POODLE-S Missing residues |
POODLE-S High B-factor residues |
---|---|
POODLE-S Missing residues |
POODLE-S High B-factor residues | ||||||||
---|---|---|---|---|---|---|---|---|---|
disordered region | 1-56 | 341-345 | 420-423 | 6-9 | 15-57 | 93 | 95-96 | 340-354 | 379-402 |
average confidence | 0.75 | 0.58 | 0.56 | 0.63 | 0.77 | 0.53 | 0.55 | 0.67 | 0.59 |
POODLE-S (which predicts short disordered regions) with the option "Missing residues" predicted the disordered regions between the positions 1-56, 341-345 and 420-423. This is also shown in Figure 8. The peaks which are over the cut-off value of 0.5 in the green region stay for the disordered regions. In the beginning there is a very high and also very long peak. Because of this it is clear that the tool predicts with a very high confidence that there is a long region with no fixed structure in the beginning of the protein. The average confidence of 0.75 can also be seen in the table under the figures. The other two numbers in this table point out that the predictions of the two disordered regions in the end of the protein do not have a very high confidence.
We also ran the prediction with POODLE-S with the option "High B-Factor residues". Here the prediction was that there are disordered regions between the positions 6-9, 15-57, 93, 95-96, 340-354 and 379-402. This is also shown in Figure 9.This option predicts more regions with no fixed structure but as in the option "Missing residues" they are in the beginning and in the end of the protein. By comparing Figure 9 with Figure 8 it can be noticed that the predictions in the end are done with more confidence in the second run with "High B-Factor residues". The peaks are much higher and also longer which shows that the predicted disorderes regions are longer.
In both runs POODLE-S has much variation in the middle part of the protein between the peaks. There are always small peaks but they are not high enough to come over the cut-off value.
POODLE-L
disordered region | 1-48 | 369-428 |
average confidence | 0.6 | 0.67 |
POODLE-L predicts two disordered regions which are longer than 40 amino acids. They are located between the positions 1-48 and 369-428. By looking at Figure 10 we can see that the predictions are in the beginning and in the end of the protein. But both of the predictions only have low peaks so POOLDE-L is not completely confident about the prediction. This observation is supported by the average confidence values of 0.6 and 0.67. This can be explained by the fact that POODLE-L searchs long disordered regions and perhaps the length of the two regions of about 40 amino acids is too short to be a very good match.
Since POODLE-L only looks for long disordered regions it is sure that the rest of the protein does not have any disordered regions. This observation is supported by Figure 10 because we can see that there are no small peaks in the middle of the plot.
POODLE-W
The regions which could be disordered regions but POODLE is not sure are bordered by blue squares and the certain disordered regions are bordered by red squares in Figure 11.
0=ordered regions
5=perhaps disordered regions
9=disordered regions
In this case there is no predited disordered region in the beginning of the protein which is completely different to the other two tools of POODLE we already used. Instead the prediction of the disordered region in the end is very good which means that the confidence is high and the space which is predicted to be disordered is very long and reachs till the end of the protein. The first part of the disordered region has no high assurance. But the major part of the match is assigned with the highest possible confidence of 9 which can be seen in Figure 11 by the red box.
POODLE-I
disordered region | 1-56 | 341-345 | 370-427 | 443-445 |
average confidence | 0.6 | 0.56 | 0.67 | 0.74 |
POODLE-I predicted four disordered regions between the positions 1-56, 341-345, 370-427 and 443-445. These predictions are shown in Figure 12 where we can see that they are in the beginning and in the end of the protein. The peak in the beginning is quite long but in the middle of the peak it falls very low so that it is nearly under the cut-off value. That is why the average value is also low. But we can see in the plot (Figure 12) that there are two maximum confidence values for this peak and they are both around 0.7 which underlines that the prediction is quite sure. The next peak is very short and also has a bad average confidence of 0.56 so it seems that POODLE-I is not sure about the prediction. The third peak is longer than the other peaks and has additionally a good average confidence value of 0.67. The prediction of the peak directly in the end of the protein has the highest value but that is comprehensible since the structure is always less defined in the end of a protein. So we have to be carefully with this hit because it also can be wrong.
Between the predicted regions there are also many small peaks which are not high enough to come over the threshold.
Comparison
POODLE-S(Missing residues) | POODLE-S(High B-factor residues) | POODLE-L | POODLE-W | POODLE-I |
---|---|---|---|---|
1-56 | 6-9 | 1-48 | 325-445 | 1-56 |
341-345 | 15-57 | 369-428 | 341-345 | |
420-423 | 93 | 370-427 | ||
95-96 | 443-445 | |||
340-354 | ||||
379-402 |
By comparing all the several tools of POODLE we can summarize that the disordered regions are mainly in the beginning or in the end of the protein. Only POODLE-S predicts them in the middle of the protein but here the regions are so short and the confidence is so low that it is not sure if they are really disordered regions. The predicted disordered regions are mainly between position 1-56 and 341-445. The fact that the disordered regions are in the beginning and in the end of the protein is obvious, since in these regions the structure is always not very well defined. So such a hit can also be a false positive just because of the bad definition of the secondary structure.
IUPred
Basic information
author: Zsuzsanna Dosztányi, Veronika Csizmók, Péter Tompa and István Simon
year: 2005
IUPred predicts disordered regions by estimating the capacity of polypeptides to form stabilizing contacts. The potential to form these contacts depends on the surrounding sequence and on the chemical properties. This approach is based on the idea that disordered regions have no capacity to form sufficient interresidue interactions so that there is no stabilizing energy.
There are three different prediction types which can be chosen:
- long disorder: predicts context-independent global disorder that encompasses at least 30 consecutive residues of predicted disorder
- short disorder: predicting short, probably context-dependent, disordered regions, such as missing residues in the X-ray structure of an otherwise globular protein
- structured regions: takes the energy profile and finds continuous regions confidently predicted ordered
References
[IUPred server]
[Theory]
Prediction
Prediction type: long disorder
disordered region | 33-50 | 89-93 | 385-388 | 390-397 | 399-401 | 404-413 | 420-422 | 424-428 | 431 |
average confidence | 0.69 | 0.57 | 0.52 | 0.64 | 0.51 | 0.55 | 0.52 | 0.56 | 0.55 |
When using the long disorder tool of IUPred it predicts several disordered regions. They are located at the positions 33-50, 89-93, 385-388, 390-397, 399-401, 404-413, 420-422, 424-428 and on the position 431. Although there are many different regions they are all located in the beginning or in the end of the protein. By looking on Figure 13 it strikes out that mainly the peak in the beginning has a high confidence. Since this hit is quite long there are also regions which don't have a high confidence that's why the average value is only 0.69 which is anyway quite good. The second peak is very short and additionally has a weak confidence so it is not sure wether this is a real hit. In the end there are many predicted disordered regions but except of one the prediction of all of them is quite unsure since the confidence value is only a bit over the cut-off value.
Between all these predicted disordered regions there are many peaks which are only a bit under the threshold. By looking at Figure 13 the whole protein except of the middle part could be part of a disordered region.
Prediction type: short disorder
disordered region | 1 | 33-55 | 92-93 | 393-411 | 415 | 420-421 | 423-425 | 427-428 | 433 | 438-445 |
average confidence | 0.56 | 0.7 | 0.56 | 0.57 | 0.5 | 0.53 | 0.53 | 0.53 | 0.51 | 0.73 |
When using the short disorder-tool of IUPred it predicts several disordered regions. They are located at the positions 1, 33-55, 92-93, 393-411, 415, 420-421, 423-425, 427-428, 433 and 438-445. The hit on position 1 can be neglected because it is just one residue long and the confidence value is only 0.56. But the next predicted disordered region seems to be important because it is about 20 residues long and the average confidence value is 0.7. Figure 14 shows that it is 0.7 because the peak is so long. The maximum confidence value of this region is about 0.8 which signals the high confidence of this disordered region. The next hit is only two residues long and has a maximum value of 0.57. Since it is so short it can also be neglected. After these predictions in the beginning of the protein there are many very short regions in the end of the protein. All of them are only about one or two residues long except of the last predicted region. There are two possibilities for these short regions. Either they are declared as too short so that they are no true disordered regions which is supported by the low confidence values or we say that they have to be combined to one long disordered region. The second possibility is supported by the fact that all these short regions are next to each other. Since all the other programms also predicted a disordered region in the end of the protein we decide to take the second posibility. The last hit is the most significant one. Indeed it is only eigth residues long but the average confidence value is 0.73 and the maximum value is higher than 0.9. It is obcious that there is such a clear prediction for a disordered region in the end of the protein because this part of a protein normally has no defined fixed structure but although there is no defined secondary structure it is not said that there is no function.
Prediction type: structured regions
By analyzing the secondary structure with the option "structure regions" the programm could not find any disordered regions in the whole protein and only has as output the information that "Unkown globular domains: 1-445" and Figure 15.
META-Disorder
To run META-Disorder we used the tool of PredictProtein Server. <ref> https://www.predictprotein.org/ </ref>
Prediction
disordered region | 1-9 | 394-400 |
average confidence | 0.63 | 0.57 |
By predicting the disordered regions with META-DISORDER we only got two regions as possible disordered region. This can be seen in the Figure 15 where only the beginning and the end of the strand is red and the rest is green. Also the table show the the regions are completely in the beginning and in the end. Since there are no other possible disordered regions in this prediction and the fact that the green part seems to be clearly not disordered indicates that these two hits could be wrong. It is not said but in gerenall these regions of a protein have no very good defined structure although they have a function.
Comparison
By comparing the results of all disordered region prediction tools we can see that all of them predicted disordered regions in the beginning and in the end of the protein. With these results we have to be carefully because in these regions the structure of a protein is always not very well defined. So the hit can arose because of the bad definition of the secondary structure in these regions. But we also have to see that all of the programms predicted these regions and most of them with a high assurance. So perhaps it has to be considered that the beginning and the end of BCKDHA can be disordered regions.
3. Prediction of transmembrane alpha-helices and signal peptides
General
Transmembrane Topology
The prediction of the membrane topology of proteins aims at discovering which portions of the protein lie within the lipid bilayer of a membrane and which portions protrude from the membrane into the watery environment. Membrane spanning polypeptides usually form helices of about 20 amino acids length. As the surrounding membrane is hydrophobic, the membrane spanning part of the protein consists of hydrophobic amino acids as well. These information can be used for the prediction of transmembrane helices, which subsequently enables the prediction of the membrane topology. <ref> http://en.wikipedia.org/wiki/Membrane_topology</ref><ref>http://en.wikipedia.org/wiki/Transmembrane_domain</ref>
Prediction tools: TMHMM, OCTOPUS and SPOCTOPUS
Signal Peptides
Signal peptides are N-terminal sequence motifs directing proteins to their cellular destination, like secretory pathway, mitochondria and chloroplast.
One example for a signal peptide is the secretory signal peptide (SP), which is an N-terminal peptide that is typically 15-30 amino acids long. There are three regions of a signal peptide: an N-terminal region (n-region) which is often built up by positively charged residues, a hydrophobic region (h-region) in the middle of at least six residues and a C-terminal region (c-region) of polar uncharged residues. In Eukaryotes the SP targets proteins across the endoplasmic reticulum, in prokaryotes across the plasma membrane. The SP is cleaved when the protein crosses the membrane.
Furthermore there exists chloroplast transit peptides (cTP) which are also N-terminal and are cleaved when the protein enters the choloplast. The most conserved site in cTPs is an Alanine directly after the N-terminal methionine...
<ref>O. Emanuelsson, S. Brunak, G. von Heijne, H. Nielsen, "Location proteins in the cell unsing TargetP, SignalP and related tools", Nature Protocols, 2007</ref>
Prediction tools: SignalP, TargetP
Combined transmembrane and signal peptide prediction
As the hydrophobic regions of a transmembrane helix and a signal peptide are highly similar, this leads to cross reaction between these two types of prediction. <ref>http://www.ebi.ac.uk/Tools/phobius/help.html</ref>
Prediction tools: Phobius and Polyphobius
In the following section different tools for predicting transmembrane helices and signal peptides are tested. As the BCKDHA protein isn't a transmembrane protein, additional proteins were used for the transmembrane and signal peptide analysis:
name | organism | location | transmembrane protein | signal peptide | function | reference |
---|---|---|---|---|---|---|
A4_HUMAN | Human | Cell membrane | yes | yes | Protease Inhibitor | P05067 |
BACR_HALSA | Halobacterium salinarium | Cell membrane | yes | no | ion transport | P02945 |
INSL5_HUMAN | Human | extracellular region | no | yes | hormone | Q9Y5Q6 |
LAMP1_HUMAN | Human | Cell membrane, Lysosome membrane, Endosome membrane | yes | yes | Presents carbohydrate ligands to selectins | P11279 |
RET4_HUMAN | Human | extracellular space | no | yes | Transport | P02753 |
TMHMM
Method
- Was developed by Sonnhammer, Heijne and Krogh in 1998 <ref> E.L. Sonnhammer, Heijne and A. Krogh, A hidden Markov model for predicting transmembrane helices in protein sequences, Proc Int Conf Intell Syst Mol Biol.(1998)</ref>
- Predicts predict transmembrane topology of membrane-spanning proteins
- Is a membrane topology prediction method based on a hidden Markov model with an architecture of 7 types of states
- Required Input: protein sequence in fasta format
- Can also be ran on the TMHMM server
Execution
Before we could execute TMHMM we had to change all occurrences of "/usr/local/bin/" to "/usr/bin" in the following files: tmhmm, tmhmm.ORIG and tmhmmformat.pl
To execute the program we used these commands:
- tmhmm P05067.fasta > tmhmm _out_P05067.txt
- tmhmm P02945.fasta > tmhmm _out_P02945.txt
- tmhmm Q9Y5Q6.fasta > tmhmm _out_Q9Y5Q6.txt
- tmhmm P11279.fasta > tmhmm _out_P11279.txt
- tmhmm P02753.fasta > tmhmm _out_P02753.txt
- tmhmm P12694.fasta > tmhmm _out_P12694.txt
Results
BCKDHA
Position | Membrane topology |
---|---|
1-445 | outside |
TMHMM predicted no membrane spanning region for the BCKDHA protein, which corresponds to the information provided in Uniprot.
A4_HUMAN
Position | Membrane topology |
---|---|
1-700 | outside |
701-723 | TMhelix |
724-770 | inside |
TMHMM predicted one transmembrane helix for the A4_HUMAN. This agrees with the Uniprot annotation. The predicted transmembrane helix begins at position 701 in the protein, whereas Uniprot states the Transmembrane regions goes from position 700-723 which can be seen in Figure 16. The extracellular region reported by Uniprot begins at position 18 in the sequence, this is due to a signal peptide in the beginning of the protein. TMHMM doesn't include a signal peptide prediction, therefore it predicted the extracellular region from position 1-700.
BACR_HALSA
Position | Membrane topology |
---|---|
1-22 | outside |
23-42 | TMhelix |
43-54 | inside |
55-77 | TMhelix |
78-91 | outside |
92-114 | TMhelix |
115-120 | inside |
121-143 | TMhelix |
144-147 | outside |
148-170 | TMhelix |
171-189 | inside |
190-212 | TMhelix |
213-262 | outside |
The TMHMM prediction differs a little bit from the information provided in Uniprot as it can be seen in Figure 17. TMHMM predicted only 13 different domains of the protein (the end of the protein is predicted to be in the extracellular space), whereas in Uniprot 15 domains are reported (protein ends in cytoplasma).
INSL5_HUMAN
Position | Membrane topology |
---|---|
1-135 | outside |
The TMHMM prediction agrees with the fact that INSL5_HUMAN is a hormone and therefore secreted in the extracellular region. The information about these properties are offered by UniProt and can be seen in Figure 18
LAMP1_HUMAN
Position | Membrane topology |
---|---|
1-10 | inside |
11-33 | TMhelix |
34-383 | outside |
384-406 | TMhelix |
407-417 | inside |
The prediction for LAMP1_HUMAN made by TMHMM does only partially agree with the Uniprot annotation as we can see by comparing the results of TMHMM with the information of UniProt which are shown in Figure 19. The sequence parts form the signal peptide and lumenal domain are predicted to be another transmembrane helix and extracellular domain. The second transmembrane helix is predicted correctly.
RET4_HUMAN
Position | Membrane topology |
---|---|
1-201 | outside |
The TMHMM prediction for RET4_HUMAN is correct, as RET4_HUMAN is a secreted protein and does not span any membrane.
Phobius and Polyphobius
Methods
- Phobius was developed by Käll et al <ref>Käll et al., "A Combined Transmembrane Topology and Signal Peptide Prediction Method", Journal of Mol. Biology,338(5):1027-1036, 2004 </ref>
- combined prediction of transmembrane regions and signal peptids
- Required input information: only sequence in FASTA-Format (20 amino acids and B, Z, X are recognized)
- As transmembrane topology and signal peptides are likely to be conserved during evolution, Polyphobius was established <ref>Käll et al., "An HMM posterior decoder for sequence feature prediction that includes homology information", Bioinformatics, 21 (Suppl 1):i251-i257, 2005</ref>, which includes information from homologous sequences to the query.
- Required input: 2 Options: Query Sequence in FASTA-Format, which is then blasted agains uniprot_trembl or upload of an alignment in FASTA-Format which provides information about homologs.
Results
A4_HUMAN | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Phobius | Polyphobius | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
By comparing the results of Phobius and Polyphobius we can see that they predict mainly the same. Also by looking at Figure 20 and Figure 21 we can see that both predictions are nearly the same. Phobius and Polyphobius predicted the signal peptide and membrane topology for A4_HUMAN correctly. The signal peptide and membrane topology for A4_HUMAN can be found in Figure 16.
BACR_HALSA | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Phobius | Polyphobius | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
The predictions of Phobius and Polyphobius differ only in a small difference in the length of the single domains which can be seen by the results in the two tables above. Additionally the comparison of Figure 22 with Figure 23 show that they are mainly the same and only differ a bit in the posterior label probability of cytoplasmic and non cytoplasmic regions. Both predictions of the membrane topology are correct which can be seen by comparing the results with Figure 17.
INSL5_HUMAN | |||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Phobius | Polyphobius | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
The Phobius and Polyphobius predictions for INSL5_HUMAN agree with the information given on Uniprot (Figure 18). By comparing the results in the table above and Figure 24 with Figure 25 we can see that both predicted correctly a signal peptide and only one extracellular region of the protein.
LAMP1_HUMAN | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Phobius | Polyphobius | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
By comparing the results of Phobius and Polyphobius listet in the table above and shown in Figure 26 and Figure 27 we can assume that the two tools made the same predictions. To find out if these results are correct we compared them to the information offered by UniProt Figure 19 and can conclude that the signal peptide and membrane topology predictions made by Phobius and Polyphobius for LAMP1_HUMAN are correct.
RET4_HUMAN | |||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Phobius | Polyphobius | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Both tools made nearly the same prediction which can be seen out of the table above and because of the visualization of the two predictions (Figure 28, Figure 29). Both predict the signal peptide of RET4_HUMAN correctly, as well as the one extracellular region of the protein.
For the BCKDHA-protein Phobius predicted a signal peptide with about 90% probability at the beginning of the sequence. The predicted signal peptide is 34 amino acids long. This matches the information given on Uniprot, which says, that BCKDHA contains a 45bp long signal peptide for the transfer into the mitochondrion. The rest of the amino acid is a non cytoplasmic protein sequence. No part of the protein is predicted to be transmembrane spanning. This is also true, as BCKDHA is a protein located in the mitochondrion matrix according to Uniprot.
BCKDHA | |||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Phobius | Polyphobius | ||||||||||||||||||||||||||||||||||
|
|
Considering the information given on Uniprot, Polyphobius performed worse than Phobius on the BCKDHA-protein sequence. It predicted no signal sequence at the beginning of the protein sequence. There is a low probability for the amino acids between position 1-45 to be a signal sequence, but all in all the whole sequenc is predicted to be a non cytoplasmic protein. This is also shown in Figure 31. In contrast to the prediction of Polyphobius, Phobius predicted the signal sequence between position 1 and 34 with a very high probability. This probability is visualized very good in Figure 30
OCTOPUS and SPOCTOPUS
Methods
- OCTOPUS was developed by Viklund and Elofsson in 2008 <ref>Håkan Viklund and Arne Elofsson, "Improving topology prediction by two-track ANN-based preference scores and an extended topological grammar", Bioinformatics (2008)</ref>
- OCTOPUS (obtainer of correct topologies for uncharacterized sequences) uses a combination of hidden Markov models and artificial neural networks.
- It creates a sequence profile by doing a BLAST search to obtain homologous sequences. The profile is used as input for a neural network that predicts the probability for each residue to be located in a transmembrane(M), interface (I), close loop (L), or globular loop (G) environment as well as the preference to be inside (i) or outside (o) of the membrane. A hidden Markov model is used to calculate the most likely Protein Topology.
- Required input: Protein Sequence in FASTA-Format
- SPOCTOPUS (Viklund et al., 2008<ref>Viklund et al., "A combined predictor of signal peptides and membrane protein topology", Bioinformatics (2008)</ref>) is an extension of OCTOPUS which also predicts signal peptides. A neural network is used to predict a signal peptide preference score. The signal peptide's location is determined by a hidden Markov model. The output contains the information retrieved by OCTOPUS as well as the probabilty if a residue is predicted to be N-terminal of a signal peptide (n) or in a signal peptide (S).
- Required input information: Protein sequence in FASTA-Format
Results
A4_HUMAN
When we compare the results of OCTOPUS and SPOCTOPUS with each other we can see that both tools predict the membrane topology for A4_HUMAN. The output is visualized in Figure 32 and it is shown by the brwon line that the protein is mainly in the non-cytoplasmic region. OCTOPUS also detected the signal peptide. By comparing the predictions with the information offered by UniProt we can see that the predictions of both tools are correct.
BACR_HALSA
The predictions made by OCTOPUS and SPOCTOPUS for BACR_HALSA are identical and correct. The results are visualized in Figure 33. We can see that the protein is mainly in the transmembrane region which is pointed out by the red bars. Additionally the alternating brown and green lines indicate that the protein changes in turn between non-cytoplasmic region and cytoplasmic region. SPOCTOPUS was not able to predict a signal peptide, which agrees with the information given in Uniprot.
INSL5_HUMAN
When we compare the results of the predictions from OCTOPUS and SPOCTOPUS we can see that both of them predict that the protein is in a non-cytoplasmic region after position 22 or 23. This conclusion is supported by the brown line in Figure 34.
In this picture it is also shown that the two tools made different predictions for the first part of the protein. SPOCTOPUS predicted the signal peptide of INSL5_HUMAN while OCTOPUS predicted for the same part of the sequence a transmembrane domain. By comparing the results with the information in UniProt we can see that the signal peptide is correctly predicted.
LAMP1_HUMAN
By looking on the visualization of the results (Figure 35) we can see that the two tools made mainly the same predictions. But their predictions differ in the beginning of the protein. While SPOCTOPUS predicted the beginning as a signal peptide, OCTOPUS assigned this region as an additional inside region and transmembrane helix where the sequence contains a signal. As we know from UniProt the prediction of SPOCTOPUS is the right one because LAMP1_HUMAN has a signal peptide in the beginning of the protein.
RET4_HUMAN
Again the two tools made nearly the same predictions and only differ in the beginning of the protein. As we can see in Figure 36 both of them predict the protein to be mainly in a non-cytoplasmic region but while SPOCTOPUS predicts the beginning to be a signal peptide, OCTOPUS assigned this region to be a transmembran helix. By comparing the two predictions with the information offered by UniProt it is obvious that there is a signal peptide in the beginning of the protein.
BCKDHA
The OCTOPUS and SPOCTOPUS predictions for the BCKDHA protein are completely contrary in terms of the intracellular and extracellular regions which is very clear by considering Figure 37. But both predictions are wrong, as BCKDHA is no membran protein. Furthermore, SPOCTOPUS missed the 45bp long signal peptide at the beginning of the sequence.
SignalP
Method
- SignalP was established by Nielsen et al. in 1997<ref>Nielsen et al., "Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites", Protein Engineering, 10:1-6, 1997</ref>
- Focused on neural networks as well as Hidden Markov Models
- Uses three different scores for the prediction with HMMs:
- S-score (score for the signal peptide)
- C-score (score for the clevage site)
- Y-score (combination of the S-score and the C-score but more precise)
- S-score (score for the signal peptide)
- Identifies signal peptides and cleavage sites
- Make predictions for three different organism groups:
- eukaryotes
- Gram-negative
- Gram-positive bacteria
- eukaryotes
- can also be run on the SignalP server
Execution
To run the command line SignalP tool, the path in the SignalP file had to be adapted to /apps/signalp-3.0
Following commands were used to execute SignalP:
- signalp -t euk P05067.fasta > signalp_out_P05067.txt
- signalp -t gram- P02945.fasta > signalp_out_P02945.txt
- signalp -t euk Q9Y5Q6.fasta > signalp_out_Q9Y5Q6.txt
- signalp -t euk P11279.fasta > signalp_out_P11279.txt
- signalp -t euk P02753.fasta > signalp_out_P02753.txt
- signalp -t euk P12694.fasta > signalp_out_P12694.txt
Results
BCKDHA
Both methods (NN and HMM) predicted the most likely cleavage site between positions 32 and 33 (ARG_LA). This is visualized very good by the red lines in Figure 38
This prediction does not agree with Uniprot, where a signal peptide from position 1-45 is listed.
A4_HUMAN
SignalP predicted with both methods a cleavage site between positions 17 and 18 with a high probability for a signal peptide.
SignalP predicted the prediction site for A4_HUMAN correct.
BACR_HALSA
Both methods (NN and HMM) predicted no cleavage site, and therefore no signal peptide, in the BACR_HALSA sequence.
This is also true according to Uniprot, where no signal peptide is stated.
INSL5_HUMAN
For the INSL5_HUMAN protein signalP detected a cleavage site between positions 22 and 23, which is due to a predicted signal peptide at the beginning of the sequence.
The signal peptidase I cleavage site was predicted correctly, as Uniprot states a signal peptide from positions 1-22.
LAMP1_HUMAN
SignalP predicted with both methods a cleavage site between positions 28 and 29, as there is a signal peptide detected.
The cleavage site prediction made by SignalP for LAMP1_HUMAN is correct. Uniprot shows a signal peptide for this protein which ranges from 1-28 in the sequence.
RET4_HUMAN
SignalP predicted a cleavage site with high probability between positions 18 and 19 in both the NN and the HMM method. This cleavage site is predicted to be after a signal peptide.
This prediction is correct according to Uniprot.
TargetP
Method
- TargetP was developed by Emanuelsson et al. in 2002 <ref> Emanuelsson et al., "Predicting subcellular localization of proteins based on their N-terminal amino acid sequence", J. Mol. Biol., 200: 1005-1016, 2002</ref>
- TargetP predicts the subcellular location of eukaryotic proteins
- Additionally it can make cleavage site predictions
- This method is neural network based. The prediction is based on the N-terminal presequences:
- chloroplast transit peptide(cTP)
- mitochondiral targeting peptide (mTP)
- secretory pathway signal peptide (SP)
- Required input information: Sequence(s) in FASTA format, organism group
- The prediction can also be ran on the targetP server
Results
All the results of the prediction of TargetP are shown in the table in Figure 39. The ODBA_HUMAN (BCKDHA) is predicted to be located in the mitochondrion, which is true according to Uniprot. All other tested proteins are predicted to be located in the secretory pathway and therefore to have a signal peptide. These predictions are true except for BACR_HALSA, which has no signal peptide. But here TargetP returns a reliabilty index of four, which indicates an unsafe prediction. |
4. Prediction of GO terms
The following section deals with GO term prediction tools. In order to verify the predictions, first the real GO annotations are presented (as they are listed in <ref>http://www.uniprot.org</ref>:
(P: Process, F: Function, C: Component)
BCKDHA
GO Term Name | GO identifier | Aspect |
---|---|---|
Process | ||
metabolic process | 0008152 | P |
branched chain family amino acid catabolic process | 0009083 | P |
cellular nitrogen compound metabolic process | 0034641 | P |
oxidation-reduction process | 0055114 | P |
Function | ||
alpha-ketoacid dehydrogenase activity | 0003826 | F |
3-methyl-2-oxobutanoate dehydrogenase (2-methylpropanoyl-transferring) activity | 0003863 | F |
protein binding | 0005515 | F |
oxidoreductase activity | 0016491 | F |
oxidoreductase activity, acting on the aldehyde or oxo group of donors, disulfide as acceptors | 0016624 | F |
carboxy-lyase activity | 0016831 | F |
metal ion binding | 0046872 | F |
Component | ||
mitochondrion | 0005739 | C |
mitochondrial matrix | 0005739 | C |
mitochondrial alpha-ketoglutarate dehydrogenase complex | 0005947 | C |
A4_HUMAN
GO Term Name | GO identifier | Aspect |
---|---|---|
Process | ||
G2 phase of mitotic cell cycle | 0000085 | P |
suckling behaviour | 0001967 | P |
plantelet degranulation | 0002576 | P |
mRNA polyadenylation | 0006378 | P |
regulation of translation | 0006417 | P |
protein phosphorylation | 0006468 | P |
cellular copper ion homeostasis | 0006878 | P |
endocytosis | 0006897 | P |
apoptosis | 0006915 | P |
induction of apoptosis | 0006917 | P |
cell adhesion | 0007155 | P |
regulation of epidermal growth factor receptor activity | 0007176 | P |
Notch signaling pathway | 0007219 | P |
axonogenesis | 0007409 | P |
blood coagulation | 0007596 | P |
mating bahavior | 0007617 | P |
locomotory behavior | 0007626 | P |
axon cargo transport | 0008088 | P |
cell death | 0008219 | P |
adult locomotory behavior | 0008344 | P |
visual learning | 0008542 | P |
negative regulation of peptidase activity | 0010466 | P |
positive regulation of peptidase activity | 0010951 | P |
axon midline choice point recognition | 0016199 | P |
neuron remodeling | 0016322 | P |
dendrite development | 0016358 | P |
platelet activation | 0030168 | P |
extracellular matrix organization | 0030198 | P |
forebrain development | 0030900 | P |
neuron projection development | 0031175 | P |
ionotropic glutamate recptor signaling pathway | 0035235 | P |
regulation of multicellular organism growth | 0040014 | P |
innate immune response | 0045087 | P |
negative regulation of neuron differentiation | 0045665 | P |
positive regulation of mitotic cell cycle | 0045931 | P |
positive regulation of transcription from RNA polymerase II promotor | 0045944 | P |
collateral sprouting in absence of injury | 0048699 | P |
regulation of synapse structure and activity | 0050803 | P |
neuromuscular process controling balance | 0050885 | P |
synaptic growth at neuromuscular junction | 0051124 | P |
neuron apoptosis | 0051402 | P |
smooth endoplasmic reticulum calcium ion homeostasis | 0051563 | P |
Function | ||
DNA binding | 0003677 | F |
serine-type endopeptidase inhibitor activity | 0004867 | F |
receptor binding | 0005102 | F |
binding | 0005488 | F |
protein binding | 0005515 | F |
heparin binding | 0008201 | F |
peptidase activator activity | 0016504 | F |
peptidase inhibitor activity | 0030414 | F |
acetylcholine receptor binding | 0033130 | F |
identical protein binding | 0042802 | F |
metal ion binding | 0046872 | F |
PTB domain binding | 0051425 | F |
Component | ||
exracellular region | 0005576 | C |
membrane fraction | 0005624 | C |
cytoplasm | 0005737 | C |
Golgi apparatus | 0005794 | C |
plasma membrane | 0005886 | C |
integral to plasma membrane | 0005887 | C |
coated pit | 0005905 | C |
cell surface | 0009986 | C |
membrane | 0016020 | C |
integral to membrane | 0016021 | C |
synaptosome | 0019717 | C |
axon | 0030424 | C |
plantelet alpha granule lumen | 0031093 | C |
cytoplasmic vesicle | 0031410 | C |
neuromuscular junction | 0031594 | C |
ciliary rootlet | 0035253 | C |
neuron projection | 0042005 | C |
dendritic spine | 0043197 | C |
dendritic shaft | 0043198 | C |
intracellular membrane-bounded organelle | 0043231 | C |
apical part of cell | 0045177 | C |
synapse | 0045202 | C |
perinuclear region of cytoplasm | 0048471 | C |
spindle midzone | 0051233 | C |
BACR_HALSA
GO Term Name | GO identifier | Aspect |
---|---|---|
Process | ||
transport | 0006810 | P |
ion transport | 0006811 | P |
phototransduction | 007602 | P |
photon transport | 0015992 | P |
protein-chromophore linkage | 0018298 | P |
response to stimulus | 0050896 | P |
Function | ||
receptor activity | 0004872 | F |
ion channel activity | 0005216 | F |
photoreceptor activity | 0009881 | F |
Component | ||
plasma membrane | 0005886 | C |
membrane | 0016020 | C |
integral to membrane | 0016021 | C |
INSL5_HUMAN
GO Term Name | GO identifier | Aspect |
---|---|---|
Process | ||
biological_process | 0008150 | P |
Function | ||
hormone activitiy | 0005279 | F |
Component | ||
cellular_component | 0005575 | C |
extracellular region | 0005576 | C |
LAMP1_HUMAN
GO Term Name | GO identifier | Aspect |
---|---|---|
Process | ||
autophagy | 0006914 | P |
Component | ||
membrane fraction | 0005624 | C |
lysosome | 0005764 | C |
lysosomal membrane | 0005765 | C |
endosome | 0005768 | C |
late endosome | 0005770 | C |
multivesicular body | 0005771 | C |
plasma membrane | 0005886 | C |
integral to plasma membrane | 0005887 | C |
external side of plasma membrane | 0009897 | C |
cell surface | 0009986 | C |
endosome membrane | 0010008 | C |
membrane | 0016020 | C |
integral to membrane | 0016021 | C |
vesicle | 0031982 | C |
sarcolemma | 0042383 | C |
melanosome | 0042470 | C |
RET4_HUMAN
GO Term Name | GO identifier | Aspect |
---|---|---|
Process | ||
eye development | 0001654 | P |
gluconeogenesis | 0006094 | P |
transport | 0006810 | P |
spermatogenesis | 0007283 | P |
heart development | 0007507 | P |
visual perception | 0007601 | P |
male gonad development | 0008584 | P |
embryo development | 0009790 | P |
maintenance of gastrointestinal epithelium | 0030277 | P |
lung development | 0030324 | P |
positive regulation of insulin secretion | 0033024 | P |
response to retinoic acid | 0032526 | P |
response to insulin stimulis | 0032868 | P |
retinol transport | 0034633 | P |
retinol metabolic process | 0042572 | P |
retinal metabolic process | 0042574 | P |
glucose homeostasis | 0042593 | P |
response to ethanol | 0045471 | P |
embryonic organ morphogenesis | 0048562 | P |
embryonic skeletal system development | 0048706 | P |
cardiac muscle tissue development | 0048738 | P |
female genitalia morphogenesis | 0048807 | P |
response to stimulus | 0050896 | P |
detection of light stimulus involved in visual perception | 0050908 | P |
positive regulation of immunoglobin secretion | 0051024 | P |
retina development in camera-type eye | 0060041 | P |
negative regulation of cardiac muscle cell proliferation | 0060044 | P |
embryonic retina morphogenesis in camera-type eye | 0060059 | P |
uterus development | 0060065 | P |
vagina development | 0060068 | P |
urinary bladder development | 0060157 | P |
heart trabecula formation | 0060347 | P |
Function | ||
transporter activity | 0005215 | F |
binding | 0005488 | F |
retinoid binding | 0005501 | F |
protein binding | 0005515 | F |
retinal binding | 0016918 | F |
retinol binding | 0019841 | F |
retinol transporter activity | 0034632 | F |
Component | ||
extracellular region | 0005576 | C |
extracellular space | 0005615 | C |
GOPET
Method
- GOPET (Gene Ontology Term Prediction and Evaluation Tool) was described by Vinayagam et al.<ref> Arunachalam Vinayagam, Coral Del Val, Falk Schubert, Roland Eils, Karl-Heinz Glatting, Sándor Suhai, Rainer König, "GOPET: A tool for automated predictions of Gene Ontology terms", BMC Bioinformatics (2006), Volume: 7, Issue: 161, Publisher: BioMed Central, Pages: 161</ref>
- GOPET is a complete automated tool for assigning molecular function terms to a given sequence.
- Bases on homology searches and Support Vector Machines
- Required input information: cDNA or protein sequence
- Gene Ontology is used for annotation terms, GO-mapped protein databases for performing homology searches and Support Vector Machines for the prediction and the assignment of confidence values.
- The prediction is organism independent.
Results
BCKDHA
GOid | Aspect | Confidence | GOTerm |
---|---|---|---|
GO:0003824 | F | 97% | catalytic activity |
Go:0016491 | F | 96% | oxidoreductase activity |
GO:0016624 | F | 95% | oxidoredusctase activity acting on the aldehyde or oxo group of donors disulfide as acceptor |
GO:0003863 | F | 90% | 3-methyl-2-oxobutanoate dehydrogenase 2-methylpropanoyl-transferring activity |
GO:0004739 | F | 89% | pyruvate dehydrogenase acetyl-transferring activity |
GO:0004738 | F | 78% | pyruvat dehydrogenase activity |
GO:0003826 | F | 77% | alpha-ketoacid dehydrogenase activity |
GO:0047101 | F | 75% | 2-oxoisovalerate dehydrogenase acylting activity |
GO:0008677 | F | 65% | 2-dehydropantoate 2-reductase activity |
GO:0019152 | F | 63% | acetoin dehydrogenase activity |
GO:0030955 | F | 63% | potassium ion binding |
GO:0016616 | F | 62% | oxidoreductase activity acting on the CH-OH group of donors NAD or NADP as acceptor |
GO:0046872 | F | 62% | metal ion binding |
The GOPET predictions for BCKDHA are mostly correct. The by this tool predicted GO terms with confidence >90% are all listed in the Uniprot entry for BCKDHA and so is the metal ion binding function.
A4_HUMAN
GOid | Aspect | Confidence | GOTerm |
---|---|---|---|
GO:0004866 | F | 87% | endopeptidase inhibitor activity |
GO:0004867 | F | 86% | serine-type endopeptidase inhibitor activity |
GO:0030568 | F | 83% | plasmin inhibitor activity |
GO:0030304 | F | 83% | trypsin inhibitor activity |
GO:0030414 | F | 82% | peptidase inhibitor activity |
GO:0005488 | F | 79% | binding |
GO:0005515 | F | 74% | protein binding |
GO:0046872 | F | 73% | metal ion binding |
GO:0003677 | F | 71% | DNA binding |
GO:0008201 | F | 70% | heparin binding |
GO:0008270 | F | 69% | zinc ion binding |
GO:0005507 | F | 69% | copper ion binding |
GO:0005506 | F | 67% | iron ion binding |
The GOPET results for A4_HUMAN match the Uniprot annotation quite good. The predicted trypsin inhibitor activity and the plasmin inhibitor activity are not present in Uniprot, as well as the peptidase inhibitor activity or the endopeptidase actitity. But as the predicted serine-type endopeptidase inhibitor activity can be seen as a subcategory of the previously named functions, and it is a true function of the A4_HUMAN protein, the predictions are not that bad. The same is true for the zinc, copper and iron ion binding function, which are all metals, and the protein has a metal ion binding function.
BACR_HALSA
GOid | Aspect | Confidence | GOterm |
---|---|---|---|
GO:0005216 | F | 77% | ion channel activiy |
GO:0008020 | F | 75% | G-protein coupled photoreceptor activity |
GO:0015078 | F | 60% | hydrogen ion transmembrane transporter activity |
GOPET predicted the ion channel activity and the photorecptor activity correctly. The hydrogen ion transmembrane transporter activity does not agree with the Uniprot annotations.
INSL5_HUMAN
GOid | Aspect | Confidence | GOterm |
---|---|---|---|
GO:0005179 | F | 80% | hormone activity |
The INSL5_HUMAN protein is correctly predicted to be a hormone.
LAMP1_HUMAN
GOid | Aspect | Confidence | GOterm |
---|---|---|---|
GO:0004812 | F | 60% | aminoacyl-tRNA ligase activity |
GO:0005524 | F | 60% | ATP binding |
For the LAMP1_HUMAN protein, no functional GO annotation is listed in Uniprot.
RET4_HUMAN
GOid | Aspect | Confidence | GOterm |
---|---|---|---|
GO:0005488 | F | 90% | binding |
GO:0005501 | F | 81% | retinoid binding |
GO:0008289 | F | 80% | lipid binding |
GO:0019841 | F | 78% | retinol binding |
GO:0005215 | F | 78% | transporter activity |
GO:0016918 | F | 78% | retinal binding |
GO:0005319 | F | 69% | lipid transporter activity |
GO:0008035 | F | 60% | high-density lipoprotein particle binding |
The GOPET predictions for RET4_HUMAN are correct except for the lipid-linked activities.
Pfam
Method
- Pfam was established by Finn et al. in 2008. It is described in <ref>Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A (2008). "The Pfam protein families database.". Nucleic Acids Res 36 (Database issue): D281–8</ref>
- Pfam is a database which contains protein families and domains
- The databes consists of two different parts: Pfam-A and Pfam-B
- Pfam-A: more exactly
- Pfam-B: generated automatically and so the data are not as qualitativ as in Pfam-A
- Webserver: Pfam
Results
BCKDHA
Pfam found one significant match in the database which is visualized in Figure 40.
- Molecular function
- GO:0016624 (oxidoreductase activity, acting on the aldehyde or oxo group of donors, disulfide as acceptor)
- Biological Process
- GO:0008152 (metabolic process)
A4_HUMAN
Pfam found six significant matches in the database which are visualized in Figure 41.
- Cellular Component
- GO:0016021 (integral to membrane)
- Molecular function
- GO:0005488 (binding)
- GO:0004867 (serine-type endopeptidase inhibitor activity )
- No GO-ID
- E2 domain of amyloid precursor protein
- beta-amyloid precursor protein C-terminus
BACR_HALSA
Pfam foundone significant match in the database which is visualized in Figure 42.
- Cellular Component
- GO:0016020 (membrane)
- Molecular function
- GO:0005216 (ion channel activity)
- Biological Process
- GO: 0006811 (ion transport)
INSL5_HUMAN
Pfam found one significant match in the database which is visualized in Figure 43.
- Cellular Component
- GO:0005576 (extracellular region)
- Molecular function
- GO:0005179 (hormone activity)
LAMP1_HUMAN
Pfam found one significant match in the database which is visualized in Figure 44.
- Cellular Component
- GO:0016020 (membrane)
RET4_HUMAN
Pfam found one significant match in the database which is visualized in Figure 45.
- Molecular function
- GO:0005488 (binding)
By comparing the Pfam annotations with the already known GO terms for the different proteins it can be seen that the results for all analysed proteins are correct, but by far not exhaustive.
ProtFun 2.2
Method
- ProtFun is described in : Jensen et al.<ref>Prediction of human protein function from post-translational modifications and localization features.
L. Juhl Jensen, R. Gupta, N. Blom, D. Devos, J. Tamames, C. Kesmir, H. Nielsen, H. H. Stærfeldt, K. Rapacki, C. Workman, C. A. F. Andersen, S. Knudsen, A. Krogh, A. Valencia and S. Brunak. J. Mol. Biol., 319:1257-1265, 2002</ref>
- ProtFun is an ab initio prediction server of protein function from sequence. Various servers are queried and the provided information is integrated into the final prediciton.
- The results of ProtFun are only probabilities and odd scores and no prediction if a protein has a specific function or not.
- The arrow (=>) indicated which line includes the highest information content
Results
BCKDHA
- Functional category
- Central_intermediary_metabolism (Prob: 0.321, Odds: 5.096) (=>)
- Amino_acid_biosynthesis (Prob: 0.187, Odds: 8.520)
- Purines_and_purymidines (Prob: 0.257, Odds: 1.059)
- Biosynthesis_of_cofactors (Prob: 0.246, Odds: 3.413)
- Enzyme/nonenzyme
- Enzyme (Prob: 0.769, Odds: 2.683)
- Enzyme class
- Ligase (Prob: 0.085, Odds: 1.673) (=>)
- Lyase (Prob: 0.076, Odds: 1.614)
- Gene Ontology category
- Growth_factor (Prob: 0.009, Odds: 0.609)
- Signal_transducer (Prob: 0.098, Odds: 0.458)
The results of ProtFun2.2 for the prediction of the GO-terms in BCKDHA are listed in Figure 46.
In the enumeration above the most significant results are summarized. The programm predicted BCKDHA to have mainly a function in the metabolic process. Also the second point of "functional category" has a very good odd score and so we also consider it to be a certain prediction. The other two entries are the ones with the next best probability or odd score. But we can see that in both cases the odd score is much lower than in the first two results. So we take the first entries as the best predictions of ProtFun2.2. By comparing these assertions with the information in UniProt we can see that they are correct. There was no certain prediction for the "Gene Ontology category". We listed the two best results above but as we can see by looking at the probability and the odd score the results are not significant.
A4_HUMAN
- Functional category
- Cell_envelope (Prob: 0.804, Odds: 13.186) (=>)
- Transport_and_Binding (Prob: 0.827, Odds: 2.016)
- Biosynthesis_of_cofactors (Prob: 0.261, Odds: 3.623)
- Enzyme/nonenzyme
- Enzyme (Prob: 0.392, Odds: 1.368)
- Enzyme class
- Ligase (Prob: 0.048, Odds: 0.946)
- Transferase (Prob: 0.208, Odds: 0.603)
- Hydrolase (Prob: 0.190, Odds: 0.600)
- Gene Ontology category
- Structural_protein (Prob: 0.034, Odds: 1.205) (=>)
- Stress_response (Prob: 0.076, Odds: 0.862)
- Signal transducer (Prob: 0.126, Odds: 0.586)
The results of ProtFun2.2 for the prediction of the GO-terms in A4_HUMAN are listed in Figure 47.
The most significant results are shown in the listing above. Here we can see that A4_HUMAN is predicted to be a "Cell_envelope" with an odd score of 13.186 which indicates that this prediction is very confident. And as we know from UniProt it is right. A4_HUMAN has also the function of transport and binding which is the next point in the list. So we can see that although the odd score is much lower than the fist one, the prediction is indeed correct. As well as the claim that A4_HUMAN is involved in the biosynthesis of cofactors. ProtFun2.2 assumes that this protein is a structural protein. This is again correct according to the information in the beginning of this section. But this is not the only correct predicted GO-term for A4_HUMAN which is shwon by the two other listed GO-terms. By comparing the whole list of predicted GO-terms in Figure 47 with the given information we can see that all of the predictions are right.
BACR_HALSA
- Functional category
- Transport_and_Binding (Prob: 0.791, Odds: 1.929) (=>)
- Biosynthesis of cofactors (Prob: 0.186, Odds: 2.589)
- Purines_and_pyrimidines (Prob: 0.302, Odds: 1.244)
- Enzyme/nonenzyme
- Nonenzyme (Prob: 0.801, Odds: 1.122)
- Enzyme class
- none
- Gene Ontology category
- Transporter (Prob: 0.400, Odds: 4.036) (=>)
- Receptor (Prob: 0.355, Odds: 2.087)
The results of ProtFun2.2 for the prediction of the GO-terms in BACR_HALSA are listed in Figure 48.
The most significant results are shown in the list above. Since both the prediction in the "functional category" and in "gene ontology category" declares BACR_HALSA to be mainly a transporter it can be assumed that this prediction is very significant. By looking at the UniProt GO annotations we can see that they include ion transport and photon transport, as well as transport itself so we can say that this prediction was correct. But the assumption that BACR_HALSA is involved in the biosynthesis of cofactors is wrong although it has a quite good odd score. But it has a very low probability which shows that the information of both is important. The last point in the list shows that ProtFun2.2 predicts receptor functionallity. This is also a correct prediction because in UniProt is listed that this protein has a receptor activity.
INSL5_HUMAN
- Functional category
- Cell_envelope (Prob: 0.756, Odds: 12.393) (=>)
- Transport_and_binding (Prob: 0.834, Odds: 2.033)
- Enzyme/nonenzyme
- Nonenzyme (Prob: 0.791, Odds: 1.109)
- Enzyme class
- none
- Gene Ontology category
- Hormone (Prob: 0.247, Odds: 37.936) (=>)
- Growth_factor (Prob: 0.061, Odds: 4.379)
The results of ProtFun2.2 for the prediction of the GO-terms in INSL5_HUMAN are listed in Figure 49.
The most significant results are specified above. The first prediction of ProtFun2.2 which maintains that INSL5_HUMAN can be classified in the functional category "cell envelope" has a very high odd score of 12.393. This suggests that this prediction is correct which can be confirmed by the information of UniProt. The prediction that INSL5_HUMAN is involved in transport and binding is also correct predicted. ProtFun predicted the hormone activity of INSL5_HUMAN correctly. By comparing this prediction with the information offered by UniProt we can see that it is correct. But it is additionally the only GO-term for this protein which means that the prediction of "growth factor" has to be wrong.
LAMP1_HUMAN
- Functional category
- Cell_envelope (Prob: 0.804, Odds: 13.186) (=>)
- Transport_and_binding (Prob: 0.834, Odds: 2.033)
- Enzyme/nonenzyme
- Nonenzyme (Prob: 0.724, Odds: 1.014)
- Enzyme class
- none
- Gene Ontology category
- Immune_response (Prob: 0.371, Odds: 4.368) (=>)
- Stress_response (Prob: 0.246, Odds: 2.795)
The results of ProtFun2.2 for the prediction of the GO-terms in LAMP1_HUMAN are listed in Figure 50.
The most significant results can be found in the list above. This protein is predicted to be important for the cell envelope with a very significant probability and odd score. As expected this result is correct since it also occurs in the GO-terms in UniProt. In contrast the prediction of transport and binding which has indeed a good probability but no high odd score is wrong. The GO category Immune response predicted by ProtFun for LAMP1_HUMAN is not false, as autophagy is a process often triggered by the immune system as a response to foreign substances. Since autophagy is listed in UniProt the prediction of stress response is also quite correct.
RET4_HUMAN
- Functional category
- Cell_envelope (Prob: 0.804, Odds: 13.186) (=>)
- Central_intermediary_metabolism (Prob: 0.197, Odds: 3.128)
- Transport_and_binding (Prob: 0.800, Odds: 1.951)
- Enzyme/nonenzyme
- Enzyme (Prob: 0.544, Odds: 1.900)
- Enzyme class
- Lyase (Prob: 0.059, Odds: 1.264) (=>)
- Hydrolase (Prob: 0.235, Odds: 0.742)
- Gene Ontology category
- Immune_response (Prob: 0.239, Odds: 2.813) (=>)
- Stress_response (Prob: 0.616, Odds: 1.829)
The results of ProtFun2.2 for the prediction of the GO-terms in RET4_HUMAN are listed in Figure 51. The most significant results can be found in the list above. The categorization of ProtFun2.2 of RET4_HUMAN in cell_envelope is done with a very high probability and a significant odd score. And of course this prediction is rigth which can be seen by the comparison with UniProt. The result that the protein is involved in the metabolism has a very bad probability and a much lower odd score than the first hit but anyway it is correct. The last of the three listed functional categories is also predicted accurately. The prediction of immune response for RET4_HUMAN. We can't find any hints in the GO-terms in UniProt for immune response. Whereas the prediction of stress response was correct.
References
<references />
back to Maple syrup urine disease main page
go to Sequence Alignments BCKDHA (Task 2)
go to Homology_based_structure_predictions_BCKDHA (Task 4)