Sequence-based mutation analysis Gaucher Disease

From Bioinformatikpedia
Revision as of 14:45, 18 June 2012 by Zhangg (talk | contribs) (Summary)

The aim of this task was to carry out a thorough analysis of ten mutations and to classify them as disease-causing and non-disease causing. The mutations have been selected by another group from our set of mutations such that their impact had been unknown for us prior to this task. We investigated the provided mutations with respect to their physicochemical properties, structural features, as well as their conservation and employed the tools, SIFT, Polyphen2, as well as SNAP for predicting their impact on the phenotype. For quantifying to which extend the mutations are disease causing, we assigend a disease score where -1 means non-disease causing, 0 ambiguous, and 1 disease causing. We averaged the disease scores to obtain a final prediction which we compared with the true impact of the mutation on the phenotype. Technical details are reported in our protocol.

Mutations

<xr id="tab:mutations"/> contains five randomly chosen Gaucher disease-causing and five non-disease-causing mutations. Disease causing mutations were sampled from the HGMD whereas non-disease causing mutations were sampled from a set of mutations which were present in the dbSNP but not in the HGMD. Reference sequence was P04062 which has a 39 residue signal peptide. The ten mutations listed in <xr id="tab:mutations"/> were investigated in the following.

<figtable id="tab:mutations">

Nr Position From To
1 99 H R
2 211 V I
3 150 E K
4 236 L P
5 248 W R
6 509 L P
7 351 W C
8 423 A D
9 482 D N
10 83 R S

Randomly selected mutations from HGMD and dbSNP which were used for the sequence-based mutation analysis. </figtable>

Physicochemical properties

We compared the charge, polarity, size, and the aromatic character of the wild-type and mutant amino-acid and assigned a disease-score of 1 to those mutations, which have a severe impact on the physicochemical properties (cf. <xr id="tab:props"/>). Mutations number 3 changes the polarity since glutamate is acidic but lysine basic. We also considered mutation number 5 and 7 disease-causing as tryptophan is aromatic and unpolar, in contrast to the target residues. Substituting alanine, which is small and unpolar, by the long and acidic aspartate might also impact the structure and function of the protein.

<figtable id="tab:props">

Nr Wildtype Mutant Disease
score
AA Charge Polarity Size Aromatic AA Charge Polarity Size Aromatic
1 H negative polar large no R negative polar large no -1
2 V neutral unpolar medium no I neutral unpolar medium no -1
3 E positive polar large no K negative polar large no 1
4 L neutral unpolar medium no P neutral unpolar medium no -1
5 W neutral unpolar large yes R negative polar large no 1
6 L neutral unpolar medium no P neutral unpolar medium no -1
7 W neutral unpolar large yes C neutral polar small no 1
8 A neutral unpolar small no D positive polar medium no 1
9 D positive polar medium no N neutral polar medium no -1
10 R negative polar large no S neutral polar small no 0

Physicochemical properiets of the wildtype and mutatant amino acid which were used to classify the mutation as severe or non-severe. </figtable>

Structural analysis

We used the HHsearch alignment for mapping the mutations of <xr id="tab:mutations"/> onto 2nt0_Aand investigated for each mutated site its solvent accessibility (buried or exposed), secondary structure (H=Helix, S=Sheet, C=coil), and whether it takes places in a domain region. Based-upon these features, we estimated a disease score (cf. <xr id="tab:structure"/>, <xr id="fig:structure_all"/>).

<figtable id="tab:structure">

Nr Mutation Acc 2nd structure Domain region Mutation Disease
score
1 H99R exposed C no Structure nr1.png -1
2 V211I exposed C no Structure nr2.png -1
3 E150K exposed C no Structure nr3.png -1
4 L236P exposed C no Structure nr4.png 0
5 W248R buried H yes Structure nr5.png 1
6 L509P exposed S yes Structure nr6.png 1
7 W351C exposed S yes Structure nr7.png 0
8 A423D buried C yes Structure nr8.png 1
9 D482N exposed C yes Structure nr9.png -1
10 R83S exposed C yes Structure nr10.png -1

Location of mutations in 2nt0_A. Blue: wildtype; Red: mutant; Acc: Solvent accessibility. </figtable>

<figure id="fig:structure_all">

2nt0_A along with the selected wildtype residues from <xr id="tab:mutations"/> in blue.

</figure>



W248R takes place in the hydrophobic core region of the TIM beta/alpha-barrel domain and inserting a hydrophilic, arginine is likely to impair the catalytic activity of the enzyme, although it does no change the secondary structure according to the PSI-PRED prediction. Same holds true for A423D. We though L509P as disease-causing as Proline turned the sheet into a loop region at this site. W351C might change the protein structure due to the formation of disulide bonds.

Conservation

The conservation of an amino acid indicates its importance for the structure and function of the protein. The log-odds substitution score is used to quantify the likelihood of a substitution where a negative score indicates that the substitution is observed less frequently than expected by chance. This is primarily due to different physicochemical properties which cause severe structural changes such that the resulting protein is negatively selected. Hence, substitutions with a negative score a likely to be disease-causing whereas a positive score indicates that the mutation does not affect the protein.

BLOSUM62 scores

The BLOSUM62 matrix substitution matrix was derived by clustering sequences of the Blocks database with a minimal identity of 62% and counting inter-cluster substitutions. The evolutionary distance underlying the BLOSUM62 matrix turned out to be suitable for many applications. We labeled substitutions with a score close to the minimal score as disease-causing (cf. <xr id="tab:subst_blosum"/>).

<figtable id="tab:subst_blosum">

Nr Mutation Score
mutation
Score
min
Score
max
Disease
score
1 H99R 0 -3 8 0
2 V211I 3 -3 4 -1
3 E150K 1 -4 5 0
4 L236P -3 -4 4 1
5 W248R -3 -4 11 1
6 L509P -3 -4 4 1
7 W351C -2 -4 11 1
8 A423D -2 -3 4 1
9 D482N 1 -4 6 0
10 R83S -1 -3 5 0

BLOSUM62 scores of the selected mutations. </figtable>

PSSM of all hits

A Position Specific Scoring Matrix (PSSM) or profile is a matrix which stores the probability P(a|i) to a observe amino acid a at position i. It is derived from a sequence alignment and the position specific substitution scores S(a,b)=log P(a|i)/ P(a) are more precise than the general BLOSUM62 scores. We therefore computed an alignment (cf. <xr id="fig:subst_pssm_all_ali"/>) from all significant sequences found by performing five rounds PSI-BLAST, computed a PSSM (cf. <xr id="fig:subst_pssm_all"/>) and used the position specific substitution scores to assign a disease score for each mutation (cf. <xr id="tab:subst_pssm_all"/>). Mutations 5-7 were though of as disease-causing since their substitution score were close to the minimum and the sites were highly conserved.

</figure> </figure>
<figure id="fig:subst_pssm_all_ali">
Sequence alignment of P04062 derived from all significant hits after 5 rounds PSI-BLAST
<figure id="fig:subst_pssm_all">
PSSM of P04062 derived from all significant hits after 5 rounds PSI-BLAST.

<figtable id="tab:subst_pssm_all">

Nr Mutation Score
mutation
Score
min
Score
max
Conservation Disease
score
1 H99R 0 -4 2 Subst pssm all col99.png 0
2 V211I 4 -4 4 Subst pssm all col211.png -1
3 E150K 0 -4 5 Subst pssm all col150.png 0
4 L236P 0 -3 1 Subst pssm all col236.png 0
5 W248R -2 -3 4 Subst pssm all col248.png 1
6 L509P -6 -6 4 Subst pssm all col509.png 1
7 W351C -3 -6 9 Subst pssm all col351.png 1
8 A423D -3 -3 3 Subst pssm all col423.png 1
9 D482N 4 -4 4 Subst pssm all col482.png -1
10 R83S 0 -3 2 Subst pssm all col83.png 0

Position specific substitution scores derived from all significant hits after 5 rounds PSI-BLAST. The respective profile column is shown on the right. </figtable>


PSSM of close homologous sequences

By performing five rounds PSI-BLAST, also distant homologous sequences are recognised whose function is not conversed, i.e. proteins with different functions are incorporated into the alignment. We therefore built an alignment using only the closest homologous sequences which probably exhibit the same catalytic activity than the query sequence. For this, we used HHfilter and the option -qsc 1.0 for filtering the alignment depicted in <xr id="fig:subst_pssm_all_ali"/> from 1050 sequences to only 60 sequences. The resulting alignment is shown in <xr id="fig:subst_pssm_best_ali"/> and the corresponding profile in <xr id="fig:subst_pssm_best"/> which is clearly less diverse than the profile shown in <xr id="fig:subst_pssm_all"/>. Since less sequences entered the alignment, also the substitution scores became more extreme (cf. <xr id="tab:subst_pssm_best"/>). However, we assigned the same disease scores, expect for L236P, as the lysine was more conserved.

</figure> </figure>
<figure id="fig:subst_pssm_best_ali">
Sequence alignment of P04062 derived from the 60 closest homologous sequences after 5 rounds PSI-BLAST
<figure id="fig:subst_pssm_best">
PSSM of P04062 derived from the 60 closest homologous sequences after 5 rounds PSI-BLAST.

<figtable id="tab:subst_pssm_best">

Nr Mutation Score
mutation
Score
min
Score
max
Conservation Disease
score
1 H99R 0 -3 8 Subst pssm best col99.png 0
2 V211I 3 -3 4 Subst pssm best col211.png -1
3 E150K 1 -4 5 Subst pssm best col150.png 0
4 L236P -3 -4 4 Subst pssm best col236.png 1
5 W248R -3 -4 11 Subst pssm best col248.png 1
6 L509P -3 -4 4 Subst pssm best col509.png 1
7 W351C -2 -4 11 Subst pssm best col351.png 1
8 A423D -2 -3 4 Subst pssm best col423.png 1
9 D482N 1 -4 6 Subst pssm best col482.png -1
10 R83S -1 -3 6 Subst pssm best col83.png 0

Position specific substitution scores derived from the 60 closest homologous sequences after 5 rounds PSI-BLAST. The respective profile column is shown on the right. </figtable>


Scoring Mutants

SIFT

SIFT (Sorting Intolerant From Tolerant) is a sequence homology-based tool that predicts whether an amino acid substitution in a protein will effect its function or not. SIFT is based on the thought that the functional related amino acids should be conserved and the mutations at such positions will lead to the change of protein function. protein evolution is correlated with protein function. On the contrary, the unimportant position will show much more amino acids variation.

The predicted results from SIFT is shown in <xr id="tab:subst_sift"/>, where the substitutions with a score less than 0.05 are predicted to affect the protein function, otherwise should be tolerated:

<figtable id="tab:subst_sift">

Nr Mutation Prediction Score Sequence conservation Disease
score
1 H99R TOLERATED 0.74 3.11 -1
2 V211I TOLERATED 1.0 3.10 -1
3 E150K TOLERATED 0.44 3.10 -1
4 L236P AFFECT PROTEIN FUNCTION 0.00 3.10 1
5 W248R AFFECT PROTEIN FUNCTION 0.00 3.10 1
6 L509P AFFECT PROTEIN FUNCTION 0.01 3.11 1
7 W351C AFFECT PROTEIN FUNCTION 0.00 3.10 1
8 A423D AFFECT PROTEIN FUNCTION 0.01 3.10 1
9 D482N TOLERATED 0.77 3.10 -1
10 R83S AFFECT PROTEIN FUNCTION 0.05 3.10 1

SIFT prediction results of the selected mutations. </figtable>


Polyphen2

PolyPhen-2 (Polymorphism Phenotyping v2) predicts whether an amino acid substitution will effect the protein function by using straightforward physical and comparative considerations. There are two pairs of datasets used to train and test PolyPhen-2 prediction models. HumDiv uses all damaging alleles with known effects on the molecular function causing human Mendelian diseases from UniProtKB database. The differences between human proteins and their closely related mammalian homologs are assumed to be non-damaging. HumVar uses all human disease-causing mutations from UniProtKB. The common human nsSNPs (MAF>1%) without annotated involvement in disease are assumed to be as non-damaging.

The predicted results from Polyphen2 is shown in <xr id="tab:subst_polyphen2"/>. Mutations with score near to 0 are predicted as "benign" and those with score near to 1 are predicted as "probably damaging":

<figtable id="tab:subst_polyphen2">

Nr Mutation HumDiv HumVar Disease
score
Prediction Score Sensitivity Specificity Prediction Score Sensitivity Specificity
1 H99R benign 0.000 1.00 0.00 benign 0.000 1.00 0.00 -1
2 V211I benign 0.000 1.00 0.00 benign 0.001 0.99 0.09 -1
3 E150K benign 0.000 1.00 0.00 benign 0.001 0.99 0.09 -1
4 L236P probably damaging 1.000 0.00 1.00 probably damaging 1.000 0.00 1.00 1
5 W248R probably damaging 1.000 0.00 1.00 probably damaging 0.999 0.09 0.99 1
6 L509P probably damaging 0.992 0.70 0.97 probably damaging 0.988 0.53 0.95 1
7 W351C probably damaging 1.000 0.00 1.00 probably damaging 1.000 0.00 1.00 1
8 A423D probably damaging 1.000 0.00 1.00 probably damaging 0.996 0.36 0.97 1
9 D482N benign 0.000 1.00 0.00 benign 0.002 0.99 0.18 -1
10 R83S benign 0.007 0.96 0.75 benign 0.019 0.95 0.55 -1

PolyPhen2 prediction results of the selected mutations. </figtable>

SNAP

SNAP(screening for non-acceptable polymorphisms) is a tool for evaluating effects of single amino acid substitutions on protein function. It was developed by Yana Bromberg in Rost Lab, at Columbia University, New York.

The predicted results from SNAP is shown in <xr id="tab:subst_sift"/>, where Reliability indices are indicative of confidence in prediction and Expected Accuracy illustrate the likelihood that a given prediction is correct.

<figtable id="tab:subst_sift">

Nr Mutation Prediction Reliability Index Expected Accuracy Disease
score
1 H99R Neutral 7 94% -1
2 V211I Neutral 7 94% -1
3 E150K Neutral 6 92% -1
4 L236P Non-neutral 0 58% 1
5 W248R Non-neutral 2 70% 1
6 L509P Neutral 0 53% -1
7 W351C Non-neutral 3 78% 1
8 A423D Non-neutral 1 63% 1
9 D482N Neutral 4 85% -1
10 R83S Neutral 3 78% -1

SNAP prediction results of the selected mutations. </figtable>


Discussion

Summary

In <xr id="tab:discussion"/> we see a summary of the sequence-based mutation analysis. The prediction results by using different properties/methods listed above are presented here again, and with a predicted disease score which is defined as:

  • -1 = non-disease causing
  • 1 = disease causing
  • 0 = ambiguous

A final disease score is obtained by computing the weighted average of all individual disease scores. i.e.:

final disease score = sum (weights f each methods * its predicted disease score)/sum (weight of each method)

And the final prediction is made following this rule:

  • "non-disease causing" if the final disease score <= 0.0
  • "disease causing" if the final disease score > 0.0


<figtable id="tab:discussion">

Property 1 2 3 4 5 6 7 8 9 10 Prediction
Name Weight H99R V211I E150K L236P W248R L509P W351C A423D D482N R83S Accuracy
Physicochemical 1.0 -1 -1 1 0 1 -1 1 1 -1 0 80%
Structure 1.0 -1 -1 -1 -1 1 1 0 1 -1 -1 60%
BLOSUM62 0.2 0 -1 0 1 1 1 1 1 0 0 50%
PSSM all 0.4 0 -1 0 0 1 1 1 1 -1 0 50%
PSSM close 0.4 0 -1 0 1 1 1 1 1 -1 0 60%
SIFT 1.0 -1 -1 -1 1 1 1 1 1 -1 1 70%
Polyphen2 1.0 -1 -1 -1 1 1 1 1 1 -1 -1 80%
SNAP 1.0 -1 -1 -1 1 1 -1 1 1 -1 -1 90%
Average disease score -0.83 -1.00 -0.50 0.43 1.00 0.33 0.83 1.00 -1.00 -0.33
Prediction !Disease !Disease !Disease Disease Disease Disease Disease Disease !Disease !Disease 80%
Verification !Disease !Disease Disease Disease Disease !Disease Disease Disease !Disease !Disease

Summary of the sequence-based mutation analysis. A final disease score is obtained by computing the weighted average of all individual disease scores. Mutations with an average disease score above 0.0 are considered as disease-causing and non-disease causing otherwise. green: "non-disease causing", red: "disease causing", yellow: "ambiguous". </figtable>



Here we can see the individual properties/methods that we used to predict the effect of the mutations show variant prediction performance. Simply Using the physicochemical analysis returned 80% prediction accuracy. It suggests that the mutation causing physicochemical change will tend to affect significantly the protein function.

Other property analysis, like the structural analysis, conservation analysis using BLOSUM and PSSM, do not perform well (with around 50% prediction accuracy). Since such analysis was done manually and could be very subjective. Therefore the prediction results vary from person to person. Still, the quite low prediction accuracy from these analysis maybe suggest that only considering the structure change or sequence conservation lonely is not sufficient to make a good prediction. Combining these methods might improve the prediction accuracy.

All the prediction tools, SIFT, PolyPhen-2 and SNAP, have returned satisfied prediction results. Among them, SNAP shows the best prediction accuracy, 90%.

The final prediction has 80% prediction accuracy which satisfied us. However, a little worse than that from SNAP. The reason is clear: since the final prediction was made by combining all the individual properties/methods prediction, relative worse prediction results from structure and sequence conservation analysis brought a negative influence.

H99R

  • H99R is not disease causing. Not listed in HGMD.
  • The final prediction is correct.
  • All the individual property/methods except BLOSUM and PSSM returned correct prediction.


  • This mutation is from histidine to arginine showing a change from a large sized polar residue with negative charge to another large sized polar residue with negative charge. Both are not aromatic. These two residues hold similar phisicochemical properties. Therefore, such change will probably lead to very minor effect on the protein function.
  • The structural analysis shows that the mutated site is exposed and has a coil secondary structure on the non domain region. It suggests that such mutation might not effect the protein function.

V211I

  • V211I is not disease causing. Not listed in HGMD.
  • The final prediction is correct.
  • All the individual property/methods returned correct prediction.


  • This mutation is from valine to isoleucine showing a change from a medium sized nonpolar residue with neutral charge to another medium sized nonpolar residue with neutral charge. Both are not aromatic. These two residues share similar phisicochemical properties. Therefore, such change will probably lead to very minor effect on the protein function.
  • The structural analysis shows that the mutated site is exposed and has a coil secondary structure on the non domain region. It suggests that such mutation might not effect the protein function.

E150K

  • Gaucher disease type 1 [1]
  • The final prediction is wrong.
  • Only the physicochemical analysis got the correct prediction.


  • This mutation is from glutamic acid to lysine showing a change from a large sized polar residue with negative charge to another large sized polar residue with positive charge. Both are not aromatic. These two residues have obviously different side chain charge. glutamate is acidic but lysine basic. Therefore, notable effect on the protein function is expected.
  • The structural analysis shows that the mutated site is exposed and has a coil secondary structure on the non domain region. It suggests that such mutation might not effect the protein function. However, it is a wrong prediction and may imply that the mutation which is outside the domain region can also effect the protein function.

L236P

  • Gaucher disease type 1 [2]
  • The final prediction is wrong.
  • However, all the scoring tools have returned correct prediction. The physicochemical and structural analysis have brought negative influence on the final decision.


  • This mutation is from leucine to proline showing a change from a medium sized nonpolar residue with neutral charge to another medium sized nonpolar residue with neutral charge. Both are not aromatic. These two residues share similar phisicochemical properties. Based on such physicochemical analysis, we can not make a correct prediction.
  • The structural analysis shows that the mutated site is exposed and has a coil secondary structure on the non domain region. It suggests that such mutation might not effect the protein function. However, it is near the core region and the mutation there might change the protein structure too. Therefore a prediction is given as "ambiguous" based on the observation.

W248R

  • Gaucher disease [3]
  • The final prediction is correct.
  • All the individual property/methods returned correct prediction.


  • This mutation is from tryptophan to arginine showing a change from a large sized unpolar residue with neutral charge to another large sized polar residue with negative charge. tryptophan is aromatic but arginine not. These two residues have obviously different physicochemical properties. Losing a ring structure might change the structure of the protein. Therefore, notable effect on the protein function is expected.
  • The structural analysis shows that the mutated site is buried and has a alpha helix secondary structure on the domain region. Observation shows that it takes place in the hydrophobic core region of the TIM beta/alpha-barrel domain and inserting a hydrophilic, arginine is likely to impair the catalytic activity of the enzyme, although it does no change the secondary structure according to the PSI-PRED prediction. Therefore, it is predicted as a harmful mutation.

L509P

  • L509P is not disease causing. Not listed in HGMD.
  • The final prediction is correct.
  • However, only physicochemical analysis and SNAP returned correct prediction.


  • This mutation is from leucine to proline showing a change from a medium sized nonpolar residue with neutral charge to another medium sized nonpolar residue with neutral charge. Both are not aromatic. These two residues share similar phisicochemical properties.
  • The structural analysis shows that the mutated site is exposed and has a beta sheet secondary structure on the domain region. Observation shows that the mutation to proline tends to turn the sheet into a loop region at this site Therefore, it is predicted as a harmful mutation.

W351C

  • Gaucher disease type 1 [4]
  • The final prediction is correct.
  • All the individual property/methods but structural analysis returned correct prediction.


  • This mutation is from tryptophan to cysteine showing a change from a large sized nonpolar residue with neutral charge to another small sized polar residue with neutral charge. tryptophan is aromatic but cysteine not. These two residues have obviously different physicochemical properties. Losing a ring structure might change the structure of the protein. Therefore, notable effect on the protein function is expected.
  • The structural analysis shows that the mutated site is exposed and has a beta sheet secondary structure on the domain region. The mutation might change the protein structure due to the formation of disulfide bonds. However not for sure, therefore, it is predicted as "ambiguous".

A423D

  • Gaucher disease [5]
  • The final prediction is correct.
  • All the individual property/methods returned correct prediction.


  • This mutation is from alanine to aspartic acid showing a change from a small sized nonpolar residue with neutral charge to another medium sized polar residue with positive charge. Both are not aromatic. These two residues have obviously different physicochemical properties. Such mutation might change the structure of the protein. Therefore, notable effect on the protein function is expected.
  • The structural analysis shows that the mutated site is buried and has a coil secondary structure on the domain region. Similar to the mutation W248R, this mutation occurs in the hydrophobic core region of the TIM beta/alpha-barrel domain. Insertion of an aspartic acid might influences the catalytic activity of the enzyme. Therefore, it is predicted as a harmful mutation.

D482N

  • D482N is not disease causing. Not listed in HGMD.
  • The final prediction is correct.
  • All the individual property/methods but structural analysis and BLOSUM62 returned correct prediction.


  • This mutation is from aspartic acid to asparagine showing a change from a medium sized polar residue with positive charge to another medium sized polar residue with neutral charge. Both are not aromatic. These two residues share similar phisicochemical properties. Therefore, such change will probably lead to very minor effect on the protein function.
  • The structural analysis shows that the mutated site is exposed and has a coil secondary structure on the domain region. It suggests that such mutation might not effect the protein function since the both residues have similar phisicochemical properties.

R83S

  • R83S is not disease causing. Not listed in HGMD.
  • The final prediction is correct.
  • However, only the structural analysis, PolyPhen-2 and SNAP returned correct prediction.


  • This mutation is from arginine to serine showing a change from a large sized polar residue with negative charge to another small sized polar residue with neutral charge. Both are not aromatic. These two residues do not seem to share similar phisicochemical properties. however, the changes could be not very significant. Therefore, we predicted it as "ambiguous" since the effect of such mutation is not very clear only based on physicochemical analysis.