Structure-based mutation analysis TSD Next »

The journal for this task can be found here.

Mutations

Dataset

Figure 1: SNPs highlighted on the HexA subunit structure.

</figure>

The following SNPs, selected by an unbiased source, will be analysed: M1V, L39R, C58Y, L127R, R170W, R178H, S210F, D258H, L451V and E482K.

The SNPs are visualised in the Hex A subunit alpha structure in <xr id="fig:snpsOnstr"/>.

Chemical properties

The biochemical properties of the wildtype and mutant amino acids of the chosen SNPs are listed in <xr id="tab:biochem"/>. Displayed are the hydrophobicity in form of the hydropathy index (Kyte Doolittle) and the according category, the volume with the matching characterisation, the charge and the grantham score. All are taken from <ref name="snpdbe">http://www.rostlab.org/services/snpdbe/</ref>. The Grantham score defines a chemical distance between two amino acids based on composition, polarity and molecular volume <ref name="grant">Grantham,R. (1974) Amino acid difference formula to help explain protein evolution. Science (New York, N.Y.), 185, 862-4.</ref>. It categorizes codon replacements into classes of increasing chemical dissimilarity, and it ranges from 5 to 215<ref name="grantham">Grantham R. Amino acid difference formula to help explain protein evolution. Science 1974; 185: 862-864 </ref>.
The highest Grantham score has the SNP C58Y where both amino acids have a high difference in volume. The change from glutamic acid to lysine on position 482 seems very severe as the charge changes from negative to positive. Altogether in 6 out of 10 cases the charge is different in wildtype and mutant. In 4 mutations the hydrohpobicity is changed from polar to non-polar or the other way round.

**Table 1:** Biochemical properties.
Mutation	Wildtype			Mutant			Grantham score
	Hydrophobicity	Volume	Charge	Hydrophobicity	Volume	Charge
M1V	1.9 (nonpolar)	162.9 (bulky)	neutral	4.2 (nonpolar)	140.0 (small)	neutral	21
L39R	3.8 (nonpolar)	166.7 (bulky)	neutral	-4.5 (polar)	173.4 (bulky)	positive	102
C58Y	2.5 (polar)	108.5 (small)	neutral	-1.3 (polar)	193.6 (bulky)	neutral	194
L127R	3.8 (nonpolar)	166.7 (bulky)	neutral	-4.5 (polar)	173.4 (bulky)	positive	102
R170W	-4.5 (polar)	173.4 (bulky)	positive	-0.9 (nonpolar)	227.8 (bulky)	neutral	101
R178H	-4.5 (polar)	173.4 (bulky)	positive	-3.2 (polar)	153.2 (bulky)	neutral	29
S210F	-0.8 (polar)	89.0 (tiny)	neutral	2.8 (nonpolar)	189.9 (bulky)	neutral	155
D258H	-3.5 (polar)	111.1 (small)	negative	-3.2 (polar)	153.2 (bulky)	neutral	81
L451V	3.8 (nonpolar)	166.7 (bulky)	neutral	4.2 (nonpolar)	140.0 (small)	neutral	32
E482K	-3.5 (polar)	138.4 (bulky)	negative	-3.9 (polar)	168.6 (bulky)	positive	56

</figtable>

Structural observations

In <xr id="tbl:pymolstruc"/> all SNPs are displayed with the wildtype and mutant structure. The M1V mutation could not be displayed as it is not part of the pdb reference structure. Especially the mutations R170W, R178H and D258H seem disrupting as they destroy the H-bonds present in the wildtype. E482K, L451V and L39R express comparably weak alterations of the natural structure. There are 5 ring-structure substitutions which could be an indicator of a more drastic structure change leading to functional implications.

**Table 2:** Overview of all SNPs: Comparison wildtype and mutant structure. The wildtype is colored green, the mutant purple and polar contacts are displayed in blue.
L39R	C58Y	L127R	R170W	R178H
S210F	D258H	L451V	E482K

</figtable>

**Table 3: DSSP structure** is the secondary structure as annotated by DSSP for PDB entry 2gk1. Explanation for the annotation can found at the DSSP website <ref name="dsspweb">http://swift.cmbi.ru.nl/gv/dssp/DSSP_2.html</ref>. This means, '-' denotes there is no secondary structure assumed at this position, while 'N/A' marks positions that are not resolved in the structure.
Mutation	DSSP structure	PsiPred wt	PsiPred mt
M1V	N/A
L39R	E
C58Y	-
L127R	E
R170W	E
R178H	S
S210F	B
D258H	-
L451V	T
E482K	H

</figtable>

Substitution matrices

BLOSUM62 and PAM

<xr id="tab:pamnlos"/> shows the substitution scores of all mutations for the BLOSUM62, PAM1 and PAM250 matrix. PAM1 and PAM250 are the extremes for this type of substitution matrix and might not be the most commonly applied ones. BLOSUM62 however is one of the standard matrices used by default in large tools like BLAST. The BLOSUM62 matrix used here is the incorrectly calculated version, since it generally performs better <ref name="wrongblosum">Styczynski, M. P., Jensen, K. L., Rigoutsos, I., & Stephanopoulos, G. (2008). BLOSUM62 miscalculations improve search performance. Nature biotechnology, 26(3), 274–275. Nature Publishing Group. Retrieved from http://www.nature.com/nbt/journal/v26/n3/full/nbt0308-274.html</ref>.

An entry in the PAM matrix describes the probability that the wildtype amino acid is replaced by the mutant amino acid in a given timeframe, e.g. 1PAM or 250PAM. Entries in a BLOSUM matrix are slightly less intuitive and describe the ratio between the wildtype and mutant appearing in a biologically founded sequence or them appearing by chance. In addition the values are usually displayed as log-odds. Refer to the original publication <ref name="BLOSUM">Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America, 89, 10915-9.</ref> for details.

**Table 4:** Substitution scores for the SNPs based on the matrices BLOSUM62, PAM1 and PAM250. Values for the PAM matrices have been multiplied by 10000 for better readability. Note that the values for BLOSUM62 are those from the incorrectly calculated matrix <ref name="wrongblosum"/>.
Mutation	BLOSUM62 <ref name="blosumvalsrc">http://www.ncbi.nlm.nih.gov/Class/BLAST/BLOSUM62.txt</ref>	PAM (1<ref name="pamvalsrc1">http://www.icp.ucl.ac.be/~opperd/private/pam1.html</ref>/250<ref name="pamvalsrc250">http://www.icp.ucl.ac.be/~opperd/private/pam250.html</ref>)
M1V	1	17/400
L39R	-2	1/200
C58Y	-2	3/300
L127R	-2	1/200
R170W	-3	2/200
R178H	0	8/500
S210F	-2	2/200
D258H	-1	3/400
L451V	1	11/1500
E482K	1	7/800

</figtable>

In a BLOSUM62 matrix, the values range between -4, a very unlikely substitution to 11, the most likely substitution. With this in mind it becomes apparent that most SNPs describe rare changes. The most likely ones only have values of 1 which does not seem much, compared to the maximum of 11, however the 75% quantile of all values in the matrix is only 0, so a value of 1 is not as low as it might seem. On the other hand a value of 1 seems a lot considering that E482K is a change from a negatively charged amino acid to a positively charged one.

In PAM1 and PAM250, the minimum and maximum values are 0,9976 and 9,72 respectively. The high maximum especially for PAM1, stems from the diagonal of the matrix; the 90% quantile is only 22. As to be expected, given more time for the changes to occur (PAM1 to PAM250), the probability for the substitutions to occur rises. However the general relation between the SNPs mostly stays comparable. The biggest change occurs for M1V which is much more likely given only one unit of time.

From this small example the two matrices seem to correlate well. All values that are high in BLOSUM62 are comparably high in both PAM matrices as well, the only exception being M1V which becomes significantly less probable in PAM250. Concerning the phenotype it would be rather far fetched to make an assumption based only on these values, especially since they are mostly close together. However both matrices seem to agree that if there are non-disease causing SNPs among the set, they are M1V, L451V and E482K, possibly R178H as well.

PSSM

PSI-Blast was used to create a PSSM that should contain more refined mutation scores since they are calculated separately for each position. The resulting scores are shown in <xr id="tab:pssm"/>. Comparing the values to the scores from position-independent matrices used above one can see that the general relations between the SNPs mostly stay the same. Exception are M1V, L451V and E482K becoming less likely and R178H becoming more likely.

The change in M1V is to be expected considering that the first position in the homologous sequences should mostly be occupied by methionine due to the start codon. The fact that R178H becomes more likely is surprising since this position is one of those originally identified to be potentially important for catalysis. This might be a small indicator that the original assumption was wrong or that it at least is not important in all homologs.

L451 is found in a single residue loop that could also be seen as an extension of the preceding alpha-helix and is directly followed by a beta-strand. In any case the side-chain at this position is pointing outside of the protein and with only beta-sheets and alpha-helices nearby it does not seem to fulfil a deeper purpose. Therefore exchanging the leucine against the even comparably similar valine should not be a problem and the decreased score is a surprise that cannot be explained for now. Apparently the residue does serve a purpose in other seqeunces. E482K is again an expected change, since there is not only the change in charge but structural analysis also suggests that glutamate forms hydrogen bonds to neighbouring residues.

**Table 5:** Substitution scores for the SNPs based on a PSSM created with a PSI-Blast query for P06865 on BIG and five iterations. Parameters for the maximum number of returned hits were chosen as in Task 2, *i.e.* 3800 hits.
Mutation	PSI-Blast PSSM
M1V	-6
L39R	-4
C58Y	-2
L127R	-7
R170W	-5
R178H	3
S210F	1
D258H	1
L451V	-2
E482K	-2

</figtable>

Multiple sequence alignments

The original idea, as outlined in the protocol did not work, since PSI-Blast only aligns the beginning part of the sequence. We will have to think about an alternative.

**Table 6:** Substitution scores for the SNPs based on a PSSM created with on an MSA created with clustalw and based on sequences retrieved with a PSI-Blast query for P06865 on nr and two iterations.
Mutation	MSA-PSSM
M1V	2
L39R	-1
C58Y	0
L127R	-3
R170W
R178H
S210F
D258H
L451V
E482K

</figtable>

Prediction

SIFT

SIFT prediction is based on the degree of conservation of amino acid residues in sequence alignments derived from closely related sequences, collected through PSI-BLAST. The output states whether a SNP affects the protein function or not. Added are the score and also the number of sequences for the conservation calculation at this position <ref name ="sift">Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc. 2009;4(7):1073-81. </ref>.

The results for the 10 SNPs are shown in <xr id="tab:sift"/>. The score does not seem to take factors like the number of sequences into account, however the webserver warns, that there is low confidence in the prediction for M1V, since there is only one sequence (the query) for this SNP. The scaled probabilities from the full output suggest that for each position of interest, there are no mutations tolerated at all, and any substitution amino acid should exert an effect. Apparently the positions in question are well conserved in the homologous sequences retrieved by SIFT.

**Table 7: Prediction** is the effect prediction, which is always damaging when the score is below 0.05 <ref name="siftdoc">http://sift.jcvi.org/www/SIFT_help.html#SIFT_PROCEDURE</ref>. **Score** is said score, however the exact calculation remains unclear from both the online documentation as well as the original publication. **Sequences** are the number of sequences with amino acid information at the current position. Median sequence conservation is 3.00 +- 0.02 in all cases but M1V where it is 4.32.
Mutation	Prediction	Sequences
M1V	AFFECT PROTEIN FUNCTION	1
L39R	AFFECT PROTEIN FUNCTION	54
C58Y	AFFECT PROTEIN FUNCTION	55
L127R	AFFECT PROTEIN FUNCTION	67
R170W	AFFECT PROTEIN FUNCTION	66
R178H	AFFECT PROTEIN FUNCTION	65
S210F	AFFECT PROTEIN FUNCTION	68
D258H	AFFECT PROTEIN FUNCTION	68
L451V	AFFECT PROTEIN FUNCTION	66
E482K	AFFECT PROTEIN FUNCTION	65

</figtable>

PolyPhen2

Polyphen2 incorporates a multiple sequence alignment pipeline and a probabilistic classifier based on machine-learning for the annotation of SNPs. The method uses three structure-based and eight sequence-based predictive features. Two datasets were used to train and test PolyPhen2, HumDiv and HumVar <ref name="pph2">Peshkin,L. et al. (2010) A method and server for predicting damaging missense mutations. Nature methods, 7, 248-249.</ref>.

The results of the effect prediction by PolyPhen2 are shown in <xr id="tab:pph2"/>. As can be seen all SNPs but M1V are predicted to have a damaging effect with very high confidence. In fact the false positive rates for these probabilities are below 0.1. Observing the full output it can be seen that PolyPhen2 used structural information from PDB entry 2gk1 for all mutations but M1V, where no structure is available. In addition UniProtKB sequence annotations were used, marking C58 as being part of a disulfide bridge as well as noting that M1V is part of the signal peptide and L39 and C58 are part of a propeptide. Whether these predictions are therefore more reliable remains to be seen at the end when checking the real effect in HGMD, however only from the scores given by PolyPhen2 this would not be apparent, since many other mutations without these additional features have an equally high probability as well as FDR, FPR, TPR.

**Table 8:** Prediction results for all SNPs by PolyPhen2. **Prediction** is the predicted effect for the specific SNP. With 'probably damaging' there are less than 10% false positives <ref name="pph2">Peshkin,L. et al. (2010) A method and server for predicting damaging missense mutations. Nature methods, 7, 248-249.</ref>. **pph2_class** is the output of a binary effect classification and **pph2_prob** the raw probability of the mutation having a damaging effect. The column **based_on** present in the output file as well is not shown since it is a remnant of PolyPhen1 and obsolote <ref name="pph2_output>http://genetics.bwh.harvard.edu/pph2/dokuwiki/appendix_a</ref>.
Mutation	Prediction	pph2_class	pph2_prob
M1V	benign	neutral	0.063
L39R	probably damaging	deleterious	1
C58Y	probably damaging	deleterious	1
L127R	probably damaging	deleterious	1
R170W	probably damaging	deleterious	1
R178H	probably damaging	deleterious	1
S210F	probably damaging	deleterious	1
D258H	probably damaging	deleterious	1
L451V	probably damaging	deleterious	0.994
E482K	probably damaging	deleterious	1

</figtable>

SNAP

SNAP was developed by Bromberg et al. in 2007 <ref name="yana">Yana Bromberg and Burkhard Rost. Snap: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res, 35(11):3823–3835, 2007.</ref>. The method is based on a neural network with many input features such as SWISS-PROT annotations, biochemical properties, sequence information, PSI-BLAST profiles and predicted 1D-structure. As input SNAP needs only sequence information but will also consider and benefit of available functional and structural annotations. The prediction can be either that the SNP is "neutral" or "non-neutral". SNAP is stated to have a 77% accuracy of non neutral predictions and 76% of the neutral substitutions at 80% accuracy.
The reliability index RI indicates the confidence of a certain prediction and should be considered together with the expected accuracy.

The SNAP prediction results are displayed in <xr id="tab:snap"/>. 4 out of the 10 SNPs are predicted to be neutral but only in the case of M1V the reliability and accuracy are very high. With E482K and S210F the reliability is below 2 which hints at a possible misclassification. The non neutral prediction for L39R and R178H is also comparably weak.

M1 seems to be a unimportant residue as it has 0 effect predictions at this position. 258, 178 and 58 seem to be the worst positions for any mutation whereas at 127 the chance of an effect is just around 50%.

For comparison the effect prediction is also provided by snapfun, the older SNAP method. The predictions differ for L39R and S210F (E482K cannot be compared as snapfun did not yield a prediction for this SNP). If the 2 methods were to be combined in a jury decision both mutations would be considered non-neutral according to accuracy and reliability.

**Table 9:** Prediction results for all SNPs by SNAP. Prediction is the predicted effect for the specific SNP. RI is the reliability index wth a range from 0-9. The expected accuracy ranges from 0 to 100%.
	SNAP2				SNAP
Mutation	Prediction	RI	Accuracy	Non-neutral at this position	Prediction	RI	Accuracy
M1V	neutral	9	91%	0	neutral	4	85%
L39R	non-neutral	2	63%	10	neutral	0	53%
C58Y	non-neutral	4	71%	16	non-neutral	1	63%
L127R	non-neutral	2	63%	7	non-neutral	6	93%
R170W	non-neutral	7	80%	15	non-neutral	7	96%
R178H	non-neutral	1	60%	16	non-neutral	7	96%
S210F	neutral	0	51%	1	non-neutral	5	87%
D258H	non neutral	3	67%	16	non-neutral	7	96%
L451V	neutral	6	79%	9	neutral	1	60%
E482K	neutral	2	59%	2	-	-	-

</figtable>

Consensus

Taking into account all information gained so far, shown in <xr id="tab:overview"/>, but excluding the prediction methods for now, M1V, L451V and E482K are the SNPs which might be non-disease causing. Possible R178H as well, however this position is one of those participating in hydrogen bond forming with the ligand and the absence of an effect therefore unlikely. It should be noted that while most sources seem to agree on these three SNPs, the PSSM obtained through PSI-Blast shows a very different picture: Here these mutations are rather unlikely.
Taking into account the prediction methods as well, the former assumptions are mostly reinforced, however since two methods agree on this, it is to be expected that L451V and E482K are in fact disease causing. Since SNAP is basically the only data point disagreeing on the disease causing status of S210F this is likely a false negative an can be disregarded. The only thing left for debate is therefore M1V. Two methods predict that it is non-disease causing while the other data points are mostly borderline cases or disagreeing completely (PAM and PSSM values). However, disregarding any information obtained in task and considering that this is a mutation altering the start codon, which is highly conserved, it would be surprising to see that this is a mutation that does not have an effect on the protein. In this case ATG changes to GTG. This is in case a possible alternative start codon, however these are only common in prokaryotes <ref name="alberts">Alberts,B. et al. (2008) Molecular biology of the cell 5th ed. Garland Science, New York.</ref>. Since a human protein is observed here, this argument does not apply. The idea is supported by the fact that in the original paper <ref name="m1vpap">Mules,E.H. et al. (1992) Six novel deleterious and three neutral mutations in the gene encoding the alpha-subunit of hexosaminidase A in non-Jewish individuals. American journal of human genetics, 50, 834-41.</ref>, the mutation is identified only on the DNA level and there is no mention of the resulting protein. Therefore, the overall conclusion on M1V must be that it is disease causing, despite the data mostly suggesting otherwise.

**Table 10:** Overview over several key data points collected during the task for every one of the 10 SNPs.
Mutation	Grantham score	BLOSUM62	PAM1/250	PSSM	MSA	SIFT	PolyPhen2	SNAP
M1V	21	1	17/400	-6	2	effect	no effect	no effect
L39R	102	-2	1/200	-4	-1	effect	effect	effect
C58Y	194	-2	3/300	-2	0	effect	effect	effect
L127R	102	-2	1/200	-7	-3	effect	effect	effect
R170W	101	-3	2/200	-5		effect	effect	effect
R178H	29	0	8/500	3		effect	effect	effect
S210F	155	-2	2/200	1		effect	effect	no effect
D258H	81	-1	3/400	1		effect	effect	effect
L451V	32	1	11/1500	-2		effect	effect	no effect
E482K	56	1	7/800	-2		effect	effect	no effect

</figtable>

Evaluation

The first thing that strikes out when looking at the evaluation of the prediction results (c.f <xr id="tab:eval"/>) is that every single SNP is listed in HGMD and therefore associated with TSD. Solely SIFT was able to predict every mutation effect correctly. Polyphen made only 1 mistake with M1V but it identified this prediction as weak, having only one sequence to base its prediction onto. Thus in this case SNAP was clearly outperformed by the other methods. Even if the prediction from snapfun and snap2 are combined in a jury decision there are still 2 mutations misleadingly classified as neutral. M1V even receives a very high accuracy and reliability score.
The Methionine on position 1 seems to be a hard case for prediction methods, probably because it has a low coverage by homologous sequences, no structure is available and it is at right at the beginning of the sequence.

As the set of mutations is unfortunately biased toward disease, and additionally very small no further conclusions concerning the performance of the prediction methods should be derived from the results.

**Table 11:** Validation of the prediction results.
Mutation	SIFT	Polyphen2	SNAP2	TSD involvement	Databases
M1V	effect	no effect	no effect	Yes	HGMD, dbSNP, SNPdbe, SNPedia, OMIM
L39R	effect	effect	effect	Yes	HGMD, dbSNP, SNPdbe, SNPedia, OMIM
C58Y	effect	effect	effect	Yes	HGMD, OMIM
L127R	effect	effect	effect	Yes	HGMD, dbSNP, SNPdbe, SNPedia, OMIM
R170W	effect	effect	effect	Yes	HGMD, dbSNP, SNPdbe, SNPedia, OMIM
R178H	effect	effect	effect	Yes	HGMD, dbSNP, SNPdbe, SNPedia, OMIM
S210F	effect	effect	no effect	Yes	HGMD, dbSNP, SNPdbe, SNPedia, OMIM
D258H	effect	effect	effect	Yes	HGMD, dbSNP, SNPdbe, SNPedia, OMIM
L451V	effect	effect	no effect	Yes	HGMD, dbSNP, SNPdbe, SNPedia, OMIM
E482K	effect	effect	no effect	Yes	HGMD, dbSNP, SNPdbe, SNPedia, OMIM

</figtable>

References

Sequence-based mutation analysis TSD

Contents

Mutations

Dataset

Chemical properties

Structural observations

Substitution matrices

BLOSUM62 and PAM

PSSM

Multiple sequence alignments

Prediction

SIFT

PolyPhen2

SNAP

Consensus

Evaluation

References

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools