Gaucher Disease: Task 08 - LabJournal
Contents
Choosing Mutation set
The cDNA sequence used by HGMD has the accession number NM_001005741.2. The one letter code of this cDNA has a 100% sequence identity to the reference sequence of UniProt, which we used for all tasks. Therefore, the exact positions of the mutations listed by HGMD can be taken. This accession number is also listed by dbSNP, so that the right mutation position can be seen.
To choose mutations from dbSNP we selected four point mutations of the SNP Geneview Report of glucocerebrosidase with the NP_001005741.1 (corresponding protein isoform of the transcript NM_001005741.2) which was already used in task 7. For HGMD we randomly picked six mutations from the SNP list of the NP_001005741.1.
Mutation Analysis
The property information about the amino acids were taken from Wikipedia.
The visualisation of the mutations was done with pymol. For the mutation of the residues we followed the description of the PymolWiki We considered again the sequence shift of 39 residues between uniprot and pdb sequence.
We took the information about the secondary structure from our previous task 3. We alway choosed the secondary structure type which was predicted from the majority of the prediction tools in that task.
The scores for the mutations were looked up in two substitution matrices: BLOSUM62 and PAM250
PSSM
We created different PSSM matrices. The first was generated with PsiBlast:
blastpgp -j 5 -h 10e-6
-i /mnt/home/student/gerkej/gaucher/task8/P04062.fasta
-d /mnt/project/pracstrucfunc13/data/big/big_80
-o /mnt/home/student/gerkej/gaucher/task8/psiBl/ev-6/big80_it5.out
-Q /mnt/home/student/gerkej/gaucher/task8/psiBl/ev-6/big80_it5.pssm
-C /mnt/home/student/gerkej/gaucher/task8/psiBl/ev-6/big80_it5.chk
PSSM by psiblast
Last position-specific scoring matrix computed, weighted observed percentages rounded down, information per position, and relative weight of gapless real matches to pseudocounts
A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V
77 S 0 1 0 1 -1 1 1 -1 0 -1 -2 0 -1 -3 -3 1 1 -4 -2 0 7 9 4 8 1 7 11 3 2 4 4 6 1 1 0 11 11 0 1 8 0.14 inf
141 N -1 0 2 4 -2 2 2 -3 -1 -3 -3 1 -2 -5 -4 0 1 1 -2 -3 5 4 10 20 1 10 15 2 1 1 3 7 1 0 0 7 7 2 1 2 0.40 inf
159 R -6 8 -6 -7 -8 -4 -5 -7 -6 -8 -4 2 -6 -7 -7 -3 -6 1 -7 -7 0 81 0 0 0 0 0 0 0 0 2 11 0 0 0 2 0 2 0 0 2.56 inf
213 L -3 -5 -2 -6 -4 -5 -5 -2 -5 4 2 -5 0 3 -5 -4 -1 3 0 3 2 0 2 0 0 0 0 3 0 23 22 0 2 13 0 0 4 4 2 21 0.72 inf
241 G -1 2 0 2 -4 1 -1 1 -1 -2 -1 3 -2 -3 -1 -1 -1 0 -1 -2 5 11 5 11 0 7 3 11 2 2 6 17 1 1 4 3 4 2 3 3 0.19 inf
349 V 0 -3 -2 -3 0 -4 -3 2 -1 0 -1 -4 -1 4 -5 0 0 5 2 2 10 1 1 1 2 0 1 14 2 5 4 0 2 17 0 6 6 9 7 14 0.42 inf
408 T -1 1 2 2 0 2 0 2 2 -2 -2 0 -1 -1 -1 0 0 -4 0 -2 3 7 10 11 2 9 5 19 5 2 3 4 2 2 2 5 4 0 3 3 0.19 inf
409 N 1 -2 1 1 1 -2 -1 0 1 -1 -1 -1 0 1 0 1 0 2 1 -1 11 0 10 9 4 0 3 6 3 4 4 3 3 6 4 9 6 4 7 4 0.09 inf
483 L -2 -3 -2 -3 -4 -3 -3 0 -2 3 3 -2 0 0 -3 -3 -2 -4 -1 3 2 1 2 2 0 1 1 9 1 13 29 3 2 4 1 2 2 0 2 24 0.46 inf
501 N -6 -6 9 -3 -8 -4 -3 -6 -5 -6 -8 -6 -3 -8 -5 -2 -2 -8 -7 -6 1 0 87 1 0 0 2 0 0 0 0 0 1 0 1 3 3 0 0 1 2.83 inf
The second PSSM is based on an alignment of all mamalian homologous. We identified the homologs on Uniprot. The BlastP seaerch was run on the mamalian database. For P04062 we found 140 homologous sequences with an evalue of less than 10e-4. To generate the MSA of the homologous sequences we used Clustal Omega, which is a newer version of CLustalW and recommended on the ClustalW webserver. With PsiBlast we created the PSSM matrix out of the MSA.
blastpgp
-i /mnt/home/student/gerkej/gaucher/task8/P04062.fasta
-d /mnt/project/pracstrucfunc13/data/big/big_80
-B /mnt/home/student/gerkej/gaucher/task8/psiBl/clustal-omegas.clustal
-o /mnt/home/student/gerkej/gaucher/task8/psiBl/big80_it5.out
-Q /mnt/home/student/gerkej/gaucher/task8/psiBl/big80_it5.pssm
PSSM by clustalOmega alignment of hoologous
Last position-specific scoring matrix computed, weighted observed percentages rounded down, information per position, and relative weight of gapless real matches to pseudocounts
A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V
77 S 1 -1 0 0 -1 0 0 0 -1 -2 -2 0 -1 -2 0 3 1 -2 -1 -1 3 2 2 2 1 1 2 4 1 2 3 2 1 1 2 64 4 0 1 2 0.25 inf
141 N -1 0 5 3 -2 0 1 -1 0 -3 -3 0 -2 -3 -1 0 0 -3 -2 -2 1 1 55 16 0 1 10 1 0 1 1 6 0 1 1 3 1 0 0 1 0.42 inf
159 R -2 5 -1 -2 -3 1 0 -3 -1 -3 -2 2 -2 -2 -2 -1 -1 5 -1 -3 0 86 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 0.76 inf
213 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 0 0 0 0 0 0 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0.49 inf
241 G 0 1 -1 -1 -3 -1 -2 5 -2 -4 -4 -1 -3 -3 -2 0 -2 -3 -3 -3 0 17 0 0 0 0 0 83 0 0 0 0 0 0 0 0 0 0 0 0 0.83 inf
349 V 0 -3 -3 -3 -1 -2 -3 -3 -3 3 1 -2 1 -1 -3 -2 0 -3 -1 4 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 97 0.39 inf
408 T 0 -1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 -1 1 4 -2 -1 0 1 1 1 1 0 1 1 1 0 1 2 1 0 1 1 1 82 0 1 1 0.37 inf
409 N -1 0 5 1 -2 0 0 0 0 -2 -2 0 -2 -2 -1 0 0 -3 -2 -2 2 1 76 5 0 1 1 2 0 1 2 1 0 1 1 2 1 0 1 1 0.51 inf
483 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 0 0 0 0 0 0 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0.50 inf
501 N -1 0 6 1 -2 0 0 0 0 -3 -3 0 -2 -3 -2 0 0 -3 -2 -2 1 1 86 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 0 1 0.61 inf
Prediction approaches
SIFT
SIFT was run on the webserver. We got the following predictions for our mutations. The score ranges between 0 and 1. A score <=0.05 indicated to an disease causing amino acid substitution. Otherwise, the substitution is "tolerated" which means that the mutation has no effect on the protein function.
S77R
Substitution at pos 77 from S to R is predicted to be TOLERATED with a score of 0.37. Median sequence conservation: 3.10 Sequences represented at this position:15
N141S
Substitution at pos 141 from N to S is predicted to be TOLERATED with a score of 0.15. Median sequence conservation: 3.13 Sequences represented at this position:15
R159Q
Substitution at pos 159 from R to Q is predicted to AFFECT PROTEIN FUNCTION with a score of 0.00. Median sequence conservation: 3.10 Sequences represented at this position:16
L213F
Substitution at pos 213 from L to F is predicted to AFFECT PROTEIN FUNCTION with a score of 0.00. Median sequence conservation: 3.10 Sequences represented at this position:16
G241E
Substitution at pos 241 from G to E is predicted to AFFECT PROTEIN FUNCTION with a score of 0.01. Median sequence conservation: 3.10 Sequences represented at this position:16
V349I
Substitution at pos 349 from V to I is predicted to be TOLERATED with a score of 0.25. Median sequence conservation: 3.10 Sequences represented at this position:16
T408M
Substitution at pos 408 from T to M is predicted to AFFECT PROTEIN FUNCTION with a score of 0.03. Median sequence conservation: 3.10 Sequences represented at this position:16
N409
Substitution at pos 409 from N to S is predicted to be TOLERATED with a score of 0.05. Median sequence conservation: 3.10 Sequences represented at this position:16
L483P
Substitution at pos 483 from L to P is predicted to AFFECT PROTEIN FUNCTION with a score of 0.00. Median sequence conservation: 3.10 Sequences represented at this position:16
N501S
Substitution at pos 501 from N to S is predicted to AFFECT PROTEIN FUNCTION with a score of 0.00. Median sequence conservation: 3.10 Sequences represented at this position:16
Polyphen2
Polyphen2 offers two different prediction scores which were trained and tested on different datasets. As recommended on the website, we decided to focus on the score HumVar which uses all human disease-causing mutations from UniProtKB and common human non-synonymous SNPs as non-damaging SNPs. For the Polyphen2 nomenklatur benign, possibly damaging and probably damaging we used the identifier non-disease causing, possibly damaging and disease causing.
The score ranges between 0 and 1 and reflects the probability for the SNP having a damaging effect.
S77R This mutation is predicted to be benign with a score of 0.170 (sensitivity: 0.89; specificity: 0.72)
N141S This mutation is predicted to be benign with a score of 0.009 (sensitivity: 0.96; specificity: 0.49)
R159Q This mutation is predicted to be probably damaging with a score of 0.997 (sensitivity: 0.27; specificity: 0.98)
L213F This mutation is predicted to be possibly damaging with a score of 0.790 (sensitivity: 0.76; specificity: 0.87)
G241E This mutation is predicted to be possibly damaging with a score of 0.892 (sensitivity: 0.70; specificity: 0.90)
V349I This mutation is predicted to be benign with a score of 0.118 (sensitivity: 0.90; specificity: 0.70)
T408M This mutation is predicted to be benign with a score of 0.113 (sensitivity: 0.90; specificity: 0.69)
N409S This mutation is predicted to be benign with a score of 0.234 (sensitivity: 0.88; specificity: 0.75)
L483P This mutation is predicted to be possibly damaging with a score of 0.856 (sensitivity: 0.72; specificity: 0.88)
N501S This mutation is predicted to be probably damaging with a score of 0.979 (sensitivity: 0.57; specificity: 0.94)
MutationTaster
We used the MutationTaster web version. As input we used:
- Gene: GBA
- Transcript: ENST00000327247 (protein_coding, 2387 bases) NM_001005741
- Position/snippet refers to coding sequence (ORF)
For the alteration, we calculated the position of the mutated bases.
The score represents the probability for the prediction.
S77R
Prediction: disease causing Model: simple_aae, prob: 0.897766357553113 Summary: amino acid sequence changed, protein features (might be) affected, splice site changes
N141S
Prediction: polymorphism Model: simple_aae, prob: 0.828715180775458 Summary: amino acid sequence changed, protein features (might be) affected, splice site changes
R159Q
Prediction: disease causing Model: simple_aae, prob: 0.999998729728442 Summary: amino acid sequence changed, known disease mutation at this position (HGMD CM880035), known disease mutation: rs79653797 (pathogenic), protein features (might be) affected, splice site changes
L213F
Prediction: disease causing Model: simple_aae, prob: 0.99997844587523 Summary: amino acid sequence changed, known disease mutation at this position (HGMD CM057076), protein features (might be) affected, splice site changes
G241E
Prediction: disease causing Model: simple_aae, prob: 0.999988349653367 Summary: amino acid sequence changed, known disease mutation at this position (HGMD CM992894), protein features (might be) affected, splice site changes
V349I
Prediction: disease causing Model: simple_aae, prob: 0.998220211902893 Summary: amino acid sequence changed, protein features (might be) affected, splice site changes
T408M
Prediction: polymorphism Model: simple_aae, prob: 0.595642579948168 Summary: amino acid sequence changed, heterozygous in TGP, known disease mutation at this position (HGMD CM960697), protein features (might be) affected, splice site changes
N409S
Prediction: disease causing Model: simple_aae, prob: 0.998851047063661 Summary: amino acid sequence changed, known disease mutation at this position (HGMD CM880036), protein features (might be) affected, splice site changes
L483P
Prediction: disease causing Model: simple_aae, prob: 0.999999999939023 Summary: amino acid sequence changed, known disease mutation at this position (HGMD CM870010, HGMD CM940819, rs421016 (pathogenic)), protein features (might be) affected, splice site changes
N501S
Prediction: disease causing Model: simple_aae, prob: 0.999999999990917 Summary: amino acid sequence changed, known disease mutation at this position (HGMD CM057072), protein features (might be) affected, splice site changes
SNAP2
For SNAP we created a File /.snap2rc
with the pathes of all databases needed by SNAP. After that SNAP2 was run on the rostlab server as followed
snapfun -i P04062.fasta -m mutation-list.txt -o snap_P04062.out
nsSNP Prediction Reliability Index Expected Accuracy ----- ------------ ------------------- ------------------- S77R Neutral 2 69% N141S Neutral 5 89% R159Q Non-neutral 7 96% L213F Neutral 0 53% G241E Neutral 0 53% V349I Neutral 5 89% T408M Neutral 5 89% N409S Neutral 0 53% L483P Neutral 1 60% N501S Non-neutral 0 58%