Lab journal task 8
Mutation selection
10 mutations were randomly selected from HGMD and dbSNP.
Mutation analysis
The description of the physicochemical properties is based on the entry for amino acids in wikipedia.
The mutations were visualized with Pymol. Because the pdb structure 1A6Z starts at position 22 in the reference structure, we subtracted 22 from the codon position to get the position of the mutation in the structure. The mutatins were done following the description in Use PyMOL for this. We did mutations for the first 9 SNPs but the last one (Arg330Met) could not be visualized, because the pdb structure is shorter than the reference sequence and only contains the residues 22 to 297. The rotamer for each mutated residue was selected based on the orientation and the size and color of the discs. We selected the rotamers with the least and smallest red discs if there was none without. For residues that are located on the border of the protein, we also tried to find rotamers that are not pointed into the solvent.
The secondary structure of the location of the mutation was taken from the DSSP assignment of the 1A6Z_A structure.
The BLOSUM62 matrix was taken from BLOSUM62 and the PAM250 matrix from PAM250.
PSSM fom PsiBlast with 5 iterations and default parameters using the /mnt/project/pracstrucfunc13/data/big/big_80 database:
blastpgp -i /mnt/home/student/betza/data/hfe.fasta -d /mnt/project/pracstrucfunc13/data/big/big_80 -j 5 -o /mnt/home/student/betza/task8/psiblast/iter5_big80.results -Q /mnt/home/student/betza/task8/psiblast/iter5_big80.pssm -C /mnt/home/student/betza/task8/psiblast/iter5_big80.chk
The resulting PSSM is the following (only the 10 mutation positions are shown):
Last position-specific scoring matrix computed, weighted observed percentages rounded down, information per position, and relative weight of gapless real matches to pseudocounts A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V 53 V -3 -6 -7 -7 -3 -6 -6 -7 -2 0 4 -6 1 -1 -6 -4 -5 -6 -5 6 2 0 0 0 1 0 0 0 1 4 37 0 3 2 0 1 0 0 0 47 1.32 inf 63 H -3 -1 2 -3 1 -5 -4 -1 0 -4 -5 0 -3 -7 -3 6 -1 -2 -7 -5 1 3 8 2 3 0 1 5 2 1 1 5 1 0 1 62 4 1 0 1 1.26 inf 67 R -3 5 0 -3 -3 2 0 -3 -1 -2 -3 3 -2 -6 -2 0 2 -1 -5 -4 2 33 4 1 1 7 5 2 1 2 2 16 1 0 2 5 13 1 0 1 0.69 inf 97 M -1 0 0 0 2 0 0 -1 2 0 -1 0 2 1 0 -1 0 4 1 -1 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 0.09 inf 130 N -4 0 2 0 -2 -2 -1 5 -2 -2 -4 0 -1 -4 -4 -1 2 -1 -6 -3 1 5 10 4 1 2 4 40 1 3 2 5 1 1 1 5 11 1 0 2 0.67 inf 168 E -3 1 -1 -2 -3 -1 0 -3 -2 0 1 4 0 0 -3 -3 -3 0 2 0 2 6 4 2 1 2 7 2 1 6 14 22 2 4 2 3 1 1 7 6 0.29 inf 183 L -4 -3 -3 -4 -2 -4 -4 -5 -1 1 5 -2 2 -1 -4 -3 -3 -1 1 2 2 2 1 1 1 1 1 1 1 6 45 3 4 3 1 2 2 1 5 13 0.68 inf 217 T 0 -3 -2 0 -5 -1 -1 1 -1 -3 1 -2 -1 -3 3 3 0 -6 -1 -1 8 1 1 4 0 3 3 10 1 1 12 2 1 1 15 22 6 0 2 5 0.30 inf 282 C -6 -6 -8 -9 12 -8 -9 -8 -8 -7 -7 -8 -7 -5 -8 -5 -6 -8 -5 -6 0 1 0 0 94 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 4.31 inf 330 R -5 5 -4 -6 1 -1 -5 -6 2 -5 -5 3 -3 0 -5 -4 -5 8 4 -6 1 28 1 0 3 2 0 0 4 1 1 18 0 3 1 1 0 23 14 0 1.41 inf
The mutations are marked in purple.
For the creation of the MSA, we first searched for homologous mammalian sequences with the NCBI BlastP online tool in the UniprotKB/SwissProt and restricted the organisms to Mammalia. The E-value cutoff was set to 0.1 and Psi-Blast was used as Algorithm. The maximum target sequences threshold was set to 100, bacause we only wanted to get the close homologs, but we did a second run with a threshold of 20000 to get also remote homologous seuences. All other parameters were left as default. We performed two iterations and then downloaded all matched sequences in fasta format. Those sequences were then used as input for ClustalW. The MSA was gerenerated with default parameters. Jalview was then used to save the alignment in fasta format. The commandline version of Blast 2.2.25+ was then used to generate a PSSM for the HFE protein Q30201 from the MSA:
psiblast -subject Q30201.fasta -in_msa alignment.fa -out_ascii_pssm pssm.txt
The resulting PSSM for the first 100 sequences is the following:
Last position-specific scoring matrix computed, weighted observed percentages rounded down, information per position, and relative weight of gapless real matches to pseudocounts
A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V 53 V 0 -3 -3 -4 -1 -3 -3 -4 -3 2 1 -3 1 -1 -3 -2 0 -3 -1 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 0.47 inf 63 H 0 -1 0 -1 -2 0 0 -1 5 -3 -3 -1 -2 -2 -1 4 1 -3 -1 -2 0 0 0 0 0 0 0 0 26 0 0 0 0 0 0 74 0 0 0 0 0.47 inf 67 R -2 6 -1 -2 -4 1 0 -3 -1 -3 -3 2 -2 -3 -2 -1 -1 -3 -2 -3 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.91 inf 97 M 0 -2 -1 -2 -1 -1 -1 -2 -2 0 0 -1 3 -2 -2 1 4 1 -2 0 1 0 0 0 0 0 0 0 0 4 0 0 23 0 0 7 62 2 0 0 0.37 inf 130 N -2 6 3 -1 -4 0 -1 -2 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -3 0 75 23 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0.73 inf 168 E -1 1 -1 0 -4 1 3 -2 -1 -3 -3 5 -2 -4 -1 -1 -1 -3 -2 -3 0 0 0 0 0 0 23 0 0 0 0 77 0 0 0 0 0 0 0 0 0.62 inf 183 L -2 -2 -4 -4 -1 -2 -3 -4 -3 1 4 -3 2 0 -3 -3 -1 -2 -1 1 0 0 0 0 0 0 0 0 0 0 96 0 4 0 0 0 0 0 0 0 0.51 inf 217 T 1 2 -2 -2 -1 -1 -1 -2 -2 2 0 0 0 -2 -2 0 1 -3 -2 2 15 25 0 0 0 0 0 0 0 12 0 0 0 0 0 0 12 0 0 35 0.15 inf 282 C -1 -4 -3 -4 9 -3 -4 -3 -3 -2 -2 -3 -2 -3 -3 -1 -1 -3 -3 -1 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.41 inf 330 R -1 3 -2 -3 -2 0 -2 -3 -2 1 1 0 5 -1 -3 -2 -1 -2 -2 1 0 30 0 0 0 0 0 0 0 14 6 0 47 0 0 0 0 0 0 3 0.31 inf
This is the PSSM for all homologous sequences:
Last position-specific scoring matrix computed, weighted observed percentages rounded down, information per position, and relative weight of gapless real matches to pseudocounts A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V 53 V 0 -2 1 -2 -3 1 0 0 2 0 0 0 -1 1 -3 0 0 4 1 0 7 0 7 0 0 7 7 7 7 7 7 7 0 7 0 7 7 7 7 7 0.11 inf 63 H 0 1 0 -1 0 1 0 1 0 -2 0 1 -2 -1 0 1 0 -3 -2 -1 6 7 5 3 3 6 5 10 3 0 10 14 0 3 3 15 4 0 0 4 0.09 inf 67 R -2 3 -1 -1 -4 1 -1 -4 1 1 -1 0 1 -3 -1 0 0 4 0 0 2 20 3 3 0 6 2 0 4 9 6 5 4 0 3 7 8 8 4 6 0.26 inf 97 M -1 -1 -1 -1 1 -1 -1 -1 -3 1 0 -1 2 0 -1 0 1 2 1 2 4 3 3 4 3 3 3 3 0 7 10 3 7 3 3 5 12 3 5 16 0.10 inf 130 N -1 2 2 2 -4 0 1 2 1 -3 -1 0 0 -3 0 -1 -1 0 -3 -2 3 12 12 13 0 2 9 17 4 0 7 5 2 0 5 2 3 2 0 2 0.23 inf 168 E -1 0 0 1 2 1 0 -2 -3 1 2 2 -1 1 -3 -2 -1 -3 -2 -2 3 3 5 7 5 5 4 2 0 9 25 19 0 8 0 0 3 0 0 0 0.18 inf 183 L -1 -4 -1 -4 3 -1 -4 0 -4 1 4 -4 2 -3 0 0 -3 -4 -4 -1 4 0 3 0 6 3 0 9 0 7 47 0 6 0 4 9 0 0 0 2 0.50 inf 217 T 1 0 0 0 -2 0 0 0 -2 0 0 -1 1 0 0 0 0 -3 -2 0 15 7 4 5 0 5 5 9 0 6 12 0 6 5 5 4 5 0 0 7 0.04 inf 282 C 0 -4 1 -4 8 1 -4 -4 -4 -3 -1 -4 -3 -3 -4 1 -3 -4 1 1 7 0 7 0 46 8 0 0 0 0 7 0 0 0 0 10 0 0 6 10 1.16 inf 330 R 0 -1 -2 -3 2 -1 -3 3 -1 -1 0 -1 0 -1 1 -1 0 4 0 0 6 4 1 0 6 2 0 25 1 2 8 5 3 2 6 2 6 7 3 9 0.22 inf
SIFT
SIFT was executed from the web interface with default parameters
Val53Met
Substitution at pos 53 from V to M is predicted to AFFECT PROTEIN FUNCTION with a score of 0.00. Median sequence conservation: 3.03 Sequences represented at this position:396
His63Asp
Substitution at pos 63 from H to D is predicted to AFFECT PROTEIN FUNCTION with a score of 0.00. Median sequence conservation: 3.03 Sequences represented at this position:396
Arg67His
Substitution at pos 67 from R to H is predicted to AFFECT PROTEIN FUNCTION with a score of 0.00. Median sequence conservation: 3.03 Sequences represented at this position:396
Met97Ile
Substitution at pos 97 from M to I is predicted to be TOLERATED with a score of 0.53. Median sequence conservation: 3.03 Sequences represented at this position:395
Asn130Ser
Substitution at pos 130 from N to S is predicted to AFFECT PROTEIN FUNCTION with a score of 0.02. Median sequence conservation: 3.03 Sequences represented at this position:396
Glu168Gln
Substitution at pos 168 from E to Q is predicted to AFFECT PROTEIN FUNCTION with a score of 0.01. Median sequence conservation: 3.03 Sequences represented at this position:396
Leu183Pro
Substitution at pos 183 from L to P is predicted to AFFECT PROTEIN FUNCTION with a score of 0.00. Median sequence conservation: 3.03 Sequences represented at this position:396
Thr217Ile
Substitution at pos 217 from T to I is predicted to be TOLERATED with a score of 0.92. Median sequence conservation: 3.03 Sequences represented at this position:396
Cys282Tyr
Substitution at pos 282 from C to Y is predicted to AFFECT PROTEIN FUNCTION with a score of 0.00. Median sequence conservation: 3.03 Sequences represented at this position:396
Arg330Met
Substitution at pos 330 from R to M is predicted to be TOLERATED with a score of 0.11. Median sequence conservation: 3.01 Sequences represented at this position:255
Polyphen2
Polyphen was executed from the web interface using default parameters
Val53Met
HumDiv This mutation is predicted to be probably damaging with a score of 0.998 (sensitivity: 0.27; specificity: 0.99) HumVar This mutation is predicted to be possibly damaging with a score of 0.841 (sensitivity: 0.73; specificity: 0.88)
His63Asp
HumDiv This mutation is predicted to be benign with a score of 0.142 (sensitivity: 0.92; specificity: 0.86) HumVar This mutation is predicted to be benign with a score of 0.161 (sensitivity: 0.89; specificity: 0.72)
Arg67His
HumDiv This mutation is predicted to be benign with a score of 0.145 (sensitivity: 0.92; specificity: 0.86) HumVar This mutation is predicted to be benign with a score of 0.031 (sensitivity: 0.94; specificity: 0.59)
Met97Ile
HumDiv This mutation is predicted to be possibly damaging with a score of 0.575 (sensitivity: 0.88; specificity: 0.91) HumVar This mutation is predicted to be benign with a score of 0.114 (sensitivity: 0.90; specificity: 0.69)
Asn130Ser
HumDiv This mutation is predicted to be possibly damaging with a score of 0.883 (sensitivity: 0.82; specificity: 0.94) HumVar This mutation is predicted to be benign with a score of 0.282 (sensitivity: 0.87; specificity: 0.76)
Glu168Gln
HumDiv This mutation is predicted to be probably damaging with a score of 1.000 (sensitivity: 0.00; specificity: 1.00) HumVar This mutation is predicted to be probably damaging with a score of 0.980 (sensitivity: 0.57; specificity: 0.94)
Leu183Pro
HumDiv This mutation is predicted to be probably damaging with a score of 1.000 (sensitivity: 0.00; specificity: 1.00) HumVar This mutation is predicted to be probably damaging with a score of 1.000 (sensitivity: 0.00; specificity: 1.00)
Thr217Ile
HumDiv This mutation is predicted to be benign with a score of 0.118 (sensitivity: 0.93; specificity: 0.86) HumVar This mutation is predicted to be benign with a score of 0.097 (sensitivity: 0.91; specificity: 0.68)
Cys282Tyr
HumDiv This mutation is predicted to be probably damaging with a score of 0.961 (sensitivity: 0.78; specificity: 0.95) HumVar This mutation is predicted to be possibly damaging with a score of 0.667 (sensitivity: 0.79; specificity: 0.84)
Arg330Met
HumDiv This mutation is predicted to be probably damaging with a score of 0.997 (sensitivity: 0.41; specificity: 0.98) HumVar This mutation is predicted to be possibly damaging with a score of 0.781 (sensitivity: 0.76; specificity: 0.87)
MutationTaster
MutationTaster was also executed from the web interface.
Gene: HFE
Transcript: ENST00000357618 (protein_coding, 5286 bases) NM_000410
Val53Met
Prediction disease causing Model: simple_aae, prob: 0.974610381496817 (explain) Summary amino acid sequence changed known disease mutation at this position (HGMD CM994469) protein features (might be) affected splice site changes
His63Asp
Prediction disease causing Model: simple_aae, prob: 0.974610381496817 (explain) Summary amino acid sequence changed known disease mutation at this position (HGMD CM994469) protein features (might be) affected splice site changes
Arg67His
Prediction polymorphism Model: simple_aae, prob: 0.999999997930159 (explain) Summary amino acid sequence changed listed as SNP protein features (might be) affected splice site changes
Met97Ile
Prediction disease causing Model: simple_aae, prob: 0.943409356836766 (explain) Summary amino acid sequence changed protein features (might be) affected
Asn130Ser
Prediction polymorphism Model: simple_aae, prob: 0.999999996637944 (explain) Summary amino acid sequence changed listed as SNP protein features (might be) affected splice site changes
Glu168Gln
Prediction polymorphism Model: simple_aae, prob: 0.707489599782817 (explain) Summary amino acid sequence changed known disease mutation at this position (HGMD CM004106) known disease mutation at this position (HGMD CM004810) protein features (might be) affected splice site changes
Leu183Pro
Prediction disease causing Model: simple_aae, prob: 0.999999979100498 (explain) Summary amino acid sequence changed known disease mutation at this position (HGMD CM081301) protein features (might be) affected
Thr217Ile
Prediction polymorphism Model: simple_aae, prob: 0.999999999993365 (explain) Summary amino acid sequence changed listed as SNP protein features (might be) affected splice site changes
Cys282Tyr
Prediction disease causing Model: simple_aae, prob: 0.999999999736277 (classification due to ClinVar, real probability is shown anyway) (explain) Summary amino acid sequence changed heterozygous in TGP known disease mutation at this position (HGMD CM004391) known disease mutation at this position (HGMD CM960828) known disease mutation: rs1800562 (pathogenic) protein features (might be) affected
Arg330Met
Prediction disease causing Model: simple_aae, prob: 8.23173030693237e-06 (classification due to ClinVar, real probability is shown anyway) (explain) Summary amino acid sequence changed known disease mutation at this position (HGMD CM990722) known disease mutation: rs111033558 (pathogenic) protein features (might be) affected splice site changes
SNAP2
SNAP2 was executed from the command line on the biolab computers. first, we created a ~/.snap2rc file that contains the paths to several databases that are needed for snap2: <source lang="bash"> [snap2]
- snapfun_utildir=path - path to package utilities, default: /usr/share/snap2
snap2dir=/usr/share/snap2
- use snap cache [0|1], default: 0
use_snap_cache=0
- snap cache fetch executable
snapc_fetch=/usr/bin/snapc_fetch
- snap cache store executable
snapc_store=/usr/bin/snapc_store
- snap cache root - overrides snap-cache-mgr configuration
snap_cache_root=
- blastpgp_processors
blastpgp_processors=1
- use predictprotein cache, default: 0
use_pp_cache=0
- predictprotein executable
pp_exe=/usr/bin/predictprotein
- sift executable
sift_exe=/usr/bin/sift_for_submitting_fasta_seq.csh
- reprof executable
reprof_exe=/usr/bin/reprof
- blast executable
blast_exe=/usr/bin/blastpgp
[data]
- swiss_dat=path - location of UniProt/Swiss-Prot dat file
swiss_dat=/mnt/project/pracstrucfunc13/data/swissprot/20120501/uniprot_sprot.dat
- db_swiss=path - path to ID index of Swiss-Prot dat file (generated by /usr/share/librg-utils-perl/dbSwiss.pl)
db_swiss=/mnt/project/pracstrucfunc13/data/swissprot/20120501/dbswiss
- PHAT substitution matrix
phat_matrix=/usr/share/snap2/phat.txt
[blast]
- big80=path - path to redundancy reduced database (UniProtKB 80 or equivalent)
big80=/mnt/project/pracstrucfunc13/data/big/big_80
- swiss=path - path to SwissProt database
swiss=/mnt/project/pracstrucfunc13/data/swissprot/uniprot_sprot </source> SNAP2 was then excuted with the following command:
snapfun -i Q30201.fasta -m mutations.txt -o results.txt
where mutations.txt is a file that contains the mutations in the one letter code. The results are the following:
nsSNP Prediction Reliability Expected Accuracy V53M Neutral 0 53% H63D Non-neutral 2 63% R67H Non-neutral 4 71% M97I Neutral 7 87% N130S Neutral 3 66% E168Q Neutral 7 87% L183P Non-neutral 8 91% T217I Neutral 0 53% C282Y Non-neutral 7 85% R330M Non-neutral 7 85%