Difference between revisions of "Task 8 Lab Journal (MSUD)"

From Bioinformatikpedia
(SNAP)
m (Using MSA)
 
Line 155: Line 155:
 
=== Using MSA ===
 
=== Using MSA ===
   
Reference sequence of BCKDHA was aligned against uniprot database of mammals. Significance cutoff was set to 0.1 and maximal 1000 hit is to be shown. Using the web tool from uniprot, we can easily extract uniprot identifiers of the homologous mammal proteins. Then with the retrieve tool from uniprot all homologous mammal sequences can be extracted. At last, the multiple sequence alignment of reference sequence of BCKDHA and all homologous mammal proteins was generated with the ClustalW2 tool from EBI.
+
Reference sequence of BCKDHA was aligned against UniProt database of mammals. Significance cutoff was set to 0.1 and maximal 1000 hit is to be shown. Using the web tool from UniProt, we can easily extract uniprot identifiers of the homologous mammal proteins. Then with the retrieve tool from uniprot all homologous mammal sequences can be extracted. At last, the multiple sequence alignment of reference sequence of BCKDHA and all homologous mammal proteins was generated with the ClustalW2 tool from EBI.
   
 
== Predicting effects of mutations ==
 
== Predicting effects of mutations ==

Latest revision as of 14:17, 30 August 2013

Visualization of mutations

Very often the PDB structures do not have identical sequences in comparison to their original proteins. Deletion of residues is very frequent in PDB structure. In order to map the positions of mutations reported in SNP databases like HGMD and dbSNP, we have applied global alignment between reference sequence and the sequence in PDB structure of BCKDHA.

Following is the alignment between reference sequence of BCKDHA and chain A of 1U5B. The result shows that the position of mutations reported in SNP databases should be shifted back with 45 residues. It means the 50th residue in reference sequence is actually the 5th in chain A of 1U5B.

########################################
# Program: needle
# Rundate: Sat 29 Jun 2013 21:56:11
# Commandline: needle
#    -auto
#    -stdout
#    -asequence emboss_needle-I20130629-215609-0499-45143350-es.aupfile
#    -bsequence emboss_needle-I20130629-215609-0499-45143350-es.bupfile
#    -datafile EBLOSUM62
#    -gapopen 10.0
#    -gapextend 0.5
#    -endopen 10.0
#    -endextend 0.5
#    -aformat3 pair
#    -sprotein1
#    -sprotein2
# Align_format: pair
# Report_file: stdout
########################################

#=======================================
#
# Aligned_sequences: 2
# 1: NP_000700.1
# 2: SEQUENCE
# Matrix: EBLOSUM62
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 445
# Identity:     400/445 (89.9%)
# Similarity:   400/445 (89.9%)
# Gaps:          45/445 (10.1%)
# Score: 2124.0
# 
#
#=======================================

NP_000700.1        1 MAVAIAAARVWRLNRGLSQAALLLLRQPGARGLARSHPPRQQQQFSSLDD     50
                                                                  |||||
SEQUENCE           1 ---------------------------------------------SSLDD      5

NP_000700.1       51 KPQFPGASAEFIDKLEFIQPNVISGIPIYRVMDRQGQIINPSEDPHLPKE    100
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
SEQUENCE           6 KPQFPGASAEFIDKLEFIQPNVISGIPIYRVMDRQGQIINPSEDPHLPKE     55

NP_000700.1      101 KVLKLYKSMTLLNTMDRILYESQRQGRISFYMTNYGEEGTHVGSAAALDN    150
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
SEQUENCE          56 KVLKLYKSMTLLNTMDRILYESQRQGRISFYMTNYGEEGTHVGSAAALDN    105

NP_000700.1      151 TDLVFGQYREAGVLMYRDYPLELFMAQCYGNISDLGKGRQMPVHYGCKER    200
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
SEQUENCE         106 TDLVFGQYREAGVLMYRDYPLELFMAQCYGNISDLGKGRQMPVHYGCKER    155

NP_000700.1      201 HFVTISSPLATQIPQAVGAAYAAKRANANRVVICYFGEGAASEGDAHAGF    250
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
SEQUENCE         156 HFVTISSPLATQIPQAVGAAYAAKRANANRVVICYFGEGAASEGDAHAGF    205

NP_000700.1      251 NFAATLECPIIFFCRNNGYAISTPTSEQYRGDGIAARGPGYGIMSIRVDG    300
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
SEQUENCE         206 NFAATLECPIIFFCRNNGYAISTPTSEQYRGDGIAARGPGYGIMSIRVDG    255

NP_000700.1      301 NDVFAVYNATKEARRRAVAENQPFLIEAMTYRIGHHSTSDDSSAYRSVDE    350
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
SEQUENCE         256 NDVFAVYNATKEARRRAVAENQPFLIEAMTYRIGHHSTSDDSSAYRSVDE    305

NP_000700.1      351 VNYWDKQDHPISRLRHYLLSQGWWDEEQEKAWRKQSRRKVMEAFEQAERK    400
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
SEQUENCE         306 VNYWDKQDHPISRLRHYLLSQGWWDEEQEKAWRKQSRRKVMEAFEQAERK    355

NP_000700.1      401 PKPNPNLLFSDVYQEMPAQLRKQQESLARHLQTYGEHYPLDHFDK    445
                     |||||||||||||||||||||||||||||||||||||||||||||
SEQUENCE         356 PKPNPNLLFSDVYQEMPAQLRKQQESLARHLQTYGEHYPLDHFDK    400


#---------------------------------------
#---------------------------------------

Analysis of evolutionary conservation of mutated positions

Using PSSM

A position-specific scoring matrix (PSSM) was created with PSI-Blast:

blastpgp -d /mnt/project/pracstrucfunc13/data/big/big_80 -i /mnt/home/student/weish/master-practical-2013/task01/refseq_BCKDHA_protein.fasta -j 5 -h 10e-10 -Q BCKDHA.pssm

The conservation (observed frequency) of the wildtype and mutant amino acids was extracted from the PSSM file with the following Python script (located at /mnt/home/student/schillerl/MasterPractical/task8/extract_aa_frequency.py):


<source lang=python> Extract amino acid frequencies from PsiBlast PSSM (created with -Q option).

Usage: python extract_aa_frequency.py <pssm file> [mutations]

Example for usage: python extract_aa_frequency.py BCKDHA.pssm P38H M82L T151M A222T C264W R265W R314Q R346H I361V Y413H

@author: Laura Schiller

import sys

pssm_file = open(sys.argv[1]) lines = pssm_file.readlines() pssm_file.close()

  1. positions of amino acid frequencies in PSSM

aa_position = { 'A': 22,

               'R': 23,
               'N': 24,
               'D': 25,
               'C': 26,
               'Q': 27,
               'E': 28,
               'G': 29,
               'H': 30,
               'I': 31,
               'L': 32,
               'K': 33,
               'M': 34,
               'F': 35,
               'P': 36,
               'S': 37,
               'T': 38,
               'W': 39,
               'Y': 40,
               'V': 41}

for i in range(2, len(sys.argv)):

   old_aa = sys.argv[i][0]
   new_aa = sys.argv[i][-1]
   position = sys.argv[i][1:(len(sys.argv[i]) - 1)]
   
   for j in range(2, len(lines)):
       splitted = lines[j].split()
       if splitted[0] == position:
           print("Frequencies at position %s: %s %s, %s %s" 
                 % (position, 
                    old_aa, 
                    splitted[aa_position[old_aa]], 
                    new_aa, 
                    splitted[aa_position[new_aa]]))
           break

</source>

Using MSA

Reference sequence of BCKDHA was aligned against UniProt database of mammals. Significance cutoff was set to 0.1 and maximal 1000 hit is to be shown. Using the web tool from UniProt, we can easily extract uniprot identifiers of the homologous mammal proteins. Then with the retrieve tool from uniprot all homologous mammal sequences can be extracted. At last, the multiple sequence alignment of reference sequence of BCKDHA and all homologous mammal proteins was generated with the ClustalW2 tool from EBI.

Predicting effects of mutations

SIFT

SIFT was run on this server with default parameters.

Polyphen2

This server was used with default parameters to run Polyphen2.

MutationTaster

The MutationTaster web tool was used to analyze mutations in BCKDHA. Because the server uses mRNA sequence to evaluate SNPs, we have consulted the reference sequence of BCKDHA (NM_000709) and find out the open reading frame of BCKDHA begins at 40th nucleotide i.e. the first start codon. Then we can map mutations of amino acid to the mRNA sequences.

SNAP

For this part of the task we have used SNAP2. In order to get SNAP2 run properly, we have to write a run-configuration file .snap2rc and put it into the home directory on the student server. Following is the configuration file:

<source lang="ini"> [snap2]

  1. snapfun_utildir=path - path to package utilities, default: /usr/share/snap2

snap2dir=/usr/share/snap2

  1. use snap cache [0|1], default: 0

use_snap_cache=0

  1. snap cache fetch executable

snapc_fetch=/usr/bin/snapc_fetch

  1. snap cache store executable

snapc_store=/usr/bin/snapc_store

  1. snap cache root - overrides snap-cache-mgr configuration

snap_cache_root=

  1. blastpgp_processors

blastpgp_processors=1

  1. use predictprotein cache, default: 0

use_pp_cache=0

  1. predictprotein executable

pp_exe=/usr/bin/predictprotein

  1. sift executable

sift_exe=/usr/bin/sift_for_submitting_fasta_seq.csh

  1. reprof executable

reprof_exe=/usr/bin/reprof

  1. blast executable

blast_exe=/usr/bin/blastpgp


[data]

  1. swiss_dat=path - location of UniProt/Swiss-Prot dat file

swiss_dat=/mnt/project/pracstrucfunc13/data/swissprot/20120501/uniprot_sprot.dat

  1. db_swiss=path - path to ID index of Swiss-Prot dat file (generated by /usr/share/librg-utils-perl/dbSwiss.pl)

db_swiss=/mnt/project/pracstrucfunc13/data/swissprot/20120501/dbswiss

  1. PHAT substitution matrix

phat_matrix=/usr/share/snap2/phat.txt

[blast]

  1. big80=path - path to redundancy reduced database (UniProtKB 80 or equivalent)

big80=/mnt/project/pracstrucfunc13/data/big/big_80

  1. swiss=path - path to SwissProt database

swiss=/mnt/project/pracstrucfunc13/data/swissprot/uniprot_sprot </source>

To call snap2, we have used this command: snap2 -m mutation-list.txt -i ../task01/refseq_BCKDHA_protein.fasta -o snapfun-bckdha.result