Task 2 lab journal (MSUD)

From Bioinformatikpedia
Revision as of 23:51, 5 May 2013 by Weish (talk | contribs) (Data creation)

Sequence searches

Configure parameters

  • We have used blastall for blastp and psiblast
  • BLAST and PSI-BLAST
    • big_80 database was used for sequence search: -d /mnt/project/pracstrucfunc13/data/big/big_80
    • output put format was set to xml and tab separated values:
      • XML output: -m 7
      • TSV output: -m 9
    • the number of hits to be shown was set to 200000: -b 200000
    • For PSI-Blast number of iterations and cutoff of E-values were set with following parameters
      • iteration: -j <number_of_iterations>
      • E-value cutoff: -h <threshold>
  • HHBlits
    • uniprot20 database was used for sequence search: -d /mnt/project/pracstrucfunc13/data/hhblits/uniprot20_current
    • Resulted a3m, hhm and hhr files are stored: -o result.hhr -oa3m result.a3m -ohhm result.hhm
    • Output size was also set to 200000: -Z 200000 -B 200000

Data creation

For each of blast, psi-blast and hhblits, a shell script was written to perform sequence search.

Convert result of hhblits

We find result of hhblits in hhr format is not parser-friendly. So the program hhr2tsv was written. It finds out all hits and their statistics information in hhr file and write the data out to a tsv (tab-separated values) file.

SCOP classification

In order to check the quality of sequence search programs, we have used the parsable file from Structural Classification of Proteins(SCOP). The classification of domains in PDB files can be observed in file dir.cla.scop.txt_1.75.

Id mapping from Uniprot to Gene Ontology

Quality check of sequence search programs was also performed using Gene Ontology(GO) annotations. The Idmapping data was used to assign GO annotations to each Uniprot proteins (genes).

  • Idmapping data in TSV format was downloaded from FTP site of Uniprot: idmapping_selected.tab.gz (large file! 1.2 GB)
  • GO term IDs are stored in the 7th column
  • Uniprot accession codes are in the 1st column

Example of id mapping file: Q6GZX4 001R_FRG3G 2947773 YP_031579.1 81941549; 47060116; 49237298 GO:0006355; GO:0046782; GO:0006351 UniRef100_Q6GZX4 UniRef90_Q6GZX4 UniRef50_Q6GZX4 UPI00003B0FD4 654924 15165820 AY548484 AAT09660.1 Q6GZX3 002L_FRG3G 2947774 YP_031580.1 49237299; 47060117; 81941548 GO:0033644; GO:0016021 UniRef100_Q6GZX3 UniRef90_Q6GZX3 UniRef50_Q6GZX3 UPI00003B0FD5 654924 15165820 AY548484 AAT09661.1 Q197F8 002R_IIV3 4156251 YP_654574.1 106073503; 109287880; 123808694 UniRef100_Q197F8 UniRef90_Q197F8 UniRef50_Q197F8

Multiple sequence alignments

Dataset creation

The datasets were created from the Blast output.

For creating datasets with low, high and whole range sequence identity, the following Python script was used:


<source lang="python"> Use blast output to create datasets with different sequence identities.

Call with: python create_dataset.py <query fasta file> <blast xml output> <database fasta file>

@author: Laura Schiller

import sys from Bio import SeqIO, pairwise2 from Bio.Blast import NCBIXML from Bio.SubsMat import MatrixInfo

def get_sequences(query, blast_xml, db_path, out_file):

   
   Fetch full length sequences for a BLAST result and write them in a FASTA file.
   
   @param query:     fasta file with query sequence.
   @param blast_xml: xml file with BLAST search result.
   @param db_path:   path of db fasta file.
   @param out_file:  file to store the sequences.
   @return:          name of the file where the sequences are stored.
   
   
   print("get all sequences for blast result in %s" % blast_xml)
   query_sequence = SeqIO.read(query, "fasta")
   hit_seqs = [query_sequence]
   
   blast_output = open(blast_xml)
   blast_result = NCBIXML.read(blast_output)
   blast_output.close()
   
   hit_list = []
   for alignment in blast_result.alignments:
       hit_list.append(alignment.title.split(" ")[1]) # the id of the sequence    
   
   # get all blast hits
   seqs_db = SeqIO.parse(db_path, "fasta")
   counter = 0
   number = len(blast_result.alignments)
   for seq in seqs_db:
       if seq.id in hit_list:
           hit_seqs.append(seq)
           counter = counter + 1
           if (counter % 100) == 0:
               print("%d of %d sequences found" % (counter, number))
           if counter == number:
               break
   print("%d of %d sequences found" % (counter, len(blast_result.alignments)))        
   
   SeqIO.write(hit_seqs, out_file, "fasta")
   
   print("sequences saved in %s" % out_file)
   return out_file

def filter_seqs(seq_file, name):

   
   Filter sequences according to sequence identity limits.
   
   @param seq_file: fasta file with sequences.
   @param name:     string used for output file names.
   
   
   sequences = SeqIO.parse(seq_file, "fasta")
   seqs = []
   for seq in sequences:
       seqs.append(seq)
   query = seqs.pop(0) # always keep query (first sequence)
   # lists for low / high / whole range sequence identity
   filtered = [[query], [query], [query]]
   
   # identify hits with pdb structures -> these are preferentially taken
   hits_pdb = [hit for hit in seqs if (hit.id.split("|")[0] == "pdb")]
   seqs = [seq for seq in seqs if not (seq.id in [pdb_seq.id for pdb_seq in hits_pdb])]
   hits_pdb.extend(seqs) # now pdb hits are at the beginning
   
   print("filter sequences")
   for seq in hits_pdb:           
       try: 
           ident = identity(query, seq) 
       except KeyError: # raises if there is a non amino acid letter
           continue
       if (len(filtered[0]) < 10):
           keep = True            
           for seq2 in filtered[0]:
               try: 
                   ident2 = identity(seq, seq2) 
               except KeyError:
                   keep = False
                   continue
               if ident2 >= 0.3:
                   keep = False
                   break
           if keep:
               filtered[0].append(seq)
       if (len(filtered[1]) < 10) and (ident > 0.6):
           filtered[1].append(seq)
       if (len(filtered[2]) < 8) and (ident >= 0.3) and (ident <= 0.6):
           filtered[2].append(seq)
       if (len(filtered[0]) == 10) and (len(filtered[1]) == 10) and (len(filtered[2]) == 8):
           break
   
   # for whole range take a part of low and a part of high sequence identity
   # plus the rest middle sequence identity
   filtered[2].extend(filtered[0][1:min(len(filtered[0]), 7)])
   filtered[2].extend(filtered[1][1:min(len(filtered[1]), 7)])
   
   SeqIO.write(filtered[0], name + "_low_seq_ident.fasta", "fasta")
   SeqIO.write(filtered[1], name + "_high_seq_ident.fasta", "fasta")
   SeqIO.write(filtered[2], name + "_whole_range_seq_ident.fasta", "fasta")
   
   print("sequences with low / high / whole range sequence identity saved in:")
   print(name + "_low_seq_ident.fasta")
   print(name + "_high_seq_ident.fasta")
   print(name + "_whole_range_seq_ident.fasta") 

def identity(seq1, seq2):

   
   Calculate relative sequence identity of two sequences.
   
   @return: number of identical residues divided by mean length.
   
   #pairwise alignment
   matrix = MatrixInfo.blosum62
   gap_open = -10
   gap_extend = -0.5     
   alignment = pairwise2.align.globalds(seq1, seq2, matrix, gap_open, gap_extend)
   seq1_aligned = alignment[0][0]
   seq2_aligned = alignment[0][1]
   #sequence identity
   ident = sum(c1 == c2 for c1, c2 in zip(seq1_aligned, seq2_aligned))
   ref_length = (len(seq1) + len(seq2)) / 2 # mean length
   return float(ident) / ref_length


if __name__ == '__main__':

   namestring = sys.argv[1].split("/")[-1].split(".")[0] # used as beginning of the output files
   print("-----------------------------------------------------------")
   all_seqs = get_sequences(sys.argv[1], sys.argv[2], sys.argv[3], namestring + "_all_sequences.fasta")
   filter_seqs(all_seqs, namestring)

</source>


Note: for DBT there were some problems aligning very long sequences to calculate mutual sequence identities. So there, for the low sequence identity group, only sequences that have less than 6 times the length of the query sequence were taken.

Calling MSA programs

Call of T-Coffee:

#!/bin/bash

proteins=( BCKDHA BCKDHB DBT DLD )
identities=( low high whole_range )

for protein in ${proteins[*]}
do
	for identity in ${identities[*]}
	do
		t_coffee -output fasta -infile ${protein}_${identity}_seq_ident.fasta -outfile ${protein}_${identity}_seq_ident_tcoffee.fasta 
	done
done

Muscle:

#!/bin/bash

proteins=( BCKDHA BCKDHB DBT DLD )
identities=( low high whole_range )

for protein in ${proteins[*]}
do
	for identity in ${identities[*]}
	do
		muscle -in ${protein}_${identity}_seq_ident.fasta -out ${protein}_${identity}_seq_ident_muscle.fasta 
	done
done

Mafft:

#!/bin/bash

proteins=( BCKDHA BCKDHB DBT DLD )
identities=( low high whole_range )

for protein in ${proteins[*]}
do
	for identity in ${identities[*]}
	do
		mafft ${protein}_${identity}_seq_ident.fasta > ${protein}_${identity}_seq_ident_mafft.fasta  
	done
done