Difference between revisions of "Task 2 lab journal (MSUD)"
(→Sequence searches) |
(→Perform statistics) |
||
Line 90: | Line 90: | ||
Here is the R code: |
Here is the R code: |
||
− | <source lang=" |
+ | <source lang="scheme"> |
####################################################################### |
####################################################################### |
||
#File: evalAlign.R |
#File: evalAlign.R |
Revision as of 00:15, 6 May 2013
Contents
Sequence searches
Configure parameters
- We have used blastall for blastp and psiblast
- BLAST and PSI-BLAST
- big_80 database was used for sequence search: -d /mnt/project/pracstrucfunc13/data/big/big_80
- output put format was set to xml and tab separated values:
- XML output: -m 7
- TSV output: -m 9
- the number of hits to be shown was set to 200000: -b 200000
- For PSI-Blast number of iterations and cutoff of E-values were set with following parameters
- iteration: -j <number_of_iterations>
- E-value cutoff: -h <threshold>
- HHBlits
- uniprot20 database was used for sequence search: -d /mnt/project/pracstrucfunc13/data/hhblits/uniprot20_current
- Resulted a3m, hhm and hhr files are stored: -o result.hhr -oa3m result.a3m -ohhm result.hhm
- Output size was also set to 200000: -Z 200000 -B 200000
Data creation
For each of blast, psi-blast and hhblits, a shell script was written to perform sequence search.
- Blast: all-blast.sh run-blast.pl
- Psi-Blast: all-psiblast.sh
- HHBlits: all-hhblits.sh
Convert result of hhblits
We find result of hhblits in hhr format is not parser-friendly. So the program hhr2tsv was written. It finds out all hits and their statistics information in hhr file and write the data out to a tsv (tab-separated values) file.
SCOP classification
In order to check the quality of sequence search programs, we have used the parsable file from Structural Classification of Proteins(SCOP). The classification of domains in PDB files can be observed in file dir.cla.scop.txt_1.75.
Id mapping from Uniprot to Gene Ontology
Quality check of sequence search programs was also performed using Gene Ontology(GO) annotations. The Idmapping data was used to assign GO annotations to each Uniprot proteins (genes).
- Idmapping data in TSV format was downloaded from FTP site of Uniprot: idmapping_selected.tab.gz (large file! 1.2 GB) README of idmapping
- GO term IDs are stored in the 7th column
- Uniprot accession codes are in the 1st column
Following command was used for extraction of all Uniprot entries with GO annotation:
zcat idmapping_selected.tab.gz | cut -f1,7 | grep -vP '\s$' > uniprot_go_mapping.tsv
For efficient searching of GO terms, the mapping data was stored into a SQLite3 database. Following is the schema of the database : <source lang='sql'> CREATE TABLE uniprot_go_mapping (
uniprot_ac CHAR( 6 ) NOT NULL, go_term CHAR( 10 ) NOT NULL, PRIMARY KEY ( uniprot_ac, go_term )
); CREATE INDEX idx_uniprot_ac ON uniprot_go_mapping (uniprot_ac ASC); </source>
Following is the python script for transporting mapping data into SQLite. <source lang='python'>
- !/usr/bin/env python
import sqlite3 import math conn = sqlite3.connect('idmapping.sqlite3') c = conn.cursor()
callback function for map procedure pack key and value into one tuple def generate_key_val_pair(key, val): return (key, val)
tsv = file('/mnt/datentank/uniprot_go_reduced.map')
entriesBuffer = []
- add uniprot ac and GO term pairs into sqlite db
for lineId, line in enumerate(tsv): line = line.strip() keyval = line.split('\t') key = keyval[0] vals = keyval[1].split('; ') keys = [ key ] * len(vals) entries = map(generate_key_val_pair, keys, vals) if (len(entriesBuffer) < 1000000): entriesBuffer.extend(entries) else: c.executemany("INSERT OR IGNORE INTO uniprot_go_mapping \ (uniprot_ac,go_term) VALUES (?,?);", entriesBuffer) conn.commit() entriesBuffer[:] = [] print 'Processed %d lines' % (lineId)
tsv.close() </source>
Perform statistics
After all hhr files were converted into tsv files and preparation of SQLite3 database for id mapping, a R script was used for all the statistical tasks, including distribution of e-values and identity, intersection of hits, SCOP and GO tests.
Here is the R code:
<source lang="scheme">
- File: evalAlign.R
- Description: perform pairwise comparison of sequence search methods
library(ggplot2)
library(RSQLite)
loadHHBlitsTSV <- function(file) {
data <- read.csv(file=file, sep='\t', header=TRUE) data$eval <- data$e_value return(data)
}
loadBlastTSV <- function(file) {
data <- read.csv(file=file, sep='\t', header=FALSE, comment.char='#') names(data) <- c('query', 'hit', 'identity', 'algn_len', 'mismatch', 'gap_open', 'q_start', 'q_end', 's_start', 's_end', 'eval', 'score') return(data)
}
- get pdb hits from sequence search results
getPDBEntries <- function(data) {
d <- data[grep('pdb\\|pdb', data$hit),] d$chain <- gsub(pattern='^.*(\\w)$', replacement='\\1', d$hit) d$pdb <- gsub(pattern='^.*(\\w{4})_\\w$', replacement='\\1', d$hit) d$pdb_chain <- paste(d$pdb, d$chain, sep='_') return(d)
}
- get uniprot hits from sequence search results
getUniProtEntries <- function(data) {
d <- data[grep("^tr|sp\\|", data$hit), ] d$uniprot <- gsub(pattern="^(tr|sp)\\|(\\w*)\\|.*$", replacement="\\2", d$hit) return(d)
}
- load parsable file: dir.cla.scop.txt
loadSCOPClassification <- function(file) {
data <- read.csv(file=file, sep='\t', comment.char='#') names(data) <- c('domain', 'pdb', 'chain', 'sccs', 'sunid', 'fullstr') data$pdb_chain <- paste(data$pdb, substr(data$chain,1,1), sep='_') return(data)
}
getSCOPFold <- function(scopData) {
gsub(pattern="^(\\w\\.\\d*)\\..*$", replacement="\\1", scopData$sccs)
}
getUniprotGO <- function(data, database.path) {
conn <- dbConnect(dbDriver('SQLite'), dbname=database.path) uniprot_acs <- unique(data$uniprot) #create temperory table dbGetQuery(conn, "CREATE TEMPORARY TABLE temp_uniprot_ac(uniprot_ac CHAR(6));") dbSendPreparedQuery( conn, "INSERT INTO temp_uniprot_ac(uniprot_ac) VALUES(:uniprot_ac)", bind.data=data.frame(uniprot_ac=data$uniprot)) result <- dbGetQuery( conn, "SELECT count(t.uniprot_ac) as count, go_term as go FROM temp_uniprot_ac t,uniprot_go_mapping m WHERE t.uniprot_ac = m.uniprot_ac GROUP BY go_term ORDER BY count") dbSendQuery(conn, "DROP TABLE temp_uniprot_ac;") dbDisconnect(conn) return(result)
}
- configure test cases for reference sequences
querys <- c('BCKDHA', 'BCKDHB', 'DBT', 'DLD') data.path <-
'/home/wei/git/MasterPractical2013/results/task02/01-seq-search/search-results'
color.palette <- 'Set2' database.path <- '/home/wei/idmapping.sqlite3' image.type <- '.png'
- load SCOP classification
scop <- loadSCOPClassification(
file='/home/wei/git/MasterPractical2013/data/SCOP/dir.cla.scop.txt_1.75')
for (query in querys) {
#configure input files input.filenames <- c(paste('blastp_', query, '.tsv', sep=), paste('psiblast-2_iterations_eval_0.002-refseq_', query, '_protein.fasta.tsv', sep=), paste('psiblast-2_iterations_eval_10e-10-refseq_', query, '_protein.fasta.tsv', sep=), paste('psiblast-10_iterations_eval_0.002-refseq_', query, '_protein.fasta.tsv', sep=), paste('psiblast-10_iterations_eval_10e-10-refseq_', query, '_protein.fasta.tsv', sep=), paste('hhblits_refseq_', query, '_protein.fasta.hhr.tsv', sep=)) data.names <- c('blast', 'psiblast(iter. 2, e-val. 0.002)', 'psiblast(iter. 2, e-val. 10e-10)', 'psiblast(iter. 10, e-val. 0.002)', 'psiblast(iter. 10, e-val. 10e-10)', 'hhblits') data.titles <- c('Blast', 'PSI-Blast[iter. 2, eval. 0.002]', 'PSI-Blast[iter. 2, eval. 10e-10]', 'PSI-Blast[iter. 10, eval. 0.002]', 'PSI-Blast[iter. 10, eval. 10e-10]', 'HHBlits') #load data data <- list() evals <- c() methods <- c() identities <- c() for (index in 1:length(input.filenames)) { data.name <- data.names[index] print(data.name) input.path <- file.path(data.path, input.filenames[index]) if (data.name != 'hhblits') { frame <- loadBlastTSV(input.path) } else { frame <- loadHHBlitsTSV(input.path) } datadata.name <- frame n <- length(frame$eval) evals <- c(evals, frame$eval) methods <- c(methods, rep(x=data.name, n)) identities <- c(identities, frame$identity) } DATA <- data.frame(evalue=evals, identity=identities, method=methods)
- evaluation: e-value distribution
PLOT <- ggplot(DATA, aes(x=evalue)) PLOT + geom_density(aes(colour=factor(method), fill=factor(method)), alpha=.7) + scale_x_log10() + scale_alpha(range=c(0, 1)) + ggtitle(paste('E-value distribtution (', query, ')', sep=)) + xlab('E-value') + ylab('Density')+ scale_colour_brewer(palette=color.palette) + scale_fill_brewer(palette=color.palette) ggsave(paste("e-value-distribution_", query, image.type, sep=), width=8.3, height=6.8, dpi=100)
- evaluation: identity distribution
PLOT <- ggplot(DATA, aes(x=identity)) PLOT + geom_density(aes(colour=factor(method), fill=factor(method)), alpha=.7) + scale_alpha(range=c(0, 1)) + ggtitle(paste('Identity distribtution (', query, ')', sep=)) + xlab('Identity') + ylab('Density')+ scale_colour_brewer(palette=color.palette) + scale_fill_brewer(palette=color.palette) ggsave(paste("identity_distribution_", query,image.type, sep=), width=8.3, height=6.8, dpi=100)
- evaluation: intersection curve
thresholds <- c(0, 1e-100, 1e-90, 1e-80, 1e-70, 1e-60, 1e-50, 1e-40, 1e-30, 1e-20, 1e-10, 1, 10) for (index1 in 1:length(data.names)) { dataset1 <- dataindex1 hits1 <- dataset1$hit meth.name <- data.names[index1] threshold <- c() quality <- c() method <- c() for (index2 in 1:length(data.names)) { if (index2 == index1) { next(); } dataset2 <- dataindex2 hits2 <- dataset2$hit curmeth <- data.names[index2] for (thr in thresholds) { intersection <- length( intersect( hits1[ dataset1$eval <= thr ], hits2[ dataset2$eval <= thr ]) ) avg.intersect <- 0.5 * ( intersection / length(hits1) + intersection / length(hits2) ) threshold <- c(threshold, thr) method <- c(method, curmeth) quality <- c(quality, avg.intersect) } } performance <- data.frame(threshold, quality, method) PLOT <- ggplot(performance, aes(x=threshold, y=quality, colour=method)) PLOT + geom_line(size=1) + geom_point() + scale_x_log10() + ggtitle( paste('Relative intersections (comparison to ',meth.name,')', sep=) ) + xlab('E-value') + ylab('intersecting results') + scale_colour_brewer(palette=color.palette) + ylim(0,1) ggsave( paste("intersection_to_", meth.name,"_", query,image.type, sep=), width=8.3, height=6.8, dpi=100) }
- evaluation: protein structure classification
for (methId in 1:length(data.names)) { method <- data.names[methId] dataset <- getPDBEntries(datamethod ) subsetSCOP <- scop[ scop$pdb_chain %in% dataset$pdb_chain, ] d <- data.frame(folds=getSCOPFold(subsetSCOP)) if (length(d$folds) == 0) { next(); } d$freq <- rep(0, length(d$folds)) ftable <- table(d$folds) for (fold in names(ftable)) { d$freq[ d$folds == fold ] <- ftable[ fold ] } PLOT <- ggplot(d, aes(x=reorder(folds, freq))) PLOT + geom_histogram() + ggtitle("Histogram of fold classes from annotated pdb hits") + xlab('fold class') ggsave( paste("SCOP_histogram_", method, "_", query, image.type, sep=), width=8.3, height=6.8, dpi=100 ); }
- evaluation: Gene Ontology
- common_gos <- list()
for (methId in 1:length(data.names)) { method <- data.names[methId] dataset <- getUniProtEntries(datamethod ) if (length(dataset$uniprot) == 0) { next(); } result <- getUniprotGO(dataset, database.path=database.path) total.count <- length(dataset$uniprot) #result$percent <- result$count / length(dataset$uniprot) result$percent <- result$count / total.count #result <- result[ result$count > 0.05 * total.count, ] result <- result[ order(result$count, decreasing=TRUE)[1:5], ] PLOT <- ggplot(result, aes(x=reorder(go, count), y=percent)) PLOT + geom_bar(stat='identity') + coord_flip() + ggtitle("Histogram of top-5 go terms from annotated uniprot hits") + xlab('GO term') + ylab('Frequency') + ylim(0,1) ggsave( paste("GO_histogram_", method, "_", query, ".png", sep=), width=8.3, height=6.8, dpi=100 );
- go_terms <- data.frame(threshold = thresholds, count=rep(0, length(thresholds)))
- for (thr in thresholds)
- {
- subset <- dataset[ dataset$eval <= thr, ]
- result <- getUniprotGO(dataset, database.path=database.path)
- go_terms$count[ go_terms$threshold == thr ] <- length(unique(result$go))
- }
- common_gos[method] <- go_terms
} save.image(file=paste('Workspace_', query, '.RData', sep=), compress="xz")
} </source>
Multiple sequence alignments
Dataset creation
The datasets were created from the Blast output.
For creating datasets with low, high and whole range sequence identity, the following Python script was used:
<source lang="python"> Use blast output to create datasets with different sequence identities.
Call with: python create_dataset.py <query fasta file> <blast xml output> <database fasta file>
@author: Laura Schiller
import sys from Bio import SeqIO, pairwise2 from Bio.Blast import NCBIXML from Bio.SubsMat import MatrixInfo
def get_sequences(query, blast_xml, db_path, out_file):
Fetch full length sequences for a BLAST result and write them in a FASTA file. @param query: fasta file with query sequence. @param blast_xml: xml file with BLAST search result. @param db_path: path of db fasta file. @param out_file: file to store the sequences. @return: name of the file where the sequences are stored. print("get all sequences for blast result in %s" % blast_xml) query_sequence = SeqIO.read(query, "fasta") hit_seqs = [query_sequence] blast_output = open(blast_xml) blast_result = NCBIXML.read(blast_output) blast_output.close() hit_list = [] for alignment in blast_result.alignments: hit_list.append(alignment.title.split(" ")[1]) # the id of the sequence # get all blast hits seqs_db = SeqIO.parse(db_path, "fasta") counter = 0 number = len(blast_result.alignments) for seq in seqs_db: if seq.id in hit_list: hit_seqs.append(seq) counter = counter + 1 if (counter % 100) == 0: print("%d of %d sequences found" % (counter, number)) if counter == number: break print("%d of %d sequences found" % (counter, len(blast_result.alignments))) SeqIO.write(hit_seqs, out_file, "fasta") print("sequences saved in %s" % out_file) return out_file
def filter_seqs(seq_file, name):
Filter sequences according to sequence identity limits. @param seq_file: fasta file with sequences. @param name: string used for output file names. sequences = SeqIO.parse(seq_file, "fasta") seqs = [] for seq in sequences: seqs.append(seq) query = seqs.pop(0) # always keep query (first sequence) # lists for low / high / whole range sequence identity filtered = [[query], [query], [query]] # identify hits with pdb structures -> these are preferentially taken hits_pdb = [hit for hit in seqs if (hit.id.split("|")[0] == "pdb")] seqs = [seq for seq in seqs if not (seq.id in [pdb_seq.id for pdb_seq in hits_pdb])] hits_pdb.extend(seqs) # now pdb hits are at the beginning print("filter sequences") for seq in hits_pdb: try: ident = identity(query, seq) except KeyError: # raises if there is a non amino acid letter continue if (len(filtered[0]) < 10): keep = True for seq2 in filtered[0]: try: ident2 = identity(seq, seq2) except KeyError: keep = False continue if ident2 >= 0.3: keep = False break if keep: filtered[0].append(seq) if (len(filtered[1]) < 10) and (ident > 0.6): filtered[1].append(seq) if (len(filtered[2]) < 8) and (ident >= 0.3) and (ident <= 0.6): filtered[2].append(seq) if (len(filtered[0]) == 10) and (len(filtered[1]) == 10) and (len(filtered[2]) == 8): break # for whole range take a part of low and a part of high sequence identity # plus the rest middle sequence identity filtered[2].extend(filtered[0][1:min(len(filtered[0]), 7)]) filtered[2].extend(filtered[1][1:min(len(filtered[1]), 7)]) SeqIO.write(filtered[0], name + "_low_seq_ident.fasta", "fasta") SeqIO.write(filtered[1], name + "_high_seq_ident.fasta", "fasta") SeqIO.write(filtered[2], name + "_whole_range_seq_ident.fasta", "fasta") print("sequences with low / high / whole range sequence identity saved in:") print(name + "_low_seq_ident.fasta") print(name + "_high_seq_ident.fasta") print(name + "_whole_range_seq_ident.fasta")
def identity(seq1, seq2):
Calculate relative sequence identity of two sequences. @return: number of identical residues divided by mean length.
#pairwise alignment matrix = MatrixInfo.blosum62 gap_open = -10 gap_extend = -0.5 alignment = pairwise2.align.globalds(seq1, seq2, matrix, gap_open, gap_extend) seq1_aligned = alignment[0][0] seq2_aligned = alignment[0][1]
#sequence identity ident = sum(c1 == c2 for c1, c2 in zip(seq1_aligned, seq2_aligned)) ref_length = (len(seq1) + len(seq2)) / 2 # mean length return float(ident) / ref_length
if __name__ == '__main__':
namestring = sys.argv[1].split("/")[-1].split(".")[0] # used as beginning of the output files print("-----------------------------------------------------------") all_seqs = get_sequences(sys.argv[1], sys.argv[2], sys.argv[3], namestring + "_all_sequences.fasta") filter_seqs(all_seqs, namestring)
</source>
Note: for DBT there were some problems aligning very long sequences to calculate mutual sequence identities. So there, for the low sequence identity group, only sequences that have less than 6 times the length of the query sequence were taken.
Calling MSA programs
Call of T-Coffee:
#!/bin/bash proteins=( BCKDHA BCKDHB DBT DLD ) identities=( low high whole_range ) for protein in ${proteins[*]} do for identity in ${identities[*]} do t_coffee -output fasta -infile ${protein}_${identity}_seq_ident.fasta -outfile ${protein}_${identity}_seq_ident_tcoffee.fasta done done
Muscle:
#!/bin/bash proteins=( BCKDHA BCKDHB DBT DLD ) identities=( low high whole_range ) for protein in ${proteins[*]} do for identity in ${identities[*]} do muscle -in ${protein}_${identity}_seq_ident.fasta -out ${protein}_${identity}_seq_ident_muscle.fasta done done
Mafft:
#!/bin/bash proteins=( BCKDHA BCKDHB DBT DLD ) identities=( low high whole_range ) for protein in ${proteins[*]} do for identity in ${identities[*]} do mafft ${protein}_${identity}_seq_ident.fasta > ${protein}_${identity}_seq_ident_mafft.fasta done done