Lab Journal of Task 3 (MSUD)
For task 3 we have used the reference sequence of BCKDHA and other given example proteins.
Contents
Secondary structure
Prediction and assignment
- PSSMs were created with Psi-Blast:
blastpgp -d /mnt/project/pracstrucfunc13/data/big/big_80 -i P10775.fasta -j 2 -h 10e-10 -Q P10775_big80.blastPsiMat
blastpgp -d /mnt/project/pracstrucfunc13/data/swissprot/uniprot_sprot -i P10775.fasta -j 2 -h 10e-10 -Q P10775_SwissProt.blastPsiMat
- ReProf was run for P10775 with a simple fasta file and with a PSSM (generated with big_80 and SwissProt, respectively) as input:
reprof -i P10775.fasta
reprof -i P10775_big80.blastPsiMat
reprof -i P10775_SwissProt.blastPsiMat
- For the remaining proteins (P12694, Q08209 and Q9X0E6) ReProf was run with a big_80 PSSM analogously as shown above.
- PsiPred and DSSP were run on the following servers: PsiPred, DSSP_server.
- The PDB files used as input for DSSP are located at
/mnt/home/student/schillerl/MasterPractical/task3/pdb_structures/
. If there were more than one PDB structures available those which the highest coverage over the sequence and with the highest resolution were taken preferentially. These structures were used: 2BFD (P12694), 2BNH (P10775), 1AUI (Q08209) and 1KR4 (Q9X0E6).
- To parse the output of ReProf and PsiPred, we used Sonja's script. For parsing the DSSP output, we used the following Python script (which is located at
/mnt/home/student/schillerl/MasterPractical/task3/dssp/parse_dssp_output.py
):
<source lang=python>
Parse DSSP output to a sequence of H, E and L (secondary structure). If there is more
than one sequence in the DSSP output, only the first sequence is parsed.
Call with python parse_dssp_putput.py <dssp output file> <sequence length> <output file>
@author: Laura Schiller
import sys
seq_len = int(sys.argv[2]) dssp_file = open(sys.argv[1]) out_file = open(sys.argv[3], "w")
- translation of DSSP secondary structure to H, E and L
secstr = { 'H': "H",
'G': "H", 'I': "H", 'E': "E", 'B': "E", 'S': "L", 'T': "L", ' ': "L" }
line = dssp_file.readline() while not line.startswith(" # RESIDUE"):
line = dssp_file.readline()
line = dssp_file.readline()
current_position = 1 while (line):
try: position = int(line[7:10]) except ValueError: if(line[13:15] == "!*"): # new amino acid chain break else: line = dssp_file.readline() continue structure = secstr[line[16:17]] if(current_position != position): for i in range(position - current_position): out_file.write("-") # no DSSP output for that residue current_position += 1 out_file.write(structure) current_position += 1 line = dssp_file.readline()
for i in range(seq_len - current_position + 1):
out_file.write("-")
dssp_file.close() out_file.close() </source>
Note: for the proteins P12694 and Q9X0E6 the residue numbering in the DSSP output did not agree with the fasta sequences, so the output of the above script for these proteins was manually modified to fit the fasta sequences (for P12694 the beginning of the sequence was missing in the DSSP output, so there was added sufficiently many '-' for this part; for Q9X0E6 there were some additional residues in the beginning and end in the DSSP output, which are missing in the fasta sequence, so these parts were deleted). The reason for these discrepancies may be, that the PDB files that were used as input for DSSP do not exactly match the fasta sequences from UniProt.
The results for ReProf and PsiPred predictions and the DSSP assignments are in the following folders:
/mnt/home/student/schillerl/MasterPractical/task3/reprof/
/mnt/home/student/schillerl/MasterPractical/task3/psipred/
/mnt/home/student/schillerl/MasterPractical/task3/dssp/
Position specific scoring matrices (PSSM) used as input for ReProt are located at:
/mnt/home/student/schillerl/MasterPractical/task3/pssm/
Evaluation of prediction approaches
- The ReProf predictions for P10775 (derived with different approaches, see above) were compared with the DSSP assignment by using the following Python script (located at
/mnt/home/student/schillerl/MasterPractical/task3/evaluate_secstr_reprof.py
), which calculates the recall, precision and f-measure of the predictions. Positions that lack a DSSP assignment (parsed as '-' by the above script) were ignored for the calculation.
Recall and Precision are defined as follows:
- recall = TP / (TP + FN)
- precision = TP / (TP + FP)
- f-measure = 2 * recall * precision / (recall + precision)
where TP means true positive, FP false positive and FN false negative.
<source lang="python">
dssp_file = open("./dssp/P10775_secstr.txt")
dssp = dssp_file.readline()
dssp_file.close()
for reprof_run in ["./reprof/P10775_secstr.txt", "./reprof/P10775_big80_secstr.txt", "./reprof/P10775_SwissProt_secstr.txt"]: reprof_file = open(reprof_run) reprof = reprof_file.readline() reprof_file.close()
assert len(dssp) == len(reprof)
sum1 = {'E': 0, 'H': 0, 'L': 0, 'all': 0} sum2 = {'E': 0, 'H': 0, 'L': 0, 'all': 0} found = {'E': 0, 'H': 0, 'L': 0, 'all': 0} right = {'E': 0, 'H': 0, 'L': 0, 'all': 0}
for i in range(len(dssp)): for secstr in ['E', 'H', 'L']: if dssp[i] == secstr: sum1[secstr] += 1 if reprof[i] == secstr: found[secstr] += 1 if reprof[i] == secstr: sum2[secstr] += 1 if dssp[i] == secstr: right[secstr] += 1
for sum in [sum1, sum2, found, right]: sum['all'] = sum['E'] + sum['H'] + sum['L']
recall = {'E': 0.0, 'H': 0.0, 'L': 0.0, 'all': 0} precision = {'E': 0.0, 'H': 0.0, 'L': 0.0, 'all': 0}
print "-----------" print "%s:" % reprof_run print "-----------"
for secstr in ['H', 'E', 'L', 'all']: recall[secstr] = (float(found[secstr]) / sum1[secstr]) print "Recall for %s: %f" % (secstr, recall[secstr]) precision[secstr] = (float(right[secstr]) / sum2[secstr]) print "Precision for %s: %f" % (secstr, precision[secstr]) print "F-measure for %s: %f" % (secstr, (2 * precision[secstr] * recall[secstr] / (precision[secstr] + recall[secstr]))) </source>
Comparison of ReProf to PsiPred and DSSP
The Reprof results for the four proteins were compared with the PsiPred predictions and DSSP assignments with the following Python script (located at/mnt/home/student/schillerl/MasterPractical/task3/cmp_secstr_reprof_psipred_dssp.py
), which calculates the percent identities (number of matching residues devided by all residues; for single secondary structure elements the matching number was divided by the maximum number of appearances of this element in one of the results; positions with no DSSP assignment were excluded from the comparison with DSSP):
<source lang=python>
for protein in ["P12694", "P10775", "Q08209", "Q9X0E6"]:
reprof_file = open("./reprof/" + protein + "_big80_secstr.txt") reprof = reprof_file.readline() reprof_file.close()
psipred_file = open("./psipred/" + protein + "_secstr.txt") psipred = psipred_file.readline() psipred_file.close()
dssp_file = open("./dssp/" + protein + "_secstr.txt") dssp = dssp_file.readline() dssp_file.close()
assert len(reprof) == len(dssp) == len(psipred)
ident_psipred = {'E': 0, 'H': 0, 'L': 0} ident_dssp = {'E': 0, 'H': 0, 'L': 0}
for i in range(len(reprof)): if reprof[i] == psipred[i]: if reprof[i] == "E": ident_psipred['E'] += 1 elif reprof[i] == "H": ident_psipred['H'] += 1 elif reprof[i] == "L": ident_psipred['L'] += 1 if reprof[i] == dssp[i]: if reprof[i] == "E": ident_dssp['E'] += 1 elif reprof[i] == "H": ident_dssp['H'] += 1 elif reprof[i] == "L": ident_dssp['L'] += 1
print "-----------" print "%s:" % protein print "-----------" print "percent identity to psipred:" print "H %.3f" % (float(ident_psipred['H']) / max(sum(c == 'H' for c in reprof), sum(c == 'H' for c in psipred))) print "E %.3f" % (float(ident_psipred['E']) / max(sum(c == 'E' for c in reprof), sum(c == 'E' for c in psipred))) print "L %.3f" % (float(ident_psipred['L']) / max(sum(c == 'L' for c in reprof), sum(c == 'L' for c in psipred))) print "all %.3f" % (float(ident_psipred['H'] + ident_psipred['E'] + ident_psipred['L']) / len(reprof)) print "percent identity to dssp:" print "H %.3f" % (float(ident_dssp['H']) / max(sum((reprof[i] == 'H') and (dssp[i] != '-') for i in range(len(reprof))), sum(c == 'H' for c in dssp))) print "E %.3f" % (float(ident_dssp['E']) / max(sum((reprof[i] == 'E') and (dssp[i] != '-') for i in range(len(reprof))), sum(c == 'E' for c in dssp))) print "L %.3f" % (float(ident_dssp['L']) / max(sum((reprof[i] == 'L') and (dssp[i] != '-') for i in range(len(reprof))), sum(c == 'L' for c in dssp))) print "all %.3f" % (float(ident_dssp['H'] + ident_dssp['E'] + ident_dssp['L']) / sum(c != '-' for c in dssp)) </source>
Disordered protein
IUPred
- Predictions were performed through the web server of IUPred. Graphical profiles of the results were downloaded.
- Output of IUPred are stored in the directory /mnt/home/student/weish/master-practical-2013/task03/02-disordered-protein/iupred
- We have also performed the prediction from command-line, following is the bash script:
<source lang="bash">
- !/bin/sh -e
INPUT=$HOME/master-practical-2013/task03 OUTPUT=$HOME/master-practical-2013/task03/02-disordered-protein/iupred PARAMS="long short glob"
if [ ! -d $OUTPUT ]; then
mkdir $OUTPUT
fi
for seq in $INPUT/*.fasta do
filename=`basename $seq` for param in $PARAMS do iupred $seq $param > $OUTPUT/iupred_${filename}_$param.tsv done
done </source>
Statistics over IUPred results were performed using following R script: <source lang="lisp">
- evaluate IUPred
library(ggplot2)
proteins <- c('BCKDHA', 'P10775', 'Q9X0E6', 'Q08209')
- profile.names <- c('long', 'short')
data.path <- file.path('02-disordered-protein', 'iupred')
loadProfiles <- function(proteins, data.path) {
profiles <- list(); for (protein in proteins) { long.file <- file.path( data.path, paste('iupred_', protein, '.fasta_long.tsv', sep=) ) short.file <- file.path( data.path, paste('iupred_', protein, '.fasta_short.tsv', sep=) ) long <- read.table(file=long.file, header=FALSE, comment.char='#') names(long) <- c('pos', 'aa', 'score') long$type <- rep(x='long', times=length(long$pos)) short <- read.table(file=short.file, header=FALSE, comment.char='#') names(short) <- c('pos', 'aa', 'score') short$type <- rep(x='short', times=length(short$pos)) #append short disorders to the table long[1:length(short$aa) + length(long$aa), ] <- short long$protein <- protein profilesprotein <- long } return(profiles)
}
calcRMSD <- function(a, b) {
diff <- a - b return( sqrt( sum(diff**2) / length(diff) ) )
}
combineDataset <- function(profiles) {
firstProt <- names(profiles)[1] data <- profilesfirstProt for (protein in names(profiles)) { if (protein == firstProt) { next() } subset <- profilesprotein len <- length(data$protein) data[len + 1:length(subset$protein), ] <- subset } row.names(data) <- 1:length(data$protein) return(data)
}
- Load all profiles
-
profiles <- loadProfiles(proteins, data.path)
- Correlation between profiles of short and long disorders
-
rmsd <- c() corr <- c()
for (protein in proteins) {
profile <- profilesprotein long <- profile[ profile$type == 'long', ] short <- profile[ profile$type == 'short', ] diff <- long$score - short$score #plot both Plot <- ggplot(profile, aes(x=pos, score, color=type)) Plot + geom_line(size=1) + xlab('position') + ggtitle(paste('Profile of disorder scores (', protein , ')', sep=)) ggsave(filename=paste('profile', protein, 'both.png', sep='_'), width=280, height=99, units='mm') #plot diff Plot <- ggplot(data.frame(diff,pos=1:length(diff)), aes(x=pos, y=diff)) Plot + geom_line(size=1) + xlab('position') + ylab('difference') + ggtitle(paste('Profile difference between long and short disorder scores (', protein , ')', sep=)) ggsave(filename=paste('profile', protein, 'diff.png', sep='_'), width=280, height=99, units='mm') #root mean standard deviation rmsd <- c(rmsd, calcRMSD(long$score, short$score)) corr <- c(corr, cor(long$score, short$score))
}
- Barplot of root mean standard deviation
-
Plot <- ggplot(
data.frame(rmsd, protein=proteins), aes(protein,rmsd, fill=protein))
Plot + geom_bar(stat="identity") + ylab('RMSD') +
geom_text(aes(label=sprintf("%.4f", rmsd),y = rmsd+0.015, x = protein),position = position_dodge(width=0.9))+ ggtitle('Root mean standard deviation between long and short disorder scores')
ggsave(filename='RMSD_all.png')
- Barplot of pearson correlation coefficient
-
Plot <- ggplot(
data.frame(corr, protein=proteins), aes(protein,corr, fill=protein))
Plot + geom_bar(stat="identity") + ylab('correlation') +
geom_text(aes(label=sprintf("%.4f", corr),y = corr+0.015, x = protein),position = position_dodge(width=0.9)) + ggtitle('Correlation between long and short disorder scores')
ggsave('Correlation_all.png')
- Boxplot of scores
-
dataset <- combineDataset(profiles) Plot <- ggplot(dataset, aes(protein, score, fill=type)) Plot + geom_boxplot() + ggtitle('Boxplot of disorder scores') ggsave('Boxplot_all.png') </source>
MetaDisorder(MD)
- As the man page of metadisorder describes, the prediction of disordered region is based on the results of other programs such as NORSnet, PROFbval etc. Rather than directly call metadisorder we have used the wrapper program predictprotein as is described on the exercise page.
- Comparison to DisProt database:
- 3 sequences do not have entries in DisProt
- Local alignments were performed to compare the assignment of disordered regions between DisProt and metadisorder.
Following script was called for the task:
<source lang="bash">
- !/bin/sh -e
INPUT=$HOME/master-practical-2013/task03 OUTPUT=$HOME/master-practical-2013/task03/metadisorder EXE=predictprotein
- make output directory
if [ ! -d $OUTPUT ]; then
mkdir $OUTPUT
fi
- call metadisorder for all query sequences
for seq in $INPUT/*.fasta do
filename=`basename $seq` $EXE --seqfile $seq --target metadisorder -p metadisorder_$filename \ -o $OUTPUT
done echo Done! </source>
Following R script was used to plot the profiles predicted by MetaDisorder: <source lang="lisp">
- evaluate metadisorder
library(ggplot2)
data.path <- './02-disordered-protein/metadisorder' proteins <- c('BCKDHA', 'P10775', 'Q9X0E6', 'Q08209') protein.length <- c(445, 456, 101, 521)
- Load metadisorder results
data <- list() for (i in 1:length(proteins)) {
protein <- proteins[i] len <- protein.length[i] file <- file.path( data.path, paste('metadisorder_', protein, '.fasta.mdisorder', sep=)) dataprotein <- read.table(file, header=TRUE, nrows=len)
}
- Plot profile
getGlobularRanges <- function(profile) {
#binary class: 1 for globular, 0 for disordered classes <- 2 - unclass(profile$MD2st) xmin <- c() xmax <- c() curStart <- 0 for (idx in 1:length(profile$Number)) { #start position of globular region if (classes[idx] == 1 && curStart == 0) { curStart = idx; xmin <- c(xmin, idx - 1) next() } #end position of globular region if (classes[idx] == 0 && curStart != 0) { curStart = 0; xmax <- c(xmax, idx - 1) } } if (length(xmax) < length(xmin)) { xmax <- c(xmax, idx) } return(data.frame(xmin, xmax))
}
for (protein in proteins) {
profile <- dataprotein ranges <- getGlobularRanges(profile) ggplot() + geom_rect(data=ranges, aes(xmin=xmin,xmax=xmax,ymin=0.6,ymax=0.7), fill='blue', alpha=.6) + geom_line(data=profile, aes(Number, MD_raw), size=1) + geom_vline(xintercept = c(0, length(profile$Number)), colour="blue", linetype = "longdash")+ xlab('raw MD score') + ylab('position') + ggtitle(paste('Disorder profile and structural regions (', protein, ')', sep=)) ggsave(paste('metadisorder_profile_', protein, '.png', sep=), width=280, height=99, units='mm')
} </source>
Transmembrane helices
Polyphobius
- Preparation of input data of Polyphobius:
- Pipeline for data generation:
- search for homologous sequences (blastget)
- moving query sequence to top of sequence list
- perform multiple sequence alignment (kalign)
- run polyphobius (jphobius)
- Pipeline for data generation:
- For each of the 4 sequences, we have also used the web-tool of Polyphobius to perform predictions and get resulted polts.
Following Perl script was used for the automatic running of Polyphobius: polyphobius.pl
For each given example sequences and the reference sequence of BCKDHA, following shell commands were used: <source lang="bash"> cd /mnt/home/student/weish/master-practical-2013/task03/03-transmembrane-helices for q in seqs/*.fasta; do
../polyphobius.pl -q $q -o polyphobius
done </source>
MEMSAT-SVM
For each of the example proteins and reference sequence of BCKDHA we have performed prediction using the web-tool of MEMSAT-SVM.
Signal peptides
SignalP version 3.0 was run as follows:
signalp -t euk P02768.fasta > signalp/P02768.txt
signalp -t euk P47863.fasta > signalp/P47863.txt
signalp -t euk P11279.fasta > signalp/P11279.txt
(results see /mnt/home/student/schillerl/MasterPractical/task3/signalp/
)
Additionally version 4.1 was run on the SignalP server with default options.
GO terms
The methods were run with default parameters on the following servers: GOPET version 2.0, ProtFun2.0 and Pfam sequence search.