Gaucher Disease: Task 02 - Lab Journal

Sequence Searches

The scripts written for this task can be found in /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1, the created outputs in /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/output/task1 and all intermediate and resulting files in /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/results/task1.

Runs

The script run.pl was written and used for the runs (using the bash script run.sh). PSSM files - a3m and hhr for HHblits, chk ("checkpoint") and PSSM for Psi-BLAST were created in order to start the search against another database, from big_80 to big for Psi-BLAST and later against a PDB database for the evaluation. Note: in Psi-BLAST in order to create a checkpoint file, one iteration more must be done, as the checkpoint file is created at the beginning of a iteration, not at its end. For example, to resume 9 iterations against big_80 with 1 iteration against big and then with 1 iteration against pdb_seqres the following calls were made (from ~/master_practical/Assignment2_Alignments/output/task1):

../../scripts/task1/run.pl --in ../../gaucher_data/P04062.fasta --out_p psiblast-10iter-big_80 --iter_p 10 --db_b big_80 --c psiblast-9iter-big_80.chk

../../scripts/task1/run.pl --in ../../gaucher_data/P04062.fasta --out_p psiblast-9iter-big_80--2iter_big --iter_p 2 --db_b big 
--r psiblast-9iter-big_80.chk --c psiblast-9iter-big_80--1iter_big.chk

../../scripts/task1/run.pl --in ../../gaucher_data/P04062.fasta --out_p psiblast-9iter-big_80--1iter_big--1iter_pdb --db_b pdb_seqres --iter_p 1 
--r psiblast-9iter-big_80--1iter_big.chk

Databases

We have used the following databases versions:

Database	Location	Version
big & big_80	/mnt/project/rost_db/data/big/big[_80]	before 30 Apr 2013?
uniprot_20	/mnt/project/rost_db/data/hhblits/uniprot_20	02 Sep 2011
pdb_seqres	/mnt/home/rost/kloppmann/data/blast_db/pdb_seqres	3 May 2013
pdb_full	/mnt/project/rost_db/data/hhblits/pdb_full	before 30 Apr 2013?
COPS	/mnt/project/pracstrucfunc13/data/COPS/COPS-ChainHierarchy.txt	27 Apr 2012

Comparison

The results were parsed and analysed using the script parse_output.pl. The distributions of E-value, sequence identity and alignment length were plotted using the R-scripts hist_eval_iden_len.r and hist_eval_iden_len-with_blast.r

Evaluation

Method

The quality of hits retrieved by HHblits, PSI-BLAST and BLAST was evaluated using the COPS L30/L40/L60 structural similarity groups as standard of truth. Protein chains with same L30/L40/L60 group have 30/40/60% structural similarity, respectively. For example, when using the L30 group as standard of truth, the hit is a true positive (TP) if it has the same L30 group as the query, otherwise it is a false positive (FP).
The query structure 1OGS chain A of glucocerebrosidase was used for comparison with the following COPS groups:
- L30 of query: 3zr6_A
- L40 of query: 3ii1_A
- L60 of query: 2v3e_B
The total number of chains in COPS with the same L30/L40/L60 group as the query is calculated and stored:
- L30: 2058
- L40: 1126
- L60: 80
Parsed results of the runs, also with the script parse_output.pl (using the bash script parse_output.sh), include the following information for each hit (as mentioned, in case of more then one alignment of the query and the same found sequence, only the one with the lowest E-value was used): E-value, whether the hit is a TP or a FP, query ID and hit ID.
Then the evaluation routine tpr_precision.pl was applied (using the bash script tpr_precision.sh). In the evaluation routine, each file is first sorted according to ascending E-value. Afterwards, the following rates are calculated for each line (from the first to the current line): TPR (sensitivity) and precision. Each rate is calculated specifically for each query seen so far, and then the average is taken (sum of all rates for each query divided by the number of the queries seen so far). The rates are defined as follows:

TPR = TP/(TP+FN)
precision = TP/(TP+FP)

where:

TP = the number of true positive hits, namely all found hits with the same COPS L30/L40/L60 group as the query
FP = the number of false positive hits, that is with different groups than the query
FN = the number of false negatives, which are not detected chains with the same group as the query
TP+FN = the number of all positives, i.e. the number of all known PDB chains with the same group as the query
TP+FP = the number of all found hits

Multiple Sequence Alignment

The multiple sequence alignments (MSA) were generated on the following webservers and visualized with jalView: ClustalW : version 1.83 Muscle (on EMBL-EBI) : version 3.8.31

Gaucher Disease: Task 02 - Lab Journal

Contents

Sequence Searches

Runs

Databases

Comparison

Evaluation

Method

Multiple Sequence Alignment

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools