Gaucher Disease: Task 02 - Lab Journal
Contents
Sequence Searches
The scripts written for this task can be found in /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1
, the created outputs in /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/output/task1
and all intermediate and resulting files in /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/results/task1
.
Runs
The script run.pl was written and used for the runs (using the bash script run.sh
). PSSM files - a3m and hhr for HHblits, chk ("checkpoint") and PSSM for Psi-BLAST were created in order to start the search against another database, from big_80 to big for Psi-BLAST and later against a PDB database for the evaluation.
Note: in Psi-BLAST in order to create a checkpoint file, one iteration more must be done, as the checkpoint file is created at the beginning of a iteration, not at its end. For example, to resume 9 iterations against big_80 with 1 iteration against big and then with 1 iteration against pdb_seqres the following calls were made (from ~/master_practical/Assignment2_Alignments/output/task1
):
../../scripts/task1/run.pl --in ../../gaucher_data/P04062.fasta --out_p psiblast-10iter-big_80 --iter_p 10 --db_b big_80 --c psiblast-9iter-big_80.chk
../../scripts/task1/run.pl --in ../../gaucher_data/P04062.fasta --out_p psiblast-9iter-big_80--2iter_big --iter_p 2 --db_b big --r psiblast-9iter-big_80.chk --c psiblast-9iter-big_80--1iter_big.chk
../../scripts/task1/run.pl --in ../../gaucher_data/P04062.fasta --out_p psiblast-9iter-big_80--1iter_big--1iter_pdb --db_b pdb_seqres --iter_p 1 --r psiblast-9iter-big_80--1iter_big.chk
Databases
We have used the following databases versions:
Database | Location | Version |
---|---|---|
big & big_80 | /mnt/project/rost_db/data/big/big[_80] | before 30 Apr 2013? |
uniprot_20 | /mnt/project/rost_db/data/hhblits/uniprot_20 | 02 Sep 2011 |
pdb_seqres | /mnt/home/rost/kloppmann/data/blast_db/pdb_seqres | 3 May 2013 |
pdb_full | /mnt/project/rost_db/data/hhblits/pdb_full | before 30 Apr 2013? |
COPS | /mnt/project/pracstrucfunc13/data/COPS/COPS-ChainHierarchy.txt | 27 Apr 2012 |
Comparison
The results were parsed and analysed using the script parse_output.pl.
The distributions of E-value, sequence identity and alignment length were plotted using the R-scripts hist_eval_iden_len.r
and hist_eval_iden_len-with_blast.r
Evaluation
Method
- The quality of hits retrieved by HHblits, PSI-BLAST and BLAST was evaluated using the COPS L30/L40/L60 structural similarity groups as standard of truth. Protein chains with same L30/L40/L60 group have 30/40/60% structural similarity, respectively. For example, when using the L30 group as standard of truth, the hit is a true positive (TP) if it has the same L30 group as the query, otherwise it is a false positive (FP).
- The query structure 1OGS chain A of glucocerebrosidase was used for comparison with the following COPS groups:
- L30 of query: 3zr6_A
- L40 of query: 3ii1_A
- L60 of query: 2v3e_B
- The total number of chains in COPS with the same L30/L40/L60 group as the query is calculated and stored:
- L30: 2058
- L40: 1126
- L60: 80
- Parsed results of the runs, also with the script parse_output.pl (using the bash script
parse_output.sh
), include the following information for each hit (as mentioned, in case of more then one alignment of the query and the same found sequence, only the one with the lowest E-value was used): E-value, whether the hit is a TP or a FP, query ID and hit ID. - Then the evaluation routine tpr_precision.pl was applied (using the bash script
tpr_precision.sh
). In the evaluation routine, each file is first sorted according to ascending E-value. Afterwards, the following rates are calculated for each line (from the first to the current line): TPR (sensitivity) and precision. Each rate is calculated specifically for each query seen so far, and then the average is taken (sum of all rates for each query divided by the number of the queries seen so far). The rates are defined as follows:
TPR = TP/(TP+FN)
precision = TP/(TP+FP)
where:
- TP = the number of true positive hits, namely all found hits with the same COPS L30/L40/L60 group as the query
- FP = the number of false positive hits, that is with different groups than the query
- FN = the number of false negatives, which are not detected chains with the same group as the query
- TP+FN = the number of all positives, i.e. the number of all known PDB chains with the same group as the query
- TP+FP = the number of all found hits
Multiple Sequence Alignment
The multiple sequence alignments (MSA) were generated on the following webservers and visualized with jalView: ClustalW : version 1.83 Muscle (on EMBL-EBI) : version 3.8.31