Gaucher Disease: Task 02 - Lab Journal

From Bioinformatikpedia
Revision as of 12:31, 21 May 2013 by Kalemanovm (talk | contribs) (Comparison)

Sequence Searches

The scripts written for this task can be found in /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1, the created outputs in /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/output/task1 and all intermediate and resulting files in /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/results/task1.

Runs

The script run.pl was written and used for the runs (using the bash script run.sh). PSSM files - a3m and hhr for HHblits, chk ("checkpoint") and PSSM for Psi-BLAST were created in order to start the search against another database, from big_80 to big for Psi-BLAST and later against a PDB database for the evaluation.

Databases

We have used the following databases versions:

Database Location Version
big & big_80 /mnt/project/rost_db/data/big/big[_80] before 30 Apr 2013?
uniprot_20 /mnt/project/rost_db/data/hhblits/uniprot_20 02 Sep 2011
pdb_seqres /mnt/home/rost/kloppmann/data/blast_db/pdb_seqres 3 May 2013
pdb_full /mnt/project/rost_db/data/hhblits/pdb_full before 30 Apr 2013?
COPS /mnt/project/pracstrucfunc13/data/COPS/COPS-ChainHierarchy.txt 27 Apr 2012

Comparison

The results were parsed and analysed using the script parse_output.pl. The distributions of E-value, sequence identity and alignment length were plotted using the R-scripts hist_eval_iden_len.r and hist_eval_iden_len-with_blast.r

Evaluation

Method

  • The quality of hits retrieved by HHblits, PSI-BLAST and BLAST was evaluated using the COPS L30/L40/L60 structural similarity groups as standard of truth. Protein chains with same L30/L40/L60 group have 30/40/60% structural similarity, respectively. For example, when using the L30 group as standard of truth, the hit is a true positive (TP) if it has the same L30 group as the query, otherwise it is a false positive (FP).
  • The query structure 1OGS chain A of glucocerebrosidase was used for comparison with the following COPS groups:
    • L30 of query: 3zr6_A
    • L40 of query: 3ii1_A
    • L60 of query: 2v3e_B
  • The total number of chains in COPS with the same L30/L40/L60 group as the query is calculated and stored:
    • L30: 2058
    • L40: 1126
    • L60: 80
  • Parsed results of the runs, also with the script parse_output.pl (using the bash script parse_output.sh), include the following information for each hit (as mentioned, in case of more then one alignment of the query and the same found sequence, only the one with the lowest E-value was used): E-value, whether the hit is a TP or a FP, query ID and hit ID.
  • Then the evaluation routine tpr_precision.pl was applied (using the bash script tpr_precision.sh). In the evaluation routine, each file is first sorted according to ascending E-value. Afterwards, the following rates are calculated for each line (from the first to the current line): TPR (sensitivity) and precision. Each rate is calculated specifically for each query seen so far, and then the average is taken (sum of all rates for each query divided by the number of the queries seen so far). The rates are defined as follows:

TPR = TP/(TP+FN)
precision = TP/(TP+FP)

where:

  • TP = the number of true positive hits, namely all found hits with the same COPS L30/L40/L60 group as the query
  • FP = the number of false positive hits, that is with different groups than the query
  • FN = the number of false negatives, which are not detected chains with the same group as the query
  • TP+FN = the number of all positives, i.e. the number of all known PDB chains with the same group as the query
  • TP+FP = the number of all found hits