Gaucher Disease: Task 02 - Lab Journal

From Bioinformatikpedia
Revision as of 15:36, 19 May 2013 by Kalemanovm (talk | contribs)

Sequence Searches

Runs

The script run.pl was written and used for the runs. PSSM files - a3m and hhr for HHblits, chk ("checkpoint") and PSSM for Psi-BLAST were created in order to start the search against another database, from big_80 to big for Psi-BLAST and later against a PDB database for the evaluation.

Comparison

The results were parsed and analysed using the script parse_output.pl.


Evaluation

Method

  • The quality of hits retrieved by HHblits, PSI-BLAST and BLAST was evaluated using the COPS L30/L40/L60 structural similarity groups as standard of truth. Protein chains with same L30/L40/L60 group have 30/40/60% structural similarity, respectively. For example, when using the L30 group as standard of truth, the hit is a true positive (TP) if it has the same L30 group as the query, otherwise it is a false positive (FP).
  • The query structure 1OGS chain A of glucocerebrosidase was used for comparison with the following COPS groups:
    • L30 of query: 3zr6_A
    • L40 of query: 3ii1_A
    • L60 of query: 2v3e_B
  • The total number of chains in COPS with the same L30/L40/L60 group as the query is calculated and stored:
    • L30: 2058
    • L40: 1126
    • L60: 80
  • Parsed results of the runs, also with the script parse_output.pl, include the following information for each hit (as mentioned, in case of more then one alignment of the query and the same found sequence, only the one with the lowest E-value was used): E-value, whether the hit is a TP or a FP, query ID and hit ID.
  • In the evaluation routine (tpr_precision.pl) each file is first sorted according to ascending E-value. Afterwards, the following rates are calculated for each line (from the first to the current line): TPR (sensitivity) and precision, which are defined as follows:

TPR = TP/(TP+FN)
precision = TP/(TP+FP)

where:

  • TP = the number of true positive hits, namely all found hits with the same COPS L30/L40/L60 group as the query
  • FP = the number of false positive hits, that is with different groups than the query
  • FN = the number of false negatives, which are not detected chains with the same group as the query
  • TP+FN = the number of all positives, i.e. the number of all known PDB chains with the same group as the query
  • TP+FP = the number of all found hits