Gaucher Disease: Task 02 - Lab Journal
From Bioinformatikpedia
Revision as of 15:36, 19 May 2013 by Kalemanovm (talk | contribs)
Sequence Searches
Runs
The script run.pl was written and used for the runs. PSSM files - a3m and hhr for HHblits, chk ("checkpoint") and PSSM for Psi-BLAST were created in order to start the search against another database, from big_80 to big for Psi-BLAST and later against a PDB database for the evaluation.
Comparison
The results were parsed and analysed using the script parse_output.pl.
Evaluation
Method
- The quality of hits retrieved by HHblits, PSI-BLAST and BLAST was evaluated using the COPS L30/L40/L60 structural similarity groups as standard of truth. Protein chains with same L30/L40/L60 group have 30/40/60% structural similarity, respectively. For example, when using the L30 group as standard of truth, the hit is a true positive (TP) if it has the same L30 group as the query, otherwise it is a false positive (FP).
- The query structure 1OGS chain A of glucocerebrosidase was used for comparison with the following COPS groups:
- L30 of query: 3zr6_A
- L40 of query: 3ii1_A
- L60 of query: 2v3e_B
- The total number of chains in COPS with the same L30/L40/L60 group as the query is calculated and stored:
- L30: 2058
- L40: 1126
- L60: 80
- Parsed results of the runs, also with the script parse_output.pl, include the following information for each hit (as mentioned, in case of more then one alignment of the query and the same found sequence, only the one with the lowest E-value was used): E-value, whether the hit is a TP or a FP, query ID and hit ID.
- In the evaluation routine (
tpr_precision.pl
) each file is first sorted according to ascending E-value. Afterwards, the following rates are calculated for each line (from the first to the current line): TPR (sensitivity) and precision, which are defined as follows:
TPR = TP/(TP+FN)
precision = TP/(TP+FP)
where:
- TP = the number of true positive hits, namely all found hits with the same COPS L30/L40/L60 group as the query
- FP = the number of false positive hits, that is with different groups than the query
- FN = the number of false negatives, which are not detected chains with the same group as the query
- TP+FN = the number of all positives, i.e. the number of all known PDB chains with the same group as the query
- TP+FP = the number of all found hits