Difference between revisions of "Gaucher Disease: Task 02 - Lab Journal"
From Bioinformatikpedia
Kalemanovm (talk | contribs) |
Kalemanovm (talk | contribs) (→Runs) |
||
Line 2: | Line 2: | ||
===Runs=== |
===Runs=== |
||
The script [[run.pl]] was written and used for the runs. PSSM files - a3m and hhr for HHblits, chk ("checkpoint") and PSSM for Psi-BLAST were created in order to start the search against another database, from big_80 to big for Psi-BLAST and later against a PDB database for the evaluation. |
The script [[run.pl]] was written and used for the runs. PSSM files - a3m and hhr for HHblits, chk ("checkpoint") and PSSM for Psi-BLAST were created in order to start the search against another database, from big_80 to big for Psi-BLAST and later against a PDB database for the evaluation. |
||
+ | |||
+ | ====Databases==== |
||
+ | We have used the following databases versions: |
||
+ | |||
+ | {| border="1" cellpadding="5" cellspacing="0" align="center" |
||
+ | |- |
||
+ | ! style="background:#efefef;" align="center" | Database |
||
+ | ! style="background:#efefef;" align="center" | Location |
||
+ | ! style="background:#efefef;" align="center" | Version |
||
+ | |- |
||
+ | | align="center" |uniprot_20 |
||
+ | | align="center" |/mnt/project/rost_db/data/hhblits/uniprot_20 |
||
+ | | align="center" |02 Sep 2011 |
||
+ | |- |
||
+ | | align="center" |big & big_80 |
||
+ | | align="center" |/mnt/project/rost_db/data/big/big[_80] |
||
+ | | align="center" |before 30 Apr 2013? |
||
+ | |- |
||
+ | | align="center" |pdb_seqres |
||
+ | | align="center" |/mnt/home/rost/kloppmann/data/blast_db/pdb_seqres |
||
+ | | align="center" |3 May 2013 |
||
+ | |- |
||
+ | | align="center" |pdb_full |
||
+ | | align="center" |/mnt/project/rost_db/data/hhblits/pdb_full |
||
+ | | align="center" |before 30 Apr 2013? |
||
+ | |- |
||
+ | | align="center" |COPS |
||
+ | | align="center" |/mnt/project/pracstrucfunc13/data/COPS/COPS-ChainHierarchy.txt |
||
+ | | align="center" |27 Apr 2012 |
||
+ | |- |
||
+ | |} |
||
===Comparison=== |
===Comparison=== |
Revision as of 19:04, 19 May 2013
Sequence Searches
Runs
The script run.pl was written and used for the runs. PSSM files - a3m and hhr for HHblits, chk ("checkpoint") and PSSM for Psi-BLAST were created in order to start the search against another database, from big_80 to big for Psi-BLAST and later against a PDB database for the evaluation.
Databases
We have used the following databases versions:
Database | Location | Version |
---|---|---|
uniprot_20 | /mnt/project/rost_db/data/hhblits/uniprot_20 | 02 Sep 2011 |
big & big_80 | /mnt/project/rost_db/data/big/big[_80] | before 30 Apr 2013? |
pdb_seqres | /mnt/home/rost/kloppmann/data/blast_db/pdb_seqres | 3 May 2013 |
pdb_full | /mnt/project/rost_db/data/hhblits/pdb_full | before 30 Apr 2013? |
COPS | /mnt/project/pracstrucfunc13/data/COPS/COPS-ChainHierarchy.txt | 27 Apr 2012 |
Comparison
The results were parsed and analysed using the script parse_output.pl.
Evaluation
Method
- The quality of hits retrieved by HHblits, PSI-BLAST and BLAST was evaluated using the COPS L30/L40/L60 structural similarity groups as standard of truth. Protein chains with same L30/L40/L60 group have 30/40/60% structural similarity, respectively. For example, when using the L30 group as standard of truth, the hit is a true positive (TP) if it has the same L30 group as the query, otherwise it is a false positive (FP).
- The query structure 1OGS chain A of glucocerebrosidase was used for comparison with the following COPS groups:
- L30 of query: 3zr6_A
- L40 of query: 3ii1_A
- L60 of query: 2v3e_B
- The total number of chains in COPS with the same L30/L40/L60 group as the query is calculated and stored:
- L30: 2058
- L40: 1126
- L60: 80
- Parsed results of the runs, also with the script parse_output.pl, include the following information for each hit (as mentioned, in case of more then one alignment of the query and the same found sequence, only the one with the lowest E-value was used): E-value, whether the hit is a TP or a FP, query ID and hit ID.
- In the evaluation routine (
tpr_precision.pl
) each file is first sorted according to ascending E-value. Afterwards, the following rates are calculated for each line (from the first to the current line): TPR (sensitivity) and precision, which are defined as follows:
TPR = TP/(TP+FN)
precision = TP/(TP+FP)
where:
- TP = the number of true positive hits, namely all found hits with the same COPS L30/L40/L60 group as the query
- FP = the number of false positive hits, that is with different groups than the query
- FN = the number of false negatives, which are not detected chains with the same group as the query
- TP+FN = the number of all positives, i.e. the number of all known PDB chains with the same group as the query
- TP+FP = the number of all found hits