Difference between revisions of "Gaucher Disease: Task 02 - Lab Journal"

From Bioinformatikpedia
(Runs)
Line 2: Line 2:
 
===Runs===
 
===Runs===
 
The script [[run.pl]] was written and used for the runs. PSSM files - a3m and hhr for HHblits, chk ("checkpoint") and PSSM for Psi-BLAST were created in order to start the search against another database, from big_80 to big for Psi-BLAST and later against a PDB database for the evaluation.
 
The script [[run.pl]] was written and used for the runs. PSSM files - a3m and hhr for HHblits, chk ("checkpoint") and PSSM for Psi-BLAST were created in order to start the search against another database, from big_80 to big for Psi-BLAST and later against a PDB database for the evaluation.
  +
  +
====Databases====
  +
We have used the following databases versions:
  +
  +
{| border="1" cellpadding="5" cellspacing="0" align="center"
  +
|-
  +
! style="background:#efefef;" align="center" | Database
  +
! style="background:#efefef;" align="center" | Location
  +
! style="background:#efefef;" align="center" | Version
  +
|-
  +
| align="center" |uniprot_20
  +
| align="center" |/mnt/project/rost_db/data/hhblits/uniprot_20
  +
| align="center" |02 Sep 2011
  +
|-
  +
| align="center" |big & big_80
  +
| align="center" |/mnt/project/rost_db/data/big/big[_80]
  +
| align="center" |before 30 Apr 2013?
  +
|-
  +
| align="center" |pdb_seqres
  +
| align="center" |/mnt/home/rost/kloppmann/data/blast_db/pdb_seqres
  +
| align="center" |3 May 2013
  +
|-
  +
| align="center" |pdb_full
  +
| align="center" |/mnt/project/rost_db/data/hhblits/pdb_full
  +
| align="center" |before 30 Apr 2013?
  +
|-
  +
| align="center" |COPS
  +
| align="center" |/mnt/project/pracstrucfunc13/data/COPS/COPS-ChainHierarchy.txt
  +
| align="center" |27 Apr 2012
  +
|-
  +
|}
   
 
===Comparison===
 
===Comparison===

Revision as of 19:04, 19 May 2013

Sequence Searches

Runs

The script run.pl was written and used for the runs. PSSM files - a3m and hhr for HHblits, chk ("checkpoint") and PSSM for Psi-BLAST were created in order to start the search against another database, from big_80 to big for Psi-BLAST and later against a PDB database for the evaluation.

Databases

We have used the following databases versions:

Database Location Version
uniprot_20 /mnt/project/rost_db/data/hhblits/uniprot_20 02 Sep 2011
big & big_80 /mnt/project/rost_db/data/big/big[_80] before 30 Apr 2013?
pdb_seqres /mnt/home/rost/kloppmann/data/blast_db/pdb_seqres 3 May 2013
pdb_full /mnt/project/rost_db/data/hhblits/pdb_full before 30 Apr 2013?
COPS /mnt/project/pracstrucfunc13/data/COPS/COPS-ChainHierarchy.txt 27 Apr 2012

Comparison

The results were parsed and analysed using the script parse_output.pl.

Evaluation

Method

  • The quality of hits retrieved by HHblits, PSI-BLAST and BLAST was evaluated using the COPS L30/L40/L60 structural similarity groups as standard of truth. Protein chains with same L30/L40/L60 group have 30/40/60% structural similarity, respectively. For example, when using the L30 group as standard of truth, the hit is a true positive (TP) if it has the same L30 group as the query, otherwise it is a false positive (FP).
  • The query structure 1OGS chain A of glucocerebrosidase was used for comparison with the following COPS groups:
    • L30 of query: 3zr6_A
    • L40 of query: 3ii1_A
    • L60 of query: 2v3e_B
  • The total number of chains in COPS with the same L30/L40/L60 group as the query is calculated and stored:
    • L30: 2058
    • L40: 1126
    • L60: 80
  • Parsed results of the runs, also with the script parse_output.pl, include the following information for each hit (as mentioned, in case of more then one alignment of the query and the same found sequence, only the one with the lowest E-value was used): E-value, whether the hit is a TP or a FP, query ID and hit ID.
  • In the evaluation routine (tpr_precision.pl) each file is first sorted according to ascending E-value. Afterwards, the following rates are calculated for each line (from the first to the current line): TPR (sensitivity) and precision, which are defined as follows:

TPR = TP/(TP+FN)
precision = TP/(TP+FP)

where:

  • TP = the number of true positive hits, namely all found hits with the same COPS L30/L40/L60 group as the query
  • FP = the number of false positive hits, that is with different groups than the query
  • FN = the number of false negatives, which are not detected chains with the same group as the query
  • TP+FN = the number of all positives, i.e. the number of all known PDB chains with the same group as the query
  • TP+FP = the number of all found hits