Difference between revisions of "Gaucher Disease: Task 02 - Lab Journal"

From Bioinformatikpedia
 
(38 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
==Sequence Searches==
 
==Sequence Searches==
  +
The scripts written for this task can be found in <code>/mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1</code>, the created outputs in <code>/mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/output/task1</code> and all intermediate and resulting files in <code>/mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/results/task1</code>.
  +
 
===Runs===
 
===Runs===
The script [[run.pl]] was written and used for the runs. PSSM files - a3m and hhr for HHblits, chk ("checkpoint") and PSSM for Psi-BLAST were created in order to start the search against another database, from big_80 to big for Psi-BLAST and later against a PDB database for the evaluation.
+
The script [[run.pl]] was written and used for the runs (using the bash script <code>run.sh</code>). PSSM files - a3m and hhr for HHblits, chk ("checkpoint") and PSSM for Psi-BLAST were created in order to start the search against another database, from big_80 to big for Psi-BLAST and later against a PDB database for the evaluation.
  +
'''Note''': in Psi-BLAST in order to create a checkpoint file, one iteration more must be done, as the checkpoint file is created at the beginning of a iteration, not at its end. For example, to resume 9 iterations against big_80 with 1 iteration against big and then with 1 iteration against pdb_seqres the following calls were made (from <code>~/master_practical/Assignment2_Alignments/output/task1</code>): <br>
  +
  +
../../scripts/task1/run.pl --in ../../gaucher_data/P04062.fasta --out_p psiblast-10iter-big_80 --iter_p 10 --db_b big_80 --c psiblast-9iter-big_80.chk
  +
  +
../../scripts/task1/run.pl --in ../../gaucher_data/P04062.fasta --out_p psiblast-9iter-big_80--2iter_big --iter_p 2 --db_b big
  +
--r psiblast-9iter-big_80.chk --c psiblast-9iter-big_80--1iter_big.chk
  +
  +
../../scripts/task1/run.pl --in ../../gaucher_data/P04062.fasta --out_p psiblast-9iter-big_80--1iter_big--1iter_pdb --db_b pdb_seqres --iter_p 1
  +
--r psiblast-9iter-big_80--1iter_big.chk
  +
  +
====Databases====
  +
We have used the following databases versions:
  +
  +
{| border="1" cellpadding="5" cellspacing="0" align="center"
  +
|-
  +
! style="background:#efefef;" align="center" | Database
  +
! style="background:#efefef;" align="center" | Location
  +
! style="background:#efefef;" align="center" | Version
  +
|-
  +
| align="center" |big & big_80
  +
| align="center" |/mnt/project/rost_db/data/big/big[_80]
  +
| align="center" |before 30 Apr 2013?
  +
|-
  +
| align="center" |uniprot_20
  +
| align="center" |/mnt/project/rost_db/data/hhblits/uniprot_20
  +
| align="center" |02 Sep 2011
  +
|-
  +
| align="center" |pdb_seqres
  +
| align="center" |/mnt/home/rost/kloppmann/data/blast_db/pdb_seqres
  +
| align="center" |3 May 2013
  +
|-
  +
| align="center" |pdb_full
  +
| align="center" |/mnt/project/rost_db/data/hhblits/pdb_full
  +
| align="center" |before 30 Apr 2013?
  +
|-
  +
| align="center" |COPS
  +
| align="center" |/mnt/project/pracstrucfunc13/data/COPS/COPS-ChainHierarchy.txt
  +
| align="center" |27 Apr 2012
  +
|-
  +
|}
   
 
===Comparison===
 
===Comparison===
The results were parsed and analysed using the script [[parse_output.pl]].
+
The results were parsed and analyzed using the script [[parse_output.pl]].
  +
The distributions of E-value, sequence identity and alignment length were plotted using the R-scripts <code>hist_eval_iden_len.r</code> and <code>hist_eval_iden_len-with_blast.r</code>
  +
  +
===Evaluation===
  +
====Method====
  +
*The quality of hits retrieved by HHblits, PSI-BLAST and BLAST was evaluated using the COPS L30/L40/L60 structural similarity groups as standard of truth. Protein chains with same L30/L40/L60 group have 30/40/60% structural similarity, respectively. For example, when using the L30 group as standard of truth, the hit is a true positive (TP) if it has the same L30 group as the query, otherwise it is a false positive (FP).
  +
*The query structure 1OGS chain A of glucocerebrosidase was used for comparison with the following COPS groups:
  +
**L30 of query: 3zr6_A
  +
**L40 of query: 3ii1_A
  +
**L60 of query: 2v3e_B
  +
*The total number of chains in COPS with the same L30/L40/L60 group as the query is calculated and stored:
  +
**L30: 2058
  +
**L40: 1126
  +
**L60: 80
  +
*Parsed results of the runs, also with the script [[parse_output.pl]] (using the bash script <code>parse_output.sh</code>), include the following information for each hit (as mentioned, in case of more then one alignment of the query and the same found sequence, only the one with the lowest E-value was used): E-value, whether the hit is a TP or a FP, query ID and hit ID.
  +
*Then the evaluation routine [[tpr_precision.pl]] was applied (using the bash script <code>tpr_precision.sh</code>). In the evaluation routine, each file is first sorted according to ascending E-value. Afterwards, the following rates are calculated for each line (from the first to the current line): TPR (sensitivity) and precision. Each rate is calculated specifically for each query seen so far, and then the average is taken (sum of all rates for each query divided by the number of the queries seen so far). The rates are defined as follows: <br>
  +
<center>
  +
TPR = TP/(TP+FN) <br>
  +
precision = TP/(TP+FP) <br>
  +
</center>
  +
where:
  +
* TP = the number of true positive hits, namely all found hits with the same COPS L30/L40/L60 group as the query
  +
* FP = the number of false positive hits, that is with different groups than the query
  +
* FN = the number of false negatives, which are not detected chains with the same group as the query
  +
* TP+FN = the number of all positives, i.e. the number of all known PDB chains with the same group as the query
  +
* TP+FP = the number of all found hits
  +
  +
  +
==Multiple Sequence Alignment==
  +
  +
===Sets===
  +
The three sets containing 10 or 20 sequences were extracted from the HHblits results of exercise 1 (Sequence Searches) with 10 iterations against uniprot_20 with an E-value cutoff 10E-10.<br>
  +
All sets were compiled with <code>/mnt/home/student/gerkej/gaucher/task2/parseSets.py</code>.
  +
  +
===Program in detail===
  +
<code>parseSets.py</code> splits the HHblits file into 3 parts. One part contains all sequences with an sequence identity >60%. The next part includes all sequences below the 30% threshold. The remaining sequences were joined in the last part. While Set 1 and Set 2 pull random sequences out of the corresponding part, Set 3 is generated differently. To guarantee a constant distribution of the sequence identities within the 20 sequences, 1/3 of the sequences were randomly taken of each of the three parts.
  +
  +
===Alignments===
  +
The multiple sequence alignments (MSA) were generated and visualized with JalView. JalView provides a web service for ClustalW, Muscle and T-Coffee. Espresso was run on the web server. Espresso searches automatically the PDB with BLAST to extract the structures of the sequences. The visualization was generated of the program itself. Similarity trees were also generated with JalView. Therefore, the average distance was used with both %identity and Blosum62 as "scoring matrix".
  +
  +
The used versions:
  +
  +
*ClustalW: version 2.1
  +
*Muscle: version 3.8.31
  +
*T-Coffee: version 8.99
  +
*[http://tcoffee.crg.cat/apps/tcoffee/do:expresso Espresso]: version 9.03
  +
  +
For a closer look at the secondary structure that might be affected by an alignment gap, the PDB entry 1OGS was used.

Latest revision as of 01:29, 6 September 2013

Sequence Searches

The scripts written for this task can be found in /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1, the created outputs in /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/output/task1 and all intermediate and resulting files in /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/results/task1.

Runs

The script run.pl was written and used for the runs (using the bash script run.sh). PSSM files - a3m and hhr for HHblits, chk ("checkpoint") and PSSM for Psi-BLAST were created in order to start the search against another database, from big_80 to big for Psi-BLAST and later against a PDB database for the evaluation. Note: in Psi-BLAST in order to create a checkpoint file, one iteration more must be done, as the checkpoint file is created at the beginning of a iteration, not at its end. For example, to resume 9 iterations against big_80 with 1 iteration against big and then with 1 iteration against pdb_seqres the following calls were made (from ~/master_practical/Assignment2_Alignments/output/task1):

../../scripts/task1/run.pl --in ../../gaucher_data/P04062.fasta --out_p psiblast-10iter-big_80 --iter_p 10 --db_b big_80 --c psiblast-9iter-big_80.chk
../../scripts/task1/run.pl --in ../../gaucher_data/P04062.fasta --out_p psiblast-9iter-big_80--2iter_big --iter_p 2 --db_b big 
--r psiblast-9iter-big_80.chk --c psiblast-9iter-big_80--1iter_big.chk
../../scripts/task1/run.pl --in ../../gaucher_data/P04062.fasta --out_p psiblast-9iter-big_80--1iter_big--1iter_pdb --db_b pdb_seqres --iter_p 1 
--r psiblast-9iter-big_80--1iter_big.chk

Databases

We have used the following databases versions:

Database Location Version
big & big_80 /mnt/project/rost_db/data/big/big[_80] before 30 Apr 2013?
uniprot_20 /mnt/project/rost_db/data/hhblits/uniprot_20 02 Sep 2011
pdb_seqres /mnt/home/rost/kloppmann/data/blast_db/pdb_seqres 3 May 2013
pdb_full /mnt/project/rost_db/data/hhblits/pdb_full before 30 Apr 2013?
COPS /mnt/project/pracstrucfunc13/data/COPS/COPS-ChainHierarchy.txt 27 Apr 2012

Comparison

The results were parsed and analyzed using the script parse_output.pl. The distributions of E-value, sequence identity and alignment length were plotted using the R-scripts hist_eval_iden_len.r and hist_eval_iden_len-with_blast.r

Evaluation

Method

  • The quality of hits retrieved by HHblits, PSI-BLAST and BLAST was evaluated using the COPS L30/L40/L60 structural similarity groups as standard of truth. Protein chains with same L30/L40/L60 group have 30/40/60% structural similarity, respectively. For example, when using the L30 group as standard of truth, the hit is a true positive (TP) if it has the same L30 group as the query, otherwise it is a false positive (FP).
  • The query structure 1OGS chain A of glucocerebrosidase was used for comparison with the following COPS groups:
    • L30 of query: 3zr6_A
    • L40 of query: 3ii1_A
    • L60 of query: 2v3e_B
  • The total number of chains in COPS with the same L30/L40/L60 group as the query is calculated and stored:
    • L30: 2058
    • L40: 1126
    • L60: 80
  • Parsed results of the runs, also with the script parse_output.pl (using the bash script parse_output.sh), include the following information for each hit (as mentioned, in case of more then one alignment of the query and the same found sequence, only the one with the lowest E-value was used): E-value, whether the hit is a TP or a FP, query ID and hit ID.
  • Then the evaluation routine tpr_precision.pl was applied (using the bash script tpr_precision.sh). In the evaluation routine, each file is first sorted according to ascending E-value. Afterwards, the following rates are calculated for each line (from the first to the current line): TPR (sensitivity) and precision. Each rate is calculated specifically for each query seen so far, and then the average is taken (sum of all rates for each query divided by the number of the queries seen so far). The rates are defined as follows:

TPR = TP/(TP+FN)
precision = TP/(TP+FP)

where:

  • TP = the number of true positive hits, namely all found hits with the same COPS L30/L40/L60 group as the query
  • FP = the number of false positive hits, that is with different groups than the query
  • FN = the number of false negatives, which are not detected chains with the same group as the query
  • TP+FN = the number of all positives, i.e. the number of all known PDB chains with the same group as the query
  • TP+FP = the number of all found hits


Multiple Sequence Alignment

Sets

The three sets containing 10 or 20 sequences were extracted from the HHblits results of exercise 1 (Sequence Searches) with 10 iterations against uniprot_20 with an E-value cutoff 10E-10.
All sets were compiled with /mnt/home/student/gerkej/gaucher/task2/parseSets.py.

Program in detail

parseSets.py splits the HHblits file into 3 parts. One part contains all sequences with an sequence identity >60%. The next part includes all sequences below the 30% threshold. The remaining sequences were joined in the last part. While Set 1 and Set 2 pull random sequences out of the corresponding part, Set 3 is generated differently. To guarantee a constant distribution of the sequence identities within the 20 sequences, 1/3 of the sequences were randomly taken of each of the three parts.

Alignments

The multiple sequence alignments (MSA) were generated and visualized with JalView. JalView provides a web service for ClustalW, Muscle and T-Coffee. Espresso was run on the web server. Espresso searches automatically the PDB with BLAST to extract the structures of the sequences. The visualization was generated of the program itself. Similarity trees were also generated with JalView. Therefore, the average distance was used with both %identity and Blosum62 as "scoring matrix".

The used versions:

  • ClustalW: version 2.1
  • Muscle: version 3.8.31
  • T-Coffee: version 8.99
  • Espresso: version 9.03

For a closer look at the secondary structure that might be affected by an alignment gap, the PDB entry 1OGS was used.