Gaucher Disease: Task 02 - Lab Journal
The scripts written for this task can be found in
/mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1, the created outputs in
/mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/output/task1 and all intermediate and resulting files in
The script run.pl was written and used for the runs (using the bash script
run.sh). PSSM files - a3m and hhr for HHblits, chk ("checkpoint") and PSSM for Psi-BLAST were created in order to start the search against another database, from big_80 to big for Psi-BLAST and later against a PDB database for the evaluation.
Note: in Psi-BLAST in order to create a checkpoint file, one iteration more must be done, as the checkpoint file is created at the beginning of a iteration, not at its end. For example, to resume 9 iterations against big_80 with 1 iteration against big and then with 1 iteration against pdb_seqres the following calls were made (from
../../scripts/task1/run.pl --in ../../gaucher_data/P04062.fasta --out_p psiblast-10iter-big_80 --iter_p 10 --db_b big_80 --c psiblast-9iter-big_80.chk
../../scripts/task1/run.pl --in ../../gaucher_data/P04062.fasta --out_p psiblast-9iter-big_80--2iter_big --iter_p 2 --db_b big --r psiblast-9iter-big_80.chk --c psiblast-9iter-big_80--1iter_big.chk
../../scripts/task1/run.pl --in ../../gaucher_data/P04062.fasta --out_p psiblast-9iter-big_80--1iter_big--1iter_pdb --db_b pdb_seqres --iter_p 1 --r psiblast-9iter-big_80--1iter_big.chk
We have used the following databases versions:
|big & big_80||/mnt/project/rost_db/data/big/big[_80]||before 30 Apr 2013?|
|uniprot_20||/mnt/project/rost_db/data/hhblits/uniprot_20||02 Sep 2011|
|pdb_seqres||/mnt/home/rost/kloppmann/data/blast_db/pdb_seqres||3 May 2013|
|pdb_full||/mnt/project/rost_db/data/hhblits/pdb_full||before 30 Apr 2013?|
|COPS||/mnt/project/pracstrucfunc13/data/COPS/COPS-ChainHierarchy.txt||27 Apr 2012|
The results were parsed and analyzed using the script parse_output.pl.
The distributions of E-value, sequence identity and alignment length were plotted using the R-scripts
- The quality of hits retrieved by HHblits, PSI-BLAST and BLAST was evaluated using the COPS L30/L40/L60 structural similarity groups as standard of truth. Protein chains with same L30/L40/L60 group have 30/40/60% structural similarity, respectively. For example, when using the L30 group as standard of truth, the hit is a true positive (TP) if it has the same L30 group as the query, otherwise it is a false positive (FP).
- The query structure 1OGS chain A of glucocerebrosidase was used for comparison with the following COPS groups:
- L30 of query: 3zr6_A
- L40 of query: 3ii1_A
- L60 of query: 2v3e_B
- The total number of chains in COPS with the same L30/L40/L60 group as the query is calculated and stored:
- L30: 2058
- L40: 1126
- L60: 80
- Parsed results of the runs, also with the script parse_output.pl (using the bash script
parse_output.sh), include the following information for each hit (as mentioned, in case of more then one alignment of the query and the same found sequence, only the one with the lowest E-value was used): E-value, whether the hit is a TP or a FP, query ID and hit ID.
- Then the evaluation routine tpr_precision.pl was applied (using the bash script
tpr_precision.sh). In the evaluation routine, each file is first sorted according to ascending E-value. Afterwards, the following rates are calculated for each line (from the first to the current line): TPR (sensitivity) and precision. Each rate is calculated specifically for each query seen so far, and then the average is taken (sum of all rates for each query divided by the number of the queries seen so far). The rates are defined as follows:
TPR = TP/(TP+FN)
precision = TP/(TP+FP)
- TP = the number of true positive hits, namely all found hits with the same COPS L30/L40/L60 group as the query
- FP = the number of false positive hits, that is with different groups than the query
- FN = the number of false negatives, which are not detected chains with the same group as the query
- TP+FN = the number of all positives, i.e. the number of all known PDB chains with the same group as the query
- TP+FP = the number of all found hits
Multiple Sequence Alignment
The three sets containing 10 or 20 sequences were extracted from the HHblits results of exercise 1 (Sequence Searches) with 10 iterations against uniprot_20 with an E-value cutoff 10E-10.
All sets were compiled with
Program in detail
parseSets.py splits the HHblits file into 3 parts. One part contains all sequences with an sequence identity >60%. The next part includes all sequences below the 30% threshold. The remaining sequences were joined in the last part. While Set 1 and Set 2 pull random sequences out of the corresponding part, Set 3 is generated differently. To guarantee a constant distribution of the sequence identities within the 20 sequences, 1/3 of the sequences were randomly taken of each of the three parts.
The multiple sequence alignments (MSA) were generated and visualized with JalView. JalView provides a web service for ClustalW, Muscle and T-Coffee. Espresso was run on the web server. Espresso searches automatically the PDB with BLAST to extract the structures of the sequences. The visualization was generated of the program itself. Similarity trees were also generated with JalView. Therefore, the average distance was used with both %identity and Blosum62 as "scoring matrix".
The used versions:
- ClustalW: version 2.1
- Muscle: version 3.8.31
- T-Coffee: version 8.99
- Espresso: version 9.03
For a closer look at the secondary structure that might be affected by an alignment gap, the PDB entry 1OGS was used.