Difference between revisions of "Sequence Alignments Gaucher Disease"

From Bioinformatikpedia
(Methods (lab journal))
(Gaps in secondary structure)
 
(93 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
== Sequence searches ==
 
== Sequence searches ==
coming soon...
 
   
===Blast===
+
=== Introduction ===
  +
The homology search tools BLAST, PSI-BLAST<ref name="blastpgp">Altschul, S; Gish, W; Miller, W; Myers, E; Lipman, D (1990). ''Basic local alignment search tool''</ref>, and HHblits<ref name="hhblits">Remmert, M.; Biegert, A.; Hauser A.; Soeding, J. (2012). ''HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment''</ref> are essential for many applications in bioinformatics. BLAST allows for a rapid identification of homologous sequences by simple sequence-to-sequence comparison. The iterative search tool PSI-BLAST builds upon BLAST but performs sequence-to-profile comparisons where the profile is derived from the hits of the previous search iteration. Only significant hits whose e-value is below a predefined cut-off (-h) are included into the profile. The higher the number of search iterations (-j) the more precise become the profile substitution scores and the more sensitive PSI-BLAST. HHblits likewise performs several search iterations but compares Hidden Markov Models (HMMs) instead of sequences to profiles which makes it even more sensitive. In this task, we compared BLAST to PSI-BLAST and HHblits (2 iterations) and investigated the impact of the number of PSI-BLAST iterations (-j 2, 10) and the e-value inclusion threshold (-h 2e-3, 1e-10) on the homology detection performance. Technical details are documented in our [[protocol_search|protocol]].
   
===PSI-Blast===
+
=== Benchmark setting ===
   
===HHBlits===
+
==== Query sequence ====
  +
We used the UniProt sequence [http://www.uniprot.org/uniprot/P04062 P04062] as query for all sequence search tools:
  +
<code>
  +
>sp|P04062|GLCM_HUMAN Glucosylceramidase OS=Homo sapiens GN=GBA PE=1 SV=3
  +
MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNAT
  +
YCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGF
  +
GGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPMASCDFSIRTYTYADTPDD
  +
FQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQP
  +
GDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIA
  +
RDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAK
  +
ATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHSIITNLLYHVVGWTDW
  +
NLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQK
  +
NDLDAVALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ
  +
</code>
  +
  +
==== Databases ====
  +
Two different databases were provided for BLAST/PSI-BLAST and BLAST:
  +
  +
===== BLAST/PSI-BLAST: big, big_80 =====
  +
Big is an unclustered database containing both all UniProtKB<ref name="uniprot">Uniprot, C. (2010). ''Ongoing and future developments at the Universal Protein Resource''</ref> entries and some PDB entries (altogether 21375097 sequences). Big_80 was built by applying CD-HIT<ref name="cd-hit">Li, W.; Godzik, A. (2006). ''Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences''</ref> for clustering all UniprotKB sequences with a sequence identity above 80%. Big_80 therefore consists of only 7730747 clusters which are represented by a single cluster sequence. Due to the clustering of highly similar sequences, the sequence space can be searched faster without loosing much information.
  +
  +
===== HHblits: uniprot20_current =====
  +
For carrying out HMM-to-HMM comparison, HHblits requires an A3M and an HMM database which are built by applying kClust<ref name="kclust">Mayer, C.; Soeding, J. (unpublished) ''kClust''.</ref> for clustering similar sequences where each cluster is represented by a database A3M/HMM. Clusters of the uniprot20_current database (altogether 3129234) exhibit a maximum inter-cluster sequence identity of 20% which means that sequences with an identity above 80% are clustered just as in case of big_80. However, a cluster is not a single sequence but an alignment of all clustered sequences. For instance, the cluster <tt>>UP20|BABBAGIBA|3|240_consensus</tt> consists of 3 sequences.
  +
  +
===== Comparing results on big and uniprot20_current =====
  +
In order to be able to compare the BLAST/PSI-BLAST hits on big/big_80 with the HHblits hits on uniprot20_current, we used following workaround:
  +
* n iterations PSI-BLAST were evaluated by performing n-1 iterations against big_80 and only the last search iteration against big. All PDB hits were ignored in the final result files such that only unclustered UniProtKB sequences (>tr, >sp) were considered. Alternatively, we might have performed all n search iterations against the full big database but which would have increased the computational costs considerably.
  +
* Two iterations HHblits were evaluated by performing two iterations against the uniprot20_current database and using the UniProtKB cluster members instead of the cluster representative as hits. For this, we translated each <tt>>UP20</tt> cluster of the HHblits result file into the the corresponding <tt>>tr</tt>, <tt>>sp</tt> entries which were obtained from the <tt>uniprot20_02Sep11_a3m_db</tt> database file.
  +
Hence we effectively compared all search tools on the full UniProtKB database. The output parameters were adjusted such that all hits with an e-value < 10 were listed which were used for the subsequent analysis.
  +
  +
=== Runtime ===
  +
<figure id="fig:search_runtime_big_80">
  +
[[File:runtime_big_80.png|thumb|350px|right|<caption>Runtime in seconds for searching big_80 (BLAST/PSI-BLAST) and uniprot20_current (HHblits).</caption>]]
  +
</figure>
  +
<xr id="fig:search_runtime_big_80"/> shows the runtime of the different search methods measured on the server i12k-biolab01. The BLAST/PSI-BLAST measurements refer to the big_80 database whereas the HHblits measurement refers to the uniprot20_current database. Since the size of these databases differ (see above), the runtime of BLAST/PSI-BLAST is not directly comparable to the runtime of HHblits. One also has to consider that the server i12k-biolab01 carries out processes of multiple users which might have led to fluctuations of the time measurements. Nevertheless, figure shows that the runtime increased proportionally to the number of search iterations. Furthermore, the runtime increased by 10% when using an inclusion threshold of 2e-3 instead of 1e-10 for 2 iterations and by even 28% for 10 iterations. This is due to the fact that a more liberal inclusion threshold increases the number of hits which are used for computing the PSSM of the next iteration which causes a higher runtime. HMM-to-HMM comparisons carried out by HHblits are computationally more costly than comparing sequences or sequences to profiles. The runtime of HHblits is therefore higher than the runtime of BLAST and PSI-BLAST although the query HMM is only compared to a small fraction of database HMMs which pass two pre-filter steps.
  +
  +
=== Number of significant hits ===
  +
<figure id="fig:search_nhits_big">
  +
[[File:nhits_big.png|thumb|350px|right|<caption>Number of significant hits (e-value < 1e-3) found by searching UniProtKB.</caption>]]
  +
</figure>
  +
<xr id="fig:search_nhits_big"/> shows the number of significant UniprotKb hits (evalue < 1e-3) detected by BLAST/PSI-BLAST and HHblits. Iterative search tools such as PSI-BLAST and HHblits allow for a higher sensitivity in detecting remote homologous sequences due to position specific substitution scores derived from hits of previous search iterations. The more search iterations are carried out the more diverse the position specific substitution scores become and the more distantly related sequences with a low sequence similarity can be recognised. This accounts why 10 iterations led to more hits than two iterations. However, if the inclusion threshold is very strict (1e-10), only highly significant hits with a high similarity are added to the PSSM. Such hits are normally already found in the first two iterations such that further iterations do not yield more hits. Two rounds HHblits led to less significant hits than two rounds PSI-BLAST. Since HHblits is much better in discriminating true positives from false positives, this indicates that PSI-BLAST reports more significant false positives than HHblits.
  +
  +
=== Overlapping hits ===
  +
We investigated the overlap of BLAST/PSI-BLAST hits depending on (1) the number of search iterations -j and (2) the inclusion threshold -h as well as (3) the overlap of BLAST/PSI-BLAST hits with HHblits hits.
  +
# As described in the previous section, more hits are found when performing more iterations (<xr id="fig:search_overlap_blastpgp_big_j"/>). However, not all hits detected after two iterations were still found after ten iterations. This demonstrates that the refinement of position specific substitution scores helps to exclude false positives which are considered as significant after only one iteration.
  +
# By increasing the inclusion threshold, PSI-BLAST found new significant hits which might be true positives but also false positives. As expected, after two iterations all hits found by using -h 1-e10 are also found by using the more relaxed threshold of -h 2e-3 (<xr id="fig:search_overlap_blastpgp_big_h"/>). Since more sequences are included into the PSSM when using -h 2e-3 instead of 1e-10, the PSSMs differ which accounts that not all hits detected by using -h 2-e3 are also detected by using -h 1e-10.
  +
# Two iterations HHblits yielded 59 hits which were not detected by two iterations of PSI-BLAST (<xr id="fig:search_overlap_hhblits_big_j2"/>). These are probably true homologous sequences which could by inferred from the additional information about homology captured in the HMMs.
  +
{|align="center"
  +
|<figure id="fig:search_overlap_blastpgp_big_j">[[File:overlap_blastpgp_big_j.png|thumb|200px|left|<caption>Venn diagram of overlapping significant hits (e-value < 1e-3) of PSI-BLAST -j 2 and PSI-BLAST -j 10.</caption>]] </figure>
  +
|<figure id="fig:search_overlap_blastpgp_big_h">[[File:overlap_blastpgp_big_h.png|thumb|200px|<caption>Venn diagram of overlapping significant hits (e-value < 1e-3) of PSI-BLAST -h 2e-3 and PSI-BLAST -h 1e-10.</caption>]]</figure>
  +
|<figure id="fig:search_overlap_hhblits_big_j2">[[File:overlap_hhblits_big_j2.png|thumb|150px|<caption>Venn diagram of overlapping significant hits (e-value < 1e-3) of two iterations PSI-BLAST and two iterations HHblits.</caption>]]</figure>
  +
|}
  +
  +
=== Distribution of the e-value ===
  +
{|style="float:right"
  +
|<figure id="fig:search_dist_evalue_big">[[File:dist_evalue_big.png|thumb|250px|<caption>Relative frequency of hits a particular e-value.</caption>]]</figure>
  +
|<figure id="fig:search_dist_id_big">[[File:dist_id_big.png|thumb|250px|<caption>Relative frequency of hits with a particular sequence identity.</caption>]]</figure>
  +
|}
  +
<xr id="fig:search_dist_evalue_big"/> depicts the distributions of e-values of the different methods whereby only hits with an e-value < 10 were taken into account. Altogether, only a small fraction of database sequences obtained an e-value smaller than 1e-3. The iterative methods tend to report lower e-values since the e-value of hits which are added to the PSSM becomes lower the more search iterations are carried out. Likewise, a low inclusion threshold produces highly significant hits. Note that the reported e-value does not have to coincide with the actual e-value: the e-value reported by iterative search tools is in most cases two optimistic, i.e. more change hits actually occur than states by the reported e-value.
  +
  +
=== Distribution of the sequence identity ===
  +
<xr id="fig:search_dist_id_big"/> shows the relative frequency of significant hits (e-evalue < 1e-3) which were found within different ranges of sequence identity. The sequence identity of most hist was either greater than 0.75 or it was between 0.1 and 0.4. Close homologous sequences with an identity above 0.75 could easily be detected by all search methods. Detecting more distantly related sequences within the twilight zone (identity < 0.3) is, however, a much harder task and requires iterative search methods such as HHblits. This accounts why the fractions of hits reported by BLAST was higher for an identity >0.8 but lower for an identity below 0.3. Another interesting observation is that performing a high number of PSI-BLAST iterations with a liberal e-value (-j 10, -h 2e-3) led to many significant hits with a very low identity. This shows that most hits were probably false positives and that an inclusion threshold of 2e-3 is too high.
  +
  +
=== Validating the hit list ===
  +
{|style="float:right"
  +
|<figure id="fig:search_val_roc">[[File:val_roc.png|thumb|250px|<caption>ROC performance of BLAST,PSI_BLAST_j2_h2e-3,HHblits_n2 on a SCOP benchmark set.</caption>]]</figure>
  +
|<figure id="fig:search_val_overlap_pdb">[[File:val_overlap_pdb.png|thumb|250px|<caption>Overlapping PDB hits of the big database.</caption>]]</figure>
  +
|}
  +
The sensitivity of homology search tools is normally estimated by a ROC plot analysis on a dataset of proteins whose whose evolutionary relationship is known. Often the SCOP<ref name="scop">Loredana Lo Conte,
  +
Bart Ailey, Tim J. P. Hubbard, Steven E. Brenner,Alexey G. Murzin, Cyrus Chothia (2000). ''SCOP: a Structural Classification of Proteins database.''</ref> database is used as benchmark set which subdivides proteins into different classes, folds, superfamilies, and families. Proteins belonging to the same superfamily are defined as true positives (TP) whereas proteins of different folds are treated as false positives (FP). Based upon this classification scheme, a ROC plot can be derived which shows the number of TP which could be found until a certain number of FP. The larger the area under the TP vs. FP curve, the more sensitive is the homology search tool.<ref name="benchmark">Soeding, J.; Remmert, M. (2011) ''Protein sequence comparison and fold recognition: progress and
  +
good-practice benchmarking''</ref> <xr id="fig:search_val_roc"/> depicts such a ROC plot which I created for my bachelor thesis. It underlines the higher sensitivity of HHblits compared to PSI-BLAST and BLAST.<br/>
  +
Since there was no common database for BLAST and HHblits with proteins of known evolutionary relationship available, we limited the validation on the PDB entries available in the big database. <xr id="fig:search_val_overlap_pdb"/> shows the number of common PDB entries found by the different BLAST/PSI-BLAST runs. 66 entries were detected by all five runs which are most likely true positives. There were 30 and 138 entries which were exclusively found by PSI-BLAST_j10_he1-10 and PSI-BLAST_j10_he2-3, respectively. We manually investigated the PDB annotations of some of these entries and found that PSI-BLAST_j10_he2-3 produced more hits than PSI-BLAST_j10_he1-10 which are not related to the query structure [http://www.rcsb.org/pdb/explore/explore.do?structureId=1OGS 1OGS]. We assume that doing a higher number of search iterations with a liberal inclusion threshold increases the risk of reporting significant false positives: if a particular false positive is once added to the PSSM, more false positives will be found in the subsequent iterations which are homologous to that false positive.
   
 
== Multiple sequence alignment ==
 
== Multiple sequence alignment ==
   
  +
In this multiple sequence alignments task, We should at first create a data set containing choose 20 relevant sequences from the sequence searches. Then different multiple alignments methods,i.e., ClustalW, Muscle, T-Coffee will be applied. The resulted alignments will be viewed by using Jalview and further analysis work will be done based on the observation.
=== Methods (lab journal) ===
 
   
  +
=== Methods ===
==== Sequences chosen for the multiple Alignment ====
 
  +
  +
==== Tools ====
  +
  +
===== ClustalW =====
  +
Clustal is a classic widely used multiple sequence alignment computer program[http://www.ebi.ac.uk/Tools/msa/clustalw2/]. For the given sequences, it generates at first the pairwise alignment, then creates a phylogenetic tree and Uses the phylogenetic tree to carry out a multiple alignment. ClustalW is its command line interface.
  +
  +
Here is the usage:
  +
clustalw -infile=msa.fasta -outfile=msa_clustal.aln
  +
  +
===== MUSCLE =====
  +
Muscle (MUltiple Sequence Comparison by Log- Expectation) is an iteration-based method. Elements of its algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function clled the log-expectation score, and refinement using tree-dependent restricted partitioning. It is reorted to be able to get better accuracy and speed than ClustalW2 or T-Coffee[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC390337/?tool=pmcentrez].
  +
  +
Here is the usage:
  +
/mnt/opt/T-Coffee/bin/muscle -in msa.fasta -out msa_muscle.aln
  +
  +
The installed tools on the student machines (/mnt/opt/T-Coffee/bin/) do not seem to work well(buffer overflow reported). We can simply use the web server on EMBL[http://www.ebi.ac.uk/Tools/msa/muscle/] .
  +
  +
===== T-Coffee =====
  +
T-Coffee (Tree-based Consistency Objective Function For alignment Evaluation) is a progressive approach which combines results obtained with several alignment methods. 3D-Coffe can use structural information from PDB files[http://tcoffee.crg.cat/apps/tcoffee/index.html].
  +
  +
Here is the usage for T-Coffee:
  +
/mnt/opt/T-Coffee/bin/t_coffee msa.fasta -special_mode accurate
  +
  +
Here is the usage for 3dcoffe with our own template files:
  +
/mnt/opt/T-Coffee/bin/t_coffee -seq msa.fasta -template_file template_list -method sap_pair
  +
  +
Here the option "-method sap_pair" specifies that sap algorithm for structural alignment will be used. And the option "-template_file template_list" the sequence/template association by using following format: >sequence_name1_P_PDBID1, for example, in our case, this template_list should contains:
  +
>tr|H2Q075|H2Q075_PANTR _P_ 2vt0A
  +
>tr|Q9KIJ7|Q9KIJ7_SALTY _P_ 2wnwA
  +
>sp|P04062|GLCM_HUMAN _P_ 2v3d
   
 
=== Results ===
 
=== Results ===
   
  +
==== Sequences chosen for the multiple Alignment ====
  +
We Choose 20 sequences from 4 identity ranges and try to take those from different organisms. We try to avoid taking the sequence without annotation. We find that there are very little candidate sequences at the identity range 50% - 80%. The most available sequences at that range are the sequences without annotations, therefore are not chosen. Two exceptions are “E7EZM1” and “Q4RID9”, the first is an Uncharacterized protein and the later is a whole genome shotgun sequence from Tetraodon nigroviridis (Spotted green pufferfish). They are chosen because they get high scores in identity range 50% - 60%. and the aligments might help to find underlying function of such proteins. Fortunately, 2 pdb "2vt0" and "2wnw", one is associated to "H2Q075" with 98% identity , one is associated to "Q9KIJ7" with 30% identity are involved. <br/>
  +
  +
The following table shows the detailed information about the chosen sequences, where P04062 is the reference sequence:
  +
  +
{| border="1" style="text-align:center; border-spacing:0; width:100%;"
  +
|-
  +
|Uniprot/PDB ID
  +
|annotation
  +
|Organism
  +
|Identity
  +
|-
  +
|colspan="4"|'''reference sequence'''
  +
|-
  +
|P04062
  +
|Glucosylceramidase
  +
|Homo sapiens
  +
|100%
  +
|-
  +
|colspan="4"|'''99 - 90% sequence identity'''
  +
|-
  +
|D3DV87
  +
|Glucosidase, beta
  +
|Homo sapiens
  +
|99%
  +
|-
  +
|Q9BDT0
  +
|Glucosylceramidase
  +
|Pan troglodytes (Chimpanzee)
  +
|99%
  +
|-
  +
|Q5R8E3
  +
|Glucosylceramidase
  +
|Pongo abelii
  +
|98%
  +
|-
  +
|F5H241
  +
|Glucosylceramidase
  +
|Homo sapiens
  +
|98%
  +
|-
  +
|H2Q075 (pdb:2vt0)
  +
|Glucosylceramidase
  +
|Pan troglodytes (Chimpanzee)
  +
|98%
  +
|-
  +
|colspan="4"|'''89 - 60% sequence identity'''
  +
|-
  +
|F1RLJ2
  +
|Glucosylceramidase
  +
|Sus scrofa (Pig)
  +
|89%
  +
|-
  +
|F1N1D5
  +
|Glucosylceramidase
  +
|Bos taurus (Bovine)
  +
|88%
  +
|-
  +
|G5BDF5
  +
|Glucosylceramidase (Fragment)
  +
|Heterocephalus glaber (Naked mole rat)
  +
|87%
  +
|-
  +
|G3HC47
  +
|Glucosylceramidase
  +
|Cricetulus griseus (Chinese hamster)
  +
|87%
  +
|-
  +
|P17439
  +
|Glucosylceramidase
  +
|Mus musculus (Mouse)
  +
|86%
  +
|-
  +
|colspan="4"|'''59 - 40% sequence identity'''
  +
|-
  +
|E7EZM1
  +
|Uncharacterized protein
  +
|Danio rerio (Zebrafish)
  +
|56%
  +
|-
  +
|Q4RID9
  +
|Chromosome 8 whole genome shotgun sequence
  +
|Tetraodon nigroviridis (Spotted green pufferfish)
  +
|54%
  +
|-
  +
|A8PTH1
  +
|O-Glycosyl hydrolase family 30 protein
  +
|Brugia malayi (Filarial nematode worm)
  +
|48%
  +
|-
  +
|F1L2Y4
  +
|Glucosylceramidase
  +
|Ascaris suum (Pig roundworm)
  +
|45%
  +
|-
  +
|B7PK21
  +
|Beta-glucocerebrosidase
  +
|Ixodes scapularis (Black-legged tick)
  +
|44%
  +
|-
  +
|colspan="4"|'''39 - 20% sequence identity'''
  +
|-
  +
|E2C4A6
  +
|Glucosylceramidase
  +
|Harpegnathos saltator (Jerdon's jumping ant)
  +
|38%
  +
|-
  +
|G3IZH3
  +
|Glucosylceramidase
  +
|Methylobacter tundripaludum SV96
  +
|34%
  +
|-
  +
|F7PUQ5
  +
|Glucosylceramidase
  +
|Haloplasma contractile SSD-17B
  +
|32%
  +
|-
  +
|Q9KIJ7(pdb:2wnw)
  +
|Activated by transcription factor SsrB,glycoside hydrolase family enzyme
  +
|Salmonella typhimurium
  +
|30%
  +
|-
  +
|Q1IIZ7
  +
|Glucosylceramidase
  +
|Koribacter versatilis (strain Ellin345)
  +
|27%
  +
|-
  +
|}
  +
<br/>
  +
  +
After having the data set, the next step, as told in the task website, three groups with different sequence identity should be generated: one contains only sequences with low sequence identity (<40%), one contains only sequences with high sequence identity (>60%) and one contains sequences covering the whole range of sequence identity. As it is not very clearly explained how the grouping should exactly be, we build up 3 groups as following:
  +
We should build up 3 groups, each with 10 sequences, then we will have:
  +
  +
*the first group contains 10 sequences with identity 20%-59%(since there are only 5 sequences with identity <40% )
  +
  +
*the second group contains another 10 sequences with identity 60%-99%
  +
  +
*the third group contains 5 sequences from the first group and 5 sequences from the second group.
  +
  +
There is sequence duplication in group 3 as compared to group 1 and group 2. If it is not allowed, we should then choose at least another 10 sequences to build this group which results in additional work.
  +
  +
The following shows the grouping information:
  +
  +
*group 1: E7EZM1, Q4RID9, A8PTH1, F1L2Y4, B7PK21, E2C4A6, G3IZH3, F7PUQ5, Q9KIJ7(pdb:2wnw), Q1IIZ7
  +
  +
*group 2: D3DV87, Q9BDT0, Q5R8E3, F5H241, H2Q075(pdb:2vt0), F1RLJ2, F1N1D5, G5BDF5, G3HC47, P17439
  +
  +
*group 3: D3DV87, Q5R8E3, H2Q075(pdb:2vt0), F1RLJ2, P17439, A8PTH1, B7PK21, E2C4A6, F7PUQ5, Q9KIJ7(pdb:2wnw)
  +
  +
==== Conserved Columns and Gaps ====
  +
  +
Different multiple sequence alignments are then applied on these three groups (the reference sequence P04062 is also included for each group). Then we get different alignment results which are visualized by using Jalview:
  +
  +
<br/>
  +
  +
*Group 1 with sequence identity 27% - 56% <br/>
  +
  +
**ClustalW
  +
  +
[[Image:msa_group1_clustal.png|thumb|1200px|center|Alignment of group1 created with Clustal]]
  +
  +
**Muscle
  +
  +
[[Image:msa_group1_muscle.png|thumb|1200px|center|Alignment of group1 created with Muscle]]
  +
  +
**T-Coffee
  +
  +
[[Image:msa_group1_tcoffee.png|thumb|1200px|center|Alignment of group1 created with T-Coffee]]
  +
  +
**3dcoffee
  +
  +
[[Image:msa_group1_3dcoffee.png|thumb|1200px|center|Alignment of group1 created with 3dcoffee]]
  +
  +
<br/>
  +
  +
*Group 2 with sequence identity 86% - 99%
  +
  +
**ClustalW
  +
  +
[[Image:msa_group2_clustal.png|thumb|1200px|center|Alignment of group2 created with Clustal]]
  +
  +
**Muscle
  +
  +
[[Image:msa_group2_muscle.png|thumb|1200px|center|Alignment of group2 created with Muscle]]
  +
  +
**T-Coffee
  +
  +
[[Image:msa_group2_tcoffee.png|thumb|1200px|center|Alignment of group2 created with T-Coffee]]
  +
  +
**3dcoffee
  +
  +
[[Image:msa_group2_3dcoffee.png|thumb|1200px|center|Alignment of group2 created with 3dcoffee]]
  +
  +
<br/>
  +
  +
*Group 3 with full sequence identity 27% - 99%
  +
  +
**ClustalW
  +
  +
[[Image:msa_group3_clustal1.png|thumb|1200px|center|Alignment of group3 created with Clustal]]
  +
  +
**Muscle
  +
  +
[[Image:msa_group3_muscle1.png|thumb|1200px|center|Alignment of group3 created with Muscle]]
  +
  +
**T-Coffee
  +
  +
[[Image:msa_group3_tcoffee1.png|thumb|1200px|center|Alignment of group3 created with T-Coffee]]
  +
  +
**3dcoffee
  +
  +
[[Image:msa_group3_3dcoffee.png|thumb|1200px|center|Alignment of group3 created with 3dcoffee]]
  +
  +
<br/>
  +
  +
In each group, all the alignment tools return overall similar results. We can hardly find difference between the results of T-Coffee and that of 3dcoffee. It implies that for the performance of 3dcoffee, our own template of pdb can contribute very little.
  +
  +
Based on the above multiple sequence alignment, the conserved columns and gaps can be calculated and shown in <xr id="conser_gap"/> , where the conserved columns are listed at different ranges (the conservation degree as shown in Jalview ):
  +
  +
<figtable id="conser_gap">
  +
{| border="1" style="text-align:center; border-spacing:0;"
  +
|-
  +
|colspan="1"|''''''
  +
|colspan="4"|'''group1 with identity < 56%'''
  +
|colspan="4"|'''group2 with identity > 88%'''
  +
|colspan="4"|'''group3 with full range identity'''
  +
|-
  +
!rowspan="2"|'''Alignment tools'''
  +
|colspan="2"|'''Conserved Columns'''
  +
|colspan="2"|'''Gaps'''
  +
|colspan="2"|'''Conserved Columns'''
  +
|colspan="2"|'''Gaps'''
  +
|colspan="2"|'''Conserved Columns'''
  +
|colspan="2"|'''Gaps'''
  +
|-
  +
|'''>=10'''
  +
|'''>=8'''
  +
|'''Gaps(consensus)'''
  +
|'''Gaps(reference)'''
  +
|'''>=10'''
  +
|'''>=8'''
  +
|'''Gaps(consensus)'''
  +
|'''Gaps(reference)'''
  +
|'''>=10'''
  +
|'''>=8'''
  +
|'''Gaps(consensus)'''
  +
|'''Gaps(reference)'''
  +
|-
  +
|ClustalW
  +
|40
  +
|94
  +
|127
  +
|107
  +
|202
  +
|230
  +
|0
  +
|0
  +
|65
  +
|136
  +
|10
  +
|17
  +
|-
  +
|Muscle
  +
|45
  +
|104
  +
|136
  +
|111
  +
|198
  +
|230
  +
|0
  +
|0
  +
|68
  +
|140
  +
|18
  +
|20
  +
|-
  +
|T-Coffee
  +
|44
  +
|104
  +
|145
  +
|126
  +
|202
  +
|235
  +
|0
  +
|0
  +
|67
  +
|138
  +
|19
  +
|23
  +
|-
  +
|3dCoffee
  +
|44
  +
|102
  +
|144
  +
|126
  +
|202
  +
|235
  +
|0
  +
|0
  +
|67
  +
|138
  +
|19
  +
|23
  +
|}
  +
</figtable>
  +
'''Table 1''': Conserved columns and gaps in different multiple alignment tools
  +
  +
==== Functionally important residues ====
  +
we can find the corresponding functionally important sites with annotation (active sites) from Uniprot[http://www.uniprot.org/uniprot/P04062]. There are two sites, locate at position 274, 379 respectively. :
  +
  +
Active site 274 Proton donor
  +
Active site 379 Nucleophile
  +
  +
Check them at different alignments and we find that both of them 100% conserved in the alignment. It is easy to understand since the most selected sequence have similar function annotation.
  +
  +
==== Gaps in secondary structure ====
  +
  +
We can find the secondary structure annotation under uniprot[http://www.uniprot.org/uniprot/P04062]. We count the number of gaps appeared in each secondary element in different alingments. Here we only look at the alignments from group3 with the whole range identity.
  +
  +
<figtable id="gap_secend">
  +
{| border="1" style="text-align:center; border-spacing:0;"
  +
! Secondary Structure Element || position || length || ClustalW || Muscle || T-Coffee || 3dcoffee
  +
|-
  +
|Beta strand|| 49-52 || 4 || 0 || 0 || 0 || 0
  +
|-
  +
|Beta strand || 54-60 || 7 || 0 || 0 || 0 || 0
  +
|-
  +
|Beta strand|| 75-82 || 8 || 0 || 0 || 0 || 0
  +
|-
  +
|Beta strand||88-94 || 7 || 0 || 0 || 0 || 1
  +
|-
  +
|Beta strand|| 96-98 || 3 || 0 || 4 || 4 || 4
  +
|-
  +
|Beta strand || 103-116|| 14 || 4 || 0 || 0 || 0
  +
|-
  +
|Beta strand|| 119-123 || 5 || 0 || 0 || 0 || 0
  +
|-
  +
|Helix || 126-132 || 7 || 0 || 0 || 0 || 0
  +
|-
  +
|Helix|| 137-148 || 12 || 0 || 0 || 0 || 0
  +
|-
  +
|Turn || 150-153 || 4 || 0 || 1 || 1 || 1
  +
|-
  +
|Beta strand || 157-163 || 7 || 0 || 0 || 0 || 0
  +
|-
  +
|Beta strand || 166-170 || 5 || 0 || 0 || 0 || 0
  +
|-
  +
|Beta strand || 177-179 || 3 || 0 || 0 || 0 || 0
  +
|-
  +
|Helix || 190-193 || 4 || 0 || 0 || 0 || 0
  +
|-
  +
|Helix || 196-206 || 11 || 0 || 0 || 0 || 0
  +
|-
  +
|Beta strand || 212-218 || 7 || 0 || 0 || 0 || 0
  +
|-
  +
|Helix || 222-224 || 0 || 3 || 0 || 0 || 0
  +
|-
  +
|Beta strand || 229-233 || 5 || 0 || 0 || 0 || 0
  +
|-
  +
|Beta strand || 235-238 || 4 || 0 || 0 || 0 || 0
  +
|-
  +
|Helix || 243-261 || 19 || 0 || 0 || 0 || 0
  +
|-
  +
|Beta strand || 267-271 || 5 || 0 || 0 || 0 || 0
  +
|-
  +
|Helix || 277-279 || 3 || 0 || 0 || 0 || 0
  +
|-
  +
|Helix || 292-301 || 10 || 0 || 0 || 0 || 0
  +
|-
  +
|Helix || 202-208 || 6 || 0 || 0 || 0 || 0
  +
|-
  +
|Turn || 311-314 || 4 || 0 || 2 || 1 || 1
  +
|-
  +
|Beta strand || 315-323 || 9 || 0 || 0 || 0 || 0
  +
|-
  +
|Helix || 324-326 || 3 || 0 || 0 || 0 || 0
  +
|-
  +
|Helix || 329-335 || 7 || 0 || 0 || 0 || 0
  +
|-
  +
|Helix || 338-341 || 4 || 0 || 0 || 0 || 0
  +
|-
  +
|Beta strand || 346-352 || 7 || 0 || 0 || 0 || 0
  +
|-
  +
|Helix || 359-369 || 11 || 0 || 0 || 0 || 0
  +
|-
  +
|Beta strand || 373-381 || 9 || 0 || 0 || 0 || 0
  +
|-
  +
|Helix || 396-411 || 16 || 0 || 0 || 0 || 0
  +
|-
  +
|Beta strand || 414-422 || 9 || 0 || 0 || 0 || 0
  +
|-
  +
|Beta strand || 440-444 || 5 || 0 || 0 || 0 || 0
  +
|-
  +
|Helix || 445-447 || 3 || 0 || 0 || 0 || 0
  +
|-
  +
|Beta strand || 449-452 || 4 || 0 || 0 || 0 || 0
  +
|-
  +
|Helix || 454-463 || 10 || 0 || 0 || 0 || 0
  +
|-
  +
|Beta strand || 471-479 || 9 || 0 || 3 || 4 || 4
  +
|-
  +
|Beta strand || 482-489 || 8 || 2 || 0 || 0 || 0
  +
|-
  +
|Beta strand || 495-501 || 7 || 0 || 0 || 0 || 0
  +
|-
  +
|Beta strand || 503-505 || 3 || 0 || 0 || 0 || 0
  +
|-
  +
|Beta strand || 507-513 || 7 || 0 || 2 || 0 || 1
  +
|-
  +
|Turn || 514-516 || 3 || 0 || 0 || 0 || 0
  +
|-
  +
|Beta strand || 517-523 || 7 || 1 || 0 || 2 || 2
  +
|-
  +
|Beta strand || 527-533 || 7 || 0 || 0 || 0 || 0
  +
|}
  +
</figtable>
  +
  +
'''Table 2''': Gaps in secondary structure element
  +
  +
In <xr id="gap_secend"/>, we can see the muscle, T-Coffee and 3dcoffe show gaps at the same position at most time while ClustalW not. It implies that ClustalW might not work well with structural information. At the secondary structure position Beta strand (96-98), turn (150-153), turn (311-314), Betastrand (411-479) and (517-523), the most alignment report gaps which deserves further investigation.
   
 
=== Discussion ===
 
=== Discussion ===
  +
  +
Four multiple sequence alignment tools ClustalW, Muscle, T-Coffee and 3dcoffee are used. As we can see from the screen shoot of all the alignments and the number of conserved columns and gaps shown in <xr id="conser_gap"/> , all the tools return similar alignments. Among them, ClustalW produces a bit less conserved columns which implies that ClustalW shows slightly weaker performance compared with other 3 tools. It is reasonable that the alignment from the sequences with low identity seem not good, with less conserved columns and more gaps, still the most functionally important sites are highly conserved and perfectly aligned.
  +
  +
We have found very little pdb structure available for our 3d-coffee alignment. Only two are found for the selected sequences and one for reference sequence. It shows very little improvement, i.e. 3dcoffee get almost the same results. Of course, change the options might give more help but in principle will not show big progress. If we really want to improve the performance by using structural information with 3dcoffe, the only possible way seems to associate more pdb structures to the sequences. An alternative way is to use Expresso[http://tcoffee.crg.cat/apps/tcoffee/do:expresso] which will automatically blast and fetch available pdb structures for the given sequences (fro our data set, Expresso can return alignment with notable difference. Such results will be added shortly later).
  +
  +
Intuitively a good alignment seems to be an alignment with more conserved columns and less gaps. But it is often that case, because the aligned sequence might be functionally related but evolutionary distant (or vice versa) and therefore more gaps will appear. Therefore a good alignment should be able to reflect the true evolutionary relationship of the given sequences. The potential functionally important sites which are highly conserved, should be aligned. Moreover, the alignment tools should be able to combine additional information, e.g., the pdb structure like 3dcoffee does, will much more improve the performance and might become favored tools.
  +
  +
  +
  +
  +
== References ==
  +
<references/>

Latest revision as of 12:45, 8 May 2012

Sequence searches

Introduction

The homology search tools BLAST, PSI-BLAST<ref name="blastpgp">Altschul, S; Gish, W; Miller, W; Myers, E; Lipman, D (1990). Basic local alignment search tool</ref>, and HHblits<ref name="hhblits">Remmert, M.; Biegert, A.; Hauser A.; Soeding, J. (2012). HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment</ref> are essential for many applications in bioinformatics. BLAST allows for a rapid identification of homologous sequences by simple sequence-to-sequence comparison. The iterative search tool PSI-BLAST builds upon BLAST but performs sequence-to-profile comparisons where the profile is derived from the hits of the previous search iteration. Only significant hits whose e-value is below a predefined cut-off (-h) are included into the profile. The higher the number of search iterations (-j) the more precise become the profile substitution scores and the more sensitive PSI-BLAST. HHblits likewise performs several search iterations but compares Hidden Markov Models (HMMs) instead of sequences to profiles which makes it even more sensitive. In this task, we compared BLAST to PSI-BLAST and HHblits (2 iterations) and investigated the impact of the number of PSI-BLAST iterations (-j 2, 10) and the e-value inclusion threshold (-h 2e-3, 1e-10) on the homology detection performance. Technical details are documented in our protocol.

Benchmark setting

Query sequence

We used the UniProt sequence P04062 as query for all sequence search tools:

>sp|P04062|GLCM_HUMAN Glucosylceramidase OS=Homo sapiens GN=GBA PE=1 SV=3
MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNAT
YCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGF
GGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPMASCDFSIRTYTYADTPDD
FQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQP
GDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIA
RDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAK
ATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHSIITNLLYHVVGWTDW
NLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQK
NDLDAVALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ

Databases

Two different databases were provided for BLAST/PSI-BLAST and BLAST:

BLAST/PSI-BLAST: big, big_80

Big is an unclustered database containing both all UniProtKB<ref name="uniprot">Uniprot, C. (2010). Ongoing and future developments at the Universal Protein Resource</ref> entries and some PDB entries (altogether 21375097 sequences). Big_80 was built by applying CD-HIT<ref name="cd-hit">Li, W.; Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences</ref> for clustering all UniprotKB sequences with a sequence identity above 80%. Big_80 therefore consists of only 7730747 clusters which are represented by a single cluster sequence. Due to the clustering of highly similar sequences, the sequence space can be searched faster without loosing much information.

HHblits: uniprot20_current

For carrying out HMM-to-HMM comparison, HHblits requires an A3M and an HMM database which are built by applying kClust<ref name="kclust">Mayer, C.; Soeding, J. (unpublished) kClust.</ref> for clustering similar sequences where each cluster is represented by a database A3M/HMM. Clusters of the uniprot20_current database (altogether 3129234) exhibit a maximum inter-cluster sequence identity of 20% which means that sequences with an identity above 80% are clustered just as in case of big_80. However, a cluster is not a single sequence but an alignment of all clustered sequences. For instance, the cluster >UP20|BABBAGIBA|3|240_consensus consists of 3 sequences.

Comparing results on big and uniprot20_current

In order to be able to compare the BLAST/PSI-BLAST hits on big/big_80 with the HHblits hits on uniprot20_current, we used following workaround:

  • n iterations PSI-BLAST were evaluated by performing n-1 iterations against big_80 and only the last search iteration against big. All PDB hits were ignored in the final result files such that only unclustered UniProtKB sequences (>tr, >sp) were considered. Alternatively, we might have performed all n search iterations against the full big database but which would have increased the computational costs considerably.
  • Two iterations HHblits were evaluated by performing two iterations against the uniprot20_current database and using the UniProtKB cluster members instead of the cluster representative as hits. For this, we translated each >UP20 cluster of the HHblits result file into the the corresponding >tr, >sp entries which were obtained from the uniprot20_02Sep11_a3m_db database file.

Hence we effectively compared all search tools on the full UniProtKB database. The output parameters were adjusted such that all hits with an e-value < 10 were listed which were used for the subsequent analysis.

Runtime

<figure id="fig:search_runtime_big_80">

Runtime in seconds for searching big_80 (BLAST/PSI-BLAST) and uniprot20_current (HHblits).

</figure> <xr id="fig:search_runtime_big_80"/> shows the runtime of the different search methods measured on the server i12k-biolab01. The BLAST/PSI-BLAST measurements refer to the big_80 database whereas the HHblits measurement refers to the uniprot20_current database. Since the size of these databases differ (see above), the runtime of BLAST/PSI-BLAST is not directly comparable to the runtime of HHblits. One also has to consider that the server i12k-biolab01 carries out processes of multiple users which might have led to fluctuations of the time measurements. Nevertheless, figure shows that the runtime increased proportionally to the number of search iterations. Furthermore, the runtime increased by 10% when using an inclusion threshold of 2e-3 instead of 1e-10 for 2 iterations and by even 28% for 10 iterations. This is due to the fact that a more liberal inclusion threshold increases the number of hits which are used for computing the PSSM of the next iteration which causes a higher runtime. HMM-to-HMM comparisons carried out by HHblits are computationally more costly than comparing sequences or sequences to profiles. The runtime of HHblits is therefore higher than the runtime of BLAST and PSI-BLAST although the query HMM is only compared to a small fraction of database HMMs which pass two pre-filter steps.

Number of significant hits

<figure id="fig:search_nhits_big">

Number of significant hits (e-value < 1e-3) found by searching UniProtKB.

</figure> <xr id="fig:search_nhits_big"/> shows the number of significant UniprotKb hits (evalue < 1e-3) detected by BLAST/PSI-BLAST and HHblits. Iterative search tools such as PSI-BLAST and HHblits allow for a higher sensitivity in detecting remote homologous sequences due to position specific substitution scores derived from hits of previous search iterations. The more search iterations are carried out the more diverse the position specific substitution scores become and the more distantly related sequences with a low sequence similarity can be recognised. This accounts why 10 iterations led to more hits than two iterations. However, if the inclusion threshold is very strict (1e-10), only highly significant hits with a high similarity are added to the PSSM. Such hits are normally already found in the first two iterations such that further iterations do not yield more hits. Two rounds HHblits led to less significant hits than two rounds PSI-BLAST. Since HHblits is much better in discriminating true positives from false positives, this indicates that PSI-BLAST reports more significant false positives than HHblits.

Overlapping hits

We investigated the overlap of BLAST/PSI-BLAST hits depending on (1) the number of search iterations -j and (2) the inclusion threshold -h as well as (3) the overlap of BLAST/PSI-BLAST hits with HHblits hits.

  1. As described in the previous section, more hits are found when performing more iterations (<xr id="fig:search_overlap_blastpgp_big_j"/>). However, not all hits detected after two iterations were still found after ten iterations. This demonstrates that the refinement of position specific substitution scores helps to exclude false positives which are considered as significant after only one iteration.
  2. By increasing the inclusion threshold, PSI-BLAST found new significant hits which might be true positives but also false positives. As expected, after two iterations all hits found by using -h 1-e10 are also found by using the more relaxed threshold of -h 2e-3 (<xr id="fig:search_overlap_blastpgp_big_h"/>). Since more sequences are included into the PSSM when using -h 2e-3 instead of 1e-10, the PSSMs differ which accounts that not all hits detected by using -h 2-e3 are also detected by using -h 1e-10.
  3. Two iterations HHblits yielded 59 hits which were not detected by two iterations of PSI-BLAST (<xr id="fig:search_overlap_hhblits_big_j2"/>). These are probably true homologous sequences which could by inferred from the additional information about homology captured in the HMMs.
</figure> </figure> </figure>
<figure id="fig:search_overlap_blastpgp_big_j">
Venn diagram of overlapping significant hits (e-value < 1e-3) of PSI-BLAST -j 2 and PSI-BLAST -j 10.
<figure id="fig:search_overlap_blastpgp_big_h">
Venn diagram of overlapping significant hits (e-value < 1e-3) of PSI-BLAST -h 2e-3 and PSI-BLAST -h 1e-10.
<figure id="fig:search_overlap_hhblits_big_j2">
Venn diagram of overlapping significant hits (e-value < 1e-3) of two iterations PSI-BLAST and two iterations HHblits.

Distribution of the e-value

</figure> </figure>
<figure id="fig:search_dist_evalue_big">
Relative frequency of hits a particular e-value.
<figure id="fig:search_dist_id_big">
Relative frequency of hits with a particular sequence identity.

<xr id="fig:search_dist_evalue_big"/> depicts the distributions of e-values of the different methods whereby only hits with an e-value < 10 were taken into account. Altogether, only a small fraction of database sequences obtained an e-value smaller than 1e-3. The iterative methods tend to report lower e-values since the e-value of hits which are added to the PSSM becomes lower the more search iterations are carried out. Likewise, a low inclusion threshold produces highly significant hits. Note that the reported e-value does not have to coincide with the actual e-value: the e-value reported by iterative search tools is in most cases two optimistic, i.e. more change hits actually occur than states by the reported e-value.

Distribution of the sequence identity

<xr id="fig:search_dist_id_big"/> shows the relative frequency of significant hits (e-evalue < 1e-3) which were found within different ranges of sequence identity. The sequence identity of most hist was either greater than 0.75 or it was between 0.1 and 0.4. Close homologous sequences with an identity above 0.75 could easily be detected by all search methods. Detecting more distantly related sequences within the twilight zone (identity < 0.3) is, however, a much harder task and requires iterative search methods such as HHblits. This accounts why the fractions of hits reported by BLAST was higher for an identity >0.8 but lower for an identity below 0.3. Another interesting observation is that performing a high number of PSI-BLAST iterations with a liberal e-value (-j 10, -h 2e-3) led to many significant hits with a very low identity. This shows that most hits were probably false positives and that an inclusion threshold of 2e-3 is too high.

Validating the hit list

</figure> </figure>
<figure id="fig:search_val_roc">
ROC performance of BLAST,PSI_BLAST_j2_h2e-3,HHblits_n2 on a SCOP benchmark set.
<figure id="fig:search_val_overlap_pdb">
Overlapping PDB hits of the big database.

The sensitivity of homology search tools is normally estimated by a ROC plot analysis on a dataset of proteins whose whose evolutionary relationship is known. Often the SCOP<ref name="scop">Loredana Lo Conte, Bart Ailey, Tim J. P. Hubbard, Steven E. Brenner,Alexey G. Murzin, Cyrus Chothia (2000). SCOP: a Structural Classification of Proteins database.</ref> database is used as benchmark set which subdivides proteins into different classes, folds, superfamilies, and families. Proteins belonging to the same superfamily are defined as true positives (TP) whereas proteins of different folds are treated as false positives (FP). Based upon this classification scheme, a ROC plot can be derived which shows the number of TP which could be found until a certain number of FP. The larger the area under the TP vs. FP curve, the more sensitive is the homology search tool.<ref name="benchmark">Soeding, J.; Remmert, M. (2011) Protein sequence comparison and fold recognition: progress and good-practice benchmarking</ref> <xr id="fig:search_val_roc"/> depicts such a ROC plot which I created for my bachelor thesis. It underlines the higher sensitivity of HHblits compared to PSI-BLAST and BLAST.
Since there was no common database for BLAST and HHblits with proteins of known evolutionary relationship available, we limited the validation on the PDB entries available in the big database. <xr id="fig:search_val_overlap_pdb"/> shows the number of common PDB entries found by the different BLAST/PSI-BLAST runs. 66 entries were detected by all five runs which are most likely true positives. There were 30 and 138 entries which were exclusively found by PSI-BLAST_j10_he1-10 and PSI-BLAST_j10_he2-3, respectively. We manually investigated the PDB annotations of some of these entries and found that PSI-BLAST_j10_he2-3 produced more hits than PSI-BLAST_j10_he1-10 which are not related to the query structure 1OGS. We assume that doing a higher number of search iterations with a liberal inclusion threshold increases the risk of reporting significant false positives: if a particular false positive is once added to the PSSM, more false positives will be found in the subsequent iterations which are homologous to that false positive.

Multiple sequence alignment

In this multiple sequence alignments task, We should at first create a data set containing choose 20 relevant sequences from the sequence searches. Then different multiple alignments methods,i.e., ClustalW, Muscle, T-Coffee will be applied. The resulted alignments will be viewed by using Jalview and further analysis work will be done based on the observation.

Methods

Tools

ClustalW

Clustal is a classic widely used multiple sequence alignment computer program[1]. For the given sequences, it generates at first the pairwise alignment, then creates a phylogenetic tree and Uses the phylogenetic tree to carry out a multiple alignment. ClustalW is its command line interface.

Here is the usage:

clustalw -infile=msa.fasta -outfile=msa_clustal.aln
MUSCLE

Muscle (MUltiple Sequence Comparison by Log- Expectation) is an iteration-based method. Elements of its algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function clled the log-expectation score, and refinement using tree-dependent restricted partitioning. It is reorted to be able to get better accuracy and speed than ClustalW2 or T-Coffee[2].

Here is the usage:

/mnt/opt/T-Coffee/bin/muscle -in msa.fasta -out msa_muscle.aln

The installed tools on the student machines (/mnt/opt/T-Coffee/bin/) do not seem to work well(buffer overflow reported). We can simply use the web server on EMBL[3] .

T-Coffee

T-Coffee (Tree-based Consistency Objective Function For alignment Evaluation) is a progressive approach which combines results obtained with several alignment methods. 3D-Coffe can use structural information from PDB files[4].

Here is the usage for T-Coffee:

/mnt/opt/T-Coffee/bin/t_coffee msa.fasta -special_mode accurate

Here is the usage for 3dcoffe with our own template files:

/mnt/opt/T-Coffee/bin/t_coffee -seq msa.fasta -template_file template_list -method sap_pair 

Here the option "-method sap_pair" specifies that sap algorithm for structural alignment will be used. And the option "-template_file template_list" the sequence/template association by using following format: >sequence_name1_P_PDBID1, for example, in our case, this template_list should contains:

>tr|H2Q075|H2Q075_PANTR _P_ 2vt0A
>tr|Q9KIJ7|Q9KIJ7_SALTY _P_ 2wnwA
>sp|P04062|GLCM_HUMAN _P_ 2v3d

Results

Sequences chosen for the multiple Alignment

We Choose 20 sequences from 4 identity ranges and try to take those from different organisms. We try to avoid taking the sequence without annotation. We find that there are very little candidate sequences at the identity range 50% - 80%. The most available sequences at that range are the sequences without annotations, therefore are not chosen. Two exceptions are “E7EZM1” and “Q4RID9”, the first is an Uncharacterized protein and the later is a whole genome shotgun sequence from Tetraodon nigroviridis (Spotted green pufferfish). They are chosen because they get high scores in identity range 50% - 60%. and the aligments might help to find underlying function of such proteins. Fortunately, 2 pdb "2vt0" and "2wnw", one is associated to "H2Q075" with 98% identity , one is associated to "Q9KIJ7" with 30% identity are involved.

The following table shows the detailed information about the chosen sequences, where P04062 is the reference sequence:

Uniprot/PDB ID annotation Organism Identity
reference sequence
P04062 Glucosylceramidase Homo sapiens 100%
99 - 90% sequence identity
D3DV87 Glucosidase, beta Homo sapiens 99%
Q9BDT0 Glucosylceramidase Pan troglodytes (Chimpanzee) 99%
Q5R8E3 Glucosylceramidase Pongo abelii 98%
F5H241 Glucosylceramidase Homo sapiens 98%
H2Q075 (pdb:2vt0) Glucosylceramidase Pan troglodytes (Chimpanzee) 98%
89 - 60% sequence identity
F1RLJ2 Glucosylceramidase Sus scrofa (Pig) 89%
F1N1D5 Glucosylceramidase Bos taurus (Bovine) 88%
G5BDF5 Glucosylceramidase (Fragment) Heterocephalus glaber (Naked mole rat) 87%
G3HC47 Glucosylceramidase Cricetulus griseus (Chinese hamster) 87%
P17439 Glucosylceramidase Mus musculus (Mouse) 86%
59 - 40% sequence identity
E7EZM1 Uncharacterized protein Danio rerio (Zebrafish) 56%
Q4RID9 Chromosome 8 whole genome shotgun sequence Tetraodon nigroviridis (Spotted green pufferfish) 54%
A8PTH1 O-Glycosyl hydrolase family 30 protein Brugia malayi (Filarial nematode worm) 48%
F1L2Y4 Glucosylceramidase Ascaris suum (Pig roundworm) 45%
B7PK21 Beta-glucocerebrosidase Ixodes scapularis (Black-legged tick) 44%
39 - 20% sequence identity
E2C4A6 Glucosylceramidase Harpegnathos saltator (Jerdon's jumping ant) 38%
G3IZH3 Glucosylceramidase Methylobacter tundripaludum SV96 34%
F7PUQ5 Glucosylceramidase Haloplasma contractile SSD-17B 32%
Q9KIJ7(pdb:2wnw) Activated by transcription factor SsrB,glycoside hydrolase family enzyme Salmonella typhimurium 30%
Q1IIZ7 Glucosylceramidase Koribacter versatilis (strain Ellin345) 27%


After having the data set, the next step, as told in the task website, three groups with different sequence identity should be generated: one contains only sequences with low sequence identity (<40%), one contains only sequences with high sequence identity (>60%) and one contains sequences covering the whole range of sequence identity. As it is not very clearly explained how the grouping should exactly be, we build up 3 groups as following: We should build up 3 groups, each with 10 sequences, then we will have:

  • the first group contains 10 sequences with identity 20%-59%(since there are only 5 sequences with identity <40% )
  • the second group contains another 10 sequences with identity 60%-99%
  • the third group contains 5 sequences from the first group and 5 sequences from the second group.

There is sequence duplication in group 3 as compared to group 1 and group 2. If it is not allowed, we should then choose at least another 10 sequences to build this group which results in additional work.

The following shows the grouping information:

  • group 1: E7EZM1, Q4RID9, A8PTH1, F1L2Y4, B7PK21, E2C4A6, G3IZH3, F7PUQ5, Q9KIJ7(pdb:2wnw), Q1IIZ7
  • group 2: D3DV87, Q9BDT0, Q5R8E3, F5H241, H2Q075(pdb:2vt0), F1RLJ2, F1N1D5, G5BDF5, G3HC47, P17439
  • group 3: D3DV87, Q5R8E3, H2Q075(pdb:2vt0), F1RLJ2, P17439, A8PTH1, B7PK21, E2C4A6, F7PUQ5, Q9KIJ7(pdb:2wnw)

Conserved Columns and Gaps

Different multiple sequence alignments are then applied on these three groups (the reference sequence P04062 is also included for each group). Then we get different alignment results which are visualized by using Jalview:


  • Group 1 with sequence identity 27% - 56%
    • ClustalW
Alignment of group1 created with Clustal
    • Muscle
Alignment of group1 created with Muscle
    • T-Coffee
Alignment of group1 created with T-Coffee
    • 3dcoffee
Alignment of group1 created with 3dcoffee


  • Group 2 with sequence identity 86% - 99%
    • ClustalW
Alignment of group2 created with Clustal
    • Muscle
Alignment of group2 created with Muscle
    • T-Coffee
Alignment of group2 created with T-Coffee
    • 3dcoffee
Alignment of group2 created with 3dcoffee


  • Group 3 with full sequence identity 27% - 99%
    • ClustalW
Alignment of group3 created with Clustal
    • Muscle
Alignment of group3 created with Muscle
    • T-Coffee
Alignment of group3 created with T-Coffee
    • 3dcoffee
Alignment of group3 created with 3dcoffee


In each group, all the alignment tools return overall similar results. We can hardly find difference between the results of T-Coffee and that of 3dcoffee. It implies that for the performance of 3dcoffee, our own template of pdb can contribute very little.

Based on the above multiple sequence alignment, the conserved columns and gaps can be calculated and shown in <xr id="conser_gap"/> , where the conserved columns are listed at different ranges (the conservation degree as shown in Jalview ):

<figtable id="conser_gap">

' group1 with identity < 56% group2 with identity > 88% group3 with full range identity
Alignment tools Conserved Columns Gaps Conserved Columns Gaps Conserved Columns Gaps
>=10 >=8 Gaps(consensus) Gaps(reference) >=10 >=8 Gaps(consensus) Gaps(reference) >=10 >=8 Gaps(consensus) Gaps(reference)
ClustalW 40 94 127 107 202 230 0 0 65 136 10 17
Muscle 45 104 136 111 198 230 0 0 68 140 18 20
T-Coffee 44 104 145 126 202 235 0 0 67 138 19 23
3dCoffee 44 102 144 126 202 235 0 0 67 138 19 23

</figtable> Table 1: Conserved columns and gaps in different multiple alignment tools

Functionally important residues

we can find the corresponding functionally important sites with annotation (active sites) from Uniprot[5]. There are two sites, locate at position 274, 379 respectively. :

Active site 274 Proton donor	
Active site 379 Nucleophile

Check them at different alignments and we find that both of them 100% conserved in the alignment. It is easy to understand since the most selected sequence have similar function annotation.

Gaps in secondary structure

We can find the secondary structure annotation under uniprot[6]. We count the number of gaps appeared in each secondary element in different alingments. Here we only look at the alignments from group3 with the whole range identity.

<figtable id="gap_secend">

Secondary Structure Element position length ClustalW Muscle T-Coffee 3dcoffee
Beta strand 49-52 4 0 0 0 0
Beta strand 54-60 7 0 0 0 0
Beta strand 75-82 8 0 0 0 0
Beta strand 88-94 7 0 0 0 1
Beta strand 96-98 3 0 4 4 4
Beta strand 103-116 14 4 0 0 0
Beta strand 119-123 5 0 0 0 0
Helix 126-132 7 0 0 0 0
Helix 137-148 12 0 0 0 0
Turn 150-153 4 0 1 1 1
Beta strand 157-163 7 0 0 0 0
Beta strand 166-170 5 0 0 0 0
Beta strand 177-179 3 0 0 0 0
Helix 190-193 4 0 0 0 0
Helix 196-206 11 0 0 0 0
Beta strand 212-218 7 0 0 0 0
Helix 222-224 0 3 0 0 0
Beta strand 229-233 5 0 0 0 0
Beta strand 235-238 4 0 0 0 0
Helix 243-261 19 0 0 0 0
Beta strand 267-271 5 0 0 0 0
Helix 277-279 3 0 0 0 0
Helix 292-301 10 0 0 0 0
Helix 202-208 6 0 0 0 0
Turn 311-314 4 0 2 1 1
Beta strand 315-323 9 0 0 0 0
Helix 324-326 3 0 0 0 0
Helix 329-335 7 0 0 0 0
Helix 338-341 4 0 0 0 0
Beta strand 346-352 7 0 0 0 0
Helix 359-369 11 0 0 0 0
Beta strand 373-381 9 0 0 0 0
Helix 396-411 16 0 0 0 0
Beta strand 414-422 9 0 0 0 0
Beta strand 440-444 5 0 0 0 0
Helix 445-447 3 0 0 0 0
Beta strand 449-452 4 0 0 0 0
Helix 454-463 10 0 0 0 0
Beta strand 471-479 9 0 3 4 4
Beta strand 482-489 8 2 0 0 0
Beta strand 495-501 7 0 0 0 0
Beta strand 503-505 3 0 0 0 0
Beta strand 507-513 7 0 2 0 1
Turn 514-516 3 0 0 0 0
Beta strand 517-523 7 1 0 2 2
Beta strand 527-533 7 0 0 0 0

</figtable>

Table 2: Gaps in secondary structure element

In <xr id="gap_secend"/>, we can see the muscle, T-Coffee and 3dcoffe show gaps at the same position at most time while ClustalW not. It implies that ClustalW might not work well with structural information. At the secondary structure position Beta strand (96-98), turn (150-153), turn (311-314), Betastrand (411-479) and (517-523), the most alignment report gaps which deserves further investigation.

Discussion

Four multiple sequence alignment tools ClustalW, Muscle, T-Coffee and 3dcoffee are used. As we can see from the screen shoot of all the alignments and the number of conserved columns and gaps shown in <xr id="conser_gap"/> , all the tools return similar alignments. Among them, ClustalW produces a bit less conserved columns which implies that ClustalW shows slightly weaker performance compared with other 3 tools. It is reasonable that the alignment from the sequences with low identity seem not good, with less conserved columns and more gaps, still the most functionally important sites are highly conserved and perfectly aligned.

We have found very little pdb structure available for our 3d-coffee alignment. Only two are found for the selected sequences and one for reference sequence. It shows very little improvement, i.e. 3dcoffee get almost the same results. Of course, change the options might give more help but in principle will not show big progress. If we really want to improve the performance by using structural information with 3dcoffe, the only possible way seems to associate more pdb structures to the sequences. An alternative way is to use Expresso[7] which will automatically blast and fetch available pdb structures for the given sequences (fro our data set, Expresso can return alignment with notable difference. Such results will be added shortly later).

Intuitively a good alignment seems to be an alignment with more conserved columns and less gaps. But it is often that case, because the aligned sequence might be functionally related but evolutionary distant (or vice versa) and therefore more gaps will appear. Therefore a good alignment should be able to reflect the true evolutionary relationship of the given sequences. The potential functionally important sites which are highly conserved, should be aligned. Moreover, the alignment tools should be able to combine additional information, e.g., the pdb structure like 3dcoffee does, will much more improve the performance and might become favored tools.



References

<references/>