Difference between revisions of "Sequence Alignments Gaucher Disease"

From Bioinformatikpedia
(Sequence searches)
Line 25: Line 25:
 
Big is an unclustered database containing both all UniprotKb entries and some PDB entries (altogether 21375097 sequences). Big_80 was built by applying CD-HIT for clustering all UniprotKb sequences with a sequence identity above 80%. Big_80 therefore consists of only 7730747 clusters which are represented by a single cluster sequence. Due to the clustering of highly similar sequences, the sequence space can be searched faster without loosing much information.
 
Big is an unclustered database containing both all UniprotKb entries and some PDB entries (altogether 21375097 sequences). Big_80 was built by applying CD-HIT for clustering all UniprotKb sequences with a sequence identity above 80%. Big_80 therefore consists of only 7730747 clusters which are represented by a single cluster sequence. Due to the clustering of highly similar sequences, the sequence space can be searched faster without loosing much information.
 
===== HHblits: uniprot20_current =====
 
===== HHblits: uniprot20_current =====
For carrying out HHM/HHM comparison, HHblits requires an A3M and an HMM database which are built by applying kClust for clustering similar sequences where each cluster is represented by a database A3M/HMM. Clusters of the uniprot20_current database exhibit a maximum inter-cluster sequence identity of 20% which means that sequences with an identity above 80% are clustered just as in case of big_80. However, a cluster is not a single sequence but an alignment of all clustered sequences. For instance, the cluster '>UP20|BABBAGIBA|3|240_consensus' consists of 3 sequences.
+
For carrying out HHM/HHM comparison, HHblits requires an A3M and an HMM database which are built by applying kClust for clustering similar sequences where each cluster is represented by a database A3M/HMM. Clusters of the uniprot20_current database (altogether 3129234) exhibit a maximum inter-cluster sequence identity of 20% which means that sequences with an identity above 80% are clustered just as in case of big_80. However, a cluster is not a single sequence but an alignment of all clustered sequences. For instance, the cluster '>UP20|BABBAGIBA|3|240_consensus' consists of 3 sequences.
 
===== Comparing results on big and uniprot20_current =====
 
===== Comparing results on big and uniprot20_current =====
In order to be able to compare the BLAST/PSI-BLAST hits on big/big_80 with the HHblits hits on uniprot20_current, we used following workaround: n iterations PSI-BLAST were evaluated by performing n-1 iterations against big_80 and only the last search iterations against big. All PDB hits were ignored in the final result files such that only unclustered UniprotKb sequences (>tr|, >sp) were considered. Alternatively, we might have performed all n search iterations against the full big database but which would have increased the computational costs considerably.<br/>
+
In order to be able to compare the BLAST/PSI-BLAST hits on big/big_80 with the HHblits hits on uniprot20_current, we used following workaround:
Two iterations HHblits were evaluated by performing two iterations against the uniprot20_current database and using the UniprotKb cluster members instead of the cluster representative as hits. For this, we translated each ">UP20" cluster of the HHblits result files into the the corresponding >tr, >sp entries which be obtained from the uniprot20_02Sep11_a3m_db database file.<br/>
+
* n iterations PSI-BLAST were evaluated by performing n-1 iterations against big_80 and only the last search iterations against big. All PDB hits were ignored in the final result files such that only unclustered UniprotKb sequences (>tr, >sp) were considered. Alternatively, we might have performed all n search iterations against the full big database but which would have increased the computational costs considerably.
  +
* Two iterations HHblits were evaluated by performing two iterations against the uniprot20_current database and using the UniprotKb cluster members instead of the cluster representative as hits. For this, we translated each >UP20 cluster of the HHblits result files into the the corresponding >tr, >sp entries which be obtained from the uniprot20_02Sep11_a3m_db database file.
 
Hence we effectively compared all search tools on the full UniprotKb database. The output parameters were adjusted such that all hits with an e-value < 10 were listed which were used for the subsequent comparisons.
 
Hence we effectively compared all search tools on the full UniprotKb database. The output parameters were adjusted such that all hits with an e-value < 10 were listed which were used for the subsequent comparisons.
  +
  +
=== Runtime ===
  +
Figure shows the runtime of the different search methods measured on the server i12k-biolab01. The BLAST/PSI-BLAST measurements refer to the big_80 database whereas the HHblits measurement on the uniprot20_current database. Since the size of these databases differ (see above), the runtime of BLAST/PSI-BLAST is therefore not directly comparable to the runtime of HHblits. One also has to consider that the server i12k-biolab01 carries out processes of multiple users which leads to fluctuations of the time measurements. Nevertheless, figure shows that the runtime increases proportionally to the number of search iterations. Furthermore, the runtime increased by 10% when using an inclusion threshold of 2e-3 instead of 1e-10 for 2 iterations and by even 28% for 10 iterations. This is due to the fact that a more liberal inclusion threshold increases the number of hits which are used for computing the PSSM of the next iteration which leads to a higher runtime. HMM-to-HMM comparisons carried out by HHblits are computationally more costly than comparing sequences or sequences to profiles such that the runtime of HHblits is higher than the runtime of BLAST and PSI-BLAST although the query HMM is only compared to a small fraction of database HMMs which pass two pre-filter steps.
  +
  +
  +
   
 
== Multiple sequence alignment ==
 
== Multiple sequence alignment ==

Revision as of 10:00, 6 May 2012

Sequence searches

Introduction

The homology search tools BLAST, PSI-BLAST, and HHblits are essential for many applications in bioinformatics. BLAST allows for a rapid identification of homologous sequences by simple sequence-to-sequence comparison. The iterative search tool PSI-BLAST builds upon BLAST but performs sequence-to-profile comparisons where the profile is derived from the hits of the previous search iteration. Only significant hits whose e-value is below a predefined cut-off (-h) are included into the profile. The higher the number of search iterations (-j) the more precise become the profile substitution scores and the more sensitive PSI-BLAST. HHblits likewise performs several search iterations but compares Hidden Markov Models (HMMs) instead of sequences to profiles which makes it even more sensitive. In this task, we compared BLAST to PSI-BLAST and HHblits (2 rounds) and investigated the impact of the number of PSI-BLAST iterations (-j 2,10) and the e-value inclusion threshold (-h 2e-3, 1e-10) on the homology detection performance.

Benchmark setting

Query sequence

We used the uniprot sequence P04062 as query for all sequence search tools:

>sp|P04062|GLCM_HUMAN Glucosylceramidase OS=Homo sapiens GN=GBA PE=1 SV=3
MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNAT
YCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGF
GGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPMASCDFSIRTYTYADTPDD
FQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQP
GDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIA
RDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAK
ATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHSIITNLLYHVVGWTDW
NLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQK
NDLDAVALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ

Databases

Two different databases were provided for BLAST/PSI-BLAST and BLAST:

BLAST/PSI-BLAST: big, big_80

Big is an unclustered database containing both all UniprotKb entries and some PDB entries (altogether 21375097 sequences). Big_80 was built by applying CD-HIT for clustering all UniprotKb sequences with a sequence identity above 80%. Big_80 therefore consists of only 7730747 clusters which are represented by a single cluster sequence. Due to the clustering of highly similar sequences, the sequence space can be searched faster without loosing much information.

HHblits: uniprot20_current

For carrying out HHM/HHM comparison, HHblits requires an A3M and an HMM database which are built by applying kClust for clustering similar sequences where each cluster is represented by a database A3M/HMM. Clusters of the uniprot20_current database (altogether 3129234) exhibit a maximum inter-cluster sequence identity of 20% which means that sequences with an identity above 80% are clustered just as in case of big_80. However, a cluster is not a single sequence but an alignment of all clustered sequences. For instance, the cluster '>UP20|BABBAGIBA|3|240_consensus' consists of 3 sequences.

Comparing results on big and uniprot20_current

In order to be able to compare the BLAST/PSI-BLAST hits on big/big_80 with the HHblits hits on uniprot20_current, we used following workaround:

  • n iterations PSI-BLAST were evaluated by performing n-1 iterations against big_80 and only the last search iterations against big. All PDB hits were ignored in the final result files such that only unclustered UniprotKb sequences (>tr, >sp) were considered. Alternatively, we might have performed all n search iterations against the full big database but which would have increased the computational costs considerably.
  • Two iterations HHblits were evaluated by performing two iterations against the uniprot20_current database and using the UniprotKb cluster members instead of the cluster representative as hits. For this, we translated each >UP20 cluster of the HHblits result files into the the corresponding >tr, >sp entries which be obtained from the uniprot20_02Sep11_a3m_db database file.

Hence we effectively compared all search tools on the full UniprotKb database. The output parameters were adjusted such that all hits with an e-value < 10 were listed which were used for the subsequent comparisons.

Runtime

Figure shows the runtime of the different search methods measured on the server i12k-biolab01. The BLAST/PSI-BLAST measurements refer to the big_80 database whereas the HHblits measurement on the uniprot20_current database. Since the size of these databases differ (see above), the runtime of BLAST/PSI-BLAST is therefore not directly comparable to the runtime of HHblits. One also has to consider that the server i12k-biolab01 carries out processes of multiple users which leads to fluctuations of the time measurements. Nevertheless, figure shows that the runtime increases proportionally to the number of search iterations. Furthermore, the runtime increased by 10% when using an inclusion threshold of 2e-3 instead of 1e-10 for 2 iterations and by even 28% for 10 iterations. This is due to the fact that a more liberal inclusion threshold increases the number of hits which are used for computing the PSSM of the next iteration which leads to a higher runtime. HMM-to-HMM comparisons carried out by HHblits are computationally more costly than comparing sequences or sequences to profiles such that the runtime of HHblits is higher than the runtime of BLAST and PSI-BLAST although the query HMM is only compared to a small fraction of database HMMs which pass two pre-filter steps.



Multiple sequence alignment

Methods (lab journal)

Sequences chosen for the multiple Alignment

Results

Discussion