Difference between revisions of "Sequence Alignments Gaucher Disease"

From Bioinformatikpedia
(Tools)
(ClustalW)
Line 206: Line 206:
   
 
Here is the usage:
 
Here is the usage:
clustalw -infile=msa.fa -outfile=msaout.aln
+
clustalw -infile=msa.fa -outfile=msaout.aln
   
 
=== Results ===
 
=== Results ===

Revision as of 16:05, 6 May 2012

Sequence searches

Introduction

The homology search tools BLAST, PSI-BLAST, and HHblits are essential for many applications in bioinformatics. BLAST allows for a rapid identification of homologous sequences by simple sequence-to-sequence comparison. The iterative search tool PSI-BLAST builds upon BLAST but performs sequence-to-profile comparisons where the profile is derived from the hits of the previous search iteration. Only significant hits whose e-value is below a predefined cut-off (-h) are included into the profile. The higher the number of search iterations (-j) the more precise become the profile substitution scores and the more sensitive PSI-BLAST. HHblits likewise performs several search iterations but compares Hidden Markov Models (HMMs) instead of sequences to profiles which makes it even more sensitive. In this task, we compared BLAST to PSI-BLAST and HHblits (2 iterations) and investigated the impact of the number of PSI-BLAST iterations (-j 2, 10) and the e-value inclusion threshold (-h 2e-3, 1e-10) on the homology detection performance. Technical details are documented in our protocol.

Benchmark setting

Query sequence

We used the uniprot sequence P04062 as query for all sequence search tools:

>sp|P04062|GLCM_HUMAN Glucosylceramidase OS=Homo sapiens GN=GBA PE=1 SV=3
MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNAT
YCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGF
GGAMTDAAALNILALSPPAQNLLLKSYFSEEGIGYNIIRVPMASCDFSIRTYTYADTPDD
FQLHNFSLPEEDTKLKIPLIHRALQLAQRPVSLLASPWTSPTWLKTNGAVNGKGSLKGQP
GDIYHQTWARYFVKFLDAYAEHKLQFWAVTAENEPSAGLLSGYPFQCLGFTPEHQRDFIA
RDLGPTLANSTHHNVRLLMLDDQRLLLPHWAKVVLTDPEAAKYVHGIAVHWYLDFLAPAK
ATLGETHRLFPNTMLFASEACVGSKFWEQSVRLGSWDRGMQYSHSIITNLLYHVVGWTDW
NLALNPEGGPNWVRNFVDSPIIVDITKDTFYKQPMFYHLGHFSKFIPEGSQRVGLVASQK
NDLDAVALMHPDGSAVVVVLNRSSKDVPLTIKDPAVGFLETISPGYSIHTYLWRRQ

Databases

Two different databases were provided for BLAST/PSI-BLAST and BLAST:

BLAST/PSI-BLAST: big, big_80

Big is an unclustered database containing both all UniprotKb entries and some PDB entries (altogether 21375097 sequences). Big_80 was built by applying CD-HIT for clustering all UniprotKb sequences with a sequence identity above 80%. Big_80 therefore consists of only 7730747 clusters which are represented by a single cluster sequence. Due to the clustering of highly similar sequences, the sequence space can be searched faster without loosing much information.

HHblits: uniprot20_current

For carrying out HHM/HHM comparison, HHblits requires an A3M and an HMM database which are built by applying kClust for clustering similar sequences where each cluster is represented by a database A3M/HMM. Clusters of the uniprot20_current database (altogether 3129234) exhibit a maximum inter-cluster sequence identity of 20% which means that sequences with an identity above 80% are clustered just as in case of big_80. However, a cluster is not a single sequence but an alignment of all clustered sequences. For instance, the cluster '>UP20|BABBAGIBA|3|240_consensus' consists of 3 sequences.

Comparing results on big and uniprot20_current

In order to be able to compare the BLAST/PSI-BLAST hits on big/big_80 with the HHblits hits on uniprot20_current, we used following workaround:

  • n iterations PSI-BLAST were evaluated by performing n-1 iterations against big_80 and only the last search iterations against big. All PDB hits were ignored in the final result files such that only unclustered UniprotKb sequences (>tr, >sp) were considered. Alternatively, we might have performed all n search iterations against the full big database but which would have increased the computational costs considerably.
  • Two iterations HHblits were evaluated by performing two iterations against the uniprot20_current database and using the UniprotKb cluster members instead of the cluster representative as hits. For this, we translated each >UP20 cluster of the HHblits result files into the the corresponding >tr, >sp entries which be obtained from the uniprot20_02Sep11_a3m_db database file.

Hence we effectively compared all search tools on the full UniprotKb database. The output parameters were adjusted such that all hits with an e-value < 10 were listed which were used for the subsequent comparisons.

Runtime

<figure id="fig:search_runtime_big_80">

Runtime in seconds for searching big_80 (BLAST/PSI-BLAST) and uniprot20_current (HHblits).

</figure> <xr id="fig:search_runtime_big_80"/> shows the runtime of the different search methods measured on the server i12k-biolab01. The BLAST/PSI-BLAST measurements refer to the big_80 database whereas the HHblits measurement on the uniprot20_current database. Since the size of these databases differ (see above), the runtime of BLAST/PSI-BLAST is therefore not directly comparable to the runtime of HHblits. One also has to consider that the server i12k-biolab01 carries out processes of multiple users which leads to fluctuations of the time measurements. Nevertheless, figure shows that the runtime increases proportionally to the number of search iterations. Furthermore, the runtime increased by 10% when using an inclusion threshold of 2e-3 instead of 1e-10 for 2 iterations and by even 28% for 10 iterations. This is due to the fact that a more liberal inclusion threshold increases the number of hits which are used for computing the PSSM of the next iteration which leads to a higher runtime. HMM-to-HMM comparisons carried out by HHblits are computationally more costly than comparing sequences or sequences to profiles such that the runtime of HHblits is higher than the runtime of BLAST and PSI-BLAST although the query HMM is only compared to a small fraction of database HMMs which pass two pre-filter steps.


Number of significant hits

<figure id="fig:search_nhits_big">

Number of significant hits (e-value < 1e-3) found by searching UniProtKB.

</figure> <xr id="fig:search_nhits_big"/> shows the number of significant UniprotKb hits (evalue < 1e-3) detected by BLAST/PSI-BLAST and HHblits. Iterative search tools such as PSI-BLAST and HHblits allow for a higher sensitivity in detecting remote homologous sequences due to position specific substitution scores derived from hits of previous search iterations. The more search iterations are carried out the more diverse the positions specific substitution scores become the the more distantly related sequences with a low sequence similarity can be recognised. This accounts why 10 iterations led to more hits than two iterations. However, if the inclusion threshold is very strict (1e-10), only highly significant hits with a high similarity are added to the PSSM. Such hits are normally already found in the first two iterations such that further iterations do not yield more hits. Two rounds HHblits led to less significant hits than two rounds PSI-BLAST. Since HHblits is much better in discriminating true positives from false positives, this indicates that PSI-BLAST reports more significant false positives than HHblits.

Overlapping hits

We investigated the overlap of BLAST/PSI-BLAST hits depending on (1) the number of search iterations -j and (2) the inclusion threshold -h as well as (3) the overlap of BLAST/PSI-BLAST hits with HHblits hits.

  1. As described in the previous section, more hits are found when performing more iterations (<xr id="fig:search_overlap_blastpgp_big_j"/>). However, not all hits found after two iterations were still found after ten iterations. This demonstrates that the refinement of position specific substitution scores helps to exclude false positives which are considered as significant after only one iteration.
  2. By increasing the inclusion threshold, PSI-BLAST finds new significant hits which might be true positives but also false positives. As expected, after two iterations all hits found by using -h 1-e10 are also found by using the more relaxed threshold of -h 2e-3 (<xr id="fig:search_overlap_blastpgp_big_h"/>). Since more sequences are included into the PSSM when using -h 2e-3 instead of 1e-10, the PSSMs differ which accounts that not all hits detected by using -h 2-e3 are also detected by using -h 1e-10.
  3. Two iterations HHblits yielded 59 hits which were not detected by two iterations of PSI-BLAST (<xr id="fig:search_overlap_hhblits_big_j2"/>). These are probably true homologous sequences which could by inferred from the additional information about homology captured in the HMMs.
</figure> </figure> </figure>
<figure id="fig:search_overlap_blastpgp_big_j">
Venn diagram of overlapping significant hits (e-value < 1e-3) of PSI-BLAST -j 2 and PSI-BLAST -j 10.
<figure id="fig:search_overlap_blastpgp_big_h">
Venn diagram of overlapping significant hits (e-value < 1e-3) of PSI-BLAST -h 2e-3 and PSI-BLAST -h 1e-10.
<figure id="fig:search_overlap_hhblits_big_j2">
Venn diagram of overlapping significant hits (e-value < 1e-3) of two iterations PSI-BLAST and two iterations HHblits.

Distribution of the e-value

</figure> </figure>
<figure id="fig:search_dist_evalue_big">
Relative frequency of hits a particular e-value.
<figure id="fig:search_dist_id_big">
Relative frequency of hits with a particular sequence identity.

(<xr id="fig:search_dist_evalue_big"/>) depicts the distributions of e-values of the different methods where only hits with an e-value < 10 were taken into account. Altogether, only a small fractions of database sequences obtained an e-value small than 1e-3. The iterative tend to report lower e-values since the e-value of hits which are added to the PSSM becomes lower the more search iterations are carried out. Likewise, a low inclusion threshold produces highly significant hits. Note that the reported e-value does not have to coincide with the actual e-value: the e-value reported by iterative search tools is in most cases two optimistic, i.e. that more change hits actually occur than states by the reported e-value.

Distribution of the sequence identity

(<xr id="fig:search_dist_id_big"/>) shows the relative frequency of significant hits (e-evalue < 1e-3) which were found within different ranges of sequence identity. Most hits have either a sequence identity greater 0.75 or between 0.1 and 0.4. Close homologous sequences with an identity above 0.75 can easily detected by all search methods. Detecting more distantly related sequences within the twilight zone (identity < 0.3) is a much harder task and requires iterative search methods such as HHblits. This accounts why the fractions of hits reported by BLAST is higher for an identity >0.8 but lower for an identity below 0.3. Another interesting observation is that performing a high number of PSI-BLAST iterations with a liberal e-value (-j 10, -h 2e-3) leads to many significant hits with a vanishing identity. This shows that most hits are probably false positives and that the inclusion threshold of 2e-3 is too high.


Multiple sequence alignment

Introduction

In this multiple sequence alignments task, We should at first create a data set containing choose 20 relevant sequences from the sequence searches. Then different multiple alignments methods,i.e., ClustalW, Muscle, T-Coffee will be applied. The resulted alignments will be viewed by using Jalview and further analysis work will be done based on the observation.

Methods (lab journal)

Sequences chosen for the multiple Alignment

We Choose 20 sequences from 4 identity ranges and try to take those from different organisms with similar function annotation. We try to avoid taking the sequence without annotation. We find that there are very little candidate sequences at the identity range 50% - 80%. The most available sequences at that range are the sequences without annotations, therefore are not chosen. Two exceptions are “tr|E7EZM1|” and “tr|Q4RID9|”, the first is an Uncharacterized protein and the later is a whole genome shotgun sequence from Tetraodon nigroviridis (Spotted green pufferfish). They are chosen because they get high scores in identity range 50% - 60%. and the aligments might help to find underlying function of such proteins. Fortunately, 2 pdb files, one with 98% , one with 30% sequence identity are involved.

The following table shows the detailed information about the chosen sequences:

Uniprot/PDB ID annotation Organism Identity
99 - 90% sequence identity
D3DV87 Glucosidase, beta Homo sapiens 99%
Q9BDT0 Glucosylceramidase Chimpanzee 99%
Q5R8E3 Glucosylceramidase Pongo abelii 98%
F5H241 Glucosylceramidase Homo sapiens 98%
2V3D_A Glucosylceramidase Homo sapiens 98%
89 - 60% sequence identity
F1RLJ2 Glucosylceramidase Sus scrofa (Pig) 89%
F1N1D5 Glucosylceramidase Bos taurus (Bovine) 88%
G5BDF5 Glucosylceramidase (Fragment) Heterocephalus glaber (Naked mole rat) 87%
G3HC47 Glucosylceramidase Cricetulus griseus (Chinese hamster) 87%
P17439 Glucosylceramidase Mus musculus (Mouse) 86%
59 - 40% sequence identity
E7EZM1 Uncharacterized protein Danio rerio (Zebrafish) 56%
Q4RID9 Chromosome 8 whole genome shotgun sequence Tetraodon nigroviridis (Spotted green pufferfish) 54%
A8PTH1 O-Glycosyl hydrolase family 30 protein Brugia malayi (Filarial nematode worm) 48%
F1L2Y4 Glucosylceramidase Ascaris suum (Pig roundworm) 45%
B7PK21 Beta-glucocerebrosidase Ixodes scapularis (Black-legged tick) 44%
39 - 20% sequence identity
E2C4A6 Glucosylceramidase Harpegnathos saltator (Jerdon's jumping ant) 38%
G3IZH3 Glucosylceramidase Methylobacter tundripaludum SV96 34%
F7PUQ5 Glucosylceramidase Haloplasma contractile SSD-17B 32%
2wnw_A Glycoside Hydrolase Family Enzyme Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 30%
Q1IIZ7 Glucosylceramidase Koribacter versatilis (strain Ellin345) 27%

Tools

ClustalW

Clustal is a widely used multiple sequence alignment computer program[1]. For the given sequences, it generates at first the pairwise alignment, then creates a phylogenetic tree and Uses the phylogenetic tree to carry out a multiple alignment. ClustalW is its command line interface.

Here is the usage:

clustalw -infile=msa.fa -outfile=msaout.aln

Results

Discussion

References

<references/>