Difference between revisions of "Sequence Alignments TSD"

From Bioinformatikpedia
m (stuff about COPS and overlaps)
m (Multiple sequence alignment)
Line 263: Line 263:
   
 
== Multiple sequence alignment ==
 
== Multiple sequence alignment ==
  +
wip

Revision as of 11:11, 6 May 2012

Sequence searches

There are several alignment methods provided by various initiatives, who tackle the problem of sequence searches. Here some of them are applied for the Hex A protein and analyzed. For the searches non redundant protein databases are used. The outputs are adapted to each other and put in comparison in order to determine the best results. A protocol containing the basic steps taken is available.

As a further note: There has been some debate as to what we are supposed to do with the -v and -b parameters. For the following, we interpreted the task such that these values are to be set in a way that all returned hits can still be considered relevant. Therefore our values are empirically set to cut off the results at e-Values around e-5 to e-3. This still leaves us enough results to actually see a distribution and of course a hit with e-3 is still less reliable than one with e-200.

Blast

The first sequence similarity search with the Hex A protein was run with Blast. Here the default settings which provide an output of 250 alignments cover a just a small fraction of similar proteins as the e-value of the last hit receives a significantly low e-value of 3e-48. This shows that the sequence search can be continued and more sequences added safely. This is especially important because there are sequences with a comparably low sequence identity of 20% needed for the multiple sequence alignment. The sequence identity correlates with the hit rank of blast, meaning that with a worse sequence identity the e-value is overall expected to increase. To manage between quality deterioration with a worse e-value and on the other hand the need for low sequence identity a limitation of the output sequences was chosen of 1200. Here the e-value does not go beyond 1e-4 and thus the quality of the alignment is still sufficient but there are also sequences aligned with the required low sequence identity. <figtable id="blastidev">

Tsd blastallIdent.png
Tsd blastallEval.png
Table 1: Blast output distributions.

</figtable>

The results from the BIG80 database contain only Uniprot sequences, which can be explained by the type of clustering used for big_80. The employed method CD-HIT tends to prefer longer sequences and since the ATOM and SEQRES based sequences in a PDB entry are expected to be shorter then the according Uniprot entry, PDB hits are scarce in BIG80.
The Blast search finds altogether 1232 sequences of which 1200 are unique.

The sequence identity distribution as well as the e-value distribution are displayed in <xr id="blastidev"/>.
The identity accumulates around 20-40% with a peak of over 150 sequences with 30%. The most often occuring e-value is between e-100 and e-10. Together with the peak of the e-value distribution at e-80 this can be a sign for good alignment quality. Note that e- Value and sequence identity do not perfectly correlate, however the obvious cuts in the distribution of sequence identity with no hit below 20% and in the e-Value distribution which ends before 0 are results of how we chose the maximum number of reported hits to include only hits that we consider relevant (c.f. above).


PSI-Blast

The PSI-Blast alignment was assessed in different constellations of e-value cutoff and iteration number for the profile compilation. An appropriate output cutoff was also chosen to avoid irrelevant hits. This threshold was set to 1200 to limit the results to relevant hits just like in the simple Blast search.
After the profile creation from the BIG80 database the search space could be extended.Thus the profiles were used to run Psi-Blast against the larger BIG database. Here the results were restricted to 3800 sequences.

All results are unique hits from the BIG80 as well as the BIG database. In the result set from BIG80 there are only Uniprot sequences whereas in the BIG hits a range between 117-122 proteins are from PDB.
Within the iterations for the profile compilation the proteins are chosen from a pool of ca. 1350 unique proteins within the 10 iteration runs. The 2 iterations consist each of around 1240 unique hits. This shows that in process of each iteration new sequences can be added to the alignments.

The performance for the different combinations of e-value and iterations for the search in the BIG80 database as well as the BIG database are shown in <xr id="tab:psiblast"/>. This comparison depicts the general trend of a higher runtime with a higher number of iterations. As there are other circumstances which also affect the performance such as other concurrent processes on the computing ressources (and the ensuing bottlenecks in I/O and CPU ressources), there is no further reliable inference from the run time results possible.

<figtable id="tab:psiblast">
Iterations 2 2 10 10
E-value 0.002 10E-10 0.002 10E-10
BIG80 3m53 4m3 18m57 21m9
BIG 17m19 11m13 16m39 11m13

Table 2: Different performances of PSI-Blast. </figtable>


<figtable id="tab:psiblastHits">
Iterations 2 2 10 10
E-value 0.002 10E-10 0.002 10E-10
All 1237 1240 1139 1114
Unique 1200 1200 1113 1084
BIG80 database
Iterations 2 2 10 10
E-value 0.002 10E-10 0.002 10E-10
All 4046 4045 4035 4032
Unique 3800 3800 3800 3800
BIG database

Table 3: Comparison of all results and Unique hits of PSI-Blast. </figtable>


<figtable id="tbl:psiblastevals">

Tsd psiblastIdents allfour.png
Tsd psiblastBIGIdents allfour.png
Tsd psiblastEvals allfour.png
Tsd psiblastBIGEvals allfour.png
Table 4: PSI-Blast output distributions: identity distributions (upper row), e-value distributions (bottom row), results from BIG80 (left), results from BIG (right).

</figtable>

The <xr id="tbl:psiblastevals"/> shows the identity and e-value distribution from BIG80 and BIG.

The frequency distributions of the sequence identity are comparably equal within the different iteration and e-value combinations.
Also between the BIG80 and the BIG database there are only little differences. The maximum frequency settles in both cases around 20% but the distribution for 10 Iterations is shifted torwards 0. Thus the iteration increase seems to lay the focus more on distant families.

The e-value frequencies differ notably between the BIG80 and the BIG database. Although they both have their minimum around e-400 the BIG hits are more wide-spread and have a maximal e-value of e-4 in contrast to the maximum of the BIG80 database of e-80.
Additionally differences between the distinct profiles become apparent. The change in e-value does not affect the distribution significantly. The increase of iterations exhibits a slight shift to higher e-values, which shows that more iterations can lead to a quality loss of the alignments.

Overall it can be concluded that in this case different e-value cut-offs lead to no alteration of the results whereas a change of iterations from the default 2 to 10 might lead to a sensitivity loss for the protein family.

<figure id="fig:psi-blast combinations big">

Overlap of all PSI-Blast runs against the BIG database: The result sets are comprised of 3800 unique proteins. With the mapping to Uniprot there are some additions due to possible ambiguity. 2 Iterations, Evalue: 0.002 = purple; 2 Iterations, Evalue: 10e-10 = yellow; 10 Iterations, Evalue: 0.002 = green; 10 Iterations, Evalue: 10e-10 = blue.

</figure>


The sequence overlap from all profile combinations applied to the search in BIG is displayed in <xr id="fig:psi-blast combinations big"/>. To adjuts the different sets for comparability the pdb identifiers were mapped to Uniprot whenever possible.

The adjustments of the iteration number and the e-value cut-off clearly account for differences in the result set. Each result set expresses an unique intersection with the other sets. It is especially striking that the results with the iteration number of 10 have a high count of hits only present in each respective set. This does not only illustrate the impact of the iteration number on the result proteins but also the relevance of the e-value cut-off which was not apparent so far. The change of e-value in the PSI-Blast profiles did not lead to differences in the sequence identity or e-value distributions but evidently it does have an effect on the composition of the results.



HHBlits

The next step was the performance of the sequence search with HHBlits against Uniprot. The results were limited to 460 representatives in accordance with the output relevance of the previous alignments.

<figtable id="hhblitsidev">

Tsd hhblitsIdent.png
Tsd hhblitsEval.png
Table 5: HHblits output distributions.

</figtable>

The sequence identity and e-value distribution of the results are display in <xr id="hhblitsidev"/>. The number of values is reduces in comparison to the Blast alignments as the scores are only provided for each cluster of proteins and not for every sequence separately.

The sequence identity distribution looks very similar to the distribution of the Blast output with a maximum around 30% and a decrease to both sides it. As the frequencies are 5 times lower than those from the Blast alignments and represent a cluster it should be kept in mind that the results can not be compared one to one.
The e-value rises continuously from e-350 to e-4 with a maximum at the highest e-values. The distribution is very different to the previous e-value frequencies that express more than one local minimum. The simple Blast distribution even consists of two local maxima.


Evaluation


<xr id="tbl:taboverlap all"/> shows the obverlap in Uniprot sequences among all three sequence search models. <figtable id="tbl:taboverlap all">

All psi2e-10Venn.png
All psi10e-10Venn.png
Table 6: Overlap of all Uniprot based searches.

</figtable> Around 960 Uniprot IDs are found by all three methods, however an interesting pattern emerges when comparing the data of PSI-Blast runs with a different number of iterations. While the relationship among the methods in raw numbers of overlap stays almost the same there is an apparant change in overlap between Blast and PSI-Blast. Using 2 PSI-Blast iterations the resulting set is a superset of the Blast results. The larger size of the PSI-Blast results is easily explained by the larger database used (BIG for PSI-Blast vs. BIG80 for Blast).
However when the iterations are increaed to 10 runs the PSI-Blast set begins to strongly deviate from the Blast results. This can be explained as an effect of the refined search pattern applied after the higher number of iterations leading to a different set of proteins. This is coarsely supported by the according distribution of eValues (c.f. <xr id="tbl:psiblastevals"/>) and a slight shift to the right in the distribution of identities (c.f. <xr id="tbl:psiblastevals"/>) which illustrate that this change of hits is a shift away from the protein family of Hex A.

As another point, from <xr id="tbl:taboverlap all"/> it also seems that the PSI-Blast iteration number does not have a strong effect on the overlap to HHblits, however <xr id="fig:psi-blast combinations big"/> highlights that they differ significantly.


PDB Hits

General

To further assess the performance of the sequence search algorithms we built a dataset of PDB hits. To build a set of maximum size all hits found directly were merged with all hits that could be mapped from Uniprot IDs, using the UniprotKB mapping webservice. <xr id="fig:tsd_pdbhits_overlap"/> shows the overlap of the resulting sets. As to be expected for a search in structural data the overall number of hits is much lower than for sequences alone. The equality of results from both PSI-Blast runs, using only two iterations is interesting and was not apparaent from previous plots since the PDB hits only account for a small number of hits.

<figure id="fig:tsd_pdbhits_overlap">

Overlap of all PDB hits retrieved directly as well as by mapping the Uniprot identifiers.

</figure>

For the following we will work with the union set of 81 hits and take a look at the few sequences that are only found by some methods afterwards.

COPS

<xr id="fig:tsd_inoutdistri"/> shows that . In theory we would have to account for the two PDB IDs 2gjx and 2gk1, however both are in the same cluster in the L99 and S90 classifications (and are therefore also in the same cluster in less stringent clusterings).

For the structure based clustering it can be observed that the number of PDB hits that are in the same cluster as our reference protein falls, the stronger the clustering citeria becomes. This is expected behaviour. Closer observation of the cluster at L99 stringency shows that the hits in the cluster are in fact all Hex A or Hex B isozymes and therefore indeed highly relevant. Lowering the criteria to L80, the hits that are found additionally (3nsm, 3ozo, 3nsn and 3ozp) all belong to another beta-Hexosaminidase (OfHex1), which although having a strong homology in the catalytic domain undergoes conformational changes which is in a strong contrast to Hex A and Hex B where this is not known (http://www.jbc.org/content/286/6/4049 TODO). Additionally even the PDB IDs found in the L60 classification still belong to beta-Hexosminidases and could therefore still be considered interesting. Only at L40 the hits start to get unreliable having no apparent relation to any Hexosaminidase type anymore. We will therefore try to limit our choice of homolog protein structure for the following course of this task to those included in the cluster when using L60 classification.

It is interesting to see that the clusters for S90 and S30 classification are equal. This suggests that sequence identity might not be considered as a valuable classification tool. While the IDs in the cluster are not wrong (they are equal to the ones in the L99 classification), sequence identity seems to be a too strong criteria even at low values to be able to capture slightly further related proteins.

The five IDs (2jnk, 2w1q, 2w1s, 2w1u, 2wdb) that could only be found by using 10 iterations in PSI-Blast contain one hydrolase, while the others belong to carbohydrate-binding modules. None of them are in a cluster with our protein of interest in COPS, independent of how loose the cluster classification is chosen. Closer examination of the hits might lead to answers regarding their importance, however from the knowledge we have gained so far these hits do not seem to be make a point for higher sensitivity by an increased number of PSI-Blast runs and therefor support our initial finding that an increase in iterations might actually lead to a loss of sensitivity.

<figure id="fig:tsd_inoutdistri">

Overlap of all 77 PDB hits from all methods with the according COPS clusters. The clusters for S30 and S90 are equal.

</figure>

COPS has already been used as a benchmark to assess the quality of sequence search methods (TODO http://bioinformatics.oxfordjournals.org/content/26/4/574.full.pdf+html). In this publication HHblits seemed to perform best with a significant difference to PSI-Blast and even more so Blast. Of course the authors used a large dataset and our special case cannot be seen as representative, nonetheless we want to state that lead of HHblits cannot be reproduced for our application since a variation of PSI-Blast parameters is able to capture all hits in the union and there is not single ID that is only found by HHblits.

Multiple sequence alignment

wip