Sequence Alignments TSD

From Bioinformatikpedia
Revision as of 23:56, 4 May 2012 by Meiera (talk | contribs) (PSI-Blast)

Sequence searches

There are several alignment methods provided by various initiatives, who tackle the problem of sequence searches. Here some of them are applied for the Hex A protein and analyzed. For the searches non redundant protein databases are used. The outputs are adapted to each other and put in comparison in order to determine the best results. A protocol containing the basic steps taken is available.

Blast

The first sequence similarity search with the Hex A protein was run with Blast. Here the default settings which provide an output of 250 alignments cover a just a small fraction of similar proteins as the e-value of the last hit receives a significantly low e-value of 3e-48. This shows that the sequence search can be continued and more sequences added safely. This is especially important because there are sequences with a comparably low sequence identity of 20% needed for the multiple sequence alignment. The sequence identity correlates with the hit rank of blast, meaning that with a worse sequence identity the e-value is overall expected to increase. To manage between quality deterioration with a worse e-value and on the other hand the need for low sequence identity a limitation of the output sequences was chosen of 1200. Here the e-value does not go beyond 1e-4 and thus the quality of the alignment is still sufficient but there are also sequences aligned with the required low sequence identity. <figtable id="blastidev">

Tsd blastallIdent.png
Tsd blastallEval.png
Table 1: Blast output distributions.

</figtable>

The results from the BIG80 database contain only Uniprot sequences, which can be explained by the clustering used for big_80 (based on CD_hit), where long sequences are preferred. All hits are unique.

The sequence identity distribution as well as the e-value distribution are displayed in <xr id="blastidev"/>.
The identity accumulates around 20-40% with a peak of over 150 sequences with 30%. The most often occuring e-value is between e-100 and e-10. Together with the peak of the e-value distribution at e-80 this can be a sign for good alignment quality.


PSI-Blast

The PSI-Blast alignment was assessed in different constellations of e-value cutoff and iteration number for the profile compilation. An appropriate output cutoff was also chosen to avoid irrelevant hits. This threshold was set to 1200 to limit the results to relevant hits just like in the simple Blast search.
After the profile creation from the BIG80 database the search space could be extended.Thus the profiles were used to run Psi-Blast against the larger BIG database. Here the results were restricted to 3800 sequences.

All results are unique hits from the BIG80 as well as the BIG database. In the result set from BIG80 there are only Uniprot sequences whereas in the BIG hits a range between 117-122 proteins are from PDB.
Within the iterations for the profile compilation the proteins are chosen from a pool of ca. 1350 unique proteins within the 10 iteration runs. The 2 iterations consist each of around 1240 unique hits. This shows that in process of each iteration new sequences can be added to the alignments.

The performance for the different combinations of e-value and iterations for the search in the BIG80 database as well as the BIG database are shown in <xr id="tab:psiblast"/>. This comparison depicts the general trend of a higher runtime with a higher number of iterations. As there are other circumstances which also affect the performance such as other concurrent processes on the computing ressources (and the ensuing bottlenecks in I/O and CPU ressources), there is no further reliable inference from the run time results possible.

<figtable id="tab:psiblast">
Iterations 2 2 10 10
E-value 0.002 10E-10 0.002 10E-10
BIG80 3m53 4m3 18m57 21m9
BIG 17m19 11m13 16m39 11m13

Table 2: Different performances of PSI-Blast. </figtable>


<figtable id="tbl:psiblastevals">

Tsd psiblastIdents allfour.png
Tsd psiblastBIGIdents allfour.png
Tsd psiblastEvals allfour.png
Tsd psiblastBIGEvals allfour.png
Table 3: PSI-Blast output distributions.

</figtable>

The frequency distributions of the sequence identity are comparably equal within the different iteration and e-value combinations. Also between the BIG80 and the BIG database there are only little differences. The maximum frequency settles in both cases around 20%.


The e-value frequencies differ notably between the BIG80 and the BIG database. Although they both have their minimum around e-400 the BIG hits are more wide-spread and have a maximal e-value of e-4 in contrast to the maximum of the BIG80 database of e-80.
Additionally differences between the distinct profiles become apparent. For example in BIG80 the run with 10 iterations and an e-value of 0.002 seems to have performed worse than the rest.

<figure id="fig:psi-blast combinations big">

Overlap of all PSI-Blast runs against the BIG database: The result sets are comprised of 3800 unique proteins. With the mapping to Uniprot there are some additions due to possible ambiguity. 2 Iterations, Evalue: 0.002 = purple; 2 Iterations, Evalue: 10e-10 = yellow; 10 Iterations, Evalue: 0.002 = green; 10 Iterations, Evalue: 10e-10 = blue.

</figure>


The sequence overlap from all profile combinations applied to the search in BIG is displayed in <xr id="fig:psi-blast combinations big"/>. To adjuts the different sets for comparability the pdb identifiers were mapped to Uniprot whenever possible. The adjustments of the iteration number and the e-value cut-off clearly accounts for differences in the result set. Each result set expresses an unique intersection with the other sets.


HHBlits

The next step was the performance of the sequence search with HHBlits against Uniprot. The results were limited to 460 representatives in accordance with the output relevance of the previous alignments.

<figtable id="hhblitsidev">

Tsd hhblitsIdent.png
Tsd hhblitsEval.png
Table 4: HHblits output distributions.

</figtable>

The sequence identity distribution looks very similar to the distribution of the Blast output. The frequencies are lower because the values are provided for the whole clusters and not for every sequence separately.
The e-value rises continuously from e-350 to e-4 with a maximum at the highest e-values.


Evaluation

<xr id="tbl:taboverlap all"/> shows the obverlap in Uniprot sequences among all three sequence search models. Around 960 Uniprot IDs are found by all three methods, however an interesting pattern emerges when comparing the data of PSI-Blast runs with a different number of iterations. While the relationship among the methods in raw numbers of overlap stays almost the same there is an apparant change in overlap between Blast and PSI-Blast. Using 2 PSI-Blast iterations the resulting set is a superset of the Blast results. The larger size of the PSI-Blast results is easily explained by the larger database used (BIG for PSI-Blast vs. BIG80 for Blast). However when the iterations are increaed to 10 runs the PSI-Blast set begins to strongly deviate from the Blast results. This could be explained as an effect of the mroe fine grained search pattern applied after the higher number of iterations, a theory that is coarsly supported by the according distribution of eValues (c.f. <xr id="tbl:psiblastevals"/>) and a slight shift to the right in the distribution of identities (not shown). .

As another point, from <xr id="tbl:taboverlap all"/> it also seems that the PSI-Blast iteration number does not have a strong effect on the overlap to HHblits, however <xr id="fig:psi-blast combinations big"/> highlights that they differ significantly.


<figtable id="tbl:taboverlap all">

All psi2e-10Venn.png
All psi10e-10Venn.png
Table 5: Overlap of all Uniprot based searches.

</figtable>


Multiple sequence alignment