Sequence Alignments TSD

From Bioinformatikpedia
« Previous Tay-Sachs Disease

The journal for this task can be found here.

Sequence searches

There are several methods provided by various initiatives, who tackle the problem of sequence searches. Here, some of them are applied and analyzes on the example of the Hex A alpha subunit. For all searches, non redundant protein databases are used. The outputs are normalized and put in comparison to determine the best results.

There has been some debate how to set the -v and -b parameters. For the following, the task was interpreted such that these values are to be set in a way that all returned hits can still be considered relevant. Therefore the values used here are empirically set to cut off the results at e-Values around e-5 to e-3. This still leaves enough results to actually see a distribution and of course a hit with e-3 is still less reliable than one with e-200.


The first sequence similarity search with HEXA was run with Blast. Here, the default settings which provide an output of 250 alignments cover only a small fraction of similar proteins since the e-value of the last hit receives a significantly low e-value of 3e-48. This suggests that the sequence search could be continued and more sequences added safely. This is especially important because there are sequences with a comparably low sequence identity of 20% needed for the multiple sequence alignment. The sequence identity correlates with the hit rank of blast, meaning that with a worse sequence identity, the e-value is overall expected to increase. To manage between quality deterioration with a worse e-value and on the other hand the need for low sequence identity a limitation of 1200 was chosen for the output sequences. Here the e-value does not go beyond 1e-4 and thus the quality of the alignment is still sufficient while there are also sequences aligned with the required low sequence identity. <figtable id="blastidev">

Tsd blastallIdent.png
Tsd blastallEval.png
Table 1: Blast output distributions.


The results from the BIG80 database contain only Uniprot sequences, which can be explained by the type of clustering used for BIG80. The employed method CD-HIT tends to prefer longer sequences and since the ATOM and SEQRES based sequences in a PDB entry are expected to be shorter then the according Uniprot entry, which does not reflect post-translational processing, PDB hits are scarce in BIG80.
The Blast search finds altogether 1232 sequences, 1200 of which are unique.

The sequence identity distribution, as well as the e-value distribution are displayed in <xr id="blastidev"/>.
The identity accumulates around 20-40% with a peak of over 150 sequences that have 30% sequence identity. The most often occuring e-value is between e-100 and e-10. Together with the peak of the e-value distribution at e-80 this can be a sign for good alignment quality. Note that e- Value and sequence identity do not perfectly correlate, however the obvious cuts in the distribution of sequence identity with no hit below 20% and in the e-Value distribution which ends before 0 are results of how we chose the maximum number of reported hits to include only hits that we consider relevant (c.f. above).


The PSI-Blast alignment was assessed in different constellations of e-value cutoff and iteration number for the profile compilation. An appropriate output cutoff was also chosen to avoid irrelevant hits. This threshold was set to 1200 to limit the results to relevant hits just like in the simple Blast search.
After the profile creation from the BIG80 database the search space could be extended.Thus the profiles were used to run Psi-Blast against the larger BIG database. Here the results were restricted to 3800 sequences.

All results are unique hits from the BIG80 as well as the BIG database. In the result set from BIG80 there are only Uniprot sequences whereas in the BIG hits a range between 117-122 proteins are from PDB.
Within the iterations for the profile compilation the proteins are chosen from a pool of ca. 1350 unique proteins within the 10 iteration runs. The 2 iterations consist each of around 1240 unique hits. This shows that in process of each iteration new sequences can be added to the alignments.

The performance for the different combinations of e-value and iterations for the search in the BIG80 database as well as the BIG database are shown in <xr id="tab:psiblast"/>. This comparison depicts the general trend of a higher runtime with a higher number of iterations. As there are other circumstances which also affect the performance such as other concurrent processes on the computing ressources (and the ensuing bottlenecks in I/O and CPU ressources), there is no further reliable inference from the run time results possible.

<figtable id="tab:psiblast">

Table 2: Different performances of PSI-Blast.

Iterations 2 2 10 10
E-value 0.002 10E-10 0.002 10E-10
BIG80 3m53 4m3 18m57 21m9
BIG 17m19 11m13 16m39 11m13


<figtable id="tab:psiblastHits">
Iterations 2 2 10 10
E-value 0.002 10E-10 0.002 10E-10
All 1237 1240 1139 1114
Unique 1200 1200 1113 1084
BIG80 database
Iterations 2 2 10 10
E-value 0.002 10E-10 0.002 10E-10
All 4046 4045 4035 4032
Unique 3800 3800 3800 3800
BIG database
Table 3: Comparison of all results and Unique hits of PSI-Blast.


<figtable id="tbl:psiblastevals">

Tsd psiblastIdents allfour.png
Tsd psiblastBIGIdents allfour.png
Tsd psiblastEvals allfour.png
Tsd psiblastBIGEvals allfour.png
Table 4: PSI-Blast output distributions: identity distributions (upper row), e-value distributions (bottom row), results from BIG80 (left), results from BIG (right).


The <xr id="tbl:psiblastevals"/> shows the identity and e-value distribution from BIG80 and BIG.

The frequency distributions of the sequence identity are comparably equal within the different iteration and e-value combinations.
Also between the BIG80 and the BIG database there are only little differences. The maximum frequency settles in both cases around 20% but the distribution for 10 Iterations is shifted torwards 0. Thus the iteration increase seems to lay the focus more on distant families.

The e-value frequencies differ notably between the BIG80 and the BIG database. Although they both have their minimum around e-400 the BIG hits are more wide-spread and have a maximal e-value of e-4 in contrast to the maximum of the BIG80 database of e-80.
Additionally differences between the distinct profiles become apparent. The change in e-value does not affect the distribution significantly. The increase of iterations exhibits a slight shift to higher e-values, which shows that more iterations can lead to a quality loss of the alignments.

Overall it can be concluded that in this case different e-value cut-offs lead to no alteration of the results whereas a change of iterations from the default 2 to 10 might lead to a sensitivity loss for the protein family.

<figure id="fig:psi-blast combinations big">

Figure 1: Overlap of all PSI-Blast runs against the BIG database: The result sets are comprised of 3800 unique proteins. With the mapping to Uniprot there are some additions due to possible ambiguity. 2 Iterations, Evalue: 0.002 = purple; 2 Iterations, Evalue: 10e-10 = yellow; 10 Iterations, Evalue: 0.002 = green; 10 Iterations, Evalue: 10e-10 = blue.


The sequence overlap from all profile combinations applied to the search in BIG is displayed in <xr id="fig:psi-blast combinations big"/>. To adjuts the different sets for comparability the pdb identifiers were mapped to Uniprot whenever possible.

The adjustments of the iteration number and the e-value cut-off clearly account for differences in the result set. Each result set expresses an unique intersection with the other sets. It is especially striking that the results with the iteration number of 10 have a high count of hits only present in each respective set. This does not only illustrate the impact of the iteration number on the result proteins but also the relevance of the e-value cut-off which was not apparent so far. The change of e-value in the PSI-Blast profiles did not lead to differences in the sequence identity or e-value distributions but evidently it does have an effect on the composition of the results.


The next step was the performance of the sequence search with HHBlits against Uniprot. The results were limited to 460 representatives in accordance with the output relevance of the previous alignments.

<figtable id="hhblitsidev">

Tsd hhblitsIdent.png
Tsd hhblitsEval.png
Table 5: HHblits output distributions.


The sequence identity and e-value distribution of the results are display in <xr id="hhblitsidev"/>. The number of values is reduces in comparison to the Blast alignments as the scores are only provided for each cluster of proteins and not for every sequence separately.

The sequence identity distribution looks very similar to the distribution of the Blast output with a maximum around 30% and a decrease to both sides it. As the frequencies are 5 times lower than those from the Blast alignments and represent a cluster it should be kept in mind that the results can not be compared one to one.
The e-value rises continuously from e-350 to e-4 with a maximum at the highest e-values. The distribution is very different to the previous e-value frequencies that express more than one local minimum. The simple Blast distribution even consists of two local maxima.


<xr id="tbl:taboverlap all"/> shows the obverlap in Uniprot sequences among all three sequence search models. <figtable id="tbl:taboverlap all">

All psi2e-10Venn.png
All psi10e-10Venn.png
Table 6: Overlap of all Uniprot based searches.

</figtable> Around 960 Uniprot IDs are found by all three methods, however an interesting pattern emerges when comparing the data of PSI-Blast runs with a different number of iterations. While the relationship among the methods in raw numbers of overlap stays almost the same there is an apparant change in overlap between Blast and PSI-Blast. Using 2 PSI-Blast iterations the resulting set is a superset of the Blast results. The larger size of the PSI-Blast results is easily explained by the larger database used (BIG for PSI-Blast vs. BIG80 for Blast).
However when the iterations are increaed to 10 runs the PSI-Blast set begins to strongly deviate from the Blast results. This can be explained as an effect of the refined search pattern applied after the higher number of iterations leading to a different set of proteins. This is coarsely supported by the according distribution of eValues (c.f. <xr id="tbl:psiblastevals"/>) and a slight shift to the right in the distribution of identities (c.f. <xr id="tbl:psiblastevals"/>) which illustrate that this change of hits is a shift away from the protein family of Hex A.

As another point, from <xr id="tbl:taboverlap all"/> it also seems that the PSI-Blast iteration number does not have a strong effect on the overlap to HHblits, however <xr id="fig:psi-blast combinations big"/> highlights that they differ significantly.

It stands out that all alignment methods provide a great amount of hits with very low e-values. This shows that a a threshold for any sequence similarity search should be chosen rather very conservatively.
Altogether the alignments provided by PSI-Blast have a larger fraction of significantly low e-values in comparison to HHBlits. However the overlap of HHblits with Blast and PSI-Blast, which is comparably high, stays the same even with the shift of the PSI-Blast results. This could be a quality sign for the HHblits results.

PDB Hits


<figure id="fig:tsd_pdbhits_overlap">

Figure 2: Overlap of all PDB hits retrieved directly as well as by mapping the Uniprot identifiers.


To further assess the performance of the sequence search algorithms we built a dataset of PDB hits. To build a set of maximum size all hits found directly were merged with all hits that could be mapped from Uniprot IDs, using the UniprotKB mapping webservice. <xr id="fig:tsd_pdbhits_overlap"/> shows the overlap of the resulting sets. As to be expected for a search in structural data the overall number of hits is much lower than for sequences alone. The equality of results from both PSI-Blast runs, using only two iterations is interesting and was not apparaent from previous plots since the PDB hits only account for a small number of hits.

For the following we will work with the union set of 81 hits and take a look at the few sequences that are only found by some methods afterwards.


<xr id="fig:tsd_inoutdistri"/> shows the number of PDB hits that fall into a cluster with the protein of interest according to COPS. In theory we would have to account for the two PDB IDs 2gjx and 2gk1, however both are in the same cluster in the L99 and S90 classifications (and are therefore also in the same cluster in less stringent clusterings).

<figure id="fig:tsd_inoutdistri">

Figure 3: Overlap of all 77 PDB hits from all methods with the according COPS clusters. The clusters for S30 and S90 are equal.


For the structure based clustering it can be observed that the number of PDB hits that are in the same cluster as our reference protein falls, the stronger the clustering citeria becomes. This is expected behaviour. Closer observation of the cluster at L99 stringency shows that the hits in the cluster are in fact all Hex A or Hex B isozymes and therefore indeed highly relevant. Lowering the criteria to L80, the hits that are found additionally (3nsm, 3ozo, 3nsn and 3ozp) all belong to another beta-Hexosaminidase (OfHex1), which although having a strong homology in the catalytic domain undergoes conformational changes which is in a strong contrast to Hex A and Hex B where this is not known <ref name="Liu2011">Liu, T., Zhang, H., Liu, F., Wu, Q., Shen, X., & Yang, Q. (2011). Structural determinants of an insect beta-N-Acetyl-D-hexosaminidase specialized as a chitinolytic enzyme. The Journal of biological chemistry, 286(6), 4049-58. doi:10.1074/jbc.M110.184796</ref>. Additionally even the PDB IDs found in the L60 classification still belong to beta-Hexosminidases and could therefore still be considered interesting. Only at L40 the hits start to get unreliable having no apparent relation to any Hexosaminidase type anymore. We will therefore try to limit our choice of homolog protein structure for the following course of this task to those included in the cluster when using L60 classification.

It is interesting to see that the clusters for S90 and S30 classification are equal. This suggests that sequence identity might not be considered as a valuable classification tool. While the IDs in the cluster are not wrong (they are equal to the ones in the L99 classification), sequence identity seems to be a too strong criteria even at low values to be able to capture slightly further related proteins.

The five IDs (2jnk, 2w1q, 2w1s, 2w1u, 2wdb) that could only be found by using 10 iterations in PSI-Blast contain one hydrolase, while the others belong to carbohydrate-binding modules. None of them are in a cluster with our protein of interest in COPS, independent of how loose the cluster classification is chosen. Closer examination of the hits might lead to answers regarding their importance, however from the knowledge we have gained so far these hits do not seem to be make a point for higher sensitivity by an increased number of PSI-Blast runs and therefor support our initial finding that an increase in iterations might actually lead to a loss of sensitivity.

COPS has already been used as a benchmark to assess the quality of sequence search methods <ref name=Frank2010>Frank, K., Gruber, M., & Sippl, M. J. (2010). COPS Benchmark: interactive analysis of database search methods. Bioinformatics, 26(4), 574-575. doi:10.1093/bioinformatics/btp712</ref>. In this publication HHblits seemed to perform best with a significant difference to PSI-Blast and even more so Blast. Of course the authors used a large dataset and our special case cannot be seen as representative, nonetheless we want to state that lead of HHblits cannot be reproduced for our application since a variation of PSI-Blast parameters is able to capture all hits in the union and there is not single ID that is only found by HHblits.

Multiple sequence alignment

Dataset creation

Based on the previous sequence searches we built our dataset by first checking the identities of all PDB hits we found. However since several structures might be linked to the same Uniprot entry and it does not make sense to align five times the (almost) identical sequence from the SEQRES record, we mapped the PDB IDs back to Uniprot and used the Uniprot sequence. This provided us with 2 sequences for the 39 - 20% range and 1 for the 59 - 40% one. The only way to get a sequence with known structure into the 99 - 90% range would be by including our own sequence, supplying the 'obivous' structures 2gkj and 2gjx already mentioned in the TSD introduction. The final four sets are shown in <xr id="tab:foursets"/>.

<figtable id="tab:foursets">
Uniprot AC Identity Comment
99 - 90% seq .ident.
H0WW55 90.34
G1RUL9 92.25
G7MY51 95.93
F6TCZ5 95.09
Q5RC84 98.87
89 - 60% seq .ident.
F1NEX5 67.33
G3W2Y3 70.18
H0VHJ3 78.9
G5BHB4 80.65
Q5DLV9 88.37
59 - 40% seq .ident.
P07686 56.55 Structure given by 1o7a_D, 3lmy_A, 1nou_B, 1now_B, 1np0_B
D3B4C6 40
G8LK57 45
E1ZMZ8 50
F1NTQ2 60.1
39 - 20% seq .ident.
Q06GJ0 27.45 Structure given by 3nsn_A, 3nsm_A, 3ozo_A, 3ozp_A
D0VX21 20.68 Structure given by 3gh5_A
B3JMV2 20
A6NPT1 25
D4DXX2 30
Table 7: 20 sequences devided into 4 equal buckets of sequence identity.


For the MSA data set the first objective was to find proteins with uniformly distributed sequence identity to fit the 4 buckets given in the taks description. Since there was no PDB hit for the 89 - 60% range set anyway the decision was made to use the <40%, >60%, all classification. Of course for that, additional 5 sequences with 40% sequence identity had to be found to fill the <40% set. The other ones were filled with sequences already found among the four ranges before. From these sets, and by additional searches to find 10 sequences in every cluster, the three datasets shown in <xr id="tbl:threesets"/> were compiled. The PDB ID associated with every set is denoted. It was chosen from the ones available and checked to be in the same cluster as the reference protein structure in the strongest COPS classification type possible. Sets were filled with ranges of identities as equal as possible and lower e-Values were preferred.

<figtable id="tbl:threesets">
Identity < 40% Identity > 60% Whole range
3nsn_A 2gjx_A 3lmy_A
P06865 P06865 P06865
Q06GJ0 F1NEX5 D3B4C6
A6NPT1 H0VHJ3 P07686
D1VWZ7 H0WW55 C1F7Y6
C1F7Y6 Q5RC84 G7MY51
Table 8: 3 groups of different sequence identity with 11 members.




<figure id="fig:gaps">

Figure 4: Gaps inserted by the different MSA methods.


<xr id="tbl:conservedColumns"/> displays the count of conserved columns in each of the MSAs. It stands out that the conservation in both the alignments of sequences with an identity below 40% and those with the whole range of identity is considerably low with only around 3 columns, whereas with 60% sequence identity a far better alignment conservation can be observed.

<xr id="fig:gaps"/> gives an overview of the gaps inserted in the alignments by ClustalW, Muscle and T-Coffee. T-Coffee has the highest gap insertions which is almost 2 times higher than ClustalW. The group of sequences with below 40% identity receives the greatest amount of gaps in Muscle and T-Coffee but not in ClustalW where the proteins from the whole identity range get inserted the most gaps.

<figtable id="tbl:conservedColumns">
# Conserved columns ClustalW Muscle T-Coffee
Identity < 40% 1 3 4
Identity > 60% 254 255 255
Whole range 3 6 3
Table 9: Conserved columns in MSA outputs


<xr id="alignments"/> contains excerpts of the MSAs at the crucial residues 322/323 and their surrounding. In the set with <40% sequence identity, differences between the methods are clearly observable. ClustalW performs best, being able to align the essential ASP322 in every of the 11 sequences. The following GLU323 is also correctly aligned and seems to have mutated in some of the sequences. T-Coffee also gets most of the sequences right at this position but seems to have put more emphasis on the preceding LEU319 thereby getting two of the sequences misaligned. The ClustalW alignment instead requires a longer gap in every but one sequence prior to the ASP322. Nonetheless, since this residue is part of an loop region, the alignment of ClustalW seems more convincing. Lastly, the alignment done by muscle seems to be the worst. Two sequences are misaligned at the 322 residue but additionally the surrounding region and most of the alignment in total seems to be highly fragment with very few regions of conservation. Several conserved columns are found in the region from residue 116 to 201, however none of these are known to be important. A possible exception is ARG178 which we initially found to be a possible hydrogen bonding partner for the ligand and which is aligned in 8 sequences.

The set of sequences with >60% similarity has sequences ranging from 67-98% identity. The high conservation is immediately apparent in the MSA: All methods align all essential positions correctly. Even in the regions of lower conservation at the beginning of the sequences there is hardly a difference between the alignments. In terms of alignment quality no method seems to stand out, so taking this aside, muscle could be considered as the best choice due to shorter runtime.

On the set with the full sequence identity range T-Coffee delivers the most promising alignment at position 322. ASP322 is aligned with other aspartic acids and two glutamic acids, which due to their similar structure and charged property are plausible substitutions. Similarly GLU323 is aligned in all but one sequence. ClustalW and Muscle show slightly worse performances with one sequence clearly being aligned wrong. It is D3B4C6 in both cases, which seems to be only far related to our protein of interest. It is the one sequence off in many of the otherwise conserved columns.

<figtable id="alignments">

Identity < 40% Identity > 60% Range
Table 10: Alignments around the hex A position 322. Deformation energies and Eigenvalues.



Muscle performed worst on sequences with low similarity, however took up to the other methods on such with higher similarity. Since it is faster than ClustalW and T-Coffee, if high sequence similarity is given, Muscle should be considered as the method of choice. For sequences with lower similarity ClustalW showed the most promising results, performing slightly better than T-Coffee. That being said, the recently released ClustalOmega <ref name="Sievers2011">Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W., Lopez, R., et al. (2011). Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular systems biology, 7(539), 539. doi:10.1038/msb.2011.75</ref> might be interesting to look at. For a set with sequence identites ranging from low to high T-Coffee seems to be the best performer, however a decision is hard to make in this case and the other methods should be considered as well.

T-Coffee 3D

In this mode T-Coffee is given a structure for every input sequence. The additional information can be used to improve the multiple sequence alignment. However since the datasets contain very few sequences for which structures can be obtained, T-Coffee in 3D mode cannot be applied to all of them. One idea is to force a 'representative' template structure on all sequences, however the sequences in the dataset do not share enough identity and T-Coffee refuses to use the template. Since the identities are so low, forcing the structure can hardly improve anything. On the other hand, using the fallback (proba_pair) is not desired either, since this is just a normal pairwise sequence alignment method. Instead a subset of the dataset with <40% sequence identity was built so that every sequence has a known structure. The set consists of the reference P06865 (2gjxA), as well as Q06GJ0 (3nsnA) and D0VX21 (3gh5A).

Additionally it should be noted that the default sap_pair structural aligner was not used due to buffer overflow errors and the built-in pdb_pair was used instead.


Quantitatively there is hardly a difference between the alignment using structural information and the one with only sequences as input. The number of conserved columns is 73 and 70, and the gaps only differ by one for each sequence. Looking at the alignment in more detail a few differences can be made out. Interesting are the first 8 residues which are (poorly) aligned in T-Coffee but not in 3D-Coffee, likely because these parts of the sequence are not part of the PDB entry.

Finally it must also be noted again, that an additional difficulty is introduced by using the Uniprot sequences which first have to be aligned to the PDB atom record. Nonetheless the Uniprot sequences are the ones, one would probably work with in reality so this should be considered the more realistic case. Another difference is found from position 75 in the reference protein on. The following section describes a loop region, that is unresolved in 2gjx and 3lmy. While T-Coffee tries to align the region 3D-Coffee correctly recognizes the lack of homology and accordingly places gaps for the region. From this point on both alignment methods agree in the core as well as the outer regions of the protein with only very minor variations in the placement of gaps. This agreement can be attributed to the high propensity of secondary structure assumed in these regions. The coiled regions are very short and sometimes even buried among secondary structure elements giving them hardly any room to deviate. The high agreement includes the catalytically important residues ARG178, ASP207, ASP322 and GLU462 that were already highlighted in the introduction. It does not go for ASN423 which shows no conservation among the three methods.


To summarize, 3D-Coffee hinted at its capability to improve alignment quality given additional information in the form of structural data. While the alignment could only be slightly improved in this case, it was mostly attributed to the fact, that it was good to begin with. From the results observed here there seems to be no disadvantage in using 3D-Coffee over T-Coffee, the largest hurdle being the common lack of structures for query sequences.