Sequence Alignments Hemochromatosis
Hemochromatosis>>Task 2: Sequence alignments
Contents
Short Task Description
- Apply and compare different sequence search algorithms.
- Use the results for MSAs
Detailed description: Task_alignments
Protocol
We've made a protocol with the commands and scripts we used to create our results: Protocol
Reference Sequence
Sequence from Uniprot: Q30201
>sp|Q30201|HFE_HUMAN Hereditary hemochromatosis protein OS=Homo sapiens GN=HFE PE=1 SV=1 MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE
Sequence Searches
In this task we've tested three different sequence search algorithms against each other. We also used three different databases:
- Big80 for BLAST and PSI-BLAST
- Big for PSI-BLAST
- Uniprot20 for HHblits
The results for each search run were filtered for unique hits and an e-Value cutoff of 2e-3. The cutoff was chosen to be in accordance with the given minimum cutoff (2-e3) for the PSI-BLAST profiles. For the evaluation of the hits we used two methods:
- GO term analysis: We compared each hit's GO term with the GO terms of HFE (Q30201; 27 GO terms). Then we computed two percentages. The first one is the percentage of GO terms the hit has in common with HFE, the second one is the percentage of GO terms HFE has in common with the hit.
- COPS classification: For each hit that could be mapped to a PDB ID we checked if it was within the same COPS cluster (L30-L99,S30,S90) as HFE.
BLAST
<figtable id="blastdist">
Table 1: E-Value and identity distributions of the BLAST search against Big80. |
</figtable>
The first BLAST search against the Big80 database reached the hit limit of 250 sequences with an e-Value of e-30 for the worst hit. So we did it again with a new limit of 1500 reported hits. These hits were then filtered for unique IDs and an e-Value cutoff of 2e-3. After the filtering 1159 hits were left.
The distributions for the e-Values and identities of these hits are shown in <xr id="blastdist"/>. Most of the e-Values are between 1e-50 and 2e-3 (cutoff). Only few hits have a better e-Value. The identities are piled between 20% and 40% with two peaks at around 27% and 34% respectively.
For the evaluation of the results we compared the hits' GO terms and COPS classifications. <xr id="blastgo"/> shows that the majority of the hits share almost all of their GO terms with the HFE protein. In contrast, only about 10% to 15% of HFE's GO terms are shared by most of the hits. This might be caused by the fact that most of the hit proteins didn't have as much GO terms as HFE (27 GO terms). The COPS classification (see <xr id="blastcops"/>) shows that most (59) of the 69 PDB entries share at least 80% structure similarity with HFE, but only a few have a high sequence similarity (which is in accordance with the previous identity statistic). Overall BLAST shows the best performance for finding proteins of similar structure (see <xr id="allcops"/>).
<figtable id="blastgo">
Table 2: Common GO terms within BLAST hits. |
</figtable>
<figure id="blastcops">
</figure>
PSI-BLAST
<figtable id="psiblastdist">
Table 3: E-Value and identity distributions of the PSI-BLast search against Big80. |
</figtable>
After the BLAST search, we also performed several searches with PSI-BLAST. This time we increased the number of reported hits and used a variety of parameter combinations to test their impact on the search results. The parameters to be changed were 'h' and 'j'. The first one, 'h', sets the e-Value cutoff for the inclusion of sequences into the PSI-BLAST-profile. The second one, 'j', is for the number of iterations for the PSI-BLAST search. For each of these parameters we used two different values: 2e-3 and 1e-10 for 'h', 2 and 10 for 'j'. This resulted in a total of 4 combinations.
The first 4 searches were against Big80 with a maximum of 10000 reported hits. We also saved the PSI-BLAST profiles for later (see bellow). The hits for each individual parameter combination were again filtered for unique IDs and an e-Value cutoff of 2e-3. This resulted in the following number of hits:
- h=2e-3, j=2: 1892 (2786 prefiltered)
- h=2e-3, j=10: 1704 (2734 prefiltered)
- h=1e-10, j=2: 2058 (3574 prefiltered)
- h=1e-10, j=10: 2035 (3458 prefiltered)
<xr id="psiblastdist"/> shows the e-Value and identity distributions for the Big80 results. In contrast to the previous BLAST search, we have more significant e-Values, but the identity shifts a bit to the left (lower identity). The differences between the parameter combinations are quite easy to spot. The lower e-Value cutoff (1-e10) also produces more significant hits (lower e-Values). This might be caused by the inclusion of fewer sequences into the profile and therefore a higher specificity for more closely related sequences (i.e. low e-Values). An increased number of iterations on the other hand reduces the number of significant hits and seems to slightly reduce the average identity.
After the searches against Big80 we also ran PSI-BLAST against the Big database. We reused the profiles from the Big80 runs and also increased the maximum of reported hits to 100000.
- h=2e-3, j=2: 23840 (25934 prefiltered)
- h=2e-3, j=10: 25756 (30616 prefiltered)
- h=1e-10, j=2: 26483 (28766 prefiltered)
- h=1e-10, j=10: 27535 (29609 prefiltered)
The two combinations with 10 iterations threw multiple error messages (but finished the process nevertheless). These errors were due to an internal code failure of PSI-BLAST and caused by too many possible hits[1].
The performances of the different PSI-BLAST runs (see <xr id="psiblastruntime"/>) show that the cutoff for the profiles ('h') doesn't really affect the runtime. The number of iteration on the other hand has a big impact on the runtime. The size of the database, of course, also affects the runtime. The exceptionally high runtime for the 10-iteration runs against Big might also be caused by the errors mentioned above.
<figtable id="psiblastruntime">
Iterations | 2 | 2 | 10 | 10 |
---|---|---|---|---|
E-Value | 0.002 | 10E-10 | 0.002 | 10E-10 |
Big80 | 3m21 | 3m6 | 16m39 | 16m41 |
Big | 28m17 | 26m43 | 367m15 | 64m4 |
Table 4: Runtime analysis of PSI-BLAST. </figtable>
The GO term and COPS classification analysis yields similar results to BLAST. The different parameter values have almost no effect on the GO terms. The only visible effect is that a higher iteration count seems to have a small negative effect on the GO term analysis. In the COPS analysis (see <xr id="psiblastcops"/>) the parameters seem to make almost no difference in the distribution of the different clusters (L30-L99,S30,S90), but a more strict e-Value cutoff and higher iteration count both have a negative effect on the overall percentages of hits within these clusters (see <xr id="allcops"/>).
<figtable id="psiblastgo">
Table 5: Common GO terms within PSI-BLAST hits. |
</figtable>
<figtable id="psiblastcops">
Table 6: COPS classification of PSI-BLAST hits. |
</figtable>
HHblits
<figtable id="hhblitsdist">
Table 7: E-Value and identity distributions of the HHblits search against Uniprot20. |
</figtable>
The final sequence search algorithm we apllied was HHblits. This time we searched against another database, Uniprot20. We set the number of reported hits (clusters) to 600 which corresponds to a worst e-Value of 0.0021. After filtering for unique clusters, we had 585 clusters left. Within these clusters we had 27588 unique Uniprot ACs. The most significant cluster with an e-Value of 2e-131 and an 40% identities accounted for 2771 (about 10% total) of these Uniprot ACs.
In <xr id="hhblitsdist"/> you can see the distribution of the clusters' e-Values and identities. Like in the BLAST results, most e-Values are between 1e-50 and 2e-3. The majority of the identities are again between 20% and 40%, with a peak at around 26%.
The runtime was 13m 13s. This means that HHblits' is more or less between BLAST's and PSI-BLAST's runtime.
HHblits shows about the same results for the GO term analysis (<xr id="hhblitsgo"/>) regarding the percentage of HFE's GO terms. In contrast to the previous results, HHblits also has a high peak for the hit's GO terms at around 65% to 70%. This means that HHblits finds more proteins with a more distant related function than the other algorithms. The COPS classification (<xr id="hhblitscops"/> and <xr id="allcops"/>) is similar to PSI-BLAST (database: Big) in the L99 and L80 clusters, but HHblits has a better performance to find proteins in the L30, L40, and L60 clusters than PSI-BLAST (database: Big).
<figtable id="hhblitsgo">
Table 8: Common GO terms within HHblits hits. |
</figtable>
<figure id="hhblitscops">
</figure>
Comparison
To compare the different search algorithms and PSI-BLAST combinations we collected the unique hits and created several Venn diagrams (Oliveros, J.C. (2007) VENNY. An interactive tool for comparing lists with Venn Diagrams.). In order to compare PSI-BLAST and HHblits we took the runs of PSI-BLAST against the Big database. BLAST was never run against Big, so all of BLAST's hits are based on the Big80 database.
<figtable id="psiblastoverlap">
Table 9: Overlap between the different PSI-BLAST runs. |
</figtable>
First we compared the different variations of PSI-BLAST runs. <xr id="psiblastoverlap"/> shows that the different parameters have only a small effect on the results as most of them are shared between all combinations. Only the adjustment of the inclusion e-Value cutoff produces some hits unique to these configurations. This is true for both databases: Big80 and Big.
<figtable id="alloverlap">
Table 10: Overlap between the different search algorithms. BLAST results are always against the Big80 database, HHblits against Uniprot20, and
PSI-BLAST against Big (left and middle figure) or Big80 (right figure). |
</figtable>
When comparing all three algorithms (<xr id="alloverlap"/>), we can see that PSI-BLAST and BLAST have the most hits in common. This is not surprising as both algorithms use a similar approach to collect thier hits. HHblits and PSI-BLAST share about half of their hits. This seems also to be true for HHblits and BLAST, but the number of BLAST hits is too small to be sure.
<figure id="allcops">
</figure>
Last we compared the COPS analysis for all algorithms and PSI-BLAST combinations (see <xr id="allcops"/>). BLAST shows the best performance among all of them in the L99 and L80 cluster, though it is equal to PSI-BLAST (database: Big80) in the L30 to L60 clusters. When comparing the results for HHblits and PSI-BLAST (database: Big), HHblits yields slightly better results in the L30 to L60 range.
Multiple Sequence Alignments
Dataset
We created three groups of sequences for the MSAs. One group with only 40% (or lower) sequence identity to HFE, one with 60% identity or higher (max 99%), and a last one with 20% to 99% identity. We tried to get several sequences with PDB entries into the lists, but the only sequence with a identity higher than 40% was Q30201 (HFE) itself. In contrast, the whole 40% or lower group consists only of Uniprot ACs that can be mapped to PDB entries. Of course, every group also includes the HFE protein which means that there are really 3 groups with 11 sequences each.
<figtable id="msagroup60">
Uniprot AC (Group 60-99) | Identity | Comment |
---|---|---|
G3QU39 | 99.14 | Uncharacterized protein OS=Gorilla gorilla gorilla |
H2PI54 | 97.41 | Uncharacterized protein OS=Pongo abelii |
Q6B0J5 | 95.98 | HFE protein OS=Homo sapiens |
G7P2L8 | 94.54 | Putative uncharacterized protein OS=Macaca fascicularis |
F7GRH8 | 90.52 | Uncharacterized protein OS=Callithrix jacchus |
F7DKE9 | 86.18 | Uncharacterized protein OS=Callithrix jacchus |
Q9GL42 | 79.89 | Hereditary hemochromatosis protein homolog OS=Dicerorhinus sumatrensis |
F6RUG7 | 78.16 | Uncharacterized protein OS=Equus caballus |
G3THV5 | 75.21 | Uncharacterized protein (Fragment) OS=Loxodonta africana |
G5BQE5 | 67.66 | Hereditary hemochromatosis protein-like protein OS=Heterocephalus glaber |
</figtable>
<figtable id="msagroup40">
Uniprot AC (Group 00-40) | Identity | Comment |
---|---|---|
P16391 | 36.03 | RT1 class I histocompatibility antigen, AA alpha chain OS=Rattus norvegicus |
P05534 | 35.13 | HLA class I histocompatibility antigen, A-24 alpha chain OS=Homo sapiens |
Q30597 | 33.72 | MHC class I Mamu-A*02 (Fragment) OS=Macaca mulatta |
P01900 | 32.11 | H-2 class I histocompatibility antigen, D-D alpha chain OS=Mus musculus |
Q31093 | 31.74 | Histocompatibility 2, M region locus 3 OS=Mus musculus |
P14432 | 29.14 | H-2 class I histocompatibility antigen, TLA(B) alpha chain OS=Mus musculus |
Q860W6 | 27.74 | Major histocompatibility complex class Ib M10.5 (Fragment) OS=Mus musculus |
Q31615 | 26.38 | MHC class I H2-TL-27-129 mRNA (b haplotype), complete cds OS=Mus musculus |
Q31206 | 25.87 | MHC class I H2-TL-T10-129 mRNA (b haplotype), complete cds OS=Mus musculus |
P01921 | 21.10 | H-2 class II histocompatibility antigen, A-D beta chain OS=Mus musculus |
</figtable>
<figtable id="msagroup00">
Uniprot AC (Group 00-99) | Identity | Comment |
---|---|---|
Q6B0J5 | 95.98 | HFE protein OS=Homo sapiens |
F7GRH8 | 90.52 | Uncharacterized protein OS=Callithrix jacchus |
F7DKE9 | 86.18 | Uncharacterized protein OS=Callithrix jacchus |
Q9GL42 | 79.89 | Hereditary hemochromatosis protein homolog OS=Dicerorhinus sumatrensis |
G5BQE5 | 67.66 | Hereditary hemochromatosis protein-like protein OS=Heterocephalus glaber |
G1PHG2 | 57.43 | Uncharacterized protein (Fragment) OS=Myotis lucifugus |
F7C3B3 | 40.11 | Uncharacterized protein OS=Macaca mulatta |
Q30597 | 33.72 | MHC class I Mamu-A*02 (Fragment) OS=Macaca mulatta |
Q860W6 | 27.74 | Major histocompatibility complex class Ib M10.5 (Fragment) OS=Mus musculus |
P01921 | 21.10 | H-2 class II histocompatibility antigen, A-D beta chain OS=Mus musculus |
</figtable>
ClustalW
ClustalW 0-40
<figure id="clustal040">
</figure>
Sequence | Gap |
---|---|
Q30597 | 79 |
P01900 | 71 |
Q860W6 | 109 |
Q31093 | 100 |
Q31615 | 57 |
P05534 | 71 |
P01921 | 171 |
Q30201 | 88 |
P16391 | 65 |
P14432 | 52 |
Q31206 | 30 |
Conserved |
---|
23 |
ClustalW 0-99
<figure id="clustal099">
</figure>
Sequence | Gap |
---|---|
Q30597 | 42 |
Q860W6 | 72 |
G5BQE5 | 32 |
Q9GL42 | 51 |
G1PHG2 | 61 |
Q6B0J5 | 54 |
P01921 | 134 |
Q30201 | 51 |
F7C3B3 | 146 |
F7DKE9 | 62 |
F7GRH8 | 51 |
Conserved |
---|
21 |
ClustalW 60-99
<figure id="clustal6099">
</figure>
Sequence | Gap |
---|---|
G3THV5 | 12 |
H2PI54 | 25 |
F6RUG7 | 25 |
G5BQE5 | 6 |
Q9GL42 | 25 |
Q6B0J5 | 28 |
Q30201 | 25 |
G3QU39 | 25 |
F7GRH8 | 25 |
F7DKE9 | 36 |
G7P2L8 | 25 |
Conserved |
---|
175 |
Muscle
Muscle Group 0-40
<figure id="muscle040">
</figure>
Sequence | Gap |
---|---|
P01900 | 82 |
Q30597 | 90 |
Q31093 | 111 |
Q860W6 | 120 |
Q31615 | 68 |
P05534 | 82 |
Q30201 | 99 |
P01921 | 182 |
P16391 | 76 |
P14432 | 63 |
Q31206 | 41 |
Conserved |
---|
22 |
Muscle Group 0-99
<figure id="muscle099">
</figure>
Sequence | Gap |
---|---|
Q30597 | 44 |
Q860W6 | 74 |
G5BQE5 | 34 |
Q9GL42 | 53 |
Q6B0J5 | 56 |
G1PHG2 | 63 |
Q30201 | 53 |
P01921 | 136 |
F7C3B3 | 148 |
F7DKE9 | 64 |
F7GRH8 | 53 |
Conserved |
---|
22 |
Muscle Group 60-99
<figure id="muscle06099">
</figure>
Sequence | Gap |
---|---|
G3THV5 | 12 |
H2PI54 | 25 |
F6RUG7 | 25 |
G5BQE5 | 6 |
Q9GL42 | 25 |
Q6B0J5 | 28 |
Q30201 | 25 |
G3QU39 | 25 |
F7GRH8 | 25 |
F7DKE9 | 36 |
G7P2L8 | 25 |
Conserved |
---|
175 |
T-Coffee
T-Coffee Group 0-40
<figure id="tcoffee040">
</figure>
Sequence | Gap |
---|---|
Q30597 | 92 |
P01900 | 84 |
Q31093 | 113 |
Q860W6 | 122 |
Q31615 | 70 |
P05534 | 84 |
P01921 | 184 |
Q30201 | 101 |
P14432 | 65 |
P16391 | 78 |
Q31206 | 43 |
Conserved |
---|
22 |
T-Coffee Group 0-99
<figure id="tcoffee099">
</figure>
Sequence | Gap |
---|---|
Q30597 | 51 |
Q860W6 | 81 |
G5BQE5 | 41 |
Q9GL42 | 60 |
Q6B0J5 | 63 |
G1PHG2 | 70 |
P01921 | 143 |
Q30201 | 60 |
F7C3B3 | 155 |
F7DKE9 | 71 |
F7GRH8 | 60 |
Conserved |
---|
22 |
T-Coffee Group 60-99
<figure id="tcoffee6099">
</figure>
Sequence | Gap |
---|---|
G3THV5 | 12 |
H2PI54 | 25 |
F6RUG7 | 25 |
G5BQE5 | 6 |
Q9GL42 | 25 |
Q6B0J5 | 28 |
Q30201 | 25 |
G3QU39 | 25 |
G7P2L8 | 25 |
F7GRH8 | 25 |
F7DKE9 | 25 |
Conserved |
---|
175 |
3D-Coffee
T-Coffee Group 0-40
<figure id="3dcoffee040">
</figure>
Sequence | Gap |
---|---|
Q30597 | 92 |
P01900 | 84 |
Q31093 | 113 |
Q860W6 | 122 |
Q31615 | 70 |
P05534 | 84 |
P01921 | 184 |
Q30201 | 101 |
P14432 | 65 |
P16391 | 78 |
Q31206 | 43 |
Conserved |
---|
22 |
Conservation of important positions
Position | 3DCoffee0-40 | T-Coffee0-40 | T-Coffee0-99 | T-Coffee60-99 | Muscle0-40 | Muscle0-99 | Muscle60-99 | ClustalW0-40 | ClustalW0-99 | ClustalW60-99 |
---|---|---|---|---|---|---|---|---|---|---|
110 | 9/10(Gap) | 9/10(Gap) | 8/10(Gap) | 10/10 | 9/10(Gap) | 8/10(Gap) | 10/10 | 9/10(Mismatch) | 8/10(Gap/Mismatch) | 10/10 |
130 | X | X | X | 10/10 | X | X | 10/10 | X | X | 10/10 |
234 | X | X | X | 10/10 | X | X | 10/10 | X | X | 10/10 |
124 | 9/10(Gap) | 9/10(Gap) | 10/10 | 10/10 | 9/10(Gap) | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 |
187 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 |
225 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 9/10(Mismatch) | 10/10 | 10/10 | 10/10 | 10/10 |
282 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 | 10/10 |
All positions are some at which the protein is modified.
The positions are not ordered by their appearance because of their type of modification.
The first three noted positions (110, 130, 234) are positions which are modified via glycosylation. The last four are responsible for disulfide-bonds (as they're grouped together, position 124 builds a disulfide bond with position 187, the same goes for positions 225 and 282).
A X represents a non-conserved position within the alignment, meaning that 5 or less sequences have the same amino acid as the HFE sequence at that position.
One can see that in nearly all groups and in all alignments the disulfide positions are conserved. This may show that the disulfide bond is crucial for the 3d structure of the protein and therefore its function. Of the Glycosylation positions only position 110 seems to be quite conserved, meaning it has a higher probability to be important than the other glycolysation positions.
The conservation of all these positions when looking at the 60-99 groups can be explained by the high similarity of the proteins.
Also one should keep in mind that these statistics are withdrawn from 10 proteins, and due to random selection of these errors may have occured.
When looking at the domain region(positions 207-298) one can see in the MSAs that this region in all groups seems to be highly conserved. Even in the group with 20-40% identity. This is what we expected because the methods should find similar proteins and as we know mutations that last occur less often in functional domains.