Sequence Alignments Hemochromatosis

From Bioinformatikpedia
Revision as of 15:58, 7 May 2012 by Bernhoferm (talk | contribs) (Undo revision 18211 by Bernhoferm (Talk))

Henry Frankenstein: Look! It's moving. It's alive. It's alive... It's alive, it's moving, it's alive, it's alive, it's alive, it's alive, IT'S ALIVE!

Victor Moritz: Henry - In the name of God!

Henry Frankenstein: Oh, in the name of God! Now I know what it feels like to be God!

Task Description

Protocol

Sorry for the inconvenience (not beeing able to read something), we're rerunning some data...

Protocol

Reference Sequence

Sequence from Uniprot: Q30201

>sp|Q30201|HFE_HUMAN Hereditary hemochromatosis protein OS=Homo sapiens GN=HFE PE=1 SV=1
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVF
YDHESRRVEPRTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQV
ILGCEMQEDNSTEGYWKYGYDGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNR
AYLERDCPAQLQQLLELGRGVLDQQVPPLVKVTHHVTSSVTTLRCRALNYYPQNITMKWL
KDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRYTCQVEHPGLDQPLIVIWEPS
PSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE

Sequence Searches


BLAST

<figtable id="blastdist">

Hemo eval blast 80.png
Hemo ident blast 80.png
Table 1: E-Value and identity distributions of the Blast search against Big80.

</figtable>

The first BLAST search against the Big80 database reached the hit limit of 250 sequences with an e-Value of e-30 for the worst hit. So we did it again with a new limit of 1500 reported hits. These hits were then filtered for unique IDs and an e-Value cutoff of 2e-3. After the filtering 1159 hits were left.

The distributions for the e-Values and identities of these hits are shown in <xr id="blastdist"/>. Most of the e-Values are between 1e-50 and 2e-3 (cutoff). Only few hits have a better e-Value. The identities are piled between 20% and 40% with two peaks at around 27% and 34% respectively.

For the evaluation of the results we compared the hits' GO terms and COPS classifications. <xr id="blastgo"/> shows that the majority of the hits share almost all of their GO terms with the HFE protein. In contrast, only about 10% to 15% of HFE's GO terms are shared by most of the hits. This might be caused by the fact that most of the hit proteins didn't have as much GO terms as HFE (27 GO terms). The COPS classification (see <xr id="blastcops"/>) shows that most (59) of the 69 PDB entries share at least 80% structure similarity with HFE, but only a few have a high sequence similarity (which is in accordance with the previous identity statistic). Overall BLAST shows the best performance for finding proteins of similar structure (see <xr id="allcops"/>).

<figtable id="blastgo">

Overlap of hit's GO terms.
Overlap of HFE's GO terms.
Table 2: Common GO terms within BLAST hits.

</figtable>

<figure id="blastcops">

Figure 1: COPS classification of BLAST hits.

</figure>


PSI-BLAST

<figtable id="psiblastdist">

Hemo eval psi 80.png
Hemo ident psi 80.png
Table 3: E-Value and identity distributions of the PSI-BLast search against Big80.

</figtable>

After the BLAST search, we also performed several searches with PSI-BLAST. This time we increased the number of reported hits and used a variety of parameter combinations to test their impact on the search results. The parameters to be changed were 'h' and 'j'. The first one, 'h', sets the e-Value cutoff for the inclusion of sequences into the PSI-BLAST-profile. The second one, 'j', is for the number of iterations for the PSI-BLAST search. For each of these parameters we used two different values: 2e-3 and 1e-10 for 'h', 2 and 10 for 'j'. This resulted in a total of 4 combinations.


The first 4 searches were against Big80 with a maximum of 10000 reported hits. We also saved the PSI-BLAST profiles for later (see bellow). The hits for each individual parameter combination were again filtered for unique IDs and an e-Value cutoff of 2e-3. This resulted in the following number of hits:

  • h=2e-3, j=2: 1892 (2786 prefiltered)
  • h=2e-3, j=10: 1704 (2734 prefiltered)
  • h=1e-10, j=2: 2058 (3574 prefiltered)
  • h=1e-10, j=10: 2035 (3458 prefiltered)

<xr id="psiblastdist"/> shows the e-Value and identity distributions for the Big80 results. In contrast to the previous BLAST search, we have more significant e-Values, but the identity shifts a bit to the left (lower identity). The differences between the parameter combinations are quite easy to spot. The lower e-Value cutoff (1-e10) also produces more significant hits (lower e-Values). This might be caused by the inclusion of fewer sequences into the profile and therefore a higher specificity for more closely related sequences (i.e. low e-Values). An increased number of iterations on the other hand reduces the number of significant hits and seems to slightly reduce the average identity.


After the searches against Big80 we also ran PSI-BLAST against the Big database. We reused the profiles from the Big80 runs and also increased the maximum of reported hits to 100000.

  • h=2e-3, j=2: 23840 (25934 prefiltered)
  • h=2e-3, j=10: 25756 (30616 prefiltered)
  • h=1e-10, j=2: 26483 (28766 prefiltered)
  • h=1e-10, j=10: 27535 (29609 prefiltered)

The two combinations with 10 iterations threw multiple error messages (but finished the process nevertheless). These errors were due to an internal code failure of PSI-BLAST and caused by too many possible hits.


The performances of the different PSI-BLAST runs (see <xr id="psiblastruntime"/>) show that the cutoff for the profiles ('h') doesn't really affect the runtime. The number of iteration on the other hand has a big impact on the runtime. The size of the database, of course, also affects the runtime. The exceptionally high runtime for the 10-iteration runs against Big might also be caused by the errors mentioned above.

<figtable id="psiblastruntime">

Iterations 2 2 10 10
E-Value 0.002 10E-10 0.002 10E-10
Big80 3m21 3m6 16m39 16m41
Big 28m17 26m43 367m15 64m4

Table 4: Runtime analysis of PSI-BLAST. </figtable>


The GO term and COPS classification analysis yields similar results to BLAST. The different parameter values have almost no effect on the GO terms. The only visible effect is that a higher iteration count seems to have a small negative effect on the GO term analysis. In the COPS analysis (see <xr id="psiblastcops"/>) the parameters seem to make almost no difference, but a more strict e-Value cutoff and higher iteration count both have a negative effect (see Comparison section, <xr id="allcops"/>).

<figtable id="psiblastgo">

Overlap of hit's GO terms (Big80).
Overlap of HFE's GO terms (Big80).
Overlap of hit's GO terms (Big).
Overlap of HFE's GO terms (Big).
Table 5: Common GO terms within PSI-BLAST hits.

</figtable>

<figtable id="psiblastcops">

Database: Big80
Database: Big
Table 6: COPS classification of PSI-BLAST hits.

</figtable>


HHblits

<figtable id="hhblitsdist">

Hemo eval hhblits 80.png
Hemo ident hhblits 80.png
Table 7: E-Value and identity distributions of the HHblits search against Uniprot20.

</figtable>

The final sequence search algorithm we apllied was HHblits. This time we searched against another database, Uniprot20. We set the number of reported hits (clusters) to 600 which corresponds to a worst e-Value of 0.0021. After filtering for unique clusters, we had 585 clusters left. Within these clusters we had 27588 unique Uniprot ACs. The most significant cluster with an e-Value of 2e-131 and an 40% identities accounted for 2771 (about 10% total) of these Uniprot ACs.

In <xr id="hhblitsdist"/> you can see the distribution of the clusters' e-Values and identities. Like in the BLAST results, most e-Values are between 1e-50 and 2e-3. The majority of the identities are again between 20% and 40%, with a peak at around 26%.

The runtime was 13m 13s. This means that HHblits' is more or less between BLAST's and PSI-BLAST's runtime.


HHblits shows about the same results for the GO term analysis (xr id="hhblitsgo"/>) regarding the percentage of HFE's GO terms. In contrast to the previous results, HHblits also has a high peak for the hit's GO terms at around 65% to 70%. This means that HHblits finds more proteins with a more distant related function than the other algorithms. The COPS classification is similar to PSI-BLAST (database: Big) in the L99 and L80 clusters, but HHblits has a better performance to find proteins in the L30, L40, and L60 clusters than PSI-BLAST (database: Big).

<figtable id="hhblitsgo">

Overlap of hit's GO terms.
Overlap of HFE's GO terms.
Table 8: Common GO terms within HHblits hits.

</figtable>

<figure id="hhblitscops">

Figure 2: COPS classification of HHblits hits.

</figure>


Comparison

<figtable id="psiblastoverlap">

Database: Big80.
Database: Big.
Overlap between the different PSI-BLAST runs.

</figtable>


<figtable id="alloverlap">

Databases: Big80, Uniprot20.
Databases: Big, Uniprot20.
Databases: Big, Uniprot20.
Database: Big80.
Overlap between the different search algorithms. In the first figure (only) PSI-BLAST includes the sum of all unique hits from all PSI-BLAST iterations (against Big80).

</figtable>

<figure id="allcops">

Figure 3: COPS classification statistics for all algorithms.

</figure>


Multiple Sequence Alignments


Dataset

<figtable id="msagroup60">

Uniprot AC (Group 60-99) Identity Comment
G3QU39 99.14 Uncharacterized protein OS=Gorilla gorilla gorilla
H2PI54 97.41 Uncharacterized protein OS=Pongo abelii
Q6B0J5 95.98 HFE protein OS=Homo sapiens
G7P2L8 94.54 Putative uncharacterized protein OS=Macaca fascicularis
F7GRH8 90.52 Uncharacterized protein OS=Callithrix jacchus
F7DKE9 86.18 Uncharacterized protein OS=Callithrix jacchus
Q9GL42 79.89 Hereditary hemochromatosis protein homolog OS=Dicerorhinus sumatrensis
F6RUG7 78.16 Uncharacterized protein OS=Equus caballus
G3THV5 75.21 Uncharacterized protein (Fragment) OS=Loxodonta africana
G5BQE5 67.66 Hereditary hemochromatosis protein-like protein OS=Heterocephalus glaber

</figtable>

<figtable id="msagroup40">

Uniprot AC (Group 00-40) Identity Comment
P16391 36.03 RT1 class I histocompatibility antigen, AA alpha chain OS=Rattus norvegicus
P05534 35.13 HLA class I histocompatibility antigen, A-24 alpha chain OS=Homo sapiens
Q30597 33.72 MHC class I Mamu-A*02 (Fragment) OS=Macaca mulatta
P01900 32.11 H-2 class I histocompatibility antigen, D-D alpha chain OS=Mus musculus
Q31093 31.74 Histocompatibility 2, M region locus 3 OS=Mus musculus
P14432 29.14 H-2 class I histocompatibility antigen, TLA(B) alpha chain OS=Mus musculus
Q860W6 27.74 Major histocompatibility complex class Ib M10.5 (Fragment) OS=Mus musculus
Q31615 26.38 MHC class I H2-TL-27-129 mRNA (b haplotype), complete cds OS=Mus musculus
Q31206 25.87 MHC class I H2-TL-T10-129 mRNA (b haplotype), complete cds OS=Mus musculus
P01921 21.10 H-2 class II histocompatibility antigen, A-D beta chain OS=Mus musculus

</figtable>

<figtable id="msagroup00">

Uniprot AC (Group 00-99) Identity Comment
Q6B0J5 95.98 HFE protein OS=Homo sapiens
F7GRH8 90.52 Uncharacterized protein OS=Callithrix jacchus
F7DKE9 86.18 Uncharacterized protein OS=Callithrix jacchus
Q9GL42 79.89 Hereditary hemochromatosis protein homolog OS=Dicerorhinus sumatrensis
G5BQE5 67.66 Hereditary hemochromatosis protein-like protein OS=Heterocephalus glaber
G1PHG2 57.43 Uncharacterized protein (Fragment) OS=Myotis lucifugus
F7C3B3 40.11 Uncharacterized protein OS=Macaca mulatta
Q30597 33.72 MHC class I Mamu-A*02 (Fragment) OS=Macaca mulatta
Q860W6 27.74 Major histocompatibility complex class Ib M10.5 (Fragment) OS=Mus musculus
P01921 21.10 H-2 class II histocompatibility antigen, A-D beta chain OS=Mus musculus

</figtable>


CLustalW

View of the alignment of the 0-40 group aligned by ClustalW

ClustalW 0-40

Sequence Gap
Q30597 79
P01900 71
Q860W6 109
Q31093 100
Q31615 57
P05534 71
P01921 171
Q30201 88
P16391 65
P14432 52
Q31206 30
Conserved
23
View of the alignment of the 0-99 group aligned by ClustalW

ClustalW 0-99

Sequence Gap
Q30597 42
Q860W6 72
G5BQE5 32
Q9GL42 51
G1PHG2 61
Q6B0J5 54
P01921 134
Q30201 51
F7C3B3 146
F7DKE9 62
F7GRH8 51
Conserved
21


View of the alignment of the 60-99 group aligned by ClustalW

ClustalW 60-99

Sequence Gap
G3THV5 12
H2PI54 25
F6RUG7 25
G5BQE5 6
Q9GL42 25
Q6B0J5 28
Q30201 25
G3QU39 25
F7GRH8 25
F7DKE9 36
G7P2L8 25
Conserved
175



Muscle

View of the alignment of the 0-40 group aligned by Muscle

Muscle Group 0-40

Sequence Gap
P01900 82
Q30597 90
Q31093 111
Q860W6 120
Q31615 68
P05534 82
Q30201 99
P01921 182
P16391 76
P14432 63
Q31206 41
Conserved
22
View of the alignment of the 0-99 group aligned by Muscle

Muscle Group 0-99

Sequence Gap
Q30597 44
Q860W6 74
G5BQE5 34
Q9GL42 53
Q6B0J5 56
G1PHG2 63
Q30201 53
P01921 136
F7C3B3 148
F7DKE9 64
F7GRH8 53
Conserved
22


View of the alignment of the 60-99 group aligned by Muscle

Muscle Group 60-99

Sequence Gap
G3THV5 12
H2PI54 25
F6RUG7 25
G5BQE5 6
Q9GL42 25
Q6B0J5 28
Q30201 25
G3QU39 25
F7GRH8 25
F7DKE9 36
G7P2L8 25
Conserved
175



T-Coffee

View of the alignment of the 0-40 group aligned by T-Coffee

ClustalW 0-40

Sequence Gap
Q30597 92|- P01900 84
Q31093 113
Q860W6 122
Q31615 70
P05534 84|- P01921 184
Q30201 101
P14432 65
P16391 78
Q31206 43
Conserved
2
View of the alignment of the 0-99 group aligned by T-Coffee

ClustalW 0-99

Sequence Gap
Q30597 51
Q860W6 81
G5BQE5 41
Q9GL42 60
Q6B0J5 63
G1PHG2 70
P01921 143
Q30201 60
F7C3B3 155
F7DKE9 71
F7GRH8 60
Conserved
22


View of the alignment of the 60-99 group aligned by T-Coffee

ClustalW 60-99

Sequence Gap
G3THV5 12
H2PI54 25
F6RUG7 25
G5BQE5 6
Q9GL42 25
Q6B0J5 28
Q30201 25
G3QU39 25
G7P2L8 25
F7GRH8 25
F7DKE9 25
Conserved
175



3D-Coffee