Fabry Disease » Sequence alignments (sequence searches and multiple alignments)
Introduction
This page contains our results and discussions. The lab journal can be found here.
Reference sequence
The reference sequence of α-Galactosidase A that will be used in this task was obtained from Swissprot P06280.
>gi|4504009|ref|NP_000160.1| alpha-galactosidase A precursor [Homo sapiens]
MQLRNPELHLGCALALRFLALVSWDIPGARALDNGLARTPTMGWLHWERFMCNLDCQEEPDSCISEKLFM
EMAELMVSEGWKDAGYEYLCIDDCWMAPQRDSEGRLQADPQRFPHGIRQLANYVHSKGLKLGIYADVGNK
TCAGFPGSFGYYDIDAQTFADWGVDLLKFDGCYCDSLENLADGYKHMSLALNRTGRSIVYSCEWPLYMWP
FQKPNYTEIRQYCNHWRNFADIDDSWKSIKSILDWTSFNQERIVDVAGPGGWNDPDMLVIGNFGLSWNQQ
VTQMALWAIMAAPLFMSNDLRHISPQAKALLQDKDVIAINQDPLGKQGYQLRQGDNFEVWERPLSGLAWA
VAMINRQEIGGPRSYTIAVASLGKGVACNPACFITQLLPVKRKLGFYEWTSRLRSHINPTGTVLLQLENT
MQMSLKDLL
Sequence searches
Blast
GO terms of P06280 and each BLAST hit (with Evalue <= 0.003) compared. Percentage terms shared, in relation to number of GO terms of P06280 (AGAL_HUMAN) in the upper picture, in the secon picture in relation to number of each hit
First we performed a BLAST search with the default parameter. Since all hits were significant we raised the number of shown one line descriptions (-v) as well as the number of database sequences to show alignments for (-b). This led to 663 hits with an E-value smaller or equal to 0.003, which we declared as significant in our search. For these proteins we extracted the E-value and the number of positive and also of identical amino acids of the pairwise alignments, as well as the length of each hit. You can see a histogram of each of these features on the left.
For the comparison of the GO terms, we obtained the set of terms for each hit and analyzed the number of those in common with the GO terms of the search protein α-Galactosidase A . We devided the number of common terms by the number of GO terms of P06280 (49). Since these proportions are very small, we thought it would also make sense to explore the fraction of the hits GO terms shared with the reference terms. Thus we devided the number if common terms by the number of terms of the hit. The histogram of the second rate show that in average over 80% of the GO terms of each hit are common with those of AGAL. The small amount of average accordance of the Galactosidase terms to the hit terms may be due to the fact that humans are a lot more complex than the species the homologous hits belong to. So the protein has to fullfill more needs in a more complex organism and thus has more GO terms assigned.
The average length fits the length of the α-Galactosidase A protein very well. This can be seen in the picture below ( Histogram of the length of the BLAST hits for P06280).
On average over 51% of the residues are positive and almost 36% are identical hits. Thus on average 87% of the residues in each alignment are similar to the protein sequence of AGAL.
<figtable id="blastidev">
|
Histogram of the logarithmic E-values of the BLAST hits for P06280
|
Histogram of the positive amino acids of the pairwise alignments of the BLAST hits for P06280
|
|
Histogram of the identical amino acids of the pairwise alignments of the BLAST hits for P06280
|
Histogram of the length of the BLAST hits for P06280
|
</figtable>
Psi-Blast
Overview of the Psi-Blast result sets
We also used Psi-Blast - the profile blast program - to search against the big80 database. The one-line description output parameter (-v) was increased to show a maximum of 4000 hits and the same goes for the number of alignments to show (-b). There were four runs with the iterations set to 2 and 10 and the e-value cut-off to 2e-3 and 1e-9, respectively. All the other options were left at its default values. The exact commandline calls can be obtained from the journal.
Iterations
|
E-value cut-off
|
Number of Hits
|
2
|
2e-3
|
1129
|
2
|
1e-9
|
683
|
10
|
2e-3
|
2000 (maximal hits)
|
10
|
1e-9
|
1491
|
- Psi-Blast run with 2 iterations and an E-value threshold of 2e-3
- Psi-Blast run with 2 iterations and an E-value threshold of 1e-9
- Psi-Blast run with 10 iterations and an E-value threshold of 2e-3
- Psi-Blast run with 10 iterations and an E-value threshold of 1e-9
HHblits
We searched the "big80" database with HHblits using the default settings and also with the maximum number of possible iterations (8).
!!!TODO: Update GO-Terms plots + interpretation--~~~~(/pre)
The HHBlits search was performed with the maximum E-value in the summary and alignment list set to 0.003 (-E) and the minimum number of lines in the summary hit list had to be 700 (-z). From this search we obtained only 325 significant cluster.
We also compared the GO terms in a similar manner as in the BLAST section. Here we discovered that on average only 14% of the AGAL_HUMAN protein's GO terms are included in the hits' terms. The "reverse" calculation revealed that around 70% of the hits' GO classes are in common with the search protein. This is rather low in comparison to the BLAST results.
The mean E-value in contrast is almost equal to the average E-value of the BLAST search. The same applies to the number of identical amino acids. The number of E-values and fraction of identical residues is comparable to the BLAST values, since there is only one E-value, %Identical and %Similar for each cluster.
<figtable id="blastidev">
|
2 iterations - default
|
|
GO terms of P06280 and each HHblits hit (with Evalue < 0.003) compared. Percentageterms shared, in relation to number of GO terms of P06280 (AGAL_HUMAN) in the upper picture, in the secon picture in relation to number of each hit
|
Histogram of the logarithmic E-values of the HHblits hits for P06280
|
|
Histogram of the similarity of the HHblits hits to P06280
|
Histogram of the identical amino acids of the pairwise alignments of the HHblits hits for P06280
|
</figtable>
Since we thought that the number of significant hits was too low, we performed another HHBlits search with 8 iterations. Doing so, we gained 729 cluster with E-value smaller or equal to 0.003.
The similarity in GO terms got better, but all other comparative values, like average E-value, similarity and identical residues got worse.
Thus increasing the number of iterations might be better to obtain more homologous proteins, but since the similarity is smaller, the conservation might also be not as high as for proteins detected with less iterations.
<figtable id="blastidev">
|
8 iterations
|
|
GO terms of P06280 and each HHblits hit (with Evalue < 0.003 and 8 iterations) compared. Percentage terms shared, in relation to number of GO terms of P06280 (AGAL_HUMAN) in the upper picture, in the secon picture in relation to number of each hit
|
Histogram of the logarithmic E-values of the HHblits search with 8 iterations for P06280
|
|
Histogram of the similarity of the BLAST hits (search with 8 iterations) to P06280
|
Histogram of the identical amino acids of the pairwise alignments of the BLAST hits (search with 8 iterations) for P06280
|
</figtable>
The first HHblits run took about 2.5 minutes, the second one about 16 minutes (see section
Time).
Comparison sequence searches
Comparing the hits
Venn diagram of proteins found by BLAST, HHBlits and HHBlits with 8 iterations
|
Venn diagram of the proteins found by BLAST, Psi-BLAST (10 iterations and E-value cutoff 10e-10 ) and HHBlits with 8 iterations
|
Venn diagram of the first 100 proteins found by BLAST, HHBlits and HHBlits with 8 iterations
|
Venn diagram of the first 100 proteins found by BLAST, Psi-BLAST and HHBlits with 8 iterations
|
Venn diagrams created with
Oliveros, J.C. (2007) VENNY. An interactive tool for comparing lists with Venn Diagrams.
For the Venn diagrams above, only the first identifier of each HHBlits cluster is used. This is only useful in the sense, that the number of proteins is comparable.
In the Venn diagrams one realises, that only a small portion of the found hits is shared by all three methods. Each method seems to have a very unique set of findings. The biggest overlap is between the BLAST and Psi-BLAST hits, which is according to our expectations, since these two use similar approaches, while HHBlits searches by using iterative HMM-HMM comparison. These facts become most obvious in the last picture, where only the 100 best hits of all three methods are compared. Only 6 hits are common among all methods. In the remaining 94, about half are shared by BLAST and Psi-BLAST, the other half is unique in BLAST and Psi-BLAST. HHBlits has 84 unique hits and shares 5 hits solely with each of the BLAST algorithms. The comparison of all hits with E-value smaller or equal to 0.03 in all methods looks similar. It is noteworthy that here even a small number of hits is even shared only by HHBlits and BLAST (52), as well as Psi-BLAST and HHBlits (2).
The shared hits of the two different HHBlits searches with 2 and 8 iterations shows also a great amount of overlap.
In the picture below, all identifiers in the HHBlits clusters were used. In this case a lot more identifiers are shared among all three methods. Not in respect to the total number of ids in the the HHBlits clusters in total (8463), but in respect to the number of identifiers that the two BLAST methods do not share with HHBlits. Here only 193 ids are not at all shared with HHBlits.
The comparison of the first 100 hits of BLAST and Psi-BLAST was performed with the 100 first clusters in the HHBlits output. Again a larger amount of ids was shared by all three methods (39) and also few hits were unique in the BLAST searches (29 in sum).
Venn diagram of proteins found by BLAST, Psi-BLAST and HHBlits with 8 iterations. In this picture all identifiers in each cluster are used for the HHBlits result
|
Venn diagramof the first 100 proteins found by BLAST, Psi-BLAST and HHBlits with 8 iterations. In this picture all identifiers in each cluster are used for the HHBlits result
|
Comparing the Evalues
On the right, you can see an animated histogram of the distribution of the E-values, for the search performed with different methods.
The R Script is based on Andrea's
R Script psiBlast.evalueHist.Rscript
The most obvious fact is, that the E-value distribution of the Psi-BLAST hits is very different from the other two methods' hits. The Psi-BLAST histogram has its maximum around -60, while the histograms of the BLAST and the HHBlits search do not, but rather tend towards the zero point. Comparing especially the BLAST and Psi-BLAST results the advantage of refining steps and more iterations becomes clear, since the quality, in respect to the E-value, increases. Thus in respect to the E-values I would prefer using Psi-Blast.
Time
We evaluated the time the programs ran with the command "time"
Method
|
Parameter
|
Time
|
Blast v = 700
|
b = 700, v = 700
|
1m53.944s
|
HHBlits
|
default
|
2m19.519s
|
HHBlits
|
n = 8
|
16m7.754s
|
Multiple sequence alignments
Dataset
The dataset was generated from the result set of the Psi-Blast run with 10 iterations and an E-value cut-off of 1e-9. We used the following 30 proteins to create multiple sequence alignments with the different methods. Since there were no sequences with a sequence identity of more than 90%, we created three datasets. One with 10 sequences spanning the whole range of sequence identity, one with sequences having an sequence identity <40% and the last one with sequence identity >60%.
id eVal identity coverage alignment_length
# whole range = Set100
tr|C7PCU7|C7PCU7_CHIPD 5e-63 21 0.9933 474
tr|B3RSE1|B3RSE1_TRIAD 2e-93 49 0.8415 362
tr|G2PG26|G2PG26_STRVO 1e-105 33 0.5854 427
tr|Q8RX86|Q8RX86_ARATH 1e-105 35 0.9814 422
tr|G8NYA7|G8NYA7_GRAMM 4e-61 22 0.6186 470
tr|H1Q7I8|H1Q7I8_9ACTO 1e-97 37 0.5409 396
tr|E1ZHK5|E1ZHK5_CHLVA 8e-80 38 0.8485 368
sp|Q0CEF5|AGALG_ASPTN 4e-63 12 0.611 478
tr|F5BFS9|F5BFS9_TOBAC 1e-106 36 0.9534 410
tr|F8FLU8|F8FLU8_PAEMK 1e-76 10 0.6091 474
# <40% sequence identity = Set40
tr|B8P149|B8P149_POSPM 3e-80 28 0.9425 432
tr|G2TQE8|G2TQE8_BACCO 7e-68 8 0.5795 452
tr|F9HJT9|F9HJT9_9STRE 3e-70 11 0.5709 452
tr|H2JN17|H2JN17_STRHY 3e-69 23 0.774 504
tr|C5AKH4|C5AKH4_BURGB 2e-67 26 0.5488 403
sp|Q0CEF5|AGALG_ASPTN 4e-63 12 0.611 478
tr|B3CFN7|B3CFN7_9BACE 1e-78 26 0.5828 412
tr|D4KDQ2|D4KDQ2_9FIRM 8e-76 10 0.611 483
tr|D4W2N5|D4W2N5_9FIRM 1e-67 10 0.5478 435
tr|F2USV1|F2USV1_SALS5 1e-88 35 1 467
# >60% sequence identity = Set60
tr|G1P280|G1P280_MYOLU 1e-108 78 0.9699 420
tr|Q4RTE7|Q4RTE7_TETNG 7e-89 71 0.7319 314
tr|F1Q5G5|F1Q5G5_DANRE 1e-106 67 0.9138 392
tr|E1B725|E1B725_BOVIN 1e-111 76 0.9727 428
tr|H2U095|H2U095_TAKRU 1e-101 65 0.9424 412
tr|G1T044|G1T044_RABIT 1e-109 82 0.9698 417
tr|C0HA45|C0HA45_SALSA 1e-102 63 0.9534 409
tr|H0WQ54|H0WQ54_OTOGA 4e-87 71 0.9953 428
tr|G3WK18|G3WK18_SARHA 1e-108 72 0.9388 414
tr|H2L5H7|H2L5H7_ORYLA 1e-100 61 0.9534 411
Results
Maybe only interesting parts or the active site...? --
Rackersederj 13:07, 5 May 2012 (UTC) ... Active site (D170 and D231) and partly the surrounding parts are highly conserved! Functional sites... maybe also Glycosylation site (139,192,215,408) and Disulfide bonds (52 ↔ 94, 56 ↔ 63, 142 ↔ 172, 202 ↔ 223, 378 ↔ 382)
ClustalW
Set 100
Sequence ID
|
Number of gaps
|
tr|G2PG26|G2PG26_STRVO
|
133
|
tr|C7PCU7|C7PCU7_CHIPD
|
372
|
sp|P06280|AGAL_HUMAN
|
389
|
tr|G8NYA7|G8NYA7_GRAMM
|
97
|
sp|Q0CEF5|AGALG_ASPTN
|
93
|
tr|E1ZHK5|E1ZHK5_CHLVA
|
452
|
tr|Q8RX86|Q8RX86_ARATH
|
422
|
tr|F8FLU8|F8FLU8_PAEMK
|
89
|
tr|B3RSE1|B3RSE1_TRIAD
|
453
|
tr|F5BFS9|F5BFS9_TOBAC
|
405
|
tr|H1Q7I8|H1Q7I8_9ACTO
|
134
|
|
|
conserved
|
2
|
Set 40
Sequence ID
|
Number of gaps
|
tr|C5AKH4|C5AKH4_BURGB
|
290
|
tr|G2TQE8|G2TQE8_BACCO
|
267
|
tr|H2JN17|H2JN17_STRHJ
|
373
|
tr|F9HJT9|F9HJT9_9STRE
|
256
|
sp|P06280|AGAL_HUMAN
|
568
|
sp|Q0CEF5|AGALG_ASPTN
|
272
|
tr|D4W2N5|D4W2N5_9FIRM
|
254
|
tr|B3CFN7|B3CFN7_9BACE
|
333
|
tr|B8P149|B8P149_POSPM
|
562
|
tr|D4KDQ2|D4KDQ2_9FIRM
|
254
|
tr|F2USV1|F2USV1_SALS5
|
534
|
|
|
conserved
|
4
|
Set 60
Sequence ID
|
Number of gaps
|
tr|H2U095|H2U095_TAKRU
|
24
|
tr|H0WQ54|H0WQ54_OTOGA
|
34
|
tr|F1Q5G5|F1Q5G5_DANRE
|
49
|
tr|G1P280|G1P280_MYOLU
|
26
|
sp|P06280|AGAL_HUMAN
|
29
|
tr|G3WK18|G3WK18_SARHA
|
17
|
tr|E1B725|E1B725_BOVIN
|
19
|
tr|Q4RTE7|Q4RTE7_TETNG
|
81
|
tr|H2L5H7|H2L5H7_ORYLA
|
30
|
tr|G1T044|G1T044_RABIT
|
28
|
tr|C0HA45|C0HA45_SALSA
|
50
|
|
|
conserved
|
154
|
Muscle
Set 100
Sequence ID
|
Number of gaps
|
sp|Q0CEF5|AGALG_ASPTN
|
217
|
tr|F8FLU8|F8FLU8_PAEMK
|
213
|
tr|C7PCU7|C7PCU7_CHIPD
|
496
|
tr|G8NYA7|G8NYA7_GRAMM
|
221
|
sp|P06280|AGAL_HUMAN
|
513
|
tr|B3RSE1|B3RSE1_TRIAD
|
577
|
tr|G2PG26|G2PG26_STRVO
|
257
|
tr|H1Q7I8|H1Q7I8_9ACTO
|
258
|
tr|E1ZHK5|E1ZHK5_CHLVA
|
576
|
tr|Q8RX86|Q8RX86_ARATH
|
546
|
tr|F5BFS9|F5BFS9_TOBAC
|
529
|
|
|
conserved
|
10
|
Set 40
Sequence ID
|
Number of gaps
|
tr|H2JN17|H2JN17_STRHJ
|
485
|
tr|C5AKH4|C5AKH4_BURGB
|
402
|
tr|B8P149|B8P149_POSPM
|
674
|
sp|P06280|AGAL_HUMAN
|
680
|
tr|F2USV1|F2USV1_SALS5
|
646
|
tr|B3CFN7|B3CFN7_9BACE
|
445
|
sp|Q0CEF5|AGALG_ASPTN
|
384
|
tr|G2TQE8|G2TQE8_BACCO
|
379
|
tr|F9HJT9|F9HJT9_9STRE
|
368
|
tr|D4KDQ2|D4KDQ2_9FIRM
|
366
|
tr|D4W2N5|D4W2N5_9FIRM
|
366
|
|
|
conserved
|
8
|
Set 60
Sequence ID
|
Number of gaps
|
tr|H2L5H7|H2L5H7_ORYLA
|
29
|
tr|C0HA45|C0HA45_SALSA
|
49
|
tr|F1Q5G5|F1Q5G5_DANRE
|
48
|
tr|Q4RTE7|Q4RTE7_TETNG
|
80
|
tr|H2U095|H2U095_TAKRU
|
23
|
tr|G3WK18|G3WK18_SARHA
|
16
|
tr|H0WQ54|H0WQ54_OTOGA
|
33
|
tr|E1B725|E1B725_BOVIN
|
18
|
tr|G1P280|G1P280_MYOLU
|
25
|
sp|P06280|AGAL_HUMAN
|
28
|
tr|G1T044|G1T044_RABIT
|
27
|
|
|
conserved
|
156
|
T-Coffee
Set 100
Sequence ID
|
Number of gaps
|
tr|G2PG26|G2PG26_STRVO
|
516
|
tr|C7PCU7|C7PCU7_CHIPD
|
755
|
sp|P06280|AGAL_HUMAN
|
772
|
sp|Q0CEF5|AGALG_ASPTN
|
476
|
tr|E1ZHK5|E1ZHK5_CHLVA
|
835
|
tr|G8NYA7|G8NYA7_GRAMM
|
480
|
tr|Q8RX86|Q8RX86_ARATH
|
805
|
tr|F8FLU8|F8FLU8_PAEMK
|
472
|
tr|F5BFS9|F5BFS9_TOBAC
|
788
|
tr|B3RSE1|B3RSE1_TRIAD
|
836
|
tr|H1Q7I8|H1Q7I8_9ACTO
|
517
|
|
|
conserved
|
9
|
Set 40
Sequence ID
|
Number of gaps
|
tr|C5AKH4|C5AKH4_BURGB
|
541
|
tr|G2TQE8|G2TQE8_BACCO
|
518
|
tr|H2JN17|H2JN17_STRHJ
|
624
|
tr|F9HJT9|F9HJT9_9STRE
|
507
|
sp|P06280|AGAL_HUMAN
|
819
|
sp|Q0CEF5|AGALG_ASPTN
|
523
|
tr|D4W2N5|D4W2N5_9FIRM
|
505
|
tr|B3CFN7|B3CFN7_9BACE
|
584
|
tr|B8P149|B8P149_POSPM
|
813
|
tr|D4KDQ2|D4KDQ2_9FIRM
|
505
|
tr|F2USV1|F2USV1_SALS5
|
785
|
|
|
conserved
|
12
|
Set 60
Sequence ID
|
Number of gaps
|
tr|H2U095|H2U095_TAKRU
|
44
|
tr|H0WQ54|H0WQ54_OTOGA
|
54
|
tr|F1Q5G5|F1Q5G5_DANRE
|
69
|
tr|G1P280|G1P280_MYOLU
|
46
|
sp|P06280|AGAL_HUMAN
|
49
|
tr|G3WK18|G3WK18_SARHA
|
37
|
tr|Q4RTE7|Q4RTE7_TETNG
|
101
|
tr|E1B725|E1B725_BOVIN
|
39
|
tr|H2L5H7|H2L5H7_ORYLA
|
50
|
tr|G1T044|G1T044_RABIT
|
48
|
tr|C0HA45|C0HA45_SALSA
|
70
|
|
|
conserved
|
155
|
3D-Coffee
Set 100
Sequence ID
|
Number of gaps
|
tr|G2PG26|G2PG26_STRVO
|
516
|
tr|C7PCU7|C7PCU7_CHIPD
|
755
|
sp|P06280|AGAL_HUMAN
|
772
|
sp|Q0CEF5|AGALG_ASPTN
|
476
|
tr|E1ZHK5|E1ZHK5_CHLVA
|
835
|
tr|G8NYA7|G8NYA7_GRAMM
|
480
|
tr|Q8RX86|Q8RX86_ARATH
|
805
|
tr|F8FLU8|F8FLU8_PAEMK
|
472
|
tr|F5BFS9|F5BFS9_TOBAC
|
788
|
tr|B3RSE1|B3RSE1_TRIAD
|
836
|
tr|H1Q7I8|H1Q7I8_9ACTO
|
517
|
|
|
conserved
|
9
|
Set 40
Sequence ID
|
Number of gaps
|
tr|C5AKH4|C5AKH4_BURGB
|
541
|
tr|G2TQE8|G2TQE8_BACCO
|
518
|
tr|H2JN17|H2JN17_STRHJ
|
624
|
tr|F9HJT9|F9HJT9_9STRE
|
507
|
sp|P06280|AGAL_HUMAN
|
819
|
sp|Q0CEF5|AGALG_ASPTN
|
523
|
tr|D4W2N5|D4W2N5_9FIRM
|
505
|
tr|B3CFN7|B3CFN7_9BACE
|
584
|
tr|B8P149|B8P149_POSPM
|
813
|
tr|D4KDQ2|D4KDQ2_9FIRM
|
505
|
tr|F2USV1|F2USV1_SALS5
|
785
|
|
|
conserved
|
12
|
Set 60
Sequence ID
|
Number of gaps
|
tr|H2U095|H2U095_TAKRU
|
44
|
tr|H0WQ54|H0WQ54_OTOGA
|
54
|
tr|F1Q5G5|F1Q5G5_DANRE
|
69
|
tr|G1P280|G1P280_MYOLU
|
46
|
sp|P06280|AGAL_HUMAN
|
49
|
tr|G3WK18|G3WK18_SARHA
|
37
|
tr|Q4RTE7|Q4RTE7_TETNG
|
101
|
tr|E1B725|E1B725_BOVIN
|
39
|
tr|H2L5H7|H2L5H7_ORYLA
|
50
|
tr|G1T044|G1T044_RABIT
|
48
|
tr|C0HA45|C0HA45_SALSA
|
70
|
|
|
conserved
|
155
|