Difference between revisions of "Fabry:Sequence alignments (sequence searches and multiple alignments)"

Revision as of 22:38, 6 May 2012

Introduction

This page contains our results and discussions. The lab journal can be found here.

Reference sequence

The reference sequence of α-Galactosidase A that will be used in this task was obtained from Swissprot P06280.

>gi|4504009|ref|NP_000160.1| alpha-galactosidase A precursor [Homo sapiens]
MQLRNPELHLGCALALRFLALVSWDIPGARALDNGLARTPTMGWLHWERFMCNLDCQEEPDSCISEKLFM
EMAELMVSEGWKDAGYEYLCIDDCWMAPQRDSEGRLQADPQRFPHGIRQLANYVHSKGLKLGIYADVGNK
TCAGFPGSFGYYDIDAQTFADWGVDLLKFDGCYCDSLENLADGYKHMSLALNRTGRSIVYSCEWPLYMWP
FQKPNYTEIRQYCNHWRNFADIDDSWKSIKSILDWTSFNQERIVDVAGPGGWNDPDMLVIGNFGLSWNQQ
VTQMALWAIMAAPLFMSNDLRHISPQAKALLQDKDVIAINQDPLGKQGYQLRQGDNFEVWERPLSGLAWA
VAMINRQEIGGPRSYTIAVASLGKGVACNPACFITQLLPVKRKLGFYEWTSRLRSHINPTGTVLLQLENT
MQMSLKDLL

Sequence searches

Blast

GO terms of P06280 and each BLAST hit (with Evalue <= 0.003) compared. Percentage terms shared, in relation to number of GO terms of P06280 (AGAL_HUMAN) in the upper picture, in the secon picture in relation to number of each hit

First we performed a BLAST search with the default parameter. Since all hits were significant we raised the number of shown one line descriptions (-v) as well as the number of database sequences to show alignments for (-b). This led to 663 hits with an E-value smaller or equal to 0.003, which we declared as significant in our search. For these proteins we extracted the E-value and the number of positive and also of identical amino acids of the pairwise alignments, as well as the length of each hit. You can see a histogram of each of these features on the left.

For the comparison of the GO terms, we obtained the set of terms for each hit and analyzed the number of those in common with the GO terms of the search protein α-Galactosidase A . We devided the number of common terms by the number of GO terms of P06280 (49). Since these proportions are very small, we thought it would also make sense to explore the fraction of the hits GO terms shared with the reference terms. Thus we devided the number if common terms by the number of terms of the hit. The histogram of the second rate show that in average over 80% of the GO terms of each hit are common with those of AGAL. The small amount of average accordance of the Galactosidase terms to the hit terms may be due to the fact that humans are a lot more complex than the species the homologous hits belong to. So the protein has to fullfill more needs in a more complex organism and thus has more GO terms assigned.

The average length fits the length of the α-Galactosidase A protein very well. This can be seen in the left picture Histogram of the length of the BLAST hits for P06280. On average over 51% of the residues are positive and almost 36% are identical hits. Thus on average 87% of the residues in each alignment are similar to the protein sequence of AGAL.

	Histogram of the logarithmic E-values of the BLAST hits for P06280	Histogram of the positive amino acids of the pairwise alignments of the BLAST hits for P06280
	Histogram of the identical amino acids of the pairwise alignments of the BLAST hits for P06280	Histogram of the length of the BLAST hits for P06280

</figtable>

Psi-Blast

HHblits

We searched the "big80" database with HHblits using the default settings and also with the maximum number of possible iterations (8).

The HHBlits search was performed with the maximum E-value in the summary and alignment list set to 0.003 (-E) and the minimum number of lines in the summary hit list had to be 700 (-z). From this search we obtained only 326 significant hits.

We also compared the GO terms in a similar manner as in the BLAST section. Here we discovered that on average only 14% of the AGAL_HUMAN protein's GO terms are included in the hits' terms. The "reverse" calculation revealed that around 70% of the hits' GO classes are in common with the search protein. This is rather low in comparison to the BLAST results.

The mean E-value in contrast is almost equal to the average E-value of the BLAST search. The same applies to the number of identical amino acids.

	2 iterations - default
	GO terms of P06280 and each HHblits hit (with Evalue < 0.003) compared. Percentageterms shared, in relation to number of GO terms of P06280 (AGAL_HUMAN) in the upper picture, in the secon picture in relation to number of each hit	Histogram of the logarithmic E-values of the HHblits hits for P06280
	Histogram of the similarity of the HHblits hits to P06280	Histogram of the identical amino acids of the pairwise alignments of the HHblits hits for P06280

</figtable>

Since we thought that the number of significant hits was too low, we performed another HHBlits search with 8 iterations. Doing so, we gained 729 hits with E-value smaller or equal to 0.003.

The similarity in GO terms got better, but all other comparative values, like average E-value, similarity and identical residues got worse.

Thus increasing the number of iterations might be better to obtain more homologous proteins, but since the similarity is smaller, the conservation might also be not as high as for proteins detected with less iterations.

	8 iterations
	GO terms of P06280 and each HHblits hit (with Evalue < 0.003 and 8 iterations) compared. Percentage terms shared, in relation to number of GO terms of P06280 (AGAL_HUMAN) in the upper picture, in the secon picture in relation to number of each hit	Histogram of the logarithmic E-values of the HHblits search with 8 iterations for P06280
	Histogram of the similarity of the BLAST hits (search with 8 iterations) to P06280	Histogram of the identical amino acids of the pairwise alignments of the BLAST hits (search with 8 iterations) for P06280

</figtable>

The first HHblits run took about 2.5 minutes, the second one about 16 minutes (see section Time).

Comparison sequence searches

Comparing the hits

Venn diagram of proteins found by BLAST, HHBlits and HHBlits with 8 iterations

Venn diagram of the proteins found by BLAST, Psi-BLAST (10 iterations and E-value cutoff 10e-10 ) and HHBlits with 8 iterations

Venn diagram of the first 100 proteins found by BLAST, HHBlits and HHBlits with 8 iterations

Venn diagram of the first 100 proteins found by BLAST, Psi-BLAST and HHBlits with 8 iterations

Venn diagrams created with Oliveros, J.C. (2007) VENNY. An interactive tool for comparing lists with Venn Diagrams. In the Venn diagrams one realises, that only a small portion of the found hits is shared by all three methods. Each method seems to have a very unique set of findings. The biggest overlap is between the BLAST and Psi-BLAST hits, which is according to our expectations, since these two use similar approaches, while HHBlits searches by using iterative HMM-HMM comparison. These facts become most obvious in the last picture, where only the 100 best hits of all three methods are compared. Only 6 hits are common among all methods. In the remaining 94, about half are shared by BLAST and Psi-BLAST, the other half is unique in BLAST and Psi-BLAST. HHBlits has 84 unique hits and shares 5 hits solely with each of the BLAST algorithms. The comparison of all hits with E-value smaller or equal to 0.03 in all methods looks similar. It is noteworthy that here even a small number of hits is even shared only by HHBlits and BLAST (52), as well as Psi-BLAST and HHBlits (2). The shared hits of the two different HHBlits searches with 2 and 8 iterations shows also a great amount of overlap.

Comparing the Evalues

Above you can see an animated histogram of the distribution of the E-values, for the search performed with different methods. The R Script is based on Andrea's R Script psiBlast.evalueHist.Rscript

The most obvious fact is, that the E-value distribution of the Psi-BLAST hits is very different from the other two methods' hits. The Psi-BLAST histogram has its maximum around -60, while the histograms of the BLAST and the HHBlits search do not, but rather tend towards the zero point. Comparing especially the BLAST and Psi-BLAST results the advantage of refining steps and more iterations becomes clear, since the quality, in respect to the E-value, increases. Thus in respect to the E-values I would prefer using Psi-Blast.

Time

We evaluated the time the programs ran with the command "time"

Method	Parameter	Time
Blast v = 700	b = 700, v = 700	1m53.944s
HHBlits	default	2m19.519s
HHBlits	n = 8	16m7.754s

Multiple sequence alignments

Dataset

The dataset was generated from the result set of the Psi-Blast run with 10 iterations and an E-value cut-off of 1e-9. We used the following 30 proteins to create multiple sequence alignments with the different methods. Since there were no sequences with a sequence identity of more than 90%, we created three datasets. One with 10 sequences spanning the whole range of sequence identity, one with sequences having an sequence identity <40% and the last one with sequence identity >60%.

id                     eVal  identity coverage alignment_length

#whole range
tr|C7PCU7|C7PCU7_CHIPD	5e-63	21	0.9933	474
tr|B3RSE1|B3RSE1_TRIAD	2e-93	49	0.8415	362
tr|G2PG26|G2PG26_STRVO	1e-105	33	0.5854	427
tr|Q8RX86|Q8RX86_ARATH	1e-105	35	0.9814	422
tr|G8NYA7|G8NYA7_GRAMM	4e-61	22	0.6186	470
tr|H1Q7I8|H1Q7I8_9ACTO	1e-97	37	0.5409	396
tr|E1ZHK5|E1ZHK5_CHLVA	8e-80	38	0.8485	368
sp|Q0CEF5|AGALG_ASPTN	4e-63	12	0.611	478
tr|F5BFS9|F5BFS9_TOBAC	1e-106	36	0.9534	410
tr|F8FLU8|F8FLU8_PAEMK	1e-76	10	0.6091	474

#<40% sequence identity
tr|B8P149|B8P149_POSPM	3e-80	28	0.9425	432
tr|G2TQE8|G2TQE8_BACCO	7e-68	8	0.5795	452
tr|F9HJT9|F9HJT9_9STRE	3e-70	11	0.5709	452
tr|H2JN17|H2JN17_STRHY	3e-69	23	0.774	504
tr|C5AKH4|C5AKH4_BURGB	2e-67	26	0.5488	403
sp|Q0CEF5|AGALG_ASPTN	4e-63	12	0.611	478
tr|B3CFN7|B3CFN7_9BACE	1e-78	26	0.5828	412
tr|D4KDQ2|D4KDQ2_9FIRM	8e-76	10	0.611	483
tr|D4W2N5|D4W2N5_9FIRM	1e-67	10	0.5478	435
tr|F2USV1|F2USV1_SALS5	1e-88	35	1	467
 
#>60% 
tr|G1P280|G1P280_MYOLU	1e-108	78	0.9699	420
tr|Q4RTE7|Q4RTE7_TETNG	7e-89	71	0.7319	314
tr|F1Q5G5|F1Q5G5_DANRE	1e-106	67	0.9138	392
tr|E1B725|E1B725_BOVIN	1e-111	76	0.9727	428
tr|H2U095|H2U095_TAKRU	1e-101	65	0.9424	412
tr|G1T044|G1T044_RABIT	1e-109	82	0.9698	417
tr|C0HA45|C0HA45_SALSA	1e-102	63	0.9534	409
tr|H0WQ54|H0WQ54_OTOGA	4e-87	71	0.9953	428
tr|G3WK18|G3WK18_SARHA	1e-108	72	0.9388	414
tr|H2L5H7|H2L5H7_ORYLA	1e-100	61	0.9534	411

Results

TODO: Add pictures of MSA and find a way to present them, since they are _very_ wide --Rackersederj 07:06, 5 May 2012 (UTC)
Maybe only interesting parts or the active site...? --Rackersederj 13:07, 5 May 2012 (UTC) ... Active site (D179 and D231) and partly the surrounding parts are highly conserved! Functional sites... maybe also Glycosylation site (139,192,215,408) and Disulfide bonds (52 ↔ 94, 56 ↔ 63, 142 ↔ 172, 202 ↔ 223, 378 ↔ 382)

ClustalW

msa/clustalw_fabry_dataset_0.msa

Sequence ID	Number of gaps
tr\|G2PG26\|G2PG26_STRVO	144
tr\|C7PCU7\|C7PCU7_CHIPD	383
tr\|G8NYA7\|G8NYA7_GRAMM	108
sp\|Q0CEF5\|AGALG_ASPTN	104
tr\|E1ZHK5\|E1ZHK5_CHLVA	463
tr\|Q8RX86\|Q8RX86_ARATH	433
tr\|F5BFS9\|F5BFS9_TOBAC	416
tr\|B3RSE1\|B3RSE1_TRIAD	464
tr\|F8FLU8\|F8FLU8_PAEMK	100
tr\|H1Q7I8\|H1Q7I8_9ACTO	145

conserved	2

msa/clustalw_fabry_dataset_40.msa

Sequence ID	Number of gaps
tr\|C5AKH4\|C5AKH4_BURGB	187
tr\|G2TQE8\|G2TQE8_BACCO	164
tr\|H2JN17\|H2JN17_STRHJ	270
tr\|F9HJT9\|F9HJT9_9STRE	153
sp\|Q0CEF5\|AGALG_ASPTN	169
tr\|D4W2N5\|D4W2N5_9FIRM	151
tr\|B3CFN7\|B3CFN7_9BACE	230
tr\|B8P149\|B8P149_POSPM	459
tr\|D4KDQ2\|D4KDQ2_9FIRM	151
tr\|F2USV1\|F2USV1_SALS5	431

conserved	2

msa/clustalw_fabry_dataset_61.msa

Sequence ID	Number of gaps
tr\|H2U095\|H2U095_TAKRU	23
tr\|H0WQ54\|H0WQ54_OTOGA	33
tr\|F1Q5G5\|F1Q5G5_DANRE	48
tr\|G1P280\|G1P280_MYOLU	25
tr\|G3WK18\|G3WK18_SARHA	16
tr\|E1B725\|E1B725_BOVIN	18
tr\|Q4RTE7\|Q4RTE7_TETNG	80
tr\|H2L5H7\|H2L5H7_ORYLA	29
tr\|G1T044\|G1T044_RABIT	27
tr\|C0HA45\|C0HA45_SALSA	49

conserved	154

Muscle

msa/muscle_fabry_dataset_0.msa

Sequence ID	Number of gaps
sp\|Q0CEF5\|AGALG_ASPTN	187
tr\|F8FLU8\|F8FLU8_PAEMK	183
tr\|C7PCU7\|C7PCU7_CHIPD	466
tr\|G8NYA7\|G8NYA7_GRAMM	191
tr\|B3RSE1\|B3RSE1_TRIAD	547
tr\|G2PG26\|G2PG26_STRVO	227
tr\|H1Q7I8\|H1Q7I8_9ACTO	228
tr\|E1ZHK5\|E1ZHK5_CHLVA	546
tr\|Q8RX86\|Q8RX86_ARATH	516
tr\|F5BFS9\|F5BFS9_TOBAC	499

conserved	6

msa/muscle_fabry_dataset_40.msa

Sequence ID	Number of gaps
tr\|H2JN17\|H2JN17_STRHJ	305
tr\|C5AKH4\|C5AKH4_BURGB	222
tr\|B8P149\|B8P149_POSPM	494
tr\|F2USV1\|F2USV1_SALS5	466
tr\|B3CFN7\|B3CFN7_9BACE	265
sp\|Q0CEF5\|AGALG_ASPTN	204
tr\|G2TQE8\|G2TQE8_BACCO	199
tr\|F9HJT9\|F9HJT9_9STRE	188
tr\|D4KDQ2\|D4KDQ2_9FIRM	186
tr\|D4W2N5\|D4W2N5_9FIRM	186

conserved	2

msa/muscle_fabry_dataset_61.msa

Sequence ID	Number of gaps
tr\|H2L5H7\|H2L5H7_ORYLA	39
tr\|C0HA45\|C0HA45_SALSA	59
tr\|F1Q5G5\|F1Q5G5_DANRE	58
tr\|Q4RTE7\|Q4RTE7_TETNG	90
tr\|H2U095\|H2U095_TAKRU	33
tr\|H0WQ54\|H0WQ54_OTOGA	43
tr\|G3WK18\|G3WK18_SARHA	26
tr\|E1B725\|E1B725_BOVIN	28
tr\|G1T044\|G1T044_RABIT	37
tr\|G1P280\|G1P280_MYOLU	35

conserved	157

T-Coffee

msa/tcoffe_fabry_dataset_0.msa

Sequence ID	Number of gaps
tr\|G2PG26\|G2PG26_STRVO	491
tr\|C7PCU7\|C7PCU7_CHIPD	730
sp\|Q0CEF5\|AGALG_ASPTN	451
tr\|E1ZHK5\|E1ZHK5_CHLVA	810
tr\|G8NYA7\|G8NYA7_GRAMM	455
tr\|Q8RX86\|Q8RX86_ARATH	780
tr\|F8FLU8\|F8FLU8_PAEMK	447
tr\|F5BFS9\|F5BFS9_TOBAC	763
tr\|B3RSE1\|B3RSE1_TRIAD	811
tr\|H1Q7I8\|H1Q7I8_9ACTO	492

conserved	8

msa/tcoffe_fabry_dataset_40.msa

Sequence ID	Number of gaps
tr\|B3CFN7\|B3CFN7_9BACE	540
tr\|C5AKH4\|C5AKH4_BURGB	497
tr\|G2TQE8\|G2TQE8_BACCO	474
tr\|H2JN17\|H2JN17_STRHJ	580
tr\|B8P149\|B8P149_POSPM	769
tr\|F9HJT9\|F9HJT9_9STRE	463
sp\|Q0CEF5\|AGALG_ASPTN	479
tr\|D4W2N5\|D4W2N5_9FIRM	461
tr\|D4KDQ2\|D4KDQ2_9FIRM	461
tr\|F2USV1\|F2USV1_SALS5	741

conserved	11

msa/tcoffe_fabry_dataset_61.msa

Sequence ID	Number of gaps
tr\|H2U095\|H2U095_TAKRU	44
tr\|H0WQ54\|H0WQ54_OTOGA	54
tr\|F1Q5G5\|F1Q5G5_DANRE	69
tr\|G1P280\|G1P280_MYOLU	46
tr\|G3WK18\|G3WK18_SARHA	37
tr\|Q4RTE7\|Q4RTE7_TETNG	101
tr\|E1B725\|E1B725_BOVIN	39
tr\|H2L5H7\|H2L5H7_ORYLA	50
tr\|G1T044\|G1T044_RABIT	48
tr\|C0HA45\|C0HA45_SALSA	70

conserved	156

3D-Coffee

msa/3Dcoffee_fabry_dataset_0.msa

Sequence ID	Number of gaps
tr\|G2PG26\|G2PG26_STRVO	491
tr\|C7PCU7\|C7PCU7_CHIPD	730
sp\|Q0CEF5\|AGALG_ASPTN	451
tr\|E1ZHK5\|E1ZHK5_CHLVA	810
tr\|G8NYA7\|G8NYA7_GRAMM	455
tr\|Q8RX86\|Q8RX86_ARATH	780
tr\|F8FLU8\|F8FLU8_PAEMK	447
tr\|F5BFS9\|F5BFS9_TOBAC	763
tr\|B3RSE1\|B3RSE1_TRIAD	811
tr\|H1Q7I8\|H1Q7I8_9ACTO	492

conserved	8

msa/3Dcoffee_fabry_dataset_40.msa

Sequence ID	Number of gaps
tr\|B3CFN7\|B3CFN7_9BACE	540
tr\|C5AKH4\|C5AKH4_BURGB	497
tr\|G2TQE8\|G2TQE8_BACCO	474
tr\|H2JN17\|H2JN17_STRHJ	580
tr\|B8P149\|B8P149_POSPM	769
tr\|F9HJT9\|F9HJT9_9STRE	463
sp\|Q0CEF5\|AGALG_ASPTN	479
tr\|D4W2N5\|D4W2N5_9FIRM	461
tr\|D4KDQ2\|D4KDQ2_9FIRM	461
tr\|F2USV1\|F2USV1_SALS5	741

conserved	11

msa/3Dcoffee_fabry_dataset_61.msa

Sequence ID	Number of gaps
tr\|H2U095\|H2U095_TAKRU	44
tr\|H0WQ54\|H0WQ54_OTOGA	54
tr\|F1Q5G5\|F1Q5G5_DANRE	69
tr\|G1P280\|G1P280_MYOLU	46
tr\|G3WK18\|G3WK18_SARHA	37
tr\|Q4RTE7\|Q4RTE7_TETNG	101
tr\|E1B725\|E1B725_BOVIN	39
tr\|H2L5H7\|H2L5H7_ORYLA	50
tr\|G1T044\|G1T044_RABIT	48
tr\|C0HA45\|C0HA45_SALSA	70

conserved	156

Difference between revisions of "Fabry:Sequence alignments (sequence searches and multiple alignments)"

Revision as of 22:38, 6 May 2012

Contents

Introduction

Reference sequence

Sequence searches

Blast

Psi-Blast

HHblits

Comparison sequence searches

Comparing the hits

Comparing the Evalues

Time

Multiple sequence alignments

Dataset

Results

ClustalW

Muscle

T-Coffee

3D-Coffee

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools

@@ Line 48: / Line 48: @@
 === HHblits ===
 We searched the "big80" database with HHblits using the default settings and also with the maximum number of possible iterations (8).
+The HHBlits search was performed with the maximum E-value in the summary and alignment list set to 0.003 (-E) and the minimum number of lines in the summary hit list had to be 700 (-z). From this search we obtained only 326 significant hits.
+We also compared the GO terms in a similar manner as in the [[Fabry:Sequence_alignments_(sequence_searches_and_multiple_alignments):Results#Blast| BLAST section]]. Here we discovered that on average only 14% of the AGAL_HUMAN protein's GO terms are included in the hits' terms. The "reverse" calculation revealed that around 70% of the hits' GO classes are in common with the search protein. This is rather low in comparison to the BLAST results.
+The mean E-value in contrast is almost equal to the average E-value of the BLAST search. The same applies to the number of identical amino acids.
 <figtable id="blastidev">
-{|  class="wikitable"  style="float: left; border: 2px solid darkgray;"   cellpadding="2"
+{|  class="wikitable"  style="border: 2px solid darkgray; margin: 2em auto;"   cellpadding="2"
 !  scope="row" align="left" |
@@ Line 65: / Line 72: @@
 </figtable>
+Since we thought that the number of significant hits was too low, we performed another HHBlits search with 8 iterations. Doing so, we gained 729 hits with E-value smaller or equal to 0.003.
-<br>
-The HHBlits search was performed with the maximum E-value in the summary and alignment list set to 0.003 (-E) and the minimum number of lines in the summary hit list had to be 700 (-z). From this search we obtained only 326 significant hits.
+The similarity in GO terms got better, but all other comparative values, like average E-value, similarity and identical residues got worse.
-We also compared the GO terms in a similar manner as in the [[Fabry:Sequence_alignments_(sequence_searches_and_multiple_alignments):Results#Blast| BLAST section]]. Here we discovered that on average only 14% of the AGAL_HUMAN protein's GO terms are included in the hits' terms. The "reverse" calculation revealed that around 70% of the hits' GO classes are in common with the search protein. This is rather low in comparison to the BLAST results.
+Thus increasing the number of iterations might be better to obtain more homologous proteins, but since the similarity is smaller, the conservation might also be not as high as for proteins detected with less iterations.
-The mean E-value in contrast is almost equal to the average E-value of the BLAST search. The same applies to the number of identical amino acids.
-<br style="clear:both;">
 <figtable id="blastidev">
-{|  class="wikitable"  style="float: left; border: 2px solid darkgray;"   cellpadding="2"
+{|  class="wikitable"  style="border: 2px solid darkgray; margin: 2em auto;"   cellpadding="2"
 !  scope="row" align="left" |
@@ Line 92: / Line 95: @@
 |}
 </figtable>
-<br>
-Since we thought that the number of significant hits was too low, we performed another HHBlits search with 8 iterations. Doing so, we gained 729 hits with E-value smaller or equal to 0.003.
-The similarity in GO terms got better, but all other comparative values, like average E-value, similarity and identical residues got worse.
-Thus increasing the number of iterations might be better to obtain more homologous proteins, but since the similarity is smaller, the conservation might also be not as high as for proteins detected with less iterations.
-<br style="clear:both;">
 The first HHblits run took about 2.5 minutes, the second one about 16 minutes (see section [[Sequence_alignments_(sequence_searches_and_multiple_alignments)#Time | Time]]).