Difference between revisions of "Fabry:Sequence alignments (sequence searches and multiple alignments):Results"

From Bioinformatikpedia
(Results)
(Dataset)
Line 177: Line 177:
 
>tr|F5BFS9|F5BFS9_TOBAC Alpha-galactosidase OS=Nicotiana tabacum|Identities = 145/329 (44%)
 
>tr|F5BFS9|F5BFS9_TOBAC Alpha-galactosidase OS=Nicotiana tabacum|Identities = 145/329 (44%)
 
>tr|Q3UZX5|Q3UZX5_MOUSE Putative uncharacterized protein OS=Mus|Identities = 140/239 (58%)
 
>tr|Q3UZX5|Q3UZX5_MOUSE Putative uncharacterized protein OS=Mus|Identities = 140/239 (58%)
  +
 
 
#39 - 20% sequence identity
 
#39 - 20% sequence identity
 
>tr|F1T8Q0|F1T8Q0_9CLOT Alpha-galactosidase (Precursor)|Identities = 164/416 (39%)
 
>tr|F1T8Q0|F1T8Q0_9CLOT Alpha-galactosidase (Precursor)|Identities = 164/416 (39%)
Line 186: Line 186:
 
>tr|B9TQP6|B9TQP6_RICCO Alpha-galactosidase/alpha-n-acetylgalactosaminidase|Identities = 31/93 (33%)
 
>tr|B9TQP6|B9TQP6_RICCO Alpha-galactosidase/alpha-n-acetylgalactosaminidase|Identities = 31/93 (33%)
   
=== Results
+
=== Results ===
 
TODO: Add pictures of MSA and find a way to present them, since they are _very_ wide --[[User:Rackersederj|Rackersederj]] 07:06, 5 May 2012 (UTC)<br>
 
TODO: Add pictures of MSA and find a way to present them, since they are _very_ wide --[[User:Rackersederj|Rackersederj]] 07:06, 5 May 2012 (UTC)<br>
 
Maybe only interesting parts or the active site...? --[[User:Rackersederj|Rackersederj]] 13:07, 5 May 2012 (UTC) ... Active site (D179 and D231) and partly the surrounding parts are highly conserved! Functional sites... maybe also Glycosylation site (139,192,215,408) and Disulfide bonds (52 ↔ 94, 56 ↔ 63, 142 ↔ 172, 202 ↔ 223, 378 ↔ 382)
 
Maybe only interesting parts or the active site...? --[[User:Rackersederj|Rackersederj]] 13:07, 5 May 2012 (UTC) ... Active site (D179 and D231) and partly the surrounding parts are highly conserved! Functional sites... maybe also Glycosylation site (139,192,215,408) and Disulfide bonds (52 ↔ 94, 56 ↔ 63, 142 ↔ 172, 202 ↔ 223, 378 ↔ 382)

Revision as of 09:37, 6 May 2012

Please see Task 2 for our scripts and line of action on this topic.

Reference sequence

The reference sequence of α-Galactosidase A that will be used in this task was obtained from Swissprot P06280.

>gi|4504009|ref|NP_000160.1| alpha-galactosidase A precursor [Homo sapiens]
MQLRNPELHLGCALALRFLALVSWDIPGARALDNGLARTPTMGWLHWERFMCNLDCQEEPDSCISEKLFM
EMAELMVSEGWKDAGYEYLCIDDCWMAPQRDSEGRLQADPQRFPHGIRQLANYVHSKGLKLGIYADVGNK
TCAGFPGSFGYYDIDAQTFADWGVDLLKFDGCYCDSLENLADGYKHMSLALNRTGRSIVYSCEWPLYMWP
FQKPNYTEIRQYCNHWRNFADIDDSWKSIKSILDWTSFNQERIVDVAGPGGWNDPDMLVIGNFGLSWNQQ
VTQMALWAIMAAPLFMSNDLRHISPQAKALLQDKDVIAINQDPLGKQGYQLRQGDNFEVWERPLSGLAWA
VAMINRQEIGGPRSYTIAVASLGKGVACNPACFITQLLPVKRKLGFYEWTSRLRSHINPTGTVLLQLENT
MQMSLKDLL

Sequence searches

Blast

<figtable id="blastidev">

GO terms of P06280 and each BLAST hit (with Evalue <= 0.003) compared. Percentage terms shared, in relation to number of GO terms of P06280 (AGAL_HUMAN) in the upper picture, in the secon picture in relation to number of each hit
Histogram of the logarithmic E-values of the BLAST hits for P06280
Histogram of the positive amino acids of the pairwise alignments of the BLAST hits for P06280
Histogram of the identical amino acids of the pairwise alignments of the BLAST hits for P06280
Histogram of the length of the BLAST hits for P06280

</figtable>

First we performed a BLAST search with the default parameter. Since all hits were significant we raised the number of shown one line descriptions (-v) as well as the number of database sequences to show alignments for (-b). This led to 663 hits with an E-value smaller or equal to 0.003, which we declared as significant in our search. For these proteins we extracted the E-value and the number of positive and also of identical amino acids of the pairwise alignments, as well as the length of each hit. You can see a histogram of each of these features on the left.

For the comparison of the GO terms, we obtained the set of terms for each hit and analyzed the number of those in common with the GO terms of the search protein α-Galactosidase A . We devided the number of common terms by the number of GO terms of P06280 (49). Since these proportions are very small, we thought it would also make sense to explore the fraction of the hits GO terms shared with the reference terms. Thus we devided the number if common terms by the number of terms of the hit. The histogram of the second rate show that in average over 80% of the GO terms of each hit are common with those of AGAL. The small amount of average accordance of the Galactosidase terms to the hit terms may be due to the fact that humans are a lot more complex than the species the homologous hits belong to. So the protein has to fullfill more needs in a more complex organism and thus has more GO terms assigned.

The average length fits the length of the α-Galactosidase A protein very well. This can be seen in the left picture Histogram of the length of the BLAST hits for P06280. On average over 51% of the residues are positive and almost 36% are identical hits. Thus on average 87% of the residues in each alignment are similar to the protein sequence of AGAL.


Psi-Blast

HHblits

We searched the "big80" database with HHblits using the default settings and also with the maximum number of possible iterations (8). <figtable id="blastidev">

2 iterations - default
GO terms of P06280 and each HHblits hit (with Evalue < 0.003) compared. Percentageterms shared, in relation to number of GO terms of P06280 (AGAL_HUMAN) in the upper picture, in the secon picture in relation to number of each hit
Histogram of the logarithmic E-values of the HHblits hits for P06280
Histogram of the similarity of the HHblits hits to P06280
Histogram of the identical amino acids of the pairwise alignments of the HHblits hits for P06280

</figtable>


The HHBlits search was performed with the maximum E-value in the summary and alignment list set to 0.003 (-E) and the minimum number of lines in the summary hit list had to be 700 (-z). From this search we obtained only 326 significant hits.

We also compared the GO terms in a similar manner as in the BLAST section. Here we discovered that on average only 14% of the AGAL_HUMAN protein's GO terms are included in the hits' terms. The "reverse" calculation revealed that around 70% of the hits' GO classes are in common with the search protein. This is rather low in comparison to the BLAST results.

The mean E-value in contrast is almost equal to the average E-value of the BLAST search. The same applies to the number of identical amino acids.



<figtable id="blastidev">

8 iterations
GO terms of P06280 and each HHblits hit (with Evalue < 0.003 and 8 iterations) compared. Percentage terms shared, in relation to number of GO terms of P06280 (AGAL_HUMAN) in the upper picture, in the secon picture in relation to number of each hit
Histogram of the logarithmic E-values of the HHblits search with 8 iterations for P06280
Histogram of the similarity of the BLAST hits (search with 8 iterations) to P06280
Histogram of the identical amino acids of the pairwise alignments of the BLAST hits (search with 8 iterations) for P06280

</figtable>
Since we thought that the number of significant hits was too low, we performed another HHBlits search with 8 iterations. Doing so, we gained 729 hits with E-value smaller or equal to 0.003.

The similarity in GO terms got better, but all other comparative values, like average E-value, similarity and identical residues got worse.

Thus increasing the number of iterations might be better to obtain more homologous proteins, but since the similarity is smaller, the conservation might also be not as high as for proteins detected with less iterations.



The first HHblits run took about 2.5 minutes, the second one about 16 minutes (see section Time).

Comparison sequence searches

Comparing the hits

Venn diagram of proteins found by BLAST, HHBlits and HHBlits with 8 iterations
Venn diagram of the proteins found by BLAST, Psi-BLAST (10 iterations and E-value cutoff 10e-10 ) and HHBlits with 8 iterations
Venn diagram of the first 100 proteins found by BLAST, HHBlits and HHBlits with 8 iterations
Venn diagram of the first 100 proteins found by BLAST, Psi-BLAST and HHBlits with 8 iterations

Venn diagrams created with Oliveros, J.C. (2007) VENNY. An interactive tool for comparing lists with Venn Diagrams.

In the Venn diagrams one realises, that only a small portion of the found hits is shared by all three methods. Each method seems to have a very unique set of findings. The biggest overlap is between the BLAST and Psi-BLAST hits, which is according to our expectations, since these two use similar approaches, while HHBlits searches by using iterative HMM-HMM comparison. These facts become most obvious in the last picture, where only the 100 best hits of all three methods are compared. Only eleven hits are common among all methods. The remaining 89 are shared by BLAST and Psi-BLAST and are unique in the HHBlits search. The comparison of all hits with E-value smaller or equal to 0.03 in all methods looks similar. It is noteworthy that here even a small number of hits is even shared only by HHBlits and BLAST (52), as well as Psi-BLAST and HHBlits (2). The overlap of the two different HHBlits searched with 2 and 8 iterations shows also a great amount of overlap.

Comparing the Evalues

Fabry animation.gif

Above you can see an animated histogram of the distribution of the E-values, for the search performed with different methods. The R Script is based on Andrea's R Script psiBlast.evalueHist.Rscript

As one can clearly see, the number of significant hits in the Psi-Blast search exceeds the number of hits in any of the other two searches by far. Also this histogram looks more like a normal distribution with mean -80, while the histograms of the BLAST and the HHBlits search do not, but rather tend towards the zero point. The least hits are generated by the "ordinary" BLAST search (663), the Psi-BLAST search finds the ten-fold number (6868). Thus in respect to the E-values I would prefer using Psi-Blast.

Time

We evaluated the time the programs ran with the command "time"


Method Parameter Time
Blast v = 700 b = 700, v = 700 1m53.944s
HHBlits default 2m19.519s
HHBlits n = 8 16m7.754s


Multiple sequence alignments

Dataset

We used the following 21 proteins to create multiple sequence alignments with the different methods. Since we only had three sequences for the first group, we added an additional sequence to each of the other ones to have a sufficiently large number of proteins in our datasets.

#99 - 90% sequence identity 
>tr|B4DLT5|B4DLT5_HUMAN cDNA FLJ56739, highly similar to Alpha-galactosidase A (EC 3.2.1.22) OS=Homo sapiens | Identities = 183/183 (100%)
>tr|G3SI81|G3SI81_GORGO Uncharacterized protein OS=Gorilla gorilla|Identities = 390/432 (90%)
>UP20|LOZLIBUBA|1|64 Alpha-galactosidase A (Fragment). [Ateles belzebuth chamek]|O97898|Identities=97%

#89 - 60% sequence identity
>tr|G1P280|G1P280_MYOLU Uncharacterized protein OS=Myotis lucifugus|Identities = 341/410 (83%)
>tr|G1T044|G1T044_RABIT Uncharacterized protein OS=Oryctolagus|Identities = 348/420 (82%)
>tr|E1B725|E1B725_BOVIN Uncharacterized protein OS=Bos taurus GN=GLA|Identities = 334/413 (80%)
>tr|D3ZJF9|D3ZJF9_RAT Galactosidase, alpha (Mapped), isoform CRA_a|Identities = 331/418 (79%)
>tr|H2L5H7|H2L5H7_ORYLA Uncharacterized protein (Fragment)|Identities = 255/402 (63%)
>tr|E1BT44|E1BT44_CHICK Uncharacterized protein OS=Gallus gallus|Identities = 276/385 (71%)
 
#59 - 40% sequence identity 
>tr|E2B637|E2B637_HARSA Alpha-N-acetylgalactosaminidase|Identities = 204/411 (49%)
>tr|F4WJD6|F4WJD6_ACREC Alpha-N-acetylgalactosaminidase|Identities = 195/394 (49%)
>tr|D2T1A8|D2T1A8_CHOPA Putative alpha-N-acetylgalactosaminidase|Identities = 202/413 (48%)
>tr|G5BPR2|G5BPR2_HETGA Alpha-N-acetylgalactosaminidase|Identities = 168/292 (57%)
>tr|F5BFS9|F5BFS9_TOBAC Alpha-galactosidase OS=Nicotiana tabacum|Identities = 145/329 (44%)
>tr|Q3UZX5|Q3UZX5_MOUSE Putative uncharacterized protein OS=Mus|Identities = 140/239 (58%)

#39 - 20% sequence identity 
>tr|F1T8Q0|F1T8Q0_9CLOT Alpha-galactosidase (Precursor)|Identities = 164/416 (39%)
>tr|G6DHZ5|G6DHZ5_DANPL Alpha-N-acetylgalactosaminidase OS=Danaus|Identities = 110/294 (37%)
>tr|C7G8V8|C7G8V8_9FIRM Alpha-galactosidase OS=Roseburia|Identities = 134/347 (38%)
>tr|F8EMB8|F8EMB8_RUNSL Alpha-galactosidase (Precursor) OS=Runella|Identities = 106/366 (28%)
>tr|G0FV12|G0FV12_AMYMD Melibiase OS=Amycolatopsis mediterranei S699|Identities = 104/411 (25%)
>tr|B9TQP6|B9TQP6_RICCO Alpha-galactosidase/alpha-n-acetylgalactosaminidase|Identities = 31/93 (33%)

Results

TODO: Add pictures of MSA and find a way to present them, since they are _very_ wide --Rackersederj 07:06, 5 May 2012 (UTC)
Maybe only interesting parts or the active site...? --Rackersederj 13:07, 5 May 2012 (UTC) ... Active site (D179 and D231) and partly the surrounding parts are highly conserved! Functional sites... maybe also Glycosylation site (139,192,215,408) and Disulfide bonds (52 ↔ 94, 56 ↔ 63, 142 ↔ 172, 202 ↔ 223, 378 ↔ 382)