Difference between revisions of "Glucocerebrosidase sequence alignments"

Revision as of 22:16, 23 May 2011

Sequence searches

Several different tools were used in order to look for sequences that are related to glucocerebrosidase in the non-redundant sequence database.

FASTA

As Fasta was not initially installed, it was downloaded from the EBI FTP Download Site <ref>ftp://ftp.ebi.ac.uk/pub/software/unix/fasta/fasta36/</ref>.

Command:

../bin/fasta36 gbaseq.fasta /data/blast/nr/nr > fasta_gba_search.out

BLAST

Command:

blastall -p blastp -d /data/blast/nr/nr -i gbaseq.fasta -o blast.out

PSI-BLAST

This tool was used four times with the following parameters:

3 iterations, e-value cutoff 0.005

3 iterations, e-value cutoff 10e-6

5 iterations, e-value cutoff 0.005

5 iterations, e-value cutoff 10e-6

Command:

blastpgp -d /data/blast/nr/nr -i gbaseq.fasta -o psi_blast_x_y.out -j x -h y

Furthermore the online version of HHSearch <ref>http://toolkit.lmb.uni-muenchen.de/hhpred</ref> was used to search against the pdb70 database of May 14th.

Results

The sequence search with FASTA returned 520 sequences. BLAST, as well as the different PSI-BLAST runs, returned 500 sequences due to restrictions in the output format. The search with HHSearch against the pdb17 database resulted in 100 sequences.

Overlap

Overlap of the results of the different sequence searches. (At most 5 different lists could be captured in one diagram).

The overlaps of the results of the different tools (FASTA, BLAST, 4 PSI-BLAST runs) have been investigated and are visualized using Venn-Diagrams (created with <ref>http://bioinformatics.psb.ugent.be/webtools/Venn/</ref>). The results of HHSearch are not included as a different database (pdb70) was used and therefore the results could not be compared.
In total, the six different sequence searches returned 626 unique sequences whereof 405 sequences were returned by each of the searches and 41 were only found by a single one. There were no sequences which have only been found by PSI-BLAST with 3 iterations whereas FASTA returned 25 unique sequences. These numbers indicate that the different sequence searches return very similar sequences and that each of the tools could be used to retrieve the related sequences of glucocerebrosidase.

Sequence Identity

Identity distribution of the different sequence searches.

As BLAST only shows details for the 250 "best" alignments, the sequence identitiy could only be analyzed for the 250 first sequences of the BLAST and PSI-BLAST results. The table below shows the number of sequences that fall into a certain interval. These numbers are also visualized in the diagram to the right. Most of the sequences which BLAST, PSI-BLAST and FASTA found lie within the range of 20 to 39 percent whereas HHSearch has its peak in the area of 0 to 19 percent. As HHSearch was applied to the pdb database, only sequences with a known 3D-structure could be found. The fact that no sequence with an identity in the interval of 30 to 99 percent was found with HHsearch indicates that there are no known structures of closely related proteins of glucocerebrosidase. The search returned one sequence with a 100% identity: 2NT0 <ref>http://www.pdb.org/pdb/explore/explore.do?structureId=2NT0</ref>, the structure of glucocerebrosidase with a pharmacological chaperone. Interestingly, almost no sequences were found with a sequence identity in the intervall of 60 to 80 percent by neither of the tools.

Identity	BLAST	PSI-BLAST 3 iterations e-value cutoff 0.005	PSI-BLAST 3 iterations e-value cutoff 10e-6	PSI-BLAST 5 iterations e-value cutoff 0.005	PSI-BLAST 5 iterations e-value cutoff 10e-6	FASTA	HHSearch
0-9 %	0	0	0	0	0	0	8
10-19%	0	0	0	0	0	0	89
20-29%	43	85	85	96	94	240	2
30-39 %	127	98	98	90	91	160	0
40-49 %	28	26	26	24	25	44	0
50-59 %	6	4	4	3	3	6	0
60-69 %	3	0	0	0	0	3	0
70-79 %	1	1	1	1	1	6	0
80-89 %	15	9	9	9	9	27	0
90-100 %	27	27	27	27	27	34	1
Total	250	250	250	250	250	520	100

E-Values

E-Value distribution of the different sequence searches.

The logarithms of the E-Values of the different datasets are visualized in the diagram to the right. E-Values of zero are not shown in this diagram as the logarithm for zero is undefined. BLAST found 37 sequences, FASTA 25, PSI-BLAST (3 it.) 19-21 and HHsearch found 5 sequences with an E-Value of zero. The results (round 3/5) of the different PSI-BLAST runs show similar E-Value distributions which cover "better" values than the results of BLAST, FASTA and HHSearch. Results of the latter can be compared to the values of the first rounds of the four PSI-BLAST runs. PSI-BLAST uses sequences found in the previous round to construct a score model for the next round and is therefore more sensitive to distantly related proteins. HHSearch returns sequences with the worst E-Values, which indicates, that the pdb database does not provide many related sequences to our reference sequence. This observation is in agreement with the results of the identity distrubution.

Comparison to HSSP

The sequences obtained were compared to the search results of the term "glucocerebrosidase" in the HSSP database <ref>http://mrs.cmbi.ru.nl/mrs-5/entry?db=hssp&id=3keh&rq=glucocerebrosidase</ref>, which provides information of homology-derived secondary structure of proteins. As the nonredundant database (nr) consists of several databases (GenBank, RefSeq, PDB, SwissProt, PIR, PRF) the identifiers of the sequences had to be converted into Uniprot AC numbers (as used in the HSSP database)<ref>http://pir.georgetown.edu/pirwww/search/idmapping.shtml</ref>. Apart from HHSearch the different search tools almost aquire the same precision and the same recall. BLAST and FASTA return slighlty better results then the different PSI-BLAST runs.

	BLAST	PSI-BLAST 3 0.005	PSI-BLAST 3 10e-6	PSI-BLAST 5 0.005	PSI-BLAST 5 10e-6	FASTA	HHSearch
Recall	0.54	0.47	0.47	0.47	0.47	0.54	0.01
Precision	0.64	0.56	0.55	0.55	0.55	0.61	0.04

Multiple sequence alignments

Sequences used for multiple sequence alignments

For the multiple sequence alignments the reference sequence and twenty other sequences (found with the sequence searches) were used. It was tried to avoid hypothetical and potential proteins and to take sequences that have similiar identities in all sequence searches. For the multiple sequence alignments sequences from a variety of organisms like human, orang-utan, mouse, bacteria and worms were used.
The following tables show the chosen sequences with their identities in the different database searches. Only one pdb structure with less than a 100% sequence identity was found and could be used for the multiple sequence alignment.
Our reference sequence: P04062, GLCM_HUMAN Glucosylceramidase

99 - 90% sequence identity
NP_001127488.1	glucosylceramidase precursor	Pongo abelii	95.0, 98.0, 98.0, 98.0, 98.0, 98.1
3KE0	A Chain A, Crystal Structure Of N370s Glucocerebrosidase At Acidic Ph.		97.0, 99.0, 99.0, 99.0, 99.0, 99.8
EAW53100.1	glucosidase, beta; acid (includes glucosylceramidase), isoform CRA_a	Homo sapiens	97.0, 99.0, 99.0, 99.0, 99.0, 99.6
NP_001165283.1	glucosylceramidase isoform 3 precursor	Homo sapiens	88.0, 90.0, 90.0, 90.0, 90.0, 90.9
NP_001128784.1	DKFZP469B0323 protein	Pongo abelii	95.0, 97.0, 97.0, 97.0, 97.0, 97.4
89 - 60% sequence identity
NP_032120.1	glucosylceramidase isoform 1	Mus musculus	84.0, 86.0, 86.0, 86.0, 86.0, 86.4
EDL15229.1	glucosidase, beta, acid, isoform CRA_a	Mus musculus	84.0, 86.0, 86.0, 86.0, 86.0, 86.3
NP_001121111.1	glucosidase, beta, acid	Rattus norvegicus	85.0, 87.0, 87.0, 87.0, 87.0, 87.6
NP_001039886.1	glucosylceramidase precursor	Bos taurus	86.0, 89.0, 89.0, 89.0, 89.0, 89.2
NP_001005730.1	glucosylceramidase precursor	Sus scrofa	87.0, 89.0, 89.0, 89.0, 89.0, 89.6
59 - 40% sequence identity
EFN73638.1	Glucosylceramidase	Camponotus floridanus	41.0, 40.0, 40.0, 41.0, 40.0, 42.2
CAG11843.1	unnamed protein product	Tetraodon nigroviridis	52.0, 53.0, 53.0, 53.0, 53.0, 54.2
NP_500785.1	hypothetical protein Y4C6B.6	Caenorhabditis elegans	41.0, 40.0, 39.0, 40.0, 39.0, 41.9
EFA07058.1	hypothetical protein TcasGA2_TC010035	Tribolium castaneum	41.0, 42.0, 41.0, 42.0, 41.0, 43.2
EFO26573.1	O-glycosyl hydrolase family 30 protein	Loa loa	40.0, 40.0, 40.0, 40.0, 40.0, 41.7
39 - 20% sequence identity
ZP_07040024.1	glucosylceramidase	Bacteroides sp. 3_1_23	26.0, 24.0, 24.0, 24.0, 24.0, 25.5
YP_244236.1	glycosyl hydrolase	Xanthomonas campestris pv. campestris str. 8004	33.0, 31.0, 30.0, 31.0, 31.0, 33.4
ZP_01885435.1	glycosyl hydrolase	Pedobacter sp. BAL39	36.0, 33.0, 32.0, 33.0, 33.0, 37.2
ZP_07388379.1	Glucan endo-1,6-beta-glucosidase	Paenibacillus curdlanolyticus YK9	28.0, 24.0, 23.0, 24.0, 24.0, 30.1
NP_623885.1	O-glycosyl hydrolase family protein	Thermoanaerobacter tengcongensis MB4	37.0, 34.0, 33.0, 34.0, 32.0, 37.5

Tools

Cobalt

Cobalt was not yet installed, so we downloaded it from the NCBI Server<ref>ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt/executables/2.0.1/</ref>.

command
time /home/student/Desktop/ncbi-cobalt-2.0.1/cobalt -i multiple_alignment.fasta -norps T > cobalt_multiple_alignment.aln

	time
real	0m3.488s
user	0m2.320s
sys	0m0.180s

ClustalW

command
time clustalw

	time
real	0m40.625s
user	0m5.320s
sys	0m0.070s

Muscle

command
time muscle -in multiple_alignment.fasta -out muscle_multiple_alignment.aln

	time
real	0m3.018s
user	0m1.710s
sys	0m0.100s

T-Coffee

command
time t_coffee multiple_alignment.fasta

	time
real	0m41.360s
user	0m34.000s
sys	0m0.920s

3D-Coffee/Expresso

command
time t_coffee -seq multiple_alignment.fasta -mode expresso -pdb_type dn

	time
real	12m19.825s
user	5m17.140s
sys	0m46.970s

The following pictures show cut-outs of the different alignments with Jalview.

Multiple Alignment by Cobalt in Jalview

Multiple Alignment by ClustalW in Jalview

Multiple Alignment by Muscle with Jalview

Multiple Alignment by T-Coffee with Jalview

Multiple Alignment by 3D-Coffee/Expresso in Jalview

Results

Conserved Columns

	Cobalt	ClustalW	Muscle	T-Coffee	3D-Coffee/Expresso
>50%	133	133	131	125	123
>60%	98	98	96	104	90
>70%	64	64	62	67	58
>80%	53	54	54	50	57
>90%	52	51	51	52	49
100%	25	23	26	26	25

Looking at the alignments in Jalview one can see, that after about 100-120 residues of the glucocerebrosidase sequence the conservation gets very high. The first part does not seem to be very important for the function of the protein and therefore mutated gradually in the different organisms. The conserved regions which follow may be necessary for the catalytic function of the protein. If there are mutations, these are substitutions to amino acids with the same properties, for example arginine to lysine, so that the function is maintained.

Functionally important residues

Glutamine residues 235 and 340 play key roles in the active site<ref>http://www.ncbi.nlm.nih.gov/books/NBK1269/</ref>. The different alignments almost show the same high conservation:

T-Coffee	15/21, 17/21
3D-Coffee	15/21, 16/21
Muscle	15/21, 17/21
Cobalt	15/21, 17/21
ClustalW	15/21, 17/21

In the other sequences asparagine and threonine are aligned to the glutamines, which also are polar amino acids.

Gaps in the reference structure

	Cobalt	ClustalW	Muscle	T-Coffee	3D-Coffee/Expresso
# Gaps	413	404	442	441	546

The reason for most of the gaps are sequence insertions. The largest example is ZP_07388379.1 which has an additional 300 amino acids at the end. Another one is CAG11843.1 with two times 30 amino acid insertions. The remaining gaps are from the alignment itself.
The 3D-Coffee alignment has the most gaps. The tool tried to align the structure and therefore there are more gaps because amino acids are only aligned, when the structure fits, too. In the Jalview alignment there are a lot of very "gappy" parts, as one can see in the picture.

Example of a very "gappy" part of the 3D-coffee alignment

Gaps in secondary structure elements

Secondary structure of Glucocerebroside <ref>http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1ogs&params.annotationsStr=DSSP%2CSCOP%2CSingle+Nucleotide+Polymorphism+%28SNP%29&params.showJmol=false</ref>

Glucocerebroside with pymol. The interesting regions are colored.

sec. structure	position	Cobalt	ClustalW	Muscle	T-Coffee	3D-Coffee/Expresso
Beta strand	49-52	0	0	0	0	2
Beta strand	54-60	0	0	0	0	0
Beta strand	75-82	5	0	0	0	0
Beta strand	88-94	0	0	0	1	9
Beta strand	96-98	0	0	0	4	18
Beta strand	103-116	4	0	23	0	6
Beta strand	119-123	0	0	0	0	0
Helix	126-132	0	0	0	0	0
Helix	137-148	0	0	0	0	19
Turn	150-153	28	28	28	28	1
Beta strand	157-163	0	0	0	0	0
Beta strand	166-170	0	0	0	0	0
Beta strand	177-179	0	0	0	0	0
Helix	190-193	0	0	0	0	0
Helix	196-206	0	0	0	0	0
Beta strand	212-218	0	0	0	0	2
Helix	222-224	0	0	0	0	0
Beta strand	229-233	7	0	0	7	7
Beta strand	235-238	0	0	0	0	0
Helix	243-261	0	0	0	0	1
Beta strand	267-271	0	0	0	0	0
Helix	277-279	0	0	0	0	0
Helix	292-301	0	0	0	0	0
Helix	202-208	0	0	0	0	0
Turn	311-314	0	0	1	1	2
Beta strand	315-323	0	0	0	0	0
Helix	324-326	0	0	0	0	0
Helix	329-335	0	0	0	0	0
Helix	338-341	0	0	0	0	0
Beta strand	346-352	0	0	0	0	0
Helix	359-369	0	0	0	2	2
Beta strand	373-381	13	13	13	13	0
Helix	396-411	0	0	8	6	13
Beta strand	414-422	8	0	0	0	0
Beta strand	440-444	0	0	0	0	1
Helix	445-447	0	0	1	0	1
Beta strand	449-452	0	0	3	0	0
Helix	454-463	0	0	0	0	1
Beta strand	471-479	1	0	2	2	5
Beta strand	482-489	0	1	0	0	0
Beta strand	495-501	0	0	0	0	0
Beta strand	503-505	0	29	0	0	0
Beta strand	507-513	0	0	10	30	1
Turn	514-516	0	0	0	0	40
Beta strand	517-523	0	0	0	1	12
Beta strand	527-533	4	0	0	0	160

The secondary structure information was obtained from Uniprot<ref>http://www.uniprot.org/uniprot/P04062</ref>. To compare the secondary structure the first picture can be used.
Turn position 150-153 and the beta sheet position 373-381 are of interest, as all of the tools but 3D-Coffee show the same number of gaps. The end of the protein position 503-533 is also interesting, because the different alignments show different gaps. These structures are marked in the picture, which was made with Pymol.
Muscle, Cobalt, ClustalW and T-Coffee only show a few exceptions with few gaps, that they have on their own, whereas 3D-Coffee has many gaps of its own. The turn and the beta sheet mentioned before have no gaps in the alignment with 3D-Coffee but in all other alignments. The results of 3D-Coffee are very different to the alignments of the other tools. Therefore including the 3D-structure seems to have a great influence on the alignment.
The most important thing is, that the secondary structures with the functionally important glutamine residues, beta strand position 235-238 and helix position 338-341 have no gaps in all alignments. Hence the alignments keep the active site aligned, which means, they have a good quality.

References

@@ Line 98: / Line 98: @@
 For the multiple sequence alignments the reference sequence and twenty other sequences (found with the sequence searches) were used. It was tried to avoid hypothetical and potential proteins and to take sequences that have similiar identities in all sequence searches. For the multiple sequence alignments sequences from a variety of organisms like human, orang-utan, mouse, bacteria and worms were used.<br/>
-The following tables show the chosen sequences with their identities in the different database searches. Only one pdb structure with less than a 100% sequence identity was found and could be used for the multiple sequence alignment.
+The following tables show the chosen sequences with their identities in the different database searches. Only one pdb structure with less than a 100% sequence identity was found and could be used for the multiple sequence alignment.<br/>
 Our reference sequence: P04062, GLCM_HUMAN Glucosylceramidase

Difference between revisions of "Glucocerebrosidase sequence alignments"

Revision as of 22:16, 23 May 2011

Contents

Sequence searches

Results

Overlap

Sequence Identity

E-Values

Comparison to HSSP

Multiple sequence alignments

Sequences used for multiple sequence alignments

Tools

Results

Conserved Columns

Functionally important residues

Gaps in the reference structure

Gaps in secondary structure elements

References

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools