Glucocerebrosidase sequence alignments
- 1 Sequence searches
- 2 Multiple sequence alignments
- 2.1 Sequences used for multiple sequence alignments
- 2.2 Tools and Results
- 2.3 Analysis
- 3 References
Several different tools were used in order to look for sequences that are related to glucocerebrosidase in the non-redundant sequence database.
- As Fasta was not initially installed, it was downloaded from the EBI FTP Download Site <ref>ftp://ftp.ebi.ac.uk/pub/software/unix/fasta/fasta36/</ref>.
../bin/fasta36 gbaseq.fasta /data/blast/nr/nr > fasta_gba_search.out
blastall -p blastp -d /data/blast/nr/nr -i gbaseq.fasta -o blast.out
- This tool was used four times with the following parameters:
- 3 iterations, e-value cutoff 0.005
- 3 iterations, e-value cutoff 10e-6
- 5 iterations, e-value cutoff 0.005
- 5 iterations, e-value cutoff 10e-6
blastpgp -d /data/blast/nr/nr -i gbaseq.fasta -o psi_blast_x_y.out -j x -h y
Furthermore the online version of HHSearch <ref>http://toolkit.lmb.uni-muenchen.de/hhpred</ref> was used to search against the pdb70 database of May 14th.
The sequence search with FASTA returned 520 sequences. BLAST, as well as the different PSI-BLAST runs, returned 500 sequences due to restrictions in the output format. The search with HHSearch against the pdb17 database resulted in 100 sequences.
The overlaps of the results of the different tools (FASTA, BLAST, 4 PSI-BLAST runs) have been investigated and are visualized in Figure 1 using Venn-Diagrams (created with <ref>http://bioinformatics.psb.ugent.be/webtools/Venn/</ref>). The results of HHSearch are not included as a different database (pdb70) was used and therefore the results could not be compared.
In total, the six different sequence searches returned 626 unique sequences whereof 405 sequences were returned by each of the searches and 41 were only found by a single one. There were no sequences which have only been found by PSI-BLAST with 3 iterations whereas FASTA returned 25 unique sequences. These numbers indicate that the different sequence searches return very similar sequences and that each of the tools could be used to retrieve the related sequences of glucocerebrosidase.
As BLAST only shows details for the 250 "best" alignments, the sequence identitiy could only be analyzed for the 250 first sequences of the BLAST and PSI-BLAST results. The table below shows the number of sequences that fall into a certain interval. These numbers are also visualized in Figure 2 to the right. Most of the sequences which BLAST, PSI-BLAST and FASTA found lie within the range of 20 to 39 percent whereas HHSearch has its peak in the area of 0 to 19 percent. As HHSearch was applied to the pdb database, only sequences with a known 3D-structure could be found. The fact that no sequence with an identity in the interval of 30 to 99 percent was found with HHsearch indicates that there are no known structures of closely related proteins of glucocerebrosidase. The search returned one sequence with a 100% identity: 2NT0 <ref>http://www.pdb.org/pdb/explore/explore.do?structureId=2NT0</ref>, the structure of glucocerebrosidase with a pharmacological chaperone. Interestingly, almost no sequences were found with a sequence identity in the intervall of 60 to 80 percent by either of the tools.
e-value cutoff 0.005
e-value cutoff 10e-6
e-value cutoff 0.005
e-value cutoff 10e-6
The logarithms of the E-Values of the different datasets are visualized in Figure 3 to the right. E-Values of zero are not shown in this diagram as the logarithm for zero is undefined. BLAST found 37 sequences, FASTA 25, PSI-BLAST (3 it.) 19-21 and HHsearch found 5 sequences with an E-Value of zero. The results (round 3/5) of the different PSI-BLAST runs show similar E-Value distributions which cover "better" values than the results of BLAST, FASTA and HHSearch. Results of the latter can be compared to the values of the first rounds of the four PSI-BLAST runs. PSI-BLAST uses sequences found in the previous round to construct a score model for the next round and is therefore more sensitive to distantly related proteins. HHSearch returns sequences with the worst E-Values, which indicates, that the pdb database does not provide many related sequences to our reference sequence. This observation is in agreement with the results of the identity distrubution.
Comparison to HSSP
The sequences obtained were compared to the 582 search results of the term "glucocerebrosidase" in the HSSP database <ref>http://mrs.cmbi.ru.nl/mrs-5/entry?db=hssp&id=3keh&rq=glucocerebrosidase</ref>, which provides information of homology-derived secondary structure of proteins. As the nonredundant database (nr) consists of several databases (GenBank, RefSeq, PDB, SwissProt, PIR, PRF) the identifiers of the sequences had to be converted into Uniprot AC numbers (as used in the HSSP database)<ref>http://pir.georgetown.edu/pirwww/search/idmapping.shtml</ref>.
Apart from HHSearch the different search tools almost aquire the same precision and the same recall. BLAST and FASTA return slighlty better results then the different PSI-BLAST runs.
|BLAST||PSI-BLAST 3 0.005||PSI-BLAST 3 10e-6||PSI-BLAST 5 0.005||PSI-BLAST 5 10e-6||FASTA||HHSearch|
Multiple sequence alignments
Sequences used for multiple sequence alignments
For the multiple sequence alignments the reference sequence and twenty other sequences (found with the sequence searches) were used. It was tried to avoid hypothetical and potential proteins and to take sequences that have similiar identities in all sequence searches. For the multiple sequence alignments sequences from a variety of organisms like human, orang-utan, mouse, bacteria and worms were used.
The following tables show the chosen sequences with their identities in the different database searches. Only one pdb structure with less than a 100% sequence identity was found and could be used for the multiple sequence alignment.
Our reference sequence: P04062, GLCM_HUMAN Glucosylceramidase
|99 - 90% sequence identity|
|NP_001127488.1||glucosylceramidase precursor||Pongo abelii||95.0, 98.0, 98.0, 98.0, 98.0, 98.1|
|3KE0||A Chain A, Crystal Structure Of N370s Glucocerebrosidase At Acidic Ph.||97.0, 99.0, 99.0, 99.0, 99.0, 99.8|
|EAW53100.1||glucosidase, beta; acid (includes glucosylceramidase), isoform CRA_a||Homo sapiens||97.0, 99.0, 99.0, 99.0, 99.0, 99.6|
|NP_001165283.1||glucosylceramidase isoform 3 precursor||Homo sapiens||88.0, 90.0, 90.0, 90.0, 90.0, 90.9|
|NP_001128784.1||DKFZP469B0323 protein||Pongo abelii||95.0, 97.0, 97.0, 97.0, 97.0, 97.4|
|89 - 60% sequence identity|
|NP_032120.1||glucosylceramidase isoform 1||Mus musculus||84.0, 86.0, 86.0, 86.0, 86.0, 86.4|
|EDL15229.1||glucosidase, beta, acid, isoform CRA_a||Mus musculus||84.0, 86.0, 86.0, 86.0, 86.0, 86.3|
|NP_001121111.1||glucosidase, beta, acid||Rattus norvegicus||85.0, 87.0, 87.0, 87.0, 87.0, 87.6|
|NP_001039886.1||glucosylceramidase precursor||Bos taurus||86.0, 89.0, 89.0, 89.0, 89.0, 89.2|
|NP_001005730.1||glucosylceramidase precursor||Sus scrofa||87.0, 89.0, 89.0, 89.0, 89.0, 89.6|
|59 - 40% sequence identity|
|EFN73638.1||Glucosylceramidase||Camponotus floridanus||41.0, 40.0, 40.0, 41.0, 40.0, 42.2|
|CAG11843.1||unnamed protein product||Tetraodon nigroviridis||52.0, 53.0, 53.0, 53.0, 53.0, 54.2|
|NP_500785.1||hypothetical protein Y4C6B.6||Caenorhabditis elegans||41.0, 40.0, 39.0, 40.0, 39.0, 41.9|
|EFA07058.1||hypothetical protein TcasGA2_TC010035||Tribolium castaneum||41.0, 42.0, 41.0, 42.0, 41.0, 43.2|
|EFO26573.1||O-glycosyl hydrolase family 30 protein||Loa loa||40.0, 40.0, 40.0, 40.0, 40.0, 41.7|
|39 - 20% sequence identity|
|ZP_07040024.1||glucosylceramidase||Bacteroides sp. 3_1_23||26.0, 24.0, 24.0, 24.0, 24.0, 25.5|
|YP_244236.1||glycosyl hydrolase||Xanthomonas campestris pv. campestris str. 8004||33.0, 31.0, 30.0, 31.0, 31.0, 33.4|
|ZP_01885435.1||glycosyl hydrolase||Pedobacter sp. BAL39||36.0, 33.0, 32.0, 33.0, 33.0, 37.2|
|ZP_07388379.1||Glucan endo-1,6-beta-glucosidase||Paenibacillus curdlanolyticus YK9||28.0, 24.0, 23.0, 24.0, 24.0, 30.1|
|NP_623885.1||O-glycosyl hydrolase family protein||Thermoanaerobacter tengcongensis MB4||37.0, 34.0, 33.0, 34.0, 32.0, 37.5|
Tools and Results
Several different multiple alignment tools have been applied in this analysis. The different tools are described in the section below.
COBALT, which stands for constriant-based alignment tool, does progressive multiple alignment of protein sequences with the help of a collection of pairwise constraints derived from conserved domain database, protein motif database and local sequence similarity. The tool was first presented in 2007 by Papadopoulos et al. <ref>Papadopoulos JS and Agarwala R (2007) COBALT: constraint-based alignment tool for multiple protein sequences, Bioinformatics 23:1073-79.</ref>
- As Cobalt was not yet installed, it was downloaded from the NCBI Server<ref>ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt/executables/2.0.1/</ref>.
time /home/student/Desktop/ncbi-cobalt-2.0.1/cobalt -i multiple_alignment.fasta -norps T > cobalt_multiple_alignment.aln
ClustalW is a multiple sequence alignment method which introduces individual weights for divergent protein sequences and varies the amino acid substitution matrices at different alignment stages according to the divergence of the sequences. The tool was first introduced by Thompson et al. in 1994. <ref>Thompson, Higgins, Gibson. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.1004. Nucleic Acids Res. 22: 4673-4680 </ref>
MUSCLE, standing for MUltiple Sequence Comparison by Log- Expectation, is a program for creating multiple alignments of protein sequences. The method is based on distance estimation using kmer counting, progressive alignments with a log-expectation score and refinements using tree-dependent restricted partitioning. <ref>Edgar R.C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 2004, Vol.32, No. 5</ref>
time muscle -in multiple_alignment.fasta -out muscle_multiple_alignment.aln
T-Coffee is a progressive multiple sequence alignment tool that pre-processes a data set of all pair-wise alignments between the sequences which provides information for the progressive alignment. The intermediate alignments are therefore based not only on the sequences aligned to them, but also on how all of the sequences align with each other. <ref>Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000 Sep 8;302(1):205-17.</ref>
time t_coffee multiple_alignment.fasta
3D-Coffee is a specialised version of T-Coffee which additionally uses structural information. The original sequences are replaced by homologous structures and are aligned using structural aligners. This process results in a structure based multiple sequence alignment. Expresso is an extension of 3D-Coffee which automatically identifies structural templates with BLAST. <ref>http://www.tcoffee.org/Projects_home_page/expresso_home_page.html</ref>
time t_coffee -seq multiple_alignment.fasta -mode expresso -pdb_type dn
Looking at the alignments in Jalview one can see, that after about 100-120 residues of the glucocerebrosidase sequence the conservation gets very high. The first part does not seem to be very important for the function of the protein and therefore mutated gradually in the different organisms. The conserved regions which follow may be necessary for the catalytic function of the protein. If there are mutations, these are substitutions to amino acids with the same properties, for example arginine to lysine, so that the function of the protein can be maintained.
Functionally important residues
Glutamic acid residues 235 (274) and 340 (379) play key roles in the active site<ref>http://www.ncbi.nlm.nih.gov/books/NBK1269/</ref> as proton donor and nucleophile<ref>http://www.uniprot.org/uniprot/P04062</ref>. The different alignments show a 100% conservation of these positions:
The conservation of the amino acid modifications was analyzed as well:
|Tool||Glycolisation 58||Glycolisation 98||Glycolisation 185||Glycolisation 309||Glycolisation 501||Disulfide bond 43-55||Disulfide bond 57-62|
The conservation of the amino acid modifications is not as high as the conservation of the active site residues. Only the residue at position 501 shows a 100% conservation. Resiudes, which are not part of the active site, are more likely to mutate, as they are not necessarily needed for the function of the protein. A mutation which does not influence the catalytic activity but might change the local conformation of the protein therefore can get fixed.
Gaps in the reference structure
Most of the gaps are due to sequence insertions. The largest example is ZP_07388379.1 which has an additional 300 amino acids at the end. Another one is CAG11843.1 with two times 30 amino acid insertions. The remaining gaps are from the alignment itself.
The 3D-Coffee alignment has the most gaps. The tool tried to align the structure and therefore there are more gaps because amino acids are only aligned, when the structure fits, too. In the Jalview alignment there are a lot of very "gappy" parts, as one can see in the picture.
Gaps in secondary structure elements
The secondary structure information was obtained from Uniprot<ref>http://www.uniprot.org/uniprot/P04062</ref> (cf. Figure 10). Turn position 150-153 and the beta sheet position 373-381 are of interest, as all of the tools but 3D-Coffee show the same number of gaps. The end of the protein position 503-533 is also interesting, because the different alignments show different gaps. These structures are marked in the picture, which was made with Pymol.
Muscle, Cobalt, ClustalW and T-Coffee only show a few exceptions with few gaps, that they have on their own, whereas 3D-Coffee has many gaps of its own. The turn and the beta sheet mentioned before have no gaps in the alignment with 3D-Coffee but in all other alignments. The results of 3D-Coffee are very different to the alignments of the other tools. Therefore including the 3D-structure seems to have a great influence on the alignment.
Interestingly 3D-Coffee splits some secondary structure elements with many gaps. For example the turn from position 514 to 516 with 40 gaps and the beta strand from position 527 to 533 with even 160 gaps. In the following you can see the alignment:
The reason for the high number of gaps in the 3D-Coffee alignment are the Valine/Glycine and the Tryptophan columns. No other alignment shows this strange split. But we do not know why this occurs. Looking at the PDB structure does not help. 3D-Coffee seems to recognise some secondary structure elements in the large insertions that fit better to Valine/Glycine and Tryptophan.