Glucocerebrosidase sequence alignments

From Bioinformatikpedia
Revision as of 16:15, 23 May 2011 by Braunt (talk | contribs) (Sequence searches)

Still under Construction

Sequence searches

Several different tools were used in order to look for sequences that are related to glucocerebrosidase in the non-redundant sequence database.

  • FASTA
As Fasta was not initially installed, it was downloaded from the EBI FTP Download Site <ref>ftp://ftp.ebi.ac.uk/pub/software/unix/fasta/fasta36/</ref>.
Command:
../bin/fasta36 gbaseq.fasta /data/blast/nr/nr > fasta_gba_search.out
  • BLAST
Command:
blastall -p blastp -d /data/blast/nr/nr -i gbaseq.fasta -o blast.out
  • PSI-BLAST
This tool was used 4 times with all different combinations of 3 or 5 iterations (x) and an E-value cut-off (y) of 0.005 or 10e-6.
Command:
blastpgp -d /data/blast/nr/nr -i gbaseq.fasta -o psi_blast_x_y.out -j x -h y


Furthermore the online version of HHSearch <ref>http://toolkit.lmb.uni-muenchen.de/hhpred</ref> was used to search against the pdb70 database of May 14th.

Results

The sequence search with FASTA returned 520 sequences. BLAST, as well as the different PSI-BLAST runs returned 500 sequences. The search with HHSearch against the pdb database only resulted in 100 sequences.

Overlap

Overlap of the results of the different sequence searches. (At most 5 different lists could be captured in one diagram).

The overlaps between the results of the different tools (FASTA, BLAST, 4 PSI-BLAST runs) have been investigated and are visualized using Venn-Diagrams (created with <ref>http://bioinformatics.psb.ugent.be/webtools/Venn/</ref>). The results of HHSearch are not included as a different database (pdb70) was used and therefore the results can not be compared.
In total, the 6 different sequence searches returned 626 unique sequences whereof 405 sequences were returned by each of the searches and 41 were only found by a single one. There were no sequences which have only been found by PSI-BLAST runs with 3 iterations whereas FASTA returned 25 sequences that have not been found by any other tool. These numbers indicate that the different sequence searches return very similar sequences and that each of the tools could be used to retrieve the related sequences of glucocerebrosidase.


Sequence Identity

As BLAST only shows details for the 250 "best" alignments, the sequence identitiy could only be analyzed for the 250 first sequences of the BLAST and PSI-BLAST results. The table below shows the number of sequences that fall into a certain interval. These numbers are also visualized in the diagram on the right. Most of the sequences which BLAST, PSI-BLAST and FASTA found lie within the range of 20 to 39 percent whereas HHSearch has its peak in the area of 0 to 19 percent. As HHSearch was applied to the pdb database, only sequences with a known 3D-structure could be found. The fact that no sequence was found with a sequence identity in the interval of 30 to 99 percent indicates, that there are no known structures of closely related proteins of glucocerebrosidase. The search returned one sequence with a 100% identity: 2NT0 <ref>http://www.pdb.org/pdb/explore/explore.do?structureId=2NT0</ref>, the structure of glucocerebrosidase with a pharmacological chaperone.

Identity distribution of the different sequence searches.
Identity BLAST PSI-BLAST
3 iterations
e-value cutoff 0.005
PSI-BLAST
3 iterations
e-value cutoff 10e-6
PSI-BLAST
5 iterations
e-value cutoff 0.005
PSI-BLAST
5 iterations
e-value cutoff 10e-6
FASTA HHSearch
0-9 % 0 0 0 0 0 0 8
10-19% 0 0 0 0 0 0 89
20-29% 43 85 85 96 94 240 2
30-39 % 127 98 98 90 91 160 0
40-49 % 28 26 26 24 25 44 0
50-59 % 6 4 4 3 3 6 0
60-69 % 3 0 0 0 0 3 0
70-79 % 1 1 1 1 1 6 0
80-89 % 15 9 9 9 9 27 0
90-100 % 27 27 27 27 27 34 1
Total 250 250 250 250 250 520 100


E-Values

E-Value distribution of the different sequence searches.

The results (round 3/5) of the different PSI-BLAST runs show a similar E-Value distribution which covers "better" values than the results from BLAST, FASTA and HHSearch. Results of the latter can be compared to the values of the first rounds of the four PSI-BLAST runs. PSI-BLAST uses sequences found in the previous round to construct score model for the next round. PSI-BLAST is therefore more sensitive to distantly related proteins. HHSearch returns sequences with the worst E-Values, which indicates, that the pdb database does not provide many related sequences to our reference sequence. This observation is in agreement with the results of the identity distrubution.


Comparison to HSSP

BLAST PSI-BLAST
3 iterations
e-value cutoff 0.005
PSI-BLAST
3 iterations
e-value cutoff 10e-6
PSI-BLAST
5 iterations
e-value cutoff 0.005
PSI-BLAST
5 iterations
e-value cutoff 10e-6
FASTA HHSearch
0.54 0.47 0.46 0.47 0.46 0.54 0.01

Multiple sequence alignments

Sequences used for multiple sequence alignments

For the multiple sequence alignments we used our reference sequence and twenty sequences we had found with sequence searches. We tried to avoid hypothetical sequences and tried to take sequences, that have similiar identities in all sequence searches. For the multiple sequence alignments we have a variety of organisms with human, orang-utan, mouse and also bacteria and worms.
The following tables show the chosen sequences with their identities in the different searches. We only found one pdb structure.

our reference sequence: P04062, GLCM_HUMAN Glucosylceramidase

99 - 90% sequence identity
NP_001127488.1 glucosylceramidase precursor Pongo abelii 95.0, 98.0, 98.0, 98.0, 98.0, 98.1
3KE0 A Chain A, Crystal Structure Of N370s Glucocerebrosidase At Acidic Ph. 97.0, 99.0, 99.0, 99.0, 99.0, 99.8
EAW53100.1 glucosidase, beta; acid (includes glucosylceramidase), isoform CRA_a Homo sapiens 97.0, 99.0, 99.0, 99.0, 99.0, 99.6
NP_001165283.1 glucosylceramidase isoform 3 precursor Homo sapiens 88.0, 90.0, 90.0, 90.0, 90.0, 90.9
NP_001128784.1 DKFZP469B0323 protein Pongo abelii 95.0, 97.0, 97.0, 97.0, 97.0, 97.4
89 - 60% sequence identity
NP_032120.1 glucosylceramidase isoform 1 Mus musculus 84.0, 86.0, 86.0, 86.0, 86.0, 86.4
EDL15229.1 glucosidase, beta, acid, isoform CRA_a Mus musculus 84.0, 86.0, 86.0, 86.0, 86.0, 86.3
NP_001121111.1 glucosidase, beta, acid Rattus norvegicus 85.0, 87.0, 87.0, 87.0, 87.0, 87.6
NP_001039886.1 glucosylceramidase precursor Bos taurus 86.0, 89.0, 89.0, 89.0, 89.0, 89.2
NP_001005730.1 glucosylceramidase precursor Sus scrofa 87.0, 89.0, 89.0, 89.0, 89.0, 89.6
59 - 40% sequence identity
EFN73638.1 Glucosylceramidase Camponotus floridanus 41.0, 40.0, 40.0, 41.0, 40.0, 42.2
CAG11843.1 unnamed protein product Tetraodon nigroviridis 52.0, 53.0, 53.0, 53.0, 53.0, 54.2
NP_500785.1 hypothetical protein Y4C6B.6 Caenorhabditis elegans 41.0, 40.0, 39.0, 40.0, 39.0, 41.9
EFA07058.1 hypothetical protein TcasGA2_TC010035 Tribolium castaneum 41.0, 42.0, 41.0, 42.0, 41.0, 43.2
EFO26573.1 O-glycosyl hydrolase family 30 protein Loa loa 40.0, 40.0, 40.0, 40.0, 40.0, 41.7
39 - 20% sequence identity
ZP_07040024.1 glucosylceramidase Bacteroides sp. 3_1_23 26.0, 24.0, 24.0, 24.0, 24.0, 25.5
YP_244236.1 glycosyl hydrolase Xanthomonas campestris pv. campestris str. 8004 33.0, 31.0, 30.0, 31.0, 31.0, 33.4
ZP_01885435.1 glycosyl hydrolase Pedobacter sp. BAL39 36.0, 33.0, 32.0, 33.0, 33.0, 37.2
ZP_07388379.1 Glucan endo-1,6-beta-glucosidase Paenibacillus curdlanolyticus YK9 28.0, 24.0, 23.0, 24.0, 24.0, 30.1
NP_623885.1 O-glycosyl hydrolase family protein Thermoanaerobacter tengcongensis MB4 37.0, 34.0, 33.0, 34.0, 32.0, 37.5

Tools

  • Cobalt

Cobalt was not yet installed, so we downloaded it from the NCBI Server<ref>ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt/executables/2.0.1/</ref>.

command
time /home/student/Desktop/ncbi-cobalt-2.0.1/cobalt -i multiple_alignment.fasta -norps T > cobalt_multiple_alignment.aln

time
real 0m3.488s
user 0m2.320s
sys 0m0.180s


  • ClustalW

command
time clustalw

time
real 0m40.625s
user 0m5.320s
sys 0m0.070s


  • Muscle

command
time muscle -in multiple_alignment.fasta -out muscle_multiple_alignment.aln

time
real 0m3.018s
user 0m1.710s
sys 0m0.100s


  • T-Coffee

command
time t_coffee multiple_alignment.fasta

time
real 0m41.360s
user 0m34.000s
sys 0m0.920s


  • 3D-Coffee/Expresso

command
time t_coffee -seq multiple_alignment.fasta -mode expresso -pdb_type dn

time
real 12m19.825s
user 5m17.140s
sys 0m46.970s

The following pictures show cut-outs of the different alignments with Jalview.

Multiple Alignment by Cobalt in Jalview
Multiple Alignment by ClustalW in Jalview
Multiple Alignment by Muscle with Jalview
Multiple Alignment by T-Coffee with Jalview
Multiple Alignment by 3D-Coffee/Expresso in Jalview

Results

Conserved Columns

Cobalt ClustalW Muscle T-Coffee 3D-Coffee/Expresso
>50% 133 133 131 125 123
>60% 98 98 96 104 90
>70% 64 64 62 67 58
>80% 53 54 54 50 57
>90% 52 51 51 52 49
100% 25 23 26 26 25

If you look at the alignments in Jalview you can see, that after about 100-120 residues of glucoceribrosidase the conservation gets very high. The first part does not seem to be very important for the function of the protein and therefore mutated gradually in the different organisms. The conserved regions which follow may be necessary for the catalytic function of the protein. If there are mutations, there are amino acids with the same properties, for example arginine and lysine, so the function is maintained.

Functionally important residues

Glutamine residues 235 and 340 play key roles in the active site<ref>http://www.ncbi.nlm.nih.gov/books/NBK1269/</ref>. The alignments show almost the same conservation:

T-Coffee 15/21, 17/21
3D-Coffee 15/21, 16/21
Muscle 15/21, 17/21
Cobalt 15/21, 17/21
ClustalW 15/21, 17/21

In the other sequences asparagine and threonine are aligned to the glutamines, which also are polar amino acids.

Gaps in the reference structure

Cobalt ClustalW Muscle T-Coffee 3D-Coffee/Expresso
# Gaps 413 404 442 441 546

The reason for most of the gaps are sequences with inserts. The largest example is ZP_07388379.1 which has additionally 300 amino acids at the end. Another one is CAG11843.1 with two times 30 amino acids. The remaining gaps are from the alignment itself.
The 3D-Coffee alignment has the most gaps. It tried to align the structure, therefore there are more gaps because amino acids are only aligned, when also the structure fits. In the Jalview alignment there are a lot of very "gappy" parts, as you can see in the picture.

Example of a very "gappy" part of the 3D-coffee alignment

Gaps in secondary structure elements

Glucocerebroside with pymol. The interesting regions are colored.
sec. structure position Cobalt ClustalW Muscle T-Coffee 3D-Coffee/Expresso
Beta strand 49-52 0 0 0 0 2
Beta strand 54-60 0 0 0 0 0
Beta strand 75-82 5 0 0 0 0
Beta strand 88-94 0 0 0 1 9
Beta strand 96-98 0 0 0 4 18
Beta strand 103-116 4 0 23 0 6
Beta strand 119-123 0 0 0 0 0
Helix 126-132 0 0 0 0 0
Helix 137-148 0 0 0 0 19
Turn 150-153 28 28 28 28 1
Beta strand 157-163 0 0 0 0 0
Beta strand 166-170 0 0 0 0 0
Beta strand 177-179 0 0 0 0 0
Helix 190-193 0 0 0 0 0
Helix 196-206 0 0 0 0 0
Beta strand 212-218 0 0 0 0 2
Helix 222-224 0 0 0 0 0
Beta strand 229-233 7 0 0 7 7
Beta strand 235-238 0 0 0 0 0
Helix 243-261 0 0 0 0 1
Beta strand 267-271 0 0 0 0 0
Helix 277-279 0 0 0 0 0
Helix 292-301 0 0 0 0 0
Helix 202-208 0 0 0 0 0
Turn 311-314 0 0 1 1 2
Beta strand 315-323 0 0 0 0 0
Helix 324-326 0 0 0 0 0
Helix 329-335 0 0 0 0 0
Helix 338-341 0 0 0 0 0
Beta strand 346-352 0 0 0 0 0
Helix 359-369 0 0 0 2 2
Beta strand 373-381 13 13 13 13 0
Helix 396-411 0 0 8 6 13
Beta strand 414-422 8 0 0 0 0
Beta strand 440-444 0 0 0 0 1
Helix 445-447 0 0 1 0 1
Beta strand 449-452 0 0 3 0 0
Helix 454-463 0 0 0 0 1
Beta strand 471-479 1 0 2 2 5
Beta strand 482-489 0 1 0 0 0
Beta strand 495-501 0 0 0 0 0
Beta strand 503-505 0 29 0 0 0
Beta strand 507-513 0 0 10 30 1
Turn 514-516 0 0 0 0 40
Beta strand 517-523 0 0 0 1 12
Beta strand 527-533 4 0 0 0 160

The secondary structure information is from Uniprot<ref>http://www.uniprot.org/uniprot/P04062</ref>. To compare the secondary structure the first picture can be used.
Interesting are the turn position 150-153 and the beta sheet position 373-381, where all but 3D-Coffee have the same number of gaps. The end of the protein position 503-533 is also interesting, because the different alignments show different gaps. These structures are marked in the picture, which is made with Pymol.
Muscle, Cobalt, ClustalW and T-Coffee show only a few exceptions with few gaps, that they have alone whereas 3D-Coffee has many gaps alone. The turn and the beta sheet mentioned before have no gaps in the alignment with 3D-Coffee but in all other alignments. The results of 3D-Coffee are very different to the alignments of the other tools. Therefore including the 3D-structure seems to have a great influence on the alignment.
The most important thing is, that the secondary structures with the functionally important glutamine residues, beta strand position 235-238 and helix position 338-341 have no gaps in all alignments. Hence the alignments keep the active site aligned, which means, they have a good quality.

References

<references />