Difference between revisions of "Glucocerebrosidase sequence alignments"
(→Gaps in the reference structure) |
(→Gaps in secondary structure elements) |
||
Line 446: | Line 446: | ||
|Beta strand || 527-533 || 4 || 0 || 0 || 0 || 160 |
|Beta strand || 527-533 || 4 || 0 || 0 || 0 || 160 |
||
|} |
|} |
||
+ | |||
+ | The secondary structure information is from Uniprot<ref>http://www.uniprot.org/uniprot/P04062</ref>. |
||
== References == |
== References == |
Revision as of 06:25, 23 May 2011
Sequence searches
Several different tools were used in order to look for sequences that are related to glucocerebrosidase in the non-redundant sequence database.
- FASTA
- As Fasta was not initially installed, it was downloaded from the EBI FTP Download Site <ref>ftp://ftp.ebi.ac.uk/pub/software/unix/fasta/fasta36/</ref>.
- Command:
../bin/fasta36 gbaseq.fasta /data/blast/nr/nr > fasta_gba_search.out
- BLAST
- Command:
blastall -p blastp -d /data/blast/nr/nr -i gbaseq.fasta -o blast.out
- PSI-BLAST
- This tool was used 4 times with all different combinations of 3 or 5 iterations (x) and an E-value cut-off (y) of 0.005 or 10e-6.
- Command:
blastpgp -d /data/blast/nr/nr -i gbaseq.fasta -o psi_blast_x_y.out -j x -h y
Furthermore the online version of HHSearch <ref>http://toolkit.lmb.uni-muenchen.de/hhpred</ref> was used to search against the pdb70 database of May 14th.
Results
The sequence search with FASTA returned 520 sequences. BLAST, as well as the different PSI-BLAST runs returned 500 sequences. The search with HHSearch against the pdb database only resulted in 100 sequences.
Overlap
The overlaps between the results of the different tools (FASTA, BLAST, 4 PSI-BLAST runs) have been investigated and are visualized using Venn-Diagrams (created with <ref>http://bioinformatics.psb.ugent.be/webtools/Venn/</ref>). The results of HHSearch are not included as a different database (pdb70) was used and therefore the results can not be compared.
In total, the 6 different sequence searches returned 626 unique sequences whereof 405 sequences were returned by each of the searches and 41 were only found by a single one. There were no sequences which have only been found by PSI-BLAST runs with 3 iterations whereas FASTA returned 25 sequences that have not been found by any other tool. These numbers indicate that the different sequence searches return very similar sequences and that each of the tools could be used to retrieve the related sequences of glucocerebrosidase.
Sequence Identity
As BLAST only shows details for the 250 "best" alignments, the sequence identitiy could only be analyzed for the 250 first sequences of the BLAST and PSI-BLAST results. The table below shows the number of sequences that fall into a certain interval. These numbers are also visualized in the diagram on the right. Most of the sequences which BLAST, PSI-BLAST and FASTA found lie within the range of 20 to 39 percent whereas HHSearch has its peak in the area of 0 to 19 percent. As HHSearch was applied to the pdb database, only sequences with a known 3D-structure could be found. The fact that no sequence was found with a sequence identity in the interval of 30 to 99 percent indicates, that there are no known structures of closely related proteins of glucocerebrosidase. The search returned one sequence with a 100% identity: 2NT0 <ref>http://www.pdb.org/pdb/explore/explore.do?structureId=2NT0</ref>, the structure of glucocerebrosidase with a pharmacological chaperone.
Identity | BLAST | PSI-BLAST 3 iterations e-value cutoff 0.005 |
PSI-BLAST 3 iterations e-value cutoff 10e-6 |
PSI-BLAST 5 iterations e-value cutoff 0.005 |
PSI-BLAST 5 iterations e-value cutoff 10e-6 |
FASTA | HHSearch |
---|---|---|---|---|---|---|---|
0-9 % | 0 | 0 | 0 | 0 | 0 | 0 | 8 |
10-19% | 0 | 0 | 0 | 0 | 0 | 0 | 89 |
20-29% | 43 | 85 | 85 | 96 | 94 | 240 | 2 |
30-39 % | 127 | 98 | 98 | 90 | 91 | 160 | 0 |
40-49 % | 28 | 26 | 26 | 24 | 25 | 44 | 0 |
50-59 % | 6 | 4 | 4 | 3 | 3 | 6 | 0 |
60-69 % | 3 | 0 | 0 | 0 | 0 | 3 | 0 |
70-79 % | 1 | 1 | 1 | 1 | 1 | 6 | 0 |
80-89 % | 15 | 9 | 9 | 9 | 9 | 27 | 0 |
90-100 % | 27 | 27 | 27 | 27 | 27 | 34 | 1 |
Total | 250 | 250 | 250 | 250 | 250 | 520 | 100 |
Discussion
Multiple sequence alignments
Sequences used for multiple sequence alignments
For the multiple sequence alignments we used our reference sequence and twenty sequences we had found with sequence searches. We tried to avoid hypothetical sequences and tried to take sequences, that have similiar identities in all sequence searches. For the multiple sequence alignments we have a variety of organisms with human, orang-utan, mouse and also bacteria and worms.
The following tables show the chosen sequences with their identities in the different searches. We only found one pdb structure.
our reference sequence: P04062, GLCM_HUMAN Glucosylceramidase
99 - 90% sequence identity | |||
NP_001127488.1 | glucosylceramidase precursor | Pongo abelii | 95.0, 98.0, 98.0, 98.0, 98.0, 98.1 |
3KE0 | A Chain A, Crystal Structure Of N370s Glucocerebrosidase At Acidic Ph. | 97.0, 99.0, 99.0, 99.0, 99.0, 99.8 | |
EAW53100.1 | glucosidase, beta; acid (includes glucosylceramidase), isoform CRA_a | Homo sapiens | 97.0, 99.0, 99.0, 99.0, 99.0, 99.6 |
NP_001165283.1 | glucosylceramidase isoform 3 precursor | Homo sapiens | 88.0, 90.0, 90.0, 90.0, 90.0, 90.9 |
NP_001128784.1 | DKFZP469B0323 protein | Pongo abelii | 95.0, 97.0, 97.0, 97.0, 97.0, 97.4 |
89 - 60% sequence identity | |||
NP_032120.1 | glucosylceramidase isoform 1 | Mus musculus | 84.0, 86.0, 86.0, 86.0, 86.0, 86.4 |
EDL15229.1 | glucosidase, beta, acid, isoform CRA_a | Mus musculus | 84.0, 86.0, 86.0, 86.0, 86.0, 86.3 |
NP_001121111.1 | glucosidase, beta, acid | Rattus norvegicus | 85.0, 87.0, 87.0, 87.0, 87.0, 87.6 |
NP_001039886.1 | glucosylceramidase precursor | Bos taurus | 86.0, 89.0, 89.0, 89.0, 89.0, 89.2 |
NP_001005730.1 | glucosylceramidase precursor | Sus scrofa | 87.0, 89.0, 89.0, 89.0, 89.0, 89.6 |
59 - 40% sequence identity | |||
EFN73638.1 | Glucosylceramidase | Camponotus floridanus | 41.0, 40.0, 40.0, 41.0, 40.0, 42.2 |
CAG11843.1 | unnamed protein product | Tetraodon nigroviridis | 52.0, 53.0, 53.0, 53.0, 53.0, 54.2 |
NP_500785.1 | hypothetical protein Y4C6B.6 | Caenorhabditis elegans | 41.0, 40.0, 39.0, 40.0, 39.0, 41.9 |
EFA07058.1 | hypothetical protein TcasGA2_TC010035 | Tribolium castaneum | 41.0, 42.0, 41.0, 42.0, 41.0, 43.2 |
EFO26573.1 | O-glycosyl hydrolase family 30 protein | Loa loa | 40.0, 40.0, 40.0, 40.0, 40.0, 41.7 |
39 - 20% sequence identity | |||
ZP_07040024.1 | glucosylceramidase | Bacteroides sp. 3_1_23 | 26.0, 24.0, 24.0, 24.0, 24.0, 25.5 |
YP_244236.1 | glycosyl hydrolase | Xanthomonas campestris pv. campestris str. 8004 | 33.0, 31.0, 30.0, 31.0, 31.0, 33.4 |
ZP_01885435.1 | glycosyl hydrolase | Pedobacter sp. BAL39 | 36.0, 33.0, 32.0, 33.0, 33.0, 37.2 |
ZP_07388379.1 | Glucan endo-1,6-beta-glucosidase | Paenibacillus curdlanolyticus YK9 | 28.0, 24.0, 23.0, 24.0, 24.0, 30.1 |
NP_623885.1 | O-glycosyl hydrolase family protein | Thermoanaerobacter tengcongensis MB4 | 37.0, 34.0, 33.0, 34.0, 32.0, 37.5 |
Tools used
- Cobalt
Cobalt was not yet installed, so we downloaded it from the NCBI Server<ref>ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt/executables/2.0.1/</ref>.
command
time /home/student/Desktop/ncbi-cobalt-2.0.1/cobalt -i multiple_alignment.fasta -norps T > cobalt_multiple_alignment.aln
time | |
real | 0m3.488s |
user | 0m2.320s |
sys | 0m0.180s |
- ClustalW
command
time clustalw
time | |
real | 0m40.625s |
user | 0m5.320s |
sys | 0m0.070s |
- Muscle
command
time muscle -in multiple_alignment.fasta -out muscle_multiple_alignment.aln
time | |
real | 0m3.018s |
user | 0m1.710s |
sys | 0m0.100s |
- T-Coffee
command
time t_coffee multiple_alignment.fasta
time | |
real | 0m41.360s |
user | 0m34.000s |
sys | 0m0.920s |
- 3D-Coffee/Expresso
command
time t_coffee -seq multiple_alignment.fasta -mode expresso -pdb_type dn
time | |
real | 12m19.825s |
user | 5m17.140s |
sys | 0m46.970s |
The following pictures show cut-outs of the different alignments with Jalview.
Results
Conserved Columns
Cobalt | ClustalW | Muscle | T-Coffee | 3D-Coffee/Expresso | |
---|---|---|---|---|---|
>50% | 133 | 133 | 131 | 125 | 123 |
>60% | 98 | 98 | 96 | 104 | 90 |
>70% | 64 | 64 | 62 | 67 | 58 |
>80% | 53 | 54 | 54 | 50 | 57 |
>90% | 52 | 51 | 51 | 52 | 49 |
100% | 25 | 23 | 26 | 26 | 25 |
If you look at the alignments in Jalview you can see, that after about 100-120 residues of glucoceribrosidase the conservation gets very high. The first part does not seem to be very important for the function of the protein and therefore mutated gradually in the different organisms. The conserved regions which follow may be necessary for the catalytic function of the protein. If there are mutations, there are amino acids with the same properties, for example arginine and lysine, so the function is maintained.
Functionally important residues
Glutamine residues 235 and 340 play key roles in the active site<ref>http://www.ncbi.nlm.nih.gov/books/NBK1269/</ref>. The alignments show almost the same conservation:
T-Coffee | 15/21, 17/21 |
3D-Coffee | 15/21, 16/21 |
Muscle | 15/21, 17/21 |
Cobalt | 15/21, 17/21 |
ClustalW | 15/21, 17/21 |
In the other sequences asparagine and threonine are aligned to the glutamines, which also are polar amino acids.
Gaps in the reference structure
Cobalt | ClustalW | Muscle | T-Coffee | 3D-Coffee/Expresso | |
---|---|---|---|---|---|
# Gaps | 413 | 404 | 442 | 441 | 546 |
The reason for most of the gaps are sequences with inserts. The largest example is ZP_07388379.1 which has additionally 300 amino acids at the end. Another one is CAG11843.1 with two times 30 amino acids. The remaining gaps are from the alignment itself.
The 3D-Coffee alignment has the most gaps. It tried to align the structure, therefore there are more gaps because amino acids are only aligned, when also the structure fits. In the Jalview alignment there are a lot of very "gappy" parts, as you can see in the picture.
Gaps in secondary structure elements
sec. structure | position | Cobalt | ClustalW | Muscle | T-Coffee | 3D-Coffee/Expresso |
---|---|---|---|---|---|---|
Beta strand | 49-52 | 0 | 0 | 0 | 0 | 2 |
Beta strand | 54-60 | 0 | 0 | 0 | 0 | 0 |
Beta strand | 75-82 | 5 | 0 | 0 | 0 | 0 |
Beta strand | 88-94 | 0 | 0 | 0 | 1 | 9 |
Beta strand | 96-98 | 0 | 0 | 0 | 4 | 18 |
Beta strand | 103-116 | 4 | 0 | 23 | 0 | 6 |
Beta strand | 119-123 | 0 | 0 | 0 | 0 | 0 |
Helix | 126-132 | 0 | 0 | 0 | 0 | 0 |
Helix | 137-148 | 0 | 0 | 0 | 0 | 19 |
Turn | 150-153 | 28 | 28 | 28 | 28 | 1 |
Beta strand | 157-163 | 0 | 0 | 0 | 0 | 0 |
Beta strand | 166-170 | 0 | 0 | 0 | 0 | 0 |
Beta strand | 177-179 | 0 | 0 | 0 | 0 | 0 |
Helix | 190-193 | 0 | 0 | 0 | 0 | 0 |
Helix | 196-206 | 0 | 0 | 0 | 0 | 0 |
Beta strand | 212-218 | 0 | 0 | 0 | 0 | 2 |
Helix | 222-224 | 0 | 0 | 0 | 0 | 0 |
Beta strand | 229-233 | 7 | 0 | 0 | 7 | 7 |
Beta strand | 235-238 | 0 | 0 | 0 | 0 | 0 |
Helix | 243-261 | 0 | 0 | 0 | 0 | 1 |
Beta strand | 267-271 | 0 | 0 | 0 | 0 | 0 |
Helix | 277-279 | 0 | 0 | 0 | 0 | 0 |
Helix | 292-301 | 0 | 0 | 0 | 0 | 0 |
Helix | 202-208 | 0 | 0 | 0 | 0 | 0 |
Turn | 311-314 | 0 | 0 | 1 | 1 | 2 |
Beta strand | 315-323 | 0 | 0 | 0 | 0 | 0 |
Helix | 324-326 | 0 | 0 | 0 | 0 | 0 |
Helix | 329-335 | 0 | 0 | 0 | 0 | 0 |
Helix | 338-341 | 0 | 0 | 0 | 0 | 0 |
Beta strand | 346-352 | 0 | 0 | 0 | 0 | 0 |
Helix | 359-369 | 0 | 0 | 0 | 2 | 2 |
Beta strand | 373-381 | 13 | 13 | 13 | 13 | 0 |
Helix | 396-411 | 0 | 0 | 8 | 6 | 13 |
Beta strand | 414-422 | 8 | 0 | 0 | 0 | 0 |
Beta strand | 440-444 | 0 | 0 | 0 | 0 | 1 |
Helix | 445-447 | 0 | 0 | 1 | 0 | 1 |
Beta strand | 449-452 | 0 | 0 | 3 | 0 | 0 |
Helix | 454-463 | 0 | 0 | 0 | 0 | 1 |
Beta strand | 471-479 | 1 | 0 | 2 | 2 | 5 |
Beta strand | 482-489 | 0 | 1 | 0 | 0 | 0 |
Beta strand | 495-501 | 0 | 0 | 0 | 0 | 0 |
Beta strand | 503-505 | 0 | 29 | 0 | 0 | 0 |
Beta strand | 507-513 | 0 | 0 | 10 | 30 | 1 |
Turn | 514-516 | 0 | 0 | 0 | 0 | 40 |
Beta strand | 517-523 | 0 | 0 | 0 | 1 | 12 |
Beta strand | 527-533 | 4 | 0 | 0 | 0 | 160 |
The secondary structure information is from Uniprot<ref>http://www.uniprot.org/uniprot/P04062</ref>.
References
<references />