Difference between revisions of "Glucocerebrosidase sequence alignments"

From Bioinformatikpedia
(Sequence Identity)
(Tools used)
Line 198: Line 198:
 
Cobalt was not yet installed, so we downloaded it here: [[ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt/executables/2.0.1/]]
 
Cobalt was not yet installed, so we downloaded it here: [[ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt/executables/2.0.1/]]
   
  +
'''command'''<br/>
 
<code>time /home/student/Desktop/ncbi-cobalt-2.0.1/cobalt -i multiple_alignment.fasta -norps T > cobalt_multiple_alignment.aln </code>
 
<code>time /home/student/Desktop/ncbi-cobalt-2.0.1/cobalt -i multiple_alignment.fasta -norps T > cobalt_multiple_alignment.aln </code>
   
Line 217: Line 218:
   
 
* ClustalW
 
* ClustalW
  +
'''command'''<br/>
 
<code>time clustalw</code>
 
<code>time clustalw</code>
   
Line 236: Line 238:
   
 
* Muscle
 
* Muscle
  +
'''command'''<br/>
 
<code>time muscle -in multiple_alignment.fasta -out muscle_multiple_alignment.aln</code>
 
<code>time muscle -in multiple_alignment.fasta -out muscle_multiple_alignment.aln</code>
   
Line 255: Line 258:
   
 
* T-Coffee
 
* T-Coffee
  +
'''command'''<br/>
 
<code>time t_coffee multiple_alignment.fasta</code>
 
<code>time t_coffee multiple_alignment.fasta</code>
   
Line 274: Line 278:
   
 
* 3D-Coffee/Expresso
 
* 3D-Coffee/Expresso
  +
'''command'''<br/>
 
<code>time t_coffee -seq multiple_alignment.fasta -mode expresso -pdb_type dn</code>
 
<code>time t_coffee -seq multiple_alignment.fasta -mode expresso -pdb_type dn</code>
   

Revision as of 13:31, 22 May 2011

Sequence searches

Several different tools were used in order to look for sequences that are related to glucocerebrosidase in the non-redundant sequence database.

  • FASTA
As Fasta was not initially installed, it was downloaded from the EBI FTP Download Site <ref>ftp://ftp.ebi.ac.uk/pub/software/unix/fasta/fasta36/</ref>.
Command:
../bin/fasta36 gbaseq.fasta /data/blast/nr/nr > fasta_gba_search.out
  • BLAST
Command:
blastall -p blastp -d /data/blast/nr/nr -i gbaseq.fasta -o blast.out
  • PSI-BLAST
This tool was used 4 times with all different combinations of 3 or 5 iterations (x) and an E-value cut-off (y) of 0.005 or 10e-6.
Command:
blastpgp -d /data/blast/nr/nr -i gbaseq.fasta -o psi_blast_x_y.out -j x -h y


Furthermore the online version of HHSearch <ref>http://toolkit.lmb.uni-muenchen.de/hhpred</ref> was used to search against the pdb70 database of May 14th.

Results

The sequence search with FASTA returned 520 sequences. BLAST, as well as the different PSI-BLAST runs returned 500 sequences. The search with HHSearch against the pdb database only resulted in 100 sequences.

Overlap

Overlap of the results of the different sequence searches. (At most 5 different lists could be captured in one diagram).

The overlaps between the results of the different tools (FASTA, BLAST, 4 PSI-BLAST runs) have been investigated and are visualized using Venn-Diagrams (created with <ref>http://bioinformatics.psb.ugent.be/webtools/Venn/</ref>). The results of HHSearch are not included as a different database (pdb70) was used and therefore the results can not be compared.
In total, the 6 different sequence searches returned 626 unique sequences whereof 405 sequences were returned by each of the searches and 41 were only found by a single one. There were no sequences which have only been found by PSI-BLAST runs with 3 iterations whereas FASTA returned 25 sequences that have not been found by any other tool. These numbers indicate that the different sequence searches return very similar sequences and that each of the tools could be used to retrieve the related sequences of glucocerebrosidase.


Sequence Identity

As BLAST only shows details for the 250 "best" alignments, the sequence identitiy could only be analyzed for the 250 first sequences of the BLAST and PSI-BLAST results.

Identity distribution of the different sequence searches.
BLAST
Identity BLAST PSI-BLAST
3 iterations
e-value cutoff 0.005
PSI-BLAST
3 iterations
e-value cutoff 10e-6
PSI-BLAST
5 iterations
e-value cutoff 0.005
PSI-BLAST
5 iterations
e-value cutoff 10e-6
FASTA HHSearch
0-9 % 0 0 0 0 0 0 8
10-19% 0 0 0 0 0 0 89
20-29% 43 85 85 96 94 240 2
30-39 % 127 98 98 90 91 160 0
40-49 % 28 26 26 24 25 44 0
50-59 % 6 4 4 3 3 6 0
60-69 % 3 0 0 0 0 3 0
70-79 % 1 1 1 1 1 6 0
80-89 % 15 9 9 9 9 27 0
90-100 % 27 27 27 27 27 34 1
Total 250 250 250 250 250 520 100




Discussion

Multiple sequence alignments

Sequences used for multiple sequence alignments

For the multiple sequence alignments we used our reference sequence and twenty sequences we had found with sequence searches. We tried to avoid hypothetical sequences and tried to take sequences, that have similiar identities in all sequence searches. The following tables show the chosen sequences with their identities in the different searches. We only found one pdb structure (which was also in the HSSP database).

our reference sequence: P04062, GLCM_HUMAN Glucosylceramidase

99 - 90% sequence identity

NP_001127488.1 glucosylceramidase precursor Pongo abelii 95.0, 98.0, 98.0, 98.0, 98.0, 98.1
3KE0 A Chain A, Crystal Structure Of N370s Glucocerebrosidase At Acidic Ph. 97.0, 99.0, 99.0, 99.0, 99.0, 99.8
EAW53100.1 glucosidase, beta; acid (includes glucosylceramidase), isoform CRA_a Homo sapiens 97.0, 99.0, 99.0, 99.0, 99.0, 99.6
NP_001165283.1 glucosylceramidase isoform 3 precursor Homo sapiens 88.0, 90.0, 90.0, 90.0, 90.0, 90.9
NP_001128784.1 DKFZP469B0323 protein Pongo abelii 95.0, 97.0, 97.0, 97.0, 97.0, 97.4

89 - 60% sequence identity

NP_032120.1 glucosylceramidase isoform 1 Mus musculus 84.0, 86.0, 86.0, 86.0, 86.0, 86.4
EDL15229.1 glucosidase, beta, acid, isoform CRA_a Mus musculus 84.0, 86.0, 86.0, 86.0, 86.0, 86.3
NP_001121111.1 glucosidase, beta, acid Rattus norvegicus 85.0, 87.0, 87.0, 87.0, 87.0, 87.6
NP_001039886.1 glucosylceramidase precursor Bos taurus 86.0, 89.0, 89.0, 89.0, 89.0, 89.2
NP_001005730.1 glucosylceramidase precursor Sus scrofa 87.0, 89.0, 89.0, 89.0, 89.0, 89.6

59 - 40% sequence identity

EFN73638.1 Glucosylceramidase Camponotus floridanus 41.0, 40.0, 40.0, 41.0, 40.0, 42.2
CAG11843.1 unnamed protein product Tetraodon nigroviridis 52.0, 53.0, 53.0, 53.0, 53.0, 54.2
NP_500785.1 hypothetical protein Y4C6B.6 Caenorhabditis elegans 41.0, 40.0, 39.0, 40.0, 39.0, 41.9
EFA07058.1 hypothetical protein TcasGA2_TC010035 Tribolium castaneum 41.0, 42.0, 41.0, 42.0, 41.0, 43.2
EFO26573.1 O-glycosyl hydrolase family 30 protein Loa loa 40.0, 40.0, 40.0, 40.0, 40.0, 41.7

39 - 20% sequence identity

ZP_07040024.1 glucosylceramidase Bacteroides sp. 3_1_23 26.0, 24.0, 24.0, 24.0, 24.0, 25.5
YP_244236.1 glycosyl hydrolase Xanthomonas campestris pv. campestris str. 8004 33.0, 31.0, 30.0, 31.0, 31.0, 33.4
ZP_01885435.1 glycosyl hydrolase Pedobacter sp. BAL39 36.0, 33.0, 32.0, 33.0, 33.0, 37.2
ZP_07388379.1 Glucan endo-1,6-beta-glucosidase Paenibacillus curdlanolyticus YK9 28.0, 24.0, 23.0, 24.0, 24.0, 30.1
NP_623885.1 O-glycosyl hydrolase family protein Thermoanaerobacter tengcongensis MB4 37.0, 34.0, 33.0, 34.0, 32.0, 37.5

Tools used

  • Cobalt

Cobalt was not yet installed, so we downloaded it here: [[1]]

command
time /home/student/Desktop/ncbi-cobalt-2.0.1/cobalt -i multiple_alignment.fasta -norps T > cobalt_multiple_alignment.aln

time
real 0m3.488s
user 0m2.320s
sys 0m0.180s
Multiple Alignment by Cobalt in Jalview
  • ClustalW

command
time clustalw

time
real 0m40.625s
user 0m5.320s
sys 0m0.070s
Multiple Alignment by ClustalW in Jalview
  • Muscle

command
time muscle -in multiple_alignment.fasta -out muscle_multiple_alignment.aln

time
real 0m3.018s
user 0m1.710s
sys 0m0.100s
Multiple Alignment by Muscle with Jalview
  • T-Coffee

command
time t_coffee multiple_alignment.fasta

time
real 0m41.360s
user 0m34.000s
sys 0m0.920s
Multiple Alignment by T-Coffee with Jalview
  • 3D-Coffee/Expresso

command
time t_coffee -seq multiple_alignment.fasta -mode expresso -pdb_type dn

time
real 12m19.825s
user 5m17.140s
sys 0m46.970s
Multiple Alignment by 3D-Coffee/Expresso in Jalview

Results

Gaps in the reference structure

Cobalt ClustalW Muscle T-Coffee 3D-Coffee/Expresso
# Gaps 413 404 442 441 546

Conserved Columns

Cobalt ClustalW Muscle T-Coffee 3D-Coffee/Expresso
>50% 133 133 131 125 123
>60% 98 98 96 104 90
>70% 64 64 62 67 58
>80% 53 54 54 50 57
>90% 52 51 51 52 49
100% 25 23 26 26 25

Gaps in secondary structure elements

References

<references />