^{by Benjamin Drexler and Fabian Grandke}

Sequence Searches

GLA sequence was searched in the PDB non-redundant(nr) database using three different tools, that are listed below. An additional search has been made applying HHsearch on the pdb70 database(max.70% sequence identity).

Blast

We used the NCBI Blast Version 2.2.18 with the command:

blast -i sequence.fasta -d database -p blastp

Fasta

As no fasta program was installed, we downloaded fasta36 and installed it to the virtual machine. The command to run the program is:

fasta36 sequence.fasta database

PSI-Blast

We used the PSI-Blast version 2.2.18 with the command:

blastpgp -i sequence.fasta -d database -j iterations -h e-value

Parameter

We used the following combinations of parameter to run the program:

3 iterations and e-value threshold of 0.005
3 iterations and e-value threshold of 0.002
3 iterations and e-value threshold of 10e-6
5 iterations and e-value threshold of 0.005
5 iterations and e-value threshold of 0.002
5 iterations and e-value threshold of 10e-6

HHsearch

We used the online tool from Gene Center of the LMU Munich with the default parameters:

Database = PDB70
Max. number of PSI-BLAST iterations = 3
Alignment mode = local

The results are still available: Results.

Overlap

Figure 1 shows the overlap between the results of Blast, Fasta, and PSI-Blast with five iterations and e-values of 10e-6 and 0.005. The most significant values in this diagram are the overlap between all four groups, the overlap between the both PSI-Blast groups and the number of Fasta-only sequences. Figure 2 shows the overlap between the results of PSI-Blast with three/five iterations and e-values of 10e-6 and 0.005. Nearly all sequences have been found by all four groups, so the variation of iteration numbers has not much impact in case of our dataset. Figure 3 shows the overlap between the results of PSI-Blast with five iterations and e-values of 10e-6, 0.002 and 0.005. Variation of e-value threshold has obviously not much influence on the results, the number of sequences that do not appear in different groups is very low. The figures were created using an online resource.

Figure 1: Overlap between Blast-, Fasta- and PSI-Blast results.

Figure 2: Overlap between PSI-Blast results with different iteration numbers.

Figure 3: Overlap between PSI-Blast results with different e-values.

E-value Distribution

Figure 4: Distribution of the e-values.

Figure 4 shows the e-value distributions of the programs.

Since all blast based programs had serveral hits with an e-values of 0, it was not possible to plot the logarithm of the distribution correctly. The certain values have been set to -500 to provide a plot at all. We used the logarithm function to provide a clearlier representation of the e-values. A plot of the original values is allocated on this page. As seen before, the results of the PSI-Blast runs are almost equal, but the ones with five iterations produce slightly better e-values. The Blast results are a little worse and the Fasta runs e-values are even worse. As they are on a completely different level, the HHsearch values are not comparable.

Identity Distribution

Figure 5: Distribution of the identities.

Figure 5 shows the identity distributions of the programs. As not all programs delivered the same number of sequences, the values are normalized to 100 hits. Despite some differences, the majority distribution of the identities is similar, except for HHsearch. All the other ones have peaks between 30-45% identity.

Especially the PSI-Blast results are almost equal. The peaks of Blast and Fasta have a huge overlap, as well. Only the results of HHsearch are outliers to the other distributions.

Runtime Analysis

The runtime of each program was measured by using the command time as a prefix in the commandline.

Program	Runtime
Blast	2:40 min
Fasta	5:16 min
PSI-Blast: 3 Iterations: E-value cutoff 10e-6	7:50 min
PSI-Blast: 3 Iterations: E-value cutoff 0.002	7:48 min
PSI-Blast: 3 Iterations: E-value cutoff 0.005	7:55 min
PSI-Blast: 5 Iterations: E-value cutoff 10e-6	13:27 min
PSI-Blast: 5 Iterations: E-value cutoff 0.002	13:06 min
PSI-Blast: 5 Iterations: E-value cutoff 0.005	12:49 min

HSSP

We used a HSSP online resource to find proteins related to alpha galactosidase and choose results, that are found in homo sapiens. The resulting files contained the ID's that were found by HSSP, which were used to classify the hits of the searches (Fasta/Blast/PSI-Blast) as true positives (TP). Afterwards we were able to calculate the sensitivity of the search tools in respect to our query.

Program	Sensitivity
Blast	20.9%
Fasta	31.6%
PSI-Blast: 3 Iterations: E-value cutoff 10e-6	24.0%
PSI-Blast: 3 Iterations: E-value cutoff 0.002	24.1%
PSI-Blast: 3 Iterations: E-value cutoff 0.005	24.2%
PSI-Blast: 5 Iterations: E-value cutoff 10e-6	24.0%
PSI-Blast: 5 Iterations: E-value cutoff 0.002	24.5%
PSI-Blast: 5 Iterations: E-value cutoff 0.005	24.5%

The sensitivity varies between 20% and 32%. It is remarkable that the different parameters of PSI-Blast do not have any noticeable impact on the result.<br\> Overall these results are not satisfactory, since these sensitivities are very low. We do not think that this is a good benchmark for the search tools and assume that these values are underestimates because of two reasons. First, it could be the case that the result of the HSSP research is not an exhaustive collection of true positives. Second, we only evaluated one query and hence there is a huge bias. Therefore we encourage you to check out some of the results of the other groups:

Hint: To map the IDs from different databases we used the Uniprot ID Mapping.

Multiple Sequence Alignments

Selection of Sequences

We selected twenty sequences of the Fasta/Blast/PSI-Blast search results which fullfilled the following criteria:

about 400-450 amino acids long
true positive according to the research with HSSP

Unfortunately we were not able to find a sequence of a PDB structure with an identity between 89%-60%. The selected sequences are listed in the following table. We also added our reference sequence to the multiple sequence alignment.

GenBank Identifier	Source	Description	Organism	Identity
99%-90% Sequence Identity
295789486	PDB	alpha-galactosidase A, chain A	Homo sapiens	99%
62896813	GenBank	alpha-galactosidase	Homo sapiens	99%
269914455	PDB	alpha-galactosidase	Homo sapiens	99%
297710567	RefSeq	alpha-galactosidase A-like	Pongo abelii	96%
296235998	RefSeq	alpha-galactosidase A	Callithrix jacchus	95%
89%-60% Sequence Identity
301788124	RefSeq	alpha-galactosidase A-like	Ailuropoda melanoleuca	83%
133778924	RefSeq	alpha-galactosidase A	Mus musculus	78%
114051916	RefSeq	alpha-N-acetylgalactosaminidase	Bombyx mori	76%
291190554	RefSeq	alpha-galactosidase A	Salmo salar	67%
148228315	RefSeq	alpha-galactosidase	Xenopus laevis	65%
59%-40% Sequence Identity
20151048	PDB	alpha-N-acetylgalactosaminidase, chain A	Gallus gallus	57%
261824882	PDB	alpha-N-acetylgalactosaminidase, chain A	Homo sapiens	54%
148229665	RefSeq	alpha-N-acetylgalactosaminidase	Xenopus laevis	47%
92096920	GenBank	NAGA protein	Bos taurus	46%
260593558	RefSeq	alpha-galactosidase	Prevotella veroralis F0319	41%
39%-20% Sequence Identity
51701639	Swiss-Prot	alpha-galactosidase precursor	Lachancea cidri	38%
74626383	Swiss-Prot	alpha-galactosidase B precursor	Aspergillus niger	35%
299856763	PDB	alpha-galactosidase, chain A	Saccharomyces cerevisiae	34%
310699603	GenBank	alpha-D-galactopyranosidase	Fusarium oxysporum	33%
226293587	Swiss-Prot	alpha-galactosidase precursor	Torulaspora delbrueckii	31%

Programs

Cobalt

We used NCBI Cobalt version 2.0.1 with the command:

cobalt -i sequences.fasta -norps T

Multiple sequence alignment of the 21 sequences by Cobalt in JalView.

ClustalW

We used ClustalW version 1.83 with the command:

clustalw -infile=sequences.fasta

Multiple sequence alignment of the 21 sequences by ClustalW in JalView.

Muscle

We used Muscle version 3.8.31 with the command:

muscle -in sequences.fasta -out muscle_msa.aln

Multiple sequence alignment of the 21 sequences by Muscle in JalView.

T-Coffee

The basic command to start T-Coffee version 8.99 is:

t_coffee sequences.fasta

Multiple sequence alignment of the 21 sequences by T-Coffee in JalView.

T-Coffee 3D

To start the 3D mode the additional parameters -mode expresso -pdb_type dn were given as a suffix to the command.

Multiple sequence alignment of the 21 sequences by T-Coffee 3D in JalView.

Results

Conservation

We used the conservation index according to Livingstone C.D. and Barton G.J.<ref name=livingstone>Livingstone C.D. and Barton G.J. (1993), "Protein Sequence Alignments: A Strategy for the Hierarchical Analysis of Residue Conservation.", CABIOS Vol. 9 No. 6 (745-756)), PubMed</ref> to determine whether a residue is conserved or not. The conservation index was calculated by JalView and ranges from 0 to 11.

Program	>= 8	>= 9	>= 10	>= 11
Cobalt	88	70	46	36
ClustalW	87	70	44	36
Muscle	87	69	46	36
T-Coffee	90	72	47	37
T-Coffee 3D	83	60	39	33

Quite obviously the number of conserved residues decreases with an increasing threshold of the conservation index. Overall all programs achieve a similar number of conserved resdiues. Only T-Coffee 3D has lower numbers for all thresholds. This could be due to the fact that T-Coffee 3D tries to incorporate structural information. But the higher the threshold of the conservation index, the smaller the difference. Hence it is very likely that the strongly conserved residues are also existing in the MSA of T-Coffee 3D.

Functionally Important Residues

We used the sequence annotation of UniProt to determine functionally important residues. These residues and their corresponding conservation index in the multiple sequence alignment are listed in the following table. Additionally we also calculated the average conservation index of all positions which are not a gap in the consensus sequence. We can use this value to evaluate whether a functionally important residue has the tendency to be conserved or not.

Type	Position	Cobalt	ClustalW	Muscle	T-Coffee	T-Coffee 3D
Average	-	3.44	3.52	3.52	3.53	3.27
Active Site	170	5	5	5	5	5
Active Site	231	11	11	11	11	11
Glycosylation	139	2	2	2	2	2
Glycosylation	192	4	4	4	4	4
Glycosylation	215	0	3	0	3	3
Glycosylation	408	7	7	0	7	6
Disulfide bond¹	52 / 94	3 / 11	3 / 11	3 / 11	3 / 11	3 / 11
Disulfide bond¹	56 / 63	0 / 0	1 / 0	2 / 2	0 / 0	0 / 0
Disulfide bond¹	142 / 172	8 / 11	9 / 11	8 / 11	8 / 11	7 / 11
Disulfide bond¹	202 / 223	11 / 7	11 / 7	11 / 7	11 / 7	8 / 7
Disulfide bond¹	378 / 382	4 / 0	0 / 0	0 / 0	0 / 0	3 / 1

¹ Since two residues are part of a disulfide bond, the two positions or values are separated by "/".

First of all, we have a great variety in these results. The residue 231 that is part of the active site has a conservation index of 11 in the MSA of all programs, which is clearly a strong sign of conservation. In contrast, some of the residues which are glycosylated (e.g. residue 139) or establish a disulfide bond (e.g. residues 56 / 63) have conservation indices that are close to 0. Overall the glycosylated residues seem to have no tendency to be conservated in the MSA, since a vast majority of their values range between 0 and 4. The story is quite different for the residues that establish a disulfide bond. It is remarkable that there is a correlation between the conservation indices of a residue pair, e.g. one residue has a high conservation index when the other residue also has a high conservation index. This obversation makes sense in respect to the biological contact, because the disulfide bond needs both residues to be established. The residues 52 / 94 are the only exception to this.

Number of Gaps

Program	Length of MSA	Number of gaps (consensus)	Number of gaps (reference)
Cobalt	586	69	188
ClustalW	563	46	165
Muscle	589	74	191
T-Coffee	588	64	190
T-Coffee 3D	685	163	287

We determined the number of gaps in the consensus sequence and also in the reference sequence. The length of the MSA and the number of gaps does not differ much among the programs except for T-Coffee 3D. This could be due to the fact that T-Coffee 3D tries to incorporate structural information.

Gaps in Secondary Structure

Secondary Structure Element	Positions	Cobalt	ClustalW	Muscle	T-Coffee	T-Coffee 3D
Beta strand	42 – 46	0	0	0	0	0
Helix	47 – 50	0	0	0	0	0
Turn	56 – 58	0	0	0	0	0
Turn	60 – 62	0	0	0	0	0
Helix	66 – 78	0	0	0	0	1
Helix	81 – 84	0	0	0	0	0
Beta strand	88 – 90	0	0	0	0	0
Turn	110 – 112	0	0	0	0	0
Helix	116 – 126	0	0	0	0	0
Beta strand	130 – 140	0	1	0	0	2
Beta strand	144 – 146	0	0	0	0	0
Turn	149 – 151	0	0	0	0	3
Helix	152 – 162	0	0	0	0	0
Beta strand	166 – 170	0	0	0	0	0
Helix	177 – 194	38	0	23	0	34
Beta strand	199 – 202	1	1	1	1	0
Helix	204 – 208	3	0	3	0	0
Turn	209 – 211	0	0	0	0	4
Helix	216 – 219	0	0	0	0	0
Turn	220 – 222	0	0	0	0	0
Beta strand	224 – 227	0	0	0	0	0
Helix	236 – 247	0	18	29	22	36
Turn	248 – 252	0	0	0	0	0
Helix	253 – 256	0	0	0	0	0
Beta strand	261 – 264	0	0	0	0	1
Beta strand	272 – 274	0	0	0	0	0
Helix	277 – 289	0	0	0	0	0
Beta strand	294 – 296	0	0	0	0	0
Helix	305 – 312	0	0	0	0	0
Helix	314 – 320	0	0	0	0	0
Beta strand	329 – 332	0	0	0	0	0
Beta strand	334 – 343	11	0	0	0	2
Beta strand	345 – 355	0	0	0	0	1
Beta strand	359 – 361	0	0	0	0	1
Beta strand	363 – 369	20	0	18	58	2
Turn	370 – 372	0	0	0	0	3
Turn	374 – 376	0	0	0	0	0
Beta strand	379 – 383	15	0	5	0	6
Beta strand	385 – 390	0	0	0	0	49
Beta strand	396 – 400	0	0	0	2	1
Beta strand	402 – 407	0	0	0	0	1
Beta strand	412 – 419	0	0	1	1	14
Overall number of gaps in a secondary structure	-	88	20	80	84	161
Overall number of gaps	-	188	165	191	190	287
Percentage of gaps¹	-	46.8%	10.5%	41.9%	44.2%	56.1%

¹ Percentage of gaps describes the ratio of number of gaps in a secondary structure to the overall number of gaps, e.g. (88 / 188 = 0.468 = 46.8% for Cobalt)

We used the secondary structure annotation of UniProt to determine the number of gaps in a secondary structure of our reference sequence. It is remarkable that ClustalW has a significant lower number of percentage of gaps. Hence we assume that ClustalW is better able to take secondary structure into consideration than the other programs.

References

Sequence Alignment GLA

Contents