Sequence Alignment GLA
Sequence Searches
GLA sequence was searched in the PDB non-redundant(nr) database using three different tools, that are listed below. An additional search has been made applying HHsearch on the pdb70 database(max.70% sequence identity).
Blast
We used the NCBI Blast Version 2.2.18 with the command:
blast -i sequence.fasta -d database -p blastp
Fasta
As no fasta program was installed, we downloaded fasta36 and installed it to the virtual machine. The command to run the program is:
fasta36 sequence.fasta database
PSI-Blast
We used the PSI-Blast version 2.2.18 with the command:
blastpgp -i sequence.fasta -d database -j iterations -h e-value
Parameter
We used the following combinations of parameter to run the program:
- 3 iterations and e-value threshold of 0.005
- 3 iterations and e-value threshold of 0.002
- 3 iterations and e-value threshold of 10e-6
- 5 iterations and e-value threshold of 0.005
- 5 iterations and e-value threshold of 0.002
- 5 iterations and e-value threshold of 10e-6
HHsearch
We used the online tool from Gene Center of the LMU Munich with the default parameters:
- Database = PDB70
- Max. number of PSI-BLAST iterations = 3
- Alignment mode = local
The results are still available: Results.
Overlap
Figure 1 shows the overlap between the results of Blast, Fasta, and PSI-Blast with five iterations and e-values of 10e-6 and 0.005. The most significant values in this diagram are the overlap between all four groups, the overlap between the both PSI-Blast groups and the number of Fasta-only sequences. Figure 2 shows the overlap between the results of PSI-Blast with three/five iterations and e-values of 10e-6 and 0.005. Nearly all sequences have been found by all four groups, so the variation of iteration numbers has not much impact in case of our dataset. Figure 3 shows the overlap between the results of PSI-Blast with five iterations and e-values of 10e-6, 0.002 and 0.005. Variation of e-value threshold has obviously not much influence on the results, the number of sequences that do not appear in different groups is very low. The figures were created using an online resource.
E-value Distribution
Figure 4 shows the e-value distributions of the programs.
Since all blast based programs had a huge number of e-values of the value 0, it was impossible to plot the logarithm of the distribution correctly. The certain values have been set to -500 to provide any plot at all. The logarithm function was necessary, because some outliers were so widely spread, that there was no visible distribution(Plot). As seen before, the results of the PSI-Blast runs are almost equal, but the ones with five iterations produce slightly better e-values. The Blast results are a little worse and the Fasta runs e-values are even worse. As they are on a completely different level, the HHsearch values are not comparable. |
Identity Distribution
Figure 5 shows the identity distributions of the programs. As not all programs delivered the same number of sequences, the values are normalized to 100. Despite some differences, the majority distribution of the identities is similar, except for HHsearch. The other ones does all have peaks between 30-45% identity. |
Runtime Analysis
The runtime of each program was measured by using the command time
as a prefix in the commandline.
Program | Runtime |
---|---|
Blast | 2:40 min |
Fasta | 5:16 min |
PSI-Blast: 3 Iterations: E-value cutoff 10e-6 | 7:50 min |
PSI-Blast: 3 Iterations: E-value cutoff 0.002 | 7:48 min |
PSI-Blast: 3 Iterations: E-value cutoff 0.005 | 7:55 min |
PSI-Blast: 5 Iterations: E-value cutoff 10e-6 | 13:27 min |
PSI-Blast: 5 Iterations: E-value cutoff 0.002 | 13:06 min |
PSI-Blast: 5 Iterations: E-value cutoff 0.005 | 12:49 min |
HSSP
We used a HSSP online resource to find proteins related to alpha galactosidase and choose results, that are found in homo sapiens. The resulting files contained the ID's that were found by HSSP, which were used to classify the hits of the searches (Fasta/Blast/PSI-Blast) as true positives (TP). Afterwards we were able to calculate the sensitivity of the search tools in respect to our query.
Program | Sensitivity |
---|---|
Blast | 20.9% |
Fasta | 31.6% |
PSI-Blast: 3 Iterations: E-value cutoff 10e-6 | 24.0% |
PSI-Blast: 3 Iterations: E-value cutoff 0.002 | 24.1% |
PSI-Blast: 3 Iterations: E-value cutoff 0.005 | 24.2% |
PSI-Blast: 5 Iterations: E-value cutoff 10e-6 | 24.0% |
PSI-Blast: 5 Iterations: E-value cutoff 0.002 | 24.5% |
PSI-Blast: 5 Iterations: E-value cutoff 0.005 | 24.5% |
The sensitivity varies between 20% and 32%. It is remarkable that the different parameters of PSI-Blast do not have any noticeable impact on the result.<br\> Overall these results are not satisfactory, since these sensitivities are very low. We do not think that this is a good benchmark for the search tools and assume that these values are underestimates because of two reasons. First, it could be the case that the result of the HSSP research is not an exhaustive collection of true positives. Second, we only evaluated one query and hence there is a huge bias. Therefore we encourage you to check out some of the results of the other groups:
Hint: To map the IDs from different databases we used the Uniprot ID Mapping.
Multiple Sequence Alignments
Selection of Sequences
We selected twenty sequences of the Fasta/Blast/PSI-Blast search results which fullfilled the following criteria:
- about 400-450 amino acids long
- true positive according to the research with HSSP
Unfortunately we were not able to find a sequence of a PDB structure with an identity between 89%-60%. The selected sequences are listed in the following table. We also added our reference sequence to the multiple sequence alignment.
GenBank Identifier | Source | Description | Organism | Identity |
---|---|---|---|---|
99%-90% Sequence Identity | ||||
295789486 | PDB | alpha-galactosidase A, chain A | Homo sapiens | 99% |
62896813 | GenBank | alpha-galactosidase | Homo sapiens | 99% |
269914455 | PDB | alpha-galactosidase | Homo sapiens | 99% |
297710567 | RefSeq | alpha-galactosidase A-like | Pongo abelii | 96% |
296235998 | RefSeq | alpha-galactosidase A | Callithrix jacchus | 95% |
89%-60% Sequence Identity | ||||
301788124 | RefSeq | alpha-galactosidase A-like | Ailuropoda melanoleuca | 83% |
133778924 | RefSeq | alpha-galactosidase A | Mus musculus | 78% |
114051916 | RefSeq | alpha-N-acetylgalactosaminidase | Bombyx mori | 76% |
291190554 | RefSeq | alpha-galactosidase A | Salmo salar | 67% |
148228315 | RefSeq | alpha-galactosidase | Xenopus laevis | 65% |
59%-40% Sequence Identity | ||||
20151048 | PDB | alpha-N-acetylgalactosaminidase, chain A | Gallus gallus | 57% |
261824882 | PDB | alpha-N-acetylgalactosaminidase, chain A | Homo sapiens | 54% |
148229665 | RefSeq | alpha-N-acetylgalactosaminidase | Xenopus laevis | 47% |
92096920 | GenBank | NAGA protein | Bos taurus | 46% |
260593558 | RefSeq | alpha-galactosidase | Prevotella veroralis F0319 | 41% |
39%-20% Sequence Identity | ||||
51701639 | Swiss-Prot | alpha-galactosidase precursor | Lachancea cidri | 38% |
74626383 | Swiss-Prot | alpha-galactosidase B precursor | Aspergillus niger | 35% |
299856763 | PDB | alpha-galactosidase, chain A | Saccharomyces cerevisiae | 34% |
310699603 | GenBank | alpha-D-galactopyranosidase | Fusarium oxysporum | 33% |
226293587 | Swiss-Prot | alpha-galactosidase precursor | Torulaspora delbrueckii | 31% |
Programs
Cobalt
We used NCBI Cobalt version 2.0.1 with the command:
cobalt -i sequences.fasta -norps T
ClustalW
We used ClustalW version 1.83 with the command:
clustalw -infile=sequences.fasta
Muscle
We used Muscle version 3.8.31 with the command:
muscle -in sequences.fasta -out muscle_msa.aln
T-Coffee
The basic command to start T-Coffee version 8.99 is:
t_coffee sequences.fasta
T-Coffee 3D
To start the 3D mode the additional parameters -mode expresso -pdb_type dn
were given as a suffix to the command.
Results
Conservation
We used the conservation index according to Livingstone C.D. and Barton G.J.<ref name=livingstone>Livingstone C.D. and Barton G.J. (1993), "Protein Sequence Alignments: A Strategy for the Hierarchical Analysis of Residue Conservation.", CABIOS Vol. 9 No. 6 (745-756)), PubMed</ref> to determine whether a residue is conserved or not. The conservation index was calculated by JalView and ranges from 0 to 11.
Program | >= 8 | >= 9 | >= 10 | >= 11 |
---|---|---|---|---|
Cobalt | 88 | 70 | 46 | 36 |
ClustalW | 87 | 70 | 44 | 36 |
Muscle | 87 | 69 | 46 | 36 |
T-Coffee | 90 | 72 | 47 | 37 |
T-Coffee 3D | 83 | 60 | 39 | 33 |
Quite obviously the number of conserved residues decreases with an increasing threshold of the conservation index. Overall all programs achieve a similar number of conserved resdiues. Only T-Coffee 3D has lower numbers for all thresholds. This could be due to the fact that T-Coffee 3D tries to incorporate structural information. But the higher the threshold of the conservation index, the smaller the difference. Hence it is very likely that the strongly conserved residues are also existing in the MSA of T-Coffee 3D.
Functionally Important Residues
We used the sequence annotation of UniProt to determine functionally important residues. These residues and their corresponding conservation index in the multiple sequence alignment are listed in the following table. Additionally we also calculated the average conservation index of all positions which are not a gap in the consensus sequence. We can use this value to evaluate whether a functionally important residue has the tendency to be conserved or not.
Type | Position | Cobalt | ClustalW | Muscle | T-Coffee | T-Coffee 3D |
---|---|---|---|---|---|---|
Average | - | 3.44 | 3.52 | 3.52 | 3.53 | 3.27 |
Active Site | 170 | 5 | 5 | 5 | 5 | 5 |
Active Site | 231 | 11 | 11 | 11 | 11 | 11 |
Glycosylation | 139 | 2 | 2 | 2 | 2 | 2 |
Glycosylation | 192 | 4 | 4 | 4 | 4 | 4 |
Glycosylation | 215 | 0 | 3 | 0 | 3 | 3 |
Glycosylation | 408 | 7 | 7 | 0 | 7 | 6 |
Disulfide bond | 52 / 94 | 3 / 11 | 3 / 11 | 3 / 11 | 3 / 11 | 3 / 11 |
Disulfide bond | 56 / 63 | 0 / 0 | 1 / 0 | 2 / 2 | 0 / 0 | 0 / 0 |
Disulfide bond | 142 / 172 | 8 / 11 | 9 / 11 | 8 / 11 | 8 / 11 | 7 / 11 |
Disulfide bond | 202 / 223 | 11 / 7 | 11 / 7 | 11 / 7 | 11 / 7 | 8 / 7 |
Disulfide bond | 378 / 382 | 4 / 0 | 0 / 0 | 0 / 0 | 0 / 0 | 3 / 1 |
Since two residues are part of a disulfide bond, the two values are separated by the "/".
First of all, we have a great variety in these results. The residue 231 that is part of the active site has a conservation index of 11 in the MSA of all programs, which is clearly a strong sign of conservation. In contrast, some of the residues which are glycosylated (e.g. residue 139) or establish a disulfide bond (e.g. residues 56 / 63) have conservation indices that are close to 0. Overall the glycosylated residues seem to have no tendency to be conservated in the MSA, since a vast majority of their values range between 0 and 4. The story is quite different for the residues that establish a disulfide bond. It is remarkable that there is a correlation between the conservation indices of a residue pair, e.g. one residue has a high conservation index when the other residue also has a high conservation index. This obversation makes sense in respect to the biological contact, because the disulfide bond needs both residues to be established. The residues 52 / 94 are the only exception to this.
Number of Gaps
We determined the number of gaps in the consensus sequence and also in the reference sequence.
Program | Length of MSA | Number of gaps (consensus) | Number of gaps (reference) |
---|---|---|---|
Cobalt | 586 | 69 | 188 |
ClustalW | 563 | 46 | 165 |
Muscle | 589 | 74 | 191 |
T-Coffee | 588 | 64 | 190 |
T-Coffee 3D | 685 | 163 | 287 |
The length of the MSA and the number of gaps does not differ much among the programs except for T-Coffee 3D. This could be due to the fact that T-Coffee 3D tries to incorporate structural information.
Gaps in Secondary Structure
Secondary Structure Element | Positions | Cobalt | ClustalW | Muscle | T-Coffee | T-Coffee 3D |
---|---|---|---|---|---|---|
Beta strand | 42 – 46 | 0 | 0 | 0 | 0 | 0 |
Helix | 47 – 50 | 0 | 0 | 0 | 0 | 0 |
Turn | 56 – 58 | 0 | 0 | 0 | 0 | 0 |
Turn | 60 – 62 | 0 | 0 | 0 | 0 | 0 |
Helix | 66 – 78 | 0 | 0 | 0 | 0 | 1 |
Helix | 81 – 84 | 0 | 0 | 0 | 0 | 0 |
Beta strand | 88 – 90 | 0 | 0 | 0 | 0 | 0 |
Turn | 110 – 112 | 0 | 0 | 0 | 0 | 0 |
Helix | 116 – 126 | 0 | 0 | 0 | 0 | 0 |
Beta strand | 130 – 140 | 0 | 1 | 0 | 0 | 2 |
Beta strand | 144 – 146 | 0 | 0 | 0 | 0 | 0 |
Turn | 149 – 151 | 0 | 0 | 0 | 0 | 3 |
Helix | 152 – 162 | 0 | 0 | 0 | 0 | 0 |
Beta strand | 166 – 170 | 0 | 0 | 0 | 0 | 0 |
Helix | 177 – 194 | 38 | 0 | 23 | 0 | 34 |
Beta strand | 199 – 202 | 1 | 1 | 1 | 1 | 0 |
Helix | 204 – 208 | 3 | 0 | 3 | 0 | 0 |
Turn | 209 – 211 | 0 | 0 | 0 | 0 | 4 |
Helix | 216 – 219 | 0 | 0 | 0 | 0 | 0 |
Turn | 220 – 222 | 0 | 0 | 0 | 0 | 0 |
Beta strand | 224 – 227 | 0 | 0 | 0 | 0 | 0 |
Helix | 236 – 247 | 0 | 18 | 29 | 22 | 36 |
Turn | 248 – 252 | 0 | 0 | 0 | 0 | 0 |
Helix | 253 – 256 | 0 | 0 | 0 | 0 | 0 |
Beta strand | 261 – 264 | 0 | 0 | 0 | 0 | 1 |
Beta strand | 272 – 274 | 0 | 0 | 0 | 0 | 0 |
Helix | 277 – 289 | 0 | 0 | 0 | 0 | 0 |
Beta strand | 294 – 296 | 0 | 0 | 0 | 0 | 0 |
Helix | 305 – 312 | 0 | 0 | 0 | 0 | 0 |
Helix | 314 – 320 | 0 | 0 | 0 | 0 | 0 |
Beta strand | 329 – 332 | 0 | 0 | 0 | 0 | 0 |
Beta strand | 334 – 343 | 11 | 0 | 0 | 0 | 2 |
Beta strand | 345 – 355 | 0 | 0 | 0 | 0 | 1 |
Beta strand | 359 – 361 | 0 | 0 | 0 | 0 | 1 |
Beta strand | 363 – 369 | 20 | 0 | 18 | 58 | 2 |
Turn | 370 – 372 | 0 | 0 | 0 | 0 | 3 |
Turn | 374 – 376 | 0 | 0 | 0 | 0 | 0 |
Beta strand | 379 – 383 | 15 | 0 | 5 | 0 | 6 |
Beta strand | 385 – 390 | 0 | 0 | 0 | 0 | 49 |
Beta strand | 396 – 400 | 0 | 0 | 0 | 2 | 1 |
Beta strand | 402 – 407 | 0 | 0 | 0 | 0 | 1 |
Beta strand | 412 – 419 | 0 | 0 | 1 | 1 | 14 |
Overall | - | 88 | 20 | 80 | 84 | 161 |
not finished yet...
References
<references />