Sequence Alignment GLA

From Bioinformatikpedia
Revision as of 02:28, 24 May 2011 by Drexler (talk | contribs) (Multiple Sequence Alignments)

Sequence Searches

GLA sequence was searched in the PDB non-redundant(nr) database using three different tools, that are listed below. An additional search has been made applying HHsearch on the pdb70 database(max.70% sequence identity).

Blast

We used the NCBI Blast Version 2.2.18 with the command:

blast -i sequence.fasta -d database -p blastp

Fasta

As no fasta program was installed, we downloaded fasta36 and installed it to the virtual machine. The command to run the program is:

fasta36 sequence.fasta database

PSI-Blast

We used the PSI-Blast version 2.2.18 with the command:

blastpgp -i sequence.fasta -d database -j iterations -h e-value

Parameter

We used the following combinations of parameter to run the program:

  • 3 iterations and e-value threshold of 0.005
  • 3 iterations and e-value threshold of 0.002
  • 3 iterations and e-value threshold of 10e-6
  • 5 iterations and e-value threshold of 0.005
  • 5 iterations and e-value threshold of 0.002
  • 5 iterations and e-value threshold of 10e-6

HHsearch

We used the online tool from Gene Center of the LMU Munich with the default parameters:

  • Database = PDB70
  • Max. number of PSI-BLAST iterations = 3
  • Alignment mode = local

The results are still available: Results.

HSSP

We used a HSSP online resource to find proteins related to alpha galactosidase and choose results, that are found in homo sapiens. The resulting files contained the ID's that were found by HSSP, which were used to classify the hits of the searches (Fasta/Blast/PSI-Blast) as true positives (TP). Afterwards we were able to calculate the sensitivity of the search tools in respect to our query.

Program Sensitivity
Blast 20.9%
Fasta 31.6%
PSI-Blast: 3 Iterations: E-value cutoff 10e-6 24.0%
PSI-Blast: 3 Iterations: E-value cutoff 0.002 24.1%
PSI-Blast: 3 Iterations: E-value cutoff 0.005 24.2%
PSI-Blast: 5 Iterations: E-value cutoff 10e-6 24.0%
PSI-Blast: 5 Iterations: E-value cutoff 0.002 24.5%
PSI-Blast: 5 Iterations: E-value cutoff 0.005 24.5%

Overlap

Figure 1 shows the overlap between the results of Blast, Fasta, and PSI-Blast with five iterations and e-values of 10e-6 and 0.005. The most significant values in this diagram are the overlap between all four groups, the overlap between the both PSI-Blast groups and the number of Fasta-only sequences. Figure 2 shows the overlap between the results of PSI-Blast with three/five iterations and e-values of 10e-6 and 0.005. Nearly all sequences have been found by all four groups, so the variation of iteration numbers has not much impact in case of our dataset. Figure 3 shows the overlap between the results of PSI-Blast with five iterations and e-values of 10e-6, 0.002 and 0.005. Variation of e-value threshold has obviously not much influence on the results, the number of sequences that do not appear in different groups is very low. The figures were create using an online resource.

Figure 1: Overlap between Blast-, Fasta- and PSI-Blast results.
Figure 2: Overlap between PSI-Blast results with different iteration numbers.
Figure 3: Overlap between PSI-Blast results with different e-values.

E-value Distribution

Figure 4: Distribution of the e-values.
Figure 4 shows the e-value distributions of the programs.

Since all blast based programs had a huge number of e-values of the value 0, it was impossible to plot the logarithm of the distribution correctly. The certain values have been set to -500 to provide any plot at all. The logarithm function was necessary, because some outliers were so widely spread, that there was no visible distribution(Plot).

Identity Distribution

Figure 5: Distribution of the identities.
Figure 5 shows the identity distributions of the programs. As not all programs delivered the same number of sequences, the values are normalized to 100. Despite some differences, the majority distribution of the identities is similar, except for HHsearch. The other ones does all have peaks between 30-45% identity.

Runtime Analysis

The runtime of each program was measured by using the command time as a prefix in the commandline.

Program Runtime
Blast 2:40 min
Fasta 5:16 min
PSI-Blast: 3 Iterations: E-value cutoff 10e-6 7:50 min
PSI-Blast: 3 Iterations: E-value cutoff 0.002 7:48 min
PSI-Blast: 3 Iterations: E-value cutoff 0.005 7:55 min
PSI-Blast: 5 Iterations: E-value cutoff 10e-6 13:27 min
PSI-Blast: 5 Iterations: E-value cutoff 0.002 13:06 min
PSI-Blast: 5 Iterations: E-value cutoff 0.005 12:49 min

Multiple Sequence Alignments

Selection of Sequences

We selected twenty sequences of the Fasta/Blast/PSI-Blast search results which fullfilled the following criteria:

  • about 400-450 amino acids long
  • true positive according to the research with HSSP

Unfortunately we were not able to find a sequence of a PDB structure with an identity between 89%-60%. The selected sequences are listed in the following table. We also added our reference sequence to the multiple sequence alignment.

GenBank Identifier Source Description Organism Identity
99%-90% Sequence Identity
295789486 PDB alpha-galactosidase A, chain A Homo sapiens 99%
62896813 GenBank alpha-galactosidase Homo sapiens 99%
269914455 PDB alpha-galactosidase Homo sapiens 99%
297710567 RefSeq alpha-galactosidase A-like Pongo abelii 96%
296235998 RefSeq alpha-galactosidase A Callithrix jacchus 95%
89%-60% Sequence Identity
301788124 RefSeq alpha-galactosidase A-like Ailuropoda melanoleuca 83%
133778924 RefSeq alpha-galactosidase A Mus musculus 78%
114051916 RefSeq alpha-N-acetylgalactosaminidase Bombyx mori 76%
291190554 RefSeq alpha-galactosidase A Salmo salar 67%
148228315 RefSeq alpha-galactosidase Xenopus laevis 65%
59%-40% Sequence Identity
20151048 PDB alpha-N-acetylgalactosaminidase, chain A Gallus gallus 57%
261824882 PDB alpha-N-acetylgalactosaminidase, chain A Homo sapiens 54%
148229665 RefSeq alpha-N-acetylgalactosaminidase Xenopus laevis 47%
92096920 GenBank NAGA protein Bos taurus 46%
260593558 RefSeq alpha-galactosidase Prevotella veroralis F0319 41%
39%-20% Sequence Identity
51701639 Swiss-Prot alpha-galactosidase precursor Lachancea cidri 38%
74626383 Swiss-Prot alpha-galactosidase B precursor Aspergillus niger 35%
299856763 PDB alpha-galactosidase, chain A Saccharomyces cerevisiae 34%
310699603 GenBank alpha-D-galactopyranosidase Fusarium oxysporum 33%
226293587 Swiss-Prot alpha-galactosidase precursor Torulaspora delbrueckii 31%

Programs

Cobalt

We used NCBI Cobalt version 2.0.1 with the command:

cobalt -i sequences.fasta -norps T

Multiple sequence alignment of the 21 sequences by Cobalt in JalView.

ClustalW

We used ClustalW version 1.83 with the command:

clustalw -infile=sequences.fasta

Multiple sequence alignment of the 21 sequences by ClustalW in JalView.

Muscle

We used Muscle version 3.8.31 with the command:

muscle -in sequences.fasta -out muscle_msa.aln

Multiple sequence alignment of the 21 sequences by Muscle in JalView.

T-Coffee

The basic command to start T-Coffee version 8.99 is:

t_coffee sequences.fasta

Multiple sequence alignment of the 21 sequences by T-Coffee in JalView.

T-Coffee 3D

To start the 3D mode the additional parameters -mode expresso -pdb_type dn were given as a suffix to the command.

Multiple sequence alignment of the 21 sequences by T-Coffee 3D in JalView.

Results

Conservation

References

<references />