Task 2: Sequence alignments (sequence searches and multiple alignments)

From Bioinformatikpedia
Revision as of 21:22, 23 May 2011 by Meier (talk | contribs) (Overlap)

Task description

The full description of this task can be found here.


Task 2.1: Sequence searches

BLAST

Running

time sudo blastall -p blastp -d '/data/blast/nr/nr' -i ./reference.fasta -o './reference.blast' -b 500

real 11m30.762s
user 3m11.440s
sys 0m12.250s

Results

Found true positives 220 of 500 predictions.

Pah.blast coverage.png Pah.blast score.png Pah.blast identity.png

FASTA

Installation

  • Used Virtual Box with Linux.

Running

time ./fasta36 /home/student/reference.fasta /data/nr/nr

interactive:

  • Enter filename for results []: /home/student/reference.fasta_search
  • How many scores do you want to see: 500
  • More scores? 0
  • Display alignments also? (y/n) [n] y
  • number of alignments [500]? 500

real 10m13.878s
user 7m26.270s
sys 0m20.230s

Results

Found positives 224 of 500 predictions.

Pah.fasta search coverage.png Pah.fasta search score.png Pah.fasta search identity.png

PSI-BLAST

Running

Parameterset 1

time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e10E-6_i3.blast' -h 10E-6 -j 3 -C './reference_i3_e10E-6.chk'

real 37m56.447s
user 14m27.620s
sys 0m54.620s

Parameterset 2

time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e005_i3.blast' -h 0.005 -j 3 -C './reference_i3_e005.chk'

real 37m41.487s
user 14m42.850s
sys 0m52.370s

Parameterset 3

time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e005_i5.blast' -h 0.005 -j 5 -C './reference_i5_e005.chk'

real 62m22.175s
user 26m25.410s
sys 1m20.700s

Parameterset 4

time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e10E-6_i5.blast' -h 10E-6 -j 5 -C './reference_i5_e10E-6.chk'

real 61m59.284s
user 25m55.920s
sys 1m21.620s

Results

Parameterset 1

3 Iterations, 10E-6 E-Value-Threshold
In iteration 2 220/500
In iteration 3 224/500
Psi-Blast had during the different iterations 255 different positives at least one time in its result list.

Pah psi 10E-6 i3 coverage.png Pah psi 10E-6 i3 score.png Pah psi 10E-6 i3 identity.png

Parameterset 2

3 Iterations, 0.005 E-Value-Threshold
In iteration 2 220/500
In iteration 3 225/500
Psi-Blast had during the different iterations 256 different positives at least one time in its result list.

Pah psi 005 i3 coverage.png Pah psi 005 i3 score.png Pah psi 005 i3 identity.png

Parameterset 3

5 Iterations, 0.005 E-Value-Threshold
In iteration 2 220/500
In iteration 3 225/500
In iteration 4 223/500
In iteration 5 225/500
Psi-Blast had during the different iterations 261 different positives at least one time in its result list.

Pah psi 005 i5 coverage.png Pah psi 005 i5 score.png Pah psi 005 i5 identity.png

Parameterset 4

5 Iterations, 10E-6 E-Value-Threshold
In iteration 2 220/500
In iteration 3 224/500
In iteration 4 223/500
In iteration 5 226/500
Psi-Blast had during the different iterations 260 different positives at least one time in its result list.

Pah psi 10E-6 i5 coverage.png Pah psi 10E-6 i5 score.png Pah psi 10E-6 i5 identity.png

HHSearch

Installation

Preparing the HHM-Database
Configure HHSearch-Tools

In the manual of HHSearch it was adviced to add the information of the secondary structure to the multiple alignment used for the query. Therefore it was necessary to run the addpsipred script of HHSearch. This script was not configured in the virtual box. Several parameters have to be adjusted.

  • Changes in /apps/bin/addpsipred:

my $psipreddir="/apps/psipred_2.5";
my $ncbidir="/apps/blast_old/bin";
my $perl="/apps/bin";
my $dummydb="/home/student/tmp";

  • Copy /apps/bin/reformat to /apps/bin/reformat.pl


Running

Parameterset 1

time hhsearch -i ./reference.fasta -d /data/hmm/pdb70.db -b 500 -o ./reference_simple.hhsearch

real 8m33.171s
user 5m14.530s
sys 0m3.510s

Parameterset 2

alignblast reference_psi_e10E-6_i3.blast reference_psi_e10E-6_i3.a3m
addpsipred /home/student/workspace/reference_psi_e10E-6_i3.a3m
time hhsearch -i reference_psi_e10E-6_i3.a3m -d /data/hmm/pdb70.db -o reference_psi_e10E-6_i3.hhsearch

real 16m27.258s
user 7m47.220s
sys 0m6.290s

To run HHSearch with the profile of a PSI-BLAST search the output of PSI-BLAST has to be converted to a multiple alignment by the script alignblast. The performance of HHSearch can be increases by adding the information of the secondary structure to the multiple alignment. This can be done with the script addspipred (uses the secondary structure prediction of PSIPRED).

Parameterset 3

alignblast reference_psi_e005_i3.blast reference_psi_e005_i3.a3m
addpsipred /home/student/workspace/reference_psi_e005_i3.a3m
time hhsearch -i reference_psi_e005_i3.a3m -d /data/hmm/pdb70.db -o reference_psi_e005_i3.hhsearch

real 16m7.216s
user 7m41.840s
sys 0m5.570s

Parameterset 4

alignblast reference_psi_e005_i5.blast reference_psi_e005_i5.blast.a3m
addpsipred /home/student/workspace/reference_psi_e005_i5.blast.a3m
time hhsearch -i reference_psi_e005_i5.blast.a3m -d /data/hmm/pdb70.db -o reference_psi_e005_i5.blast.hhsearch

real 7m49.907s
user 7m15.310s
sys 0m4.320s

Parameterset 5

alignblast reference_psi_e10E-6_i5.blast reference_psi_e10E-6_i5.a3m
addpsipred /home/student/workspace/reference_psi_e10E-6_i5.a3m
time hhsearch -i reference_psi_e10E-6_i5.a3m -d /data/hmm/pdb70.db -o reference_psi_e10E-6_i5.hhsearch

real 8m10.730s
user 7m33.190s
sys 0m5.390s

Results

Parameterset 0

Run on Simple Fasta-File.
Found 6 positives.

Pah psi simple.hhsearch coverage.png Pah psi simple.hhsearch score.png Pah psi simple.hhsearch identity.png

Parameterset 1

Run on Psi-Blast-Output of parameter set 1.
Found 6 positives.

Pah psi e10E-6 i3.hhsearch coverage.png Pah psi e10E-6 i3.hhsearch score.png Pah psi e10E-6 i3.hhsearch identity.png

Parameterset 2

Run on Psi-Blast-Output of parameter set 2.
Found 6 positives.

Pah psi e005 i3.hhsearch coverage.png Pah psi e005 i3.hhsearch score.png Pah psi e005 i3.hhsearch identity.png

Parameterset 3

Run on Psi-Blast-Output of parameter set 3.
Found 6 positives.

Pah psi e005 i5.hhsearch coverage.png Pah psi e005 i5.hhsearch score.png Pah psi e005 i5.hhsearch identity.png

Parameterset 4

Run on Psi-Blast-Output of parameter set 4.
Found 6 positives.

Pah psi e10E-6 i5.hhsearch coverage.png Pah psi e10E-6 i5.hhsearch score.png Pah psi e10E-6 i5.hhsearch identity.png

Comparing the Results

Defining True Positives

HSSP - Some Positives

Getting the entry of PAH from HSSP http://mrs.cmbi.ru.nl/mrs-5/entry?db=hssp&id=2pah&q=phenylalanine%20hydroxylase

HSSP - More Positives

HHSearch was run with a dataset of the PDB. BLAST was run with the a dataset of the NR database. NR itself is a mixture of SwissProt, RefSeq, PIR, PRF, PDB and GenBank CDS translations. HSSP itself contains only references to SwissProt entries. Therefore it was necessary to get the cross-references of the Swissprot entries in HSSP to the other databases. For this purpose we created a java-tool.

Overlap

unfiltered sets:

Nr. Method/Parameter-Set Size of Result-Set
1 reference_psi_e10E-6_i3.blast_found_all 499
2 reference.fasta_search_found_all 499
3 reference_psi_e10E-6_i3.hhsearch_found_all 10
4 reference_simple.hhsearch_found_all 496
5 reference_psi_e005_i5.blast_found_all 499
6 reference_psi_e005_i5.hhsearch_found_all 10
7 reference.blast_found_all 499
8 reference_psi_e10E-6_i5.blast_found_all 499
9 reference_psi_e005_i3.hhsearch_found_all 494
10 reference_psi_e005_i3.blast_found_all 499
11 reference_psi_e10E-6_i5.hhsearch_found_all 10


Nr./Nr. 1 2 3 4 5 6 7 8 9 10 11
0 499 414 3 3 485 3 405 482 3 494 3
1 414 499 4 4 410 4 478 406 4 413 4
2 3 4 10 10 3 10 4 3 10 3 10
3 3 4 10 496 3 10 4 3 58 3 10
4 485 410 3 3 499 3 401 490 3 488 3
5 3 4 10 10 3 10 4 3 10 3 10
6 405 478 4 4 401 4 499 397 4 404 4
7 482 406 3 3 490 3 397 499 3 486 3
8 3 4 10 58 3 10 4 3 494 3 10
9 494 413 3 3 488 3 404 486 3 499 3
10 3 4 10 10 3 10 4 3 10 3 10


positive sets:

Nr. Method/Parameter-Set Size of Result-Set
1 reference_psi_e005_i3.hhsearch_found_positives 6
2 reference_psi_e10E-6_i3.hhsearch_found_positives 6
3 reference.fasta_search_found_positives 224
4 reference_psi_e005_i5.blast_found_positives 226
5 reference_simple.hhsearch_found_positives 6
6 reference_psi_e10E-6_i5.hhsearch_found_positives 6
7 reference_psi_e10E-6_i5.blast_found_positives 226
8 reference_psi_e005_i3.blast_found_positives 223
9 reference.blast_found_positives 220
10 reference_psi_e005_i5.hhsearch_found_positives 6
11 reference_psi_e10E-6_i3.blast_found_positives 223


Nr./Nr. 1 2 3 4 5 6 7 8 9 10 11
0 6 6 4 3 6 6 3 3 4 6 3
1 6 6 4 3 6 6 3 3 4 6 3
2 4 4 224 193 4 4 192 194 211 4 194
3 3 3 193 226 3 3 224 220 188 3 219
4 6 6 4 3 6 6 3 3 4 6 3
5 6 6 4 3 6 6 3 3 4 6 3
6 3 3 192 224 3 3 226 219 187 3 217
7 3 3 194 220 3 3 219 223 188 3 220
8 4 4 211 188 4 4 187 188 220 4 188
9 6 6 4 3 6 6 3 3 4 6 3
10 3 3 194 219 3 3 217 220 188 3 223

Summary

BLAST is the fastest method. Its result is comparable to FASTA and PSI-BLAST.

It was mentioned that FASTA is more sensitive than BLAST. The number of found positives does not underlie this statement.

PSI-Pred is the slowest of the used methods on the base of the measured time. The final results of each iteration are comparable to a simple BLAST or FASTA search. Between the different runs it finds sometimes new positives, sometimes it loses positives and sometimes it finds formerly lost positives again. The ranks of the positives spread. It's not easy to dermine without the knowledge of the positives which results are good.

The only way to count the found positives (according to HSSP) is to use the cross-references in the SwissProt entries listed in HSSP. There are several problems concerning this method. Not all possible references might be annotated in the SwissProt entry. That means not every non-"positive" is automatically a false positive. Regarding the performance of an alignment method it's also necessary to regard the ranking of the positives. Therefore some non-positive gaps especially at the beginning (high score, high coverage, high identity, but non-positive) might be just wrongly annotated as non-positive.

HHSearch is in fact slower than PSI-Blast. We used for some of our HHSearch examples (1-4) PSI-Blast results. Adding the PSI-Blast runtime HHSearch is really slow. But it's performance is marvelous. It found only seven positives, but these are (in every parameter set) the first seven results and are clearly separated from the others by high scores. As already mentioned that's not a lot. But in HSSP are only (indirectly) 61 PDB entries listed. And probably some of them got lost by the filtering of the PDB HHM-database (less than 70 % sequence-identity).

In general it seems that PAH is a widely studied gene. Therefore it is quite easy to find targets. Even in the PDB were seven structures (of 61 possible) found, which are associated to PAH by HSSP.

With more difficult (less similar knowledge in the databases) queries the performance of the methods would separate more clearly. There PSI-BLAST and HHSearch would probably outperform BLAST and FASTA.

In my opinion HHSearch run with PSI-BLAST profiles is the tool of choice. The true positives can be more easily determined by the scores, which is necessary if searching for a virgin (nothing is known about the positives) and difficult sequence. HHSearch is part of the HHPred method, whose servers were on top in the last CASP-Competition. This shows its capabilities in finding reliable alignments, which could even be used for homology modeling.

Task 2.2: Multiple sequence alignments

Methodology

For each identity range (99%-90%, 89%-60, 59%-40% and 39%-20%) we selected five sequences of which at least one sequence has a solved structure as well. In order to find these sequences we looked in our blast and psi-blast results for appropriate sequences, four for each identity range. A sequences was considered as appropriate if it has the wanted sequence identity to our reference sequence and if this sequence has roughly the same length as our reference sequence. The reason for this is that we wanted to avoid leading and trailing gaps in our alignments. However, this was not always possible , especially for the low identity range (39%-20%). Last but not least, we used our hhsearch results (which was queried against a profile database of all pdb entries) to find appropriate sequences for each identity range. The list of selected sequences is as follows:


Sequence Identity GI/PDB Remark
97% 4557819 phenylalanine-4-hydroxylase [Homosapiens]
95% 1J8T catalytic domain of human phenylalanine hydroxylase Fe(II)
94% 296212713 PREDICTED:phenylalanine-4-hydroxylase [Callithrix jacchus]
91% 296212713 PREDICTED:phenylalanine-4-hydroxylase-like [Ailuropodamelanoleuca]
90% 171543886 phenylalanine-4-hydroxylase [Musmusculus]
89% 126339754 PREDICTED: similar toPhenylalanine-4-hydroxylase (PAH) (Phe-4-monooxygenase)[Monodelphis domestica]
82% 224095451 PREDICTED: phenylalanine hydroxylase [Taeniopygia guttata]
73% 32442452 Pah [Danio rerio]
66% 3HF6 crystal structure of human tryptophan hydroxylase type 1 with bound LP-521834 and FE
60% 194865387 GG14450 [Drosophila erecta]
59% 2TOH TYROSINE HYDROXYLASE CATALYTIC AND TETRAMERIZATION DOMAINS FROM RAT
56% 4883767 PREDICTED: phenylalanine hydroxylase [Caenorhabditis elegans]
48% 74096365 tyrosine hydroxylase 2 [Takifugu rubripes]
48% 301617387 PREDICTED: tyrosine3-monooxygenase-like [Xenopus (Silurana) tropicalis]
45% 115532103 TryPtophan Hydroxylase familymember (tph-1) [Caenorhabditis elegans]
37% 85860954 prephenate dehydratase [Syntrophus aciditrophicus SB]
31% 114570403 phenylalanine 4-monooxygenase [Maricaulis maris MCS10]
30% 1LTV CRYSTAL STRUCTURE OF CHROMOBACTERIUM VIOLACEUM PHENYLALANINE HYDROXYLASE, STRUCTURE WITH BOUND OXIDIZED Fe(III)
25% 94312464 phenylalanine 4-monooxygenase[Cupriavidus metallidurans CH34]
23% 298578904 aromatic amino acid hydroxylase[Chlamydophila pneumoniae]


Then we created a FASTA file for each identity range which contains the 5 sequences selected and our PAH reference sequence. For this purpose we executed the following commands:

//99-90

cat reference_pah_aa.fasta gi_4557819.fasta pdb_1J8T.fasta gi_296212713.fasta gi_301759315.fasta gi_171543886.fasta > ../99_90.fasta

//89-60

cat reference_pah_aa.fasta gi_126339754.fasta gi_224095451.fasta gi_32442452.fasta pdb_3hf6.fasta gi_194865387.fasta > ../89_60.fasta

//59-40

cat reference_pah_aa.fasta pdb_2TOH.fasta gi_4883767.fasta gi_74096365.fasta gi_301617387.fasta gi_115532103.fasta > ../59_40.fasta

//39-20

cat reference_pah_aa.fasta gi_85860954.fasta gi_114570403.fasta pdb_1ltv_fasta gi_94312464.fasta gi_298578904.fasta > ../39_20.fasta


Creating multiple sequence alignments with Cobalt

Installation

Before we could start to create our multiple sequence alignments (MSA) with Cobalt we had to install it first. We retrieved the latest copy from the NCBI server and installed it with the following commands:

//download of cobalt

wget ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/cobalt/*

//set enviroment var

COBALT=/home/student/master_praktikum/task2/task22/cobalt

//make cobalt.linux executable

chmod +x cobalt.linux

Execution

The execution of Cobalt was straight forward. To create our MSA we used the standard parameter as proposed in the README file. We used the following commands to create a MSA for each identity range:


$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 99_90.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_99_90.out

$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 89_60.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_89_60.out

$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 59_40.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_59_40.out

$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 39_20.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_39_20.out


Creating multiple sequence alignments with ClustalW

Installation

Installation was not necessary since ClustalW was already installed.

Execution

The execution of ClustalW was straight forward. To create our MSA we used the standard parameters. We used the following commands to create a MSA for each identity range:

clustalw -align -infile=99_90.fasta -type=PROTEIN -outfile=clustalw_out/clustalw_99_90.out

clustalw -align -infile=89_60.fasta -type=PROTEIN -outfile=clustalw_out/clustalw_89_60.out

clustalw -align -infile=59_40.fasta -type=PROTEIN -outfile=clustalw_out/clustalw_59_40.out

clustalw -align -infile=39_20.fasta -type=PROTEIN -outfile=clustalw_out/clustalw_39_20.out


Creating multiple sequence alignments with Muscle

Installation

Installation was not necessary since Muscle was already installed.

Execution

The execution of Muscle was straight forward. To create our MSA we used the standard parameters. We used the following commands to create a MSA for each identity range:

muscle -in 99_90.fasta -out muscle_out/muscle_99_90.out

muscle -in 89_60.fasta -out muscle_out/muscle_89_60.out

muscle -in 59_40.fasta -out muscle_out/muscle_59_40.out

muscle -in 39_20.fasta -out muscle_out/muscle_39_20.out


Creating multiple sequence alignments with T-Coffee

Installation

Installation was not necessary since Muscle was already installed.

Execution

The execution of T-Coffee was straight forward. The parameter “-method sap_pair,slow_pair” defines the method, which is in our case a MSA which considers the PDB file defined by the parameter “-template_file 1J8T”. We used the following commands to create a MSA for each identity range:

t_coffee 99_90.fasta -method sap_pair,slow_pair -template_file 1J8T

t_coffee 89_60.fasta -method sap_pair,slow_pair -template_file 3HF6

t_coffee 59_40.fasta -method sap_pair,slow_pair -template_file 2TOH

t_coffee 39_20.fasta -method sap_pair,slow_pair -template_file 1LTV


Results and discussion

Cobalt

MSA of 99-90% sequence identity
99 90 cobalt.png
Conservation
Gaps
Sequence ID Number of gaps
PAH
4557819
1J8T
296212713
301759315
171543886
MSA of 89-60% sequence identity
89 60 cobalt.png
Conservation
Gaps
Sequence ID Number of gaps
PAH
4557819
1J8T
296212713
301759315
171543886
MSA of 59-40% sequence identity
59 40 cobalt.png
Conservation
Gaps
MSA of 39-20% sequence identity
39 20 cobalt.png
Conservation
Gaps

ClustalW

MSA of 99-90% sequence identity
99 90 clustalw.png
Conservation
Gaps
MSA of 89-60% sequence identity
89 60 clustalw.png
Conservation
Gaps
MSA of 59-40% sequence identity
59 40 clustalw.png
Conservation
Gaps
MSA of 39-20% sequence identity
39 20 clustalw.png
Conservation
Gaps

Muscle

MSA of 99-90% sequence identity
99 90 muscle.png
Conservation
Gaps
MSA of 89-60% sequence identity
89 60 muscle.png
Conservation
Gaps
MSA of 59-40% sequence identity
59 40 muscle.png
Conservation
Gaps
MSA of 39-20% sequence identity
39 20 muscle.png
Conservation
Gaps

T-Coffee

MSA of 99-90% sequence identity
99 90 tcoffee.png
Conservation
Gaps
MSA of 89-60% sequence identity
89 60 tcoffee.png
Conservation
Gaps
MSA of 59-40% sequence identity
59 40 tcoffee.png
Conservation
Gaps
MSA of 39-20% sequence identity
39 20 tcoffee.png
Conservation
Gaps