Difference between revisions of "Task 2: Sequence alignments (sequence searches and multiple alignments)"

From Bioinformatikpedia
(Creating multiple sequence alignments with Cobalt)
Line 382: Line 382:
$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 99_90.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_99_90.out
$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 99_90.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_99_90.out
$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 89_60.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_89_60.out
$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 89_60.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_89_60.out
$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 59_40.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_59_40.out
$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 59_40.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_59_40.out
$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 39_20.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_39_20.out
$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 39_20.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_39_20.out

Revision as of 19:00, 23 May 2011

Task description

The full description of this task can be found here.

Task 2.1: Sequence searches



time sudo blastall -p blastp -d '/data/blast/nr/nr' -i ./reference.fasta -o './reference.blast' -b 500

real 11m30.762s
user 3m11.440s
sys 0m12.250s


Found true positives 220 of 500 predictions.

Pah.blast coverage.png Pah.blast score.png Pah.blast identity.png



  • Used Virtual Box with Linux.


time ./fasta36 /home/student/reference.fasta /data/nr/nr


  • Enter filename for results []: /home/student/reference.fasta_search
  • How many scores do you want to see: 500
  • More scores? 0
  • Display alignments also? (y/n) [n] y
  • number of alignments [500]? 500

real 10m13.878s
user 7m26.270s
sys 0m20.230s


Found positives 224 of 500 predictions.

Pah.fasta search coverage.png Pah.fasta search score.png Pah.fasta search identity.png



Parameterset 1

time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e10E-6_i3.blast' -h 10E-6 -j 3 -C './reference_i3_e10E-6.chk'

real 37m56.447s
user 14m27.620s
sys 0m54.620s

Parameterset 2

time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e005_i3.blast' -h 0.005 -j 3 -C './reference_i3_e005.chk'

real 37m41.487s
user 14m42.850s
sys 0m52.370s

Parameterset 3

time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e005_i5.blast' -h 0.005 -j 5 -C './reference_i5_e005.chk'

real 62m22.175s
user 26m25.410s
sys 1m20.700s

Parameterset 4

time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e10E-6_i5.blast' -h 10E-6 -j 5 -C './reference_i5_e10E-6.chk'

real 61m59.284s
user 25m55.920s
sys 1m21.620s


Parameterset 1

3 Iterations, 10E-6 E-Value-Threshold In iteration 2 220/500 In iteration 3 224/500 Psi-Blast had during the different iterations 255 different positives at least one time in its result list.

Pah psi 10E-6 i3 coverage.png Pah psi 10E-6 i3 score.png Pah psi 10E-6 i3 identity.png

Parameterset 2

3 Iterations, 0.005 E-Value-Threshold In iteration 2 220/500 In iteration 3 225/500 Psi-Blast had during the different iterations 256 different positives at least one time in its result list.

Pah psi 005 i3 coverage.png Pah psi 005 i3 score.png Pah psi 005 i3 identity.png

Parameterset 3

5 Iterations, 0.005 E-Value-Threshold In iteration 2 220/500 In iteration 3 225/500 In iteration 4 223/500 In iteration 5 225/500 Psi-Blast had during the different iterations 261 different positives at least one time in its result list.

Pah psi 005 i5 coverage.png Pah psi 005 i5 score.png Pah psi 005 i5 identity.png

Parameterset 4

5 Iterations, 10E-6 E-Value-Threshold In iteration 2 220/500 In iteration 3 224/500 In iteration 4 223/500 In iteration 5 226/500 Psi-Blast had during the different iterations 260 different positives at least one time in its result list.

Pah psi 10E-6 i5 coverage.png Pah psi 10E-6 i5 score.png Pah psi 10E-6 i5 identity.png



Preparing the HHM-Database
Configure HHSearch-Tools

In the manual of HHSearch it was adviced to add the information of the secondary structure to the multiple alignment used for the query. Therefore it was necessary to run the addpsipred script of HHSearch. This script was not configured in the virtual box. Several parameters have to be adjusted.

  • Changes in /apps/bin/addpsipred:

my $psipreddir="/apps/psipred_2.5";
my $ncbidir="/apps/blast_old/bin";
my $perl="/apps/bin";
my $dummydb="/home/student/tmp";

  • Copy /apps/bin/reformat to /apps/bin/reformat.pl


Parameterset 1

time hhsearch -i ./reference.fasta -d /data/hmm/pdb70.db -b 500 -o ./reference_simple.hhsearch

real 8m33.171s
user 5m14.530s
sys 0m3.510s

Parameterset 2

alignblast reference_psi_e10E-6_i3.blast reference_psi_e10E-6_i3.a3m
addpsipred /home/student/workspace/reference_psi_e10E-6_i3.a3m
time hhsearch -i reference_psi_e10E-6_i3.a3m -d /data/hmm/pdb70.db -o reference_psi_e10E-6_i3.hhsearch

real 16m27.258s
user 7m47.220s
sys 0m6.290s

To run HHSearch with the profile of a PSI-BLAST search the output of PSI-BLAST has to be converted to a multiple alignment by the script alignblast. The performance of HHSearch can be increases by adding the information of the secondary structure to the multiple alignment. This can be done with the script addspipred (uses the secondary structure prediction of PSIPRED).

Parameterset 3

alignblast reference_psi_e005_i3.blast reference_psi_e005_i3.a3m
addpsipred /home/student/workspace/reference_psi_e005_i3.a3m
time hhsearch -i reference_psi_e005_i3.a3m -d /data/hmm/pdb70.db -o reference_psi_e005_i3.hhsearch

real 16m7.216s
user 7m41.840s
sys 0m5.570s

Parameterset 4

alignblast reference_psi_e005_i5.blast reference_psi_e005_i5.blast.a3m
addpsipred /home/student/workspace/reference_psi_e005_i5.blast.a3m
time hhsearch -i reference_psi_e005_i5.blast.a3m -d /data/hmm/pdb70.db -o reference_psi_e005_i5.blast.hhsearch

real 7m49.907s
user 7m15.310s
sys 0m4.320s

Parameterset 5

alignblast reference_psi_e10E-6_i5.blast reference_psi_e10E-6_i5.a3m
addpsipred /home/student/workspace/reference_psi_e10E-6_i5.a3m
time hhsearch -i reference_psi_e10E-6_i5.a3m -d /data/hmm/pdb70.db -o reference_psi_e10E-6_i5.hhsearch

real 8m10.730s
user 7m33.190s
sys 0m5.390s

Comparing the Results

HSSP - Some Positives

Getting the entry of PAH from HSSP http://mrs.cmbi.ru.nl/mrs-5/entry?db=hssp&id=2pah&q=phenylalanine%20hydroxylase

HSSP - More Positives

HHSearch was run with a dataset of the PDB. BLAST was run with the a dataset of the NR database. NR itself is a mixture of SwissProt, RefSeq, PIR, PRF, PDB and GenBank CDS translations. HSSP itself contains only references to SwissProt entries. Therefore it was necessary to get the cross-references of the Swissprot entries in HSSP to the other databases. For this purpose we created a java-tool (only runable with linux).

Task 2.2: Multiple sequence alignments


For each identity range (99%-90%, 89%-60, 59%-40% and 39%-20%) we selected five sequences of which at least one sequence has a solved structure as well. In order to find these sequences we looked in our blast and psi-blast results for appropriate sequences, four for each identity range. A sequences was considered as appropriate if it has the wanted sequence identity to our reference sequence and if this sequence has roughly the same length as our reference sequence. The reason for this is that we wanted to avoid leading and trailing gaps in our alignments. However, this was not always possible , especially for the low identity range (39%-20%). Last but not least, we used our hhsearch results (which was queried against a profile database of all pdb entries) to find appropriate sequences for each identity range. The list of selected sequences is as follows:

Sequence Identity GI/PDB Remark
97% 4557819 phenylalanine-4-hydroxylase [Homosapiens]
95% 1J8T catalytic domain of human phenylalanine hydroxylase Fe(II)
94% 296212713 PREDICTED:phenylalanine-4-hydroxylase [Callithrix jacchus]
91% 296212713 PREDICTED:phenylalanine-4-hydroxylase-like [Ailuropodamelanoleuca]
90% 171543886 phenylalanine-4-hydroxylase [Musmusculus]
89% 126339754 PREDICTED: similar toPhenylalanine-4-hydroxylase (PAH) (Phe-4-monooxygenase)[Monodelphis domestica]
82% 224095451 PREDICTED: phenylalanine hydroxylase [Taeniopygia guttata]
73% 32442452 Pah [Danio rerio]
66% 3HF6 crystal structure of human tryptophan hydroxylase type 1 with bound LP-521834 and FE
60% 194865387 GG14450 [Drosophila erecta]
56% 4883767 PREDICTED: phenylalanine hydroxylase [Caenorhabditis elegans]
48% 74096365 tyrosine hydroxylase 2 [Takifugu rubripes]
48% 301617387 PREDICTED: tyrosine3-monooxygenase-like [Xenopus (Silurana) tropicalis]
45% 115532103 TryPtophan Hydroxylase familymember (tph-1) [Caenorhabditis elegans]
37% 85860954 prephenate dehydratase [Syntrophus aciditrophicus SB]
31% 114570403 phenylalanine 4-monooxygenase [Maricaulis maris MCS10]
25% 94312464 phenylalanine 4-monooxygenase[Cupriavidus metallidurans CH34]
23% 298578904 aromatic amino acid hydroxylase[Chlamydophila pneumoniae]

Then we created a FASTA file for each identity range which contains the 5 sequences selected and our PAH reference sequence. For this purpose we executed the following commands:


cat reference_pah_aa.fasta gi_4557819.fasta pdb_1J8T.fasta gi_296212713.fasta gi_301759315.fasta gi_171543886.fasta > ../99_90.fasta


cat reference_pah_aa.fasta gi_126339754.fasta gi_224095451.fasta gi_32442452.fasta pdb_3hf6.fasta gi_194865387.fasta > ../89_60.fasta


cat reference_pah_aa.fasta pdb_2TOH.fasta gi_4883767.fasta gi_74096365.fasta gi_301617387.fasta gi_115532103.fasta > ../59_40.fasta


cat reference_pah_aa.fasta gi_85860954.fasta gi_114570403.fasta pdb_1ltv_fasta gi_94312464.fasta gi_298578904.fasta > ../39_20.fasta

Creating multiple sequence alignments with Cobalt


Before we could start to create our multiple sequence alignments (MSA) with Cobalt we had to install it first. We retrieved the latest copy from the NCBI server and installed it with the following commands:

//download of cobalt

wget ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/cobalt/*

//set enviroment var


//make cobalt.linux executable

chmod +x cobalt.linux


The execution of Cobalt was straight forward. To create our MSA we used the standard parameter as proposed in the README file. We used the following commands to create a MSA for each identity range:

$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 99_90.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_99_90.out

$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 89_60.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_89_60.out

$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 59_40.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_59_40.out

$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 39_20.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_39_20.out