Task 2: Sequence alignments (sequence searches and multiple alignments)

From Bioinformatikpedia
Revision as of 11:36, 30 August 2011 by Meier (talk | contribs) (BLAST)

Task description

The search for good alignments is often the first step in any bioinformatics application, e.g. homology modeling, function analysis, etc. By such an alignment the user wants to get more information about a probably new sequence. In an alignment two or more sequences are compared due to similarity. If two sequences are very similar due to an alignment the probability is high that these two sequences were derived from one ancestral sequence (homologs, orthologs, paralogs). Therefore it is possible to compare or assign annotations between the two sequences: function, catalytic site, secondary structure, etc.

There are some terms, which appear frequently together with alignments:

  • Homologs: Two sequences share a high sequence identity and are derived from the same ancestral sequence.
  • Orthologs: The sequences were separated by a speciation event.
  • Paralogs: The sequences were separated by a duplication event within a species.

If more sequences are compared in a multiple alignment, it can be possible to derive sites of the protein, which have not changed between these sequences. These sites are called conserved, which means that they did not change in the evolution from a common ancestral sequence. If more sequences are compared, it is easier to see blocks in the alignment. These blocks can be caused by domains of the protein, if amino acid sequences were compared. A domain can be the foundation of one of the the protein's functions, e. g. binding, enzymatic activity, etc.

The derived relationship between the sequences is somehow only a construct in mind. There is no and there will never be a complete record of evolution. Therefore the results of such an alignment can give hints, but should always be regarded critically.

There are several dynamic programming approaches to calculate optimal alignments. Optimal means in this case that the score of the alignment cannot be better with the used scoring scheme.

  • Smith-Waterman calculates a local alignment, that means only the highest scoring region is aligned.
  • Needleman-Wunsch calculates a global alignment, that means the whole sequence is regarded in the alignment.
  • Gotoh uses affine gap costs. In the general dynamic programming approaches there are penalties defined for introducing gaps in the alignment. Gotoh differentiate between opening a new gap and extending an already opened gap. That means, that Gotoh uses two different gap penalties for the two cases. By this scheme, blocks are more favored, since the penalty of opening a new gap is usually ten times the penalty of extending a gap.

These dynamic programming approaches calculate optimal alignments, but are very computationally intensive. It is not reasonable to use these approaches to search whole databases for similar sequences to a target sequence.

Therefore we use several heuristics to find similar sequences for the protein PAH. In the second part we select several of the found sequences to create a multiple alignment.

The full description of this task can be found here.

Task 2.1: Sequence searches

BLAST

A detailed description of BLAST can be found here

Running

time sudo blastall -p blastp -d '/data/blast/nr/nr' -i ./reference.fasta -o './reference.blast' -b 500

real 11m30.762s
user 3m11.440s
sys 0m12.250s

Results

Found true positives 220 of 500 predictions.

Figure 1: Plot of the frequency of BLAST search hits versus the coverage with the query sequence of PAH (Parameters: b=500, database=NR)
Figure 2: Plot of the frequency of BLAST search hits versus the score with the query sequence of PAH (Parameters: b=500, database=NR)
Figure 3: Plot of the frequency of BLAST search hits versus the identity with the query sequence of PAH (Parameters: b=500, database=NR)

FASTA

A detailed description of FASTA can be found here.

Installation

  • Used Virtual Box with Linux.

Running

time ./fasta36 /home/student/reference.fasta /data/nr/nr

interactive:

  • Enter filename for results []: /home/student/reference.fasta_search
  • How many scores do you want to see: 500
  • More scores? 0
  • Display alignments also? (y/n) [n] y
  • number of alignments [500]? 500

real 10m13.878s
user 7m26.270s
sys 0m20.230s

Results

Found positives 224 of 500 predictions.

Figure 4: Plot of the frequency of FASTA search hits versus the coverage with the query sequence of PAH (Parameters: default, database: nr)
Figure 5: Plot of the frequency of FASTA search hits versus the coverage with the query sequence of PAH (Parameters: default, database: nr)
Figure 6: Plot of the frequency of FASTA search hits versus the coverage with the query sequence of PAH (Parameters: default, database: nr)

PSI-BLAST

A detailed description of PSI-BLAST can be found here.

Running

Parameterset 1

time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e10E-6_i3.blast' -h 10E-6 -j 3 -C './reference_i3_e10E-6.chk'

real 37m56.447s
user 14m27.620s
sys 0m54.620s

Parameterset 2

time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e005_i3.blast' -h 0.005 -j 3 -C './reference_i3_e005.chk'

real 37m41.487s
user 14m42.850s
sys 0m52.370s

Parameterset 3

time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e005_i5.blast' -h 0.005 -j 5 -C './reference_i5_e005.chk'

real 62m22.175s
user 26m25.410s
sys 1m20.700s

Parameterset 4

time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e10E-6_i5.blast' -h 10E-6 -j 5 -C './reference_i5_e10E-6.chk'

real 61m59.284s
user 25m55.920s
sys 1m21.620s

Results

Parameterset 1

3 Iterations, 10E-6 E-Value-Threshold
In iteration 2 220/500
In iteration 3 224/500
Psi-Blast had during the different iterations 255 different positives at least one time in its result list.

Figure 7: Plot of the frequency of PSI-BLAST search hits versus the coverage with the query sequence of PAH (Parameters: h=10E-6, j=3, database: nr)
Figure 8: Plot of the frequency of PSI-BLAST search hits versus the score with the query sequence of PAH (Parameters: h=10E-6, j=3, database: nr)
Figure 9: Plot of the frequency of PSI-BLAST search hits versus the identity with the query sequence of PAH (Parameters: h=10E-6, j=3, database: nr)
Parameterset 2

3 Iterations, 0.005 E-Value-Threshold
In iteration 2 220/500
In iteration 3 225/500
Psi-Blast had during the different iterations 256 different positives at least one time in its result list.

Figure 10: Plot of the frequency of PSI-BLAST search hits versus the coverage with the query sequence of PAH (Parameters: h=0.005, j=5, database: nr)
Figure 11: Plot of the frequency of PSI-BLAST search hits versus the score with the query sequence of PAH (Parameters: h=0.005, j=5, database: nr)
Figure 12: Plot of the frequency of PSI-BLAST search hits versus the identity with the query sequence of PAH (Parameters: h=0.005, j=5, database: nr)
Parameterset 3

5 Iterations, 0.005 E-Value-Threshold
In iteration 2 220/500
In iteration 3 225/500
In iteration 4 223/500
In iteration 5 225/500
Psi-Blast had during the different iterations 261 different positives at least one time in its result list.

Figure 13: Plot of the frequency of PSI-BLAST search hits versus the coverage with the query sequence of PAH (Parameters: h=0.005, j=5, database: nr)
Figure 14: Plot of the frequency of PSI-BLAST search hits versus the score with the query sequence of PAH (Parameters: h=0.005, j=5, database: nr)
Figure 15: Plot of the frequency of PSI-BLAST search hits versus the identity with the query sequence of PAH (Parameters: h=0.005, j=5, database: nr)
Parameterset 4

5 Iterations, 10E-6 E-Value-Threshold
In iteration 2 220/500
In iteration 3 224/500
In iteration 4 223/500
In iteration 5 226/500
Psi-Blast had during the different iterations 260 different positives at least one time in its result list.

Figure 16: Plot of the frequency of PSI-BLAST search hits versus the coverage with the query sequence of PAH (Parameters: h=10E-6, j=5, database: nr)
Figure 17: Plot of the frequency of PSI-BLAST search hits versus the score with the query sequence of PAH (Parameters: h=10E-6, j=5, database: nr)
Figure 18: Plot of the frequency of PSI-BLAST search hits versus the identity with the query sequence of PAH (Parameters: h=10E-6, j=5, database: nr)

HHSearch

A detailed description of HHSearch can be found here.

Installation

Preparing the HHM-Database
Configure HHSearch-Tools

In the manual of HHSearch it was adviced to add the information of the secondary structure to the multiple alignment used for the query. Therefore it was necessary to run the addpsipred script of HHSearch. This script was not configured in the virtual box. Several parameters have to be adjusted.

  • Changes in /apps/bin/addpsipred:

my $psipreddir="/apps/psipred_2.5";
my $ncbidir="/apps/blast_old/bin";
my $perl="/apps/bin";
my $dummydb="/home/student/tmp";

  • Copy /apps/bin/reformat to /apps/bin/reformat.pl


Running

Parameterset 0

time hhsearch -i ./reference.fasta -d /data/hmm/pdb70.db -b 500 -o ./reference_simple.hhsearch

real 8m33.171s
user 5m14.530s
sys 0m3.510s

Parameterset 1

alignblast reference_psi_e10E-6_i3.blast reference_psi_e10E-6_i3.a3m
addpsipred /home/student/workspace/reference_psi_e10E-6_i3.a3m
time hhsearch -i reference_psi_e10E-6_i3.a3m -d /data/hmm/pdb70.db -o reference_psi_e10E-6_i3.hhsearch

real 16m27.258s
user 7m47.220s
sys 0m6.290s

To run HHSearch with the profile of a PSI-BLAST search the output of PSI-BLAST has to be converted to a multiple alignment by the script alignblast. The performance of HHSearch can be increases by adding the information of the secondary structure to the multiple alignment. This can be done with the script addspipred (uses the secondary structure prediction of PSIPRED).

Parameterset 2

alignblast reference_psi_e005_i3.blast reference_psi_e005_i3.a3m
addpsipred /home/student/workspace/reference_psi_e005_i3.a3m
time hhsearch -i reference_psi_e005_i3.a3m -d /data/hmm/pdb70.db -o reference_psi_e005_i3.hhsearch

real 16m7.216s
user 7m41.840s
sys 0m5.570s

Parameterset 3

alignblast reference_psi_e005_i5.blast reference_psi_e005_i5.blast.a3m
addpsipred /home/student/workspace/reference_psi_e005_i5.blast.a3m
time hhsearch -i reference_psi_e005_i5.blast.a3m -d /data/hmm/pdb70.db -o reference_psi_e005_i5.blast.hhsearch

real 7m49.907s
user 7m15.310s
sys 0m4.320s

Parameterset 4

alignblast reference_psi_e10E-6_i5.blast reference_psi_e10E-6_i5.a3m
addpsipred /home/student/workspace/reference_psi_e10E-6_i5.a3m
time hhsearch -i reference_psi_e10E-6_i5.a3m -d /data/hmm/pdb70.db -o reference_psi_e10E-6_i5.hhsearch

real 8m10.730s
user 7m33.190s
sys 0m5.390s

Results

Parameterset 0

Run on Simple Fasta-File.
Found 6 positives.

Figure 19: Plot of the frequency of HHSearch search hits versus the coverage with the query sequence of PAH (parameters: b=500, database=PDB70, i=fasta sequence of PAH)
Figure 20: Plot of the frequency of HHSearch search hits versus the score with the query sequence of PAH (parameters: b=500, database=PDB70, i=fasta sequence of PAH)
Figure 21: Plot of the frequency of HHSearch search hits versus the identity with the query sequence of PAH (parameters: b=500, database=PDB70, i=fasta sequence of PAH)
Parameterset 1

Run on Psi-Blast-Output of parameter set 1.
Found 6 positives.

Figure 22: Plot of the frequency of HHSearch search hits versus the coverage with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=10E-6, j=3, database=nr)
Figure 23: Plot of the frequency of HHSearch search hits versus the score with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=10E-6, j=3, database=nr)
Figure 24: Plot of the frequency of HHSearch search hits versus the identity with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=10E-6, j=3, database=nr)
Parameterset 2

Run on Psi-Blast-Output of parameter set 2.
Found 6 positives.

Figure 25: Plot of the frequency of HHSearch search hits versus the coverage with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=0.005, j=3, database=nr)
Figure 26: Plot of the frequency of HHSearch search hits versus the score with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=0.005, j=3, database=nr)
Figure 27: Plot of the frequency of HHSearch search hits versus the identity with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=0.005, j=3, database=nr)
Parameterset 3

Run on Psi-Blast-Output of parameter set 3.
Found 6 positives.

Figure 28: Plot of the frequency of HHSearch search hits versus the coverage with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=0.005, j=5, database=nr)
Figure 29: Plot of the frequency of HHSearch search hits versus the score with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=0.005, j=5, database=nr)
Figure 30: Plot of the frequency of HHSearch search hits versus the identity with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=0.005, j=5, database=nr)
Parameterset 4

Run on Psi-Blast-Output of parameter set 4.
Found 6 positives.

Figure 31: Plot of the frequency of HHSearch search hits versus the coverage with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=10E-6, j=5, database=nr)
Figure 32: Plot of the frequency of HHSearch search hits versus the score with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=10E-6, j=5, database=nr)
Figure 33: Plot of the frequency of HHSearch search hits versus the identity with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=10E-6, j=5, database=nr)

Comparing the Results

Defining True Positives

HSSP - Some Positives

Getting the entry of PAH from HSSP http://mrs.cmbi.ru.nl/mrs-5/entry?db=hssp&id=2pah&q=phenylalanine%20hydroxylase

HSSP - More Positives

HHSearch was run with a dataset of the PDB. BLAST was run with the a dataset of the NR database. NR itself is a mixture of SwissProt, RefSeq, PIR, PRF, PDB and GenBank CDS translations. HSSP itself contains only references to SwissProt entries. Therefore it was necessary to get the cross-references of the Swissprot entries in HSSP to the other databases. For this purpose we created a java-tool.

Overlap

unfiltered sets:

Nr. Method/Parameter-Set Size of Result-Set
1 psi-blast e=10E-6 i=3 499
2 fasta 499
3 hhsearch e=10E-6 i=3 10
4 hhsearch simple 496
5 psi-blast e=0.005 i=5 499
6 hhsearch (psi-blast: e=0.005 i=5) 10
7 blast 499
8 psi-blast e=10E-6 i=5 499
9 hhsearch (psi-blast e=0.005 i=3) 494
10 hhsearch (psi-blast e=0.005 i=3) 499
11 hhsearch (psi-blast e=10E-6 i=5) 10


Nr./Nr. 1 2 3 4 5 6 7 8 9 10 11
0 499 414 3 3 485 3 405 482 3 494 3
1 414 499 4 4 410 4 478 406 4 413 4
2 3 4 10 10 3 10 4 3 10 3 10
3 3 4 10 496 3 10 4 3 58 3 10
4 485 410 3 3 499 3 401 490 3 488 3
5 3 4 10 10 3 10 4 3 10 3 10
6 405 478 4 4 401 4 499 397 4 404 4
7 482 406 3 3 490 3 397 499 3 486 3
8 3 4 10 58 3 10 4 3 494 3 10
9 494 413 3 3 488 3 404 486 3 499 3
10 3 4 10 10 3 10 4 3 10 3 10


positive sets:

Nr. Method/Parameter-Set Size of Result-Set
1 hhsearch (psi-blast e=0.005 i=3) 6
2 hhsearch (psi-blast e=10E-6 i=3) 6
3 fasta 224
4 psi-blast e=0.005 i=5 226
5 hhsearch - simple 6
6 hhsearch (psi-blast e=10E-6 i=5) 6
7 psi-blast e=10E-6 i=5 226
8 psi-blast e=0.005 i=3 223
9 blast 220
10 hhsearch (psi-blast e=0.005 i=5) 6
11 psi-blast e=10E-6 i=3 223


Nr./Nr. 1 2 3 4 5 6 7 8 9 10 11
0 6 6 4 3 6 6 3 3 4 6 3
1 6 6 4 3 6 6 3 3 4 6 3
2 4 4 224 193 4 4 192 194 211 4 194
3 3 3 193 226 3 3 224 220 188 3 219
4 6 6 4 3 6 6 3 3 4 6 3
5 6 6 4 3 6 6 3 3 4 6 3
6 3 3 192 224 3 3 226 219 187 3 217
7 3 3 194 220 3 3 219 223 188 3 220
8 4 4 211 188 4 4 187 188 220 4 188
9 6 6 4 3 6 6 3 3 4 6 3
10 3 3 194 219 3 3 217 220 188 3 223

Summary

BLAST is the fastest method. Its result is comparable to FASTA and PSI-BLAST.

It was mentioned that FASTA is more sensitive than BLAST. The number of found positives does not underlie this statement.

PSI-Pred is the slowest of the used methods on the base of the measured time. The final results of each iteration are comparable to a simple BLAST or FASTA search. Between the different runs it finds sometimes new positives, sometimes it loses positives and sometimes it finds formerly lost positives again. The ranks of the positives spread. It's not easy to dermine without the knowledge of the positives which results are good.

The only way to count the found positives (according to HSSP) is to use the cross-references in the SwissProt entries listed in HSSP. There are several problems concerning this method. Not all possible references might be annotated in the SwissProt entry. That means not every non-"positive" is automatically a false positive. Regarding the performance of an alignment method it's also necessary to regard the ranking of the positives. Therefore some non-positive gaps especially at the beginning (high score, high coverage, high identity, but non-positive) might be just wrongly annotated as non-positive.

HHSearch is in fact slower than PSI-Blast. We used for some of our HHSearch examples (1-4) PSI-Blast results. Adding the PSI-Blast runtime HHSearch is really slow. But it's performance is marvelous. It found only seven positives, but these are (in every parameter set) the first seven results and are clearly separated from the others by high scores. As already mentioned that's not a lot. But in HSSP are only (indirectly) 61 PDB entries listed. And probably some of them got lost by the filtering of the PDB HHM-database (less than 70 % sequence-identity).

In general it seems that PAH is a widely studied gene. Therefore it is quite easy to find targets. Even in the PDB were seven structures (of 61 possible) found, which are associated to PAH by HSSP.

With more difficult (less similar knowledge in the databases) queries the performance of the methods would separate more clearly. There PSI-BLAST and HHSearch would probably outperform BLAST and FASTA.

In my opinion HHSearch run with PSI-BLAST profiles is the tool of choice. The true positives can be more easily determined by the scores, which is necessary if searching for a virgin (nothing is known about the positives) and difficult sequence. HHSearch is part of the HHPred method, whose servers were on top in the last CASP-Competition. This shows its capabilities in finding reliable alignments, which could even be used for homology modeling.

Task 2.2: Multiple sequence alignments

Methodology

For each identity range (99%-90%, 89%-60, 59%-40% and 39%-20%) we selected five sequences of which at least one sequence has a solved structure as well. In order to find these sequences we looked in our blast and psi-blast results for appropriate sequences, four for each identity range. A sequences was considered as appropriate if it has the wanted sequence identity to our reference sequence and if this sequence has roughly the same length as our reference sequence. The reason for this is that we wanted to avoid leading and trailing gaps in our alignments. However, this was not always possible , especially for the low identity range (39%-20%). Last but not least, we used our hhsearch results (which was queried against a profile database of all pdb entries) to find appropriate sequences for each identity range. The list of selected sequences is as follows:


Sequence Identity GI/PDB Remark
97% 4557819 phenylalanine-4-hydroxylase [Homosapiens]
95% 1J8T catalytic domain of human phenylalanine hydroxylase Fe(II)
94% 296212713 PREDICTED:phenylalanine-4-hydroxylase [Callithrix jacchus]
91% 296212713 PREDICTED:phenylalanine-4-hydroxylase-like [Ailuropodamelanoleuca]
90% 171543886 phenylalanine-4-hydroxylase [Musmusculus]
89% 126339754 PREDICTED: similar toPhenylalanine-4-hydroxylase (PAH) (Phe-4-monooxygenase)[Monodelphis domestica]
82% 224095451 PREDICTED: phenylalanine hydroxylase [Taeniopygia guttata]
73% 32442452 Pah [Danio rerio]
66% 3HF6 crystal structure of human tryptophan hydroxylase type 1 with bound LP-521834 and FE
60% 194865387 GG14450 [Drosophila erecta]
59% 2TOH TYROSINE HYDROXYLASE CATALYTIC AND TETRAMERIZATION DOMAINS FROM RAT
56% 4883767 PREDICTED: phenylalanine hydroxylase [Caenorhabditis elegans]
48% 74096365 tyrosine hydroxylase 2 [Takifugu rubripes]
48% 301617387 PREDICTED: tyrosine3-monooxygenase-like [Xenopus (Silurana) tropicalis]
45% 115532103 TryPtophan Hydroxylase familymember (tph-1) [Caenorhabditis elegans]
37% 85860954 prephenate dehydratase [Syntrophus aciditrophicus SB]
31% 114570403 phenylalanine 4-monooxygenase [Maricaulis maris MCS10]
30% 1LTV CRYSTAL STRUCTURE OF CHROMOBACTERIUM VIOLACEUM PHENYLALANINE HYDROXYLASE, STRUCTURE WITH BOUND OXIDIZED Fe(III)
25% 94312464 phenylalanine 4-monooxygenase[Cupriavidus metallidurans CH34]
23% 298578904 aromatic amino acid hydroxylase[Chlamydophila pneumoniae]


Then we created a FASTA file for each identity range which contains the 5 sequences selected and our PAH reference sequence. For this purpose we executed the following commands:

//99-90

cat reference_pah_aa.fasta gi_4557819.fasta pdb_1J8T.fasta gi_296212713.fasta gi_301759315.fasta gi_171543886.fasta > ../99_90.fasta

//89-60

cat reference_pah_aa.fasta gi_126339754.fasta gi_224095451.fasta gi_32442452.fasta pdb_3hf6.fasta gi_194865387.fasta > ../89_60.fasta

//59-40

cat reference_pah_aa.fasta pdb_2TOH.fasta gi_4883767.fasta gi_74096365.fasta gi_301617387.fasta gi_115532103.fasta > ../59_40.fasta

//39-20

cat reference_pah_aa.fasta gi_85860954.fasta gi_114570403.fasta pdb_1ltv_fasta gi_94312464.fasta gi_298578904.fasta > ../39_20.fasta


Creating multiple sequence alignments with Cobalt

Installation

Before we could start to create our multiple sequence alignments (MSA) with Cobalt we had to install it first. We retrieved the latest copy from the NCBI server and installed it with the following commands:

//download of cobalt

wget ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/cobalt/*

//set enviroment var

COBALT=/home/student/master_praktikum/task2/task22/cobalt

//make cobalt.linux executable

chmod +x cobalt.linux

Execution

The execution of Cobalt was straight forward. To create our MSA we used the standard parameter as proposed in the README file. We used the following commands to create a MSA for each identity range:


$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 99_90.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_99_90.out

$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 89_60.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_89_60.out

$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 59_40.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_59_40.out

$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 39_20.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_39_20.out


Creating multiple sequence alignments with ClustalW

Installation

Installation was not necessary since ClustalW was already installed.

Execution

The execution of ClustalW was straight forward. To create our MSA we used the standard parameters. We used the following commands to create a MSA for each identity range:

clustalw -align -infile=99_90.fasta -type=PROTEIN -outfile=clustalw_out/clustalw_99_90.out

clustalw -align -infile=89_60.fasta -type=PROTEIN -outfile=clustalw_out/clustalw_89_60.out

clustalw -align -infile=59_40.fasta -type=PROTEIN -outfile=clustalw_out/clustalw_59_40.out

clustalw -align -infile=39_20.fasta -type=PROTEIN -outfile=clustalw_out/clustalw_39_20.out


Creating multiple sequence alignments with Muscle

Installation

Installation was not necessary since Muscle was already installed.

Execution

The execution of Muscle was straight forward. To create our MSA we used the standard parameters. We used the following commands to create a MSA for each identity range:

muscle -in 99_90.fasta -out muscle_out/muscle_99_90.out

muscle -in 89_60.fasta -out muscle_out/muscle_89_60.out

muscle -in 59_40.fasta -out muscle_out/muscle_59_40.out

muscle -in 39_20.fasta -out muscle_out/muscle_39_20.out


Creating multiple sequence alignments with T-Coffee

Installation

Installation was not necessary since T-Coffee was already installed.

Execution

The execution of T-Coffee was straight forward. The parameter “-method sap_pair,slow_pair” defines the method, which is in our case a MSA which considers the PDB file defined by the parameter “-template_file 1J8T”. We used the following commands to create a MSA for each identity range:

t_coffee 99_90.fasta -method sap_pair,slow_pair -template_file 1J8T

t_coffee 89_60.fasta -method sap_pair,slow_pair -template_file 3HF6

t_coffee 59_40.fasta -method sap_pair,slow_pair -template_file 2TOH

t_coffee 39_20.fasta -method sap_pair,slow_pair -template_file 1LTV


Results and discussion

Cobalt

MSA of 99-90% sequence identity
Figure 34: Multiple alignment calculated by Cobalt for the selected sequences with 99 to 90% sequence identity to PAH
Conservation

304 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps
Sequence ID Number of gaps
PAH 22
4557819 22
1J8T 149
296212713 1
301759315 23
171543886 21

Gaps are only introduced to the N- and C-Terminal ends of the MSA. There are no gaps in secondary structure elements of PAH.

MSA of 89-60% sequence identity
Figure 35: Multiple alignment calculated by Cobalt for the selected sequences with 89 to 60% sequence identity to PAH
Conservation

161 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps
Sequence ID Number of gaps
PAH 21
126339754 9
224095451 27
32442452 24
3HF6 183
194865387 21

Gaps are only introduced to the N- and C-Terminal ends of the MSA. There are no gaps in secondary structure elements of PAH.

MSA of 59-40% sequence identity
Figure 36: Multiple alignment calculated by Cobalt for the selected sequences with 59 to 40% sequence identity to PAH
Conservation

102 columns are conserved. The area around the catalytic center is also conserved (around position 290 on PAH).

Gaps
Sequence ID Number of gaps
PAH 57
2TOH 166
4883767 52
74096365 33
301617387 38
115532103 80

Most gaps are introduced to the N- and C-Terminal ends of the MSA. There is one gap in the two sheets from MSA position 372 to 393 of PAH.

MSA of 39-20% sequence identity
Figure 37: Multiple alignment calculated by Cobalt for the selected sequences with 39 to 20% sequence identity to PAH
Conservation

1 columns are conserved. The area around the catalytic center is in 4 of 5 sequences conserved. (position 290 on PAH).

Gaps
Sequence ID Number of gaps
PAH 19
85860954 117
114570403 178
1LTV 174
94312464 161
298578904 125

Most gaps are introduced to the N- and C-Terminal ends of the MSA. There are two gaps from MSA position 320 to 337 (helix) and from 417 to 420 (helix) in PAH.

ClustalW

MSA of 99-90% sequence identity
Figure 38: Multiple alignment calculated by ClustalW for the selected sequences with 99 to 90% sequence identity to PAH
Conservation

304 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps
Sequence ID Number of gaps
PAH 22
4557819 22
1J8T 149
296212713 1
301759315 23
171543886 21

Gaps are only introduced to the N- and C-Terminal ends of the MSA. There are no gaps in secondary structure elements of PAH.

MSA of 89-60% sequence identity
Figure 39: Multiple alignment calculated by ClustalW for the selected sequences with 89 to 60% sequence identity to PAH
Conservation

161 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps
Sequence ID Number of gaps
PAH 18
126339754 6
224095451 24
32442452 21
3HF6 180
194865387 18

Gaps are only introduced to the N- and C-Terminal ends of the MSA. There are no gaps in secondary structure elements of PAH.

MSA of 59-40% sequence identity
Figure 40: Multiple alignment calculated by ClustalW for the selected sequences with 59 to 40% sequence identity to PAH
Conservation

102 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps
Sequence ID Number of gaps
PAH 49
2TOH 158
4883767 44
74096365 25
301617387 40
115532103 72

Most gaps are introduced to the N- and C-Terminal ends of the MSA. Two gaps from MSA position 366 to 376 and from 379 to 384 are disrupting two sheets in PAH.

MSA of 39-20% sequence identity
Figure 41: Multiple alignment calculated by ClustalW for the selected sequences with 39 to 20% sequence identity to PAH
Conservation

2 columns are conserved. The area around the catalytic center is in 5 from 6 sequences conserved (position 290 on PAH).

Gaps
Sequence ID Number of gaps
PAH 20
85860954 118
114570403 179
1LTV 175
94312464 162
298578904 126

Most gaps are introduced to the N- and C-Terminal ends of the MSA. Two gaps from MSA position 308 to 324 and from 438 to 441 are disrupting one helix and one sheet in PAH.

Muscle

MSA of 99-90% sequence identity
Figure 42: Multiple alignment calculated by Muscle for the selected sequences with 99 to 90% sequence identity to PAH
Conservation

304 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps
Sequence ID Number of gaps
PAH 22
4557819 22
1J8T 149
296212713 1
301759315 23
171543886 21

Gaps are introduced to the N- and C-Terminal ends of the MSA. There are no gaps in secondary structure elements of PAH.

MSA of 89-60% sequence identity
Figure 43: Multiple alignment calculated by Muscle for the selected sequences with 89 to 60% sequence identity to PAH
Conservation

161 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps
Sequence ID Number of gaps
PAH 24
126339754 12
224095451 30
32442452 27
3HF6 186
194865387 24

Gaps are introduced to the N- and C-Terminal ends of the MSA. There are no gaps in secondary structure elements of PAH.

MSA of 59-40% sequence identity
Figure 44: Multiple alignment calculated by Muscle for the selected sequences with 59 to 40% sequence identity to PAH
Conservation

102 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps
Sequence ID Number of gaps
PAH 62
2TOH 171
4883767 57
74096365 38
301617387 43
115532103 85

Most gaps are introduced to the N- and C-Terminal ends of the MSA. There are two gaps from 376 to 386 and from 392 to 397 in two sheets.

MSA of 39-20% sequence identity
Figure 45: Multiple alignment calculated by Muscle for the selected sequences with 39 to 20% sequence identity to PAH
Conservation

7 columns are conserved. The area around the catalytic center is in 5 of 6 sequences conserved (position 290 on PAH).

Gaps
Sequence ID Number of gaps
PAH 46
85860954 144
114570403 205
1LTV 201
94312464 188
298578904 152

Most gaps are introduced to the N- and C-Terminal ends of the MSA. There are gaps from 232 to 239, from 251 to 256, from 342 to 359 and from 391 to 392 in two sheets.

T-Coffee

MSA of 99-90% sequence identity
Figure 46: Multiple alignment calculated by T-Coffee for the selected sequences with 99 to 90% sequence identity to PAH
Conservation

304 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps
Sequence ID Number of gaps
PAH 22
4557819 22
1J8T 149
296212713 1
301759315 23
171543886 21

Gaps are introduced to the N- and C-Terminal ends of the MSA. There are no gaps in the secondary structure elements of PAH.

MSA of 89-60% sequence identity
Figure 47: Multiple alignment calculated by T-Coffee for the selected sequences with 89 to 60% sequence identity to PAH
Conservation

160 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps
Sequence ID Number of gaps
PAH 20
126339754 8
224095451 26
32442452 23
3HF6 182
194865387 20

Gaps are introduced to the N- and C-Terminal ends of the MSA. There are no gaps in the secondary structure elements of PAH.

MSA of 59-40% sequence identity
Figure 48: Multiple alignment calculated by T-Coffee for the selected sequences with 59 to 40% sequence identity to PAH
Conservation

102 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps
Sequence ID Number of gaps
PAH 75
2TOH 184
4883767 70
74096365 51
301617387 56
115532103 98

Gaps are introduced to the N- and C-Terminal ends of the MSA. There are no gaps in the secondary structure elements of PAH.

MSA of 39-20% sequence identity
Figure 49: Multiple alignment calculated by T-Coffee for the selected sequences with 39 to 20% sequence identity to PAH
Conservation

5 columns are conserved. The area around the catalytic center is in 5 of 6 sequences conserved (position 290 on PAH).

Gaps
Sequence ID Number of gaps
PAH 109
85860954 207
114570403 268
1LTV 264
94312464 251
298578904 215

Most gaps are introduced to the N- and C-Terminal ends of the MSA. There are gaps from 224 to 228, from 375 to 377, from 412 to 422 and from 457 to 496 in PAH.