Task 2: Sequence alignments (sequence searches and multiple alignments)

Task description

The search for good alignments is often the first step in any bioinformatics application, e.g. homology modeling, function analysis, etc. By such an alignment the user wants to get more information about a probably new sequence. In an alignment two or more sequences are compared due to similarity. If two sequences are very similar due to an alignment the probability is high that these two sequences were derived from one ancestral sequence (homologs, orthologs, paralogs). Therefore it is possible to compare or assign annotations between the two sequences: function, catalytic site, secondary structure, etc.

There are some terms, which appear frequently together with alignments:

Homologs: Two sequences share a high sequence identity and are derived from the same ancestral sequence.
Orthologs: The sequences were separated by a speciation event.
Paralogs: The sequences were separated by a duplication event within a species.

If more sequences are compared in a multiple alignment, it can be possible to derive sites of the protein, which have not changed between these sequences. These sites are called conserved, which means that they did not change in the evolution from a common ancestral sequence. If more sequences are compared, it is easier to see blocks in the alignment. These blocks can be caused by domains of the protein, if amino acid sequences were compared. A domain can be the foundation of one of the the protein's functions, e. g. binding, enzymatic activity, etc.

The derived relationship between the sequences is somehow only a construct in mind. There is no and there will never be a complete record of evolution. Therefore the results of such an alignment can give hints, but should always be regarded critically.

There are several dynamic programming approaches to calculate optimal alignments. Optimal means in this case that the score of the alignment cannot be better with the used scoring scheme.

Smith-Waterman calculates a local alignment, that means only the highest scoring region is aligned.
Needleman-Wunsch calculates a global alignment, that means the whole sequence is regarded in the alignment.
Gotoh uses affine gap costs. In the general dynamic programming approaches there are penalties defined for introducing gaps in the alignment. Gotoh differentiate between opening a new gap and extending an already opened gap. That means, that Gotoh uses two different gap penalties for the two cases. By this scheme, blocks are more favored, since the penalty of opening a new gap is usually ten times the penalty of extending a gap.

These dynamic programming approaches calculate optimal alignments, but are very computationally intensive. It is not reasonable to use these approaches to search whole databases for similar sequences to a target sequence.

Therefore we use several heuristics to find similar sequences for the protein PAH. In the second part we select several of the found sequences to create a multiple alignment.

The full description of this task can be found here.

Task 2.1: Sequence searches

BLAST

A detailed description of BLAST can be found here

Running

time sudo blastall -p blastp -d '/data/blast/nr/nr' -i ./reference.fasta -o './reference.blast' -b 500

real 11m30.762s user 3m11.440s sys 0m12.250s

Results

Found true positives 220 of 500 predictions.

Figure 1: Plot of the frequency of BLAST search hits versus the coverage with the query sequence of PAH (Parameters: b=500, database=NR)

Figure 2: Plot of the frequency of BLAST search hits versus the score with the query sequence of PAH (Parameters: b=500, database=NR)

Figure 3: Plot of the frequency of BLAST search hits versus the identity with the query sequence of PAH (Parameters: b=500, database=NR)

FASTA

A detailed description of FASTA can be found here.

Installation

Used Virtual Box with Linux.

Download fasta3.tar.gz from ftp://ftp.ebi.ac.uk/pub/software/unix/fasta/
Unzip the archive
Build it with: make -f /apps/fasta/make/Makefile.linux64 all

Running

time ./fasta36 /home/student/reference.fasta /data/nr/nr

interactive:

Enter filename for results []: /home/student/reference.fasta_search
How many scores do you want to see: 500
More scores? 0
Display alignments also? (y/n) [n] y
number of alignments [500]? 500

real 10m13.878s user 7m26.270s sys 0m20.230s

Results

Found positives 224 of 500 predictions.

Figure 4: Plot of the frequency of FASTA search hits versus the coverage with the query sequence of PAH (Parameters: default, database: nr)

Figure 5: Plot of the frequency of FASTA search hits versus the coverage with the query sequence of PAH (Parameters: default, database: nr)

Figure 6: Plot of the frequency of FASTA search hits versus the coverage with the query sequence of PAH (Parameters: default, database: nr)

PSI-BLAST

A detailed description of PSI-BLAST can be found here.

Running

Parameterset 1

time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e10E-6_i3.blast' -h 10E-6 -j 3 -C './reference_i3_e10E-6.chk'

real 37m56.447s user 14m27.620s sys 0m54.620s

Parameterset 2

time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e005_i3.blast' -h 0.005 -j 3 -C './reference_i3_e005.chk'

real 37m41.487s user 14m42.850s sys 0m52.370s

Parameterset 3

time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e005_i5.blast' -h 0.005 -j 5 -C './reference_i5_e005.chk'

real 62m22.175s user 26m25.410s sys 1m20.700s

Parameterset 4

time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e10E-6_i5.blast' -h 10E-6 -j 5 -C './reference_i5_e10E-6.chk'

real 61m59.284s user 25m55.920s sys 1m21.620s

Results

Parameterset 1

3 Iterations, 10E-6 E-Value-Threshold
In iteration 2 220/500
In iteration 3 224/500
Psi-Blast had during the different iterations 255 different positives at least one time in its result list.

Figure 7: Plot of the frequency of PSI-BLAST search hits versus the coverage with the query sequence of PAH (Parameters: h=10E-6, j=3, database: nr)

Figure 8: Plot of the frequency of PSI-BLAST search hits versus the score with the query sequence of PAH (Parameters: h=10E-6, j=3, database: nr)

Figure 9: Plot of the frequency of PSI-BLAST search hits versus the identity with the query sequence of PAH (Parameters: h=10E-6, j=3, database: nr)

Parameterset 2

3 Iterations, 0.005 E-Value-Threshold
In iteration 2 220/500
In iteration 3 225/500
Psi-Blast had during the different iterations 256 different positives at least one time in its result list.

Figure 10: Plot of the frequency of PSI-BLAST search hits versus the coverage with the query sequence of PAH (Parameters: h=0.005, j=5, database: nr)

Figure 11: Plot of the frequency of PSI-BLAST search hits versus the score with the query sequence of PAH (Parameters: h=0.005, j=5, database: nr)

Figure 12: Plot of the frequency of PSI-BLAST search hits versus the identity with the query sequence of PAH (Parameters: h=0.005, j=5, database: nr)

Parameterset 3

5 Iterations, 0.005 E-Value-Threshold
In iteration 2 220/500
In iteration 3 225/500
In iteration 4 223/500
In iteration 5 225/500
Psi-Blast had during the different iterations 261 different positives at least one time in its result list.

Figure 13: Plot of the frequency of PSI-BLAST search hits versus the coverage with the query sequence of PAH (Parameters: h=0.005, j=5, database: nr)

Figure 14: Plot of the frequency of PSI-BLAST search hits versus the score with the query sequence of PAH (Parameters: h=0.005, j=5, database: nr)

Figure 15: Plot of the frequency of PSI-BLAST search hits versus the identity with the query sequence of PAH (Parameters: h=0.005, j=5, database: nr)

Parameterset 4

5 Iterations, 10E-6 E-Value-Threshold
In iteration 2 220/500
In iteration 3 224/500
In iteration 4 223/500
In iteration 5 226/500
Psi-Blast had during the different iterations 260 different positives at least one time in its result list.

Figure 16: Plot of the frequency of PSI-BLAST search hits versus the coverage with the query sequence of PAH (Parameters: h=10E-6, j=5, database: nr)

Figure 17: Plot of the frequency of PSI-BLAST search hits versus the score with the query sequence of PAH (Parameters: h=10E-6, j=5, database: nr)

Figure 18: Plot of the frequency of PSI-BLAST search hits versus the identity with the query sequence of PAH (Parameters: h=10E-6, j=5, database: nr)

HHSearch

A detailed description of HHSearch can be found here.

Installation

Preparing the HHM-Database

Download pdb70_29May10.hhm.tar.gz from ftp://ftp.tuebingen.mpg.de/pub/protevo/HHsearch/databases/
Unzip the archive
Make a database: cat *.hhm >> pdb70.db
Move the db to an appropriate directory: sudo mv pdb70.db ../pdb70.db

Configure HHSearch-Tools

In the manual of HHSearch it was adviced to add the information of the secondary structure to the multiple alignment used for the query. Therefore it was necessary to run the addpsipred script of HHSearch. This script was not configured in the virtual box. Several parameters have to be adjusted.

Changes in /apps/bin/addpsipred:

my $psipreddir="/apps/psipred_2.5"; my $ncbidir="/apps/blast_old/bin"; my $perl="/apps/bin"; my $dummydb="/home/student/tmp";

Copy /apps/bin/reformat to /apps/bin/reformat.pl

Running

Parameterset 0

time hhsearch -i ./reference.fasta -d /data/hmm/pdb70.db -b 500 -o ./reference_simple.hhsearch

real 8m33.171s user 5m14.530s sys 0m3.510s

Parameterset 1

alignblast reference_psi_e10E-6_i3.blast reference_psi_e10E-6_i3.a3m
addpsipred /home/student/workspace/reference_psi_e10E-6_i3.a3m
time hhsearch -i reference_psi_e10E-6_i3.a3m -d /data/hmm/pdb70.db -o reference_psi_e10E-6_i3.hhsearch

real 16m27.258s user 7m47.220s sys 0m6.290s

To run HHSearch with the profile of a PSI-BLAST search the output of PSI-BLAST has to be converted to a multiple alignment by the script alignblast. The performance of HHSearch can be increases by adding the information of the secondary structure to the multiple alignment. This can be done with the script addspipred (uses the secondary structure prediction of PSIPRED).

Parameterset 2

alignblast reference_psi_e005_i3.blast reference_psi_e005_i3.a3m
addpsipred /home/student/workspace/reference_psi_e005_i3.a3m
time hhsearch -i reference_psi_e005_i3.a3m -d /data/hmm/pdb70.db -o reference_psi_e005_i3.hhsearch

real 16m7.216s user 7m41.840s sys 0m5.570s

Parameterset 3

alignblast reference_psi_e005_i5.blast reference_psi_e005_i5.blast.a3m
addpsipred /home/student/workspace/reference_psi_e005_i5.blast.a3m
time hhsearch -i reference_psi_e005_i5.blast.a3m -d /data/hmm/pdb70.db -o reference_psi_e005_i5.blast.hhsearch

real 7m49.907s user 7m15.310s sys 0m4.320s

Parameterset 4

alignblast reference_psi_e10E-6_i5.blast reference_psi_e10E-6_i5.a3m
addpsipred /home/student/workspace/reference_psi_e10E-6_i5.a3m
time hhsearch -i reference_psi_e10E-6_i5.a3m -d /data/hmm/pdb70.db -o reference_psi_e10E-6_i5.hhsearch

real 8m10.730s user 7m33.190s sys 0m5.390s

Results

Parameterset 0

Run on Simple Fasta-File.
Found 6 positives.

Figure 19: Plot of the frequency of HHSearch search hits versus the coverage with the query sequence of PAH (parameters: b=500, database=PDB70, i=fasta sequence of PAH)

Figure 20: Plot of the frequency of HHSearch search hits versus the score with the query sequence of PAH (parameters: b=500, database=PDB70, i=fasta sequence of PAH)

Figure 21: Plot of the frequency of HHSearch search hits versus the identity with the query sequence of PAH (parameters: b=500, database=PDB70, i=fasta sequence of PAH)

Parameterset 1

Run on Psi-Blast-Output of parameter set 1.
Found 6 positives.

Figure 22: Plot of the frequency of HHSearch search hits versus the coverage with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=10E-6, j=3, database=nr)

Figure 23: Plot of the frequency of HHSearch search hits versus the score with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=10E-6, j=3, database=nr)

Figure 24: Plot of the frequency of HHSearch search hits versus the identity with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=10E-6, j=3, database=nr)

Parameterset 2

Run on Psi-Blast-Output of parameter set 2.
Found 6 positives.

Figure 25: Plot of the frequency of HHSearch search hits versus the coverage with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=0.005, j=3, database=nr)

Figure 26: Plot of the frequency of HHSearch search hits versus the score with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=0.005, j=3, database=nr)

Figure 27: Plot of the frequency of HHSearch search hits versus the identity with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=0.005, j=3, database=nr)

Parameterset 3

Run on Psi-Blast-Output of parameter set 3.
Found 6 positives.

Figure 28: Plot of the frequency of HHSearch search hits versus the coverage with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=0.005, j=5, database=nr)

Figure 29: Plot of the frequency of HHSearch search hits versus the score with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=0.005, j=5, database=nr)

Figure 30: Plot of the frequency of HHSearch search hits versus the identity with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=0.005, j=5, database=nr)

Parameterset 4

Run on Psi-Blast-Output of parameter set 4.
Found 6 positives.

Figure 31: Plot of the frequency of HHSearch search hits versus the coverage with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=10E-6, j=5, database=nr)

Figure 32: Plot of the frequency of HHSearch search hits versus the score with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=10E-6, j=5, database=nr)

Figure 33: Plot of the frequency of HHSearch search hits versus the identity with the query sequence of PAH (HHSearch parameters: database=PDB70, i=a3m profile of PSI-BLAST run with sequence of PAH; PSI-BLAST parameters: h=10E-6, j=5, database=nr)

Comparing the Results

Defining True Positives

HSSP - Some Positives

Getting the entry of PAH from HSSP http://mrs.cmbi.ru.nl/mrs-5/entry?db=hssp&id=2pah&q=phenylalanine%20hydroxylase

HSSP - More Positives

HHSearch was run with a dataset of the PDB. BLAST was run with the a dataset of the NR database. NR itself is a mixture of SwissProt, RefSeq, PIR, PRF, PDB and GenBank CDS translations. HSSP itself contains only references to SwissProt entries. Therefore it was necessary to get the cross-references of the Swissprot entries in HSSP to the other databases. For this purpose we created a java-tool.

Overlap

unfiltered sets:

Nr.	Method/Parameter-Set	Size of Result-Set
1	psi-blast e=10E-6 i=3	499
2	fasta	499
3	hhsearch e=10E-6 i=3	10
4	hhsearch simple	496
5	psi-blast e=0.005 i=5	499
6	hhsearch (psi-blast: e=0.005 i=5)	10
7	blast	499
8	psi-blast e=10E-6 i=5	499
9	hhsearch (psi-blast e=0.005 i=3)	494
10	hhsearch (psi-blast e=0.005 i=3)	499
11	hhsearch (psi-blast e=10E-6 i=5)	10

Nr./Nr.	1	2	3	4	5	6	7	8	9	10	11
0	499	414	3	3	485	3	405	482	3	494	3
1	414	499	4	4	410	4	478	406	4	413	4
2	3	4	10	10	3	10	4	3	10	3	10
3	3	4	10	496	3	10	4	3	58	3	10
4	485	410	3	3	499	3	401	490	3	488	3
5	3	4	10	10	3	10	4	3	10	3	10
6	405	478	4	4	401	4	499	397	4	404	4
7	482	406	3	3	490	3	397	499	3	486	3
8	3	4	10	58	3	10	4	3	494	3	10
9	494	413	3	3	488	3	404	486	3	499	3
10	3	4	10	10	3	10	4	3	10	3	10

positive sets:

Nr.	Method/Parameter-Set	Size of Result-Set
1	hhsearch (psi-blast e=0.005 i=3)	6
2	hhsearch (psi-blast e=10E-6 i=3)	6
3	fasta	224
4	psi-blast e=0.005 i=5	226
5	hhsearch - simple	6
6	hhsearch (psi-blast e=10E-6 i=5)	6
7	psi-blast e=10E-6 i=5	226
8	psi-blast e=0.005 i=3	223
9	blast	220
10	hhsearch (psi-blast e=0.005 i=5)	6
11	psi-blast e=10E-6 i=3	223

Nr./Nr.	1	2	3	4	5	6	7	8	9	10	11
0	6	6	4	3	6	6	3	3	4	6	3
1	6	6	4	3	6	6	3	3	4	6	3
2	4	4	224	193	4	4	192	194	211	4	194
3	3	3	193	226	3	3	224	220	188	3	219
4	6	6	4	3	6	6	3	3	4	6	3
5	6	6	4	3	6	6	3	3	4	6	3
6	3	3	192	224	3	3	226	219	187	3	217
7	3	3	194	220	3	3	219	223	188	3	220
8	4	4	211	188	4	4	187	188	220	4	188
9	6	6	4	3	6	6	3	3	4	6	3
10	3	3	194	219	3	3	217	220	188	3	223

Summary

BLAST is the fastest method. Its result is comparable to FASTA and PSI-BLAST.

It was mentioned that FASTA is more sensitive than BLAST. The number of found positives does not underlie this statement.

PSI-Pred is the slowest of the used methods on the base of the measured time. The final results of each iteration are comparable to a simple BLAST or FASTA search. Between the different runs it finds sometimes new positives, sometimes it loses positives and sometimes it finds formerly lost positives again. The ranks of the positives spread. It's not easy to dermine without the knowledge of the positives which results are good.

The only way to count the found positives (according to HSSP) is to use the cross-references in the SwissProt entries listed in HSSP. There are several problems concerning this method. Not all possible references might be annotated in the SwissProt entry. That means not every non-"positive" is automatically a false positive. Regarding the performance of an alignment method it's also necessary to regard the ranking of the positives. Therefore some non-positive gaps especially at the beginning (high score, high coverage, high identity, but non-positive) might be just wrongly annotated as non-positive.

HHSearch is in fact slower than PSI-Blast. We used for some of our HHSearch examples (1-4) PSI-Blast results. Adding the PSI-Blast runtime HHSearch is really slow. But it's performance is marvelous. It found only seven positives, but these are (in every parameter set) the first seven results and are clearly separated from the others by high scores. As already mentioned that's not a lot. But in HSSP are only (indirectly) 61 PDB entries listed. And probably some of them got lost by the filtering of the PDB HHM-database (less than 70 % sequence-identity).

In general it seems that PAH is a widely studied gene. Therefore it is quite easy to find targets. Even in the PDB were seven structures (of 61 possible) found, which are associated to PAH by HSSP.

With more difficult (less similar knowledge in the databases) queries the performance of the methods would separate more clearly. There PSI-BLAST and HHSearch would probably outperform BLAST and FASTA.

In my opinion HHSearch run with PSI-BLAST profiles is the tool of choice. The true positives can be more easily determined by the scores, which is necessary if searching for a virgin (nothing is known about the positives) and difficult sequence. HHSearch is part of the HHPred method, whose servers were on top in the last CASP-Competition. This shows its capabilities in finding reliable alignments, which could even be used for homology modeling.

Task 2.2: Multiple sequence alignments

Methodology

For each identity range (99%-90%, 89%-60, 59%-40% and 39%-20%) we selected five sequences of which at least one sequence has a solved structure as well. In order to find these sequences we looked in our blast and psi-blast results for appropriate sequences, four for each identity range. A sequences was considered as appropriate if it has the wanted sequence identity to our reference sequence and if this sequence has roughly the same length as our reference sequence. The reason for this is that we wanted to avoid leading and trailing gaps in our alignments. However, this was not always possible , especially for the low identity range (39%-20%). Last but not least, we used our hhsearch results (which was queried against a profile database of all pdb entries) to find appropriate sequences for each identity range. The list of selected sequences is as follows:

Sequence Identity	GI/PDB	Remark
97%	4557819	phenylalanine-4-hydroxylase [Homosapiens]
95%	1J8T	catalytic domain of human phenylalanine hydroxylase Fe(II)
94%	296212713	PREDICTED:phenylalanine-4-hydroxylase [Callithrix jacchus]
91%	296212713	PREDICTED:phenylalanine-4-hydroxylase-like [Ailuropodamelanoleuca]
90%	171543886	phenylalanine-4-hydroxylase [Musmusculus]
89%	126339754	PREDICTED: similar toPhenylalanine-4-hydroxylase (PAH) (Phe-4-monooxygenase)[Monodelphis domestica]
82%	224095451	PREDICTED: phenylalanine hydroxylase [Taeniopygia guttata]
73%	32442452	Pah [Danio rerio]
66%	3HF6	crystal structure of human tryptophan hydroxylase type 1 with bound LP-521834 and FE
60%	194865387	GG14450 [Drosophila erecta]
59%	2TOH	TYROSINE HYDROXYLASE CATALYTIC AND TETRAMERIZATION DOMAINS FROM RAT
56%	4883767	PREDICTED: phenylalanine hydroxylase [Caenorhabditis elegans]
48%	74096365	tyrosine hydroxylase 2 [Takifugu rubripes]
48%	301617387	PREDICTED: tyrosine3-monooxygenase-like [Xenopus (Silurana) tropicalis]
45%	115532103	TryPtophan Hydroxylase familymember (tph-1) [Caenorhabditis elegans]
37%	85860954	prephenate dehydratase [Syntrophus aciditrophicus SB]
31%	114570403	phenylalanine 4-monooxygenase [Maricaulis maris MCS10]
30%	1LTV	CRYSTAL STRUCTURE OF CHROMOBACTERIUM VIOLACEUM PHENYLALANINE HYDROXYLASE, STRUCTURE WITH BOUND OXIDIZED Fe(III)
25%	94312464	phenylalanine 4-monooxygenase[Cupriavidus metallidurans CH34]
23%	298578904	aromatic amino acid hydroxylase[Chlamydophila pneumoniae]

Then we created a FASTA file for each identity range which contains the 5 sequences selected and our PAH reference sequence. For this purpose we executed the following commands:

//99-90

cat reference_pah_aa.fasta gi_4557819.fasta pdb_1J8T.fasta gi_296212713.fasta gi_301759315.fasta gi_171543886.fasta > ../99_90.fasta

//89-60

cat reference_pah_aa.fasta gi_126339754.fasta gi_224095451.fasta gi_32442452.fasta pdb_3hf6.fasta gi_194865387.fasta > ../89_60.fasta

//59-40

cat reference_pah_aa.fasta pdb_2TOH.fasta gi_4883767.fasta gi_74096365.fasta gi_301617387.fasta gi_115532103.fasta > ../59_40.fasta

//39-20

cat reference_pah_aa.fasta gi_85860954.fasta gi_114570403.fasta pdb_1ltv_fasta gi_94312464.fasta gi_298578904.fasta > ../39_20.fasta

Creating multiple sequence alignments with Cobalt

Installation

Before we could start to create our multiple sequence alignments (MSA) with Cobalt we had to install it first. We retrieved the latest copy from the NCBI server and installed it with the following commands:

//download of cobalt

wget ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/cobalt/*

//set enviroment var

COBALT=/home/student/master_praktikum/task2/task22/cobalt

//make cobalt.linux executable

chmod +x cobalt.linux

Execution

The execution of Cobalt was straight forward. To create our MSA we used the standard parameter as proposed in the README file. We used the following commands to create a MSA for each identity range:

$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 99_90.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_99_90.out

$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 89_60.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_89_60.out

$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 59_40.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_59_40.out

$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 39_20.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_39_20.out

Creating multiple sequence alignments with ClustalW

Installation

Installation was not necessary since ClustalW was already installed.

Execution

The execution of ClustalW was straight forward. To create our MSA we used the standard parameters. We used the following commands to create a MSA for each identity range:

clustalw -align -infile=99_90.fasta -type=PROTEIN -outfile=clustalw_out/clustalw_99_90.out

clustalw -align -infile=89_60.fasta -type=PROTEIN -outfile=clustalw_out/clustalw_89_60.out

clustalw -align -infile=59_40.fasta -type=PROTEIN -outfile=clustalw_out/clustalw_59_40.out

clustalw -align -infile=39_20.fasta -type=PROTEIN -outfile=clustalw_out/clustalw_39_20.out

Creating multiple sequence alignments with Muscle

Installation

Installation was not necessary since Muscle was already installed.

Execution

The execution of Muscle was straight forward. To create our MSA we used the standard parameters. We used the following commands to create a MSA for each identity range:

muscle -in 99_90.fasta -out muscle_out/muscle_99_90.out

muscle -in 89_60.fasta -out muscle_out/muscle_89_60.out

muscle -in 59_40.fasta -out muscle_out/muscle_59_40.out

muscle -in 39_20.fasta -out muscle_out/muscle_39_20.out

Creating multiple sequence alignments with T-Coffee

Installation

Installation was not necessary since T-Coffee was already installed.

Execution

The execution of T-Coffee was straight forward. The parameter “-method sap_pair,slow_pair” defines the method, which is in our case a MSA which considers the PDB file defined by the parameter “-template_file 1J8T”. We used the following commands to create a MSA for each identity range:

t_coffee 99_90.fasta -method sap_pair,slow_pair -template_file 1J8T

t_coffee 89_60.fasta -method sap_pair,slow_pair -template_file 3HF6

t_coffee 59_40.fasta -method sap_pair,slow_pair -template_file 2TOH

t_coffee 39_20.fasta -method sap_pair,slow_pair -template_file 1LTV

Results and discussion

Cobalt

MSA of 99-90% sequence identity

Figure 34: Multiple alignment calculated by Cobalt for the selected sequences with 99 to 90% sequence identity to PAH

Conservation

304 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps

Sequence ID	Number of gaps
PAH	22
4557819	22
1J8T	149
296212713	1
301759315	23
171543886	21

Gaps are only introduced to the N- and C-Terminal ends of the MSA. There are no gaps in secondary structure elements of PAH.

MSA of 89-60% sequence identity

Figure 35: Multiple alignment calculated by Cobalt for the selected sequences with 89 to 60% sequence identity to PAH

Conservation

161 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps

Sequence ID	Number of gaps
PAH	21
126339754	9
224095451	27
32442452	24
3HF6	183
194865387	21

Gaps are only introduced to the N- and C-Terminal ends of the MSA. There are no gaps in secondary structure elements of PAH.

MSA of 59-40% sequence identity

Figure 36: Multiple alignment calculated by Cobalt for the selected sequences with 59 to 40% sequence identity to PAH

Conservation

102 columns are conserved. The area around the catalytic center is also conserved (around position 290 on PAH).

Gaps

Sequence ID	Number of gaps
PAH	57
2TOH	166
4883767	52
74096365	33
301617387	38
115532103	80

Most gaps are introduced to the N- and C-Terminal ends of the MSA. There is one gap in the two sheets from MSA position 372 to 393 of PAH.

MSA of 39-20% sequence identity

Figure 37: Multiple alignment calculated by Cobalt for the selected sequences with 39 to 20% sequence identity to PAH

Conservation

1 columns are conserved. The area around the catalytic center is in 4 of 5 sequences conserved. (position 290 on PAH).

Gaps

Sequence ID	Number of gaps
PAH	19
85860954	117
114570403	178
1LTV	174
94312464	161
298578904	125

Most gaps are introduced to the N- and C-Terminal ends of the MSA. There are two gaps from MSA position 320 to 337 (helix) and from 417 to 420 (helix) in PAH.

ClustalW

MSA of 99-90% sequence identity

Figure 38: Multiple alignment calculated by ClustalW for the selected sequences with 99 to 90% sequence identity to PAH

Conservation

304 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps

Sequence ID	Number of gaps
PAH	22
4557819	22
1J8T	149
296212713	1
301759315	23
171543886	21

Gaps are only introduced to the N- and C-Terminal ends of the MSA. There are no gaps in secondary structure elements of PAH.

MSA of 89-60% sequence identity

Figure 39: Multiple alignment calculated by ClustalW for the selected sequences with 89 to 60% sequence identity to PAH

Conservation

161 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps

Sequence ID	Number of gaps
PAH	18
126339754	6
224095451	24
32442452	21
3HF6	180
194865387	18

Gaps are only introduced to the N- and C-Terminal ends of the MSA. There are no gaps in secondary structure elements of PAH.

MSA of 59-40% sequence identity

Figure 40: Multiple alignment calculated by ClustalW for the selected sequences with 59 to 40% sequence identity to PAH

Conservation

102 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps

Sequence ID	Number of gaps
PAH	49
2TOH	158
4883767	44
74096365	25
301617387	40
115532103	72

Most gaps are introduced to the N- and C-Terminal ends of the MSA. Two gaps from MSA position 366 to 376 and from 379 to 384 are disrupting two sheets in PAH.

MSA of 39-20% sequence identity

Figure 41: Multiple alignment calculated by ClustalW for the selected sequences with 39 to 20% sequence identity to PAH

Conservation

2 columns are conserved. The area around the catalytic center is in 5 from 6 sequences conserved (position 290 on PAH).

Gaps

Sequence ID	Number of gaps
PAH	20
85860954	118
114570403	179
1LTV	175
94312464	162
298578904	126

Most gaps are introduced to the N- and C-Terminal ends of the MSA. Two gaps from MSA position 308 to 324 and from 438 to 441 are disrupting one helix and one sheet in PAH.

Muscle

MSA of 99-90% sequence identity

Figure 42: Multiple alignment calculated by Muscle for the selected sequences with 99 to 90% sequence identity to PAH

Conservation

304 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps

Sequence ID	Number of gaps
PAH	22
4557819	22
1J8T	149
296212713	1
301759315	23
171543886	21

Gaps are introduced to the N- and C-Terminal ends of the MSA. There are no gaps in secondary structure elements of PAH.

MSA of 89-60% sequence identity

Figure 43: Multiple alignment calculated by Muscle for the selected sequences with 89 to 60% sequence identity to PAH

Conservation

161 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps

Sequence ID	Number of gaps
PAH	24
126339754	12
224095451	30
32442452	27
3HF6	186
194865387	24

Gaps are introduced to the N- and C-Terminal ends of the MSA. There are no gaps in secondary structure elements of PAH.

MSA of 59-40% sequence identity

Figure 44: Multiple alignment calculated by Muscle for the selected sequences with 59 to 40% sequence identity to PAH

Conservation

102 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps

Sequence ID	Number of gaps
PAH	62
2TOH	171
4883767	57
74096365	38
301617387	43
115532103	85

Most gaps are introduced to the N- and C-Terminal ends of the MSA. There are two gaps from 376 to 386 and from 392 to 397 in two sheets.

MSA of 39-20% sequence identity

Figure 45: Multiple alignment calculated by Muscle for the selected sequences with 39 to 20% sequence identity to PAH

Conservation

7 columns are conserved. The area around the catalytic center is in 5 of 6 sequences conserved (position 290 on PAH).

Gaps

Sequence ID	Number of gaps
PAH	46
85860954	144
114570403	205
1LTV	201
94312464	188
298578904	152

Most gaps are introduced to the N- and C-Terminal ends of the MSA. There are gaps from 232 to 239, from 251 to 256, from 342 to 359 and from 391 to 392 in two sheets.

T-Coffee

MSA of 99-90% sequence identity

Figure 46: Multiple alignment calculated by T-Coffee for the selected sequences with 99 to 90% sequence identity to PAH

Conservation

304 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps

Sequence ID	Number of gaps
PAH	22
4557819	22
1J8T	149
296212713	1
301759315	23
171543886	21

Gaps are introduced to the N- and C-Terminal ends of the MSA. There are no gaps in the secondary structure elements of PAH.

MSA of 89-60% sequence identity

Figure 47: Multiple alignment calculated by T-Coffee for the selected sequences with 89 to 60% sequence identity to PAH

Conservation

160 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps

Sequence ID	Number of gaps
PAH	20
126339754	8
224095451	26
32442452	23
3HF6	182
194865387	20

Gaps are introduced to the N- and C-Terminal ends of the MSA. There are no gaps in the secondary structure elements of PAH.

MSA of 59-40% sequence identity

Figure 48: Multiple alignment calculated by T-Coffee for the selected sequences with 59 to 40% sequence identity to PAH

Conservation

102 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).

Gaps

Sequence ID	Number of gaps
PAH	75
2TOH	184
4883767	70
74096365	51
301617387	56
115532103	98

Gaps are introduced to the N- and C-Terminal ends of the MSA. There are no gaps in the secondary structure elements of PAH.

MSA of 39-20% sequence identity

Figure 49: Multiple alignment calculated by T-Coffee for the selected sequences with 39 to 20% sequence identity to PAH

Conservation

5 columns are conserved. The area around the catalytic center is in 5 of 6 sequences conserved (position 290 on PAH).

Gaps

Sequence ID	Number of gaps
PAH	109
85860954	207
114570403	268
1LTV	264
94312464	251
298578904	215

Most gaps are introduced to the N- and C-Terminal ends of the MSA. There are gaps from 224 to 228, from 375 to 377, from 412 to 422 and from 457 to 496 in PAH.

Task 2: Sequence alignments (sequence searches and multiple alignments)

Contents

Task description

Task 2.1: Sequence searches

BLAST

Running

Results

FASTA

Installation

Running

Results

PSI-BLAST

Running

Parameterset 1

Parameterset 2

Parameterset 3

Parameterset 4

Results

Parameterset 1

Parameterset 2

Parameterset 3

Parameterset 4

HHSearch

Installation

Preparing the HHM-Database

Configure HHSearch-Tools

Running

Parameterset 0

Parameterset 1

Parameterset 2

Parameterset 3

Parameterset 4

Results

Parameterset 0

Parameterset 1

Parameterset 2

Parameterset 3

Parameterset 4

Comparing the Results

Defining True Positives

HSSP - Some Positives

HSSP - More Positives

Overlap

Summary

Task 2.2: Multiple sequence alignments

Methodology

Creating multiple sequence alignments with Cobalt

Installation

Execution

Creating multiple sequence alignments with ClustalW

Installation

Execution

Creating multiple sequence alignments with Muscle

Installation

Execution

Creating multiple sequence alignments with T-Coffee

Installation

Execution

Results and discussion

Cobalt

MSA of 99-90% sequence identity

Conservation

Gaps

MSA of 89-60% sequence identity

Conservation

Gaps

MSA of 59-40% sequence identity

Conservation

Gaps

MSA of 39-20% sequence identity

Conservation

Gaps

ClustalW

MSA of 99-90% sequence identity

Conservation

Gaps

MSA of 89-60% sequence identity

Conservation

Gaps

MSA of 59-40% sequence identity