Task 2: Sequence alignments (sequence searches and multiple alignments)
Contents
Task description
The search for good alignments is often the first step in any bioinformatics application, e.g. homology modeling, function analysis, etc. By such an alignment the user wants to get more information about a probably new sequence. In an alignment two or more sequences are compared due to similarity. If two sequences are very similar due to an alignment the probability is high that these two sequences were derived from one ancestral sequence (homologs, orthologs, paralogs). Therefore it is possible to compare or assign annotations between the two sequences: function, catalytic site, secondary structure, etc.
There are some terms, which appear frequently together with alignments:
- Homologs: Two sequences share a high sequence identity and are derived from the same ancestral sequence.
- Orthologs: The sequences were separated by a speciation event.
- Paralogs: The sequences were separated by a duplication event within a species.
If more sequences are compared in a multiple alignment, it can be possible to derive sites of the protein, which have not changed between these sequences. These sites are called conserved, which means that they did not change in the evolution from a common ancestral sequence. If more sequences are compared, it is easier to see blocks in the alignment. These blocks can be caused by domains of the protein, if amino acid sequences were compared. A domain can be the foundation of one of the the protein's functions, e. g. binding, enzymatic activity, etc.
The derived relationship between the sequences is somehow only a construct in mind. There is no and there will never be a complete record of evolution. Therefore the results of such an alignment can give hints, but should always be regarded critically.
There are several dynamic programming approaches to calculate optimal alignments. Optimal means in this case that the score of the alignment cannot be better with the used scoring scheme.
- Smith-Waterman calculates a local alignment, that means only the highest scoring region is aligned.
- Needleman-Wunsch calculates a global alignment, that means the whole sequence is regarded in the alignment.
- Gotoh uses affine gap costs. In the general dynamic programming approaches there are penalties defined for introducing gaps in the alignment. Gotoh differentiate between opening a new gap and extending an already opened gap. That means, that Gotoh uses two different gap penalties for the two cases. By this scheme, blocks are more favored, since the penalty of opening a new gap is usually ten times the penalty of extending a gap.
These dynamic programming approaches calculate optimal alignments, but are very computationally intensive. It is not reasonable to use these approaches to search whole databases for similar sequences to a target sequence.
Therefore we use several heuristics to find similar sequences for the protein PAH. In the second part we select several of the found sequences to create a multiple alignment.
The full description of this task can be found here.
Task 2.1: Sequence searches
BLAST
A detailed description of BLAST can be found here
Running
time sudo blastall -p blastp -d '/data/blast/nr/nr' -i ./reference.fasta -o './reference.blast' -b 500
real 11m30.762s
user 3m11.440s
sys 0m12.250s
Results
Found true positives 220 of 500 predictions.
FASTA
A detailed description of FASTA can be found here.
Installation
- Used Virtual Box with Linux.
- Download fasta3.tar.gz from ftp://ftp.ebi.ac.uk/pub/software/unix/fasta/
- Unzip the archive
- Build it with: make -f /apps/fasta/make/Makefile.linux64 all
Running
time ./fasta36 /home/student/reference.fasta /data/nr/nr
interactive:
- Enter filename for results []: /home/student/reference.fasta_search
- How many scores do you want to see: 500
- More scores? 0
- Display alignments also? (y/n) [n] y
- number of alignments [500]? 500
real 10m13.878s
user 7m26.270s
sys 0m20.230s
Results
Found positives 224 of 500 predictions.
PSI-BLAST
A detailed description of PSI-BLAST can be found here.
Running
Parameterset 1
time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e10E-6_i3.blast' -h 10E-6 -j 3 -C './reference_i3_e10E-6.chk'
real 37m56.447s
user 14m27.620s
sys 0m54.620s
Parameterset 2
time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e005_i3.blast' -h 0.005 -j 3 -C './reference_i3_e005.chk'
real 37m41.487s
user 14m42.850s
sys 0m52.370s
Parameterset 3
time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e005_i5.blast' -h 0.005 -j 5 -C './reference_i5_e005.chk'
real 62m22.175s
user 26m25.410s
sys 1m20.700s
Parameterset 4
time blastpgp -d '/data/nr/nr' -i './reference.fasta' -o './reference_psi_e10E-6_i5.blast' -h 10E-6 -j 5 -C './reference_i5_e10E-6.chk'
real 61m59.284s
user 25m55.920s
sys 1m21.620s
Results
Parameterset 1
3 Iterations, 10E-6 E-Value-Threshold
In iteration 2 220/500
In iteration 3 224/500
Psi-Blast had during the different iterations 255 different positives
at least one time in its result list.
Parameterset 2
3 Iterations, 0.005 E-Value-Threshold
In iteration 2 220/500
In iteration 3 225/500
Psi-Blast had during the different iterations 256 different positives
at least one time in its result list.
Parameterset 3
5 Iterations, 0.005 E-Value-Threshold
In iteration 2 220/500
In iteration 3 225/500
In iteration 4 223/500
In iteration 5 225/500
Psi-Blast had during the different iterations 261 different positives
at least one time in its result list.
Parameterset 4
5 Iterations, 10E-6 E-Value-Threshold
In iteration 2 220/500
In iteration 3 224/500
In iteration 4 223/500
In iteration 5 226/500
Psi-Blast had during the different iterations 260 different positives
at least one time in its result list.
HHSearch
A detailed description of HHSearch can be found here.
Installation
Preparing the HHM-Database
- Download pdb70_29May10.hhm.tar.gz from ftp://ftp.tuebingen.mpg.de/pub/protevo/HHsearch/databases/
- Unzip the archive
- Make a database: cat *.hhm >> pdb70.db
- Move the db to an appropriate directory: sudo mv pdb70.db ../pdb70.db
Configure HHSearch-Tools
In the manual of HHSearch it was adviced to add the information of the secondary structure to the multiple alignment used for the query. Therefore it was necessary to run the addpsipred script of HHSearch. This script was not configured in the virtual box. Several parameters have to be adjusted.
- Changes in /apps/bin/addpsipred:
my $psipreddir="/apps/psipred_2.5";
my $ncbidir="/apps/blast_old/bin";
my $perl="/apps/bin";
my $dummydb="/home/student/tmp";
- Copy /apps/bin/reformat to /apps/bin/reformat.pl
Running
Parameterset 0
time hhsearch -i ./reference.fasta -d /data/hmm/pdb70.db -b 500 -o ./reference_simple.hhsearch
real 8m33.171s
user 5m14.530s
sys 0m3.510s
Parameterset 1
alignblast reference_psi_e10E-6_i3.blast reference_psi_e10E-6_i3.a3m
addpsipred /home/student/workspace/reference_psi_e10E-6_i3.a3m
time hhsearch -i reference_psi_e10E-6_i3.a3m -d /data/hmm/pdb70.db -o reference_psi_e10E-6_i3.hhsearch
real 16m27.258s
user 7m47.220s
sys 0m6.290s
To run HHSearch with the profile of a PSI-BLAST search the output of PSI-BLAST has to be converted to a multiple alignment by the script alignblast. The performance of HHSearch can be increases by adding the information of the secondary structure to the multiple alignment. This can be done with the script addspipred (uses the secondary structure prediction of PSIPRED).
Parameterset 2
alignblast reference_psi_e005_i3.blast reference_psi_e005_i3.a3m
addpsipred /home/student/workspace/reference_psi_e005_i3.a3m
time hhsearch -i reference_psi_e005_i3.a3m -d /data/hmm/pdb70.db -o reference_psi_e005_i3.hhsearch
real 16m7.216s
user 7m41.840s
sys 0m5.570s
Parameterset 3
alignblast reference_psi_e005_i5.blast reference_psi_e005_i5.blast.a3m
addpsipred /home/student/workspace/reference_psi_e005_i5.blast.a3m
time hhsearch -i reference_psi_e005_i5.blast.a3m -d /data/hmm/pdb70.db -o reference_psi_e005_i5.blast.hhsearch
real 7m49.907s
user 7m15.310s
sys 0m4.320s
Parameterset 4
alignblast reference_psi_e10E-6_i5.blast reference_psi_e10E-6_i5.a3m
addpsipred /home/student/workspace/reference_psi_e10E-6_i5.a3m
time hhsearch -i reference_psi_e10E-6_i5.a3m -d /data/hmm/pdb70.db -o reference_psi_e10E-6_i5.hhsearch
real 8m10.730s
user 7m33.190s
sys 0m5.390s
Results
Parameterset 0
Run on Simple Fasta-File.
Found 6 positives.
Parameterset 1
Run on Psi-Blast-Output of parameter set 1.
Found 6 positives.
Parameterset 2
Run on Psi-Blast-Output of parameter set 2.
Found 6 positives.
Parameterset 3
Run on Psi-Blast-Output of parameter set 3.
Found 6 positives.
Parameterset 4
Run on Psi-Blast-Output of parameter set 4.
Found 6 positives.
Comparing the Results
Defining True Positives
HSSP - Some Positives
Getting the entry of PAH from HSSP http://mrs.cmbi.ru.nl/mrs-5/entry?db=hssp&id=2pah&q=phenylalanine%20hydroxylase
HSSP - More Positives
HHSearch was run with a dataset of the PDB. BLAST was run with the a dataset of the NR database. NR itself is a mixture of SwissProt, RefSeq, PIR, PRF, PDB and GenBank CDS translations. HSSP itself contains only references to SwissProt entries. Therefore it was necessary to get the cross-references of the Swissprot entries in HSSP to the other databases. For this purpose we created a java-tool.
Overlap
unfiltered sets:
Nr. | Method/Parameter-Set | Size of Result-Set |
---|---|---|
1 | psi-blast e=10E-6 i=3 | 499 |
2 | fasta | 499 |
3 | hhsearch e=10E-6 i=3 | 10 |
4 | hhsearch simple | 496 |
5 | psi-blast e=0.005 i=5 | 499 |
6 | hhsearch (psi-blast: e=0.005 i=5) | 10 |
7 | blast | 499 |
8 | psi-blast e=10E-6 i=5 | 499 |
9 | hhsearch (psi-blast e=0.005 i=3) | 494 |
10 | hhsearch (psi-blast e=0.005 i=3) | 499 |
11 | hhsearch (psi-blast e=10E-6 i=5) | 10 |
Nr./Nr. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 499 | 414 | 3 | 3 | 485 | 3 | 405 | 482 | 3 | 494 | 3 |
1 | 414 | 499 | 4 | 4 | 410 | 4 | 478 | 406 | 4 | 413 | 4 |
2 | 3 | 4 | 10 | 10 | 3 | 10 | 4 | 3 | 10 | 3 | 10 |
3 | 3 | 4 | 10 | 496 | 3 | 10 | 4 | 3 | 58 | 3 | 10 |
4 | 485 | 410 | 3 | 3 | 499 | 3 | 401 | 490 | 3 | 488 | 3 |
5 | 3 | 4 | 10 | 10 | 3 | 10 | 4 | 3 | 10 | 3 | 10 |
6 | 405 | 478 | 4 | 4 | 401 | 4 | 499 | 397 | 4 | 404 | 4 |
7 | 482 | 406 | 3 | 3 | 490 | 3 | 397 | 499 | 3 | 486 | 3 |
8 | 3 | 4 | 10 | 58 | 3 | 10 | 4 | 3 | 494 | 3 | 10 |
9 | 494 | 413 | 3 | 3 | 488 | 3 | 404 | 486 | 3 | 499 | 3 |
10 | 3 | 4 | 10 | 10 | 3 | 10 | 4 | 3 | 10 | 3 | 10 |
positive sets:
Nr. | Method/Parameter-Set | Size of Result-Set |
---|---|---|
1 | hhsearch (psi-blast e=0.005 i=3) | 6 |
2 | hhsearch (psi-blast e=10E-6 i=3) | 6 |
3 | fasta | 224 |
4 | psi-blast e=0.005 i=5 | 226 |
5 | hhsearch - simple | 6 |
6 | hhsearch (psi-blast e=10E-6 i=5) | 6 |
7 | psi-blast e=10E-6 i=5 | 226 |
8 | psi-blast e=0.005 i=3 | 223 |
9 | blast | 220 |
10 | hhsearch (psi-blast e=0.005 i=5) | 6 |
11 | psi-blast e=10E-6 i=3 | 223 |
Nr./Nr. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6 | 6 | 4 | 3 | 6 | 6 | 3 | 3 | 4 | 6 | 3 |
1 | 6 | 6 | 4 | 3 | 6 | 6 | 3 | 3 | 4 | 6 | 3 |
2 | 4 | 4 | 224 | 193 | 4 | 4 | 192 | 194 | 211 | 4 | 194 |
3 | 3 | 3 | 193 | 226 | 3 | 3 | 224 | 220 | 188 | 3 | 219 |
4 | 6 | 6 | 4 | 3 | 6 | 6 | 3 | 3 | 4 | 6 | 3 |
5 | 6 | 6 | 4 | 3 | 6 | 6 | 3 | 3 | 4 | 6 | 3 |
6 | 3 | 3 | 192 | 224 | 3 | 3 | 226 | 219 | 187 | 3 | 217 |
7 | 3 | 3 | 194 | 220 | 3 | 3 | 219 | 223 | 188 | 3 | 220 |
8 | 4 | 4 | 211 | 188 | 4 | 4 | 187 | 188 | 220 | 4 | 188 |
9 | 6 | 6 | 4 | 3 | 6 | 6 | 3 | 3 | 4 | 6 | 3 |
10 | 3 | 3 | 194 | 219 | 3 | 3 | 217 | 220 | 188 | 3 | 223 |
Summary
BLAST is the fastest method. Its result is comparable to FASTA and PSI-BLAST.
It was mentioned that FASTA is more sensitive than BLAST. The number of found positives does not underlie this statement.
PSI-Pred is the slowest of the used methods on the base of the measured time. The final results of each iteration are comparable to a simple BLAST or FASTA search. Between the different runs it finds sometimes new positives, sometimes it loses positives and sometimes it finds formerly lost positives again. The ranks of the positives spread. It's not easy to dermine without the knowledge of the positives which results are good.
The only way to count the found positives (according to HSSP) is to use the cross-references in the SwissProt entries listed in HSSP. There are several problems concerning this method. Not all possible references might be annotated in the SwissProt entry. That means not every non-"positive" is automatically a false positive. Regarding the performance of an alignment method it's also necessary to regard the ranking of the positives. Therefore some non-positive gaps especially at the beginning (high score, high coverage, high identity, but non-positive) might be just wrongly annotated as non-positive.
HHSearch is in fact slower than PSI-Blast. We used for some of our HHSearch examples (1-4) PSI-Blast results. Adding the PSI-Blast runtime HHSearch is really slow. But it's performance is marvelous. It found only seven positives, but these are (in every parameter set) the first seven results and are clearly separated from the others by high scores. As already mentioned that's not a lot. But in HSSP are only (indirectly) 61 PDB entries listed. And probably some of them got lost by the filtering of the PDB HHM-database (less than 70 % sequence-identity).
In general it seems that PAH is a widely studied gene. Therefore it is quite easy to find targets. Even in the PDB were seven structures (of 61 possible) found, which are associated to PAH by HSSP.
With more difficult (less similar knowledge in the databases) queries the performance of the methods would separate more clearly. There PSI-BLAST and HHSearch would probably outperform BLAST and FASTA.
In my opinion HHSearch run with PSI-BLAST profiles is the tool of choice. The true positives can be more easily determined by the scores, which is necessary if searching for a virgin (nothing is known about the positives) and difficult sequence. HHSearch is part of the HHPred method, whose servers were on top in the last CASP-Competition. This shows its capabilities in finding reliable alignments, which could even be used for homology modeling.
Task 2.2: Multiple sequence alignments
Methodology
For each identity range (99%-90%, 89%-60, 59%-40% and 39%-20%) we selected five sequences of which at least one sequence has a solved structure as well. In order to find these sequences we looked in our blast and psi-blast results for appropriate sequences, four for each identity range. A sequences was considered as appropriate if it has the wanted sequence identity to our reference sequence and if this sequence has roughly the same length as our reference sequence. The reason for this is that we wanted to avoid leading and trailing gaps in our alignments. However, this was not always possible , especially for the low identity range (39%-20%). Last but not least, we used our hhsearch results (which was queried against a profile database of all pdb entries) to find appropriate sequences for each identity range. The list of selected sequences is as follows:
Sequence Identity | GI/PDB | Remark |
---|---|---|
97% | 4557819 | phenylalanine-4-hydroxylase [Homosapiens] |
95% | 1J8T | catalytic domain of human phenylalanine hydroxylase Fe(II) |
94% | 296212713 | PREDICTED:phenylalanine-4-hydroxylase [Callithrix jacchus] |
91% | 296212713 | PREDICTED:phenylalanine-4-hydroxylase-like [Ailuropodamelanoleuca] |
90% | 171543886 | phenylalanine-4-hydroxylase [Musmusculus] |
89% | 126339754 | PREDICTED: similar toPhenylalanine-4-hydroxylase (PAH) (Phe-4-monooxygenase)[Monodelphis domestica] |
82% | 224095451 | PREDICTED: phenylalanine hydroxylase [Taeniopygia guttata] |
73% | 32442452 | Pah [Danio rerio] |
66% | 3HF6 | crystal structure of human tryptophan hydroxylase type 1 with bound LP-521834 and FE |
60% | 194865387 | GG14450 [Drosophila erecta] |
59% | 2TOH | TYROSINE HYDROXYLASE CATALYTIC AND TETRAMERIZATION DOMAINS FROM RAT |
56% | 4883767 | PREDICTED: phenylalanine hydroxylase [Caenorhabditis elegans] |
48% | 74096365 | tyrosine hydroxylase 2 [Takifugu rubripes] |
48% | 301617387 | PREDICTED: tyrosine3-monooxygenase-like [Xenopus (Silurana) tropicalis] |
45% | 115532103 | TryPtophan Hydroxylase familymember (tph-1) [Caenorhabditis elegans] |
37% | 85860954 | prephenate dehydratase [Syntrophus aciditrophicus SB] |
31% | 114570403 | phenylalanine 4-monooxygenase [Maricaulis maris MCS10] |
30% | 1LTV | CRYSTAL STRUCTURE OF CHROMOBACTERIUM VIOLACEUM PHENYLALANINE HYDROXYLASE, STRUCTURE WITH BOUND OXIDIZED Fe(III) |
25% | 94312464 | phenylalanine 4-monooxygenase[Cupriavidus metallidurans CH34] |
23% | 298578904 | aromatic amino acid hydroxylase[Chlamydophila pneumoniae] |
Then we created a FASTA file for each identity range which contains the 5 sequences selected and our PAH reference sequence. For this purpose we executed the following commands:
//99-90
cat reference_pah_aa.fasta gi_4557819.fasta pdb_1J8T.fasta gi_296212713.fasta gi_301759315.fasta gi_171543886.fasta > ../99_90.fasta
//89-60
cat reference_pah_aa.fasta gi_126339754.fasta gi_224095451.fasta gi_32442452.fasta pdb_3hf6.fasta gi_194865387.fasta > ../89_60.fasta
//59-40
cat reference_pah_aa.fasta pdb_2TOH.fasta gi_4883767.fasta gi_74096365.fasta gi_301617387.fasta gi_115532103.fasta > ../59_40.fasta
//39-20
cat reference_pah_aa.fasta gi_85860954.fasta gi_114570403.fasta pdb_1ltv_fasta gi_94312464.fasta gi_298578904.fasta > ../39_20.fasta
Creating multiple sequence alignments with Cobalt
Installation
Before we could start to create our multiple sequence alignments (MSA) with Cobalt we had to install it first. We retrieved the latest copy from the NCBI server and installed it with the following commands:
//download of cobalt
wget ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/cobalt/*
//set enviroment var
COBALT=/home/student/master_praktikum/task2/task22/cobalt
//make cobalt.linux executable
chmod +x cobalt.linux
Execution
The execution of Cobalt was straight forward. To create our MSA we used the standard parameter as proposed in the README file. We used the following commands to create a MSA for each identity range:
$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 99_90.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_99_90.out
$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 89_60.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_89_60.out
$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 59_40.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_59_40.out
$COBALT/cobalt.linux -db $COBALT/cdd -b $COBALT/cdd.blocks -i 39_20.fasta -p $COBALT/patterns -f $COBALT/cdd.freq > cobalt_out/cobalt_39_20.out
Creating multiple sequence alignments with ClustalW
Installation
Installation was not necessary since ClustalW was already installed.
Execution
The execution of ClustalW was straight forward. To create our MSA we used the standard parameters. We used the following commands to create a MSA for each identity range:
clustalw -align -infile=99_90.fasta -type=PROTEIN -outfile=clustalw_out/clustalw_99_90.out
clustalw -align -infile=89_60.fasta -type=PROTEIN -outfile=clustalw_out/clustalw_89_60.out
clustalw -align -infile=59_40.fasta -type=PROTEIN -outfile=clustalw_out/clustalw_59_40.out
clustalw -align -infile=39_20.fasta -type=PROTEIN -outfile=clustalw_out/clustalw_39_20.out
Creating multiple sequence alignments with Muscle
Installation
Installation was not necessary since Muscle was already installed.
Execution
The execution of Muscle was straight forward. To create our MSA we used the standard parameters. We used the following commands to create a MSA for each identity range:
muscle -in 99_90.fasta -out muscle_out/muscle_99_90.out
muscle -in 89_60.fasta -out muscle_out/muscle_89_60.out
muscle -in 59_40.fasta -out muscle_out/muscle_59_40.out
muscle -in 39_20.fasta -out muscle_out/muscle_39_20.out
Creating multiple sequence alignments with T-Coffee
Installation
Installation was not necessary since T-Coffee was already installed.
Execution
The execution of T-Coffee was straight forward. The parameter “-method sap_pair,slow_pair” defines the method, which is in our case a MSA which considers the PDB file defined by the parameter “-template_file 1J8T”. We used the following commands to create a MSA for each identity range:
t_coffee 99_90.fasta -method sap_pair,slow_pair -template_file 1J8T
t_coffee 89_60.fasta -method sap_pair,slow_pair -template_file 3HF6
t_coffee 59_40.fasta -method sap_pair,slow_pair -template_file 2TOH
t_coffee 39_20.fasta -method sap_pair,slow_pair -template_file 1LTV
Results and discussion
Cobalt
MSA of 99-90% sequence identity
Conservation
304 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).
Gaps
Sequence ID | Number of gaps |
---|---|
PAH | 22 |
4557819 | 22 |
1J8T | 149 |
296212713 | 1 |
301759315 | 23 |
171543886 | 21 |
Gaps are only introduced to the N- and C-Terminal ends of the MSA. There are no gaps in secondary structure elements of PAH.
MSA of 89-60% sequence identity
Conservation
161 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).
Gaps
Sequence ID | Number of gaps |
---|---|
PAH | 21 |
126339754 | 9 |
224095451 | 27 |
32442452 | 24 |
3HF6 | 183 |
194865387 | 21 |
Gaps are only introduced to the N- and C-Terminal ends of the MSA. There are no gaps in secondary structure elements of PAH.
MSA of 59-40% sequence identity
Conservation
102 columns are conserved. The area around the catalytic center is also conserved (around position 290 on PAH).
Gaps
Sequence ID | Number of gaps |
---|---|
PAH | 57 |
2TOH | 166 |
4883767 | 52 |
74096365 | 33 |
301617387 | 38 |
115532103 | 80 |
Most gaps are introduced to the N- and C-Terminal ends of the MSA. There is one gap in the two sheets from MSA position 372 to 393 of PAH.
MSA of 39-20% sequence identity
Conservation
1 columns are conserved. The area around the catalytic center is in 4 of 5 sequences conserved. (position 290 on PAH).
Gaps
Sequence ID | Number of gaps |
---|---|
PAH | 19 |
85860954 | 117 |
114570403 | 178 |
1LTV | 174 |
94312464 | 161 |
298578904 | 125 |
Most gaps are introduced to the N- and C-Terminal ends of the MSA. There are two gaps from MSA position 320 to 337 (helix) and from 417 to 420 (helix) in PAH.
ClustalW
MSA of 99-90% sequence identity
Conservation
304 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).
Gaps
Sequence ID | Number of gaps |
---|---|
PAH | 22 |
4557819 | 22 |
1J8T | 149 |
296212713 | 1 |
301759315 | 23 |
171543886 | 21 |
Gaps are only introduced to the N- and C-Terminal ends of the MSA. There are no gaps in secondary structure elements of PAH.
MSA of 89-60% sequence identity
Conservation
161 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).
Gaps
Sequence ID | Number of gaps |
---|---|
PAH | 18 |
126339754 | 6 |
224095451 | 24 |
32442452 | 21 |
3HF6 | 180 |
194865387 | 18 |
Gaps are only introduced to the N- and C-Terminal ends of the MSA. There are no gaps in secondary structure elements of PAH.
MSA of 59-40% sequence identity
Conservation
102 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).
Gaps
Sequence ID | Number of gaps |
---|---|
PAH | 49 |
2TOH | 158 |
4883767 | 44 |
74096365 | 25 |
301617387 | 40 |
115532103 | 72 |
Most gaps are introduced to the N- and C-Terminal ends of the MSA. Two gaps from MSA position 366 to 376 and from 379 to 384 are disrupting two sheets in PAH.
MSA of 39-20% sequence identity
Conservation
2 columns are conserved. The area around the catalytic center is in 5 from 6 sequences conserved (position 290 on PAH).
Gaps
Sequence ID | Number of gaps |
---|---|
PAH | 20 |
85860954 | 118 |
114570403 | 179 |
1LTV | 175 |
94312464 | 162 |
298578904 | 126 |
Most gaps are introduced to the N- and C-Terminal ends of the MSA. Two gaps from MSA position 308 to 324 and from 438 to 441 are disrupting one helix and one sheet in PAH.
Muscle
MSA of 99-90% sequence identity
Conservation
304 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).
Gaps
Sequence ID | Number of gaps |
---|---|
PAH | 22 |
4557819 | 22 |
1J8T | 149 |
296212713 | 1 |
301759315 | 23 |
171543886 | 21 |
Gaps are introduced to the N- and C-Terminal ends of the MSA. There are no gaps in secondary structure elements of PAH.
MSA of 89-60% sequence identity
Conservation
161 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).
Gaps
Sequence ID | Number of gaps |
---|---|
PAH | 24 |
126339754 | 12 |
224095451 | 30 |
32442452 | 27 |
3HF6 | 186 |
194865387 | 24 |
Gaps are introduced to the N- and C-Terminal ends of the MSA. There are no gaps in secondary structure elements of PAH.
MSA of 59-40% sequence identity
Conservation
102 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).
Gaps
Sequence ID | Number of gaps |
---|---|
PAH | 62 |
2TOH | 171 |
4883767 | 57 |
74096365 | 38 |
301617387 | 43 |
115532103 | 85 |
Most gaps are introduced to the N- and C-Terminal ends of the MSA. There are two gaps from 376 to 386 and from 392 to 397 in two sheets.
MSA of 39-20% sequence identity
Conservation
7 columns are conserved. The area around the catalytic center is in 5 of 6 sequences conserved (position 290 on PAH).
Gaps
Sequence ID | Number of gaps |
---|---|
PAH | 46 |
85860954 | 144 |
114570403 | 205 |
1LTV | 201 |
94312464 | 188 |
298578904 | 152 |
Most gaps are introduced to the N- and C-Terminal ends of the MSA. There are gaps from 232 to 239, from 251 to 256, from 342 to 359 and from 391 to 392 in two sheets.
T-Coffee
MSA of 99-90% sequence identity
Conservation
304 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).
Gaps
Sequence ID | Number of gaps |
---|---|
PAH | 22 |
4557819 | 22 |
1J8T | 149 |
296212713 | 1 |
301759315 | 23 |
171543886 | 21 |
Gaps are introduced to the N- and C-Terminal ends of the MSA. There are no gaps in the secondary structure elements of PAH.
MSA of 89-60% sequence identity
Conservation
160 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).
Gaps
Sequence ID | Number of gaps |
---|---|
PAH | 20 |
126339754 | 8 |
224095451 | 26 |
32442452 | 23 |
3HF6 | 182 |
194865387 | 20 |
Gaps are introduced to the N- and C-Terminal ends of the MSA. There are no gaps in the secondary structure elements of PAH.
MSA of 59-40% sequence identity
Conservation
102 columns are conserved. The area around the catalytic center is also conserved (position 290 on PAH).
Gaps
Sequence ID | Number of gaps |
---|---|
PAH | 75 |
2TOH | 184 |
4883767 | 70 |
74096365 | 51 |
301617387 | 56 |
115532103 | 98 |
Gaps are introduced to the N- and C-Terminal ends of the MSA. There are no gaps in the secondary structure elements of PAH.
MSA of 39-20% sequence identity
Conservation
5 columns are conserved. The area around the catalytic center is in 5 of 6 sequences conserved (position 290 on PAH).
Gaps
Sequence ID | Number of gaps |
---|---|
PAH | 109 |
85860954 | 207 |
114570403 | 268 |
1LTV | 264 |
94312464 | 251 |
298578904 | 215 |
Most gaps are introduced to the N- and C-Terminal ends of the MSA. There are gaps from 224 to 228, from 375 to 377, from 412 to 422 and from 457 to 496 in PAH.