Difference between revisions of "Sequence Search and Multiple Sequence Alignment (PKU)"

From Bioinformatikpedia
(3D-coffee)
(T-coffee)
Line 189: Line 189:
   
 
===T-coffee===
 
===T-coffee===
  +
Command used to create the alignments:
t_coffee NN.fasta
 
  +
t_coffee NN.fasta
  +
  +
  +
[[Image:PKU_T-Coffee_20.png|800px|thumb|right|Alignment of set 4 created with T-Coffee]]
  +
[[Image:PKU_T-Coffee_40.png|800px|thumb|right|Alignment of set 3 created with T-Coffee]]
  +
[[Image:PKU_T-Coffee_60.png|800px|thumb|right|Alignment of set 2 created with T-Coffee]]
  +
[[Image:PKU_T-Coffee_80.png|800px|thumb|right|Alignment of set 1 created with T-Coffee]]
   
 
===3D-coffee===
 
===3D-coffee===

Revision as of 18:28, 3 May 2012

Short Task Description

Perform database searches using different search tools with the PAH protein as query

Create and evaluate multiple sequence alignments


Reference Sequence of PAH

>sp|P00439|PH4H_HUMAN Phenylalanine-4-hydroxylase OS=Homo sapiens GN=PAH PE=1 SV=1
MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV
NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW
FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM
EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF
RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA
QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE
KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR
IEVLDNTQQLKILADSINSEIGILCSALQKI

Database Searches

Blast

time blast2 -p blastp -d /mnt/project/pracstrucfunc12/data/big/big -i Dropbox/Phenylketonuria/Task1/PAH.fasta -o results_blast2_standard

real 1m47.401s user 1m25.290s sys 0m18.280s

time blast2 -p blastp -d /mnt/project/pracstrucfunc12/data/big/big -i Dropbox/Phenylketonuria/Task1/PAH.fasta -o results_blast2_e-10 -e 0.0000000001 -v 2000

real 1m35.454s user 1m21.700s sys 0m3.100s

PSIBlast

time blastpgp -j 5 -d /mnt/project/pracstrucfunc12/data/big/big_80 -i Dropbox/Phenylketonuria/Task1/PAH.fasta -o psi_blast_standard_5_it

real 8m48.107s user 8m21.950s sys 0m8.730s


HHBlits

time hhblits -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o results_hhblits_standard

real 6m10.059s user 3m15.640s sys 0m40.220s

hhblits -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -o results_hhblits_n4_e-7 -n 4 -e 0.0000001 -o results_hhblits_n4_e-7

HHSearch

time hhsearch -i Dropbox/Phenylketonuria/Task1/PAH.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current_hhm_db

real 13m27.782s user 13m18.120s sys 0m8.480s

PDB

Instead of mapping hits in big_80 to PDB to get structural information for the alignments, we performed an additional search in PDB at NCBI, with parameters: scoring matrix=PAM70, gap open = 10, gap extend = 1, composition based statistics, max. 1000 target sequences.

MSA

Datasets

We tried to create datasets of 4 sequences + reference sequence for the identity ranges (1: 90%-99%, 2: 60%-89%, 3: 40%-59%, 4: 20%-39%) from our search in the big_80 database. Since these results don't contain PDB structures, we additionally searched PDB directly for proteins in the required range and added them to the dataset. For the most conserved range, there is only 1 sequence in big_80, so we experimentally created the best possible highly conserved dataset in the range 80%-90% and used the structure of the reference sequence for 3D-coffee. The resulting datasets are shown in the following tables. All sequences have roughly the same length as the reference sequence except for G5AMD7 in the first set. G5AMD7 contains an insertion of 162 aa ,that is easily identified in the alignment, but shows a very high similarity in the other sections.

80-90% Sequence Identity
Sequence Identity ID Comment
90% G1P4I7 Uncharacterized protein OS=Myotis lucifugus
89% G5AMD7 Phenylalanine-4-hydroxylase OS=Heterocephalus glaber
80% G1KSL1 Uncharacterized protein OS=Anolis carolinensis
100% 1PAH used as 3D-Template only
60-89% Sequence Identity
Sequence Identity ID Comment
80% G1KSL1 Uncharacterized protein OS=Anolis carolinensis
76% Q4VBE2 Putative uncharacterized protein mgc108157 OS=Xenopus tropicalis
70% H2UJM8 Uncharacterized protein OS=Takifugu rubripes
63% D1LXB2 Phenylalanine hydroxlase OS=Saccoglossus kowalevskii
60% 2XSN_A human Tyrosine 3-Monooxygenase, also used as 3D-Template
40-59% Sequence Identity
Sequence Identity ID Comment
58% O96947 Phenylalanine hydroxylase OS=Geodia cydonium
55% D3BKZ8 Phenylalanine 4-monooxygenase OS=Polysphondylium pallidum
49% Q5RHI3 Novel protein similar to tyrosine hydroxylase OS=Danio rerio
44% A6P4D3 Tyrosine hydroxylase OS=Dugesia japonica
59% 1TOH_A Tyrosine hydroxylase from rattus norvegicus, also used as 3D-Template
20-39% Sequence Identity
Sequence Identity ID Comment
37% F4WGX3 Phenylalanine-4-hydroxylase OS=Acromyrmex echinatior
35% Q23A76 Biopterin-dependent aromatic amino acid hydroxylase family protein OS=Tetrahymena thermophila
29% D0I5S9 Phenylalanine-4-hydroxylase OS=Grimontia hollisae
28% F6G5X4 Phenylalanine-4-hydroxylase OS=Ralstonia solanacearum
31% 1LTU_A Phenylalanine-4-hydroxylase from Chromobacterium violaceum, also used as 3D-Template

ClustalW

command used to create the alignments:

clustalw -align -infile=NN.fasta -outfile=clustalW_NN.aln
Alignment of set 4 created with ClustalW
Alignment of set 3 created with ClustalW
Alignment of set 2 created with ClustalW
Alignment of set 1 created with ClustalW

Muscle

muscle -in NN.fasta -out muscle_NN.fasta

T-coffee

Command used to create the alignments:

t_coffee NN.fasta


Alignment of set 4 created with T-Coffee
Alignment of set 3 created with T-Coffee
Alignment of set 2 created with T-Coffee
Alignment of set 1 created with T-Coffee

3D-coffee

Command to create the alignments:

t_coffee NN.fasta -method sap_pair,slow_pair -template_file <PDB-ID>