Difference between revisions of "Sequence searches and multiple sequence alignments (Phenylketonuria)"

From Bioinformatikpedia
(Datasets)
(Datasets)
Line 43: Line 43:
 
=== Datasets ===
 
=== Datasets ===
 
For the multiple sequence alignments four different datasets were generated with a [https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Phenylketonuria/Task2/Scripts Python script]. Three groups with ten sequences (two of them are pdb-sequences): one with higher than 60% sequence identity to our target sequence (PAH gene), another group with sequence identity between 30% and 60% and one group with lower sequence identity than 30%. In the group with < 30% identity the pdb sequences have a 32% identity, because these are the lowest ones found in the Blast output against the pdb dataset. The fourth group contains 20 sequences with a whole range of sequence identites and four of them are pdb sequences.
 
For the multiple sequence alignments four different datasets were generated with a [https://i12r-studfilesrv.informatik.tu-muenchen.de/wiki/index.php/Phenylketonuria/Task2/Scripts Python script]. Three groups with ten sequences (two of them are pdb-sequences): one with higher than 60% sequence identity to our target sequence (PAH gene), another group with sequence identity between 30% and 60% and one group with lower sequence identity than 30%. In the group with < 30% identity the pdb sequences have a 32% identity, because these are the lowest ones found in the Blast output against the pdb dataset. The fourth group contains 20 sequences with a whole range of sequence identites and four of them are pdb sequences.
 
 
{| border="1" cellpadding="5" cellspacing="0" align="center"
 
|+'''Dataset "low":'''
 
|-
 
! colspan="3" style="background:#32CD32;" | Group of sequences with < 30%
 
(pdb = 32%) identity
 
|-
 
! style="background:#90EE90;" align="center" | Sequence identity
 
! style="background:#90EE90;" align="center" | ID
 
|-
 
| 29%
 
| C8W332
 
|-
 
| 29%
 
| F4GLY0
 
|-
 
| 29%
 
| B8J3L5
 
|-
 
| 27%
 
| A1ZW97
 
|-
 
| 26%
 
| H1GP19
 
|-
 
| 25%
 
| G0L2J6
 
|-
 
| 24%
 
| L9JT09
 
|-
 
| 23%
 
| Q08RX0
 
|-
 
| 32%
 
| 1ltu_A (pdb)
 
|-
 
| style="border-bottom:3px solid gray;" | 32%
 
| style="border-bottom:3px solid gray;" | 1ltz_A (pdb)
 
|-
 
|}
 
   
   

Revision as of 19:18, 6 May 2013

Summary of the task

In this task we compare the protein sequence of interest, in this case the phenylalanine hydroxylase (PAH), to other protein sequences. Therefore both sequence searches and multiple sequence alignments were done using the big80 database meaning a database that contains subsets of swissprot and pdb, where the entries have a sequence similarity of 80% or less. Furthermore searches against a pdb database were done. For sequence searches the programs BLAST, PSIBLAST and HHblits are used. Their results were taken for the creation of multiple sequence alignments (MSA) using he methods ClustalW, Muscle and TCoffee.

Sequence searches

The following invocations were used for Blast, PSI-Blast and HHBlits:

BLAST (Basic Local Alignment Search Tool)

blastall -p blastp -d /mnt/project/rost_db/data/big/big_80 -i /mnt/home/student/worfk
/Masterpractical/Task2/PAH.fasta -o /mnt/home/student/worfk/Masterpractical/Task2/Blast/PAH
_Blast_big_80.out -v 2000 -b 2000

PSI-BLAST (Position-Specific Iterated BLAST)

For PSI-Blast (PSI-BLAST Tutorial) more than one vocation was performed. First two iterations were done with an E-value cutoff of 0.002 and then again with cutoff 10E-10. The same for ten iterations. An example vocation would be:

blastpgp -i /mnt/home/student/worfk/Masterpractical/Task2/PAH.fasta -d /mnt/project/rost_db
/data/big/big_80 -j 2 -h 0.002 -v 2000 -b 2000 -o psi_blast_big_80_2_2.out -C big_80_check_
2_2.chk -Q big_80_matrix_2_2.pssm

HHblits

hhblits -i /mnt/home/student/waldraffs/Masterpraktikum/PAH.fasta -d /mnt/project/rost_db/data/hhblits/uniprot20_02Sep11 
-o /mnt/home/student/waldraffs/Masterpraktikum/PAH_2000.hrr -oa3m /mnt/home/student/waldraffs/Masterpraktikum/PAH_2000.a3m 
-ohhm /mnt/home/student/waldraffs/Masterpraktikum/PAH_2000.hhm -Z 2000 -B 2000


To perform all programms at once, one could use the Perl-script from Maria, like shown here:

perl /mnt/home/student/kalemanovm/master_practical/Assignment2_Alignments/scripts/task1/run.pl ...

Comparison of the results

  • Sequence identity in percent
  • E-Value
  • GO-terms

For the reference sequence (P00432) following GO-terms were found on QuickGO: ...

To look for similarities between the reference sequence and the sequences found in the searches, those terms are counted. ...

Multiple sequence alignments

Datasets

For the multiple sequence alignments four different datasets were generated with a Python script. Three groups with ten sequences (two of them are pdb-sequences): one with higher than 60% sequence identity to our target sequence (PAH gene), another group with sequence identity between 30% and 60% and one group with lower sequence identity than 30%. In the group with < 30% identity the pdb sequences have a 32% identity, because these are the lowest ones found in the Blast output against the pdb dataset. The fourth group contains 20 sequences with a whole range of sequence identites and four of them are pdb sequences.


Dataset "low":
Group of sequences with < 30% (pdb = 32%) identity
Sequence identity ID Protein Name
29% C8W332 Prephenate dehydratase OS=Desulfotomaculum acetoxidans
29% F4GLY0 Phospho-2-dehydro-3-deoxyheptonate aldolase OS=Spirochaeta coccoides
29% B8J3L5 Chorismate mutase OS=Desulfovibrio desulfuricans
27% A1ZW97 Phenylalanine-4-hydroxylase OS=Microscilla marina
26% H1GP19 Putative uncharacterized protein OS=Myroides odoratimimus
25% G0L2J6 Phenylalanine 4-monooxygenase OS=Zobellia galactanivorans
24% L9JT09 Phenylalanine-4-hydroxylase OS=Cystobacter fuscus
23% Q08RX0 Aromatic amino acid hydroxylase, biopterin-dependent OS=Stigmatella aurantiaca
32% 1ltu_A (pdb) PHENYLALANINE-4-HYDROXYLASE
32% 1ltz_A (pdb) PHENYLALANINE-4-HYDROXYLASE


Dataset "medium":
Group of sequences between 30% and 60% identity
Sequence identity ID Protein Name
48% Q45XJ4 Tyrosine hydroxylase OS=Branchiostoma floridae
43% D8U9W1 Putative uncharacterized protein OS=Volvox carteri
41% A5GW18 Prephenate dehydratase OS=Synechococcus sp.
38% Q1Q4R8 Strongly similar to chorismate mutase/prephenate dehydratase OS=Candidatus Kuenenia stuttgartiensis
37% E5Y7E0 Prephenate dehydratase OS=Bilophila wadsworthia
33% F5XW56 Candidate phenylalanine 4-monooxygenase (Phenylalanine-4-hydroxylase) OS=Ramlibacter tataouinensis
32% K0JRN5 Prephenate dehydratase OS=Saccharothrix espanaensis
31% A3UNV6 Phenylalanine-4-hydroxylase OS=Vibrio splendidus
59% 1toh_A (pdb) TYROSINE HYDROXYLASE
33% 2v28_B (pdb) PHENYLALANINE-4-HYDROXYLASE


Dataset "high":
Group of sequences with > 60% identity
Sequence identity ID Protein Name
76% Q4VBE2 Putative uncharacterized protein mgc108157 OS=Xenopus tropicalis
67% K1RSS1 Protein henna OS=Crassostrea gigas
64% H3HGU2 Uncharacterized protein (Fragment) OS=Strongylocentrotus purpuratus
63% C3ZNL0 Putative uncharacterized protein OS=Branchiostoma floridae
63% G3MQ02 Putative uncharacterized protein OS=Amblyomma maculatum
62% E4XIM4 Whole genome shotgun assembly, reference scaffold set, scaffold scaffold_41 OS=Oikopleura dioica
61% E9FTL2 Putative uncharacterized protein OS=Daphnia pulex
61% D6WIQ7 Putative uncharacterized protein OS=Tribolium castaneum
95% 1tg2_A (pdb) Phenylalanine-4-hydroxylase
96% 4pah_A (pdb) PHENYLALANINE HYDROXYLASE


Dataset "all":
Group of sequences with different identities (0-100%)
Sequence identity ID Protein Name
55% K7F3H7 Uncharacterized protein OS=Pelodiscus sinensis
55% F4PX76 Phenylalanine 4-monooxygenase OS=Dictyostelium fasciculatum
53% K7FZR2 Uncharacterized protein OS=Pelodiscus sinensis
52% A6YIC4 Tryptophan hydroxylase OS=Platynereis dumerilii
50% C0KKU5 Tyrosine hydroxylase (Fragment) OS=Octopus vulgaris
48% I2FKE7 Tyrosine hydroxylase long variant (Fragment) OS=Gryllus bimaculatus
47% K8YPQ2 Phenylalanine-4-hydroxylase OS=Nannochloropsis gaditana
44% B1B5B7 Chorismate mutase/prephenate dehydratase OS=uncultured Termite group 1 bacterium
36% L8LIE0 Prephenate dehydratase OS=Leptolyngbya sp.
35% B9Y6K3 Putative uncharacterized protein OS=Holdemania filiformis
35% A3QCV4 Phenylalanine 4-hydroxylase OS=Shewanella loihica
34% C1F688 Phenylalanine-4-hydroxylase OS=Acidobacterium capsulatum
34% K5BAA8 Prephenate dehydratase family protein OS=Mycobacterium hassiacum
34% G4FWD8 Chorismate mutase OS=Rhodanobacter sp.
33% E2MCW6 Phenylalanine-4-hydroxylase OS=Pseudomonas syringae pv. tomato
33% J2JTB5 Phenylalanine-4-hydroxylase, monomeric form OS=Rhizobium sp.
95% 1tg2_A (pdb) Phenylalanine-4-hydroxylase
65% 3hfb_A Tryptophan 5-hydroxylase 1
60% 2xsn_D TYROSINE 3-MONOOXYGENASE
32% 2qmw_B (pdb) Prephenate dehydratase

ClustalW

clustalw -align in.fasta

Muscle

muscle -in in.fasta -clw -out out.aln

T-Coffee

t_coffee out.fasta

3D-Coffee

...

Discussion of the multiple sequence alignments and the used tools

...