Difference between revisions of "Task6 Hemochromatosis Protocol"
Bernhoferm (talk | contribs) (Created page with "Casting Spell: Create Protocol Page") |
Bernhoferm (talk | contribs) (→BLOSUM62/PAM1/PAM250) |
||
(8 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
+ | == Amino acid features == |
||
− | Casting Spell: Create Protocol Page |
||
+ | |||
+ | |||
+ | The values were taken from [http://en.wikipedia.org/wiki/Amino_acid Wiki: Amino Acids] and [http://en.wikipedia.org/wiki/Proteinogenic_amino_acid Wiki: Proteinogenic Amino Acids]. |
||
+ | |||
+ | <br style="clear:both;"> |
||
+ | |||
+ | == BLOSUM62/PAM1/PAM250 == |
||
+ | |||
+ | |||
+ | * BLOSUM62: [http://www.ncbi.nlm.nih.gov/Class/BLAST/BLOSUM62.txt NCBI: BLAST] (last accessed 17th Jun 2012) |
||
+ | |||
+ | A R N D C Q E G H I L K M F P S T W Y V B Z X * |
||
+ | A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 |
||
+ | R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 |
||
+ | N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 |
||
+ | D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 |
||
+ | C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 |
||
+ | Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 |
||
+ | E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 |
||
+ | G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 |
||
+ | H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 |
||
+ | I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 |
||
+ | L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 |
||
+ | K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 |
||
+ | M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 |
||
+ | F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 |
||
+ | P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 |
||
+ | S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 |
||
+ | T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 |
||
+ | W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 |
||
+ | Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 |
||
+ | V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 |
||
+ | B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 |
||
+ | Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 |
||
+ | X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 |
||
+ | * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1 |
||
+ | |||
+ | |||
+ | * PAM1: [http://www.icp.ucl.ac.be/~opperd/private/pam1.html http://www.icp.ucl.ac.be] (last accessed 17th Jun 2012) |
||
+ | |||
+ | Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val |
||
+ | A R N D C Q E G H I L K M F P S T W Y V |
||
+ | Ala A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18 |
||
+ | Arg R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1 |
||
+ | Asn N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1 |
||
+ | Asp D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1 |
||
+ | Cys C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2 |
||
+ | Gln Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1 |
||
+ | Glu E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2 |
||
+ | Gly G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5 |
||
+ | His H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1 |
||
+ | Ile I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33 |
||
+ | Leu L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15 |
||
+ | Lys K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1 |
||
+ | Met M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4 |
||
+ | Phe F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0 |
||
+ | Pro P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2 |
||
+ | Ser S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2 |
||
+ | Thr T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9 |
||
+ | Trp W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0 |
||
+ | Tyr Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1 |
||
+ | Val V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901 |
||
+ | |||
+ | |||
+ | * PAM250: [http://www.icp.ucl.ac.be/~opperd/private/pam250.html http://www.icp.ucl.ac.be] (last accessed 17th Jun 2012) |
||
+ | |||
+ | Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val |
||
+ | A R N D C Q E G H I L K M F P S T W Y V |
||
+ | Ala A 13 6 9 9 5 8 9 12 6 8 6 7 7 4 11 11 11 2 4 9 |
||
+ | Arg R 3 17 4 3 2 5 3 2 6 3 2 9 4 1 4 4 3 7 2 2 |
||
+ | Asn N 4 4 6 7 2 5 6 4 6 3 2 5 3 2 4 5 4 2 3 3 |
||
+ | Asp D 5 4 8 11 1 7 10 5 6 3 2 5 3 1 4 5 5 1 2 3 |
||
+ | Cys C 2 1 1 1 52 1 1 2 2 2 1 1 1 1 2 3 2 1 4 2 |
||
+ | Gln Q 3 5 5 6 1 10 7 3 7 2 3 5 3 1 4 3 3 1 2 3 |
||
+ | Glu E 5 4 7 11 1 9 12 5 6 3 2 5 3 1 4 5 5 1 2 3 |
||
+ | Gly G 12 5 10 10 4 7 9 27 5 5 4 6 5 3 8 11 9 2 3 7 |
||
+ | His H 2 5 5 4 2 7 4 2 15 2 2 3 2 2 3 3 2 2 3 2 |
||
+ | Ile I 3 2 2 2 2 2 2 2 2 10 6 2 6 5 2 3 4 1 3 9 |
||
+ | Leu L 6 4 4 3 2 6 4 3 5 15 34 4 20 13 5 4 6 6 7 13 |
||
+ | Lys K 6 18 10 8 2 10 8 5 8 5 4 24 9 2 6 8 8 4 3 5 |
||
+ | Met M 1 1 1 1 0 1 1 1 1 2 3 2 6 2 1 1 1 1 1 2 |
||
+ | Phe F 2 1 2 1 1 1 1 1 3 5 6 1 4 32 1 2 2 4 20 3 |
||
+ | Pro P 7 5 5 4 3 5 4 5 5 3 3 4 3 2 20 6 5 1 2 4 |
||
+ | Ser S 9 6 8 7 7 6 7 9 6 5 4 7 5 3 9 10 9 4 4 6 |
||
+ | Thr T 8 5 6 6 4 5 5 6 4 6 4 6 5 3 6 8 11 2 3 6 |
||
+ | Trp W 0 2 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 55 1 0 |
||
+ | Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1 2 15 1 2 2 3 31 2 |
||
+ | Val V 7 4 4 4 4 4 4 4 5 4 15 10 4 10 5 5 5 72 4 17 |
||
+ | |||
+ | <br style="clear:both;"> |
||
+ | |||
+ | == Secondary structure == |
||
+ | |||
+ | |||
+ | Sequence and PyMol editing was performed by hand. The SwissModel were created with the [http://swissmodel.expasy.org/workspace/index.php?func=modelling_simple1 WebServer] (template: 1a6zC). |
||
+ | |||
+ | <br style="clear:both;"> |
||
+ | |||
+ | == PSSM-Matrix and homolog search == |
||
+ | |||
+ | The PSSM-Matrix and homolog search was done by using PSI-Blast with the command: |
||
+ | blastpgp -i Q30201.fasta -j 5 -Q outMatrix.txt -o output.txt -d /mnt/project/pracstrucfunc12/data/big/big |
||
+ | |||
+ | To retrieve all homologous proteins, we retained only the results of the fifth iteration and used the following script to extract them: |
||
+ | |||
+ | |||
+ | <source lang="perl"> |
||
+ | #!usr/bin/perl |
||
+ | |||
+ | use strict; |
||
+ | #use warnings; |
||
+ | #use autodie; |
||
+ | my $file= shift; |
||
+ | |||
+ | open (FILE, "<$file"); |
||
+ | while (<FILE>){ |
||
+ | my $currentLine=$_; |
||
+ | chomp $currentLine; |
||
+ | |||
+ | if ($currentLine=~m/^..\|(.+)\|.*\d\d\d.*/gi){ |
||
+ | print "$1\n"; |
||
+ | } |
||
+ | } |
||
+ | </source> |
||
+ | |||
+ | Because we only wanted mammalian protein sequences, we used Unipot to retrieve the corresponding information to these protein by getting the Flat-formatted file. |
||
+ | |||
+ | To get only mammalian proteins we used the following script on the flat file: |
||
+ | |||
+ | |||
+ | |||
+ | <source lang="perl"> |
||
+ | #!usr/bin/perl |
||
+ | |||
+ | use strict; |
||
+ | #use warnings; |
||
+ | #use autodie; |
||
+ | my $file= shift; |
||
+ | |||
+ | my @currentIDs=""; |
||
+ | open (FILE, "<$file"); |
||
+ | while (<FILE>){ |
||
+ | my $currentLine=$_; |
||
+ | chomp $currentLine; |
||
+ | |||
+ | if ($currentLine=~m/^AC\s(.*)/gi){ |
||
+ | $currentLine=$1; |
||
+ | # print "$currentLine 1\n"; |
||
+ | $currentLine=~s/\s//gi; |
||
+ | |||
+ | |||
+ | push(@currentIDs, split(/;/,$currentLine)); |
||
+ | #print "$1\n"; |
||
+ | } elsif ($currentLine=~m/^SQ\s(.*)/gi){ |
||
+ | @currentIDs=""; |
||
+ | } elsif ($currentLine=~m/^OC.*Mammalia/gi){ |
||
+ | # print "Mammalian found, bitches!\n"; |
||
+ | foreach my $entry (@currentIDs){ |
||
+ | if ($entry ne ""){ |
||
+ | print "$entry\n"; |
||
+ | } |
||
+ | } |
||
+ | } |
||
+ | } |
||
+ | </source> |
||
+ | |||
+ | This way we got all mammalian protein IDs that are homologs to our protein and retrieved their sequences through Uniprot. |
||
+ | |||
+ | As some proteins have multiple IDs noted in the flat file we retained our PSIBlast found IDs by using the |
||
+ | uniq -d |
||
+ | command. |
||
+ | |||
+ | After adding our own sequence, we used these sequences in ClustalW and Muscle to retrieve the MSAs. |
||
+ | |||
+ | clustalw -align -infile=Mammals.fasta -outfile=MammalsClustalw.fasta |
||
+ | muscle -in Mammals.fasta -out MuscleMSAMammals.fasta |
Latest revision as of 17:17, 18 June 2012
Contents
Amino acid features
The values were taken from Wiki: Amino Acids and Wiki: Proteinogenic Amino Acids.
BLOSUM62/PAM1/PAM250
- BLOSUM62: NCBI: BLAST (last accessed 17th Jun 2012)
A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
- PAM1: http://www.icp.ucl.ac.be (last accessed 17th Jun 2012)
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V Ala A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18 Arg R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1 Asn N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1 Asp D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1 Cys C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2 Gln Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1 Glu E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2 Gly G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5 His H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1 Ile I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33 Leu L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15 Lys K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1 Met M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4 Phe F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0 Pro P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2 Ser S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2 Thr T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9 Trp W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0 Tyr Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1 Val V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901
- PAM250: http://www.icp.ucl.ac.be (last accessed 17th Jun 2012)
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V Ala A 13 6 9 9 5 8 9 12 6 8 6 7 7 4 11 11 11 2 4 9 Arg R 3 17 4 3 2 5 3 2 6 3 2 9 4 1 4 4 3 7 2 2 Asn N 4 4 6 7 2 5 6 4 6 3 2 5 3 2 4 5 4 2 3 3 Asp D 5 4 8 11 1 7 10 5 6 3 2 5 3 1 4 5 5 1 2 3 Cys C 2 1 1 1 52 1 1 2 2 2 1 1 1 1 2 3 2 1 4 2 Gln Q 3 5 5 6 1 10 7 3 7 2 3 5 3 1 4 3 3 1 2 3 Glu E 5 4 7 11 1 9 12 5 6 3 2 5 3 1 4 5 5 1 2 3 Gly G 12 5 10 10 4 7 9 27 5 5 4 6 5 3 8 11 9 2 3 7 His H 2 5 5 4 2 7 4 2 15 2 2 3 2 2 3 3 2 2 3 2 Ile I 3 2 2 2 2 2 2 2 2 10 6 2 6 5 2 3 4 1 3 9 Leu L 6 4 4 3 2 6 4 3 5 15 34 4 20 13 5 4 6 6 7 13 Lys K 6 18 10 8 2 10 8 5 8 5 4 24 9 2 6 8 8 4 3 5 Met M 1 1 1 1 0 1 1 1 1 2 3 2 6 2 1 1 1 1 1 2 Phe F 2 1 2 1 1 1 1 1 3 5 6 1 4 32 1 2 2 4 20 3 Pro P 7 5 5 4 3 5 4 5 5 3 3 4 3 2 20 6 5 1 2 4 Ser S 9 6 8 7 7 6 7 9 6 5 4 7 5 3 9 10 9 4 4 6 Thr T 8 5 6 6 4 5 5 6 4 6 4 6 5 3 6 8 11 2 3 6 Trp W 0 2 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 55 1 0 Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1 2 15 1 2 2 3 31 2 Val V 7 4 4 4 4 4 4 4 5 4 15 10 4 10 5 5 5 72 4 17
Secondary structure
Sequence and PyMol editing was performed by hand. The SwissModel were created with the WebServer (template: 1a6zC).
PSSM-Matrix and homolog search
The PSSM-Matrix and homolog search was done by using PSI-Blast with the command:
blastpgp -i Q30201.fasta -j 5 -Q outMatrix.txt -o output.txt -d /mnt/project/pracstrucfunc12/data/big/big
To retrieve all homologous proteins, we retained only the results of the fifth iteration and used the following script to extract them:
<source lang="perl">
- !usr/bin/perl
use strict;
- use warnings;
- use autodie;
my $file= shift;
open (FILE, "<$file"); while (<FILE>){ my $currentLine=$_; chomp $currentLine;
if ($currentLine=~m/^..\|(.+)\|.*\d\d\d.*/gi){ print "$1\n"; } } </source>
Because we only wanted mammalian protein sequences, we used Unipot to retrieve the corresponding information to these protein by getting the Flat-formatted file.
To get only mammalian proteins we used the following script on the flat file:
<source lang="perl">
- !usr/bin/perl
use strict;
- use warnings;
- use autodie;
my $file= shift;
my @currentIDs=""; open (FILE, "<$file"); while (<FILE>){ my $currentLine=$_; chomp $currentLine;
if ($currentLine=~m/^AC\s(.*)/gi){ $currentLine=$1;
- print "$currentLine 1\n";
$currentLine=~s/\s//gi;
push(@currentIDs, split(/;/,$currentLine));
#print "$1\n";
} elsif ($currentLine=~m/^SQ\s(.*)/gi){
@currentIDs="";
} elsif ($currentLine=~m/^OC.*Mammalia/gi){
# print "Mammalian found, bitches!\n";
foreach my $entry (@currentIDs){
if ($entry ne ""){
print "$entry\n";
}
}
}
}
</source>
This way we got all mammalian protein IDs that are homologs to our protein and retrieved their sequences through Uniprot.
As some proteins have multiple IDs noted in the flat file we retained our PSIBlast found IDs by using the
uniq -d
command.
After adding our own sequence, we used these sequences in ClustalW and Muscle to retrieve the MSAs.
clustalw -align -infile=Mammals.fasta -outfile=MammalsClustalw.fasta muscle -in Mammals.fasta -out MuscleMSAMammals.fasta