Difference between revisions of "Task6 Hemochromatosis Protocol"

From Bioinformatikpedia
(Created page with "Casting Spell: Create Protocol Page")
 
Line 1: Line 1:
 
Casting Spell: Create Protocol Page
 
Casting Spell: Create Protocol Page
  +
  +
== PSSM-Matrix and homolog search ==
  +
  +
The PSSM-Matrix and homolog search was done by using PSI-Blast with the command:
  +
blastpgp -i Q30201.fasta -j5 -Q outMatrix.txt -o output.txt -d /mnt/project/pracstrucfunc12/data/big/big
  +
  +
To retrieve all homologous proteins, we retained only the results of the fifth iteration and used the following script to extract them:
  +
  +
  +
<source lang="perl">
  +
#!usr/bin/perl
  +
  +
use strict;
  +
#use warnings;
  +
#use autodie;
  +
my $file= shift;
  +
  +
open (FILE, "<$file");
  +
while (<FILE>){
  +
my $currentLine=$_;
  +
chomp $currentLine;
  +
  +
if ($currentLine=~m/^..\|(.+)\|.*\d\d\d.*/gi){
  +
print "$1\n";
  +
}
  +
}
  +
</source>
  +
  +
Because we only wanted mammalian protein sequences, we used Unipot to retrieve the corresponding information to these protein by getting the Flat-formatted file.
  +
  +
To get only mammalian proteins we used the following script on the flat file:
  +
  +
  +
  +
<source lang="perl">
  +
#!usr/bin/perl
  +
  +
use strict;
  +
#use warnings;
  +
#use autodie;
  +
my $file= shift;
  +
  +
my @currentIDs="";
  +
open (FILE, "<$file");
  +
while (<FILE>){
  +
my $currentLine=$_;
  +
chomp $currentLine;
  +
  +
if ($currentLine=~m/^AC\s(.*)/gi){
  +
$currentLine=$1;
  +
# print "$currentLine 1\n";
  +
$currentLine=~s/\s//gi;
  +
  +
  +
push(@currentIDs, split(/;/,$currentLine));
  +
#print "$1\n";
  +
} elsif ($currentLine=~m/^SQ\s(.*)/gi){
  +
@currentIDs="";
  +
} elsif ($currentLine=~m/^OC.*Mammalia/gi){
  +
# print "Mammalian found, bitches!\n";
  +
foreach my $entry (@currentIDs){
  +
if ($entry ne ""){
  +
print "$entry\n";
  +
}
  +
}
  +
}
  +
}
  +
</source>
  +
  +
This way we got all mammalian protein IDs that are homologs to our protein and retrieved their sequences through Uniprot.
  +
  +
As some proteins have multiple IDs noted in the flat file we retained our PSIBlast found IDs by using the
  +
uniq -d
  +
command.
  +
  +
After adding our own sequence, we used these sequences in ClustalW and Muscle to retrieve the MSAs.
  +
  +
clustalw -align -infile=Mammals.fasta -outfile=MammalsClustalw.fasta
  +
muscle -in Mammals.fasta -out MuscleMSAMammals.fasta

Revision as of 11:38, 17 June 2012

Casting Spell: Create Protocol Page

PSSM-Matrix and homolog search

The PSSM-Matrix and homolog search was done by using PSI-Blast with the command:

blastpgp -i Q30201.fasta -j5 -Q outMatrix.txt -o output.txt -d /mnt/project/pracstrucfunc12/data/big/big

To retrieve all homologous proteins, we retained only the results of the fifth iteration and used the following script to extract them:


<source lang="perl">

  1. !usr/bin/perl

use strict;

  1. use warnings;
  2. use autodie;

my $file= shift;

open (FILE, "<$file"); while (<FILE>){ my $currentLine=$_; chomp $currentLine;

if ($currentLine=~m/^..\|(.+)\|.*\d\d\d.*/gi){ print "$1\n"; } } </source>

Because we only wanted mammalian protein sequences, we used Unipot to retrieve the corresponding information to these protein by getting the Flat-formatted file.

To get only mammalian proteins we used the following script on the flat file:


<source lang="perl">

  1. !usr/bin/perl

use strict;

  1. use warnings;
  2. use autodie;

my $file= shift;

my @currentIDs=""; open (FILE, "<$file"); while (<FILE>){ my $currentLine=$_; chomp $currentLine;

if ($currentLine=~m/^AC\s(.*)/gi){ $currentLine=$1;

  1. print "$currentLine 1\n";

$currentLine=~s/\s//gi;


push(@currentIDs, split(/;/,$currentLine)); #print "$1\n"; } elsif ($currentLine=~m/^SQ\s(.*)/gi){ @currentIDs=""; } elsif ($currentLine=~m/^OC.*Mammalia/gi){ # print "Mammalian found, bitches!\n"; foreach my $entry (@currentIDs){ if ($entry ne ""){ print "$entry\n"; } } } } </source>

This way we got all mammalian protein IDs that are homologs to our protein and retrieved their sequences through Uniprot.

As some proteins have multiple IDs noted in the flat file we retained our PSIBlast found IDs by using the

uniq -d

command.

After adding our own sequence, we used these sequences in ClustalW and Muscle to retrieve the MSAs.

clustalw -align -infile=Mammals.fasta -outfile=MammalsClustalw.fasta
muscle -in Mammals.fasta -out MuscleMSAMammals.fasta