Difference between revisions of "Fabry:Sequence alignments (sequence searches and multiple alignments)/Journal"

From Bioinformatikpedia
 
(36 intermediate revisions by 2 users not shown)
Line 1: Line 1:
  +
[[Fabry Disease]] » [[Fabry:Sequence alignments (sequence searches and multiple alignments)|Sequence alignments]] » Journal
Please see [[Fabry:Sequence_alignments_(sequence_searches_and_multiple_alignments):Results | Task 2 Results]] for our results on this topic.
 
  +
<hr>
Please see also [[Fabry:Sequence_alignments_(sequence_searches_and_multiple_alignments):Scripts | Task 2 Scripts]] for the used scripts.
 
   
   
  +
Please see [[Fabry:Sequence_alignments_(sequence_searches_and_multiple_alignments):Scripts | Task 2 Scripts]] for the used scripts.
== Reference sequence ==
 
   
The reference sequence of [[Alpha-galactosidase|α-Galactosidase A]] that will be used in the following tasks was obtained from Swissprot [http://www.uniprot.org/uniprot/P06280 P06280].
 
 
>gi|4504009|ref|NP_000160.1| alpha-galactosidase A precursor [Homo sapiens]
 
MQLRNPELHLGCALALRFLALVSWDIPGARALDNGLARTPTMGWLHWERFMCNLDCQEEPDSCISEKLFM
 
EMAELMVSEGWKDAGYEYLCIDDCWMAPQRDSEGRLQADPQRFPHGIRQLANYVHSKGLKLGIYADVGNK
 
TCAGFPGSFGYYDIDAQTFADWGVDLLKFDGCYCDSLENLADGYKHMSLALNRTGRSIVYSCEWPLYMWP
 
FQKPNYTEIRQYCNHWRNFADIDDSWKSIKSILDWTSFNQERIVDVAGPGGWNDPDMLVIGNFGLSWNQQ
 
VTQMALWAIMAAPLFMSNDLRHISPQAKALLQDKDVIAINQDPLGKQGYQLRQGDNFEVWERPLSGLAWA
 
VAMINRQEIGGPRSYTIAVASLGKGVACNPACFITQLLPVKRKLGFYEWTSRLRSHINPTGTVLLQLENT
 
MQMSLKDLL
 
   
 
== Sequence searches ==
 
== Sequence searches ==
Line 21: Line 11:
   
 
blastall -p blastp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i P06280.fasta -m 0 -o blastsearch_default.out -v 700 -b 700
 
blastall -p blastp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i P06280.fasta -m 0 -o blastsearch_default.out -v 700 -b 700
./extract_ids_blast.sh blastsearch_default.out
+
perl <span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/extract_ids_blast.pl.html extract_ids_blast.pl]</span> blastsearch_default.out
perl ../download-annotation.pl blastsearch_default_ids.txt
+
perl ../<span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/download-annotation.pl.html download-annotation.pl]</span> blastsearch_default_ids.txt
perl ../compare_GO_terms.pl P06280 blastsearch_default_ids_GOterms.tsv
+
perl ../<span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/compare_GO_terms.pl.html compare_GO_terms.pl]</span> P06280 blastsearch_default_ids_GOterms.tsv
perl parse_blast.pl blastsearch_default.out
+
perl <span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/parse_blast.pl.html parse_blast.pl]</span> blastsearch_default.out
  +
  +
R CMD BATCH <span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/hist_blast.R.html hist_blast.R]</span>
   
  +
=== Psi-Blast ===
   
  +
The following command was used to run Psi-Blast with AGAL as query sequence against big80. It was run with two and ten iterations configured and an e-value cut-off of 2e-3 and 1e-9, respectively.
The run took about 2 minutes (see section [[Sequence_alignments_(sequence_searches_and_multiple_alignments)#Time | Time]])
 
   
  +
$ bash <span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/run_psi_blast.sh.html run_psi_blast.sh]</span> &> run_psi_blast.log
=== Psi-Blast ===
 
   
  +
The log file contains the runtimes of the different psi-blast runs:
Iterations: 2
 
  +
Evalue: 0.002
 
  +
{| style="border-spacing: 0em; text-align: center; margin: 2em 0px;"
 
  +
|-
real 3m30.256s
 
  +
! scope="col" style="border-bottom: 2px solid #000; padding: 0px 1em;" | Iterations
user 2m58.070s
 
  +
! scope="col" style="border-bottom: 2px solid #000; padding: 0px 1em; border-left: 2px solid #000;" | E-value cut-off
sys 0m13.360s
 
  +
! scope="col" style="border-bottom: 2px solid #000; padding: 0px 2em; border-left: 2px solid #000;" | Runtime
 
  +
|-
Iterations: 2
 
  +
| 2
Evalue: 0.000000001
 
  +
| style="border-left: 2px solid #000;" | 2e-3
 
  +
| style="border-left: 2px solid #000;" | 3m0.814s
real 3m8.507s
 
  +
|-
user 3m5.180s
 
  +
| 2
sys 0m2.400s
 
  +
| style="border-left: 2px solid #000;" | 1e-9
 
  +
| style="border-left: 2px solid #000;" | 3m9.422s
Iterations: 2
 
  +
|-
Evalue: 0.0000000001
 
  +
| 10
 
  +
| style="border-left: 2px solid #000;" | 2e-3
real 3m10.271s
 
  +
| style="border-left: 2px solid #000;" | 14m29.179s
user 3m7.620s
 
  +
|-
sys 0m2.190s
 
  +
| 10
 
  +
| style="border-left: 2px solid #000;" | 1e-9
Iterations: 10
 
  +
| style="border-left: 2px solid #000;" | 15m39.251s
Evalue: 0.002
 
  +
|-
 
  +
|}
real 15m29.218s
 
  +
user 15m8.910s
 
  +
Afterwards, the psi-blast output was parsed to collect the all the information about all the hits of the last iteration, which include the e-value, the sequence identity, the coverage in the longer sequence of the pairwise alignment and the length of the alignment. When there were more than one alignment per hit, we used the first one which was also listed in the short result output.
sys 0m12.730s
 
  +
 
  +
$ for i in psi_results_*.txt; do
Iterations: 10
 
  +
perl <span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/parse_psiblast.pl.html parse_psiblast.pl]</span> "$i" > "${i%.*}.stats"
Evalue: 0.000000001
 
  +
done
 
  +
real 16m33.748s
 
  +
The histograms were generated with the [https://dl.dropbox.com/u/13796643/fabry/msa_scripts/generate_histograms.sh.html generate_histograms.sh] script:
user 16m12.500s
 
  +
$ bash <span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/generate_histograms.sh.html generate_histograms.sh]</span> *.stats
sys 0m13.080s
 
  +
 
  +
GO term comparison
Iterations: 10
 
  +
Evalue: 0.0000000001
 
  +
$ perl ../<span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/download-annotation.pl.html download-annotation.pl]</span> ids_psiblast_10its_eVal_1e-9.txt
 
  +
$ perl ../<span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/compare_GO_terms.pl.html compare_GO_terms.pl]</span> P06280 ids_psiblast_10its_eVal_1e-9_GOterms.tsv
real 16m20.137s
 
  +
$ bash <span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/hist_psiblast.sh.html hist_psiblast.sh]</span> ids_psiblast_10its_eVal_1e-9_GOterms_comparison.txt
user 15m55.910s
 
sys 0m13.190s
 
   
 
=== HHblits / HHsearch ===
 
=== HHblits / HHsearch ===
Line 77: Line 69:
   
 
time hhblits -i ../P06280.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -e 0.003 -o hhblits_default.out -E 0.003 -z 700
 
time hhblits -i ../P06280.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -e 0.003 -o hhblits_default.out -E 0.003 -z 700
./extract_ids_hhblits.sh hhblits_default.out
+
./<span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/extract_ids_hhblits.sh.html extract_ids_hhblits.sh]</span> hhblits_default.out
  +
perl <span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/parse_hhblits.pl.html parse_hhblits.pl]</span> hhblits_default.out
perl ../download-annotation.pl hhblits_default_ids.txt
 
  +
perl ../<span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/download-annotation.pl.html download-annotation.pl]</span> hhblits_default.out_cluster_ids_only.tsv
perl ../compare_GO_terms.pl P06280 hhblits_default_ids_GOterms.tsv
 
  +
perl ../<span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/compare_GO_terms.pl.html compare_GO_terms.pl]</span> P06280 hhblits_default.out_cluster_ids_only_GOterms.tsv
perl parse_hhblits.pl hhblits_default.out
 
 
 
 
time hhblits -i ../P06280.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -e 0.003 -o hhblits_n8_neu.out -E 0.003 -n 8 -z 800 -b 800
 
time hhblits -i ../P06280.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -e 0.003 -o hhblits_n8_neu.out -E 0.003 -n 8 -z 800 -b 800
./extract_ids_hhblits.sh hhblits_n8_neu.out
+
./<span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/extract_ids_hhblits.sh.html extract_ids_hhblits.sh]</span> hhblits_n8_neu.out
  +
perl <span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/parse_hhblits.pl.html parse_hhblits.pl]</span> hhblits_n8_neu.out
perl ../download-annotation.pl hhblits_n8_neu_ids.txt
 
  +
perl ../<span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/download-annotation.pl.html download-annotation.pl]</span> hhblits_n8_neu.out_cluster_ids_only.tsv
perl ../compare_GO_terms.pl P06280 hhblits_n8_neu_ids_GOterms.tsv
 
  +
perl ../<span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/compare_GO_terms.pl.html compare_GO_terms.pl]</span> P06280 hhblits_n8_neu.out_cluster_ids_only_GOterms.tsv
perl parse_hhblits.pl hhblits_n8_neu.out
 
 
 
R CMD BATCH hist_hhblits.R
+
R CMD BATCH <span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/hist_hhblits.R.html hist_hhblits.R]</span>
 
The first HHblits run took about 2.5 minutes, the second one about 16 minutes (see section [[Sequence_alignments_(sequence_searches_and_multiple_alignments)#Time | Time]]).
 
 
   
 
== Comparison ==
 
== Comparison ==
Line 97: Line 86:
 
Venn diagrams created with [http://bioinfogp.cnb.csic.es/tools/venny/index.html Oliveros, J.C. (2007) VENNY. An interactive tool for comparing lists with Venn Diagrams.]
 
Venn diagrams created with [http://bioinfogp.cnb.csic.es/tools/venny/index.html Oliveros, J.C. (2007) VENNY. An interactive tool for comparing lists with Venn Diagrams.]
   
>R CMD BATCH all_Evalues.R
+
>R CMD BATCH <span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/all_Evalues.R.html all_Evalues.R]</span>
  +
  +
== Multiple sequence alignments ==
  +
  +
The following commands were used in our bash script [https://dl.dropbox.com/u/13796643/fabry/msa_scripts/calculate_msas.sh.html calculate_msas.sh] to generate the multiple sequence alignments. The pictures were obtained by using [http://www.jalview.org/ jalview].
  +
  +
$ clustalw -infile="<filename>.fasta" -outfile="msa/clustalw_<filename>.msa" &
  +
  +
$ muscle -in "<filename>.fasta" -out "msa/muscle_<filename>.msa" &
  +
  +
$ /mnt/opt/T-Coffee/bin/t_coffee -seq "<filename>.fasta" -outfile "msa/tcoffe_<filename>.msa" &
  +
  +
$ /mnt/opt/T-Coffee/bin/t_coffee -seq "<filename>.fasta" -method sap_pair -template_file "<filename>.pdb" \
  +
-outfile "msa/3Dcoffee_<filename>.msa" &
  +
  +
We counted the number of gaps and conserved columns with the perl script [https://dl.dropbox.com/u/13796643/fabry/msa_scripts/countGaps.pl.html countGaps.pl]. There is also a small wrapper script - [https://dl.dropbox.com/u/13796643/fabry/msa_scripts/countAllGaps.sh.html countAllGaps.sh] which runs countGaps.pl on all .msa files in a specific folder:
  +
  +
#!/bin/bash
  +
  +
for file in msa/*.msa; do
  +
perl <span class="plainlinks">[https://dl.dropbox.com/u/13796643/fabry/msa_scripts/countGaps.pl.html countGaps.pl]</span> "$file" > "${file%.*}.counts"
  +
done
  +
  +
[[Category: Fabry Disease 2012]]

Latest revision as of 17:06, 9 May 2012

Fabry Disease » Sequence alignments » Journal



Please see Task 2 Scripts for the used scripts.


Sequence searches

Blast

We searched the "big80" database with Blast with the following command:

blastall -p blastp -d /mnt/project/pracstrucfunc12/data/big/big_80 -i P06280.fasta -m 0 -o blastsearch_default.out -v 700 -b 700
perl extract_ids_blast.pl blastsearch_default.out
perl ../download-annotation.pl blastsearch_default_ids.txt
perl ../compare_GO_terms.pl P06280 blastsearch_default_ids_GOterms.tsv
perl parse_blast.pl blastsearch_default.out

R CMD BATCH hist_blast.R

Psi-Blast

The following command was used to run Psi-Blast with AGAL as query sequence against big80. It was run with two and ten iterations configured and an e-value cut-off of 2e-3 and 1e-9, respectively.

$ bash run_psi_blast.sh &> run_psi_blast.log

The log file contains the runtimes of the different psi-blast runs:

Iterations E-value cut-off Runtime
2 2e-3 3m0.814s
2 1e-9 3m9.422s
10 2e-3 14m29.179s
10 1e-9 15m39.251s

Afterwards, the psi-blast output was parsed to collect the all the information about all the hits of the last iteration, which include the e-value, the sequence identity, the coverage in the longer sequence of the pairwise alignment and the length of the alignment. When there were more than one alignment per hit, we used the first one which was also listed in the short result output.

$ for i in psi_results_*.txt; do
    perl parse_psiblast.pl "$i" > "${i%.*}.stats"
  done

The histograms were generated with the generate_histograms.sh script:

$ bash generate_histograms.sh *.stats

GO term comparison

$ perl ../download-annotation.pl ids_psiblast_10its_eVal_1e-9.txt
$ perl ../compare_GO_terms.pl P06280 ids_psiblast_10its_eVal_1e-9_GOterms.tsv
$ bash hist_psiblast.sh ids_psiblast_10its_eVal_1e-9_GOterms_comparison.txt

HHblits / HHsearch

We searched the "big80" database with HHblits using the default settings and also with the maximum number of possible iterations (8) with the following commands:

time hhblits -i ../P06280.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -e 0.003 -o hhblits_default.out -E 0.003  -z 700
./extract_ids_hhblits.sh hhblits_default.out
perl parse_hhblits.pl hhblits_default.out
perl ../download-annotation.pl hhblits_default.out_cluster_ids_only.tsv
perl ../compare_GO_terms.pl P06280 hhblits_default.out_cluster_ids_only_GOterms.tsv

time hhblits -i ../P06280.fasta -d /mnt/project/pracstrucfunc12/data/hhblits/uniprot20_current -e 0.003 -o hhblits_n8_neu.out -E 0.003 -n 8 -z 800 -b 800
./extract_ids_hhblits.sh hhblits_n8_neu.out
perl parse_hhblits.pl hhblits_n8_neu.out
perl ../download-annotation.pl hhblits_n8_neu.out_cluster_ids_only.tsv
perl ../compare_GO_terms.pl P06280 hhblits_n8_neu.out_cluster_ids_only_GOterms.tsv

R CMD BATCH hist_hhblits.R

Comparison

Venn diagrams created with Oliveros, J.C. (2007) VENNY. An interactive tool for comparing lists with Venn Diagrams.

  >R CMD BATCH all_Evalues.R

Multiple sequence alignments

The following commands were used in our bash script calculate_msas.sh to generate the multiple sequence alignments. The pictures were obtained by using jalview.

$ clustalw -infile="<filename>.fasta" -outfile="msa/clustalw_<filename>.msa" &

$ muscle -in "<filename>.fasta" -out "msa/muscle_<filename>.msa" &

$ /mnt/opt/T-Coffee/bin/t_coffee -seq "<filename>.fasta" -outfile "msa/tcoffe_<filename>.msa" &

$ /mnt/opt/T-Coffee/bin/t_coffee -seq "<filename>.fasta" -method sap_pair -template_file "<filename>.pdb" \
    -outfile "msa/3Dcoffee_<filename>.msa" &

We counted the number of gaps and conserved columns with the perl script countGaps.pl. There is also a small wrapper script - countAllGaps.sh which runs countGaps.pl on all .msa files in a specific folder:

#!/bin/bash

for file in msa/*.msa; do
	perl countGaps.pl "$file" > "${file%.*}.counts"
done