Difference between revisions of "ARS A Sequence alignments"

From Bioinformatikpedia
m (Blast)
m (PSI-Blast)
Line 17: Line 17:
 
=== PSI-Blast ===
 
=== PSI-Blast ===
   
# run PSI-Blast with different combinations of parameters
 
> blastpgp -d /mnt/project/pracstrucfunc12/data/big/big_80 -m 8 -b 10000 -j 2 -i ARSA.fas -o ARSA.psiBlast.j2.hDefault
 
280.470u 30.420s 5:55.56 87.4% 0+0k 18005712+1880io 792pf+0w
 
> blastpgp -d /mnt/project/pracstrucfunc12/data/big/big_80 -m 8 -b 10000 -j 10 -i ARSA.fas -o ARSA.psiBlast.j10.hDefault
 
[blastpgp] ERROR: ncbiapi [000.000] ObjMgrNextAvailEntityID failed with idx 2048
 
2111.750u 44.740s 36:55.25 97.3% 0+0k 12951072+11856io 1271pf+0w
 
> blastpgp -d /mnt/project/pracstrucfunc12/data/big/big_80 -m 8 -b 10000 -j 2 -h 1e-10 -i ARSA.fas -o ARSA.psiBlast.j2.h1e-10
 
> blastpgp -d /mnt/project/pracstrucfunc12/data/big/big_80 -m 8 -b 10000 -j 10 -h 1e-10 -i ARSA.fas -o ARSA.psiBlast.j10.h1e-10
 
   
 
* Running iterative blasts takes a while (see table below). The more iterations the longer the run-time. However, decreasing the inclusion threshold speeds up the process.
 
* Running iterative blasts takes a while (see table below). The more iterations the longer the run-time. However, decreasing the inclusion threshold speeds up the process.
   
 
{| class="wikitable"
 
{| class="wikitable"
! scope="col" align="left"| Parameter
+
! scope="col" align="left"| Runtime [s]
 
! scope="col"| j = 2
 
! scope="col"| j = 2
 
! scope="col"| j = 10
 
! scope="col"| j = 10
 
|-
 
|-
 
! scope="row" align="left" | h = 0.002
 
! scope="row" align="left" | h = 0.002
| align="right" | 280 s
+
| align="right" | 280
| align="right" | 2111 s
+
| align="right" | 2111
 
|-
 
|-
 
! scope="row" align="left" | h = 10E-10
 
! scope="row" align="left" | h = 10E-10
| align="right" | s
+
| align="right" |
 
|
 
|
 
|-
 
|-
 
|}
 
|}
   
  +
* The number of unique matches increases with more iterations.
# To evaluate the final results, we have to dissect the output into separate files: Look for the first hit, find at what line numbers it occurs and then cut the files accordingly.
 
  +
{| class="wikitable"
grep -n G3IH84 ARSA.psiBlast.j2.hDefault
 
  +
! scope="col" align="left"| # unique matches
tail -n +4679 ARSA.psiBlast.j2.hDefault > ARSA.psiBlast.j2.hDefault.lastIter
 
  +
! scope="col"| j = 2
  +
! scope="col"| j = 10
  +
|-
  +
! scope="row" align="left" | h = 0.002
  +
| align="right" |
  +
| align="right" |
  +
|-
  +
! scope="row" align="left" | h = 10E-10
  +
| align="right" |
  +
|
  +
|-
  +
|}
   
 
=== Comparison ===
 
=== Comparison ===

Revision as of 13:43, 12 April 2012

In this task, we explore the sequence space around the Human lysosomal arylsulfatase A (ARS A).

Sequence searches

We compare different methods to search a database of non-redundant proteins (for details see protocol).

Blast

BLAST result: distribution of percent sequence identity of unique matches
BLAST result: distribution of eValues of unique matches
  • This simple search finds 3763 sequence matches with an e-value better (smaller) than the default 10.
  • Of these, 3120 have an e-value matching the default criterion for inclusion in the iterative BLAST.
  • Some of the sequence matches occur twice with different alignments. The number of unique sequence matches is: 3513.
  • The distributions of percent sequence identity and e-Values shows that there are many sequence matches between 20 and 40 percent sequence identity and that the majority of e-Values is around 10^-6.
  • The blast search in big_80 finds only one matching pdb entry. However, this is partly due to the way of clustering used for big_80 (based on CD_hit), where long sequences are preferred over shorter ones.

PSI-Blast

  • Running iterative blasts takes a while (see table below). The more iterations the longer the run-time. However, decreasing the inclusion threshold speeds up the process.
Runtime [s] j = 2 j = 10
h = 0.002 280 2111
h = 10E-10
  • The number of unique matches increases with more iterations.
# unique matches j = 2 j = 10
h = 0.002
h = 10E-10

Comparison

Multiple sequence alignments