Difference between revisions of "ARS A Sequence alignments"
From Bioinformatikpedia
m (→PSI-Blast) |
m (→PSI-Blast) |
||
Line 24: | Line 24: | ||
<figtable id="tab:ARSA_psiBlast_runtimes"> |
<figtable id="tab:ARSA_psiBlast_runtimes"> |
||
{| class="wikitable" style="float: right; border: 2px solid darkgray;" border="1" |
{| class="wikitable" style="float: right; border: 2px solid darkgray;" border="1" |
||
− | |+ '''Runtimes depending on parameters''' |
+ | |+ <caption>'''Runtimes depending on parameters'''</caption> |
! scope="col" align="left"| Runtime [s] |
! scope="col" align="left"| Runtime [s] |
||
! scope="col"| j = 2 |
! scope="col"| j = 2 |
Revision as of 09:39, 2 May 2012
In this task, we explore the sequence space around the Human lysosomal arylsulfatase A (ARS A).
Contents
Sequence searches
We compare different methods to search a database of non-redundant proteins (for details see protocol).
Blast
<figure id="fig:ARSA_blastHistogram_id">
</figure>
<figure id="fig:ARSA_blastHistogram_eVal">
</figure>
- This simple search finds 3763 sequence matches with an e-value better (smaller) than the default 10.
- Of these, 3120 have an e-value matching the default criterion for inclusion in the iterative BLAST.
- Some of the sequence matches occur twice with different alignments. The number of unique sequence matches is: 3513.
- The distributions of percent sequence identity (see <xr id="fig:ARSA_blastHistogram_id"/>) and e-Values (see <xr id="fig:ARSA_blastHistogram_eVal"/>) shows that there are many sequence matches between 20 and 40 percent sequence identity and that the majority of e-Values is around 10^-6.
- The blast search in big_80 finds only one matching pdb entry. However, this is partly due to the way of clustering used for big_80 (based on CD_hit), where long sequences are preferred over shorter ones.
PSI-Blast
<figtable id="tab:ARSA_psiBlast_runtimes">
Runtime [s] | j = 2 | j = 10 |
---|---|---|
h = 0.002 | 280 | 2111 |
h = 10E-10 | 280 | 2040 |
</figtable>
# unique matches | j = 2 | j = 10 |
---|---|---|
h = 0.002 | 6421 | 7554 |
h = 10E-10 | 7115 | 8366 |
<figure id="fig:ARSA_psiBlastHistogram_eVal">
</figure>
<figure id="fig:ARSA_psiBlastHistogram_id">
</figure>
- Running iterative blasts takes a while (see <xr id="tab:ARSA_psiBlast_runtimes"/> table "Runtimes depending on parameters"). The more iterations, the longer the run-time. However, decreasing the inclusion threshold speeds up the process since fewer sequences are added to the profile.
- The number of unique matches increases with more iterations (see table "Number of unique matches depending on parameters"). Notably, the number of unique matches increases with decreased inclusion threshold. Probably the PSSM built with stricter inclusion is more specific and therefore produces more significant hits.
- <xr id="fig:ARSA_psiBlastHistogram_eVal"/> shows the distribution of e-values for the different PSI-BLAST runs.
- The distribution shifts to higher (less meaningful) e-values with more iterations (compare orange vs. green and blue vs. violet distributions).
- It shifts to lower (more meaningful) e-values with more stringent inclusion cutoff (compare orange vs. blue and green vs. violet distributions).
- Likewise, <xr id="fig:ARSA_psiBlastHistogram_id"/> shows the distribution of sequence identities for the different PSI-BLAST runs.
- Changing the e-value cutoff (orange vs. blue, green vs. violet) hardly changes the distribution.
- Increasing the number of iterations moves the distribution to less sequence identity.
- These observations are as expected: More iterations can move the profile to become less significant for the protein family. A more stringent inclusion cutoff makes the profile more sensitive for the family.
- The sensitivity (ratio of the found true positives to all possible hits) and accuracy (ratio of true positive among all found hits) of the search is defined here by the hits among pdb sequences.