Difference between revisions of "ARS A Sequence alignments"

From Bioinformatikpedia
m (PSI-Blast)
m
 
(25 intermediate revisions by the same user not shown)
Line 1: Line 1:
In this [[task_alignments|task]], we explore the sequence space around the Human lysosomal arylsulfatase A ([[ARS A]]).
+
In this [[task_alignments_2012|task]], we explore the sequence space around the Human lysosomal arylsulfatase A ([[ARS A]]).
   
 
== Sequence searches ==
 
== Sequence searches ==
Line 20: Line 20:
 
* The distributions of percent sequence identity (see <xr id="fig:ARSA_blastHistogram_id"/>) and e-Values (see <xr id="fig:ARSA_blastHistogram_eVal"/>) shows that there are many sequence matches between 20 and 40 percent sequence identity and that the majority of e-Values is around 10^-6.
 
* The distributions of percent sequence identity (see <xr id="fig:ARSA_blastHistogram_id"/>) and e-Values (see <xr id="fig:ARSA_blastHistogram_eVal"/>) shows that there are many sequence matches between 20 and 40 percent sequence identity and that the majority of e-Values is around 10^-6.
 
* The blast search in big_80 finds only one matching pdb entry. However, this is partly due to the way of clustering used for big_80 (based on CD_hit), where long sequences are preferred over shorter ones.
 
* The blast search in big_80 finds only one matching pdb entry. However, this is partly due to the way of clustering used for big_80 (based on CD_hit), where long sequences are preferred over shorter ones.
  +
  +
<br style="clear:both;">
   
 
=== PSI-Blast ===
 
=== PSI-Blast ===
  +
<table id="tab:ARSA_psiBlast_runtimes">
 
  +
<figtable id="tab:ARSA_psiBlast_runtimes">
{| class="wikitable" style="float: right; border: 2px solid darkgray;" border="1"
 
  +
{| class="wikitable" style="float: right; border: 2px solid darkgray;" cellpadding="2"
|+ '''Runtimes depending on parameters'''
 
  +
|+ <caption>''Runtimes depending on parameters''</caption>
 
! scope="col" align="left"| Runtime [s]
 
! scope="col" align="left"| Runtime [s]
 
! scope="col"| j = 2
 
! scope="col"| j = 2
Line 38: Line 41:
 
|-
 
|-
 
|}
 
|}
</table>
+
</figtable>
   
  +
{| class="wikitable" style="float: right; border: 2px solid darkgray;" border="1"
 
  +
* Running iterative blasts takes a while (see <xr id="tab:ARSA_psiBlast_runtimes"/>). The more iterations, the longer the run-time. However, decreasing the inclusion threshold speeds up the process since fewer sequences are added to the profile.
|+ '''Number of unique matches depending on parameters'''
 
  +
! scope="col" align="left"| # unique matches
 
  +
<br style="clear:both;">
  +
  +
<figtable id="tab:ARSA_psiBlast_hits">
  +
{| class="wikitable" style="float: right; border: 2px solid darkgray;" cellpadding="2"
  +
|+ <caption>''Number of unique matches depending on parameters''</caption>
  +
! scope="col" align="left"| # matches
 
! scope="col"| j = 2
 
! scope="col"| j = 2
 
! scope="col"| j = 10
 
! scope="col"| j = 10
Line 55: Line 64:
 
|-
 
|-
 
|}
 
|}
  +
</figtable>
  +
  +
* The number of unique matches increases with more iterations (see <xr id="tab:ARSA_psiBlast_hits"/>). Notably, the number of unique matches increases with decreased inclusion threshold. Probably the PSSM built with stricter inclusion is more specific and therefore produces more significant hits.
  +
  +
<br style="clear:both;">
   
 
<figure id="fig:ARSA_psiBlastHistogram_eVal">
 
<figure id="fig:ARSA_psiBlastHistogram_eVal">
Line 64: Line 78:
 
</figure>
 
</figure>
   
* Running iterative blasts takes a while (see <xr id="tab:ARSA_psiBlast_runtimes"/> table "Runtimes depending on parameters"). The more iterations, the longer the run-time. However, decreasing the inclusion threshold speeds up the process since fewer sequences are added to the profile.
 
* The number of unique matches increases with more iterations (see table "Number of unique matches depending on parameters"). Notably, the number of unique matches increases with decreased inclusion threshold. Probably the PSSM built with stricter inclusion is more specific and therefore produces more significant hits.
 
 
* <xr id="fig:ARSA_psiBlastHistogram_eVal"/> shows the distribution of e-values for the different PSI-BLAST runs.
 
* <xr id="fig:ARSA_psiBlastHistogram_eVal"/> shows the distribution of e-values for the different PSI-BLAST runs.
 
** The distribution shifts to higher (less meaningful) e-values with more iterations (compare orange vs. green and blue vs. violet distributions).
 
** The distribution shifts to higher (less meaningful) e-values with more iterations (compare orange vs. green and blue vs. violet distributions).
Line 73: Line 85:
 
** Increasing the number of iterations moves the distribution to less sequence identity.
 
** Increasing the number of iterations moves the distribution to less sequence identity.
 
* These observations are as expected: More iterations can move the profile to become less significant for the protein family. A more stringent inclusion cutoff makes the profile more sensitive for the family.
 
* These observations are as expected: More iterations can move the profile to become less significant for the protein family. A more stringent inclusion cutoff makes the profile more sensitive for the family.
  +
<br style="clear:both;">
* The sensitivity (ratio of the found true positives to all possible hits) and accuracy (ratio of true positive among all found hits) of the search is defined here by the hits among pdb sequences.
 
  +
  +
  +
  +
* The sensitivity (ratio of the found true positives to all possible hits) and accuracy (ratio of true positive among all found hits) of the search is defined here by the hits among pdb sequences.
  +
** All pdb codes in the same COPS (for more info see http://www.came.sbg.ac.at/resources.php) L30 or L40 group as ARS A (structural representative [http://www.rcsb.org/pdb/explore.do?structureId=1n2l 1nl2]) count as "true positives".
  +
** For "sensitivity" all pdb codes in the same COPS L40 group as ARS A (29 structures) count as possible hits.
  +
** All pdb hits together define the numerator for "accuracy".
  +
  +
<figtable id="tab:ARSA_psiBlast_sensitivity">
  +
{| class="wikitable" style="float: right; border: 2px solid darkgray;" cellpadding="2"
  +
|+ <caption>''Sensitivity depending on parameters''</caption>
  +
! scope="col" align="left" width="100pt"| Sensitivity [%] in L40 / L30
  +
! scope="col"| j = 2
  +
! scope="col"| j = 10
  +
|-
  +
! scope="row" align="left" | h = 0.002
  +
| align="right" | 86 / 64
  +
| align="right" | 72 / 57
  +
|-
  +
! scope="row" align="left" | h = 10E-10
  +
| align="right" | 72 / 57
  +
| align="right" | 72 / 57
  +
|-
  +
|}
  +
</figtable>
  +
  +
* <xr id="tab:ARSA_psiBlast_sensitivity"/> summarises how the sensitivity changes with different parameter settings.
  +
* The sensitivity is identical for almost all settings. Only the profile produced with a low number of iterations and default inclusion parameters retrieves more true positives.
  +
  +
<br style="clear:both;">
  +
  +
<figtable id="tab:ARSA_psiBlast_accuracy">
  +
{| class="wikitable" style="float: right; border: 2px solid darkgray;" cellpadding="2"
  +
|+ <caption>''Accuracy depending on parameters''</caption>
  +
! scope="col" align="left" width="100pt"| Accuracy [%] in L40 / L30
  +
! scope="col"| j = 2
  +
! scope="col"| j = 10
  +
|-
  +
! scope="row" align="left" | h = 0.002
  +
| align="right" | 35 / 61
  +
| align="right" | 47 / 82
  +
|-
  +
! scope="row" align="left" | h = 10E-10
  +
| align="right" | 40 / 62
  +
| align="right" | 24 / 38
  +
|-
  +
|}
  +
</figtable>
  +
  +
* <xr id="tab:ARSA_psiBlast_accuracy"/> summarises how the accuracy changes with different parameter settings.
  +
* The accuracy is highest using the profile with default inclusion parameters and 10 iterations. This is because the fewest total number of pdb hits was found with this search, while still finding the true positives found by the other searches (no significant drop in sensitivity).
  +
* This is not what I would have expected. I thought the accuracy would be highest in the searches with lower inclusion threshold. I also expected the accuracy to go down with increasing the number of iterations and to decrease less with a tighter inclusion threshold. -- This might be a result of the bad statistics.
  +
  +
<figtable id="tab:ARSA_psiBlast_pdb">
  +
{| class="wikitable" style="float: right; border: 2px solid darkgray;" cellpadding="2"
  +
|+ <caption>''Distributions of e-values for different structural similarity groups, depending on parameters''</caption>
  +
! scope="col" align="left"|
  +
! scope="col"| j = 2
  +
! scope="col"| j = 10
  +
|-
  +
! scope="row" align="left" | h = 0.002
  +
| align="right" | [[File:ARSA_psiBlastHistogram_pdb_j2.hDefault_eVal.png|thumb|200px]]
  +
| align="right" | [[File:ARSA_psiBlastHistogram_pdb_j10.hDefault_eVal.png|thumb|200px]]
  +
|-
  +
! scope="row" align="left" | h = 10E-10
  +
| align="right" | [[File:ARSA_psiBlastHistogram_pdb_j2.h1e-10_eVal.png|thumb|200px]]
  +
| align="right" | [[File:ARSA_psiBlastHistogram_pdb_j10.h1e-10_eVal.png|thumb|200px]]
  +
|-
  +
|}
  +
</figtable>
  +
  +
  +
* ''However'', looking at the results in detail (<xr id="tab:ARSA_psiBlast_pdb"/>) reveals that more stringent profiles (lower e-value inclusion cutoff) lead to more separation between the structural similarity groups. This also separates the true and false positives better. Increasing the number of iterations has a similar effect.
  +
  +
<br style="clear:both;">
   
 
=== HHblits ===
 
=== HHblits ===

Latest revision as of 10:52, 25 April 2013

In this task, we explore the sequence space around the Human lysosomal arylsulfatase A (ARS A).

Sequence searches

We compare different methods to search a database of non-redundant proteins (for details see protocol).

Blast

<figure id="fig:ARSA_blastHistogram_id">

BLAST result: distribution of percent sequence identity of unique matches

</figure>

<figure id="fig:ARSA_blastHistogram_eVal">

BLAST result: distribution of e-values of unique matches

</figure>

  • This simple search finds 3763 sequence matches with an e-value better (smaller) than the default 10.
  • Of these, 3120 have an e-value matching the default criterion for inclusion in the iterative BLAST.
  • Some of the sequence matches occur twice with different alignments. The number of unique sequence matches is: 3513.
  • The distributions of percent sequence identity (see <xr id="fig:ARSA_blastHistogram_id"/>) and e-Values (see <xr id="fig:ARSA_blastHistogram_eVal"/>) shows that there are many sequence matches between 20 and 40 percent sequence identity and that the majority of e-Values is around 10^-6.
  • The blast search in big_80 finds only one matching pdb entry. However, this is partly due to the way of clustering used for big_80 (based on CD_hit), where long sequences are preferred over shorter ones.


PSI-Blast

<figtable id="tab:ARSA_psiBlast_runtimes">

Runtimes depending on parameters
Runtime [s] j = 2 j = 10
h = 0.002 280 2111
h = 10E-10 280 2040

</figtable>


  • Running iterative blasts takes a while (see <xr id="tab:ARSA_psiBlast_runtimes"/>). The more iterations, the longer the run-time. However, decreasing the inclusion threshold speeds up the process since fewer sequences are added to the profile.


<figtable id="tab:ARSA_psiBlast_hits">

Number of unique matches depending on parameters
# matches j = 2 j = 10
h = 0.002 6421 7554
h = 10E-10 7115 8366

</figtable>

  • The number of unique matches increases with more iterations (see <xr id="tab:ARSA_psiBlast_hits"/>). Notably, the number of unique matches increases with decreased inclusion threshold. Probably the PSSM built with stricter inclusion is more specific and therefore produces more significant hits.


<figure id="fig:ARSA_psiBlastHistogram_eVal">

PSI-BLAST result: distribution of e-values of unique matches

</figure>

<figure id="fig:ARSA_psiBlastHistogram_id">

PSI-BLAST result: distribution of sequence identities of unique matches

</figure>

  • <xr id="fig:ARSA_psiBlastHistogram_eVal"/> shows the distribution of e-values for the different PSI-BLAST runs.
    • The distribution shifts to higher (less meaningful) e-values with more iterations (compare orange vs. green and blue vs. violet distributions).
    • It shifts to lower (more meaningful) e-values with more stringent inclusion cutoff (compare orange vs. blue and green vs. violet distributions).
  • Likewise, <xr id="fig:ARSA_psiBlastHistogram_id"/> shows the distribution of sequence identities for the different PSI-BLAST runs.
    • Changing the e-value cutoff (orange vs. blue, green vs. violet) hardly changes the distribution.
    • Increasing the number of iterations moves the distribution to less sequence identity.
  • These observations are as expected: More iterations can move the profile to become less significant for the protein family. A more stringent inclusion cutoff makes the profile more sensitive for the family.



  • The sensitivity (ratio of the found true positives to all possible hits) and accuracy (ratio of true positive among all found hits) of the search is defined here by the hits among pdb sequences.
    • All pdb codes in the same COPS (for more info see http://www.came.sbg.ac.at/resources.php) L30 or L40 group as ARS A (structural representative 1nl2) count as "true positives".
    • For "sensitivity" all pdb codes in the same COPS L40 group as ARS A (29 structures) count as possible hits.
    • All pdb hits together define the numerator for "accuracy".

<figtable id="tab:ARSA_psiBlast_sensitivity">

Sensitivity depending on parameters
Sensitivity [%] in L40 / L30 j = 2 j = 10
h = 0.002 86 / 64 72 / 57
h = 10E-10 72 / 57 72 / 57

</figtable>

  • <xr id="tab:ARSA_psiBlast_sensitivity"/> summarises how the sensitivity changes with different parameter settings.
  • The sensitivity is identical for almost all settings. Only the profile produced with a low number of iterations and default inclusion parameters retrieves more true positives.


<figtable id="tab:ARSA_psiBlast_accuracy">

Accuracy depending on parameters
Accuracy [%] in L40 / L30 j = 2 j = 10
h = 0.002 35 / 61 47 / 82
h = 10E-10 40 / 62 24 / 38

</figtable>

  • <xr id="tab:ARSA_psiBlast_accuracy"/> summarises how the accuracy changes with different parameter settings.
  • The accuracy is highest using the profile with default inclusion parameters and 10 iterations. This is because the fewest total number of pdb hits was found with this search, while still finding the true positives found by the other searches (no significant drop in sensitivity).
  • This is not what I would have expected. I thought the accuracy would be highest in the searches with lower inclusion threshold. I also expected the accuracy to go down with increasing the number of iterations and to decrease less with a tighter inclusion threshold. -- This might be a result of the bad statistics.

<figtable id="tab:ARSA_psiBlast_pdb">

Distributions of e-values for different structural similarity groups, depending on parameters
j = 2 j = 10
h = 0.002
ARSA psiBlastHistogram pdb j2.hDefault eVal.png
ARSA psiBlastHistogram pdb j10.hDefault eVal.png
h = 10E-10
ARSA psiBlastHistogram pdb j2.h1e-10 eVal.png
ARSA psiBlastHistogram pdb j10.h1e-10 eVal.png

</figtable>


  • However, looking at the results in detail (<xr id="tab:ARSA_psiBlast_pdb"/>) reveals that more stringent profiles (lower e-value inclusion cutoff) lead to more separation between the structural similarity groups. This also separates the true and false positives better. Increasing the number of iterations has a similar effect.


HHblits

Comparison

Multiple sequence alignments