Difference between revisions of "Sequence and multiple alignments"
(→Database search) |
m |
||
(78 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
+ | <sup>by [[User:Greil|Robert Greil]] and [[User:Landerer|Cedric Landerer]]</sup> |
||
=Sequence Alignments= |
=Sequence Alignments= |
||
== Database search== |
== Database search== |
||
− | [[File:Identity histo.png|thumb|Identity distribution of blast, fasta and psiblast]] |
+ | [[File:Identity histo.png|thumb|Figure 1: Identity distribution of blast, fasta and psiblast]] |
− | [[File:Fasta_score_histo.png|thumb|score distribution of fasta]] |
+ | [[File:Fasta_score_histo.png|thumb|Figure 2: score distribution of fasta]] |
− | [[File:Blast_psiblast_score_histo.png|thumb|score distribution of blast and psiblast]] |
+ | [[File:Blast_psiblast_score_histo.png|thumb|Figure 3: score distribution of blast and psiblast]] |
+ | [[File:Overlap.png|thumb|Figure 4: overlap between all methods]] |
||
− | + | To find homologues sequences which are adequate for a template based structure prediction, we used different alignment tools like BLAST, FASTA and PSI-BLAST, for the search in a non-reduntant (NR) database. Because of the absence of the database in hmm format, we decided not to integrate the hhpred web-results. These results are not comparable because of the different database structure. |
|
− | We called BLAST and FASTA with the standard |
+ | We called BLAST and FASTA with the standard parameters. |
− | ''blast -p blastp -i 1a6z.fasta -d /data/nr/nr -o blastp_1a6z_on_NR.txt'' runtime 5:34<br> |
+ | ''blast -p blastp -i 1a6z.fasta -d /data/nr/nr -o blastp_1a6z_on_NR.txt'' '''runtime 5:34'''<br> |
− | ''../../Desktop/fasta-36.3.4/bin/fasta36 -q 1a6z.fasta /data/nr/nr >fasta_1a6z_on_nr.txt'' |
+ | ''../../Desktop/fasta-36.3.4/bin/fasta36 -q 1a6z.fasta /data/nr/nr >fasta_1a6z_on_nr.txt'' '''runtime 10:44'''<br> |
<br> |
<br> |
||
PSI-BLAST was used with different parameter settings. |
PSI-BLAST was used with different parameter settings. |
||
− | ''blastpgp -i 1a6z.fasta -d /data/nr/nr'' runtime 21:40 <br> |
+ | ''blastpgp -i 1a6z.fasta -d /data/nr/nr'' runtime 21:40 <br> |
− | ''blastpgp -i 1a6z.fasta -d /data/nr/nr -j 6 -e 10E-6 > psiblast_on_NR_itera3_e-val_10E-6.txt'' runtime 23:49<br> |
+ | ''blastpgp -i 1a6z.fasta -d /data/nr/nr -j 6 -e 10E-6 > psiblast_on_NR_itera3_e-val_10E-6.txt'' '''runtime 23:49'''<br> |
− | ''blastpgp -i 1a6z.fasta -d /data/nr/nr -j 3 -e 10E-6 > psiblast_on_NR_itera6_e-val_10E-6.txt'' runtime 15:08<br> |
+ | ''blastpgp -i 1a6z.fasta -d /data/nr/nr -j 3 -e 10E-6 > psiblast_on_NR_itera6_e-val_10E-6.txt'' '''runtime 15:08'''<br> |
− | ''blastpgp -i 1a6z.fasta -d /data/nr/nr -j 3 > psiblast_on_NR_itera3.txt'' runtime 9:41<br> |
+ | ''blastpgp -i 1a6z.fasta -d /data/nr/nr -j 3 > psiblast_on_NR_itera3.txt'' '''runtime 9:41'''<br> |
− | ''blastpgp -i 1a6z.fasta -d /data/nr/nr -j 6 > psiblast_on_NR_itera6.txt'' runtime 23:30<br> |
+ | ''blastpgp -i 1a6z.fasta -d /data/nr/nr -j 6 > psiblast_on_NR_itera6.txt'' '''runtime 23:30'''<br> |
− | We also compared the runtime of the different tools. Suprisingly for PSI-BLAST, in our case, the e-Value and the number of iterations had no impact on the results, so we used PSI-BLAST with standart parameter for the comparison. |
||
+ | To compare the different alignment tools, we looked at the distributions of the identity and the scores of the results. The results are shown in Figure 1 to Figure 3. While PSI-BLAST found an equal amount of homologues in the range around 20% to 40%, FASTA and BLAST found more homologues with an identity around 35%. In total, we obtained more results with a lower sequence identity while using PSI-BLAST and more close homologues while using FASTA and BLAST. In the score distributions, we see, the most results found by FASTA are in a range from 20 to 200, while in the case of BLAST and PSI-BLAST the most results are in a range from 150 to 250. We can now propose that, in our case, PSI-BLAST has an advantage to find more distant related proteins, while BLAST and FASTA find more close related proteins. |
||
+ | We also compared the runtime of the different tools. Surprisingly for PSI-BLAST, in our case, the e-Value and the number of iterations had no impact on the results (see Figure 4), so we used PSI-BLAST with standard parameters for the comparison. |
||
− | As we looked for an overlab between the reference sequences given by HSSP, no overlap occured. |
||
+ | |||
− | The HSSP reference is given with UniProt AC, so we mapped the ID's on the GenBank ID's using PIR<ref>pir.georgetown.edu/pirwww/search/idmapping.shtml</ref>. Than we compared the ID's without any result. |
||
+ | As we looked for an overlap between the reference sequences given by HSSP, no overlap occurred. |
||
− | We also mapped the PDB ID's found by FASTA to UniProt. Here we found a small overlap of about 34 Proteins. |
||
+ | The HSSP reference is given with UniProt AC, so we mapped the ID's on the GenBank ID's using PIR<ref>pir.georgetown.edu/pirwww/search/idmapping.shtml</ref>. Than we compared the id's without any result. |
||
− | But the issue is, that the highest identety given by a pdb entry is abput 40%. |
||
+ | We also mapped the PDB id's found by FASTA to UniProt. Here we found a small overlap of about 34 Proteins. |
||
+ | But the main issue is, that the highest identity given by a pdb entry is about 40%. So we found just week homologues templates for a later structure prediction. |
||
{| class="centered" |
{| class="centered" |
||
− | |[[File: |
+ | |[[File:Venn_blast_fasta_psi.png|thumb|Figure 5: overlap between FASTA, BLAST and PSI-BLAST]] |
− | |[[File: |
+ | |[[File:Ven_ref_blast_fasta.png|thumb|Figure 6: overlap with the reference id's]] |
− | |[[File: |
+ | |[[File:Ven_ref_pdb.png|thumb|Figure 7: overlap with the mapped pdb id's]] |
− | |[[File:Ven_ref_pdb.png|thumb|] overlab with the mapped pdb id's] |
||
|} |
|} |
||
+ | |||
+ | Figures 5 to 7 show the overlap between the different result sets. Figure 5 shows the overlap between the results of FASTA, BLAST and PSI-BLAST. Here we can see a strong overlap between the different methods. In the case of FASTA, no limitation in the amount of results which are found was set. In the case of BLAST and PSI-BLAST, this restriction is set to 500 as a standard value. Figure 6 shows that we were not able to find any protein from our reference set by searching with an alignment tool. The reference set contains proteins which are from HSSP proposed to have a homologue secondary structure. In Figure 7, we show the overlap between the proteins found by the search with the global alignment tools for which were able to find a PDB id's also in the reference set. Interesting here is, that the same sequence can have different id's in the same database but is assigned to the same PDB id. So in total, we could find 34 sequences with a 3D-Structure. |
||
=Multiple Alignments= |
=Multiple Alignments= |
||
+ | For the multiple sequence alignment we used the following sequences. |
||
− | Used Sequences |
||
+ | |||
{| border="1" style="text-align:center; border-spacing:0;" |
{| border="1" style="text-align:center; border-spacing:0;" |
||
!SeqIdentifier |
!SeqIdentifier |
||
Line 43: | Line 49: | ||
!colspan="4"| 99-90% Sequence Identity |
!colspan="4"| 99-90% Sequence Identity |
||
|- |
|- |
||
− | |AAG29574.1||94%|| |
+ | |AAG29574.1||94%|| '''!EXCLUDED''' hemochromatosis splice variant |
|- |
|- |
||
!colspan="4"| 89-60% Sequence Identity |
!colspan="4"| 89-60% Sequence Identity |
||
Line 49: | Line 55: | ||
|AAO47091.1|| 87% || hemochromatosis |
|AAO47091.1|| 87% || hemochromatosis |
||
|- |
|- |
||
+ | |NP_620576.1||86%|| hereditary hemochromatosis protein isoform 7 precursor |
||
− | |Q9GL42.1|| 68% || HFE_DICSU RecName |
||
|- |
|- |
||
|NP_620577.1|| 86% ||hereditary hemochromatosis protein isoform 8 precursor |
|NP_620577.1|| 86% ||hereditary hemochromatosis protein isoform 8 precursor |
||
+ | |- |
||
+ | |AAH74721.1||84%|| HFE protein |
||
|- |
|- |
||
|NP_620578.1|| 78% || hereditary hemochromatosis protein isoform 9 precursor |
|NP_620578.1|| 78% || hereditary hemochromatosis protein isoform 9 precursor |
||
|- |
|- |
||
− | |AAF01222.1||75%|| hereditary haemochromatosis protein precursor |
+ | |AAF01222.1||75%|| '''!EXCLUDED''' hereditary haemochromatosis protein precursor |
|- |
|- |
||
+ | |Q9GL42.1|| 68% || HFE_DICSU RecName |
||
− | |EDL86571.1||62%|| hemochromatosis, isoform CRA_c |
||
− | |- |
||
− | |NP_620576.1||86%|| hereditary hemochromatosis protein isoform 7 precursor |
||
|- |
|- |
||
|Q9GL43.1||68%|| HFE_DICBI RecName |
|Q9GL43.1||68%|| HFE_DICBI RecName |
||
− | |- |
||
− | |AAH74721.1||84%|| HFE protein |
||
|- |
|- |
||
|NP_001166905.1||63%|| hereditary hemochromatosis protein homolog isoform 2 precursor |
|NP_001166905.1||63%|| hereditary hemochromatosis protein homolog isoform 2 precursor |
||
+ | |- |
||
+ | |EDL86571.1||62%|| hemochromatosis, isoform CRA_c |
||
|- |
|- |
||
|P70387.1||61%||HFE_MOUSE |
|P70387.1||61%||HFE_MOUSE |
||
Line 78: | Line 84: | ||
|- |
|- |
||
|AAR25266.1||37%|| MHC class I heavy chain |
|AAR25266.1||37%|| MHC class I heavy chain |
||
− | |- |
||
− | |AAZ30022.1||34%|| MHC class I antigen alpha chain |
||
|- |
|- |
||
|XP_585880.4||37%|| PREDICTED: histocompatibility 2, Q region locus 10-like |
|XP_585880.4||37%|| PREDICTED: histocompatibility 2, Q region locus 10-like |
||
Line 86: | Line 90: | ||
|- |
|- |
||
|BAD23967.1||35.7%|| MHC class I antigen |
|BAD23967.1||35.7%|| MHC class I antigen |
||
+ | |- |
||
− | |||
+ | |AAZ30022.1||34%|| MHC class I antigen alpha chain |
||
+ | |- |
||
|} |
|} |
||
+ | |||
+ | The first results were not suitable, so we excluded the two sequences which causes the most gaps. These gaps were due to the different length of the sequences. Both sequence had just a length of around 120 amino acids. Therefore, they caused the insertion of long strands of gaps that lead to badly conservated regions. |
||
+ | |||
+ | Excluded sequences from all MSAs: |
||
+ | * tr|Q9HC71|Q9HC71_HUMAN (AAG29574.1 / 105 aa) |
||
+ | * gi|6010711|gb|AAF01222.1|AF184234_1 (129 aa) |
||
+ | |||
+ | Thus we have no more sequences in the region of 99 to 90% sequenced identity, but the one we had was only a fragment and therefore only partial useful for an alignment. Without these sequences our alignments and other analysis should get quite better. |
||
+ | |||
+ | |||
==Programs== |
==Programs== |
||
− | All programs except cobalt were preinstalled at our VM. Cobalt was downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt/executables/2.0.1/ (file depending on system architecture) and extracted. Afterwards the current directory was changed to ~/usr/bin and a symbolic link was created (ln -s < |
+ | All programs except cobalt were preinstalled at our VM. Cobalt was downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt/executables/2.0.1/ (file depending on system architecture) and extracted. Afterwards the current directory was changed to ~/usr/bin and a symbolic link was created (ln -s <extracted-cobalt-runnable> cobalt). |
+ | ===Cobalt<ref>http://bioinformatics.oxfordjournals.org/content/23/9/1073.long</ref>=== |
||
− | ===Cobalt=== |
||
− | [[File:Cobalt.png|600px|thumb|msa using cobalt]] |
+ | [[File:Cobalt.png|600px|thumb|Figure 8: msa using cobalt]] |
call: |
call: |
||
− | * '''cobalt -i |
+ | * '''cobalt -i input.txt -norps T -outfmt clustalw > cobalt.aln''' |
+ | switches: |
||
+ | * '''-norps T''': use no downloaded database |
||
+ | * '''-outfmt clustalw''': use output formatting of clustalw |
||
time: |
time: |
||
* 0,55s user |
* 0,55s user |
||
* 0,02s system |
* 0,02s system |
||
+ | The MSA is shown in Figure 8. |
||
− | ===ClustalW=== |
||
+ | |||
− | [[File:Clustalw.png|600px|thumb|msa using clustalW]] |
||
+ | ===ClustalW<ref>http://www.ncbi.nlm.nih.gov/pmc/articles/PMC308517/?tool=pubmed</ref>=== |
||
+ | [[File:Clustalw.png|600px|thumb|Figure 9: msa using clustalW]] |
||
call: |
call: |
||
− | * '''clustalw |
+ | * '''clustalw input.txt''' |
time: |
time: |
||
Line 109: | Line 130: | ||
* 0,04s system |
* 0,04s system |
||
+ | The MSA is shown in Figure 9. |
||
− | ===Muscle=== |
||
+ | |||
− | [[File:Muscle.png|600px|thumb|msa using muscle]] |
||
+ | ===Muscle<ref>http://nar.oxfordjournals.org/content/32/5/1792.long</ref>=== |
||
+ | [[File:Muscle.png|600px|thumb|Figure 10: msa using muscle]] |
||
call: |
call: |
||
− | * '''muscle -in |
+ | * '''muscle -in input.txt -clw -out muscle.aln''' |
+ | switches: |
||
+ | * '''-clw''': use output formatting of clustalw |
||
+ | * '''-out muscle.aln''': write output to muscle.aln |
||
time: |
time: |
||
* 1,25s user |
* 1,25s user |
||
* 0,00s system |
* 0,00s system |
||
+ | The MSA is shown in Figure 10. |
||
− | ===T-Coffee=== |
||
+ | |||
+ | ===T-Coffee<ref>http://www.sciencedirect.com/science/article/pii/S0022283600940427</ref>=== |
||
====standard==== |
====standard==== |
||
− | [[File:tcoffee_std.png|600px|thumb|msa using |
+ | [[File:tcoffee_std.png|600px|thumb|Figure 11: msa using t-coffee (standard)]] |
call: |
call: |
||
− | * '''t_coffee |
+ | * '''t_coffee input.txt''' |
time: |
time: |
||
Line 128: | Line 156: | ||
* 0,18s system |
* 0,18s system |
||
+ | The MSA is shown in Figure 8. |
||
− | ====3dcoffee==== |
||
+ | ====3dcoffee/expresso==== |
||
+ | Unfortunately were we not able to run the 3dcoffee, because it crashed everytime. Therefore we used '''-mode expresso'''. |
||
+ | [[File:expresso.png|600px|thumb|Figure 12: msa using expresso mode of t-coffee]] |
||
call: |
call: |
||
− | * '''t_coffee |
+ | * '''t_coffee input.txt -mode expresso''' |
+ | |||
+ | switches: |
||
+ | * '''-mode expresso''': use special mode with 3d structure supported alignment |
||
time: |
time: |
||
− | * |
+ | * 4m22s user |
− | * |
+ | * 0,11s system |
+ | The MSA is shown in Figure 12. |
||
− | Unfortunately were we not able to run the 3d mode successfully, because our 3dcoffee crashed everytime with an rather unusual and useless error report. |
||
==Alignments== |
==Alignments== |
||
+ | All programs worked fine and produced a tight aligned part from the mid to the end (position ~170 to 390) with high conservation and of good quality. Without the two short sequences, all extremely gapped parts in the alignments disappeared. Even the use of the structure supported alignment at t-coffee resulted in an nice alignment. But our best pdb entry was at ~40% sequence identity, so gaps had to be inserted in order to incorporate the structure. This issue affects also the following sections, like the amount and quality of conserved regions or the amount of introduced gaps. |
||
− | All of our alignments are very bad alignments strechted with many gaps, which led to the decision that our sequence input is not very useful. Because of this, we decided to try it again but this time with more direct homologs to the HFE protein. This step will be done in the next two to four days because it is too late now to recalculate everything. Maybe we can also fix the strange 3dcoffee bug this time. |
||
+ | |||
+ | All programs except the special mode of t-coffee have an clearly obvious gap between ~160 and ~170, by expresso it is between ~190 - ~200. That gap is caused by insertions at the hfe protein in the rat and mouse. |
||
==Residues== |
==Residues== |
||
+ | But in order to present a visible result, we used the conserved region calculation of JalView. |
||
− | Furthermore we have almost no conserved column (one column with the same amino acid at all sequences) which we think depends also on our input sequences. |
||
+ | |||
+ | The values ranges from 0 (no conservation at all) to 11 (full conserved row). |
||
+ | {| border="1" style="text-align:center; border-spacing:0;" |
||
+ | !name |
||
+ | !0 |
||
+ | !2 |
||
+ | !5 |
||
+ | !7 |
||
+ | !9 |
||
+ | !11 |
||
+ | |- |
||
+ | | quality || colspan="2" | very bad || colspan="2" | medium || colspan="2" | very good || length of msa |
||
+ | |- |
||
+ | | cobalt ||161||21||37||23||16||34 || rowspan="5" | 402 |
||
+ | |- |
||
+ | | clustalW ||156||26||34||22||0||33 |
||
+ | |- |
||
+ | | muscle ||157||22||33||23||11||34 |
||
+ | |- |
||
+ | | t-coffee (std) ||193||23||34||20||17||34 |
||
+ | |- |
||
+ | | t-coffee (expresso) ||280||16||29||16||11||26 |
||
+ | |- |
||
+ | |} |
||
+ | It is clearly visible that all our alignments have a large region of bad conservation in the beginning of the alignments, probably a region of no functional importance. After position ~160 to ~200, a large block of medium to good conservation is visible. Maybe that part contains the biological function or important domains. This block ends around position ~340 to ~400. |
||
+ | |||
+ | Around .4 to .7 percentage of the alignments are no conservation and therefore gaps or many mutation. It it logical, that t-coffee expresso produces the most non-conservations because of the additional constraint to use the structural information. |
||
===Functional Important=== |
===Functional Important=== |
||
− | + | We used the information available at [http://www.uniprot.org/uniprot/Q30201 UniProt] for determining functional important residues and regions. |
|
{| border="1" style="text-align:center; border-spacing:0;" |
{| border="1" style="text-align:center; border-spacing:0;" |
||
!key |
!key |
||
Line 151: | Line 214: | ||
!length |
!length |
||
!description |
!description |
||
+ | !cobalt |
||
− | |- |
||
+ | !clustalW |
||
− | | Topological domain || 23 – 306 || 284 || Extracellular Potential |
||
+ | !muscle |
||
+ | !t-coffee (std) |
||
+ | !t-coffee (ex) |
||
|- |
|- |
||
+ | | Glycosylation || 110 || 1 || N-linked (GlcNAc...) Potential || c2 || c2 || c1 || c2 || c2 |
||
− | | Transmembrane || 307 – 330 || 24 || Helical; Potential |
||
|- |
|- |
||
+ | | Glycosylation || 130 || 1 || N-linked (GlcNAc...) Potential || c4 || c4 || c0 || c4 || c4 |
||
− | | Topological domain || 331 – 348 || 18 || Cytoplasmic Potential |
||
|- |
|- |
||
+ | | Glycosylation || 234 || 1 || N-linked (GlcNAc...) Potential || c6 || c6 || c1 || c6 || c5 |
||
− | | Domain || 207 – 298 || 92 || Ig-like C1-type |
||
|- |
|- |
||
+ | | Disulfide bond || 124 ↔ 187 || 1 || || c11/c11 || c11/c11 || c11/c11 || c11/c11 || c11/c11 |
||
− | | Region || 23 – 114 || 92 || Alpha-1 |
||
|- |
|- |
||
+ | |Disulfide bond || 225 ↔ 282 || 1 || || c11/c11 || c11/c11 || c11/c11 || c11/c11 || c5/c11 |
||
− | | Region || 15 – 205 || 91 || Alpha-2 |
||
|- |
|- |
||
+ | |} |
||
− | | Region || 206 – 297 || 92 || Alpha-3 |
||
+ | |||
+ | We excluded all domain and other regions, because they are of no use this time. The only important functional positions are the position of the glycosylation and the both position of the disulfide bond. |
||
+ | |||
+ | We are using the conserved region index from JalView again, cX stands for the value of conservation at this position. One can see, that the conservation of the glycosylation is only low to medium, so not important for the function of the protein. All disulfide bonds, except in expresso are perfectly conserved, that could be a result of a slightly different folding with maintaining the linking between the both cysteine residues. |
||
+ | |||
+ | ==Gaps== |
||
+ | {| border="1" style="text-align:center; border-spacing:0;" |
||
+ | !name |
||
+ | !gaps# |
||
|- |
|- |
||
+ | |cobalt || 85 |
||
− | | Region || 298 – 306 || 9 || Connecting peptide |
||
+ | |- |
||
+ | |clustalW || 78 |
||
+ | |- |
||
+ | |muscle || 83 |
||
+ | |- |
||
+ | |t-coffee (std) || 114 |
||
+ | |- |
||
+ | |t-coffee (ex) || 166 |
||
|- |
|- |
||
|} |
|} |
||
+ | There are very less gaps in the reference sequence in fraction to its length of 348 aa. Only both of the t-coffee variants create more gaps. We do not know why the standard version produces also more gaps, but we are sure the expresso mode has to follow the constraint to align two position only if the structure matches and thus inserts more gaps. |
||
− | ==Gaps== |
||
+ | |||
− | Counting and analyzing the gaps is quite useless in this actual view because all results will be redone and corrected. |
||
+ | ===Gaps in Secondary Structure=== |
||
+ | {| border="1" style="text-align:center; border-spacing:0;" |
||
+ | !name |
||
+ | !helix |
||
+ | !beta strand |
||
+ | !turn |
||
+ | !no sse |
||
+ | |- |
||
+ | |cobalt || 0 || 8 || 1 ||70 |
||
+ | |- |
||
+ | |clustalW || 0 || 9 ||0 ||1 |
||
+ | |- |
||
+ | |muscle || 0 ||8 ||0 ||40 |
||
+ | |- |
||
+ | |t-coffee (std) || 0 ||8 ||0 ||116 |
||
+ | |- |
||
+ | |t-coffee (ex) || 5 || 8 ||0 || 64 |
||
+ | |- |
||
+ | |} |
||
+ | |||
+ | The vast majority of the gaps were in regions with no important secondary structure and therefore broke nothing. Every alignment broke also some of the beta strands but only t-coffee (expresso) broke an helix totally. Cobalt also broke the leading turn into the helix, but not the helix itself. We were surprised by expresso, because it incorporated the secondary structure in the alignment and thus was the method, that must absolutely not brake the helix. |
||
+ | |||
+ | == References == |
||
+ | <references /> |
||
+ | |||
+ | [[Category : Hemochromatosis]] |
Latest revision as of 12:47, 30 August 2011
by Robert Greil and Cedric Landerer
Contents
- 1 Sequence Alignments
- 2 Multiple Alignments
- 2.1 Programs
- 2.1.1 Cobalt<ref>http://bioinformatics.oxfordjournals.org/content/23/9/1073.long</ref>
- 2.1.2 ClustalW<ref>http://www.ncbi.nlm.nih.gov/pmc/articles/PMC308517/?tool=pubmed</ref>
- 2.1.3 Muscle<ref>http://nar.oxfordjournals.org/content/32/5/1792.long</ref>
- 2.1.4 T-Coffee<ref>http://www.sciencedirect.com/science/article/pii/S0022283600940427</ref>
- 2.2 Alignments
- 2.3 Residues
- 2.4 Gaps
- 2.5 References
- 2.1 Programs
Sequence Alignments
Database search
To find homologues sequences which are adequate for a template based structure prediction, we used different alignment tools like BLAST, FASTA and PSI-BLAST, for the search in a non-reduntant (NR) database. Because of the absence of the database in hmm format, we decided not to integrate the hhpred web-results. These results are not comparable because of the different database structure.
We called BLAST and FASTA with the standard parameters.
blast -p blastp -i 1a6z.fasta -d /data/nr/nr -o blastp_1a6z_on_NR.txt runtime 5:34
../../Desktop/fasta-36.3.4/bin/fasta36 -q 1a6z.fasta /data/nr/nr >fasta_1a6z_on_nr.txt runtime 10:44
PSI-BLAST was used with different parameter settings.
blastpgp -i 1a6z.fasta -d /data/nr/nr runtime 21:40
blastpgp -i 1a6z.fasta -d /data/nr/nr -j 6 -e 10E-6 > psiblast_on_NR_itera3_e-val_10E-6.txt runtime 23:49
blastpgp -i 1a6z.fasta -d /data/nr/nr -j 3 -e 10E-6 > psiblast_on_NR_itera6_e-val_10E-6.txt runtime 15:08
blastpgp -i 1a6z.fasta -d /data/nr/nr -j 3 > psiblast_on_NR_itera3.txt runtime 9:41
blastpgp -i 1a6z.fasta -d /data/nr/nr -j 6 > psiblast_on_NR_itera6.txt runtime 23:30
To compare the different alignment tools, we looked at the distributions of the identity and the scores of the results. The results are shown in Figure 1 to Figure 3. While PSI-BLAST found an equal amount of homologues in the range around 20% to 40%, FASTA and BLAST found more homologues with an identity around 35%. In total, we obtained more results with a lower sequence identity while using PSI-BLAST and more close homologues while using FASTA and BLAST. In the score distributions, we see, the most results found by FASTA are in a range from 20 to 200, while in the case of BLAST and PSI-BLAST the most results are in a range from 150 to 250. We can now propose that, in our case, PSI-BLAST has an advantage to find more distant related proteins, while BLAST and FASTA find more close related proteins.
We also compared the runtime of the different tools. Surprisingly for PSI-BLAST, in our case, the e-Value and the number of iterations had no impact on the results (see Figure 4), so we used PSI-BLAST with standard parameters for the comparison.
As we looked for an overlap between the reference sequences given by HSSP, no overlap occurred. The HSSP reference is given with UniProt AC, so we mapped the ID's on the GenBank ID's using PIR<ref>pir.georgetown.edu/pirwww/search/idmapping.shtml</ref>. Than we compared the id's without any result. We also mapped the PDB id's found by FASTA to UniProt. Here we found a small overlap of about 34 Proteins. But the main issue is, that the highest identity given by a pdb entry is about 40%. So we found just week homologues templates for a later structure prediction.
Figures 5 to 7 show the overlap between the different result sets. Figure 5 shows the overlap between the results of FASTA, BLAST and PSI-BLAST. Here we can see a strong overlap between the different methods. In the case of FASTA, no limitation in the amount of results which are found was set. In the case of BLAST and PSI-BLAST, this restriction is set to 500 as a standard value. Figure 6 shows that we were not able to find any protein from our reference set by searching with an alignment tool. The reference set contains proteins which are from HSSP proposed to have a homologue secondary structure. In Figure 7, we show the overlap between the proteins found by the search with the global alignment tools for which were able to find a PDB id's also in the reference set. Interesting here is, that the same sequence can have different id's in the same database but is assigned to the same PDB id. So in total, we could find 34 sequences with a 3D-Structure.
Multiple Alignments
For the multiple sequence alignment we used the following sequences.
SeqIdentifier | Seq Identity | Protein description | |
---|---|---|---|
99-90% Sequence Identity | |||
AAG29574.1 | 94% | !EXCLUDED hemochromatosis splice variant | |
89-60% Sequence Identity | |||
AAO47091.1 | 87% | hemochromatosis | |
NP_620576.1 | 86% | hereditary hemochromatosis protein isoform 7 precursor | |
NP_620577.1 | 86% | hereditary hemochromatosis protein isoform 8 precursor | |
AAH74721.1 | 84% | HFE protein | |
NP_620578.1 | 78% | hereditary hemochromatosis protein isoform 9 precursor | |
AAF01222.1 | 75% | !EXCLUDED hereditary haemochromatosis protein precursor | |
Q9GL42.1 | 68% | HFE_DICSU RecName | |
Q9GL43.1 | 68% | HFE_DICBI RecName | |
NP_001166905.1 | 63% | hereditary hemochromatosis protein homolog isoform 2 precursor | |
EDL86571.1 | 62% | hemochromatosis, isoform CRA_c | |
P70387.1 | 61% | HFE_MOUSE | |
59-40% Sequence Identity | |||
XP_001511789.1 | 41% | PREDICTED: similar to MHC class I heavy chain antigen | |
XP_002713938.1 | 40% | PREDICTED: histocompatibility 2, Q region locus 10-like | |
39-20% Sequence Identity | |||
AAR25266.1 | 37% | MHC class I heavy chain | |
XP_585880.4 | 37% | PREDICTED: histocompatibility 2, Q region locus 10-like | |
AAX51393.1 | 36% | MHC class I antigen alpha chain precursor | |
BAD23967.1 | 35.7% | MHC class I antigen | |
AAZ30022.1 | 34% | MHC class I antigen alpha chain |
The first results were not suitable, so we excluded the two sequences which causes the most gaps. These gaps were due to the different length of the sequences. Both sequence had just a length of around 120 amino acids. Therefore, they caused the insertion of long strands of gaps that lead to badly conservated regions.
Excluded sequences from all MSAs:
- tr|Q9HC71|Q9HC71_HUMAN (AAG29574.1 / 105 aa)
- gi|6010711|gb|AAF01222.1|AF184234_1 (129 aa)
Thus we have no more sequences in the region of 99 to 90% sequenced identity, but the one we had was only a fragment and therefore only partial useful for an alignment. Without these sequences our alignments and other analysis should get quite better.
Programs
All programs except cobalt were preinstalled at our VM. Cobalt was downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt/executables/2.0.1/ (file depending on system architecture) and extracted. Afterwards the current directory was changed to ~/usr/bin and a symbolic link was created (ln -s <extracted-cobalt-runnable> cobalt).
Cobalt<ref>http://bioinformatics.oxfordjournals.org/content/23/9/1073.long</ref>
call:
- cobalt -i input.txt -norps T -outfmt clustalw > cobalt.aln
switches:
- -norps T: use no downloaded database
- -outfmt clustalw: use output formatting of clustalw
time:
- 0,55s user
- 0,02s system
The MSA is shown in Figure 8.
ClustalW<ref>http://www.ncbi.nlm.nih.gov/pmc/articles/PMC308517/?tool=pubmed</ref>
call:
- clustalw input.txt
time:
- 1,32s user
- 0,04s system
The MSA is shown in Figure 9.
Muscle<ref>http://nar.oxfordjournals.org/content/32/5/1792.long</ref>
call:
- muscle -in input.txt -clw -out muscle.aln
switches:
- -clw: use output formatting of clustalw
- -out muscle.aln: write output to muscle.aln
time:
- 1,25s user
- 0,00s system
The MSA is shown in Figure 10.
T-Coffee<ref>http://www.sciencedirect.com/science/article/pii/S0022283600940427</ref>
standard
call:
- t_coffee input.txt
time:
- 11,00s user
- 0,18s system
The MSA is shown in Figure 8.
3dcoffee/expresso
Unfortunately were we not able to run the 3dcoffee, because it crashed everytime. Therefore we used -mode expresso.
call:
- t_coffee input.txt -mode expresso
switches:
- -mode expresso: use special mode with 3d structure supported alignment
time:
- 4m22s user
- 0,11s system
The MSA is shown in Figure 12.
Alignments
All programs worked fine and produced a tight aligned part from the mid to the end (position ~170 to 390) with high conservation and of good quality. Without the two short sequences, all extremely gapped parts in the alignments disappeared. Even the use of the structure supported alignment at t-coffee resulted in an nice alignment. But our best pdb entry was at ~40% sequence identity, so gaps had to be inserted in order to incorporate the structure. This issue affects also the following sections, like the amount and quality of conserved regions or the amount of introduced gaps.
All programs except the special mode of t-coffee have an clearly obvious gap between ~160 and ~170, by expresso it is between ~190 - ~200. That gap is caused by insertions at the hfe protein in the rat and mouse.
Residues
But in order to present a visible result, we used the conserved region calculation of JalView.
The values ranges from 0 (no conservation at all) to 11 (full conserved row).
name | 0 | 2 | 5 | 7 | 9 | 11 | |
---|---|---|---|---|---|---|---|
quality | very bad | medium | very good | length of msa | |||
cobalt | 161 | 21 | 37 | 23 | 16 | 34 | 402 |
clustalW | 156 | 26 | 34 | 22 | 0 | 33 | |
muscle | 157 | 22 | 33 | 23 | 11 | 34 | |
t-coffee (std) | 193 | 23 | 34 | 20 | 17 | 34 | |
t-coffee (expresso) | 280 | 16 | 29 | 16 | 11 | 26 |
It is clearly visible that all our alignments have a large region of bad conservation in the beginning of the alignments, probably a region of no functional importance. After position ~160 to ~200, a large block of medium to good conservation is visible. Maybe that part contains the biological function or important domains. This block ends around position ~340 to ~400.
Around .4 to .7 percentage of the alignments are no conservation and therefore gaps or many mutation. It it logical, that t-coffee expresso produces the most non-conservations because of the additional constraint to use the structural information.
Functional Important
We used the information available at UniProt for determining functional important residues and regions.
key | positions | length | description | cobalt | clustalW | muscle | t-coffee (std) | t-coffee (ex) |
---|---|---|---|---|---|---|---|---|
Glycosylation | 110 | 1 | N-linked (GlcNAc...) Potential | c2 | c2 | c1 | c2 | c2 |
Glycosylation | 130 | 1 | N-linked (GlcNAc...) Potential | c4 | c4 | c0 | c4 | c4 |
Glycosylation | 234 | 1 | N-linked (GlcNAc...) Potential | c6 | c6 | c1 | c6 | c5 |
Disulfide bond | 124 ↔ 187 | 1 | c11/c11 | c11/c11 | c11/c11 | c11/c11 | c11/c11 | |
Disulfide bond | 225 ↔ 282 | 1 | c11/c11 | c11/c11 | c11/c11 | c11/c11 | c5/c11 |
We excluded all domain and other regions, because they are of no use this time. The only important functional positions are the position of the glycosylation and the both position of the disulfide bond.
We are using the conserved region index from JalView again, cX stands for the value of conservation at this position. One can see, that the conservation of the glycosylation is only low to medium, so not important for the function of the protein. All disulfide bonds, except in expresso are perfectly conserved, that could be a result of a slightly different folding with maintaining the linking between the both cysteine residues.
Gaps
name | gaps# |
---|---|
cobalt | 85 |
clustalW | 78 |
muscle | 83 |
t-coffee (std) | 114 |
t-coffee (ex) | 166 |
There are very less gaps in the reference sequence in fraction to its length of 348 aa. Only both of the t-coffee variants create more gaps. We do not know why the standard version produces also more gaps, but we are sure the expresso mode has to follow the constraint to align two position only if the structure matches and thus inserts more gaps.
Gaps in Secondary Structure
name | helix | beta strand | turn | no sse |
---|---|---|---|---|
cobalt | 0 | 8 | 1 | 70 |
clustalW | 0 | 9 | 0 | 1 |
muscle | 0 | 8 | 0 | 40 |
t-coffee (std) | 0 | 8 | 0 | 116 |
t-coffee (ex) | 5 | 8 | 0 | 64 |
The vast majority of the gaps were in regions with no important secondary structure and therefore broke nothing. Every alignment broke also some of the beta strands but only t-coffee (expresso) broke an helix totally. Cobalt also broke the leading turn into the helix, but not the helix itself. We were surprised by expresso, because it incorporated the secondary structure in the alignment and thus was the method, that must absolutely not brake the helix.
References
<references />