Difference between revisions of "Sequence and multiple alignments"

From Bioinformatikpedia
(Database search)
(Database search)
Line 28: Line 28:
 
We also compared the runtime of the different tools. Surprisingly for PSI-BLAST, in our case, the e-Value and the number of iterations had no impact on the results, so we used PSI-BLAST with standard parameters for the comparison.
 
We also compared the runtime of the different tools. Surprisingly for PSI-BLAST, in our case, the e-Value and the number of iterations had no impact on the results, so we used PSI-BLAST with standard parameters for the comparison.
   
  +
To compare the different alignment tools, we looked at the distributions of the identity and the scores of the results. The results are shown in Figure 1 to Figure 3.
   
 
As we looked for an overlap between the reference sequences given by HSSP, no overlap occured.
 
As we looked for an overlap between the reference sequences given by HSSP, no overlap occured.

Revision as of 12:04, 24 August 2011

TODO

re-check all pictures
re-reference all pictures
add all references/quotes


Sequence Alignments

Database search

Figure 1: Identity distribution of blast, fasta and psiblast
Figure 2: score distribution of fasta
Figure 3: score distribution of blast and psiblast

To find homologues sequences, we used different alignment tools like BLAST, FASTA and PSI-BLAST, for the search in a non-reduntant (NR) database. Because of the absence of the database in a hmm format, we decided not to integrate the hhpred web-results. These results are not comparable because of the different database structure. This is done to find sequences which are adequate for a template based structure prediction.

We called BLAST and FASTA with the standard parameters. blast -p blastp -i 1a6z.fasta -d /data/nr/nr -o blastp_1a6z_on_NR.txt runtime 5:34
../../Desktop/fasta-36.3.4/bin/fasta36 -q 1a6z.fasta /data/nr/nr >fasta_1a6z_on_nr.txt the runtime was 10:44.

PSI-BLAST was used with different parameter settings. blastpgp -i 1a6z.fasta -d /data/nr/nr runtime 21:40
blastpgp -i 1a6z.fasta -d /data/nr/nr -j 6 -e 10E-6 > psiblast_on_NR_itera3_e-val_10E-6.txt runtime 23:49
blastpgp -i 1a6z.fasta -d /data/nr/nr -j 3 -e 10E-6 > psiblast_on_NR_itera6_e-val_10E-6.txt runtime 15:08
blastpgp -i 1a6z.fasta -d /data/nr/nr -j 3 > psiblast_on_NR_itera3.txt runtime 9:41
blastpgp -i 1a6z.fasta -d /data/nr/nr -j 6 > psiblast_on_NR_itera6.txt runtime 23:30

We also compared the runtime of the different tools. Surprisingly for PSI-BLAST, in our case, the e-Value and the number of iterations had no impact on the results, so we used PSI-BLAST with standard parameters for the comparison.

To compare the different alignment tools, we looked at the distributions of the identity and the scores of the results. The results are shown in Figure 1 to Figure 3.

As we looked for an overlap between the reference sequences given by HSSP, no overlap occured. The HSSP reference is given with UniProt AC, so we mapped the ID's on the GenBank ID's using PIR<ref>pir.georgetown.edu/pirwww/search/idmapping.shtml</ref>. Than we compared the ID's without any result. We also mapped the PDB ID's found by FASTA to UniProt. Here we found a small overlap of about 34 Proteins. But the main issue is, that the highest identity given by a pdb entry is about 40%. So we found just week homologues templates for a later sturcture prediction.

Figure 4: overlap between all methods
Figure 5: overlap between fasta blast and psiblast
Figure 6: overlap with the reference id's
Figure 7: overlap with the mapped pdb id's

Multiple Alignments

Used Sequences

SeqIdentifier Seq Identity Protein description
99-90% Sequence Identity
AAG29574.1 94% !EXCLUDED hemochromatosis splice variant
89-60% Sequence Identity
AAO47091.1 87% hemochromatosis
NP_620576.1 86% hereditary hemochromatosis protein isoform 7 precursor
NP_620577.1 86% hereditary hemochromatosis protein isoform 8 precursor
AAH74721.1 84% HFE protein
NP_620578.1 78% hereditary hemochromatosis protein isoform 9 precursor
AAF01222.1 75% !EXCLUDED hereditary haemochromatosis protein precursor
Q9GL42.1 68% HFE_DICSU RecName
Q9GL43.1 68% HFE_DICBI RecName
NP_001166905.1 63% hereditary hemochromatosis protein homolog isoform 2 precursor
EDL86571.1 62% hemochromatosis, isoform CRA_c
P70387.1 61% HFE_MOUSE
59-40% Sequence Identity
XP_001511789.1 41% PREDICTED: similar to MHC class I heavy chain antigen
XP_002713938.1 40% PREDICTED: histocompatibility 2, Q region locus 10-like
39-20% Sequence Identity
AAR25266.1 37% MHC class I heavy chain
XP_585880.4 37% PREDICTED: histocompatibility 2, Q region locus 10-like
AAX51393.1 36% MHC class I antigen alpha chain precursor
BAD23967.1 35.7% MHC class I antigen
AAZ30022.1 34% MHC class I antigen alpha chain

Because our first results were really bad, we are given the advice to include/exclude some of our sequences. We decided to exclude the two sequences, that cause the most gaps. Both had around ~120 aminoacids, so they caused the insertion of long strands of gaps that lead to bad conservated regions.

Excluded sequences from all MSAs:

  • tr|Q9HC71|Q9HC71_HUMAN (AAG29574.1 / 105 aa)
  • gi|6010711|gb|AAF01222.1|AF184234_1 (129 aa)

Thus we have no more sequences in the region of 99 to 90% sequenced identity, but the one we had was only a fragment and therefore only partial alignmentable. Without these sequences our alignments and other analysis should get quite better.


Programs

All programs except cobalt were preinstalled at our VM. Cobalt was downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt/executables/2.0.1/ (file depending on system architecture) and extracted. Afterwards the current directory was changed to ~/usr/bin and a symbolic link was created (ln -s <extraced-cobalt-runnable> cobalt).

Cobalt

msa using cobalt

call:

  • cobalt -i input.txt -norps T -outfmt clustalw > cobalt.aln

switches:

  • -norps T: use no downloaded database
  • -outfmt clustalw: use output formatting of clustalw

time:

  • 0,55s user
  • 0,02s system

ClustalW

msa using clustalW

call:

  • clustalw input.txt

time:

  • 1,32s user
  • 0,04s system

Muscle

msa using muscle

call:

  • muscle -in input.txt -clw -out muscle.aln

switches:

  • -clw: use output formatting of clustalw
  • -out muscle.aln: write output to muscle.aln

time:

  • 1,25s user
  • 0,00s system

T-Coffee

standard

msa using tcoffee (standard)

call:

  • t_coffee input.txt

time:

  • 11,00s user
  • 0,18s system

3dcoffee/expresso

Unfortunately were we not able to run the 3dcoffee, because it crashed everytime. Therefore we used -mode expresso.

msa using expresso mode of t-coffee

call:

  • t_coffee input.txt -mode expresso

switches:

  • -mode expresso: use special mode with 3d structure supported alignment

time:

  • 4m22s user
  • 0,11s system

Alignments

All programs worked fine and produced a tight aligned part from the mid to the end (position ~170 to 390) with high conservation and of good quality. Without the two short sequences, all extremely gapped parts in the alignments disappeard. Even the use of the structure supported alignment at t-coffee resulted in an nice alignment. But our best pdb entry was at ~40% sequence identity, so gaps had to be inserted in order to incorporate the structure. This problem affects also the following sections, like the amount and quality of conserved regions or the amount of introduced gaps.

All programs except the special mode of t-coffee have an clearly obvious gap between ~160 and ~170, by expresso it is between ~190 - ~200. That gap is caused by insertions at the hfe protein in the rat and mouse.

Residues

But in order to present a visible result, we used the conserved region calculation of JalView.

The values ranges from 0 (no conservation at all) to 11 (full conserved row).

name 0 2 5 7 9 11
quality very bad medium very good length of msa
cobalt 161 21 37 23 16 34 402
clustalW 156 26 34 22 0 33
muscle 157 22 33 23 11 34
t-coffee (std) 193 23 34 20 17 34
t-coffee (expresso) 280 16 29 16 11 26

It is clearly visible that all our alignments have a large region of bad conservation in the beginning of the alignments, propably a region of no functional importance. After position ~160 to ~200, a large block of medium to good conservation is visible. Maybe that part contains the biological function or important domains. This block ends around position ~340 to ~400.

Around .4 to .7 percentage of the alignments are no conservation and therefore gaps or many mutation. It it logical, that t-coffee expresso produces the most non-conservations because of the additional constraint to use the structural information.

Functional Important

We used the information available at UniProt for determining functional important residues and regions.

key positions length description cobalt clustalW muscle t-coffee (std) t-coffee (ex)
Glycosylation 110 1 N-linked (GlcNAc...) Potential c2 c2 c1 c2 c2
Glycosylation 130 1 N-linked (GlcNAc...) Potential c4 c4 c0 c4 c4
Glycosylation 234 1 N-linked (GlcNAc...) Potential c6 c6 c1 c6 c5
Disulfide bond 124 ↔ 187 1 c11/c11 c11/c11 c11/c11 c11/c11 c11/c11
Disulfide bond 225 ↔ 282 1 c11/c11 c11/c11 c11/c11 c11/c11 c5/c11

We excluded all domain and other regions, because they are of no use this time. The only important functional positions are the position of the glycosylation and the both position of the disulfide bond.

We are using the conserved region index from JalView again, cX stands for the value of conservation at this position. One can see, that the conservation of the glycosylation is only low to medium, so not important for the function of the protein. All disulfide bonds, except in expresso are perfectly conserved, that could be a result of a slightly different folding with maintaining the linking between the both cysteine residues.

Gaps

name gaps#
cobalt 85
clustalW 78
muscle 83
t-coffee (std) 114
t-coffee (ex) 166

There are very less gaps in the reference sequence in fraction to its length of 348 aa. Only both of the t-coffee variants create more gaps. We dont not know why the standard version produces also more gaps, but we are sure the expresso mode has to follow the contraint to align two position only if the structure matches and thus inserts more gaps.

Gaps in Secondary Structure

name helix beta strand turn no sse
cobalt 0 8 1 70
clustalW 0 9 0 1
muscle 0 8 0 40
t-coffee (std) 0 8 0 116
t-coffee (ex) 5 8 0 64

The vast majority of the gaps were in regions with no important secondary structure and therefore broke nothing. Every alignment broke also some of the beta strands but only t-coffee (expresso) broke an helix totally. Cobalt also broke the leading turn into the helix, but not the helix itself. We were suprised by expresso, because it incorporated the secondary structure in the alignment and thus was the method, that must absolutely not brake the helix.