Sequence and multiple alignments
Contents
Sequence Alignments
Database search
We used different tools for the search in a non-reduntant(NR) database like BLAST, FASTA and PSI-BLAST. Because of the absence of the database in a hmm format, we decided not to integrate the hhpred web-results. These results are not compareable.
We called BLAST and FASTA with the standard parameters.
blast -p blastp -i 1a6z.fasta -d /data/nr/nr -o blastp_1a6z_on_NR.txt runtime 5:34
../../Desktop/fasta-36.3.4/bin/fasta36 -q 1a6z.fasta /data/nr/nr >fasta_1a6z_on_nr.txt the runtime was 10:44.
PSI-BLAST was used with different parameter settings.
blastpgp -i 1a6z.fasta -d /data/nr/nr runtime 21:40
blastpgp -i 1a6z.fasta -d /data/nr/nr -j 6 -e 10E-6 > psiblast_on_NR_itera3_e-val_10E-6.txt runtime 23:49
blastpgp -i 1a6z.fasta -d /data/nr/nr -j 3 -e 10E-6 > psiblast_on_NR_itera6_e-val_10E-6.txt runtime 15:08
blastpgp -i 1a6z.fasta -d /data/nr/nr -j 3 > psiblast_on_NR_itera3.txt runtime 9:41
blastpgp -i 1a6z.fasta -d /data/nr/nr -j 6 > psiblast_on_NR_itera6.txt runtime 23:30
We also compared the runtime of the different tools. Surprisingly for PSI-BLAST, in our case, the e-Value and the number of iterations had no impact on the results, so we used PSI-BLAST with standard parameters for the comparison.
As we looked for an overlap between the reference sequences given by HSSP, no overlap occured.
The HSSP reference is given with UniProt AC, so we mapped the ID's on the GenBank ID's using PIR<ref>pir.georgetown.edu/pirwww/search/idmapping.shtml</ref>. Than we compared the ID's without any result.
We also mapped the PDB ID's found by FASTA to UniProt. Here we found a small overlap of about 34 Proteins.
But the issue is, that the highest identety given by a pdb entry is abput 40%.
Multiple Alignments
Used Sequences
SeqIdentifier | Seq Identity | Protein description | |
---|---|---|---|
99-90% Sequence Identity | |||
AAG29574.1 | 94% | emochromatosis splice variant | |
89-60% Sequence Identity | |||
AAO47091.1 | 87% | hemochromatosis | |
NP_620576.1 | 86% | hereditary hemochromatosis protein isoform 7 precursor | |
NP_620577.1 | 86% | hereditary hemochromatosis protein isoform 8 precursor | |
AAH74721.1 | 84% | HFE protein | |
NP_620578.1 | 78% | hereditary hemochromatosis protein isoform 9 precursor | |
AAF01222.1 | 75% | hereditary haemochromatosis protein precursor | |
Q9GL42.1 | 68% | HFE_DICSU RecName | |
Q9GL43.1 | 68% | HFE_DICBI RecName | |
NP_001166905.1 | 63% | hereditary hemochromatosis protein homolog isoform 2 precursor | |
EDL86571.1 | 62% | hemochromatosis, isoform CRA_c | |
P70387.1 | 61% | HFE_MOUSE | |
59-40% Sequence Identity | |||
XP_001511789.1 | 41% | PREDICTED: similar to MHC class I heavy chain antigen | |
XP_002713938.1 | 40% | PREDICTED: histocompatibility 2, Q region locus 10-like | |
39-20% Sequence Identity | |||
AAR25266.1 | 37% | MHC class I heavy chain | |
XP_585880.4 | 37% | PREDICTED: histocompatibility 2, Q region locus 10-like | |
AAX51393.1 | 36% | MHC class I antigen alpha chain precursor | |
BAD23967.1 | 35.7% | MHC class I antigen | |
AAZ30022.1 | 34% | MHC class I antigen alpha chain |
Programs
All programs except cobalt were preinstalled at our VM. Cobalt was downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt/executables/2.0.1/ (file depending on system architecture) and extracted. Afterwards the current directory was changed to ~/usr/bin and a symbolic link was created (ln -s <extraced-cobalt-runnable> cobalt).
Cobalt
call:
- cobalt -i hfe_fasta_multi.txt -norps T -outfmt clustalw > cobalt.txt
switches:
- -norps T: use no downloaded database
- -outfmt clustalw: use output formatting of clustalw
time:
- 0,55s user
- 0,02s system
ClustalW
call:
- clustalw hfe_fasta_multi.txt
time:
- 1,32s user
- 0,04s system
Muscle
call:
- muscle -in hfe_fasta_multi.txt -clw -out muscle.txt
switches:
- -clw: use output formatting of clustalw
- -out muscle.txt: write output to muscle.txt
time:
- 1,25s user
- 0,00s system
T-Coffee
standard
call:
- t_coffee hfe_fasta_multi.txt
time:
- 11,00s user
- 0,18s system
3dcoffee/expresso
Unfortunately were we not able to run the 3d mode, because it crashed everytime. Therefore we used -mode expresso instead with the following results.
call:
- t_coffee hfe_fasta_multi.txt -mode expresso
switches:
- -mode expresso: use special mode with 3d structure supported alignment
time:
- 4m22s user
- 0,11s system
Alignments
All programs worked fine and produced a good aligned part from the mid to the end (position ~170 to 390). The use of the structure supported alignment at t-coffee resulted in an extremely gapped first part of the alignment. This is probably caused by non similar structures of our input sequences and the need to incorporate the structure information in the alignment. That can only be done by inserting many gaps as placeholders for region of no or very less structural similarty.
All programs except the special mode of t-coffee have an clearly obvious gap between ~160 and ~170. That gap is caused by insertions at the hfe protein in the rat and mouse.
Residues
But in order to present a visible result, we used the conserved region calculation of JalView. The values ranges from 0 (no conservation at all) to 11 (full conserved row).
name | 0 (bad) | >=5 | >=8 | 11 (very good) |
---|---|---|---|---|
cobalt | ~100 | ~2 | ~0 | ~0 |
clustalW | ~113 | ~12 | ~3 | ~1 |
muscle | ~120 | ~7 | ~3 | ~1 |
t-coffee (std) | ~127 | ~3 | ~0 | ~0 |
t-coffee (expresso) | ~130 | ~3 | ~0 | ~0 |
It is clearly visible that our alignments have almost no conserved rows, except two hits at clustalW and cobalt alignments. We still do not know the reason for that but we think, it derives from our input sequences and their length and sequence identity difference. Therefore the non-conserved rows are about .12% to .25% of our complete msas, which is a too high fraction.
Functional Important
We used the information available at UniProt for determining functional important residues and regions.
key | positions | length | description | cobalt | clustalW | muscle | t-coffee (std) | t-coffe (3d) |
---|---|---|---|---|---|---|---|---|
Glycosylation | 110 | 1 | N-linked (GlcNAc...) Potential | yes - c2 | yes - c1 | yes - c2 | yes - c2 | yes - c2 |
Glycosylation | 130 | 1 | N-linked (GlcNAc...) Potential | no - c0 | yes - c4 | yes - c4 | no - c0 | yes - c4 |
Glycosylation | 234 | 1 | N-linked (GlcNAc...) Potential | yes - c1 | yes - c1 | yes - c1 | yes - c1 | no - c0 |
We excluded all domain and other regions, because they are of no use this time. The only important functional positions are the position of the glycosylation and disulfide bond. Because the disulfide bond positions are only amino acid replacements, we exclude them also.
Gaps
Counting and analyzing the gaps is quite useless in this actual view because all results will be redone and corrected.