Revision as of 17:04, 30 May 2011

Sequence Alignments

Database search

Identity distribution of blast, fasta and psiblast

score distribution of fasta

score distribution of blast and psiblast

We used different tools for the search in a non-reduntant(NR) database like BLAST, FASTA and PSI-BLAST. Because of the absence of the database in a hmm format, we decided not to integrate the hhpred web-results. These results are not compareable.

We called BLAST and FASTA with the standard parameters. blast -p blastp -i 1a6z.fasta -d /data/nr/nr -o blastp_1a6z_on_NR.txt runtime 5:34
../../Desktop/fasta-36.3.4/bin/fasta36 -q 1a6z.fasta /data/nr/nr >fasta_1a6z_on_nr.txt the runtime was 10:44.

PSI-BLAST was used with different parameter settings. blastpgp -i 1a6z.fasta -d /data/nr/nr runtime 21:40
blastpgp -i 1a6z.fasta -d /data/nr/nr -j 6 -e 10E-6 > psiblast_on_NR_itera3_e-val_10E-6.txt runtime 23:49
blastpgp -i 1a6z.fasta -d /data/nr/nr -j 3 -e 10E-6 > psiblast_on_NR_itera6_e-val_10E-6.txt runtime 15:08
blastpgp -i 1a6z.fasta -d /data/nr/nr -j 3 > psiblast_on_NR_itera3.txt runtime 9:41
blastpgp -i 1a6z.fasta -d /data/nr/nr -j 6 > psiblast_on_NR_itera6.txt runtime 23:30
We also compared the runtime of the different tools. Surprisingly for PSI-BLAST, in our case, the e-Value and the number of iterations had no impact on the results, so we used PSI-BLAST with standard parameters for the comparison.

As we looked for an overlap between the reference sequences given by HSSP, no overlap occured. The HSSP reference is given with UniProt AC, so we mapped the ID's on the GenBank ID's using PIR<ref>pir.georgetown.edu/pirwww/search/idmapping.shtml</ref>. Than we compared the ID's without any result. We also mapped the PDB ID's found by FASTA to UniProt. Here we found a small overlap of about 34 Proteins. But the issue is, that the highest identety given by a pdb entry is abput 40%.

overlap between all methods

overlap between fasta blast and psiblast

overlap with the reference id's

overlap with the mapped pdb id's

Multiple Alignments

Used Sequences

SeqIdentifier	Seq Identity	Protein description
99-90% Sequence Identity
AAG29574.1	94%	emochromatosis splice variant
89-60% Sequence Identity
AAO47091.1	87%	hemochromatosis
Q9GL42.1	68%	HFE_DICSU RecName
NP_620577.1	86%	hereditary hemochromatosis protein isoform 8 precursor
NP_620578.1	78%	hereditary hemochromatosis protein isoform 9 precursor
AAF01222.1	75%	hereditary haemochromatosis protein precursor
EDL86571.1	62%	hemochromatosis, isoform CRA_c
NP_620576.1	86%	hereditary hemochromatosis protein isoform 7 precursor
Q9GL43.1	68%	HFE_DICBI RecName
AAH74721.1	84%	HFE protein
NP_001166905.1	63%	hereditary hemochromatosis protein homolog isoform 2 precursor
P70387.1	61%	HFE_MOUSE
59-40% Sequence Identity
XP_001511789.1	41%	PREDICTED: similar to MHC class I heavy chain antigen
XP_002713938.1	40%	PREDICTED: histocompatibility 2, Q region locus 10-like
39-20% Sequence Identity
AAR25266.1	37%	MHC class I heavy chain
AAZ30022.1	34%	MHC class I antigen alpha chain
XP_585880.4	37%	PREDICTED: histocompatibility 2, Q region locus 10-like
AAX51393.1	36%	MHC class I antigen alpha chain precursor
BAD23967.1	35.7%	MHC class I antigen

Programs

All programs except cobalt were preinstalled at our VM. Cobalt was downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/cobalt/executables/2.0.1/ (file depending on system architecture) and extracted. Afterwards the current directory was changed to ~/usr/bin and a symbolic link was created (ln -s <extraced-cobalt-runnable> cobalt).

Cobalt

msa using cobalt

call:

cobalt -i hfe_fasta_multi.txt -norps T -outfmt clustalw > cobalt.txt

switches:

-norps T: use no downloaded database
-outfmt clustalw: use output formatting of clustalw

time:

0,55s user
0,02s system

ClustalW

msa using clustalW

call:

clustalw hfe_fasta_multi.txt

time:

1,32s user
0,04s system

Muscle

msa using muscle

call:

muscle -in hfe_fasta_multi.txt -clw -out muscle.txt

switches:

-clw: use output formatting of clustalw
-out muscle.txt: write output to muscle.txt

time:

1,25s user
0,00s system

T-Coffee

standard

msa using tcoffee (standard)

call:

t_coffee hfe_fasta_multi.txt

time:

11,00s user
0,18s system

3dcoffee/expresso

Unfortunately were we not able to run the 3d mode, because it crashed everytime. Therefore we used -mode expresso instead with the following results.

msa using expresso mode of t-coffee

call:

t_coffee hfe_fasta_multi.txt -mode expresso

switches:

-mode expresso: use special mode with 3d structure supported alignment

time:

4m22s user
0,11s system

Alignments

All programs worked fine and produced a good aligned part from the mid to the end (position ~170 to 390). The use of the structure supported alignment at t-coffee resulted in an extremely gapped first part of the alignment. This is probably caused by non similar structures of our input sequences and the need to incorporate the structure information in the alignment. That can only be done by inserting many gaps as placeholders for region of no or very less structural similarty.

All programs except the special mode of t-coffee have broad gap between ~160 and ~170. That gap is caused by insertions at the hfe protein in the rat and mouse.

Residues

Furthermore we have almost no conserved column (one column with the same amino acid at all sequences) which we think depends also on our input sequences.

But in order to present a visible result, we used the conserved region calculation of JalView. The values ranges from 0 (no conservation at all) to 11 (full conserved row).

name	0	>=5	>=8	10
cobalt	~100	~2	~0	~0
clustalW	~113	~12	~3	~1
muscle	~120	~7	~3	~1
t-coffee (std)	~127	~3	~0	~0
t-coffee (expresso)	~130	~3	~0	~0

Functional Important

We used the information available at UniProt for determining functional important residues and regions.

key	positions	length	description
Topological domain	23 – 306	284	Extracellular Potential
Transmembrane	307 – 330	24	Helical; Potential
Topological domain	331 – 348	18	Cytoplasmic Potential
Domain	207 – 298	92	Ig-like C1-type
Region	23 – 114	92	Alpha-1
Region	15 – 205	91	Alpha-2
Region	206 – 297	92	Alpha-3
Region	298 – 306	9	Connecting peptide
Glycosylation	110	1	N-linked (GlcNAc...) Potential
Glycosylation	130	1	N-linked (GlcNAc...) Potential
Glycosylation	234	1	N-linked (GlcNAc...) Potential
Disulfide bond	124 <-> 187
Disulfide bond	225 <-> 282

Gaps

Counting and analyzing the gaps is quite useless in this actual view because all results will be redone and corrected.

@@ Line 149: / Line 149: @@
 ==Alignments==
-All programs except the 3dmode of t-coffee worked fine and produced a good aligned part from the mid to the end (position ~170 to 390). The use of the structure supported alignment at t-coffee resulted in an extremely gapped alignment. This is probably caused by non similar structures of our input sequences and the need to align the structures best matching. That can only be done by inserting many gaps as placeholders for region of no or very less structural similarty.
+All programs worked fine and produced a good aligned part from the mid to the end (position ~170 to 390). The use of the structure supported alignment at t-coffee resulted in an extremely gapped first part of the alignment. This is probably caused by non similar structures of our input sequences and the need to incorporate the structure information in the alignment. That can only be done by inserting many gaps as placeholders for region of no or very less structural similarty.
+All programs except the special mode of t-coffee have broad gap between ~160 and ~170. That gap is caused by insertions at the hfe protein in the rat and mouse.
 ==Residues==

Difference between revisions of "Sequence and multiple alignments"

Revision as of 17:04, 30 May 2011

Contents

Sequence Alignments

Database search

Multiple Alignments

Programs

Cobalt

ClustalW

Muscle

T-Coffee

standard

3dcoffee/expresso

Alignments

Residues

Functional Important

Gaps

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools