Difference between revisions of "Gaucher Disease: Task 02 - Alignments"

Revision as of 13:51, 6 May 2013

Work is still in progress here. Please check today evening ;)

Alignments allow a comparisons of Strings. In the field of bioinformatics, sequence alignments show the relation between two or more sequences.

Theoretical Background

Description

Pairwise or multiple alignments often contain an aditional line below the proper alignment (or between two aligned sequences for pairwise alignments). This line gives a more accurate description of the relation of the aligned residues above. The symbols show if there is match between identical amino acids or if they are only similar.

Symbols for describing sequence alignments
Program(s)	Symbol	Example	Meaning
MSAs	*	blub	identical residues
	:		similar residues
	.
(Psi-)BLAST	same letter	A A A	identical residues
(Psi-)BLAST	+	L + V	similar residues
HHblits	\|	AF \| \| AW	identical or very similar residues?
	+	T + S	similar residues?
	.	N . H	non-similar residues?

Sequence Searches

Sequence searches with our query protein sequence, P04062.fasta, were done with the following programs:

BLAST
- using standard parameters
- against big_80
- against big

Psi-BLAST
- with number of shown hits and alignments set to 10000 (-b, -v options), so that all the hits will be shown.
- with all combinations of:
  - 2 iterations: 1 iterations against big_80 followed by 1 iteration against big
  - 10 iterations: 9 iterations against big_80 followed by 1 iteration against big
  - default E-value cutoff (0.002)
  - E-value cutoff 10E-10
- other options leaved default

HHblits
- with number of shown hits and alignments set to 10000 (-Z, -B options), as in Psi-BLAST
- with all combinations of:
  - 2 iterations against uniprot_20
  - 10 iterations against uniprot_20
  - default E-value cutoff (0.002)
  - E-value cutoff 10E-10
- other options leaved default

The script run.pl was written and used for the runs. PSSM files - a3m and hhr for HHblits, chk ("checkpoint") and PSSM for Psi-BLAST were created in order to start the search against another database, from big_80 to big for Psi-BLAST and later against a PDB database for the evaluation.

For Psi-BLAST, first a search against big_80 was done in order to create a good profile, then a last iteration against big was done with this profile. The idea was to get as many hits as possible, so that the results will be comparable with HHblits, where the runs were made against the clustered HMM database. All uniprot_20 cluster members were count in the following comparison and evaluation.

Comparison

The results were parsed and analysed using the script parse_output.pl.

Number and overlap of hits Number of found hits with each program and parameter combination and overlap of hits between some are summerized in the following tables. <todo tables>

Furthermore, E-value, percentage identity and alignment length distributions were plottet.

E-value distribution

Percentage identity distribution

Alignment length distribution

Evaluation

Validation against COPS L30 - L60 groups was made.

Multiple Sequence Alignment

For the multiple sequence alignments three sets were created. Therefore the results of the previous task were according to their sequnece identiy to the native protein sequence of glucocerebrosidase. Set 1 and set 2 contains 10 sequences, including the native protein sequence to keep the alignments in relation to the Gaucher's disease causing protein. For the remainig 9 sequences have a sequence identity to the glucocerebrosidase of more than 60% in set 1 and less than 30% in set 2. In set 3, the sequences are over the whole range of sequence identity. The multiple sequence alignments were made with Clustalw, Muscle and T-Coffee.

Set 1: sequence identity >60%
ID	Identity in %
P04062	100
A9UD35	84.1
D1L2S0	83.0
3gxi_A	99.8
2nt1_A	99.8
F6WDY8	90.7
Q2KHZ8	89.2
F5CB27	81.8
F5H241	98.2
B7Z6S9	99.8

Set 2: sequence identity <30%
ID	Identity in %
P04062	100
H6CEV7	26.5
Q21GD0	24.3
I1WBF3	23.7
I9HH59	29.4
D0TN48	25.4
B1VPJ0	27.5
E2LY19	24.8
K9HBW2	25.9
B5QQZ8	27.1

Set 3: sequence identity over all
ID	Identity in %	ID	Identity in %
P04062	100	B5DYA3	32.0
B4JTN5	34.1	F5CB37	81.8
E1ZYU8	42.7	J2STU0	34.2
F4E6W5	26.4	G9BHQ3	80.7
F4X220	41.7	2nsx_B	99.8
E5UPZ0	21.6	G9BHQ5	82.5
C6A5Q0	25.0	D0TIX6	25.8
B1QKT0	27.8	1y7v_A	99.8
D4SV88	34.6	A9UD58	80.4
A9UD54	81.8	G9MUP6	20.7

ClustalW

Multiple alignments generated with ClustalW:

Results for Clustalw of : Set1
Results for Clustalw of : Set2
Results for Clustalw of : Set3

Set 1, which includes sequences with high similarity has only 11 conserved columns. But these columns lie densly in an area of 55 amio acids. Also columns that are not conserved by identical residues, but have similar amino acids can be found in this high conserved region (marked blue in the linked multiple alignment results above). The gaps of the alignment split the protein into 7, mostly longer parts of of residues. Especially the last 400 residues have a very high sequence identity between the clucosylcerebrosidase and 6 other proteins.

Set 2 has less conserved columns than set 1. They are spread over the alignment and build no conserved region. The contiguous gaps are more but shorter.

There exist no conserved columns for set 3. This could result from the greater number of sequences in the set. The more sequences are in an alignment the lesser the probability of having conserved columns. Because of the sequences having an identity over the whole range, the sequences with a low identity cause to this loss of a conserved region. This can be seen by looking only on the sequences with high similarity (marked green in the alignment of set 3).

Results of ClustalW in numbers
	Set 1	Set 2	Set 3
Conserved Columns	11	8	0
Number of contiguous gaps in P04062	7	18	18

Muscle

Multiple alignments generated with Muscle:

Results for Muscle of : Set1
Results for Muscle of : Set2
Results for Muscle of : Set3

In Set 1 the native protein sequence itself has no gap. There are only gaps at the beginning of the sequence, because the length of the alignment is longer than the length of clucocerebrosidase sequence. There also exist no conserved columns. This is caused because of the shift of the sequences. So, in no alignment position, the residues of all sequences are aligned. The whole alignment has only one long contigous gap inside of the protein sequence F5H241. Some of the shorter sequences (D1L2S0, A9UD35, F5CB27) are aligned at these alignment positions, where F5H241 has its gap. If only the aligned residues would be considered and the gaps were neglected, there would be a lot of conserved columns.

In contrary to set 1, set 2 has 10 conserved columns. However they are widly spread over the alignment. The same observation can be made of columns with similar residues. Through this scattering it seems rather a randomly generated conserved column than a conservation due to functionaly reasons. The high number as well as the partly great length of the contigous gaps straighten the alignment. The alignment gives a great overview of the relation between sequences with low similarity and sequence identity. It also shows that for finding functional important areas that are conserved, a higher similarity and identity is needed.

Results of Muscle in numbers
	Set 1	Set 2	Set 3
Conserved Columns	0	10
Number of contiguous gaps in P04062	1	27

@@ Line 291: / Line 291: @@
 In '''Set 1''' the native protein sequence itself has no gap. There are only gaps at the beginning of the sequence, because the length of the alignment is longer than the length of clucocerebrosidase sequence. There also exist no conserved columns. This is caused because of the shift of the sequences. So, in no alignment position, the residues of all sequences are aligned. The whole alignment has only one long contigous gap inside of the protein sequence F5H241. Some of the shorter sequences (D1L2S0, A9UD35, F5CB27) are aligned at these alignment positions, where F5H241 has its gap. If only the aligned residues would be considered and the gaps were neglected, there would be a lot of conserved columns.
+In contrary to set 1, '''set 2''' has 10 conserved columns. However they are widly spread over the alignment. The same observation can be made of columns with similar residues. Through this scattering it seems rather a randomly generated conserved column than a conservation due to functionaly reasons. The high number as well as the partly great length of the contigous gaps straighten the alignment. The alignment gives a great overview of the relation between sequences with low similarity and sequence identity. It also shows that for finding functional important areas that are conserved, a higher similarity and identity is needed.
@@ Line 304: / Line 306: @@
 |Conserved Columns
 |0
-|
+|10
 |
 |-
 |Number of contiguous gaps in P04062
 |1
-|
+|27
 |
 |-

Difference between revisions of "Gaucher Disease: Task 02 - Alignments"

Revision as of 13:51, 6 May 2013

Contents

Theoretical Background

Description

Sequence Searches

Comparison

Evaluation

Multiple Sequence Alignment

ClustalW

Muscle

T-Coffee

Alignment Comparison

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools