Gaucher Disease: Task 02 - Alignments
Alignments allow a comparisons of Strings. In the field of bioinformatics, sequence alignments show the relation between two or more sequences.
Contents
Theoretical Background
Description
Pairwise or multiple alignments opten contain an aditional line below the proper alignment. This line gives a more accurate description of the relation of the aligned residues above. The symbols show if there is match between identical amino acids or if they are only similar.
Symbols for describing sequence alignments | ||
---|---|---|
Symbol | Example | Meaning |
* | blub | identical residues |
: | similar residues | |
. |
Sequence Searches
Sequence searches with our query protein sequence, P04062.fasta, were done with the following programs:
- BLAST
- using standard parameters
- against big_80
- against big
- Psi-BLAST
- with number of shown hits and alignments set to 10000 (-b, -v options), so that all the hits will be shown.
- with all combinations of:
- 2 iterations: 1 iterations against big_80 followed by 1 iteration against big
- 10 iterations: 9 iterations against big_80 followed by 1 iteration against big
- default E-value cutoff (0.002)
- E-value cutoff 10E-10
- other options leaved default
- HHblits
- with number of shown hits and alignments set to 10000 (-Z, -B options), as in Psi-BLAST
- with all combinations of:
- 2 iterations against uniprot_20
- 10 iterations against uniprot_20
- default E-value cutoff (0.002)
- E-value cutoff 10E-10
- other options leaved default
The script run.pl was written and used for the runs. PSSM files - a3m and hhr for HHblits, chk ("checkpoint") and PSSM for Psi-BLAST were created in order to start the search against another database, from big_80 to big for Psi-BLAST and later against a PDB database for the evaluation.
For Psi-BLAST, first a search against big_80 was done in order to create a good profile, then a last iteration against big was done with this profile. The idea was to get as many hits as possible, so that the results will be comparable with HHblits, where the runs were made against the clustered HMM database. All uniprot_20 cluster members were count in the following comparison and evaluation.
Comparison
Overlap of hits
Percentage identity distribution
E-value distribution
Evaluation
Validation against COPS L30 - L60 groups was made.
Multiple Sequence Alignment
For the multiple sequence alignments three sets were created. Therefore the results of the previous task were according to their sequnece identiy to the native protein sequence of glucocerebrosidase. Set 1 and set 2 contains 10 sequences, including the native protein sequence to keep the alignments in relation to the Gaucher's disease causing protein. For the remainig 9 sequences have a sequence identity to the glucocerebrosidase of more than 60% in set 1 and less than 30% in set 2. In set 3, the sequences are over the whole range of sequence identity. The multiple sequence alignments were made with Clustalw, Muscle and T-Coffee.
Set 1: sequence identity >60% | |
---|---|
ID | Identity in % |
P04062 | 100 |
A9UD35 | 84.1 |
D1L2S0 | 83.0 |
3gxi_A | 99.8 |
2nt1_A | 99.8 |
F6WDY8 | 90.7 |
Q2KHZ8 | 89.2 |
F5CB27 | 81.8 |
F5H241 | 98.2 |
B7Z6S9 | 99.8 |
Set 2: sequence identity <30% | |
---|---|
ID | Identity in % |
P04062 | 100 |
H6CEV7 | 26.5 |
Q21GD0 | 24.3 |
I1WBF3 | 23.7 |
I9HH59 | 29.4 |
D0TN48 | 25.4 |
B1VPJ0 | 27.5 |
E2LY19 | 24.8 |
K9HBW2 | 25.9 |
B5QQZ8 | 27.1 |
Set 3: sequence identity over all | |||
---|---|---|---|
ID | Identity in % | ID | Identity in % |
P04062 | 100 | B5DYA3 | 32.0 |
B4JTN5 | 34.1 | F5CB37 | 81.8 |
E1ZYU8 | 42.7 | J2STU0 | 34.2 |
F4E6W5 | 26.4 | G9BHQ3 | 80.7 |
F4X220 | 41.7 | 2nsx_B | 99.8 |
E5UPZ0 | 21.6 | G9BHQ5 | 82.5 |
C6A5Q0 | 25.0 | D0TIX6 | 25.8 |
B1QKT0 | 27.8 | 1y7v_A | 99.8 |
D4SV88 | 34.6 | A9UD58 | 80.4 |
A9UD54 | 81.8 | G9MUP6 | 20.7 |
ClustalW
Set 1, which includes sequences with high similarity has only 11 conserced columns. But these column lie densly in an area of 55 amioacids. Also columns that are not conserved but have similar amino acids can be found in this high conserved region (marked blue in the linked results below). The gaps of th alignment split the protein into 7, mostly longer parts of of residues. Especially the last 400 residues have a very high sequence identity between the clucosylcerebrosidase and 6 other proteins.
Set 2 has less conserved columns than set1. They are spread over the alignment and build no conserved region. The contiguous gaps are more but shorter.
There exist no conserved columns for set 3. This could result from the greater number of sequences in the set. The more sequences are in an alignment the lesser the probability of having conserved columns.
- Results for Clustalw of : Set1
Results of ClustalW | |||
---|---|---|---|
Set 1 | Set 2 | Set 3 | |
Conserved Columns | 11 | 8 | 0 |
Number of contiguous gaps in P04062 | 7 | 18 |