Gaucher Disease: Task 02 - Alignments

From Bioinformatikpedia
Revision as of 21:44, 6 May 2013 by Gerkej (talk | contribs) (ClustalW)

Work is still in progress here. Please check today evening ;)

Alignments allow a comparisons of Strings. In the field of bioinformatics, sequence alignments show the relation between two or more sequences.

Theoretical Background

Description

Pairwise or multiple alignments often contain an aditional line below the proper alignment (or between two aligned sequences for pairwise alignments). This line gives a more accurate description of the relation of the aligned residues above. The symbols show if there is match between identical amino acids or if they are only similar.

Symbols for describing sequence alignments
Program(s) Symbol Example Meaning
MSAs * blub identical residues
: similar residues
.
(Psi-)BLAST same letter A
A
A
identical residues
+ L
+
V
similar residues
HHblits | AF
| |
AW
identical or very similar residues?
+ T
+
S
similar residues?
. N
.
H
non-similar residues?

Sequence Searches

Sequence searches with our query protein sequence, P04062.fasta, were done with the following programs:

  • BLAST
    • using standard parameters
    • against big_80
    • against big
  • Psi-BLAST
    • with number of shown hits and alignments set to 10000 (-b, -v options), so that all the hits will be shown.
    • with all combinations of:
      • 2 iterations: 1 iterations against big_80 followed by 1 iteration against big
      • 10 iterations: 9 iterations against big_80 followed by 1 iteration against big
      • default E-value cutoff (0.002)
      • E-value cutoff 10E-10
    • other options leaved default
  • HHblits
    • with number of shown hits and alignments set to 10000 (-Z, -B options), as in Psi-BLAST
    • with all combinations of:
      • 2 iterations against uniprot_20
      • 10 iterations against uniprot_20
      • default E-value cutoff (0.002)
      • E-value cutoff 10E-10
    • other options leaved default

The script run.pl was written and used for the runs. PSSM files - a3m and hhr for HHblits, chk ("checkpoint") and PSSM for Psi-BLAST were created in order to start the search against another database, from big_80 to big for Psi-BLAST and later against a PDB database for the evaluation.

HHblits searches among representative sequences from different uniprot_20 clusters. When a representative from one cluster is found, all the sequences from this cluster are being added to the hit list. Thus HHblits finds much more hits than PSI-BLAST. Therefore, one additional PSI-BLAST run against the unfiltered big database was performed, using checkpoint files of the last profile created from the previous search in big80, in order to find as much related sequences as possible, as HHblits does indirectly. Summing up, the following databases were used: big_80 for profile building and big for the last iteration for PSI-BLAST and uniprot20 for HHblits. BLAST runs (consisting only from one iteration) we searched once against big_80 and once against big to see the difference in number of found hits.

Comparison

The results were parsed and analysed using the script parse_output.pl.

Number and overlap of hits Number of found hits with each program and parameter combination and overlap of hits between some are summerized in the following tables. <todo tables>

Furthermore, E-value, percentage identity and alignment length distributions were plottet.

E-value distribution

E-value distribution of HHblits hits (2 iterations against uniprot_20 with default E-value cutoff, 0.001), Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with default E-value cutoff, 0.002) and BLAST hits (against big with default parameters). On the X-axis -log of the E-value is plotted, so the smallest (the best) E-value is on the right side. On the Y-axis is logarithmic frequency of the values on the X-axis.
E-value distribution of HHblits hits (2 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis -log of the E-value is plotted, so the smallest (the best) E-value is on the right side. On the Y-axis is logarithmic frequency of the values on the X-axis.
E-value distribution of HHblits hits (10 iterations against uniprot_20 with default E-value cutoff, 0.001) and Psi-BLAST hits (9 iteration against big_80 followed by one iteration against big with default E-value cutoff, 0.002). On the X-axis -log of the E-value is plotted, so the smallest (the best) E-value is on the right side. On the Y-axis is logarithmic frequency of the values on the X-axis.
E-value distribution of HHblits hits (10 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (9 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis -log of the E-value is plotted, so the smallest (the best) E-value is on the right side. On the Y-axis is logarithmic frequency of the values on the X-axis.


Percentage identity distribution

Identity distribution of HHblits hits (2 iterations against uniprot_20 with default E-value cutoff, 0.001), Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with default E-value cutoff, 0.002) and BLAST hits (against big with default parameters). On the X-axis %identity is plotted [0,1]. On the Y-axis is frequency of the values on the X-axis.
Identity distribution of HHblits hits (2 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis %identity is plotted [0,1]. On the Y-axis is frequency of the values on the X-axis.
Identity distribution of HHblits hits (10 iterations against uniprot_20 with default E-value cutoff, 0.001) and Psi-BLAST hits (9 iteration against big_80 followed by one iteration against big with default E-value cutoff, 0.002). On the X-axis %identity is plotted [0,1]. On the Y-axis is frequency of the values on the X-axis.
Identity distribution of HHblits hits (10 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (9 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis %identity is plotted [0,1]. On the Y-axis is frequency of the values on the X-axis.


Alignment length distribution

Alignment length distribution of HHblits hits (2 iterations against uniprot_20 with default E-value cutoff, 0.001), Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with default E-value cutoff, 0.002) and BLAST hits (against big with default parameters). On the X-axis alignment length is plotted. On the Y-axis is logarithmic frequency of the values on the X-axis.
Alignment length distribution of HHblits hits (2 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (1 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis %identity is plotted [0,1]. On the X-axis alignment length is plotted. On the Y-axis is logarithmic frequency of the values on the X-axis.
Alignment length distribution of HHblits hits (10 iterations against uniprot_20 with default E-value cutoff, 0.001) and Psi-BLAST hits (9 iteration against big_80 followed by one iteration against big with default E-value cutoff, 0.002). On the X-axis %identity is plotted [0,1]. On the X-axis alignment length is plotted. On the Y-axis is logarithmic frequency of the values on the X-axis.
Alignment length distribution of HHblits hits (10 iterations against uniprot_20 with E-value cutoff 10E-10) and Psi-BLAST hits (9 iteration against big_80 followed by one iteration against big with E-value cutoff 10E-10). On the X-axis %identity is plotted [0,1]. On the X-axis alignment length is plotted. On the Y-axis is logarithmic frequency of the values on the X-axis.


Evaluation

The quality of hits retrieved by HHblits, PSI-BLAST and BLAST was evaluated using the COPS L30/L40/L60 structural similarity groups as standard of truth. Protein chains with same L30/L40/L60 group have 30/40/60% structural similarity, respectively. For example, when using the L30 group as standard of truth, the hit is a true positive (TP) if it has the same L30 group as the query, otherwise it is a false positive (FP). Parsed results of the runs, also with the script parse_output.pl, include the following information for each hit (as mentioned, in case of more then one alignment of the query and the same found sequence, only the one with the lowest E-value was used): E-value, whether the hit is a TP or a FP, query ID and hit ID.
In the evaluation routine (tpr_precision.pl) each file is first sorted according to ascending E-value. Afterwards, the following rates are calculated for each line (from the first to the current line): TPR (sensitivity) and precision, which are defined as follows:

TPR = TP/(TP+FN)
precision = TP/(TP+FP)

where:

  • TP = the number of true positive hits, namely all found hits with the same COPS L30/L40/L60 group as the query
  • FP = the number of false positive hits, that is with different groups than the query
  • FN = the number of false negatives, which are not detected chains with the same group as the query
  • TP+FN = the number of all positives, i.e. the number of all known PDB chains with the same group as the query
  • TP+FP = the number of all found hits


Multiple Sequence Alignment

For the multiple sequence alignments three sets were created. Therefore the results of the previous task were according to their sequnece identiy to the native protein sequence of glucocerebrosidase. Set 1 and set 2 contains 10 sequences, including the native protein sequence to keep the alignments in relation to the Gaucher's disease causing protein. For the remainig 9 sequences have a sequence identity to the glucocerebrosidase of more than 60% in set 1 and less than 30% in set 2. In set 3, the sequences are over the whole range of sequence identity. The multiple sequence alignments were made with Clustalw, Muscle, T-Coffee and Espresso.

In the following analyses the term "conserved region" is used for a section of columns that feature a higher number of conserved columns or columns of similar residues.

Set 1: sequence identity >60%
ID Identity in %
P04062 100
A9UD35 84.1
D1L2S0 83.0
3gxi_A 99.8
2nt1_A 99.8
F6WDY8 90.7
Q2KHZ8 89.2
F5CB27 81.8
F5H241 98.2
B7Z6S9 99.8
Set 2: sequence identity <30%
ID Identity in %
P04062 100
H6CEV7 26.5
Q21GD0 24.3
I1WBF3 23.7
I9HH59 29.4
D0TN48 25.4
B1VPJ0 27.5
E2LY19 24.8
K9HBW2 25.9
B5QQZ8 27.1
Set 3: sequence identity over all
ID Identity in % ID Identity in %
P04062 100 B5DYA3 32.0
B4JTN5 34.1 F5CB37 81.8
E1ZYU8 42.7 J2STU0 34.2
F4E6W5 26.4 G9BHQ3 80.7
F4X220 41.7 2nsx_B 99.8
E5UPZ0 21.6 G9BHQ5 82.5
C6A5Q0 25.0 D0TIX6 25.8
B1QKT0 27.8 1y7v_A 99.8
D4SV88 34.6 A9UD58 80.4
A9UD54 81.8 G9MUP6 20.7

ClustalW

Multiple alignments generated with ClustalW:

  • Results for Clustalw of  : Set1
  • Results for Clustalw of  : Set2
  • Results for Clustalw of  : Set3


Set 1, which includes sequences with high similarity has only 11 conserved columns. But these columns lie densly in an area of 55 amio acids. Also columns that are not conserved by identical residues, but have similar amino acids can be found in this high conserved region (marked blue in the linked multiple alignment results above). The gaps of the alignment split the protein into 7, mostly longer parts of of residues. Especially the last 400 residues have a very high sequence identity between the clucosylcerebrosidase and 6 other proteins.

Set 2 has less conserved columns than set 1. They are spread over the alignment and build no conserved region. The contiguous gaps are more but shorter.

There exist no conserved columns for set 3. This could result from the greater number of sequences in the set. The more sequences are in an alignment the lesser the probability of having conserved columns. Because of the sequences having an identity over the whole range, the sequences with a low identity cause to this loss of a conserved region. This can be seen by looking only on the sequences with high similarity (marked green in the alignment of set 3).


Results of ClustalW in numbers
Set 1 Set 2 Set 3
Conserved Columns 11 8 0
Number of gaps in the Alignm. 3797 5239 6721
Number of contiguous gaps in P04062 7 18 18

Muscle

Multiple alignments generated with Muscle:

  • Results for Muscle of  : Set1
  • Results for Muscle of  : Set2
  • Results for Muscle of  : Set3

In Set 1 the native protein sequence itself has no gap. There are only gaps at the beginning of the sequence, because the length of the alignment is longer than the length of clucocerebrosidase sequence. There also exist no conserved columns. This is caused because of the shift of the sequences. So, in no alignment position, the residues of all sequences are aligned. The whole alignment has only one long contigous gap inside of the protein sequence F5H241. Some of the shorter sequences (D1L2S0, A9UD35, F5CB27) are aligned at these alignment positions, where F5H241 has its gap. If only the aligned residues would be considered and the gaps were neglected, there would be a lot of conserved columns.

In contrary to set 1, set 2 has 10 conserved columns. However they are widly spread over the alignment. The same observation can be made of columns with similar residues. Through this scattering it seems rather a randomly generated conserved column than a conservation due to functionaly reasons. The high number as well as the partly great length of the contigous gaps straighten the alignment. The alignment gives a great overview of the relation between sequences with low similarity and sequence identity. It also shows that for finding functional important areas that are conserved, a higher similarity and identity is needed.

Set 3 has also no conserved columns, for the same reason as explained befor, for set 1. It has indeed the highest number of gaps of the three sets. But contrary to the alignment of set 1 the contigous gaps are shorter, partially they are only 1-2 gaps long. So the alignment appears less straightened, as the sequences with high sequence identity keep the alignment a bit compact.


Results of Muscle in numbers
Set 1 Set 2 Set 3
Conserved Columns 0 10 0
Number of gaps in the Alignm.
Number of contiguous gaps in P04062 1 27 36

T-Coffee

Multiple alignments generated with T-Coffee:

  • Results for T-Coffee of  : Set1
  • Results for T-Coffee of  : Set2
  • Results for T-Coffee of  : Set3


The Alignment of set 1 generated by T-Coffee is similar to its alignment made by Muscle. There are again no conserved columns because of the shift of the sequence fragments. The T-Coffee alignment has a few more gaps, which are also very long, and cause a different fragment shift than in the Muscle alignment. Interestingly the short sequences A9UD35 and D1L2S0 have one gap. This extremely long contigous gap shifts the last amino acid of the sequence to the last position of the alignment. As these two amino acids (G, D) are not identical to the remaining amino acids (Q) at the last alignment position, there seem to be no reason for this observation. The suspicion of an error arises, as the two amino acids (G,D) would perfectly fit at the end of ther sequence without a contigous gap. This observation is emphasised in violet coloring in the visualized Alignment, which is linked above.

Set 2 shows widly spread conserved columns without building a conserved region. The alignment itself has lots of gaps. An interpretation of the alignment in a biological way should be made carefully, as it bases on low sequence identity and similarity. Noticeable for this alignment is its beginning and end. All sequences are aligned in these areas with at least 4 residues.

Sequences with an identity over the full range (set 3) generate many gaps of different length and no conserved columns. On the one hand blocks of aligned residues are seperated mostly by a sinlge gap due to a few sequences. On the other hand a block is not completely aligned by residues of all sequences, but has an contigous gap of a few sequences. (Examples are shown in violet in the T-Coffee alignment of set 3)


Results of Muscle in numbers
Set 1 Set 2 Set 3
Conserved Columns 0 12 0
Number of gaps in the Alignm.
Number of contiguous gaps in P04062 1 32 39


All generated T-Coffee alignments seem to focus aligned residues at their beginning or their end. So the question arises: Why? Why are these residues aligend, even though enormous gap penalty accrue considering the resulting long contigous gaps. One thing is certain: T-Coffee does not use a freeshift alignments for generating a multiple alignment.

Espresso

Espresso is an advanced programm of T-Coffee. While T-Coffe only uses sequences, Espresso also takes the 3D structure of the Proteins into account. The Alignments were generated on the T-Coffee server which provides both programms. Esspresso searches automatically the PDB with BLAST to extract the structures of the sequences.

Multiple alignments generated with Espresso:

  • Results for Espresso of  : Set1
  • Results for Espresso of  : Set2
  • Results for Espresso of  : Set3

(Please note: in case of an unavailability of the Alignments in the links above, please contact the author)

The additional use of the structure improves the alignment of Sequences with high sequence identity (set 1). In contrary to T-Coffee Espresso detects now conserved columns. Through a little difference in the alignment score calculating, the sequence F5H241 alignes different than in T-Coffee. Now an conserved area of 34 residues including 12 conserved columns and 9 columns with similar amino acids can be seen in the alignment.

For the two other sets containing also sequences with low sequence identity show no remarkable improvement to the results of T-Coffee.


Results of Espresso in numbers
Set 1 Set 2 Set 3
Conserved Columns 12 11 0
Number of gaps in the Alignm.
Number of contiguous gaps in P04062 1 46 51

Alignment Comparison

Set 1: high similarity

The Alignments produced by Muscle, T-Coffe and Espresso are very similar. They only differ in their aligning of the sequence F5H241. In case of Espresso the sequence was aligned so that a conserved region is observable. This region overlaps with the conserved region, detected by ClustalW, even though the complete alignment of clustalw differs more from the other three.

Set 2: low similarity

For future research

Multiple sequence alignments are suitable to detect conserved regions on proteins. As for such researches sequences with high similarity cause proper results, the choice of the tool is based on the results of set 1. For a good alignment I would expect few gaps and many conserved columns of residues. Under these categories I would use and advise Espresso. It shows more conserved columns than Muscle and T-Coffe, and also has less gaps than clustalW. The Espresso generated Alignment is a compact alignment which provides a conserved region.

In case of only having sequences with low similarity, I would tend to the use of clustalW. The analyses with set 2 and set 3 show better results for Alignments generated by clustalW. These Alignments have less gaps and less straightened than the alignments produced by Muscle, T-Coffee and Espresso.