Difference between revisions of "Task 2: Alignments"
Line 127: | Line 127: | ||
<figtable id="psiblast pdb evalue"> |
<figtable id="psiblast pdb evalue"> |
||
{| align="center" |
{| align="center" |
||
− | |align="center" | [[File:pdb_psi_hist_e.png|center|frame|Histogram of the log(evalue) distribution of the results of the four Psiblast searches against pdb_seqres using the checkpoint files from the runs before. (2 iterations with default evalue cutoff (0.002) and 10E-10 and 10 iterations with default evalue cutoff(0.002) and 10E-10). The x-axis shows the natural logarithm of the evalue and the y-axis the frequency. The better the evalue the smaller is log(evalue).]] |
+ | |align="center" | [[File:pdb_psi_hist_e.png|center|frame|'''a)''' Histogram of the log(evalue) distribution of the results of the four Psiblast searches against pdb_seqres using the checkpoint files from the runs before. (2 iterations with default evalue cutoff (0.002) and 10E-10 and 10 iterations with default evalue cutoff(0.002) and 10E-10). The x-axis shows the natural logarithm of the evalue and the y-axis the frequency. The better the evalue the smaller is log(evalue).]] |
− | |align="center" | [[File:pdb_psi_box_e.png|center|frame|Boxplot of the log(evalue) of the four different Psiblast searches against pdb_seqres using the checkpoint files from the runs before. (2 iterations with default evalue cutoff (0.002) and 10E-10 and 10 iterations with default evalue cutoff(0.002) and 10E-10). The thick black line inside the box is the median and the upper and lower end of the box the 25% and 75% quantils. The single points below the whiskers are outliers.]] |
+ | |align="center" | [[File:pdb_psi_box_e.png|center|frame|''' b)''' Boxplot of the log(evalue) of the four different Psiblast searches against pdb_seqres using the checkpoint files from the runs before. (2 iterations with default evalue cutoff (0.002) and 10E-10 and 10 iterations with default evalue cutoff(0.002) and 10E-10). The thick black line inside the box is the median and the upper and lower end of the box the 25% and 75% quantils. The single points below the whiskers are outliers.]] |
|+ style="caption-side: bottom; text-align: left" |<font size=1.5>'''Figure 2:''' E-value distribution of the the results from four Psiblast runs in pdb_seqres. |
|+ style="caption-side: bottom; text-align: left" |<font size=1.5>'''Figure 2:''' E-value distribution of the the results from four Psiblast runs in pdb_seqres. |
||
|} |
|} |
||
Line 136: | Line 136: | ||
<figtable id="psiblast pdb identity"> |
<figtable id="psiblast pdb identity"> |
||
{| align="center" |
{| align="center" |
||
− | |align="center" | [[File:pdb_psi_hist_i.png|left|frame|Histogram of the identity distribution of the results of the four different Psiblast searches against pdb_seqres using the checkpoint files from the runs before. (2 iterations with default evalue cutoff (0.002) and 10E-10 and 10 iterations with default evalue cutoff(0.002) and 10E-10). The x-axis shows the relative identity between the query and the hits.]] |
+ | |align="center" | [[File:pdb_psi_hist_i.png|left|frame|'''a)''' Histogram of the identity distribution of the results of the four different Psiblast searches against pdb_seqres using the checkpoint files from the runs before. (2 iterations with default evalue cutoff (0.002) and 10E-10 and 10 iterations with default evalue cutoff(0.002) and 10E-10). The x-axis shows the relative identity between the query and the hits.]] |
|align="center" | [[File:pdb_psi_box_i.png|center|frame|Boxplot of the identity of the results of the four different Psiblast searches against pdb_seqres using the checkpoint files from the runs before. (2 iterations with default evalue cutoff (0.002) and 10E-10 and 10 iterations with default evalue cutoff(0.002) and 10E-10). The thick black line inside the box is the median and the upper and lower end of the box the 25% and 75% quantils. The single points above and below the whiskers are outliers.]] |
|align="center" | [[File:pdb_psi_box_i.png|center|frame|Boxplot of the identity of the results of the four different Psiblast searches against pdb_seqres using the checkpoint files from the runs before. (2 iterations with default evalue cutoff (0.002) and 10E-10 and 10 iterations with default evalue cutoff(0.002) and 10E-10). The thick black line inside the box is the median and the upper and lower end of the box the 25% and 75% quantils. The single points above and below the whiskers are outliers.]] |
||
|+ style="caption-side: bottom; text-align: left" |<font size=1.5>'''Figure 2:''' Identity distribution of the results from the four Psiblast runs in pdb_seqres. |
|+ style="caption-side: bottom; text-align: left" |<font size=1.5>'''Figure 2:''' Identity distribution of the results from the four Psiblast runs in pdb_seqres. |
||
Line 178: | Line 178: | ||
<figtable id="hhblits"> |
<figtable id="hhblits"> |
||
{| align="center" |
{| align="center" |
||
− | |align="center" | [[File:hhblits_hist_e.png|center|frame|Evalue distribution of the results of the HHblits search with default parameters using the using the uniprot20_02Sep11 database. The x-axis shows the natural logarithm of the evalue and the y-axis the frequency. The better the evalue the smaller |
+ | |align="center" | [[File:hhblits_hist_e.png|center|frame|'''a)''' Evalue distribution of the results of the HHblits search with default parameters using the using the uniprot20_02Sep11 database. The x-axis shows the natural logarithm of the evalue and the y-axis the frequency. The better the evalue the smaller the log(evalue).]] |
− | |align="center" | [[File:hhblits_hist_i.png|center|frame|Identity distribution of the results of the HHblits search in the uniprot20_02Sep11 database using default parameters. The x-axis shows the relative identity between the query and the hits.]] |
+ | |align="center" | [[File:hhblits_hist_i.png|center|frame''' b)''' |Identity distribution of the results of the HHblits search in the uniprot20_02Sep11 database using default parameters. The x-axis shows the relative identity between the query and the hits.]] |
|+ style="caption-side: bottom; text-align: left" |<font size=1.5>'''Figure 2:'''Evalue and identity distribution of the hhblits results from uniprot20. |
|+ style="caption-side: bottom; text-align: left" |<font size=1.5>'''Figure 2:'''Evalue and identity distribution of the hhblits results from uniprot20. |
||
|} |
|} |
||
Line 192: | Line 192: | ||
<figtable id="hhblits"> |
<figtable id="hhblits"> |
||
{| align="center" |
{| align="center" |
||
− | |align="center" | [[File:compDefault_e.png|left|frame|TODO evalue]] |
+ | |align="center" | [[File:compDefault_e.png|left|frame|'''a)''' TODO evalue]] |
− | |align="center" |[[File:compDefault_i.png|center|frame|TODO idenity]] |
+ | |align="center" |[[File:compDefault_i.png|center|frame|''' b)''' TODO idenity]] |
|+ style="caption-side: bottom; text-align: left" |<font size=1.5>'''Figure 2:''' Comparison of Blast, Psiblast and hhblits with respect to the E-value and the identity. |
|+ style="caption-side: bottom; text-align: left" |<font size=1.5>'''Figure 2:''' Comparison of Blast, Psiblast and hhblits with respect to the E-value and the identity. |
||
|} |
|} |
Revision as of 00:23, 28 August 2013
Sequence alignments are a good start to analyse a new protein sequence. Alignment search tools such as Blast, Psiblast or hhblits can be used to find related protein sequences.
Contents
Sequence Searches
The HFE protein sequence was used to conduct sequence searches with the three different search tools Blast, Psiblast and hhblits. CATH and COPS, two protein structure classification databases, were used to evaluate and compare the results of the different tools.
CATH is a database that hierarchically classifies all domain structures from PDB. This hierarchy consists of four major levels: 1. Class 2. Architecture 3. Topology 4. Homologous superfamily
We compared the first level of the CATH hierarchy (Class) to check whether the hits have the same fold class as the query. The class of a domain is determined by its secondary structure composition (mostly alpha, mostly beta, mixed alpha/beta or few secondary structures). Each domain of a protein is assigned to a CATH class, this means, that one query protein can consist of several fold classes, one for each domain. The HFE protein consists of two domains that belong to the CATH fold class 2 and 3.
Another database for protein structure clustering is COPS. It consists of different structural groups that contain all proteins with a structural similarity above a threshold. For example, all proteins in the L30 group have a structural similarity of at least 30% percent. This means that if two proteins are not in the same L30 group, they have less than 30% structural similarity. The L30 groups of the hits were determined to check if they are in the same group as the HFE protein or not.
Blast
The Blast search in the big_80 database yielded 1504 hits.
<figtable id="blast distribution">
</figtable>
<xr id="blast distribution"/> shows the E-value and identity distributions of the Blast search hits. There are only a few hits with a very high evalue of 1e-160 but most good hits have an evalue between 1e-40 and 1e-50 (left peak in a)). Most results have a worse evalue above 1e-20. Nearly all hits have an identity of 20% to 40% with the query.
The Blast results only contain 13 PDB IDs. Nevertheless, the number of shared CATH classes betwen HFE and the hits were computed as described above. Of 13 pdb hits, only 3 share all domain fild classes with HFE, 6 have one domain fold class in commom and 4 even none.
number of same fold classes | frequency |
---|---|
0 | 4 |
1 | 6 |
2 | 3 |
Only 6 of the 13 pdb hits are in the same COPS L30 group as the HFE protein.
All results indicate, that some of the Blast hits are not really related to HFE and thus false positives.
Psiblast
Four different psiblst runsagaisnt big_80 where done using all four combinations of:
- 2 iterations
- 10 iterations
- default E-value cutoff (0.002)
- E-value cutoff 10E-10
The evalue an identity distributions of the four runs are plotted below.
<figtable id="psiblast evalue">
</figtable>
<figtable id="psiblast identity">
</figtable>
The evalue an identity distributions of the four runs are plotted above.
The boxplot of the evalue distribution shows, that the default parameter settings (j=2 and h=0.002) can be considered as the best. The median of log(evalue) is higher than in all the other 3 psiblast runs. This indicates that the percentage of hits with a good evalue is higher in the default psiblast run than in the other three. Besides, the sequence identity of the median is also the highest with the default parameters.
The results of the CATH and COPS classification analysis are listed below:
psiblast parameters | j=2, h=0.002 | j=2, h=10E-10 | j=10, h=0.002 | j=10, h=10E-10 |
---|---|---|---|---|
# hits | 2564 | 3163 | 2601 | 3343 |
# PDB hits | 66 | 72 | 56 | 66 |
0 CATH class | 21 | 21 | 20 | 21 |
1 CATH class | 42 | 48 | 33 | 42 |
2 CATH classes | 3 | 3 | 3 | 3 |
same L30 group | 6 | 6 | 6 | 6 |
There are 1095 proteins in the same COPS L30 group as the query HFE. Since the big_80 datatbase does not contain all proteins from pdb, the quality of the program cannot be assesed using the COPS L30 group of HFE. Nevertheless, of the found pdb hits only a few belong to the same L30 and share the same CATH fold classes with HFE.
To get more pdb ids, we also did 4 psiblast searches against /mnt/home/rost/kloppmann/data/blast_db/pdb_seqres using the profiles created with big_80. Those four results were compared as well.
<figtable id="psiblast pdb evalue">
</figtable>
<figtable id="psiblast pdb identity">
</figtable>
psiblast checkfiles | j=2, h=0.002 | j=2, h=10E-10 | j=10, h=0.002 | j=10, h=10E-10 |
---|---|---|---|---|
# hits | 4307 | 4814 | 3874 | 4058 |
0 CATH class | 561 | 586 | 529 | 539 |
1 CATH class | 2798 | 3280 | 2397 | 2571 |
2 CATH classes | 948 | 948 | 948 | 948 |
same L30 group | 1047 | 1053 | 1045 | 1052 |
1095 proteins are in the same COPS L30 group as HFE. The differnent psiblast searches found nearly all proteins that are in the same L30 group as HFE. Therefore,the decision which of the parameter settings leads to the best results cannot be based ont he pdb hits, since the differences are minimal.Nevertheless, j=10, h=0.002 had the least hits, but the highest percentage of hits in the same L30 group as the query.
We also checked how many ot the hits overlapped in the four different runs. The results are shown in the Venn diagramm below:
TODO Venn diagramm !!!!
Additionally, the 100 best hits of each of the runs were compared to the 100 best hits of all the other 3 runs. They were the same with all four parameter combinations. Thus, the parameter settings only matter if a high number of hits is needed, but not of onyl the best hits are used in further analyses.
HHblits
<figtable id="hhblits">
</figtable>
It is remarkable that HHblits finds so many hits with a very low evalue. This can maybe explained by the fact, that hhblits found several clusters with a good evalue. Since used all proteins in the clusters for the analyses and not only the cluster representative the number of hits is clearly higher than the number of hits from the Blast and Psiblast searches.
Comparison
<figtable id="hhblits">
</figtable>
TODO: Venn diagramms of hits overlap.
Multiple sequence alignments
The following data sets were chosen for the Multiple Sequence Alignments:
Below 30% SeqID:
AC Number | Sequence Identity |
F6JYA9 | 0.29 |
D9J389 | 0.28 |
D5MSB3 | 0.27 |
B3FRK2 | 0.20 |
3ov6_A | 0.22 |
1p7k_L | 0.21 |
H0Y1D0 | 0.20 |
B3FRK3 | 0.18 |
Q8HWL2 | 0.17 |
Above 60% SeqID:
G5BQE5 0.67AC Number | Sequence Identity |
B4DDZ1 | 0.83 |
Q5EEZ1 | 0.76 |
G3THV5 | 0.75 |
G1MBW1 | 0.73 |
H0VAR7 | 0.72 |
F1PX48 | 0.71 |
G1T7D7 | 0.70 |
O35799 | 0.68 |
Whole range of SeqID:
AC Number | Sequence Identity |
B4DDZ1 | 0.83 |
Q5EEZ1 | 0.76 |
G1MBW1 | 0.73 |
H0VAR7 | 0.72 |
F1PX48 | 0.71 |
G1T7D7 | 0.70 |
G5BQE5 | 0.67 |
Q95IT9 | 0.38 |
2qrt-A | 0.38 |
Q8HX83 | 0.36 |
F6WCX4 | 0.34 |
F6JYA9 | 0.29 |
D9J389 | 0.28 |
D5MSB3 | 0.27 |
2zok_E | 0.24 |
H0Y1D0 | 0.20 |
Q8SNJ4 | 0.22 |
B3FRK3 | 0.18 |
Q8HWL2 | 0.17 |
pdb:
1p7k_L 0.21
3ov6_A 0.22
2qrt_A 0.38
2zok_E 0.24
ClustalW
Below 30% SeqID: ClustalW_Below_30
Above 60% SeqID: ClustalW_Above_60
Whole range of SeqID: ClustalW_whole
MAFFT
Below 30%:MAFFT_30
Above 60%: MAFFT_60
Whole range: MAFFT_whole
T-Coffee
Below 30%: Tcoffee_30
Above 60%:Tcoffee_60
Whole range: Tcoffee_whole